Message ID | 1407768567-171794-3-git-send-email-dzickus@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
* Don Zickus <dzickus@redhat.com> wrote: > From: chai wen <chaiw.fnst@cn.fujitsu.com> > > For now, soft lockup detector warns once for each case of process softlockup. > But the thread 'watchdog/n' may not always get the cpu at the time slot between > the task switch of two processes hogging that cpu to reset soft_watchdog_warn. > > An example would be two processes hogging the cpu. Process A causes the > softlockup warning and is killed manually by a user. Process B immediately > becomes the new process hogging the cpu preventing the softlockup code from > resetting the soft_watchdog_warn variable. > > This case is a false negative of "warn only once for a process", as there may > be a different process that is going to hog the cpu. Resolve this by > saving/checking the pid of the hogging process and use that to reset > soft_watchdog_warn too. > > Signed-off-by: chai wen <chaiw.fnst@cn.fujitsu.com> > [modified the comment and changelog to be more specific] > Signed-off-by: Don Zickus <dzickus@redhat.com> > --- > kernel/watchdog.c | 20 ++++++++++++++++++-- > 1 files changed, 18 insertions(+), 2 deletions(-) > > diff --git a/kernel/watchdog.c b/kernel/watchdog.c > index 4c2e11c..6d0a891 100644 > --- a/kernel/watchdog.c > +++ b/kernel/watchdog.c > @@ -42,6 +42,7 @@ static DEFINE_PER_CPU(bool, softlockup_touch_sync); > static DEFINE_PER_CPU(bool, soft_watchdog_warn); > static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts); > static DEFINE_PER_CPU(unsigned long, soft_lockup_hrtimer_cnt); > +static DEFINE_PER_CPU(pid_t, softlockup_warn_pid_saved); > #ifdef CONFIG_HARDLOCKUP_DETECTOR > static DEFINE_PER_CPU(bool, hard_watchdog_warn); > static DEFINE_PER_CPU(bool, watchdog_nmi_touch); > @@ -317,6 +318,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) > */ > duration = is_softlockup(touch_ts); > if (unlikely(duration)) { > + pid_t pid = task_pid_nr(current); > + > /* > * If a virtual machine is stopped by the host it can look to > * the watchdog like a soft lockup, check to see if the host > @@ -326,8 +329,20 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) > return HRTIMER_RESTART; > > /* only warn once */ > - if (__this_cpu_read(soft_watchdog_warn) == true) > + if (__this_cpu_read(soft_watchdog_warn) == true) { > + > + /* > + * Handle the case where multiple processes are > + * causing softlockups but the duration is small > + * enough, the softlockup detector can not reset > + * itself in time. Use pids to detect this. > + */ > + if (__this_cpu_read(softlockup_warn_pid_saved) != pid) { So I agree with the motivation of this improvement, but is this implementation namespace-safe? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Aug 18, 2014 at 11:03:19AM +0200, Ingo Molnar wrote: > * Don Zickus <dzickus@redhat.com> wrote: > > > From: chai wen <chaiw.fnst@cn.fujitsu.com> > > > > For now, soft lockup detector warns once for each case of process softlockup. > > But the thread 'watchdog/n' may not always get the cpu at the time slot between > > the task switch of two processes hogging that cpu to reset soft_watchdog_warn. > > > > An example would be two processes hogging the cpu. Process A causes the > > softlockup warning and is killed manually by a user. Process B immediately > > becomes the new process hogging the cpu preventing the softlockup code from > > resetting the soft_watchdog_warn variable. > > > > This case is a false negative of "warn only once for a process", as there may > > be a different process that is going to hog the cpu. Resolve this by > > saving/checking the pid of the hogging process and use that to reset > > soft_watchdog_warn too. > > > > Signed-off-by: chai wen <chaiw.fnst@cn.fujitsu.com> > > [modified the comment and changelog to be more specific] > > Signed-off-by: Don Zickus <dzickus@redhat.com> > > --- > > kernel/watchdog.c | 20 ++++++++++++++++++-- > > 1 files changed, 18 insertions(+), 2 deletions(-) > > > > diff --git a/kernel/watchdog.c b/kernel/watchdog.c > > index 4c2e11c..6d0a891 100644 > > --- a/kernel/watchdog.c > > +++ b/kernel/watchdog.c > > @@ -42,6 +42,7 @@ static DEFINE_PER_CPU(bool, softlockup_touch_sync); > > static DEFINE_PER_CPU(bool, soft_watchdog_warn); > > static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts); > > static DEFINE_PER_CPU(unsigned long, soft_lockup_hrtimer_cnt); > > +static DEFINE_PER_CPU(pid_t, softlockup_warn_pid_saved); > > #ifdef CONFIG_HARDLOCKUP_DETECTOR > > static DEFINE_PER_CPU(bool, hard_watchdog_warn); > > static DEFINE_PER_CPU(bool, watchdog_nmi_touch); > > @@ -317,6 +318,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) > > */ > > duration = is_softlockup(touch_ts); > > if (unlikely(duration)) { > > + pid_t pid = task_pid_nr(current); > > + > > /* > > * If a virtual machine is stopped by the host it can look to > > * the watchdog like a soft lockup, check to see if the host > > @@ -326,8 +329,20 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) > > return HRTIMER_RESTART; > > > > /* only warn once */ > > - if (__this_cpu_read(soft_watchdog_warn) == true) > > + if (__this_cpu_read(soft_watchdog_warn) == true) { > > + > > + /* > > + * Handle the case where multiple processes are > > + * causing softlockups but the duration is small > > + * enough, the softlockup detector can not reset > > + * itself in time. Use pids to detect this. > > + */ > > + if (__this_cpu_read(softlockup_warn_pid_saved) != pid) { > > So I agree with the motivation of this improvement, but is this > implementation namespace-safe? What namespace are you worried about colliding with? I thought softlockup_ would provide the safety?? Maybe I am missing something obvious. :-( Cheers, Don -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Don Zickus <dzickus@redhat.com> wrote: > On Mon, Aug 18, 2014 at 11:03:19AM +0200, Ingo Molnar wrote: > > * Don Zickus <dzickus@redhat.com> wrote: > > > > > From: chai wen <chaiw.fnst@cn.fujitsu.com> > > > > > > For now, soft lockup detector warns once for each case of process softlockup. > > > But the thread 'watchdog/n' may not always get the cpu at the time slot between > > > the task switch of two processes hogging that cpu to reset soft_watchdog_warn. > > > > > > An example would be two processes hogging the cpu. Process A causes the > > > softlockup warning and is killed manually by a user. Process B immediately > > > becomes the new process hogging the cpu preventing the softlockup code from > > > resetting the soft_watchdog_warn variable. > > > > > > This case is a false negative of "warn only once for a process", as there may > > > be a different process that is going to hog the cpu. Resolve this by > > > saving/checking the pid of the hogging process and use that to reset > > > soft_watchdog_warn too. > > > > > > Signed-off-by: chai wen <chaiw.fnst@cn.fujitsu.com> > > > [modified the comment and changelog to be more specific] > > > Signed-off-by: Don Zickus <dzickus@redhat.com> > > > --- > > > kernel/watchdog.c | 20 ++++++++++++++++++-- > > > 1 files changed, 18 insertions(+), 2 deletions(-) > > > > > > diff --git a/kernel/watchdog.c b/kernel/watchdog.c > > > index 4c2e11c..6d0a891 100644 > > > --- a/kernel/watchdog.c > > > +++ b/kernel/watchdog.c > > > @@ -42,6 +42,7 @@ static DEFINE_PER_CPU(bool, softlockup_touch_sync); > > > static DEFINE_PER_CPU(bool, soft_watchdog_warn); > > > static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts); > > > static DEFINE_PER_CPU(unsigned long, soft_lockup_hrtimer_cnt); > > > +static DEFINE_PER_CPU(pid_t, softlockup_warn_pid_saved); > > > #ifdef CONFIG_HARDLOCKUP_DETECTOR > > > static DEFINE_PER_CPU(bool, hard_watchdog_warn); > > > static DEFINE_PER_CPU(bool, watchdog_nmi_touch); > > > @@ -317,6 +318,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) > > > */ > > > duration = is_softlockup(touch_ts); > > > if (unlikely(duration)) { > > > + pid_t pid = task_pid_nr(current); > > > + > > > /* > > > * If a virtual machine is stopped by the host it can look to > > > * the watchdog like a soft lockup, check to see if the host > > > @@ -326,8 +329,20 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) > > > return HRTIMER_RESTART; > > > > > > /* only warn once */ > > > - if (__this_cpu_read(soft_watchdog_warn) == true) > > > + if (__this_cpu_read(soft_watchdog_warn) == true) { > > > + > > > + /* > > > + * Handle the case where multiple processes are > > > + * causing softlockups but the duration is small > > > + * enough, the softlockup detector can not reset > > > + * itself in time. Use pids to detect this. > > > + */ > > > + if (__this_cpu_read(softlockup_warn_pid_saved) != pid) { > > > > So I agree with the motivation of this improvement, but is this > > implementation namespace-safe? > > What namespace are you worried about colliding with? I thought > softlockup_ would provide the safety?? Maybe I am missing something > obvious. :-( I meant PID namespaces - a PID in itself isn't guaranteed to be unique across the system. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Aug 18, 2014 at 08:01:58PM +0200, Ingo Molnar wrote: > > > > duration = is_softlockup(touch_ts); > > > > if (unlikely(duration)) { > > > > + pid_t pid = task_pid_nr(current); > > > > + > > > > /* > > > > * If a virtual machine is stopped by the host it can look to > > > > * the watchdog like a soft lockup, check to see if the host > > > > @@ -326,8 +329,20 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) > > > > return HRTIMER_RESTART; > > > > > > > > /* only warn once */ > > > > - if (__this_cpu_read(soft_watchdog_warn) == true) > > > > + if (__this_cpu_read(soft_watchdog_warn) == true) { > > > > + > > > > + /* > > > > + * Handle the case where multiple processes are > > > > + * causing softlockups but the duration is small > > > > + * enough, the softlockup detector can not reset > > > > + * itself in time. Use pids to detect this. > > > > + */ > > > > + if (__this_cpu_read(softlockup_warn_pid_saved) != pid) { > > > > > > So I agree with the motivation of this improvement, but is this > > > implementation namespace-safe? > > > > What namespace are you worried about colliding with? I thought > > softlockup_ would provide the safety?? Maybe I am missing something > > obvious. :-( > > I meant PID namespaces - a PID in itself isn't guaranteed to be > unique across the system. Ah, I don't think we thought about that. Is there a better way to do this? Is there a domain id or something that can be OR'd with the pid? Cheers, Don -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Don Zickus <dzickus@redhat.com> wrote: > > > > So I agree with the motivation of this improvement, but > > > > is this implementation namespace-safe? > > > > > > What namespace are you worried about colliding with? I > > > thought softlockup_ would provide the safety?? Maybe I > > > am missing something obvious. :-( > > > > I meant PID namespaces - a PID in itself isn't guaranteed > > to be unique across the system. > > Ah, I don't think we thought about that. Is there a better > way to do this? Is there a domain id or something that can > be OR'd with the pid? What is always unique is the task pointer itself. We use pids when we interface with user-space - but we don't really do that here, right? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Aug 18, 2014 at 09:02:00PM +0200, Ingo Molnar wrote: > > * Don Zickus <dzickus@redhat.com> wrote: > > > > > > So I agree with the motivation of this improvement, but > > > > > is this implementation namespace-safe? > > > > > > > > What namespace are you worried about colliding with? I > > > > thought softlockup_ would provide the safety?? Maybe I > > > > am missing something obvious. :-( > > > > > > I meant PID namespaces - a PID in itself isn't guaranteed > > > to be unique across the system. > > > > Ah, I don't think we thought about that. Is there a better > > way to do this? Is there a domain id or something that can > > be OR'd with the pid? > > What is always unique is the task pointer itself. We use pids > when we interface with user-space - but we don't really do that > here, right? No, I don't believe so. Ok, so saving 'current' and comparing that should be enough, correct? Cheers, Don -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/19/2014 04:38 AM, Don Zickus wrote: > On Mon, Aug 18, 2014 at 09:02:00PM +0200, Ingo Molnar wrote: >> >> * Don Zickus <dzickus@redhat.com> wrote: >> >>>>>> So I agree with the motivation of this improvement, but >>>>>> is this implementation namespace-safe? >>>>> >>>>> What namespace are you worried about colliding with? I >>>>> thought softlockup_ would provide the safety?? Maybe I >>>>> am missing something obvious. :-( >>>> >>>> I meant PID namespaces - a PID in itself isn't guaranteed >>>> to be unique across the system. >>> >>> Ah, I don't think we thought about that. Is there a better >>> way to do this? Is there a domain id or something that can >>> be OR'd with the pid? >> >> What is always unique is the task pointer itself. We use pids >> when we interface with user-space - but we don't really do that >> here, right? > > No, I don't believe so. Ok, so saving 'current' and comparing that should > be enough, correct? > I am not sure of the safety about using pid here with namespace. But as to the pointer of process, is there a chance that we got a 'historical' address saved in the 'softlockup_warn_pid(or address)_saved' and the current hogging process happened to get the same task pointer address? If it never happens, I think the comparing of address is ok. thanks chai wen > Cheers, > Don > . >
On 08/19/2014 09:36 AM, Chai Wen wrote: > On 08/19/2014 04:38 AM, Don Zickus wrote: > >> On Mon, Aug 18, 2014 at 09:02:00PM +0200, Ingo Molnar wrote: >>> >>> * Don Zickus <dzickus@redhat.com> wrote: >>> >>>>>>> So I agree with the motivation of this improvement, but >>>>>>> is this implementation namespace-safe? >>>>>> >>>>>> What namespace are you worried about colliding with? I >>>>>> thought softlockup_ would provide the safety?? Maybe I >>>>>> am missing something obvious. :-( >>>>> >>>>> I meant PID namespaces - a PID in itself isn't guaranteed >>>>> to be unique across the system. >>>> >>>> Ah, I don't think we thought about that. Is there a better >>>> way to do this? Is there a domain id or something that can >>>> be OR'd with the pid? >>> >>> What is always unique is the task pointer itself. We use pids >>> when we interface with user-space - but we don't really do that >>> here, right? >> >> No, I don't believe so. Ok, so saving 'current' and comparing that should >> be enough, correct? >> > > > I am not sure of the safety about using pid here with namespace. > But as to the pointer of process, is there a chance that we got a 'historical' > address saved in the 'softlockup_warn_pid(or address)_saved' and the current > hogging process happened to get the same task pointer address? > If it never happens, I think the comparing of address is ok. > Hi Ingo what do you think of Don's solution- 'comparing of task pointer' ? Anyway this is just an additional check about some very special cases, so I think the issue that I am concerned above is not a problem at all. And after learning some concepts about PID namespace, I think comparing of task pointer is reliable dealing with PID namespace here. And Don, If you want me to re-post this patch, please let me know that. thanks chai wen > thanks > chai wen > >> Cheers, >> Don >> . >> > > >
On Thu, Aug 21, 2014 at 09:37:04AM +0800, Chai Wen wrote: > On 08/19/2014 09:36 AM, Chai Wen wrote: > > > On 08/19/2014 04:38 AM, Don Zickus wrote: > > > >> On Mon, Aug 18, 2014 at 09:02:00PM +0200, Ingo Molnar wrote: > >>> > >>> * Don Zickus <dzickus@redhat.com> wrote: > >>> > >>>>>>> So I agree with the motivation of this improvement, but > >>>>>>> is this implementation namespace-safe? > >>>>>> > >>>>>> What namespace are you worried about colliding with? I > >>>>>> thought softlockup_ would provide the safety?? Maybe I > >>>>>> am missing something obvious. :-( > >>>>> > >>>>> I meant PID namespaces - a PID in itself isn't guaranteed > >>>>> to be unique across the system. > >>>> > >>>> Ah, I don't think we thought about that. Is there a better > >>>> way to do this? Is there a domain id or something that can > >>>> be OR'd with the pid? > >>> > >>> What is always unique is the task pointer itself. We use pids > >>> when we interface with user-space - but we don't really do that > >>> here, right? > >> > >> No, I don't believe so. Ok, so saving 'current' and comparing that should > >> be enough, correct? > >> > > > > > > I am not sure of the safety about using pid here with namespace. > > But as to the pointer of process, is there a chance that we got a 'historical' > > address saved in the 'softlockup_warn_pid(or address)_saved' and the current > > hogging process happened to get the same task pointer address? > > If it never happens, I think the comparing of address is ok. > > > > > Hi Ingo > > what do you think of Don's solution- 'comparing of task pointer' ? > Anyway this is just an additional check about some very special cases, > so I think the issue that I am concerned above is not a problem at all. > And after learning some concepts about PID namespace, I think comparing > of task pointer is reliable dealing with PID namespace here. > > And Don, If you want me to re-post this patch, please let me know that. Sure, just quickly test with the task pointer to make sure it still works and then re-post. Cheers, Don -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 4c2e11c..6d0a891 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -42,6 +42,7 @@ static DEFINE_PER_CPU(bool, softlockup_touch_sync); static DEFINE_PER_CPU(bool, soft_watchdog_warn); static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts); static DEFINE_PER_CPU(unsigned long, soft_lockup_hrtimer_cnt); +static DEFINE_PER_CPU(pid_t, softlockup_warn_pid_saved); #ifdef CONFIG_HARDLOCKUP_DETECTOR static DEFINE_PER_CPU(bool, hard_watchdog_warn); static DEFINE_PER_CPU(bool, watchdog_nmi_touch); @@ -317,6 +318,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) */ duration = is_softlockup(touch_ts); if (unlikely(duration)) { + pid_t pid = task_pid_nr(current); + /* * If a virtual machine is stopped by the host it can look to * the watchdog like a soft lockup, check to see if the host @@ -326,8 +329,20 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) return HRTIMER_RESTART; /* only warn once */ - if (__this_cpu_read(soft_watchdog_warn) == true) + if (__this_cpu_read(soft_watchdog_warn) == true) { + + /* + * Handle the case where multiple processes are + * causing softlockups but the duration is small + * enough, the softlockup detector can not reset + * itself in time. Use pids to detect this. + */ + if (__this_cpu_read(softlockup_warn_pid_saved) != pid) { + __this_cpu_write(soft_watchdog_warn, false); + __touch_watchdog(); + } return HRTIMER_RESTART; + } if (softlockup_all_cpu_backtrace) { /* Prevent multiple soft-lockup reports if one cpu is already @@ -342,7 +357,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) printk(KERN_EMERG "BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n", smp_processor_id(), duration, - current->comm, task_pid_nr(current)); + current->comm, pid); + __this_cpu_write(softlockup_warn_pid_saved, pid); print_modules(); print_irqtrace_events(current); if (regs)