diff mbox

[REPOST] Revert mm/vmstat.c: fix vmstat_update() preemption BUG

Message ID 20180504104451.20278-1-bigeasy@linutronix.de (mailing list archive)
State New, archived
Headers show

Commit Message

Sebastian Andrzej Siewior May 4, 2018, 10:44 a.m. UTC
This patch reverts commit c7f26ccfb2c3 ("mm/vmstat.c: fix
vmstat_update() preemption BUG").
Steven saw a "using smp_processor_id() in preemptible" message and
added a preempt_disable() section around it to keep it quiet. This is
not the right thing to do it does not fix the real problem.

vmstat_update() is invoked by a kworker on a specific CPU. This worker
it bound to this CPU. The name of the worker was "kworker/1:1" so it
should have been a worker which was bound to CPU1. A worker which can
run on any CPU would have a `u' before the first digit.

smp_processor_id() can be used in a preempt-enabled region as long as
the task is bound to a single CPU which is the case here. If it could
run on an arbitrary CPU then this is the problem we have an should seek
to resolve.
Not only this smp_processor_id() must not be migrated to another CPU but
also refresh_cpu_vm_stats() which might access wrong per-CPU variables.
Not to mention that other code relies on the fact that such a worker
runs on one specific CPU only.

Therefore I revert that commit and we should look instead what broke the
affinity mask of the kworker.

Cc: Steven J. Hill <steven.hill@cavium.com>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 mm/vmstat.c | 2 --
 1 file changed, 2 deletions(-)

Comments

Vlastimil Babka May 7, 2018, 7:31 a.m. UTC | #1
On 05/04/2018 12:44 PM, Sebastian Andrzej Siewior wrote:
> This patch reverts commit c7f26ccfb2c3 ("mm/vmstat.c: fix
> vmstat_update() preemption BUG").
> Steven saw a "using smp_processor_id() in preemptible" message and
> added a preempt_disable() section around it to keep it quiet. This is
> not the right thing to do it does not fix the real problem.
> 
> vmstat_update() is invoked by a kworker on a specific CPU. This worker
> it bound to this CPU. The name of the worker was "kworker/1:1" so it
> should have been a worker which was bound to CPU1. A worker which can
> run on any CPU would have a `u' before the first digit.
> 
> smp_processor_id() can be used in a preempt-enabled region as long as
> the task is bound to a single CPU which is the case here. If it could
> run on an arbitrary CPU then this is the problem we have an should seek
> to resolve.
> Not only this smp_processor_id() must not be migrated to another CPU but
> also refresh_cpu_vm_stats() which might access wrong per-CPU variables.
> Not to mention that other code relies on the fact that such a worker
> runs on one specific CPU only.
> 
> Therefore I revert that commit and we should look instead what broke the
> affinity mask of the kworker.

Yes. I think there's an important detail that should be perhaps included
explicitly here. The check check_preemption_disabled() does include this
test:

        /*
         * Kernel threads bound to a single CPU can safely use
         * smp_processor_id():
         */
        if (cpumask_equal(&current->cpus_allowed, cpumask_of(this_cpu)))
                goto out;

So indeed, if kworkers work as expected, there's no false positive bug.

The report in commit c7f26ccfb2c3 included:

      BUG: using smp_processor_id() in preemptible [00000000] code:
      kworker/1:1/269
      caller is vmstat_update+0x50/0xa0
      CPU: 0 PID: 269 Comm: kworker/1:1 Not tainted

I.e. kworker/1 running on CPU 0. Because the BUG was reported, we know
that the test above did not prevent it. That means either the kworker's
cpumask was not restricted to CPU 1 (it included also cpu 0), or it was
restricted, but the restriction was ignored, and it still executed on cpu 0.

Note the report also said "Attempting to hotplug CPUs with
CONFIG_VM_EVENT_COUNTERS enabled can...". IIRC Tejun mentioned that
during hotplug (or hotremove, actually?) the guarantees are off, but
vmstat should not schedule work items on such cpus due to its hooks/checks.

In any case I agree that the revert should be done immediately even
before fixing the underlying bug. The preempt_disable/enable doesn't
prevent the bug, it only prevents the debugging code from actually
reporting it! Note that it's debugging code (CONFIG_DEBUG_PREEMPT) that
production kernels most likely don't have enabled, so we are not even
helping them not crash (while allowing possible data corruption).

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> Cc: Steven J. Hill <steven.hill@cavium.com>
> Cc: Tejun Heo <htejun@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  mm/vmstat.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 33581be705f0..40b2db6db6b1 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1839,11 +1839,9 @@ static void vmstat_update(struct work_struct *w)
>  		 * to occur in the future. Keep on running the
>  		 * update worker thread.
>  		 */
> -		preempt_disable();
>  		queue_delayed_work_on(smp_processor_id(), mm_percpu_wq,
>  				this_cpu_ptr(&vmstat_work),
>  				round_jiffies_relative(sysctl_stat_interval));
> -		preempt_enable();
>  	}
>  }
>  
>
Andrew Morton May 8, 2018, 11:02 p.m. UTC | #2
On Mon, 7 May 2018 09:31:05 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:

> In any case I agree that the revert should be done immediately even
> before fixing the underlying bug. The preempt_disable/enable doesn't
> prevent the bug, it only prevents the debugging code from actually
> reporting it! Note that it's debugging code (CONFIG_DEBUG_PREEMPT) that
> production kernels most likely don't have enabled, so we are not even
> helping them not crash (while allowing possible data corruption).

Grumble.

I don't see much benefit in emitting warnings into end-users' logs for
bugs which we already know about.

The only thing this buys us is that people will hassle us if we forget
to fix the bug, and how pathetic is that?  I mean, we may as well put

	printk("don't forget to fix the vmstat_update() bug!\n");

into start_kernel().
Sebastian Andrzej Siewior May 9, 2018, 10:35 p.m. UTC | #3
On 2018-05-08 16:02:57 [-0700], Andrew Morton wrote:
> On Mon, 7 May 2018 09:31:05 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:
> 
> > In any case I agree that the revert should be done immediately even
> > before fixing the underlying bug. The preempt_disable/enable doesn't
> > prevent the bug, it only prevents the debugging code from actually
> > reporting it! Note that it's debugging code (CONFIG_DEBUG_PREEMPT) that
> > production kernels most likely don't have enabled, so we are not even
> > helping them not crash (while allowing possible data corruption).
> 
> Grumble.
> 
> I don't see much benefit in emitting warnings into end-users' logs for
> bugs which we already know about.

not end-users (not to mention that neither Debian Stretch nor F28 has
preemption enabled in their kernels). And if so, they may provide
additional information for someone to fix the bug in the end. I wasn't
able to reproduce the bug but I don't have access to anything MIPSish
where I can boot my own kernels. At least two people were looking at the
code after I posted the revert and nobody spotted the bug.

> The only thing this buys us is that people will hassle us if we forget
> to fix the bug, and how pathetic is that?  I mean, we may as well put
> 
> 	printk("don't forget to fix the vmstat_update() bug!\n");

No that is different. That would be seen by everyone. The bug was only
reported by Steven J. Hill which did not respond since. This message
would also imply that we know how to fix the bug but didn't do it yet
which is not the case. We seen that something was wrong but have no idea
*how* it got there.

The preempt_disable() was added by the end of v4.16. The
smp_processor_id() in vmstat_update() was added in commit 7cc36bbddde5
("vmstat: on-demand vmstat workers V8") which was in v3.18-rc1. The
hotplug rework took place in v4.10-rc1. And it took (counting from the
hotplug rework) 6 kernel releases for someone to trigger that warning
_if_ this was related to the hotplug rework.

What we have *now* is way worse: We have a possible bug that triggered
the warning. As we see in report the code in question was _already_
invoked on the wrong CPU. The preempt_disable() just silences the
warning, hiding the real issue so nobody will do a thing about it since
it will be never reported again (in a kernel with preemption and debug
enabled).

> into start_kernel().

Sebastian
Vlastimil Babka May 10, 2018, 6:32 a.m. UTC | #4
On 05/10/2018 12:35 AM, Sebastian Andrzej Siewior wrote:
> On 2018-05-08 16:02:57 [-0700], Andrew Morton wrote:
>> On Mon, 7 May 2018 09:31:05 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>>> In any case I agree that the revert should be done immediately even
>>> before fixing the underlying bug. The preempt_disable/enable doesn't
>>> prevent the bug, it only prevents the debugging code from actually
>>> reporting it! Note that it's debugging code (CONFIG_DEBUG_PREEMPT) that
>>> production kernels most likely don't have enabled, so we are not even
>>> helping them not crash (while allowing possible data corruption).
>>
>> Grumble.
>>
>> I don't see much benefit in emitting warnings into end-users' logs for
>> bugs which we already know about.
> 
> not end-users (not to mention that neither Debian Stretch nor F28 has
> preemption enabled in their kernels). And if so, they may provide
> additional information for someone to fix the bug in the end. I wasn't

Even if end-users have enabled preemption, they likely won't have
enabled CONFIG_DEBUG_PREEMPT anyway.

> able to reproduce the bug but I don't have access to anything MIPSish
> where I can boot my own kernels. At least two people were looking at the
> code after I posted the revert and nobody spotted the bug.
> 
>> The only thing this buys us is that people will hassle us if we forget
>> to fix the bug, and how pathetic is that?  I mean, we may as well put
>>
>> 	printk("don't forget to fix the vmstat_update() bug!\n");
> 
> No that is different. That would be seen by everyone. The bug was only
> reported by Steven J. Hill which did not respond since. This message
> would also imply that we know how to fix the bug but didn't do it yet
> which is not the case. We seen that something was wrong but have no idea
> *how* it got there.
> 
> The preempt_disable() was added by the end of v4.16. The
> smp_processor_id() in vmstat_update() was added in commit 7cc36bbddde5
> ("vmstat: on-demand vmstat workers V8") which was in v3.18-rc1. The
> hotplug rework took place in v4.10-rc1. And it took (counting from the
> hotplug rework) 6 kernel releases for someone to trigger that warning
> _if_ this was related to the hotplug rework.
> 
> What we have *now* is way worse: We have a possible bug that triggered
> the warning. As we see in report the code in question was _already_
> invoked on the wrong CPU. The preempt_disable() just silences the
> warning, hiding the real issue so nobody will do a thing about it since
> it will be never reported again (in a kernel with preemption and debug
> enabled).

Fully agree with everything you said!

We could extend the warning to e.g. print affinity mask of the thread,
and e.g. state of cpus that are subject to ongoing hotplug/hotremove.
But maybe it's not so useful in general, as the common case is likely
indeed a missing preempt_disable, and this is an exception? In any case,
I would hope that Steven applies some patch locally and we get more
details about what's going on at that MIPS machine.
Thomas Gleixner June 13, 2018, 9:46 p.m. UTC | #5
On Thu, 10 May 2018, Vlastimil Babka wrote:
> On 05/10/2018 12:35 AM, Sebastian Andrzej Siewior wrote:
> >> The only thing this buys us is that people will hassle us if we forget
> >> to fix the bug, and how pathetic is that?  I mean, we may as well put
> >>
> >> 	printk("don't forget to fix the vmstat_update() bug!\n");
> > 
> > No that is different. That would be seen by everyone. The bug was only
> > reported by Steven J. Hill which did not respond since. This message
> > would also imply that we know how to fix the bug but didn't do it yet
> > which is not the case. We seen that something was wrong but have no idea
> > *how* it got there.
> > 
> > The preempt_disable() was added by the end of v4.16. The
> > smp_processor_id() in vmstat_update() was added in commit 7cc36bbddde5
> > ("vmstat: on-demand vmstat workers V8") which was in v3.18-rc1. The
> > hotplug rework took place in v4.10-rc1. And it took (counting from the
> > hotplug rework) 6 kernel releases for someone to trigger that warning
> > _if_ this was related to the hotplug rework.
> > 
> > What we have *now* is way worse: We have a possible bug that triggered
> > the warning. As we see in report the code in question was _already_
> > invoked on the wrong CPU. The preempt_disable() just silences the
> > warning, hiding the real issue so nobody will do a thing about it since
> > it will be never reported again (in a kernel with preemption and debug
> > enabled).
> 
> Fully agree with everything you said!
> 
> We could extend the warning to e.g. print affinity mask of the thread,
> and e.g. state of cpus that are subject to ongoing hotplug/hotremove.
> But maybe it's not so useful in general, as the common case is likely
> indeed a missing preempt_disable, and this is an exception? In any case,
> I would hope that Steven applies some patch locally and we get more
> details about what's going on at that MIPS machine.

So after this went completely silent and S.J. Hill seems to be lost in the
intertubes, I spent quite some time staring at that code and the commit
in question:

    "Attempting to hotplug CPUs with CONFIG_VM_EVENT_COUNTERS enabled can
     cause vmstat_update() to report a BUG due to preemption not being
     disabled around smp_processor_id()."

That changelog is pretty much useless as it just decscribes the symptom and
the 'fix' follows suit by papering over that symptom.

Plus it's even more obscure that only the queue_delayed_work_on() bit needs
to be fixed because vmstat_update() does:

        if (refresh_cpu_vm_stats(true))
	      queue_delayed_work_on(smp_processor_id() ....);

and refresh_cpu_vm_stats() is full of this_cpu_* accesses which all are
equipped with __this_cpu_preempt_check() calls which depend on
CONFIG_DEBUG_PREEMPT as well.

So how does only the queue_delayed_work_on(smp_processor_id()) part
trigger? That does not make any sense at all.

Can we please revert that master piece of duct tape engineering and wait
for someone to actually trigger the warning again? All we can hope it's
someone who really sits down and does a proper analysis of the problem
instead of throwing some half baken 'works for me' crap over the fence and
then running as fast as it goes.

As I was at it I stared at the hotplug code some more.

vmstat_update() is a delayed work function which is scheduled on a
particular per CPU mm_percpu_wq. In case of CPU hotplug the work (and the
timer) is canceled _before_ the per CPU workqueues are unbound. It cannot
be requeued because vmstat_shepherd() on some other CPU would be stuck in
get_online_cpus() and by the time it's unstuck the CPU is gone from the
online cpumask.

So this looks all about correct, but there is a very subtle case where the
above has a hole. That requires to execute the hotplug state control and
not the full /sys/..../cpu$N/online mechanism, e.g. by doing:

# echo $PERF_ONLINE_STATE > /sys/devices/system/cpu/cpu1/hotplug/target

CPU0					CPU1		

do_cpu_down(1, CPUHP_AP_PERF_ONLINE)
 write_lock_cpus()
 __cpu_down(1, CPUHP_AP_PERF_ONLINE)
  kick_cpu(1);
  wait_for_completion(); 		while (state > CPUHP_AP_PERF_ONLINE)
					   invoke_shutdown_callback(state--);

					 That invokes:

					     vmstat_cpu_offline();
					       cancel_delayed_work();
					     
					     ...
					     					     
					     workqueue_offline_cpu()
					       unbind_workers();
					     ...
					
					complete();
 write_unlock_cpus();

Note, after this returns, CPU1 is still in the online mask. So the next
invocation of vmstat_sheperd() can queue the work on CPU1 again.

If that happens then the work will run on an unbound work queue somewhere
if I'm not completely misreading the workqueue code. Tejun ???

I'll try that tomorrow if nobody beats me to it, but that's the only way I
found how the debug warning can trigger. That does not explain why it
triggers only the smp_processor_id() thing and not the this_cpu_* check,
but I don't trust that information in the changelog at all.

Thanks,

	tglx
Andrew Morton June 14, 2018, 9:27 p.m. UTC | #6
On Wed, 13 Jun 2018 23:46:45 +0200 (CEST) Thomas Gleixner <tglx@linutronix.de> wrote:

> Can we please revert that master piece of duct tape engineering and wait
> for someone to actually trigger the warning again?

OK.
Steven Rostedt June 27, 2018, 7:40 p.m. UTC | #7
On Thu, Jun 14, 2018 at 02:27:10PM -0700, Andrew Morton wrote:
> On Wed, 13 Jun 2018 23:46:45 +0200 (CEST) Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > Can we please revert that master piece of duct tape engineering and wait
> > for someone to actually trigger the warning again?
> 
> OK.

And while we're at it, can we revert it from stable as well. As this is just
an overly aggressive pulling anything that looks like a fix into stable.

-- Steve
diff mbox

Patch

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 33581be705f0..40b2db6db6b1 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1839,11 +1839,9 @@  static void vmstat_update(struct work_struct *w)
 		 * to occur in the future. Keep on running the
 		 * update worker thread.
 		 */
-		preempt_disable();
 		queue_delayed_work_on(smp_processor_id(), mm_percpu_wq,
 				this_cpu_ptr(&vmstat_work),
 				round_jiffies_relative(sysctl_stat_interval));
-		preempt_enable();
 	}
 }