mbox series

[0/6] stackdepot, kasan, workqueue: Avoid expanding stackdepot slabs when holding raw_spin_lock

Message ID 20210907141307.1437816-1-elver@google.com (mailing list archive)
Headers show
Series stackdepot, kasan, workqueue: Avoid expanding stackdepot slabs when holding raw_spin_lock | expand

Message

Marco Elver Sept. 7, 2021, 2:13 p.m. UTC
Shuah Khan reported [1]:

 | When CONFIG_PROVE_RAW_LOCK_NESTING=y and CONFIG_KASAN are enabled,
 | kasan_record_aux_stack() runs into "BUG: Invalid wait context" when
 | it tries to allocate memory attempting to acquire spinlock in page
 | allocation code while holding workqueue pool raw_spinlock.
 |
 | There are several instances of this problem when block layer tries
 | to __queue_work(). Call trace from one of these instances is below:
 |
 |     kblockd_mod_delayed_work_on()
 |       mod_delayed_work_on()
 |         __queue_delayed_work()
 |           __queue_work() (rcu_read_lock, raw_spin_lock pool->lock held)
 |             insert_work()
 |               kasan_record_aux_stack()
 |                 kasan_save_stack()
 |                   stack_depot_save()
 |                     alloc_pages()
 |                       __alloc_pages()
 |                         get_page_from_freelist()
 |                           rm_queue()
 |                             rm_queue_pcplist()
 |                               local_lock_irqsave(&pagesets.lock, flags);
 |                               [ BUG: Invalid wait context triggered ]

[1] https://lkml.kernel.org/r/20210902200134.25603-1-skhan@linuxfoundation.org

PROVE_RAW_LOCK_NESTING is pointing out that (on RT kernels) the locking
rules are being violated. More generally, memory is being allocated from
a non-preemptive context (raw_spin_lock'd c-s) where it is not allowed.

To properly fix this, we must prevent stackdepot from replenishing its
"stack slab" pool if memory allocations cannot be done in the current
context: it's a bug to use either GFP_ATOMIC nor GFP_NOWAIT in certain
non-preemptive contexts, including raw_spin_locks (see gfp.h and
ab00db216c9c7).

The only downside is that saving a stack trace may fail if: stackdepot
runs out of space AND the same stack trace has not been recorded before.
I expect this to be unlikely, and a simple experiment (boot the kernel)
didn't result in any failure to record stack trace from insert_work().

The series includes a few minor fixes to stackdepot that I noticed in
preparing the series. It then introduces __stack_depot_save(), which
exposes the option to force stackdepot to not allocate any memory.
Finally, KASAN is changed to use the new stackdepot interface and
provide kasan_record_aux_stack_noalloc(), which is then used by
workqueue code.

Marco Elver (6):
  lib/stackdepot: include gfp.h
  lib/stackdepot: remove unused function argument
  lib/stackdepot: introduce __stack_depot_save()
  kasan: common: provide can_alloc in kasan_save_stack()
  kasan: generic: introduce kasan_record_aux_stack_noalloc()
  workqueue, kasan: avoid alloc_pages() when recording stack

 include/linux/kasan.h      |  2 ++
 include/linux/stackdepot.h |  6 +++++
 kernel/workqueue.c         |  2 +-
 lib/stackdepot.c           | 51 ++++++++++++++++++++++++++++++--------
 mm/kasan/common.c          |  6 ++---
 mm/kasan/generic.c         | 14 +++++++++--
 mm/kasan/kasan.h           |  2 +-
 7 files changed, 65 insertions(+), 18 deletions(-)

Comments

Marco Elver Sept. 7, 2021, 2:17 p.m. UTC | #1
[+Cc: Thomas, Sebastian]

Sorry, forgot to Cc you... :-/

On Tue, 7 Sept 2021 at 16:14, Marco Elver <elver@google.com> wrote:
>
> Shuah Khan reported [1]:
>
>  | When CONFIG_PROVE_RAW_LOCK_NESTING=y and CONFIG_KASAN are enabled,
>  | kasan_record_aux_stack() runs into "BUG: Invalid wait context" when
>  | it tries to allocate memory attempting to acquire spinlock in page
>  | allocation code while holding workqueue pool raw_spinlock.
>  |
>  | There are several instances of this problem when block layer tries
>  | to __queue_work(). Call trace from one of these instances is below:
>  |
>  |     kblockd_mod_delayed_work_on()
>  |       mod_delayed_work_on()
>  |         __queue_delayed_work()
>  |           __queue_work() (rcu_read_lock, raw_spin_lock pool->lock held)
>  |             insert_work()
>  |               kasan_record_aux_stack()
>  |                 kasan_save_stack()
>  |                   stack_depot_save()
>  |                     alloc_pages()
>  |                       __alloc_pages()
>  |                         get_page_from_freelist()
>  |                           rm_queue()
>  |                             rm_queue_pcplist()
>  |                               local_lock_irqsave(&pagesets.lock, flags);
>  |                               [ BUG: Invalid wait context triggered ]
>
> [1] https://lkml.kernel.org/r/20210902200134.25603-1-skhan@linuxfoundation.org
>
> PROVE_RAW_LOCK_NESTING is pointing out that (on RT kernels) the locking
> rules are being violated. More generally, memory is being allocated from
> a non-preemptive context (raw_spin_lock'd c-s) where it is not allowed.
>
> To properly fix this, we must prevent stackdepot from replenishing its
> "stack slab" pool if memory allocations cannot be done in the current
> context: it's a bug to use either GFP_ATOMIC nor GFP_NOWAIT in certain
> non-preemptive contexts, including raw_spin_locks (see gfp.h and
> ab00db216c9c7).
>
> The only downside is that saving a stack trace may fail if: stackdepot
> runs out of space AND the same stack trace has not been recorded before.
> I expect this to be unlikely, and a simple experiment (boot the kernel)
> didn't result in any failure to record stack trace from insert_work().
>
> The series includes a few minor fixes to stackdepot that I noticed in
> preparing the series. It then introduces __stack_depot_save(), which
> exposes the option to force stackdepot to not allocate any memory.
> Finally, KASAN is changed to use the new stackdepot interface and
> provide kasan_record_aux_stack_noalloc(), which is then used by
> workqueue code.
>
> Marco Elver (6):
>   lib/stackdepot: include gfp.h
>   lib/stackdepot: remove unused function argument
>   lib/stackdepot: introduce __stack_depot_save()
>   kasan: common: provide can_alloc in kasan_save_stack()
>   kasan: generic: introduce kasan_record_aux_stack_noalloc()
>   workqueue, kasan: avoid alloc_pages() when recording stack
>
>  include/linux/kasan.h      |  2 ++
>  include/linux/stackdepot.h |  6 +++++
>  kernel/workqueue.c         |  2 +-
>  lib/stackdepot.c           | 51 ++++++++++++++++++++++++++++++--------
>  mm/kasan/common.c          |  6 ++---
>  mm/kasan/generic.c         | 14 +++++++++--
>  mm/kasan/kasan.h           |  2 +-
>  7 files changed, 65 insertions(+), 18 deletions(-)
>
> --
> 2.33.0.153.gba50c8fa24-goog
>
Shuah Khan Sept. 7, 2021, 8:05 p.m. UTC | #2
On 9/7/21 8:13 AM, Marco Elver wrote:
> Shuah Khan reported [1]:
> 
>   | When CONFIG_PROVE_RAW_LOCK_NESTING=y and CONFIG_KASAN are enabled,
>   | kasan_record_aux_stack() runs into "BUG: Invalid wait context" when
>   | it tries to allocate memory attempting to acquire spinlock in page
>   | allocation code while holding workqueue pool raw_spinlock.
>   |
>   | There are several instances of this problem when block layer tries
>   | to __queue_work(). Call trace from one of these instances is below:
>   |
>   |     kblockd_mod_delayed_work_on()
>   |       mod_delayed_work_on()
>   |         __queue_delayed_work()
>   |           __queue_work() (rcu_read_lock, raw_spin_lock pool->lock held)
>   |             insert_work()
>   |               kasan_record_aux_stack()
>   |                 kasan_save_stack()
>   |                   stack_depot_save()
>   |                     alloc_pages()
>   |                       __alloc_pages()
>   |                         get_page_from_freelist()
>   |                           rm_queue()
>   |                             rm_queue_pcplist()
>   |                               local_lock_irqsave(&pagesets.lock, flags);
>   |                               [ BUG: Invalid wait context triggered ]
> 
> [1] https://lkml.kernel.org/r/20210902200134.25603-1-skhan@linuxfoundation.org
> 
> PROVE_RAW_LOCK_NESTING is pointing out that (on RT kernels) the locking
> rules are being violated. More generally, memory is being allocated from
> a non-preemptive context (raw_spin_lock'd c-s) where it is not allowed.
> 
> To properly fix this, we must prevent stackdepot from replenishing its
> "stack slab" pool if memory allocations cannot be done in the current
> context: it's a bug to use either GFP_ATOMIC nor GFP_NOWAIT in certain
> non-preemptive contexts, including raw_spin_locks (see gfp.h and
> ab00db216c9c7).
> 
> The only downside is that saving a stack trace may fail if: stackdepot
> runs out of space AND the same stack trace has not been recorded before.
> I expect this to be unlikely, and a simple experiment (boot the kernel)
> didn't result in any failure to record stack trace from insert_work().
> 
> The series includes a few minor fixes to stackdepot that I noticed in
> preparing the series. It then introduces __stack_depot_save(), which
> exposes the option to force stackdepot to not allocate any memory.
> Finally, KASAN is changed to use the new stackdepot interface and
> provide kasan_record_aux_stack_noalloc(), which is then used by
> workqueue code.
> 
> Marco Elver (6):
>    lib/stackdepot: include gfp.h
>    lib/stackdepot: remove unused function argument
>    lib/stackdepot: introduce __stack_depot_save()
>    kasan: common: provide can_alloc in kasan_save_stack()
>    kasan: generic: introduce kasan_record_aux_stack_noalloc()
>    workqueue, kasan: avoid alloc_pages() when recording stack
> 
>   include/linux/kasan.h      |  2 ++
>   include/linux/stackdepot.h |  6 +++++
>   kernel/workqueue.c         |  2 +-
>   lib/stackdepot.c           | 51 ++++++++++++++++++++++++++++++--------
>   mm/kasan/common.c          |  6 ++---
>   mm/kasan/generic.c         | 14 +++++++++--
>   mm/kasan/kasan.h           |  2 +-
>   7 files changed, 65 insertions(+), 18 deletions(-)
> 

Thank you. Tested all the 6 patches in this series on Linux 5.14. This problem
exists in 5.13 and needs to be marked for both 5.14 and 5.13 stable releases.

Here is my

Tested-by: Shuah Khan <skhan@linuxfoundation.org>

thanks,
-- Shuah
Vlastimil Babka Sept. 10, 2021, 10:50 a.m. UTC | #3
On 9/7/21 22:05, Shuah Khan wrote:
> On 9/7/21 8:13 AM, Marco Elver wrote:
>> Shuah Khan reported [1]:
>>
>>   | When CONFIG_PROVE_RAW_LOCK_NESTING=y and CONFIG_KASAN are enabled,
>>   | kasan_record_aux_stack() runs into "BUG: Invalid wait context" when
>>   | it tries to allocate memory attempting to acquire spinlock in page
>>   | allocation code while holding workqueue pool raw_spinlock.
>>   |
>>   | There are several instances of this problem when block layer tries
>>   | to __queue_work(). Call trace from one of these instances is below:
>>   |
>>   |     kblockd_mod_delayed_work_on()
>>   |       mod_delayed_work_on()
>>   |         __queue_delayed_work()
>>   |           __queue_work() (rcu_read_lock, raw_spin_lock pool->lock held)
>>   |             insert_work()
>>   |               kasan_record_aux_stack()
>>   |                 kasan_save_stack()
>>   |                   stack_depot_save()
>>   |                     alloc_pages()
>>   |                       __alloc_pages()
>>   |                         get_page_from_freelist()
>>   |                           rm_queue()
>>   |                             rm_queue_pcplist()
>>   |                               local_lock_irqsave(&pagesets.lock, flags);
>>   |                               [ BUG: Invalid wait context triggered ]
>>
>> [1]
>> https://lkml.kernel.org/r/20210902200134.25603-1-skhan@linuxfoundation.org
>>
>> PROVE_RAW_LOCK_NESTING is pointing out that (on RT kernels) the locking
>> rules are being violated. More generally, memory is being allocated from
>> a non-preemptive context (raw_spin_lock'd c-s) where it is not allowed.
>>
>> To properly fix this, we must prevent stackdepot from replenishing its
>> "stack slab" pool if memory allocations cannot be done in the current
>> context: it's a bug to use either GFP_ATOMIC nor GFP_NOWAIT in certain
>> non-preemptive contexts, including raw_spin_locks (see gfp.h and
>> ab00db216c9c7).
>>
>> The only downside is that saving a stack trace may fail if: stackdepot
>> runs out of space AND the same stack trace has not been recorded before.
>> I expect this to be unlikely, and a simple experiment (boot the kernel)
>> didn't result in any failure to record stack trace from insert_work().
>>
>> The series includes a few minor fixes to stackdepot that I noticed in
>> preparing the series. It then introduces __stack_depot_save(), which
>> exposes the option to force stackdepot to not allocate any memory.
>> Finally, KASAN is changed to use the new stackdepot interface and
>> provide kasan_record_aux_stack_noalloc(), which is then used by
>> workqueue code.
>>
>> Marco Elver (6):
>>    lib/stackdepot: include gfp.h
>>    lib/stackdepot: remove unused function argument
>>    lib/stackdepot: introduce __stack_depot_save()
>>    kasan: common: provide can_alloc in kasan_save_stack()
>>    kasan: generic: introduce kasan_record_aux_stack_noalloc()
>>    workqueue, kasan: avoid alloc_pages() when recording stack
>>
>>   include/linux/kasan.h      |  2 ++
>>   include/linux/stackdepot.h |  6 +++++
>>   kernel/workqueue.c         |  2 +-
>>   lib/stackdepot.c           | 51 ++++++++++++++++++++++++++++++--------
>>   mm/kasan/common.c          |  6 ++---
>>   mm/kasan/generic.c         | 14 +++++++++--
>>   mm/kasan/kasan.h           |  2 +-
>>   7 files changed, 65 insertions(+), 18 deletions(-)
>>
> 
> Thank you. Tested all the 6 patches in this series on Linux 5.14. This problem
> exists in 5.13 and needs to be marked for both 5.14 and 5.13 stable releases.

I think if this problem manifests only with CONFIG_PROVE_RAW_LOCK_NESTING
then it shouldn't be backported to stable. CONFIG_PROVE_RAW_LOCK_NESTING is
an experimental/development option to earlier discover what will collide
with RT lock semantics, without needing the full RT tree.
Thus, good to fix going forward, but not necessary to stable backport.

> Here is my
> 
> Tested-by: Shuah Khan <skhan@linuxfoundation.org>
> 
> thanks,
> -- Shuah
>
Sebastian Andrzej Siewior Sept. 10, 2021, 3:28 p.m. UTC | #4
On 2021-09-10 12:50:51 [+0200], Vlastimil Babka wrote:
> > Thank you. Tested all the 6 patches in this series on Linux 5.14. This problem
> > exists in 5.13 and needs to be marked for both 5.14 and 5.13 stable releases.
> 
> I think if this problem manifests only with CONFIG_PROVE_RAW_LOCK_NESTING
> then it shouldn't be backported to stable. CONFIG_PROVE_RAW_LOCK_NESTING is
> an experimental/development option to earlier discover what will collide
> with RT lock semantics, without needing the full RT tree.
> Thus, good to fix going forward, but not necessary to stable backport.

  Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
for the series. Thank you.

As for the backport I agree here with Vlastimil.

I pulled it into my RT tree for some testing and it looked good. I had
to
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3030,7 +3030,7 @@ __call_rcu(struct rcu_head *head, rcu_callback_t func)
        head->func = func;
        head->next = NULL;
        local_irq_save(flags);
-       kasan_record_aux_stack(head);
+       kasan_record_aux_stack_noalloc(head);
        rdp = this_cpu_ptr(&rcu_data);
 
        /* Add the callback to our list. */

We could move kasan_record_aux_stack() before that local_irq_save() but
then call_rcu() can be called preempt-disabled section so we would have
the same problem.

The second warning came from kasan_quarantine_remove_cache(). At the end
per_cpu_remove_cache() -> qlist_free_all() will free memory with
disabled interrupts (due to that smp-function call).
Moving it to kworker would solve the problem. I don't mind keeping that
smp_function call assuming that it is all debug-code and it increases
overall latency anyway. But then could we maybe move all those objects
to a single list which freed after on_each_cpu()?

Otherwise I haven't seen any new warnings showing up with KASAN enabled.

Sebastian
Marco Elver Sept. 13, 2021, 6 a.m. UTC | #5
On Fri, 10 Sept 2021 at 17:28, Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
> On 2021-09-10 12:50:51 [+0200], Vlastimil Babka wrote:
> > > Thank you. Tested all the 6 patches in this series on Linux 5.14. This problem
> > > exists in 5.13 and needs to be marked for both 5.14 and 5.13 stable releases.
> >
> > I think if this problem manifests only with CONFIG_PROVE_RAW_LOCK_NESTING
> > then it shouldn't be backported to stable. CONFIG_PROVE_RAW_LOCK_NESTING is
> > an experimental/development option to earlier discover what will collide
> > with RT lock semantics, without needing the full RT tree.
> > Thus, good to fix going forward, but not necessary to stable backport.
>
>   Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> for the series. Thank you.

Thank you. I'll send v2 with Acks/Tested-by added and the comment
addition you suggested.

> As for the backport I agree here with Vlastimil.
>
> I pulled it into my RT tree for some testing and it looked good. I had
> to
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3030,7 +3030,7 @@ __call_rcu(struct rcu_head *head, rcu_callback_t func)
>         head->func = func;
>         head->next = NULL;
>         local_irq_save(flags);
> -       kasan_record_aux_stack(head);
> +       kasan_record_aux_stack_noalloc(head);
>         rdp = this_cpu_ptr(&rcu_data);
>
>         /* Add the callback to our list. */
>
> We could move kasan_record_aux_stack() before that local_irq_save() but
> then call_rcu() can be called preempt-disabled section so we would have
> the same problem.
>
> The second warning came from kasan_quarantine_remove_cache(). At the end
> per_cpu_remove_cache() -> qlist_free_all() will free memory with
> disabled interrupts (due to that smp-function call).
> Moving it to kworker would solve the problem. I don't mind keeping that
> smp_function call assuming that it is all debug-code and it increases
> overall latency anyway. But then could we maybe move all those objects
> to a single list which freed after on_each_cpu()?

The quarantine is per-CPU, and I think what you suggest would
fundamentally change its design. If you have something that works on
RT without a fundamental change would be ideal (it is all debug code
and not used on non-KASAN kernels).


> Otherwise I haven't seen any new warnings showing up with KASAN enabled.
>
> Sebastian
Alexander Potapenko Sept. 14, 2021, 12:55 p.m. UTC | #6
On Tue, Sep 7, 2021 at 4:14 PM Marco Elver <elver@google.com> wrote:
>
> Shuah Khan reported [1]:
>
>  | When CONFIG_PROVE_RAW_LOCK_NESTING=y and CONFIG_KASAN are enabled,
>  | kasan_record_aux_stack() runs into "BUG: Invalid wait context" when
>  | it tries to allocate memory attempting to acquire spinlock in page
>  | allocation code while holding workqueue pool raw_spinlock.
>  |
>  | There are several instances of this problem when block layer tries
>  | to __queue_work(). Call trace from one of these instances is below:
>  |
>  |     kblockd_mod_delayed_work_on()
>  |       mod_delayed_work_on()
>  |         __queue_delayed_work()
>  |           __queue_work() (rcu_read_lock, raw_spin_lock pool->lock held)
>  |             insert_work()
>  |               kasan_record_aux_stack()
>  |                 kasan_save_stack()
>  |                   stack_depot_save()
>  |                     alloc_pages()
>  |                       __alloc_pages()
>  |                         get_page_from_freelist()
>  |                           rm_queue()
>  |                             rm_queue_pcplist()
>  |                               local_lock_irqsave(&pagesets.lock, flags);
>  |                               [ BUG: Invalid wait context triggered ]
>
> [1] https://lkml.kernel.org/r/20210902200134.25603-1-skhan@linuxfoundation.org
>
> PROVE_RAW_LOCK_NESTING is pointing out that (on RT kernels) the locking
> rules are being violated. More generally, memory is being allocated from
> a non-preemptive context (raw_spin_lock'd c-s) where it is not allowed.
>
> To properly fix this, we must prevent stackdepot from replenishing its
> "stack slab" pool if memory allocations cannot be done in the current
> context: it's a bug to use either GFP_ATOMIC nor GFP_NOWAIT in certain
> non-preemptive contexts, including raw_spin_locks (see gfp.h and
> ab00db216c9c7).
>
> The only downside is that saving a stack trace may fail if: stackdepot
> runs out of space AND the same stack trace has not been recorded before.
> I expect this to be unlikely, and a simple experiment (boot the kernel)
> didn't result in any failure to record stack trace from insert_work().
>
> The series includes a few minor fixes to stackdepot that I noticed in
> preparing the series. It then introduces __stack_depot_save(), which
> exposes the option to force stackdepot to not allocate any memory.
> Finally, KASAN is changed to use the new stackdepot interface and
> provide kasan_record_aux_stack_noalloc(), which is then used by
> workqueue code.
>
> Marco Elver (6):
>   lib/stackdepot: include gfp.h
>   lib/stackdepot: remove unused function argument
>   lib/stackdepot: introduce __stack_depot_save()
>   kasan: common: provide can_alloc in kasan_save_stack()
>   kasan: generic: introduce kasan_record_aux_stack_noalloc()
>   workqueue, kasan: avoid alloc_pages() when recording stack

Acked-by: Alexander Potapenko <glider@google.com>

for the whole series.

>
>  include/linux/kasan.h      |  2 ++
>  include/linux/stackdepot.h |  6 +++++
>  kernel/workqueue.c         |  2 +-
>  lib/stackdepot.c           | 51 ++++++++++++++++++++++++++++++--------
>  mm/kasan/common.c          |  6 ++---
>  mm/kasan/generic.c         | 14 +++++++++--
>  mm/kasan/kasan.h           |  2 +-
>  7 files changed, 65 insertions(+), 18 deletions(-)
>
> --
> 2.33.0.153.gba50c8fa24-goog
>