diff mbox series

[v2,1/1] rcu/tree: Drop the lock before entering to page allocator

Message ID 20200727211012.30948-1-urezki@gmail.com (mailing list archive)
State New, archived
Headers show
Series [v2,1/1] rcu/tree: Drop the lock before entering to page allocator | expand

Commit Message

Uladzislau Rezki July 27, 2020, 9:10 p.m. UTC
If the kernel is built with CONFIG_PROVE_RAW_LOCK_NESTING
option, the lockedp will complain about violation of the
nesting rules:

<snip>
[   28.060389] =============================
[   28.060389] [ BUG: Invalid wait context ]
[   28.060389] 5.8.0-rc3-rcu #211 Tainted: G            E
[   28.060389] -----------------------------
[   28.060390] vmalloc_test/0/523 is trying to lock:
[   28.060390] ffff96df7ffe0228 (&zone->lock){-.-.}-{3:3}, at: get_page_from_freelist+0xcf0/0x16d0
[   28.060391] other info that might help us debug this:
[   28.060391] context-{5:5}
[   28.060392] 2 locks held by vmalloc_test/0/523:
[   28.060392]  #0: ffffffffc06750d0 (prepare_for_test_rwsem){++++}-{4:4}, at: test_func+0x76/0x240 [test_vmalloc]
[   28.060393]  #1: ffff96df5fa1d390 (krc.lock){..-.}-{2:2}, at: kvfree_call_rcu+0x5c/0x230
[   28.060395] stack backtrace:
[   28.060395] CPU: 0 PID: 523 Comm: vmalloc_test/0 Tainted: G            E     5.8.0-rc3-rcu #211
[   28.060395] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[   28.060396] Call Trace:
[   28.060397]  dump_stack+0x96/0xd0
[   28.060397]  __lock_acquire.cold.65+0x166/0x2d7
[   28.060398]  ? find_held_lock+0x2d/0x90
[   28.060399]  lock_acquire+0xad/0x370
[   28.060400]  ? get_page_from_freelist+0xcf0/0x16d0
[   28.060401]  ? mark_held_locks+0x48/0x70
[   28.060402]  _raw_spin_lock+0x25/0x30
[   28.060403]  ? get_page_from_freelist+0xcf0/0x16d0
[   28.060404]  get_page_from_freelist+0xcf0/0x16d0
[   28.060405]  ? __lock_acquire+0x3ee/0x1b90
[   28.060407]  __alloc_pages_nodemask+0x16a/0x3a0
[   28.060408]  __get_free_pages+0xd/0x30
[   28.060409]  kvfree_call_rcu+0x18a/0x230
<snip>

Internally the kfree_rcu() uses raw_spinlock_t whereas the
page allocator internally deals with spinlock_t to access
to its zones.

In order to prevent such vialation that is in question we
can drop the internal raw_spinlock_t before entering to
the page allocaor.

Short changelog (v1 -> v2):
    - rework the commit message;
    - rework the patch making it smaller;
    - add more comments.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 kernel/rcu/tree.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

Comments

Joel Fernandes July 30, 2020, 12:53 p.m. UTC | #1
On Mon, Jul 27, 2020 at 11:10:12PM +0200, Uladzislau Rezki (Sony) wrote:
> If the kernel is built with CONFIG_PROVE_RAW_LOCK_NESTING
> option, the lockedp will complain about violation of the
> nesting rules:
> 
> <snip>
> [   28.060389] =============================
> [   28.060389] [ BUG: Invalid wait context ]
> [   28.060389] 5.8.0-rc3-rcu #211 Tainted: G            E
> [   28.060389] -----------------------------
> [   28.060390] vmalloc_test/0/523 is trying to lock:
> [   28.060390] ffff96df7ffe0228 (&zone->lock){-.-.}-{3:3}, at: get_page_from_freelist+0xcf0/0x16d0
> [   28.060391] other info that might help us debug this:
> [   28.060391] context-{5:5}
> [   28.060392] 2 locks held by vmalloc_test/0/523:
> [   28.060392]  #0: ffffffffc06750d0 (prepare_for_test_rwsem){++++}-{4:4}, at: test_func+0x76/0x240 [test_vmalloc]
> [   28.060393]  #1: ffff96df5fa1d390 (krc.lock){..-.}-{2:2}, at: kvfree_call_rcu+0x5c/0x230
> [   28.060395] stack backtrace:
> [   28.060395] CPU: 0 PID: 523 Comm: vmalloc_test/0 Tainted: G            E     5.8.0-rc3-rcu #211
> [   28.060395] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> [   28.060396] Call Trace:
> [   28.060397]  dump_stack+0x96/0xd0
> [   28.060397]  __lock_acquire.cold.65+0x166/0x2d7
> [   28.060398]  ? find_held_lock+0x2d/0x90
> [   28.060399]  lock_acquire+0xad/0x370
> [   28.060400]  ? get_page_from_freelist+0xcf0/0x16d0
> [   28.060401]  ? mark_held_locks+0x48/0x70
> [   28.060402]  _raw_spin_lock+0x25/0x30
> [   28.060403]  ? get_page_from_freelist+0xcf0/0x16d0
> [   28.060404]  get_page_from_freelist+0xcf0/0x16d0
> [   28.060405]  ? __lock_acquire+0x3ee/0x1b90
> [   28.060407]  __alloc_pages_nodemask+0x16a/0x3a0
> [   28.060408]  __get_free_pages+0xd/0x30
> [   28.060409]  kvfree_call_rcu+0x18a/0x230
> <snip>
> 
> Internally the kfree_rcu() uses raw_spinlock_t whereas the
> page allocator internally deals with spinlock_t to access
> to its zones.
> 
> In order to prevent such vialation that is in question we
> can drop the internal raw_spinlock_t before entering to
> the page allocaor.
> 
> Short changelog (v1 -> v2):
>     - rework the commit message;
>     - rework the patch making it smaller;
>     - add more comments.
> 
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> ---
>  kernel/rcu/tree.c | 27 +++++++++++++++++++++++++++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 21c2fa5bd8c3..2de112404121 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3287,6 +3287,8 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
>  		return false;
>  
>  	lockdep_assert_held(&krcp->lock);
> +	lockdep_assert_irqs_disabled();
> +
>  	idx = !!is_vmalloc_addr(ptr);
>  
>  	/* Check if a new block is required. */
> @@ -3306,6 +3308,29 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
>  			if (IS_ENABLED(CONFIG_PREEMPT_RT))
>  				return false;
>  
> +			/*
> +			 * If built with CONFIG_PROVE_RAW_LOCK_NESTING option,
> +			 * the lockedp will complain about violation of the
> +			 * nesting rules. It does the raw_spinlock vs. spinlock
> +			 * nesting checks.
> +			 *
> +			 * That is why we drop the raw lock. Please note IRQs are
> +			 * still disabled it guarantees that the "current" stays
> +			 * on the same CPU later on when the raw lock is taken
> +			 * back.
> +			 *
> +			 * It is important because if the page allocator is invoked
> +			 * in fully preemptible context, it can be that we get a page
> +			 * but end up on another CPU. That another CPU might not need
> +			 * a page because of having some extra spots in its internal
> +			 * array for pointer collecting. Staying on same CPU eliminates
> +			 * described issue.
> +			 *
> +			 * Dropping the lock also reduces the critical section by
> +			 * the time taken by the page allocator to obtain a page.
> +			 */
> +			raw_spin_unlock(&krcp->lock);
> +
>  			/*
>  			 * NOTE: For one argument of kvfree_rcu() we can
>  			 * drop the lock and get the page in sleepable
> @@ -3315,6 +3340,8 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
>  			 */
>  			bnode = (struct kvfree_rcu_bulk_data *)
>  				__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> +
> +			raw_spin_lock(&krcp->lock);

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>

thanks,

 - Joel


>  		}
>  
>  		/* Switch to emergency path. */
> -- 
> 2.20.1
>
Chris Wilson Oct. 28, 2020, 12:10 p.m. UTC | #2
Quoting Uladzislau Rezki (Sony) (2020-07-27 22:10:12)
> If the kernel is built with CONFIG_PROVE_RAW_LOCK_NESTING
> option, the lockedp will complain about violation of the
> nesting rules:
> 
> <snip>
> [   28.060389] =============================
> [   28.060389] [ BUG: Invalid wait context ]
> [   28.060389] 5.8.0-rc3-rcu #211 Tainted: G            E
> [   28.060389] -----------------------------
> [   28.060390] vmalloc_test/0/523 is trying to lock:
> [   28.060390] ffff96df7ffe0228 (&zone->lock){-.-.}-{3:3}, at: get_page_from_freelist+0xcf0/0x16d0
> [   28.060391] other info that might help us debug this:
> [   28.060391] context-{5:5}
> [   28.060392] 2 locks held by vmalloc_test/0/523:
> [   28.060392]  #0: ffffffffc06750d0 (prepare_for_test_rwsem){++++}-{4:4}, at: test_func+0x76/0x240 [test_vmalloc]
> [   28.060393]  #1: ffff96df5fa1d390 (krc.lock){..-.}-{2:2}, at: kvfree_call_rcu+0x5c/0x230
> [   28.060395] stack backtrace:
> [   28.060395] CPU: 0 PID: 523 Comm: vmalloc_test/0 Tainted: G            E     5.8.0-rc3-rcu #211
> [   28.060395] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> [   28.060396] Call Trace:
> [   28.060397]  dump_stack+0x96/0xd0
> [   28.060397]  __lock_acquire.cold.65+0x166/0x2d7
> [   28.060398]  ? find_held_lock+0x2d/0x90
> [   28.060399]  lock_acquire+0xad/0x370
> [   28.060400]  ? get_page_from_freelist+0xcf0/0x16d0
> [   28.060401]  ? mark_held_locks+0x48/0x70
> [   28.060402]  _raw_spin_lock+0x25/0x30
> [   28.060403]  ? get_page_from_freelist+0xcf0/0x16d0
> [   28.060404]  get_page_from_freelist+0xcf0/0x16d0
> [   28.060405]  ? __lock_acquire+0x3ee/0x1b90
> [   28.060407]  __alloc_pages_nodemask+0x16a/0x3a0
> [   28.060408]  __get_free_pages+0xd/0x30
> [   28.060409]  kvfree_call_rcu+0x18a/0x230
> <snip>

We've encountered the same warning and applying this patches resolves
the issue.

> Internally the kfree_rcu() uses raw_spinlock_t whereas the
> page allocator internally deals with spinlock_t to access
> to its zones.
> 
> In order to prevent such vialation that is in question we
> can drop the internal raw_spinlock_t before entering to
> the page allocaor.
> 
> Short changelog (v1 -> v2):
>     - rework the commit message;
>     - rework the patch making it smaller;
>     - add more comments.
> 
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> ---
>  kernel/rcu/tree.c | 27 +++++++++++++++++++++++++++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 21c2fa5bd8c3..2de112404121 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3287,6 +3287,8 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
>                 return false;
>  
>         lockdep_assert_held(&krcp->lock);
> +       lockdep_assert_irqs_disabled();
> +
>         idx = !!is_vmalloc_addr(ptr);
>  
>         /* Check if a new block is required. */
> @@ -3306,6 +3308,29 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
>                         if (IS_ENABLED(CONFIG_PREEMPT_RT))
>                                 return false;
>  
> +                       /*
> +                        * If built with CONFIG_PROVE_RAW_LOCK_NESTING option,
> +                        * the lockedp will complain about violation of the
> +                        * nesting rules. It does the raw_spinlock vs. spinlock
> +                        * nesting checks.
> +                        *
> +                        * That is why we drop the raw lock. Please note IRQs are
> +                        * still disabled it guarantees that the "current" stays
> +                        * on the same CPU later on when the raw lock is taken
> +                        * back.
> +                        *
> +                        * It is important because if the page allocator is invoked
> +                        * in fully preemptible context, it can be that we get a page
> +                        * but end up on another CPU. That another CPU might not need
> +                        * a page because of having some extra spots in its internal
> +                        * array for pointer collecting. Staying on same CPU eliminates
> +                        * described issue.
> +                        *
> +                        * Dropping the lock also reduces the critical section by
> +                        * the time taken by the page allocator to obtain a page.
> +                        */
> +                       raw_spin_unlock(&krcp->lock);

Aiui, this makes the CONFIG_PREEMPT_RT check and subsequent comment redundant.

>                         /*
>                          * NOTE: For one argument of kvfree_rcu() we can
>                          * drop the lock and get the page in sleepable
> @@ -3315,6 +3340,8 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
>                          */
>                         bnode = (struct kvfree_rcu_bulk_data *)
>                                 __get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> +
> +                       raw_spin_lock(&krcp->lock);
>                 }

The consequence of dropping the lock here is that we may create two
bnodes and add them both to krcp->bkvhead[idx]. The list remains intact,
but the second entry may be underutilised. That should be transient.

It doesn't seem like the patch will break things, and removes the
lockdep splat,
Tested-by: Chris Wilson <chris@chris-wilson.co.uk>
-Chris
Paul E. McKenney Oct. 28, 2020, 3:54 p.m. UTC | #3
On Wed, Oct 28, 2020 at 12:10:06PM +0000, Chris Wilson wrote:
> Quoting Uladzislau Rezki (Sony) (2020-07-27 22:10:12)
> > If the kernel is built with CONFIG_PROVE_RAW_LOCK_NESTING
> > option, the lockedp will complain about violation of the
> > nesting rules:
> > 
> > <snip>
> > [   28.060389] =============================
> > [   28.060389] [ BUG: Invalid wait context ]
> > [   28.060389] 5.8.0-rc3-rcu #211 Tainted: G            E
> > [   28.060389] -----------------------------
> > [   28.060390] vmalloc_test/0/523 is trying to lock:
> > [   28.060390] ffff96df7ffe0228 (&zone->lock){-.-.}-{3:3}, at: get_page_from_freelist+0xcf0/0x16d0
> > [   28.060391] other info that might help us debug this:
> > [   28.060391] context-{5:5}
> > [   28.060392] 2 locks held by vmalloc_test/0/523:
> > [   28.060392]  #0: ffffffffc06750d0 (prepare_for_test_rwsem){++++}-{4:4}, at: test_func+0x76/0x240 [test_vmalloc]
> > [   28.060393]  #1: ffff96df5fa1d390 (krc.lock){..-.}-{2:2}, at: kvfree_call_rcu+0x5c/0x230
> > [   28.060395] stack backtrace:
> > [   28.060395] CPU: 0 PID: 523 Comm: vmalloc_test/0 Tainted: G            E     5.8.0-rc3-rcu #211
> > [   28.060395] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> > [   28.060396] Call Trace:
> > [   28.060397]  dump_stack+0x96/0xd0
> > [   28.060397]  __lock_acquire.cold.65+0x166/0x2d7
> > [   28.060398]  ? find_held_lock+0x2d/0x90
> > [   28.060399]  lock_acquire+0xad/0x370
> > [   28.060400]  ? get_page_from_freelist+0xcf0/0x16d0
> > [   28.060401]  ? mark_held_locks+0x48/0x70
> > [   28.060402]  _raw_spin_lock+0x25/0x30
> > [   28.060403]  ? get_page_from_freelist+0xcf0/0x16d0
> > [   28.060404]  get_page_from_freelist+0xcf0/0x16d0
> > [   28.060405]  ? __lock_acquire+0x3ee/0x1b90
> > [   28.060407]  __alloc_pages_nodemask+0x16a/0x3a0
> > [   28.060408]  __get_free_pages+0xd/0x30
> > [   28.060409]  kvfree_call_rcu+0x18a/0x230
> > <snip>
> 
> We've encountered the same warning and applying this patches resolves
> the issue.
> 
> > Internally the kfree_rcu() uses raw_spinlock_t whereas the
> > page allocator internally deals with spinlock_t to access
> > to its zones.
> > 
> > In order to prevent such vialation that is in question we
> > can drop the internal raw_spinlock_t before entering to
> > the page allocaor.
> > 
> > Short changelog (v1 -> v2):
> >     - rework the commit message;
> >     - rework the patch making it smaller;
> >     - add more comments.
> > 
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > ---
> >  kernel/rcu/tree.c | 27 +++++++++++++++++++++++++++
> >  1 file changed, 27 insertions(+)
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 21c2fa5bd8c3..2de112404121 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3287,6 +3287,8 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> >                 return false;
> >  
> >         lockdep_assert_held(&krcp->lock);
> > +       lockdep_assert_irqs_disabled();
> > +
> >         idx = !!is_vmalloc_addr(ptr);
> >  
> >         /* Check if a new block is required. */
> > @@ -3306,6 +3308,29 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> >                         if (IS_ENABLED(CONFIG_PREEMPT_RT))
> >                                 return false;
> >  
> > +                       /*
> > +                        * If built with CONFIG_PROVE_RAW_LOCK_NESTING option,
> > +                        * the lockedp will complain about violation of the
> > +                        * nesting rules. It does the raw_spinlock vs. spinlock
> > +                        * nesting checks.
> > +                        *
> > +                        * That is why we drop the raw lock. Please note IRQs are
> > +                        * still disabled it guarantees that the "current" stays
> > +                        * on the same CPU later on when the raw lock is taken
> > +                        * back.
> > +                        *
> > +                        * It is important because if the page allocator is invoked
> > +                        * in fully preemptible context, it can be that we get a page
> > +                        * but end up on another CPU. That another CPU might not need
> > +                        * a page because of having some extra spots in its internal
> > +                        * array for pointer collecting. Staying on same CPU eliminates
> > +                        * described issue.
> > +                        *
> > +                        * Dropping the lock also reduces the critical section by
> > +                        * the time taken by the page allocator to obtain a page.
> > +                        */
> > +                       raw_spin_unlock(&krcp->lock);
> 
> Aiui, this makes the CONFIG_PREEMPT_RT check and subsequent comment redundant.
> 
> >                         /*
> >                          * NOTE: For one argument of kvfree_rcu() we can
> >                          * drop the lock and get the page in sleepable
> > @@ -3315,6 +3340,8 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> >                          */
> >                         bnode = (struct kvfree_rcu_bulk_data *)
> >                                 __get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> > +
> > +                       raw_spin_lock(&krcp->lock);
> >                 }
> 
> The consequence of dropping the lock here is that we may create two
> bnodes and add them both to krcp->bkvhead[idx]. The list remains intact,
> but the second entry may be underutilised. That should be transient.
> 
> It doesn't seem like the patch will break things, and removes the
> lockdep splat,
> Tested-by: Chris Wilson <chris@chris-wilson.co.uk>

Thank you for testing this, Chris!  We had objections to all patches
thus far, so we are going back to the drawing board, hopefully having
something for the v5.11 merge window.

							Thanx, Paul
diff mbox series

Patch

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 21c2fa5bd8c3..2de112404121 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3287,6 +3287,8 @@  kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
 		return false;
 
 	lockdep_assert_held(&krcp->lock);
+	lockdep_assert_irqs_disabled();
+
 	idx = !!is_vmalloc_addr(ptr);
 
 	/* Check if a new block is required. */
@@ -3306,6 +3308,29 @@  kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
 			if (IS_ENABLED(CONFIG_PREEMPT_RT))
 				return false;
 
+			/*
+			 * If built with CONFIG_PROVE_RAW_LOCK_NESTING option,
+			 * the lockedp will complain about violation of the
+			 * nesting rules. It does the raw_spinlock vs. spinlock
+			 * nesting checks.
+			 *
+			 * That is why we drop the raw lock. Please note IRQs are
+			 * still disabled it guarantees that the "current" stays
+			 * on the same CPU later on when the raw lock is taken
+			 * back.
+			 *
+			 * It is important because if the page allocator is invoked
+			 * in fully preemptible context, it can be that we get a page
+			 * but end up on another CPU. That another CPU might not need
+			 * a page because of having some extra spots in its internal
+			 * array for pointer collecting. Staying on same CPU eliminates
+			 * described issue.
+			 *
+			 * Dropping the lock also reduces the critical section by
+			 * the time taken by the page allocator to obtain a page.
+			 */
+			raw_spin_unlock(&krcp->lock);
+
 			/*
 			 * NOTE: For one argument of kvfree_rcu() we can
 			 * drop the lock and get the page in sleepable
@@ -3315,6 +3340,8 @@  kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
 			 */
 			bnode = (struct kvfree_rcu_bulk_data *)
 				__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
+
+			raw_spin_lock(&krcp->lock);
 		}
 
 		/* Switch to emergency path. */