btrfs: fix waitqueue_active without memory barrier in btrfs
diff mbox

Message ID 17EC94B0A072C34B8DCF0D30AD16044A02874794@BPXM09GP.gisp.nec.co.jp
State New
Headers show

Commit Message

Kosuke Tatsukawa Oct. 9, 2015, 12:35 a.m. UTC
btrfs_bio_counter_sub() seems to be missing a memory barrier which might
cause the waker to not notice the waiter and miss sending a wake_up as
in the following figure.

	btrfs_bio_counter_sub			btrfs_rm_dev_replace_blocked
------------------------------------------------------------------------
if (waitqueue_active(&fs_info->replace_wait))
/* The CPU might reorder the test for
   the waitqueue up here, before
   prior writes complete */
					/* wait_event */
					 /* __wait_event */
					  /* ___wait_event */
					  long __int = prepare_to_wait_event(&wq,
					    &__wait, state);
					  if (!percpu_counter_sum(&fs_info->bio_counter))
percpu_counter_sub(&fs_info->bio_counter,
  amount);
					  schedule()
------------------------------------------------------------------------

This patch removes the call to waitqueue_active() leaving just wake_up()
behind.  This fixes the problem because the call to spin_lock_irqsave()
in wake_up() will be an ACQUIRE operation.

I found this issue when I was looking through the linux source code
for places calling waitqueue_active() before wake_up*(), but without
preceding memory barriers, after sending a patch to fix a similar
issue in drivers/tty/n_tty.c  (Details about the original issue can be
found here: https://lkml.org/lkml/2015/9/28/849).

Signed-off-by: Kosuke Tatsukawa <tatsu@ab.jp.nec.com>
---
 fs/btrfs/dev-replace.c |    4 +---
 1 files changed, 1 insertions(+), 3 deletions(-)

Comments

Josef Bacik Oct. 9, 2015, 4:32 a.m. UTC | #1
On 10/08/2015 05:35 PM, Kosuke Tatsukawa wrote:
> btrfs_bio_counter_sub() seems to be missing a memory barrier which might
> cause the waker to not notice the waiter and miss sending a wake_up as
> in the following figure.
> 
> 	btrfs_bio_counter_sub			btrfs_rm_dev_replace_blocked
> ------------------------------------------------------------------------
> if (waitqueue_active(&fs_info->replace_wait))
> /* The CPU might reorder the test for
>     the waitqueue up here, before
>     prior writes complete */
> 					/* wait_event */
> 					 /* __wait_event */
> 					  /* ___wait_event */
> 					  long __int = prepare_to_wait_event(&wq,
> 					    &__wait, state);
> 					  if (!percpu_counter_sum(&fs_info->bio_counter))
> percpu_counter_sub(&fs_info->bio_counter,
>    amount);
> 					  schedule()

percpu_counter_sub can't be reordered, in its most basic form it does
preempt_disable/enable which in its most basic form does barrier().  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kosuke Tatsukawa Oct. 9, 2015, 4:50 a.m. UTC | #2
Josef Bacik wrote:
> On 10/08/2015 05:35 PM, Kosuke Tatsukawa wrote:
>> btrfs_bio_counter_sub() seems to be missing a memory barrier which might
>> cause the waker to not notice the waiter and miss sending a wake_up as
>> in the following figure.
>> 
>> 	btrfs_bio_counter_sub			btrfs_rm_dev_replace_blocked
>> ------------------------------------------------------------------------
>> if (waitqueue_active(&fs_info->replace_wait))
>> /* The CPU might reorder the test for
>>     the waitqueue up here, before
>>     prior writes complete */
>> 					/* wait_event */
>> 					 /* __wait_event */
>> 					  /* ___wait_event */
>> 					  long __int = prepare_to_wait_event(&wq,
>> 					    &__wait, state);
>> 					  if (!percpu_counter_sum(&fs_info->bio_counter))
>> percpu_counter_sub(&fs_info->bio_counter,
>>    amount);
>> 					  schedule()
>
> percpu_counter_sub can't be reordered, in its most basic form it does
> preempt_disable/enable which in its most basic form does barrier().  Thanks,

It's not the compiler, but the CPU that is doing the reordering.

The CPU can delay the write of the counter, so that the following read
of &fs_info->replace_wait is completed first.  Hence a memory barrier is
required, and not just a barrier.
---
Kosuke TATSUKAWA  | 3rd IT Platform Department
                  | IT Platform Division, NEC Corporation
                  | tatsu@ab.jp.nec.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba Oct. 9, 2015, 2:21 p.m. UTC | #3
On Fri, Oct 09, 2015 at 12:35:48AM +0000, Kosuke Tatsukawa wrote:
> This patch removes the call to waitqueue_active() leaving just wake_up()
> behind.  This fixes the problem because the call to spin_lock_irqsave()
> in wake_up() will be an ACQUIRE operation.

Either we can switch it to wake_up or put the barrier before the check.
Not all instances of waitqueue_active need the barrier though.

> I found this issue when I was looking through the linux source code
> for places calling waitqueue_active() before wake_up*(), but without
> preceding memory barriers, after sending a patch to fix a similar
> issue in drivers/tty/n_tty.c  (Details about the original issue can be
> found here: https://lkml.org/lkml/2015/9/28/849).

There are more in btrfs:

https://www.mail-archive.com/linux-btrfs%40vger.kernel.org/msg41914.html

> @@ -918,9 +918,7 @@ void btrfs_bio_counter_inc_noblocked(struct btrfs_fs_info *fs_info)
>  void btrfs_bio_counter_sub(struct btrfs_fs_info *fs_info, s64 amount)
>  {
>  	percpu_counter_sub(&fs_info->bio_counter, amount);
> -
> -	if (waitqueue_active(&fs_info->replace_wait))
> -		wake_up(&fs_info->replace_wait);
> +	wake_up(&fs_info->replace_wait);

Chris had a comment on that one in
https://www.mail-archive.com/linux-btrfs%40vger.kernel.org/msg42551.html
it's in performance critial context and the explicit wake_up is even
worse than the barrier.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kosuke Tatsukawa Oct. 10, 2015, 5:03 a.m. UTC | #4
David Sterba wrote:
> On Fri, Oct 09, 2015 at 12:35:48AM +0000, Kosuke Tatsukawa wrote:
>> This patch removes the call to waitqueue_active() leaving just wake_up()
>> behind.  This fixes the problem because the call to spin_lock_irqsave()
>> in wake_up() will be an ACQUIRE operation.
>
> Either we can switch it to wake_up or put the barrier before the check.
> Not all instances of waitqueue_active need the barrier though.
>
>> I found this issue when I was looking through the linux source code
>> for places calling waitqueue_active() before wake_up*(), but without
>> preceding memory barriers, after sending a patch to fix a similar
>> issue in drivers/tty/n_tty.c  (Details about the original issue can be
>> found here: https://lkml.org/lkml/2015/9/28/849).
>
> There are more in btrfs:
>
> https://www.mail-archive.com/linux-btrfs%40vger.kernel.org/msg41914.html

Thank you for the pointer.
Your patch seems better than mine.

I think the other places in btrfs that use waitqueue_active() before
wake_up are preceded by either a smp_mb or some kind of atomic
operation.

The latter still needs smp_mb__after_atomic() but it's light-weight
compared to smp_mb().


>> @@ -918,9 +918,7 @@ void btrfs_bio_counter_inc_noblocked(struct btrfs_fs_info *fs_info)
>>  void btrfs_bio_counter_sub(struct btrfs_fs_info *fs_info, s64 amount)
>>  {
>>  	percpu_counter_sub(&fs_info->bio_counter, amount);
>> -
>> -	if (waitqueue_active(&fs_info->replace_wait))
>> -		wake_up(&fs_info->replace_wait);
>> +	wake_up(&fs_info->replace_wait);
>
> Chris had a comment on that one in
> https://www.mail-archive.com/linux-btrfs%40vger.kernel.org/msg42551.html
> it's in performance critial context and the explicit wake_up is even
> worse than the barrier.
---
Kosuke TATSUKAWA  | 3rd IT Platform Department
                  | IT Platform Division, NEC Corporation
                  | tatsu@ab.jp.nec.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index e54dd59..ecb3e71 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -918,9 +918,7 @@  void btrfs_bio_counter_inc_noblocked(struct btrfs_fs_info *fs_info)
 void btrfs_bio_counter_sub(struct btrfs_fs_info *fs_info, s64 amount)
 {
 	percpu_counter_sub(&fs_info->bio_counter, amount);
-
-	if (waitqueue_active(&fs_info->replace_wait))
-		wake_up(&fs_info->replace_wait);
+	wake_up(&fs_info->replace_wait);
 }
 
 void btrfs_bio_counter_inc_blocked(struct btrfs_fs_info *fs_info)