[4/5] btrfs: serialize blocking_writers updates
diff mbox series

Message ID 2a310858bfa3754f6f7a4d4b7693959b0fdd7005.1571340084.git.dsterba@suse.com
State New
Headers show
Series
  • Extent buffer locking and documentation
Related show

Commit Message

David Sterba Oct. 17, 2019, 7:39 p.m. UTC
There's one potential memory ordering problem regarding
eb::blocking_writers, although this hasn't been hit in practice. On TSO
(total store ordering) architectures, the writes will always be seen in
right order.

In case the calls in btrfs_set_lock_blocking_write are seen in this
(wrong) order on another CPU:

0:  /* blocking_writers == 0 */
1:  write_unlock()
2:  blocking_writers = 1

btrfs_try_tree_read_lock, btrfs_tree_read_lock_atomic or
btrfs_try_tree_write_lock would have to squeeze all of this between 1:
and 2:

- check if blocking_writers is 0 (it is, continue)
- try read lock, read lock or write lock (succeeds after 1:)
- check blocking_writers again (continue)

All of that assumes that the reordered writes can survive for quite some
time (unlikely if its in the internal store buffer).

Another point against observing that in practice is that
blocking_writers and the lock are on the same cacheline (64B), so it's
more likely both values are stored in order, or some sort of pending
write flush will update blocking_writers, rwlock before the try lock
happens. Assuming the CPUs work like that and don't hide other
surprises.

struct extent_buffer {
	u64                        start;                /*     0     8 */
	long unsigned int          len;                  /*     8     8 */
	long unsigned int          bflags;               /*    16     8 */
	struct btrfs_fs_info *     fs_info;              /*    24     8 */
	spinlock_t                 refs_lock;            /*    32     4 */
	atomic_t                   refs;                 /*    36     4 */
	atomic_t                   io_pages;             /*    40     4 */
	int                        read_mirror;          /*    44     4 */
	struct callback_head callback_head __attribute__((__aligned__(8))); /*    48    16 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	pid_t                      lock_owner;           /*    64     4 */
	int                        blocking_writers;     /*    68     4 */
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	atomic_t                   blocking_readers;     /*    72     4 */
	bool                       lock_nested;          /*    76     1 */

	/* XXX 1 byte hole, try to pack */

	short int                  log_index;            /*    78     2 */
	rwlock_t                   lock;                 /*    80     8 */
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	wait_queue_head_t          write_lock_wq;        /*    88    24 */
	wait_queue_head_t          read_lock_wq;         /*   112    24 */
	/* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
	struct page *              pages[16];            /*   136   128 */

	/* size: 264, cachelines: 5, members: 18 */
	/* sum members: 263, holes: 1, sum holes: 1 */
	/* forced alignments: 1 */
	/* last cacheline: 8 bytes */
} __attribute__((__aligned__(8)));

Add the barriers for correctness sake.

Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/locking.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

Comments

Nikolay Borisov Oct. 23, 2019, 9:57 a.m. UTC | #1
On 17.10.19 г. 22:39 ч., David Sterba wrote:
> There's one potential memory ordering problem regarding
> eb::blocking_writers, although this hasn't been hit in practice. On TSO
> (total store ordering) architectures, the writes will always be seen in
> right order.
> 
> In case the calls in btrfs_set_lock_blocking_write are seen in this
> (wrong) order on another CPU:
> 
> 0:  /* blocking_writers == 0 */
> 1:  write_unlock()
> 2:  blocking_writers = 1
> 
> btrfs_try_tree_read_lock, btrfs_tree_read_lock_atomic or
> btrfs_try_tree_write_lock would have to squeeze all of this between 1:
> and 2:

This is only a problem for unlocked (optimistic) accesses in those
functions. Simply because from its POV the eb->lock doesn't play any
role e.g. they don't know about it at all.

This implies there needs to be yet another synchronization/ordering
mechanism only for blocking_writer. But then further down you say that
there are no read side barrier because observing the accesses in a
particular order is not important for correctness.

Given this I fail to see what this smp_wmb affects ordering.
> 
> - check if blocking_writers is 0 (it is, continue)
> - try read lock, read lock or write lock (succeeds after 1:)
> - check blocking_writers again (continue)
> 
> All of that assumes that the reordered writes can survive for quite some
> time (unlikely if its in the internal store buffer).
> 
> Another point against observing that in practice is that
> blocking_writers and the lock are on the same cacheline (64B), so it's
> more likely both values are stored in order, or some sort of pending
> write flush will update blocking_writers, rwlock before the try lock
> happens. Assuming the CPUs work like that and don't hide other
> surprises.
> 
> struct extent_buffer {
> 	u64                        start;                /*     0     8 */
> 	long unsigned int          len;                  /*     8     8 */
> 	long unsigned int          bflags;               /*    16     8 */
> 	struct btrfs_fs_info *     fs_info;              /*    24     8 */
> 	spinlock_t                 refs_lock;            /*    32     4 */
> 	atomic_t                   refs;                 /*    36     4 */
> 	atomic_t                   io_pages;             /*    40     4 */
> 	int                        read_mirror;          /*    44     4 */
> 	struct callback_head callback_head __attribute__((__aligned__(8))); /*    48    16 */
> 	/* --- cacheline 1 boundary (64 bytes) --- */
> 	pid_t                      lock_owner;           /*    64     4 */
> 	int                        blocking_writers;     /*    68     4 */
> 	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> 	atomic_t                   blocking_readers;     /*    72     4 */
> 	bool                       lock_nested;          /*    76     1 */
> 
> 	/* XXX 1 byte hole, try to pack */
> 
> 	short int                  log_index;            /*    78     2 */
> 	rwlock_t                   lock;                 /*    80     8 */
>         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> 	wait_queue_head_t          write_lock_wq;        /*    88    24 */
> 	wait_queue_head_t          read_lock_wq;         /*   112    24 */
> 	/* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
> 	struct page *              pages[16];            /*   136   128 */
> 
> 	/* size: 264, cachelines: 5, members: 18 */
> 	/* sum members: 263, holes: 1, sum holes: 1 */
> 	/* forced alignments: 1 */
> 	/* last cacheline: 8 bytes */
> } __attribute__((__aligned__(8)));
> 
> Add the barriers for correctness sake.
> 
> Signed-off-by: David Sterba <dsterba@suse.com>
> ---
>  fs/btrfs/locking.c | 16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c
> index a12f3edc3505..e0e0430577aa 100644
> --- a/fs/btrfs/locking.c
> +++ b/fs/btrfs/locking.c
> @@ -110,6 +110,18 @@ void btrfs_set_lock_blocking_write(struct extent_buffer *eb)
>  		btrfs_assert_spinning_writers_put(eb);
>  		btrfs_assert_tree_locked(eb);
>  		WRITE_ONCE(eb->blocking_writers, 1);
> +		/*
> +		 * Writers must be be updated before the lock is released so
> +		 * that other threads see that in order. Otherwise this could
> +		 * theoretically lead to blocking_writers be still set to 1 but
> +		 * this would be unexpected eg. for spinning locks.
> +		 *
> +		 * There are no pairing read barriers for unlocked access as we
> +		 * don't strictly need them for correctness (eg. in
> +		 * btrfs_try_tree_read_lock), and the unlock/lock is an implied
> +		 * barrier in other cases.
> +		 */
> +		smp_wmb();
>  		write_unlock(&eb->lock);

That comment sounds to me as if you only care about the readers of
blocking_writers _after_ they have acquired the eb::lock for reading. In
this case this smp_wmb is pointless because write_unlock/write_lock
imply release/acquire semantics.

Unlocked readers are not affected by this wmb.

>  	}
>  }
> @@ -316,7 +328,9 @@ void btrfs_tree_unlock(struct extent_buffer *eb)
>  		/*
>  		 * We need to order modifying blocking_writers above with
>  		 * actually waking up the sleepers to ensure they see the
> -		 * updated value of blocking_writers
> +		 * updated value of blocking_writers.
> +		 *
> +		 * cond_wake_up calls smp_mb
>  		 */
>  		cond_wake_up(&eb->write_lock_wq);
>  	} else {
>
David Sterba Oct. 29, 2019, 5:51 p.m. UTC | #2
On Wed, Oct 23, 2019 at 12:57:00PM +0300, Nikolay Borisov wrote:
> 
> 
> On 17.10.19 г. 22:39 ч., David Sterba wrote:
> > There's one potential memory ordering problem regarding
> > eb::blocking_writers, although this hasn't been hit in practice. On TSO
> > (total store ordering) architectures, the writes will always be seen in
> > right order.
> > 
> > In case the calls in btrfs_set_lock_blocking_write are seen in this
> > (wrong) order on another CPU:
> > 
> > 0:  /* blocking_writers == 0 */
> > 1:  write_unlock()
> > 2:  blocking_writers = 1
> > 
> > btrfs_try_tree_read_lock, btrfs_tree_read_lock_atomic or
> > btrfs_try_tree_write_lock would have to squeeze all of this between 1:
> > and 2:
> 
> This is only a problem for unlocked (optimistic) accesses in those
> functions. Simply because from its POV the eb->lock doesn't play any
> role e.g. they don't know about it at all.

No the lock is important here. Let's take btrfs_try_tree_read_lock as
example for 2nd thread:

T1:                            T2:
write_unlock
                               blocking_writers == 0
			       try_readlock (success, locked)
			       if (blocking_writers) return -> not taken
blocking_writers = 1

Same for btrfs_tree_read_lock_atomic (with read_lock) and
btrfs_try_tree_write_lock (with write_lock)

IMO it's the other way around than you say, the optimistic lock is
irrelevant. We need the blocking_writers update sneak into a locked
section.

> This implies there needs to be yet another synchronization/ordering
> mechanism only for blocking_writer. But then further down you say that
> there are no read side barrier because observing the accesses in a
> particular order is not important for correctness.
> 
> Given this I fail to see what this smp_wmb affects ordering.

So in the steps above:

T1:                            T2:
blocking_writers = 1
smp_mb()
write_unlock
                               blocking_writers == 1
			       	-> take the fast path and return

> > +		/*
> > +		 * Writers must be be updated before the lock is released so
> > +		 * that other threads see that in order. Otherwise this could
> > +		 * theoretically lead to blocking_writers be still set to 1 but
> > +		 * this would be unexpected eg. for spinning locks.
> > +		 *
> > +		 * There are no pairing read barriers for unlocked access as we
> > +		 * don't strictly need them for correctness (eg. in
> > +		 * btrfs_try_tree_read_lock), and the unlock/lock is an implied
> > +		 * barrier in other cases.
> > +		 */
> > +		smp_wmb();
> >  		write_unlock(&eb->lock);
> 
> That comment sounds to me as if you only care about the readers of
> blocking_writers _after_ they have acquired the eb::lock for reading. In
> this case this smp_wmb is pointless because write_unlock/write_lock
> imply release/acquire semantics.

The pairing read side barrier would have to be before all reads of
blocking_writers, eg.

180 int btrfs_try_tree_read_lock(struct extent_buffer *eb)
181 {

            smp_rmb();

182         if (eb->blocking_writers)
183                 return 0;

ant it won't be wrong, the value would be most up to date as it could be. And
givent that this is an optimistic shortcut, lack of synchronized read might
pessimize that. Ie. blocking_writers is already 0 but the update is not
propagated and btrfs_try_tree_read_lock returns.

This is in principle indistinguishable from a state where the lock is
still locked. Hence the 'not needed for correctness' part. And then the
barrier can be avoided.

The write side needs the barrier to order write and unlock. The
unlock/lock implied barrier won't help as the read-side check happens in
the middle of that.
Nikolay Borisov Oct. 29, 2019, 6:48 p.m. UTC | #3
On 29.10.19 г. 19:51 ч., David Sterba wrote:
> On Wed, Oct 23, 2019 at 12:57:00PM +0300, Nikolay Borisov wrote:
>>
>>
>> On 17.10.19 г. 22:39 ч., David Sterba wrote:
>>> There's one potential memory ordering problem regarding
>>> eb::blocking_writers, although this hasn't been hit in practice. On TSO
>>> (total store ordering) architectures, the writes will always be seen in
>>> right order.
>>>
>>> In case the calls in btrfs_set_lock_blocking_write are seen in this
>>> (wrong) order on another CPU:
>>>
>>> 0:  /* blocking_writers == 0 */
>>> 1:  write_unlock()
>>> 2:  blocking_writers = 1
>>>
>>> btrfs_try_tree_read_lock, btrfs_tree_read_lock_atomic or
>>> btrfs_try_tree_write_lock would have to squeeze all of this between 1:
>>> and 2:
>>
>> This is only a problem for unlocked (optimistic) accesses in those
>> functions. Simply because from its POV the eb->lock doesn't play any
>> role e.g. they don't know about it at all.
> 
> No the lock is important here. Let's take btrfs_try_tree_read_lock as
> example for 2nd thread:
> 
> T1:                            T2:
> write_unlock
>                                blocking_writers == 0
> 			       try_readlock (success, locked)
> 			       if (blocking_writers) return -> not taken
> blocking_writers = 1
> 
> Same for btrfs_tree_read_lock_atomic (with read_lock) and
> btrfs_try_tree_write_lock (with write_lock)
> 
> IMO it's the other way around than you say, the optimistic lock is
> irrelevant. We need the blocking_writers update sneak into a locked
> section.

But write_unlock is a release operation, as per memory-barriers.txt:

(6) RELEASE operations.



     This also acts as a one-way permeable barrier.  It guarantees that
all  memory operations before the RELEASE operation will appear to
happen  before the RELEASE operation with respect to the other
components of the  system. RELEASE operations include UNLOCK operations
and smp_store_release() operations.

...

The use of ACQUIRE and RELEASE operations generally precludes the need
for other sorts of memory barrier (but note the exceptions mentioned in
 the subsection "MMIO write barrier").  In addition, a RELEASE+ACQUIRE
pair is -not- guaranteed to act as a full memory barrier.  However,
after  an ACQUIRE on a given variable, all memory accesses preceding any
prior RELEASE on that same variable are guaranteed to be visible.

try_readlock is an ACQUIRE operation w.r.t the lock. So if it succeeds
then all previous accesses e.g. blocking_writer = 1 are guaranteed to be
visible because we have RELEASE (write_unlock ) + ACQUIRE (try_readlock
in case of success).

So the blocking_write check AFTER try_readlock has succeeded is
guaranteed to see the update to blocking_writers in
btrfs_set_lock_blocking_write.

> 
>> This implies there needs to be yet another synchronization/ordering
>> mechanism only for blocking_writer. But then further down you say that
>> there are no read side barrier because observing the accesses in a
>> particular order is not important for correctness.
>>
>> Given this I fail to see what this smp_wmb affects ordering.
> 
> So in the steps above:
> 
> T1:                            T2:
> blocking_writers = 1
> smp_mb()
> write_unlock
>                                blocking_writers == 1
> 			       	-> take the fast path and return
> 
>>> +		/*
>>> +		 * Writers must be be updated before the lock is released so
>>> +		 * that other threads see that in order. Otherwise this could
>>> +		 * theoretically lead to blocking_writers be still set to 1 but
>>> +		 * this would be unexpected eg. for spinning locks.
>>> +		 *
>>> +		 * There are no pairing read barriers for unlocked access as we
>>> +		 * don't strictly need them for correctness (eg. in
>>> +		 * btrfs_try_tree_read_lock), and the unlock/lock is an implied
>>> +		 * barrier in other cases.
>>> +		 */
>>> +		smp_wmb();
>>>  		write_unlock(&eb->lock);
>>
>> That comment sounds to me as if you only care about the readers of
>> blocking_writers _after_ they have acquired the eb::lock for reading. In
>> this case this smp_wmb is pointless because write_unlock/write_lock
>> imply release/acquire semantics.
> 
> The pairing read side barrier would have to be before all reads of
> blocking_writers, eg.
> 
> 180 int btrfs_try_tree_read_lock(struct extent_buffer *eb)
> 181 {
> 
>             smp_rmb();
> 
> 182         if (eb->blocking_writers)
> 183                 return 0;
> 
> ant it won't be wrong, the value would be most up to date as it could be. And
> givent that this is an optimistic shortcut, lack of synchronized read might
> pessimize that. Ie. blocking_writers is already 0 but the update is not
> propagated and btrfs_try_tree_read_lock returns.
> 
> This is in principle indistinguishable from a state where the lock is
> still locked. Hence the 'not needed for correctness' part. And then the
> barrier can be avoided.

Nod

> 
> The write side needs the barrier to order write and unlock. The
> unlock/lock implied barrier won't help as the read-side check happens in
> the middle of that.
> 

But the unlock (as explained earlier) won't allow the write to 'leak'
past it.
David Sterba Oct. 29, 2019, 9:15 p.m. UTC | #4
On Tue, Oct 29, 2019 at 08:48:15PM +0200, Nikolay Borisov wrote:
> On 29.10.19 г. 19:51 ч., David Sterba wrote:
> > On Wed, Oct 23, 2019 at 12:57:00PM +0300, Nikolay Borisov wrote:
> >> On 17.10.19 г. 22:39 ч., David Sterba wrote:
> >>> There's one potential memory ordering problem regarding
> >>> eb::blocking_writers, although this hasn't been hit in practice. On TSO
> >>> (total store ordering) architectures, the writes will always be seen in
> >>> right order.
> >>>
> >>> In case the calls in btrfs_set_lock_blocking_write are seen in this
> >>> (wrong) order on another CPU:
> >>>
> >>> 0:  /* blocking_writers == 0 */
> >>> 1:  write_unlock()
> >>> 2:  blocking_writers = 1
> >>>
> >>> btrfs_try_tree_read_lock, btrfs_tree_read_lock_atomic or
> >>> btrfs_try_tree_write_lock would have to squeeze all of this between 1:
> >>> and 2:
> >>
> >> This is only a problem for unlocked (optimistic) accesses in those
> >> functions. Simply because from its POV the eb->lock doesn't play any
> >> role e.g. they don't know about it at all.
> > 
> > No the lock is important here. Let's take btrfs_try_tree_read_lock as
> > example for 2nd thread:
> > 
> > T1:                            T2:
> > write_unlock
> >                                blocking_writers == 0
> > 			       try_readlock (success, locked)
> > 			       if (blocking_writers) return -> not taken
> > blocking_writers = 1

Ok, follows from what you write below and I got it wrong as the above
can't happen because of the simple unlock/lock sequence. I don't know
why I missed that, the original changelog has enough clues to see that
too.

> > Same for btrfs_tree_read_lock_atomic (with read_lock) and
> > btrfs_try_tree_write_lock (with write_lock)
> > 
> > IMO it's the other way around than you say, the optimistic lock is
> > irrelevant. We need the blocking_writers update sneak into a locked
> > section.
> 
> But write_unlock is a release operation, as per memory-barriers.txt:
> 
> (6) RELEASE operations.
> 
>      This also acts as a one-way permeable barrier.  It guarantees that
> all  memory operations before the RELEASE operation will appear to
> happen  before the RELEASE operation with respect to the other
> components of the  system. RELEASE operations include UNLOCK operations
> and smp_store_release() operations.
> 
> ...
> 
> The use of ACQUIRE and RELEASE operations generally precludes the need
> for other sorts of memory barrier (but note the exceptions mentioned in
>  the subsection "MMIO write barrier").  In addition, a RELEASE+ACQUIRE
> pair is -not- guaranteed to act as a full memory barrier.  However,
> after  an ACQUIRE on a given variable, all memory accesses preceding any
> prior RELEASE on that same variable are guaranteed to be visible.
> 
> try_readlock is an ACQUIRE operation w.r.t the lock. So if it succeeds
> then all previous accesses e.g. blocking_writer = 1 are guaranteed to be
> visible because we have RELEASE (write_unlock ) + ACQUIRE (try_readlock
> in case of success).
> 
> So the blocking_write check AFTER try_readlock has succeeded is
> guaranteed to see the update to blocking_writers in
> btrfs_set_lock_blocking_write.

Yeah. As long as the slow paths check the value inside locks, we're
safe. And this is done in all the functions AFAICS.

So far we haven't found a problem with the locks on TSO and non-TSO
arches. Thanks for the barrier reasoning exercise, I'll drop the patch.

Patch
diff mbox series

diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c
index a12f3edc3505..e0e0430577aa 100644
--- a/fs/btrfs/locking.c
+++ b/fs/btrfs/locking.c
@@ -110,6 +110,18 @@  void btrfs_set_lock_blocking_write(struct extent_buffer *eb)
 		btrfs_assert_spinning_writers_put(eb);
 		btrfs_assert_tree_locked(eb);
 		WRITE_ONCE(eb->blocking_writers, 1);
+		/*
+		 * Writers must be be updated before the lock is released so
+		 * that other threads see that in order. Otherwise this could
+		 * theoretically lead to blocking_writers be still set to 1 but
+		 * this would be unexpected eg. for spinning locks.
+		 *
+		 * There are no pairing read barriers for unlocked access as we
+		 * don't strictly need them for correctness (eg. in
+		 * btrfs_try_tree_read_lock), and the unlock/lock is an implied
+		 * barrier in other cases.
+		 */
+		smp_wmb();
 		write_unlock(&eb->lock);
 	}
 }
@@ -316,7 +328,9 @@  void btrfs_tree_unlock(struct extent_buffer *eb)
 		/*
 		 * We need to order modifying blocking_writers above with
 		 * actually waking up the sleepers to ensure they see the
-		 * updated value of blocking_writers
+		 * updated value of blocking_writers.
+		 *
+		 * cond_wake_up calls smp_mb
 		 */
 		cond_wake_up(&eb->write_lock_wq);
 	} else {