Message ID | 2a310858bfa3754f6f7a4d4b7693959b0fdd7005.1571340084.git.dsterba@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Extent buffer locking and documentation | expand |
On 17.10.19 г. 22:39 ч., David Sterba wrote: > There's one potential memory ordering problem regarding > eb::blocking_writers, although this hasn't been hit in practice. On TSO > (total store ordering) architectures, the writes will always be seen in > right order. > > In case the calls in btrfs_set_lock_blocking_write are seen in this > (wrong) order on another CPU: > > 0: /* blocking_writers == 0 */ > 1: write_unlock() > 2: blocking_writers = 1 > > btrfs_try_tree_read_lock, btrfs_tree_read_lock_atomic or > btrfs_try_tree_write_lock would have to squeeze all of this between 1: > and 2: This is only a problem for unlocked (optimistic) accesses in those functions. Simply because from its POV the eb->lock doesn't play any role e.g. they don't know about it at all. This implies there needs to be yet another synchronization/ordering mechanism only for blocking_writer. But then further down you say that there are no read side barrier because observing the accesses in a particular order is not important for correctness. Given this I fail to see what this smp_wmb affects ordering. > > - check if blocking_writers is 0 (it is, continue) > - try read lock, read lock or write lock (succeeds after 1:) > - check blocking_writers again (continue) > > All of that assumes that the reordered writes can survive for quite some > time (unlikely if its in the internal store buffer). > > Another point against observing that in practice is that > blocking_writers and the lock are on the same cacheline (64B), so it's > more likely both values are stored in order, or some sort of pending > write flush will update blocking_writers, rwlock before the try lock > happens. Assuming the CPUs work like that and don't hide other > surprises. > > struct extent_buffer { > u64 start; /* 0 8 */ > long unsigned int len; /* 8 8 */ > long unsigned int bflags; /* 16 8 */ > struct btrfs_fs_info * fs_info; /* 24 8 */ > spinlock_t refs_lock; /* 32 4 */ > atomic_t refs; /* 36 4 */ > atomic_t io_pages; /* 40 4 */ > int read_mirror; /* 44 4 */ > struct callback_head callback_head __attribute__((__aligned__(8))); /* 48 16 */ > /* --- cacheline 1 boundary (64 bytes) --- */ > pid_t lock_owner; /* 64 4 */ > int blocking_writers; /* 68 4 */ > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > atomic_t blocking_readers; /* 72 4 */ > bool lock_nested; /* 76 1 */ > > /* XXX 1 byte hole, try to pack */ > > short int log_index; /* 78 2 */ > rwlock_t lock; /* 80 8 */ > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > wait_queue_head_t write_lock_wq; /* 88 24 */ > wait_queue_head_t read_lock_wq; /* 112 24 */ > /* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */ > struct page * pages[16]; /* 136 128 */ > > /* size: 264, cachelines: 5, members: 18 */ > /* sum members: 263, holes: 1, sum holes: 1 */ > /* forced alignments: 1 */ > /* last cacheline: 8 bytes */ > } __attribute__((__aligned__(8))); > > Add the barriers for correctness sake. > > Signed-off-by: David Sterba <dsterba@suse.com> > --- > fs/btrfs/locking.c | 16 +++++++++++++++- > 1 file changed, 15 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c > index a12f3edc3505..e0e0430577aa 100644 > --- a/fs/btrfs/locking.c > +++ b/fs/btrfs/locking.c > @@ -110,6 +110,18 @@ void btrfs_set_lock_blocking_write(struct extent_buffer *eb) > btrfs_assert_spinning_writers_put(eb); > btrfs_assert_tree_locked(eb); > WRITE_ONCE(eb->blocking_writers, 1); > + /* > + * Writers must be be updated before the lock is released so > + * that other threads see that in order. Otherwise this could > + * theoretically lead to blocking_writers be still set to 1 but > + * this would be unexpected eg. for spinning locks. > + * > + * There are no pairing read barriers for unlocked access as we > + * don't strictly need them for correctness (eg. in > + * btrfs_try_tree_read_lock), and the unlock/lock is an implied > + * barrier in other cases. > + */ > + smp_wmb(); > write_unlock(&eb->lock); That comment sounds to me as if you only care about the readers of blocking_writers _after_ they have acquired the eb::lock for reading. In this case this smp_wmb is pointless because write_unlock/write_lock imply release/acquire semantics. Unlocked readers are not affected by this wmb. > } > } > @@ -316,7 +328,9 @@ void btrfs_tree_unlock(struct extent_buffer *eb) > /* > * We need to order modifying blocking_writers above with > * actually waking up the sleepers to ensure they see the > - * updated value of blocking_writers > + * updated value of blocking_writers. > + * > + * cond_wake_up calls smp_mb > */ > cond_wake_up(&eb->write_lock_wq); > } else { >
On Wed, Oct 23, 2019 at 12:57:00PM +0300, Nikolay Borisov wrote: > > > On 17.10.19 г. 22:39 ч., David Sterba wrote: > > There's one potential memory ordering problem regarding > > eb::blocking_writers, although this hasn't been hit in practice. On TSO > > (total store ordering) architectures, the writes will always be seen in > > right order. > > > > In case the calls in btrfs_set_lock_blocking_write are seen in this > > (wrong) order on another CPU: > > > > 0: /* blocking_writers == 0 */ > > 1: write_unlock() > > 2: blocking_writers = 1 > > > > btrfs_try_tree_read_lock, btrfs_tree_read_lock_atomic or > > btrfs_try_tree_write_lock would have to squeeze all of this between 1: > > and 2: > > This is only a problem for unlocked (optimistic) accesses in those > functions. Simply because from its POV the eb->lock doesn't play any > role e.g. they don't know about it at all. No the lock is important here. Let's take btrfs_try_tree_read_lock as example for 2nd thread: T1: T2: write_unlock blocking_writers == 0 try_readlock (success, locked) if (blocking_writers) return -> not taken blocking_writers = 1 Same for btrfs_tree_read_lock_atomic (with read_lock) and btrfs_try_tree_write_lock (with write_lock) IMO it's the other way around than you say, the optimistic lock is irrelevant. We need the blocking_writers update sneak into a locked section. > This implies there needs to be yet another synchronization/ordering > mechanism only for blocking_writer. But then further down you say that > there are no read side barrier because observing the accesses in a > particular order is not important for correctness. > > Given this I fail to see what this smp_wmb affects ordering. So in the steps above: T1: T2: blocking_writers = 1 smp_mb() write_unlock blocking_writers == 1 -> take the fast path and return > > + /* > > + * Writers must be be updated before the lock is released so > > + * that other threads see that in order. Otherwise this could > > + * theoretically lead to blocking_writers be still set to 1 but > > + * this would be unexpected eg. for spinning locks. > > + * > > + * There are no pairing read barriers for unlocked access as we > > + * don't strictly need them for correctness (eg. in > > + * btrfs_try_tree_read_lock), and the unlock/lock is an implied > > + * barrier in other cases. > > + */ > > + smp_wmb(); > > write_unlock(&eb->lock); > > That comment sounds to me as if you only care about the readers of > blocking_writers _after_ they have acquired the eb::lock for reading. In > this case this smp_wmb is pointless because write_unlock/write_lock > imply release/acquire semantics. The pairing read side barrier would have to be before all reads of blocking_writers, eg. 180 int btrfs_try_tree_read_lock(struct extent_buffer *eb) 181 { smp_rmb(); 182 if (eb->blocking_writers) 183 return 0; ant it won't be wrong, the value would be most up to date as it could be. And givent that this is an optimistic shortcut, lack of synchronized read might pessimize that. Ie. blocking_writers is already 0 but the update is not propagated and btrfs_try_tree_read_lock returns. This is in principle indistinguishable from a state where the lock is still locked. Hence the 'not needed for correctness' part. And then the barrier can be avoided. The write side needs the barrier to order write and unlock. The unlock/lock implied barrier won't help as the read-side check happens in the middle of that.
On 29.10.19 г. 19:51 ч., David Sterba wrote: > On Wed, Oct 23, 2019 at 12:57:00PM +0300, Nikolay Borisov wrote: >> >> >> On 17.10.19 г. 22:39 ч., David Sterba wrote: >>> There's one potential memory ordering problem regarding >>> eb::blocking_writers, although this hasn't been hit in practice. On TSO >>> (total store ordering) architectures, the writes will always be seen in >>> right order. >>> >>> In case the calls in btrfs_set_lock_blocking_write are seen in this >>> (wrong) order on another CPU: >>> >>> 0: /* blocking_writers == 0 */ >>> 1: write_unlock() >>> 2: blocking_writers = 1 >>> >>> btrfs_try_tree_read_lock, btrfs_tree_read_lock_atomic or >>> btrfs_try_tree_write_lock would have to squeeze all of this between 1: >>> and 2: >> >> This is only a problem for unlocked (optimistic) accesses in those >> functions. Simply because from its POV the eb->lock doesn't play any >> role e.g. they don't know about it at all. > > No the lock is important here. Let's take btrfs_try_tree_read_lock as > example for 2nd thread: > > T1: T2: > write_unlock > blocking_writers == 0 > try_readlock (success, locked) > if (blocking_writers) return -> not taken > blocking_writers = 1 > > Same for btrfs_tree_read_lock_atomic (with read_lock) and > btrfs_try_tree_write_lock (with write_lock) > > IMO it's the other way around than you say, the optimistic lock is > irrelevant. We need the blocking_writers update sneak into a locked > section. But write_unlock is a release operation, as per memory-barriers.txt: (6) RELEASE operations. This also acts as a one-way permeable barrier. It guarantees that all memory operations before the RELEASE operation will appear to happen before the RELEASE operation with respect to the other components of the system. RELEASE operations include UNLOCK operations and smp_store_release() operations. ... The use of ACQUIRE and RELEASE operations generally precludes the need for other sorts of memory barrier (but note the exceptions mentioned in the subsection "MMIO write barrier"). In addition, a RELEASE+ACQUIRE pair is -not- guaranteed to act as a full memory barrier. However, after an ACQUIRE on a given variable, all memory accesses preceding any prior RELEASE on that same variable are guaranteed to be visible. try_readlock is an ACQUIRE operation w.r.t the lock. So if it succeeds then all previous accesses e.g. blocking_writer = 1 are guaranteed to be visible because we have RELEASE (write_unlock ) + ACQUIRE (try_readlock in case of success). So the blocking_write check AFTER try_readlock has succeeded is guaranteed to see the update to blocking_writers in btrfs_set_lock_blocking_write. > >> This implies there needs to be yet another synchronization/ordering >> mechanism only for blocking_writer. But then further down you say that >> there are no read side barrier because observing the accesses in a >> particular order is not important for correctness. >> >> Given this I fail to see what this smp_wmb affects ordering. > > So in the steps above: > > T1: T2: > blocking_writers = 1 > smp_mb() > write_unlock > blocking_writers == 1 > -> take the fast path and return > >>> + /* >>> + * Writers must be be updated before the lock is released so >>> + * that other threads see that in order. Otherwise this could >>> + * theoretically lead to blocking_writers be still set to 1 but >>> + * this would be unexpected eg. for spinning locks. >>> + * >>> + * There are no pairing read barriers for unlocked access as we >>> + * don't strictly need them for correctness (eg. in >>> + * btrfs_try_tree_read_lock), and the unlock/lock is an implied >>> + * barrier in other cases. >>> + */ >>> + smp_wmb(); >>> write_unlock(&eb->lock); >> >> That comment sounds to me as if you only care about the readers of >> blocking_writers _after_ they have acquired the eb::lock for reading. In >> this case this smp_wmb is pointless because write_unlock/write_lock >> imply release/acquire semantics. > > The pairing read side barrier would have to be before all reads of > blocking_writers, eg. > > 180 int btrfs_try_tree_read_lock(struct extent_buffer *eb) > 181 { > > smp_rmb(); > > 182 if (eb->blocking_writers) > 183 return 0; > > ant it won't be wrong, the value would be most up to date as it could be. And > givent that this is an optimistic shortcut, lack of synchronized read might > pessimize that. Ie. blocking_writers is already 0 but the update is not > propagated and btrfs_try_tree_read_lock returns. > > This is in principle indistinguishable from a state where the lock is > still locked. Hence the 'not needed for correctness' part. And then the > barrier can be avoided. Nod > > The write side needs the barrier to order write and unlock. The > unlock/lock implied barrier won't help as the read-side check happens in > the middle of that. > But the unlock (as explained earlier) won't allow the write to 'leak' past it.
On Tue, Oct 29, 2019 at 08:48:15PM +0200, Nikolay Borisov wrote: > On 29.10.19 г. 19:51 ч., David Sterba wrote: > > On Wed, Oct 23, 2019 at 12:57:00PM +0300, Nikolay Borisov wrote: > >> On 17.10.19 г. 22:39 ч., David Sterba wrote: > >>> There's one potential memory ordering problem regarding > >>> eb::blocking_writers, although this hasn't been hit in practice. On TSO > >>> (total store ordering) architectures, the writes will always be seen in > >>> right order. > >>> > >>> In case the calls in btrfs_set_lock_blocking_write are seen in this > >>> (wrong) order on another CPU: > >>> > >>> 0: /* blocking_writers == 0 */ > >>> 1: write_unlock() > >>> 2: blocking_writers = 1 > >>> > >>> btrfs_try_tree_read_lock, btrfs_tree_read_lock_atomic or > >>> btrfs_try_tree_write_lock would have to squeeze all of this between 1: > >>> and 2: > >> > >> This is only a problem for unlocked (optimistic) accesses in those > >> functions. Simply because from its POV the eb->lock doesn't play any > >> role e.g. they don't know about it at all. > > > > No the lock is important here. Let's take btrfs_try_tree_read_lock as > > example for 2nd thread: > > > > T1: T2: > > write_unlock > > blocking_writers == 0 > > try_readlock (success, locked) > > if (blocking_writers) return -> not taken > > blocking_writers = 1 Ok, follows from what you write below and I got it wrong as the above can't happen because of the simple unlock/lock sequence. I don't know why I missed that, the original changelog has enough clues to see that too. > > Same for btrfs_tree_read_lock_atomic (with read_lock) and > > btrfs_try_tree_write_lock (with write_lock) > > > > IMO it's the other way around than you say, the optimistic lock is > > irrelevant. We need the blocking_writers update sneak into a locked > > section. > > But write_unlock is a release operation, as per memory-barriers.txt: > > (6) RELEASE operations. > > This also acts as a one-way permeable barrier. It guarantees that > all memory operations before the RELEASE operation will appear to > happen before the RELEASE operation with respect to the other > components of the system. RELEASE operations include UNLOCK operations > and smp_store_release() operations. > > ... > > The use of ACQUIRE and RELEASE operations generally precludes the need > for other sorts of memory barrier (but note the exceptions mentioned in > the subsection "MMIO write barrier"). In addition, a RELEASE+ACQUIRE > pair is -not- guaranteed to act as a full memory barrier. However, > after an ACQUIRE on a given variable, all memory accesses preceding any > prior RELEASE on that same variable are guaranteed to be visible. > > try_readlock is an ACQUIRE operation w.r.t the lock. So if it succeeds > then all previous accesses e.g. blocking_writer = 1 are guaranteed to be > visible because we have RELEASE (write_unlock ) + ACQUIRE (try_readlock > in case of success). > > So the blocking_write check AFTER try_readlock has succeeded is > guaranteed to see the update to blocking_writers in > btrfs_set_lock_blocking_write. Yeah. As long as the slow paths check the value inside locks, we're safe. And this is done in all the functions AFAICS. So far we haven't found a problem with the locks on TSO and non-TSO arches. Thanks for the barrier reasoning exercise, I'll drop the patch.
diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c index a12f3edc3505..e0e0430577aa 100644 --- a/fs/btrfs/locking.c +++ b/fs/btrfs/locking.c @@ -110,6 +110,18 @@ void btrfs_set_lock_blocking_write(struct extent_buffer *eb) btrfs_assert_spinning_writers_put(eb); btrfs_assert_tree_locked(eb); WRITE_ONCE(eb->blocking_writers, 1); + /* + * Writers must be be updated before the lock is released so + * that other threads see that in order. Otherwise this could + * theoretically lead to blocking_writers be still set to 1 but + * this would be unexpected eg. for spinning locks. + * + * There are no pairing read barriers for unlocked access as we + * don't strictly need them for correctness (eg. in + * btrfs_try_tree_read_lock), and the unlock/lock is an implied + * barrier in other cases. + */ + smp_wmb(); write_unlock(&eb->lock); } } @@ -316,7 +328,9 @@ void btrfs_tree_unlock(struct extent_buffer *eb) /* * We need to order modifying blocking_writers above with * actually waking up the sleepers to ensure they see the - * updated value of blocking_writers + * updated value of blocking_writers. + * + * cond_wake_up calls smp_mb */ cond_wake_up(&eb->write_lock_wq); } else {
There's one potential memory ordering problem regarding eb::blocking_writers, although this hasn't been hit in practice. On TSO (total store ordering) architectures, the writes will always be seen in right order. In case the calls in btrfs_set_lock_blocking_write are seen in this (wrong) order on another CPU: 0: /* blocking_writers == 0 */ 1: write_unlock() 2: blocking_writers = 1 btrfs_try_tree_read_lock, btrfs_tree_read_lock_atomic or btrfs_try_tree_write_lock would have to squeeze all of this between 1: and 2: - check if blocking_writers is 0 (it is, continue) - try read lock, read lock or write lock (succeeds after 1:) - check blocking_writers again (continue) All of that assumes that the reordered writes can survive for quite some time (unlikely if its in the internal store buffer). Another point against observing that in practice is that blocking_writers and the lock are on the same cacheline (64B), so it's more likely both values are stored in order, or some sort of pending write flush will update blocking_writers, rwlock before the try lock happens. Assuming the CPUs work like that and don't hide other surprises. struct extent_buffer { u64 start; /* 0 8 */ long unsigned int len; /* 8 8 */ long unsigned int bflags; /* 16 8 */ struct btrfs_fs_info * fs_info; /* 24 8 */ spinlock_t refs_lock; /* 32 4 */ atomic_t refs; /* 36 4 */ atomic_t io_pages; /* 40 4 */ int read_mirror; /* 44 4 */ struct callback_head callback_head __attribute__((__aligned__(8))); /* 48 16 */ /* --- cacheline 1 boundary (64 bytes) --- */ pid_t lock_owner; /* 64 4 */ int blocking_writers; /* 68 4 */ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ atomic_t blocking_readers; /* 72 4 */ bool lock_nested; /* 76 1 */ /* XXX 1 byte hole, try to pack */ short int log_index; /* 78 2 */ rwlock_t lock; /* 80 8 */ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ wait_queue_head_t write_lock_wq; /* 88 24 */ wait_queue_head_t read_lock_wq; /* 112 24 */ /* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */ struct page * pages[16]; /* 136 128 */ /* size: 264, cachelines: 5, members: 18 */ /* sum members: 263, holes: 1, sum holes: 1 */ /* forced alignments: 1 */ /* last cacheline: 8 bytes */ } __attribute__((__aligned__(8))); Add the barriers for correctness sake. Signed-off-by: David Sterba <dsterba@suse.com> --- fs/btrfs/locking.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-)