diff mbox

[v3] NFSv4: replace seqcount_t with a rw_semaphore

Message ID 20161031131958.mrrez5t7sow75p6v@linutronix.de
State New, archived
Headers show

Commit Message

Sebastian Andrzej Siewior Oct. 31, 2016, 1:19 p.m. UTC
The raw_write_seqcount_begin() in nfs4_reclaim_open_state() bugs me
because it maps to preempt_disable() in -RT which I can't have at this
point. So I took a look at the code.
It the lockdep part was removed in commit abbec2da13f0 ("NFS: Use
raw_write_seqcount_begin/end int nfs4_reclaim_open_state") because
lockdep complained. The whole seqcount thing was introduced in commit
c137afabe330 ("NFSv4: Allow the state manager to mark an open_owner as
being recovered").
Trond Myklebust confirms that there i only one writer thread at
nfs4_reclaim_open_state()

The list_for_each_entry() in nfs4_reclaim_open_state:
It seems that this lock protects the ->so_states list among other
atomic_t & flags members. So at the begin of the loop we inc ->count
ensuring that this field is not removed while we use it. So we drop the
->so_lock loc during the loop it seems. And after nfs4_reclaim_locks()
invocation we nfs4_put_open_state() and grab the ->so_lock again. So if
we were the last user of this struct and we remove it, then the
following list_next_entry() invocation is a use-after-free. Even if we
use list_for_each_entry_safe() there is no guarantee that the following
member is still valid because it might have been removed by someone
invoking nfs4_put_open_state() on it, right?
So there is this.

However to address my initial problem I have here a patch :) So it uses
a rw_semaphore which ensures that there is only one writer at a time or
multiple reader. So it should be basically what is happening now plus a
tiny tiny tiny lock plus lockdep coverage. I tried to this myself but I
don't manage to get into this code path at all so I might be doing
something wrong.

Could you please check if this patch is working for you and whether my
list_for_each_entry() observation is correct or not?

v2…v3: replace the seqlock with a RW semaphore.

v1…v2: write_seqlock() disables preemption and some function need it
(thread_run(), non-GFP_ATOMIC memory alloction()). We don't want
preemption enabled because a preempted writer would stall the reader
spinning. This is a duct tape mutex. Maybe the seqlock should go.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 fs/nfs/delegation.c |  6 ++----
 fs/nfs/nfs4_fs.h    |  2 +-
 fs/nfs/nfs4proc.c   |  9 +++------
 fs/nfs/nfs4state.c  | 10 ++++------
 4 files changed, 10 insertions(+), 17 deletions(-)

Comments

Trond Myklebust Oct. 31, 2016, 3:30 p.m. UTC | #1
DQo+IE9uIE9jdCAzMSwgMjAxNiwgYXQgMDk6MTksIFNlYmFzdGlhbiBBbmRyemVqIFNpZXdpb3Ig
PGJpZ2Vhc3lAbGludXRyb25peC5kZT4gd3JvdGU6DQo+IA0KPiBUaGUgcmF3X3dyaXRlX3NlcWNv
dW50X2JlZ2luKCkgaW4gbmZzNF9yZWNsYWltX29wZW5fc3RhdGUoKSBidWdzIG1lDQo+IGJlY2F1
c2UgaXQgbWFwcyB0byBwcmVlbXB0X2Rpc2FibGUoKSBpbiAtUlQgd2hpY2ggSSBjYW4ndCBoYXZl
IGF0IHRoaXMNCj4gcG9pbnQuIFNvIEkgdG9vayBhIGxvb2sgYXQgdGhlIGNvZGUuDQo+IEl0IHRo
ZSBsb2NrZGVwIHBhcnQgd2FzIHJlbW92ZWQgaW4gY29tbWl0IGFiYmVjMmRhMTNmMCAoIk5GUzog
VXNlDQo+IHJhd193cml0ZV9zZXFjb3VudF9iZWdpbi9lbmQgaW50IG5mczRfcmVjbGFpbV9vcGVu
X3N0YXRlIikgYmVjYXVzZQ0KPiBsb2NrZGVwIGNvbXBsYWluZWQuIFRoZSB3aG9sZSBzZXFjb3Vu
dCB0aGluZyB3YXMgaW50cm9kdWNlZCBpbiBjb21taXQNCj4gYzEzN2FmYWJlMzMwICgiTkZTdjQ6
IEFsbG93IHRoZSBzdGF0ZSBtYW5hZ2VyIHRvIG1hcmsgYW4gb3Blbl9vd25lciBhcw0KPiBiZWlu
ZyByZWNvdmVyZWQiKS4NCj4gVHJvbmQgTXlrbGVidXN0IGNvbmZpcm1zIHRoYXQgdGhlcmUgaSBv
bmx5IG9uZSB3cml0ZXIgdGhyZWFkIGF0DQo+IG5mczRfcmVjbGFpbV9vcGVuX3N0YXRlKCkNCj4g
DQo+IFRoZSBsaXN0X2Zvcl9lYWNoX2VudHJ5KCkgaW4gbmZzNF9yZWNsYWltX29wZW5fc3RhdGU6
DQo+IEl0IHNlZW1zIHRoYXQgdGhpcyBsb2NrIHByb3RlY3RzIHRoZSAtPnNvX3N0YXRlcyBsaXN0
IGFtb25nIG90aGVyDQo+IGF0b21pY190ICYgZmxhZ3MgbWVtYmVycy4gU28gYXQgdGhlIGJlZ2lu
IG9mIHRoZSBsb29wIHdlIGluYyAtPmNvdW50DQo+IGVuc3VyaW5nIHRoYXQgdGhpcyBmaWVsZCBp
cyBub3QgcmVtb3ZlZCB3aGlsZSB3ZSB1c2UgaXQuIFNvIHdlIGRyb3AgdGhlDQo+IC0+c29fbG9j
ayBsb2MgZHVyaW5nIHRoZSBsb29wIGl0IHNlZW1zLiBBbmQgYWZ0ZXIgbmZzNF9yZWNsYWltX2xv
Y2tzKCkNCj4gaW52b2NhdGlvbiB3ZSBuZnM0X3B1dF9vcGVuX3N0YXRlKCkgYW5kIGdyYWIgdGhl
IC0+c29fbG9jayBhZ2Fpbi4gU28gaWYNCj4gd2Ugd2VyZSB0aGUgbGFzdCB1c2VyIG9mIHRoaXMg
c3RydWN0IGFuZCB3ZSByZW1vdmUgaXQsIHRoZW4gdGhlDQo+IGZvbGxvd2luZyBsaXN0X25leHRf
ZW50cnkoKSBpbnZvY2F0aW9uIGlzIGEgdXNlLWFmdGVyLWZyZWUuIEV2ZW4gaWYgd2UNCj4gdXNl
IGxpc3RfZm9yX2VhY2hfZW50cnlfc2FmZSgpIHRoZXJlIGlzIG5vIGd1YXJhbnRlZSB0aGF0IHRo
ZSBmb2xsb3dpbmcNCj4gbWVtYmVyIGlzIHN0aWxsIHZhbGlkIGJlY2F1c2UgaXQgbWlnaHQgaGF2
ZSBiZWVuIHJlbW92ZWQgYnkgc29tZW9uZQ0KPiBpbnZva2luZyBuZnM0X3B1dF9vcGVuX3N0YXRl
KCkgb24gaXQsIHJpZ2h0Pw0KPiBTbyB0aGVyZSBpcyB0aGlzLg0KPiANCj4gSG93ZXZlciB0byBh
ZGRyZXNzIG15IGluaXRpYWwgcHJvYmxlbSBJIGhhdmUgaGVyZSBhIHBhdGNoIDopIFNvIGl0IHVz
ZXMNCj4gYSByd19zZW1hcGhvcmUgd2hpY2ggZW5zdXJlcyB0aGF0IHRoZXJlIGlzIG9ubHkgb25l
IHdyaXRlciBhdCBhIHRpbWUgb3INCj4gbXVsdGlwbGUgcmVhZGVyLiBTbyBpdCBzaG91bGQgYmUg
YmFzaWNhbGx5IHdoYXQgaXMgaGFwcGVuaW5nIG5vdyBwbHVzIGENCj4gdGlueSB0aW55IHRpbnkg
bG9jayBwbHVzIGxvY2tkZXAgY292ZXJhZ2UuIEkgdHJpZWQgdG8gdGhpcyBteXNlbGYgYnV0IEkN
Cj4gZG9uJ3QgbWFuYWdlIHRvIGdldCBpbnRvIHRoaXMgY29kZSBwYXRoIGF0IGFsbCBzbyBJIG1p
Z2h0IGJlIGRvaW5nDQo+IHNvbWV0aGluZyB3cm9uZy4NCj4gDQo+IENvdWxkIHlvdSBwbGVhc2Ug
Y2hlY2sgaWYgdGhpcyBwYXRjaCBpcyB3b3JraW5nIGZvciB5b3UgYW5kIHdoZXRoZXIgbXkNCj4g
bGlzdF9mb3JfZWFjaF9lbnRyeSgpIG9ic2VydmF0aW9uIGlzIGNvcnJlY3Qgb3Igbm90Pw0KPiAN
Cj4gdjLigKZ2MzogcmVwbGFjZSB0aGUgc2VxbG9jayB3aXRoIGEgUlcgc2VtYXBob3JlLg0KPiAN
Cg0KTkFDSy4gVGhhdCB3aWxsIGRlYWRsb2NrLiBUaGUgcmVhc29uIHdoeSB3ZSB1c2UgYSBzZXFs
b2NrIHRoZXJlIGlzIHByZWNpc2VseSBiZWNhdXNlIHdlIGNhbm5vdCBhbGxvdyBvcmRpbmFyeSBS
UEMgY2FsbHMgdG8gbG9jayBvdXQgdGhlIHJlY292ZXJ5IHRocmVhZC4gSWYgdGhlIHNlcnZlciBy
ZWJvb3RzLCB0aGVuIG9yZGluYXJ5IFJQQyBjYWxscyB3aWxsIGZhaWwgdW50aWwgdGhlIHJlY292
ZXJ5IHRocmVhZCBoYXMgaGFkIGEgY2hhbmNlIHRvIHJlLWVzdGFibGlzaCB0aGUgc3RhdGUuDQoN
Cg0K

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sebastian Andrzej Siewior Oct. 31, 2016, 3:56 p.m. UTC | #2
On 2016-10-31 15:30:02 [+0000], Trond Myklebust wrote:
> > On Oct 31, 2016, at 09:19, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> > The list_for_each_entry() in nfs4_reclaim_open_state:
> > It seems that this lock protects the ->so_states list among other
> > atomic_t & flags members. So at the begin of the loop we inc ->count
> > ensuring that this field is not removed while we use it. So we drop the
> > ->so_lock loc during the loop it seems. And after nfs4_reclaim_locks()
> > invocation we nfs4_put_open_state() and grab the ->so_lock again. So if
> > we were the last user of this struct and we remove it, then the
> > following list_next_entry() invocation is a use-after-free. Even if we
> > use list_for_each_entry_safe() there is no guarantee that the following
> > member is still valid because it might have been removed by someone
> > invoking nfs4_put_open_state() on it, right?
> > So there is this.
> > 
> > However to address my initial problem I have here a patch :) So it uses
> > a rw_semaphore which ensures that there is only one writer at a time or
> > multiple reader. So it should be basically what is happening now plus a
> > tiny tiny tiny lock plus lockdep coverage. I tried to this myself but I
> > don't manage to get into this code path at all so I might be doing
> > something wrong.
> > 
> > Could you please check if this patch is working for you and whether my
> > list_for_each_entry() observation is correct or not?
> > 
> > v2…v3: replace the seqlock with a RW semaphore.
> > 
> 
> NACK. That will deadlock. The reason why we use a seqlock there is precisely because we cannot allow ordinary RPC calls to lock out the recovery thread. 
Hmmm. So this is getting invoked if I reboot the server? A restart of
nfs-kernel-server is the same thing?
Is the list_for_each_entry() observation I made correct?

>If the server reboots, then ordinary RPC calls will fail until the recovery thread has had a chance to re-establish the state.

This means that the ordinary RPC call won't return and fail but wait
with the lock held until the recovery thread did its thing?

Sebastian
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust Oct. 31, 2016, 4:11 p.m. UTC | #3
> On Oct 31, 2016, at 11:56, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:

> 

> On 2016-10-31 15:30:02 [+0000], Trond Myklebust wrote:

>>> On Oct 31, 2016, at 09:19, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:

>>> The list_for_each_entry() in nfs4_reclaim_open_state:

>>> It seems that this lock protects the ->so_states list among other

>>> atomic_t & flags members. So at the begin of the loop we inc ->count

>>> ensuring that this field is not removed while we use it. So we drop the

>>> ->so_lock loc during the loop it seems. And after nfs4_reclaim_locks()

>>> invocation we nfs4_put_open_state() and grab the ->so_lock again. So if

>>> we were the last user of this struct and we remove it, then the

>>> following list_next_entry() invocation is a use-after-free. Even if we

>>> use list_for_each_entry_safe() there is no guarantee that the following

>>> member is still valid because it might have been removed by someone

>>> invoking nfs4_put_open_state() on it, right?

>>> So there is this.

>>> 

>>> However to address my initial problem I have here a patch :) So it uses

>>> a rw_semaphore which ensures that there is only one writer at a time or

>>> multiple reader. So it should be basically what is happening now plus a

>>> tiny tiny tiny lock plus lockdep coverage. I tried to this myself but I

>>> don't manage to get into this code path at all so I might be doing

>>> something wrong.

>>> 

>>> Could you please check if this patch is working for you and whether my

>>> list_for_each_entry() observation is correct or not?

>>> 

>>> v2…v3: replace the seqlock with a RW semaphore.

>>> 

>> 

>> NACK. That will deadlock. The reason why we use a seqlock there is precisely because we cannot allow ordinary RPC calls to lock out the recovery thread. 

> Hmmm. So this is getting invoked if I reboot the server? A restart of

> nfs-kernel-server is the same thing?

> Is the list_for_each_entry() observation I made correct?


Yes, and yes. We can’t rely on the list pointers remaining correct, so we restart the list scan and we use the ops->state_flag_bit to signal whether or not state has been recovered for the entry being scanned.

> 

>> If the server reboots, then ordinary RPC calls will fail until the recovery thread has had a chance to re-establish the state.

> 

> This means that the ordinary RPC call won't return and fail but wait

> with the lock held until the recovery thread did its thing?


It uses the seqcount_t to figure out if a recovery occurred while an OPEN or CLOSE was being processed. If so, it schedules a new recovery of the stateids in question.
Sebastian Andrzej Siewior Nov. 2, 2016, 5:11 p.m. UTC | #4
On 2016-10-31 16:11:02 [+0000], Trond Myklebust wrote:
> 
> Yes, and yes. We can’t rely on the list pointers remaining correct, so we restart the list scan and we use the ops->state_flag_bit to signal whether or not state has been recovered for the entry being scanned.

but this is tested at the top of the loop and by then you look at
lists' ->next pointer which might be invalid.

Sebastian
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust Nov. 2, 2016, 5:37 p.m. UTC | #5
> On Nov 2, 2016, at 13:11, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:

> 

> On 2016-10-31 16:11:02 [+0000], Trond Myklebust wrote:

>> 

>> Yes, and yes. We can’t rely on the list pointers remaining correct, so we restart the list scan and we use the ops->state_flag_bit to signal whether or not state has been recovered for the entry being scanned.

> 

> but this is tested at the top of the loop and by then you look at

> lists' ->next pointer which might be invalid.

> 


No. We ensure we restart the list scan if we release the spinlock. It’s safe…
diff mbox

Patch

diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
index dff600ae0d74..55d5531aedbb 100644
--- a/fs/nfs/delegation.c
+++ b/fs/nfs/delegation.c
@@ -130,7 +130,6 @@  static int nfs_delegation_claim_opens(struct inode *inode,
 	struct nfs_open_context *ctx;
 	struct nfs4_state_owner *sp;
 	struct nfs4_state *state;
-	unsigned int seq;
 	int err;
 
 again:
@@ -150,12 +149,11 @@  static int nfs_delegation_claim_opens(struct inode *inode,
 		sp = state->owner;
 		/* Block nfs4_proc_unlck */
 		mutex_lock(&sp->so_delegreturn_mutex);
-		seq = raw_seqcount_begin(&sp->so_reclaim_seqcount);
+		down_read(&sp->so_reclaim_rw);
 		err = nfs4_open_delegation_recall(ctx, state, stateid, type);
 		if (!err)
 			err = nfs_delegation_claim_locks(ctx, state, stateid);
-		if (!err && read_seqcount_retry(&sp->so_reclaim_seqcount, seq))
-			err = -EAGAIN;
+		up_read(&sp->so_reclaim_rw);
 		mutex_unlock(&sp->so_delegreturn_mutex);
 		put_nfs_open_context(ctx);
 		if (err != 0)
diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
index 9b3a82abab07..03d37826a183 100644
--- a/fs/nfs/nfs4_fs.h
+++ b/fs/nfs/nfs4_fs.h
@@ -111,7 +111,7 @@  struct nfs4_state_owner {
 	unsigned long	     so_flags;
 	struct list_head     so_states;
 	struct nfs_seqid_counter so_seqid;
-	seqcount_t	     so_reclaim_seqcount;
+	struct rw_semaphore  so_reclaim_rw;
 	struct mutex	     so_delegreturn_mutex;
 };
 
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 7897826d7c51..ba6589d1fd36 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -2682,10 +2682,9 @@  static int _nfs4_open_and_get_state(struct nfs4_opendata *opendata,
 	struct nfs_server *server = sp->so_server;
 	struct dentry *dentry;
 	struct nfs4_state *state;
-	unsigned int seq;
 	int ret;
 
-	seq = raw_seqcount_begin(&sp->so_reclaim_seqcount);
+	down_read(&sp->so_reclaim_rw);
 
 	ret = _nfs4_proc_open(opendata);
 	if (ret != 0)
@@ -2721,12 +2720,10 @@  static int _nfs4_open_and_get_state(struct nfs4_opendata *opendata,
 		goto out;
 
 	ctx->state = state;
-	if (d_inode(dentry) == state->inode) {
+	if (d_inode(dentry) == state->inode)
 		nfs_inode_attach_open_context(ctx);
-		if (read_seqcount_retry(&sp->so_reclaim_seqcount, seq))
-			nfs4_schedule_stateid_recovery(server, state);
-	}
 out:
+	up_read(&sp->so_reclaim_rw);
 	return ret;
 }
 
diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index 5f4281ec5f72..61c431fa2fe3 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -488,7 +488,7 @@  nfs4_alloc_state_owner(struct nfs_server *server,
 	nfs4_init_seqid_counter(&sp->so_seqid);
 	atomic_set(&sp->so_count, 1);
 	INIT_LIST_HEAD(&sp->so_lru);
-	seqcount_init(&sp->so_reclaim_seqcount);
+	init_rwsem(&sp->so_reclaim_rw);
 	mutex_init(&sp->so_delegreturn_mutex);
 	return sp;
 }
@@ -1497,8 +1497,8 @@  static int nfs4_reclaim_open_state(struct nfs4_state_owner *sp, const struct nfs
 	 * recovering after a network partition or a reboot from a
 	 * server that doesn't support a grace period.
 	 */
+	down_write(&sp->so_reclaim_rw);
 	spin_lock(&sp->so_lock);
-	raw_write_seqcount_begin(&sp->so_reclaim_seqcount);
 restart:
 	list_for_each_entry(state, &sp->so_states, open_states) {
 		if (!test_and_clear_bit(ops->state_flag_bit, &state->flags))
@@ -1566,14 +1566,12 @@  static int nfs4_reclaim_open_state(struct nfs4_state_owner *sp, const struct nfs
 		spin_lock(&sp->so_lock);
 		goto restart;
 	}
-	raw_write_seqcount_end(&sp->so_reclaim_seqcount);
 	spin_unlock(&sp->so_lock);
+	up_write(&sp->so_reclaim_rw);
 	return 0;
 out_err:
 	nfs4_put_open_state(state);
-	spin_lock(&sp->so_lock);
-	raw_write_seqcount_end(&sp->so_reclaim_seqcount);
-	spin_unlock(&sp->so_lock);
+	up_write(&sp->so_reclaim_rw);
 	return status;
 }