diff mbox series

nfsd_copy_write_verifier: wrong usage of read_seqbegin_or_lock()

Message ID 20231025163006.GA8279@redhat.com (mailing list archive)
State New, archived
Headers show
Series nfsd_copy_write_verifier: wrong usage of read_seqbegin_or_lock() | expand

Commit Message

Oleg Nesterov Oct. 25, 2023, 4:30 p.m. UTC
Hello,

The usage of writeverf_lock is wrong and misleading no matter what and
I can not understand the intent.

nfsd_copy_write_verifier() uses read_seqbegin_or_lock() incorrectly.
"seq" is always even, so read_seqbegin_or_lock() can never take the
lock for writing. We need to make the counter odd for the 2nd round:

	--- a/fs/nfsd/nfssvc.c
	+++ b/fs/nfsd/nfssvc.c
	@@ -359,11 +359,14 @@ static bool nfsd_needs_lockd(struct nfsd_net *nn)
	  */
	 void nfsd_copy_write_verifier(__be32 verf[2], struct nfsd_net *nn)
	 {
	-	int seq = 0;
	+	int seq, nextseq = 0;
	 
		do {
	+		seq = nextseq;
			read_seqbegin_or_lock(&nn->writeverf_lock, &seq);
			memcpy(verf, nn->writeverf, sizeof(nn->writeverf));
	+		/* If lockless access failed, take the lock. */
	+		nextseq = 1;
		} while (need_seqretry(&nn->writeverf_lock, seq));
		done_seqretry(&nn->writeverf_lock, seq);
	 }

OTOH. This function just copies 8 bytes, this makes me think that it doesn't
need the conditional locking and read_seqbegin_or_lock() at all. So perhaps
the (untested) patch below makes more sense? Please note that it should not
change the current behaviour, it just makes the code look correct (and more
optimal but this is minor).

Another question is why we can't simply turn nn->writeverf into seqcount_t.
I guess we can't because nfsd_reset_write_verifier() needs spin_lock() to
serialise with itself, right?

Oleg.
---

Comments

Chuck Lever Oct. 25, 2023, 5 p.m. UTC | #1
On Wed, Oct 25, 2023 at 06:30:06PM +0200, Oleg Nesterov wrote:
> Hello,
> 
> The usage of writeverf_lock is wrong and misleading no matter what and
> I can not understand the intent.

The structure of the seqlock was introduced in commit 27c438f53e79
("nfsd: Support the server resetting the boot verifier").

The NFS write verifier is an 8-byte cookie that is supposed to
indicate the boot epoch of the server -- simply put, when the server
restarts, the epoch (and this verifier) changes.

NFSv3 and later have a two-phase write scheme where the client
sends data to the server (known as an UNSTABLE WRITE), then later
asks the server to commit that data (a COMMIT). Before the COMMIT,
that data is not durable and the client must hold onto it until
the server's COMMIT Reply indicates it's safe for the client to
discard that data and move on.

When an UNSTABLE WRITE is done, the server reports its current
epoch as part of each WRITE Reply. If this verifier cookie changes,
the client knows that the server might have lost previously
written written-but-uncommitted data, so it must send the WRITEs
again in that (rare) case.

NFSD abuses this slightly by changing the write verifier whenever
there is an underlying local write error that might have occurred in
the background (ie, there was no WRITE or COMMIT operation at the
time that the server could use to convey the error back to the
client). This is supposed to trigger clients to send UNSTABLE WRITEs
again to ensure that data is properly committed to durable storage.

The point of the seqlock is to ensure that

a) a write verifier update does not tear the verifier
b) a write verifier read does not see a torn verifier

This is a hot path, so we don't want a full spinlock to achieve
a) and b).

Way back when, the verifier was updated by two separate 32-bit
stores; hence the risk of tearing.


> nfsd_copy_write_verifier() uses read_seqbegin_or_lock() incorrectly.
> "seq" is always even, so read_seqbegin_or_lock() can never take the
> lock for writing. We need to make the counter odd for the 2nd round:
> 
> 	--- a/fs/nfsd/nfssvc.c
> 	+++ b/fs/nfsd/nfssvc.c
> 	@@ -359,11 +359,14 @@ static bool nfsd_needs_lockd(struct nfsd_net *nn)
> 	  */
> 	 void nfsd_copy_write_verifier(__be32 verf[2], struct nfsd_net *nn)
> 	 {
> 	-	int seq = 0;
> 	+	int seq, nextseq = 0;
> 	 
> 		do {
> 	+		seq = nextseq;
> 			read_seqbegin_or_lock(&nn->writeverf_lock, &seq);
> 			memcpy(verf, nn->writeverf, sizeof(nn->writeverf));
> 	+		/* If lockless access failed, take the lock. */
> 	+		nextseq = 1;
> 		} while (need_seqretry(&nn->writeverf_lock, seq));
> 		done_seqretry(&nn->writeverf_lock, seq);
> 	 }
> 
> OTOH. This function just copies 8 bytes, this makes me think that it doesn't
> need the conditional locking and read_seqbegin_or_lock() at all. So perhaps
> the (untested) patch below makes more sense? Please note that it should not
> change the current behaviour, it just makes the code look correct (and more
> optimal but this is minor).
> 
> Another question is why we can't simply turn nn->writeverf into seqcount_t.
> I guess we can't because nfsd_reset_write_verifier() needs spin_lock() to
> serialise with itself, right?

"reset" is supposed to be very rare operation. Using a lock in that
case is probably quite acceptable, as long as reading the verifier
is wait-free and guaranteed to be untorn.

But a seqcount_t is only 32 bits.


> Oleg.
> ---
> 
> diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
> index c7af1095f6b5..094b765c5397 100644
> --- a/fs/nfsd/nfssvc.c
> +++ b/fs/nfsd/nfssvc.c
> @@ -359,13 +359,12 @@ static bool nfsd_needs_lockd(struct nfsd_net *nn)
>   */
>  void nfsd_copy_write_verifier(__be32 verf[2], struct nfsd_net *nn)
>  {
> -	int seq = 0;
> +	unsigned seq;
>  
>  	do {
> -		read_seqbegin_or_lock(&nn->writeverf_lock, &seq);
> +		seq = read_seqbegin(&nn->writeverf_lock);
>  		memcpy(verf, nn->writeverf, sizeof(nn->writeverf));
> -	} while (need_seqretry(&nn->writeverf_lock, seq));
> -	done_seqretry(&nn->writeverf_lock, seq);
> +	} while (read_seqretry(&nn->writeverf_lock, seq));
>  }
>  
>  static void nfsd_reset_write_verifier_locked(struct nfsd_net *nn)
>
Oleg Nesterov Oct. 25, 2023, 5:39 p.m. UTC | #2
Hi Chuck,

Thanks for your reply. But I am already sleeping and I can't understand it.
So let me ask a couple of questions.

1. Do you agree that the current nfsd_copy_write_verifier() code makes no sense?

   I mean, the usage of read_seqbegin_or_lock() suggests that if the lockless
   pass fails it should take writeverf_lock for writing. But this can't happen,
   and thus this code doesn't look right no matter what. None of the
   read_seqbegin_or_lock/need_seqretry/done_seqretry helpers make any sense
   because "seq" is alway even.

2. If yes, which change do you prefer? I'd prefer the patch at the end.

Oleg.

On 10/25, Chuck Lever wrote:
>
> On Wed, Oct 25, 2023 at 06:30:06PM +0200, Oleg Nesterov wrote:
> > Hello,
> >
> > The usage of writeverf_lock is wrong and misleading no matter what and
> > I can not understand the intent.
>
> The structure of the seqlock was introduced in commit 27c438f53e79
> ("nfsd: Support the server resetting the boot verifier").
>
> The NFS write verifier is an 8-byte cookie that is supposed to
> indicate the boot epoch of the server -- simply put, when the server
> restarts, the epoch (and this verifier) changes.
>
> NFSv3 and later have a two-phase write scheme where the client
> sends data to the server (known as an UNSTABLE WRITE), then later
> asks the server to commit that data (a COMMIT). Before the COMMIT,
> that data is not durable and the client must hold onto it until
> the server's COMMIT Reply indicates it's safe for the client to
> discard that data and move on.
>
> When an UNSTABLE WRITE is done, the server reports its current
> epoch as part of each WRITE Reply. If this verifier cookie changes,
> the client knows that the server might have lost previously
> written written-but-uncommitted data, so it must send the WRITEs
> again in that (rare) case.
>
> NFSD abuses this slightly by changing the write verifier whenever
> there is an underlying local write error that might have occurred in
> the background (ie, there was no WRITE or COMMIT operation at the
> time that the server could use to convey the error back to the
> client). This is supposed to trigger clients to send UNSTABLE WRITEs
> again to ensure that data is properly committed to durable storage.
>
> The point of the seqlock is to ensure that
>
> a) a write verifier update does not tear the verifier
> b) a write verifier read does not see a torn verifier
>
> This is a hot path, so we don't want a full spinlock to achieve
> a) and b).
>
> Way back when, the verifier was updated by two separate 32-bit
> stores; hence the risk of tearing.
>
>
> > nfsd_copy_write_verifier() uses read_seqbegin_or_lock() incorrectly.
> > "seq" is always even, so read_seqbegin_or_lock() can never take the
> > lock for writing. We need to make the counter odd for the 2nd round:
> >
> > 	--- a/fs/nfsd/nfssvc.c
> > 	+++ b/fs/nfsd/nfssvc.c
> > 	@@ -359,11 +359,14 @@ static bool nfsd_needs_lockd(struct nfsd_net *nn)
> > 	  */
> > 	 void nfsd_copy_write_verifier(__be32 verf[2], struct nfsd_net *nn)
> > 	 {
> > 	-	int seq = 0;
> > 	+	int seq, nextseq = 0;
> >
> > 		do {
> > 	+		seq = nextseq;
> > 			read_seqbegin_or_lock(&nn->writeverf_lock, &seq);
> > 			memcpy(verf, nn->writeverf, sizeof(nn->writeverf));
> > 	+		/* If lockless access failed, take the lock. */
> > 	+		nextseq = 1;
> > 		} while (need_seqretry(&nn->writeverf_lock, seq));
> > 		done_seqretry(&nn->writeverf_lock, seq);
> > 	 }
> >
> > OTOH. This function just copies 8 bytes, this makes me think that it doesn't
> > need the conditional locking and read_seqbegin_or_lock() at all. So perhaps
> > the (untested) patch below makes more sense? Please note that it should not
> > change the current behaviour, it just makes the code look correct (and more
> > optimal but this is minor).
> >
> > Another question is why we can't simply turn nn->writeverf into seqcount_t.
> > I guess we can't because nfsd_reset_write_verifier() needs spin_lock() to
> > serialise with itself, right?
>
> "reset" is supposed to be very rare operation. Using a lock in that
> case is probably quite acceptable, as long as reading the verifier
> is wait-free and guaranteed to be untorn.
>
> But a seqcount_t is only 32 bits.
>
>
> > Oleg.
> > ---
> >
> > diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
> > index c7af1095f6b5..094b765c5397 100644
> > --- a/fs/nfsd/nfssvc.c
> > +++ b/fs/nfsd/nfssvc.c
> > @@ -359,13 +359,12 @@ static bool nfsd_needs_lockd(struct nfsd_net *nn)
> >   */
> >  void nfsd_copy_write_verifier(__be32 verf[2], struct nfsd_net *nn)
> >  {
> > -	int seq = 0;
> > +	unsigned seq;
> >
> >  	do {
> > -		read_seqbegin_or_lock(&nn->writeverf_lock, &seq);
> > +		seq = read_seqbegin(&nn->writeverf_lock);
> >  		memcpy(verf, nn->writeverf, sizeof(nn->writeverf));
> > -	} while (need_seqretry(&nn->writeverf_lock, seq));
> > -	done_seqretry(&nn->writeverf_lock, seq);
> > +	} while (read_seqretry(&nn->writeverf_lock, seq));
> >  }
> >
> >  static void nfsd_reset_write_verifier_locked(struct nfsd_net *nn)
> >
>
> --
> Chuck Lever
>
Oleg Nesterov Oct. 25, 2023, 5:47 p.m. UTC | #3
sorry for the noise, forgot to mention.

I personally don't care about nfsd_copy_write_verifier(), and this code doesn't
look really buggy. I am trying to audit the users of read_seqbegin_or_lock(),
see https://lore.kernel.org/all/20231024120808.GA15382@redhat.com/


On 10/25, Oleg Nesterov wrote:
>
> Hi Chuck,
>
> Thanks for your reply. But I am already sleeping and I can't understand it.
> So let me ask a couple of questions.
>
> 1. Do you agree that the current nfsd_copy_write_verifier() code makes no sense?
>
>    I mean, the usage of read_seqbegin_or_lock() suggests that if the lockless
>    pass fails it should take writeverf_lock for writing. But this can't happen,
>    and thus this code doesn't look right no matter what. None of the
>    read_seqbegin_or_lock/need_seqretry/done_seqretry helpers make any sense
>    because "seq" is alway even.
>
> 2. If yes, which change do you prefer? I'd prefer the patch at the end.
>
> Oleg.
>
> On 10/25, Chuck Lever wrote:
> >
> > On Wed, Oct 25, 2023 at 06:30:06PM +0200, Oleg Nesterov wrote:
> > > Hello,
> > >
> > > The usage of writeverf_lock is wrong and misleading no matter what and
> > > I can not understand the intent.
> >
> > The structure of the seqlock was introduced in commit 27c438f53e79
> > ("nfsd: Support the server resetting the boot verifier").
> >
> > The NFS write verifier is an 8-byte cookie that is supposed to
> > indicate the boot epoch of the server -- simply put, when the server
> > restarts, the epoch (and this verifier) changes.
> >
> > NFSv3 and later have a two-phase write scheme where the client
> > sends data to the server (known as an UNSTABLE WRITE), then later
> > asks the server to commit that data (a COMMIT). Before the COMMIT,
> > that data is not durable and the client must hold onto it until
> > the server's COMMIT Reply indicates it's safe for the client to
> > discard that data and move on.
> >
> > When an UNSTABLE WRITE is done, the server reports its current
> > epoch as part of each WRITE Reply. If this verifier cookie changes,
> > the client knows that the server might have lost previously
> > written written-but-uncommitted data, so it must send the WRITEs
> > again in that (rare) case.
> >
> > NFSD abuses this slightly by changing the write verifier whenever
> > there is an underlying local write error that might have occurred in
> > the background (ie, there was no WRITE or COMMIT operation at the
> > time that the server could use to convey the error back to the
> > client). This is supposed to trigger clients to send UNSTABLE WRITEs
> > again to ensure that data is properly committed to durable storage.
> >
> > The point of the seqlock is to ensure that
> >
> > a) a write verifier update does not tear the verifier
> > b) a write verifier read does not see a torn verifier
> >
> > This is a hot path, so we don't want a full spinlock to achieve
> > a) and b).
> >
> > Way back when, the verifier was updated by two separate 32-bit
> > stores; hence the risk of tearing.
> >
> >
> > > nfsd_copy_write_verifier() uses read_seqbegin_or_lock() incorrectly.
> > > "seq" is always even, so read_seqbegin_or_lock() can never take the
> > > lock for writing. We need to make the counter odd for the 2nd round:
> > >
> > > 	--- a/fs/nfsd/nfssvc.c
> > > 	+++ b/fs/nfsd/nfssvc.c
> > > 	@@ -359,11 +359,14 @@ static bool nfsd_needs_lockd(struct nfsd_net *nn)
> > > 	  */
> > > 	 void nfsd_copy_write_verifier(__be32 verf[2], struct nfsd_net *nn)
> > > 	 {
> > > 	-	int seq = 0;
> > > 	+	int seq, nextseq = 0;
> > >
> > > 		do {
> > > 	+		seq = nextseq;
> > > 			read_seqbegin_or_lock(&nn->writeverf_lock, &seq);
> > > 			memcpy(verf, nn->writeverf, sizeof(nn->writeverf));
> > > 	+		/* If lockless access failed, take the lock. */
> > > 	+		nextseq = 1;
> > > 		} while (need_seqretry(&nn->writeverf_lock, seq));
> > > 		done_seqretry(&nn->writeverf_lock, seq);
> > > 	 }
> > >
> > > OTOH. This function just copies 8 bytes, this makes me think that it doesn't
> > > need the conditional locking and read_seqbegin_or_lock() at all. So perhaps
> > > the (untested) patch below makes more sense? Please note that it should not
> > > change the current behaviour, it just makes the code look correct (and more
> > > optimal but this is minor).
> > >
> > > Another question is why we can't simply turn nn->writeverf into seqcount_t.
> > > I guess we can't because nfsd_reset_write_verifier() needs spin_lock() to
> > > serialise with itself, right?
> >
> > "reset" is supposed to be very rare operation. Using a lock in that
> > case is probably quite acceptable, as long as reading the verifier
> > is wait-free and guaranteed to be untorn.
> >
> > But a seqcount_t is only 32 bits.
> >
> >
> > > Oleg.
> > > ---
> > >
> > > diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
> > > index c7af1095f6b5..094b765c5397 100644
> > > --- a/fs/nfsd/nfssvc.c
> > > +++ b/fs/nfsd/nfssvc.c
> > > @@ -359,13 +359,12 @@ static bool nfsd_needs_lockd(struct nfsd_net *nn)
> > >   */
> > >  void nfsd_copy_write_verifier(__be32 verf[2], struct nfsd_net *nn)
> > >  {
> > > -	int seq = 0;
> > > +	unsigned seq;
> > >
> > >  	do {
> > > -		read_seqbegin_or_lock(&nn->writeverf_lock, &seq);
> > > +		seq = read_seqbegin(&nn->writeverf_lock);
> > >  		memcpy(verf, nn->writeverf, sizeof(nn->writeverf));
> > > -	} while (need_seqretry(&nn->writeverf_lock, seq));
> > > -	done_seqretry(&nn->writeverf_lock, seq);
> > > +	} while (read_seqretry(&nn->writeverf_lock, seq));
> > >  }
> > >
> > >  static void nfsd_reset_write_verifier_locked(struct nfsd_net *nn)
> > >
> >
> > --
> > Chuck Lever
> >
Oleg Nesterov Oct. 25, 2023, 5:54 p.m. UTC | #4
On 10/25, Chuck Lever wrote:
>
> > Another question is why we can't simply turn nn->writeverf into seqcount_t.
> > I guess we can't because nfsd_reset_write_verifier() needs spin_lock() to
> > serialise with itself, right?
>
> "reset" is supposed to be very rare operation. Using a lock in that
> case is probably quite acceptable, as long as reading the verifier
> is wait-free and guaranteed to be untorn.
>
> But a seqcount_t is only 32 bits.

Again, I don't understand you.

Once again, we can turn writeverf into seqcount_t, see the patch below.

But this way nfsd_reset_write_verifier() can race with itself, no?

Oleg
---

diff --git a/fs/nfsd/netns.h b/fs/nfsd/netns.h
index ec49b200b797..3e2adf3eb15f 100644
--- a/fs/nfsd/netns.h
+++ b/fs/nfsd/netns.h
@@ -110,7 +110,7 @@ struct nfsd_net {
 	bool nfsd_net_up;
 	bool lockd_up;
 
-	seqlock_t writeverf_lock;
+	seqcount_t writeverf_lock;
 	unsigned char writeverf[8];
 
 	/*
diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
index 7ed02fb88a36..6320491018f8 100644
--- a/fs/nfsd/nfsctl.c
+++ b/fs/nfsd/nfsctl.c
@@ -1523,7 +1523,7 @@ static __net_init int nfsd_net_init(struct net *net)
 	nn->nfsd4_minorversions = NULL;
 	nfsd4_init_leases_net(nn);
 	get_random_bytes(&nn->siphash_key, sizeof(nn->siphash_key));
-	seqlock_init(&nn->writeverf_lock);
+	seqcount_init(&nn->writeverf_lock);
 
 	return 0;
 
diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index c7af1095f6b5..fc4e31411508 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -359,13 +359,12 @@ static bool nfsd_needs_lockd(struct nfsd_net *nn)
  */
 void nfsd_copy_write_verifier(__be32 verf[2], struct nfsd_net *nn)
 {
-	int seq = 0;
+	int seq;
 
 	do {
-		read_seqbegin_or_lock(&nn->writeverf_lock, &seq);
+		seq = read_seqcount_begin(&nn->writeverf_lock);
 		memcpy(verf, nn->writeverf, sizeof(nn->writeverf));
-	} while (need_seqretry(&nn->writeverf_lock, seq));
-	done_seqretry(&nn->writeverf_lock, seq);
+	} while (read_seqcount_retry(&nn->writeverf_lock, seq));
 }
 
 static void nfsd_reset_write_verifier_locked(struct nfsd_net *nn)
@@ -397,9 +396,9 @@ static void nfsd_reset_write_verifier_locked(struct nfsd_net *nn)
  */
 void nfsd_reset_write_verifier(struct nfsd_net *nn)
 {
-	write_seqlock(&nn->writeverf_lock);
+	write_seqcount_begin(&nn->writeverf_lock);
 	nfsd_reset_write_verifier_locked(nn);
-	write_sequnlock(&nn->writeverf_lock);
+	write_seqcount_end(&nn->writeverf_lock);
 }
 
 /*
Chuck Lever Oct. 25, 2023, 5:57 p.m. UTC | #5
On Wed, Oct 25, 2023 at 07:39:31PM +0200, Oleg Nesterov wrote:
> Hi Chuck,
> 
> Thanks for your reply. But I am already sleeping and I can't understand it.

I was responding to "I can not understand the intent." But also I
was hoping that explanation would help you provide a correct
replacement for the existing code.


> So let me ask a couple of questions.
> 
> 1. Do you agree that the current nfsd_copy_write_verifier() code makes no sense?

Probably.


>    I mean, the usage of read_seqbegin_or_lock() suggests that if the lockless
>    pass fails it should take writeverf_lock for writing. But this can't happen,
>    and thus this code doesn't look right no matter what. None of the
>    read_seqbegin_or_lock/need_seqretry/done_seqretry helpers make any sense
>    because "seq" is alway even.

> 2. If yes, which change do you prefer? I'd prefer the patch at the end.

Based on my limited understanding of read_seqbegin(), the patch at
the end seems cleanest and is on-point. Please post an official
version of that to linux-nfs@ with a full patch description, and
I'll see that it gets into v6.8-rc with proper tags, review, and
testing.


> Oleg.
> 
> On 10/25, Chuck Lever wrote:
> >
> > On Wed, Oct 25, 2023 at 06:30:06PM +0200, Oleg Nesterov wrote:
> > > Hello,
> > >
> > > The usage of writeverf_lock is wrong and misleading no matter what and
> > > I can not understand the intent.
> >
> > The structure of the seqlock was introduced in commit 27c438f53e79
> > ("nfsd: Support the server resetting the boot verifier").
> >
> > The NFS write verifier is an 8-byte cookie that is supposed to
> > indicate the boot epoch of the server -- simply put, when the server
> > restarts, the epoch (and this verifier) changes.
> >
> > NFSv3 and later have a two-phase write scheme where the client
> > sends data to the server (known as an UNSTABLE WRITE), then later
> > asks the server to commit that data (a COMMIT). Before the COMMIT,
> > that data is not durable and the client must hold onto it until
> > the server's COMMIT Reply indicates it's safe for the client to
> > discard that data and move on.
> >
> > When an UNSTABLE WRITE is done, the server reports its current
> > epoch as part of each WRITE Reply. If this verifier cookie changes,
> > the client knows that the server might have lost previously
> > written written-but-uncommitted data, so it must send the WRITEs
> > again in that (rare) case.
> >
> > NFSD abuses this slightly by changing the write verifier whenever
> > there is an underlying local write error that might have occurred in
> > the background (ie, there was no WRITE or COMMIT operation at the
> > time that the server could use to convey the error back to the
> > client). This is supposed to trigger clients to send UNSTABLE WRITEs
> > again to ensure that data is properly committed to durable storage.
> >
> > The point of the seqlock is to ensure that
> >
> > a) a write verifier update does not tear the verifier
> > b) a write verifier read does not see a torn verifier
> >
> > This is a hot path, so we don't want a full spinlock to achieve
> > a) and b).
> >
> > Way back when, the verifier was updated by two separate 32-bit
> > stores; hence the risk of tearing.
> >
> >
> > > nfsd_copy_write_verifier() uses read_seqbegin_or_lock() incorrectly.
> > > "seq" is always even, so read_seqbegin_or_lock() can never take the
> > > lock for writing. We need to make the counter odd for the 2nd round:
> > >
> > > 	--- a/fs/nfsd/nfssvc.c
> > > 	+++ b/fs/nfsd/nfssvc.c
> > > 	@@ -359,11 +359,14 @@ static bool nfsd_needs_lockd(struct nfsd_net *nn)
> > > 	  */
> > > 	 void nfsd_copy_write_verifier(__be32 verf[2], struct nfsd_net *nn)
> > > 	 {
> > > 	-	int seq = 0;
> > > 	+	int seq, nextseq = 0;
> > >
> > > 		do {
> > > 	+		seq = nextseq;
> > > 			read_seqbegin_or_lock(&nn->writeverf_lock, &seq);
> > > 			memcpy(verf, nn->writeverf, sizeof(nn->writeverf));
> > > 	+		/* If lockless access failed, take the lock. */
> > > 	+		nextseq = 1;
> > > 		} while (need_seqretry(&nn->writeverf_lock, seq));
> > > 		done_seqretry(&nn->writeverf_lock, seq);
> > > 	 }
> > >
> > > OTOH. This function just copies 8 bytes, this makes me think that it doesn't
> > > need the conditional locking and read_seqbegin_or_lock() at all. So perhaps
> > > the (untested) patch below makes more sense? Please note that it should not
> > > change the current behaviour, it just makes the code look correct (and more
> > > optimal but this is minor).
> > >
> > > Another question is why we can't simply turn nn->writeverf into seqcount_t.
> > > I guess we can't because nfsd_reset_write_verifier() needs spin_lock() to
> > > serialise with itself, right?
> >
> > "reset" is supposed to be very rare operation. Using a lock in that
> > case is probably quite acceptable, as long as reading the verifier
> > is wait-free and guaranteed to be untorn.
> >
> > But a seqcount_t is only 32 bits.
> >
> >
> > > Oleg.
> > > ---
> > >
> > > diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
> > > index c7af1095f6b5..094b765c5397 100644
> > > --- a/fs/nfsd/nfssvc.c
> > > +++ b/fs/nfsd/nfssvc.c
> > > @@ -359,13 +359,12 @@ static bool nfsd_needs_lockd(struct nfsd_net *nn)
> > >   */
> > >  void nfsd_copy_write_verifier(__be32 verf[2], struct nfsd_net *nn)
> > >  {
> > > -	int seq = 0;
> > > +	unsigned seq;
> > >
> > >  	do {
> > > -		read_seqbegin_or_lock(&nn->writeverf_lock, &seq);
> > > +		seq = read_seqbegin(&nn->writeverf_lock);
> > >  		memcpy(verf, nn->writeverf, sizeof(nn->writeverf));
> > > -	} while (need_seqretry(&nn->writeverf_lock, seq));
> > > -	done_seqretry(&nn->writeverf_lock, seq);
> > > +	} while (read_seqretry(&nn->writeverf_lock, seq));
> > >  }
> > >
> > >  static void nfsd_reset_write_verifier_locked(struct nfsd_net *nn)
> > >
> >
> > --
> > Chuck Lever
> >
>
Chuck Lever Oct. 25, 2023, 6:07 p.m. UTC | #6
On Wed, Oct 25, 2023 at 07:54:36PM +0200, Oleg Nesterov wrote:
> On 10/25, Chuck Lever wrote:
> >
> > > Another question is why we can't simply turn nn->writeverf into seqcount_t.
> > > I guess we can't because nfsd_reset_write_verifier() needs spin_lock() to
> > > serialise with itself, right?
> >
> > "reset" is supposed to be very rare operation. Using a lock in that
> > case is probably quite acceptable, as long as reading the verifier
> > is wait-free and guaranteed to be untorn.
> >
> > But a seqcount_t is only 32 bits.
> 
> Again, I don't understand you.
> 
> Once again, we can turn writeverf into seqcount_t, see the patch below.

The patch below does not turn "writeverf" into a seqcount_t, it
turns "writeverf_lock" into a seqcount_t. "writeverf" is an 8-byte
field, that's why I said "seqcount_t is only 32 bits" -- that is
not an adequate replacement for the 8-byte "writeverf" field.

Your original proposal made no sense. But I see now what you
would like to change.

I'm not familiar enough with these primitives to have a strong
opinion. What do you think would be the benefit?


> But this way nfsd_reset_write_verifier() can race with itself, no?

> Oleg
> ---
> 
> diff --git a/fs/nfsd/netns.h b/fs/nfsd/netns.h
> index ec49b200b797..3e2adf3eb15f 100644
> --- a/fs/nfsd/netns.h
> +++ b/fs/nfsd/netns.h
> @@ -110,7 +110,7 @@ struct nfsd_net {
>  	bool nfsd_net_up;
>  	bool lockd_up;
>  
> -	seqlock_t writeverf_lock;
> +	seqcount_t writeverf_lock;
>  	unsigned char writeverf[8];
>  
>  	/*
> diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
> index 7ed02fb88a36..6320491018f8 100644
> --- a/fs/nfsd/nfsctl.c
> +++ b/fs/nfsd/nfsctl.c
> @@ -1523,7 +1523,7 @@ static __net_init int nfsd_net_init(struct net *net)
>  	nn->nfsd4_minorversions = NULL;
>  	nfsd4_init_leases_net(nn);
>  	get_random_bytes(&nn->siphash_key, sizeof(nn->siphash_key));
> -	seqlock_init(&nn->writeverf_lock);
> +	seqcount_init(&nn->writeverf_lock);
>  
>  	return 0;
>  
> diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
> index c7af1095f6b5..fc4e31411508 100644
> --- a/fs/nfsd/nfssvc.c
> +++ b/fs/nfsd/nfssvc.c
> @@ -359,13 +359,12 @@ static bool nfsd_needs_lockd(struct nfsd_net *nn)
>   */
>  void nfsd_copy_write_verifier(__be32 verf[2], struct nfsd_net *nn)
>  {
> -	int seq = 0;
> +	int seq;
>  
>  	do {
> -		read_seqbegin_or_lock(&nn->writeverf_lock, &seq);
> +		seq = read_seqcount_begin(&nn->writeverf_lock);
>  		memcpy(verf, nn->writeverf, sizeof(nn->writeverf));
> -	} while (need_seqretry(&nn->writeverf_lock, seq));
> -	done_seqretry(&nn->writeverf_lock, seq);
> +	} while (read_seqcount_retry(&nn->writeverf_lock, seq));
>  }
>  
>  static void nfsd_reset_write_verifier_locked(struct nfsd_net *nn)
> @@ -397,9 +396,9 @@ static void nfsd_reset_write_verifier_locked(struct nfsd_net *nn)
>   */
>  void nfsd_reset_write_verifier(struct nfsd_net *nn)
>  {
> -	write_seqlock(&nn->writeverf_lock);
> +	write_seqcount_begin(&nn->writeverf_lock);
>  	nfsd_reset_write_verifier_locked(nn);
> -	write_sequnlock(&nn->writeverf_lock);
> +	write_seqcount_end(&nn->writeverf_lock);
>  }
>  
>  /*
>
Oleg Nesterov Oct. 25, 2023, 6:10 p.m. UTC | #7
On 10/25, Chuck Lever wrote:
>
> On Wed, Oct 25, 2023 at 07:39:31PM +0200, Oleg Nesterov wrote:
> > Hi Chuck,
> >
> > Thanks for your reply. But I am already sleeping and I can't understand it.
>
> I was responding to "I can not understand the intent." But also I
> was hoping that explanation would help you provide a correct
> replacement for the existing code.

In case I was not clear, I have already provided the replacement for the
existing code which looks "correct" for me ;) Nevermind, please forget.

> > 1. Do you agree that the current nfsd_copy_write_verifier() code makes no sense?
>
> Probably.
>
>
> >    I mean, the usage of read_seqbegin_or_lock() suggests that if the lockless
> >    pass fails it should take writeverf_lock for writing. But this can't happen,
> >    and thus this code doesn't look right no matter what. None of the
> >    read_seqbegin_or_lock/need_seqretry/done_seqretry helpers make any sense
> >    because "seq" is alway even.
>
> > 2. If yes, which change do you prefer? I'd prefer the patch at the end.
>
> Based on my limited understanding of read_seqbegin(), the patch at
> the end seems cleanest and is on-point. Please post an official
> version of that to linux-nfs@ with a full patch description, and
> I'll see that it gets into v6.8-rc with proper tags, review, and
> testing.

Ok, will do tomorrow.

Thanks,

Oleg.
Oleg Nesterov Oct. 25, 2023, 6:19 p.m. UTC | #8
On 10/25, Chuck Lever wrote:
>
> On Wed, Oct 25, 2023 at 07:54:36PM +0200, Oleg Nesterov wrote:
> > On 10/25, Chuck Lever wrote:
> > >
> > > > Another question is why we can't simply turn nn->writeverf into seqcount_t.
> > > > I guess we can't because nfsd_reset_write_verifier() needs spin_lock() to
> > > > serialise with itself, right?
> > >
> > > "reset" is supposed to be very rare operation. Using a lock in that
> > > case is probably quite acceptable, as long as reading the verifier
> > > is wait-free and guaranteed to be untorn.
> > >
> > > But a seqcount_t is only 32 bits.
> >
> > Again, I don't understand you.
> >
> > Once again, we can turn writeverf into seqcount_t, see the patch below.
>
> The patch below does not turn "writeverf" into a seqcount_t, it
> turns "writeverf_lock" into a seqcount_t.

Yes, typo. Of course I meant writeverf_lock. A bit strange it was not clear.

> Your original proposal made no sense.

Which one??? I thought that you agree that the current nfsd_copy_write_verifier()
code makes no send, at least that is how I interpreted your previous email. Confused.

> But I see now what you
> would like to change.

OK,

> I'm not familiar enough with these primitives to have a strong
> opinion. What do you think would be the benefit?

See above. And just in case let me repeat. No, I don't think we can/should turn
writeverf_lock (double check I didn't say "writeverf") into seqcount_t.

Oleg.
diff mbox series

Patch

diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index c7af1095f6b5..094b765c5397 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -359,13 +359,12 @@  static bool nfsd_needs_lockd(struct nfsd_net *nn)
  */
 void nfsd_copy_write_verifier(__be32 verf[2], struct nfsd_net *nn)
 {
-	int seq = 0;
+	unsigned seq;
 
 	do {
-		read_seqbegin_or_lock(&nn->writeverf_lock, &seq);
+		seq = read_seqbegin(&nn->writeverf_lock);
 		memcpy(verf, nn->writeverf, sizeof(nn->writeverf));
-	} while (need_seqretry(&nn->writeverf_lock, seq));
-	done_seqretry(&nn->writeverf_lock, seq);
+	} while (read_seqretry(&nn->writeverf_lock, seq));
 }
 
 static void nfsd_reset_write_verifier_locked(struct nfsd_net *nn)