diff mbox series

[3/3] nfsd: simplify write verifier handling

Message ID 20230213211345.385005-4-jlayton@kernel.org (mailing list archive)
State New, archived
Headers show
Series nfsd: write verifier fixes and optimization | expand

Commit Message

Jeff Layton Feb. 13, 2023, 9:13 p.m. UTC
The write verifier exists to tell the client when the server may have
forgotten some unstable writes. The typical way that this happens is if
the server crashes, but we've also extended nfsd to change it when there
are writeback errors as well.

The way it works today though, we call something like vfs_fsync (e.g.
for a COMMIT call) and if we get back an error, we'll reset the write
verifier.

This is non-optimal for a couple of reasons:

1/ There could be significant delay between an error being
recorded and the reset. It would be ideal if the write verifier were to
change as soon as the error was recorded.

2/ It's a bit of a waste, in that if we get a writeback error on a
single inode, we'll end up resetting the write verifier for everything,
even on inodes that may be fine (e.g. on a completely separate fs).

The protocol doesn't require that we use the same verifier for all
inodes. The only requirement is that the verifier change if the server
may have forgotten some unstable writes.

Instead of resetting the per-net write verifier on errors, we can just
fetch the current errseq_t for the inode, and fold that value into the
verifier on a write. If an error is reported, then that value will
change and the verifier will also naturally change without nfsd having
to take any explicit steps.

Make nfsd only set the per-net verifier at startup time. When we need a
verifier for a reply, fetch the current errseq_t value for the mapping
and xor it into the per-net verifier.

Cc: Trond Myklebust <trondmy@hammerspace.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/nfsd/filecache.c | 22 +--------------------
 fs/nfsd/netns.h     |  4 ----
 fs/nfsd/nfs4proc.c  | 17 +++++++---------
 fs/nfsd/nfsctl.c    |  1 -
 fs/nfsd/nfssvc.c    | 48 ++++++++++++++-------------------------------
 fs/nfsd/trace.h     | 28 --------------------------
 fs/nfsd/vfs.c       | 28 +++++---------------------
 fs/nfsd/vfs.h       |  1 +
 8 files changed, 29 insertions(+), 120 deletions(-)

Comments

Rick Macklem Feb. 14, 2023, 12:49 a.m. UTC | #1
On Mon, Feb 13, 2023 at 1:14 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to IThelp@uoguelph.ca
>
>
> The write verifier exists to tell the client when the server may have
> forgotten some unstable writes. The typical way that this happens is if
> the server crashes, but we've also extended nfsd to change it when there
> are writeback errors as well.
>
> The way it works today though, we call something like vfs_fsync (e.g.
> for a COMMIT call) and if we get back an error, we'll reset the write
> verifier.
>
> This is non-optimal for a couple of reasons:
>
> 1/ There could be significant delay between an error being
> recorded and the reset. It would be ideal if the write verifier were to
> change as soon as the error was recorded.
>
> 2/ It's a bit of a waste, in that if we get a writeback error on a
> single inode, we'll end up resetting the write verifier for everything,
> even on inodes that may be fine (e.g. on a completely separate fs).
>
Here's the snippet from RFC8881:
   The final portion of the result is the field writeverf.  This field
   is the write verifier and is a cookie that the client can use to
   determine whether a server has changed instance state (e.g., server
   restart) between a call to WRITE and a subsequent call to either
   WRITE or COMMIT.  This cookie MUST be unchanged during a single
   instance of the NFSv4.1 server and MUST be unique between instances
   of the NFSv4.1 server.  If the cookie changes, then the client MUST
   assume that any data written with an UNSTABLE4 value for committed
   and an old writeverf in the reply has been lost and will need to be
   recovered.

I've always interpreted the writeverf as "per-server" and not  "per-file".
Although I'll admit the above does not make that crystal clear, it does make
it clear that the writeverf applies to a "server instance" and not a file or
file system on the server.

The FreeBSD client assumes it is "per-server" and re-writes all uncommitted
writes for the server, not just ones for the file (or file system) the
writeverf is
replied with.  (I vaguely recall Solaris does the same?)

At the very least, I think you should run this past the IETF working group
(nfsv4@ietf.org) to see what they say w.r.t. the writeverf being "per-file" vs
"per-server".

rick

> The protocol doesn't require that we use the same verifier for all
> inodes. The only requirement is that the verifier change if the server
> may have forgotten some unstable writes.
>
> Instead of resetting the per-net write verifier on errors, we can just
> fetch the current errseq_t for the inode, and fold that value into the
> verifier on a write. If an error is reported, then that value will
> change and the verifier will also naturally change without nfsd having
> to take any explicit steps.
>
> Make nfsd only set the per-net verifier at startup time. When we need a
> verifier for a reply, fetch the current errseq_t value for the mapping
> and xor it into the per-net verifier.
>
> Cc: Trond Myklebust <trondmy@hammerspace.com>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  fs/nfsd/filecache.c | 22 +--------------------
>  fs/nfsd/netns.h     |  4 ----
>  fs/nfsd/nfs4proc.c  | 17 +++++++---------
>  fs/nfsd/nfsctl.c    |  1 -
>  fs/nfsd/nfssvc.c    | 48 ++++++++++++++-------------------------------
>  fs/nfsd/trace.h     | 28 --------------------------
>  fs/nfsd/vfs.c       | 28 +++++---------------------
>  fs/nfsd/vfs.h       |  1 +
>  8 files changed, 29 insertions(+), 120 deletions(-)
>
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index 3b9a10378c83..1ca9ad0aabcd 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -233,23 +233,6 @@ nfsd_file_alloc(struct net *net, struct inode *inode, unsigned char need,
>         return nf;
>  }
>
> -/**
> - * nfsd_file_check_write_error - check for writeback errors on a file
> - * @nf: nfsd_file to check for writeback errors
> - *
> - * Check whether a nfsd_file has an unseen error. Reset the write
> - * verifier if so.
> - */
> -static void
> -nfsd_file_check_write_error(struct nfsd_file *nf)
> -{
> -       struct file *file = nf->nf_file;
> -
> -       if ((file->f_mode & FMODE_WRITE) &&
> -           filemap_check_wb_err(file->f_mapping, READ_ONCE(file->f_wb_err)))
> -               nfsd_reset_write_verifier(net_generic(nf->nf_net, nfsd_net_id));
> -}
> -
>  static void
>  nfsd_file_hash_remove(struct nfsd_file *nf)
>  {
> @@ -281,10 +264,8 @@ nfsd_file_free(struct nfsd_file *nf)
>         nfsd_file_unhash(nf);
>         if (nf->nf_mark)
>                 nfsd_file_mark_put(nf->nf_mark);
> -       if (nf->nf_file) {
> -               nfsd_file_check_write_error(nf);
> +       if (nf->nf_file)
>                 filp_close(nf->nf_file, NULL);
> -       }
>
>         /*
>          * If this item is still linked via nf_lru, that's a bug.
> @@ -1038,7 +1019,6 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  out:
>         if (status == nfs_ok) {
>                 this_cpu_inc(nfsd_file_acquisitions);
> -               nfsd_file_check_write_error(nf);
>                 *pnf = nf;
>         }
>         put_cred(cred);
> diff --git a/fs/nfsd/netns.h b/fs/nfsd/netns.h
> index 51a4b7885cae..95d7adb73b24 100644
> --- a/fs/nfsd/netns.h
> +++ b/fs/nfsd/netns.h
> @@ -109,7 +109,6 @@ struct nfsd_net {
>         bool nfsd_net_up;
>         bool lockd_up;
>
> -       seqlock_t writeverf_lock;
>         unsigned char writeverf[8];
>
>         /*
> @@ -204,7 +203,4 @@ struct nfsd_net {
>  extern void nfsd_netns_free_versions(struct nfsd_net *nn);
>
>  extern unsigned int nfsd_net_id;
> -
> -void nfsd_copy_write_verifier(__be32 verf[2], struct nfsd_net *nn);
> -void nfsd_reset_write_verifier(struct nfsd_net *nn);
>  #endif /* __NFSD_NETNS_H__ */
> diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
> index 5ae670807449..d63648d79f17 100644
> --- a/fs/nfsd/nfs4proc.c
> +++ b/fs/nfsd/nfs4proc.c
> @@ -717,15 +717,6 @@ nfsd4_access(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>                            &access->ac_supported);
>  }
>
> -static void gen_boot_verifier(nfs4_verifier *verifier, struct net *net)
> -{
> -       __be32 *verf = (__be32 *)verifier->data;
> -
> -       BUILD_BUG_ON(2*sizeof(*verf) != sizeof(verifier->data));
> -
> -       nfsd_copy_write_verifier(verf, net_generic(net, nfsd_net_id));
> -}
> -
>  static __be32
>  nfsd4_commit(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>              union nfsd4_op_u *u)
> @@ -1586,11 +1577,17 @@ static const struct nfsd4_callback_ops nfsd4_cb_offload_ops = {
>
>  static void nfsd4_init_copy_res(struct nfsd4_copy *copy, bool sync)
>  {
> +       __be32 *verf;
> +       nfs4_verifier *verifier = &copy->cp_res.wr_verifier;
> +
> +       BUILD_BUG_ON(2*sizeof(*verf) != sizeof(verifier->data));
> +
>         copy->cp_res.wr_stable_how =
>                 test_bit(NFSD4_COPY_F_COMMITTED, &copy->cp_flags) ?
>                         NFS_FILE_SYNC : NFS_UNSTABLE;
>         nfsd4_copy_set_sync(copy, sync);
> -       gen_boot_verifier(&copy->cp_res.wr_verifier, copy->cp_clp->net);
> +       verf = (__be32 *)verifier->data;
> +       nfsd_set_write_verifier(verf, copy->nf_dst);
>  }
>
>  static ssize_t _nfsd_copy_file_range(struct nfsd4_copy *copy,
> diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
> index 04474b8ccf0a..28a73577beaa 100644
> --- a/fs/nfsd/nfsctl.c
> +++ b/fs/nfsd/nfsctl.c
> @@ -1501,7 +1501,6 @@ static __net_init int nfsd_init_net(struct net *net)
>         nn->nfsd4_minorversions = NULL;
>         nfsd4_init_leases_net(nn);
>         get_random_bytes(&nn->siphash_key, sizeof(nn->siphash_key));
> -       seqlock_init(&nn->writeverf_lock);
>
>         return 0;
>
> diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
> index 3a38ab304b02..8b09bfc1fae9 100644
> --- a/fs/nfsd/nfssvc.c
> +++ b/fs/nfsd/nfssvc.c
> @@ -350,25 +350,27 @@ static bool nfsd_needs_lockd(struct nfsd_net *nn)
>  }
>
>  /**
> - * nfsd_copy_write_verifier - Atomically copy a write verifier
> + * nfsd_set_write_verifier - set the write verifier for a call
>   * @verf: buffer in which to receive the verifier cookie
> - * @nn: NFS net namespace
> + * @nf: nfsd_file being operated on for write
>   *
> - * This function provides a wait-free mechanism for copying the
> - * namespace's write verifier without tearing it.
> + * Grab the (static) write verifier for the nfsd_net, and then fold
> + * the current errseq_t value into it for the inode to create a
> + * write verifier.
>   */
> -void nfsd_copy_write_verifier(__be32 verf[2], struct nfsd_net *nn)
> +void nfsd_set_write_verifier(__be32 verf[2], struct nfsd_file *nf)
>  {
> -       int seq = 0;
> +       struct nfsd_net *nn = net_generic(nf->nf_net, nfsd_net_id);
> +       errseq_t eseq = errseq_fetch(&nf->nf_file->f_mapping->wb_err);
>
> -       do {
> -               read_seqbegin_or_lock(&nn->writeverf_lock, &seq);
> -               memcpy(verf, nn->writeverf, sizeof(*verf) * 2);
> -       } while (need_seqretry(&nn->writeverf_lock, seq));
> -       done_seqretry(&nn->writeverf_lock, seq);
> +       /* copy in the per-net write verifier */
> +       memcpy(verf, nn->writeverf, sizeof(*verf) * 2);
> +
> +       /* fold the errseq into first word */
> +       verf[0] ^= (__force __be32)eseq;
>  }
>
> -static void nfsd_reset_write_verifier_locked(struct nfsd_net *nn)
> +static void nfsd_init_write_verifier(struct nfsd_net *nn)
>  {
>         struct timespec64 now;
>         u64 verf;
> @@ -382,26 +384,6 @@ static void nfsd_reset_write_verifier_locked(struct nfsd_net *nn)
>         memcpy(nn->writeverf, &verf, sizeof(nn->writeverf));
>  }
>
> -/**
> - * nfsd_reset_write_verifier - Generate a new write verifier
> - * @nn: NFS net namespace
> - *
> - * This function updates the ->writeverf field of @nn. This field
> - * contains an opaque cookie that, according to Section 18.32.3 of
> - * RFC 8881, "the client can use to determine whether a server has
> - * changed instance state (e.g., server restart) between a call to
> - * WRITE and a subsequent call to either WRITE or COMMIT.  This
> - * cookie MUST be unchanged during a single instance of the NFSv4.1
> - * server and MUST be unique between instances of the NFSv4.1
> - * server."
> - */
> -void nfsd_reset_write_verifier(struct nfsd_net *nn)
> -{
> -       write_seqlock(&nn->writeverf_lock);
> -       nfsd_reset_write_verifier_locked(nn);
> -       write_sequnlock(&nn->writeverf_lock);
> -}
> -
>  static int nfsd_startup_net(struct net *net, const struct cred *cred)
>  {
>         struct nfsd_net *nn = net_generic(net, nfsd_net_id);
> @@ -687,7 +669,7 @@ int nfsd_create_serv(struct net *net)
>                 register_inet6addr_notifier(&nfsd_inet6addr_notifier);
>  #endif
>         }
> -       nfsd_reset_write_verifier(nn);
> +       nfsd_init_write_verifier(nn);
>         return 0;
>  }
>
> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> index 4183819ea082..93d76fe33514 100644
> --- a/fs/nfsd/trace.h
> +++ b/fs/nfsd/trace.h
> @@ -744,34 +744,6 @@ DEFINE_EVENT(nfsd_net_class, nfsd_##name, \
>  DEFINE_NET_EVENT(grace_start);
>  DEFINE_NET_EVENT(grace_complete);
>
> -TRACE_EVENT(nfsd_writeverf_reset,
> -       TP_PROTO(
> -               const struct nfsd_net *nn,
> -               const struct svc_rqst *rqstp,
> -               int error
> -       ),
> -       TP_ARGS(nn, rqstp, error),
> -       TP_STRUCT__entry(
> -               __field(unsigned long long, boot_time)
> -               __field(u32, xid)
> -               __field(int, error)
> -               __array(unsigned char, verifier, NFS4_VERIFIER_SIZE)
> -       ),
> -       TP_fast_assign(
> -               __entry->boot_time = nn->boot_time;
> -               __entry->xid = be32_to_cpu(rqstp->rq_xid);
> -               __entry->error = error;
> -
> -               /* avoid seqlock inside TP_fast_assign */
> -               memcpy(__entry->verifier, nn->writeverf,
> -                      NFS4_VERIFIER_SIZE);
> -       ),
> -       TP_printk("boot_time=%16llx xid=0x%08x error=%d new verifier=0x%s",
> -               __entry->boot_time, __entry->xid, __entry->error,
> -               __print_hex_str(__entry->verifier, NFS4_VERIFIER_SIZE)
> -       )
> -);
> -
>  TRACE_EVENT(nfsd_clid_cred_mismatch,
>         TP_PROTO(
>                 const struct nfs4_client *clp,
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 21d5209f6e04..6e8ba8357735 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -628,17 +628,12 @@ __be32 nfsd4_clone_file_range(struct svc_rqst *rqstp,
>                 if (!status)
>                         status = commit_inode_metadata(file_inode(src));
>                 if (status < 0) {
> -                       struct nfsd_net *nn = net_generic(nf_dst->nf_net,
> -                                                         nfsd_net_id);
> -
>                         trace_nfsd_clone_file_range_err(rqstp,
>                                         &nfsd4_get_cstate(rqstp)->save_fh,
>                                         src_pos,
>                                         &nfsd4_get_cstate(rqstp)->current_fh,
>                                         dst_pos,
>                                         count, status);
> -                       nfsd_reset_write_verifier(nn);
> -                       trace_nfsd_writeverf_reset(nn, rqstp, status);
>                         ret = nfserrno(status);
>                 }
>         }
> @@ -1058,7 +1053,6 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
>                                 unsigned long *cnt, int stable,
>                                 __be32 *verf)
>  {
> -       struct nfsd_net         *nn = net_generic(SVC_NET(rqstp), nfsd_net_id);
>         struct file             *file = nf->nf_file;
>         struct super_block      *sb = file_inode(file)->i_sb;
>         struct svc_export       *exp;
> @@ -1103,13 +1097,10 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
>         iov_iter_kvec(&iter, ITER_SOURCE, vec, vlen, *cnt);
>         since = READ_ONCE(file->f_wb_err);
>         if (verf)
> -               nfsd_copy_write_verifier(verf, nn);
> +               nfsd_set_write_verifier(verf, nf);
>         host_err = vfs_iter_write(file, &iter, &pos, flags);
> -       if (host_err < 0) {
> -               nfsd_reset_write_verifier(nn);
> -               trace_nfsd_writeverf_reset(nn, rqstp, host_err);
> +       if (host_err < 0)
>                 goto out_nfserr;
> -       }
>         *cnt = host_err;
>         nfsd_stats_io_write_add(exp, *cnt);
>         fsnotify_modify(file);
> @@ -1117,13 +1108,8 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
>         if (host_err < 0)
>                 goto out_nfserr;
>
> -       if (stable && use_wgather) {
> +       if (stable && use_wgather)
>                 host_err = wait_for_concurrent_writes(file);
> -               if (host_err < 0) {
> -                       nfsd_reset_write_verifier(nn);
> -                       trace_nfsd_writeverf_reset(nn, rqstp, host_err);
> -               }
> -       }
>
>  out_nfserr:
>         if (host_err >= 0) {
> @@ -1223,7 +1209,6 @@ nfsd_commit(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
>         __be32                  err = nfs_ok;
>         u64                     maxbytes;
>         loff_t                  start, end;
> -       struct nfsd_net         *nn;
>
>         /*
>          * Convert the client-provided (offset, count) range to a
> @@ -1240,7 +1225,6 @@ nfsd_commit(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
>                         end = offset + count - 1;
>         }
>
> -       nn = net_generic(nf->nf_net, nfsd_net_id);
>         if (EX_ISSYNC(fhp->fh_export)) {
>                 errseq_t since = READ_ONCE(nf->nf_file->f_wb_err);
>                 int err2;
> @@ -1248,7 +1232,7 @@ nfsd_commit(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
>                 err2 = vfs_fsync_range(nf->nf_file, start, end, 0);
>                 switch (err2) {
>                 case 0:
> -                       nfsd_copy_write_verifier(verf, nn);
> +                       nfsd_set_write_verifier(verf, nf);
>                         err2 = filemap_check_wb_err(nf->nf_file->f_mapping,
>                                                     since);
>                         err = nfserrno(err2);
> @@ -1257,12 +1241,10 @@ nfsd_commit(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
>                         err = nfserr_notsupp;
>                         break;
>                 default:
> -                       nfsd_reset_write_verifier(nn);
> -                       trace_nfsd_writeverf_reset(nn, rqstp, err2);
>                         err = nfserrno(err2);
>                 }
>         } else
> -               nfsd_copy_write_verifier(verf, nn);
> +               nfsd_set_write_verifier(verf, nf);
>
>         return err;
>  }
> diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
> index dbdfef7ae85b..e69f304407ae 100644
> --- a/fs/nfsd/vfs.h
> +++ b/fs/nfsd/vfs.h
> @@ -176,4 +176,5 @@ static inline __be32 fh_getattr(const struct svc_fh *fh, struct kstat *stat)
>                                     AT_STATX_SYNC_AS_STAT));
>  }
>
> +void nfsd_set_write_verifier(__be32 verf[2], struct nfsd_file *nf);
>  #endif /* LINUX_NFSD_VFS_H */
> --
> 2.39.1
>
>
Trond Myklebust Feb. 14, 2023, 3:28 a.m. UTC | #2
On Mon, 2023-02-13 at 16:49 -0800, Rick Macklem wrote:
> On Mon, Feb 13, 2023 at 1:14 PM Jeff Layton <jlayton@kernel.org>
> wrote:
> > 
> > CAUTION: This email originated from outside of the University of
> > Guelph. Do not click links or open attachments unless you recognize
> > the sender and know the content is safe. If in doubt, forward
> > suspicious emails to IThelp@uoguelph.ca
> > 
> > 
> > The write verifier exists to tell the client when the server may
> > have
> > forgotten some unstable writes. The typical way that this happens
> > is if
> > the server crashes, but we've also extended nfsd to change it when
> > there
> > are writeback errors as well.
> > 
> > The way it works today though, we call something like vfs_fsync
> > (e.g.
> > for a COMMIT call) and if we get back an error, we'll reset the
> > write
> > verifier.
> > 
> > This is non-optimal for a couple of reasons:
> > 
> > 1/ There could be significant delay between an error being
> > recorded and the reset. It would be ideal if the write verifier
> > were to
> > change as soon as the error was recorded.
> > 
> > 2/ It's a bit of a waste, in that if we get a writeback error on a
> > single inode, we'll end up resetting the write verifier for
> > everything,
> > even on inodes that may be fine (e.g. on a completely separate fs).
> > 
> Here's the snippet from RFC8881:
>    The final portion of the result is the field writeverf.  This
> field
>    is the write verifier and is a cookie that the client can use to
>    determine whether a server has changed instance state (e.g.,
> server
>    restart) between a call to WRITE and a subsequent call to either
>    WRITE or COMMIT.  This cookie MUST be unchanged during a single
>    instance of the NFSv4.1 server and MUST be unique between
> instances
>    of the NFSv4.1 server.  If the cookie changes, then the client
> MUST
>    assume that any data written with an UNSTABLE4 value for committed
>    and an old writeverf in the reply has been lost and will need to
> be
>    recovered.
> 
> I've always interpreted the writeverf as "per-server" and not  "per-
> file".
> Although I'll admit the above does not make that crystal clear, it
> does make
> it clear that the writeverf applies to a "server instance" and not a
> file or
> file system on the server.
> 
> The FreeBSD client assumes it is "per-server" and re-writes all
> uncommitted
> writes for the server, not just ones for the file (or file system)
> the
> writeverf is
> replied with.  (I vaguely recall Solaris does the same?)
> 
> At the very least, I think you should run this past the IETF working
> group
> (nfsv4@ietf.org) to see what they say w.r.t. the writeverf being
> "per-file" vs
> "per-server".
> 

As I recall, we've already had this discussion on the IETF NFSv4
working group mailing list:
https://mailarchive.ietf.org/arch/msg/nfsv4/99Ow2muMylXKWd9lzi9_BX2LJDY/


That's why I kept it a global in the first place.

Now note that RFC8881 does also clarify in Section 18.3.3 that:


   The server must vary the value of the write
   verifier at each server event or instantiation that may lead to a
   loss of uncommitted data.  Most commonly this occurs when the server
   is restarted; however, other events at the server may result in
   uncommitted data loss as well.

So I feel it is quite OK to use the verifier the way we do now in order
to signify that a fatal write error has occurred and that clients must
resend any data that was uncommitted.
Jeff Layton Feb. 14, 2023, 1:53 p.m. UTC | #3
On Mon, 2023-02-13 at 22:28 -0500, Trond Myklebust wrote:
> On Mon, 2023-02-13 at 16:49 -0800, Rick Macklem wrote:
> > On Mon, Feb 13, 2023 at 1:14 PM Jeff Layton <jlayton@kernel.org>
> > wrote:
> > > 
> > > CAUTION: This email originated from outside of the University of
> > > Guelph. Do not click links or open attachments unless you recognize
> > > the sender and know the content is safe. If in doubt, forward
> > > suspicious emails to IThelp@uoguelph.ca
> > > 
> > > 
> > > The write verifier exists to tell the client when the server may
> > > have
> > > forgotten some unstable writes. The typical way that this happens
> > > is if
> > > the server crashes, but we've also extended nfsd to change it when
> > > there
> > > are writeback errors as well.
> > > 
> > > The way it works today though, we call something like vfs_fsync
> > > (e.g.
> > > for a COMMIT call) and if we get back an error, we'll reset the
> > > write
> > > verifier.
> > > 
> > > This is non-optimal for a couple of reasons:
> > > 
> > > 1/ There could be significant delay between an error being
> > > recorded and the reset. It would be ideal if the write verifier
> > > were to
> > > change as soon as the error was recorded.
> > > 
> > > 2/ It's a bit of a waste, in that if we get a writeback error on a
> > > single inode, we'll end up resetting the write verifier for
> > > everything,
> > > even on inodes that may be fine (e.g. on a completely separate fs).
> > > 
> > Here's the snippet from RFC8881:
> >    The final portion of the result is the field writeverf.  This
> > field
> >    is the write verifier and is a cookie that the client can use to
> >    determine whether a server has changed instance state (e.g.,
> > server
> >    restart) between a call to WRITE and a subsequent call to either
> >    WRITE or COMMIT.  This cookie MUST be unchanged during a single
> >    instance of the NFSv4.1 server and MUST be unique between
> > instances
> >    of the NFSv4.1 server.  If the cookie changes, then the client
> > MUST
> >    assume that any data written with an UNSTABLE4 value for committed
> >    and an old writeverf in the reply has been lost and will need to
> > be
> >    recovered.
> > 
> > I've always interpreted the writeverf as "per-server" and not  "per-
> > file".
> > Although I'll admit the above does not make that crystal clear, it
> > does make
> > it clear that the writeverf applies to a "server instance" and not a
> > file or
> > file system on the server.
> > 
> > The FreeBSD client assumes it is "per-server" and re-writes all
> > uncommitted
> > writes for the server, not just ones for the file (or file system)
> > the
> > writeverf is
> > replied with.  (I vaguely recall Solaris does the same?)
> > 
> > At the very least, I think you should run this past the IETF working
> > group
> > (nfsv4@ietf.org) to see what they say w.r.t. the writeverf being
> > "per-file" vs
> > "per-server".
> > 
> 
> As I recall, we've already had this discussion on the IETF NFSv4
> working group mailing list:
> https://mailarchive.ietf.org/arch/msg/nfsv4/99Ow2muMylXKWd9lzi9_BX2LJDY/
> 
> 
> That's why I kept it a global in the first place.
> 
> Now note that RFC8881 does also clarify in Section 18.3.3 that:
> 
> 
>    The server must vary the value of the write
>    verifier at each server event or instantiation that may lead to a
>    loss of uncommitted data.  Most commonly this occurs when the server
>    is restarted; however, other events at the server may result in
>    uncommitted data loss as well.
> 
> So I feel it is quite OK to use the verifier the way we do now in order
> to signify that a fatal write error has occurred and that clients must
> resend any data that was uncommitted.
> 

Thanks, I missed that discussion. I think you guys have convinced me
that we have to keep this per-server. I won't bother starting a new
thread on it.

It's a pity. It would have been a lot more elegant as a per-inode thing!

Chuck, I think that means we'll just want to keep patch #1 in this 
series?

Thanks,
Chuck Lever Feb. 14, 2023, 2:58 p.m. UTC | #4
> On Feb 14, 2023, at 8:53 AM, Jeff Layton <jlayton@kernel.org> wrote:
> 
> On Mon, 2023-02-13 at 22:28 -0500, Trond Myklebust wrote:
>> On Mon, 2023-02-13 at 16:49 -0800, Rick Macklem wrote:
>>> On Mon, Feb 13, 2023 at 1:14 PM Jeff Layton <jlayton@kernel.org>
>>> wrote:
>>>> 
>>>> CAUTION: This email originated from outside of the University of
>>>> Guelph. Do not click links or open attachments unless you recognize
>>>> the sender and know the content is safe. If in doubt, forward
>>>> suspicious emails to IThelp@uoguelph.ca
>>>> 
>>>> 
>>>> The write verifier exists to tell the client when the server may
>>>> have
>>>> forgotten some unstable writes. The typical way that this happens
>>>> is if
>>>> the server crashes, but we've also extended nfsd to change it when
>>>> there
>>>> are writeback errors as well.
>>>> 
>>>> The way it works today though, we call something like vfs_fsync
>>>> (e.g.
>>>> for a COMMIT call) and if we get back an error, we'll reset the
>>>> write
>>>> verifier.
>>>> 
>>>> This is non-optimal for a couple of reasons:
>>>> 
>>>> 1/ There could be significant delay between an error being
>>>> recorded and the reset. It would be ideal if the write verifier
>>>> were to
>>>> change as soon as the error was recorded.
>>>> 
>>>> 2/ It's a bit of a waste, in that if we get a writeback error on a
>>>> single inode, we'll end up resetting the write verifier for
>>>> everything,
>>>> even on inodes that may be fine (e.g. on a completely separate fs).
>>>> 
>>> Here's the snippet from RFC8881:
>>>    The final portion of the result is the field writeverf.  This
>>> field
>>>    is the write verifier and is a cookie that the client can use to
>>>    determine whether a server has changed instance state (e.g.,
>>> server
>>>    restart) between a call to WRITE and a subsequent call to either
>>>    WRITE or COMMIT.  This cookie MUST be unchanged during a single
>>>    instance of the NFSv4.1 server and MUST be unique between
>>> instances
>>>    of the NFSv4.1 server.  If the cookie changes, then the client
>>> MUST
>>>    assume that any data written with an UNSTABLE4 value for committed
>>>    and an old writeverf in the reply has been lost and will need to
>>> be
>>>    recovered.
>>> 
>>> I've always interpreted the writeverf as "per-server" and not  "per-
>>> file".
>>> Although I'll admit the above does not make that crystal clear, it
>>> does make
>>> it clear that the writeverf applies to a "server instance" and not a
>>> file or
>>> file system on the server.
>>> 
>>> The FreeBSD client assumes it is "per-server" and re-writes all
>>> uncommitted
>>> writes for the server, not just ones for the file (or file system)
>>> the
>>> writeverf is
>>> replied with.  (I vaguely recall Solaris does the same?)
>>> 
>>> At the very least, I think you should run this past the IETF working
>>> group
>>> (nfsv4@ietf.org) to see what they say w.r.t. the writeverf being
>>> "per-file" vs
>>> "per-server".
>>> 
>> 
>> As I recall, we've already had this discussion on the IETF NFSv4
>> working group mailing list:
>> https://mailarchive.ietf.org/arch/msg/nfsv4/99Ow2muMylXKWd9lzi9_BX2LJDY/
>> 
>> 
>> That's why I kept it a global in the first place.
>> 
>> Now note that RFC8881 does also clarify in Section 18.3.3 that:
>> 
>> 
>>   The server must vary the value of the write
>>   verifier at each server event or instantiation that may lead to a
>>   loss of uncommitted data.  Most commonly this occurs when the server
>>   is restarted; however, other events at the server may result in
>>   uncommitted data loss as well.
>> 
>> So I feel it is quite OK to use the verifier the way we do now in order
>> to signify that a fatal write error has occurred and that clients must
>> resend any data that was uncommitted.
>> 
> 
> Thanks, I missed that discussion. I think you guys have convinced me
> that we have to keep this per-server. I won't bother starting a new
> thread on it.
> 
> It's a pity. It would have been a lot more elegant as a per-inode thing!
> 
> Chuck, I think that means we'll just want to keep patch #1 in this 
> series?

Regarding patch 1/3:

"sizeof(verf)" works as well as "sizeof(*verf) * 2" and is a little
more clear to boot. You can redrive a v2 of your patch or I can make
one. Up to you.


--
Chuck Lever
Jeff Layton Feb. 14, 2023, 3:01 p.m. UTC | #5
On Tue, 2023-02-14 at 14:58 +0000, Chuck Lever III wrote:
> 
> > On Feb 14, 2023, at 8:53 AM, Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > On Mon, 2023-02-13 at 22:28 -0500, Trond Myklebust wrote:
> > > On Mon, 2023-02-13 at 16:49 -0800, Rick Macklem wrote:
> > > > On Mon, Feb 13, 2023 at 1:14 PM Jeff Layton <jlayton@kernel.org>
> > > > wrote:
> > > > > 
> > > > > CAUTION: This email originated from outside of the University of
> > > > > Guelph. Do not click links or open attachments unless you recognize
> > > > > the sender and know the content is safe. If in doubt, forward
> > > > > suspicious emails to IThelp@uoguelph.ca
> > > > > 
> > > > > 
> > > > > The write verifier exists to tell the client when the server may
> > > > > have
> > > > > forgotten some unstable writes. The typical way that this happens
> > > > > is if
> > > > > the server crashes, but we've also extended nfsd to change it when
> > > > > there
> > > > > are writeback errors as well.
> > > > > 
> > > > > The way it works today though, we call something like vfs_fsync
> > > > > (e.g.
> > > > > for a COMMIT call) and if we get back an error, we'll reset the
> > > > > write
> > > > > verifier.
> > > > > 
> > > > > This is non-optimal for a couple of reasons:
> > > > > 
> > > > > 1/ There could be significant delay between an error being
> > > > > recorded and the reset. It would be ideal if the write verifier
> > > > > were to
> > > > > change as soon as the error was recorded.
> > > > > 
> > > > > 2/ It's a bit of a waste, in that if we get a writeback error on a
> > > > > single inode, we'll end up resetting the write verifier for
> > > > > everything,
> > > > > even on inodes that may be fine (e.g. on a completely separate fs).
> > > > > 
> > > > Here's the snippet from RFC8881:
> > > >    The final portion of the result is the field writeverf.  This
> > > > field
> > > >    is the write verifier and is a cookie that the client can use to
> > > >    determine whether a server has changed instance state (e.g.,
> > > > server
> > > >    restart) between a call to WRITE and a subsequent call to either
> > > >    WRITE or COMMIT.  This cookie MUST be unchanged during a single
> > > >    instance of the NFSv4.1 server and MUST be unique between
> > > > instances
> > > >    of the NFSv4.1 server.  If the cookie changes, then the client
> > > > MUST
> > > >    assume that any data written with an UNSTABLE4 value for committed
> > > >    and an old writeverf in the reply has been lost and will need to
> > > > be
> > > >    recovered.
> > > > 
> > > > I've always interpreted the writeverf as "per-server" and not  "per-
> > > > file".
> > > > Although I'll admit the above does not make that crystal clear, it
> > > > does make
> > > > it clear that the writeverf applies to a "server instance" and not a
> > > > file or
> > > > file system on the server.
> > > > 
> > > > The FreeBSD client assumes it is "per-server" and re-writes all
> > > > uncommitted
> > > > writes for the server, not just ones for the file (or file system)
> > > > the
> > > > writeverf is
> > > > replied with.  (I vaguely recall Solaris does the same?)
> > > > 
> > > > At the very least, I think you should run this past the IETF working
> > > > group
> > > > (nfsv4@ietf.org) to see what they say w.r.t. the writeverf being
> > > > "per-file" vs
> > > > "per-server".
> > > > 
> > > 
> > > As I recall, we've already had this discussion on the IETF NFSv4
> > > working group mailing list:
> > > https://mailarchive.ietf.org/arch/msg/nfsv4/99Ow2muMylXKWd9lzi9_BX2LJDY/
> > > 
> > > 
> > > That's why I kept it a global in the first place.
> > > 
> > > Now note that RFC8881 does also clarify in Section 18.3.3 that:
> > > 
> > > 
> > >   The server must vary the value of the write
> > >   verifier at each server event or instantiation that may lead to a
> > >   loss of uncommitted data.  Most commonly this occurs when the server
> > >   is restarted; however, other events at the server may result in
> > >   uncommitted data loss as well.
> > > 
> > > So I feel it is quite OK to use the verifier the way we do now in order
> > > to signify that a fatal write error has occurred and that clients must
> > > resend any data that was uncommitted.
> > > 
> > 
> > Thanks, I missed that discussion. I think you guys have convinced me
> > that we have to keep this per-server. I won't bother starting a new
> > thread on it.
> > 
> > It's a pity. It would have been a lot more elegant as a per-inode thing!
> > 
> > Chuck, I think that means we'll just want to keep patch #1 in this 
> > series?
> 
> Regarding patch 1/3:
> 
> "sizeof(verf)" works as well as "sizeof(*verf) * 2" and is a little
> more clear to boot. You can redrive a v2 of your patch or I can make
> one. Up to you.
> 

That sounds fine to me. Go for it!
Rick Macklem Feb. 14, 2023, 10:57 p.m. UTC | #6
On Tue, Feb 14, 2023 at 5:53 AM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Mon, 2023-02-13 at 22:28 -0500, Trond Myklebust wrote:
> > On Mon, 2023-02-13 at 16:49 -0800, Rick Macklem wrote:
> > > On Mon, Feb 13, 2023 at 1:14 PM Jeff Layton <jlayton@kernel.org>
> > > wrote:
> > > >
> > > > CAUTION: This email originated from outside of the University of
> > > > Guelph. Do not click links or open attachments unless you recognize
> > > > the sender and know the content is safe. If in doubt, forward
> > > > suspicious emails to IThelp@uoguelph.ca
> > > >
> > > >
> > > > The write verifier exists to tell the client when the server may
> > > > have
> > > > forgotten some unstable writes. The typical way that this happens
> > > > is if
> > > > the server crashes, but we've also extended nfsd to change it when
> > > > there
> > > > are writeback errors as well.
> > > >
> > > > The way it works today though, we call something like vfs_fsync
> > > > (e.g.
> > > > for a COMMIT call) and if we get back an error, we'll reset the
> > > > write
> > > > verifier.
> > > >
> > > > This is non-optimal for a couple of reasons:
> > > >
> > > > 1/ There could be significant delay between an error being
> > > > recorded and the reset. It would be ideal if the write verifier
> > > > were to
> > > > change as soon as the error was recorded.
> > > >
> > > > 2/ It's a bit of a waste, in that if we get a writeback error on a
> > > > single inode, we'll end up resetting the write verifier for
> > > > everything,
> > > > even on inodes that may be fine (e.g. on a completely separate fs).
> > > >
> > > Here's the snippet from RFC8881:
> > >    The final portion of the result is the field writeverf.  This
> > > field
> > >    is the write verifier and is a cookie that the client can use to
> > >    determine whether a server has changed instance state (e.g.,
> > > server
> > >    restart) between a call to WRITE and a subsequent call to either
> > >    WRITE or COMMIT.  This cookie MUST be unchanged during a single
> > >    instance of the NFSv4.1 server and MUST be unique between
> > > instances
> > >    of the NFSv4.1 server.  If the cookie changes, then the client
> > > MUST
> > >    assume that any data written with an UNSTABLE4 value for committed
> > >    and an old writeverf in the reply has been lost and will need to
> > > be
> > >    recovered.
> > >
> > > I've always interpreted the writeverf as "per-server" and not  "per-
> > > file".
> > > Although I'll admit the above does not make that crystal clear, it
> > > does make
> > > it clear that the writeverf applies to a "server instance" and not a
> > > file or
> > > file system on the server.
> > >
> > > The FreeBSD client assumes it is "per-server" and re-writes all
> > > uncommitted
> > > writes for the server, not just ones for the file (or file system)
> > > the
> > > writeverf is
> > > replied with.  (I vaguely recall Solaris does the same?)
> > >
> > > At the very least, I think you should run this past the IETF working
> > > group
> > > (nfsv4@ietf.org) to see what they say w.r.t. the writeverf being
> > > "per-file" vs
> > > "per-server".
> > >
> >
> > As I recall, we've already had this discussion on the IETF NFSv4
> > working group mailing list:
> > https://mailarchive.ietf.org/arch/msg/nfsv4/99Ow2muMylXKWd9lzi9_BX2LJDY/
> >
> >
> > That's why I kept it a global in the first place.
> >
> > Now note that RFC8881 does also clarify in Section 18.3.3 that:
> >
> >
> >    The server must vary the value of the write
> >    verifier at each server event or instantiation that may lead to a
> >    loss of uncommitted data.  Most commonly this occurs when the server
> >    is restarted; however, other events at the server may result in
> >    uncommitted data loss as well.
> >
> > So I feel it is quite OK to use the verifier the way we do now in order
> > to signify that a fatal write error has occurred and that clients must
> > resend any data that was uncommitted.
> >
>
> Thanks, I missed that discussion. I think you guys have convinced me
> that we have to keep this per-server. I won't bother starting a new
> thread on it.
>
> It's a pity. It would have been a lot more elegant as a per-inode thing!
>
If you think it is worth the effort, you could propose an extension to
4.2. Something like Write_plus, Commit_plus operations.

rick

> Chuck, I think that means we'll just want to keep patch #1 in this
> series?
>
> Thanks,
> --
> Jeff Layton <jlayton@kernel.org>
Jeff Layton Feb. 14, 2023, 11:16 p.m. UTC | #7
On Tue, 2023-02-14 at 14:57 -0800, Rick Macklem wrote:
> On Tue, Feb 14, 2023 at 5:53 AM Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > On Mon, 2023-02-13 at 22:28 -0500, Trond Myklebust wrote:
> > > On Mon, 2023-02-13 at 16:49 -0800, Rick Macklem wrote:
> > > > On Mon, Feb 13, 2023 at 1:14 PM Jeff Layton <jlayton@kernel.org>
> > > > wrote:
> > > > > 
> > > > > CAUTION: This email originated from outside of the University of
> > > > > Guelph. Do not click links or open attachments unless you recognize
> > > > > the sender and know the content is safe. If in doubt, forward
> > > > > suspicious emails to IThelp@uoguelph.ca
> > > > > 
> > > > > 
> > > > > The write verifier exists to tell the client when the server may
> > > > > have
> > > > > forgotten some unstable writes. The typical way that this happens
> > > > > is if
> > > > > the server crashes, but we've also extended nfsd to change it when
> > > > > there
> > > > > are writeback errors as well.
> > > > > 
> > > > > The way it works today though, we call something like vfs_fsync
> > > > > (e.g.
> > > > > for a COMMIT call) and if we get back an error, we'll reset the
> > > > > write
> > > > > verifier.
> > > > > 
> > > > > This is non-optimal for a couple of reasons:
> > > > > 
> > > > > 1/ There could be significant delay between an error being
> > > > > recorded and the reset. It would be ideal if the write verifier
> > > > > were to
> > > > > change as soon as the error was recorded.
> > > > > 
> > > > > 2/ It's a bit of a waste, in that if we get a writeback error on a
> > > > > single inode, we'll end up resetting the write verifier for
> > > > > everything,
> > > > > even on inodes that may be fine (e.g. on a completely separate fs).
> > > > > 
> > > > Here's the snippet from RFC8881:
> > > >    The final portion of the result is the field writeverf.  This
> > > > field
> > > >    is the write verifier and is a cookie that the client can use to
> > > >    determine whether a server has changed instance state (e.g.,
> > > > server
> > > >    restart) between a call to WRITE and a subsequent call to either
> > > >    WRITE or COMMIT.  This cookie MUST be unchanged during a single
> > > >    instance of the NFSv4.1 server and MUST be unique between
> > > > instances
> > > >    of the NFSv4.1 server.  If the cookie changes, then the client
> > > > MUST
> > > >    assume that any data written with an UNSTABLE4 value for committed
> > > >    and an old writeverf in the reply has been lost and will need to
> > > > be
> > > >    recovered.
> > > > 
> > > > I've always interpreted the writeverf as "per-server" and not  "per-
> > > > file".
> > > > Although I'll admit the above does not make that crystal clear, it
> > > > does make
> > > > it clear that the writeverf applies to a "server instance" and not a
> > > > file or
> > > > file system on the server.
> > > > 
> > > > The FreeBSD client assumes it is "per-server" and re-writes all
> > > > uncommitted
> > > > writes for the server, not just ones for the file (or file system)
> > > > the
> > > > writeverf is
> > > > replied with.  (I vaguely recall Solaris does the same?)
> > > > 
> > > > At the very least, I think you should run this past the IETF working
> > > > group
> > > > (nfsv4@ietf.org) to see what they say w.r.t. the writeverf being
> > > > "per-file" vs
> > > > "per-server".
> > > > 
> > > 
> > > As I recall, we've already had this discussion on the IETF NFSv4
> > > working group mailing list:
> > > https://mailarchive.ietf.org/arch/msg/nfsv4/99Ow2muMylXKWd9lzi9_BX2LJDY/
> > > 
> > > 
> > > That's why I kept it a global in the first place.
> > > 
> > > Now note that RFC8881 does also clarify in Section 18.3.3 that:
> > > 
> > > 
> > >    The server must vary the value of the write
> > >    verifier at each server event or instantiation that may lead to a
> > >    loss of uncommitted data.  Most commonly this occurs when the server
> > >    is restarted; however, other events at the server may result in
> > >    uncommitted data loss as well.
> > > 
> > > So I feel it is quite OK to use the verifier the way we do now in order
> > > to signify that a fatal write error has occurred and that clients must
> > > resend any data that was uncommitted.
> > > 
> > 
> > Thanks, I missed that discussion. I think you guys have convinced me
> > that we have to keep this per-server. I won't bother starting a new
> > thread on it.
> > 
> > It's a pity. It would have been a lot more elegant as a per-inode thing!
> > 
> If you think it is worth the effort, you could propose an extension to
> 4.2. Something like Write_plus, Commit_plus operations.
> 

I considered that, but I don't think it really helps. We'd have to bump
the verifier on a per-server basis anyway to keep up backward
compatibility. I think we're stuck unless we wanted to make a break with
the past.

> 
> > Chuck, I think that means we'll just want to keep patch #1 in this
> > series?
> > 
> > Thanks,
> > --
> > Jeff Layton <jlayton@kernel.org>
Trond Myklebust Feb. 14, 2023, 11:34 p.m. UTC | #8
On Tue, 2023-02-14 at 14:57 -0800, Rick Macklem wrote:
> On Tue, Feb 14, 2023 at 5:53 AM Jeff Layton <jlayton@kernel.org>
> wrote:
> > 
> > On Mon, 2023-02-13 at 22:28 -0500, Trond Myklebust wrote:
> > > On Mon, 2023-02-13 at 16:49 -0800, Rick Macklem wrote:
> > > > On Mon, Feb 13, 2023 at 1:14 PM Jeff Layton
> > > > <jlayton@kernel.org>
> > > > wrote:
> > > > > 
> > > > > CAUTION: This email originated from outside of the University
> > > > > of
> > > > > Guelph. Do not click links or open attachments unless you
> > > > > recognize
> > > > > the sender and know the content is safe. If in doubt, forward
> > > > > suspicious emails to IThelp@uoguelph.ca
> > > > > 
> > > > > 
> > > > > The write verifier exists to tell the client when the server
> > > > > may
> > > > > have
> > > > > forgotten some unstable writes. The typical way that this
> > > > > happens
> > > > > is if
> > > > > the server crashes, but we've also extended nfsd to change it
> > > > > when
> > > > > there
> > > > > are writeback errors as well.
> > > > > 
> > > > > The way it works today though, we call something like
> > > > > vfs_fsync
> > > > > (e.g.
> > > > > for a COMMIT call) and if we get back an error, we'll reset
> > > > > the
> > > > > write
> > > > > verifier.
> > > > > 
> > > > > This is non-optimal for a couple of reasons:
> > > > > 
> > > > > 1/ There could be significant delay between an error being
> > > > > recorded and the reset. It would be ideal if the write
> > > > > verifier
> > > > > were to
> > > > > change as soon as the error was recorded.
> > > > > 
> > > > > 2/ It's a bit of a waste, in that if we get a writeback error
> > > > > on a
> > > > > single inode, we'll end up resetting the write verifier for
> > > > > everything,
> > > > > even on inodes that may be fine (e.g. on a completely
> > > > > separate fs).
> > > > > 
> > > > Here's the snippet from RFC8881:
> > > >    The final portion of the result is the field writeverf. 
> > > > This
> > > > field
> > > >    is the write verifier and is a cookie that the client can
> > > > use to
> > > >    determine whether a server has changed instance state (e.g.,
> > > > server
> > > >    restart) between a call to WRITE and a subsequent call to
> > > > either
> > > >    WRITE or COMMIT.  This cookie MUST be unchanged during a
> > > > single
> > > >    instance of the NFSv4.1 server and MUST be unique between
> > > > instances
> > > >    of the NFSv4.1 server.  If the cookie changes, then the
> > > > client
> > > > MUST
> > > >    assume that any data written with an UNSTABLE4 value for
> > > > committed
> > > >    and an old writeverf in the reply has been lost and will
> > > > need to
> > > > be
> > > >    recovered.
> > > > 
> > > > I've always interpreted the writeverf as "per-server" and not 
> > > > "per-
> > > > file".
> > > > Although I'll admit the above does not make that crystal clear,
> > > > it
> > > > does make
> > > > it clear that the writeverf applies to a "server instance" and
> > > > not a
> > > > file or
> > > > file system on the server.
> > > > 
> > > > The FreeBSD client assumes it is "per-server" and re-writes all
> > > > uncommitted
> > > > writes for the server, not just ones for the file (or file
> > > > system)
> > > > the
> > > > writeverf is
> > > > replied with.  (I vaguely recall Solaris does the same?)
> > > > 
> > > > At the very least, I think you should run this past the IETF
> > > > working
> > > > group
> > > > (nfsv4@ietf.org) to see what they say w.r.t. the writeverf
> > > > being
> > > > "per-file" vs
> > > > "per-server".
> > > > 
> > > 
> > > As I recall, we've already had this discussion on the IETF NFSv4
> > > working group mailing list:
> > > https://mailarchive.ietf.org/arch/msg/nfsv4/99Ow2muMylXKWd9lzi9_BX2LJDY/
> > > 
> > > 
> > > That's why I kept it a global in the first place.
> > > 
> > > Now note that RFC8881 does also clarify in Section 18.3.3 that:
> > > 
> > > 
> > >    The server must vary the value of the write
> > >    verifier at each server event or instantiation that may lead
> > > to a
> > >    loss of uncommitted data.  Most commonly this occurs when the
> > > server
> > >    is restarted; however, other events at the server may result
> > > in
> > >    uncommitted data loss as well.
> > > 
> > > So I feel it is quite OK to use the verifier the way we do now in
> > > order
> > > to signify that a fatal write error has occurred and that clients
> > > must
> > > resend any data that was uncommitted.
> > > 
> > 
> > Thanks, I missed that discussion. I think you guys have convinced
> > me
> > that we have to keep this per-server. I won't bother starting a new
> > thread on it.
> > 
> > It's a pity. It would have been a lot more elegant as a per-inode
> > thing!
> > 
> If you think it is worth the effort, you could propose an extension
> to
> 4.2. Something like Write_plus, Commit_plus operations.
> 
> rick
> 
OK. Apparently some further clarification is required here.

The main reason for needing to bump the boot verifier on an error is
NFSv3. Unlike NFSv4, the NFSv3 clients are stateless, and so error
propagation using Jeff's "errseq" mechanism is inherently flawed
because it is ultimately caching the I/O errors in a file descriptor.
The fact that the file cache garbage collector can close these NFSv3
file descriptors at any time without any possibility of coordination
with the clients, and causing loss of the "errseq" cached data, means
that we must have an alternative mechanism to propagate error state for
unstable writes. Hence the use of the boot verifier.

NFSv4 does not have these limitations. The clients are required to use
stateful OPEN/CLOSE/DELEGRETURN/... operations in order to signify to
the NFSv4 server when file descriptors need to be kept open, and hence
"errseq" data needs to be preserved. The only cases where that
assumption is violated are that of a network partition and server
reboot, where it is obvious to all parties that information has been
lost.
Note: The Linux NFSv4 client does not currently assume that information
might be lost during network partition, so we might want to look into
fixing that. However it does obviously recover safely during a server
reboot thanks to the boot verifier mechanism.

IOW: there should be no need for a change in NFSv4 semantics in order
to address this issue. The problem is, as usual, mostly limited to
NFSv3.
diff mbox series

Patch

diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index 3b9a10378c83..1ca9ad0aabcd 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -233,23 +233,6 @@  nfsd_file_alloc(struct net *net, struct inode *inode, unsigned char need,
 	return nf;
 }
 
-/**
- * nfsd_file_check_write_error - check for writeback errors on a file
- * @nf: nfsd_file to check for writeback errors
- *
- * Check whether a nfsd_file has an unseen error. Reset the write
- * verifier if so.
- */
-static void
-nfsd_file_check_write_error(struct nfsd_file *nf)
-{
-	struct file *file = nf->nf_file;
-
-	if ((file->f_mode & FMODE_WRITE) &&
-	    filemap_check_wb_err(file->f_mapping, READ_ONCE(file->f_wb_err)))
-		nfsd_reset_write_verifier(net_generic(nf->nf_net, nfsd_net_id));
-}
-
 static void
 nfsd_file_hash_remove(struct nfsd_file *nf)
 {
@@ -281,10 +264,8 @@  nfsd_file_free(struct nfsd_file *nf)
 	nfsd_file_unhash(nf);
 	if (nf->nf_mark)
 		nfsd_file_mark_put(nf->nf_mark);
-	if (nf->nf_file) {
-		nfsd_file_check_write_error(nf);
+	if (nf->nf_file)
 		filp_close(nf->nf_file, NULL);
-	}
 
 	/*
 	 * If this item is still linked via nf_lru, that's a bug.
@@ -1038,7 +1019,6 @@  nfsd_file_do_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
 out:
 	if (status == nfs_ok) {
 		this_cpu_inc(nfsd_file_acquisitions);
-		nfsd_file_check_write_error(nf);
 		*pnf = nf;
 	}
 	put_cred(cred);
diff --git a/fs/nfsd/netns.h b/fs/nfsd/netns.h
index 51a4b7885cae..95d7adb73b24 100644
--- a/fs/nfsd/netns.h
+++ b/fs/nfsd/netns.h
@@ -109,7 +109,6 @@  struct nfsd_net {
 	bool nfsd_net_up;
 	bool lockd_up;
 
-	seqlock_t writeverf_lock;
 	unsigned char writeverf[8];
 
 	/*
@@ -204,7 +203,4 @@  struct nfsd_net {
 extern void nfsd_netns_free_versions(struct nfsd_net *nn);
 
 extern unsigned int nfsd_net_id;
-
-void nfsd_copy_write_verifier(__be32 verf[2], struct nfsd_net *nn);
-void nfsd_reset_write_verifier(struct nfsd_net *nn);
 #endif /* __NFSD_NETNS_H__ */
diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index 5ae670807449..d63648d79f17 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -717,15 +717,6 @@  nfsd4_access(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 			   &access->ac_supported);
 }
 
-static void gen_boot_verifier(nfs4_verifier *verifier, struct net *net)
-{
-	__be32 *verf = (__be32 *)verifier->data;
-
-	BUILD_BUG_ON(2*sizeof(*verf) != sizeof(verifier->data));
-
-	nfsd_copy_write_verifier(verf, net_generic(net, nfsd_net_id));
-}
-
 static __be32
 nfsd4_commit(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 	     union nfsd4_op_u *u)
@@ -1586,11 +1577,17 @@  static const struct nfsd4_callback_ops nfsd4_cb_offload_ops = {
 
 static void nfsd4_init_copy_res(struct nfsd4_copy *copy, bool sync)
 {
+	__be32 *verf;
+	nfs4_verifier *verifier = &copy->cp_res.wr_verifier;
+
+	BUILD_BUG_ON(2*sizeof(*verf) != sizeof(verifier->data));
+
 	copy->cp_res.wr_stable_how =
 		test_bit(NFSD4_COPY_F_COMMITTED, &copy->cp_flags) ?
 			NFS_FILE_SYNC : NFS_UNSTABLE;
 	nfsd4_copy_set_sync(copy, sync);
-	gen_boot_verifier(&copy->cp_res.wr_verifier, copy->cp_clp->net);
+	verf = (__be32 *)verifier->data;
+	nfsd_set_write_verifier(verf, copy->nf_dst);
 }
 
 static ssize_t _nfsd_copy_file_range(struct nfsd4_copy *copy,
diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
index 04474b8ccf0a..28a73577beaa 100644
--- a/fs/nfsd/nfsctl.c
+++ b/fs/nfsd/nfsctl.c
@@ -1501,7 +1501,6 @@  static __net_init int nfsd_init_net(struct net *net)
 	nn->nfsd4_minorversions = NULL;
 	nfsd4_init_leases_net(nn);
 	get_random_bytes(&nn->siphash_key, sizeof(nn->siphash_key));
-	seqlock_init(&nn->writeverf_lock);
 
 	return 0;
 
diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 3a38ab304b02..8b09bfc1fae9 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -350,25 +350,27 @@  static bool nfsd_needs_lockd(struct nfsd_net *nn)
 }
 
 /**
- * nfsd_copy_write_verifier - Atomically copy a write verifier
+ * nfsd_set_write_verifier - set the write verifier for a call
  * @verf: buffer in which to receive the verifier cookie
- * @nn: NFS net namespace
+ * @nf: nfsd_file being operated on for write
  *
- * This function provides a wait-free mechanism for copying the
- * namespace's write verifier without tearing it.
+ * Grab the (static) write verifier for the nfsd_net, and then fold
+ * the current errseq_t value into it for the inode to create a
+ * write verifier.
  */
-void nfsd_copy_write_verifier(__be32 verf[2], struct nfsd_net *nn)
+void nfsd_set_write_verifier(__be32 verf[2], struct nfsd_file *nf)
 {
-	int seq = 0;
+	struct nfsd_net *nn = net_generic(nf->nf_net, nfsd_net_id);
+	errseq_t eseq = errseq_fetch(&nf->nf_file->f_mapping->wb_err);
 
-	do {
-		read_seqbegin_or_lock(&nn->writeverf_lock, &seq);
-		memcpy(verf, nn->writeverf, sizeof(*verf) * 2);
-	} while (need_seqretry(&nn->writeverf_lock, seq));
-	done_seqretry(&nn->writeverf_lock, seq);
+	/* copy in the per-net write verifier */
+	memcpy(verf, nn->writeverf, sizeof(*verf) * 2);
+
+	/* fold the errseq into first word */
+	verf[0] ^= (__force __be32)eseq;
 }
 
-static void nfsd_reset_write_verifier_locked(struct nfsd_net *nn)
+static void nfsd_init_write_verifier(struct nfsd_net *nn)
 {
 	struct timespec64 now;
 	u64 verf;
@@ -382,26 +384,6 @@  static void nfsd_reset_write_verifier_locked(struct nfsd_net *nn)
 	memcpy(nn->writeverf, &verf, sizeof(nn->writeverf));
 }
 
-/**
- * nfsd_reset_write_verifier - Generate a new write verifier
- * @nn: NFS net namespace
- *
- * This function updates the ->writeverf field of @nn. This field
- * contains an opaque cookie that, according to Section 18.32.3 of
- * RFC 8881, "the client can use to determine whether a server has
- * changed instance state (e.g., server restart) between a call to
- * WRITE and a subsequent call to either WRITE or COMMIT.  This
- * cookie MUST be unchanged during a single instance of the NFSv4.1
- * server and MUST be unique between instances of the NFSv4.1
- * server."
- */
-void nfsd_reset_write_verifier(struct nfsd_net *nn)
-{
-	write_seqlock(&nn->writeverf_lock);
-	nfsd_reset_write_verifier_locked(nn);
-	write_sequnlock(&nn->writeverf_lock);
-}
-
 static int nfsd_startup_net(struct net *net, const struct cred *cred)
 {
 	struct nfsd_net *nn = net_generic(net, nfsd_net_id);
@@ -687,7 +669,7 @@  int nfsd_create_serv(struct net *net)
 		register_inet6addr_notifier(&nfsd_inet6addr_notifier);
 #endif
 	}
-	nfsd_reset_write_verifier(nn);
+	nfsd_init_write_verifier(nn);
 	return 0;
 }
 
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index 4183819ea082..93d76fe33514 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -744,34 +744,6 @@  DEFINE_EVENT(nfsd_net_class, nfsd_##name, \
 DEFINE_NET_EVENT(grace_start);
 DEFINE_NET_EVENT(grace_complete);
 
-TRACE_EVENT(nfsd_writeverf_reset,
-	TP_PROTO(
-		const struct nfsd_net *nn,
-		const struct svc_rqst *rqstp,
-		int error
-	),
-	TP_ARGS(nn, rqstp, error),
-	TP_STRUCT__entry(
-		__field(unsigned long long, boot_time)
-		__field(u32, xid)
-		__field(int, error)
-		__array(unsigned char, verifier, NFS4_VERIFIER_SIZE)
-	),
-	TP_fast_assign(
-		__entry->boot_time = nn->boot_time;
-		__entry->xid = be32_to_cpu(rqstp->rq_xid);
-		__entry->error = error;
-
-		/* avoid seqlock inside TP_fast_assign */
-		memcpy(__entry->verifier, nn->writeverf,
-		       NFS4_VERIFIER_SIZE);
-	),
-	TP_printk("boot_time=%16llx xid=0x%08x error=%d new verifier=0x%s",
-		__entry->boot_time, __entry->xid, __entry->error,
-		__print_hex_str(__entry->verifier, NFS4_VERIFIER_SIZE)
-	)
-);
-
 TRACE_EVENT(nfsd_clid_cred_mismatch,
 	TP_PROTO(
 		const struct nfs4_client *clp,
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 21d5209f6e04..6e8ba8357735 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -628,17 +628,12 @@  __be32 nfsd4_clone_file_range(struct svc_rqst *rqstp,
 		if (!status)
 			status = commit_inode_metadata(file_inode(src));
 		if (status < 0) {
-			struct nfsd_net *nn = net_generic(nf_dst->nf_net,
-							  nfsd_net_id);
-
 			trace_nfsd_clone_file_range_err(rqstp,
 					&nfsd4_get_cstate(rqstp)->save_fh,
 					src_pos,
 					&nfsd4_get_cstate(rqstp)->current_fh,
 					dst_pos,
 					count, status);
-			nfsd_reset_write_verifier(nn);
-			trace_nfsd_writeverf_reset(nn, rqstp, status);
 			ret = nfserrno(status);
 		}
 	}
@@ -1058,7 +1053,6 @@  nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
 				unsigned long *cnt, int stable,
 				__be32 *verf)
 {
-	struct nfsd_net		*nn = net_generic(SVC_NET(rqstp), nfsd_net_id);
 	struct file		*file = nf->nf_file;
 	struct super_block	*sb = file_inode(file)->i_sb;
 	struct svc_export	*exp;
@@ -1103,13 +1097,10 @@  nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
 	iov_iter_kvec(&iter, ITER_SOURCE, vec, vlen, *cnt);
 	since = READ_ONCE(file->f_wb_err);
 	if (verf)
-		nfsd_copy_write_verifier(verf, nn);
+		nfsd_set_write_verifier(verf, nf);
 	host_err = vfs_iter_write(file, &iter, &pos, flags);
-	if (host_err < 0) {
-		nfsd_reset_write_verifier(nn);
-		trace_nfsd_writeverf_reset(nn, rqstp, host_err);
+	if (host_err < 0)
 		goto out_nfserr;
-	}
 	*cnt = host_err;
 	nfsd_stats_io_write_add(exp, *cnt);
 	fsnotify_modify(file);
@@ -1117,13 +1108,8 @@  nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
 	if (host_err < 0)
 		goto out_nfserr;
 
-	if (stable && use_wgather) {
+	if (stable && use_wgather)
 		host_err = wait_for_concurrent_writes(file);
-		if (host_err < 0) {
-			nfsd_reset_write_verifier(nn);
-			trace_nfsd_writeverf_reset(nn, rqstp, host_err);
-		}
-	}
 
 out_nfserr:
 	if (host_err >= 0) {
@@ -1223,7 +1209,6 @@  nfsd_commit(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
 	__be32			err = nfs_ok;
 	u64			maxbytes;
 	loff_t			start, end;
-	struct nfsd_net		*nn;
 
 	/*
 	 * Convert the client-provided (offset, count) range to a
@@ -1240,7 +1225,6 @@  nfsd_commit(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
 			end = offset + count - 1;
 	}
 
-	nn = net_generic(nf->nf_net, nfsd_net_id);
 	if (EX_ISSYNC(fhp->fh_export)) {
 		errseq_t since = READ_ONCE(nf->nf_file->f_wb_err);
 		int err2;
@@ -1248,7 +1232,7 @@  nfsd_commit(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
 		err2 = vfs_fsync_range(nf->nf_file, start, end, 0);
 		switch (err2) {
 		case 0:
-			nfsd_copy_write_verifier(verf, nn);
+			nfsd_set_write_verifier(verf, nf);
 			err2 = filemap_check_wb_err(nf->nf_file->f_mapping,
 						    since);
 			err = nfserrno(err2);
@@ -1257,12 +1241,10 @@  nfsd_commit(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
 			err = nfserr_notsupp;
 			break;
 		default:
-			nfsd_reset_write_verifier(nn);
-			trace_nfsd_writeverf_reset(nn, rqstp, err2);
 			err = nfserrno(err2);
 		}
 	} else
-		nfsd_copy_write_verifier(verf, nn);
+		nfsd_set_write_verifier(verf, nf);
 
 	return err;
 }
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index dbdfef7ae85b..e69f304407ae 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -176,4 +176,5 @@  static inline __be32 fh_getattr(const struct svc_fh *fh, struct kstat *stat)
 				    AT_STATX_SYNC_AS_STAT));
 }
 
+void nfsd_set_write_verifier(__be32 verf[2], struct nfsd_file *nf);
 #endif /* LINUX_NFSD_VFS_H */