Message ID | 20241220154227.16873-15-cel@kernel.org (mailing list archive) |
---|---|
State | Under Review |
Headers | show |
Series | Client-side OFFLOAD_STATUS implementation | expand |
On Fri, Dec 20, 2024 at 10:46 AM <cel@kernel.org> wrote: > > From: Chuck Lever <chuck.lever@oracle.com> > > We've found that there are cases where a transport disconnection > results in the loss of callback RPCs. NFS servers typically do not > retransmit callback operations after a disconnect. > > This can be a problem for the Linux NFS client's current > implementation of asynchronous COPY, which waits indefinitely for a > CB_OFFLOAD callback. If a transport disconnect occurs while an async > COPY is running, there's a good chance the client will never get the > completing CB_OFFLOAD. > > Fix this by implementing the OFFLOAD_STATUS operation so that the > Linux NFS client can probe the NFS server if it doesn't see a > CB_OFFLOAD in a reasonable amount of time. > > This patch implements a simplistic check. As future work, the client > might also be able to detect whether there is no forward progress on > the request asynchronous COPY operation, and CANCEL it. > > Suggested-by: Olga Kornievskaia <kolga@netapp.com> > Link: https://bugzilla.kernel.org/show_bug.cgi?id=218735 > Signed-off-by: Chuck Lever <chuck.lever@oracle.com> > --- > fs/nfs/nfs42proc.c | 70 ++++++++++++++++++++++++++++++++++++++-------- > 1 file changed, 59 insertions(+), 11 deletions(-) > > diff --git a/fs/nfs/nfs42proc.c b/fs/nfs/nfs42proc.c > index 7fd0f2aa42d4..65cfdb5c7b02 100644 > --- a/fs/nfs/nfs42proc.c > +++ b/fs/nfs/nfs42proc.c > @@ -175,6 +175,25 @@ int nfs42_proc_deallocate(struct file *filep, loff_t offset, loff_t len) > return err; > } > > +/* Wait this long before checking progress on a COPY operation */ > +enum { > + NFS42_COPY_TIMEOUT = 3 * HZ, I'm really not a fan of such a short time out. This make the OFFLOAD_STATUS a more likely operation rather than a less likely operation to occur during a copy. OFFLOAD_STATUS and CB_OFFLOAD being concurrent operations introduce races which we have to try to account for. > +}; > + > +static void nfs4_copy_dequeue_callback(struct nfs_server *dst_server, > + struct nfs_server *src_server, > + struct nfs4_copy_state *copy) > +{ > + spin_lock(&dst_server->nfs_client->cl_lock); > + list_del_init(©->copies); > + spin_unlock(&dst_server->nfs_client->cl_lock); > + if (dst_server != src_server) { > + spin_lock(&src_server->nfs_client->cl_lock); > + list_del_init(©->src_copies); > + spin_unlock(&src_server->nfs_client->cl_lock); > + } > +} > + > static int handle_async_copy(struct nfs42_copy_res *res, > struct nfs_server *dst_server, > struct nfs_server *src_server, > @@ -184,9 +203,10 @@ static int handle_async_copy(struct nfs42_copy_res *res, > bool *restart) > { > struct nfs4_copy_state *copy, *tmp_copy = NULL, *iter; > - int status = NFS4_OK; > struct nfs_open_context *dst_ctx = nfs_file_open_context(dst); > struct nfs_open_context *src_ctx = nfs_file_open_context(src); > + int status = NFS4_OK; > + u64 copied; > > copy = kzalloc(sizeof(struct nfs4_copy_state), GFP_KERNEL); > if (!copy) > @@ -224,15 +244,12 @@ static int handle_async_copy(struct nfs42_copy_res *res, > spin_unlock(&src_server->nfs_client->cl_lock); > } > > - status = wait_for_completion_interruptible(©->completion); > - spin_lock(&dst_server->nfs_client->cl_lock); > - list_del_init(©->copies); > - spin_unlock(&dst_server->nfs_client->cl_lock); > - if (dst_server != src_server) { > - spin_lock(&src_server->nfs_client->cl_lock); > - list_del_init(©->src_copies); > - spin_unlock(&src_server->nfs_client->cl_lock); > - } > +wait: > + status = wait_for_completion_interruptible_timeout(©->completion, > + NFS42_COPY_TIMEOUT); > + if (!status) > + goto timeout; > + nfs4_copy_dequeue_callback(dst_server, src_server, copy); > if (status == -ERESTARTSYS) { > goto out_cancel; > } else if (copy->flags || copy->error == NFS4ERR_PARTNER_NO_AUTH) { > @@ -242,6 +259,7 @@ static int handle_async_copy(struct nfs42_copy_res *res, > } > out: > res->write_res.count = copy->count; > + /* Copy out the updated write verifier provided by CB_OFFLOAD. */ > memcpy(&res->write_res.verifier, ©->verf, sizeof(copy->verf)); > status = -copy->error; > > @@ -253,6 +271,36 @@ static int handle_async_copy(struct nfs42_copy_res *res, > if (!nfs42_files_from_same_server(src, dst)) > nfs42_do_offload_cancel_async(src, src_stateid); > goto out_free; > +timeout: > + status = nfs42_proc_offload_status(dst, ©->stateid, &copied); Regardless of what OFFLOAD_STATUS returned we have to check whether or not the CB_OFFLOAD had arrived while we were waiting for the reply to the OFFLOAD_STATUS. > + if (status == -EINPROGRESS) > + goto wait; > + nfs4_copy_dequeue_callback(dst_server, src_server, copy); > + switch (status) { > + case 0: > + /* The server recognized the copy stateid, so it hasn't > + * rebooted. Don't overwrite the verifier returned in the > + * COPY result. */ > + res->write_res.count = copied; > + goto out_free; In case OFFLOAD_STATUS was successful and CB_OFFLOAD was received we should take the verifier from the CB_OFFLOAD otherwise we are sending the unneeede and expensive COMMIT because OFFLOAD_STATUS carries the completion and value of copy it does not carry the "how committed" value and thus we are forced to use async copy's "how committed" value. > + case -EREMOTEIO: > + /* COPY operation failed on the server. */ > + status = -EOPNOTSUPP; > + res->write_res.count = copied; > + goto out_free; > + case -EBADF: > + /* Server did not recognize the copy stateid. It has > + * probably restarted and lost the plot. */ > + res->write_res.count = 0; > + status = -EOPNOTSUPP; > + break; This is the case of receiving a BAD_STATEID from OFFLOAD_STATUS and this would lead to copy falling back to read/write operation. IF we don't check the existence of CB_OFFLOAD reply, then OFFLOAD_STATUS can and will carry BAD_STATEID as the server is free to delete copy status after it get CB_OFFLOAD reply (which as i said we have to check). And If we did then we need take the result of the CB_OFFLOAD and not act on OFFLOAD_STATUS's error. > + case -EOPNOTSUPP: > + /* RFC 7862 REQUIREs server to support OFFLOAD_STATUS when > + * it has signed up for an async COPY, so server is not > + * spec-compliant. */ > + res->write_res.count = 0; > + } > + goto out_free; > } > > static int process_copy_commit(struct file *dst, loff_t pos_dst, > @@ -643,7 +691,7 @@ _nfs42_proc_offload_status(struct nfs_server *server, struct file *file, > * Other negative errnos indicate the client could not complete the > * request. > */ > -static int __maybe_unused > +static int > nfs42_proc_offload_status(struct file *dst, nfs4_stateid *stateid, u64 *copied) > { > struct inode *inode = file_inode(dst); > -- > 2.47.0 > >
On 12/21/24 9:36 AM, Olga Kornievskaia wrote: > On Fri, Dec 20, 2024 at 10:46 AM <cel@kernel.org> wrote: >> >> From: Chuck Lever <chuck.lever@oracle.com> >> >> We've found that there are cases where a transport disconnection >> results in the loss of callback RPCs. NFS servers typically do not >> retransmit callback operations after a disconnect. >> >> This can be a problem for the Linux NFS client's current >> implementation of asynchronous COPY, which waits indefinitely for a >> CB_OFFLOAD callback. If a transport disconnect occurs while an async >> COPY is running, there's a good chance the client will never get the >> completing CB_OFFLOAD. >> >> Fix this by implementing the OFFLOAD_STATUS operation so that the >> Linux NFS client can probe the NFS server if it doesn't see a >> CB_OFFLOAD in a reasonable amount of time. >> >> This patch implements a simplistic check. As future work, the client >> might also be able to detect whether there is no forward progress on >> the request asynchronous COPY operation, and CANCEL it. >> >> Suggested-by: Olga Kornievskaia <kolga@netapp.com> >> Link: https://bugzilla.kernel.org/show_bug.cgi?id=218735 >> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> >> --- >> fs/nfs/nfs42proc.c | 70 ++++++++++++++++++++++++++++++++++++++-------- >> 1 file changed, 59 insertions(+), 11 deletions(-) >> >> diff --git a/fs/nfs/nfs42proc.c b/fs/nfs/nfs42proc.c >> index 7fd0f2aa42d4..65cfdb5c7b02 100644 >> --- a/fs/nfs/nfs42proc.c >> +++ b/fs/nfs/nfs42proc.c >> @@ -175,6 +175,25 @@ int nfs42_proc_deallocate(struct file *filep, loff_t offset, loff_t len) >> return err; >> } >> >> +/* Wait this long before checking progress on a COPY operation */ >> +enum { >> + NFS42_COPY_TIMEOUT = 3 * HZ, > > I'm really not a fan of such a short time out. I'm not wedded to 3 seconds. For the purpose of a proof-of-concept, it can be valuable to make such timeouts short to exacerbate races. But perhaps we are about to step out of the realm of proof-of-concept. Suggestions are welcome. > This make the > OFFLOAD_STATUS a more likely operation rather than a less likely > operation to occur during a copy. OFFLOAD_STATUS and CB_OFFLOAD being > concurrent operations introduce races which we have to try to account > for. However, we have to close these windows whether the timeout is short or long. Making a timeout longer simply to avoid races seems like a band aid to me. Back to choosing a value. For a very long running copy, every 3 seconds might indeed be overkill. We could use the advertised lease timeout (or a fraction of it) instead. Neil suggested an exponential backoff here. That feels to me like an optimization that can be made later if we find there is evidence in favor of doing so. >> +}; >> + >> +static void nfs4_copy_dequeue_callback(struct nfs_server *dst_server, >> + struct nfs_server *src_server, >> + struct nfs4_copy_state *copy) >> +{ >> + spin_lock(&dst_server->nfs_client->cl_lock); >> + list_del_init(©->copies); >> + spin_unlock(&dst_server->nfs_client->cl_lock); >> + if (dst_server != src_server) { >> + spin_lock(&src_server->nfs_client->cl_lock); >> + list_del_init(©->src_copies); >> + spin_unlock(&src_server->nfs_client->cl_lock); >> + } >> +} >> + >> static int handle_async_copy(struct nfs42_copy_res *res, >> struct nfs_server *dst_server, >> struct nfs_server *src_server, >> @@ -184,9 +203,10 @@ static int handle_async_copy(struct nfs42_copy_res *res, >> bool *restart) >> { >> struct nfs4_copy_state *copy, *tmp_copy = NULL, *iter; >> - int status = NFS4_OK; >> struct nfs_open_context *dst_ctx = nfs_file_open_context(dst); >> struct nfs_open_context *src_ctx = nfs_file_open_context(src); >> + int status = NFS4_OK; >> + u64 copied; >> >> copy = kzalloc(sizeof(struct nfs4_copy_state), GFP_KERNEL); >> if (!copy) >> @@ -224,15 +244,12 @@ static int handle_async_copy(struct nfs42_copy_res *res, >> spin_unlock(&src_server->nfs_client->cl_lock); >> } >> >> - status = wait_for_completion_interruptible(©->completion); >> - spin_lock(&dst_server->nfs_client->cl_lock); >> - list_del_init(©->copies); >> - spin_unlock(&dst_server->nfs_client->cl_lock); >> - if (dst_server != src_server) { >> - spin_lock(&src_server->nfs_client->cl_lock); >> - list_del_init(©->src_copies); >> - spin_unlock(&src_server->nfs_client->cl_lock); >> - } >> +wait: >> + status = wait_for_completion_interruptible_timeout(©->completion, >> + NFS42_COPY_TIMEOUT); >> + if (!status) >> + goto timeout; >> + nfs4_copy_dequeue_callback(dst_server, src_server, copy); >> if (status == -ERESTARTSYS) { >> goto out_cancel; >> } else if (copy->flags || copy->error == NFS4ERR_PARTNER_NO_AUTH) { >> @@ -242,6 +259,7 @@ static int handle_async_copy(struct nfs42_copy_res *res, >> } >> out: >> res->write_res.count = copy->count; >> + /* Copy out the updated write verifier provided by CB_OFFLOAD. */ >> memcpy(&res->write_res.verifier, ©->verf, sizeof(copy->verf)); >> status = -copy->error; >> >> @@ -253,6 +271,36 @@ static int handle_async_copy(struct nfs42_copy_res *res, >> if (!nfs42_files_from_same_server(src, dst)) >> nfs42_do_offload_cancel_async(src, src_stateid); >> goto out_free; >> +timeout: >> + status = nfs42_proc_offload_status(dst, ©->stateid, &copied); > > Regardless of what OFFLOAD_STATUS returned we have to check whether or > not the CB_OFFLOAD had arrived while we were waiting for the reply to > the OFFLOAD_STATUS. I think that could be done. The only reason to add that extra complexity is because CB_OFFLOAD carries a verifier and stable_how, and OFFLOAD_STATUS does not. But it seems to me this race scenario should be pretty infrequent. And I'm not yet convinced that the COPY will misbehave if the client ignores a racing CB_OFFLOAD request. It might be a little slower, is all. For me, the Minimum Viable Product for this series is that COPY should always be able to make forward progress. Are we at that point yet? Anything else is an optimization that should be made based on whether there is evidence that shows there is a substantial application-visible improvement. >> + if (status == -EINPROGRESS) >> + goto wait; >> + nfs4_copy_dequeue_callback(dst_server, src_server, copy); >> + switch (status) { >> + case 0: >> + /* The server recognized the copy stateid, so it hasn't >> + * rebooted. Don't overwrite the verifier returned in the >> + * COPY result. */ >> + res->write_res.count = copied; >> + goto out_free; > > In case OFFLOAD_STATUS was successful and CB_OFFLOAD was received we > should take the verifier from the CB_OFFLOAD otherwise we are sending > the unneeede and expensive COMMIT because OFFLOAD_STATUS carries the > completion and value of copy it does not carry the "how committed" > value and thus we are forced to use async copy's "how committed" > value. I quibble with the description of COMMIT as "expensive". - If the COPY was UNSTABLE, then a COMMIT will have to be done whether the client checks for an intervening CB_OFFLOAD or not. - If the COPY was FILE_SYNC, then the COMMIT is a no-op, and the only expense here is an extra network round trip. - This is not a performance path. If the client missed the CB_OFFLOAD in the first place, then it's already waited a few seconds longer than the server took to do the COPY. So my feeling is that avoiding the COMMIT is an optimization. It isn't required to make COPY or OFFLOAD_STATUS work correctly AFAICT. >> + case -EREMOTEIO: >> + /* COPY operation failed on the server. */ >> + status = -EOPNOTSUPP; >> + res->write_res.count = copied; >> + goto out_free; >> + case -EBADF: >> + /* Server did not recognize the copy stateid. It has >> + * probably restarted and lost the plot. */ >> + res->write_res.count = 0; >> + status = -EOPNOTSUPP; >> + break; > > This is the case of receiving a BAD_STATEID from OFFLOAD_STATUS and > this would lead to copy falling back to read/write operation. IF we > don't check the existence of CB_OFFLOAD reply, then OFFLOAD_STATUS can > and will carry BAD_STATEID as the server is free to delete copy status > after it get CB_OFFLOAD reply (which as i said we have to check). I don't agree that the client /has/ to check for it. I think it's an optimization for an infrequent race. As the meme goes: "Change my mind". ;-) But back to changing the logic in this arm: what is the problem with falling back to a read/write copy in such cases? This seems to me like an infrequent case where we should prioritize recovering reliably. > And > If we did then we need take the result of the CB_OFFLOAD and not act > on OFFLOAD_STATUS's error. Checking for CB_OFFLOAD after an OFFLOAD_STATUS operation seems racy to me. There is still a gap where the client has done OFFLOAD_STATUS, checks for the CB_OFFLOAD and doesn't see one, and then the CB_OFFLOAD actually shows up. So this does not eliminate a race, it merely narrows the window. >> + case -EOPNOTSUPP: >> + /* RFC 7862 REQUIREs server to support OFFLOAD_STATUS when >> + * it has signed up for an async COPY, so server is not >> + * spec-compliant. */ >> + res->write_res.count = 0; >> + } >> + goto out_free; >> } >> >> static int process_copy_commit(struct file *dst, loff_t pos_dst, >> @@ -643,7 +691,7 @@ _nfs42_proc_offload_status(struct nfs_server *server, struct file *file, >> * Other negative errnos indicate the client could not complete the >> * request. >> */ >> -static int __maybe_unused >> +static int >> nfs42_proc_offload_status(struct file *dst, nfs4_stateid *stateid, u64 *copied) >> { >> struct inode *inode = file_inode(dst); >> -- >> 2.47.0 >> >>
diff --git a/fs/nfs/nfs42proc.c b/fs/nfs/nfs42proc.c index 7fd0f2aa42d4..65cfdb5c7b02 100644 --- a/fs/nfs/nfs42proc.c +++ b/fs/nfs/nfs42proc.c @@ -175,6 +175,25 @@ int nfs42_proc_deallocate(struct file *filep, loff_t offset, loff_t len) return err; } +/* Wait this long before checking progress on a COPY operation */ +enum { + NFS42_COPY_TIMEOUT = 3 * HZ, +}; + +static void nfs4_copy_dequeue_callback(struct nfs_server *dst_server, + struct nfs_server *src_server, + struct nfs4_copy_state *copy) +{ + spin_lock(&dst_server->nfs_client->cl_lock); + list_del_init(©->copies); + spin_unlock(&dst_server->nfs_client->cl_lock); + if (dst_server != src_server) { + spin_lock(&src_server->nfs_client->cl_lock); + list_del_init(©->src_copies); + spin_unlock(&src_server->nfs_client->cl_lock); + } +} + static int handle_async_copy(struct nfs42_copy_res *res, struct nfs_server *dst_server, struct nfs_server *src_server, @@ -184,9 +203,10 @@ static int handle_async_copy(struct nfs42_copy_res *res, bool *restart) { struct nfs4_copy_state *copy, *tmp_copy = NULL, *iter; - int status = NFS4_OK; struct nfs_open_context *dst_ctx = nfs_file_open_context(dst); struct nfs_open_context *src_ctx = nfs_file_open_context(src); + int status = NFS4_OK; + u64 copied; copy = kzalloc(sizeof(struct nfs4_copy_state), GFP_KERNEL); if (!copy) @@ -224,15 +244,12 @@ static int handle_async_copy(struct nfs42_copy_res *res, spin_unlock(&src_server->nfs_client->cl_lock); } - status = wait_for_completion_interruptible(©->completion); - spin_lock(&dst_server->nfs_client->cl_lock); - list_del_init(©->copies); - spin_unlock(&dst_server->nfs_client->cl_lock); - if (dst_server != src_server) { - spin_lock(&src_server->nfs_client->cl_lock); - list_del_init(©->src_copies); - spin_unlock(&src_server->nfs_client->cl_lock); - } +wait: + status = wait_for_completion_interruptible_timeout(©->completion, + NFS42_COPY_TIMEOUT); + if (!status) + goto timeout; + nfs4_copy_dequeue_callback(dst_server, src_server, copy); if (status == -ERESTARTSYS) { goto out_cancel; } else if (copy->flags || copy->error == NFS4ERR_PARTNER_NO_AUTH) { @@ -242,6 +259,7 @@ static int handle_async_copy(struct nfs42_copy_res *res, } out: res->write_res.count = copy->count; + /* Copy out the updated write verifier provided by CB_OFFLOAD. */ memcpy(&res->write_res.verifier, ©->verf, sizeof(copy->verf)); status = -copy->error; @@ -253,6 +271,36 @@ static int handle_async_copy(struct nfs42_copy_res *res, if (!nfs42_files_from_same_server(src, dst)) nfs42_do_offload_cancel_async(src, src_stateid); goto out_free; +timeout: + status = nfs42_proc_offload_status(dst, ©->stateid, &copied); + if (status == -EINPROGRESS) + goto wait; + nfs4_copy_dequeue_callback(dst_server, src_server, copy); + switch (status) { + case 0: + /* The server recognized the copy stateid, so it hasn't + * rebooted. Don't overwrite the verifier returned in the + * COPY result. */ + res->write_res.count = copied; + goto out_free; + case -EREMOTEIO: + /* COPY operation failed on the server. */ + status = -EOPNOTSUPP; + res->write_res.count = copied; + goto out_free; + case -EBADF: + /* Server did not recognize the copy stateid. It has + * probably restarted and lost the plot. */ + res->write_res.count = 0; + status = -EOPNOTSUPP; + break; + case -EOPNOTSUPP: + /* RFC 7862 REQUIREs server to support OFFLOAD_STATUS when + * it has signed up for an async COPY, so server is not + * spec-compliant. */ + res->write_res.count = 0; + } + goto out_free; } static int process_copy_commit(struct file *dst, loff_t pos_dst, @@ -643,7 +691,7 @@ _nfs42_proc_offload_status(struct nfs_server *server, struct file *file, * Other negative errnos indicate the client could not complete the * request. */ -static int __maybe_unused +static int nfs42_proc_offload_status(struct file *dst, nfs4_stateid *stateid, u64 *copied) { struct inode *inode = file_inode(dst);