diff mbox

[RFC] pnfs: Send layoutreturn, if Return-On-Close is set

Message ID 4DDE536E.8020805@panasas.com (mailing list archive)
State New, archived
Headers show

Commit Message

Boaz Harrosh May 26, 2011, 1:19 p.m. UTC
Every thing was ready, in pnfs_roc(). The segments released
and the LO state blocked til after the close is done. All that
is needed is to send the actual layoutreturn synchronously.

RFC1: It looks like the close that comes afterwords is also sync
      So I assumed it would be always safe to wait here. Is that
      true?

RFC2: The ideal is if we could send the layoutreturn together
      with the close, in the same compound. The only little
      problem with that is that we will need, at error handling,
      to analyse if the error came from the layoutreturn part
      of the compound, re-encode the close with out the layoutreturn
      and resend.
      I mean, What stupid Server returns an error on layoutreturn?!
      really? why is it useful information to the client? what can
      he do differently. Any way ...

      So I'm more lazy to do the right thing, right now. I promise
      to put it on my TODO, and do it soon.

TODO:
	mark_lseg_invalid() should receive and collect the union
        *range* of all the segments that where ROC. And only send
	a layoutreturn on that collected range. (If lseg_list is empty
	then all-file return)

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
---
 fs/nfs/nfs4proc.c       |   10 ++++++++-
 fs/nfs/pnfs.c           |   52 +++++++++++++++++++++++++++++-----------------
 include/linux/nfs_xdr.h |    1 +
 3 files changed, 43 insertions(+), 20 deletions(-)

Comments

Trond Myklebust May 26, 2011, 2:16 p.m. UTC | #1
On Thu, 2011-05-26 at 16:19 +0300, Boaz Harrosh wrote: 
> Every thing was ready, in pnfs_roc(). The segments released
> and the LO state blocked til after the close is done. All that
> is needed is to send the actual layoutreturn synchronously.

Why would we want to do this?

Return-on-close was initially considered useful only for debugging.

At the interim IETF meeting in Sunnyvale, we also discussed the case
where the forgetful client has forgotten the layout: in this case the
server may decide to forget the layout too. There is no controversy in
doing this, since both the client and the server know that any
outstanding layout is supposed to be returned (and if there is a
problem, then the server always has the option of sending a
CB_LAYOUTRECALL).

Adding a synchronous call to close is in any case a bug since close can
on occasion be sent in situations where we don't allow sleeping.

Cheers
  Trond
Boaz Harrosh May 26, 2011, 2:45 p.m. UTC | #2
On 05/26/2011 05:16 PM, Trond Myklebust wrote:
> On Thu, 2011-05-26 at 16:19 +0300, Boaz Harrosh wrote: 
>> Every thing was ready, in pnfs_roc(). The segments released
>> and the LO state blocked til after the close is done. All that
>> is needed is to send the actual layoutreturn synchronously.
> 
> Why would we want to do this?
> 
> Return-on-close was initially considered useful only for debugging.
> 
What ??

> At the interim IETF meeting in Sunnyvale, we also discussed the case
> where the forgetful client has forgotten the layout: in this case the
> server may decide to forget the layout too. There is no controversy in
> doing this, since both the client and the server know that any
> outstanding layout is supposed to be returned (and if there is a
> problem, then the server always has the option of sending a
> CB_LAYOUTRECALL).
> 

OK I didn't know that. So what you are saying is that if the server see
a final close he can go and provocative free all segments marked with ROC?

If so then someone should fix the Linux server. Because currently it never
frees them. On a modest machine like the UML I use. Few 10s of "git checkout linux"
crash the machine with oom. Today they are only freed on client umount.

> Adding a synchronous call to close is in any case a bug since close can
> on occasion be sent in situations where we don't allow sleeping.
> 

This is done only on the final close. Isn't the very final call sync?

Ok re-inspecting the code I can see that nfs4_do_close also takes a wait flag.
I thought that the last close should always be waiting for all operations
to end before proceeding with the close. That's how it is at the VFS level
but I guess life is hard. So the only possible solution is within the same
compound as the close. (not that we need it as you say)

> Cheers
>   Trond

Thanks
Boaz
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust May 26, 2011, 3:04 p.m. UTC | #3
On Thu, 2011-05-26 at 17:45 +0300, Boaz Harrosh wrote: 
> On 05/26/2011 05:16 PM, Trond Myklebust wrote:
> > On Thu, 2011-05-26 at 16:19 +0300, Boaz Harrosh wrote: 
> >> Every thing was ready, in pnfs_roc(). The segments released
> >> and the LO state blocked til after the close is done. All that
> >> is needed is to send the actual layoutreturn synchronously.
> > 
> > Why would we want to do this?
> > 
> > Return-on-close was initially considered useful only for debugging.
> > 
> What ??
> 
> > At the interim IETF meeting in Sunnyvale, we also discussed the case
> > where the forgetful client has forgotten the layout: in this case the
> > server may decide to forget the layout too. There is no controversy in
> > doing this, since both the client and the server know that any
> > outstanding layout is supposed to be returned (and if there is a
> > problem, then the server always has the option of sending a
> > CB_LAYOUTRECALL).
> > 
> 
> OK I didn't know that. So what you are saying is that if the server see
> a final close he can go and provocative free all segments marked with ROC?
> 
> If so then someone should fix the Linux server. Because currently it never
> frees them. On a modest machine like the UML I use. Few 10s of "git checkout linux"
> crash the machine with oom. Today they are only freed on client umount.

That would be a bug. The server must either free or recall in this
situation.

> > Adding a synchronous call to close is in any case a bug since close can
> > on occasion be sent in situations where we don't allow sleeping.
> > 
> 
> This is done only on the final close. Isn't the very final call sync?
> 
> Ok re-inspecting the code I can see that nfs4_do_close also takes a wait flag.
> I thought that the last close should always be waiting for all operations
> to end before proceeding with the close. That's how it is at the VFS level
> but I guess life is hard. So the only possible solution is within the same
> compound as the close. (not that we need it as you say)

The problem is that LAYOUTRETURN may have to be sent with a _different_
credential than the CLOSE itself.

See the description of the EXCHGID4_FLAG_BIND_PRINC_STATEID exchangeid
flag. Although we don't set that flag in our exchangeid requests, the
_server_ may still set it in its reply, in which case we are supposed to
obey it.
This is also a reason why sending OPEN and LAYOUTGET in the same
compound can be problematic.

Cheers
  Trond
Trond Myklebust May 26, 2011, 3:19 p.m. UTC | #4
On Thu, 2011-05-26 at 11:04 -0400, Trond Myklebust wrote: 
> See the description of the EXCHGID4_FLAG_BIND_PRINC_STATEID exchangeid
> flag. Although we don't set that flag in our exchangeid requests, the
> _server_ may still set it in its reply, in which case we are supposed to
> obey it.
> This is also a reason why sending OPEN and LAYOUTGET in the same
> compound can be problematic.

BTW: We have yet to implement support for BIND_PRINC_STATEID in the
client. One of the more "interesting" issues, I think, is that we
probably want to handle EKEYEXPIRED differently: instead of just
retrying, we should really clear out the layouts (forgetfully - since we
don't have a credential) and reget them.

Note that SP4_MACH_CRED and SP4_SSV might help here, but we're not
guaranteed that the server supports those even if it requires
BIND_PRINC_STATEID.
diff mbox

Patch

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 5734009..f32cc46 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -5745,6 +5745,7 @@  int nfs4_proc_layoutreturn(struct nfs4_layoutreturn *lrp)
 		.rpc_message = &msg,
 		.callback_ops = &nfs4_layoutreturn_call_ops,
 		.callback_data = lrp,
+		.flags = RPC_TASK_ASYNC,
 	};
 	int status;
 
@@ -5753,8 +5754,15 @@  int nfs4_proc_layoutreturn(struct nfs4_layoutreturn *lrp)
 	if (IS_ERR(task))
 		return PTR_ERR(task);
 	status = task->tk_status;
-	dprintk("<-- %s status=%d\n", __func__, status);
+
+	if (!status && lrp->args.sync) {
+		status = nfs4_wait_for_completion_rpc_task(task);
+		if (status == 0)
+			status = task->tk_status;
+	}
+
 	rpc_put_task(task);
+	dprintk("<-- %s status=%d\n", __func__, status);
 	return status;
 }
 
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 9b749f2..7dc15d8 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -620,6 +620,31 @@  out_err_free:
 	return NULL;
 }
 
+static int __send_layoutreturn(struct inode *ino, nfs4_stateid *stateid,
+			       bool sync)
+{
+	struct nfs4_layoutreturn *lrp = NULL;
+	int status;
+
+	/* lrp is freed in nfs4_layoutreturn_release */
+	lrp = kzalloc(sizeof(*lrp), GFP_KERNEL);
+	if (unlikely(!lrp)) {
+		put_layout_hdr(NFS_I(ino)->layout);
+		status = -ENOMEM;
+		goto out;
+	}
+	lrp->args.sync = sync;
+	lrp->args.stateid = *stateid;
+	lrp->args.reclaim = 0;
+	lrp->args.layout_type = NFS_SERVER(ino)->pnfs_curr_ld->id;
+	lrp->args.inode = ino;
+	lrp->clp = NFS_SERVER(ino)->nfs_client;
+
+	status = nfs4_proc_layoutreturn(lrp);
+out:
+	dprintk("%s: ==> %d\n", __func__, status);
+	return status;
+}
 /* Initiates a LAYOUTRETURN(FILE) */
 int
 _pnfs_return_layout(struct inode *ino)
@@ -627,7 +652,6 @@  _pnfs_return_layout(struct inode *ino)
 	struct pnfs_layout_hdr *lo = NULL;
 	struct nfs_inode *nfsi = NFS_I(ino);
 	LIST_HEAD(tmp_list);
-	struct nfs4_layoutreturn *lrp = NULL;
 	nfs4_stateid stateid;
 	int status = 0;
 
@@ -647,37 +671,23 @@  _pnfs_return_layout(struct inode *ino)
 	pnfs_free_lseg_list(&tmp_list);
 
 	WARN_ON(test_bit(NFS_INO_LAYOUTCOMMIT, &nfsi->flags));
-
-	/* lrp is freed in nfs4_layoutreturn_release */
-	lrp = kzalloc(sizeof(*lrp), GFP_KERNEL);
-	if (unlikely(!lrp)) {
-		put_layout_hdr(NFS_I(ino)->layout);
-		status = -ENOMEM;
-		goto out;
-	}
-	lrp->args.stateid = stateid;
-	lrp->args.reclaim = 0;
-	lrp->args.layout_type = NFS_SERVER(ino)->pnfs_curr_ld->id;
-	lrp->args.inode = ino;
-	lrp->clp = NFS_SERVER(ino)->nfs_client;
-
-	status = nfs4_proc_layoutreturn(lrp);
+	status = __send_layoutreturn(ino, &stateid, false);
 out:
-	if (unlikely(status))
-		kfree(lrp);
 	dprintk("<-- %s status: %d\n", __func__, status);
 	return status;
 }
 
 bool pnfs_roc(struct inode *ino)
 {
+	struct nfs_inode *nfsi = NFS_I(ino);
 	struct pnfs_layout_hdr *lo;
 	struct pnfs_layout_segment *lseg, *tmp;
 	LIST_HEAD(tmp_list);
 	bool found = false;
+	nfs4_stateid stateid;
 
 	spin_lock(&ino->i_lock);
-	lo = NFS_I(ino)->layout;
+	lo = nfsi->layout;
 	if (!lo || !test_and_clear_bit(NFS_LAYOUT_ROC, &lo->plh_flags) ||
 	    test_bit(NFS_LAYOUT_BULK_RECALL, &lo->plh_flags))
 		goto out_nolayout;
@@ -689,9 +699,13 @@  bool pnfs_roc(struct inode *ino)
 	if (!found)
 		goto out_nolayout;
 	lo->plh_block_lgets++;
+	get_layout_hdr(lo); /* matched in nfs4_layoutreturn_release */
 	get_layout_hdr(lo); /* matched in pnfs_roc_release */
+	stateid = nfsi->layout->plh_stateid;
 	spin_unlock(&ino->i_lock);
 	pnfs_free_lseg_list(&tmp_list);
+	/* We ignore the return code from layoutreturn, layout is forgotten */
+	__send_layoutreturn(ino, &stateid, true);
 	return true;
 
 out_nolayout:
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index df99102..00b391a 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -272,6 +272,7 @@  struct nfs4_layoutcommit_data {
 };
 
 struct nfs4_layoutreturn_args {
+	bool	sync;
 	__u32   reclaim;
 	__u32   layout_type;
 	struct inode *inode;