diff mbox

[3/9] pnfs: add return_range method

Message ID 1410362617-28018-4-git-send-email-hch@lst.de (mailing list archive)
State New, archived
Headers show

Commit Message

Christoph Hellwig Sept. 10, 2014, 3:23 p.m. UTC
If a layout driver keeps per-inode state outside of the layout segments it
needs to be notified of any layout returns or recalls on an inode, and not
just about the freeing of layout segments.  Add a method to acomplish this,
which will allow the block layout driver to handle the case of truncated
and re-expanded files properly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/nfs/callback_proc.c | 12 +++++++++---
 fs/nfs/pnfs.c          | 10 ++++++++++
 fs/nfs/pnfs.h          |  3 +++
 3 files changed, 22 insertions(+), 3 deletions(-)

Comments

Peng Tao Sept. 11, 2014, 1:54 p.m. UTC | #1
On Wed, Sep 10, 2014 at 11:23 PM, Christoph Hellwig <hch@lst.de> wrote:
> If a layout driver keeps per-inode state outside of the layout segments it
> needs to be notified of any layout returns or recalls on an inode, and not
> just about the freeing of layout segments.  Add a method to acomplish this,
> which will allow the block layout driver to handle the case of truncated
> and re-expanded files properly.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/nfs/callback_proc.c | 12 +++++++++---
>  fs/nfs/pnfs.c          | 10 ++++++++++
>  fs/nfs/pnfs.h          |  3 +++
>  3 files changed, 22 insertions(+), 3 deletions(-)
>
> diff --git a/fs/nfs/callback_proc.c b/fs/nfs/callback_proc.c
> index 0d26951..56bd0ea 100644
> --- a/fs/nfs/callback_proc.c
> +++ b/fs/nfs/callback_proc.c
> @@ -181,10 +181,16 @@ static u32 initiate_file_draining(struct nfs_client *clp,
>         spin_lock(&ino->i_lock);
>         if (test_bit(NFS_LAYOUT_BULK_RECALL, &lo->plh_flags) ||
>             pnfs_mark_matching_lsegs_invalid(lo, &free_me_list,
> -                                       &args->cbl_range))
> +                                       &args->cbl_range)) {
>                 rv = NFS4ERR_DELAY;
> -       else
> -               rv = NFS4ERR_NOMATCHING_LAYOUT;
> +               goto unlock;
> +       }
> +
> +       if (NFS_SERVER(ino)->pnfs_curr_ld->return_range) {
It looks better to put return_range inside
pnfs_mark_matching_lsegs_invalid() to have it called every time a
range of layout segments all get freed. So that ld is sure to free
things up.

Cheers,
Tao
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Sept. 11, 2014, 3:20 p.m. UTC | #2
On Thu, Sep 11, 2014 at 09:54:42PM +0800, Peng Tao wrote:
> It looks better to put return_range inside
> pnfs_mark_matching_lsegs_invalid() to have it called every time a
> range of layout segments all get freed. So that ld is sure to free
> things up.

Actually we specificly want to avoid that for cases like the internal
new stateid handling in pnfs_layout_process, where we might allocate
a new layout that overlaps an existing one, and where we need to keep
the extents around.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peng Tao Sept. 11, 2014, 3:30 p.m. UTC | #3
On Thu, Sep 11, 2014 at 11:20 PM, Christoph Hellwig <hch@lst.de> wrote:
> On Thu, Sep 11, 2014 at 09:54:42PM +0800, Peng Tao wrote:
>> It looks better to put return_range inside
>> pnfs_mark_matching_lsegs_invalid() to have it called every time a
>> range of layout segments all get freed. So that ld is sure to free
>> things up.
>
> Actually we specificly want to avoid that for cases like the internal
> new stateid handling in pnfs_layout_process, where we might allocate
> a new layout that overlaps an existing one, and where we need to keep
> the extents around.
>
It looks dangerous to have extents lurking around without matching
layout segments. One example is NFS4ERR_EXPIRED and
NFS4ERR_BAD_STATEID in nfs4_layoutget_done(), client needs to drop all
layout segments but may keep the layout header, in which case blocks
layout would still hold all extents at hand.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Sept. 11, 2014, 3:36 p.m. UTC | #4
On Thu, Sep 11, 2014 at 11:30:50PM +0800, Peng Tao wrote:
> It looks dangerous to have extents lurking around without matching
> layout segments. One example is NFS4ERR_EXPIRED and
> NFS4ERR_BAD_STATEID in nfs4_layoutget_done(), client needs to drop all
> layout segments but may keep the layout header, in which case blocks
> layout would still hold all extents at hand.

We never do I/O without a valid layout, but we keep the extents around
because we need to keep state like that it's been written to or has
a commit pending in a single place, and also need to keep it if a layout
goes away temporarily.

I hate the way it has been done in the old client, and tried moving the
extent tracking to a per-layout basis, and spent way too much time on it
but still couldn't get it to work, so I finally gave up.  I think we're
between a rock (the idiocy on rfc5663 that it tracks data in extents and not
in layouts) and a hard place (the forgetful Linux client and it's lose layout
coherency) here.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peng Tao Sept. 12, 2014, 2:22 a.m. UTC | #5
On Thu, Sep 11, 2014 at 11:36 PM, Christoph Hellwig <hch@lst.de> wrote:
> On Thu, Sep 11, 2014 at 11:30:50PM +0800, Peng Tao wrote:
>> It looks dangerous to have extents lurking around without matching
>> layout segments. One example is NFS4ERR_EXPIRED and
>> NFS4ERR_BAD_STATEID in nfs4_layoutget_done(), client needs to drop all
>> layout segments but may keep the layout header, in which case blocks
>> layout would still hold all extents at hand.
>
> We never do I/O without a valid layout, but we keep the extents around
> because we need to keep state like that it's been written to or has
> a commit pending in a single place, and also need to keep it if a layout
> goes away temporarily.
We don't know if layout goes away temporarily or permanently. If
server happens to move the file's data (enterprise NAS does that for a
lot of reasons) in between, next time client does layoutget, it will
get a different extent mapping. I'm not sure if the new extent
tracking code is able to cope with it. The old code will certainly
fail though.

If your concern is current layoutreturn implementation that ignores
pending layoutcommit (and inflight IOs), that needs to be fixed. I
have a patchset to fix it and it is still under internal QA. I'll post
it once it passes testing.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Sept. 13, 2014, 7:38 p.m. UTC | #6
On Fri, Sep 12, 2014 at 10:22:11AM +0800, Peng Tao wrote:
> > We never do I/O without a valid layout, but we keep the extents around
> > because we need to keep state like that it's been written to or has
> > a commit pending in a single place, and also need to keep it if a layout
> > goes away temporarily.
> We don't know if layout goes away temporarily or permanently. If
> server happens to move the file's data (enterprise NAS does that for a
> lot of reasons) in between, next time client does layoutget, it will
> get a different extent mapping. I'm not sure if the new extent
> tracking code is able to cope with it. The old code will certainly
> fail though.

A server is expected to do a layout recall with the clora_changed set
to true if it moves data around and thus the layout goes away permanently.
The Linux client currently irgnores that field, but that's something which
needs to fixed in the core pnfs code and not the block layout drivers.

But this isn't one of the cases were we keep the extent around but
not the layout - those are the ones were we have stateid mismatches
or similar (most of them now fixed by me).

We actually have a much bigger elephant in the room, which this series
doesn't address yet:  if a layoutcommit fails we curently don't have
a way to resend the data to the MDS.  This will be more common pnfs core
work as we basically need to keep the data in a state similar to NFS
unstable writes, instead of claiming that we finished a NFS_FILE_SYNC,
which allows the VM to drop the data from the pagecache before the
commit happened.

I'm slowly wading through all these issues - I initially only planned to
write a server but so far fixing up the client has taken way more of my
time compared to the relatively trivial server..

> If your concern is current layoutreturn implementation that ignores
> pending layoutcommit (and inflight IOs), that needs to be fixed. I
> have a patchset to fix it and it is still under internal QA. I'll post
> it once it passes testing.

That's a different issues, but I'm happy to review that as well.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/nfs/callback_proc.c b/fs/nfs/callback_proc.c
index 0d26951..56bd0ea 100644
--- a/fs/nfs/callback_proc.c
+++ b/fs/nfs/callback_proc.c
@@ -181,10 +181,16 @@  static u32 initiate_file_draining(struct nfs_client *clp,
 	spin_lock(&ino->i_lock);
 	if (test_bit(NFS_LAYOUT_BULK_RECALL, &lo->plh_flags) ||
 	    pnfs_mark_matching_lsegs_invalid(lo, &free_me_list,
-					&args->cbl_range))
+					&args->cbl_range)) {
 		rv = NFS4ERR_DELAY;
-	else
-		rv = NFS4ERR_NOMATCHING_LAYOUT;
+		goto unlock;
+	}
+
+	if (NFS_SERVER(ino)->pnfs_curr_ld->return_range) {
+		NFS_SERVER(ino)->pnfs_curr_ld->return_range(lo,
+			&args->cbl_range);
+	}
+unlock:
 	spin_unlock(&ino->i_lock);
 	pnfs_free_lseg_list(&free_me_list);
 	pnfs_put_layout_hdr(lo);
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index ce5d1ea..76de7f5 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -857,6 +857,16 @@  _pnfs_return_layout(struct inode *ino)
 	empty = list_empty(&lo->plh_segs);
 	pnfs_clear_layoutcommit(ino, &tmp_list);
 	pnfs_mark_matching_lsegs_invalid(lo, &tmp_list, NULL);
+
+	if (NFS_SERVER(ino)->pnfs_curr_ld->return_range) {
+		struct pnfs_layout_range range = {
+			.iomode		= IOMODE_ANY,
+			.offset		= 0,
+			.length		= NFS4_MAX_UINT64,
+		};
+		NFS_SERVER(ino)->pnfs_curr_ld->return_range(lo, &range);
+	}
+
 	/* Don't send a LAYOUTRETURN if list was initially empty */
 	if (empty) {
 		spin_unlock(&ino->i_lock);
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index a4c530e..2c24942 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -94,6 +94,9 @@  struct pnfs_layoutdriver_type {
 	struct pnfs_layout_segment * (*alloc_lseg) (struct pnfs_layout_hdr *layoutid, struct nfs4_layoutget_res *lgr, gfp_t gfp_flags);
 	void (*free_lseg) (struct pnfs_layout_segment *lseg);
 
+	void (*return_range) (struct pnfs_layout_hdr *lo,
+			      struct pnfs_layout_range *range);
+
 	/* test for nfs page cache coalescing */
 	const struct nfs_pageio_ops *pg_read_ops;
 	const struct nfs_pageio_ops *pg_write_ops;