diff mbox

[08/17] pnfs/blocklayout: reject pnfs blocksize larger than page size

Message ID 1407396229-4785-9-git-send-email-hch@lst.de (mailing list archive)
State New, archived
Headers show

Commit Message

Christoph Hellwig Aug. 7, 2014, 7:23 a.m. UTC
The Linux VM subsystem can't support block sizes larger than page size
for block based filesystems very well.  While this can be hacked around
to some extent for simple filesystems the read-modify-write cycles
required for pnfs block invalid extents are extremly deadlock prone
when operating on multiple pages.  Reject this case early on instead
of pretending to support it (badly).

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/nfs/blocklayout/blocklayout.c | 6 ++++++
 1 file changed, 6 insertions(+)

Comments

Peng Tao Aug. 7, 2014, 10:43 a.m. UTC | #1
On Thu, Aug 7, 2014 at 3:23 PM, Christoph Hellwig <hch@lst.de> wrote:
> The Linux VM subsystem can't support block sizes larger than page size
> for block based filesystems very well.  While this can be hacked around
> to some extent for simple filesystems the read-modify-write cycles
> required for pnfs block invalid extents are extremly deadlock prone
> when operating on multiple pages.  Reject this case early on instead
> of pretending to support it (badly).
>
So this kills EMC server support. Can you please share what kind of
badly deadlock you saw with large block size support?

Thanks,
Tao

> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/nfs/blocklayout/blocklayout.c | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
> index cbb1797..6c1a421 100644
> --- a/fs/nfs/blocklayout/blocklayout.c
> +++ b/fs/nfs/blocklayout/blocklayout.c
> @@ -1115,6 +1115,12 @@ bl_set_layoutdriver(struct nfs_server *server, const struct nfs_fh *fh)
>                 dprintk("%s Server did not return blksize\n", __func__);
>                 return -EINVAL;
>         }
> +       if (server->pnfs_blksize > PAGE_SIZE) {
> +               printk(KERN_ERR "%s: pNFS blksize %d not supported.\n",
> +                       __func__, server->pnfs_blksize);
> +               return -EINVAL;
> +       }
> +
>         b_mt_id = kzalloc(sizeof(struct block_mount_id), GFP_NOFS);
>         if (!b_mt_id) {
>                 status = -ENOMEM;
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Aug. 7, 2014, 11:25 a.m. UTC | #2
On Thu, Aug 07, 2014 at 06:43:14PM +0800, Peng Tao wrote:
> So this kills EMC server support.

Given the state the code  claiming support for any server is a large
exaggeration..

> Can you please share what kind of
> badly deadlock you saw with large block size support?

The read-modify write code (which I'll remove later) can lock arbitary
numbers of additional pages from the writeback back code without doing
a trylock, which is required for doing this in page writeback.  Note
that it's not a deadlock, but I can also triv?ally corrupt data in
those pages as it doesn't lock against them, you just need a race
window where it's modified after writeback has been started for a large
extents, which isn't too hard to hit with tools like fsstress.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peng Tao Aug. 7, 2014, 11:51 a.m. UTC | #3
On Thu, Aug 7, 2014 at 7:25 PM, Christoph Hellwig <hch@lst.de> wrote:
> On Thu, Aug 07, 2014 at 06:43:14PM +0800, Peng Tao wrote:
>> So this kills EMC server support.
>
> Given the state the code  claiming support for any server is a large
> exaggeration..
>
>> Can you please share what kind of
>> badly deadlock you saw with large block size support?
>
> The read-modify write code (which I'll remove later) can lock arbitary
> numbers of additional pages from the writeback back code without doing
> a trylock, which is required for doing this in page writeback.  Note
> that it's not a deadlock, but I can also triv?ally corrupt data in
> those pages as it doesn't lock against them, you just need a race
> window where it's modified after writeback has been started for a large
> extents, which isn't too hard to hit with tools like fsstress.
>
Is it bl_find_get_zeroing_page() you are concerning about? I was
hoping page flags can tell if some other threads are flushing the same
page. And the extra page is always locked before readin or zeroed,
after which the page is marked uptodate before unlocking. So the
problem is that a page that is being written back gets modified by a
new writer, is it correct? How about marking it writeback before
unlocking in bl_find_get_zeroing_page()? That should keep new writers
from modifying it concurrently.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Aug. 7, 2014, 12:10 p.m. UTC | #4
On Thu, Aug 07, 2014 at 07:51:57PM +0800, Peng Tao wrote:
> Is it bl_find_get_zeroing_page() you are concerning about?

That's the primary culprit.

If you want to do the read-modify write cycle properly you'll have to
do it at the time the invalid extent is read, not written.  Take a look
at my patch to add a flag to do that in page 5, although it would
be a lot more involved to do it for large pnfs block sizes.  Given
that the code never really fully worked beforehand I'm not tempted
to come up with that myself, though.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
faibish, sorin Aug. 7, 2014, 12:56 p.m. UTC | #5
Why don't you send a patch?

-----Original Message-----
From: Peng Tao [mailto:bergwolf@gmail.com] 

Sent: Thursday, August 07, 2014 7:52 AM
To: Christoph Hellwig
Cc: Trond Myklebust; linuxnfs; faibish, sorin
Subject: Re: [PATCH 08/17] pnfs/blocklayout: reject pnfs blocksize larger than page size

On Thu, Aug 7, 2014 at 7:25 PM, Christoph Hellwig <hch@lst.de> wrote:
> On Thu, Aug 07, 2014 at 06:43:14PM +0800, Peng Tao wrote:

>> So this kills EMC server support.

>

> Given the state the code  claiming support for any server is a large 

> exaggeration..

>

>> Can you please share what kind of

>> badly deadlock you saw with large block size support?

>

> The read-modify write code (which I'll remove later) can lock arbitary 

> numbers of additional pages from the writeback back code without doing 

> a trylock, which is required for doing this in page writeback.  Note 

> that it's not a deadlock, but I can also triv?ally corrupt data in 

> those pages as it doesn't lock against them, you just need a race 

> window where it's modified after writeback has been started for a 

> large extents, which isn't too hard to hit with tools like fsstress.

>

Is it bl_find_get_zeroing_page() you are concerning about? I was hoping page flags can tell if some other threads are flushing the same page. And the extra page is always locked before readin or zeroed, after which the page is marked uptodate before unlocking. So the problem is that a page that is being written back gets modified by a new writer, is it correct? How about marking it writeback before unlocking in bl_find_get_zeroing_page()? That should keep new writers from modifying it concurrently.
Steve Dickson Aug. 7, 2014, 1:13 p.m. UTC | #6
On 08/07/2014 07:25 AM, Christoph Hellwig wrote:
> On Thu, Aug 07, 2014 at 06:43:14PM +0800, Peng Tao wrote:
>> > So this kills EMC server support.
> Given the state the code  claiming support for any server is a large
> exaggeration..
> 
But eliminating the potential of supporting block layouts
does not sound like a very good idea to me.... 

There are rumors of other block layouts implementations
coming down the pike.... So hopefully we don't eliminating
any and all possible support of block layouts in the future...

steved.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Dickson Aug. 7, 2014, 1:17 p.m. UTC | #7
On 08/07/2014 09:13 AM, Steve Dickson wrote:
> 
> 
> On 08/07/2014 07:25 AM, Christoph Hellwig wrote:
>> On Thu, Aug 07, 2014 at 06:43:14PM +0800, Peng Tao wrote:
>>>> So this kills EMC server support.
>> Given the state the code  claiming support for any server is a large
>> exaggeration..
>>
> But eliminating the potential of supporting block layouts
> does not sound like a very good idea to me.... 
> 
> There are rumors of other block layouts implementations
> coming down the pike.... So hopefully we don't eliminating
> any and all possible support of block layouts in the future...
> 
My bad... I see this not the case... I jumped in with
out reading the entire thread.... 

But EMC server support is still not a good idea IMHO...

steved.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peng Tao Aug. 7, 2014, 1:43 p.m. UTC | #8
On Thu, Aug 7, 2014 at 8:10 PM, Christoph Hellwig <hch@lst.de> wrote:
> On Thu, Aug 07, 2014 at 07:51:57PM +0800, Peng Tao wrote:
>> Is it bl_find_get_zeroing_page() you are concerning about?
>
> That's the primary culprit.
>
> If you want to do the read-modify write cycle properly you'll have to
> do it at the time the invalid extent is read, not written.  Take a look
> at my patch to add a flag to do that in page 5, although it would
> be a lot more involved to do it for large pnfs block sizes.  Given
> that the code never really fully worked beforehand I'm not tempted
> to come up with that myself, though.
>
we can't assume all pages written back have their pari pages (for 8K
block size e.g.) read in read_pagelists(). A page can also be read in
via MDS read. So what we need is a hook into nfs_readpage to read or
zero additional pages. But we might not even have a layout there.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peng Tao Aug. 7, 2014, 1:45 p.m. UTC | #9
On Thu, Aug 7, 2014 at 8:56 PM, faibish, sorin <faibish_sorin@emc.com> wrote:
> Why don't you send a patch?
>
I can't be sure what I proposed is correct and I don't have any server
to test against.

> -----Original Message-----
> From: Peng Tao [mailto:bergwolf@gmail.com]
> Sent: Thursday, August 07, 2014 7:52 AM
> To: Christoph Hellwig
> Cc: Trond Myklebust; linuxnfs; faibish, sorin
> Subject: Re: [PATCH 08/17] pnfs/blocklayout: reject pnfs blocksize larger than page size
>
> On Thu, Aug 7, 2014 at 7:25 PM, Christoph Hellwig <hch@lst.de> wrote:
>> On Thu, Aug 07, 2014 at 06:43:14PM +0800, Peng Tao wrote:
>>> So this kills EMC server support.
>>
>> Given the state the code  claiming support for any server is a large
>> exaggeration..
>>
>>> Can you please share what kind of
>>> badly deadlock you saw with large block size support?
>>
>> The read-modify write code (which I'll remove later) can lock arbitary
>> numbers of additional pages from the writeback back code without doing
>> a trylock, which is required for doing this in page writeback.  Note
>> that it's not a deadlock, but I can also triv?ally corrupt data in
>> those pages as it doesn't lock against them, you just need a race
>> window where it's modified after writeback has been started for a
>> large extents, which isn't too hard to hit with tools like fsstress.
>>
> Is it bl_find_get_zeroing_page() you are concerning about? I was hoping page flags can tell if some other threads are flushing the same page. And the extra page is always locked before readin or zeroed, after which the page is marked uptodate before unlocking. So the problem is that a page that is being written back gets modified by a new writer, is it correct? How about marking it writeback before unlocking in bl_find_get_zeroing_page()? That should keep new writers from modifying it concurrently.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Aug. 7, 2014, 4:10 p.m. UTC | #10
On Thu, Aug 07, 2014 at 08:56:04AM -0400, faibish, sorin wrote:
> Why don't you send a patch?

What patch?  Rewriting the broken code that was put in to support a server I
don't have access to and that apparently no one cared enough before to
fix even the most trivial bugs for?

I can leave the code for it in, but I can neither test nor really support it,
and from the state of the code it doesn't seem like anyone else really
cares for it.

How will sign up for supporting and testing > PAGE_SIZE servers if we
leave that support in?  And with support I mean regular runs of basic
QA like xfstests or just doing a simple dd if=/dev/zero of=/mnt/nfs/bigfile,
which already causes softlocks with the current code.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
faibish, sorin Aug. 7, 2014, 4:18 p.m. UTC | #11
What Peng suggested to solve: "marking it writeback before unlocking in bl_find_get_zeroing_page()?"

-----Original Message-----
From: Christoph Hellwig [mailto:hch@lst.de] 
Sent: Thursday, August 07, 2014 12:11 PM
To: faibish, sorin
Cc: Peng Tao; Trond Myklebust; linuxnfs
Subject: Re: [PATCH 08/17] pnfs/blocklayout: reject pnfs blocksize larger than page size

On Thu, Aug 07, 2014 at 08:56:04AM -0400, faibish, sorin wrote:
> Why don't you send a patch?

What patch?  Rewriting the broken code that was put in to support a server I don't have access to and that apparently no one cared enough before to fix even the most trivial bugs for?

I can leave the code for it in, but I can neither test nor really support it, and from the state of the code it doesn't seem like anyone else really cares for it.

How will sign up for supporting and testing > PAGE_SIZE servers if we leave that support in?  And with support I mean regular runs of basic QA like xfstests or just doing a simple dd if=/dev/zero of=/mnt/nfs/bigfile, which already causes softlocks with the current code.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Aug. 7, 2014, 4:20 p.m. UTC | #12
On Thu, Aug 07, 2014 at 09:43:09PM +0800, Peng Tao wrote:
> we can't assume all pages written back have their pari pages (for 8K
> block size e.g.) read in read_pagelists(). A page can also be read in
> via MDS read. So what we need is a hook into nfs_readpage to read or
> zero additional pages. But we might not even have a layout there.

We can't assume the page is there for writeback either, what all this
mess exists for.  That's why we really shouldn't even attempt to support
a a block size large than the page size, and that's also why the local
Linux filesystems strictly refuse to support it.  If you want to hack
around it you will run into problems in either case.

I also don't really see why a server would insist on this large block
size, there really isn't any major benefit in doing that today (aka the last 20
years) now that we have extent based filesystems.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peng Tao Aug. 8, 2014, 10:28 a.m. UTC | #13
On Fri, Aug 8, 2014 at 12:20 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Thu, Aug 07, 2014 at 09:43:09PM +0800, Peng Tao wrote:
>> we can't assume all pages written back have their pari pages (for 8K
>> block size e.g.) read in read_pagelists(). A page can also be read in
>> via MDS read. So what we need is a hook into nfs_readpage to read or
>> zero additional pages. But we might not even have a layout there.
>
> We can't assume the page is there for writeback either, what all this
> mess exists for.
In write_pagelist, we can find or create the pair page. It is indeed
cow extent that makes things complicated by requiring to read from
disk. If we drop cow support (which is required by rfc but I don't
know of any server that supports it), we can just zero the extra pages
and mark them uptodate. No extra read in or writeback required. That
is doable IMHO.

> That's why we really shouldn't even attempt to support
> a a block size large than the page size, and that's also why the local
> Linux filesystems strictly refuse to support it.  If you want to hack
> around it you will run into problems in either case.
>
> I also don't really see why a server would insist on this large block
> size, there really isn't any major benefit in doing that today (aka the last 20
> years) now that we have extent based filesystems.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index cbb1797..6c1a421 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -1115,6 +1115,12 @@  bl_set_layoutdriver(struct nfs_server *server, const struct nfs_fh *fh)
 		dprintk("%s Server did not return blksize\n", __func__);
 		return -EINVAL;
 	}
+	if (server->pnfs_blksize > PAGE_SIZE) {
+		printk(KERN_ERR "%s: pNFS blksize %d not supported.\n",
+			__func__, server->pnfs_blksize);
+		return -EINVAL;
+	}
+
 	b_mt_id = kzalloc(sizeof(struct block_mount_id), GFP_NOFS);
 	if (!b_mt_id) {
 		status = -ENOMEM;