diff mbox

NFS: filelayout should use nfs_generic_pg_test

Message ID 4DE65202.2010502@panasas.com (mailing list archive)
State New, archived
Headers show

Commit Message

Benny Halevy June 1, 2011, 2:51 p.m. UTC
On 2011-06-01 17:44, Weston Andros Adamson wrote:
> 
> On Jun 1, 2011, at 1:47 AM, Boaz Harrosh wrote:
> 
>> On 06/01/2011 06:18 AM, Weston Andros Adamson wrote:
>>> Use nfs_generic_pg_test instead of pnfs_generic_pg_test.
>>>
>>> This fixes the BUG at fs/nfs/write.c:941 introduced by
>>> 89a58e32d9105c01022a757fb32ddc3b51bf0025.
>>>
>>> I was able to trigger this BUG reliably using pynfs in pnfs mode,
>>> by using dd(1) to write many small blocks.
>>>
>>> Signed-off-by: Weston Andros Adamson <dros@netapp.com>
>>> ---
>>> Fix proposed by Trond.
>>>
>>> Benny- Does this make sense?
>>>
>>> fs/nfs/nfs4filelayout.c  |    2 +-
>>> fs/nfs/pagelist.c        |    5 ++++-
>>> include/linux/nfs_page.h |    3 ++-
>>> 3 files changed, 7 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c
>>> index 4269088..1c3bb72 100644
>>> --- a/fs/nfs/nfs4filelayout.c
>>> +++ b/fs/nfs/nfs4filelayout.c
>>> @@ -661,7 +661,7 @@ filelayout_pg_test(struct nfs_pageio_descriptor *pgio, struct nfs_page *prev,
>>> 	u64 p_stripe, r_stripe;
>>> 	u32 stripe_unit;
>>>
>>> -	if (!pnfs_generic_pg_test(pgio, prev, req))
>>> +	if (!nfs_generic_pg_test(pgio, prev, req))
>>> 		return 0;
>>>
>>
>> pnfs_generic_pg_test is the one that gets the layout.
>>
>> What you've done is revert to MDS IO
>>
>> Boaz
> 
> Ah, you're right - I didn't even notice that!  I usually confirm client -> DS communication with tcpdump.  I was working for too long yesterday :)
> 
> Patch: recalled.  Discussion about a real fix: started.
> 
> -dros

I think the following should work:

Benny

git diff --stat -p -M
 fs/nfs/nfs4filelayout.c |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Weston Andros Adamson June 1, 2011, 3:36 p.m. UTC | #1
On Jun 1, 2011, at 10:51 AM, Benny Halevy wrote:

> On 2011-06-01 17:44, Weston Andros Adamson wrote:
>> 
>> On Jun 1, 2011, at 1:47 AM, Boaz Harrosh wrote:
>> 
>>> On 06/01/2011 06:18 AM, Weston Andros Adamson wrote:
>>>> Use nfs_generic_pg_test instead of pnfs_generic_pg_test.
>>>> 
>>>> This fixes the BUG at fs/nfs/write.c:941 introduced by
>>>> 89a58e32d9105c01022a757fb32ddc3b51bf0025.
>>>> 
>>>> I was able to trigger this BUG reliably using pynfs in pnfs mode,
>>>> by using dd(1) to write many small blocks.
>>>> 
>>>> Signed-off-by: Weston Andros Adamson <dros@netapp.com>
>>>> ---
>>>> Fix proposed by Trond.
>>>> 
>>>> Benny- Does this make sense?
>>>> 
>>>> fs/nfs/nfs4filelayout.c  |    2 +-
>>>> fs/nfs/pagelist.c        |    5 ++++-
>>>> include/linux/nfs_page.h |    3 ++-
>>>> 3 files changed, 7 insertions(+), 3 deletions(-)
>>>> 
>>>> diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c
>>>> index 4269088..1c3bb72 100644
>>>> --- a/fs/nfs/nfs4filelayout.c
>>>> +++ b/fs/nfs/nfs4filelayout.c
>>>> @@ -661,7 +661,7 @@ filelayout_pg_test(struct nfs_pageio_descriptor *pgio, struct nfs_page *prev,
>>>> 	u64 p_stripe, r_stripe;
>>>> 	u32 stripe_unit;
>>>> 
>>>> -	if (!pnfs_generic_pg_test(pgio, prev, req))
>>>> +	if (!nfs_generic_pg_test(pgio, prev, req))
>>>> 		return 0;
>>>> 
>>> 
>>> pnfs_generic_pg_test is the one that gets the layout.
>>> 
>>> What you've done is revert to MDS IO
>>> 
>>> Boaz
>> 
>> Ah, you're right - I didn't even notice that!  I usually confirm client -> DS communication with tcpdump.  I was working for too long yesterday :)
>> 
>> Patch: recalled.  Discussion about a real fix: started.
>> 
>> -dros
> 
> I think the following should work:
> 
> Benny
> 

This diff fixes the problem for me (but with s/desc/pgio).

-dros

> git diff --stat -p -M
> fs/nfs/nfs4filelayout.c |   10 ++++++++++
> 1 files changed, 10 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c
> index 4269088..9f1d445 100644
> --- a/fs/nfs/nfs4filelayout.c
> +++ b/fs/nfs/nfs4filelayout.c
> @@ -661,6 +661,16 @@ filelayout_pg_test(struct nfs_pageio_descriptor
> *pgio, struct nfs_page *prev,
> 	u64 p_stripe, r_stripe;
> 	u32 stripe_unit;
> 
> +	/*
> +	 * FIXME: ideally we should be able to coalesce all requests
> +	 * that are not block boundary aligned, but currently this
> +	 * is problematic for the case of bsize < PAGE_CACHE_SIZE,
> +	 * since nfs_flush_multi and nfs_pagein_multi assume you
> +	 * can have only one struct nfs_page.
> +	 */
> +	if (desc->pg_bsize < PAGE_SIZE)
> +		return 0;
> +
> 	if (!pnfs_generic_pg_test(pgio, prev, req))
> 		return 0;
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Fred Isaman June 1, 2011, 4:01 p.m. UTC | #2
On Wed, Jun 1, 2011 at 10:51 AM, Benny Halevy <bhalevy@panasas.com> wrote:
> On 2011-06-01 17:44, Weston Andros Adamson wrote:
>>
>> On Jun 1, 2011, at 1:47 AM, Boaz Harrosh wrote:
>>
>>> On 06/01/2011 06:18 AM, Weston Andros Adamson wrote:
>>>> Use nfs_generic_pg_test instead of pnfs_generic_pg_test.
>>>>
>>>> This fixes the BUG at fs/nfs/write.c:941 introduced by
>>>> 89a58e32d9105c01022a757fb32ddc3b51bf0025.
>>>>
>>>> I was able to trigger this BUG reliably using pynfs in pnfs mode,
>>>> by using dd(1) to write many small blocks.
>>>>
>>>> Signed-off-by: Weston Andros Adamson <dros@netapp.com>
>>>> ---
>>>> Fix proposed by Trond.
>>>>
>>>> Benny- Does this make sense?
>>>>
>>>> fs/nfs/nfs4filelayout.c  |    2 +-
>>>> fs/nfs/pagelist.c        |    5 ++++-
>>>> include/linux/nfs_page.h |    3 ++-
>>>> 3 files changed, 7 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c
>>>> index 4269088..1c3bb72 100644
>>>> --- a/fs/nfs/nfs4filelayout.c
>>>> +++ b/fs/nfs/nfs4filelayout.c
>>>> @@ -661,7 +661,7 @@ filelayout_pg_test(struct nfs_pageio_descriptor *pgio, struct nfs_page *prev,
>>>>     u64 p_stripe, r_stripe;
>>>>     u32 stripe_unit;
>>>>
>>>> -   if (!pnfs_generic_pg_test(pgio, prev, req))
>>>> +   if (!nfs_generic_pg_test(pgio, prev, req))
>>>>             return 0;
>>>>
>>>
>>> pnfs_generic_pg_test is the one that gets the layout.
>>>
>>> What you've done is revert to MDS IO
>>>
>>> Boaz
>>
>> Ah, you're right - I didn't even notice that!  I usually confirm client -> DS communication with tcpdump.  I was working for too long yesterday :)
>>
>> Patch: recalled.  Discussion about a real fix: started.
>>
>> -dros
>
> I think the following should work:
>
> Benny
>
> git diff --stat -p -M
>  fs/nfs/nfs4filelayout.c |   10 ++++++++++
>  1 files changed, 10 insertions(+), 0 deletions(-)
>
> diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c
> index 4269088..9f1d445 100644
> --- a/fs/nfs/nfs4filelayout.c
> +++ b/fs/nfs/nfs4filelayout.c
> @@ -661,6 +661,16 @@ filelayout_pg_test(struct nfs_pageio_descriptor
> *pgio, struct nfs_page *prev,
>        u64 p_stripe, r_stripe;
>        u32 stripe_unit;
>
> +       /*
> +        * FIXME: ideally we should be able to coalesce all requests
> +        * that are not block boundary aligned, but currently this
> +        * is problematic for the case of bsize < PAGE_CACHE_SIZE,
> +        * since nfs_flush_multi and nfs_pagein_multi assume you
> +        * can have only one struct nfs_page.
> +        */
> +       if (desc->pg_bsize < PAGE_SIZE)
> +               return 0;
> +
>        if (!pnfs_generic_pg_test(pgio, prev, req))
>                return 0;
>

Note this moves a test that was once part of the plain nfs code into
the file layout driver.  Why don't other drivers need this test?

Fred
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust June 1, 2011, 6:07 p.m. UTC | #3
On Wed, 2011-06-01 at 17:51 +0300, Benny Halevy wrote: 
> I think the following should work:
> 
> Benny
> 
> git diff --stat -p -M
>  fs/nfs/nfs4filelayout.c |   10 ++++++++++
>  1 files changed, 10 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c
> index 4269088..9f1d445 100644
> --- a/fs/nfs/nfs4filelayout.c
> +++ b/fs/nfs/nfs4filelayout.c
> @@ -661,6 +661,16 @@ filelayout_pg_test(struct nfs_pageio_descriptor
> *pgio, struct nfs_page *prev,
>  	u64 p_stripe, r_stripe;
>  	u32 stripe_unit;
> 
> +	/*
> +	 * FIXME: ideally we should be able to coalesce all requests
> +	 * that are not block boundary aligned, but currently this
> +	 * is problematic for the case of bsize < PAGE_CACHE_SIZE,
> +	 * since nfs_flush_multi and nfs_pagein_multi assume you
> +	 * can have only one struct nfs_page.
> +	 */
> +	if (desc->pg_bsize < PAGE_SIZE)
> +		return 0;
> +
>  	if (!pnfs_generic_pg_test(pgio, prev, req))
>  		return 0;

So, there are several things that bother me about pnfs_generic_pg_test()
too now that I'm looking more closely at it:

     1. If the intention is to coalesce 'prev' and 'req', shouldn't we
        be asking for a layout with req_offset(prev) instead of
        req_offset(req)? 
     2. If we're only requesting a layout of length pg_count, don't we
        still need to test the layout length that the server actually
        returned before we can allow the coalescing? 
     3. if (!pgio->lseg), shouldn't we be returning an error of some
        sort? Right now we're returning 'true', and allowing the
        coalesce to occur. 
     4. Furthermore, shouldn't that test guarding the
        pnfs_update_layout() call rather be an 'if (pgio->pg_lseg ==
        NULL)' instead of looking at the values of pg_count and
        prev->wb_bytes?
Benny Halevy June 1, 2011, 6:56 p.m. UTC | #4
On 2011-06-01 19:01, Fred Isaman wrote:
> On Wed, Jun 1, 2011 at 10:51 AM, Benny Halevy <bhalevy@panasas.com> wrote:
>> On 2011-06-01 17:44, Weston Andros Adamson wrote:
>>>
>>> On Jun 1, 2011, at 1:47 AM, Boaz Harrosh wrote:
>>>
>>>> On 06/01/2011 06:18 AM, Weston Andros Adamson wrote:
>>>>> Use nfs_generic_pg_test instead of pnfs_generic_pg_test.
>>>>>
>>>>> This fixes the BUG at fs/nfs/write.c:941 introduced by
>>>>> 89a58e32d9105c01022a757fb32ddc3b51bf0025.
>>>>>
>>>>> I was able to trigger this BUG reliably using pynfs in pnfs mode,
>>>>> by using dd(1) to write many small blocks.
>>>>>
>>>>> Signed-off-by: Weston Andros Adamson <dros@netapp.com>
>>>>> ---
>>>>> Fix proposed by Trond.
>>>>>
>>>>> Benny- Does this make sense?
>>>>>
>>>>> fs/nfs/nfs4filelayout.c  |    2 +-
>>>>> fs/nfs/pagelist.c        |    5 ++++-
>>>>> include/linux/nfs_page.h |    3 ++-
>>>>> 3 files changed, 7 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c
>>>>> index 4269088..1c3bb72 100644
>>>>> --- a/fs/nfs/nfs4filelayout.c
>>>>> +++ b/fs/nfs/nfs4filelayout.c
>>>>> @@ -661,7 +661,7 @@ filelayout_pg_test(struct nfs_pageio_descriptor *pgio, struct nfs_page *prev,
>>>>>     u64 p_stripe, r_stripe;
>>>>>     u32 stripe_unit;
>>>>>
>>>>> -   if (!pnfs_generic_pg_test(pgio, prev, req))
>>>>> +   if (!nfs_generic_pg_test(pgio, prev, req))
>>>>>             return 0;
>>>>>
>>>>
>>>> pnfs_generic_pg_test is the one that gets the layout.
>>>>
>>>> What you've done is revert to MDS IO
>>>>
>>>> Boaz
>>>
>>> Ah, you're right - I didn't even notice that!  I usually confirm client -> DS communication with tcpdump.  I was working for too long yesterday :)
>>>
>>> Patch: recalled.  Discussion about a real fix: started.
>>>
>>> -dros
>>
>> I think the following should work:
>>
>> Benny
>>
>> git diff --stat -p -M
>>  fs/nfs/nfs4filelayout.c |   10 ++++++++++
>>  1 files changed, 10 insertions(+), 0 deletions(-)
>>
>> diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c
>> index 4269088..9f1d445 100644
>> --- a/fs/nfs/nfs4filelayout.c
>> +++ b/fs/nfs/nfs4filelayout.c
>> @@ -661,6 +661,16 @@ filelayout_pg_test(struct nfs_pageio_descriptor
>> *pgio, struct nfs_page *prev,
>>        u64 p_stripe, r_stripe;
>>        u32 stripe_unit;
>>
>> +       /*
>> +        * FIXME: ideally we should be able to coalesce all requests
>> +        * that are not block boundary aligned, but currently this
>> +        * is problematic for the case of bsize < PAGE_CACHE_SIZE,
>> +        * since nfs_flush_multi and nfs_pagein_multi assume you
>> +        * can have only one struct nfs_page.
>> +        */
>> +       if (desc->pg_bsize < PAGE_SIZE)
>> +               return 0;
>> +
>>        if (!pnfs_generic_pg_test(pgio, prev, req))
>>                return 0;
>>
> 
> Note this moves a test that was once part of the plain nfs code into
> the file layout driver.  Why don't other drivers need this test?

True.  Note I said it would work, not that it's the right fix? :-/
This just tells us what change exposed this issue...

Boaz moved this check to the nfs only path assuming that pg_bsize,
which holds the MDS's wsize/rsize is irrelevant for coalescing requests
for striping over pnfs.

I'm still convinced why nfs_flush_multi cannot use desc->pg_lseg
if it exists, but at the same time it seems like not doing the
right thing for pnfs coalescing in nfs_pageio_init_write and
nfs_pageio_do_add_request.

For pnfs, we need to ignore wsize, meaning we first need to try
to coalesce the pages and then decide if we're going the nfs_flush_multi
or the nfs_flush_one way, based on the coalesced length.

Benny

> 
> Fred
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benny Halevy June 1, 2011, 7:13 p.m. UTC | #5
On 2011-06-01 21:07, Trond Myklebust wrote:
> On Wed, 2011-06-01 at 17:51 +0300, Benny Halevy wrote: 
>> I think the following should work:
>>
>> Benny
>>
>> git diff --stat -p -M
>>  fs/nfs/nfs4filelayout.c |   10 ++++++++++
>>  1 files changed, 10 insertions(+), 0 deletions(-)
>>
>> diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c
>> index 4269088..9f1d445 100644
>> --- a/fs/nfs/nfs4filelayout.c
>> +++ b/fs/nfs/nfs4filelayout.c
>> @@ -661,6 +661,16 @@ filelayout_pg_test(struct nfs_pageio_descriptor
>> *pgio, struct nfs_page *prev,
>>  	u64 p_stripe, r_stripe;
>>  	u32 stripe_unit;
>>
>> +	/*
>> +	 * FIXME: ideally we should be able to coalesce all requests
>> +	 * that are not block boundary aligned, but currently this
>> +	 * is problematic for the case of bsize < PAGE_CACHE_SIZE,
>> +	 * since nfs_flush_multi and nfs_pagein_multi assume you
>> +	 * can have only one struct nfs_page.
>> +	 */
>> +	if (desc->pg_bsize < PAGE_SIZE)
>> +		return 0;
>> +
>>  	if (!pnfs_generic_pg_test(pgio, prev, req))
>>  		return 0;
> 
> So, there are several things that bother me about pnfs_generic_pg_test()
> too now that I'm looking more closely at it:
> 
>      1. If the intention is to coalesce 'prev' and 'req', shouldn't we
>         be asking for a layout with req_offset(prev) instead of
>         req_offset(req)? 
>      2. If we're only requesting a layout of length pg_count, don't we
>         still need to test the layout length that the server actually
>         returned before we can allow the coalescing? 
>      3. if (!pgio->lseg), shouldn't we be returning an error of some
>         sort? Right now we're returning 'true', and allowing the
>         coalesce to occur. 
>      4. Furthermore, shouldn't that test guarding the
>         pnfs_update_layout() call rather be an 'if (pgio->pg_lseg ==
>         NULL)' instead of looking at the values of pg_count and
>         prev->wb_bytes?
> 

or rather we get the layout for the first page in
nfs_pageio_do_add_request when desc->pg_count == 0?

Then, this lseg would be useful for nfs_flush_multi
if we failed to coalesce, or we failed to get a layout
altogether we go the nfs path and can reset pg_test to
nfs_generic_pg_test.

Otherwise I agree with your assertions above.

Benny
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust June 1, 2011, 7:17 p.m. UTC | #6
On Wed, 2011-06-01 at 21:56 +0300, Benny Halevy wrote:

> For pnfs, we need to ignore wsize, meaning we first need to try
> to coalesce the pages and then decide if we're going the nfs_flush_multi
> or the nfs_flush_one way, based on the coalesced length.

No! Ignoring the wsize is definitely wrong... If the stripe size is
larger than the 'maxwrite' recommended attribute, then the DS is allowed
to do a short write, in which case we have to resend.

In any case, nfs_flush_multi and nfs_flush_one need a rewrite in order
to deal properly with O_DIRECT writes, and so I'm expecting to get rid
of the single nfs_page limit for the r/wsize<PAGE_SIZE case.
Please don't make any large changes to this code at this time.
Trond Myklebust June 1, 2011, 7:29 p.m. UTC | #7
On Wed, 2011-06-01 at 22:13 +0300, Benny Halevy wrote: 
> On 2011-06-01 21:07, Trond Myklebust wrote:
> > On Wed, 2011-06-01 at 17:51 +0300, Benny Halevy wrote: 
> >> I think the following should work:
> >>
> >> Benny
> >>
> >> git diff --stat -p -M
> >>  fs/nfs/nfs4filelayout.c |   10 ++++++++++
> >>  1 files changed, 10 insertions(+), 0 deletions(-)
> >>
> >> diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c
> >> index 4269088..9f1d445 100644
> >> --- a/fs/nfs/nfs4filelayout.c
> >> +++ b/fs/nfs/nfs4filelayout.c
> >> @@ -661,6 +661,16 @@ filelayout_pg_test(struct nfs_pageio_descriptor
> >> *pgio, struct nfs_page *prev,
> >>  	u64 p_stripe, r_stripe;
> >>  	u32 stripe_unit;
> >>
> >> +	/*
> >> +	 * FIXME: ideally we should be able to coalesce all requests
> >> +	 * that are not block boundary aligned, but currently this
> >> +	 * is problematic for the case of bsize < PAGE_CACHE_SIZE,
> >> +	 * since nfs_flush_multi and nfs_pagein_multi assume you
> >> +	 * can have only one struct nfs_page.
> >> +	 */
> >> +	if (desc->pg_bsize < PAGE_SIZE)
> >> +		return 0;
> >> +
> >>  	if (!pnfs_generic_pg_test(pgio, prev, req))
> >>  		return 0;
> > 
> > So, there are several things that bother me about pnfs_generic_pg_test()
> > too now that I'm looking more closely at it:
> > 
> >      1. If the intention is to coalesce 'prev' and 'req', shouldn't we
> >         be asking for a layout with req_offset(prev) instead of
> >         req_offset(req)? 
> >      2. If we're only requesting a layout of length pg_count, don't we
> >         still need to test the layout length that the server actually
> >         returned before we can allow the coalescing? 
> >      3. if (!pgio->lseg), shouldn't we be returning an error of some
> >         sort? Right now we're returning 'true', and allowing the
> >         coalesce to occur. 
> >      4. Furthermore, shouldn't that test guarding the
> >         pnfs_update_layout() call rather be an 'if (pgio->pg_lseg ==
> >         NULL)' instead of looking at the values of pg_count and
> >         prev->wb_bytes?
> > 
> 
> or rather we get the layout for the first page in
> nfs_pageio_do_add_request when desc->pg_count == 0?

I can live with a desc->pg_init() callback or rather, converting
pg_test() and pg_doio() into a

struct nfs_pageio_ops {
	int (*pg_init)(struct nfs_pageio_descriptor *desc, struct nfs_page *req);
	bool (*pg_test)(struct nfs_pageio_descriptor *desc, struct nfs_page *prev, struct nfs_page *req);
	int (*pg_doio)(struct nfs_pageio_descriptor *desc);
};

and then replacing the two callback functions in the existing struct
nfs_pageio_descriptor with a single pointer to a 'const struct
nfs_pageio_ops'...

> Then, this lseg would be useful for nfs_flush_multi
> if we failed to coalesce, or we failed to get a layout
> altogether we go the nfs path and can reset pg_test to
> nfs_generic_pg_test.

It would presumably also get rid of all those pnfs_update_layout() calls
in read.c and write.c.
Boaz Harrosh June 1, 2011, 7:29 p.m. UTC | #8
On 06/01/2011 10:17 PM, Trond Myklebust wrote:
> On Wed, 2011-06-01 at 21:56 +0300, Benny Halevy wrote:
> 
>> For pnfs, we need to ignore wsize, meaning we first need to try
>> to coalesce the pages and then decide if we're going the nfs_flush_multi
>> or the nfs_flush_one way, based on the coalesced length.
> 
> No! Ignoring the wsize is definitely wrong... If the stripe size is
> larger than the 'maxwrite' recommended attribute, then the DS is allowed
> to do a short write, in which case we have to resend.
> 

As far as I could understand the current code, desc->bsize which derives
from wsize/rsize is negotiated with the MDS. But the wsize in question
is the DS's one. So I think in pnfs it is only the layout-driver that
can check this properly against the filer in question. Only if IO is
to go through MDS the bsize check is relevant.

BTW: The BUG_ON() Andy hit, does not look like an hard bug to fix ;-)

> In any case, nfs_flush_multi and nfs_flush_one need a rewrite in order
> to deal properly with O_DIRECT writes, and so I'm expecting to get rid
> of the single nfs_page limit for the r/wsize<PAGE_SIZE case.
> Please don't make any large changes to this code at this time.
> 

Amen, Good riddance
Boaz
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust June 1, 2011, 7:38 p.m. UTC | #9
On Wed, 2011-06-01 at 22:29 +0300, Boaz Harrosh wrote: 
> On 06/01/2011 10:17 PM, Trond Myklebust wrote:
> > On Wed, 2011-06-01 at 21:56 +0300, Benny Halevy wrote:
> > 
> >> For pnfs, we need to ignore wsize, meaning we first need to try
> >> to coalesce the pages and then decide if we're going the nfs_flush_multi
> >> or the nfs_flush_one way, based on the coalesced length.
> > 
> > No! Ignoring the wsize is definitely wrong... If the stripe size is
> > larger than the 'maxwrite' recommended attribute, then the DS is allowed
> > to do a short write, in which case we have to resend.
> > 
> 
> As far as I could understand the current code, desc->bsize which derives
> from wsize/rsize is negotiated with the MDS. But the wsize in question
> is the DS's one. So I think in pnfs it is only the layout-driver that
> can check this properly against the filer in question. Only if IO is
> to go through MDS the bsize check is relevant.

There is no distinction between the MDS's idea of the wsize and the DS's
idea of the wsize. The DS has to support the same wsize as the MDS,
since the client has no way of querying the wsize from the DS (GETATTR
maxwrite is not on the allowed set of DS operations)...

> BTW: The BUG_ON() Andy hit, does not look like an hard bug to fix ;-)
> 
> > In any case, nfs_flush_multi and nfs_flush_one need a rewrite in order
> > to deal properly with O_DIRECT writes, and so I'm expecting to get rid
> > of the single nfs_page limit for the r/wsize<PAGE_SIZE case.
> > Please don't make any large changes to this code at this time.
> > 
> 
> Amen, Good riddance

:-)
Boaz Harrosh June 1, 2011, 7:49 p.m. UTC | #10
On 06/01/2011 10:38 PM, Trond Myklebust wrote:
> 
> There is no distinction between the MDS's idea of the wsize and the DS's
> idea of the wsize. The DS has to support the same wsize as the MDS,
> since the client has no way of querying the wsize from the DS (GETATTR
> maxwrite is not on the allowed set of DS operations)...
> 

OK I did not know that. Good god how broken is that?

Any way you are probably right. But then the check properly belongs in
files-layout driver. Surly it has nothing to do with objects, and blocks

Thanks
Boaz
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust June 1, 2011, 7:52 p.m. UTC | #11
On Wed, 2011-06-01 at 22:49 +0300, Boaz Harrosh wrote: 
> On 06/01/2011 10:38 PM, Trond Myklebust wrote:
> > 
> > There is no distinction between the MDS's idea of the wsize and the DS's
> > idea of the wsize. The DS has to support the same wsize as the MDS,
> > since the client has no way of querying the wsize from the DS (GETATTR
> > maxwrite is not on the allowed set of DS operations)...
> > 
> 
> OK I did not know that. Good god how broken is that?
> 
> Any way you are probably right. But then the check properly belongs in
> files-layout driver. Surly it has nothing to do with objects, and blocks

It belongs in the 'files' layout and in the generic (i.e. v2/v3/v4)
code.
Benny Halevy June 1, 2011, 8:09 p.m. UTC | #12
On 2011-06-01 22:29, Trond Myklebust wrote:
> On Wed, 2011-06-01 at 22:13 +0300, Benny Halevy wrote: 
>> On 2011-06-01 21:07, Trond Myklebust wrote:
>>> On Wed, 2011-06-01 at 17:51 +0300, Benny Halevy wrote: 
>>>> I think the following should work:
>>>>
>>>> Benny
>>>>
>>>> git diff --stat -p -M
>>>>  fs/nfs/nfs4filelayout.c |   10 ++++++++++
>>>>  1 files changed, 10 insertions(+), 0 deletions(-)
>>>>
>>>> diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c
>>>> index 4269088..9f1d445 100644
>>>> --- a/fs/nfs/nfs4filelayout.c
>>>> +++ b/fs/nfs/nfs4filelayout.c
>>>> @@ -661,6 +661,16 @@ filelayout_pg_test(struct nfs_pageio_descriptor
>>>> *pgio, struct nfs_page *prev,
>>>>  	u64 p_stripe, r_stripe;
>>>>  	u32 stripe_unit;
>>>>
>>>> +	/*
>>>> +	 * FIXME: ideally we should be able to coalesce all requests
>>>> +	 * that are not block boundary aligned, but currently this
>>>> +	 * is problematic for the case of bsize < PAGE_CACHE_SIZE,
>>>> +	 * since nfs_flush_multi and nfs_pagein_multi assume you
>>>> +	 * can have only one struct nfs_page.
>>>> +	 */
>>>> +	if (desc->pg_bsize < PAGE_SIZE)
>>>> +		return 0;
>>>> +
>>>>  	if (!pnfs_generic_pg_test(pgio, prev, req))
>>>>  		return 0;
>>>
>>> So, there are several things that bother me about pnfs_generic_pg_test()
>>> too now that I'm looking more closely at it:
>>>
>>>      1. If the intention is to coalesce 'prev' and 'req', shouldn't we
>>>         be asking for a layout with req_offset(prev) instead of
>>>         req_offset(req)? 
>>>      2. If we're only requesting a layout of length pg_count, don't we
>>>         still need to test the layout length that the server actually
>>>         returned before we can allow the coalescing? 
>>>      3. if (!pgio->lseg), shouldn't we be returning an error of some
>>>         sort? Right now we're returning 'true', and allowing the
>>>         coalesce to occur. 
>>>      4. Furthermore, shouldn't that test guarding the
>>>         pnfs_update_layout() call rather be an 'if (pgio->pg_lseg ==
>>>         NULL)' instead of looking at the values of pg_count and
>>>         prev->wb_bytes?
>>>
>>
>> or rather we get the layout for the first page in
>> nfs_pageio_do_add_request when desc->pg_count == 0?
> 
> I can live with a desc->pg_init() callback or rather, converting
> pg_test() and pg_doio() into a
> 
> struct nfs_pageio_ops {
> 	int (*pg_init)(struct nfs_pageio_descriptor *desc, struct nfs_page *req);
> 	bool (*pg_test)(struct nfs_pageio_descriptor *desc, struct nfs_page *prev, struct nfs_page *req);
> 	int (*pg_doio)(struct nfs_pageio_descriptor *desc);
> };
> 
> and then replacing the two callback functions in the existing struct
> nfs_pageio_descriptor with a single pointer to a 'const struct
> nfs_pageio_ops'...
> 

looks like a good way to do this!

>> Then, this lseg would be useful for nfs_flush_multi
>> if we failed to coalesce, or we failed to get a layout
>> altogether we go the nfs path and can reset pg_test to
>> nfs_generic_pg_test.
> 
> It would presumably also get rid of all those pnfs_update_layout() calls
> in read.c and write.c.
> 

Yup.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
William A. (Andy) Adamson June 6, 2011, 4:47 p.m. UTC | #13
On Wed, Jun 1, 2011 at 4:09 PM, Benny Halevy <bhalevy@panasas.com> wrote:
> On 2011-06-01 22:29, Trond Myklebust wrote:
>> On Wed, 2011-06-01 at 22:13 +0300, Benny Halevy wrote:
>>> On 2011-06-01 21:07, Trond Myklebust wrote:
>>>> On Wed, 2011-06-01 at 17:51 +0300, Benny Halevy wrote:
>>>>> I think the following should work:
>>>>>
>>>>> Benny
>>>>>
>>>>> git diff --stat -p -M
>>>>>  fs/nfs/nfs4filelayout.c |   10 ++++++++++
>>>>>  1 files changed, 10 insertions(+), 0 deletions(-)
>>>>>
>>>>> diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c
>>>>> index 4269088..9f1d445 100644
>>>>> --- a/fs/nfs/nfs4filelayout.c
>>>>> +++ b/fs/nfs/nfs4filelayout.c
>>>>> @@ -661,6 +661,16 @@ filelayout_pg_test(struct nfs_pageio_descriptor
>>>>> *pgio, struct nfs_page *prev,
>>>>>    u64 p_stripe, r_stripe;
>>>>>    u32 stripe_unit;
>>>>>
>>>>> +  /*
>>>>> +   * FIXME: ideally we should be able to coalesce all requests
>>>>> +   * that are not block boundary aligned, but currently this
>>>>> +   * is problematic for the case of bsize < PAGE_CACHE_SIZE,
>>>>> +   * since nfs_flush_multi and nfs_pagein_multi assume you
>>>>> +   * can have only one struct nfs_page.
>>>>> +   */
>>>>> +  if (desc->pg_bsize < PAGE_SIZE)
>>>>> +          return 0;
>>>>> +
>>>>>    if (!pnfs_generic_pg_test(pgio, prev, req))
>>>>>            return 0;
>>>>
>>>> So, there are several things that bother me about pnfs_generic_pg_test()
>>>> too now that I'm looking more closely at it:
>>>>
>>>>      1. If the intention is to coalesce 'prev' and 'req', shouldn't we
>>>>         be asking for a layout with req_offset(prev) instead of
>>>>         req_offset(req)?
>>>>      2. If we're only requesting a layout of length pg_count, don't we
>>>>         still need to test the layout length that the server actually
>>>>         returned before we can allow the coalescing?
>>>>      3. if (!pgio->lseg), shouldn't we be returning an error of some
>>>>         sort? Right now we're returning 'true', and allowing the
>>>>         coalesce to occur.
>>>>      4. Furthermore, shouldn't that test guarding the
>>>>         pnfs_update_layout() call rather be an 'if (pgio->pg_lseg ==
>>>>         NULL)' instead of looking at the values of pg_count and
>>>>         prev->wb_bytes?
>>>>
>>>
>>> or rather we get the layout for the first page in
>>> nfs_pageio_do_add_request when desc->pg_count == 0?
>>
>> I can live with a desc->pg_init() callback or rather, converting
>> pg_test() and pg_doio() into a
>>
>> struct nfs_pageio_ops {
>>       int (*pg_init)(struct nfs_pageio_descriptor *desc, struct nfs_page *req);
>>       bool (*pg_test)(struct nfs_pageio_descriptor *desc, struct nfs_page *prev, struct nfs_page *req);
>>       int (*pg_doio)(struct nfs_pageio_descriptor *desc);
>> };
>>
>> and then replacing the two callback functions in the existing struct
>> nfs_pageio_descriptor with a single pointer to a 'const struct
>> nfs_pageio_ops'...
>>
>
> looks like a good way to do this!

Is anyone coding this fix?

-->Andy

>
>>> Then, this lseg would be useful for nfs_flush_multi
>>> if we failed to coalesce, or we failed to get a layout
>>> altogether we go the nfs path and can reset pg_test to
>>> nfs_generic_pg_test.
>>
>> It would presumably also get rid of all those pnfs_update_layout() calls
>> in read.c and write.c.
>>
>
> Yup.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benny Halevy June 6, 2011, 6:21 p.m. UTC | #14
On 2011-06-06 12:47, William A. (Andy) Adamson wrote:
> On Wed, Jun 1, 2011 at 4:09 PM, Benny Halevy <bhalevy@panasas.com> wrote:
>> On 2011-06-01 22:29, Trond Myklebust wrote:
>>> On Wed, 2011-06-01 at 22:13 +0300, Benny Halevy wrote:
>>>> On 2011-06-01 21:07, Trond Myklebust wrote:
>>>>> On Wed, 2011-06-01 at 17:51 +0300, Benny Halevy wrote:
>>>>>> I think the following should work:
>>>>>>
>>>>>> Benny
>>>>>>
>>>>>> git diff --stat -p -M
>>>>>>  fs/nfs/nfs4filelayout.c |   10 ++++++++++
>>>>>>  1 files changed, 10 insertions(+), 0 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c
>>>>>> index 4269088..9f1d445 100644
>>>>>> --- a/fs/nfs/nfs4filelayout.c
>>>>>> +++ b/fs/nfs/nfs4filelayout.c
>>>>>> @@ -661,6 +661,16 @@ filelayout_pg_test(struct nfs_pageio_descriptor
>>>>>> *pgio, struct nfs_page *prev,
>>>>>>    u64 p_stripe, r_stripe;
>>>>>>    u32 stripe_unit;
>>>>>>
>>>>>> +  /*
>>>>>> +   * FIXME: ideally we should be able to coalesce all requests
>>>>>> +   * that are not block boundary aligned, but currently this
>>>>>> +   * is problematic for the case of bsize < PAGE_CACHE_SIZE,
>>>>>> +   * since nfs_flush_multi and nfs_pagein_multi assume you
>>>>>> +   * can have only one struct nfs_page.
>>>>>> +   */
>>>>>> +  if (desc->pg_bsize < PAGE_SIZE)
>>>>>> +          return 0;
>>>>>> +
>>>>>>    if (!pnfs_generic_pg_test(pgio, prev, req))
>>>>>>            return 0;
>>>>>
>>>>> So, there are several things that bother me about pnfs_generic_pg_test()
>>>>> too now that I'm looking more closely at it:
>>>>>
>>>>>      1. If the intention is to coalesce 'prev' and 'req', shouldn't we
>>>>>         be asking for a layout with req_offset(prev) instead of
>>>>>         req_offset(req)?
>>>>>      2. If we're only requesting a layout of length pg_count, don't we
>>>>>         still need to test the layout length that the server actually
>>>>>         returned before we can allow the coalescing?
>>>>>      3. if (!pgio->lseg), shouldn't we be returning an error of some
>>>>>         sort? Right now we're returning 'true', and allowing the
>>>>>         coalesce to occur.
>>>>>      4. Furthermore, shouldn't that test guarding the
>>>>>         pnfs_update_layout() call rather be an 'if (pgio->pg_lseg ==
>>>>>         NULL)' instead of looking at the values of pg_count and
>>>>>         prev->wb_bytes?
>>>>>
>>>>
>>>> or rather we get the layout for the first page in
>>>> nfs_pageio_do_add_request when desc->pg_count == 0?
>>>
>>> I can live with a desc->pg_init() callback or rather, converting
>>> pg_test() and pg_doio() into a
>>>
>>> struct nfs_pageio_ops {
>>>       int (*pg_init)(struct nfs_pageio_descriptor *desc, struct nfs_page *req);
>>>       bool (*pg_test)(struct nfs_pageio_descriptor *desc, struct nfs_page *prev, struct nfs_page *req);
>>>       int (*pg_doio)(struct nfs_pageio_descriptor *desc);
>>> };
>>>
>>> and then replacing the two callback functions in the existing struct
>>> nfs_pageio_descriptor with a single pointer to a 'const struct
>>> nfs_pageio_ops'...
>>>
>>
>> looks like a good way to do this!
> 
> Is anyone coding this fix?
> 

I started working on this but switched to porting forward spnfs and
spnfs-block (which I've just pushed out).

Benny

> -->Andy
> 
>>
>>>> Then, this lseg would be useful for nfs_flush_multi
>>>> if we failed to coalesce, or we failed to get a layout
>>>> altogether we go the nfs path and can reset pg_test to
>>>> nfs_generic_pg_test.
>>>
>>> It would presumably also get rid of all those pnfs_update_layout() calls
>>> in read.c and write.c.
>>>
>>
>> Yup.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust June 6, 2011, 6:22 p.m. UTC | #15
I'm working on it. I'm doing a lot of surgery in that general area
anyway...

-----Original Message-----
From: Benny Halevy [mailto:bhalevy@panasas.com] 
Sent: Monday, June 06, 2011 2:21 PM
To: William A. (Andy) Adamson
Cc: Myklebust, Trond; Adamson, Dros; Boaz Harrosh; Myklebust, Trond;
linux-nfs@vger.kernel.org
Subject: Re: [PATCH] NFS: filelayout should use nfs_generic_pg_test

On 2011-06-06 12:47, William A. (Andy) Adamson wrote:
> On Wed, Jun 1, 2011 at 4:09 PM, Benny Halevy <bhalevy@panasas.com>
wrote:
>> On 2011-06-01 22:29, Trond Myklebust wrote:
>>> On Wed, 2011-06-01 at 22:13 +0300, Benny Halevy wrote:
>>>> On 2011-06-01 21:07, Trond Myklebust wrote:
>>>>> On Wed, 2011-06-01 at 17:51 +0300, Benny Halevy wrote:
>>>>>> I think the following should work:
>>>>>>
>>>>>> Benny
>>>>>>
>>>>>> git diff --stat -p -M
>>>>>>  fs/nfs/nfs4filelayout.c |   10 ++++++++++
>>>>>>  1 files changed, 10 insertions(+), 0 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c
>>>>>> index 4269088..9f1d445 100644
>>>>>> --- a/fs/nfs/nfs4filelayout.c
>>>>>> +++ b/fs/nfs/nfs4filelayout.c
>>>>>> @@ -661,6 +661,16 @@ filelayout_pg_test(struct
nfs_pageio_descriptor
>>>>>> *pgio, struct nfs_page *prev,
>>>>>>    u64 p_stripe, r_stripe;
>>>>>>    u32 stripe_unit;
>>>>>>
>>>>>> +  /*
>>>>>> +   * FIXME: ideally we should be able to coalesce all requests
>>>>>> +   * that are not block boundary aligned, but currently this
>>>>>> +   * is problematic for the case of bsize < PAGE_CACHE_SIZE,
>>>>>> +   * since nfs_flush_multi and nfs_pagein_multi assume you
>>>>>> +   * can have only one struct nfs_page.
>>>>>> +   */
>>>>>> +  if (desc->pg_bsize < PAGE_SIZE)
>>>>>> +          return 0;
>>>>>> +
>>>>>>    if (!pnfs_generic_pg_test(pgio, prev, req))
>>>>>>            return 0;
>>>>>
>>>>> So, there are several things that bother me about
pnfs_generic_pg_test()
>>>>> too now that I'm looking more closely at it:
>>>>>
>>>>>      1. If the intention is to coalesce 'prev' and 'req',
shouldn't we
>>>>>         be asking for a layout with req_offset(prev) instead of
>>>>>         req_offset(req)?
>>>>>      2. If we're only requesting a layout of length pg_count,
don't we
>>>>>         still need to test the layout length that the server
actually
>>>>>         returned before we can allow the coalescing?
>>>>>      3. if (!pgio->lseg), shouldn't we be returning an error of
some
>>>>>         sort? Right now we're returning 'true', and allowing the
>>>>>         coalesce to occur.
>>>>>      4. Furthermore, shouldn't that test guarding the
>>>>>         pnfs_update_layout() call rather be an 'if (pgio->pg_lseg
==
>>>>>         NULL)' instead of looking at the values of pg_count and
>>>>>         prev->wb_bytes?
>>>>>
>>>>
>>>> or rather we get the layout for the first page in
>>>> nfs_pageio_do_add_request when desc->pg_count == 0?
>>>
>>> I can live with a desc->pg_init() callback or rather, converting
>>> pg_test() and pg_doio() into a
>>>
>>> struct nfs_pageio_ops {
>>>       int (*pg_init)(struct nfs_pageio_descriptor *desc, struct
nfs_page *req);
>>>       bool (*pg_test)(struct nfs_pageio_descriptor *desc, struct
nfs_page *prev, struct nfs_page *req);
>>>       int (*pg_doio)(struct nfs_pageio_descriptor *desc);
>>> };
>>>
>>> and then replacing the two callback functions in the existing struct
>>> nfs_pageio_descriptor with a single pointer to a 'const struct
>>> nfs_pageio_ops'...
>>>
>>
>> looks like a good way to do this!
> 
> Is anyone coding this fix?
> 

I started working on this but switched to porting forward spnfs and
spnfs-block (which I've just pushed out).

Benny

> -->Andy
> 
>>
>>>> Then, this lseg would be useful for nfs_flush_multi
>>>> if we failed to coalesce, or we failed to get a layout
>>>> altogether we go the nfs path and can reset pg_test to
>>>> nfs_generic_pg_test.
>>>
>>> It would presumably also get rid of all those pnfs_update_layout()
calls
>>> in read.c and write.c.
>>>
>>
>> Yup.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs"
in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs"
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c
index 4269088..9f1d445 100644
--- a/fs/nfs/nfs4filelayout.c
+++ b/fs/nfs/nfs4filelayout.c
@@ -661,6 +661,16 @@  filelayout_pg_test(struct nfs_pageio_descriptor
*pgio, struct nfs_page *prev,
 	u64 p_stripe, r_stripe;
 	u32 stripe_unit;

+	/*
+	 * FIXME: ideally we should be able to coalesce all requests
+	 * that are not block boundary aligned, but currently this
+	 * is problematic for the case of bsize < PAGE_CACHE_SIZE,
+	 * since nfs_flush_multi and nfs_pagein_multi assume you
+	 * can have only one struct nfs_page.
+	 */
+	if (desc->pg_bsize < PAGE_SIZE)
+		return 0;
+
 	if (!pnfs_generic_pg_test(pgio, prev, req))
 		return 0;