diff mbox series

[v1,19/42] SUNRPC: Fix xdr_get_next_encode_buffer() page boundary handling

Message ID 161461183307.8508.17196295994390119297.stgit@klimt.1015granger.net (mailing list archive)
State New, archived
Headers show
Series NFSv2/3 XDR encoder overhaul | expand

Commit Message

Chuck Lever March 1, 2021, 3:17 p.m. UTC
The description of commit 2825a7f90753 ("nfsd4: allow encoding
across page boundaries") states:

> Also we can't handle a new operation starting close to the end of
> a page.

But does not detail why this is the case.

Subtracting the scratch buffer's "shift" value from the remaining
stream space seems to make reserving space close to the end of the
buf->pages array reliable.

This change is needed to make entry encoding with struct xdr_stream,
introduced in a subsequent patch, work correctly when it approaches
the end of the dirlist buffer.

Fixes: 2825a7f90753 ("nfsd4: allow encoding across page boundaries")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xdr.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

J. Bruce Fields March 2, 2021, 10:11 p.m. UTC | #1
On Mon, Mar 01, 2021 at 10:17:13AM -0500, Chuck Lever wrote:
> The description of commit 2825a7f90753 ("nfsd4: allow encoding
> across page boundaries") states:
> 
> > Also we can't handle a new operation starting close to the end of
> > a page.
> 
> But does not detail why this is the case.

That wasn't every helpful of me, sorry.

> Subtracting the scratch buffer's "shift" value from the remaining
> stream space seems to make reserving space close to the end of the
> buf->pages array reliable.

So, why is that?

Thinking this through:

When somebody asks for a buffer that would straddle a page boundary,
with frag1bytes at the end of this page and frag2bytes at the start of
the next page, we instead give them a buffer starting at start of the
next page.  That gives them a nice contiguous buffer to write into.
When they're done using it, we fix things up by copying what they wrote
back to where it should be.

That means we're temporarily wasting frag1bytes of space.  So, I don't
know, maybe that's the logic behind subtracing frag1bytes from
space_left.

It means you may end up with xdr->end frag1bytes short of the next page.
I'm not sure that's right.

--b.


> 
> This change is needed to make entry encoding with struct xdr_stream,
> introduced in a subsequent patch, work correctly when it approaches
> the end of the dirlist buffer.
> 
> Fixes: 2825a7f90753 ("nfsd4: allow encoding across page boundaries")
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  net/sunrpc/xdr.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c
> index 3964ff74ee51..043b67229792 100644
> --- a/net/sunrpc/xdr.c
> +++ b/net/sunrpc/xdr.c
> @@ -978,7 +978,7 @@ static __be32 *xdr_get_next_encode_buffer(struct xdr_stream *xdr,
>  	 * shifted this one back:
>  	 */
>  	xdr->p = (void *)p + frag2bytes;
> -	space_left = xdr->buf->buflen - xdr->buf->len;
> +	space_left = xdr->buf->buflen - xdr->buf->len - frag1bytes;
>  	xdr->end = (void *)p + min_t(int, space_left, PAGE_SIZE);
>  	xdr->buf->page_len += frag2bytes;
>  	xdr->buf->len += nbytes;
>
Chuck Lever March 3, 2021, 3:43 p.m. UTC | #2
Hi Bruce-

Thanks for your careful review of this series!


> On Mar 2, 2021, at 5:11 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> 
> On Mon, Mar 01, 2021 at 10:17:13AM -0500, Chuck Lever wrote:
>> The description of commit 2825a7f90753 ("nfsd4: allow encoding
>> across page boundaries") states:
>> 
>>> Also we can't handle a new operation starting close to the end of
>>> a page.
>> 
>> But does not detail why this is the case.
> 
> That wasn't every helpful of me, sorry.
> 
>> Subtracting the scratch buffer's "shift" value from the remaining
>> stream space seems to make reserving space close to the end of the
>> buf->pages array reliable.
> 
> So, why is that?
> 
> Thinking this through:
> 
> When somebody asks for a buffer that would straddle a page boundary,
> with frag1bytes at the end of this page and frag2bytes at the start of
> the next page, we instead give them a buffer starting at start of the
> next page.  That gives them a nice contiguous buffer to write into.
> When they're done using it, we fix things up by copying what they wrote
> back to where it should be.
> 
> That means we're temporarily wasting frag1bytes of space.  So, I don't
> know, maybe that's the logic behind subtracing frag1bytes from
> space_left.
> 
> It means you may end up with xdr->end frag1bytes short of the next page.
> I'm not sure that's right.

Why would that not be OK? the next call to xdr_get_next_encode_buffer()
should do the right thing and bounce the new encoded data from the
next page into this one again.

So far I have not encountered any problems. Would such a problem show
up with some frequency under normal use, or would it be especially
subtle?


> --b.
> 
> 
>> 
>> This change is needed to make entry encoding with struct xdr_stream,
>> introduced in a subsequent patch, work correctly when it approaches
>> the end of the dirlist buffer.
>> 
>> Fixes: 2825a7f90753 ("nfsd4: allow encoding across page boundaries")
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> ---
>> net/sunrpc/xdr.c |    2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c
>> index 3964ff74ee51..043b67229792 100644
>> --- a/net/sunrpc/xdr.c
>> +++ b/net/sunrpc/xdr.c
>> @@ -978,7 +978,7 @@ static __be32 *xdr_get_next_encode_buffer(struct xdr_stream *xdr,
>> 	 * shifted this one back:
>> 	 */
>> 	xdr->p = (void *)p + frag2bytes;
>> -	space_left = xdr->buf->buflen - xdr->buf->len;
>> +	space_left = xdr->buf->buflen - xdr->buf->len - frag1bytes;
>> 	xdr->end = (void *)p + min_t(int, space_left, PAGE_SIZE);
>> 	xdr->buf->page_len += frag2bytes;
>> 	xdr->buf->len += nbytes;
>> 

--
Chuck Lever
J. Bruce Fields March 3, 2021, 4:52 p.m. UTC | #3
On Wed, Mar 03, 2021 at 03:43:28PM +0000, Chuck Lever wrote:
> Hi Bruce-
> 
> Thanks for your careful review of this series!
> 
> 
> > On Mar 2, 2021, at 5:11 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> > 
> > On Mon, Mar 01, 2021 at 10:17:13AM -0500, Chuck Lever wrote:
> >> The description of commit 2825a7f90753 ("nfsd4: allow encoding
> >> across page boundaries") states:
> >> 
> >>> Also we can't handle a new operation starting close to the end of
> >>> a page.
> >> 
> >> But does not detail why this is the case.
> > 
> > That wasn't every helpful of me, sorry.
> > 
> >> Subtracting the scratch buffer's "shift" value from the remaining
> >> stream space seems to make reserving space close to the end of the
> >> buf->pages array reliable.
> > 
> > So, why is that?
> > 
> > Thinking this through:
> > 
> > When somebody asks for a buffer that would straddle a page boundary,
> > with frag1bytes at the end of this page and frag2bytes at the start of
> > the next page, we instead give them a buffer starting at start of the
> > next page.  That gives them a nice contiguous buffer to write into.
> > When they're done using it, we fix things up by copying what they wrote
> > back to where it should be.
> > 
> > That means we're temporarily wasting frag1bytes of space.  So, I don't
> > know, maybe that's the logic behind subtracing frag1bytes from
> > space_left.
> > 
> > It means you may end up with xdr->end frag1bytes short of the next page.

Wait, let me try that again:

	p = page_address(*xdr->page_ptr);
	xdr->p = (void *)p + frag2bytes;
	space_left = xdr->buf->buflen - xdr->buf->len - frag1bytes;
        xdr->end = (void *)p + min_t(int, space_left, PAGE_SIZE);

If you've still got a lot of buffer space left, then that'll put
xdr->end frag2bytes past the end of a page, won't it?

> > I'm not sure that's right.
> 
> Why would that not be OK? the next call to xdr_get_next_encode_buffer()
> should do the right thing and bounce the new encoded data from the
> next page into this one again.
> 
> So far I have not encountered any problems. Would such a problem show
> up with some frequency under normal use, or would it be especially
> subtle?

I mainly just want to make sure we've got a coherent idea what this code
is doing....

For testing: large replies that aren't just read data are readdir and
getacl.  So reading large directories with lots of variably-named files
might be good. Also pynfs could easily send a single compound with lots
of variable-sized reads, that might be interesting.

Constructing a compound that will result in xdr_reserve_space calls that
exactly hit the various corner cases may be hard.  But maybe if we just
send a bunch of compounds that vary over some range we can still
guarantee hitting those cases.

--b.
Chuck Lever March 3, 2021, 6:19 p.m. UTC | #4
> On Mar 3, 2021, at 11:52 AM, Bruce Fields <bfields@fieldses.org> wrote:
> 
> On Wed, Mar 03, 2021 at 03:43:28PM +0000, Chuck Lever wrote:
>> Hi Bruce-
>> 
>> Thanks for your careful review of this series!
>> 
>> 
>>> On Mar 2, 2021, at 5:11 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
>>> 
>>> On Mon, Mar 01, 2021 at 10:17:13AM -0500, Chuck Lever wrote:
>>>> The description of commit 2825a7f90753 ("nfsd4: allow encoding
>>>> across page boundaries") states:
>>>> 
>>>>> Also we can't handle a new operation starting close to the end of
>>>>> a page.
>>>> 
>>>> But does not detail why this is the case.
>>> 
>>> That wasn't every helpful of me, sorry.
>>> 
>>>> Subtracting the scratch buffer's "shift" value from the remaining
>>>> stream space seems to make reserving space close to the end of the
>>>> buf->pages array reliable.
>>> 
>>> So, why is that?
>>> 
>>> Thinking this through:
>>> 
>>> When somebody asks for a buffer that would straddle a page boundary,
>>> with frag1bytes at the end of this page and frag2bytes at the start of
>>> the next page, we instead give them a buffer starting at start of the
>>> next page.  That gives them a nice contiguous buffer to write into.
>>> When they're done using it, we fix things up by copying what they wrote
>>> back to where it should be.
>>> 
>>> That means we're temporarily wasting frag1bytes of space.  So, I don't
>>> know, maybe that's the logic behind subtracing frag1bytes from
>>> space_left.
>>> 
>>> It means you may end up with xdr->end frag1bytes short of the next page.
> 
> Wait, let me try that again:
> 
> 	p = page_address(*xdr->page_ptr);
> 	xdr->p = (void *)p + frag2bytes;
> 	space_left = xdr->buf->buflen - xdr->buf->len - frag1bytes;
>        xdr->end = (void *)p + min_t(int, space_left, PAGE_SIZE);
> 
> If you've still got a lot of buffer space left, then that'll put
> xdr->end frag2bytes past the end of a page, won't it?

So far I haven't heard any complaints from the kernel. I'll
instrument the code and see what comes up.


>>> I'm not sure that's right.
>> 
>> Why would that not be OK? the next call to xdr_get_next_encode_buffer()
>> should do the right thing and bounce the new encoded data from the
>> next page into this one again.
>> 
>> So far I have not encountered any problems. Would such a problem show
>> up with some frequency under normal use, or would it be especially
>> subtle?
> 
> I mainly just want to make sure we've got a coherent idea what this code
> is doing....

Agreed, that's a good thing.


> For testing: large replies that aren't just read data are readdir and
> getacl.  So reading large directories with lots of variably-named files
> might be good. Also pynfs could easily send a single compound with lots
> of variable-sized reads, that might be interesting.

I've run the full git regression suite over NFSv3 many many times with
this patch applied. That includes unpacking via tar, a full build from
scratch, and then running thousands of individual tests.

So that doesn't exercise a particular corner case, but it does reflect
a broad variety of directory operations.


> Constructing a compound that will result in xdr_reserve_space calls that
> exactly hit the various corner cases may be hard.  But maybe if we just
> send a bunch of compounds that vary over some range we can still
> guarantee hitting those cases.
> 
> --b.

--
Chuck Lever
J. Bruce Fields March 3, 2021, 6:25 p.m. UTC | #5
On Wed, Mar 03, 2021 at 06:19:33PM +0000, Chuck Lever wrote:
> 
> 
> > On Mar 3, 2021, at 11:52 AM, Bruce Fields <bfields@fieldses.org> wrote:
> > 
> > On Wed, Mar 03, 2021 at 03:43:28PM +0000, Chuck Lever wrote:
> >> Why would that not be OK? the next call to xdr_get_next_encode_buffer()
> >> should do the right thing and bounce the new encoded data from the
> >> next page into this one again.
> >> 
> >> So far I have not encountered any problems. Would such a problem show
> >> up with some frequency under normal use, or would it be especially
> >> subtle?
> > 
> > I mainly just want to make sure we've got a coherent idea what this code
> > is doing....
> 
> Agreed, that's a good thing.

I'm also a little vague on what exactly the problem is you're running
into.  (Probably because I haven't really looked at the v3 readdir
encoding.)

Is it approaching the end of a page, or is it running out of buflen?
How exactly does it fail?

--b.

> 
> 
> > For testing: large replies that aren't just read data are readdir and
> > getacl.  So reading large directories with lots of variably-named files
> > might be good. Also pynfs could easily send a single compound with lots
> > of variable-sized reads, that might be interesting.
> 
> I've run the full git regression suite over NFSv3 many many times with
> this patch applied. That includes unpacking via tar, a full build from
> scratch, and then running thousands of individual tests.
> 
> So that doesn't exercise a particular corner case, but it does reflect
> a broad variety of directory operations.
> 
> 
> > Constructing a compound that will result in xdr_reserve_space calls that
> > exactly hit the various corner cases may be hard.  But maybe if we just
> > send a bunch of compounds that vary over some range we can still
> > guarantee hitting those cases.
> > 
> > --b.
> 
> --
> Chuck Lever
> 
>
Chuck Lever March 3, 2021, 6:27 p.m. UTC | #6
> On Mar 3, 2021, at 1:25 PM, Bruce Fields <bfields@fieldses.org> wrote:
> 
> On Wed, Mar 03, 2021 at 06:19:33PM +0000, Chuck Lever wrote:
>> 
>> 
>>> On Mar 3, 2021, at 11:52 AM, Bruce Fields <bfields@fieldses.org> wrote:
>>> 
>>> On Wed, Mar 03, 2021 at 03:43:28PM +0000, Chuck Lever wrote:
>>>> Why would that not be OK? the next call to xdr_get_next_encode_buffer()
>>>> should do the right thing and bounce the new encoded data from the
>>>> next page into this one again.
>>>> 
>>>> So far I have not encountered any problems. Would such a problem show
>>>> up with some frequency under normal use, or would it be especially
>>>> subtle?
>>> 
>>> I mainly just want to make sure we've got a coherent idea what this code
>>> is doing....
>> 
>> Agreed, that's a good thing.
> 
> I'm also a little vague on what exactly the problem is you're running
> into.  (Probably because I haven't really looked at the v3 readdir
> encoding.)
> 
> Is it approaching the end of a page, or is it running out of buflen?
> How exactly does it fail?

I don't recall exactly, it was a late last summer when I wrote all these.

Approaching the end of a page, IIRC, the unpatched code would leave
a gap in the directory entry stream.


> --b.
> 
>> 
>> 
>>> For testing: large replies that aren't just read data are readdir and
>>> getacl.  So reading large directories with lots of variably-named files
>>> might be good. Also pynfs could easily send a single compound with lots
>>> of variable-sized reads, that might be interesting.
>> 
>> I've run the full git regression suite over NFSv3 many many times with
>> this patch applied. That includes unpacking via tar, a full build from
>> scratch, and then running thousands of individual tests.
>> 
>> So that doesn't exercise a particular corner case, but it does reflect
>> a broad variety of directory operations.
>> 
>> 
>>> Constructing a compound that will result in xdr_reserve_space calls that
>>> exactly hit the various corner cases may be hard.  But maybe if we just
>>> send a bunch of compounds that vary over some range we can still
>>> guarantee hitting those cases.
>>> 
>>> --b.
>> 
>> --
>> Chuck Lever

--
Chuck Lever
Chuck Lever March 3, 2021, 6:30 p.m. UTC | #7
> On Mar 3, 2021, at 1:27 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
> 
> 
> 
>> On Mar 3, 2021, at 1:25 PM, Bruce Fields <bfields@fieldses.org> wrote:
>> 
>> On Wed, Mar 03, 2021 at 06:19:33PM +0000, Chuck Lever wrote:
>>> 
>>> 
>>>> On Mar 3, 2021, at 11:52 AM, Bruce Fields <bfields@fieldses.org> wrote:
>>>> 
>>>> On Wed, Mar 03, 2021 at 03:43:28PM +0000, Chuck Lever wrote:
>>>>> Why would that not be OK? the next call to xdr_get_next_encode_buffer()
>>>>> should do the right thing and bounce the new encoded data from the
>>>>> next page into this one again.
>>>>> 
>>>>> So far I have not encountered any problems. Would such a problem show
>>>>> up with some frequency under normal use, or would it be especially
>>>>> subtle?
>>>> 
>>>> I mainly just want to make sure we've got a coherent idea what this code
>>>> is doing....
>>> 
>>> Agreed, that's a good thing.
>> 
>> I'm also a little vague on what exactly the problem is you're running
>> into.  (Probably because I haven't really looked at the v3 readdir
>> encoding.)
>> 
>> Is it approaching the end of a page, or is it running out of buflen?
>> How exactly does it fail?
> 
> I don't recall exactly, it was a late last summer when I wrote all these.
> 
> Approaching the end of a page, IIRC, the unpatched code would leave
> a gap in the directory entry stream.

Well, when I converted the entry encoders to use xdr_stream, it would
have a problem around the end of a page. The existing encoders are
open-coded to deal with this case.


>> --b.
>> 
>>> 
>>> 
>>>> For testing: large replies that aren't just read data are readdir and
>>>> getacl.  So reading large directories with lots of variably-named files
>>>> might be good. Also pynfs could easily send a single compound with lots
>>>> of variable-sized reads, that might be interesting.
>>> 
>>> I've run the full git regression suite over NFSv3 many many times with
>>> this patch applied. That includes unpacking via tar, a full build from
>>> scratch, and then running thousands of individual tests.
>>> 
>>> So that doesn't exercise a particular corner case, but it does reflect
>>> a broad variety of directory operations.
>>> 
>>> 
>>>> Constructing a compound that will result in xdr_reserve_space calls that
>>>> exactly hit the various corner cases may be hard.  But maybe if we just
>>>> send a bunch of compounds that vary over some range we can still
>>>> guarantee hitting those cases.
>>>> 
>>>> --b.
>>> 
>>> --
>>> Chuck Lever
> 
> --
> Chuck Lever

--
Chuck Lever
J. Bruce Fields March 3, 2021, 6:38 p.m. UTC | #8
On Wed, Mar 03, 2021 at 06:30:40PM +0000, Chuck Lever wrote:
> 
> 
> > On Mar 3, 2021, at 1:27 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
> > 
> > 
> > 
> >> On Mar 3, 2021, at 1:25 PM, Bruce Fields <bfields@fieldses.org> wrote:
> >> 
> >> On Wed, Mar 03, 2021 at 06:19:33PM +0000, Chuck Lever wrote:
> >>> 
> >>> 
> >>>> On Mar 3, 2021, at 11:52 AM, Bruce Fields <bfields@fieldses.org> wrote:
> >>>> 
> >>>> On Wed, Mar 03, 2021 at 03:43:28PM +0000, Chuck Lever wrote:
> >>>>> Why would that not be OK? the next call to xdr_get_next_encode_buffer()
> >>>>> should do the right thing and bounce the new encoded data from the
> >>>>> next page into this one again.
> >>>>> 
> >>>>> So far I have not encountered any problems. Would such a problem show
> >>>>> up with some frequency under normal use, or would it be especially
> >>>>> subtle?
> >>>> 
> >>>> I mainly just want to make sure we've got a coherent idea what this code
> >>>> is doing....
> >>> 
> >>> Agreed, that's a good thing.
> >> 
> >> I'm also a little vague on what exactly the problem is you're running
> >> into.  (Probably because I haven't really looked at the v3 readdir
> >> encoding.)
> >> 
> >> Is it approaching the end of a page, or is it running out of buflen?
> >> How exactly does it fail?
> > 
> > I don't recall exactly, it was a late last summer when I wrote all these.
> > 
> > Approaching the end of a page, IIRC, the unpatched code would leave
> > a gap in the directory entry stream.
> 
> Well, when I converted the entry encoders to use xdr_stream, it would
> have a problem around the end of a page. The existing encoders are
> open-coded to deal with this case.

We're not seeing v4 readdir bugs, though, I wonder what's different?

--b.
Chuck Lever March 3, 2021, 6:42 p.m. UTC | #9
> On Mar 3, 2021, at 1:38 PM, Bruce Fields <bfields@fieldses.org> wrote:
> 
> On Wed, Mar 03, 2021 at 06:30:40PM +0000, Chuck Lever wrote:
>> 
>> 
>>> On Mar 3, 2021, at 1:27 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>> 
>>> 
>>> 
>>>> On Mar 3, 2021, at 1:25 PM, Bruce Fields <bfields@fieldses.org> wrote:
>>>> 
>>>> On Wed, Mar 03, 2021 at 06:19:33PM +0000, Chuck Lever wrote:
>>>>> 
>>>>> 
>>>>>> On Mar 3, 2021, at 11:52 AM, Bruce Fields <bfields@fieldses.org> wrote:
>>>>>> 
>>>>>> On Wed, Mar 03, 2021 at 03:43:28PM +0000, Chuck Lever wrote:
>>>>>>> Why would that not be OK? the next call to xdr_get_next_encode_buffer()
>>>>>>> should do the right thing and bounce the new encoded data from the
>>>>>>> next page into this one again.
>>>>>>> 
>>>>>>> So far I have not encountered any problems. Would such a problem show
>>>>>>> up with some frequency under normal use, or would it be especially
>>>>>>> subtle?
>>>>>> 
>>>>>> I mainly just want to make sure we've got a coherent idea what this code
>>>>>> is doing....
>>>>> 
>>>>> Agreed, that's a good thing.
>>>> 
>>>> I'm also a little vague on what exactly the problem is you're running
>>>> into.  (Probably because I haven't really looked at the v3 readdir
>>>> encoding.)
>>>> 
>>>> Is it approaching the end of a page, or is it running out of buflen?
>>>> How exactly does it fail?
>>> 
>>> I don't recall exactly, it was a late last summer when I wrote all these.
>>> 
>>> Approaching the end of a page, IIRC, the unpatched code would leave
>>> a gap in the directory entry stream.
>> 
>> Well, when I converted the entry encoders to use xdr_stream, it would
>> have a problem around the end of a page. The existing encoders are
>> open-coded to deal with this case.
> 
> We're not seeing v4 readdir bugs, though, I wonder what's different?

It's a small patch, easy to revert and find out with your own preferred
tests.


--
Chuck Lever
diff mbox series

Patch

diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c
index 3964ff74ee51..043b67229792 100644
--- a/net/sunrpc/xdr.c
+++ b/net/sunrpc/xdr.c
@@ -978,7 +978,7 @@  static __be32 *xdr_get_next_encode_buffer(struct xdr_stream *xdr,
 	 * shifted this one back:
 	 */
 	xdr->p = (void *)p + frag2bytes;
-	space_left = xdr->buf->buflen - xdr->buf->len;
+	space_left = xdr->buf->buflen - xdr->buf->len - frag1bytes;
 	xdr->end = (void *)p + min_t(int, space_left, PAGE_SIZE);
 	xdr->buf->page_len += frag2bytes;
 	xdr->buf->len += nbytes;