[RFC] fix short copy handling in copy_mc_pipe_to_iter()

Message ID	YqaAcKsd6uGfIQzM@zeniv-ca.linux.org.uk (mailing list archive)
State	Accepted
Commit	c3497fd009ef2c59eea60d21c3ac22de3585ed7d
Headers	show Received: from zeniv-ca.linux.org.uk (zeniv-ca.linux.org.uk [142.44.231.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 13557EA0 for <nvdimm@lists.linux.dev>; Mon, 13 Jun 2022 00:10:34 +0000 (UTC) Date: Mon, 13 Jun 2022 00:10:24 +0000 From: Al Viro <viro@zeniv.linux.org.uk> To: Dan Williams <dan.j.williams@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, nvdimm@lists.linux.dev Subject: [RFC][PATCH] fix short copy handling in copy_mc_pipe_to_iter() Message-ID: <YqaAcKsd6uGfIQzM@zeniv-ca.linux.org.uk> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Sender: Al Viro <viro@ftp.linux.org.uk>
Series	[RFC] fix short copy handling in copy_mc_pipe_to_iter() \| expand [RFC] fix short copy handling in copy_mc_pipe_to_iter()

Al Viro June 13, 2022, 12:10 a.m. UTC

[commit in question sits in vfs.git#fixes]

Unlike other copying operations on ITER_PIPE, copy_mc_to_iter() can
result in a short copy.  In that case we need to trim the unused
buffers, as well as the length of partially filled one - it's not
enough to set ->head, ->iov_offset and ->count to reflect how
much had we copied.  Not hard to fix, fortunately...

I'd put a helper (pipe_discard_from(pipe, head)) into pipe_fs_i.h,
rather than iov_iter.c - it has nothing to do with iov_iter and
having it will allow us to avoid an ugly kludge in fs/splice.c.
We could put it into lib/iov_iter.c for now and move it later,
but I don't see the point going that way...

Fixes: ca146f6f091e "lib/iov_iter: Fix pipe handling in _copy_to_iter_mcsafe()"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---

Linus Torvalds June 13, 2022, 5:54 p.m. UTC | #1

On Sun, Jun 12, 2022 at 5:10 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Unlike other copying operations on ITER_PIPE, copy_mc_to_iter() can
> result in a short copy.  In that case we need to trim the unused
> buffers, as well as the length of partially filled one - it's not
> enough to set ->head, ->iov_offset and ->count to reflect how
> much had we copied.  Not hard to fix, fortunately...
>
> I'd put a helper (pipe_discard_from(pipe, head)) into pipe_fs_i.h,
> rather than iov_iter.c -

Actually, since this "copy_mc_xyz()" stuff is going to be entirely
impossible to debug and replicate for any normal situation, I would
suggest we take the approach that we (long ago) used to take with
copy_from_user(): zero out the destination buffer, so that developers
that can't test the faulting behavior don't have to worry about it.

And then the existing code is fine: it will break out of the loop, but
it won't do the odd revert games and the "randomnoise.len -= rem"
thing that I can't wrap my head around.

Hmm?

                Linus

Al Viro June 13, 2022, 10:28 p.m. UTC | #2

On Mon, Jun 13, 2022 at 10:54:36AM -0700, Linus Torvalds wrote:
> On Sun, Jun 12, 2022 at 5:10 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Unlike other copying operations on ITER_PIPE, copy_mc_to_iter() can
> > result in a short copy.  In that case we need to trim the unused
> > buffers, as well as the length of partially filled one - it's not
> > enough to set ->head, ->iov_offset and ->count to reflect how
> > much had we copied.  Not hard to fix, fortunately...
> >
> > I'd put a helper (pipe_discard_from(pipe, head)) into pipe_fs_i.h,
> > rather than iov_iter.c -
> 
> Actually, since this "copy_mc_xyz()" stuff is going to be entirely
> impossible to debug and replicate for any normal situation, I would
> suggest we take the approach that we (long ago) used to take with
> copy_from_user(): zero out the destination buffer, so that developers
> that can't test the faulting behavior don't have to worry about it.
> 
> And then the existing code is fine: it will break out of the loop, but
> it won't do the odd revert games and the "randomnoise.len -= rem"
> thing that I can't wrap my head around.
> 
> Hmm?

Not really - we would need to zero the rest of those pages somehow.
They are already allocated and linked into pipe; leaving them
there (and subsequent ones hadn't seen any stores whatsoever - they
are fresh out of alloc_page(GFP_USER)) is a non-starter.

We could do allocation as we go, but that's a much more intrusive
change...

BTW, speaking of pipes:
static inline unsigned int pipe_space_for_user(unsigned int head, unsigned int tail,
                                               struct pipe_inode_info *pipe)
{
        unsigned int p_occupancy, p_space;

        p_occupancy = pipe_occupancy(head, tail);
        if (p_occupancy >= pipe->max_usage)
                return 0;
        p_space = pipe->ring_size - p_occupancy;
        if (p_space > pipe->max_usage)
                p_space = pipe->max_usage;
        return p_space;
}

OK, if head - tail >= max_usage, we get 0.  Fair enough, since
pipe_full() callers will get "it's full, sod off" in that situation.
But...  what the hell is the rest doing?  p_space is the amount of
slots not in use.  So we return the lesser of it and max_usage?

Suppose we have 128 slots in the ring, with max_usage being below
that (e.g. 64).  63 slots are in use; you can add at most one.
And p_space is 65, so this sucker will return 64.

Dave, could you explain what's going on there?  Note that pipe_write()
does *not* use that thing at all; it's only splice (i.e. ITER_PIPE
stuff) that is using it.

What's wrong with
        p_occupancy = pipe_occupancy(head, tail);
        if (p_occupancy >= pipe->max_usage)
                return 0;
	else
		return pipe->max_usage - p_occupancy;

which would match the way you are using ->max_usage in pipe_write()
et.al.  Including the use in copy_page_to_iter_pipe(), BTW...

Al Viro June 13, 2022, 11:25 p.m. UTC | #3

On Mon, Jun 13, 2022 at 11:28:34PM +0100, Al Viro wrote:

> Dave, could you explain what's going on there?  Note that pipe_write()
> does *not* use that thing at all; it's only splice (i.e. ITER_PIPE
> stuff) that is using it.
> 
> What's wrong with
>         p_occupancy = pipe_occupancy(head, tail);
>         if (p_occupancy >= pipe->max_usage)
>                 return 0;
> 	else
> 		return pipe->max_usage - p_occupancy;
> 
> which would match the way you are using ->max_usage in pipe_write()
> et.al.  Including the use in copy_page_to_iter_pipe(), BTW...

The more I'm looking at that thing, the more it smells like a bug;
it had the same 3 callers since the time it had been introduced.

1) pipe_get_pages().  We are about to try and allocate up to that
many pipe buffers.  Allocation (done in push_pipe()) is done only
if we have !pipe_full(pipe->head, pipe->tail, pipe->max_usage).

It simply won't give you more than max_usage - occupancy.
Your function returns min(ring_size - occupancy, max_usage), which
is always greater than or equal to that (ring_size >= max_usage).

2) pipe_get_pages_alloc().  Same story, same push_pipe() being
called, same "we'll never get that much - it'll hit the limit
first".

3) iov_iter_npages() in case of ITER_PIPE.  Again, the value
is bogus - it should not be greater than the amount of pages
we would be able to write there.

AFAICS, 6718b6f855a0 "pipe: Allow pipes to have kernel-reserved slots"
broke it for cases when ring_size != max_usage...

Al Viro June 13, 2022, 11:34 p.m. UTC | #4

On Tue, Jun 14, 2022 at 12:25:03AM +0100, Al Viro wrote:

> The more I'm looking at that thing, the more it smells like a bug;
> it had the same 3 callers since the time it had been introduced.
> 
> 1) pipe_get_pages().  We are about to try and allocate up to that
> many pipe buffers.  Allocation (done in push_pipe()) is done only
> if we have !pipe_full(pipe->head, pipe->tail, pipe->max_usage).
> 
> It simply won't give you more than max_usage - occupancy.
> Your function returns min(ring_size - occupancy, max_usage), which
> is always greater than or equal to that (ring_size >= max_usage).
> 
> 2) pipe_get_pages_alloc().  Same story, same push_pipe() being
> called, same "we'll never get that much - it'll hit the limit
> first".
> 
> 3) iov_iter_npages() in case of ITER_PIPE.  Again, the value
> is bogus - it should not be greater than the amount of pages
> we would be able to write there.
> 
> AFAICS, 6718b6f855a0 "pipe: Allow pipes to have kernel-reserved slots"
> broke it for cases when ring_size != max_usage...

Unless I'm missing something, the following would do the right thing.
Dave?

diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 4ea496924106..c22173d6e500 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -165,15 +165,10 @@ static inline bool pipe_full(unsigned int head, unsigned int tail,
 static inline unsigned int pipe_space_for_user(unsigned int head, unsigned int tail,
 					       struct pipe_inode_info *pipe)
 {
-	unsigned int p_occupancy, p_space;
-
-	p_occupancy = pipe_occupancy(head, tail);
+	unsigned int p_occupancy = pipe_occupancy(head, tail);
 	if (p_occupancy >= pipe->max_usage)
 		return 0;
-	p_space = pipe->ring_size - p_occupancy;
-	if (p_space > pipe->max_usage)
-		p_space = pipe->max_usage;
-	return p_space;
+	return pipe->max_usage - p_occupancy;
 }
 
 /**

Al Viro June 14, 2022, 12:53 a.m. UTC | #5

On Mon, Jun 13, 2022 at 11:28:34PM +0100, Al Viro wrote:
> On Mon, Jun 13, 2022 at 10:54:36AM -0700, Linus Torvalds wrote:
> > On Sun, Jun 12, 2022 at 5:10 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > Unlike other copying operations on ITER_PIPE, copy_mc_to_iter() can
> > > result in a short copy.  In that case we need to trim the unused
> > > buffers, as well as the length of partially filled one - it's not
> > > enough to set ->head, ->iov_offset and ->count to reflect how
> > > much had we copied.  Not hard to fix, fortunately...
> > >
> > > I'd put a helper (pipe_discard_from(pipe, head)) into pipe_fs_i.h,
> > > rather than iov_iter.c -
> > 
> > Actually, since this "copy_mc_xyz()" stuff is going to be entirely
> > impossible to debug and replicate for any normal situation, I would
> > suggest we take the approach that we (long ago) used to take with
> > copy_from_user(): zero out the destination buffer, so that developers
> > that can't test the faulting behavior don't have to worry about it.
> > 
> > And then the existing code is fine: it will break out of the loop, but
> > it won't do the odd revert games and the "randomnoise.len -= rem"
> > thing that I can't wrap my head around.
> > 
> > Hmm?
> 
> Not really - we would need to zero the rest of those pages somehow.
> They are already allocated and linked into pipe; leaving them
> there (and subsequent ones hadn't seen any stores whatsoever - they
> are fresh out of alloc_page(GFP_USER)) is a non-starter.
> 
> We could do allocation as we go, but that's a much more intrusive
> change...

FWIW, I've got quite a bit of cleanups in the local tree; reordering and
cleaning that queue up at the moment, will post tonight or tomorrow.

I've looked into doing allocations page-by-page (instead of single
push_pipe(), followed by copying into those).  Doable, but it ends
up being much messier.

IMO this "truncate on failure" approach is saner.

David Howells June 14, 2022, 6:36 a.m. UTC | #6

Al Viro <viro@zeniv.linux.org.uk> wrote:

> What's wrong with
>         p_occupancy = pipe_occupancy(head, tail);
>         if (p_occupancy >= pipe->max_usage)
>                 return 0;
> 	else
> 		return pipe->max_usage - p_occupancy;

Because "pipe->max_usage - p_occupancy" can be negative.

post_one_notification() is limited by pipe->ring_size, not pipe->max_usage.

The idea is to allow some slack in a watch pipe for the watch_queue code to
use that userspace can't.

David

Al Viro June 14, 2022, 12:11 p.m. UTC | #7

On Tue, Jun 14, 2022 at 07:36:19AM +0100, David Howells wrote:
> Al Viro <viro@zeniv.linux.org.uk> wrote:
> 
> > What's wrong with
> >         p_occupancy = pipe_occupancy(head, tail);
> >         if (p_occupancy >= pipe->max_usage)
> >                 return 0;
> > 	else
> > 		return pipe->max_usage - p_occupancy;
> 
> Because "pipe->max_usage - p_occupancy" can be negative.

Sure can.  And in that case you return 0; no problem wiht that.
It's what happens when occupancy is below max_usage that is weird.

> post_one_notification() is limited by pipe->ring_size, not pipe->max_usage.
> 
> The idea is to allow some slack in a watch pipe for the watch_queue code to
> use that userspace can't.

Sure.  And if this function is supposed to report how many times would
userspace be able to grab a slot, it's returning the wrong value.

Look: 32-slot ring.  max_usage is 16.  14 slots are already occupied.
Userland (sure as hell, anything in iov_iter.c) will be able to occupy
two more before it runs into the pipe_full().  And your function returns
min(32 - 14, 16), i.e. 16.

What am I missing here?

Al Viro June 14, 2022, 5:12 p.m. UTC | #8

On Tue, Jun 14, 2022 at 01:53:50AM +0100, Al Viro wrote:

> FWIW, I've got quite a bit of cleanups in the local tree; reordering and
> cleaning that queue up at the moment, will post tonight or tomorrow.
> 
> I've looked into doing allocations page-by-page (instead of single
> push_pipe(), followed by copying into those).  Doable, but it ends
> up being much messier.

Hmm...  Maybe not - a possible interface would be
	append_pipe(iter, size, &off)

that would either do kmap_local_page() on the last buffer (if it's
anonymous and has space in it) or allocated and mapped a page and
added a new buffer.  Returning the mapped address and offset from it.
Then these loops would looks like this:

	while (left) {
		p = append_pipe(iter, left, &off);
		if (!p)
			break;
		chunk = min(left, PAGE_SIZE - off);
		rem = copy(p + off, whatever, chunk);
		chunk -= rem;
		kunmap_local(p);

		copied += chunk;
		left -= chunk;

		if (unlikely(rem)) {
			pipe_revert(i, rem);
			break;
		}
	}
	return copied;

with no push_pipe() used at all.  For operations that can't fail,
the things are simplified in an obvious way (rem is always 0).

Or we could have append_pipe() return a struct page * and leave
kmap_local_page() to the caller...

struct page *append_pipe(struct iov_iter *i, size_t size, unsigned *off)
{
	struct pipe_inode_info *pipe = i->pipe;
	unsigned offset = i->iov_offset;
	struct page_buffer *buf;
	struct page *page;

	if (offset && offset < PAGE_SIZE) {
		// some space in the last buffer; can we add to it?
		buf = pipe_buf(pipe, pipe->head - 1);
		if (allocated(buf)) {
			size = min(size, PAGE_SIZE - offset);
			buf->len += size;
			i->iov_offset += size;
			i->count -= size;
			*off = offset;
			return buf->page;	// or kmap_local_page(...)
		}
	}
	// OK, we need a new buffer
	size = min(size, PAGE_SIZE);
	if (pipe_full(.....))
		return NULL;
	page = alloc_page(GFP_USER);
	if (!page)
		return NULL;
	// got it...
	buf = pipe_buf(pipe, pipe->head++);
	*buf = (struct pipe_buffer){.ops = &default_pipe_buf_ops,
				    .page = page, .len = size };
	i->head = pipe->head - 1;
	i->iov_offset = size;
	i->count -= size;
	*off = 0;
	return page;	 // or kmap_local_page(...)
}

(matter of fact, the last part could use another helper in my tree - there
the tail would be
	// OK, we need a new buffer
	size = min(size, PAGE_SIZE);
	page = push_anon(pipe, size);
	if (!page)
		return NULL;
	i->head = pipe->head - 1;
	i->iov_offset = size;
	i->count -= size;
	*off = 0;
	return page;
)

Would that be readable enough from your POV?  That way push_pipe()
loses almost all callers and after the "make iov_iter_get_pages()
advancing" part of the series it simply goes away...

It's obviously too intrusive for backports, though - there I'd very much
prefer the variant I posted.

Comments?

PS: re local helpers:

static inline struct pipe_buffer *pipe_buf(const struct pipe_inode_info *pipe,
                                           unsigned int slot)
{
	return &pipe->bufs[slot & (pipe->ring_size - 1)];
}

pretty much all places where we cache pipe->ring_size - 1 had been
absolutely pointless; there are several exceptions, but back in 2019
"pipe: Use head and tail pointers for the ring, not cursor and length"
went overboard with microoptimizations...

Dan Williams June 16, 2022, 9:22 p.m. UTC | #9

[ add Tony and Boris ]

Al Viro wrote:
> [commit in question sits in vfs.git#fixes]
> 
> Unlike other copying operations on ITER_PIPE, copy_mc_to_iter() can
> result in a short copy.  In that case we need to trim the unused
> buffers, as well as the length of partially filled one - it's not
> enough to set ->head, ->iov_offset and ->count to reflect how
> much had we copied.  Not hard to fix, fortunately...
> 
> I'd put a helper (pipe_discard_from(pipe, head)) into pipe_fs_i.h,
> rather than iov_iter.c - it has nothing to do with iov_iter and
> having it will allow us to avoid an ugly kludge in fs/splice.c.
> We could put it into lib/iov_iter.c for now and move it later,
> but I don't see the point going that way...

Apologies for the delay in responding (reworking my email workflow after
a loss of Gmail access for my intel.com address). This looks good to me:

Acked-by: Dan Williams <dan.j.williams@intel.com>

...and I also share the concern from Linus about the lack of testing
this gets outside of systems with the necessary hardware/firmware to do
error injection testing.

Boris and I had agreed to remove some software error injection machinery
for copy_mc_* in commit 3adb776384f2 ("x86, libnvdimm/test: Remove
COPY_MC_TEST"). Is there an appetite to see some of that return and
write a regression test for this bug?

> 
> Fixes: ca146f6f091e "lib/iov_iter: Fix pipe handling in _copy_to_iter_mcsafe()"
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
> diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
> index cb0fd633a610..4ea496924106 100644
> --- a/include/linux/pipe_fs_i.h
> +++ b/include/linux/pipe_fs_i.h
> @@ -229,6 +229,15 @@ static inline bool pipe_buf_try_steal(struct pipe_inode_info *pipe,
>  	return buf->ops->try_steal(pipe, buf);
>  }
>  
> +static inline void pipe_discard_from(struct pipe_inode_info *pipe,
> +		unsigned int old_head)
> +{
> +	unsigned int mask = pipe->ring_size - 1;
> +
> +	while (pipe->head > old_head)
> +		pipe_buf_release(pipe, &pipe->bufs[--pipe->head & mask]);
> +}
> +
>  /* Differs from PIPE_BUF in that PIPE_SIZE is the length of the actual
>     memory allocation, whereas PIPE_BUF makes atomicity guarantees.  */
>  #define PIPE_SIZE		PAGE_SIZE
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 0b64695ab632..2bf20b48a04a 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -689,6 +689,7 @@ static size_t copy_mc_pipe_to_iter(const void *addr, size_t bytes,
>  	struct pipe_inode_info *pipe = i->pipe;
>  	unsigned int p_mask = pipe->ring_size - 1;
>  	unsigned int i_head;
> +	unsigned int valid = pipe->head;
>  	size_t n, off, xfer = 0;
>  
>  	if (!sanity(i))
> @@ -702,11 +703,17 @@ static size_t copy_mc_pipe_to_iter(const void *addr, size_t bytes,
>  		rem = copy_mc_to_kernel(p + off, addr + xfer, chunk);
>  		chunk -= rem;
>  		kunmap_local(p);
> -		i->head = i_head;
> -		i->iov_offset = off + chunk;
> -		xfer += chunk;
> -		if (rem)
> +		if (chunk) {
> +			i->head = i_head;
> +			i->iov_offset = off + chunk;
> +			xfer += chunk;
> +			valid = i_head + 1;
> +		}
> +		if (rem) {
> +			pipe->bufs[i_head & p_mask].len -= rem;
> +			pipe_discard_from(pipe, valid);
>  			break;
> +		}
>  		n -= chunk;
>  		off = 0;
>  		i_head++;

[RFC] fix short copy handling in copy_mc_pipe_to_iter()

Commit Message

Comments

Patch