[04/11] splice: lift pipe_lock out of splice_to_pipe()
diff mbox

Message ID 20160923190326.GB2356@ZenIV.linux.org.uk
State New
Headers show

Commit Message

Al Viro Sept. 23, 2016, 7:03 p.m. UTC
* splice_to_pipe() stops at pipe overflow and does *not* take pipe_lock
* ->splice_read() instances do the same
* vmsplice_to_pipe() and do_splice() (ultimate callers of splice_to_pipe())
  arrange for waiting, looping, etc. themselves.

That should make pipe_lock the outermost one.

Unfortunately, existing rules for the amount passed by vmsplice_to_pipe()
and do_splice() are quite ugly _and_ userland code can be easily broken
by changing those.  It's not even "no more than the maximal capacity of
this pipe" - it's "once we'd fed pipe->nr_buffers pages into the pipe,
leave instead of waiting".  I would like to change it to something saner,
but that's for later.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/fuse/dev.c |   2 -
 fs/splice.c   | 171 ++++++++++++++++++++++++++++++++--------------------------
 2 files changed, 96 insertions(+), 77 deletions(-)

Comments

Linus Torvalds Sept. 23, 2016, 7:45 p.m. UTC | #1
On Fri, Sep 23, 2016 at 12:03 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> @@ -1421,8 +1406,25 @@ static long do_splice(struct file *in, loff_t __user *off_in,
> +               ret = 0;
> +               pipe_lock(opipe);
> +               bogus_count = opipe->buffers;
> +               do {
> +                       bogus_count += opipe->nrbufs;
> +                       ret = do_splice_to(in, &offset, opipe, len, flags);
> +                       if (ret > 0) {
> +                               total += ret;
> +                               len -= ret;
> +                       }
> +                       bogus_count -= opipe->nrbufs;
> +                       if (bogus_count <= 0)
> +                               break;

I was like "oh, I'm sure this is some temporary hack, it will be gone
by the end of the series".

It wasn't gone by the end.

There's two copies of that pattern, and at the very least it needs a
big comment about what this pattern does and why.

But other than that reaction, I didn't get any hives from this. I
didn't *test* it, only looking at patches, but no red flags I could
notice.

               Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Sept. 23, 2016, 8:10 p.m. UTC | #2
On Fri, Sep 23, 2016 at 12:45:53PM -0700, Linus Torvalds wrote:

> I was like "oh, I'm sure this is some temporary hack, it will be gone
> by the end of the series".
> 
> It wasn't gone by the end.
> 
> There's two copies of that pattern, and at the very least it needs a
> big comment about what this pattern does and why.

The thing is, I'm not sure what to do with it; it was brought by the LTP
vmsplice test, which asks to feed 128Kb into a pipe.  With the caller
itself on the other end of that pipe, SPLICE_F_NONBLOCK *not* given and
the pipe capacity being 64Kb.  Unfortunately, "quietly truncate the
length down to 64Kb" does *not* suffice - the damn thing starts not at
the page boundary, so we only copy about 62Kb until hitting the pipe
overflow (the pipe is initially empty).  The reason why it doesn't go
to sleep indefinitely on the mainline kernel is that mainline collects
up to page->buffers *pages*, before feeding them into the pipe.  And these
~62Kb are just that.  Note that had there been anything already in the
pipe, the same call would've gone to sleep (and in the end transferred the
same ~62Kb worth of data).

All of that is completely undocumented in vmsplice(2) (or anywhere else that
I'd been able to find) ;-/

OTOH, considering the quality of documentation, I'm somewhat tempted to go
for "sleep only if it had been completely full when we entered; once there's
some space feed as much as fits and be done with that".  OTTH, I'm not sure
that no userland cr^Hode will manage to be hurt by that variant...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Sept. 23, 2016, 8:36 p.m. UTC | #3
On Fri, Sep 23, 2016 at 1:10 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> OTOH, considering the quality of documentation, I'm somewhat tempted to go
> for "sleep only if it had been completely full when we entered; once there's
> some space feed as much as fits and be done with that".  OTTH, I'm not sure
> that no userland cr^Hode will manage to be hurt by that variant...

Let's just try it.

If that then doesn't work, we can introduce your odd code (with a
*big* comment). Ok?

               Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Sept. 24, 2016, 3:59 a.m. UTC | #4
On Fri, Sep 23, 2016 at 01:36:12PM -0700, Linus Torvalds wrote:
> On Fri, Sep 23, 2016 at 1:10 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > OTOH, considering the quality of documentation, I'm somewhat tempted to go
> > for "sleep only if it had been completely full when we entered; once there's
> > some space feed as much as fits and be done with that".  OTTH, I'm not sure
> > that no userland cr^Hode will manage to be hurt by that variant...
> 
> Let's just try it.
> 
> If that then doesn't work, we can introduce your odd code (with a
> *big* comment). Ok?

	FWIW, updated (with fixes) and force-pushed.  Added piece:
default_file_splice_read() converted to iov_iter.  Seems to work, after
fixing a braino in __pipe_get_pages().  Changed: #4 (sleep only in the
beginning, as described above), #6 (context changes from #4), #10 (missing
get_page() added in __pipe_get_pages()), #11 (removed pointless truncation
of len - ->read_iter() can bloody well handle that on its own) and added #12.
Stands at 28 files changed, 657 insertions(+), 1009 deletions(-) now...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Sept. 24, 2016, 5:29 p.m. UTC | #5
On Sat, Sep 24, 2016 at 04:59:08AM +0100, Al Viro wrote:

> 	FWIW, updated (with fixes) and force-pushed.  Added piece:
> default_file_splice_read() converted to iov_iter.  Seems to work, after
> fixing a braino in __pipe_get_pages().  Changed: #4 (sleep only in the
> beginning, as described above), #6 (context changes from #4), #10 (missing
> get_page() added in __pipe_get_pages()), #11 (removed pointless truncation
> of len - ->read_iter() can bloody well handle that on its own) and added #12.
> Stands at 28 files changed, 657 insertions(+), 1009 deletions(-) now...

	I think I see how to get full zero-copy (including the write side
of things).  Just add a "from" side for ITER_PIPE iov_iter (advance,
get_pages, get_pages_alloc, npages and alignment will need to behave
differently for "to" and "from" ones) and pull the following trick:
have fault_in_readable return NULL instead of 0, ERR_PTR(-EFAULT) instead
of -EFAULT *and* return a struct page if it was asked for a full-page
range on a page that could be successfully stolen (only "from pipe" iov_iter
would go for the last one, of course).  Then we make generic_perform_write()
shove the return value of fault-in into 'page'.  ->write_begin() is given
&page as an argument, to return the resulting page via that.  All instances
currently just store into that pointer, completely ignoring the prior value.
And they'll keep working just fine.

	Let's make sure that all method call sites outside of
generic_perform_write() (there's only one such, actually) have NULL
stored in there prior to the call.  Now we can start switching the
instances to zero-copy support - all it takes is replacing
grab_cache_page_write_begin() with "if *page is non-NULL, try to
shove it (locked, non-uptodate) into pagecache; if that succeeds grab a
reference to our page and we are done, if it fails - fall back to
grab_cache_page_write_begin()".  Then do get_block, etc., or whatever that
->write_begin() instance would normally do, just remember not to zero anything
if the page had been passed to us by caller.

	Now all we need is to make sure that iov_iter_copy_from_user_atomic()
for those guys recongnizes the case of full-page copy when source and target
are the same page and quietly returns PAGE_SIZE.  Voila - we can make
iter_file_splice_write() pass pipe-backed iov_iter instead of bvec-backed
one *and* get write-side zero-copy for all filesystems with ->write_begin()
taught to handle that (see above).  Since the filesystems with unmodified
->write_begin() will act correctly (just do the copying), we don't have
to make that a flagday change; ->write_begin() instances can be switched
one by one.  Similar treatment of iomap_write_begin()/iomap_write_actor()
would cover iomap-using ->write_iter() instances.

	It's clearly not something I want to touch until -rc1, but it looks
feasible for the next cycle, and if done right it promises to unify the
plain and splice sides of fuse_dev_...() stuff, simplifying the hell out
of them without losing zero-copy there.  And if everything really goes
right, we might be able to get rid of net/* ->splice_read() and ->sendpage()
methods as well...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nicholas Piggin Sept. 27, 2016, 3:38 p.m. UTC | #6
On Sat, 24 Sep 2016 18:29:01 +0100
Al Viro <viro@ZenIV.linux.org.uk> wrote:

> On Sat, Sep 24, 2016 at 04:59:08AM +0100, Al Viro wrote:
> 
> > 	FWIW, updated (with fixes) and force-pushed.  Added piece:
> > default_file_splice_read() converted to iov_iter.  Seems to work, after
> > fixing a braino in __pipe_get_pages().  Changed: #4 (sleep only in the
> > beginning, as described above), #6 (context changes from #4), #10 (missing
> > get_page() added in __pipe_get_pages()), #11 (removed pointless truncation
> > of len - ->read_iter() can bloody well handle that on its own) and added #12.
> > Stands at 28 files changed, 657 insertions(+), 1009 deletions(-) now...  
> 
> 	I think I see how to get full zero-copy (including the write side
> of things).  Just add a "from" side for ITER_PIPE iov_iter (advance,
> get_pages, get_pages_alloc, npages and alignment will need to behave
> differently for "to" and "from" ones) and pull the following trick:
> have fault_in_readable return NULL instead of 0, ERR_PTR(-EFAULT) instead
> of -EFAULT *and* return a struct page if it was asked for a full-page
> range on a page that could be successfully stolen (only "from pipe" iov_iter
> would go for the last one, of course).  Then we make generic_perform_write()
> shove the return value of fault-in into 'page'.  ->write_begin() is given
> &page as an argument, to return the resulting page via that.  All instances
> currently just store into that pointer, completely ignoring the prior value.
> And they'll keep working just fine.
> 
> 	Let's make sure that all method call sites outside of
> generic_perform_write() (there's only one such, actually) have NULL
> stored in there prior to the call.  Now we can start switching the
> instances to zero-copy support - all it takes is replacing
> grab_cache_page_write_begin() with "if *page is non-NULL, try to
> shove it (locked, non-uptodate) into pagecache; if that succeeds grab a
> reference to our page and we are done, if it fails - fall back to
> grab_cache_page_write_begin()".  Then do get_block, etc., or whatever that
> ->write_begin() instance would normally do, just remember not to zero anything  
> if the page had been passed to us by caller.

Interesting stuff. It should also be possible for a filesystem to replace
existing pagecache as a zero-copy overwrite with the migration APIs and
just a little bit of work.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chuck Lever Sept. 27, 2016, 3:53 p.m. UTC | #7
> On Sep 24, 2016, at 1:29 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> 
> On Sat, Sep 24, 2016 at 04:59:08AM +0100, Al Viro wrote:
> 
>> 	FWIW, updated (with fixes) and force-pushed.  Added piece:
>> default_file_splice_read() converted to iov_iter.  Seems to work, after
>> fixing a braino in __pipe_get_pages().  Changed: #4 (sleep only in the
>> beginning, as described above), #6 (context changes from #4), #10 (missing
>> get_page() added in __pipe_get_pages()), #11 (removed pointless truncation
>> of len - ->read_iter() can bloody well handle that on its own) and added #12.
>> Stands at 28 files changed, 657 insertions(+), 1009 deletions(-) now...
> 
> 	I think I see how to get full zero-copy (including the write side
> of things).  Just add a "from" side for ITER_PIPE iov_iter (advance,
> get_pages, get_pages_alloc, npages and alignment will need to behave
> differently for "to" and "from" ones) and pull the following trick:
> have fault_in_readable return NULL instead of 0, ERR_PTR(-EFAULT) instead
> of -EFAULT *and* return a struct page if it was asked for a full-page
> range on a page that could be successfully stolen (only "from pipe" iov_iter
> would go for the last one, of course).  Then we make generic_perform_write()
> shove the return value of fault-in into 'page'.  ->write_begin() is given
> &page as an argument, to return the resulting page via that.  All instances
> currently just store into that pointer, completely ignoring the prior value.
> And they'll keep working just fine.
> 
> 	Let's make sure that all method call sites outside of
> generic_perform_write() (there's only one such, actually) have NULL
> stored in there prior to the call.  Now we can start switching the
> instances to zero-copy support - all it takes is replacing
> grab_cache_page_write_begin() with "if *page is non-NULL, try to
> shove it (locked, non-uptodate) into pagecache; if that succeeds grab a
> reference to our page and we are done, if it fails - fall back to
> grab_cache_page_write_begin()".  Then do get_block, etc., or whatever that
> ->write_begin() instance would normally do, just remember not to zero anything
> if the page had been passed to us by caller.
> 
> 	Now all we need is to make sure that iov_iter_copy_from_user_atomic()
> for those guys recongnizes the case of full-page copy when source and target
> are the same page and quietly returns PAGE_SIZE.  Voila - we can make
> iter_file_splice_write() pass pipe-backed iov_iter instead of bvec-backed
> one *and* get write-side zero-copy for all filesystems with ->write_begin()
> taught to handle that (see above).  Since the filesystems with unmodified
> ->write_begin() will act correctly (just do the copying), we don't have
> to make that a flagday change; ->write_begin() instances can be switched
> one by one.  Similar treatment of iomap_write_begin()/iomap_write_actor()
> would cover iomap-using ->write_iter() instances.
> 
> 	It's clearly not something I want to touch until -rc1, but it looks
> feasible for the next cycle, and if done right it promises to unify the
> plain and splice sides of fuse_dev_...() stuff, simplifying the hell out
> of them without losing zero-copy there.  And if everything really goes
> right, we might be able to get rid of net/* ->splice_read() and ->sendpage()
> methods as well...

Kernel NFS server already uses splice for its read path, but the
write path appears to require a full data copy of incoming payloads.
Would be awesome to see write-side support for zero-copy.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index a94d2ed..eaf56c6 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1364,7 +1364,6 @@  static ssize_t fuse_dev_splice_read(struct file *in, loff_t *ppos,
 		goto out;
 
 	ret = 0;
-	pipe_lock(pipe);
 
 	if (!pipe->readers) {
 		send_sig(SIGPIPE, current, 0);
@@ -1400,7 +1399,6 @@  static ssize_t fuse_dev_splice_read(struct file *in, loff_t *ppos,
 	}
 
 out_unlock:
-	pipe_unlock(pipe);
 
 	if (do_wakeup) {
 		smp_mb();
diff --git a/fs/splice.c b/fs/splice.c
index 31c52e0..9ce6e62 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -183,79 +183,41 @@  ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
 		       struct splice_pipe_desc *spd)
 {
 	unsigned int spd_pages = spd->nr_pages;
-	int ret, do_wakeup, page_nr;
+	int ret = 0, page_nr = 0;
 
 	if (!spd_pages)
 		return 0;
 
-	ret = 0;
-	do_wakeup = 0;
-	page_nr = 0;
-
-	pipe_lock(pipe);
-
-	for (;;) {
-		if (!pipe->readers) {
-			send_sig(SIGPIPE, current, 0);
-			if (!ret)
-				ret = -EPIPE;
-			break;
-		}
-
-		if (pipe->nrbufs < pipe->buffers) {
-			int newbuf = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1);
-			struct pipe_buffer *buf = pipe->bufs + newbuf;
-
-			buf->page = spd->pages[page_nr];
-			buf->offset = spd->partial[page_nr].offset;
-			buf->len = spd->partial[page_nr].len;
-			buf->private = spd->partial[page_nr].private;
-			buf->ops = spd->ops;
-			if (spd->flags & SPLICE_F_GIFT)
-				buf->flags |= PIPE_BUF_FLAG_GIFT;
-
-			pipe->nrbufs++;
-			page_nr++;
-			ret += buf->len;
-
-			if (pipe->files)
-				do_wakeup = 1;
+	if (unlikely(!pipe->readers)) {
+		send_sig(SIGPIPE, current, 0);
+		ret = -EPIPE;
+		goto out;
+	}
 
-			if (!--spd->nr_pages)
-				break;
-			if (pipe->nrbufs < pipe->buffers)
-				continue;
+	while (pipe->nrbufs < pipe->buffers) {
+		int newbuf = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1);
+		struct pipe_buffer *buf = pipe->bufs + newbuf;
 
-			break;
-		}
+		buf->page = spd->pages[page_nr];
+		buf->offset = spd->partial[page_nr].offset;
+		buf->len = spd->partial[page_nr].len;
+		buf->private = spd->partial[page_nr].private;
+		buf->ops = spd->ops;
+		if (spd->flags & SPLICE_F_GIFT)
+			buf->flags |= PIPE_BUF_FLAG_GIFT;
 
-		if (spd->flags & SPLICE_F_NONBLOCK) {
-			if (!ret)
-				ret = -EAGAIN;
-			break;
-		}
+		pipe->nrbufs++;
+		page_nr++;
+		ret += buf->len;
 
-		if (signal_pending(current)) {
-			if (!ret)
-				ret = -ERESTARTSYS;
+		if (!--spd->nr_pages)
 			break;
-		}
-
-		if (do_wakeup) {
-			wakeup_pipe_readers(pipe);
-			do_wakeup = 0;
-		}
-
-		pipe->waiting_writers++;
-		pipe_wait(pipe);
-		pipe->waiting_writers--;
 	}
 
-	pipe_unlock(pipe);
-
-	if (do_wakeup)
-		wakeup_pipe_readers(pipe);
+	if (!ret)
+		ret = -EAGAIN;
 
+out:
 	while (page_nr < spd_pages)
 		spd->spd_release(spd, page_nr++);
 
@@ -1339,6 +1301,27 @@  long do_splice_direct(struct file *in, loff_t *ppos, struct file *out,
 }
 EXPORT_SYMBOL(do_splice_direct);
 
+static bool splice_more(struct pipe_inode_info *pipe,
+			long *p, unsigned flags)
+{
+	if (pipe->nrbufs < pipe->buffers) // no overflows
+		return false;
+	if (flags & SPLICE_F_NONBLOCK) // not allowed to wait
+		return false;
+	if (*p < 0 && *p != -EAGAIN) // error happened
+		return false;
+	if (signal_pending(current)) { // interrupted
+		*p = -ERESTARTSYS;
+		return false;
+	}
+	if (*p > 0)
+		wakeup_pipe_readers(pipe);
+	pipe->waiting_writers++;
+	pipe_wait(pipe);
+	pipe->waiting_writers--;
+	return true;
+}
+
 static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
 			       struct pipe_inode_info *opipe,
 			       size_t len, unsigned int flags);
@@ -1410,6 +1393,8 @@  static long do_splice(struct file *in, loff_t __user *off_in,
 	}
 
 	if (opipe) {
+		size_t total = 0;
+		int bogus_count;
 		if (off_out)
 			return -ESPIPE;
 		if (off_in) {
@@ -1421,8 +1406,25 @@  static long do_splice(struct file *in, loff_t __user *off_in,
 			offset = in->f_pos;
 		}
 
-		ret = do_splice_to(in, &offset, opipe, len, flags);
-
+		ret = 0;
+		pipe_lock(opipe);
+		bogus_count = opipe->buffers;
+		do {
+			bogus_count += opipe->nrbufs;
+			ret = do_splice_to(in, &offset, opipe, len, flags);
+			if (ret > 0) {
+				total += ret;
+				len -= ret;
+			}
+			bogus_count -= opipe->nrbufs;
+			if (bogus_count <= 0)
+				break;
+		} while (len && splice_more(opipe, &ret, flags));
+		pipe_unlock(opipe);
+		if (total) {
+			wakeup_pipe_readers(opipe);
+			ret = total;
+		}
 		if (!off_in)
 			in->f_pos = offset;
 		else if (copy_to_user(off_in, &offset, sizeof(loff_t)))
@@ -1434,22 +1436,23 @@  static long do_splice(struct file *in, loff_t __user *off_in,
 	return -EINVAL;
 }
 
-static int get_iovec_page_array(struct iov_iter *from,
+static int get_iovec_page_array(const struct iov_iter *from,
 				struct page **pages,
 				struct partial_page *partial,
 				unsigned int pipe_buffers)
 {
+	struct iov_iter i = *from;
 	int buffers = 0;
-	while (iov_iter_count(from)) {
+	while (iov_iter_count(&i)) {
 		ssize_t copied;
 		size_t start;
 
-		copied = iov_iter_get_pages(from, pages + buffers, ~0UL,
+		copied = iov_iter_get_pages(&i, pages + buffers, ~0UL,
 					pipe_buffers - buffers, &start);
 		if (copied <= 0)
 			return buffers ? buffers : copied;
 
-		iov_iter_advance(from, copied);
+		iov_iter_advance(&i, copied);
 		while (copied) {
 			int size = min_t(int, copied, PAGE_SIZE - start);
 			partial[buffers].offset = start;
@@ -1530,7 +1533,8 @@  static long vmsplice_to_pipe(struct file *file, const struct iovec __user *uiov,
 		.ops = &user_page_pipe_buf_ops,
 		.spd_release = spd_release_page,
 	};
-	long ret;
+	long ret, total = 0;
+	int bogus_count;
 
 	pipe = get_pipe_info(file);
 	if (!pipe)
@@ -1546,14 +1550,31 @@  static long vmsplice_to_pipe(struct file *file, const struct iovec __user *uiov,
 		return -ENOMEM;
 	}
 
-	spd.nr_pages = get_iovec_page_array(&from, spd.pages,
-					    spd.partial,
-					    spd.nr_pages_max);
-	if (spd.nr_pages <= 0)
-		ret = spd.nr_pages;
-	else
+	pipe_lock(pipe);
+	bogus_count = pipe->buffers;
+	do {
+		bogus_count += pipe->nrbufs;
+		spd.nr_pages = get_iovec_page_array(&from, spd.pages,
+						    spd.partial,
+						    spd.nr_pages_max);
+		if (spd.nr_pages <= 0) {
+			ret = spd.nr_pages;
+			break;
+		}
 		ret = splice_to_pipe(pipe, &spd);
-
+		if (ret > 0) {
+			total += ret;
+			iov_iter_advance(&from, ret);
+		}
+		bogus_count -= pipe->nrbufs;
+		if (bogus_count <= 0)
+			break;
+	} while (iov_iter_count(&from) && splice_more(pipe, &ret, flags));
+	pipe_unlock(pipe);
+	if (total) {
+		wakeup_pipe_readers(pipe);
+		ret = total;
+	}
 	splice_shrink_spd(&spd);
 	kfree(iov);
 	return ret;