diff mbox

iov_iter_pipe warning.

Message ID 20170911200713.GK5426@ZenIV.linux.org.uk (mailing list archive)
State Superseded
Headers show

Commit Message

Al Viro Sept. 11, 2017, 8:07 p.m. UTC
On Mon, Sep 11, 2017 at 04:44:40PM +1000, Dave Chinner wrote:

> > iov_iter_get_pages() for pipe-backed destination does page allocation
> > and inserts freshly allocated pages into pipe.
> 
> Oh, it's hidden more layers down than the code implied I needed to
> look.
> 
> i.e. there's no obvious clue in the function names that there is
> allocation happening in these paths (get_pipe_pages ->
> __get_pipe_pages -> push_pipe -> page allocation). The function
> names imply it's getting a reference to pages (like
> (get_user_pages()) and the fact it does allocation is inconsistent
> with it's naming.  Worse, when push_pipe() fails to allocate pages,
> the error __get_pipe_pages() returns is -EFAULT, which further hides
> the fact push_pipe() does memory allocation that can fail....
> 
> And then there's the companion interface that implies page
> allocation: pipe_get_pages_alloc(). Which brings me back to there
> being no obvious clue while reading the code from the top down that
> pages are being allocated in push_pipe()....
> 
> Comments and documentation for this code would help, but I can't
> find any of that, either. Hence I assumed naming followed familiar
> patterns and so mistook these interfaces being one that does page
> allocation and the other for getting references to pre-existing
> pages.....

_NONE_ of those is a public interface - they are all static, to start
with.

The whole damn point is to have normal ->read_iter() work for read-to-pipe
without changes.  That's why -EFAULT as error (rather than some other
mechanism for saying that pipe is full), etc.

Filesystem should *not* be changed to use that.  At all.  As far as it is
concerned,
	copy_to_iter()
	copy_page_to_iter()
	iov_iter_get_pages()
	iov_iter_get_pages_alloc()
	iov_iter_advance()
are black boxes.

Note that one of the bugs there applies to normal read() as well - if you
are reading from a hole in file into an array with a read-only page in
the middle, you want a short read.  Ignoring return value from iov_iter_zero()
is wrong for iovec-backed case as well as for pipes.

Another one manages to work for iovec-backed case, albeit with rather odd
resulting semantics.  readv(2) is underspecified (to put it politely) enough
for compliance, but it's still bloody strange.  Namely, if you have a contiguous
50Mb chunk of file on disk and run into e.g. a failure to fault the destination
pages in halfway through that extent, you act as if *nothing* in the damn thing
had been read, nevermind that 25Mb had been actually already read and that had
there been a discontinuity 5Mb prior, the first 20Mb would've been reported
read just fine.

Strictly speaking that behaviour doesn't violate POSIX.  It is, however,
atrocious from the QoI standpoint, and for no good reason whatsoever.
It's quite easy to do better, and doing so would've eliminated the problems
in pipe-backed case as well (see below).  In addition to that, I would
consider teaching bio_iov_iter_get_pages() to take the maximal bio size
as an explict argument.  That would've killed the need of copying the
iterator and calling iov_iter_advance() in iomap_dio_actor() at all.
Anyway, the minimal candidate fix follows; it won't do anything about
the WARN_ON() in there, seeing that those are deliberate "program is
doing something bogus" things, but it should eliminate all crap with
->splice_read() misreporting the amount of data it has copied.

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Dave Chinner Sept. 12, 2017, 6:02 a.m. UTC | #1
On Mon, Sep 11, 2017 at 09:07:13PM +0100, Al Viro wrote:
> On Mon, Sep 11, 2017 at 04:44:40PM +1000, Dave Chinner wrote:
> 
> > > iov_iter_get_pages() for pipe-backed destination does page allocation
> > > and inserts freshly allocated pages into pipe.
> > 
> > Oh, it's hidden more layers down than the code implied I needed to
> > look.
> > 
> > i.e. there's no obvious clue in the function names that there is
> > allocation happening in these paths (get_pipe_pages ->
> > __get_pipe_pages -> push_pipe -> page allocation). The function
> > names imply it's getting a reference to pages (like
> > (get_user_pages()) and the fact it does allocation is inconsistent
> > with it's naming.  Worse, when push_pipe() fails to allocate pages,
> > the error __get_pipe_pages() returns is -EFAULT, which further hides
> > the fact push_pipe() does memory allocation that can fail....
> > 
> > And then there's the companion interface that implies page
> > allocation: pipe_get_pages_alloc(). Which brings me back to there
> > being no obvious clue while reading the code from the top down that
> > pages are being allocated in push_pipe()....
> > 
> > Comments and documentation for this code would help, but I can't
> > find any of that, either. Hence I assumed naming followed familiar
> > patterns and so mistook these interfaces being one that does page
> > allocation and the other for getting references to pre-existing
> > pages.....
> 
> _NONE_ of those is a public interface - they are all static, to start
> with.

It still requires comments to explain *why* the code is doing what
it's doing. You wrote the code, the whole explaination of it is in
your head. I can't see any of that, so when I read the code I sit
there thinking "why the fuck is it doing this?" because there's no
explanations of the WTF? moments in the code...

> The whole damn point is to have normal ->read_iter() work for read-to-pipe
> without changes.  That's why -EFAULT as error (rather than some other
> mechanism for saying that pipe is full), etc.

... like this one.

That needs a *fucking big comment* because it's not at all obvious
why ENOMEM conditions are being hidden with EFAULT.

Comments and documentation are not for the person who writes the
code - they are for the stupid morons like me that need all the
help they can get to understand complex code that does tricksy,
subtle, non-obvious shit to work correctly.

> Filesystem should *not* be changed to use that.  At all.  As far as it is
> concerned,
> 	copy_to_iter()
> 	copy_page_to_iter()
> 	iov_iter_get_pages()
> 	iov_iter_get_pages_alloc()
> 	iov_iter_advance()
> are black boxes.

The implementation may be a black box, but the operations the black
box is performing for the callers still needs to be explained.

> Note that one of the bugs there applies to normal read() as well - if you
> are reading from a hole in file into an array with a read-only page in
> the middle, you want a short read.

And there's another WTF? moment.....

How do we get a read only page in the middle of an array of pages
we've been told to write data into? And why isn't that a bug in the
code that supplied us with those pages?

> Ignoring return value from iov_iter_zero()
> is wrong for iovec-backed case as well as for pipes.
> 
> Another one manages to work for iovec-backed case, albeit with rather odd
> resulting semantics.  readv(2) is underspecified (to put it politely) enough
> for compliance, but it's still bloody strange.  Namely, if you have a contiguous
> 50Mb chunk of file on disk and run into e.g. a failure to fault the destination
> pages in halfway through that extent, you act as if *nothing* in the damn thing
> had been read, nevermind that 25Mb had been actually already read and that had
> there been a discontinuity 5Mb prior, the first 20Mb would've been reported
> read just fine.
> 
> Strictly speaking that behaviour doesn't violate POSIX.

This is direct IO. POSIX compliant behaviour went out the window
long ago..... :/

> It is, however,
> atrocious from the QoI standpoint, and for no good reason whatsoever.
> It's quite easy to do better, and doing so would've eliminated the problems
> in pipe-backed case as well (see below).  In addition to that, I would
> consider teaching bio_iov_iter_get_pages() to take the maximal bio size
> as an explict argument.  That would've killed the need of copying the
> iterator and calling iov_iter_advance() in iomap_dio_actor() at all.
> Anyway, the minimal candidate fix follows; it won't do anything about
> the WARN_ON() in there, seeing that those are deliberate "program is
> doing something bogus" things, but it should eliminate all crap with
> ->splice_read() misreporting the amount of data it has copied.

I'll run your updated patch through my testing, but seeing as I have
nothing that tests splice+direct IO I'm not going to be able to test
that right now. I have slightly more important things to that need
urgent attention than writing splice+DIO test cases....

Cheers,

Dave.
Al Viro Sept. 12, 2017, 11:13 a.m. UTC | #2
On Tue, Sep 12, 2017 at 04:02:14PM +1000, Dave Chinner wrote:

> > Note that one of the bugs there applies to normal read() as well - if you
> > are reading from a hole in file into an array with a read-only page in
> > the middle, you want a short read.
> 
> And there's another WTF? moment.....
> 
> How do we get a read only page in the middle of an array of pages
> we've been told to write data into? And why isn't that a bug in the
> code that supplied us with those pages?

Sorry, I'd been unclear - I'm talking about read(2) or readv(2) called by
userland with a read-only piece in the middle of a (user-supplied) buffer.
Either due to mprotect() or simply with one of the iovecs passed to readv(2)
having ->iov_base set to some read-only area.

It may be a bug in userland code, but when handling that error is as trivial
as "don't assume iov_iter_zero(to, n) will return n, use the actual return
value", resorting to "the userland code was probably buggy and it's O_DIRECT,
so we can do whatever we want" looks wrong.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/iomap.c b/fs/iomap.c
index 269b24a01f32..836fe27b00e2 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -832,6 +832,7 @@  iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 	struct bio *bio;
 	bool need_zeroout = false;
 	int nr_pages, ret;
+	size_t copied = 0;
 
 	if ((pos | length | align) & ((1 << blkbits) - 1))
 		return -EINVAL;
@@ -843,7 +844,7 @@  iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 		/*FALLTHRU*/
 	case IOMAP_UNWRITTEN:
 		if (!(dio->flags & IOMAP_DIO_WRITE)) {
-			iov_iter_zero(length, dio->submit.iter);
+			length = iov_iter_zero(length, dio->submit.iter);
 			dio->size += length;
 			return length;
 		}
@@ -880,6 +881,7 @@  iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 	}
 
 	do {
+		size_t n;
 		if (dio->error)
 			return 0;
 
@@ -897,17 +899,21 @@  iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 			return ret;
 		}
 
+		n = bio->bi_iter.bi_size;
 		if (dio->flags & IOMAP_DIO_WRITE) {
 			bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_SYNC | REQ_IDLE);
-			task_io_account_write(bio->bi_iter.bi_size);
+			task_io_account_write(n);
 		} else {
 			bio_set_op_attrs(bio, REQ_OP_READ, 0);
 			if (dio->flags & IOMAP_DIO_DIRTY)
 				bio_set_pages_dirty(bio);
 		}
 
-		dio->size += bio->bi_iter.bi_size;
-		pos += bio->bi_iter.bi_size;
+		iov_iter_advance(dio->submit.iter, n);
+
+		dio->size += n;
+		pos += n;
+		copied += n;
 
 		nr_pages = iov_iter_npages(&iter, BIO_MAX_PAGES);
 
@@ -923,9 +929,7 @@  iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 		if (pad)
 			iomap_dio_zero(dio, iomap, pos, fs_block_size - pad);
 	}
-
-	iov_iter_advance(dio->submit.iter, length);
-	return length;
+	return copied;
 }
 
 ssize_t