diff mbox

iov_iter_pipe warning.

Message ID 20170908010441.GZ5426@ZenIV.linux.org.uk (mailing list archive)
State Superseded
Headers show

Commit Message

Al Viro Sept. 8, 2017, 1:04 a.m. UTC
On Thu, Sep 07, 2017 at 09:46:17AM +1000, Dave Chinner wrote:
> On Wed, Sep 06, 2017 at 04:03:37PM -0400, Dave Jones wrote:
> > On Mon, Aug 28, 2017 at 09:25:42PM -0700, Darrick J. Wong wrote:
> >  > On Mon, Aug 28, 2017 at 04:31:30PM -0400, Dave Jones wrote:
> >  > > I'm still trying to narrow down an exact reproducer, but it seems having
> >  > > trinity do a combination of sendfile & writev, with pipes and regular
> >  > > files as fd's is the best repro.
> >  > > 
> >  > > Is this a real problem, or am I chasing ghosts ?  That it doesn't happen
> >  > > on ext4 or btrfs is making me wonder...
> >  > 
> >  > <shrug> I haven't heard of any problems w/ directio xfs lately, but OTOH
> >  > I think it's the only filesystem that uses iomap_dio_rw, which would
> >  > explain why ext4/btrfs don't have this problem.
> > 
> > Another warning, from likely the same root cause.
> > 
> > WARNING: CPU: 3 PID: 572 at lib/iov_iter.c:962 iov_iter_pipe+0xe2/0xf0
> 
> 	WARN_ON(pipe->nrbufs == pipe->buffers);
> 
>  *      @nrbufs: the number of non-empty pipe buffers in this pipe
>  *      @buffers: total number of buffers (should be a power of 2)
> 
> So that's warning that the pipe buffer is already full before we
> try to read from the filesystem?
> 
> That doesn't seem like an XFS problem - it indicates the pipe we are
> filling in generic_file_splice_read() is not being emptied by
> whatever we are splicing the file data to....

Or that XFS in some conditions shoves into pipe more than it reports,
so not all of that gets emptied, filling the sucker up after sufficient
amount of iterations...

There's at least one suspicious place in iomap_dio_actor() -
                if (!(dio->flags & IOMAP_DIO_WRITE)) {
                        iov_iter_zero(length, dio->submit.iter);
                        dio->size += length;
                        return length;
                }
which assumes that iov_iter_zero() always succeeds.  That's very
much _not_ true - neither for iovec-backed, not for pipe-backed.
Orangefs read_one_page() is fine (it calls that sucker for bvec-backed
iov_iter it's just created), but iomap_dio_actor() is not.

I'm not saying that it will suffice, but we definitely need this:

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Dave Jones Sept. 10, 2017, 1:07 a.m. UTC | #1
On Fri, Sep 08, 2017 at 02:04:41AM +0100, Al Viro wrote:
 
 > There's at least one suspicious place in iomap_dio_actor() -
 >                 if (!(dio->flags & IOMAP_DIO_WRITE)) {
 >                         iov_iter_zero(length, dio->submit.iter);
 >                         dio->size += length;
 >                         return length;
 >                 }
 > which assumes that iov_iter_zero() always succeeds.  That's very
 > much _not_ true - neither for iovec-backed, not for pipe-backed.
 > Orangefs read_one_page() is fine (it calls that sucker for bvec-backed
 > iov_iter it's just created), but iomap_dio_actor() is not.
 > 
 > I'm not saying that it will suffice, but we definitely need this:
 > 
 > diff --git a/fs/iomap.c b/fs/iomap.c
 > index 269b24a01f32..4a671263475f 100644
 > --- a/fs/iomap.c
 > +++ b/fs/iomap.c
 > @@ -843,7 +843,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 >  		/*FALLTHRU*/
 >  	case IOMAP_UNWRITTEN:
 >  		if (!(dio->flags & IOMAP_DIO_WRITE)) {
 > -			iov_iter_zero(length, dio->submit.iter);
 > +			length = iov_iter_zero(length, dio->submit.iter);
 >  			dio->size += length;
 >  			return length;

With this in place, I'm still seeing -EBUSY from invalidate_inode_pages2_range
which doesn't end well...


WARNING: CPU: 3 PID: 11443 at fs/iomap.c:993 iomap_dio_rw+0x825/0x840
CPU: 3 PID: 11443 Comm: trinity-c39 Not tainted 4.13.0-think+ #9 
task: ffff880461080040 task.stack: ffff88043d720000
RIP: 0010:iomap_dio_rw+0x825/0x840
RSP: 0018:ffff88043d727730 EFLAGS: 00010286
RAX: 00000000fffffff0 RBX: ffff88044f036428 RCX: 0000000000000000
RDX: ffffed0087ae4e67 RSI: 0000000000000000 RDI: ffffed0087ae4ed7
RBP: ffff88043d727910 R08: ffff88046b4176c0 R09: 0000000000000000
R10: ffff88043d726d20 R11: 0000000000000001 R12: ffff88043d727a90
R13: 00000000027253f7 R14: 1ffff10087ae4ef4 R15: ffff88043d727c10
FS:  00007f5d8613e700(0000) GS:ffff88046b400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f5d84503000 CR3: 00000004594e1000 CR4: 00000000001606e0
Call Trace:
 ? iomap_seek_data+0xb0/0xb0
 ? down_read_nested+0xd3/0x160
 ? down_read_non_owner+0x40/0x40
 ? xfs_ilock+0x3cb/0x460 [xfs]
 ? sched_clock_cpu+0x14/0xf0
 ? __lock_is_held+0x51/0xc0
 ? xfs_file_dio_aio_read+0x123/0x350 [xfs]
 xfs_file_dio_aio_read+0x123/0x350 [xfs]
 ? xfs_file_fallocate+0x550/0x550 [xfs]
 ? lock_release+0xa00/0xa00
 ? ___might_sleep.part.70+0x118/0x320
 xfs_file_read_iter+0x1b1/0x1d0 [xfs]
 do_iter_readv_writev+0x2ea/0x330
 ? vfs_dedupe_file_range+0x400/0x400
 do_iter_read+0x149/0x280
 vfs_readv+0x107/0x180
 ? vfs_iter_read+0x60/0x60
 ? fget_raw+0x10/0x10
 ? native_sched_clock+0xf9/0x1a0
 ? __fdget_pos+0xd6/0x110
 ? __fdget_pos+0xd6/0x110
 ? __fdget_raw+0x10/0x10
 ? do_readv+0xc0/0x1b0
 do_readv+0xc0/0x1b0
 ? vfs_readv+0x180/0x180
 ? mark_held_locks+0x1c/0x90
 ? do_syscall_64+0xae/0x3e0
 ? compat_rw_copy_check_uvector+0x1b0/0x1b0
 do_syscall_64+0x182/0x3e0
 ? syscall_return_slowpath+0x250/0x250
 ? rcu_read_lock_sched_held+0x90/0xa0
 ? mark_held_locks+0x1c/0x90
 ? return_from_SYSCALL_64+0x2d/0x7a
 ? trace_hardirqs_on_caller+0x17a/0x250
 ? trace_hardirqs_on_thunk+0x1a/0x1c
 entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7f5d85a69219
RSP: 002b:00007ffdf090afd8 EFLAGS: 00000246
 ORIG_RAX: 0000000000000013
RAX: ffffffffffffffda RBX: 0000000000000013 RCX: 00007f5d85a69219
RDX: 00000000000000ae RSI: 0000565183cd5490 RDI: 0000000000000056
RBP: 00007ffdf090b080 R08: 0141082b00011c63 R09: 0000000000000000
R10: 00000000ffffe000 R11: 0000000000000246 R12: 0000000000000002
R13: 00007f5d86026058 R14: 00007f5d8613e698 R15: 00007f5d86026000
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Sept. 10, 2017, 2:57 a.m. UTC | #2
On Sat, Sep 09, 2017 at 09:07:56PM -0400, Dave Jones wrote:

> With this in place, I'm still seeing -EBUSY from invalidate_inode_pages2_range
> which doesn't end well...

Different issue, and I'm not sure why that WARN_ON() is there in the
first place.  Note that in a similar situation generic_file_direct_write()
simply buggers off and lets the caller do buffered write...

iov_iter_pipe() warning is a sign of ->read_iter() on pipe-backed iov_iter
putting into the pipe more than it claims to have done.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Jones Sept. 10, 2017, 4:07 p.m. UTC | #3
On Sun, Sep 10, 2017 at 03:57:21AM +0100, Al Viro wrote:
 > On Sat, Sep 09, 2017 at 09:07:56PM -0400, Dave Jones wrote:
 > 
 > > With this in place, I'm still seeing -EBUSY from invalidate_inode_pages2_range
 > > which doesn't end well...
 > 
 > Different issue, and I'm not sure why that WARN_ON() is there in the
 > first place.  Note that in a similar situation generic_file_direct_write()
 > simply buggers off and lets the caller do buffered write...
 > 
 > iov_iter_pipe() warning is a sign of ->read_iter() on pipe-backed iov_iter
 > putting into the pipe more than it claims to have done.

(from a rerun after hitting that EBUSY warn; hence the taint)

WARNING: CPU: 0 PID: 14154 at fs/iomap.c:1055 iomap_dio_rw+0x78e/0x840
CPU: 0 PID: 14154 Comm: trinity-c33 Tainted: G        W       4.13.0-think+ #9 
task: ffff8801027e3e40 task.stack: ffff8801632d8000
RIP: 0010:iomap_dio_rw+0x78e/0x840
RSP: 0018:ffff8801632df370 EFLAGS: 00010286
RAX: 00000000fffffff0 RBX: ffff880428666428 RCX: ffffffffffffffea
RDX: ffffed002c65bdef RSI: 0000000000000000 RDI: ffffed002c65be5f
RBP: ffff8801632df550 R08: ffff88046ae176c0 R09: 0000000000000000
R10: ffff8801632de960 R11: 0000000000000001 R12: ffff8801632df7f0
R13: ffffffffffffffea R14: 1ffff1002c65be7c R15: ffff8801632df988
FS:  00007f3da2100700(0000) GS:ffff88046ae00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000002f6223001 CR4: 00000000001606f0
DR0: 00007f3da1f3d000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
Call Trace:
 ? iomap_seek_data+0xb0/0xb0
 ? find_inode_fast+0xd0/0xd0
 ? xfs_file_aio_write_checks+0x295/0x320 [xfs]
 ? __lock_is_held+0x51/0xc0
 ? xfs_file_dio_aio_write+0x286/0x7e0 [xfs]
 ? rcu_read_lock_sched_held+0x90/0xa0
 xfs_file_dio_aio_write+0x286/0x7e0 [xfs]
 ? xfs_file_aio_write_checks+0x320/0x320 [xfs]
 ? unwind_get_return_address+0x2f/0x50
 ? __save_stack_trace+0x92/0x100
 ? memcmp+0x45/0x70
 ? depot_save_stack+0x12e/0x480
 ? save_stack+0x89/0xb0
 ? save_stack+0x32/0xb0
 ? kasan_kmalloc+0xa0/0xd0
 ? __kmalloc+0x157/0x360
 ? iter_file_splice_write+0x154/0x760
 ? direct_splice_actor+0x86/0xa0
 ? splice_direct_to_actor+0x1c4/0x420
 ? do_splice_direct+0x173/0x1e0
 ? do_sendfile+0x3a2/0x6d0
 ? SyS_sendfile64+0xa4/0x130
 ? do_syscall_64+0x182/0x3e0
 ? entry_SYSCALL64_slow_path+0x25/0x25
 ? match_held_lock+0xa6/0x410
 ? iter_file_splice_write+0x154/0x760
 xfs_file_write_iter+0x227/0x280 [xfs]
 do_iter_readv_writev+0x267/0x330
 ? vfs_dedupe_file_range+0x400/0x400
 do_iter_write+0xd7/0x280
 ? splice_from_pipe_next.part.9+0x28/0x160
 iter_file_splice_write+0x4d5/0x760
 ? page_cache_pipe_buf_steal+0x2b0/0x2b0
 ? generic_file_splice_read+0x2e1/0x340
 ? pipe_to_user+0x80/0x80
 direct_splice_actor+0x86/0xa0
 splice_direct_to_actor+0x1c4/0x420
 ? generic_pipe_buf_nosteal+0x10/0x10
 ? do_splice_to+0xc0/0xc0
 do_splice_direct+0x173/0x1e0
 ? splice_direct_to_actor+0x420/0x420
 ? rcu_read_lock_sched_held+0x90/0xa0
 ? rcu_sync_lockdep_assert+0x43/0x70
 ? __sb_start_write+0x179/0x1e0
 do_sendfile+0x3a2/0x6d0
 ? do_compat_pwritev64+0xa0/0xa0
 ? __lock_is_held+0x2e/0xc0
 SyS_sendfile64+0xa4/0x130
 ? SyS_sendfile+0x140/0x140
 ? mark_held_locks+0x1c/0x90
 ? do_syscall_64+0xae/0x3e0
 ? SyS_sendfile+0x140/0x140
 do_syscall_64+0x182/0x3e0
 ? syscall_return_slowpath+0x250/0x250
 ? rcu_read_lock_sched_held+0x90/0xa0
 ? __context_tracking_exit.part.4+0x223/0x290
 ? mark_held_locks+0x1c/0x90
 ? return_from_SYSCALL_64+0x2d/0x7a
 ? trace_hardirqs_on_caller+0x17a/0x250
 ? trace_hardirqs_on_thunk+0x1a/0x1c
 entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7f3da1a2b219
RSP: 002b:00007ffdd1642f38 EFLAGS: 00000246
 ORIG_RAX: 0000000000000028
RAX: ffffffffffffffda RBX: 0000000000000028 RCX: 00007f3da1a2b219
RDX: 00007f3da1f3d000 RSI: 000000000000005f RDI: 0000000000000060
RBP: 00007ffdd1642fe0 R08: 30503123188dbe3f R09: ffffffffe7e7e7e7
R10: 000000000000f000 R11: 0000000000000246 R12: 0000000000000002
R13: 00007f3da2012058 R14: 00007f3da2100698 R15: 00007f3da2012000
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Sept. 10, 2017, 8:05 p.m. UTC | #4
On Sun, Sep 10, 2017 at 12:07:10PM -0400, Dave Jones wrote:
> On Sun, Sep 10, 2017 at 03:57:21AM +0100, Al Viro wrote:
>  > On Sat, Sep 09, 2017 at 09:07:56PM -0400, Dave Jones wrote:
>  > 
>  > > With this in place, I'm still seeing -EBUSY from invalidate_inode_pages2_range
>  > > which doesn't end well...
>  > 
>  > Different issue, and I'm not sure why that WARN_ON() is there in the
>  > first place.  Note that in a similar situation generic_file_direct_write()
>  > simply buggers off and lets the caller do buffered write...
>  > 
>  > iov_iter_pipe() warning is a sign of ->read_iter() on pipe-backed iov_iter
>  > putting into the pipe more than it claims to have done.
> 
> (from a rerun after hitting that EBUSY warn; hence the taint)
> 
> WARNING: CPU: 0 PID: 14154 at fs/iomap.c:1055 iomap_dio_rw+0x78e/0x840

... and that's another invalidate_inode_pages2_range() in the same
sucker.  Again, compare with generic_file_direct_write()...

I don't believe that this one has anything splice-specific to do with it.
And its only relation to iov_iter_pipe() splat is that it's in the same
fs/iomap.c...
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Jones Sept. 10, 2017, 8:07 p.m. UTC | #5
On Sun, Sep 10, 2017 at 09:05:48PM +0100, Al Viro wrote:
 > On Sun, Sep 10, 2017 at 12:07:10PM -0400, Dave Jones wrote:
 > > On Sun, Sep 10, 2017 at 03:57:21AM +0100, Al Viro wrote:
 > >  > On Sat, Sep 09, 2017 at 09:07:56PM -0400, Dave Jones wrote:
 > >  > 
 > >  > > With this in place, I'm still seeing -EBUSY from invalidate_inode_pages2_range
 > >  > > which doesn't end well...
 > >  > 
 > >  > Different issue, and I'm not sure why that WARN_ON() is there in the
 > >  > first place.  Note that in a similar situation generic_file_direct_write()
 > >  > simply buggers off and lets the caller do buffered write...
 > >  > 
 > >  > iov_iter_pipe() warning is a sign of ->read_iter() on pipe-backed iov_iter
 > >  > putting into the pipe more than it claims to have done.
 > > 
 > > (from a rerun after hitting that EBUSY warn; hence the taint)
 > > 
 > > WARNING: CPU: 0 PID: 14154 at fs/iomap.c:1055 iomap_dio_rw+0x78e/0x840
 > 
 > ... and that's another invalidate_inode_pages2_range() in the same
 > sucker.  Again, compare with generic_file_direct_write()...
 > 
 > I don't believe that this one has anything splice-specific to do with it.
 > And its only relation to iov_iter_pipe() splat is that it's in the same
 > fs/iomap.c...

The interesting part is that I'm hitting these two over and over now
rather than the iov_iter_pipe warning.  Could just be unlucky
randomness though..

	Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Sept. 10, 2017, 8:33 p.m. UTC | #6
On Sun, Sep 10, 2017 at 04:07:24PM -0400, Dave Jones wrote:
> On Sun, Sep 10, 2017 at 09:05:48PM +0100, Al Viro wrote:
>  > On Sun, Sep 10, 2017 at 12:07:10PM -0400, Dave Jones wrote:
>  > > On Sun, Sep 10, 2017 at 03:57:21AM +0100, Al Viro wrote:
>  > >  > On Sat, Sep 09, 2017 at 09:07:56PM -0400, Dave Jones wrote:
>  > >  > 
>  > >  > > With this in place, I'm still seeing -EBUSY from invalidate_inode_pages2_range
>  > >  > > which doesn't end well...
>  > >  > 
>  > >  > Different issue, and I'm not sure why that WARN_ON() is there in the
>  > >  > first place.  Note that in a similar situation generic_file_direct_write()
>  > >  > simply buggers off and lets the caller do buffered write...
>  > >  > 
>  > >  > iov_iter_pipe() warning is a sign of ->read_iter() on pipe-backed iov_iter
>  > >  > putting into the pipe more than it claims to have done.
>  > > 
>  > > (from a rerun after hitting that EBUSY warn; hence the taint)
>  > > 
>  > > WARNING: CPU: 0 PID: 14154 at fs/iomap.c:1055 iomap_dio_rw+0x78e/0x840
>  > 
>  > ... and that's another invalidate_inode_pages2_range() in the same
>  > sucker.  Again, compare with generic_file_direct_write()...
>  > 
>  > I don't believe that this one has anything splice-specific to do with it.
>  > And its only relation to iov_iter_pipe() splat is that it's in the same
>  > fs/iomap.c...
> 
> The interesting part is that I'm hitting these two over and over now
> rather than the iov_iter_pipe warning.  Could just be unlucky
> randomness though..

Well, if you are still running the same reproducer and it used to hit the "read
from hole longer than the amount of space left in pipe" case, fixing the other
bug would have led to a lot more data shoved through the pipe without choking.
So the write side would be exercised more than before...

Hell knows; the question I have right now is what the devil are those WARN_ON_ONCE()
doing there.  Again, the same conditions are possible on other filesystems, only
there we don't yell; invalidation failure before starting O_DIRECT write is
handled by quiet fallback to buffered IO, the one after the write is simply
ignored.

Doing those WARN_ON_ONCE() is an explicit choice in "iomap: implement direct I/O",
so it's a question to Christoph, AFAICS...
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Sept. 10, 2017, 9:11 p.m. UTC | #7
On Sun, Sep 10, 2017 at 03:57:21AM +0100, Al Viro wrote:
> On Sat, Sep 09, 2017 at 09:07:56PM -0400, Dave Jones wrote:
> 
> > With this in place, I'm still seeing -EBUSY from invalidate_inode_pages2_range
> > which doesn't end well...
> 
> Different issue, and I'm not sure why that WARN_ON() is there in the
> first place.  Note that in a similar situation generic_file_direct_write()
> simply buggers off and lets the caller do buffered write...

XFS does not fall back to buffered IO when direct IO fails.  A
direct IO failure is indicative of a problem that needs to be fixed,
not use a "let's hope we can hide this" fallback path. Especially in
this case - EBUSY usually comes from the app is doing something we
/know/ is dangerous and it's occurrence to completely timing
dependent - if the timing is slightly different, we miss detection
and that can lead to silent data corruption. Hence if we detect it,
and our coherency guards can't fix up the coherency problem, we
issue a warning and fail the IO.

The warning is mostly there for us developers and it's been there
for years - it's so we don't end up chasing ghosts when we see that
warning in the logs. The usual vector is an app that mixes
concurrent DIO with mmap access to the same file, which we
explicitly say "don't do this because data corruption" in the
open(2) man page....

Cheers,

Dave.
Al Viro Sept. 10, 2017, 9:19 p.m. UTC | #8
On Mon, Sep 11, 2017 at 07:11:10AM +1000, Dave Chinner wrote:
> On Sun, Sep 10, 2017 at 03:57:21AM +0100, Al Viro wrote:
> > On Sat, Sep 09, 2017 at 09:07:56PM -0400, Dave Jones wrote:
> > 
> > > With this in place, I'm still seeing -EBUSY from invalidate_inode_pages2_range
> > > which doesn't end well...
> > 
> > Different issue, and I'm not sure why that WARN_ON() is there in the
> > first place.  Note that in a similar situation generic_file_direct_write()
> > simply buggers off and lets the caller do buffered write...
> 
> XFS does not fall back to buffered IO when direct IO fails.  A
> direct IO failure is indicative of a problem that needs to be fixed,
> not use a "let's hope we can hide this" fallback path. Especially in
> this case - EBUSY usually comes from the app is doing something we
> /know/ is dangerous and it's occurrence to completely timing
> dependent - if the timing is slightly different, we miss detection
> and that can lead to silent data corruption.

In this case app is a fuzzer, which is bloody well supposed to poke
into all kinds of odd usage patterns, though...
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Sept. 10, 2017, 10:08 p.m. UTC | #9
On Sun, Sep 10, 2017 at 10:19:07PM +0100, Al Viro wrote:
> On Mon, Sep 11, 2017 at 07:11:10AM +1000, Dave Chinner wrote:
> > On Sun, Sep 10, 2017 at 03:57:21AM +0100, Al Viro wrote:
> > > On Sat, Sep 09, 2017 at 09:07:56PM -0400, Dave Jones wrote:
> > > 
> > > > With this in place, I'm still seeing -EBUSY from invalidate_inode_pages2_range
> > > > which doesn't end well...
> > > 
> > > Different issue, and I'm not sure why that WARN_ON() is there in the
> > > first place.  Note that in a similar situation generic_file_direct_write()
> > > simply buggers off and lets the caller do buffered write...
> > 
> > XFS does not fall back to buffered IO when direct IO fails.  A
> > direct IO failure is indicative of a problem that needs to be fixed,
> > not use a "let's hope we can hide this" fallback path. Especially in
> > this case - EBUSY usually comes from the app is doing something we
> > /know/ is dangerous and it's occurrence to completely timing
> > dependent - if the timing is slightly different, we miss detection
> > and that can lead to silent data corruption.
> 
> In this case app is a fuzzer, which is bloody well supposed to poke
> into all kinds of odd usage patterns, though...

Yup, and we have quite a few tests in xfstests that specifically
exercise this same dark corner. We filter out these warnings from
the xfstests that exercise this case, though, because we know they
are going to be emitted and so aren't a sign of test failures...

Cheers,

Dave.
Al Viro Sept. 10, 2017, 11:07 p.m. UTC | #10
On Mon, Sep 11, 2017 at 08:08:14AM +1000, Dave Chinner wrote:
> On Sun, Sep 10, 2017 at 10:19:07PM +0100, Al Viro wrote:
> > On Mon, Sep 11, 2017 at 07:11:10AM +1000, Dave Chinner wrote:
> > > On Sun, Sep 10, 2017 at 03:57:21AM +0100, Al Viro wrote:
> > > > On Sat, Sep 09, 2017 at 09:07:56PM -0400, Dave Jones wrote:
> > > > 
> > > > > With this in place, I'm still seeing -EBUSY from invalidate_inode_pages2_range
> > > > > which doesn't end well...
> > > > 
> > > > Different issue, and I'm not sure why that WARN_ON() is there in the
> > > > first place.  Note that in a similar situation generic_file_direct_write()
> > > > simply buggers off and lets the caller do buffered write...
> > > 
> > > XFS does not fall back to buffered IO when direct IO fails.  A
> > > direct IO failure is indicative of a problem that needs to be fixed,
> > > not use a "let's hope we can hide this" fallback path. Especially in
> > > this case - EBUSY usually comes from the app is doing something we
> > > /know/ is dangerous and it's occurrence to completely timing
> > > dependent - if the timing is slightly different, we miss detection
> > > and that can lead to silent data corruption.
> > 
> > In this case app is a fuzzer, which is bloody well supposed to poke
> > into all kinds of odd usage patterns, though...
> 
> Yup, and we have quite a few tests in xfstests that specifically
> exercise this same dark corner. We filter out these warnings from
> the xfstests that exercise this case, though, because we know they
> are going to be emitted and so aren't a sign of test failures...

BTW, another problem I see there is that iomap_dio_actor() should *NOT*
assume that do-while loop in there will always manage to shove 'length'
bytes out in case of success.  That is simply not true for pipe-backed
destination.  And I'm not sure if outright failures halfway through
are handled correctly.  What does it need a copy of dio->submit.iter for,
anyway?  Why not work with dio->submit.iter directly?
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Sept. 10, 2017, 11:15 p.m. UTC | #11
On Mon, Sep 11, 2017 at 12:07:23AM +0100, Al Viro wrote:

> BTW, another problem I see there is that iomap_dio_actor() should *NOT*
> assume that do-while loop in there will always manage to shove 'length'
> bytes out in case of success.  That is simply not true for pipe-backed
> destination.  And I'm not sure if outright failures halfway through
> are handled correctly.  What does it need a copy of dio->submit.iter for,
> anyway?  Why not work with dio->submit.iter directly?

I mean, if it's just a matter of iov_iter_truncate() to be undone in
the end, that's not hard to do - iov_iter_reexpand() is there.  Or is there
something more subtle in the play?
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Sept. 11, 2017, 12:31 a.m. UTC | #12
On Mon, Sep 11, 2017 at 12:07:23AM +0100, Al Viro wrote:
> On Mon, Sep 11, 2017 at 08:08:14AM +1000, Dave Chinner wrote:
> > On Sun, Sep 10, 2017 at 10:19:07PM +0100, Al Viro wrote:
> > > On Mon, Sep 11, 2017 at 07:11:10AM +1000, Dave Chinner wrote:
> > > > On Sun, Sep 10, 2017 at 03:57:21AM +0100, Al Viro wrote:
> > > > > On Sat, Sep 09, 2017 at 09:07:56PM -0400, Dave Jones wrote:
> > > > > 
> > > > > > With this in place, I'm still seeing -EBUSY from invalidate_inode_pages2_range
> > > > > > which doesn't end well...
> > > > > 
> > > > > Different issue, and I'm not sure why that WARN_ON() is there in the
> > > > > first place.  Note that in a similar situation generic_file_direct_write()
> > > > > simply buggers off and lets the caller do buffered write...
> > > > 
> > > > XFS does not fall back to buffered IO when direct IO fails.  A
> > > > direct IO failure is indicative of a problem that needs to be fixed,
> > > > not use a "let's hope we can hide this" fallback path. Especially in
> > > > this case - EBUSY usually comes from the app is doing something we
> > > > /know/ is dangerous and it's occurrence to completely timing
> > > > dependent - if the timing is slightly different, we miss detection
> > > > and that can lead to silent data corruption.
> > > 
> > > In this case app is a fuzzer, which is bloody well supposed to poke
> > > into all kinds of odd usage patterns, though...
> > 
> > Yup, and we have quite a few tests in xfstests that specifically
> > exercise this same dark corner. We filter out these warnings from
> > the xfstests that exercise this case, though, because we know they
> > are going to be emitted and so aren't a sign of test failures...
> 
> BTW, another problem I see there is that iomap_dio_actor() should *NOT*
> assume that do-while loop in there will always manage to shove 'length'
> bytes out in case of success.  That is simply not true for pipe-backed
> destination.

splice does not go down the direct IO path, so iomap_dio_actor()
should never be handled a pipe as the destination for the IO data.
Indeed, splice read has to supply the pages to be put into the pipe,
which the DIO path does not do - it requires pages be supplied to
it. So I'm not sure why we'd care about pipe destination limitations
in the DIO path?

> And I'm not sure if outright failures halfway through
> are handled correctly.  What does it need a copy of dio->submit.iter for,
> anyway?  Why not work with dio->submit.iter directly?

No idea - that's a question for Christoph...

Cheers,

Dave.
Al Viro Sept. 11, 2017, 3:32 a.m. UTC | #13
On Mon, Sep 11, 2017 at 10:31:13AM +1000, Dave Chinner wrote:

> splice does not go down the direct IO path, so iomap_dio_actor()
> should never be handled a pipe as the destination for the IO data.
> Indeed, splice read has to supply the pages to be put into the pipe,
> which the DIO path does not do - it requires pages be supplied to
> it. So I'm not sure why we'd care about pipe destination limitations
> in the DIO path?

splice doesn't give a rat's arse for direct IO; it's up to filesystem.
generic_file_splice_read() simply sets up a pipe-backed iov_iter and
calls ->read_iter(), period.

iov_iter_get_pages() for pipe-backed destination does page allocation
and inserts freshly allocated pages into pipe.  copy_to_iter() does
the same + copies data; copy_page_to_iter() grabs an extra reference
to page and inserts it into pipe, not that O_DIRECT ->read_iter()
had been likely to use the last one.

Normally O_DIRECT would work just fine - pages get allocated, references
to them put into pipe cyclic buffer *and* into a number of bio, bio
would get submitted and once the IO is completed we unlock the pipe,
making those pages available for readers.

With minimal care it works just fine - all you really need is
	* cope with failing copy_to_... / iov_iter_get_pages().
Short read if we'd already gotten something, -EFAULT otherwise.
That goes for pipe-backed same as for iovec-backed - any ->read_iter()
that fails to handle that is already in trouble.
	* make sure that iov_iter_get_pages()/iov_iter_get_pages_alloc()
is followed by iov_iter_advance() for the amount you've actually filled,
before any subsequent copy_to_iter()/copy_page_to_iter() or return
from ->read_iter(), whichever comes first.  That includes the situation
when you actually hadn't filled anything at all - just remember to
do iov_iter_advance(to, 0) in that case.  That's about the only
extra requirement imposed by pipes and it's not hard to satisfy.
Combination of iov_iter_advance() with iov_iter_revert() works as
usual.

Normally a filesystem doesn't need to care about splice at all -
just use generic_file_splice_read() and be done with that.
It will use the normal ->read_iter(), with whatever locking, etc.,
your filesystem would do on a normal read.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Sept. 11, 2017, 6:44 a.m. UTC | #14
On Mon, Sep 11, 2017 at 04:32:22AM +0100, Al Viro wrote:
> On Mon, Sep 11, 2017 at 10:31:13AM +1000, Dave Chinner wrote:
> 
> > splice does not go down the direct IO path, so iomap_dio_actor()
> > should never be handled a pipe as the destination for the IO data.
> > Indeed, splice read has to supply the pages to be put into the pipe,
> > which the DIO path does not do - it requires pages be supplied to
> > it. So I'm not sure why we'd care about pipe destination limitations
> > in the DIO path?
> 
> splice doesn't give a rat's arse for direct IO; it's up to filesystem.

[....]

It's news to me that splice works on direct IO - I thought it was
still required page cache based IO for file data for this stuff to
work. I must have missed the memo saying that splice interfaces now
work on O_DIRECT fds, there's certainly no documentation or comments
in the code I could have read to find this out myself...

As it is, we have very little test coverage for splice interfaces,
and all that I am aware of assumes that sendfile/splice only works
for buffered IO. So I'm not surprised there are bugs in this code,
it's likely to be completely untested.

I'm guessing the warnings are being thrown because sendfile's
source and/or destination is opened O_DIRECT and something else has
also mmap()d the same files and is doing concurrent sendfile/splice
and page faults. Hell, it could even be sendfile to/from the same
file mixing buffered and direct IO, or perhaps vmsplice of a mapped
range of the same file it's using as the O_DIRECT destination fd.
None of which are sane things to do and fall under the "not
supported" category....

> iov_iter_get_pages() for pipe-backed destination does page allocation
> and inserts freshly allocated pages into pipe.

Oh, it's hidden more layers down than the code implied I needed to
look.

i.e. there's no obvious clue in the function names that there is
allocation happening in these paths (get_pipe_pages ->
__get_pipe_pages -> push_pipe -> page allocation). The function
names imply it's getting a reference to pages (like
(get_user_pages()) and the fact it does allocation is inconsistent
with it's naming.  Worse, when push_pipe() fails to allocate pages,
the error __get_pipe_pages() returns is -EFAULT, which further hides
the fact push_pipe() does memory allocation that can fail....

And then there's the companion interface that implies page
allocation: pipe_get_pages_alloc(). Which brings me back to there
being no obvious clue while reading the code from the top down that
pages are being allocated in push_pipe()....

Comments and documentation for this code would help, but I can't
find any of that, either. Hence I assumed naming followed familiar
patterns and so mistook these interfaces being one that does page
allocation and the other for getting references to pre-existing
pages.....

[snip]

> Normally a filesystem doesn't need to care about splice at all -
> just use generic_file_splice_read() and be done with that.
> It will use the normal ->read_iter(), with whatever locking, etc.,
> your filesystem would do on a normal read.

Yup, that's my point - this is exactly what XFS does, and so I had
no clue that the generic splice code had been changed to accept and
use O_DIRECT semantics because no filesystem code was changed to
enable it.

Cheers,

Dave.
Christoph Hellwig Sept. 11, 2017, 12:07 p.m. UTC | #15
On Mon, Sep 11, 2017 at 12:07:23AM +0100, Al Viro wrote:
> BTW, another problem I see there is that iomap_dio_actor() should *NOT*
> assume that do-while loop in there will always manage to shove 'length'
> bytes out in case of success.  That is simply not true for pipe-backed
> destination.  And I'm not sure if outright failures halfway through
> are handled correctly.  What does it need a copy of dio->submit.iter for,
> anyway?  Why not work with dio->submit.iter directly?
> --

So that we only walk the pagetables and pin down the pages that
we can actually use in this iteration.

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Sept. 11, 2017, 12:51 p.m. UTC | #16
On Mon, Sep 11, 2017 at 05:07:57AM -0700, Christoph Hellwig wrote:
> On Mon, Sep 11, 2017 at 12:07:23AM +0100, Al Viro wrote:
> > BTW, another problem I see there is that iomap_dio_actor() should *NOT*
> > assume that do-while loop in there will always manage to shove 'length'
> > bytes out in case of success.  That is simply not true for pipe-backed
> > destination.  And I'm not sure if outright failures halfway through
> > are handled correctly.  What does it need a copy of dio->submit.iter for,
> > anyway?  Why not work with dio->submit.iter directly?
> > --
> 
> So that we only walk the pagetables and pin down the pages that
> we can actually use in this iteration.

Er...  So why not simply do iov_iter_reexpand() in the end of segment with the
right argument?  IDGI...
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/iomap.c b/fs/iomap.c
index 269b24a01f32..4a671263475f 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -843,7 +843,7 @@  iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 		/*FALLTHRU*/
 	case IOMAP_UNWRITTEN:
 		if (!(dio->flags & IOMAP_DIO_WRITE)) {
-			iov_iter_zero(length, dio->submit.iter);
+			length = iov_iter_zero(length, dio->submit.iter);
 			dio->size += length;
 			return length;
 		}