diff mbox series

xfs: serialize unaligned dio writes against all other dio writes

Message ID 20190322165242.40662-1-bfoster@redhat.com (mailing list archive)
State Accepted, archived
Headers show
Series xfs: serialize unaligned dio writes against all other dio writes | expand

Commit Message

Brian Foster March 22, 2019, 4:52 p.m. UTC
XFS applies more strict serialization constraints to unaligned
direct writes to accommodate things like direct I/O layer zeroing,
unwritten extent conversion, etc. Unaligned submissions acquire the
exclusive iolock and wait for in-flight dio to complete to ensure
multiple submissions do not race on the same block and cause data
corruption.

This generally works in the case of an aligned dio followed by an
unaligned dio, but the serialization is lost if I/Os occur in the
opposite order. If an unaligned write is submitted first and
immediately followed by an overlapping, aligned write, the latter
submits without the typical unaligned serialization barriers because
there is no indication of an unaligned dio still in-flight. This can
lead to unpredictable results.

To provide proper unaligned dio serialization, require that such
direct writes are always the only dio allowed in-flight at one time
for a particular inode. We already acquire the exclusive iolock and
drain pending dio before submitting the unaligned dio. Wait once
more after the dio submission to hold the iolock across the I/O and
prevent further submissions until the unaligned I/O completes. This
is heavy handed, but consistent with the current pre-submission
serialization for unaligned direct writes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---

I was originally going to deal with this problem by hacking in an inode
flag to track unaligned dio writes in-flight and use that to block any
follow on dio writes until cleared. Dave suggested we could use the
iolock to serialize by converting unaligned async dio writes to sync dio
writes and just letting the unaligned dio itself always block. That
seemed reasonable to me, but I morphed the approach slightly to just use
inode_dio_wait() because it seemed a bit cleaner. Thoughts?

Zorro,

You reproduced this problem originally. It addresses the problem in the
test case that reproduced for me. Care to confirm whether this patch
fixes the problem for you? Thanks.

Brian

 fs/xfs/xfs_file.c | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

Comments

Allison Henderson March 22, 2019, 8:46 p.m. UTC | #1
Ok, I think your solution looks good.  You can add my review.  Thanks!

Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

On 3/22/19 9:52 AM, Brian Foster wrote:
> XFS applies more strict serialization constraints to unaligned
> direct writes to accommodate things like direct I/O layer zeroing,
> unwritten extent conversion, etc. Unaligned submissions acquire the
> exclusive iolock and wait for in-flight dio to complete to ensure
> multiple submissions do not race on the same block and cause data
> corruption.
> 
> This generally works in the case of an aligned dio followed by an
> unaligned dio, but the serialization is lost if I/Os occur in the
> opposite order. If an unaligned write is submitted first and
> immediately followed by an overlapping, aligned write, the latter
> submits without the typical unaligned serialization barriers because
> there is no indication of an unaligned dio still in-flight. This can
> lead to unpredictable results.
> 
> To provide proper unaligned dio serialization, require that such
> direct writes are always the only dio allowed in-flight at one time
> for a particular inode. We already acquire the exclusive iolock and
> drain pending dio before submitting the unaligned dio. Wait once
> more after the dio submission to hold the iolock across the I/O and
> prevent further submissions until the unaligned I/O completes. This
> is heavy handed, but consistent with the current pre-submission
> serialization for unaligned direct writes.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
> 
> I was originally going to deal with this problem by hacking in an inode
> flag to track unaligned dio writes in-flight and use that to block any
> follow on dio writes until cleared. Dave suggested we could use the
> iolock to serialize by converting unaligned async dio writes to sync dio
> writes and just letting the unaligned dio itself always block. That
> seemed reasonable to me, but I morphed the approach slightly to just use
> inode_dio_wait() because it seemed a bit cleaner. Thoughts?
> 
> Zorro,
> 
> You reproduced this problem originally. It addresses the problem in the
> test case that reproduced for me. Care to confirm whether this patch
> fixes the problem for you? Thanks.
> 
> Brian
> 
>   fs/xfs/xfs_file.c | 21 ++++++++++++---------
>   1 file changed, 12 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 770cc2edf777..8b2aaed82343 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -529,18 +529,19 @@ xfs_file_dio_aio_write(
>   	count = iov_iter_count(from);
>   
>   	/*
> -	 * If we are doing unaligned IO, wait for all other IO to drain,
> -	 * otherwise demote the lock if we had to take the exclusive lock
> -	 * for other reasons in xfs_file_aio_write_checks.
> +	 * If we are doing unaligned IO, we can't allow any other IO in-flight
> +	 * at the same time or we risk data corruption. Wait for all other IO to
> +	 * drain, submit and wait for completion before we release the iolock.
> +	 *
> +	 * If the IO is aligned, demote the iolock if we had to take the
> +	 * exclusive lock in xfs_file_aio_write_checks() for other reasons.
>   	 */
>   	if (unaligned_io) {
> -		/* If we are going to wait for other DIO to finish, bail */
> -		if (iocb->ki_flags & IOCB_NOWAIT) {
> -			if (atomic_read(&inode->i_dio_count))
> -				return -EAGAIN;
> -		} else {
> +		/* unaligned dio always waits, bail */
> +		if (iocb->ki_flags & IOCB_NOWAIT)
> +			return -EAGAIN;
> +		else
>   			inode_dio_wait(inode);
> -		}
>   	} else if (iolock == XFS_IOLOCK_EXCL) {
>   		xfs_ilock_demote(ip, XFS_IOLOCK_EXCL);
>   		iolock = XFS_IOLOCK_SHARED;
> @@ -548,6 +549,8 @@ xfs_file_dio_aio_write(
>   
>   	trace_xfs_file_direct_write(ip, count, iocb->ki_pos);
>   	ret = iomap_dio_rw(iocb, from, &xfs_iomap_ops, xfs_dio_write_end_io);
> +	if (unaligned_io && !is_sync_kiocb(iocb))
> +		inode_dio_wait(inode);
>   out:
>   	xfs_iunlock(ip, iolock);
>   
>
Zorro Lang March 23, 2019, 10:29 a.m. UTC | #2
On Fri, Mar 22, 2019 at 12:52:42PM -0400, Brian Foster wrote:
> XFS applies more strict serialization constraints to unaligned
> direct writes to accommodate things like direct I/O layer zeroing,
> unwritten extent conversion, etc. Unaligned submissions acquire the
> exclusive iolock and wait for in-flight dio to complete to ensure
> multiple submissions do not race on the same block and cause data
> corruption.
> 
> This generally works in the case of an aligned dio followed by an
> unaligned dio, but the serialization is lost if I/Os occur in the
> opposite order. If an unaligned write is submitted first and
> immediately followed by an overlapping, aligned write, the latter
> submits without the typical unaligned serialization barriers because
> there is no indication of an unaligned dio still in-flight. This can
> lead to unpredictable results.
> 
> To provide proper unaligned dio serialization, require that such
> direct writes are always the only dio allowed in-flight at one time
> for a particular inode. We already acquire the exclusive iolock and
> drain pending dio before submitting the unaligned dio. Wait once
> more after the dio submission to hold the iolock across the I/O and
> prevent further submissions until the unaligned I/O completes. This
> is heavy handed, but consistent with the current pre-submission
> serialization for unaligned direct writes.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
> 
> I was originally going to deal with this problem by hacking in an inode
> flag to track unaligned dio writes in-flight and use that to block any
> follow on dio writes until cleared. Dave suggested we could use the
> iolock to serialize by converting unaligned async dio writes to sync dio
> writes and just letting the unaligned dio itself always block. That
> seemed reasonable to me, but I morphed the approach slightly to just use
> inode_dio_wait() because it seemed a bit cleaner. Thoughts?
> 
> Zorro,
> 
> You reproduced this problem originally. It addresses the problem in the
> test case that reproduced for me. Care to confirm whether this patch
> fixes the problem for you? Thanks.

Hi Brian,

Sure, but I can't reproduce this bug on upstream kernel. I have to merge
it into an older kernel(you know that:), to verify if it works.

If upstream kernel has this issue too, do you have a better idea to reproduce
it on upstream? Maybe I can improve my case to cover more.

Thanks,
Zorro

> 
> Brian
> 
>  fs/xfs/xfs_file.c | 21 ++++++++++++---------
>  1 file changed, 12 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 770cc2edf777..8b2aaed82343 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -529,18 +529,19 @@ xfs_file_dio_aio_write(
>  	count = iov_iter_count(from);
>  
>  	/*
> -	 * If we are doing unaligned IO, wait for all other IO to drain,
> -	 * otherwise demote the lock if we had to take the exclusive lock
> -	 * for other reasons in xfs_file_aio_write_checks.
> +	 * If we are doing unaligned IO, we can't allow any other IO in-flight
> +	 * at the same time or we risk data corruption. Wait for all other IO to
> +	 * drain, submit and wait for completion before we release the iolock.
> +	 *
> +	 * If the IO is aligned, demote the iolock if we had to take the
> +	 * exclusive lock in xfs_file_aio_write_checks() for other reasons.
>  	 */
>  	if (unaligned_io) {
> -		/* If we are going to wait for other DIO to finish, bail */
> -		if (iocb->ki_flags & IOCB_NOWAIT) {
> -			if (atomic_read(&inode->i_dio_count))
> -				return -EAGAIN;
> -		} else {
> +		/* unaligned dio always waits, bail */
> +		if (iocb->ki_flags & IOCB_NOWAIT)
> +			return -EAGAIN;
> +		else
>  			inode_dio_wait(inode);
> -		}
>  	} else if (iolock == XFS_IOLOCK_EXCL) {
>  		xfs_ilock_demote(ip, XFS_IOLOCK_EXCL);
>  		iolock = XFS_IOLOCK_SHARED;
> @@ -548,6 +549,8 @@ xfs_file_dio_aio_write(
>  
>  	trace_xfs_file_direct_write(ip, count, iocb->ki_pos);
>  	ret = iomap_dio_rw(iocb, from, &xfs_iomap_ops, xfs_dio_write_end_io);
> +	if (unaligned_io && !is_sync_kiocb(iocb))
> +		inode_dio_wait(inode);
>  out:
>  	xfs_iunlock(ip, iolock);
>  
> -- 
> 2.17.2
>
Dave Chinner March 24, 2019, 8:59 p.m. UTC | #3
On Fri, Mar 22, 2019 at 12:52:42PM -0400, Brian Foster wrote:
> XFS applies more strict serialization constraints to unaligned
> direct writes to accommodate things like direct I/O layer zeroing,
> unwritten extent conversion, etc. Unaligned submissions acquire the
> exclusive iolock and wait for in-flight dio to complete to ensure
> multiple submissions do not race on the same block and cause data
> corruption.
> 
> This generally works in the case of an aligned dio followed by an
> unaligned dio, but the serialization is lost if I/Os occur in the
> opposite order. If an unaligned write is submitted first and
> immediately followed by an overlapping, aligned write, the latter
> submits without the typical unaligned serialization barriers because
> there is no indication of an unaligned dio still in-flight. This can
> lead to unpredictable results.
> 
> To provide proper unaligned dio serialization, require that such
> direct writes are always the only dio allowed in-flight at one time
> for a particular inode. We already acquire the exclusive iolock and
> drain pending dio before submitting the unaligned dio. Wait once
> more after the dio submission to hold the iolock across the I/O and
> prevent further submissions until the unaligned I/O completes. This
> is heavy handed, but consistent with the current pre-submission
> serialization for unaligned direct writes.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
> 
> I was originally going to deal with this problem by hacking in an inode
> flag to track unaligned dio writes in-flight and use that to block any
> follow on dio writes until cleared. Dave suggested we could use the
> iolock to serialize by converting unaligned async dio writes to sync dio
> writes and just letting the unaligned dio itself always block. That
> seemed reasonable to me, but I morphed the approach slightly to just use
> inode_dio_wait() because it seemed a bit cleaner. Thoughts?
> 
> Zorro,
> 
> You reproduced this problem originally. It addresses the problem in the
> test case that reproduced for me. Care to confirm whether this patch
> fixes the problem for you? Thanks.
> 
> Brian
> 
>  fs/xfs/xfs_file.c | 21 ++++++++++++---------
>  1 file changed, 12 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 770cc2edf777..8b2aaed82343 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -529,18 +529,19 @@ xfs_file_dio_aio_write(
>  	count = iov_iter_count(from);
>  
>  	/*
> -	 * If we are doing unaligned IO, wait for all other IO to drain,
> -	 * otherwise demote the lock if we had to take the exclusive lock
> -	 * for other reasons in xfs_file_aio_write_checks.
> +	 * If we are doing unaligned IO, we can't allow any other IO in-flight

* any other overlapping IO in-flight

> +	 * at the same time or we risk data corruption. Wait for all other IO to
> +	 * drain, submit and wait for completion before we release the iolock.
> +	 *
> +	 * If the IO is aligned, demote the iolock if we had to take the
> +	 * exclusive lock in xfs_file_aio_write_checks() for other reasons.
>  	 */
>  	if (unaligned_io) {
> -		/* If we are going to wait for other DIO to finish, bail */
> -		if (iocb->ki_flags & IOCB_NOWAIT) {
> -			if (atomic_read(&inode->i_dio_count))
> -				return -EAGAIN;
> -		} else {
> +		/* unaligned dio always waits, bail */
> +		if (iocb->ki_flags & IOCB_NOWAIT)
> +			return -EAGAIN;
> +		else
>  			inode_dio_wait(inode);
> -		}
>  	} else if (iolock == XFS_IOLOCK_EXCL) {
>  		xfs_ilock_demote(ip, XFS_IOLOCK_EXCL);
>  		iolock = XFS_IOLOCK_SHARED;
> @@ -548,6 +549,8 @@ xfs_file_dio_aio_write(
>  
>  	trace_xfs_file_direct_write(ip, count, iocb->ki_pos);
>  	ret = iomap_dio_rw(iocb, from, &xfs_iomap_ops, xfs_dio_write_end_io);
> +	if (unaligned_io && !is_sync_kiocb(iocb))
> +		inode_dio_wait(inode);

If it's AIO and it has already been completed, then this wait is
unnecessary. i.e. we only need to wait in the case where AIO has
been queued but not completed:

	/*
	 * If we are doing unaligned IO, it will be the only IO in
	 * progress right now. If it has not completed yet, wait on
	 * it before we drop the IOLOCK.
	 */
	if (ret == -EIOCBQUEUED && unaligned_io)
		inode_dio_wait(inode);

Next question: do we need to change the return value here to reflect
the actual completion result?

Hmmmm.  iomap_dio_complete() will return either the IO byte count or
an error for synchronous IO. And for AIO, ki->complete will only be
called by the iomap bio completion path if it's the last reference.
So for AIO that is completed before the submitter returns, it will
return the result of iomap_dio_complete() without having called
iocb->ki_complete(). Which means we want to return a byte count or
IO error to the higher layers, and that will result in
aio_read/aio_write calling aio_rw_done() and calling the completion
appropriately.

Ok, so we don't need to futz with the return value, and we only
need to check for ret == -EIOCBQUEUED to determine if we should wait
or not, because any other return value indicates either IO completion
or an error has already occurred.

Cheers,

Dave.
Zorro Lang March 25, 2019, 3:47 a.m. UTC | #4
On Sat, Mar 23, 2019 at 06:29:30PM +0800, Zorro Lang wrote:
> On Fri, Mar 22, 2019 at 12:52:42PM -0400, Brian Foster wrote:
> > XFS applies more strict serialization constraints to unaligned
> > direct writes to accommodate things like direct I/O layer zeroing,
> > unwritten extent conversion, etc. Unaligned submissions acquire the
> > exclusive iolock and wait for in-flight dio to complete to ensure
> > multiple submissions do not race on the same block and cause data
> > corruption.
> > 
> > This generally works in the case of an aligned dio followed by an
> > unaligned dio, but the serialization is lost if I/Os occur in the
> > opposite order. If an unaligned write is submitted first and
> > immediately followed by an overlapping, aligned write, the latter
> > submits without the typical unaligned serialization barriers because
> > there is no indication of an unaligned dio still in-flight. This can
> > lead to unpredictable results.
> > 
> > To provide proper unaligned dio serialization, require that such
> > direct writes are always the only dio allowed in-flight at one time
> > for a particular inode. We already acquire the exclusive iolock and
> > drain pending dio before submitting the unaligned dio. Wait once
> > more after the dio submission to hold the iolock across the I/O and
> > prevent further submissions until the unaligned I/O completes. This
> > is heavy handed, but consistent with the current pre-submission
> > serialization for unaligned direct writes.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> > 
> > I was originally going to deal with this problem by hacking in an inode
> > flag to track unaligned dio writes in-flight and use that to block any
> > follow on dio writes until cleared. Dave suggested we could use the
> > iolock to serialize by converting unaligned async dio writes to sync dio
> > writes and just letting the unaligned dio itself always block. That
> > seemed reasonable to me, but I morphed the approach slightly to just use
> > inode_dio_wait() because it seemed a bit cleaner. Thoughts?
> > 
> > Zorro,
> > 
> > You reproduced this problem originally. It addresses the problem in the
> > test case that reproduced for me. Care to confirm whether this patch
> > fixes the problem for you? Thanks.
> 
> Hi Brian,
> 
> Sure, but I can't reproduce this bug on upstream kernel. I have to merge
> it into an older kernel(you know that:), to verify if it works.

I merged your patch into an older kernel which I can reproduce this bug.
Then test passed.

BTW, Eryu said he hit this bug on upstream v5.0, on a KVM machine with LVM
devices, when he ran the case which I sent to fstests@. So I think
it's reproducible on upstream kernel. Just we need some conditions to trigger
that. So if you know how to make this 'condition', please tell me, I'll
think about if I can write anther case to cover this bug specially.

Thanks,
Zorro

> 
> If upstream kernel has this issue too, do you have a better idea to reproduce
> it on upstream? Maybe I can improve my case to cover more.
> 
> Thanks,
> Zorro
> 
> > 
> > Brian
> > 
> >  fs/xfs/xfs_file.c | 21 ++++++++++++---------
> >  1 file changed, 12 insertions(+), 9 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 770cc2edf777..8b2aaed82343 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -529,18 +529,19 @@ xfs_file_dio_aio_write(
> >  	count = iov_iter_count(from);
> >  
> >  	/*
> > -	 * If we are doing unaligned IO, wait for all other IO to drain,
> > -	 * otherwise demote the lock if we had to take the exclusive lock
> > -	 * for other reasons in xfs_file_aio_write_checks.
> > +	 * If we are doing unaligned IO, we can't allow any other IO in-flight
> > +	 * at the same time or we risk data corruption. Wait for all other IO to
> > +	 * drain, submit and wait for completion before we release the iolock.
> > +	 *
> > +	 * If the IO is aligned, demote the iolock if we had to take the
> > +	 * exclusive lock in xfs_file_aio_write_checks() for other reasons.
> >  	 */
> >  	if (unaligned_io) {
> > -		/* If we are going to wait for other DIO to finish, bail */
> > -		if (iocb->ki_flags & IOCB_NOWAIT) {
> > -			if (atomic_read(&inode->i_dio_count))
> > -				return -EAGAIN;
> > -		} else {
> > +		/* unaligned dio always waits, bail */
> > +		if (iocb->ki_flags & IOCB_NOWAIT)
> > +			return -EAGAIN;
> > +		else
> >  			inode_dio_wait(inode);
> > -		}
> >  	} else if (iolock == XFS_IOLOCK_EXCL) {
> >  		xfs_ilock_demote(ip, XFS_IOLOCK_EXCL);
> >  		iolock = XFS_IOLOCK_SHARED;
> > @@ -548,6 +549,8 @@ xfs_file_dio_aio_write(
> >  
> >  	trace_xfs_file_direct_write(ip, count, iocb->ki_pos);
> >  	ret = iomap_dio_rw(iocb, from, &xfs_iomap_ops, xfs_dio_write_end_io);
> > +	if (unaligned_io && !is_sync_kiocb(iocb))
> > +		inode_dio_wait(inode);
> >  out:
> >  	xfs_iunlock(ip, iolock);
> >  
> > -- 
> > 2.17.2
> >
Christoph Hellwig March 25, 2019, 11:06 a.m. UTC | #5
>  	if (unaligned_io) {
> +		/* unaligned dio always waits, bail */
> +		if (iocb->ki_flags & IOCB_NOWAIT)
> +			return -EAGAIN;
> +		else
>  			inode_dio_wait(inode);

No need for the else here.

>  	} else if (iolock == XFS_IOLOCK_EXCL) {
>  		xfs_ilock_demote(ip, XFS_IOLOCK_EXCL);
>  		iolock = XFS_IOLOCK_SHARED;
> @@ -548,6 +549,8 @@ xfs_file_dio_aio_write(
>  
>  	trace_xfs_file_direct_write(ip, count, iocb->ki_pos);
>  	ret = iomap_dio_rw(iocb, from, &xfs_iomap_ops, xfs_dio_write_end_io);
> +	if (unaligned_io && !is_sync_kiocb(iocb))
> +		inode_dio_wait(inode);

Instead of explicittly waiting here I'd much rather just mark the
I/O as sync before submitting it.  The only thing needed for that
is to clear iocb->ki_complete.  To avoid too much low-level hacking
that is probably best done with a:

static inline void mark_kiocb_sync(struct kiocb *kiocb)
{
	kiocb->ki_complete = NULL;
}

helper in fs.h.
Brian Foster March 25, 2019, 1:45 p.m. UTC | #6
On Mon, Mar 25, 2019 at 11:47:22AM +0800, Zorro Lang wrote:
> On Sat, Mar 23, 2019 at 06:29:30PM +0800, Zorro Lang wrote:
> > On Fri, Mar 22, 2019 at 12:52:42PM -0400, Brian Foster wrote:
> > > XFS applies more strict serialization constraints to unaligned
> > > direct writes to accommodate things like direct I/O layer zeroing,
> > > unwritten extent conversion, etc. Unaligned submissions acquire the
> > > exclusive iolock and wait for in-flight dio to complete to ensure
> > > multiple submissions do not race on the same block and cause data
> > > corruption.
> > > 
> > > This generally works in the case of an aligned dio followed by an
> > > unaligned dio, but the serialization is lost if I/Os occur in the
> > > opposite order. If an unaligned write is submitted first and
> > > immediately followed by an overlapping, aligned write, the latter
> > > submits without the typical unaligned serialization barriers because
> > > there is no indication of an unaligned dio still in-flight. This can
> > > lead to unpredictable results.
> > > 
> > > To provide proper unaligned dio serialization, require that such
> > > direct writes are always the only dio allowed in-flight at one time
> > > for a particular inode. We already acquire the exclusive iolock and
> > > drain pending dio before submitting the unaligned dio. Wait once
> > > more after the dio submission to hold the iolock across the I/O and
> > > prevent further submissions until the unaligned I/O completes. This
> > > is heavy handed, but consistent with the current pre-submission
> > > serialization for unaligned direct writes.
> > > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > > 
> > > I was originally going to deal with this problem by hacking in an inode
> > > flag to track unaligned dio writes in-flight and use that to block any
> > > follow on dio writes until cleared. Dave suggested we could use the
> > > iolock to serialize by converting unaligned async dio writes to sync dio
> > > writes and just letting the unaligned dio itself always block. That
> > > seemed reasonable to me, but I morphed the approach slightly to just use
> > > inode_dio_wait() because it seemed a bit cleaner. Thoughts?
> > > 
> > > Zorro,
> > > 
> > > You reproduced this problem originally. It addresses the problem in the
> > > test case that reproduced for me. Care to confirm whether this patch
> > > fixes the problem for you? Thanks.
> > 
> > Hi Brian,
> > 
> > Sure, but I can't reproduce this bug on upstream kernel. I have to merge
> > it into an older kernel(you know that:), to verify if it works.
> 
> I merged your patch into an older kernel which I can reproduce this bug.
> Then test passed.
> 

Excellent, thanks.

> BTW, Eryu said he hit this bug on upstream v5.0, on a KVM machine with LVM
> devices, when he ran the case which I sent to fstests@. So I think
> it's reproducible on upstream kernel. Just we need some conditions to trigger
> that. So if you know how to make this 'condition', please tell me, I'll
> think about if I can write anther case to cover this bug specially.
> 

Ok. I could only reproduce with your custom reproducer (over loop) on
kernels between v4.14 and v4.20 (inclusive). More specifically, I could
reproduce between commits 546e7be824 ("iomap_dio_rw: Allocate AIO
completion queue before submitting dio") and a79d40e9b0 ("aio: only use
blk plugs for > 2 depth submissions"). As discussed, these commits
mostly just alter timing and thus affect the issue indirectly.

I haven't taken a closer look at the fstest yet beyond the custom
variant you provided. I wonder if a test could reproduce this more
effectively by increasing the load of the test..? For example, can you
reproduce by running many iterations of the I/O? What about running a
deeper queue and submitting many such overlapping aligned/unaligned I/Os
all at once (to the same or different offsets of the file)? Just a
thought..

Brian

> Thanks,
> Zorro
> 
> > 
> > If upstream kernel has this issue too, do you have a better idea to reproduce
> > it on upstream? Maybe I can improve my case to cover more.
> > 
> > Thanks,
> > Zorro
> > 
> > > 
> > > Brian
> > > 
> > >  fs/xfs/xfs_file.c | 21 ++++++++++++---------
> > >  1 file changed, 12 insertions(+), 9 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index 770cc2edf777..8b2aaed82343 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -529,18 +529,19 @@ xfs_file_dio_aio_write(
> > >  	count = iov_iter_count(from);
> > >  
> > >  	/*
> > > -	 * If we are doing unaligned IO, wait for all other IO to drain,
> > > -	 * otherwise demote the lock if we had to take the exclusive lock
> > > -	 * for other reasons in xfs_file_aio_write_checks.
> > > +	 * If we are doing unaligned IO, we can't allow any other IO in-flight
> > > +	 * at the same time or we risk data corruption. Wait for all other IO to
> > > +	 * drain, submit and wait for completion before we release the iolock.
> > > +	 *
> > > +	 * If the IO is aligned, demote the iolock if we had to take the
> > > +	 * exclusive lock in xfs_file_aio_write_checks() for other reasons.
> > >  	 */
> > >  	if (unaligned_io) {
> > > -		/* If we are going to wait for other DIO to finish, bail */
> > > -		if (iocb->ki_flags & IOCB_NOWAIT) {
> > > -			if (atomic_read(&inode->i_dio_count))
> > > -				return -EAGAIN;
> > > -		} else {
> > > +		/* unaligned dio always waits, bail */
> > > +		if (iocb->ki_flags & IOCB_NOWAIT)
> > > +			return -EAGAIN;
> > > +		else
> > >  			inode_dio_wait(inode);
> > > -		}
> > >  	} else if (iolock == XFS_IOLOCK_EXCL) {
> > >  		xfs_ilock_demote(ip, XFS_IOLOCK_EXCL);
> > >  		iolock = XFS_IOLOCK_SHARED;
> > > @@ -548,6 +549,8 @@ xfs_file_dio_aio_write(
> > >  
> > >  	trace_xfs_file_direct_write(ip, count, iocb->ki_pos);
> > >  	ret = iomap_dio_rw(iocb, from, &xfs_iomap_ops, xfs_dio_write_end_io);
> > > +	if (unaligned_io && !is_sync_kiocb(iocb))
> > > +		inode_dio_wait(inode);
> > >  out:
> > >  	xfs_iunlock(ip, iolock);
> > >  
> > > -- 
> > > 2.17.2
> > >
Brian Foster March 25, 2019, 1:48 p.m. UTC | #7
On Mon, Mar 25, 2019 at 07:59:26AM +1100, Dave Chinner wrote:
> On Fri, Mar 22, 2019 at 12:52:42PM -0400, Brian Foster wrote:
> > XFS applies more strict serialization constraints to unaligned
> > direct writes to accommodate things like direct I/O layer zeroing,
> > unwritten extent conversion, etc. Unaligned submissions acquire the
> > exclusive iolock and wait for in-flight dio to complete to ensure
> > multiple submissions do not race on the same block and cause data
> > corruption.
> > 
> > This generally works in the case of an aligned dio followed by an
> > unaligned dio, but the serialization is lost if I/Os occur in the
> > opposite order. If an unaligned write is submitted first and
> > immediately followed by an overlapping, aligned write, the latter
> > submits without the typical unaligned serialization barriers because
> > there is no indication of an unaligned dio still in-flight. This can
> > lead to unpredictable results.
> > 
> > To provide proper unaligned dio serialization, require that such
> > direct writes are always the only dio allowed in-flight at one time
> > for a particular inode. We already acquire the exclusive iolock and
> > drain pending dio before submitting the unaligned dio. Wait once
> > more after the dio submission to hold the iolock across the I/O and
> > prevent further submissions until the unaligned I/O completes. This
> > is heavy handed, but consistent with the current pre-submission
> > serialization for unaligned direct writes.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> > 
> > I was originally going to deal with this problem by hacking in an inode
> > flag to track unaligned dio writes in-flight and use that to block any
> > follow on dio writes until cleared. Dave suggested we could use the
> > iolock to serialize by converting unaligned async dio writes to sync dio
> > writes and just letting the unaligned dio itself always block. That
> > seemed reasonable to me, but I morphed the approach slightly to just use
> > inode_dio_wait() because it seemed a bit cleaner. Thoughts?
> > 
> > Zorro,
> > 
> > You reproduced this problem originally. It addresses the problem in the
> > test case that reproduced for me. Care to confirm whether this patch
> > fixes the problem for you? Thanks.
> > 
> > Brian
> > 
> >  fs/xfs/xfs_file.c | 21 ++++++++++++---------
> >  1 file changed, 12 insertions(+), 9 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 770cc2edf777..8b2aaed82343 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -529,18 +529,19 @@ xfs_file_dio_aio_write(
> >  	count = iov_iter_count(from);
> >  
> >  	/*
> > -	 * If we are doing unaligned IO, wait for all other IO to drain,
> > -	 * otherwise demote the lock if we had to take the exclusive lock
> > -	 * for other reasons in xfs_file_aio_write_checks.
> > +	 * If we are doing unaligned IO, we can't allow any other IO in-flight
> 
> * any other overlapping IO in-flight
> 

Ack.

> > +	 * at the same time or we risk data corruption. Wait for all other IO to
> > +	 * drain, submit and wait for completion before we release the iolock.
> > +	 *
> > +	 * If the IO is aligned, demote the iolock if we had to take the
> > +	 * exclusive lock in xfs_file_aio_write_checks() for other reasons.
> >  	 */
> >  	if (unaligned_io) {
> > -		/* If we are going to wait for other DIO to finish, bail */
> > -		if (iocb->ki_flags & IOCB_NOWAIT) {
> > -			if (atomic_read(&inode->i_dio_count))
> > -				return -EAGAIN;
> > -		} else {
> > +		/* unaligned dio always waits, bail */
> > +		if (iocb->ki_flags & IOCB_NOWAIT)
> > +			return -EAGAIN;
> > +		else
> >  			inode_dio_wait(inode);
> > -		}
> >  	} else if (iolock == XFS_IOLOCK_EXCL) {
> >  		xfs_ilock_demote(ip, XFS_IOLOCK_EXCL);
> >  		iolock = XFS_IOLOCK_SHARED;
> > @@ -548,6 +549,8 @@ xfs_file_dio_aio_write(
> >  
> >  	trace_xfs_file_direct_write(ip, count, iocb->ki_pos);
> >  	ret = iomap_dio_rw(iocb, from, &xfs_iomap_ops, xfs_dio_write_end_io);
> > +	if (unaligned_io && !is_sync_kiocb(iocb))
> > +		inode_dio_wait(inode);
> 
> If it's AIO and it has already been completed, then this wait is
> unnecessary. i.e. we only need to wait in the case where AIO has
> been queued but not completed:
> 

Yeah, I figured it would be a no-op...

> 	/*
> 	 * If we are doing unaligned IO, it will be the only IO in
> 	 * progress right now. If it has not completed yet, wait on
> 	 * it before we drop the IOLOCK.
> 	 */
> 	if (ret == -EIOCBQUEUED && unaligned_io)
> 		inode_dio_wait(inode);
> 

... but this looks fine to me. This also nicely filters out both the
sync and fast completion cases, which is a bit more consistent than just
filtering out the sync case.

> Next question: do we need to change the return value here to reflect
> the actual completion result?
> 

As noted in the bug report (and the reason you've outlined below), we
only have to consider the return value if we screw around with the
semantics of the I/O before we submit it to the iomap/dio code (i.e.,
change from async to sync). Thanks for the review..

Brian

> Hmmmm.  iomap_dio_complete() will return either the IO byte count or
> an error for synchronous IO. And for AIO, ki->complete will only be
> called by the iomap bio completion path if it's the last reference.
> So for AIO that is completed before the submitter returns, it will
> return the result of iomap_dio_complete() without having called
> iocb->ki_complete(). Which means we want to return a byte count or
> IO error to the higher layers, and that will result in
> aio_read/aio_write calling aio_rw_done() and calling the completion
> appropriately.
> 
> Ok, so we don't need to futz with the return value, and we only
> need to check for ret == -EIOCBQUEUED to determine if we should wait
> or not, because any other return value indicates either IO completion
> or an error has already occurred.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Brian Foster March 25, 2019, 1:51 p.m. UTC | #8
On Mon, Mar 25, 2019 at 04:06:46AM -0700, Christoph Hellwig wrote:
> >  	if (unaligned_io) {
> > +		/* unaligned dio always waits, bail */
> > +		if (iocb->ki_flags & IOCB_NOWAIT)
> > +			return -EAGAIN;
> > +		else
> >  			inode_dio_wait(inode);
> 
> No need for the else here.
> 

Ack.

> >  	} else if (iolock == XFS_IOLOCK_EXCL) {
> >  		xfs_ilock_demote(ip, XFS_IOLOCK_EXCL);
> >  		iolock = XFS_IOLOCK_SHARED;
> > @@ -548,6 +549,8 @@ xfs_file_dio_aio_write(
> >  
> >  	trace_xfs_file_direct_write(ip, count, iocb->ki_pos);
> >  	ret = iomap_dio_rw(iocb, from, &xfs_iomap_ops, xfs_dio_write_end_io);
> > +	if (unaligned_io && !is_sync_kiocb(iocb))
> > +		inode_dio_wait(inode);
> 
> Instead of explicittly waiting here I'd much rather just mark the
> I/O as sync before submitting it.  The only thing needed for that
> is to clear iocb->ki_complete.  To avoid too much low-level hacking
> that is probably best done with a:
> 
> static inline void mark_kiocb_sync(struct kiocb *kiocb)
> {
> 	kiocb->ki_complete = NULL;
> }
> 
> helper in fs.h.

It's not quite that simple..

FWIW, the discussion (between Dave and I) for how best to solve this
started offline prior to sending the patch and pretty much started with
the idea of changing the async I/O to sync as you suggest here. I backed
off from that because it's too subtle given the semantics between the
higher level aio code and lower level dio code for async I/O. By that I
mean either can be responsible for calling the ->ki_complete() callback
in the iocb on I/O completion.

IOW, if we receive an async direct I/O, clear ->ki_complete() as you
describe above and submit it, the dio code will wait on I/O and return
the size of the I/O on successful completion. It will not have called
->ki_complete(), however. Rather, the >0 return value indicates that
aio_rw_done() must call ->ki_complete() after xfs_file_write_iter()
returns, but we would have already cleared the function pointer.

I think it is technically possible to use this technique by clearing and
restoring ->ki_complete(), but in general we've visited this "change the
I/O type" approach twice now and we've (collectively) got it wrong both
times (the first error in thinking was that XFS would need to call
->ki_complete()). IMO, this demonstrates that it's not worth the
complexity to insert ourselves into this dependency chain when we can
accomplish the same thing with a simple dio wait call.

Brian
diff mbox series

Patch

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 770cc2edf777..8b2aaed82343 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -529,18 +529,19 @@  xfs_file_dio_aio_write(
 	count = iov_iter_count(from);
 
 	/*
-	 * If we are doing unaligned IO, wait for all other IO to drain,
-	 * otherwise demote the lock if we had to take the exclusive lock
-	 * for other reasons in xfs_file_aio_write_checks.
+	 * If we are doing unaligned IO, we can't allow any other IO in-flight
+	 * at the same time or we risk data corruption. Wait for all other IO to
+	 * drain, submit and wait for completion before we release the iolock.
+	 *
+	 * If the IO is aligned, demote the iolock if we had to take the
+	 * exclusive lock in xfs_file_aio_write_checks() for other reasons.
 	 */
 	if (unaligned_io) {
-		/* If we are going to wait for other DIO to finish, bail */
-		if (iocb->ki_flags & IOCB_NOWAIT) {
-			if (atomic_read(&inode->i_dio_count))
-				return -EAGAIN;
-		} else {
+		/* unaligned dio always waits, bail */
+		if (iocb->ki_flags & IOCB_NOWAIT)
+			return -EAGAIN;
+		else
 			inode_dio_wait(inode);
-		}
 	} else if (iolock == XFS_IOLOCK_EXCL) {
 		xfs_ilock_demote(ip, XFS_IOLOCK_EXCL);
 		iolock = XFS_IOLOCK_SHARED;
@@ -548,6 +549,8 @@  xfs_file_dio_aio_write(
 
 	trace_xfs_file_direct_write(ip, count, iocb->ki_pos);
 	ret = iomap_dio_rw(iocb, from, &xfs_iomap_ops, xfs_dio_write_end_io);
+	if (unaligned_io && !is_sync_kiocb(iocb))
+		inode_dio_wait(inode);
 out:
 	xfs_iunlock(ip, iolock);