Message ID | 20190322165242.40662-1-bfoster@redhat.com (mailing list archive) |
---|---|
State | Accepted, archived |
Headers | show |
Series | xfs: serialize unaligned dio writes against all other dio writes | expand |
Ok, I think your solution looks good. You can add my review. Thanks! Reviewed-by: Allison Henderson <allison.henderson@oracle.com> On 3/22/19 9:52 AM, Brian Foster wrote: > XFS applies more strict serialization constraints to unaligned > direct writes to accommodate things like direct I/O layer zeroing, > unwritten extent conversion, etc. Unaligned submissions acquire the > exclusive iolock and wait for in-flight dio to complete to ensure > multiple submissions do not race on the same block and cause data > corruption. > > This generally works in the case of an aligned dio followed by an > unaligned dio, but the serialization is lost if I/Os occur in the > opposite order. If an unaligned write is submitted first and > immediately followed by an overlapping, aligned write, the latter > submits without the typical unaligned serialization barriers because > there is no indication of an unaligned dio still in-flight. This can > lead to unpredictable results. > > To provide proper unaligned dio serialization, require that such > direct writes are always the only dio allowed in-flight at one time > for a particular inode. We already acquire the exclusive iolock and > drain pending dio before submitting the unaligned dio. Wait once > more after the dio submission to hold the iolock across the I/O and > prevent further submissions until the unaligned I/O completes. This > is heavy handed, but consistent with the current pre-submission > serialization for unaligned direct writes. > > Signed-off-by: Brian Foster <bfoster@redhat.com> > --- > > I was originally going to deal with this problem by hacking in an inode > flag to track unaligned dio writes in-flight and use that to block any > follow on dio writes until cleared. Dave suggested we could use the > iolock to serialize by converting unaligned async dio writes to sync dio > writes and just letting the unaligned dio itself always block. That > seemed reasonable to me, but I morphed the approach slightly to just use > inode_dio_wait() because it seemed a bit cleaner. Thoughts? > > Zorro, > > You reproduced this problem originally. It addresses the problem in the > test case that reproduced for me. Care to confirm whether this patch > fixes the problem for you? Thanks. > > Brian > > fs/xfs/xfs_file.c | 21 ++++++++++++--------- > 1 file changed, 12 insertions(+), 9 deletions(-) > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index 770cc2edf777..8b2aaed82343 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -529,18 +529,19 @@ xfs_file_dio_aio_write( > count = iov_iter_count(from); > > /* > - * If we are doing unaligned IO, wait for all other IO to drain, > - * otherwise demote the lock if we had to take the exclusive lock > - * for other reasons in xfs_file_aio_write_checks. > + * If we are doing unaligned IO, we can't allow any other IO in-flight > + * at the same time or we risk data corruption. Wait for all other IO to > + * drain, submit and wait for completion before we release the iolock. > + * > + * If the IO is aligned, demote the iolock if we had to take the > + * exclusive lock in xfs_file_aio_write_checks() for other reasons. > */ > if (unaligned_io) { > - /* If we are going to wait for other DIO to finish, bail */ > - if (iocb->ki_flags & IOCB_NOWAIT) { > - if (atomic_read(&inode->i_dio_count)) > - return -EAGAIN; > - } else { > + /* unaligned dio always waits, bail */ > + if (iocb->ki_flags & IOCB_NOWAIT) > + return -EAGAIN; > + else > inode_dio_wait(inode); > - } > } else if (iolock == XFS_IOLOCK_EXCL) { > xfs_ilock_demote(ip, XFS_IOLOCK_EXCL); > iolock = XFS_IOLOCK_SHARED; > @@ -548,6 +549,8 @@ xfs_file_dio_aio_write( > > trace_xfs_file_direct_write(ip, count, iocb->ki_pos); > ret = iomap_dio_rw(iocb, from, &xfs_iomap_ops, xfs_dio_write_end_io); > + if (unaligned_io && !is_sync_kiocb(iocb)) > + inode_dio_wait(inode); > out: > xfs_iunlock(ip, iolock); > >
On Fri, Mar 22, 2019 at 12:52:42PM -0400, Brian Foster wrote: > XFS applies more strict serialization constraints to unaligned > direct writes to accommodate things like direct I/O layer zeroing, > unwritten extent conversion, etc. Unaligned submissions acquire the > exclusive iolock and wait for in-flight dio to complete to ensure > multiple submissions do not race on the same block and cause data > corruption. > > This generally works in the case of an aligned dio followed by an > unaligned dio, but the serialization is lost if I/Os occur in the > opposite order. If an unaligned write is submitted first and > immediately followed by an overlapping, aligned write, the latter > submits without the typical unaligned serialization barriers because > there is no indication of an unaligned dio still in-flight. This can > lead to unpredictable results. > > To provide proper unaligned dio serialization, require that such > direct writes are always the only dio allowed in-flight at one time > for a particular inode. We already acquire the exclusive iolock and > drain pending dio before submitting the unaligned dio. Wait once > more after the dio submission to hold the iolock across the I/O and > prevent further submissions until the unaligned I/O completes. This > is heavy handed, but consistent with the current pre-submission > serialization for unaligned direct writes. > > Signed-off-by: Brian Foster <bfoster@redhat.com> > --- > > I was originally going to deal with this problem by hacking in an inode > flag to track unaligned dio writes in-flight and use that to block any > follow on dio writes until cleared. Dave suggested we could use the > iolock to serialize by converting unaligned async dio writes to sync dio > writes and just letting the unaligned dio itself always block. That > seemed reasonable to me, but I morphed the approach slightly to just use > inode_dio_wait() because it seemed a bit cleaner. Thoughts? > > Zorro, > > You reproduced this problem originally. It addresses the problem in the > test case that reproduced for me. Care to confirm whether this patch > fixes the problem for you? Thanks. Hi Brian, Sure, but I can't reproduce this bug on upstream kernel. I have to merge it into an older kernel(you know that:), to verify if it works. If upstream kernel has this issue too, do you have a better idea to reproduce it on upstream? Maybe I can improve my case to cover more. Thanks, Zorro > > Brian > > fs/xfs/xfs_file.c | 21 ++++++++++++--------- > 1 file changed, 12 insertions(+), 9 deletions(-) > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index 770cc2edf777..8b2aaed82343 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -529,18 +529,19 @@ xfs_file_dio_aio_write( > count = iov_iter_count(from); > > /* > - * If we are doing unaligned IO, wait for all other IO to drain, > - * otherwise demote the lock if we had to take the exclusive lock > - * for other reasons in xfs_file_aio_write_checks. > + * If we are doing unaligned IO, we can't allow any other IO in-flight > + * at the same time or we risk data corruption. Wait for all other IO to > + * drain, submit and wait for completion before we release the iolock. > + * > + * If the IO is aligned, demote the iolock if we had to take the > + * exclusive lock in xfs_file_aio_write_checks() for other reasons. > */ > if (unaligned_io) { > - /* If we are going to wait for other DIO to finish, bail */ > - if (iocb->ki_flags & IOCB_NOWAIT) { > - if (atomic_read(&inode->i_dio_count)) > - return -EAGAIN; > - } else { > + /* unaligned dio always waits, bail */ > + if (iocb->ki_flags & IOCB_NOWAIT) > + return -EAGAIN; > + else > inode_dio_wait(inode); > - } > } else if (iolock == XFS_IOLOCK_EXCL) { > xfs_ilock_demote(ip, XFS_IOLOCK_EXCL); > iolock = XFS_IOLOCK_SHARED; > @@ -548,6 +549,8 @@ xfs_file_dio_aio_write( > > trace_xfs_file_direct_write(ip, count, iocb->ki_pos); > ret = iomap_dio_rw(iocb, from, &xfs_iomap_ops, xfs_dio_write_end_io); > + if (unaligned_io && !is_sync_kiocb(iocb)) > + inode_dio_wait(inode); > out: > xfs_iunlock(ip, iolock); > > -- > 2.17.2 >
On Fri, Mar 22, 2019 at 12:52:42PM -0400, Brian Foster wrote: > XFS applies more strict serialization constraints to unaligned > direct writes to accommodate things like direct I/O layer zeroing, > unwritten extent conversion, etc. Unaligned submissions acquire the > exclusive iolock and wait for in-flight dio to complete to ensure > multiple submissions do not race on the same block and cause data > corruption. > > This generally works in the case of an aligned dio followed by an > unaligned dio, but the serialization is lost if I/Os occur in the > opposite order. If an unaligned write is submitted first and > immediately followed by an overlapping, aligned write, the latter > submits without the typical unaligned serialization barriers because > there is no indication of an unaligned dio still in-flight. This can > lead to unpredictable results. > > To provide proper unaligned dio serialization, require that such > direct writes are always the only dio allowed in-flight at one time > for a particular inode. We already acquire the exclusive iolock and > drain pending dio before submitting the unaligned dio. Wait once > more after the dio submission to hold the iolock across the I/O and > prevent further submissions until the unaligned I/O completes. This > is heavy handed, but consistent with the current pre-submission > serialization for unaligned direct writes. > > Signed-off-by: Brian Foster <bfoster@redhat.com> > --- > > I was originally going to deal with this problem by hacking in an inode > flag to track unaligned dio writes in-flight and use that to block any > follow on dio writes until cleared. Dave suggested we could use the > iolock to serialize by converting unaligned async dio writes to sync dio > writes and just letting the unaligned dio itself always block. That > seemed reasonable to me, but I morphed the approach slightly to just use > inode_dio_wait() because it seemed a bit cleaner. Thoughts? > > Zorro, > > You reproduced this problem originally. It addresses the problem in the > test case that reproduced for me. Care to confirm whether this patch > fixes the problem for you? Thanks. > > Brian > > fs/xfs/xfs_file.c | 21 ++++++++++++--------- > 1 file changed, 12 insertions(+), 9 deletions(-) > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index 770cc2edf777..8b2aaed82343 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -529,18 +529,19 @@ xfs_file_dio_aio_write( > count = iov_iter_count(from); > > /* > - * If we are doing unaligned IO, wait for all other IO to drain, > - * otherwise demote the lock if we had to take the exclusive lock > - * for other reasons in xfs_file_aio_write_checks. > + * If we are doing unaligned IO, we can't allow any other IO in-flight * any other overlapping IO in-flight > + * at the same time or we risk data corruption. Wait for all other IO to > + * drain, submit and wait for completion before we release the iolock. > + * > + * If the IO is aligned, demote the iolock if we had to take the > + * exclusive lock in xfs_file_aio_write_checks() for other reasons. > */ > if (unaligned_io) { > - /* If we are going to wait for other DIO to finish, bail */ > - if (iocb->ki_flags & IOCB_NOWAIT) { > - if (atomic_read(&inode->i_dio_count)) > - return -EAGAIN; > - } else { > + /* unaligned dio always waits, bail */ > + if (iocb->ki_flags & IOCB_NOWAIT) > + return -EAGAIN; > + else > inode_dio_wait(inode); > - } > } else if (iolock == XFS_IOLOCK_EXCL) { > xfs_ilock_demote(ip, XFS_IOLOCK_EXCL); > iolock = XFS_IOLOCK_SHARED; > @@ -548,6 +549,8 @@ xfs_file_dio_aio_write( > > trace_xfs_file_direct_write(ip, count, iocb->ki_pos); > ret = iomap_dio_rw(iocb, from, &xfs_iomap_ops, xfs_dio_write_end_io); > + if (unaligned_io && !is_sync_kiocb(iocb)) > + inode_dio_wait(inode); If it's AIO and it has already been completed, then this wait is unnecessary. i.e. we only need to wait in the case where AIO has been queued but not completed: /* * If we are doing unaligned IO, it will be the only IO in * progress right now. If it has not completed yet, wait on * it before we drop the IOLOCK. */ if (ret == -EIOCBQUEUED && unaligned_io) inode_dio_wait(inode); Next question: do we need to change the return value here to reflect the actual completion result? Hmmmm. iomap_dio_complete() will return either the IO byte count or an error for synchronous IO. And for AIO, ki->complete will only be called by the iomap bio completion path if it's the last reference. So for AIO that is completed before the submitter returns, it will return the result of iomap_dio_complete() without having called iocb->ki_complete(). Which means we want to return a byte count or IO error to the higher layers, and that will result in aio_read/aio_write calling aio_rw_done() and calling the completion appropriately. Ok, so we don't need to futz with the return value, and we only need to check for ret == -EIOCBQUEUED to determine if we should wait or not, because any other return value indicates either IO completion or an error has already occurred. Cheers, Dave.
On Sat, Mar 23, 2019 at 06:29:30PM +0800, Zorro Lang wrote: > On Fri, Mar 22, 2019 at 12:52:42PM -0400, Brian Foster wrote: > > XFS applies more strict serialization constraints to unaligned > > direct writes to accommodate things like direct I/O layer zeroing, > > unwritten extent conversion, etc. Unaligned submissions acquire the > > exclusive iolock and wait for in-flight dio to complete to ensure > > multiple submissions do not race on the same block and cause data > > corruption. > > > > This generally works in the case of an aligned dio followed by an > > unaligned dio, but the serialization is lost if I/Os occur in the > > opposite order. If an unaligned write is submitted first and > > immediately followed by an overlapping, aligned write, the latter > > submits without the typical unaligned serialization barriers because > > there is no indication of an unaligned dio still in-flight. This can > > lead to unpredictable results. > > > > To provide proper unaligned dio serialization, require that such > > direct writes are always the only dio allowed in-flight at one time > > for a particular inode. We already acquire the exclusive iolock and > > drain pending dio before submitting the unaligned dio. Wait once > > more after the dio submission to hold the iolock across the I/O and > > prevent further submissions until the unaligned I/O completes. This > > is heavy handed, but consistent with the current pre-submission > > serialization for unaligned direct writes. > > > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > --- > > > > I was originally going to deal with this problem by hacking in an inode > > flag to track unaligned dio writes in-flight and use that to block any > > follow on dio writes until cleared. Dave suggested we could use the > > iolock to serialize by converting unaligned async dio writes to sync dio > > writes and just letting the unaligned dio itself always block. That > > seemed reasonable to me, but I morphed the approach slightly to just use > > inode_dio_wait() because it seemed a bit cleaner. Thoughts? > > > > Zorro, > > > > You reproduced this problem originally. It addresses the problem in the > > test case that reproduced for me. Care to confirm whether this patch > > fixes the problem for you? Thanks. > > Hi Brian, > > Sure, but I can't reproduce this bug on upstream kernel. I have to merge > it into an older kernel(you know that:), to verify if it works. I merged your patch into an older kernel which I can reproduce this bug. Then test passed. BTW, Eryu said he hit this bug on upstream v5.0, on a KVM machine with LVM devices, when he ran the case which I sent to fstests@. So I think it's reproducible on upstream kernel. Just we need some conditions to trigger that. So if you know how to make this 'condition', please tell me, I'll think about if I can write anther case to cover this bug specially. Thanks, Zorro > > If upstream kernel has this issue too, do you have a better idea to reproduce > it on upstream? Maybe I can improve my case to cover more. > > Thanks, > Zorro > > > > > Brian > > > > fs/xfs/xfs_file.c | 21 ++++++++++++--------- > > 1 file changed, 12 insertions(+), 9 deletions(-) > > > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > > index 770cc2edf777..8b2aaed82343 100644 > > --- a/fs/xfs/xfs_file.c > > +++ b/fs/xfs/xfs_file.c > > @@ -529,18 +529,19 @@ xfs_file_dio_aio_write( > > count = iov_iter_count(from); > > > > /* > > - * If we are doing unaligned IO, wait for all other IO to drain, > > - * otherwise demote the lock if we had to take the exclusive lock > > - * for other reasons in xfs_file_aio_write_checks. > > + * If we are doing unaligned IO, we can't allow any other IO in-flight > > + * at the same time or we risk data corruption. Wait for all other IO to > > + * drain, submit and wait for completion before we release the iolock. > > + * > > + * If the IO is aligned, demote the iolock if we had to take the > > + * exclusive lock in xfs_file_aio_write_checks() for other reasons. > > */ > > if (unaligned_io) { > > - /* If we are going to wait for other DIO to finish, bail */ > > - if (iocb->ki_flags & IOCB_NOWAIT) { > > - if (atomic_read(&inode->i_dio_count)) > > - return -EAGAIN; > > - } else { > > + /* unaligned dio always waits, bail */ > > + if (iocb->ki_flags & IOCB_NOWAIT) > > + return -EAGAIN; > > + else > > inode_dio_wait(inode); > > - } > > } else if (iolock == XFS_IOLOCK_EXCL) { > > xfs_ilock_demote(ip, XFS_IOLOCK_EXCL); > > iolock = XFS_IOLOCK_SHARED; > > @@ -548,6 +549,8 @@ xfs_file_dio_aio_write( > > > > trace_xfs_file_direct_write(ip, count, iocb->ki_pos); > > ret = iomap_dio_rw(iocb, from, &xfs_iomap_ops, xfs_dio_write_end_io); > > + if (unaligned_io && !is_sync_kiocb(iocb)) > > + inode_dio_wait(inode); > > out: > > xfs_iunlock(ip, iolock); > > > > -- > > 2.17.2 > >
> if (unaligned_io) { > + /* unaligned dio always waits, bail */ > + if (iocb->ki_flags & IOCB_NOWAIT) > + return -EAGAIN; > + else > inode_dio_wait(inode); No need for the else here. > } else if (iolock == XFS_IOLOCK_EXCL) { > xfs_ilock_demote(ip, XFS_IOLOCK_EXCL); > iolock = XFS_IOLOCK_SHARED; > @@ -548,6 +549,8 @@ xfs_file_dio_aio_write( > > trace_xfs_file_direct_write(ip, count, iocb->ki_pos); > ret = iomap_dio_rw(iocb, from, &xfs_iomap_ops, xfs_dio_write_end_io); > + if (unaligned_io && !is_sync_kiocb(iocb)) > + inode_dio_wait(inode); Instead of explicittly waiting here I'd much rather just mark the I/O as sync before submitting it. The only thing needed for that is to clear iocb->ki_complete. To avoid too much low-level hacking that is probably best done with a: static inline void mark_kiocb_sync(struct kiocb *kiocb) { kiocb->ki_complete = NULL; } helper in fs.h.
On Mon, Mar 25, 2019 at 11:47:22AM +0800, Zorro Lang wrote: > On Sat, Mar 23, 2019 at 06:29:30PM +0800, Zorro Lang wrote: > > On Fri, Mar 22, 2019 at 12:52:42PM -0400, Brian Foster wrote: > > > XFS applies more strict serialization constraints to unaligned > > > direct writes to accommodate things like direct I/O layer zeroing, > > > unwritten extent conversion, etc. Unaligned submissions acquire the > > > exclusive iolock and wait for in-flight dio to complete to ensure > > > multiple submissions do not race on the same block and cause data > > > corruption. > > > > > > This generally works in the case of an aligned dio followed by an > > > unaligned dio, but the serialization is lost if I/Os occur in the > > > opposite order. If an unaligned write is submitted first and > > > immediately followed by an overlapping, aligned write, the latter > > > submits without the typical unaligned serialization barriers because > > > there is no indication of an unaligned dio still in-flight. This can > > > lead to unpredictable results. > > > > > > To provide proper unaligned dio serialization, require that such > > > direct writes are always the only dio allowed in-flight at one time > > > for a particular inode. We already acquire the exclusive iolock and > > > drain pending dio before submitting the unaligned dio. Wait once > > > more after the dio submission to hold the iolock across the I/O and > > > prevent further submissions until the unaligned I/O completes. This > > > is heavy handed, but consistent with the current pre-submission > > > serialization for unaligned direct writes. > > > > > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > > --- > > > > > > I was originally going to deal with this problem by hacking in an inode > > > flag to track unaligned dio writes in-flight and use that to block any > > > follow on dio writes until cleared. Dave suggested we could use the > > > iolock to serialize by converting unaligned async dio writes to sync dio > > > writes and just letting the unaligned dio itself always block. That > > > seemed reasonable to me, but I morphed the approach slightly to just use > > > inode_dio_wait() because it seemed a bit cleaner. Thoughts? > > > > > > Zorro, > > > > > > You reproduced this problem originally. It addresses the problem in the > > > test case that reproduced for me. Care to confirm whether this patch > > > fixes the problem for you? Thanks. > > > > Hi Brian, > > > > Sure, but I can't reproduce this bug on upstream kernel. I have to merge > > it into an older kernel(you know that:), to verify if it works. > > I merged your patch into an older kernel which I can reproduce this bug. > Then test passed. > Excellent, thanks. > BTW, Eryu said he hit this bug on upstream v5.0, on a KVM machine with LVM > devices, when he ran the case which I sent to fstests@. So I think > it's reproducible on upstream kernel. Just we need some conditions to trigger > that. So if you know how to make this 'condition', please tell me, I'll > think about if I can write anther case to cover this bug specially. > Ok. I could only reproduce with your custom reproducer (over loop) on kernels between v4.14 and v4.20 (inclusive). More specifically, I could reproduce between commits 546e7be824 ("iomap_dio_rw: Allocate AIO completion queue before submitting dio") and a79d40e9b0 ("aio: only use blk plugs for > 2 depth submissions"). As discussed, these commits mostly just alter timing and thus affect the issue indirectly. I haven't taken a closer look at the fstest yet beyond the custom variant you provided. I wonder if a test could reproduce this more effectively by increasing the load of the test..? For example, can you reproduce by running many iterations of the I/O? What about running a deeper queue and submitting many such overlapping aligned/unaligned I/Os all at once (to the same or different offsets of the file)? Just a thought.. Brian > Thanks, > Zorro > > > > > If upstream kernel has this issue too, do you have a better idea to reproduce > > it on upstream? Maybe I can improve my case to cover more. > > > > Thanks, > > Zorro > > > > > > > > Brian > > > > > > fs/xfs/xfs_file.c | 21 ++++++++++++--------- > > > 1 file changed, 12 insertions(+), 9 deletions(-) > > > > > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > > > index 770cc2edf777..8b2aaed82343 100644 > > > --- a/fs/xfs/xfs_file.c > > > +++ b/fs/xfs/xfs_file.c > > > @@ -529,18 +529,19 @@ xfs_file_dio_aio_write( > > > count = iov_iter_count(from); > > > > > > /* > > > - * If we are doing unaligned IO, wait for all other IO to drain, > > > - * otherwise demote the lock if we had to take the exclusive lock > > > - * for other reasons in xfs_file_aio_write_checks. > > > + * If we are doing unaligned IO, we can't allow any other IO in-flight > > > + * at the same time or we risk data corruption. Wait for all other IO to > > > + * drain, submit and wait for completion before we release the iolock. > > > + * > > > + * If the IO is aligned, demote the iolock if we had to take the > > > + * exclusive lock in xfs_file_aio_write_checks() for other reasons. > > > */ > > > if (unaligned_io) { > > > - /* If we are going to wait for other DIO to finish, bail */ > > > - if (iocb->ki_flags & IOCB_NOWAIT) { > > > - if (atomic_read(&inode->i_dio_count)) > > > - return -EAGAIN; > > > - } else { > > > + /* unaligned dio always waits, bail */ > > > + if (iocb->ki_flags & IOCB_NOWAIT) > > > + return -EAGAIN; > > > + else > > > inode_dio_wait(inode); > > > - } > > > } else if (iolock == XFS_IOLOCK_EXCL) { > > > xfs_ilock_demote(ip, XFS_IOLOCK_EXCL); > > > iolock = XFS_IOLOCK_SHARED; > > > @@ -548,6 +549,8 @@ xfs_file_dio_aio_write( > > > > > > trace_xfs_file_direct_write(ip, count, iocb->ki_pos); > > > ret = iomap_dio_rw(iocb, from, &xfs_iomap_ops, xfs_dio_write_end_io); > > > + if (unaligned_io && !is_sync_kiocb(iocb)) > > > + inode_dio_wait(inode); > > > out: > > > xfs_iunlock(ip, iolock); > > > > > > -- > > > 2.17.2 > > >
On Mon, Mar 25, 2019 at 07:59:26AM +1100, Dave Chinner wrote: > On Fri, Mar 22, 2019 at 12:52:42PM -0400, Brian Foster wrote: > > XFS applies more strict serialization constraints to unaligned > > direct writes to accommodate things like direct I/O layer zeroing, > > unwritten extent conversion, etc. Unaligned submissions acquire the > > exclusive iolock and wait for in-flight dio to complete to ensure > > multiple submissions do not race on the same block and cause data > > corruption. > > > > This generally works in the case of an aligned dio followed by an > > unaligned dio, but the serialization is lost if I/Os occur in the > > opposite order. If an unaligned write is submitted first and > > immediately followed by an overlapping, aligned write, the latter > > submits without the typical unaligned serialization barriers because > > there is no indication of an unaligned dio still in-flight. This can > > lead to unpredictable results. > > > > To provide proper unaligned dio serialization, require that such > > direct writes are always the only dio allowed in-flight at one time > > for a particular inode. We already acquire the exclusive iolock and > > drain pending dio before submitting the unaligned dio. Wait once > > more after the dio submission to hold the iolock across the I/O and > > prevent further submissions until the unaligned I/O completes. This > > is heavy handed, but consistent with the current pre-submission > > serialization for unaligned direct writes. > > > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > --- > > > > I was originally going to deal with this problem by hacking in an inode > > flag to track unaligned dio writes in-flight and use that to block any > > follow on dio writes until cleared. Dave suggested we could use the > > iolock to serialize by converting unaligned async dio writes to sync dio > > writes and just letting the unaligned dio itself always block. That > > seemed reasonable to me, but I morphed the approach slightly to just use > > inode_dio_wait() because it seemed a bit cleaner. Thoughts? > > > > Zorro, > > > > You reproduced this problem originally. It addresses the problem in the > > test case that reproduced for me. Care to confirm whether this patch > > fixes the problem for you? Thanks. > > > > Brian > > > > fs/xfs/xfs_file.c | 21 ++++++++++++--------- > > 1 file changed, 12 insertions(+), 9 deletions(-) > > > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > > index 770cc2edf777..8b2aaed82343 100644 > > --- a/fs/xfs/xfs_file.c > > +++ b/fs/xfs/xfs_file.c > > @@ -529,18 +529,19 @@ xfs_file_dio_aio_write( > > count = iov_iter_count(from); > > > > /* > > - * If we are doing unaligned IO, wait for all other IO to drain, > > - * otherwise demote the lock if we had to take the exclusive lock > > - * for other reasons in xfs_file_aio_write_checks. > > + * If we are doing unaligned IO, we can't allow any other IO in-flight > > * any other overlapping IO in-flight > Ack. > > + * at the same time or we risk data corruption. Wait for all other IO to > > + * drain, submit and wait for completion before we release the iolock. > > + * > > + * If the IO is aligned, demote the iolock if we had to take the > > + * exclusive lock in xfs_file_aio_write_checks() for other reasons. > > */ > > if (unaligned_io) { > > - /* If we are going to wait for other DIO to finish, bail */ > > - if (iocb->ki_flags & IOCB_NOWAIT) { > > - if (atomic_read(&inode->i_dio_count)) > > - return -EAGAIN; > > - } else { > > + /* unaligned dio always waits, bail */ > > + if (iocb->ki_flags & IOCB_NOWAIT) > > + return -EAGAIN; > > + else > > inode_dio_wait(inode); > > - } > > } else if (iolock == XFS_IOLOCK_EXCL) { > > xfs_ilock_demote(ip, XFS_IOLOCK_EXCL); > > iolock = XFS_IOLOCK_SHARED; > > @@ -548,6 +549,8 @@ xfs_file_dio_aio_write( > > > > trace_xfs_file_direct_write(ip, count, iocb->ki_pos); > > ret = iomap_dio_rw(iocb, from, &xfs_iomap_ops, xfs_dio_write_end_io); > > + if (unaligned_io && !is_sync_kiocb(iocb)) > > + inode_dio_wait(inode); > > If it's AIO and it has already been completed, then this wait is > unnecessary. i.e. we only need to wait in the case where AIO has > been queued but not completed: > Yeah, I figured it would be a no-op... > /* > * If we are doing unaligned IO, it will be the only IO in > * progress right now. If it has not completed yet, wait on > * it before we drop the IOLOCK. > */ > if (ret == -EIOCBQUEUED && unaligned_io) > inode_dio_wait(inode); > ... but this looks fine to me. This also nicely filters out both the sync and fast completion cases, which is a bit more consistent than just filtering out the sync case. > Next question: do we need to change the return value here to reflect > the actual completion result? > As noted in the bug report (and the reason you've outlined below), we only have to consider the return value if we screw around with the semantics of the I/O before we submit it to the iomap/dio code (i.e., change from async to sync). Thanks for the review.. Brian > Hmmmm. iomap_dio_complete() will return either the IO byte count or > an error for synchronous IO. And for AIO, ki->complete will only be > called by the iomap bio completion path if it's the last reference. > So for AIO that is completed before the submitter returns, it will > return the result of iomap_dio_complete() without having called > iocb->ki_complete(). Which means we want to return a byte count or > IO error to the higher layers, and that will result in > aio_read/aio_write calling aio_rw_done() and calling the completion > appropriately. > > Ok, so we don't need to futz with the return value, and we only > need to check for ret == -EIOCBQUEUED to determine if we should wait > or not, because any other return value indicates either IO completion > or an error has already occurred. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Mon, Mar 25, 2019 at 04:06:46AM -0700, Christoph Hellwig wrote: > > if (unaligned_io) { > > + /* unaligned dio always waits, bail */ > > + if (iocb->ki_flags & IOCB_NOWAIT) > > + return -EAGAIN; > > + else > > inode_dio_wait(inode); > > No need for the else here. > Ack. > > } else if (iolock == XFS_IOLOCK_EXCL) { > > xfs_ilock_demote(ip, XFS_IOLOCK_EXCL); > > iolock = XFS_IOLOCK_SHARED; > > @@ -548,6 +549,8 @@ xfs_file_dio_aio_write( > > > > trace_xfs_file_direct_write(ip, count, iocb->ki_pos); > > ret = iomap_dio_rw(iocb, from, &xfs_iomap_ops, xfs_dio_write_end_io); > > + if (unaligned_io && !is_sync_kiocb(iocb)) > > + inode_dio_wait(inode); > > Instead of explicittly waiting here I'd much rather just mark the > I/O as sync before submitting it. The only thing needed for that > is to clear iocb->ki_complete. To avoid too much low-level hacking > that is probably best done with a: > > static inline void mark_kiocb_sync(struct kiocb *kiocb) > { > kiocb->ki_complete = NULL; > } > > helper in fs.h. It's not quite that simple.. FWIW, the discussion (between Dave and I) for how best to solve this started offline prior to sending the patch and pretty much started with the idea of changing the async I/O to sync as you suggest here. I backed off from that because it's too subtle given the semantics between the higher level aio code and lower level dio code for async I/O. By that I mean either can be responsible for calling the ->ki_complete() callback in the iocb on I/O completion. IOW, if we receive an async direct I/O, clear ->ki_complete() as you describe above and submit it, the dio code will wait on I/O and return the size of the I/O on successful completion. It will not have called ->ki_complete(), however. Rather, the >0 return value indicates that aio_rw_done() must call ->ki_complete() after xfs_file_write_iter() returns, but we would have already cleared the function pointer. I think it is technically possible to use this technique by clearing and restoring ->ki_complete(), but in general we've visited this "change the I/O type" approach twice now and we've (collectively) got it wrong both times (the first error in thinking was that XFS would need to call ->ki_complete()). IMO, this demonstrates that it's not worth the complexity to insert ourselves into this dependency chain when we can accomplish the same thing with a simple dio wait call. Brian
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 770cc2edf777..8b2aaed82343 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -529,18 +529,19 @@ xfs_file_dio_aio_write( count = iov_iter_count(from); /* - * If we are doing unaligned IO, wait for all other IO to drain, - * otherwise demote the lock if we had to take the exclusive lock - * for other reasons in xfs_file_aio_write_checks. + * If we are doing unaligned IO, we can't allow any other IO in-flight + * at the same time or we risk data corruption. Wait for all other IO to + * drain, submit and wait for completion before we release the iolock. + * + * If the IO is aligned, demote the iolock if we had to take the + * exclusive lock in xfs_file_aio_write_checks() for other reasons. */ if (unaligned_io) { - /* If we are going to wait for other DIO to finish, bail */ - if (iocb->ki_flags & IOCB_NOWAIT) { - if (atomic_read(&inode->i_dio_count)) - return -EAGAIN; - } else { + /* unaligned dio always waits, bail */ + if (iocb->ki_flags & IOCB_NOWAIT) + return -EAGAIN; + else inode_dio_wait(inode); - } } else if (iolock == XFS_IOLOCK_EXCL) { xfs_ilock_demote(ip, XFS_IOLOCK_EXCL); iolock = XFS_IOLOCK_SHARED; @@ -548,6 +549,8 @@ xfs_file_dio_aio_write( trace_xfs_file_direct_write(ip, count, iocb->ki_pos); ret = iomap_dio_rw(iocb, from, &xfs_iomap_ops, xfs_dio_write_end_io); + if (unaligned_io && !is_sync_kiocb(iocb)) + inode_dio_wait(inode); out: xfs_iunlock(ip, iolock);
XFS applies more strict serialization constraints to unaligned direct writes to accommodate things like direct I/O layer zeroing, unwritten extent conversion, etc. Unaligned submissions acquire the exclusive iolock and wait for in-flight dio to complete to ensure multiple submissions do not race on the same block and cause data corruption. This generally works in the case of an aligned dio followed by an unaligned dio, but the serialization is lost if I/Os occur in the opposite order. If an unaligned write is submitted first and immediately followed by an overlapping, aligned write, the latter submits without the typical unaligned serialization barriers because there is no indication of an unaligned dio still in-flight. This can lead to unpredictable results. To provide proper unaligned dio serialization, require that such direct writes are always the only dio allowed in-flight at one time for a particular inode. We already acquire the exclusive iolock and drain pending dio before submitting the unaligned dio. Wait once more after the dio submission to hold the iolock across the I/O and prevent further submissions until the unaligned I/O completes. This is heavy handed, but consistent with the current pre-submission serialization for unaligned direct writes. Signed-off-by: Brian Foster <bfoster@redhat.com> --- I was originally going to deal with this problem by hacking in an inode flag to track unaligned dio writes in-flight and use that to block any follow on dio writes until cleared. Dave suggested we could use the iolock to serialize by converting unaligned async dio writes to sync dio writes and just letting the unaligned dio itself always block. That seemed reasonable to me, but I morphed the approach slightly to just use inode_dio_wait() because it seemed a bit cleaner. Thoughts? Zorro, You reproduced this problem originally. It addresses the problem in the test case that reproduced for me. Care to confirm whether this patch fixes the problem for you? Thanks. Brian fs/xfs/xfs_file.c | 21 ++++++++++++--------- 1 file changed, 12 insertions(+), 9 deletions(-)