Message ID | 20170925231404.32723-4-ross.zwisler@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Sep 25, 2017 at 05:14:00PM -0600, Ross Zwisler wrote: > In the current XFS read I/O path we check IS_DAX() in xfs_file_read_iter() > to decide whether to do DAX I/O, direct I/O or buffered I/O. This check is > done without holding the XFS_IOLOCK, though, which means that if we allow > S_DAX to be manipulated via the inode flag we can run into this race: > > CPU 0 CPU 1 > ----- ----- > xfs_file_read_iter() > IS_DAX() << returns false > xfs_ioctl_setattr() > xfs_ioctl_setattr_dax_invalidate() > xfs_ilock(XFS_MMAPLOCK|XFS_IOLOCK) > sets S_DAX > releases XFS_MMAPLOCK and XFS_IOLOCK > xfs_file_buffered_aio_read() > does buffered I/O to DAX inode, death > > Fix this by ensuring that we only check S_DAX when we hold the XFS_IOLOCK > in the read path. > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> > --- > fs/xfs/xfs_file.c | 42 +++++++++++++----------------------------- > 1 file changed, 13 insertions(+), 29 deletions(-) > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index ebdd0bd..ca4c8fd 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -207,7 +207,6 @@ xfs_file_dio_aio_read( > { > struct xfs_inode *ip = XFS_I(file_inode(iocb->ki_filp)); > size_t count = iov_iter_count(to); > - ssize_t ret; > > trace_xfs_file_direct_read(ip, count, iocb->ki_pos); > > @@ -215,12 +214,7 @@ xfs_file_dio_aio_read( > return 0; /* skip atime */ > > file_accessed(iocb->ki_filp); > - > - xfs_ilock(ip, XFS_IOLOCK_SHARED); > - ret = iomap_dio_rw(iocb, to, &xfs_iomap_ops, NULL); > - xfs_iunlock(ip, XFS_IOLOCK_SHARED); > - > - return ret; > + return iomap_dio_rw(iocb, to, &xfs_iomap_ops, NULL); This puts file_accessed under the XFS_IOLOCK_SHARED now. Is that a safe/sane thing to do for DIO? Cheers, Dave.
We can't just take locking one level up, as we need differnet locking for different kinds of I/O. I think you probably want an IOCB_DAX flag to check IS_DAX once and then stick to it, similar to what we do for direct I/O.
On Mon, Sep 25, 2017 at 11:32 PM, Christoph Hellwig <hch@lst.de> wrote: > We can't just take locking one level up, as we need differnet locking > for different kinds of I/O. > > I think you probably want an IOCB_DAX flag to check IS_DAX once and > then stick to it, similar to what we do for direct I/O. I wonder if this works better with a reference count mechanism per-file so that we don't need a hold a lock over the whole transition. Similar to request_queue reference counting, when DAX is being turned off we block new references and drain the in-flight ones.
On Tue, Sep 26, 2017 at 06:59:37AM -0700, Dan Williams wrote: > > I think you probably want an IOCB_DAX flag to check IS_DAX once and > > then stick to it, similar to what we do for direct I/O. > > I wonder if this works better with a reference count mechanism > per-file so that we don't need a hold a lock over the whole > transition. Similar to request_queue reference counting, when DAX is > being turned off we block new references and drain the in-flight ones. Maybe. But that assumes we want to be stuck in a perpetual binary DAX on/off state on a given file. Which makes not only for an awkward interface (inode or mount flag), but also might be fundamentally the wrong thing to do for some media where you'd happily read directly from it but rather buffer writes in DRAM.
On Tue, Sep 26, 2017 at 7:33 AM, Christoph Hellwig <hch@lst.de> wrote: > On Tue, Sep 26, 2017 at 06:59:37AM -0700, Dan Williams wrote: >> > I think you probably want an IOCB_DAX flag to check IS_DAX once and >> > then stick to it, similar to what we do for direct I/O. >> >> I wonder if this works better with a reference count mechanism >> per-file so that we don't need a hold a lock over the whole >> transition. Similar to request_queue reference counting, when DAX is >> being turned off we block new references and drain the in-flight ones. > > Maybe. But that assumes we want to be stuck in a perpetual binary > DAX on/off state on a given file. Which makes not only for an awkward > interface (inode or mount flag), but also might be fundamentally the > wrong thing to do for some media where you'd happily read directly > from it but rather buffer writes in DRAM. I think we'll always need an explicit override available, but yes we need to think about what the override looks like in the context of a kernel that is able to automatically pick the right I/O policy relative to the media type. A potential mixed policy for reads vs writes makes sense. Where would this finer grained I/O policy selection go other than more inode flags?
On Tue, Sep 26, 2017 at 11:11:55AM -0700, Dan Williams wrote: > I think we'll always need an explicit override available, but yes we > need to think about what the override looks like in the context of a > kernel that is able to automatically pick the right I/O policy > relative to the media type. A potential mixed policy for reads vs > writes makes sense. Where would this finer grained I/O policy > selection go other than more inode flags? fadvise?
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index ebdd0bd..ca4c8fd 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -207,7 +207,6 @@ xfs_file_dio_aio_read( { struct xfs_inode *ip = XFS_I(file_inode(iocb->ki_filp)); size_t count = iov_iter_count(to); - ssize_t ret; trace_xfs_file_direct_read(ip, count, iocb->ki_pos); @@ -215,12 +214,7 @@ xfs_file_dio_aio_read( return 0; /* skip atime */ file_accessed(iocb->ki_filp); - - xfs_ilock(ip, XFS_IOLOCK_SHARED); - ret = iomap_dio_rw(iocb, to, &xfs_iomap_ops, NULL); - xfs_iunlock(ip, XFS_IOLOCK_SHARED); - - return ret; + return iomap_dio_rw(iocb, to, &xfs_iomap_ops, NULL); } static noinline ssize_t @@ -230,23 +224,14 @@ xfs_file_dax_read( { struct xfs_inode *ip = XFS_I(iocb->ki_filp->f_mapping->host); size_t count = iov_iter_count(to); - ssize_t ret = 0; trace_xfs_file_dax_read(ip, count, iocb->ki_pos); if (!count) return 0; /* skip atime */ - if (!xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED)) { - if (iocb->ki_flags & IOCB_NOWAIT) - return -EAGAIN; - xfs_ilock(ip, XFS_IOLOCK_SHARED); - } - ret = dax_iomap_rw(iocb, to, &xfs_iomap_ops); - xfs_iunlock(ip, XFS_IOLOCK_SHARED); - file_accessed(iocb->ki_filp); - return ret; + return dax_iomap_rw(iocb, to, &xfs_iomap_ops); } STATIC ssize_t @@ -255,19 +240,9 @@ xfs_file_buffered_aio_read( struct iov_iter *to) { struct xfs_inode *ip = XFS_I(file_inode(iocb->ki_filp)); - ssize_t ret; trace_xfs_file_buffered_read(ip, iov_iter_count(to), iocb->ki_pos); - - if (!xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED)) { - if (iocb->ki_flags & IOCB_NOWAIT) - return -EAGAIN; - xfs_ilock(ip, XFS_IOLOCK_SHARED); - } - ret = generic_file_read_iter(iocb, to); - xfs_iunlock(ip, XFS_IOLOCK_SHARED); - - return ret; + return generic_file_read_iter(iocb, to); } STATIC ssize_t @@ -276,7 +251,8 @@ xfs_file_read_iter( struct iov_iter *to) { struct inode *inode = file_inode(iocb->ki_filp); - struct xfs_mount *mp = XFS_I(inode)->i_mount; + struct xfs_inode *ip = XFS_I(inode); + struct xfs_mount *mp = ip->i_mount; ssize_t ret = 0; XFS_STATS_INC(mp, xs_read_calls); @@ -284,6 +260,12 @@ xfs_file_read_iter( if (XFS_FORCED_SHUTDOWN(mp)) return -EIO; + if (!xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED)) { + if (iocb->ki_flags & IOCB_NOWAIT) + return -EAGAIN; + xfs_ilock(ip, XFS_IOLOCK_SHARED); + } + if (IS_DAX(inode)) ret = xfs_file_dax_read(iocb, to); else if (iocb->ki_flags & IOCB_DIRECT) @@ -291,6 +273,8 @@ xfs_file_read_iter( else ret = xfs_file_buffered_aio_read(iocb, to); + xfs_iunlock(ip, XFS_IOLOCK_SHARED); + if (ret > 0) XFS_STATS_ADD(mp, xs_read_bytes, ret); return ret;
In the current XFS read I/O path we check IS_DAX() in xfs_file_read_iter() to decide whether to do DAX I/O, direct I/O or buffered I/O. This check is done without holding the XFS_IOLOCK, though, which means that if we allow S_DAX to be manipulated via the inode flag we can run into this race: CPU 0 CPU 1 ----- ----- xfs_file_read_iter() IS_DAX() << returns false xfs_ioctl_setattr() xfs_ioctl_setattr_dax_invalidate() xfs_ilock(XFS_MMAPLOCK|XFS_IOLOCK) sets S_DAX releases XFS_MMAPLOCK and XFS_IOLOCK xfs_file_buffered_aio_read() does buffered I/O to DAX inode, death Fix this by ensuring that we only check S_DAX when we hold the XFS_IOLOCK in the read path. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> --- fs/xfs/xfs_file.c | 42 +++++++++++++----------------------------- 1 file changed, 13 insertions(+), 29 deletions(-)