diff mbox

[3/7] xfs: protect S_DAX transitions in XFS read path

Message ID 20170925231404.32723-4-ross.zwisler@linux.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Ross Zwisler Sept. 25, 2017, 11:14 p.m. UTC
In the current XFS read I/O path we check IS_DAX() in xfs_file_read_iter()
to decide whether to do DAX I/O, direct I/O or buffered I/O.  This check is
done without holding the XFS_IOLOCK, though, which means that if we allow
S_DAX to be manipulated via the inode flag we can run into this race:

CPU 0				CPU 1
-----				-----
xfs_file_read_iter()
  IS_DAX() << returns false
  				xfs_ioctl_setattr()
				  xfs_ioctl_setattr_dax_invalidate()
				   xfs_ilock(XFS_MMAPLOCK|XFS_IOLOCK)
				  sets S_DAX
				  releases XFS_MMAPLOCK and XFS_IOLOCK
  xfs_file_buffered_aio_read()
  does buffered I/O to DAX inode, death

Fix this by ensuring that we only check S_DAX when we hold the XFS_IOLOCK
in the read path.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/xfs/xfs_file.c | 42 +++++++++++++-----------------------------
 1 file changed, 13 insertions(+), 29 deletions(-)

Comments

Dave Chinner Sept. 25, 2017, 11:27 p.m. UTC | #1
On Mon, Sep 25, 2017 at 05:14:00PM -0600, Ross Zwisler wrote:
> In the current XFS read I/O path we check IS_DAX() in xfs_file_read_iter()
> to decide whether to do DAX I/O, direct I/O or buffered I/O.  This check is
> done without holding the XFS_IOLOCK, though, which means that if we allow
> S_DAX to be manipulated via the inode flag we can run into this race:
> 
> CPU 0				CPU 1
> -----				-----
> xfs_file_read_iter()
>   IS_DAX() << returns false
>   				xfs_ioctl_setattr()
> 				  xfs_ioctl_setattr_dax_invalidate()
> 				   xfs_ilock(XFS_MMAPLOCK|XFS_IOLOCK)
> 				  sets S_DAX
> 				  releases XFS_MMAPLOCK and XFS_IOLOCK
>   xfs_file_buffered_aio_read()
>   does buffered I/O to DAX inode, death
> 
> Fix this by ensuring that we only check S_DAX when we hold the XFS_IOLOCK
> in the read path.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  fs/xfs/xfs_file.c | 42 +++++++++++++-----------------------------
>  1 file changed, 13 insertions(+), 29 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index ebdd0bd..ca4c8fd 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -207,7 +207,6 @@ xfs_file_dio_aio_read(
>  {
>  	struct xfs_inode	*ip = XFS_I(file_inode(iocb->ki_filp));
>  	size_t			count = iov_iter_count(to);
> -	ssize_t			ret;
>  
>  	trace_xfs_file_direct_read(ip, count, iocb->ki_pos);
>  
> @@ -215,12 +214,7 @@ xfs_file_dio_aio_read(
>  		return 0; /* skip atime */
>  
>  	file_accessed(iocb->ki_filp);
> -
> -	xfs_ilock(ip, XFS_IOLOCK_SHARED);
> -	ret = iomap_dio_rw(iocb, to, &xfs_iomap_ops, NULL);
> -	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
> -
> -	return ret;
> +	return iomap_dio_rw(iocb, to, &xfs_iomap_ops, NULL);

This puts file_accessed under the XFS_IOLOCK_SHARED now. Is that a
safe/sane thing to do for DIO?

Cheers,

Dave.
Christoph Hellwig Sept. 26, 2017, 6:32 a.m. UTC | #2
We can't just take locking one level up, as we need differnet locking
for different kinds of I/O.

I think you probably want an IOCB_DAX flag to check IS_DAX once and
then stick to it, similar to what we do for direct I/O.
Dan Williams Sept. 26, 2017, 1:59 p.m. UTC | #3
On Mon, Sep 25, 2017 at 11:32 PM, Christoph Hellwig <hch@lst.de> wrote:
> We can't just take locking one level up, as we need differnet locking
> for different kinds of I/O.
>
> I think you probably want an IOCB_DAX flag to check IS_DAX once and
> then stick to it, similar to what we do for direct I/O.

I wonder if this works better with a reference count mechanism
per-file so that we don't need a hold a lock over the whole
transition. Similar to request_queue reference counting, when DAX is
being turned off we block new references and drain the in-flight ones.
Christoph Hellwig Sept. 26, 2017, 2:33 p.m. UTC | #4
On Tue, Sep 26, 2017 at 06:59:37AM -0700, Dan Williams wrote:
> > I think you probably want an IOCB_DAX flag to check IS_DAX once and
> > then stick to it, similar to what we do for direct I/O.
> 
> I wonder if this works better with a reference count mechanism
> per-file so that we don't need a hold a lock over the whole
> transition. Similar to request_queue reference counting, when DAX is
> being turned off we block new references and drain the in-flight ones.

Maybe.  But that assumes we want to be stuck in a perpetual binary
DAX on/off state on a given file.  Which makes not only for an awkward
interface (inode or mount flag), but also might be fundamentally the
wrong thing to do for some media where you'd happily read directly
from it but rather buffer writes in DRAM.
Dan Williams Sept. 26, 2017, 6:11 p.m. UTC | #5
On Tue, Sep 26, 2017 at 7:33 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Tue, Sep 26, 2017 at 06:59:37AM -0700, Dan Williams wrote:
>> > I think you probably want an IOCB_DAX flag to check IS_DAX once and
>> > then stick to it, similar to what we do for direct I/O.
>>
>> I wonder if this works better with a reference count mechanism
>> per-file so that we don't need a hold a lock over the whole
>> transition. Similar to request_queue reference counting, when DAX is
>> being turned off we block new references and drain the in-flight ones.
>
> Maybe.  But that assumes we want to be stuck in a perpetual binary
> DAX on/off state on a given file.  Which makes not only for an awkward
> interface (inode or mount flag), but also might be fundamentally the
> wrong thing to do for some media where you'd happily read directly
> from it but rather buffer writes in DRAM.

I think we'll always need an explicit override available, but yes we
need to think about what the override looks like in the context of a
kernel that is able to automatically pick the right I/O policy
relative to the media type. A potential mixed policy for reads vs
writes makes sense. Where would this finer grained I/O policy
selection go other than more inode flags?
Christoph Hellwig Oct. 1, 2017, 8:17 a.m. UTC | #6
On Tue, Sep 26, 2017 at 11:11:55AM -0700, Dan Williams wrote:
> I think we'll always need an explicit override available, but yes we
> need to think about what the override looks like in the context of a
> kernel that is able to automatically pick the right I/O policy
> relative to the media type. A potential mixed policy for reads vs
> writes makes sense. Where would this finer grained I/O policy
> selection go other than more inode flags?

fadvise?
diff mbox

Patch

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index ebdd0bd..ca4c8fd 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -207,7 +207,6 @@  xfs_file_dio_aio_read(
 {
 	struct xfs_inode	*ip = XFS_I(file_inode(iocb->ki_filp));
 	size_t			count = iov_iter_count(to);
-	ssize_t			ret;
 
 	trace_xfs_file_direct_read(ip, count, iocb->ki_pos);
 
@@ -215,12 +214,7 @@  xfs_file_dio_aio_read(
 		return 0; /* skip atime */
 
 	file_accessed(iocb->ki_filp);
-
-	xfs_ilock(ip, XFS_IOLOCK_SHARED);
-	ret = iomap_dio_rw(iocb, to, &xfs_iomap_ops, NULL);
-	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
-
-	return ret;
+	return iomap_dio_rw(iocb, to, &xfs_iomap_ops, NULL);
 }
 
 static noinline ssize_t
@@ -230,23 +224,14 @@  xfs_file_dax_read(
 {
 	struct xfs_inode	*ip = XFS_I(iocb->ki_filp->f_mapping->host);
 	size_t			count = iov_iter_count(to);
-	ssize_t			ret = 0;
 
 	trace_xfs_file_dax_read(ip, count, iocb->ki_pos);
 
 	if (!count)
 		return 0; /* skip atime */
 
-	if (!xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED)) {
-		if (iocb->ki_flags & IOCB_NOWAIT)
-			return -EAGAIN;
-		xfs_ilock(ip, XFS_IOLOCK_SHARED);
-	}
-	ret = dax_iomap_rw(iocb, to, &xfs_iomap_ops);
-	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
-
 	file_accessed(iocb->ki_filp);
-	return ret;
+	return dax_iomap_rw(iocb, to, &xfs_iomap_ops);
 }
 
 STATIC ssize_t
@@ -255,19 +240,9 @@  xfs_file_buffered_aio_read(
 	struct iov_iter		*to)
 {
 	struct xfs_inode	*ip = XFS_I(file_inode(iocb->ki_filp));
-	ssize_t			ret;
 
 	trace_xfs_file_buffered_read(ip, iov_iter_count(to), iocb->ki_pos);
-
-	if (!xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED)) {
-		if (iocb->ki_flags & IOCB_NOWAIT)
-			return -EAGAIN;
-		xfs_ilock(ip, XFS_IOLOCK_SHARED);
-	}
-	ret = generic_file_read_iter(iocb, to);
-	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
-
-	return ret;
+	return generic_file_read_iter(iocb, to);
 }
 
 STATIC ssize_t
@@ -276,7 +251,8 @@  xfs_file_read_iter(
 	struct iov_iter		*to)
 {
 	struct inode		*inode = file_inode(iocb->ki_filp);
-	struct xfs_mount	*mp = XFS_I(inode)->i_mount;
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_mount	*mp = ip->i_mount;
 	ssize_t			ret = 0;
 
 	XFS_STATS_INC(mp, xs_read_calls);
@@ -284,6 +260,12 @@  xfs_file_read_iter(
 	if (XFS_FORCED_SHUTDOWN(mp))
 		return -EIO;
 
+	if (!xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED)) {
+		if (iocb->ki_flags & IOCB_NOWAIT)
+			return -EAGAIN;
+		xfs_ilock(ip, XFS_IOLOCK_SHARED);
+	}
+
 	if (IS_DAX(inode))
 		ret = xfs_file_dax_read(iocb, to);
 	else if (iocb->ki_flags & IOCB_DIRECT)
@@ -291,6 +273,8 @@  xfs_file_read_iter(
 	else
 		ret = xfs_file_buffered_aio_read(iocb, to);
 
+	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
+
 	if (ret > 0)
 		XFS_STATS_ADD(mp, xs_read_bytes, ret);
 	return ret;