diff mbox

nfs: fix dio deadlock when O_DIRECT flag is flipped

Message ID 1421244543-32539-1-git-send-email-tao.peng@primarydata.com (mailing list archive)
State New, archived
Headers show

Commit Message

Peng Tao Jan. 14, 2015, 2:09 p.m. UTC
Running xfstest generic/036, we'll get following VM_BUG_ON in
nfs_direct_IO(). 036 toggles O_DIRECT flag while IO is going on.
So the VM_BUG_ON should not exist there. However, we have a deadlock
in the code path as well, because inode->i_mutex is taken when calling
into ->direct_IO. And nfs_file_direct_write() would grab inode->i_mutex
again.

Meanwhile, nfs_file_direct_write() does a lot of things that is already
done by vfs before ->direct_IO is called. So skip those duplicates. One
exception is i_size_write. vfs does not take i_lock when setting i_size.
But nfs appears to need i_lock when setting i_size.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
---
 fs/nfs/direct.c | 92 ++++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 65 insertions(+), 27 deletions(-)

Comments

Christoph Hellwig Jan. 14, 2015, 3:48 p.m. UTC | #1
On Wed, Jan 14, 2015 at 10:09:03PM +0800, Peng Tao wrote:
> Running xfstest generic/036, we'll get following VM_BUG_ON in
> nfs_direct_IO(). 036 toggles O_DIRECT flag while IO is going on.
> So the VM_BUG_ON should not exist there. However, we have a deadlock
> in the code path as well, because inode->i_mutex is taken when calling
> into ->direct_IO. And nfs_file_direct_write() would grab inode->i_mutex
> again.
> 
> Meanwhile, nfs_file_direct_write() does a lot of things that is already
> done by vfs before ->direct_IO is called. So skip those duplicates. One
> exception is i_size_write. vfs does not take i_lock when setting i_size.
> But nfs appears to need i_lock when setting i_size.

But given that NFS implements direct I/O without ->direct_IO (except for
the horrible swap over NFS hack that is on it's way out) it shold
never be called.

The right fix is to determine the O_DIRECT flag in one place when
entering a write, and then pass it down on the stack.  We already do
this in XFS for example, it just needs to be expanded to filemap.c
so that more filesystems benefit from it.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust Jan. 14, 2015, 4:21 p.m. UTC | #2
Resend for the benefit of the f*ing mailing list spam filter...

On Wed, Jan 14, 2015 at 11:20 AM, Trond Myklebust
<trond.myklebust@primarydata.com> wrote:
>
>
>
> On Wed, Jan 14, 2015 at 10:48 AM, Christoph Hellwig <hch@infradead.org> wrote:
>>
>> On Wed, Jan 14, 2015 at 10:09:03PM +0800, Peng Tao wrote:
>> > Running xfstest generic/036, we'll get following VM_BUG_ON in
>> > nfs_direct_IO(). 036 toggles O_DIRECT flag while IO is going on.
>> > So the VM_BUG_ON should not exist there. However, we have a deadlock
>> > in the code path as well, because inode->i_mutex is taken when calling
>> > into ->direct_IO. And nfs_file_direct_write() would grab inode->i_mutex
>> > again.
>> >
>> > Meanwhile, nfs_file_direct_write() does a lot of things that is already
>> > done by vfs before ->direct_IO is called. So skip those duplicates. One
>> > exception is i_size_write. vfs does not take i_lock when setting i_size.
>> > But nfs appears to need i_lock when setting i_size.
>>
>> But given that NFS implements direct I/O without ->direct_IO (except for
>> the horrible swap over NFS hack that is on it's way out) it shold
>> never be called.
>>
>> The right fix is to determine the O_DIRECT flag in one place when
>> entering a write, and then pass it down on the stack.  We already do
>> this in XFS for example, it just needs to be expanded to filemap.c
>> so that more filesystems benefit from it.
>
>
> Can't we just assume that anything that calls nfs_direct_IO() on a file for which !IS_SWAPFILE(inode) is actually doing the xfstests O_DIRECT flag flipping dance, and just return the value '0'?
>
> AFAICT that should cause both generic_file_read_iter() and generic_file_write_iter() to fall back to buffered I/O, which is what we want...
>
> --
> Trond Myklebust
> Linux NFS client maintainer, PrimaryData
> trond.myklebust@primarydata.com
Peng Tao Jan. 14, 2015, 4:55 p.m. UTC | #3
On Wed, Jan 14, 2015 at 11:48 PM, Christoph Hellwig <hch@infradead.org> wrote:
> On Wed, Jan 14, 2015 at 10:09:03PM +0800, Peng Tao wrote:
>> Running xfstest generic/036, we'll get following VM_BUG_ON in
>> nfs_direct_IO(). 036 toggles O_DIRECT flag while IO is going on.
>> So the VM_BUG_ON should not exist there. However, we have a deadlock
>> in the code path as well, because inode->i_mutex is taken when calling
>> into ->direct_IO. And nfs_file_direct_write() would grab inode->i_mutex
>> again.
>>
>> Meanwhile, nfs_file_direct_write() does a lot of things that is already
>> done by vfs before ->direct_IO is called. So skip those duplicates. One
>> exception is i_size_write. vfs does not take i_lock when setting i_size.
>> But nfs appears to need i_lock when setting i_size.
>
> But given that NFS implements direct I/O without ->direct_IO (except for
> the horrible swap over NFS hack that is on it's way out) it shold
> never be called.
>
036 flips O_DIRECT flag with fcntl. So in theory user space is able to
hit the VM_BUG_ON().

> The right fix is to determine the O_DIRECT flag in one place when
> entering a write, and then pass it down on the stack.  We already do
> this in XFS for example, it just needs to be expanded to filemap.c
> so that more filesystems benefit from it.
NFS hijacks DIO in ->write_iter by checking O_DIRECT flag and rely on
generic_file_write_iter() for buffer write. generic_file_write_iter()
again checks for O_DIRECT flag.... It seems that we can avoid the
double check on O_DIRECT flag by calling generic_perform_write()
instead.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Jan. 15, 2015, 3:17 p.m. UTC | #4
On Thu, Jan 15, 2015 at 12:55:35AM +0800, Peng Tao wrote:
> > The right fix is to determine the O_DIRECT flag in one place when
> > entering a write, and then pass it down on the stack.  We already do
> > this in XFS for example, it just needs to be expanded to filemap.c
> > so that more filesystems benefit from it.
> NFS hijacks DIO in ->write_iter by checking O_DIRECT flag and rely on
> generic_file_write_iter() for buffer write. generic_file_write_iter()
> again checks for O_DIRECT flag.... It seems that we can avoid the
> double check on O_DIRECT flag by calling generic_perform_write()
> instead.

Yes, that sounds like the simplest version for now.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index bfd9d49..9f88b13 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -105,6 +105,8 @@  static const struct nfs_pgio_completion_ops nfs_direct_write_completion_ops;
 static const struct nfs_commit_completion_ops nfs_direct_commit_completion_ops;
 static void nfs_direct_write_complete(struct nfs_direct_req *dreq, struct inode *inode);
 static void nfs_direct_write_schedule_work(struct work_struct *work);
+static ssize_t nfs_direct_write(struct kiocb *iocb, struct iov_iter *iter,
+				loff_t pos);
 
 static inline void get_dreq(struct nfs_direct_req *dreq)
 {
@@ -257,11 +259,9 @@  ssize_t nfs_direct_IO(int rw, struct kiocb *iocb, struct iov_iter *iter, loff_t
 
 	return -EINVAL;
 #else
-	VM_BUG_ON(iocb->ki_nbytes != PAGE_SIZE);
-
 	if (rw == READ)
 		return nfs_file_direct_read(iocb, iter, pos);
-	return nfs_file_direct_write(iocb, iter, pos);
+	return nfs_direct_write(iocb, iter, pos);
 #endif /* CONFIG_NFS_SWAP */
 }
 
@@ -919,6 +919,66 @@  static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 	return 0;
 }
 
+static ssize_t _nfs_direct_write(struct kiocb *iocb, struct iov_iter *iter,
+				   struct inode *inode, struct nfs_direct_req **dreqp,
+				   size_t count, loff_t pos)
+{
+	ssize_t result;
+	struct nfs_direct_req *dreq;
+	struct nfs_lock_context *l_ctx;
+
+	task_io_account_write(count);
+
+	result = -ENOMEM;
+	dreq = nfs_direct_req_alloc();
+	if (!dreq)
+		goto out;
+
+	dreq->inode = inode;
+	dreq->bytes_left = count;
+	dreq->ctx = get_nfs_open_context(nfs_file_open_context(iocb->ki_filp));
+	l_ctx = nfs_get_lock_context(dreq->ctx);
+	if (IS_ERR(l_ctx)) {
+		result = PTR_ERR(l_ctx);
+		nfs_direct_req_release(dreq);
+		goto out;
+	}
+	dreq->l_ctx = l_ctx;
+	if (!is_sync_kiocb(iocb))
+		dreq->iocb = iocb;
+
+	result = nfs_direct_write_schedule_iovec(dreq, iter, pos);
+	*dreqp = dreq;
+
+out:
+	return result;
+}
+
+static ssize_t nfs_direct_write(struct kiocb *iocb, struct iov_iter *iter,
+				loff_t pos)
+{
+	struct file *file = iocb->ki_filp;
+	struct address_space *mapping = file->f_mapping;
+	struct inode *inode = mapping->host;
+	struct nfs_direct_req *uninitialized_var(dreq);
+	size_t count = iov_iter_count(iter);
+	ssize_t result;
+
+	result = _nfs_direct_write(iocb, iter, inode, &dreq, count, pos);
+	if (!result) {
+		result = nfs_direct_wait(dreq);
+		if (result > 0) {
+			iocb->ki_pos = pos + result;
+			spin_lock(&inode->i_lock);
+			if (i_size_read(inode) < iocb->ki_pos)
+				i_size_write(inode, iocb->ki_pos);
+			spin_unlock(&inode->i_lock);
+		}
+	}
+	nfs_direct_req_release(dreq);
+	return result;
+}
+
 /**
  * nfs_file_direct_write - file direct write operation for NFS files
  * @iocb: target I/O control block
@@ -947,8 +1007,7 @@  ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter,
 	struct file *file = iocb->ki_filp;
 	struct address_space *mapping = file->f_mapping;
 	struct inode *inode = mapping->host;
-	struct nfs_direct_req *dreq;
-	struct nfs_lock_context *l_ctx;
+	struct nfs_direct_req *uninitialized_var(dreq);
 	loff_t end;
 	size_t count = iov_iter_count(iter);
 	end = (pos + count - 1) >> PAGE_CACHE_SHIFT;
@@ -982,26 +1041,7 @@  ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter,
 			goto out_unlock;
 	}
 
-	task_io_account_write(count);
-
-	result = -ENOMEM;
-	dreq = nfs_direct_req_alloc();
-	if (!dreq)
-		goto out_unlock;
-
-	dreq->inode = inode;
-	dreq->bytes_left = count;
-	dreq->ctx = get_nfs_open_context(nfs_file_open_context(iocb->ki_filp));
-	l_ctx = nfs_get_lock_context(dreq->ctx);
-	if (IS_ERR(l_ctx)) {
-		result = PTR_ERR(l_ctx);
-		goto out_release;
-	}
-	dreq->l_ctx = l_ctx;
-	if (!is_sync_kiocb(iocb))
-		dreq->iocb = iocb;
-
-	result = nfs_direct_write_schedule_iovec(dreq, iter, pos);
+	result = _nfs_direct_write(iocb, iter, inode, &dreq, count, pos);
 
 	if (mapping->nrpages) {
 		invalidate_inode_pages2_range(mapping,
@@ -1025,8 +1065,6 @@  ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter,
 	nfs_direct_req_release(dreq);
 	return result;
 
-out_release:
-	nfs_direct_req_release(dreq);
 out_unlock:
 	mutex_unlock(&inode->i_mutex);
 out: