diff mbox

[6/7] nfs: take i_mutex during direct I/O reads

Message ID 20131114165042.013211118@bombadil.infradead.org (mailing list archive)
State New, archived
Headers show

Commit Message

Christoph Hellwig Nov. 14, 2013, 4:50 p.m. UTC
We'll need the i_mutex to prevent i_dio_count from incrementing while
truncate is waiting for it to reach zero, and protects against having
the pagecache repopulated after we flushed it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/nfs/direct.c |   14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

Comments

Chuck Lever Nov. 14, 2013, 5 p.m. UTC | #1
Hi Christoph-

On Nov 14, 2013, at 11:50 AM, Christoph Hellwig <hch@infradead.org> wrote:

> We'll need the i_mutex to prevent i_dio_count from incrementing while
> truncate is waiting for it to reach zero, and protects against having
> the pagecache repopulated after we flushed it.

How was the performance impact of this change measured?


> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
> fs/nfs/direct.c |   14 ++++++++++++--
> 1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
> index c32db2a..6cc7fe1 100644
> --- a/fs/nfs/direct.c
> +++ b/fs/nfs/direct.c
> @@ -500,16 +500,17 @@ ssize_t nfs_file_direct_read(struct kiocb *iocb, const struct iovec *iov,
> 	if (!count)
> 		goto out;
> 
> +	mutex_lock(&inode->i_mutex);
> 	result = nfs_sync_mapping(mapping);
> 	if (result)
> -		goto out;
> +		goto out_unlock;
> 
> 	task_io_account_read(count);
> 
> 	result = -ENOMEM;
> 	dreq = nfs_direct_req_alloc();
> 	if (dreq == NULL)
> -		goto out;
> +		goto out_unlock;
> 
> 	dreq->inode = inode;
> 	dreq->bytes_left = iov_length(iov, nr_segs);
> @@ -525,13 +526,22 @@ ssize_t nfs_file_direct_read(struct kiocb *iocb, const struct iovec *iov,
> 
> 	NFS_I(inode)->read_io += iov_length(iov, nr_segs);
> 	result = nfs_direct_read_schedule_iovec(dreq, iov, nr_segs, pos, uio);
> +
> +	mutex_unlock(&inode->i_mutex);
> +
> 	if (!result) {
> 		result = nfs_direct_wait(dreq);
> 		if (result > 0)
> 			iocb->ki_pos = pos + result;
> 	}
> +
> +	nfs_direct_req_release(dreq);
> +	return result;
> +
> out_release:
> 	nfs_direct_req_release(dreq);
> +out_unlock:
> +	mutex_unlock(&inode->i_mutex);
> out:
> 	return result;
> }
> -- 
> 1.7.10.4
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust Nov. 14, 2013, 8:43 p.m. UTC | #2
On Nov 14, 2013, at 11:50, Christoph Hellwig <hch@infradead.org> wrote:

> We'll need the i_mutex to prevent i_dio_count from incrementing while
> truncate is waiting for it to reach zero, and protects against having
> the pagecache repopulated after we flushed it.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Hi Christoph,

Why do we need to protect the page cache here? Using O_DIRECT and the page cache without some kind of application-specific synchronization method is technically not supported, since that violates close-to-open cache consistency.
What we _do_ want to support, though, is parallel reads and writes to the server by applications that know what they are doing. If we were to only protect the i_dio_count, then we could fix the truncate race, while continuing to allow parallelism. Is there any reason why we cannot do this?

Cheers
  Trond

PS: I appear to be missing 7/7 in this patch series. Should I have seen it?--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Nov. 15, 2013, 2:29 p.m. UTC | #3
On Thu, Nov 14, 2013 at 12:00:06PM -0500, Chuck Lever wrote:
> > We'll need the i_mutex to prevent i_dio_count from incrementing while
> > truncate is waiting for it to reach zero, and protects against having
> > the pagecache repopulated after we flushed it.
> 
> How was the performance impact of this change measured?

I've only done correctness tests.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Nov. 15, 2013, 2:32 p.m. UTC | #4
On Thu, Nov 14, 2013 at 03:43:21PM -0500, Trond Myklebust wrote:
> 
> Hi Christoph,
> 
> Why do we need to protect the page cache here? Using O_DIRECT and the page cache without some kind of application-specific synchronization method is technically not supported, since that violates close-to-open cache consistency.

As far as I understand close to open matter for synchronizations with
access from different clients.  Linux applications do expect tight
synchronization locally.

> What we _do_ want to support, though, is parallel reads and writes to the server by applications that know what they are doing. If we were to only protect the i_dio_count, then we could fix the truncate race, while continuing to allow parallelism. Is there any reason why we cannot do this?

To just fix that race we could do it.  Note that even with the patch we
only do the setup and I/O submission under the lock, and wait for it
outside, similar to what the local filesystems do for aio, but not for
synchronous direct I/O.

Another good way to speed this up is to use locking scheme XFS uses with
a shared exclusive lock:

buffered write:			exclusive
buffered read:			shared
direct I/O			exclusive if pagecache is present, then
				demote to shared for actual I/O,
				shared only if no pages are present on
				the file

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust Nov. 15, 2013, 3:23 p.m. UTC | #5
On Nov 15, 2013, at 9:32, Christoph Hellwig <hch@infradead.org> wrote:

> On Thu, Nov 14, 2013 at 03:43:21PM -0500, Trond Myklebust wrote:
>> 
>> Hi Christoph,
>> 
>> Why do we need to protect the page cache here? Using O_DIRECT and the page cache without some kind of application-specific synchronization method is technically not supported, since that violates close-to-open cache consistency.
> 
> As far as I understand close to open matter for synchronizations with
> access from different clients.  Linux applications do expect tight
> synchronization locally.

Yes, but it seems to me that if you’re adding locking in order to synchronize O_DIRECT with the page cache, then something is very wrong. It is hard to close the races, and you are really mixing 2 models that shouldn’t coexist.

It seems a lot more correct in that case to have the kernel convert those O_DIRECT writes into O_SYNC if we really need page cache consistency. That’s the only way we can really _guarantee_ that consistency.

>> What we _do_ want to support, though, is parallel reads and writes to the server by applications that know what they are doing. If we were to only protect the i_dio_count, then we could fix the truncate race, while continuing to allow parallelism. Is there any reason why we cannot do this?
> 
> To just fix that race we could do it.  Note that even with the patch we
> only do the setup and I/O submission under the lock, and wait for it
> outside, similar to what the local filesystems do for aio, but not for
> synchronous direct I/O.
> 
> Another good way to speed this up is to use locking scheme XFS uses with
> a shared exclusive lock:
> 
> buffered write:			exclusive
> buffered read:			shared
> direct I/O			exclusive if pagecache is present, then
> 				demote to shared for actual I/O,
> 				shared only if no pages are present on
> 				the file

I think we should try to close the truncate race first. Then let’s look at whether or not people need better consistency with the page cache.

Cheers,
  Trond--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Nov. 15, 2013, 3:25 p.m. UTC | #6
On Fri, Nov 15, 2013 at 10:23:01AM -0500, Trond Myklebust wrote:
> > buffered read:			shared
> > direct I/O			exclusive if pagecache is present, then
> > 				demote to shared for actual I/O,
> > 				shared only if no pages are present on
> > 				the file
> 
> I think we should try to close the truncate race first. Then let?s look at whether or not people need better consistency with the page cache.

With the current locking model that will require the i_mutex in both
read and write around the i_dio_count increment.  The XFS model avoids
taking an exclusive lock for the direct I/O fast path.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust Nov. 15, 2013, 3:34 p.m. UTC | #7
On Nov 15, 2013, at 10:25, Christoph Hellwig <hch@infradead.org> wrote:

> On Fri, Nov 15, 2013 at 10:23:01AM -0500, Trond Myklebust wrote:
>>> buffered read:			shared
>>> direct I/O			exclusive if pagecache is present, then
>>> 				demote to shared for actual I/O,
>>> 				shared only if no pages are present on
>>> 				the file
>> 
>> I think we should try to close the truncate race first. Then let?s look at whether or not people need better consistency with the page cache.
> 
> With the current locking model that will require the i_mutex in both
> read and write around the i_dio_count increment.  The XFS model avoids
> taking an exclusive lock for the direct I/O fast path.
> 

Is that the xfs_ilock vs xfs_rw_ilock code?

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Nov. 15, 2013, 3:37 p.m. UTC | #8
On Fri, Nov 15, 2013 at 10:34:52AM -0500, Trond Myklebust wrote:
> > With the current locking model that will require the i_mutex in both
> > read and write around the i_dio_count increment.  The XFS model avoids
> > taking an exclusive lock for the direct I/O fast path.
> > 
> 
> Is that the xfs_ilock vs xfs_rw_ilock code?

xfs_rw_ilock is where you should lock, that are the helpers around
the xfs iolock which is the shared/exclusive lock and the i_mutex
which we still have to take in the exclusive case as the VFS expects it.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust Nov. 15, 2013, 4 p.m. UTC | #9
On Nov 15, 2013, at 10:25, Christoph Hellwig <hch@infradead.org> wrote:

> On Fri, Nov 15, 2013 at 10:23:01AM -0500, Trond Myklebust wrote:
>>> buffered read:			shared
>>> direct I/O			exclusive if pagecache is present, then
>>> 				demote to shared for actual I/O,
>>> 				shared only if no pages are present on
>>> 				the file
>> 
>> I think we should try to close the truncate race first. Then let?s look at whether or not people need better consistency with the page cache.
> 
> With the current locking model that will require the i_mutex in both
> read and write around the i_dio_count increment.  The XFS model avoids
> taking an exclusive lock for the direct I/O fast path.

OK, so afaics it is basically a wrapper around a 2 lock system: an rw_semaphore and the i_mutex. That means adding an rw_semaphore to the nfs_inode; we unfortunately cannot use the existing rw_sem since that is taken during lock recovery and so could deadlock if we take it during i/o.--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index c32db2a..6cc7fe1 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -500,16 +500,17 @@  ssize_t nfs_file_direct_read(struct kiocb *iocb, const struct iovec *iov,
 	if (!count)
 		goto out;
 
+	mutex_lock(&inode->i_mutex);
 	result = nfs_sync_mapping(mapping);
 	if (result)
-		goto out;
+		goto out_unlock;
 
 	task_io_account_read(count);
 
 	result = -ENOMEM;
 	dreq = nfs_direct_req_alloc();
 	if (dreq == NULL)
-		goto out;
+		goto out_unlock;
 
 	dreq->inode = inode;
 	dreq->bytes_left = iov_length(iov, nr_segs);
@@ -525,13 +526,22 @@  ssize_t nfs_file_direct_read(struct kiocb *iocb, const struct iovec *iov,
 
 	NFS_I(inode)->read_io += iov_length(iov, nr_segs);
 	result = nfs_direct_read_schedule_iovec(dreq, iov, nr_segs, pos, uio);
+
+	mutex_unlock(&inode->i_mutex);
+
 	if (!result) {
 		result = nfs_direct_wait(dreq);
 		if (result > 0)
 			iocb->ki_pos = pos + result;
 	}
+
+	nfs_direct_req_release(dreq);
+	return result;
+
 out_release:
 	nfs_direct_req_release(dreq);
+out_unlock:
+	mutex_unlock(&inode->i_mutex);
 out:
 	return result;
 }