[PATCH/RFC] NFS: Minimize COMMIT calls during writeback.
diff mbox series

Message ID 87y2s1rwof.fsf@notabene.neil.brown.name
State New
Headers show
Series
  • [PATCH/RFC] NFS: Minimize COMMIT calls during writeback.
Related show

Commit Message

NeilBrown March 16, 2020, 12:05 a.m. UTC
Since 4.13 (see Fixes: commit) NFS has sent a COMMIT request for every
.writepages() call it received - when the writes submitted for that call
have all completed.

This works well enough when COMMIT is fast, but if COMMIT is slow these
calls can overlap each other, overload the server, and substantially
reduce throughput.

I looked at this due to a report of regression when writing to a Ganesha
NFS server with tracing showing multiple overlapping COMMITs, and
individual commits typically taking 200ms to 300ms.  This resulted in 2
orders of magnitude slow down when writing 1000x1M from /dev/zero with dd.
This is easily reproduced by adding 'msleep(300)' to nfsd_commit() in the
Linux NFS server.

This patch changes the details of when COMMIT is sent without changing
the general approach.

Where previously the writes from a single .writepages() call were
accounted together in a nfs_io_completion, now the writes from multiple
writepages() calls are accounted together.  The set of writepages calls
that are grouped together are all those from when one COMMIT call ends
to when the next COMMIT call ends.  This automatically reduces the rate
of COMMIT requests when COMMIT itself is slow.
(If there are no COMMIT calls pending, the first .writepages will get
 an nfs_io_completion to itself, then subsequenct writepages requests
 until the first COMMIT completes will go in the next nfs_io_completion)

There are typically at most two nfs_io_completion structures allocated
for writeback to a given inode.

- If there was been any writepages calls since the last time a COMMIT
  completed, there will be an nfs_io_completion stored in the inode (in
  nfs_mds_commit_info) which new writepages requests are accounted it.

- If there is no pending COMMIT request, but there are pending writeback
  WRITES, there will be another nfs_io_completion which is not attached
  and which is draining.  When it fully drains a COMMIT will be sent.
  When that COMMIT completes, the attached nfs_io_completion will be
  detached and allowed to drain.

The rpcs_out counter now counts the unattached nfs_io_completion as well
as any pending COMMIT requests.  As an unattached nfs_io_completion will
soon turn into a COMMIT request, this seems reasonable.  It allows us to
clearly detect when there are no longer any COMMITs expected to
complete, so we know to detach any nfs_io_completion from the inode and
allow it to drain.

As we use atomic accesses (e.g.  xchg and kref_get_unless_zero()) to
access the attached nfs_io_completion, we now use kfree_rcu() to free
it, to ensure it is not accessed after it is freed.

With 300ms commits, this improves throught of a 1G dd by 2 orders of
magnitude.  Even without the 300ms delay, this noticeably improves
throughput to a Linux NFS server is a simple VM.

Fixes: 919e3bd9a875 ("NFS: Ensure we commit after writeback is complete")
Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfs/inode.c          |  1 +
 fs/nfs/write.c          | 50 ++++++++++++++++++++++++++++++++++++-----
 include/linux/nfs_xdr.h |  7 ++++++
 3 files changed, 53 insertions(+), 5 deletions(-)

Comments

Trond Myklebust March 16, 2020, 1:22 a.m. UTC | #1
On Mon, 2020-03-16 at 11:05 +1100, NeilBrown wrote:
> Since 4.13 (see Fixes: commit) NFS has sent a COMMIT request for
> every
> .writepages() call it received - when the writes submitted for that
> call
> have all completed.
> 
> This works well enough when COMMIT is fast, but if COMMIT is slow
> these
> calls can overlap each other, overload the server, and substantially
> reduce throughput.
> 
> I looked at this due to a report of regression when writing to a
> Ganesha
> NFS server with tracing showing multiple overlapping COMMITs, and
> individual commits typically taking 200ms to 300ms.  This resulted in
> 2
> orders of magnitude slow down when writing 1000x1M from /dev/zero
> with dd.
> This is easily reproduced by adding 'msleep(300)' to nfsd_commit() in
> the
> Linux NFS server.


When there is overlap of writeback to the same file, then it is almost
always because we're hitting memory reclaim on the client. In that case
we want to ensure that memory reclaim completes, which it won't do if
we delay COMMIT.
IOW: this behaviour is very much intentional.

> This patch changes the details of when COMMIT is sent without
> changing
> the general approach.
> 
> Where previously the writes from a single .writepages() call were
> accounted together in a nfs_io_completion, now the writes from
> multiple
> writepages() calls are accounted together.  The set of writepages
> calls
> that are grouped together are all those from when one COMMIT call
> ends
> to when the next COMMIT call ends.  This automatically reduces the
> rate
> of COMMIT requests when COMMIT itself is slow.
> (If there are no COMMIT calls pending, the first .writepages will get
>  an nfs_io_completion to itself, then subsequenct writepages requests
>  until the first COMMIT completes will go in the next
> nfs_io_completion)
> 
> There are typically at most two nfs_io_completion structures
> allocated
> for writeback to a given inode.
> 
> - If there was been any writepages calls since the last time a COMMIT
>   completed, there will be an nfs_io_completion stored in the inode
> (in
>   nfs_mds_commit_info) which new writepages requests are accounted
> it.
> 
> - If there is no pending COMMIT request, but there are pending
> writeback
>   WRITES, there will be another nfs_io_completion which is not
> attached
>   and which is draining.  When it fully drains a COMMIT will be sent.
>   When that COMMIT completes, the attached nfs_io_completion will be
>   detached and allowed to drain.
> 
> The rpcs_out counter now counts the unattached nfs_io_completion as
> well
> as any pending COMMIT requests.  As an unattached nfs_io_completion
> will
> soon turn into a COMMIT request, this seems reasonable.  It allows us
> to
> clearly detect when there are no longer any COMMITs expected to
> complete, so we know to detach any nfs_io_completion from the inode
> and
> allow it to drain.
> 
> As we use atomic accesses (e.g.  xchg and kref_get_unless_zero()) to
> access the attached nfs_io_completion, we now use kfree_rcu() to free
> it, to ensure it is not accessed after it is freed.
> 
> With 300ms commits, this improves throught of a 1G dd by 2 orders of
> magnitude.  Even without the 300ms delay, this noticeably improves
> throughput to a Linux NFS server is a simple VM.
> 
> Fixes: 919e3bd9a875 ("NFS: Ensure we commit after writeback is
> complete")
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  fs/nfs/inode.c          |  1 +
>  fs/nfs/write.c          | 50 ++++++++++++++++++++++++++++++++++++---
> --
>  include/linux/nfs_xdr.h |  7 ++++++
>  3 files changed, 53 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
> index 11bf15800ac9..c00b54491949 100644
> --- a/fs/nfs/inode.c
> +++ b/fs/nfs/inode.c
> @@ -2110,6 +2110,7 @@ static void init_once(void *foo)
>  	INIT_LIST_HEAD(&nfsi->commit_info.list);
>  	atomic_long_set(&nfsi->nrequests, 0);
>  	atomic_long_set(&nfsi->commit_info.ncommit, 0);
> +	nfsi->commit_info.ioc = NULL;
>  	atomic_set(&nfsi->commit_info.rpcs_out, 0);
>  	init_rwsem(&nfsi->rmdir_sem);
>  	mutex_init(&nfsi->commit_mutex);
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index c478b772cc49..57e209f964e4 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -47,6 +47,7 @@ struct nfs_io_completion {
>  	void (*complete)(void *data);
>  	void *data;
>  	struct kref refcount;
> +	struct rcu_head rcu;
>  };
>  
>  /*
> @@ -134,7 +135,7 @@ static void nfs_io_completion_release(struct kref
> *kref)
>  	struct nfs_io_completion *ioc = container_of(kref,
>  			struct nfs_io_completion, refcount);
>  	ioc->complete(ioc->data);
> -	kfree(ioc);
> +	kfree_rcu(ioc, rcu);
>  }
>  
>  static void nfs_io_completion_get(struct nfs_io_completion *ioc)
> @@ -720,6 +721,8 @@ static void nfs_io_completion_commit(void *inode)
>  	nfs_commit_inode(inode, 0);
>  }
>  
> +static void nfs_commit_end(struct nfs_mds_commit_info *cinfo);
> +
>  int nfs_writepages(struct address_space *mapping, struct
> writeback_control *wbc)
>  {
>  	struct inode *inode = mapping->host;
> @@ -729,9 +732,37 @@ int nfs_writepages(struct address_space
> *mapping, struct writeback_control *wbc)
>  
>  	nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGES);
>  
> -	ioc = nfs_io_completion_alloc(GFP_KERNEL);
> -	if (ioc)
> -		nfs_io_completion_init(ioc, nfs_io_completion_commit,
> inode);
> +	rcu_read_lock();
> +	ioc = rcu_dereference(NFS_I(inode)->commit_info.ioc);
> +	if (ioc && kref_get_unless_zero(&ioc->refcount)) {
> +		rcu_read_unlock();
> +		/* We've successfully piggybacked on the attached ioc
> */
> +	} else {
> +		rcu_read_unlock();
> +		ioc = nfs_io_completion_alloc(GFP_KERNEL);
> +		if (ioc) {
> +			struct nfs_io_completion *ioc_prev;
> +
> +			nfs_io_completion_init(ioc,
> nfs_io_completion_commit,
> +					       inode);
> +			/* Temporarily elevate rpcs_out to ensure a
> commit
> +			 * completion *will* happen after we attach
> this ioc,
> +			 * so it *will* get a chance to drain.
> +			 */
> +			atomic_inc(&NFS_I(inode)-
> >commit_info.rpcs_out);
> +			nfs_io_completion_get(ioc);
> +			ioc_prev = xchg(&NFS_I(inode)->commit_info.ioc,
> +				    ioc);
> +			/* ioc_prev is normally NULL, but racing
> writepages
> +			 * calls might result in it being non-NULL.
> +			 * In either case it is safe to put it - worst
> case
> +			 * we get an extra COMMIT.
> +			 */
> +			nfs_io_completion_put(ioc_prev);
> +			/* release temporary ref on rpcs_out */
> +			nfs_commit_end(&NFS_I(inode)->commit_info);

So won't this normally trigger the xchg(NULL) in nfs_commmit_end()? For
most cases, I'd expect commit_info.rpcs_out to be zero until we start
actually sending out the COMMITs.

> +		}
> +	}
>  
>  	nfs_pageio_init_write(&pgio, inode, wb_priority(wbc), false,
>  				&nfs_async_write_completion_ops);
> @@ -1677,8 +1708,17 @@ static void nfs_commit_begin(struct
> nfs_mds_commit_info *cinfo)
>  
>  static void nfs_commit_end(struct nfs_mds_commit_info *cinfo)
>  {
> -	if (atomic_dec_and_test(&cinfo->rpcs_out))
> +	if (atomic_dec_and_test(&cinfo->rpcs_out)) {
> +		/* When a commit finishes, we must release any attached
> +		 * nfs_io_completion so that it can drain and then
> request
> +		 * another commit.
> +		 */
> +		struct nfs_io_completion *ioc;
> +		ioc = xchg(&cinfo->ioc, NULL);
> +		nfs_io_completion_put(ioc);
> +
>  		wake_up_var(&cinfo->rpcs_out);
> +	}
>  }
>  
>  void nfs_commitdata_release(struct nfs_commit_data *data)
> diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
> index 94c77ed55ce1..89db1e9d461d 100644
> --- a/include/linux/nfs_xdr.h
> +++ b/include/linux/nfs_xdr.h
> @@ -1557,9 +1557,16 @@ struct nfs_pgio_header {
>  };
>  
>  struct nfs_mds_commit_info {
> +	/* rpcs_out counts pending COMMIT rpcs plus pendng
> nfs_io_completions
> +	 * which are *not* attached at 'ioc' below.  Such
> nfs_io_compleions
> +	 * (normally at most one) will drain as writes complete and
> then trigger
> +	 * a COMMIT, so they can be considered as pending COMMITs which
> haven't
> +	 * been sent yet
> +	 */
>  	atomic_t rpcs_out;
>  	atomic_long_t		ncommit;
>  	struct list_head	list;
> +	struct nfs_io_completion *ioc;
>  };
>  
>  struct nfs_commit_info;
NeilBrown March 16, 2020, 3:39 a.m. UTC | #2
On Mon, Mar 16 2020, Trond Myklebust wrote:

> On Mon, 2020-03-16 at 11:05 +1100, NeilBrown wrote:
>> Since 4.13 (see Fixes: commit) NFS has sent a COMMIT request for
>> every
>> .writepages() call it received - when the writes submitted for that
>> call
>> have all completed.
>> 
>> This works well enough when COMMIT is fast, but if COMMIT is slow
>> these
>> calls can overlap each other, overload the server, and substantially
>> reduce throughput.
>> 
>> I looked at this due to a report of regression when writing to a
>> Ganesha
>> NFS server with tracing showing multiple overlapping COMMITs, and
>> individual commits typically taking 200ms to 300ms.  This resulted in
>> 2
>> orders of magnitude slow down when writing 1000x1M from /dev/zero
>> with dd.
>> This is easily reproduced by adding 'msleep(300)' to nfsd_commit() in
>> the
>> Linux NFS server.
>
>
> When there is overlap of writeback to the same file, then it is almost
> always because we're hitting memory reclaim on the client. In that case
> we want to ensure that memory reclaim completes, which it won't do if
> we delay COMMIT.
> IOW: this behaviour is very much intentional.

What do you mean by "overlap of writeback" ?
The ->writepages calls are sequential - they don't overlap.
The WRITE rpcs are asynchronous so of course they overlap.
The COMMITs are expected to overlap with subsequent WRITEs.
It is only when COMMITs overlap with COMMITs that I think we might
overload the server, and I cannot see how that would ever be a useful
thing to do.

In the cases I've experiment with there is plenty of memory, but the
total write size exceeds dirty_threshold, so writeout starts to limit
the amount of dirty memory.  Certainly we want the COMMIT promptly as
a group of writes finish, but that group doesn't need to be tiny.

(Our initial explorations found that increasing dirty_threshold to
larger than the write size restored normal performance)

>
>> This patch changes the details of when COMMIT is sent without
>> changing
>> the general approach.
>> 
>> Where previously the writes from a single .writepages() call were
>> accounted together in a nfs_io_completion, now the writes from
>> multiple
>> writepages() calls are accounted together.  The set of writepages
>> calls
>> that are grouped together are all those from when one COMMIT call
>> ends
>> to when the next COMMIT call ends.  This automatically reduces the
>> rate
>> of COMMIT requests when COMMIT itself is slow.
>> (If there are no COMMIT calls pending, the first .writepages will get
>>  an nfs_io_completion to itself, then subsequenct writepages requests
>>  until the first COMMIT completes will go in the next
>> nfs_io_completion)
>> 
>> There are typically at most two nfs_io_completion structures
>> allocated
>> for writeback to a given inode.
>> 
>> - If there was been any writepages calls since the last time a COMMIT
>>   completed, there will be an nfs_io_completion stored in the inode
>> (in
>>   nfs_mds_commit_info) which new writepages requests are accounted
>> it.
>> 
>> - If there is no pending COMMIT request, but there are pending
>> writeback
>>   WRITES, there will be another nfs_io_completion which is not
>> attached
>>   and which is draining.  When it fully drains a COMMIT will be sent.
>>   When that COMMIT completes, the attached nfs_io_completion will be
>>   detached and allowed to drain.
>> 
>> The rpcs_out counter now counts the unattached nfs_io_completion as
>> well
>> as any pending COMMIT requests.  As an unattached nfs_io_completion
>> will
>> soon turn into a COMMIT request, this seems reasonable.  It allows us
>> to
>> clearly detect when there are no longer any COMMITs expected to
>> complete, so we know to detach any nfs_io_completion from the inode
>> and
>> allow it to drain.
>> 
>> As we use atomic accesses (e.g.  xchg and kref_get_unless_zero()) to
>> access the attached nfs_io_completion, we now use kfree_rcu() to free
>> it, to ensure it is not accessed after it is freed.
>> 
>> With 300ms commits, this improves throught of a 1G dd by 2 orders of
>> magnitude.  Even without the 300ms delay, this noticeably improves
>> throughput to a Linux NFS server is a simple VM.
>> 
>> Fixes: 919e3bd9a875 ("NFS: Ensure we commit after writeback is
>> complete")
>> Signed-off-by: NeilBrown <neilb@suse.de>
>> ---
>>  fs/nfs/inode.c          |  1 +
>>  fs/nfs/write.c          | 50 ++++++++++++++++++++++++++++++++++++---
>> --
>>  include/linux/nfs_xdr.h |  7 ++++++
>>  3 files changed, 53 insertions(+), 5 deletions(-)
>> 
>> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
>> index 11bf15800ac9..c00b54491949 100644
>> --- a/fs/nfs/inode.c
>> +++ b/fs/nfs/inode.c
>> @@ -2110,6 +2110,7 @@ static void init_once(void *foo)
>>  	INIT_LIST_HEAD(&nfsi->commit_info.list);
>>  	atomic_long_set(&nfsi->nrequests, 0);
>>  	atomic_long_set(&nfsi->commit_info.ncommit, 0);
>> +	nfsi->commit_info.ioc = NULL;
>>  	atomic_set(&nfsi->commit_info.rpcs_out, 0);
>>  	init_rwsem(&nfsi->rmdir_sem);
>>  	mutex_init(&nfsi->commit_mutex);
>> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
>> index c478b772cc49..57e209f964e4 100644
>> --- a/fs/nfs/write.c
>> +++ b/fs/nfs/write.c
>> @@ -47,6 +47,7 @@ struct nfs_io_completion {
>>  	void (*complete)(void *data);
>>  	void *data;
>>  	struct kref refcount;
>> +	struct rcu_head rcu;
>>  };
>>  
>>  /*
>> @@ -134,7 +135,7 @@ static void nfs_io_completion_release(struct kref
>> *kref)
>>  	struct nfs_io_completion *ioc = container_of(kref,
>>  			struct nfs_io_completion, refcount);
>>  	ioc->complete(ioc->data);
>> -	kfree(ioc);
>> +	kfree_rcu(ioc, rcu);
>>  }
>>  
>>  static void nfs_io_completion_get(struct nfs_io_completion *ioc)
>> @@ -720,6 +721,8 @@ static void nfs_io_completion_commit(void *inode)
>>  	nfs_commit_inode(inode, 0);
>>  }
>>  
>> +static void nfs_commit_end(struct nfs_mds_commit_info *cinfo);
>> +
>>  int nfs_writepages(struct address_space *mapping, struct
>> writeback_control *wbc)
>>  {
>>  	struct inode *inode = mapping->host;
>> @@ -729,9 +732,37 @@ int nfs_writepages(struct address_space
>> *mapping, struct writeback_control *wbc)
>>  
>>  	nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGES);
>>  
>> -	ioc = nfs_io_completion_alloc(GFP_KERNEL);
>> -	if (ioc)
>> -		nfs_io_completion_init(ioc, nfs_io_completion_commit,
>> inode);
>> +	rcu_read_lock();
>> +	ioc = rcu_dereference(NFS_I(inode)->commit_info.ioc);
>> +	if (ioc && kref_get_unless_zero(&ioc->refcount)) {
>> +		rcu_read_unlock();
>> +		/* We've successfully piggybacked on the attached ioc
>> */
>> +	} else {
>> +		rcu_read_unlock();
>> +		ioc = nfs_io_completion_alloc(GFP_KERNEL);
>> +		if (ioc) {
>> +			struct nfs_io_completion *ioc_prev;
>> +
>> +			nfs_io_completion_init(ioc,
>> nfs_io_completion_commit,
>> +					       inode);
>> +			/* Temporarily elevate rpcs_out to ensure a
>> commit
>> +			 * completion *will* happen after we attach
>> this ioc,
>> +			 * so it *will* get a chance to drain.
>> +			 */
>> +			atomic_inc(&NFS_I(inode)-
>> >commit_info.rpcs_out);
>> +			nfs_io_completion_get(ioc);
>> +			ioc_prev = xchg(&NFS_I(inode)->commit_info.ioc,
>> +				    ioc);
>> +			/* ioc_prev is normally NULL, but racing
>> writepages
>> +			 * calls might result in it being non-NULL.
>> +			 * In either case it is safe to put it - worst
>> case
>> +			 * we get an extra COMMIT.
>> +			 */
>> +			nfs_io_completion_put(ioc_prev);
>> +			/* release temporary ref on rpcs_out */
>> +			nfs_commit_end(&NFS_I(inode)->commit_info);
>
> So won't this normally trigger the xchg(NULL) in nfs_commmit_end()? For
> most cases, I'd expect commit_info.rpcs_out to be zero until we start
> actually sending out the COMMITs.

On the first writepages call after a period of quiet, that is exactly
correct.  The ioc that we have just allocated and attached will
immediately be detached.  It will have a refcount of 1 at this point so
nothing further happens.
Then WRITEs get queued elevating the refcount.  This will eventually
drain and a COMMIT will be sent.  So the behaviour for the firstpages
call is exactly like the current code (I think I mentioned that in the
commit message).
If the WRITEs and COMMIT complete before the next .writepages, then that
writepages will behave the same way.
But if the WRITEs or COMMIT are still pending, then rpcs_out will still
be elevated....
Only - no.  Unattached iocs are meant to be counted in rpcs_out, but I
haven't got that right.  So we can still end up with parallel COMMITs.
I think that might explain some slightly odd numbers in my testing.

Here is the new version.

Thanks for the review.

NeilBrown

From: NeilBrown <neilb@suse.de>
Subject: [PATCH] NFS: Minimize COMMIT calls during writeback.

Since 4.13 (see Fixes: commit) NFS has sent a COMMIT request for every
.writepages() call it received - when the writes submitted for that call
have all completed.

This works well enough when COMMIT is fast, but if COMMIT is slow these
calls can overlap each other, overload the server, and substantially
reduce throughput.

I looked at this due to a report of regression when writing to a Ganesha
NFS server with tracing showing multiple overlapping COMMITs, and
individual commits typically taking 200ms to 300ms.  This resulted in 2
orders of magnitude slow down when writing 1000x1M from /dev/zero with dd.
This is easily reproduced by adding 'msleep(300)' to nfsd_commit() in the
Linux NFS server.

This patch changes the details of when COMMIT is sent without changing
the general approach.

Where previously the writes from a single .writepages() call were
accounted together in a nfs_io_completion, now the writes from multiple
writepages() calls are accounted together.  The set of writepages calls
that are grouped together are all those from when one COMMIT call ends
to when the next COMMIT call ends.  This automatically reduces the rate
of COMMIT requests when COMMIT itself is slow.
(If there are no COMMIT calls pending, the first .writepages will get
 an nfs_io_completion to itself, then subsequenct writepages requests
 until the first COMMIT completes will go in the next nfs_io_completion)

There are typically at most two nfs_io_completion structures allocated
for writeback to a given inode.

- If there was been any writepages calls since the last time a COMMIT
  completed, there will be an nfs_io_completion stored in the inode (in
  nfs_mds_commit_info) which new writepages requests are accounted it.

- If there is no pending COMMIT request, but there are pending writeback
  WRITES, there will be another nfs_io_completion which is not attached
  and which is draining.  When it fully drains a COMMIT will be sent.
  When that COMMIT completes, the attached nfs_io_completion will be
  detached and allowed to drain.

The rpcs_out counter now counts the unattached nfs_io_completion as well
as any pending COMMIT requests.  As an unattached nfs_io_completion will
soon turn into a COMMIT request, this seems reasonable.  It allows us to
clearly detect when there are no longer any COMMITs expected to
complete, so we know to detach any nfs_io_completion from the inode and
allow it to drain.

As we use atomic accesses (e.g.  xchg and kref_get_unless_zero()) to
access the attached nfs_io_completion, we now use kfree_rcu() to free
it, to ensure it is not accessed after it is freed.

With 300ms commits, this improves throught of a 1G dd by 2 orders of
magnitude.  Even without the 300ms delay, this noticeably improves
throughput to a Linux NFS server is a simple VM.

Fixes: 919e3bd9a875 ("NFS: Ensure we commit after writeback is complete")
Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfs/inode.c          |  1 +
 fs/nfs/write.c          | 69 +++++++++++++++++++++++++++++++++++++----
 include/linux/nfs_xdr.h |  7 +++++
 3 files changed, 71 insertions(+), 6 deletions(-)

diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 11bf15800ac9..c00b54491949 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -2110,6 +2110,7 @@ static void init_once(void *foo)
 	INIT_LIST_HEAD(&nfsi->commit_info.list);
 	atomic_long_set(&nfsi->nrequests, 0);
 	atomic_long_set(&nfsi->commit_info.ncommit, 0);
+	nfsi->commit_info.ioc = NULL;
 	atomic_set(&nfsi->commit_info.rpcs_out, 0);
 	init_rwsem(&nfsi->rmdir_sem);
 	mutex_init(&nfsi->commit_mutex);
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index c478b772cc49..6dbb055f80bf 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -47,6 +47,7 @@ struct nfs_io_completion {
 	void (*complete)(void *data);
 	void *data;
 	struct kref refcount;
+	struct rcu_head rcu;
 };
 
 /*
@@ -134,7 +135,7 @@ static void nfs_io_completion_release(struct kref *kref)
 	struct nfs_io_completion *ioc = container_of(kref,
 			struct nfs_io_completion, refcount);
 	ioc->complete(ioc->data);
-	kfree(ioc);
+	kfree_rcu(ioc, rcu);
 }
 
 static void nfs_io_completion_get(struct nfs_io_completion *ioc)
@@ -715,9 +716,15 @@ static int nfs_writepages_callback(struct page *page, struct writeback_control *
 	return ret;
 }
 
+static void nfs_commit_end(struct nfs_mds_commit_info *cinfo);
+
 static void nfs_io_completion_commit(void *inode)
 {
 	nfs_commit_inode(inode, 0);
+	/* this came from a detached nfs_io_completion, which is now
+	 * no longer active, so must decrement rpcs_out.
+	 */
+	nfs_commit_end(&NFS_I(inode)->commit_info);
 }
 
 int nfs_writepages(struct address_space *mapping, struct writeback_control *wbc)
@@ -729,9 +736,46 @@ int nfs_writepages(struct address_space *mapping, struct writeback_control *wbc)
 
 	nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGES);
 
-	ioc = nfs_io_completion_alloc(GFP_KERNEL);
-	if (ioc)
-		nfs_io_completion_init(ioc, nfs_io_completion_commit, inode);
+	rcu_read_lock();
+	ioc = rcu_dereference(NFS_I(inode)->commit_info.ioc);
+	if (ioc && kref_get_unless_zero(&ioc->refcount)) {
+		rcu_read_unlock();
+		/* We've successfully piggybacked on the attached ioc */
+	} else {
+		rcu_read_unlock();
+		ioc = nfs_io_completion_alloc(GFP_KERNEL);
+		if (ioc) {
+			struct nfs_io_completion *ioc_prev;
+
+			nfs_io_completion_init(ioc, nfs_io_completion_commit,
+					       inode);
+			/* This is now a detached ioc so we count it in
+			 * rpcs_out.  *After* we successfully attach it
+			 * below (likely, but not certain), we will call
+			 * nfs_commit_end() which might detach it immedately
+			 * if there are no outstanding commit.
+			 */
+			atomic_inc(&NFS_I(inode)->commit_info.rpcs_out);
+			nfs_io_completion_get(ioc);
+			ioc_prev = xchg(&NFS_I(inode)->commit_info.ioc,
+				    ioc);
+			/* ioc_prev is normally NULL, but racing writepages
+			 * calls might result in it being non-NULL.
+			 * In either case it is safe to put it - worst case
+			 * we get an extra COMMIT.
+			 */
+			if (ioc_prev)
+				/* This ioc_prev is now detached, so put it, and
+				 * keep the ref on rpcs_out
+				 */
+				nfs_io_completion_put(ioc_prev);
+			else
+				/* Attachment successful, so drop ref
+				 * on rpcs_out
+				 */
+				nfs_commit_end(&NFS_I(inode)->commit_info);
+		}
+	}
 
 	nfs_pageio_init_write(&pgio, inode, wb_priority(wbc), false,
 				&nfs_async_write_completion_ops);
@@ -1677,8 +1721,21 @@ static void nfs_commit_begin(struct nfs_mds_commit_info *cinfo)
 
 static void nfs_commit_end(struct nfs_mds_commit_info *cinfo)
 {
-	if (atomic_dec_and_test(&cinfo->rpcs_out))
-		wake_up_var(&cinfo->rpcs_out);
+	if (atomic_dec_and_test(&cinfo->rpcs_out)) {
+		/* When a commit finishes, we must release any attached
+		 * nfs_io_completion so that it can drain and then request
+		 * another commit.
+		 */
+		struct nfs_io_completion *ioc;
+
+		ioc = xchg(&cinfo->ioc, NULL);
+		if (ioc) {
+			/* Count the detached completion */
+			atomic_inc(&cinfo->rpcs_out);
+			nfs_io_completion_put(ioc);
+		} else
+			wake_up_var(&cinfo->rpcs_out);
+	}
 }
 
 void nfs_commitdata_release(struct nfs_commit_data *data)
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index 94c77ed55ce1..89db1e9d461d 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -1557,9 +1557,16 @@ struct nfs_pgio_header {
 };
 
 struct nfs_mds_commit_info {
+	/* rpcs_out counts pending COMMIT rpcs plus pendng nfs_io_completions
+	 * which are *not* attached at 'ioc' below.  Such nfs_io_compleions
+	 * (normally at most one) will drain as writes complete and then trigger
+	 * a COMMIT, so they can be considered as pending COMMITs which haven't
+	 * been sent yet
+	 */
 	atomic_t rpcs_out;
 	atomic_long_t		ncommit;
 	struct list_head	list;
+	struct nfs_io_completion *ioc;
 };
 
 struct nfs_commit_info;
Trond Myklebust March 16, 2020, 4:13 a.m. UTC | #3
On Mon, 2020-03-16 at 14:39 +1100, NeilBrown wrote:
> On Mon, Mar 16 2020, Trond Myklebust wrote:
> 
> > On Mon, 2020-03-16 at 11:05 +1100, NeilBrown wrote:
> > > Since 4.13 (see Fixes: commit) NFS has sent a COMMIT request for
> > > every
> > > .writepages() call it received - when the writes submitted for
> > > that
> > > call
> > > have all completed.
> > > 
> > > This works well enough when COMMIT is fast, but if COMMIT is slow
> > > these
> > > calls can overlap each other, overload the server, and
> > > substantially
> > > reduce throughput.
> > > 
> > > I looked at this due to a report of regression when writing to a
> > > Ganesha
> > > NFS server with tracing showing multiple overlapping COMMITs, and
> > > individual commits typically taking 200ms to 300ms.  This
> > > resulted in
> > > 2
> > > orders of magnitude slow down when writing 1000x1M from /dev/zero
> > > with dd.
> > > This is easily reproduced by adding 'msleep(300)' to
> > > nfsd_commit() in
> > > the
> > > Linux NFS server.
> > 
> > When there is overlap of writeback to the same file, then it is
> > almost
> > always because we're hitting memory reclaim on the client. In that
> > case
> > we want to ensure that memory reclaim completes, which it won't do
> > if
> > we delay COMMIT.
> > IOW: this behaviour is very much intentional.
> 
> What do you mean by "overlap of writeback" ?
> The ->writepages calls are sequential - they don't overlap.
> The WRITE rpcs are asynchronous so of course they overlap.
> The COMMITs are expected to overlap with subsequent WRITEs.
> It is only when COMMITs overlap with COMMITs that I think we might
> overload the server, and I cannot see how that would ever be a useful
> thing to do.
> 

I mean that the NFS writeback code has exactly two cases where it
starts to write out data before an fsync() or close() forces it to do
so:

1) Read of a page that was written to, but not up to date
2) Memory pressure.

IOW: If you're seeing overlapping commits, it is due to a situation
where we're trying to free up the pages as soon as possible.

> In the cases I've experiment with there is plenty of memory, but the
> total write size exceeds dirty_threshold, so writeout starts to limit
> the amount of dirty memory.  Certainly we want the COMMIT promptly as
> a group of writes finish, but that group doesn't need to be tiny.
> 
> (Our initial explorations found that increasing dirty_threshold to
> larger than the write size restored normal performance)
> 

That's the definition of memory pressure here. It's not the job of the
filesystem to second guess what is going on in the higher layers. Its
job is to write out data as quickly as possible when asked to do so.

> > > This patch changes the details of when COMMIT is sent without
> > > changing
> > > the general approach.
> > > 
> > > Where previously the writes from a single .writepages() call were
> > > accounted together in a nfs_io_completion, now the writes from
> > > multiple
> > > writepages() calls are accounted together.  The set of writepages
> > > calls
> > > that are grouped together are all those from when one COMMIT call
> > > ends
> > > to when the next COMMIT call ends.  This automatically reduces
> > > the
> > > rate
> > > of COMMIT requests when COMMIT itself is slow.
> > > (If there are no COMMIT calls pending, the first .writepages will
> > > get
> > >  an nfs_io_completion to itself, then subsequenct writepages
> > > requests
> > >  until the first COMMIT completes will go in the next
> > > nfs_io_completion)
> > > 
> > > There are typically at most two nfs_io_completion structures
> > > allocated
> > > for writeback to a given inode.
> > > 
> > > - If there was been any writepages calls since the last time a
> > > COMMIT
> > >   completed, there will be an nfs_io_completion stored in the
> > > inode
> > > (in
> > >   nfs_mds_commit_info) which new writepages requests are
> > > accounted
> > > it.
> > > 
> > > - If there is no pending COMMIT request, but there are pending
> > > writeback
> > >   WRITES, there will be another nfs_io_completion which is not
> > > attached
> > >   and which is draining.  When it fully drains a COMMIT will be
> > > sent.
> > >   When that COMMIT completes, the attached nfs_io_completion will
> > > be
> > >   detached and allowed to drain.
> > > 
> > > The rpcs_out counter now counts the unattached nfs_io_completion
> > > as
> > > well
> > > as any pending COMMIT requests.  As an unattached
> > > nfs_io_completion
> > > will
> > > soon turn into a COMMIT request, this seems reasonable.  It
> > > allows us
> > > to
> > > clearly detect when there are no longer any COMMITs expected to
> > > complete, so we know to detach any nfs_io_completion from the
> > > inode
> > > and
> > > allow it to drain.
> > > 
> > > As we use atomic accesses (e.g.  xchg and kref_get_unless_zero())
> > > to
> > > access the attached nfs_io_completion, we now use kfree_rcu() to
> > > free
> > > it, to ensure it is not accessed after it is freed.
> > > 
> > > With 300ms commits, this improves throught of a 1G dd by 2 orders
> > > of
> > > magnitude.  Even without the 300ms delay, this noticeably
> > > improves
> > > throughput to a Linux NFS server is a simple VM.
> > > 
> > > Fixes: 919e3bd9a875 ("NFS: Ensure we commit after writeback is
> > > complete")
> > > Signed-off-by: NeilBrown <neilb@suse.de>
> > > ---
> > >  fs/nfs/inode.c          |  1 +
> > >  fs/nfs/write.c          | 50
> > > ++++++++++++++++++++++++++++++++++++---
> > > --
> > >  include/linux/nfs_xdr.h |  7 ++++++
> > >  3 files changed, 53 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
> > > index 11bf15800ac9..c00b54491949 100644
> > > --- a/fs/nfs/inode.c
> > > +++ b/fs/nfs/inode.c
> > > @@ -2110,6 +2110,7 @@ static void init_once(void *foo)
> > >  	INIT_LIST_HEAD(&nfsi->commit_info.list);
> > >  	atomic_long_set(&nfsi->nrequests, 0);
> > >  	atomic_long_set(&nfsi->commit_info.ncommit, 0);
> > > +	nfsi->commit_info.ioc = NULL;
> > >  	atomic_set(&nfsi->commit_info.rpcs_out, 0);
> > >  	init_rwsem(&nfsi->rmdir_sem);
> > >  	mutex_init(&nfsi->commit_mutex);
> > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > > index c478b772cc49..57e209f964e4 100644
> > > --- a/fs/nfs/write.c
> > > +++ b/fs/nfs/write.c
> > > @@ -47,6 +47,7 @@ struct nfs_io_completion {
> > >  	void (*complete)(void *data);
> > >  	void *data;
> > >  	struct kref refcount;
> > > +	struct rcu_head rcu;
> > >  };
> > >  
> > >  /*
> > > @@ -134,7 +135,7 @@ static void nfs_io_completion_release(struct
> > > kref
> > > *kref)
> > >  	struct nfs_io_completion *ioc = container_of(kref,
> > >  			struct nfs_io_completion, refcount);
> > >  	ioc->complete(ioc->data);
> > > -	kfree(ioc);
> > > +	kfree_rcu(ioc, rcu);
> > >  }
> > >  
> > >  static void nfs_io_completion_get(struct nfs_io_completion *ioc)
> > > @@ -720,6 +721,8 @@ static void nfs_io_completion_commit(void
> > > *inode)
> > >  	nfs_commit_inode(inode, 0);
> > >  }
> > >  
> > > +static void nfs_commit_end(struct nfs_mds_commit_info *cinfo);
> > > +
> > >  int nfs_writepages(struct address_space *mapping, struct
> > > writeback_control *wbc)
> > >  {
> > >  	struct inode *inode = mapping->host;
> > > @@ -729,9 +732,37 @@ int nfs_writepages(struct address_space
> > > *mapping, struct writeback_control *wbc)
> > >  
> > >  	nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGES);
> > >  
> > > -	ioc = nfs_io_completion_alloc(GFP_KERNEL);
> > > -	if (ioc)
> > > -		nfs_io_completion_init(ioc, nfs_io_completion_commit,
> > > inode);
> > > +	rcu_read_lock();
> > > +	ioc = rcu_dereference(NFS_I(inode)->commit_info.ioc);
> > > +	if (ioc && kref_get_unless_zero(&ioc->refcount)) {
> > > +		rcu_read_unlock();
> > > +		/* We've successfully piggybacked on the attached ioc
> > > */
> > > +	} else {
> > > +		rcu_read_unlock();
> > > +		ioc = nfs_io_completion_alloc(GFP_KERNEL);
> > > +		if (ioc) {
> > > +			struct nfs_io_completion *ioc_prev;
> > > +
> > > +			nfs_io_completion_init(ioc,
> > > nfs_io_completion_commit,
> > > +					       inode);
> > > +			/* Temporarily elevate rpcs_out to ensure a
> > > commit
> > > +			 * completion *will* happen after we attach
> > > this ioc,
> > > +			 * so it *will* get a chance to drain.
> > > +			 */
> > > +			atomic_inc(&NFS_I(inode)-
> > > > commit_info.rpcs_out);
> > > +			nfs_io_completion_get(ioc);
> > > +			ioc_prev = xchg(&NFS_I(inode)->commit_info.ioc,
> > > +				    ioc);
> > > +			/* ioc_prev is normally NULL, but racing
> > > writepages
> > > +			 * calls might result in it being non-NULL.
> > > +			 * In either case it is safe to put it - worst
> > > case
> > > +			 * we get an extra COMMIT.
> > > +			 */
> > > +			nfs_io_completion_put(ioc_prev);
> > > +			/* release temporary ref on rpcs_out */
> > > +			nfs_commit_end(&NFS_I(inode)->commit_info);
> > 
> > So won't this normally trigger the xchg(NULL) in nfs_commmit_end()?
> > For
> > most cases, I'd expect commit_info.rpcs_out to be zero until we
> > start
> > actually sending out the COMMITs.
> 
> On the first writepages call after a period of quiet, that is exactly
> correct.  The ioc that we have just allocated and attached will
> immediately be detached.  It will have a refcount of 1 at this point
> so
> nothing further happens.
> Then WRITEs get queued elevating the refcount.  This will eventually
> drain and a COMMIT will be sent.  So the behaviour for the firstpages
> call is exactly like the current code (I think I mentioned that in
> the
> commit message).
> If the WRITEs and COMMIT complete before the next .writepages, then
> that
> writepages will behave the same way.
> But if the WRITEs or COMMIT are still pending, then rpcs_out will
> still
> be elevated....
> Only - no.  Unattached iocs are meant to be counted in rpcs_out, but
> I
> haven't got that right.  So we can still end up with parallel
> COMMITs.
> I think that might explain some slightly odd numbers in my testing.
> 
> Here is the new version.
> 
> Thanks for the review.
> 
> NeilBrown
> 
> From: NeilBrown <neilb@suse.de>
> Subject: [PATCH] NFS: Minimize COMMIT calls during writeback.
> 
> Since 4.13 (see Fixes: commit) NFS has sent a COMMIT request for
> every
> .writepages() call it received - when the writes submitted for that
> call
> have all completed.
> 
> This works well enough when COMMIT is fast, but if COMMIT is slow
> these
> calls can overlap each other, overload the server, and substantially
> reduce throughput.
> 
> I looked at this due to a report of regression when writing to a
> Ganesha
> NFS server with tracing showing multiple overlapping COMMITs, and
> individual commits typically taking 200ms to 300ms.  This resulted in
> 2
> orders of magnitude slow down when writing 1000x1M from /dev/zero
> with dd.
> This is easily reproduced by adding 'msleep(300)' to nfsd_commit() in
> the
> Linux NFS server.
> 
> This patch changes the details of when COMMIT is sent without
> changing
> the general approach.
> 
> Where previously the writes from a single .writepages() call were
> accounted together in a nfs_io_completion, now the writes from
> multiple
> writepages() calls are accounted together.  The set of writepages
> calls
> that are grouped together are all those from when one COMMIT call
> ends
> to when the next COMMIT call ends.  This automatically reduces the
> rate
> of COMMIT requests when COMMIT itself is slow.
> (If there are no COMMIT calls pending, the first .writepages will get
>  an nfs_io_completion to itself, then subsequenct writepages requests
>  until the first COMMIT completes will go in the next
> nfs_io_completion)
> 
> There are typically at most two nfs_io_completion structures
> allocated
> for writeback to a given inode.
> 
> - If there was been any writepages calls since the last time a COMMIT
>   completed, there will be an nfs_io_completion stored in the inode
> (in
>   nfs_mds_commit_info) which new writepages requests are accounted
> it.
> 
> - If there is no pending COMMIT request, but there are pending
> writeback
>   WRITES, there will be another nfs_io_completion which is not
> attached
>   and which is draining.  When it fully drains a COMMIT will be sent.
>   When that COMMIT completes, the attached nfs_io_completion will be
>   detached and allowed to drain.
> 
> The rpcs_out counter now counts the unattached nfs_io_completion as
> well
> as any pending COMMIT requests.  As an unattached nfs_io_completion
> will
> soon turn into a COMMIT request, this seems reasonable.  It allows us
> to
> clearly detect when there are no longer any COMMITs expected to
> complete, so we know to detach any nfs_io_completion from the inode
> and
> allow it to drain.
> 
> As we use atomic accesses (e.g.  xchg and kref_get_unless_zero()) to
> access the attached nfs_io_completion, we now use kfree_rcu() to free
> it, to ensure it is not accessed after it is freed.
> 
> With 300ms commits, this improves throught of a 1G dd by 2 orders of
> magnitude.  Even without the 300ms delay, this noticeably improves
> throughput to a Linux NFS server is a simple VM.
> 
> Fixes: 919e3bd9a875 ("NFS: Ensure we commit after writeback is
> complete")
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  fs/nfs/inode.c          |  1 +
>  fs/nfs/write.c          | 69 +++++++++++++++++++++++++++++++++++++
> ----
>  include/linux/nfs_xdr.h |  7 +++++
>  3 files changed, 71 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
> index 11bf15800ac9..c00b54491949 100644
> --- a/fs/nfs/inode.c
> +++ b/fs/nfs/inode.c
> @@ -2110,6 +2110,7 @@ static void init_once(void *foo)
>  	INIT_LIST_HEAD(&nfsi->commit_info.list);
>  	atomic_long_set(&nfsi->nrequests, 0);
>  	atomic_long_set(&nfsi->commit_info.ncommit, 0);
> +	nfsi->commit_info.ioc = NULL;
>  	atomic_set(&nfsi->commit_info.rpcs_out, 0);
>  	init_rwsem(&nfsi->rmdir_sem);
>  	mutex_init(&nfsi->commit_mutex);
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index c478b772cc49..6dbb055f80bf 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -47,6 +47,7 @@ struct nfs_io_completion {
>  	void (*complete)(void *data);
>  	void *data;
>  	struct kref refcount;
> +	struct rcu_head rcu;
>  };
>  
>  /*
> @@ -134,7 +135,7 @@ static void nfs_io_completion_release(struct kref
> *kref)
>  	struct nfs_io_completion *ioc = container_of(kref,
>  			struct nfs_io_completion, refcount);
>  	ioc->complete(ioc->data);
> -	kfree(ioc);
> +	kfree_rcu(ioc, rcu);
>  }
>  
>  static void nfs_io_completion_get(struct nfs_io_completion *ioc)
> @@ -715,9 +716,15 @@ static int nfs_writepages_callback(struct page
> *page, struct writeback_control *
>  	return ret;
>  }
>  
> +static void nfs_commit_end(struct nfs_mds_commit_info *cinfo);
> +
>  static void nfs_io_completion_commit(void *inode)
>  {
>  	nfs_commit_inode(inode, 0);
> +	/* this came from a detached nfs_io_completion, which is now
> +	 * no longer active, so must decrement rpcs_out.
> +	 */
> +	nfs_commit_end(&NFS_I(inode)->commit_info);
>  }
>  
>  int nfs_writepages(struct address_space *mapping, struct
> writeback_control *wbc)
> @@ -729,9 +736,46 @@ int nfs_writepages(struct address_space
> *mapping, struct writeback_control *wbc)
>  
>  	nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGES);
>  
> -	ioc = nfs_io_completion_alloc(GFP_KERNEL);
> -	if (ioc)
> -		nfs_io_completion_init(ioc, nfs_io_completion_commit,
> inode);
> +	rcu_read_lock();
> +	ioc = rcu_dereference(NFS_I(inode)->commit_info.ioc);
> +	if (ioc && kref_get_unless_zero(&ioc->refcount)) {
> +		rcu_read_unlock();
> +		/* We've successfully piggybacked on the attached ioc
> */
> +	} else {
> +		rcu_read_unlock();
> +		ioc = nfs_io_completion_alloc(GFP_KERNEL);
> +		if (ioc) {
> +			struct nfs_io_completion *ioc_prev;
> +
> +			nfs_io_completion_init(ioc,
> nfs_io_completion_commit,
> +					       inode);
> +			/* This is now a detached ioc so we count it in
> +			 * rpcs_out.  *After* we successfully attach it
> +			 * below (likely, but not certain), we will
> call
> +			 * nfs_commit_end() which might detach it
> immedately
> +			 * if there are no outstanding commit.
> +			 */
> +			atomic_inc(&NFS_I(inode)-
> >commit_info.rpcs_out);

commit_info.rpcs_out is there in order to count the number of
outstanding COMMITs. I'm not seeing why we need to change that
definition, particularly given that there can be processes out there
that are trying to wait for the count to go to zero.

The I/O completion structure already has its own reference counter, and
I can see nothing stopping you from modifying
nfs_io_completion_commit() to do the same things you are trying to do
in nfs_commit_end() here.

However as I said, I have my doubts as to whether or not this is a good
idea. It sounds to me as if you're trying to fix a memory management
layer problem in the NFS layer.

> +			nfs_io_completion_get(ioc);
> +			ioc_prev = xchg(&NFS_I(inode)->commit_info.ioc,
> +				    ioc);
> +			/* ioc_prev is normally NULL, but racing
> writepages
> +			 * calls might result in it being non-NULL.
> +			 * In either case it is safe to put it - worst
> case
> +			 * we get an extra COMMIT.
> +			 */
> +			if (ioc_prev)
> +				/* This ioc_prev is now detached, so
> put it, and
> +				 * keep the ref on rpcs_out
> +				 */
> +				nfs_io_completion_put(ioc_prev);
> +			else
> +				/* Attachment successful, so drop ref
> +				 * on rpcs_out
> +				 */
> +				nfs_commit_end(&NFS_I(inode)-
> >commit_info);
> +		}
> +	}
>  
>  	nfs_pageio_init_write(&pgio, inode, wb_priority(wbc), false,
>  				&nfs_async_write_completion_ops);
> @@ -1677,8 +1721,21 @@ static void nfs_commit_begin(struct
> nfs_mds_commit_info *cinfo)
>  
>  static void nfs_commit_end(struct nfs_mds_commit_info *cinfo)
>  {
> -	if (atomic_dec_and_test(&cinfo->rpcs_out))
> -		wake_up_var(&cinfo->rpcs_out);
> +	if (atomic_dec_and_test(&cinfo->rpcs_out)) {
> +		/* When a commit finishes, we must release any attached
> +		 * nfs_io_completion so that it can drain and then
> request
> +		 * another commit.
> +		 */
> +		struct nfs_io_completion *ioc;
> +
> +		ioc = xchg(&cinfo->ioc, NULL);
> +		if (ioc) {
> +			/* Count the detached completion */
> +			atomic_inc(&cinfo->rpcs_out);
> +			nfs_io_completion_put(ioc);
> +		} else
> +			wake_up_var(&cinfo->rpcs_out);
> +	}
>  }
>  
>  void nfs_commitdata_release(struct nfs_commit_data *data)
> diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
> index 94c77ed55ce1..89db1e9d461d 100644
> --- a/include/linux/nfs_xdr.h
> +++ b/include/linux/nfs_xdr.h
> @@ -1557,9 +1557,16 @@ struct nfs_pgio_header {
>  };
>  
>  struct nfs_mds_commit_info {
> +	/* rpcs_out counts pending COMMIT rpcs plus pendng
> nfs_io_completions
> +	 * which are *not* attached at 'ioc' below.  Such
> nfs_io_compleions
> +	 * (normally at most one) will drain as writes complete and
> then trigger
> +	 * a COMMIT, so they can be considered as pending COMMITs which
> haven't
> +	 * been sent yet
> +	 */
>  	atomic_t rpcs_out;
>  	atomic_long_t		ncommit;
>  	struct list_head	list;
> +	struct nfs_io_completion *ioc;
>  };
>  
>  struct nfs_commit_info;
NeilBrown March 17, 2020, 10:41 a.m. UTC | #4
On Mon, Mar 16 2020, Trond Myklebust wrote:

> On Mon, 2020-03-16 at 14:39 +1100, NeilBrown wrote:
>> On Mon, Mar 16 2020, Trond Myklebust wrote:
>> 
>> > On Mon, 2020-03-16 at 11:05 +1100, NeilBrown wrote:
>> > > Since 4.13 (see Fixes: commit) NFS has sent a COMMIT request for
>> > > every
>> > > .writepages() call it received - when the writes submitted for
>> > > that
>> > > call
>> > > have all completed.
>> > > 
>> > > This works well enough when COMMIT is fast, but if COMMIT is slow
>> > > these
>> > > calls can overlap each other, overload the server, and
>> > > substantially
>> > > reduce throughput.
>> > > 
>> > > I looked at this due to a report of regression when writing to a
>> > > Ganesha
>> > > NFS server with tracing showing multiple overlapping COMMITs, and
>> > > individual commits typically taking 200ms to 300ms.  This
>> > > resulted in
>> > > 2
>> > > orders of magnitude slow down when writing 1000x1M from /dev/zero
>> > > with dd.
>> > > This is easily reproduced by adding 'msleep(300)' to
>> > > nfsd_commit() in
>> > > the
>> > > Linux NFS server.
>> > 
>> > When there is overlap of writeback to the same file, then it is
>> > almost
>> > always because we're hitting memory reclaim on the client. In that
>> > case
>> > we want to ensure that memory reclaim completes, which it won't do
>> > if
>> > we delay COMMIT.
>> > IOW: this behaviour is very much intentional.
>> 
>> What do you mean by "overlap of writeback" ?
>> The ->writepages calls are sequential - they don't overlap.
>> The WRITE rpcs are asynchronous so of course they overlap.
>> The COMMITs are expected to overlap with subsequent WRITEs.
>> It is only when COMMITs overlap with COMMITs that I think we might
>> overload the server, and I cannot see how that would ever be a useful
>> thing to do.
>> 
>
> I mean that the NFS writeback code has exactly two cases where it
> starts to write out data before an fsync() or close() forces it to do
> so:
>
> 1) Read of a page that was written to, but not up to date
> 2) Memory pressure.
>
> IOW: If you're seeing overlapping commits, it is due to a situation
> where we're trying to free up the pages as soon as possible.
>
>> In the cases I've experiment with there is plenty of memory, but the
>> total write size exceeds dirty_threshold, so writeout starts to limit
>> the amount of dirty memory.  Certainly we want the COMMIT promptly as
>> a group of writes finish, but that group doesn't need to be tiny.
>> 
>> (Our initial explorations found that increasing dirty_threshold to
>> larger than the write size restored normal performance)
>> 
>
> That's the definition of memory pressure here. It's not the job of the
> filesystem to second guess what is going on in the higher layers. Its
> job is to write out data as quickly as possible when asked to do so.

Fair enough.  I don't usually think of writeback as due to memory
pressure, but I can see that I filesystem wouldn't much care for the
distinction.

Still there is no overlap imposed by the VM.  The VM is sending a
sensible streak of writepages requests - as dd dirties more pages, the
VM asks NFS to write out more pages.  All in a nice orderly fashion.

>
>> > > This patch changes the details of when COMMIT is sent without
>> > > changing
>> > > the general approach.
>> > > 
>> > > Where previously the writes from a single .writepages() call were
>> > > accounted together in a nfs_io_completion, now the writes from
>> > > multiple
>> > > writepages() calls are accounted together.  The set of writepages
>> > > calls
>> > > that are grouped together are all those from when one COMMIT call
>> > > ends
>> > > to when the next COMMIT call ends.  This automatically reduces
>> > > the
>> > > rate
>> > > of COMMIT requests when COMMIT itself is slow.
>> > > (If there are no COMMIT calls pending, the first .writepages will
>> > > get
>> > >  an nfs_io_completion to itself, then subsequenct writepages
>> > > requests
>> > >  until the first COMMIT completes will go in the next
>> > > nfs_io_completion)
>> > > 
>> > > There are typically at most two nfs_io_completion structures
>> > > allocated
>> > > for writeback to a given inode.
>> > > 
>> > > - If there was been any writepages calls since the last time a
>> > > COMMIT
>> > >   completed, there will be an nfs_io_completion stored in the
>> > > inode
>> > > (in
>> > >   nfs_mds_commit_info) which new writepages requests are
>> > > accounted
>> > > it.
>> > > 
>> > > - If there is no pending COMMIT request, but there are pending
>> > > writeback
>> > >   WRITES, there will be another nfs_io_completion which is not
>> > > attached
>> > >   and which is draining.  When it fully drains a COMMIT will be
>> > > sent.
>> > >   When that COMMIT completes, the attached nfs_io_completion will
>> > > be
>> > >   detached and allowed to drain.
>> > > 
>> > > The rpcs_out counter now counts the unattached nfs_io_completion
>> > > as
>> > > well
>> > > as any pending COMMIT requests.  As an unattached
>> > > nfs_io_completion
>> > > will
>> > > soon turn into a COMMIT request, this seems reasonable.  It
>> > > allows us
>> > > to
>> > > clearly detect when there are no longer any COMMITs expected to
>> > > complete, so we know to detach any nfs_io_completion from the
>> > > inode
>> > > and
>> > > allow it to drain.
>> > > 
>> > > As we use atomic accesses (e.g.  xchg and kref_get_unless_zero())
>> > > to
>> > > access the attached nfs_io_completion, we now use kfree_rcu() to
>> > > free
>> > > it, to ensure it is not accessed after it is freed.
>> > > 
>> > > With 300ms commits, this improves throught of a 1G dd by 2 orders
>> > > of
>> > > magnitude.  Even without the 300ms delay, this noticeably
>> > > improves
>> > > throughput to a Linux NFS server is a simple VM.
>> > > 
>> > > Fixes: 919e3bd9a875 ("NFS: Ensure we commit after writeback is
>> > > complete")
>> > > Signed-off-by: NeilBrown <neilb@suse.de>
>> > > ---
>> > >  fs/nfs/inode.c          |  1 +
>> > >  fs/nfs/write.c          | 50
>> > > ++++++++++++++++++++++++++++++++++++---
>> > > --
>> > >  include/linux/nfs_xdr.h |  7 ++++++
>> > >  3 files changed, 53 insertions(+), 5 deletions(-)
>> > > 
>> > > diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
>> > > index 11bf15800ac9..c00b54491949 100644
>> > > --- a/fs/nfs/inode.c
>> > > +++ b/fs/nfs/inode.c
>> > > @@ -2110,6 +2110,7 @@ static void init_once(void *foo)
>> > >  	INIT_LIST_HEAD(&nfsi->commit_info.list);
>> > >  	atomic_long_set(&nfsi->nrequests, 0);
>> > >  	atomic_long_set(&nfsi->commit_info.ncommit, 0);
>> > > +	nfsi->commit_info.ioc = NULL;
>> > >  	atomic_set(&nfsi->commit_info.rpcs_out, 0);
>> > >  	init_rwsem(&nfsi->rmdir_sem);
>> > >  	mutex_init(&nfsi->commit_mutex);
>> > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
>> > > index c478b772cc49..57e209f964e4 100644
>> > > --- a/fs/nfs/write.c
>> > > +++ b/fs/nfs/write.c
>> > > @@ -47,6 +47,7 @@ struct nfs_io_completion {
>> > >  	void (*complete)(void *data);
>> > >  	void *data;
>> > >  	struct kref refcount;
>> > > +	struct rcu_head rcu;
>> > >  };
>> > >  
>> > >  /*
>> > > @@ -134,7 +135,7 @@ static void nfs_io_completion_release(struct
>> > > kref
>> > > *kref)
>> > >  	struct nfs_io_completion *ioc = container_of(kref,
>> > >  			struct nfs_io_completion, refcount);
>> > >  	ioc->complete(ioc->data);
>> > > -	kfree(ioc);
>> > > +	kfree_rcu(ioc, rcu);
>> > >  }
>> > >  
>> > >  static void nfs_io_completion_get(struct nfs_io_completion *ioc)
>> > > @@ -720,6 +721,8 @@ static void nfs_io_completion_commit(void
>> > > *inode)
>> > >  	nfs_commit_inode(inode, 0);
>> > >  }
>> > >  
>> > > +static void nfs_commit_end(struct nfs_mds_commit_info *cinfo);
>> > > +
>> > >  int nfs_writepages(struct address_space *mapping, struct
>> > > writeback_control *wbc)
>> > >  {
>> > >  	struct inode *inode = mapping->host;
>> > > @@ -729,9 +732,37 @@ int nfs_writepages(struct address_space
>> > > *mapping, struct writeback_control *wbc)
>> > >  
>> > >  	nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGES);
>> > >  
>> > > -	ioc = nfs_io_completion_alloc(GFP_KERNEL);
>> > > -	if (ioc)
>> > > -		nfs_io_completion_init(ioc, nfs_io_completion_commit,
>> > > inode);
>> > > +	rcu_read_lock();
>> > > +	ioc = rcu_dereference(NFS_I(inode)->commit_info.ioc);
>> > > +	if (ioc && kref_get_unless_zero(&ioc->refcount)) {
>> > > +		rcu_read_unlock();
>> > > +		/* We've successfully piggybacked on the attached ioc
>> > > */
>> > > +	} else {
>> > > +		rcu_read_unlock();
>> > > +		ioc = nfs_io_completion_alloc(GFP_KERNEL);
>> > > +		if (ioc) {
>> > > +			struct nfs_io_completion *ioc_prev;
>> > > +
>> > > +			nfs_io_completion_init(ioc,
>> > > nfs_io_completion_commit,
>> > > +					       inode);
>> > > +			/* Temporarily elevate rpcs_out to ensure a
>> > > commit
>> > > +			 * completion *will* happen after we attach
>> > > this ioc,
>> > > +			 * so it *will* get a chance to drain.
>> > > +			 */
>> > > +			atomic_inc(&NFS_I(inode)-
>> > > > commit_info.rpcs_out);
>> > > +			nfs_io_completion_get(ioc);
>> > > +			ioc_prev = xchg(&NFS_I(inode)->commit_info.ioc,
>> > > +				    ioc);
>> > > +			/* ioc_prev is normally NULL, but racing
>> > > writepages
>> > > +			 * calls might result in it being non-NULL.
>> > > +			 * In either case it is safe to put it - worst
>> > > case
>> > > +			 * we get an extra COMMIT.
>> > > +			 */
>> > > +			nfs_io_completion_put(ioc_prev);
>> > > +			/* release temporary ref on rpcs_out */
>> > > +			nfs_commit_end(&NFS_I(inode)->commit_info);
>> > 
>> > So won't this normally trigger the xchg(NULL) in nfs_commmit_end()?
>> > For
>> > most cases, I'd expect commit_info.rpcs_out to be zero until we
>> > start
>> > actually sending out the COMMITs.
>> 
>> On the first writepages call after a period of quiet, that is exactly
>> correct.  The ioc that we have just allocated and attached will
>> immediately be detached.  It will have a refcount of 1 at this point
>> so
>> nothing further happens.
>> Then WRITEs get queued elevating the refcount.  This will eventually
>> drain and a COMMIT will be sent.  So the behaviour for the firstpages
>> call is exactly like the current code (I think I mentioned that in
>> the
>> commit message).
>> If the WRITEs and COMMIT complete before the next .writepages, then
>> that
>> writepages will behave the same way.
>> But if the WRITEs or COMMIT are still pending, then rpcs_out will
>> still
>> be elevated....
>> Only - no.  Unattached iocs are meant to be counted in rpcs_out, but
>> I
>> haven't got that right.  So we can still end up with parallel
>> COMMITs.
>> I think that might explain some slightly odd numbers in my testing.
>> 
>> Here is the new version.
>> 
>> Thanks for the review.
>> 
>> NeilBrown
>> 
>> From: NeilBrown <neilb@suse.de>
>> Subject: [PATCH] NFS: Minimize COMMIT calls during writeback.
>> 
>> Since 4.13 (see Fixes: commit) NFS has sent a COMMIT request for
>> every
>> .writepages() call it received - when the writes submitted for that
>> call
>> have all completed.
>> 
>> This works well enough when COMMIT is fast, but if COMMIT is slow
>> these
>> calls can overlap each other, overload the server, and substantially
>> reduce throughput.
>> 
>> I looked at this due to a report of regression when writing to a
>> Ganesha
>> NFS server with tracing showing multiple overlapping COMMITs, and
>> individual commits typically taking 200ms to 300ms.  This resulted in
>> 2
>> orders of magnitude slow down when writing 1000x1M from /dev/zero
>> with dd.
>> This is easily reproduced by adding 'msleep(300)' to nfsd_commit() in
>> the
>> Linux NFS server.
>> 
>> This patch changes the details of when COMMIT is sent without
>> changing
>> the general approach.
>> 
>> Where previously the writes from a single .writepages() call were
>> accounted together in a nfs_io_completion, now the writes from
>> multiple
>> writepages() calls are accounted together.  The set of writepages
>> calls
>> that are grouped together are all those from when one COMMIT call
>> ends
>> to when the next COMMIT call ends.  This automatically reduces the
>> rate
>> of COMMIT requests when COMMIT itself is slow.
>> (If there are no COMMIT calls pending, the first .writepages will get
>>  an nfs_io_completion to itself, then subsequenct writepages requests
>>  until the first COMMIT completes will go in the next
>> nfs_io_completion)
>> 
>> There are typically at most two nfs_io_completion structures
>> allocated
>> for writeback to a given inode.
>> 
>> - If there was been any writepages calls since the last time a COMMIT
>>   completed, there will be an nfs_io_completion stored in the inode
>> (in
>>   nfs_mds_commit_info) which new writepages requests are accounted
>> it.
>> 
>> - If there is no pending COMMIT request, but there are pending
>> writeback
>>   WRITES, there will be another nfs_io_completion which is not
>> attached
>>   and which is draining.  When it fully drains a COMMIT will be sent.
>>   When that COMMIT completes, the attached nfs_io_completion will be
>>   detached and allowed to drain.
>> 
>> The rpcs_out counter now counts the unattached nfs_io_completion as
>> well
>> as any pending COMMIT requests.  As an unattached nfs_io_completion
>> will
>> soon turn into a COMMIT request, this seems reasonable.  It allows us
>> to
>> clearly detect when there are no longer any COMMITs expected to
>> complete, so we know to detach any nfs_io_completion from the inode
>> and
>> allow it to drain.
>> 
>> As we use atomic accesses (e.g.  xchg and kref_get_unless_zero()) to
>> access the attached nfs_io_completion, we now use kfree_rcu() to free
>> it, to ensure it is not accessed after it is freed.
>> 
>> With 300ms commits, this improves throught of a 1G dd by 2 orders of
>> magnitude.  Even without the 300ms delay, this noticeably improves
>> throughput to a Linux NFS server is a simple VM.
>> 
>> Fixes: 919e3bd9a875 ("NFS: Ensure we commit after writeback is
>> complete")
>> Signed-off-by: NeilBrown <neilb@suse.de>
>> ---
>>  fs/nfs/inode.c          |  1 +
>>  fs/nfs/write.c          | 69 +++++++++++++++++++++++++++++++++++++
>> ----
>>  include/linux/nfs_xdr.h |  7 +++++
>>  3 files changed, 71 insertions(+), 6 deletions(-)
>> 
>> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
>> index 11bf15800ac9..c00b54491949 100644
>> --- a/fs/nfs/inode.c
>> +++ b/fs/nfs/inode.c
>> @@ -2110,6 +2110,7 @@ static void init_once(void *foo)
>>  	INIT_LIST_HEAD(&nfsi->commit_info.list);
>>  	atomic_long_set(&nfsi->nrequests, 0);
>>  	atomic_long_set(&nfsi->commit_info.ncommit, 0);
>> +	nfsi->commit_info.ioc = NULL;
>>  	atomic_set(&nfsi->commit_info.rpcs_out, 0);
>>  	init_rwsem(&nfsi->rmdir_sem);
>>  	mutex_init(&nfsi->commit_mutex);
>> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
>> index c478b772cc49..6dbb055f80bf 100644
>> --- a/fs/nfs/write.c
>> +++ b/fs/nfs/write.c
>> @@ -47,6 +47,7 @@ struct nfs_io_completion {
>>  	void (*complete)(void *data);
>>  	void *data;
>>  	struct kref refcount;
>> +	struct rcu_head rcu;
>>  };
>>  
>>  /*
>> @@ -134,7 +135,7 @@ static void nfs_io_completion_release(struct kref
>> *kref)
>>  	struct nfs_io_completion *ioc = container_of(kref,
>>  			struct nfs_io_completion, refcount);
>>  	ioc->complete(ioc->data);
>> -	kfree(ioc);
>> +	kfree_rcu(ioc, rcu);
>>  }
>>  
>>  static void nfs_io_completion_get(struct nfs_io_completion *ioc)
>> @@ -715,9 +716,15 @@ static int nfs_writepages_callback(struct page
>> *page, struct writeback_control *
>>  	return ret;
>>  }
>>  
>> +static void nfs_commit_end(struct nfs_mds_commit_info *cinfo);
>> +
>>  static void nfs_io_completion_commit(void *inode)
>>  {
>>  	nfs_commit_inode(inode, 0);
>> +	/* this came from a detached nfs_io_completion, which is now
>> +	 * no longer active, so must decrement rpcs_out.
>> +	 */
>> +	nfs_commit_end(&NFS_I(inode)->commit_info);
>>  }
>>  
>>  int nfs_writepages(struct address_space *mapping, struct
>> writeback_control *wbc)
>> @@ -729,9 +736,46 @@ int nfs_writepages(struct address_space
>> *mapping, struct writeback_control *wbc)
>>  
>>  	nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGES);
>>  
>> -	ioc = nfs_io_completion_alloc(GFP_KERNEL);
>> -	if (ioc)
>> -		nfs_io_completion_init(ioc, nfs_io_completion_commit,
>> inode);
>> +	rcu_read_lock();
>> +	ioc = rcu_dereference(NFS_I(inode)->commit_info.ioc);
>> +	if (ioc && kref_get_unless_zero(&ioc->refcount)) {
>> +		rcu_read_unlock();
>> +		/* We've successfully piggybacked on the attached ioc
>> */
>> +	} else {
>> +		rcu_read_unlock();
>> +		ioc = nfs_io_completion_alloc(GFP_KERNEL);
>> +		if (ioc) {
>> +			struct nfs_io_completion *ioc_prev;
>> +
>> +			nfs_io_completion_init(ioc,
>> nfs_io_completion_commit,
>> +					       inode);
>> +			/* This is now a detached ioc so we count it in
>> +			 * rpcs_out.  *After* we successfully attach it
>> +			 * below (likely, but not certain), we will
>> call
>> +			 * nfs_commit_end() which might detach it
>> immedately
>> +			 * if there are no outstanding commit.
>> +			 */
>> +			atomic_inc(&NFS_I(inode)-
>> >commit_info.rpcs_out);
>
> commit_info.rpcs_out is there in order to count the number of
> outstanding COMMITs. I'm not seeing why we need to change that
> definition, particularly given that there can be processes out there
> that are trying to wait for the count to go to zero.
>
> The I/O completion structure already has its own reference counter, and
> I can see nothing stopping you from modifying
> nfs_io_completion_commit() to do the same things you are trying to do
> in nfs_commit_end() here.

I don't see how that would work.
There are two events that need to happen asynchronously when a counter
hits zero.
One is that a COMMIT needs to be sent when the number of outstanding
writes from a particular set hits zero.  That is the nfs_io_completion
counter.
The other is that the currently growing set of pending commits needs to
be detached and allowed to drain so that another COMMIT will get sent.
This is what I'm using rpcs_out for.  It seems quite a natural extension
of its current use.

>
> However as I said, I have my doubts as to whether or not this is a good
> idea. It sounds to me as if you're trying to fix a memory management
> layer problem in the NFS layer.

I guess we really need to resolve this issue before we put too much
effort into sorting out differences over the code.

This is certainly not a memory management layer problem.  The MM layer
is doing exactly what it should - sending an orderly sequence of
writepages requests to the filesystem in response to relevant limits or
timeouts being reached.

NFS is putting these into effect by sending out WRITE requests for the
chosen pages, just as it should.

It is also sending COMMIT requests and this is where the problem lies.
It sometimes sends too many.

Clearly (I hope) sending one COMMIT request for every WRITE request
would be too many.  At the other end of the spectrum, only sending a
COMMIT when there are no out standing WRITEs and haven't been for some
time would be far too few.  We need something in between.

The current behaviour is to send one COMMIT for every writepages call.
This often works well but is quite arbitrary.  It ties details of the
NFS protocol to unrelated details of the MM behaviour which is not
designed to meet the specific needs of NFS.

This behaviour results in extremely poor performance in some
circumstances when COMMIT takes a long time - e.g. 300ms.  As the
frequency of COMMIT messages is tied to the behaviour of the MM, and not
the behaviour of the server, it can do things that hurt the server.
In particularly it can easily send COMMIT messages more frequently than
can be handled, which can overload the server.

I think it makes much more sense to tie the frequency of the COMMIT
messages to the speed at which the server replies to them, and in
particular only have a single COMMIT outstanding at a time.
Whether we then want to send a COMMIT immediately after the last one
returns, or after a short delay or with some other heuristic is
something I'm quite willing to discuss.  I have a solution that I like
but there might be a better one.

But can we at least agree that sending multiple concurrent COMMIT
requests is unlikely to be helpful, and to achieve this we need to let
the behaviour of the server affect the frequency of COMMITs??

Thanks,
NeilBrown
Trond Myklebust March 17, 2020, 2:22 p.m. UTC | #5
On Tue, 2020-03-17 at 21:41 +1100, NeilBrown wrote:
> On Mon, Mar 16 2020, Trond Myklebust wrote:
> 
> > On Mon, 2020-03-16 at 14:39 +1100, NeilBrown wrote:
> > > On Mon, Mar 16 2020, Trond Myklebust wrote:
> > > 
> > > > On Mon, 2020-03-16 at 11:05 +1100, NeilBrown wrote:
> > > > > Since 4.13 (see Fixes: commit) NFS has sent a COMMIT request
> > > > > for
> > > > > every
> > > > > .writepages() call it received - when the writes submitted
> > > > > for
> > > > > that
> > > > > call
> > > > > have all completed.
> > > > > 
> > > > > This works well enough when COMMIT is fast, but if COMMIT is
> > > > > slow
> > > > > these
> > > > > calls can overlap each other, overload the server, and
> > > > > substantially
> > > > > reduce throughput.
> > > > > 
> > > > > I looked at this due to a report of regression when writing
> > > > > to a
> > > > > Ganesha
> > > > > NFS server with tracing showing multiple overlapping COMMITs,
> > > > > and
> > > > > individual commits typically taking 200ms to 300ms.  This
> > > > > resulted in
> > > > > 2
> > > > > orders of magnitude slow down when writing 1000x1M from
> > > > > /dev/zero
> > > > > with dd.
> > > > > This is easily reproduced by adding 'msleep(300)' to
> > > > > nfsd_commit() in
> > > > > the
> > > > > Linux NFS server.
> > > > 
> > > > When there is overlap of writeback to the same file, then it is
> > > > almost
> > > > always because we're hitting memory reclaim on the client. In
> > > > that
> > > > case
> > > > we want to ensure that memory reclaim completes, which it won't
> > > > do
> > > > if
> > > > we delay COMMIT.
> > > > IOW: this behaviour is very much intentional.
> > > 
> > > What do you mean by "overlap of writeback" ?
> > > The ->writepages calls are sequential - they don't overlap.
> > > The WRITE rpcs are asynchronous so of course they overlap.
> > > The COMMITs are expected to overlap with subsequent WRITEs.
> > > It is only when COMMITs overlap with COMMITs that I think we
> > > might
> > > overload the server, and I cannot see how that would ever be a
> > > useful
> > > thing to do.
> > > 
> > 
> > I mean that the NFS writeback code has exactly two cases where it
> > starts to write out data before an fsync() or close() forces it to
> > do
> > so:
> > 
> > 1) Read of a page that was written to, but not up to date
> > 2) Memory pressure.
> > 
> > IOW: If you're seeing overlapping commits, it is due to a situation
> > where we're trying to free up the pages as soon as possible.
> > 
> > > In the cases I've experiment with there is plenty of memory, but
> > > the
> > > total write size exceeds dirty_threshold, so writeout starts to
> > > limit
> > > the amount of dirty memory.  Certainly we want the COMMIT
> > > promptly as
> > > a group of writes finish, but that group doesn't need to be tiny.
> > > 
> > > (Our initial explorations found that increasing dirty_threshold
> > > to
> > > larger than the write size restored normal performance)
> > > 
> > 
> > That's the definition of memory pressure here. It's not the job of
> > the
> > filesystem to second guess what is going on in the higher layers.
> > Its
> > job is to write out data as quickly as possible when asked to do
> > so.
> 
> Fair enough.  I don't usually think of writeback as due to memory
> pressure, but I can see that I filesystem wouldn't much care for the
> distinction.
> 
> Still there is no overlap imposed by the VM.  The VM is sending a
> sensible streak of writepages requests - as dd dirties more pages,
> the
> VM asks NFS to write out more pages.  All in a nice orderly fashion.

The question essentially boils down to: "What is the purpose of the
background writeback?" My answer would be "To free up pages", something
that does not happen until COMMIT is sent.

IOW: My concern is livelock, not serialisation. Right now, we do start
commit when a batch sent by the VM has been completely written (same as
we do when an O_DIRECT batch has been completely written). That ensures
that the pages can be freed. Your patch is deliberately designed to
defer that process that allows pages to be freed, and I'm not seeing
how it is bounded.


> > > > > This patch changes the details of when COMMIT is sent without
> > > > > changing
> > > > > the general approach.
> > > > > 
> > > > > Where previously the writes from a single .writepages() call
> > > > > were
> > > > > accounted together in a nfs_io_completion, now the writes
> > > > > from
> > > > > multiple
> > > > > writepages() calls are accounted together.  The set of
> > > > > writepages
> > > > > calls
> > > > > that are grouped together are all those from when one COMMIT
> > > > > call
> > > > > ends
> > > > > to when the next COMMIT call ends.  This automatically
> > > > > reduces
> > > > > the
> > > > > rate
> > > > > of COMMIT requests when COMMIT itself is slow.
> > > > > (If there are no COMMIT calls pending, the first .writepages
> > > > > will
> > > > > get
> > > > >  an nfs_io_completion to itself, then subsequenct writepages
> > > > > requests
> > > > >  until the first COMMIT completes will go in the next
> > > > > nfs_io_completion)
> > > > > 
> > > > > There are typically at most two nfs_io_completion structures
> > > > > allocated
> > > > > for writeback to a given inode.
> > > > > 
> > > > > - If there was been any writepages calls since the last time
> > > > > a
> > > > > COMMIT
> > > > >   completed, there will be an nfs_io_completion stored in the
> > > > > inode
> > > > > (in
> > > > >   nfs_mds_commit_info) which new writepages requests are
> > > > > accounted
> > > > > it.
> > > > > 
> > > > > - If there is no pending COMMIT request, but there are
> > > > > pending
> > > > > writeback
> > > > >   WRITES, there will be another nfs_io_completion which is
> > > > > not
> > > > > attached
> > > > >   and which is draining.  When it fully drains a COMMIT will
> > > > > be
> > > > > sent.
> > > > >   When that COMMIT completes, the attached nfs_io_completion
> > > > > will
> > > > > be
> > > > >   detached and allowed to drain.
> > > > > 
> > > > > The rpcs_out counter now counts the unattached
> > > > > nfs_io_completion
> > > > > as
> > > > > well
> > > > > as any pending COMMIT requests.  As an unattached
> > > > > nfs_io_completion
> > > > > will
> > > > > soon turn into a COMMIT request, this seems reasonable.  It
> > > > > allows us
> > > > > to
> > > > > clearly detect when there are no longer any COMMITs expected
> > > > > to
> > > > > complete, so we know to detach any nfs_io_completion from the
> > > > > inode
> > > > > and
> > > > > allow it to drain.
> > > > > 
> > > > > As we use atomic accesses (e.g.  xchg and
> > > > > kref_get_unless_zero())
> > > > > to
> > > > > access the attached nfs_io_completion, we now use kfree_rcu()
> > > > > to
> > > > > free
> > > > > it, to ensure it is not accessed after it is freed.
> > > > > 
> > > > > With 300ms commits, this improves throught of a 1G dd by 2
> > > > > orders
> > > > > of
> > > > > magnitude.  Even without the 300ms delay, this noticeably
> > > > > improves
> > > > > throughput to a Linux NFS server is a simple VM.
> > > > > 
> > > > > Fixes: 919e3bd9a875 ("NFS: Ensure we commit after writeback
> > > > > is
> > > > > complete")
> > > > > Signed-off-by: NeilBrown <neilb@suse.de>
> > > > > ---
> > > > >  fs/nfs/inode.c          |  1 +
> > > > >  fs/nfs/write.c          | 50
> > > > > ++++++++++++++++++++++++++++++++++++---
> > > > > --
> > > > >  include/linux/nfs_xdr.h |  7 ++++++
> > > > >  3 files changed, 53 insertions(+), 5 deletions(-)
> > > > > 
> > > > > diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
> > > > > index 11bf15800ac9..c00b54491949 100644
> > > > > --- a/fs/nfs/inode.c
> > > > > +++ b/fs/nfs/inode.c
> > > > > @@ -2110,6 +2110,7 @@ static void init_once(void *foo)
> > > > >  	INIT_LIST_HEAD(&nfsi->commit_info.list);
> > > > >  	atomic_long_set(&nfsi->nrequests, 0);
> > > > >  	atomic_long_set(&nfsi->commit_info.ncommit, 0);
> > > > > +	nfsi->commit_info.ioc = NULL;
> > > > >  	atomic_set(&nfsi->commit_info.rpcs_out, 0);
> > > > >  	init_rwsem(&nfsi->rmdir_sem);
> > > > >  	mutex_init(&nfsi->commit_mutex);
> > > > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > > > > index c478b772cc49..57e209f964e4 100644
> > > > > --- a/fs/nfs/write.c
> > > > > +++ b/fs/nfs/write.c
> > > > > @@ -47,6 +47,7 @@ struct nfs_io_completion {
> > > > >  	void (*complete)(void *data);
> > > > >  	void *data;
> > > > >  	struct kref refcount;
> > > > > +	struct rcu_head rcu;
> > > > >  };
> > > > >  
> > > > >  /*
> > > > > @@ -134,7 +135,7 @@ static void
> > > > > nfs_io_completion_release(struct
> > > > > kref
> > > > > *kref)
> > > > >  	struct nfs_io_completion *ioc = container_of(kref,
> > > > >  			struct nfs_io_completion, refcount);
> > > > >  	ioc->complete(ioc->data);
> > > > > -	kfree(ioc);
> > > > > +	kfree_rcu(ioc, rcu);
> > > > >  }
> > > > >  
> > > > >  static void nfs_io_completion_get(struct nfs_io_completion
> > > > > *ioc)
> > > > > @@ -720,6 +721,8 @@ static void nfs_io_completion_commit(void
> > > > > *inode)
> > > > >  	nfs_commit_inode(inode, 0);
> > > > >  }
> > > > >  
> > > > > +static void nfs_commit_end(struct nfs_mds_commit_info
> > > > > *cinfo);
> > > > > +
> > > > >  int nfs_writepages(struct address_space *mapping, struct
> > > > > writeback_control *wbc)
> > > > >  {
> > > > >  	struct inode *inode = mapping->host;
> > > > > @@ -729,9 +732,37 @@ int nfs_writepages(struct address_space
> > > > > *mapping, struct writeback_control *wbc)
> > > > >  
> > > > >  	nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGES);
> > > > >  
> > > > > -	ioc = nfs_io_completion_alloc(GFP_KERNEL);
> > > > > -	if (ioc)
> > > > > -		nfs_io_completion_init(ioc,
> > > > > nfs_io_completion_commit,
> > > > > inode);
> > > > > +	rcu_read_lock();
> > > > > +	ioc = rcu_dereference(NFS_I(inode)->commit_info.ioc);
> > > > > +	if (ioc && kref_get_unless_zero(&ioc->refcount)) {
> > > > > +		rcu_read_unlock();
> > > > > +		/* We've successfully piggybacked on the
> > > > > attached ioc
> > > > > */
> > > > > +	} else {
> > > > > +		rcu_read_unlock();
> > > > > +		ioc = nfs_io_completion_alloc(GFP_KERNEL);
> > > > > +		if (ioc) {
> > > > > +			struct nfs_io_completion *ioc_prev;
> > > > > +
> > > > > +			nfs_io_completion_init(ioc,
> > > > > nfs_io_completion_commit,
> > > > > +					       inode);
> > > > > +			/* Temporarily elevate rpcs_out to
> > > > > ensure a
> > > > > commit
> > > > > +			 * completion *will* happen after we
> > > > > attach
> > > > > this ioc,
> > > > > +			 * so it *will* get a chance to drain.
> > > > > +			 */
> > > > > +			atomic_inc(&NFS_I(inode)-
> > > > > > commit_info.rpcs_out);
> > > > > +			nfs_io_completion_get(ioc);
> > > > > +			ioc_prev = xchg(&NFS_I(inode)-
> > > > > >commit_info.ioc,
> > > > > +				    ioc);
> > > > > +			/* ioc_prev is normally NULL, but
> > > > > racing
> > > > > writepages
> > > > > +			 * calls might result in it being non-
> > > > > NULL.
> > > > > +			 * In either case it is safe to put it
> > > > > - worst
> > > > > case
> > > > > +			 * we get an extra COMMIT.
> > > > > +			 */
> > > > > +			nfs_io_completion_put(ioc_prev);
> > > > > +			/* release temporary ref on rpcs_out */
> > > > > +			nfs_commit_end(&NFS_I(inode)-
> > > > > >commit_info);
> > > > 
> > > > So won't this normally trigger the xchg(NULL) in
> > > > nfs_commmit_end()?
> > > > For
> > > > most cases, I'd expect commit_info.rpcs_out to be zero until we
> > > > start
> > > > actually sending out the COMMITs.
> > > 
> > > On the first writepages call after a period of quiet, that is
> > > exactly
> > > correct.  The ioc that we have just allocated and attached will
> > > immediately be detached.  It will have a refcount of 1 at this
> > > point
> > > so
> > > nothing further happens.
> > > Then WRITEs get queued elevating the refcount.  This will
> > > eventually
> > > drain and a COMMIT will be sent.  So the behaviour for the
> > > firstpages
> > > call is exactly like the current code (I think I mentioned that
> > > in
> > > the
> > > commit message).
> > > If the WRITEs and COMMIT complete before the next .writepages,
> > > then
> > > that
> > > writepages will behave the same way.
> > > But if the WRITEs or COMMIT are still pending, then rpcs_out will
> > > still
> > > be elevated....
> > > Only - no.  Unattached iocs are meant to be counted in rpcs_out,
> > > but
> > > I
> > > haven't got that right.  So we can still end up with parallel
> > > COMMITs.
> > > I think that might explain some slightly odd numbers in my
> > > testing.
> > > 
> > > Here is the new version.
> > > 
> > > Thanks for the review.
> > > 
> > > NeilBrown
> > > 
> > > From: NeilBrown <neilb@suse.de>
> > > Subject: [PATCH] NFS: Minimize COMMIT calls during writeback.
> > > 
> > > Since 4.13 (see Fixes: commit) NFS has sent a COMMIT request for
> > > every
> > > .writepages() call it received - when the writes submitted for
> > > that
> > > call
> > > have all completed.
> > > 
> > > This works well enough when COMMIT is fast, but if COMMIT is slow
> > > these
> > > calls can overlap each other, overload the server, and
> > > substantially
> > > reduce throughput.
> > > 
> > > I looked at this due to a report of regression when writing to a
> > > Ganesha
> > > NFS server with tracing showing multiple overlapping COMMITs, and
> > > individual commits typically taking 200ms to 300ms.  This
> > > resulted in
> > > 2
> > > orders of magnitude slow down when writing 1000x1M from /dev/zero
> > > with dd.
> > > This is easily reproduced by adding 'msleep(300)' to
> > > nfsd_commit() in
> > > the
> > > Linux NFS server.
> > > 
> > > This patch changes the details of when COMMIT is sent without
> > > changing
> > > the general approach.
> > > 
> > > Where previously the writes from a single .writepages() call were
> > > accounted together in a nfs_io_completion, now the writes from
> > > multiple
> > > writepages() calls are accounted together.  The set of writepages
> > > calls
> > > that are grouped together are all those from when one COMMIT call
> > > ends
> > > to when the next COMMIT call ends.  This automatically reduces
> > > the
> > > rate
> > > of COMMIT requests when COMMIT itself is slow.
> > > (If there are no COMMIT calls pending, the first .writepages will
> > > get
> > >  an nfs_io_completion to itself, then subsequenct writepages
> > > requests
> > >  until the first COMMIT completes will go in the next
> > > nfs_io_completion)
> > > 
> > > There are typically at most two nfs_io_completion structures
> > > allocated
> > > for writeback to a given inode.
> > > 
> > > - If there was been any writepages calls since the last time a
> > > COMMIT
> > >   completed, there will be an nfs_io_completion stored in the
> > > inode
> > > (in
> > >   nfs_mds_commit_info) which new writepages requests are
> > > accounted
> > > it.
> > > 
> > > - If there is no pending COMMIT request, but there are pending
> > > writeback
> > >   WRITES, there will be another nfs_io_completion which is not
> > > attached
> > >   and which is draining.  When it fully drains a COMMIT will be
> > > sent.
> > >   When that COMMIT completes, the attached nfs_io_completion will
> > > be
> > >   detached and allowed to drain.
> > > 
> > > The rpcs_out counter now counts the unattached nfs_io_completion
> > > as
> > > well
> > > as any pending COMMIT requests.  As an unattached
> > > nfs_io_completion
> > > will
> > > soon turn into a COMMIT request, this seems reasonable.  It
> > > allows us
> > > to
> > > clearly detect when there are no longer any COMMITs expected to
> > > complete, so we know to detach any nfs_io_completion from the
> > > inode
> > > and
> > > allow it to drain.
> > > 
> > > As we use atomic accesses (e.g.  xchg and kref_get_unless_zero())
> > > to
> > > access the attached nfs_io_completion, we now use kfree_rcu() to
> > > free
> > > it, to ensure it is not accessed after it is freed.
> > > 
> > > With 300ms commits, this improves throught of a 1G dd by 2 orders
> > > of
> > > magnitude.  Even without the 300ms delay, this noticeably
> > > improves
> > > throughput to a Linux NFS server is a simple VM.
> > > 
> > > Fixes: 919e3bd9a875 ("NFS: Ensure we commit after writeback is
> > > complete")
> > > Signed-off-by: NeilBrown <neilb@suse.de>
> > > ---
> > >  fs/nfs/inode.c          |  1 +
> > >  fs/nfs/write.c          | 69
> > > +++++++++++++++++++++++++++++++++++++
> > > ----
> > >  include/linux/nfs_xdr.h |  7 +++++
> > >  3 files changed, 71 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
> > > index 11bf15800ac9..c00b54491949 100644
> > > --- a/fs/nfs/inode.c
> > > +++ b/fs/nfs/inode.c
> > > @@ -2110,6 +2110,7 @@ static void init_once(void *foo)
> > >  	INIT_LIST_HEAD(&nfsi->commit_info.list);
> > >  	atomic_long_set(&nfsi->nrequests, 0);
> > >  	atomic_long_set(&nfsi->commit_info.ncommit, 0);
> > > +	nfsi->commit_info.ioc = NULL;
> > >  	atomic_set(&nfsi->commit_info.rpcs_out, 0);
> > >  	init_rwsem(&nfsi->rmdir_sem);
> > >  	mutex_init(&nfsi->commit_mutex);
> > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > > index c478b772cc49..6dbb055f80bf 100644
> > > --- a/fs/nfs/write.c
> > > +++ b/fs/nfs/write.c
> > > @@ -47,6 +47,7 @@ struct nfs_io_completion {
> > >  	void (*complete)(void *data);
> > >  	void *data;
> > >  	struct kref refcount;
> > > +	struct rcu_head rcu;
> > >  };
> > >  
> > >  /*
> > > @@ -134,7 +135,7 @@ static void nfs_io_completion_release(struct
> > > kref
> > > *kref)
> > >  	struct nfs_io_completion *ioc = container_of(kref,
> > >  			struct nfs_io_completion, refcount);
> > >  	ioc->complete(ioc->data);
> > > -	kfree(ioc);
> > > +	kfree_rcu(ioc, rcu);
> > >  }
> > >  
> > >  static void nfs_io_completion_get(struct nfs_io_completion *ioc)
> > > @@ -715,9 +716,15 @@ static int nfs_writepages_callback(struct
> > > page
> > > *page, struct writeback_control *
> > >  	return ret;
> > >  }
> > >  
> > > +static void nfs_commit_end(struct nfs_mds_commit_info *cinfo);
> > > +
> > >  static void nfs_io_completion_commit(void *inode)
> > >  {
> > >  	nfs_commit_inode(inode, 0);
> > > +	/* this came from a detached nfs_io_completion, which is now
> > > +	 * no longer active, so must decrement rpcs_out.
> > > +	 */
> > > +	nfs_commit_end(&NFS_I(inode)->commit_info);
> > >  }
> > >  
> > >  int nfs_writepages(struct address_space *mapping, struct
> > > writeback_control *wbc)
> > > @@ -729,9 +736,46 @@ int nfs_writepages(struct address_space
> > > *mapping, struct writeback_control *wbc)
> > >  
> > >  	nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGES);
> > >  
> > > -	ioc = nfs_io_completion_alloc(GFP_KERNEL);
> > > -	if (ioc)
> > > -		nfs_io_completion_init(ioc, nfs_io_completion_commit,
> > > inode);
> > > +	rcu_read_lock();
> > > +	ioc = rcu_dereference(NFS_I(inode)->commit_info.ioc);
> > > +	if (ioc && kref_get_unless_zero(&ioc->refcount)) {
> > > +		rcu_read_unlock();
> > > +		/* We've successfully piggybacked on the attached ioc
> > > */
> > > +	} else {
> > > +		rcu_read_unlock();
> > > +		ioc = nfs_io_completion_alloc(GFP_KERNEL);
> > > +		if (ioc) {
> > > +			struct nfs_io_completion *ioc_prev;
> > > +
> > > +			nfs_io_completion_init(ioc,
> > > nfs_io_completion_commit,
> > > +					       inode);
> > > +			/* This is now a detached ioc so we count it in
> > > +			 * rpcs_out.  *After* we successfully attach it
> > > +			 * below (likely, but not certain), we will
> > > call
> > > +			 * nfs_commit_end() which might detach it
> > > immedately
> > > +			 * if there are no outstanding commit.
> > > +			 */
> > > +			atomic_inc(&NFS_I(inode)-
> > > > commit_info.rpcs_out);
> > 
> > commit_info.rpcs_out is there in order to count the number of
> > outstanding COMMITs. I'm not seeing why we need to change that
> > definition, particularly given that there can be processes out
> > there
> > that are trying to wait for the count to go to zero.
> > 
> > The I/O completion structure already has its own reference counter,
> > and
> > I can see nothing stopping you from modifying
> > nfs_io_completion_commit() to do the same things you are trying to
> > do
> > in nfs_commit_end() here.
> 
> I don't see how that would work.
> There are two events that need to happen asynchronously when a
> counter
> hits zero.
> One is that a COMMIT needs to be sent when the number of outstanding
> writes from a particular set hits zero.  That is the
> nfs_io_completion
> counter.
> The other is that the currently growing set of pending commits needs
> to
> be detached and allowed to drain so that another COMMIT will get
> sent.
> This is what I'm using rpcs_out for.  It seems quite a natural
> extension
> of its current use.
> 

You're not supposed to call kref_get() on something with a refcount of
zero. That means the nfs_io_completion counter defines the lifetime of
the object that your commit_info.ioc points to, and once that refcount
goes to zero, you should be removing it, which you are doing in
nfs_writepages().

commit_info.rpcs_out is non-zero for the lifetime of any COMMIT
operations. I'm not seeing how that relates to anything in the
preceding paragraph (other than that nfs_io_completion going to zero
will usually cause commit_info.rpcs_out to get incremented).
IOW: commit_info.rpcs_out goes to zero a long time after the object
pointed to by commit_info.ioc ought to be freed.

> > However as I said, I have my doubts as to whether or not this is a
> > good
> > idea. It sounds to me as if you're trying to fix a memory
> > management
> > layer problem in the NFS layer.
> 
> I guess we really need to resolve this issue before we put too much
> effort into sorting out differences over the code.
> 
> This is certainly not a memory management layer problem.  The MM
> layer
> is doing exactly what it should - sending an orderly sequence of
> writepages requests to the filesystem in response to relevant limits
> or
> timeouts being reached.
> 
> NFS is putting these into effect by sending out WRITE requests for
> the
> chosen pages, just as it should.
> 
> It is also sending COMMIT requests and this is where the problem
> lies.
> It sometimes sends too many.
> 
> Clearly (I hope) sending one COMMIT request for every WRITE request
> would be too many.  At the other end of the spectrum, only sending a
> COMMIT when there are no out standing WRITEs and haven't been for
> some
> time would be far too few.  We need something in between.
> 
> The current behaviour is to send one COMMIT for every writepages
> call.
> This often works well but is quite arbitrary.  It ties details of the
> NFS protocol to unrelated details of the MM behaviour which is not
> designed to meet the specific needs of NFS.
> 
> This behaviour results in extremely poor performance in some
> circumstances when COMMIT takes a long time - e.g. 300ms.  As the
> frequency of COMMIT messages is tied to the behaviour of the MM, and
> not
> the behaviour of the server, it can do things that hurt the server.
> In particularly it can easily send COMMIT messages more frequently
> than
> can be handled, which can overload the server.
> 
> I think it makes much more sense to tie the frequency of the COMMIT
> messages to the speed at which the server replies to them, and in
> particular only have a single COMMIT outstanding at a time.
> Whether we then want to send a COMMIT immediately after the last one
> returns, or after a short delay or with some other heuristic is
> something I'm quite willing to discuss.  I have a solution that I
> like
> but there might be a better one.
> 
> But can we at least agree that sending multiple concurrent COMMIT
> requests is unlikely to be helpful, and to achieve this we need to
> let
> the behaviour of the server affect the frequency of COMMITs??
> 
> Thanks,
> NeilBrown
>
NeilBrown March 17, 2020, 11:48 p.m. UTC | #6
On Tue, Mar 17 2020, Trond Myklebust wrote:

> On Tue, 2020-03-17 at 21:41 +1100, NeilBrown wrote:
>> 
>> Fair enough.  I don't usually think of writeback as due to memory
>> pressure, but I can see that I filesystem wouldn't much care for the
>> distinction.
>> 
>> Still there is no overlap imposed by the VM.  The VM is sending a
>> sensible streak of writepages requests - as dd dirties more pages,
>> the
>> VM asks NFS to write out more pages.  All in a nice orderly fashion.
>
> The question essentially boils down to: "What is the purpose of the
> background writeback?" My answer would be "To free up pages", something
> that does not happen until COMMIT is sent.
>
> IOW: My concern is livelock, not serialisation. Right now, we do start
> commit when a batch sent by the VM has been completely written (same as
> we do when an O_DIRECT batch has been completely written). That ensures
> that the pages can be freed. Your patch is deliberately designed to
> defer that process that allows pages to be freed, and I'm not seeing
> how it is bounded.

In the current code the COMMIT for any given writepages() call only
completes after:
  - all the WRITEs generated by that writeapages call completes, and
    then
  - a subsequent COMMIT completes.

After my patch, the bound is a little more generous, but still a strong
bound.  The COMMIT will be complete after:
  - Any currently running COMMIT completes (there is usually only one),
    then
  - any writes that were submitted through writepages() before that
    COMMIT completed, completed, then
  - a subsequent COMMIT completes.

So we wait extra time to allow for a COMMIT and some more WRITEs to
complete.  Once the pending COMMIT completes, no more work can be added
to the set of things that need to be waited for, so it cannot live-lock.
  

>> > 
>> > commit_info.rpcs_out is there in order to count the number of
>> > outstanding COMMITs. I'm not seeing why we need to change that
>> > definition, particularly given that there can be processes out
>> > there
>> > that are trying to wait for the count to go to zero.
>> > 
>> > The I/O completion structure already has its own reference counter,
>> > and
>> > I can see nothing stopping you from modifying
>> > nfs_io_completion_commit() to do the same things you are trying to
>> > do
>> > in nfs_commit_end() here.
>> 
>> I don't see how that would work.
>> There are two events that need to happen asynchronously when a
>> counter
>> hits zero.
>> One is that a COMMIT needs to be sent when the number of outstanding
>> writes from a particular set hits zero.  That is the
>> nfs_io_completion
>> counter.
>> The other is that the currently growing set of pending commits needs
>> to
>> be detached and allowed to drain so that another COMMIT will get
>> sent.
>> This is what I'm using rpcs_out for.  It seems quite a natural
>> extension
>> of its current use.
>> 
>
> You're not supposed to call kref_get() on something with a refcount of
> zero. That means the nfs_io_completion counter defines the lifetime of
> the object that your commit_info.ioc points to, and once that refcount
> goes to zero, you should be removing it, which you are doing in
> nfs_writepages().

Certainly kref_get() cannot be called on something with a refcount of
zero - and if you have something that might have just gone to zero you
can use kref_get_unless_zero() to ensure you don't do the wrong thing.
I use kref_get_unless_zero() in nfs_writepages, but there is nowhere
that might take a ref on a kref with refcount of zero.

The io completion stored in commit_info.ioc has an extra refcount
to reflect the reference from commit_info.ioc.

The reference is removed and the reference count dropped using atomic
ops (no locking) so nfs_writepages() might get the pointer and not
be able to increment that refcount because it is already zero.  RCU
makes it safe to look at the refcount, and kref_get_unless_zero() makes
it safe to try an increment it.

>
> commit_info.rpcs_out is non-zero for the lifetime of any COMMIT
> operations. I'm not seeing how that relates to anything in the
> preceding paragraph (other than that nfs_io_completion going to zero
> will usually cause commit_info.rpcs_out to get incremented).
> IOW: commit_info.rpcs_out goes to zero a long time after the object
> pointed to by commit_info.ioc ought to be freed.

Yes, I agree with this.  I confirm that rpcs_out will only go to zero
well after commit_info.ioc is freed.  The purpose of my change to
rpcs_out is to cause it to be non-zero *earlier*.

As soon as there exists a detached io completion (which will eventually
trigger a COMMIT), rpcs_out becomes non-zero.  The count held by the
io completion is transferred to the pending COMMIT and dropped when the
COMMIT completes.

Making it non-zero earlier allows nfs_writepages to detect if there is
already a pending COMMIT *even if it hasn't been sent yet*.

If there is no pending COMMIT (or detached io completion), then it must
create a detached io completion so that a commit will follow.
If there *is* a pending COMMIT or detached io completion, then it can
leave its io completion attached because it can be sure that a COMMIT
will complete in bounded time, and so the io completion that it has
attached will be detached and processed.

Thanks,
NeilBrown

Patch
diff mbox series

diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 11bf15800ac9..c00b54491949 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -2110,6 +2110,7 @@  static void init_once(void *foo)
 	INIT_LIST_HEAD(&nfsi->commit_info.list);
 	atomic_long_set(&nfsi->nrequests, 0);
 	atomic_long_set(&nfsi->commit_info.ncommit, 0);
+	nfsi->commit_info.ioc = NULL;
 	atomic_set(&nfsi->commit_info.rpcs_out, 0);
 	init_rwsem(&nfsi->rmdir_sem);
 	mutex_init(&nfsi->commit_mutex);
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index c478b772cc49..57e209f964e4 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -47,6 +47,7 @@  struct nfs_io_completion {
 	void (*complete)(void *data);
 	void *data;
 	struct kref refcount;
+	struct rcu_head rcu;
 };
 
 /*
@@ -134,7 +135,7 @@  static void nfs_io_completion_release(struct kref *kref)
 	struct nfs_io_completion *ioc = container_of(kref,
 			struct nfs_io_completion, refcount);
 	ioc->complete(ioc->data);
-	kfree(ioc);
+	kfree_rcu(ioc, rcu);
 }
 
 static void nfs_io_completion_get(struct nfs_io_completion *ioc)
@@ -720,6 +721,8 @@  static void nfs_io_completion_commit(void *inode)
 	nfs_commit_inode(inode, 0);
 }
 
+static void nfs_commit_end(struct nfs_mds_commit_info *cinfo);
+
 int nfs_writepages(struct address_space *mapping, struct writeback_control *wbc)
 {
 	struct inode *inode = mapping->host;
@@ -729,9 +732,37 @@  int nfs_writepages(struct address_space *mapping, struct writeback_control *wbc)
 
 	nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGES);
 
-	ioc = nfs_io_completion_alloc(GFP_KERNEL);
-	if (ioc)
-		nfs_io_completion_init(ioc, nfs_io_completion_commit, inode);
+	rcu_read_lock();
+	ioc = rcu_dereference(NFS_I(inode)->commit_info.ioc);
+	if (ioc && kref_get_unless_zero(&ioc->refcount)) {
+		rcu_read_unlock();
+		/* We've successfully piggybacked on the attached ioc */
+	} else {
+		rcu_read_unlock();
+		ioc = nfs_io_completion_alloc(GFP_KERNEL);
+		if (ioc) {
+			struct nfs_io_completion *ioc_prev;
+
+			nfs_io_completion_init(ioc, nfs_io_completion_commit,
+					       inode);
+			/* Temporarily elevate rpcs_out to ensure a commit
+			 * completion *will* happen after we attach this ioc,
+			 * so it *will* get a chance to drain.
+			 */
+			atomic_inc(&NFS_I(inode)->commit_info.rpcs_out);
+			nfs_io_completion_get(ioc);
+			ioc_prev = xchg(&NFS_I(inode)->commit_info.ioc,
+				    ioc);
+			/* ioc_prev is normally NULL, but racing writepages
+			 * calls might result in it being non-NULL.
+			 * In either case it is safe to put it - worst case
+			 * we get an extra COMMIT.
+			 */
+			nfs_io_completion_put(ioc_prev);
+			/* release temporary ref on rpcs_out */
+			nfs_commit_end(&NFS_I(inode)->commit_info);
+		}
+	}
 
 	nfs_pageio_init_write(&pgio, inode, wb_priority(wbc), false,
 				&nfs_async_write_completion_ops);
@@ -1677,8 +1708,17 @@  static void nfs_commit_begin(struct nfs_mds_commit_info *cinfo)
 
 static void nfs_commit_end(struct nfs_mds_commit_info *cinfo)
 {
-	if (atomic_dec_and_test(&cinfo->rpcs_out))
+	if (atomic_dec_and_test(&cinfo->rpcs_out)) {
+		/* When a commit finishes, we must release any attached
+		 * nfs_io_completion so that it can drain and then request
+		 * another commit.
+		 */
+		struct nfs_io_completion *ioc;
+		ioc = xchg(&cinfo->ioc, NULL);
+		nfs_io_completion_put(ioc);
+
 		wake_up_var(&cinfo->rpcs_out);
+	}
 }
 
 void nfs_commitdata_release(struct nfs_commit_data *data)
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index 94c77ed55ce1..89db1e9d461d 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -1557,9 +1557,16 @@  struct nfs_pgio_header {
 };
 
 struct nfs_mds_commit_info {
+	/* rpcs_out counts pending COMMIT rpcs plus pendng nfs_io_completions
+	 * which are *not* attached at 'ioc' below.  Such nfs_io_compleions
+	 * (normally at most one) will drain as writes complete and then trigger
+	 * a COMMIT, so they can be considered as pending COMMITs which haven't
+	 * been sent yet
+	 */
 	atomic_t rpcs_out;
 	atomic_long_t		ncommit;
 	struct list_head	list;
+	struct nfs_io_completion *ioc;
 };
 
 struct nfs_commit_info;