[PATCH/RFC] MM: fix writeback for NFS
diff mbox series

Message ID 87tv2b7q72.fsf@notabene.neil.brown.name
State New
Headers show
  • [PATCH/RFC] MM: fix writeback for NFS
Related show

Commit Message

NeilBrown March 26, 2020, 3:25 a.m. UTC
Page account happens a little differently for NFS, and writeback is
aware of this, but handles it wrongly.
With NFS, pages go throught three states which are accounted separately:
 NR_FILE_DIRTY - page is dirty and unwritten
 NR_WRITEBACK  - page is being written back
 NR_UNSTABLE_NFS - page has been written, but might need to be written
		again if server restarts.  Once an NFS COMMIT
		completes, page becomes freeable.

writeback combines both the NF_FILE_DIRTY counters and NF_UNSTABLE_NFS
together, which doesn't make a lot of sense.  NR_UNSTABLE_NFS is more
like NR_WRITEBACK in that there is nothing that can be done for these
pages except wait.

Current behaviour is that when there are a lot of NR_UNSTABLE_NFS pages,
balance_dirty_pages() repeatedly schedules the writeback thread, even
when the number of dirty pages is below the threshold.  This result is
each ->write_pages() call into NFS only finding relatively few DIRTY
pages, so it creates requests that are well below the maximum size.
Depending on performance of characteristics of the server, this can
substantially reduce throughput.

There are two places where balance_dirty_pages() schedules the writeback
thread, and each of them schedule writeback too often.

1/ When balance_dirty_pages() determines that it must wait for the
   number of dirty pages to shrink, it unconditionally schedules
   This is probably not ideal for any filesystem but I measured it to be
   harmful for NFS.  This writeback should only be scheduled if the
   number of dirty pages is above the threshold.  While it is below,
   waiting for the pending writes to complete (and be committed) is a more
   appropriate behaviour.

2/ When balance_dirty_pages() finishes, it again shedules writeback
   if nr_reclaimable is above the background threshold.  This is
   conceptually correct, but wrong because nr_reclaimable is wrong.
   The impliciation of this usage is that nr_reclaimable is the number
   of pages that can be reclaimed if writeback is triggered.  In fact it
   is NR_FILE_DIRTY plus NR_UNSTABLE_NFS, which is not a meaningful

So this patch removes NR_UNSTABLE_NFS from nr_reclaimable, and only
schedules writeback when the new nr_reclaimable is greater than
->thresh or ->bg_thresh, depending on context.

The effect of this can be measured by looking at the count of
nfs_writepages calls while performing a large streaming write (larger
than dirty_threshold)

  mount -t nfs server:/path /mnt
  dd if=/dev/zero of=/mnt/zeros bs=1M count=1000
  awk '/events:/ {print $13}' /proc/self/mountstats

Each writepages call should normally submit writes for 4M (1024 pages)
or more (MIN_WRITEBACK_PAGES) (unless memory size is tiny).
With the patch I measure typically 200-300 writepages calls with the
above, which suggests an average of about 850 pages being made available to
each writepages call.  This is less than MIN_WRITEBACK_PAGES, but enough
to assemble large 1MB WRITE requests fairly often.

Without the patch, numbers range from about 2000 to over 10000, meaning
an average of maybe 100 pages per call, which means we rarely get large
1M requests, and often get small requests

Note that there are other places where NR_FILE_DIRTY and NR_UNSTABLE_NFS
are combined (wb_over_bg_thresh() and get_nr_dirty_pages()).  I suspect
they are wrong too.  I also suspect that unstable-NFS pages should not be
included in the WB_RECLAIMABLE wb_stat, but I haven't explored that in
detail yet.

Note that the current behaviour of NFS w.r.t sending COMMIT requests is
that as soon as a batch of writes all complete, a COMMIT is scheduled.
Prior to commit 919e3bd9a875 ("NFS: Ensure we commit after writeback is
complete"), other mechanisms triggered the COMMIT.  It is possible that
the current writeback code made more sense before that commit.

Signed-off-by: NeilBrown <neilb@suse.de>
 mm/page-writeback.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)


NeilBrown April 1, 2020, 11:52 p.m. UTC | #1
Please ignore my previous patch (which this is in reply to), it was
flawed in various ways.

I now understand the code a bit better and have a somewhat simpler patch
which appears to address the same problem.

The problem is that writeback to NFS often produces lots of small writes
(10s of K) rather than fewer large writes (1M).  This pattern can often
hurt throughput, but in certain circumstances it can hurt NFS throughput
more than expected.

Each nfs_writepages() call results in an NFS commit being sent to the
server.  If writeback triggers lots of smaller nfs_writepages calls,
this means lots of COMMITs.  If the server is slow to handle the COMMIT
(I've seen the Ganesha NFS server take over 200ms per commit), these
COMMITs can overlap, queue up, and choke the NFS server and cause
order-of-magnitude drop in throughput.

So we really want to only call nfs_writepages when there are a largish
number of pages to be written - i.e. that are 'dirty'.

For historical reasons that I didn't thoroughly research but I'm
confident are no longer relevant, pages that have been written to the
NFS server but have not yet been the subject of a COMMIT - so-called
"unstable" pages - are effectively accounted that same as "dirty" pages
(sometimes called "reclaimable").
This can result in writeback thinking there are lots of "dirty" pages to
reclaim, while nfs_writepages can only find a few that it can write out.

The second patch following changes the accounting for these "unstable"
pages.  They are now always accounted exactly the same was writeback
pages.  Conceptually they can be thought of as still in writeback, but the
writeback is now happening on the server.

A COMMIT will always automatically follow the writes generated by
nfs_writepages, so from the perspective of the VM, there really is no
difference: It has scheduled the write and there is nothing else it can
do except wait.

Testing this patch showed that loop-back NFS is prone to deadlocks
again.  I cannot see exactly how the change to 'unstable' accounting
affected this, but I can see that the old +25% heuristic can no longer
be justified given the complexity of writeback calculations.
So the first patch following changes how writeback is handled for NFS
servers handling loop-back requests (and other similar services) so that
it is more obviously safe against excessive dirty pages scheduled for
other devices.

NeilBrown April 2, 2020, 4:57 a.m. UTC | #2
On Thu, Apr 02 2020, Hillf Danton wrote:

> On Thu, 02 Apr 2020 10:53:20 +1100 NeilBrown wrote:
>> PF_LESS_THROTTLE exists for loop-back nfsd, and a similar need in the
>> loop block driver, where a daemon needs to write to one bdi in
>> order to free up writes queued to another bdi.
>> The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty
>> pages, so that it can still dirty pages after other processses have been
>> throttled.
>> This approach was designed when all threads were blocked equally,
>> independently on which device they were writing to, or how fast it was.
>> Since that time the writeback algorithm has changed substantially with
>> different threads getting different allowances based on non-trivial
>> heuristics.  This means the simple "add 25%" heuristic is no longer
>> reliable.
>> This patch changes the heuristic to ignore the global limits and
>> consider only the limit relevant to the bdi being written to.  This
>> approach is already available for BDI_CAP_STRICTLIMIT users (fuse) and
>> should not introduce surprises.  This has the desired result of
>> protecting the task from the consequences of large amounts of dirty data
>> queued for other devices.
>> This approach of "only consider the target bdi" is consistent with the
>> other use of PF_LESS_THROTTLE in current_may_throttle(), were it causes
>> attention to be focussed only on the target bdi.
>> So this patch
>>  - remove the 25% bonus that that flag gives, and
>>  - imposes 'strictlimit' handling for any process with PF_LOCAL_THROTTLE
>>    set.
> 	/*
> 	 * The strictlimit feature is a tool preventing mistrusted filesystems
> 	 * from growing a large number of dirty pages before throttling. For
> Based on the comment snippet, I suspect it is applicable to IO flushers
> unless they are likely generating tons of dirty pages. If they are,
> however, cutting their bonuses seem questionable.

The purpose of the strictlimit feature was to isolate one filesystem
(bdi) from all others, so that the one cannot create dirty pages which
unfairly disadvantage the others - this is what that comment says.
But the implementation appears to focus on the isolation, not the
specific purpose, and isolation works both ways.  It protects the others
from the one, and the one from the others.

fuse needs to be isolated so it doesn't harm others.
nfsd and loop need to be isolate so they aren't harmed by others.
I'm less familiar with IO flushers but I suspect that have exactly the
same need as nfsd and loop - they need to be isolated from dirty pages
other than on the device they are writing to.
The 25% bonus was never about giving them a bonus because they need it.
It was about protecting them from excess usage elsewhere.  I strongly
suspect that my change will provide a conceptually better service for IO
flushers. (whether it is better in a practical measurable sense I cannot
say, but I'd be surprised if it was worse).

One possible problem with strictlimit isolation is suggested by the

	 * In strictlimit case make decision based on the wb counters
	 * and limits. Small writeouts when the wb limits are ramping
	 * up are the price we consciously pay for strictlimit-ing.

This suggests that starting transients may be worse in some cases.
I haven't noticed any problems in my (limited) testing so while I
suspect there is a genuine difference here, I don't expect it to be problematic.

>> Note that previously realtime threads were treated the same as
>> PF_LESS_THROTTLE threads.  This patch does *not* change the behvaiour for
>> real-time threads, so it is now different from the behaviour of nfsd and
>> loop tasks.  I don't know what is wanted for realtime.
>> Signed-off-by: NeilBrown <neilb@suse.de>
>> =2D--
> Hrm corrupted delivery?

No, that's just normal email quotation for a multi-part message (needed
for crypto-signing which I do by default)
'git am', for example, is quite capable of coping with it.


>>  drivers/block/loop.c  |  2 +-
>>  fs/nfsd/vfs.c         |  9 +++++----
>>  include/linux/sched.h |  2 +-
>>  kernel/sys.c          |  2 +-
>>  mm/page-writeback.c   | 10 ++++++----
>>  mm/vmscan.c           |  4 ++--
>>  6 files changed, 16 insertions(+), 13 deletions(-)
NeilBrown April 6, 2020, 11:42 p.m. UTC | #3
This is a second version of my patches to improve writeback over NFS.
This was prompted by a customer report of very slow throughput to a
server running Ganesha NFS server which was taking about 200ms to handle
COMMIT requests.  While slow COMMIT requests do exacerbate the
problem, there are still problems when COMMIT is comparatively fast.
Latest test results from customer are that these changes provide
substantial improvements.

Changes over first version include improvements to the explanation in
the commit message, and preservation of "NFS Unstable" counts in various
statistics files, despite that fact that the internal accounting is
completely gone (merged with writeback accounting).

NeilBrown April 16, 2020, 12:29 a.m. UTC | #4
This is version 3 (I think) of my patches to improve NFS writeaback.

 - the code for adding legacy values to /proc/vmstat was broken.
   I haven't taken the approach that Michal and Jan discussed but
   a simpler (I hope) approach that just seq_puts() the lines at an
   appropriate place.

 - I've modified the handling of PF_LOCAL_THROTTLE - sufficiently that
   I dropped Jan's reviewed-by.
   Rather than invoking the same behaviour as BDI_CAP_STRICTLIMIT,
   I now just use the part I needed.
   So if the global threshold is not exceeded, PF_LOCAL_THROTTLE tasks
   are not throttled.  This is the case for normal processes, but not
   when BDI_CAP_STRICTLIMIT is in effect.
   If the global threshold *is* exceeded, only then to we check the
   local per-bdi threshold.  If that is not exceeded then
   PF_LOCAL_THROTTLE again avoid any throttling.  Only if both the
   thresholds are exceeded are these tasks throttled.
   I think this is more refletive if what we actually need.


diff mbox series

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 2caf780a42e7..99419cba1e7a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1593,10 +1593,11 @@  static void balance_dirty_pages(struct bdi_writeback *wb,
 		 * written to the server's write cache, but has not yet
 		 * been flushed to permanent storage.
-		nr_reclaimable = global_node_page_state(NR_FILE_DIRTY) +
-					global_node_page_state(NR_UNSTABLE_NFS);
+		nr_reclaimable = global_node_page_state(NR_FILE_DIRTY);
 		gdtc->avail = global_dirtyable_memory();
-		gdtc->dirty = nr_reclaimable + global_node_page_state(NR_WRITEBACK);
+		gdtc->dirty = nr_reclaimable +
+			global_node_page_state(NR_WRITEBACK) +
+			global_node_page_state(NR_UNSTABLE_NFS);
@@ -1664,7 +1665,8 @@  static void balance_dirty_pages(struct bdi_writeback *wb,
-		if (unlikely(!writeback_in_progress(wb)))
+		if (nr_reclaimable > gdtc->thresh &&
+		    unlikely(!writeback_in_progress(wb)))