diff mbox series

[1/2] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE

Message ID 87blo4w57m.fsf@notabene.neil.brown.name (mailing list archive)
State New, archived
Headers show
Series [1/2] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE | expand

Commit Message

NeilBrown April 6, 2020, 11:43 p.m. UTC
PF_LESS_THROTTLE exists for loop-back nfsd, and a similar need in the
loop block driver, where a daemon needs to write to one bdi (the final
bdi) in order to free up writes queued to another bdi (the client bdi).

The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty
pages, so that it can still dirty pages after other processses have been
throttled.

This approach was designed when all threads were blocked equally,
independently on which device they were writing to, or how fast it was.
Since that time the writeback algorithm has changed substantially with
different threads getting different allowances based on non-trivial
heuristics.  This means the simple "add 25%" heuristic is no longer
reliable.

The important issue is not that the daemon needs a *larger* dirty page
allowance, but that it needs a *private* dirty page allowance, so that
dirty pages for the "client" bdi that it is helping to clear (the bdi for
an NFS filesystem or loop block device etc) do not affect the throttling
of the deamon writing to the "final" bdi.

This patch changes the heuristic to ignore the global limits and
consider only the limit relevant to the bdi being written to.  This
approach is already available for BDI_CAP_STRICTLIMIT users (fuse) and
should not introduce surprises.  This has the desired result of
protecting the task from the consequences of large amounts of dirty data
queued for other devices.

This approach of "only consider the target bdi" is consistent with the
other use of PF_LESS_THROTTLE in current_may_throttle(), were it causes
attention to be focussed only on the target bdi.

So this patch
 - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE,
 - removes the 25% bonus that that flag gives, and
 - imposes 'strictlimit' handling for any process with PF_LOCAL_THROTTLE
   set.

Note that previously realtime threads were treated the same as
PF_LESS_THROTTLE threads.  This patch does *not* change the behvaiour for
real-time threads, so it is now different from the behaviour of nfsd and
loop tasks.  I don't know what is wanted for realtime.

Note that the worst-case situation with this patch is that the threshold
might be calculated as zero.  In that case the daemon may block when
there are any dirty pages for the final bdi.  These will eventually
clear and the daemon will be able to proceed.  The writing of those
dirty pages will increase the apparent throughput of the final bdi and
thus increase its threshold for future calculations.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/block/loop.c  |  2 +-
 fs/nfsd/vfs.c         |  9 +++++----
 include/linux/sched.h |  3 ++-
 kernel/sys.c          |  2 +-
 mm/page-writeback.c   | 10 ++++++----
 mm/vmscan.c           |  4 ++--
 6 files changed, 17 insertions(+), 13 deletions(-)

Comments

Chuck Lever April 7, 2020, 4:10 p.m. UTC | #1
> On Apr 6, 2020, at 7:43 PM, NeilBrown <neilb@suse.de> wrote:
> 
> 
> PF_LESS_THROTTLE exists for loop-back nfsd, and a similar need in the
> loop block driver, where a daemon needs to write to one bdi (the final
> bdi) in order to free up writes queued to another bdi (the client bdi).
> 
> The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty
> pages, so that it can still dirty pages after other processses have been
> throttled.
> 
> This approach was designed when all threads were blocked equally,
> independently on which device they were writing to, or how fast it was.
> Since that time the writeback algorithm has changed substantially with
> different threads getting different allowances based on non-trivial
> heuristics.  This means the simple "add 25%" heuristic is no longer
> reliable.
> 
> The important issue is not that the daemon needs a *larger* dirty page
> allowance, but that it needs a *private* dirty page allowance, so that
> dirty pages for the "client" bdi that it is helping to clear (the bdi for
> an NFS filesystem or loop block device etc) do not affect the throttling
> of the deamon writing to the "final" bdi.
> 
> This patch changes the heuristic to ignore the global limits and
> consider only the limit relevant to the bdi being written to.  This
> approach is already available for BDI_CAP_STRICTLIMIT users (fuse) and
> should not introduce surprises.  This has the desired result of
> protecting the task from the consequences of large amounts of dirty data
> queued for other devices.
> 
> This approach of "only consider the target bdi" is consistent with the
> other use of PF_LESS_THROTTLE in current_may_throttle(), were it causes
> attention to be focussed only on the target bdi.
> 
> So this patch
> - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE,
> - removes the 25% bonus that that flag gives, and
> - imposes 'strictlimit' handling for any process with PF_LOCAL_THROTTLE
>   set.
> 
> Note that previously realtime threads were treated the same as
> PF_LESS_THROTTLE threads.  This patch does *not* change the behvaiour for
> real-time threads, so it is now different from the behaviour of nfsd and
> loop tasks.  I don't know what is wanted for realtime.
> 
> Note that the worst-case situation with this patch is that the threshold
> might be calculated as zero.  In that case the daemon may block when
> there are any dirty pages for the final bdi.  These will eventually
> clear and the daemon will be able to proceed.  The writing of those
> dirty pages will increase the apparent throughput of the final bdi and
> thus increase its threshold for future calculations.
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
> drivers/block/loop.c  |  2 +-
> fs/nfsd/vfs.c         |  9 +++++----
> include/linux/sched.h |  3 ++-
> kernel/sys.c          |  2 +-
> mm/page-writeback.c   | 10 ++++++----
> mm/vmscan.c           |  4 ++--
> 6 files changed, 17 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index a42c49e04954..0e13b9fc8dfa 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -899,7 +899,7 @@ static void loop_unprepare_queue(struct loop_device *lo)
> 
> static int loop_kthread_worker_fn(void *worker_ptr)
> {
> -	current->flags |= PF_LESS_THROTTLE | PF_MEMALLOC_NOIO;
> +	current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO;
> 	return kthread_worker_fn(worker_ptr);
> }
> 
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 0aa02eb18bd3..c3fbab1753ec 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -979,12 +979,13 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,

Assuming these patches are not going through the NFSD tree, so this hunk is

Acked-by: Chuck Lever <chuck.lever@oracle.com>

If this isn't necessary or appropriate, then ignore me :-)


> 	if (test_bit(RQ_LOCAL, &rqstp->rq_flags))
> 		/*
> -		 * We want less throttling in balance_dirty_pages()
> -		 * and shrink_inactive_list() so that nfs to
> +		 * We want throttling in balance_dirty_pages()
> +		 * and shrink_inactive_list() to only consider
> +		 * the backingdev we are writing to, so that nfs to
> 		 * localhost doesn't cause nfsd to lock up due to all
> 		 * the client's dirty pages or its congested queue.
> 		 */
> -		current->flags |= PF_LESS_THROTTLE;
> +		current->flags |= PF_LOCAL_THROTTLE;
> 
> 	exp = fhp->fh_export;
> 	use_wgather = (rqstp->rq_vers == 2) && EX_WGATHER(exp);
> @@ -1037,7 +1038,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
> 		nfserr = nfserrno(host_err);
> 	}
> 	if (test_bit(RQ_LOCAL, &rqstp->rq_flags))
> -		current_restore_flags(pflags, PF_LESS_THROTTLE);
> +		current_restore_flags(pflags, PF_LOCAL_THROTTLE);
> 	return nfserr;
> }
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 4418f5cb8324..5955a089df32 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1481,7 +1481,8 @@ extern struct pid *cad_pid;
> #define PF_KSWAPD		0x00020000	/* I am kswapd */
> #define PF_MEMALLOC_NOFS	0x00040000	/* All allocation requests will inherit GFP_NOFS */
> #define PF_MEMALLOC_NOIO	0x00080000	/* All allocation requests will inherit GFP_NOIO */
> -#define PF_LESS_THROTTLE	0x00100000	/* Throttle me less: I clean memory */
> +#define PF_LOCAL_THROTTLE	0x00100000	/* Throttle writes only agasint the bdi I write to,
> +						 * I am cleaning dirty pages from some other bdi. */
> #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
> #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
> #define PF_SWAPWRITE		0x00800000	/* Allowed to write to swap */
> diff --git a/kernel/sys.c b/kernel/sys.c
> index d325f3ab624a..180a2fa33f7f 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2262,7 +2262,7 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
> 	return -EINVAL;
> }
> 
> -#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LESS_THROTTLE)
> +#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LOCAL_THROTTLE)
> 
> SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> 		unsigned long, arg4, unsigned long, arg5)
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 7326b54ab728..4c9875971de5 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -387,8 +387,7 @@ static unsigned long global_dirtyable_memory(void)
>  * Calculate @dtc->thresh and ->bg_thresh considering
>  * vm_dirty_{bytes|ratio} and dirty_background_{bytes|ratio}.  The caller
>  * must ensure that @dtc->avail is set before calling this function.  The
> - * dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and
> - * real-time tasks.
> + * dirty limits will be lifted by 1/4 for real-time tasks.
>  */
> static void domain_dirty_limits(struct dirty_throttle_control *dtc)
> {
> @@ -436,7 +435,7 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc)
> 	if (bg_thresh >= thresh)
> 		bg_thresh = thresh / 2;
> 	tsk = current;
> -	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
> +	if (rt_task(tsk)) {
> 		bg_thresh += bg_thresh / 4 + global_wb_domain.dirty_limit / 32;
> 		thresh += thresh / 4 + global_wb_domain.dirty_limit / 32;
> 	}
> @@ -486,7 +485,7 @@ static unsigned long node_dirty_limit(struct pglist_data *pgdat)
> 	else
> 		dirty = vm_dirty_ratio * node_memory / 100;
> 
> -	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk))
> +	if (rt_task(tsk))
> 		dirty += dirty / 4;
> 
> 	return dirty;
> @@ -1580,6 +1579,9 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
> 	bool strictlimit = bdi->capabilities & BDI_CAP_STRICTLIMIT;
> 	unsigned long start_time = jiffies;
> 
> +	if (current->flags & PF_LOCAL_THROTTLE)
> +		/* This task must only be throttled by its own writeback */
> +		strictlimit = true;
> 	for (;;) {
> 		unsigned long now = jiffies;
> 		unsigned long dirty, thresh, bg_thresh;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2e8e690d2813..b776da4bb8c8 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1879,13 +1879,13 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
> 
> /*
>  * If a kernel thread (such as nfsd for loop-back mounts) services
> - * a backing device by writing to the page cache it sets PF_LESS_THROTTLE.
> + * a backing device by writing to the page cache it sets PF_LOCAL_THROTTLE.
>  * In that case we should only throttle if the backing device it is
>  * writing to is congested.  In other cases it is safe to throttle.
>  */
> static int current_may_throttle(void)
> {
> -	return !(current->flags & PF_LESS_THROTTLE) ||
> +	return !(current->flags & PF_LOCAL_THROTTLE) ||
> 		current->backing_dev_info == NULL ||
> 		bdi_write_congested(current->backing_dev_info);
> }
> -- 
> 2.26.0
> 

--
Chuck Lever
chucklever@gmail.com
diff mbox series

Patch

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index a42c49e04954..0e13b9fc8dfa 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -899,7 +899,7 @@  static void loop_unprepare_queue(struct loop_device *lo)
 
 static int loop_kthread_worker_fn(void *worker_ptr)
 {
-	current->flags |= PF_LESS_THROTTLE | PF_MEMALLOC_NOIO;
+	current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO;
 	return kthread_worker_fn(worker_ptr);
 }
 
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 0aa02eb18bd3..c3fbab1753ec 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -979,12 +979,13 @@  nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
 
 	if (test_bit(RQ_LOCAL, &rqstp->rq_flags))
 		/*
-		 * We want less throttling in balance_dirty_pages()
-		 * and shrink_inactive_list() so that nfs to
+		 * We want throttling in balance_dirty_pages()
+		 * and shrink_inactive_list() to only consider
+		 * the backingdev we are writing to, so that nfs to
 		 * localhost doesn't cause nfsd to lock up due to all
 		 * the client's dirty pages or its congested queue.
 		 */
-		current->flags |= PF_LESS_THROTTLE;
+		current->flags |= PF_LOCAL_THROTTLE;
 
 	exp = fhp->fh_export;
 	use_wgather = (rqstp->rq_vers == 2) && EX_WGATHER(exp);
@@ -1037,7 +1038,7 @@  nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
 		nfserr = nfserrno(host_err);
 	}
 	if (test_bit(RQ_LOCAL, &rqstp->rq_flags))
-		current_restore_flags(pflags, PF_LESS_THROTTLE);
+		current_restore_flags(pflags, PF_LOCAL_THROTTLE);
 	return nfserr;
 }
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4418f5cb8324..5955a089df32 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1481,7 +1481,8 @@  extern struct pid *cad_pid;
 #define PF_KSWAPD		0x00020000	/* I am kswapd */
 #define PF_MEMALLOC_NOFS	0x00040000	/* All allocation requests will inherit GFP_NOFS */
 #define PF_MEMALLOC_NOIO	0x00080000	/* All allocation requests will inherit GFP_NOIO */
-#define PF_LESS_THROTTLE	0x00100000	/* Throttle me less: I clean memory */
+#define PF_LOCAL_THROTTLE	0x00100000	/* Throttle writes only agasint the bdi I write to,
+						 * I am cleaning dirty pages from some other bdi. */
 #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
 #define PF_SWAPWRITE		0x00800000	/* Allowed to write to swap */
diff --git a/kernel/sys.c b/kernel/sys.c
index d325f3ab624a..180a2fa33f7f 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2262,7 +2262,7 @@  int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
 	return -EINVAL;
 }
 
-#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LESS_THROTTLE)
+#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LOCAL_THROTTLE)
 
 SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		unsigned long, arg4, unsigned long, arg5)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 7326b54ab728..4c9875971de5 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -387,8 +387,7 @@  static unsigned long global_dirtyable_memory(void)
  * Calculate @dtc->thresh and ->bg_thresh considering
  * vm_dirty_{bytes|ratio} and dirty_background_{bytes|ratio}.  The caller
  * must ensure that @dtc->avail is set before calling this function.  The
- * dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and
- * real-time tasks.
+ * dirty limits will be lifted by 1/4 for real-time tasks.
  */
 static void domain_dirty_limits(struct dirty_throttle_control *dtc)
 {
@@ -436,7 +435,7 @@  static void domain_dirty_limits(struct dirty_throttle_control *dtc)
 	if (bg_thresh >= thresh)
 		bg_thresh = thresh / 2;
 	tsk = current;
-	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
+	if (rt_task(tsk)) {
 		bg_thresh += bg_thresh / 4 + global_wb_domain.dirty_limit / 32;
 		thresh += thresh / 4 + global_wb_domain.dirty_limit / 32;
 	}
@@ -486,7 +485,7 @@  static unsigned long node_dirty_limit(struct pglist_data *pgdat)
 	else
 		dirty = vm_dirty_ratio * node_memory / 100;
 
-	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk))
+	if (rt_task(tsk))
 		dirty += dirty / 4;
 
 	return dirty;
@@ -1580,6 +1579,9 @@  static void balance_dirty_pages(struct bdi_writeback *wb,
 	bool strictlimit = bdi->capabilities & BDI_CAP_STRICTLIMIT;
 	unsigned long start_time = jiffies;
 
+	if (current->flags & PF_LOCAL_THROTTLE)
+		/* This task must only be throttled by its own writeback */
+		strictlimit = true;
 	for (;;) {
 		unsigned long now = jiffies;
 		unsigned long dirty, thresh, bg_thresh;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2e8e690d2813..b776da4bb8c8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1879,13 +1879,13 @@  static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 
 /*
  * If a kernel thread (such as nfsd for loop-back mounts) services
- * a backing device by writing to the page cache it sets PF_LESS_THROTTLE.
+ * a backing device by writing to the page cache it sets PF_LOCAL_THROTTLE.
  * In that case we should only throttle if the backing device it is
  * writing to is congested.  In other cases it is safe to throttle.
  */
 static int current_may_throttle(void)
 {
-	return !(current->flags & PF_LESS_THROTTLE) ||
+	return !(current->flags & PF_LOCAL_THROTTLE) ||
 		current->backing_dev_info == NULL ||
 		bdi_write_congested(current->backing_dev_info);
 }