Message ID | 87blo4w57m.fsf@notabene.neil.brown.name (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/2] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE | expand |
> On Apr 6, 2020, at 7:43 PM, NeilBrown <neilb@suse.de> wrote: > > > PF_LESS_THROTTLE exists for loop-back nfsd, and a similar need in the > loop block driver, where a daemon needs to write to one bdi (the final > bdi) in order to free up writes queued to another bdi (the client bdi). > > The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty > pages, so that it can still dirty pages after other processses have been > throttled. > > This approach was designed when all threads were blocked equally, > independently on which device they were writing to, or how fast it was. > Since that time the writeback algorithm has changed substantially with > different threads getting different allowances based on non-trivial > heuristics. This means the simple "add 25%" heuristic is no longer > reliable. > > The important issue is not that the daemon needs a *larger* dirty page > allowance, but that it needs a *private* dirty page allowance, so that > dirty pages for the "client" bdi that it is helping to clear (the bdi for > an NFS filesystem or loop block device etc) do not affect the throttling > of the deamon writing to the "final" bdi. > > This patch changes the heuristic to ignore the global limits and > consider only the limit relevant to the bdi being written to. This > approach is already available for BDI_CAP_STRICTLIMIT users (fuse) and > should not introduce surprises. This has the desired result of > protecting the task from the consequences of large amounts of dirty data > queued for other devices. > > This approach of "only consider the target bdi" is consistent with the > other use of PF_LESS_THROTTLE in current_may_throttle(), were it causes > attention to be focussed only on the target bdi. > > So this patch > - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE, > - removes the 25% bonus that that flag gives, and > - imposes 'strictlimit' handling for any process with PF_LOCAL_THROTTLE > set. > > Note that previously realtime threads were treated the same as > PF_LESS_THROTTLE threads. This patch does *not* change the behvaiour for > real-time threads, so it is now different from the behaviour of nfsd and > loop tasks. I don't know what is wanted for realtime. > > Note that the worst-case situation with this patch is that the threshold > might be calculated as zero. In that case the daemon may block when > there are any dirty pages for the final bdi. These will eventually > clear and the daemon will be able to proceed. The writing of those > dirty pages will increase the apparent throughput of the final bdi and > thus increase its threshold for future calculations. > > Reviewed-by: Jan Kara <jack@suse.cz> > Signed-off-by: NeilBrown <neilb@suse.de> > --- > drivers/block/loop.c | 2 +- > fs/nfsd/vfs.c | 9 +++++---- > include/linux/sched.h | 3 ++- > kernel/sys.c | 2 +- > mm/page-writeback.c | 10 ++++++---- > mm/vmscan.c | 4 ++-- > 6 files changed, 17 insertions(+), 13 deletions(-) > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c > index a42c49e04954..0e13b9fc8dfa 100644 > --- a/drivers/block/loop.c > +++ b/drivers/block/loop.c > @@ -899,7 +899,7 @@ static void loop_unprepare_queue(struct loop_device *lo) > > static int loop_kthread_worker_fn(void *worker_ptr) > { > - current->flags |= PF_LESS_THROTTLE | PF_MEMALLOC_NOIO; > + current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO; > return kthread_worker_fn(worker_ptr); > } > > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c > index 0aa02eb18bd3..c3fbab1753ec 100644 > --- a/fs/nfsd/vfs.c > +++ b/fs/nfsd/vfs.c > @@ -979,12 +979,13 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf, Assuming these patches are not going through the NFSD tree, so this hunk is Acked-by: Chuck Lever <chuck.lever@oracle.com> If this isn't necessary or appropriate, then ignore me :-) > if (test_bit(RQ_LOCAL, &rqstp->rq_flags)) > /* > - * We want less throttling in balance_dirty_pages() > - * and shrink_inactive_list() so that nfs to > + * We want throttling in balance_dirty_pages() > + * and shrink_inactive_list() to only consider > + * the backingdev we are writing to, so that nfs to > * localhost doesn't cause nfsd to lock up due to all > * the client's dirty pages or its congested queue. > */ > - current->flags |= PF_LESS_THROTTLE; > + current->flags |= PF_LOCAL_THROTTLE; > > exp = fhp->fh_export; > use_wgather = (rqstp->rq_vers == 2) && EX_WGATHER(exp); > @@ -1037,7 +1038,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf, > nfserr = nfserrno(host_err); > } > if (test_bit(RQ_LOCAL, &rqstp->rq_flags)) > - current_restore_flags(pflags, PF_LESS_THROTTLE); > + current_restore_flags(pflags, PF_LOCAL_THROTTLE); > return nfserr; > } > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 4418f5cb8324..5955a089df32 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1481,7 +1481,8 @@ extern struct pid *cad_pid; > #define PF_KSWAPD 0x00020000 /* I am kswapd */ > #define PF_MEMALLOC_NOFS 0x00040000 /* All allocation requests will inherit GFP_NOFS */ > #define PF_MEMALLOC_NOIO 0x00080000 /* All allocation requests will inherit GFP_NOIO */ > -#define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */ > +#define PF_LOCAL_THROTTLE 0x00100000 /* Throttle writes only agasint the bdi I write to, > + * I am cleaning dirty pages from some other bdi. */ > #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ > #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */ > #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */ > diff --git a/kernel/sys.c b/kernel/sys.c > index d325f3ab624a..180a2fa33f7f 100644 > --- a/kernel/sys.c > +++ b/kernel/sys.c > @@ -2262,7 +2262,7 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which, > return -EINVAL; > } > > -#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LESS_THROTTLE) > +#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LOCAL_THROTTLE) > > SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, > unsigned long, arg4, unsigned long, arg5) > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index 7326b54ab728..4c9875971de5 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -387,8 +387,7 @@ static unsigned long global_dirtyable_memory(void) > * Calculate @dtc->thresh and ->bg_thresh considering > * vm_dirty_{bytes|ratio} and dirty_background_{bytes|ratio}. The caller > * must ensure that @dtc->avail is set before calling this function. The > - * dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and > - * real-time tasks. > + * dirty limits will be lifted by 1/4 for real-time tasks. > */ > static void domain_dirty_limits(struct dirty_throttle_control *dtc) > { > @@ -436,7 +435,7 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc) > if (bg_thresh >= thresh) > bg_thresh = thresh / 2; > tsk = current; > - if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) { > + if (rt_task(tsk)) { > bg_thresh += bg_thresh / 4 + global_wb_domain.dirty_limit / 32; > thresh += thresh / 4 + global_wb_domain.dirty_limit / 32; > } > @@ -486,7 +485,7 @@ static unsigned long node_dirty_limit(struct pglist_data *pgdat) > else > dirty = vm_dirty_ratio * node_memory / 100; > > - if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) > + if (rt_task(tsk)) > dirty += dirty / 4; > > return dirty; > @@ -1580,6 +1579,9 @@ static void balance_dirty_pages(struct bdi_writeback *wb, > bool strictlimit = bdi->capabilities & BDI_CAP_STRICTLIMIT; > unsigned long start_time = jiffies; > > + if (current->flags & PF_LOCAL_THROTTLE) > + /* This task must only be throttled by its own writeback */ > + strictlimit = true; > for (;;) { > unsigned long now = jiffies; > unsigned long dirty, thresh, bg_thresh; > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 2e8e690d2813..b776da4bb8c8 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1879,13 +1879,13 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, > > /* > * If a kernel thread (such as nfsd for loop-back mounts) services > - * a backing device by writing to the page cache it sets PF_LESS_THROTTLE. > + * a backing device by writing to the page cache it sets PF_LOCAL_THROTTLE. > * In that case we should only throttle if the backing device it is > * writing to is congested. In other cases it is safe to throttle. > */ > static int current_may_throttle(void) > { > - return !(current->flags & PF_LESS_THROTTLE) || > + return !(current->flags & PF_LOCAL_THROTTLE) || > current->backing_dev_info == NULL || > bdi_write_congested(current->backing_dev_info); > } > -- > 2.26.0 > -- Chuck Lever chucklever@gmail.com
diff --git a/drivers/block/loop.c b/drivers/block/loop.c index a42c49e04954..0e13b9fc8dfa 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -899,7 +899,7 @@ static void loop_unprepare_queue(struct loop_device *lo) static int loop_kthread_worker_fn(void *worker_ptr) { - current->flags |= PF_LESS_THROTTLE | PF_MEMALLOC_NOIO; + current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO; return kthread_worker_fn(worker_ptr); } diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c index 0aa02eb18bd3..c3fbab1753ec 100644 --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -979,12 +979,13 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf, if (test_bit(RQ_LOCAL, &rqstp->rq_flags)) /* - * We want less throttling in balance_dirty_pages() - * and shrink_inactive_list() so that nfs to + * We want throttling in balance_dirty_pages() + * and shrink_inactive_list() to only consider + * the backingdev we are writing to, so that nfs to * localhost doesn't cause nfsd to lock up due to all * the client's dirty pages or its congested queue. */ - current->flags |= PF_LESS_THROTTLE; + current->flags |= PF_LOCAL_THROTTLE; exp = fhp->fh_export; use_wgather = (rqstp->rq_vers == 2) && EX_WGATHER(exp); @@ -1037,7 +1038,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf, nfserr = nfserrno(host_err); } if (test_bit(RQ_LOCAL, &rqstp->rq_flags)) - current_restore_flags(pflags, PF_LESS_THROTTLE); + current_restore_flags(pflags, PF_LOCAL_THROTTLE); return nfserr; } diff --git a/include/linux/sched.h b/include/linux/sched.h index 4418f5cb8324..5955a089df32 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1481,7 +1481,8 @@ extern struct pid *cad_pid; #define PF_KSWAPD 0x00020000 /* I am kswapd */ #define PF_MEMALLOC_NOFS 0x00040000 /* All allocation requests will inherit GFP_NOFS */ #define PF_MEMALLOC_NOIO 0x00080000 /* All allocation requests will inherit GFP_NOIO */ -#define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */ +#define PF_LOCAL_THROTTLE 0x00100000 /* Throttle writes only agasint the bdi I write to, + * I am cleaning dirty pages from some other bdi. */ #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */ #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */ diff --git a/kernel/sys.c b/kernel/sys.c index d325f3ab624a..180a2fa33f7f 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2262,7 +2262,7 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which, return -EINVAL; } -#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LESS_THROTTLE) +#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LOCAL_THROTTLE) SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, unsigned long, arg4, unsigned long, arg5) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 7326b54ab728..4c9875971de5 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -387,8 +387,7 @@ static unsigned long global_dirtyable_memory(void) * Calculate @dtc->thresh and ->bg_thresh considering * vm_dirty_{bytes|ratio} and dirty_background_{bytes|ratio}. The caller * must ensure that @dtc->avail is set before calling this function. The - * dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and - * real-time tasks. + * dirty limits will be lifted by 1/4 for real-time tasks. */ static void domain_dirty_limits(struct dirty_throttle_control *dtc) { @@ -436,7 +435,7 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc) if (bg_thresh >= thresh) bg_thresh = thresh / 2; tsk = current; - if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) { + if (rt_task(tsk)) { bg_thresh += bg_thresh / 4 + global_wb_domain.dirty_limit / 32; thresh += thresh / 4 + global_wb_domain.dirty_limit / 32; } @@ -486,7 +485,7 @@ static unsigned long node_dirty_limit(struct pglist_data *pgdat) else dirty = vm_dirty_ratio * node_memory / 100; - if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) + if (rt_task(tsk)) dirty += dirty / 4; return dirty; @@ -1580,6 +1579,9 @@ static void balance_dirty_pages(struct bdi_writeback *wb, bool strictlimit = bdi->capabilities & BDI_CAP_STRICTLIMIT; unsigned long start_time = jiffies; + if (current->flags & PF_LOCAL_THROTTLE) + /* This task must only be throttled by its own writeback */ + strictlimit = true; for (;;) { unsigned long now = jiffies; unsigned long dirty, thresh, bg_thresh; diff --git a/mm/vmscan.c b/mm/vmscan.c index 2e8e690d2813..b776da4bb8c8 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1879,13 +1879,13 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, /* * If a kernel thread (such as nfsd for loop-back mounts) services - * a backing device by writing to the page cache it sets PF_LESS_THROTTLE. + * a backing device by writing to the page cache it sets PF_LOCAL_THROTTLE. * In that case we should only throttle if the backing device it is * writing to is congested. In other cases it is safe to throttle. */ static int current_may_throttle(void) { - return !(current->flags & PF_LESS_THROTTLE) || + return !(current->flags & PF_LOCAL_THROTTLE) || current->backing_dev_info == NULL || bdi_write_congested(current->backing_dev_info); }