Message ID | 170112272125.7109.6245462722883333440@noble.neil.brown.name (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [PATCH/RFC] core/nfsd: allow kernel threads to use task_work. | expand |
On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote: > A simple way to fix this is to treat nfsd threads like normal processes > for task_work. Thus the pending files are queued for the thread, and > the same thread finishes the work. > > Currently KTHREADs are assumed never to call task_work_run(). With this > patch that it still the default but it is implemented by storing the > magic value TASK_WORKS_DISABLED in ->task_works. If a kthread, such as > nfsd, will call task_work_run() periodically, it sets ->task_works > to NULL to indicate this. > svc_recv(rqstp); > validate_process_creds(); > + if (task_work_pending(current)) > + task_work_run(); What locking environment and call chain do you have here? And what happens if you get something stuck in ->release()? > > p->pdeath_signal = 0; > - p->task_works = NULL; > + p->task_works = args->kthread ? TASK_WORKS_DISABLED : NULL; Umm... why not have them set (by helper in kernel/task_work.c) to &work_exited? Then the task_work_run parts wouldn't be needed at all...
On Tue, 28 Nov 2023, Al Viro wrote: > On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote: > > > A simple way to fix this is to treat nfsd threads like normal processes > > for task_work. Thus the pending files are queued for the thread, and > > the same thread finishes the work. > > > > Currently KTHREADs are assumed never to call task_work_run(). With this > > patch that it still the default but it is implemented by storing the > > magic value TASK_WORKS_DISABLED in ->task_works. If a kthread, such as > > nfsd, will call task_work_run() periodically, it sets ->task_works > > to NULL to indicate this. > > > svc_recv(rqstp); > > validate_process_creds(); > > + if (task_work_pending(current)) > > + task_work_run(); > > What locking environment and call chain do you have here? And what happens if > you get something stuck in ->release()? No locking. This is in the top level function of the kthread. A ->release function that waits for an NFS filesystem to flush out data through a filesystem exported by this nfsd might hit problems. But that really requires us nfs-exporting and nfs filesystem which is loop-back mounted. While we do support nfs-reexport and nfs-loop-back mounts, I don't think we make any pretence of supporting a combination. Is that the sort of thing you were think of? > > > > > p->pdeath_signal = 0; > > - p->task_works = NULL; > > + p->task_works = args->kthread ? TASK_WORKS_DISABLED : NULL; > > Umm... why not have them set (by helper in kernel/task_work.c) to > &work_exited? Then the task_work_run parts wouldn't be needed at all... > I hadn't tried to understand what work_exited was for - but now I see that its purpose is precisely to block further work from being queued - exactly what I need. Thanks - I make that change for a v2. I've realised that I'll also need to change the flush_delayed_fput() in fsd_file_close_inode_sync() to task_work_run(). Thanks, NeilBrown
On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote: > > I have evidence from a customer site of 256 nfsd threads adding files to > delayed_fput_lists nearly twice as fast they are retired by a single > work-queue thread running delayed_fput(). As you might imagine this > does not end well (20 million files in the queue at the time a snapshot > was taken for analysis). > > While this might point to a problem with the filesystem not handling the > final close efficiently, such problems should only hurt throughput, not > lead to memory exhaustion. I have this patch queued for v6.8: https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=nfsd-next&id=c42661ffa58acfeaf73b932dec1e6f04ce8a98c0 > For normal threads, the thread that closes the file also calls the > final fput so there is natural rate limiting preventing excessive growth > in the list of delayed fputs. For kernel threads, and particularly for > nfsd, delayed in the final fput do not impose any throttling to prevent > the thread from closing more files. I don't think we want to block nfsd threads waiting for files to close. Won't that be a potential denial of service? > A simple way to fix this is to treat nfsd threads like normal processes > for task_work. Thus the pending files are queued for the thread, and > the same thread finishes the work. > > Currently KTHREADs are assumed never to call task_work_run(). With this > patch that it still the default but it is implemented by storing the > magic value TASK_WORKS_DISABLED in ->task_works. If a kthread, such as > nfsd, will call task_work_run() periodically, it sets ->task_works > to NULL to indicate this. > > Signed-off-by: NeilBrown <neilb@suse.de> > --- > > I wonder which tree this should go through assuming everyone likes it. > VFS maybe?? > > Thanks. > > fs/file_table.c | 2 +- > fs/nfsd/nfssvc.c | 4 ++++ > include/linux/sched.h | 1 + > include/linux/task_work.h | 4 +++- > kernel/fork.c | 2 +- > kernel/task_work.c | 7 ++++--- > 6 files changed, 14 insertions(+), 6 deletions(-) > > diff --git a/fs/file_table.c b/fs/file_table.c > index de4a2915bfd4..e79351df22be 100644 > --- a/fs/file_table.c > +++ b/fs/file_table.c > @@ -445,7 +445,7 @@ void fput(struct file *file) > if (atomic_long_dec_and_test(&file->f_count)) { > struct task_struct *task = current; > > - if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) { > + if (likely(!in_interrupt())) { > init_task_work(&file->f_rcuhead, ____fput); > if (!task_work_add(task, &file->f_rcuhead, TWA_RESUME)) > return; > diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c > index 66ca50b38b27..c047961262ca 100644 > --- a/fs/nfsd/nfssvc.c > +++ b/fs/nfsd/nfssvc.c > @@ -13,6 +13,7 @@ > #include <linux/fs_struct.h> > #include <linux/swap.h> > #include <linux/siphash.h> > +#include <linux/task_work.h> > > #include <linux/sunrpc/stats.h> > #include <linux/sunrpc/svcsock.h> > @@ -941,6 +942,7 @@ nfsd(void *vrqstp) > } > > current->fs->umask = 0; > + current->task_works = NULL; /* Declare that I will call task_work_run() */ > > atomic_inc(&nfsdstats.th_cnt); > > @@ -955,6 +957,8 @@ nfsd(void *vrqstp) > > svc_recv(rqstp); > validate_process_creds(); > + if (task_work_pending(current)) > + task_work_run(); > } > > atomic_dec(&nfsdstats.th_cnt); > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 292c31697248..c63c2bedbf71 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1117,6 +1117,7 @@ struct task_struct { > unsigned int sas_ss_flags; > > struct callback_head *task_works; > +#define TASK_WORKS_DISABLED ((void*)1) > > #ifdef CONFIG_AUDIT > #ifdef CONFIG_AUDITSYSCALL > diff --git a/include/linux/task_work.h b/include/linux/task_work.h > index 795ef5a68429..3c74e3de81ed 100644 > --- a/include/linux/task_work.h > +++ b/include/linux/task_work.h > @@ -22,7 +22,9 @@ enum task_work_notify_mode { > > static inline bool task_work_pending(struct task_struct *task) > { > - return READ_ONCE(task->task_works); > + struct callback_head *works = READ_ONCE(task->task_works); > + > + return works && works != TASK_WORKS_DISABLED; > } > > int task_work_add(struct task_struct *task, struct callback_head *twork, > diff --git a/kernel/fork.c b/kernel/fork.c > index 10917c3e1f03..903b29804fe1 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -2577,7 +2577,7 @@ __latent_entropy struct task_struct *copy_process( > p->dirty_paused_when = 0; > > p->pdeath_signal = 0; > - p->task_works = NULL; > + p->task_works = args->kthread ? TASK_WORKS_DISABLED : NULL; > clear_posix_cputimers_work(p); > > #ifdef CONFIG_KRETPROBES > diff --git a/kernel/task_work.c b/kernel/task_work.c > index 95a7e1b7f1da..ffdf4b0d7a0e 100644 > --- a/kernel/task_work.c > +++ b/kernel/task_work.c > @@ -49,7 +49,8 @@ int task_work_add(struct task_struct *task, struct callback_head *work, > > head = READ_ONCE(task->task_works); > do { > - if (unlikely(head == &work_exited)) > + if (unlikely(head == &work_exited || > + head == TASK_WORKS_DISABLED)) > return -ESRCH; > work->next = head; > } while (!try_cmpxchg(&task->task_works, &head, work)); > @@ -157,7 +158,7 @@ void task_work_run(void) > work = READ_ONCE(task->task_works); > do { > head = NULL; > - if (!work) { > + if (!work || work == TASK_WORKS_DISABLED) { > if (task->flags & PF_EXITING) > head = &work_exited; > else > @@ -165,7 +166,7 @@ void task_work_run(void) > } > } while (!try_cmpxchg(&task->task_works, &work, head)); > > - if (!work) > + if (!work || work == TASK_WORKS_DISABLED) > break; > /* > * Synchronize with task_work_cancel(). It can not remove > -- > 2.42.1 >
On Tue, 28 Nov 2023, Chuck Lever wrote: > On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote: > > > > I have evidence from a customer site of 256 nfsd threads adding files to > > delayed_fput_lists nearly twice as fast they are retired by a single > > work-queue thread running delayed_fput(). As you might imagine this > > does not end well (20 million files in the queue at the time a snapshot > > was taken for analysis). > > > > While this might point to a problem with the filesystem not handling the > > final close efficiently, such problems should only hurt throughput, not > > lead to memory exhaustion. > > I have this patch queued for v6.8: > > https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=nfsd-next&id=c42661ffa58acfeaf73b932dec1e6f04ce8a98c0 > Thanks.... I think that change is good, but I don't think it addresses the problem mentioned in the description, and it is not directly relevant to the problem I saw ... though it is complicated. The problem "workqueue ... hogged cpu..." probably means that nfsd_file_dispose_list() needs a cond_resched() call in the loop. That will stop it from hogging the CPU whether it is tied to one CPU or free to roam. Also that work is calling filp_close() which primarily calls filp_flush(). It also calls fput() but that does minimal work. If there is much work to do then that is offloaded to another work-item. *That* is the workitem that I had problems with. The problem I saw was with an older kernel which didn't have the nfsd file cache and so probably is calling filp_close more often. So maybe my patch isn't so important now. Particularly as nfsd now isn't closing most files in-task but instead offloads that to another task. So the final fput will not be handled by the nfsd task either. But I think there is room for improvement. Gathering lots of files together into a list and closing them sequentially is not going to be as efficient as closing them in parallel. > > > For normal threads, the thread that closes the file also calls the > > final fput so there is natural rate limiting preventing excessive growth > > in the list of delayed fputs. For kernel threads, and particularly for > > nfsd, delayed in the final fput do not impose any throttling to prevent > > the thread from closing more files. > > I don't think we want to block nfsd threads waiting for files to > close. Won't that be a potential denial of service? Not as much as the denial of service caused by memory exhaustion due to an indefinitely growing list of files waiting to be closed by a single thread of workqueue. I think it is perfectly reasonable that when handling an NFSv4 CLOSE, the nfsd thread should completely handle that request including all the flush and ->release etc. If that causes any denial of service, then simple increase the number of nfsd threads. For NFSv3 it is more complex. On the kernel where I saw a problem the filp_close happen after each READ or WRITE (though I think the customer was using NFSv4...). With the file cache there is no thread that is obviously responsible for the close. To get the sort of throttling that I think is need, we could possibly have each "nfsd_open" check if there are pending closes, and to wait for some small amount of progress. But don't think it is reasonable for the nfsd threads to take none of the burden of closing files as that can result in imbalance. I'll need to give this more thought. Thanks, NeilBrown
On Tue, Nov 28, 2023 at 11:16:06AM +1100, NeilBrown wrote: > On Tue, 28 Nov 2023, Chuck Lever wrote: > > On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote: > > > > > > I have evidence from a customer site of 256 nfsd threads adding files to > > > delayed_fput_lists nearly twice as fast they are retired by a single > > > work-queue thread running delayed_fput(). As you might imagine this > > > does not end well (20 million files in the queue at the time a snapshot > > > was taken for analysis). > > > > > > While this might point to a problem with the filesystem not handling the > > > final close efficiently, such problems should only hurt throughput, not > > > lead to memory exhaustion. > > > > I have this patch queued for v6.8: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=nfsd-next&id=c42661ffa58acfeaf73b932dec1e6f04ce8a98c0 > > > > Thanks.... > I think that change is good, but I don't think it addresses the problem > mentioned in the description, and it is not directly relevant to the > problem I saw ... though it is complicated. > > The problem "workqueue ... hogged cpu..." probably means that > nfsd_file_dispose_list() needs a cond_resched() call in the loop. > That will stop it from hogging the CPU whether it is tied to one CPU or > free to roam. > > Also that work is calling filp_close() which primarily calls > filp_flush(). > It also calls fput() but that does minimal work. If there is much work > to do then that is offloaded to another work-item. *That* is the > workitem that I had problems with. > > The problem I saw was with an older kernel which didn't have the nfsd > file cache and so probably is calling filp_close more often. Without the file cache, the filp_close() should be handled directly by the nfsd thread handling the RPC, IIRC. > So maybe > my patch isn't so important now. Particularly as nfsd now isn't closing > most files in-task but instead offloads that to another task. So the > final fput will not be handled by the nfsd task either. > > But I think there is room for improvement. Gathering lots of files > together into a list and closing them sequentially is not going to be as > efficient as closing them in parallel. I believe the file cache passes the filps to the work queue one at a time, but I don't think there's anything that forces the work queue to handle each flush/close completely before proceeding to the next. IOW there is some parallelism there already, especially now that nfsd_filecache_wq is UNBOUND. > > > For normal threads, the thread that closes the file also calls the > > > final fput so there is natural rate limiting preventing excessive growth > > > in the list of delayed fputs. For kernel threads, and particularly for > > > nfsd, delayed in the final fput do not impose any throttling to prevent > > > the thread from closing more files. > > > > I don't think we want to block nfsd threads waiting for files to > > close. Won't that be a potential denial of service? > > Not as much as the denial of service caused by memory exhaustion due to > an indefinitely growing list of files waiting to be closed by a single > thread of workqueue. The cache garbage collector is single-threaded, but nfsd_filecache_wq has a max_active setting of zero. > I think it is perfectly reasonable that when handling an NFSv4 CLOSE, > the nfsd thread should completely handle that request including all the > flush and ->release etc. If that causes any denial of service, then > simple increase the number of nfsd threads. > > For NFSv3 it is more complex. On the kernel where I saw a problem the > filp_close happen after each READ or WRITE (though I think the customer > was using NFSv4...). With the file cache there is no thread that is > obviously responsible for the close. > To get the sort of throttling that I think is need, we could possibly > have each "nfsd_open" check if there are pending closes, and to wait for > some small amount of progress. Well nfsd_open() in particular appears to be used only for readdir. But maybe nfsd_file_acquire() could wait briefly, in the garbage- collected case, if the nfsd_net's disposal queue is long. > But don't think it is reasonable for the nfsd threads to take none of > the burden of closing files as that can result in imbalance. > > I'll need to give this more thought.
(trimmed cc...) On Tue, 28 Nov 2023, Chuck Lever wrote: > On Tue, Nov 28, 2023 at 11:16:06AM +1100, NeilBrown wrote: > > On Tue, 28 Nov 2023, Chuck Lever wrote: > > > On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote: > > > > > > > > I have evidence from a customer site of 256 nfsd threads adding files to > > > > delayed_fput_lists nearly twice as fast they are retired by a single > > > > work-queue thread running delayed_fput(). As you might imagine this > > > > does not end well (20 million files in the queue at the time a snapshot > > > > was taken for analysis). > > > > > > > > While this might point to a problem with the filesystem not handling the > > > > final close efficiently, such problems should only hurt throughput, not > > > > lead to memory exhaustion. > > > > > > I have this patch queued for v6.8: > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=nfsd-next&id=c42661ffa58acfeaf73b932dec1e6f04ce8a98c0 > > > > > > > Thanks.... > > I think that change is good, but I don't think it addresses the problem > > mentioned in the description, and it is not directly relevant to the > > problem I saw ... though it is complicated. > > > > The problem "workqueue ... hogged cpu..." probably means that > > nfsd_file_dispose_list() needs a cond_resched() call in the loop. > > That will stop it from hogging the CPU whether it is tied to one CPU or > > free to roam. > > > > Also that work is calling filp_close() which primarily calls > > filp_flush(). > > It also calls fput() but that does minimal work. If there is much work > > to do then that is offloaded to another work-item. *That* is the > > workitem that I had problems with. > > > > The problem I saw was with an older kernel which didn't have the nfsd > > file cache and so probably is calling filp_close more often. > > Without the file cache, the filp_close() should be handled directly > by the nfsd thread handling the RPC, IIRC. Yes - but __fput() is handled by a workqueue. > > > > So maybe > > my patch isn't so important now. Particularly as nfsd now isn't closing > > most files in-task but instead offloads that to another task. So the > > final fput will not be handled by the nfsd task either. > > > > But I think there is room for improvement. Gathering lots of files > > together into a list and closing them sequentially is not going to be as > > efficient as closing them in parallel. > > I believe the file cache passes the filps to the work queue one at nfsd_file_close_inode() does. nfsd_file_gc() and nfsd_file_lru_scan() can pass multiple. > a time, but I don't think there's anything that forces the work > queue to handle each flush/close completely before proceeding to the > next. Parallelism with workqueues is controlled by the work items (struct work_struct). Two different work items can run in parallel. But any given work item can never run parallel to itself. The only work items queued on nfsd_filecache_wq are from nn->fcache_disposal->work. There is one of these for each network namespace. So in any given network namespace, all work on nfsd_filecache_wq is fully serialised. > > IOW there is some parallelism there already, especially now that > nfsd_filecache_wq is UNBOUND. No there is not. And UNBOUND makes no difference to parallelism in this case. It allows the one work item to migrate between CPUs while it is running, but it doesn't allow it to run concurrently on two different CPUs. (UNBOUND can improve parallelism when multiple different work items are submitted all from the same CPU. Without UNBOUND all the work would happen on the same CPU, though if the work sleeps, the different work items can be interleaved. With UNBOUND the different work items can enjoy true parallelism when needed). > > > > > > For normal threads, the thread that closes the file also calls the > > > > final fput so there is natural rate limiting preventing excessive growth > > > > in the list of delayed fputs. For kernel threads, and particularly for > > > > nfsd, delayed in the final fput do not impose any throttling to prevent > > > > the thread from closing more files. > > > > > > I don't think we want to block nfsd threads waiting for files to > > > close. Won't that be a potential denial of service? > > > > Not as much as the denial of service caused by memory exhaustion due to > > an indefinitely growing list of files waiting to be closed by a single > > thread of workqueue. > > The cache garbage collector is single-threaded, but nfsd_filecache_wq > has a max_active setting of zero. This allows parallelism between network namespaces, but not within a network namespace. > > > > I think it is perfectly reasonable that when handling an NFSv4 CLOSE, > > the nfsd thread should completely handle that request including all the > > flush and ->release etc. If that causes any denial of service, then > > simple increase the number of nfsd threads. > > > > For NFSv3 it is more complex. On the kernel where I saw a problem the > > filp_close happen after each READ or WRITE (though I think the customer > > was using NFSv4...). With the file cache there is no thread that is > > obviously responsible for the close. > > To get the sort of throttling that I think is need, we could possibly > > have each "nfsd_open" check if there are pending closes, and to wait for > > some small amount of progress. > > Well nfsd_open() in particular appears to be used only for readdir. > > But maybe nfsd_file_acquire() could wait briefly, in the garbage- > collected case, if the nfsd_net's disposal queue is long. > > > > But don't think it is reasonable for the nfsd threads to take none of > > the burden of closing files as that can result in imbalance. > > > > I'll need to give this more thought. > > > -- > Chuck Lever > Thanks, NeilBrown
On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote: > > I have evidence from a customer site of 256 nfsd threads adding files to > delayed_fput_lists nearly twice as fast they are retired by a single > work-queue thread running delayed_fput(). As you might imagine this > does not end well (20 million files in the queue at the time a snapshot > was taken for analysis). > > While this might point to a problem with the filesystem not handling the > final close efficiently, such problems should only hurt throughput, not > lead to memory exhaustion. > > For normal threads, the thread that closes the file also calls the > final fput so there is natural rate limiting preventing excessive growth > in the list of delayed fputs. For kernel threads, and particularly for > nfsd, delayed in the final fput do not impose any throttling to prevent > the thread from closing more files. > > A simple way to fix this is to treat nfsd threads like normal processes > for task_work. Thus the pending files are queued for the thread, and > the same thread finishes the work. > > Currently KTHREADs are assumed never to call task_work_run(). With this > patch that it still the default but it is implemented by storing the > magic value TASK_WORKS_DISABLED in ->task_works. If a kthread, such as > nfsd, will call task_work_run() periodically, it sets ->task_works > to NULL to indicate this. > > Signed-off-by: NeilBrown <neilb@suse.de> > --- > > I wonder which tree this should go through assuming everyone likes it. > VFS maybe?? Sure. > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 292c31697248..c63c2bedbf71 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1117,6 +1117,7 @@ struct task_struct { > unsigned int sas_ss_flags; > > struct callback_head *task_works; > +#define TASK_WORKS_DISABLED ((void*)1) Should be simpler if you invert the logic? COMPLETELY UNTESTED --- fs/file_table.c | 2 +- fs/nfsd/nfssvc.c | 4 ++++ include/linux/task_work.h | 3 +++ kernel/fork.c | 3 +++ kernel/task_work.c | 12 ++++++++++++ 5 files changed, 23 insertions(+), 1 deletion(-) diff --git a/fs/file_table.c b/fs/file_table.c index de4a2915bfd4..e79351df22be 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -445,7 +445,7 @@ void fput(struct file *file) if (atomic_long_dec_and_test(&file->f_count)) { struct task_struct *task = current; - if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) { + if (likely(!in_interrupt())) { init_task_work(&file->f_rcuhead, ____fput); if (!task_work_add(task, &file->f_rcuhead, TWA_RESUME)) return; diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c index d6122bb2d167..cea76bad3a95 100644 --- a/fs/nfsd/nfssvc.c +++ b/fs/nfsd/nfssvc.c @@ -13,6 +13,7 @@ #include <linux/fs_struct.h> #include <linux/swap.h> #include <linux/siphash.h> +#include <linux/task_work.h> #include <linux/sunrpc/stats.h> #include <linux/sunrpc/svcsock.h> @@ -943,6 +944,7 @@ nfsd(void *vrqstp) current->fs->umask = 0; + task_work_manage(current); /* Declare that I will call task_work_run() */ atomic_inc(&nfsdstats.th_cnt); set_freezable(); @@ -956,6 +958,8 @@ nfsd(void *vrqstp) svc_recv(rqstp); validate_process_creds(); + if (task_work_pending(current)) + task_work_run(); } atomic_dec(&nfsdstats.th_cnt); diff --git a/include/linux/task_work.h b/include/linux/task_work.h index 795ef5a68429..645fb94e47e0 100644 --- a/include/linux/task_work.h +++ b/include/linux/task_work.h @@ -20,6 +20,9 @@ enum task_work_notify_mode { TWA_SIGNAL_NO_IPI, }; +void task_work_off(struct task_struct *task); +void task_work_manage(struct task_struct *task); + static inline bool task_work_pending(struct task_struct *task) { return READ_ONCE(task->task_works); diff --git a/kernel/fork.c b/kernel/fork.c index 10917c3e1f03..348ed8fa9333 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2346,6 +2346,9 @@ __latent_entropy struct task_struct *copy_process( if (args->io_thread) p->flags |= PF_IO_WORKER; + if (args->kthread) + task_work_off(p); + if (args->name) strscpy_pad(p->comm, args->name, sizeof(p->comm)); diff --git a/kernel/task_work.c b/kernel/task_work.c index 95a7e1b7f1da..2ae948d0d124 100644 --- a/kernel/task_work.c +++ b/kernel/task_work.c @@ -5,6 +5,18 @@ static struct callback_head work_exited; /* all we need is ->next == NULL */ +void task_work_off(struct task_struct *task) +{ + task->task_works = &work_exited; +} +EXPORT_SYMBOL(task_work_off); + +void task_work_manage(struct task_struct *task) +{ + task->task_works = NULL; +} +EXPORT_SYMBOL(task_work_manage); + /** * task_work_add - ask the @task to execute @work->func() * @task: the task which should run the callback
[Reusing the trimmed Cc] On Tue, Nov 28, 2023 at 11:16:06AM +1100, NeilBrown wrote: > On Tue, 28 Nov 2023, Chuck Lever wrote: > > On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote: > > > > > > I have evidence from a customer site of 256 nfsd threads adding files to > > > delayed_fput_lists nearly twice as fast they are retired by a single > > > work-queue thread running delayed_fput(). As you might imagine this > > > does not end well (20 million files in the queue at the time a snapshot > > > was taken for analysis). > > > > > > While this might point to a problem with the filesystem not handling the > > > final close efficiently, such problems should only hurt throughput, not > > > lead to memory exhaustion. > > > > I have this patch queued for v6.8: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=nfsd-next&id=c42661ffa58acfeaf73b932dec1e6f04ce8a98c0 > > > > Thanks.... > I think that change is good, but I don't think it addresses the problem > mentioned in the description, and it is not directly relevant to the > problem I saw ... though it is complicated. > > The problem "workqueue ... hogged cpu..." probably means that > nfsd_file_dispose_list() needs a cond_resched() call in the loop. > That will stop it from hogging the CPU whether it is tied to one CPU or > free to roam. > > Also that work is calling filp_close() which primarily calls > filp_flush(). > It also calls fput() but that does minimal work. If there is much work > to do then that is offloaded to another work-item. *That* is the > workitem that I had problems with. > > The problem I saw was with an older kernel which didn't have the nfsd > file cache and so probably is calling filp_close more often. So maybe > my patch isn't so important now. Particularly as nfsd now isn't closing > most files in-task but instead offloads that to another task. So the > final fput will not be handled by the nfsd task either. > > But I think there is room for improvement. Gathering lots of files > together into a list and closing them sequentially is not going to be as > efficient as closing them in parallel. > > > > > > For normal threads, the thread that closes the file also calls the > > > final fput so there is natural rate limiting preventing excessive growth > > > in the list of delayed fputs. For kernel threads, and particularly for > > > nfsd, delayed in the final fput do not impose any throttling to prevent > > > the thread from closing more files. > > > > I don't think we want to block nfsd threads waiting for files to > > close. Won't that be a potential denial of service? > > Not as much as the denial of service caused by memory exhaustion due to > an indefinitely growing list of files waiting to be closed by a single > thread of workqueue. It seems less likely that you run into memory exhausting than a DOS because nfsd() is busy closing fds. Especially because you default to single nfsd thread afaict. > I think it is perfectly reasonable that when handling an NFSv4 CLOSE, > the nfsd thread should completely handle that request including all the > flush and ->release etc. If that causes any denial of service, then > simple increase the number of nfsd threads. But isn't that a significant behavioral change? So I would expect to make this at configurable via a module- or Kconfig option? > For NFSv3 it is more complex. On the kernel where I saw a problem the > filp_close happen after each READ or WRITE (though I think the customer > was using NFSv4...). With the file cache there is no thread that is > obviously responsible for the close. > To get the sort of throttling that I think is need, we could possibly > have each "nfsd_open" check if there are pending closes, and to wait for > some small amount of progress. > > But don't think it is reasonable for the nfsd threads to take none of > the burden of closing files as that can result in imbalance. It feels that this really needs to be tested under a similar workload in question to see whether this is a viable solution.
On 11/28, Christian Brauner wrote: > > Should be simpler if you invert the logic? > > COMPLETELY UNTESTED Agreed, this looks much better to me. But perhaps we can just add the new PF_KTHREAD_XXX flag and change fput --- a/fs/file_table.c +++ b/fs/file_table.c @@ -445,7 +445,8 @@ void fput(struct file *file) if (atomic_long_dec_and_test(&file->f_count)) { struct task_struct *task = current; - if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) { + if (likely(!in_interrupt() && + task->flags & (PF_KTHREAD|PF_KTHREAD_XXX) != PF_KTHREAD) { init_task_work(&file->f_rcuhead, ____fput); if (!task_work_add(task, &file->f_rcuhead, TWA_RESUME)) return; ? Then nfsd() can simply set PF_KTHREAD_XXX. This looks even simpler to me. Oleg.
On 11/28, NeilBrown wrote: > > I have evidence from a customer site of 256 nfsd threads adding files to > delayed_fput_lists nearly twice as fast they are retired by a single > work-queue thread running delayed_fput(). As you might imagine this > does not end well (20 million files in the queue at the time a snapshot > was taken for analysis). On a related note... Neil, Al, et al, can you look at [PATCH 1/3] fput: don't abuse task_work_add() when possible https://lore.kernel.org/all/20150908171446.GA14589@redhat.com/ (please ignore 3/3). Oleg.
On Tue, 2023-11-28 at 14:51 +0100, Christian Brauner wrote: > [Reusing the trimmed Cc] > > On Tue, Nov 28, 2023 at 11:16:06AM +1100, NeilBrown wrote: > > On Tue, 28 Nov 2023, Chuck Lever wrote: > > > On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote: > > > > > > > > I have evidence from a customer site of 256 nfsd threads adding files to > > > > delayed_fput_lists nearly twice as fast they are retired by a single > > > > work-queue thread running delayed_fput(). As you might imagine this > > > > does not end well (20 million files in the queue at the time a snapshot > > > > was taken for analysis). > > > > > > > > While this might point to a problem with the filesystem not handling the > > > > final close efficiently, such problems should only hurt throughput, not > > > > lead to memory exhaustion. > > > > > > I have this patch queued for v6.8: > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=nfsd-next&id=c42661ffa58acfeaf73b932dec1e6f04ce8a98c0 > > > > > > > Thanks.... > > I think that change is good, but I don't think it addresses the problem > > mentioned in the description, and it is not directly relevant to the > > problem I saw ... though it is complicated. > > > > The problem "workqueue ... hogged cpu..." probably means that > > nfsd_file_dispose_list() needs a cond_resched() call in the loop. > > That will stop it from hogging the CPU whether it is tied to one CPU or > > free to roam. > > > > Also that work is calling filp_close() which primarily calls > > filp_flush(). > > It also calls fput() but that does minimal work. If there is much work > > to do then that is offloaded to another work-item. *That* is the > > workitem that I had problems with. > > > > The problem I saw was with an older kernel which didn't have the nfsd > > file cache and so probably is calling filp_close more often. So maybe > > my patch isn't so important now. Particularly as nfsd now isn't closing > > most files in-task but instead offloads that to another task. So the > > final fput will not be handled by the nfsd task either. > > > > But I think there is room for improvement. Gathering lots of files > > together into a list and closing them sequentially is not going to be as > > efficient as closing them in parallel. > > > > > > > > > For normal threads, the thread that closes the file also calls the > > > > final fput so there is natural rate limiting preventing excessive growth > > > > in the list of delayed fputs. For kernel threads, and particularly for > > > > nfsd, delayed in the final fput do not impose any throttling to prevent > > > > the thread from closing more files. > > > > > > I don't think we want to block nfsd threads waiting for files to > > > close. Won't that be a potential denial of service? > > > > Not as much as the denial of service caused by memory exhaustion due to > > an indefinitely growing list of files waiting to be closed by a single > > thread of workqueue. > > It seems less likely that you run into memory exhausting than a DOS > because nfsd() is busy closing fds. Especially because you default to > single nfsd thread afaict. > The default is currently 8 threads (which is ridiculously low for most uses, but that's another discussion). That policy is usually set by userland nfs-utils though. This is another place where we might want to reserve a "rescuer" thread that avoids doing work that can end up blocked. Maybe we could switch back to queuing them to the list when we're below a certain threshold of available threads (1? 2? 4?). > > I think it is perfectly reasonable that when handling an NFSv4 CLOSE, > > the nfsd thread should completely handle that request including all the > > flush and ->release etc. If that causes any denial of service, then > > simple increase the number of nfsd threads. > > But isn't that a significant behavioral change? So I would expect to > make this at configurable via a module- or Kconfig option? > I struggle to think about how we would document a new option like this. > > For NFSv3 it is more complex. On the kernel where I saw a problem the > > filp_close happen after each READ or WRITE (though I think the customer > > was using NFSv4...). With the file cache there is no thread that is > > obviously responsible for the close. > > To get the sort of throttling that I think is need, we could possibly > > have each "nfsd_open" check if there are pending closes, and to wait for > > some small amount of progress. > > > > But don't think it is reasonable for the nfsd threads to take none of > > the burden of closing files as that can result in imbalance. > > It feels that this really needs to be tested under a similar workload in > question to see whether this is a viable solution. Definitely. I'd also like to see how this behaves with NFS or Ceph reexport. Closing can be quite slow on those filesystems, so that might be a good place to try and break this.
On 11/28, Oleg Nesterov wrote: > > On a related note... Neil, Al, et al, can you look at > > [PATCH 1/3] fput: don't abuse task_work_add() when possible > https://lore.kernel.org/all/20150908171446.GA14589@redhat.com/ Cough... Now that I look at this 8 years old patch again I think it is wrong, fput() can race with task_work_cancel() so it is not safe to dereference the first pending work in theory :/ Oleg.
On Tue, Nov 28, 2023 at 09:15:39AM -0500, Jeff Layton wrote: > On Tue, 2023-11-28 at 14:51 +0100, Christian Brauner wrote: > > [Reusing the trimmed Cc] > > > > On Tue, Nov 28, 2023 at 11:16:06AM +1100, NeilBrown wrote: > > > On Tue, 28 Nov 2023, Chuck Lever wrote: > > > > On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote: > > > > > > > > > > I have evidence from a customer site of 256 nfsd threads adding files to > > > > > delayed_fput_lists nearly twice as fast they are retired by a single > > > > > work-queue thread running delayed_fput(). As you might imagine this > > > > > does not end well (20 million files in the queue at the time a snapshot > > > > > was taken for analysis). > > > > > > > > > > While this might point to a problem with the filesystem not handling the > > > > > final close efficiently, such problems should only hurt throughput, not > > > > > lead to memory exhaustion. > > > > > > > > I have this patch queued for v6.8: > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=nfsd-next&id=c42661ffa58acfeaf73b932dec1e6f04ce8a98c0 > > > > > > > > > > Thanks.... > > > I think that change is good, but I don't think it addresses the problem > > > mentioned in the description, and it is not directly relevant to the > > > problem I saw ... though it is complicated. > > > > > > The problem "workqueue ... hogged cpu..." probably means that > > > nfsd_file_dispose_list() needs a cond_resched() call in the loop. > > > That will stop it from hogging the CPU whether it is tied to one CPU or > > > free to roam. > > > > > > Also that work is calling filp_close() which primarily calls > > > filp_flush(). > > > It also calls fput() but that does minimal work. If there is much work > > > to do then that is offloaded to another work-item. *That* is the > > > workitem that I had problems with. > > > > > > The problem I saw was with an older kernel which didn't have the nfsd > > > file cache and so probably is calling filp_close more often. So maybe > > > my patch isn't so important now. Particularly as nfsd now isn't closing > > > most files in-task but instead offloads that to another task. So the > > > final fput will not be handled by the nfsd task either. > > > > > > But I think there is room for improvement. Gathering lots of files > > > together into a list and closing them sequentially is not going to be as > > > efficient as closing them in parallel. > > > > > > > > > > > > For normal threads, the thread that closes the file also calls the > > > > > final fput so there is natural rate limiting preventing excessive growth > > > > > in the list of delayed fputs. For kernel threads, and particularly for > > > > > nfsd, delayed in the final fput do not impose any throttling to prevent > > > > > the thread from closing more files. > > > > > > > > I don't think we want to block nfsd threads waiting for files to > > > > close. Won't that be a potential denial of service? > > > > > > Not as much as the denial of service caused by memory exhaustion due to > > > an indefinitely growing list of files waiting to be closed by a single > > > thread of workqueue. > > > > It seems less likely that you run into memory exhausting than a DOS > > because nfsd() is busy closing fds. Especially because you default to > > single nfsd thread afaict. I would expect a DoS too: the system should start pushing out dirty file data itself well before exhausting memory. > The default is currently 8 threads (which is ridiculously low for most > uses, but that's another discussion). That policy is usually set by > userland nfs-utils though. With only 8 threads, it might be /more/ difficult for clients to generate enough workload to cause an overwhelming flood of closes. As Neil said in the cover letter text, he observed this issue with 256 nfsd threads. > This is another place where we might want to reserve a "rescuer" thread > that avoids doing work that can end up blocked. Maybe we could switch > back to queuing them to the list when we're below a certain threshold of > available threads (1? 2? 4?). > > > > I think it is perfectly reasonable that when handling an NFSv4 CLOSE, > > > the nfsd thread should completely handle that request including all the > > > flush and ->release etc. If that causes any denial of service, then > > > simple increase the number of nfsd threads. > > > > But isn't that a significant behavioral change? So I would expect to > > make this at configurable via a module- or Kconfig option? > > I struggle to think about how we would document a new option like this. I think NFSv4 CLOSE can close files synchronously without an observable behavior change. NFSv4 clients almost always COMMIT dirty data before they CLOSE, so there should only rarely be a significant flush component to an fput done here. The problem is the garbage-collected (NFSv3) case, where the server frequently closes files well before a client might have COMMITted its dirty data. > > > For NFSv3 it is more complex. On the kernel where I saw a problem the > > > filp_close happen after each READ or WRITE (though I think the customer > > > was using NFSv4...). With the file cache there is no thread that is > > > obviously responsible for the close. > > > To get the sort of throttling that I think is need, we could possibly > > > have each "nfsd_open" check if there are pending closes, and to wait for > > > some small amount of progress. > > > > > > But don't think it is reasonable for the nfsd threads to take none of > > > the burden of closing files as that can result in imbalance. > > > > It feels that this really needs to be tested under a similar workload in > > question to see whether this is a viable solution. > > Definitely. I'd also like to see how this behaves with NFS or Ceph > reexport. Closing can be quite slow on those filesystems, so that might > be a good place to try and break this.
On Tue, Nov 28, 2023 at 02:52:59PM +0100, Oleg Nesterov wrote: > On 11/28, Christian Brauner wrote: > > > > Should be simpler if you invert the logic? > > > > COMPLETELY UNTESTED > > Agreed, this looks much better to me. But perhaps we can just add the new > PF_KTHREAD_XXX flag and change fput > > > --- a/fs/file_table.c > +++ b/fs/file_table.c > @@ -445,7 +445,8 @@ void fput(struct file *file) > if (atomic_long_dec_and_test(&file->f_count)) { > struct task_struct *task = current; > > - if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) { > + if (likely(!in_interrupt() && > + task->flags & (PF_KTHREAD|PF_KTHREAD_XXX) != PF_KTHREAD) { > init_task_work(&file->f_rcuhead, ____fput); > if (!task_work_add(task, &file->f_rcuhead, TWA_RESUME)) > return; > > ? > > Then nfsd() can simply set PF_KTHREAD_XXX. This looks even simpler to me. Yeah, I had played with that as well. Only reason I didn't do it was to avoid a PF_* flag. If that's preferable it might be worth to just add PF_TASK_WORK and decouple this from PF_KTHREAD. kthread creation and userspace process creation are all based on the same struct kernel_clone_args for a while now ever since we added this for clone3() so we catch everything in copy_process(): diff --git a/fs/file_table.c b/fs/file_table.c index 6deac386486d..5d3eb5ef4fc7 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -437,7 +437,7 @@ void fput(struct file *file) file_free(file); return; } - if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) { + if (likely(!in_interrupt() && (task->flags & PF_TASK_WORK))) { init_task_work(&file->f_rcuhead, ____fput); if (!task_work_add(task, &file->f_rcuhead, TWA_RESUME)) return; diff --git a/include/linux/sched.h b/include/linux/sched.h index 292c31697248..8dfc06acc6a0 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1755,7 +1755,7 @@ extern struct pid *cad_pid; * I am cleaning dirty pages from some other bdi. */ #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */ -#define PF__HOLE__00800000 0x00800000 +#define PF_TASK_WORK 0x00800000 #define PF__HOLE__01000000 0x01000000 #define PF__HOLE__02000000 0x02000000 #define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */ diff --git a/kernel/fork.c b/kernel/fork.c index 10917c3e1f03..2604235c800f 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2346,6 +2346,14 @@ __latent_entropy struct task_struct *copy_process( if (args->io_thread) p->flags |= PF_IO_WORKER; + /* + * By default only non-kernel threads can use task work. Kernel + * threads that manage task work explicitly can add that flag in + * their kthread callback. + */ + if (!args->kthread) + p->flags |= PF_TASK_WORK; + if (args->name) strscpy_pad(p->comm, args->name, sizeof(p->comm));
On Tue, Nov 28, 2023 at 01:57:30PM +1100, NeilBrown wrote: > > (trimmed cc...) > > On Tue, 28 Nov 2023, Chuck Lever wrote: > > On Tue, Nov 28, 2023 at 11:16:06AM +1100, NeilBrown wrote: > > > On Tue, 28 Nov 2023, Chuck Lever wrote: > > > > On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote: > > > > > > > > > > I have evidence from a customer site of 256 nfsd threads adding files to > > > > > delayed_fput_lists nearly twice as fast they are retired by a single > > > > > work-queue thread running delayed_fput(). As you might imagine this > > > > > does not end well (20 million files in the queue at the time a snapshot > > > > > was taken for analysis). > > > > > > > > > > While this might point to a problem with the filesystem not handling the > > > > > final close efficiently, such problems should only hurt throughput, not > > > > > lead to memory exhaustion. > > > > > > > > I have this patch queued for v6.8: > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=nfsd-next&id=c42661ffa58acfeaf73b932dec1e6f04ce8a98c0 > > > > > > > > > > Thanks.... > > > I think that change is good, but I don't think it addresses the problem > > > mentioned in the description, and it is not directly relevant to the > > > problem I saw ... though it is complicated. > > > > > > The problem "workqueue ... hogged cpu..." probably means that > > > nfsd_file_dispose_list() needs a cond_resched() call in the loop. > > > That will stop it from hogging the CPU whether it is tied to one CPU or > > > free to roam. > > > > > > Also that work is calling filp_close() which primarily calls > > > filp_flush(). > > > It also calls fput() but that does minimal work. If there is much work > > > to do then that is offloaded to another work-item. *That* is the > > > workitem that I had problems with. > > > > > > The problem I saw was with an older kernel which didn't have the nfsd > > > file cache and so probably is calling filp_close more often. > > > > Without the file cache, the filp_close() should be handled directly > > by the nfsd thread handling the RPC, IIRC. > > Yes - but __fput() is handled by a workqueue. > > > > > > > > So maybe > > > my patch isn't so important now. Particularly as nfsd now isn't closing > > > most files in-task but instead offloads that to another task. So the > > > final fput will not be handled by the nfsd task either. > > > > > > But I think there is room for improvement. Gathering lots of files > > > together into a list and closing them sequentially is not going to be as > > > efficient as closing them in parallel. > > > > I believe the file cache passes the filps to the work queue one at > > nfsd_file_close_inode() does. nfsd_file_gc() and nfsd_file_lru_scan() > can pass multiple. > > > a time, but I don't think there's anything that forces the work > > queue to handle each flush/close completely before proceeding to the > > next. > > Parallelism with workqueues is controlled by the work items (struct > work_struct). Two different work items can run in parallel. But any > given work item can never run parallel to itself. > > The only work items queued on nfsd_filecache_wq are from > nn->fcache_disposal->work. > There is one of these for each network namespace. So in any given > network namespace, all work on nfsd_filecache_wq is fully serialised. OIC, it's that specific case you are concerned with. The per- namespace laundrette was added by: 9542e6a643fc ("nfsd: Containerise filecache laundrette") It's purpose was to confine the close backlog to each container. Seems like it would be better if there was a struct work_struct in each struct nfsd_file. That wouldn't add real backpressure to nfsd threads, but it would enable file closes to run in parallel. > > IOW there is some parallelism there already, especially now that > > nfsd_filecache_wq is UNBOUND. > > No there is not. And UNBOUND makes no difference to parallelism in this > case. It allows the one work item to migrate between CPUs while it is > running, but it doesn't allow it to run concurrently on two different > CPUs. Right. The laundrette can now run in parallel with other work by moving to a different core, but there still can be only one laundrette running per namespace. > (UNBOUND can improve parallelism when multiple different work items are > submitted all from the same CPU. Without UNBOUND all the work would > happen on the same CPU, though if the work sleeps, the different work > items can be interleaved. With UNBOUND the different work items can > enjoy true parallelism when needed). > > > > > > > > > > > For normal threads, the thread that closes the file also calls the > > > > > final fput so there is natural rate limiting preventing excessive growth > > > > > in the list of delayed fputs. For kernel threads, and particularly for > > > > > nfsd, delayed in the final fput do not impose any throttling to prevent > > > > > the thread from closing more files. > > > > > > > > I don't think we want to block nfsd threads waiting for files to > > > > close. Won't that be a potential denial of service? > > > > > > Not as much as the denial of service caused by memory exhaustion due to > > > an indefinitely growing list of files waiting to be closed by a single > > > thread of workqueue. > > > > The cache garbage collector is single-threaded, but nfsd_filecache_wq > > has a max_active setting of zero. > > This allows parallelism between network namespaces, but not within a > network namespace. > > > > > > > > I think it is perfectly reasonable that when handling an NFSv4 CLOSE, > > > the nfsd thread should completely handle that request including all the > > > flush and ->release etc. If that causes any denial of service, then > > > simple increase the number of nfsd threads. > > > > > > For NFSv3 it is more complex. On the kernel where I saw a problem the > > > filp_close happen after each READ or WRITE (though I think the customer > > > was using NFSv4...). With the file cache there is no thread that is > > > obviously responsible for the close. > > > To get the sort of throttling that I think is need, we could possibly > > > have each "nfsd_open" check if there are pending closes, and to wait for > > > some small amount of progress. > > > > Well nfsd_open() in particular appears to be used only for readdir. > > > > But maybe nfsd_file_acquire() could wait briefly, in the garbage- > > collected case, if the nfsd_net's disposal queue is long. > > > > > > > But don't think it is reasonable for the nfsd threads to take none of > > > the burden of closing files as that can result in imbalance. > > > > > > I'll need to give this more thought. > > > > > > -- > > Chuck Lever > > > > Thanks, > NeilBrown
On 11/28, Christian Brauner wrote: > > Yeah, I had played with that as well. Only reason I didn't do it was to > avoid a PF_* flag. If that's preferable it might be worth to just add > PF_TASK_WORK and decouple this from PF_KTHREAD. OK, I won't insist. But, > + /* > + * By default only non-kernel threads can use task work. Kernel > + * threads that manage task work explicitly can add that flag in > + * their kthread callback. > + */ > + if (!args->kthread) > + p->flags |= PF_TASK_WORK; The comment and the name of the new flag look a bit misleading to me... kthreads can use task_work's. You can create a kthread which does task_work_run() from time to time and use task_work_add() on it, nothing wrong with that. Probably nobody does this right now (I didn't try to check), but please note irq_thread()->task_work_add(on_exit_work). Oleg.
Forgot to menstion, On 11/28, Oleg Nesterov wrote: > > but please > note irq_thread()->task_work_add(on_exit_work). and this means that Neil's and your more patch were wrong ;) Oleg.
On Wed, 29 Nov 2023, Christian Brauner wrote: > [Reusing the trimmed Cc] > > On Tue, Nov 28, 2023 at 11:16:06AM +1100, NeilBrown wrote: > > On Tue, 28 Nov 2023, Chuck Lever wrote: > > > On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote: > > > > > > > > I have evidence from a customer site of 256 nfsd threads adding files to > > > > delayed_fput_lists nearly twice as fast they are retired by a single > > > > work-queue thread running delayed_fput(). As you might imagine this > > > > does not end well (20 million files in the queue at the time a snapshot > > > > was taken for analysis). > > > > > > > > While this might point to a problem with the filesystem not handling the > > > > final close efficiently, such problems should only hurt throughput, not > > > > lead to memory exhaustion. > > > > > > I have this patch queued for v6.8: > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=nfsd-next&id=c42661ffa58acfeaf73b932dec1e6f04ce8a98c0 > > > > > > > Thanks.... > > I think that change is good, but I don't think it addresses the problem > > mentioned in the description, and it is not directly relevant to the > > problem I saw ... though it is complicated. > > > > The problem "workqueue ... hogged cpu..." probably means that > > nfsd_file_dispose_list() needs a cond_resched() call in the loop. > > That will stop it from hogging the CPU whether it is tied to one CPU or > > free to roam. > > > > Also that work is calling filp_close() which primarily calls > > filp_flush(). > > It also calls fput() but that does minimal work. If there is much work > > to do then that is offloaded to another work-item. *That* is the > > workitem that I had problems with. > > > > The problem I saw was with an older kernel which didn't have the nfsd > > file cache and so probably is calling filp_close more often. So maybe > > my patch isn't so important now. Particularly as nfsd now isn't closing > > most files in-task but instead offloads that to another task. So the > > final fput will not be handled by the nfsd task either. > > > > But I think there is room for improvement. Gathering lots of files > > together into a list and closing them sequentially is not going to be as > > efficient as closing them in parallel. > > > > > > > > > For normal threads, the thread that closes the file also calls the > > > > final fput so there is natural rate limiting preventing excessive growth > > > > in the list of delayed fputs. For kernel threads, and particularly for > > > > nfsd, delayed in the final fput do not impose any throttling to prevent > > > > the thread from closing more files. > > > > > > I don't think we want to block nfsd threads waiting for files to > > > close. Won't that be a potential denial of service? > > > > Not as much as the denial of service caused by memory exhaustion due to > > an indefinitely growing list of files waiting to be closed by a single > > thread of workqueue. > > It seems less likely that you run into memory exhausting than a DOS > because nfsd() is busy closing fds. Especially because you default to > single nfsd thread afaict. An nfsd thread would not end up being busy closing fds any more than it can already be busy reading data or busy syncing out changes or busying renaming a file. Which it is say: of course it can be busy doing this, but doing this sort of thing is its whole purpose in life. If an nfsd thread only completes the close that it initiated the close on (which is what I am currently proposing) then there would be at most one, or maybe 2, fds to close after handling each request. While that is certainly a non-zero burden, I can't see how it can realistically be called a DOS. > > > I think it is perfectly reasonable that when handling an NFSv4 CLOSE, > > the nfsd thread should completely handle that request including all the > > flush and ->release etc. If that causes any denial of service, then > > simple increase the number of nfsd threads. > > But isn't that a significant behavioral change? So I would expect to > make this at configurable via a module- or Kconfig option? Not really. Certainly not more than the change to introduce the filecache to nfsd in v5.4. > > > For NFSv3 it is more complex. On the kernel where I saw a problem the > > filp_close happen after each READ or WRITE (though I think the customer > > was using NFSv4...). With the file cache there is no thread that is > > obviously responsible for the close. > > To get the sort of throttling that I think is need, we could possibly > > have each "nfsd_open" check if there are pending closes, and to wait for > > some small amount of progress. > > > > But don't think it is reasonable for the nfsd threads to take none of > > the burden of closing files as that can result in imbalance. > > It feels that this really needs to be tested under a similar workload in > question to see whether this is a viable solution. > Creating that workload might be a challenge. I know it involved accessing 10s of millions of files with a server that was somewhat memory constrained. I don't know anything about the access pattern. Certainly I'll try to reproduce something similar by inserting delays in suitable places. This will help exercise the code, but won't really replicate the actual workload. Thanks, NeilBrown
On Wed, 29 Nov 2023, Jeff Layton wrote: > > This is another place where we might want to reserve a "rescuer" thread > that avoids doing work that can end up blocked. Maybe we could switch > back to queuing them to the list when we're below a certain threshold of > available threads (1? 2? 4?). A rescuer isn't for cases where work might be blocked, but for cases whether otherwise work might be deadlocked - though maybe that is what you meant. Could nfsd calling filp_close() or __fput() ever deadlock? I think we know from the experience pre v5.8 that calling filp_close() doesn't cause deadlocks. Could __fput, and particularly ->release ever deadlock w.r.t nfsd? i.e. could a ->release function for a file exported through NFS ever wait for nfsd to handle an NFS request? We don't need to worry about indirect dependencies like allocating memory and waiting for nfsd to flush out writes - that is already handled so that we can support loop-back mounts. So to have a problem we would need to nfs-export an NFS filesystem that was being served by the local NFS server. Now that we support NFS re-export, and we support loop-back mounts, it is fair to ask if we support the combination of the two. If we did, then calling ->release from the nfsd thread could deadlock. But calling ->read and ->write (or whatever those interfaces are called today) would also deadlock. So I think we have to say that nfs-reexporting a loop-back NFS mount is not supported, and not supportable. Whether we should try to detect and reject this case is an interesting question, but quite a separate question from that of how to handle the closing of files. In short - I don't think there is any need or value in a dedicated "rescuer" thread here. Thanks, NeilBrown
On Wed, 29 Nov 2023, Oleg Nesterov wrote: > Forgot to menstion, > > On 11/28, Oleg Nesterov wrote: > > > > but please > > note irq_thread()->task_work_add(on_exit_work). > > and this means that Neil's and your more patch were wrong ;) Yes it does - thanks for that! I guess we need a setting that is focused particularly on fput(). Probably a PF flag is best for that. Thanks, NeilBrown > > Oleg. > >
On Wed, 29 Nov 2023, Oleg Nesterov wrote: > On 11/28, NeilBrown wrote: > > > > I have evidence from a customer site of 256 nfsd threads adding files to > > delayed_fput_lists nearly twice as fast they are retired by a single > > work-queue thread running delayed_fput(). As you might imagine this > > does not end well (20 million files in the queue at the time a snapshot > > was taken for analysis). > > On a related note... Neil, Al, et al, can you look at > > [PATCH 1/3] fput: don't abuse task_work_add() when possible > https://lore.kernel.org/all/20150908171446.GA14589@redhat.com/ > Would it make sense to create a separate task_struct->delayed_fput llist? fput() adds the file to this llist and if it was the first item on the list, it then adds the task_work. That would probably request adding a callback_head to struct task_struct as well. NeilBrown
On 11/29, NeilBrown wrote: > > On Wed, 29 Nov 2023, Oleg Nesterov wrote: > > On 11/28, NeilBrown wrote: > > > > > > I have evidence from a customer site of 256 nfsd threads adding files to > > > delayed_fput_lists nearly twice as fast they are retired by a single > > > work-queue thread running delayed_fput(). As you might imagine this > > > does not end well (20 million files in the queue at the time a snapshot > > > was taken for analysis). > > > > On a related note... Neil, Al, et al, can you look at > > > > [PATCH 1/3] fput: don't abuse task_work_add() when possible > > https://lore.kernel.org/all/20150908171446.GA14589@redhat.com/ > > > > Would it make sense to create a separate task_struct->delayed_fput > llist? Sure, I too thought about this, > fput() adds the file to this llist and if it was the first item on the > list, it then adds the task_work. That would probably request adding a > callback_head to struct task_struct as well. Even simpler, but perhaps I missed something... We can add a "struct file *fput_xxx" into task_struct and f_fput_xxx into the f_llist/f_rcuhead union in the struct file. fput: if (task->fput_xxx) { file->f_fput_xxx = task->fput_xxx; task->fput_xxx = file; } else { task_work_add(...); // XXX: file->f_fput_xxx != NULL task->fput_xxx = file; } ____fput: struct file *file = task->fput_xxx; struct file *tail = container_of(work, ...); // see XXX in fput() tail->f_fput_xxx = NULL; current->fput_xxx = NULL; do { next = READ_ONCE(file->f_fput_xxx); __fput(file); file = next; } while (file); Again, quite possibly I missed something, but something like this should work. But I am still trying to find a simpler solution which doesn't need another member in task_struct... Oleg.
On Tue, Nov 28, 2023 at 06:29:59PM +0100, Oleg Nesterov wrote: > Forgot to menstion, > > On 11/28, Oleg Nesterov wrote: > > > > but please > > note irq_thread()->task_work_add(on_exit_work). > > and this means that Neil's and your more patch were wrong ;) Hm, that's all the more reason to not hang this off of PF_KTHREAD then. I mean, it's functional but we likely wouldn't have run into this confusion if this would be PF_FPUT_DELAYED, for example.
> If an nfsd thread only completes the close that it initiated the close > on (which is what I am currently proposing) then there would be at most > one, or maybe 2, fds to close after handling each request. While that > is certainly a non-zero burden, I can't see how it can realistically be > called a DOS. The 10s of millions of files is what makes me curious. Because that's the workload that'd be interesting. > > It feels that this really needs to be tested under a similar workload in > > question to see whether this is a viable solution. > > > > Creating that workload might be a challenge. I know it involved > accessing 10s of millions of files with a server that was somewhat > memory constrained. I don't know anything about the access pattern. > > Certainly I'll try to reproduce something similar by inserting delays in > suitable places. This will help exercise the code, but won't really > replicate the actual workload.
On Wed, Nov 29, 2023 at 10:20:23AM +1100, NeilBrown wrote: > On Wed, 29 Nov 2023, Christian Brauner wrote: > > [Reusing the trimmed Cc] > > > > On Tue, Nov 28, 2023 at 11:16:06AM +1100, NeilBrown wrote: > > > On Tue, 28 Nov 2023, Chuck Lever wrote: > > > > On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote: > > > > > > > > > > I have evidence from a customer site of 256 nfsd threads adding files to > > > > > delayed_fput_lists nearly twice as fast they are retired by a single > > > > > work-queue thread running delayed_fput(). As you might imagine this > > > > > does not end well (20 million files in the queue at the time a snapshot > > > > > was taken for analysis). > > > > > > > > > > While this might point to a problem with the filesystem not handling the > > > > > final close efficiently, such problems should only hurt throughput, not > > > > > lead to memory exhaustion. > > > > > > > > I have this patch queued for v6.8: > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=nfsd-next&id=c42661ffa58acfeaf73b932dec1e6f04ce8a98c0 > > > > > > > > > > Thanks.... > > > I think that change is good, but I don't think it addresses the problem > > > mentioned in the description, and it is not directly relevant to the > > > problem I saw ... though it is complicated. > > > > > > The problem "workqueue ... hogged cpu..." probably means that > > > nfsd_file_dispose_list() needs a cond_resched() call in the loop. > > > That will stop it from hogging the CPU whether it is tied to one CPU or > > > free to roam. > > > > > > Also that work is calling filp_close() which primarily calls > > > filp_flush(). > > > It also calls fput() but that does minimal work. If there is much work > > > to do then that is offloaded to another work-item. *That* is the > > > workitem that I had problems with. > > > > > > The problem I saw was with an older kernel which didn't have the nfsd > > > file cache and so probably is calling filp_close more often. So maybe > > > my patch isn't so important now. Particularly as nfsd now isn't closing > > > most files in-task but instead offloads that to another task. So the > > > final fput will not be handled by the nfsd task either. > > > > > > But I think there is room for improvement. Gathering lots of files > > > together into a list and closing them sequentially is not going to be as > > > efficient as closing them in parallel. > > > > > > > > > > > > For normal threads, the thread that closes the file also calls the > > > > > final fput so there is natural rate limiting preventing excessive growth > > > > > in the list of delayed fputs. For kernel threads, and particularly for > > > > > nfsd, delayed in the final fput do not impose any throttling to prevent > > > > > the thread from closing more files. > > > > > > > > I don't think we want to block nfsd threads waiting for files to > > > > close. Won't that be a potential denial of service? > > > > > > Not as much as the denial of service caused by memory exhaustion due to > > > an indefinitely growing list of files waiting to be closed by a single > > > thread of workqueue. > > > > It seems less likely that you run into memory exhausting than a DOS > > because nfsd() is busy closing fds. Especially because you default to > > single nfsd thread afaict. > > An nfsd thread would not end up being busy closing fds any more than it > can already be busy reading data or busy syncing out changes or busying > renaming a file. > Which it is say: of course it can be busy doing this, but doing this sort > of thing is its whole purpose in life. > > If an nfsd thread only completes the close that it initiated the close > on (which is what I am currently proposing) then there would be at most > one, or maybe 2, fds to close after handling each request. Closing files more aggressively would seem to entirely defeat the purpose of the file cache, which is to avoid the overhead of opens and closes on frequently-used files. And usually Linux prefers to let the workload consume as many free resources as possible before it applies back pressure or cache eviction. IMO the first step should be removing head-of-queue blocking from the file cache's background closing mechanism. That might be enough to avoid forming a backlog in most cases. > > > For NFSv3 it is more complex. On the kernel where I saw a problem the > > > filp_close happen after each READ or WRITE (though I think the customer > > > was using NFSv4...). With the file cache there is no thread that is > > > obviously responsible for the close. > > > To get the sort of throttling that I think is need, we could possibly > > > have each "nfsd_open" check if there are pending closes, and to wait for > > > some small amount of progress. > > > > > > But don't think it is reasonable for the nfsd threads to take none of > > > the burden of closing files as that can result in imbalance. > > > > It feels that this really needs to be tested under a similar workload in > > question to see whether this is a viable solution. > > Creating that workload might be a challenge. I know it involved > accessing 10s of millions of files with a server that was somewhat > memory constrained. I don't know anything about the access pattern. > > Certainly I'll try to reproduce something similar by inserting delays in > suitable places. This will help exercise the code, but won't really > replicate the actual workload. It's likely that the fundamental bottleneck is writeback during close.
On Wed, 2023-11-29 at 09:04 -0500, Chuck Lever wrote: > On Wed, Nov 29, 2023 at 10:20:23AM +1100, NeilBrown wrote: > > On Wed, 29 Nov 2023, Christian Brauner wrote: > > > [Reusing the trimmed Cc] > > > > > > On Tue, Nov 28, 2023 at 11:16:06AM +1100, NeilBrown wrote: > > > > On Tue, 28 Nov 2023, Chuck Lever wrote: > > > > > On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote: > > > > > > > > > > > > I have evidence from a customer site of 256 nfsd threads adding files to > > > > > > delayed_fput_lists nearly twice as fast they are retired by a single > > > > > > work-queue thread running delayed_fput(). As you might imagine this > > > > > > does not end well (20 million files in the queue at the time a snapshot > > > > > > was taken for analysis). > > > > > > > > > > > > While this might point to a problem with the filesystem not handling the > > > > > > final close efficiently, such problems should only hurt throughput, not > > > > > > lead to memory exhaustion. > > > > > > > > > > I have this patch queued for v6.8: > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=nfsd-next&id=c42661ffa58acfeaf73b932dec1e6f04ce8a98c0 > > > > > > > > > > > > > Thanks.... > > > > I think that change is good, but I don't think it addresses the problem > > > > mentioned in the description, and it is not directly relevant to the > > > > problem I saw ... though it is complicated. > > > > > > > > The problem "workqueue ... hogged cpu..." probably means that > > > > nfsd_file_dispose_list() needs a cond_resched() call in the loop. > > > > That will stop it from hogging the CPU whether it is tied to one CPU or > > > > free to roam. > > > > > > > > Also that work is calling filp_close() which primarily calls > > > > filp_flush(). > > > > It also calls fput() but that does minimal work. If there is much work > > > > to do then that is offloaded to another work-item. *That* is the > > > > workitem that I had problems with. > > > > > > > > The problem I saw was with an older kernel which didn't have the nfsd > > > > file cache and so probably is calling filp_close more often. So maybe > > > > my patch isn't so important now. Particularly as nfsd now isn't closing > > > > most files in-task but instead offloads that to another task. So the > > > > final fput will not be handled by the nfsd task either. > > > > > > > > But I think there is room for improvement. Gathering lots of files > > > > together into a list and closing them sequentially is not going to be as > > > > efficient as closing them in parallel. > > > > > > > > > > > > > > > For normal threads, the thread that closes the file also calls the > > > > > > final fput so there is natural rate limiting preventing excessive growth > > > > > > in the list of delayed fputs. For kernel threads, and particularly for > > > > > > nfsd, delayed in the final fput do not impose any throttling to prevent > > > > > > the thread from closing more files. > > > > > > > > > > I don't think we want to block nfsd threads waiting for files to > > > > > close. Won't that be a potential denial of service? > > > > > > > > Not as much as the denial of service caused by memory exhaustion due to > > > > an indefinitely growing list of files waiting to be closed by a single > > > > thread of workqueue. > > > > > > It seems less likely that you run into memory exhausting than a DOS > > > because nfsd() is busy closing fds. Especially because you default to > > > single nfsd thread afaict. > > > > An nfsd thread would not end up being busy closing fds any more than it > > can already be busy reading data or busy syncing out changes or busying > > renaming a file. > > Which it is say: of course it can be busy doing this, but doing this sort > > of thing is its whole purpose in life. > > > > If an nfsd thread only completes the close that it initiated the close > > on (which is what I am currently proposing) then there would be at most > > one, or maybe 2, fds to close after handling each request. > > Closing files more aggressively would seem to entirely defeat the > purpose of the file cache, which is to avoid the overhead of opens > and closes on frequently-used files. > > And usually Linux prefers to let the workload consume as many free > resources as possible before it applies back pressure or cache > eviction. > > IMO the first step should be removing head-of-queue blocking from > the file cache's background closing mechanism. That might be enough > to avoid forming a backlog in most cases. > > That's not quite what task_work does. Neil's patch wouldn't result in closes happening more aggressively. It would just make it so that we don't queue the delayed part of the fput process to a workqueue like we do today. Instead, the nfsd threads would have to clean that part up themselves, like syscalls do before returning to userland. I think that idea makes sense overall since that mirrors what we already do in userland. In the event that all of the nfsd threads are tied up in slow task_work jobs...tough luck. That at least makes it more of a self-limiting problem since RPCs will start being queueing, rather than allowing dead files to just pile onto the list. > > > > For NFSv3 it is more complex. On the kernel where I saw a problem the > > > > filp_close happen after each READ or WRITE (though I think the customer > > > > was using NFSv4...). With the file cache there is no thread that is > > > > obviously responsible for the close. > > > > To get the sort of throttling that I think is need, we could possibly > > > > have each "nfsd_open" check if there are pending closes, and to wait for > > > > some small amount of progress. > > > > > > > > But don't think it is reasonable for the nfsd threads to take none of > > > > the burden of closing files as that can result in imbalance. > > > > > > It feels that this really needs to be tested under a similar workload in > > > question to see whether this is a viable solution. > > > > Creating that workload might be a challenge. I know it involved > > accessing 10s of millions of files with a server that was somewhat > > memory constrained. I don't know anything about the access pattern. > > > > Certainly I'll try to reproduce something similar by inserting delays in > > suitable places. This will help exercise the code, but won't really > > replicate the actual workload. > > It's likely that the fundamental bottleneck is writeback during > close. >
On Tue, 2023-11-28 at 10:34 -0500, Chuck Lever wrote: > On Tue, Nov 28, 2023 at 01:57:30PM +1100, NeilBrown wrote: > > > > (trimmed cc...) > > > > On Tue, 28 Nov 2023, Chuck Lever wrote: > > > On Tue, Nov 28, 2023 at 11:16:06AM +1100, NeilBrown wrote: > > > > On Tue, 28 Nov 2023, Chuck Lever wrote: > > > > > On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote: > > > > > > > > > > > > I have evidence from a customer site of 256 nfsd threads adding files to > > > > > > delayed_fput_lists nearly twice as fast they are retired by a single > > > > > > work-queue thread running delayed_fput(). As you might imagine this > > > > > > does not end well (20 million files in the queue at the time a snapshot > > > > > > was taken for analysis). > > > > > > > > > > > > While this might point to a problem with the filesystem not handling the > > > > > > final close efficiently, such problems should only hurt throughput, not > > > > > > lead to memory exhaustion. > > > > > > > > > > I have this patch queued for v6.8: > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=nfsd-next&id=c42661ffa58acfeaf73b932dec1e6f04ce8a98c0 > > > > > > > > > > > > > Thanks.... > > > > I think that change is good, but I don't think it addresses the problem > > > > mentioned in the description, and it is not directly relevant to the > > > > problem I saw ... though it is complicated. > > > > > > > > The problem "workqueue ... hogged cpu..." probably means that > > > > nfsd_file_dispose_list() needs a cond_resched() call in the loop. > > > > That will stop it from hogging the CPU whether it is tied to one CPU or > > > > free to roam. > > > > > > > > Also that work is calling filp_close() which primarily calls > > > > filp_flush(). > > > > It also calls fput() but that does minimal work. If there is much work > > > > to do then that is offloaded to another work-item. *That* is the > > > > workitem that I had problems with. > > > > > > > > The problem I saw was with an older kernel which didn't have the nfsd > > > > file cache and so probably is calling filp_close more often. > > > > > > Without the file cache, the filp_close() should be handled directly > > > by the nfsd thread handling the RPC, IIRC. > > > > Yes - but __fput() is handled by a workqueue. > > > > > > > > > > > > So maybe > > > > my patch isn't so important now. Particularly as nfsd now isn't closing > > > > most files in-task but instead offloads that to another task. So the > > > > final fput will not be handled by the nfsd task either. > > > > > > > > But I think there is room for improvement. Gathering lots of files > > > > together into a list and closing them sequentially is not going to be as > > > > efficient as closing them in parallel. > > > > > > I believe the file cache passes the filps to the work queue one at > > > > nfsd_file_close_inode() does. nfsd_file_gc() and nfsd_file_lru_scan() > > can pass multiple. > > > > > a time, but I don't think there's anything that forces the work > > > queue to handle each flush/close completely before proceeding to the > > > next. > > > > Parallelism with workqueues is controlled by the work items (struct > > work_struct). Two different work items can run in parallel. But any > > given work item can never run parallel to itself. > > > > The only work items queued on nfsd_filecache_wq are from > > nn->fcache_disposal->work. > > There is one of these for each network namespace. So in any given > > network namespace, all work on nfsd_filecache_wq is fully serialised. > > OIC, it's that specific case you are concerned with. The per- > namespace laundrette was added by: > > 9542e6a643fc ("nfsd: Containerise filecache laundrette") > > It's purpose was to confine the close backlog to each container. > > Seems like it would be better if there was a struct work_struct > in each struct nfsd_file. That wouldn't add real backpressure to > nfsd threads, but it would enable file closes to run in parallel. > I like this idea. That seems a lot simpler than all of this weirdo queueing of delayed closes that we do.
On Thu, Nov 30, 2023 at 12:47:58PM -0500, Jeff Layton wrote: > On Wed, 2023-11-29 at 09:04 -0500, Chuck Lever wrote: > > On Wed, Nov 29, 2023 at 10:20:23AM +1100, NeilBrown wrote: > > > On Wed, 29 Nov 2023, Christian Brauner wrote: > > > > [Reusing the trimmed Cc] > > > > > > > > On Tue, Nov 28, 2023 at 11:16:06AM +1100, NeilBrown wrote: > > > > > On Tue, 28 Nov 2023, Chuck Lever wrote: > > > > > > On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote: > > > > > > > > > > > > > > I have evidence from a customer site of 256 nfsd threads adding files to > > > > > > > delayed_fput_lists nearly twice as fast they are retired by a single > > > > > > > work-queue thread running delayed_fput(). As you might imagine this > > > > > > > does not end well (20 million files in the queue at the time a snapshot > > > > > > > was taken for analysis). > > > > > > > > > > > > > > While this might point to a problem with the filesystem not handling the > > > > > > > final close efficiently, such problems should only hurt throughput, not > > > > > > > lead to memory exhaustion. > > > > > > > > > > > > I have this patch queued for v6.8: > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=nfsd-next&id=c42661ffa58acfeaf73b932dec1e6f04ce8a98c0 > > > > > > > > > > > > > > > > Thanks.... > > > > > I think that change is good, but I don't think it addresses the problem > > > > > mentioned in the description, and it is not directly relevant to the > > > > > problem I saw ... though it is complicated. > > > > > > > > > > The problem "workqueue ... hogged cpu..." probably means that > > > > > nfsd_file_dispose_list() needs a cond_resched() call in the loop. > > > > > That will stop it from hogging the CPU whether it is tied to one CPU or > > > > > free to roam. > > > > > > > > > > Also that work is calling filp_close() which primarily calls > > > > > filp_flush(). > > > > > It also calls fput() but that does minimal work. If there is much work > > > > > to do then that is offloaded to another work-item. *That* is the > > > > > workitem that I had problems with. > > > > > > > > > > The problem I saw was with an older kernel which didn't have the nfsd > > > > > file cache and so probably is calling filp_close more often. So maybe > > > > > my patch isn't so important now. Particularly as nfsd now isn't closing > > > > > most files in-task but instead offloads that to another task. So the > > > > > final fput will not be handled by the nfsd task either. > > > > > > > > > > But I think there is room for improvement. Gathering lots of files > > > > > together into a list and closing them sequentially is not going to be as > > > > > efficient as closing them in parallel. > > > > > > > > > > > > > > > > > > For normal threads, the thread that closes the file also calls the > > > > > > > final fput so there is natural rate limiting preventing excessive growth > > > > > > > in the list of delayed fputs. For kernel threads, and particularly for > > > > > > > nfsd, delayed in the final fput do not impose any throttling to prevent > > > > > > > the thread from closing more files. > > > > > > > > > > > > I don't think we want to block nfsd threads waiting for files to > > > > > > close. Won't that be a potential denial of service? > > > > > > > > > > Not as much as the denial of service caused by memory exhaustion due to > > > > > an indefinitely growing list of files waiting to be closed by a single > > > > > thread of workqueue. > > > > > > > > It seems less likely that you run into memory exhausting than a DOS > > > > because nfsd() is busy closing fds. Especially because you default to > > > > single nfsd thread afaict. > > > > > > An nfsd thread would not end up being busy closing fds any more than it > > > can already be busy reading data or busy syncing out changes or busying > > > renaming a file. > > > Which it is say: of course it can be busy doing this, but doing this sort > > > of thing is its whole purpose in life. > > > > > > If an nfsd thread only completes the close that it initiated the close > > > on (which is what I am currently proposing) then there would be at most > > > one, or maybe 2, fds to close after handling each request. > > > > Closing files more aggressively would seem to entirely defeat the > > purpose of the file cache, which is to avoid the overhead of opens > > and closes on frequently-used files. > > > > And usually Linux prefers to let the workload consume as many free > > resources as possible before it applies back pressure or cache > > eviction. > > > > IMO the first step should be removing head-of-queue blocking from > > the file cache's background closing mechanism. That might be enough > > to avoid forming a backlog in most cases. > > That's not quite what task_work does. Neil's patch wouldn't result in > closes happening more aggressively. It would just make it so that we > don't queue the delayed part of the fput process to a workqueue like we > do today. > > Instead, the nfsd threads would have to clean that part up themselves, > like syscalls do before returning to userland. I think that idea makes > sense overall since that mirrors what we already do in userland. > > In the event that all of the nfsd threads are tied up in slow task_work > jobs...tough luck. That at least makes it more of a self-limiting > problem since RPCs will start being queueing, rather than allowing dead > files to just pile onto the list. Thanks for helping me understand the proposal. task_work would cause nfsd threads to wait for flush/close operations that others have already started; it would not increase the rate of closing cached file descriptors. The thing that nfsd_filesystem_wq does is compartmentalize the flush/close workload so that a heavy flush/close workload in one net namespace does not negatively impact other namespaces. IIUC, then, task_work does not discriminate between namespaces -- if one namespace is creating a backlog of dirty files to close, all nfsd threads would need to handle that backlog, and thus all namespaces would bear (a part of) that backlog.
On Thu, 2023-11-30 at 13:07 -0500, Chuck Lever wrote: > On Thu, Nov 30, 2023 at 12:47:58PM -0500, Jeff Layton wrote: > > On Wed, 2023-11-29 at 09:04 -0500, Chuck Lever wrote: > > > On Wed, Nov 29, 2023 at 10:20:23AM +1100, NeilBrown wrote: > > > > On Wed, 29 Nov 2023, Christian Brauner wrote: > > > > > [Reusing the trimmed Cc] > > > > > > > > > > On Tue, Nov 28, 2023 at 11:16:06AM +1100, NeilBrown wrote: > > > > > > On Tue, 28 Nov 2023, Chuck Lever wrote: > > > > > > > On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote: > > > > > > > > > > > > > > > > I have evidence from a customer site of 256 nfsd threads adding files to > > > > > > > > delayed_fput_lists nearly twice as fast they are retired by a single > > > > > > > > work-queue thread running delayed_fput(). As you might imagine this > > > > > > > > does not end well (20 million files in the queue at the time a snapshot > > > > > > > > was taken for analysis). > > > > > > > > > > > > > > > > While this might point to a problem with the filesystem not handling the > > > > > > > > final close efficiently, such problems should only hurt throughput, not > > > > > > > > lead to memory exhaustion. > > > > > > > > > > > > > > I have this patch queued for v6.8: > > > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=nfsd-next&id=c42661ffa58acfeaf73b932dec1e6f04ce8a98c0 > > > > > > > > > > > > > > > > > > > Thanks.... > > > > > > I think that change is good, but I don't think it addresses the problem > > > > > > mentioned in the description, and it is not directly relevant to the > > > > > > problem I saw ... though it is complicated. > > > > > > > > > > > > The problem "workqueue ... hogged cpu..." probably means that > > > > > > nfsd_file_dispose_list() needs a cond_resched() call in the loop. > > > > > > That will stop it from hogging the CPU whether it is tied to one CPU or > > > > > > free to roam. > > > > > > > > > > > > Also that work is calling filp_close() which primarily calls > > > > > > filp_flush(). > > > > > > It also calls fput() but that does minimal work. If there is much work > > > > > > to do then that is offloaded to another work-item. *That* is the > > > > > > workitem that I had problems with. > > > > > > > > > > > > The problem I saw was with an older kernel which didn't have the nfsd > > > > > > file cache and so probably is calling filp_close more often. So maybe > > > > > > my patch isn't so important now. Particularly as nfsd now isn't closing > > > > > > most files in-task but instead offloads that to another task. So the > > > > > > final fput will not be handled by the nfsd task either. > > > > > > > > > > > > But I think there is room for improvement. Gathering lots of files > > > > > > together into a list and closing them sequentially is not going to be as > > > > > > efficient as closing them in parallel. > > > > > > > > > > > > > > > > > > > > > For normal threads, the thread that closes the file also calls the > > > > > > > > final fput so there is natural rate limiting preventing excessive growth > > > > > > > > in the list of delayed fputs. For kernel threads, and particularly for > > > > > > > > nfsd, delayed in the final fput do not impose any throttling to prevent > > > > > > > > the thread from closing more files. > > > > > > > > > > > > > > I don't think we want to block nfsd threads waiting for files to > > > > > > > close. Won't that be a potential denial of service? > > > > > > > > > > > > Not as much as the denial of service caused by memory exhaustion due to > > > > > > an indefinitely growing list of files waiting to be closed by a single > > > > > > thread of workqueue. > > > > > > > > > > It seems less likely that you run into memory exhausting than a DOS > > > > > because nfsd() is busy closing fds. Especially because you default to > > > > > single nfsd thread afaict. > > > > > > > > An nfsd thread would not end up being busy closing fds any more than it > > > > can already be busy reading data or busy syncing out changes or busying > > > > renaming a file. > > > > Which it is say: of course it can be busy doing this, but doing this sort > > > > of thing is its whole purpose in life. > > > > > > > > If an nfsd thread only completes the close that it initiated the close > > > > on (which is what I am currently proposing) then there would be at most > > > > one, or maybe 2, fds to close after handling each request. > > > > > > Closing files more aggressively would seem to entirely defeat the > > > purpose of the file cache, which is to avoid the overhead of opens > > > and closes on frequently-used files. > > > > > > And usually Linux prefers to let the workload consume as many free > > > resources as possible before it applies back pressure or cache > > > eviction. > > > > > > IMO the first step should be removing head-of-queue blocking from > > > the file cache's background closing mechanism. That might be enough > > > to avoid forming a backlog in most cases. > > > > That's not quite what task_work does. Neil's patch wouldn't result in > > closes happening more aggressively. It would just make it so that we > > don't queue the delayed part of the fput process to a workqueue like we > > do today. > > > > Instead, the nfsd threads would have to clean that part up themselves, > > like syscalls do before returning to userland. I think that idea makes > > sense overall since that mirrors what we already do in userland. > > > > In the event that all of the nfsd threads are tied up in slow task_work > > jobs...tough luck. That at least makes it more of a self-limiting > > problem since RPCs will start being queueing, rather than allowing dead > > files to just pile onto the list. > > Thanks for helping me understand the proposal. task_work would cause > nfsd threads to wait for flush/close operations that others have > already started; it would not increase the rate of closing cached > file descriptors. > Note that task_work is completely a per-task thing. Each nfsd thread now becomes responsible for cleaning up the files that were closed as part of processing the last RPC that it processed. Closes that occur during the processing of an RPC would be finished before the thread picks up a new RPC to process, but I don't see how the thread would be responsible for closes that happen in a different task altogether. > The thing that nfsd_filesystem_wq does is compartmentalize the > flush/close workload so that a heavy flush/close workload in one > net namespace does not negatively impact other namespaces. IIUC, > then, task_work does not discriminate between namespaces -- if one > namespace is creating a backlog of dirty files to close, all nfsd > threads would need to handle that backlog, and thus all namespaces > would bear (a part of) that backlog. > Closes that are queued to the nfsd_filesystem_wq won't be affected by this patch, since those files get closed in the context of a workqueue thread. In most cases, those are v2/3 files that have timed out. FWIW, I expect that this patch will mostly affect NFSv4, since those get closed more directly by nfsd. I wonder if we might want to consider doing something like this with lockd as well.
On Wed, 29 Nov 2023, Christian Brauner wrote: > > If an nfsd thread only completes the close that it initiated the close > > on (which is what I am currently proposing) then there would be at most > > one, or maybe 2, fds to close after handling each request. While that > > is certainly a non-zero burden, I can't see how it can realistically be > > called a DOS. > > The 10s of millions of files is what makes me curious. Because that's > the workload that'd be interesting. > I think the main effect of the 10s of millions of files is to bloat the icache which causes it to be cleaned and this results is synchronous reads from storage is situations that you wouldn't usually expect them. It appears from examining a memory-snapshot that some files being closed have already been unlinked. This is quite unusual with NFS but can happen if they are unlinked from one client while open on another client. (The directory containing the file in one case is called "Cache". Maybe cleaning of that cache by the applications often gets files that are in use). This pattern means that the final __dput calls __dentry_kill and eventually xfs_fs_destroy_inode. This sometimes needs to read synchronously from storage - if the required info isn't cached. This causes the delays. I've modelled the delay with diff --git a/fs/file_table.c b/fs/file_table.c index d36cade6e366..51563f79385a 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -29,6 +29,7 @@ #include <linux/ima.h> #include <linux/swap.h> #include <linux/kmemleak.h> +#include <linux/delay.h> #include <linux/atomic.h> @@ -375,6 +376,9 @@ static void __fput(struct file *file) eventpoll_release(file); locks_remove_file(file); + if ((file->f_mode & FMODE_WRITE) && + (current->flags & PF_KTHREAD)) + msleep(25); ima_file_free(file); if (unlikely(file->f_flags & FASYNC)) { if (file->f_op->fasync) I loop-back mount a filesystem with NFS on the test machine. The PF_KTHREAD test ensures that when "cp -r" writes to the NFS filesystem there is no delay, but when nfsd writes to the local filesystem there is a delay. With this patch I can easily demonstrate the number of open files growing without bound. With my patch as well (I'll send new version shortly) the growth is bounded. Thanks, NeilBrown
diff --git a/fs/file_table.c b/fs/file_table.c index de4a2915bfd4..e79351df22be 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -445,7 +445,7 @@ void fput(struct file *file) if (atomic_long_dec_and_test(&file->f_count)) { struct task_struct *task = current; - if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) { + if (likely(!in_interrupt())) { init_task_work(&file->f_rcuhead, ____fput); if (!task_work_add(task, &file->f_rcuhead, TWA_RESUME)) return; diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c index 66ca50b38b27..c047961262ca 100644 --- a/fs/nfsd/nfssvc.c +++ b/fs/nfsd/nfssvc.c @@ -13,6 +13,7 @@ #include <linux/fs_struct.h> #include <linux/swap.h> #include <linux/siphash.h> +#include <linux/task_work.h> #include <linux/sunrpc/stats.h> #include <linux/sunrpc/svcsock.h> @@ -941,6 +942,7 @@ nfsd(void *vrqstp) } current->fs->umask = 0; + current->task_works = NULL; /* Declare that I will call task_work_run() */ atomic_inc(&nfsdstats.th_cnt); @@ -955,6 +957,8 @@ nfsd(void *vrqstp) svc_recv(rqstp); validate_process_creds(); + if (task_work_pending(current)) + task_work_run(); } atomic_dec(&nfsdstats.th_cnt); diff --git a/include/linux/sched.h b/include/linux/sched.h index 292c31697248..c63c2bedbf71 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1117,6 +1117,7 @@ struct task_struct { unsigned int sas_ss_flags; struct callback_head *task_works; +#define TASK_WORKS_DISABLED ((void*)1) #ifdef CONFIG_AUDIT #ifdef CONFIG_AUDITSYSCALL diff --git a/include/linux/task_work.h b/include/linux/task_work.h index 795ef5a68429..3c74e3de81ed 100644 --- a/include/linux/task_work.h +++ b/include/linux/task_work.h @@ -22,7 +22,9 @@ enum task_work_notify_mode { static inline bool task_work_pending(struct task_struct *task) { - return READ_ONCE(task->task_works); + struct callback_head *works = READ_ONCE(task->task_works); + + return works && works != TASK_WORKS_DISABLED; } int task_work_add(struct task_struct *task, struct callback_head *twork, diff --git a/kernel/fork.c b/kernel/fork.c index 10917c3e1f03..903b29804fe1 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2577,7 +2577,7 @@ __latent_entropy struct task_struct *copy_process( p->dirty_paused_when = 0; p->pdeath_signal = 0; - p->task_works = NULL; + p->task_works = args->kthread ? TASK_WORKS_DISABLED : NULL; clear_posix_cputimers_work(p); #ifdef CONFIG_KRETPROBES diff --git a/kernel/task_work.c b/kernel/task_work.c index 95a7e1b7f1da..ffdf4b0d7a0e 100644 --- a/kernel/task_work.c +++ b/kernel/task_work.c @@ -49,7 +49,8 @@ int task_work_add(struct task_struct *task, struct callback_head *work, head = READ_ONCE(task->task_works); do { - if (unlikely(head == &work_exited)) + if (unlikely(head == &work_exited || + head == TASK_WORKS_DISABLED)) return -ESRCH; work->next = head; } while (!try_cmpxchg(&task->task_works, &head, work)); @@ -157,7 +158,7 @@ void task_work_run(void) work = READ_ONCE(task->task_works); do { head = NULL; - if (!work) { + if (!work || work == TASK_WORKS_DISABLED) { if (task->flags & PF_EXITING) head = &work_exited; else @@ -165,7 +166,7 @@ void task_work_run(void) } } while (!try_cmpxchg(&task->task_works, &work, head)); - if (!work) + if (!work || work == TASK_WORKS_DISABLED) break; /* * Synchronize with task_work_cancel(). It can not remove
I have evidence from a customer site of 256 nfsd threads adding files to delayed_fput_lists nearly twice as fast they are retired by a single work-queue thread running delayed_fput(). As you might imagine this does not end well (20 million files in the queue at the time a snapshot was taken for analysis). While this might point to a problem with the filesystem not handling the final close efficiently, such problems should only hurt throughput, not lead to memory exhaustion. For normal threads, the thread that closes the file also calls the final fput so there is natural rate limiting preventing excessive growth in the list of delayed fputs. For kernel threads, and particularly for nfsd, delayed in the final fput do not impose any throttling to prevent the thread from closing more files. A simple way to fix this is to treat nfsd threads like normal processes for task_work. Thus the pending files are queued for the thread, and the same thread finishes the work. Currently KTHREADs are assumed never to call task_work_run(). With this patch that it still the default but it is implemented by storing the magic value TASK_WORKS_DISABLED in ->task_works. If a kthread, such as nfsd, will call task_work_run() periodically, it sets ->task_works to NULL to indicate this. Signed-off-by: NeilBrown <neilb@suse.de> --- I wonder which tree this should go through assuming everyone likes it. VFS maybe?? Thanks. fs/file_table.c | 2 +- fs/nfsd/nfssvc.c | 4 ++++ include/linux/sched.h | 1 + include/linux/task_work.h | 4 +++- kernel/fork.c | 2 +- kernel/task_work.c | 7 ++++--- 6 files changed, 14 insertions(+), 6 deletions(-)