Message ID | 20250105234022.2361576-2-neilb@suse.de (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
Series | nfsd: add scheduling point in nfsd_file_gc() | expand |
On 1/5/25 6:11 PM, NeilBrown wrote: > Under a high NFSv3 load with lots of different file being accessed The > list_lru of garbage-collectable files can become quite long. > > Asking lisT_lru_scan() to scan the whole list can result in a long > period during which a spinlock is held and no scheduling is possible. > This is impolite. > > So only ask list_lru_scan() to scan 1024 entries at a time, and repeat > if necessary - calling cond_resched() each time. > > Signed-off-by: NeilBrown <neilb@suse.de> > --- > fs/nfsd/filecache.c | 17 ++++++++++++----- > 1 file changed, 12 insertions(+), 5 deletions(-) > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > index a1cdba42c4fa..e99a86798e86 100644 > --- a/fs/nfsd/filecache.c > +++ b/fs/nfsd/filecache.c > @@ -543,11 +543,18 @@ nfsd_file_gc(void) > { > LIST_HEAD(dispose); > unsigned long ret; > - > - ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb, > - &dispose, list_lru_count(&nfsd_file_lru)); > - trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru)); > - nfsd_file_dispose_list_delayed(&dispose); > + unsigned long cnt = list_lru_count(&nfsd_file_lru); > + > + while (cnt > 0) { Since @cnt is unsigned, it should be safe to use "while (cnt) {" (I might use "total" and "remaining" here -- "cnt", when said aloud, leaves me snickering). > + unsigned long num_to_scan = min(cnt, 1024UL); I see long delays with fewer than 1024 items on the list. I might drop this number by one or two orders of magnitude. And make it a symbolic constant. There's another naked integer (8) in nfsd_file_net_dispose() -- how does that relate to this new cap? Should that also be a symbolic constant? > + ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb, > + &dispose, num_to_scan); > + trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru)); > + nfsd_file_dispose_list_delayed(&dispose); I need to go back and review the function traces to see where the delays add up -- to make sure rescheduling here, rather than at some other point, is appropriate. It probably is, but my memory fails me these days. > + cnt -= num_to_scan; > + if (cnt) > + cond_resched(); Another approach might be to poke the laundrette again and simply exit here, but I don't feel strongly about that. > + } > } > > static void
On Mon, 06 Jan 2025, Chuck Lever wrote: > On 1/5/25 6:11 PM, NeilBrown wrote: > > Under a high NFSv3 load with lots of different file being accessed The > > list_lru of garbage-collectable files can become quite long. > > > > Asking lisT_lru_scan() to scan the whole list can result in a long > > period during which a spinlock is held and no scheduling is possible. > > This is impolite. > > > > So only ask list_lru_scan() to scan 1024 entries at a time, and repeat > > if necessary - calling cond_resched() each time. > > > > Signed-off-by: NeilBrown <neilb@suse.de> > > --- > > fs/nfsd/filecache.c | 17 ++++++++++++----- > > 1 file changed, 12 insertions(+), 5 deletions(-) > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > index a1cdba42c4fa..e99a86798e86 100644 > > --- a/fs/nfsd/filecache.c > > +++ b/fs/nfsd/filecache.c > > @@ -543,11 +543,18 @@ nfsd_file_gc(void) > > { > > LIST_HEAD(dispose); > > unsigned long ret; > > - > > - ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb, > > - &dispose, list_lru_count(&nfsd_file_lru)); > > - trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru)); > > - nfsd_file_dispose_list_delayed(&dispose); > > + unsigned long cnt = list_lru_count(&nfsd_file_lru); > > + > > + while (cnt > 0) { > > Since @cnt is unsigned, it should be safe to use "while (cnt) {" "while (cnt > 0) {" is equally safe and for me it emphasises that it is a decreasing counter, not a Bool. But different people have different tastes. > > (I might use "total" and "remaining" here -- "cnt", when said aloud, > leaves me snickering). "remaining" would be better than "cnt". > > > > + unsigned long num_to_scan = min(cnt, 1024UL); > > I see long delays with fewer than 1024 items on the list. I might > drop this number by one or two orders of magnitude. And make it a > symbolic constant. In that case I seriously wonder if this is where the delays are coming from. nfsd_file_dispose_list_delayed() does take and drop a spinlock repeatedly (though it may not always be the same lock) and call svc_wake_up() repeatedly - although the head of the queue might already be woken. We could optimise that to detect runs with the same nn and only take the lock once, and only wake_up once. > > There's another naked integer (8) in nfsd_file_net_dispose() -- how does > that relate to this new cap? Should that also be a symbolic constant? I don't think they relate. The trade-off with "8" is: a bigger number might block an nfsd thread for longer, forcing serialising when the work can usefully be done in parallel. a smaller number might needlessly wake lots of threads to share out a tiny amount of work. The 1024 is simply about "don't hold a spinlock for too long". > > > > + ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb, > > + &dispose, num_to_scan); > > + trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru)); > > + nfsd_file_dispose_list_delayed(&dispose); > > I need to go back and review the function traces to see where the > delays add up -- to make sure rescheduling here, rather than at some > other point, is appropriate. It probably is, but my memory fails me > these days. I would like to see those function traces too. > > > > + cnt -= num_to_scan; > > + if (cnt) > > + cond_resched(); > > Another approach might be to poke the laundrette again and simply > exit here, but I don't feel strongly about that. That wouldn't work without storing the 'remaining' count somewhere. With the current design we need to scan the whole list every 2 seconds. Thanks, NeilBrown > > > > + } > > } > > > > static void > > > -- > Chuck Lever >
On 1/5/25 10:02 PM, NeilBrown wrote: > On Mon, 06 Jan 2025, Chuck Lever wrote: >> On 1/5/25 6:11 PM, NeilBrown wrote: >>> + unsigned long num_to_scan = min(cnt, 1024UL); >> >> I see long delays with fewer than 1024 items on the list. I might >> drop this number by one or two orders of magnitude. And make it a >> symbolic constant. > > In that case I seriously wonder if this is where the delays are coming > from. > > nfsd_file_dispose_list_delayed() does take and drop a spinlock > repeatedly (though it may not always be the same lock) and call > svc_wake_up() repeatedly - although the head of the queue might already > be woken. We could optimise that to detect runs with the same nn and > only take the lock once, and only wake_up once. > >> >> There's another naked integer (8) in nfsd_file_net_dispose() -- how does >> that relate to this new cap? Should that also be a symbolic constant? > > I don't think they relate. > The trade-off with "8" is: > a bigger number might block an nfsd thread for longer, > forcing serialising when the work can usefully be done in parallel. > a smaller number might needlessly wake lots of threads > to share out a tiny amount of work. > > The 1024 is simply about "don't hold a spinlock for too long". By that, I think you mean list_lru_walk() takes &l->lock for the duration of the scan? For a long scan, that would effectively block adding or removing LRU items for quite some time. So here's a typical excerpt from a common test: kworker/u80:7-206 [003] 266.985735: nfsd_file_unhash: ... kworker/u80:7-206 [003] 266.987723: nfsd_file_gc_removed: 1309 entries removed, 2972 remaining nfsd-1532 [015] 266.988626: nfsd_file_free: ... Here, the nfsd_file_unhash record marks the beginning of the LRU walk, and the nfsd_file_gc_removed record marks the end. The timestamps indicate the walk took two milliseconds. The nfsd_file_free record above marks the last disposal activity. That takes almost a millisecond, but as far as I can tell, it does not hold any locks for long. This seems to me like a strong argument for cutting the scan size down to no more than 32-64 items. Ideally spin locks are supposed to be held only for simple operations (eg, list_add); this seems a little outside that window (hence your remark that "a large nr_to_walk is always a bad idea" -- I now see what you meant). IMHO the patch description should single out that purpose: We want to significantly reduce the maximum amount of time that list_lru_walk() blocks foreground LRU activity such as list_lru_add_obj(). The cond_resched() in this case might be gravy. >>> + ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb, >>> + &dispose, num_to_scan); >>> + trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru)); >>> + nfsd_file_dispose_list_delayed(&dispose); >> >> I need to go back and review the function traces to see where the >> delays add up -- to make sure rescheduling here, rather than at some >> other point, is appropriate. It probably is, but my memory fails me >> these days. > > I would like to see those function traces too. Here's my reproducer: 1. On the client: Set up xfstests to use NFSv3 2. On the server: "sudo trace-cmd record -e nfsd &" 3. On the client: Run xfstests with "sudo ./check -nfs generic/750"
On Tue, 07 Jan 2025, Chuck Lever wrote: > On 1/5/25 10:02 PM, NeilBrown wrote: > > On Mon, 06 Jan 2025, Chuck Lever wrote: > >> On 1/5/25 6:11 PM, NeilBrown wrote: > > >>> + unsigned long num_to_scan = min(cnt, 1024UL); > >> > >> I see long delays with fewer than 1024 items on the list. I might > >> drop this number by one or two orders of magnitude. And make it a > >> symbolic constant. > > > > In that case I seriously wonder if this is where the delays are coming > > from. > > > > nfsd_file_dispose_list_delayed() does take and drop a spinlock > > repeatedly (though it may not always be the same lock) and call > > svc_wake_up() repeatedly - although the head of the queue might already > > be woken. We could optimise that to detect runs with the same nn and > > only take the lock once, and only wake_up once. > > > >> > >> There's another naked integer (8) in nfsd_file_net_dispose() -- how does > >> that relate to this new cap? Should that also be a symbolic constant? > > > > I don't think they relate. > > The trade-off with "8" is: > > a bigger number might block an nfsd thread for longer, > > forcing serialising when the work can usefully be done in parallel. > > a smaller number might needlessly wake lots of threads > > to share out a tiny amount of work. > > > > The 1024 is simply about "don't hold a spinlock for too long". > > By that, I think you mean list_lru_walk() takes &l->lock for the > duration of the scan? For a long scan, that would effectively block > adding or removing LRU items for quite some time. > > So here's a typical excerpt from a common test: > > kworker/u80:7-206 [003] 266.985735: nfsd_file_unhash: ... > > kworker/u80:7-206 [003] 266.987723: nfsd_file_gc_removed: 1309 > entries removed, 2972 remaining > > nfsd-1532 [015] 266.988626: nfsd_file_free: ... > > Here, the nfsd_file_unhash record marks the beginning of the LRU > walk, and the nfsd_file_gc_removed record marks the end. The > timestamps indicate the walk took two milliseconds. > > The nfsd_file_free record above marks the last disposal activity. > That takes almost a millisecond, but as far as I can tell, it > does not hold any locks for long. > > This seems to me like a strong argument for cutting the scan size > down to no more than 32-64 items. Ideally spin locks are supposed > to be held only for simple operations (eg, list_add); this seems a > little outside that window (hence your remark that "a large > nr_to_walk is always a bad idea" -- I now see what you meant). This is useful - thanks. So the problem seems to be that holding the list_lru while canning the whole list can block all incoming NFSv3 for a noticeable amount of time - 2 msecs above. That makes perfect sense and as you say it suggests that the lack of scheduling points isn't really the issue. This confirms for me that the list_lru approach is no a good fit for this problem. I have written a patch which replaces it with a pair of simple lists as I described in my cover letter. It is a bit large and needs careful review. In particular I haven't given thought to the tracepoints which might need to be moved or changed or discarded. Thanks, NeilBrown
On 1/7/25 6:01 PM, NeilBrown wrote: > On Tue, 07 Jan 2025, Chuck Lever wrote: >> On 1/5/25 10:02 PM, NeilBrown wrote: >>> On Mon, 06 Jan 2025, Chuck Lever wrote: >>>> On 1/5/25 6:11 PM, NeilBrown wrote: >> >>>>> + unsigned long num_to_scan = min(cnt, 1024UL); >>>> >>>> I see long delays with fewer than 1024 items on the list. I might >>>> drop this number by one or two orders of magnitude. And make it a >>>> symbolic constant. >>> >>> In that case I seriously wonder if this is where the delays are coming >>> from. >>> >>> nfsd_file_dispose_list_delayed() does take and drop a spinlock >>> repeatedly (though it may not always be the same lock) and call >>> svc_wake_up() repeatedly - although the head of the queue might already >>> be woken. We could optimise that to detect runs with the same nn and >>> only take the lock once, and only wake_up once. >>> >>>> >>>> There's another naked integer (8) in nfsd_file_net_dispose() -- how does >>>> that relate to this new cap? Should that also be a symbolic constant? >>> >>> I don't think they relate. >>> The trade-off with "8" is: >>> a bigger number might block an nfsd thread for longer, >>> forcing serialising when the work can usefully be done in parallel. >>> a smaller number might needlessly wake lots of threads >>> to share out a tiny amount of work. >>> >>> The 1024 is simply about "don't hold a spinlock for too long". >> >> By that, I think you mean list_lru_walk() takes &l->lock for the >> duration of the scan? For a long scan, that would effectively block >> adding or removing LRU items for quite some time. >> >> So here's a typical excerpt from a common test: >> >> kworker/u80:7-206 [003] 266.985735: nfsd_file_unhash: ... >> >> kworker/u80:7-206 [003] 266.987723: nfsd_file_gc_removed: 1309 >> entries removed, 2972 remaining >> >> nfsd-1532 [015] 266.988626: nfsd_file_free: ... >> >> Here, the nfsd_file_unhash record marks the beginning of the LRU >> walk, and the nfsd_file_gc_removed record marks the end. The >> timestamps indicate the walk took two milliseconds. >> >> The nfsd_file_free record above marks the last disposal activity. >> That takes almost a millisecond, but as far as I can tell, it >> does not hold any locks for long. >> >> This seems to me like a strong argument for cutting the scan size >> down to no more than 32-64 items. Ideally spin locks are supposed >> to be held only for simple operations (eg, list_add); this seems a >> little outside that window (hence your remark that "a large >> nr_to_walk is always a bad idea" -- I now see what you meant). > > This is useful - thanks. > So the problem seems to be that holding the list_lru while canning the > whole list can block all incoming NFSv3 for a noticeable amount of time > - 2 msecs above. That makes perfect sense and as you say it suggests > that the lack of scheduling points isn't really the issue. > > This confirms for me that the list_lru approach is no a good fit for > this problem. I have written a patch which replaces it with a pair of > simple lists as I described in my cover letter. Before proceeding with replacement of the LRU, is there interest in addressing this issue in LTS kernels as well? If so, then IMO the better approach would be to take a variant of your narrower fix for v6.14, and then visit the deeper LRU changes for v6.15ff. > It is a bit large and needs careful review. In particular I haven't > given thought to the tracepoints which might need to be moved or changed > or discarded. Some of those trace points are specific to the state machine that is part of the list_lru scan. Those should be removed if we replace that state machine.
On Thu, 09 Jan 2025, Chuck Lever wrote: > On 1/7/25 6:01 PM, NeilBrown wrote: > > On Tue, 07 Jan 2025, Chuck Lever wrote: > >> On 1/5/25 10:02 PM, NeilBrown wrote: > >>> On Mon, 06 Jan 2025, Chuck Lever wrote: > >>>> On 1/5/25 6:11 PM, NeilBrown wrote: > >> > >>>>> + unsigned long num_to_scan = min(cnt, 1024UL); > >>>> > >>>> I see long delays with fewer than 1024 items on the list. I might > >>>> drop this number by one or two orders of magnitude. And make it a > >>>> symbolic constant. > >>> > >>> In that case I seriously wonder if this is where the delays are coming > >>> from. > >>> > >>> nfsd_file_dispose_list_delayed() does take and drop a spinlock > >>> repeatedly (though it may not always be the same lock) and call > >>> svc_wake_up() repeatedly - although the head of the queue might already > >>> be woken. We could optimise that to detect runs with the same nn and > >>> only take the lock once, and only wake_up once. > >>> > >>>> > >>>> There's another naked integer (8) in nfsd_file_net_dispose() -- how does > >>>> that relate to this new cap? Should that also be a symbolic constant? > >>> > >>> I don't think they relate. > >>> The trade-off with "8" is: > >>> a bigger number might block an nfsd thread for longer, > >>> forcing serialising when the work can usefully be done in parallel. > >>> a smaller number might needlessly wake lots of threads > >>> to share out a tiny amount of work. > >>> > >>> The 1024 is simply about "don't hold a spinlock for too long". > >> > >> By that, I think you mean list_lru_walk() takes &l->lock for the > >> duration of the scan? For a long scan, that would effectively block > >> adding or removing LRU items for quite some time. > >> > >> So here's a typical excerpt from a common test: > >> > >> kworker/u80:7-206 [003] 266.985735: nfsd_file_unhash: ... > >> > >> kworker/u80:7-206 [003] 266.987723: nfsd_file_gc_removed: 1309 > >> entries removed, 2972 remaining > >> > >> nfsd-1532 [015] 266.988626: nfsd_file_free: ... > >> > >> Here, the nfsd_file_unhash record marks the beginning of the LRU > >> walk, and the nfsd_file_gc_removed record marks the end. The > >> timestamps indicate the walk took two milliseconds. > >> > >> The nfsd_file_free record above marks the last disposal activity. > >> That takes almost a millisecond, but as far as I can tell, it > >> does not hold any locks for long. > >> > >> This seems to me like a strong argument for cutting the scan size > >> down to no more than 32-64 items. Ideally spin locks are supposed > >> to be held only for simple operations (eg, list_add); this seems a > >> little outside that window (hence your remark that "a large > >> nr_to_walk is always a bad idea" -- I now see what you meant). > > > > This is useful - thanks. > > So the problem seems to be that holding the list_lru while canning the > > whole list can block all incoming NFSv3 for a noticeable amount of time > > - 2 msecs above. That makes perfect sense and as you say it suggests > > that the lack of scheduling points isn't really the issue. > > > > This confirms for me that the list_lru approach is no a good fit for > > this problem. I have written a patch which replaces it with a pair of > > simple lists as I described in my cover letter. > > Before proceeding with replacement of the LRU, is there interest in > addressing this issue in LTS kernels as well? If so, then IMO the > better approach would be to take a variant of your narrower fix for > v6.14, and then visit the deeper LRU changes for v6.15ff. That is probably reasonable. You could take the first patch, drop the 1024 to 64 (or less if testing suggests that is still too high), and maybe drop he cond_resched(). Thanks, NeilBrown
On 1/8/25 3:41 PM, NeilBrown wrote: > On Thu, 09 Jan 2025, Chuck Lever wrote: >> On 1/7/25 6:01 PM, NeilBrown wrote: >>> On Tue, 07 Jan 2025, Chuck Lever wrote: >>>> On 1/5/25 10:02 PM, NeilBrown wrote: >>>>> On Mon, 06 Jan 2025, Chuck Lever wrote: >>>>>> On 1/5/25 6:11 PM, NeilBrown wrote: >>>> >>>>>>> + unsigned long num_to_scan = min(cnt, 1024UL); >>>>>> >>>>>> I see long delays with fewer than 1024 items on the list. I might >>>>>> drop this number by one or two orders of magnitude. And make it a >>>>>> symbolic constant. >>>>> >>>>> In that case I seriously wonder if this is where the delays are coming >>>>> from. >>>>> >>>>> nfsd_file_dispose_list_delayed() does take and drop a spinlock >>>>> repeatedly (though it may not always be the same lock) and call >>>>> svc_wake_up() repeatedly - although the head of the queue might already >>>>> be woken. We could optimise that to detect runs with the same nn and >>>>> only take the lock once, and only wake_up once. >>>>> >>>>>> >>>>>> There's another naked integer (8) in nfsd_file_net_dispose() -- how does >>>>>> that relate to this new cap? Should that also be a symbolic constant? >>>>> >>>>> I don't think they relate. >>>>> The trade-off with "8" is: >>>>> a bigger number might block an nfsd thread for longer, >>>>> forcing serialising when the work can usefully be done in parallel. >>>>> a smaller number might needlessly wake lots of threads >>>>> to share out a tiny amount of work. >>>>> >>>>> The 1024 is simply about "don't hold a spinlock for too long". >>>> >>>> By that, I think you mean list_lru_walk() takes &l->lock for the >>>> duration of the scan? For a long scan, that would effectively block >>>> adding or removing LRU items for quite some time. >>>> >>>> So here's a typical excerpt from a common test: >>>> >>>> kworker/u80:7-206 [003] 266.985735: nfsd_file_unhash: ... >>>> >>>> kworker/u80:7-206 [003] 266.987723: nfsd_file_gc_removed: 1309 >>>> entries removed, 2972 remaining >>>> >>>> nfsd-1532 [015] 266.988626: nfsd_file_free: ... >>>> >>>> Here, the nfsd_file_unhash record marks the beginning of the LRU >>>> walk, and the nfsd_file_gc_removed record marks the end. The >>>> timestamps indicate the walk took two milliseconds. >>>> >>>> The nfsd_file_free record above marks the last disposal activity. >>>> That takes almost a millisecond, but as far as I can tell, it >>>> does not hold any locks for long. >>>> >>>> This seems to me like a strong argument for cutting the scan size >>>> down to no more than 32-64 items. Ideally spin locks are supposed >>>> to be held only for simple operations (eg, list_add); this seems a >>>> little outside that window (hence your remark that "a large >>>> nr_to_walk is always a bad idea" -- I now see what you meant). >>> >>> This is useful - thanks. >>> So the problem seems to be that holding the list_lru while canning the >>> whole list can block all incoming NFSv3 for a noticeable amount of time >>> - 2 msecs above. That makes perfect sense and as you say it suggests >>> that the lack of scheduling points isn't really the issue. >>> >>> This confirms for me that the list_lru approach is no a good fit for >>> this problem. I have written a patch which replaces it with a pair of >>> simple lists as I described in my cover letter. >> >> Before proceeding with replacement of the LRU, is there interest in >> addressing this issue in LTS kernels as well? If so, then IMO the >> better approach would be to take a variant of your narrower fix for >> v6.14, and then visit the deeper LRU changes for v6.15ff. > > That is probably reasonable. You could take the first patch, drop the > 1024 to 64 (or less if testing suggests that is still too high), and > maybe drop he cond_resched(). I will make it so. Enjoy the rest of your leave!
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c index a1cdba42c4fa..e99a86798e86 100644 --- a/fs/nfsd/filecache.c +++ b/fs/nfsd/filecache.c @@ -543,11 +543,18 @@ nfsd_file_gc(void) { LIST_HEAD(dispose); unsigned long ret; - - ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb, - &dispose, list_lru_count(&nfsd_file_lru)); - trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru)); - nfsd_file_dispose_list_delayed(&dispose); + unsigned long cnt = list_lru_count(&nfsd_file_lru); + + while (cnt > 0) { + unsigned long num_to_scan = min(cnt, 1024UL); + ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb, + &dispose, num_to_scan); + trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru)); + nfsd_file_dispose_list_delayed(&dispose); + cnt -= num_to_scan; + if (cnt) + cond_resched(); + } } static void
Under a high NFSv3 load with lots of different file being accessed The list_lru of garbage-collectable files can become quite long. Asking lisT_lru_scan() to scan the whole list can result in a long period during which a spinlock is held and no scheduling is possible. This is impolite. So only ask list_lru_scan() to scan 1024 entries at a time, and repeat if necessary - calling cond_resched() each time. Signed-off-by: NeilBrown <neilb@suse.de> --- fs/nfsd/filecache.c | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-)