Message ID | 163090344807.19339.10071205771966144716@noble.neil.brown.name (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | SUNRPC: use congestion_wait() in svc_alloc_args() | expand |
Hi Neil- > On Sep 6, 2021, at 12:44 AM, NeilBrown <neilb@suse.de> wrote: > > > Many places that need to wait before retrying a memory allocation use > congestion_wait(). xfs_buf_alloc_pages() is a good example which > follows a similar pattern to that in svc_alloc_args(). > > It make sense to do the same thing in svc_alloc_args(); This will allow > the allocation to be retried sooner if some backing device becomes > non-congested before the timeout. > > Every call to congestion_wait() in the entire kernel passes BLK_RW_ASYNC > as the first argument, so we should so. > > The second argument - an upper limit for waiting - seem fairly > arbitrary. Many places use "HZ/50" or "HZ/10". As there is no obvious > choice, it seems reasonable to leave the maximum time unchanged. > > If a service using svc_alloc_args() is terminated, it may now have to > wait up to the full 500ms before termination completes as > congestion_wait() cannot be interrupted. I don't believe this will be a > problem in practice, though it might be justification for using a > smaller timeout. > > Signed-off-by: NeilBrown <neilb@suse.de> > --- > > I happened to notice this inconsistency between svc_alloc_args() and > xfs_buf_alloc_pages() despite them doing very similar things, so thought > I'd send a patch. > > NeilBrown When we first considered the alloc_pages_bulk() API, the SUNRPC patch in that series replaced this schedule_timeout(). Mel suggested we postpone that to a separate patch. Now is an ideal time to consider this change again. I've added the MM folks for expert commentary. I would rather see a shorter timeout, since that will be less disruptive in practice and today's systems shouldn't have to wait that long for free memory to become available. DEFAULT_IO_TIMEOUT might be a defensible choice -- it will slow down this loop effectively without adding a significant delay. > net/sunrpc/svc_xprt.c | 9 ++++----- > 1 file changed, 4 insertions(+), 5 deletions(-) > > diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c > index 796eebf1787d..161433ae0fab 100644 > --- a/net/sunrpc/svc_xprt.c > +++ b/net/sunrpc/svc_xprt.c > @@ -10,6 +10,7 @@ > #include <linux/freezer.h> > #include <linux/kthread.h> > #include <linux/slab.h> > +#include <linux/backing-dev.h> > #include <net/sock.h> > #include <linux/sunrpc/addr.h> > #include <linux/sunrpc/stats.h> > @@ -682,12 +683,10 @@ static int svc_alloc_arg(struct svc_rqst *rqstp) > /* Made progress, don't sleep yet */ > continue; > > - set_current_state(TASK_INTERRUPTIBLE); > - if (signalled() || kthread_should_stop()) { > - set_current_state(TASK_RUNNING); > + if (signalled() || kthread_should_stop()) > return -EINTR; > - } > - schedule_timeout(msecs_to_jiffies(500)); > + > + congestion_wait(BLK_RW_ASYNC, msecs_to_jiffies(500)); > } > rqstp->rq_page_end = &rqstp->rq_pages[pages]; > rqstp->rq_pages[pages] = NULL; /* this might be seen in nfsd_splice_actor() */ > -- > 2.33.0 > -- Chuck Lever
On Mon, Sep 06, 2021 at 03:46:34PM +0000, Chuck Lever III wrote: > Hi Neil- > > > On Sep 6, 2021, at 12:44 AM, NeilBrown <neilb@suse.de> wrote: > > > > > > Many places that need to wait before retrying a memory allocation use > > congestion_wait(). xfs_buf_alloc_pages() is a good example which > > follows a similar pattern to that in svc_alloc_args(). > > > > It make sense to do the same thing in svc_alloc_args(); This will allow > > the allocation to be retried sooner if some backing device becomes > > non-congested before the timeout. It's adorable that you believe this is still true. https://lore.kernel.org/linux-mm/20191231125908.GD6788@bombadil.infradead.org/
On Tue, 07 Sep 2021, Chuck Lever III wrote: > Hi Neil- > > > On Sep 6, 2021, at 12:44 AM, NeilBrown <neilb@suse.de> wrote: > > > > > > Many places that need to wait before retrying a memory allocation use > > congestion_wait(). xfs_buf_alloc_pages() is a good example which > > follows a similar pattern to that in svc_alloc_args(). > > > > It make sense to do the same thing in svc_alloc_args(); This will allow > > the allocation to be retried sooner if some backing device becomes > > non-congested before the timeout. > > > > Every call to congestion_wait() in the entire kernel passes BLK_RW_ASYNC > > as the first argument, so we should so. > > > > The second argument - an upper limit for waiting - seem fairly > > arbitrary. Many places use "HZ/50" or "HZ/10". As there is no obvious > > choice, it seems reasonable to leave the maximum time unchanged. > > > > If a service using svc_alloc_args() is terminated, it may now have to > > wait up to the full 500ms before termination completes as > > congestion_wait() cannot be interrupted. I don't believe this will be a > > problem in practice, though it might be justification for using a > > smaller timeout. > > > > Signed-off-by: NeilBrown <neilb@suse.de> > > --- > > > > I happened to notice this inconsistency between svc_alloc_args() and > > xfs_buf_alloc_pages() despite them doing very similar things, so thought > > I'd send a patch. > > > > NeilBrown > > When we first considered the alloc_pages_bulk() API, the SUNRPC > patch in that series replaced this schedule_timeout(). Mel > suggested we postpone that to a separate patch. Now is an ideal > time to consider this change again. I've added the MM folks for > expert commentary. > > I would rather see a shorter timeout, since that will be less > disruptive in practice and today's systems shouldn't have to wait > that long for free memory to become available. DEFAULT_IO_TIMEOUT > might be a defensible choice -- it will slow down this loop > effectively without adding a significant delay. DEFAULT_IO_TIMEOUT is local to f2fs, so might not be the best choice. I could be comfortable with any number from 1 to HZ, and have no basis on how to make a choice - which is why I deliberately avoided making one. Ideally, the full timeout would (almost) never expire in practice. Ideally, the interface would not even ask that we supply a timeout. But are not currently at the ideal ;-( Thanks, NeilBrown
On Mon, Sep 06, 2021 at 09:20:35PM +0100, Matthew Wilcox wrote: > On Mon, Sep 06, 2021 at 03:46:34PM +0000, Chuck Lever III wrote: > > Hi Neil- > > > > > On Sep 6, 2021, at 12:44 AM, NeilBrown <neilb@suse.de> wrote: > > > > > > > > > Many places that need to wait before retrying a memory allocation use > > > congestion_wait(). xfs_buf_alloc_pages() is a good example which > > > follows a similar pattern to that in svc_alloc_args(). > > > > > > It make sense to do the same thing in svc_alloc_args(); This will allow > > > the allocation to be retried sooner if some backing device becomes > > > non-congested before the timeout. > > It's adorable that you believe this is still true. > > https://lore.kernel.org/linux-mm/20191231125908.GD6788@bombadil.infradead.org/ So, what's the advice for now? Do we add the congestion_wait() call anyway and assume it'll be fixed to do something less broken in the future, or just skip it completely? --b.
On Tue, 07 Sep 2021, Matthew Wilcox wrote: > On Mon, Sep 06, 2021 at 03:46:34PM +0000, Chuck Lever III wrote: > > Hi Neil- > > > > > On Sep 6, 2021, at 12:44 AM, NeilBrown <neilb@suse.de> wrote: > > > > > > > > > Many places that need to wait before retrying a memory allocation use > > > congestion_wait(). xfs_buf_alloc_pages() is a good example which > > > follows a similar pattern to that in svc_alloc_args(). > > > > > > It make sense to do the same thing in svc_alloc_args(); This will allow > > > the allocation to be retried sooner if some backing device becomes > > > non-congested before the timeout. > > It's adorable that you believe this is still true. always happy to be called "adorable" !! > > https://lore.kernel.org/linux-mm/20191231125908.GD6788@bombadil.infradead.org/ > > Interesting ... a few filesystems call clear_bdi_congested(), but not enough to make a difference. At least my patch won't make things worse. And when (not if !!) congestion_wait() gets fixed, sunrpc will immediately benefit. I suspect that "congestion_wait()" needs to be replaced by several different interfaces. Some callers want to wait until memory might be available, which should be tracked entirely by MM, not by filesystems. Other caller are really only interested in their own bdi making progress and should be allowed to specify that bdi. And in general, it seems that that waits aren't really interested in congestion being eased, but in progress being made. reclaim_progress_wait() bdi_progress_wait() ?? Even if we just provided void reclaim_progress_wait(void) { schedule_timeout_uninterruptible(HZ/20); } that would be an improvement. It could be called from svc_alloc_args() and various other places that want to wait before retrying an allocation, and it would be no worse than current behaviour. Then if anyone had a clever idea of how to get a notification when progress had been made, there would be an obvious place to implement that idea. NeilBrown
On Tue, 07 Sep 2021, NeilBrown wrote: > Even if we just provided > > void reclaim_progress_wait(void) > { > schedule_timeout_uninterruptible(HZ/20); > } > > that would be an improvement. I contemplated providing a patch, but the more I thought about it, the less sure I was. When does a single-page GFP_KERNEL allocation fail? Ever? I know that if I add __GFP_NOFAIL then it won't fail and that is preferred to looping. I know that if I add __GFP_RETRY_MAILFAIL (or others) then it might fail. But that is the semantics for a plain GFP_KERNEL ?? I recall a suggestion one that it would only fail if the process was being killed by the oom killer. That seems reasonable and would suggest that retrying is really bad. Is that true? For svc_alloc_args(), it might be better to fail and have the calling server thread exit. This would need to be combined with dynamic thread-count management so that when a request arrived, a new thread might be started. So maybe we really don't want reclaim_progress_wait(), and all current callers of congestion_wait() which are just waiting for allocation to succeed should be either change to use __GFP_NOFAIL, or to handle failure. For the ext4_truncate case() that would be easier if there were a PF_MEMALLOC_NOFAIL flag though maybe passing down __GFP_MNOFAIL isn't too hard. (this is why we all work-around problems in the platform rather than fixing them. Fixing them is just too hard...) NeilBrown
On Tue, Sep 07, 2021 at 08:22:39AM +1000, NeilBrown wrote: > On Tue, 07 Sep 2021, Matthew Wilcox wrote: > > On Mon, Sep 06, 2021 at 03:46:34PM +0000, Chuck Lever III wrote: > > > Hi Neil- > > > > > > > On Sep 6, 2021, at 12:44 AM, NeilBrown <neilb@suse.de> wrote: > > > > > > > > > > > > Many places that need to wait before retrying a memory allocation use > > > > congestion_wait(). xfs_buf_alloc_pages() is a good example which > > > > follows a similar pattern to that in svc_alloc_args(). > > > > > > > > It make sense to do the same thing in svc_alloc_args(); This will allow > > > > the allocation to be retried sooner if some backing device becomes > > > > non-congested before the timeout. > > > > It's adorable that you believe this is still true. > > always happy to be called "adorable" !! > > > > > https://lore.kernel.org/linux-mm/20191231125908.GD6788@bombadil.infradead.org/ > > > > > Interesting ... a few filesystems call clear_bdi_congested(), but not > enough to make a difference. > > At least my patch won't make things worse. And when (not if !!) > congestion_wait() gets fixed, sunrpc will immediately benefit. > > I suspect that "congestion_wait()" needs to be replaced by several > different interfaces. > > Some callers want to wait until memory might be available, which should > be tracked entirely by MM, not by filesystems. > Other caller are really only interested in their own bdi making progress > and should be allowed to specify that bdi. > For the available memory side, I believe the interface would involve a waitqueue combined with something like struct capture_control except it has a waitqueue, a zone, an order, a struct page pointer and a list_head that is declared on stack. Reclaimers for that zone would check if there are any such waiters and if so, add a page that has just being reclaimed and wake the waiter. That then would be more event driven than time driven which is usually what mm is meant to do. Despite congestion_wait being known to be broken for a long time, I don't recall anyone trying to actually fix it. > And in general, it seems that that waits aren't really interested in > congestion being eased, but in progress being made. > > reclaim_progress_wait() > bdi_progress_wait() > > ?? > > Even if we just provided > > void reclaim_progress_wait(void) > { > schedule_timeout_uninterruptible(HZ/20); > } > reclaim_progress_wait at least would clarify that it's waiting on a page but ultimately, it shouldn't be time based.
> On Sep 6, 2021, at 8:41 PM, NeilBrown <neilb@suse.de> wrote: > > When does a single-page GFP_KERNEL allocation fail? Ever? > > I know that if I add __GFP_NOFAIL then it won't fail and that is > preferred to looping. > I know that if I add __GFP_RETRY_MAILFAIL (or others) then it might > fail. > But that is the semantics for a plain GFP_KERNEL ?? > > I recall a suggestion one that it would only fail if the process was > being killed by the oom killer. That seems reasonable and would suggest > that retrying is really bad. Is that true? > > For svc_alloc_args(), it might be better to fail and have the calling > server thread exit. This would need to be combined with dynamic > thread-count management so that when a request arrived, a new thread > might be started. I don't immediately see a benefit to killing server threads during periods of memory exhaustion, but sometimes I lack imagination. > So maybe we really don't want reclaim_progress_wait(), and all current > callers of congestion_wait() which are just waiting for allocation to > succeed should be either change to use __GFP_NOFAIL, or to handle > failure. I had completely forgotten about GFP_NOFAIL. That seems like the preferred answer, as it avoids an arbitrary time-based wait for a memory resource. (And maybe svc_rqst_alloc() should also get the NOFAIL treatment?). The question in my mind is how the new alloc_pages_bulk() will behave when passed NOFAIL. I would expect that NOFAIL would simply guarantee that at least one page is allocated on every call. -- Chuck Lever
On Tue, Sep 07, 2021 at 02:53:48PM +0000, Chuck Lever III wrote: > > > > On Sep 6, 2021, at 8:41 PM, NeilBrown <neilb@suse.de> wrote: > > > > When does a single-page GFP_KERNEL allocation fail? Ever? > > > > I know that if I add __GFP_NOFAIL then it won't fail and that is > > preferred to looping. > > I know that if I add __GFP_RETRY_MAILFAIL (or others) then it might > > fail. > > But that is the semantics for a plain GFP_KERNEL ?? > > > > I recall a suggestion one that it would only fail if the process was > > being killed by the oom killer. That seems reasonable and would suggest > > that retrying is really bad. Is that true? > > > > For svc_alloc_args(), it might be better to fail and have the calling > > server thread exit. This would need to be combined with dynamic > > thread-count management so that when a request arrived, a new thread > > might be started. > > I don't immediately see a benefit to killing server threads > during periods of memory exhaustion, but sometimes I lack > imagination. Give up parallelism in return for at least hope of forward progress? (Which should be possible as long as there's at least one server thread.) --b.
On Tue, Sep 07, 2021 at 02:53:48PM +0000, Chuck Lever III wrote: > > So maybe we really don't want reclaim_progress_wait(), and all current > > callers of congestion_wait() which are just waiting for allocation to > > succeed should be either change to use __GFP_NOFAIL, or to handle > > failure. > > I had completely forgotten about GFP_NOFAIL. That seems like the > preferred answer, as it avoids an arbitrary time-based wait for > a memory resource. (And maybe svc_rqst_alloc() should also get > the NOFAIL treatment?). > Don't use NOFAIL. It's intended for allocation requests that if they fail, there is no sane way for the kernel to recover. Amoung other things, it can access emergency memory reserves and if not returned quickly may cause premature OOM or even a livelock. > The question in my mind is how the new alloc_pages_bulk() will > behave when passed NOFAIL. I would expect that NOFAIL would simply > guarantee that at least one page is allocated on every call. > alloc_pages_bulk ignores GFP_NOFAIL. If called repeatedly, it might pass the GFP_NOFAIL flag to allocate at least one page but you'll be depleting reserves to do it from a call site that has other options for recovery. The docs say it better * %__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller * cannot handle allocation failures. The allocation could block * indefinitely but will never return with failure. Testing for * failure is pointless. * New users should be evaluated carefully (and the flag should be * used only when there is no reasonable failure policy) but it is * definitely preferable to use the flag rather than opencode endless * loop around allocator. * Using this flag for costly allocations is _highly_ discouraged
> On Sep 7, 2021, at 11:41 AM, Mel Gorman <mgorman@suse.com> wrote: > > On Tue, Sep 07, 2021 at 02:53:48PM +0000, Chuck Lever III wrote: >>> So maybe we really don't want reclaim_progress_wait(), and all current >>> callers of congestion_wait() which are just waiting for allocation to >>> succeed should be either change to use __GFP_NOFAIL, or to handle >>> failure. >> >> I had completely forgotten about GFP_NOFAIL. That seems like the >> preferred answer, as it avoids an arbitrary time-based wait for >> a memory resource. (And maybe svc_rqst_alloc() should also get >> the NOFAIL treatment?). >> > > Don't use NOFAIL. It's intended for allocation requests that if they fail, > there is no sane way for the kernel to recover. Amoung other things, > it can access emergency memory reserves and if not returned quickly may > cause premature OOM or even a livelock. > >> The question in my mind is how the new alloc_pages_bulk() will >> behave when passed NOFAIL. I would expect that NOFAIL would simply >> guarantee that at least one page is allocated on every call. >> > > alloc_pages_bulk ignores GFP_NOFAIL. If called repeatedly, it might pass > the GFP_NOFAIL flag to allocate at least one page but you'll be depleting > reserves to do it from a call site that has other options for recovery. True, an allocation failure in svc_alloc_arg() won't cause kernel instability. But AFAICS svc_alloc_arg() can't do anything but call again if an allocation request fails to make forward progress. There really aren't other options for recovery. On it's face, svc_alloc_arg() seems to match the "use the flag rather than opencode [an] endless loop around [the] allocator" comment below. > The docs say it better > > * %__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller > * cannot handle allocation failures. The allocation could block > * indefinitely but will never return with failure. Testing for > * failure is pointless. > * New users should be evaluated carefully (and the flag should be > * used only when there is no reasonable failure policy) but it is > * definitely preferable to use the flag rather than opencode endless > * loop around allocator. As an aside, there's nothing here that suggests that using NOFAIL would be risky. And the "no reasonable failure policy" phrasing is a parenthetical, but it seems to be most pertinent. IMHO the comment should be updated to match the current implementation of NOFAIL. > * Using this flag for costly allocations is _highly_ discouraged I don't have a preference about using NOFAIL. I think it's a good idea to base the wait for resources on the availability of those resources rather than an arbitrary time interval, so I'd like to see the schedule_timeout() in svc_alloc_arg() replaced with something smarter. -- Chuck Lever
On Wed, 08 Sep 2021, Chuck Lever III wrote: > > > On Sep 6, 2021, at 8:41 PM, NeilBrown <neilb@suse.de> wrote: > > > > When does a single-page GFP_KERNEL allocation fail? Ever? > > > > I know that if I add __GFP_NOFAIL then it won't fail and that is > > preferred to looping. > > I know that if I add __GFP_RETRY_MAILFAIL (or others) then it might > > fail. > > But that is the semantics for a plain GFP_KERNEL ?? > > > > I recall a suggestion one that it would only fail if the process was > > being killed by the oom killer. That seems reasonable and would suggest > > that retrying is really bad. Is that true? > > > > For svc_alloc_args(), it might be better to fail and have the calling > > server thread exit. This would need to be combined with dynamic > > thread-count management so that when a request arrived, a new thread > > might be started. > > I don't immediately see a benefit to killing server threads > during periods of memory exhaustion, but sometimes I lack > imagination. Dining philosophers problem. If you cannot get all the resources you need, a polite response is to release *all* the resources you have before trying again. > > > > So maybe we really don't want reclaim_progress_wait(), and all current > > callers of congestion_wait() which are just waiting for allocation to > > succeed should be either change to use __GFP_NOFAIL, or to handle > > failure. > > I had completely forgotten about GFP_NOFAIL. That seems like the > preferred answer, as it avoids an arbitrary time-based wait for > a memory resource. (And maybe svc_rqst_alloc() should also get > the NOFAIL treatment?). Apart from what others have said, GFP_NOFAIL is not appropriate for svc_rqst_alloc() because it cannot honour kthread_stop(). NeilBrown
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c index 796eebf1787d..161433ae0fab 100644 --- a/net/sunrpc/svc_xprt.c +++ b/net/sunrpc/svc_xprt.c @@ -10,6 +10,7 @@ #include <linux/freezer.h> #include <linux/kthread.h> #include <linux/slab.h> +#include <linux/backing-dev.h> #include <net/sock.h> #include <linux/sunrpc/addr.h> #include <linux/sunrpc/stats.h> @@ -682,12 +683,10 @@ static int svc_alloc_arg(struct svc_rqst *rqstp) /* Made progress, don't sleep yet */ continue; - set_current_state(TASK_INTERRUPTIBLE); - if (signalled() || kthread_should_stop()) { - set_current_state(TASK_RUNNING); + if (signalled() || kthread_should_stop()) return -EINTR; - } - schedule_timeout(msecs_to_jiffies(500)); + + congestion_wait(BLK_RW_ASYNC, msecs_to_jiffies(500)); } rqstp->rq_page_end = &rqstp->rq_pages[pages]; rqstp->rq_pages[pages] = NULL; /* this might be seen in nfsd_splice_actor() */
Many places that need to wait before retrying a memory allocation use congestion_wait(). xfs_buf_alloc_pages() is a good example which follows a similar pattern to that in svc_alloc_args(). It make sense to do the same thing in svc_alloc_args(); This will allow the allocation to be retried sooner if some backing device becomes non-congested before the timeout. Every call to congestion_wait() in the entire kernel passes BLK_RW_ASYNC as the first argument, so we should so. The second argument - an upper limit for waiting - seem fairly arbitrary. Many places use "HZ/50" or "HZ/10". As there is no obvious choice, it seems reasonable to leave the maximum time unchanged. If a service using svc_alloc_args() is terminated, it may now have to wait up to the full 500ms before termination completes as congestion_wait() cannot be interrupted. I don't believe this will be a problem in practice, though it might be justification for using a smaller timeout. Signed-off-by: NeilBrown <neilb@suse.de> --- I happened to notice this inconsistency between svc_alloc_args() and xfs_buf_alloc_pages() despite them doing very similar things, so thought I'd send a patch. NeilBrown net/sunrpc/svc_xprt.c | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-)