[2/7] mm/page_alloc: Treat RT tasks similar to __GFP_HIGH

Message ID	20230109151631.24923-3-mgorman@techsingularity.net (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Mel Gorman <mgorman@techsingularity.net> To: Linux-MM <linux-mm@kvack.org> Cc: Andrew Morton <akpm@linux-foundation.org>, Michal Hocko <mhocko@suse.com>, NeilBrown <neilb@suse.de>, Thierry Reding <thierry.reding@gmail.com>, Matthew Wilcox <willy@infradead.org>, Vlastimil Babka <vbabka@suse.cz>, LKML <linux-kernel@vger.kernel.org>, Mel Gorman <mgorman@techsingularity.net> Subject: [PATCH 2/7] mm/page_alloc: Treat RT tasks similar to __GFP_HIGH Date: Mon, 9 Jan 2023 15:16:26 +0000 Message-Id: <20230109151631.24923-3-mgorman@techsingularity.net> In-Reply-To: <20230109151631.24923-1-mgorman@techsingularity.net> References: <20230109151631.24923-1-mgorman@techsingularity.net> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Discard __GFP_ATOMIC \| expand [0/6,v2] Discard __GFP_ATOMIC [1/7] mm/page_alloc: Rename ALLOC_HIGH to ALLOC_MIN_RESERVE [2/7] mm/page_alloc: Treat RT tasks similar to __GFP_HIGH [3/7] mm/page_alloc: Explicitly record high-order atomic allocations in alloc_flags [4/7] mm/page_alloc: Explicitly define what alloc flags deplete min reserves [5/7] mm/page_alloc.c: Allow __GFP_NOFAIL requests deeper access to reserves [6/7] mm/page_alloc: Give GFP_ATOMIC and non-blocking allocations access to reserves [7/7] mm: discard __GFP_ATOMIC

Mel Gorman Jan. 9, 2023, 3:16 p.m. UTC

RT tasks are allowed to dip below the min reserve but ALLOC_HARDER is
typically combined with ALLOC_MIN_RESERVE so RT tasks are a little
unusual. While there is some justification for allowing RT tasks
access to memory reserves, there is a strong chance that a RT task
that is also under memory pressure is at risk of missing deadlines
anyway. Relax how much reserves an RT task can access by treating
it the same as __GFP_HIGH allocations.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Michal Hocko Jan. 11, 2023, 3:27 p.m. UTC | #1

On Mon 09-01-23 15:16:26, Mel Gorman wrote:
> RT tasks are allowed to dip below the min reserve but ALLOC_HARDER is
> typically combined with ALLOC_MIN_RESERVE so RT tasks are a little
> unusual. While there is some justification for allowing RT tasks
> access to memory reserves, there is a strong chance that a RT task
> that is also under memory pressure is at risk of missing deadlines
> anyway. Relax how much reserves an RT task can access by treating
> it the same as __GFP_HIGH allocations.

TBH, I would much rather drop the RT special casing here. As you say if
a RT task need to dip into memory reserves it is either already too late
because the execution is already under RT constrains or this is init
phase where the reclaim is not a problem yet.

I have tried to trace down this special case and only found a patch from
Robert Love from 2003 which says:
: - Let real-time tasks dip further into the reserves than usual in
:   __alloc_pages().  There are a lot of ways to special case this.  This
:   patch just cuts z->pages_low in half, before doing the incremental min
:   thing, for real-time tasks.  I do not do anything in the low memory slow
:   path.  We can be a _lot_ more aggressive if we want.  Right now, we just
:   give real-time tasks a little help.

This doesn't really explain why this is needed.

We are really great at preserving a behavior and cementing it for
future generations. Maybe we should just drop it and see if something
breaks. We would get some reasoning at least finally.

So I am not opposed to the patch per se but I would much rather see this
branch go away. If you want me I can condense the above into a changelog
and send a patch (either on top of this one or replacing it). WDYT?

> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 244c1e675dc8..0040b4e00913 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4847,7 +4847,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  		 */
>  		alloc_flags &= ~ALLOC_CPUSET;
>  	} else if (unlikely(rt_task(current)) && in_task())
> -		alloc_flags |= ALLOC_HARDER;
> +		alloc_flags |= ALLOC_MIN_RESERVE;
>  
>  	alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, alloc_flags);
>  
> -- 
> 2.35.3

Mel Gorman Jan. 12, 2023, 9:36 a.m. UTC | #2

On Wed, Jan 11, 2023 at 04:27:29PM +0100, Michal Hocko wrote:
> On Mon 09-01-23 15:16:26, Mel Gorman wrote:
> > RT tasks are allowed to dip below the min reserve but ALLOC_HARDER is
> > typically combined with ALLOC_MIN_RESERVE so RT tasks are a little
> > unusual. While there is some justification for allowing RT tasks
> > access to memory reserves, there is a strong chance that a RT task
> > that is also under memory pressure is at risk of missing deadlines
> > anyway. Relax how much reserves an RT task can access by treating
> > it the same as __GFP_HIGH allocations.
> 
> TBH, I would much rather drop the RT special casing here. As you say if
> a RT task need to dip into memory reserves it is either already too late
> because the execution is already under RT constrains or this is init
> phase where the reclaim is not a problem yet.
> 

I completely agree. I included it in the changelog because I was tempted
to delete it now. I'm wary that the series will result in some
allocation failure bug reports and so played it cautious.

Hard realtime tasks should be locking down resources in advance. Even a
soft-realtime task like audio or video live decoding which cannot jitter
should be allocating both memory and any disk space required up-front
before the recording starts instead of relying on reserves. At best,
reserve access will only delay the problem by a very short interval.

> I have tried to trace down this special case and only found a patch from
> Robert Love from 2003 which says:
> : - Let real-time tasks dip further into the reserves than usual in
> :   __alloc_pages().  There are a lot of ways to special case this.  This
> :   patch just cuts z->pages_low in half, before doing the incremental min
> :   thing, for real-time tasks.  I do not do anything in the low memory slow
> :   path.  We can be a _lot_ more aggressive if we want.  Right now, we just
> :   give real-time tasks a little help.
> 
> This doesn't really explain why this is needed.
> 

No, it does not but I'm not willing to complain either. 20 years ago,
it might have been completely reasonable.

> We are really great at preserving a behavior and cementing it for
> future generations. Maybe we should just drop it and see if something
> breaks. We would get some reasoning at least finally.
> 
> So I am not opposed to the patch per se but I would much rather see this
> branch go away. If you want me I can condense the above into a changelog
> and send a patch (either on top of this one or replacing it). WDYT?
> 

I agree with you but given the risk of bisections hitting this series,
would you be opposed to delaying the removal by 1 kernel release? That
way bisections for failures will hit 6.3 and a single commit or at least
just a report against 6.3. That would mitigate the risk of a full revert
of the series. I can add a note to the changelog mentioning the expected
removal so git blame will also highlight it.

> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> 

Thanks.

Michal Hocko Jan. 12, 2023, 9:47 a.m. UTC | #3

On Thu 12-01-23 09:36:23, Mel Gorman wrote:
[...]
> I agree with you but given the risk of bisections hitting this series,
> would you be opposed to delaying the removal by 1 kernel release? That
> way bisections for failures will hit 6.3 and a single commit or at least
> just a report against 6.3. That would mitigate the risk of a full revert
> of the series. I can add a note to the changelog mentioning the expected
> removal so git blame will also highlight it.

Sure. I will post the removal on top of your series and put myself into
the "wait for regression chair".

Mel Gorman Jan. 12, 2023, 4:42 p.m. UTC | #4

On Thu, Jan 12, 2023 at 10:47:24AM +0100, Michal Hocko wrote:
> On Thu 12-01-23 09:36:23, Mel Gorman wrote:
> [...]
> > I agree with you but given the risk of bisections hitting this series,
> > would you be opposed to delaying the removal by 1 kernel release? That
> > way bisections for failures will hit 6.3 and a single commit or at least
> > just a report against 6.3. That would mitigate the risk of a full revert
> > of the series. I can add a note to the changelog mentioning the expected
> > removal so git blame will also highlight it.
> 
> Sure. I will post the removal on top of your series and put myself into
> the "wait for regression chair".
> 

I'm happy to sit in the same chair and send the patch. If there is an
example of an RT task actually caring about memory reserves, I'd like to
determine if it's a real problem or a badly designed RT application.

David Laight Jan. 13, 2023, 9:04 a.m. UTC | #5

From: Mel Gorman
> Sent: 12 January 2023 09:36
...
> Hard realtime tasks should be locking down resources in advance. Even a
> soft-realtime task like audio or video live decoding which cannot jitter
> should be allocating both memory and any disk space required up-front
> before the recording starts instead of relying on reserves. At best,
> reserve access will only delay the problem by a very short interval.

Or, at least, ensuring the system isn't memory limited.

The biggest effect on RT task latency/jitter (on a normal kernel)
is hardware interrupt and softint code 'stealing' the cpu.
The main 'culprit' being ethernet receive.
 
Unfortunately if you are doing RTP audio (UDP data) you absolutely
need the ethernet receive to run. When the softint code decides
to drop back to a normal priority kernel worker thread packets
get lost.

(I've been running 10000 RTP streams - with 10k receive UDP sockets.)

So I doubt avoiding sleeps in kmalloc() is going to make a
significant difference.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Mel Gorman Jan. 13, 2023, 11:09 a.m. UTC | #6

On Fri, Jan 13, 2023 at 09:04:50AM +0000, David Laight wrote:
> From: Mel Gorman
> > Sent: 12 January 2023 09:36
> ...
> > Hard realtime tasks should be locking down resources in advance. Even a
> > soft-realtime task like audio or video live decoding which cannot jitter
> > should be allocating both memory and any disk space required up-front
> > before the recording starts instead of relying on reserves. At best,
> > reserve access will only delay the problem by a very short interval.
> 
> Or, at least, ensuring the system isn't memory limited.
> 

Added to changelog.

> The biggest effect on RT task latency/jitter (on a normal kernel)
> is hardware interrupt and softint code 'stealing' the cpu.
> The main 'culprit' being ethernet receive.
>  
> Unfortunately if you are doing RTP audio (UDP data) you absolutely
> need the ethernet receive to run. When the softint code decides
> to drop back to a normal priority kernel worker thread packets
> get lost.
> 

Yes, although this is a fundamental problem for background networking
processing in general IIUC that is independent of mm/. ksoftirqd may be
getting stalled behind a higher priority, a realtime task, a task that has
built a credit due to sleep time under GENTLE_FAIR_SLEEPERS relative to
ksoftirqd etc. As a normal low-priority CPU hog may be at the same priority
as ksoftirqd, it can use enough of the scheduler slice for the runqueue to
cause an indirect priority inversion if a high priority task is sleeping
waiting on network traffic it needs ASAP that's stalled on ksoftirqd. I
didn't check for other examples but the only one I'm aware of that boosts
ksoftirq priority is during rcutorture (see rcutorture_booster_init). A
quick glance doesn't show any possibility of boosting ksoftirqd priority
temporarily if dealing with NET_TX_SOFTIRQ although it might be an
interesting idea if it was demonstrated with a realistic (or at least
semi-realistic) test case that priority inversion can occur due to background
RX processing. It's not even PREEMPT_RT-specific, I suspect it's a general
problem. I don't think this series makes a difference because if any of
the critical tasks are depending on the memory reserves, they're already
in serious trouble.

> (I've been running 10000 RTP streams - with 10k receive UDP sockets.)
> 

min_reserve access isn't going to fix any stalls with that volume of
streams :D

> So I doubt avoiding sleeps in kmalloc() is going to make a
> significant difference.
> 

Agreed.

[2/7] mm/page_alloc: Treat RT tasks similar to __GFP_HIGH

Commit Message

Comments

Patch