diff mbox series

[bpf-next,v4,1/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation

Message ID 20250114021922.92609-2-alexei.starovoitov@gmail.com (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series bpf, mm: Introduce try_alloc_pages() | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR success PR summary
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for bpf-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 557 this patch: 557
netdev/build_tools success Errors and warnings before: 0 (+0) this patch: 0 (+0)
netdev/cc_maintainers warning 2 maintainers not CCed: clrkwllms@kernel.org linux-rt-devel@lists.linux.dev
netdev/build_clang success Errors and warnings before: 22692 this patch: 22692
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn fail Errors and warnings before: 16001 this patch: 16002
netdev/checkpatch warning WARNING: line length of 81 exceeds 80 columns WARNING: line length of 82 exceeds 80 columns WARNING: line length of 84 exceeds 80 columns WARNING: line length of 96 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 5 this patch: 5
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-VM_Test-0 success Logs for Lint
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-2 success Logs for Unittests
bpf/vmtest-bpf-next-VM_Test-3 success Logs for Validate matrix.py
bpf/vmtest-bpf-next-VM_Test-5 success Logs for aarch64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-11 success Logs for aarch64-gcc / veristat-meta
bpf/vmtest-bpf-next-VM_Test-10 success Logs for aarch64-gcc / veristat-kernel
bpf/vmtest-bpf-next-VM_Test-13 success Logs for s390x-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-4 success Logs for aarch64-gcc / build / build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-6 success Logs for aarch64-gcc / test (test_maps, false, 360) / test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-9 success Logs for aarch64-gcc / test (test_verifier, false, 360) / test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-16 success Logs for s390x-gcc / test (test_verifier, false, 360) / test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-18 success Logs for s390x-gcc / veristat-meta
bpf/vmtest-bpf-next-VM_Test-17 success Logs for s390x-gcc / veristat-kernel
bpf/vmtest-bpf-next-VM_Test-19 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-20 success Logs for x86_64-gcc / build / build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-21 success Logs for x86_64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-7 success Logs for aarch64-gcc / test (test_progs, false, 360) / test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-30 success Logs for x86_64-llvm-17 / build / build for x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-31 success Logs for x86_64-llvm-17 / build-release / build for x86_64 with llvm-17-O2
bpf/vmtest-bpf-next-VM_Test-8 success Logs for aarch64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-12 success Logs for s390x-gcc / build / build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-36 success Logs for x86_64-llvm-17 / veristat-kernel
bpf/vmtest-bpf-next-VM_Test-37 success Logs for x86_64-llvm-17 / veristat-meta
bpf/vmtest-bpf-next-VM_Test-38 success Logs for x86_64-llvm-18 / build / build for x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-39 success Logs for x86_64-llvm-18 / build-release / build for x86_64 with llvm-18-O2
bpf/vmtest-bpf-next-VM_Test-45 success Logs for x86_64-llvm-18 / veristat-kernel
bpf/vmtest-bpf-next-VM_Test-46 success Logs for x86_64-llvm-18 / veristat-meta
bpf/vmtest-bpf-next-VM_Test-27 success Logs for x86_64-gcc / test (test_verifier, false, 360) / test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-22 success Logs for x86_64-gcc / test (test_maps, false, 360) / test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-23 success Logs for x86_64-gcc / test (test_progs, false, 360) / test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-24 success Logs for x86_64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-25 success Logs for x86_64-gcc / test (test_progs_no_alu32_parallel, true, 30) / test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-26 success Logs for x86_64-gcc / test (test_progs_parallel, true, 30) / test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-28 success Logs for x86_64-gcc / veristat-kernel / x86_64-gcc veristat_kernel
bpf/vmtest-bpf-next-VM_Test-29 success Logs for x86_64-gcc / veristat-meta / x86_64-gcc veristat_meta
bpf/vmtest-bpf-next-VM_Test-15 success Logs for s390x-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-14 success Logs for s390x-gcc / test (test_progs, false, 360) / test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-32 success Logs for x86_64-llvm-17 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-33 success Logs for x86_64-llvm-17 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-34 success Logs for x86_64-llvm-17 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-35 success Logs for x86_64-llvm-17 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-40 success Logs for x86_64-llvm-18 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-41 success Logs for x86_64-llvm-18 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-44 success Logs for x86_64-llvm-18 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-42 success Logs for x86_64-llvm-18 / test (test_progs_cpuv4, false, 360) / test_progs_cpuv4 on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-43 success Logs for x86_64-llvm-18 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-18

Commit Message

Alexei Starovoitov Jan. 14, 2025, 2:19 a.m. UTC
From: Alexei Starovoitov <ast@kernel.org>

Tracing BPF programs execute from tracepoints and kprobes where
running context is unknown, but they need to request additional
memory. The prior workarounds were using pre-allocated memory and
BPF specific freelists to satisfy such allocation requests.
Instead, introduce gfpflags_allow_spinning() condition that signals
to the allocator that running context is unknown.
Then rely on percpu free list of pages to allocate a page.
The rmqueue_pcplist() should be able to pop the page from.
If it fails (due to IRQ re-entrancy or list being empty) then
try_alloc_pages() attempts to spin_trylock zone->lock
and refill percpu freelist as normal.
BPF program may execute with IRQs disabled and zone->lock is
sleeping in RT, so trylock is the only option. In theory we can
introduce percpu reentrance counter and increment it every time
spin_lock_irqsave(&zone->lock, flags) is used, but we cannot rely
on it. Even if this cpu is not in page_alloc path the
spin_lock_irqsave() is not safe, since BPF prog might be called
from tracepoint where preemption is disabled. So trylock only.

Note, free_page and memcg are not taught about gfpflags_allow_spinning()
condition. The support comes in the next patches.

This is a first step towards supporting BPF requirements in SLUB
and getting rid of bpf_mem_alloc.
That goal was discussed at LSFMM: https://lwn.net/Articles/974138/

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/gfp.h | 22 ++++++++++++
 mm/internal.h       |  1 +
 mm/page_alloc.c     | 85 +++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 105 insertions(+), 3 deletions(-)

Comments

Peter Zijlstra Jan. 14, 2025, 9:53 a.m. UTC | #1
On Mon, Jan 13, 2025 at 06:19:17PM -0800, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Tracing BPF programs execute from tracepoints and kprobes where
> running context is unknown, but they need to request additional
> memory.

> The prior workarounds were using pre-allocated memory and
> BPF specific freelists to satisfy such allocation requests.
> Instead, introduce gfpflags_allow_spinning() condition that signals
> to the allocator that running context is unknown.
> Then rely on percpu free list of pages to allocate a page.
> The rmqueue_pcplist() should be able to pop the page from.
> If it fails (due to IRQ re-entrancy or list being empty) then
> try_alloc_pages() attempts to spin_trylock zone->lock
> and refill percpu freelist as normal.

> BPF program may execute with IRQs disabled and zone->lock is
> sleeping in RT, so trylock is the only option. 

how is spin_trylock() from IRQ context not utterly broken in RT?

It can lead to try to priority boost the idle thread, among other crazy
things.
Michal Hocko Jan. 14, 2025, 10:19 a.m. UTC | #2
On Tue 14-01-25 10:53:55, Peter Zijlstra wrote:
> On Mon, Jan 13, 2025 at 06:19:17PM -0800, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> > 
> > Tracing BPF programs execute from tracepoints and kprobes where
> > running context is unknown, but they need to request additional
> > memory.
> 
> > The prior workarounds were using pre-allocated memory and
> > BPF specific freelists to satisfy such allocation requests.
> > Instead, introduce gfpflags_allow_spinning() condition that signals
> > to the allocator that running context is unknown.
> > Then rely on percpu free list of pages to allocate a page.
> > The rmqueue_pcplist() should be able to pop the page from.
> > If it fails (due to IRQ re-entrancy or list being empty) then
> > try_alloc_pages() attempts to spin_trylock zone->lock
> > and refill percpu freelist as normal.
> 
> > BPF program may execute with IRQs disabled and zone->lock is
> > sleeping in RT, so trylock is the only option. 
> 
> how is spin_trylock() from IRQ context not utterly broken in RT?

+	if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq()))
+		return NULL;

Deals with that, right?
Michal Hocko Jan. 14, 2025, 10:31 a.m. UTC | #3
On Mon 13-01-25 18:19:17, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Tracing BPF programs execute from tracepoints and kprobes where
> running context is unknown, but they need to request additional
> memory. The prior workarounds were using pre-allocated memory and
> BPF specific freelists to satisfy such allocation requests.
> Instead, introduce gfpflags_allow_spinning() condition that signals
> to the allocator that running context is unknown.
> Then rely on percpu free list of pages to allocate a page.
> The rmqueue_pcplist() should be able to pop the page from.
> If it fails (due to IRQ re-entrancy or list being empty) then
> try_alloc_pages() attempts to spin_trylock zone->lock
> and refill percpu freelist as normal.
> BPF program may execute with IRQs disabled and zone->lock is
> sleeping in RT, so trylock is the only option. In theory we can
> introduce percpu reentrance counter and increment it every time
> spin_lock_irqsave(&zone->lock, flags) is used, but we cannot rely
> on it. Even if this cpu is not in page_alloc path the
> spin_lock_irqsave() is not safe, since BPF prog might be called
> from tracepoint where preemption is disabled. So trylock only.
> 
> Note, free_page and memcg are not taught about gfpflags_allow_spinning()
> condition. The support comes in the next patches.
> 
> This is a first step towards supporting BPF requirements in SLUB
> and getting rid of bpf_mem_alloc.
> That goal was discussed at LSFMM: https://lwn.net/Articles/974138/
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

LGTM, I am not entirely clear on kmsan_alloc_page part though.
As long as that part is correct you can add
Acked-by: Michal Hocko <mhocko@suse.com>

Other than that try_alloc_pages_noprof begs some user documentation.

/**
 * try_alloc_pages_noprof - opportunistic reentrant allocation from any context
 * @nid - node to allocate from
 * @order - allocation order size
 *
 * Allocates pages of a given order from the given node. This is safe to
 * call from any context (from atomic, NMI but also reentrant 
 * allocator -> tracepoint -> try_alloc_pages_noprof).
 * Allocation is best effort and to be expected to fail easily so nobody should
 * rely on the succeess. Failures are not reported via warn_alloc().
 *
 * Return: allocated page or NULL on failure.
 */
> +struct page *try_alloc_pages_noprof(int nid, unsigned int order)
> +{
> +	/*
> +	 * Do not specify __GFP_DIRECT_RECLAIM, since direct claim is not allowed.
> +	 * Do not specify __GFP_KSWAPD_RECLAIM either, since wake up of kswapd
> +	 * is not safe in arbitrary context.
> +	 *
> +	 * These two are the conditions for gfpflags_allow_spinning() being true.
> +	 *
> +	 * Specify __GFP_NOWARN since failing try_alloc_pages() is not a reason
> +	 * to warn. Also warn would trigger printk() which is unsafe from
> +	 * various contexts. We cannot use printk_deferred_enter() to mitigate,
> +	 * since the running context is unknown.
> +	 *
> +	 * Specify __GFP_ZERO to make sure that call to kmsan_alloc_page() below
> +	 * is safe in any context. Also zeroing the page is mandatory for
> +	 * BPF use cases.
> +	 *
> +	 * Though __GFP_NOMEMALLOC is not checked in the code path below,
> +	 * specify it here to highlight that try_alloc_pages()
> +	 * doesn't want to deplete reserves.
> +	 */
> +	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC;
> +	unsigned int alloc_flags = ALLOC_TRYLOCK;
> +	struct alloc_context ac = { };
> +	struct page *page;
> +
> +	/*
> +	 * In RT spin_trylock() may call raw_spin_lock() which is unsafe in NMI.
> +	 * If spin_trylock() is called from hard IRQ the current task may be
> +	 * waiting for one rt_spin_lock, but rt_spin_trylock() will mark the
> +	 * task as the owner of another rt_spin_lock which will confuse PI
> +	 * logic, so return immediately if called form hard IRQ or NMI.
> +	 *
> +	 * Note, irqs_disabled() case is ok. This function can be called
> +	 * from raw_spin_lock_irqsave region.
> +	 */
> +	if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq()))
> +		return NULL;
> +	if (!pcp_allowed_order(order))
> +		return NULL;
> +
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +	/* Bailout, since try_to_accept_memory_one() needs to take a lock */
> +	if (has_unaccepted_memory())
> +		return NULL;
> +#endif
> +	/* Bailout, since _deferred_grow_zone() needs to take a lock */
> +	if (deferred_pages_enabled())
> +		return NULL;
> +
> +	if (nid == NUMA_NO_NODE)
> +		nid = numa_node_id();
> +
> +	prepare_alloc_pages(alloc_gfp, order, nid, NULL, &ac,
> +			    &alloc_gfp, &alloc_flags);
> +
> +	/*
> +	 * Best effort allocation from percpu free list.
> +	 * If it's empty attempt to spin_trylock zone->lock.
> +	 */
> +	page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac);
> +
> +	/* Unlike regular alloc_pages() there is no __alloc_pages_slowpath(). */
> +
> +	trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype);
> +	kmsan_alloc_page(page, order, alloc_gfp);
> +	return page;
> +}
> -- 
> 2.43.5
Peter Zijlstra Jan. 14, 2025, 10:39 a.m. UTC | #4
On Tue, Jan 14, 2025 at 11:19:41AM +0100, Michal Hocko wrote:
> On Tue 14-01-25 10:53:55, Peter Zijlstra wrote:
> > On Mon, Jan 13, 2025 at 06:19:17PM -0800, Alexei Starovoitov wrote:
> > > From: Alexei Starovoitov <ast@kernel.org>
> > > 
> > > Tracing BPF programs execute from tracepoints and kprobes where
> > > running context is unknown, but they need to request additional
> > > memory.
> > 
> > > The prior workarounds were using pre-allocated memory and
> > > BPF specific freelists to satisfy such allocation requests.
> > > Instead, introduce gfpflags_allow_spinning() condition that signals
> > > to the allocator that running context is unknown.
> > > Then rely on percpu free list of pages to allocate a page.
> > > The rmqueue_pcplist() should be able to pop the page from.
> > > If it fails (due to IRQ re-entrancy or list being empty) then
> > > try_alloc_pages() attempts to spin_trylock zone->lock
> > > and refill percpu freelist as normal.
> > 
> > > BPF program may execute with IRQs disabled and zone->lock is
> > > sleeping in RT, so trylock is the only option. 
> > 
> > how is spin_trylock() from IRQ context not utterly broken in RT?
> 
> +	if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq()))
> +		return NULL;
> 
> Deals with that, right?

Changelog didn't really mention that, did it? -- it seems to imply quite
the opposite :/

But maybe, I suppose any BPF program needs to expect failure due to this
being trylock. I just worry some programs will malfunction due to never
succeeding -- and RT getting blamed for this.

Maybe I worry too much.
Michal Hocko Jan. 14, 2025, 10:43 a.m. UTC | #5
On Tue 14-01-25 11:39:46, Peter Zijlstra wrote:
> On Tue, Jan 14, 2025 at 11:19:41AM +0100, Michal Hocko wrote:
> > On Tue 14-01-25 10:53:55, Peter Zijlstra wrote:
> > > On Mon, Jan 13, 2025 at 06:19:17PM -0800, Alexei Starovoitov wrote:
> > > > From: Alexei Starovoitov <ast@kernel.org>
> > > > 
> > > > Tracing BPF programs execute from tracepoints and kprobes where
> > > > running context is unknown, but they need to request additional
> > > > memory.
> > > 
> > > > The prior workarounds were using pre-allocated memory and
> > > > BPF specific freelists to satisfy such allocation requests.
> > > > Instead, introduce gfpflags_allow_spinning() condition that signals
> > > > to the allocator that running context is unknown.
> > > > Then rely on percpu free list of pages to allocate a page.
> > > > The rmqueue_pcplist() should be able to pop the page from.
> > > > If it fails (due to IRQ re-entrancy or list being empty) then
> > > > try_alloc_pages() attempts to spin_trylock zone->lock
> > > > and refill percpu freelist as normal.
> > > 
> > > > BPF program may execute with IRQs disabled and zone->lock is
> > > > sleeping in RT, so trylock is the only option. 
> > > 
> > > how is spin_trylock() from IRQ context not utterly broken in RT?
> > 
> > +	if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq()))
> > +		return NULL;
> > 
> > Deals with that, right?
> 
> Changelog didn't really mention that, did it? -- it seems to imply quite
> the opposite :/

yes, one has to look into the implementation see all the constrains and
the changelog could/should be improved in that regards.

> But maybe, I suppose any BPF program needs to expect failure due to this
> being trylock. I just worry some programs will malfunction due to never
> succeeding -- and RT getting blamed for this.

This is a good question. The current implementation has fewer constrains
and there are no data points about the success rate with the new
implementation. But to be completely honest I am not really sure how
much BPF is used on PREEMPT_RT systems and with RT workloads so I am not
sure how much of a practical problem that is.

> 
> Maybe I worry too much.
Alexei Starovoitov Jan. 14, 2025, 6:29 p.m. UTC | #6
On Tue, Jan 14, 2025 at 2:39 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Jan 14, 2025 at 11:19:41AM +0100, Michal Hocko wrote:
> > On Tue 14-01-25 10:53:55, Peter Zijlstra wrote:
> > > On Mon, Jan 13, 2025 at 06:19:17PM -0800, Alexei Starovoitov wrote:
> > > > From: Alexei Starovoitov <ast@kernel.org>
> > > >
> > > > Tracing BPF programs execute from tracepoints and kprobes where
> > > > running context is unknown, but they need to request additional
> > > > memory.
> > >
> > > > The prior workarounds were using pre-allocated memory and
> > > > BPF specific freelists to satisfy such allocation requests.
> > > > Instead, introduce gfpflags_allow_spinning() condition that signals
> > > > to the allocator that running context is unknown.
> > > > Then rely on percpu free list of pages to allocate a page.
> > > > The rmqueue_pcplist() should be able to pop the page from.
> > > > If it fails (due to IRQ re-entrancy or list being empty) then
> > > > try_alloc_pages() attempts to spin_trylock zone->lock
> > > > and refill percpu freelist as normal.
> > >
> > > > BPF program may execute with IRQs disabled and zone->lock is
> > > > sleeping in RT, so trylock is the only option.
> > >
> > > how is spin_trylock() from IRQ context not utterly broken in RT?
> >
> > +     if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq()))
> > +             return NULL;
> >
> > Deals with that, right?
>
> Changelog didn't really mention that, did it? -- it seems to imply quite
> the opposite :/

Hmm. Until you said it I didn't read it as "imply the opposite" :(

The cover letter is pretty clear...
"
- Since spin_trylock() is not safe in RT from hard IRQ and NMI
  disable such usage in lock_trylock and in try_alloc_pages().
"

and the patch 2 commit log is clear too...

"
Since spin_trylock() cannot be used in RT from hard IRQ or NMI
it uses lockless link list...
"

and further in patch 3 commit log...

"
Use spin_trylock in PREEMPT_RT when not in hard IRQ and not in NMI
and fail instantly otherwise, since spin_trylock is not safe from IRQ
due to PI issues.
"

I guess I can reword this particular sentence in patch 1 commit log,
but before jumping to an incorrect conclusion please read the
whole set.

> But maybe, I suppose any BPF program needs to expect failure due to this
> being trylock. I just worry some programs will malfunction due to never
> succeeding -- and RT getting blamed for this.
>
> Maybe I worry too much.

Humans will find a way to blame BPF and/or RT for all of their problems
anyway. Just days ago BPF was blamed in RT for causing IPIs during JIT.
Valentin's patches are going to address that but ain't noone has time
to explain that continuously.

Seriously, though, the number of things that still run in hard irq context
in RT is so small that if some tracing BPF prog is attached there
it should be using prealloc mode. Full prealloc is still
the default for bpf hash map.
Steven Rostedt Jan. 14, 2025, 6:34 p.m. UTC | #7
On Tue, 14 Jan 2025 10:29:04 -0800
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> Seriously, though, the number of things that still run in hard irq context
> in RT is so small that if some tracing BPF prog is attached there
> it should be using prealloc mode. Full prealloc is still
> the default for bpf hash map.

The one thing to watch out for is hrtimer trace events. They will be called
in hard irq context even in RT. If a BPF program is attached to one of
them, then that could be an issue.

-- Steve
Alexei Starovoitov Jan. 15, 2025, 1:23 a.m. UTC | #8
On Tue, Jan 14, 2025 at 2:31 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 13-01-25 18:19:17, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Tracing BPF programs execute from tracepoints and kprobes where
> > running context is unknown, but they need to request additional
> > memory. The prior workarounds were using pre-allocated memory and
> > BPF specific freelists to satisfy such allocation requests.
> > Instead, introduce gfpflags_allow_spinning() condition that signals
> > to the allocator that running context is unknown.
> > Then rely on percpu free list of pages to allocate a page.
> > The rmqueue_pcplist() should be able to pop the page from.
> > If it fails (due to IRQ re-entrancy or list being empty) then
> > try_alloc_pages() attempts to spin_trylock zone->lock
> > and refill percpu freelist as normal.
> > BPF program may execute with IRQs disabled and zone->lock is
> > sleeping in RT, so trylock is the only option. In theory we can
> > introduce percpu reentrance counter and increment it every time
> > spin_lock_irqsave(&zone->lock, flags) is used, but we cannot rely
> > on it. Even if this cpu is not in page_alloc path the
> > spin_lock_irqsave() is not safe, since BPF prog might be called
> > from tracepoint where preemption is disabled. So trylock only.
> >
> > Note, free_page and memcg are not taught about gfpflags_allow_spinning()
> > condition. The support comes in the next patches.
> >
> > This is a first step towards supporting BPF requirements in SLUB
> > and getting rid of bpf_mem_alloc.
> > That goal was discussed at LSFMM: https://lwn.net/Articles/974138/
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>
> LGTM, I am not entirely clear on kmsan_alloc_page part though.

Which part is still confusing?
I hoped the comment below is enough:
   * Specify __GFP_ZERO to make sure that call to kmsan_alloc_page() below
   * is safe in any context. Also zeroing the page is mandatory for
   * BPF use cases.

and once you zoom into:
void kmsan_alloc_page(struct page *page, unsigned int order, gfp_t flags)
{
        bool initialized = (flags & __GFP_ZERO) || !kmsan_enabled;
...
        if (initialized) {
                __memset(page_address(shadow), 0, PAGE_SIZE * pages);
                __memset(page_address(origin), 0, PAGE_SIZE * pages);
                return;
        }

So it's safe to call it and it's necessary to call it when KMSAN is on.

This was the easiest code path to analyze from doesnt-take-spinlocks pov.
I feel the comment is enough.

If/when people want to support !__GFP_ZERO case with KMSAN they would
need to make stack_depot_save() behave in
!gfpflags_allow_spinning() condition.

Since __GFP_ZERO is necessary for the BPF use case I left all
the extra work for the future follow ups.

> As long as that part is correct you can add
> Acked-by: Michal Hocko <mhocko@suse.com>
>
> Other than that try_alloc_pages_noprof begs some user documentation.
>
> /**
>  * try_alloc_pages_noprof - opportunistic reentrant allocation from any context
>  * @nid - node to allocate from
>  * @order - allocation order size
>  *
>  * Allocates pages of a given order from the given node. This is safe to
>  * call from any context (from atomic, NMI but also reentrant
>  * allocator -> tracepoint -> try_alloc_pages_noprof).
>  * Allocation is best effort and to be expected to fail easily so nobody should
>  * rely on the succeess. Failures are not reported via warn_alloc().
>  *
>  * Return: allocated page or NULL on failure.
>  */

Nicely worded. Will add.

Thanks for all the reviews. Appreciate it!
diff mbox series

Patch

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index b0fe9f62d15b..b41bb6e01781 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -39,6 +39,25 @@  static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
 	return !!(gfp_flags & __GFP_DIRECT_RECLAIM);
 }
 
+static inline bool gfpflags_allow_spinning(const gfp_t gfp_flags)
+{
+	/*
+	 * !__GFP_DIRECT_RECLAIM -> direct claim is not allowed.
+	 * !__GFP_KSWAPD_RECLAIM -> it's not safe to wake up kswapd.
+	 * All GFP_* flags including GFP_NOWAIT use one or both flags.
+	 * try_alloc_pages() is the only API that doesn't specify either flag.
+	 *
+	 * This is stronger than GFP_NOWAIT or GFP_ATOMIC because
+	 * those are guaranteed to never block on a sleeping lock.
+	 * Here we are enforcing that the allaaction doesn't ever spin
+	 * on any locks (i.e. only trylocks). There is no highlevel
+	 * GFP_$FOO flag for this use in try_alloc_pages() as the
+	 * regular page allocator doesn't fully support this
+	 * allocation mode.
+	 */
+	return !(gfp_flags & __GFP_RECLAIM);
+}
+
 #ifdef CONFIG_HIGHMEM
 #define OPT_ZONE_HIGHMEM ZONE_HIGHMEM
 #else
@@ -347,6 +366,9 @@  static inline struct page *alloc_page_vma_noprof(gfp_t gfp,
 }
 #define alloc_page_vma(...)			alloc_hooks(alloc_page_vma_noprof(__VA_ARGS__))
 
+struct page *try_alloc_pages_noprof(int nid, unsigned int order);
+#define try_alloc_pages(...)			alloc_hooks(try_alloc_pages_noprof(__VA_ARGS__))
+
 extern unsigned long get_free_pages_noprof(gfp_t gfp_mask, unsigned int order);
 #define __get_free_pages(...)			alloc_hooks(get_free_pages_noprof(__VA_ARGS__))
 
diff --git a/mm/internal.h b/mm/internal.h
index cb8d8e8e3ffa..5454fa610aac 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1174,6 +1174,7 @@  unsigned int reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_NOFRAGMENT	  0x0
 #endif
 #define ALLOC_HIGHATOMIC	0x200 /* Allows access to MIGRATE_HIGHATOMIC */
+#define ALLOC_TRYLOCK		0x400 /* Only use spin_trylock in allocation path */
 #define ALLOC_KSWAPD		0x800 /* allow waking of kswapd, __GFP_KSWAPD_RECLAIM set */
 
 /* Flags that allow allocations below the min watermark. */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1cb4b8c8886d..0f4be88ff131 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2304,7 +2304,11 @@  static int rmqueue_bulk(struct zone *zone, unsigned int order,
 	unsigned long flags;
 	int i;
 
-	spin_lock_irqsave(&zone->lock, flags);
+	if (!spin_trylock_irqsave(&zone->lock, flags)) {
+		if (unlikely(alloc_flags & ALLOC_TRYLOCK))
+			return 0;
+		spin_lock_irqsave(&zone->lock, flags);
+	}
 	for (i = 0; i < count; ++i) {
 		struct page *page = __rmqueue(zone, order, migratetype,
 								alloc_flags);
@@ -2904,7 +2908,11 @@  struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 
 	do {
 		page = NULL;
-		spin_lock_irqsave(&zone->lock, flags);
+		if (!spin_trylock_irqsave(&zone->lock, flags)) {
+			if (unlikely(alloc_flags & ALLOC_TRYLOCK))
+				return NULL;
+			spin_lock_irqsave(&zone->lock, flags);
+		}
 		if (alloc_flags & ALLOC_HIGHATOMIC)
 			page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
 		if (!page) {
@@ -4509,7 +4517,8 @@  static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 
 	might_alloc(gfp_mask);
 
-	if (should_fail_alloc_page(gfp_mask, order))
+	if (!(*alloc_flags & ALLOC_TRYLOCK) &&
+	    should_fail_alloc_page(gfp_mask, order))
 		return false;
 
 	*alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, *alloc_flags);
@@ -7023,3 +7032,73 @@  static bool __free_unaccepted(struct page *page)
 }
 
 #endif /* CONFIG_UNACCEPTED_MEMORY */
+
+struct page *try_alloc_pages_noprof(int nid, unsigned int order)
+{
+	/*
+	 * Do not specify __GFP_DIRECT_RECLAIM, since direct claim is not allowed.
+	 * Do not specify __GFP_KSWAPD_RECLAIM either, since wake up of kswapd
+	 * is not safe in arbitrary context.
+	 *
+	 * These two are the conditions for gfpflags_allow_spinning() being true.
+	 *
+	 * Specify __GFP_NOWARN since failing try_alloc_pages() is not a reason
+	 * to warn. Also warn would trigger printk() which is unsafe from
+	 * various contexts. We cannot use printk_deferred_enter() to mitigate,
+	 * since the running context is unknown.
+	 *
+	 * Specify __GFP_ZERO to make sure that call to kmsan_alloc_page() below
+	 * is safe in any context. Also zeroing the page is mandatory for
+	 * BPF use cases.
+	 *
+	 * Though __GFP_NOMEMALLOC is not checked in the code path below,
+	 * specify it here to highlight that try_alloc_pages()
+	 * doesn't want to deplete reserves.
+	 */
+	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC;
+	unsigned int alloc_flags = ALLOC_TRYLOCK;
+	struct alloc_context ac = { };
+	struct page *page;
+
+	/*
+	 * In RT spin_trylock() may call raw_spin_lock() which is unsafe in NMI.
+	 * If spin_trylock() is called from hard IRQ the current task may be
+	 * waiting for one rt_spin_lock, but rt_spin_trylock() will mark the
+	 * task as the owner of another rt_spin_lock which will confuse PI
+	 * logic, so return immediately if called form hard IRQ or NMI.
+	 *
+	 * Note, irqs_disabled() case is ok. This function can be called
+	 * from raw_spin_lock_irqsave region.
+	 */
+	if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq()))
+		return NULL;
+	if (!pcp_allowed_order(order))
+		return NULL;
+
+#ifdef CONFIG_UNACCEPTED_MEMORY
+	/* Bailout, since try_to_accept_memory_one() needs to take a lock */
+	if (has_unaccepted_memory())
+		return NULL;
+#endif
+	/* Bailout, since _deferred_grow_zone() needs to take a lock */
+	if (deferred_pages_enabled())
+		return NULL;
+
+	if (nid == NUMA_NO_NODE)
+		nid = numa_node_id();
+
+	prepare_alloc_pages(alloc_gfp, order, nid, NULL, &ac,
+			    &alloc_gfp, &alloc_flags);
+
+	/*
+	 * Best effort allocation from percpu free list.
+	 * If it's empty attempt to spin_trylock zone->lock.
+	 */
+	page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac);
+
+	/* Unlike regular alloc_pages() there is no __alloc_pages_slowpath(). */
+
+	trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype);
+	kmsan_alloc_page(page, order, alloc_gfp);
+	return page;
+}