Message ID | 20250114021922.92609-2-alexei.starovoitov@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | bpf, mm: Introduce try_alloc_pages() | expand |
On Mon, Jan 13, 2025 at 06:19:17PM -0800, Alexei Starovoitov wrote: > From: Alexei Starovoitov <ast@kernel.org> > > Tracing BPF programs execute from tracepoints and kprobes where > running context is unknown, but they need to request additional > memory. > The prior workarounds were using pre-allocated memory and > BPF specific freelists to satisfy such allocation requests. > Instead, introduce gfpflags_allow_spinning() condition that signals > to the allocator that running context is unknown. > Then rely on percpu free list of pages to allocate a page. > The rmqueue_pcplist() should be able to pop the page from. > If it fails (due to IRQ re-entrancy or list being empty) then > try_alloc_pages() attempts to spin_trylock zone->lock > and refill percpu freelist as normal. > BPF program may execute with IRQs disabled and zone->lock is > sleeping in RT, so trylock is the only option. how is spin_trylock() from IRQ context not utterly broken in RT? It can lead to try to priority boost the idle thread, among other crazy things.
On Tue 14-01-25 10:53:55, Peter Zijlstra wrote: > On Mon, Jan 13, 2025 at 06:19:17PM -0800, Alexei Starovoitov wrote: > > From: Alexei Starovoitov <ast@kernel.org> > > > > Tracing BPF programs execute from tracepoints and kprobes where > > running context is unknown, but they need to request additional > > memory. > > > The prior workarounds were using pre-allocated memory and > > BPF specific freelists to satisfy such allocation requests. > > Instead, introduce gfpflags_allow_spinning() condition that signals > > to the allocator that running context is unknown. > > Then rely on percpu free list of pages to allocate a page. > > The rmqueue_pcplist() should be able to pop the page from. > > If it fails (due to IRQ re-entrancy or list being empty) then > > try_alloc_pages() attempts to spin_trylock zone->lock > > and refill percpu freelist as normal. > > > BPF program may execute with IRQs disabled and zone->lock is > > sleeping in RT, so trylock is the only option. > > how is spin_trylock() from IRQ context not utterly broken in RT? + if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq())) + return NULL; Deals with that, right?
On Mon 13-01-25 18:19:17, Alexei Starovoitov wrote: > From: Alexei Starovoitov <ast@kernel.org> > > Tracing BPF programs execute from tracepoints and kprobes where > running context is unknown, but they need to request additional > memory. The prior workarounds were using pre-allocated memory and > BPF specific freelists to satisfy such allocation requests. > Instead, introduce gfpflags_allow_spinning() condition that signals > to the allocator that running context is unknown. > Then rely on percpu free list of pages to allocate a page. > The rmqueue_pcplist() should be able to pop the page from. > If it fails (due to IRQ re-entrancy or list being empty) then > try_alloc_pages() attempts to spin_trylock zone->lock > and refill percpu freelist as normal. > BPF program may execute with IRQs disabled and zone->lock is > sleeping in RT, so trylock is the only option. In theory we can > introduce percpu reentrance counter and increment it every time > spin_lock_irqsave(&zone->lock, flags) is used, but we cannot rely > on it. Even if this cpu is not in page_alloc path the > spin_lock_irqsave() is not safe, since BPF prog might be called > from tracepoint where preemption is disabled. So trylock only. > > Note, free_page and memcg are not taught about gfpflags_allow_spinning() > condition. The support comes in the next patches. > > This is a first step towards supporting BPF requirements in SLUB > and getting rid of bpf_mem_alloc. > That goal was discussed at LSFMM: https://lwn.net/Articles/974138/ > > Signed-off-by: Alexei Starovoitov <ast@kernel.org> LGTM, I am not entirely clear on kmsan_alloc_page part though. As long as that part is correct you can add Acked-by: Michal Hocko <mhocko@suse.com> Other than that try_alloc_pages_noprof begs some user documentation. /** * try_alloc_pages_noprof - opportunistic reentrant allocation from any context * @nid - node to allocate from * @order - allocation order size * * Allocates pages of a given order from the given node. This is safe to * call from any context (from atomic, NMI but also reentrant * allocator -> tracepoint -> try_alloc_pages_noprof). * Allocation is best effort and to be expected to fail easily so nobody should * rely on the succeess. Failures are not reported via warn_alloc(). * * Return: allocated page or NULL on failure. */ > +struct page *try_alloc_pages_noprof(int nid, unsigned int order) > +{ > + /* > + * Do not specify __GFP_DIRECT_RECLAIM, since direct claim is not allowed. > + * Do not specify __GFP_KSWAPD_RECLAIM either, since wake up of kswapd > + * is not safe in arbitrary context. > + * > + * These two are the conditions for gfpflags_allow_spinning() being true. > + * > + * Specify __GFP_NOWARN since failing try_alloc_pages() is not a reason > + * to warn. Also warn would trigger printk() which is unsafe from > + * various contexts. We cannot use printk_deferred_enter() to mitigate, > + * since the running context is unknown. > + * > + * Specify __GFP_ZERO to make sure that call to kmsan_alloc_page() below > + * is safe in any context. Also zeroing the page is mandatory for > + * BPF use cases. > + * > + * Though __GFP_NOMEMALLOC is not checked in the code path below, > + * specify it here to highlight that try_alloc_pages() > + * doesn't want to deplete reserves. > + */ > + gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC; > + unsigned int alloc_flags = ALLOC_TRYLOCK; > + struct alloc_context ac = { }; > + struct page *page; > + > + /* > + * In RT spin_trylock() may call raw_spin_lock() which is unsafe in NMI. > + * If spin_trylock() is called from hard IRQ the current task may be > + * waiting for one rt_spin_lock, but rt_spin_trylock() will mark the > + * task as the owner of another rt_spin_lock which will confuse PI > + * logic, so return immediately if called form hard IRQ or NMI. > + * > + * Note, irqs_disabled() case is ok. This function can be called > + * from raw_spin_lock_irqsave region. > + */ > + if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq())) > + return NULL; > + if (!pcp_allowed_order(order)) > + return NULL; > + > +#ifdef CONFIG_UNACCEPTED_MEMORY > + /* Bailout, since try_to_accept_memory_one() needs to take a lock */ > + if (has_unaccepted_memory()) > + return NULL; > +#endif > + /* Bailout, since _deferred_grow_zone() needs to take a lock */ > + if (deferred_pages_enabled()) > + return NULL; > + > + if (nid == NUMA_NO_NODE) > + nid = numa_node_id(); > + > + prepare_alloc_pages(alloc_gfp, order, nid, NULL, &ac, > + &alloc_gfp, &alloc_flags); > + > + /* > + * Best effort allocation from percpu free list. > + * If it's empty attempt to spin_trylock zone->lock. > + */ > + page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac); > + > + /* Unlike regular alloc_pages() there is no __alloc_pages_slowpath(). */ > + > + trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype); > + kmsan_alloc_page(page, order, alloc_gfp); > + return page; > +} > -- > 2.43.5
On Tue, Jan 14, 2025 at 11:19:41AM +0100, Michal Hocko wrote: > On Tue 14-01-25 10:53:55, Peter Zijlstra wrote: > > On Mon, Jan 13, 2025 at 06:19:17PM -0800, Alexei Starovoitov wrote: > > > From: Alexei Starovoitov <ast@kernel.org> > > > > > > Tracing BPF programs execute from tracepoints and kprobes where > > > running context is unknown, but they need to request additional > > > memory. > > > > > The prior workarounds were using pre-allocated memory and > > > BPF specific freelists to satisfy such allocation requests. > > > Instead, introduce gfpflags_allow_spinning() condition that signals > > > to the allocator that running context is unknown. > > > Then rely on percpu free list of pages to allocate a page. > > > The rmqueue_pcplist() should be able to pop the page from. > > > If it fails (due to IRQ re-entrancy or list being empty) then > > > try_alloc_pages() attempts to spin_trylock zone->lock > > > and refill percpu freelist as normal. > > > > > BPF program may execute with IRQs disabled and zone->lock is > > > sleeping in RT, so trylock is the only option. > > > > how is spin_trylock() from IRQ context not utterly broken in RT? > > + if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq())) > + return NULL; > > Deals with that, right? Changelog didn't really mention that, did it? -- it seems to imply quite the opposite :/ But maybe, I suppose any BPF program needs to expect failure due to this being trylock. I just worry some programs will malfunction due to never succeeding -- and RT getting blamed for this. Maybe I worry too much.
On Tue 14-01-25 11:39:46, Peter Zijlstra wrote: > On Tue, Jan 14, 2025 at 11:19:41AM +0100, Michal Hocko wrote: > > On Tue 14-01-25 10:53:55, Peter Zijlstra wrote: > > > On Mon, Jan 13, 2025 at 06:19:17PM -0800, Alexei Starovoitov wrote: > > > > From: Alexei Starovoitov <ast@kernel.org> > > > > > > > > Tracing BPF programs execute from tracepoints and kprobes where > > > > running context is unknown, but they need to request additional > > > > memory. > > > > > > > The prior workarounds were using pre-allocated memory and > > > > BPF specific freelists to satisfy such allocation requests. > > > > Instead, introduce gfpflags_allow_spinning() condition that signals > > > > to the allocator that running context is unknown. > > > > Then rely on percpu free list of pages to allocate a page. > > > > The rmqueue_pcplist() should be able to pop the page from. > > > > If it fails (due to IRQ re-entrancy or list being empty) then > > > > try_alloc_pages() attempts to spin_trylock zone->lock > > > > and refill percpu freelist as normal. > > > > > > > BPF program may execute with IRQs disabled and zone->lock is > > > > sleeping in RT, so trylock is the only option. > > > > > > how is spin_trylock() from IRQ context not utterly broken in RT? > > > > + if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq())) > > + return NULL; > > > > Deals with that, right? > > Changelog didn't really mention that, did it? -- it seems to imply quite > the opposite :/ yes, one has to look into the implementation see all the constrains and the changelog could/should be improved in that regards. > But maybe, I suppose any BPF program needs to expect failure due to this > being trylock. I just worry some programs will malfunction due to never > succeeding -- and RT getting blamed for this. This is a good question. The current implementation has fewer constrains and there are no data points about the success rate with the new implementation. But to be completely honest I am not really sure how much BPF is used on PREEMPT_RT systems and with RT workloads so I am not sure how much of a practical problem that is. > > Maybe I worry too much.
On Tue, Jan 14, 2025 at 2:39 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, Jan 14, 2025 at 11:19:41AM +0100, Michal Hocko wrote: > > On Tue 14-01-25 10:53:55, Peter Zijlstra wrote: > > > On Mon, Jan 13, 2025 at 06:19:17PM -0800, Alexei Starovoitov wrote: > > > > From: Alexei Starovoitov <ast@kernel.org> > > > > > > > > Tracing BPF programs execute from tracepoints and kprobes where > > > > running context is unknown, but they need to request additional > > > > memory. > > > > > > > The prior workarounds were using pre-allocated memory and > > > > BPF specific freelists to satisfy such allocation requests. > > > > Instead, introduce gfpflags_allow_spinning() condition that signals > > > > to the allocator that running context is unknown. > > > > Then rely on percpu free list of pages to allocate a page. > > > > The rmqueue_pcplist() should be able to pop the page from. > > > > If it fails (due to IRQ re-entrancy or list being empty) then > > > > try_alloc_pages() attempts to spin_trylock zone->lock > > > > and refill percpu freelist as normal. > > > > > > > BPF program may execute with IRQs disabled and zone->lock is > > > > sleeping in RT, so trylock is the only option. > > > > > > how is spin_trylock() from IRQ context not utterly broken in RT? > > > > + if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq())) > > + return NULL; > > > > Deals with that, right? > > Changelog didn't really mention that, did it? -- it seems to imply quite > the opposite :/ Hmm. Until you said it I didn't read it as "imply the opposite" :( The cover letter is pretty clear... " - Since spin_trylock() is not safe in RT from hard IRQ and NMI disable such usage in lock_trylock and in try_alloc_pages(). " and the patch 2 commit log is clear too... " Since spin_trylock() cannot be used in RT from hard IRQ or NMI it uses lockless link list... " and further in patch 3 commit log... " Use spin_trylock in PREEMPT_RT when not in hard IRQ and not in NMI and fail instantly otherwise, since spin_trylock is not safe from IRQ due to PI issues. " I guess I can reword this particular sentence in patch 1 commit log, but before jumping to an incorrect conclusion please read the whole set. > But maybe, I suppose any BPF program needs to expect failure due to this > being trylock. I just worry some programs will malfunction due to never > succeeding -- and RT getting blamed for this. > > Maybe I worry too much. Humans will find a way to blame BPF and/or RT for all of their problems anyway. Just days ago BPF was blamed in RT for causing IPIs during JIT. Valentin's patches are going to address that but ain't noone has time to explain that continuously. Seriously, though, the number of things that still run in hard irq context in RT is so small that if some tracing BPF prog is attached there it should be using prealloc mode. Full prealloc is still the default for bpf hash map.
On Tue, 14 Jan 2025 10:29:04 -0800 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > Seriously, though, the number of things that still run in hard irq context > in RT is so small that if some tracing BPF prog is attached there > it should be using prealloc mode. Full prealloc is still > the default for bpf hash map. The one thing to watch out for is hrtimer trace events. They will be called in hard irq context even in RT. If a BPF program is attached to one of them, then that could be an issue. -- Steve
diff --git a/include/linux/gfp.h b/include/linux/gfp.h index b0fe9f62d15b..b41bb6e01781 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -39,6 +39,25 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags) return !!(gfp_flags & __GFP_DIRECT_RECLAIM); } +static inline bool gfpflags_allow_spinning(const gfp_t gfp_flags) +{ + /* + * !__GFP_DIRECT_RECLAIM -> direct claim is not allowed. + * !__GFP_KSWAPD_RECLAIM -> it's not safe to wake up kswapd. + * All GFP_* flags including GFP_NOWAIT use one or both flags. + * try_alloc_pages() is the only API that doesn't specify either flag. + * + * This is stronger than GFP_NOWAIT or GFP_ATOMIC because + * those are guaranteed to never block on a sleeping lock. + * Here we are enforcing that the allaaction doesn't ever spin + * on any locks (i.e. only trylocks). There is no highlevel + * GFP_$FOO flag for this use in try_alloc_pages() as the + * regular page allocator doesn't fully support this + * allocation mode. + */ + return !(gfp_flags & __GFP_RECLAIM); +} + #ifdef CONFIG_HIGHMEM #define OPT_ZONE_HIGHMEM ZONE_HIGHMEM #else @@ -347,6 +366,9 @@ static inline struct page *alloc_page_vma_noprof(gfp_t gfp, } #define alloc_page_vma(...) alloc_hooks(alloc_page_vma_noprof(__VA_ARGS__)) +struct page *try_alloc_pages_noprof(int nid, unsigned int order); +#define try_alloc_pages(...) alloc_hooks(try_alloc_pages_noprof(__VA_ARGS__)) + extern unsigned long get_free_pages_noprof(gfp_t gfp_mask, unsigned int order); #define __get_free_pages(...) alloc_hooks(get_free_pages_noprof(__VA_ARGS__)) diff --git a/mm/internal.h b/mm/internal.h index cb8d8e8e3ffa..5454fa610aac 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1174,6 +1174,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone, #define ALLOC_NOFRAGMENT 0x0 #endif #define ALLOC_HIGHATOMIC 0x200 /* Allows access to MIGRATE_HIGHATOMIC */ +#define ALLOC_TRYLOCK 0x400 /* Only use spin_trylock in allocation path */ #define ALLOC_KSWAPD 0x800 /* allow waking of kswapd, __GFP_KSWAPD_RECLAIM set */ /* Flags that allow allocations below the min watermark. */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1cb4b8c8886d..0f4be88ff131 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2304,7 +2304,11 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, unsigned long flags; int i; - spin_lock_irqsave(&zone->lock, flags); + if (!spin_trylock_irqsave(&zone->lock, flags)) { + if (unlikely(alloc_flags & ALLOC_TRYLOCK)) + return 0; + spin_lock_irqsave(&zone->lock, flags); + } for (i = 0; i < count; ++i) { struct page *page = __rmqueue(zone, order, migratetype, alloc_flags); @@ -2904,7 +2908,11 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone, do { page = NULL; - spin_lock_irqsave(&zone->lock, flags); + if (!spin_trylock_irqsave(&zone->lock, flags)) { + if (unlikely(alloc_flags & ALLOC_TRYLOCK)) + return NULL; + spin_lock_irqsave(&zone->lock, flags); + } if (alloc_flags & ALLOC_HIGHATOMIC) page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC); if (!page) { @@ -4509,7 +4517,8 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order, might_alloc(gfp_mask); - if (should_fail_alloc_page(gfp_mask, order)) + if (!(*alloc_flags & ALLOC_TRYLOCK) && + should_fail_alloc_page(gfp_mask, order)) return false; *alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, *alloc_flags); @@ -7023,3 +7032,73 @@ static bool __free_unaccepted(struct page *page) } #endif /* CONFIG_UNACCEPTED_MEMORY */ + +struct page *try_alloc_pages_noprof(int nid, unsigned int order) +{ + /* + * Do not specify __GFP_DIRECT_RECLAIM, since direct claim is not allowed. + * Do not specify __GFP_KSWAPD_RECLAIM either, since wake up of kswapd + * is not safe in arbitrary context. + * + * These two are the conditions for gfpflags_allow_spinning() being true. + * + * Specify __GFP_NOWARN since failing try_alloc_pages() is not a reason + * to warn. Also warn would trigger printk() which is unsafe from + * various contexts. We cannot use printk_deferred_enter() to mitigate, + * since the running context is unknown. + * + * Specify __GFP_ZERO to make sure that call to kmsan_alloc_page() below + * is safe in any context. Also zeroing the page is mandatory for + * BPF use cases. + * + * Though __GFP_NOMEMALLOC is not checked in the code path below, + * specify it here to highlight that try_alloc_pages() + * doesn't want to deplete reserves. + */ + gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC; + unsigned int alloc_flags = ALLOC_TRYLOCK; + struct alloc_context ac = { }; + struct page *page; + + /* + * In RT spin_trylock() may call raw_spin_lock() which is unsafe in NMI. + * If spin_trylock() is called from hard IRQ the current task may be + * waiting for one rt_spin_lock, but rt_spin_trylock() will mark the + * task as the owner of another rt_spin_lock which will confuse PI + * logic, so return immediately if called form hard IRQ or NMI. + * + * Note, irqs_disabled() case is ok. This function can be called + * from raw_spin_lock_irqsave region. + */ + if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq())) + return NULL; + if (!pcp_allowed_order(order)) + return NULL; + +#ifdef CONFIG_UNACCEPTED_MEMORY + /* Bailout, since try_to_accept_memory_one() needs to take a lock */ + if (has_unaccepted_memory()) + return NULL; +#endif + /* Bailout, since _deferred_grow_zone() needs to take a lock */ + if (deferred_pages_enabled()) + return NULL; + + if (nid == NUMA_NO_NODE) + nid = numa_node_id(); + + prepare_alloc_pages(alloc_gfp, order, nid, NULL, &ac, + &alloc_gfp, &alloc_flags); + + /* + * Best effort allocation from percpu free list. + * If it's empty attempt to spin_trylock zone->lock. + */ + page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac); + + /* Unlike regular alloc_pages() there is no __alloc_pages_slowpath(). */ + + trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype); + kmsan_alloc_page(page, order, alloc_gfp); + return page; +}