Message ID | 20250124035655.78899-1-alexei.starovoitov@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | bpf, mm: Introduce try_alloc_pages() | expand |
On Thu, Jan 23, 2025 at 07:56:49PM -0800, Alexei Starovoitov wrote: > - Considered using __GFP_COMP in try_alloc_pages to simplify > free_pages_nolock a bit, but then decided to make it work > for all types of pages, since free_pages_nolock() is used by > stackdepot and currently it's using non-compound order 2. > I felt it's best to leave it as-is and make free_pages_nolock() > support all pages. We're trying to eliminate non-use of __GFP_COMP. Because people don't use __GFP_COMP, there's a security check that we can't turn on. Would you reconsider this change you made?
On 1/24/25 15:16, Matthew Wilcox wrote: > On Thu, Jan 23, 2025 at 07:56:49PM -0800, Alexei Starovoitov wrote: >> - Considered using __GFP_COMP in try_alloc_pages to simplify >> free_pages_nolock a bit, but then decided to make it work >> for all types of pages, since free_pages_nolock() is used by >> stackdepot and currently it's using non-compound order 2. >> I felt it's best to leave it as-is and make free_pages_nolock() >> support all pages. > > We're trying to eliminate non-use of __GFP_COMP. Because people don't > use __GFP_COMP, there's a security check that we can't turn on. Would > you reconsider this change you made? This means changing stackdepot to use __GFP_COMP. Which would be a good thing on its own. But if you consider if off-topic to your series, I can look at it.
On Fri, Jan 24, 2025 at 6:19 AM Vlastimil Babka <vbabka@suse.cz> wrote: > > On 1/24/25 15:16, Matthew Wilcox wrote: > > On Thu, Jan 23, 2025 at 07:56:49PM -0800, Alexei Starovoitov wrote: > >> - Considered using __GFP_COMP in try_alloc_pages to simplify > >> free_pages_nolock a bit, but then decided to make it work > >> for all types of pages, since free_pages_nolock() is used by > >> stackdepot and currently it's using non-compound order 2. > >> I felt it's best to leave it as-is and make free_pages_nolock() > >> support all pages. > > > > We're trying to eliminate non-use of __GFP_COMP. Because people don't > > use __GFP_COMP, there's a security check that we can't turn on. Would > > you reconsider this change you made? > > This means changing stackdepot to use __GFP_COMP. Which would be a good > thing on its own. But if you consider if off-topic to your series, I can > look at it. Ohh. I wasn't aware of that. I can certainly add __GFP_COMP to try_alloc_pages() and will check stackdepot too. I spotted this line: VM_BUG_ON_PAGE(compound && compound_order(page) != order, page); that line alone was a good enough reason to use __GFP_COMP, but since it's debug only I could only guess where the future lies. Should it be something like: if (WARN_ON(compound && compound_order(page) != order)) order = compound_order(page); since proceeding with the wrong order is certain to crash. ?
On 1/24/25 17:19, Alexei Starovoitov wrote: > On Fri, Jan 24, 2025 at 6:19 AM Vlastimil Babka <vbabka@suse.cz> wrote: >> >> On 1/24/25 15:16, Matthew Wilcox wrote: >> > On Thu, Jan 23, 2025 at 07:56:49PM -0800, Alexei Starovoitov wrote: >> >> - Considered using __GFP_COMP in try_alloc_pages to simplify >> >> free_pages_nolock a bit, but then decided to make it work >> >> for all types of pages, since free_pages_nolock() is used by >> >> stackdepot and currently it's using non-compound order 2. >> >> I felt it's best to leave it as-is and make free_pages_nolock() >> >> support all pages. >> > >> > We're trying to eliminate non-use of __GFP_COMP. Because people don't >> > use __GFP_COMP, there's a security check that we can't turn on. Would >> > you reconsider this change you made? >> >> This means changing stackdepot to use __GFP_COMP. Which would be a good >> thing on its own. But if you consider if off-topic to your series, I can >> look at it. > > Ohh. I wasn't aware of that. > I can certainly add __GFP_COMP to try_alloc_pages() and Yeah IIRC I suggested that previously. > will check stackdepot too. Great, thanks. > I spotted this line: > VM_BUG_ON_PAGE(compound && compound_order(page) != order, page); > that line alone was a good enough reason to use __GFP_COMP, > but since it's debug only I could only guess where the future lies. > > Should it be something like: > > if (WARN_ON(compound && compound_order(page) != order)) > order = compound_order(page); > > since proceeding with the wrong order is certain to crash. > ? That's the common question, should we be paranoid and add overhead to fast paths in production. Here we do only for DEBUG_VM, which is meant for easier debugging during development of new code. I think it's not worth adding this overhead in normal configs, as the (increasing) majority of order > 0 parameters should come here from compound_order() anyway (i.e. put_folio()) As said we're trying to eliminate the other cases so we don't need to cater for them more.
On Fri, Jan 24, 2025 at 08:19:19AM -0800, Alexei Starovoitov wrote: > I spotted this line: > VM_BUG_ON_PAGE(compound && compound_order(page) != order, page); > that line alone was a good enough reason to use __GFP_COMP, > but since it's debug only I could only guess where the future lies. > > Should it be something like: > > if (WARN_ON(compound && compound_order(page) != order)) > order = compound_order(page); > > since proceeding with the wrong order is certain to crash. > ? It's hard to say. We've got a discrepancy between "order at free call site" and "order recorded in page". We might, for example, have a memory corruption which has overwritten the compound_order() stored in the struct page, in which case the 'order' passed in is the correct one, and "correcting" it to the corrupt one stored in struct page would be the thing which caused the crash. I'd leave it as VM_BUG_ON().
On Fri, Jan 24, 2025 at 8:28 AM Vlastimil Babka <vbabka@suse.cz> wrote: > > On 1/24/25 17:19, Alexei Starovoitov wrote: > > On Fri, Jan 24, 2025 at 6:19 AM Vlastimil Babka <vbabka@suse.cz> wrote: > >> > >> On 1/24/25 15:16, Matthew Wilcox wrote: > >> > On Thu, Jan 23, 2025 at 07:56:49PM -0800, Alexei Starovoitov wrote: > >> >> - Considered using __GFP_COMP in try_alloc_pages to simplify > >> >> free_pages_nolock a bit, but then decided to make it work > >> >> for all types of pages, since free_pages_nolock() is used by > >> >> stackdepot and currently it's using non-compound order 2. > >> >> I felt it's best to leave it as-is and make free_pages_nolock() > >> >> support all pages. > >> > > >> > We're trying to eliminate non-use of __GFP_COMP. Because people don't > >> > use __GFP_COMP, there's a security check that we can't turn on. Would > >> > you reconsider this change you made? > >> > >> This means changing stackdepot to use __GFP_COMP. Which would be a good > >> thing on its own. But if you consider if off-topic to your series, I can > >> look at it. > > > > Ohh. I wasn't aware of that. > > I can certainly add __GFP_COMP to try_alloc_pages() and > > Yeah IIRC I suggested that previously. Yes, as a way to simplify free_pages_nolock(). Hence I looked into it and added the above comment to the cover letter. Now I see that there are more and stronger reasons to use it. > > will check stackdepot too. > > Great, thanks. > > > I spotted this line: > > VM_BUG_ON_PAGE(compound && compound_order(page) != order, page); > > that line alone was a good enough reason to use __GFP_COMP, > > but since it's debug only I could only guess where the future lies. > > > > Should it be something like: > > > > if (WARN_ON(compound && compound_order(page) != order)) > > order = compound_order(page); > > > > since proceeding with the wrong order is certain to crash. > > ? > > That's the common question, should we be paranoid and add overhead to fast > paths in production. Here we do only for DEBUG_VM, which is meant for easier > debugging during development of new code. > > I think it's not worth adding this overhead in normal configs, as the > (increasing) majority of order > 0 parameters should come here from > compound_order() anyway (i.e. put_folio()) As said we're trying to eliminate > the other cases so we don't need to cater for them more. Understood. I also agree with Matthew comment about page corruption. Whether compound_order(page) or order is correct is indeed a question. I'll wait for review on patch 3 before resubmitting with GFP_COMP included.
From: Alexei Starovoitov <ast@kernel.org> Hi All, The main motivation is to make alloc page and slab reentrant and remove bpf_mem_alloc. v5->v6: - Addressed comments from Sebastian, Vlastimil - New approach for local_lock_t in patch 3. Instead of unconditionally increasing local_lock_t size to 4 bytes introduce local_trylock_t and use _Generic() tricks to manipulate active field. - Address stackdepot reentrance issues. alloc part in patch 1 and free part in patch 2. - Inlined mem_cgroup_cancel_charge() in patch 4 since this helper is being removed. - Added Acks. - Dropped failslab, kfence, kmemleak patch. - Improved bpf_map_alloc_pages() in patch 6 a bit to demo intended usage. It will be refactored further. - Considered using __GFP_COMP in try_alloc_pages to simplify free_pages_nolock a bit, but then decided to make it work for all types of pages, since free_pages_nolock() is used by stackdepot and currently it's using non-compound order 2. I felt it's best to leave it as-is and make free_pages_nolock() support all pages. v5: https://lore.kernel.org/all/20250115021746.34691-1-alexei.starovoitov@gmail.com/ v4->v5: - Fixed patch 1 and 4 commit logs and comments per Michal suggestions. Added Acks. - Added patch 6 to make failslab, kfence, kmemleak complaint with trylock mode. It's a prerequisite for reentrant slab patches. v4: https://lore.kernel.org/bpf/20250114021922.92609-1-alexei.starovoitov@gmail.com/ v3->v4: Addressed feedback from Michal and Shakeel: - GFP_TRYLOCK flag is gone. gfpflags_allow_spinning() is used instead. - Improved comments and commit logs. v3: https://lore.kernel.org/bpf/20241218030720.1602449-1-alexei.starovoitov@gmail.com/ v2->v3: To address the issues spotted by Sebastian, Vlastimil, Steven: - Made GFP_TRYLOCK internal to mm/internal.h try_alloc_pages() and free_pages_nolock() are the only interfaces. - Since spin_trylock() is not safe in RT from hard IRQ and NMI disable such usage in lock_trylock and in try_alloc_pages(). In such case free_pages_nolock() falls back to llist right away. - Process trylock_free_pages llist when preemptible. - Check for things like unaccepted memory and order <= 3 early. - Don't call into __alloc_pages_slowpath() at all. - Inspired by Vlastimil's struct local_tryirq_lock adopted it in local_lock_t. Extra 4 bytes in !RT in local_lock_t shouldn't affect any of the current local_lock_t users. This is patch 3. - Tested with bpf selftests in RT and !RT and realized how much more work is necessary on bpf side to play nice with RT. The urgency of this work got higher. The alternative is to convert bpf bits left and right to bpf_mem_alloc. v2: https://lore.kernel.org/bpf/20241210023936.46871-1-alexei.starovoitov@gmail.com/ v1->v2: - fixed buggy try_alloc_pages_noprof() in PREEMPT_RT. Thanks Peter. - optimize all paths by doing spin_trylock_irqsave() first and only then check for gfp_flags & __GFP_TRYLOCK. Then spin_lock_irqsave() if it's a regular mode. So new gfp flag will not add performance overhead. - patches 2-5 are new. They introduce lockless and/or trylock free_pages_nolock() and memcg support. So it's in usable shape for bpf in patch 6. v1: https://lore.kernel.org/bpf/20241116014854.55141-1-alexei.starovoitov@gmail.com/ Alexei Starovoitov (6): mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation mm, bpf: Introduce free_pages_nolock() locking/local_lock: Introduce local_trylock_t and local_trylock_irqsave() memcg: Use trylock to access memcg stock_lock. mm, bpf: Use memcg in try_alloc_pages(). bpf: Use try_alloc_pages() to allocate pages for bpf needs. include/linux/bpf.h | 2 +- include/linux/gfp.h | 23 ++++ include/linux/local_lock.h | 9 ++ include/linux/local_lock_internal.h | 79 ++++++++++- include/linux/mm_types.h | 4 + include/linux/mmzone.h | 3 + kernel/bpf/arena.c | 5 +- kernel/bpf/syscall.c | 23 +++- lib/stackdepot.c | 10 +- mm/internal.h | 1 + mm/memcontrol.c | 30 ++++- mm/page_alloc.c | 200 ++++++++++++++++++++++++++-- mm/page_owner.c | 8 +- 13 files changed, 365 insertions(+), 32 deletions(-)