[GIT,PULL] Introduce try_alloc_pages for 6.15

Message ID	20250327145159.99799-1-alexei.starovoitov@gmail.com (mailing list archive)
State	Not Applicable
Headers	show Received: from mail-qv1-f48.google.com (mail-qv1-f48.google.com [209.85.219.48]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5957A217F34; Thu, 27 Mar 2025 14:52:05 +0000 (UTC) From: Alexei Starovoitov <alexei.starovoitov@gmail.com> To: torvalds@linux-foundation.org Cc: bpf@vger.kernel.org, daniel@iogearbox.net, andrii@kernel.org, martin.lau@kernel.org, akpm@linux-foundation.org, peterz@infradead.org, vbabka@suse.cz, bigeasy@linutronix.de, rostedt@goodmis.org, mhocko@suse.com, shakeel.butt@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [GIT PULL] Introduce try_alloc_pages for 6.15 Date: Thu, 27 Mar 2025 10:51:59 -0400 Message-Id: <20250327145159.99799-1-alexei.starovoitov@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[GIT,PULL] Introduce try_alloc_pages for 6.15 \| expand [GIT,PULL] Introduce try_alloc_pages for 6.15

Context	Check	Description
netdev/tree_selection	success	Pull request for net, async
netdev/build_32bit	fail	Errors and warnings before: 518 this patch: 525
netdev/build_tools	success	Errors and warnings before: 26 (+0) this patch: 26 (+0)
netdev/build_clang	fail	Errors and warnings before: 966 this patch: 974
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/verify_fixes	success	Fixes tag looks correct
netdev/build_allmodconfig_warn	fail	Errors and warnings before: 15128 this patch: 15002
netdev/build_clang_rust	success	No Rust files in patch. Skipping build
netdev/kdoc	success	Errors and warnings before: 21 this patch: 21

Alexei Starovoitov March 27, 2025, 2:51 p.m. UTC

Hi Linus,

The following changes since commit 2014c95afecee3e76ca4a56956a936e23283f05b:

  Linux 6.14-rc1 (2025-02-02 15:39:26 -0800)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git tags/bpf_try_alloc_pages

for you to fetch changes up to f90b474a35744b5d43009e4fab232e74a3024cae:

  mm: Fix the flipped condition in gfpflags_allow_spinning() (2025-03-15 11:18:19 -0700)

----------------------------------------------------------------
Please pull after main MM changes.

The pull includes work from Sebastian, Vlastimil and myself
with a lot of help from Michal and Shakeel.
This is a first step towards making kmalloc reentrant to get rid
of slab wrappers: bpf_mem_alloc, kretprobe's objpool, etc.
These patches make page allocator safe from any context.

Vlastimil kicked off this effort at LSFMM 2024:
https://lwn.net/Articles/974138/
and we continued at LSFMM 2025:
https://lore.kernel.org/all/CAADnVQKfkGxudNUkcPJgwe3nTZ=xohnRshx9kLZBTmR_E1DFEg@mail.gmail.com/

Why:
SLAB wrappers bind memory to a particular subsystem
making it unavailable to the rest of the kernel.
Some BPF maps in production consume Gbytes of preallocated
memory. Top 5 in Meta: 1.5G, 1.2G, 1.1G, 300M, 200M.
Once we have kmalloc that works in any context BPF map
preallocation won't be necessary.

How:
Synchronous kmalloc/page alloc stack has multiple
stages going from fast to slow:
cmpxchg16 -> slab_alloc -> new_slab -> alloc_pages ->
rmqueue_pcplist -> __rmqueue.
rmqueue_pcplist was already relying on trylock.
This set changes rmqueue_bulk/rmqueue_buddy to attempt
a trylock and return ENOMEM if alloc_flags & ALLOC_TRYLOCK.
Then it wraps this functionality into try_alloc_pages() helper.
We make sure that the logic is sane in PREEMPT_RT.
End result: try_alloc_pages()/free_pages_nolock() are
safe to call from any context.
try_kmalloc() for any context with similar trylock
approach will follow. It will use try_alloc_pages()
when slab needs a new page.
Though such try_kmalloc/page_alloc() is an opportunistic
allocator, this design ensures that the probability of
successful allocation of small objects (up to one
page in size) is high.

Even before we have try_kmalloc(), we already use
try_alloc_pages() in BPF arena implementation and it's
going to be used more extensively in BPF.

Once the set was applied to bpf-next we ran into two
two small conflicts with MM tree as reported by Stephen:
https://lore.kernel.org/bpf/20250311120422.1d9a8f80@canb.auug.org.au/
https://lore.kernel.org/bpf/20250312145247.380c2aa5@canb.auug.org.au/

So Andrew suggested to keep thing as-is instead of moving
patchset between the trees before merge window:
https://lore.kernel.org/all/20250317132710.fbcde1c8bb66f91f36e78c89@linux-foundation.org/

Note "locking/local_lock: Introduce localtry_lock_t" patch is
later used in Vlastimil's sheaves and in Shakeel's changes.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
----------------------------------------------------------------
Alexei Starovoitov (6):
      mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
      mm, bpf: Introduce free_pages_nolock()
      memcg: Use trylock to access memcg stock_lock.
      mm, bpf: Use memcg in try_alloc_pages().
      bpf: Use try_alloc_pages() to allocate pages for bpf needs.
      Merge branch 'bpf-mm-introduce-try_alloc_pages'

Sebastian Andrzej Siewior (1):
      locking/local_lock: Introduce localtry_lock_t

Vlastimil Babka (1):
      mm: Fix the flipped condition in gfpflags_allow_spinning()

 include/linux/bpf.h                 |   2 +-
 include/linux/gfp.h                 |  23 ++++
 include/linux/local_lock.h          |  70 +++++++++++++
 include/linux/local_lock_internal.h | 146 ++++++++++++++++++++++++++
 include/linux/mm_types.h            |   4 +
 include/linux/mmzone.h              |   3 +
 kernel/bpf/arena.c                  |   5 +-
 kernel/bpf/syscall.c                |  23 +++-
 lib/stackdepot.c                    |  10 +-
 mm/internal.h                       |   1 +
 mm/memcontrol.c                     |  53 +++++++---
 mm/page_alloc.c                     | 203 +++++++++++++++++++++++++++++++++---
 mm/page_owner.c                     |   8 +-
 13 files changed, 509 insertions(+), 42 deletions(-)

Linus Torvalds March 30, 2025, 8:42 p.m. UTC | #1

On Thu, 27 Mar 2025 at 07:52, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> The pull includes work from Sebastian, Vlastimil and myself
> with a lot of help from Michal and Shakeel.
> This is a first step towards making kmalloc reentrant to get rid
> of slab wrappers: bpf_mem_alloc, kretprobe's objpool, etc.
> These patches make page allocator safe from any context.

So I've pulled this too, since it looked generally fine.

The one reaction I had is that when you basically change

        spin_lock_irqsave(&zone->lock, flags);

into

        if (!spin_trylock_irqsave(&zone->lock, flags)) {
                if (unlikely(alloc_flags & ALLOC_TRYLOCK))
                        return NULL;
                spin_lock_irqsave(&zone->lock, flags);
        }

we've seen bad cache behavior for this kind of pattern in other
situations: if the "try" fails, the subsequent "do the lock for real"
case now does the wrong thing, in that it will immediately try again
even if it's almost certainly just going to fail - causing extra write
cache accesses.

So typically, in places that can see contention, it's better to either do

 (a) trylock followed by a slowpath that takes the fact that it was
locked into account and does a read-only loop until it sees otherwise

     This is, for example, what the mutex code does with that
__mutex_trylock() -> mutex_optimistic_spin() pattern, but our
spinlocks end up doing similar things (ie "trylock" followed by
"release irq and do the 'relax loop' thing).

or

 (b) do the trylock and lock separately, ie

        if (unlikely(alloc_flags & ALLOC_TRYLOCK)) {
                if (!spin_trylock_irqsave(&zone->lock, flags))
                        return NULL;
        } else
                spin_lock_irqsave(&zone->lock, flags);

so that you don't end up doing two cache accesses for ownership that
can cause extra bouncing.

I'm not sure this matters at all in the allocation path - contention
may simply not be enough of an issue, and the trylock is purely about
"unlikely NMI worries", but I do worry that you might have made the
normal case slower.

It's easily fixable later if it ends up being the case, so I don't
worry too much about it, but I did want to mention it since going
through the code made me react to it.

                Linus

Linus Torvalds March 30, 2025, 8:56 p.m. UTC | #2

On Sun, 30 Mar 2025 at 13:42, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> The one reaction I had is that when you basically change

Oh, actually, two reactions now that I fixed up the merge build issue
which forced me to look at the function naming.

That 'localtry_lock_irqsave()' naming is horrendous.

The "try" part in it makes me think it's a trylock. But no, the
"localtry" just comes from the lock naming, and the trylock version is
localtry_trylock_irqsave.

That's horrible.

I missed that on the first read-though, and I'm not unpulling it
because the code generally makes sense.

But I do think that the lock name needs fixing.

"localtry_lock_t" is not a good name, and spreading that odd
"localtry" into the actual (non-try) locking functions makes the
naming actively insane.

If the *only* operation you could do on the lock was "trylock", then
"localtry" would be fine. Then the lock literally is a "only try"
thing. But as it is, the naming now ends up actively broken.

Honestly, the lock name should probably reflect the fact that it can
be used from any context (with a "trylock"), not about the trylock
part itself.

So maybe "nmisafe_local_lock_t" or something in that vein?

Please fix this up, There aren't *that* many users of
"localtry_xyzzy", let's get this fixed before there are more of them.

             Linus

pr-tracker-bot@kernel.org March 30, 2025, 9:05 p.m. UTC | #3

The pull request you sent on Thu, 27 Mar 2025 10:51:59 -0400:

> https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git tags/bpf_try_alloc_pages

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/aa918db707fba507e85217961643281ee8dfb2ed

Thank you!

Alexei Starovoitov March 30, 2025, 9:30 p.m. UTC | #4

On Sun, Mar 30, 2025 at 1:42 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, 27 Mar 2025 at 07:52, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > The pull includes work from Sebastian, Vlastimil and myself
> > with a lot of help from Michal and Shakeel.
> > This is a first step towards making kmalloc reentrant to get rid
> > of slab wrappers: bpf_mem_alloc, kretprobe's objpool, etc.
> > These patches make page allocator safe from any context.
>
> So I've pulled this too, since it looked generally fine.

Thanks!

> The one reaction I had is that when you basically change
>
>         spin_lock_irqsave(&zone->lock, flags);
>
> into
>
>         if (!spin_trylock_irqsave(&zone->lock, flags)) {
>                 if (unlikely(alloc_flags & ALLOC_TRYLOCK))
>                         return NULL;
>                 spin_lock_irqsave(&zone->lock, flags);
>         }
>
> we've seen bad cache behavior for this kind of pattern in other
> situations: if the "try" fails, the subsequent "do the lock for real"
> case now does the wrong thing, in that it will immediately try again
> even if it's almost certainly just going to fail - causing extra write
> cache accesses.
>
> So typically, in places that can see contention, it's better to either do
>
>  (a) trylock followed by a slowpath that takes the fact that it was
> locked into account and does a read-only loop until it sees otherwise
>
>      This is, for example, what the mutex code does with that
> __mutex_trylock() -> mutex_optimistic_spin() pattern, but our
> spinlocks end up doing similar things (ie "trylock" followed by
> "release irq and do the 'relax loop' thing).

Right,
__mutex_trylock(lock) -> mutex_optimistic_spin() pattern is
equivalent to 'pending' bit spinning in qspinlock.

> or
>
>  (b) do the trylock and lock separately, ie
>
>         if (unlikely(alloc_flags & ALLOC_TRYLOCK)) {
>                 if (!spin_trylock_irqsave(&zone->lock, flags))
>                         return NULL;
>         } else
>                 spin_lock_irqsave(&zone->lock, flags);
>
> so that you don't end up doing two cache accesses for ownership that
> can cause extra bouncing.

Ok, I will switch to above.

> I'm not sure this matters at all in the allocation path - contention
> may simply not be enough of an issue, and the trylock is purely about
> "unlikely NMI worries", but I do worry that you might have made the
> normal case slower.

We actually did see zone->lock being contended in production.
Last time the culprit was an inadequate per-cpu caching and
these series in 6.11 fixed it:
https://lwn.net/Articles/947900/
I don't think we've seen it contended in the newer kernels.

Johannes, pls correct me if I'm wrong.

But to avoid being finger pointed, I'll switch to checking alloc_flags
first. It does seem a better trade off to avoid cache bouncing because
of 2nd cmpxchg. Though when I wrote it this way I convinced myself and
others that it's faster to do trylock first to avoid branch misprediction.

Alexei Starovoitov March 30, 2025, 9:49 p.m. UTC | #5

On Sun, Mar 30, 2025 at 1:56 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> But I do think that the lock name needs fixing.
>
> "localtry_lock_t" is not a good name, and spreading that odd
> "localtry" into the actual (non-try) locking functions makes the
> naming actively insane.
>
> If the *only* operation you could do on the lock was "trylock", then
> "localtry" would be fine. Then the lock literally is a "only try"
> thing. But as it is, the naming now ends up actively broken.
>
> Honestly, the lock name should probably reflect the fact that it can
> be used from any context (with a "trylock"), not about the trylock
> part itself.
>
> So maybe "nmisafe_local_lock_t" or something in that vein?
>
> Please fix this up, There aren't *that* many users of
> "localtry_xyzzy", let's get this fixed before there are more of them.

Ok. Agree with the reasoning that the name doesn't quite fit.

nmisafe_local_lock_t name works for me,
though nmisafe_local_lock_irqsave() is a bit verbose.

Don't have better name suggestions at the moment.

Sebastian, Vlastimil,
what do you prefer ?

Linus Torvalds March 30, 2025, 10:08 p.m. UTC | #6

On Sun, 30 Mar 2025 at 14:30, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> But to avoid being finger pointed, I'll switch to checking alloc_flags
> first. It does seem a better trade off to avoid cache bouncing because
> of 2nd cmpxchg. Though when I wrote it this way I convinced myself and
> others that it's faster to do trylock first to avoid branch misprediction.

Yes, the really hot paths (ie core locking) do the "trylock -> read
spinning" for that reason. Then for the normal case, _only_ the
trylock is in the path, and that's the best of both worlds.

And in practice, the "do two compare-and-exchange" operations actually
does work fine, because the cacheline will generally be sticky enough
that you don't actually get many extra cachline bouncing.

So I'm not sure it matters in the end, but I did react to it.

             Linus

Alexei Starovoitov March 31, 2025, 12:33 a.m. UTC | #7

On Sun, Mar 30, 2025 at 3:08 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Sun, 30 Mar 2025 at 14:30, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > But to avoid being finger pointed, I'll switch to checking alloc_flags
> > first. It does seem a better trade off to avoid cache bouncing because
> > of 2nd cmpxchg. Though when I wrote it this way I convinced myself and
> > others that it's faster to do trylock first to avoid branch misprediction.
>
> Yes, the really hot paths (ie core locking) do the "trylock -> read
> spinning" for that reason. Then for the normal case, _only_ the
> trylock is in the path, and that's the best of both worlds.
>
> And in practice, the "do two compare-and-exchange" operations actually
> does work fine, because the cacheline will generally be sticky enough
> that you don't actually get many extra cachline bouncing.

Right, but I also realized that in the contended case there is
an unnecessary irq save/restore pair.
Posted the fix:
https://lore.kernel.org/bpf/20250331002809.94758-1-alexei.starovoitov@gmail.com/

maybe apply directly?

I'll send the renaming fix once we converge on a good name.

Sebastian Andrzej Siewior March 31, 2025, 7:14 a.m. UTC | #8

On 2025-03-30 14:49:25 [-0700], Alexei Starovoitov wrote:
> On Sun, Mar 30, 2025 at 1:56 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > So maybe "nmisafe_local_lock_t" or something in that vein?
> >
> > Please fix this up, There aren't *that* many users of
> > "localtry_xyzzy", let's get this fixed before there are more of them.
> 
> Ok. Agree with the reasoning that the name doesn't quite fit.
> 
> nmisafe_local_lock_t name works for me,
> though nmisafe_local_lock_irqsave() is a bit verbose.
> 
> Don't have better name suggestions at the moment.
> 
> Sebastian, Vlastimil,
> what do you prefer ?

nmisafe_local_lock_t sounds okay assuming the "nmisafe" part does not
make it look like it can be used without the trylock part in NMI context.

But yeah, it sounds better than the previous one.

Sebastian

Vlastimil Babka March 31, 2025, 9:59 a.m. UTC | #9

On 3/31/25 09:14, Sebastian Sewior wrote:
> On 2025-03-30 14:49:25 [-0700], Alexei Starovoitov wrote:
>> On Sun, Mar 30, 2025 at 1:56 PM Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>> > So maybe "nmisafe_local_lock_t" or something in that vein?
>> >
>> > Please fix this up, There aren't *that* many users of
>> > "localtry_xyzzy", let's get this fixed before there are more of them.
>> 
>> Ok. Agree with the reasoning that the name doesn't quite fit.
>> 
>> nmisafe_local_lock_t name works for me,
>> though nmisafe_local_lock_irqsave() is a bit verbose.
>> 
>> Don't have better name suggestions at the moment.
>> 
>> Sebastian, Vlastimil,
>> what do you prefer ?
> 
> nmisafe_local_lock_t sounds okay assuming the "nmisafe" part does not
> make it look like it can be used without the trylock part in NMI context.

Yes I was going to point out that e.g. "nmisafe_local_lock_irqsave()" seems
rather misleading to me as this operation is not a nmisafe one?

I think it comes back to what we discussed in previous versions of the
patchset. IMHO conceptually it's still a local_lock, it just supports the
new trylock operations. However, to make them possible (on non-RT) if a
particular lock instance is to be used with the trylock anywhere, it needs
the new "acquired" field, and for the non-trylock operations to work with
the field too.

So (to inform Linus) the earlier attempt [1] was to just change the existing
local_lock_t, but that adds the overhead of the field and manipulating it
for instances that don't use trylock.

The following attempt [2] meant there would be only a new local_trylock_t
type, but the existing locking operations would remain the same, relying on
_Generic() parts inside them.

It was preferred to make the implementation more obvious, but then we have
the naming issue for the operations in addition to the type...

[1]
https://lore.kernel.org/bpf/20241218030720.1602449-4-alexei.starovoitov@gmail.com/
[2]
https://lore.kernel.org/bpf/20250124035655.78899-4-alexei.starovoitov@gmail.com/

> But yeah, it sounds better than the previous one.
> 
> Sebastian

Vlastimil Babka March 31, 2025, 1:11 p.m. UTC | #10

On 3/31/25 00:08, Linus Torvalds wrote:
> On Sun, 30 Mar 2025 at 14:30, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>>
>> But to avoid being finger pointed, I'll switch to checking alloc_flags
>> first. It does seem a better trade off to avoid cache bouncing because
>> of 2nd cmpxchg. Though when I wrote it this way I convinced myself and
>> others that it's faster to do trylock first to avoid branch misprediction.
> 
> Yes, the really hot paths (ie core locking) do the "trylock -> read
> spinning" for that reason. Then for the normal case, _only_ the
> trylock is in the path, and that's the best of both worlds.

I've been wondering if spin locks could expose the contended slowpath so we
can trylock, and on failure do the check and then call the slowpath directly
that doesn't include another trylock.

It would also be nice if the trylock part could become inline and only the
slowpath would be a function call - even during normal spin_lock_*()
operation. AFAIK right now everything is a function call on x86_64. Not sure
how feasible would that be with the alternatives and paravirt stuff we do.

> And in practice, the "do two compare-and-exchange" operations actually
> does work fine, because the cacheline will generally be sticky enough
> that you don't actually get many extra cachline bouncing.
> 
> So I'm not sure it matters in the end, but I did react to it.
> 
>              Linus

Johannes Weiner March 31, 2025, 2:59 p.m. UTC | #11

On Sun, Mar 30, 2025 at 02:30:15PM -0700, Alexei Starovoitov wrote:
> On Sun, Mar 30, 2025 at 1:42 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Thu, 27 Mar 2025 at 07:52, Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > The pull includes work from Sebastian, Vlastimil and myself
> > > with a lot of help from Michal and Shakeel.
> > > This is a first step towards making kmalloc reentrant to get rid
> > > of slab wrappers: bpf_mem_alloc, kretprobe's objpool, etc.
> > > These patches make page allocator safe from any context.
> >
> > So I've pulled this too, since it looked generally fine.
> 
> Thanks!
> 
> > The one reaction I had is that when you basically change
> >
> >         spin_lock_irqsave(&zone->lock, flags);
> >
> > into
> >
> >         if (!spin_trylock_irqsave(&zone->lock, flags)) {
> >                 if (unlikely(alloc_flags & ALLOC_TRYLOCK))
> >                         return NULL;
> >                 spin_lock_irqsave(&zone->lock, flags);
> >         }
> >
> > we've seen bad cache behavior for this kind of pattern in other
> > situations: if the "try" fails, the subsequent "do the lock for real"
> > case now does the wrong thing, in that it will immediately try again
> > even if it's almost certainly just going to fail - causing extra write
> > cache accesses.
> >
> > So typically, in places that can see contention, it's better to either do
> >
> >  (a) trylock followed by a slowpath that takes the fact that it was
> > locked into account and does a read-only loop until it sees otherwise
> >
> >      This is, for example, what the mutex code does with that
> > __mutex_trylock() -> mutex_optimistic_spin() pattern, but our
> > spinlocks end up doing similar things (ie "trylock" followed by
> > "release irq and do the 'relax loop' thing).
> 
> Right,
> __mutex_trylock(lock) -> mutex_optimistic_spin() pattern is
> equivalent to 'pending' bit spinning in qspinlock.
> 
> > or
> >
> >  (b) do the trylock and lock separately, ie
> >
> >         if (unlikely(alloc_flags & ALLOC_TRYLOCK)) {
> >                 if (!spin_trylock_irqsave(&zone->lock, flags))
> >                         return NULL;
> >         } else
> >                 spin_lock_irqsave(&zone->lock, flags);
> >
> > so that you don't end up doing two cache accesses for ownership that
> > can cause extra bouncing.
> 
> Ok, I will switch to above.
> 
> > I'm not sure this matters at all in the allocation path - contention
> > may simply not be enough of an issue, and the trylock is purely about
> > "unlikely NMI worries", but I do worry that you might have made the
> > normal case slower.
> 
> We actually did see zone->lock being contended in production.
> Last time the culprit was an inadequate per-cpu caching and
> these series in 6.11 fixed it:
> https://lwn.net/Articles/947900/
> I don't think we've seen it contended in the newer kernels.
>
> Johannes, pls correct me if I'm wrong.

Contention should indeed be rare in practice. This has become a very
coarse lock, with nowadays hundreds of HW threads hitting still only
one or two zones. A lot rides on the fastpath per-cpu caches, and it
becomes noticable very quickly if those are sized inappropriately.

> But to avoid being finger pointed, I'll switch to checking alloc_flags
> first. It does seem a better trade off to avoid cache bouncing because
> of 2nd cmpxchg. Though when I wrote it this way I convinced myself and
> others that it's faster to do trylock first to avoid branch misprediction.

If you haven't yet, it could be interesting to check if/where branches
are generated at all, given the proximity and the heavy inlining
between where you pass the flag and where it's tested.

Linus Torvalds March 31, 2025, 3:35 p.m. UTC | #12

On Mon, 31 Mar 2025 at 02:59, Vlastimil Babka <vbabka@suse.cz> wrote:
>
> Yes I was going to point out that e.g. "nmisafe_local_lock_irqsave()" seems
> rather misleading to me as this operation is not a nmisafe one?

Yeah, it's not a great name either, IO admit.

> The following attempt [2] meant there would be only a new local_trylock_t
> type, but the existing locking operations would remain the same, relying on
> _Generic() parts inside them.

Hmm. I actually like that approach.

That avoids having the misleading operation naming. IOW, you'd not
have a "localtry" when it's not a trylock, and you'd not have
"nmisafe" when it's not an operation that is actually nmi-safe.

The downside of _Generic() is that it's a bit subtle and can hide the
actual operation, but I think that in this situation that's the whole
point.

So yes, I'd vote for the "let's just introduce the new type that has
the required 'acquired' field, and then use a _Generic() model to
automatically pick the right op".

                Linus

Alexei Starovoitov April 1, 2025, 12:57 a.m. UTC | #13

On Mon, Mar 31, 2025 at 8:35 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, 31 Mar 2025 at 02:59, Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > Yes I was going to point out that e.g. "nmisafe_local_lock_irqsave()" seems
> > rather misleading to me as this operation is not a nmisafe one?
>
> Yeah, it's not a great name either, IO admit.
>
> > The following attempt [2] meant there would be only a new local_trylock_t
> > type, but the existing locking operations would remain the same, relying on
> > _Generic() parts inside them.
>
> Hmm. I actually like that approach.
>
> That avoids having the misleading operation naming. IOW, you'd not
> have a "localtry" when it's not a trylock, and you'd not have
> "nmisafe" when it's not an operation that is actually nmi-safe.
>
> The downside of _Generic() is that it's a bit subtle and can hide the
> actual operation, but I think that in this situation that's the whole
> point.
>
> So yes, I'd vote for the "let's just introduce the new type that has
> the required 'acquired' field, and then use a _Generic() model to
> automatically pick the right op".

Here is the patch that goes back to that approach:

https://lore.kernel.org/bpf/20250401005134.14433-1-alexei.starovoitov@gmail.com/

btw the compiler error when local_lock_t (instead of
local_trylock_t) is passed into local_trylock_irqsave()
is imo quite readable:

../mm/memcontrol.c: In function ‘consume_stock’:
../include/linux/local_lock_internal.h:136:20: error: assignment to
‘local_trylock_t *’ from incompatible pointer type ‘local_lock_t *’
[-Werror=incompatible-pointer-types]
  136 |                 tl = this_cpu_ptr(lock);                        \
      |                    ^
../include/linux/local_lock.h:76:9: note: in expansion of macro
‘__local_trylock_irqsave’
   76 |         __local_trylock_irqsave(lock, flags)
      |         ^~~~~~~~~~~~~~~~~~~~~~~
../mm/memcontrol.c:1790:19: note: in expansion of macro ‘local_trylock_irqsave’
 1790 |         else if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags))
      |                   ^~~~~~~~~~~~~~~~~~~~~

[GIT,PULL] Introduce try_alloc_pages for 6.15

Pull-request

Checks

Message

Comments