diff mbox series

[RFC-PATCH,1/2] mm: Add __GFP_NO_LOCKS flag

Message ID	20200809204354.20137-2-urezki@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=So5R=BT=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org DA54E20709 From: "Uladzislau Rezki (Sony)" <urezki@gmail.com> To: LKML <linux-kernel@vger.kernel.org>, RCU <rcu@vger.kernel.org>, linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>, Vlastimil Babka <vbabka@suse.cz>, "Paul E . McKenney" <paulmck@kernel.org>, Matthew Wilcox <willy@infradead.org> Cc: "Theodore Y . Ts'o" <tytso@mit.edu>, Joel Fernandes <joel@joelfernandes.org>, Sebastian Andrzej Siewior <bigeasy@linutronix.de>, Uladzislau Rezki <urezki@gmail.com>, Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Subject: [RFC-PATCH 1/2] mm: Add __GFP_NO_LOCKS flag Date: Sun, 9 Aug 2020 22:43:53 +0200 Message-Id: <20200809204354.20137-2-urezki@gmail.com> In-Reply-To: <20200809204354.20137-1-urezki@gmail.com> References: <20200809204354.20137-1-urezki@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	__GFP_NO_LOCKS \| expand [RFC-PATCH,0/2] __GFP_NO_LOCKS [RFC-PATCH,1/2] mm: Add __GFP_NO_LOCKS flag [2/2] rcu/tree: use __GFP_NO_LOCKS flag

Commit Message

Uladzislau Rezki Aug. 9, 2020, 8:43 p.m. UTC

Some background and kfree_rcu()
===============================
The pointers to be freed are stored in the per-cpu array to improve
performance, to enable an easier-to-use API, to accommodate vmalloc
memmory and to support a single argument of the kfree_rcu() when only
a pointer is passed. More details are below.

In order to maintain such per-CPU arrays there is a need in dynamic
allocation when a current array is fully populated and a new block is
required. See below the example:

 0 1 2 3      0 1 2 3
|p|p|p|p| -> |p|p|p|p| -> NULL

there are two pointer-blocks, each one can store 4 addresses
which will be freed after a grace period is passed. In reality
we store PAGE_SIZE / sizeof(void *). So to maintain such blocks
a single page is obtain via the page allocator:

    bnode = (struct kvfree_rcu_bulk_data *)
        __get_free_page(GFP_NOWAIT | __GFP_NOWARN);

after that it is attached to the "head" and its "next" pointer is
set to previous "head", so the list of blocks can be maintained and
grow dynamically until it gets drained by the reclaiming thread.

Please note. There is always a fallback if an allocation fails. In the
single argument, this is a call to synchronize_rcu() and for the two
arguments case this is to use rcu_head structure embedded in the object
being free, and then paying cache-miss penalty, also invoke the kfree()
per object instead of kfree_bulk() for groups of objects.

Why we maintain arrays/blocks instead of linking objects by the regular
"struct rcu_head" technique. See below a few but main reasons:

a) A memory can be reclaimed by invoking of the kfree_bulk()
   interface that requires passing an array and number of
   entries in it. That reduces the per-object overhead caused
   by calling kfree() per-object. This reduces the reclamation
   time.

b) Improves locality and reduces the number of cache-misses, due to
   "pointer chasing" between objects, which can be far spread between
   each other.

c) Support a "single argument" in the kvfree_rcu()
   void *ptr = kvmalloc(some_bytes, GFP_KERNEL);
   if (ptr)
        kvfree_rcu(ptr);

   We need it when an "rcu_head" is not embed into a stucture but an
   object must be freed after a grace period. Therefore for the single
   argument, such objects cannot be queued on a linked list.

   So nowadays, since we do not have a single argument but we see the
   demand in it, to workaround it people just do a simple not efficient
   sequence:
   <snip>
       synchronize_rcu(); /* Can be long and blocks a current context */
       kfree(p);
   <snip>

   More details is here: https://lkml.org/lkml/2020/4/28/1626

d) To distinguish vmalloc pointers between SLAB ones. It becomes possible
   to invoke the right freeing API for the right kind of pointer, kfree_bulk()
   or TBD: vmalloc_bulk().

Also, please have a look here: https://lkml.org/lkml/2020/7/30/1166

Limitations and concerns (Main part)
====================================
The current memmory-allocation interface presents to following
difficulties that this patch is designed to overcome:

a) If built with CONFIG_PROVE_RAW_LOCK_NESTING, the lockdep will
   complain about violation("BUG: Invalid wait context") of the
   nesting rules. It does the raw_spinlock vs. spinlock nesting
   checks, i.e. it is not legal to acquire a spinlock_t while
   holding a raw_spinlock_t.

   Internally the kfree_rcu() uses raw_spinlock_t(in rcu-dev branch)
   whereas the "page allocator" internally deals with spinlock_t to
   access to its zones. The code also can be broken from higher level
   of view:
   <snip>
       raw_spin_lock(&some_lock);
       kfree_rcu(some_pointer, some_field_offset);
   <snip>

b) If built with CONFIG_PREEMPT_RT. Please note, in that case spinlock_t
   is converted into sleepable variant. Invoking the page allocator from
   atomic contexts leads to "BUG: scheduling while atomic".

Proposal
========
1) Make GFP_* that ensures that the allocator returns NULL rather
than acquire its own spinlock_t. Having such flag will address a and b
limitations described above. It will also make the kfree_rcu() code
common for RT and regular kernel, more clean, less handling corner
cases and reduce the code size.

Description:
The page allocator has two phases, fast path and slow one. We are interested
in fast path and order-0 allocations. In its turn it is divided also into two
phases: lock-less and not:

a) As a first step the page allocator tries to obtain a page from the
   per-cpu-list, so each CPU has its own one. That is why this step is
   lock-less and fast. Basically it disables irqs on current CPU in order
   to access to per-cpu data and remove a first element from the pcp-list.
   An element/page is returned to an user.

b) If there is no any available page in per-cpu-list, the second step is
   involved. It removes a specified number of elements from the buddy allocator
   transferring them to the "supplied-list/per-cpu-list" described in [1].

A number of pre-fetched elements can be controlled via sysfs attribute.
Please see the /proc/sys/vm/percpu_pagelist_fraction. This step is not
lock-less. It uses spinlock_t for accessing to the buddy zone. This
step is fully covered by the rmqueue_bulk() function.

Summarizing. The __GFP_NO_LOCKS covers only [1] and can not do step [2],
due to the fact that [2] acquires spinlock_t. It implies that it is super
fast, but a higher rate of fails is also expected. Having such flag will
address (a) and (b) limitations described above.

Usage: __get_free_page(__GFP_NO_LOCKS);

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 include/linux/gfp.h            | 11 +++++++++--
 include/trace/events/mmflags.h |  1 +
 mm/page_alloc.c                | 31 +++++++++++++++++++++++++------
 tools/perf/builtin-kmem.c      |  1 +
 4 files changed, 36 insertions(+), 8 deletions(-)

Comments

Michal Hocko Aug. 10, 2020, 12:31 p.m. UTC | #1

On Sun 09-08-20 22:43:53, Uladzislau Rezki (Sony) wrote:
[...]
> Limitations and concerns (Main part)
> ====================================
> The current memmory-allocation interface presents to following
> difficulties that this patch is designed to overcome:
> 
> a) If built with CONFIG_PROVE_RAW_LOCK_NESTING, the lockdep will
>    complain about violation("BUG: Invalid wait context") of the
>    nesting rules. It does the raw_spinlock vs. spinlock nesting
>    checks, i.e. it is not legal to acquire a spinlock_t while
>    holding a raw_spinlock_t.
> 
>    Internally the kfree_rcu() uses raw_spinlock_t(in rcu-dev branch)
>    whereas the "page allocator" internally deals with spinlock_t to
>    access to its zones. The code also can be broken from higher level
>    of view:
>    <snip>
>        raw_spin_lock(&some_lock);
>        kfree_rcu(some_pointer, some_field_offset);
>    <snip>

Is there any fundamental problem to make zone raw_spin_lock?

> b) If built with CONFIG_PREEMPT_RT. Please note, in that case spinlock_t
>    is converted into sleepable variant. Invoking the page allocator from
>    atomic contexts leads to "BUG: scheduling while atomic".

[...]

> Proposal
> ========
> 1) Make GFP_* that ensures that the allocator returns NULL rather
> than acquire its own spinlock_t. Having such flag will address a and b
> limitations described above. It will also make the kfree_rcu() code
> common for RT and regular kernel, more clean, less handling corner
> cases and reduce the code size.

I do not think this is a good idea. Single purpose gfp flags that tend
to heavily depend on the current implementation of the page allocator
have turned out to be problematic. Users used to misunderstand their
meaning resulting in a lot of abuse which was not trivial to remove.
This flag seem to fall into exactly this sort of category. If there is a
problem in nesting then that should be addressed rather than a new flag
exported IMHO. If that is absolutely not possible for some reason then
we can try to figure out what to do but that really need a very strong
justification.

Uladzislau Rezki Aug. 10, 2020, 4:07 p.m. UTC | #2

> On Sun 09-08-20 22:43:53, Uladzislau Rezki (Sony) wrote:
> [...]
> > Limitations and concerns (Main part)
> > ====================================
> > The current memmory-allocation interface presents to following
> > difficulties that this patch is designed to overcome:
> > 
> > a) If built with CONFIG_PROVE_RAW_LOCK_NESTING, the lockdep will
> >    complain about violation("BUG: Invalid wait context") of the
> >    nesting rules. It does the raw_spinlock vs. spinlock nesting
> >    checks, i.e. it is not legal to acquire a spinlock_t while
> >    holding a raw_spinlock_t.
> > 
> >    Internally the kfree_rcu() uses raw_spinlock_t(in rcu-dev branch)
> >    whereas the "page allocator" internally deals with spinlock_t to
> >    access to its zones. The code also can be broken from higher level
> >    of view:
> >    <snip>
> >        raw_spin_lock(&some_lock);
> >        kfree_rcu(some_pointer, some_field_offset);
> >    <snip>
> 
> Is there any fundamental problem to make zone raw_spin_lock?
> 
Good point. Converting a regular spinlock to the raw_* variant can solve 
an issue and to me it seems partly reasonable. Because there are other
questions if we do it:

a) what to do with kswapd and "wake-up path" that uses sleepable lock:
    wakeup_kswapd() -> wake_up_interruptible(&pgdat->kswapd_wait).

b) How RT people reacts on it? I guess they will no be happy.

As i described before, calling the __get_free_page(0) with 0 as argument
will solve the (a). How correctly is it? From my point of view the logic
that bypass the wakeup path should be explicitly defined.

Or we can enter the allocator with (__GFP_HIGH|__GFP_ATOMIC) that bypass
the __GFP_KSWAPD_RECLAIM as well.

Any thoughts here? Please comment.

Having proposed flag will not heart RT latency and solve all concerns.

> > b) If built with CONFIG_PREEMPT_RT. Please note, in that case spinlock_t
> >    is converted into sleepable variant. Invoking the page allocator from
> >    atomic contexts leads to "BUG: scheduling while atomic".
> 
> [...]
> 
> > Proposal
> > ========
> > 1) Make GFP_* that ensures that the allocator returns NULL rather
> > than acquire its own spinlock_t. Having such flag will address a and b
> > limitations described above. It will also make the kfree_rcu() code
> > common for RT and regular kernel, more clean, less handling corner
> > cases and reduce the code size.
> 
> I do not think this is a good idea. Single purpose gfp flags that tend
> to heavily depend on the current implementation of the page allocator
> have turned out to be problematic. Users used to misunderstand their
> meaning resulting in a lot of abuse which was not trivial to remove.
> This flag seem to fall into exactly this sort of category. If there is a
> problem in nesting then that should be addressed rather than a new flag
> exported IMHO. If that is absolutely not possible for some reason then
> we can try to figure out what to do but that really need a very strong
> justification.
> 
The problem that i see is we can not use the page allocator from atomic
contexts, what is our case:

<snip>
    local_irq_save(flags) or preempt_disable() or raw_spinlock();
    __get_free_page(GFP_ATOMIC);
<snip>

So if we can convert the page allocator to raw_* lock it will be appreciated,
at least from our side, IMHO, not from RT one. But as i stated above we need
to sort raised questions out if converting is done.

What is your view?

Thank you for your help and feedback!

--
Vlad Rezki

Michal Hocko Aug. 10, 2020, 7:25 p.m. UTC | #3

On Mon 10-08-20 18:07:39, Uladzislau Rezki wrote:
> > On Sun 09-08-20 22:43:53, Uladzislau Rezki (Sony) wrote:
> > [...]
> > > Limitations and concerns (Main part)
> > > ====================================
> > > The current memmory-allocation interface presents to following
> > > difficulties that this patch is designed to overcome:
> > > 
> > > a) If built with CONFIG_PROVE_RAW_LOCK_NESTING, the lockdep will
> > >    complain about violation("BUG: Invalid wait context") of the
> > >    nesting rules. It does the raw_spinlock vs. spinlock nesting
> > >    checks, i.e. it is not legal to acquire a spinlock_t while
> > >    holding a raw_spinlock_t.
> > > 
> > >    Internally the kfree_rcu() uses raw_spinlock_t(in rcu-dev branch)
> > >    whereas the "page allocator" internally deals with spinlock_t to
> > >    access to its zones. The code also can be broken from higher level
> > >    of view:
> > >    <snip>
> > >        raw_spin_lock(&some_lock);
> > >        kfree_rcu(some_pointer, some_field_offset);
> > >    <snip>
> > 
> > Is there any fundamental problem to make zone raw_spin_lock?
> > 
> Good point. Converting a regular spinlock to the raw_* variant can solve 
> an issue and to me it seems partly reasonable. Because there are other
> questions if we do it:
> 
> a) what to do with kswapd and "wake-up path" that uses sleepable lock:
>     wakeup_kswapd() -> wake_up_interruptible(&pgdat->kswapd_wait).

If there is no RT friendly variant for waking up process from the atomic
context then we might need to special case this for the RT tree.

> b) How RT people reacts on it? I guess they will no be happy.

zone->lock should be held for a very limited amount of time.

> As i described before, calling the __get_free_page(0) with 0 as argument
> will solve the (a). How correctly is it? From my point of view the logic
> that bypass the wakeup path should be explicitly defined.

gfp_mask == 0 is GFP_NOWAIT (aka an atomic allocation request) which
doesn't wake up kswapd. So if the wakeup is a problem then this would be
a way to go.

> Or we can enter the allocator with (__GFP_HIGH|__GFP_ATOMIC) that bypass
> the __GFP_KSWAPD_RECLAIM as well.

This would be an alternative which consumes memory reserves. Is this
really needed for the particular case?

> 
> Any thoughts here? Please comment.
> 
> Having proposed flag will not heart RT latency and solve all concerns.
> 
> > > b) If built with CONFIG_PREEMPT_RT. Please note, in that case spinlock_t
> > >    is converted into sleepable variant. Invoking the page allocator from
> > >    atomic contexts leads to "BUG: scheduling while atomic".
> > 
> > [...]
> > 
> > > Proposal
> > > ========
> > > 1) Make GFP_* that ensures that the allocator returns NULL rather
> > > than acquire its own spinlock_t. Having such flag will address a and b
> > > limitations described above. It will also make the kfree_rcu() code
> > > common for RT and regular kernel, more clean, less handling corner
> > > cases and reduce the code size.
> > 
> > I do not think this is a good idea. Single purpose gfp flags that tend
> > to heavily depend on the current implementation of the page allocator
> > have turned out to be problematic. Users used to misunderstand their
> > meaning resulting in a lot of abuse which was not trivial to remove.
> > This flag seem to fall into exactly this sort of category. If there is a
> > problem in nesting then that should be addressed rather than a new flag
> > exported IMHO. If that is absolutely not possible for some reason then
> > we can try to figure out what to do but that really need a very strong
> > justification.
> > 
> The problem that i see is we can not use the page allocator from atomic
> contexts, what is our case:
> 
> <snip>
>     local_irq_save(flags) or preempt_disable() or raw_spinlock();
>     __get_free_page(GFP_ATOMIC);
> <snip>
> 
> So if we can convert the page allocator to raw_* lock it will be appreciated,
> at least from our side, IMHO, not from RT one. But as i stated above we need
> to sort raised questions out if converting is done.
> 
> What is your view?

To me it would make more sense to support atomic allocations also for
the RT tree. Having both GFP_NOWAIT and GFP_ATOMIC which do not really
work for atomic context in RT sounds subtle and wrong.

Michal Hocko Aug. 11, 2020, 8:19 a.m. UTC | #4

On Mon 10-08-20 21:25:26, Michal Hocko wrote:
> On Mon 10-08-20 18:07:39, Uladzislau Rezki wrote:
[...]
> > The problem that i see is we can not use the page allocator from atomic
> > contexts, what is our case:
> > 
> > <snip>
> >     local_irq_save(flags) or preempt_disable() or raw_spinlock();
> >     __get_free_page(GFP_ATOMIC);
> > <snip>
> > 
> > So if we can convert the page allocator to raw_* lock it will be appreciated,
> > at least from our side, IMHO, not from RT one. But as i stated above we need
> > to sort raised questions out if converting is done.
> > 
> > What is your view?
> 
> To me it would make more sense to support atomic allocations also for
> the RT tree. Having both GFP_NOWAIT and GFP_ATOMIC which do not really
> work for atomic context in RT sounds subtle and wrong.

I was thinking about this some more. I still think the above would be a
reasonable goal we should try to achieve. If for not other then for
future maintainability (especially after the RT patchset is merged).
I have tried to search for any known problems/attempts to make
zone->lock raw but couldn't find anything. Maybe somebody more involved
in RT world have something to say about that.

Anyway, if the zone->lock is not a good fit for raw_spin_lock then the
only way I can see forward is to detect real (RT) atomic contexts and
bail out early before taking the lock in the allocator for NOWAIT/ATOMIC
requests.

Uladzislau Rezki Aug. 11, 2020, 9:18 a.m. UTC | #5

On Mon, Aug 10, 2020 at 09:25:25PM +0200, Michal Hocko wrote:
> On Mon 10-08-20 18:07:39, Uladzislau Rezki wrote:
> > > On Sun 09-08-20 22:43:53, Uladzislau Rezki (Sony) wrote:
> > > [...]
> > > > Limitations and concerns (Main part)
> > > > ====================================
> > > > The current memmory-allocation interface presents to following
> > > > difficulties that this patch is designed to overcome:
> > > > 
> > > > a) If built with CONFIG_PROVE_RAW_LOCK_NESTING, the lockdep will
> > > >    complain about violation("BUG: Invalid wait context") of the
> > > >    nesting rules. It does the raw_spinlock vs. spinlock nesting
> > > >    checks, i.e. it is not legal to acquire a spinlock_t while
> > > >    holding a raw_spinlock_t.
> > > > 
> > > >    Internally the kfree_rcu() uses raw_spinlock_t(in rcu-dev branch)
> > > >    whereas the "page allocator" internally deals with spinlock_t to
> > > >    access to its zones. The code also can be broken from higher level
> > > >    of view:
> > > >    <snip>
> > > >        raw_spin_lock(&some_lock);
> > > >        kfree_rcu(some_pointer, some_field_offset);
> > > >    <snip>
> > > 
> > > Is there any fundamental problem to make zone raw_spin_lock?
> > > 
> > Good point. Converting a regular spinlock to the raw_* variant can solve 
> > an issue and to me it seems partly reasonable. Because there are other
> > questions if we do it:
> > 
> > a) what to do with kswapd and "wake-up path" that uses sleepable lock:
> >     wakeup_kswapd() -> wake_up_interruptible(&pgdat->kswapd_wait).
> 
> If there is no RT friendly variant for waking up process from the atomic
> context then we might need to special case this for the RT tree.
> 
I do not see it in RT kernel. The waiting primitives, see the wait.c,
use sleepable locks all over the file.

> > b) How RT people reacts on it? I guess they will no be happy.
> 
> zone->lock should be held for a very limited amount of time.
> 
> > As i described before, calling the __get_free_page(0) with 0 as argument
> > will solve the (a). How correctly is it? From my point of view the logic
> > that bypass the wakeup path should be explicitly defined.
> 
> gfp_mask == 0 is GFP_NOWAIT (aka an atomic allocation request) which
> doesn't wake up kswapd. So if the wakeup is a problem then this would be
> a way to go.
> 
What do you mean Michal? gfp_mask 0 != GFP_NOWAIT:

#define GFP_NOWAIT (__GFP_KSWAPD_RECLAIM)

it does wakeup of the kswapd. Or am i missing something? Please comment.
If we are about to avoid the kswapd, should we define something special?

#define GFP_NOWWAKE_KSWAPD 0

> > Or we can enter the allocator with (__GFP_HIGH|__GFP_ATOMIC) that bypass
> > the __GFP_KSWAPD_RECLAIM as well.
> 
> This would be an alternative which consumes memory reserves. Is this
> really needed for the particular case?
> 
No. That was just another example illustrating how to bypass the
__GFP_KSWAPD_RECLAIM.

> > 
> > Any thoughts here? Please comment.
> > 
> > Having proposed flag will not heart RT latency and solve all concerns.
> > 
> > > > b) If built with CONFIG_PREEMPT_RT. Please note, in that case spinlock_t
> > > >    is converted into sleepable variant. Invoking the page allocator from
> > > >    atomic contexts leads to "BUG: scheduling while atomic".
> > > 
> > > [...]
> > > 
> > > > Proposal
> > > > ========
> > > > 1) Make GFP_* that ensures that the allocator returns NULL rather
> > > > than acquire its own spinlock_t. Having such flag will address a and b
> > > > limitations described above. It will also make the kfree_rcu() code
> > > > common for RT and regular kernel, more clean, less handling corner
> > > > cases and reduce the code size.
> > > 
> > > I do not think this is a good idea. Single purpose gfp flags that tend
> > > to heavily depend on the current implementation of the page allocator
> > > have turned out to be problematic. Users used to misunderstand their
> > > meaning resulting in a lot of abuse which was not trivial to remove.
> > > This flag seem to fall into exactly this sort of category. If there is a
> > > problem in nesting then that should be addressed rather than a new flag
> > > exported IMHO. If that is absolutely not possible for some reason then
> > > we can try to figure out what to do but that really need a very strong
> > > justification.
> > > 
> > The problem that i see is we can not use the page allocator from atomic
> > contexts, what is our case:
> > 
> > <snip>
> >     local_irq_save(flags) or preempt_disable() or raw_spinlock();
> >     __get_free_page(GFP_ATOMIC);
> > <snip>
> > 
> > So if we can convert the page allocator to raw_* lock it will be appreciated,
> > at least from our side, IMHO, not from RT one. But as i stated above we need
> > to sort raised questions out if converting is done.
> > 
> > What is your view?
> 
> To me it would make more sense to support atomic allocations also for
> the RT tree. Having both GFP_NOWAIT and GFP_ATOMIC which do not really
> work for atomic context in RT sounds subtle and wrong.
>
Same view on it.

Thank you for your comments!

--
Vlad Rezki

Uladzislau Rezki Aug. 11, 2020, 9:37 a.m. UTC | #6

On Tue, Aug 11, 2020 at 10:19:17AM +0200, Michal Hocko wrote:
> On Mon 10-08-20 21:25:26, Michal Hocko wrote:
> > On Mon 10-08-20 18:07:39, Uladzislau Rezki wrote:
> [...]
> > > The problem that i see is we can not use the page allocator from atomic
> > > contexts, what is our case:
> > > 
> > > <snip>
> > >     local_irq_save(flags) or preempt_disable() or raw_spinlock();
> > >     __get_free_page(GFP_ATOMIC);
> > > <snip>
> > > 
> > > So if we can convert the page allocator to raw_* lock it will be appreciated,
> > > at least from our side, IMHO, not from RT one. But as i stated above we need
> > > to sort raised questions out if converting is done.
> > > 
> > > What is your view?
> > 
> > To me it would make more sense to support atomic allocations also for
> > the RT tree. Having both GFP_NOWAIT and GFP_ATOMIC which do not really
> > work for atomic context in RT sounds subtle and wrong.
> 
> I was thinking about this some more. I still think the above would be a
> reasonable goal we should try to achieve. If for not other then for
> future maintainability (especially after the RT patchset is merged).
> I have tried to search for any known problems/attempts to make
> zone->lock raw but couldn't find anything. Maybe somebody more involved
> in RT world have something to say about that.
> 
I tried yesterday to convert zone->lock. See below files i had to modify:
<snip>
        modified:   include/linux/mmzone.h
        modified:   mm/compaction.c
        modified:   mm/memory_hotplug.c
        modified:   mm/page_alloc.c
        modified:   mm/page_isolation.c
        modified:   mm/page_reporting.c
        modified:   mm/shuffle.c
        modified:   mm/vmscan.c
        modified:   mm/vmstat.c
<snip>

There is one more lock, that is zone->lru_lock one. Both zone->lock and this
one intersect between each other. If the lru_lock can be nested under zone->lock
it should be converted as well. But i need to analyze it farther. There are
two wrapper functions which are used as common interface to lock/unlock both
locks. See compact_lock_irqsave()/compact_unlock_should_abort_lru() in the 
mm/compaction.c.

Any thoughts here?

Anyway i tried to convert only zone->lock and use page allocator passing there
gfp_mask=0 as argument. So it works. CONFIG_PROVE_RAW_LOCK_NESTING does not
complain about any "bad" lock nesting.

> Anyway, if the zone->lock is not a good fit for raw_spin_lock then the
> only way I can see forward is to detect real (RT) atomic contexts and
> bail out early before taking the lock in the allocator for NOWAIT/ATOMIC
> requests.
>
For RT kernel we can detect it for sure. preemtable() works just fine there,
i.e. we can identify the context we are currently in.

Thanks!

--
Vlad Rezki

Uladzislau Rezki Aug. 11, 2020, 9:42 a.m. UTC | #7

On Tue, Aug 11, 2020 at 11:37:13AM +0200, Uladzislau Rezki wrote:
> On Tue, Aug 11, 2020 at 10:19:17AM +0200, Michal Hocko wrote:
> > On Mon 10-08-20 21:25:26, Michal Hocko wrote:
> > > On Mon 10-08-20 18:07:39, Uladzislau Rezki wrote:
> > [...]
> > > > The problem that i see is we can not use the page allocator from atomic
> > > > contexts, what is our case:
> > > > 
> > > > <snip>
> > > >     local_irq_save(flags) or preempt_disable() or raw_spinlock();
> > > >     __get_free_page(GFP_ATOMIC);
> > > > <snip>
> > > > 
> > > > So if we can convert the page allocator to raw_* lock it will be appreciated,
> > > > at least from our side, IMHO, not from RT one. But as i stated above we need
> > > > to sort raised questions out if converting is done.
> > > > 
> > > > What is your view?
> > > 
> > > To me it would make more sense to support atomic allocations also for
> > > the RT tree. Having both GFP_NOWAIT and GFP_ATOMIC which do not really
> > > work for atomic context in RT sounds subtle and wrong.
> > 
> > I was thinking about this some more. I still think the above would be a
> > reasonable goal we should try to achieve. If for not other then for
> > future maintainability (especially after the RT patchset is merged).
> > I have tried to search for any known problems/attempts to make
> > zone->lock raw but couldn't find anything. Maybe somebody more involved
> > in RT world have something to say about that.
> > 
> I tried yesterday to convert zone->lock. See below files i had to modify:
> <snip>
>         modified:   include/linux/mmzone.h
>         modified:   mm/compaction.c
>         modified:   mm/memory_hotplug.c
>         modified:   mm/page_alloc.c
>         modified:   mm/page_isolation.c
>         modified:   mm/page_reporting.c
>         modified:   mm/shuffle.c
>         modified:   mm/vmscan.c
>         modified:   mm/vmstat.c
> <snip>
> 
> There is one more lock, that is zone->lru_lock one. Both zone->lock and this
> one intersect between each other. If the lru_lock can be nested under zone->lock
> it should be converted as well. But i need to analyze it farther. There are
> two wrapper functions which are used as common interface to lock/unlock both
> locks. See compact_lock_irqsave()/compact_unlock_should_abort_lru() in the 
> mm/compaction.c.
> 
> Any thoughts here?
> 
> Anyway i tried to convert only zone->lock and use page allocator passing there
> gfp_mask=0 as argument. So it works. CONFIG_PROVE_RAW_LOCK_NESTING does not
> complain about any "bad" lock nesting.
> 
> > Anyway, if the zone->lock is not a good fit for raw_spin_lock then the
> > only way I can see forward is to detect real (RT) atomic contexts and
> > bail out early before taking the lock in the allocator for NOWAIT/ATOMIC
> > requests.
> >
This is similar what i have done with mm: Add __GFP_NO_LOCKS flag. I just
did it for order-0 pages(other paths are impossible) and made it common for
any kernel.

Because when you say "bail out early" i suspect that we would like to check
the per-cpu-list cache.

> For RT kernel we can detect it for sure. preemtable() works just fine there,
> i.e. we can identify the context we are currently in.
> 

--
Vlad Rezki

Michal Hocko Aug. 11, 2020, 10:21 a.m. UTC | #8

On Tue 11-08-20 11:18:07, Uladzislau Rezki wrote:
> On Mon, Aug 10, 2020 at 09:25:25PM +0200, Michal Hocko wrote:
> > On Mon 10-08-20 18:07:39, Uladzislau Rezki wrote:
> > > > On Sun 09-08-20 22:43:53, Uladzislau Rezki (Sony) wrote:
[...]
> > > As i described before, calling the __get_free_page(0) with 0 as argument
> > > will solve the (a). How correctly is it? From my point of view the logic
> > > that bypass the wakeup path should be explicitly defined.
> > 
> > gfp_mask == 0 is GFP_NOWAIT (aka an atomic allocation request) which
> > doesn't wake up kswapd. So if the wakeup is a problem then this would be
> > a way to go.
> > 
> What do you mean Michal? gfp_mask 0 != GFP_NOWAIT:
> 
> #define GFP_NOWAIT (__GFP_KSWAPD_RECLAIM)
> 
> it does wakeup of the kswapd. Or am i missing something? Please comment.
> If we are about to avoid the kswapd, should we define something special?
> 
> #define GFP_NOWWAKE_KSWAPD 0

Sorry, I was more cryptic than necessary. What I meant is that
GFP_NOWAIT is the basic non-sleepable allocation. It does wake up
kswapd but a lack of it can be expressed GFP_NOWAIT & ~__GFP_KSWAPD_RECLAIM
which is 0, now. The mouthfull variant is better for future
maintainability.

Michal Hocko Aug. 11, 2020, 10:26 a.m. UTC | #9

On Tue 11-08-20 11:37:13, Uladzislau Rezki wrote:
> On Tue, Aug 11, 2020 at 10:19:17AM +0200, Michal Hocko wrote:
> > On Mon 10-08-20 21:25:26, Michal Hocko wrote:
> > > On Mon 10-08-20 18:07:39, Uladzislau Rezki wrote:
> > [...]
> > > > The problem that i see is we can not use the page allocator from atomic
> > > > contexts, what is our case:
> > > > 
> > > > <snip>
> > > >     local_irq_save(flags) or preempt_disable() or raw_spinlock();
> > > >     __get_free_page(GFP_ATOMIC);
> > > > <snip>
> > > > 
> > > > So if we can convert the page allocator to raw_* lock it will be appreciated,
> > > > at least from our side, IMHO, not from RT one. But as i stated above we need
> > > > to sort raised questions out if converting is done.
> > > > 
> > > > What is your view?
> > > 
> > > To me it would make more sense to support atomic allocations also for
> > > the RT tree. Having both GFP_NOWAIT and GFP_ATOMIC which do not really
> > > work for atomic context in RT sounds subtle and wrong.
> > 
> > I was thinking about this some more. I still think the above would be a
> > reasonable goal we should try to achieve. If for not other then for
> > future maintainability (especially after the RT patchset is merged).
> > I have tried to search for any known problems/attempts to make
> > zone->lock raw but couldn't find anything. Maybe somebody more involved
> > in RT world have something to say about that.
> > 
> I tried yesterday to convert zone->lock. See below files i had to modify:
> <snip>
>         modified:   include/linux/mmzone.h
>         modified:   mm/compaction.c
>         modified:   mm/memory_hotplug.c
>         modified:   mm/page_alloc.c
>         modified:   mm/page_isolation.c
>         modified:   mm/page_reporting.c
>         modified:   mm/shuffle.c
>         modified:   mm/vmscan.c
>         modified:   mm/vmstat.c
> <snip>
> 
> There is one more lock, that is zone->lru_lock one. Both zone->lock and this
> one intersect between each other. If the lru_lock can be nested under zone->lock
> it should be converted as well. But i need to analyze it farther. There are
> two wrapper functions which are used as common interface to lock/unlock both
> locks. See compact_lock_irqsave()/compact_unlock_should_abort_lru() in the 
> mm/compaction.c.
> 
> Any thoughts here?

I am not an expert on compaction. Vlastimil would know better. My
thinking was that zone->lock is a tail lock but compaction/page
isolation might be doing something I am not aware of right now.

> Anyway i tried to convert only zone->lock and use page allocator passing there
> gfp_mask=0 as argument. So it works. CONFIG_PROVE_RAW_LOCK_NESTING does not
> complain about any "bad" lock nesting.
> 
> > Anyway, if the zone->lock is not a good fit for raw_spin_lock then the
> > only way I can see forward is to detect real (RT) atomic contexts and
> > bail out early before taking the lock in the allocator for NOWAIT/ATOMIC
> > requests.
> >
> For RT kernel we can detect it for sure. preemtable() works just fine there,
> i.e. we can identify the context we are currently in.

In previous email I didn't mention why I prefer full NOWAIT semantic
over rt specific bailouts. There are users making NOWAIT allocation
attempts as an opportunistic allocation request which is OK to fail
as they have a fallback to go through. This would imply they would
prefer to know this ASAP rather then get blocked and sleep. A lack of
reports for PREEMPT_RT would suggest that nobody has noticed as this
though.

Michal Hocko Aug. 11, 2020, 10:28 a.m. UTC | #10

On Tue 11-08-20 11:42:51, Uladzislau Rezki wrote:
> On Tue, Aug 11, 2020 at 11:37:13AM +0200, Uladzislau Rezki wrote:
> > On Tue, Aug 11, 2020 at 10:19:17AM +0200, Michal Hocko wrote:
[...]
> > > Anyway, if the zone->lock is not a good fit for raw_spin_lock then the
> > > only way I can see forward is to detect real (RT) atomic contexts and
> > > bail out early before taking the lock in the allocator for NOWAIT/ATOMIC
> > > requests.
> > >
> This is similar what i have done with mm: Add __GFP_NO_LOCKS flag. I just
> did it for order-0 pages(other paths are impossible) and made it common for
> any kernel.
> 
> Because when you say "bail out early" i suspect that we would like to check
> the per-cpu-list cache.

Bail out early means to do as much as possible until a raw non-compliant
lock has to be taken.

Uladzislau Rezki Aug. 11, 2020, 10:45 a.m. UTC | #11

On Tue, Aug 11, 2020 at 12:28:18PM +0200, Michal Hocko wrote:
> On Tue 11-08-20 11:42:51, Uladzislau Rezki wrote:
> > On Tue, Aug 11, 2020 at 11:37:13AM +0200, Uladzislau Rezki wrote:
> > > On Tue, Aug 11, 2020 at 10:19:17AM +0200, Michal Hocko wrote:
> [...]
> > > > Anyway, if the zone->lock is not a good fit for raw_spin_lock then the
> > > > only way I can see forward is to detect real (RT) atomic contexts and
> > > > bail out early before taking the lock in the allocator for NOWAIT/ATOMIC
> > > > requests.
> > > >
> > This is similar what i have done with mm: Add __GFP_NO_LOCKS flag. I just
> > did it for order-0 pages(other paths are impossible) and made it common for
> > any kernel.
> > 
> > Because when you say "bail out early" i suspect that we would like to check
> > the per-cpu-list cache.
> 
> Bail out early means to do as much as possible until a raw non-compliant
> lock has to be taken.
> 

<snip>
struct page *rmqueue(struct zone *preferred_zone,
   struct zone *zone, unsigned int order,
   gfp_t gfp_flags, unsigned int alloc_flags,
   int migratetype)
{
 unsigned long flags;
 struct page *page;

 if (likely(order == 0)) {
  page = rmqueue_pcplist(preferred_zone, zone, gfp_flags,
     migratetype, alloc_flags);
  goto out;
 }

 /*
  * We most definitely don't want callers attempting to
  * allocate greater than order-1 page units with __GFP_NOFAIL.
  */
 WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
 spin_lock_irqsave(&zone->lock, flags);
<snip>

only order-0 allocations can be checked if CPUs pcp-list-cache has something.
I mean without taking any locks, i.e. it is lockless. "Pre-fetching" is not
possible since it takes zone->lock in order to do transfer pages from the buddy
to the per-cpu-lists. It is done in the rmqueue_bulk() function.

--
Vlad Rezki

Uladzislau Rezki Aug. 11, 2020, 11:10 a.m. UTC | #12

On Tue, Aug 11, 2020 at 12:21:24PM +0200, Michal Hocko wrote:
> On Tue 11-08-20 11:18:07, Uladzislau Rezki wrote:
> > On Mon, Aug 10, 2020 at 09:25:25PM +0200, Michal Hocko wrote:
> > > On Mon 10-08-20 18:07:39, Uladzislau Rezki wrote:
> > > > > On Sun 09-08-20 22:43:53, Uladzislau Rezki (Sony) wrote:
> [...]
> > > > As i described before, calling the __get_free_page(0) with 0 as argument
> > > > will solve the (a). How correctly is it? From my point of view the logic
> > > > that bypass the wakeup path should be explicitly defined.
> > > 
> > > gfp_mask == 0 is GFP_NOWAIT (aka an atomic allocation request) which
> > > doesn't wake up kswapd. So if the wakeup is a problem then this would be
> > > a way to go.
> > > 
> > What do you mean Michal? gfp_mask 0 != GFP_NOWAIT:
> > 
> > #define GFP_NOWAIT (__GFP_KSWAPD_RECLAIM)
> > 
> > it does wakeup of the kswapd. Or am i missing something? Please comment.
> > If we are about to avoid the kswapd, should we define something special?
> > 
> > #define GFP_NOWWAKE_KSWAPD 0
> 
> Sorry, I was more cryptic than necessary. What I meant is that
> GFP_NOWAIT is the basic non-sleepable allocation. It does wake up
> kswapd but a lack of it can be expressed GFP_NOWAIT & ~__GFP_KSWAPD_RECLAIM
> which is 0, now. The mouthfull variant is better for future
> maintainability.
>
OK. I got it anyway. Just decided to clarify.

--
Vlad Rezki

Uladzislau Rezki Aug. 11, 2020, 11:33 a.m. UTC | #13

On Tue, Aug 11, 2020 at 12:26:49PM +0200, Michal Hocko wrote:
> On Tue 11-08-20 11:37:13, Uladzislau Rezki wrote:
> > On Tue, Aug 11, 2020 at 10:19:17AM +0200, Michal Hocko wrote:
> > > On Mon 10-08-20 21:25:26, Michal Hocko wrote:
> > > > On Mon 10-08-20 18:07:39, Uladzislau Rezki wrote:
> > > [...]
> > > > > The problem that i see is we can not use the page allocator from atomic
> > > > > contexts, what is our case:
> > > > > 
> > > > > <snip>
> > > > >     local_irq_save(flags) or preempt_disable() or raw_spinlock();
> > > > >     __get_free_page(GFP_ATOMIC);
> > > > > <snip>
> > > > > 
> > > > > So if we can convert the page allocator to raw_* lock it will be appreciated,
> > > > > at least from our side, IMHO, not from RT one. But as i stated above we need
> > > > > to sort raised questions out if converting is done.
> > > > > 
> > > > > What is your view?
> > > > 
> > > > To me it would make more sense to support atomic allocations also for
> > > > the RT tree. Having both GFP_NOWAIT and GFP_ATOMIC which do not really
> > > > work for atomic context in RT sounds subtle and wrong.
> > > 
> > > I was thinking about this some more. I still think the above would be a
> > > reasonable goal we should try to achieve. If for not other then for
> > > future maintainability (especially after the RT patchset is merged).
> > > I have tried to search for any known problems/attempts to make
> > > zone->lock raw but couldn't find anything. Maybe somebody more involved
> > > in RT world have something to say about that.
> > > 
> > I tried yesterday to convert zone->lock. See below files i had to modify:
> > <snip>
> >         modified:   include/linux/mmzone.h
> >         modified:   mm/compaction.c
> >         modified:   mm/memory_hotplug.c
> >         modified:   mm/page_alloc.c
> >         modified:   mm/page_isolation.c
> >         modified:   mm/page_reporting.c
> >         modified:   mm/shuffle.c
> >         modified:   mm/vmscan.c
> >         modified:   mm/vmstat.c
> > <snip>
> > 
> > There is one more lock, that is zone->lru_lock one. Both zone->lock and this
> > one intersect between each other. If the lru_lock can be nested under zone->lock
> > it should be converted as well. But i need to analyze it farther. There are
> > two wrapper functions which are used as common interface to lock/unlock both
> > locks. See compact_lock_irqsave()/compact_unlock_should_abort_lru() in the 
> > mm/compaction.c.
> > 
> > Any thoughts here?
> 
> I am not an expert on compaction. Vlastimil would know better. My
> thinking was that zone->lock is a tail lock but compaction/page
> isolation might be doing something I am not aware of right now.
> 
> > Anyway i tried to convert only zone->lock and use page allocator passing there
> > gfp_mask=0 as argument. So it works. CONFIG_PROVE_RAW_LOCK_NESTING does not
> > complain about any "bad" lock nesting.
> > 
> > > Anyway, if the zone->lock is not a good fit for raw_spin_lock then the
> > > only way I can see forward is to detect real (RT) atomic contexts and
> > > bail out early before taking the lock in the allocator for NOWAIT/ATOMIC
> > > requests.
> > >
> > For RT kernel we can detect it for sure. preemtable() works just fine there,
> > i.e. we can identify the context we are currently in.
> 
> In previous email I didn't mention why I prefer full NOWAIT semantic
> over rt specific bailouts. There are users making NOWAIT allocation
> attempts as an opportunistic allocation request which is OK to fail
> as they have a fallback to go through. This would imply they would
> prefer to know this ASAP rather then get blocked and sleep. A lack of
> reports for PREEMPT_RT would suggest that nobody has noticed as this
> though.
>
I agree here and share your view on it. To me, making *_ATOMIC *_NOWAIT
to be fully workable on both kernels sounds like correct way to go. 

Indeed, there can be no complains as of now. But later on it can be
and the question will be raised again, what to do.

--
Vlad Rezki

Thomas Gleixner Aug. 11, 2020, 2:44 p.m. UTC | #14

Michal Hocko <mhocko@suse.com> writes:
> On Mon 10-08-20 18:07:39, Uladzislau Rezki wrote:
>> > On Sun 09-08-20 22:43:53, Uladzislau Rezki (Sony) wrote:
>> > Is there any fundamental problem to make zone raw_spin_lock?
>> > 
>> Good point. Converting a regular spinlock to the raw_* variant can solve 
>> an issue and to me it seems partly reasonable. Because there are other
>> questions if we do it:
>> 
>> a) what to do with kswapd and "wake-up path" that uses sleepable lock:
>>     wakeup_kswapd() -> wake_up_interruptible(&pgdat->kswapd_wait).
>
> If there is no RT friendly variant for waking up process from the atomic
> context then we might need to special case this for the RT tree.

That's a solvable problem.

>> b) How RT people reacts on it? I guess they will no be happy.
>
> zone->lock should be held for a very limited amount of time.

Emphasis on should. free_pcppages_bulk() can hold it for quite some time
when a large amount of pages are purged. We surely would have converted
it to a raw lock long time ago otherwise.

For regular enterprise stuff a few hundred microseconds might qualify as
a limited amount of time. For advanced RT applications that's way beyond
tolerable..

>> As i described before, calling the __get_free_page(0) with 0 as argument
>> will solve the (a). How correctly is it? From my point of view the logic
>> that bypass the wakeup path should be explicitly defined.
>
> gfp_mask == 0 is GFP_NOWAIT (aka an atomic allocation request) which
> doesn't wake up kswapd. So if the wakeup is a problem then this would be
> a way to go.

The wakeup is the least of my worries.

> To me it would make more sense to support atomic allocations also for
> the RT tree. Having both GFP_NOWAIT and GFP_ATOMIC which do not really
> work for atomic context in RT sounds subtle and wrong.

Well, no. RT moves almost everything out of atomic context which means
that GFP_ATOMIC is pretty meanlingless on a RT kernel. RT sacrifies
performance for determinism. It's a known tradeoff.

Now RCU creates a new thing which enforces to make page allocation in
atomic context possible on RT. What for?

What's the actual use case in truly atomic context for this new thing on
an RT kernel?

The actual RCU code disabling interrupts is an implementation detail
which can easily be mitigated with a local lock.

Thanks,

        tglx

Thomas Gleixner Aug. 11, 2020, 3:22 p.m. UTC | #15

Thomas Gleixner <tglx@linutronix.de> writes:
> Michal Hocko <mhocko@suse.com> writes:
>> zone->lock should be held for a very limited amount of time.
>
> Emphasis on should. free_pcppages_bulk() can hold it for quite some time
> when a large amount of pages are purged. We surely would have converted
> it to a raw lock long time ago otherwise.
>
> For regular enterprise stuff a few hundred microseconds might qualify as
> a limited amount of time. For advanced RT applications that's way beyond
> tolerable..

Sebastian just tried with zone lock converted to a raw lock and maximum
latencies go up by a factor of 7 when putting a bit of stress on the
memory subsytem. Just a regular kernel compile kicks them up by a factor
of 5. Way out of tolerance.

We'll have a look whether it's solely free_pcppages_bulk() and if so we
could get away with dropping the lock in the loop.

Thanks,

        tglx

Paul E. McKenney Aug. 11, 2020, 3:33 p.m. UTC | #16

On Tue, Aug 11, 2020 at 04:44:21PM +0200, Thomas Gleixner wrote:
> Michal Hocko <mhocko@suse.com> writes:
> > On Mon 10-08-20 18:07:39, Uladzislau Rezki wrote:
> >> > On Sun 09-08-20 22:43:53, Uladzislau Rezki (Sony) wrote:
> >> > Is there any fundamental problem to make zone raw_spin_lock?
> >> > 
> >> Good point. Converting a regular spinlock to the raw_* variant can solve 
> >> an issue and to me it seems partly reasonable. Because there are other
> >> questions if we do it:
> >> 
> >> a) what to do with kswapd and "wake-up path" that uses sleepable lock:
> >>     wakeup_kswapd() -> wake_up_interruptible(&pgdat->kswapd_wait).
> >
> > If there is no RT friendly variant for waking up process from the atomic
> > context then we might need to special case this for the RT tree.
> 
> That's a solvable problem.
> 
> >> b) How RT people reacts on it? I guess they will no be happy.
> >
> > zone->lock should be held for a very limited amount of time.
> 
> Emphasis on should. free_pcppages_bulk() can hold it for quite some time
> when a large amount of pages are purged. We surely would have converted
> it to a raw lock long time ago otherwise.
> 
> For regular enterprise stuff a few hundred microseconds might qualify as
> a limited amount of time. For advanced RT applications that's way beyond
> tolerable..
> 
> >> As i described before, calling the __get_free_page(0) with 0 as argument
> >> will solve the (a). How correctly is it? From my point of view the logic
> >> that bypass the wakeup path should be explicitly defined.
> >
> > gfp_mask == 0 is GFP_NOWAIT (aka an atomic allocation request) which
> > doesn't wake up kswapd. So if the wakeup is a problem then this would be
> > a way to go.
> 
> The wakeup is the least of my worries.
> 
> > To me it would make more sense to support atomic allocations also for
> > the RT tree. Having both GFP_NOWAIT and GFP_ATOMIC which do not really
> > work for atomic context in RT sounds subtle and wrong.
> 
> Well, no. RT moves almost everything out of atomic context which means
> that GFP_ATOMIC is pretty meanlingless on a RT kernel. RT sacrifies
> performance for determinism. It's a known tradeoff.
> 
> Now RCU creates a new thing which enforces to make page allocation in
> atomic context possible on RT. What for?
> 
> What's the actual use case in truly atomic context for this new thing on
> an RT kernel?

It is not just RT kernels.  CONFIG_PROVE_RAW_LOCK_NESTING=y propagates
this constraint to all configurations, and a patch in your new favorite
subsystem really did trigger this lockdep check in a non-RT kernel.

> The actual RCU code disabling interrupts is an implementation detail
> which can easily be mitigated with a local lock.

In this case, we are in raw-spinlock context on entry to kfree_rcu().

							Thanx, Paul

Thomas Gleixner Aug. 11, 2020, 3:43 p.m. UTC | #17

"Paul E. McKenney" <paulmck@kernel.org> writes:
> On Tue, Aug 11, 2020 at 04:44:21PM +0200, Thomas Gleixner wrote:
>> Now RCU creates a new thing which enforces to make page allocation in
>> atomic context possible on RT. What for?
>> 
>> What's the actual use case in truly atomic context for this new thing on
>> an RT kernel?
>
> It is not just RT kernels.  CONFIG_PROVE_RAW_LOCK_NESTING=y propagates
> this constraint to all configurations, and a patch in your new favorite
> subsystem really did trigger this lockdep check in a non-RT kernel.
>
>> The actual RCU code disabling interrupts is an implementation detail
>> which can easily be mitigated with a local lock.
>
> In this case, we are in raw-spinlock context on entry to kfree_rcu().

Where?

Sebastian Andrzej Siewior Aug. 11, 2020, 3:56 p.m. UTC | #18

On 2020-08-11 17:43:16 [+0200], Thomas Gleixner wrote:
> Where?

See commit 8ac88f7177c75 ("rcu/tree: Keep kfree_rcu() awake during lock contention")

Sebastian

Paul E. McKenney Aug. 11, 2020, 4:02 p.m. UTC | #19

On Tue, Aug 11, 2020 at 05:43:16PM +0200, Thomas Gleixner wrote:
> "Paul E. McKenney" <paulmck@kernel.org> writes:
> > On Tue, Aug 11, 2020 at 04:44:21PM +0200, Thomas Gleixner wrote:
> >> Now RCU creates a new thing which enforces to make page allocation in
> >> atomic context possible on RT. What for?
> >> 
> >> What's the actual use case in truly atomic context for this new thing on
> >> an RT kernel?
> >
> > It is not just RT kernels.  CONFIG_PROVE_RAW_LOCK_NESTING=y propagates
> > this constraint to all configurations, and a patch in your new favorite
> > subsystem really did trigger this lockdep check in a non-RT kernel.
> >
> >> The actual RCU code disabling interrupts is an implementation detail
> >> which can easily be mitigated with a local lock.
> >
> > In this case, we are in raw-spinlock context on entry to kfree_rcu().
> 
> Where?

Some BPF code that needs to process and free a list.  As noted above,
this is a patch rather than something that is already in mainline.
Not surprising, though, given call_rcu() invocations in similar contexts.

Yes, we can perhaps rework all current and future callers to avoid
invoking both call_rcu() and kfree_rcu() from raw atomic context, but
the required change to permit this is quite a bit simpler.

							Thanx, Paul

Paul E. McKenney Aug. 11, 2020, 4:19 p.m. UTC | #20

On Tue, Aug 11, 2020 at 09:02:40AM -0700, Paul E. McKenney wrote:
> On Tue, Aug 11, 2020 at 05:43:16PM +0200, Thomas Gleixner wrote:
> > "Paul E. McKenney" <paulmck@kernel.org> writes:
> > > On Tue, Aug 11, 2020 at 04:44:21PM +0200, Thomas Gleixner wrote:
> > >> Now RCU creates a new thing which enforces to make page allocation in
> > >> atomic context possible on RT. What for?
> > >> 
> > >> What's the actual use case in truly atomic context for this new thing on
> > >> an RT kernel?
> > >
> > > It is not just RT kernels.  CONFIG_PROVE_RAW_LOCK_NESTING=y propagates
> > > this constraint to all configurations, and a patch in your new favorite
> > > subsystem really did trigger this lockdep check in a non-RT kernel.
> > >
> > >> The actual RCU code disabling interrupts is an implementation detail
> > >> which can easily be mitigated with a local lock.
> > >
> > > In this case, we are in raw-spinlock context on entry to kfree_rcu().
> > 
> > Where?
> 
> Some BPF code that needs to process and free a list.  As noted above,
> this is a patch rather than something that is already in mainline.
> Not surprising, though, given call_rcu() invocations in similar contexts.
> 
> Yes, we can perhaps rework all current and future callers to avoid
> invoking both call_rcu() and kfree_rcu() from raw atomic context, but
> the required change to permit this is quite a bit simpler.

I should hasten to add that from what I can see right now, the required
change allows telling the memory allocator bail out instead of acquiring
a non-raw spinlock.  I am absolutely not advocating converting the
allocator's spinlocks to raw spinlocks.

							Thanx, Paul

Thomas Gleixner Aug. 11, 2020, 7:39 p.m. UTC | #21

Thomas Gleixner <tglx@linutronix.de> writes:
> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> On Tue, Aug 11, 2020 at 04:44:21PM +0200, Thomas Gleixner wrote:
>>> Now RCU creates a new thing which enforces to make page allocation in
>>> atomic context possible on RT. What for?
>>> 
>>> What's the actual use case in truly atomic context for this new thing on
>>> an RT kernel?
>>
>> It is not just RT kernels.  CONFIG_PROVE_RAW_LOCK_NESTING=y propagates
>> this constraint to all configurations, and a patch in your new favorite
>> subsystem really did trigger this lockdep check in a non-RT kernel.
>>
>>> The actual RCU code disabling interrupts is an implementation detail
>>> which can easily be mitigated with a local lock.
>>
>> In this case, we are in raw-spinlock context on entry to kfree_rcu().
>
> Where?

And aside of the where, wasn't kfree_rcu() from within raw spinlock held
regions possible all the time? Either I'm missing something or you are
fundamentally changing RCU internals. kfree_rcu() saved RT in various
ways where invoking kfree() was just not an option. Confused...

Thanks,

        tglx

Paul E. McKenney Aug. 11, 2020, 9:09 p.m. UTC | #22

On Tue, Aug 11, 2020 at 09:39:10PM +0200, Thomas Gleixner wrote:
> Thomas Gleixner <tglx@linutronix.de> writes:
> > "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> On Tue, Aug 11, 2020 at 04:44:21PM +0200, Thomas Gleixner wrote:
> >>> Now RCU creates a new thing which enforces to make page allocation in
> >>> atomic context possible on RT. What for?
> >>> 
> >>> What's the actual use case in truly atomic context for this new thing on
> >>> an RT kernel?
> >>
> >> It is not just RT kernels.  CONFIG_PROVE_RAW_LOCK_NESTING=y propagates
> >> this constraint to all configurations, and a patch in your new favorite
> >> subsystem really did trigger this lockdep check in a non-RT kernel.
> >>
> >>> The actual RCU code disabling interrupts is an implementation detail
> >>> which can easily be mitigated with a local lock.
> >>
> >> In this case, we are in raw-spinlock context on entry to kfree_rcu().
> >
> > Where?
> 
> And aside of the where, wasn't kfree_rcu() from within raw spinlock held
> regions possible all the time? Either I'm missing something or you are
> fundamentally changing RCU internals. kfree_rcu() saved RT in various
> ways where invoking kfree() was just not an option. Confused...

Back in the old days (months ago!), it was possible to invoke kfree_rcu()
while holding a raw spinlock because kfree_rcu() didn't acquire any locks.
It didn't need to because it was just a wrapper around call_rcu().
And call_rcu(), along with the rest of RCU's internals, has used raw
spinlocks since near the beginnings of RT, which meant that it wasn't
a problem in the rare cases where call_rcu() needed to acquire one of
RCU's locks (for example, when call_rcu() sees that the current CPU has
accumulated more than 10,000 callbacks).

But one problem with the old kfree_rcu() approach is that the memory to
be freed is almost always cold in the cache by the time that the grace
period ends.  And this was made worse by the fact that the rcu_do_batch()
function traverses the list of objects to be freed as a linked list,
thus incurring a cache miss on each and every object.  The usual fix
for this sort of performance problem is to use arrays of pointers,
which on a 64-bit system with 64-byte cache lines reduces the number of
cache misses by a factor of eight.  In addition, decreasing the number
of post-grace-period cache misses increases the stability of RCU with
respect to callback flooding:  Because the kfree()s happen faster, it
is harder to overrun RCU with tight loops posting callbacks (as happened
some time back with certain types of ACLs).

Hence Ulad's work on kfree_rcu().  The approach is to allocate a
page-sized array to hold all the pointers, then fill in the rest of these
pointers on each subsequent kfree_rcu() call.  These arrays of pointers
also allows use of kfree_bulk() instead of kfree(), which can improve
performance yet more.  It is no big deal if kfree_rcu()'s allocation
attempts fail occasionally because it can simply fall back to the old
linked-list approach.  And given that the various lockless caches in
the memory allocator are almost never empty, in theory life is good.

But in practice, mainline now has CONFIG_PROVE_RAW_LOCK_NESTING,
and for good reason -- this Kconfig option makes it at least a
little bit harder for mainline developers to mess up RT.  But with
CONFIG_PROVE_RAW_LOCK_NESTING=y and lockdep enabled, mainline will now
sometimes complain if you invoke kfree_rcu() while holding a raw spinlock.
This happens when kfree_rcu() needs to invoke the memory allocator and
the memory allocator's caches are empty, thus resulting in the memory
allocator attempting to acquire a non-raw spinlock.

Because kfree_rcu() has a fallback available (just using the old linked
list), kfree_rcu() would work well given a way to tell the memory
allocator to return NULL instead of acquiring a non-raw spinlock.
Which is exactly what Ulad's recent patches are intended to do.

Since then, this thread has discussed various other approaches,
including using existing combinations of GFP_ flags, converting
the allocator's zone lock to a raw spinlock, and so on.

Does that help, or am I missing the point of your question?

							Thanx, Paul

Thomas Gleixner Aug. 12, 2020, 12:13 a.m. UTC | #23

"Paul E. McKenney" <paulmck@kernel.org> writes:
> Hence Ulad's work on kfree_rcu().  The approach is to allocate a
> page-sized array to hold all the pointers, then fill in the rest of these
> pointers on each subsequent kfree_rcu() call.  These arrays of pointers
> also allows use of kfree_bulk() instead of kfree(), which can improve
> performance yet more.  It is no big deal if kfree_rcu()'s allocation
> attempts fail occasionally because it can simply fall back to the old
> linked-list approach.  And given that the various lockless caches in
> the memory allocator are almost never empty, in theory life is good.

Of course, it's always the damned reality which ruins the fun.

> But in practice, mainline now has CONFIG_PROVE_RAW_LOCK_NESTING,
> and for good reason -- this Kconfig option makes it at least a
> little bit harder for mainline developers to mess up RT.  But with
> CONFIG_PROVE_RAW_LOCK_NESTING=y and lockdep enabled, mainline will now
> sometimes complain if you invoke kfree_rcu() while holding a raw spinlock.
> This happens when kfree_rcu() needs to invoke the memory allocator and
> the memory allocator's caches are empty, thus resulting in the memory
> allocator attempting to acquire a non-raw spinlock.

Right.

> Because kfree_rcu() has a fallback available (just using the old linked
> list), kfree_rcu() would work well given a way to tell the memory
> allocator to return NULL instead of acquiring a non-raw spinlock.
> Which is exactly what Ulad's recent patches are intended to do.

That much I understood, but I somehow failed to figure the why out
despite the elaborate changelog. 2 weeks of 30+C seem to have cooked my
brain :)

> Since then, this thread has discussed various other approaches,
> including using existing combinations of GFP_ flags, converting
> the allocator's zone lock to a raw spinlock, and so on.
>
> Does that help, or am I missing the point of your question?

Yes, that helps so far that I understand what the actual problem is. It
does not really help to make me more happy. :)

That said, we can support atomic allocations on RT up to the point where
zone->lock comes into play. We don't know yet exactly where the
zone->lock induced damage happens. Presumably it's inside
free_pcppages_bulk() - at least that's where I have faint bad memories
from 15+ years ago. Aside of that I seriously doubt that it can be made
work within a reasonable time frame.

But what makes me really unhappy is that my defense line against
allocations from truly atomic contexts (from RT POV) which was enforced
on RT gets a real big gap shot into it.

It becomes pretty hard to argue why atomic allocations via kmalloc() or
kmem_cache_alloc() should be treated any different. Technically they can
work similar to the page allocations up to the point where regular
spinlocks come into play or the slab cache is exhausted. Where to draw
the line?

It's also unclear for the page allocator case whether we can and should
stick a limit on the number of pages and/or the pageorder.

Myself and others spent a considerable amount of time to kill off these
kind of allocations from various interesting places including the guts
of send IPI, the affinity setting path and others where people just
slapped allocations into them because the stack checker warned or
because they happened to copy the code from some other place.

RT was pretty much a quick crap detector whenever new incarnations of
this got added and to some extent continuous education about these
issues made them less prominent over the years. Using atomic allocations
should always have a real good rationale, not only in the cases where
they collide with RT.

I can understand your rationale and what you are trying to solve. So, if
we can actually have a distinct GFP variant:

  GFP_I_ABSOLUTELY_HAVE_TO_DO_THAT_AND_I_KNOW_IT_CAN_FAIL_EARLY

which is easy to grep for then having the page allocator go down to the
point where zone lock gets involved is not the end of the world for
RT in theory - unless that damned reality tells otherwise. :)

The page allocator allocations should also have a limit on the number of
pages and eventually also page order (need to stare at the code or let
Michal educate me that the order does not matter).

To make it consistent the same GFP_ variant should allow the slab
allocator go to the point where the slab cache is exhausted.

Having a distinct and clearly defined GFP_ variant is really key to
chase down offenders and to make reviewers double check upfront why this
is absolutely required.

Thanks,

        tglx

Paul E. McKenney Aug. 12, 2020, 4:29 a.m. UTC | #24

On Wed, Aug 12, 2020 at 02:13:25AM +0200, Thomas Gleixner wrote:
> "Paul E. McKenney" <paulmck@kernel.org> writes:
> > Hence Ulad's work on kfree_rcu().  The approach is to allocate a
> > page-sized array to hold all the pointers, then fill in the rest of these
> > pointers on each subsequent kfree_rcu() call.  These arrays of pointers
> > also allows use of kfree_bulk() instead of kfree(), which can improve
> > performance yet more.  It is no big deal if kfree_rcu()'s allocation
> > attempts fail occasionally because it can simply fall back to the old
> > linked-list approach.  And given that the various lockless caches in
> > the memory allocator are almost never empty, in theory life is good.
> 
> Of course, it's always the damned reality which ruins the fun.

Classic!!!  And yes, it always is!

> > But in practice, mainline now has CONFIG_PROVE_RAW_LOCK_NESTING,
> > and for good reason -- this Kconfig option makes it at least a
> > little bit harder for mainline developers to mess up RT.  But with
> > CONFIG_PROVE_RAW_LOCK_NESTING=y and lockdep enabled, mainline will now
> > sometimes complain if you invoke kfree_rcu() while holding a raw spinlock.
> > This happens when kfree_rcu() needs to invoke the memory allocator and
> > the memory allocator's caches are empty, thus resulting in the memory
> > allocator attempting to acquire a non-raw spinlock.
> 
> Right.
> 
> > Because kfree_rcu() has a fallback available (just using the old linked
> > list), kfree_rcu() would work well given a way to tell the memory
> > allocator to return NULL instead of acquiring a non-raw spinlock.
> > Which is exactly what Ulad's recent patches are intended to do.
> 
> That much I understood, but I somehow failed to figure the why out
> despite the elaborate changelog. 2 weeks of 30+C seem to have cooked my
> brain :)

Ouch!!!  And what on earth is Germany doing being that warm???

I hate it when that happens...

> > Since then, this thread has discussed various other approaches,
> > including using existing combinations of GFP_ flags, converting
> > the allocator's zone lock to a raw spinlock, and so on.
> >
> > Does that help, or am I missing the point of your question?
> 
> Yes, that helps so far that I understand what the actual problem is. It
> does not really help to make me more happy. :)

I must confess that I was not expecting to find anything resembling
happiness anywhere down this road, whether for myself or anyone else...

> That said, we can support atomic allocations on RT up to the point where
> zone->lock comes into play. We don't know yet exactly where the
> zone->lock induced damage happens. Presumably it's inside
> free_pcppages_bulk() - at least that's where I have faint bad memories
> from 15+ years ago. Aside of that I seriously doubt that it can be made
> work within a reasonable time frame.

I was not considering any approach other than return NULL just before
the code would otherwise have acquired zone->lock.

> But what makes me really unhappy is that my defense line against
> allocations from truly atomic contexts (from RT POV) which was enforced
> on RT gets a real big gap shot into it.

Understood, and agreed:  We do need to keep the RT degradation in check.

> It becomes pretty hard to argue why atomic allocations via kmalloc() or
> kmem_cache_alloc() should be treated any different. Technically they can
> work similar to the page allocations up to the point where regular
> spinlocks come into play or the slab cache is exhausted. Where to draw
> the line?
> 
> It's also unclear for the page allocator case whether we can and should
> stick a limit on the number of pages and/or the pageorder.
> 
> Myself and others spent a considerable amount of time to kill off these
> kind of allocations from various interesting places including the guts
> of send IPI, the affinity setting path and others where people just
> slapped allocations into them because the stack checker warned or
> because they happened to copy the code from some other place.
> 
> RT was pretty much a quick crap detector whenever new incarnations of
> this got added and to some extent continuous education about these
> issues made them less prominent over the years. Using atomic allocations
> should always have a real good rationale, not only in the cases where
> they collide with RT.
> 
> I can understand your rationale and what you are trying to solve. So, if
> we can actually have a distinct GFP variant:
> 
>   GFP_I_ABSOLUTELY_HAVE_TO_DO_THAT_AND_I_KNOW_IT_CAN_FAIL_EARLY
> 
> which is easy to grep for then having the page allocator go down to the
> point where zone lock gets involved is not the end of the world for
> RT in theory - unless that damned reality tells otherwise. :)

I have no objection to an otherwise objectionable name in this particular
case.  After all, we now have 100 characters per line, right?  ;-)

> The page allocator allocations should also have a limit on the number of
> pages and eventually also page order (need to stare at the code or let
> Michal educate me that the order does not matter).

Understood.  I confess that I have but little understanding of that code.

> To make it consistent the same GFP_ variant should allow the slab
> allocator go to the point where the slab cache is exhausted.

Why not wait until someone has an extremely good reason for needing
this functionality from the slab allocators?  After all, leaving out
the slab allocators would provide a more robust defense line.  Yes,
consistent APIs are very good things as a general rule, but maybe this
situation is one of the exceptions to that rule.

> Having a distinct and clearly defined GFP_ variant is really key to
> chase down offenders and to make reviewers double check upfront why this
> is absolutely required.

Checks for that GFP_ variant could be added to automation, though reality
might eventually prove that to be a mixed blessing.

							Thanx, Paul

Thomas Gleixner Aug. 12, 2020, 8:32 a.m. UTC | #25

Paul,

"Paul E. McKenney" <paulmck@kernel.org> writes:
> On Wed, Aug 12, 2020 at 02:13:25AM +0200, Thomas Gleixner wrote:
>> That much I understood, but I somehow failed to figure the why out
>> despite the elaborate changelog. 2 weeks of 30+C seem to have cooked my
>> brain :)
>
> Ouch!!!  And what on earth is Germany doing being that warm???

The hot air exhaustion of politicians, managers and conspiracy
mythomaniacs seens to have contributed extensivly to global warming
lately.

>> But what makes me really unhappy is that my defense line against
>> allocations from truly atomic contexts (from RT POV) which was enforced
>> on RT gets a real big gap shot into it.
>
> Understood, and agreed:  We do need to keep the RT degradation in
> check.

Not only that. It's bad practice in general to do memory allocations
from such contexts if not absolutely necessary and the majority of cases
which we cleaned up over time were just from the "works for me and why
should I care and start to think" departement.

>> I can understand your rationale and what you are trying to solve. So, if
>> we can actually have a distinct GFP variant:
>> 
>>   GFP_I_ABSOLUTELY_HAVE_TO_DO_THAT_AND_I_KNOW_IT_CAN_FAIL_EARLY
>> 
>> which is easy to grep for then having the page allocator go down to the
>> point where zone lock gets involved is not the end of the world for
>> RT in theory - unless that damned reality tells otherwise. :)
>
> I have no objection to an otherwise objectionable name in this particular
> case.  After all, we now have 100 characters per line, right?  ;-)

Hehe. I can live with the proposed NO_LOCK name or anything distinct
which the mm people can agree on.

>> To make it consistent the same GFP_ variant should allow the slab
>> allocator go to the point where the slab cache is exhausted.
>
> Why not wait until someone has an extremely good reason for needing
> this functionality from the slab allocators?  After all, leaving out
> the slab allocators would provide a more robust defense line.  Yes,
> consistent APIs are very good things as a general rule, but maybe this
> situation is one of the exceptions to that rule.

Fair enough.

>> Having a distinct and clearly defined GFP_ variant is really key to
>> chase down offenders and to make reviewers double check upfront why this
>> is absolutely required.
>
> Checks for that GFP_ variant could be added to automation, though reality
> might eventually prove that to be a mixed blessing.

Did you really have to remind me and destroy my illusions before I was
able to marvel at them?

Thanks,

        tglx

Thomas Gleixner Aug. 12, 2020, 11:38 a.m. UTC | #26

Thomas Gleixner <tglx@linutronix.de> writes:
> Thomas Gleixner <tglx@linutronix.de> writes:
>> Michal Hocko <mhocko@suse.com> writes:
>>> zone->lock should be held for a very limited amount of time.
>>
>> Emphasis on should. free_pcppages_bulk() can hold it for quite some time
>> when a large amount of pages are purged. We surely would have converted
>> it to a raw lock long time ago otherwise.
>>
>> For regular enterprise stuff a few hundred microseconds might qualify as
>> a limited amount of time. For advanced RT applications that's way beyond
>> tolerable..
>
> Sebastian just tried with zone lock converted to a raw lock and maximum
> latencies go up by a factor of 7 when putting a bit of stress on the
> memory subsytem. Just a regular kernel compile kicks them up by a factor
> of 5. Way out of tolerance.
>
> We'll have a look whether it's solely free_pcppages_bulk() and if so we
> could get away with dropping the lock in the loop.

So even on !RT and just doing a kernel compile the time spent in
free_pcppages_bulk() is up to 270 usec.

It's not only the loop which processes a large pile of pages, part of it
is caused by lock contention on zone->lock. Dropping the lock after a
processing a couple of pages does not make it much better if enough CPUs
are contending on the lock.

Thanks,

        tglx

Uladzislau Rezki Aug. 12, 2020, 12:01 p.m. UTC | #27

On Wed, Aug 12, 2020 at 01:38:35PM +0200, Thomas Gleixner wrote:
> Thomas Gleixner <tglx@linutronix.de> writes:
> > Thomas Gleixner <tglx@linutronix.de> writes:
> >> Michal Hocko <mhocko@suse.com> writes:
> >>> zone->lock should be held for a very limited amount of time.
> >>
> >> Emphasis on should. free_pcppages_bulk() can hold it for quite some time
> >> when a large amount of pages are purged. We surely would have converted
> >> it to a raw lock long time ago otherwise.
> >>
> >> For regular enterprise stuff a few hundred microseconds might qualify as
> >> a limited amount of time. For advanced RT applications that's way beyond
> >> tolerable..
> >
> > Sebastian just tried with zone lock converted to a raw lock and maximum
> > latencies go up by a factor of 7 when putting a bit of stress on the
> > memory subsytem. Just a regular kernel compile kicks them up by a factor
> > of 5. Way out of tolerance.
> >
> > We'll have a look whether it's solely free_pcppages_bulk() and if so we
> > could get away with dropping the lock in the loop.
> 
> So even on !RT and just doing a kernel compile the time spent in
> free_pcppages_bulk() is up to 270 usec.
> 
I suspect if you measure the latency of the zone->lock and its contention
on any embedded device, i mean not powerful devices like PC, it could be
milliseconds. IMHO.

>
> It's not only the loop which processes a large pile of pages, part of it
> is caused by lock contention on zone->lock. Dropping the lock after a
> processing a couple of pages does not make it much better if enough CPUs
> are contending on the lock.
>
Initially i have not proposed to convert the lock, because i suspected that
from the RT point of view there could be problems. Also, like i mentioned before, 
the GFP_ATOMIC is not meaningful anymore, that is a bit out of what GFP_ATOMIC
stands for. But i see your point about "where is a stop line". 

That is why i proposed to bail out as later as possible: mm: Add __GFP_NO_LOCKS flag
From the other hand we have been discussing other options, like converting. Just
to cover as much as possible :)

Thanks Thomas for valuable comments!

--
Vlad Rezki

Paul E. McKenney Aug. 12, 2020, 1:30 p.m. UTC | #28

On Wed, Aug 12, 2020 at 10:32:50AM +0200, Thomas Gleixner wrote:
> Paul,
> 
> "Paul E. McKenney" <paulmck@kernel.org> writes:
> > On Wed, Aug 12, 2020 at 02:13:25AM +0200, Thomas Gleixner wrote:
> >> That much I understood, but I somehow failed to figure the why out
> >> despite the elaborate changelog. 2 weeks of 30+C seem to have cooked my
> >> brain :)
> >
> > Ouch!!!  And what on earth is Germany doing being that warm???
> 
> The hot air exhaustion of politicians, managers and conspiracy
> mythomaniacs seens to have contributed extensivly to global warming
> lately.

In that case, our only hope here in this geography is that we are in a
simulation, so that the hot air will cause a signed integer overflow to
negative numbers some fraction of the time.  :-(

> >> But what makes me really unhappy is that my defense line against
> >> allocations from truly atomic contexts (from RT POV) which was enforced
> >> on RT gets a real big gap shot into it.
> >
> > Understood, and agreed:  We do need to keep the RT degradation in
> > check.
> 
> Not only that. It's bad practice in general to do memory allocations
> from such contexts if not absolutely necessary and the majority of cases
> which we cleaned up over time were just from the "works for me and why
> should I care and start to think" departement.

Agreed, and I continue to see some of that myself.  :-/

> >> I can understand your rationale and what you are trying to solve. So, if
> >> we can actually have a distinct GFP variant:
> >> 
> >>   GFP_I_ABSOLUTELY_HAVE_TO_DO_THAT_AND_I_KNOW_IT_CAN_FAIL_EARLY
> >> 
> >> which is easy to grep for then having the page allocator go down to the
> >> point where zone lock gets involved is not the end of the world for
> >> RT in theory - unless that damned reality tells otherwise. :)
> >
> > I have no objection to an otherwise objectionable name in this particular
> > case.  After all, we now have 100 characters per line, right?  ;-)
> 
> Hehe. I can live with the proposed NO_LOCK name or anything distinct
> which the mm people can agree on.

Sounds good.  ;-)

> >> To make it consistent the same GFP_ variant should allow the slab
> >> allocator go to the point where the slab cache is exhausted.
> >
> > Why not wait until someone has an extremely good reason for needing
> > this functionality from the slab allocators?  After all, leaving out
> > the slab allocators would provide a more robust defense line.  Yes,
> > consistent APIs are very good things as a general rule, but maybe this
> > situation is one of the exceptions to that rule.
> 
> Fair enough.
> 
> >> Having a distinct and clearly defined GFP_ variant is really key to
> >> chase down offenders and to make reviewers double check upfront why this
> >> is absolutely required.
> >
> > Checks for that GFP_ variant could be added to automation, though reality
> > might eventually prove that to be a mixed blessing.
> 
> Did you really have to remind me and destroy my illusions before I was
> able to marvel at them?

Apologies!  I am afraid that it has become a reflex due to living in
this time and place.  My further fear is that I will have all to great
an opportunity for further reinforcing this reflex in the future.  :-/

							Thanx, Paul

Michal Hocko Aug. 13, 2020, 7:18 a.m. UTC | #29

On Wed 12-08-20 13:38:35, Thomas Gleixner wrote:
> Thomas Gleixner <tglx@linutronix.de> writes:
> > Thomas Gleixner <tglx@linutronix.de> writes:
> >> Michal Hocko <mhocko@suse.com> writes:
> >>> zone->lock should be held for a very limited amount of time.
> >>
> >> Emphasis on should. free_pcppages_bulk() can hold it for quite some time
> >> when a large amount of pages are purged. We surely would have converted
> >> it to a raw lock long time ago otherwise.
> >>
> >> For regular enterprise stuff a few hundred microseconds might qualify as
> >> a limited amount of time. For advanced RT applications that's way beyond
> >> tolerable..
> >
> > Sebastian just tried with zone lock converted to a raw lock and maximum
> > latencies go up by a factor of 7 when putting a bit of stress on the
> > memory subsytem. Just a regular kernel compile kicks them up by a factor
> > of 5. Way out of tolerance.
> >
> > We'll have a look whether it's solely free_pcppages_bulk() and if so we
> > could get away with dropping the lock in the loop.
> 
> So even on !RT and just doing a kernel compile the time spent in
> free_pcppages_bulk() is up to 270 usec.
> 
> It's not only the loop which processes a large pile of pages, part of it
> is caused by lock contention on zone->lock. Dropping the lock after a
> processing a couple of pages does not make it much better if enough CPUs
> are contending on the lock.

OK, this is a bit surprising to me but well, reality sucks sometimes.

I was really hoping for a solution which would allow the allocator to
really do what gfp flags say and if something is nowait then it
shouldn't really block. I believe we need to document this properly.

I will comment on the dedicated gfp flag in reply to other email.

Thanks for trying that out Sebastian and Thomas!

Michal Hocko Aug. 13, 2020, 7:50 a.m. UTC | #30

On Wed 12-08-20 02:13:25, Thomas Gleixner wrote:
[...]
> I can understand your rationale and what you are trying to solve. So, if
> we can actually have a distinct GFP variant:
> 
>   GFP_I_ABSOLUTELY_HAVE_TO_DO_THAT_AND_I_KNOW_IT_CAN_FAIL_EARLY

Even if we cannot make the zone->lock raw I would prefer to not
introduce a new gfp flag. Well we can do an alias for easier grepping
#define GFP_RT_SAFE	0

that would imply nowait semantic and would exclude waking up kswapd as
well. If we can make wake up safe under RT then the alias would reflect
that without any code changes.

The second, and the more important part, would be to bail out anytime
the page allocator is to take a lock which is not allowed in the current
RT context. Something like 
	
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 67a0774e080b..3ef3ac82d110 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -237,6 +237,9 @@ struct vm_area_struct;
  * that subsystems start with one of these combinations and then set/clear
  * %__GFP_FOO flags as necessary.
  *
+ * %GFP_RT_SAFE users can not sleep and they are running under RT atomic context
+ * e.g. under raw_spin_lock. Failure of an allocation is to be expected.
+ *
  * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
  * watermark is applied to allow access to "atomic reserves"
  *
@@ -293,6 +296,7 @@ struct vm_area_struct;
  * version does not attempt reclaim/compaction at all and is by default used
  * in page fault path, while the non-light is used by khugepaged.
  */
+#define GFP_RT_SAFE	0
 #define GFP_ATOMIC	(__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
 #define GFP_KERNEL	(__GFP_RECLAIM | __GFP_IO | __GFP_FS)
 #define GFP_KERNEL_ACCOUNT (GFP_KERNEL | __GFP_ACCOUNT)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e028b87ce294..268ae872cc2a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2824,6 +2824,13 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 {
 	int i, alloced = 0;
 
+	/*
+	 * Hard atomic contexts are not supported by the allocator for
+	 * anything but pcp requests
+	 */
+	if (!preemtable())
+		return 0;
+
 	spin_lock(&zone->lock);
 	for (i = 0; i < count; ++i) {
 		struct page *page = __rmqueue(zone, order, migratetype,
@@ -3371,6 +3378,13 @@ struct page *rmqueue(struct zone *preferred_zone,
 		goto out;
 	}
 
+	/*
+	 * Hard atomic contexts are not supported by the allocator for high
+	 * order requests
+	 */
+	if (WARN_ON_ONCE(!preemtable()))
+		reurn NULL;
+
 	/*
 	 * We most definitely don't want callers attempting to
 	 * allocate greater than order-1 page units with __GFP_NOFAIL.
@@ -4523,6 +4537,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 				(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
 		gfp_mask &= ~__GFP_ATOMIC;
 
+	/* Hard atomic contexts support is very limited to the fast path */
+	if (!preemtable()) {
+		WARN_ON_ONCE(gfp_mask != GFP_RT_SAFE);
+		return NULL;
+	}
+
 retry_cpuset:
 	compaction_retries = 0;
 	no_progress_loops = 0;

What do you think?
 
> which is easy to grep for then having the page allocator go down to the
> point where zone lock gets involved is not the end of the world for
> RT in theory - unless that damned reality tells otherwise. :)
> 
> The page allocator allocations should also have a limit on the number of
> pages and eventually also page order (need to stare at the code or let
> Michal educate me that the order does not matter).

In practice anything but order 0 is out of question because we need
zone->lock for that currently. Maybe we can introduce pcp lists for
higher orders in the future - I have a vague recollection Mel was
playing with that some time ago.

> To make it consistent the same GFP_ variant should allow the slab
> allocator go to the point where the slab cache is exhausted.
> 
> Having a distinct and clearly defined GFP_ variant is really key to
> chase down offenders and to make reviewers double check upfront why this
> is absolutely required.

Having a high level and recognizable gfp mask is OK but I would really
like not to introduce a dedicated flag. The page allocator should be
able to recognize the context which cannot be handled.

Uladzislau Rezki Aug. 13, 2020, 9:58 a.m. UTC | #31

On Thu, Aug 13, 2020 at 09:50:27AM +0200, Michal Hocko wrote:
> On Wed 12-08-20 02:13:25, Thomas Gleixner wrote:
> [...]
> > I can understand your rationale and what you are trying to solve. So, if
> > we can actually have a distinct GFP variant:
> > 
> >   GFP_I_ABSOLUTELY_HAVE_TO_DO_THAT_AND_I_KNOW_IT_CAN_FAIL_EARLY
> 
> Even if we cannot make the zone->lock raw I would prefer to not
> introduce a new gfp flag. Well we can do an alias for easier grepping
> #define GFP_RT_SAFE	0
> 
> that would imply nowait semantic and would exclude waking up kswapd as
> well. If we can make wake up safe under RT then the alias would reflect
> that without any code changes.
> 
> The second, and the more important part, would be to bail out anytime
> the page allocator is to take a lock which is not allowed in the current
> RT context. Something like 
> 	
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 67a0774e080b..3ef3ac82d110 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -237,6 +237,9 @@ struct vm_area_struct;
>   * that subsystems start with one of these combinations and then set/clear
>   * %__GFP_FOO flags as necessary.
>   *
> + * %GFP_RT_SAFE users can not sleep and they are running under RT atomic context
> + * e.g. under raw_spin_lock. Failure of an allocation is to be expected.
> + *
>   * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
>   * watermark is applied to allow access to "atomic reserves"
>   *
> @@ -293,6 +296,7 @@ struct vm_area_struct;
>   * version does not attempt reclaim/compaction at all and is by default used
>   * in page fault path, while the non-light is used by khugepaged.
>   */
> +#define GFP_RT_SAFE	0
>  #define GFP_ATOMIC	(__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
>  #define GFP_KERNEL	(__GFP_RECLAIM | __GFP_IO | __GFP_FS)
>  #define GFP_KERNEL_ACCOUNT (GFP_KERNEL | __GFP_ACCOUNT)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e028b87ce294..268ae872cc2a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2824,6 +2824,13 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>  {
>  	int i, alloced = 0;
>  
> +	/*
> +	 * Hard atomic contexts are not supported by the allocator for
> +	 * anything but pcp requests
> +	 */
> +	if (!preemtable())
> +		return 0;
> +
>  	spin_lock(&zone->lock);
>  	for (i = 0; i < count; ++i) {
>  		struct page *page = __rmqueue(zone, order, migratetype,
> @@ -3371,6 +3378,13 @@ struct page *rmqueue(struct zone *preferred_zone,
>  		goto out;
>  	}
>  
> +	/*
> +	 * Hard atomic contexts are not supported by the allocator for high
> +	 * order requests
> +	 */
> +	if (WARN_ON_ONCE(!preemtable()))
> +		reurn NULL;
> +
>  	/*
>  	 * We most definitely don't want callers attempting to
>  	 * allocate greater than order-1 page units with __GFP_NOFAIL.
> @@ -4523,6 +4537,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  				(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
>  		gfp_mask &= ~__GFP_ATOMIC;
>  
> +	/* Hard atomic contexts support is very limited to the fast path */
> +	if (!preemtable()) {
> +		WARN_ON_ONCE(gfp_mask != GFP_RT_SAFE);
> +		return NULL;
> +	}
> +
>  retry_cpuset:
>  	compaction_retries = 0;
>  	no_progress_loops = 0;
> 
> What do you think?
>  
> > which is easy to grep for then having the page allocator go down to the
> > point where zone lock gets involved is not the end of the world for
> > RT in theory - unless that damned reality tells otherwise. :)
> > 
> > The page allocator allocations should also have a limit on the number of
> > pages and eventually also page order (need to stare at the code or let
> > Michal educate me that the order does not matter).
> 
> In practice anything but order 0 is out of question because we need
> zone->lock for that currently. Maybe we can introduce pcp lists for
> higher orders in the future - I have a vague recollection Mel was
> playing with that some time ago.
> 
> > To make it consistent the same GFP_ variant should allow the slab
> > allocator go to the point where the slab cache is exhausted.
> > 
> > Having a distinct and clearly defined GFP_ variant is really key to
> > chase down offenders and to make reviewers double check upfront why this
> > is absolutely required.
> 
> Having a high level and recognizable gfp mask is OK but I would really
> like not to introduce a dedicated flag. The page allocator should be
> able to recognize the context which cannot be handled. 
>
Sorry for jumping in. We can rely on preemptable() for sure, if CONFIG_PREEMPT_RT
is enabled, something like below:

if (IS_ENABLED_RT && preemptebale())

Also i have a question about pcp-lists. Can we introduce and use all allowed 
MIGRATE_PCPTYPES? If called with GFP_RT_SAFE? If not please elaborate.
According to my tests it really helps when it comes to: succeed(return the page) or NULL.
Because on of the list of below types:
 MIGRATE_UNMOVABLE,
 MIGRATE_MOVABLE,
 MIGRATE_RECLAIMABLE,

can have a page making allocation succeed.

Thanks!

--
Vlad Rezki

Michal Hocko Aug. 13, 2020, 11:15 a.m. UTC | #32

On Thu 13-08-20 11:58:40, Uladzislau Rezki wrote:
[...]
> Sorry for jumping in. We can rely on preemptable() for sure, if CONFIG_PREEMPT_RT
> is enabled, something like below:
> 
> if (IS_ENABLED_RT && preemptebale())

Sure. I thought this was an RT specific thing that would be noop
otherwise.

> Also i have a question about pcp-lists. Can we introduce and use all allowed 
> MIGRATE_PCPTYPES? If called with GFP_RT_SAFE? If not please elaborate.

Yes we can. This depends on the provided gfp_mask. Many gfp flags will
be meaningless for such a context but essentially all we care about is
that ((gfp_mask & __GFP_RECLAIM) == GFP_RT_SAFE) for the checking
purpose. We can sort out all these details if this is decided to be the
right way to do. My (pseudo) patch was mostly to show the direction I've
had in mind for easier further discussion.

Thomas Gleixner Aug. 13, 2020, 1:22 p.m. UTC | #33

Uladzislau Rezki <urezki@gmail.com> writes:
> On Thu, Aug 13, 2020 at 09:50:27AM +0200, Michal Hocko wrote:
>> On Wed 12-08-20 02:13:25, Thomas Gleixner wrote:
>> [...]
>> > I can understand your rationale and what you are trying to solve. So, if
>> > we can actually have a distinct GFP variant:
>> > 
>> >   GFP_I_ABSOLUTELY_HAVE_TO_DO_THAT_AND_I_KNOW_IT_CAN_FAIL_EARLY
>> 
>> Even if we cannot make the zone->lock raw I would prefer to not
>> introduce a new gfp flag. Well we can do an alias for easier grepping
>> #define GFP_RT_SAFE	0

Just using 0 is sneaky but yes, that's fine :)

Bikeshedding: GFP_RT_NOWAIT or such might be more obvious.

>> that would imply nowait semantic and would exclude waking up kswapd as
>> well. If we can make wake up safe under RT then the alias would reflect
>> that without any code changes.

It basically requires to convert the wait queue to something else. Is
the waitqueue strict single waiter?

>> The second, and the more important part, would be to bail out anytime
>> the page allocator is to take a lock which is not allowed in the current
>> RT context. Something like

>> +	/*
>> +	 * Hard atomic contexts are not supported by the allocator for
>> +	 * anything but pcp requests
>> +	 */
>> +	if (!preemtable())

If you make that preemtible() it might even compile, but that still wont
work because if CONFIG_PREEMPT_COUNT=n then preemptible() is always
false.

So that should be:

	if (!preemptible() && gfp == GFP_RT_NOWAIT)

which is limiting the damage to those callers which hand in
GFP_RT_NOWAIT.

lockdep will yell at invocations with gfp != GFP_RT_NOWAIT when it hits
zone->lock in the wrong context. And we want to know about that so we
can look at the caller and figure out how to solve it.

>> > The page allocator allocations should also have a limit on the number of
>> > pages and eventually also page order (need to stare at the code or let
>> > Michal educate me that the order does not matter).
>> 
>> In practice anything but order 0 is out of question because we need
>> zone->lock for that currently. Maybe we can introduce pcp lists for
>> higher orders in the future - I have a vague recollection Mel was
>> playing with that some time ago.

Ok.
 
>> > To make it consistent the same GFP_ variant should allow the slab
>> > allocator go to the point where the slab cache is exhausted.
>> > 
>> > Having a distinct and clearly defined GFP_ variant is really key to
>> > chase down offenders and to make reviewers double check upfront why this
>> > is absolutely required.
>> 
>> Having a high level and recognizable gfp mask is OK but I would really
>> like not to introduce a dedicated flag. The page allocator should be
>> able to recognize the context which cannot be handled.

The GFP_xxx == 0 is perfectly fine.

> Sorry for jumping in. We can rely on preemptable() for sure, if CONFIG_PREEMPT_RT
> is enabled, something like below:
>
> if (IS_ENABLED_RT && preemptebale())

Ha, you morphed preemtable() into preemptebale() which will not compile
either :)

Thanks,

        tglx

Thomas Gleixner Aug. 13, 2020, 1:27 p.m. UTC | #34

Michal Hocko <mhocko@suse.com> writes:
> On Thu 13-08-20 11:58:40, Uladzislau Rezki wrote:
> [...]
>> Sorry for jumping in. We can rely on preemptable() for sure, if CONFIG_PREEMPT_RT
>> is enabled, something like below:
>> 
>> if (IS_ENABLED_RT && preemptebale())
>
> Sure. I thought this was an RT specific thing that would be noop
> otherwise.

Well, even if RT specific it would be still something returning either
true or false unconditionally.

And guarding it with RT is not working either because then you are back
to square one with the problem which triggered the discussion in the
first place:

raw_spin_lock()
  alloc()
    if (RT && !preemptible())  <- False because RT == false
    	goto bail;

    spin_lock(&zone->lock)  --> LOCKDEP complains

So either you convince Paul not to do that or you need to do something
like I suggested in my other reply.

Thanks,

        tglx

Uladzislau Rezki Aug. 13, 2020, 1:29 p.m. UTC | #35

Hello, Michal.

> > On Wed 12-08-20 02:13:25, Thomas Gleixner wrote:
> > [...]
> > > I can understand your rationale and what you are trying to solve. So, if
> > > we can actually have a distinct GFP variant:
> > > 
> > >   GFP_I_ABSOLUTELY_HAVE_TO_DO_THAT_AND_I_KNOW_IT_CAN_FAIL_EARLY
> > 
> > Even if we cannot make the zone->lock raw I would prefer to not
> > introduce a new gfp flag. Well we can do an alias for easier grepping
> > #define GFP_RT_SAFE	0
> > 
> > that would imply nowait semantic and would exclude waking up kswapd as
> > well. If we can make wake up safe under RT then the alias would reflect
> > that without any code changes.
> > 
> > The second, and the more important part, would be to bail out anytime
> > the page allocator is to take a lock which is not allowed in the current
> > RT context. Something like 
> > 	
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index 67a0774e080b..3ef3ac82d110 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -237,6 +237,9 @@ struct vm_area_struct;
> >   * that subsystems start with one of these combinations and then set/clear
> >   * %__GFP_FOO flags as necessary.
> >   *
> > + * %GFP_RT_SAFE users can not sleep and they are running under RT atomic context
> > + * e.g. under raw_spin_lock. Failure of an allocation is to be expected.
> > + *
> >   * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
> >   * watermark is applied to allow access to "atomic reserves"
> >   *
> > @@ -293,6 +296,7 @@ struct vm_area_struct;
> >   * version does not attempt reclaim/compaction at all and is by default used
> >   * in page fault path, while the non-light is used by khugepaged.
> >   */
> > +#define GFP_RT_SAFE	0
> >  #define GFP_ATOMIC	(__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
> >  #define GFP_KERNEL	(__GFP_RECLAIM | __GFP_IO | __GFP_FS)
> >  #define GFP_KERNEL_ACCOUNT (GFP_KERNEL | __GFP_ACCOUNT)
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index e028b87ce294..268ae872cc2a 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2824,6 +2824,13 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
> >  {
> >  	int i, alloced = 0;
> >  
> > +	/*
> > +	 * Hard atomic contexts are not supported by the allocator for
> > +	 * anything but pcp requests
> > +	 */
> > +	if (!preemtable())
> > +		return 0;
> > +
> >  	spin_lock(&zone->lock);
> >  	for (i = 0; i < count; ++i) {
> >  		struct page *page = __rmqueue(zone, order, migratetype,
> > @@ -3371,6 +3378,13 @@ struct page *rmqueue(struct zone *preferred_zone,
> >  		goto out;
> >  	}
> >  
> > +	/*
> > +	 * Hard atomic contexts are not supported by the allocator for high
> > +	 * order requests
> > +	 */
> > +	if (WARN_ON_ONCE(!preemtable()))
> > +		reurn NULL;
> > +
> >  	/*
> >  	 * We most definitely don't want callers attempting to
> >  	 * allocate greater than order-1 page units with __GFP_NOFAIL.
> > @@ -4523,6 +4537,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  				(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
> >  		gfp_mask &= ~__GFP_ATOMIC;
> >  
> > +	/* Hard atomic contexts support is very limited to the fast path */
> > +	if (!preemtable()) {
> > +		WARN_ON_ONCE(gfp_mask != GFP_RT_SAFE);
> > +		return NULL;
> > +	}
> > +
> >  retry_cpuset:
> >  	compaction_retries = 0;
> >  	no_progress_loops = 0;
> > 
> > What do you think?
> >  
> > > which is easy to grep for then having the page allocator go down to the
> > > point where zone lock gets involved is not the end of the world for
> > > RT in theory - unless that damned reality tells otherwise. :)
> > > 
> > > The page allocator allocations should also have a limit on the number of
> > > pages and eventually also page order (need to stare at the code or let
> > > Michal educate me that the order does not matter).
> > 
> > In practice anything but order 0 is out of question because we need
> > zone->lock for that currently. Maybe we can introduce pcp lists for
> > higher orders in the future - I have a vague recollection Mel was
> > playing with that some time ago.
> > 
> > > To make it consistent the same GFP_ variant should allow the slab
> > > allocator go to the point where the slab cache is exhausted.
> > > 
> > > Having a distinct and clearly defined GFP_ variant is really key to
> > > chase down offenders and to make reviewers double check upfront why this
> > > is absolutely required.
> > 
> > Having a high level and recognizable gfp mask is OK but I would really
> > like not to introduce a dedicated flag. The page allocator should be
> > able to recognize the context which cannot be handled. 
> >
> Sorry for jumping in. We can rely on preemptable() for sure, if CONFIG_PREEMPT_RT
> is enabled, something like below:
> 
> if (IS_ENABLED_RT && preemptebale())
> 
I was a bit out of focus and did not mention about one thing. Identifying the context
type using preemtable() primitives looks a bit not suitable, IMHO. GFP_* flags are
not supposed to identify a context type, because it is not possible for all cases.
But that i

Also, to bail out based on a context's type will not allow to get a page from atomic
contexts, what we try to achieve :)

Whereas, i mean, we do have possibility to do lock-less per-cpu-list allocation without
touching any zone lock.

if (gfp_mask == 0) {
   check_pcp_lists();
      if (page)
          return page;

    bail out here without doing farther logic, like pre-fetching.
    return NULL;
}

For example consider our case:
kfree_rcu()-> raw_spin_lock() -> page_alloc-> preemtable() -> false

--
Vlad Rezki

Michal Hocko Aug. 13, 2020, 1:33 p.m. UTC | #36

On Thu 13-08-20 15:22:00, Thomas Gleixner wrote:
> Uladzislau Rezki <urezki@gmail.com> writes:
> > On Thu, Aug 13, 2020 at 09:50:27AM +0200, Michal Hocko wrote:
> >> On Wed 12-08-20 02:13:25, Thomas Gleixner wrote:
> >> [...]
> >> > I can understand your rationale and what you are trying to solve. So, if
> >> > we can actually have a distinct GFP variant:
> >> > 
> >> >   GFP_I_ABSOLUTELY_HAVE_TO_DO_THAT_AND_I_KNOW_IT_CAN_FAIL_EARLY
> >> 
> >> Even if we cannot make the zone->lock raw I would prefer to not
> >> introduce a new gfp flag. Well we can do an alias for easier grepping
> >> #define GFP_RT_SAFE	0
> 
> Just using 0 is sneaky but yes, that's fine :)
> 
> Bikeshedding: GFP_RT_NOWAIT or such might be more obvious.

Sounds goood.

> >> that would imply nowait semantic and would exclude waking up kswapd as
> >> well. If we can make wake up safe under RT then the alias would reflect
> >> that without any code changes.
> 
> It basically requires to convert the wait queue to something else. Is
> the waitqueue strict single waiter?

I would have to double check. From what I remember only kswapd should
ever sleep on it.

> >> The second, and the more important part, would be to bail out anytime
> >> the page allocator is to take a lock which is not allowed in the current
> >> RT context. Something like
> 
> >> +	/*
> >> +	 * Hard atomic contexts are not supported by the allocator for
> >> +	 * anything but pcp requests
> >> +	 */
> >> +	if (!preemtable())
> 
> If you make that preemtible() it might even compile, but that still wont
> work because if CONFIG_PREEMPT_COUNT=n then preemptible() is always
> false.

It would be nice to hide all that behind a helper and guarded by
PREEMPT_RT. That would imply PREEMPT_COUNT automatically, right?

> 
> So that should be:
> 
> 	if (!preemptible() && gfp == GFP_RT_NOWAIT)
> 
> which is limiting the damage to those callers which hand in
> GFP_RT_NOWAIT.
> 
> lockdep will yell at invocations with gfp != GFP_RT_NOWAIT when it hits
> zone->lock in the wrong context. And we want to know about that so we
> can look at the caller and figure out how to solve it.

Yes, that would have to somehow need to annotate the zone_lock to be ok
in those paths so that lockdep doesn't complain.

Michal Hocko Aug. 13, 2020, 1:41 p.m. UTC | #37

On Thu 13-08-20 15:29:31, Uladzislau Rezki wrote:
[...]
> I was a bit out of focus and did not mention about one thing. Identifying the context
> type using preemtable() primitives looks a bit not suitable, IMHO. GFP_* flags are
> not supposed to identify a context type, because it is not possible for all cases.

Yes, GFP flags do not identify context and that is why my draft didn't
really consider gfp flags for anything but the retry logic which is
already gfp based already. The buddy allocator path simply always bails
out for those rt atomic paths whenever it gets close to zone->lock

> But that i

You meant to say more I guess

> Also, to bail out based on a context's type will not allow to get a page from atomic
> contexts, what we try to achieve :)

Yes lockdep will need some special treatment but I suspect we want to
address the underlying problem first and only then take care of the
lockdep.

> Whereas, i mean, we do have possibility to do lock-less per-cpu-list allocation without
> touching any zone lock.
> 
> if (gfp_mask == 0) {
>    check_pcp_lists();
>       if (page)
>           return page;
> 
>     bail out here without doing farther logic, like pre-fetching.
>     return NULL;
> }

The existing code does that already. __rmqueue_pcplist will go
rmqueue_bulk only when pcp lists are empty. Or did I miss your point?

Michal Hocko Aug. 13, 2020, 1:45 p.m. UTC | #38

On Thu 13-08-20 15:27:15, Thomas Gleixner wrote:
> Michal Hocko <mhocko@suse.com> writes:
> > On Thu 13-08-20 11:58:40, Uladzislau Rezki wrote:
> > [...]
> >> Sorry for jumping in. We can rely on preemptable() for sure, if CONFIG_PREEMPT_RT
> >> is enabled, something like below:
> >> 
> >> if (IS_ENABLED_RT && preemptebale())
> >
> > Sure. I thought this was an RT specific thing that would be noop
> > otherwise.
> 
> Well, even if RT specific it would be still something returning either
> true or false unconditionally.
> 
> And guarding it with RT is not working either because then you are back
> to square one with the problem which triggered the discussion in the
> first place:
> 
> raw_spin_lock()
>   alloc()
>     if (RT && !preemptible())  <- False because RT == false
>     	goto bail;
> 
>     spin_lock(&zone->lock)  --> LOCKDEP complains
> 
> So either you convince Paul not to do that or you need to do something
> like I suggested in my other reply.

Can we somehow annotate the lock to be safe for nesting for lockdep?

Uladzislau Rezki Aug. 13, 2020, 2:22 p.m. UTC | #39

On Thu, Aug 13, 2020 at 03:41:39PM +0200, Michal Hocko wrote:
> On Thu 13-08-20 15:29:31, Uladzislau Rezki wrote:
> [...]
> > I was a bit out of focus and did not mention about one thing. Identifying the context
> > type using preemtable() primitives looks a bit not suitable, IMHO. GFP_* flags are
> > not supposed to identify a context type, because it is not possible for all cases.
> 
> Yes, GFP flags do not identify context and that is why my draft didn't
> really consider gfp flags for anything but the retry logic which is
> already gfp based already. The buddy allocator path simply always bails
> out for those rt atomic paths whenever it gets close to zone->lock
> 
> > But that i
> 
> You meant to say more I guess
> 
Ahh. Right. It was not completed. The idea was that we do not really need 
to identify preemptible we are or not. Unless we want to help RT to proceed 
further, based on if "RT && preemtable()", allowing to take zone->lock and
improve a chance of to be succeed with allocation. Basically not bail out.

> > Also, to bail out based on a context's type will not allow to get a page from atomic
> > contexts, what we try to achieve :)
> 
> Yes lockdep will need some special treatment but I suspect we want to
> address the underlying problem first and only then take care of the
> lockdep.
> 
> > Whereas, i mean, we do have possibility to do lock-less per-cpu-list allocation without
> > touching any zone lock.
> > 
> > if (gfp_mask == 0) {
> >    check_pcp_lists();
> >       if (page)
> >           return page;
> > 
> >     bail out here without doing farther logic, like pre-fetching.
> >     return NULL;
> > }
> 
> The existing code does that already. __rmqueue_pcplist will go
> rmqueue_bulk only when pcp lists are empty. Or did I miss your point?
> 
Right. Probably we look at it from different angles :) Right, when
pcp is empty the zone->lock is accessed. 

I got the feeling that you want to bail out based on the if (RT && !preemptible()) -> bail out
i.e. On RT below code will always return NULL:

<snip>
    raw_cpin_lock();
    page_alloc();
<snip>

Thanks!

--
Vlad Rezki

Matthew Wilcox Aug. 13, 2020, 2:32 p.m. UTC | #40

On Thu, Aug 13, 2020 at 03:27:15PM +0200, Thomas Gleixner wrote:
> And guarding it with RT is not working either because then you are back
> to square one with the problem which triggered the discussion in the
> first place:
> 
> raw_spin_lock()
>   alloc()
>     if (RT && !preemptible())  <- False because RT == false
>     	goto bail;
> 
>     spin_lock(&zone->lock)  --> LOCKDEP complains
> 
> So either you convince Paul not to do that or you need to do something
> like I suggested in my other reply.

I'd like to throw in the possibility that we do something like:

  raw_spin_lock()
    alloc()
      if (!spin_trylock(&zone->lock))
        if (RT && !preemptible())
          goto bail;
        spin_lock(&zone->lock);

would that make us feel more comfortable about converting zone->lock to
a raw spinlock?

Thomas Gleixner Aug. 13, 2020, 2:34 p.m. UTC | #41

Michal Hocko <mhocko@suse.com> writes:
> On Thu 13-08-20 15:22:00, Thomas Gleixner wrote:
>> It basically requires to convert the wait queue to something else. Is
>> the waitqueue strict single waiter?
>
> I would have to double check. From what I remember only kswapd should
> ever sleep on it.

That would make it trivial as we could simply switch it over to rcu_wait.

>> So that should be:
>> 
>> 	if (!preemptible() && gfp == GFP_RT_NOWAIT)
>> 
>> which is limiting the damage to those callers which hand in
>> GFP_RT_NOWAIT.
>> 
>> lockdep will yell at invocations with gfp != GFP_RT_NOWAIT when it hits
>> zone->lock in the wrong context. And we want to know about that so we
>> can look at the caller and figure out how to solve it.
>
> Yes, that would have to somehow need to annotate the zone_lock to be ok
> in those paths so that lockdep doesn't complain.

That opens the worst of all cans of worms. If we start this here then
Joe programmer and his dog will use these lockdep annotation to evade
warnings and when exposed to RT it will fall apart in pieces. Just that
at that point Joe programmer moved on to something else and the usual
suspects can mop up the pieces. We've seen that all over the place and
some people even disable lockdep temporarily because annotations don't
help.

PeterZ might have opinions about that too I suspect.

Really, if your primary lockless caches are empty then any allocation
which comes from deep atomic context should simply always fail. Being
stuck in an interrupt handler or even deeper for 200+ microseconds
waiting for zone lock is just bonkers IMO.

Thanks,

        tglx

Michal Hocko Aug. 13, 2020, 2:53 p.m. UTC | #42

On Thu 13-08-20 16:34:57, Thomas Gleixner wrote:
> Michal Hocko <mhocko@suse.com> writes:
> > On Thu 13-08-20 15:22:00, Thomas Gleixner wrote:
> >> It basically requires to convert the wait queue to something else. Is
> >> the waitqueue strict single waiter?
> >
> > I would have to double check. From what I remember only kswapd should
> > ever sleep on it.
> 
> That would make it trivial as we could simply switch it over to rcu_wait.
> 
> >> So that should be:
> >> 
> >> 	if (!preemptible() && gfp == GFP_RT_NOWAIT)
> >> 
> >> which is limiting the damage to those callers which hand in
> >> GFP_RT_NOWAIT.
> >> 
> >> lockdep will yell at invocations with gfp != GFP_RT_NOWAIT when it hits
> >> zone->lock in the wrong context. And we want to know about that so we
> >> can look at the caller and figure out how to solve it.
> >
> > Yes, that would have to somehow need to annotate the zone_lock to be ok
> > in those paths so that lockdep doesn't complain.
> 
> That opens the worst of all cans of worms. If we start this here then
> Joe programmer and his dog will use these lockdep annotation to evade
> warnings and when exposed to RT it will fall apart in pieces. Just that
> at that point Joe programmer moved on to something else and the usual
> suspects can mop up the pieces. We've seen that all over the place and
> some people even disable lockdep temporarily because annotations don't
> help.

Hmm. I am likely missing something really important here. We have two
problems at hand:
1) RT will become broken as soon as this new RCU functionality which
requires an allocation from inside of raw_spinlock hits the RT tree
2) lockdep splats which are telling us that early because of the
raw_spinlock-> spin_lock dependency.

1) can be handled by handled by the bailing out whenever we have to use
zone->lock inside the buddy allocator - essentially even more strict
NOWAIT semantic than we have for RT tree - proposed (pseudo) patch is
trying to describe that.

2) would become a false positive if 1) is in place, right? RT wouldn't
do the illegal nesting and !RT would just work fine because
GFP_RT_NOWAIT would be simply GFP_NOWAIT & ~__GFP_KSWAPD_RECLAIM.
Why should we limit the functionality of the allocator for something
that is not a real problem?

> PeterZ might have opinions about that too I suspect.
> 
> Really, if your primary lockless caches are empty then any allocation
> which comes from deep atomic context should simply always fail. Being
> stuck in an interrupt handler or even deeper for 200+ microseconds
> waiting for zone lock is just bonkers IMO.

That would require changing NOWAIT/ATOMIC allocations semantic quite
drastically for !RT kernels as well. I am not sure this is something we
can do. Or maybe I am just missing your point.

Paul E. McKenney Aug. 13, 2020, 3:41 p.m. UTC | #43

On Thu, Aug 13, 2020 at 04:53:35PM +0200, Michal Hocko wrote:
> On Thu 13-08-20 16:34:57, Thomas Gleixner wrote:
> > Michal Hocko <mhocko@suse.com> writes:
> > > On Thu 13-08-20 15:22:00, Thomas Gleixner wrote:
> > >> It basically requires to convert the wait queue to something else. Is
> > >> the waitqueue strict single waiter?
> > >
> > > I would have to double check. From what I remember only kswapd should
> > > ever sleep on it.
> > 
> > That would make it trivial as we could simply switch it over to rcu_wait.
> > 
> > >> So that should be:
> > >> 
> > >> 	if (!preemptible() && gfp == GFP_RT_NOWAIT)
> > >> 
> > >> which is limiting the damage to those callers which hand in
> > >> GFP_RT_NOWAIT.
> > >> 
> > >> lockdep will yell at invocations with gfp != GFP_RT_NOWAIT when it hits
> > >> zone->lock in the wrong context. And we want to know about that so we
> > >> can look at the caller and figure out how to solve it.
> > >
> > > Yes, that would have to somehow need to annotate the zone_lock to be ok
> > > in those paths so that lockdep doesn't complain.
> > 
> > That opens the worst of all cans of worms. If we start this here then
> > Joe programmer and his dog will use these lockdep annotation to evade
> > warnings and when exposed to RT it will fall apart in pieces. Just that
> > at that point Joe programmer moved on to something else and the usual
> > suspects can mop up the pieces. We've seen that all over the place and
> > some people even disable lockdep temporarily because annotations don't
> > help.
> 
> Hmm. I am likely missing something really important here. We have two
> problems at hand:
> 1) RT will become broken as soon as this new RCU functionality which
> requires an allocation from inside of raw_spinlock hits the RT tree
> 2) lockdep splats which are telling us that early because of the
> raw_spinlock-> spin_lock dependency.

That is a reasonable high-level summary.

> 1) can be handled by handled by the bailing out whenever we have to use
> zone->lock inside the buddy allocator - essentially even more strict
> NOWAIT semantic than we have for RT tree - proposed (pseudo) patch is
> trying to describe that.

Unless I am missing something subtle, the problem with this approach
is that in production-environment CONFIG_PREEMPT_NONE=y kernels, there
is no way at runtime to distinguish between holding a spinlock on the
one hand and holding a raw spinlock on the other.  Therefore, without
some sort of indication from the caller, this approach will not make
CONFIG_PREEMPT_NONE=y users happy.

> 2) would become a false positive if 1) is in place, right? RT wouldn't
> do the illegal nesting and !RT would just work fine because
> GFP_RT_NOWAIT would be simply GFP_NOWAIT & ~__GFP_KSWAPD_RECLAIM.
> Why should we limit the functionality of the allocator for something
> that is not a real problem?

I will confess that I found this approach quite attractive, at least
until I dug into it.  But it runs up against the inability to distinguish
between holding a spinlock and holding a raw spinlock.  This inability
shows up in non-debugging CONFIG_PREEMPT_NONE=y kernels.  (Specifically,
production kernels of this type will have CONFIG_PREEMPT_COUNT=n.)

I will always have a warm spot in my heart for RT, but this
CONFIG_PREEMPT_NONE=y&&CONFIG_PREEMPT_COUNT=n configuration is still
very important.

> > PeterZ might have opinions about that too I suspect.
> > 
> > Really, if your primary lockless caches are empty then any allocation
> > which comes from deep atomic context should simply always fail. Being
> > stuck in an interrupt handler or even deeper for 200+ microseconds
> > waiting for zone lock is just bonkers IMO.
> 
> That would require changing NOWAIT/ATOMIC allocations semantic quite
> drastically for !RT kernels as well. I am not sure this is something we
> can do. Or maybe I am just missing your point.

Exactly, and avoiding changing this semantic for current users is
precisely why we are proposing some sort of indication to be passed
into the allocation request.  In Uladzislau's patch, this was the
__GFP_NO_LOCKS flag, but whatever works.

							Thanx, Paul

Michal Hocko Aug. 13, 2020, 3:54 p.m. UTC | #44

On Thu 13-08-20 08:41:59, Paul E. McKenney wrote:
> On Thu, Aug 13, 2020 at 04:53:35PM +0200, Michal Hocko wrote:
> > On Thu 13-08-20 16:34:57, Thomas Gleixner wrote:
> > > Michal Hocko <mhocko@suse.com> writes:
> > > > On Thu 13-08-20 15:22:00, Thomas Gleixner wrote:
> > > >> It basically requires to convert the wait queue to something else. Is
> > > >> the waitqueue strict single waiter?
> > > >
> > > > I would have to double check. From what I remember only kswapd should
> > > > ever sleep on it.
> > > 
> > > That would make it trivial as we could simply switch it over to rcu_wait.
> > > 
> > > >> So that should be:
> > > >> 
> > > >> 	if (!preemptible() && gfp == GFP_RT_NOWAIT)
> > > >> 
> > > >> which is limiting the damage to those callers which hand in
> > > >> GFP_RT_NOWAIT.
> > > >> 
> > > >> lockdep will yell at invocations with gfp != GFP_RT_NOWAIT when it hits
> > > >> zone->lock in the wrong context. And we want to know about that so we
> > > >> can look at the caller and figure out how to solve it.
> > > >
> > > > Yes, that would have to somehow need to annotate the zone_lock to be ok
> > > > in those paths so that lockdep doesn't complain.
> > > 
> > > That opens the worst of all cans of worms. If we start this here then
> > > Joe programmer and his dog will use these lockdep annotation to evade
> > > warnings and when exposed to RT it will fall apart in pieces. Just that
> > > at that point Joe programmer moved on to something else and the usual
> > > suspects can mop up the pieces. We've seen that all over the place and
> > > some people even disable lockdep temporarily because annotations don't
> > > help.
> > 
> > Hmm. I am likely missing something really important here. We have two
> > problems at hand:
> > 1) RT will become broken as soon as this new RCU functionality which
> > requires an allocation from inside of raw_spinlock hits the RT tree
> > 2) lockdep splats which are telling us that early because of the
> > raw_spinlock-> spin_lock dependency.
> 
> That is a reasonable high-level summary.
> 
> > 1) can be handled by handled by the bailing out whenever we have to use
> > zone->lock inside the buddy allocator - essentially even more strict
> > NOWAIT semantic than we have for RT tree - proposed (pseudo) patch is
> > trying to describe that.
> 
> Unless I am missing something subtle, the problem with this approach
> is that in production-environment CONFIG_PREEMPT_NONE=y kernels, there
> is no way at runtime to distinguish between holding a spinlock on the
> one hand and holding a raw spinlock on the other.  Therefore, without
> some sort of indication from the caller, this approach will not make
> CONFIG_PREEMPT_NONE=y users happy.

If the whole bailout is guarded by CONFIG_PREEMPT_RT specific atomicity
check then there is no functional problem - GFP_RT_SAFE would still be
GFP_NOWAIT so functional wise the allocator will still do the right
thing.

[...]

> > That would require changing NOWAIT/ATOMIC allocations semantic quite
> > drastically for !RT kernels as well. I am not sure this is something we
> > can do. Or maybe I am just missing your point.
> 
> Exactly, and avoiding changing this semantic for current users is
> precisely why we are proposing some sort of indication to be passed
> into the allocation request.  In Uladzislau's patch, this was the
> __GFP_NO_LOCKS flag, but whatever works.

As I've tried to explain already, I would really hope we can do without
any new gfp flags. We are running out of them and they tend to generate
a lot of maintenance burden. There is a lot of abuse etc. We should also
not expose such an implementation detail of the allocator to callers
because that would make future changes even harder. The alias, on the
othere hand already builds on top of existing NOWAIT semantic and it
just helps the allocator to complain about a wrong usage while it
doesn't expose any internals.

Paul E. McKenney Aug. 13, 2020, 4:04 p.m. UTC | #45

On Thu, Aug 13, 2020 at 05:54:12PM +0200, Michal Hocko wrote:
> On Thu 13-08-20 08:41:59, Paul E. McKenney wrote:
> > On Thu, Aug 13, 2020 at 04:53:35PM +0200, Michal Hocko wrote:
> > > On Thu 13-08-20 16:34:57, Thomas Gleixner wrote:
> > > > Michal Hocko <mhocko@suse.com> writes:
> > > > > On Thu 13-08-20 15:22:00, Thomas Gleixner wrote:
> > > > >> It basically requires to convert the wait queue to something else. Is
> > > > >> the waitqueue strict single waiter?
> > > > >
> > > > > I would have to double check. From what I remember only kswapd should
> > > > > ever sleep on it.
> > > > 
> > > > That would make it trivial as we could simply switch it over to rcu_wait.
> > > > 
> > > > >> So that should be:
> > > > >> 
> > > > >> 	if (!preemptible() && gfp == GFP_RT_NOWAIT)
> > > > >> 
> > > > >> which is limiting the damage to those callers which hand in
> > > > >> GFP_RT_NOWAIT.
> > > > >> 
> > > > >> lockdep will yell at invocations with gfp != GFP_RT_NOWAIT when it hits
> > > > >> zone->lock in the wrong context. And we want to know about that so we
> > > > >> can look at the caller and figure out how to solve it.
> > > > >
> > > > > Yes, that would have to somehow need to annotate the zone_lock to be ok
> > > > > in those paths so that lockdep doesn't complain.
> > > > 
> > > > That opens the worst of all cans of worms. If we start this here then
> > > > Joe programmer and his dog will use these lockdep annotation to evade
> > > > warnings and when exposed to RT it will fall apart in pieces. Just that
> > > > at that point Joe programmer moved on to something else and the usual
> > > > suspects can mop up the pieces. We've seen that all over the place and
> > > > some people even disable lockdep temporarily because annotations don't
> > > > help.
> > > 
> > > Hmm. I am likely missing something really important here. We have two
> > > problems at hand:
> > > 1) RT will become broken as soon as this new RCU functionality which
> > > requires an allocation from inside of raw_spinlock hits the RT tree
> > > 2) lockdep splats which are telling us that early because of the
> > > raw_spinlock-> spin_lock dependency.
> > 
> > That is a reasonable high-level summary.
> > 
> > > 1) can be handled by handled by the bailing out whenever we have to use
> > > zone->lock inside the buddy allocator - essentially even more strict
> > > NOWAIT semantic than we have for RT tree - proposed (pseudo) patch is
> > > trying to describe that.
> > 
> > Unless I am missing something subtle, the problem with this approach
> > is that in production-environment CONFIG_PREEMPT_NONE=y kernels, there
> > is no way at runtime to distinguish between holding a spinlock on the
> > one hand and holding a raw spinlock on the other.  Therefore, without
> > some sort of indication from the caller, this approach will not make
> > CONFIG_PREEMPT_NONE=y users happy.
> 
> If the whole bailout is guarded by CONFIG_PREEMPT_RT specific atomicity
> check then there is no functional problem - GFP_RT_SAFE would still be
> GFP_NOWAIT so functional wise the allocator will still do the right
> thing.

Perhaps it was just me getting confused, early hour Pacific Time and
whatever other excuses might apply.  But I thought that you still had
an objection to GFP_RT_SAFE based on changes in allocator semantics for
other users.

Of course, if it is just me being confused, by all means let's give this
a shot!!!

> [...]
> 
> > > That would require changing NOWAIT/ATOMIC allocations semantic quite
> > > drastically for !RT kernels as well. I am not sure this is something we
> > > can do. Or maybe I am just missing your point.
> > 
> > Exactly, and avoiding changing this semantic for current users is
> > precisely why we are proposing some sort of indication to be passed
> > into the allocation request.  In Uladzislau's patch, this was the
> > __GFP_NO_LOCKS flag, but whatever works.
> 
> As I've tried to explain already, I would really hope we can do without
> any new gfp flags. We are running out of them and they tend to generate
> a lot of maintenance burden. There is a lot of abuse etc. We should also
> not expose such an implementation detail of the allocator to callers
> because that would make future changes even harder. The alias, on the
> othere hand already builds on top of existing NOWAIT semantic and it
> just helps the allocator to complain about a wrong usage while it
> doesn't expose any internals.

And I am plenty happy if an existing flag or set of flags (up to and
including the all-zeroes empty set) can be used in this case.

							Thanx, Paul

Michal Hocko Aug. 13, 2020, 4:13 p.m. UTC | #46

On Thu 13-08-20 09:04:42, Paul E. McKenney wrote:
> On Thu, Aug 13, 2020 at 05:54:12PM +0200, Michal Hocko wrote:
[...]
> > If the whole bailout is guarded by CONFIG_PREEMPT_RT specific atomicity
> > check then there is no functional problem - GFP_RT_SAFE would still be
> > GFP_NOWAIT so functional wise the allocator will still do the right
> > thing.
> 
> Perhaps it was just me getting confused, early hour Pacific Time and
> whatever other excuses might apply.  But I thought that you still had
> an objection to GFP_RT_SAFE based on changes in allocator semantics for
> other users.

There is still that problem with lockdep complaining about raw->regular
spinlock on !PREEMPT_RT that would need to get resolved somehow. Thomas
is not really keen on adding some lockdep annotation mechanism and
unfortunatelly I do not have a different idea how to get rid of those.

Thomas Gleixner Aug. 13, 2020, 4:14 p.m. UTC | #47

Matthew Wilcox <willy@infradead.org> writes:
> On Thu, Aug 13, 2020 at 03:27:15PM +0200, Thomas Gleixner wrote:
>> And guarding it with RT is not working either because then you are back
>> to square one with the problem which triggered the discussion in the
>> first place:
>> 
>> raw_spin_lock()
>>   alloc()
>>     if (RT && !preemptible())  <- False because RT == false
>>     	goto bail;
>> 
>>     spin_lock(&zone->lock)  --> LOCKDEP complains
>> 
>> So either you convince Paul not to do that or you need to do something
>> like I suggested in my other reply.
>
> I'd like to throw in the possibility that we do something like:
>
>   raw_spin_lock()
>     alloc()
>       if (!spin_trylock(&zone->lock))
>         if (RT && !preemptible())
>           goto bail;
>         spin_lock(&zone->lock);
>
> would that make us feel more comfortable about converting zone->lock to
> a raw spinlock?

Even if that could cure that particular problem of allocations in deep
atomic context, making zone->lock raw brings back the problem of
zone->lock being held/contended for hundreds of microseconds with
interrupts disabled which is causing RT tasks to miss their deadlines by
big margins.

Thanks,

        tglx

Uladzislau Rezki Aug. 13, 2020, 4:20 p.m. UTC | #48

> On Thu 13-08-20 08:41:59, Paul E. McKenney wrote:
> > On Thu, Aug 13, 2020 at 04:53:35PM +0200, Michal Hocko wrote:
> > > On Thu 13-08-20 16:34:57, Thomas Gleixner wrote:
> > > > Michal Hocko <mhocko@suse.com> writes:
> > > > > On Thu 13-08-20 15:22:00, Thomas Gleixner wrote:
> > > > >> It basically requires to convert the wait queue to something else. Is
> > > > >> the waitqueue strict single waiter?
> > > > >
> > > > > I would have to double check. From what I remember only kswapd should
> > > > > ever sleep on it.
> > > > 
> > > > That would make it trivial as we could simply switch it over to rcu_wait.
> > > > 
> > > > >> So that should be:
> > > > >> 
> > > > >> 	if (!preemptible() && gfp == GFP_RT_NOWAIT)
> > > > >> 
> > > > >> which is limiting the damage to those callers which hand in
> > > > >> GFP_RT_NOWAIT.
> > > > >> 
> > > > >> lockdep will yell at invocations with gfp != GFP_RT_NOWAIT when it hits
> > > > >> zone->lock in the wrong context. And we want to know about that so we
> > > > >> can look at the caller and figure out how to solve it.
> > > > >
> > > > > Yes, that would have to somehow need to annotate the zone_lock to be ok
> > > > > in those paths so that lockdep doesn't complain.
> > > > 
> > > > That opens the worst of all cans of worms. If we start this here then
> > > > Joe programmer and his dog will use these lockdep annotation to evade
> > > > warnings and when exposed to RT it will fall apart in pieces. Just that
> > > > at that point Joe programmer moved on to something else and the usual
> > > > suspects can mop up the pieces. We've seen that all over the place and
> > > > some people even disable lockdep temporarily because annotations don't
> > > > help.
> > > 
> > > Hmm. I am likely missing something really important here. We have two
> > > problems at hand:
> > > 1) RT will become broken as soon as this new RCU functionality which
> > > requires an allocation from inside of raw_spinlock hits the RT tree
> > > 2) lockdep splats which are telling us that early because of the
> > > raw_spinlock-> spin_lock dependency.
> > 
> > That is a reasonable high-level summary.
> > 
> > > 1) can be handled by handled by the bailing out whenever we have to use
> > > zone->lock inside the buddy allocator - essentially even more strict
> > > NOWAIT semantic than we have for RT tree - proposed (pseudo) patch is
> > > trying to describe that.
> > 
> > Unless I am missing something subtle, the problem with this approach
> > is that in production-environment CONFIG_PREEMPT_NONE=y kernels, there
> > is no way at runtime to distinguish between holding a spinlock on the
> > one hand and holding a raw spinlock on the other.  Therefore, without
> > some sort of indication from the caller, this approach will not make
> > CONFIG_PREEMPT_NONE=y users happy.
> 
> If the whole bailout is guarded by CONFIG_PREEMPT_RT specific atomicity
> check then there is no functional problem - GFP_RT_SAFE would still be
> GFP_NOWAIT so functional wise the allocator will still do the right
> thing.
> 
> [...]
> 
> > > That would require changing NOWAIT/ATOMIC allocations semantic quite
> > > drastically for !RT kernels as well. I am not sure this is something we
> > > can do. Or maybe I am just missing your point.
> > 
> > Exactly, and avoiding changing this semantic for current users is
> > precisely why we are proposing some sort of indication to be passed
> > into the allocation request.  In Uladzislau's patch, this was the
> > __GFP_NO_LOCKS flag, but whatever works.
> 
> As I've tried to explain already, I would really hope we can do without
> any new gfp flags. We are running out of them and they tend to generate
> a lot of maintenance burden. There is a lot of abuse etc. We should also
> not expose such an implementation detail of the allocator to callers
> because that would make future changes even harder. The alias, on the
> othere hand already builds on top of existing NOWAIT semantic and it
> just helps the allocator to complain about a wrong usage while it
> doesn't expose any internals.
> 
I know that Matthew and me raised it. We do can handle it without
introducing any flag. I mean just use 0 as argument to the page_alloc(gfp_flags = 0) 

i.e. #define __GFP_NO_LOCKS 0

so it will be handled same way how it is done in the "mm: Add __GFP_NO_LOCKS flag"
I can re-spin the RFC patch and send it out for better understanding.

Does it work for you, Michal? Or it is better just to drop the patch here?

--
Vlad Rezki

Matthew Wilcox Aug. 13, 2020, 4:22 p.m. UTC | #49

On Thu, Aug 13, 2020 at 06:14:32PM +0200, Thomas Gleixner wrote:
> Matthew Wilcox <willy@infradead.org> writes:
> > On Thu, Aug 13, 2020 at 03:27:15PM +0200, Thomas Gleixner wrote:
> >> And guarding it with RT is not working either because then you are back
> >> to square one with the problem which triggered the discussion in the
> >> first place:
> >> 
> >> raw_spin_lock()
> >>   alloc()
> >>     if (RT && !preemptible())  <- False because RT == false
> >>     	goto bail;
> >> 
> >>     spin_lock(&zone->lock)  --> LOCKDEP complains
> >> 
> >> So either you convince Paul not to do that or you need to do something
> >> like I suggested in my other reply.
> >
> > I'd like to throw in the possibility that we do something like:
> >
> >   raw_spin_lock()
> >     alloc()
> >       if (!spin_trylock(&zone->lock))
> >         if (RT && !preemptible())
> >           goto bail;
> >         spin_lock(&zone->lock);
> >
> > would that make us feel more comfortable about converting zone->lock to
> > a raw spinlock?
> 
> Even if that could cure that particular problem of allocations in deep
> atomic context, making zone->lock raw brings back the problem of
> zone->lock being held/contended for hundreds of microseconds with
> interrupts disabled which is causing RT tasks to miss their deadlines by
> big margins.

Ah, I see.  Yeah, that doesn't work.  Never mind.

Paul E. McKenney Aug. 13, 2020, 4:29 p.m. UTC | #50

On Thu, Aug 13, 2020 at 06:13:57PM +0200, Michal Hocko wrote:
> On Thu 13-08-20 09:04:42, Paul E. McKenney wrote:
> > On Thu, Aug 13, 2020 at 05:54:12PM +0200, Michal Hocko wrote:
> [...]
> > > If the whole bailout is guarded by CONFIG_PREEMPT_RT specific atomicity
> > > check then there is no functional problem - GFP_RT_SAFE would still be
> > > GFP_NOWAIT so functional wise the allocator will still do the right
> > > thing.
> > 
> > Perhaps it was just me getting confused, early hour Pacific Time and
> > whatever other excuses might apply.  But I thought that you still had
> > an objection to GFP_RT_SAFE based on changes in allocator semantics for
> > other users.
> 
> There is still that problem with lockdep complaining about raw->regular
> spinlock on !PREEMPT_RT that would need to get resolved somehow. Thomas
> is not really keen on adding some lockdep annotation mechanism and
> unfortunatelly I do not have a different idea how to get rid of those.

OK.  So the current situation requires a choice between these these
alternatives, each of which has shortcomings that have been mentioned
earlier in this thread:

1.	Prohibit invoking allocators from raw atomic context, such
	as when holding a raw spinlock.

2.	Adding a GFP_ flag.

3.	Reusing existing GFP_ flags/values/whatever to communicate
	the raw-context information that was to be communicated by
	the new GFP_ flag.

4.	Making lockdep forgive acquiring spinlocks while holding
	raw spinlocks, but only in CONFIG_PREEMPT_NONE=y kernels.

Am I missing anything?

						Thanx, Paul

Michal Hocko Aug. 13, 2020, 4:36 p.m. UTC | #51

On Thu 13-08-20 18:20:47, Uladzislau Rezki wrote:
> > On Thu 13-08-20 08:41:59, Paul E. McKenney wrote:
> > > On Thu, Aug 13, 2020 at 04:53:35PM +0200, Michal Hocko wrote:
> > > > On Thu 13-08-20 16:34:57, Thomas Gleixner wrote:
> > > > > Michal Hocko <mhocko@suse.com> writes:
> > > > > > On Thu 13-08-20 15:22:00, Thomas Gleixner wrote:
> > > > > >> It basically requires to convert the wait queue to something else. Is
> > > > > >> the waitqueue strict single waiter?
> > > > > >
> > > > > > I would have to double check. From what I remember only kswapd should
> > > > > > ever sleep on it.
> > > > > 
> > > > > That would make it trivial as we could simply switch it over to rcu_wait.
> > > > > 
> > > > > >> So that should be:
> > > > > >> 
> > > > > >> 	if (!preemptible() && gfp == GFP_RT_NOWAIT)
> > > > > >> 
> > > > > >> which is limiting the damage to those callers which hand in
> > > > > >> GFP_RT_NOWAIT.
> > > > > >> 
> > > > > >> lockdep will yell at invocations with gfp != GFP_RT_NOWAIT when it hits
> > > > > >> zone->lock in the wrong context. And we want to know about that so we
> > > > > >> can look at the caller and figure out how to solve it.
> > > > > >
> > > > > > Yes, that would have to somehow need to annotate the zone_lock to be ok
> > > > > > in those paths so that lockdep doesn't complain.
> > > > > 
> > > > > That opens the worst of all cans of worms. If we start this here then
> > > > > Joe programmer and his dog will use these lockdep annotation to evade
> > > > > warnings and when exposed to RT it will fall apart in pieces. Just that
> > > > > at that point Joe programmer moved on to something else and the usual
> > > > > suspects can mop up the pieces. We've seen that all over the place and
> > > > > some people even disable lockdep temporarily because annotations don't
> > > > > help.
> > > > 
> > > > Hmm. I am likely missing something really important here. We have two
> > > > problems at hand:
> > > > 1) RT will become broken as soon as this new RCU functionality which
> > > > requires an allocation from inside of raw_spinlock hits the RT tree
> > > > 2) lockdep splats which are telling us that early because of the
> > > > raw_spinlock-> spin_lock dependency.
> > > 
> > > That is a reasonable high-level summary.
> > > 
> > > > 1) can be handled by handled by the bailing out whenever we have to use
> > > > zone->lock inside the buddy allocator - essentially even more strict
> > > > NOWAIT semantic than we have for RT tree - proposed (pseudo) patch is
> > > > trying to describe that.
> > > 
> > > Unless I am missing something subtle, the problem with this approach
> > > is that in production-environment CONFIG_PREEMPT_NONE=y kernels, there
> > > is no way at runtime to distinguish between holding a spinlock on the
> > > one hand and holding a raw spinlock on the other.  Therefore, without
> > > some sort of indication from the caller, this approach will not make
> > > CONFIG_PREEMPT_NONE=y users happy.
> > 
> > If the whole bailout is guarded by CONFIG_PREEMPT_RT specific atomicity
> > check then there is no functional problem - GFP_RT_SAFE would still be
> > GFP_NOWAIT so functional wise the allocator will still do the right
> > thing.
> > 
> > [...]
> > 
> > > > That would require changing NOWAIT/ATOMIC allocations semantic quite
> > > > drastically for !RT kernels as well. I am not sure this is something we
> > > > can do. Or maybe I am just missing your point.
> > > 
> > > Exactly, and avoiding changing this semantic for current users is
> > > precisely why we are proposing some sort of indication to be passed
> > > into the allocation request.  In Uladzislau's patch, this was the
> > > __GFP_NO_LOCKS flag, but whatever works.
> > 
> > As I've tried to explain already, I would really hope we can do without
> > any new gfp flags. We are running out of them and they tend to generate
> > a lot of maintenance burden. There is a lot of abuse etc. We should also
> > not expose such an implementation detail of the allocator to callers
> > because that would make future changes even harder. The alias, on the
> > othere hand already builds on top of existing NOWAIT semantic and it
> > just helps the allocator to complain about a wrong usage while it
> > doesn't expose any internals.
> > 
> I know that Matthew and me raised it. We do can handle it without
> introducing any flag. I mean just use 0 as argument to the page_alloc(gfp_flags = 0) 
> 
> i.e. #define __GFP_NO_LOCKS 0
> 
> so it will be handled same way how it is done in the "mm: Add __GFP_NO_LOCKS flag"
> I can re-spin the RFC patch and send it out for better understanding.
> 
> Does it work for you, Michal? Or it is better just to drop the patch here?

That would change the semantic for GFP_NOWAIT users who decided to drop
__GFP_KSWAPD_RECLAIM or even use 0 gfp mask right away, right? The point
I am trying to make is that an alias is good for RT because it doesn't
have any users (because there is no RT atomic user of the allocator)
currently.

Thomas Gleixner Aug. 13, 2020, 5:09 p.m. UTC | #52

Michal Hocko <mhocko@suse.com> writes:
> On Thu 13-08-20 16:34:57, Thomas Gleixner wrote:
>> Michal Hocko <mhocko@suse.com> writes:
>> > Yes, that would have to somehow need to annotate the zone_lock to be ok
>> > in those paths so that lockdep doesn't complain.
>> 
>> That opens the worst of all cans of worms. If we start this here then
>> Joe programmer and his dog will use these lockdep annotation to evade
>> warnings and when exposed to RT it will fall apart in pieces. Just that
>> at that point Joe programmer moved on to something else and the usual
>> suspects can mop up the pieces. We've seen that all over the place and
>> some people even disable lockdep temporarily because annotations don't
>> help.
>
> Hmm. I am likely missing something really important here. We have two
> problems at hand:
> 1) RT will become broken as soon as this new RCU functionality which
> requires an allocation from inside of raw_spinlock hits the RT tree
> 2) lockdep splats which are telling us that early because of the
> raw_spinlock-> spin_lock dependency.

Correct.

> 1) can be handled by handled by the bailing out whenever we have to use
> zone->lock inside the buddy allocator - essentially even more strict
> NOWAIT semantic than we have for RT tree - proposed (pseudo) patch is
> trying to describe that.
>
> 2) would become a false positive if 1) is in place, right? RT wouldn't
> do the illegal nesting and !RT would just work fine because
> GFP_RT_NOWAIT would be simply GFP_NOWAIT & ~__GFP_KSWAPD_RECLAIM.

And how do you deal with that false positive and the subsequent false
positives when this code hits the next regular spinlock in some code
path? Disabling lockdep or crippling coverage?

> Why should we limit the functionality of the allocator for something
> that is not a real problem?

We'd limit the allocator for exactly ONE new user which was aware of
this problem _before_ the code hit mainline. And that ONE user is
prepared to handle the fail.

Any other usage of the page allocator just works. The amount of raw
spinlocks is very limited and there are very good reasons to make them
raw spinlocks. And none of them does allocations inside, except this
particular new one. Some did years ago, but none of them was necessary
at all, quite the contrary most of them were just pointless and in
frequent hot pathes.

Let me ask the question the other way round:

  Is there a real request by Paul that going deeper into the allocator
  is necessary for his new fangled RCU thing?

I haven't seen one and if the lockless allocation fails then the system
might have other worries than getting a page to this particular RCU
thing which has a perfectly working fallback.

It's not affecting anything else. GFP_ATOMIC/NOWAIT still work the same
way as before from all other contexts and that's 99,9999999999% of all
use cases. Why, because none of them happen under a raw spinlock.

Even if we could make this lockdep thing work that does not mean that
it's a good thing to do.

Quite the contrary, you'd just encourage people to create more of those
use cases for probably the completely wrong reasons. Putting a
limitation into place upfront might makes them think farther than just
slapping GFP_RT_ATOMIC in and be done with it. Let me dream :)

I've dealt with tons of patches in the last 15+ years where people just
came up with 's/GFP_KERNEL/GFP_ATOMIC/ because tool complained'
patches. The vast majority of them were bogus because the alloc() was
simply at the wrong place.

Forcing people not to take the easy way out by making the infrastructure
restrictive is way better than encouraging mindless hackery. We have
enough of this (not restricted to memory allocations) all over the
kernel already. No need for more.

>> Really, if your primary lockless caches are empty then any allocation
>> which comes from deep atomic context should simply always fail. Being
>> stuck in an interrupt handler or even deeper for 200+ microseconds
>> waiting for zone lock is just bonkers IMO.
>
> That would require changing NOWAIT/ATOMIC allocations semantic quite
> drastically for !RT kernels as well. I am not sure this is something we
> can do. Or maybe I am just missing your point.

I really do not understand why you think that it affects everything. 

It's exactly ONE particular use case which is affected, i.e. Pauls new
RCU thing if he uses GFP_RT_NOWAIT.

Everything else is not affected at all and NOWAIT/ATOMIC just works as
it used to work because NOWAIT != 0 and ATOMIC != 0.

Thanks,

        tglx

Michal Hocko Aug. 13, 2020, 5:12 p.m. UTC | #53

On Thu 13-08-20 09:29:04, Paul E. McKenney wrote:
> On Thu, Aug 13, 2020 at 06:13:57PM +0200, Michal Hocko wrote:
> > On Thu 13-08-20 09:04:42, Paul E. McKenney wrote:
> > > On Thu, Aug 13, 2020 at 05:54:12PM +0200, Michal Hocko wrote:
> > [...]
> > > > If the whole bailout is guarded by CONFIG_PREEMPT_RT specific atomicity
> > > > check then there is no functional problem - GFP_RT_SAFE would still be
> > > > GFP_NOWAIT so functional wise the allocator will still do the right
> > > > thing.
> > > 
> > > Perhaps it was just me getting confused, early hour Pacific Time and
> > > whatever other excuses might apply.  But I thought that you still had
> > > an objection to GFP_RT_SAFE based on changes in allocator semantics for
> > > other users.
> > 
> > There is still that problem with lockdep complaining about raw->regular
> > spinlock on !PREEMPT_RT that would need to get resolved somehow. Thomas
> > is not really keen on adding some lockdep annotation mechanism and
> > unfortunatelly I do not have a different idea how to get rid of those.
> 
> OK.  So the current situation requires a choice between these these
> alternatives, each of which has shortcomings that have been mentioned
> earlier in this thread:
> 
> 1.	Prohibit invoking allocators from raw atomic context, such
> 	as when holding a raw spinlock.
> 
> 2.	Adding a GFP_ flag.

Which would implemente a completely new level atomic allocation for all
preemption models

> 
> 3.	Reusing existing GFP_ flags/values/whatever to communicate
> 	the raw-context information that was to be communicated by
> 	the new GFP_ flag.

this would have to be RT specific to not change the semantic for
existing users. In other words make NOWAIT semantic working for
RT atomic contexts.

> 
> 4.	Making lockdep forgive acquiring spinlocks while holding
> 	raw spinlocks, but only in CONFIG_PREEMPT_NONE=y kernels.

and this would have to go along with 3 to remove false positives on !RT.

Michal Hocko Aug. 13, 2020, 5:22 p.m. UTC | #54

On Thu 13-08-20 19:09:29, Thomas Gleixner wrote:
> Michal Hocko <mhocko@suse.com> writes:
> > On Thu 13-08-20 16:34:57, Thomas Gleixner wrote:
[...]

I will go though the rest of the email tomorrow.

> >> Really, if your primary lockless caches are empty then any allocation
> >> which comes from deep atomic context should simply always fail. Being
> >> stuck in an interrupt handler or even deeper for 200+ microseconds
> >> waiting for zone lock is just bonkers IMO.
> >
> > That would require changing NOWAIT/ATOMIC allocations semantic quite
> > drastically for !RT kernels as well. I am not sure this is something we
> > can do. Or maybe I am just missing your point.
> 
> I really do not understand why you think that it affects everything. 

We are likely not on the same page here. Are you talking about the
original proposal in this thread (aka a new flag) or a way to reuse
existing gfp space (http://lkml.kernel.org/r/20200813075027.GD9477@dhcp22.suse.cz)
with a modification that would both silence the lockdep and keep the
existing NOWAIT semantic?

Sorry bear with me but I am getting lost in this thread.

Paul E. McKenney Aug. 13, 2020, 5:27 p.m. UTC | #55

On Thu, Aug 13, 2020 at 07:12:11PM +0200, Michal Hocko wrote:
> On Thu 13-08-20 09:29:04, Paul E. McKenney wrote:
> > On Thu, Aug 13, 2020 at 06:13:57PM +0200, Michal Hocko wrote:
> > > On Thu 13-08-20 09:04:42, Paul E. McKenney wrote:
> > > > On Thu, Aug 13, 2020 at 05:54:12PM +0200, Michal Hocko wrote:
> > > [...]
> > > > > If the whole bailout is guarded by CONFIG_PREEMPT_RT specific atomicity
> > > > > check then there is no functional problem - GFP_RT_SAFE would still be
> > > > > GFP_NOWAIT so functional wise the allocator will still do the right
> > > > > thing.
> > > > 
> > > > Perhaps it was just me getting confused, early hour Pacific Time and
> > > > whatever other excuses might apply.  But I thought that you still had
> > > > an objection to GFP_RT_SAFE based on changes in allocator semantics for
> > > > other users.
> > > 
> > > There is still that problem with lockdep complaining about raw->regular
> > > spinlock on !PREEMPT_RT that would need to get resolved somehow. Thomas
> > > is not really keen on adding some lockdep annotation mechanism and
> > > unfortunatelly I do not have a different idea how to get rid of those.
> > 
> > OK.  So the current situation requires a choice between these these
> > alternatives, each of which has shortcomings that have been mentioned
> > earlier in this thread:
> > 
> > 1.	Prohibit invoking allocators from raw atomic context, such
> > 	as when holding a raw spinlock.
> > 
> > 2.	Adding a GFP_ flag.
> 
> Which would implemente a completely new level atomic allocation for all
> preemption models
> 
> > 
> > 3.	Reusing existing GFP_ flags/values/whatever to communicate
> > 	the raw-context information that was to be communicated by
> > 	the new GFP_ flag.
> 
> this would have to be RT specific to not change the semantic for
> existing users. In other words make NOWAIT semantic working for
> RT atomic contexts.
> 
> > 
> > 4.	Making lockdep forgive acquiring spinlocks while holding
> > 	raw spinlocks, but only in CONFIG_PREEMPT_NONE=y kernels.
> 
> and this would have to go along with 3 to remove false positives on !RT.

OK, let's fill in the issues, then:

1.	Prohibit invoking allocators from raw atomic context, such
	as when holding a raw spinlock.

	o	This would prevent an important cache-locality
		optimization.

2.	Adding a GFP_ flag.

	o	Requires a new level atomic allocation for all preemption
		models, namely, confined to the allocator's lockless
		caches.

3.	Reusing existing GFP_ flags/values/whatever to communicate
	the raw-context information that was to be communicated by
	the new GFP_ flag.

	o	There are existing users of all combinations that might
		be unhappy with a change of semantics.

	o	But Michal is OK with this if usage is restricted to RT.
		Except that this requires #4 below:

4.	Making lockdep forgive acquiring spinlocks while holding
	raw spinlocks, but only in CONFIG_PREEMPT_NONE=y kernels.

	o	This would allow latency degradation and other bad coding
		practices to creep in, per Thomas's recent email.

Again, am I missing anything?

							Thanx, Paul

Peter Zijlstra Aug. 13, 2020, 6:26 p.m. UTC | #56

On Thu, Aug 13, 2020 at 04:34:57PM +0200, Thomas Gleixner wrote:
> Michal Hocko <mhocko@suse.com> writes:
> > On Thu 13-08-20 15:22:00, Thomas Gleixner wrote:
> >> It basically requires to convert the wait queue to something else. Is
> >> the waitqueue strict single waiter?
> >
> > I would have to double check. From what I remember only kswapd should
> > ever sleep on it.
> 
> That would make it trivial as we could simply switch it over to rcu_wait.
> 
> >> So that should be:
> >> 
> >> 	if (!preemptible() && gfp == GFP_RT_NOWAIT)
> >> 
> >> which is limiting the damage to those callers which hand in
> >> GFP_RT_NOWAIT.
> >> 
> >> lockdep will yell at invocations with gfp != GFP_RT_NOWAIT when it hits
> >> zone->lock in the wrong context. And we want to know about that so we
> >> can look at the caller and figure out how to solve it.
> >
> > Yes, that would have to somehow need to annotate the zone_lock to be ok
> > in those paths so that lockdep doesn't complain.
> 
> That opens the worst of all cans of worms. If we start this here then
> Joe programmer and his dog will use these lockdep annotation to evade
> warnings and when exposed to RT it will fall apart in pieces. Just that
> at that point Joe programmer moved on to something else and the usual
> suspects can mop up the pieces. We've seen that all over the place and
> some people even disable lockdep temporarily because annotations don't
> help.
> 
> PeterZ might have opinions about that too I suspect.

PeterZ is mightily confused by all of this -- also heat induced brain
melt.

I thought the rule was:

 - No allocators (alloc/free) inside raw_spinlock_t, full-stop.

Why are we trying to craft an exception?

Peter Zijlstra Aug. 13, 2020, 6:31 p.m. UTC | #57

On Thu, Aug 13, 2020 at 09:29:04AM -0700, Paul E. McKenney wrote:
> OK.  So the current situation requires a choice between these these
> alternatives, each of which has shortcomings that have been mentioned
> earlier in this thread:
> 
> 1.	Prohibit invoking allocators from raw atomic context, such
> 	as when holding a raw spinlock.

This! This has always been the case, why are we even considering change
here?

> 2.	Adding a GFP_ flag.

The patch 1/2 in this thread is horrendous crap.

> 3.	Reusing existing GFP_ flags/values/whatever to communicate
> 	the raw-context information that was to be communicated by
> 	the new GFP_ flag.
> 
> 4.	Making lockdep forgive acquiring spinlocks while holding
> 	raw spinlocks, but only in CONFIG_PREEMPT_NONE=y kernels.
> 
> Am I missing anything?

How would 4 solve anything?


In other words, what is the actual friggin problem? I've not seen one
described anywhere.

Paul E. McKenney Aug. 13, 2020, 6:52 p.m. UTC | #58

On Thu, Aug 13, 2020 at 08:26:18PM +0200, peterz@infradead.org wrote:
> On Thu, Aug 13, 2020 at 04:34:57PM +0200, Thomas Gleixner wrote:
> > Michal Hocko <mhocko@suse.com> writes:
> > > On Thu 13-08-20 15:22:00, Thomas Gleixner wrote:
> > >> It basically requires to convert the wait queue to something else. Is
> > >> the waitqueue strict single waiter?
> > >
> > > I would have to double check. From what I remember only kswapd should
> > > ever sleep on it.
> > 
> > That would make it trivial as we could simply switch it over to rcu_wait.
> > 
> > >> So that should be:
> > >> 
> > >> 	if (!preemptible() && gfp == GFP_RT_NOWAIT)
> > >> 
> > >> which is limiting the damage to those callers which hand in
> > >> GFP_RT_NOWAIT.
> > >> 
> > >> lockdep will yell at invocations with gfp != GFP_RT_NOWAIT when it hits
> > >> zone->lock in the wrong context. And we want to know about that so we
> > >> can look at the caller and figure out how to solve it.
> > >
> > > Yes, that would have to somehow need to annotate the zone_lock to be ok
> > > in those paths so that lockdep doesn't complain.
> > 
> > That opens the worst of all cans of worms. If we start this here then
> > Joe programmer and his dog will use these lockdep annotation to evade
> > warnings and when exposed to RT it will fall apart in pieces. Just that
> > at that point Joe programmer moved on to something else and the usual
> > suspects can mop up the pieces. We've seen that all over the place and
> > some people even disable lockdep temporarily because annotations don't
> > help.
> > 
> > PeterZ might have opinions about that too I suspect.
> 
> PeterZ is mightily confused by all of this -- also heat induced brain
> melt.
> 
> I thought the rule was:
> 
>  - No allocators (alloc/free) inside raw_spinlock_t, full-stop.
> 
> Why are we trying to craft an exception?

So that we can reduce post-grace-period cache misses by a factor of
eight when invoking RCU callbacks.  This reduction in cache misses also
makes it more difficult to overrun RCU with floods of either call_rcu()
or kfree_rcu() invocations.

The idea is to allocate page-sized arrays of pointers so that the callback
invocation can sequence through the array instead of walking a linked
list, hence the reduction in cache misses.

If the allocation fails, for example, during OOM events, we fall back to
the linked-list approach.  So, as with much of the rest of the kernel,
under OOM we just run a bit slower.

							Thanx, Paul

Michal Hocko Aug. 13, 2020, 7:13 p.m. UTC | #59

On Thu 13-08-20 20:31:21, Peter Zijlstra wrote:
> On Thu, Aug 13, 2020 at 09:29:04AM -0700, Paul E. McKenney wrote:
[...]
> > 3.	Reusing existing GFP_ flags/values/whatever to communicate
> > 	the raw-context information that was to be communicated by
> > 	the new GFP_ flag.
> > 
> > 4.	Making lockdep forgive acquiring spinlocks while holding
> > 	raw spinlocks, but only in CONFIG_PREEMPT_NONE=y kernels.
> > 
> > Am I missing anything?
> 
> How would 4 solve anything?

Nothing on its own but along with http://lkml.kernel.org/r/20200813075027.GD9477@dhcp22.suse.cz
it would allow at least _some_ NOWAIT semantic for RT atomic contexts
and prevent from lockdep false positives for !RT trees.

Peter Zijlstra Aug. 13, 2020, 10:06 p.m. UTC | #60

On Thu, Aug 13, 2020 at 11:52:57AM -0700, Paul E. McKenney wrote:
> On Thu, Aug 13, 2020 at 08:26:18PM +0200, peterz@infradead.org wrote:

> > I thought the rule was:
> > 
> >  - No allocators (alloc/free) inside raw_spinlock_t, full-stop.
> > 
> > Why are we trying to craft an exception?
> 
> So that we can reduce post-grace-period cache misses by a factor of
> eight when invoking RCU callbacks.  This reduction in cache misses also
> makes it more difficult to overrun RCU with floods of either call_rcu()
> or kfree_rcu() invocations.
> 
> The idea is to allocate page-sized arrays of pointers so that the callback
> invocation can sequence through the array instead of walking a linked
> list, hence the reduction in cache misses.

I'm still not getting it, how do we end up trying to allocate memory
from under raw spinlocks if you're not allowed to use kfree_rcu() under
one ?

Can someone please spell out the actual problem?

Paul E. McKenney Aug. 13, 2020, 11:23 p.m. UTC | #61

On Fri, Aug 14, 2020 at 12:06:19AM +0200, peterz@infradead.org wrote:
> On Thu, Aug 13, 2020 at 11:52:57AM -0700, Paul E. McKenney wrote:
> > On Thu, Aug 13, 2020 at 08:26:18PM +0200, peterz@infradead.org wrote:
> 
> > > I thought the rule was:
> > > 
> > >  - No allocators (alloc/free) inside raw_spinlock_t, full-stop.
> > > 
> > > Why are we trying to craft an exception?
> > 
> > So that we can reduce post-grace-period cache misses by a factor of
> > eight when invoking RCU callbacks.  This reduction in cache misses also
> > makes it more difficult to overrun RCU with floods of either call_rcu()
> > or kfree_rcu() invocations.
> > 
> > The idea is to allocate page-sized arrays of pointers so that the callback
> > invocation can sequence through the array instead of walking a linked
> > list, hence the reduction in cache misses.
> 
> I'm still not getting it, how do we end up trying to allocate memory
> from under raw spinlocks if you're not allowed to use kfree_rcu() under
> one ?

You are indeed not allowed to use kfree() under a raw spinlock, given
that it can acquire a non-raw spinlock.

But kfree_rcu() was just a wrapper around call_rcu(), which can be and
is called from raw atomic context.

> Can someone please spell out the actual problem?

And as noted above, reducing the kfree()-time cache misses would be
a good thing.

						Thanx, Paul

Thomas Gleixner Aug. 13, 2020, 11:59 p.m. UTC | #62

On Fri, Aug 14 2020 at 00:06, peterz wrote:
> I'm still not getting it, how do we end up trying to allocate memory
> from under raw spinlocks if you're not allowed to use kfree_rcu() under
> one ?
>
> Can someone please spell out the actual problem?

Your actual problem is the heat wave. Get an icepack already :)

To set things straight:

  1) kfree_rcu() which used to be just a conveniance wrapper around
     call_rcu() always worked in any context except NMI.

     Otherwise RT would have never worked or would have needed other
     horrible hacks to defer kfree in certain contexts including
     context switch.

  2) RCU grew an optimization for kfree_rcu() which avoids the linked
     list cache misses when a large number of objects is freed via
     kfree_rcu(). Paul says it speeds up post grace period processing by
     a factor of 8 which is impressive.

     That's done by avoiding the linked list and storing the object
     pointer in an array of pointers which is page sized. This array is
     then freed in bulk without having to touch the rcu head linked list
     which obviously avoids cache misses.

     Occasionally kfree_rcu() needs to allocate a page for this but it
     can fallback to the linked list when the allocation fails.
     
     Inconveniantly this allocation happens with an RCU internal raw
     lock held, but even if we could do the allocation outside the RCU
     internal lock this would not solve the problem that kfree_rcu() can
     be called in any context except NMI even on RT.

     So RT forbids allocations from truly atomic contexts even with
     GFP_ATOMIC/NOWAIT. GFP_ATOMIC is kinda meaningless on RT because
     everything which calls it is in non-atomic context :) But still
     GFP_ATOMIC or GFP_NOWAIT retain the semantics of !RT and do not go
     down into deeper levels or wait for pages to become available.

  3) #2 upsets lockdep (with the raw lock nesting enabled) rightfully
     when RCU tries to allocate a page, the lockless page cache is empty
     and it acquires zone->lock which is a regular spinlock

  4) As kfree_rcu() can be used from any context except NMI and RT
     relies on it we ran into a circular dependency problem.

     If we disable that feature for RT then the problem would be solved
     except that lockdep still would complain about taking zone->lock
     within a forbidden context on !RT kernels.

     Preventing that feature because of that is not a feasible option
     either. Aside of that we discuss this postfactum, IOW the stuff is
     upstream already despite the problem being known before.

  5) Ulad posted patches which introduce a new GFP allocation mode
     which makes the allocation fail if the lockless cache is empty,
     i.e. it does not try to touch zone->lock in that case.

     That works perfectly fine on RT and !RT, makes lockdep happy and
     Paul is happy as well.

     If the lockless cache, which works perfectly fine on RT, is empty
     then the performance of kfree_rcu() post grace period processing is
     probably not making the situation of the system worse.

     Michal is not fond of the new GFP flag and wants to explore options
     to avoid that completely. But so far there is no real feasible
     solution.

     A) It was suggested to make zone->lock raw, but that's a total
        disaster latency wise. With just a kernel compile (!RT kernel)
        spinwait times are in the hundreds of microseconds.

        RT tests showed max latency of cyclictest go up from 30 to 220
        microseconds, i.e. factor 7 just because of that.

        So not really great.

        It would have had the charm to keep the semantics of GFP_NOWAIT,
        but OTOH even if it would work I'm pretty oppposed to open the
        can of worm which allows allocations from the deepest atomic
        contexts in general without a really good reason.

     B) Michal suggested to have GFP_RT_ATOMIC defined to 0 which would
        not require a new GFP bit and bail out when RT is enabled and
        the context is atomic.

        That of course does not solve the problem vs. lockdep.

        Also the idea to make this conditional on !preemptible() does
        not work because preemptible() returns always false for
        CONFIG_PREEMPT_COUNT=n kernels.

     C) I suggested to make GFP == 0 fail unconditionally when the
        lockless cache is empty, but that changes the semantics on !RT
        kernels for existing users which hand in 0.
        
     D) To solve the lockdep issue of #B Michal suggested to have magic
        lockdep annotations which allow !RT kernels to take zone->lock
        from the otherwise forbidden contexts because on !RT this lock
        nesting could be considered a false positive.

        But this would be horrors of some sorts because there might be
        locks further down which then need the same treatment or some
        general 'yay, I'm excempt from everything' annotation which is
        short of disabling lockdep completly.

        Even if that could be solved and made correct for both RT and
        !RT then this opens the can of worms that this kind of
        annotation would show up all over the place within no time for
        the completely wrong reasons.

Paul compiled this options list:

> 1.	Prohibit invoking allocators from raw atomic context, such
>	as when holding a raw spinlock.

  Clearly the simplest solution but not Pauls favourite and
  unfortunately he has a good reason.     

> 2.	Adding a GFP_ flag.

  The simplest and most consistent solution. If you really need to do
  allocations from such contexts then deal with the limitations whether
  it's RT or not. Paul has no problem with that and this newfangled
  kfree_rcu() is the only user and can live with that restriction.

  Michal does not like the restriction for !RT kernels and tries to
  avoid the introduction of a new allocation mode.

  My argument here is that consistency is the best solution and the
  extra GFP mode is explicitly restrictive due to the context which it
  is called from. Aside of that this affects exactly ONE use case which
  has a really good reason and does not care about the restriction even
  on !RT because in that situation kfree_rcu() performance is not the
  most urgent problem.

> 3.	Reusing existing GFP_ flags/values/whatever to communicate
>	the raw-context information that was to be communicated by
>	the new GFP_ flag.
>
> 4.	Making lockdep forgive acquiring spinlocks while holding
>	raw spinlocks, but only in CONFIG_PREEMPT_NONE=y kernels.

 These are not seperate of each other as #3 requires #4. The most
 horrible solution IMO from a technical POV as it proliferates
 inconsistency for no good reaosn.

 Aside of that it'd be solving a problem which does not exist simply
 because kfree_rcu() does not depend on it and we really don't want to
 set precedence and encourage the (ab)use of this in any way.

Having a distinct GFP mode is technically correct, consistent on all
kernel variants and does not affect any exisiting user of the page
allocator aside of the new kfree_rcu().

I hope that the cloudy and rainy day cured most of my heat wave induced
brain damage to the extent that the above is correctly summarizing the
state of affairs. If not, then please yell and I get an icepack.

Thanks,

        tglx

Michal Hocko Aug. 14, 2020, 7:17 a.m. UTC | #63

On Thu 13-08-20 19:09:29, Thomas Gleixner wrote:
> Michal Hocko <mhocko@suse.com> writes:
[...]
> > Why should we limit the functionality of the allocator for something
> > that is not a real problem?
> 
> We'd limit the allocator for exactly ONE new user which was aware of
> this problem _before_ the code hit mainline. And that ONE user is
> prepared to handle the fail.

If we are to limit the functionality to this one particular user then
I would consider a dedicated gfp flag a huge overkill. It would be much
more easier to have a preallocated pool of pages and use those and
completely avoid the core allocator. That would certainly only shift the
complexity to the caller but if it is expected there would be only that
single user then it would be probably better than opening a can of worms
like allocator usable from raw spin locks.

Paul would something like that be feasible?

Really we have been bitten by a single usecase gfp flags in the past.

[...]
> Even if we could make this lockdep thing work that does not mean that
> it's a good thing to do.
> 
> Quite the contrary, you'd just encourage people to create more of those
> use cases for probably the completely wrong reasons. Putting a
> limitation into place upfront might makes them think farther than just
> slapping GFP_RT_ATOMIC in and be done with it. Let me dream :)

Good one ;) But seriously. I was suggesting lockdep workaround because
I reckon that is less prone to an abuse than gfp flags. Lockdep is that
scary thing people do not want to touch by a long pole but gfp flags
are something you have to deal with when calling allocator and people
tend to be creative. We used to suck in documentation so I am not
wondering but things have improved so maybe the usage is going to
improve as well. Anyway __GFP_NO_LOCK would be a free ticket to "I want
to optimize even further" land. Maybe a better naming would be better
but I am skeptical.

> I've dealt with tons of patches in the last 15+ years where people just
> came up with 's/GFP_KERNEL/GFP_ATOMIC/ because tool complained'
> patches. The vast majority of them were bogus because the alloc() was
> simply at the wrong place.

Completely agreed.

> Forcing people not to take the easy way out by making the infrastructure
> restrictive is way better than encouraging mindless hackery. We have
> enough of this (not restricted to memory allocations) all over the
> kernel already. No need for more.

I do agree with you. I just slightly disagree where the danger is.
Explicit lockdep usage outside of the core is spread much less than the
allocator so the abuse is less likely.

Peter Zijlstra Aug. 14, 2020, 8:30 a.m. UTC | #64

On Fri, Aug 14, 2020 at 01:59:04AM +0200, Thomas Gleixner wrote:
> On Fri, Aug 14 2020 at 00:06, peterz wrote:
> > I'm still not getting it, how do we end up trying to allocate memory
> > from under raw spinlocks if you're not allowed to use kfree_rcu() under
> > one ?
> >
> > Can someone please spell out the actual problem?
> 
> Your actual problem is the heat wave. Get an icepack already :)

Sure, but also much of the below wasn't stated anywhere in the thread I
got Cc'ed on. All I got was half arsed solutions without an actual
problem statement.

> To set things straight:
> 
>   1) kfree_rcu() which used to be just a conveniance wrapper around
>      call_rcu() always worked in any context except NMI.
> 
>      Otherwise RT would have never worked or would have needed other
>      horrible hacks to defer kfree in certain contexts including
>      context switch.

I've never used kfree_rcu(), htf would I know.

>   2) RCU grew an optimization for kfree_rcu() which avoids the linked
>      list cache misses when a large number of objects is freed via
>      kfree_rcu(). Paul says it speeds up post grace period processing by
>      a factor of 8 which is impressive.
> 
>      That's done by avoiding the linked list and storing the object
>      pointer in an array of pointers which is page sized. This array is
>      then freed in bulk without having to touch the rcu head linked list
>      which obviously avoids cache misses.
> 
>      Occasionally kfree_rcu() needs to allocate a page for this but it
>      can fallback to the linked list when the allocation fails.
>      
>      Inconveniantly this allocation happens with an RCU internal raw
>      lock held, but even if we could do the allocation outside the RCU
>      internal lock this would not solve the problem that kfree_rcu() can
>      be called in any context except NMI even on RT.
> 
>      So RT forbids allocations from truly atomic contexts even with
>      GFP_ATOMIC/NOWAIT. GFP_ATOMIC is kinda meaningless on RT because
>      everything which calls it is in non-atomic context :) But still
>      GFP_ATOMIC or GFP_NOWAIT retain the semantics of !RT and do not go
>      down into deeper levels or wait for pages to become available.

Right so on RT just cut out the allocation and unconditionally make it
NULL. Since it needs to deal with GFP_ATOMIC|GFP_NOWARN failing the
allocation anyway.

>   3) #2 upsets lockdep (with the raw lock nesting enabled) rightfully
>      when RCU tries to allocate a page, the lockless page cache is empty
>      and it acquires zone->lock which is a regular spinlock

There was no lockdep splat in the thread.

>   4) As kfree_rcu() can be used from any context except NMI and RT
>      relies on it we ran into a circular dependency problem.

Well, which actual usage sites are under a raw spinlock? Most of the
ones I could find are not.

>      If we disable that feature for RT then the problem would be solved
>      except that lockdep still would complain about taking zone->lock
>      within a forbidden context on !RT kernels.
> 
>      Preventing that feature because of that is not a feasible option
>      either. Aside of that we discuss this postfactum, IOW the stuff is
>      upstream already despite the problem being known before.

Well, that's a fail :-( I tought we were supposed to solve known issues
before shit got merged.

>   5) Ulad posted patches which introduce a new GFP allocation mode
>      which makes the allocation fail if the lockless cache is empty,
>      i.e. it does not try to touch zone->lock in that case.
> 
>      That works perfectly fine on RT and !RT, makes lockdep happy and
>      Paul is happy as well.
> 
>      If the lockless cache, which works perfectly fine on RT, is empty
>      then the performance of kfree_rcu() post grace period processing is
>      probably not making the situation of the system worse.
> 
>      Michal is not fond of the new GFP flag and wants to explore options
>      to avoid that completely. But so far there is no real feasible
>      solution.

Not only Michal, those patches looked pretty horrid.

>      A) It was suggested to make zone->lock raw, but that's a total
>         disaster latency wise. With just a kernel compile (!RT kernel)
>         spinwait times are in the hundreds of microseconds.

Yeah, I know, been there done that, like over a decade ago :-)

>      B) Michal suggested to have GFP_RT_ATOMIC defined to 0 which would
>         not require a new GFP bit and bail out when RT is enabled and
>         the context is atomic.

I would've written the code like:

	void *mem = NULL;

	...
#ifndef CONFIG_PREEMPT_RT
	mem = kmalloc(PAGE_SIZE, GFP_ATOMIC|GFP_NOWAIT);
#endif
	if (!mem)
	  ....

Seems much simpler and doesn't pollute the GFP_ space.

>      D) To solve the lockdep issue of #B Michal suggested to have magic
>         lockdep annotations which allow !RT kernels to take zone->lock
>         from the otherwise forbidden contexts because on !RT this lock
>         nesting could be considered a false positive.
> 
>         But this would be horrors of some sorts because there might be
>         locks further down which then need the same treatment or some
>         general 'yay, I'm excempt from everything' annotation which is
>         short of disabling lockdep completly.
> 
>         Even if that could be solved and made correct for both RT and
>         !RT then this opens the can of worms that this kind of
>         annotation would show up all over the place within no time for
>         the completely wrong reasons.

So due to this heat crap I've not slept more than a few hours a night
for the last week, I'm not entirely there, but we already have a bunch
of lockdep magic for this raw nesting stuff.

But yes, we need to avoid abuse, grep for lockdep_off() :-( This
drivers/md/dm-cache-target.c thing is just plain broken. It sorta
'works' on mainline (and even there it can behave badly), but on RT it
will come apart with a bang.

> Paul compiled this options list:
> 
> > 1.	Prohibit invoking allocators from raw atomic context, such
> >	as when holding a raw spinlock.
> 
>   Clearly the simplest solution but not Pauls favourite and
>   unfortunately he has a good reason.

Which isn't actually stated anywhere I suppose ?

> > 2.	Adding a GFP_ flag.
> 
>   Michal does not like the restriction for !RT kernels and tries to
>   avoid the introduction of a new allocation mode.

Like above, I tend to be with Michal on this, just wrap the actual
allocation in CONFIG_PREEMPT_RT, the code needs to handle a NULL pointer
there anyway.

> > 3.	Reusing existing GFP_ flags/values/whatever to communicate
> >	the raw-context information that was to be communicated by
> >	the new GFP_ flag.
> >
> > 4.	Making lockdep forgive acquiring spinlocks while holding
> >	raw spinlocks, but only in CONFIG_PREEMPT_NONE=y kernels.

Uhh, !CONFIG_PREEMPT_RT, the rest is 'fine'.

>  These are not seperate of each other as #3 requires #4. The most
>  horrible solution IMO from a technical POV as it proliferates
>  inconsistency for no good reaosn.
> 
>  Aside of that it'd be solving a problem which does not exist simply
>  because kfree_rcu() does not depend on it and we really don't want to
>  set precedence and encourage the (ab)use of this in any way.

My preferred solution is 1, if you want to use an allocator, you simply
don't get to play under raw_spinlock_t. And from a quick grep, most
kfree_rcu() users are not under raw_spinlock_t context.

This here sounds like someone who wants to have his cake and eat it too.

I'll try and think about a lockdep annotation that isn't utter crap, but
that's probably next week, I need this heat to end and get a few nights
sleep, I'm an utter wreck atm.

Peter Zijlstra Aug. 14, 2020, 10:23 a.m. UTC | #65

On Fri, Aug 14, 2020 at 10:30:37AM +0200, Peter Zijlstra wrote:
> > > 1.	Prohibit invoking allocators from raw atomic context, such
> > >	as when holding a raw spinlock.
> > 
> >   Clearly the simplest solution but not Pauls favourite and
> >   unfortunately he has a good reason.
> 
> Which isn't actually stated anywhere I suppose ?

Introduce raw_kfree_rcu() that doesn't do the allocation, and fix the
few wonky callsites.

Uladzislau Rezki Aug. 14, 2020, 11:54 a.m. UTC | #66

On Thu, Aug 13, 2020 at 06:36:17PM +0200, Michal Hocko wrote:
> On Thu 13-08-20 18:20:47, Uladzislau Rezki wrote:
> > > On Thu 13-08-20 08:41:59, Paul E. McKenney wrote:
> > > > On Thu, Aug 13, 2020 at 04:53:35PM +0200, Michal Hocko wrote:
> > > > > On Thu 13-08-20 16:34:57, Thomas Gleixner wrote:
> > > > > > Michal Hocko <mhocko@suse.com> writes:
> > > > > > > On Thu 13-08-20 15:22:00, Thomas Gleixner wrote:
> > > > > > >> It basically requires to convert the wait queue to something else. Is
> > > > > > >> the waitqueue strict single waiter?
> > > > > > >
> > > > > > > I would have to double check. From what I remember only kswapd should
> > > > > > > ever sleep on it.
> > > > > > 
> > > > > > That would make it trivial as we could simply switch it over to rcu_wait.
> > > > > > 
> > > > > > >> So that should be:
> > > > > > >> 
> > > > > > >> 	if (!preemptible() && gfp == GFP_RT_NOWAIT)
> > > > > > >> 
> > > > > > >> which is limiting the damage to those callers which hand in
> > > > > > >> GFP_RT_NOWAIT.
> > > > > > >> 
> > > > > > >> lockdep will yell at invocations with gfp != GFP_RT_NOWAIT when it hits
> > > > > > >> zone->lock in the wrong context. And we want to know about that so we
> > > > > > >> can look at the caller and figure out how to solve it.
> > > > > > >
> > > > > > > Yes, that would have to somehow need to annotate the zone_lock to be ok
> > > > > > > in those paths so that lockdep doesn't complain.
> > > > > > 
> > > > > > That opens the worst of all cans of worms. If we start this here then
> > > > > > Joe programmer and his dog will use these lockdep annotation to evade
> > > > > > warnings and when exposed to RT it will fall apart in pieces. Just that
> > > > > > at that point Joe programmer moved on to something else and the usual
> > > > > > suspects can mop up the pieces. We've seen that all over the place and
> > > > > > some people even disable lockdep temporarily because annotations don't
> > > > > > help.
> > > > > 
> > > > > Hmm. I am likely missing something really important here. We have two
> > > > > problems at hand:
> > > > > 1) RT will become broken as soon as this new RCU functionality which
> > > > > requires an allocation from inside of raw_spinlock hits the RT tree
> > > > > 2) lockdep splats which are telling us that early because of the
> > > > > raw_spinlock-> spin_lock dependency.
> > > > 
> > > > That is a reasonable high-level summary.
> > > > 
> > > > > 1) can be handled by handled by the bailing out whenever we have to use
> > > > > zone->lock inside the buddy allocator - essentially even more strict
> > > > > NOWAIT semantic than we have for RT tree - proposed (pseudo) patch is
> > > > > trying to describe that.
> > > > 
> > > > Unless I am missing something subtle, the problem with this approach
> > > > is that in production-environment CONFIG_PREEMPT_NONE=y kernels, there
> > > > is no way at runtime to distinguish between holding a spinlock on the
> > > > one hand and holding a raw spinlock on the other.  Therefore, without
> > > > some sort of indication from the caller, this approach will not make
> > > > CONFIG_PREEMPT_NONE=y users happy.
> > > 
> > > If the whole bailout is guarded by CONFIG_PREEMPT_RT specific atomicity
> > > check then there is no functional problem - GFP_RT_SAFE would still be
> > > GFP_NOWAIT so functional wise the allocator will still do the right
> > > thing.
> > > 
> > > [...]
> > > 
> > > > > That would require changing NOWAIT/ATOMIC allocations semantic quite
> > > > > drastically for !RT kernels as well. I am not sure this is something we
> > > > > can do. Or maybe I am just missing your point.
> > > > 
> > > > Exactly, and avoiding changing this semantic for current users is
> > > > precisely why we are proposing some sort of indication to be passed
> > > > into the allocation request.  In Uladzislau's patch, this was the
> > > > __GFP_NO_LOCKS flag, but whatever works.
> > > 
> > > As I've tried to explain already, I would really hope we can do without
> > > any new gfp flags. We are running out of them and they tend to generate
> > > a lot of maintenance burden. There is a lot of abuse etc. We should also
> > > not expose such an implementation detail of the allocator to callers
> > > because that would make future changes even harder. The alias, on the
> > > othere hand already builds on top of existing NOWAIT semantic and it
> > > just helps the allocator to complain about a wrong usage while it
> > > doesn't expose any internals.
> > > 
> > I know that Matthew and me raised it. We do can handle it without
> > introducing any flag. I mean just use 0 as argument to the page_alloc(gfp_flags = 0) 
> > 
> > i.e. #define __GFP_NO_LOCKS 0
> > 
> > so it will be handled same way how it is done in the "mm: Add __GFP_NO_LOCKS flag"
> > I can re-spin the RFC patch and send it out for better understanding.
> > 
> > Does it work for you, Michal? Or it is better just to drop the patch here?
> 
> That would change the semantic for GFP_NOWAIT users who decided to drop
> __GFP_KSWAPD_RECLAIM or even use 0 gfp mask right away, right? The point
>
I see your point. Doing GFP_NOWAIT & ~__GFP_KSWAPD_RECLAIM will do something
different what people expect. Right you are.

>
> I am trying to make is that an alias is good for RT because it doesn't
> have any users (because there is no RT atomic user of the allocator)
> currently.
>
Now I see your view. So we can handle RT case by using "RT && !preemptible()",
based on that we can bail out. GFP_ATOMIC and NOWAIT at least will keep same
semantic.

Second, if the CONFIG_PROVE_RAW_LOCK_NESTING is fixed for PREEMPT_COUNT=n,
then it would work. But i am lost here a bit if it is discussable or not.

Thanks!

--
Vlad Rezki

Uladzislau Rezki Aug. 14, 2020, 12:15 p.m. UTC | #67

> On Thu 13-08-20 19:09:29, Thomas Gleixner wrote:
> > Michal Hocko <mhocko@suse.com> writes:
> [...]
> > > Why should we limit the functionality of the allocator for something
> > > that is not a real problem?
> > 
> > We'd limit the allocator for exactly ONE new user which was aware of
> > this problem _before_ the code hit mainline. And that ONE user is
> > prepared to handle the fail.
> 
> If we are to limit the functionality to this one particular user then
> I would consider a dedicated gfp flag a huge overkill. It would be much
> more easier to have a preallocated pool of pages and use those and
> completely avoid the core allocator. That would certainly only shift the
> complexity to the caller but if it is expected there would be only that
> single user then it would be probably better than opening a can of worms
> like allocator usable from raw spin locks.
> 
Vlastimil raised same question earlier, i answered, but let me answer again:

It is hard to achieve because the logic does not stick to certain static test
case, i.e. it depends on how heavily kfree_rcu(single/double) are used. Based
on that, "how heavily" - number of pages are formed, until the drain/reclaimer
thread frees them.

Preloading pages and keeping them for internal use, IMHO, seems not optimal
from the point of resources wasting. It is better to have a fast mechanism to
request a page and release it back for needs of others. As described about
we do not know how much we will need.

--
Vlad Rezki

Michal Hocko Aug. 14, 2020, 12:48 p.m. UTC | #68

On Fri 14-08-20 14:15:44, Uladzislau Rezki wrote:
> > On Thu 13-08-20 19:09:29, Thomas Gleixner wrote:
> > > Michal Hocko <mhocko@suse.com> writes:
> > [...]
> > > > Why should we limit the functionality of the allocator for something
> > > > that is not a real problem?
> > > 
> > > We'd limit the allocator for exactly ONE new user which was aware of
> > > this problem _before_ the code hit mainline. And that ONE user is
> > > prepared to handle the fail.
> > 
> > If we are to limit the functionality to this one particular user then
> > I would consider a dedicated gfp flag a huge overkill. It would be much
> > more easier to have a preallocated pool of pages and use those and
> > completely avoid the core allocator. That would certainly only shift the
> > complexity to the caller but if it is expected there would be only that
> > single user then it would be probably better than opening a can of worms
> > like allocator usable from raw spin locks.
> > 
> Vlastimil raised same question earlier, i answered, but let me answer again:
> 
> It is hard to achieve because the logic does not stick to certain static test
> case, i.e. it depends on how heavily kfree_rcu(single/double) are used. Based
> on that, "how heavily" - number of pages are formed, until the drain/reclaimer
> thread frees them.

How many pages are talking about - ball park? 100s, 1000s?

> Preloading pages and keeping them for internal use, IMHO, seems not optimal
> from the point of resources wasting. It is better to have a fast mechanism to
> request a page and release it back for needs of others. As described about
> we do not know how much we will need.

It would be wasted memory but if we are talking about relatively small
number of pages. Note that there are not that many pages on the
allocator's pcp list anyway.

Paul E. McKenney Aug. 14, 2020, 1:34 p.m. UTC | #69

On Fri, Aug 14, 2020 at 02:48:32PM +0200, Michal Hocko wrote:
> On Fri 14-08-20 14:15:44, Uladzislau Rezki wrote:
> > > On Thu 13-08-20 19:09:29, Thomas Gleixner wrote:
> > > > Michal Hocko <mhocko@suse.com> writes:
> > > [...]
> > > > > Why should we limit the functionality of the allocator for something
> > > > > that is not a real problem?
> > > > 
> > > > We'd limit the allocator for exactly ONE new user which was aware of
> > > > this problem _before_ the code hit mainline. And that ONE user is
> > > > prepared to handle the fail.
> > > 
> > > If we are to limit the functionality to this one particular user then
> > > I would consider a dedicated gfp flag a huge overkill. It would be much
> > > more easier to have a preallocated pool of pages and use those and
> > > completely avoid the core allocator. That would certainly only shift the
> > > complexity to the caller but if it is expected there would be only that
> > > single user then it would be probably better than opening a can of worms
> > > like allocator usable from raw spin locks.
> > > 
> > Vlastimil raised same question earlier, i answered, but let me answer again:
> > 
> > It is hard to achieve because the logic does not stick to certain static test
> > case, i.e. it depends on how heavily kfree_rcu(single/double) are used. Based
> > on that, "how heavily" - number of pages are formed, until the drain/reclaimer
> > thread frees them.
> 
> How many pages are talking about - ball park? 100s, 1000s?

Under normal operation, a couple of pages per CPU, which would make
preallocation entirely reasonable.  Except that if someone does something
that floods RCU callbacks (close(open) in a tight userspace loop, for but
one example), then 2000 per CPU might not be enough, which on a 64-CPU
system comes to about 500MB.  This is beyond excessive for preallocation
on the systems I am familiar with.

And the flooding case is where you most want the reclamation to be
efficient, and thus where you want the pages.

This of course raises the question of how much memory the lockless caches
contain, but fortunately these RCU callback flooding scenarios also
involve process-context allocation of the memory that is being passed
to kfree_rcu().  That allocation should keep the lockless caches from
going empty in the common case, correct?

Please note that we will also need to apply this same optimization to
call_rcu(), which will have the same constraints.

> > Preloading pages and keeping them for internal use, IMHO, seems not optimal
> > from the point of resources wasting. It is better to have a fast mechanism to
> > request a page and release it back for needs of others. As described about
> > we do not know how much we will need.
> 
> It would be wasted memory but if we are talking about relatively small
> number of pages. Note that there are not that many pages on the
> allocator's pcp list anyway.

Indeed, if we were talking a maximum of (say) 10 pages per CPU, we would
not be having this conversation.  ;-)

							Thanx, Paul

Michal Hocko Aug. 14, 2020, 2:06 p.m. UTC | #70

On Fri 14-08-20 06:34:50, Paul E. McKenney wrote:
> On Fri, Aug 14, 2020 at 02:48:32PM +0200, Michal Hocko wrote:
> > On Fri 14-08-20 14:15:44, Uladzislau Rezki wrote:
> > > > On Thu 13-08-20 19:09:29, Thomas Gleixner wrote:
> > > > > Michal Hocko <mhocko@suse.com> writes:
> > > > [...]
> > > > > > Why should we limit the functionality of the allocator for something
> > > > > > that is not a real problem?
> > > > > 
> > > > > We'd limit the allocator for exactly ONE new user which was aware of
> > > > > this problem _before_ the code hit mainline. And that ONE user is
> > > > > prepared to handle the fail.
> > > > 
> > > > If we are to limit the functionality to this one particular user then
> > > > I would consider a dedicated gfp flag a huge overkill. It would be much
> > > > more easier to have a preallocated pool of pages and use those and
> > > > completely avoid the core allocator. That would certainly only shift the
> > > > complexity to the caller but if it is expected there would be only that
> > > > single user then it would be probably better than opening a can of worms
> > > > like allocator usable from raw spin locks.
> > > > 
> > > Vlastimil raised same question earlier, i answered, but let me answer again:
> > > 
> > > It is hard to achieve because the logic does not stick to certain static test
> > > case, i.e. it depends on how heavily kfree_rcu(single/double) are used. Based
> > > on that, "how heavily" - number of pages are formed, until the drain/reclaimer
> > > thread frees them.
> > 
> > How many pages are talking about - ball park? 100s, 1000s?
> 
> Under normal operation, a couple of pages per CPU, which would make
> preallocation entirely reasonable.  Except that if someone does something
> that floods RCU callbacks (close(open) in a tight userspace loop, for but
> one example), then 2000 per CPU might not be enough, which on a 64-CPU
> system comes to about 500MB.  This is beyond excessive for preallocation
> on the systems I am familiar with.
> 
> And the flooding case is where you most want the reclamation to be
> efficient, and thus where you want the pages.

I am not sure the page allocator would help you with this scenario
unless you are on very large machines. Pagesets scale with the available
memory and percpu_pagelist_fraction sysctl (have a look at
pageset_set_high_and_batch). It is roughly 1000th of the zone size for
each zone. You can check that in /proc/vmstat (my 8G machine)

Node 0, zone      DMA
Not interesting at all
Node 0, zone    DMA32
  pagesets
    cpu: 0
              count: 242
              high:  378
              batch: 63
    cpu: 1
              count: 355
              high:  378
              batch: 63
    cpu: 2
              count: 359
              high:  378
              batch: 63
    cpu: 3
              count: 366
              high:  378
              batch: 63
Node 0, zone   Normal
  pagesets
    cpu: 0
              count: 359
              high:  378
              batch: 63
    cpu: 1
              count: 241
              high:  378
              batch: 63
    cpu: 2
              count: 297
              high:  378
              batch: 63
    cpu: 3
              count: 227
              high:  378
              batch: 63

Besides that do you need to be per-cpu? Having 1000 pages available and
managed under your raw spinlock should be good enough already no?
 
> This of course raises the question of how much memory the lockless caches
> contain, but fortunately these RCU callback flooding scenarios also
> involve process-context allocation of the memory that is being passed
> to kfree_rcu().  That allocation should keep the lockless caches from
> going empty in the common case, correct?

Yes, those are refilled both on the allocation/free paths. But you
cannot really rely on that to happen early enough.

Do you happen to have any numbers that would show the typical usage
and how often the slow path has to be taken becase pcp lists are
depleted? In other words even if we provide a functionality to give
completely lockless way to allocate memory how useful that is?

Paul E. McKenney Aug. 14, 2020, 2:14 p.m. UTC | #71

On Fri, Aug 14, 2020 at 10:30:37AM +0200, Peter Zijlstra wrote:
> On Fri, Aug 14, 2020 at 01:59:04AM +0200, Thomas Gleixner wrote:
> > On Fri, Aug 14 2020 at 00:06, peterz wrote:
> > > I'm still not getting it, how do we end up trying to allocate memory
> > > from under raw spinlocks if you're not allowed to use kfree_rcu() under
> > > one ?
> > >
> > > Can someone please spell out the actual problem?
> > 
> > Your actual problem is the heat wave. Get an icepack already :)
> 
> Sure, but also much of the below wasn't stated anywhere in the thread I
> got Cc'ed on. All I got was half arsed solutions without an actual
> problem statement.
> 
> > To set things straight:
> > 
> >   1) kfree_rcu() which used to be just a conveniance wrapper around
> >      call_rcu() always worked in any context except NMI.
> > 
> >      Otherwise RT would have never worked or would have needed other
> >      horrible hacks to defer kfree in certain contexts including
> >      context switch.
> 
> I've never used kfree_rcu(), htf would I know.

Doing this to kfree_rcu() is the first step.  We will also be doing this
to call_rcu(), which has some long-standing invocations from various
raw contexts, including hardirq handler.

> >   2) RCU grew an optimization for kfree_rcu() which avoids the linked
> >      list cache misses when a large number of objects is freed via
> >      kfree_rcu(). Paul says it speeds up post grace period processing by
> >      a factor of 8 which is impressive.
> > 
> >      That's done by avoiding the linked list and storing the object
> >      pointer in an array of pointers which is page sized. This array is
> >      then freed in bulk without having to touch the rcu head linked list
> >      which obviously avoids cache misses.
> > 
> >      Occasionally kfree_rcu() needs to allocate a page for this but it
> >      can fallback to the linked list when the allocation fails.
> >      
> >      Inconveniantly this allocation happens with an RCU internal raw
> >      lock held, but even if we could do the allocation outside the RCU
> >      internal lock this would not solve the problem that kfree_rcu() can
> >      be called in any context except NMI even on RT.
> > 
> >      So RT forbids allocations from truly atomic contexts even with
> >      GFP_ATOMIC/NOWAIT. GFP_ATOMIC is kinda meaningless on RT because
> >      everything which calls it is in non-atomic context :) But still
> >      GFP_ATOMIC or GFP_NOWAIT retain the semantics of !RT and do not go
> >      down into deeper levels or wait for pages to become available.
> 
> Right so on RT just cut out the allocation and unconditionally make it
> NULL. Since it needs to deal with GFP_ATOMIC|GFP_NOWARN failing the
> allocation anyway.

Except that this is not just RT due to CONFIG_PROVE_RAW_LOCK_NESTING=y.

> >   3) #2 upsets lockdep (with the raw lock nesting enabled) rightfully
> >      when RCU tries to allocate a page, the lockless page cache is empty
> >      and it acquires zone->lock which is a regular spinlock
> 
> There was no lockdep splat in the thread.

I don't see the point of including such a splat given that we know that
lockdep is supposed to splat in this situation.

> >   4) As kfree_rcu() can be used from any context except NMI and RT
> >      relies on it we ran into a circular dependency problem.
> 
> Well, which actual usage sites are under a raw spinlock? Most of the
> ones I could find are not.

There are some on their way in, but this same optimization will be applied
to call_rcu(), and there are no shortage of such call sites in that case.
And these call sites have been around for a very long time.

> >      If we disable that feature for RT then the problem would be solved
> >      except that lockdep still would complain about taking zone->lock
> >      within a forbidden context on !RT kernels.
> > 
> >      Preventing that feature because of that is not a feasible option
> >      either. Aside of that we discuss this postfactum, IOW the stuff is
> >      upstream already despite the problem being known before.
> 
> Well, that's a fail :-( I tought we were supposed to solve known issues
> before shit got merged.

The enforcement is coming in and the kfree_rcu() speed up is coming in
at the same time.  The call_rcu() speedup will appear later.

> >   5) Ulad posted patches which introduce a new GFP allocation mode
> >      which makes the allocation fail if the lockless cache is empty,
> >      i.e. it does not try to touch zone->lock in that case.
> > 
> >      That works perfectly fine on RT and !RT, makes lockdep happy and
> >      Paul is happy as well.
> > 
> >      If the lockless cache, which works perfectly fine on RT, is empty
> >      then the performance of kfree_rcu() post grace period processing is
> >      probably not making the situation of the system worse.
> > 
> >      Michal is not fond of the new GFP flag and wants to explore options
> >      to avoid that completely. But so far there is no real feasible
> >      solution.
> 
> Not only Michal, those patches looked pretty horrid.

They are the first round, and will be improved.

> >      A) It was suggested to make zone->lock raw, but that's a total
> >         disaster latency wise. With just a kernel compile (!RT kernel)
> >         spinwait times are in the hundreds of microseconds.
> 
> Yeah, I know, been there done that, like over a decade ago :-)
> 
> >      B) Michal suggested to have GFP_RT_ATOMIC defined to 0 which would
> >         not require a new GFP bit and bail out when RT is enabled and
> >         the context is atomic.
> 
> I would've written the code like:
> 
> 	void *mem = NULL;
> 
> 	...
> #ifndef CONFIG_PREEMPT_RT
> 	mem = kmalloc(PAGE_SIZE, GFP_ATOMIC|GFP_NOWAIT);
> #endif
> 	if (!mem)
> 	  ....
> 
> Seems much simpler and doesn't pollute the GFP_ space.

But fails in the CONFIG_PROVE_RAW_LOCK_NESTING=y case when lockdep
is enabled.

> >      D) To solve the lockdep issue of #B Michal suggested to have magic
> >         lockdep annotations which allow !RT kernels to take zone->lock
> >         from the otherwise forbidden contexts because on !RT this lock
> >         nesting could be considered a false positive.
> > 
> >         But this would be horrors of some sorts because there might be
> >         locks further down which then need the same treatment or some
> >         general 'yay, I'm excempt from everything' annotation which is
> >         short of disabling lockdep completly.
> > 
> >         Even if that could be solved and made correct for both RT and
> >         !RT then this opens the can of worms that this kind of
> >         annotation would show up all over the place within no time for
> >         the completely wrong reasons.
> 
> So due to this heat crap I've not slept more than a few hours a night
> for the last week, I'm not entirely there, but we already have a bunch
> of lockdep magic for this raw nesting stuff.

Ouch!  Here is hoping for cooler weather soon!

> But yes, we need to avoid abuse, grep for lockdep_off() :-( This
> drivers/md/dm-cache-target.c thing is just plain broken. It sorta
> 'works' on mainline (and even there it can behave badly), but on RT it
> will come apart with a bang.
> 
> > Paul compiled this options list:
> > 
> > > 1.	Prohibit invoking allocators from raw atomic context, such
> > >	as when holding a raw spinlock.
> > 
> >   Clearly the simplest solution but not Pauls favourite and
> >   unfortunately he has a good reason.
> 
> Which isn't actually stated anywhere I suppose ?

Several times, but why not one more?  ;-)

In CONFIG_PREEMPT_NONE=y kernels, which are heavily used, and for which
the proposed kfree_rcu() and later call_rcu() optimizations are quite
important, there is no way to tell at runtime whether or you are in
atomic raw context.

> > > 2.	Adding a GFP_ flag.
> > 
> >   Michal does not like the restriction for !RT kernels and tries to
> >   avoid the introduction of a new allocation mode.
> 
> Like above, I tend to be with Michal on this, just wrap the actual
> allocation in CONFIG_PREEMPT_RT, the code needs to handle a NULL pointer
> there anyway.

That disables the optimization in the CONFIG_PREEMPT_NONE=y case,
where it really is needed.

> > > 3.	Reusing existing GFP_ flags/values/whatever to communicate
> > >	the raw-context information that was to be communicated by
> > >	the new GFP_ flag.
> > >
> > > 4.	Making lockdep forgive acquiring spinlocks while holding
> > >	raw spinlocks, but only in CONFIG_PREEMPT_NONE=y kernels.
> 
> Uhh, !CONFIG_PREEMPT_RT, the rest is 'fine'.

I would be OK with either.  In CONFIG_PREEMPT_NONE=n kernels, the
kfree_rcu() code could use preemptible() to determine whether it was safe
to invoke the allocator.  The code in kfree_rcu() might look like this:

	mem = NULL;
	if (IS_ENABLED(CONFIG_PREEMPT_NONE) || preemptible())
		mem = __get_free_page(...);

Is your point is that the usual mistakes would then be caught by the
usual testing on CONFIG_PREEMPT_NONE=n kernels?

> >  These are not seperate of each other as #3 requires #4. The most
> >  horrible solution IMO from a technical POV as it proliferates
> >  inconsistency for no good reaosn.
> > 
> >  Aside of that it'd be solving a problem which does not exist simply
> >  because kfree_rcu() does not depend on it and we really don't want to
> >  set precedence and encourage the (ab)use of this in any way.
> 
> My preferred solution is 1, if you want to use an allocator, you simply
> don't get to play under raw_spinlock_t. And from a quick grep, most
> kfree_rcu() users are not under raw_spinlock_t context.

There is at least one on its way in, but more to the point, we will
need to apply this same optimization to call_rcu(), which is used in
raw atomic context, including from hardirq handlers.

> This here sounds like someone who wants to have his cake and eat it too.

Just looking for a non-negative sized slice of cake, actually!

> I'll try and think about a lockdep annotation that isn't utter crap, but
> that's probably next week, I need this heat to end and get a few nights
> sleep, I'm an utter wreck atm.

Again, here is hoping for cooler weather!

							Thanx, Paul

Paul E. McKenney Aug. 14, 2020, 3:26 p.m. UTC | #72

On Fri, Aug 14, 2020 at 12:23:06PM +0200, peterz@infradead.org wrote:
> On Fri, Aug 14, 2020 at 10:30:37AM +0200, Peter Zijlstra wrote:
> > > > 1.	Prohibit invoking allocators from raw atomic context, such
> > > >	as when holding a raw spinlock.
> > > 
> > >   Clearly the simplest solution but not Pauls favourite and
> > >   unfortunately he has a good reason.
> > 
> > Which isn't actually stated anywhere I suppose ?
> 
> Introduce raw_kfree_rcu() that doesn't do the allocation, and fix the
> few wonky callsites.

The problem with that is common code along with the tendency of people
to just use the one that "works everywhere".

							Thanx, Paul

Paul E. McKenney Aug. 14, 2020, 4:11 p.m. UTC | #73

On Fri, Aug 14, 2020 at 07:14:25AM -0700, Paul E. McKenney wrote:
> On Fri, Aug 14, 2020 at 10:30:37AM +0200, Peter Zijlstra wrote:
> > On Fri, Aug 14, 2020 at 01:59:04AM +0200, Thomas Gleixner wrote:

[ . . . ]

> > > > 3.	Reusing existing GFP_ flags/values/whatever to communicate
> > > >	the raw-context information that was to be communicated by
> > > >	the new GFP_ flag.
> > > >
> > > > 4.	Making lockdep forgive acquiring spinlocks while holding
> > > >	raw spinlocks, but only in CONFIG_PREEMPT_NONE=y kernels.
> > 
> > Uhh, !CONFIG_PREEMPT_RT, the rest is 'fine'.
> 
> I would be OK with either.  In CONFIG_PREEMPT_NONE=n kernels, the
> kfree_rcu() code could use preemptible() to determine whether it was safe
> to invoke the allocator.  The code in kfree_rcu() might look like this:
> 
> 	mem = NULL;
> 	if (IS_ENABLED(CONFIG_PREEMPT_NONE) || preemptible())
> 		mem = __get_free_page(...);
> 
> Is your point is that the usual mistakes would then be caught by the
> usual testing on CONFIG_PREEMPT_NONE=n kernels?

Just to make sure we are talking about the same thing, please see below
for an untested patch that illustrates how I was interpreting your words.
Was this what you had in mind?

							Thanx, Paul

------------------------------------------------------------------------

diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index 62a382d..42d0ff1 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -579,7 +579,7 @@ do {									\
 # define lockdep_assert_preemption_disabled() do { } while (0)
 #endif
 
-#ifdef CONFIG_PROVE_RAW_LOCK_NESTING
+#ifdef CONFIG_PROVE_RAW_LOCK_NESTING_EFFECTIVE
 
 # define lockdep_assert_RT_in_threaded_ctx() do {			\
 		WARN_ONCE(debug_locks && !current->lockdep_recursion &&	\
diff --git a/include/linux/lockdep_types.h b/include/linux/lockdep_types.h
index bb35b44..70867d58 100644
--- a/include/linux/lockdep_types.h
+++ b/include/linux/lockdep_types.h
@@ -20,7 +20,7 @@ enum lockdep_wait_type {
 	LD_WAIT_FREE,		/* wait free, rcu etc.. */
 	LD_WAIT_SPIN,		/* spin loops, raw_spinlock_t etc.. */
 
-#ifdef CONFIG_PROVE_RAW_LOCK_NESTING
+#ifdef PROVE_RAW_LOCK_NESTING_EFFECTIVE
 	LD_WAIT_CONFIG,		/* CONFIG_PREEMPT_LOCK, spinlock_t etc.. */
 #else
 	LD_WAIT_CONFIG = LD_WAIT_SPIN,
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index e068c3c..e02de40 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1215,6 +1215,9 @@ config PROVE_RAW_LOCK_NESTING
 
 	 If unsure, select N.
 
+config PROVE_RAW_LOCK_NESTING_EFFECTIVE
+	def_bool PROVE_RAW_LOCK_NESTING && !PREEMPTION
+
 config LOCK_STAT
 	bool "Lock usage statistics"
 	depends on DEBUG_KERNEL && LOCK_DEBUGGING_SUPPORT

Peter Zijlstra Aug. 14, 2020, 4:19 p.m. UTC | #74

On Fri, Aug 14, 2020 at 07:14:25AM -0700, Paul E. McKenney wrote:

> Doing this to kfree_rcu() is the first step.  We will also be doing this
> to call_rcu(), which has some long-standing invocations from various
> raw contexts, including hardirq handler.

Most hardirq handler are not raw on RT due to threaded interrupts.
Lockdep knows about this.

> > >   4) As kfree_rcu() can be used from any context except NMI and RT
> > >      relies on it we ran into a circular dependency problem.
> > 
> > Well, which actual usage sites are under a raw spinlock? Most of the
> > ones I could find are not.
> 
> There are some on their way in, but this same optimization will be applied
> to call_rcu(), and there are no shortage of such call sites in that case.
> And these call sites have been around for a very long time.

Luckily there is lockdep to help you find the ones that need converting
to raw_call_rcu() :-)

> > >   Clearly the simplest solution but not Pauls favourite and
> > >   unfortunately he has a good reason.
> > 
> > Which isn't actually stated anywhere I suppose ?
> 
> Several times, but why not one more?  ;-)
> 
> In CONFIG_PREEMPT_NONE=y kernels, which are heavily used, and for which
> the proposed kfree_rcu() and later call_rcu() optimizations are quite
> important, there is no way to tell at runtime whether or you are in
> atomic raw context.

CONFIG_PREEMPT_NONE has nothing what so ever to do with any of this.

> > > > 2.	Adding a GFP_ flag.
> > > 
> > >   Michal does not like the restriction for !RT kernels and tries to
> > >   avoid the introduction of a new allocation mode.
> > 
> > Like above, I tend to be with Michal on this, just wrap the actual
> > allocation in CONFIG_PREEMPT_RT, the code needs to handle a NULL pointer
> > there anyway.
> 
> That disables the optimization in the CONFIG_PREEMPT_NONE=y case,
> where it really is needed.

No, it disables it for CONFIG_PREEMPT_RT.

> I would be OK with either.  In CONFIG_PREEMPT_NONE=n kernels, the
> kfree_rcu() code could use preemptible() to determine whether it was safe
> to invoke the allocator.  The code in kfree_rcu() might look like this:
> 
> 	mem = NULL;
> 	if (IS_ENABLED(CONFIG_PREEMPT_NONE) || preemptible())
> 		mem = __get_free_page(...);
> 
> Is your point is that the usual mistakes would then be caught by the
> usual testing on CONFIG_PREEMPT_NONE=n kernels?

	mem = NULL;
#if !defined(CONFIG_PREEMPT_RT) && !defined(CONFIG_PROVE_LOCKING)
	mem = __get_free_page(...)
#endif
	if (!mem)

But I _really_ would much rather have raw_kfree_rcu() and raw_call_rcu()
variants for the few places that actually need it.

> > >  These are not seperate of each other as #3 requires #4. The most
> > >  horrible solution IMO from a technical POV as it proliferates
> > >  inconsistency for no good reaosn.
> > > 
> > >  Aside of that it'd be solving a problem which does not exist simply
> > >  because kfree_rcu() does not depend on it and we really don't want to
> > >  set precedence and encourage the (ab)use of this in any way.
> > 
> > My preferred solution is 1, if you want to use an allocator, you simply
> > don't get to play under raw_spinlock_t. And from a quick grep, most
> > kfree_rcu() users are not under raw_spinlock_t context.
> 
> There is at least one on its way in, but more to the point, we will
> need to apply this same optimization to call_rcu(), which is used in

There is no need, call_rcu() works perfectly fine today, thank you.
You might want to, but that's an entirely different thing.

> raw atomic context, including from hardirq handlers.

Threaded IRQs. There really is very little code that is 'raw' on RT.

Peter Zijlstra Aug. 14, 2020, 5:49 p.m. UTC | #75

On Fri, Aug 14, 2020 at 09:11:06AM -0700, Paul E. McKenney wrote:
> Just to make sure we are talking about the same thing, please see below
> for an untested patch that illustrates how I was interpreting your words.
> Was this what you had in mind?

No, definitely not.

Also, since we used to be able to use call_rcu() _everywhere_, including
under zone->lock, how's that working with you calling the
page-allocating from it?

Paul E. McKenney Aug. 14, 2020, 6:01 p.m. UTC | #76

On Fri, Aug 14, 2020 at 04:06:04PM +0200, Michal Hocko wrote:
> On Fri 14-08-20 06:34:50, Paul E. McKenney wrote:
> > On Fri, Aug 14, 2020 at 02:48:32PM +0200, Michal Hocko wrote:
> > > On Fri 14-08-20 14:15:44, Uladzislau Rezki wrote:
> > > > > On Thu 13-08-20 19:09:29, Thomas Gleixner wrote:
> > > > > > Michal Hocko <mhocko@suse.com> writes:
> > > > > [...]
> > > > > > > Why should we limit the functionality of the allocator for something
> > > > > > > that is not a real problem?
> > > > > > 
> > > > > > We'd limit the allocator for exactly ONE new user which was aware of
> > > > > > this problem _before_ the code hit mainline. And that ONE user is
> > > > > > prepared to handle the fail.
> > > > > 
> > > > > If we are to limit the functionality to this one particular user then
> > > > > I would consider a dedicated gfp flag a huge overkill. It would be much
> > > > > more easier to have a preallocated pool of pages and use those and
> > > > > completely avoid the core allocator. That would certainly only shift the
> > > > > complexity to the caller but if it is expected there would be only that
> > > > > single user then it would be probably better than opening a can of worms
> > > > > like allocator usable from raw spin locks.
> > > > > 
> > > > Vlastimil raised same question earlier, i answered, but let me answer again:
> > > > 
> > > > It is hard to achieve because the logic does not stick to certain static test
> > > > case, i.e. it depends on how heavily kfree_rcu(single/double) are used. Based
> > > > on that, "how heavily" - number of pages are formed, until the drain/reclaimer
> > > > thread frees them.
> > > 
> > > How many pages are talking about - ball park? 100s, 1000s?
> > 
> > Under normal operation, a couple of pages per CPU, which would make
> > preallocation entirely reasonable.  Except that if someone does something
> > that floods RCU callbacks (close(open) in a tight userspace loop, for but
> > one example), then 2000 per CPU might not be enough, which on a 64-CPU
> > system comes to about 500MB.  This is beyond excessive for preallocation
> > on the systems I am familiar with.
> > 
> > And the flooding case is where you most want the reclamation to be
> > efficient, and thus where you want the pages.
> 
> I am not sure the page allocator would help you with this scenario
> unless you are on very large machines. Pagesets scale with the available
> memory and percpu_pagelist_fraction sysctl (have a look at
> pageset_set_high_and_batch). It is roughly 1000th of the zone size for
> each zone. You can check that in /proc/vmstat (my 8G machine)

Small systems might have ~64G.  The medium-sized systems might have
~250G.  There are a few big ones that might have 1.5T.  None of the
/proc/vmstat files from those machines contain anything resembling
the list below, though.

> Node 0, zone      DMA
> Not interesting at all
> Node 0, zone    DMA32
>   pagesets
>     cpu: 0
>               count: 242
>               high:  378
>               batch: 63
>     cpu: 1
>               count: 355
>               high:  378
>               batch: 63
>     cpu: 2
>               count: 359
>               high:  378
>               batch: 63
>     cpu: 3
>               count: 366
>               high:  378
>               batch: 63
> Node 0, zone   Normal
>   pagesets
>     cpu: 0
>               count: 359
>               high:  378
>               batch: 63
>     cpu: 1
>               count: 241
>               high:  378
>               batch: 63
>     cpu: 2
>               count: 297
>               high:  378
>               batch: 63
>     cpu: 3
>               count: 227
>               high:  378
>               batch: 63
> 
> Besides that do you need to be per-cpu? Having 1000 pages available and
> managed under your raw spinlock should be good enough already no?

It needs to be almost entirely per-CPU for performance reasons.  Plus
a user could do a tight close(open()) loop on each CPU.

> > This of course raises the question of how much memory the lockless caches
> > contain, but fortunately these RCU callback flooding scenarios also
> > involve process-context allocation of the memory that is being passed
> > to kfree_rcu().  That allocation should keep the lockless caches from
> > going empty in the common case, correct?
> 
> Yes, those are refilled both on the allocation/free paths. But you
> cannot really rely on that to happen early enough.

So the really ugly scenarios with the tight loops normally allocate
something and immediately either call_rcu() or kfree_rcu() it.
But you are right, someone doing "rm -rf" on a large file tree
with lots of small files might not be doing that many allocations.

> Do you happen to have any numbers that would show the typical usage
> and how often the slow path has to be taken becase pcp lists are
> depleted? In other words even if we provide a functionality to give
> completely lockless way to allocate memory how useful that is?

Not yet, but let's see what we can do.

							Thanx, Paul

Paul E. McKenney Aug. 14, 2020, 6:02 p.m. UTC | #77

On Fri, Aug 14, 2020 at 07:49:24PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 14, 2020 at 09:11:06AM -0700, Paul E. McKenney wrote:
> > Just to make sure we are talking about the same thing, please see below
> > for an untested patch that illustrates how I was interpreting your words.
> > Was this what you had in mind?
> 
> No, definitely not.
> 
> Also, since we used to be able to use call_rcu() _everywhere_, including
> under zone->lock, how's that working with you calling the
> page-allocating from it?

Indeed, that is exactly the problem we are trying to solve.

							Thanx, Paul

Paul E. McKenney Aug. 14, 2020, 6:15 p.m. UTC | #78

On Fri, Aug 14, 2020 at 06:19:04PM +0200, peterz@infradead.org wrote:
> On Fri, Aug 14, 2020 at 07:14:25AM -0700, Paul E. McKenney wrote:
> 
> > Doing this to kfree_rcu() is the first step.  We will also be doing this
> > to call_rcu(), which has some long-standing invocations from various
> > raw contexts, including hardirq handler.
> 
> Most hardirq handler are not raw on RT due to threaded interrupts.
> Lockdep knows about this.

Understood.  But not all hardirq handlers are threaded.

> > > >   4) As kfree_rcu() can be used from any context except NMI and RT
> > > >      relies on it we ran into a circular dependency problem.
> > > 
> > > Well, which actual usage sites are under a raw spinlock? Most of the
> > > ones I could find are not.
> > 
> > There are some on their way in, but this same optimization will be applied
> > to call_rcu(), and there are no shortage of such call sites in that case.
> > And these call sites have been around for a very long time.
> 
> Luckily there is lockdep to help you find the ones that need converting
> to raw_call_rcu() :-)

I already gave you my views on raw_call_rcu().  :-/

> > > >   Clearly the simplest solution but not Pauls favourite and
> > > >   unfortunately he has a good reason.
> > > 
> > > Which isn't actually stated anywhere I suppose ?
> > 
> > Several times, but why not one more?  ;-)
> > 
> > In CONFIG_PREEMPT_NONE=y kernels, which are heavily used, and for which
> > the proposed kfree_rcu() and later call_rcu() optimizations are quite
> > important, there is no way to tell at runtime whether or you are in
> > atomic raw context.
> 
> CONFIG_PREEMPT_NONE has nothing what so ever to do with any of this.

On the contrary, it has everything to do with this.  It is the environment
in which we cannot use preemptible() to dynamically determine whether
or not it is safe to invoke the memory allocator.

> > > > > 2.	Adding a GFP_ flag.
> > > > 
> > > >   Michal does not like the restriction for !RT kernels and tries to
> > > >   avoid the introduction of a new allocation mode.
> > > 
> > > Like above, I tend to be with Michal on this, just wrap the actual
> > > allocation in CONFIG_PREEMPT_RT, the code needs to handle a NULL pointer
> > > there anyway.
> > 
> > That disables the optimization in the CONFIG_PREEMPT_NONE=y case,
> > where it really is needed.
> 
> No, it disables it for CONFIG_PREEMPT_RT.

Except that lockdep still yells at us for CONFIG_PREEMPT_NONE=y, and
again, in the CONFIG_PREEMPT_NONE=y we cannot use preemptible() to
dynamically determine whether it is safe to invoke the memory allocator.

> > I would be OK with either.  In CONFIG_PREEMPT_NONE=n kernels, the
> > kfree_rcu() code could use preemptible() to determine whether it was safe
> > to invoke the allocator.  The code in kfree_rcu() might look like this:
> > 
> > 	mem = NULL;
> > 	if (IS_ENABLED(CONFIG_PREEMPT_NONE) || preemptible())
> > 		mem = __get_free_page(...);
> > 
> > Is your point is that the usual mistakes would then be caught by the
> > usual testing on CONFIG_PREEMPT_NONE=n kernels?
> 
> 	mem = NULL;
> #if !defined(CONFIG_PREEMPT_RT) && !defined(CONFIG_PROVE_LOCKING)
> 	mem = __get_free_page(...)
> #endif
> 	if (!mem)
> 
> But I _really_ would much rather have raw_kfree_rcu() and raw_call_rcu()
> variants for the few places that actually need it.

Until people start propagating them all over because they happen to
unconditionally stop lockdep from complaining.

> > > >  These are not seperate of each other as #3 requires #4. The most
> > > >  horrible solution IMO from a technical POV as it proliferates
> > > >  inconsistency for no good reaosn.
> > > > 
> > > >  Aside of that it'd be solving a problem which does not exist simply
> > > >  because kfree_rcu() does not depend on it and we really don't want to
> > > >  set precedence and encourage the (ab)use of this in any way.
> > > 
> > > My preferred solution is 1, if you want to use an allocator, you simply
> > > don't get to play under raw_spinlock_t. And from a quick grep, most
> > > kfree_rcu() users are not under raw_spinlock_t context.
> > 
> > There is at least one on its way in, but more to the point, we will
> > need to apply this same optimization to call_rcu(), which is used in
> 
> There is no need, call_rcu() works perfectly fine today, thank you.
> You might want to, but that's an entirely different thing.

Sorry, but no.  The call_rcu() callback invocation currently takes
a cache miss on each step through the rcu_head chain.

> > raw atomic context, including from hardirq handlers.
> 
> Threaded IRQs. There really is very little code that is 'raw' on RT.

Except that we also need to run non-RT kernels.

							Thanx, Paul

Thomas Gleixner Aug. 14, 2020, 7:33 p.m. UTC | #79

On Fri, Aug 14 2020 at 11:02, Paul E. McKenney wrote:
> On Fri, Aug 14, 2020 at 07:49:24PM +0200, Peter Zijlstra wrote:
>> On Fri, Aug 14, 2020 at 09:11:06AM -0700, Paul E. McKenney wrote:
>> > Just to make sure we are talking about the same thing, please see below
>> > for an untested patch that illustrates how I was interpreting your words.
>> > Was this what you had in mind?
>> 
>> No, definitely not.
>> 
>> Also, since we used to be able to use call_rcu() _everywhere_, including
>> under zone->lock, how's that working with you calling the
>> page-allocating from it?
>
> Indeed, that is exactly the problem we are trying to solve.

Wait a moment. Why are we discussing RT induced raw non raw lock
ordering at all?

Whatever kernel you variant you look at this is not working:

  lock(zone)  call_rcu() lock(zone)

It's a simple recursive dead lock, nothing else.

And that enforces the GFP_NOLOCK allocation mode or some other solution
unless you make a new rule that calling call_rcu() is forbidden while
holding zone lock or any other lock which might be nested inside the
GFP_NOWAIT zone::lock held region.

Thanks,

        tglx

Paul E. McKenney Aug. 14, 2020, 8:41 p.m. UTC | #80

On Fri, Aug 14, 2020 at 09:33:47PM +0200, Thomas Gleixner wrote:
> On Fri, Aug 14 2020 at 11:02, Paul E. McKenney wrote:
> > On Fri, Aug 14, 2020 at 07:49:24PM +0200, Peter Zijlstra wrote:
> >> On Fri, Aug 14, 2020 at 09:11:06AM -0700, Paul E. McKenney wrote:
> >> > Just to make sure we are talking about the same thing, please see below
> >> > for an untested patch that illustrates how I was interpreting your words.
> >> > Was this what you had in mind?
> >> 
> >> No, definitely not.
> >> 
> >> Also, since we used to be able to use call_rcu() _everywhere_, including
> >> under zone->lock, how's that working with you calling the
> >> page-allocating from it?
> >
> > Indeed, that is exactly the problem we are trying to solve.
> 
> Wait a moment. Why are we discussing RT induced raw non raw lock
> ordering at all?

Because we like to argue?  (Sorry, couldn't resist.)

> Whatever kernel you variant you look at this is not working:
> 
>   lock(zone)  call_rcu() lock(zone)
> 
> It's a simple recursive dead lock, nothing else.

You are of course absolutely correct.

> And that enforces the GFP_NOLOCK allocation mode or some other solution
> unless you make a new rule that calling call_rcu() is forbidden while
> holding zone lock or any other lock which might be nested inside the
> GFP_NOWAIT zone::lock held region.

Again, you are correct.  Maybe the forecasted weekend heat will cause
my brain to hallucinate a better solution, but in the meantime, the
GFP_NOLOCK approach looks good from this end.

							Thanx, Paul

Peter Zijlstra Aug. 14, 2020, 9:52 p.m. UTC | #81

On Fri, Aug 14, 2020 at 01:41:40PM -0700, Paul E. McKenney wrote:
> > And that enforces the GFP_NOLOCK allocation mode or some other solution
> > unless you make a new rule that calling call_rcu() is forbidden while
> > holding zone lock or any other lock which might be nested inside the
> > GFP_NOWAIT zone::lock held region.
> 
> Again, you are correct.  Maybe the forecasted weekend heat will cause
> my brain to hallucinate a better solution, but in the meantime, the
> GFP_NOLOCK approach looks good from this end.

So I hate __GFP_NO_LOCKS for a whole number of reasons:

 - it should be called __GFP_LOCKLESS if anything
 - it sprinkles a bunch of ugly branches around the allocator fast path
 - it only works for order==0

Combined I really odn't think this should be a GFP flag. How about a
special purpose allocation function, something like so..

---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 901a21f61d68..cdec9c99fba7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4875,6 +4875,47 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);
 
+struct page *__rmqueue_lockless(struct zone *zone, struct per_cpu_pages *pcp)
+{
+	struct list_head *list;
+	struct page *page;
+	int migratetype;
+
+	for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++) {
+		list = &pcp->list[migratetype];
+		page = list_first_entry_or_null(list, struct page, lru);
+		if (page && check_new_pcp(page)) {
+			list_del(&page->lru);
+			pcp->count--;
+			return page;
+		}
+	}
+
+	return NULL;
+}
+
+struct page *__alloc_page_lockless(void)
+{
+	struct zonelist *zonelist = node_zonelist(numa_node_id(), GFP_KERNEL);
+	struct per_cpu_pages *pcp;
+	struct page *page = NULL;
+	unsigned long flags;
+	struct zoneref *z;
+	struct zone *zone;
+
+	for_each_zone_zonelist(zone, z, zonelist, ZONE_NORMAL) {
+		local_irq_save(flags);
+		pcp = &this_cpu_ptr(zone->pageset)->pcp;
+		page = __rmqueue_lockless(zone, pcp);
+		local_irq_restore(flags);
+
+		if (page)
+			break;
+	}
+
+	return page;
+}
+
 /*
  * Common helper functions. Never use with __GFP_HIGHMEM because the returned
  * address cannot represent highmem pages. Use alloc_pages and then kmap if

Thomas Gleixner Aug. 14, 2020, 11:14 p.m. UTC | #82

Paul,

On Fri, Aug 14 2020 at 11:01, Paul E. McKenney wrote:
> On Fri, Aug 14, 2020 at 04:06:04PM +0200, Michal Hocko wrote:
>> > > > Vlastimil raised same question earlier, i answered, but let me answer again:
>> > > > 
>> > > > It is hard to achieve because the logic does not stick to certain static test
>> > > > case, i.e. it depends on how heavily kfree_rcu(single/double) are used. Based
>> > > > on that, "how heavily" - number of pages are formed, until the drain/reclaimer
>> > > > thread frees them.
>> > > 
>> > > How many pages are talking about - ball park? 100s, 1000s?
>> > 
>> > Under normal operation, a couple of pages per CPU, which would make
>> > preallocation entirely reasonable.  Except that if someone does something
>> > that floods RCU callbacks (close(open) in a tight userspace loop, for but
>> > one example), then 2000 per CPU might not be enough, which on a 64-CPU
>> > system comes to about 500MB.  This is beyond excessive for preallocation
>> > on the systems I am familiar with.
>> > 
>> > And the flooding case is where you most want the reclamation to be
>> > efficient, and thus where you want the pages.

As we now established that taking zone lock is impossible at all
independent of raw/non-raw ordering and independent of RT/PREEMPT
configs, can we just take a step back and look at the problem from
scratch again?

As a matter of fact I assume^Wdeclare that removing struct rcu_head which
provides a fallback is not an option at all. I know that you want to,
but it wont work ever. Dream on, but as we agreed on recently there is
this thing called reality which ruins everything.

For normal operations a couple of pages which can be preallocated are
enough. What you are concerned of is the case where you run out of
pointer storage space.

There are two reasons why that can happen:

      1) RCU call flooding
      2) RCU not being able to run and mop up the backlog

#1 is observable by looking at the remaining storage space and the RCU
   call frequency

#2 is uninteresting because it's caused by RCU being stalled / delayed
   e.g. by a runaway of some sorts or a plain RCU usage bug.

   Allocating more memory in that case does not solve or improve anything.

So the interesting case is #1. Which means we need to look at the
potential sources of the flooding:

    1) User space via syscalls, e.g. open/close
    2) Kernel thread
    3) Softirq
    4) Device interrupt
    5) System interrupts, deep atomic context, NMI ...

#1 trivial fix is to force switching to an high prio thread or a soft
   interrupt which does the allocation

#2 Similar to #1 unless that thread loops with interrupts, softirqs or
   preemption disabled. If that's the case then running out of RCU
   storage space is the least of your worries.

#3 Similar to #2. The obvious candidates (e.g. NET) for monopolizing a
   CPU have loop limits in place already. If there is a bug which fails
   to care about the limit, why would RCU care and allocate more memory?

#4 Similar to #3. If the interrupt handler loops forever or if the
   interrupt is a runaway which prevents task/softirq processing then
   RCU free performance is the least of your worries.

#5 Clearly a bug and making RCU accomodate for that is beyond silly.

So if call_rcu() detects that the remaining storage space for pointers
goes below the critical point or if it observes high frequency calls
then it simply should force a soft interrupt which does the allocation.

Allocating from softirq context obviously without holding the raw lock
which is used inside call_rcu() is safe on all configurations.

If call_rcu() is forced to use the fallback for a few calls until this
happens then that's not the end of the world. It is not going to be a
problem ever for the most obvious issue #1, user space madness, because
that case cannot delay the softirq processing unless there is a kernel
bug which makes again RCU free performance irrelevant.

So this will cure the problem for the most interesting case #1 and
handle all sane variants of the other possible flooding sources.

The other insane reasons do not justify any attempt to increase RCU
performance at all.

Watching the remaining storage space is good enough IMO. It clearly
covers #1 and for all others the occasional fallback wont hurt. If it
really matters for any case > #1 then doing a frequency based prediction
is a straight forward optimization.

As usual I might be missing something, but as this is the second day
with reasonable temperatures here that would be caused by my ignorance
and not be excusable by brain usage outside of specified temperature
range.

Thanks,

        tglx

Paul E. McKenney Aug. 14, 2020, 11:27 p.m. UTC | #83

On Fri, Aug 14, 2020 at 11:52:06PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 14, 2020 at 01:41:40PM -0700, Paul E. McKenney wrote:
> > > And that enforces the GFP_NOLOCK allocation mode or some other solution
> > > unless you make a new rule that calling call_rcu() is forbidden while
> > > holding zone lock or any other lock which might be nested inside the
> > > GFP_NOWAIT zone::lock held region.
> > 
> > Again, you are correct.  Maybe the forecasted weekend heat will cause
> > my brain to hallucinate a better solution, but in the meantime, the
> > GFP_NOLOCK approach looks good from this end.
> 
> So I hate __GFP_NO_LOCKS for a whole number of reasons:
> 
>  - it should be called __GFP_LOCKLESS if anything
>  - it sprinkles a bunch of ugly branches around the allocator fast path
>  - it only works for order==0
> 
> Combined I really odn't think this should be a GFP flag. How about a
> special purpose allocation function, something like so..

This looks entirely reasonable to me!

							Thanx, Paul

> ---
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 901a21f61d68..cdec9c99fba7 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4875,6 +4875,47 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
>  }
>  EXPORT_SYMBOL(__alloc_pages_nodemask);
>  
> +struct page *__rmqueue_lockless(struct zone *zone, struct per_cpu_pages *pcp)
> +{
> +	struct list_head *list;
> +	struct page *page;
> +	int migratetype;
> +
> +	for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++) {
> +		list = &pcp->list[migratetype];
> +		page = list_first_entry_or_null(list, struct page, lru);
> +		if (page && check_new_pcp(page)) {
> +			list_del(&page->lru);
> +			pcp->count--;
> +			return page;
> +		}
> +	}
> +
> +	return NULL;
> +}
> +
> +struct page *__alloc_page_lockless(void)
> +{
> +	struct zonelist *zonelist = node_zonelist(numa_node_id(), GFP_KERNEL);
> +	struct per_cpu_pages *pcp;
> +	struct page *page = NULL;
> +	unsigned long flags;
> +	struct zoneref *z;
> +	struct zone *zone;
> +
> +	for_each_zone_zonelist(zone, z, zonelist, ZONE_NORMAL) {
> +		local_irq_save(flags);
> +		pcp = &this_cpu_ptr(zone->pageset)->pcp;
> +		page = __rmqueue_lockless(zone, pcp);
> +		local_irq_restore(flags);
> +
> +		if (page)
> +			break;
> +	}
> +
> +	return page;
> +}
> +
>  /*
>   * Common helper functions. Never use with __GFP_HIGHMEM because the returned
>   * address cannot represent highmem pages. Use alloc_pages and then kmap if

Thomas Gleixner Aug. 14, 2020, 11:40 p.m. UTC | #84

On Fri, Aug 14 2020 at 23:52, Peter Zijlstra wrote:
> On Fri, Aug 14, 2020 at 01:41:40PM -0700, Paul E. McKenney wrote:
>> > And that enforces the GFP_NOLOCK allocation mode or some other solution
>> > unless you make a new rule that calling call_rcu() is forbidden while
>> > holding zone lock or any other lock which might be nested inside the
>> > GFP_NOWAIT zone::lock held region.
>> 
>> Again, you are correct.  Maybe the forecasted weekend heat will cause
>> my brain to hallucinate a better solution, but in the meantime, the
>> GFP_NOLOCK approach looks good from this end.
>
> So I hate __GFP_NO_LOCKS for a whole number of reasons:
>
>  - it should be called __GFP_LOCKLESS if anything
>  - it sprinkles a bunch of ugly branches around the allocator fast path
>  - it only works for order==0
>
> Combined I really odn't think this should be a GFP flag. How about a
> special purpose allocation function, something like so..

No. No. No.

It's not requried at all after the context got set straight and my brain
started working again.

There is no need to provide such a thing which tries to "optimize"
unfixable problems and as a consequence makes other people use it for
the completely wrong reasons all over the place.

We have plenty of that stuff already. No need for more ...

Thanks,

        tglx

Paul E. McKenney Aug. 14, 2020, 11:41 p.m. UTC | #85

On Sat, Aug 15, 2020 at 01:14:53AM +0200, Thomas Gleixner wrote:
> Paul,
> 
> On Fri, Aug 14 2020 at 11:01, Paul E. McKenney wrote:
> > On Fri, Aug 14, 2020 at 04:06:04PM +0200, Michal Hocko wrote:
> >> > > > Vlastimil raised same question earlier, i answered, but let me answer again:
> >> > > > 
> >> > > > It is hard to achieve because the logic does not stick to certain static test
> >> > > > case, i.e. it depends on how heavily kfree_rcu(single/double) are used. Based
> >> > > > on that, "how heavily" - number of pages are formed, until the drain/reclaimer
> >> > > > thread frees them.
> >> > > 
> >> > > How many pages are talking about - ball park? 100s, 1000s?
> >> > 
> >> > Under normal operation, a couple of pages per CPU, which would make
> >> > preallocation entirely reasonable.  Except that if someone does something
> >> > that floods RCU callbacks (close(open) in a tight userspace loop, for but
> >> > one example), then 2000 per CPU might not be enough, which on a 64-CPU
> >> > system comes to about 500MB.  This is beyond excessive for preallocation
> >> > on the systems I am familiar with.
> >> > 
> >> > And the flooding case is where you most want the reclamation to be
> >> > efficient, and thus where you want the pages.
> 
> As we now established that taking zone lock is impossible at all
> independent of raw/non-raw ordering and independent of RT/PREEMPT
> configs, can we just take a step back and look at the problem from
> scratch again?

Can't hurt!  (Famous last words...)

> As a matter of fact I assume^Wdeclare that removing struct rcu_head which
> provides a fallback is not an option at all. I know that you want to,
> but it wont work ever. Dream on, but as we agreed on recently there is
> this thing called reality which ruins everything.

For call_rcu(), agreed.  For kfree_rcu(), we know what the callback is
going to do, plus single-argument kfree_rcu() can only be invoked from
sleepable context.  (If you want to kfree_rcu() from non-sleepable
context, that will cost you an rcu_head in the data structure being
freed.)

So if the single-argument kfree_rcu() case gets hit with a
memory-allocation failure, it can fall back to waiting for a grace
period and doing the free.  Of course, grace-period waits have horrible
latency, but under OOM life is hard.  If this becomes a problem in
non-OOM situations due to the lockless caches becoming empty, we will
have to allocate memory if needed before acquiring the lock with the
usual backout logic.  Doing that means that we can let the allocator
acquire locks and maybe even do a little bit of blocking, so that the
inline grace-period-wait would only happen if the system was well and
truly OOMed.

> For normal operations a couple of pages which can be preallocated are
> enough. What you are concerned of is the case where you run out of
> pointer storage space.

Agreed.

> There are two reasons why that can happen:
> 
>       1) RCU call flooding
>       2) RCU not being able to run and mop up the backlog
> 
> #1 is observable by looking at the remaining storage space and the RCU
>    call frequency
> 
> #2 is uninteresting because it's caused by RCU being stalled / delayed
>    e.g. by a runaway of some sorts or a plain RCU usage bug.
>    
>    Allocating more memory in that case does not solve or improve anything.

Yes, #2 is instead RCU CPU stall warning territory.

If this becomes a problem, one approach is to skip the page-of-pointers
allocation if the grace period is more than (say) one second old.  If
the grace period never completes, OOM is unavoidable, but this is a way
of putting it off for a bit.

> So the interesting case is #1. Which means we need to look at the
> potential sources of the flooding:
> 
>     1) User space via syscalls, e.g. open/close
>     2) Kernel thread
>     3) Softirq
>     4) Device interrupt
>     5) System interrupts, deep atomic context, NMI ...
> 
> #1 trivial fix is to force switching to an high prio thread or a soft
>    interrupt which does the allocation
> 
> #2 Similar to #1 unless that thread loops with interrupts, softirqs or
>    preemption disabled. If that's the case then running out of RCU
>    storage space is the least of your worries.
> 
> #3 Similar to #2. The obvious candidates (e.g. NET) for monopolizing a
>    CPU have loop limits in place already. If there is a bug which fails
>    to care about the limit, why would RCU care and allocate more memory?
> 
> #4 Similar to #3. If the interrupt handler loops forever or if the
>    interrupt is a runaway which prevents task/softirq processing then
>    RCU free performance is the least of your worries.
> 
> #5 Clearly a bug and making RCU accomodate for that is beyond silly.
> 
> So if call_rcu() detects that the remaining storage space for pointers
> goes below the critical point or if it observes high frequency calls
> then it simply should force a soft interrupt which does the allocation.

Unless call_rcu() has been invoked with scheduler locks held.  But
eventually call_rcu() should be invoked with interrupts enabled, and at
that point it would be safe to raise_softirq(), wake_up(), or whatever.

> Allocating from softirq context obviously without holding the raw lock
> which is used inside call_rcu() is safe on all configurations.

Once we get there, agreed.

> If call_rcu() is forced to use the fallback for a few calls until this
> happens then that's not the end of the world. It is not going to be a
> problem ever for the most obvious issue #1, user space madness, because
> that case cannot delay the softirq processing unless there is a kernel
> bug which makes again RCU free performance irrelevant.
> 
> So this will cure the problem for the most interesting case #1 and
> handle all sane variants of the other possible flooding sources.
> 
> The other insane reasons do not justify any attempt to increase RCU
> performance at all.
> 
> Watching the remaining storage space is good enough IMO. It clearly
> covers #1 and for all others the occasional fallback wont hurt. If it
> really matters for any case > #1 then doing a frequency based prediction
> is a straight forward optimization.
> 
> As usual I might be missing something, but as this is the second day
> with reasonable temperatures here that would be caused by my ignorance
> and not be excusable by brain usage outside of specified temperature
> range.

It is at the very least a new approach, so either way thank you for
that!  ;-)

							Thanx, Paul

Thomas Gleixner Aug. 15, 2020, 12:43 a.m. UTC | #86

Paul,

On Fri, Aug 14 2020 at 16:41, Paul E. McKenney wrote:
> On Sat, Aug 15, 2020 at 01:14:53AM +0200, Thomas Gleixner wrote:
>> As a matter of fact I assume^Wdeclare that removing struct rcu_head which
>> provides a fallback is not an option at all. I know that you want to,
>> but it wont work ever. Dream on, but as we agreed on recently there is
>> this thing called reality which ruins everything.
>
> For call_rcu(), agreed.  For kfree_rcu(), we know what the callback is
> going to do, plus single-argument kfree_rcu() can only be invoked from
> sleepable context.  (If you want to kfree_rcu() from non-sleepable
> context, that will cost you an rcu_head in the data structure being
> freed.)

kfree_rcu() as of today is just a conveniance wrapper around
call_rcu(obj, rcu) which can be called from any context and it still
takes TWO arguments.

Icepack?

So if you come up with a new kfree_rcu_magic(void *obj) single argument
variant which can only be called from sleepable contexts then this does
not require any of the raw lock vs. non raw hacks at all because you can
simply allocate without holding the raw lock in the rare case that you
run out of storage space. With four 4k pages preallocated per CPU that's
every 2048 invocations per CPU on 64bit.

So if you run into that situation then you drop the lock and then it's
racy because you might be preempted or migrated after dropping the lock
and you might have done a useless allocation, but that does not justify
having a special allocator just for that? You have an extra page, so
what?

To prevent subsequent callers to add to the allocation race you simply
can let them wait on the first allocating attempt to finish That avoids
more pointless allocations and as a side effect prevents all of them to
create more pressure by continuing their open/close loop naturally
without extra work.

> So if the single-argument kfree_rcu() case gets hit with a
> memory-allocation failure, it can fall back to waiting for a grace
> period and doing the free.  Of course, grace-period waits have horrible
> latency, but under OOM life is hard.  If this becomes a problem in
> non-OOM situations due to the lockless caches becoming empty, we will
> have to allocate memory if needed before acquiring the lock with the
> usual backout logic.  Doing that means that we can let the allocator
> acquire locks and maybe even do a little bit of blocking, so that the
> inline grace-period-wait would only happen if the system was well and
> truly OOMed.

No. It dropped the rcu internal lock and does a regular GFP_KENRNEL
allocation which waits for the page to become available. Which is a good
thing in the open/close scenario because it throttles the offender.

>> For normal operations a couple of pages which can be preallocated are
>> enough. What you are concerned of is the case where you run out of
>> pointer storage space.
>
> Agreed.
>
>> There are two reasons why that can happen:
>> 
>>       1) RCU call flooding
>>       2) RCU not being able to run and mop up the backlog
>> 
>> #1 is observable by looking at the remaining storage space and the RCU
>>    call frequency
>> 
>> #2 is uninteresting because it's caused by RCU being stalled / delayed
>>    e.g. by a runaway of some sorts or a plain RCU usage bug.
>>    
>>    Allocating more memory in that case does not solve or improve anything.
>
> Yes, #2 is instead RCU CPU stall warning territory.
>
> If this becomes a problem, one approach is to skip the page-of-pointers
> allocation if the grace period is more than (say) one second old.  If
> the grace period never completes, OOM is unavoidable, but this is a way
> of putting it off for a bit.

Don't even think about optimizing your new thing for #2. It's a
pointless exercise. If the task which runs into the 'can't allocate'
case then is sleeps and waits. End of story.

>> So the interesting case is #1. Which means we need to look at the
>> potential sources of the flooding:
>> 
>>     1) User space via syscalls, e.g. open/close
>>     2) Kernel thread
>>     3) Softirq
>>     4) Device interrupt
>>     5) System interrupts, deep atomic context, NMI ...
>> 
>> #1 trivial fix is to force switching to an high prio thread or a soft
>>    interrupt which does the allocation
>> 
>> #2 Similar to #1 unless that thread loops with interrupts, softirqs or
>>    preemption disabled. If that's the case then running out of RCU
>>    storage space is the least of your worries.
>> 
>> #3 Similar to #2. The obvious candidates (e.g. NET) for monopolizing a
>>    CPU have loop limits in place already. If there is a bug which fails
>>    to care about the limit, why would RCU care and allocate more memory?
>> 
>> #4 Similar to #3. If the interrupt handler loops forever or if the
>>    interrupt is a runaway which prevents task/softirq processing then
>>    RCU free performance is the least of your worries.
>> 
>> #5 Clearly a bug and making RCU accomodate for that is beyond silly.
>> 
>> So if call_rcu() detects that the remaining storage space for pointers
>> goes below the critical point or if it observes high frequency calls
>> then it simply should force a soft interrupt which does the allocation.
>
> Unless call_rcu() has been invoked with scheduler locks held.  But
> eventually call_rcu() should be invoked with interrupts enabled, and at
> that point it would be safe to raise_softirq(), wake_up(), or
> whatever.

If this atomic context corner case is hit within a problematic context
then we talk about the RCU of today and not about the future single
argument thing. And that oldschool RCU has a fallback. We are talking
about pressure corner cases and you really want to squeeze out the last
cache miss?  What for? If there is pressure then these cache misses are
irrelevant.

Thanks,

        tglx

Paul E. McKenney Aug. 15, 2020, 3:01 a.m. UTC | #87

On Sat, Aug 15, 2020 at 02:43:51AM +0200, Thomas Gleixner wrote:
> Paul,
> 
> On Fri, Aug 14 2020 at 16:41, Paul E. McKenney wrote:
> > On Sat, Aug 15, 2020 at 01:14:53AM +0200, Thomas Gleixner wrote:
> >> As a matter of fact I assume^Wdeclare that removing struct rcu_head which
> >> provides a fallback is not an option at all. I know that you want to,
> >> but it wont work ever. Dream on, but as we agreed on recently there is
> >> this thing called reality which ruins everything.
> >
> > For call_rcu(), agreed.  For kfree_rcu(), we know what the callback is
> > going to do, plus single-argument kfree_rcu() can only be invoked from
> > sleepable context.  (If you want to kfree_rcu() from non-sleepable
> > context, that will cost you an rcu_head in the data structure being
> > freed.)
> 
> kfree_rcu() as of today is just a conveniance wrapper around
> call_rcu(obj, rcu) which can be called from any context and it still
> takes TWO arguments.
> 
> Icepack?

Indeed.  Make that not kfree_rcu(), but rather kvfree_rcu(), which is
in mainline.  :-/

> So if you come up with a new kfree_rcu_magic(void *obj) single argument
> variant which can only be called from sleepable contexts then this does
> not require any of the raw lock vs. non raw hacks at all because you can
> simply allocate without holding the raw lock in the rare case that you
> run out of storage space. With four 4k pages preallocated per CPU that's
> every 2048 invocations per CPU on 64bit.
> 
> So if you run into that situation then you drop the lock and then it's
> racy because you might be preempted or migrated after dropping the lock
> and you might have done a useless allocation, but that does not justify
> having a special allocator just for that? You have an extra page, so
> what?
> 
> To prevent subsequent callers to add to the allocation race you simply
> can let them wait on the first allocating attempt to finish That avoids
> more pointless allocations and as a side effect prevents all of them to
> create more pressure by continuing their open/close loop naturally
> without extra work.

Agreed, as I said, it is the double-argument version that is the
challenge.

> > So if the single-argument kfree_rcu() case gets hit with a
> > memory-allocation failure, it can fall back to waiting for a grace
> > period and doing the free.  Of course, grace-period waits have horrible
> > latency, but under OOM life is hard.  If this becomes a problem in
> > non-OOM situations due to the lockless caches becoming empty, we will
> > have to allocate memory if needed before acquiring the lock with the
> > usual backout logic.  Doing that means that we can let the allocator
> > acquire locks and maybe even do a little bit of blocking, so that the
> > inline grace-period-wait would only happen if the system was well and
> > truly OOMed.
> 
> No. It dropped the rcu internal lock and does a regular GFP_KENRNEL
> allocation which waits for the page to become available. Which is a good
> thing in the open/close scenario because it throttles the offender.

Understood, especially that last.  But it really doesn't want to be
waiting in the memory allocator for more than a grace period.  But that
was hashed out quite some time ago, and there is a combination of GFP_*
flags that achieves the right balance for the can-sleep situation.

> >> For normal operations a couple of pages which can be preallocated are
> >> enough. What you are concerned of is the case where you run out of
> >> pointer storage space.
> >
> > Agreed.
> >
> >> There are two reasons why that can happen:
> >> 
> >>       1) RCU call flooding
> >>       2) RCU not being able to run and mop up the backlog
> >> 
> >> #1 is observable by looking at the remaining storage space and the RCU
> >>    call frequency
> >> 
> >> #2 is uninteresting because it's caused by RCU being stalled / delayed
> >>    e.g. by a runaway of some sorts or a plain RCU usage bug.
> >>    
> >>    Allocating more memory in that case does not solve or improve anything.
> >
> > Yes, #2 is instead RCU CPU stall warning territory.
> >
> > If this becomes a problem, one approach is to skip the page-of-pointers
> > allocation if the grace period is more than (say) one second old.  If
> > the grace period never completes, OOM is unavoidable, but this is a way
> > of putting it off for a bit.
> 
> Don't even think about optimizing your new thing for #2. It's a
> pointless exercise. If the task which runs into the 'can't allocate'
> case then is sleeps and waits. End of story.

Agreed, and hence my "If this becomes a problem".  Until such time,
it is pointless.  For one thing, we don't yet know the failure mode.
But it has been helpful for me to think a move or two ahead when
playing against RCU, hence the remainder of my paragraph.

> >> So the interesting case is #1. Which means we need to look at the
> >> potential sources of the flooding:
> >> 
> >>     1) User space via syscalls, e.g. open/close
> >>     2) Kernel thread
> >>     3) Softirq
> >>     4) Device interrupt
> >>     5) System interrupts, deep atomic context, NMI ...
> >> 
> >> #1 trivial fix is to force switching to an high prio thread or a soft
> >>    interrupt which does the allocation
> >> 
> >> #2 Similar to #1 unless that thread loops with interrupts, softirqs or
> >>    preemption disabled. If that's the case then running out of RCU
> >>    storage space is the least of your worries.
> >> 
> >> #3 Similar to #2. The obvious candidates (e.g. NET) for monopolizing a
> >>    CPU have loop limits in place already. If there is a bug which fails
> >>    to care about the limit, why would RCU care and allocate more memory?
> >> 
> >> #4 Similar to #3. If the interrupt handler loops forever or if the
> >>    interrupt is a runaway which prevents task/softirq processing then
> >>    RCU free performance is the least of your worries.
> >> 
> >> #5 Clearly a bug and making RCU accomodate for that is beyond silly.
> >> 
> >> So if call_rcu() detects that the remaining storage space for pointers
> >> goes below the critical point or if it observes high frequency calls
> >> then it simply should force a soft interrupt which does the allocation.
> >
> > Unless call_rcu() has been invoked with scheduler locks held.  But
> > eventually call_rcu() should be invoked with interrupts enabled, and at
> > that point it would be safe to raise_softirq(), wake_up(), or
> > whatever.
> 
> If this atomic context corner case is hit within a problematic context
> then we talk about the RCU of today and not about the future single
> argument thing. And that oldschool RCU has a fallback. We are talking
> about pressure corner cases and you really want to squeeze out the last
> cache miss?  What for? If there is pressure then these cache misses are
> irrelevant.

Of course.  My point was instead that even this atomic corner case was
likely to have escape routes in the form of occasional non-atomic calls,
and that these could do the wakeups.

Again, thank you.

							Thanx, Paul

Peter Zijlstra Aug. 15, 2020, 8:27 a.m. UTC | #88

On Sat, Aug 15, 2020 at 01:14:53AM +0200, Thomas Gleixner wrote:
> 
> As a matter of fact I assume^Wdeclare that removing struct rcu_head which
> provides a fallback is not an option at all. I know that you want to,
> but it wont work ever. Dream on, but as we agreed on recently there is
> this thing called reality which ruins everything.

It never was going to work, freeing memory can never hard rely on the
success of allocating memory.

Peter Zijlstra Aug. 15, 2020, 8:42 a.m. UTC | #89

On Sat, Aug 15, 2020 at 01:14:53AM +0200, Thomas Gleixner wrote:

> #1 trivial fix is to force switching to an high prio thread or a soft
>    interrupt which does the allocation

Yeah, push the alocation out to another context. I did consider it, but
why bother?

Also, raising a softirq can't be done from every context, that's a whole
new problem. You can do irq_work I suppose, but not all architectures
support the self-IPI yet.

All in all, it's just more complexity than the fairly trivial
__alloc_page_lockless().

Whichever way around, we can't rely on the allocation.

Paul E. McKenney Aug. 15, 2020, 1:03 p.m. UTC | #90

On Sat, Aug 15, 2020 at 10:27:54AM +0200, Peter Zijlstra wrote:
> On Sat, Aug 15, 2020 at 01:14:53AM +0200, Thomas Gleixner wrote:
> > 
> > As a matter of fact I assume^Wdeclare that removing struct rcu_head which
> > provides a fallback is not an option at all. I know that you want to,
> > but it wont work ever. Dream on, but as we agreed on recently there is
> > this thing called reality which ruins everything.
> 
> It never was going to work, freeing memory can never hard rely on the
> success of allocating memory.

In neither case does the freeing of memory rely hard-rely on the success
of allocating memory.  This is because there is a fallback in both cases
should allocation fail.

Given an rcu_head structure, we use that, and accept the extra cache
misses at callback-invocation time.  Otherwise, without an rcu_head
structure, the allocation parameters are carefully chosen to avoid
indefinite sleeping, meaning that the allocation attempt either succeeds
or fails within a reasonable amount of time.  And upon failure we invoke
synchronize_rcu(), then immediately free.  Which is slow, but then again
life is like that under OOM conditions.

And yes, this means that the price of leaving the rcu_head structure out
of the structure to be freed is that you must call kvfree_free() from
a sleepable context.  If you don't like being restricted to sleepable
context, you can always supply the rcu_head structure.

							Thanx, Paul

Paul E. McKenney Aug. 15, 2020, 2:18 p.m. UTC | #91

On Sat, Aug 15, 2020 at 10:42:50AM +0200, Peter Zijlstra wrote:
> On Sat, Aug 15, 2020 at 01:14:53AM +0200, Thomas Gleixner wrote:
> 
> > #1 trivial fix is to force switching to an high prio thread or a soft
> >    interrupt which does the allocation
> 
> Yeah, push the alocation out to another context. I did consider it, but
> why bother?
> 
> Also, raising a softirq can't be done from every context, that's a whole
> new problem. You can do irq_work I suppose, but not all architectures
> support the self-IPI yet.
> 
> All in all, it's just more complexity than the fairly trivial
> __alloc_page_lockless().
> 
> Whichever way around, we can't rely on the allocation.

One way to enforce that would be to put something like this at the
beginning of the __alloc_page_lockless() function:

	if (IS_ENABLED(CONFIG_PROVE_LOCKING) && (prandom_u32() & 0xffff))
		return NULL;

I am sure that there is a better choice than CONFIG_PROVE_LOCKING.
But whatever the choice, there is nothing quite like the occasional
allocation failure during testing to convince people that such failure
really can happen.

							Thanx, Paul

Paul E. McKenney Aug. 15, 2020, 2:23 p.m. UTC | #92

On Sat, Aug 15, 2020 at 07:18:39AM -0700, Paul E. McKenney wrote:
> On Sat, Aug 15, 2020 at 10:42:50AM +0200, Peter Zijlstra wrote:
> > On Sat, Aug 15, 2020 at 01:14:53AM +0200, Thomas Gleixner wrote:
> > 
> > > #1 trivial fix is to force switching to an high prio thread or a soft
> > >    interrupt which does the allocation
> > 
> > Yeah, push the alocation out to another context. I did consider it, but
> > why bother?
> > 
> > Also, raising a softirq can't be done from every context, that's a whole
> > new problem. You can do irq_work I suppose, but not all architectures
> > support the self-IPI yet.
> > 
> > All in all, it's just more complexity than the fairly trivial
> > __alloc_page_lockless().
> > 
> > Whichever way around, we can't rely on the allocation.
> 
> One way to enforce that would be to put something like this at the
> beginning of the __alloc_page_lockless() function:
> 
> 	if (IS_ENABLED(CONFIG_PROVE_LOCKING) && (prandom_u32() & 0xffff))
> 		return NULL;

Right, too early in the morning.  :-/

This "slight" variation might include a bit of usefulness along with
the convincing:

	if (IS_ENABLED(CONFIG_PROVE_LOCKING) && !(prandom_u32() & 0xff))
		return NULL;

Plus failing one out of 256 times is likely a better choice than once
out of 65536 times, especially for the occasional caller of this
function.

							Thanx, Paul

> I am sure that there is a better choice than CONFIG_PROVE_LOCKING.
> But whatever the choice, there is nothing quite like the occasional
> allocation failure during testing to convince people that such failure
> really can happen.
> 
> 							Thanx, Paul

Uladzislau Rezki Aug. 16, 2020, 10:56 p.m. UTC | #93

On Fri, Aug 14, 2020 at 11:52:06PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 14, 2020 at 01:41:40PM -0700, Paul E. McKenney wrote:
> > > And that enforces the GFP_NOLOCK allocation mode or some other solution
> > > unless you make a new rule that calling call_rcu() is forbidden while
> > > holding zone lock or any other lock which might be nested inside the
> > > GFP_NOWAIT zone::lock held region.
> > 
> > Again, you are correct.  Maybe the forecasted weekend heat will cause
> > my brain to hallucinate a better solution, but in the meantime, the
> > GFP_NOLOCK approach looks good from this end.
> 
> So I hate __GFP_NO_LOCKS for a whole number of reasons:
> 
>  - it should be called __GFP_LOCKLESS if anything
>  - it sprinkles a bunch of ugly branches around the allocator fast path
>  - it only works for order==0
> 
I had a look at your proposal, that is below. An underlying logic stays
almost the same as what has been proposed by this RFC. I mean i do not
see any difference in your approach that does exactly the same - providing
lock-less access to the per-cpu-lists. I am not talking about implementation
details and farther improvements, like doing also a search over zonelist -> ZONE_NORMAL.

Also, please note. The patch was tagged as RFC.

>
> Combined I really odn't think this should be a GFP flag. How about a
> special purpose allocation function, something like so..
> 
I agree with you. Also i think, Michal, does not like the GFP flag, it introduces
more complexity to the page allocator. So, providing lock-less access as a separate
function is better approach, IMHO.

Michal asked to provide some data regarding how many pages we need and how
"lockless allocation" behaves when it comes to success vs failed scenarios.

Please see below some results. The test case is a tight loop of 1 000 000 allocations
doing kmalloc() and kfree_rcu():

sudo ./test_vmalloc.sh run_test_mask=2048 single_cpu_test=1

<snip>
 for (i = 0; i < 1 000 000; i++) {
  p = kmalloc(sizeof(*p), GFP_KERNEL);
  if (!p)
   return -1;

  p->array[0] = 'a';
  kvfree_rcu(p, rcu);
 }
<snip>

wget ftp://vps418301.ovh.net/incoming/1000000_kmalloc_kfree_rcu_proc_percpu_pagelist_fractio_is_0.png
wget ftp://vps418301.ovh.net/incoming/1000000_kmalloc_kfree_rcu_proc_percpu_pagelist_fractio_is_8.png

Also i would like to underline, that kfree_rcu() reclaim logic can be improved further,
making the drain logic more efficient when it comes to time, thus to reduce a footprint
as a result number of required pages.

--
Vlad Rezki

Michal Hocko Aug. 17, 2020, 8:28 a.m. UTC | #94

On Mon 17-08-20 00:56:55, Uladzislau Rezki wrote:
[...]
> Michal asked to provide some data regarding how many pages we need and how
> "lockless allocation" behaves when it comes to success vs failed scenarios.
> 
> Please see below some results. The test case is a tight loop of 1 000 000 allocations
> doing kmalloc() and kfree_rcu():

It would be nice to cover some more realistic workloads as well.

> sudo ./test_vmalloc.sh run_test_mask=2048 single_cpu_test=1
> 
> <snip>
>  for (i = 0; i < 1 000 000; i++) {
>   p = kmalloc(sizeof(*p), GFP_KERNEL);
>   if (!p)
>    return -1;
> 
>   p->array[0] = 'a';
>   kvfree_rcu(p, rcu);
>  }
> <snip>
> 
> wget ftp://vps418301.ovh.net/incoming/1000000_kmalloc_kfree_rcu_proc_percpu_pagelist_fractio_is_0.png

If I understand this correctly then this means that failures happen very
often because pcp pages are not recycled quicklly enough.

> wget ftp://vps418301.ovh.net/incoming/1000000_kmalloc_kfree_rcu_proc_percpu_pagelist_fractio_is_8.png

1/8 of the memory in pcp lists is quite large and likely not something
used very often.

Both these numbers just make me think that a dedicated pool of page
pre-allocated for RCU specifically might be a better solution. I still
haven't read through that branch of the email thread though so there
might be some pretty convincing argments to not do that.

> Also i would like to underline, that kfree_rcu() reclaim logic can be improved further,
> making the drain logic more efficient when it comes to time, thus to reduce a footprint
> as a result number of required pages.
> 
> --
> Vlad Rezki

Michal Hocko Aug. 17, 2020, 8:47 a.m. UTC | #95

On Sat 15-08-20 01:14:53, Thomas Gleixner wrote:
[...]
> For normal operations a couple of pages which can be preallocated are
> enough. What you are concerned of is the case where you run out of
> pointer storage space.
> 
> There are two reasons why that can happen:
> 
>       1) RCU call flooding
>       2) RCU not being able to run and mop up the backlog
> 
> #1 is observable by looking at the remaining storage space and the RCU
>    call frequency
> 
> #2 is uninteresting because it's caused by RCU being stalled / delayed
>    e.g. by a runaway of some sorts or a plain RCU usage bug.
>    
>    Allocating more memory in that case does not solve or improve anything.
> 
> So the interesting case is #1. Which means we need to look at the
> potential sources of the flooding:
> 
>     1) User space via syscalls, e.g. open/close
>     2) Kernel thread
>     3) Softirq
>     4) Device interrupt
>     5) System interrupts, deep atomic context, NMI ...
> 
> #1 trivial fix is to force switching to an high prio thread or a soft
>    interrupt which does the allocation
> 
> #2 Similar to #1 unless that thread loops with interrupts, softirqs or
>    preemption disabled. If that's the case then running out of RCU
>    storage space is the least of your worries.
> 
> #3 Similar to #2. The obvious candidates (e.g. NET) for monopolizing a
>    CPU have loop limits in place already. If there is a bug which fails
>    to care about the limit, why would RCU care and allocate more memory?
> 
> #4 Similar to #3. If the interrupt handler loops forever or if the
>    interrupt is a runaway which prevents task/softirq processing then
>    RCU free performance is the least of your worries.
> 
> #5 Clearly a bug and making RCU accomodate for that is beyond silly.
> 
> So if call_rcu() detects that the remaining storage space for pointers
> goes below the critical point or if it observes high frequency calls
> then it simply should force a soft interrupt which does the allocation.
>
> Allocating from softirq context obviously without holding the raw lock
> which is used inside call_rcu() is safe on all configurations.
> 
> If call_rcu() is forced to use the fallback for a few calls until this
> happens then that's not the end of the world. It is not going to be a
> problem ever for the most obvious issue #1, user space madness, because
> that case cannot delay the softirq processing unless there is a kernel
> bug which makes again RCU free performance irrelevant.

Yes, this makes perfect sense to me! I really do not think we want to
optimize for a userspace abuse to allow complete pcp allocator memory
depletion (or a control in a worse case).

Thanks!

Uladzislau Rezki Aug. 17, 2020, 10:36 a.m. UTC | #96

On Mon, Aug 17, 2020 at 10:28:49AM +0200, Michal Hocko wrote:
> On Mon 17-08-20 00:56:55, Uladzislau Rezki wrote:
> [...]
> > Michal asked to provide some data regarding how many pages we need and how
> > "lockless allocation" behaves when it comes to success vs failed scenarios.
> > 
> > Please see below some results. The test case is a tight loop of 1 000 000 allocations
> > doing kmalloc() and kfree_rcu():
> 
> It would be nice to cover some more realistic workloads as well.
> 
Hmm.. I tried to show syntactic worst case when a "flood" occurs.
In such conditions we can get fails what is expectable and we have
fallback mechanism for it.

> > sudo ./test_vmalloc.sh run_test_mask=2048 single_cpu_test=1
> > 
> > <snip>
> >  for (i = 0; i < 1 000 000; i++) {
> >   p = kmalloc(sizeof(*p), GFP_KERNEL);
> >   if (!p)
> >    return -1;
> > 
> >   p->array[0] = 'a';
> >   kvfree_rcu(p, rcu);
> >  }
> > <snip>
> > 
> > wget ftp://vps418301.ovh.net/incoming/1000000_kmalloc_kfree_rcu_proc_percpu_pagelist_fractio_is_0.png
> 
> If I understand this correctly then this means that failures happen very
> often because pcp pages are not recycled quicklly enough.
> 
Yep, it happens and that is kind of worst scenario(flood one). Therefore we
have a fallback and is expectable. Also, i did not provide the number of pages
in a loop. On my test machine we need approximately ~300/400 pages to cover
that flood case until we recycles or return back the pages to the pcp.

Please note, as i mentioned before. Our drain part is not optimal for sure,
it means that we can rework it a bit making it more efficient. For example,
when a flood occurs, instead of delaying "reclaimer logic" thread, it can be
placed to a run-queue right away. We can use separate "flush workqueue"
that is tagged with WQ_MEM_RECLAIM raising a priority of drain context.

i.e. there is a room for reducing such page footprint.

> > wget ftp://vps418301.ovh.net/incoming/1000000_kmalloc_kfree_rcu_proc_percpu_pagelist_fractio_is_8.png
> 
> 1/8 of the memory in pcp lists is quite large and likely not something
> used very often.
> 
Just for illustration. When percpu_pagelist_fractio is set to 8, i do
not see any page fail on a single CPU flood case. If i run simultaneously
such flood on all available CPUs there will be fails for sure. 

> Both these numbers just make me think that a dedicated pool of page
> pre-allocated for RCU specifically might be a better solution. I still
> haven't read through that branch of the email thread though so there
> might be some pretty convincing argments to not do that.
> 
> > Also i would like to underline, that kfree_rcu() reclaim logic can be improved further,
> > making the drain logic more efficient when it comes to time, thus to reduce a footprint
> > as a result number of required pages.
> > 
> > --
> > Vlad Rezki
> 
> -- 
> Michal Hocko
> SUSE Labs

Paul E. McKenney Aug. 17, 2020, 10:28 p.m. UTC | #97

On Mon, Aug 17, 2020 at 10:28:49AM +0200, Michal Hocko wrote:
> On Mon 17-08-20 00:56:55, Uladzislau Rezki wrote:

[ . . . ]

> > wget ftp://vps418301.ovh.net/incoming/1000000_kmalloc_kfree_rcu_proc_percpu_pagelist_fractio_is_8.png
> 
> 1/8 of the memory in pcp lists is quite large and likely not something
> used very often.
> 
> Both these numbers just make me think that a dedicated pool of page
> pre-allocated for RCU specifically might be a better solution. I still
> haven't read through that branch of the email thread though so there
> might be some pretty convincing argments to not do that.

To avoid the problematic corner cases, we would need way more dedicated
memory than is reasonable, as in well over one hundred pages per CPU.
Sure, we could choose a smaller number, but then we are failing to defend
against flooding, even on systems that have more than enough free memory
to be able to do so.  It would be better to live within what is available,
taking the performance/robustness hit only if there isn't enough.

My current belief is that we need a combination of (1) either the
GFP_NOLOCK flag or Peter Zijlstra's patch and (2) Thomas Gleixner's
delayed allocation approach.  As you say and as Uladislau's measurements
suggest, if we only have Peter's approach, although we could handle short
floods just fine, we could not handle longer-term floods.  And Thomas's
approach is in fact designed to handle these longer-term floods.

Except that if we have only Thomas's approach, then we have to handle the
possibility that RCU_SOFTIRQ happened on the back of an interrupt that
happened while the interrupted process was holding a memory-allocator
lock.  This means further deferral, such as going into a workqueue,
which would allow better memory-allocator results, but which would also
mean further delays from the time that memory was needed until the time
that it was actually supplied.  Delays that could be bridged by either
a GFP_NOLOCK flag or Peter's patch.

So again, it looks like this is not an either/or situation, but rather
an need-both situation.

I freely confess that one of my hopes almost 30 years ago was that a
high-quality parallel memory allocator would eliminate the need for
special-purpose allocators, but as has been noted several times on this
thread, reality does not always seem to be compelled to take such
hopes into account.

							Thanx, Paul

Michal Hocko Aug. 18, 2020, 7:43 a.m. UTC | #98

On Mon 17-08-20 15:28:03, Paul E. McKenney wrote:
> On Mon, Aug 17, 2020 at 10:28:49AM +0200, Michal Hocko wrote:
> > On Mon 17-08-20 00:56:55, Uladzislau Rezki wrote:
> 
> [ . . . ]
> 
> > > wget ftp://vps418301.ovh.net/incoming/1000000_kmalloc_kfree_rcu_proc_percpu_pagelist_fractio_is_8.png
> > 
> > 1/8 of the memory in pcp lists is quite large and likely not something
> > used very often.
> > 
> > Both these numbers just make me think that a dedicated pool of page
> > pre-allocated for RCU specifically might be a better solution. I still
> > haven't read through that branch of the email thread though so there
> > might be some pretty convincing argments to not do that.
> 
> To avoid the problematic corner cases, we would need way more dedicated
> memory than is reasonable, as in well over one hundred pages per CPU.
> Sure, we could choose a smaller number, but then we are failing to defend
> against flooding, even on systems that have more than enough free memory
> to be able to do so.  It would be better to live within what is available,
> taking the performance/robustness hit only if there isn't enough.

Thomas had a good point that it doesn't really make much sense to
optimize for flooders because that just makes them more effective.

> My current belief is that we need a combination of (1) either the
> GFP_NOLOCK flag or Peter Zijlstra's patch and

I must have missed the patch?

Paul E. McKenney Aug. 18, 2020, 1:53 p.m. UTC | #99

On Tue, Aug 18, 2020 at 09:43:44AM +0200, Michal Hocko wrote:
> On Mon 17-08-20 15:28:03, Paul E. McKenney wrote:
> > On Mon, Aug 17, 2020 at 10:28:49AM +0200, Michal Hocko wrote:
> > > On Mon 17-08-20 00:56:55, Uladzislau Rezki wrote:
> > 
> > [ . . . ]
> > 
> > > > wget ftp://vps418301.ovh.net/incoming/1000000_kmalloc_kfree_rcu_proc_percpu_pagelist_fractio_is_8.png
> > > 
> > > 1/8 of the memory in pcp lists is quite large and likely not something
> > > used very often.
> > > 
> > > Both these numbers just make me think that a dedicated pool of page
> > > pre-allocated for RCU specifically might be a better solution. I still
> > > haven't read through that branch of the email thread though so there
> > > might be some pretty convincing argments to not do that.
> > 
> > To avoid the problematic corner cases, we would need way more dedicated
> > memory than is reasonable, as in well over one hundred pages per CPU.
> > Sure, we could choose a smaller number, but then we are failing to defend
> > against flooding, even on systems that have more than enough free memory
> > to be able to do so.  It would be better to live within what is available,
> > taking the performance/robustness hit only if there isn't enough.
> 
> Thomas had a good point that it doesn't really make much sense to
> optimize for flooders because that just makes them more effective.

The point is not to make the flooders go faster, but rather for the
system to be robust in the face of flooders.  Robust as in harder for
a flooder to OOM the system.

And reducing the number of post-grace-period cache misses makes it
easier for the callback-invocation-time memory freeing to keep up with
the flooder, thus avoiding (or at least delaying) the OOM.

> > My current belief is that we need a combination of (1) either the
> > GFP_NOLOCK flag or Peter Zijlstra's patch and
> 
> I must have missed the patch?

If I am keeping track, this one:

https://lore.kernel.org/lkml/20200814215206.GL3982@worktop.programming.kicks-ass.net/

							Thanx, Paul

Thomas Gleixner Aug. 18, 2020, 2:43 p.m. UTC | #100

On Tue, Aug 18 2020 at 06:53, Paul E. McKenney wrote:
> On Tue, Aug 18, 2020 at 09:43:44AM +0200, Michal Hocko wrote:
>> Thomas had a good point that it doesn't really make much sense to
>> optimize for flooders because that just makes them more effective.
>
> The point is not to make the flooders go faster, but rather for the
> system to be robust in the face of flooders.  Robust as in harder for
> a flooder to OOM the system.
>
> And reducing the number of post-grace-period cache misses makes it
> easier for the callback-invocation-time memory freeing to keep up with
> the flooder, thus avoiding (or at least delaying) the OOM.

Throttling the flooder is incresing robustness far more than reducing
cache misses.

Thanks,

        tglx

Michal Hocko Aug. 18, 2020, 3:02 p.m. UTC | #101

On Tue 18-08-20 06:53:27, Paul E. McKenney wrote:
> On Tue, Aug 18, 2020 at 09:43:44AM +0200, Michal Hocko wrote:
> > On Mon 17-08-20 15:28:03, Paul E. McKenney wrote:
> > > On Mon, Aug 17, 2020 at 10:28:49AM +0200, Michal Hocko wrote:
> > > > On Mon 17-08-20 00:56:55, Uladzislau Rezki wrote:
> > > 
> > > [ . . . ]
> > > 
> > > > > wget ftp://vps418301.ovh.net/incoming/1000000_kmalloc_kfree_rcu_proc_percpu_pagelist_fractio_is_8.png
> > > > 
> > > > 1/8 of the memory in pcp lists is quite large and likely not something
> > > > used very often.
> > > > 
> > > > Both these numbers just make me think that a dedicated pool of page
> > > > pre-allocated for RCU specifically might be a better solution. I still
> > > > haven't read through that branch of the email thread though so there
> > > > might be some pretty convincing argments to not do that.
> > > 
> > > To avoid the problematic corner cases, we would need way more dedicated
> > > memory than is reasonable, as in well over one hundred pages per CPU.
> > > Sure, we could choose a smaller number, but then we are failing to defend
> > > against flooding, even on systems that have more than enough free memory
> > > to be able to do so.  It would be better to live within what is available,
> > > taking the performance/robustness hit only if there isn't enough.
> > 
> > Thomas had a good point that it doesn't really make much sense to
> > optimize for flooders because that just makes them more effective.
> 
> The point is not to make the flooders go faster, but rather for the
> system to be robust in the face of flooders.  Robust as in harder for
> a flooder to OOM the system.

Do we see this to be a practical problem? I am really confused because
the initial argument was revolving around an optimization now you are
suggesting that this is actually system stability measure. And I fail to
see how allowing an easy way to deplete pcp caches completely solves
any of that. Please do realize that if allow that then every user who
relies on pcp caches will have to take a slow(er) path and that will
have performance consequences. The pool is a global and a scarce
resource. That's why I've suggested a dedicated preallocated pool and
use it instead of draining global pcp caches.

> And reducing the number of post-grace-period cache misses makes it
> easier for the callback-invocation-time memory freeing to keep up with
> the flooder, thus avoiding (or at least delaying) the OOM.
> 
> > > My current belief is that we need a combination of (1) either the
> > > GFP_NOLOCK flag or Peter Zijlstra's patch and
> > 
> > I must have missed the patch?
> 
> If I am keeping track, this one:
> 
> https://lore.kernel.org/lkml/20200814215206.GL3982@worktop.programming.kicks-ass.net/

OK, I have certainly noticed that one but didn't react but my response
would be similar to the dedicated gfp flag. This is less of a hack than
__GFP_NOLOCK  but it still exposes very internal parts of the allocator
and I find that a quite problematic from the future maintenance of the
allocator. The risk of an easy depletion of the pcp pool is also there
of course.

Uladzislau Rezki Aug. 18, 2020, 3:45 p.m. UTC | #102

Hello, Michal.

You mentioned somewhere in the thread to show figures regarding hitting
a fast path and "fallback one". We follow fallback when a page allocation fails.

Please see below the plot. I hope it is easy to understand:

wget ftp://vps418301.ovh.net/incoming/1000000_kfree_rcu_fast_hit_vs_fallback_hit.png

to summarize. When i tight loop is applied, i.e. flood simulation the fallback hit
is negligible. It is a noise, out of 1 000 000 we have 1% or %2 of fallback hitting.

Thanks!

--
Vlad Rezki

Paul E. McKenney Aug. 18, 2020, 4:13 p.m. UTC | #103

On Tue, Aug 18, 2020 at 04:43:14PM +0200, Thomas Gleixner wrote:
> On Tue, Aug 18 2020 at 06:53, Paul E. McKenney wrote:
> > On Tue, Aug 18, 2020 at 09:43:44AM +0200, Michal Hocko wrote:
> >> Thomas had a good point that it doesn't really make much sense to
> >> optimize for flooders because that just makes them more effective.
> >
> > The point is not to make the flooders go faster, but rather for the
> > system to be robust in the face of flooders.  Robust as in harder for
> > a flooder to OOM the system.
> >
> > And reducing the number of post-grace-period cache misses makes it
> > easier for the callback-invocation-time memory freeing to keep up with
> > the flooder, thus avoiding (or at least delaying) the OOM.
> 
> Throttling the flooder is incresing robustness far more than reducing
> cache misses.

True, but it takes time to identify a flooding event that needs to be
throttled (as in milliseconds).  This time cannot be made up.

And in the absence of a flooding event, the last thing you want to do
is to throttle call_rcu(), kfree_rcu(), and kvfree_rcu().

							Thanx, Paul

Paul E. McKenney Aug. 18, 2020, 4:18 p.m. UTC | #104

On Tue, Aug 18, 2020 at 05:02:32PM +0200, Michal Hocko wrote:
> On Tue 18-08-20 06:53:27, Paul E. McKenney wrote:
> > On Tue, Aug 18, 2020 at 09:43:44AM +0200, Michal Hocko wrote:
> > > On Mon 17-08-20 15:28:03, Paul E. McKenney wrote:
> > > > On Mon, Aug 17, 2020 at 10:28:49AM +0200, Michal Hocko wrote:
> > > > > On Mon 17-08-20 00:56:55, Uladzislau Rezki wrote:
> > > > 
> > > > [ . . . ]
> > > > 
> > > > > > wget ftp://vps418301.ovh.net/incoming/1000000_kmalloc_kfree_rcu_proc_percpu_pagelist_fractio_is_8.png
> > > > > 
> > > > > 1/8 of the memory in pcp lists is quite large and likely not something
> > > > > used very often.
> > > > > 
> > > > > Both these numbers just make me think that a dedicated pool of page
> > > > > pre-allocated for RCU specifically might be a better solution. I still
> > > > > haven't read through that branch of the email thread though so there
> > > > > might be some pretty convincing argments to not do that.
> > > > 
> > > > To avoid the problematic corner cases, we would need way more dedicated
> > > > memory than is reasonable, as in well over one hundred pages per CPU.
> > > > Sure, we could choose a smaller number, but then we are failing to defend
> > > > against flooding, even on systems that have more than enough free memory
> > > > to be able to do so.  It would be better to live within what is available,
> > > > taking the performance/robustness hit only if there isn't enough.
> > > 
> > > Thomas had a good point that it doesn't really make much sense to
> > > optimize for flooders because that just makes them more effective.
> > 
> > The point is not to make the flooders go faster, but rather for the
> > system to be robust in the face of flooders.  Robust as in harder for
> > a flooder to OOM the system.
> 
> Do we see this to be a practical problem? I am really confused because
> the initial argument was revolving around an optimization now you are
> suggesting that this is actually system stability measure. And I fail to
> see how allowing an easy way to deplete pcp caches completely solves
> any of that. Please do realize that if allow that then every user who
> relies on pcp caches will have to take a slow(er) path and that will
> have performance consequences. The pool is a global and a scarce
> resource. That's why I've suggested a dedicated preallocated pool and
> use it instead of draining global pcp caches.

Both the optimization and the robustness are important.

The problem with this thing is that I have to start describing it
somewhere, and I have not yet come up with a description of the whole
thing that isn't TL;DR.

> > And reducing the number of post-grace-period cache misses makes it
> > easier for the callback-invocation-time memory freeing to keep up with
> > the flooder, thus avoiding (or at least delaying) the OOM.
> > 
> > > > My current belief is that we need a combination of (1) either the
> > > > GFP_NOLOCK flag or Peter Zijlstra's patch and
> > > 
> > > I must have missed the patch?
> > 
> > If I am keeping track, this one:
> > 
> > https://lore.kernel.org/lkml/20200814215206.GL3982@worktop.programming.kicks-ass.net/
> 
> OK, I have certainly noticed that one but didn't react but my response
> would be similar to the dedicated gfp flag. This is less of a hack than
> __GFP_NOLOCK  but it still exposes very internal parts of the allocator
> and I find that a quite problematic from the future maintenance of the
> allocator. The risk of an easy depletion of the pcp pool is also there
> of course.

I had to ask.  ;-)

							Thanx, Paul

Thomas Gleixner Aug. 18, 2020, 4:55 p.m. UTC | #105

On Tue, Aug 18 2020 at 09:13, Paul E. McKenney wrote:
> On Tue, Aug 18, 2020 at 04:43:14PM +0200, Thomas Gleixner wrote:
>> On Tue, Aug 18 2020 at 06:53, Paul E. McKenney wrote:
>> > On Tue, Aug 18, 2020 at 09:43:44AM +0200, Michal Hocko wrote:
>> >> Thomas had a good point that it doesn't really make much sense to
>> >> optimize for flooders because that just makes them more effective.
>> >
>> > The point is not to make the flooders go faster, but rather for the
>> > system to be robust in the face of flooders.  Robust as in harder for
>> > a flooder to OOM the system.
>> >
>> > And reducing the number of post-grace-period cache misses makes it
>> > easier for the callback-invocation-time memory freeing to keep up with
>> > the flooder, thus avoiding (or at least delaying) the OOM.
>> 
>> Throttling the flooder is incresing robustness far more than reducing
>> cache misses.
>
> True, but it takes time to identify a flooding event that needs to be
> throttled (as in milliseconds).  This time cannot be made up.

Not really. A flooding event will deplete your preallocated pages very
fast, so you have to go into the allocator and get new ones which
naturally throttles the offender.

So if your open/close thing uses the new single argument free which has
to be called from sleepable context then the allocation either gives you
a page or that thing has to wait. No fancy extras.

You still can have a page reserved for your other regular things and
once that it gone, you have to fall back to the linked list for
those. But when that happens the extra cache misses are not your main
problem anymore.

Thanks,

        tglx

Paul E. McKenney Aug. 18, 2020, 5:13 p.m. UTC | #106

On Tue, Aug 18, 2020 at 06:55:11PM +0200, Thomas Gleixner wrote:
> On Tue, Aug 18 2020 at 09:13, Paul E. McKenney wrote:
> > On Tue, Aug 18, 2020 at 04:43:14PM +0200, Thomas Gleixner wrote:
> >> On Tue, Aug 18 2020 at 06:53, Paul E. McKenney wrote:
> >> > On Tue, Aug 18, 2020 at 09:43:44AM +0200, Michal Hocko wrote:
> >> >> Thomas had a good point that it doesn't really make much sense to
> >> >> optimize for flooders because that just makes them more effective.
> >> >
> >> > The point is not to make the flooders go faster, but rather for the
> >> > system to be robust in the face of flooders.  Robust as in harder for
> >> > a flooder to OOM the system.
> >> >
> >> > And reducing the number of post-grace-period cache misses makes it
> >> > easier for the callback-invocation-time memory freeing to keep up with
> >> > the flooder, thus avoiding (or at least delaying) the OOM.
> >> 
> >> Throttling the flooder is incresing robustness far more than reducing
> >> cache misses.
> >
> > True, but it takes time to identify a flooding event that needs to be
> > throttled (as in milliseconds).  This time cannot be made up.
> 
> Not really. A flooding event will deplete your preallocated pages very
> fast, so you have to go into the allocator and get new ones which
> naturally throttles the offender.

Should it turn out that we can in fact go into the allocator, completely
agreed.

> So if your open/close thing uses the new single argument free which has
> to be called from sleepable context then the allocation either gives you
> a page or that thing has to wait. No fancy extras.

In the single-argument kvfree_rcu() case, completely agreed.

> You still can have a page reserved for your other regular things and
> once that it gone, you have to fall back to the linked list for
> those. But when that happens the extra cache misses are not your main
> problem anymore.

The extra cache misses are a problem in that case because they throttle
the reclamation, which anti-throttles the producer, especially in the
case where callback invocation is offloaded.

							Thanx, Paul

Thomas Gleixner Aug. 18, 2020, 11:26 p.m. UTC | #107

Paul,

On Tue, Aug 18 2020 at 10:13, Paul E. McKenney wrote:
> On Tue, Aug 18, 2020 at 06:55:11PM +0200, Thomas Gleixner wrote:
>> On Tue, Aug 18 2020 at 09:13, Paul E. McKenney wrote:
>> > On Tue, Aug 18, 2020 at 04:43:14PM +0200, Thomas Gleixner wrote:
>> >> Throttling the flooder is incresing robustness far more than reducing
>> >> cache misses.
>> >
>> > True, but it takes time to identify a flooding event that needs to be
>> > throttled (as in milliseconds).  This time cannot be made up.
>> 
>> Not really. A flooding event will deplete your preallocated pages very
>> fast, so you have to go into the allocator and get new ones which
>> naturally throttles the offender.
>
> Should it turn out that we can in fact go into the allocator, completely
> agreed.

You better can for any user space controllable flooding source.

>> So if your open/close thing uses the new single argument free which has
>> to be called from sleepable context then the allocation either gives you
>> a page or that thing has to wait. No fancy extras.
>
> In the single-argument kvfree_rcu() case, completely agreed.
>
>> You still can have a page reserved for your other regular things and
>> once that it gone, you have to fall back to the linked list for
>> those. But when that happens the extra cache misses are not your main
>> problem anymore.
>
> The extra cache misses are a problem in that case because they throttle
> the reclamation, which anti-throttles the producer, especially in the
> case where callback invocation is offloaded.

You still did not explain which contexts can create flooding. I gave you
a complete list a few mails ago, but you still did not tell which of the
contexts can cause flooding.

If it's any context which is not sleepable or controllable in any way,
then any attempt to mitigate it is a lost battle:

  A dependency on allocating memory to free memory is a dead end by
  definition.

Any flooder which is uncontrollable is a bug and no matter what kind of
hacks you provide, it will be able to bring the whole thing down.

So far this looks like you're trying to cure the symptoms, which is
wrong to begin with.

If the flooder is controllable then there is no problem with cache
misses at all unless the RCU free callbacks are not able to catch up
which is yet another problem which you can't cure by allocating more
memory.

Thanks,

        tglx

Paul E. McKenney Aug. 19, 2020, 11:07 p.m. UTC | #108

On Wed, Aug 19, 2020 at 01:26:02AM +0200, Thomas Gleixner wrote:
> Paul,
> 
> On Tue, Aug 18 2020 at 10:13, Paul E. McKenney wrote:
> > On Tue, Aug 18, 2020 at 06:55:11PM +0200, Thomas Gleixner wrote:
> >> On Tue, Aug 18 2020 at 09:13, Paul E. McKenney wrote:
> >> > On Tue, Aug 18, 2020 at 04:43:14PM +0200, Thomas Gleixner wrote:
> >> >> Throttling the flooder is incresing robustness far more than reducing
> >> >> cache misses.
> >> >
> >> > True, but it takes time to identify a flooding event that needs to be
> >> > throttled (as in milliseconds).  This time cannot be made up.
> >> 
> >> Not really. A flooding event will deplete your preallocated pages very
> >> fast, so you have to go into the allocator and get new ones which
> >> naturally throttles the offender.
> >
> > Should it turn out that we can in fact go into the allocator, completely
> > agreed.
> 
> You better can for any user space controllable flooding source.

For the memory being passed to kvfree_rcu(), but of course.

However, that memory has just as good a chance of going deeply into the
allocator when being freed as when being allocated.  In contrast, the
page of pointers that kvfree_rcu() attempts to allocate can be handed back
via per-CPU variables, avoiding lengthy time on the allocator's free path.

Yes, yes, in theory we could make a devil's pact with potential flooders
so that RCU directly handed memory back to them via a back channel as
well, but in practice that sounds like an excellent source of complexity
and bugs.

> >> So if your open/close thing uses the new single argument free which has
> >> to be called from sleepable context then the allocation either gives you
> >> a page or that thing has to wait. No fancy extras.
> >
> > In the single-argument kvfree_rcu() case, completely agreed.
> >
> >> You still can have a page reserved for your other regular things and
> >> once that it gone, you have to fall back to the linked list for
> >> those. But when that happens the extra cache misses are not your main
> >> problem anymore.
> >
> > The extra cache misses are a problem in that case because they throttle
> > the reclamation, which anti-throttles the producer, especially in the
> > case where callback invocation is offloaded.
> 
> You still did not explain which contexts can create flooding. I gave you
> a complete list a few mails ago, but you still did not tell which of the
> contexts can cause flooding.

Message-ID: <87tux4kefm.fsf@nanos.tec.linutronix.de>, correct?

In case #1 (RCU call flooding), the page of pointers is helpful.

In case #2 (RCU not being able to run and mop up the backlog), allocating
pages of pointers is unhelpful, given that doing so simply speeds up the
potential OOM.  My thought is to skip kvfree_rcu()/call_rcu() pointer-page
allocation once the current grace period's duration exceeds one second.

> If it's any context which is not sleepable or controllable in any way,
> then any attempt to mitigate it is a lost battle:
> 
>   A dependency on allocating memory to free memory is a dead end by
>   definition.

Any attempt to mitigate that -lacks- -a- -fallback- is a losing battle.

> Any flooder which is uncontrollable is a bug and no matter what kind of
> hacks you provide, it will be able to bring the whole thing down.

On to the sources of flooding, based on what reality has managed to
teach me thus far:

1) User space via syscalls, e.g. open/close

	These are definitely in scope.

2) Kernel thread

	In general, these are out of scope, as you would expect.

	For completeness, based on rcutorture-induced callback flooding,
	if the kernel thread's loop contains at least one cond_resched()
	for CONFIG_PREEMPT_NONE=y on the one hand, at least one schedule()
	for preemptible kernels running NO_HZ_FULL, and at least one
	point during which preemption is possible otherwise.

	As discussed at Plumbers last year, the sysadm can always
	configure callback offloading in such a way that RCU has no
	chance of keeping up with a flood.  Then again, the sysadm can
	also always crash the system in all sorts of interesting ways,
	so what is one more?

	But please note that rcutorture does -not- recycle the flooded
	memory via kfree(), and thus avoids any problems with kfree()
	needing to dive deep into the allocator.  What rcutorture
	does instead is to pass the memory back to the flooder using
	a dedicated queue.  At the end of the test, the whole mess
	is kfree()ed.

	And all of these will be solved (as you would hope) by "Don't do
	that!!!", as laid out in the paragraphs above.	Of course, the
	exact form of "Don't do that" will no doubt change over time, but
	this code is in the kernel and therefore can be changed as needed.

3) Softirq

	Flooding from within the confines of a single softirq handler
	is of course out of scope.  There are all sorts of ways for the
	softirq-handler writer to deal with this, for example, dropping
	into workqueue context to do the allocation, which brings things
	back to #2 (kernel thread).

	I have not experimented with this, nor do I intend to.

4) Device interrupt

	Likewise, flooding from within the confines of a single device
	interrupt handler is out of scope.  As with softirq handlers,
	there are all sorts of ways for the device-driver writer to deal
	with this, for example, dropping into workqueue context to do
	the allocation, which brings things back to #2 (kernel thread).

	Again, as with softirq, I have not experimented with this,
	nor do I intend to.

5) System interrupts, deep atomic context, NMI ...

	System interrupts have the same callback-flooding constraints
	as device interrupts, correct?	(I am thinking that by "system
	interrupt" you mean things like the scheduling-clock interrupt,
	except that I have always thought of this interrupt as being a
	form of device interrupt.)

	I have no idea what "deep atomic context" but it does sound
	intriguing.  ;-)  If you mean the entry/exit code that is not
	allowed to be traced, then you cannot allocate, call_rcu(),
	kfree_rcu, or kvfree_rcu() from that context anyway.

	NMI handlers are not allowed to allocate or to invoke either
	call_rcu(), kfree_rcu(), or kvfree_rcu(), so they are going to
	need help from some other execution context if they are going
	to do any flooding at all.

In short, the main concern is flooding driven one way or another from
user space, whether via syscalls, traps, exceptions, or whatever.
Therefore, of the above list, I am worried only about #1.

(OK, OK, I am worried about #2-#4, but only from the perspective of
knowing what to tell the developer not to do.)

> So far this looks like you're trying to cure the symptoms, which is
> wrong to begin with.
> 
> If the flooder is controllable then there is no problem with cache
> misses at all unless the RCU free callbacks are not able to catch up
> which is yet another problem which you can't cure by allocating more
> memory.

As much as I might like to agree with that statement, the possibility
of freeing (not just allocation) going deep into the allocator leads me
to believe that reality might have a rather different opinion.

And the only way to find out is to try it.  I therefore propose that
the eventual patch series be split into three parts:

1.	Maintain separate per-CPU cache of pages of pointers dedicated
	to kfree_rcu() and kvfree_rcu(), and later also to call_rcu().
	If a given cache is empty, the code takes the fallback (either
	use the rcu_head structure or block on synchronize_rcu(),
	depending), but the code also asks for an allocation from a
	sleepable context in order to replenish the cache.  (We could
	use softirq for kfree_rcu() and kvfree_rcu(), but call_rcu()
	would have deadlock issues with softirq.)

2.	Peter's patch or similar.

3.	Use of Peter's patch.

If the need for #2 and #3 are convincing, well and good.  If not, I
create a tag for them in the -rcu tree so that they can be found quickly
in case reality decides to express its opinion at some inconvenient time
at some inconvenient scale.

Thoughts?

							Thanx, Paul

diff mbox series

Patch

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 67a0774e080b..c6f11481c42a 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -39,8 +39,9 @@  struct vm_area_struct;
 #define ___GFP_HARDWALL		0x100000u
 #define ___GFP_THISNODE		0x200000u
 #define ___GFP_ACCOUNT		0x400000u
+#define ___GFP_NO_LOCKS		0x800000u
 #ifdef CONFIG_LOCKDEP
-#define ___GFP_NOLOCKDEP	0x800000u
+#define ___GFP_NOLOCKDEP	0x1000000u
 #else
 #define ___GFP_NOLOCKDEP	0
 #endif
@@ -215,16 +216,22 @@  struct vm_area_struct;
  * %__GFP_COMP address compound page metadata.
  *
  * %__GFP_ZERO returns a zeroed page on success.
+ *
+ * %__GFP_NO_LOCKS order-0 allocation without sleepable-locks.
+ * It obtains a page from the per-cpu-list and considered as
+ * lock-less. No other actions are performed, thus it returns
+ * NULL if per-cpu-list is empty.
  */
 #define __GFP_NOWARN	((__force gfp_t)___GFP_NOWARN)
 #define __GFP_COMP	((__force gfp_t)___GFP_COMP)
 #define __GFP_ZERO	((__force gfp_t)___GFP_ZERO)
+#define __GFP_NO_LOCKS	((__force gfp_t)___GFP_NO_LOCKS)
 
 /* Disable lockdep for GFP context tracking */
 #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
 
 /* Room for N __GFP_FOO bits */
-#define __GFP_BITS_SHIFT (23 + IS_ENABLED(CONFIG_LOCKDEP))
+#define __GFP_BITS_SHIFT (24 + IS_ENABLED(CONFIG_LOCKDEP))
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /**
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 939092dbcb8b..653c56c478ad 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -45,6 +45,7 @@ 
 	{(unsigned long)__GFP_RECLAIMABLE,	"__GFP_RECLAIMABLE"},	\
 	{(unsigned long)__GFP_MOVABLE,		"__GFP_MOVABLE"},	\
 	{(unsigned long)__GFP_ACCOUNT,		"__GFP_ACCOUNT"},	\
+	{(unsigned long)__GFP_NO_LOCKS,		"__GFP_NO_LOCKS"},	\
 	{(unsigned long)__GFP_WRITE,		"__GFP_WRITE"},		\
 	{(unsigned long)__GFP_RECLAIM,		"__GFP_RECLAIM"},	\
 	{(unsigned long)__GFP_DIRECT_RECLAIM,	"__GFP_DIRECT_RECLAIM"},\
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e4896e674594..8bf1e3a9a1c3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3305,7 +3305,8 @@  static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
 }
 
 /* Remove page from the per-cpu list, caller must protect the list */
-static struct page *__rmqueue_pcplist(struct zone *zone, int migratetype,
+static struct page *__rmqueue_pcplist(struct zone *zone, gfp_t gfp_flags,
+			int migratetype,
 			unsigned int alloc_flags,
 			struct per_cpu_pages *pcp,
 			struct list_head *list)
@@ -3314,7 +3315,8 @@  static struct page *__rmqueue_pcplist(struct zone *zone, int migratetype,
 
 	do {
 		if (list_empty(list)) {
-			pcp->count += rmqueue_bulk(zone, 0,
+			if (!(gfp_flags & __GFP_NO_LOCKS))
+				pcp->count += rmqueue_bulk(zone, 0,
 					pcp->batch, list,
 					migratetype, alloc_flags);
 			if (unlikely(list_empty(list)))
@@ -3341,8 +3343,20 @@  static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 
 	local_irq_save(flags);
 	pcp = &this_cpu_ptr(zone->pageset)->pcp;
-	list = &pcp->lists[migratetype];
-	page = __rmqueue_pcplist(zone,  migratetype, alloc_flags, pcp, list);
+
+	if (!(gfp_flags & __GFP_NO_LOCKS)) {
+		list = &pcp->lists[migratetype];
+		page = __rmqueue_pcplist(zone, gfp_flags, migratetype, alloc_flags, pcp, list);
+	} else {
+		/* Iterate over all migrate types of the pcp-lists. */
+		for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++) {
+			list = &pcp->lists[migratetype];
+			page = __rmqueue_pcplist(zone, gfp_flags, migratetype, alloc_flags, pcp, list);
+			if (page)
+				break;
+		}
+	}
+
 	if (page) {
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1);
 		zone_statistics(preferred_zone, zone);
@@ -3790,7 +3804,8 @@  get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 			 * grow this zone if it contains deferred pages.
 			 */
 			if (static_branch_unlikely(&deferred_pages)) {
-				if (_deferred_grow_zone(zone, order))
+				if (!(gfp_mask & __GFP_NO_LOCKS) &&
+						_deferred_grow_zone(zone, order))
 					goto try_this_zone;
 			}
 #endif
@@ -3835,7 +3850,7 @@  get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 				reserve_highatomic_pageblock(page, zone, order);
 
 			return page;
-		} else {
+		} else if (!(gfp_mask & __GFP_NO_LOCKS)) {
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 			/* Try again if zone has deferred pages */
 			if (static_branch_unlikely(&deferred_pages)) {
@@ -4880,6 +4895,10 @@  __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
 	if (likely(page))
 		goto out;
 
+	/* Bypass slow path if __GFP_NO_LOCKS. */
+	if ((gfp_mask & __GFP_NO_LOCKS))
+		goto out;
+
 	/*
 	 * Apply scoped allocation constraints. This is mainly about GFP_NOFS
 	 * resp. GFP_NOIO which has to be inherited for all allocation requests
diff --git a/tools/perf/builtin-kmem.c b/tools/perf/builtin-kmem.c
index 38a5ab683ebc..662e1d9a0e99 100644
--- a/tools/perf/builtin-kmem.c
+++ b/tools/perf/builtin-kmem.c
@@ -656,6 +656,7 @@  static const struct {
 	{ "__GFP_RECLAIMABLE",		"RC" },
 	{ "__GFP_MOVABLE",		"M" },
 	{ "__GFP_ACCOUNT",		"AC" },
+	{ "__GFP_NO_LOCKS",		"NL" },
 	{ "__GFP_WRITE",		"WR" },
 	{ "__GFP_RECLAIM",		"R" },
 	{ "__GFP_DIRECT_RECLAIM",	"DR" },