diff mbox series

[RFC-PATCH,2/4] mm: Add __rcu_alloc_page_lockless() func.

Message ID	20200918194817.48921-3-urezki@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=yEt/=C3=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6EFAA23119 From: "Uladzislau Rezki (Sony)" <urezki@gmail.com> To: LKML <linux-kernel@vger.kernel.org>, RCU <rcu@vger.kernel.org>, linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>, "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org>, Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>, Thomas Gleixner <tglx@linutronix.de>, "Theodore Y . Ts'o" <tytso@mit.edu>, Joel Fernandes <joel@joelfernandes.org>, Sebastian Andrzej Siewior <bigeasy@linutronix.de>, Uladzislau Rezki <urezki@gmail.com>, Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Subject: [RFC-PATCH 2/4] mm: Add __rcu_alloc_page_lockless() func. Date: Fri, 18 Sep 2020 21:48:15 +0200 Message-Id: <20200918194817.48921-3-urezki@gmail.com> In-Reply-To: <20200918194817.48921-1-urezki@gmail.com> References: <20200918194817.48921-1-urezki@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	kvfree_rcu() and _LOCK_NESTING/_PREEMPT_RT \| expand [0/4] kvfree_rcu() and _LOCK_NESTING/_PREEMPT_RT [1/4] rcu/tree: Add a work to allocate pages from regular context [RFC-PATCH,2/4] mm: Add __rcu_alloc_page_lockless() func. [3/4] rcu/tree: use __rcu_alloc_page_lockless() func. [4/4] rcu/tree: Use schedule_delayed_work() instead of WQ_HIGHPRI queue

Commit Message

Uladzislau Rezki Sept. 18, 2020, 7:48 p.m. UTC

Some background and kfree_rcu()
===============================
The pointers to be freed are stored in the per-cpu array to improve
performance, to enable an easier-to-use API, to accommodate vmalloc
memmory and to support a single argument of the kfree_rcu() when only
a pointer is passed. More details are below.

In order to maintain such per-CPU arrays there is a need in dynamic
allocation when a current array is fully populated and a new block is
required. See below the example:

 0 1 2 3      0 1 2 3
|p|p|p|p| -> |p|p|p|p| -> NULL

there are two pointer-blocks, each one can store 4 addresses
which will be freed after a grace period is passed. In reality
we store PAGE_SIZE / sizeof(void *). So to maintain such blocks
a single page is obtain via the page allocator:

    bnode = (struct kvfree_rcu_bulk_data *)
        __get_free_page(GFP_NOWAIT | __GFP_NOWARN);

after that it is attached to the "head" and its "next" pointer is
set to previous "head", so the list of blocks can be maintained and
grow dynamically until it gets drained by the reclaiming thread.

Please note. There is always a fallback if an allocation fails. In the
single argument, this is a call to synchronize_rcu() and for the two
arguments case this is to use rcu_head structure embedded in the object
being free, and then paying cache-miss penalty, also invoke the kfree()
per object instead of kfree_bulk() for groups of objects.

Why we maintain arrays/blocks instead of linking objects by the regular
"struct rcu_head" technique. See below a few but main reasons:

a) A memory can be reclaimed by invoking of the kfree_bulk()
   interface that requires passing an array and number of
   entries in it. That reduces the per-object overhead caused
   by calling kfree() per-object. This reduces the reclamation
   time.

b) Improves locality and reduces the number of cache-misses, due to
   "pointer chasing" between objects, which can be far spread between
   each other.

c) Support a "single argument" in the kvfree_rcu()
   void *ptr = kvmalloc(some_bytes, GFP_KERNEL);
   if (ptr)
        kvfree_rcu(ptr);

   We need it when an "rcu_head" is not embed into a stucture but an
   object must be freed after a grace period. Therefore for the single
   argument, such objects cannot be queued on a linked list.

   So nowadays, since we do not have a single argument but we see the
   demand in it, to workaround it people just do a simple not efficient
   sequence:
   <snip>
       synchronize_rcu(); /* Can be long and blocks a current context */
       kfree(p);
   <snip>

   More details is here: https://lkml.org/lkml/2020/4/28/1626

d) To distinguish vmalloc pointers between SLAB ones. It becomes possible
   to invoke the right freeing API for the right kind of pointer, kfree_bulk()
   or TBD: vmalloc_bulk().

e) Speeding up the post-grace-period freeing reduces the chance of a flood
   of callback's OOMing the system.

Also, please have a look here: https://lkml.org/lkml/2020/7/30/1166

Proposal
========
Introduce a lock-free function that obtain a page from the per-cpu-lists
on current CPU. It returns NULL rather than acquiring any non-raw spinlock.

Description
===========
The page allocator has two phases, fast path and slow one. We are interested
in fast path and order-0 allocations. In its turn it is divided also into two
phases: lock-less and not:

1) As a first step the page allocator tries to obtain a page from the
   per-cpu-list, so each CPU has its own one. That is why this step is
   lock-less and fast. Basically it disables irqs on current CPU in order
   to access to per-cpu data and remove a first element from the pcp-list.
   An element/page is returned to an user.

2) If there is no any available page in per-cpu-list, the second step is
   involved. It removes a specified number of elements from the buddy allocator
   transferring them to the "supplied-list/per-cpu-list" described in [1].

Summarizing. The __rcu_alloc_page_lockless() covers only [1] and can not
do step [2], due to the fact that [2] requires an access to zone->lock.
It implies that it is super fast, but a higher rate of fails is also
expected.

Usage: __rcu_alloc_page_lockless();

Link: https://lore.kernel.org/lkml/20200814215206.GL3982@worktop.programming.kicks-ass.net/
Not-signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 include/linux/gfp.h |  1 +
 mm/page_alloc.c     | 82 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 83 insertions(+)

Comments

Michal Hocko Sept. 21, 2020, 7:47 a.m. UTC | #1

On Fri 18-09-20 21:48:15, Uladzislau Rezki (Sony) wrote:
[...]
> Proposal
> ========
> Introduce a lock-free function that obtain a page from the per-cpu-lists
> on current CPU. It returns NULL rather than acquiring any non-raw spinlock.

I was not happy about this solution when we have discussed this
last time and I have to say I am still not happy. This is exposing
an internal allocator optimization and allows a hard to estimate
consumption of pcp free pages. IIUC this run on pcp cache can be
controled directly from userspace (close(open) loop IIRC) which makes it
even bigger no-no.

I strongly agree with Thomas http://lkml.kernel.org/r/87tux4kefm.fsf@nanos.tec.linutronix.de
that this optimization is not aiming at reasonable workloads. Really, go
with pre-allocated buffer and fallback to whatever slow path you have
already. Exposing more internals of the allocator is not going to do any
good for long term maintainability.

Paul E. McKenney Sept. 21, 2020, 3:45 p.m. UTC | #2

On Mon, Sep 21, 2020 at 09:47:16AM +0200, Michal Hocko wrote:
> On Fri 18-09-20 21:48:15, Uladzislau Rezki (Sony) wrote:
> [...]
> > Proposal
> > ========
> > Introduce a lock-free function that obtain a page from the per-cpu-lists
> > on current CPU. It returns NULL rather than acquiring any non-raw spinlock.
> 
> I was not happy about this solution when we have discussed this
> last time and I have to say I am still not happy. This is exposing
> an internal allocator optimization and allows a hard to estimate
> consumption of pcp free pages. IIUC this run on pcp cache can be
> controled directly from userspace (close(open) loop IIRC) which makes it
> even bigger no-no.

Yes, I do well remember that you are unhappy with this approach.
Unfortunately, thus far, there is no solution that makes all developers
happy.  You might be glad to hear that we are also looking into other
solutions, each of which makes some other developers unhappy.  So we
are at least not picking on you alone.  :-/

> I strongly agree with Thomas http://lkml.kernel.org/r/87tux4kefm.fsf@nanos.tec.linutronix.de
> that this optimization is not aiming at reasonable workloads. Really, go
> with pre-allocated buffer and fallback to whatever slow path you have
> already. Exposing more internals of the allocator is not going to do any
> good for long term maintainability.

I suggest that you carefully re-read the thread following that email.

Given a choice between making users unhappy and making developers
unhappy, I will side with the users each and every time.

							Thanx, Paul

Michal Hocko Sept. 21, 2020, 4:03 p.m. UTC | #3

On Mon 21-09-20 08:45:58, Paul E. McKenney wrote:
> On Mon, Sep 21, 2020 at 09:47:16AM +0200, Michal Hocko wrote:
> > On Fri 18-09-20 21:48:15, Uladzislau Rezki (Sony) wrote:
> > [...]
> > > Proposal
> > > ========
> > > Introduce a lock-free function that obtain a page from the per-cpu-lists
> > > on current CPU. It returns NULL rather than acquiring any non-raw spinlock.
> > 
> > I was not happy about this solution when we have discussed this
> > last time and I have to say I am still not happy. This is exposing
> > an internal allocator optimization and allows a hard to estimate
> > consumption of pcp free pages. IIUC this run on pcp cache can be
> > controled directly from userspace (close(open) loop IIRC) which makes it
> > even bigger no-no.
> 
> Yes, I do well remember that you are unhappy with this approach.
> Unfortunately, thus far, there is no solution that makes all developers
> happy.  You might be glad to hear that we are also looking into other
> solutions, each of which makes some other developers unhappy.  So we
> are at least not picking on you alone.  :-/

No worries I do not feel like a whipping boy here. But do expect me to
argue against the approach. I would also appreciate it if there was some
more information on other attempts, why they have failed. E.g. why
pre-allocation is not an option that works well enough in most
reasonable workloads. I would also appreciate some more thoughts why we
need to optimize for heavy abusers of RCU (like close(open) extremes).

> > I strongly agree with Thomas http://lkml.kernel.org/r/87tux4kefm.fsf@nanos.tec.linutronix.de
> > that this optimization is not aiming at reasonable workloads. Really, go
> > with pre-allocated buffer and fallback to whatever slow path you have
> > already. Exposing more internals of the allocator is not going to do any
> > good for long term maintainability.
> 
> I suggest that you carefully re-read the thread following that email.

I clearly remember Thomas not being particularly happy that you optimize
for a corner case. I do not remember there being a consensus that this
is the right approach. There was some consensus that this is better than
a gfp flag. Still quite bad though if you ask me.

> Given a choice between making users unhappy and making developers
> unhappy, I will side with the users each and every time.

Well, let me rephrase. It is not only about me (as a developer) being
unhappy but also all the side effects this would have for users when
performance of their favorite workload declines for no apparent reason
just because pcp caches are depleted by an unrelated process.

Uladzislau Rezki Sept. 21, 2020, 7:48 p.m. UTC | #4

Hello, Michal.

> >
> > Yes, I do well remember that you are unhappy with this approach.
> > Unfortunately, thus far, there is no solution that makes all developers
> > happy.  You might be glad to hear that we are also looking into other
> > solutions, each of which makes some other developers unhappy.  So we
> > are at least not picking on you alone.  :-/
> 
> No worries I do not feel like a whipping boy here. But do expect me to
> argue against the approach. I would also appreciate it if there was some
> more information on other attempts, why they have failed. E.g. why
> pre-allocation is not an option that works well enough in most
> reasonable workloads.
Pre-allocating has some drawbacks:

a) It is impossible to predict how many pages will be required to
   cover a demand that is controlled by different workloads on
   various systems.

b) Memory overhead since we do not know how much pages should be
   preloaded: 100, 200 or 300

As for memory overhead, it is important to reduce it because of
embedded devices like phones, where a low memory condition is a
big issue. In that sense pre-allocating is something that we strongly
would like to avoid.

>
> I would also appreciate some more thoughts why we
> need to optimize for heavy abusers of RCU (like close(open) extremes).
> 
I think here is a small misunderstanding. Please note, that is not only
about performance and corner cases. There is a single argument support
of the kvfree_rcu(ptr), where maintaining an array in time is needed.
The fallback of the single argument case is extrimely slow.

Single-argument details is here: https://lkml.org/lkml/2020/4/28/1626

> > > I strongly agree with Thomas http://lkml.kernel.org/r/87tux4kefm.fsf@nanos.tec.linutronix.de
> > > that this optimization is not aiming at reasonable workloads. Really, go
> > > with pre-allocated buffer and fallback to whatever slow path you have
> > > already. Exposing more internals of the allocator is not going to do any
> > > good for long term maintainability.
> > 
> > I suggest that you carefully re-read the thread following that email.
> 
> I clearly remember Thomas not being particularly happy that you optimize
> for a corner case. I do not remember there being a consensus that this
> is the right approach. There was some consensus that this is better than
> a gfp flag. Still quite bad though if you ask me.
> 
> > Given a choice between making users unhappy and making developers
> > unhappy, I will side with the users each and every time.
> 
> Well, let me rephrase. It is not only about me (as a developer) being
> unhappy but also all the side effects this would have for users when
> performance of their favorite workload declines for no apparent reason
> just because pcp caches are depleted by an unrelated process.
>
If depleted, we have a special worker that charge it. From the other hand,
the pcplist can be depleted by its nature, what _is_ not wrong. But just
in case we secure it since you had a concern about it.

Could you please specify a real test case or workload you are talking about?

Thank you for your comments and help.

--
Vlad Rezki

Paul E. McKenney Sept. 22, 2020, 3:35 a.m. UTC | #5

On Mon, Sep 21, 2020 at 06:03:18PM +0200, Michal Hocko wrote:
> On Mon 21-09-20 08:45:58, Paul E. McKenney wrote:
> > On Mon, Sep 21, 2020 at 09:47:16AM +0200, Michal Hocko wrote:
> > > On Fri 18-09-20 21:48:15, Uladzislau Rezki (Sony) wrote:
> > > [...]
> > > > Proposal
> > > > ========
> > > > Introduce a lock-free function that obtain a page from the per-cpu-lists
> > > > on current CPU. It returns NULL rather than acquiring any non-raw spinlock.
> > > 
> > > I was not happy about this solution when we have discussed this
> > > last time and I have to say I am still not happy. This is exposing
> > > an internal allocator optimization and allows a hard to estimate
> > > consumption of pcp free pages. IIUC this run on pcp cache can be
> > > controled directly from userspace (close(open) loop IIRC) which makes it
> > > even bigger no-no.
> > 
> > Yes, I do well remember that you are unhappy with this approach.
> > Unfortunately, thus far, there is no solution that makes all developers
> > happy.  You might be glad to hear that we are also looking into other
> > solutions, each of which makes some other developers unhappy.  So we
> > are at least not picking on you alone.  :-/
> 
> No worries I do not feel like a whipping boy here. But do expect me to
> argue against the approach. I would also appreciate it if there was some
> more information on other attempts, why they have failed. E.g. why
> pre-allocation is not an option that works well enough in most
> reasonable workloads. I would also appreciate some more thoughts why we
> need to optimize for heavy abusers of RCU (like close(open) extremes).

Not optimizing for them, but rather defending against them.  Uladzislau
gave the example of low-memory phones.  And we have quite the array
of defenses against other userspace bugs including MMUs, the "limit"
command, and so on.

There have been quite a few successful attempts, starting from the
introduction of blimit and RCU-bh more than 15 years ago, continuing
through making call_rcu() jump-start grace periods, IPIing reluctant
CPUs, tuning RCU callback offloading, and many others.  But these prior
approaches have only taken us so far.

Other approaches under consideration include making CONFIG_PREEMPT_COUNT
unconditional and thus allowing call_rcu() and kvfree_rcu() to determine
whether direct calls to the allocator are safe (some guy named Linus
doesn't like this one), preallocation (Uladzislau covered this, and
the amount that would need to be preallocated is excessive), deferring
allocation to RCU_SOFTIRQ (this would also need CONFIG_PREEMPT_COUNT),
and deferring to some clean context (which is the best we can do within
the confines of RCU, but which can have issues with delay).

So it is not the need to address this general problem that is new.
Far from it!  What is new is the need for changes outside of RCU.

> > > I strongly agree with Thomas http://lkml.kernel.org/r/87tux4kefm.fsf@nanos.tec.linutronix.de
> > > that this optimization is not aiming at reasonable workloads. Really, go
> > > with pre-allocated buffer and fallback to whatever slow path you have
> > > already. Exposing more internals of the allocator is not going to do any
> > > good for long term maintainability.
> > 
> > I suggest that you carefully re-read the thread following that email.
> 
> I clearly remember Thomas not being particularly happy that you optimize
> for a corner case. I do not remember there being a consensus that this
> is the right approach. There was some consensus that this is better than
> a gfp flag. Still quite bad though if you ask me.

Again, this "optimization" is for robustness more than raw speed.

> > Given a choice between making users unhappy and making developers
> > unhappy, I will side with the users each and every time.
> 
> Well, let me rephrase. It is not only about me (as a developer) being
> unhappy but also all the side effects this would have for users when
> performance of their favorite workload declines for no apparent reason
> just because pcp caches are depleted by an unrelated process.

But in the close(open()) case, wouldn't the allocations on the open()
side refill those caches?  Yes, cases where one CPU is doing the
allocating and another the call_rcu()/kvfree_rcu() need additional
help, but as Uladzislau noted, we do have patches that ensure that the
refilling happens.

							Thanx, Paul

Michal Hocko Sept. 22, 2020, 7:50 a.m. UTC | #6

[Cc Mel - the thread starts http://lkml.kernel.org/r/20200918194817.48921-1-urezki@gmail.com]

On Mon 21-09-20 21:48:19, Uladzislau Rezki wrote:
> Hello, Michal.
> 
> > >
> > > Yes, I do well remember that you are unhappy with this approach.
> > > Unfortunately, thus far, there is no solution that makes all developers
> > > happy.  You might be glad to hear that we are also looking into other
> > > solutions, each of which makes some other developers unhappy.  So we
> > > are at least not picking on you alone.  :-/
> > 
> > No worries I do not feel like a whipping boy here. But do expect me to
> > argue against the approach. I would also appreciate it if there was some
> > more information on other attempts, why they have failed. E.g. why
> > pre-allocation is not an option that works well enough in most
> > reasonable workloads.
> Pre-allocating has some drawbacks:
> 
> a) It is impossible to predict how many pages will be required to
>    cover a demand that is controlled by different workloads on
>    various systems.

Yes, this is not trivial but not a rocket science either. Remember that
you are relying on a very dumb watermark based pcp pool from the
allocator. Mimicing a similar implementation shouldn't be all that hard
and you will get your own pool which doesn't affect other page allocator
users as much as a bonus.

> b) Memory overhead since we do not know how much pages should be
>    preloaded: 100, 200 or 300

Does anybody who really needs this optimization actually cares about 300
pages?

> As for memory overhead, it is important to reduce it because of
> embedded devices like phones, where a low memory condition is a
> big issue. In that sense pre-allocating is something that we strongly
> would like to avoid.

How big "machines" are we talking about here? I would expect that really
tiny machines would have hard times to really fill up thousands of pages
with pointers to free...

Would a similar scaling as the page allocator feasible. Really I mostly
do care about shared nature of the pcp allocator list that one user can
easily monopolize with this API.

> > I would also appreciate some more thoughts why we
> > need to optimize for heavy abusers of RCU (like close(open) extremes).
> > 
> I think here is a small misunderstanding. Please note, that is not only
> about performance and corner cases. There is a single argument support
> of the kvfree_rcu(ptr), where maintaining an array in time is needed.
> The fallback of the single argument case is extrimely slow.

This should be part of the changelog.
> 
> Single-argument details is here: https://lkml.org/lkml/2020/4/28/1626

Error 501

> > > > I strongly agree with Thomas http://lkml.kernel.org/r/87tux4kefm.fsf@nanos.tec.linutronix.de
> > > > that this optimization is not aiming at reasonable workloads. Really, go
> > > > with pre-allocated buffer and fallback to whatever slow path you have
> > > > already. Exposing more internals of the allocator is not going to do any
> > > > good for long term maintainability.
> > > 
> > > I suggest that you carefully re-read the thread following that email.
> > 
> > I clearly remember Thomas not being particularly happy that you optimize
> > for a corner case. I do not remember there being a consensus that this
> > is the right approach. There was some consensus that this is better than
> > a gfp flag. Still quite bad though if you ask me.
> > 
> > > Given a choice between making users unhappy and making developers
> > > unhappy, I will side with the users each and every time.
> > 
> > Well, let me rephrase. It is not only about me (as a developer) being
> > unhappy but also all the side effects this would have for users when
> > performance of their favorite workload declines for no apparent reason
> > just because pcp caches are depleted by an unrelated process.
> >
> If depleted, we have a special worker that charge it. From the other hand,
> the pcplist can be depleted by its nature, what _is_ not wrong. But just
> in case we secure it since you had a concern about it.

pcp free lists should ever get empty when we run out of memory and need
to reclaim. Otherwise they are constantly refilled/rebalanced on demand.
The fact that you are refilling them from outside just suggest that you
are operating on a wrong layer. Really, create your own pool of pages
and rebalance them based on the workload.

> Could you please specify a real test case or workload you are talking about?

I am not a performance expert but essentially any memory allocator heavy
workload might notice. I am pretty sure Mel would tell you more.

Michal Hocko Sept. 22, 2020, 8:03 a.m. UTC | #7

On Mon 21-09-20 20:35:53, Paul E. McKenney wrote:
> On Mon, Sep 21, 2020 at 06:03:18PM +0200, Michal Hocko wrote:
> > On Mon 21-09-20 08:45:58, Paul E. McKenney wrote:
> > > On Mon, Sep 21, 2020 at 09:47:16AM +0200, Michal Hocko wrote:
> > > > On Fri 18-09-20 21:48:15, Uladzislau Rezki (Sony) wrote:
> > > > [...]
> > > > > Proposal
> > > > > ========
> > > > > Introduce a lock-free function that obtain a page from the per-cpu-lists
> > > > > on current CPU. It returns NULL rather than acquiring any non-raw spinlock.
> > > > 
> > > > I was not happy about this solution when we have discussed this
> > > > last time and I have to say I am still not happy. This is exposing
> > > > an internal allocator optimization and allows a hard to estimate
> > > > consumption of pcp free pages. IIUC this run on pcp cache can be
> > > > controled directly from userspace (close(open) loop IIRC) which makes it
> > > > even bigger no-no.
> > > 
> > > Yes, I do well remember that you are unhappy with this approach.
> > > Unfortunately, thus far, there is no solution that makes all developers
> > > happy.  You might be glad to hear that we are also looking into other
> > > solutions, each of which makes some other developers unhappy.  So we
> > > are at least not picking on you alone.  :-/
> > 
> > No worries I do not feel like a whipping boy here. But do expect me to
> > argue against the approach. I would also appreciate it if there was some
> > more information on other attempts, why they have failed. E.g. why
> > pre-allocation is not an option that works well enough in most
> > reasonable workloads. I would also appreciate some more thoughts why we
> > need to optimize for heavy abusers of RCU (like close(open) extremes).
> 
> Not optimizing for them, but rather defending against them.  Uladzislau
> gave the example of low-memory phones.  And we have quite the array
> of defenses against other userspace bugs including MMUs, the "limit"
> command, and so on.
> 
> There have been quite a few successful attempts, starting from the
> introduction of blimit and RCU-bh more than 15 years ago, continuing
> through making call_rcu() jump-start grace periods, IPIing reluctant
> CPUs, tuning RCU callback offloading, and many others.  But these prior
> approaches have only taken us so far.

Thank you this is useful. It gives me some idea about the problem but I
still cannot picture how serious the problem really is. Is this a DoS
like scenario where an unprivileged task is able to monopolize the
system/CPU to do all the RCU heavy lifting? Are other RCU users deprived
from doing their portion? How much expediting helps? Does it really
solve the problem or it merely shifts it around as you will have hard
time to keep up with more and more tasks to hit the same path and the
pre-allocated memory is a very finite resource.

> Other approaches under consideration include making CONFIG_PREEMPT_COUNT
> unconditional and thus allowing call_rcu() and kvfree_rcu() to determine
> whether direct calls to the allocator are safe (some guy named Linus
> doesn't like this one),

I assume that the primary argument is the overhead, right? Do you happen
to have any reference?

> preallocation (Uladzislau covered this, and
> the amount that would need to be preallocated is excessive), deferring
> allocation to RCU_SOFTIRQ (this would also need CONFIG_PREEMPT_COUNT),
> and deferring to some clean context (which is the best we can do within
> the confines of RCU, but which can have issues with delay).

I have already replied to that so let's not split the discussion into
several fronts.

> So it is not the need to address this general problem that is new.
> Far from it!  What is new is the need for changes outside of RCU.
> 
> > > > I strongly agree with Thomas http://lkml.kernel.org/r/87tux4kefm.fsf@nanos.tec.linutronix.de
> > > > that this optimization is not aiming at reasonable workloads. Really, go
> > > > with pre-allocated buffer and fallback to whatever slow path you have
> > > > already. Exposing more internals of the allocator is not going to do any
> > > > good for long term maintainability.
> > > 
> > > I suggest that you carefully re-read the thread following that email.
> > 
> > I clearly remember Thomas not being particularly happy that you optimize
> > for a corner case. I do not remember there being a consensus that this
> > is the right approach. There was some consensus that this is better than
> > a gfp flag. Still quite bad though if you ask me.
> 
> Again, this "optimization" is for robustness more than raw speed.
> 
> > > Given a choice between making users unhappy and making developers
> > > unhappy, I will side with the users each and every time.
> > 
> > Well, let me rephrase. It is not only about me (as a developer) being
> > unhappy but also all the side effects this would have for users when
> > performance of their favorite workload declines for no apparent reason
> > just because pcp caches are depleted by an unrelated process.
> 
> But in the close(open()) case, wouldn't the allocations on the open()
> side refill those caches?

Yes pcp lists will eventually get replenished. This is not a correctness
problem I am pointing out. It is the fact that any user of this new
interface can monopolize the _global_ pool which would affect all other
users of the pool.

Uladzislau Rezki Sept. 22, 2020, 1:12 p.m. UTC | #8

> > > > Yes, I do well remember that you are unhappy with this approach.
> > > > Unfortunately, thus far, there is no solution that makes all developers
> > > > happy.  You might be glad to hear that we are also looking into other
> > > > solutions, each of which makes some other developers unhappy.  So we
> > > > are at least not picking on you alone.  :-/
> > > 
> > > No worries I do not feel like a whipping boy here. But do expect me to
> > > argue against the approach. I would also appreciate it if there was some
> > > more information on other attempts, why they have failed. E.g. why
> > > pre-allocation is not an option that works well enough in most
> > > reasonable workloads.
> > Pre-allocating has some drawbacks:
> > 
> > a) It is impossible to predict how many pages will be required to
> >    cover a demand that is controlled by different workloads on
> >    various systems.
> 
> Yes, this is not trivial but not a rocket science either. Remember that
> you are relying on a very dumb watermark based pcp pool from the
> allocator.
>
We rely on it, indeed. If the pcp-cache is depleted our special work is
triggered to charge our local cache(few pages) such way will also initiate
the process of pre-featching pages from the buddy allocator populating
the depleted pcp-cache. I do not have any concern here.

>
> Mimicing a similar implementation shouldn't be all that hard
> and you will get your own pool which doesn't affect other page allocator
> users as much as a bonus.
> 
I see your point Michal. As i mentioned before, it is important to avoid of
having such own pools, because the aim is not to waste memory resources. A
page will be returned back to "page allocator" as soon as a scheduler place  
our reclaim thread on a CPU and grace period is passed. So, the resource
can be used for other needs. What is important.

Otherwise a memory footprint is increased what is bad for low memory
conditions when OOM is involved. Just in case, it is a big issue for
mobile devices.

> > b) Memory overhead since we do not know how much pages should be
> >    preloaded: 100, 200 or 300
> 
> Does anybody who really needs this optimization actually cares about 300
> pages?
> 
It might be an issue for embedded devices when such devices run into a
low memory condition resulting in OOM or slow allocations due to mentioned
condition. For servers and big system it will not be visible.

> > As for memory overhead, it is important to reduce it because of
> > embedded devices like phones, where a low memory condition is a
> > big issue. In that sense pre-allocating is something that we strongly
> > would like to avoid.
> 
> How big "machines" are we talking about here? I would expect that really
> tiny machines would have hard times to really fill up thousands of pages
> with pointers to free...
> 
I mentioned above. We can not rely on static model. We would like to
have a mechanism that gives back ASAP used pages to page allocator
for other needs.

>
> Would a similar scaling as the page allocator feasible. Really I mostly
> do care about shared nature of the pcp allocator list that one user can
> easily monopolize with this API.
> 
I see your concern. pcplist can be monopolized by already existing API:

    while (i < 100)
        __get_free_page(GFP_NOWAIT | __GFP_NOWARN);

> > > I would also appreciate some more thoughts why we
> > > need to optimize for heavy abusers of RCU (like close(open) extremes).
> > > 
> > I think here is a small misunderstanding. Please note, that is not only
> > about performance and corner cases. There is a single argument support
> > of the kvfree_rcu(ptr), where maintaining an array in time is needed.
> > The fallback of the single argument case is extrimely slow.
> 
> This should be part of the changelog.
>
Hmm.. I think it is. Sorry if i missed that but i hope i mentioned about it.

> > 
> > Single-argument details is here: https://lkml.org/lkml/2020/4/28/1626
> 
> Error 501
> 
Could you please elaborate? Do not want to speculate :)

> > > > > I strongly agree with Thomas http://lkml.kernel.org/r/87tux4kefm.fsf@nanos.tec.linutronix.de
> > > > > that this optimization is not aiming at reasonable workloads. Really, go
> > > > > with pre-allocated buffer and fallback to whatever slow path you have
> > > > > already. Exposing more internals of the allocator is not going to do any
> > > > > good for long term maintainability.
> > > > 
> > > > I suggest that you carefully re-read the thread following that email.
> > > 
> > > I clearly remember Thomas not being particularly happy that you optimize
> > > for a corner case. I do not remember there being a consensus that this
> > > is the right approach. There was some consensus that this is better than
> > > a gfp flag. Still quite bad though if you ask me.
> > > 
> > > > Given a choice between making users unhappy and making developers
> > > > unhappy, I will side with the users each and every time.
> > > 
> > > Well, let me rephrase. It is not only about me (as a developer) being
> > > unhappy but also all the side effects this would have for users when
> > > performance of their favorite workload declines for no apparent reason
> > > just because pcp caches are depleted by an unrelated process.
> > >
> > If depleted, we have a special worker that charge it. From the other hand,
> > the pcplist can be depleted by its nature, what _is_ not wrong. But just
> > in case we secure it since you had a concern about it.
> 
> pcp free lists should ever get empty when we run out of memory and need
> to reclaim. Otherwise they are constantly refilled/rebalanced on demand.
> The fact that you are refilling them from outside just suggest that you
> are operating on a wrong layer. Really, create your own pool of pages
> and rebalance them based on the workload.
> 
I covered it above.

> > Could you please specify a real test case or workload you are talking about?
> 
> I am not a performance expert but essentially any memory allocator heavy
> workload might notice. I am pretty sure Mel would tell you more.
> 
OK.

Thank you for your comments, Michal!

--
Vlad Rezki

Michal Hocko Sept. 22, 2020, 3:35 p.m. UTC | #9

On Tue 22-09-20 15:12:57, Uladzislau Rezki wrote:
[...]
> > Mimicing a similar implementation shouldn't be all that hard
> > and you will get your own pool which doesn't affect other page allocator
> > users as much as a bonus.
> > 
> I see your point Michal. As i mentioned before, it is important to avoid of
> having such own pools, because the aim is not to waste memory resources. A
> page will be returned back to "page allocator" as soon as a scheduler place  
> our reclaim thread on a CPU and grace period is passed. So, the resource
> can be used for other needs. What is important.
> 
> Otherwise a memory footprint is increased what is bad for low memory
> conditions when OOM is involved. Just in case, it is a big issue for
> mobile devices.

Really, how much memory are we talking about here? Do you have any
estimation? How many pointers do you need to store at once? 10k (that
would be 20 pages per cpu? Doesn't sound too big to me. But again I do
not know the scale here. Also if you really care you can fine tune this
pool based on demand. All that is not a rocket science and it can be
tuned outside of the page allocator rather than other way around.

We will not move forward without any specific numbers here I am afraid.

[...]

> > Would a similar scaling as the page allocator feasible. Really I mostly
> > do care about shared nature of the pcp allocator list that one user can
> > easily monopolize with this API.
> > 
> I see your concern. pcplist can be monopolized by already existing API:
> 
>     while (i < 100)
>         __get_free_page(GFP_NOWAIT | __GFP_NOWARN);

They will usually not, because even non-sleeping allocations will refill
them unless the memory is scarce and memory reclaim is needed. As
replied to Paul in other email, this is not a question of correctness.
It is a matter of shifting the overhead around.

> > > Single-argument details is here: https://lkml.org/lkml/2020/4/28/1626
> > 
> > Error 501
> > 
> Could you please elaborate? Do not want to speculate :)

It thrown 501 on me. lkml.org is quite unreliable. It works now. I will
read through that. Please use lore or lkml.kernel.org/r/$msg in future.

Paul E. McKenney Sept. 22, 2020, 3:46 p.m. UTC | #10

On Tue, Sep 22, 2020 at 10:03:06AM +0200, Michal Hocko wrote:
> On Mon 21-09-20 20:35:53, Paul E. McKenney wrote:
> > On Mon, Sep 21, 2020 at 06:03:18PM +0200, Michal Hocko wrote:
> > > On Mon 21-09-20 08:45:58, Paul E. McKenney wrote:
> > > > On Mon, Sep 21, 2020 at 09:47:16AM +0200, Michal Hocko wrote:
> > > > > On Fri 18-09-20 21:48:15, Uladzislau Rezki (Sony) wrote:
> > > > > [...]
> > > > > > Proposal
> > > > > > ========
> > > > > > Introduce a lock-free function that obtain a page from the per-cpu-lists
> > > > > > on current CPU. It returns NULL rather than acquiring any non-raw spinlock.
> > > > > 
> > > > > I was not happy about this solution when we have discussed this
> > > > > last time and I have to say I am still not happy. This is exposing
> > > > > an internal allocator optimization and allows a hard to estimate
> > > > > consumption of pcp free pages. IIUC this run on pcp cache can be
> > > > > controled directly from userspace (close(open) loop IIRC) which makes it
> > > > > even bigger no-no.
> > > > 
> > > > Yes, I do well remember that you are unhappy with this approach.
> > > > Unfortunately, thus far, there is no solution that makes all developers
> > > > happy.  You might be glad to hear that we are also looking into other
> > > > solutions, each of which makes some other developers unhappy.  So we
> > > > are at least not picking on you alone.  :-/
> > > 
> > > No worries I do not feel like a whipping boy here. But do expect me to
> > > argue against the approach. I would also appreciate it if there was some
> > > more information on other attempts, why they have failed. E.g. why
> > > pre-allocation is not an option that works well enough in most
> > > reasonable workloads. I would also appreciate some more thoughts why we
> > > need to optimize for heavy abusers of RCU (like close(open) extremes).
> > 
> > Not optimizing for them, but rather defending against them.  Uladzislau
> > gave the example of low-memory phones.  And we have quite the array
> > of defenses against other userspace bugs including MMUs, the "limit"
> > command, and so on.
> > 
> > There have been quite a few successful attempts, starting from the
> > introduction of blimit and RCU-bh more than 15 years ago, continuing
> > through making call_rcu() jump-start grace periods, IPIing reluctant
> > CPUs, tuning RCU callback offloading, and many others.  But these prior
> > approaches have only taken us so far.
> 
> Thank you this is useful. It gives me some idea about the problem but I
> still cannot picture how serious the problem really is. Is this a DoS
> like scenario where an unprivileged task is able to monopolize the
> system/CPU to do all the RCU heavy lifting? Are other RCU users deprived
> from doing their portion? How much expediting helps? Does it really
> solve the problem or it merely shifts it around as you will have hard
> time to keep up with more and more tasks to hit the same path and the
> pre-allocated memory is a very finite resource.

Yes, this situation can result in an unprivileged task being able to
DoS some or even all of the system.  Which is why I have been pushing
on this despite resistance.

And again yes, expediting can help and RCU does in fact take
auto-expediting steps when any CPU's callback list exceeds 10,000
elements.  These auto-expediting changes have also been made over a
15-year time period.  Joel Fernandes is working on a patch series that
will allow RCU to better choose when to auto-expedite and when to favor
speedy RCU callback invocation.  Currently RCU cannot tell and so just
does both whenever it starts getting into trouble.

Both auto-expediting and speedy RCU callback invocation help the memory
allocator by freeing memory more quickly, thus reducing the system's
memory footprint.  Even more important, these help the RCU callback
invocation (and thus freeing of memory) to better keep up with the
allocation in the various difficult situations, for example, the tight
userspace loop around close(open()).

> > Other approaches under consideration include making CONFIG_PREEMPT_COUNT
> > unconditional and thus allowing call_rcu() and kvfree_rcu() to determine
> > whether direct calls to the allocator are safe (some guy named Linus
> > doesn't like this one),
> 
> I assume that the primary argument is the overhead, right? Do you happen
> to have any reference?

Jon Corbet wrote a very nice article summarizing the current situation:
https://lwn.net/Articles/831678/.  Thomas's measurements show no visible
system-level performance impact.  I will let Uladzislau present his more
microbenchmarky performance work.

We are pursuing this unconditional CONFIG_PREEMPT_COUNT approach as well,
in keeping with our current de-facto policy of irritating everyone across
the full expanse of the Linux-kernel source tree.  :-/

> > preallocation (Uladzislau covered this, and
> > the amount that would need to be preallocated is excessive), deferring
> > allocation to RCU_SOFTIRQ (this would also need CONFIG_PREEMPT_COUNT),
> > and deferring to some clean context (which is the best we can do within
> > the confines of RCU, but which can have issues with delay).
> 
> I have already replied to that so let's not split the discussion into
> several fronts.

I will look at other emails.

> > So it is not the need to address this general problem that is new.
> > Far from it!  What is new is the need for changes outside of RCU.
> > 
> > > > > I strongly agree with Thomas http://lkml.kernel.org/r/87tux4kefm.fsf@nanos.tec.linutronix.de
> > > > > that this optimization is not aiming at reasonable workloads. Really, go
> > > > > with pre-allocated buffer and fallback to whatever slow path you have
> > > > > already. Exposing more internals of the allocator is not going to do any
> > > > > good for long term maintainability.
> > > > 
> > > > I suggest that you carefully re-read the thread following that email.
> > > 
> > > I clearly remember Thomas not being particularly happy that you optimize
> > > for a corner case. I do not remember there being a consensus that this
> > > is the right approach. There was some consensus that this is better than
> > > a gfp flag. Still quite bad though if you ask me.
> > 
> > Again, this "optimization" is for robustness more than raw speed.
> > 
> > > > Given a choice between making users unhappy and making developers
> > > > unhappy, I will side with the users each and every time.
> > > 
> > > Well, let me rephrase. It is not only about me (as a developer) being
> > > unhappy but also all the side effects this would have for users when
> > > performance of their favorite workload declines for no apparent reason
> > > just because pcp caches are depleted by an unrelated process.
> > 
> > But in the close(open()) case, wouldn't the allocations on the open()
> > side refill those caches?
> 
> Yes pcp lists will eventually get replenished. This is not a correctness
> problem I am pointing out. It is the fact that any user of this new
> interface can monopolize the _global_ pool which would affect all other
> users of the pool.

Understood.

However, the kvfree_rcu() and call_rcu() allocations when under stress
will be about one page per 500 objects on 64-bit systems.  These 500
objects are there to begin with.  So it is not at all clear to me that
this additional one-page-per-500-objects memory overhead is significant.

What about when not under stress?  In that case, the absolute number of
additional objects allocated will be small, so it is once again not at
all clear to me that this additional memory overhead is significant.

							Thanx, Paul

Paul E. McKenney Sept. 22, 2020, 3:49 p.m. UTC | #11

On Tue, Sep 22, 2020 at 09:50:02AM +0200, Michal Hocko wrote:
> [Cc Mel - the thread starts http://lkml.kernel.org/r/20200918194817.48921-1-urezki@gmail.com]
> 
> On Mon 21-09-20 21:48:19, Uladzislau Rezki wrote:
> > Hello, Michal.
> > 
> > > >
> > > > Yes, I do well remember that you are unhappy with this approach.
> > > > Unfortunately, thus far, there is no solution that makes all developers
> > > > happy.  You might be glad to hear that we are also looking into other
> > > > solutions, each of which makes some other developers unhappy.  So we
> > > > are at least not picking on you alone.  :-/
> > > 
> > > No worries I do not feel like a whipping boy here. But do expect me to
> > > argue against the approach. I would also appreciate it if there was some
> > > more information on other attempts, why they have failed. E.g. why
> > > pre-allocation is not an option that works well enough in most
> > > reasonable workloads.
> > Pre-allocating has some drawbacks:
> > 
> > a) It is impossible to predict how many pages will be required to
> >    cover a demand that is controlled by different workloads on
> >    various systems.
> 
> Yes, this is not trivial but not a rocket science either. Remember that
> you are relying on a very dumb watermark based pcp pool from the
> allocator. Mimicing a similar implementation shouldn't be all that hard
> and you will get your own pool which doesn't affect other page allocator
> users as much as a bonus.
> 
> > b) Memory overhead since we do not know how much pages should be
> >    preloaded: 100, 200 or 300
> 
> Does anybody who really needs this optimization actually cares about 300
> pages?

That would be 100-300 (maybe more) pages -per- -CPU-, so yes, some people
will care quite deeply about this.

							Thanx, Paul

> > As for memory overhead, it is important to reduce it because of
> > embedded devices like phones, where a low memory condition is a
> > big issue. In that sense pre-allocating is something that we strongly
> > would like to avoid.
> 
> How big "machines" are we talking about here? I would expect that really
> tiny machines would have hard times to really fill up thousands of pages
> with pointers to free...
> 
> Would a similar scaling as the page allocator feasible. Really I mostly
> do care about shared nature of the pcp allocator list that one user can
> easily monopolize with this API.
> 
> > > I would also appreciate some more thoughts why we
> > > need to optimize for heavy abusers of RCU (like close(open) extremes).
> > > 
> > I think here is a small misunderstanding. Please note, that is not only
> > about performance and corner cases. There is a single argument support
> > of the kvfree_rcu(ptr), where maintaining an array in time is needed.
> > The fallback of the single argument case is extrimely slow.
> 
> This should be part of the changelog.
> > 
> > Single-argument details is here: https://lkml.org/lkml/2020/4/28/1626
> 
> Error 501
> 
> > > > > I strongly agree with Thomas http://lkml.kernel.org/r/87tux4kefm.fsf@nanos.tec.linutronix.de
> > > > > that this optimization is not aiming at reasonable workloads. Really, go
> > > > > with pre-allocated buffer and fallback to whatever slow path you have
> > > > > already. Exposing more internals of the allocator is not going to do any
> > > > > good for long term maintainability.
> > > > 
> > > > I suggest that you carefully re-read the thread following that email.
> > > 
> > > I clearly remember Thomas not being particularly happy that you optimize
> > > for a corner case. I do not remember there being a consensus that this
> > > is the right approach. There was some consensus that this is better than
> > > a gfp flag. Still quite bad though if you ask me.
> > > 
> > > > Given a choice between making users unhappy and making developers
> > > > unhappy, I will side with the users each and every time.
> > > 
> > > Well, let me rephrase. It is not only about me (as a developer) being
> > > unhappy but also all the side effects this would have for users when
> > > performance of their favorite workload declines for no apparent reason
> > > just because pcp caches are depleted by an unrelated process.
> > >
> > If depleted, we have a special worker that charge it. From the other hand,
> > the pcplist can be depleted by its nature, what _is_ not wrong. But just
> > in case we secure it since you had a concern about it.
> 
> pcp free lists should ever get empty when we run out of memory and need
> to reclaim. Otherwise they are constantly refilled/rebalanced on demand.
> The fact that you are refilling them from outside just suggest that you
> are operating on a wrong layer. Really, create your own pool of pages
> and rebalance them based on the workload.
> 
> > Could you please specify a real test case or workload you are talking about?
> 
> I am not a performance expert but essentially any memory allocator heavy
> workload might notice. I am pretty sure Mel would tell you more.
> 
> -- 
> Michal Hocko
> SUSE Labs

Mel Gorman Sept. 23, 2020, 10:37 a.m. UTC | #12

On Tue, Sep 22, 2020 at 03:12:57PM +0200, Uladzislau Rezki wrote:
> > > > > Yes, I do well remember that you are unhappy with this approach.
> > > > > Unfortunately, thus far, there is no solution that makes all developers
> > > > > happy.  You might be glad to hear that we are also looking into other
> > > > > solutions, each of which makes some other developers unhappy.  So we
> > > > > are at least not picking on you alone.  :-/
> > > > 
> > > > No worries I do not feel like a whipping boy here. But do expect me to
> > > > argue against the approach. I would also appreciate it if there was some
> > > > more information on other attempts, why they have failed. E.g. why
> > > > pre-allocation is not an option that works well enough in most
> > > > reasonable workloads.
> > > Pre-allocating has some drawbacks:
> > > 
> > > a) It is impossible to predict how many pages will be required to
> > >    cover a demand that is controlled by different workloads on
> > >    various systems.
> > 
> > Yes, this is not trivial but not a rocket science either. Remember that
> > you are relying on a very dumb watermark based pcp pool from the
> > allocator.
> >
> We rely on it, indeed. If the pcp-cache is depleted our special work is
> triggered to charge our local cache(few pages) such way will also initiate
> the process of pre-featching pages from the buddy allocator populating
> the depleted pcp-cache. I do not have any concern here.
> 

It can interfere with ATOMIC allocations in critical paths in extreme
circumstances as it potentially puts increased pressure on the emergency
reserve as watermarks are bypassed. That adds to the risk of a functional
failuure if reclaim fails to make progress.  The number of pages are likely
to be limited and unpredictable. As it uses any PCP type, it potentially
causes fragmention issues. For the last point, the allocations may be
transient in the RCU case now but not guaranteed forever. As the API is
in gfp.h, it's open to abuse so the next guy that comes along and thinks
"I am critical no matter what the name says" will cause problems. While
you could argue that would be caught in review, plenty of GFP flag abuses
made it through review.

Fundamentally, this is simply shifting the problem from RCU to the page
allocator because of the locking arrangements and hazard of acquiring zone
lock is a raw spinlock is held on RT. It does not even make the timing
predictable as an empty PCU list (for example, a full drain in low memory
situations) may mean the emergency path is hit anyway. About all it changes
is the timing of when the emergency path is hit in some circumstances --
it's not fixing the problem, it's simply changing the shape.

> >
> > Mimicing a similar implementation shouldn't be all that hard
> > and you will get your own pool which doesn't affect other page allocator
> > users as much as a bonus.
> > 
> I see your point Michal. As i mentioned before, it is important to avoid of
> having such own pools, because the aim is not to waste memory resources. A
> page will be returned back to "page allocator" as soon as a scheduler place  
> our reclaim thread on a CPU and grace period is passed. So, the resource
> can be used for other needs. What is important.
> 

As the emergency path and synchronising can be hit no matter what, why
not increase the pool temporarily after the emergency path is hit and
shrink it again later if necessary?

> Otherwise a memory footprint is increased what is bad for low memory
> conditions when OOM is involved.

OOM would only be a major factor if the size of the pools meant the
machine could not even operate or at least was severely degraded. However,
depleting the PCPU lists for RCU may slow kswapd making reclaim progress
and cause an OOM in itself, or at least an intervention by a userspace
monitor that kills non-critical applications in the background when memory
pressure exists.

> > > As for memory overhead, it is important to reduce it because of
> > > embedded devices like phones, where a low memory condition is a
> > > big issue. In that sense pre-allocating is something that we strongly
> > > would like to avoid.
> > 
> > How big "machines" are we talking about here? I would expect that really
> > tiny machines would have hard times to really fill up thousands of pages
> > with pointers to free...
> > 
> I mentioned above. We can not rely on static model. We would like to
> have a mechanism that gives back ASAP used pages to page allocator
> for other needs.
> 

After an emergency, temporarily increase the size of the pool to avoid
hitting the emergency path again in the near future.

> >
> > Would a similar scaling as the page allocator feasible. Really I mostly
> > do care about shared nature of the pcp allocator list that one user can
> > easily monopolize with this API.
> > 
> I see your concern. pcplist can be monopolized by already existing API:
> 
>     while (i < 100)
>         __get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> 

That's not the same class of abuse as it can go to the buddy lists to
refill the correct PCP lists, avoid fragmentation issues, obeys watermarks
and wakes kswapd if it's not awake already.

Uladzislau Rezki Sept. 23, 2020, 11:27 a.m. UTC | #13

> > > Other approaches under consideration include making CONFIG_PREEMPT_COUNT
> > > unconditional and thus allowing call_rcu() and kvfree_rcu() to determine
> > > whether direct calls to the allocator are safe (some guy named Linus
> > > doesn't like this one),
> > 
> > I assume that the primary argument is the overhead, right? Do you happen
> > to have any reference?
> 
> Jon Corbet wrote a very nice article summarizing the current situation:
> https://lwn.net/Articles/831678/.  Thomas's measurements show no visible
> system-level performance impact.  I will let Uladzislau present his more
> microbenchmarky performance work.
> 
I have done some analysis of the !PREEMPT kernel with and without PREEMPT_COUNT
configuration. The aim is to show a performance impact if the PREEMPT_COUNT is
unconditionally enabled.

As for the test i used the refscale kernel module, that does:

<snip>
static void ref_rcu_read_section(const int nloops)
{
 int i;

 for (i = nloops; i >= 0; i--) {
  rcu_read_lock();
  rcu_read_unlock();
 }
}
<snip>

How to run the microbenchmark:

<snip>
urezki@pc638:~$ sudo modprobe refscale
<snip>

The below is an average duration per loop (nanoseconds):

  !PREEMPT_COUNT            PREEMPT_COUNT
 Runs     Time(ns)         Runc     Time(ns)
 1        109.640          1        99.915
 2        102.303          2        111.106
 3        90.520           3        98.713
 4        106.347          4        111.239
 5        108.374          5        111.797
 6        108.012          6        111.558
 7        103.989          7        113.122
 8        106.194          8        111.515
 9        107.330          9        107.559
 10       105.877          10       105.965
 11       104.860          11       104.835
 12       104.299          12       106.342
 13       104.794          13       106.664
 14       104.916          14       104.914
 15       105.485          15       104.280
 16       104.610          16       105.642
 17       104.981          17       105.646
 18       103.089          18       106.370
 19       105.251          19       105.284
 20       104.133          20       105.973
 21       105.589          21       105.271
 22       104.154          22       106.063
 23       104.963          23       106.248
 24       102.431          24       105.568
 25       102.610          25       105.556
 26       103.474          26       105.655
 27       100.194          27       102.887
 28       102.340          28       104.347
 29       102.075          29       102.389
 30       102.808          30       103.123

The difference is ~1.8% in average. The maximum value is 109.640 vs 113.122
The minimum value is 90.520 vs 98.713.

Tested on:
    processor       : 63
    vendor_id       : AuthenticAMD
    cpu family      : 6
    model           : 6
    model name      : QEMU Virtual CPU version 2.5+
    cpu MHz         : 3700.204

I also can do more detailed testing using "perf" tool.

--
Vlad Rezki

Paul E. McKenney Sept. 23, 2020, 3:41 p.m. UTC | #14

On Wed, Sep 23, 2020 at 11:37:06AM +0100, Mel Gorman wrote:
> On Tue, Sep 22, 2020 at 03:12:57PM +0200, Uladzislau Rezki wrote:
> > > > > > Yes, I do well remember that you are unhappy with this approach.
> > > > > > Unfortunately, thus far, there is no solution that makes all developers
> > > > > > happy.  You might be glad to hear that we are also looking into other
> > > > > > solutions, each of which makes some other developers unhappy.  So we
> > > > > > are at least not picking on you alone.  :-/
> > > > > 
> > > > > No worries I do not feel like a whipping boy here. But do expect me to
> > > > > argue against the approach. I would also appreciate it if there was some
> > > > > more information on other attempts, why they have failed. E.g. why
> > > > > pre-allocation is not an option that works well enough in most
> > > > > reasonable workloads.
> > > > Pre-allocating has some drawbacks:
> > > > 
> > > > a) It is impossible to predict how many pages will be required to
> > > >    cover a demand that is controlled by different workloads on
> > > >    various systems.
> > > 
> > > Yes, this is not trivial but not a rocket science either. Remember that
> > > you are relying on a very dumb watermark based pcp pool from the
> > > allocator.
> > >
> > We rely on it, indeed. If the pcp-cache is depleted our special work is
> > triggered to charge our local cache(few pages) such way will also initiate
> > the process of pre-featching pages from the buddy allocator populating
> > the depleted pcp-cache. I do not have any concern here.
> 
> It can interfere with ATOMIC allocations in critical paths in extreme
> circumstances as it potentially puts increased pressure on the emergency
> reserve as watermarks are bypassed. That adds to the risk of a functional
> failuure if reclaim fails to make progress.  The number of pages are likely
> to be limited and unpredictable. As it uses any PCP type, it potentially
> causes fragmention issues. For the last point, the allocations may be
> transient in the RCU case now but not guaranteed forever. As the API is
> in gfp.h, it's open to abuse so the next guy that comes along and thinks
> "I am critical no matter what the name says" will cause problems. While
> you could argue that would be caught in review, plenty of GFP flag abuses
> made it through review.
> 
> Fundamentally, this is simply shifting the problem from RCU to the page
> allocator because of the locking arrangements and hazard of acquiring zone
> lock is a raw spinlock is held on RT. It does not even make the timing
> predictable as an empty PCU list (for example, a full drain in low memory
> situations) may mean the emergency path is hit anyway. About all it changes
> is the timing of when the emergency path is hit in some circumstances --
> it's not fixing the problem, it's simply changing the shape.

All good points!

On the other hand, duplicating a portion of the allocator functionality
within RCU increases the amount of reserved memory, and needlessly most
of the time.

Is there some way that we can locklessly allocate memory, but return
failure instead of running down the emergency pool?  A change to the loop
that iterates over the migration types?  Or to the loop that iterates
over the zones?  Something else?

> > > Mimicing a similar implementation shouldn't be all that hard
> > > and you will get your own pool which doesn't affect other page allocator
> > > users as much as a bonus.
> > > 
> > I see your point Michal. As i mentioned before, it is important to avoid of
> > having such own pools, because the aim is not to waste memory resources. A
> > page will be returned back to "page allocator" as soon as a scheduler place  
> > our reclaim thread on a CPU and grace period is passed. So, the resource
> > can be used for other needs. What is important.
> 
> As the emergency path and synchronising can be hit no matter what, why
> not increase the pool temporarily after the emergency path is hit and
> shrink it again later if necessary?

If I understand what you are suggesting, this is in fact what Uladzislau's
prototyped commit 8c0a1269709d ("rcu/tree: Add a work to allocate
pages from regular context") on the -rcu "dev" branch is intended to do.
The issue, as Uladislau noted above, is that scheduler delays can prevent
these pool-increase actions until the point at which there is no memory.

> > Otherwise a memory footprint is increased what is bad for low memory
> > conditions when OOM is involved.
> 
> OOM would only be a major factor if the size of the pools meant the
> machine could not even operate or at least was severely degraded. However,
> depleting the PCPU lists for RCU may slow kswapd making reclaim progress
> and cause an OOM in itself, or at least an intervention by a userspace
> monitor that kills non-critical applications in the background when memory
> pressure exists.

When under emergency conditions, we have one page allocated per 500
objects passed to kvfree_rcu().  So the increase in total allocated
memory load due to this emergency path is quite small.

> > > > As for memory overhead, it is important to reduce it because of
> > > > embedded devices like phones, where a low memory condition is a
> > > > big issue. In that sense pre-allocating is something that we strongly
> > > > would like to avoid.
> > > 
> > > How big "machines" are we talking about here? I would expect that really
> > > tiny machines would have hard times to really fill up thousands of pages
> > > with pointers to free...
> > > 
> > I mentioned above. We can not rely on static model. We would like to
> > have a mechanism that gives back ASAP used pages to page allocator
> > for other needs.
> 
> After an emergency, temporarily increase the size of the pool to avoid
> hitting the emergency path again in the near future.

By which time we might well already be in OOM territory.  The emergency
situations can ramp up very quickly.

> > > Would a similar scaling as the page allocator feasible. Really I mostly
> > > do care about shared nature of the pcp allocator list that one user can
> > > easily monopolize with this API.
> > > 
> > I see your concern. pcplist can be monopolized by already existing API:
> > 
> >     while (i < 100)
> >         __get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> 
> That's not the same class of abuse as it can go to the buddy lists to
> refill the correct PCP lists, avoid fragmentation issues, obeys watermarks
> and wakes kswapd if it's not awake already.

Good point, and we did try doing it this way.  Unfortunately, in current
!PREEMPT kernels, this approach can deadlock on one of the allocator
locks via call_rcu().  This experience caused us to look at lockless
allocator access.  It also inspired the unconditional PREEMPT_COUNT
approach, but that has its own detractors.  (Yes, we are of course still
persuing it as well.)

							Thanx, Paul

Mel Gorman Sept. 23, 2020, 11:22 p.m. UTC | #15

On Wed, Sep 23, 2020 at 08:41:05AM -0700, Paul E. McKenney wrote:
> > Fundamentally, this is simply shifting the problem from RCU to the page
> > allocator because of the locking arrangements and hazard of acquiring zone
> > lock is a raw spinlock is held on RT. It does not even make the timing
> > predictable as an empty PCU list (for example, a full drain in low memory
> > situations) may mean the emergency path is hit anyway. About all it changes
> > is the timing of when the emergency path is hit in some circumstances --
> > it's not fixing the problem, it's simply changing the shape.
> 
> All good points!
> 
> On the other hand, duplicating a portion of the allocator functionality
> within RCU increases the amount of reserved memory, and needlessly most
> of the time.
> 

But it's very similar to what mempools are for.

> Is there some way that we can locklessly allocate memory, but return
> failure instead of running down the emergency pool? A change to the loop
> that iterates over the migration types?  Or to the loop that iterates
> over the zones?  Something else?
> 

Only by duplicating some of the logic in get_page_from_freelist would
protect the reserves. Even if you had that, the contents of the pcp
pools can easily be zero for the local CPU and you still get stuck.

> > > > Mimicing a similar implementation shouldn't be all that hard
> > > > and you will get your own pool which doesn't affect other page allocator
> > > > users as much as a bonus.
> > > > 
> > > I see your point Michal. As i mentioned before, it is important to avoid of
> > > having such own pools, because the aim is not to waste memory resources. A
> > > page will be returned back to "page allocator" as soon as a scheduler place  
> > > our reclaim thread on a CPU and grace period is passed. So, the resource
> > > can be used for other needs. What is important.
> > 
> > As the emergency path and synchronising can be hit no matter what, why
> > not increase the pool temporarily after the emergency path is hit and
> > shrink it again later if necessary?
> 
> If I understand what you are suggesting, this is in fact what Uladzislau's
> prototyped commit 8c0a1269709d ("rcu/tree: Add a work to allocate
> pages from regular context") on the -rcu "dev" branch is intended to do.
> The issue, as Uladislau noted above, is that scheduler delays can prevent
> these pool-increase actions until the point at which there is no memory.
> 

Scheduler latency would be a problem. You would have to keep an emergency
"rescue" pool that is enough to make slow progress even if no other memory
is available until the worker takes action. The worker that allocates
pages from regular context would be to increase the pool size enough so
the emergency reserve does not have to be used again in the near future.

> > > Otherwise a memory footprint is increased what is bad for low memory
> > > conditions when OOM is involved.
> > 
> > OOM would only be a major factor if the size of the pools meant the
> > machine could not even operate or at least was severely degraded. However,
> > depleting the PCPU lists for RCU may slow kswapd making reclaim progress
> > and cause an OOM in itself, or at least an intervention by a userspace
> > monitor that kills non-critical applications in the background when memory
> > pressure exists.
> 
> When under emergency conditions, we have one page allocated per 500
> objects passed to kvfree_rcu().  So the increase in total allocated
> memory load due to this emergency path is quite small.
> 

Fair enough but any solution that depends on the PCP having even one
page available is going to need a fallback option of some sort.

> > > > > As for memory overhead, it is important to reduce it because of
> > > > > embedded devices like phones, where a low memory condition is a
> > > > > big issue. In that sense pre-allocating is something that we strongly
> > > > > would like to avoid.
> > > > 
> > > > How big "machines" are we talking about here? I would expect that really
> > > > tiny machines would have hard times to really fill up thousands of pages
> > > > with pointers to free...
> > > > 
> > > I mentioned above. We can not rely on static model. We would like to
> > > have a mechanism that gives back ASAP used pages to page allocator
> > > for other needs.
> > 
> > After an emergency, temporarily increase the size of the pool to avoid
> > hitting the emergency path again in the near future.
> 
> By which time we might well already be in OOM territory.  The emergency
> situations can ramp up very quickly.
> 

And if the pcp lists are empty, this problem still exists -- pretty much
anything that depends on the pcp lists always having pages is going to
have a failure case.

> > > > Would a similar scaling as the page allocator feasible. Really I mostly
> > > > do care about shared nature of the pcp allocator list that one user can
> > > > easily monopolize with this API.
> > > > 
> > > I see your concern. pcplist can be monopolized by already existing API:
> > > 
> > >     while (i < 100)
> > >         __get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> > 
> > That's not the same class of abuse as it can go to the buddy lists to
> > refill the correct PCP lists, avoid fragmentation issues, obeys watermarks
> > and wakes kswapd if it's not awake already.
> 
> Good point, and we did try doing it this way.  Unfortunately, in current
> !PREEMPT kernels, this approach can deadlock on one of the allocator
> locks via call_rcu().  This experience caused us to look at lockless
> allocator access.  It also inspired the unconditional PREEMPT_COUNT
> approach, but that has its own detractors.  (Yes, we are of course still
> persuing it as well.)
> 

I didn't keep up to date unfortunately but pretty much anything that
needs guaranteed access to pages has to maintain a mempool of some sort.
Even if the page allocator had a common pool for emergency use, it would
be subject to abuse because all it would take is one driver or subsystem
deciding that it was special enough to deplete it. For example, the ml4
driver decided that it should declare itself to be a "memory allocator"
like kswapd for iSCSI (it claims swap-over-nfs, but swap-over-nfs was
only meant to apply when a swapfile was backed by a network fs).

Uladzislau Rezki Sept. 24, 2020, 8:16 a.m. UTC | #16

> On Wed, Sep 23, 2020 at 08:41:05AM -0700, Paul E. McKenney wrote:
> > > Fundamentally, this is simply shifting the problem from RCU to the page
> > > allocator because of the locking arrangements and hazard of acquiring zone
> > > lock is a raw spinlock is held on RT. It does not even make the timing
> > > predictable as an empty PCU list (for example, a full drain in low memory
> > > situations) may mean the emergency path is hit anyway. About all it changes
> > > is the timing of when the emergency path is hit in some circumstances --
> > > it's not fixing the problem, it's simply changing the shape.
> > 
> > All good points!
> > 
> > On the other hand, duplicating a portion of the allocator functionality
> > within RCU increases the amount of reserved memory, and needlessly most
> > of the time.
> > 
> 
> But it's very similar to what mempools are for.
> 
As for dynamic caching or mempools. It requires extra logic on top of RCU
to move things forward and it might be not efficient way. As a side
effect, maintaining of the bulk arrays in the separate worker thread
will introduce other drawbacks:

 a) There is an extra latency window, a time during which a fallback
    mechanism is used until pages are obtained via the special
    worker for further pointers collecting over arrays.

 b) It is impossible to predict how many pages will be required to
    cover a demand that is controlled by different workloads on
    various systems. It would require a rough value.

 c) extra memory footprint.

Why a special worker is required. It is because non-sleeping flags, like
GFP_ATOMIC or GFP_NOWAIT apparently can be sleeping what is not supposed.
Therefore we proposed lock-less flag as a first step, and later a lock-less
function.

Other option is if we had unconditionally enabled PREEMPT_COUNT config.
It would be easy to identify a context type and invoke a page allocator
if a context is preemtale. But as of now preemptable() is "half" working.
Thomas uploaded patches to make it unconditional. But it can be blocked.

If both are not accepted, we will need to workaround it with dynamic
caching, at least for !PREEMPT kernel.

> > Is there some way that we can locklessly allocate memory, but return
> > failure instead of running down the emergency pool? A change to the loop
> > that iterates over the migration types?  Or to the loop that iterates
> > over the zones?  Something else?
> > 
> 
> Only by duplicating some of the logic in get_page_from_freelist would
> protect the reserves. Even if you had that, the contents of the pcp
> pools can easily be zero for the local CPU and you still get stuck.
> 
We have a special worker that is triggered when a lock-less alloc fails.
The worker we have, by requesting a new page will also initiate an internal
process that prefetches specified number of elements from the buddy allocator
populating the "pcplist" by new fresh pages.

> > > > > Mimicing a similar implementation shouldn't be all that hard
> > > > > and you will get your own pool which doesn't affect other page allocator
> > > > > users as much as a bonus.
> > > > > 
> > > > I see your point Michal. As i mentioned before, it is important to avoid of
> > > > having such own pools, because the aim is not to waste memory resources. A
> > > > page will be returned back to "page allocator" as soon as a scheduler place  
> > > > our reclaim thread on a CPU and grace period is passed. So, the resource
> > > > can be used for other needs. What is important.
> > > 
> > > As the emergency path and synchronising can be hit no matter what, why
> > > not increase the pool temporarily after the emergency path is hit and
> > > shrink it again later if necessary?
> > 
> > If I understand what you are suggesting, this is in fact what Uladzislau's
> > prototyped commit 8c0a1269709d ("rcu/tree: Add a work to allocate
> > pages from regular context") on the -rcu "dev" branch is intended to do.
> > The issue, as Uladislau noted above, is that scheduler delays can prevent
> > these pool-increase actions until the point at which there is no memory.
> > 
> 
> Scheduler latency would be a problem. You would have to keep an emergency
> "rescue" pool that is enough to make slow progress even if no other memory
> is available until the worker takes action. The worker that allocates
> pages from regular context would be to increase the pool size enough so
> the emergency reserve does not have to be used again in the near future.
> 
The key point is "enough". We need pages to make a) fast progress b) support
single argument of kvfree_rcu(one_arg). Not vice versa. That "enough" depends
on scheduler latency and vague pre-allocated number of pages, it might
be not enough what would require to refill it more and more or we can overshoot
that would lead to memory overhead. So we have here timing issues and
not accurate model. IMHO.

> > > > Otherwise a memory footprint is increased what is bad for low memory
> > > > conditions when OOM is involved.
> > > 
> > > OOM would only be a major factor if the size of the pools meant the
> > > machine could not even operate or at least was severely degraded. However,
> > > depleting the PCPU lists for RCU may slow kswapd making reclaim progress
> > > and cause an OOM in itself, or at least an intervention by a userspace
> > > monitor that kills non-critical applications in the background when memory
> > > pressure exists.
> > 
> > When under emergency conditions, we have one page allocated per 500
> > objects passed to kvfree_rcu().  So the increase in total allocated
> > memory load due to this emergency path is quite small.
> > 
> 
> Fair enough but any solution that depends on the PCP having even one
> page available is going to need a fallback option of some sort.
> 
We have a fallback mechanism.

> > > > > > As for memory overhead, it is important to reduce it because of
> > > > > > embedded devices like phones, where a low memory condition is a
> > > > > > big issue. In that sense pre-allocating is something that we strongly
> > > > > > would like to avoid.
> > > > > 
> > > > > How big "machines" are we talking about here? I would expect that really
> > > > > tiny machines would have hard times to really fill up thousands of pages
> > > > > with pointers to free...
> > > > > 
> > > > I mentioned above. We can not rely on static model. We would like to
> > > > have a mechanism that gives back ASAP used pages to page allocator
> > > > for other needs.
> > > 
> > > After an emergency, temporarily increase the size of the pool to avoid
> > > hitting the emergency path again in the near future.
> > 
> > By which time we might well already be in OOM territory.  The emergency
> > situations can ramp up very quickly.
> > 
> 
> And if the pcp lists are empty, this problem still exists -- pretty much
> anything that depends on the pcp lists always having pages is going to
> have a failure case.
> 
But it is enough just to trigger __get_free_page() on that CPU, so the PCP
will be populated. What we do in a separate worker.

> > > > > Would a similar scaling as the page allocator feasible. Really I mostly
> > > > > do care about shared nature of the pcp allocator list that one user can
> > > > > easily monopolize with this API.
> > > > > 
> > > > I see your concern. pcplist can be monopolized by already existing API:
> > > > 
> > > >     while (i < 100)
> > > >         __get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> > > 
> > > That's not the same class of abuse as it can go to the buddy lists to
> > > refill the correct PCP lists, avoid fragmentation issues, obeys watermarks
> > > and wakes kswapd if it's not awake already.
> > 
> > Good point, and we did try doing it this way.  Unfortunately, in current
> > !PREEMPT kernels, this approach can deadlock on one of the allocator
> > locks via call_rcu().  This experience caused us to look at lockless
> > allocator access.  It also inspired the unconditional PREEMPT_COUNT
> > approach, but that has its own detractors.  (Yes, we are of course still
> > persuing it as well.)
> > 
> 
> I didn't keep up to date unfortunately but pretty much anything that
> needs guaranteed access to pages has to maintain a mempool of some sort.
> Even if the page allocator had a common pool for emergency use, it would
> be subject to abuse because all it would take is one driver or subsystem
> deciding that it was special enough to deplete it. For example, the ml4
> driver decided that it should declare itself to be a "memory allocator"
> like kswapd for iSCSI (it claims swap-over-nfs, but swap-over-nfs was
> only meant to apply when a swapfile was backed by a network fs).
> 
We do not want to have many different memory allocators for sure :)

As it was noted above. GFP_ATOMIC/GFP_NOWAIT can sleep what is a problem
for atomic context. That is why we have proposed a lock-less way of getting
a page and you are not happy with that :) Other option is unconditionally
enable PREEMPT_COUNT config. That would help a lot not only for RCU but
also for others if they can use of GFP_ATOMIC/GFP_NOWAIT and similar.

Appreciate for comments.

--
Vlad Rezki

Peter Zijlstra Sept. 24, 2020, 11:16 a.m. UTC | #17

On Thu, Sep 24, 2020 at 10:16:14AM +0200, Uladzislau Rezki wrote:
> Other option is if we had unconditionally enabled PREEMPT_COUNT config.
> It would be easy to identify a context type and invoke a page allocator
> if a context is preemtale. But as of now preemptable() is "half" working.
> Thomas uploaded patches to make it unconditional. But it can be blocked.

While I in principle support Thomas' patch, I think this is an abuse and
still complete wrong.

Not all preemptible() context can deal with GFP_KERNEL (GFP_NOFS
existing should be a clue). !preemptible context cannot unconditionally
deal with GFP_NOWAIT, consider using it while holding zone->lock.

The preemption context does not say antying much useful about the
allocation context.

Peter Zijlstra Sept. 24, 2020, 11:19 a.m. UTC | #18

On Thu, Sep 24, 2020 at 10:16:14AM +0200, Uladzislau Rezki wrote:
> The key point is "enough". We need pages to make a) fast progress b) support
> single argument of kvfree_rcu(one_arg). Not vice versa. That "enough" depends
> on scheduler latency and vague pre-allocated number of pages, it might
> be not enough what would require to refill it more and more or we can overshoot
> that would lead to memory overhead. So we have here timing issues and
> not accurate model. IMHO.

I'm firmly opposed to the single argument kvfree_rcu() idea, that's
requiring memory to free memory.

Uladzislau Rezki Sept. 24, 2020, 3:16 p.m. UTC | #19

On Thu, Sep 24, 2020 at 01:16:32PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 24, 2020 at 10:16:14AM +0200, Uladzislau Rezki wrote:
> > Other option is if we had unconditionally enabled PREEMPT_COUNT config.
> > It would be easy to identify a context type and invoke a page allocator
> > if a context is preemtale. But as of now preemptable() is "half" working.
> > Thomas uploaded patches to make it unconditional. But it can be blocked.
> 
> While I in principle support Thomas' patch, I think this is an abuse and
> still complete wrong.
> 
Good that you support it :)

>
> Not all preemptible() context can deal with GFP_KERNEL (GFP_NOFS
> existing should be a clue). !preemptible context cannot unconditionally
> deal with GFP_NOWAIT, consider using it while holding zone->lock.
> 
Not sure if i fully follow you here. What i tried to express, if we had
preemtable() fully working, including !PREEMPT, we can at least detect
the following:

if (!preemptable())
    by pass using any ATOMIC/NOWAIT flags because they do not work;

as an example you mentioned z->lock that is sleepable.

>
> The preemption context does not say antying much useful about the
> allocation context.
>
I am talking about taking a decision. 

--
Vlad Rezki

Uladzislau Rezki Sept. 24, 2020, 3:21 p.m. UTC | #20

On Thu, Sep 24, 2020 at 01:19:07PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 24, 2020 at 10:16:14AM +0200, Uladzislau Rezki wrote:
> > The key point is "enough". We need pages to make a) fast progress b) support
> > single argument of kvfree_rcu(one_arg). Not vice versa. That "enough" depends
> > on scheduler latency and vague pre-allocated number of pages, it might
> > be not enough what would require to refill it more and more or we can overshoot
> > that would lead to memory overhead. So we have here timing issues and
> > not accurate model. IMHO.
> 
> I'm firmly opposed to the single argument kvfree_rcu() idea, that's
> requiring memory to free memory.
> 
Hmm.. The problem is there is a demand in it:

Please have a look at places in the kernel where people do not
embed the rcu_head into their stuctures and do like:

<snip>
    synchronize_rcu();
    kfree(p);
<snip>

<snip>
urezki@pc638:~/data/coding/linux-rcu.git$ find ./ -name "*.c" | xargs grep -C 1 -rn "synchronize_rcu" | grep kfree
./fs/nfs/sysfs.c-113-           kfree(old);
./fs/ext4/super.c-1708- kfree(old_qname);
./kernel/trace/ftrace.c-5079-                   kfree(direct);
./kernel/trace/ftrace.c-5156-                   kfree(direct);
./kernel/trace/trace_probe.c-1087-      kfree(link);
./kernel/module.c-3910- kfree(mod->args);
./net/core/sysctl_net_core.c-143-                               kfree(cur);
./arch/x86/mm/mmio-mod.c-314-           kfree(found_trace);
./drivers/mfd/dln2.c-183-               kfree(i);
./drivers/block/drbd/drbd_state.c-2074-         kfree(old_conf);
./drivers/block/drbd/drbd_nl.c-1689-    kfree(old_disk_conf);
./drivers/block/drbd/drbd_nl.c-2522-    kfree(old_net_conf);
./drivers/block/drbd/drbd_nl.c-2935-            kfree(old_disk_conf);
./drivers/block/drbd/drbd_receiver.c-3805-      kfree(old_net_conf);
./drivers/block/drbd/drbd_receiver.c-4177-                      kfree(old_disk_conf);
./drivers/ipack/carriers/tpci200.c-189- kfree(slot_irq);
./drivers/crypto/nx/nx-842-pseries.c-1010-      kfree(old_devdata);
./drivers/net/ethernet/myricom/myri10ge/myri10ge.c-3583-        kfree(mgp->ss);
./drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.c:286:       synchronize_rcu(); /* before kfree(flow) */
./drivers/net/ethernet/mellanox/mlxsw/core.c-1574-      kfree(rxl_item);
./drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c-6642- kfree(adapter->mbox_log);
./drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c-6644- kfree(adapter);
./drivers/infiniband/hw/hfi1/sdma.c-1337-       kfree(dd->per_sdma);
./drivers/infiniband/core/device.c:2164:                         * synchronize_rcu before the netdev is kfreed, so we
./drivers/misc/vmw_vmci/vmci_context.c-692-             kfree(notifier);
./drivers/misc/vmw_vmci/vmci_event.c-213-       kfree(s);
./drivers/staging/fwserial/fwserial.c-2122-     kfree(peer);
urezki@pc638:~/data/coding/linux-rcu.git$
<snip>

--
Vlad Rezki

Paul E. McKenney Sept. 24, 2020, 3:38 p.m. UTC | #21

On Thu, Sep 24, 2020 at 01:19:07PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 24, 2020 at 10:16:14AM +0200, Uladzislau Rezki wrote:
> > The key point is "enough". We need pages to make a) fast progress b) support
> > single argument of kvfree_rcu(one_arg). Not vice versa. That "enough" depends
> > on scheduler latency and vague pre-allocated number of pages, it might
> > be not enough what would require to refill it more and more or we can overshoot
> > that would lead to memory overhead. So we have here timing issues and
> > not accurate model. IMHO.
> 
> I'm firmly opposed to the single argument kvfree_rcu() idea, that's
> requiring memory to free memory.

Not quite.

First, there is a fallback when memory allocation fails.  Second,
in heavy-use situations, there is only one allocation per about
500 kvfree_rcu() calls on 64-bit systems.  Third, there are other
long-standing situations that require allocating memory in order to
free memory.

So I agree that it is a good general rule of thumb to avoid allocating
on free paths, but there are exceptions.  This is one of them.

						Thanx, Paul

Michal Hocko Sept. 25, 2020, 8:05 a.m. UTC | #22

On Thu 24-09-20 10:16:14, Uladzislau Rezki wrote:
> > On Wed, Sep 23, 2020 at 08:41:05AM -0700, Paul E. McKenney wrote:
> > > > Fundamentally, this is simply shifting the problem from RCU to the page
> > > > allocator because of the locking arrangements and hazard of acquiring zone
> > > > lock is a raw spinlock is held on RT. It does not even make the timing
> > > > predictable as an empty PCU list (for example, a full drain in low memory
> > > > situations) may mean the emergency path is hit anyway. About all it changes
> > > > is the timing of when the emergency path is hit in some circumstances --
> > > > it's not fixing the problem, it's simply changing the shape.
> > > 
> > > All good points!
> > > 
> > > On the other hand, duplicating a portion of the allocator functionality
> > > within RCU increases the amount of reserved memory, and needlessly most
> > > of the time.
> > > 
> > 
> > But it's very similar to what mempools are for.
> > 
> As for dynamic caching or mempools. It requires extra logic on top of RCU
> to move things forward and it might be not efficient way. As a side
> effect, maintaining of the bulk arrays in the separate worker thread
> will introduce other drawbacks:

This is true but it is also true that it is RCU to require this special
logic and we can expect that we might need to fine tune this logic
depending on the RCU usage. We definitely do not want to tune the
generic page allocator for a very specific usecase, do we?

>  a) There is an extra latency window, a time during which a fallback
>     mechanism is used until pages are obtained via the special
>     worker for further pointers collecting over arrays.

This will be always the case for async refilling. More importantly this
will be a bigger problem at the page allocator level which has other
users other than RCU so more users are suffering...

>  b) It is impossible to predict how many pages will be required to
>     cover a demand that is controlled by different workloads on
>     various systems. It would require a rough value.

I have asked for some actual numbers for real life scenarios this work
is meant to cover. There was nothing presented so far. We can hand wave
for ever but this will not move us forward much. As I've said in other
email, few dozens pages per CPU by default will hardly get noticeable.
You have a trivial initial implementation and can build on top in
incremental steps. You can kick a background allocator to start new
allocations when the pool hits a watermark and aggressivelly remove
cached pages when they are not used. You will have quite a freedom to
fine tune the scheme which is much harder for the page allocator because
there are many other consumers.

Anyway, I am afraid that we are going in circles here. We do not have
any meaningful numbers to claim memory footprint problems. There is a
clear opposition to hook into page allocator for reasons already
mentioned. You are looking for a dedicated memory pool and it should be
quite trivial to develop one and fine tune it for your specific usecase.
All that on top of page allocator. Unless this is seen as completely
unfeasible based on some solid arguments then we can start talking about
the page allocator itself.

Peter Zijlstra Sept. 25, 2020, 8:15 a.m. UTC | #23

On Thu, Sep 24, 2020 at 05:21:12PM +0200, Uladzislau Rezki wrote:
> On Thu, Sep 24, 2020 at 01:19:07PM +0200, Peter Zijlstra wrote:
> > On Thu, Sep 24, 2020 at 10:16:14AM +0200, Uladzislau Rezki wrote:
> > > The key point is "enough". We need pages to make a) fast progress b) support
> > > single argument of kvfree_rcu(one_arg). Not vice versa. That "enough" depends
> > > on scheduler latency and vague pre-allocated number of pages, it might
> > > be not enough what would require to refill it more and more or we can overshoot
> > > that would lead to memory overhead. So we have here timing issues and
> > > not accurate model. IMHO.
> > 
> > I'm firmly opposed to the single argument kvfree_rcu() idea, that's
> > requiring memory to free memory.
> > 
> Hmm.. The problem is there is a demand in it:

People demand ponies all the time, the usual answer is: No.

Peter Zijlstra Sept. 25, 2020, 8:26 a.m. UTC | #24

On Thu, Sep 24, 2020 at 08:38:34AM -0700, Paul E. McKenney wrote:
> On Thu, Sep 24, 2020 at 01:19:07PM +0200, Peter Zijlstra wrote:
> > On Thu, Sep 24, 2020 at 10:16:14AM +0200, Uladzislau Rezki wrote:
> > > The key point is "enough". We need pages to make a) fast progress b) support
> > > single argument of kvfree_rcu(one_arg). Not vice versa. That "enough" depends
> > > on scheduler latency and vague pre-allocated number of pages, it might
> > > be not enough what would require to refill it more and more or we can overshoot
> > > that would lead to memory overhead. So we have here timing issues and
> > > not accurate model. IMHO.
> > 
> > I'm firmly opposed to the single argument kvfree_rcu() idea, that's
> > requiring memory to free memory.
> 
> Not quite.
> 
> First, there is a fallback when memory allocation fails.  Second,
> in heavy-use situations, there is only one allocation per about
> 500 kvfree_rcu() calls on 64-bit systems.  Third, there are other
> long-standing situations that require allocating memory in order to
> free memory.

Some of which are quite broken. And yes, I'm aware of all that, I'm the
one that started swap-over-NFS, which requires network traffic to free
memory, which is one insane step further.

But the way to make that 'work' is carefully account and pre-allocate
(or size the reserve) the required memory to make progress and to
strictly limit concurrency to ensure you stay in your bounds.

> So I agree that it is a good general rule of thumb to avoid allocating
> on free paths, but there are exceptions.  This is one of them.

The very first thing you need to do is proof your memory usage is
bounded, and then calculate your bound.

The problem is that with RCU you can't limit concurrency. call_rcu()
can't block, you can't wait for a grace period to end when you've ran
out of your reserve.

That is, you don't have a bound, so no reserve what so ever is going to
help.

You must have that callback_head fallback.

Uladzislau Rezki Sept. 25, 2020, 10:25 a.m. UTC | #25

On Fri, Sep 25, 2020 at 10:15:52AM +0200, Peter Zijlstra wrote:
> On Thu, Sep 24, 2020 at 05:21:12PM +0200, Uladzislau Rezki wrote:
> > On Thu, Sep 24, 2020 at 01:19:07PM +0200, Peter Zijlstra wrote:
> > > On Thu, Sep 24, 2020 at 10:16:14AM +0200, Uladzislau Rezki wrote:
> > > > The key point is "enough". We need pages to make a) fast progress b) support
> > > > single argument of kvfree_rcu(one_arg). Not vice versa. That "enough" depends
> > > > on scheduler latency and vague pre-allocated number of pages, it might
> > > > be not enough what would require to refill it more and more or we can overshoot
> > > > that would lead to memory overhead. So we have here timing issues and
> > > > not accurate model. IMHO.
> > > 
> > > I'm firmly opposed to the single argument kvfree_rcu() idea, that's
> > > requiring memory to free memory.
> > > 
> > Hmm.. The problem is there is a demand in it:
> 
> People demand ponies all the time, the usual answer is: No.
>
I see your view. From the other hand it is clear, there is still
demand in it:

<snip>
void ext4_kvfree_array_rcu(void *to_free)
{
 struct ext4_rcu_ptr *ptr = kzalloc(sizeof(*ptr), GFP_KERNEL);

 if (ptr) {
  ptr->ptr = to_free;
  call_rcu(&ptr->rcu, ext4_rcu_ptr_callback);
  return;
 }
 synchronize_rcu();
 kvfree(ptr);
}
<snip>

--
Vlad Rezki

Uladzislau Rezki Sept. 25, 2020, 3:31 p.m. UTC | #26

> > > > 
> > > > All good points!
> > > > 
> > > > On the other hand, duplicating a portion of the allocator functionality
> > > > within RCU increases the amount of reserved memory, and needlessly most
> > > > of the time.
> > > > 
> > > 
> > > But it's very similar to what mempools are for.
> > > 
> > As for dynamic caching or mempools. It requires extra logic on top of RCU
> > to move things forward and it might be not efficient way. As a side
> > effect, maintaining of the bulk arrays in the separate worker thread
> > will introduce other drawbacks:
> 
> This is true but it is also true that it is RCU to require this special
> logic and we can expect that we might need to fine tune this logic
> depending on the RCU usage. We definitely do not want to tune the
> generic page allocator for a very specific usecase, do we?
> 
I look at it in scope of GFP_ATOMIC/GFP_NOWAIT issues, i.e. inability
to provide a memory service for contexts which are not allowed to
sleep, RCU is part of them. Both flags used to provide such ability
before but not anymore.

Do you agree with it?

>
> do we?
>
If you agree with above, then there is a reasonable question what to
do. Converting zone->lock into raw variant, the simplest and generic
solution, has been declined(RT people do not like and see a side effect),
NO_LOCK flag together with lock-less function were declined. All those
efforts in terms of suggestions were just to find the best way to go.
That is why we highlighted it outside of RCU.

If you do not agree, then at least a documentation of mentioned flags
should be changed, IMHO. We are, in our turn will focus on fixing it
differently.

> >  a) There is an extra latency window, a time during which a fallback
> >     mechanism is used until pages are obtained via the special
> >     worker for further pointers collecting over arrays.
> 
> This will be always the case for async refilling. More importantly this
> will be a bigger problem at the page allocator level which has other
> users other than RCU so more users are suffering...
> 
Agree on that, async refilling has its disadvantages.

> >  b) It is impossible to predict how many pages will be required to
> >     cover a demand that is controlled by different workloads on
> >     various systems. It would require a rough value.
> 
> I have asked for some actual numbers for real life scenarios this work
> is meant to cover. There was nothing presented so far. We can hand wave
> for ever but this will not move us forward much. As I've said in other
> email, few dozens pages per CPU by default will hardly get noticeable.
> You have a trivial initial implementation and can build on top in
> incremental steps. You can kick a background allocator to start new
> allocations when the pool hits a watermark and aggressivelly remove
> cached pages when they are not used. You will have quite a freedom to
> fine tune the scheme which is much harder for the page allocator because
> there are many other consumers.
> 

> Anyway, I am afraid that we are going in circles here. We do not have
> any meaningful numbers to claim memory footprint problems. There is a
> clear opposition to hook into page allocator for reasons already
> mentioned. You are looking for a dedicated memory pool and it should be
> quite trivial to develop one and fine tune it for your specific usecase.
> All that on top of page allocator. Unless this is seen as completely
> unfeasible based on some solid arguments then we can start talking about
> the page allocator itself.
>
We have only synthetic tests. As for real life scenarios, on highly
loaded servers we will have one picture, on small devices like phones
we will have another one. As i mentioned in my previous emails, in worst
case on my system it requires ~400 pages to cover the bandwidth.

But, again we would like to find the best way to go. As i mentioned
if Thomas patches are in place, we do not need to workaround everything
regarding dynamic caching or async preloading and so on.

Thank you Michal!

--
Vlad Rezki

Michal Hocko Sept. 25, 2020, 3:47 p.m. UTC | #27

On Fri 25-09-20 17:31:29, Uladzislau Rezki wrote:
> > > > > 
> > > > > All good points!
> > > > > 
> > > > > On the other hand, duplicating a portion of the allocator functionality
> > > > > within RCU increases the amount of reserved memory, and needlessly most
> > > > > of the time.
> > > > > 
> > > > 
> > > > But it's very similar to what mempools are for.
> > > > 
> > > As for dynamic caching or mempools. It requires extra logic on top of RCU
> > > to move things forward and it might be not efficient way. As a side
> > > effect, maintaining of the bulk arrays in the separate worker thread
> > > will introduce other drawbacks:
> > 
> > This is true but it is also true that it is RCU to require this special
> > logic and we can expect that we might need to fine tune this logic
> > depending on the RCU usage. We definitely do not want to tune the
> > generic page allocator for a very specific usecase, do we?
> > 
> I look at it in scope of GFP_ATOMIC/GFP_NOWAIT issues, i.e. inability
> to provide a memory service for contexts which are not allowed to
> sleep, RCU is part of them. Both flags used to provide such ability
> before but not anymore.
> 
> Do you agree with it?

Yes this sucks. But this is something that we likely really want to live
with. We have to explicitly _document_ that really atomic contexts in RT
cannot use the allocator. From the past discussions we've had this is
likely the most reasonable way forward because we do not really want to
encourage anybody to do something like that and there should be ways
around that. The same is btw. true also for !RT. The allocator is not
NMI safe and while we should be able to make it compatible I am not
convinced we really want to.

Would something like this be helpful wrt documentation?

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 67a0774e080b..9fcd47606493 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -238,7 +238,9 @@ struct vm_area_struct;
  * %__GFP_FOO flags as necessary.
  *
  * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
- * watermark is applied to allow access to "atomic reserves"
+ * watermark is applied to allow access to "atomic reserves".
+ * The current implementation doesn't support NMI and other non-preemptive context
+ * (e.g. raw_spin_lock).
  *
  * %GFP_KERNEL is typical for kernel-internal allocations. The caller requires
  * %ZONE_NORMAL or a lower zone for direct access but can direct reclaim.

[...]

Mel Gorman Sept. 25, 2020, 4:17 p.m. UTC | #28

On Fri, Sep 25, 2020 at 05:31:29PM +0200, Uladzislau Rezki wrote:
> > > > > 
> > > > > All good points!
> > > > > 
> > > > > On the other hand, duplicating a portion of the allocator functionality
> > > > > within RCU increases the amount of reserved memory, and needlessly most
> > > > > of the time.
> > > > > 
> > > > 
> > > > But it's very similar to what mempools are for.
> > > > 
> > > As for dynamic caching or mempools. It requires extra logic on top of RCU
> > > to move things forward and it might be not efficient way. As a side
> > > effect, maintaining of the bulk arrays in the separate worker thread
> > > will introduce other drawbacks:
> > 
> > This is true but it is also true that it is RCU to require this special
> > logic and we can expect that we might need to fine tune this logic
> > depending on the RCU usage. We definitely do not want to tune the
> > generic page allocator for a very specific usecase, do we?
> > 
> I look at it in scope of GFP_ATOMIC/GFP_NOWAIT issues, i.e. inability
> to provide a memory service for contexts which are not allowed to
> sleep, RCU is part of them. Both flags used to provide such ability
> before but not anymore.
> 
> Do you agree with it?
> 

I was led to believe that hte problem was taking the zone lock while
holding a raw spinlock that was specific to RCU. Are you saying that
GFP_ATOMIC/GFP_NOWAIT users are also holding raw spinlocks at the same
time on RT?

Uladzislau Rezki Sept. 25, 2020, 5:57 p.m. UTC | #29

On Fri, Sep 25, 2020 at 05:17:12PM +0100, Mel Gorman wrote:
> On Fri, Sep 25, 2020 at 05:31:29PM +0200, Uladzislau Rezki wrote:
> > > > > > 
> > > > > > All good points!
> > > > > > 
> > > > > > On the other hand, duplicating a portion of the allocator functionality
> > > > > > within RCU increases the amount of reserved memory, and needlessly most
> > > > > > of the time.
> > > > > > 
> > > > > 
> > > > > But it's very similar to what mempools are for.
> > > > > 
> > > > As for dynamic caching or mempools. It requires extra logic on top of RCU
> > > > to move things forward and it might be not efficient way. As a side
> > > > effect, maintaining of the bulk arrays in the separate worker thread
> > > > will introduce other drawbacks:
> > > 
> > > This is true but it is also true that it is RCU to require this special
> > > logic and we can expect that we might need to fine tune this logic
> > > depending on the RCU usage. We definitely do not want to tune the
> > > generic page allocator for a very specific usecase, do we?
> > > 
> > I look at it in scope of GFP_ATOMIC/GFP_NOWAIT issues, i.e. inability
> > to provide a memory service for contexts which are not allowed to
> > sleep, RCU is part of them. Both flags used to provide such ability
> > before but not anymore.
> > 
> > Do you agree with it?
> > 
> 
> I was led to believe that hte problem was taking the zone lock while
> holding a raw spinlock that was specific to RCU.
In RCU code we hold a raw spinlock, because the kfree_rcu() should
follow the call_rcu() rule and work in atomic contexts. So we can
not enter a page allocator because it uses spinlock_t z->lock(is sleepable for RT).

Because of CONFIG_PROVE_RAW_LOCK_NESTING option and CONFIG_PREEMPT_RT.

>
> Are you saying that GFP_ATOMIC/GFP_NOWAIT users are also holding raw
> spinlocks at the same time on RT?
>
I do not say it. And it is not possible because zone->lock has
a spinlock_t type. So, in case of CONFIG_PREEMPT_RT you will
hit a "BUG: scheduling while atomic". If allocator is called
when: raw lock is held or irqs are disabled or preempt_disable()
on a higher level.

--
Vlad Rezki

Paul E. McKenney Sept. 26, 2020, 2:37 p.m. UTC | #30

On Fri, Sep 25, 2020 at 10:26:18AM +0200, Peter Zijlstra wrote:
> On Thu, Sep 24, 2020 at 08:38:34AM -0700, Paul E. McKenney wrote:
> > On Thu, Sep 24, 2020 at 01:19:07PM +0200, Peter Zijlstra wrote:
> > > On Thu, Sep 24, 2020 at 10:16:14AM +0200, Uladzislau Rezki wrote:
> > > > The key point is "enough". We need pages to make a) fast progress b) support
> > > > single argument of kvfree_rcu(one_arg). Not vice versa. That "enough" depends
> > > > on scheduler latency and vague pre-allocated number of pages, it might
> > > > be not enough what would require to refill it more and more or we can overshoot
> > > > that would lead to memory overhead. So we have here timing issues and
> > > > not accurate model. IMHO.
> > > 
> > > I'm firmly opposed to the single argument kvfree_rcu() idea, that's
> > > requiring memory to free memory.
> > 
> > Not quite.
> > 
> > First, there is a fallback when memory allocation fails.  Second,
> > in heavy-use situations, there is only one allocation per about
> > 500 kvfree_rcu() calls on 64-bit systems.  Third, there are other
> > long-standing situations that require allocating memory in order to
> > free memory.
> 
> Some of which are quite broken. And yes, I'm aware of all that, I'm the
> one that started swap-over-NFS, which requires network traffic to free
> memory, which is one insane step further.

I could easily imagine that experience might have left some scars.

> But the way to make that 'work' is carefully account and pre-allocate
> (or size the reserve) the required memory to make progress and to
> strictly limit concurrency to ensure you stay in your bounds.

But your situation is different.  When swapping over NFS, if you
cannot allocate the memory to do the I/O, you cannot free the memory
you are attempting to swap out, at least not unless you can kill the
corresponding process.  So if you don't want to kill processes, as you
say, worst case is what matters.

The kvfree_rcu() situation is rather different.  In all cases, there
is a fallback, namely using the existing rcu_head for double-argument
kvfree_rcu() or falling back to synchronize_rcu() for single-argument
kvfree_rcu().  As long as these fallbacks are sufficiently rare, the
system will probably survive.

> > So I agree that it is a good general rule of thumb to avoid allocating
> > on free paths, but there are exceptions.  This is one of them.
> 
> The very first thing you need to do is proof your memory usage is
> bounded, and then calculate your bound.

Again, you are confusing your old swap-over-NFS scars with the current
situation.  They really are not the same.

> The problem is that with RCU you can't limit concurrency. call_rcu()
> can't block, you can't wait for a grace period to end when you've ran
> out of your reserve.
> 
> That is, you don't have a bound, so no reserve what so ever is going to
> help.

Almost.  A dedicated reserve large enough to result in sufficiently low
use of the fallback paths is too large.  Again, we can tolerate a small
fraction of requests taking the fallback, with emphasis on "small".

> You must have that callback_head fallback.

And we do have that callback_head fallback.  And in the case of
single-argument kvfree_rcu(), that synchronize_rcu() fallback. And as
long as we can avoid using those fallbacks almost all the time, things
will be OK.  But we do need to able to allocate memory in the common
case when there is memory to be had.

							Thanx, Paul

Vlastimil Babka Sept. 29, 2020, 10:15 a.m. UTC | #31

On 9/18/20 9:48 PM, Uladzislau Rezki (Sony) wrote:
> Some background and kfree_rcu()
> ===============================
> The pointers to be freed are stored in the per-cpu array to improve
> performance, to enable an easier-to-use API, to accommodate vmalloc
> memmory and to support a single argument of the kfree_rcu() when only
> a pointer is passed. More details are below.
> 
> In order to maintain such per-CPU arrays there is a need in dynamic
> allocation when a current array is fully populated and a new block is
> required. See below the example:
> 
>  0 1 2 3      0 1 2 3
> |p|p|p|p| -> |p|p|p|p| -> NULL
> 
> there are two pointer-blocks, each one can store 4 addresses
> which will be freed after a grace period is passed. In reality
> we store PAGE_SIZE / sizeof(void *). So to maintain such blocks
> a single page is obtain via the page allocator:
> 
>     bnode = (struct kvfree_rcu_bulk_data *)
>         __get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> 
> after that it is attached to the "head" and its "next" pointer is
> set to previous "head", so the list of blocks can be maintained and
> grow dynamically until it gets drained by the reclaiming thread.
> 
> Please note. There is always a fallback if an allocation fails. In the
> single argument, this is a call to synchronize_rcu() and for the two
> arguments case this is to use rcu_head structure embedded in the object
> being free, and then paying cache-miss penalty, also invoke the kfree()
> per object instead of kfree_bulk() for groups of objects.
> 
> Why we maintain arrays/blocks instead of linking objects by the regular
> "struct rcu_head" technique. See below a few but main reasons:
> 
> a) A memory can be reclaimed by invoking of the kfree_bulk()
>    interface that requires passing an array and number of
>    entries in it. That reduces the per-object overhead caused
>    by calling kfree() per-object. This reduces the reclamation
>    time.
> 
> b) Improves locality and reduces the number of cache-misses, due to
>    "pointer chasing" between objects, which can be far spread between
>    each other.
> 
> c) Support a "single argument" in the kvfree_rcu()
>    void *ptr = kvmalloc(some_bytes, GFP_KERNEL);
>    if (ptr)
>         kvfree_rcu(ptr);
> 
>    We need it when an "rcu_head" is not embed into a stucture but an
>    object must be freed after a grace period. Therefore for the single
>    argument, such objects cannot be queued on a linked list.
> 
>    So nowadays, since we do not have a single argument but we see the
>    demand in it, to workaround it people just do a simple not efficient
>    sequence:
>    <snip>
>        synchronize_rcu(); /* Can be long and blocks a current context */
>        kfree(p);
>    <snip>
> 
>    More details is here: https://lkml.org/lkml/2020/4/28/1626
> 
> d) To distinguish vmalloc pointers between SLAB ones. It becomes possible
>    to invoke the right freeing API for the right kind of pointer, kfree_bulk()
>    or TBD: vmalloc_bulk().
> 
> e) Speeding up the post-grace-period freeing reduces the chance of a flood
>    of callback's OOMing the system.
> 
> Also, please have a look here: https://lkml.org/lkml/2020/7/30/1166
> 
> Proposal
> ========
> Introduce a lock-free function that obtain a page from the per-cpu-lists
> on current CPU. It returns NULL rather than acquiring any non-raw spinlock.
> 
> Description
> ===========
> The page allocator has two phases, fast path and slow one. We are interested
> in fast path and order-0 allocations. In its turn it is divided also into two
> phases: lock-less and not:
> 
> 1) As a first step the page allocator tries to obtain a page from the
>    per-cpu-list, so each CPU has its own one. That is why this step is
>    lock-less and fast. Basically it disables irqs on current CPU in order
>    to access to per-cpu data and remove a first element from the pcp-list.
>    An element/page is returned to an user.
> 
> 2) If there is no any available page in per-cpu-list, the second step is
>    involved. It removes a specified number of elements from the buddy allocator
>    transferring them to the "supplied-list/per-cpu-list" described in [1].
> 
> Summarizing. The __rcu_alloc_page_lockless() covers only [1] and can not
> do step [2], due to the fact that [2] requires an access to zone->lock.
> It implies that it is super fast, but a higher rate of fails is also
> expected.
> 
> Usage: __rcu_alloc_page_lockless();
> 
> Link: https://lore.kernel.org/lkml/20200814215206.GL3982@worktop.programming.kicks-ass.net/
> Not-signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

After reading all the threads and mulling over this, I am going to deflect from
Mel and Michal and not oppose the idea of lockless allocation. I would even
prefer to do it via the gfp flag and not a completely separate path. Not using
the exact code from v1, I think it could be done in a way that we don't actually
look at the new flag until we find that pcplist is empty - which should not
introduce overhead to the fast-fast path when pcpclist is not empty. It's more
maintainable that adding new entry points, IMHO.

But there's still the condition that it's sufficiently shown that the allocation
is useful for RCU. In that case I prefer that the page allocator (or MM in
general) can give its users what they need without having to work around it.
Seems like GFP_ATOMIC is insufficient today so if that means we need a new flag
for the raw spinlock context, so be it. But if your usage of __GFP_NO_LOCKS
depends on the result of preempt_count() or similar check whether this is a
context that needs it, I'd prefer to keep this in the caller.

Uladzislau Rezki Sept. 29, 2020, 4:25 p.m. UTC | #32

> > I look at it in scope of GFP_ATOMIC/GFP_NOWAIT issues, i.e. inability
> > to provide a memory service for contexts which are not allowed to
> > sleep, RCU is part of them. Both flags used to provide such ability
> > before but not anymore.
> > 
> > Do you agree with it?
> 
> Yes this sucks. But this is something that we likely really want to live
> with. We have to explicitly _document_ that really atomic contexts in RT
> cannot use the allocator. From the past discussions we've had this is
> likely the most reasonable way forward because we do not really want to
> encourage anybody to do something like that and there should be ways
> around that. The same is btw. true also for !RT. The allocator is not
> NMI safe and while we should be able to make it compatible I am not
> convinced we really want to.
> 
> Would something like this be helpful wrt documentation?
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 67a0774e080b..9fcd47606493 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -238,7 +238,9 @@ struct vm_area_struct;
>   * %__GFP_FOO flags as necessary.
>   *
>   * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
> - * watermark is applied to allow access to "atomic reserves"
> + * watermark is applied to allow access to "atomic reserves".
> + * The current implementation doesn't support NMI and other non-preemptive context
> + * (e.g. raw_spin_lock).
>   *
>   * %GFP_KERNEL is typical for kernel-internal allocations. The caller requires
>   * %ZONE_NORMAL or a lower zone for direct access but can direct reclaim.
> 
To me it is clear. But also above conflicting statement:

<snip>
%GFP_ATOMIC users can not sleep and need the allocation to succeed. A %lower
<snip>

should be rephrased, IMHO.

--
Vlad Rezki

Uladzislau Rezki Sept. 29, 2020, 10:07 p.m. UTC | #33

On Tue, Sep 29, 2020 at 12:15:34PM +0200, Vlastimil Babka wrote:
> On 9/18/20 9:48 PM, Uladzislau Rezki (Sony) wrote:
> > Some background and kfree_rcu()
> > ===============================
> > The pointers to be freed are stored in the per-cpu array to improve
> > performance, to enable an easier-to-use API, to accommodate vmalloc
> > memmory and to support a single argument of the kfree_rcu() when only
> > a pointer is passed. More details are below.
> > 
> > In order to maintain such per-CPU arrays there is a need in dynamic
> > allocation when a current array is fully populated and a new block is
> > required. See below the example:
> > 
> >  0 1 2 3      0 1 2 3
> > |p|p|p|p| -> |p|p|p|p| -> NULL
> > 
> > there are two pointer-blocks, each one can store 4 addresses
> > which will be freed after a grace period is passed. In reality
> > we store PAGE_SIZE / sizeof(void *). So to maintain such blocks
> > a single page is obtain via the page allocator:
> > 
> >     bnode = (struct kvfree_rcu_bulk_data *)
> >         __get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> > 
> > after that it is attached to the "head" and its "next" pointer is
> > set to previous "head", so the list of blocks can be maintained and
> > grow dynamically until it gets drained by the reclaiming thread.
> > 
> > Please note. There is always a fallback if an allocation fails. In the
> > single argument, this is a call to synchronize_rcu() and for the two
> > arguments case this is to use rcu_head structure embedded in the object
> > being free, and then paying cache-miss penalty, also invoke the kfree()
> > per object instead of kfree_bulk() for groups of objects.
> > 
> > Why we maintain arrays/blocks instead of linking objects by the regular
> > "struct rcu_head" technique. See below a few but main reasons:
> > 
> > a) A memory can be reclaimed by invoking of the kfree_bulk()
> >    interface that requires passing an array and number of
> >    entries in it. That reduces the per-object overhead caused
> >    by calling kfree() per-object. This reduces the reclamation
> >    time.
> > 
> > b) Improves locality and reduces the number of cache-misses, due to
> >    "pointer chasing" between objects, which can be far spread between
> >    each other.
> > 
> > c) Support a "single argument" in the kvfree_rcu()
> >    void *ptr = kvmalloc(some_bytes, GFP_KERNEL);
> >    if (ptr)
> >         kvfree_rcu(ptr);
> > 
> >    We need it when an "rcu_head" is not embed into a stucture but an
> >    object must be freed after a grace period. Therefore for the single
> >    argument, such objects cannot be queued on a linked list.
> > 
> >    So nowadays, since we do not have a single argument but we see the
> >    demand in it, to workaround it people just do a simple not efficient
> >    sequence:
> >    <snip>
> >        synchronize_rcu(); /* Can be long and blocks a current context */
> >        kfree(p);
> >    <snip>
> > 
> >    More details is here: https://lkml.org/lkml/2020/4/28/1626
> > 
> > d) To distinguish vmalloc pointers between SLAB ones. It becomes possible
> >    to invoke the right freeing API for the right kind of pointer, kfree_bulk()
> >    or TBD: vmalloc_bulk().
> > 
> > e) Speeding up the post-grace-period freeing reduces the chance of a flood
> >    of callback's OOMing the system.
> > 
> > Also, please have a look here: https://lkml.org/lkml/2020/7/30/1166
> > 
> > Proposal
> > ========
> > Introduce a lock-free function that obtain a page from the per-cpu-lists
> > on current CPU. It returns NULL rather than acquiring any non-raw spinlock.
> > 
> > Description
> > ===========
> > The page allocator has two phases, fast path and slow one. We are interested
> > in fast path and order-0 allocations. In its turn it is divided also into two
> > phases: lock-less and not:
> > 
> > 1) As a first step the page allocator tries to obtain a page from the
> >    per-cpu-list, so each CPU has its own one. That is why this step is
> >    lock-less and fast. Basically it disables irqs on current CPU in order
> >    to access to per-cpu data and remove a first element from the pcp-list.
> >    An element/page is returned to an user.
> > 
> > 2) If there is no any available page in per-cpu-list, the second step is
> >    involved. It removes a specified number of elements from the buddy allocator
> >    transferring them to the "supplied-list/per-cpu-list" described in [1].
> > 
> > Summarizing. The __rcu_alloc_page_lockless() covers only [1] and can not
> > do step [2], due to the fact that [2] requires an access to zone->lock.
> > It implies that it is super fast, but a higher rate of fails is also
> > expected.
> > 
> > Usage: __rcu_alloc_page_lockless();
> > 
> > Link: https://lore.kernel.org/lkml/20200814215206.GL3982@worktop.programming.kicks-ass.net/
> > Not-signed-off-by: Peter Zijlstra <peterz@infradead.org>
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> 
> After reading all the threads and mulling over this, I am going to deflect from
> Mel and Michal and not oppose the idea of lockless allocation. I would even
> prefer to do it via the gfp flag and not a completely separate path. Not using
> the exact code from v1, I think it could be done in a way that we don't actually
> look at the new flag until we find that pcplist is empty - which should not
> introduce overhead to the fast-fast path when pcpclist is not empty. It's more
> maintainable that adding new entry points, IMHO.
> 
Thanks for reading all that from the beginning! It must be tough due to the
fact there were lot of messages in the threads, so at least i was lost.

I agree that adding a new entry or separate lock-less function can be considered
as something that is hard to maintain. I have a question here. I mean about your
different look at it:

<snip>
bool is_pcp_cache_empty(gfp_t gfp)
{
    struct per_cpu_pages *pcp;
    struct zoneref *ref;
    unsigned long flags;
    bool empty;

    ref = first_zones_zonelist(node_zonelist(
            numa_node_id(), gfp), gfp_zone(gfp), NULL);
    if (!ref->zone)
            return true;

    local_irq_save(flags);
    pcp = &this_cpu_ptr(ref->zone->pageset)->pcp;
    empty = list_empty(&pcp->lists[gfp_migratetype(gfp)]);
    local_irq_restore(flags);

    return empty;
}

disable_irq();
if (!is_pcp_cache_empty(GFP_NOWAIT))
    __get_free_page(GFP_NOWAIT);
enable_irq();
<snip>

Do you mean to have something like above? I mean some extra API
function that returns true or false if fast-fast allocation can
either occur or not. Above code works just fine and never touches
main zone->lock.

i.e. Instead of introducing an extra GFP_LOCKLESS flag or any new
extra lock-less function. We could have something that checks a
pcp page cache list, thus it can guarantee that a request would
be accomplish using fast-fast path.

>
> But there's still the condition that it's sufficiently shown that the allocation
> is useful for RCU. In that case I prefer that the page allocator (or MM in
> general) can give its users what they need without having to work around it.
>
I see your point, if there is something that can be generic and not specif - makes
sense to me also. I do share your view.

>
> Seems like GFP_ATOMIC is insufficient today so if that means we need a new flag
> for the raw spinlock context, so be it. But if your usage of __GFP_NO_LOCKS
> depends on the result of preempt_count() or similar check whether this is a
> context that needs it, I'd prefer to keep this in the caller.
>
By having the preemptabele() thing, we can just workaround GFP_ATOMIC or GFP_NOWAIT
flag issues. Because both are insufficient. I mean when we see that a context is
atomic(can not sleep) we can bypass any usage of the page allocator because it will
hit BUG() with flags which are in question. And yes, the best is when we do not
think much about context, so there is something reliable from point of using the
page allocator.

Thanks Vlastimil!

--
Vlad Rezki

Michal Hocko Sept. 30, 2020, 9:27 a.m. UTC | #34

On Tue 29-09-20 18:25:14, Uladzislau Rezki wrote:
> > > I look at it in scope of GFP_ATOMIC/GFP_NOWAIT issues, i.e. inability
> > > to provide a memory service for contexts which are not allowed to
> > > sleep, RCU is part of them. Both flags used to provide such ability
> > > before but not anymore.
> > > 
> > > Do you agree with it?
> > 
> > Yes this sucks. But this is something that we likely really want to live
> > with. We have to explicitly _document_ that really atomic contexts in RT
> > cannot use the allocator. From the past discussions we've had this is
> > likely the most reasonable way forward because we do not really want to
> > encourage anybody to do something like that and there should be ways
> > around that. The same is btw. true also for !RT. The allocator is not
> > NMI safe and while we should be able to make it compatible I am not
> > convinced we really want to.
> > 
> > Would something like this be helpful wrt documentation?
> > 
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index 67a0774e080b..9fcd47606493 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -238,7 +238,9 @@ struct vm_area_struct;
> >   * %__GFP_FOO flags as necessary.
> >   *
> >   * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
> > - * watermark is applied to allow access to "atomic reserves"
> > + * watermark is applied to allow access to "atomic reserves".
> > + * The current implementation doesn't support NMI and other non-preemptive context
> > + * (e.g. raw_spin_lock).
> >   *
> >   * %GFP_KERNEL is typical for kernel-internal allocations. The caller requires
> >   * %ZONE_NORMAL or a lower zone for direct access but can direct reclaim.
> > 
> To me it is clear. But also above conflicting statement:
> 
> <snip>
> %GFP_ATOMIC users can not sleep and need the allocation to succeed. A %lower
> <snip>
> 
> should be rephrased, IMHO.

Any suggestions? Or more specifics about which part is conflicting? It
tries to say that there is a higher demand to succeed even though the
context cannot sleep to take active measures to achieve that. So the
only way to achieve that is to break the watermakrs to a certain degree
which is making them more "higher class" than other allocations.

Michal Hocko Sept. 30, 2020, 10:35 a.m. UTC | #35

On Wed 30-09-20 00:07:42, Uladzislau Rezki wrote:
[...]
> <snip>
> bool is_pcp_cache_empty(gfp_t gfp)
> {
>     struct per_cpu_pages *pcp;
>     struct zoneref *ref;
>     unsigned long flags;
>     bool empty;
> 
>     ref = first_zones_zonelist(node_zonelist(
>             numa_node_id(), gfp), gfp_zone(gfp), NULL);
>     if (!ref->zone)
>             return true;
> 
>     local_irq_save(flags);
>     pcp = &this_cpu_ptr(ref->zone->pageset)->pcp;
>     empty = list_empty(&pcp->lists[gfp_migratetype(gfp)]);
>     local_irq_restore(flags);
> 
>     return empty;
> }
> 
> disable_irq();
> if (!is_pcp_cache_empty(GFP_NOWAIT))
>     __get_free_page(GFP_NOWAIT);
> enable_irq();
> <snip>
> 
> Do you mean to have something like above? I mean some extra API
> function that returns true or false if fast-fast allocation can
> either occur or not. Above code works just fine and never touches
> main zone->lock.

The above code works with the _current_ implementation and it restricts
its implementation to some degree. Future changes might get harder to
implement with a pattern like this. I do not think we want users to be
aware of internal implementation details like pcp caches, migrate types
or others. While pcp caches are here for years and unlikely to change in
a foreseeable future many details are changing on regular basis.

Uladzislau Rezki Sept. 30, 2020, 12:35 p.m. UTC | #36

On Wed, Sep 30, 2020 at 11:27:32AM +0200, Michal Hocko wrote:
> On Tue 29-09-20 18:25:14, Uladzislau Rezki wrote:
> > > > I look at it in scope of GFP_ATOMIC/GFP_NOWAIT issues, i.e. inability
> > > > to provide a memory service for contexts which are not allowed to
> > > > sleep, RCU is part of them. Both flags used to provide such ability
> > > > before but not anymore.
> > > > 
> > > > Do you agree with it?
> > > 
> > > Yes this sucks. But this is something that we likely really want to live
> > > with. We have to explicitly _document_ that really atomic contexts in RT
> > > cannot use the allocator. From the past discussions we've had this is
> > > likely the most reasonable way forward because we do not really want to
> > > encourage anybody to do something like that and there should be ways
> > > around that. The same is btw. true also for !RT. The allocator is not
> > > NMI safe and while we should be able to make it compatible I am not
> > > convinced we really want to.
> > > 
> > > Would something like this be helpful wrt documentation?
> > > 
> > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > > index 67a0774e080b..9fcd47606493 100644
> > > --- a/include/linux/gfp.h
> > > +++ b/include/linux/gfp.h
> > > @@ -238,7 +238,9 @@ struct vm_area_struct;
> > >   * %__GFP_FOO flags as necessary.
> > >   *
> > >   * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
> > > - * watermark is applied to allow access to "atomic reserves"
> > > + * watermark is applied to allow access to "atomic reserves".
> > > + * The current implementation doesn't support NMI and other non-preemptive context
> > > + * (e.g. raw_spin_lock).
> > >   *
> > >   * %GFP_KERNEL is typical for kernel-internal allocations. The caller requires
> > >   * %ZONE_NORMAL or a lower zone for direct access but can direct reclaim.
> > > 
> > To me it is clear. But also above conflicting statement:
> > 
> > <snip>
> > %GFP_ATOMIC users can not sleep and need the allocation to succeed. A %lower
> > <snip>
> > 
> > should be rephrased, IMHO.
> 
> Any suggestions? Or more specifics about which part is conflicting? It
> tries to say that there is a higher demand to succeed even though the
> context cannot sleep to take active measures to achieve that. So the
> only way to achieve that is to break the watermakrs to a certain degree
> which is making them more "higher class" than other allocations.
> 
Michal, i had only one concern about it. It says that %GFP_ATOMIC users
can not sleep, i.e. callers know that they are in atomic, thus no any
sleeping, but the chose they make will force them to sleep.

--
Vlad Rezki

Michal Hocko Sept. 30, 2020, 12:44 p.m. UTC | #37

On Wed 30-09-20 14:35:35, Uladzislau Rezki wrote:
> On Wed, Sep 30, 2020 at 11:27:32AM +0200, Michal Hocko wrote:
> > On Tue 29-09-20 18:25:14, Uladzislau Rezki wrote:
> > > > > I look at it in scope of GFP_ATOMIC/GFP_NOWAIT issues, i.e. inability
> > > > > to provide a memory service for contexts which are not allowed to
> > > > > sleep, RCU is part of them. Both flags used to provide such ability
> > > > > before but not anymore.
> > > > > 
> > > > > Do you agree with it?
> > > > 
> > > > Yes this sucks. But this is something that we likely really want to live
> > > > with. We have to explicitly _document_ that really atomic contexts in RT
> > > > cannot use the allocator. From the past discussions we've had this is
> > > > likely the most reasonable way forward because we do not really want to
> > > > encourage anybody to do something like that and there should be ways
> > > > around that. The same is btw. true also for !RT. The allocator is not
> > > > NMI safe and while we should be able to make it compatible I am not
> > > > convinced we really want to.
> > > > 
> > > > Would something like this be helpful wrt documentation?
> > > > 
> > > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > > > index 67a0774e080b..9fcd47606493 100644
> > > > --- a/include/linux/gfp.h
> > > > +++ b/include/linux/gfp.h
> > > > @@ -238,7 +238,9 @@ struct vm_area_struct;
> > > >   * %__GFP_FOO flags as necessary.
> > > >   *
> > > >   * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
> > > > - * watermark is applied to allow access to "atomic reserves"
> > > > + * watermark is applied to allow access to "atomic reserves".
> > > > + * The current implementation doesn't support NMI and other non-preemptive context
> > > > + * (e.g. raw_spin_lock).
> > > >   *
> > > >   * %GFP_KERNEL is typical for kernel-internal allocations. The caller requires
> > > >   * %ZONE_NORMAL or a lower zone for direct access but can direct reclaim.
> > > > 
> > > To me it is clear. But also above conflicting statement:
> > > 
> > > <snip>
> > > %GFP_ATOMIC users can not sleep and need the allocation to succeed. A %lower
> > > <snip>
> > > 
> > > should be rephrased, IMHO.
> > 
> > Any suggestions? Or more specifics about which part is conflicting? It
> > tries to say that there is a higher demand to succeed even though the
> > context cannot sleep to take active measures to achieve that. So the
> > only way to achieve that is to break the watermakrs to a certain degree
> > which is making them more "higher class" than other allocations.
> > 
> Michal, i had only one concern about it. It says that %GFP_ATOMIC users
> can not sleep, i.e. callers know that they are in atomic, thus no any
> sleeping, but the chose they make will force them to sleep.

I am not sure I follow you here. Do you mean they will be forced to
sleep with PREEMPT_RT?

Uladzislau Rezki Sept. 30, 2020, 1:39 p.m. UTC | #38

On Wed, Sep 30, 2020 at 02:44:13PM +0200, Michal Hocko wrote:
> On Wed 30-09-20 14:35:35, Uladzislau Rezki wrote:
> > On Wed, Sep 30, 2020 at 11:27:32AM +0200, Michal Hocko wrote:
> > > On Tue 29-09-20 18:25:14, Uladzislau Rezki wrote:
> > > > > > I look at it in scope of GFP_ATOMIC/GFP_NOWAIT issues, i.e. inability
> > > > > > to provide a memory service for contexts which are not allowed to
> > > > > > sleep, RCU is part of them. Both flags used to provide such ability
> > > > > > before but not anymore.
> > > > > > 
> > > > > > Do you agree with it?
> > > > > 
> > > > > Yes this sucks. But this is something that we likely really want to live
> > > > > with. We have to explicitly _document_ that really atomic contexts in RT
> > > > > cannot use the allocator. From the past discussions we've had this is
> > > > > likely the most reasonable way forward because we do not really want to
> > > > > encourage anybody to do something like that and there should be ways
> > > > > around that. The same is btw. true also for !RT. The allocator is not
> > > > > NMI safe and while we should be able to make it compatible I am not
> > > > > convinced we really want to.
> > > > > 
> > > > > Would something like this be helpful wrt documentation?
> > > > > 
> > > > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > > > > index 67a0774e080b..9fcd47606493 100644
> > > > > --- a/include/linux/gfp.h
> > > > > +++ b/include/linux/gfp.h
> > > > > @@ -238,7 +238,9 @@ struct vm_area_struct;
> > > > >   * %__GFP_FOO flags as necessary.
> > > > >   *
> > > > >   * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
> > > > > - * watermark is applied to allow access to "atomic reserves"
> > > > > + * watermark is applied to allow access to "atomic reserves".
> > > > > + * The current implementation doesn't support NMI and other non-preemptive context
> > > > > + * (e.g. raw_spin_lock).
> > > > >   *
> > > > >   * %GFP_KERNEL is typical for kernel-internal allocations. The caller requires
> > > > >   * %ZONE_NORMAL or a lower zone for direct access but can direct reclaim.
> > > > > 
> > > > To me it is clear. But also above conflicting statement:
> > > > 
> > > > <snip>
> > > > %GFP_ATOMIC users can not sleep and need the allocation to succeed. A %lower
> > > > <snip>
> > > > 
> > > > should be rephrased, IMHO.
> > > 
> > > Any suggestions? Or more specifics about which part is conflicting? It
> > > tries to say that there is a higher demand to succeed even though the
> > > context cannot sleep to take active measures to achieve that. So the
> > > only way to achieve that is to break the watermakrs to a certain degree
> > > which is making them more "higher class" than other allocations.
> > > 
> > Michal, i had only one concern about it. It says that %GFP_ATOMIC users
> > can not sleep, i.e. callers know that they are in atomic, thus no any
> > sleeping, but the chose they make will force them to sleep.
> 
> I am not sure I follow you here. Do you mean they will be forced to
> sleep with PREEMPT_RT?
> 
Exactly :)

--
Vlad Rezki

Vlastimil Babka Sept. 30, 2020, 2:39 p.m. UTC | #39

On 9/30/20 12:07 AM, Uladzislau Rezki wrote:
> On Tue, Sep 29, 2020 at 12:15:34PM +0200, Vlastimil Babka wrote:
>> On 9/18/20 9:48 PM, Uladzislau Rezki (Sony) wrote:
>> 
>> After reading all the threads and mulling over this, I am going to deflect from
>> Mel and Michal and not oppose the idea of lockless allocation. I would even
>> prefer to do it via the gfp flag and not a completely separate path. Not using
>> the exact code from v1, I think it could be done in a way that we don't actually
>> look at the new flag until we find that pcplist is empty - which should not
>> introduce overhead to the fast-fast path when pcpclist is not empty. It's more
>> maintainable that adding new entry points, IMHO.
>> 
> Thanks for reading all that from the beginning! It must be tough due to the
> fact there were lot of messages in the threads, so at least i was lost.
> 
> I agree that adding a new entry or separate lock-less function can be considered
> as something that is hard to maintain. I have a question here. I mean about your
> different look at it:
> 
> <snip>
> bool is_pcp_cache_empty(gfp_t gfp)
> {
>     struct per_cpu_pages *pcp;
>     struct zoneref *ref;
>     unsigned long flags;
>     bool empty;
> 
>     ref = first_zones_zonelist(node_zonelist(
>             numa_node_id(), gfp), gfp_zone(gfp), NULL);
>     if (!ref->zone)
>             return true;
> 
>     local_irq_save(flags);
>     pcp = &this_cpu_ptr(ref->zone->pageset)->pcp;
>     empty = list_empty(&pcp->lists[gfp_migratetype(gfp)]);
>     local_irq_restore(flags);
> 
>     return empty;
> }
> 
> disable_irq();
> if (!is_pcp_cache_empty(GFP_NOWAIT))
>     __get_free_page(GFP_NOWAIT);
> enable_irq();
> <snip>
> 
> Do you mean to have something like above? I mean some extra API
> function that returns true or false if fast-fast allocation can
> either occur or not. Above code works just fine and never touches
> main zone->lock.
> 
> i.e. Instead of introducing an extra GFP_LOCKLESS flag or any new
> extra lock-less function. We could have something that checks a
> pcp page cache list, thus it can guarantee that a request would
> be accomplish using fast-fast path.

No, I meant going back to idea of new gfp flag, but adjust the implementation in
the allocator (different from what you posted in previous version) so that it
only looks at the flag after it tries to allocate from pcplist and finds out
it's empty. So, no inventing of new page allocator entry points or checks such
as the one you wrote above, but adding the new gfp flag in a way that it doesn't
affect existing fast paths.

Joel Fernandes Sept. 30, 2020, 3:25 p.m. UTC | #40

On Fri, Sep 25, 2020 at 05:47:41PM +0200, Michal Hocko wrote:
> On Fri 25-09-20 17:31:29, Uladzislau Rezki wrote:
> > > > > > 
> > > > > > All good points!
> > > > > > 
> > > > > > On the other hand, duplicating a portion of the allocator functionality
> > > > > > within RCU increases the amount of reserved memory, and needlessly most
> > > > > > of the time.
> > > > > > 
> > > > > 
> > > > > But it's very similar to what mempools are for.
> > > > > 
> > > > As for dynamic caching or mempools. It requires extra logic on top of RCU
> > > > to move things forward and it might be not efficient way. As a side
> > > > effect, maintaining of the bulk arrays in the separate worker thread
> > > > will introduce other drawbacks:
> > > 
> > > This is true but it is also true that it is RCU to require this special
> > > logic and we can expect that we might need to fine tune this logic
> > > depending on the RCU usage. We definitely do not want to tune the
> > > generic page allocator for a very specific usecase, do we?
> > > 
> > I look at it in scope of GFP_ATOMIC/GFP_NOWAIT issues, i.e. inability
> > to provide a memory service for contexts which are not allowed to
> > sleep, RCU is part of them. Both flags used to provide such ability
> > before but not anymore.
> > 
> > Do you agree with it?
> 
> Yes this sucks. But this is something that we likely really want to live
> with. We have to explicitly _document_ that really atomic contexts in RT
> cannot use the allocator. From the past discussions we've had this is
> likely the most reasonable way forward because we do not really want to
> encourage anybody to do something like that and there should be ways
> around that. The same is btw. true also for !RT. The allocator is not
> NMI safe and while we should be able to make it compatible I am not
> convinced we really want to.
> 
> Would something like this be helpful wrt documentation?
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 67a0774e080b..9fcd47606493 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -238,7 +238,9 @@ struct vm_area_struct;
>   * %__GFP_FOO flags as necessary.
>   *
>   * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
> - * watermark is applied to allow access to "atomic reserves"
> + * watermark is applied to allow access to "atomic reserves".
> + * The current implementation doesn't support NMI and other non-preemptive context
> + * (e.g. raw_spin_lock).

I think documenting is useful.

Could it be more explicit in what the issue is? Something like:

* Even with GFP_ATOMIC, calls to the allocator can sleep on PREEMPT_RT
systems. Therefore, the current low-level allocator implementation does not
support being called from special contexts that are atomic on RT - such as
NMI and raw_spin_lock. Due to these constraints and considering calling code
usually has no control over the PREEMPT_RT configuration, callers of the
allocator should avoid calling the allocator from these cotnexts even in
non-RT systems.

thanks!

 - Joel


>   *
>   * %GFP_KERNEL is typical for kernel-internal allocations. The caller requires
>   * %ZONE_NORMAL or a lower zone for direct access but can direct reclaim.
> 
> [...]
> -- 
> Michal Hocko
> SUSE Labs

Joel Fernandes Sept. 30, 2020, 3:37 p.m. UTC | #41

On Wed, Sep 30, 2020 at 04:39:53PM +0200, Vlastimil Babka wrote:
> On 9/30/20 12:07 AM, Uladzislau Rezki wrote:
> > On Tue, Sep 29, 2020 at 12:15:34PM +0200, Vlastimil Babka wrote:
> >> On 9/18/20 9:48 PM, Uladzislau Rezki (Sony) wrote:
> >> 
> >> After reading all the threads and mulling over this, I am going to deflect from
> >> Mel and Michal and not oppose the idea of lockless allocation. I would even
> >> prefer to do it via the gfp flag and not a completely separate path. Not using
> >> the exact code from v1, I think it could be done in a way that we don't actually
> >> look at the new flag until we find that pcplist is empty - which should not
> >> introduce overhead to the fast-fast path when pcpclist is not empty. It's more
> >> maintainable that adding new entry points, IMHO.
> >> 
> > Thanks for reading all that from the beginning! It must be tough due to the
> > fact there were lot of messages in the threads, so at least i was lost.
> > 
> > I agree that adding a new entry or separate lock-less function can be considered
> > as something that is hard to maintain. I have a question here. I mean about your
> > different look at it:
> > 
> > <snip>
> > bool is_pcp_cache_empty(gfp_t gfp)
> > {
> >     struct per_cpu_pages *pcp;
> >     struct zoneref *ref;
> >     unsigned long flags;
> >     bool empty;
> > 
> >     ref = first_zones_zonelist(node_zonelist(
> >             numa_node_id(), gfp), gfp_zone(gfp), NULL);
> >     if (!ref->zone)
> >             return true;
> > 
> >     local_irq_save(flags);
> >     pcp = &this_cpu_ptr(ref->zone->pageset)->pcp;
> >     empty = list_empty(&pcp->lists[gfp_migratetype(gfp)]);
> >     local_irq_restore(flags);
> > 
> >     return empty;
> > }
> > 
> > disable_irq();
> > if (!is_pcp_cache_empty(GFP_NOWAIT))
> >     __get_free_page(GFP_NOWAIT);
> > enable_irq();
> > <snip>
> > 
> > Do you mean to have something like above? I mean some extra API
> > function that returns true or false if fast-fast allocation can
> > either occur or not. Above code works just fine and never touches
> > main zone->lock.
> > 
> > i.e. Instead of introducing an extra GFP_LOCKLESS flag or any new
> > extra lock-less function. We could have something that checks a
> > pcp page cache list, thus it can guarantee that a request would
> > be accomplish using fast-fast path.
> 
> No, I meant going back to idea of new gfp flag, but adjust the implementation in
> the allocator (different from what you posted in previous version) so that it
> only looks at the flag after it tries to allocate from pcplist and finds out
> it's empty. So, no inventing of new page allocator entry points or checks such
> as the one you wrote above, but adding the new gfp flag in a way that it doesn't
> affect existing fast paths.

I like the idea of a new flag too :) After all telling the allocator more
about what your context can tolerate, via GFP flags, is not without
precedent.

thanks,

 - Joel

Michal Hocko Sept. 30, 2020, 4:46 p.m. UTC | #42

On Wed 30-09-20 15:39:54, Uladzislau Rezki wrote:
> On Wed, Sep 30, 2020 at 02:44:13PM +0200, Michal Hocko wrote:
> > On Wed 30-09-20 14:35:35, Uladzislau Rezki wrote:
> > > On Wed, Sep 30, 2020 at 11:27:32AM +0200, Michal Hocko wrote:
> > > > On Tue 29-09-20 18:25:14, Uladzislau Rezki wrote:
> > > > > > > I look at it in scope of GFP_ATOMIC/GFP_NOWAIT issues, i.e. inability
> > > > > > > to provide a memory service for contexts which are not allowed to
> > > > > > > sleep, RCU is part of them. Both flags used to provide such ability
> > > > > > > before but not anymore.
> > > > > > > 
> > > > > > > Do you agree with it?
> > > > > > 
> > > > > > Yes this sucks. But this is something that we likely really want to live
> > > > > > with. We have to explicitly _document_ that really atomic contexts in RT
> > > > > > cannot use the allocator. From the past discussions we've had this is
> > > > > > likely the most reasonable way forward because we do not really want to
> > > > > > encourage anybody to do something like that and there should be ways
> > > > > > around that. The same is btw. true also for !RT. The allocator is not
> > > > > > NMI safe and while we should be able to make it compatible I am not
> > > > > > convinced we really want to.
> > > > > > 
> > > > > > Would something like this be helpful wrt documentation?
> > > > > > 
> > > > > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > > > > > index 67a0774e080b..9fcd47606493 100644
> > > > > > --- a/include/linux/gfp.h
> > > > > > +++ b/include/linux/gfp.h
> > > > > > @@ -238,7 +238,9 @@ struct vm_area_struct;
> > > > > >   * %__GFP_FOO flags as necessary.
> > > > > >   *
> > > > > >   * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
> > > > > > - * watermark is applied to allow access to "atomic reserves"
> > > > > > + * watermark is applied to allow access to "atomic reserves".
> > > > > > + * The current implementation doesn't support NMI and other non-preemptive context
> > > > > > + * (e.g. raw_spin_lock).
> > > > > >   *
> > > > > >   * %GFP_KERNEL is typical for kernel-internal allocations. The caller requires
> > > > > >   * %ZONE_NORMAL or a lower zone for direct access but can direct reclaim.
> > > > > > 
> > > > > To me it is clear. But also above conflicting statement:
> > > > > 
> > > > > <snip>
> > > > > %GFP_ATOMIC users can not sleep and need the allocation to succeed. A %lower
> > > > > <snip>
> > > > > 
> > > > > should be rephrased, IMHO.
> > > > 
> > > > Any suggestions? Or more specifics about which part is conflicting? It
> > > > tries to say that there is a higher demand to succeed even though the
> > > > context cannot sleep to take active measures to achieve that. So the
> > > > only way to achieve that is to break the watermakrs to a certain degree
> > > > which is making them more "higher class" than other allocations.
> > > > 
> > > Michal, i had only one concern about it. It says that %GFP_ATOMIC users
> > > can not sleep, i.e. callers know that they are in atomic, thus no any
> > > sleeping, but the chose they make will force them to sleep.
> > 
> > I am not sure I follow you here. Do you mean they will be forced to
> > sleep with PREEMPT_RT?
> > 
> Exactly :)

We can make that more specific once RT patchset is merged. As of now
this is not the thing in the Linus tree. I believe there will be more to
clarify about atomic contexts in the RT tree as it means something else
than people are used to think.

Michal Hocko Sept. 30, 2020, 4:48 p.m. UTC | #43

On Wed 30-09-20 11:25:17, Joel Fernandes wrote:
> On Fri, Sep 25, 2020 at 05:47:41PM +0200, Michal Hocko wrote:
> > On Fri 25-09-20 17:31:29, Uladzislau Rezki wrote:
> > > > > > > 
> > > > > > > All good points!
> > > > > > > 
> > > > > > > On the other hand, duplicating a portion of the allocator functionality
> > > > > > > within RCU increases the amount of reserved memory, and needlessly most
> > > > > > > of the time.
> > > > > > > 
> > > > > > 
> > > > > > But it's very similar to what mempools are for.
> > > > > > 
> > > > > As for dynamic caching or mempools. It requires extra logic on top of RCU
> > > > > to move things forward and it might be not efficient way. As a side
> > > > > effect, maintaining of the bulk arrays in the separate worker thread
> > > > > will introduce other drawbacks:
> > > > 
> > > > This is true but it is also true that it is RCU to require this special
> > > > logic and we can expect that we might need to fine tune this logic
> > > > depending on the RCU usage. We definitely do not want to tune the
> > > > generic page allocator for a very specific usecase, do we?
> > > > 
> > > I look at it in scope of GFP_ATOMIC/GFP_NOWAIT issues, i.e. inability
> > > to provide a memory service for contexts which are not allowed to
> > > sleep, RCU is part of them. Both flags used to provide such ability
> > > before but not anymore.
> > > 
> > > Do you agree with it?
> > 
> > Yes this sucks. But this is something that we likely really want to live
> > with. We have to explicitly _document_ that really atomic contexts in RT
> > cannot use the allocator. From the past discussions we've had this is
> > likely the most reasonable way forward because we do not really want to
> > encourage anybody to do something like that and there should be ways
> > around that. The same is btw. true also for !RT. The allocator is not
> > NMI safe and while we should be able to make it compatible I am not
> > convinced we really want to.
> > 
> > Would something like this be helpful wrt documentation?
> > 
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index 67a0774e080b..9fcd47606493 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -238,7 +238,9 @@ struct vm_area_struct;
> >   * %__GFP_FOO flags as necessary.
> >   *
> >   * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
> > - * watermark is applied to allow access to "atomic reserves"
> > + * watermark is applied to allow access to "atomic reserves".
> > + * The current implementation doesn't support NMI and other non-preemptive context
> > + * (e.g. raw_spin_lock).
> 
> I think documenting is useful.
> 
> Could it be more explicit in what the issue is? Something like:
> 
> * Even with GFP_ATOMIC, calls to the allocator can sleep on PREEMPT_RT
> systems. Therefore, the current low-level allocator implementation does not
> support being called from special contexts that are atomic on RT - such as
> NMI and raw_spin_lock. Due to these constraints and considering calling code
> usually has no control over the PREEMPT_RT configuration, callers of the
> allocator should avoid calling the allocator from these cotnexts even in
> non-RT systems.

I do not mind documenting RT specific behavior but as mentioned in other
reply, this should likely go via RT tree for now. There is likely more
to clarify about atomicity for PREEMPT_RT.

Joel Fernandes Sept. 30, 2020, 5:03 p.m. UTC | #44

On Wed, Sep 30, 2020 at 12:48 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 30-09-20 11:25:17, Joel Fernandes wrote:
> > On Fri, Sep 25, 2020 at 05:47:41PM +0200, Michal Hocko wrote:
> > > On Fri 25-09-20 17:31:29, Uladzislau Rezki wrote:
> > > > > > > >
> > > > > > > > All good points!
> > > > > > > >
> > > > > > > > On the other hand, duplicating a portion of the allocator functionality
> > > > > > > > within RCU increases the amount of reserved memory, and needlessly most
> > > > > > > > of the time.
> > > > > > > >
> > > > > > >
> > > > > > > But it's very similar to what mempools are for.
> > > > > > >
> > > > > > As for dynamic caching or mempools. It requires extra logic on top of RCU
> > > > > > to move things forward and it might be not efficient way. As a side
> > > > > > effect, maintaining of the bulk arrays in the separate worker thread
> > > > > > will introduce other drawbacks:
> > > > >
> > > > > This is true but it is also true that it is RCU to require this special
> > > > > logic and we can expect that we might need to fine tune this logic
> > > > > depending on the RCU usage. We definitely do not want to tune the
> > > > > generic page allocator for a very specific usecase, do we?
> > > > >
> > > > I look at it in scope of GFP_ATOMIC/GFP_NOWAIT issues, i.e. inability
> > > > to provide a memory service for contexts which are not allowed to
> > > > sleep, RCU is part of them. Both flags used to provide such ability
> > > > before but not anymore.
> > > >
> > > > Do you agree with it?
> > >
> > > Yes this sucks. But this is something that we likely really want to live
> > > with. We have to explicitly _document_ that really atomic contexts in RT
> > > cannot use the allocator. From the past discussions we've had this is
> > > likely the most reasonable way forward because we do not really want to
> > > encourage anybody to do something like that and there should be ways
> > > around that. The same is btw. true also for !RT. The allocator is not
> > > NMI safe and while we should be able to make it compatible I am not
> > > convinced we really want to.
> > >
> > > Would something like this be helpful wrt documentation?
> > >
> > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > > index 67a0774e080b..9fcd47606493 100644
> > > --- a/include/linux/gfp.h
> > > +++ b/include/linux/gfp.h
> > > @@ -238,7 +238,9 @@ struct vm_area_struct;
> > >   * %__GFP_FOO flags as necessary.
> > >   *
> > >   * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
> > > - * watermark is applied to allow access to "atomic reserves"
> > > + * watermark is applied to allow access to "atomic reserves".
> > > + * The current implementation doesn't support NMI and other non-preemptive context
> > > + * (e.g. raw_spin_lock).
> >
> > I think documenting is useful.
> >
> > Could it be more explicit in what the issue is? Something like:
> >
> > * Even with GFP_ATOMIC, calls to the allocator can sleep on PREEMPT_RT
> > systems. Therefore, the current low-level allocator implementation does not
> > support being called from special contexts that are atomic on RT - such as
> > NMI and raw_spin_lock. Due to these constraints and considering calling code
> > usually has no control over the PREEMPT_RT configuration, callers of the
> > allocator should avoid calling the allocator from these cotnexts even in
> > non-RT systems.
>
> I do not mind documenting RT specific behavior but as mentioned in other
> reply, this should likely go via RT tree for now. There is likely more
> to clarify about atomicity for PREEMPT_RT.

I am sorry, I did not understand what you meant by something missing
in Linus Tree. CONFIG_PREEMPT_RT is there.

Could you clarify? Also the CONFIG_PREEMPT_RT is the only thing
driving this requirement for GFP_ATOMIC right? Or are there non-RT
reasons why GFP_ATOMIC allocation cannot be done from
NMI/raw_spin_lock ?

Thanks,

 - Joel

Michal Hocko Sept. 30, 2020, 5:22 p.m. UTC | #45

On Wed 30-09-20 13:03:29, Joel Fernandes wrote:
> On Wed, Sep 30, 2020 at 12:48 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 30-09-20 11:25:17, Joel Fernandes wrote:
> > > On Fri, Sep 25, 2020 at 05:47:41PM +0200, Michal Hocko wrote:
> > > > On Fri 25-09-20 17:31:29, Uladzislau Rezki wrote:
> > > > > > > > >
> > > > > > > > > All good points!
> > > > > > > > >
> > > > > > > > > On the other hand, duplicating a portion of the allocator functionality
> > > > > > > > > within RCU increases the amount of reserved memory, and needlessly most
> > > > > > > > > of the time.
> > > > > > > > >
> > > > > > > >
> > > > > > > > But it's very similar to what mempools are for.
> > > > > > > >
> > > > > > > As for dynamic caching or mempools. It requires extra logic on top of RCU
> > > > > > > to move things forward and it might be not efficient way. As a side
> > > > > > > effect, maintaining of the bulk arrays in the separate worker thread
> > > > > > > will introduce other drawbacks:
> > > > > >
> > > > > > This is true but it is also true that it is RCU to require this special
> > > > > > logic and we can expect that we might need to fine tune this logic
> > > > > > depending on the RCU usage. We definitely do not want to tune the
> > > > > > generic page allocator for a very specific usecase, do we?
> > > > > >
> > > > > I look at it in scope of GFP_ATOMIC/GFP_NOWAIT issues, i.e. inability
> > > > > to provide a memory service for contexts which are not allowed to
> > > > > sleep, RCU is part of them. Both flags used to provide such ability
> > > > > before but not anymore.
> > > > >
> > > > > Do you agree with it?
> > > >
> > > > Yes this sucks. But this is something that we likely really want to live
> > > > with. We have to explicitly _document_ that really atomic contexts in RT
> > > > cannot use the allocator. From the past discussions we've had this is
> > > > likely the most reasonable way forward because we do not really want to
> > > > encourage anybody to do something like that and there should be ways
> > > > around that. The same is btw. true also for !RT. The allocator is not
> > > > NMI safe and while we should be able to make it compatible I am not
> > > > convinced we really want to.
> > > >
> > > > Would something like this be helpful wrt documentation?
> > > >
> > > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > > > index 67a0774e080b..9fcd47606493 100644
> > > > --- a/include/linux/gfp.h
> > > > +++ b/include/linux/gfp.h
> > > > @@ -238,7 +238,9 @@ struct vm_area_struct;
> > > >   * %__GFP_FOO flags as necessary.
> > > >   *
> > > >   * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
> > > > - * watermark is applied to allow access to "atomic reserves"
> > > > + * watermark is applied to allow access to "atomic reserves".
> > > > + * The current implementation doesn't support NMI and other non-preemptive context
> > > > + * (e.g. raw_spin_lock).
> > >
> > > I think documenting is useful.
> > >
> > > Could it be more explicit in what the issue is? Something like:
> > >
> > > * Even with GFP_ATOMIC, calls to the allocator can sleep on PREEMPT_RT
> > > systems. Therefore, the current low-level allocator implementation does not
> > > support being called from special contexts that are atomic on RT - such as
> > > NMI and raw_spin_lock. Due to these constraints and considering calling code
> > > usually has no control over the PREEMPT_RT configuration, callers of the
> > > allocator should avoid calling the allocator from these cotnexts even in
> > > non-RT systems.
> >
> > I do not mind documenting RT specific behavior but as mentioned in other
> > reply, this should likely go via RT tree for now. There is likely more
> > to clarify about atomicity for PREEMPT_RT.
> 
> I am sorry, I did not understand what you meant by something missing
> in Linus Tree. CONFIG_PREEMPT_RT is there.

OK, I was not aware we already CONFIG_PREEMPT_RT in the three. There is
still a lot from the RT patchset including sleeping spin locks that make
a real difference. Or maybe I haven't checked properly.

> Could you clarify? Also the CONFIG_PREEMPT_RT is the only thing
> driving this requirement for GFP_ATOMIC right? Or are there non-RT
> reasons why GFP_ATOMIC allocation cannot be done from
> NMI/raw_spin_lock ?

I have already sent a clarification patch [1]. NMI is not supported
regardless of the preemption mode. Hope this clarifies it a bit.

[1] http://lkml.kernel.org/r/20200929123010.5137-1-mhocko@kernel.org

Joel Fernandes Sept. 30, 2020, 5:48 p.m. UTC | #46

On Wed, Sep 30, 2020 at 1:22 PM Michal Hocko <mhocko@suse.com> wrote:

> > > > I think documenting is useful.
> > > >
> > > > Could it be more explicit in what the issue is? Something like:
> > > >
> > > > * Even with GFP_ATOMIC, calls to the allocator can sleep on PREEMPT_RT
> > > > systems. Therefore, the current low-level allocator implementation does not
> > > > support being called from special contexts that are atomic on RT - such as
> > > > NMI and raw_spin_lock. Due to these constraints and considering calling code
> > > > usually has no control over the PREEMPT_RT configuration, callers of the
> > > > allocator should avoid calling the allocator from these cotnexts even in
> > > > non-RT systems.
> > >
> > > I do not mind documenting RT specific behavior but as mentioned in other
> > > reply, this should likely go via RT tree for now. There is likely more
> > > to clarify about atomicity for PREEMPT_RT.
> >
> > I am sorry, I did not understand what you meant by something missing
> > in Linus Tree. CONFIG_PREEMPT_RT is there.
>
> OK, I was not aware we already CONFIG_PREEMPT_RT in the three. There is
> still a lot from the RT patchset including sleeping spin locks that make
> a real difference. Or maybe I haven't checked properly.
>
> > Could you clarify? Also the CONFIG_PREEMPT_RT is the only thing
> > driving this requirement for GFP_ATOMIC right? Or are there non-RT
> > reasons why GFP_ATOMIC allocation cannot be done from
> > NMI/raw_spin_lock ?
>
> I have already sent a clarification patch [1]. NMI is not supported
> regardless of the preemption mode. Hope this clarifies it a bit.
>
> [1] http://lkml.kernel.org/r/20200929123010.5137-1-mhocko@kernel.org

That works for me. Thanks,

  - Joel

Uladzislau Rezki Sept. 30, 2020, 8:36 p.m. UTC | #47

On Wed, Sep 30, 2020 at 06:46:00PM +0200, Michal Hocko wrote:
> On Wed 30-09-20 15:39:54, Uladzislau Rezki wrote:
> > On Wed, Sep 30, 2020 at 02:44:13PM +0200, Michal Hocko wrote:
> > > On Wed 30-09-20 14:35:35, Uladzislau Rezki wrote:
> > > > On Wed, Sep 30, 2020 at 11:27:32AM +0200, Michal Hocko wrote:
> > > > > On Tue 29-09-20 18:25:14, Uladzislau Rezki wrote:
> > > > > > > > I look at it in scope of GFP_ATOMIC/GFP_NOWAIT issues, i.e. inability
> > > > > > > > to provide a memory service for contexts which are not allowed to
> > > > > > > > sleep, RCU is part of them. Both flags used to provide such ability
> > > > > > > > before but not anymore.
> > > > > > > > 
> > > > > > > > Do you agree with it?
> > > > > > > 
> > > > > > > Yes this sucks. But this is something that we likely really want to live
> > > > > > > with. We have to explicitly _document_ that really atomic contexts in RT
> > > > > > > cannot use the allocator. From the past discussions we've had this is
> > > > > > > likely the most reasonable way forward because we do not really want to
> > > > > > > encourage anybody to do something like that and there should be ways
> > > > > > > around that. The same is btw. true also for !RT. The allocator is not
> > > > > > > NMI safe and while we should be able to make it compatible I am not
> > > > > > > convinced we really want to.
> > > > > > > 
> > > > > > > Would something like this be helpful wrt documentation?
> > > > > > > 
> > > > > > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > > > > > > index 67a0774e080b..9fcd47606493 100644
> > > > > > > --- a/include/linux/gfp.h
> > > > > > > +++ b/include/linux/gfp.h
> > > > > > > @@ -238,7 +238,9 @@ struct vm_area_struct;
> > > > > > >   * %__GFP_FOO flags as necessary.
> > > > > > >   *
> > > > > > >   * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
> > > > > > > - * watermark is applied to allow access to "atomic reserves"
> > > > > > > + * watermark is applied to allow access to "atomic reserves".
> > > > > > > + * The current implementation doesn't support NMI and other non-preemptive context
> > > > > > > + * (e.g. raw_spin_lock).
> > > > > > >   *
> > > > > > >   * %GFP_KERNEL is typical for kernel-internal allocations. The caller requires
> > > > > > >   * %ZONE_NORMAL or a lower zone for direct access but can direct reclaim.
> > > > > > > 
> > > > > > To me it is clear. But also above conflicting statement:
> > > > > > 
> > > > > > <snip>
> > > > > > %GFP_ATOMIC users can not sleep and need the allocation to succeed. A %lower
> > > > > > <snip>
> > > > > > 
> > > > > > should be rephrased, IMHO.
> > > > > 
> > > > > Any suggestions? Or more specifics about which part is conflicting? It
> > > > > tries to say that there is a higher demand to succeed even though the
> > > > > context cannot sleep to take active measures to achieve that. So the
> > > > > only way to achieve that is to break the watermakrs to a certain degree
> > > > > which is making them more "higher class" than other allocations.
> > > > > 
> > > > Michal, i had only one concern about it. It says that %GFP_ATOMIC users
> > > > can not sleep, i.e. callers know that they are in atomic, thus no any
> > > > sleeping, but the chose they make will force them to sleep.
> > > 
> > > I am not sure I follow you here. Do you mean they will be forced to
> > > sleep with PREEMPT_RT?
> > > 
> > Exactly :)
> 
> We can make that more specific once RT patchset is merged. As of now
> this is not the thing in the Linus tree. I believe there will be more to
> clarify about atomic contexts in the RT tree as it means something else
> than people are used to think.
> 
Fair enough. Currently we have only the Kconfig CONFIG_PREEMPT_RT
symbol, so RT patches are still not in place.

--
Vlad Rezki

Uladzislau Rezki Oct. 1, 2020, 7:26 p.m. UTC | #48

> 
> No, I meant going back to idea of new gfp flag, but adjust the implementation in
> the allocator (different from what you posted in previous version) so that it
> only looks at the flag after it tries to allocate from pcplist and finds out
> it's empty. So, no inventing of new page allocator entry points or checks such
> as the one you wrote above, but adding the new gfp flag in a way that it doesn't
> affect existing fast paths.
>
OK. Now i see. Please have a look below at the patch, so we fully understand
each other. If that is something that is close to your view or not:

<snip>
t a/include/linux/gfp.h b/include/linux/gfp.h
index c603237e006c..7e613560a502 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -39,8 +39,9 @@ struct vm_area_struct;
 #define ___GFP_HARDWALL                0x100000u
 #define ___GFP_THISNODE                0x200000u
 #define ___GFP_ACCOUNT         0x400000u
+#define ___GFP_NO_LOCKS                0x800000u
 #ifdef CONFIG_LOCKDEP
-#define ___GFP_NOLOCKDEP       0x800000u
+#define ___GFP_NOLOCKDEP       0x1000000u
 #else
 #define ___GFP_NOLOCKDEP       0
 #endif
@@ -215,16 +216,22 @@ struct vm_area_struct;
  * %__GFP_COMP address compound page metadata.
  *
  * %__GFP_ZERO returns a zeroed page on success.
+ *
+ * %__GFP_NO_LOCKS order-0 allocation without sleepable-locks.
+ * It obtains a page from the per-cpu-list and considered as
+ * lock-less. No other actions are performed, thus it returns
+ * NULL if per-cpu-list is empty.
  */
 #define __GFP_NOWARN   ((__force gfp_t)___GFP_NOWARN)
 #define __GFP_COMP     ((__force gfp_t)___GFP_COMP)
 #define __GFP_ZERO     ((__force gfp_t)___GFP_ZERO)
+#define __GFP_NO_LOCKS ((__force gfp_t)___GFP_NO_LOCKS)

 /* Disable lockdep for GFP context tracking */
 #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)

 /* Room for N __GFP_FOO bits */
-#define __GFP_BITS_SHIFT (23 + IS_ENABLED(CONFIG_LOCKDEP))
+#define __GFP_BITS_SHIFT (24 + IS_ENABLED(CONFIG_LOCKDEP))
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /**
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 67018d367b9f..d99af78237be 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -45,6 +45,7 @@
        {(unsigned long)__GFP_RECLAIMABLE,      "__GFP_RECLAIMABLE"},   \
        {(unsigned long)__GFP_MOVABLE,          "__GFP_MOVABLE"},       \
        {(unsigned long)__GFP_ACCOUNT,          "__GFP_ACCOUNT"},       \
+       {(unsigned long)__GFP_NO_LOCKS,         "__GFP_NO_LOCKS"},      \
        {(unsigned long)__GFP_WRITE,            "__GFP_WRITE"},         \
        {(unsigned long)__GFP_RECLAIM,          "__GFP_RECLAIM"},       \
        {(unsigned long)__GFP_DIRECT_RECLAIM,   "__GFP_DIRECT_RECLAIM"},\
diff --git a/mm/internal.h b/mm/internal.h
index 6345b08ce86c..5724fba921f9 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -569,6 +569,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_NOFRAGMENT         0x0
 #endif
 #define ALLOC_KSWAPD           0x800 /* allow waking of kswapd, __GFP_KSWAPD_RECLAIM set */
+#define ALLOC_NO_LOCKS         0x1000 /* Lock free allocation. */
 
 enum ttu_flags;
 struct tlbflush_unmap_batch;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index aff1f84bf268..19cd9794dd45 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2837,6 +2837,9 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 {
        int i, alloced = 0;
 
+       if (alloc_flags & ALLOC_NO_LOCKS)
+               return alloced;
+
        spin_lock(&zone->lock);
        for (i = 0; i < count; ++i) {
                struct page *page = __rmqueue(zone, order, migratetype,
@@ -3805,7 +3808,8 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
                         * grow this zone if it contains deferred pages.
                         */
                        if (static_branch_unlikely(&deferred_pages)) {
-                               if (_deferred_grow_zone(zone, order))
+                               if (!(alloc_flags & ALLOC_NO_LOCKS) &&
+                                               _deferred_grow_zone(zone, order))
                                        goto try_this_zone;
                        }
 #endif
@@ -3850,7 +3854,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
                                reserve_highatomic_pageblock(page, zone, order);

                        return page;
-               } else {
+               } else if (!(alloc_flags & ALLOC_NO_LOCKS)) {
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
                        /* Try again if zone has deferred pages */
                        if (static_branch_unlikely(&deferred_pages)) {
@@ -4846,6 +4850,9 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
        ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
                                        ac->highest_zoneidx, ac->nodemask);

+       if (gfp_mask & __GFP_NO_LOCKS)
+               *alloc_flags |= ALLOC_NO_LOCKS;
+
        return true;
 }

@@ -4886,6 +4893,10 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
        if (likely(page))
                goto out;

+       /* Bypass slow path if __GFP_NO_LOCKS. */
+       if ((gfp_mask & __GFP_NO_LOCKS))
+               goto out;
+
        /*
         * Apply scoped allocation constraints. This is mainly about GFP_NOFS
         * resp. GFP_NOIO which has to be inherited for all allocation requests
diff --git a/tools/perf/builtin-kmem.c b/tools/perf/builtin-kmem.c
index a50dae2c4ae9..fee3221bcf6a 100644
--- a/tools/perf/builtin-kmem.c
+++ b/tools/perf/builtin-kmem.c
@@ -656,6 +656,7 @@ static const struct {
        { "__GFP_RECLAIMABLE",          "RC" },
        { "__GFP_MOVABLE",              "M" },
        { "__GFP_ACCOUNT",              "AC" },
+       { "__GFP_NO_LOCKS",             "NL" },
        { "__GFP_WRITE",                "WR" },
        { "__GFP_RECLAIM",              "R" },
        { "__GFP_DIRECT_RECLAIM",       "DR" },
<snip>

If not, could you please provide some snips or any pseudo code?

Thanks Vlastimil!

--
Vlad Rezki

Uladzislau Rezki Oct. 1, 2020, 7:32 p.m. UTC | #49

On Wed, Sep 30, 2020 at 12:35:57PM +0200, Michal Hocko wrote:
> On Wed 30-09-20 00:07:42, Uladzislau Rezki wrote:
> [...]
> > <snip>
> > bool is_pcp_cache_empty(gfp_t gfp)
> > {
> >     struct per_cpu_pages *pcp;
> >     struct zoneref *ref;
> >     unsigned long flags;
> >     bool empty;
> > 
> >     ref = first_zones_zonelist(node_zonelist(
> >             numa_node_id(), gfp), gfp_zone(gfp), NULL);
> >     if (!ref->zone)
> >             return true;
> > 
> >     local_irq_save(flags);
> >     pcp = &this_cpu_ptr(ref->zone->pageset)->pcp;
> >     empty = list_empty(&pcp->lists[gfp_migratetype(gfp)]);
> >     local_irq_restore(flags);
> > 
> >     return empty;
> > }
> > 
> > disable_irq();
> > if (!is_pcp_cache_empty(GFP_NOWAIT))
> >     __get_free_page(GFP_NOWAIT);
> > enable_irq();
> > <snip>
> > 
> > Do you mean to have something like above? I mean some extra API
> > function that returns true or false if fast-fast allocation can
> > either occur or not. Above code works just fine and never touches
> > main zone->lock.
> 
> The above code works with the _current_ implementation and it restricts
> its implementation to some degree. Future changes might get harder to
> implement with a pattern like this. I do not think we want users to be
> aware of internal implementation details like pcp caches, migrate types
> or others. While pcp caches are here for years and unlikely to change in
> a foreseeable future many details are changing on regular basis.
>
I see your view. That was en example for better understanding.

Thanks.

--
Vlad Rezki

Michal Hocko Oct. 2, 2020, 7:11 a.m. UTC | #50

On Thu 01-10-20 21:26:26, Uladzislau Rezki wrote:
> > 
> > No, I meant going back to idea of new gfp flag, but adjust the implementation in
> > the allocator (different from what you posted in previous version) so that it
> > only looks at the flag after it tries to allocate from pcplist and finds out
> > it's empty. So, no inventing of new page allocator entry points or checks such
> > as the one you wrote above, but adding the new gfp flag in a way that it doesn't
> > affect existing fast paths.
> >
> OK. Now i see. Please have a look below at the patch, so we fully understand
> each other. If that is something that is close to your view or not:
> 
> <snip>
> t a/include/linux/gfp.h b/include/linux/gfp.h
> index c603237e006c..7e613560a502 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -39,8 +39,9 @@ struct vm_area_struct;
>  #define ___GFP_HARDWALL                0x100000u
>  #define ___GFP_THISNODE                0x200000u
>  #define ___GFP_ACCOUNT         0x400000u
> +#define ___GFP_NO_LOCKS                0x800000u

Even if a new gfp flag gains a sufficient traction and support I am
_strongly_ opposed against consuming another flag for that. Bit space is
limited.  Besides that we certainly do not want to allow craziness like
__GFP_NO_LOCK | __GFP_RECLAIM (and similar), do we?

Mel Gorman Oct. 2, 2020, 8:06 a.m. UTC | #51

On Thu, Oct 01, 2020 at 09:26:26PM +0200, Uladzislau Rezki wrote:
> > 
> > No, I meant going back to idea of new gfp flag, but adjust the implementation in
> > the allocator (different from what you posted in previous version) so that it
> > only looks at the flag after it tries to allocate from pcplist and finds out
> > it's empty. So, no inventing of new page allocator entry points or checks such
> > as the one you wrote above, but adding the new gfp flag in a way that it doesn't
> > affect existing fast paths.
> >
> OK. Now i see. Please have a look below at the patch, so we fully understand
> each other. If that is something that is close to your view or not:
> 
> <snip>
> t a/include/linux/gfp.h b/include/linux/gfp.h
> index c603237e006c..7e613560a502 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -39,8 +39,9 @@ struct vm_area_struct;
>  #define ___GFP_HARDWALL                0x100000u
>  #define ___GFP_THISNODE                0x200000u
>  #define ___GFP_ACCOUNT         0x400000u
> +#define ___GFP_NO_LOCKS                0x800000u
>  #ifdef CONFIG_LOCKDEP
> -#define ___GFP_NOLOCKDEP       0x800000u
> +#define ___GFP_NOLOCKDEP       0x1000000u
>  #else
>  #define ___GFP_NOLOCKDEP       0
>  #endif
> @@ -215,16 +216,22 @@ struct vm_area_struct;
>   * %__GFP_COMP address compound page metadata.
>   *
>   * %__GFP_ZERO returns a zeroed page on success.
> + *
> + * %__GFP_NO_LOCKS order-0 allocation without sleepable-locks.
> + * It obtains a page from the per-cpu-list and considered as
> + * lock-less. No other actions are performed, thus it returns
> + * NULL if per-cpu-list is empty.
>   */
>  #define __GFP_NOWARN   ((__force gfp_t)___GFP_NOWARN)
>  #define __GFP_COMP     ((__force gfp_t)___GFP_COMP)
>  #define __GFP_ZERO     ((__force gfp_t)___GFP_ZERO)
> +#define __GFP_NO_LOCKS ((__force gfp_t)___GFP_NO_LOCKS)
> 

I'm not a fan of the GFP flag approach simply because we've had cases
before where GFP flags were used in inappropriate contexts like
__GFP_MEMALLOC which led to a surprising amount of bugs, particularly
from out-of-tree drivers but also in-tree drivers. Of course, there
are limited GFP flags available too but at least the comment should
be as robust as possible. Maybe something like

 * %__GFP_NO_LOCKS attempts order-0 allocation without sleepable-locks. It
 * attempts to obtain a page without acquiring any spinlocks. This
 * should only be used in a context where the holder holds a
 * raw_spin_lock that cannot be released for the allocation request.
 * This may be necessary in PREEMPT_RT kernels where a
 * raw_spin_lock is held which does not sleep tries to acquire a
 * spin_lock that can sleep with PREEMPT_RT. This should not be
 * confused with GFP_ATOMIC contexts. Like atomic allocation
 * requests, there is no guarantee a page will be returned and
 * the caller must be able to deal with allocation failures.
 * The risk of allocation failure is higher than using GFP_ATOMIC.

It's verbose but it would be hard to misinterpret. I think we're
going to go through a period of time before people get familiar
with PREEMPT_RT-related hazards as various comments that were
true are going to be misleading for a while.

For anyone reviewing, any use of __GFP_NO_LOCKS should meet a high
standard where there is no alternative except to use the flags. i.e. a
higher standard "but I'm an important driver".

Mel Gorman Oct. 2, 2020, 8:50 a.m. UTC | #52

On Fri, Oct 02, 2020 at 09:11:23AM +0200, Michal Hocko wrote:
> On Thu 01-10-20 21:26:26, Uladzislau Rezki wrote:
> > > 
> > > No, I meant going back to idea of new gfp flag, but adjust the implementation in
> > > the allocator (different from what you posted in previous version) so that it
> > > only looks at the flag after it tries to allocate from pcplist and finds out
> > > it's empty. So, no inventing of new page allocator entry points or checks such
> > > as the one you wrote above, but adding the new gfp flag in a way that it doesn't
> > > affect existing fast paths.
> > >
> > OK. Now i see. Please have a look below at the patch, so we fully understand
> > each other. If that is something that is close to your view or not:
> > 
> > <snip>
> > t a/include/linux/gfp.h b/include/linux/gfp.h
> > index c603237e006c..7e613560a502 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -39,8 +39,9 @@ struct vm_area_struct;
> >  #define ___GFP_HARDWALL                0x100000u
> >  #define ___GFP_THISNODE                0x200000u
> >  #define ___GFP_ACCOUNT         0x400000u
> > +#define ___GFP_NO_LOCKS                0x800000u
> 
> Even if a new gfp flag gains a sufficient traction and support I am
> _strongly_ opposed against consuming another flag for that. Bit space is
> limited. 

That is definitely true. I'm not happy with the GFP flag at all, the
comment is at best a damage limiting move. It still would be better for
a memory pool to be reserved and sized for critical allocations.

> Besides that we certainly do not want to allow craziness like
> __GFP_NO_LOCK | __GFP_RECLAIM (and similar), do we?

That would deserve to be taken to a dumpster and set on fire. The flag
combination could be checked in the allocator but the allocator path fast
paths are bad enough already.

Michal Hocko Oct. 2, 2020, 9:05 a.m. UTC | #53

On Fri 02-10-20 09:50:14, Mel Gorman wrote:
> On Fri, Oct 02, 2020 at 09:11:23AM +0200, Michal Hocko wrote:
> > On Thu 01-10-20 21:26:26, Uladzislau Rezki wrote:
> > > > 
> > > > No, I meant going back to idea of new gfp flag, but adjust the implementation in
> > > > the allocator (different from what you posted in previous version) so that it
> > > > only looks at the flag after it tries to allocate from pcplist and finds out
> > > > it's empty. So, no inventing of new page allocator entry points or checks such
> > > > as the one you wrote above, but adding the new gfp flag in a way that it doesn't
> > > > affect existing fast paths.
> > > >
> > > OK. Now i see. Please have a look below at the patch, so we fully understand
> > > each other. If that is something that is close to your view or not:
> > > 
> > > <snip>
> > > t a/include/linux/gfp.h b/include/linux/gfp.h
> > > index c603237e006c..7e613560a502 100644
> > > --- a/include/linux/gfp.h
> > > +++ b/include/linux/gfp.h
> > > @@ -39,8 +39,9 @@ struct vm_area_struct;
> > >  #define ___GFP_HARDWALL                0x100000u
> > >  #define ___GFP_THISNODE                0x200000u
> > >  #define ___GFP_ACCOUNT         0x400000u
> > > +#define ___GFP_NO_LOCKS                0x800000u
> > 
> > Even if a new gfp flag gains a sufficient traction and support I am
> > _strongly_ opposed against consuming another flag for that. Bit space is
> > limited. 
> 
> That is definitely true. I'm not happy with the GFP flag at all, the
> comment is at best a damage limiting move. It still would be better for
> a memory pool to be reserved and sized for critical allocations.

Completely agreed. The only existing usecase is so special cased that a
dedicated pool is not only easier to maintain but it should be also much
better tuned for the specific workload. Something not really feasible
with the allocator.

> > Besides that we certainly do not want to allow craziness like
> > __GFP_NO_LOCK | __GFP_RECLAIM (and similar), do we?
> 
> That would deserve to be taken to a dumpster and set on fire. The flag
> combination could be checked in the allocator but the allocator path fast
> paths are bad enough already.

If a new allocation/gfp mode is absolutely necessary then I believe that
the most reasoanble way forward would be
#define GFP_NO_LOCK	((__force gfp_t)0)

and explicitly document it as a final flag to use without any further
modifiers. Yeah there are some that could be used potentially - e.g. zone
specifiers, __GFP_ZERO and likely few others. But support for those can
be added when there is an actual and reasonable demand.

I would also strongly argue against implementation alowing to fully
consume pcp free pages.

Peter Zijlstra Oct. 2, 2020, 9:07 a.m. UTC | #54

On Fri, Oct 02, 2020 at 09:50:14AM +0100, Mel Gorman wrote:
> On Fri, Oct 02, 2020 at 09:11:23AM +0200, Michal Hocko wrote:

> > > +#define ___GFP_NO_LOCKS                0x800000u
> > 
> > Even if a new gfp flag gains a sufficient traction and support I am
> > _strongly_ opposed against consuming another flag for that. Bit space is
> > limited. 
> 
> That is definitely true. I'm not happy with the GFP flag at all, the
> comment is at best a damage limiting move. It still would be better for
> a memory pool to be reserved and sized for critical allocations.

This is one of the reasons I did a separate allocation function. No GFP
flag to leak into general usage.

> > Besides that we certainly do not want to allow craziness like
> > __GFP_NO_LOCK | __GFP_RECLAIM (and similar), do we?
> 
> That would deserve to be taken to a dumpster and set on fire. The flag
> combination could be checked in the allocator but the allocator path fast
> paths are bad enough already.

Isn't that what we have CONFIG_DEBUG_VM for?

Mel Gorman Oct. 2, 2020, 9:45 a.m. UTC | #55

On Fri, Oct 02, 2020 at 11:07:29AM +0200, Peter Zijlstra wrote:
> On Fri, Oct 02, 2020 at 09:50:14AM +0100, Mel Gorman wrote:
> > On Fri, Oct 02, 2020 at 09:11:23AM +0200, Michal Hocko wrote:
> 
> > > > +#define ___GFP_NO_LOCKS                0x800000u
> > > 
> > > Even if a new gfp flag gains a sufficient traction and support I am
> > > _strongly_ opposed against consuming another flag for that. Bit space is
> > > limited. 
> > 
> > That is definitely true. I'm not happy with the GFP flag at all, the
> > comment is at best a damage limiting move. It still would be better for
> > a memory pool to be reserved and sized for critical allocations.
> 
> This is one of the reasons I did a separate allocation function. No GFP
> flag to leak into general usage.
> 

Even a specific function with a hint that "this is for RCU only" will
not prevent abuse.

> > > Besides that we certainly do not want to allow craziness like
> > > __GFP_NO_LOCK | __GFP_RECLAIM (and similar), do we?
> > 
> > That would deserve to be taken to a dumpster and set on fire. The flag
> > combination could be checked in the allocator but the allocator path fast
> > paths are bad enough already.
> 
> Isn't that what we have CONFIG_DEBUG_VM for?

It's enabled by default by enough distros that adding too many checks
is potentially painful. Granted it would be missed by most benchmarking
which tend to control allocations from userspace but a lot of performance
problems I see are the "death by a thousand cuts" variety.

Peter Zijlstra Oct. 2, 2020, 9:58 a.m. UTC | #56

On Fri, Oct 02, 2020 at 10:45:02AM +0100, Mel Gorman wrote:
> On Fri, Oct 02, 2020 at 11:07:29AM +0200, Peter Zijlstra wrote:
> > On Fri, Oct 02, 2020 at 09:50:14AM +0100, Mel Gorman wrote:
> > > On Fri, Oct 02, 2020 at 09:11:23AM +0200, Michal Hocko wrote:
> > 
> > > > > +#define ___GFP_NO_LOCKS                0x800000u
> > > > 
> > > > Even if a new gfp flag gains a sufficient traction and support I am
> > > > _strongly_ opposed against consuming another flag for that. Bit space is
> > > > limited. 
> > > 
> > > That is definitely true. I'm not happy with the GFP flag at all, the
> > > comment is at best a damage limiting move. It still would be better for
> > > a memory pool to be reserved and sized for critical allocations.
> > 
> > This is one of the reasons I did a separate allocation function. No GFP
> > flag to leak into general usage.
> > 
> 
> Even a specific function with a hint that "this is for RCU only" will
> not prevent abuse.

Not exporting it for modules helps, but yes.

> > > > Besides that we certainly do not want to allow craziness like
> > > > __GFP_NO_LOCK | __GFP_RECLAIM (and similar), do we?
> > > 
> > > That would deserve to be taken to a dumpster and set on fire. The flag
> > > combination could be checked in the allocator but the allocator path fast
> > > paths are bad enough already.
> > 
> > Isn't that what we have CONFIG_DEBUG_VM for?
> 
> It's enabled by default by enough distros that adding too many checks
> is potentially painful. Granted it would be missed by most benchmarking
> which tend to control allocations from userspace but a lot of performance
> problems I see are the "death by a thousand cuts" variety.

Oh quite agreed, aka death by accounting. But if people are enabling
DEBUG options in production kernels, there's something wrong, no?

Should we now go add CONFIG_REALLY_DEBUG_STAY_AWAY_ALREADY options?

Mel Gorman Oct. 2, 2020, 10:19 a.m. UTC | #57

On Fri, Oct 02, 2020 at 11:58:58AM +0200, Peter Zijlstra wrote:
> > It's enabled by default by enough distros that adding too many checks
> > is potentially painful. Granted it would be missed by most benchmarking
> > which tend to control allocations from userspace but a lot of performance
> > problems I see are the "death by a thousand cuts" variety.
> 
> Oh quite agreed, aka death by accounting. But if people are enabling
> DEBUG options in production kernels, there's something wrong, no?
> 

You'd think but historically I believe DEBUG_VM was enabled for some
distributions because it made certain classes of problems easier to debug
early. There is also a recent trend for enabling various DEBUG options for
"hardening" even when they protect very specific corner cases or are for
intended for kernel development. I've pushed back where I have an opinion
that matters but it's generally corrosive.

> Should we now go add CONFIG_REALLY_DEBUG_STAY_AWAY_ALREADY options?

It's heading in that direction :(

Paul E. McKenney Oct. 2, 2020, 2:41 p.m. UTC | #58

On Fri, Oct 02, 2020 at 11:19:52AM +0100, Mel Gorman wrote:
> On Fri, Oct 02, 2020 at 11:58:58AM +0200, Peter Zijlstra wrote:
> > > It's enabled by default by enough distros that adding too many checks
> > > is potentially painful. Granted it would be missed by most benchmarking
> > > which tend to control allocations from userspace but a lot of performance
> > > problems I see are the "death by a thousand cuts" variety.
> > 
> > Oh quite agreed, aka death by accounting. But if people are enabling
> > DEBUG options in production kernels, there's something wrong, no?
> > 
> 
> You'd think but historically I believe DEBUG_VM was enabled for some
> distributions because it made certain classes of problems easier to debug
> early. There is also a recent trend for enabling various DEBUG options for
> "hardening" even when they protect very specific corner cases or are for
> intended for kernel development. I've pushed back where I have an opinion
> that matters but it's generally corrosive.
> 
> > Should we now go add CONFIG_REALLY_DEBUG_STAY_AWAY_ALREADY options?
> 
> It's heading in that direction :(

Given that you guys have just reiterated yet again that you are very
unhappy with either a GFP_ flag or a special function like the one that
Peter Zijlstra put together, it would be very helpful if you were to at
least voice some level of support for Thomas Gleixner's patchset, which,
if accepted, will allow me to solve at least 50% of the problem.

https://lore.kernel.org/lkml/20200928233041.GA23230@paulmck-ThinkPad-P72/
	Patch series including Thomas's patchset.
https://lore.kernel.org/lkml/20201001210750.GA25287@paulmck-ThinkPad-P72/
	Corresponding pull request.

							Thanx, Paul

Uladzislau Rezki Oct. 5, 2020, 1:58 p.m. UTC | #59

On Fri, Oct 02, 2020 at 09:11:23AM +0200, Michal Hocko wrote:
> On Thu 01-10-20 21:26:26, Uladzislau Rezki wrote:
> > > 
> > > No, I meant going back to idea of new gfp flag, but adjust the implementation in
> > > the allocator (different from what you posted in previous version) so that it
> > > only looks at the flag after it tries to allocate from pcplist and finds out
> > > it's empty. So, no inventing of new page allocator entry points or checks such
> > > as the one you wrote above, but adding the new gfp flag in a way that it doesn't
> > > affect existing fast paths.
> > >
> > OK. Now i see. Please have a look below at the patch, so we fully understand
> > each other. If that is something that is close to your view or not:
> > 
> > <snip>
> > t a/include/linux/gfp.h b/include/linux/gfp.h
> > index c603237e006c..7e613560a502 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -39,8 +39,9 @@ struct vm_area_struct;
> >  #define ___GFP_HARDWALL                0x100000u
> >  #define ___GFP_THISNODE                0x200000u
> >  #define ___GFP_ACCOUNT         0x400000u
> > +#define ___GFP_NO_LOCKS                0x800000u
> 
> Even if a new gfp flag gains a sufficient traction and support I am
> _strongly_ opposed against consuming another flag for that. Bit space is
> limited.
>
That is a valid point.

>
> Besides that we certainly do not want to allow craziness like
> __GFP_NO_LOCK | __GFP_RECLAIM (and similar), do we?
>
Obviously not. And it seems that the way of implementing of the
NO_LOCK logic would be easier(less code changes) and better if
it was defined like below(what you proposed later in this thread):

-#define __GFP_NO_LOCKS ((__force gfp_t)___GFP_NO_LOCKS)
+#define __GFP_NO_LOCKS ((__force gfp_t) 0)

That could imply that calling the page allocator with zero
argument would apply a further limitation - that is lock free.

--
Vlad Rezki

Uladzislau Rezki Oct. 5, 2020, 2:12 p.m. UTC | #60

On Fri, Oct 02, 2020 at 09:06:24AM +0100, Mel Gorman wrote:
> On Thu, Oct 01, 2020 at 09:26:26PM +0200, Uladzislau Rezki wrote:
> > > 
> > > No, I meant going back to idea of new gfp flag, but adjust the implementation in
> > > the allocator (different from what you posted in previous version) so that it
> > > only looks at the flag after it tries to allocate from pcplist and finds out
> > > it's empty. So, no inventing of new page allocator entry points or checks such
> > > as the one you wrote above, but adding the new gfp flag in a way that it doesn't
> > > affect existing fast paths.
> > >
> > OK. Now i see. Please have a look below at the patch, so we fully understand
> > each other. If that is something that is close to your view or not:
> > 
> > <snip>
> > t a/include/linux/gfp.h b/include/linux/gfp.h
> > index c603237e006c..7e613560a502 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -39,8 +39,9 @@ struct vm_area_struct;
> >  #define ___GFP_HARDWALL                0x100000u
> >  #define ___GFP_THISNODE                0x200000u
> >  #define ___GFP_ACCOUNT         0x400000u
> > +#define ___GFP_NO_LOCKS                0x800000u
> >  #ifdef CONFIG_LOCKDEP
> > -#define ___GFP_NOLOCKDEP       0x800000u
> > +#define ___GFP_NOLOCKDEP       0x1000000u
> >  #else
> >  #define ___GFP_NOLOCKDEP       0
> >  #endif
> > @@ -215,16 +216,22 @@ struct vm_area_struct;
> >   * %__GFP_COMP address compound page metadata.
> >   *
> >   * %__GFP_ZERO returns a zeroed page on success.
> > + *
> > + * %__GFP_NO_LOCKS order-0 allocation without sleepable-locks.
> > + * It obtains a page from the per-cpu-list and considered as
> > + * lock-less. No other actions are performed, thus it returns
> > + * NULL if per-cpu-list is empty.
> >   */
> >  #define __GFP_NOWARN   ((__force gfp_t)___GFP_NOWARN)
> >  #define __GFP_COMP     ((__force gfp_t)___GFP_COMP)
> >  #define __GFP_ZERO     ((__force gfp_t)___GFP_ZERO)
> > +#define __GFP_NO_LOCKS ((__force gfp_t)___GFP_NO_LOCKS)
> > 
> 
> I'm not a fan of the GFP flag approach simply because we've had cases
> before where GFP flags were used in inappropriate contexts like
> __GFP_MEMALLOC which led to a surprising amount of bugs, particularly
> from out-of-tree drivers but also in-tree drivers. Of course, there
> are limited GFP flags available too but at least the comment should
> be as robust as possible. Maybe something like
> 
>  * %__GFP_NO_LOCKS attempts order-0 allocation without sleepable-locks. It
>  * attempts to obtain a page without acquiring any spinlocks. This
>  * should only be used in a context where the holder holds a
>  * raw_spin_lock that cannot be released for the allocation request.
>  * This may be necessary in PREEMPT_RT kernels where a
>  * raw_spin_lock is held which does not sleep tries to acquire a
>  * spin_lock that can sleep with PREEMPT_RT. This should not be
>  * confused with GFP_ATOMIC contexts. Like atomic allocation
>  * requests, there is no guarantee a page will be returned and
>  * the caller must be able to deal with allocation failures.
>  * The risk of allocation failure is higher than using GFP_ATOMIC.
> 
> It's verbose but it would be hard to misinterpret. I think we're
> going to go through a period of time before people get familiar
> with PREEMPT_RT-related hazards as various comments that were
> true are going to be misleading for a while.
> 
Yep, it should be properly documented for sure. Including new GFP_NOWAIT
limitations, same as GFP_ATOMIC once you mentioned.

Thanks!

--
Vlad Rezki

Uladzislau Rezki Oct. 5, 2020, 3:08 p.m. UTC | #61

On Fri, Oct 02, 2020 at 11:05:07AM +0200, Michal Hocko wrote:
> On Fri 02-10-20 09:50:14, Mel Gorman wrote:
> > On Fri, Oct 02, 2020 at 09:11:23AM +0200, Michal Hocko wrote:
> > > On Thu 01-10-20 21:26:26, Uladzislau Rezki wrote:
> > > > > 
> > > > > No, I meant going back to idea of new gfp flag, but adjust the implementation in
> > > > > the allocator (different from what you posted in previous version) so that it
> > > > > only looks at the flag after it tries to allocate from pcplist and finds out
> > > > > it's empty. So, no inventing of new page allocator entry points or checks such
> > > > > as the one you wrote above, but adding the new gfp flag in a way that it doesn't
> > > > > affect existing fast paths.
> > > > >
> > > > OK. Now i see. Please have a look below at the patch, so we fully understand
> > > > each other. If that is something that is close to your view or not:
> > > > 
> > > > <snip>
> > > > t a/include/linux/gfp.h b/include/linux/gfp.h
> > > > index c603237e006c..7e613560a502 100644
> > > > --- a/include/linux/gfp.h
> > > > +++ b/include/linux/gfp.h
> > > > @@ -39,8 +39,9 @@ struct vm_area_struct;
> > > >  #define ___GFP_HARDWALL                0x100000u
> > > >  #define ___GFP_THISNODE                0x200000u
> > > >  #define ___GFP_ACCOUNT         0x400000u
> > > > +#define ___GFP_NO_LOCKS                0x800000u
> > > 
> > > Even if a new gfp flag gains a sufficient traction and support I am
> > > _strongly_ opposed against consuming another flag for that. Bit space is
> > > limited. 
> > 
> > That is definitely true. I'm not happy with the GFP flag at all, the
> > comment is at best a damage limiting move. It still would be better for
> > a memory pool to be reserved and sized for critical allocations.
> 
> Completely agreed. The only existing usecase is so special cased that a
> dedicated pool is not only easier to maintain but it should be also much
> better tuned for the specific workload. Something not really feasible
> with the allocator.
> 
> > > Besides that we certainly do not want to allow craziness like
> > > __GFP_NO_LOCK | __GFP_RECLAIM (and similar), do we?
> > 
> > That would deserve to be taken to a dumpster and set on fire. The flag
> > combination could be checked in the allocator but the allocator path fast
> > paths are bad enough already.
> 
> If a new allocation/gfp mode is absolutely necessary then I believe that
> the most reasoanble way forward would be
> #define GFP_NO_LOCK	((__force gfp_t)0)
> 
Agree. Even though i see that some code should be adjusted for it. There are
a few users of the __get_free_page(0); So, need to double check it:

<snip>
[    0.650351] BUG: kernel NULL pointer dereference, address: 0000000000000010
[    0.651083] #PF: supervisor read access in kernel mode
[    0.651639] #PF: error_code(0x0000) - not-present page
[    0.652200] PGD 0 P4D 0
[    0.652523] Oops: 0000 [#1] SMP NOPTI
[    0.652668] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.9.0-rc7-next-20200930+ #140
[    0.652668] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[    0.652668] RIP: 0010:__find_event_file+0x21/0x80
<snip>

Apart of that. There is a post_alloc_hook(), that gets called from the prep_new_page().
If "debug page alloc enabled", it maps a page for debug purposes invoking kernel_map_pages().
__kernel_map_pages() is ARCH specific. For example, powerpc variant uses sleep-able locks
what can be easily converted to raw variant. 

--
Vlad Rezki

Michal Hocko Oct. 5, 2020, 3:41 p.m. UTC | #62

On Mon 05-10-20 17:08:01, Uladzislau Rezki wrote:
> On Fri, Oct 02, 2020 at 11:05:07AM +0200, Michal Hocko wrote:
> > On Fri 02-10-20 09:50:14, Mel Gorman wrote:
> > > On Fri, Oct 02, 2020 at 09:11:23AM +0200, Michal Hocko wrote:
> > > > On Thu 01-10-20 21:26:26, Uladzislau Rezki wrote:
> > > > > > 
> > > > > > No, I meant going back to idea of new gfp flag, but adjust the implementation in
> > > > > > the allocator (different from what you posted in previous version) so that it
> > > > > > only looks at the flag after it tries to allocate from pcplist and finds out
> > > > > > it's empty. So, no inventing of new page allocator entry points or checks such
> > > > > > as the one you wrote above, but adding the new gfp flag in a way that it doesn't
> > > > > > affect existing fast paths.
> > > > > >
> > > > > OK. Now i see. Please have a look below at the patch, so we fully understand
> > > > > each other. If that is something that is close to your view or not:
> > > > > 
> > > > > <snip>
> > > > > t a/include/linux/gfp.h b/include/linux/gfp.h
> > > > > index c603237e006c..7e613560a502 100644
> > > > > --- a/include/linux/gfp.h
> > > > > +++ b/include/linux/gfp.h
> > > > > @@ -39,8 +39,9 @@ struct vm_area_struct;
> > > > >  #define ___GFP_HARDWALL                0x100000u
> > > > >  #define ___GFP_THISNODE                0x200000u
> > > > >  #define ___GFP_ACCOUNT         0x400000u
> > > > > +#define ___GFP_NO_LOCKS                0x800000u
> > > > 
> > > > Even if a new gfp flag gains a sufficient traction and support I am
> > > > _strongly_ opposed against consuming another flag for that. Bit space is
> > > > limited. 
> > > 
> > > That is definitely true. I'm not happy with the GFP flag at all, the
> > > comment is at best a damage limiting move. It still would be better for
> > > a memory pool to be reserved and sized for critical allocations.
> > 
> > Completely agreed. The only existing usecase is so special cased that a
> > dedicated pool is not only easier to maintain but it should be also much
> > better tuned for the specific workload. Something not really feasible
> > with the allocator.
> > 
> > > > Besides that we certainly do not want to allow craziness like
> > > > __GFP_NO_LOCK | __GFP_RECLAIM (and similar), do we?
> > > 
> > > That would deserve to be taken to a dumpster and set on fire. The flag
> > > combination could be checked in the allocator but the allocator path fast
> > > paths are bad enough already.
> > 
> > If a new allocation/gfp mode is absolutely necessary then I believe that
> > the most reasoanble way forward would be
> > #define GFP_NO_LOCK	((__force gfp_t)0)
> > 
> Agree. Even though i see that some code should be adjusted for it. There are
> a few users of the __get_free_page(0); So, need to double check it:

Yes, I believe I have pointed that out in the previous discussion.

> <snip>
> [    0.650351] BUG: kernel NULL pointer dereference, address: 0000000000000010
> [    0.651083] #PF: supervisor read access in kernel mode
> [    0.651639] #PF: error_code(0x0000) - not-present page
> [    0.652200] PGD 0 P4D 0
> [    0.652523] Oops: 0000 [#1] SMP NOPTI
> [    0.652668] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.9.0-rc7-next-20200930+ #140
> [    0.652668] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> [    0.652668] RIP: 0010:__find_event_file+0x21/0x80
> <snip>
> 
> Apart of that. There is a post_alloc_hook(), that gets called from the prep_new_page().
> If "debug page alloc enabled", it maps a page for debug purposes invoking kernel_map_pages().
> __kernel_map_pages() is ARCH specific. For example, powerpc variant uses sleep-able locks
> what can be easily converted to raw variant. 

Yes, there are likely more surprises like that. I am not sure about
kasan, page owner (which depens on the stack unwinder) and others which
hook into this path.

Mel Gorman Oct. 6, 2020, 10:03 a.m. UTC | #63

On Fri, Oct 02, 2020 at 07:41:20AM -0700, Paul E. McKenney wrote:
> On Fri, Oct 02, 2020 at 11:19:52AM +0100, Mel Gorman wrote:
> > On Fri, Oct 02, 2020 at 11:58:58AM +0200, Peter Zijlstra wrote:
> > > > It's enabled by default by enough distros that adding too many checks
> > > > is potentially painful. Granted it would be missed by most benchmarking
> > > > which tend to control allocations from userspace but a lot of performance
> > > > problems I see are the "death by a thousand cuts" variety.
> > > 
> > > Oh quite agreed, aka death by accounting. But if people are enabling
> > > DEBUG options in production kernels, there's something wrong, no?
> > > 
> > 
> > You'd think but historically I believe DEBUG_VM was enabled for some
> > distributions because it made certain classes of problems easier to debug
> > early. There is also a recent trend for enabling various DEBUG options for
> > "hardening" even when they protect very specific corner cases or are for
> > intended for kernel development. I've pushed back where I have an opinion
> > that matters but it's generally corrosive.
> > 
> > > Should we now go add CONFIG_REALLY_DEBUG_STAY_AWAY_ALREADY options?
> > 
> > It's heading in that direction :(
> 
> Given that you guys have just reiterated yet again that you are very
> unhappy with either a GFP_ flag or a special function like the one that
> Peter Zijlstra put together, it would be very helpful if you were to at
> least voice some level of support for Thomas Gleixner's patchset, which,
> if accepted, will allow me to solve at least 50% of the problem.
> 

I read through the series and didn't find anything problematic that
had not been covered already. Minimally, avoiding surprises about what
preemptible() means in different contexts is nice. While I have not
run it through a test grid to check, I'd be very surprised if this was
problematic from a performance perspective on a preempt-disabled kernels.
Last I checked, the difference between PREEMPT_NONE and PREEMPT_VOLUNTARY
was less than 2% *at worst* and I don't think that was due to the preempt
accounting.

Paul E. McKenney Oct. 6, 2020, 3:41 p.m. UTC | #64

On Tue, Oct 06, 2020 at 11:03:34AM +0100, Mel Gorman wrote:
> On Fri, Oct 02, 2020 at 07:41:20AM -0700, Paul E. McKenney wrote:
> > On Fri, Oct 02, 2020 at 11:19:52AM +0100, Mel Gorman wrote:
> > > On Fri, Oct 02, 2020 at 11:58:58AM +0200, Peter Zijlstra wrote:
> > > > > It's enabled by default by enough distros that adding too many checks
> > > > > is potentially painful. Granted it would be missed by most benchmarking
> > > > > which tend to control allocations from userspace but a lot of performance
> > > > > problems I see are the "death by a thousand cuts" variety.
> > > > 
> > > > Oh quite agreed, aka death by accounting. But if people are enabling
> > > > DEBUG options in production kernels, there's something wrong, no?
> > > > 
> > > 
> > > You'd think but historically I believe DEBUG_VM was enabled for some
> > > distributions because it made certain classes of problems easier to debug
> > > early. There is also a recent trend for enabling various DEBUG options for
> > > "hardening" even when they protect very specific corner cases or are for
> > > intended for kernel development. I've pushed back where I have an opinion
> > > that matters but it's generally corrosive.
> > > 
> > > > Should we now go add CONFIG_REALLY_DEBUG_STAY_AWAY_ALREADY options?
> > > 
> > > It's heading in that direction :(
> > 
> > Given that you guys have just reiterated yet again that you are very
> > unhappy with either a GFP_ flag or a special function like the one that
> > Peter Zijlstra put together, it would be very helpful if you were to at
> > least voice some level of support for Thomas Gleixner's patchset, which,
> > if accepted, will allow me to solve at least 50% of the problem.
> 
> I read through the series and didn't find anything problematic that
> had not been covered already. Minimally, avoiding surprises about what
> preemptible() means in different contexts is nice. While I have not
> run it through a test grid to check, I'd be very surprised if this was
> problematic from a performance perspective on a preempt-disabled kernels.
> Last I checked, the difference between PREEMPT_NONE and PREEMPT_VOLUNTARY
> was less than 2% *at worst* and I don't think that was due to the preempt
> accounting.

Thank you, Mel!

							Thanx, Paul

Uladzislau Rezki Oct. 6, 2020, 10:25 p.m. UTC | #65

On Mon, Oct 05, 2020 at 05:41:00PM +0200, Michal Hocko wrote:
> On Mon 05-10-20 17:08:01, Uladzislau Rezki wrote:
> > On Fri, Oct 02, 2020 at 11:05:07AM +0200, Michal Hocko wrote:
> > > On Fri 02-10-20 09:50:14, Mel Gorman wrote:
> > > > On Fri, Oct 02, 2020 at 09:11:23AM +0200, Michal Hocko wrote:
> > > > > On Thu 01-10-20 21:26:26, Uladzislau Rezki wrote:
> > > > > > > 
> > > > > > > No, I meant going back to idea of new gfp flag, but adjust the implementation in
> > > > > > > the allocator (different from what you posted in previous version) so that it
> > > > > > > only looks at the flag after it tries to allocate from pcplist and finds out
> > > > > > > it's empty. So, no inventing of new page allocator entry points or checks such
> > > > > > > as the one you wrote above, but adding the new gfp flag in a way that it doesn't
> > > > > > > affect existing fast paths.
> > > > > > >
> > > > > > OK. Now i see. Please have a look below at the patch, so we fully understand
> > > > > > each other. If that is something that is close to your view or not:
> > > > > > 
> > > > > > <snip>
> > > > > > t a/include/linux/gfp.h b/include/linux/gfp.h
> > > > > > index c603237e006c..7e613560a502 100644
> > > > > > --- a/include/linux/gfp.h
> > > > > > +++ b/include/linux/gfp.h
> > > > > > @@ -39,8 +39,9 @@ struct vm_area_struct;
> > > > > >  #define ___GFP_HARDWALL                0x100000u
> > > > > >  #define ___GFP_THISNODE                0x200000u
> > > > > >  #define ___GFP_ACCOUNT         0x400000u
> > > > > > +#define ___GFP_NO_LOCKS                0x800000u
> > > > > 
> > > > > Even if a new gfp flag gains a sufficient traction and support I am
> > > > > _strongly_ opposed against consuming another flag for that. Bit space is
> > > > > limited. 
> > > > 
> > > > That is definitely true. I'm not happy with the GFP flag at all, the
> > > > comment is at best a damage limiting move. It still would be better for
> > > > a memory pool to be reserved and sized for critical allocations.
> > > 
> > > Completely agreed. The only existing usecase is so special cased that a
> > > dedicated pool is not only easier to maintain but it should be also much
> > > better tuned for the specific workload. Something not really feasible
> > > with the allocator.
> > > 
> > > > > Besides that we certainly do not want to allow craziness like
> > > > > __GFP_NO_LOCK | __GFP_RECLAIM (and similar), do we?
> > > > 
> > > > That would deserve to be taken to a dumpster and set on fire. The flag
> > > > combination could be checked in the allocator but the allocator path fast
> > > > paths are bad enough already.
> > > 
> > > If a new allocation/gfp mode is absolutely necessary then I believe that
> > > the most reasoanble way forward would be
> > > #define GFP_NO_LOCK	((__force gfp_t)0)
> > > 
> > Agree. Even though i see that some code should be adjusted for it. There are
> > a few users of the __get_free_page(0); So, need to double check it:
> 
> Yes, I believe I have pointed that out in the previous discussion.
> 
OK. I spent more time on it. A passed gfp_mask can be adjusted on the entry,
that adjustment depends on the gfp_allowed_mask. It can be changed in run-time.

For example during early boot it excludes: __GFP_RECLAIM|__GFP_IO|__GFP_FS flags,
what is GFP_KERNEL. So, GFP_KERNEL is adjusted on entry and becomes 0 during early
boot phase.

How to distinguish it:

<snip>
+       /*
+        * gfp_mask can become zero because gfp_allowed_mask changes in run-time.
+        */
+       if (!gfp_mask)
+               alloc_flags |= ALLOC_NO_LOCKS;
+
        gfp_mask &= gfp_allowed_mask;
        alloc_mask = gfp_mask;
        if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags))
<snip>

> > 
> > Apart of that. There is a post_alloc_hook(), that gets called from the prep_new_page().
> > If "debug page alloc enabled", it maps a page for debug purposes invoking kernel_map_pages().
> > __kernel_map_pages() is ARCH specific. For example, powerpc variant uses sleep-able locks
> > what can be easily converted to raw variant. 
> 
> Yes, there are likely more surprises like that. I am not sure about
> kasan, page owner (which depens on the stack unwinder) and others which
> hook into this path.
>
I have checked kasan_alloc_pages(), kernel_poison_pages() both are OK,
at least i did not find any locking there. As for set_page_owner(), it
requires more attention, since it uses arch specific unwind logic. Though,
i spent some time on it and so far have not noticed anything.

--
Vlad Rezki

Michal Hocko Oct. 7, 2020, 10:02 a.m. UTC | #66

On Wed 07-10-20 00:25:29, Uladzislau Rezki wrote:
> On Mon, Oct 05, 2020 at 05:41:00PM +0200, Michal Hocko wrote:
> > On Mon 05-10-20 17:08:01, Uladzislau Rezki wrote:
> > > On Fri, Oct 02, 2020 at 11:05:07AM +0200, Michal Hocko wrote:
> > > > On Fri 02-10-20 09:50:14, Mel Gorman wrote:
> > > > > On Fri, Oct 02, 2020 at 09:11:23AM +0200, Michal Hocko wrote:
> > > > > > On Thu 01-10-20 21:26:26, Uladzislau Rezki wrote:
> > > > > > > > 
> > > > > > > > No, I meant going back to idea of new gfp flag, but adjust the implementation in
> > > > > > > > the allocator (different from what you posted in previous version) so that it
> > > > > > > > only looks at the flag after it tries to allocate from pcplist and finds out
> > > > > > > > it's empty. So, no inventing of new page allocator entry points or checks such
> > > > > > > > as the one you wrote above, but adding the new gfp flag in a way that it doesn't
> > > > > > > > affect existing fast paths.
> > > > > > > >
> > > > > > > OK. Now i see. Please have a look below at the patch, so we fully understand
> > > > > > > each other. If that is something that is close to your view or not:
> > > > > > > 
> > > > > > > <snip>
> > > > > > > t a/include/linux/gfp.h b/include/linux/gfp.h
> > > > > > > index c603237e006c..7e613560a502 100644
> > > > > > > --- a/include/linux/gfp.h
> > > > > > > +++ b/include/linux/gfp.h
> > > > > > > @@ -39,8 +39,9 @@ struct vm_area_struct;
> > > > > > >  #define ___GFP_HARDWALL                0x100000u
> > > > > > >  #define ___GFP_THISNODE                0x200000u
> > > > > > >  #define ___GFP_ACCOUNT         0x400000u
> > > > > > > +#define ___GFP_NO_LOCKS                0x800000u
> > > > > > 
> > > > > > Even if a new gfp flag gains a sufficient traction and support I am
> > > > > > _strongly_ opposed against consuming another flag for that. Bit space is
> > > > > > limited. 
> > > > > 
> > > > > That is definitely true. I'm not happy with the GFP flag at all, the
> > > > > comment is at best a damage limiting move. It still would be better for
> > > > > a memory pool to be reserved and sized for critical allocations.
> > > > 
> > > > Completely agreed. The only existing usecase is so special cased that a
> > > > dedicated pool is not only easier to maintain but it should be also much
> > > > better tuned for the specific workload. Something not really feasible
> > > > with the allocator.
> > > > 
> > > > > > Besides that we certainly do not want to allow craziness like
> > > > > > __GFP_NO_LOCK | __GFP_RECLAIM (and similar), do we?
> > > > > 
> > > > > That would deserve to be taken to a dumpster and set on fire. The flag
> > > > > combination could be checked in the allocator but the allocator path fast
> > > > > paths are bad enough already.
> > > > 
> > > > If a new allocation/gfp mode is absolutely necessary then I believe that
> > > > the most reasoanble way forward would be
> > > > #define GFP_NO_LOCK	((__force gfp_t)0)
> > > > 
> > > Agree. Even though i see that some code should be adjusted for it. There are
> > > a few users of the __get_free_page(0); So, need to double check it:
> > 
> > Yes, I believe I have pointed that out in the previous discussion.
> > 
> OK. I spent more time on it. A passed gfp_mask can be adjusted on the entry,
> that adjustment depends on the gfp_allowed_mask. It can be changed in run-time.
> 
> For example during early boot it excludes: __GFP_RECLAIM|__GFP_IO|__GFP_FS flags,
> what is GFP_KERNEL. So, GFP_KERNEL is adjusted on entry and becomes 0 during early
> boot phase.

Honestly I am not sure how much is GFP_BOOT_MASK still needed. The
remaining user of gfp_allowed_mask mask should be only hibernation and I
believe this should be removed in long term. Not as trivial because
scope API cannot be used for that as it needs a global flag but this is
a gross hack that should be implemented differently. It is waiting on my
todo list but never got around to that.

> How to distinguish it:
> 
> <snip>
> +       /*
> +        * gfp_mask can become zero because gfp_allowed_mask changes in run-time.
> +        */
> +       if (!gfp_mask)
> +               alloc_flags |= ALLOC_NO_LOCKS;
> +
>         gfp_mask &= gfp_allowed_mask;
>         alloc_mask = gfp_mask;
>         if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags))
> <snip>
> 
> > > 
> > > Apart of that. There is a post_alloc_hook(), that gets called from the prep_new_page().
> > > If "debug page alloc enabled", it maps a page for debug purposes invoking kernel_map_pages().
> > > __kernel_map_pages() is ARCH specific. For example, powerpc variant uses sleep-able locks
> > > what can be easily converted to raw variant. 
> > 
> > Yes, there are likely more surprises like that. I am not sure about
> > kasan, page owner (which depens on the stack unwinder) and others which
> > hook into this path.
> >
> I have checked kasan_alloc_pages(), kernel_poison_pages() both are OK,
> at least i did not find any locking there. As for set_page_owner(), it
> requires more attention, since it uses arch specific unwind logic. Though,
> i spent some time on it and so far have not noticed anything.

stack depod depends on a lock IIRC. Anyway, this is just showing how
this is going to grow in complexity and make future additions harder.
A niche usecase is inducing an additional complexity for future
maintenance.

Uladzislau Rezki Oct. 7, 2020, 11:02 a.m. UTC | #67

On Wed, Oct 07, 2020 at 12:02:34PM +0200, Michal Hocko wrote:
> On Wed 07-10-20 00:25:29, Uladzislau Rezki wrote:
> > On Mon, Oct 05, 2020 at 05:41:00PM +0200, Michal Hocko wrote:
> > > On Mon 05-10-20 17:08:01, Uladzislau Rezki wrote:
> > > > On Fri, Oct 02, 2020 at 11:05:07AM +0200, Michal Hocko wrote:
> > > > > On Fri 02-10-20 09:50:14, Mel Gorman wrote:
> > > > > > On Fri, Oct 02, 2020 at 09:11:23AM +0200, Michal Hocko wrote:
> > > > > > > On Thu 01-10-20 21:26:26, Uladzislau Rezki wrote:
> > > > > > > > > 
> > > > > > > > > No, I meant going back to idea of new gfp flag, but adjust the implementation in
> > > > > > > > > the allocator (different from what you posted in previous version) so that it
> > > > > > > > > only looks at the flag after it tries to allocate from pcplist and finds out
> > > > > > > > > it's empty. So, no inventing of new page allocator entry points or checks such
> > > > > > > > > as the one you wrote above, but adding the new gfp flag in a way that it doesn't
> > > > > > > > > affect existing fast paths.
> > > > > > > > >
> > > > > > > > OK. Now i see. Please have a look below at the patch, so we fully understand
> > > > > > > > each other. If that is something that is close to your view or not:
> > > > > > > > 
> > > > > > > > <snip>
> > > > > > > > t a/include/linux/gfp.h b/include/linux/gfp.h
> > > > > > > > index c603237e006c..7e613560a502 100644
> > > > > > > > --- a/include/linux/gfp.h
> > > > > > > > +++ b/include/linux/gfp.h
> > > > > > > > @@ -39,8 +39,9 @@ struct vm_area_struct;
> > > > > > > >  #define ___GFP_HARDWALL                0x100000u
> > > > > > > >  #define ___GFP_THISNODE                0x200000u
> > > > > > > >  #define ___GFP_ACCOUNT         0x400000u
> > > > > > > > +#define ___GFP_NO_LOCKS                0x800000u
> > > > > > > 
> > > > > > > Even if a new gfp flag gains a sufficient traction and support I am
> > > > > > > _strongly_ opposed against consuming another flag for that. Bit space is
> > > > > > > limited. 
> > > > > > 
> > > > > > That is definitely true. I'm not happy with the GFP flag at all, the
> > > > > > comment is at best a damage limiting move. It still would be better for
> > > > > > a memory pool to be reserved and sized for critical allocations.
> > > > > 
> > > > > Completely agreed. The only existing usecase is so special cased that a
> > > > > dedicated pool is not only easier to maintain but it should be also much
> > > > > better tuned for the specific workload. Something not really feasible
> > > > > with the allocator.
> > > > > 
> > > > > > > Besides that we certainly do not want to allow craziness like
> > > > > > > __GFP_NO_LOCK | __GFP_RECLAIM (and similar), do we?
> > > > > > 
> > > > > > That would deserve to be taken to a dumpster and set on fire. The flag
> > > > > > combination could be checked in the allocator but the allocator path fast
> > > > > > paths are bad enough already.
> > > > > 
> > > > > If a new allocation/gfp mode is absolutely necessary then I believe that
> > > > > the most reasoanble way forward would be
> > > > > #define GFP_NO_LOCK	((__force gfp_t)0)
> > > > > 
> > > > Agree. Even though i see that some code should be adjusted for it. There are
> > > > a few users of the __get_free_page(0); So, need to double check it:
> > > 
> > > Yes, I believe I have pointed that out in the previous discussion.
> > > 
> > OK. I spent more time on it. A passed gfp_mask can be adjusted on the entry,
> > that adjustment depends on the gfp_allowed_mask. It can be changed in run-time.
> > 
> > For example during early boot it excludes: __GFP_RECLAIM|__GFP_IO|__GFP_FS flags,
> > what is GFP_KERNEL. So, GFP_KERNEL is adjusted on entry and becomes 0 during early
> > boot phase.
> 
> Honestly I am not sure how much is GFP_BOOT_MASK still needed. The
> remaining user of gfp_allowed_mask mask should be only hibernation and I
> believe this should be removed in long term. Not as trivial because
> scope API cannot be used for that as it needs a global flag but this is
> a gross hack that should be implemented differently. It is waiting on my
> todo list but never got around to that.
> 
> > How to distinguish it:
> > 
> > <snip>
> > +       /*
> > +        * gfp_mask can become zero because gfp_allowed_mask changes in run-time.
> > +        */
> > +       if (!gfp_mask)
> > +               alloc_flags |= ALLOC_NO_LOCKS;
> > +
> >         gfp_mask &= gfp_allowed_mask;
> >         alloc_mask = gfp_mask;
> >         if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags))
> > <snip>
> > 
> > > > 
> > > > Apart of that. There is a post_alloc_hook(), that gets called from the prep_new_page().
> > > > If "debug page alloc enabled", it maps a page for debug purposes invoking kernel_map_pages().
> > > > __kernel_map_pages() is ARCH specific. For example, powerpc variant uses sleep-able locks
> > > > what can be easily converted to raw variant. 
> > > 
> > > Yes, there are likely more surprises like that. I am not sure about
> > > kasan, page owner (which depens on the stack unwinder) and others which
> > > hook into this path.
> > >
> > I have checked kasan_alloc_pages(), kernel_poison_pages() both are OK,
> > at least i did not find any locking there. As for set_page_owner(), it
> > requires more attention, since it uses arch specific unwind logic. Though,
> > i spent some time on it and so far have not noticed anything.
> 
> stack depod depends on a lock IIRC. Anyway, this is just showing how
> this is going to grow in complexity and make future additions harder.
> A niche usecase is inducing an additional complexity for future
> maintenance.
> 
I agree regarding maintenance costs. That is what is hard to avoid.

--
Vlad Rezki

diff mbox series

Patch

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 67a0774e080b..c065031b4403 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -565,6 +565,7 @@  extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
 
 extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
 extern unsigned long get_zeroed_page(gfp_t gfp_mask);
+extern unsigned long __rcu_alloc_page_lockless(void);
 
 void *alloc_pages_exact(size_t size, gfp_t gfp_mask);
 void free_pages_exact(void *virt, size_t size);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0e2bab486fea..360c68ea3491 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4908,6 +4908,88 @@  __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);
 
+static struct page *__rmqueue_lockless(struct zone *zone, struct per_cpu_pages *pcp)
+{
+	struct list_head *list;
+	struct page *page;
+	int migratetype;
+
+	for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++) {
+		list = &pcp->lists[migratetype];
+		page = list_first_entry_or_null(list, struct page, lru);
+		if (page && !check_new_pcp(page)) {
+			list_del(&page->lru);
+			pcp->count--;
+			return page;
+		}
+	}
+
+	return NULL;
+}
+
+/*
+ * Semantic of this function illustrates that a page
+ * is obtained in lock-free maneer. Instead of going
+ * deeper in the page allocator, it uses the pcplists
+ * only. Such way provides lock-less allocation method.
+ *
+ * Some notes are below:
+ * - intended to use for RCU code only;
+ * - it does not use any atomic reserves.
+ */
+unsigned long __rcu_alloc_page_lockless(void)
+{
+	struct zonelist *zonelist =
+		node_zonelist(numa_node_id(), GFP_KERNEL);
+	struct zoneref *z, *preferred_zoneref;
+	struct per_cpu_pages *pcp;
+	struct page *page;
+	unsigned long flags;
+	struct zone *zone;
+
+	/*
+	 * If DEBUG_PAGEALLOC is enabled, the post_alloc_hook()
+	 * in the prep_new_page() function also does some extra
+	 * page mappings via __kernel_map_pages(), what is arch
+	 * specific. It is for debug purpose only.
+	 *
+	 * For example, powerpc variant of __kernel_map_pages()
+	 * uses sleep-able locks. Thus a lock-less access can
+	 * not be provided if debug option is activated. In that
+	 * case it is fine to revert and return NULL, since RCU
+	 * code has a fallback mechanism. It is OK if it is used
+	 * for debug kernel.
+	 */
+	if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC))
+		return 0;
+
+	/*
+	 * Preferred zone is a first one in the zonelist.
+	 */
+	preferred_zoneref = NULL;
+
+	for_each_zone_zonelist(zone, z, zonelist, ZONE_NORMAL) {
+		if (!preferred_zoneref)
+			preferred_zoneref = z;
+
+		local_irq_save(flags);
+		pcp = &this_cpu_ptr(zone->pageset)->pcp;
+		page = __rmqueue_lockless(zone, pcp);
+		if (page) {
+			__count_zid_vm_events(PGALLOC, page_zonenum(page), 1);
+			zone_statistics(preferred_zoneref->zone, zone);
+		}
+		local_irq_restore(flags);
+
+		if (page) {
+			prep_new_page(page, 0, 0, 0);
+			return (unsigned long) page_address(page);
+		}
+	}
+
+	return 0;
+}
+
 /*
  * Common helper functions. Never use with __GFP_HIGHMEM because the returned
  * address cannot represent highmem pages. Use alloc_pages and then kmap if