mbox series

[RFC,0/3] mm/page_alloc: Remote per-cpu lists drain support

Message ID 20211008161922.942459-1-nsaenzju@redhat.com (mailing list archive)
Headers show
Series mm/page_alloc: Remote per-cpu lists drain support | expand

Message

Nicolas Saenz Julienne Oct. 8, 2021, 4:19 p.m. UTC
This series replaces mm/page_alloc's per-cpu lists drain mechanism in order for
it to be able to be run remotely. Currently, only a local CPU is permitted to
change its per-cpu lists, and it's expected to do so, on-demand, whenever a
process demands it (by means of queueing a drain task on the local CPU). Most
systems will handle this promptly, but it'll cause problems for NOHZ_FULL CPUs
that can't take any sort of interruption without breaking their functional
guarantees (latency, bandwidth, etc...). Having a way for these processes to
remotely drain the lists themselves will make co-existing with isolated CPUs
possible, and comes with minimal performance[1]/memory cost to other users.

The new algorithm will atomically switch the pointer to the per-cpu lists and
use RCU to make sure it's not being used before draining them. 

I'm interested in an sort of feedback, but especially validating that the
approach is acceptable, and any tests/benchmarks you'd like to see run against
it. For now, I've been testing this successfully on both arm64 and x86_64
systems while forcing high memory pressure (i.e. forcing the
page_alloc's slow path).

Patches 1-2 serve as cleanups/preparation to make patch 3 easier to follow.

Here's my previous attempt at fixing this:
https://lkml.org/lkml/2021/9/21/599

[1] Proper performance numbers will be provided if the approach is deemed
    acceptable. That said, mm/page_alloc.c's fast paths only grow by an extra
    pointer indirection and a compiler barrier, which I think is unlikely to be
    measurable.

---

Nicolas Saenz Julienne (3):
  mm/page_alloc: Simplify __rmqueue_pcplist()'s arguments
  mm/page_alloc: Access lists in 'struct per_cpu_pages' indirectly
  mm/page_alloc: Add remote draining support to per-cpu lists

 include/linux/mmzone.h |  24 +++++-
 mm/page_alloc.c        | 173 +++++++++++++++++++++--------------------
 mm/vmstat.c            |   6 +-
 3 files changed, 114 insertions(+), 89 deletions(-)

Comments

Vlastimil Babka Oct. 12, 2021, 3:45 p.m. UTC | #1
On 10/8/21 18:19, Nicolas Saenz Julienne wrote:
> This series replaces mm/page_alloc's per-cpu lists drain mechanism in order for
> it to be able to be run remotely. Currently, only a local CPU is permitted to
> change its per-cpu lists, and it's expected to do so, on-demand, whenever a
> process demands it (by means of queueing a drain task on the local CPU). Most
> systems will handle this promptly, but it'll cause problems for NOHZ_FULL CPUs
> that can't take any sort of interruption without breaking their functional
> guarantees (latency, bandwidth, etc...). Having a way for these processes to
> remotely drain the lists themselves will make co-existing with isolated CPUs
> possible, and comes with minimal performance[1]/memory cost to other users.
> 
> The new algorithm will atomically switch the pointer to the per-cpu lists and
> use RCU to make sure it's not being used before draining them. 
> 
> I'm interested in an sort of feedback, but especially validating that the
> approach is acceptable, and any tests/benchmarks you'd like to see run against

So let's consider the added alloc/free fast paths overhead:
- Patch 1 - __alloc_pages_bulk() used to determine pcp_list once, now it's
determined for each allocated page in __rmqueue_pcplist().
- Patch 2 - adds indirection from pcp->$foo to pcp->lp->$foo in each operation
- Patch 3
  - extra irqsave/irqrestore in free_pcppages_bulk (amortized)
  - rcu_dereference_check() in free_unref_page_commit() and __rmqueue_pcplist()

BTW - I'm not sure if the RCU usage is valid here.

The "read side" (normal operations) is using:
rcu_dereference_check(pcp->lp,
		lockdep_is_held(this_cpu_ptr(&pagesets.lock)));

where the lockdep parameter according to the comments for
rcu_dereference_check() means

"indicate to lockdep that foo->bar may only be dereferenced if either
rcu_read_lock() is held, or that the lock required to replace the bar struct
at foo->bar is held."

but you are not taking rcu_read_lock() and the "write side" (remote
draining) actually doesn't take pagesets.lock, so it's not true that the
"lock required to replace ... is held"? The write side uses
rcu_replace_pointer(...,
			mutex_is_locked(&pcpu_drain_mutex))
which is a different lock.

IOW, synchronize_rcu_expedited() AFAICS has nothing (no rcu_read_lock() to
synchronize against? Might accidentally work on !RT thanks to disabled irqs,
but not sure about with RT lock semantics of the local_lock...

So back to overhead, if I'm correct above we can assume that there would be
also rcu_read_lock() in the fast paths.

The alternative proposed by tglx was IIRC that there would be a spinlock on
each cpu, which would be mostly uncontended except when draining. Maybe an
uncontended spin lock/unlock would have lower overhead than all of the
above? It would be certainly simpler, so I would probably try that first and
see if it's acceptable?

> it. For now, I've been testing this successfully on both arm64 and x86_64
> systems while forcing high memory pressure (i.e. forcing the
> page_alloc's slow path).
> 
> Patches 1-2 serve as cleanups/preparation to make patch 3 easier to follow.
> 
> Here's my previous attempt at fixing this:
> https://lkml.org/lkml/2021/9/21/599
> 
> [1] Proper performance numbers will be provided if the approach is deemed
>     acceptable. That said, mm/page_alloc.c's fast paths only grow by an extra
>     pointer indirection and a compiler barrier, which I think is unlikely to be
>     measurable.
> 
> ---
> 
> Nicolas Saenz Julienne (3):
>   mm/page_alloc: Simplify __rmqueue_pcplist()'s arguments
>   mm/page_alloc: Access lists in 'struct per_cpu_pages' indirectly
>   mm/page_alloc: Add remote draining support to per-cpu lists
> 
>  include/linux/mmzone.h |  24 +++++-
>  mm/page_alloc.c        | 173 +++++++++++++++++++++--------------------
>  mm/vmstat.c            |   6 +-
>  3 files changed, 114 insertions(+), 89 deletions(-)
>
Nicolas Saenz Julienne Oct. 13, 2021, 12:50 p.m. UTC | #2
Hi Vlastimil, thanks for spending time on this.
Also, excuse me if I over explain things.

On Tue, 2021-10-12 at 17:45 +0200, Vlastimil Babka wrote:
> On 10/8/21 18:19, Nicolas Saenz Julienne wrote:
> > This series replaces mm/page_alloc's per-cpu lists drain mechanism in order for
> > it to be able to be run remotely. Currently, only a local CPU is permitted to
> > change its per-cpu lists, and it's expected to do so, on-demand, whenever a
> > process demands it (by means of queueing a drain task on the local CPU). Most
> > systems will handle this promptly, but it'll cause problems for NOHZ_FULL CPUs
> > that can't take any sort of interruption without breaking their functional
> > guarantees (latency, bandwidth, etc...). Having a way for these processes to
> > remotely drain the lists themselves will make co-existing with isolated CPUs
> > possible, and comes with minimal performance[1]/memory cost to other users.
> > 
> > The new algorithm will atomically switch the pointer to the per-cpu lists and
> > use RCU to make sure it's not being used before draining them. 
> > 
> > I'm interested in an sort of feedback, but especially validating that the
> > approach is acceptable, and any tests/benchmarks you'd like to see run against
> 
> So let's consider the added alloc/free fast paths overhead:
> - Patch 1 - __alloc_pages_bulk() used to determine pcp_list once, now it's
> determined for each allocated page in __rmqueue_pcplist().

This one I can avoid, I missed the performance aspect of it. I was aiming at
making the code bearable.

> - Patch 2 - adds indirection from pcp->$foo to pcp->lp->$foo in each operation
> - Patch 3
>   - extra irqsave/irqrestore in free_pcppages_bulk (amortized)
>   - rcu_dereference_check() in free_unref_page_commit() and __rmqueue_pcplist()

Yes.

> BTW - I'm not sure if the RCU usage is valid here.
>
> The "read side" (normal operations) is using:
> rcu_dereference_check(pcp->lp,
> 		lockdep_is_held(this_cpu_ptr(&pagesets.lock)));
> 
> where the lockdep parameter according to the comments for
> rcu_dereference_check() means
> 
> "indicate to lockdep that foo->bar may only be dereferenced if either
> rcu_read_lock() is held, or that the lock required to replace the bar struct
> at foo->bar is held."

You missed the "Could be used to" at the beginning of the sentence :). That
said, I believe this is similar to what I'm doing, only that the situation is
more complex.

> but you are not taking rcu_read_lock() 

I am taking the rcu_read_lock() implicitly, it's explained in 'struct
per_cpu_pages', and in more depth below.

> and the "write side" (remote draining) actually doesn't take pagesets.lock,
> so it's not true that the "lock required to replace ... is held"? The write
> side uses rcu_replace_pointer(..., mutex_is_locked(&pcpu_drain_mutex))
> which is a different lock.

The thing 'pagesets.lock' protects against is concurrent access to pcp->lp's
content, as opposed to its address. pcp->lp is dereferenced atomically, so no
need for locking on that operation.

The drain side never accesses pcp->lp's contents concurrently, it changes
pcp->lp's address and makes sure all CPUs are in sync with the new address
before clearing the stale data.

Just for the record, I think a better representation of what 'check' in
rcu_dereference means is:

 * Do an rcu_dereference(), but check that the conditions under which the
 * dereference will take place are correct.  Typically the conditions
 * indicate the various locking conditions that should be held at that
 * point. The check should return true if the conditions are satisfied.
 * An implicit check for being in an RCU read-side critical section
 * (rcu_read_lock()) is included.

So for the read side, that is, code reading pcp->lp's address and its contents,
the conditions to be met are: being in a RCU critical section, to make sure RCU
is keeping track of it, and holding 'pagesets.lock', to avoid concurrently
accessing pcp->lp's contents. The later is achieved either by disabling local
irqs or disabling migration and getting a per-cpu rt_spinlock. Conveniently
these are actions that implicitly delimit an RCU critical section (see [1] and
[2]). So the 'pagesets.lock' check fully covers the read side locking/RCU
concerns.

On the write side, the drain has to make sure pcp->lp address change is atomic
(this is achieved through rcu_replace_pointer()) and that lp->drain is emptied
before a happens. So checking for pcpu_drain_mutex being held is good enough.

> IOW, synchronize_rcu_expedited() AFAICS has nothing (no rcu_read_lock() to
> synchronize against? Might accidentally work on !RT thanks to disabled irqs,
> but not sure about with RT lock semantics of the local_lock...
>
> So back to overhead, if I'm correct above we can assume that there would be
> also rcu_read_lock() in the fast paths.

As I explained above, no need.

> The alternative proposed by tglx was IIRC that there would be a spinlock on
> each cpu, which would be mostly uncontended except when draining. Maybe an
> uncontended spin lock/unlock would have lower overhead than all of the
> above? It would be certainly simpler, so I would probably try that first and
> see if it's acceptable?

You have a point here. I'll provide a performance rundown of both solutions.
This one is a bit more complex that's for sure.

Thanks!

[1] See rcu_read_lock()'s description: "synchronize_rcu() wait for regions of
    code with preemption disabled, including regions of code with interrupts or
    softirqs disabled."

[2] See kernel/locking/spinlock_rt.c: "The RT [spinlock] substitutions
    explicitly disable migration and take rcu_read_lock() across the lock held
    section."
Vlastimil Babka Oct. 21, 2021, 8:27 a.m. UTC | #3
On 10/13/21 14:50, nsaenzju@redhat.com wrote:
> Hi Vlastimil, thanks for spending time on this.
> Also, excuse me if I over explain things.

Hi, thanks for spending time on the explanation :) It was very useful.

...

> 
>> and the "write side" (remote draining) actually doesn't take pagesets.lock,
>> so it's not true that the "lock required to replace ... is held"? The write
>> side uses rcu_replace_pointer(..., mutex_is_locked(&pcpu_drain_mutex))
>> which is a different lock.
> 
> The thing 'pagesets.lock' protects against is concurrent access to pcp->lp's
> content, as opposed to its address. pcp->lp is dereferenced atomically, so no
> need for locking on that operation.
> 
> The drain side never accesses pcp->lp's contents concurrently, it changes
> pcp->lp's address and makes sure all CPUs are in sync with the new address
> before clearing the stale data.
> 
> Just for the record, I think a better representation of what 'check' in
> rcu_dereference means is:
> 
>  * Do an rcu_dereference(), but check that the conditions under which the
>  * dereference will take place are correct.  Typically the conditions
>  * indicate the various locking conditions that should be held at that
>  * point. The check should return true if the conditions are satisfied.
>  * An implicit check for being in an RCU read-side critical section
>  * (rcu_read_lock()) is included.
> 
> So for the read side, that is, code reading pcp->lp's address and its contents,
> the conditions to be met are: being in a RCU critical section, to make sure RCU
> is keeping track of it, and holding 'pagesets.lock', to avoid concurrently
> accessing pcp->lp's contents. The later is achieved either by disabling local
> irqs or disabling migration and getting a per-cpu rt_spinlock. Conveniently
> these are actions that implicitly delimit an RCU critical section (see [1] and
> [2]). So the 'pagesets.lock' check fully covers the read side locking/RCU
> concerns.

Yeah, I wasn't aware of [2] especially. It makes sense that RT locks provide
the same guarantees for RCU as non-RT.

> On the write side, the drain has to make sure pcp->lp address change is atomic
> (this is achieved through rcu_replace_pointer()) and that lp->drain is emptied
> before a happens. So checking for pcpu_drain_mutex being held is good enough.
> 
>> IOW, synchronize_rcu_expedited() AFAICS has nothing (no rcu_read_lock() to
>> synchronize against? Might accidentally work on !RT thanks to disabled irqs,
>> but not sure about with RT lock semantics of the local_lock...
>>
>> So back to overhead, if I'm correct above we can assume that there would be
>> also rcu_read_lock() in the fast paths.
> 
> As I explained above, no need.
> 
>> The alternative proposed by tglx was IIRC that there would be a spinlock on
>> each cpu, which would be mostly uncontended except when draining. Maybe an
>> uncontended spin lock/unlock would have lower overhead than all of the
>> above? It would be certainly simpler, so I would probably try that first and
>> see if it's acceptable?
> 
> You have a point here. I'll provide a performance rundown of both solutions.
> This one is a bit more complex that's for sure.

Great, thanks!

> Thanks!
> 
> [1] See rcu_read_lock()'s description: "synchronize_rcu() wait for regions of
>     code with preemption disabled, including regions of code with interrupts or
>     softirqs disabled."
> 
> [2] See kernel/locking/spinlock_rt.c: "The RT [spinlock] substitutions
>     explicitly disable migration and take rcu_read_lock() across the lock held
>     section."
>