[0/6] mm: Remote LRU per-cpu pagevec cache/per-cpu page list drain support

Message ID	20210921161323.607817-1-nsaenzju@redhat.com (mailing list archive)
Headers	show Return-Path: <SRS0=tPO9=OL=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org AA31360ED8 From: Nicolas Saenz Julienne <nsaenzju@redhat.com> To: akpm@linux-foundation.org, frederic@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, tglx@linutronix.de, cl@linux.com, peterz@infradead.org, juri.lelli@redhat.com, mingo@redhat.com, mtosatti@redhat.com, nilal@redhat.com, mgorman@suse.de, ppandit@redhat.com, williams@redhat.com, bigeasy@linutronix.de, anna-maria@linutronix.de, linux-rt-users@vger.kernel.org, Nicolas Saenz Julienne <nsaenzju@redhat.com> Subject: [PATCH 0/6] mm: Remote LRU per-cpu pagevec cache/per-cpu page list drain support Date: Tue, 21 Sep 2021 18:13:18 +0200 Message-Id: <20210921161323.607817-1-nsaenzju@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: Remote LRU per-cpu pagevec cache/per-cpu page list drain support \| expand [0/6] mm: Remote LRU per-cpu pagevec cache/per-cpu page list drain support [1/6] mm/swap: Introduce lru_cpu_needs_drain() [2/6] mm/swap: Introduce alternative per-cpu LRU cache locking [3/6] mm/swap: Allow remote LRU cache draining [4/6] mm/page_alloc: Introduce alternative per-cpu list locking [5/6] mm/page_alloc: Allow remote per-cpu page list draining [6/6] sched/isolation: Enable 'remote_pcpu_cache_access' on NOHZ_FULL systems

Nicolas Saenz Julienne Sept. 21, 2021, 4:13 p.m. UTC

This series introduces an alternative locking scheme around mm/swap.c's per-cpu
LRU pagevec caches and mm/page_alloc.c's per-cpu page lists which will allow
for remote CPUs to drain them. Currently, only a local CPU is permitted to
change its per-cpu lists, and it's expected to do so, on-demand, whenever a
process demands it (by means of queueing an drain task on the local CPU). Most
systems will handle this promptly, but it'll cause problems for NOHZ_FULL CPUs
that can't take any sort of interruption without breaking their functional
guarantees (latency, bandwidth, etc...). Having a way for these processes to
remotely drain the lists themselves will make co-existing with isolated CPUs
possible, at the cost of more constraining locks.

Fortunately for non-NOHZ_FULL users, the alternative locking scheme and remote
drain code are conditional to a static key which is disabled by default. This
guarantees minimal functional or performance regressions. The feature will only
be enabled if NOHZ_FULL's initialization process was successful.

This work is based on a previous series by Thomas Gleixner, Anna-Maria
Gleixner, and Sebastian Andrzej Siewior[1].

[1] https://patchwork.kernel.org/project/linux-mm/patch/20190424111208.24459-3-bigeasy@linutronix.de/

Nicolas Saenz Julienne (6):
  mm/swap: Introduce lru_cpu_needs_drain()
  mm/swap: Introduce alternative per-cpu LRU cache locking
  mm/swap: Allow remote LRU cache draining
  mm/page_alloc: Introduce alternative per-cpu list locking
  mm/page_alloc: Allow remote per-cpu page list draining
  sched/isolation: Enable 'remote_pcpu_cache_access' on NOHZ_FULL
    systems

 kernel/sched/isolation.c |   9 +-
 mm/internal.h            |   2 +
 mm/page_alloc.c          | 111 ++++++++++++++++-----
 mm/swap.c                | 202 ++++++++++++++++++++++++++++++---------
 4 files changed, 253 insertions(+), 71 deletions(-)

Andrew Morton Sept. 21, 2021, 5:51 p.m. UTC | #1

On Tue, 21 Sep 2021 18:13:18 +0200 Nicolas Saenz Julienne <nsaenzju@redhat.com> wrote:

> This series introduces an alternative locking scheme around mm/swap.c's per-cpu
> LRU pagevec caches and mm/page_alloc.c's per-cpu page lists which will allow
> for remote CPUs to drain them. Currently, only a local CPU is permitted to
> change its per-cpu lists, and it's expected to do so, on-demand, whenever a
> process demands it (by means of queueing an drain task on the local CPU). Most
> systems will handle this promptly, but it'll cause problems for NOHZ_FULL CPUs
> that can't take any sort of interruption without breaking their functional
> guarantees (latency, bandwidth, etc...). Having a way for these processes to
> remotely drain the lists themselves will make co-existing with isolated CPUs
> possible, at the cost of more constraining locks.
> 
> Fortunately for non-NOHZ_FULL users, the alternative locking scheme and remote
> drain code are conditional to a static key which is disabled by default. This
> guarantees minimal functional or performance regressions. The feature will only
> be enabled if NOHZ_FULL's initialization process was successful.

That all looks pretty straightforward.  Obvious problems are:

- Very little test coverage for the spinlocked code paths.  Virtually
  all test setups will be using local_lock() and the code path you care
  about will go untested.

  I hope that whoever does test the spinlock version will be running
  full debug kernels, including lockdep.  Because adding a spinlock
  where the rest of the code expects local_lock might introduce
  problems.

  A fix for all of this would be to enable the spin_lock code paths
  to be tested more widely.  Perhaps you could add a boot-time kernel
  parameter (or, not as good, a Kconfig thing) which forces the use of
  the spinlock code even on non-NOHZ_FULL systems.

  Or perhaps this debug/testing mode _should_ be enabled by Kconfig,
  so kernel fuzzers sometimes turn it on.

  Please have a think about all of this?

- Maintainability.  Few other MM developers will think about this new
  spinlocked mode much, and they are unlikely to runtime test the
  spinlock mode.  Adding the force-spinlocks-mode-on knob will help
  with this.

Vlastimil Babka Sept. 21, 2021, 5:59 p.m. UTC | #2

On 9/21/21 6:13 PM, Nicolas Saenz Julienne wrote:
> This series introduces an alternative locking scheme around mm/swap.c's per-cpu
> LRU pagevec caches and mm/page_alloc.c's per-cpu page lists which will allow
> for remote CPUs to drain them. Currently, only a local CPU is permitted to
> change its per-cpu lists, and it's expected to do so, on-demand, whenever a
> process demands it (by means of queueing an drain task on the local CPU). Most
> systems will handle this promptly, but it'll cause problems for NOHZ_FULL CPUs
> that can't take any sort of interruption without breaking their functional
> guarantees (latency, bandwidth, etc...). Having a way for these processes to
> remotely drain the lists themselves will make co-existing with isolated CPUs
> possible, at the cost of more constraining locks.
> 
> Fortunately for non-NOHZ_FULL users, the alternative locking scheme and remote
> drain code are conditional to a static key which is disabled by default. This
> guarantees minimal functional or performance regressions. The feature will only
> be enabled if NOHZ_FULL's initialization process was successful.
> 
> This work is based on a previous series by Thomas Gleixner, Anna-Maria
> Gleixner, and Sebastian Andrzej Siewior[1].

These days the pcplist protection is done by local_lock, which solved
the RT concerns. Probably a stupid/infeasible idea, but maybe what you
want to achieve could be more generally solved at the local_lock level?
That on NOHZ_FULL CPUs, local_locks could have this mode where they
could synchronize with remote cpus?

> [1] https://patchwork.kernel.org/project/linux-mm/patch/20190424111208.24459-3-bigeasy@linutronix.de/
> 
> Nicolas Saenz Julienne (6):
>   mm/swap: Introduce lru_cpu_needs_drain()
>   mm/swap: Introduce alternative per-cpu LRU cache locking
>   mm/swap: Allow remote LRU cache draining
>   mm/page_alloc: Introduce alternative per-cpu list locking
>   mm/page_alloc: Allow remote per-cpu page list draining
>   sched/isolation: Enable 'remote_pcpu_cache_access' on NOHZ_FULL
>     systems
> 
>  kernel/sched/isolation.c |   9 +-
>  mm/internal.h            |   2 +
>  mm/page_alloc.c          | 111 ++++++++++++++++-----
>  mm/swap.c                | 202 ++++++++++++++++++++++++++++++---------
>  4 files changed, 253 insertions(+), 71 deletions(-)
>

Peter Zijlstra Sept. 22, 2021, 11:28 a.m. UTC | #3

On Tue, Sep 21, 2021 at 07:59:51PM +0200, Vlastimil Babka wrote:

> These days the pcplist protection is done by local_lock, which solved
> the RT concerns. Probably a stupid/infeasible idea, but maybe what you
> want to achieve could be more generally solved at the local_lock level?
> That on NOHZ_FULL CPUs, local_locks could have this mode where they
> could synchronize with remote cpus?

local_lock and spinlock have different rules, local_lock for example can
never cause an irq inversion, unlike a spinlock.

Thomas Gleixner Sept. 22, 2021, 10:09 p.m. UTC | #4

On Wed, Sep 22 2021 at 13:28, Peter Zijlstra wrote:
> On Tue, Sep 21, 2021 at 07:59:51PM +0200, Vlastimil Babka wrote:
>
>> These days the pcplist protection is done by local_lock, which solved
>> the RT concerns. Probably a stupid/infeasible idea, but maybe what you
>> want to achieve could be more generally solved at the local_lock level?
>> That on NOHZ_FULL CPUs, local_locks could have this mode where they
>> could synchronize with remote cpus?
>
> local_lock and spinlock have different rules, local_lock for example can
> never cause an irq inversion, unlike a spinlock.

TBH, I really regret an attempt I made at some point in the RT
development to abuse local locks for this kind of cross CPU protections
because that led to yet another semantically ill defined construct.

local locks are as the name says strictly local. IOW, they do not exist
for remote access. Just look at the !RT mapping:

  local_lock() 		-> preempt_disable()
  local_lock_irq()	-> local_irq_disable()
  ...

The only thing local_lock is addressing is the opaque nature of
preempt_disable(), local*_disable/save() protected critical sections,
which have per CPU BKL, i.e. undefined protection scope, semantics.

If you really want cross CPU protection then using a regular spinlock in
a per CPU data structure is the right way to go.

That makes it a bit akward vs. the code at hand which already introduced
local locks to replace the opaque preempt/local_irq disabled critical
sections with scoped local locks which in turn allows RT to substitute
them with strict CPU local spinlocks.

But for clarity sake we really have to look at two different cases now:

   1) Strict per CPU local protection

      That's what the code does today via local lock and this includes
      RT where the locality is preserved via the local lock semantics

      I.e. for the !RT case the spinlock overhead is avoided 

   2) Global scoped per CPU protection

      That's what Nicolas is trying to achieve to be able to update
      data structures cross CPU for the sake of avoiding smp function
      calls or queuing work on remote CPUs for the NOHZ_FULL isolation
      use case.

That said, I completely understand Andrew's concerns versus these
distinctions and their test coverage.

In consequence the real interesting question here is whether any of
these use cases are sensible to the extra overhead of #2.

IOW, does it really matter whether the !RT and !NOHZ_FULL case take an
uncontended per CPU spinlock unconditionally instead of just disabling
preemption / interrupts?

The same question arises vs. the remote work queuing. Does it really
matter? I think that a proper investigation is due to figure out whether
delegated work is really superiour vs. doing the same work locked from a
remote CPU occasionally.

If you really think about it then the task which is initiating the work
is already in a slow path. So what's the tradeoff to make this path a
little bit slower due to remote access vs. scheduling work and thereby
disturbing the remote CPU which has a performance impact as well and in
the NOHZ_FULL case even a correctness impact? That's especially
questionable for situations where the initiator has to wait for
completion of the remote work.

The changelogs and the cover letter have a distinct void vs. that which
means this is just another example of 'scratch my itch' changes w/o
proper justification.

I'm well aware that the inital stuff which myself, and in consequence
Sebastian and Anna-Maria, were doing back then falls into the same
category. IOW, guilty as charged, but that does not make the current
approach more palatable than the one from years ago.

Thanks,

        tglx

Vlastimil Babka Sept. 23, 2021, 7:12 a.m. UTC | #5

On 9/23/21 00:09, Thomas Gleixner wrote:
> On Wed, Sep 22 2021 at 13:28, Peter Zijlstra wrote:
>> On Tue, Sep 21, 2021 at 07:59:51PM +0200, Vlastimil Babka wrote:
>>
>>> These days the pcplist protection is done by local_lock, which solved
>>> the RT concerns. Probably a stupid/infeasible idea, but maybe what you
>>> want to achieve could be more generally solved at the local_lock level?
>>> That on NOHZ_FULL CPUs, local_locks could have this mode where they
>>> could synchronize with remote cpus?
>>
>> local_lock and spinlock have different rules, local_lock for example can
>> never cause an irq inversion, unlike a spinlock.
> 
> TBH, I really regret an attempt I made at some point in the RT
> development to abuse local locks for this kind of cross CPU protections
> because that led to yet another semantically ill defined construct.
> 
> local locks are as the name says strictly local. IOW, they do not exist
> for remote access. Just look at the !RT mapping:
> 
>   local_lock() 		-> preempt_disable()
>   local_lock_irq()	-> local_irq_disable()
>   ...

Yes, to be clean, this would have to be a new primitive, not just an abused
local lock. It would just look similar to the RT version (a percpu array of
spinlocks), so for this patchset it would allow not to have two such locks
side be side (local + spin) while only one is being used. For maximum
flexibility the initialization would take a CONFIG_ (or something
compile-time bool) that when false would make the !RT version an empty
struct and "locking" would rely on preempt/irq disable (just as with !RT
local_lock). If compile-time true it would take a static key to decide on
boot whether the !RT version only does the preepmt/irq disable or actually
takes the lock.

But as you say below, it's too much complexity for questionable benefit.

But maybe this can all be avoided anyway, as I recalled what we do for
vmstat already (IIUC). See quiet_vmstat() - when cpu enters the nohz mode,
it flushes per-cpu vmstat diffs and then there's no reason to further
disturb the cpu to do that while it's on NOHZ mode. We could do the same for
lru pagevecs and pcplists?

> The only thing local_lock is addressing is the opaque nature of
> preempt_disable(), local*_disable/save() protected critical sections,
> which have per CPU BKL, i.e. undefined protection scope, semantics.
> 
> If you really want cross CPU protection then using a regular spinlock in
> a per CPU data structure is the right way to go.
> 
> That makes it a bit akward vs. the code at hand which already introduced
> local locks to replace the opaque preempt/local_irq disabled critical
> sections with scoped local locks which in turn allows RT to substitute
> them with strict CPU local spinlocks.
> 
> But for clarity sake we really have to look at two different cases now:
> 
>    1) Strict per CPU local protection
> 
>       That's what the code does today via local lock and this includes
>       RT where the locality is preserved via the local lock semantics
> 
>       I.e. for the !RT case the spinlock overhead is avoided 
> 
>    2) Global scoped per CPU protection
> 
>       That's what Nicolas is trying to achieve to be able to update
>       data structures cross CPU for the sake of avoiding smp function
>       calls or queuing work on remote CPUs for the NOHZ_FULL isolation
>       use case.
> 
> That said, I completely understand Andrew's concerns versus these
> distinctions and their test coverage.
> 
> In consequence the real interesting question here is whether any of
> these use cases are sensible to the extra overhead of #2.
> 
> IOW, does it really matter whether the !RT and !NOHZ_FULL case take an
> uncontended per CPU spinlock unconditionally instead of just disabling
> preemption / interrupts?
> 
> The same question arises vs. the remote work queuing. Does it really
> matter? I think that a proper investigation is due to figure out whether
> delegated work is really superiour vs. doing the same work locked from a
> remote CPU occasionally.
> 
> If you really think about it then the task which is initiating the work
> is already in a slow path. So what's the tradeoff to make this path a
> little bit slower due to remote access vs. scheduling work and thereby
> disturbing the remote CPU which has a performance impact as well and in
> the NOHZ_FULL case even a correctness impact? That's especially
> questionable for situations where the initiator has to wait for
> completion of the remote work.
> 
> The changelogs and the cover letter have a distinct void vs. that which
> means this is just another example of 'scratch my itch' changes w/o
> proper justification.
> 
> I'm well aware that the inital stuff which myself, and in consequence
> Sebastian and Anna-Maria, were doing back then falls into the same
> category. IOW, guilty as charged, but that does not make the current
> approach more palatable than the one from years ago.
> 
> Thanks,
> 
>         tglx
>

Thomas Gleixner Sept. 23, 2021, 10:36 a.m. UTC | #6

Vlastimil,

On Thu, Sep 23 2021 at 09:12, Vlastimil Babka wrote:
> On 9/23/21 00:09, Thomas Gleixner wrote:
>>   local_lock() 		-> preempt_disable()
>>   local_lock_irq()	-> local_irq_disable()
>>   ...
>
> Yes, to be clean, this would have to be a new primitive, not just an abused
> local lock. It would just look similar to the RT version (a percpu array of
> spinlocks), so for this patchset it would allow not to have two such locks
> side be side (local + spin) while only one is being used. For maximum
> flexibility the initialization would take a CONFIG_ (or something
> compile-time bool) that when false would make the !RT version an empty
> struct and "locking" would rely on preempt/irq disable (just as with !RT
> local_lock). If compile-time true it would take a static key to decide on
> boot whether the !RT version only does the preepmt/irq disable or actually
> takes the lock.
>
> But as you say below, it's too much complexity for questionable benefit.
>
> But maybe this can all be avoided anyway, as I recalled what we do for
> vmstat already (IIUC). See quiet_vmstat() - when cpu enters the nohz mode,
> it flushes per-cpu vmstat diffs and then there's no reason to further
> disturb the cpu to do that while it's on NOHZ mode. We could do the same for
> lru pagevecs and pcplists?

I'm not sure about this. I like the idea of being able to offload things
to housekeeping CPUs not only in the full isolation case.

A good example is RCU which allows to offload all RCU processing to some
other CPU(s), which is useful even w/o full isolation.

The synchronous quiescing on entering NOHZ full mode is a cute
workaround but for one it makes entering NOHZ full more expensive and it
does not necessarily provide good isolation guarantees under all
circumstances, while a full remote processing definitely does.

I think it's at least worthwhile to investigate.

Thanks,

        tglx

Nicolas Saenz Julienne Sept. 27, 2021, 9:30 a.m. UTC | #7

Hi Thomas,
thanks for the in-depth review.

On Thu, 2021-09-23 at 00:09 +0200, Thomas Gleixner wrote:
> On Wed, Sep 22 2021 at 13:28, Peter Zijlstra wrote:
> > On Tue, Sep 21, 2021 at 07:59:51PM +0200, Vlastimil Babka wrote:
> > 
> > > These days the pcplist protection is done by local_lock, which solved
> > > the RT concerns. Probably a stupid/infeasible idea, but maybe what you
> > > want to achieve could be more generally solved at the local_lock level?
> > > That on NOHZ_FULL CPUs, local_locks could have this mode where they
> > > could synchronize with remote cpus?
> > 
> > local_lock and spinlock have different rules, local_lock for example can
> > never cause an irq inversion, unlike a spinlock.
> 
> TBH, I really regret an attempt I made at some point in the RT
> development to abuse local locks for this kind of cross CPU protections
> because that led to yet another semantically ill defined construct.
> 
> local locks are as the name says strictly local. IOW, they do not exist
> for remote access. Just look at the !RT mapping:
> 
>   local_lock() 		-> preempt_disable()
>   local_lock_irq()	-> local_irq_disable()
>   ...
> 
> The only thing local_lock is addressing is the opaque nature of
> preempt_disable(), local*_disable/save() protected critical sections,
> which have per CPU BKL, i.e. undefined protection scope, semantics.
> 
> If you really want cross CPU protection then using a regular spinlock in
> a per CPU data structure is the right way to go.
> 
> That makes it a bit akward vs. the code at hand which already introduced
> local locks to replace the opaque preempt/local_irq disabled critical
> sections with scoped local locks which in turn allows RT to substitute
> them with strict CPU local spinlocks.
> 
> But for clarity sake we really have to look at two different cases now:
> 
>    1) Strict per CPU local protection
> 
>       That's what the code does today via local lock and this includes
>       RT where the locality is preserved via the local lock semantics
> 
>       I.e. for the !RT case the spinlock overhead is avoided 
> 
>    2) Global scoped per CPU protection
> 
>       That's what Nicolas is trying to achieve to be able to update
>       data structures cross CPU for the sake of avoiding smp function
>       calls or queuing work on remote CPUs for the NOHZ_FULL isolation
>       use case.
> 
> That said, I completely understand Andrew's concerns versus these
> distinctions and their test coverage.
> 
> In consequence the real interesting question here is whether any of
> these use cases are sensible to the extra overhead of #2.
> 
> IOW, does it really matter whether the !RT and !NOHZ_FULL case take an
> uncontended per CPU spinlock unconditionally instead of just disabling
> preemption / interrupts?
> 
> The same question arises vs. the remote work queuing. Does it really
> matter? I think that a proper investigation is due to figure out whether
> delegated work is really superiour vs. doing the same work locked from a
> remote CPU occasionally.
> 
> If you really think about it then the task which is initiating the work
> is already in a slow path. So what's the tradeoff to make this path a
> little bit slower due to remote access vs. scheduling work and thereby
> disturbing the remote CPU which has a performance impact as well and in
> the NOHZ_FULL case even a correctness impact? That's especially
> questionable for situations where the initiator has to wait for
> completion of the remote work.
> 
> The changelogs and the cover letter have a distinct void vs. that which
> means this is just another example of 'scratch my itch' changes w/o
> proper justification.

Point taken, I'll get to it.

[0/6] mm: Remote LRU per-cpu pagevec cache/per-cpu page list drain support

Message

Comments