mbox series

[0/2] mm/swap: Add locking for pagevec

Message ID 20180914145924.22055-1-bigeasy@linutronix.de (mailing list archive)
Headers show
Series mm/swap: Add locking for pagevec | expand

Message

Sebastian Andrzej Siewior Sept. 14, 2018, 2:59 p.m. UTC
The swap code synchronizes its access to the (four) pagevec struct
(which is allocated per-CPU) by disabling preemption. This works and the
one struct needs to be accessed from interrupt context is protected by
disabling interrupts. This was manually audited and there is no lockdep
coverage for this.
There is one case where the per-CPU of a remote CPU needs to be accessed
and this is solved by started a worker on the remote CPU and waiting for
it to finish.

I measured the invocation of lru_add_drain_all(), ensured that it would
invoke the drain function but the drain function would not do anything
except the locking (preempt / interrupts on/off) of the individual
pagevec. On a Xeon E5-2650 (2 Socket, 8 cores dual threaded, 32 CPUs in
total) I tried to drain CPU4 and measured how long it took in
microseconds:
               t-771   [001] ....   183.165619: lru_add_drain_all_test: took 92
               t-771   [001] ....   183.165710: lru_add_drain_all_test: took 87
               t-771   [001] ....   183.165781: lru_add_drain_all_test: took 68
               t-771   [001] ....   183.165826: lru_add_drain_all_test: took 43
               t-771   [001] ....   183.165837: lru_add_drain_all_test: took 9
               t-771   [001] ....   183.165847: lru_add_drain_all_test: took 9
               t-771   [001] ....   183.165858: lru_add_drain_all_test: took 9
               t-771   [001] ....   183.165868: lru_add_drain_all_test: took 9
               t-771   [001] ....   183.165878: lru_add_drain_all_test: took 9
               t-771   [001] ....   183.165889: lru_add_drain_all_test: took 9

This is mostly the wake up from idle that takes long and once the CPU is
busy and cache hot it goes down to 9us. If all CPUs are busy in user land then 
               t-1484  [001] .... 40864.452481: lru_add_drain_all_test: took 12
               t-1484  [001] .... 40864.452492: lru_add_drain_all_test: took 8
               t-1484  [001] .... 40864.452500: lru_add_drain_all_test: took 7
               t-1484  [001] .... 40864.452508: lru_add_drain_all_test: took 7
               t-1484  [001] .... 40864.452516: lru_add_drain_all_test: took 7
               t-1484  [001] .... 40864.452524: lru_add_drain_all_test: took 7
               t-1484  [001] .... 40864.452532: lru_add_drain_all_test: took 7
               t-1484  [001] .... 40864.452540: lru_add_drain_all_test: took 7
               t-1484  [001] .... 40864.452547: lru_add_drain_all_test: took 7
               t-1484  [001] .... 40864.452555: lru_add_drain_all_test: took 7

it goes to 7us once the cache is hot.
Invoking the same test on every CPU it gets to:
               t-768   [000] ....    61.508781: lru_add_drain_all_test: took 133
               t-768   [000] ....    61.508892: lru_add_drain_all_test: took 105
               t-768   [000] ....    61.509004: lru_add_drain_all_test: took 108
               t-768   [000] ....    61.509112: lru_add_drain_all_test: took 104
               t-768   [000] ....    61.509220: lru_add_drain_all_test: took 104
               t-768   [000] ....    61.509333: lru_add_drain_all_test: took 109
               t-768   [000] ....    61.509414: lru_add_drain_all_test: took 78
               t-768   [000] ....    61.509493: lru_add_drain_all_test: took 76
               t-768   [000] ....    61.509558: lru_add_drain_all_test: took 63
               t-768   [000] ....    61.509623: lru_add_drain_all_test: took 62

on an idle machine and once the CPUs are busy:
               t-849   [020] ....   379.429727: lru_add_drain_all_test: took 57
               t-849   [020] ....   379.429777: lru_add_drain_all_test: took 47
               t-849   [020] ....   379.429823: lru_add_drain_all_test: took 45
               t-849   [020] ....   379.429870: lru_add_drain_all_test: took 45
               t-849   [020] ....   379.429916: lru_add_drain_all_test: took 45
               t-849   [020] ....   379.429962: lru_add_drain_all_test: took 45
               t-849   [020] ....   379.430009: lru_add_drain_all_test: took 45
               t-849   [020] ....   379.430055: lru_add_drain_all_test: took 45
               t-849   [020] ....   379.430101: lru_add_drain_all_test: took 45
               t-849   [020] ....   379.430147: lru_add_drain_all_test: took 45

so we get down to 45us.

If the preemption based locking gets replaced with a PER-CPU spin_lock()
then it gain a locking scope on the operation. The spin_lock() should not
bring much overhead because it is not contended. However, having the
lock there does not only add lockdep coverage it also allows to access
the data from a remote CPU. So the work can be done on the CPU that
asked for it and there is no need to wake a CPU from idle (or user land).

With this series applied, the test again:
Idle box, all CPUs:
               t-861   [000] ....   861.051780: lru_add_drain_all_test: took 16
               t-861   [000] ....   861.051789: lru_add_drain_all_test: took 7
               t-861   [000] ....   861.051797: lru_add_drain_all_test: took 7
               t-861   [000] ....   861.051805: lru_add_drain_all_test: took 7
               t-861   [000] ....   861.051813: lru_add_drain_all_test: took 7
               t-861   [000] ....   861.051821: lru_add_drain_all_test: took 7
               t-861   [000] ....   861.051829: lru_add_drain_all_test: took 7
               t-861   [000] ....   861.051837: lru_add_drain_all_test: took 7
               t-861   [000] ....   861.051844: lru_add_drain_all_test: took 7
               t-861   [000] ....   861.051852: lru_add_drain_all_test: took 7

which is almost the same compared with "busy, one CPU". Invoking the
test only for a single remote CPU: 
               t-863   [020] ....   906.579885: lru_add_drain_all_test: took 0
               t-863   [020] ....   906.579887: lru_add_drain_all_test: took 0
               t-863   [020] ....   906.579889: lru_add_drain_all_test: took 0
               t-863   [020] ....   906.579889: lru_add_drain_all_test: took 0
               t-863   [020] ....   906.579890: lru_add_drain_all_test: took 0
               t-863   [020] ....   906.579891: lru_add_drain_all_test: took 0
               t-863   [020] ....   906.579892: lru_add_drain_all_test: took 0
               t-863   [020] ....   906.579892: lru_add_drain_all_test: took 0
               t-863   [020] ....   906.579893: lru_add_drain_all_test: took 0
               t-863   [020] ....   906.579894: lru_add_drain_all_test: took 0

and it is less than a microsecond.

Sebastian

Comments

Vlastimil Babka Oct. 12, 2018, 7:21 a.m. UTC | #1
On 9/14/18 4:59 PM, Sebastian Andrzej Siewior wrote:
> The swap code synchronizes its access to the (four) pagevec struct
> (which is allocated per-CPU) by disabling preemption. This works and the
> one struct needs to be accessed from interrupt context is protected by
> disabling interrupts. This was manually audited and there is no lockdep
> coverage for this.
> There is one case where the per-CPU of a remote CPU needs to be accessed
> and this is solved by started a worker on the remote CPU and waiting for
> it to finish.
> 
> I measured the invocation of lru_add_drain_all(), ensured that it would
> invoke the drain function but the drain function would not do anything
> except the locking (preempt / interrupts on/off) of the individual
> pagevec. On a Xeon E5-2650 (2 Socket, 8 cores dual threaded, 32 CPUs in
> total) I tried to drain CPU4 and measured how long it took in
> microseconds:
>                t-771   [001] ....   183.165619: lru_add_drain_all_test: took 92
>                t-771   [001] ....   183.165710: lru_add_drain_all_test: took 87
>                t-771   [001] ....   183.165781: lru_add_drain_all_test: took 68
>                t-771   [001] ....   183.165826: lru_add_drain_all_test: took 43
>                t-771   [001] ....   183.165837: lru_add_drain_all_test: took 9
>                t-771   [001] ....   183.165847: lru_add_drain_all_test: took 9
>                t-771   [001] ....   183.165858: lru_add_drain_all_test: took 9
>                t-771   [001] ....   183.165868: lru_add_drain_all_test: took 9
>                t-771   [001] ....   183.165878: lru_add_drain_all_test: took 9
>                t-771   [001] ....   183.165889: lru_add_drain_all_test: took 9
> 
> This is mostly the wake up from idle that takes long and once the CPU is
> busy and cache hot it goes down to 9us. If all CPUs are busy in user land then 
>                t-1484  [001] .... 40864.452481: lru_add_drain_all_test: took 12
>                t-1484  [001] .... 40864.452492: lru_add_drain_all_test: took 8
>                t-1484  [001] .... 40864.452500: lru_add_drain_all_test: took 7
>                t-1484  [001] .... 40864.452508: lru_add_drain_all_test: took 7
>                t-1484  [001] .... 40864.452516: lru_add_drain_all_test: took 7
>                t-1484  [001] .... 40864.452524: lru_add_drain_all_test: took 7
>                t-1484  [001] .... 40864.452532: lru_add_drain_all_test: took 7
>                t-1484  [001] .... 40864.452540: lru_add_drain_all_test: took 7
>                t-1484  [001] .... 40864.452547: lru_add_drain_all_test: took 7
>                t-1484  [001] .... 40864.452555: lru_add_drain_all_test: took 7
> 
> it goes to 7us once the cache is hot.
> Invoking the same test on every CPU it gets to:
>                t-768   [000] ....    61.508781: lru_add_drain_all_test: took 133
>                t-768   [000] ....    61.508892: lru_add_drain_all_test: took 105
>                t-768   [000] ....    61.509004: lru_add_drain_all_test: took 108
>                t-768   [000] ....    61.509112: lru_add_drain_all_test: took 104
>                t-768   [000] ....    61.509220: lru_add_drain_all_test: took 104
>                t-768   [000] ....    61.509333: lru_add_drain_all_test: took 109
>                t-768   [000] ....    61.509414: lru_add_drain_all_test: took 78
>                t-768   [000] ....    61.509493: lru_add_drain_all_test: took 76
>                t-768   [000] ....    61.509558: lru_add_drain_all_test: took 63
>                t-768   [000] ....    61.509623: lru_add_drain_all_test: took 62
> 
> on an idle machine and once the CPUs are busy:
>                t-849   [020] ....   379.429727: lru_add_drain_all_test: took 57
>                t-849   [020] ....   379.429777: lru_add_drain_all_test: took 47
>                t-849   [020] ....   379.429823: lru_add_drain_all_test: took 45
>                t-849   [020] ....   379.429870: lru_add_drain_all_test: took 45
>                t-849   [020] ....   379.429916: lru_add_drain_all_test: took 45
>                t-849   [020] ....   379.429962: lru_add_drain_all_test: took 45
>                t-849   [020] ....   379.430009: lru_add_drain_all_test: took 45
>                t-849   [020] ....   379.430055: lru_add_drain_all_test: took 45
>                t-849   [020] ....   379.430101: lru_add_drain_all_test: took 45
>                t-849   [020] ....   379.430147: lru_add_drain_all_test: took 45
> 
> so we get down to 45us.
> 
> If the preemption based locking gets replaced with a PER-CPU spin_lock()
> then it gain a locking scope on the operation. The spin_lock() should not
> bring much overhead because it is not contended. However, having the
> lock there does not only add lockdep coverage it also allows to access
> the data from a remote CPU. So the work can be done on the CPU that
> asked for it and there is no need to wake a CPU from idle (or user land).
> 
> With this series applied, the test again:
> Idle box, all CPUs:
>                t-861   [000] ....   861.051780: lru_add_drain_all_test: took 16
>                t-861   [000] ....   861.051789: lru_add_drain_all_test: took 7
>                t-861   [000] ....   861.051797: lru_add_drain_all_test: took 7
>                t-861   [000] ....   861.051805: lru_add_drain_all_test: took 7
>                t-861   [000] ....   861.051813: lru_add_drain_all_test: took 7
>                t-861   [000] ....   861.051821: lru_add_drain_all_test: took 7
>                t-861   [000] ....   861.051829: lru_add_drain_all_test: took 7
>                t-861   [000] ....   861.051837: lru_add_drain_all_test: took 7
>                t-861   [000] ....   861.051844: lru_add_drain_all_test: took 7
>                t-861   [000] ....   861.051852: lru_add_drain_all_test: took 7
> 
> which is almost the same compared with "busy, one CPU". Invoking the
> test only for a single remote CPU: 
>                t-863   [020] ....   906.579885: lru_add_drain_all_test: took 0
>                t-863   [020] ....   906.579887: lru_add_drain_all_test: took 0
>                t-863   [020] ....   906.579889: lru_add_drain_all_test: took 0
>                t-863   [020] ....   906.579889: lru_add_drain_all_test: took 0
>                t-863   [020] ....   906.579890: lru_add_drain_all_test: took 0
>                t-863   [020] ....   906.579891: lru_add_drain_all_test: took 0
>                t-863   [020] ....   906.579892: lru_add_drain_all_test: took 0
>                t-863   [020] ....   906.579892: lru_add_drain_all_test: took 0
>                t-863   [020] ....   906.579893: lru_add_drain_all_test: took 0
>                t-863   [020] ....   906.579894: lru_add_drain_all_test: took 0
> 
> and it is less than a microsecond.

+CC Mel.

Hi,

I think this evaluation is missing the other side of the story, and
that's the cost of using a spinlock (even uncontended) instead of
disabling preemption. The expectation for LRU pagevec is that the local
operations will be much more common than draining of other CPU's, so
it's optimized for the former.

I guess the LKP report for patch 1/2 already hints at some regression,
if it can be trusted. I don't know the details but AFAIU shows that fio
completed 2.4% less operations so it shouldn't be an improvement as you
thought.

Vlastimil

> Sebastian
>
Mel Gorman Oct. 15, 2018, 9:50 a.m. UTC | #2
On Fri, Oct 12, 2018 at 09:21:41AM +0200, Vlastimil Babka wrote:
> On 9/14/18 4:59 PM, Sebastian Andrzej Siewior wrote:
> I think this evaluation is missing the other side of the story, and
> that's the cost of using a spinlock (even uncontended) instead of
> disabling preemption. The expectation for LRU pagevec is that the local
> operations will be much more common than draining of other CPU's, so
> it's optimized for the former.
> 

Agreed, the drain operation should be extremely rare except under heavy
memory pressure, particularly if mixed with THP allocations. The overall
intent seems to be improving lockdep coverage but I don't think we
should take a hit in the common case just to get that coverage. Bear in
mind that the main point of the pagevec (whether it's true or not) is to
avoid the much heavier LRU lock.
Frederic Weisbecker Oct. 16, 2018, 4:26 p.m. UTC | #3
On Mon, Oct 15, 2018 at 10:50:48AM +0100, Mel Gorman wrote:
> On Fri, Oct 12, 2018 at 09:21:41AM +0200, Vlastimil Babka wrote:
> > On 9/14/18 4:59 PM, Sebastian Andrzej Siewior wrote:
> > I think this evaluation is missing the other side of the story, and
> > that's the cost of using a spinlock (even uncontended) instead of
> > disabling preemption. The expectation for LRU pagevec is that the local
> > operations will be much more common than draining of other CPU's, so
> > it's optimized for the former.
> > 
> 
> Agreed, the drain operation should be extremely rare except under heavy
> memory pressure, particularly if mixed with THP allocations. The overall
> intent seems to be improving lockdep coverage but I don't think we
> should take a hit in the common case just to get that coverage. Bear in
> mind that the main point of the pagevec (whether it's true or not) is to
> avoid the much heavier LRU lock.

So indeed, if the only purpose of this patch were to make lockdep wiser,
a pair of spin_lock_acquire() / spin_unlock_release() would be enough to
teach it and would avoid the overhead.

Now another significant incentive behind this change is to improve CPU isolation.
Workloads relying on owning the entire CPU without being disturbed are interested
in this as it allows to offload some noise. It's no big deal for those who can
tolerate rare events but often CPU isolation is combined with deterministic latency
requirements.

So, I'm not saying this per-CPU spinlock is necessarily the right answer, I
don't know that code enough to have an opinion, but I still wish we can find
a solution.

Thanks.
Thomas Gleixner Oct. 16, 2018, 5:13 p.m. UTC | #4
On Tue, 16 Oct 2018, Frederic Weisbecker wrote:

> On Mon, Oct 15, 2018 at 10:50:48AM +0100, Mel Gorman wrote:
> > On Fri, Oct 12, 2018 at 09:21:41AM +0200, Vlastimil Babka wrote:
> > > On 9/14/18 4:59 PM, Sebastian Andrzej Siewior wrote:
> > > I think this evaluation is missing the other side of the story, and
> > > that's the cost of using a spinlock (even uncontended) instead of
> > > disabling preemption. The expectation for LRU pagevec is that the local
> > > operations will be much more common than draining of other CPU's, so
> > > it's optimized for the former.
> > > 
> > 
> > Agreed, the drain operation should be extremely rare except under heavy
> > memory pressure, particularly if mixed with THP allocations. The overall
> > intent seems to be improving lockdep coverage but I don't think we
> > should take a hit in the common case just to get that coverage. Bear in
> > mind that the main point of the pagevec (whether it's true or not) is to
> > avoid the much heavier LRU lock.
> 
> So indeed, if the only purpose of this patch were to make lockdep wiser,
> a pair of spin_lock_acquire() / spin_unlock_release() would be enough to
> teach it and would avoid the overhead.
> 
> Now another significant incentive behind this change is to improve CPU isolation.
> Workloads relying on owning the entire CPU without being disturbed are interested
> in this as it allows to offload some noise. It's no big deal for those who can
> tolerate rare events but often CPU isolation is combined with deterministic latency
> requirements.
> 
> So, I'm not saying this per-CPU spinlock is necessarily the right answer, I
> don't know that code enough to have an opinion, but I still wish we can find
> a solution.

One way to solve this and I had played with it already is to make the smp
function call based variant and the lock based variant switchable at boot
time with a static key. That way CPU isolation can select it and take the
penalty while normal workloads are not affected.

Thanks,

	tglx
Frederic Weisbecker Oct. 16, 2018, 7:54 p.m. UTC | #5
On Tue, Oct 16, 2018 at 07:13:48PM +0200, Thomas Gleixner wrote:
> On Tue, 16 Oct 2018, Frederic Weisbecker wrote:
> 
> > On Mon, Oct 15, 2018 at 10:50:48AM +0100, Mel Gorman wrote:
> > > On Fri, Oct 12, 2018 at 09:21:41AM +0200, Vlastimil Babka wrote:
> > > > On 9/14/18 4:59 PM, Sebastian Andrzej Siewior wrote:
> > > > I think this evaluation is missing the other side of the story, and
> > > > that's the cost of using a spinlock (even uncontended) instead of
> > > > disabling preemption. The expectation for LRU pagevec is that the local
> > > > operations will be much more common than draining of other CPU's, so
> > > > it's optimized for the former.
> > > > 
> > > 
> > > Agreed, the drain operation should be extremely rare except under heavy
> > > memory pressure, particularly if mixed with THP allocations. The overall
> > > intent seems to be improving lockdep coverage but I don't think we
> > > should take a hit in the common case just to get that coverage. Bear in
> > > mind that the main point of the pagevec (whether it's true or not) is to
> > > avoid the much heavier LRU lock.
> > 
> > So indeed, if the only purpose of this patch were to make lockdep wiser,
> > a pair of spin_lock_acquire() / spin_unlock_release() would be enough to
> > teach it and would avoid the overhead.
> > 
> > Now another significant incentive behind this change is to improve CPU isolation.
> > Workloads relying on owning the entire CPU without being disturbed are interested
> > in this as it allows to offload some noise. It's no big deal for those who can
> > tolerate rare events but often CPU isolation is combined with deterministic latency
> > requirements.
> > 
> > So, I'm not saying this per-CPU spinlock is necessarily the right answer, I
> > don't know that code enough to have an opinion, but I still wish we can find
> > a solution.
> 
> One way to solve this and I had played with it already is to make the smp
> function call based variant and the lock based variant switchable at boot
> time with a static key. That way CPU isolation can select it and take the
> penalty while normal workloads are not affected.

Sounds good, and we can toggle that with "isolcpus=".

We could also make it modifiable through cpuset.sched_load_balance,
despite its name it's not just used to govern sched balancing but isolation
in general. Now making it toggable at runtime could be a bit trickier in
terms of correctness, yet probably needed in the long term.
Thomas Gleixner Oct. 16, 2018, 8:44 p.m. UTC | #6
On Tue, 16 Oct 2018, Frederic Weisbecker wrote:
> On Tue, Oct 16, 2018 at 07:13:48PM +0200, Thomas Gleixner wrote:
> > On Tue, 16 Oct 2018, Frederic Weisbecker wrote:
> > 
> > > On Mon, Oct 15, 2018 at 10:50:48AM +0100, Mel Gorman wrote:
> > > > On Fri, Oct 12, 2018 at 09:21:41AM +0200, Vlastimil Babka wrote:
> > > > > On 9/14/18 4:59 PM, Sebastian Andrzej Siewior wrote:
> > > > > I think this evaluation is missing the other side of the story, and
> > > > > that's the cost of using a spinlock (even uncontended) instead of
> > > > > disabling preemption. The expectation for LRU pagevec is that the local
> > > > > operations will be much more common than draining of other CPU's, so
> > > > > it's optimized for the former.
> > > > > 
> > > > 
> > > > Agreed, the drain operation should be extremely rare except under heavy
> > > > memory pressure, particularly if mixed with THP allocations. The overall
> > > > intent seems to be improving lockdep coverage but I don't think we
> > > > should take a hit in the common case just to get that coverage. Bear in
> > > > mind that the main point of the pagevec (whether it's true or not) is to
> > > > avoid the much heavier LRU lock.
> > > 
> > > So indeed, if the only purpose of this patch were to make lockdep wiser,
> > > a pair of spin_lock_acquire() / spin_unlock_release() would be enough to
> > > teach it and would avoid the overhead.
> > > 
> > > Now another significant incentive behind this change is to improve CPU isolation.
> > > Workloads relying on owning the entire CPU without being disturbed are interested
> > > in this as it allows to offload some noise. It's no big deal for those who can
> > > tolerate rare events but often CPU isolation is combined with deterministic latency
> > > requirements.
> > > 
> > > So, I'm not saying this per-CPU spinlock is necessarily the right answer, I
> > > don't know that code enough to have an opinion, but I still wish we can find
> > > a solution.
> > 
> > One way to solve this and I had played with it already is to make the smp
> > function call based variant and the lock based variant switchable at boot
> > time with a static key. That way CPU isolation can select it and take the
> > penalty while normal workloads are not affected.
> 
> Sounds good, and we can toggle that with "isolcpus=".
> 
> We could also make it modifiable through cpuset.sched_load_balance,
> despite its name it's not just used to govern sched balancing but isolation
> in general. Now making it toggable at runtime could be a bit trickier in
> terms of correctness, yet probably needed in the long term.

That wouldn't be so hard to do. I'll have a look in the next days, unless
someone else beats me to it.

Thanks,

	tglx