[v11,00/16] per memcg lru lock
mbox series

Message ID 1590663658-184131-1-git-send-email-alex.shi@linux.alibaba.com
Headers show
Series
  • per memcg lru lock
Related show

Message

Alex Shi May 28, 2020, 11 a.m. UTC
This is a new version which bases on linux-next 

Johannes Weiner has suggested:
"So here is a crazy idea that may be worth exploring:

Right now, pgdat->lru_lock protects both PageLRU *and* the lruvec's
linked list.

Can we make PageLRU atomic and use it to stabilize the lru_lock
instead, and then use the lru_lock only serialize list operations?
..."

With new memcg charge path and this solution, we could isolate
LRU pages to exclusive visit them in compaction, page migration, reclaim,
memcg move_accunt, huge page split etc scenarios while keeping pages' 
memcg stable. Then possible to change per node lru locking to per memcg
lru locking. As to pagevec_lru_move_fn funcs, it would be safe to let
pages remain on lru list, lru lock could guard them for list integrity.

The patchset includes 3 parts:
1, some code cleanup and minimum optimization as a preparation.
2, use TestCleanPageLRU as page isolation's precondition
3, replace per node lru_lock with per memcg per node lru_lock

The 3rd part moves per node lru_lock into lruvec, thus bring a lru_lock for
each of memcg per node. So on a large machine, each of memcg don't
have to suffer from per node pgdat->lru_lock competition. They could go
fast with their self lru_lock

Following Daniel Jordan's suggestion, I have run 208 'dd' with on 104
containers on a 2s * 26cores * HT box with a modefied case:
https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice

With this patchset, the readtwice performance increased about 80%
in concurrent containers.

Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought this
idea 8 years ago, and others who give comments as well: Daniel Jordan, 
Mel Gorman, Shakeel Butt, Matthew Wilcox etc.

Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks!


Alex Shi (14):
  mm/vmscan: remove unnecessary lruvec adding
  mm/page_idle: no unlikely double check for idle page counting
  mm/compaction: correct the comments of compact_defer_shift
  mm/compaction: rename compact_deferred as compact_should_defer
  mm/thp: move lru_add_page_tail func to huge_memory.c
  mm/thp: clean up lru_add_page_tail
  mm/thp: narrow lru locking
  mm/memcg: add debug checking in lock_page_memcg
  mm/lru: introduce TestClearPageLRU
  mm/compaction: do page isolation first in compaction
  mm/mlock: reorder isolation sequence during munlock
  mm/lru: replace pgdat lru_lock with lruvec lock
  mm/lru: introduce the relock_page_lruvec function
  mm/pgdat: remove pgdat lru_lock

Hugh Dickins (2):
  mm/vmscan: use relock for move_pages_to_lru
  mm/lru: revise the comments of lru_lock

 Documentation/admin-guide/cgroup-v1/memcg_test.rst |  15 +-
 Documentation/admin-guide/cgroup-v1/memory.rst     |   8 +-
 Documentation/trace/events-kmem.rst                |   2 +-
 Documentation/vm/unevictable-lru.rst               |  22 +--
 include/linux/compaction.h                         |   4 +-
 include/linux/memcontrol.h                         |  92 +++++++++++
 include/linux/mm_types.h                           |   2 +-
 include/linux/mmzone.h                             |   6 +-
 include/linux/page-flags.h                         |   1 +
 include/linux/swap.h                               |   4 +-
 include/trace/events/compaction.h                  |   2 +-
 mm/compaction.c                                    | 104 ++++++++-----
 mm/filemap.c                                       |   4 +-
 mm/huge_memory.c                                   |  51 +++++--
 mm/memcontrol.c                                    |  87 ++++++++++-
 mm/mlock.c                                         |  93 ++++++------
 mm/mmzone.c                                        |   1 +
 mm/page_alloc.c                                    |   1 -
 mm/page_idle.c                                     |   8 -
 mm/rmap.c                                          |   2 +-
 mm/swap.c                                          | 112 ++++----------
 mm/swap_state.c                                    |   6 +-
 mm/vmscan.c                                        | 168 +++++++++++----------
 mm/workingset.c                                    |   4 +-
 24 files changed, 487 insertions(+), 312 deletions(-)

Comments

Hugh Dickins June 8, 2020, 4:15 a.m. UTC | #1
On Thu, 28 May 2020, Alex Shi wrote:

> This is a new version which bases on linux-next 
> 
> Johannes Weiner has suggested:
> "So here is a crazy idea that may be worth exploring:
> 
> Right now, pgdat->lru_lock protects both PageLRU *and* the lruvec's
> linked list.
> 
> Can we make PageLRU atomic and use it to stabilize the lru_lock
> instead, and then use the lru_lock only serialize list operations?
> ..."
> 
> With new memcg charge path and this solution, we could isolate
> LRU pages to exclusive visit them in compaction, page migration, reclaim,
> memcg move_accunt, huge page split etc scenarios while keeping pages' 
> memcg stable. Then possible to change per node lru locking to per memcg
> lru locking. As to pagevec_lru_move_fn funcs, it would be safe to let
> pages remain on lru list, lru lock could guard them for list integrity.
> 
> The patchset includes 3 parts:
> 1, some code cleanup and minimum optimization as a preparation.
> 2, use TestCleanPageLRU as page isolation's precondition
> 3, replace per node lru_lock with per memcg per node lru_lock
> 
> The 3rd part moves per node lru_lock into lruvec, thus bring a lru_lock for
> each of memcg per node. So on a large machine, each of memcg don't
> have to suffer from per node pgdat->lru_lock competition. They could go
> fast with their self lru_lock
> 
> Following Daniel Jordan's suggestion, I have run 208 'dd' with on 104
> containers on a 2s * 26cores * HT box with a modefied case:
> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice
> 
> With this patchset, the readtwice performance increased about 80%
> in concurrent containers.
> 
> Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought this
> idea 8 years ago, and others who give comments as well: Daniel Jordan, 
> Mel Gorman, Shakeel Butt, Matthew Wilcox etc.
> 
> Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
> and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks!
> 
> 
> Alex Shi (14):
>   mm/vmscan: remove unnecessary lruvec adding
>   mm/page_idle: no unlikely double check for idle page counting
>   mm/compaction: correct the comments of compact_defer_shift
>   mm/compaction: rename compact_deferred as compact_should_defer
>   mm/thp: move lru_add_page_tail func to huge_memory.c
>   mm/thp: clean up lru_add_page_tail
>   mm/thp: narrow lru locking
>   mm/memcg: add debug checking in lock_page_memcg
>   mm/lru: introduce TestClearPageLRU
>   mm/compaction: do page isolation first in compaction
>   mm/mlock: reorder isolation sequence during munlock
>   mm/lru: replace pgdat lru_lock with lruvec lock
>   mm/lru: introduce the relock_page_lruvec function
>   mm/pgdat: remove pgdat lru_lock
> 
> Hugh Dickins (2):
>   mm/vmscan: use relock for move_pages_to_lru
>   mm/lru: revise the comments of lru_lock
> 
>  Documentation/admin-guide/cgroup-v1/memcg_test.rst |  15 +-
>  Documentation/admin-guide/cgroup-v1/memory.rst     |   8 +-
>  Documentation/trace/events-kmem.rst                |   2 +-
>  Documentation/vm/unevictable-lru.rst               |  22 +--
>  include/linux/compaction.h                         |   4 +-
>  include/linux/memcontrol.h                         |  92 +++++++++++
>  include/linux/mm_types.h                           |   2 +-
>  include/linux/mmzone.h                             |   6 +-
>  include/linux/page-flags.h                         |   1 +
>  include/linux/swap.h                               |   4 +-
>  include/trace/events/compaction.h                  |   2 +-
>  mm/compaction.c                                    | 104 ++++++++-----
>  mm/filemap.c                                       |   4 +-
>  mm/huge_memory.c                                   |  51 +++++--
>  mm/memcontrol.c                                    |  87 ++++++++++-
>  mm/mlock.c                                         |  93 ++++++------
>  mm/mmzone.c                                        |   1 +
>  mm/page_alloc.c                                    |   1 -
>  mm/page_idle.c                                     |   8 -
>  mm/rmap.c                                          |   2 +-
>  mm/swap.c                                          | 112 ++++----------
>  mm/swap_state.c                                    |   6 +-
>  mm/vmscan.c                                        | 168 +++++++++++----------
>  mm/workingset.c                                    |   4 +-
>  24 files changed, 487 insertions(+), 312 deletions(-)

Hi Alex,

I didn't get to try v10 at all, waited until Johannes's preparatory
memcg swap cleanup was in mmotm; but I have spent a while thrashing
this v11, and can happily report that it is much better than v9 etc:
I believe this memcg lru_lock work will soon be ready for v5.9.

I've not yet found any flaw at the swapping end, but fixes are needed
for isolate_migratepages_block() and mem_cgroup_move_account(): I've
got a series of 4 fix patches to send you (I guess two to fold into
existing patches of yours, and two to keep as separate from me).

I haven't yet written the patch descriptions, will return to that
tomorrow.  I expect you will be preparing a v12 rebased on v5.8-rc1
or v5.8-rc2, and will be able to include these fixes in that.

Tomorrow...
Hugh
Alex Shi June 8, 2020, 6:13 a.m. UTC | #2
在 2020/6/8 下午12:15, Hugh Dickins 写道:
>>  24 files changed, 487 insertions(+), 312 deletions(-)
> Hi Alex,
> 
> I didn't get to try v10 at all, waited until Johannes's preparatory
> memcg swap cleanup was in mmotm; but I have spent a while thrashing
> this v11, and can happily report that it is much better than v9 etc:
> I believe this memcg lru_lock work will soon be ready for v5.9.
> 
> I've not yet found any flaw at the swapping end, but fixes are needed
> for isolate_migratepages_block() and mem_cgroup_move_account(): I've
> got a series of 4 fix patches to send you (I guess two to fold into
> existing patches of yours, and two to keep as separate from me).
> 
> I haven't yet written the patch descriptions, will return to that
> tomorrow.  I expect you will be preparing a v12 rebased on v5.8-rc1
> or v5.8-rc2, and will be able to include these fixes in that.

I am very glad to get your help on this feature! 

and looking forward for your fixes tomorrow. :)

Thanks a lot!
Alex
Hugh Dickins June 10, 2020, 3:22 a.m. UTC | #3
On Mon, 8 Jun 2020, Alex Shi wrote:
> 在 2020/6/8 下午12:15, Hugh Dickins 写道:
> >>  24 files changed, 487 insertions(+), 312 deletions(-)
> > Hi Alex,
> > 
> > I didn't get to try v10 at all, waited until Johannes's preparatory
> > memcg swap cleanup was in mmotm; but I have spent a while thrashing
> > this v11, and can happily report that it is much better than v9 etc:
> > I believe this memcg lru_lock work will soon be ready for v5.9.
> > 
> > I've not yet found any flaw at the swapping end, but fixes are needed
> > for isolate_migratepages_block() and mem_cgroup_move_account(): I've
> > got a series of 4 fix patches to send you (I guess two to fold into
> > existing patches of yours, and two to keep as separate from me).
> > 
> > I haven't yet written the patch descriptions, will return to that
> > tomorrow.  I expect you will be preparing a v12 rebased on v5.8-rc1
> > or v5.8-rc2, and will be able to include these fixes in that.
> 
> I am very glad to get your help on this feature! 
> 
> and looking forward for your fixes tomorrow. :)
> 
> Thanks a lot!
> Alex

Sorry, Alex, the news is not so good today.

You'll have noticed I sent nothing yesterday. That's because I got
stuck on my second patch: could not quite convince myself that it
was safe.

I keep hinting at these patches, and I can't complete their writeups
until I'm convinced; but to give you a better idea of what they do:

1. Fixes isolate_fail and isolate_abort in isolate_migratepages_block().
2. Fixes unsafe use of trylock_page() in __isolate_lru_page_prepare().
3. Reverts 07/16 inversion of lock ordering in split_huge_page_to_list().
4. Adds lruvec lock protection in mem_cgroup_move_account().

In the second, I was using rcu_read_lock() instead of trylock_page()
(like in my own patchset), but could not quite be sure of the case when
PageSwapCache gets set at the wrong moment. Gave up for the night, and
in the morning abandoned that, instead just shifting the call to
__isolate_lru_page_prepare() after the get_page_unless_zero(),
where that trylock_page() becomes safe (no danger of stomping on page
flags while page is being freed or newly allocated to another owner).

I thought that a very safe change, but best to do some test runs with
it in before finalizing. And was then unpleasantly surprised to hit a
VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup) from
lock_page_lruvec_irqsave < relock_page_lruvec < pagevec_lru_move_fn <
pagevec_move_tail < lru_add_drain_cpu after 6 hours on one machine.
Then similar but < rotate_reclaimable_page after 8 hours on another.

Only seen once before: that's what drove me to add patch 4 (with 3 to
revert the locking before it): somehow, when adding the lruvec locking
there, I just took it for granted that your patchset would have the
appropriate locking (or TestClearPageLRU magic) at the other end.

But apparently not. And I'm beginning to think that TestClearPageLRU
was just to distract the audience from the lack of proper locking.

I have certainly not concluded that yet, but I'm having to think about
an area of the code which I'd imagined you had under control (and I'm
puzzled why my testing has found it so very hard to hit). If we're
lucky, I'll find that pagevec_move_tail is a special case, and
nothing much else needs changing; but I doubt that will be so.

There's one other unexplained and unfixed bug I've seen several times
while exercising mem_cgroup_move_account(): refcount_warn_saturate()
from where __mem_cgroup_clear_mc() calls mem_cgroup_id_get_many().
I'll be glad if that goes away when the lruvec locking is fixed,
but don't understand the connection. And it's quite possible that
this refcounting bug has nothing to do with your changes: I have
not succeeded in reproducing it on 5.7 nor on 5.7-rc7-mm1,
but I didn't really try long enough to be sure.

(I should also warn, that I'm surprised by the amount of change
11/16 makes to mm/mlock.c: I've not been exercising mlock at all.)

Taking a break for the evening,
Hugh
Alex Shi June 11, 2020, 6:06 a.m. UTC | #4
在 2020/6/10 上午11:22, Hugh Dickins 写道:
> On Mon, 8 Jun 2020, Alex Shi wrote:
>> 在 2020/6/8 下午12:15, Hugh Dickins 写道:
>>>>  24 files changed, 487 insertions(+), 312 deletions(-)
>>> Hi Alex,
>>>
>>> I didn't get to try v10 at all, waited until Johannes's preparatory
>>> memcg swap cleanup was in mmotm; but I have spent a while thrashing
>>> this v11, and can happily report that it is much better than v9 etc:
>>> I believe this memcg lru_lock work will soon be ready for v5.9.
>>>
>>> I've not yet found any flaw at the swapping end, but fixes are needed
>>> for isolate_migratepages_block() and mem_cgroup_move_account(): I've
>>> got a series of 4 fix patches to send you (I guess two to fold into
>>> existing patches of yours, and two to keep as separate from me).
>>>
>>> I haven't yet written the patch descriptions, will return to that
>>> tomorrow.  I expect you will be preparing a v12 rebased on v5.8-rc1
>>> or v5.8-rc2, and will be able to include these fixes in that.
>>
>> I am very glad to get your help on this feature! 
>>
>> and looking forward for your fixes tomorrow. :)
>>
>> Thanks a lot!
>> Alex
> 
> Sorry, Alex, the news is not so good today.
> 
> You'll have noticed I sent nothing yesterday. That's because I got
> stuck on my second patch: could not quite convince myself that it
> was safe.

Hi Hugh,

Thanks a lot for your help and effort! I very appreciate for this.

> 
> I keep hinting at these patches, and I can't complete their writeups
> until I'm convinced; but to give you a better idea of what they do:
> 
> 1. Fixes isolate_fail and isolate_abort in isolate_migratepages_block().

I guess I know this after mm-compaction-avoid-vm_bug_onpageslab-in-page_mapcount.patch
was removed.

> 2. Fixes unsafe use of trylock_page() in __isolate_lru_page_prepare().
> 3. Reverts 07/16 inversion of lock ordering in split_huge_page_to_list().
> 4. Adds lruvec lock protection in mem_cgroup_move_account().

Sorry for can't follow you for above issues. Anyway, I will send out new patchset
with the first issue fixed. and then let's discussion base on it.

> 
> In the second, I was using rcu_read_lock() instead of trylock_page()
> (like in my own patchset), but could not quite be sure of the case when
> PageSwapCache gets set at the wrong moment. Gave up for the night, and
> in the morning abandoned that, instead just shifting the call to
> __isolate_lru_page_prepare() after the get_page_unless_zero(),
> where that trylock_page() becomes safe (no danger of stomping on page
> flags while page is being freed or newly allocated to another owner).

Sorry, I don't know the problem of trylock_page here? Could you like to
describe it as a race?

> 
> I thought that a very safe change, but best to do some test runs with
> it in before finalizing. And was then unpleasantly surprised to hit a
> VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup) from
> lock_page_lruvec_irqsave < relock_page_lruvec < pagevec_lru_move_fn <
> pagevec_move_tail < lru_add_drain_cpu after 6 hours on one machine.
> Then similar but < rotate_reclaimable_page after 8 hours on another.
> 
> Only seen once before: that's what drove me to add patch 4 (with 3 to
> revert the locking before it): somehow, when adding the lruvec locking
> there, I just took it for granted that your patchset would have the
> appropriate locking (or TestClearPageLRU magic) at the other end.
> 
> But apparently not. And I'm beginning to think that TestClearPageLRU
> was just to distract the audience from the lack of proper locking.
> 
> I have certainly not concluded that yet, but I'm having to think about
> an area of the code which I'd imagined you had under control (and I'm
> puzzled why my testing has found it so very hard to hit). If we're
> lucky, I'll find that pagevec_move_tail is a special case, and
> nothing much else needs changing; but I doubt that will be so.
> 
> There's one other unexplained and unfixed bug I've seen several times
> while exercising mem_cgroup_move_account(): refcount_warn_saturate()
> from where __mem_cgroup_clear_mc() calls mem_cgroup_id_get_many().
> I'll be glad if that goes away when the lruvec locking is fixed,
> but don't understand the connection. And it's quite possible that
> this refcounting bug has nothing to do with your changes: I have
> not succeeded in reproducing it on 5.7 nor on 5.7-rc7-mm1,
> but I didn't really try long enough to be sure.
> 
> (I should also warn, that I'm surprised by the amount of change
> 11/16 makes to mm/mlock.c: I've not been exercising mlock at all.)

yes, that is a bit complex. I have tried the mlock cases in selftest with
your swap&build case. They are all fine with 300 times run.

> 
> Taking a break for the evening,
> Hugh
>
Hugh Dickins June 11, 2020, 10:09 p.m. UTC | #5
On Thu, 11 Jun 2020, Alex Shi wrote:
> 在 2020/6/10 上午11:22, Hugh Dickins 写道:
> > On Mon, 8 Jun 2020, Alex Shi wrote:
> >> 在 2020/6/8 下午12:15, Hugh Dickins 写道:
> >>>>  24 files changed, 487 insertions(+), 312 deletions(-)
> >>> Hi Alex,
> >>>
> >>> I didn't get to try v10 at all, waited until Johannes's preparatory
> >>> memcg swap cleanup was in mmotm; but I have spent a while thrashing
> >>> this v11, and can happily report that it is much better than v9 etc:
> >>> I believe this memcg lru_lock work will soon be ready for v5.9.
> >>>
> >>> I've not yet found any flaw at the swapping end, but fixes are needed
> >>> for isolate_migratepages_block() and mem_cgroup_move_account(): I've
> >>> got a series of 4 fix patches to send you (I guess two to fold into
> >>> existing patches of yours, and two to keep as separate from me).
> >>>
> >>> I haven't yet written the patch descriptions, will return to that
> >>> tomorrow.  I expect you will be preparing a v12 rebased on v5.8-rc1
> >>> or v5.8-rc2, and will be able to include these fixes in that.
> >>
> >> I am very glad to get your help on this feature! 
> >>
> >> and looking forward for your fixes tomorrow. :)
> >>
> >> Thanks a lot!
> >> Alex
> > 
> > Sorry, Alex, the news is not so good today.
> > 
> > You'll have noticed I sent nothing yesterday. That's because I got
> > stuck on my second patch: could not quite convince myself that it
> > was safe.
> 
> Hi Hugh,
> 
> Thanks a lot for your help and effort! I very appreciate for this.
> 
> > 
> > I keep hinting at these patches, and I can't complete their writeups
> > until I'm convinced; but to give you a better idea of what they do:
> > 
> > 1. Fixes isolate_fail and isolate_abort in isolate_migratepages_block().
> 
> I guess I know this after mm-compaction-avoid-vm_bug_onpageslab-in-page_mapcount.patch
> was removed.

No, I already assumed you had backed that out: these are fixes beyond that.

> 
> > 2. Fixes unsafe use of trylock_page() in __isolate_lru_page_prepare().
> > 3. Reverts 07/16 inversion of lock ordering in split_huge_page_to_list().
> > 4. Adds lruvec lock protection in mem_cgroup_move_account().
> 
> Sorry for can't follow you for above issues.

Indeed, more explanation needed: coming.

> Anyway, I will send out new patchset
> with the first issue fixed. and then let's discussion base on it.

Sigh. I wish you had waited for me to send you fixes, or waited for an
identifiable tag like 5.8-rc1.  Andrew has been very hard at work with
mm patches to Linus, but it looks like there are still "data_race" mods
to come before -rc1, which may stop your v12 from applying cleanly.

> 
> > 
> > In the second, I was using rcu_read_lock() instead of trylock_page()
> > (like in my own patchset), but could not quite be sure of the case when
> > PageSwapCache gets set at the wrong moment. Gave up for the night, and
> > in the morning abandoned that, instead just shifting the call to
> > __isolate_lru_page_prepare() after the get_page_unless_zero(),
> > where that trylock_page() becomes safe (no danger of stomping on page
> > flags while page is being freed or newly allocated to another owner).
> 
> Sorry, I don't know the problem of trylock_page here? Could you like to
> describe it as a race?

Races, yes. Look, I'll send you now patches 1 and 2: at least with those
in it should be safe for you and others to test compaction (if 5.8-rc1
turns out well: I think so much has gone in that it might have unrelated
problems, and often the -rc2 is much more stable).

But no point in sending 3 and 4 at this point, since ...

> 
> > 
> > I thought that a very safe change, but best to do some test runs with
> > it in before finalizing. And was then unpleasantly surprised to hit a
> > VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup) from
> > lock_page_lruvec_irqsave < relock_page_lruvec < pagevec_lru_move_fn <
> > pagevec_move_tail < lru_add_drain_cpu after 6 hours on one machine.
> > Then similar but < rotate_reclaimable_page after 8 hours on another.
> > 
> > Only seen once before: that's what drove me to add patch 4 (with 3 to
> > revert the locking before it): somehow, when adding the lruvec locking
> > there, I just took it for granted that your patchset would have the
> > appropriate locking (or TestClearPageLRU magic) at the other end.
> > 
> > But apparently not. And I'm beginning to think that TestClearPageLRU
> > was just to distract the audience from the lack of proper locking.
> > 
> > I have certainly not concluded that yet, but I'm having to think about
> > an area of the code which I'd imagined you had under control (and I'm
> > puzzled why my testing has found it so very hard to hit). If we're
> > lucky, I'll find that pagevec_move_tail is a special case, and
> > nothing much else needs changing; but I doubt that will be so.

... shows that your locking primitives are not yet good enough
to handle the case when tasks are moved between memcgs with
move_charge_at_immigrate set.  "bin/cg m" in the tests I sent,
but today I'm changing its "seconds=60" to "seconds=1" in hope
of speeding up the reproduction.

Ah, good, two machines crashed in 1.5 hours: but I don't need to
examine the crashes, now that it's obvious there's no protection -
please, think about rotate_reclaimable_page() (there will be more
cases, but in practice that seems easiest to hit, so focus on that)
and how it is not protected from mem_cgroup_move_account().

I'm thinking too. Maybe judicious use of lock_page_memcg() can fix it
(8 years ago it was unsuitable, but a lot has changed for the better
since then); otherwise it's back to what I've been doing all along,
taking the likely lruvec lock, and checking under that lock whether
we have the right lock (as your lruvec_memcg_debug() does), retrying
if not. Which may be more efficient than involving lock_page_memcg().

But I guess still worth sending my first two patches, since most of us
use move_charge_at_immigrate only for... testing move_charge_at_immigrate.
Whereas compaction bugs can hit any of us at any time.

> > 
> > There's one other unexplained and unfixed bug I've seen several times
> > while exercising mem_cgroup_move_account(): refcount_warn_saturate()
> > from where __mem_cgroup_clear_mc() calls mem_cgroup_id_get_many().
> > I'll be glad if that goes away when the lruvec locking is fixed,
> > but don't understand the connection. And it's quite possible that
> > this refcounting bug has nothing to do with your changes: I have
> > not succeeded in reproducing it on 5.7 nor on 5.7-rc7-mm1,
> > but I didn't really try long enough to be sure.

I got one of those quite quickly too after setting "cg m"'s seconds=1.
I think the best thing I can do while thinking and researching, is
give 5.7-rc7-mm1 a run on that machine with the speeded up moving,
to see whether or not that refcount bug reproduces.

> > 
> > (I should also warn, that I'm surprised by the amount of change
> > 11/16 makes to mm/mlock.c: I've not been exercising mlock at all.)
> 
> yes, that is a bit complex. I have tried the mlock cases in selftest with
> your swap&build case. They are all fine with 300 times run.

Good, thanks.

Hugh
Alex Shi June 12, 2020, 10:43 a.m. UTC | #6
在 2020/6/12 上午6:09, Hugh Dickins 写道:
>> Anyway, I will send out new patchset
>> with the first issue fixed. and then let's discussion base on it.
> Sigh. I wish you had waited for me to send you fixes, or waited for an
> identifiable tag like 5.8-rc1.  Andrew has been very hard at work with
> mm patches to Linus, but it looks like there are still "data_race" mods
> to come before -rc1, which may stop your v12 from applying cleanly.

Sorry, I didn't aware you would had another sending... My fault.
And yes, offical 5.8-rc is better base.

> 
>>> In the second, I was using rcu_read_lock() instead of trylock_page()
>>> (like in my own patchset), but could not quite be sure of the case when
>>> PageSwapCache gets set at the wrong moment. Gave up for the night, and
>>> in the morning abandoned that, instead just shifting the call to
>>> __isolate_lru_page_prepare() after the get_page_unless_zero(),
>>> where that trylock_page() becomes safe (no danger of stomping on page
>>> flags while page is being freed or newly allocated to another owner).
>> Sorry, I don't know the problem of trylock_page here? Could you like to
>> describe it as a race?
> Races, yes. Look, I'll send you now patches 1 and 2: at least with those
> in it should be safe for you and others to test compaction (if 5.8-rc1
> turns out well: I think so much has gone in that it might have unrelated
> problems, and often the -rc2 is much more stable).
> 
> But no point in sending 3 and 4 at this point, since ...
> 

I guess some concern may come from next mm bug?

>>> I thought that a very safe change, but best to do some test runs with
>>> it in before finalizing. And was then unpleasantly surprised to hit a
>>> VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup) from
>>> lock_page_lruvec_irqsave < relock_page_lruvec < pagevec_lru_move_fn <
>>> pagevec_move_tail < lru_add_drain_cpu after 6 hours on one machine.
>>> Then similar but < rotate_reclaimable_page after 8 hours on another.
>>>
>>> Only seen once before: that's what drove me to add patch 4 (with 3 to
>>> revert the locking before it): somehow, when adding the lruvec locking
>>> there, I just took it for granted that your patchset would have the
>>> appropriate locking (or TestClearPageLRU magic) at the other end.
>>>
>>> But apparently not. And I'm beginning to think that TestClearPageLRU
>>> was just to distract the audience from the lack of proper locking.
>>>
>>> I have certainly not concluded that yet, but I'm having to think about
>>> an area of the code which I'd imagined you had under control (and I'm
>>> puzzled why my testing has found it so very hard to hit). If we're
>>> lucky, I'll find that pagevec_move_tail is a special case, and
>>> nothing much else needs changing; but I doubt that will be so.
> ... shows that your locking primitives are not yet good enough
> to handle the case when tasks are moved between memcgs with
> move_charge_at_immigrate set.  "bin/cg m" in the tests I sent,
> but today I'm changing its "seconds=60" to "seconds=1" in hope
> of speeding up the reproduction.

Yes, I am using your great cases with 'm' parameter to do migration testing,
but unlockly, no error found in my box.

> 
> Ah, good, two machines crashed in 1.5 hours: but I don't need to
> examine the crashes, now that it's obvious there's no protection -
> please, think about rotate_reclaimable_page() (there will be more
> cases, but in practice that seems easiest to hit, so focus on that)
> and how it is not protected from mem_cgroup_move_account().
> > I'm thinking too. Maybe judicious use of lock_page_memcg() can fix it
> (8 years ago it was unsuitable, but a lot has changed for the better
> since then); otherwise it's back to what I've been doing all along,
> taking the likely lruvec lock, and checking under that lock whether
> we have the right lock (as your lruvec_memcg_debug() does), retrying
> if not. Which may be more efficient than involving lock_page_memcg().
> 
> But I guess still worth sending my first two patches, since most of us
> use move_charge_at_immigrate only for... testing move_charge_at_immigrate.
> Whereas compaction bugs can hit any of us at any time.
> 
>>> There's one other unexplained and unfixed bug I've seen several times
>>> while exercising mem_cgroup_move_account(): refcount_warn_saturate()
>>> from where __mem_cgroup_clear_mc() calls mem_cgroup_id_get_many().
>>> I'll be glad if that goes away when the lruvec locking is fixed,
>>> but don't understand the connection. And it's quite possible that
>>> this refcounting bug has nothing to do with your changes: I have
>>> not succeeded in reproducing it on 5.7 nor on 5.7-rc7-mm1,
>>> but I didn't really try long enough to be sure.
> I got one of those quite quickly too after setting "cg m"'s seconds=1.
> I think the best thing I can do while thinking and researching, is
> give 5.7-rc7-mm1 a run on that machine with the speeded up moving,
> to see whether or not that refcount bug reproduces.
> 

Millions thanks for help on this patchset!

Alex
Alex Shi June 16, 2020, 6:14 a.m. UTC | #7
在 2020/6/12 上午6:09, Hugh Dickins 写道:
>>> I thought that a very safe change, but best to do some test runs with
>>> it in before finalizing. And was then unpleasantly surprised to hit a
>>> VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup) from
>>> lock_page_lruvec_irqsave < relock_page_lruvec < pagevec_lru_move_fn <
>>> pagevec_move_tail < lru_add_drain_cpu after 6 hours on one machine.
>>> Then similar but < rotate_reclaimable_page after 8 hours on another.
>>>
>>> Only seen once before: that's what drove me to add patch 4 (with 3 to
>>> revert the locking before it): somehow, when adding the lruvec locking
>>> there, I just took it for granted that your patchset would have the
>>> appropriate locking (or TestClearPageLRU magic) at the other end.
>>>
>>> But apparently not. And I'm beginning to think that TestClearPageLRU
>>> was just to distract the audience from the lack of proper locking.
>>>
>>> I have certainly not concluded that yet, but I'm having to think about
>>> an area of the code which I'd imagined you had under control (and I'm
>>> puzzled why my testing has found it so very hard to hit). If we're
>>> lucky, I'll find that pagevec_move_tail is a special case, and
>>> nothing much else needs changing; but I doubt that will be so.
> ... shows that your locking primitives are not yet good enough
> to handle the case when tasks are moved between memcgs with
> move_charge_at_immigrate set.  "bin/cg m" in the tests I sent,
> but today I'm changing its "seconds=60" to "seconds=1" in hope
> of speeding up the reproduction.
> 
> Ah, good, two machines crashed in 1.5 hours: but I don't need to
> examine the crashes, now that it's obvious there's no protection -
> please, think about rotate_reclaimable_page() (there will be more
> cases, but in practice that seems easiest to hit, so focus on that)
> and how it is not protected from mem_cgroup_move_account().
> 
> I'm thinking too. Maybe judicious use of lock_page_memcg() can fix it
> (8 years ago it was unsuitable, but a lot has changed for the better
> since then); otherwise it's back to what I've been doing all along,
> taking the likely lruvec lock, and checking under that lock whether
> we have the right lock (as your lruvec_memcg_debug() does), retrying
> if not. Which may be more efficient than involving lock_page_memcg().
> 
Hi Hugh,

Thanks a lot for the report!

Think again lru_move_fn and mem_cgroup_move_account relation. I found
if we want to change the pgdat->lru_lock to memcg's lruvec lock, we have
to serialize mem_cgroup_move_account during pagevec_lru_move_fn. Otherwise
the possible bad scenario would like:

        cpu 0                                   cpu 1
    lruvec = mem_cgroup_page_lruvec()
                                        if (!isolate_lru_page())
                                                mem_cgroup_move_account

    spin_lock_irqsave(&lruvec->lru_lock <== wrong lock.

So we need the ClearPageLRU to block isolate_lru_page(), then serialize
the memcg change here. Do relock check would get a mitigation, but not
solution.

The following patch fold vm event PGROTATED into pagevec_move_tail_fn
and fixed this problem by ClearPageLRU before page moving between lru
I will split them into 2 patches, and merge into v12 patchset.

Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>


diff --git a/mm/swap.c b/mm/swap.c
index eba0c17dffd8..fa211157bfec 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -200,8 +200,7 @@ int get_kernel_page(unsigned long start, int write, struct page **pages)
 EXPORT_SYMBOL_GPL(get_kernel_page);
 
 static void pagevec_lru_move_fn(struct pagevec *pvec,
-	void (*move_fn)(struct page *page, struct lruvec *lruvec, void *arg),
-	void *arg)
+	void (*move_fn)(struct page *page, struct lruvec *lruvec), bool add)
 {
 	int i;
 	struct lruvec *lruvec = NULL;
@@ -210,8 +209,14 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
 
+		if (!add && !TestClearPageLRU(page))
+			continue;
+
 		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
-		(*move_fn)(page, lruvec, arg);
+		(*move_fn)(page, lruvec);
+
+		if (!add)
+			SetPageLRU(page);
 	}
 	if (lruvec)
 		unlock_page_lruvec_irqrestore(lruvec, flags);
@@ -219,35 +224,23 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 	pagevec_reinit(pvec);
 }
 
-static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec,
-				 void *arg)
+static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec)
 {
-	int *pgmoved = arg;
-
 	if (PageLRU(page) && !PageUnevictable(page)) {
 		del_page_from_lru_list(page, lruvec, page_lru(page));
 		ClearPageActive(page);
 		add_page_to_lru_list_tail(page, lruvec, page_lru(page));
-		(*pgmoved) += hpage_nr_pages(page);
+		__count_vm_events(PGROTATED, hpage_nr_pages(page));
 	}
 }
 
 /*
- * pagevec_move_tail() must be called with IRQ disabled.
- * Otherwise this may cause nasty races.
- */
-static void pagevec_move_tail(struct pagevec *pvec)
-{
-	int pgmoved = 0;
-
-	pagevec_lru_move_fn(pvec, pagevec_move_tail_fn, &pgmoved);
-	__count_vm_events(PGROTATED, pgmoved);
-}
-
-/*
  * Writeback is about to end against a page which has been marked for immediate
  * reclaim.  If it still appears to be reclaimable, move it to the tail of the
  * inactive list.
+ *
+ * pagevec_move_tail_fn() must be called with IRQ disabled.
+ * Otherwise this may cause nasty races.
  */
 void rotate_reclaimable_page(struct page *page)
 {
@@ -260,7 +253,7 @@ void rotate_reclaimable_page(struct page *page)
 		local_lock_irqsave(&lru_rotate.lock, flags);
 		pvec = this_cpu_ptr(&lru_rotate.pvec);
 		if (!pagevec_add(pvec, page) || PageCompound(page))
-			pagevec_move_tail(pvec);
+			pagevec_lru_move_fn(pvec, pagevec_move_tail_fn, false);
 		local_unlock_irqrestore(&lru_rotate.lock, flags);
 	}
 }
@@ -302,8 +295,7 @@ void lru_note_cost_page(struct page *page)
 		      page_is_file_lru(page), hpage_nr_pages(page));
 }
 
-static void __activate_page(struct page *page, struct lruvec *lruvec,
-			    void *arg)
+static void __activate_page(struct page *page, struct lruvec *lruvec)
 {
 	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
 		int lru = page_lru_base_type(page);
@@ -327,7 +319,7 @@ static void activate_page_drain(int cpu)
 	struct pagevec *pvec = &per_cpu(lru_pvecs.activate_page, cpu);
 
 	if (pagevec_count(pvec))
-		pagevec_lru_move_fn(pvec, __activate_page, NULL);
+		pagevec_lru_move_fn(pvec, __activate_page, false);
 }
 
 static bool need_activate_page_drain(int cpu)
@@ -345,7 +337,7 @@ void activate_page(struct page *page)
 		pvec = this_cpu_ptr(&lru_pvecs.activate_page);
 		get_page(page);
 		if (!pagevec_add(pvec, page) || PageCompound(page))
-			pagevec_lru_move_fn(pvec, __activate_page, NULL);
+			pagevec_lru_move_fn(pvec, __activate_page, false);
 		local_unlock(&lru_pvecs.lock);
 	}
 }
@@ -515,8 +507,7 @@ void lru_cache_add_active_or_unevictable(struct page *page,
  * be write it out by flusher threads as this is much more effective
  * than the single-page writeout from reclaim.
  */
-static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
-			      void *arg)
+static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
 {
 	int lru;
 	bool active;
@@ -563,8 +554,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
 	}
 }
 
-static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
-			    void *arg)
+static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
 {
 	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
 		int lru = page_lru_base_type(page);
@@ -581,8 +571,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
 	}
 }
 
-static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
-			    void *arg)
+static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec)
 {
 	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
 	    !PageSwapCache(page) && !PageUnevictable(page)) {
@@ -625,21 +614,21 @@ void lru_add_drain_cpu(int cpu)
 
 		/* No harm done if a racing interrupt already did this */
 		local_lock_irqsave(&lru_rotate.lock, flags);
-		pagevec_move_tail(pvec);
+		pagevec_lru_move_fn(pvec, pagevec_move_tail_fn, false);
 		local_unlock_irqrestore(&lru_rotate.lock, flags);
 	}
 
 	pvec = &per_cpu(lru_pvecs.lru_deactivate_file, cpu);
 	if (pagevec_count(pvec))
-		pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
+		pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, false);
 
 	pvec = &per_cpu(lru_pvecs.lru_deactivate, cpu);
 	if (pagevec_count(pvec))
-		pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+		pagevec_lru_move_fn(pvec, lru_deactivate_fn, false);
 
 	pvec = &per_cpu(lru_pvecs.lru_lazyfree, cpu);
 	if (pagevec_count(pvec))
-		pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
+		pagevec_lru_move_fn(pvec, lru_lazyfree_fn, false);
 
 	activate_page_drain(cpu);
 }
@@ -668,7 +657,7 @@ void deactivate_file_page(struct page *page)
 		pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate_file);
 
 		if (!pagevec_add(pvec, page) || PageCompound(page))
-			pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
+			pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, false);
 		local_unlock(&lru_pvecs.lock);
 	}
 }
@@ -690,7 +679,7 @@ void deactivate_page(struct page *page)
 		pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate);
 		get_page(page);
 		if (!pagevec_add(pvec, page) || PageCompound(page))
-			pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+			pagevec_lru_move_fn(pvec, lru_deactivate_fn, false);
 		local_unlock(&lru_pvecs.lock);
 	}
 }
@@ -712,7 +701,7 @@ void mark_page_lazyfree(struct page *page)
 		pvec = this_cpu_ptr(&lru_pvecs.lru_lazyfree);
 		get_page(page);
 		if (!pagevec_add(pvec, page) || PageCompound(page))
-			pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
+			pagevec_lru_move_fn(pvec, lru_lazyfree_fn, false);
 		local_unlock(&lru_pvecs.lock);
 	}
 }
@@ -913,8 +902,7 @@ void __pagevec_release(struct pagevec *pvec)
 }
 EXPORT_SYMBOL(__pagevec_release);
 
-static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
-				 void *arg)
+static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
 {
 	enum lru_list lru;
 	int was_unevictable = TestClearPageUnevictable(page);
@@ -973,7 +961,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
  */
 void __pagevec_lru_add(struct pagevec *pvec)
 {
-	pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn, NULL);
+	pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn, true);
 }
 
 /**