[v16,00/22] per memcg lru_lock

Message ID	1594429136-20002-1-git-send-email-alex.shi@linux.alibaba.com (mailing list archive)
Headers	show Return-Path: <SRS0=pPMm=AW=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 692C42080D From: Alex Shi <alex.shi@linux.alibaba.com> To: akpm@linux-foundation.org, mgorman@techsingularity.net, tj@kernel.org, hughd@google.com, khlebnikov@yandex-team.ru, daniel.m.jordan@oracle.com, yang.shi@linux.alibaba.com, willy@infradead.org, hannes@cmpxchg.org, lkp@intel.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, shakeelb@google.com, iamjoonsoo.kim@lge.com, richard.weiyang@gmail.com, kirill@shutemov.name Subject: [PATCH v16 00/22] per memcg lru_lock Date: Sat, 11 Jul 2020 08:58:34 +0800 Message-Id: <1594429136-20002-1-git-send-email-alex.shi@linux.alibaba.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	per memcg lru_lock \| expand [v16,00/22] per memcg lru_lock [v16,01/22] mm/vmscan: remove unnecessary lruvec adding [v16,02/22] mm/page_idle: no unlikely double check for idle page counting [v16,03/22] mm/compaction: correct the comments of compact_defer_shift [v16,04/22] mm/compaction: rename compact_deferred as compact_should_defer [v16,05/22] mm/thp: move lru_add_page_tail func to huge_memory.c [v16,06/22] mm/thp: clean up lru_add_page_tail [v16,07/22] mm/thp: remove code path which never got into [v16,08/22] mm/thp: narrow lru locking [v16,09/22] mm/memcg: add debug checking in lock_page_memcg [v16,10/22] mm/swap: fold vm event PGROTATED into pagevec_move_tail_fn [v16,11/22] mm/lru: move lru_lock holding in func lru_note_cost_page [v16,12/22] mm/lru: move lock into lru_note_cost [v16,13/22] mm/lru: introduce TestClearPageLRU [v16,14/22] mm/thp: add tail pages into lru anyway in split_huge_page() [v16,15/22] mm/compaction: do page isolation first in compaction [v16,16/22] mm/mlock: reorder isolation sequence during munlock [v16,17/22] mm/swap: serialize memcg changes during pagevec_lru_move_fn [v16,18/22] mm/lru: replace pgdat lru_lock with lruvec lock [v16,19/22] mm/lru: introduce the relock_page_lruvec function [v16,20/22] mm/vmscan: use relock for move_pages_to_lru [v16,21/22] mm/pgdat: remove pgdat lru_lock [v16,22/22] mm/lru: revise the comments of lru_lock

Alex Shi July 11, 2020, 12:58 a.m. UTC

The new version which bases on v5.8-rc4. Add 2 more patchs:
'mm/thp: remove code path which never got into'
'mm/thp: add tail pages into lru anyway in split_huge_page()'
and modified 'mm/mlock: reorder isolation sequence during munlock'

Current lru_lock is one for each of node, pgdat->lru_lock, that guard for
lru lists, but now we had moved the lru lists into memcg for long time. Still
using per node lru_lock is clearly unscalable, pages on each of memcgs have
to compete each others for a whole lru_lock. This patchset try to use per
lruvec/memcg lru_lock to repleace per node lru lock to guard lru lists, make
it scalable for memcgs and get performance gain.

Currently lru_lock still guards both lru list and page's lru bit, that's ok.
but if we want to use specific lruvec lock on the page, we need to pin down
the page's lruvec/memcg during locking. Just taking lruvec lock first may be
undermined by the page's memcg charge/migration. To fix this problem, we could
take out the page's lru bit clear and use it as pin down action to block the
memcg changes. That's the reason for new atomic func TestClearPageLRU.
So now isolating a page need both actions: TestClearPageLRU and hold the
lru_lock.

The typical usage of this is isolate_migratepages_block() in compaction.c
we have to take lru bit before lru lock, that serialized the page isolation
in memcg page charge/migration which will change page's lruvec and new 
lru_lock in it.

The above solution suggested by Johannes Weiner, and based on his new memcg 
charge path, then have this patchset. (Hugh Dickins tested and contributed much
code from compaction fix to general code polish, thanks a lot!).

The patchset includes 3 parts:
1, some code cleanup and minimum optimization as a preparation.
2, use TestCleanPageLRU as page isolation's precondition
3, replace per node lru_lock with per memcg per node lru_lock

Following Daniel Jordan's suggestion, I have run 208 'dd' with on 104
containers on a 2s * 26cores * HT box with a modefied case:
https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice
With this patchset, the readtwice performance increased about 80%
in concurrent containers.

Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought this
idea 8 years ago, and others who give comments as well: Daniel Jordan, 
Mel Gorman, Shakeel Butt, Matthew Wilcox etc.

Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks!

Alex Shi (20):
  mm/vmscan: remove unnecessary lruvec adding
  mm/page_idle: no unlikely double check for idle page counting
  mm/compaction: correct the comments of compact_defer_shift
  mm/compaction: rename compact_deferred as compact_should_defer
  mm/thp: move lru_add_page_tail func to huge_memory.c
  mm/thp: clean up lru_add_page_tail
  mm/thp: remove code path which never got into
  mm/thp: narrow lru locking
  mm/memcg: add debug checking in lock_page_memcg
  mm/swap: fold vm event PGROTATED into pagevec_move_tail_fn
  mm/lru: move lru_lock holding in func lru_note_cost_page
  mm/lru: move lock into lru_note_cost
  mm/lru: introduce TestClearPageLRU
  mm/thp: add tail pages into lru anyway in split_huge_page()
  mm/compaction: do page isolation first in compaction
  mm/mlock: reorder isolation sequence during munlock
  mm/swap: serialize memcg changes during pagevec_lru_move_fn
  mm/lru: replace pgdat lru_lock with lruvec lock
  mm/lru: introduce the relock_page_lruvec function
  mm/pgdat: remove pgdat lru_lock

Hugh Dickins (2):
  mm/vmscan: use relock for move_pages_to_lru
  mm/lru: revise the comments of lru_lock

 Documentation/admin-guide/cgroup-v1/memcg_test.rst |  15 +-
 Documentation/admin-guide/cgroup-v1/memory.rst     |  21 +--
 Documentation/trace/events-kmem.rst                |   2 +-
 Documentation/vm/unevictable-lru.rst               |  22 +--
 include/linux/compaction.h                         |   4 +-
 include/linux/memcontrol.h                         |  98 +++++++++++
 include/linux/mm_types.h                           |   2 +-
 include/linux/mmzone.h                             |   6 +-
 include/linux/page-flags.h                         |   1 +
 include/linux/swap.h                               |   4 +-
 include/trace/events/compaction.h                  |   2 +-
 mm/compaction.c                                    | 113 ++++++++----
 mm/filemap.c                                       |   4 +-
 mm/huge_memory.c                                   |  47 +++--
 mm/memcontrol.c                                    |  71 +++++++-
 mm/memory.c                                        |   3 -
 mm/mlock.c                                         |  93 +++++-----
 mm/mmzone.c                                        |   1 +
 mm/page_alloc.c                                    |   1 -
 mm/page_idle.c                                     |   8 -
 mm/rmap.c                                          |   4 +-
 mm/swap.c                                          | 189 ++++++++-------------
 mm/swap_state.c                                    |   2 -
 mm/vmscan.c                                        | 174 ++++++++++---------
 mm/workingset.c                                    |   2 -
 25 files changed, 524 insertions(+), 365 deletions(-)

Alex Shi July 11, 2020, 1:02 a.m. UTC | #1

Hi Hugh,

I believe I own your a 'tested-by' for previous version.
Could you like to give a try on the new version and give a reviewed or tested-by
if it's fine.

Thanks
Alex

Alex Shi July 16, 2020, 8:49 a.m. UTC | #2

Hi All,

This version get tested and passed Hugh Dickin's testing as well as v15/v14.
Thanks, Hugh!

Anyone like to give any comments or concerns for the patches?


Thanks
Alex


在 2020/7/11 上午8:58, Alex Shi 写道:
> The new version which bases on v5.8-rc4. Add 2 more patchs:
> 'mm/thp: remove code path which never got into'
> 'mm/thp: add tail pages into lru anyway in split_huge_page()'
> and modified 'mm/mlock: reorder isolation sequence during munlock'
> 
> Current lru_lock is one for each of node, pgdat->lru_lock, that guard for
> lru lists, but now we had moved the lru lists into memcg for long time. Still
> using per node lru_lock is clearly unscalable, pages on each of memcgs have
> to compete each others for a whole lru_lock. This patchset try to use per
> lruvec/memcg lru_lock to repleace per node lru lock to guard lru lists, make
> it scalable for memcgs and get performance gain.
> 
> Currently lru_lock still guards both lru list and page's lru bit, that's ok.
> but if we want to use specific lruvec lock on the page, we need to pin down
> the page's lruvec/memcg during locking. Just taking lruvec lock first may be
> undermined by the page's memcg charge/migration. To fix this problem, we could
> take out the page's lru bit clear and use it as pin down action to block the
> memcg changes. That's the reason for new atomic func TestClearPageLRU.
> So now isolating a page need both actions: TestClearPageLRU and hold the
> lru_lock.
> 
> The typical usage of this is isolate_migratepages_block() in compaction.c
> we have to take lru bit before lru lock, that serialized the page isolation
> in memcg page charge/migration which will change page's lruvec and new 
> lru_lock in it.
> 
> The above solution suggested by Johannes Weiner, and based on his new memcg 
> charge path, then have this patchset. (Hugh Dickins tested and contributed much
> code from compaction fix to general code polish, thanks a lot!).
> 
> The patchset includes 3 parts:
> 1, some code cleanup and minimum optimization as a preparation.
> 2, use TestCleanPageLRU as page isolation's precondition
> 3, replace per node lru_lock with per memcg per node lru_lock
> 
> Following Daniel Jordan's suggestion, I have run 208 'dd' with on 104
> containers on a 2s * 26cores * HT box with a modefied case:
> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice
> With this patchset, the readtwice performance increased about 80%
> in concurrent containers.
> 
> Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought this
> idea 8 years ago, and others who give comments as well: Daniel Jordan, 
> Mel Gorman, Shakeel Butt, Matthew Wilcox etc.
> 
> Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
> and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks!
> 
> Alex Shi (20):
>   mm/vmscan: remove unnecessary lruvec adding
>   mm/page_idle: no unlikely double check for idle page counting
>   mm/compaction: correct the comments of compact_defer_shift
>   mm/compaction: rename compact_deferred as compact_should_defer
>   mm/thp: move lru_add_page_tail func to huge_memory.c
>   mm/thp: clean up lru_add_page_tail
>   mm/thp: remove code path which never got into
>   mm/thp: narrow lru locking
>   mm/memcg: add debug checking in lock_page_memcg
>   mm/swap: fold vm event PGROTATED into pagevec_move_tail_fn
>   mm/lru: move lru_lock holding in func lru_note_cost_page
>   mm/lru: move lock into lru_note_cost
>   mm/lru: introduce TestClearPageLRU
>   mm/thp: add tail pages into lru anyway in split_huge_page()
>   mm/compaction: do page isolation first in compaction
>   mm/mlock: reorder isolation sequence during munlock
>   mm/swap: serialize memcg changes during pagevec_lru_move_fn
>   mm/lru: replace pgdat lru_lock with lruvec lock
>   mm/lru: introduce the relock_page_lruvec function
>   mm/pgdat: remove pgdat lru_lock
> 
> Hugh Dickins (2):
>   mm/vmscan: use relock for move_pages_to_lru
>   mm/lru: revise the comments of lru_lock
> 
>  Documentation/admin-guide/cgroup-v1/memcg_test.rst |  15 +-
>  Documentation/admin-guide/cgroup-v1/memory.rst     |  21 +--
>  Documentation/trace/events-kmem.rst                |   2 +-
>  Documentation/vm/unevictable-lru.rst               |  22 +--
>  include/linux/compaction.h                         |   4 +-
>  include/linux/memcontrol.h                         |  98 +++++++++++
>  include/linux/mm_types.h                           |   2 +-
>  include/linux/mmzone.h                             |   6 +-
>  include/linux/page-flags.h                         |   1 +
>  include/linux/swap.h                               |   4 +-
>  include/trace/events/compaction.h                  |   2 +-
>  mm/compaction.c                                    | 113 ++++++++----
>  mm/filemap.c                                       |   4 +-
>  mm/huge_memory.c                                   |  47 +++--
>  mm/memcontrol.c                                    |  71 +++++++-
>  mm/memory.c                                        |   3 -
>  mm/mlock.c                                         |  93 +++++-----
>  mm/mmzone.c                                        |   1 +
>  mm/page_alloc.c                                    |   1 -
>  mm/page_idle.c                                     |   8 -
>  mm/rmap.c                                          |   4 +-
>  mm/swap.c                                          | 189 ++++++++-------------
>  mm/swap_state.c                                    |   2 -
>  mm/vmscan.c                                        | 174 ++++++++++---------
>  mm/workingset.c                                    |   2 -
>  25 files changed, 524 insertions(+), 365 deletions(-)
>

Alexander Duyck July 16, 2020, 2:11 p.m. UTC | #3

On Fri, Jul 10, 2020 at 5:59 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
> The new version which bases on v5.8-rc4. Add 2 more patchs:
> 'mm/thp: remove code path which never got into'
> 'mm/thp: add tail pages into lru anyway in split_huge_page()'
> and modified 'mm/mlock: reorder isolation sequence during munlock'
>
> Current lru_lock is one for each of node, pgdat->lru_lock, that guard for
> lru lists, but now we had moved the lru lists into memcg for long time. Still
> using per node lru_lock is clearly unscalable, pages on each of memcgs have
> to compete each others for a whole lru_lock. This patchset try to use per
> lruvec/memcg lru_lock to repleace per node lru lock to guard lru lists, make
> it scalable for memcgs and get performance gain.
>
> Currently lru_lock still guards both lru list and page's lru bit, that's ok.
> but if we want to use specific lruvec lock on the page, we need to pin down
> the page's lruvec/memcg during locking. Just taking lruvec lock first may be
> undermined by the page's memcg charge/migration. To fix this problem, we could
> take out the page's lru bit clear and use it as pin down action to block the
> memcg changes. That's the reason for new atomic func TestClearPageLRU.
> So now isolating a page need both actions: TestClearPageLRU and hold the
> lru_lock.
>
> The typical usage of this is isolate_migratepages_block() in compaction.c
> we have to take lru bit before lru lock, that serialized the page isolation
> in memcg page charge/migration which will change page's lruvec and new
> lru_lock in it.
>
> The above solution suggested by Johannes Weiner, and based on his new memcg
> charge path, then have this patchset. (Hugh Dickins tested and contributed much
> code from compaction fix to general code polish, thanks a lot!).
>
> The patchset includes 3 parts:
> 1, some code cleanup and minimum optimization as a preparation.
> 2, use TestCleanPageLRU as page isolation's precondition
> 3, replace per node lru_lock with per memcg per node lru_lock
>
> Following Daniel Jordan's suggestion, I have run 208 'dd' with on 104
> containers on a 2s * 26cores * HT box with a modefied case:
> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice
> With this patchset, the readtwice performance increased about 80%
> in concurrent containers.
>
> Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought this
> idea 8 years ago, and others who give comments as well: Daniel Jordan,
> Mel Gorman, Shakeel Butt, Matthew Wilcox etc.
>
> Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
> and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks!

Hi Alex,

I think I am seeing a regression with this patch set when I run the
will-it-scale/page_fault3 test. Specifically the processes result is
dropping from 56371083 to 43127382 when I apply these patches.

I haven't had a chance to bisect and figure out what is causing it,
and wanted to let you know in case you are aware of anything specific
that may be causing this.

Thanks.

- Alex

Alex Shi July 17, 2020, 5:24 a.m. UTC | #4

在 2020/7/16 下午10:11, Alexander Duyck 写道:
>> Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
>> and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks!
> Hi Alex,
> 
> I think I am seeing a regression with this patch set when I run the
> will-it-scale/page_fault3 test. Specifically the processes result is
> dropping from 56371083 to 43127382 when I apply these patches.
> 
> I haven't had a chance to bisect and figure out what is causing it,
> and wanted to let you know in case you are aware of anything specific
> that may be causing this.


Thanks a lot for the info!

Actually, the patch 17th, and patch 13th may changed performance a little,
like the 17th, intel LKP found vm-scalability.throughput 68.0% improvement,
and stress-ng.remap.ops_per_sec -76.3% regression, or stress-ng.memfd.ops_per_sec
 +23.2%. etc.

This kind performance interference is known and acceptable.
Thanks
Alex

Hugh Dickins July 19, 2020, 3:23 p.m. UTC | #5

On Fri, 17 Jul 2020, Alex Shi wrote:
> 在 2020/7/16 下午10:11, Alexander Duyck 写道:
> >> Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
> >> and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks!
> > Hi Alex,
> > 
> > I think I am seeing a regression with this patch set when I run the
> > will-it-scale/page_fault3 test. Specifically the processes result is
> > dropping from 56371083 to 43127382 when I apply these patches.
> > 
> > I haven't had a chance to bisect and figure out what is causing it,
> > and wanted to let you know in case you are aware of anything specific
> > that may be causing this.
> 
> 
> Thanks a lot for the info!
> 
> Actually, the patch 17th, and patch 13th may changed performance a little,
> like the 17th, intel LKP found vm-scalability.throughput 68.0% improvement,
> and stress-ng.remap.ops_per_sec -76.3% regression, or stress-ng.memfd.ops_per_sec
>  +23.2%. etc.
> 
> This kind performance interference is known and acceptable.

That may be too blithe a response.

I can see that I've lots of other mails to reply to, from you and from
others - I got held up for a week in advancing from gcc 4.8 on my test
machines. But I'd better rush this to you before reading further, because
what I was hunting the last few days rather invalidates earlier testing.
And I'm glad that I held back from volunteering a Tested-by - though,
yes, v13 and later are stable where the older versions were unstable.

I noticed that 5.8-rc5, with lrulock v16 applied, took significantly
longer to run loads than without it applied, when there should have been
only slight differences in system time. Comparing /proc/vmstat, something
that stood out was "pgrotated 0" for the patched kernels, which led here:

If pagevec_lru_move_fn() is now to TestClearPageLRU (I have still not
decided whether that's good or not, but assume here that it is good),
then functions called though it must be changed not to expect PageLRU!

Signed-off-by: Hugh Dickins <hughd@google.com>
---

 mm/swap.c |   14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

--- 5.8-rc5-lru16/mm/swap.c	2020-07-15 21:03:42.781236769 -0700
+++ linux/mm/swap.c	2020-07-18 13:28:14.000000000 -0700
@@ -227,7 +227,7 @@ static void pagevec_lru_move_fn(struct p
 
 static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec)
 {
-	if (PageLRU(page) && !PageUnevictable(page)) {
+	if (!PageUnevictable(page)) {
 		del_page_from_lru_list(page, lruvec, page_lru(page));
 		ClearPageActive(page);
 		add_page_to_lru_list_tail(page, lruvec, page_lru(page));
@@ -300,7 +300,7 @@ void lru_note_cost_page(struct page *pag
 
 static void __activate_page(struct page *page, struct lruvec *lruvec)
 {
-	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
+	if (!PageActive(page) && !PageUnevictable(page)) {
 		int lru = page_lru_base_type(page);
 		int nr_pages = hpage_nr_pages(page);
 
@@ -357,7 +357,8 @@ void activate_page(struct page *page)
 
 	page = compound_head(page);
 	lruvec = lock_page_lruvec_irq(page);
-	__activate_page(page, lruvec);
+	if (PageLRU(page))
+		__activate_page(page, lruvec);
 	unlock_page_lruvec_irq(lruvec);
 }
 #endif
@@ -515,9 +516,6 @@ static void lru_deactivate_file_fn(struc
 	bool active;
 	int nr_pages = hpage_nr_pages(page);
 
-	if (!PageLRU(page))
-		return;
-
 	if (PageUnevictable(page))
 		return;
 
@@ -558,7 +556,7 @@ static void lru_deactivate_file_fn(struc
 
 static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
 {
-	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+	if (PageActive(page) && !PageUnevictable(page)) {
 		int lru = page_lru_base_type(page);
 		int nr_pages = hpage_nr_pages(page);
 
@@ -575,7 +573,7 @@ static void lru_deactivate_fn(struct pag
 
 static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec)
 {
-	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
+	if (PageAnon(page) && PageSwapBacked(page) &&
 	    !PageSwapCache(page) && !PageUnevictable(page)) {
 		bool active = PageActive(page);
 		int nr_pages = hpage_nr_pages(page);

Alex Shi July 20, 2020, 3:01 a.m. UTC | #6

在 2020/7/19 下午11:23, Hugh Dickins 写道:
> I noticed that 5.8-rc5, with lrulock v16 applied, took significantly
> longer to run loads than without it applied, when there should have been
> only slight differences in system time. Comparing /proc/vmstat, something
> that stood out was "pgrotated 0" for the patched kernels, which led here:
> 
> If pagevec_lru_move_fn() is now to TestClearPageLRU (I have still not
> decided whether that's good or not, but assume here that it is good),
> then functions called though it must be changed not to expect PageLRU!
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

Good catch!

Thanks a lot, Hugh! 
except 6 changes should apply, looks we add one more in swap.c file to stop
!PageRLU further actions!

Many Thanks!
Alex

@@ -649,7 +647,7 @@ void deactivate_file_page(struct page *page)
         * In a workload with many unevictable page such as mprotect,
         * unevictable page deactivation for accelerating reclaim is pointless.
         */
-       if (PageUnevictable(page))
+       if (PageUnevictable(page) || !PageLRU(page))
                return;

        if (likely(get_page_unless_zero(page))) {

Hugh Dickins July 20, 2020, 4:47 a.m. UTC | #7

On Mon, 20 Jul 2020, Alex Shi wrote:
> 在 2020/7/19 下午11:23, Hugh Dickins 写道:
> > I noticed that 5.8-rc5, with lrulock v16 applied, took significantly
> > longer to run loads than without it applied, when there should have been
> > only slight differences in system time. Comparing /proc/vmstat, something
> > that stood out was "pgrotated 0" for the patched kernels, which led here:
> > 
> > If pagevec_lru_move_fn() is now to TestClearPageLRU (I have still not
> > decided whether that's good or not, but assume here that it is good),
> > then functions called though it must be changed not to expect PageLRU!
> > 
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> 
> Good catch!
> 
> Thanks a lot, Hugh! 
> except 6 changes should apply, looks we add one more in swap.c file to stop
> !PageRLU further actions!

Agreed, that's a minor optimization that wasn't done before,
that can be added (but it's not a fix like the rest of them).

> 
> Many Thanks!
> Alex
> 
> @@ -649,7 +647,7 @@ void deactivate_file_page(struct page *page)
>          * In a workload with many unevictable page such as mprotect,
>          * unevictable page deactivation for accelerating reclaim is pointless.
>          */
> -       if (PageUnevictable(page))
> +       if (PageUnevictable(page) || !PageLRU(page))
>                 return;
> 
>         if (likely(get_page_unless_zero(page))) {

Alex Shi July 20, 2020, 7:30 a.m. UTC | #8

I am preparing/testing the patch v17 according comments from Hugh Dickins
and Alexander Duyck. 
Many thanks for line by line review and patient suggestion!

Please drop me any more comments or concern of any patches!

Thanks a lot!
Alex

在 2020/7/11 上午8:58, Alex Shi 写道:
> The new version which bases on v5.8-rc4. Add 2 more patchs:
> 'mm/thp: remove code path which never got into'
> 'mm/thp: add tail pages into lru anyway in split_huge_page()'
> and modified 'mm/mlock: reorder isolation sequence during munlock'
> 
> Current lru_lock is one for each of node, pgdat->lru_lock, that guard for
> lru lists, but now we had moved the lru lists into memcg for long time. Still
> using per node lru_lock is clearly unscalable, pages on each of memcgs have
> to compete each others for a whole lru_lock. This patchset try to use per
> lruvec/memcg lru_lock to repleace per node lru lock to guard lru lists, make
> it scalable for memcgs and get performance gain.
> 
> Currently lru_lock still guards both lru list and page's lru bit, that's ok.
> but if we want to use specific lruvec lock on the page, we need to pin down
> the page's lruvec/memcg during locking. Just taking lruvec lock first may be
> undermined by the page's memcg charge/migration. To fix this problem, we could
> take out the page's lru bit clear and use it as pin down action to block the
> memcg changes. That's the reason for new atomic func TestClearPageLRU.
> So now isolating a page need both actions: TestClearPageLRU and hold the
> lru_lock.
> 
> The typical usage of this is isolate_migratepages_block() in compaction.c
> we have to take lru bit before lru lock, that serialized the page isolation
> in memcg page charge/migration which will change page's lruvec and new 
> lru_lock in it.
> 
> The above solution suggested by Johannes Weiner, and based on his new memcg 
> charge path, then have this patchset. (Hugh Dickins tested and contributed much
> code from compaction fix to general code polish, thanks a lot!).
> 
> The patchset includes 3 parts:
> 1, some code cleanup and minimum optimization as a preparation.
> 2, use TestCleanPageLRU as page isolation's precondition
> 3, replace per node lru_lock with per memcg per node lru_lock
> 
> Following Daniel Jordan's suggestion, I have run 208 'dd' with on 104
> containers on a 2s * 26cores * HT box with a modefied case:
> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice
> With this patchset, the readtwice performance increased about 80%
> in concurrent containers.
> 
> Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought this
> idea 8 years ago, and others who give comments as well: Daniel Jordan, 
> Mel Gorman, Shakeel Butt, Matthew Wilcox etc.
> 
> Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
> and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks!

[v16,00/22] per memcg lru_lock

Message

Comments