mbox series

[v7,00/10] per lruvec lru_lock for memcg

Message ID 1577264666-246071-1-git-send-email-alex.shi@linux.alibaba.com (mailing list archive)
Headers show
Series per lruvec lru_lock for memcg | expand

Message

Alex Shi Dec. 25, 2019, 9:04 a.m. UTC
Hi all,

Merry Christmas! :)

This patchset move lru_lock into lruvec, give a lru_lock for each of
lruvec, thus bring a lru_lock for each of memcg per node.

We introduce function lock_page_lruvec, which will lock the page's
memcg and then memcg's lruvec->lru_lock(Thanks Johannes Weiner,
Hugh Dickins and Konstantin Khlebnikov suggestion/reminder) to replace
old pgdat->lru_lock.

Following to Daniel Jordan's suggestion, I run 208 'dd' with on 104
containers on a 2s * 26cores * HT box with a modefied case:
  https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice

With this patchset, the readtwice performance increased about 80%
with containers. And no performance drops w/o container.

Another way to guard move_account is by lru_lock instead of move_lock 
Considering the memcg move task path:
   mem_cgroup_move_task:
     mem_cgroup_move_charge:
	lru_add_drain_all();
	atomic_inc(&mc.from->moving_account); //ask lruvec's move_lock
	synchronize_rcu();
	walk_parge_range: do charge_walk_ops(mem_cgroup_move_charge_pte_range):
	   isolate_lru_page();
	   mem_cgroup_move_account(page,)
		spin_lock(&from->move_lock) 
		page->mem_cgroup = to;
		spin_unlock(&from->move_lock) 
	   putback_lru_page(page)

to guard 'page->mem_cgroup = to' by to_vec->lru_lock has the similar effect with
move_lock. So for performance reason, both solutions are same.

Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought the same idea
7 years ago.

Thanks all the comments from Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, 
Johannes Weiner, Mel Gorman, Shakeel Butt, Rong Chen, Fengguang Wu, Yun Wang etc.
and some testing support from Intel 0days!

v7,
  a, rebase on v5.5-rc3, 
  b, move the lock_page_lru() clean up before lock replace.

v6, 
  a, rebase on v5.5-rc2, and do retesting.
  b, pick up Johanness' comments change and a lock_page_lru cleanup.

v5,
  a, locking page's memcg according JohannesW suggestion
  b, using macro for non memcg, according to Johanness and Metthew's suggestion.

v4: 
  a, fix the page->mem_cgroup dereferencing issue, thanks Johannes Weiner
  b, remove the irqsave flags changes, thanks Metthew Wilcox
  c, merge/split patches for better understanding and bisection purpose

v3: rebase on linux-next, and fold the relock fix patch into introduceing patch

v2: bypass a performance regression bug and fix some function issues

v1: initial version, aim testing show 5% performance increase on a 16 threads box.


Alex Shi (9):
  mm/vmscan: remove unnecessary lruvec adding
  mm/memcg: fold lru_lock in lock_page_lru
  mm/lru: replace pgdat lru_lock with lruvec lock
  mm/lru: introduce the relock_page_lruvec function
  mm/mlock: optimize munlock_pagevec by relocking
  mm/swap: only change the lru_lock iff page's lruvec is different
  mm/pgdat: remove pgdat lru_lock
  mm/lru: add debug checking for page memcg moving
  mm/memcg: add debug checking in lock_page_memcg

Hugh Dickins (1):
  mm/lru: revise the comments of lru_lock

 Documentation/admin-guide/cgroup-v1/memcg_test.rst | 15 +---
 Documentation/admin-guide/cgroup-v1/memory.rst     |  6 +-
 Documentation/trace/events-kmem.rst                |  2 +-
 Documentation/vm/unevictable-lru.rst               | 22 ++---
 include/linux/memcontrol.h                         | 63 ++++++++++++++
 include/linux/mm_types.h                           |  2 +-
 include/linux/mmzone.h                             |  5 +-
 mm/compaction.c                                    | 59 ++++++++-----
 mm/filemap.c                                       |  4 +-
 mm/huge_memory.c                                   | 18 ++--
 mm/memcontrol.c                                    | 84 +++++++++++++++----
 mm/mlock.c                                         | 28 +++----
 mm/mmzone.c                                        |  1 +
 mm/page_alloc.c                                    |  1 -
 mm/page_idle.c                                     |  7 +-
 mm/rmap.c                                          |  2 +-
 mm/swap.c                                          | 75 +++++++----------
 mm/vmscan.c                                        | 98 ++++++++++++----------
 18 files changed, 297 insertions(+), 195 deletions(-)

Comments

Andrew Morton Dec. 31, 2019, 11:05 p.m. UTC | #1
On Wed, 25 Dec 2019 17:04:16 +0800 Alex Shi <alex.shi@linux.alibaba.com> wrote:

> This patchset move lru_lock into lruvec, give a lru_lock for each of
> lruvec, thus bring a lru_lock for each of memcg per node.

I see that there has been plenty of feedback on previous versions, but
no acked/reviewed tags as yet.

I think I'll take a pass for now, see what the audience feedback looks
like ;)
Alex Shi Jan. 2, 2020, 10:21 a.m. UTC | #2
在 2020/1/1 上午7:05, Andrew Morton 写道:
> On Wed, 25 Dec 2019 17:04:16 +0800 Alex Shi <alex.shi@linux.alibaba.com> wrote:
> 
>> This patchset move lru_lock into lruvec, give a lru_lock for each of
>> lruvec, thus bring a lru_lock for each of memcg per node.
> 
> I see that there has been plenty of feedback on previous versions, but
> no acked/reviewed tags as yet.
> 
> I think I'll take a pass for now, see what the audience feedback looks
> like ;)
> 


Thanks a lot! Andrew.

Please drop the 10th patch since it's for debug only and cost performance drop.

Best regards & Happy new year! :)
Alex
Alex Shi Jan. 10, 2020, 2:01 a.m. UTC | #3
在 2020/1/2 下午6:21, Alex Shi 写道:
> 
> 
> 在 2020/1/1 上午7:05, Andrew Morton 写道:
>> On Wed, 25 Dec 2019 17:04:16 +0800 Alex Shi <alex.shi@linux.alibaba.com> wrote:
>>
>>> This patchset move lru_lock into lruvec, give a lru_lock for each of
>>> lruvec, thus bring a lru_lock for each of memcg per node.
>>
>> I see that there has been plenty of feedback on previous versions, but
>> no acked/reviewed tags as yet.
>>
>> I think I'll take a pass for now, see what the audience feedback looks
>> like ;)
>>
> 

Hi Johannes,

Any comments of this version? :)

Thanks
Alex

> 
> Thanks a lot! Andrew.
> 
> Please drop the 10th patch since it's for debug only and cost performance drop.
> 
> Best regards & Happy new year! :)
> Alex
>
Hugh Dickins Jan. 13, 2020, 8:48 a.m. UTC | #4
On Fri, 10 Jan 2020, Alex Shi wrote:
> 在 2020/1/2 下午6:21, Alex Shi 写道:
> > 在 2020/1/1 上午7:05, Andrew Morton 写道:
> >> On Wed, 25 Dec 2019 17:04:16 +0800 Alex Shi <alex.shi@linux.alibaba.com> wrote:
> >>
> >>> This patchset move lru_lock into lruvec, give a lru_lock for each of
> >>> lruvec, thus bring a lru_lock for each of memcg per node.
> >>
> >> I see that there has been plenty of feedback on previous versions, but
> >> no acked/reviewed tags as yet.
> >>
> >> I think I'll take a pass for now, see what the audience feedback looks
> >> like ;)
> >>
> > 
> 
> Hi Johannes,
> 
> Any comments of this version? :)

I (Hugh) tried to test it on v5.5-rc5, but did not get very far at all -
perhaps because my particular interest tends towards tmpfs and swap,
and swap always made trouble for lruvec lock - one of the reasons why
our patches were more complicated than you thought necessary.

Booted a smallish kernel in mem=700M with 1.5G of swap, with intention
of running small kernel builds in tmpfs and in ext4-on-loop-on-tmpfs
(losetup was the last command started but I doubt it played much part):

mount -t tmpfs -o size=470M tmpfs /tst
cp /dev/zero /tst
losetup /dev/loop0 /tst/zero

and kernel crashed on the

VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup, page);
kernel BUG at mm/memcontrol.c:1268!
lock_page_lruvec_irqsave
relock_page_lruvec_irqsave
pagevec_lru_move_fn
__pagevec_lru_add
lru_add_drain_cpu
lru_add_drain
swap_cluster_readahead
shmem_swapin
shmem_swapin_page
shmem_getpage_gfp
shmem_getpage
shmem_write_begin
generic_perform_write
__generic_file_write_iter
generic_file_write_iter
new_sync_write
__vfs_write
vfs_write
ksys_write
__x86_sys_write
do_syscall_64

Hugh
Alex Shi Jan. 13, 2020, 12:45 p.m. UTC | #5
在 2020/1/13 下午4:48, Hugh Dickins 写道:
> On Fri, 10 Jan 2020, Alex Shi wrote:
>> 在 2020/1/2 下午6:21, Alex Shi 写道:
>>> 在 2020/1/1 上午7:05, Andrew Morton 写道:
>>>> On Wed, 25 Dec 2019 17:04:16 +0800 Alex Shi <alex.shi@linux.alibaba.com> wrote:
>>>>
>>>>> This patchset move lru_lock into lruvec, give a lru_lock for each of
>>>>> lruvec, thus bring a lru_lock for each of memcg per node.
>>>>
>>>> I see that there has been plenty of feedback on previous versions, but
>>>> no acked/reviewed tags as yet.
>>>>
>>>> I think I'll take a pass for now, see what the audience feedback looks
>>>> like ;)
>>>>
>>>
>>
>> Hi Johannes,
>>
>> Any comments of this version? :)
> 
> I (Hugh) tried to test it on v5.5-rc5, but did not get very far at all -
> perhaps because my particular interest tends towards tmpfs and swap,
> and swap always made trouble for lruvec lock - one of the reasons why
> our patches were more complicated than you thought necessary.
> 
> Booted a smallish kernel in mem=700M with 1.5G of swap, with intention
> of running small kernel builds in tmpfs and in ext4-on-loop-on-tmpfs
> (losetup was the last command started but I doubt it played much part):
> 
> mount -t tmpfs -o size=470M tmpfs /tst
> cp /dev/zero /tst
> losetup /dev/loop0 /tst/zero

Hi Hugh,

Many thanks for the testing!

I am trying to reproduce your testing, do above 3 steps, then build kernel with 'make -j 8' on my qemu. but cannot reproduce the problem with this v7 version or with v8 version, https://github.com/alexshi/linux/tree/lru-next, which fixed the bug KK mentioned, like the following. 
my qemu vmm like this:

[root@debug010000002015 ~]# mount -t tmpfs -o size=470M tmpfs /tst
[root@debug010000002015 ~]# cp /dev/zero /tst
cp: error writing ‘/tst/zero’: No space left on device
cp: failed to extend ‘/tst/zero’: No space left on device
[root@debug010000002015 ~]# losetup /dev/loop0 /tst/zero
[root@debug010000002015 ~]# cat /proc/cmdline
earlyprintk=ttyS0 root=/dev/sda1 console=ttyS0 debug crashkernel=128M printk.devkmsg=on

my kernel configed with MEMCG/MEMCG_SWAP with xfs rootimage, and compiling kernel under ext4. Could you like to share your kernel config and detailed reproduce steps with me? And would you like to try my new version from above github link in your convenient?

Thanks a lot!
Alex 

 static void commit_charge(struct page *page, struct mem_cgroup *memcg,
                          bool lrucare)
 {
-       int isolated;
+       struct lruvec *lruvec = NULL;

        VM_BUG_ON_PAGE(page->mem_cgroup, page);

@@ -2612,8 +2617,16 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
         * In some cases, SwapCache and FUSE(splice_buf->radixtree), the page
         * may already be on some other mem_cgroup's LRU.  Take care of it.
         */
-       if (lrucare)
-               lock_page_lru(page, &isolated);
+       if (lrucare) {
+               lruvec = lock_page_lruvec_irq(page);
+               if (likely(PageLRU(page))) {
+                       ClearPageLRU(page);
+                       del_page_from_lru_list(page, lruvec, page_lru(page));
+               } else {
+                       unlock_page_lruvec_irq(lruvec);
+                       lruvec = NULL;
+               }
+       }

        /*
         * Nobody should be changing or seriously looking at
@@ -2631,8 +2644,15 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
         */
        page->mem_cgroup = memcg;

-       if (lrucare)
-               unlock_page_lru(page, isolated);
+       if (lrucare && lruvec) {
+               unlock_page_lruvec_irq(lruvec);
+               lruvec = lock_page_lruvec_irq(page);
+
+               VM_BUG_ON_PAGE(PageLRU(page), page);
+               SetPageLRU(page);
+               add_page_to_lru_list(page, lruvec, page_lru(page));
+               unlock_page_lruvec_irq(lruvec);
+       }
 }
> 
> and kernel crashed on the
> 
> VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup, page);
> kernel BUG at mm/memcontrol.c:1268!
> lock_page_lruvec_irqsave
> relock_page_lruvec_irqsave
> pagevec_lru_move_fn
> __pagevec_lru_add
> lru_add_drain_cpu
> lru_add_drain
> swap_cluster_readahead
> shmem_swapin
> shmem_swapin_page
> shmem_getpage_gfp
> shmem_getpage
> shmem_write_begin
> generic_perform_write
> __generic_file_write_iter
> generic_file_write_iter
> new_sync_write
> __vfs_write
> vfs_write
> ksys_write
> __x86_sys_write
> do_syscall_64
> 
> Hugh
>
Hugh Dickins Jan. 13, 2020, 8:20 p.m. UTC | #6
On Mon, 13 Jan 2020, Alex Shi wrote:
> 在 2020/1/13 下午4:48, Hugh Dickins 写道:
> > 
> > I (Hugh) tried to test it on v5.5-rc5, but did not get very far at all -
> > perhaps because my particular interest tends towards tmpfs and swap,
> > and swap always made trouble for lruvec lock - one of the reasons why
> > our patches were more complicated than you thought necessary.
> > 
> > Booted a smallish kernel in mem=700M with 1.5G of swap, with intention
> > of running small kernel builds in tmpfs and in ext4-on-loop-on-tmpfs
> > (losetup was the last command started but I doubt it played much part):
> > 
> > mount -t tmpfs -o size=470M tmpfs /tst
> > cp /dev/zero /tst
> > losetup /dev/loop0 /tst/zero
> 
> Hi Hugh,
> 
> Many thanks for the testing!
> 
> I am trying to reproduce your testing, do above 3 steps, then build kernel with 'make -j 8' on my qemu. but cannot reproduce the problem with this v7 version or with v8 version, https://github.com/alexshi/linux/tree/lru-next, which fixed the bug KK mentioned, like the following. 
> my qemu vmm like this:
> 
> [root@debug010000002015 ~]# mount -t tmpfs -o size=470M tmpfs /tst
> [root@debug010000002015 ~]# cp /dev/zero /tst
> cp: error writing ‘/tst/zero’: No space left on device
> cp: failed to extend ‘/tst/zero’: No space left on device
> [root@debug010000002015 ~]# losetup /dev/loop0 /tst/zero
> [root@debug010000002015 ~]# cat /proc/cmdline
> earlyprintk=ttyS0 root=/dev/sda1 console=ttyS0 debug crashkernel=128M printk.devkmsg=on
> 
> my kernel configed with MEMCG/MEMCG_SWAP with xfs rootimage, and compiling kernel under ext4. Could you like to share your kernel config and detailed reproduce steps with me? And would you like to try my new version from above github link in your convenient?

I tried with the mods you had appended, from [PATCH v7 02/10]
discussion with Konstantion: no, still crashes in a similar way.

Does your github tree have other changes too?  I see it says "Latest
commit e05d0dd 22 days ago", which doesn't seem to fit.  Afraid I
don't have time to test many variations.

It looks like, in my case, systemd was usually jumping in and doing
something with shmem (perhaps via memfd) that read back from swap
and triggered the crash without any further intervention from me.

So please try booting with mem=700M and 1.5G swap,
mount -t tmpfs -o size=470M tmpfs /tst
cp /dev/zero /tst; cp /tst/zero /dev/null

That's enough to crash it for me, without getting into any losetup or
systemd complications. But you might have to adjust the numbers to be
sure of writing out and reading back from swap.

It's swap to SSD in my case, don't think that matters. I happen to
run with swappiness 100 (precisely to help generate swap problems),
but swappiness 60 is good enough to get these crashes.

Hugh
Alex Shi Jan. 14, 2020, 9:14 a.m. UTC | #7
> I tried with the mods you had appended, from [PATCH v7 02/10]
> discussion with Konstantion: no, still crashes in a similar way.> 
> Does your github tree have other changes too?  I see it says "Latest
> commit e05d0dd 22 days ago", which doesn't seem to fit.  Afraid I
> don't have time to test many variations.

Thanks a lot for testing! the github version is same as your tested.
The github branches page has a bug, it don't show correct update time.
https://github.com/alexshi/linux/branches while detailed page does. 
https://github.com/alexshi/linux/tree/lru-next 
> 
> It looks like, in my case, systemd was usually jumping in and doing
> something with shmem (perhaps via memfd) that read back from swap
> and triggered the crash without any further intervention from me.
> 
> So please try booting with mem=700M and 1.5G swap,
> mount -t tmpfs -o size=470M tmpfs /tst
> cp /dev/zero /tst; cp /tst/zero /dev/null
> 
> That's enough to crash it for me, without getting into any losetup or
> systemd complications. But you might have to adjust the numbers to be
> sure of writing out and reading back from swap.
> 
> It's swap to SSD in my case, don't think that matters. I happen to
> run with swappiness 100 (precisely to help generate swap problems),
> but swappiness 60 is good enough to get these crashes.
> 

I did use 700M memory and 1.5G swapfile in my qemu, but with a swapfile
not a disk.
qemu-system-x86_64  -smp 4 -enable-kvm -cpu SandyBridge \
	-m 700M -kernel /home/kuiliang.as/linux/qemulru/arch/x86/boot/bzImage \
	-append "earlyprintk=ttyS0 root=/dev/sda1 console=ttyS0 debug crashkernel=128M printk.devkmsg=on " \
	-hda /home/kuiliang.as/rootimages/CentOS-7-x86_64-Azure-1703.qcow2 \
	-hdb /home/kuiliang.as/rootimages/hdb.qcow2 \
	--nographic \

Anyway, although I didn't reproduced the bug. but I found a bug in my
debug function:
	VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup, page);

 if !page->mem_cgroup, the bug could be triggered, so, seems it's a bug
for debug function, not real issue. The 9th patch should be replaced by
the following new patch. 

Many thanks for testing!
Alex

From ac6d3e2bcfba5727d5c03f9655bb0c7443f655eb Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@linux.alibaba.com>
Date: Mon, 23 Dec 2019 13:33:54 +0800
Subject: [PATCH v8 8/9] mm/lru: add debug checking for page memcg moving

This debug patch could give some clues if there are sth out of
consideration.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 include/linux/memcontrol.h |  5 +++++
 mm/compaction.c            |  2 ++
 mm/memcontrol.c            | 13 +++++++++++++
 3 files changed, 20 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 09e861df48e8..ece88bb11d0f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -421,6 +421,7 @@ struct lruvec *lock_page_lruvec_irq(struct page *);
 struct lruvec *lock_page_lruvec_irqsave(struct page *, unsigned long*);
 void unlock_page_lruvec_irq(struct lruvec *);
 void unlock_page_lruvec_irqrestore(struct lruvec *, unsigned long);
+void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page);
 
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
@@ -1183,6 +1184,10 @@ static inline
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/mm/compaction.c b/mm/compaction.c
index 8c0a2da217d8..151242817bf4 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -971,6 +971,8 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 			compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
 			locked_lruvec = lruvec;
 
+			lruvec_memcg_debug(lruvec, page);
+
 			/* Try get exclusive access under lock */
 			if (!skip_updated) {
 				skip_updated = true;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 00fef8ddbd08..a473da8d2275 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1238,6 +1238,17 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
 	return lruvec;
 }
 
+void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+	if (mem_cgroup_disabled())
+		return;
+
+	if (!page->mem_cgroup)
+		VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != root_mem_cgroup, page);
+	else
+		VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup, page);
+}
+
 struct lruvec *lock_page_lruvec_irq(struct page *page)
 {
 	struct lruvec *lruvec;
@@ -1247,6 +1258,7 @@ struct lruvec *lock_page_lruvec_irq(struct page *page)
 	lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
 	spin_lock_irq(&lruvec->lru_lock);
 
+	lruvec_memcg_debug(lruvec, page);
 	return lruvec;
 }
 
@@ -1259,6 +1271,7 @@ struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long *flags)
 	lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
 	spin_lock_irqsave(&lruvec->lru_lock, *flags);
 
+	lruvec_memcg_debug(lruvec, page);
 	return lruvec;
 }
Alex Shi Jan. 14, 2020, 9:29 a.m. UTC | #8
在 2020/1/14 下午5:14, Alex Shi 写道:
> Anyway, although I didn't reproduced the bug. but I found a bug in my
> debug function:
> 	VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup, page);
> 
>  if !page->mem_cgroup, the bug could be triggered, so, seems it's a bug
> for debug function, not real issue. The 9th patch should be replaced by
> the following new patch. 

If !page->mem_cgroup, means the page is on root_mem_cgroup, so lurvec's
memcg is root_mem_cgroup, not NULL. that trigger the issue.


Hi Johannes,

So I have a question about the lock_page_memcg in this scenario, Should
we lock the page to root_mem_cgroup? or there is no needs as no tasks
move to a leaf memcg from root?

Thanks
Alex