diff mbox series

[1/7] mm/swap, workingset: make anon workingset nodes memcg aware

Message ID 20240624175313.47329-2-ryncsn@gmail.com (mailing list archive)
State New
Headers show
Series Split list_lru lock into per-cgroup scope | expand

Commit Message

Kairui Song June 24, 2024, 5:53 p.m. UTC
From: Kairui Song <kasong@tencent.com>

Currently, the (shadow) nodes of the swap cache are not accounted to
their corresponding memory cgroup, instead, they are all accounted
to the root cgroup. This leads to inaccurate accounting and
ineffective reclaiming.

This issue is similar to commit 7b785645e8f1 ("mm: fix page cache
convergence regression"), where page cache shadow nodes were incorrectly
accounted. That was due to the accidental dropping of the accounting
flag during the XArray conversion in commit a28334862993
("page cache: Finish XArray conversion").

However, this fix has a different cause. Swap cache shadow nodes were
never accounted even before the XArray conversion, since they did not
exist until commit 3852f6768ede ("mm/swapcache: support to handle the
shadow entries"), which was years after the XArray conversion. Without
shadow nodes, swap cache nodes can only use a very small amount of memory
and so reclaiming is not very important.

But now with shadow nodes, if a cgroup swaps out a large amount of
memory, it could take up a lot of memory.

This can be easily fixed by adding proper flags and LRU setters.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap_state.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Muchun Song July 17, 2024, 3:25 a.m. UTC | #1
> On Jun 25, 2024, at 01:53, Kairui Song <ryncsn@gmail.com> wrote:
> 
> From: Kairui Song <kasong@tencent.com>
> 
> Currently, the (shadow) nodes of the swap cache are not accounted to
> their corresponding memory cgroup, instead, they are all accounted
> to the root cgroup. This leads to inaccurate accounting and
> ineffective reclaiming.
> 
> This issue is similar to commit 7b785645e8f1 ("mm: fix page cache
> convergence regression"), where page cache shadow nodes were incorrectly
> accounted. That was due to the accidental dropping of the accounting
> flag during the XArray conversion in commit a28334862993
> ("page cache: Finish XArray conversion").
> 
> However, this fix has a different cause. Swap cache shadow nodes were
> never accounted even before the XArray conversion, since they did not
> exist until commit 3852f6768ede ("mm/swapcache: support to handle the
> shadow entries"), which was years after the XArray conversion. Without
> shadow nodes, swap cache nodes can only use a very small amount of memory
> and so reclaiming is not very important.
> 
> But now with shadow nodes, if a cgroup swaps out a large amount of
> memory, it could take up a lot of memory.
> 
> This can be easily fixed by adding proper flags and LRU setters.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>

I haven't looked at the details of this patch yet, but I think it is not
related to this series, it could be as an individual patch (If it is a real
problem, I think it will be quickly and easily merged).

Thanks.
Kairui Song July 18, 2024, 11:33 a.m. UTC | #2
On Wed, Jul 17, 2024 at 11:25 AM Muchun Song <muchun.song@linux.dev> wrote:
> > On Jun 25, 2024, at 01:53, Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > Currently, the (shadow) nodes of the swap cache are not accounted to
> > their corresponding memory cgroup, instead, they are all accounted
> > to the root cgroup. This leads to inaccurate accounting and
> > ineffective reclaiming.
> >
> > This issue is similar to commit 7b785645e8f1 ("mm: fix page cache
> > convergence regression"), where page cache shadow nodes were incorrectly
> > accounted. That was due to the accidental dropping of the accounting
> > flag during the XArray conversion in commit a28334862993
> > ("page cache: Finish XArray conversion").
> >
> > However, this fix has a different cause. Swap cache shadow nodes were
> > never accounted even before the XArray conversion, since they did not
> > exist until commit 3852f6768ede ("mm/swapcache: support to handle the
> > shadow entries"), which was years after the XArray conversion. Without
> > shadow nodes, swap cache nodes can only use a very small amount of memory
> > and so reclaiming is not very important.
> >
> > But now with shadow nodes, if a cgroup swaps out a large amount of
> > memory, it could take up a lot of memory.
> >
> > This can be easily fixed by adding proper flags and LRU setters.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
>
> I haven't looked at the details of this patch yet, but I think it is not
> related to this series, it could be as an individual patch (If it is a real
> problem, I think it will be quickly and easily merged).
>
> Thanks.
>

The gain of this patch is not as big as being part of the series, but
yes, it can be sent independently.
Shakeel Butt July 19, 2024, 1:34 a.m. UTC | #3
On Tue, Jun 25, 2024 at 01:53:07AM GMT, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Currently, the (shadow) nodes of the swap cache are not accounted to
> their corresponding memory cgroup, instead, they are all accounted
> to the root cgroup. This leads to inaccurate accounting and
> ineffective reclaiming.
> 
> This issue is similar to commit 7b785645e8f1 ("mm: fix page cache
> convergence regression"), where page cache shadow nodes were incorrectly
> accounted. That was due to the accidental dropping of the accounting
> flag during the XArray conversion in commit a28334862993
> ("page cache: Finish XArray conversion").
> 
> However, this fix has a different cause. Swap cache shadow nodes were
> never accounted even before the XArray conversion, since they did not
> exist until commit 3852f6768ede ("mm/swapcache: support to handle the
> shadow entries"), which was years after the XArray conversion. Without
> shadow nodes, swap cache nodes can only use a very small amount of memory
> and so reclaiming is not very important.
> 
> But now with shadow nodes, if a cgroup swaps out a large amount of
> memory, it could take up a lot of memory.
> 
> This can be easily fixed by adding proper flags and LRU setters.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>

As Muchun said, please send this patch separately. However as I am
thinking more about this patch, I think it is incomplete and the full
solution may be much more involved and might not be worth it.

One of the differences between file page cache and swap page cache is
the context in which the underlying xarray node can be allocated. For
file pages, such allocations happen in the context of the process/cgroup
owning the file pages and thus the memcg of the current is used for
charging. However xarray node allocations happen when a page is added to
swap cache which often happen in reclaim and reclaim can happen in any
context i.e. kernel thread, unrelated process/cgroups e.t.c. So, we may
charge unrelated memcg for these nodes.

Now you may argue that we can use the memcg of the page which is being
swapped out but then you may have an xarray node containing pointers to
pages (or shadows) of different memcgs. Who should be charged?

BTW filesystem shared between multiple cgroups can face this issue as
well but users have more control on shared filesystem as compare to
shared swap address space.

Shakeel
diff mbox series

Patch

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 642c30d8376c..68afaaf1c09b 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -95,6 +95,7 @@  int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
 	void *old;
 
 	xas_set_update(&xas, workingset_update_node);
+	xas_set_lru(&xas, &shadow_nodes);
 
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio);
@@ -714,7 +715,7 @@  int init_swap_address_space(unsigned int type, unsigned long nr_pages)
 		return -ENOMEM;
 	for (i = 0; i < nr; i++) {
 		space = spaces + i;
-		xa_init_flags(&space->i_pages, XA_FLAGS_LOCK_IRQ);
+		xa_init_flags(&space->i_pages, XA_FLAGS_LOCK_IRQ | XA_FLAGS_ACCOUNT);
 		atomic_set(&space->i_mmap_writable, 0);
 		space->a_ops = &swap_aops;
 		/* swap cache doesn't use writeback related tags */