diff mbox series

[v3,5/9] mm/workingset: use the node counter if memcg is the root memcg

Message ID 1584423717-3440-6-git-send-email-iamjoonsoo.kim@lge.com (mailing list archive)
State New, archived
Headers show
Series workingset protection/detection on the anonymous LRU list | expand

Commit Message

Joonsoo Kim March 17, 2020, 5:41 a.m. UTC
From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

In the following patch, workingset detection is implemented for the
swap cache. Swap cache's node is usually allocated by kswapd and it
isn't charged by kmemcg since it is from the kernel thread. So the swap
cache's shadow node is managed by the node list of the list_lru rather
than the memcg specific one.

If counting the shadow node on the root memcg happens to reclaim the slab
object, the shadow node count returns the number of the shadow node on
the node list of the list_lru since root memcg has the kmem_cache_id, -1.

However, the size of pages on the LRU is calculated by using the specific
memcg, so mismatch happens. This causes the number of shadow node not to
be increased to the enough size and, therefore, workingset detection
cannot work correctly. This patch fixes this bug by checking if the memcg
is the root memcg or not. If it is the root memcg, instead of using
the memcg-specific LRU, the system-wide LRU is used to calculate proper
size of the shadow node so that the number of the shadow node can grow
as expected.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/workingset.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Johannes Weiner March 18, 2020, 7:18 p.m. UTC | #1
On Tue, Mar 17, 2020 at 02:41:53PM +0900, js1304@gmail.com wrote:
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 
> In the following patch, workingset detection is implemented for the
> swap cache. Swap cache's node is usually allocated by kswapd and it
> isn't charged by kmemcg since it is from the kernel thread. So the swap
> cache's shadow node is managed by the node list of the list_lru rather
> than the memcg specific one.
> 
> If counting the shadow node on the root memcg happens to reclaim the slab
> object, the shadow node count returns the number of the shadow node on
> the node list of the list_lru since root memcg has the kmem_cache_id, -1.
> 
> However, the size of pages on the LRU is calculated by using the specific
> memcg, so mismatch happens. This causes the number of shadow node not to
> be increased to the enough size and, therefore, workingset detection
> cannot work correctly. This patch fixes this bug by checking if the memcg
> is the root memcg or not. If it is the root memcg, instead of using
> the memcg-specific LRU, the system-wide LRU is used to calculate proper
> size of the shadow node so that the number of the shadow node can grow
> as expected.
> 
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> ---
>  mm/workingset.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/workingset.c b/mm/workingset.c
> index 5fb8f85..a9f474a 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -468,7 +468,13 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
>  	 * PAGE_SIZE / xa_nodes / node_entries * 8 / PAGE_SIZE
>  	 */
>  #ifdef CONFIG_MEMCG
> -	if (sc->memcg) {
> +	/*
> +	 * Kernel allocation on root memcg isn't regarded as allocation of
> +	 * specific memcg. So, if sc->memcg is the root memcg, we need to
> +	 * use the count for the node rather than one for the specific
> +	 * memcg.
> +	 */
> +	if (sc->memcg && !mem_cgroup_is_root(sc->memcg)) {

This is no good, unfortunately.

It allows the root cgroup's shadows to grow way too large. Consider a
large memory system where several workloads run in containers and only
some host software runs in the root, yet that tiny root group will
grow shadow entries in proportion to the entire RAM.

IMO, we have some choices here:

1. We say the swapcache is a shared system facility and its memory is
not accounted to anyone. In that case, we should either
   1a. Reclaim them to a fixed threshold, regardless of cgroup, or
   1b. Not reclaim them at all. Or
2. We account all nodes to the groups for which they are allocated.
   Essentially like this:

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 8e7ce9a9bc5e..d0d0dcc357fb 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -125,6 +125,7 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
 	page_ref_add(page, nr);
 	SetPageSwapCache(page);
 
+	memalloc_use_memcg(page_memcg(page));
 	do {
 		xas_lock_irq(&xas);
 		xas_create_range(&xas);
@@ -142,6 +143,7 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
 unlock:
 		xas_unlock_irq(&xas);
 	} while (xas_nomem(&xas, gfp));
+	memalloc_unuse_memcg();
 
 	if (!xas_error(&xas))
 		return 0;
@@ -605,7 +607,8 @@ int init_swap_address_space(unsigned int type, unsigned long nr_pages)
 		return -ENOMEM;
 	for (i = 0; i < nr; i++) {
 		space = spaces + i;
-		xa_init_flags(&space->i_pages, XA_FLAGS_LOCK_IRQ);
+		xa_init_flags(&space->i_pages,
+			      XA_FLAGS_LOCK_IRQ | XA_FLAGS_ACCOUNT);
 		atomic_set(&space->i_mmap_writable, 0);
 		space->a_ops = &swap_aops;
 		/* swap cache doesn't use writeback related tags */

(A reclaimer has PF_MEMALLOC set, so we'll bypass the limit when
recursing into charging the node.)

I'm leaning more toward 1b, actually. The shadow shrinker was written
because the combined address space of files on the filesystem can
easily be in the terabytes, and practically unbounded with sparse
files. The shadow shrinker is there to keep users from DoSing the
system with shadow entries for files.

However, the swap address space is bounded by a privileged user. And
the size is usually in the GB range. On my system, radix_tree_node is
~583 bytes, so a for a 16G swapfile, the swapcache xarray should max
out below 40M (36M worth of leaf nodes, plus some intermediate nodes).

It doesn't seem worth messing with the shrinker at all for these.
Joonsoo Kim March 19, 2020, 6:20 a.m. UTC | #2
2020년 3월 19일 (목) 오전 4:18, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
>
> On Tue, Mar 17, 2020 at 02:41:53PM +0900, js1304@gmail.com wrote:
> > From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >
> > In the following patch, workingset detection is implemented for the
> > swap cache. Swap cache's node is usually allocated by kswapd and it
> > isn't charged by kmemcg since it is from the kernel thread. So the swap
> > cache's shadow node is managed by the node list of the list_lru rather
> > than the memcg specific one.
> >
> > If counting the shadow node on the root memcg happens to reclaim the slab
> > object, the shadow node count returns the number of the shadow node on
> > the node list of the list_lru since root memcg has the kmem_cache_id, -1.
> >
> > However, the size of pages on the LRU is calculated by using the specific
> > memcg, so mismatch happens. This causes the number of shadow node not to
> > be increased to the enough size and, therefore, workingset detection
> > cannot work correctly. This patch fixes this bug by checking if the memcg
> > is the root memcg or not. If it is the root memcg, instead of using
> > the memcg-specific LRU, the system-wide LRU is used to calculate proper
> > size of the shadow node so that the number of the shadow node can grow
> > as expected.
> >
> > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > ---
> >  mm/workingset.c | 8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/workingset.c b/mm/workingset.c
> > index 5fb8f85..a9f474a 100644
> > --- a/mm/workingset.c
> > +++ b/mm/workingset.c
> > @@ -468,7 +468,13 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
> >        * PAGE_SIZE / xa_nodes / node_entries * 8 / PAGE_SIZE
> >        */
> >  #ifdef CONFIG_MEMCG
> > -     if (sc->memcg) {
> > +     /*
> > +      * Kernel allocation on root memcg isn't regarded as allocation of
> > +      * specific memcg. So, if sc->memcg is the root memcg, we need to
> > +      * use the count for the node rather than one for the specific
> > +      * memcg.
> > +      */
> > +     if (sc->memcg && !mem_cgroup_is_root(sc->memcg)) {
>
> This is no good, unfortunately.
>
> It allows the root cgroup's shadows to grow way too large. Consider a
> large memory system where several workloads run in containers and only
> some host software runs in the root, yet that tiny root group will
> grow shadow entries in proportion to the entire RAM.

Okay.

> IMO, we have some choices here:
>
> 1. We say the swapcache is a shared system facility and its memory is
> not accounted to anyone. In that case, we should either
>    1a. Reclaim them to a fixed threshold, regardless of cgroup, or
>    1b. Not reclaim them at all. Or
> 2. We account all nodes to the groups for which they are allocated.
>    Essentially like this:
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 8e7ce9a9bc5e..d0d0dcc357fb 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -125,6 +125,7 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
>         page_ref_add(page, nr);
>         SetPageSwapCache(page);
>
> +       memalloc_use_memcg(page_memcg(page));
>         do {
>                 xas_lock_irq(&xas);
>                 xas_create_range(&xas);
> @@ -142,6 +143,7 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
>  unlock:
>                 xas_unlock_irq(&xas);
>         } while (xas_nomem(&xas, gfp));
> +       memalloc_unuse_memcg();
>
>         if (!xas_error(&xas))
>                 return 0;
> @@ -605,7 +607,8 @@ int init_swap_address_space(unsigned int type, unsigned long nr_pages)
>                 return -ENOMEM;
>         for (i = 0; i < nr; i++) {
>                 space = spaces + i;
> -               xa_init_flags(&space->i_pages, XA_FLAGS_LOCK_IRQ);
> +               xa_init_flags(&space->i_pages,
> +                             XA_FLAGS_LOCK_IRQ | XA_FLAGS_ACCOUNT);
>                 atomic_set(&space->i_mmap_writable, 0);
>                 space->a_ops = &swap_aops;
>                 /* swap cache doesn't use writeback related tags */
>
> (A reclaimer has PF_MEMALLOC set, so we'll bypass the limit when
> recursing into charging the node.)
>
> I'm leaning more toward 1b, actually. The shadow shrinker was written
> because the combined address space of files on the filesystem can
> easily be in the terabytes, and practically unbounded with sparse
> files. The shadow shrinker is there to keep users from DoSing the
> system with shadow entries for files.
>
> However, the swap address space is bounded by a privileged user. And
> the size is usually in the GB range. On my system, radix_tree_node is
> ~583 bytes, so a for a 16G swapfile, the swapcache xarray should max
> out below 40M (36M worth of leaf nodes, plus some intermediate nodes).
>
> It doesn't seem worth messing with the shrinker at all for these.

40M / 16G, 0.25% of the amount of the used swap looks okay to me.
I will rework the patch on that way.

Thanks.
diff mbox series

Patch

diff --git a/mm/workingset.c b/mm/workingset.c
index 5fb8f85..a9f474a 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -468,7 +468,13 @@  static unsigned long count_shadow_nodes(struct shrinker *shrinker,
 	 * PAGE_SIZE / xa_nodes / node_entries * 8 / PAGE_SIZE
 	 */
 #ifdef CONFIG_MEMCG
-	if (sc->memcg) {
+	/*
+	 * Kernel allocation on root memcg isn't regarded as allocation of
+	 * specific memcg. So, if sc->memcg is the root memcg, we need to
+	 * use the count for the node rather than one for the specific
+	 * memcg.
+	 */
+	if (sc->memcg && !mem_cgroup_is_root(sc->memcg)) {
 		struct lruvec *lruvec;
 		int i;