diff mbox series

Partially revert "mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones"

Message ID 20190817004726.2530670-1-guro@fb.com (mailing list archive)
State New, archived
Headers show
Series Partially revert "mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones" | expand

Commit Message

Roman Gushchin Aug. 17, 2019, 12:47 a.m. UTC
Commit 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync
with the hierarchical ones") effectively decreased the precision of
per-memcg vmstats_local and per-memcg-per-node lruvec percpu counters.

That's good for displaying in memory.stat, but brings a serious regression
into the reclaim process.

One issue I've discovered and debugged is the following:
lruvec_lru_size() can return 0 instead of the actual number of pages
in the lru list, preventing the kernel to reclaim last remaining
pages. Result is yet another dying memory cgroups flooding.
The opposite is also happening: scanning an empty lru list
is the waste of cpu time.

Also, inactive_list_is_low() can return incorrect values, preventing
the active lru from being scanned and freed. It can fail both because
the size of active and inactive lists are inaccurate, and because
the number of workingset refaults isn't precise. In other words,
the result is pretty random.

I'm not sure, if using the approximate number of slab pages in
count_shadow_number() is acceptable, but issues described above
are enough to partially revert the patch.

Let's keep per-memcg vmstat_local batched (they are only used for
displaying stats to the userspace), but keep lruvec stats precise.
This change fixes the dead memcg flooding on my setup.

Fixes: 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones")
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

Comments

Yafang Shao Aug. 17, 2019, 3:33 a.m. UTC | #1
On Sat, Aug 17, 2019 at 8:47 AM Roman Gushchin <guro@fb.com> wrote:
>
> Commit 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync
> with the hierarchical ones") effectively decreased the precision of
> per-memcg vmstats_local and per-memcg-per-node lruvec percpu counters.
>
> That's good for displaying in memory.stat, but brings a serious regression
> into the reclaim process.
>
> One issue I've discovered and debugged is the following:
> lruvec_lru_size() can return 0 instead of the actual number of pages
> in the lru list, preventing the kernel to reclaim last remaining
> pages. Result is yet another dying memory cgroups flooding.
> The opposite is also happening: scanning an empty lru list
> is the waste of cpu time.
>
> Also, inactive_list_is_low() can return incorrect values, preventing
> the active lru from being scanned and freed. It can fail both because
> the size of active and inactive lists are inaccurate, and because
> the number of workingset refaults isn't precise. In other words,
> the result is pretty random.
>
> I'm not sure, if using the approximate number of slab pages in
> count_shadow_number() is acceptable, but issues described above
> are enough to partially revert the patch.
>
> Let's keep per-memcg vmstat_local batched (they are only used for
> displaying stats to the userspace), but keep lruvec stats precise.
> This change fixes the dead memcg flooding on my setup.
>

That will make some misunderstanding if the local counters are not in
sync with the hierarchical ones
(someone may doubt whether there're something leaked.).
If we have to do it like this, I think we should better document this behavior.

> Fixes: 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones")
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Yafang Shao <laoar.shao@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/memcontrol.c | 8 +++-----
>  1 file changed, 3 insertions(+), 5 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 249187907339..3429340adb56 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -746,15 +746,13 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
>         /* Update memcg */
>         __mod_memcg_state(memcg, idx, val);
>
> +       /* Update lruvec */
> +       __this_cpu_add(pn->lruvec_stat_local->count[idx], val);
> +
>         x = val + __this_cpu_read(pn->lruvec_stat_cpu->count[idx]);
>         if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
>                 struct mem_cgroup_per_node *pi;
>
> -               /*
> -                * Batch local counters to keep them in sync with
> -                * the hierarchical ones.
> -                */
> -               __this_cpu_add(pn->lruvec_stat_local->count[idx], x);
>                 for (pi = pn; pi; pi = parent_nodeinfo(pi, pgdat->node_id))
>                         atomic_long_add(x, &pi->lruvec_stat[idx]);
>                 x = 0;
> --
> 2.21.0
>
Greg KH Aug. 17, 2019, 6:36 a.m. UTC | #2
On Fri, Aug 16, 2019 at 05:47:26PM -0700, Roman Gushchin wrote:
> Commit 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync
> with the hierarchical ones") effectively decreased the precision of
> per-memcg vmstats_local and per-memcg-per-node lruvec percpu counters.
> 
> That's good for displaying in memory.stat, but brings a serious regression
> into the reclaim process.
> 
> One issue I've discovered and debugged is the following:
> lruvec_lru_size() can return 0 instead of the actual number of pages
> in the lru list, preventing the kernel to reclaim last remaining
> pages. Result is yet another dying memory cgroups flooding.
> The opposite is also happening: scanning an empty lru list
> is the waste of cpu time.
> 
> Also, inactive_list_is_low() can return incorrect values, preventing
> the active lru from being scanned and freed. It can fail both because
> the size of active and inactive lists are inaccurate, and because
> the number of workingset refaults isn't precise. In other words,
> the result is pretty random.
> 
> I'm not sure, if using the approximate number of slab pages in
> count_shadow_number() is acceptable, but issues described above
> are enough to partially revert the patch.
> 
> Let's keep per-memcg vmstat_local batched (they are only used for
> displaying stats to the userspace), but keep lruvec stats precise.
> This change fixes the dead memcg flooding on my setup.
> 
> Fixes: 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones")
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Yafang Shao <laoar.shao@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/memcontrol.c | 8 +++-----
>  1 file changed, 3 insertions(+), 5 deletions(-)

<formletter>

This is not the correct way to submit patches for inclusion in the
stable kernel tree.  Please read:
    https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
for how to do this properly.

</formletter>
Roman Gushchin Aug. 17, 2019, 7:14 p.m. UTC | #3
On Sat, Aug 17, 2019 at 11:33:57AM +0800, Yafang Shao wrote:
> On Sat, Aug 17, 2019 at 8:47 AM Roman Gushchin <guro@fb.com> wrote:
> >
> > Commit 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync
> > with the hierarchical ones") effectively decreased the precision of
> > per-memcg vmstats_local and per-memcg-per-node lruvec percpu counters.
> >
> > That's good for displaying in memory.stat, but brings a serious regression
> > into the reclaim process.
> >
> > One issue I've discovered and debugged is the following:
> > lruvec_lru_size() can return 0 instead of the actual number of pages
> > in the lru list, preventing the kernel to reclaim last remaining
> > pages. Result is yet another dying memory cgroups flooding.
> > The opposite is also happening: scanning an empty lru list
> > is the waste of cpu time.
> >
> > Also, inactive_list_is_low() can return incorrect values, preventing
> > the active lru from being scanned and freed. It can fail both because
> > the size of active and inactive lists are inaccurate, and because
> > the number of workingset refaults isn't precise. In other words,
> > the result is pretty random.
> >
> > I'm not sure, if using the approximate number of slab pages in
> > count_shadow_number() is acceptable, but issues described above
> > are enough to partially revert the patch.
> >
> > Let's keep per-memcg vmstat_local batched (they are only used for
> > displaying stats to the userspace), but keep lruvec stats precise.
> > This change fixes the dead memcg flooding on my setup.
> >
> 
> That will make some misunderstanding if the local counters are not in
> sync with the hierarchical ones
> (someone may doubt whether there're something leaked.).

Sure, but the actual leakage is a much more serious issue.

> If we have to do it like this, I think we should better document this behavior.

Lru size calculations can be done using per-zone counters, which is
actually cheaper, because the number of zones is usually smaller than
the number of cpus. I'll send a corresponding patch on Monday.

Maybe other use cases can also be converted?

Thanks!

> 
> > Fixes: 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones")
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > Cc: Yafang Shao <laoar.shao@gmail.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> >  mm/memcontrol.c | 8 +++-----
> >  1 file changed, 3 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 249187907339..3429340adb56 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -746,15 +746,13 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
> >         /* Update memcg */
> >         __mod_memcg_state(memcg, idx, val);
> >
> > +       /* Update lruvec */
> > +       __this_cpu_add(pn->lruvec_stat_local->count[idx], val);
> > +
> >         x = val + __this_cpu_read(pn->lruvec_stat_cpu->count[idx]);
> >         if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
> >                 struct mem_cgroup_per_node *pi;
> >
> > -               /*
> > -                * Batch local counters to keep them in sync with
> > -                * the hierarchical ones.
> > -                */
> > -               __this_cpu_add(pn->lruvec_stat_local->count[idx], x);
> >                 for (pi = pn; pi; pi = parent_nodeinfo(pi, pgdat->node_id))
> >                         atomic_long_add(x, &pi->lruvec_stat[idx]);
> >                 x = 0;
> > --
> > 2.21.0
> >
Roman Gushchin Aug. 17, 2019, 7:15 p.m. UTC | #4
On Sat, Aug 17, 2019 at 08:36:16AM +0200, Greg KH wrote:
> On Fri, Aug 16, 2019 at 05:47:26PM -0700, Roman Gushchin wrote:
> > Commit 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync
> > with the hierarchical ones") effectively decreased the precision of
> > per-memcg vmstats_local and per-memcg-per-node lruvec percpu counters.
> > 
> > That's good for displaying in memory.stat, but brings a serious regression
> > into the reclaim process.
> > 
> > One issue I've discovered and debugged is the following:
> > lruvec_lru_size() can return 0 instead of the actual number of pages
> > in the lru list, preventing the kernel to reclaim last remaining
> > pages. Result is yet another dying memory cgroups flooding.
> > The opposite is also happening: scanning an empty lru list
> > is the waste of cpu time.
> > 
> > Also, inactive_list_is_low() can return incorrect values, preventing
> > the active lru from being scanned and freed. It can fail both because
> > the size of active and inactive lists are inaccurate, and because
> > the number of workingset refaults isn't precise. In other words,
> > the result is pretty random.
> > 
> > I'm not sure, if using the approximate number of slab pages in
> > count_shadow_number() is acceptable, but issues described above
> > are enough to partially revert the patch.
> > 
> > Let's keep per-memcg vmstat_local batched (they are only used for
> > displaying stats to the userspace), but keep lruvec stats precise.
> > This change fixes the dead memcg flooding on my setup.
> > 
> > Fixes: 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones")
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > Cc: Yafang Shao <laoar.shao@gmail.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> >  mm/memcontrol.c | 8 +++-----
> >  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> <formletter>
> 
> This is not the correct way to submit patches for inclusion in the
> stable kernel tree.  Please read:
>     https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
> for how to do this properly.

Oh, I'm sorry, will read and follow next time. Thanks!

> 
> </formletter>
Yafang Shao Aug. 18, 2019, 12:30 a.m. UTC | #5
On Sun, Aug 18, 2019 at 3:14 AM Roman Gushchin <guro@fb.com> wrote:
>
> On Sat, Aug 17, 2019 at 11:33:57AM +0800, Yafang Shao wrote:
> > On Sat, Aug 17, 2019 at 8:47 AM Roman Gushchin <guro@fb.com> wrote:
> > >
> > > Commit 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync
> > > with the hierarchical ones") effectively decreased the precision of
> > > per-memcg vmstats_local and per-memcg-per-node lruvec percpu counters.
> > >
> > > That's good for displaying in memory.stat, but brings a serious regression
> > > into the reclaim process.
> > >
> > > One issue I've discovered and debugged is the following:
> > > lruvec_lru_size() can return 0 instead of the actual number of pages
> > > in the lru list, preventing the kernel to reclaim last remaining
> > > pages. Result is yet another dying memory cgroups flooding.
> > > The opposite is also happening: scanning an empty lru list
> > > is the waste of cpu time.
> > >
> > > Also, inactive_list_is_low() can return incorrect values, preventing
> > > the active lru from being scanned and freed. It can fail both because
> > > the size of active and inactive lists are inaccurate, and because
> > > the number of workingset refaults isn't precise. In other words,
> > > the result is pretty random.
> > >
> > > I'm not sure, if using the approximate number of slab pages in
> > > count_shadow_number() is acceptable, but issues described above
> > > are enough to partially revert the patch.
> > >
> > > Let's keep per-memcg vmstat_local batched (they are only used for
> > > displaying stats to the userspace), but keep lruvec stats precise.
> > > This change fixes the dead memcg flooding on my setup.
> > >
> >
> > That will make some misunderstanding if the local counters are not in
> > sync with the hierarchical ones
> > (someone may doubt whether there're something leaked.).
>
> Sure, but the actual leakage is a much more serious issue.
>
> > If we have to do it like this, I think we should better document this behavior.
>
> Lru size calculations can be done using per-zone counters, which is
> actually cheaper, because the number of zones is usually smaller than
> the number of cpus. I'll send a corresponding patch on Monday.
>

Looks like a good idea.

> Maybe other use cases can also be converted?

We'd better keep the behavior the same across counters. I think you
can have a try.

Thanks
Yafang

>
> Thanks!
>
> >
> > > Fixes: 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones")
> > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > Cc: Yafang Shao <laoar.shao@gmail.com>
> > > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > > ---
> > >  mm/memcontrol.c | 8 +++-----
> > >  1 file changed, 3 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 249187907339..3429340adb56 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -746,15 +746,13 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
> > >         /* Update memcg */
> > >         __mod_memcg_state(memcg, idx, val);
> > >
> > > +       /* Update lruvec */
> > > +       __this_cpu_add(pn->lruvec_stat_local->count[idx], val);
> > > +
> > >         x = val + __this_cpu_read(pn->lruvec_stat_cpu->count[idx]);
> > >         if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
> > >                 struct mem_cgroup_per_node *pi;
> > >
> > > -               /*
> > > -                * Batch local counters to keep them in sync with
> > > -                * the hierarchical ones.
> > > -                */
> > > -               __this_cpu_add(pn->lruvec_stat_local->count[idx], x);
> > >                 for (pi = pn; pi; pi = parent_nodeinfo(pi, pgdat->node_id))
> > >                         atomic_long_add(x, &pi->lruvec_stat[idx]);
> > >                 x = 0;
> > > --
> > > 2.21.0
> > >
Roman Gushchin Aug. 19, 2019, 9:20 p.m. UTC | #6
On Sun, Aug 18, 2019 at 08:30:15AM +0800, Yafang Shao wrote:
> On Sun, Aug 18, 2019 at 3:14 AM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Sat, Aug 17, 2019 at 11:33:57AM +0800, Yafang Shao wrote:
> > > On Sat, Aug 17, 2019 at 8:47 AM Roman Gushchin <guro@fb.com> wrote:
> > > >
> > > > Commit 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync
> > > > with the hierarchical ones") effectively decreased the precision of
> > > > per-memcg vmstats_local and per-memcg-per-node lruvec percpu counters.
> > > >
> > > > That's good for displaying in memory.stat, but brings a serious regression
> > > > into the reclaim process.
> > > >
> > > > One issue I've discovered and debugged is the following:
> > > > lruvec_lru_size() can return 0 instead of the actual number of pages
> > > > in the lru list, preventing the kernel to reclaim last remaining
> > > > pages. Result is yet another dying memory cgroups flooding.
> > > > The opposite is also happening: scanning an empty lru list
> > > > is the waste of cpu time.
> > > >
> > > > Also, inactive_list_is_low() can return incorrect values, preventing
> > > > the active lru from being scanned and freed. It can fail both because
> > > > the size of active and inactive lists are inaccurate, and because
> > > > the number of workingset refaults isn't precise. In other words,
> > > > the result is pretty random.
> > > >
> > > > I'm not sure, if using the approximate number of slab pages in
> > > > count_shadow_number() is acceptable, but issues described above
> > > > are enough to partially revert the patch.
> > > >
> > > > Let's keep per-memcg vmstat_local batched (they are only used for
> > > > displaying stats to the userspace), but keep lruvec stats precise.
> > > > This change fixes the dead memcg flooding on my setup.
> > > >
> > >
> > > That will make some misunderstanding if the local counters are not in
> > > sync with the hierarchical ones
> > > (someone may doubt whether there're something leaked.).
> >
> > Sure, but the actual leakage is a much more serious issue.
> >
> > > If we have to do it like this, I think we should better document this behavior.
> >
> > Lru size calculations can be done using per-zone counters, which is
> > actually cheaper, because the number of zones is usually smaller than
> > the number of cpus. I'll send a corresponding patch on Monday.
> >
> 
> Looks like a good idea.
> 
> > Maybe other use cases can also be converted?
> 
> We'd better keep the behavior the same across counters. I think you
> can have a try.

As I said, consistency of counters is important, but not nearly as important
as the real behavior of the system. Especially because we talk about
per-node memcg statistics, which I believe is mostly used for debugging.

So for now I think the right thing to do is to revert the change to fix
the memory reclaim process. And then we can discuss how to get counters
right.

Thanks!
Yafang Shao Aug. 20, 2019, 1:29 a.m. UTC | #7
On Tue, Aug 20, 2019 at 5:20 AM Roman Gushchin <guro@fb.com> wrote:
>
> On Sun, Aug 18, 2019 at 08:30:15AM +0800, Yafang Shao wrote:
> > On Sun, Aug 18, 2019 at 3:14 AM Roman Gushchin <guro@fb.com> wrote:
> > >
> > > On Sat, Aug 17, 2019 at 11:33:57AM +0800, Yafang Shao wrote:
> > > > On Sat, Aug 17, 2019 at 8:47 AM Roman Gushchin <guro@fb.com> wrote:
> > > > >
> > > > > Commit 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync
> > > > > with the hierarchical ones") effectively decreased the precision of
> > > > > per-memcg vmstats_local and per-memcg-per-node lruvec percpu counters.
> > > > >
> > > > > That's good for displaying in memory.stat, but brings a serious regression
> > > > > into the reclaim process.
> > > > >
> > > > > One issue I've discovered and debugged is the following:
> > > > > lruvec_lru_size() can return 0 instead of the actual number of pages
> > > > > in the lru list, preventing the kernel to reclaim last remaining
> > > > > pages. Result is yet another dying memory cgroups flooding.
> > > > > The opposite is also happening: scanning an empty lru list
> > > > > is the waste of cpu time.
> > > > >
> > > > > Also, inactive_list_is_low() can return incorrect values, preventing
> > > > > the active lru from being scanned and freed. It can fail both because
> > > > > the size of active and inactive lists are inaccurate, and because
> > > > > the number of workingset refaults isn't precise. In other words,
> > > > > the result is pretty random.
> > > > >
> > > > > I'm not sure, if using the approximate number of slab pages in
> > > > > count_shadow_number() is acceptable, but issues described above
> > > > > are enough to partially revert the patch.
> > > > >
> > > > > Let's keep per-memcg vmstat_local batched (they are only used for
> > > > > displaying stats to the userspace), but keep lruvec stats precise.
> > > > > This change fixes the dead memcg flooding on my setup.
> > > > >
> > > >
> > > > That will make some misunderstanding if the local counters are not in
> > > > sync with the hierarchical ones
> > > > (someone may doubt whether there're something leaked.).
> > >
> > > Sure, but the actual leakage is a much more serious issue.
> > >
> > > > If we have to do it like this, I think we should better document this behavior.
> > >
> > > Lru size calculations can be done using per-zone counters, which is
> > > actually cheaper, because the number of zones is usually smaller than
> > > the number of cpus. I'll send a corresponding patch on Monday.
> > >
> >
> > Looks like a good idea.
> >
> > > Maybe other use cases can also be converted?
> >
> > We'd better keep the behavior the same across counters. I think you
> > can have a try.
>
> As I said, consistency of counters is important, but not nearly as important
> as the real behavior of the system. Especially because we talk about
> per-node memcg statistics, which I believe is mostly used for debugging.
>
> So for now I think the right thing to do is to revert the change to fix
> the memory reclaim process. And then we can discuss how to get counters
> right.
>

Sure.

Thanks
Yafang
Roman Gushchin Aug. 23, 2019, 10:33 p.m. UTC | #8
On Fri, Aug 16, 2019 at 05:47:26PM -0700, Roman Gushchin wrote:
> Commit 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync
> with the hierarchical ones") effectively decreased the precision of
> per-memcg vmstats_local and per-memcg-per-node lruvec percpu counters.
> 
> That's good for displaying in memory.stat, but brings a serious regression
> into the reclaim process.
> 
> One issue I've discovered and debugged is the following:
> lruvec_lru_size() can return 0 instead of the actual number of pages
> in the lru list, preventing the kernel to reclaim last remaining
> pages. Result is yet another dying memory cgroups flooding.
> The opposite is also happening: scanning an empty lru list
> is the waste of cpu time.
> 
> Also, inactive_list_is_low() can return incorrect values, preventing
> the active lru from being scanned and freed. It can fail both because
> the size of active and inactive lists are inaccurate, and because
> the number of workingset refaults isn't precise. In other words,
> the result is pretty random.
> 
> I'm not sure, if using the approximate number of slab pages in
> count_shadow_number() is acceptable, but issues described above
> are enough to partially revert the patch.
> 
> Let's keep per-memcg vmstat_local batched (they are only used for
> displaying stats to the userspace), but keep lruvec stats precise.
> This change fixes the dead memcg flooding on my setup.
> 
> Fixes: 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones")
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Yafang Shao <laoar.shao@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>

Any other concerns/comments here?

I'd prefer to fix the regression: we're likely leaking several pages
of memory for each created and destroyed memory cgroup. Plus
all internal structures, which are measured in hundreds of kb.

Thanks!
Yafang Shao Aug. 24, 2019, 3:41 a.m. UTC | #9
On Sat, Aug 24, 2019 at 6:33 AM Roman Gushchin <guro@fb.com> wrote:
>
> On Fri, Aug 16, 2019 at 05:47:26PM -0700, Roman Gushchin wrote:
> > Commit 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync
> > with the hierarchical ones") effectively decreased the precision of
> > per-memcg vmstats_local and per-memcg-per-node lruvec percpu counters.
> >
> > That's good for displaying in memory.stat, but brings a serious regression
> > into the reclaim process.
> >
> > One issue I've discovered and debugged is the following:
> > lruvec_lru_size() can return 0 instead of the actual number of pages
> > in the lru list, preventing the kernel to reclaim last remaining
> > pages. Result is yet another dying memory cgroups flooding.
> > The opposite is also happening: scanning an empty lru list
> > is the waste of cpu time.
> >
> > Also, inactive_list_is_low() can return incorrect values, preventing
> > the active lru from being scanned and freed. It can fail both because
> > the size of active and inactive lists are inaccurate, and because
> > the number of workingset refaults isn't precise. In other words,
> > the result is pretty random.
> >
> > I'm not sure, if using the approximate number of slab pages in
> > count_shadow_number() is acceptable, but issues described above
> > are enough to partially revert the patch.
> >
> > Let's keep per-memcg vmstat_local batched (they are only used for
> > displaying stats to the userspace), but keep lruvec stats precise.
> > This change fixes the dead memcg flooding on my setup.
> >
> > Fixes: 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones")
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > Cc: Yafang Shao <laoar.shao@gmail.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
>
> Any other concerns/comments here?
>
> I'd prefer to fix the regression: we're likely leaking several pages
> of memory for each created and destroyed memory cgroup. Plus
> all internal structures, which are measured in hundreds of kb.
>

Hi Roman,

As it really introduces issues, I agree with you that we should fix it first.

So for your fix,
Acked-by: Yafang Shao <laoar.shao@gmail.com>

Thanks
Yafang
Andrew Morton Aug. 24, 2019, 7:57 p.m. UTC | #10
On Sat, 17 Aug 2019 19:15:23 +0000 Roman Gushchin <guro@fb.com> wrote:

> > > Fixes: 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones")
> > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > Cc: Yafang Shao <laoar.shao@gmail.com>
> > > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > > ---
> > >  mm/memcontrol.c | 8 +++-----
> > >  1 file changed, 3 insertions(+), 5 deletions(-)
> > 
> > <formletter>
> > 
> > This is not the correct way to submit patches for inclusion in the
> > stable kernel tree.  Please read:
> >     https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
> > for how to do this properly.
> 
> Oh, I'm sorry, will read and follow next time. Thanks!

766a4c19d880 is not present in 5.2 so no -stable backport is needed, yes?
Thomas Backlund Aug. 24, 2019, 8:23 p.m. UTC | #11
Den 24-08-2019 kl. 22:57, skrev Andrew Morton:
> On Sat, 17 Aug 2019 19:15:23 +0000 Roman Gushchin <guro@fb.com> wrote:
> 
>>>> Fixes: 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones")
>>>> Signed-off-by: Roman Gushchin <guro@fb.com>
>>>> Cc: Yafang Shao <laoar.shao@gmail.com>
>>>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>>>> ---
>>>>   mm/memcontrol.c | 8 +++-----
>>>>   1 file changed, 3 insertions(+), 5 deletions(-)
>>>
>>> <formletter>
>>>
>>> This is not the correct way to submit patches for inclusion in the
>>> stable kernel tree.  Please read:
>>>      https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
>>> for how to do this properly.
>>
>> Oh, I'm sorry, will read and follow next time. Thanks!
> 
> 766a4c19d880 is not present in 5.2 so no -stable backport is needed, yes?
> 

Unfortunately it got added in 5.2.7, so backport is needed.

--
Thomas
Michal Hocko Aug. 27, 2019, 2:10 p.m. UTC | #12
On Sat 24-08-19 23:23:07, Thomas Backlund wrote:
> Den 24-08-2019 kl. 22:57, skrev Andrew Morton:
> > On Sat, 17 Aug 2019 19:15:23 +0000 Roman Gushchin <guro@fb.com> wrote:
> > 
> > > > > Fixes: 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones")
> > > > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > > > Cc: Yafang Shao <laoar.shao@gmail.com>
> > > > > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > > > > ---
> > > > >   mm/memcontrol.c | 8 +++-----
> > > > >   1 file changed, 3 insertions(+), 5 deletions(-)
> > > > 
> > > > <formletter>
> > > > 
> > > > This is not the correct way to submit patches for inclusion in the
> > > > stable kernel tree.  Please read:
> > > >      https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
> > > > for how to do this properly.
> > > 
> > > Oh, I'm sorry, will read and follow next time. Thanks!
> > 
> > 766a4c19d880 is not present in 5.2 so no -stable backport is needed, yes?
> > 
> 
> Unfortunately it got added in 5.2.7, so backport is needed.

yet another example of patch not marked for stable backported to the
stable tree. yay...
Greg KH Aug. 27, 2019, 5:06 p.m. UTC | #13
On Tue, Aug 27, 2019 at 04:10:16PM +0200, Michal Hocko wrote:
> On Sat 24-08-19 23:23:07, Thomas Backlund wrote:
> > Den 24-08-2019 kl. 22:57, skrev Andrew Morton:
> > > On Sat, 17 Aug 2019 19:15:23 +0000 Roman Gushchin <guro@fb.com> wrote:
> > > 
> > > > > > Fixes: 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones")
> > > > > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > > > > Cc: Yafang Shao <laoar.shao@gmail.com>
> > > > > > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > > > > > ---
> > > > > >   mm/memcontrol.c | 8 +++-----
> > > > > >   1 file changed, 3 insertions(+), 5 deletions(-)
> > > > > 
> > > > > <formletter>
> > > > > 
> > > > > This is not the correct way to submit patches for inclusion in the
> > > > > stable kernel tree.  Please read:
> > > > >      https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
> > > > > for how to do this properly.
> > > > 
> > > > Oh, I'm sorry, will read and follow next time. Thanks!
> > > 
> > > 766a4c19d880 is not present in 5.2 so no -stable backport is needed, yes?
> > > 
> > 
> > Unfortunately it got added in 5.2.7, so backport is needed.
> 
> yet another example of patch not marked for stable backported to the
> stable tree. yay...

If you do not want autobot to pick up patches for specific
subsystems/files, just let us know and we will add them to the
blacklist.

thanks,

greg k-h
Michal Hocko Aug. 27, 2019, 5:39 p.m. UTC | #14
On Tue 27-08-19 19:06:18, Greg KH wrote:
> On Tue, Aug 27, 2019 at 04:10:16PM +0200, Michal Hocko wrote:
> > On Sat 24-08-19 23:23:07, Thomas Backlund wrote:
> > > Den 24-08-2019 kl. 22:57, skrev Andrew Morton:
> > > > On Sat, 17 Aug 2019 19:15:23 +0000 Roman Gushchin <guro@fb.com> wrote:
> > > > 
> > > > > > > Fixes: 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones")
> > > > > > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > > > > > Cc: Yafang Shao <laoar.shao@gmail.com>
> > > > > > > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > > > > > > ---
> > > > > > >   mm/memcontrol.c | 8 +++-----
> > > > > > >   1 file changed, 3 insertions(+), 5 deletions(-)
> > > > > > 
> > > > > > <formletter>
> > > > > > 
> > > > > > This is not the correct way to submit patches for inclusion in the
> > > > > > stable kernel tree.  Please read:
> > > > > >      https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
> > > > > > for how to do this properly.
> > > > > 
> > > > > Oh, I'm sorry, will read and follow next time. Thanks!
> > > > 
> > > > 766a4c19d880 is not present in 5.2 so no -stable backport is needed, yes?
> > > > 
> > > 
> > > Unfortunately it got added in 5.2.7, so backport is needed.
> > 
> > yet another example of patch not marked for stable backported to the
> > stable tree. yay...
> 
> If you do not want autobot to pick up patches for specific
> subsystems/files, just let us know and we will add them to the
> blacklist.

Done that on several occasions over last year and so. I always get "yep
we are going to black list" and whoops and we are back there with
patches going to stable like nothing happened. We've been through this
discussion so many times I am tired of it and to be honest I simply do
not care anymore.

I will keep encouraging people to mark patches for stable but I do not
give a wee bit about any reports for the stable tree. Nor do I care
whether something made it in and we should be careful to mark another
patch for stable as a fixup like this one.
Greg KH Aug. 27, 2019, 6:39 p.m. UTC | #15
On Tue, Aug 27, 2019 at 07:39:50PM +0200, Michal Hocko wrote:
> On Tue 27-08-19 19:06:18, Greg KH wrote:
> > On Tue, Aug 27, 2019 at 04:10:16PM +0200, Michal Hocko wrote:
> > > On Sat 24-08-19 23:23:07, Thomas Backlund wrote:
> > > > Den 24-08-2019 kl. 22:57, skrev Andrew Morton:
> > > > > On Sat, 17 Aug 2019 19:15:23 +0000 Roman Gushchin <guro@fb.com> wrote:
> > > > > 
> > > > > > > > Fixes: 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones")
> > > > > > > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > > > > > > Cc: Yafang Shao <laoar.shao@gmail.com>
> > > > > > > > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > > > > > > > ---
> > > > > > > >   mm/memcontrol.c | 8 +++-----
> > > > > > > >   1 file changed, 3 insertions(+), 5 deletions(-)
> > > > > > > 
> > > > > > > <formletter>
> > > > > > > 
> > > > > > > This is not the correct way to submit patches for inclusion in the
> > > > > > > stable kernel tree.  Please read:
> > > > > > >      https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
> > > > > > > for how to do this properly.
> > > > > > 
> > > > > > Oh, I'm sorry, will read and follow next time. Thanks!
> > > > > 
> > > > > 766a4c19d880 is not present in 5.2 so no -stable backport is needed, yes?
> > > > > 
> > > > 
> > > > Unfortunately it got added in 5.2.7, so backport is needed.
> > > 
> > > yet another example of patch not marked for stable backported to the
> > > stable tree. yay...
> > 
> > If you do not want autobot to pick up patches for specific
> > subsystems/files, just let us know and we will add them to the
> > blacklist.
> 
> Done that on several occasions over last year and so. I always get "yep
> we are going to black list" and whoops and we are back there with
> patches going to stable like nothing happened. We've been through this
> discussion so many times I am tired of it and to be honest I simply do
> not care anymore.
> 
> I will keep encouraging people to mark patches for stable but I do not
> give a wee bit about any reports for the stable tree. Nor do I care
> whether something made it in and we should be careful to mark another
> patch for stable as a fixup like this one.

Sasha, can you add these to the blacklist for autosel?

thanks,

greg k-h
diff mbox series

Patch

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 249187907339..3429340adb56 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -746,15 +746,13 @@  void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 	/* Update memcg */
 	__mod_memcg_state(memcg, idx, val);
 
+	/* Update lruvec */
+	__this_cpu_add(pn->lruvec_stat_local->count[idx], val);
+
 	x = val + __this_cpu_read(pn->lruvec_stat_cpu->count[idx]);
 	if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
 		struct mem_cgroup_per_node *pi;
 
-		/*
-		 * Batch local counters to keep them in sync with
-		 * the hierarchical ones.
-		 */
-		__this_cpu_add(pn->lruvec_stat_local->count[idx], x);
 		for (pi = pn; pi; pi = parent_nodeinfo(pi, pgdat->node_id))
 			atomic_long_add(x, &pi->lruvec_stat[idx]);
 		x = 0;