diff mbox series

mm: vmscan: consistent update to pgsteal and pgscan

Message ID 20200507204913.18661-1-shakeelb@google.com (mailing list archive)
State New, archived
Headers show
Series mm: vmscan: consistent update to pgsteal and pgscan | expand

Commit Message

Shakeel Butt May 7, 2020, 8:49 p.m. UTC
One way to measure the efficiency of memory reclaim is to look at the
ratio (pgscan+pfrefill)/pgsteal. However at the moment these stats are
not updated consistently at the system level and the ratio of these are
not very meaningful. The pgsteal and pgscan are updated for only global
reclaim while pgrefill gets updated for global as well as cgroup
reclaim.

Please note that this difference is only for system level vmstats. The
cgroup stats returned by memory.stat are actually consistent. The
cgroup's pgsteal contains number of reclaimed pages for global as well
as cgroup reclaim. So, one way to get the system level stats is to get
these stats from root's memory.stat but root does not expose that
interface. Also for !CONFIG_MEMCG machines /proc/vmstat is the only way
to get these stats. So, make these stats consistent.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
---
 mm/vmscan.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

Comments

Roman Gushchin May 7, 2020, 10:28 p.m. UTC | #1
On Thu, May 07, 2020 at 01:49:13PM -0700, Shakeel Butt wrote:
> One way to measure the efficiency of memory reclaim is to look at the
> ratio (pgscan+pfrefill)/pgsteal. However at the moment these stats are
> not updated consistently at the system level and the ratio of these are
> not very meaningful. The pgsteal and pgscan are updated for only global
> reclaim while pgrefill gets updated for global as well as cgroup
> reclaim.
> 
> Please note that this difference is only for system level vmstats. The
> cgroup stats returned by memory.stat are actually consistent. The
> cgroup's pgsteal contains number of reclaimed pages for global as well
> as cgroup reclaim. So, one way to get the system level stats is to get
> these stats from root's memory.stat but root does not expose that
> interface. Also for !CONFIG_MEMCG machines /proc/vmstat is the only way
> to get these stats. So, make these stats consistent.
> 
> Signed-off-by: Shakeel Butt <shakeelb@google.com>

Acked-by: Roman Gushchin <guro@fb.com>

Thanks!
Yafang Shao May 8, 2020, 10:34 a.m. UTC | #2
On Fri, May 8, 2020 at 4:49 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> One way to measure the efficiency of memory reclaim is to look at the
> ratio (pgscan+pfrefill)/pgsteal. However at the moment these stats are
> not updated consistently at the system level and the ratio of these are
> not very meaningful. The pgsteal and pgscan are updated for only global
> reclaim while pgrefill gets updated for global as well as cgroup
> reclaim.
>

Hi Shakeel,

We always use pgscan and pgsteal for monitoring the system level
memory pressure, for example, by using sysstat(sar) or some other
monitor tools.
But with this change, these two counters include the memcg pressure as
well. It is not easy to know whether the pgscan and pgsteal are caused
by system level pressure or only some specific memcgs reaching their
memory limit.

How about adding  cgroup_reclaim() to pgrefill as well ?

> Please note that this difference is only for system level vmstats. The
> cgroup stats returned by memory.stat are actually consistent. The
> cgroup's pgsteal contains number of reclaimed pages for global as well
> as cgroup reclaim. So, one way to get the system level stats is to get
> these stats from root's memory.stat but root does not expose that
> interface. Also for !CONFIG_MEMCG machines /proc/vmstat is the only way
> to get these stats. So, make these stats consistent.
>
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> ---
>  mm/vmscan.c | 6 ++----
>  1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index cc555903a332..51f7d1efc912 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1943,8 +1943,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>         reclaim_stat->recent_scanned[file] += nr_taken;
>
>         item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT;
> -       if (!cgroup_reclaim(sc))
> -               __count_vm_events(item, nr_scanned);
> +       __count_vm_events(item, nr_scanned);
>         __count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
>         spin_unlock_irq(&pgdat->lru_lock);
>
> @@ -1957,8 +1956,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>         spin_lock_irq(&pgdat->lru_lock);
>
>         item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
> -       if (!cgroup_reclaim(sc))
> -               __count_vm_events(item, nr_reclaimed);
> +       __count_vm_events(item, nr_reclaimed);
>         __count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
>         reclaim_stat->recent_rotated[0] += stat.nr_activate[0];
>         reclaim_stat->recent_rotated[1] += stat.nr_activate[1];
> --
> 2.26.2.526.g744177e7f7-goog
>
>
Shakeel Butt May 8, 2020, 1:25 p.m. UTC | #3
On Fri, May 8, 2020 at 3:34 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Fri, May 8, 2020 at 4:49 AM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > One way to measure the efficiency of memory reclaim is to look at the
> > ratio (pgscan+pfrefill)/pgsteal. However at the moment these stats are
> > not updated consistently at the system level and the ratio of these are
> > not very meaningful. The pgsteal and pgscan are updated for only global
> > reclaim while pgrefill gets updated for global as well as cgroup
> > reclaim.
> >
>
> Hi Shakeel,
>
> We always use pgscan and pgsteal for monitoring the system level
> memory pressure, for example, by using sysstat(sar) or some other
> monitor tools.

Don't you need pgrefill in addition to pgscan and pgsteal to get the
full picture of the reclaim activity?

> But with this change, these two counters include the memcg pressure as
> well. It is not easy to know whether the pgscan and pgsteal are caused
> by system level pressure or only some specific memcgs reaching their
> memory limit.
>
> How about adding  cgroup_reclaim() to pgrefill as well ?
>

I am looking for all the reclaim activity on the system. Adding
!cgroup_reclaim to pgrefill will skip the cgroup reclaim activity.
Maybe adding pgsteal_cgroup and pgscan_cgroup would be better.

> > Please note that this difference is only for system level vmstats. The
> > cgroup stats returned by memory.stat are actually consistent. The
> > cgroup's pgsteal contains number of reclaimed pages for global as well
> > as cgroup reclaim. So, one way to get the system level stats is to get
> > these stats from root's memory.stat but root does not expose that
> > interface. Also for !CONFIG_MEMCG machines /proc/vmstat is the only way
> > to get these stats. So, make these stats consistent.
> >
> > Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > ---
> >  mm/vmscan.c | 6 ++----
> >  1 file changed, 2 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index cc555903a332..51f7d1efc912 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1943,8 +1943,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> >         reclaim_stat->recent_scanned[file] += nr_taken;
> >
> >         item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT;
> > -       if (!cgroup_reclaim(sc))
> > -               __count_vm_events(item, nr_scanned);
> > +       __count_vm_events(item, nr_scanned);
> >         __count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
> >         spin_unlock_irq(&pgdat->lru_lock);
> >
> > @@ -1957,8 +1956,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> >         spin_lock_irq(&pgdat->lru_lock);
> >
> >         item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
> > -       if (!cgroup_reclaim(sc))
> > -               __count_vm_events(item, nr_reclaimed);
> > +       __count_vm_events(item, nr_reclaimed);
> >         __count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
> >         reclaim_stat->recent_rotated[0] += stat.nr_activate[0];
> >         reclaim_stat->recent_rotated[1] += stat.nr_activate[1];
> > --
> > 2.26.2.526.g744177e7f7-goog
> >
> >
>
>
> --
> Thanks
> Yafang
Johannes Weiner May 8, 2020, 1:38 p.m. UTC | #4
On Fri, May 08, 2020 at 06:25:14AM -0700, Shakeel Butt wrote:
> On Fri, May 8, 2020 at 3:34 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Fri, May 8, 2020 at 4:49 AM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > One way to measure the efficiency of memory reclaim is to look at the
> > > ratio (pgscan+pfrefill)/pgsteal. However at the moment these stats are
> > > not updated consistently at the system level and the ratio of these are
> > > not very meaningful. The pgsteal and pgscan are updated for only global
> > > reclaim while pgrefill gets updated for global as well as cgroup
> > > reclaim.
> > >
> >
> > Hi Shakeel,
> >
> > We always use pgscan and pgsteal for monitoring the system level
> > memory pressure, for example, by using sysstat(sar) or some other
> > monitor tools.

I'm in the same boat. It's useful to have activity that happens purely
due to machine capacity rather than localized activity that happens
due to the limits throughout the cgroup tree.

> Don't you need pgrefill in addition to pgscan and pgsteal to get the
> full picture of the reclaim activity?

I actually almost never look at pgrefill.

> > But with this change, these two counters include the memcg pressure as
> > well. It is not easy to know whether the pgscan and pgsteal are caused
> > by system level pressure or only some specific memcgs reaching their
> > memory limit.
> >
> > How about adding  cgroup_reclaim() to pgrefill as well ?
> >
> 
> I am looking for all the reclaim activity on the system. Adding
> !cgroup_reclaim to pgrefill will skip the cgroup reclaim activity.
> Maybe adding pgsteal_cgroup and pgscan_cgroup would be better.

How would you feel about adding memory.stat at the root cgroup level?

There are subtle differences between /proc/vmstat and memory.stat, and
cgroup-aware code that wants to watch the full hierarchy currently has
to know about these intricacies and translate semantics back and forth.

Generally having the fully recursive memory.stat at the root level
could help a broader range of usecases.
Shakeel Butt May 8, 2020, 2:05 p.m. UTC | #5
On Fri, May 8, 2020 at 6:38 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, May 08, 2020 at 06:25:14AM -0700, Shakeel Butt wrote:
> > On Fri, May 8, 2020 at 3:34 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Fri, May 8, 2020 at 4:49 AM Shakeel Butt <shakeelb@google.com> wrote:
> > > >
> > > > One way to measure the efficiency of memory reclaim is to look at the
> > > > ratio (pgscan+pfrefill)/pgsteal. However at the moment these stats are
> > > > not updated consistently at the system level and the ratio of these are
> > > > not very meaningful. The pgsteal and pgscan are updated for only global
> > > > reclaim while pgrefill gets updated for global as well as cgroup
> > > > reclaim.
> > > >
> > >
> > > Hi Shakeel,
> > >
> > > We always use pgscan and pgsteal for monitoring the system level
> > > memory pressure, for example, by using sysstat(sar) or some other
> > > monitor tools.
>
> I'm in the same boat. It's useful to have activity that happens purely
> due to machine capacity rather than localized activity that happens
> due to the limits throughout the cgroup tree.
>
> > Don't you need pgrefill in addition to pgscan and pgsteal to get the
> > full picture of the reclaim activity?
>
> I actually almost never look at pgrefill.
>

Nowadays we are looking at reclaim cost on high utilization
machines/devices and noticed that rmap walk takes more than 60/70% of
the CPU cost of the reclaim. Kernel does rmap walks in
shrink_active_list and shrink_page_list and pgscan and pgrefill are
good approximations of the number of rmap walks during a reclaim.

> > > But with this change, these two counters include the memcg pressure as
> > > well. It is not easy to know whether the pgscan and pgsteal are caused
> > > by system level pressure or only some specific memcgs reaching their
> > > memory limit.
> > >
> > > How about adding  cgroup_reclaim() to pgrefill as well ?
> > >
> >
> > I am looking for all the reclaim activity on the system. Adding
> > !cgroup_reclaim to pgrefill will skip the cgroup reclaim activity.
> > Maybe adding pgsteal_cgroup and pgscan_cgroup would be better.
>
> How would you feel about adding memory.stat at the root cgroup level?
>

Actually I would prefer adding memory.stat at the root cgroup level as
you noted below that more use-cases would benefit from it.

> There are subtle differences between /proc/vmstat and memory.stat, and
> cgroup-aware code that wants to watch the full hierarchy currently has
> to know about these intricacies and translate semantics back and forth.
>
> Generally having the fully recursive memory.stat at the root level
> could help a broader range of usecases.

Thanks for the feedback. I will send the patch with the additional motivation.
Yafang Shao May 9, 2020, 6:53 a.m. UTC | #6
On Fri, May 8, 2020 at 9:38 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, May 08, 2020 at 06:25:14AM -0700, Shakeel Butt wrote:
> > On Fri, May 8, 2020 at 3:34 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Fri, May 8, 2020 at 4:49 AM Shakeel Butt <shakeelb@google.com> wrote:
> > > >
> > > > One way to measure the efficiency of memory reclaim is to look at the
> > > > ratio (pgscan+pfrefill)/pgsteal. However at the moment these stats are
> > > > not updated consistently at the system level and the ratio of these are
> > > > not very meaningful. The pgsteal and pgscan are updated for only global
> > > > reclaim while pgrefill gets updated for global as well as cgroup
> > > > reclaim.
> > > >
> > >
> > > Hi Shakeel,
> > >
> > > We always use pgscan and pgsteal for monitoring the system level
> > > memory pressure, for example, by using sysstat(sar) or some other
> > > monitor tools.
>
> I'm in the same boat. It's useful to have activity that happens purely
> due to machine capacity rather than localized activity that happens
> due to the limits throughout the cgroup tree.
>

Hi Johannes,

When I used PSI to monitor memory pressure, I found there's the same
behavoir in PSI that /proc/pressure/{memroy, IO} can be very large due
to some limited cgroups rather the machine capacity.
Should we separate /proc/pressure/XXX from /sys/fs/cgroup/XXX.pressure
as well ? Then /proc/pressure/XXX only indicate the pressure due to
machine capacity and /sys/fs/cgroup/XXX.presssure show the pressure
throughout the cgroup tree.

Besides that, there's another difference between /proc/pressure/XXX
and /sys/fs/cgroup/XXX.pressure, which is when you disable the psi
(i.e. psi=n) /proc/pressure/ will disapear but
/sys/fs/cgroup/XXX.pressure still exist.  If we separate them, this
difference will be reasonable.

> > Don't you need pgrefill in addition to pgscan and pgsteal to get the
> > full picture of the reclaim activity?
>
> I actually almost never look at pgrefill.
>
> > > But with this change, these two counters include the memcg pressure as
> > > well. It is not easy to know whether the pgscan and pgsteal are caused
> > > by system level pressure or only some specific memcgs reaching their
> > > memory limit.
> > >
> > > How about adding  cgroup_reclaim() to pgrefill as well ?
> > >
> >
> > I am looking for all the reclaim activity on the system. Adding
> > !cgroup_reclaim to pgrefill will skip the cgroup reclaim activity.
> > Maybe adding pgsteal_cgroup and pgscan_cgroup would be better.
>
> How would you feel about adding memory.stat at the root cgroup level?
>
> There are subtle differences between /proc/vmstat and memory.stat, and
> cgroup-aware code that wants to watch the full hierarchy currently has
> to know about these intricacies and translate semantics back and forth.
>
> Generally having the fully recursive memory.stat at the root level
> could help a broader range of usecases.
diff mbox series

Patch

diff --git a/mm/vmscan.c b/mm/vmscan.c
index cc555903a332..51f7d1efc912 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1943,8 +1943,7 @@  shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	reclaim_stat->recent_scanned[file] += nr_taken;
 
 	item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT;
-	if (!cgroup_reclaim(sc))
-		__count_vm_events(item, nr_scanned);
+	__count_vm_events(item, nr_scanned);
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
 	spin_unlock_irq(&pgdat->lru_lock);
 
@@ -1957,8 +1956,7 @@  shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	spin_lock_irq(&pgdat->lru_lock);
 
 	item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
-	if (!cgroup_reclaim(sc))
-		__count_vm_events(item, nr_reclaimed);
+	__count_vm_events(item, nr_reclaimed);
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
 	reclaim_stat->recent_rotated[0] += stat.nr_activate[0];
 	reclaim_stat->recent_rotated[1] += stat.nr_activate[1];