Message ID | 20200507204913.18661-1-shakeelb@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: vmscan: consistent update to pgsteal and pgscan | expand |
On Thu, May 07, 2020 at 01:49:13PM -0700, Shakeel Butt wrote: > One way to measure the efficiency of memory reclaim is to look at the > ratio (pgscan+pfrefill)/pgsteal. However at the moment these stats are > not updated consistently at the system level and the ratio of these are > not very meaningful. The pgsteal and pgscan are updated for only global > reclaim while pgrefill gets updated for global as well as cgroup > reclaim. > > Please note that this difference is only for system level vmstats. The > cgroup stats returned by memory.stat are actually consistent. The > cgroup's pgsteal contains number of reclaimed pages for global as well > as cgroup reclaim. So, one way to get the system level stats is to get > these stats from root's memory.stat but root does not expose that > interface. Also for !CONFIG_MEMCG machines /proc/vmstat is the only way > to get these stats. So, make these stats consistent. > > Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Thanks!
On Fri, May 8, 2020 at 4:49 AM Shakeel Butt <shakeelb@google.com> wrote: > > One way to measure the efficiency of memory reclaim is to look at the > ratio (pgscan+pfrefill)/pgsteal. However at the moment these stats are > not updated consistently at the system level and the ratio of these are > not very meaningful. The pgsteal and pgscan are updated for only global > reclaim while pgrefill gets updated for global as well as cgroup > reclaim. > Hi Shakeel, We always use pgscan and pgsteal for monitoring the system level memory pressure, for example, by using sysstat(sar) or some other monitor tools. But with this change, these two counters include the memcg pressure as well. It is not easy to know whether the pgscan and pgsteal are caused by system level pressure or only some specific memcgs reaching their memory limit. How about adding cgroup_reclaim() to pgrefill as well ? > Please note that this difference is only for system level vmstats. The > cgroup stats returned by memory.stat are actually consistent. The > cgroup's pgsteal contains number of reclaimed pages for global as well > as cgroup reclaim. So, one way to get the system level stats is to get > these stats from root's memory.stat but root does not expose that > interface. Also for !CONFIG_MEMCG machines /proc/vmstat is the only way > to get these stats. So, make these stats consistent. > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > --- > mm/vmscan.c | 6 ++---- > 1 file changed, 2 insertions(+), 4 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index cc555903a332..51f7d1efc912 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1943,8 +1943,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, > reclaim_stat->recent_scanned[file] += nr_taken; > > item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT; > - if (!cgroup_reclaim(sc)) > - __count_vm_events(item, nr_scanned); > + __count_vm_events(item, nr_scanned); > __count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned); > spin_unlock_irq(&pgdat->lru_lock); > > @@ -1957,8 +1956,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, > spin_lock_irq(&pgdat->lru_lock); > > item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT; > - if (!cgroup_reclaim(sc)) > - __count_vm_events(item, nr_reclaimed); > + __count_vm_events(item, nr_reclaimed); > __count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed); > reclaim_stat->recent_rotated[0] += stat.nr_activate[0]; > reclaim_stat->recent_rotated[1] += stat.nr_activate[1]; > -- > 2.26.2.526.g744177e7f7-goog > >
On Fri, May 8, 2020 at 3:34 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > On Fri, May 8, 2020 at 4:49 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > One way to measure the efficiency of memory reclaim is to look at the > > ratio (pgscan+pfrefill)/pgsteal. However at the moment these stats are > > not updated consistently at the system level and the ratio of these are > > not very meaningful. The pgsteal and pgscan are updated for only global > > reclaim while pgrefill gets updated for global as well as cgroup > > reclaim. > > > > Hi Shakeel, > > We always use pgscan and pgsteal for monitoring the system level > memory pressure, for example, by using sysstat(sar) or some other > monitor tools. Don't you need pgrefill in addition to pgscan and pgsteal to get the full picture of the reclaim activity? > But with this change, these two counters include the memcg pressure as > well. It is not easy to know whether the pgscan and pgsteal are caused > by system level pressure or only some specific memcgs reaching their > memory limit. > > How about adding cgroup_reclaim() to pgrefill as well ? > I am looking for all the reclaim activity on the system. Adding !cgroup_reclaim to pgrefill will skip the cgroup reclaim activity. Maybe adding pgsteal_cgroup and pgscan_cgroup would be better. > > Please note that this difference is only for system level vmstats. The > > cgroup stats returned by memory.stat are actually consistent. The > > cgroup's pgsteal contains number of reclaimed pages for global as well > > as cgroup reclaim. So, one way to get the system level stats is to get > > these stats from root's memory.stat but root does not expose that > > interface. Also for !CONFIG_MEMCG machines /proc/vmstat is the only way > > to get these stats. So, make these stats consistent. > > > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > > --- > > mm/vmscan.c | 6 ++---- > > 1 file changed, 2 insertions(+), 4 deletions(-) > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index cc555903a332..51f7d1efc912 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -1943,8 +1943,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, > > reclaim_stat->recent_scanned[file] += nr_taken; > > > > item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT; > > - if (!cgroup_reclaim(sc)) > > - __count_vm_events(item, nr_scanned); > > + __count_vm_events(item, nr_scanned); > > __count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned); > > spin_unlock_irq(&pgdat->lru_lock); > > > > @@ -1957,8 +1956,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, > > spin_lock_irq(&pgdat->lru_lock); > > > > item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT; > > - if (!cgroup_reclaim(sc)) > > - __count_vm_events(item, nr_reclaimed); > > + __count_vm_events(item, nr_reclaimed); > > __count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed); > > reclaim_stat->recent_rotated[0] += stat.nr_activate[0]; > > reclaim_stat->recent_rotated[1] += stat.nr_activate[1]; > > -- > > 2.26.2.526.g744177e7f7-goog > > > > > > > -- > Thanks > Yafang
On Fri, May 08, 2020 at 06:25:14AM -0700, Shakeel Butt wrote: > On Fri, May 8, 2020 at 3:34 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > On Fri, May 8, 2020 at 4:49 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > One way to measure the efficiency of memory reclaim is to look at the > > > ratio (pgscan+pfrefill)/pgsteal. However at the moment these stats are > > > not updated consistently at the system level and the ratio of these are > > > not very meaningful. The pgsteal and pgscan are updated for only global > > > reclaim while pgrefill gets updated for global as well as cgroup > > > reclaim. > > > > > > > Hi Shakeel, > > > > We always use pgscan and pgsteal for monitoring the system level > > memory pressure, for example, by using sysstat(sar) or some other > > monitor tools. I'm in the same boat. It's useful to have activity that happens purely due to machine capacity rather than localized activity that happens due to the limits throughout the cgroup tree. > Don't you need pgrefill in addition to pgscan and pgsteal to get the > full picture of the reclaim activity? I actually almost never look at pgrefill. > > But with this change, these two counters include the memcg pressure as > > well. It is not easy to know whether the pgscan and pgsteal are caused > > by system level pressure or only some specific memcgs reaching their > > memory limit. > > > > How about adding cgroup_reclaim() to pgrefill as well ? > > > > I am looking for all the reclaim activity on the system. Adding > !cgroup_reclaim to pgrefill will skip the cgroup reclaim activity. > Maybe adding pgsteal_cgroup and pgscan_cgroup would be better. How would you feel about adding memory.stat at the root cgroup level? There are subtle differences between /proc/vmstat and memory.stat, and cgroup-aware code that wants to watch the full hierarchy currently has to know about these intricacies and translate semantics back and forth. Generally having the fully recursive memory.stat at the root level could help a broader range of usecases.
On Fri, May 8, 2020 at 6:38 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Fri, May 08, 2020 at 06:25:14AM -0700, Shakeel Butt wrote: > > On Fri, May 8, 2020 at 3:34 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > On Fri, May 8, 2020 at 4:49 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > > > One way to measure the efficiency of memory reclaim is to look at the > > > > ratio (pgscan+pfrefill)/pgsteal. However at the moment these stats are > > > > not updated consistently at the system level and the ratio of these are > > > > not very meaningful. The pgsteal and pgscan are updated for only global > > > > reclaim while pgrefill gets updated for global as well as cgroup > > > > reclaim. > > > > > > > > > > Hi Shakeel, > > > > > > We always use pgscan and pgsteal for monitoring the system level > > > memory pressure, for example, by using sysstat(sar) or some other > > > monitor tools. > > I'm in the same boat. It's useful to have activity that happens purely > due to machine capacity rather than localized activity that happens > due to the limits throughout the cgroup tree. > > > Don't you need pgrefill in addition to pgscan and pgsteal to get the > > full picture of the reclaim activity? > > I actually almost never look at pgrefill. > Nowadays we are looking at reclaim cost on high utilization machines/devices and noticed that rmap walk takes more than 60/70% of the CPU cost of the reclaim. Kernel does rmap walks in shrink_active_list and shrink_page_list and pgscan and pgrefill are good approximations of the number of rmap walks during a reclaim. > > > But with this change, these two counters include the memcg pressure as > > > well. It is not easy to know whether the pgscan and pgsteal are caused > > > by system level pressure or only some specific memcgs reaching their > > > memory limit. > > > > > > How about adding cgroup_reclaim() to pgrefill as well ? > > > > > > > I am looking for all the reclaim activity on the system. Adding > > !cgroup_reclaim to pgrefill will skip the cgroup reclaim activity. > > Maybe adding pgsteal_cgroup and pgscan_cgroup would be better. > > How would you feel about adding memory.stat at the root cgroup level? > Actually I would prefer adding memory.stat at the root cgroup level as you noted below that more use-cases would benefit from it. > There are subtle differences between /proc/vmstat and memory.stat, and > cgroup-aware code that wants to watch the full hierarchy currently has > to know about these intricacies and translate semantics back and forth. > > Generally having the fully recursive memory.stat at the root level > could help a broader range of usecases. Thanks for the feedback. I will send the patch with the additional motivation.
On Fri, May 8, 2020 at 9:38 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Fri, May 08, 2020 at 06:25:14AM -0700, Shakeel Butt wrote: > > On Fri, May 8, 2020 at 3:34 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > On Fri, May 8, 2020 at 4:49 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > > > One way to measure the efficiency of memory reclaim is to look at the > > > > ratio (pgscan+pfrefill)/pgsteal. However at the moment these stats are > > > > not updated consistently at the system level and the ratio of these are > > > > not very meaningful. The pgsteal and pgscan are updated for only global > > > > reclaim while pgrefill gets updated for global as well as cgroup > > > > reclaim. > > > > > > > > > > Hi Shakeel, > > > > > > We always use pgscan and pgsteal for monitoring the system level > > > memory pressure, for example, by using sysstat(sar) or some other > > > monitor tools. > > I'm in the same boat. It's useful to have activity that happens purely > due to machine capacity rather than localized activity that happens > due to the limits throughout the cgroup tree. > Hi Johannes, When I used PSI to monitor memory pressure, I found there's the same behavoir in PSI that /proc/pressure/{memroy, IO} can be very large due to some limited cgroups rather the machine capacity. Should we separate /proc/pressure/XXX from /sys/fs/cgroup/XXX.pressure as well ? Then /proc/pressure/XXX only indicate the pressure due to machine capacity and /sys/fs/cgroup/XXX.presssure show the pressure throughout the cgroup tree. Besides that, there's another difference between /proc/pressure/XXX and /sys/fs/cgroup/XXX.pressure, which is when you disable the psi (i.e. psi=n) /proc/pressure/ will disapear but /sys/fs/cgroup/XXX.pressure still exist. If we separate them, this difference will be reasonable. > > Don't you need pgrefill in addition to pgscan and pgsteal to get the > > full picture of the reclaim activity? > > I actually almost never look at pgrefill. > > > > But with this change, these two counters include the memcg pressure as > > > well. It is not easy to know whether the pgscan and pgsteal are caused > > > by system level pressure or only some specific memcgs reaching their > > > memory limit. > > > > > > How about adding cgroup_reclaim() to pgrefill as well ? > > > > > > > I am looking for all the reclaim activity on the system. Adding > > !cgroup_reclaim to pgrefill will skip the cgroup reclaim activity. > > Maybe adding pgsteal_cgroup and pgscan_cgroup would be better. > > How would you feel about adding memory.stat at the root cgroup level? > > There are subtle differences between /proc/vmstat and memory.stat, and > cgroup-aware code that wants to watch the full hierarchy currently has > to know about these intricacies and translate semantics back and forth. > > Generally having the fully recursive memory.stat at the root level > could help a broader range of usecases.
diff --git a/mm/vmscan.c b/mm/vmscan.c index cc555903a332..51f7d1efc912 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1943,8 +1943,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, reclaim_stat->recent_scanned[file] += nr_taken; item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT; - if (!cgroup_reclaim(sc)) - __count_vm_events(item, nr_scanned); + __count_vm_events(item, nr_scanned); __count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned); spin_unlock_irq(&pgdat->lru_lock); @@ -1957,8 +1956,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, spin_lock_irq(&pgdat->lru_lock); item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT; - if (!cgroup_reclaim(sc)) - __count_vm_events(item, nr_reclaimed); + __count_vm_events(item, nr_reclaimed); __count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed); reclaim_stat->recent_rotated[0] += stat.nr_activate[0]; reclaim_stat->recent_rotated[1] += stat.nr_activate[1];
One way to measure the efficiency of memory reclaim is to look at the ratio (pgscan+pfrefill)/pgsteal. However at the moment these stats are not updated consistently at the system level and the ratio of these are not very meaningful. The pgsteal and pgscan are updated for only global reclaim while pgrefill gets updated for global as well as cgroup reclaim. Please note that this difference is only for system level vmstats. The cgroup stats returned by memory.stat are actually consistent. The cgroup's pgsteal contains number of reclaimed pages for global as well as cgroup reclaim. So, one way to get the system level stats is to get these stats from root's memory.stat but root does not expose that interface. Also for !CONFIG_MEMCG machines /proc/vmstat is the only way to get these stats. So, make these stats consistent. Signed-off-by: Shakeel Butt <shakeelb@google.com> --- mm/vmscan.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-)