diff mbox series

[v2] mm/vmscan: check references from all memcgs for swapbacked memory

Message ID 20221005173713.1308832-1-yosryahmed@google.com (mailing list archive)
State New
Headers show
Series [v2] mm/vmscan: check references from all memcgs for swapbacked memory | expand

Commit Message

Yosry Ahmed Oct. 5, 2022, 5:37 p.m. UTC
During page/folio reclaim, we check if a folio is referenced using
folio_referenced() to avoid reclaiming folios that have been recently
accessed (hot memory). The rationale is that this memory is likely to be
accessed soon, and hence reclaiming it will cause a refault.

For memcg reclaim, we currently only check accesses to the folio from
processes in the subtree of the target memcg. This behavior was
originally introduced by commit bed7161a519a ("Memory controller: make
page_referenced() cgroup aware") a long time ago. Back then, refaulted
pages would get charged to the memcg of the process that was faulting them
in. It made sense to only consider accesses coming from processes in the
subtree of target_mem_cgroup. If a page was charged to memcg A but only
being accessed by a sibling memcg B, we would reclaim it if memcg A is
is the reclaim target. memcg B can then fault it back in and get charged
for it appropriately.

Today, this behavior still makes sense for file pages. However, unlike
file pages, when swapbacked pages are refaulted they are charged to the
memcg that was originally charged for them during swapping out. Which
means that if a swapbacked page is charged to memcg A but only used by
memcg B, and we reclaim it from memcg A, it would simply be faulted back
in and charged again to memcg A once memcg B accesses it. In that sense,
accesses from all memcgs matter equally when considering if a swapbacked
page/folio is a viable reclaim target.

Modify folio_referenced() to always consider accesses from all memcgs if
the folio is swapbacked.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---

v1 -> v2:
- Move the folio_test_swapbacked() check inside folio_referenced()
    (Johannes).
- Slight rephrasing of the commit log and comment to make them
  clearer.
- Renamed memcg argument to folio_referenced() to target_memcg.

---
 include/linux/rmap.h |  2 +-
 mm/rmap.c            | 35 +++++++++++++++++++++++++++--------
 2 files changed, 28 insertions(+), 9 deletions(-)

Comments

Johannes Weiner Oct. 5, 2022, 7:49 p.m. UTC | #1
On Wed, Oct 05, 2022 at 05:37:13PM +0000, Yosry Ahmed wrote:
> During page/folio reclaim, we check if a folio is referenced using
> folio_referenced() to avoid reclaiming folios that have been recently
> accessed (hot memory). The rationale is that this memory is likely to be
> accessed soon, and hence reclaiming it will cause a refault.
> 
> For memcg reclaim, we currently only check accesses to the folio from
> processes in the subtree of the target memcg. This behavior was
> originally introduced by commit bed7161a519a ("Memory controller: make
> page_referenced() cgroup aware") a long time ago. Back then, refaulted
> pages would get charged to the memcg of the process that was faulting them
> in. It made sense to only consider accesses coming from processes in the
> subtree of target_mem_cgroup. If a page was charged to memcg A but only
> being accessed by a sibling memcg B, we would reclaim it if memcg A is
> is the reclaim target. memcg B can then fault it back in and get charged
> for it appropriately.
> 
> Today, this behavior still makes sense for file pages. However, unlike
> file pages, when swapbacked pages are refaulted they are charged to the
> memcg that was originally charged for them during swapping out. Which
> means that if a swapbacked page is charged to memcg A but only used by
> memcg B, and we reclaim it from memcg A, it would simply be faulted back
> in and charged again to memcg A once memcg B accesses it. In that sense,
> accesses from all memcgs matter equally when considering if a swapbacked
> page/folio is a viable reclaim target.
> 
> Modify folio_referenced() to always consider accesses from all memcgs if
> the folio is swapbacked.
> 
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Yu Zhao Oct. 5, 2022, 8:47 p.m. UTC | #2
On Wed, Oct 5, 2022 at 11:37 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> During page/folio reclaim, we check if a folio is referenced using
> folio_referenced() to avoid reclaiming folios that have been recently
> accessed (hot memory). The rationale is that this memory is likely to be
> accessed soon, and hence reclaiming it will cause a refault.
>
> For memcg reclaim, we currently only check accesses to the folio from
> processes in the subtree of the target memcg. This behavior was
> originally introduced by commit bed7161a519a ("Memory controller: make
> page_referenced() cgroup aware") a long time ago. Back then, refaulted
> pages would get charged to the memcg of the process that was faulting them
> in. It made sense to only consider accesses coming from processes in the
> subtree of target_mem_cgroup. If a page was charged to memcg A but only
> being accessed by a sibling memcg B, we would reclaim it if memcg A is
> is the reclaim target. memcg B can then fault it back in and get charged
> for it appropriately.
>
> Today, this behavior still makes sense for file pages. However, unlike
> file pages, when swapbacked pages are refaulted they are charged to the
> memcg that was originally charged for them during swapping out. Which
> means that if a swapbacked page is charged to memcg A but only used by
> memcg B, and we reclaim it from memcg A, it would simply be faulted back
> in and charged again to memcg A once memcg B accesses it. In that sense,
> accesses from all memcgs matter equally when considering if a swapbacked
> page/folio is a viable reclaim target.
>
> Modify folio_referenced() to always consider accesses from all memcgs if
> the folio is swapbacked.

It seems to me this change can potentially increase the number of
zombie memcgs. Any risk assessment done on this?
Yosry Ahmed Oct. 5, 2022, 9:01 p.m. UTC | #3
On Wed, Oct 5, 2022 at 1:48 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Wed, Oct 5, 2022 at 11:37 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > During page/folio reclaim, we check if a folio is referenced using
> > folio_referenced() to avoid reclaiming folios that have been recently
> > accessed (hot memory). The rationale is that this memory is likely to be
> > accessed soon, and hence reclaiming it will cause a refault.
> >
> > For memcg reclaim, we currently only check accesses to the folio from
> > processes in the subtree of the target memcg. This behavior was
> > originally introduced by commit bed7161a519a ("Memory controller: make
> > page_referenced() cgroup aware") a long time ago. Back then, refaulted
> > pages would get charged to the memcg of the process that was faulting them
> > in. It made sense to only consider accesses coming from processes in the
> > subtree of target_mem_cgroup. If a page was charged to memcg A but only
> > being accessed by a sibling memcg B, we would reclaim it if memcg A is
> > is the reclaim target. memcg B can then fault it back in and get charged
> > for it appropriately.
> >
> > Today, this behavior still makes sense for file pages. However, unlike
> > file pages, when swapbacked pages are refaulted they are charged to the
> > memcg that was originally charged for them during swapping out. Which
> > means that if a swapbacked page is charged to memcg A but only used by
> > memcg B, and we reclaim it from memcg A, it would simply be faulted back
> > in and charged again to memcg A once memcg B accesses it. In that sense,
> > accesses from all memcgs matter equally when considering if a swapbacked
> > page/folio is a viable reclaim target.
> >
> > Modify folio_referenced() to always consider accesses from all memcgs if
> > the folio is swapbacked.
>
> It seems to me this change can potentially increase the number of
> zombie memcgs. Any risk assessment done on this?

Do you mind elaborating the case(s) where this could happen? Is this
the cgroup v1 case in mem_cgroup_swapout() where we are reclaiming
from a zombie memcg and swapping out would let us move the charge to
the parent?
Yu Zhao Oct. 5, 2022, 9:13 p.m. UTC | #4
On Wed, Oct 5, 2022 at 3:02 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Wed, Oct 5, 2022 at 1:48 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Wed, Oct 5, 2022 at 11:37 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > During page/folio reclaim, we check if a folio is referenced using
> > > folio_referenced() to avoid reclaiming folios that have been recently
> > > accessed (hot memory). The rationale is that this memory is likely to be
> > > accessed soon, and hence reclaiming it will cause a refault.
> > >
> > > For memcg reclaim, we currently only check accesses to the folio from
> > > processes in the subtree of the target memcg. This behavior was
> > > originally introduced by commit bed7161a519a ("Memory controller: make
> > > page_referenced() cgroup aware") a long time ago. Back then, refaulted
> > > pages would get charged to the memcg of the process that was faulting them
> > > in. It made sense to only consider accesses coming from processes in the
> > > subtree of target_mem_cgroup. If a page was charged to memcg A but only
> > > being accessed by a sibling memcg B, we would reclaim it if memcg A is
> > > is the reclaim target. memcg B can then fault it back in and get charged
> > > for it appropriately.
> > >
> > > Today, this behavior still makes sense for file pages. However, unlike
> > > file pages, when swapbacked pages are refaulted they are charged to the
> > > memcg that was originally charged for them during swapping out. Which
> > > means that if a swapbacked page is charged to memcg A but only used by
> > > memcg B, and we reclaim it from memcg A, it would simply be faulted back
> > > in and charged again to memcg A once memcg B accesses it. In that sense,
> > > accesses from all memcgs matter equally when considering if a swapbacked
> > > page/folio is a viable reclaim target.
> > >
> > > Modify folio_referenced() to always consider accesses from all memcgs if
> > > the folio is swapbacked.
> >
> > It seems to me this change can potentially increase the number of
> > zombie memcgs. Any risk assessment done on this?
>
> Do you mind elaborating the case(s) where this could happen? Is this
> the cgroup v1 case in mem_cgroup_swapout() where we are reclaiming
> from a zombie memcg and swapping out would let us move the charge to
> the parent?

The scenario is quite straightforward: for a page charged to memcg A
and also actively used by memcg B, if we don't ignore the access from
memcg B, we won't be able to reclaim it after memcg A is deleted.
Yu Zhao Oct. 5, 2022, 10:22 p.m. UTC | #5
On Wed, Oct 5, 2022 at 3:13 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Wed, Oct 5, 2022 at 3:02 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Wed, Oct 5, 2022 at 1:48 PM Yu Zhao <yuzhao@google.com> wrote:
> > >
> > > On Wed, Oct 5, 2022 at 11:37 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > >
> > > > During page/folio reclaim, we check if a folio is referenced using
> > > > folio_referenced() to avoid reclaiming folios that have been recently
> > > > accessed (hot memory). The rationale is that this memory is likely to be
> > > > accessed soon, and hence reclaiming it will cause a refault.
> > > >
> > > > For memcg reclaim, we currently only check accesses to the folio from
> > > > processes in the subtree of the target memcg. This behavior was
> > > > originally introduced by commit bed7161a519a ("Memory controller: make
> > > > page_referenced() cgroup aware") a long time ago. Back then, refaulted
> > > > pages would get charged to the memcg of the process that was faulting them
> > > > in. It made sense to only consider accesses coming from processes in the
> > > > subtree of target_mem_cgroup. If a page was charged to memcg A but only
> > > > being accessed by a sibling memcg B, we would reclaim it if memcg A is
> > > > is the reclaim target. memcg B can then fault it back in and get charged
> > > > for it appropriately.
> > > >
> > > > Today, this behavior still makes sense for file pages. However, unlike
> > > > file pages, when swapbacked pages are refaulted they are charged to the
> > > > memcg that was originally charged for them during swapping out. Which
> > > > means that if a swapbacked page is charged to memcg A but only used by
> > > > memcg B, and we reclaim it from memcg A, it would simply be faulted back
> > > > in and charged again to memcg A once memcg B accesses it. In that sense,
> > > > accesses from all memcgs matter equally when considering if a swapbacked
> > > > page/folio is a viable reclaim target.

I just read the entire commit message (sorry for not doing so
previously) to figure out where the confusion came from: the above
claim is wrong for two cases. I'll let you figure out why :)

> > > > Modify folio_referenced() to always consider accesses from all memcgs if
> > > > the folio is swapbacked.
> > >
> > > It seems to me this change can potentially increase the number of
> > > zombie memcgs. Any risk assessment done on this?
> >
> > Do you mind elaborating the case(s) where this could happen? Is this
> > the cgroup v1 case in mem_cgroup_swapout() where we are reclaiming
> > from a zombie memcg and swapping out would let us move the charge to
> > the parent?
>
> The scenario is quite straightforward: for a page charged to memcg A
> and also actively used by memcg B, if we don't ignore the access from
> memcg B, we won't be able to reclaim it after memcg A is deleted.
Yosry Ahmed Oct. 5, 2022, 10:45 p.m. UTC | #6
On Wed, Oct 5, 2022 at 3:23 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Wed, Oct 5, 2022 at 3:13 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Wed, Oct 5, 2022 at 3:02 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Wed, Oct 5, 2022 at 1:48 PM Yu Zhao <yuzhao@google.com> wrote:
> > > >
> > > > On Wed, Oct 5, 2022 at 11:37 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > >
> > > > > During page/folio reclaim, we check if a folio is referenced using
> > > > > folio_referenced() to avoid reclaiming folios that have been recently
> > > > > accessed (hot memory). The rationale is that this memory is likely to be
> > > > > accessed soon, and hence reclaiming it will cause a refault.
> > > > >
> > > > > For memcg reclaim, we currently only check accesses to the folio from
> > > > > processes in the subtree of the target memcg. This behavior was
> > > > > originally introduced by commit bed7161a519a ("Memory controller: make
> > > > > page_referenced() cgroup aware") a long time ago. Back then, refaulted
> > > > > pages would get charged to the memcg of the process that was faulting them
> > > > > in. It made sense to only consider accesses coming from processes in the
> > > > > subtree of target_mem_cgroup. If a page was charged to memcg A but only
> > > > > being accessed by a sibling memcg B, we would reclaim it if memcg A is
> > > > > is the reclaim target. memcg B can then fault it back in and get charged
> > > > > for it appropriately.
> > > > >
> > > > > Today, this behavior still makes sense for file pages. However, unlike
> > > > > file pages, when swapbacked pages are refaulted they are charged to the
> > > > > memcg that was originally charged for them during swapping out. Which
> > > > > means that if a swapbacked page is charged to memcg A but only used by
> > > > > memcg B, and we reclaim it from memcg A, it would simply be faulted back
> > > > > in and charged again to memcg A once memcg B accesses it. In that sense,
> > > > > accesses from all memcgs matter equally when considering if a swapbacked
> > > > > page/folio is a viable reclaim target.
>
> I just read the entire commit message (sorry for not doing so
> previously) to figure out where the confusion came from: the above
> claim is wrong for two cases. I'll let you figure out why :)

I missed the cases with dead memcgs. I think the two cases are:

1)  If a memcg is dead during swap out it looks like the swap charge
is moved to the parent. So reclaim is effectively recharging to the
parent. This can be handled by only checking access from all memcgs if
the charged memcg is alive, something like this:

if (target_memcg && (!folio_test_swapback(folio) ||
!mem_cgroup_online(folio_memcg(folio))))
...

2) If a memcg dies after a page is already swapped out. During swap in
it looks like we charge the page to the process of the page fault if
that's the case.

Now in this case this patch might actually increase zombie memcgs, but
only temporarily. Next time we try to reclaim the page we will go back
to case (1) and reclaim it. Also, one might argue that given that the
page is relatively hot (accessed by a different memcg), and therefore
likely to be faulted in soon, the chances of the memcg dying between
the time where the page is reclaimed and faulted back in are slim. One
might also argue that having swapbacked pages charged to one memcg and
accessed by another is generally less common compared to file page
cache.

So I *think* with an added check for offline memcgs there shouldn't be
any concerns, unless I got the two cases wrong :)

>
> > > > > Modify folio_referenced() to always consider accesses from all memcgs if
> > > > > the folio is swapbacked.
> > > >
> > > > It seems to me this change can potentially increase the number of
> > > > zombie memcgs. Any risk assessment done on this?
> > >
> > > Do you mind elaborating the case(s) where this could happen? Is this
> > > the cgroup v1 case in mem_cgroup_swapout() where we are reclaiming
> > > from a zombie memcg and swapping out would let us move the charge to
> > > the parent?
> >
> > The scenario is quite straightforward: for a page charged to memcg A
> > and also actively used by memcg B, if we don't ignore the access from
> > memcg B, we won't be able to reclaim it after memcg A is deleted.
Johannes Weiner Oct. 6, 2022, 4:19 a.m. UTC | #7
On Wed, Oct 05, 2022 at 03:13:38PM -0600, Yu Zhao wrote:
> On Wed, Oct 5, 2022 at 3:02 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Wed, Oct 5, 2022 at 1:48 PM Yu Zhao <yuzhao@google.com> wrote:
> > >
> > > On Wed, Oct 5, 2022 at 11:37 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > >
> > > > During page/folio reclaim, we check if a folio is referenced using
> > > > folio_referenced() to avoid reclaiming folios that have been recently
> > > > accessed (hot memory). The rationale is that this memory is likely to be
> > > > accessed soon, and hence reclaiming it will cause a refault.
> > > >
> > > > For memcg reclaim, we currently only check accesses to the folio from
> > > > processes in the subtree of the target memcg. This behavior was
> > > > originally introduced by commit bed7161a519a ("Memory controller: make
> > > > page_referenced() cgroup aware") a long time ago. Back then, refaulted
> > > > pages would get charged to the memcg of the process that was faulting them
> > > > in. It made sense to only consider accesses coming from processes in the
> > > > subtree of target_mem_cgroup. If a page was charged to memcg A but only
> > > > being accessed by a sibling memcg B, we would reclaim it if memcg A is
> > > > is the reclaim target. memcg B can then fault it back in and get charged
> > > > for it appropriately.
> > > >
> > > > Today, this behavior still makes sense for file pages. However, unlike
> > > > file pages, when swapbacked pages are refaulted they are charged to the
> > > > memcg that was originally charged for them during swapping out. Which
> > > > means that if a swapbacked page is charged to memcg A but only used by
> > > > memcg B, and we reclaim it from memcg A, it would simply be faulted back
> > > > in and charged again to memcg A once memcg B accesses it. In that sense,
> > > > accesses from all memcgs matter equally when considering if a swapbacked
> > > > page/folio is a viable reclaim target.
> > > >
> > > > Modify folio_referenced() to always consider accesses from all memcgs if
> > > > the folio is swapbacked.
> > >
> > > It seems to me this change can potentially increase the number of
> > > zombie memcgs. Any risk assessment done on this?
> >
> > Do you mind elaborating the case(s) where this could happen? Is this
> > the cgroup v1 case in mem_cgroup_swapout() where we are reclaiming
> > from a zombie memcg and swapping out would let us move the charge to
> > the parent?
> 
> The scenario is quite straightforward: for a page charged to memcg A
> and also actively used by memcg B, if we don't ignore the access from
> memcg B, we won't be able to reclaim it after memcg A is deleted.

This patch changes the behavior of limit-induced reclaim. There is no
limit reclaim on A after it's been deleted. And parental/global
reclaim has always recognized outside references.
Yu Zhao Oct. 6, 2022, 5:10 a.m. UTC | #8
On Wed, Oct 5, 2022 at 10:19 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Oct 05, 2022 at 03:13:38PM -0600, Yu Zhao wrote:
> > On Wed, Oct 5, 2022 at 3:02 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Wed, Oct 5, 2022 at 1:48 PM Yu Zhao <yuzhao@google.com> wrote:
> > > >
> > > > On Wed, Oct 5, 2022 at 11:37 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > >
> > > > > During page/folio reclaim, we check if a folio is referenced using
> > > > > folio_referenced() to avoid reclaiming folios that have been recently
> > > > > accessed (hot memory). The rationale is that this memory is likely to be
> > > > > accessed soon, and hence reclaiming it will cause a refault.
> > > > >
> > > > > For memcg reclaim, we currently only check accesses to the folio from
> > > > > processes in the subtree of the target memcg. This behavior was
> > > > > originally introduced by commit bed7161a519a ("Memory controller: make
> > > > > page_referenced() cgroup aware") a long time ago. Back then, refaulted
> > > > > pages would get charged to the memcg of the process that was faulting them
> > > > > in. It made sense to only consider accesses coming from processes in the
> > > > > subtree of target_mem_cgroup. If a page was charged to memcg A but only
> > > > > being accessed by a sibling memcg B, we would reclaim it if memcg A is
> > > > > is the reclaim target. memcg B can then fault it back in and get charged
> > > > > for it appropriately.
> > > > >
> > > > > Today, this behavior still makes sense for file pages. However, unlike
> > > > > file pages, when swapbacked pages are refaulted they are charged to the
> > > > > memcg that was originally charged for them during swapping out. Which
> > > > > means that if a swapbacked page is charged to memcg A but only used by
> > > > > memcg B, and we reclaim it from memcg A, it would simply be faulted back
> > > > > in and charged again to memcg A once memcg B accesses it. In that sense,
> > > > > accesses from all memcgs matter equally when considering if a swapbacked
> > > > > page/folio is a viable reclaim target.
> > > > >
> > > > > Modify folio_referenced() to always consider accesses from all memcgs if
> > > > > the folio is swapbacked.
> > > >
> > > > It seems to me this change can potentially increase the number of
> > > > zombie memcgs. Any risk assessment done on this?
> > >
> > > Do you mind elaborating the case(s) where this could happen? Is this
> > > the cgroup v1 case in mem_cgroup_swapout() where we are reclaiming
> > > from a zombie memcg and swapping out would let us move the charge to
> > > the parent?
> >
> > The scenario is quite straightforward: for a page charged to memcg A
> > and also actively used by memcg B, if we don't ignore the access from
> > memcg B, we won't be able to reclaim it after memcg A is deleted.
>
> This patch changes the behavior of limit-induced reclaim. There is no
> limit reclaim on A after it's been deleted. And parental/global
> reclaim has always recognized outside references.

We use memory.reclaim to scrape memcgs right before rmdir so that they
are unlikely to stick around. Otherwise our job scheduler would see
less available memory and become less eager to increase load. This in
turn reduces the chance of global reclaim, and deleted memcgs would
stick around even longer.
Yosry Ahmed Oct. 6, 2022, 7:30 a.m. UTC | #9
On Wed, Oct 5, 2022 at 9:19 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Oct 05, 2022 at 03:13:38PM -0600, Yu Zhao wrote:
> > On Wed, Oct 5, 2022 at 3:02 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Wed, Oct 5, 2022 at 1:48 PM Yu Zhao <yuzhao@google.com> wrote:
> > > >
> > > > On Wed, Oct 5, 2022 at 11:37 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > >
> > > > > During page/folio reclaim, we check if a folio is referenced using
> > > > > folio_referenced() to avoid reclaiming folios that have been recently
> > > > > accessed (hot memory). The rationale is that this memory is likely to be
> > > > > accessed soon, and hence reclaiming it will cause a refault.
> > > > >
> > > > > For memcg reclaim, we currently only check accesses to the folio from
> > > > > processes in the subtree of the target memcg. This behavior was
> > > > > originally introduced by commit bed7161a519a ("Memory controller: make
> > > > > page_referenced() cgroup aware") a long time ago. Back then, refaulted
> > > > > pages would get charged to the memcg of the process that was faulting them
> > > > > in. It made sense to only consider accesses coming from processes in the
> > > > > subtree of target_mem_cgroup. If a page was charged to memcg A but only
> > > > > being accessed by a sibling memcg B, we would reclaim it if memcg A is
> > > > > is the reclaim target. memcg B can then fault it back in and get charged
> > > > > for it appropriately.
> > > > >
> > > > > Today, this behavior still makes sense for file pages. However, unlike
> > > > > file pages, when swapbacked pages are refaulted they are charged to the
> > > > > memcg that was originally charged for them during swapping out. Which
> > > > > means that if a swapbacked page is charged to memcg A but only used by
> > > > > memcg B, and we reclaim it from memcg A, it would simply be faulted back
> > > > > in and charged again to memcg A once memcg B accesses it. In that sense,
> > > > > accesses from all memcgs matter equally when considering if a swapbacked
> > > > > page/folio is a viable reclaim target.
> > > > >
> > > > > Modify folio_referenced() to always consider accesses from all memcgs if
> > > > > the folio is swapbacked.
> > > >
> > > > It seems to me this change can potentially increase the number of
> > > > zombie memcgs. Any risk assessment done on this?
> > >
> > > Do you mind elaborating the case(s) where this could happen? Is this
> > > the cgroup v1 case in mem_cgroup_swapout() where we are reclaiming
> > > from a zombie memcg and swapping out would let us move the charge to
> > > the parent?
> >
> > The scenario is quite straightforward: for a page charged to memcg A
> > and also actively used by memcg B, if we don't ignore the access from
> > memcg B, we won't be able to reclaim it after memcg A is deleted.
>
> This patch changes the behavior of limit-induced reclaim. There is no
> limit reclaim on A after it's been deleted. And parental/global
> reclaim has always recognized outside references.

Do you mind elaborating on the parental reclaim part?

I am looking at the code and it looks like memcg reclaim of a parent
(limit-induced or proactive) will only consider references coming from
its subtree, even when reclaiming from its dead children. It looks
like as long as sc->target_mem_cgroup is set, we ignore outside
references (relative to sc->target_mem_cgroup).

If that is true, maybe we want to keep ignoring outside references for
swap-backed pages if the folio is charged to a dead memcg? My
understanding is that in this case we will uncharge the page from the
dead memcg and charge the swapped entry to the parent, reducing the
number of refs on the dead memcg. Without this check, this patch might
prevent the charge from being moved to the parent in this case. WDYT?
Johannes Weiner Oct. 6, 2022, 3:32 p.m. UTC | #10
On Thu, Oct 06, 2022 at 12:30:45AM -0700, Yosry Ahmed wrote:
> On Wed, Oct 5, 2022 at 9:19 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Wed, Oct 05, 2022 at 03:13:38PM -0600, Yu Zhao wrote:
> > > On Wed, Oct 5, 2022 at 3:02 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > >
> > > > On Wed, Oct 5, 2022 at 1:48 PM Yu Zhao <yuzhao@google.com> wrote:
> > > > >
> > > > > On Wed, Oct 5, 2022 at 11:37 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > >
> > > > > > During page/folio reclaim, we check if a folio is referenced using
> > > > > > folio_referenced() to avoid reclaiming folios that have been recently
> > > > > > accessed (hot memory). The rationale is that this memory is likely to be
> > > > > > accessed soon, and hence reclaiming it will cause a refault.
> > > > > >
> > > > > > For memcg reclaim, we currently only check accesses to the folio from
> > > > > > processes in the subtree of the target memcg. This behavior was
> > > > > > originally introduced by commit bed7161a519a ("Memory controller: make
> > > > > > page_referenced() cgroup aware") a long time ago. Back then, refaulted
> > > > > > pages would get charged to the memcg of the process that was faulting them
> > > > > > in. It made sense to only consider accesses coming from processes in the
> > > > > > subtree of target_mem_cgroup. If a page was charged to memcg A but only
> > > > > > being accessed by a sibling memcg B, we would reclaim it if memcg A is
> > > > > > is the reclaim target. memcg B can then fault it back in and get charged
> > > > > > for it appropriately.
> > > > > >
> > > > > > Today, this behavior still makes sense for file pages. However, unlike
> > > > > > file pages, when swapbacked pages are refaulted they are charged to the
> > > > > > memcg that was originally charged for them during swapping out. Which
> > > > > > means that if a swapbacked page is charged to memcg A but only used by
> > > > > > memcg B, and we reclaim it from memcg A, it would simply be faulted back
> > > > > > in and charged again to memcg A once memcg B accesses it. In that sense,
> > > > > > accesses from all memcgs matter equally when considering if a swapbacked
> > > > > > page/folio is a viable reclaim target.
> > > > > >
> > > > > > Modify folio_referenced() to always consider accesses from all memcgs if
> > > > > > the folio is swapbacked.
> > > > >
> > > > > It seems to me this change can potentially increase the number of
> > > > > zombie memcgs. Any risk assessment done on this?
> > > >
> > > > Do you mind elaborating the case(s) where this could happen? Is this
> > > > the cgroup v1 case in mem_cgroup_swapout() where we are reclaiming
> > > > from a zombie memcg and swapping out would let us move the charge to
> > > > the parent?
> > >
> > > The scenario is quite straightforward: for a page charged to memcg A
> > > and also actively used by memcg B, if we don't ignore the access from
> > > memcg B, we won't be able to reclaim it after memcg A is deleted.
> >
> > This patch changes the behavior of limit-induced reclaim. There is no
> > limit reclaim on A after it's been deleted. And parental/global
> > reclaim has always recognized outside references.
> 
> Do you mind elaborating on the parental reclaim part?
> 
> I am looking at the code and it looks like memcg reclaim of a parent
> (limit-induced or proactive) will only consider references coming from
> its subtree, even when reclaiming from its dead children. It looks
> like as long as sc->target_mem_cgroup is set, we ignore outside
> references (relative to sc->target_mem_cgroup).

Yes, I was referring to outside of A.

As of today, any siblings of A can already pin its memory after it's
dead. I suppose your patch would add cousins to that list. It doesn't
seem like a categorial difference to me.

> If that is true, maybe we want to keep ignoring outside references for
> swap-backed pages if the folio is charged to a dead memcg? My
> understanding is that in this case we will uncharge the page from the
> dead memcg and charge the swapped entry to the parent, reducing the
> number of refs on the dead memcg. Without this check, this patch might
> prevent the charge from being moved to the parent in this case. WDYT?

I don't think it's worth it. Keeping the semantics simple and behavior
predictable is IMO more valuable.

It also wouldn't fix the scrape-before-rmdir issue Yu points out,
which I think is the more practical concern. In light of that, it
might be best to table the patch for now. (Until we have
reparent-on-delete for anon and file pages...)
Johannes Weiner Oct. 6, 2022, 3:38 p.m. UTC | #11
On Wed, Oct 05, 2022 at 11:10:37PM -0600, Yu Zhao wrote:
> On Wed, Oct 5, 2022 at 10:19 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Wed, Oct 05, 2022 at 03:13:38PM -0600, Yu Zhao wrote:
> > > On Wed, Oct 5, 2022 at 3:02 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > >
> > > > On Wed, Oct 5, 2022 at 1:48 PM Yu Zhao <yuzhao@google.com> wrote:
> > > > >
> > > > > On Wed, Oct 5, 2022 at 11:37 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > >
> > > > > > During page/folio reclaim, we check if a folio is referenced using
> > > > > > folio_referenced() to avoid reclaiming folios that have been recently
> > > > > > accessed (hot memory). The rationale is that this memory is likely to be
> > > > > > accessed soon, and hence reclaiming it will cause a refault.
> > > > > >
> > > > > > For memcg reclaim, we currently only check accesses to the folio from
> > > > > > processes in the subtree of the target memcg. This behavior was
> > > > > > originally introduced by commit bed7161a519a ("Memory controller: make
> > > > > > page_referenced() cgroup aware") a long time ago. Back then, refaulted
> > > > > > pages would get charged to the memcg of the process that was faulting them
> > > > > > in. It made sense to only consider accesses coming from processes in the
> > > > > > subtree of target_mem_cgroup. If a page was charged to memcg A but only
> > > > > > being accessed by a sibling memcg B, we would reclaim it if memcg A is
> > > > > > is the reclaim target. memcg B can then fault it back in and get charged
> > > > > > for it appropriately.
> > > > > >
> > > > > > Today, this behavior still makes sense for file pages. However, unlike
> > > > > > file pages, when swapbacked pages are refaulted they are charged to the
> > > > > > memcg that was originally charged for them during swapping out. Which
> > > > > > means that if a swapbacked page is charged to memcg A but only used by
> > > > > > memcg B, and we reclaim it from memcg A, it would simply be faulted back
> > > > > > in and charged again to memcg A once memcg B accesses it. In that sense,
> > > > > > accesses from all memcgs matter equally when considering if a swapbacked
> > > > > > page/folio is a viable reclaim target.
> > > > > >
> > > > > > Modify folio_referenced() to always consider accesses from all memcgs if
> > > > > > the folio is swapbacked.
> > > > >
> > > > > It seems to me this change can potentially increase the number of
> > > > > zombie memcgs. Any risk assessment done on this?
> > > >
> > > > Do you mind elaborating the case(s) where this could happen? Is this
> > > > the cgroup v1 case in mem_cgroup_swapout() where we are reclaiming
> > > > from a zombie memcg and swapping out would let us move the charge to
> > > > the parent?
> > >
> > > The scenario is quite straightforward: for a page charged to memcg A
> > > and also actively used by memcg B, if we don't ignore the access from
> > > memcg B, we won't be able to reclaim it after memcg A is deleted.
> >
> > This patch changes the behavior of limit-induced reclaim. There is no
> > limit reclaim on A after it's been deleted. And parental/global
> > reclaim has always recognized outside references.
> 
> We use memory.reclaim to scrape memcgs right before rmdir so that they
> are unlikely to stick around. Otherwise our job scheduler would see
> less available memory and become less eager to increase load. This in
> turn reduces the chance of global reclaim, and deleted memcgs would
> stick around even longer.

Thanks for the context.

It's not great that we have to design reclaim policy around this
implementation detail of past-EOF-pins. But such is life until we get
rid of them.
Yosry Ahmed Oct. 6, 2022, 6:29 p.m. UTC | #12
On Thu, Oct 6, 2022 at 8:32 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Oct 06, 2022 at 12:30:45AM -0700, Yosry Ahmed wrote:
> > On Wed, Oct 5, 2022 at 9:19 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > On Wed, Oct 05, 2022 at 03:13:38PM -0600, Yu Zhao wrote:
> > > > On Wed, Oct 5, 2022 at 3:02 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > >
> > > > > On Wed, Oct 5, 2022 at 1:48 PM Yu Zhao <yuzhao@google.com> wrote:
> > > > > >
> > > > > > On Wed, Oct 5, 2022 at 11:37 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > >
> > > > > > > During page/folio reclaim, we check if a folio is referenced using
> > > > > > > folio_referenced() to avoid reclaiming folios that have been recently
> > > > > > > accessed (hot memory). The rationale is that this memory is likely to be
> > > > > > > accessed soon, and hence reclaiming it will cause a refault.
> > > > > > >
> > > > > > > For memcg reclaim, we currently only check accesses to the folio from
> > > > > > > processes in the subtree of the target memcg. This behavior was
> > > > > > > originally introduced by commit bed7161a519a ("Memory controller: make
> > > > > > > page_referenced() cgroup aware") a long time ago. Back then, refaulted
> > > > > > > pages would get charged to the memcg of the process that was faulting them
> > > > > > > in. It made sense to only consider accesses coming from processes in the
> > > > > > > subtree of target_mem_cgroup. If a page was charged to memcg A but only
> > > > > > > being accessed by a sibling memcg B, we would reclaim it if memcg A is
> > > > > > > is the reclaim target. memcg B can then fault it back in and get charged
> > > > > > > for it appropriately.
> > > > > > >
> > > > > > > Today, this behavior still makes sense for file pages. However, unlike
> > > > > > > file pages, when swapbacked pages are refaulted they are charged to the
> > > > > > > memcg that was originally charged for them during swapping out. Which
> > > > > > > means that if a swapbacked page is charged to memcg A but only used by
> > > > > > > memcg B, and we reclaim it from memcg A, it would simply be faulted back
> > > > > > > in and charged again to memcg A once memcg B accesses it. In that sense,
> > > > > > > accesses from all memcgs matter equally when considering if a swapbacked
> > > > > > > page/folio is a viable reclaim target.
> > > > > > >
> > > > > > > Modify folio_referenced() to always consider accesses from all memcgs if
> > > > > > > the folio is swapbacked.
> > > > > >
> > > > > > It seems to me this change can potentially increase the number of
> > > > > > zombie memcgs. Any risk assessment done on this?
> > > > >
> > > > > Do you mind elaborating the case(s) where this could happen? Is this
> > > > > the cgroup v1 case in mem_cgroup_swapout() where we are reclaiming
> > > > > from a zombie memcg and swapping out would let us move the charge to
> > > > > the parent?
> > > >
> > > > The scenario is quite straightforward: for a page charged to memcg A
> > > > and also actively used by memcg B, if we don't ignore the access from
> > > > memcg B, we won't be able to reclaim it after memcg A is deleted.
> > >
> > > This patch changes the behavior of limit-induced reclaim. There is no
> > > limit reclaim on A after it's been deleted. And parental/global
> > > reclaim has always recognized outside references.
> >
> > Do you mind elaborating on the parental reclaim part?
> >
> > I am looking at the code and it looks like memcg reclaim of a parent
> > (limit-induced or proactive) will only consider references coming from
> > its subtree, even when reclaiming from its dead children. It looks
> > like as long as sc->target_mem_cgroup is set, we ignore outside
> > references (relative to sc->target_mem_cgroup).
>
> Yes, I was referring to outside of A.
>
> As of today, any siblings of A can already pin its memory after it's
> dead. I suppose your patch would add cousins to that list. It doesn't
> seem like a categorial difference to me.
>
> > If that is true, maybe we want to keep ignoring outside references for
> > swap-backed pages if the folio is charged to a dead memcg? My
> > understanding is that in this case we will uncharge the page from the
> > dead memcg and charge the swapped entry to the parent, reducing the
> > number of refs on the dead memcg. Without this check, this patch might
> > prevent the charge from being moved to the parent in this case. WDYT?
>
> I don't think it's worth it. Keeping the semantics simple and behavior
> predictable is IMO more valuable.
>
> It also wouldn't fix the scrape-before-rmdir issue Yu points out,
> which I think is the more practical concern. In light of that, it
> might be best to table the patch for now. (Until we have
> reparent-on-delete for anon and file pages...)

If we add a mem_cgroup_online() check, we partially solve the problem.
Maybe scrape-before-rmdir will not reclaim those pages at once, but
the next time we try to reclaim from the dead memcg (global, limit,
proactive,..) we will reclaim the pages. So we will only be delaying
the freeing of those zombie memcgs.

We have had a variant of this behavior in our production for a while,
and we haven't been having a zombie problem, but I guess the way our
userspace is setup is smart enough to stop referencing shared memory
charged to a dead memcg. One could argue that in general, having
swap-backed pages charged to a dead memcg is less common compared to
file pages.

Anyway, I wanted to give some background about why this is a real
problem that we have been facing. If a tmpfs mount is used by multiple
memcgs (say A and B), and we try to reclaim fromA using
memory.reclaim, we would observe pages that are more heavily used by B
as cold and reclaim them. Afterwards, they get swapped back once B
accesses them again, and charged back to A. We observe increased
refaults and fallback on proactive reclaim, missing the actual cold
memory. With this behavior, reclaim chooses pages that are cold
relative to both A and B, we observe less refaults, and end up
actually trimming cold memory. I suspect the same can happen with
limit-induced reclaim, where memcg A keeps hits its limit, we may
reclaim pages used by B, they get refaulted back in, we go to reclaim
again,... thrashing.

So it is a real problem that we are trying to solve. Given that adding
an online check would partially solves the scrape-before-rmdir problem
(zombie memcgs might stick around a little bit longer, but they should
eventually go away eventually), and that this patch addresses a real
problem and would overall improve reclaim policy for living memcgs
(which is *hopefully* the majority), it would be great if you could
reconsider this. Please let me know if anything that I have said
doesn't make sense to you.
Yu Zhao Oct. 6, 2022, 9:56 p.m. UTC | #13
On Thu, Oct 6, 2022 at 12:30 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Thu, Oct 6, 2022 at 8:32 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Thu, Oct 06, 2022 at 12:30:45AM -0700, Yosry Ahmed wrote:
> > > On Wed, Oct 5, 2022 at 9:19 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > >
> > > > On Wed, Oct 05, 2022 at 03:13:38PM -0600, Yu Zhao wrote:
> > > > > On Wed, Oct 5, 2022 at 3:02 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > >
> > > > > > On Wed, Oct 5, 2022 at 1:48 PM Yu Zhao <yuzhao@google.com> wrote:
> > > > > > >
> > > > > > > On Wed, Oct 5, 2022 at 11:37 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > > >
> > > > > > > > During page/folio reclaim, we check if a folio is referenced using
> > > > > > > > folio_referenced() to avoid reclaiming folios that have been recently
> > > > > > > > accessed (hot memory). The rationale is that this memory is likely to be
> > > > > > > > accessed soon, and hence reclaiming it will cause a refault.
> > > > > > > >
> > > > > > > > For memcg reclaim, we currently only check accesses to the folio from
> > > > > > > > processes in the subtree of the target memcg. This behavior was
> > > > > > > > originally introduced by commit bed7161a519a ("Memory controller: make
> > > > > > > > page_referenced() cgroup aware") a long time ago. Back then, refaulted
> > > > > > > > pages would get charged to the memcg of the process that was faulting them
> > > > > > > > in. It made sense to only consider accesses coming from processes in the
> > > > > > > > subtree of target_mem_cgroup. If a page was charged to memcg A but only
> > > > > > > > being accessed by a sibling memcg B, we would reclaim it if memcg A is
> > > > > > > > is the reclaim target. memcg B can then fault it back in and get charged
> > > > > > > > for it appropriately.
> > > > > > > >
> > > > > > > > Today, this behavior still makes sense for file pages. However, unlike
> > > > > > > > file pages, when swapbacked pages are refaulted they are charged to the
> > > > > > > > memcg that was originally charged for them during swapping out. Which
> > > > > > > > means that if a swapbacked page is charged to memcg A but only used by
> > > > > > > > memcg B, and we reclaim it from memcg A, it would simply be faulted back
> > > > > > > > in and charged again to memcg A once memcg B accesses it. In that sense,
> > > > > > > > accesses from all memcgs matter equally when considering if a swapbacked
> > > > > > > > page/folio is a viable reclaim target.
> > > > > > > >
> > > > > > > > Modify folio_referenced() to always consider accesses from all memcgs if
> > > > > > > > the folio is swapbacked.
> > > > > > >
> > > > > > > It seems to me this change can potentially increase the number of
> > > > > > > zombie memcgs. Any risk assessment done on this?
> > > > > >
> > > > > > Do you mind elaborating the case(s) where this could happen? Is this
> > > > > > the cgroup v1 case in mem_cgroup_swapout() where we are reclaiming
> > > > > > from a zombie memcg and swapping out would let us move the charge to
> > > > > > the parent?
> > > > >
> > > > > The scenario is quite straightforward: for a page charged to memcg A
> > > > > and also actively used by memcg B, if we don't ignore the access from
> > > > > memcg B, we won't be able to reclaim it after memcg A is deleted.
> > > >
> > > > This patch changes the behavior of limit-induced reclaim. There is no
> > > > limit reclaim on A after it's been deleted. And parental/global
> > > > reclaim has always recognized outside references.
> > >
> > > Do you mind elaborating on the parental reclaim part?
> > >
> > > I am looking at the code and it looks like memcg reclaim of a parent
> > > (limit-induced or proactive) will only consider references coming from
> > > its subtree, even when reclaiming from its dead children. It looks
> > > like as long as sc->target_mem_cgroup is set, we ignore outside
> > > references (relative to sc->target_mem_cgroup).
> >
> > Yes, I was referring to outside of A.
> >
> > As of today, any siblings of A can already pin its memory after it's
> > dead. I suppose your patch would add cousins to that list. It doesn't
> > seem like a categorial difference to me.
> >
> > > If that is true, maybe we want to keep ignoring outside references for
> > > swap-backed pages if the folio is charged to a dead memcg? My
> > > understanding is that in this case we will uncharge the page from the
> > > dead memcg and charge the swapped entry to the parent, reducing the
> > > number of refs on the dead memcg. Without this check, this patch might
> > > prevent the charge from being moved to the parent in this case. WDYT?
> >
> > I don't think it's worth it. Keeping the semantics simple and behavior
> > predictable is IMO more valuable.
> >
> > It also wouldn't fix the scrape-before-rmdir issue Yu points out,
> > which I think is the more practical concern. In light of that, it
> > might be best to table the patch for now. (Until we have
> > reparent-on-delete for anon and file pages...)
>
> If we add a mem_cgroup_online() check, we partially solve the problem.
> Maybe scrape-before-rmdir will not reclaim those pages at once, but
> the next time we try to reclaim from the dead memcg (global, limit,
> proactive,..) we will reclaim the pages. So we will only be delaying
> the freeing of those zombie memcgs.

As an observer, this seems to be the death by a thousand cuts of the
existing mechanism that Google has been using to virtually eliminate
zombie memcgs for the last decade.

I understand the desire to fix a specific problem with this patch. But
it's methodically wrong to focus on specific problems without
considering the large picture and how it's evolving.

Our internal memory.reclaim, which is being superseded, is a superset
of the mainline version. It has two flags relevant to this discussion:
    1. hierarchical walk of a parent
    2. target dead memcgs only
With these, our job scheduler (Borg) doesn't need to scrape before
rmdir at all. It does something called "applying root pressure",
which, as one might imagine, is to write to the root memory.reclaim
with the above flags. We have metrics on the efficiency of this
mechanism and they are closely monitored.

Why is this important? Because Murphy's law is generally true for a
fleet when its scale and diversity is large and high enough. *We used
to run out memcg IDs.* And we are still carrying a patch that enlarges
swap_cgroup->id from unsigned short to unsigned int.

Compared with the recharging proposal we have been discussing, the two
cases that the above solution can't help:
    1. kernel long pins
    2. foreign mlocks
But it's still *a lot* more reliable than the scrape-before-rmdir
approach (or scrape-after-rmdir if we can hold the FD open before
rmdir), because it offers unlimited retries and no dead memcgs, e.g.,
those created and deleted by jobs (not the job scheduler), can escape.

Unless you can provide data, my past experience tells me that this
patch will make scrape-before-rmdir unacceptable (in terms of
effectiveness) to our fleet. Of course you can add additional code,
i.e., those two flags or the offline check, which I'm not object to.
Frankly, let me ask the real question: are you really sure this is the
best for us and the rest of the community?

(I think the recharging proposal is.)
Yosry Ahmed Oct. 6, 2022, 11:07 p.m. UTC | #14
On Thu, Oct 6, 2022 at 2:57 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Thu, Oct 6, 2022 at 12:30 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Thu, Oct 6, 2022 at 8:32 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > On Thu, Oct 06, 2022 at 12:30:45AM -0700, Yosry Ahmed wrote:
> > > > On Wed, Oct 5, 2022 at 9:19 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > >
> > > > > On Wed, Oct 05, 2022 at 03:13:38PM -0600, Yu Zhao wrote:
> > > > > > On Wed, Oct 5, 2022 at 3:02 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > >
> > > > > > > On Wed, Oct 5, 2022 at 1:48 PM Yu Zhao <yuzhao@google.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, Oct 5, 2022 at 11:37 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > > > >
> > > > > > > > > During page/folio reclaim, we check if a folio is referenced using
> > > > > > > > > folio_referenced() to avoid reclaiming folios that have been recently
> > > > > > > > > accessed (hot memory). The rationale is that this memory is likely to be
> > > > > > > > > accessed soon, and hence reclaiming it will cause a refault.
> > > > > > > > >
> > > > > > > > > For memcg reclaim, we currently only check accesses to the folio from
> > > > > > > > > processes in the subtree of the target memcg. This behavior was
> > > > > > > > > originally introduced by commit bed7161a519a ("Memory controller: make
> > > > > > > > > page_referenced() cgroup aware") a long time ago. Back then, refaulted
> > > > > > > > > pages would get charged to the memcg of the process that was faulting them
> > > > > > > > > in. It made sense to only consider accesses coming from processes in the
> > > > > > > > > subtree of target_mem_cgroup. If a page was charged to memcg A but only
> > > > > > > > > being accessed by a sibling memcg B, we would reclaim it if memcg A is
> > > > > > > > > is the reclaim target. memcg B can then fault it back in and get charged
> > > > > > > > > for it appropriately.
> > > > > > > > >
> > > > > > > > > Today, this behavior still makes sense for file pages. However, unlike
> > > > > > > > > file pages, when swapbacked pages are refaulted they are charged to the
> > > > > > > > > memcg that was originally charged for them during swapping out. Which
> > > > > > > > > means that if a swapbacked page is charged to memcg A but only used by
> > > > > > > > > memcg B, and we reclaim it from memcg A, it would simply be faulted back
> > > > > > > > > in and charged again to memcg A once memcg B accesses it. In that sense,
> > > > > > > > > accesses from all memcgs matter equally when considering if a swapbacked
> > > > > > > > > page/folio is a viable reclaim target.
> > > > > > > > >
> > > > > > > > > Modify folio_referenced() to always consider accesses from all memcgs if
> > > > > > > > > the folio is swapbacked.
> > > > > > > >
> > > > > > > > It seems to me this change can potentially increase the number of
> > > > > > > > zombie memcgs. Any risk assessment done on this?
> > > > > > >
> > > > > > > Do you mind elaborating the case(s) where this could happen? Is this
> > > > > > > the cgroup v1 case in mem_cgroup_swapout() where we are reclaiming
> > > > > > > from a zombie memcg and swapping out would let us move the charge to
> > > > > > > the parent?
> > > > > >
> > > > > > The scenario is quite straightforward: for a page charged to memcg A
> > > > > > and also actively used by memcg B, if we don't ignore the access from
> > > > > > memcg B, we won't be able to reclaim it after memcg A is deleted.
> > > > >
> > > > > This patch changes the behavior of limit-induced reclaim. There is no
> > > > > limit reclaim on A after it's been deleted. And parental/global
> > > > > reclaim has always recognized outside references.
> > > >
> > > > Do you mind elaborating on the parental reclaim part?
> > > >
> > > > I am looking at the code and it looks like memcg reclaim of a parent
> > > > (limit-induced or proactive) will only consider references coming from
> > > > its subtree, even when reclaiming from its dead children. It looks
> > > > like as long as sc->target_mem_cgroup is set, we ignore outside
> > > > references (relative to sc->target_mem_cgroup).
> > >
> > > Yes, I was referring to outside of A.
> > >
> > > As of today, any siblings of A can already pin its memory after it's
> > > dead. I suppose your patch would add cousins to that list. It doesn't
> > > seem like a categorial difference to me.
> > >
> > > > If that is true, maybe we want to keep ignoring outside references for
> > > > swap-backed pages if the folio is charged to a dead memcg? My
> > > > understanding is that in this case we will uncharge the page from the
> > > > dead memcg and charge the swapped entry to the parent, reducing the
> > > > number of refs on the dead memcg. Without this check, this patch might
> > > > prevent the charge from being moved to the parent in this case. WDYT?
> > >
> > > I don't think it's worth it. Keeping the semantics simple and behavior
> > > predictable is IMO more valuable.
> > >
> > > It also wouldn't fix the scrape-before-rmdir issue Yu points out,
> > > which I think is the more practical concern. In light of that, it
> > > might be best to table the patch for now. (Until we have
> > > reparent-on-delete for anon and file pages...)
> >
> > If we add a mem_cgroup_online() check, we partially solve the problem.
> > Maybe scrape-before-rmdir will not reclaim those pages at once, but
> > the next time we try to reclaim from the dead memcg (global, limit,
> > proactive,..) we will reclaim the pages. So we will only be delaying
> > the freeing of those zombie memcgs.
>
> As an observer, this seems to be the death by a thousand cuts of the
> existing mechanism that Google has been using to virtually eliminate
> zombie memcgs for the last decade.
>
> I understand the desire to fix a specific problem with this patch. But
> it's methodically wrong to focus on specific problems without
> considering the large picture and how it's evolving.
>
> Our internal memory.reclaim, which is being superseded, is a superset
> of the mainline version. It has two flags relevant to this discussion:
>     1. hierarchical walk of a parent
>     2. target dead memcgs only
> With these, our job scheduler (Borg) doesn't need to scrape before
> rmdir at all. It does something called "applying root pressure",
> which, as one might imagine, is to write to the root memory.reclaim
> with the above flags. We have metrics on the efficiency of this
> mechanism and they are closely monitored.
>
> Why is this important? Because Murphy's law is generally true for a
> fleet when its scale and diversity is large and high enough. *We used
> to run out memcg IDs.* And we are still carrying a patch that enlarges
> swap_cgroup->id from unsigned short to unsigned int.
>
> Compared with the recharging proposal we have been discussing, the two
> cases that the above solution can't help:
>     1. kernel long pins
>     2. foreign mlocks
> But it's still *a lot* more reliable than the scrape-before-rmdir
> approach (or scrape-after-rmdir if we can hold the FD open before
> rmdir), because it offers unlimited retries and no dead memcgs, e.g.,
> those created and deleted by jobs (not the job scheduler), can escape.
>
> Unless you can provide data, my past experience tells me that this
> patch will make scrape-before-rmdir unacceptable (in terms of
> effectiveness) to our fleet. Of course you can add additional code,
> i.e., those two flags or the offline check, which I'm not object to.

I agree that the zombie memcgs problem is a serious problem that needs
to be dealt with, and recharging memory when a memcg is dying seems
like a promising direction. However, this patch's goal is to improve
reclaim of shared swapbacked memory in general, regardless of the
zombie memcgs problem. I understand that the current version affects
the zombie memcgs problem, but I believe this is an oversight that
needs to be fixed, not something that should make us leave the reclaim
problem unsolved.

I think this patch + an offline check should be sufficient to fix the
reclaim problem while not regressing the zombie memcgs problem for
multiple reasons, see below.

If we implement recharging memory of dying memcgs in the future, we
can always come back and remove the offline check.

> Frankly, let me ask the real question: are you really sure this is the
> best for us and the rest of the community?

Yes. This patch should improve reclaim of shared memory as I
elaborated in my previous email, and with an offline check I believe
we shouldn't be regressing the zombie memcgs problem whether for
Google or the community, for the following reasons.

For Google:
a) With an offline check, our root memcg pressure method should remain
unaffected, as this patch + offline check should be nop for dead
memcgs.
b) The current version of our internal memory.reclaim actually
considers accesses from outside memcgs, specifically for the reason I
am sending this patch for. In retrospect, maybe this should *not* be
the case for the root zombie cleanup scenario, but maybe this was an
oversight, or it wasn't the case originally. Anyway, as I have
mentioned before, I think for the case of shared memory our userspace
has become smart enough to stop referencing the shared memory charged
to dead memcgs.

For the community:
- In the scenario of scrape-before-rmdir where that memcg has shared
memory charged to it (which shouldn't be very likely), this patch
would only delay freeing the dead memcgs until the next reclaim cycle.
Given that memory.reclaim was only recently upstreamed, and
memory.force_empty is not in cgroup v2, I am not really sure if a lot
of people are depending on this scrape-before-rmdir behavior.

For both Google and the community, we get a better reclaim policy for
shared memory.

Again, this conversation got quite derailed. There is a clear problem
with reclaiming shared memory that we are trying to fix. We should not
regress the zombie cleanup behavior (thanks Yu for catching that), but
hopefully this can be accomplished with a simple offline check.

Does this make sense to you, Yu and Johannes? If yes, I can send a v3
with an added offline check.

>
> (I think the recharging proposal is.)
Yu Zhao Oct. 6, 2022, 11:55 p.m. UTC | #15
On Thu, Oct 6, 2022 at 5:07 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Thu, Oct 6, 2022 at 2:57 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Thu, Oct 6, 2022 at 12:30 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Thu, Oct 6, 2022 at 8:32 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > >
> > > > On Thu, Oct 06, 2022 at 12:30:45AM -0700, Yosry Ahmed wrote:
> > > > > On Wed, Oct 5, 2022 at 9:19 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > > >
> > > > > > On Wed, Oct 05, 2022 at 03:13:38PM -0600, Yu Zhao wrote:
> > > > > > > On Wed, Oct 5, 2022 at 3:02 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, Oct 5, 2022 at 1:48 PM Yu Zhao <yuzhao@google.com> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, Oct 5, 2022 at 11:37 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > > > > >
> > > > > > > > > > During page/folio reclaim, we check if a folio is referenced using
> > > > > > > > > > folio_referenced() to avoid reclaiming folios that have been recently
> > > > > > > > > > accessed (hot memory). The rationale is that this memory is likely to be
> > > > > > > > > > accessed soon, and hence reclaiming it will cause a refault.
> > > > > > > > > >
> > > > > > > > > > For memcg reclaim, we currently only check accesses to the folio from
> > > > > > > > > > processes in the subtree of the target memcg. This behavior was
> > > > > > > > > > originally introduced by commit bed7161a519a ("Memory controller: make
> > > > > > > > > > page_referenced() cgroup aware") a long time ago. Back then, refaulted
> > > > > > > > > > pages would get charged to the memcg of the process that was faulting them
> > > > > > > > > > in. It made sense to only consider accesses coming from processes in the
> > > > > > > > > > subtree of target_mem_cgroup. If a page was charged to memcg A but only
> > > > > > > > > > being accessed by a sibling memcg B, we would reclaim it if memcg A is
> > > > > > > > > > is the reclaim target. memcg B can then fault it back in and get charged
> > > > > > > > > > for it appropriately.
> > > > > > > > > >
> > > > > > > > > > Today, this behavior still makes sense for file pages. However, unlike
> > > > > > > > > > file pages, when swapbacked pages are refaulted they are charged to the
> > > > > > > > > > memcg that was originally charged for them during swapping out. Which
> > > > > > > > > > means that if a swapbacked page is charged to memcg A but only used by
> > > > > > > > > > memcg B, and we reclaim it from memcg A, it would simply be faulted back
> > > > > > > > > > in and charged again to memcg A once memcg B accesses it. In that sense,
> > > > > > > > > > accesses from all memcgs matter equally when considering if a swapbacked
> > > > > > > > > > page/folio is a viable reclaim target.
> > > > > > > > > >
> > > > > > > > > > Modify folio_referenced() to always consider accesses from all memcgs if
> > > > > > > > > > the folio is swapbacked.
> > > > > > > > >
> > > > > > > > > It seems to me this change can potentially increase the number of
> > > > > > > > > zombie memcgs. Any risk assessment done on this?
> > > > > > > >
> > > > > > > > Do you mind elaborating the case(s) where this could happen? Is this
> > > > > > > > the cgroup v1 case in mem_cgroup_swapout() where we are reclaiming
> > > > > > > > from a zombie memcg and swapping out would let us move the charge to
> > > > > > > > the parent?
> > > > > > >
> > > > > > > The scenario is quite straightforward: for a page charged to memcg A
> > > > > > > and also actively used by memcg B, if we don't ignore the access from
> > > > > > > memcg B, we won't be able to reclaim it after memcg A is deleted.
> > > > > >
> > > > > > This patch changes the behavior of limit-induced reclaim. There is no
> > > > > > limit reclaim on A after it's been deleted. And parental/global
> > > > > > reclaim has always recognized outside references.
> > > > >
> > > > > Do you mind elaborating on the parental reclaim part?
> > > > >
> > > > > I am looking at the code and it looks like memcg reclaim of a parent
> > > > > (limit-induced or proactive) will only consider references coming from
> > > > > its subtree, even when reclaiming from its dead children. It looks
> > > > > like as long as sc->target_mem_cgroup is set, we ignore outside
> > > > > references (relative to sc->target_mem_cgroup).
> > > >
> > > > Yes, I was referring to outside of A.
> > > >
> > > > As of today, any siblings of A can already pin its memory after it's
> > > > dead. I suppose your patch would add cousins to that list. It doesn't
> > > > seem like a categorial difference to me.
> > > >
> > > > > If that is true, maybe we want to keep ignoring outside references for
> > > > > swap-backed pages if the folio is charged to a dead memcg? My
> > > > > understanding is that in this case we will uncharge the page from the
> > > > > dead memcg and charge the swapped entry to the parent, reducing the
> > > > > number of refs on the dead memcg. Without this check, this patch might
> > > > > prevent the charge from being moved to the parent in this case. WDYT?
> > > >
> > > > I don't think it's worth it. Keeping the semantics simple and behavior
> > > > predictable is IMO more valuable.
> > > >
> > > > It also wouldn't fix the scrape-before-rmdir issue Yu points out,
> > > > which I think is the more practical concern. In light of that, it
> > > > might be best to table the patch for now. (Until we have
> > > > reparent-on-delete for anon and file pages...)
> > >
> > > If we add a mem_cgroup_online() check, we partially solve the problem.
> > > Maybe scrape-before-rmdir will not reclaim those pages at once, but
> > > the next time we try to reclaim from the dead memcg (global, limit,
> > > proactive,..) we will reclaim the pages. So we will only be delaying
> > > the freeing of those zombie memcgs.
> >
> > As an observer, this seems to be the death by a thousand cuts of the
> > existing mechanism that Google has been using to virtually eliminate
> > zombie memcgs for the last decade.
> >
> > I understand the desire to fix a specific problem with this patch. But
> > it's methodically wrong to focus on specific problems without
> > considering the large picture and how it's evolving.
> >
> > Our internal memory.reclaim, which is being superseded, is a superset
> > of the mainline version. It has two flags relevant to this discussion:
> >     1. hierarchical walk of a parent
> >     2. target dead memcgs only
> > With these, our job scheduler (Borg) doesn't need to scrape before
> > rmdir at all. It does something called "applying root pressure",
> > which, as one might imagine, is to write to the root memory.reclaim
> > with the above flags. We have metrics on the efficiency of this
> > mechanism and they are closely monitored.
> >
> > Why is this important? Because Murphy's law is generally true for a
> > fleet when its scale and diversity is large and high enough. *We used
> > to run out memcg IDs.* And we are still carrying a patch that enlarges
> > swap_cgroup->id from unsigned short to unsigned int.
> >
> > Compared with the recharging proposal we have been discussing, the two
> > cases that the above solution can't help:
> >     1. kernel long pins
> >     2. foreign mlocks
> > But it's still *a lot* more reliable than the scrape-before-rmdir
> > approach (or scrape-after-rmdir if we can hold the FD open before
> > rmdir), because it offers unlimited retries and no dead memcgs, e.g.,
> > those created and deleted by jobs (not the job scheduler), can escape.
> >
> > Unless you can provide data, my past experience tells me that this
> > patch will make scrape-before-rmdir unacceptable (in terms of
> > effectiveness) to our fleet. Of course you can add additional code,
> > i.e., those two flags or the offline check, which I'm not object to.
>
> I agree that the zombie memcgs problem is a serious problem that needs
> to be dealt with, and recharging memory when a memcg is dying seems
> like a promising direction. However, this patch's goal is to improve
> reclaim of shared swapbacked memory in general, regardless of the
> zombie memcgs problem. I understand that the current version affects
> the zombie memcgs problem, but I believe this is an oversight that
> needs to be fixed, not something that should make us leave the reclaim
> problem unsolved.
>
> I think this patch + an offline check should be sufficient to fix the
> reclaim problem while not regressing the zombie memcgs problem for
> multiple reasons, see below.
>
> If we implement recharging memory of dying memcgs in the future, we
> can always come back and remove the offline check.
>
> > Frankly, let me ask the real question: are you really sure this is the
> > best for us and the rest of the community?
>
> Yes. This patch should improve reclaim of shared memory as I
> elaborated in my previous email, and with an offline check I believe
> we shouldn't be regressing the zombie memcgs problem whether for
> Google or the community, for the following reasons.

But it will regress a (non-root) parent who wishes to keep the hot
memory it shared with its deleted children to itself.

I think we should sleep on this.
diff mbox series

Patch

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index ca3e4ba6c58c..8130586eb637 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -352,7 +352,7 @@  static inline int page_try_share_anon_rmap(struct page *page)
  * Called from mm/vmscan.c to handle paging out
  */
 int folio_referenced(struct folio *, int is_locked,
-			struct mem_cgroup *memcg, unsigned long *vm_flags);
+		     struct mem_cgroup *target_memcg, unsigned long *vm_flags);
 
 void try_to_migrate(struct folio *folio, enum ttu_flags flags);
 void try_to_unmap(struct folio *, enum ttu_flags flags);
diff --git a/mm/rmap.c b/mm/rmap.c
index b6743c2b8b5f..7c98d0ca7cc6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -882,7 +882,7 @@  static bool invalid_folio_referenced_vma(struct vm_area_struct *vma, void *arg)
  * folio_referenced() - Test if the folio was referenced.
  * @folio: The folio to test.
  * @is_locked: Caller holds lock on the folio.
- * @memcg: target memory cgroup
+ * @target_memcg: target memory cgroup of reclaim.
  * @vm_flags: A combination of all the vma->vm_flags which referenced the folio.
  *
  * Quick test_and_clear_referenced for all mappings of a folio,
@@ -891,12 +891,12 @@  static bool invalid_folio_referenced_vma(struct vm_area_struct *vma, void *arg)
  * the function bailed out due to rmap lock contention.
  */
 int folio_referenced(struct folio *folio, int is_locked,
-		     struct mem_cgroup *memcg, unsigned long *vm_flags)
+		     struct mem_cgroup *target_memcg, unsigned long *vm_flags)
 {
 	int we_locked = 0;
 	struct folio_referenced_arg pra = {
 		.mapcount = folio_mapcount(folio),
-		.memcg = memcg,
+		.memcg = target_memcg,
 	};
 	struct rmap_walk_control rwc = {
 		.rmap_one = folio_referenced_one,
@@ -919,13 +919,32 @@  int folio_referenced(struct folio *folio, int is_locked,
 	}
 
 	/*
-	 * If we are reclaiming on behalf of a cgroup, skip
-	 * counting on behalf of references from different
-	 * cgroups
+	 * We check references to folios to make sure we don't reclaim hot
+	 * folios that are likely to be refaulted soon. If we are reclaiming
+	 * memory on behalf of a memcg, we may want to skip references from
+	 * processes outside the target memcg's subtree.
+	 *
+	 * For file folios, we only consider references from processes in the
+	 * subtree of the target memcg. If memcg A is under reclaim, and a
+	 * folio is charged to memcg A but only referenced by processes in
+	 * memcg B, we ignore references from memcg B and try to reclaim it.
+	 * If it is later accessed by memcg B it will be faulted back in and
+	 * charged appropriately to memcg B. For memcg A, this is cold memory
+	 * that should be reclaimed.
+	 *
+	 * On the other hand, when swapbacked folios are faulted in, they get
+	 * charged to the memcg that was originally charged for them at the time
+	 * of swapping out. This means that if a folio that is charged to
+	 * memcg A gets swapped out, it will get charged back to A when any
+	 * process in any memcg accesses it. In that sense, we need to consider
+	 * references from all memcgs when considering whether to reclaim a
+	 * swapbacked folio.
+	 *
+	 * Hence, only skip references from outside the target memcg (if any) if
+	 * the folio is not swapbacked.
 	 */
-	if (memcg) {
+	if (target_memcg && !folio_test_swapbacked(folio))
 		rwc.invalid_vma = invalid_folio_referenced_vma;
-	}
 
 	rmap_walk(folio, &rwc);
 	*vm_flags = pra.vm_flags;