diff mbox series

mm: madvise: fix uneven accounting of psi

Message ID 1685531374-6091-1-git-send-email-quic_charante@quicinc.com (mailing list archive)
State New
Headers show
Series mm: madvise: fix uneven accounting of psi | expand

Commit Message

Charan Teja Kalla May 31, 2023, 11:09 a.m. UTC
A folio turns into a Workingset during:
1) shrink_active_list() placing the folio from active to inactive list.
2) When a workingset transition is happening during the folio refault.

And when Workingset is set on a folio, PSI for memory can be accounted
during a) That folio is being reclaimed and b) Refault of that folio.

There exists clients who can do the proactive reclaim using the system
calls like madvise(), whose folios can be safely treated as inactive
folios assuming the client knows that these folios are not needed in the
near future thus wanted to reclaim them. For such folios psi is not
accounted uniformly:
a) A folio started at inactive and moved to active as part of accesses.
Workingset is absent on the folio thus madvise(MADV_PAGEOUT) don't
account such folios for PSI.

b) When the same folio transition from inactive->active and then to
inactive through shrink_active_list(). Workingset is set on the folio
thus madvise(MADV_PAGEOUT) account such folios for PSI.

c) When the same folio is part of active list directly as a result of
folio refault and this was a workingset folio prior to eviction.
Workingset is set on the folio thus madvise(MADV_PAGEOUT) account such
folios for PSI.

As said about the MADV_PAGEOUT on a folio is accounted in b) and c) but
not in a) which is inconsistent. Remove this inconsistency by always not
considering the PSI for folios that are getting reclaimed through
madvise(MADV_PAGEOUT) by clearing the Workingset on a folio. This
consistency of clearing the workingset was chosen under the assumption
that client knows these folios are not in active use thus reclaiming
them hence not eligible as workingset folios. Probably it is the same
reason why workingset is not set on a folio through MADV_COLD but during
the shrink_active_list() though both the actions make the folio put onto
the inactive list.

This patch is tested on Android, Snapdragon SoC with 8Gb RAM, 4GB swap
mounted on zram which has 2GB of backingdev. The test case involved
launching some memory hungry apps in an order and do the proactive
reclaim for the app that went to background using madvise(MADV_PAGEOUT).
We are seeing ~40% less total values of psi mem some and full when this
patch is combined with [1].

[1]https://lore.kernel.org/all/20220214214921.419687-1-hannes@cmpxchg.org/T/#u

Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
---
 mm/madvise.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

Comments

Suren Baghdasaryan June 6, 2023, 7:48 p.m. UTC | #1
Hi Charan,

On Tue, Jun 6, 2023 at 7:54 AM Charan Teja Kalla
<quic_charante@quicinc.com> wrote:
>
> Thanks Johannes for the detailed review comments...
>
> On 6/5/2023 11:30 PM, Johannes Weiner wrote:
> >> Agree that we shouldn't be really silence the thrashing. My point is we
> >> shouldn't be  considering the folios as thrashing If those were getting
> >> reclaim by the user him self through MADV_PAGEOUT under the assumption
> >> that __user knows they are not real working set__.  Please let me know
> >> if I am not making sense here.
> > I'm not sure I agree with this. I think it misses the point of what
> > the madvise is actually for.
> >
> > The workingset is defined based on access frequency and available
> > memory. Thrashing is defined as having to read pages back shortly
> > after their eviction.
> >
> > MADV_PAGEOUT is for the application to inform the kernel that it's
> > done accessing the pages, so that the kernel can accelerate their
> > eviction over other pages that may still be in use. This is ultimately
> > meant to REDUCE reclaim and paging.
> >
> > However, in this case, the MADVISE_PAGEOUT evicts pages that are
> > reused after and then refault. It INCREASED reclaim and paging.
> >
> I agree here...
> > Surely that's a problem? And the system would have behaved better
> > without the madvise() in the first place?
> >
> Yes, the system behavior could be much better without this PAGEOUT
> operation...
> > In fact, I would argue that the pressure spike is a great signal for
> > detecting overzealous madvising. If you're redefining the workingset
> > from access frequency to "whatever the user is saying", that will take
> > away an important mechanism to detect advise bugs and unnecessary IO.
> currently wanted to do the PAGEOUT operation but what information lacks
> is if I am really operating on the workingset pages. Had the client
> knows that he is operating on the workingset pages, he could have backed
> off from madvising.
>
> I now note that I shouldn't be defining the workingset from "whatever
> user is saying". But then, IMO, there should be a way from the kernel to
> the user that his madvise operation is being performed on the workingset
> pages.
>
> One way the user can do is monitoring the PSI events while PAGEOUT is
> being performed and he may exclude those VMA's from next time.
>
> Alternatively kernel itself can support it may be through like
> MADV_PAGEOUT_INACTIVE which doesn't pageout the Workingset pages.
>
> Please let me know your opinion about this interface.
>
> This has the usecase on android where it just assumes that 2nd
> background app will most likely to be not used in the future thus
> reclaim those app pages. It works well for most of the times but such
> assumption will go wrong with the usecase I had mentioned.

Hi Folks. Sorry for being late to the party.
Yeah, userspace does not have a crystal ball to predict future user
behavior, so there will always be pathological cases when usual
assumptions and resulting madvise() would make things worse.

I think this discussion can be split into several questions/issues:
1. Inconsistency in how madvise(MADV_PAGEOUT) would affect PSI
calculation when the page is refaulted, based on the path it took
before being evicted by madvise(). In your initial description case
(a) is inconsistent with (b) and (c) and it's probably worth fixing.
IMHO (a) should be made consistent with others, not the other way
around. My reasoning is that page was expelled from the active list,
so it was part of the active workingset.

2. Whether refaults caused by incorrect madvise(MADV_PAGEOUT) should
be counted as workingset refault and affect PSI.
This one I think is trickier. IMHO it should be counted as workingset
refault simply because it was refaulted and it was part of the
workingset. Whether it should affect PSI, which is supposed to be an
indicator of "pressure" is, I think, debatable. With madvise() in the
mix, refault might happen without any real memory pressure... So, the
answer is not obvious to me.

3. Should refaults caused by incorrect madvise(MADV_PAGEOUT) be
distinguished from the ones which were evicted by kernel reclaim
mechanisms.
I can see use for that from userspace to detect incorrect madvise()
and adjust its aggressiveness. I think the API might get a bit complex
because of the need to associate refaults with specific madvise()/VMAs
to understand which hint was incorrect and adjust the behavior.

Hope my feedback is useful and if we can improve Android's userspace
behavior, I'm happy to help make that happen.
Thanks,
Suren.

>
> --Thanks.
Charan Teja Kalla June 9, 2023, 12:42 p.m. UTC | #2
Thanks Suren & Johannes,

On 6/7/2023 1:18 AM, Suren Baghdasaryan wrote:
> Hi Folks. Sorry for being late to the party.
> Yeah, userspace does not have a crystal ball to predict future user
> behavior, so there will always be pathological cases when usual
> assumptions and resulting madvise() would make things worse.
> 
> I think this discussion can be split into several questions/issues:
> 1. Inconsistency in how madvise(MADV_PAGEOUT) would affect PSI
> calculation when the page is refaulted, based on the path it took
> before being evicted by madvise(). In your initial description case
> (a) is inconsistent with (b) and (c) and it's probably worth fixing.
> IMHO (a) should be made consistent with others, not the other way
> around. My reasoning is that page was expelled from the active list,
> so it was part of the active workingset.
> 
That means we should be setting Workingset on the page while it is on
the active list and when it is being pageout through madvising. Right? I
see, this makes it consistent.

On the same note, discussing with Suren offline, Should the refaulted
madvise pages start always at the inactive list? If they are really
active, they get promoted anyway..

> 2. Whether refaults caused by incorrect madvise(MADV_PAGEOUT) should
> be counted as workingset refault and affect PSI.
> This one I think is trickier. IMHO it should be counted as workingset
> refault simply because it was refaulted and it was part of the
> workingset. Whether it should affect PSI, which is supposed to be an
> indicator of "pressure" is, I think, debatable. With madvise() in the
> mix, refault might happen without any real memory pressure... So, the
> answer is not obvious to me.
> 
> 3. Should refaults caused by incorrect madvise(MADV_PAGEOUT) be
> distinguished from the ones which were evicted by kernel reclaim
> mechanisms.
> I can see use for that from userspace to detect incorrect madvise()
> and adjust its aggressiveness. I think the API might get a bit complex
> because of the need to associate refaults with specific madvise()/VMAs
> to understand which hint was incorrect and adjust the behavior.
> Instead what is the opinion about giving an MADV_PAGEOUT_INACTIVE
interface which does operate on a page only If it is on the inactive
list and !PageWorkingset ?

> Hope my feedback is useful and if we can improve Android's userspace
> behavior, I'm happy to help make that happen.
Thanks...
Suren Baghdasaryan June 9, 2023, 11:13 p.m. UTC | #3
On Fri, Jun 9, 2023 at 5:42 AM Charan Teja Kalla
<quic_charante@quicinc.com> wrote:
>
> Thanks Suren & Johannes,
>
> On 6/7/2023 1:18 AM, Suren Baghdasaryan wrote:
> > Hi Folks. Sorry for being late to the party.
> > Yeah, userspace does not have a crystal ball to predict future user
> > behavior, so there will always be pathological cases when usual
> > assumptions and resulting madvise() would make things worse.
> >
> > I think this discussion can be split into several questions/issues:
> > 1. Inconsistency in how madvise(MADV_PAGEOUT) would affect PSI
> > calculation when the page is refaulted, based on the path it took
> > before being evicted by madvise(). In your initial description case
> > (a) is inconsistent with (b) and (c) and it's probably worth fixing.
> > IMHO (a) should be made consistent with others, not the other way
> > around. My reasoning is that page was expelled from the active list,
> > so it was part of the active workingset.
> >
> That means we should be setting Workingset on the page while it is on
> the active list and when it is being pageout through madvising. Right? I
> see, this makes it consistent.

This was my opinion but others might think otherwise, like I found out
in some recent conversations. So, it would be great to get some more
feedback before making the change.

>
> On the same note, discussing with Suren offline, Should the refaulted
> madvise pages start always at the inactive list? If they are really
> active, they get promoted anyway..
>
> > 2. Whether refaults caused by incorrect madvise(MADV_PAGEOUT) should
> > be counted as workingset refault and affect PSI.
> > This one I think is trickier. IMHO it should be counted as workingset
> > refault simply because it was refaulted and it was part of the
> > workingset. Whether it should affect PSI, which is supposed to be an
> > indicator of "pressure" is, I think, debatable. With madvise() in the
> > mix, refault might happen without any real memory pressure... So, the
> > answer is not obvious to me.
> >
> > 3. Should refaults caused by incorrect madvise(MADV_PAGEOUT) be
> > distinguished from the ones which were evicted by kernel reclaim
> > mechanisms.
> > I can see use for that from userspace to detect incorrect madvise()
> > and adjust its aggressiveness. I think the API might get a bit complex
> > because of the need to associate refaults with specific madvise()/VMAs
> > to understand which hint was incorrect and adjust the behavior.
> > Instead what is the opinion about giving an MADV_PAGEOUT_INACTIVE
> interface which does operate on a page only If it is on the inactive
> list and !PageWorkingset ?

IOW you want a less aggressive mechanism which can be used by the
userspace to tell the kernel "I think these pages won't be used but
I'm not 100% sure, so drop them only if they are inactive"?
 I don't know how much that will help when the madvise() ends up being
wrong but maybe you can quickly experiment and tell us if the
difference is substantial?

>
> > Hope my feedback is useful and if we can improve Android's userspace
> > behavior, I'm happy to help make that happen.
> Thanks...
Johannes Weiner June 12, 2023, 1:40 p.m. UTC | #4
On Fri, Jun 09, 2023 at 04:13:14PM -0700, Suren Baghdasaryan wrote:
> On Fri, Jun 9, 2023 at 5:42 AM Charan Teja Kalla
> <quic_charante@quicinc.com> wrote:
> >
> > Thanks Suren & Johannes,
> >
> > On 6/7/2023 1:18 AM, Suren Baghdasaryan wrote:
> > > Hi Folks. Sorry for being late to the party.
> > > Yeah, userspace does not have a crystal ball to predict future user
> > > behavior, so there will always be pathological cases when usual
> > > assumptions and resulting madvise() would make things worse.
> > >
> > > I think this discussion can be split into several questions/issues:
> > > 1. Inconsistency in how madvise(MADV_PAGEOUT) would affect PSI
> > > calculation when the page is refaulted, based on the path it took
> > > before being evicted by madvise(). In your initial description case
> > > (a) is inconsistent with (b) and (c) and it's probably worth fixing.
> > > IMHO (a) should be made consistent with others, not the other way
> > > around. My reasoning is that page was expelled from the active list,
> > > so it was part of the active workingset.
> > >
> > That means we should be setting Workingset on the page while it is on
> > the active list and when it is being pageout through madvising. Right? I
> > see, this makes it consistent.
> 
> This was my opinion but others might think otherwise, like I found out
> in some recent conversations. So, it would be great to get some more
> feedback before making the change.

I also agree with the consistency fix: it should set Workingset when
madvise zaps pages from the active list.
Pavan Kondeti June 13, 2023, 3:13 a.m. UTC | #5
On Fri, Jun 09, 2023 at 06:12:28PM +0530, Charan Teja Kalla wrote:
> Thanks Suren & Johannes,
> 
> On 6/7/2023 1:18 AM, Suren Baghdasaryan wrote:
> > Hi Folks. Sorry for being late to the party.
> > Yeah, userspace does not have a crystal ball to predict future user
> > behavior, so there will always be pathological cases when usual
> > assumptions and resulting madvise() would make things worse.
> > 
> > I think this discussion can be split into several questions/issues:
> > 1. Inconsistency in how madvise(MADV_PAGEOUT) would affect PSI
> > calculation when the page is refaulted, based on the path it took
> > before being evicted by madvise(). In your initial description case
> > (a) is inconsistent with (b) and (c) and it's probably worth fixing.
> > IMHO (a) should be made consistent with others, not the other way
> > around. My reasoning is that page was expelled from the active list,
> > so it was part of the active workingset.
> > 
> That means we should be setting Workingset on the page while it is on
> the active list and when it is being pageout through madvising. Right? I
> see, this makes it consistent.
> 
> On the same note, discussing with Suren offline, Should the refaulted
> madvise pages start always at the inactive list? If they are really
> active, they get promoted anyway..
> 
Can you elaborate on the rationale why refaulted madvise pages needs to
be on inactive list? If it had not been paged out via madvise, it would
have been activated no?

Thanks,
Pavan
Charan Teja Kalla June 26, 2023, 2:31 p.m. UTC | #6
Hi Suren,

On 6/10/2023 4:43 AM, Suren Baghdasaryan wrote:
>>> I can see use for that from userspace to detect incorrect madvise()
>>> and adjust its aggressiveness. I think the API might get a bit complex
>>> because of the need to associate refaults with specific madvise()/VMAs
>>> to understand which hint was incorrect and adjust the behavior.
>>> Instead what is the opinion about giving an MADV_PAGEOUT_INACTIVE
>> interface which does operate on a page only If it is on the inactive
>> list and !PageWorkingset ?
> IOW you want a less aggressive mechanism which can be used by the
> userspace to tell the kernel "I think these pages won't be used but
> I'm not 100% sure, so drop them only if they are inactive"?
>  I don't know how much that will help when the madvise() ends up being
> wrong but maybe you can quickly experiment and tell us if the
> difference is substantial?

We did some extensive testing on Android and this ask is not helping us
much. I am really not sure if there is some other usecase that can
benefit from this. So, for now I just stick to your suggestion of making
the pages on the Active list as the Workingset at the time of pageout.
>

Thanks.
Suren Baghdasaryan June 27, 2023, 10:50 p.m. UTC | #7
On Mon, Jun 26, 2023 at 7:31 AM Charan Teja Kalla
<quic_charante@quicinc.com> wrote:
>
> Hi Suren,
>
> On 6/10/2023 4:43 AM, Suren Baghdasaryan wrote:
> >>> I can see use for that from userspace to detect incorrect madvise()
> >>> and adjust its aggressiveness. I think the API might get a bit complex
> >>> because of the need to associate refaults with specific madvise()/VMAs
> >>> to understand which hint was incorrect and adjust the behavior.
> >>> Instead what is the opinion about giving an MADV_PAGEOUT_INACTIVE
> >> interface which does operate on a page only If it is on the inactive
> >> list and !PageWorkingset ?
> > IOW you want a less aggressive mechanism which can be used by the
> > userspace to tell the kernel "I think these pages won't be used but
> > I'm not 100% sure, so drop them only if they are inactive"?
> >  I don't know how much that will help when the madvise() ends up being
> > wrong but maybe you can quickly experiment and tell us if the
> > difference is substantial?
>
> We did some extensive testing on Android and this ask is not helping us
> much. I am really not sure if there is some other usecase that can
> benefit from this. So, for now I just stick to your suggestion of making
> the pages on the Active list as the Workingset at the time of pageout.
> >

Thanks for checking that. Your plan SGTM.

>
> Thanks.
>
diff mbox series

Patch

diff --git a/mm/madvise.c b/mm/madvise.c
index 340125d..3410c39 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -409,8 +409,10 @@  static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 			if (folio_isolate_lru(folio)) {
 				if (folio_test_unevictable(folio))
 					folio_putback_lru(folio);
-				else
+				else {
+					folio_clear_workingset(folio);
 					list_add(&folio->lru, &folio_list);
+				}
 			}
 		} else
 			folio_deactivate(folio);
@@ -503,8 +505,10 @@  static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 			if (folio_isolate_lru(folio)) {
 				if (folio_test_unevictable(folio))
 					folio_putback_lru(folio);
-				else
+				else {
+					folio_clear_workingset(folio);
 					list_add(&folio->lru, &folio_list);
+				}
 			}
 		} else
 			folio_deactivate(folio);