diff mbox series

[v2,2/9] mm/vmscan: protect the workingset on anonymous LRU

Message ID 1582175513-22601-3-git-send-email-iamjoonsoo.kim@lge.com (mailing list archive)
State New, archived
Headers show
Series workingset protection/detection on the anonymous LRU list | expand

Commit Message

Joonsoo Kim Feb. 20, 2020, 5:11 a.m. UTC
From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

In current implementation, newly created or swap-in anonymous page
is started on active list. Growing active list results in rebalancing
active/inactive list so old pages on active list are demoted to inactive
list. Hence, the page on active list isn't protected at all.

Following is an example of this situation.

Assume that 50 hot pages on active list. Numbers denote the number of
pages on active/inactive list (active | inactive).

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (used-once) pages
50(uo) | 50(h)

3. workload: another 50 newly created (used-once) pages
50(uo) | 50(uo), swap-out 50(h)

This patch tries to fix this issue.
Like as file LRU, newly created or swap-in anonymous pages will be
inserted to the inactive list. They are promoted to active list if
enough reference happens. This simple modification changes the above
example as following.

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (used-once) pages
50(h) | 50(uo)

3. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(uo)

As you can see, hot pages on active list would be protected.

Note that, this implementation has a drawback that the page cannot
be promoted and will be swapped-out if re-access interval is greater than
the size of inactive list but less than the size of total(active+inactive).
To solve this potential issue, following patch will apply workingset
detection that is applied to file LRU some day before.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 include/linux/swap.h    |  2 +-
 kernel/events/uprobes.c |  2 +-
 mm/huge_memory.c        |  6 +++---
 mm/khugepaged.c         |  2 +-
 mm/memory.c             |  9 ++++-----
 mm/migrate.c            |  2 +-
 mm/swap.c               | 13 +++++++------
 mm/swapfile.c           |  2 +-
 mm/userfaultfd.c        |  2 +-
 mm/vmscan.c             | 21 +++++++++++++++++++--
 10 files changed, 39 insertions(+), 22 deletions(-)

Comments

Johannes Weiner March 12, 2020, 3:14 p.m. UTC | #1
On Thu, Feb 20, 2020 at 02:11:46PM +0900, js1304@gmail.com wrote:
> @@ -1010,8 +1010,15 @@ static enum page_references page_check_references(struct page *page,
>  		return PAGEREF_RECLAIM;
>  
>  	if (referenced_ptes) {
> -		if (PageSwapBacked(page))
> -			return PAGEREF_ACTIVATE;
> +		if (PageSwapBacked(page)) {
> +			if (referenced_page) {
> +				ClearPageReferenced(page);
> +				return PAGEREF_ACTIVATE;
> +			}

This looks odd to me. referenced_page = TestClearPageReferenced()
above, so it's already be clear. Why clear it again?

> +
> +			SetPageReferenced(page);
> +			return PAGEREF_KEEP;
> +		}

The existing file code already does:

		SetPageReferenced(page);
		if (referenced_page || referenced_ptes > 1)
			return PAGEREF_ACTIVATE;
		if (vm_flags & VM_EXEC)
			return PAGEREF_ACTIVATE;
		return PAGEREF_KEEP;

The differences are:

1) referenced_ptes > 1. We did this so that heavily shared file
mappings are protected a bit better than others. Arguably the same
could apply for anon pages when we put them on the inactive list.

2) vm_flags & VM_EXEC. This mostly doesn't apply to anon pages. The
exception would be jit code pages, but if we put anon pages on the
inactive list we should protect jit code the same way we protect file
executables.

Seems to me you don't need to add anything. Just remove the
PageSwapBacked branch and apply equal treatment to both types.

> @@ -2056,6 +2063,15 @@ static void shrink_active_list(unsigned long nr_to_scan,
>  			}
>  		}
>  
> +		/*
> +		 * Now, newly created anonymous page isn't appened to the
> +		 * active list. We don't need to clear the reference bit here.
> +		 */
> +		if (PageSwapBacked(page)) {
> +			ClearPageReferenced(page);
> +			goto deactivate;
> +		}

I don't understand this.

If you don't clear the pte references, you're leaving behind stale
data. You already decide here that we consider the page referenced
when it reaches the end of the inactive list, regardless of what
happens in between. That makes the deactivation kind of useless.

And it blurs the lines between the inactive and active list.

shrink_page_list() (and page_check_references()) are written with the
notion that any references they look at are from the inactive list. If
you carry over stale data, this can cause more subtle bugs later on.

And again, I don't quite understand why anon would need different
treatment here than file.

> +
>  		if (page_referenced(page, 0, sc->target_mem_cgroup,
>  				    &vm_flags)) {
>  			nr_rotated += hpage_nr_pages(page);
> @@ -2074,6 +2090,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>  			}
>  		}
>  
> +deactivate:
>  		ClearPageActive(page);	/* we are de-activating */
>  		SetPageWorkingset(page);
>  		list_add(&page->lru, &l_inactive);
> -- 
> 2.7.4
>
Joonsoo Kim March 13, 2020, 7:40 a.m. UTC | #2
2020년 3월 13일 (금) 오전 12:14, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
>
> On Thu, Feb 20, 2020 at 02:11:46PM +0900, js1304@gmail.com wrote:
> > @@ -1010,8 +1010,15 @@ static enum page_references page_check_references(struct page *page,
> >               return PAGEREF_RECLAIM;
> >
> >       if (referenced_ptes) {
> > -             if (PageSwapBacked(page))
> > -                     return PAGEREF_ACTIVATE;
> > +             if (PageSwapBacked(page)) {
> > +                     if (referenced_page) {
> > +                             ClearPageReferenced(page);
> > +                             return PAGEREF_ACTIVATE;
> > +                     }
>
> This looks odd to me. referenced_page = TestClearPageReferenced()
> above, so it's already be clear. Why clear it again?

Oops... it's just my fault. Will remove it.

> > +
> > +                     SetPageReferenced(page);
> > +                     return PAGEREF_KEEP;
> > +             }
>
> The existing file code already does:
>
>                 SetPageReferenced(page);
>                 if (referenced_page || referenced_ptes > 1)
>                         return PAGEREF_ACTIVATE;
>                 if (vm_flags & VM_EXEC)
>                         return PAGEREF_ACTIVATE;
>                 return PAGEREF_KEEP;
>
> The differences are:
>
> 1) referenced_ptes > 1. We did this so that heavily shared file
> mappings are protected a bit better than others. Arguably the same
> could apply for anon pages when we put them on the inactive list.

Yes, these check should be included for anon.

> 2) vm_flags & VM_EXEC. This mostly doesn't apply to anon pages. The
> exception would be jit code pages, but if we put anon pages on the
> inactive list we should protect jit code the same way we protect file
> executables.

I'm not sure that this is necessary for anon page. From my understanding,
executable mapped file page is more precious than other mapped file page
because this mapping is usually used by *multiple* thread and there is
no way to check it by MM. If anon JIT code has also such characteristic, this
code should be included for anon, but, should be included separately. It
seems that it's beyond of this patch.

> Seems to me you don't need to add anything. Just remove the
> PageSwapBacked branch and apply equal treatment to both types.

I will rework the code if you agree with my opinion.

> > @@ -2056,6 +2063,15 @@ static void shrink_active_list(unsigned long nr_to_scan,
> >                       }
> >               }
> >
> > +             /*
> > +              * Now, newly created anonymous page isn't appened to the
> > +              * active list. We don't need to clear the reference bit here.
> > +              */
> > +             if (PageSwapBacked(page)) {
> > +                     ClearPageReferenced(page);
> > +                     goto deactivate;
> > +             }
>
> I don't understand this.
>
> If you don't clear the pte references, you're leaving behind stale
> data. You already decide here that we consider the page referenced
> when it reaches the end of the inactive list, regardless of what
> happens in between. That makes the deactivation kind of useless.

My idea is that the pages newly appended to the inactive list, for example,
a newly allocated anon page or deactivated page, start at the same line.
A newly allocated anon page would have a mapping (reference) so I
made this code to help for deactivated page to have a mapping (reference).
I think that there is no reason to devalue the page accessed on active list.

Before this patch is applied, all newly allocated anon page are started
at the active list so clearing the pte reference on deactivation is required
to check the further access. However, it is not the case so I skip it here.

> And it blurs the lines between the inactive and active list.
>
> shrink_page_list() (and page_check_references()) are written with the
> notion that any references they look at are from the inactive list. If
> you carry over stale data, this can cause more subtle bugs later on.

It's not. For file page, PageReferenced() is maintained even if deactivation
happens and it means one reference.

> And again, I don't quite understand why anon would need different
> treatment here than file.

In order to preserve the current behaviour for the file page, I leave the code
as is for the file page and change the code for the anon page. There is
fundamental difference between them such as how referenced is checked,
accessed by mapping and accessed by syscall. I think that some difference
would be admitted.

Thanks.
Johannes Weiner March 13, 2020, 7:55 p.m. UTC | #3
On Fri, Mar 13, 2020 at 04:40:18PM +0900, Joonsoo Kim wrote:
> 2020년 3월 13일 (금) 오전 12:14, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> >
> > On Thu, Feb 20, 2020 at 02:11:46PM +0900, js1304@gmail.com wrote:
> > > @@ -1010,8 +1010,15 @@ static enum page_references page_check_references(struct page *page,
> > >               return PAGEREF_RECLAIM;
> > >
> > >       if (referenced_ptes) {
> > > -             if (PageSwapBacked(page))
> > > -                     return PAGEREF_ACTIVATE;
> > > +             if (PageSwapBacked(page)) {
> > > +                     if (referenced_page) {
> > > +                             ClearPageReferenced(page);
> > > +                             return PAGEREF_ACTIVATE;
> > > +                     }
> >
> > This looks odd to me. referenced_page = TestClearPageReferenced()
> > above, so it's already be clear. Why clear it again?
> 
> Oops... it's just my fault. Will remove it.
> 
> > > +
> > > +                     SetPageReferenced(page);
> > > +                     return PAGEREF_KEEP;
> > > +             }
> >
> > The existing file code already does:
> >
> >                 SetPageReferenced(page);
> >                 if (referenced_page || referenced_ptes > 1)
> >                         return PAGEREF_ACTIVATE;
> >                 if (vm_flags & VM_EXEC)
> >                         return PAGEREF_ACTIVATE;
> >                 return PAGEREF_KEEP;
> >
> > The differences are:
> >
> > 1) referenced_ptes > 1. We did this so that heavily shared file
> > mappings are protected a bit better than others. Arguably the same
> > could apply for anon pages when we put them on the inactive list.
> 
> Yes, these check should be included for anon.
> 
> > 2) vm_flags & VM_EXEC. This mostly doesn't apply to anon pages. The
> > exception would be jit code pages, but if we put anon pages on the
> > inactive list we should protect jit code the same way we protect file
> > executables.
> 
> I'm not sure that this is necessary for anon page. From my understanding,
> executable mapped file page is more precious than other mapped file page
> because this mapping is usually used by *multiple* thread and there is
> no way to check it by MM. If anon JIT code has also such characteristic, this
> code should be included for anon, but, should be included separately. It
> seems that it's beyond of this patch.

The sharing is what the referenced_ptes > 1 check is for.

The problem with executables is that when they are referenced, they
get a *lot* of references compared to data pages. Think about an
instruction stream and how many of those instructions result in data
references. So when you see an executable page that is being accessed,
it's likely being accessed at a high rate. They're much hotter, and
that's why reference bits from VM_EXEC mappings carry more weight.

IMO this applies to executable file and anon equally.

> > Seems to me you don't need to add anything. Just remove the
> > PageSwapBacked branch and apply equal treatment to both types.
> 
> I will rework the code if you agree with my opinion.
> 
> > > @@ -2056,6 +2063,15 @@ static void shrink_active_list(unsigned long nr_to_scan,
> > >                       }
> > >               }
> > >
> > > +             /*
> > > +              * Now, newly created anonymous page isn't appened to the
> > > +              * active list. We don't need to clear the reference bit here.
> > > +              */
> > > +             if (PageSwapBacked(page)) {
> > > +                     ClearPageReferenced(page);
> > > +                     goto deactivate;
> > > +             }
> >
> > I don't understand this.
> >
> > If you don't clear the pte references, you're leaving behind stale
> > data. You already decide here that we consider the page referenced
> > when it reaches the end of the inactive list, regardless of what
> > happens in between. That makes the deactivation kind of useless.
> 
> My idea is that the pages newly appended to the inactive list, for example,
> a newly allocated anon page or deactivated page, start at the same line.
> A newly allocated anon page would have a mapping (reference) so I
> made this code to help for deactivated page to have a mapping (reference).
> I think that there is no reason to devalue the page accessed on active list.

I don't think that leads to desirable behavior, because it causes an
age inversion between deactivated and freshly instantiated pages.

We know the new page was referenced when it entered the head of the
inactive list. However, the old page's reference could be much, much
longer in the past. Hours ago. So when they both reach the end of the
list, we treat them as equally hot even though the new page has been
referenced very recently and the old page might be completely stale.

Keep in mind that we only deactivate in the first place because the
inactive list is struggling and we need to get rid of stale active
pages. We're in a workingset transition and *should* be giving old
pages the chance to move out quickly.

> Before this patch is applied, all newly allocated anon page are started
> at the active list so clearing the pte reference on deactivation is required
> to check the further access. However, it is not the case so I skip it here.
> 
> > And it blurs the lines between the inactive and active list.
> >
> > shrink_page_list() (and page_check_references()) are written with the
> > notion that any references they look at are from the inactive list. If
> > you carry over stale data, this can cause more subtle bugs later on.
> 
> It's not. For file page, PageReferenced() is maintained even if deactivation
> happens and it means one reference.

shrink_page_list() doesn't honor PageReferenced as a reference.

PG_referenced is primarily for the mark_page_accessed() state machine,
which is different from the reclaim scanner's reference tracking: for
unmapped pages we can detect accesses in realtime and don't need the
reference sampling from LRU cycle to LRU cycle. The bit carries over a
deactivation, but it doesn't prevent reclaim from freeing the page.

For mapped pages, we sample references using the LRU cycles, and
PG_referenced is otherwise unused. We repurpose it to implement
second-chance tracking of inactive pages with pte refs. It counts
inactive list cycles, not references.

> > And again, I don't quite understand why anon would need different
> > treatment here than file.
> 
> In order to preserve the current behaviour for the file page, I leave the code
> as is for the file page and change the code for the anon page. There is
> fundamental difference between them such as how referenced is checked,
> accessed by mapping and accessed by syscall. I think that some difference
> would be admitted.

Right, unmapped pages have their own reference tracking system because
they can be detected synchronously.

My questions center around this:

We have an existing sampling algorithm for the coarse-grained page
table referenced bit, where we start pages on inactive, treat
references a certain way, target a certain inactive:active ratio, use
refault information to detect workingset transitions etc. Anon used a
different system in the past, but your patch set switches it over to
the more universal model we have developed for mapped file pages.

However, you don't switch it over to this exact model we have for
mapped files, but rather a slightly modified version. And I don't
quite understand the rationale behind the individual modifications.

So let me turn it around. What would be the downsides of aging mapped
anon exactly the same way we age mapped files? Can we identify where
differences are necessary and document them?
Joonsoo Kim March 16, 2020, 7:05 a.m. UTC | #4
2020년 3월 14일 (토) 오전 4:55, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
>
> On Fri, Mar 13, 2020 at 04:40:18PM +0900, Joonsoo Kim wrote:
> > 2020년 3월 13일 (금) 오전 12:14, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> > >
> > > On Thu, Feb 20, 2020 at 02:11:46PM +0900, js1304@gmail.com wrote:
> > > > @@ -1010,8 +1010,15 @@ static enum page_references page_check_references(struct page *page,
> > > >               return PAGEREF_RECLAIM;
> > > >
> > > >       if (referenced_ptes) {
> > > > -             if (PageSwapBacked(page))
> > > > -                     return PAGEREF_ACTIVATE;
> > > > +             if (PageSwapBacked(page)) {
> > > > +                     if (referenced_page) {
> > > > +                             ClearPageReferenced(page);
> > > > +                             return PAGEREF_ACTIVATE;
> > > > +                     }
> > >
> > > This looks odd to me. referenced_page = TestClearPageReferenced()
> > > above, so it's already be clear. Why clear it again?
> >
> > Oops... it's just my fault. Will remove it.
> >
> > > > +
> > > > +                     SetPageReferenced(page);
> > > > +                     return PAGEREF_KEEP;
> > > > +             }
> > >
> > > The existing file code already does:
> > >
> > >                 SetPageReferenced(page);
> > >                 if (referenced_page || referenced_ptes > 1)
> > >                         return PAGEREF_ACTIVATE;
> > >                 if (vm_flags & VM_EXEC)
> > >                         return PAGEREF_ACTIVATE;
> > >                 return PAGEREF_KEEP;
> > >
> > > The differences are:
> > >
> > > 1) referenced_ptes > 1. We did this so that heavily shared file
> > > mappings are protected a bit better than others. Arguably the same
> > > could apply for anon pages when we put them on the inactive list.
> >
> > Yes, these check should be included for anon.
> >
> > > 2) vm_flags & VM_EXEC. This mostly doesn't apply to anon pages. The
> > > exception would be jit code pages, but if we put anon pages on the
> > > inactive list we should protect jit code the same way we protect file
> > > executables.
> >
> > I'm not sure that this is necessary for anon page. From my understanding,
> > executable mapped file page is more precious than other mapped file page
> > because this mapping is usually used by *multiple* thread and there is
> > no way to check it by MM. If anon JIT code has also such characteristic, this
> > code should be included for anon, but, should be included separately. It
> > seems that it's beyond of this patch.
>
> The sharing is what the referenced_ptes > 1 check is for.

Sharing between processes can be checked by referenced_ptes,
but, sharing between threads cannot be checked by it. I guess that
VM_EXEC check is for thread case since most of executable is usually
shared by thread. But, from below your comment, I re-consider the
reason of VM_EXEC check. See below.

> The problem with executables is that when they are referenced, they
> get a *lot* of references compared to data pages. Think about an
> instruction stream and how many of those instructions result in data
> references. So when you see an executable page that is being accessed,
> it's likely being accessed at a high rate. They're much hotter, and
> that's why reference bits from VM_EXEC mappings carry more weight.

Sound reasonable.

But, now, I have another thought about it. I think that the root of the reason
of this check is that there are two different reference tracking models
on file LRU. If we assume that all file pages are mapped ones, this special
check isn't needed. If executable pages are accessed more frequently than
others, reference can be found more for them at LRU cycle. No need for this
special check.

However, file pages includes syscall-ed pages and mapped pages, and,
reference tracking models for mapped page has a disadvantage to
one for syscall-ed page. Several references for mapped page would be
considered as one at LRU cycle. I think that this check is to mitigate
this disadvantage, at least, for executable file mapped page.

> IMO this applies to executable file and anon equally.

anon LRU consist of all mapped pages, so, IMHO,  there is no need for
special handling for executable pages on anon LRU. Although reference
is just checked at LRU cycle, it would correctly approximate reference
frequency.

Moreover, there is one comment in shrink_active_list() that explains one
reason about omission of this check for anon pages.

"Anon pages are not likely to be evicted by use-once streaming IO, plus JVM
can create lots of anon VM_EXEC pages, so we ignore them here."

Lastly, as I said before, at least, it is done separately with appropriate
investigation.

> > > Seems to me you don't need to add anything. Just remove the
> > > PageSwapBacked branch and apply equal treatment to both types.
> >
> > I will rework the code if you agree with my opinion.
> >
> > > > @@ -2056,6 +2063,15 @@ static void shrink_active_list(unsigned long nr_to_scan,
> > > >                       }
> > > >               }
> > > >
> > > > +             /*
> > > > +              * Now, newly created anonymous page isn't appened to the
> > > > +              * active list. We don't need to clear the reference bit here.
> > > > +              */
> > > > +             if (PageSwapBacked(page)) {
> > > > +                     ClearPageReferenced(page);
> > > > +                     goto deactivate;
> > > > +             }
> > >
> > > I don't understand this.
> > >
> > > If you don't clear the pte references, you're leaving behind stale
> > > data. You already decide here that we consider the page referenced
> > > when it reaches the end of the inactive list, regardless of what
> > > happens in between. That makes the deactivation kind of useless.
> >
> > My idea is that the pages newly appended to the inactive list, for example,
> > a newly allocated anon page or deactivated page, start at the same line.
> > A newly allocated anon page would have a mapping (reference) so I
> > made this code to help for deactivated page to have a mapping (reference).
> > I think that there is no reason to devalue the page accessed on active list.
>
> I don't think that leads to desirable behavior, because it causes an
> age inversion between deactivated and freshly instantiated pages.
>
> We know the new page was referenced when it entered the head of the
> inactive list. However, the old page's reference could be much, much
> longer in the past. Hours ago. So when they both reach the end of the
> list, we treat them as equally hot even though the new page has been
> referenced very recently and the old page might be completely stale.
>
> Keep in mind that we only deactivate in the first place because the
> inactive list is struggling and we need to get rid of stale active
> pages. We're in a workingset transition and *should* be giving old
> pages the chance to move out quickly.

You're right. I will fix it.

> > Before this patch is applied, all newly allocated anon page are started
> > at the active list so clearing the pte reference on deactivation is required
> > to check the further access. However, it is not the case so I skip it here.
> >
> > > And it blurs the lines between the inactive and active list.
> > >
> > > shrink_page_list() (and page_check_references()) are written with the
> > > notion that any references they look at are from the inactive list. If
> > > you carry over stale data, this can cause more subtle bugs later on.
> >
> > It's not. For file page, PageReferenced() is maintained even if deactivation
> > happens and it means one reference.
>
> shrink_page_list() doesn't honor PageReferenced as a reference.
>
> PG_referenced is primarily for the mark_page_accessed() state machine,
> which is different from the reclaim scanner's reference tracking: for
> unmapped pages we can detect accesses in realtime and don't need the
> reference sampling from LRU cycle to LRU cycle. The bit carries over a
> deactivation, but it doesn't prevent reclaim from freeing the page.
>
> For mapped pages, we sample references using the LRU cycles, and
> PG_referenced is otherwise unused. We repurpose it to implement
> second-chance tracking of inactive pages with pte refs. It counts
> inactive list cycles, not references.

Okay.

> > > And again, I don't quite understand why anon would need different
> > > treatment here than file.
> >
> > In order to preserve the current behaviour for the file page, I leave the code
> > as is for the file page and change the code for the anon page. There is
> > fundamental difference between them such as how referenced is checked,
> > accessed by mapping and accessed by syscall. I think that some difference
> > would be admitted.
>
> Right, unmapped pages have their own reference tracking system because
> they can be detected synchronously.
>
> My questions center around this:
>
> We have an existing sampling algorithm for the coarse-grained page
> table referenced bit, where we start pages on inactive, treat
> references a certain way, target a certain inactive:active ratio, use
> refault information to detect workingset transitions etc. Anon used a
> different system in the past, but your patch set switches it over to
> the more universal model we have developed for mapped file pages.
>
> However, you don't switch it over to this exact model we have for
> mapped files, but rather a slightly modified version. And I don't
> quite understand the rationale behind the individual modifications.
>
> So let me turn it around. What would be the downsides of aging mapped
> anon exactly the same way we age mapped files? Can we identify where
> differences are necessary and document them?

Now, I plan to make a next version applied all your comments except
VM_EXEC case. As I said above, fundamental difference between
file mapped page and anon mapped page is that file LRU where file mapped
pages are managed uses two reference tracking model but anon LRU uses
just one. File LRU needs some heuristic to adjust the difference of two models,
but, anon LRU doesn't need at all. And, I think VM_EXEC check is for this case.

Please let me know your opinion about VM_EXEC case. I will start to rework
if you agree with my thought.

Anyway, all comments are very helpful to me. Really, thanks Johannes.

Thanks.
Johannes Weiner March 16, 2020, 4:12 p.m. UTC | #5
On Mon, Mar 16, 2020 at 04:05:30PM +0900, Joonsoo Kim wrote:
> 2020년 3월 14일 (토) 오전 4:55, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> > The problem with executables is that when they are referenced, they
> > get a *lot* of references compared to data pages. Think about an
> > instruction stream and how many of those instructions result in data
> > references. So when you see an executable page that is being accessed,
> > it's likely being accessed at a high rate. They're much hotter, and
> > that's why reference bits from VM_EXEC mappings carry more weight.
> 
> Sound reasonable.
> 
> But, now, I have another thought about it. I think that the root of the reason
> of this check is that there are two different reference tracking models
> on file LRU. If we assume that all file pages are mapped ones, this special
> check isn't needed. If executable pages are accessed more frequently than
> others, reference can be found more for them at LRU cycle. No need for this
> special check.
> 
> However, file pages includes syscall-ed pages and mapped pages, and,
> reference tracking models for mapped page has a disadvantage to
> one for syscall-ed page. Several references for mapped page would be
> considered as one at LRU cycle. I think that this check is to mitigate
> this disadvantage, at least, for executable file mapped page.

Hm, I don't quite understand this reasoning. Yes, there are two models
for unmapped and mapped file pages. But both types of pages get
activated with two or more references: for unmapped it's tracked
through mark_page_accessed(), and for mapped it's the two LRU cycles
with the referenced bit set (unmapped pages don't get that extra trip
around the LRU with one reference). With that alone, mapped pages and
unmapped pages should have equal chances, no?

> > IMO this applies to executable file and anon equally.
> 
> anon LRU consist of all mapped pages, so, IMHO,  there is no need for
> special handling for executable pages on anon LRU. Although reference
> is just checked at LRU cycle, it would correctly approximate reference
> frequency.
> 
> Moreover, there is one comment in shrink_active_list() that explains one
> reason about omission of this check for anon pages.
> 
> "Anon pages are not likely to be evicted by use-once streaming IO, plus JVM
> can create lots of anon VM_EXEC pages, so we ignore them here."
>
> Lastly, as I said before, at least, it is done separately with appropriate
> investigation.

You do have a point here, though.

The VM_EXEC protection is a heuristic that I think was added for one
specific case. Instead of adopting it straight-away for anon pages, it
may be good to wait until a usecase materializes. If it never does,
all the better - one less heuristic to carry around.

> Now, I plan to make a next version applied all your comments except
> VM_EXEC case. As I said above, fundamental difference between
> file mapped page and anon mapped page is that file LRU where file mapped
> pages are managed uses two reference tracking model but anon LRU uses
> just one. File LRU needs some heuristic to adjust the difference of two models,
> but, anon LRU doesn't need at all. And, I think VM_EXEC check is for this case.
> 
> Please let me know your opinion about VM_EXEC case. I will start to rework
> if you agree with my thought.

That sounds like a good plan. I'm looking forward to the next version!

Johannes
Joonsoo Kim March 17, 2020, 4:52 a.m. UTC | #6
2020년 3월 17일 (화) 오전 1:12, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
>
> On Mon, Mar 16, 2020 at 04:05:30PM +0900, Joonsoo Kim wrote:
> > 2020년 3월 14일 (토) 오전 4:55, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> > > The problem with executables is that when they are referenced, they
> > > get a *lot* of references compared to data pages. Think about an
> > > instruction stream and how many of those instructions result in data
> > > references. So when you see an executable page that is being accessed,
> > > it's likely being accessed at a high rate. They're much hotter, and
> > > that's why reference bits from VM_EXEC mappings carry more weight.
> >
> > Sound reasonable.
> >
> > But, now, I have another thought about it. I think that the root of the reason
> > of this check is that there are two different reference tracking models
> > on file LRU. If we assume that all file pages are mapped ones, this special
> > check isn't needed. If executable pages are accessed more frequently than
> > others, reference can be found more for them at LRU cycle. No need for this
> > special check.
> >
> > However, file pages includes syscall-ed pages and mapped pages, and,
> > reference tracking models for mapped page has a disadvantage to
> > one for syscall-ed page. Several references for mapped page would be
> > considered as one at LRU cycle. I think that this check is to mitigate
> > this disadvantage, at least, for executable file mapped page.
>
> Hm, I don't quite understand this reasoning. Yes, there are two models
> for unmapped and mapped file pages. But both types of pages get
> activated with two or more references: for unmapped it's tracked
> through mark_page_accessed(), and for mapped it's the two LRU cycles
> with the referenced bit set (unmapped pages don't get that extra trip
> around the LRU with one reference). With that alone, mapped pages and
> unmapped pages should have equal chances, no?
>

Think about following example.

Assume that the size of inactive list for file page is 100.

1. start the new page, A, and access to A happens
2. 50 inactive pages are reclaimed due to 50 new pages
3. second access to A happens
4. 50 inactive pages are reclaimed due to 50 new pages
5. 100 inactive pages are reclaimed due to 100 new pages

If A is:
unmapped page: the page is activated at #3 and lives after #5
mapped page: the page reference is checked at #4 and re-attached
to the inactive list with PageReferenced() but is eventually reclaimed at #5

Even they are referenced by the same pattern, mapped page is reclaimed
but unmapped isn't. This is, what I said before, the disadvantage of
the mapped page than unmapped page. My guess is that to mitigate this
disadvantage on mapped file page, VM_EXEC test is needed since
VM_EXEC is the most important case for mapped file page.

Thanks.
diff mbox series

Patch

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1e99f7a..954e13e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -344,7 +344,7 @@  extern void deactivate_page(struct page *page);
 extern void mark_page_lazyfree(struct page *page);
 extern void swap_setup(void);
 
-extern void lru_cache_add_active_or_unevictable(struct page *page,
+extern void lru_cache_add_inactive_or_unevictable(struct page *page,
 						struct vm_area_struct *vma);
 
 /* linux/mm/vmscan.c */
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index ece7e13..14156fc 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -190,7 +190,7 @@  static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 		get_page(new_page);
 		page_add_new_anon_rmap(new_page, vma, addr, false);
 		mem_cgroup_commit_charge(new_page, memcg, false, false);
-		lru_cache_add_active_or_unevictable(new_page, vma);
+		lru_cache_add_inactive_or_unevictable(new_page, vma);
 	} else
 		/* no new page, just dec_mm_counter for old_page */
 		dec_mm_counter(mm, MM_ANONPAGES);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a880932..6356dfd 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -638,7 +638,7 @@  static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		page_add_new_anon_rmap(page, vma, haddr, true);
 		mem_cgroup_commit_charge(page, memcg, false, true);
-		lru_cache_add_active_or_unevictable(page, vma);
+		lru_cache_add_inactive_or_unevictable(page, vma);
 		pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
 		set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry);
 		add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
@@ -1282,7 +1282,7 @@  static vm_fault_t do_huge_pmd_wp_page_fallback(struct vm_fault *vmf,
 		set_page_private(pages[i], 0);
 		page_add_new_anon_rmap(pages[i], vmf->vma, haddr, false);
 		mem_cgroup_commit_charge(pages[i], memcg, false, false);
-		lru_cache_add_active_or_unevictable(pages[i], vma);
+		lru_cache_add_inactive_or_unevictable(pages[i], vma);
 		vmf->pte = pte_offset_map(&_pmd, haddr);
 		VM_BUG_ON(!pte_none(*vmf->pte));
 		set_pte_at(vma->vm_mm, haddr, vmf->pte, entry);
@@ -1435,7 +1435,7 @@  vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 		pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
 		page_add_new_anon_rmap(new_page, vma, haddr, true);
 		mem_cgroup_commit_charge(new_page, memcg, false, true);
-		lru_cache_add_active_or_unevictable(new_page, vma);
+		lru_cache_add_inactive_or_unevictable(new_page, vma);
 		set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry);
 		update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
 		if (!page) {
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b679908..246c155 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1092,7 +1092,7 @@  static void collapse_huge_page(struct mm_struct *mm,
 	page_add_new_anon_rmap(new_page, vma, address, true);
 	mem_cgroup_commit_charge(new_page, memcg, false, true);
 	count_memcg_events(memcg, THP_COLLAPSE_ALLOC, 1);
-	lru_cache_add_active_or_unevictable(new_page, vma);
+	lru_cache_add_inactive_or_unevictable(new_page, vma);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	set_pmd_at(mm, address, pmd, _pmd);
 	update_mmu_cache_pmd(vma, address, pmd);
diff --git a/mm/memory.c b/mm/memory.c
index 45442d9..5f7813a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2513,7 +2513,7 @@  static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
 		page_add_new_anon_rmap(new_page, vma, vmf->address, false);
 		mem_cgroup_commit_charge(new_page, memcg, false, false);
-		lru_cache_add_active_or_unevictable(new_page, vma);
+		lru_cache_add_inactive_or_unevictable(new_page, vma);
 		/*
 		 * We call the notify macro here because, when using secondary
 		 * mmu page tables (such as kvm shadow page tables), we want the
@@ -3038,11 +3038,10 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (unlikely(page != swapcache && swapcache)) {
 		page_add_new_anon_rmap(page, vma, vmf->address, false);
 		mem_cgroup_commit_charge(page, memcg, false, false);
-		lru_cache_add_active_or_unevictable(page, vma);
+		lru_cache_add_inactive_or_unevictable(page, vma);
 	} else {
 		do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
 		mem_cgroup_commit_charge(page, memcg, true, false);
-		activate_page(page);
 	}
 
 	swap_free(entry);
@@ -3186,7 +3185,7 @@  static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
 	page_add_new_anon_rmap(page, vma, vmf->address, false);
 	mem_cgroup_commit_charge(page, memcg, false, false);
-	lru_cache_add_active_or_unevictable(page, vma);
+	lru_cache_add_inactive_or_unevictable(page, vma);
 setpte:
 	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
 
@@ -3449,7 +3448,7 @@  vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
 		inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
 		page_add_new_anon_rmap(page, vma, vmf->address, false);
 		mem_cgroup_commit_charge(page, memcg, false, false);
-		lru_cache_add_active_or_unevictable(page, vma);
+		lru_cache_add_inactive_or_unevictable(page, vma);
 	} else {
 		inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
 		page_add_file_rmap(page, false);
diff --git a/mm/migrate.c b/mm/migrate.c
index 86873b6..ef034c0 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2784,7 +2784,7 @@  static void migrate_vma_insert_page(struct migrate_vma *migrate,
 	page_add_new_anon_rmap(page, vma, addr, false);
 	mem_cgroup_commit_charge(page, memcg, false, false);
 	if (!is_zone_device_page(page))
-		lru_cache_add_active_or_unevictable(page, vma);
+		lru_cache_add_inactive_or_unevictable(page, vma);
 	get_page(page);
 
 	if (flush) {
diff --git a/mm/swap.c b/mm/swap.c
index 5341ae9..18b2735 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -448,23 +448,24 @@  void lru_cache_add(struct page *page)
 }
 
 /**
- * lru_cache_add_active_or_unevictable
+ * lru_cache_add_inactive_or_unevictable
  * @page:  the page to be added to LRU
  * @vma:   vma in which page is mapped for determining reclaimability
  *
- * Place @page on the active or unevictable LRU list, depending on its
+ * Place @page on the inactive or unevictable LRU list, depending on its
  * evictability.  Note that if the page is not evictable, it goes
  * directly back onto it's zone's unevictable list, it does NOT use a
  * per cpu pagevec.
  */
-void lru_cache_add_active_or_unevictable(struct page *page,
+void lru_cache_add_inactive_or_unevictable(struct page *page,
 					 struct vm_area_struct *vma)
 {
+	bool evictable;
+
 	VM_BUG_ON_PAGE(PageLRU(page), page);
 
-	if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
-		SetPageActive(page);
-	else if (!TestSetPageMlocked(page)) {
+	evictable = (vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED;
+	if (!evictable && !TestSetPageMlocked(page)) {
 		/*
 		 * We use the irq-unsafe __mod_zone_page_stat because this
 		 * counter is not modified from interrupt context, and the pte
diff --git a/mm/swapfile.c b/mm/swapfile.c
index bb3261d..6bdcbf9 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1888,7 +1888,7 @@  static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	} else { /* ksm created a completely new copy */
 		page_add_new_anon_rmap(page, vma, addr, false);
 		mem_cgroup_commit_charge(page, memcg, false, false);
-		lru_cache_add_active_or_unevictable(page, vma);
+		lru_cache_add_inactive_or_unevictable(page, vma);
 	}
 	swap_free(entry);
 	/*
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 1b0d7ab..875e329 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -120,7 +120,7 @@  static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 	inc_mm_counter(dst_mm, MM_ANONPAGES);
 	page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
 	mem_cgroup_commit_charge(page, memcg, false, false);
-	lru_cache_add_active_or_unevictable(page, dst_vma);
+	lru_cache_add_inactive_or_unevictable(page, dst_vma);
 
 	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e772f3f..4122a84 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1010,8 +1010,15 @@  static enum page_references page_check_references(struct page *page,
 		return PAGEREF_RECLAIM;
 
 	if (referenced_ptes) {
-		if (PageSwapBacked(page))
-			return PAGEREF_ACTIVATE;
+		if (PageSwapBacked(page)) {
+			if (referenced_page) {
+				ClearPageReferenced(page);
+				return PAGEREF_ACTIVATE;
+			}
+
+			SetPageReferenced(page);
+			return PAGEREF_KEEP;
+		}
 		/*
 		 * All mapped pages start out with page table
 		 * references from the instantiating fault, so we need
@@ -2056,6 +2063,15 @@  static void shrink_active_list(unsigned long nr_to_scan,
 			}
 		}
 
+		/*
+		 * Now, newly created anonymous page isn't appened to the
+		 * active list. We don't need to clear the reference bit here.
+		 */
+		if (PageSwapBacked(page)) {
+			ClearPageReferenced(page);
+			goto deactivate;
+		}
+
 		if (page_referenced(page, 0, sc->target_mem_cgroup,
 				    &vm_flags)) {
 			nr_rotated += hpage_nr_pages(page);
@@ -2074,6 +2090,7 @@  static void shrink_active_list(unsigned long nr_to_scan,
 			}
 		}
 
+deactivate:
 		ClearPageActive(page);	/* we are de-activating */
 		SetPageWorkingset(page);
 		list_add(&page->lru, &l_inactive);