diff mbox series

[05/14] mm: workingset: let cache workingset challenge anon

Message ID 20200520232525.798933-6-hannes@cmpxchg.org (mailing list archive)
State New, archived
Headers show
Series mm: balance LRU lists based on relative thrashing v2 | expand

Commit Message

Johannes Weiner May 20, 2020, 11:25 p.m. UTC
We activate cache refaults with reuse distances in pages smaller than
the size of the total cache. This allows new pages with competitive
access frequencies to establish themselves, as well as challenge and
potentially displace pages on the active list that have gone cold.

However, that assumes that active cache can only replace other active
cache in a competition for the hottest memory. This is not a great
default assumption. The page cache might be thrashing while there are
enough completely cold and unused anonymous pages sitting around that
we'd only have to write to swap once to stop all IO from the cache.

Activate cache refaults when their reuse distance in pages is smaller
than the total userspace workingset, including anonymous pages.

Reclaim can still decide how to balance pressure among the two LRUs
depending on the IO situation. Rotational drives will prefer avoiding
random IO from swap and go harder after cache. But fundamentally, hot
cache should be able to compete with anon pages for a place in RAM.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/workingset.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

Comments

Joonsoo Kim May 27, 2020, 2:06 a.m. UTC | #1
2020년 5월 21일 (목) 오전 8:26, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
>
> We activate cache refaults with reuse distances in pages smaller than
> the size of the total cache. This allows new pages with competitive
> access frequencies to establish themselves, as well as challenge and
> potentially displace pages on the active list that have gone cold.
>
> However, that assumes that active cache can only replace other active
> cache in a competition for the hottest memory. This is not a great
> default assumption. The page cache might be thrashing while there are
> enough completely cold and unused anonymous pages sitting around that
> we'd only have to write to swap once to stop all IO from the cache.
>
> Activate cache refaults when their reuse distance in pages is smaller
> than the total userspace workingset, including anonymous pages.

Hmm... I'm not sure the correctness of this change.

IIUC, this patch leads to more activations in the file list and more activations
here will challenge the anon list since rotation ratio for the file
list will be increased.

However, this change breaks active/inactive concept of the file list.
active/inactive
separation is implemented by in-list refault distance. anon list size has
no direct connection with refault distance of the file list so using
anon list size
to detect workingset for file page breaks the concept.

My suspicion is started by this counter example.

Environment:
anon: 500 MB (so hot) / 500 MB (so hot)
file: 50 MB (hot) / 50 MB (cold)

Think about the situation that there is periodical access to other file (100 MB)
with low frequency (refault distance is 500 MB)

Without your change, this periodical access doesn't make thrashing for cached
active file page since refault distance of periodical access is larger
than the size of
the active file list. However, with your change, it causes thrashing
on the file list.

In fact, I'm not sure that I'm thinking correctly. It's very
complicated for me. :)
Please let me know if there is a missing piece.

Thanks.
Johannes Weiner May 27, 2020, 1:43 p.m. UTC | #2
On Wed, May 27, 2020 at 11:06:47AM +0900, Joonsoo Kim wrote:
> 2020년 5월 21일 (목) 오전 8:26, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> >
> > We activate cache refaults with reuse distances in pages smaller than
> > the size of the total cache. This allows new pages with competitive
> > access frequencies to establish themselves, as well as challenge and
> > potentially displace pages on the active list that have gone cold.
> >
> > However, that assumes that active cache can only replace other active
> > cache in a competition for the hottest memory. This is not a great
> > default assumption. The page cache might be thrashing while there are
> > enough completely cold and unused anonymous pages sitting around that
> > we'd only have to write to swap once to stop all IO from the cache.
> >
> > Activate cache refaults when their reuse distance in pages is smaller
> > than the total userspace workingset, including anonymous pages.
> 
> Hmm... I'm not sure the correctness of this change.
> 
> IIUC, this patch leads to more activations in the file list and more activations
> here will challenge the anon list since rotation ratio for the file
> list will be increased.

Yes.

> However, this change breaks active/inactive concept of the file list.
> active/inactive
> separation is implemented by in-list refault distance. anon list size has
> no direct connection with refault distance of the file list so using
> anon list size
> to detect workingset for file page breaks the concept.

This is intentional, because there IS a connection: they both take up
space in RAM, and they both cost IO to bring back once reclaimed.

When file is refaulting, it means we need to make more space for
cache. That space can come from stale active file pages. But what if
active cache is all hot, and meanwhile there are cold anon pages that
we could swap out once and then serve everything from RAM?

When file is refaulting, we should find the coldest data that is
taking up RAM and kick it out. It doesn't matter whether it's file or
anon: the goal is to free up RAM with the least amount of IO risk.

Remember that the file/anon split, and the inactive/active split, are
there to optimize reclaim. It doesn't mean that these memory pools are
independent from each other.

The file list is split in two because of use-once cache. The anon and
file lists are split because of different IO patterns, because we may
not have swap etc. But once we are out of use-once cache, have swap
space available, and have corrected for the different cost of IO,
there needs to be a relative order between all pages in the system to
find the optimal candidates to reclaim.

> My suspicion is started by this counter example.
> 
> Environment:
> anon: 500 MB (so hot) / 500 MB (so hot)
> file: 50 MB (hot) / 50 MB (cold)
> 
> Think about the situation that there is periodical access to other file (100 MB)
> with low frequency (refault distance is 500 MB)
> 
> Without your change, this periodical access doesn't make thrashing for cached
> active file page since refault distance of periodical access is larger
> than the size of
> the active file list. However, with your change, it causes thrashing
> on the file list.

It doesn't cause thrashing. It causes scanning because that 100M file
IS thrashing: with or without my patch, that refault IO is occuring.

What this patch acknowledges is that the 100M file COULD fit fully
into memory, and not require any IO to serve, IFF 100M of the active
file or anon pages were cold and could be reclaimed or swapped out.

In your example, the anon set is hot. We'll scan it slowly (at the
rate of IO from the other file) and rotate the pages that are in use -
which would be all of them. Likewise for the file - there will be some
deactivations, but mark_page_accessed() or the second chance algorithm
in page_check_references() for mapped will keep the hottest pages active.

In a slightly modified example, 400M of the anon set is hot and 100M
cold. Without my patch, we would never look for them and the second
file would be IO-bound forever. After my patch, we would scan anon,
eventually find the cold pages, swap them out, and then serve the
entire workingset from memory.

Does it cause more scanning than before in your scenario? Yes, but we
don't even know it's your scenario instead of mine until we actually
sample references of all memory. Not scanning is a false stable state.

And your scenario could easily change over time. Even if the amount of
repeatedly accessed pages stays larger than memory, and will always
require IO to serve, the relative access frequencies could change.
Some pages could become hotter, others colder. Without scanning, we
wouldn't adapt the LRU ordering and cause more IO than necessary.
Joonsoo Kim May 28, 2020, 7:16 a.m. UTC | #3
2020년 5월 27일 (수) 오후 10:43, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
>
> On Wed, May 27, 2020 at 11:06:47AM +0900, Joonsoo Kim wrote:
> > 2020년 5월 21일 (목) 오전 8:26, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> > >
> > > We activate cache refaults with reuse distances in pages smaller than
> > > the size of the total cache. This allows new pages with competitive
> > > access frequencies to establish themselves, as well as challenge and
> > > potentially displace pages on the active list that have gone cold.
> > >
> > > However, that assumes that active cache can only replace other active
> > > cache in a competition for the hottest memory. This is not a great
> > > default assumption. The page cache might be thrashing while there are
> > > enough completely cold and unused anonymous pages sitting around that
> > > we'd only have to write to swap once to stop all IO from the cache.
> > >
> > > Activate cache refaults when their reuse distance in pages is smaller
> > > than the total userspace workingset, including anonymous pages.
> >
> > Hmm... I'm not sure the correctness of this change.
> >
> > IIUC, this patch leads to more activations in the file list and more activations
> > here will challenge the anon list since rotation ratio for the file
> > list will be increased.
>
> Yes.
>
> > However, this change breaks active/inactive concept of the file list.
> > active/inactive
> > separation is implemented by in-list refault distance. anon list size has
> > no direct connection with refault distance of the file list so using
> > anon list size
> > to detect workingset for file page breaks the concept.
>
> This is intentional, because there IS a connection: they both take up
> space in RAM, and they both cost IO to bring back once reclaimed.

I know that. This is the reason that I said 'no direct connection'. The anon
list size is directly related the *possible* file list size. But,
active/inactive
separation in one list is firstly based on *current* list size rather than
the possible list size. Adding anon list size to detect workingset means
to use the possible list size and I think that it's wrong.

> When file is refaulting, it means we need to make more space for
> cache. That space can come from stale active file pages. But what if
> active cache is all hot, and meanwhile there are cold anon pages that
> we could swap out once and then serve everything from RAM?
>
> When file is refaulting, we should find the coldest data that is
> taking up RAM and kick it out. It doesn't matter whether it's file or
> anon: the goal is to free up RAM with the least amount of IO risk.

I understand your purpose and agree with it. We need to find a solution.
To achieve your goal, my suggestion is:

- refault distance < active file, then do activation and add up IO cost
- refault distance < active file + anon list, then add up IO cost

This doesn't break workingset detection on file list and challenge
the anon list as the same degree as you did.

> Remember that the file/anon split, and the inactive/active split, are
> there to optimize reclaim. It doesn't mean that these memory pools are
> independent from each other.
>
> The file list is split in two because of use-once cache. The anon and
> file lists are split because of different IO patterns, because we may
> not have swap etc. But once we are out of use-once cache, have swap
> space available, and have corrected for the different cost of IO,
> there needs to be a relative order between all pages in the system to
> find the optimal candidates to reclaim.
>
> > My suspicion is started by this counter example.
> >
> > Environment:
> > anon: 500 MB (so hot) / 500 MB (so hot)
> > file: 50 MB (hot) / 50 MB (cold)
> >
> > Think about the situation that there is periodical access to other file (100 MB)
> > with low frequency (refault distance is 500 MB)
> >
> > Without your change, this periodical access doesn't make thrashing for cached
> > active file page since refault distance of periodical access is larger
> > than the size of
> > the active file list. However, with your change, it causes thrashing
> > on the file list.
>
> It doesn't cause thrashing. It causes scanning because that 100M file
> IS thrashing: with or without my patch, that refault IO is occuring.

It could cause thrashing for your patch. Without the patch, current logic try to
find most hottest file pages that are fit into the current file list
size and protect them
successfully. Assume that access distance of 50 MB hot file pages is 60 MB
which is less than whole file list size but larger than inactive list
size. Without
your patch, 50 MB (hot) pages are not evicted at all. All these hot
pages will be
protected from the 100MB low access frequency pages. 100 MB low access
frequency pages will be refaulted repeatedely but it's correct behaviour.

However, with your patch, 50 MB (hot) file pages are deactivated due to newly
added file pages with low access frequency. And, then, since access distance of
50 MB (hot) pages is larger than inactive list size, they could not
get a second chance
and finally could be evicted. I think that this is a thrashing since
low access frequency
pages that are not fit for current file list size pushes out the high
access frequency
pages that are fit for current file list size and it would happen
again and again.

Maybe, logic can be corrected if the patch considers inactive age of
anon list but
I think that my above suggestion would be enough.

> What this patch acknowledges is that the 100M file COULD fit fully
> into memory, and not require any IO to serve, IFF 100M of the active
> file or anon pages were cold and could be reclaimed or swapped out.
>
> In your example, the anon set is hot. We'll scan it slowly (at the
> rate of IO from the other file) and rotate the pages that are in use -
> which would be all of them. Likewise for the file - there will be some
> deactivations, but mark_page_accessed() or the second chance algorithm
> in page_check_references() for mapped will keep the hottest pages active.
> In a slightly modified example, 400M of the anon set is hot and 100M
> cold. Without my patch, we would never look for them and the second
> file would be IO-bound forever. After my patch, we would scan anon,
> eventually find the cold pages, swap them out, and then serve the
> entire workingset from memory.

Again, I agree with your goal. What I don't agree is the implementation
to achieve the goal.

Thanks.
Johannes Weiner May 28, 2020, 5:01 p.m. UTC | #4
On Thu, May 28, 2020 at 04:16:50PM +0900, Joonsoo Kim wrote:
> 2020년 5월 27일 (수) 오후 10:43, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> >
> > On Wed, May 27, 2020 at 11:06:47AM +0900, Joonsoo Kim wrote:
> > > 2020년 5월 21일 (목) 오전 8:26, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> > > >
> > > > We activate cache refaults with reuse distances in pages smaller than
> > > > the size of the total cache. This allows new pages with competitive
> > > > access frequencies to establish themselves, as well as challenge and
> > > > potentially displace pages on the active list that have gone cold.
> > > >
> > > > However, that assumes that active cache can only replace other active
> > > > cache in a competition for the hottest memory. This is not a great
> > > > default assumption. The page cache might be thrashing while there are
> > > > enough completely cold and unused anonymous pages sitting around that
> > > > we'd only have to write to swap once to stop all IO from the cache.
> > > >
> > > > Activate cache refaults when their reuse distance in pages is smaller
> > > > than the total userspace workingset, including anonymous pages.
> > >
> > > Hmm... I'm not sure the correctness of this change.
> > >
> > > IIUC, this patch leads to more activations in the file list and more activations
> > > here will challenge the anon list since rotation ratio for the file
> > > list will be increased.
> >
> > Yes.
> >
> > > However, this change breaks active/inactive concept of the file list.
> > > active/inactive
> > > separation is implemented by in-list refault distance. anon list size has
> > > no direct connection with refault distance of the file list so using
> > > anon list size
> > > to detect workingset for file page breaks the concept.
> >
> > This is intentional, because there IS a connection: they both take up
> > space in RAM, and they both cost IO to bring back once reclaimed.
> 
> I know that. This is the reason that I said 'no direct connection'. The anon
> list size is directly related the *possible* file list size. But,
> active/inactive
> separation in one list is firstly based on *current* list size rather than
> the possible list size. Adding anon list size to detect workingset means
> to use the possible list size and I think that it's wrong.

Hm so you're saying we should try to grow the page cache, but maintain
an inactive/active balance within the cache based on its current size
in case growing is not possible.

I think it would be implementable*, but I'm not quite convinced that
it's worth distinguishing. We're talking about an overcommitted
workingset and thereby an inherently unstable state that may thrash
purely based on timing differences anyway. See below.

*It would require another page flag to tell whether a refaulting cache
page has challenged the anon set once (transitioning) or repeatedly
(thrashing), as we currently use the active state for that. If we
would repurpose PG_workingset to tell the first from the second
refault, we'd need a new flag to mark a page for memstall accounting.

> > When file is refaulting, it means we need to make more space for
> > cache. That space can come from stale active file pages. But what if
> > active cache is all hot, and meanwhile there are cold anon pages that
> > we could swap out once and then serve everything from RAM?
> >
> > When file is refaulting, we should find the coldest data that is
> > taking up RAM and kick it out. It doesn't matter whether it's file or
> > anon: the goal is to free up RAM with the least amount of IO risk.
> 
> I understand your purpose and agree with it. We need to find a solution.
> To achieve your goal, my suggestion is:
> 
> - refault distance < active file, then do activation and add up IO cost
> - refault distance < active file + anon list, then add up IO cost
> 
> This doesn't break workingset detection on file list and challenge
> the anon list as the same degree as you did.

Unfortunately, it doesn't challenge the anon set. We'll stay in cache
trimming mode where the IO cost balance doesn't have any effect.

We would need to

  1. activate on distance < active file
  2. force reclaim to scan anon when distance < active file + anon
  3. record thrashing and IO cost when distance < active file + anon
     and it's the second refault of the page.

Something like this, where vmscan would scan anon when
WORKINGSET_RESTORE or WORKINGSET_EXPAND events are occuring:

diff --git a/mm/workingset.c b/mm/workingset.c
index d481ea452eeb..41dde6274fff 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -275,9 +275,9 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
 void workingset_refault(struct page *page, void *shadow)
 {
 	struct mem_cgroup *eviction_memcg;
+	unsigned long active_file, anon;
 	struct lruvec *eviction_lruvec;
 	unsigned long refault_distance;
-	unsigned long workingset_size;
 	struct pglist_data *pgdat;
 	struct mem_cgroup *memcg;
 	unsigned long eviction;
@@ -330,7 +330,7 @@ void workingset_refault(struct page *page, void *shadow)
 	refault_distance = (refault - eviction) & EVICTION_MASK;
 
 	/*
-	 * The activation decision for this page is made at the level
+	 * The distance decisions for this page is made at the level
 	 * where the eviction occurred, as that is where the LRU order
 	 * during page reclaim is being determined.
 	 *
@@ -344,32 +344,52 @@ void workingset_refault(struct page *page, void *shadow)
 
 	/*
 	 * Compare the distance to the existing workingset size. We
-	 * don't activate pages that couldn't stay resident even if
-	 * all the memory was available to the page cache. Whether
-	 * cache can compete with anon or not depends on having swap.
+	 * ignore pages that couldn't stay resident even if all the
+	 * memory was available to the page cache. Whether cache can
+	 * compete with anon or not depends on having swap.
 	 */
-	workingset_size = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE);
+	active_file = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE);
+	anon = 0;
 	if (mem_cgroup_get_nr_swap_pages(memcg) > 0) {
-		workingset_size += lruvec_page_state(eviction_lruvec,
-						     NR_INACTIVE_ANON);
-		workingset_size += lruvec_page_state(eviction_lruvec,
-						     NR_ACTIVE_ANON);
+		anon += lruvec_page_state(eviction_lruvec, NR_INACTIVE_ANON);
+		anon += lruvec_page_state(eviction_lruvec, NR_ACTIVE_ANON);
 	}
-	if (refault_distance > workingset_size)
-		goto out;
 
-	SetPageActive(page);
-	advance_inactive_age(memcg, pgdat);
-	inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
+	if (refault_distance > active_file + anon)
+		goto out;
 
-	/* Page was active prior to eviction */
+	/*
+	 * Page has already challenged the workingset before, and it's
+	 * refaulting again: we're not transitioning out old cache,
+	 * we're thrashing and need to grow. Record the IO to tip the
+	 * reclaim balance and mark the page for memstall detection.
+	 */
 	if (workingset) {
-		SetPageWorkingset(page);
+		inc_lruvec_state(lruvec, WORKINGSET_RESTORE);
+
 		/* XXX: Move to lru_cache_add() when it supports new vs putback */
 		spin_lock_irq(&page_pgdat(page)->lru_lock);
 		lru_note_cost_page(page);
 		spin_unlock_irq(&page_pgdat(page)->lru_lock);
-		inc_lruvec_state(lruvec, WORKINGSET_RESTORE);
+
+		SetPageThrashing(page);
+	}
+
+	SetPageWorkingset(page);
+
+	/*
+	 * Activate right away if page can compete with the existing
+	 * active set given the *current* size of the page cache.
+	 *
+	 * Otherwise, the page cache needs to grow to house this
+	 * page. Tell reclaim to rebalance against anonymous memory.
+	 */
+	if (refault_distance <= active_file) {
+		SetPageActive(page);
+		advance_inactive_age(memcg, pgdat);
+		inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
+	} else {
+		inc_lruvec_state(lruvec, WORKINGSET_EXPAND);
 	}
 out:
 	rcu_read_unlock();

> > Remember that the file/anon split, and the inactive/active split, are
> > there to optimize reclaim. It doesn't mean that these memory pools are
> > independent from each other.
> >
> > The file list is split in two because of use-once cache. The anon and
> > file lists are split because of different IO patterns, because we may
> > not have swap etc. But once we are out of use-once cache, have swap
> > space available, and have corrected for the different cost of IO,
> > there needs to be a relative order between all pages in the system to
> > find the optimal candidates to reclaim.
> >
> > > My suspicion is started by this counter example.
> > >
> > > Environment:
> > > anon: 500 MB (so hot) / 500 MB (so hot)
> > > file: 50 MB (hot) / 50 MB (cold)
> > >
> > > Think about the situation that there is periodical access to other file (100 MB)
> > > with low frequency (refault distance is 500 MB)
> > >
> > > Without your change, this periodical access doesn't make thrashing for cached
> > > active file page since refault distance of periodical access is larger
> > > than the size of
> > > the active file list. However, with your change, it causes thrashing
> > > on the file list.
> >
> > It doesn't cause thrashing. It causes scanning because that 100M file
> > IS thrashing: with or without my patch, that refault IO is occuring.
> 
> It could cause thrashing for your patch.
> Without the patch, current logic try to
> find most hottest file pages that are fit into the current file list
> size and protect them
> successfully. Assume that access distance of 50 MB hot file pages is 60 MB
> which is less than whole file list size but larger than inactive list
> size. Without
> your patch, 50 MB (hot) pages are not evicted at all. All these hot
> pages will be
> protected from the 100MB low access frequency pages. 100 MB low access
> frequency pages will be refaulted repeatedely but it's correct behaviour.

Hm, something doesn't quite add up. Why is the 50M hot set evicted
with my patch?

With a 60M access distance and a 100M page cache, they might get
deactivated, but then activated back to the head of the active list on
their next access. They should get scanned but not actually reclaimed.

The only way they could get reclaimed is if their access distance ends
up bigger than the file cache. But if that's the case, then the
workingset is overcommitted, and none of the pages qualify for reclaim
protection. Picking a subset to protect against the rest is arbitrary.

For example, if you started out with 50M of free space and an empty
cache and started accessing the 50M and 100M files simultaneously,
with the access distances you specified, none of them would get
activated. You would have the same behavior with or without my patch.

> However, with your patch, 50 MB (hot) file pages are deactivated due to newly
> added file pages with low access frequency. And, then, since access distance of
> 50 MB (hot) pages is larger than inactive list size, they could not
> get a second chance
> and finally could be evicted.

Ah okay, that's a different concern. We DO get two references, but we
fail to detect them properly.

First, we don't enforce the small inactive list size when there is
refault turnover like this. Deactivation is permanently allowed, which
means the lists are scanned in proportion to their size and should
converge on a 50/50 balance. The 50M pages should get referenced again
in the second half of their in-memory time and get re-activated.

It's somewhat debatable whether that's the right thing to do when
access distances are bunched up instead of uniform. Should a page
that's accessed twice with a 10M distance, and then not again for
another 80M, stay resident in a 60M cache?

It's not entirely unreasonable to say a page needs to get accessed in
each half of in-memory time to be considered still active, to ensure a
reasonable maximum distance between reference clusters.

But if this is an actual concern, shouldn't we instead improve the
accuracy of the LRU ordering? For example, if we're refaulting and
have dissolved the inactive/active separation, shouldn't
shrink_active_list() and shrink_page_list() actually rotate pages with
PG_referenced or pte references?

We ignore those references right now because we rely on second chance
to catch them; and respecting them would incur too much CPU overhead
when all we try to do is funnel one-off cache through the list.

But if we don't trust second-chance with this, and we're already
having refault IO and CPU is less of a concern, that seems like an
obvious change to make.
Joonsoo Kim May 29, 2020, 6:48 a.m. UTC | #5
2020년 5월 29일 (금) 오전 2:02, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
>
> On Thu, May 28, 2020 at 04:16:50PM +0900, Joonsoo Kim wrote:
> > 2020년 5월 27일 (수) 오후 10:43, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> > >
> > > On Wed, May 27, 2020 at 11:06:47AM +0900, Joonsoo Kim wrote:
> > > > 2020년 5월 21일 (목) 오전 8:26, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> > > > >
> > > > > We activate cache refaults with reuse distances in pages smaller than
> > > > > the size of the total cache. This allows new pages with competitive
> > > > > access frequencies to establish themselves, as well as challenge and
> > > > > potentially displace pages on the active list that have gone cold.
> > > > >
> > > > > However, that assumes that active cache can only replace other active
> > > > > cache in a competition for the hottest memory. This is not a great
> > > > > default assumption. The page cache might be thrashing while there are
> > > > > enough completely cold and unused anonymous pages sitting around that
> > > > > we'd only have to write to swap once to stop all IO from the cache.
> > > > >
> > > > > Activate cache refaults when their reuse distance in pages is smaller
> > > > > than the total userspace workingset, including anonymous pages.
> > > >
> > > > Hmm... I'm not sure the correctness of this change.
> > > >
> > > > IIUC, this patch leads to more activations in the file list and more activations
> > > > here will challenge the anon list since rotation ratio for the file
> > > > list will be increased.
> > >
> > > Yes.
> > >
> > > > However, this change breaks active/inactive concept of the file list.
> > > > active/inactive
> > > > separation is implemented by in-list refault distance. anon list size has
> > > > no direct connection with refault distance of the file list so using
> > > > anon list size
> > > > to detect workingset for file page breaks the concept.
> > >
> > > This is intentional, because there IS a connection: they both take up
> > > space in RAM, and they both cost IO to bring back once reclaimed.
> >
> > I know that. This is the reason that I said 'no direct connection'. The anon
> > list size is directly related the *possible* file list size. But,
> > active/inactive
> > separation in one list is firstly based on *current* list size rather than
> > the possible list size. Adding anon list size to detect workingset means
> > to use the possible list size and I think that it's wrong.
>
> Hm so you're saying we should try to grow the page cache, but maintain
> an inactive/active balance within the cache based on its current size
> in case growing is not possible.
>
> I think it would be implementable*, but I'm not quite convinced that
> it's worth distinguishing. We're talking about an overcommitted
> workingset and thereby an inherently unstable state that may thrash
> purely based on timing differences anyway. See below.

I think that this patch wrongly judge the overcommitted workingset.
See below new example in this reply.

> *It would require another page flag to tell whether a refaulting cache
> page has challenged the anon set once (transitioning) or repeatedly
> (thrashing), as we currently use the active state for that. If we
> would repurpose PG_workingset to tell the first from the second
> refault, we'd need a new flag to mark a page for memstall accounting.

I don't understand why a new flag is needed. Whenever we found that
challenge is needed (dist < active + anon), we need to add up IO cost.

> > > When file is refaulting, it means we need to make more space for
> > > cache. That space can come from stale active file pages. But what if
> > > active cache is all hot, and meanwhile there are cold anon pages that
> > > we could swap out once and then serve everything from RAM?
> > >
> > > When file is refaulting, we should find the coldest data that is
> > > taking up RAM and kick it out. It doesn't matter whether it's file or
> > > anon: the goal is to free up RAM with the least amount of IO risk.
> >
> > I understand your purpose and agree with it. We need to find a solution.
> > To achieve your goal, my suggestion is:
> >
> > - refault distance < active file, then do activation and add up IO cost
> > - refault distance < active file + anon list, then add up IO cost
> >
> > This doesn't break workingset detection on file list and challenge
> > the anon list as the same degree as you did.
>
> Unfortunately, it doesn't challenge the anon set. We'll stay in cache
> trimming mode where the IO cost balance doesn't have any effect.

Ah... I missed that. I think that we can avoid cache trimming mode easily.
Like as WORKINGSET_REFAULT, we need to snapshot
WORKINGSET_EXPAND. And, if we find a change to it, we can skip
cache trimming mode.

> We would need to
>
>   1. activate on distance < active file
>   2. force reclaim to scan anon when distance < active file + anon
>   3. record thrashing and IO cost when distance < active file + anon
>      and it's the second refault of the page.
>
> Something like this, where vmscan would scan anon when
> WORKINGSET_RESTORE or WORKINGSET_EXPAND events are occuring:
>
> diff --git a/mm/workingset.c b/mm/workingset.c
> index d481ea452eeb..41dde6274fff 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -275,9 +275,9 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
>  void workingset_refault(struct page *page, void *shadow)
>  {
>         struct mem_cgroup *eviction_memcg;
> +       unsigned long active_file, anon;
>         struct lruvec *eviction_lruvec;
>         unsigned long refault_distance;
> -       unsigned long workingset_size;
>         struct pglist_data *pgdat;
>         struct mem_cgroup *memcg;
>         unsigned long eviction;
> @@ -330,7 +330,7 @@ void workingset_refault(struct page *page, void *shadow)
>         refault_distance = (refault - eviction) & EVICTION_MASK;
>
>         /*
> -        * The activation decision for this page is made at the level
> +        * The distance decisions for this page is made at the level
>          * where the eviction occurred, as that is where the LRU order
>          * during page reclaim is being determined.
>          *
> @@ -344,32 +344,52 @@ void workingset_refault(struct page *page, void *shadow)
>
>         /*
>          * Compare the distance to the existing workingset size. We
> -        * don't activate pages that couldn't stay resident even if
> -        * all the memory was available to the page cache. Whether
> -        * cache can compete with anon or not depends on having swap.
> +        * ignore pages that couldn't stay resident even if all the
> +        * memory was available to the page cache. Whether cache can
> +        * compete with anon or not depends on having swap.
>          */
> -       workingset_size = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE);
> +       active_file = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE);
> +       anon = 0;
>         if (mem_cgroup_get_nr_swap_pages(memcg) > 0) {
> -               workingset_size += lruvec_page_state(eviction_lruvec,
> -                                                    NR_INACTIVE_ANON);
> -               workingset_size += lruvec_page_state(eviction_lruvec,
> -                                                    NR_ACTIVE_ANON);
> +               anon += lruvec_page_state(eviction_lruvec, NR_INACTIVE_ANON);
> +               anon += lruvec_page_state(eviction_lruvec, NR_ACTIVE_ANON);
>         }
> -       if (refault_distance > workingset_size)
> -               goto out;
>
> -       SetPageActive(page);
> -       advance_inactive_age(memcg, pgdat);
> -       inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
> +       if (refault_distance > active_file + anon)
> +               goto out;
>
> -       /* Page was active prior to eviction */
> +       /*
> +        * Page has already challenged the workingset before, and it's
> +        * refaulting again: we're not transitioning out old cache,
> +        * we're thrashing and need to grow. Record the IO to tip the
> +        * reclaim balance and mark the page for memstall detection.
> +        */
>         if (workingset) {
> -               SetPageWorkingset(page);
> +               inc_lruvec_state(lruvec, WORKINGSET_RESTORE);
> +
>                 /* XXX: Move to lru_cache_add() when it supports new vs putback */
>                 spin_lock_irq(&page_pgdat(page)->lru_lock);
>                 lru_note_cost_page(page);
>                 spin_unlock_irq(&page_pgdat(page)->lru_lock);
> -               inc_lruvec_state(lruvec, WORKINGSET_RESTORE);
> +
> +               SetPageThrashing(page);
> +       }
> +
> +       SetPageWorkingset(page);
> +
> +       /*
> +        * Activate right away if page can compete with the existing
> +        * active set given the *current* size of the page cache.
> +        *
> +        * Otherwise, the page cache needs to grow to house this
> +        * page. Tell reclaim to rebalance against anonymous memory.
> +        */
> +       if (refault_distance <= active_file) {
> +               SetPageActive(page);
> +               advance_inactive_age(memcg, pgdat);
> +               inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
> +       } else {
> +               inc_lruvec_state(lruvec, WORKINGSET_EXPAND);
>         }
>  out:
>         rcu_read_unlock();
>
> > > Remember that the file/anon split, and the inactive/active split, are
> > > there to optimize reclaim. It doesn't mean that these memory pools are
> > > independent from each other.
> > >
> > > The file list is split in two because of use-once cache. The anon and
> > > file lists are split because of different IO patterns, because we may
> > > not have swap etc. But once we are out of use-once cache, have swap
> > > space available, and have corrected for the different cost of IO,
> > > there needs to be a relative order between all pages in the system to
> > > find the optimal candidates to reclaim.
> > >
> > > > My suspicion is started by this counter example.
> > > >
> > > > Environment:
> > > > anon: 500 MB (so hot) / 500 MB (so hot)
> > > > file: 50 MB (hot) / 50 MB (cold)
> > > >
> > > > Think about the situation that there is periodical access to other file (100 MB)
> > > > with low frequency (refault distance is 500 MB)
> > > >
> > > > Without your change, this periodical access doesn't make thrashing for cached
> > > > active file page since refault distance of periodical access is larger
> > > > than the size of
> > > > the active file list. However, with your change, it causes thrashing
> > > > on the file list.
> > >
> > > It doesn't cause thrashing. It causes scanning because that 100M file
> > > IS thrashing: with or without my patch, that refault IO is occuring.
> >
> > It could cause thrashing for your patch.
> > Without the patch, current logic try to
> > find most hottest file pages that are fit into the current file list
> > size and protect them
> > successfully. Assume that access distance of 50 MB hot file pages is 60 MB
> > which is less than whole file list size but larger than inactive list
> > size. Without
> > your patch, 50 MB (hot) pages are not evicted at all. All these hot
> > pages will be
> > protected from the 100MB low access frequency pages. 100 MB low access
> > frequency pages will be refaulted repeatedely but it's correct behaviour.
>
> Hm, something doesn't quite add up. Why is the 50M hot set evicted
> with my patch?

Thanks for kind explanation. I read all and I found that I was confused before.
Please let me correct the example.

Environment:
anon: 500 MB (hot) / 500 MB (hot)
file: 50 MB (so hot) / 50 MB (dummy)

I will call 50 MB file hot pages as F(H).
Let's assume that periodical access to other file (500 MB) is started. That
file consists of 5 parts and each one is 100 MB. I will call it P1, P2, ..., P5.

Problem will occur on following access pattern.

F(H) -> P1 -> F(H) -> P2 -> ... -> F(H) -> P5 -> F(H) -> P1 -> ...

With this access pattern, access distance of F(H) and Pn is:

F(H) = 150 MB
Pn = 750 MB

Without your patch, F(H) is kept on the memory since deactivation would not
happen. However, with your patch, deactivation happens since Pn's refault
distance is less than 'active file + anon'. In the end, F(H) would be finally
evicted.

> With a 60M access distance and a 100M page cache, they might get
> deactivated, but then activated back to the head of the active list on
> their next access. They should get scanned but not actually reclaimed.

Sorry about the wrong example. I fixed the example in the above phrase.

> The only way they could get reclaimed is if their access distance ends
> up bigger than the file cache. But if that's the case, then the
> workingset is overcommitted, and none of the pages qualify for reclaim
> protection. Picking a subset to protect against the rest is arbitrary.

In the fixed example, although other file (500 MB) is repeatedly accessed,
it's not workingset. If we have unified list (file + anon), access distance of
Pn will be larger than whole memory size. Therefore, it's not overcommitted
workingset and this patch wrongly try to activate it. As I said before,
without considering inactive_age for anon list, this calculation can not be
correct.

> For example, if you started out with 50M of free space and an empty
> cache and started accessing the 50M and 100M files simultaneously,
> with the access distances you specified, none of them would get
> activated. You would have the same behavior with or without my patch.

It's not good example for my case. The purpose of active/inactive separation
is to protect previous hot pages until new evidence is found. If accessing
50M and 100M files simultaneously on 100M free space, all of them are
not activated. However, if 50M is in active list and then access to 50M and
100M happens simultaneously, we need to protect previous 50M. My example
is started with 50M in active list. So, it should be protected until
new evidence
is found. Pn is not new evidence.

> > However, with your patch, 50 MB (hot) file pages are deactivated due to newly
> > added file pages with low access frequency. And, then, since access distance of
> > 50 MB (hot) pages is larger than inactive list size, they could not
> > get a second chance
> > and finally could be evicted.
>
> Ah okay, that's a different concern. We DO get two references, but we
> fail to detect them properly.

As I said before, I was confused here. Your following explanation is correct
and I understand it. Please focus on my fixed example.

Thanks.
Johannes Weiner May 29, 2020, 3:12 p.m. UTC | #6
On Fri, May 29, 2020 at 03:48:00PM +0900, Joonsoo Kim wrote:
> 2020년 5월 29일 (금) 오전 2:02, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> > On Thu, May 28, 2020 at 04:16:50PM +0900, Joonsoo Kim wrote:
> > > 2020년 5월 27일 (수) 오후 10:43, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> > > > On Wed, May 27, 2020 at 11:06:47AM +0900, Joonsoo Kim wrote:
> > *It would require another page flag to tell whether a refaulting cache
> > page has challenged the anon set once (transitioning) or repeatedly
> > (thrashing), as we currently use the active state for that. If we
> > would repurpose PG_workingset to tell the first from the second
> > refault, we'd need a new flag to mark a page for memstall accounting.
> 
> I don't understand why a new flag is needed. Whenever we found that
> challenge is needed (dist < active + anon), we need to add up IO cost.

It sounds like this was cleared up later on in the email.

> > > It could cause thrashing for your patch.
> > > Without the patch, current logic try to
> > > find most hottest file pages that are fit into the current file list
> > > size and protect them
> > > successfully. Assume that access distance of 50 MB hot file pages is 60 MB
> > > which is less than whole file list size but larger than inactive list
> > > size. Without
> > > your patch, 50 MB (hot) pages are not evicted at all. All these hot
> > > pages will be
> > > protected from the 100MB low access frequency pages. 100 MB low access
> > > frequency pages will be refaulted repeatedely but it's correct behaviour.
> >
> > Hm, something doesn't quite add up. Why is the 50M hot set evicted
> > with my patch?
> 
> Thanks for kind explanation. I read all and I found that I was confused before.
> Please let me correct the example.
> 
> Environment:
> anon: 500 MB (hot) / 500 MB (hot)
> file: 50 MB (so hot) / 50 MB (dummy)
> 
> I will call 50 MB file hot pages as F(H).
> Let's assume that periodical access to other file (500 MB) is started. That
> file consists of 5 parts and each one is 100 MB. I will call it P1, P2, ..., P5.
> 
> Problem will occur on following access pattern.
> 
> F(H) -> P1 -> F(H) -> P2 -> ... -> F(H) -> P5 -> F(H) -> P1 -> ...
> 
> With this access pattern, access distance of F(H) and Pn is:
> 
> F(H) = 150 MB
> Pn = 750 MB
> 
> Without your patch, F(H) is kept on the memory since deactivation would not
> happen. However, with your patch, deactivation happens since Pn's refault
> distance is less than 'active file + anon'. In the end, F(H) would be finally
> evicted.

Okay, this example makes sense to me.

I do think P needs to challenge the workingset - at first. P could
easily fit into memory by itself if anon and active_file were cold, so
we need to challenge them to find out that they're hot. As you can
see, if anon and F(H) were completely unused, the current behavior
would be incorrect.

The current behavior would do the same in a cache-only example:

	anon = 1G (unreclaimable)
	file = 500M (hot) / 300M (dummy)

	P = 400M

	F(H) -> P1 -> F(H) -> P2 ...

If F(H) is already active, the first P refaults would have a distance
of 100M, thereby challenging F(H). As F(H) reactivates, its size will
be reflected in the refault distances, pushing them beyond the size of
memory that is available to the cache: 600M refault distance > 500M
active cache, or 900M access distance > 800M cache space.

However, I agree with your observation about the anon age below. When
we start aging the anon set, we have to reflect that in the refault
distances. Once we know that the 1G anon pages are indeed hotter than
the pages in P, there is no reason to keep churning the workingset.

> > The only way they could get reclaimed is if their access distance ends
> > up bigger than the file cache. But if that's the case, then the
> > workingset is overcommitted, and none of the pages qualify for reclaim
> > protection. Picking a subset to protect against the rest is arbitrary.
> 
> In the fixed example, although other file (500 MB) is repeatedly accessed,
> it's not workingset. If we have unified list (file + anon), access distance of
> Pn will be larger than whole memory size. Therefore, it's not overcommitted
> workingset and this patch wrongly try to activate it. As I said before,
> without considering inactive_age for anon list, this calculation can not be
> correct.

You're right. If we don't take anon age into account, the activations
could be over-eager; however, so would counting IO cost and exerting
pressure on anon be, which means my previous patch to split these two
wouldn't fix fundamental the problem you're pointing out. We simply
have to take anon age into account for the refaults to be comparable.

Once we do that, in your example, we would see activations in the
beginning in order to challenge the combined workingset (active_file +
anon) - which is legitimate as long as we don't know it's hot. And as
the anon pages are scanned and rotated (and the challenged F(h)
reactivated), the refault distances increase accordingly to reflect
the size of the hot pages sampled, which will correctly put P's
distances beyond the size of available memory.

However, your example cannot have a completely silent stable state. As
we stop workingset aging, the refault distances will slowly increase
again. We will always have a bit of churn, and rightfully so, because
the workingset *could* go stale.

That's the same situation in my cache-only example above. Anytime you
have a subset of pages that by itself could fit into memory, but can't
because of an established workingset, ongoing sampling is necessary.

But the rate definitely needs to reduce as we detect that in-memory
pages are indeed hot. Otherwise we cause more churn than is required
for an appropriate rate of workingset sampling.

How about the patch below? It looks correct, but I will have to re-run
my tests to make sure I / we are not missing anything.

---

From b105c85bf1cebfc4d1057f7228d7484c0ee77127 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Fri, 29 May 2020 09:40:00 -0400
Subject: [PATCH] mm: workingset: age nonresident information alongside
 anonymous pages

After ("mm: workingset: let cache workingset challenge anon fix"), we
compare refault distances to active_file + anon. But age of the
non-resident information is only driven by the file LRU. As a result,
we may overestimate the recency of any incoming refaults and activate
them too eagerly, causing unnecessary LRU churn in certain situations.

Make anon aging drive nonresident age as well to address that.

Reported-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h |  4 ++--
 include/linux/swap.h   |  1 +
 mm/vmscan.c            |  3 +++
 mm/workingset.c        | 46 +++++++++++++++++++++++++-----------------
 4 files changed, 33 insertions(+), 21 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e959602140b4..39459b8a7bb8 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -255,8 +255,8 @@ struct lruvec {
 	 */
 	unsigned long			anon_cost;
 	unsigned long			file_cost;
-	/* Evictions & activations on the inactive file list */
-	atomic_long_t			inactive_age;
+	/* Non-resident age, driven by LRU movement */
+	atomic_long_t			nonresident_age;
 	/* Refaults at the time of last reclaim cycle */
 	unsigned long			refaults;
 	/* Various lruvec state flags (enum lruvec_flags) */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 157e5081bf98..216df7bdaf62 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -312,6 +312,7 @@ struct vma_swap_readahead {
 };
 
 /* linux/mm/workingset.c */
+void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages);
 void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg);
 void workingset_refault(struct page *page, void *shadow);
 void workingset_activation(struct page *page);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c628f9ab886b..2fee237063e7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -904,6 +904,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 		__delete_from_swap_cache(page, swap);
 		xa_unlock_irqrestore(&mapping->i_pages, flags);
 		put_swap_page(page, swap);
+		workingset_eviction(page, target_memcg);
 	} else {
 		void (*freepage)(struct page *);
 		void *shadow = NULL;
@@ -1884,6 +1885,8 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 				list_add(&page->lru, &pages_to_free);
 		} else {
 			nr_moved += nr_pages;
+			if (PageActive(page))
+				workingset_age_nonresident(lruvec, nr_pages);
 		}
 	}
 
diff --git a/mm/workingset.c b/mm/workingset.c
index d481ea452eeb..50b7937bab32 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -156,8 +156,8 @@
  *
  *		Implementation
  *
- * For each node's file LRU lists, a counter for inactive evictions
- * and activations is maintained (node->inactive_age).
+ * For each node's LRU lists, a counter for inactive evictions and
+ * activations is maintained (node->nonresident_age).
  *
  * On eviction, a snapshot of this counter (along with some bits to
  * identify the node) is stored in the now empty page cache
@@ -213,7 +213,17 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
 	*workingsetp = workingset;
 }
 
-static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat)
+/**
+ * workingset_age_nonresident - age non-resident entries as LRU ages
+ * @memcg: the lruvec that was aged
+ * @nr_pages: the number of pages to count
+ *
+ * As in-memory pages are aged, non-resident pages need to be aged as
+ * well, in order for the refault distances later on to be comparable
+ * to the in-memory dimensions. This function allows reclaim and LRU
+ * operations to drive the non-resident aging along in parallel.
+ */
+void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages)
 {
 	/*
 	 * Reclaiming a cgroup means reclaiming all its children in a
@@ -227,11 +237,8 @@ static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat)
 	 * the root cgroup's, age as well.
 	 */
 	do {
-		struct lruvec *lruvec;
-
-		lruvec = mem_cgroup_lruvec(memcg, pgdat);
-		atomic_long_inc(&lruvec->inactive_age);
-	} while (memcg && (memcg = parent_mem_cgroup(memcg)));
+		atomic_long_add(nr_pages, &lruvec->nonresident_age);
+	} while ((lruvec = parent_lruvec(lruvec)));
 }
 
 /**
@@ -254,12 +261,11 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
 	VM_BUG_ON_PAGE(page_count(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
-	advance_inactive_age(page_memcg(page), pgdat);
-
 	lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
+	workingset_age_nonresident(lruvec, hpage_nr_pages(page));
 	/* XXX: target_memcg can be NULL, go through lruvec */
 	memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
-	eviction = atomic_long_read(&lruvec->inactive_age);
+	eviction = atomic_long_read(&lruvec->nonresident_age);
 	return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page));
 }
 
@@ -309,20 +315,20 @@ void workingset_refault(struct page *page, void *shadow)
 	if (!mem_cgroup_disabled() && !eviction_memcg)
 		goto out;
 	eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
-	refault = atomic_long_read(&eviction_lruvec->inactive_age);
+	refault = atomic_long_read(&eviction_lruvec->nonresident_age);
 
 	/*
 	 * Calculate the refault distance
 	 *
 	 * The unsigned subtraction here gives an accurate distance
-	 * across inactive_age overflows in most cases. There is a
+	 * across nonresident_age overflows in most cases. There is a
 	 * special case: usually, shadow entries have a short lifetime
 	 * and are either refaulted or reclaimed along with the inode
 	 * before they get too old.  But it is not impossible for the
-	 * inactive_age to lap a shadow entry in the field, which can
-	 * then result in a false small refault distance, leading to a
-	 * false activation should this old entry actually refault
-	 * again.  However, earlier kernels used to deactivate
+	 * nonresident_age to lap a shadow entry in the field, which
+	 * can then result in a false small refault distance, leading
+	 * to a false activation should this old entry actually
+	 * refault again.  However, earlier kernels used to deactivate
 	 * unconditionally with *every* reclaim invocation for the
 	 * longest time, so the occasional inappropriate activation
 	 * leading to pressure on the active list is not a problem.
@@ -359,7 +365,7 @@ void workingset_refault(struct page *page, void *shadow)
 		goto out;
 
 	SetPageActive(page);
-	advance_inactive_age(memcg, pgdat);
+	workingset_age_nonresident(lruvec, hpage_nr_pages(page));
 	inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
 
 	/* Page was active prior to eviction */
@@ -382,6 +388,7 @@ void workingset_refault(struct page *page, void *shadow)
 void workingset_activation(struct page *page)
 {
 	struct mem_cgroup *memcg;
+	struct lruvec *lruvec;
 
 	rcu_read_lock();
 	/*
@@ -394,7 +401,8 @@ void workingset_activation(struct page *page)
 	memcg = page_memcg_rcu(page);
 	if (!mem_cgroup_disabled() && !memcg)
 		goto out;
-	advance_inactive_age(memcg, page_pgdat(page));
+	lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+	workingset_age_nonresident(lruvec, hpage_nr_pages(page));
 out:
 	rcu_read_unlock();
 }
Joonsoo Kim June 1, 2020, 6:14 a.m. UTC | #7
2020년 5월 30일 (토) 오전 12:12, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
>
> On Fri, May 29, 2020 at 03:48:00PM +0900, Joonsoo Kim wrote:
> > 2020년 5월 29일 (금) 오전 2:02, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> > > On Thu, May 28, 2020 at 04:16:50PM +0900, Joonsoo Kim wrote:
> > > > 2020년 5월 27일 (수) 오후 10:43, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> > > > > On Wed, May 27, 2020 at 11:06:47AM +0900, Joonsoo Kim wrote:
> > > *It would require another page flag to tell whether a refaulting cache
> > > page has challenged the anon set once (transitioning) or repeatedly
> > > (thrashing), as we currently use the active state for that. If we
> > > would repurpose PG_workingset to tell the first from the second
> > > refault, we'd need a new flag to mark a page for memstall accounting.
> >
> > I don't understand why a new flag is needed. Whenever we found that
> > challenge is needed (dist < active + anon), we need to add up IO cost.
>
> It sounds like this was cleared up later on in the email.
>
> > > > It could cause thrashing for your patch.
> > > > Without the patch, current logic try to
> > > > find most hottest file pages that are fit into the current file list
> > > > size and protect them
> > > > successfully. Assume that access distance of 50 MB hot file pages is 60 MB
> > > > which is less than whole file list size but larger than inactive list
> > > > size. Without
> > > > your patch, 50 MB (hot) pages are not evicted at all. All these hot
> > > > pages will be
> > > > protected from the 100MB low access frequency pages. 100 MB low access
> > > > frequency pages will be refaulted repeatedely but it's correct behaviour.
> > >
> > > Hm, something doesn't quite add up. Why is the 50M hot set evicted
> > > with my patch?
> >
> > Thanks for kind explanation. I read all and I found that I was confused before.
> > Please let me correct the example.
> >
> > Environment:
> > anon: 500 MB (hot) / 500 MB (hot)
> > file: 50 MB (so hot) / 50 MB (dummy)
> >
> > I will call 50 MB file hot pages as F(H).
> > Let's assume that periodical access to other file (500 MB) is started. That
> > file consists of 5 parts and each one is 100 MB. I will call it P1, P2, ..., P5.
> >
> > Problem will occur on following access pattern.
> >
> > F(H) -> P1 -> F(H) -> P2 -> ... -> F(H) -> P5 -> F(H) -> P1 -> ...
> >
> > With this access pattern, access distance of F(H) and Pn is:
> >
> > F(H) = 150 MB
> > Pn = 750 MB
> >
> > Without your patch, F(H) is kept on the memory since deactivation would not
> > happen. However, with your patch, deactivation happens since Pn's refault
> > distance is less than 'active file + anon'. In the end, F(H) would be finally
> > evicted.
>
> Okay, this example makes sense to me.
>
> I do think P needs to challenge the workingset - at first. P could
> easily fit into memory by itself if anon and active_file were cold, so
> we need to challenge them to find out that they're hot. As you can
> see, if anon and F(H) were completely unused, the current behavior
> would be incorrect.
>
> The current behavior would do the same in a cache-only example:
>
>         anon = 1G (unreclaimable)
>         file = 500M (hot) / 300M (dummy)
>
>         P = 400M
>
>         F(H) -> P1 -> F(H) -> P2 ...
>
> If F(H) is already active, the first P refaults would have a distance
> of 100M, thereby challenging F(H). As F(H) reactivates, its size will
> be reflected in the refault distances, pushing them beyond the size of
> memory that is available to the cache: 600M refault distance > 500M
> active cache, or 900M access distance > 800M cache space.

Hmm... It seems that the current behavior (before your patch) for this
example has no problem. It causes deactivation but doesn't cause eviction
so there is no workingset thrashing.

> However, I agree with your observation about the anon age below. When
> we start aging the anon set, we have to reflect that in the refault
> distances. Once we know that the 1G anon pages are indeed hotter than
> the pages in P, there is no reason to keep churning the workingset.

Okay.

> > > The only way they could get reclaimed is if their access distance ends
> > > up bigger than the file cache. But if that's the case, then the
> > > workingset is overcommitted, and none of the pages qualify for reclaim
> > > protection. Picking a subset to protect against the rest is arbitrary.
> >
> > In the fixed example, although other file (500 MB) is repeatedly accessed,
> > it's not workingset. If we have unified list (file + anon), access distance of
> > Pn will be larger than whole memory size. Therefore, it's not overcommitted
> > workingset and this patch wrongly try to activate it. As I said before,
> > without considering inactive_age for anon list, this calculation can not be
> > correct.
>
> You're right. If we don't take anon age into account, the activations
> could be over-eager; however, so would counting IO cost and exerting
> pressure on anon be, which means my previous patch to split these two
> wouldn't fix fundamental the problem you're pointing out. We simply

Splitting would not fix the fundamental problem (over-eager) but it would
greatly weaken the problem. Just counting IO cost doesn't break the
active/inactive separation in file list. It does cause more scan on anon list
but I think that it's endurable.

> have to take anon age into account for the refaults to be comparable.

Yes, taking anon age into account is also a good candidate to fix the problem.

> Once we do that, in your example, we would see activations in the
> beginning in order to challenge the combined workingset (active_file +
> anon) - which is legitimate as long as we don't know it's hot. And as
> the anon pages are scanned and rotated (and the challenged F(h)
> reactivated), the refault distances increase accordingly to reflect
> the size of the hot pages sampled, which will correctly put P's
> distances beyond the size of available memory.

Okay.

> However, your example cannot have a completely silent stable state. As
> we stop workingset aging, the refault distances will slowly increase
> again. We will always have a bit of churn, and rightfully so, because
> the workingset *could* go stale.
>
> That's the same situation in my cache-only example above. Anytime you
> have a subset of pages that by itself could fit into memory, but can't
> because of an established workingset, ongoing sampling is necessary.
>
> But the rate definitely needs to reduce as we detect that in-memory
> pages are indeed hot. Otherwise we cause more churn than is required
> for an appropriate rate of workingset sampling.
>
> How about the patch below? It looks correct, but I will have to re-run
> my tests to make sure I / we are not missing anything.

Much better! It may solve my concern mostly.

But, I still think that modified refault activation equation isn't
safe. The next
problem I found is related to the scan ratio limit patch ("limit the range of
LRU type balancing") on this series. See the below example.

anon: Hot (X M)
file: Hot (200 M) / dummy (200 M)
P: 1200 M (3 parts, each one 400 M, P1, P2, P3)
Access Pattern: A -> F(H) -> P1 -> A -> F(H) -> P2 -> ... ->

Without this patch, A and F(H) are kept on the memory and look like
it's correct.

With this patch and below fix, refault equation for Pn would be:

Refault dist of Pn = 1200 (from file non-resident) + 1200 * anon scan
ratio (from anon non-resident)
anon + active file = X + 200
1200 + 1200 * anon scan ratio (0.5 ~ 2) < X + 200

According to the size of X, Pn's refault result would be different. Pn could
be activated with large enough X and then F(H) could be evicted. In ideal
case (unified list), for this example, Pn should not be activated in any X.

This is a fundamental problem since we have two list type (file/anon) and
scan ratio limit is required. Anyway, we need to take care of this reality and
the way most safe is to count IO cost instead of doing activation in this
'non-resident dist < (active + anon list)' case.

Again, for this patch, I'm not confident myself so please let me know if I'm
wrong.

Thanks.
Johannes Weiner June 1, 2020, 3:56 p.m. UTC | #8
On Mon, Jun 01, 2020 at 03:14:24PM +0900, Joonsoo Kim wrote:
> 2020년 5월 30일 (토) 오전 12:12, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> >
> > On Fri, May 29, 2020 at 03:48:00PM +0900, Joonsoo Kim wrote:
> > > 2020년 5월 29일 (금) 오전 2:02, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> > > > On Thu, May 28, 2020 at 04:16:50PM +0900, Joonsoo Kim wrote:
> > > > > 2020년 5월 27일 (수) 오후 10:43, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> > > > > > On Wed, May 27, 2020 at 11:06:47AM +0900, Joonsoo Kim wrote:
> > > > The only way they could get reclaimed is if their access distance ends
> > > > up bigger than the file cache. But if that's the case, then the
> > > > workingset is overcommitted, and none of the pages qualify for reclaim
> > > > protection. Picking a subset to protect against the rest is arbitrary.
> > >
> > > In the fixed example, although other file (500 MB) is repeatedly accessed,
> > > it's not workingset. If we have unified list (file + anon), access distance of
> > > Pn will be larger than whole memory size. Therefore, it's not overcommitted
> > > workingset and this patch wrongly try to activate it. As I said before,
> > > without considering inactive_age for anon list, this calculation can not be
> > > correct.
> >
> > You're right. If we don't take anon age into account, the activations
> > could be over-eager; however, so would counting IO cost and exerting
> > pressure on anon be, which means my previous patch to split these two
> > wouldn't fix fundamental the problem you're pointing out. We simply
> 
> Splitting would not fix the fundamental problem (over-eager) but it would
> greatly weaken the problem. Just counting IO cost doesn't break the
> active/inactive separation in file list. It does cause more scan on anon list
> but I think that it's endurable.

I think the split is a good idea.

The only thing I'm not sure yet is if we can get away without an
additional page flag if the active flag cannot be reused to denote
thrashing. I'll keep at it, maybe I can figure something out.

But I think it would be follow-up work.

> > have to take anon age into account for the refaults to be comparable.
> 
> Yes, taking anon age into account is also a good candidate to fix the problem.

Okay, good.

> > However, your example cannot have a completely silent stable state. As
> > we stop workingset aging, the refault distances will slowly increase
> > again. We will always have a bit of churn, and rightfully so, because
> > the workingset *could* go stale.
> >
> > That's the same situation in my cache-only example above. Anytime you
> > have a subset of pages that by itself could fit into memory, but can't
> > because of an established workingset, ongoing sampling is necessary.
> >
> > But the rate definitely needs to reduce as we detect that in-memory
> > pages are indeed hot. Otherwise we cause more churn than is required
> > for an appropriate rate of workingset sampling.
> >
> > How about the patch below? It looks correct, but I will have to re-run
> > my tests to make sure I / we are not missing anything.
> 
> Much better! It may solve my concern mostly.

Okay thanks for confirming. I'll send a proper version to Andrew.

> But, I still think that modified refault activation equation isn't
> safe. The next
> problem I found is related to the scan ratio limit patch ("limit the range of
> LRU type balancing") on this series. See the below example.
> 
> anon: Hot (X M)
> file: Hot (200 M) / dummy (200 M)
> P: 1200 M (3 parts, each one 400 M, P1, P2, P3)
> Access Pattern: A -> F(H) -> P1 -> A -> F(H) -> P2 -> ... ->
> 
> Without this patch, A and F(H) are kept on the memory and look like
> it's correct.
> 
> With this patch and below fix, refault equation for Pn would be:
> 
> Refault dist of Pn = 1200 (from file non-resident) + 1200 * anon scan
> ratio (from anon non-resident)
> anon + active file = X + 200
> 1200 + 1200 * anon scan ratio (0.5 ~ 2) < X + 200

That doesn't look quite right to me. The anon part of the refault
distance is driven by X, so the left-hand of this formula contains X
as well.

1000 file (1200M reuse distance, 200M in-core size) + F(H) reactivations + X * scan ratio < X + 1000

Activations persist as long as anon isn't fully scanned and it isn't
established yet that it's fully hot. Meaning, we optimistically assume
the refaulting pages can be workingset until we're proven wrong.

> According to the size of X, Pn's refault result would be different. Pn could
> be activated with large enough X and then F(H) could be evicted. In ideal
> case (unified list), for this example, Pn should not be activated in any X.

Yes. The active/iocost split would allow us to be smarter about it.

> This is a fundamental problem since we have two list type (file/anon) and
> scan ratio limit is required. Anyway, we need to take care of this reality and
> the way most safe is to count IO cost instead of doing activation in this
> 'non-resident dist < (active + anon list)' case.

Agreed here again.

> Again, for this patch, I'm not confident myself so please let me know if I'm
> wrong.

As far as this patch goes, I think it's important to look at the
bigger picture.

We need to have convergence first before being able to worry about
optimizing. Stable states are optimizations, but false stable states
are correctness problems.

For the longest time, we scanned active pages unconditionally during
page reclaim. This was always safe in the sense that it wouldn't get
stuck on a stale workingset, but it incurs unnecessary workingset
churn when reclaim is driven by use-once patterns.

We optimized the latter too aggressively, and as a result caused
situations where we indefinitely fail to cache the hottest
data. That's not really a workable trade-off.

With the active/iocost split you're suggesting, we can reasonably
optimize your example scenario. But we can't do it if the flipside
means complete failure to transition between in-memory sets.

So I think we should go ahead with this patch (with the anon age
recognition fixed, because that's a correctness issue), and follow it
up with the stable state optimization to shrink anon first.
Johannes Weiner June 1, 2020, 8:44 p.m. UTC | #9
On Mon, Jun 01, 2020 at 11:56:17AM -0400, Johannes Weiner wrote:
> On Mon, Jun 01, 2020 at 03:14:24PM +0900, Joonsoo Kim wrote:
> > 2020년 5월 30일 (토) 오전 12:12, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> > > However, your example cannot have a completely silent stable state. As
> > > we stop workingset aging, the refault distances will slowly increase
> > > again. We will always have a bit of churn, and rightfully so, because
> > > the workingset *could* go stale.
> > >
> > > That's the same situation in my cache-only example above. Anytime you
> > > have a subset of pages that by itself could fit into memory, but can't
> > > because of an established workingset, ongoing sampling is necessary.
> > >
> > > But the rate definitely needs to reduce as we detect that in-memory
> > > pages are indeed hot. Otherwise we cause more churn than is required
> > > for an appropriate rate of workingset sampling.
> > >
> > > How about the patch below? It looks correct, but I will have to re-run
> > > my tests to make sure I / we are not missing anything.
> > 
> > Much better! It may solve my concern mostly.
> 
> Okay thanks for confirming. I'll send a proper version to Andrew.

I sent the below patch through my battery of tests to make sure it
doesn't break anything fundamental, and they didn't. For the
convergence test, the overload test, and the kernel build, the
difference, if any, is in the noise, which is expected.

Andrew, absent any objections, can you please fold the below into the
original patch?

It's a bit noisy due to the rename of inactive_age now that it's used
for anon as well (frankly, it was a misnomer from the start), and
switching the interface to take an lruvec instead of a memcg and a
pgdat, which works better for the callers (I stole parent_lruvec()
from "mm: vmscan: determine anon/file pressure balance at the reclaim
root" later in the series, so that isn't new code).

The main change is the two bits in __remove_mapping and
move_pages_to_lru, so that shadow entries are aged when anon is
reclaimed or rotated.

---

From a8faceabc1454dfd878caee2a8422493d937a394 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Mon, 1 Jun 2020 14:04:09 -0400
Subject: [PATCH] mm: workingset: let cache workingset challenge anon fix

We compare refault distances to active_file + anon. But age of the
non-resident information is only driven by the file LRU. As a result,
we may overestimate the recency of any incoming refaults and activate
them too eagerly, causing unnecessary LRU churn in certain situations.

Make anon aging drive nonresident age as well to address that.

Reported-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h | 13 +++++++++++
 include/linux/mmzone.h     |  4 ++--
 include/linux/swap.h       |  1 +
 mm/vmscan.c                |  3 +++
 mm/workingset.c            | 46 ++++++++++++++++++++++----------------
 5 files changed, 46 insertions(+), 21 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 32a0b4d47540..d982c80da157 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1303,6 +1303,19 @@ static inline void dec_lruvec_page_state(struct page *page,
 	mod_lruvec_page_state(page, idx, -1);
 }
 
+static inline struct lruvec *parent_lruvec(struct lruvec *lruvec)
+{
+	struct mem_cgroup *memcg;
+
+	memcg = lruvec_memcg(lruvec);
+	if (!memcg)
+		return NULL;
+	memcg = parent_mem_cgroup(memcg);
+	if (!memcg)
+		return NULL;
+	return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec));
+}
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c1fbda9ddd1f..7cccbb8bc2d7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -262,8 +262,8 @@ enum lruvec_flags {
 struct lruvec {
 	struct list_head		lists[NR_LRU_LISTS];
 	struct zone_reclaim_stat	reclaim_stat;
-	/* Evictions & activations on the inactive file list */
-	atomic_long_t			inactive_age;
+	/* Non-resident age, driven by LRU movement */
+	atomic_long_t			nonresident_age;
 	/* Refaults at the time of last reclaim cycle */
 	unsigned long			refaults;
 	/* Various lruvec state flags (enum lruvec_flags) */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 30fd4641890e..f0d2dca28c99 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -312,6 +312,7 @@ struct vma_swap_readahead {
 };
 
 /* linux/mm/workingset.c */
+void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages);
 void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg);
 void workingset_refault(struct page *page, void *shadow);
 void workingset_activation(struct page *page);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 43f88b1a4f14..3033f1c951cd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -898,6 +898,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 		__delete_from_swap_cache(page, swap);
 		xa_unlock_irqrestore(&mapping->i_pages, flags);
 		put_swap_page(page, swap);
+		workingset_eviction(page, target_memcg);
 	} else {
 		void (*freepage)(struct page *);
 		void *shadow = NULL;
@@ -1876,6 +1877,8 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 				list_add(&page->lru, &pages_to_free);
 		} else {
 			nr_moved += nr_pages;
+			if (PageActive(page))
+				workingset_age_nonresident(lruvec, nr_pages);
 		}
 	}
 
diff --git a/mm/workingset.c b/mm/workingset.c
index e69865739539..98b91e623b85 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -156,8 +156,8 @@
  *
  *		Implementation
  *
- * For each node's file LRU lists, a counter for inactive evictions
- * and activations is maintained (node->inactive_age).
+ * For each node's LRU lists, a counter for inactive evictions and
+ * activations is maintained (node->nonresident_age).
  *
  * On eviction, a snapshot of this counter (along with some bits to
  * identify the node) is stored in the now empty page cache
@@ -213,7 +213,17 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
 	*workingsetp = workingset;
 }
 
-static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat)
+/**
+ * workingset_age_nonresident - age non-resident entries as LRU ages
+ * @memcg: the lruvec that was aged
+ * @nr_pages: the number of pages to count
+ *
+ * As in-memory pages are aged, non-resident pages need to be aged as
+ * well, in order for the refault distances later on to be comparable
+ * to the in-memory dimensions. This function allows reclaim and LRU
+ * operations to drive the non-resident aging along in parallel.
+ */
+void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages)
 {
 	/*
 	 * Reclaiming a cgroup means reclaiming all its children in a
@@ -227,11 +237,8 @@ static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat)
 	 * the root cgroup's, age as well.
 	 */
 	do {
-		struct lruvec *lruvec;
-
-		lruvec = mem_cgroup_lruvec(memcg, pgdat);
-		atomic_long_inc(&lruvec->inactive_age);
-	} while (memcg && (memcg = parent_mem_cgroup(memcg)));
+		atomic_long_add(nr_pages, &lruvec->nonresident_age);
+	} while ((lruvec = parent_lruvec(lruvec)));
 }
 
 /**
@@ -254,12 +261,11 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
 	VM_BUG_ON_PAGE(page_count(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
-	advance_inactive_age(page_memcg(page), pgdat);
-
 	lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
+	workingset_age_nonresident(lruvec, hpage_nr_pages(page));
 	/* XXX: target_memcg can be NULL, go through lruvec */
 	memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
-	eviction = atomic_long_read(&lruvec->inactive_age);
+	eviction = atomic_long_read(&lruvec->nonresident_age);
 	return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page));
 }
 
@@ -309,20 +315,20 @@ void workingset_refault(struct page *page, void *shadow)
 	if (!mem_cgroup_disabled() && !eviction_memcg)
 		goto out;
 	eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
-	refault = atomic_long_read(&eviction_lruvec->inactive_age);
+	refault = atomic_long_read(&eviction_lruvec->nonresident_age);
 
 	/*
 	 * Calculate the refault distance
 	 *
 	 * The unsigned subtraction here gives an accurate distance
-	 * across inactive_age overflows in most cases. There is a
+	 * across nonresident_age overflows in most cases. There is a
 	 * special case: usually, shadow entries have a short lifetime
 	 * and are either refaulted or reclaimed along with the inode
 	 * before they get too old.  But it is not impossible for the
-	 * inactive_age to lap a shadow entry in the field, which can
-	 * then result in a false small refault distance, leading to a
-	 * false activation should this old entry actually refault
-	 * again.  However, earlier kernels used to deactivate
+	 * nonresident_age to lap a shadow entry in the field, which
+	 * can then result in a false small refault distance, leading
+	 * to a false activation should this old entry actually
+	 * refault again.  However, earlier kernels used to deactivate
 	 * unconditionally with *every* reclaim invocation for the
 	 * longest time, so the occasional inappropriate activation
 	 * leading to pressure on the active list is not a problem.
@@ -359,7 +365,7 @@ void workingset_refault(struct page *page, void *shadow)
 		goto out;
 
 	SetPageActive(page);
-	advance_inactive_age(memcg, pgdat);
+	workingset_age_nonresident(lruvec, hpage_nr_pages(page));
 	inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
 
 	/* Page was active prior to eviction */
@@ -378,6 +384,7 @@ void workingset_refault(struct page *page, void *shadow)
 void workingset_activation(struct page *page)
 {
 	struct mem_cgroup *memcg;
+	struct lruvec *lruvec;
 
 	rcu_read_lock();
 	/*
@@ -390,7 +397,8 @@ void workingset_activation(struct page *page)
 	memcg = page_memcg_rcu(page);
 	if (!mem_cgroup_disabled() && !memcg)
 		goto out;
-	advance_inactive_age(memcg, page_pgdat(page));
+	lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+	workingset_age_nonresident(lruvec, hpage_nr_pages(page));
 out:
 	rcu_read_unlock();
 }
Joonsoo Kim June 2, 2020, 2:34 a.m. UTC | #10
2020년 6월 2일 (화) 오전 12:56, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
>
> On Mon, Jun 01, 2020 at 03:14:24PM +0900, Joonsoo Kim wrote:
> > 2020년 5월 30일 (토) 오전 12:12, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> > >
> > > On Fri, May 29, 2020 at 03:48:00PM +0900, Joonsoo Kim wrote:
> > > > 2020년 5월 29일 (금) 오전 2:02, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> > > > > On Thu, May 28, 2020 at 04:16:50PM +0900, Joonsoo Kim wrote:
> > > > > > 2020년 5월 27일 (수) 오후 10:43, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> > > > > > > On Wed, May 27, 2020 at 11:06:47AM +0900, Joonsoo Kim wrote:
> > > > > The only way they could get reclaimed is if their access distance ends
> > > > > up bigger than the file cache. But if that's the case, then the
> > > > > workingset is overcommitted, and none of the pages qualify for reclaim
> > > > > protection. Picking a subset to protect against the rest is arbitrary.
> > > >
> > > > In the fixed example, although other file (500 MB) is repeatedly accessed,
> > > > it's not workingset. If we have unified list (file + anon), access distance of
> > > > Pn will be larger than whole memory size. Therefore, it's not overcommitted
> > > > workingset and this patch wrongly try to activate it. As I said before,
> > > > without considering inactive_age for anon list, this calculation can not be
> > > > correct.
> > >
> > > You're right. If we don't take anon age into account, the activations
> > > could be over-eager; however, so would counting IO cost and exerting
> > > pressure on anon be, which means my previous patch to split these two
> > > wouldn't fix fundamental the problem you're pointing out. We simply
> >
> > Splitting would not fix the fundamental problem (over-eager) but it would
> > greatly weaken the problem. Just counting IO cost doesn't break the
> > active/inactive separation in file list. It does cause more scan on anon list
> > but I think that it's endurable.
>
> I think the split is a good idea.
>
> The only thing I'm not sure yet is if we can get away without an
> additional page flag if the active flag cannot be reused to denote
> thrashing. I'll keep at it, maybe I can figure something out.
>
> But I think it would be follow-up work.
>
> > > have to take anon age into account for the refaults to be comparable.
> >
> > Yes, taking anon age into account is also a good candidate to fix the problem.
>
> Okay, good.
>
> > > However, your example cannot have a completely silent stable state. As
> > > we stop workingset aging, the refault distances will slowly increase
> > > again. We will always have a bit of churn, and rightfully so, because
> > > the workingset *could* go stale.
> > >
> > > That's the same situation in my cache-only example above. Anytime you
> > > have a subset of pages that by itself could fit into memory, but can't
> > > because of an established workingset, ongoing sampling is necessary.
> > >
> > > But the rate definitely needs to reduce as we detect that in-memory
> > > pages are indeed hot. Otherwise we cause more churn than is required
> > > for an appropriate rate of workingset sampling.
> > >
> > > How about the patch below? It looks correct, but I will have to re-run
> > > my tests to make sure I / we are not missing anything.
> >
> > Much better! It may solve my concern mostly.
>
> Okay thanks for confirming. I'll send a proper version to Andrew.

Okay.

> > But, I still think that modified refault activation equation isn't
> > safe. The next
> > problem I found is related to the scan ratio limit patch ("limit the range of
> > LRU type balancing") on this series. See the below example.
> >
> > anon: Hot (X M)
> > file: Hot (200 M) / dummy (200 M)
> > P: 1200 M (3 parts, each one 400 M, P1, P2, P3)
> > Access Pattern: A -> F(H) -> P1 -> A -> F(H) -> P2 -> ... ->
> >
> > Without this patch, A and F(H) are kept on the memory and look like
> > it's correct.
> >
> > With this patch and below fix, refault equation for Pn would be:
> >
> > Refault dist of Pn = 1200 (from file non-resident) + 1200 * anon scan
> > ratio (from anon non-resident)
> > anon + active file = X + 200
> > 1200 + 1200 * anon scan ratio (0.5 ~ 2) < X + 200
>
> That doesn't look quite right to me. The anon part of the refault
> distance is driven by X, so the left-hand of this formula contains X
> as well.
>
> 1000 file (1200M reuse distance, 200M in-core size) + F(H) reactivations + X * scan ratio < X + 1000

As I said before, there is no X on left-hand of this formula. To
access all Pn and
re-access P1, we need 1200M file list scan and reclaim. More scan isn't needed.
With your patch "limit the range of LRU type balancing", scan ratio
between file/anon
list is limited to 0.5 ~ 2.0, so, maximum anon scan would be 1200 M *
2.0, that is,
2400 M and not bounded by X. That means that file list cannot be
stable with some X.

> Activations persist as long as anon isn't fully scanned and it isn't
> established yet that it's fully hot. Meaning, we optimistically assume
> the refaulting pages can be workingset until we're proven wrong.
>
> > According to the size of X, Pn's refault result would be different. Pn could
> > be activated with large enough X and then F(H) could be evicted. In ideal
> > case (unified list), for this example, Pn should not be activated in any X.
>
> Yes. The active/iocost split would allow us to be smarter about it.
>
> > This is a fundamental problem since we have two list type (file/anon) and
> > scan ratio limit is required. Anyway, we need to take care of this reality and
> > the way most safe is to count IO cost instead of doing activation in this
> > 'non-resident dist < (active + anon list)' case.
>
> Agreed here again.
>
> > Again, for this patch, I'm not confident myself so please let me know if I'm
> > wrong.
>
> As far as this patch goes, I think it's important to look at the
> bigger picture.
>
> We need to have convergence first before being able to worry about
> optimizing. Stable states are optimizations, but false stable states
> are correctness problems.
>
> For the longest time, we scanned active pages unconditionally during
> page reclaim. This was always safe in the sense that it wouldn't get
> stuck on a stale workingset, but it incurs unnecessary workingset
> churn when reclaim is driven by use-once patterns.
>
> We optimized the latter too aggressively, and as a result caused
> situations where we indefinitely fail to cache the hottest
> data. That's not really a workable trade-off.
>
> With the active/iocost split you're suggesting, we can reasonably
> optimize your example scenario. But we can't do it if the flipside
> means complete failure to transition between in-memory sets.
>
> So I think we should go ahead with this patch (with the anon age
> recognition fixed, because that's a correctness issue), and follow it
> up with the stable state optimization to shrink anon first.

If my lastly found example is a correct example (your confirm is required),
it is also related to the correctness issue since cold pages causes
eviction of the hot pages repeatedly.

In this case, they (without patch, with patch) all have some correctness
issue so we need to judge which one is better in terms of overall impact.
I don't have strong opinion about it so it's up to you to decide the way to go.

Thanks.
Johannes Weiner June 2, 2020, 4:47 p.m. UTC | #11
On Tue, Jun 02, 2020 at 11:34:17AM +0900, Joonsoo Kim wrote:
> 2020년 6월 2일 (화) 오전 12:56, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> > On Mon, Jun 01, 2020 at 03:14:24PM +0900, Joonsoo Kim wrote:
> > > But, I still think that modified refault activation equation isn't
> > > safe. The next
> > > problem I found is related to the scan ratio limit patch ("limit the range of
> > > LRU type balancing") on this series. See the below example.
> > >
> > > anon: Hot (X M)
> > > file: Hot (200 M) / dummy (200 M)
> > > P: 1200 M (3 parts, each one 400 M, P1, P2, P3)
> > > Access Pattern: A -> F(H) -> P1 -> A -> F(H) -> P2 -> ... ->
> > >
> > > Without this patch, A and F(H) are kept on the memory and look like
> > > it's correct.
> > >
> > > With this patch and below fix, refault equation for Pn would be:
> > >
> > > Refault dist of Pn = 1200 (from file non-resident) + 1200 * anon scan
> > > ratio (from anon non-resident)
> > > anon + active file = X + 200
> > > 1200 + 1200 * anon scan ratio (0.5 ~ 2) < X + 200
> >
> > That doesn't look quite right to me. The anon part of the refault
> > distance is driven by X, so the left-hand of this formula contains X
> > as well.
> >
> > 1000 file (1200M reuse distance, 200M in-core size) + F(H) reactivations + X * scan ratio < X + 1000
> 
> As I said before, there is no X on left-hand of this formula. To
> access all Pn and
> re-access P1, we need 1200M file list scan and reclaim. More scan isn't needed.
> With your patch "limit the range of LRU type balancing", scan ratio
> between file/anon
> list is limited to 0.5 ~ 2.0, so, maximum anon scan would be 1200 M *
> 2.0, that is,
> 2400 M and not bounded by X. That means that file list cannot be
> stable with some X.

Oh, no X on the left because you're talking about the number of pages
scanned until the first refaults, which is fixed - so why are we still
interpreting the refault distance against a variable anon size X?

Well, that's misleading. We compare against anon because part of the
cache is already encoded in the refault distance. What we're really
checking is access distance against total amount of available RAM.

Consider this. We want to activate pages where

	access_distance <= RAM

and our measure of access distance is:

	access_distance = refault_distance + inactive_file

So the comparison becomes:

	refault_distance + inactive_file < RAM

which we simplify to:

	refault_distance < active_file + anon

There is a certain threshold for X simply because there is a certain
threshold for RAM beyond which we need to start activating. X cannot
be arbitrary, it must be X + cache filling up memory - after all we
have page reclaim evicting pages.

Again, this isn't new. In the current code, we activate when:

	refault_distance < active_file

which is

	access_distance <= RAM - anon

You can see, whether things are stable or not always depends on the
existing workingset size. It's just a proxy for how much total RAM we
have potentially available to the refaulting page.

> If my lastly found example is a correct example (your confirm is required),
> it is also related to the correctness issue since cold pages causes
> eviction of the hot pages repeatedly.

I think your example is correct, but it benefits from the VM
arbitrarily making an assumption that has a 50/50 shot of being true.

You and I know which pages are hot and which are cold because you
designed the example.

All the VM sees is this:

- We have an established workingset that has previously shown an
  access distance <= RAM and therefor was activated.

- We now have another set that also appears to have an access distance
  <= RAM. The only way to know for sure, however, is sample the
  established workingset and compare the relative access frequencies.

Currently, we just assume the incoming pages are colder. Clearly
that's beneficial when it's true. Clearly that can be totally wrong.

We must allow a fair comparison between these two sets.

For cache, that's already the case - that's why I brought up the
cache-only example: if refault distances are 50M and you have 60M of
active cache, we activate all refaults and force an even competition
between the established workingset and the new pages.

Whether we can protect active file when anon needs to shrink first and
can't (the activate/iocost split) that's a different question. But I'm
no longer so sure after looking into it further.

First, we would need two different refault distances: either we
consider anon age and need to compare to active_file + anon, or we
don't and compare to active_file only. We cannot mix willy nilly,
because the metrics wouldn't be comparable. We don't have the space to
store two different eviction timestamps, nor could we afford to cut
the precision in half.

Second, the additional page flag required to implement it.

Third, it's somewhat moot because we still have the same behavior when
active_file would need to shrink and can't. There can't be a stable
state as long as refault distances <= active_file.

> In this case, they (without patch, with patch) all have some correctness
> issue so we need to judge which one is better in terms of overall impact.
> I don't have strong opinion about it so it's up to you to decide the way to go.

If my patch was simply changing the default assumption on which pages
are hot and which are cold, I would agree with you - the pros would be
equal to the cons, one way wouldn't be more correct than the other.

But that isn't what my patch is doing. What it does is get rid of the
assumption, to actually sample and compare the access frequencies when
there isn't enough data to make an informed decision.

That's a net improvement.
Joonsoo Kim June 3, 2020, 5:40 a.m. UTC | #12
2020년 6월 3일 (수) 오전 1:48, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
>
> On Tue, Jun 02, 2020 at 11:34:17AM +0900, Joonsoo Kim wrote:
> > 2020년 6월 2일 (화) 오전 12:56, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> > > On Mon, Jun 01, 2020 at 03:14:24PM +0900, Joonsoo Kim wrote:
> > > > But, I still think that modified refault activation equation isn't
> > > > safe. The next
> > > > problem I found is related to the scan ratio limit patch ("limit the range of
> > > > LRU type balancing") on this series. See the below example.
> > > >
> > > > anon: Hot (X M)
> > > > file: Hot (200 M) / dummy (200 M)
> > > > P: 1200 M (3 parts, each one 400 M, P1, P2, P3)
> > > > Access Pattern: A -> F(H) -> P1 -> A -> F(H) -> P2 -> ... ->
> > > >
> > > > Without this patch, A and F(H) are kept on the memory and look like
> > > > it's correct.
> > > >
> > > > With this patch and below fix, refault equation for Pn would be:
> > > >
> > > > Refault dist of Pn = 1200 (from file non-resident) + 1200 * anon scan
> > > > ratio (from anon non-resident)
> > > > anon + active file = X + 200
> > > > 1200 + 1200 * anon scan ratio (0.5 ~ 2) < X + 200
> > >
> > > That doesn't look quite right to me. The anon part of the refault
> > > distance is driven by X, so the left-hand of this formula contains X
> > > as well.
> > >
> > > 1000 file (1200M reuse distance, 200M in-core size) + F(H) reactivations + X * scan ratio < X + 1000
> >
> > As I said before, there is no X on left-hand of this formula. To
> > access all Pn and
> > re-access P1, we need 1200M file list scan and reclaim. More scan isn't needed.
> > With your patch "limit the range of LRU type balancing", scan ratio
> > between file/anon
> > list is limited to 0.5 ~ 2.0, so, maximum anon scan would be 1200 M *
> > 2.0, that is,
> > 2400 M and not bounded by X. That means that file list cannot be
> > stable with some X.
>
> Oh, no X on the left because you're talking about the number of pages
> scanned until the first refaults, which is fixed - so why are we still
> interpreting the refault distance against a variable anon size X?

Looks like I was confused again. Your formula is correct and mine is
wrong. My mistake is I thought that your patch "limit the range of LRU
type balancing"
which makes scan *ratio* 2:1 leads to actual scan *count* ratio
between anon/file to 2:1.
But, now I realized that 2:1 is just scan ratio and actual scan
*count* ratio could be far
larger with certain list size. It would be X * scan ratio in above example so my
explanation is wrong and you are right.

Sorry for making a trouble.

> Well, that's misleading. We compare against anon because part of the
> cache is already encoded in the refault distance. What we're really
> checking is access distance against total amount of available RAM.
>
> Consider this. We want to activate pages where
>
>         access_distance <= RAM
>
> and our measure of access distance is:
>
>         access_distance = refault_distance + inactive_file
>
> So the comparison becomes:
>
>         refault_distance + inactive_file < RAM
>
> which we simplify to:
>
>         refault_distance < active_file + anon
>
> There is a certain threshold for X simply because there is a certain
> threshold for RAM beyond which we need to start activating. X cannot
> be arbitrary, it must be X + cache filling up memory - after all we
> have page reclaim evicting pages.
>
> Again, this isn't new. In the current code, we activate when:
>
>         refault_distance < active_file
>
> which is
>
>         access_distance <= RAM - anon
>
> You can see, whether things are stable or not always depends on the
> existing workingset size. It's just a proxy for how much total RAM we
> have potentially available to the refaulting page.
>
> > If my lastly found example is a correct example (your confirm is required),
> > it is also related to the correctness issue since cold pages causes
> > eviction of the hot pages repeatedly.
>
> I think your example is correct, but it benefits from the VM
> arbitrarily making an assumption that has a 50/50 shot of being true.
>
> You and I know which pages are hot and which are cold because you
> designed the example.
>
> All the VM sees is this:
>
> - We have an established workingset that has previously shown an
>   access distance <= RAM and therefor was activated.
>
> - We now have another set that also appears to have an access distance
>   <= RAM. The only way to know for sure, however, is sample the
>   established workingset and compare the relative access frequencies.
>
> Currently, we just assume the incoming pages are colder. Clearly
> that's beneficial when it's true. Clearly that can be totally wrong.
>
> We must allow a fair comparison between these two sets.
>
> For cache, that's already the case - that's why I brought up the
> cache-only example: if refault distances are 50M and you have 60M of
> active cache, we activate all refaults and force an even competition
> between the established workingset and the new pages.
>
> Whether we can protect active file when anon needs to shrink first and
> can't (the activate/iocost split) that's a different question. But I'm
> no longer so sure after looking into it further.
>
> First, we would need two different refault distances: either we
> consider anon age and need to compare to active_file + anon, or we
> don't and compare to active_file only. We cannot mix willy nilly,
> because the metrics wouldn't be comparable. We don't have the space to
> store two different eviction timestamps, nor could we afford to cut
> the precision in half.
>
> Second, the additional page flag required to implement it.
>
> Third, it's somewhat moot because we still have the same behavior when
> active_file would need to shrink and can't. There can't be a stable
> state as long as refault distances <= active_file.
>
> > In this case, they (without patch, with patch) all have some correctness
> > issue so we need to judge which one is better in terms of overall impact.
> > I don't have strong opinion about it so it's up to you to decide the way to go.
>
> If my patch was simply changing the default assumption on which pages
> are hot and which are cold, I would agree with you - the pros would be
> equal to the cons, one way wouldn't be more correct than the other.
>
> But that isn't what my patch is doing. What it does is get rid of the
> assumption, to actually sample and compare the access frequencies when
> there isn't enough data to make an informed decision.
>
> That's a net improvement.

Okay. Now I think that this patch has a net improvement.

Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Thanks.
Vlastimil Babka June 4, 2020, 1:35 p.m. UTC | #13
On 6/1/20 10:44 PM, Johannes Weiner wrote:
> From a8faceabc1454dfd878caee2a8422493d937a394 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Mon, 1 Jun 2020 14:04:09 -0400
> Subject: [PATCH] mm: workingset: let cache workingset challenge anon fix

Looks like the whole series is now in mainline (that was quick!) without this
followup? As such it won't be squashed so perhaps the subject should be more
"standalone" now, description referencing commit 34e58cac6d8f ("mm: workingset:
let cache workingset challenge anon"), possibly with Fixes: tag etc...

> We compare refault distances to active_file + anon. But age of the
> non-resident information is only driven by the file LRU. As a result,
> we may overestimate the recency of any incoming refaults and activate
> them too eagerly, causing unnecessary LRU churn in certain situations.
> 
> Make anon aging drive nonresident age as well to address that.
> 
> Reported-by: Joonsoo Kim <js1304@gmail.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/memcontrol.h | 13 +++++++++++
>  include/linux/mmzone.h     |  4 ++--
>  include/linux/swap.h       |  1 +
>  mm/vmscan.c                |  3 +++
>  mm/workingset.c            | 46 ++++++++++++++++++++++----------------
>  5 files changed, 46 insertions(+), 21 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 32a0b4d47540..d982c80da157 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1303,6 +1303,19 @@ static inline void dec_lruvec_page_state(struct page *page,
>  	mod_lruvec_page_state(page, idx, -1);
>  }
>  
> +static inline struct lruvec *parent_lruvec(struct lruvec *lruvec)
> +{
> +	struct mem_cgroup *memcg;
> +
> +	memcg = lruvec_memcg(lruvec);
> +	if (!memcg)
> +		return NULL;
> +	memcg = parent_mem_cgroup(memcg);
> +	if (!memcg)
> +		return NULL;
> +	return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec));
> +}
> +
>  #ifdef CONFIG_CGROUP_WRITEBACK
>  
>  struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index c1fbda9ddd1f..7cccbb8bc2d7 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -262,8 +262,8 @@ enum lruvec_flags {
>  struct lruvec {
>  	struct list_head		lists[NR_LRU_LISTS];
>  	struct zone_reclaim_stat	reclaim_stat;
> -	/* Evictions & activations on the inactive file list */
> -	atomic_long_t			inactive_age;
> +	/* Non-resident age, driven by LRU movement */
> +	atomic_long_t			nonresident_age;
>  	/* Refaults at the time of last reclaim cycle */
>  	unsigned long			refaults;
>  	/* Various lruvec state flags (enum lruvec_flags) */
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 30fd4641890e..f0d2dca28c99 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -312,6 +312,7 @@ struct vma_swap_readahead {
>  };
>  
>  /* linux/mm/workingset.c */
> +void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages);
>  void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg);
>  void workingset_refault(struct page *page, void *shadow);
>  void workingset_activation(struct page *page);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 43f88b1a4f14..3033f1c951cd 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -898,6 +898,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
>  		__delete_from_swap_cache(page, swap);
>  		xa_unlock_irqrestore(&mapping->i_pages, flags);
>  		put_swap_page(page, swap);
> +		workingset_eviction(page, target_memcg);
>  	} else {
>  		void (*freepage)(struct page *);
>  		void *shadow = NULL;
> @@ -1876,6 +1877,8 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>  				list_add(&page->lru, &pages_to_free);
>  		} else {
>  			nr_moved += nr_pages;
> +			if (PageActive(page))
> +				workingset_age_nonresident(lruvec, nr_pages);
>  		}
>  	}
>  
> diff --git a/mm/workingset.c b/mm/workingset.c
> index e69865739539..98b91e623b85 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -156,8 +156,8 @@
>   *
>   *		Implementation
>   *
> - * For each node's file LRU lists, a counter for inactive evictions
> - * and activations is maintained (node->inactive_age).
> + * For each node's LRU lists, a counter for inactive evictions and
> + * activations is maintained (node->nonresident_age).
>   *
>   * On eviction, a snapshot of this counter (along with some bits to
>   * identify the node) is stored in the now empty page cache
> @@ -213,7 +213,17 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
>  	*workingsetp = workingset;
>  }
>  
> -static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat)
> +/**
> + * workingset_age_nonresident - age non-resident entries as LRU ages
> + * @memcg: the lruvec that was aged
> + * @nr_pages: the number of pages to count
> + *
> + * As in-memory pages are aged, non-resident pages need to be aged as
> + * well, in order for the refault distances later on to be comparable
> + * to the in-memory dimensions. This function allows reclaim and LRU
> + * operations to drive the non-resident aging along in parallel.
> + */
> +void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages)
>  {
>  	/*
>  	 * Reclaiming a cgroup means reclaiming all its children in a
> @@ -227,11 +237,8 @@ static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat)
>  	 * the root cgroup's, age as well.
>  	 */
>  	do {
> -		struct lruvec *lruvec;
> -
> -		lruvec = mem_cgroup_lruvec(memcg, pgdat);
> -		atomic_long_inc(&lruvec->inactive_age);
> -	} while (memcg && (memcg = parent_mem_cgroup(memcg)));
> +		atomic_long_add(nr_pages, &lruvec->nonresident_age);
> +	} while ((lruvec = parent_lruvec(lruvec)));
>  }
>  
>  /**
> @@ -254,12 +261,11 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
>  	VM_BUG_ON_PAGE(page_count(page), page);
>  	VM_BUG_ON_PAGE(!PageLocked(page), page);
>  
> -	advance_inactive_age(page_memcg(page), pgdat);
> -
>  	lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
> +	workingset_age_nonresident(lruvec, hpage_nr_pages(page));
>  	/* XXX: target_memcg can be NULL, go through lruvec */
>  	memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
> -	eviction = atomic_long_read(&lruvec->inactive_age);
> +	eviction = atomic_long_read(&lruvec->nonresident_age);
>  	return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page));
>  }
>  
> @@ -309,20 +315,20 @@ void workingset_refault(struct page *page, void *shadow)
>  	if (!mem_cgroup_disabled() && !eviction_memcg)
>  		goto out;
>  	eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
> -	refault = atomic_long_read(&eviction_lruvec->inactive_age);
> +	refault = atomic_long_read(&eviction_lruvec->nonresident_age);
>  
>  	/*
>  	 * Calculate the refault distance
>  	 *
>  	 * The unsigned subtraction here gives an accurate distance
> -	 * across inactive_age overflows in most cases. There is a
> +	 * across nonresident_age overflows in most cases. There is a
>  	 * special case: usually, shadow entries have a short lifetime
>  	 * and are either refaulted or reclaimed along with the inode
>  	 * before they get too old.  But it is not impossible for the
> -	 * inactive_age to lap a shadow entry in the field, which can
> -	 * then result in a false small refault distance, leading to a
> -	 * false activation should this old entry actually refault
> -	 * again.  However, earlier kernels used to deactivate
> +	 * nonresident_age to lap a shadow entry in the field, which
> +	 * can then result in a false small refault distance, leading
> +	 * to a false activation should this old entry actually
> +	 * refault again.  However, earlier kernels used to deactivate
>  	 * unconditionally with *every* reclaim invocation for the
>  	 * longest time, so the occasional inappropriate activation
>  	 * leading to pressure on the active list is not a problem.
> @@ -359,7 +365,7 @@ void workingset_refault(struct page *page, void *shadow)
>  		goto out;
>  
>  	SetPageActive(page);
> -	advance_inactive_age(memcg, pgdat);
> +	workingset_age_nonresident(lruvec, hpage_nr_pages(page));
>  	inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
>  
>  	/* Page was active prior to eviction */
> @@ -378,6 +384,7 @@ void workingset_refault(struct page *page, void *shadow)
>  void workingset_activation(struct page *page)
>  {
>  	struct mem_cgroup *memcg;
> +	struct lruvec *lruvec;
>  
>  	rcu_read_lock();
>  	/*
> @@ -390,7 +397,8 @@ void workingset_activation(struct page *page)
>  	memcg = page_memcg_rcu(page);
>  	if (!mem_cgroup_disabled() && !memcg)
>  		goto out;
> -	advance_inactive_age(memcg, page_pgdat(page));
> +	lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> +	workingset_age_nonresident(lruvec, hpage_nr_pages(page));
>  out:
>  	rcu_read_unlock();
>  }
>
Johannes Weiner June 4, 2020, 3:05 p.m. UTC | #14
On Thu, Jun 04, 2020 at 03:35:27PM +0200, Vlastimil Babka wrote:
> On 6/1/20 10:44 PM, Johannes Weiner wrote:
> > From a8faceabc1454dfd878caee2a8422493d937a394 Mon Sep 17 00:00:00 2001
> > From: Johannes Weiner <hannes@cmpxchg.org>
> > Date: Mon, 1 Jun 2020 14:04:09 -0400
> > Subject: [PATCH] mm: workingset: let cache workingset challenge anon fix
> 
> Looks like the whole series is now in mainline (that was quick!) without this
> followup? As such it won't be squashed so perhaps the subject should be more
> "standalone" now, description referencing commit 34e58cac6d8f ("mm: workingset:
> let cache workingset challenge anon"), possibly with Fixes: tag etc...

Yep, I'll send a stand-alone version of this. It was a bit fat to get
squashed last minute anyway, and I don't mind having a bit of extra
time to quantify the actual impact of this delta.

Thanks Vlastimil!
Joonsoo Kim June 12, 2020, 3:19 a.m. UTC | #15
2020년 6월 5일 (금) 오전 12:06, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
>
> On Thu, Jun 04, 2020 at 03:35:27PM +0200, Vlastimil Babka wrote:
> > On 6/1/20 10:44 PM, Johannes Weiner wrote:
> > > From a8faceabc1454dfd878caee2a8422493d937a394 Mon Sep 17 00:00:00 2001
> > > From: Johannes Weiner <hannes@cmpxchg.org>
> > > Date: Mon, 1 Jun 2020 14:04:09 -0400
> > > Subject: [PATCH] mm: workingset: let cache workingset challenge anon fix
> >
> > Looks like the whole series is now in mainline (that was quick!) without this
> > followup? As such it won't be squashed so perhaps the subject should be more
> > "standalone" now, description referencing commit 34e58cac6d8f ("mm: workingset:
> > let cache workingset challenge anon"), possibly with Fixes: tag etc...
>
> Yep, I'll send a stand-alone version of this. It was a bit fat to get
> squashed last minute anyway, and I don't mind having a bit of extra
> time to quantify the actual impact of this delta.

Hello, Johannes.

Now, a week has passed. I hope that this patch is merged as soon as possible,
since my WIP patchset depends on this patch. How about trying to merge
this patch now? If you don't mind, I could send it on your behalf.

Thanks.
Johannes Weiner June 15, 2020, 1:41 p.m. UTC | #16
On Fri, Jun 12, 2020 at 12:19:58PM +0900, Joonsoo Kim wrote:
> 2020년 6월 5일 (금) 오전 12:06, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> >
> > On Thu, Jun 04, 2020 at 03:35:27PM +0200, Vlastimil Babka wrote:
> > > On 6/1/20 10:44 PM, Johannes Weiner wrote:
> > > > From a8faceabc1454dfd878caee2a8422493d937a394 Mon Sep 17 00:00:00 2001
> > > > From: Johannes Weiner <hannes@cmpxchg.org>
> > > > Date: Mon, 1 Jun 2020 14:04:09 -0400
> > > > Subject: [PATCH] mm: workingset: let cache workingset challenge anon fix
> > >
> > > Looks like the whole series is now in mainline (that was quick!) without this
> > > followup? As such it won't be squashed so perhaps the subject should be more
> > > "standalone" now, description referencing commit 34e58cac6d8f ("mm: workingset:
> > > let cache workingset challenge anon"), possibly with Fixes: tag etc...
> >
> > Yep, I'll send a stand-alone version of this. It was a bit fat to get
> > squashed last minute anyway, and I don't mind having a bit of extra
> > time to quantify the actual impact of this delta.
> 
> Hello, Johannes.
> 
> Now, a week has passed. I hope that this patch is merged as soon as possible,
> since my WIP patchset depends on this patch. How about trying to merge
> this patch now? If you don't mind, I could send it on your behalf.

I didn't realize you directly depended on it, my apologies.

Do you depend on it code-wise or do you have test case that shows a
the difference?

Here is the latest version again, you can include it in your patch
series until we get it merged. But I think we should be able to merge
it as a follow-up fix into 5.8 still.

---

From d604062a66bc88ad1a1529c9e16952dc0d7c01c6 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Fri, 29 May 2020 09:40:00 -0400
Subject: [PATCH] mm: workingset: age nonresident information alongside
 anonymous pages

After ("mm: workingset: let cache workingset challenge anon fix"), we
compare refault distances to active_file + anon. But age of the
non-resident information is only driven by the file LRU. As a result,
we may overestimate the recency of any incoming refaults and activate
them too eagerly, causing unnecessary LRU churn in certain situations.

Make anon aging drive nonresident age as well to address that.

Reported-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h |  4 ++--
 include/linux/swap.h   |  1 +
 mm/vmscan.c            |  3 +++
 mm/workingset.c        | 46 +++++++++++++++++++++++++-----------------
 4 files changed, 33 insertions(+), 21 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c4c37fd12104..f6f884970511 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -257,8 +257,8 @@ struct lruvec {
 	 */
 	unsigned long			anon_cost;
 	unsigned long			file_cost;
-	/* Evictions & activations on the inactive file list */
-	atomic_long_t			inactive_age;
+	/* Non-resident age, driven by LRU movement */
+	atomic_long_t			nonresident_age;
 	/* Refaults at the time of last reclaim cycle */
 	unsigned long			refaults;
 	/* Various lruvec state flags (enum lruvec_flags) */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4c5974bb9ba9..5b3216ba39a9 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -313,6 +313,7 @@ struct vma_swap_readahead {
 };
 
 /* linux/mm/workingset.c */
+void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages);
 void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg);
 void workingset_refault(struct page *page, void *shadow);
 void workingset_activation(struct page *page);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b6d84326bdf2..749d239c62b2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -904,6 +904,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 		__delete_from_swap_cache(page, swap);
 		xa_unlock_irqrestore(&mapping->i_pages, flags);
 		put_swap_page(page, swap);
+		workingset_eviction(page, target_memcg);
 	} else {
 		void (*freepage)(struct page *);
 		void *shadow = NULL;
@@ -1884,6 +1885,8 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 				list_add(&page->lru, &pages_to_free);
 		} else {
 			nr_moved += nr_pages;
+			if (PageActive(page))
+				workingset_age_nonresident(lruvec, nr_pages);
 		}
 	}
 
diff --git a/mm/workingset.c b/mm/workingset.c
index d481ea452eeb..50b7937bab32 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -156,8 +156,8 @@
  *
  *		Implementation
  *
- * For each node's file LRU lists, a counter for inactive evictions
- * and activations is maintained (node->inactive_age).
+ * For each node's LRU lists, a counter for inactive evictions and
+ * activations is maintained (node->nonresident_age).
  *
  * On eviction, a snapshot of this counter (along with some bits to
  * identify the node) is stored in the now empty page cache
@@ -213,7 +213,17 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
 	*workingsetp = workingset;
 }
 
-static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat)
+/**
+ * workingset_age_nonresident - age non-resident entries as LRU ages
+ * @memcg: the lruvec that was aged
+ * @nr_pages: the number of pages to count
+ *
+ * As in-memory pages are aged, non-resident pages need to be aged as
+ * well, in order for the refault distances later on to be comparable
+ * to the in-memory dimensions. This function allows reclaim and LRU
+ * operations to drive the non-resident aging along in parallel.
+ */
+void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages)
 {
 	/*
 	 * Reclaiming a cgroup means reclaiming all its children in a
@@ -227,11 +237,8 @@ static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat)
 	 * the root cgroup's, age as well.
 	 */
 	do {
-		struct lruvec *lruvec;
-
-		lruvec = mem_cgroup_lruvec(memcg, pgdat);
-		atomic_long_inc(&lruvec->inactive_age);
-	} while (memcg && (memcg = parent_mem_cgroup(memcg)));
+		atomic_long_add(nr_pages, &lruvec->nonresident_age);
+	} while ((lruvec = parent_lruvec(lruvec)));
 }
 
 /**
@@ -254,12 +261,11 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
 	VM_BUG_ON_PAGE(page_count(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
-	advance_inactive_age(page_memcg(page), pgdat);
-
 	lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
+	workingset_age_nonresident(lruvec, hpage_nr_pages(page));
 	/* XXX: target_memcg can be NULL, go through lruvec */
 	memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
-	eviction = atomic_long_read(&lruvec->inactive_age);
+	eviction = atomic_long_read(&lruvec->nonresident_age);
 	return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page));
 }
 
@@ -309,20 +315,20 @@ void workingset_refault(struct page *page, void *shadow)
 	if (!mem_cgroup_disabled() && !eviction_memcg)
 		goto out;
 	eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
-	refault = atomic_long_read(&eviction_lruvec->inactive_age);
+	refault = atomic_long_read(&eviction_lruvec->nonresident_age);
 
 	/*
 	 * Calculate the refault distance
 	 *
 	 * The unsigned subtraction here gives an accurate distance
-	 * across inactive_age overflows in most cases. There is a
+	 * across nonresident_age overflows in most cases. There is a
 	 * special case: usually, shadow entries have a short lifetime
 	 * and are either refaulted or reclaimed along with the inode
 	 * before they get too old.  But it is not impossible for the
-	 * inactive_age to lap a shadow entry in the field, which can
-	 * then result in a false small refault distance, leading to a
-	 * false activation should this old entry actually refault
-	 * again.  However, earlier kernels used to deactivate
+	 * nonresident_age to lap a shadow entry in the field, which
+	 * can then result in a false small refault distance, leading
+	 * to a false activation should this old entry actually
+	 * refault again.  However, earlier kernels used to deactivate
 	 * unconditionally with *every* reclaim invocation for the
 	 * longest time, so the occasional inappropriate activation
 	 * leading to pressure on the active list is not a problem.
@@ -359,7 +365,7 @@ void workingset_refault(struct page *page, void *shadow)
 		goto out;
 
 	SetPageActive(page);
-	advance_inactive_age(memcg, pgdat);
+	workingset_age_nonresident(lruvec, hpage_nr_pages(page));
 	inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
 
 	/* Page was active prior to eviction */
@@ -382,6 +388,7 @@ void workingset_refault(struct page *page, void *shadow)
 void workingset_activation(struct page *page)
 {
 	struct mem_cgroup *memcg;
+	struct lruvec *lruvec;
 
 	rcu_read_lock();
 	/*
@@ -394,7 +401,8 @@ void workingset_activation(struct page *page)
 	memcg = page_memcg_rcu(page);
 	if (!mem_cgroup_disabled() && !memcg)
 		goto out;
-	advance_inactive_age(memcg, page_pgdat(page));
+	lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+	workingset_age_nonresident(lruvec, hpage_nr_pages(page));
 out:
 	rcu_read_unlock();
 }
Joonsoo Kim June 16, 2020, 6:09 a.m. UTC | #17
2020년 6월 15일 (월) 오후 10:41, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
>
> On Fri, Jun 12, 2020 at 12:19:58PM +0900, Joonsoo Kim wrote:
> > 2020년 6월 5일 (금) 오전 12:06, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> > >
> > > On Thu, Jun 04, 2020 at 03:35:27PM +0200, Vlastimil Babka wrote:
> > > > On 6/1/20 10:44 PM, Johannes Weiner wrote:
> > > > > From a8faceabc1454dfd878caee2a8422493d937a394 Mon Sep 17 00:00:00 2001
> > > > > From: Johannes Weiner <hannes@cmpxchg.org>
> > > > > Date: Mon, 1 Jun 2020 14:04:09 -0400
> > > > > Subject: [PATCH] mm: workingset: let cache workingset challenge anon fix
> > > >
> > > > Looks like the whole series is now in mainline (that was quick!) without this
> > > > followup? As such it won't be squashed so perhaps the subject should be more
> > > > "standalone" now, description referencing commit 34e58cac6d8f ("mm: workingset:
> > > > let cache workingset challenge anon"), possibly with Fixes: tag etc...
> > >
> > > Yep, I'll send a stand-alone version of this. It was a bit fat to get
> > > squashed last minute anyway, and I don't mind having a bit of extra
> > > time to quantify the actual impact of this delta.
> >
> > Hello, Johannes.
> >
> > Now, a week has passed. I hope that this patch is merged as soon as possible,
> > since my WIP patchset depends on this patch. How about trying to merge
> > this patch now? If you don't mind, I could send it on your behalf.
>
> I didn't realize you directly depended on it, my apologies.

No problem.

> Do you depend on it code-wise or do you have test case that shows a
> the difference?

Code-wise. My patchset also requires activation counting for the anon
pages and conflict could occur in this case.

> Here is the latest version again, you can include it in your patch
> series until we get it merged. But I think we should be able to merge
> it as a follow-up fix into 5.8 still.

Yes. I will send it separately for 5.8. And, I will send some minor fixes, too.
It's appreciated if you review them.

Thanks.
diff mbox series

Patch

diff --git a/mm/workingset.c b/mm/workingset.c
index 474186b76ced..e69865739539 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -277,8 +277,8 @@  void workingset_refault(struct page *page, void *shadow)
 	struct mem_cgroup *eviction_memcg;
 	struct lruvec *eviction_lruvec;
 	unsigned long refault_distance;
+	unsigned long workingset_size;
 	struct pglist_data *pgdat;
-	unsigned long active_file;
 	struct mem_cgroup *memcg;
 	unsigned long eviction;
 	struct lruvec *lruvec;
@@ -310,7 +310,6 @@  void workingset_refault(struct page *page, void *shadow)
 		goto out;
 	eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
 	refault = atomic_long_read(&eviction_lruvec->inactive_age);
-	active_file = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE);
 
 	/*
 	 * Calculate the refault distance
@@ -345,10 +344,18 @@  void workingset_refault(struct page *page, void *shadow)
 
 	/*
 	 * Compare the distance to the existing workingset size. We
-	 * don't act on pages that couldn't stay resident even if all
-	 * the memory was available to the page cache.
+	 * don't activate pages that couldn't stay resident even if
+	 * all the memory was available to the page cache. Whether
+	 * cache can compete with anon or not depends on having swap.
 	 */
-	if (refault_distance > active_file)
+	workingset_size = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE);
+	if (mem_cgroup_get_nr_swap_pages(memcg) > 0) {
+		workingset_size += lruvec_page_state(eviction_lruvec,
+						     NR_INACTIVE_ANON);
+		workingset_size += lruvec_page_state(eviction_lruvec,
+						     NR_ACTIVE_ANON);
+	}
+	if (refault_distance > workingset_size)
 		goto out;
 
 	SetPageActive(page);