diff mbox series

[v5,2/6] mm/page_idle: Add support for handling swapped PG_Idle pages

Message ID 20190807171559.182301-2-joel@joelfernandes.org (mailing list archive)
State New, archived
Headers show
Series [v5,1/6] mm/page_idle: Add per-pid idle page tracking using virtual index | expand

Commit Message

Joel Fernandes Aug. 7, 2019, 5:15 p.m. UTC
Idle page tracking currently does not work well in the following
scenario:
 1. mark page-A idle which was present at that time.
 2. run workload
 3. page-A is not touched by workload
 4. *sudden* memory pressure happen so finally page A is finally swapped out
 5. now see the page A - it appears as if it was accessed (pte unmapped
    so idle bit not set in output) - but it's incorrect.

To fix this, we store the idle information into a new idle bit of the
swap PTE during swapping of anonymous pages.

Also in the future, madvise extensions will allow a system process
manager (like Android's ActivityManager) to swap pages out of a process
that it knows will be cold. To an external process like a heap profiler
that is doing idle tracking on another process, this procedure will
interfere with the idle page tracking similar to the above steps.

Suggested-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 arch/Kconfig                  |  3 +++
 include/asm-generic/pgtable.h |  6 ++++++
 mm/page_idle.c                | 26 ++++++++++++++++++++++++--
 mm/rmap.c                     |  2 ++
 4 files changed, 35 insertions(+), 2 deletions(-)

Comments

Michal Hocko Aug. 13, 2019, 3:04 p.m. UTC | #1
On Wed 07-08-19 13:15:55, Joel Fernandes (Google) wrote:
> Idle page tracking currently does not work well in the following
> scenario:
>  1. mark page-A idle which was present at that time.
>  2. run workload
>  3. page-A is not touched by workload
>  4. *sudden* memory pressure happen so finally page A is finally swapped out
>  5. now see the page A - it appears as if it was accessed (pte unmapped
>     so idle bit not set in output) - but it's incorrect.
> 
> To fix this, we store the idle information into a new idle bit of the
> swap PTE during swapping of anonymous pages.
>
> Also in the future, madvise extensions will allow a system process
> manager (like Android's ActivityManager) to swap pages out of a process
> that it knows will be cold. To an external process like a heap profiler
> that is doing idle tracking on another process, this procedure will
> interfere with the idle page tracking similar to the above steps.

This could be solved by checking the !present/swapped out pages
right? Whoever decided to put the page out to the swap just made it
idle effectively.  So the monitor can make some educated guess for
tracking. If that is fundamentally not possible then please describe
why.
Joel Fernandes Aug. 13, 2019, 3:36 p.m. UTC | #2
On Tue, Aug 13, 2019 at 05:04:50PM +0200, Michal Hocko wrote:
> On Wed 07-08-19 13:15:55, Joel Fernandes (Google) wrote:
> > Idle page tracking currently does not work well in the following
> > scenario:
> >  1. mark page-A idle which was present at that time.
> >  2. run workload
> >  3. page-A is not touched by workload
> >  4. *sudden* memory pressure happen so finally page A is finally swapped out
> >  5. now see the page A - it appears as if it was accessed (pte unmapped
> >     so idle bit not set in output) - but it's incorrect.
> > 
> > To fix this, we store the idle information into a new idle bit of the
> > swap PTE during swapping of anonymous pages.
> >
> > Also in the future, madvise extensions will allow a system process
> > manager (like Android's ActivityManager) to swap pages out of a process
> > that it knows will be cold. To an external process like a heap profiler
> > that is doing idle tracking on another process, this procedure will
> > interfere with the idle page tracking similar to the above steps.
> 
> This could be solved by checking the !present/swapped out pages
> right? Whoever decided to put the page out to the swap just made it
> idle effectively.  So the monitor can make some educated guess for
> tracking. If that is fundamentally not possible then please describe
> why.

But the monitoring process (profiler) does not have control over the 'whoever
made it effectively idle' process.

As you said it will be a guess, it will not be accurate.

I am curious what is your concern with using a bit in the swap PTE?

(Adding Konstantin as well since we may be interested in this, since we also
suggested this idea).

thanks,

 - Joel
Konstantin Khlebnikov Aug. 13, 2019, 7:24 p.m. UTC | #3
On Tue, Aug 13, 2019 at 6:37 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Tue, Aug 13, 2019 at 05:04:50PM +0200, Michal Hocko wrote:
> > On Wed 07-08-19 13:15:55, Joel Fernandes (Google) wrote:
> > > Idle page tracking currently does not work well in the following
> > > scenario:
> > >  1. mark page-A idle which was present at that time.
> > >  2. run workload
> > >  3. page-A is not touched by workload
> > >  4. *sudden* memory pressure happen so finally page A is finally swapped out
> > >  5. now see the page A - it appears as if it was accessed (pte unmapped
> > >     so idle bit not set in output) - but it's incorrect.
> > >
> > > To fix this, we store the idle information into a new idle bit of the
> > > swap PTE during swapping of anonymous pages.
> > >
> > > Also in the future, madvise extensions will allow a system process
> > > manager (like Android's ActivityManager) to swap pages out of a process
> > > that it knows will be cold. To an external process like a heap profiler
> > > that is doing idle tracking on another process, this procedure will
> > > interfere with the idle page tracking similar to the above steps.
> >
> > This could be solved by checking the !present/swapped out pages
> > right? Whoever decided to put the page out to the swap just made it
> > idle effectively.  So the monitor can make some educated guess for
> > tracking. If that is fundamentally not possible then please describe
> > why.
>
> But the monitoring process (profiler) does not have control over the 'whoever
> made it effectively idle' process.
>
> As you said it will be a guess, it will not be accurate.

Yep. Without saving idle bit in swap entry (and presuming that all swap is idle)
profiler could miss access. This patch adds accurate tracking almost for free.
After that profiler could work with any pace without races.

>
> I am curious what is your concern with using a bit in the swap PTE?
>
> (Adding Konstantin as well since we may be interested in this, since we also
> suggested this idea).
>
> thanks,
>
>  - Joel
>
>
Michal Hocko Aug. 14, 2019, 8:05 a.m. UTC | #4
On Tue 13-08-19 11:36:59, Joel Fernandes wrote:
> On Tue, Aug 13, 2019 at 05:04:50PM +0200, Michal Hocko wrote:
> > On Wed 07-08-19 13:15:55, Joel Fernandes (Google) wrote:
> > > Idle page tracking currently does not work well in the following
> > > scenario:
> > >  1. mark page-A idle which was present at that time.
> > >  2. run workload
> > >  3. page-A is not touched by workload
> > >  4. *sudden* memory pressure happen so finally page A is finally swapped out
> > >  5. now see the page A - it appears as if it was accessed (pte unmapped
> > >     so idle bit not set in output) - but it's incorrect.
> > > 
> > > To fix this, we store the idle information into a new idle bit of the
> > > swap PTE during swapping of anonymous pages.
> > >
> > > Also in the future, madvise extensions will allow a system process
> > > manager (like Android's ActivityManager) to swap pages out of a process
> > > that it knows will be cold. To an external process like a heap profiler
> > > that is doing idle tracking on another process, this procedure will
> > > interfere with the idle page tracking similar to the above steps.
> > 
> > This could be solved by checking the !present/swapped out pages
> > right? Whoever decided to put the page out to the swap just made it
> > idle effectively.  So the monitor can make some educated guess for
> > tracking. If that is fundamentally not possible then please describe
> > why.
> 
> But the monitoring process (profiler) does not have control over the 'whoever
> made it effectively idle' process.

Why does that matter? Whether it is a global/memcg reclaim or somebody
calling MADV_PAGEOUT or whatever it is a decision to make the page not
hot. Sure you could argue that a missing idle bit on swap entries might
mean that the swap out decision was pre-mature/sub-optimal/wrong but is
this the aim of the interface?

> As you said it will be a guess, it will not be accurate.

Yes and the point I am trying to make is that having some space and not
giving a guarantee sounds like a safer option for this interface because
...
> 
> I am curious what is your concern with using a bit in the swap PTE?

... It is a promiss of the semantic I find limiting for future. The bit
in the pte might turn out insufficient (e.g. pte reclaim) so teaching
the userspace to consider this a hard guarantee is a ticket to problems
later on. Maybe I am overly paranoid because I have seen so many "nice
to have" features turning into a maintenance burden in the past.

If this is really considered mostly debugging purpouse interface then a
certain level of imprecision should be tolerateable. If there is a
really strong real world usecase that simply has no other way to go
then this might be added later. Adding an information is always safer
than take it away.

That being said, if I am a minority voice here then I will not really
stand in the way and won't nack the patch. I will not ack it neither
though.
Joel Fernandes Aug. 14, 2019, 4:32 p.m. UTC | #5
On Wed, Aug 14, 2019 at 10:05:31AM +0200, Michal Hocko wrote:
> On Tue 13-08-19 11:36:59, Joel Fernandes wrote:
> > On Tue, Aug 13, 2019 at 05:04:50PM +0200, Michal Hocko wrote:
> > > On Wed 07-08-19 13:15:55, Joel Fernandes (Google) wrote:
> > > > Idle page tracking currently does not work well in the following
> > > > scenario:
> > > >  1. mark page-A idle which was present at that time.
> > > >  2. run workload
> > > >  3. page-A is not touched by workload
> > > >  4. *sudden* memory pressure happen so finally page A is finally swapped out
> > > >  5. now see the page A - it appears as if it was accessed (pte unmapped
> > > >     so idle bit not set in output) - but it's incorrect.
> > > > 
> > > > To fix this, we store the idle information into a new idle bit of the
> > > > swap PTE during swapping of anonymous pages.
> > > >
> > > > Also in the future, madvise extensions will allow a system process
> > > > manager (like Android's ActivityManager) to swap pages out of a process
> > > > that it knows will be cold. To an external process like a heap profiler
> > > > that is doing idle tracking on another process, this procedure will
> > > > interfere with the idle page tracking similar to the above steps.
> > > 
> > > This could be solved by checking the !present/swapped out pages
> > > right? Whoever decided to put the page out to the swap just made it
> > > idle effectively.  So the monitor can make some educated guess for
> > > tracking. If that is fundamentally not possible then please describe
> > > why.
> > 
> > But the monitoring process (profiler) does not have control over the 'whoever
> > made it effectively idle' process.
> 
> Why does that matter? Whether it is a global/memcg reclaim or somebody
> calling MADV_PAGEOUT or whatever it is a decision to make the page not
> hot. Sure you could argue that a missing idle bit on swap entries might
> mean that the swap out decision was pre-mature/sub-optimal/wrong but is
> this the aim of the interface?
> 
> > As you said it will be a guess, it will not be accurate.
> 
> Yes and the point I am trying to make is that having some space and not
> giving a guarantee sounds like a safer option for this interface because

I do see your point of view, but jJust because a future (and possibly not
going to happen) usecase which you mentioned as pte reclaim, makes you feel
that userspace may be subject to inaccuracies anyway, doesn't mean we should
make everything inaccurate..  We already know idle page tracking is not
completely accurate. But that doesn't mean we miss out on the opportunity to
make the "non pte-reclaim" usecase inaccurate as well. 

IMO, we should do our best for today, and not hypothesize. How likely is pte
reclaim and is there a thread to describe that direction?

> > I am curious what is your concern with using a bit in the swap PTE?
> 
> ... It is a promiss of the semantic I find limiting for future. The bit
> in the pte might turn out insufficient (e.g. pte reclaim) so teaching
> the userspace to consider this a hard guarantee is a ticket to problems
> later on. Maybe I am overly paranoid because I have seen so many "nice
> to have" features turning into a maintenance burden in the past.
> 
> If this is really considered mostly debugging purpouse interface then a
> certain level of imprecision should be tolerateable. If there is a
> really strong real world usecase that simply has no other way to go
> then this might be added later. Adding an information is always safer
> than take it away.
> 
> That being said, if I am a minority voice here then I will not really
> stand in the way and won't nack the patch. I will not ack it neither
> though.

Ok.

thanks,

 - Joel
Michal Hocko Aug. 14, 2019, 6:36 p.m. UTC | #6
On Wed 14-08-19 12:32:03, Joel Fernandes wrote:
> On Wed, Aug 14, 2019 at 10:05:31AM +0200, Michal Hocko wrote:
> > On Tue 13-08-19 11:36:59, Joel Fernandes wrote:
> > > On Tue, Aug 13, 2019 at 05:04:50PM +0200, Michal Hocko wrote:
> > > > On Wed 07-08-19 13:15:55, Joel Fernandes (Google) wrote:
> > > > > Idle page tracking currently does not work well in the following
> > > > > scenario:
> > > > >  1. mark page-A idle which was present at that time.
> > > > >  2. run workload
> > > > >  3. page-A is not touched by workload
> > > > >  4. *sudden* memory pressure happen so finally page A is finally swapped out
> > > > >  5. now see the page A - it appears as if it was accessed (pte unmapped
> > > > >     so idle bit not set in output) - but it's incorrect.
> > > > > 
> > > > > To fix this, we store the idle information into a new idle bit of the
> > > > > swap PTE during swapping of anonymous pages.
> > > > >
> > > > > Also in the future, madvise extensions will allow a system process
> > > > > manager (like Android's ActivityManager) to swap pages out of a process
> > > > > that it knows will be cold. To an external process like a heap profiler
> > > > > that is doing idle tracking on another process, this procedure will
> > > > > interfere with the idle page tracking similar to the above steps.
> > > > 
> > > > This could be solved by checking the !present/swapped out pages
> > > > right? Whoever decided to put the page out to the swap just made it
> > > > idle effectively.  So the monitor can make some educated guess for
> > > > tracking. If that is fundamentally not possible then please describe
> > > > why.
> > > 
> > > But the monitoring process (profiler) does not have control over the 'whoever
> > > made it effectively idle' process.
> > 
> > Why does that matter? Whether it is a global/memcg reclaim or somebody
> > calling MADV_PAGEOUT or whatever it is a decision to make the page not
> > hot. Sure you could argue that a missing idle bit on swap entries might
> > mean that the swap out decision was pre-mature/sub-optimal/wrong but is
> > this the aim of the interface?
> > 
> > > As you said it will be a guess, it will not be accurate.
> > 
> > Yes and the point I am trying to make is that having some space and not
> > giving a guarantee sounds like a safer option for this interface because
> 
> I do see your point of view, but jJust because a future (and possibly not
> going to happen) usecase which you mentioned as pte reclaim, makes you feel
> that userspace may be subject to inaccuracies anyway, doesn't mean we should
> make everything inaccurate..  We already know idle page tracking is not
> completely accurate. But that doesn't mean we miss out on the opportunity to
> make the "non pte-reclaim" usecase inaccurate as well. 

Just keep in mind that you will add more burden to future features
because they would have to somehow overcome this user visible behavior
and we will get to the usual question - Is this going to break
something that relies on the idle bit being stable?

> IMO, we should do our best for today, and not hypothesize. How likely is pte
> reclaim and is there a thread to describe that direction?

Not that I am aware of now but with large NVDIMM mapped files I can see
that this will get more and more interesting.
diff mbox series

Patch

diff --git a/arch/Kconfig b/arch/Kconfig
index a7b57dd42c26..3aa121ce824e 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -575,6 +575,9 @@  config ARCH_WANT_HUGE_PMD_SHARE
 config HAVE_ARCH_SOFT_DIRTY
 	bool
 
+config HAVE_ARCH_PTE_SWP_PGIDLE
+	bool
+
 config HAVE_MOD_ARCH_SPECIFIC
 	bool
 	help
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 75d9d68a6de7..6d51d0a355a7 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -712,6 +712,12 @@  static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
 #define arch_start_context_switch(prev)	do {} while (0)
 #endif
 
+#ifndef CONFIG_HAVE_ARCH_PTE_SWP_PGIDLE
+static inline pte_t pte_swp_mkpage_idle(pte_t pte) { return pte; }
+static inline int pte_swp_page_idle(pte_t pte) { return 0; }
+static inline pte_t pte_swp_clear_mkpage_idle(pte_t pte) { return pte; }
+#endif
+
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
 #ifndef CONFIG_ARCH_ENABLE_THP_MIGRATION
 static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
diff --git a/mm/page_idle.c b/mm/page_idle.c
index 9de4f4c67a8c..2766d4ab348c 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -276,7 +276,7 @@  struct page_idle_proc_priv {
 };
 
 /*
- * Add page to list to be set as idle later.
+ * Set a page as idle or add it to a list to be set as idle later.
  */
 static void pte_page_idle_proc_add(struct page *page,
 			       unsigned long addr, struct mm_walk *walk)
@@ -303,6 +303,13 @@  static void pte_page_idle_proc_add(struct page *page,
 		page_get = page_idle_get_page(page);
 		if (!page_get)
 			return;
+	} else {
+		/* For swapped pages, set output bit as idle */
+		frames = (addr - priv->start_addr) >> PAGE_SHIFT;
+		bit = frames % BITMAP_CHUNK_BITS;
+		chunk = &chunk[frames / BITMAP_CHUNK_BITS];
+		*chunk |= (1 << bit);
+		return;
 	}
 
 	/*
@@ -323,6 +330,7 @@  static int pte_page_idle_proc_range(pmd_t *pmd, unsigned long addr,
 	spinlock_t *ptl;
 	struct page *page;
 	struct vm_area_struct *vma = walk->vma;
+	struct page_idle_proc_priv *priv = walk->private;
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
@@ -341,6 +349,19 @@  static int pte_page_idle_proc_range(pmd_t *pmd, unsigned long addr,
 
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	for (; addr != end; pte++, addr += PAGE_SIZE) {
+		/* For swap_pte handling, we use an idle bit in the swap pte. */
+		if (is_swap_pte(*pte)) {
+			if (priv->write) {
+				set_pte_at(walk->mm, addr, pte,
+					   pte_swp_mkpage_idle(*pte));
+			} else {
+				/* If swap pte has idle bit set, report it as idle */
+				if (pte_swp_page_idle(*pte))
+					pte_page_idle_proc_add(NULL, addr, walk);
+			}
+			continue;
+		}
+
 		if (!pte_present(*pte))
 			continue;
 
@@ -432,7 +453,8 @@  ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff,
 				set_page_idle(page);
 			}
 		} else {
-			if (page_idle_pte_check(page)) {
+			/* If page is NULL, it was swapped out */
+			if (!page || page_idle_pte_check(page)) {
 				off = ((cur->addr) >> PAGE_SHIFT) - start_frame;
 				bit = off % BITMAP_CHUNK_BITS;
 				index = off / BITMAP_CHUNK_BITS;
diff --git a/mm/rmap.c b/mm/rmap.c
index e5dfe2ae6b0d..4bd618aab402 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1629,6 +1629,8 @@  static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			swp_pte = swp_entry_to_pte(entry);
 			if (pte_soft_dirty(pteval))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			if (page_is_idle(page))
+				swp_pte = pte_swp_mkpage_idle(swp_pte);
 			set_pte_at(mm, address, pvmw.pte, swp_pte);
 			/* Invalidate as we cleared the pte */
 			mmu_notifier_invalidate_range(mm, address,