diff mbox series

[v1,1/4] mm: introduce MADV_COLD

Message ID 20190603053655.127730-2-minchan@kernel.org (mailing list archive)
State New, archived
Headers show
Series Introduce MADV_COLD and MADV_PAGEOUT | expand

Commit Message

Minchan Kim June 3, 2019, 5:36 a.m. UTC
When a process expects no accesses to a certain memory range, it could
give a hint to kernel that the pages can be reclaimed when memory pressure
happens but data should be preserved for future use.  This could reduce
workingset eviction so it ends up increasing performance.

This patch introduces the new MADV_COLD hint to madvise(2) syscall.
MADV_COLD can be used by a process to mark a memory range as not expected
to be used in the near future. The hint can help kernel in deciding which
pages to evict early during memory pressure.

It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves

	active file page -> inactive file LRU
	active anon page -> inacdtive anon LRU

Unlike MADV_FREE, it doesn't move active anonymous pages to inactive
files's head because MADV_COLD is a little bit different symantic.
MADV_FREE means it's okay to discard when the memory pressure because
the content of the page is *garbage* so freeing such pages is almost zero
overhead since we don't need to swap out and access afterward causes just
minor fault. Thus, it would make sense to put those freeable pages in
inactive file LRU to compete other used-once pages. Even, it could
give a bonus to make them be reclaimed on swapless system. However,
MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in
in the end. So it's better to move inactive anon's LRU list, not file LRU.
Furthermore, it would help to avoid unnecessary scanning of cold anonymous
if system doesn't have a swap device.

All of error rule is same with MADV_DONTNEED.

Note:
This hint works with only private pages(IOW, page_mapcount(page) < 2)
because shared page could have more chance to be accessed from other
processes sharing the page although the caller reset the reference bits.
It ends up preventing the reclaim of the page and wastes CPU cycle.

* RFCv2
 * add more description - mhocko

* RFCv1
 * renaming from MADV_COOL to MADV_COLD - hannes

* internal review
 * use clear_page_youn in deactivate_page - joelaf
 * Revise the description - surenb
 * Renaming from MADV_WARM to MADV_COOL - surenb

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/page-flags.h             |   1 +
 include/linux/page_idle.h              |  15 ++++
 include/linux/swap.h                   |   1 +
 include/uapi/asm-generic/mman-common.h |   1 +
 mm/internal.h                          |   2 +-
 mm/madvise.c                           | 115 ++++++++++++++++++++++++-
 mm/oom_kill.c                          |   2 +-
 mm/swap.c                              |  43 +++++++++
 8 files changed, 176 insertions(+), 4 deletions(-)

Comments

Joel Fernandes June 4, 2019, 8:38 p.m. UTC | #1
On Mon, Jun 03, 2019 at 02:36:52PM +0900, Minchan Kim wrote:
> When a process expects no accesses to a certain memory range, it could
> give a hint to kernel that the pages can be reclaimed when memory pressure
> happens but data should be preserved for future use.  This could reduce
> workingset eviction so it ends up increasing performance.
> 
> This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> MADV_COLD can be used by a process to mark a memory range as not expected
> to be used in the near future. The hint can help kernel in deciding which
> pages to evict early during memory pressure.
> 
> It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
> 
> 	active file page -> inactive file LRU
> 	active anon page -> inacdtive anon LRU
> 
> Unlike MADV_FREE, it doesn't move active anonymous pages to inactive
> files's head because MADV_COLD is a little bit different symantic.
> MADV_FREE means it's okay to discard when the memory pressure because
> the content of the page is *garbage* so freeing such pages is almost zero
> overhead since we don't need to swap out and access afterward causes just
> minor fault. Thus, it would make sense to put those freeable pages in
> inactive file LRU to compete other used-once pages. Even, it could
> give a bonus to make them be reclaimed on swapless system. However,
> MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in
> in the end. So it's better to move inactive anon's LRU list, not file LRU.
> Furthermore, it would help to avoid unnecessary scanning of cold anonymous
> if system doesn't have a swap device.
> 
> All of error rule is same with MADV_DONTNEED.
> 
> Note:
> This hint works with only private pages(IOW, page_mapcount(page) < 2)
> because shared page could have more chance to be accessed from other
> processes sharing the page although the caller reset the reference bits.
> It ends up preventing the reclaim of the page and wastes CPU cycle.
> 
> * RFCv2
>  * add more description - mhocko
> 
> * RFCv1
>  * renaming from MADV_COOL to MADV_COLD - hannes
> 
> * internal review
>  * use clear_page_youn in deactivate_page - joelaf
>  * Revise the description - surenb
>  * Renaming from MADV_WARM to MADV_COOL - surenb
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  include/linux/page-flags.h             |   1 +
>  include/linux/page_idle.h              |  15 ++++
>  include/linux/swap.h                   |   1 +
>  include/uapi/asm-generic/mman-common.h |   1 +
>  mm/internal.h                          |   2 +-
>  mm/madvise.c                           | 115 ++++++++++++++++++++++++-
>  mm/oom_kill.c                          |   2 +-
>  mm/swap.c                              |  43 +++++++++
>  8 files changed, 176 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 9f8712a4b1a5..58b06654c8dd 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -424,6 +424,7 @@ static inline bool set_hwpoison_free_buddy_page(struct page *page)
>  TESTPAGEFLAG(Young, young, PF_ANY)
>  SETPAGEFLAG(Young, young, PF_ANY)
>  TESTCLEARFLAG(Young, young, PF_ANY)
> +CLEARPAGEFLAG(Young, young, PF_ANY)
>  PAGEFLAG(Idle, idle, PF_ANY)
>  #endif
>  
> diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
> index 1e894d34bdce..f3f43b317150 100644
> --- a/include/linux/page_idle.h
> +++ b/include/linux/page_idle.h
> @@ -19,6 +19,11 @@ static inline void set_page_young(struct page *page)
>  	SetPageYoung(page);
>  }
>  
> +static inline void clear_page_young(struct page *page)
> +{
> +	ClearPageYoung(page);
> +}
> +
>  static inline bool test_and_clear_page_young(struct page *page)
>  {
>  	return TestClearPageYoung(page);
> @@ -65,6 +70,16 @@ static inline void set_page_young(struct page *page)
>  	set_bit(PAGE_EXT_YOUNG, &page_ext->flags);
>  }
>  
> +static void clear_page_young(struct page *page)
> +{
> +	struct page_ext *page_ext = lookup_page_ext(page);
> +
> +	if (unlikely(!page_ext))
> +		return;
> +
> +	clear_bit(PAGE_EXT_YOUNG, &page_ext->flags);
> +}
> +
>  static inline bool test_and_clear_page_young(struct page *page)
>  {
>  	struct page_ext *page_ext = lookup_page_ext(page);
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index de2c67a33b7e..0ce997edb8bb 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -340,6 +340,7 @@ extern void lru_add_drain_cpu(int cpu);
>  extern void lru_add_drain_all(void);
>  extern void rotate_reclaimable_page(struct page *page);
>  extern void deactivate_file_page(struct page *page);
> +extern void deactivate_page(struct page *page);
>  extern void mark_page_lazyfree(struct page *page);
>  extern void swap_setup(void);
>  
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index bea0278f65ab..1190f4e7f7b9 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -43,6 +43,7 @@
>  #define MADV_SEQUENTIAL	2		/* expect sequential page references */
>  #define MADV_WILLNEED	3		/* will need these pages */
>  #define MADV_DONTNEED	4		/* don't need these pages */
> +#define MADV_COLD	5		/* deactivatie these pages */
>  
>  /* common parameters: try to keep these consistent across architectures */
>  #define MADV_FREE	8		/* free pages only if memory pressure */
> diff --git a/mm/internal.h b/mm/internal.h
> index 9eeaf2b95166..75a4f96ec0fb 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -43,7 +43,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf);
>  void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
>  		unsigned long floor, unsigned long ceiling);
>  
> -static inline bool can_madv_dontneed_vma(struct vm_area_struct *vma)
> +static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
>  {
>  	return !(vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP));
>  }
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 628022e674a7..ab158766858a 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -40,6 +40,7 @@ static int madvise_need_mmap_write(int behavior)
>  	case MADV_REMOVE:
>  	case MADV_WILLNEED:
>  	case MADV_DONTNEED:
> +	case MADV_COLD:
>  	case MADV_FREE:
>  		return 0;
>  	default:
> @@ -307,6 +308,113 @@ static long madvise_willneed(struct vm_area_struct *vma,
>  	return 0;
>  }
>  
> +static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> +				unsigned long end, struct mm_walk *walk)
> +{
> +	pte_t *orig_pte, *pte, ptent;
> +	spinlock_t *ptl;
> +	struct page *page;
> +	struct vm_area_struct *vma = walk->vma;
> +	unsigned long next;
> +
> +	next = pmd_addr_end(addr, end);
> +	if (pmd_trans_huge(*pmd)) {
> +		ptl = pmd_trans_huge_lock(pmd, vma);
> +		if (!ptl)
> +			return 0;
> +
> +		if (is_huge_zero_pmd(*pmd))
> +			goto huge_unlock;
> +
> +		page = pmd_page(*pmd);
> +		if (page_mapcount(page) > 1)
> +			goto huge_unlock;
> +
> +		if (next - addr != HPAGE_PMD_SIZE) {
> +			int err;
> +
> +			get_page(page);
> +			spin_unlock(ptl);
> +			lock_page(page);
> +			err = split_huge_page(page);
> +			unlock_page(page);
> +			put_page(page);
> +			if (!err)
> +				goto regular_page;
> +			return 0;
> +		}
> +
> +		pmdp_test_and_clear_young(vma, addr, pmd);
> +		deactivate_page(page);
> +huge_unlock:
> +		spin_unlock(ptl);
> +		return 0;
> +	}
> +
> +	if (pmd_trans_unstable(pmd))
> +		return 0;
> +
> +regular_page:
> +	orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> +	for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
> +		ptent = *pte;
> +
> +		if (pte_none(ptent))
> +			continue;
> +
> +		if (!pte_present(ptent))
> +			continue;
> +
> +		page = vm_normal_page(vma, addr, ptent);
> +		if (!page)
> +			continue;
> +
> +		if (page_mapcount(page) > 1)
> +			continue;
> +
> +		ptep_test_and_clear_young(vma, addr, pte);

Wondering here how it interacts with idle page tracking. Here since young
flag is cleared by the cold hint, page_referenced_one() or
page_idle_clear_pte_refs_one() will not be able to clear the page-idle flag
if it was previously set since it does not know any more that a page was
actively referenced.

Should you also be clearing the page-idle flag here if the ptep young/accessed
bit was previously set, just so that page-idle tracking works smoothly when
this hint is concurrently applied?

> +		deactivate_page(page);
> +	}
> +
> +	pte_unmap_unlock(orig_pte, ptl);
> +	cond_resched();
> +
> +	return 0;
> +}
> +
> +static void madvise_cold_page_range(struct mmu_gather *tlb,
> +			     struct vm_area_struct *vma,
> +			     unsigned long addr, unsigned long end)
> +{
> +	struct mm_walk cool_walk = {
> +		.pmd_entry = madvise_cold_pte_range,
> +		.mm = vma->vm_mm,
> +	};
> +
> +	tlb_start_vma(tlb, vma);
> +	walk_page_range(addr, end, &cool_walk);
> +	tlb_end_vma(tlb, vma);
> +}
> +
> +static long madvise_cold(struct vm_area_struct *vma,
> +			struct vm_area_struct **prev,
> +			unsigned long start_addr, unsigned long end_addr)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	struct mmu_gather tlb;
> +
> +	*prev = vma;
> +	if (!can_madv_lru_vma(vma))
> +		return -EINVAL;
> +
> +	lru_add_drain();
> +	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
> +	madvise_cold_page_range(&tlb, vma, start_addr, end_addr);
> +	tlb_finish_mmu(&tlb, start_addr, end_addr);
> +
> +	return 0;
> +}
> +
>  static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  				unsigned long end, struct mm_walk *walk)
>  
> @@ -519,7 +627,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
>  				  int behavior)
>  {
>  	*prev = vma;
> -	if (!can_madv_dontneed_vma(vma))
> +	if (!can_madv_lru_vma(vma))
>  		return -EINVAL;
>  
>  	if (!userfaultfd_remove(vma, start, end)) {
> @@ -541,7 +649,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
>  			 */
>  			return -ENOMEM;
>  		}
> -		if (!can_madv_dontneed_vma(vma))
> +		if (!can_madv_lru_vma(vma))
>  			return -EINVAL;
>  		if (end > vma->vm_end) {
>  			/*
> @@ -695,6 +803,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
>  		return madvise_remove(vma, prev, start, end);
>  	case MADV_WILLNEED:
>  		return madvise_willneed(vma, prev, start, end);
> +	case MADV_COLD:
> +		return madvise_cold(vma, prev, start, end);
>  	case MADV_FREE:
>  	case MADV_DONTNEED:
>  		return madvise_dontneed_free(vma, prev, start, end, behavior);
> @@ -716,6 +826,7 @@ madvise_behavior_valid(int behavior)
>  	case MADV_WILLNEED:
>  	case MADV_DONTNEED:
>  	case MADV_FREE:
> +	case MADV_COLD:
>  #ifdef CONFIG_KSM
>  	case MADV_MERGEABLE:
>  	case MADV_UNMERGEABLE:
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 5a58778c91d4..f73d5f5145f0 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -515,7 +515,7 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
>  	set_bit(MMF_UNSTABLE, &mm->flags);
>  
>  	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> -		if (!can_madv_dontneed_vma(vma))
> +		if (!can_madv_lru_vma(vma))
>  			continue;
>  
>  		/*
> diff --git a/mm/swap.c b/mm/swap.c
> index 7b079976cbec..cebedab15aa2 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -47,6 +47,7 @@ int page_cluster;
>  static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
>  static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
>  static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
> +static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
>  static DEFINE_PER_CPU(struct pagevec, lru_lazyfree_pvecs);
>  #ifdef CONFIG_SMP
>  static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
> @@ -538,6 +539,23 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
>  	update_page_reclaim_stat(lruvec, file, 0);
>  }
>  
> +static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
> +			    void *arg)
> +{
> +	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> +		int file = page_is_file_cache(page);
> +		int lru = page_lru_base_type(page);
> +
> +		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
> +		ClearPageActive(page);
> +		ClearPageReferenced(page);
> +		clear_page_young(page);

I was curious, how does the above interact with clear_refs?

It appears that a range that is marked as cold will appear to be unreferenced
(since referenced flag is cleared) even though it may have been referenced in
the past. Very least, I believe this behavior should be documented somewhere.

I believe it is best for us to make the clear_refs technique behave similar
to the idle-page tracking for the purposes of figuring out which pages are
referenced, so that clear_refs is more immune to this hint.

thanks,

 - Joel
Minchan Kim June 10, 2019, 10:09 a.m. UTC | #2
Hi Joel,

On Tue, Jun 04, 2019 at 04:38:41PM -0400, Joel Fernandes wrote:
> On Mon, Jun 03, 2019 at 02:36:52PM +0900, Minchan Kim wrote:
> > When a process expects no accesses to a certain memory range, it could
> > give a hint to kernel that the pages can be reclaimed when memory pressure
> > happens but data should be preserved for future use.  This could reduce
> > workingset eviction so it ends up increasing performance.
> > 
> > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > MADV_COLD can be used by a process to mark a memory range as not expected
> > to be used in the near future. The hint can help kernel in deciding which
> > pages to evict early during memory pressure.
> > 
> > It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
> > 
> > 	active file page -> inactive file LRU
> > 	active anon page -> inacdtive anon LRU
> > 
> > Unlike MADV_FREE, it doesn't move active anonymous pages to inactive
> > files's head because MADV_COLD is a little bit different symantic.
> > MADV_FREE means it's okay to discard when the memory pressure because
> > the content of the page is *garbage* so freeing such pages is almost zero
> > overhead since we don't need to swap out and access afterward causes just
> > minor fault. Thus, it would make sense to put those freeable pages in
> > inactive file LRU to compete other used-once pages. Even, it could
> > give a bonus to make them be reclaimed on swapless system. However,
> > MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in
> > in the end. So it's better to move inactive anon's LRU list, not file LRU.
> > Furthermore, it would help to avoid unnecessary scanning of cold anonymous
> > if system doesn't have a swap device.
> > 
> > All of error rule is same with MADV_DONTNEED.
> > 
> > Note:
> > This hint works with only private pages(IOW, page_mapcount(page) < 2)
> > because shared page could have more chance to be accessed from other
> > processes sharing the page although the caller reset the reference bits.
> > It ends up preventing the reclaim of the page and wastes CPU cycle.
> > 
> > * RFCv2
> >  * add more description - mhocko
> > 
> > * RFCv1
> >  * renaming from MADV_COOL to MADV_COLD - hannes
> > 
> > * internal review
> >  * use clear_page_youn in deactivate_page - joelaf
> >  * Revise the description - surenb
> >  * Renaming from MADV_WARM to MADV_COOL - surenb
> > 
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  include/linux/page-flags.h             |   1 +
> >  include/linux/page_idle.h              |  15 ++++
> >  include/linux/swap.h                   |   1 +
> >  include/uapi/asm-generic/mman-common.h |   1 +
> >  mm/internal.h                          |   2 +-
> >  mm/madvise.c                           | 115 ++++++++++++++++++++++++-
> >  mm/oom_kill.c                          |   2 +-
> >  mm/swap.c                              |  43 +++++++++
> >  8 files changed, 176 insertions(+), 4 deletions(-)
> > 
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > index 9f8712a4b1a5..58b06654c8dd 100644
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -424,6 +424,7 @@ static inline bool set_hwpoison_free_buddy_page(struct page *page)
> >  TESTPAGEFLAG(Young, young, PF_ANY)
> >  SETPAGEFLAG(Young, young, PF_ANY)
> >  TESTCLEARFLAG(Young, young, PF_ANY)
> > +CLEARPAGEFLAG(Young, young, PF_ANY)
> >  PAGEFLAG(Idle, idle, PF_ANY)
> >  #endif
> >  
> > diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
> > index 1e894d34bdce..f3f43b317150 100644
> > --- a/include/linux/page_idle.h
> > +++ b/include/linux/page_idle.h
> > @@ -19,6 +19,11 @@ static inline void set_page_young(struct page *page)
> >  	SetPageYoung(page);
> >  }
> >  
> > +static inline void clear_page_young(struct page *page)
> > +{
> > +	ClearPageYoung(page);
> > +}
> > +
> >  static inline bool test_and_clear_page_young(struct page *page)
> >  {
> >  	return TestClearPageYoung(page);
> > @@ -65,6 +70,16 @@ static inline void set_page_young(struct page *page)
> >  	set_bit(PAGE_EXT_YOUNG, &page_ext->flags);
> >  }
> >  
> > +static void clear_page_young(struct page *page)
> > +{
> > +	struct page_ext *page_ext = lookup_page_ext(page);
> > +
> > +	if (unlikely(!page_ext))
> > +		return;
> > +
> > +	clear_bit(PAGE_EXT_YOUNG, &page_ext->flags);
> > +}
> > +
> >  static inline bool test_and_clear_page_young(struct page *page)
> >  {
> >  	struct page_ext *page_ext = lookup_page_ext(page);
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index de2c67a33b7e..0ce997edb8bb 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -340,6 +340,7 @@ extern void lru_add_drain_cpu(int cpu);
> >  extern void lru_add_drain_all(void);
> >  extern void rotate_reclaimable_page(struct page *page);
> >  extern void deactivate_file_page(struct page *page);
> > +extern void deactivate_page(struct page *page);
> >  extern void mark_page_lazyfree(struct page *page);
> >  extern void swap_setup(void);
> >  
> > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > index bea0278f65ab..1190f4e7f7b9 100644
> > --- a/include/uapi/asm-generic/mman-common.h
> > +++ b/include/uapi/asm-generic/mman-common.h
> > @@ -43,6 +43,7 @@
> >  #define MADV_SEQUENTIAL	2		/* expect sequential page references */
> >  #define MADV_WILLNEED	3		/* will need these pages */
> >  #define MADV_DONTNEED	4		/* don't need these pages */
> > +#define MADV_COLD	5		/* deactivatie these pages */
> >  
> >  /* common parameters: try to keep these consistent across architectures */
> >  #define MADV_FREE	8		/* free pages only if memory pressure */
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 9eeaf2b95166..75a4f96ec0fb 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -43,7 +43,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf);
> >  void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
> >  		unsigned long floor, unsigned long ceiling);
> >  
> > -static inline bool can_madv_dontneed_vma(struct vm_area_struct *vma)
> > +static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
> >  {
> >  	return !(vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP));
> >  }
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 628022e674a7..ab158766858a 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -40,6 +40,7 @@ static int madvise_need_mmap_write(int behavior)
> >  	case MADV_REMOVE:
> >  	case MADV_WILLNEED:
> >  	case MADV_DONTNEED:
> > +	case MADV_COLD:
> >  	case MADV_FREE:
> >  		return 0;
> >  	default:
> > @@ -307,6 +308,113 @@ static long madvise_willneed(struct vm_area_struct *vma,
> >  	return 0;
> >  }
> >  
> > +static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> > +				unsigned long end, struct mm_walk *walk)
> > +{
> > +	pte_t *orig_pte, *pte, ptent;
> > +	spinlock_t *ptl;
> > +	struct page *page;
> > +	struct vm_area_struct *vma = walk->vma;
> > +	unsigned long next;
> > +
> > +	next = pmd_addr_end(addr, end);
> > +	if (pmd_trans_huge(*pmd)) {
> > +		ptl = pmd_trans_huge_lock(pmd, vma);
> > +		if (!ptl)
> > +			return 0;
> > +
> > +		if (is_huge_zero_pmd(*pmd))
> > +			goto huge_unlock;
> > +
> > +		page = pmd_page(*pmd);
> > +		if (page_mapcount(page) > 1)
> > +			goto huge_unlock;
> > +
> > +		if (next - addr != HPAGE_PMD_SIZE) {
> > +			int err;
> > +
> > +			get_page(page);
> > +			spin_unlock(ptl);
> > +			lock_page(page);
> > +			err = split_huge_page(page);
> > +			unlock_page(page);
> > +			put_page(page);
> > +			if (!err)
> > +				goto regular_page;
> > +			return 0;
> > +		}
> > +
> > +		pmdp_test_and_clear_young(vma, addr, pmd);
> > +		deactivate_page(page);
> > +huge_unlock:
> > +		spin_unlock(ptl);
> > +		return 0;
> > +	}
> > +
> > +	if (pmd_trans_unstable(pmd))
> > +		return 0;
> > +
> > +regular_page:
> > +	orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > +	for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
> > +		ptent = *pte;
> > +
> > +		if (pte_none(ptent))
> > +			continue;
> > +
> > +		if (!pte_present(ptent))
> > +			continue;
> > +
> > +		page = vm_normal_page(vma, addr, ptent);
> > +		if (!page)
> > +			continue;
> > +
> > +		if (page_mapcount(page) > 1)
> > +			continue;
> > +
> > +		ptep_test_and_clear_young(vma, addr, pte);
> 
> Wondering here how it interacts with idle page tracking. Here since young
> flag is cleared by the cold hint, page_referenced_one() or
> page_idle_clear_pte_refs_one() will not be able to clear the page-idle flag
> if it was previously set since it does not know any more that a page was
> actively referenced.

ptep_test_and_clear_young doesn't change PG_idle/young so idle page tracking
doesn't affect.

> 
> Should you also be clearing the page-idle flag here if the ptep young/accessed
> bit was previously set, just so that page-idle tracking works smoothly when
> this hint is concurrently applied?

deactivate_page will remove PG_young bit so that the page will be reclaimed.
Do I miss your point?

> 
> > +		deactivate_page(page);
> > +	}
> > +
> > +	pte_unmap_unlock(orig_pte, ptl);
> > +	cond_resched();
> > +
> > +	return 0;
> > +}
> > +
> > +static void madvise_cold_page_range(struct mmu_gather *tlb,
> > +			     struct vm_area_struct *vma,
> > +			     unsigned long addr, unsigned long end)
> > +{
> > +	struct mm_walk cool_walk = {
> > +		.pmd_entry = madvise_cold_pte_range,
> > +		.mm = vma->vm_mm,
> > +	};
> > +
> > +	tlb_start_vma(tlb, vma);
> > +	walk_page_range(addr, end, &cool_walk);
> > +	tlb_end_vma(tlb, vma);
> > +}
> > +
> > +static long madvise_cold(struct vm_area_struct *vma,
> > +			struct vm_area_struct **prev,
> > +			unsigned long start_addr, unsigned long end_addr)
> > +{
> > +	struct mm_struct *mm = vma->vm_mm;
> > +	struct mmu_gather tlb;
> > +
> > +	*prev = vma;
> > +	if (!can_madv_lru_vma(vma))
> > +		return -EINVAL;
> > +
> > +	lru_add_drain();
> > +	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
> > +	madvise_cold_page_range(&tlb, vma, start_addr, end_addr);
> > +	tlb_finish_mmu(&tlb, start_addr, end_addr);
> > +
> > +	return 0;
> > +}
> > +
> >  static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> >  				unsigned long end, struct mm_walk *walk)
> >  
> > @@ -519,7 +627,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> >  				  int behavior)
> >  {
> >  	*prev = vma;
> > -	if (!can_madv_dontneed_vma(vma))
> > +	if (!can_madv_lru_vma(vma))
> >  		return -EINVAL;
> >  
> >  	if (!userfaultfd_remove(vma, start, end)) {
> > @@ -541,7 +649,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> >  			 */
> >  			return -ENOMEM;
> >  		}
> > -		if (!can_madv_dontneed_vma(vma))
> > +		if (!can_madv_lru_vma(vma))
> >  			return -EINVAL;
> >  		if (end > vma->vm_end) {
> >  			/*
> > @@ -695,6 +803,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
> >  		return madvise_remove(vma, prev, start, end);
> >  	case MADV_WILLNEED:
> >  		return madvise_willneed(vma, prev, start, end);
> > +	case MADV_COLD:
> > +		return madvise_cold(vma, prev, start, end);
> >  	case MADV_FREE:
> >  	case MADV_DONTNEED:
> >  		return madvise_dontneed_free(vma, prev, start, end, behavior);
> > @@ -716,6 +826,7 @@ madvise_behavior_valid(int behavior)
> >  	case MADV_WILLNEED:
> >  	case MADV_DONTNEED:
> >  	case MADV_FREE:
> > +	case MADV_COLD:
> >  #ifdef CONFIG_KSM
> >  	case MADV_MERGEABLE:
> >  	case MADV_UNMERGEABLE:
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 5a58778c91d4..f73d5f5145f0 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -515,7 +515,7 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
> >  	set_bit(MMF_UNSTABLE, &mm->flags);
> >  
> >  	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> > -		if (!can_madv_dontneed_vma(vma))
> > +		if (!can_madv_lru_vma(vma))
> >  			continue;
> >  
> >  		/*
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 7b079976cbec..cebedab15aa2 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -47,6 +47,7 @@ int page_cluster;
> >  static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
> >  static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
> >  static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
> > +static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
> >  static DEFINE_PER_CPU(struct pagevec, lru_lazyfree_pvecs);
> >  #ifdef CONFIG_SMP
> >  static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
> > @@ -538,6 +539,23 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
> >  	update_page_reclaim_stat(lruvec, file, 0);
> >  }
> >  
> > +static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
> > +			    void *arg)
> > +{
> > +	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> > +		int file = page_is_file_cache(page);
> > +		int lru = page_lru_base_type(page);
> > +
> > +		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
> > +		ClearPageActive(page);
> > +		ClearPageReferenced(page);
> > +		clear_page_young(page);
> 
> I was curious, how does the above interact with clear_refs?

clear_refs couldn't find the access any longer. However, it's the
intention because we couldn't reclaim the page without it.

> 
> It appears that a range that is marked as cold will appear to be unreferenced
> (since referenced flag is cleared) even though it may have been referenced in
> the past. Very least, I believe this behavior should be documented somewhere.

Okay, I will add some comment.

Thanks.
Joel Fernandes June 12, 2019, 5:21 p.m. UTC | #3
On Mon, Jun 10, 2019 at 07:09:04PM +0900, Minchan Kim wrote:
> Hi Joel,
> 
> On Tue, Jun 04, 2019 at 04:38:41PM -0400, Joel Fernandes wrote:
> > On Mon, Jun 03, 2019 at 02:36:52PM +0900, Minchan Kim wrote:
> > > When a process expects no accesses to a certain memory range, it could
> > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > happens but data should be preserved for future use.  This could reduce
> > > workingset eviction so it ends up increasing performance.
> > > 
> > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > to be used in the near future. The hint can help kernel in deciding which
> > > pages to evict early during memory pressure.
> > > 
> > > It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
> > > 
> > > 	active file page -> inactive file LRU
> > > 	active anon page -> inacdtive anon LRU
> > > 
> > > Unlike MADV_FREE, it doesn't move active anonymous pages to inactive
> > > files's head because MADV_COLD is a little bit different symantic.
> > > MADV_FREE means it's okay to discard when the memory pressure because
> > > the content of the page is *garbage* so freeing such pages is almost zero
> > > overhead since we don't need to swap out and access afterward causes just
> > > minor fault. Thus, it would make sense to put those freeable pages in
> > > inactive file LRU to compete other used-once pages. Even, it could
> > > give a bonus to make them be reclaimed on swapless system. However,
> > > MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in
> > > in the end. So it's better to move inactive anon's LRU list, not file LRU.
> > > Furthermore, it would help to avoid unnecessary scanning of cold anonymous
> > > if system doesn't have a swap device.
> > > 
> > > All of error rule is same with MADV_DONTNEED.
> > > 
> > > Note:
> > > This hint works with only private pages(IOW, page_mapcount(page) < 2)
> > > because shared page could have more chance to be accessed from other
> > > processes sharing the page although the caller reset the reference bits.
> > > It ends up preventing the reclaim of the page and wastes CPU cycle.
> > > 
> > > * RFCv2
> > >  * add more description - mhocko
> > > 
> > > * RFCv1
> > >  * renaming from MADV_COOL to MADV_COLD - hannes
> > > 
> > > * internal review
> > >  * use clear_page_youn in deactivate_page - joelaf
> > >  * Revise the description - surenb
> > >  * Renaming from MADV_WARM to MADV_COOL - surenb
> > > 
> > > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > > ---
> > >  include/linux/page-flags.h             |   1 +
> > >  include/linux/page_idle.h              |  15 ++++
> > >  include/linux/swap.h                   |   1 +
> > >  include/uapi/asm-generic/mman-common.h |   1 +
> > >  mm/internal.h                          |   2 +-
> > >  mm/madvise.c                           | 115 ++++++++++++++++++++++++-
> > >  mm/oom_kill.c                          |   2 +-
> > >  mm/swap.c                              |  43 +++++++++
> > >  8 files changed, 176 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > > index 9f8712a4b1a5..58b06654c8dd 100644
> > > --- a/include/linux/page-flags.h
> > > +++ b/include/linux/page-flags.h
> > > @@ -424,6 +424,7 @@ static inline bool set_hwpoison_free_buddy_page(struct page *page)
> > >  TESTPAGEFLAG(Young, young, PF_ANY)
> > >  SETPAGEFLAG(Young, young, PF_ANY)
> > >  TESTCLEARFLAG(Young, young, PF_ANY)
> > > +CLEARPAGEFLAG(Young, young, PF_ANY)
> > >  PAGEFLAG(Idle, idle, PF_ANY)
> > >  #endif
> > >  
> > > diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
> > > index 1e894d34bdce..f3f43b317150 100644
> > > --- a/include/linux/page_idle.h
> > > +++ b/include/linux/page_idle.h
> > > @@ -19,6 +19,11 @@ static inline void set_page_young(struct page *page)
> > >  	SetPageYoung(page);
> > >  }
> > >  
> > > +static inline void clear_page_young(struct page *page)
> > > +{
> > > +	ClearPageYoung(page);
> > > +}
> > > +
> > >  static inline bool test_and_clear_page_young(struct page *page)
> > >  {
> > >  	return TestClearPageYoung(page);
> > > @@ -65,6 +70,16 @@ static inline void set_page_young(struct page *page)
> > >  	set_bit(PAGE_EXT_YOUNG, &page_ext->flags);
> > >  }
> > >  
> > > +static void clear_page_young(struct page *page)
> > > +{
> > > +	struct page_ext *page_ext = lookup_page_ext(page);
> > > +
> > > +	if (unlikely(!page_ext))
> > > +		return;
> > > +
> > > +	clear_bit(PAGE_EXT_YOUNG, &page_ext->flags);
> > > +}
> > > +
> > >  static inline bool test_and_clear_page_young(struct page *page)
> > >  {
> > >  	struct page_ext *page_ext = lookup_page_ext(page);
> > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > index de2c67a33b7e..0ce997edb8bb 100644
> > > --- a/include/linux/swap.h
> > > +++ b/include/linux/swap.h
> > > @@ -340,6 +340,7 @@ extern void lru_add_drain_cpu(int cpu);
> > >  extern void lru_add_drain_all(void);
> > >  extern void rotate_reclaimable_page(struct page *page);
> > >  extern void deactivate_file_page(struct page *page);
> > > +extern void deactivate_page(struct page *page);
> > >  extern void mark_page_lazyfree(struct page *page);
> > >  extern void swap_setup(void);
> > >  
> > > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > > index bea0278f65ab..1190f4e7f7b9 100644
> > > --- a/include/uapi/asm-generic/mman-common.h
> > > +++ b/include/uapi/asm-generic/mman-common.h
> > > @@ -43,6 +43,7 @@
> > >  #define MADV_SEQUENTIAL	2		/* expect sequential page references */
> > >  #define MADV_WILLNEED	3		/* will need these pages */
> > >  #define MADV_DONTNEED	4		/* don't need these pages */
> > > +#define MADV_COLD	5		/* deactivatie these pages */
> > >  
> > >  /* common parameters: try to keep these consistent across architectures */
> > >  #define MADV_FREE	8		/* free pages only if memory pressure */
> > > diff --git a/mm/internal.h b/mm/internal.h
> > > index 9eeaf2b95166..75a4f96ec0fb 100644
> > > --- a/mm/internal.h
> > > +++ b/mm/internal.h
> > > @@ -43,7 +43,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf);
> > >  void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
> > >  		unsigned long floor, unsigned long ceiling);
> > >  
> > > -static inline bool can_madv_dontneed_vma(struct vm_area_struct *vma)
> > > +static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
> > >  {
> > >  	return !(vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP));
> > >  }
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index 628022e674a7..ab158766858a 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -40,6 +40,7 @@ static int madvise_need_mmap_write(int behavior)
> > >  	case MADV_REMOVE:
> > >  	case MADV_WILLNEED:
> > >  	case MADV_DONTNEED:
> > > +	case MADV_COLD:
> > >  	case MADV_FREE:
> > >  		return 0;
> > >  	default:
> > > @@ -307,6 +308,113 @@ static long madvise_willneed(struct vm_area_struct *vma,
> > >  	return 0;
> > >  }
> > >  
> > > +static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> > > +				unsigned long end, struct mm_walk *walk)
> > > +{
> > > +	pte_t *orig_pte, *pte, ptent;
> > > +	spinlock_t *ptl;
> > > +	struct page *page;
> > > +	struct vm_area_struct *vma = walk->vma;
> > > +	unsigned long next;
> > > +
> > > +	next = pmd_addr_end(addr, end);
> > > +	if (pmd_trans_huge(*pmd)) {
> > > +		ptl = pmd_trans_huge_lock(pmd, vma);
> > > +		if (!ptl)
> > > +			return 0;
> > > +
> > > +		if (is_huge_zero_pmd(*pmd))
> > > +			goto huge_unlock;
> > > +
> > > +		page = pmd_page(*pmd);
> > > +		if (page_mapcount(page) > 1)
> > > +			goto huge_unlock;
> > > +
> > > +		if (next - addr != HPAGE_PMD_SIZE) {
> > > +			int err;
> > > +
> > > +			get_page(page);
> > > +			spin_unlock(ptl);
> > > +			lock_page(page);
> > > +			err = split_huge_page(page);
> > > +			unlock_page(page);
> > > +			put_page(page);
> > > +			if (!err)
> > > +				goto regular_page;
> > > +			return 0;
> > > +		}
> > > +
> > > +		pmdp_test_and_clear_young(vma, addr, pmd);
> > > +		deactivate_page(page);
> > > +huge_unlock:
> > > +		spin_unlock(ptl);
> > > +		return 0;
> > > +	}
> > > +
> > > +	if (pmd_trans_unstable(pmd))
> > > +		return 0;
> > > +
> > > +regular_page:
> > > +	orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > +	for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
> > > +		ptent = *pte;
> > > +
> > > +		if (pte_none(ptent))
> > > +			continue;
> > > +
> > > +		if (!pte_present(ptent))
> > > +			continue;
> > > +
> > > +		page = vm_normal_page(vma, addr, ptent);
> > > +		if (!page)
> > > +			continue;
> > > +
> > > +		if (page_mapcount(page) > 1)
> > > +			continue;
> > > +
> > > +		ptep_test_and_clear_young(vma, addr, pte);
> > 
> > Wondering here how it interacts with idle page tracking. Here since young
> > flag is cleared by the cold hint, page_referenced_one() or
> > page_idle_clear_pte_refs_one() will not be able to clear the page-idle flag
> > if it was previously set since it does not know any more that a page was
> > actively referenced.
> 
> ptep_test_and_clear_young doesn't change PG_idle/young so idle page tracking
> doesn't affect.

Clearing of the young bit in the PTE does affect idle tracking.

Both page_referenced_one() and page_idle_clear_pte_refs_one() check this bit.

> > bit was previously set, just so that page-idle tracking works smoothly when
> > this hint is concurrently applied?
> 
> deactivate_page will remove PG_young bit so that the page will be reclaimed.
> Do I miss your point?

Say a process had accessed PTE bit not set, then idle tracking is run and PG_Idle
is set. Now the page is accessed from userspace thus setting the accessed PTE
bit.  Now a remote process passes this process_madvise cold hint (I know your
current series does not support remote process, but I am saying for future
when you post this). Because you cleared the PTE accessed bit through the
hint, idle tracking no longer will know that the page is referenced and the
user gets confused because accessed page appears to be idle.

I think to fix this, what you should do is clear the PG_Idle flag if the
young/accessed PTE bits are set. If PG_Idle is already cleared, then you
don't need to do anything.

thanks,

 - Joel
Minchan Kim June 13, 2019, 4:48 a.m. UTC | #4
On Wed, Jun 12, 2019 at 01:21:04PM -0400, Joel Fernandes wrote:
> On Mon, Jun 10, 2019 at 07:09:04PM +0900, Minchan Kim wrote:
> > Hi Joel,
> > 
> > On Tue, Jun 04, 2019 at 04:38:41PM -0400, Joel Fernandes wrote:
> > > On Mon, Jun 03, 2019 at 02:36:52PM +0900, Minchan Kim wrote:
> > > > When a process expects no accesses to a certain memory range, it could
> > > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > > happens but data should be preserved for future use.  This could reduce
> > > > workingset eviction so it ends up increasing performance.
> > > > 
> > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > > to be used in the near future. The hint can help kernel in deciding which
> > > > pages to evict early during memory pressure.
> > > > 
> > > > It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
> > > > 
> > > > 	active file page -> inactive file LRU
> > > > 	active anon page -> inacdtive anon LRU
> > > > 
> > > > Unlike MADV_FREE, it doesn't move active anonymous pages to inactive
> > > > files's head because MADV_COLD is a little bit different symantic.
> > > > MADV_FREE means it's okay to discard when the memory pressure because
> > > > the content of the page is *garbage* so freeing such pages is almost zero
> > > > overhead since we don't need to swap out and access afterward causes just
> > > > minor fault. Thus, it would make sense to put those freeable pages in
> > > > inactive file LRU to compete other used-once pages. Even, it could
> > > > give a bonus to make them be reclaimed on swapless system. However,
> > > > MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in
> > > > in the end. So it's better to move inactive anon's LRU list, not file LRU.
> > > > Furthermore, it would help to avoid unnecessary scanning of cold anonymous
> > > > if system doesn't have a swap device.
> > > > 
> > > > All of error rule is same with MADV_DONTNEED.
> > > > 
> > > > Note:
> > > > This hint works with only private pages(IOW, page_mapcount(page) < 2)
> > > > because shared page could have more chance to be accessed from other
> > > > processes sharing the page although the caller reset the reference bits.
> > > > It ends up preventing the reclaim of the page and wastes CPU cycle.
> > > > 
> > > > * RFCv2
> > > >  * add more description - mhocko
> > > > 
> > > > * RFCv1
> > > >  * renaming from MADV_COOL to MADV_COLD - hannes
> > > > 
> > > > * internal review
> > > >  * use clear_page_youn in deactivate_page - joelaf
> > > >  * Revise the description - surenb
> > > >  * Renaming from MADV_WARM to MADV_COOL - surenb
> > > > 
> > > > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > > > ---
> > > >  include/linux/page-flags.h             |   1 +
> > > >  include/linux/page_idle.h              |  15 ++++
> > > >  include/linux/swap.h                   |   1 +
> > > >  include/uapi/asm-generic/mman-common.h |   1 +
> > > >  mm/internal.h                          |   2 +-
> > > >  mm/madvise.c                           | 115 ++++++++++++++++++++++++-
> > > >  mm/oom_kill.c                          |   2 +-
> > > >  mm/swap.c                              |  43 +++++++++
> > > >  8 files changed, 176 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > > > index 9f8712a4b1a5..58b06654c8dd 100644
> > > > --- a/include/linux/page-flags.h
> > > > +++ b/include/linux/page-flags.h
> > > > @@ -424,6 +424,7 @@ static inline bool set_hwpoison_free_buddy_page(struct page *page)
> > > >  TESTPAGEFLAG(Young, young, PF_ANY)
> > > >  SETPAGEFLAG(Young, young, PF_ANY)
> > > >  TESTCLEARFLAG(Young, young, PF_ANY)
> > > > +CLEARPAGEFLAG(Young, young, PF_ANY)
> > > >  PAGEFLAG(Idle, idle, PF_ANY)
> > > >  #endif
> > > >  
> > > > diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
> > > > index 1e894d34bdce..f3f43b317150 100644
> > > > --- a/include/linux/page_idle.h
> > > > +++ b/include/linux/page_idle.h
> > > > @@ -19,6 +19,11 @@ static inline void set_page_young(struct page *page)
> > > >  	SetPageYoung(page);
> > > >  }
> > > >  
> > > > +static inline void clear_page_young(struct page *page)
> > > > +{
> > > > +	ClearPageYoung(page);
> > > > +}
> > > > +
> > > >  static inline bool test_and_clear_page_young(struct page *page)
> > > >  {
> > > >  	return TestClearPageYoung(page);
> > > > @@ -65,6 +70,16 @@ static inline void set_page_young(struct page *page)
> > > >  	set_bit(PAGE_EXT_YOUNG, &page_ext->flags);
> > > >  }
> > > >  
> > > > +static void clear_page_young(struct page *page)
> > > > +{
> > > > +	struct page_ext *page_ext = lookup_page_ext(page);
> > > > +
> > > > +	if (unlikely(!page_ext))
> > > > +		return;
> > > > +
> > > > +	clear_bit(PAGE_EXT_YOUNG, &page_ext->flags);
> > > > +}
> > > > +
> > > >  static inline bool test_and_clear_page_young(struct page *page)
> > > >  {
> > > >  	struct page_ext *page_ext = lookup_page_ext(page);
> > > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > > index de2c67a33b7e..0ce997edb8bb 100644
> > > > --- a/include/linux/swap.h
> > > > +++ b/include/linux/swap.h
> > > > @@ -340,6 +340,7 @@ extern void lru_add_drain_cpu(int cpu);
> > > >  extern void lru_add_drain_all(void);
> > > >  extern void rotate_reclaimable_page(struct page *page);
> > > >  extern void deactivate_file_page(struct page *page);
> > > > +extern void deactivate_page(struct page *page);
> > > >  extern void mark_page_lazyfree(struct page *page);
> > > >  extern void swap_setup(void);
> > > >  
> > > > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > > > index bea0278f65ab..1190f4e7f7b9 100644
> > > > --- a/include/uapi/asm-generic/mman-common.h
> > > > +++ b/include/uapi/asm-generic/mman-common.h
> > > > @@ -43,6 +43,7 @@
> > > >  #define MADV_SEQUENTIAL	2		/* expect sequential page references */
> > > >  #define MADV_WILLNEED	3		/* will need these pages */
> > > >  #define MADV_DONTNEED	4		/* don't need these pages */
> > > > +#define MADV_COLD	5		/* deactivatie these pages */
> > > >  
> > > >  /* common parameters: try to keep these consistent across architectures */
> > > >  #define MADV_FREE	8		/* free pages only if memory pressure */
> > > > diff --git a/mm/internal.h b/mm/internal.h
> > > > index 9eeaf2b95166..75a4f96ec0fb 100644
> > > > --- a/mm/internal.h
> > > > +++ b/mm/internal.h
> > > > @@ -43,7 +43,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf);
> > > >  void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
> > > >  		unsigned long floor, unsigned long ceiling);
> > > >  
> > > > -static inline bool can_madv_dontneed_vma(struct vm_area_struct *vma)
> > > > +static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
> > > >  {
> > > >  	return !(vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP));
> > > >  }
> > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > index 628022e674a7..ab158766858a 100644
> > > > --- a/mm/madvise.c
> > > > +++ b/mm/madvise.c
> > > > @@ -40,6 +40,7 @@ static int madvise_need_mmap_write(int behavior)
> > > >  	case MADV_REMOVE:
> > > >  	case MADV_WILLNEED:
> > > >  	case MADV_DONTNEED:
> > > > +	case MADV_COLD:
> > > >  	case MADV_FREE:
> > > >  		return 0;
> > > >  	default:
> > > > @@ -307,6 +308,113 @@ static long madvise_willneed(struct vm_area_struct *vma,
> > > >  	return 0;
> > > >  }
> > > >  
> > > > +static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> > > > +				unsigned long end, struct mm_walk *walk)
> > > > +{
> > > > +	pte_t *orig_pte, *pte, ptent;
> > > > +	spinlock_t *ptl;
> > > > +	struct page *page;
> > > > +	struct vm_area_struct *vma = walk->vma;
> > > > +	unsigned long next;
> > > > +
> > > > +	next = pmd_addr_end(addr, end);
> > > > +	if (pmd_trans_huge(*pmd)) {
> > > > +		ptl = pmd_trans_huge_lock(pmd, vma);
> > > > +		if (!ptl)
> > > > +			return 0;
> > > > +
> > > > +		if (is_huge_zero_pmd(*pmd))
> > > > +			goto huge_unlock;
> > > > +
> > > > +		page = pmd_page(*pmd);
> > > > +		if (page_mapcount(page) > 1)
> > > > +			goto huge_unlock;
> > > > +
> > > > +		if (next - addr != HPAGE_PMD_SIZE) {
> > > > +			int err;
> > > > +
> > > > +			get_page(page);
> > > > +			spin_unlock(ptl);
> > > > +			lock_page(page);
> > > > +			err = split_huge_page(page);
> > > > +			unlock_page(page);
> > > > +			put_page(page);
> > > > +			if (!err)
> > > > +				goto regular_page;
> > > > +			return 0;
> > > > +		}
> > > > +
> > > > +		pmdp_test_and_clear_young(vma, addr, pmd);
> > > > +		deactivate_page(page);
> > > > +huge_unlock:
> > > > +		spin_unlock(ptl);
> > > > +		return 0;
> > > > +	}
> > > > +
> > > > +	if (pmd_trans_unstable(pmd))
> > > > +		return 0;
> > > > +
> > > > +regular_page:
> > > > +	orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > > +	for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
> > > > +		ptent = *pte;
> > > > +
> > > > +		if (pte_none(ptent))
> > > > +			continue;
> > > > +
> > > > +		if (!pte_present(ptent))
> > > > +			continue;
> > > > +
> > > > +		page = vm_normal_page(vma, addr, ptent);
> > > > +		if (!page)
> > > > +			continue;
> > > > +
> > > > +		if (page_mapcount(page) > 1)
> > > > +			continue;
> > > > +
> > > > +		ptep_test_and_clear_young(vma, addr, pte);
> > > 
> > > Wondering here how it interacts with idle page tracking. Here since young
> > > flag is cleared by the cold hint, page_referenced_one() or
> > > page_idle_clear_pte_refs_one() will not be able to clear the page-idle flag
> > > if it was previously set since it does not know any more that a page was
> > > actively referenced.
> > 
> > ptep_test_and_clear_young doesn't change PG_idle/young so idle page tracking
> > doesn't affect.

You said *young flag* in the comment, which made me confused. I thought you meant
PG_young flag but you mean PTE access bit.

> 
> Clearing of the young bit in the PTE does affect idle tracking.
> 
> Both page_referenced_one() and page_idle_clear_pte_refs_one() check this bit.
> 
> > > bit was previously set, just so that page-idle tracking works smoothly when
> > > this hint is concurrently applied?
> > 
> > deactivate_page will remove PG_young bit so that the page will be reclaimed.
> > Do I miss your point?
> 
> Say a process had accessed PTE bit not set, then idle tracking is run and PG_Idle
> is set. Now the page is accessed from userspace thus setting the accessed PTE
> bit.  Now a remote process passes this process_madvise cold hint (I know your
> current series does not support remote process, but I am saying for future
> when you post this). Because you cleared the PTE accessed bit through the
> hint, idle tracking no longer will know that the page is referenced and the
> user gets confused because accessed page appears to be idle.

Right.

> 
> I think to fix this, what you should do is clear the PG_Idle flag if the
> young/accessed PTE bits are set. If PG_Idle is already cleared, then you
> don't need to do anything.

I'm not sure. What does it make MADV_COLD special?
How about MADV_FREE|MADV_DONTNEED?
Why don't they clear PG_Idle if pte was young at tearing down pte? 
The page could be shared by other processes so if we miss to clear out
PG_idle in there, page idle tracking could miss the access history forever.

If it's not what you want, maybe we need to fix all places all at once.
However, I'm not sure. Rather than, I want to keep PG_idle in those hints
even though pte was accesssed because the process now gives strong hint
"The page is idle from now on". It's valid because he knows himself better than
others, even admin. IOW, he declare the page is not workingset any more.
What's the problem if page idle tracking feature miss it?
If other processs still have access bit of their pte for the page, page idle
tracking could find the page as non-idle so it's no problem, either.
Joel Fernandes June 19, 2019, 5:13 p.m. UTC | #5
Sorry Minchan for late reply, I had a visa interview travel and then the weekend
came. I replied below:

On Thu, Jun 13, 2019 at 01:48:24PM +0900, Minchan Kim wrote:
> On Wed, Jun 12, 2019 at 01:21:04PM -0400, Joel Fernandes wrote:
> > On Mon, Jun 10, 2019 at 07:09:04PM +0900, Minchan Kim wrote:
> > > Hi Joel,
> > > 
> > > On Tue, Jun 04, 2019 at 04:38:41PM -0400, Joel Fernandes wrote:
> > > > On Mon, Jun 03, 2019 at 02:36:52PM +0900, Minchan Kim wrote:
> > > > > When a process expects no accesses to a certain memory range, it could
> > > > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > > > happens but data should be preserved for future use.  This could reduce
> > > > > workingset eviction so it ends up increasing performance.
> > > > > 
> > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > > > to be used in the near future. The hint can help kernel in deciding which
> > > > > pages to evict early during memory pressure.
> > > > > 
> > > > > It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
> > > > > 
> > > > > 	active file page -> inactive file LRU
> > > > > 	active anon page -> inacdtive anon LRU
> > > > > 
> > > > > Unlike MADV_FREE, it doesn't move active anonymous pages to inactive
> > > > > files's head because MADV_COLD is a little bit different symantic.
> > > > > MADV_FREE means it's okay to discard when the memory pressure because
> > > > > the content of the page is *garbage* so freeing such pages is almost zero
> > > > > overhead since we don't need to swap out and access afterward causes just
> > > > > minor fault. Thus, it would make sense to put those freeable pages in
> > > > > inactive file LRU to compete other used-once pages. Even, it could
> > > > > give a bonus to make them be reclaimed on swapless system. However,
> > > > > MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in
> > > > > in the end. So it's better to move inactive anon's LRU list, not file LRU.
> > > > > Furthermore, it would help to avoid unnecessary scanning of cold anonymous
> > > > > if system doesn't have a swap device.
> > > > > 
> > > > > All of error rule is same with MADV_DONTNEED.
> > > > > 
> > > > > Note:
> > > > > This hint works with only private pages(IOW, page_mapcount(page) < 2)
> > > > > because shared page could have more chance to be accessed from other
> > > > > processes sharing the page although the caller reset the reference bits.
> > > > > It ends up preventing the reclaim of the page and wastes CPU cycle.
> > > > > 
> > > > > * RFCv2
> > > > >  * add more description - mhocko
> > > > > 
> > > > > * RFCv1
> > > > >  * renaming from MADV_COOL to MADV_COLD - hannes
> > > > > 
> > > > > * internal review
> > > > >  * use clear_page_youn in deactivate_page - joelaf
> > > > >  * Revise the description - surenb
> > > > >  * Renaming from MADV_WARM to MADV_COOL - surenb
> > > > > 
> > > > > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > > > > ---
> > > > >  include/linux/page-flags.h             |   1 +
> > > > >  include/linux/page_idle.h              |  15 ++++
> > > > >  include/linux/swap.h                   |   1 +
> > > > >  include/uapi/asm-generic/mman-common.h |   1 +
> > > > >  mm/internal.h                          |   2 +-
> > > > >  mm/madvise.c                           | 115 ++++++++++++++++++++++++-
> > > > >  mm/oom_kill.c                          |   2 +-
> > > > >  mm/swap.c                              |  43 +++++++++
> > > > >  8 files changed, 176 insertions(+), 4 deletions(-)
> > > > > 
> > > > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > > > > index 9f8712a4b1a5..58b06654c8dd 100644
> > > > > --- a/include/linux/page-flags.h
> > > > > +++ b/include/linux/page-flags.h
> > > > > @@ -424,6 +424,7 @@ static inline bool set_hwpoison_free_buddy_page(struct page *page)
> > > > >  TESTPAGEFLAG(Young, young, PF_ANY)
> > > > >  SETPAGEFLAG(Young, young, PF_ANY)
> > > > >  TESTCLEARFLAG(Young, young, PF_ANY)
> > > > > +CLEARPAGEFLAG(Young, young, PF_ANY)
> > > > >  PAGEFLAG(Idle, idle, PF_ANY)
> > > > >  #endif
> > > > >  
> > > > > diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
> > > > > index 1e894d34bdce..f3f43b317150 100644
> > > > > --- a/include/linux/page_idle.h
> > > > > +++ b/include/linux/page_idle.h
> > > > > @@ -19,6 +19,11 @@ static inline void set_page_young(struct page *page)
> > > > >  	SetPageYoung(page);
> > > > >  }
> > > > >  
> > > > > +static inline void clear_page_young(struct page *page)
> > > > > +{
> > > > > +	ClearPageYoung(page);
> > > > > +}
> > > > > +
> > > > >  static inline bool test_and_clear_page_young(struct page *page)
> > > > >  {
> > > > >  	return TestClearPageYoung(page);
> > > > > @@ -65,6 +70,16 @@ static inline void set_page_young(struct page *page)
> > > > >  	set_bit(PAGE_EXT_YOUNG, &page_ext->flags);
> > > > >  }
> > > > >  
> > > > > +static void clear_page_young(struct page *page)
> > > > > +{
> > > > > +	struct page_ext *page_ext = lookup_page_ext(page);
> > > > > +
> > > > > +	if (unlikely(!page_ext))
> > > > > +		return;
> > > > > +
> > > > > +	clear_bit(PAGE_EXT_YOUNG, &page_ext->flags);
> > > > > +}
> > > > > +
> > > > >  static inline bool test_and_clear_page_young(struct page *page)
> > > > >  {
> > > > >  	struct page_ext *page_ext = lookup_page_ext(page);
> > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > > > index de2c67a33b7e..0ce997edb8bb 100644
> > > > > --- a/include/linux/swap.h
> > > > > +++ b/include/linux/swap.h
> > > > > @@ -340,6 +340,7 @@ extern void lru_add_drain_cpu(int cpu);
> > > > >  extern void lru_add_drain_all(void);
> > > > >  extern void rotate_reclaimable_page(struct page *page);
> > > > >  extern void deactivate_file_page(struct page *page);
> > > > > +extern void deactivate_page(struct page *page);
> > > > >  extern void mark_page_lazyfree(struct page *page);
> > > > >  extern void swap_setup(void);
> > > > >  
> > > > > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > > > > index bea0278f65ab..1190f4e7f7b9 100644
> > > > > --- a/include/uapi/asm-generic/mman-common.h
> > > > > +++ b/include/uapi/asm-generic/mman-common.h
> > > > > @@ -43,6 +43,7 @@
> > > > >  #define MADV_SEQUENTIAL	2		/* expect sequential page references */
> > > > >  #define MADV_WILLNEED	3		/* will need these pages */
> > > > >  #define MADV_DONTNEED	4		/* don't need these pages */
> > > > > +#define MADV_COLD	5		/* deactivatie these pages */
> > > > >  
> > > > >  /* common parameters: try to keep these consistent across architectures */
> > > > >  #define MADV_FREE	8		/* free pages only if memory pressure */
> > > > > diff --git a/mm/internal.h b/mm/internal.h
> > > > > index 9eeaf2b95166..75a4f96ec0fb 100644
> > > > > --- a/mm/internal.h
> > > > > +++ b/mm/internal.h
> > > > > @@ -43,7 +43,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf);
> > > > >  void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
> > > > >  		unsigned long floor, unsigned long ceiling);
> > > > >  
> > > > > -static inline bool can_madv_dontneed_vma(struct vm_area_struct *vma)
> > > > > +static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
> > > > >  {
> > > > >  	return !(vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP));
> > > > >  }
> > > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > > index 628022e674a7..ab158766858a 100644
> > > > > --- a/mm/madvise.c
> > > > > +++ b/mm/madvise.c
> > > > > @@ -40,6 +40,7 @@ static int madvise_need_mmap_write(int behavior)
> > > > >  	case MADV_REMOVE:
> > > > >  	case MADV_WILLNEED:
> > > > >  	case MADV_DONTNEED:
> > > > > +	case MADV_COLD:
> > > > >  	case MADV_FREE:
> > > > >  		return 0;
> > > > >  	default:
> > > > > @@ -307,6 +308,113 @@ static long madvise_willneed(struct vm_area_struct *vma,
> > > > >  	return 0;
> > > > >  }
> > > > >  
> > > > > +static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> > > > > +				unsigned long end, struct mm_walk *walk)
> > > > > +{
> > > > > +	pte_t *orig_pte, *pte, ptent;
> > > > > +	spinlock_t *ptl;
> > > > > +	struct page *page;
> > > > > +	struct vm_area_struct *vma = walk->vma;
> > > > > +	unsigned long next;
> > > > > +
> > > > > +	next = pmd_addr_end(addr, end);
> > > > > +	if (pmd_trans_huge(*pmd)) {
> > > > > +		ptl = pmd_trans_huge_lock(pmd, vma);
> > > > > +		if (!ptl)
> > > > > +			return 0;
> > > > > +
> > > > > +		if (is_huge_zero_pmd(*pmd))
> > > > > +			goto huge_unlock;
> > > > > +
> > > > > +		page = pmd_page(*pmd);
> > > > > +		if (page_mapcount(page) > 1)
> > > > > +			goto huge_unlock;
> > > > > +
> > > > > +		if (next - addr != HPAGE_PMD_SIZE) {
> > > > > +			int err;
> > > > > +
> > > > > +			get_page(page);
> > > > > +			spin_unlock(ptl);
> > > > > +			lock_page(page);
> > > > > +			err = split_huge_page(page);
> > > > > +			unlock_page(page);
> > > > > +			put_page(page);
> > > > > +			if (!err)
> > > > > +				goto regular_page;
> > > > > +			return 0;
> > > > > +		}
> > > > > +
> > > > > +		pmdp_test_and_clear_young(vma, addr, pmd);
> > > > > +		deactivate_page(page);
> > > > > +huge_unlock:
> > > > > +		spin_unlock(ptl);
> > > > > +		return 0;
> > > > > +	}
> > > > > +
> > > > > +	if (pmd_trans_unstable(pmd))
> > > > > +		return 0;
> > > > > +
> > > > > +regular_page:
> > > > > +	orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > > > +	for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
> > > > > +		ptent = *pte;
> > > > > +
> > > > > +		if (pte_none(ptent))
> > > > > +			continue;
> > > > > +
> > > > > +		if (!pte_present(ptent))
> > > > > +			continue;
> > > > > +
> > > > > +		page = vm_normal_page(vma, addr, ptent);
> > > > > +		if (!page)
> > > > > +			continue;
> > > > > +
> > > > > +		if (page_mapcount(page) > 1)
> > > > > +			continue;
> > > > > +
> > > > > +		ptep_test_and_clear_young(vma, addr, pte);
> > > > 
> > > > Wondering here how it interacts with idle page tracking. Here since young
> > > > flag is cleared by the cold hint, page_referenced_one() or
> > > > page_idle_clear_pte_refs_one() will not be able to clear the page-idle flag
> > > > if it was previously set since it does not know any more that a page was
> > > > actively referenced.
> > > 
> > > ptep_test_and_clear_young doesn't change PG_idle/young so idle page tracking
> > > doesn't affect.
> 
> You said *young flag* in the comment, which made me confused. I thought you meant
> PG_young flag but you mean PTE access bit.
> 
> > 
> > Clearing of the young bit in the PTE does affect idle tracking.
> > 
> > Both page_referenced_one() and page_idle_clear_pte_refs_one() check this bit.
> > 
> > > > bit was previously set, just so that page-idle tracking works smoothly when
> > > > this hint is concurrently applied?
> > > 
> > > deactivate_page will remove PG_young bit so that the page will be reclaimed.
> > > Do I miss your point?
> > 
> > Say a process had accessed PTE bit not set, then idle tracking is run and PG_Idle
> > is set. Now the page is accessed from userspace thus setting the accessed PTE
> > bit.  Now a remote process passes this process_madvise cold hint (I know your
> > current series does not support remote process, but I am saying for future
> > when you post this). Because you cleared the PTE accessed bit through the
> > hint, idle tracking no longer will know that the page is referenced and the
> > user gets confused because accessed page appears to be idle.
> 
> Right.
> 
> > 
> > I think to fix this, what you should do is clear the PG_Idle flag if the
> > young/accessed PTE bits are set. If PG_Idle is already cleared, then you
> > don't need to do anything.
> 
> I'm not sure. What does it make MADV_COLD special?
> How about MADV_FREE|MADV_DONTNEED?
> Why don't they clear PG_Idle if pte was young at tearing down pte? 

Good point, so it sounds like those (MADV_FREE|MADV_DONTNEED) also need to be fixed then?

> The page could be shared by other processes so if we miss to clear out
> PG_idle in there, page idle tracking could miss the access history forever.

I did not understand this. So say a page X is shared process P and Q and
assume the PG_idle flag is set on the page.

P accesses memory and has the pte accessed bit set. P now gets the MADV_COLD
hint and forgets to clear the idle flag while clearing the pte accessed bit.

Now the page appears to be idle, even though it was not. This has nothing to
do with Q and whether the page is shared or not.

> If it's not what you want, maybe we need to fix all places all at once.
> However, I'm not sure. Rather than, I want to keep PG_idle in those hints
> even though pte was accesssed because the process now gives strong hint
> "The page is idle from now on". It's valid because he knows himself better than
> others, even admin. IOW, he declare the page is not workingset any more.

Even if the PG_idle flag is not cleared - it is not a strong hint for working
set size IMHO, because the page *was* accessed so the process definitely needed the
page at some point even though now it says it is MADV_COLD. So that is part
of working set. I don't think we should implicitly provide such hints and we
should fix it.

Also I was saying in previous email, if process_madvise (future extension) is
called from say activity manager, then the process and the user running the
idle tracking feature has no idea that the page was accessed because the idle
flag is still set. That is a bit weird and is loss of information.

It may not be a big deal in the long run if the page is accessed a lot, since
the PTE accessed bit will be set again and idle-tracking feature may not miss
it, but why leave it to chance if it is a simple fix?

> What's the problem if page idle tracking feature miss it?

What's the problem if PG_idle flag is cleared here? It is just a software
flag.

> If other processs still have access bit of their pte for the page, page idle
> tracking could find the page as non-idle so it's no problem, either.

Yes, but if the other process also does not access the page, then the access
information is lost.

thanks!

 - Joel
Minchan Kim June 20, 2019, 5:01 a.m. UTC | #6
On Wed, Jun 19, 2019 at 01:13:40PM -0400, Joel Fernandes wrote:
< snip >

Ccing Vladimir

> > > > > > +static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> > > > > > +				unsigned long end, struct mm_walk *walk)
> > > > > > +{
> > > > > > +	pte_t *orig_pte, *pte, ptent;
> > > > > > +	spinlock_t *ptl;
> > > > > > +	struct page *page;
> > > > > > +	struct vm_area_struct *vma = walk->vma;
> > > > > > +	unsigned long next;
> > > > > > +
> > > > > > +	next = pmd_addr_end(addr, end);
> > > > > > +	if (pmd_trans_huge(*pmd)) {
> > > > > > +		ptl = pmd_trans_huge_lock(pmd, vma);
> > > > > > +		if (!ptl)
> > > > > > +			return 0;
> > > > > > +
> > > > > > +		if (is_huge_zero_pmd(*pmd))
> > > > > > +			goto huge_unlock;
> > > > > > +
> > > > > > +		page = pmd_page(*pmd);
> > > > > > +		if (page_mapcount(page) > 1)
> > > > > > +			goto huge_unlock;
> > > > > > +
> > > > > > +		if (next - addr != HPAGE_PMD_SIZE) {
> > > > > > +			int err;
> > > > > > +
> > > > > > +			get_page(page);
> > > > > > +			spin_unlock(ptl);
> > > > > > +			lock_page(page);
> > > > > > +			err = split_huge_page(page);
> > > > > > +			unlock_page(page);
> > > > > > +			put_page(page);
> > > > > > +			if (!err)
> > > > > > +				goto regular_page;
> > > > > > +			return 0;
> > > > > > +		}
> > > > > > +
> > > > > > +		pmdp_test_and_clear_young(vma, addr, pmd);
> > > > > > +		deactivate_page(page);
> > > > > > +huge_unlock:
> > > > > > +		spin_unlock(ptl);
> > > > > > +		return 0;
> > > > > > +	}
> > > > > > +
> > > > > > +	if (pmd_trans_unstable(pmd))
> > > > > > +		return 0;
> > > > > > +
> > > > > > +regular_page:
> > > > > > +	orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > > > > +	for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
> > > > > > +		ptent = *pte;
> > > > > > +
> > > > > > +		if (pte_none(ptent))
> > > > > > +			continue;
> > > > > > +
> > > > > > +		if (!pte_present(ptent))
> > > > > > +			continue;
> > > > > > +
> > > > > > +		page = vm_normal_page(vma, addr, ptent);
> > > > > > +		if (!page)
> > > > > > +			continue;
> > > > > > +
> > > > > > +		if (page_mapcount(page) > 1)
> > > > > > +			continue;
> > > > > > +
> > > > > > +		ptep_test_and_clear_young(vma, addr, pte);
> > > > > 
> > > > > Wondering here how it interacts with idle page tracking. Here since young
> > > > > flag is cleared by the cold hint, page_referenced_one() or
> > > > > page_idle_clear_pte_refs_one() will not be able to clear the page-idle flag
> > > > > if it was previously set since it does not know any more that a page was
> > > > > actively referenced.
> > > > 
> > > > ptep_test_and_clear_young doesn't change PG_idle/young so idle page tracking
> > > > doesn't affect.
> > 
> > You said *young flag* in the comment, which made me confused. I thought you meant
> > PG_young flag but you mean PTE access bit.
> > 
> > > 
> > > Clearing of the young bit in the PTE does affect idle tracking.
> > > 
> > > Both page_referenced_one() and page_idle_clear_pte_refs_one() check this bit.
> > > 
> > > > > bit was previously set, just so that page-idle tracking works smoothly when
> > > > > this hint is concurrently applied?
> > > > 
> > > > deactivate_page will remove PG_young bit so that the page will be reclaimed.
> > > > Do I miss your point?
> > > 
> > > Say a process had accessed PTE bit not set, then idle tracking is run and PG_Idle
> > > is set. Now the page is accessed from userspace thus setting the accessed PTE
> > > bit.  Now a remote process passes this process_madvise cold hint (I know your
> > > current series does not support remote process, but I am saying for future
> > > when you post this). Because you cleared the PTE accessed bit through the
> > > hint, idle tracking no longer will know that the page is referenced and the
> > > user gets confused because accessed page appears to be idle.
> > 
> > Right.
> > 
> > > 
> > > I think to fix this, what you should do is clear the PG_Idle flag if the
> > > young/accessed PTE bits are set. If PG_Idle is already cleared, then you
> > > don't need to do anything.
> > 
> > I'm not sure. What does it make MADV_COLD special?
> > How about MADV_FREE|MADV_DONTNEED?
> > Why don't they clear PG_Idle if pte was young at tearing down pte? 
> 
> Good point, so it sounds like those (MADV_FREE|MADV_DONTNEED) also need to be fixed then?

Not sure. If you want it, maybe you need to fix every pte clearing and pte_mkold
part, which is more general to cover every sites like munmap, get_user_pages and
so on. Anyway, I don't think it's related to this patchset.

> 
> > The page could be shared by other processes so if we miss to clear out
> > PG_idle in there, page idle tracking could miss the access history forever.
> 
> I did not understand this. So say a page X is shared process P and Q and
> assume the PG_idle flag is set on the page.
> 
> P accesses memory and has the pte accessed bit set. P now gets the MADV_COLD
> hint and forgets to clear the idle flag while clearing the pte accessed bit.
> 
> Now the page appears to be idle, even though it was not. This has nothing to
> do with Q and whether the page is shared or not.

What I meant was MADV_FREE|MADV_DONTNEED.

> 
> > If it's not what you want, maybe we need to fix all places all at once.
> > However, I'm not sure. Rather than, I want to keep PG_idle in those hints
> > even though pte was accesssed because the process now gives strong hint
> > "The page is idle from now on". It's valid because he knows himself better than
> > others, even admin. IOW, he declare the page is not workingset any more.
> 
> Even if the PG_idle flag is not cleared - it is not a strong hint for working
> set size IMHO, because the page *was* accessed so the process definitely needed the
> page at some point even though now it says it is MADV_COLD. So that is part
> of working set. I don't think we should implicitly provide such hints and we
> should fix it.
> 
> Also I was saying in previous email, if process_madvise (future extension) is
> called from say activity manager, then the process and the user running the
> idle tracking feature has no idea that the page was accessed because the idle
> flag is still set. That is a bit weird and is loss of information.
> 
> It may not be a big deal in the long run if the page is accessed a lot, since
> the PTE accessed bit will be set again and idle-tracking feature may not miss
> it, but why leave it to chance if it is a simple fix?

Consistency with other madvise hints.

There are many places you could lose the information as I mentioned and I'm
really not conviced we need fixing because currently page-idle tracking
feature is biased to work with . If you believe we need to fix it,
it would be better to have a separate discussion, not here.

> 
> > What's the problem if page idle tracking feature miss it?
> 
> What's the problem if PG_idle flag is cleared here? It is just a software
> flag.

Again consistency. I don't think it's a MADV_PAGEOUT specific issue.
Since I pointed out other places idle tracking is missing(? not sure),
Let's discuss it separately if you feel we need fix.

Furthermore, once the page is reclaimed, that means the page could be
deallocated so you automatically don't see any PG_idle from the page.

> 
> > If other processs still have access bit of their pte for the page, page idle
> > tracking could find the page as non-idle so it's no problem, either.
> 
> Yes, but if the other process also does not access the page, then the access
> information is lost.
> 
> thanks!
> 
>  - Joel
>
Joel Fernandes June 20, 2019, 3:57 p.m. UTC | #7
On Thu, Jun 20, 2019 at 1:01 AM Minchan Kim <minchan@kernel.org> wrote:
[snip]
> > > >
> > > > I think to fix this, what you should do is clear the PG_Idle flag if the
> > > > young/accessed PTE bits are set. If PG_Idle is already cleared, then you
> > > > don't need to do anything.
> > >
> > > I'm not sure. What does it make MADV_COLD special?
> > > How about MADV_FREE|MADV_DONTNEED?
> > > Why don't they clear PG_Idle if pte was young at tearing down pte?
> >
> > Good point, so it sounds like those (MADV_FREE|MADV_DONTNEED) also need to be fixed then?
>
> Not sure. If you want it, maybe you need to fix every pte clearing and pte_mkold
> part, which is more general to cover every sites like munmap, get_user_pages and
> so on. Anyway, I don't think it's related to this patchset.

Ok, I can look into this issue on my own when I get time. I'll add it
to my list. No problems with your patch otherwise from my side.

 -Joel
diff mbox series

Patch

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 9f8712a4b1a5..58b06654c8dd 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -424,6 +424,7 @@  static inline bool set_hwpoison_free_buddy_page(struct page *page)
 TESTPAGEFLAG(Young, young, PF_ANY)
 SETPAGEFLAG(Young, young, PF_ANY)
 TESTCLEARFLAG(Young, young, PF_ANY)
+CLEARPAGEFLAG(Young, young, PF_ANY)
 PAGEFLAG(Idle, idle, PF_ANY)
 #endif
 
diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
index 1e894d34bdce..f3f43b317150 100644
--- a/include/linux/page_idle.h
+++ b/include/linux/page_idle.h
@@ -19,6 +19,11 @@  static inline void set_page_young(struct page *page)
 	SetPageYoung(page);
 }
 
+static inline void clear_page_young(struct page *page)
+{
+	ClearPageYoung(page);
+}
+
 static inline bool test_and_clear_page_young(struct page *page)
 {
 	return TestClearPageYoung(page);
@@ -65,6 +70,16 @@  static inline void set_page_young(struct page *page)
 	set_bit(PAGE_EXT_YOUNG, &page_ext->flags);
 }
 
+static void clear_page_young(struct page *page)
+{
+	struct page_ext *page_ext = lookup_page_ext(page);
+
+	if (unlikely(!page_ext))
+		return;
+
+	clear_bit(PAGE_EXT_YOUNG, &page_ext->flags);
+}
+
 static inline bool test_and_clear_page_young(struct page *page)
 {
 	struct page_ext *page_ext = lookup_page_ext(page);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index de2c67a33b7e..0ce997edb8bb 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -340,6 +340,7 @@  extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
 extern void deactivate_file_page(struct page *page);
+extern void deactivate_page(struct page *page);
 extern void mark_page_lazyfree(struct page *page);
 extern void swap_setup(void);
 
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index bea0278f65ab..1190f4e7f7b9 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -43,6 +43,7 @@ 
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_COLD	5		/* deactivatie these pages */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_FREE	8		/* free pages only if memory pressure */
diff --git a/mm/internal.h b/mm/internal.h
index 9eeaf2b95166..75a4f96ec0fb 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -43,7 +43,7 @@  vm_fault_t do_swap_page(struct vm_fault *vmf);
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 
-static inline bool can_madv_dontneed_vma(struct vm_area_struct *vma)
+static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
 {
 	return !(vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP));
 }
diff --git a/mm/madvise.c b/mm/madvise.c
index 628022e674a7..ab158766858a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -40,6 +40,7 @@  static int madvise_need_mmap_write(int behavior)
 	case MADV_REMOVE:
 	case MADV_WILLNEED:
 	case MADV_DONTNEED:
+	case MADV_COLD:
 	case MADV_FREE:
 		return 0;
 	default:
@@ -307,6 +308,113 @@  static long madvise_willneed(struct vm_area_struct *vma,
 	return 0;
 }
 
+static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
+				unsigned long end, struct mm_walk *walk)
+{
+	pte_t *orig_pte, *pte, ptent;
+	spinlock_t *ptl;
+	struct page *page;
+	struct vm_area_struct *vma = walk->vma;
+	unsigned long next;
+
+	next = pmd_addr_end(addr, end);
+	if (pmd_trans_huge(*pmd)) {
+		ptl = pmd_trans_huge_lock(pmd, vma);
+		if (!ptl)
+			return 0;
+
+		if (is_huge_zero_pmd(*pmd))
+			goto huge_unlock;
+
+		page = pmd_page(*pmd);
+		if (page_mapcount(page) > 1)
+			goto huge_unlock;
+
+		if (next - addr != HPAGE_PMD_SIZE) {
+			int err;
+
+			get_page(page);
+			spin_unlock(ptl);
+			lock_page(page);
+			err = split_huge_page(page);
+			unlock_page(page);
+			put_page(page);
+			if (!err)
+				goto regular_page;
+			return 0;
+		}
+
+		pmdp_test_and_clear_young(vma, addr, pmd);
+		deactivate_page(page);
+huge_unlock:
+		spin_unlock(ptl);
+		return 0;
+	}
+
+	if (pmd_trans_unstable(pmd))
+		return 0;
+
+regular_page:
+	orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
+		ptent = *pte;
+
+		if (pte_none(ptent))
+			continue;
+
+		if (!pte_present(ptent))
+			continue;
+
+		page = vm_normal_page(vma, addr, ptent);
+		if (!page)
+			continue;
+
+		if (page_mapcount(page) > 1)
+			continue;
+
+		ptep_test_and_clear_young(vma, addr, pte);
+		deactivate_page(page);
+	}
+
+	pte_unmap_unlock(orig_pte, ptl);
+	cond_resched();
+
+	return 0;
+}
+
+static void madvise_cold_page_range(struct mmu_gather *tlb,
+			     struct vm_area_struct *vma,
+			     unsigned long addr, unsigned long end)
+{
+	struct mm_walk cool_walk = {
+		.pmd_entry = madvise_cold_pte_range,
+		.mm = vma->vm_mm,
+	};
+
+	tlb_start_vma(tlb, vma);
+	walk_page_range(addr, end, &cool_walk);
+	tlb_end_vma(tlb, vma);
+}
+
+static long madvise_cold(struct vm_area_struct *vma,
+			struct vm_area_struct **prev,
+			unsigned long start_addr, unsigned long end_addr)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct mmu_gather tlb;
+
+	*prev = vma;
+	if (!can_madv_lru_vma(vma))
+		return -EINVAL;
+
+	lru_add_drain();
+	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
+	madvise_cold_page_range(&tlb, vma, start_addr, end_addr);
+	tlb_finish_mmu(&tlb, start_addr, end_addr);
+
+	return 0;
+}
+
 static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 				unsigned long end, struct mm_walk *walk)
 
@@ -519,7 +627,7 @@  static long madvise_dontneed_free(struct vm_area_struct *vma,
 				  int behavior)
 {
 	*prev = vma;
-	if (!can_madv_dontneed_vma(vma))
+	if (!can_madv_lru_vma(vma))
 		return -EINVAL;
 
 	if (!userfaultfd_remove(vma, start, end)) {
@@ -541,7 +649,7 @@  static long madvise_dontneed_free(struct vm_area_struct *vma,
 			 */
 			return -ENOMEM;
 		}
-		if (!can_madv_dontneed_vma(vma))
+		if (!can_madv_lru_vma(vma))
 			return -EINVAL;
 		if (end > vma->vm_end) {
 			/*
@@ -695,6 +803,8 @@  madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		return madvise_remove(vma, prev, start, end);
 	case MADV_WILLNEED:
 		return madvise_willneed(vma, prev, start, end);
+	case MADV_COLD:
+		return madvise_cold(vma, prev, start, end);
 	case MADV_FREE:
 	case MADV_DONTNEED:
 		return madvise_dontneed_free(vma, prev, start, end, behavior);
@@ -716,6 +826,7 @@  madvise_behavior_valid(int behavior)
 	case MADV_WILLNEED:
 	case MADV_DONTNEED:
 	case MADV_FREE:
+	case MADV_COLD:
 #ifdef CONFIG_KSM
 	case MADV_MERGEABLE:
 	case MADV_UNMERGEABLE:
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5a58778c91d4..f73d5f5145f0 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -515,7 +515,7 @@  bool __oom_reap_task_mm(struct mm_struct *mm)
 	set_bit(MMF_UNSTABLE, &mm->flags);
 
 	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
-		if (!can_madv_dontneed_vma(vma))
+		if (!can_madv_lru_vma(vma))
 			continue;
 
 		/*
diff --git a/mm/swap.c b/mm/swap.c
index 7b079976cbec..cebedab15aa2 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -47,6 +47,7 @@  int page_cluster;
 static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
 static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
 static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
 static DEFINE_PER_CPU(struct pagevec, lru_lazyfree_pvecs);
 #ifdef CONFIG_SMP
 static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
@@ -538,6 +539,23 @@  static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
 	update_page_reclaim_stat(lruvec, file, 0);
 }
 
+static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
+			    void *arg)
+{
+	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+		int file = page_is_file_cache(page);
+		int lru = page_lru_base_type(page);
+
+		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
+		ClearPageActive(page);
+		ClearPageReferenced(page);
+		clear_page_young(page);
+		add_page_to_lru_list(page, lruvec, lru);
+
+		__count_vm_events(PGDEACTIVATE, hpage_nr_pages(page));
+		update_page_reclaim_stat(lruvec, file, 0);
+	}
+}
 
 static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
 			    void *arg)
@@ -590,6 +608,10 @@  void lru_add_drain_cpu(int cpu)
 	if (pagevec_count(pvec))
 		pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
 
+	pvec = &per_cpu(lru_deactivate_pvecs, cpu);
+	if (pagevec_count(pvec))
+		pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+
 	pvec = &per_cpu(lru_lazyfree_pvecs, cpu);
 	if (pagevec_count(pvec))
 		pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
@@ -623,6 +645,26 @@  void deactivate_file_page(struct page *page)
 	}
 }
 
+/*
+ * deactivate_page - deactivate a page
+ * @page: page to deactivate
+ *
+ * deactivate_page() moves @page to the inactive list if @page was on the active
+ * list and was not an unevictable page.  This is done to accelerate the reclaim
+ * of @page.
+ */
+void deactivate_page(struct page *page)
+{
+	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+		struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
+
+		get_page(page);
+		if (!pagevec_add(pvec, page) || PageCompound(page))
+			pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+		put_cpu_var(lru_deactivate_pvecs);
+	}
+}
+
 /**
  * mark_page_lazyfree - make an anon page lazyfree
  * @page: page to deactivate
@@ -687,6 +729,7 @@  void lru_add_drain_all(void)
 		if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
 		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
 		    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
+		    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
 		    pagevec_count(&per_cpu(lru_lazyfree_pvecs, cpu)) ||
 		    need_activate_page_drain(cpu)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);