diff mbox series

[RFC,v1] mm: add page preemption

Message ID 20191020134304.11700-1-hdanton@sina.com (mailing list archive)
State New, archived
Headers show
Series [RFC,v1] mm: add page preemption | expand

Commit Message

Hillf Danton Oct. 20, 2019, 1:43 p.m. UTC
Unlike cpu preemption, page preemption would have been a two-edge
option for quite a while. It is added by preventing tasks from
reclaiming as many lru pages as possible from other tasks of
higher priorities.

Who need pp?
Users who want to manage/control jitters in lru pages under memory
pressure. Way in parallel to scheduling with task's memory footprint
taken into account, pp makes task prio a part of page reclaiming.

[Off topic: prio can also be defined and added in memory controller
and then plays a role in memcg reclaiming, for example check prio
when selecting victim memcg.]

First on the page side, page->prio that is used to mirror the prio
of page owner tasks is added, and a couple of helpers for setting,
copying and comparing page->prio to help to add pages to lru.

Then on the reclaimer side, pgdat->kswapd_prio is added to mirror
the prio of tasks that wake up the kthread, and it is updated
every time kswapd raises its reclaiming priority.

Finally priorities on both sides are compared when deactivating lru
pages, and skip page if it is higher on prio.

V1 is based on next-20191018.

Changes since v0
- s/page->nice/page->prio/
- drop the role of kswapd's reclaiming prioirty in prio comparison
- add pgdat->kswapd_prio

Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Hillf Danton <hdanton@sina.com>
---

--

Comments

Matthew Wilcox Oct. 20, 2019, 3:24 p.m. UTC | #1
On Sun, Oct 20, 2019 at 09:43:04PM +0800, Hillf Danton wrote:
> First on the page side, page->prio that is used to mirror the prio
> of page owner tasks is added, and a couple of helpers for setting,
> copying and comparing page->prio to help to add pages to lru.

Um, no.  struct page is 64 bytes and shall remain so without a very very
good reason.

> @@ -197,6 +198,10 @@ struct page {
>  	/* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
>  	atomic_t _refcount;
>  
> +#ifdef CONFIG_PAGE_PREEMPTION
> +	int prio; /* mirror page owner task->prio */
> +#endif
> +
>  #ifdef CONFIG_MEMCG
>  	struct mem_cgroup *mem_cgroup;
>  #endif
Michal Hocko Oct. 21, 2019, 12:27 p.m. UTC | #2
On Sun 20-10-19 21:43:04, Hillf Danton wrote:
> 
> Unlike cpu preemption, page preemption would have been a two-edge
> option for quite a while. It is added by preventing tasks from
> reclaiming as many lru pages as possible from other tasks of
> higher priorities.

This really begs for more explanation. I would start with what page
preemption actually is. Why do we care and which workloads would benefit
and how much. And last but not least why the existing infrastructure
doesn't help (e.g. if you have clearly defined workloads with different
memory consumption requirements then why don't you use memory cgroups to
reflect the priority).
 
> Who need pp?
> Users who want to manage/control jitters in lru pages under memory
> pressure. Way in parallel to scheduling with task's memory footprint
> taken into account, pp makes task prio a part of page reclaiming.

How do you assign priority to generally shared pages?

> [Off topic: prio can also be defined and added in memory controller
> and then plays a role in memcg reclaiming, for example check prio
> when selecting victim memcg.]
> 
> First on the page side, page->prio that is used to mirror the prio
> of page owner tasks is added, and a couple of helpers for setting,
> copying and comparing page->prio to help to add pages to lru.
> 
> Then on the reclaimer side, pgdat->kswapd_prio is added to mirror
> the prio of tasks that wake up the kthread, and it is updated
> every time kswapd raises its reclaiming priority.

This sounds like a very bad design to me. You essentially hand over
to a completely detached context while you want to handle priority
inversion problems (or at least that is what I think you want).

> Finally priorities on both sides are compared when deactivating lru
> pages, and skip page if it is higher on prio.
> 
> V1 is based on next-20191018.
> 
> Changes since v0
> - s/page->nice/page->prio/
> - drop the role of kswapd's reclaiming prioirty in prio comparison
> - add pgdat->kswapd_prio
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Jan Kara <jack@suse.cz>
> Signed-off-by: Hillf Danton <hdanton@sina.com>
> ---
> 
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -281,6 +281,15 @@ config VIRT_TO_BUS
>  	  should probably not select this.
>  
>  
> +config PAGE_PREEMPTION
> +	bool
> +	help
> +	  This makes a task unable to reclaim as many lru pages as
> +	  possible from other tasks of higher priorities.
> +
> +	  Say N if unsure.
> +
> +
>  config MMU_NOTIFIER
>  	bool
>  	select SRCU
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -14,6 +14,7 @@
>  #include <linux/uprobes.h>
>  #include <linux/page-flags-layout.h>
>  #include <linux/workqueue.h>
> +#include <linux/sched/prio.h>
>  
>  #include <asm/mmu.h>
>  
> @@ -197,6 +198,10 @@ struct page {
>  	/* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
>  	atomic_t _refcount;
>  
> +#ifdef CONFIG_PAGE_PREEMPTION
> +	int prio; /* mirror page owner task->prio */
> +#endif
> +
>  #ifdef CONFIG_MEMCG
>  	struct mem_cgroup *mem_cgroup;
>  #endif
> @@ -232,6 +237,54 @@ struct page {
>  #define page_private(page)		((page)->private)
>  #define set_page_private(page, v)	((page)->private = (v))
>  
> +#ifdef CONFIG_PAGE_PREEMPTION
> +static inline bool page_prio_valid(struct page *p)
> +{
> +	return p->prio > MAX_PRIO;
> +}
> +
> +static inline void set_page_prio(struct page *p, int task_prio)
> +{
> +	/* store page prio low enough to help khugepaged add lru pages */
> +	if (!page_prio_valid(p))
> +		p->prio = task_prio + MAX_PRIO + 1;
> +}
> +
> +static inline void copy_page_prio(struct page *to, struct page *from)
> +{
> +	to->prio = from->prio;
> +}
> +
> +static inline int page_prio(struct page *p)
> +{
> +	return p->prio - MAX_PRIO - 1;
> +}
> +
> +static inline bool page_prio_higher(struct page *p, int prio)
> +{
> +	return page_prio(p) < prio;
> +}
> +#else
> +static inline bool page_prio_valid(struct page *p)
> +{
> +	return true;
> +}
> +static inline void set_page_prio(struct page *p, int task_prio)
> +{
> +}
> +static inline void copy_page_prio(struct page *to, struct page *from)
> +{
> +}
> +static inline int page_prio(struct page *p)
> +{
> +	return MAX_PRIO + 1;
> +}
> +static inline bool page_prio_higher(struct page *p, int prio)
> +{
> +	return false;
> +}
> +#endif
> +
>  struct page_frag_cache {
>  	void * va;
>  #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -402,6 +402,7 @@ static void __lru_cache_add(struct page
>  	struct pagevec *pvec = &get_cpu_var(lru_add_pvec);
>  
>  	get_page(page);
> +	set_page_prio(page, current->prio);
>  	if (!pagevec_add(pvec, page) || PageCompound(page))
>  		__pagevec_lru_add(pvec);
>  	put_cpu_var(lru_add_pvec);
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -647,6 +647,7 @@ void migrate_page_states(struct page *ne
>  		end_page_writeback(newpage);
>  
>  	copy_page_owner(page, newpage);
> +	copy_page_prio(newpage, page);
>  
>  	mem_cgroup_migrate(page, newpage);
>  }
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1575,6 +1575,7 @@ static int shmem_replace_page(struct pag
>  
>  	get_page(newpage);
>  	copy_highpage(newpage, oldpage);
> +	copy_page_prio(newpage, oldpage);
>  	flush_dcache_page(newpage);
>  
>  	__SetPageLocked(newpage);
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -671,6 +671,7 @@ static void __collapse_huge_page_copy(pt
>  			}
>  		} else {
>  			src_page = pte_page(pteval);
> +			copy_page_prio(page, src_page);
>  			copy_user_highpage(page, src_page, address, vma);
>  			VM_BUG_ON_PAGE(page_mapcount(src_page) != 1, src_page);
>  			release_pte_page(src_page);
> @@ -1723,6 +1724,7 @@ xa_unlocked:
>  				clear_highpage(new_page + (index % HPAGE_PMD_NR));
>  				index++;
>  			}
> +			copy_page_prio(new_page, page);
>  			copy_highpage(new_page + (page->index % HPAGE_PMD_NR),
>  					page);
>  			list_del(&page->lru);
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -738,6 +738,9 @@ typedef struct pglist_data {
>  	int kswapd_order;
>  	enum zone_type kswapd_classzone_idx;
>  
> +#ifdef CONFIG_PAGE_PREEMPTION
> +	int kswapd_prio; /* mirror task->prio waking up kswapd */
> +#endif
>  	int kswapd_failures;		/* Number of 'reclaimed == 0' runs */
>  
>  #ifdef CONFIG_COMPACTION
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -110,6 +110,10 @@ struct scan_control {
>  	/* The highest zone to isolate pages for reclaim from */
>  	s8 reclaim_idx;
>  
> +#ifdef CONFIG_PAGE_PREEMPTION
> +	s8 __pad;
> +	int reclaimer_prio; /* mirror task->prio */
> +#endif
>  	/* This context's GFP mask */
>  	gfp_t gfp_mask;
>  
> @@ -1710,11 +1714,18 @@ static unsigned long isolate_lru_pages(u
>  		total_scan += nr_pages;
>  
>  		if (page_zonenum(page) > sc->reclaim_idx) {
> +next_page:
>  			list_move(&page->lru, &pages_skipped);
>  			nr_skipped[page_zonenum(page)] += nr_pages;
>  			continue;
>  		}
>  
> +		if (IS_ENABLED(CONFIG_PAGE_PREEMPTION) &&
> +		    is_active_lru(lru) &&
> +		    global_reclaim(sc) &&
> +		    page_prio_higher(page, sc->reclaimer_prio))
> +			goto next_page;
> +
>  		/*
>  		 * Do not count skipped pages because that makes the function
>  		 * return with no isolated pages if the LRU mostly contains
> @@ -3260,6 +3271,9 @@ unsigned long try_to_free_pages(struct z
>  	unsigned long nr_reclaimed;
>  	struct scan_control sc = {
>  		.nr_to_reclaim = SWAP_CLUSTER_MAX,
> +#ifdef CONFIG_PAGE_PREEMPTION
> +		.reclaimer_prio = current->prio,
> +#endif
>  		.gfp_mask = current_gfp_context(gfp_mask),
>  		.reclaim_idx = gfp_zone(gfp_mask),
>  		.order = order,
> @@ -3586,6 +3600,9 @@ static int balance_pgdat(pg_data_t *pgda
>  	bool boosted;
>  	struct zone *zone;
>  	struct scan_control sc = {
> +#ifdef CONFIG_PAGE_PREEMPTION
> +		.reclaimer_prio = pgdat->kswapd_prio,
> +#endif
>  		.gfp_mask = GFP_KERNEL,
>  		.order = order,
>  		.may_unmap = 1,
> @@ -3739,6 +3756,9 @@ restart:
>  		if (nr_boost_reclaim && !nr_reclaimed)
>  			break;
>  
> +		if (IS_ENABLED(CONFIG_PAGE_PREEMPTION))
> +			sc.reclaimer_prio = pgdat->kswapd_prio;
> +
>  		if (raise_priority || !nr_reclaimed)
>  			sc.priority--;
>  	} while (sc.priority >= 1);
> @@ -3831,6 +3851,10 @@ static void kswapd_try_to_sleep(pg_data_
>  		 */
>  		wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
>  
> +		if (IS_ENABLED(CONFIG_PAGE_PREEMPTION)) {
> +			pgdat->kswapd_prio = MAX_PRIO + 1;
> +			smp_wmb();
> +		}
>  		remaining = schedule_timeout(HZ/10);
>  
>  		/*
> @@ -3865,8 +3889,13 @@ static void kswapd_try_to_sleep(pg_data_
>  		 */
>  		set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);
>  
> -		if (!kthread_should_stop())
> +		if (!kthread_should_stop()) {
> +			if (IS_ENABLED(CONFIG_PAGE_PREEMPTION)) {
> +				pgdat->kswapd_prio = MAX_PRIO + 1;
> +				smp_wmb();
> +			}
>  			schedule();
> +		}
>  
>  		set_pgdat_percpu_threshold(pgdat, calculate_pressure_threshold);
>  	} else {
> @@ -3917,6 +3946,8 @@ static int kswapd(void *p)
>  	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
>  	set_freezable();
>  
> +	if (IS_ENABLED(CONFIG_PAGE_PREEMPTION))
> +		pgdat->kswapd_prio = MAX_PRIO + 1;
>  	pgdat->kswapd_order = 0;
>  	pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
>  	for ( ; ; ) {
> @@ -3985,6 +4016,17 @@ void wakeup_kswapd(struct zone *zone, gf
>  		return;
>  	pgdat = zone->zone_pgdat;
>  
> +	if (IS_ENABLED(CONFIG_PAGE_PREEMPTION)) {
> +		int prio = current->prio;
> +
> +		if (pgdat->kswapd_prio < prio) {
> +			smp_rmb();
> +			return;
> +		}
> +		pgdat->kswapd_prio = prio;
> +		smp_wmb();
> +	}
> +
>  	if (pgdat->kswapd_classzone_idx == MAX_NR_ZONES)
>  		pgdat->kswapd_classzone_idx = classzone_idx;
>  	else
> --
>
Hillf Danton Oct. 22, 2019, 1:41 a.m. UTC | #3
On Sun, 20 Oct 2019 08:24:00 -0700 Matthew Wilcox wrote:
> On Sun, Oct 20, 2019 at 09:43:04PM +0800, Hillf Danton wrote:
> > First on the page side, page->prio that is used to mirror the prio
> > of page owner tasks is added, and a couple of helpers for setting,
> > copying and comparing page->prio to help to add pages to lru.
> 
> Um, no.  struct page is 64 bytes and shall remain so without a very very
> good reason.

Queued on todo list. Thanks:)

Hillf
Hillf Danton Oct. 22, 2019, 12:14 p.m. UTC | #4
On Mon, 21 Oct 2019 14:27:28 +0200 Michal Hocko wrote:
> 
> On Sun 20-10-19 21:43:04, Hillf Danton wrote:
> > 
> > Unlike cpu preemption, page preemption would have been a two-edge
> > option for quite a while. It is added by preventing tasks from
> > reclaiming as many lru pages as possible from other tasks of
> > higher priorities.
> 
> This really begs for more explanation.
> I would start with what page preemption actually is.

Page preemption is the system-provided facility that makes a task
able to preempt other tasks of lower priorities for page in the
same manner as for cpu.

> Why do we care and which workloads would benefit and how much.

Page preemption, disabled by default, should be turned on by those
who wish that the performance of their workloads can survive memory
pressure to certain extent.

The number of pp users is supposed near the people who change the
nice value of their apps either to -1 or higher at least once a week,
less than vi users among UK's undergraduates.

> And last but not least why the existing infrastructure doesn't help
> (e.g. if you have clearly defined workloads with different
> memory consumption requirements then why don't you use memory cgroups to
> reflect the priority).

Good question:)

Though pp is implemented by preventing any task from reclaiming as many
pages as possible from other tasks that are higher on priority, it is
trying to introduce prio into page reclaiming, to add a feature.

Page and memcg are different objects after all; pp is being added at
the page granularity. It should be an option available in environments
without memcg enabled.

What is way different from the protections offered by memory cgroup
is that pages protected by memcg:min/low can't be reclaimed regardless
of memory pressure. Such guarantee is not available under pp as it only
suggests an extra factor to consider on deactivating lru pages.

Adding prio in memory controller is another good topic, already queued
after pp and memcg lru on todo list.

> > Who need pp?
> > Users who want to manage/control jitters in lru pages under memory
> > pressure. Way in parallel to scheduling with task's memory footprint
> > taken into account, pp makes task prio a part of page reclaiming.
> 
> How do you assign priority to generally shared pages?

It is solved by setting page prio only when they are added to lru.
Prio will never change thereafter.
There is helper copy_page_prio(new_page, old_page) for scenarios like
migrating pages.

> > [Off topic: prio can also be defined and added in memory controller
> > and then plays a role in memcg reclaiming, for example check prio
> > when selecting victim memcg.]
> > 
> > First on the page side, page->prio that is used to mirror the prio
> > of page owner tasks is added, and a couple of helpers for setting,
> > copying and comparing page->prio to help to add pages to lru.
> > 
> > Then on the reclaimer side, pgdat->kswapd_prio is added to mirror
> > the prio of tasks that wake up the kthread, and it is updated
> > every time kswapd raises its reclaiming priority.
> 
> This sounds like a very bad design to me. You essentially hand over
> to a completely detached context while you want to handle priority
> inversion problems (or at least that is what I think you want).

What was added on the reclaimer side is

1, kswapd sets pgdat->kswapd_prio, the switch between page reclaimer
   and allocator in terms of prio, to the lowest value before taking
   a nap.

2, any allocator is able to wake up the reclaimer because of the
   lowest prio, and it starts reclaiming pages using the waker's prio.

3, allocator comes while kswapd is active, its prio is checked and
   no-op if kswapd is higher on prio; otherwise switch is updated
   with the higher prio.

4, every time kswapd raises sc.priority that starts with DEF_PRIORITY,
   it is checked if there is pending update of switch; and kswapd's
   prio steps up if there is a pending one, thus its prio never steps
   down. Nor prio inversion. 

5, goto 1 when kswapd finishes its work.

Thanks
Hillf
Michal Hocko Oct. 22, 2019, 12:42 p.m. UTC | #5
On Tue 22-10-19 20:14:39, Hillf Danton wrote:
> 
> On Mon, 21 Oct 2019 14:27:28 +0200 Michal Hocko wrote:
[...]
> > Why do we care and which workloads would benefit and how much.
> 
> Page preemption, disabled by default, should be turned on by those
> who wish that the performance of their workloads can survive memory
> pressure to certain extent.

I am sorry but this doesn't say anything to me. How come not all
workloads would fit that description?
 
> The number of pp users is supposed near the people who change the
> nice value of their apps either to -1 or higher at least once a week,
> less than vi users among UK's undergraduates.
> 
> > And last but not least why the existing infrastructure doesn't help
> > (e.g. if you have clearly defined workloads with different
> > memory consumption requirements then why don't you use memory cgroups to
> > reflect the priority).
> 
> Good question:)
> 
> Though pp is implemented by preventing any task from reclaiming as many
> pages as possible from other tasks that are higher on priority, it is
> trying to introduce prio into page reclaiming, to add a feature.
> 
> Page and memcg are different objects after all; pp is being added at
> the page granularity. It should be an option available in environments
> without memcg enabled.

So do you actually want to establish LRUs per priority? Why using memcgs
is not an option? This is the main facility to partition reclaimable
memory in the first place. You should really focus on explaining on why
a much more fine grained control is needed much more thoroughly.

> What is way different from the protections offered by memory cgroup
> is that pages protected by memcg:min/low can't be reclaimed regardless
> of memory pressure. Such guarantee is not available under pp as it only
> suggests an extra factor to consider on deactivating lru pages.

Well, low limit can be breached if there is no eliglible memcg to be
reclaimed. That means that you can shape some sort of priority by
setting the low limit already.

[...]

> What was added on the reclaimer side is
> 
> 1, kswapd sets pgdat->kswapd_prio, the switch between page reclaimer
>    and allocator in terms of prio, to the lowest value before taking
>    a nap.
> 
> 2, any allocator is able to wake up the reclaimer because of the
>    lowest prio, and it starts reclaiming pages using the waker's prio.
> 
> 3, allocator comes while kswapd is active, its prio is checked and
>    no-op if kswapd is higher on prio; otherwise switch is updated
>    with the higher prio.
> 
> 4, every time kswapd raises sc.priority that starts with DEF_PRIORITY,
>    it is checked if there is pending update of switch; and kswapd's
>    prio steps up if there is a pending one, thus its prio never steps
>    down. Nor prio inversion. 
> 
> 5, goto 1 when kswapd finishes its work.

What about the direct reclaim? What if pages of a lower priority are
hard to reclaim? Do you want a process of a higher priority stall more
just because it has to wait for those lower priority pages?
Hillf Danton Oct. 22, 2019, 2:28 p.m. UTC | #6
On Tue, 22 Oct 2019 14:42:41 +0200 Michal Hocko wrote:
> 
> On Tue 22-10-19 20:14:39, Hillf Danton wrote:
> > 
> > On Mon, 21 Oct 2019 14:27:28 +0200 Michal Hocko wrote:
> [...]
> > > Why do we care and which workloads would benefit and how much.
> > 
> > Page preemption, disabled by default, should be turned on by those
> > who wish that the performance of their workloads can survive memory
> > pressure to certain extent.
> 
> I am sorry but this doesn't say anything to me. How come not all
> workloads would fit that description?

That means pp plays a role when kswapd becomes active, and it may
prevent too much jitters in active lru pages.

> > The number of pp users is supposed near the people who change the
> > nice value of their apps either to -1 or higher at least once a week,
> > less than vi users among UK's undergraduates.
> > 
> > > And last but not least why the existing infrastructure doesn't help
> > > (e.g. if you have clearly defined workloads with different
> > > memory consumption requirements then why don't you use memory cgroups to
> > > reflect the priority).
> > 
> > Good question:)
> > 
> > Though pp is implemented by preventing any task from reclaiming as many
> > pages as possible from other tasks that are higher on priority, it is
> > trying to introduce prio into page reclaiming, to add a feature.
> > 
> > Page and memcg are different objects after all; pp is being added at
> > the page granularity. It should be an option available in environments
> > without memcg enabled.
> 
> So do you actually want to establish LRUs per priority?

No, no change other than the prio for every lru page was added. LRU per prio
is too much to implement.

> Why using memcgs is not an option?

I have plan to add prio in memcg. As you see, I sent a rfc before v0 with
nice added in memcg, and realised a couple days ago that its dependence on
soft limit reclaim is not acceptable.

But we can't do that without determining how to define memcg's prio.
What is in mind now is the highest (or lowest) prio of tasks in a memcg
with a knob offered to userspace.

If you like, I want to have a talk about it sometime later.

> This is the main facility to partition reclaimable
> memory in the first place. You should really focus on explaining on why
> a much more fine grained control is needed much more thoroughly.
> 
> > What is way different from the protections offered by memory cgroup
> > is that pages protected by memcg:min/low can't be reclaimed regardless
> > of memory pressure. Such guarantee is not available under pp as it only
> > suggests an extra factor to consider on deactivating lru pages.
> 
> Well, low limit can be breached if there is no eliglible memcg to be
> reclaimed. That means that you can shape some sort of priority by
> setting the low limit already.
> 
> [...]
> 
> > What was added on the reclaimer side is
> > 
> > 1, kswapd sets pgdat->kswapd_prio, the switch between page reclaimer
> >    and allocator in terms of prio, to the lowest value before taking
> >    a nap.
> > 
> > 2, any allocator is able to wake up the reclaimer because of the
> >    lowest prio, and it starts reclaiming pages using the waker's prio.
> > 
> > 3, allocator comes while kswapd is active, its prio is checked and
> >    no-op if kswapd is higher on prio; otherwise switch is updated
> >    with the higher prio.
> > 
> > 4, every time kswapd raises sc.priority that starts with DEF_PRIORITY,
> >    it is checked if there is pending update of switch; and kswapd's
> >    prio steps up if there is a pending one, thus its prio never steps
> >    down. Nor prio inversion. 
> > 
> > 5, goto 1 when kswapd finishes its work.
> 
> What about the direct reclaim?

Their prio will not change before reclaiming finishes, so leave it be.

> What if pages of a lower priority are
> hard to reclaim? Do you want a process of a higher priority stall more
> just because it has to wait for those lower priority pages?

The problems above are not introduced by pp, let Mr. Kswapd take care of
them.

(It is 22:23 local time, lets continue after a 7-hour sleep. Good night.)

Thanks
Hillf
Michal Hocko Oct. 23, 2019, 8:17 a.m. UTC | #7
On Tue 22-10-19 22:28:02, Hillf Danton wrote:
> 
> On Tue, 22 Oct 2019 14:42:41 +0200 Michal Hocko wrote:
> > 
> > On Tue 22-10-19 20:14:39, Hillf Danton wrote:
> > > 
> > > On Mon, 21 Oct 2019 14:27:28 +0200 Michal Hocko wrote:
> > [...]
> > > > Why do we care and which workloads would benefit and how much.
> > > 
> > > Page preemption, disabled by default, should be turned on by those
> > > who wish that the performance of their workloads can survive memory
> > > pressure to certain extent.
> > 
> > I am sorry but this doesn't say anything to me. How come not all
> > workloads would fit that description?
> 
> That means pp plays a role when kswapd becomes active, and it may
> prevent too much jitters in active lru pages.

This is still too vague to be useful in any way.

> > > The number of pp users is supposed near the people who change the
> > > nice value of their apps either to -1 or higher at least once a week,
> > > less than vi users among UK's undergraduates.
> > > 
> > > > And last but not least why the existing infrastructure doesn't help
> > > > (e.g. if you have clearly defined workloads with different
> > > > memory consumption requirements then why don't you use memory cgroups to
> > > > reflect the priority).
> > > 
> > > Good question:)
> > > 
> > > Though pp is implemented by preventing any task from reclaiming as many
> > > pages as possible from other tasks that are higher on priority, it is
> > > trying to introduce prio into page reclaiming, to add a feature.
> > > 
> > > Page and memcg are different objects after all; pp is being added at
> > > the page granularity. It should be an option available in environments
> > > without memcg enabled.
> > 
> > So do you actually want to establish LRUs per priority?
> 
> No, no change other than the prio for every lru page was added. LRU per prio
> is too much to implement.

Well, considering that per page priority is a no go as already pointed
out by Willy then you do not have other choice right?

> > Why using memcgs is not an option?
> 
> I have plan to add prio in memcg. As you see, I sent a rfc before v0 with
> nice added in memcg, and realised a couple days ago that its dependence on
> soft limit reclaim is not acceptable.
> 
> But we can't do that without determining how to define memcg's prio.
> What is in mind now is the highest (or lowest) prio of tasks in a memcg
> with a knob offered to userspace.
> 
> If you like, I want to have a talk about it sometime later.

This doesn't really answer my question. Why cannot you use memcgs as
they are now. Why exactly do you need a fixed priority?

> > This is the main facility to partition reclaimable
> > memory in the first place. You should really focus on explaining on why
> > a much more fine grained control is needed much more thoroughly.
> > 
> > > What is way different from the protections offered by memory cgroup
> > > is that pages protected by memcg:min/low can't be reclaimed regardless
> > > of memory pressure. Such guarantee is not available under pp as it only
> > > suggests an extra factor to consider on deactivating lru pages.
> > 
> > Well, low limit can be breached if there is no eliglible memcg to be
> > reclaimed. That means that you can shape some sort of priority by
> > setting the low limit already.
> > 
> > [...]
> > 
> > > What was added on the reclaimer side is
> > > 
> > > 1, kswapd sets pgdat->kswapd_prio, the switch between page reclaimer
> > >    and allocator in terms of prio, to the lowest value before taking
> > >    a nap.
> > > 
> > > 2, any allocator is able to wake up the reclaimer because of the
> > >    lowest prio, and it starts reclaiming pages using the waker's prio.
> > > 
> > > 3, allocator comes while kswapd is active, its prio is checked and
> > >    no-op if kswapd is higher on prio; otherwise switch is updated
> > >    with the higher prio.
> > > 
> > > 4, every time kswapd raises sc.priority that starts with DEF_PRIORITY,
> > >    it is checked if there is pending update of switch; and kswapd's
> > >    prio steps up if there is a pending one, thus its prio never steps
> > >    down. Nor prio inversion. 
> > > 
> > > 5, goto 1 when kswapd finishes its work.
> > 
> > What about the direct reclaim?
> 
> Their prio will not change before reclaiming finishes, so leave it be.

This doesn't answer my question.

> > What if pages of a lower priority are
> > hard to reclaim? Do you want a process of a higher priority stall more
> > just because it has to wait for those lower priority pages?
> 
> The problems above are not introduced by pp, let Mr. Kswapd take care of
> them.

No, this is not an answer.
Hillf Danton Oct. 23, 2019, 11:53 a.m. UTC | #8
On Wed, 23 Oct 2019 10:17:29 +0200 Michal Hocko wrote:
> 
> On Tue 22-10-19 22:28:02, Hillf Danton wrote:
> > 
> > On Tue, 22 Oct 2019 14:42:41 +0200 Michal Hocko wrote:
> > > 
> > > On Tue 22-10-19 20:14:39, Hillf Danton wrote:
> > > > 
> > > > On Mon, 21 Oct 2019 14:27:28 +0200 Michal Hocko wrote:
> > > [...]
> > > > > Why do we care and which workloads would benefit and how much.
> > > > 
> > > > Page preemption, disabled by default, should be turned on by those
> > > > who wish that the performance of their workloads can survive memory
> > > > pressure to certain extent.
> > > 
> > > I am sorry but this doesn't say anything to me. How come not all
> > > workloads would fit that description?
> > 
> > That means pp plays a role when kswapd becomes active, and it may
> > prevent too much jitters in active lru pages.
> 
> This is still too vague to be useful in any way.

Page preemption is designed to function only under memory pressure by
suggesting kswapd to skip deactivating some pages based on prio comparison.
No page will be skipped without difference found in prio by design.
That said, no workload can be picked out before updating prio, so let
users who know that their workloads are sensitive to jitters in lru pages
chage the nice.
We are simply adding the pp feature; users are responsible for turning pp
on and changing nice if they feel necessary.

> > > > The number of pp users is supposed near the people who change the
> > > > nice value of their apps either to -1 or higher at least once a week,
> > > > less than vi users among UK's undergraduates.
> > > > 
> > > > > And last but not least why the existing infrastructure doesn't help
> > > > > (e.g. if you have clearly defined workloads with different
> > > > > memory consumption requirements then why don't you use memory cgroups to
> > > > > reflect the priority).
> > > > 
> > > > Good question:)
> > > > 
> > > > Though pp is implemented by preventing any task from reclaiming as many
> > > > pages as possible from other tasks that are higher on priority, it is
> > > > trying to introduce prio into page reclaiming, to add a feature.
> > > > 
> > > > Page and memcg are different objects after all; pp is being added at
> > > > the page granularity. It should be an option available in environments
> > > > without memcg enabled.
> > > 
> > > So do you actually want to establish LRUs per priority?
> > 
> > No, no change other than the prio for every lru page was added. LRU per prio
> > is too much to implement.
> 
> Well, considering that per page priority is a no go as already pointed
> out by Willy then you do not have other choice right?

No need to seek extra choice because of the prio introduced to reclaiming as
no one is hurt by design without pp enabled and prio updated.

> > > Why using memcgs is not an option?
> > 
> > I have plan to add prio in memcg. As you see, I sent a rfc before v0 with
> > nice added in memcg, and realised a couple days ago that its dependence on
> > soft limit reclaim is not acceptable.
> > 
> > But we can't do that without determining how to define memcg's prio.
> > What is in mind now is the highest (or lowest) prio of tasks in a memcg
> > with a knob offered to userspace.
> > 
> > If you like, I want to have a talk about it sometime later.
> 
> This doesn't really answer my question.
> Why cannot you use memcgs as they are now.

No prio provided.

> Why exactly do you need a fixed priority?

Prio comparison in global reclaim is what was added. Because every task has
prio makes that comparison possible.

> > > This is the main facility to partition reclaimable
> > > memory in the first place.

Is every task (pid != 1) contained in memcg? And why?

> > > You should really focus on explaining on why
> > > a much more fine grained control is needed much more thoroughly.

Which do you prefer, cello or fiddle? And why?

> > > > What is way different from the protections offered by memory cgroup
> > > > is that pages protected by memcg:min/low can't be reclaimed regardless
> > > > of memory pressure. Such guarantee is not available under pp as it only
> > > > suggests an extra factor to consider on deactivating lru pages.
> > > 
> > > Well, low limit can be breached if there is no eliglible memcg to be
> > > reclaimed. That means that you can shape some sort of priority by
> > > setting the low limit already.
> > > 
> > > [...]
> > > 
> > > > What was added on the reclaimer side is
> > > > 
> > > > 1, kswapd sets pgdat->kswapd_prio, the switch between page reclaimer
> > > >    and allocator in terms of prio, to the lowest value before taking
> > > >    a nap.
> > > > 
> > > > 2, any allocator is able to wake up the reclaimer because of the
> > > >    lowest prio, and it starts reclaiming pages using the waker's prio.
> > > > 
> > > > 3, allocator comes while kswapd is active, its prio is checked and
> > > >    no-op if kswapd is higher on prio; otherwise switch is updated
> > > >    with the higher prio.
> > > > 
> > > > 4, every time kswapd raises sc.priority that starts with DEF_PRIORITY,
> > > >    it is checked if there is pending update of switch; and kswapd's
> > > >    prio steps up if there is a pending one, thus its prio never steps
> > > >    down. Nor prio inversion. 
> > > > 
> > > > 5, goto 1 when kswapd finishes its work.
> > > 
> > > What about the direct reclaim?
> > 
> > Their prio will not change before reclaiming finishes, so leave it be.
> 
> This doesn't answer my question.

No prio inversion in direct reclaim if you mean that.

> > > What if pages of a lower priority are
> > > hard to reclaim? Do you want a process of a higher priority stall more
> > > just because it has to wait for those lower priority pages?
> > 
> > The problems above are not introduced by pp, let Mr. Kswapd take care of
> > them.
> 
> No, this is not an answer.

Is pp making them worse?

Thanks
Hillf
Michal Hocko Oct. 23, 2019, 12:04 p.m. UTC | #9
On Wed 23-10-19 19:53:50, Hillf Danton wrote:
> 
> On Wed, 23 Oct 2019 10:17:29 +0200 Michal Hocko wrote:
[...]
> > This doesn't really answer my question.
> > Why cannot you use memcgs as they are now.
> 
> No prio provided.
> 
> > Why exactly do you need a fixed priority?
> 
> Prio comparison in global reclaim is what was added. Because every task has
> prio makes that comparison possible.

That still doesn't answer the question because it doesn't explain why is
the priority really necessary. I am sorry but I have more important
things to deal with than asking the same question again and again.
diff mbox series

Patch

--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -281,6 +281,15 @@  config VIRT_TO_BUS
 	  should probably not select this.
 
 
+config PAGE_PREEMPTION
+	bool
+	help
+	  This makes a task unable to reclaim as many lru pages as
+	  possible from other tasks of higher priorities.
+
+	  Say N if unsure.
+
+
 config MMU_NOTIFIER
 	bool
 	select SRCU
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -14,6 +14,7 @@ 
 #include <linux/uprobes.h>
 #include <linux/page-flags-layout.h>
 #include <linux/workqueue.h>
+#include <linux/sched/prio.h>
 
 #include <asm/mmu.h>
 
@@ -197,6 +198,10 @@  struct page {
 	/* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
 	atomic_t _refcount;
 
+#ifdef CONFIG_PAGE_PREEMPTION
+	int prio; /* mirror page owner task->prio */
+#endif
+
 #ifdef CONFIG_MEMCG
 	struct mem_cgroup *mem_cgroup;
 #endif
@@ -232,6 +237,54 @@  struct page {
 #define page_private(page)		((page)->private)
 #define set_page_private(page, v)	((page)->private = (v))
 
+#ifdef CONFIG_PAGE_PREEMPTION
+static inline bool page_prio_valid(struct page *p)
+{
+	return p->prio > MAX_PRIO;
+}
+
+static inline void set_page_prio(struct page *p, int task_prio)
+{
+	/* store page prio low enough to help khugepaged add lru pages */
+	if (!page_prio_valid(p))
+		p->prio = task_prio + MAX_PRIO + 1;
+}
+
+static inline void copy_page_prio(struct page *to, struct page *from)
+{
+	to->prio = from->prio;
+}
+
+static inline int page_prio(struct page *p)
+{
+	return p->prio - MAX_PRIO - 1;
+}
+
+static inline bool page_prio_higher(struct page *p, int prio)
+{
+	return page_prio(p) < prio;
+}
+#else
+static inline bool page_prio_valid(struct page *p)
+{
+	return true;
+}
+static inline void set_page_prio(struct page *p, int task_prio)
+{
+}
+static inline void copy_page_prio(struct page *to, struct page *from)
+{
+}
+static inline int page_prio(struct page *p)
+{
+	return MAX_PRIO + 1;
+}
+static inline bool page_prio_higher(struct page *p, int prio)
+{
+	return false;
+}
+#endif
+
 struct page_frag_cache {
 	void * va;
 #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -402,6 +402,7 @@  static void __lru_cache_add(struct page
 	struct pagevec *pvec = &get_cpu_var(lru_add_pvec);
 
 	get_page(page);
+	set_page_prio(page, current->prio);
 	if (!pagevec_add(pvec, page) || PageCompound(page))
 		__pagevec_lru_add(pvec);
 	put_cpu_var(lru_add_pvec);
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -647,6 +647,7 @@  void migrate_page_states(struct page *ne
 		end_page_writeback(newpage);
 
 	copy_page_owner(page, newpage);
+	copy_page_prio(newpage, page);
 
 	mem_cgroup_migrate(page, newpage);
 }
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1575,6 +1575,7 @@  static int shmem_replace_page(struct pag
 
 	get_page(newpage);
 	copy_highpage(newpage, oldpage);
+	copy_page_prio(newpage, oldpage);
 	flush_dcache_page(newpage);
 
 	__SetPageLocked(newpage);
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -671,6 +671,7 @@  static void __collapse_huge_page_copy(pt
 			}
 		} else {
 			src_page = pte_page(pteval);
+			copy_page_prio(page, src_page);
 			copy_user_highpage(page, src_page, address, vma);
 			VM_BUG_ON_PAGE(page_mapcount(src_page) != 1, src_page);
 			release_pte_page(src_page);
@@ -1723,6 +1724,7 @@  xa_unlocked:
 				clear_highpage(new_page + (index % HPAGE_PMD_NR));
 				index++;
 			}
+			copy_page_prio(new_page, page);
 			copy_highpage(new_page + (page->index % HPAGE_PMD_NR),
 					page);
 			list_del(&page->lru);
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -738,6 +738,9 @@  typedef struct pglist_data {
 	int kswapd_order;
 	enum zone_type kswapd_classzone_idx;
 
+#ifdef CONFIG_PAGE_PREEMPTION
+	int kswapd_prio; /* mirror task->prio waking up kswapd */
+#endif
 	int kswapd_failures;		/* Number of 'reclaimed == 0' runs */
 
 #ifdef CONFIG_COMPACTION
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -110,6 +110,10 @@  struct scan_control {
 	/* The highest zone to isolate pages for reclaim from */
 	s8 reclaim_idx;
 
+#ifdef CONFIG_PAGE_PREEMPTION
+	s8 __pad;
+	int reclaimer_prio; /* mirror task->prio */
+#endif
 	/* This context's GFP mask */
 	gfp_t gfp_mask;
 
@@ -1710,11 +1714,18 @@  static unsigned long isolate_lru_pages(u
 		total_scan += nr_pages;
 
 		if (page_zonenum(page) > sc->reclaim_idx) {
+next_page:
 			list_move(&page->lru, &pages_skipped);
 			nr_skipped[page_zonenum(page)] += nr_pages;
 			continue;
 		}
 
+		if (IS_ENABLED(CONFIG_PAGE_PREEMPTION) &&
+		    is_active_lru(lru) &&
+		    global_reclaim(sc) &&
+		    page_prio_higher(page, sc->reclaimer_prio))
+			goto next_page;
+
 		/*
 		 * Do not count skipped pages because that makes the function
 		 * return with no isolated pages if the LRU mostly contains
@@ -3260,6 +3271,9 @@  unsigned long try_to_free_pages(struct z
 	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
+#ifdef CONFIG_PAGE_PREEMPTION
+		.reclaimer_prio = current->prio,
+#endif
 		.gfp_mask = current_gfp_context(gfp_mask),
 		.reclaim_idx = gfp_zone(gfp_mask),
 		.order = order,
@@ -3586,6 +3600,9 @@  static int balance_pgdat(pg_data_t *pgda
 	bool boosted;
 	struct zone *zone;
 	struct scan_control sc = {
+#ifdef CONFIG_PAGE_PREEMPTION
+		.reclaimer_prio = pgdat->kswapd_prio,
+#endif
 		.gfp_mask = GFP_KERNEL,
 		.order = order,
 		.may_unmap = 1,
@@ -3739,6 +3756,9 @@  restart:
 		if (nr_boost_reclaim && !nr_reclaimed)
 			break;
 
+		if (IS_ENABLED(CONFIG_PAGE_PREEMPTION))
+			sc.reclaimer_prio = pgdat->kswapd_prio;
+
 		if (raise_priority || !nr_reclaimed)
 			sc.priority--;
 	} while (sc.priority >= 1);
@@ -3831,6 +3851,10 @@  static void kswapd_try_to_sleep(pg_data_
 		 */
 		wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
 
+		if (IS_ENABLED(CONFIG_PAGE_PREEMPTION)) {
+			pgdat->kswapd_prio = MAX_PRIO + 1;
+			smp_wmb();
+		}
 		remaining = schedule_timeout(HZ/10);
 
 		/*
@@ -3865,8 +3889,13 @@  static void kswapd_try_to_sleep(pg_data_
 		 */
 		set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);
 
-		if (!kthread_should_stop())
+		if (!kthread_should_stop()) {
+			if (IS_ENABLED(CONFIG_PAGE_PREEMPTION)) {
+				pgdat->kswapd_prio = MAX_PRIO + 1;
+				smp_wmb();
+			}
 			schedule();
+		}
 
 		set_pgdat_percpu_threshold(pgdat, calculate_pressure_threshold);
 	} else {
@@ -3917,6 +3946,8 @@  static int kswapd(void *p)
 	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
 	set_freezable();
 
+	if (IS_ENABLED(CONFIG_PAGE_PREEMPTION))
+		pgdat->kswapd_prio = MAX_PRIO + 1;
 	pgdat->kswapd_order = 0;
 	pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
 	for ( ; ; ) {
@@ -3985,6 +4016,17 @@  void wakeup_kswapd(struct zone *zone, gf
 		return;
 	pgdat = zone->zone_pgdat;
 
+	if (IS_ENABLED(CONFIG_PAGE_PREEMPTION)) {
+		int prio = current->prio;
+
+		if (pgdat->kswapd_prio < prio) {
+			smp_rmb();
+			return;
+		}
+		pgdat->kswapd_prio = prio;
+		smp_wmb();
+	}
+
 	if (pgdat->kswapd_classzone_idx == MAX_NR_ZONES)
 		pgdat->kswapd_classzone_idx = classzone_idx;
 	else