diff mbox series

[RFC,v2] mm: add page preemption

Message ID 20191026112808.14268-1-hdanton@sina.com (mailing list archive)
State New, archived
Headers show
Series [RFC,v2] mm: add page preemption | expand

Commit Message

Hillf Danton Oct. 26, 2019, 11:28 a.m. UTC
The cpu preemption feature makes a task able to preempt other tasks
of lower priorities for cpu. It has been around for a while.

This work introduces task prio into page reclaiming in order to add
the page preemption feature that makes a task able to preempt other
tasks of lower priorities for page.

No page will be reclaimed on behalf of tasks of lower priorities
under pp, a two-edge feature that functions only under memory
pressure, laying a barrier to pages flowing to lower prio, and the
nice syscall is what users need to fiddle with it for instance as
no task will be preempted without prio shades, if they have a couple
of workloads that are sensitive to jitters in lru pages, and some
difficulty predicting their working set sizes.

Currently lru pages are reclaimed under memory pressure without prio
taken into account; pages can be reclaimed from tasks of lower
priorities on behalf of higher-prio tasks and vice versa.

s/and vice versa/only/ is what we need to make pp by definition, but
it could not make a sense without prio introduced in reclaiming,
otherwise we can simply skip deactivating the lru pages based on prio
comprison, and work is done.

The introduction consists of two parts. On the page side, we have to
store the page owner task's prio in page, which needs an extra room the
size of the int type in the page struct.

That room sounds impossible without inflating the page struct size, and
it is not solved but walked around by sharing room with the 32-bit numa
balancing, see 75980e97dacc ("mm: fold page->_last_nid into page->flags
where possible").

On the reclaimer side, kswapd's prio is set with the prio of its waker,
and updated in the same manner as kswapd_order.

V2 is based on next-20191018.

Changes since v1
- page->prio shares room with _last_cpupid as per Matthew Wilcox

Changes since v0
- s/page->nice/page->prio/
- drop the role of kswapd's reclaiming prioirty in prio comparison
- add pgdat->kswapd_prio

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Hillf Danton <hdanton@sina.com>
---

--

Comments

Kirill A. Shutemov Oct. 28, 2019, 12:26 p.m. UTC | #1
On Sat, Oct 26, 2019 at 07:28:08PM +0800, Hillf Danton wrote:
> @@ -218,6 +219,9 @@ struct page {
>  
>  #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
>  	int _last_cpupid;
> +#else
> +	int prio;
> +#define CONFIG_PAGE_PREEMPTION PP
>  #endif
>  } _struct_page_alignment;
>  

No.

There's a really good reason we trying hard to push the _last_cpuid into
page flags instead of growing the struct page by 4 bytes.

I don't think your feature worth 0.1% of RAM and a lot of cache misses
that this change would generate.
Hillf Danton Oct. 28, 2019, 1:55 p.m. UTC | #2
Date: Mon, 28 Oct 2019 15:26:09 +0300 Kirill A. Shutemov wrote:
> 
> On Sat, Oct 26, 2019 at 07:28:08PM +0800, Hillf Danton wrote:
> > @@ -218,6 +219,9 @@ struct page {
> >  
> >  #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
> >  	int _last_cpupid;
> > +#else
> > +	int prio;
> > +#define CONFIG_PAGE_PREEMPTION PP
> >  #endif
> >  } _struct_page_alignment;
> >  
> [...]
> a lot of cache misses that this change would generate.

Queued on todo list. Thanks.

Hillf
Johannes Weiner Oct. 28, 2019, 3:56 p.m. UTC | #3
On Sat, Oct 26, 2019 at 07:28:08PM +0800, Hillf Danton wrote:
> 
> The cpu preemption feature makes a task able to preempt other tasks
> of lower priorities for cpu. It has been around for a while.
> 
> This work introduces task prio into page reclaiming in order to add
> the page preemption feature that makes a task able to preempt other
> tasks of lower priorities for page.
> 
> No page will be reclaimed on behalf of tasks of lower priorities

... at which point they'll declare OOM and kill the high-pri task?

Please have a look at the cgroup2 memory.low control. This memory
prioritization problem has already been solved.
Hillf Danton Oct. 29, 2019, 3:04 a.m. UTC | #4
Date: Mon, 28 Oct 2019 11:56:17 -0400 Johannes Weiner wrote:
> 
> On Sat, Oct 26, 2019 at 07:28:08PM +0800, Hillf Danton wrote:
> > 
> > The cpu preemption feature makes a task able to preempt other tasks
> > of lower priorities for cpu. It has been around for a while.
> > 
> > This work introduces task prio into page reclaiming in order to add
> > the page preemption feature that makes a task able to preempt other
> > tasks of lower priorities for page.
> > 
> > No page will be reclaimed on behalf of tasks of lower priorities
> 
> ... at which point they'll declare OOM and kill the high-pri task?

Fixed. Thanks.

--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -164,6 +164,11 @@ static bool oom_unkillable_task(struct t
 		return true;
 	if (p->flags & PF_KTHREAD)
 		return true;
+
+#ifdef CONFIG_PAGE_PREEMPTION
+	if (p->prio < current->prio)
+		return true;
+#endif
 	return false;
 }
 
--


> Please have a look at the cgroup2 memory.low control. This memory
> prioritization problem has already been solved.

Scooter never runs in subway. Fixed.

--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -197,8 +198,18 @@ struct page {
 	/* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
 	atomic_t _refcount;
 
+	/* 32/56 bytes above */
+
 #ifdef CONFIG_MEMCG
 	struct mem_cgroup *mem_cgroup;
+#else
+#ifdef CONFIG_64BIT
+	union {
+		int prio;
+		void *__pad;
+	};
+#define CONFIG_PAGE_PREEMPTION PP
+#endif
 #endif
 
 	/*
--
Michal Hocko Oct. 29, 2019, 8:41 a.m. UTC | #5
On Sat 26-10-19 19:28:08, Hillf Danton wrote:
> 
> The cpu preemption feature makes a task able to preempt other tasks
> of lower priorities for cpu. It has been around for a while.
> 
> This work introduces task prio into page reclaiming in order to add
> the page preemption feature that makes a task able to preempt other
> tasks of lower priorities for page.
> 
> No page will be reclaimed on behalf of tasks of lower priorities
> under pp, a two-edge feature that functions only under memory
> pressure, laying a barrier to pages flowing to lower prio, and the
> nice syscall is what users need to fiddle with it for instance as
> no task will be preempted without prio shades, if they have a couple
> of workloads that are sensitive to jitters in lru pages, and some
> difficulty predicting their working set sizes.
> 
> Currently lru pages are reclaimed under memory pressure without prio
> taken into account; pages can be reclaimed from tasks of lower
> priorities on behalf of higher-prio tasks and vice versa.
> 
> s/and vice versa/only/ is what we need to make pp by definition, but
> it could not make a sense without prio introduced in reclaiming,
> otherwise we can simply skip deactivating the lru pages based on prio
> comprison, and work is done.
> 
> The introduction consists of two parts. On the page side, we have to
> store the page owner task's prio in page, which needs an extra room the
> size of the int type in the page struct.
> 
> That room sounds impossible without inflating the page struct size, and
> it is not solved but walked around by sharing room with the 32-bit numa
> balancing, see 75980e97dacc ("mm: fold page->_last_nid into page->flags
> where possible").
> 
> On the reclaimer side, kswapd's prio is set with the prio of its waker,
> and updated in the same manner as kswapd_order.
> 
> V2 is based on next-20191018.
> 
> Changes since v1
> - page->prio shares room with _last_cpupid as per Matthew Wilcox
> 
> Changes since v0
> - s/page->nice/page->prio/
> - drop the role of kswapd's reclaiming prioirty in prio comparison
> - add pgdat->kswapd_prio
> 
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Jan Kara <jack@suse.cz>
> Signed-off-by: Hillf Danton <hdanton@sina.com>

As already raised in the review of v1. There is no real life usecase
described in the changelog. I have also expressed concerns about how
such a reclaim would work in the first place (priority inversion,
expensive reclaim etc.). Until that is provided/clarified

Nacked-by: Michal Hocko <mhocko@suse.com>

Please do not ignore review feedback in the future.
Hillf Danton Oct. 29, 2019, 12:30 p.m. UTC | #6
Date: Tue, 29 Oct 2019 09:41:53 +0100 Michal Hocko wrote:
> 
> As already raised in the review of v1. There is no real life usecase
> described in the changelog.

No feature, no user; no user, no workloads.
No linux-6.x released, no 6.x users.
Are you going to be one of the users of linux-6.0?

Even though, I see a use case over there at
https://lore.kernel.org/lkml/20191023120452.GN754@dhcp22.suse.cz/

That thread terminated because of preemption, showing us how useful
preemption might be in real life.

> I have also expressed concerns about how
> such a reclaim would work in the first place

Based on what?

> (priority inversion,

No prio inversion will happen after introducing prio to global reclaim.

> expensive reclaim etc.).

No cost, no earn.
David Hildenbrand Oct. 29, 2019, 1:26 p.m. UTC | #7
On 29.10.19 13:30, Hillf Danton wrote:
> 
> Date: Tue, 29 Oct 2019 09:41:53 +0100 Michal Hocko wrote:
>>
>> As already raised in the review of v1. There is no real life usecase
>> described in the changelog.
> 
> No feature, no user; no user, no workloads.
> No linux-6.x released, no 6.x users.
> Are you going to be one of the users of linux-6.0?
> 
> Even though, I see a use case over there at
> https://lore.kernel.org/lkml/20191023120452.GN754@dhcp22.suse.cz/
> 
> That thread terminated because of preemption, showing us how useful
> preemption might be in real life.
> 
>> I have also expressed concerns about how
>> such a reclaim would work in the first place
> 
> Based on what?
> 
>> (priority inversion,
> 
> No prio inversion will happen after introducing prio to global reclaim.
> 
>> expensive reclaim etc.).
> 
> No cost, no earn.
> 
> 

Side note: You should really have a look what your mail client is 
messing up here. E.g., the reply from Michal correctly had

Message-ID: <20191029084153.GD31513@dhcp22.suse.cz>
References: <20191026112808.14268-1-hdanton@sina.com>
In-Reply-To: <20191026112808.14268-1-hdanton@sina.com>

Once you reply to that, you have

Message-Id: <20191029123058.19060-1-hdanton@sina.com>
In-Reply-To: <20191026112808.14268-1-hdanton@sina.com>
References:

Instead of

Message-Id: <20191029123058.19060-1-hdanton@sina.com>
In-Reply-To: <20191029084153.GD31513@dhcp22.suse.cz>
References: <20191029084153.GD31513@dhcp22.suse.cz>

Which flattens the whole thread hierarchy. Nasty. Please fix that.
Michal Hocko Oct. 29, 2019, 1:41 p.m. UTC | #8
On Tue 29-10-19 14:26:36, David Hildenbrand wrote:
[...]
> Side note: You should really have a look what your mail client is messing up
> here. E.g., the reply from Michal correctly had
> 
> Message-ID: <20191029084153.GD31513@dhcp22.suse.cz>
> References: <20191026112808.14268-1-hdanton@sina.com>
> In-Reply-To: <20191026112808.14268-1-hdanton@sina.com>
> 
> Once you reply to that, you have
> 
> Message-Id: <20191029123058.19060-1-hdanton@sina.com>
> In-Reply-To: <20191026112808.14268-1-hdanton@sina.com>
> References:
> 
> Instead of
> 
> Message-Id: <20191029123058.19060-1-hdanton@sina.com>
> In-Reply-To: <20191029084153.GD31513@dhcp22.suse.cz>
> References: <20191029084153.GD31513@dhcp22.suse.cz>
> 
> Which flattens the whole thread hierarchy. Nasty. Please fix that.

This is not for the first time. It's been like that for a longer time
and several people have noted that before.
Johannes Weiner Oct. 29, 2019, 3:27 p.m. UTC | #9
On Tue, Oct 29, 2019 at 09:41:53AM +0100, Michal Hocko wrote:
> As already raised in the review of v1. There is no real life usecase
> described in the changelog. I have also expressed concerns about how
> such a reclaim would work in the first place (priority inversion,
> expensive reclaim etc.). Until that is provided/clarified
> 
> Nacked-by: Michal Hocko <mhocko@suse.com>

I second this.

Nacked-by: Johannes Weiner <hannes@cmpxchg.org>
diff mbox series

Patch

--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -14,6 +14,7 @@ 
 #include <linux/uprobes.h>
 #include <linux/page-flags-layout.h>
 #include <linux/workqueue.h>
+#include <linux/sched/prio.h>
 
 #include <asm/mmu.h>
 
@@ -218,6 +219,9 @@  struct page {
 
 #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
 	int _last_cpupid;
+#else
+	int prio;
+#define CONFIG_PAGE_PREEMPTION PP
 #endif
 } _struct_page_alignment;
 
@@ -232,6 +236,53 @@  struct page {
 #define page_private(page)		((page)->private)
 #define set_page_private(page, v)	((page)->private = (v))
 
+#ifdef CONFIG_PAGE_PREEMPTION
+static inline bool page_prio_valid(struct page *p)
+{
+	return p->prio > MAX_PRIO;
+}
+
+static inline void set_page_prio(struct page *p, int task_prio)
+{
+	if (!page_prio_valid(p))
+		p->prio = task_prio + MAX_PRIO + 1;
+}
+
+static inline void copy_page_prio(struct page *to, struct page *from)
+{
+	to->prio = from->prio;
+}
+
+static inline int page_prio(struct page *p)
+{
+	return p->prio - MAX_PRIO - 1;
+}
+
+static inline bool page_prio_higher(struct page *p, int prio)
+{
+	return page_prio(p) < prio;
+}
+#else
+static inline bool page_prio_valid(struct page *p)
+{
+	return true;
+}
+static inline void set_page_prio(struct page *p, int task_prio)
+{
+}
+static inline void copy_page_prio(struct page *to, struct page *from)
+{
+}
+static inline int page_prio(struct page *p)
+{
+	return MAX_PRIO + 1;
+}
+static inline bool page_prio_higher(struct page *p, int prio)
+{
+	return false;
+}
+#endif /* CONFIG_PAGE_PREEMPTION */
+
 struct page_frag_cache {
 	void * va;
 #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -671,6 +671,7 @@  static void __collapse_huge_page_copy(pt
 			}
 		} else {
 			src_page = pte_page(pteval);
+			copy_page_prio(page, src_page);
 			copy_user_highpage(page, src_page, address, vma);
 			VM_BUG_ON_PAGE(page_mapcount(src_page) != 1, src_page);
 			release_pte_page(src_page);
@@ -1735,6 +1736,7 @@  xa_unlocked:
 				clear_highpage(new_page + (index % HPAGE_PMD_NR));
 				index++;
 			}
+			copy_page_prio(new_page, page);
 			copy_highpage(new_page + (page->index % HPAGE_PMD_NR),
 					page);
 			list_del(&page->lru);
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -647,6 +647,7 @@  void migrate_page_states(struct page *ne
 		end_page_writeback(newpage);
 
 	copy_page_owner(page, newpage);
+	copy_page_prio(newpage, page);
 
 	mem_cgroup_migrate(page, newpage);
 }
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1575,6 +1575,7 @@  static int shmem_replace_page(struct pag
 
 	get_page(newpage);
 	copy_highpage(newpage, oldpage);
+	copy_page_prio(newpage, oldpage);
 	flush_dcache_page(newpage);
 
 	__SetPageLocked(newpage);
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -407,6 +407,7 @@  static void __lru_cache_add(struct page
 	struct pagevec *pvec = &get_cpu_var(lru_add_pvec);
 
 	get_page(page);
+	set_page_prio(page, current->prio);
 	if (!pagevec_add(pvec, page) || PageCompound(page))
 		__pagevec_lru_add(pvec);
 	put_cpu_var(lru_add_pvec);
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -738,6 +738,7 @@  typedef struct pglist_data {
 	int kswapd_order;
 	enum zone_type kswapd_classzone_idx;
 
+	int kswapd_prio;
 	int kswapd_failures;		/* Number of 'reclaimed == 0' runs */
 
 #ifdef CONFIG_COMPACTION
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -110,6 +110,9 @@  struct scan_control {
 	/* The highest zone to isolate pages for reclaim from */
 	s8 reclaim_idx;
 
+	s8 __pad;
+	int reclaimer_prio;
+
 	/* This context's GFP mask */
 	gfp_t gfp_mask;
 
@@ -1707,11 +1710,17 @@  static unsigned long isolate_lru_pages(u
 		total_scan += nr_pages;
 
 		if (page_zonenum(page) > sc->reclaim_idx) {
+next_page:
 			list_move(&page->lru, &pages_skipped);
 			nr_skipped[page_zonenum(page)] += nr_pages;
 			continue;
 		}
 
+#ifdef CONFIG_PAGE_PREEMPTION
+		if (is_active_lru(lru) && global_reclaim(sc) &&
+		    page_prio_higher(page, sc->reclaimer_prio))
+			goto next_page;
+#endif
 		/*
 		 * Do not count skipped pages because that makes the function
 		 * return with no isolated pages if the LRU mostly contains
@@ -3257,6 +3266,7 @@  unsigned long try_to_free_pages(struct z
 	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
+		.reclaimer_prio = current->prio,
 		.gfp_mask = current_gfp_context(gfp_mask),
 		.reclaim_idx = gfp_zone(gfp_mask),
 		.order = order,
@@ -3583,6 +3593,7 @@  static int balance_pgdat(pg_data_t *pgda
 	bool boosted;
 	struct zone *zone;
 	struct scan_control sc = {
+		.reclaimer_prio = pgdat->kswapd_prio,
 		.gfp_mask = GFP_KERNEL,
 		.order = order,
 		.may_unmap = 1,
@@ -3736,6 +3747,8 @@  restart:
 		if (nr_boost_reclaim && !nr_reclaimed)
 			break;
 
+		sc.reclaimer_prio = pgdat->kswapd_prio;
+
 		if (raise_priority || !nr_reclaimed)
 			sc.priority--;
 	} while (sc.priority >= 1);
@@ -3828,6 +3841,7 @@  static void kswapd_try_to_sleep(pg_data_
 		 */
 		wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
 
+		pgdat->kswapd_prio = MAX_PRIO + 1;
 		remaining = schedule_timeout(HZ/10);
 
 		/*
@@ -3862,8 +3876,10 @@  static void kswapd_try_to_sleep(pg_data_
 		 */
 		set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);
 
-		if (!kthread_should_stop())
+		if (!kthread_should_stop()) {
+			pgdat->kswapd_prio = MAX_PRIO + 1;
 			schedule();
+		}
 
 		set_pgdat_percpu_threshold(pgdat, calculate_pressure_threshold);
 	} else {
@@ -3914,6 +3930,7 @@  static int kswapd(void *p)
 	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
 	set_freezable();
 
+	pgdat->kswapd_prio = MAX_PRIO + 1;
 	pgdat->kswapd_order = 0;
 	pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
 	for ( ; ; ) {
@@ -3982,6 +3999,19 @@  void wakeup_kswapd(struct zone *zone, gf
 		return;
 	pgdat = zone->zone_pgdat;
 
+#ifdef CONFIG_PAGE_PREEMPTION
+	do {
+		int prio = current->prio;
+
+		if (pgdat->kswapd_prio < prio) {
+			smp_rmb();
+			return;
+		}
+		pgdat->kswapd_prio = prio;
+		smp_wmb();
+	} while (0);
+#endif
+
 	if (pgdat->kswapd_classzone_idx == MAX_NR_ZONES)
 		pgdat->kswapd_classzone_idx = classzone_idx;
 	else