From patchwork Sat Oct 26 11:28:08 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Hillf Danton X-Patchwork-Id: 11213509 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 49D6C1515 for ; Sat, 26 Oct 2019 11:28:24 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 0EB30214DA for ; Sat, 26 Oct 2019 11:28:24 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0EB30214DA Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=sina.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 2F6AC6B0003; Sat, 26 Oct 2019 07:28:23 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 2A7806B0005; Sat, 26 Oct 2019 07:28:23 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1BDF56B0006; Sat, 26 Oct 2019 07:28:23 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0023.hostedemail.com [216.40.44.23]) by kanga.kvack.org (Postfix) with ESMTP id EE64E6B0003 for ; Sat, 26 Oct 2019 07:28:22 -0400 (EDT) Received: from smtpin04.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with SMTP id 9F2D381CF for ; Sat, 26 Oct 2019 11:28:22 +0000 (UTC) X-FDA: 76085712444.04.show73_2bd5a121c021b X-Spam-Summary: 2,0,0,e168045fda98a76d,d41d8cd98f00b204,hdanton@sina.com,::akpm@linux-foundation.org:linux-kernel@vger.kernel.org:willy@infradead.org:mhocko@suse.com:hannes@cmpxchg.org:shakeelb@google.com:minchan@kernel.org:mgorman@suse.de:vdavydov.dev@gmail.com:jack@suse.cz:hdanton@sina.com,RULES_HIT:2:41:355:379:800:960:966:973:988:989:1260:1311:1314:1345:1431:1437:1515:1535:1605:1730:1747:1777:1792:2196:2198:2199:2200:2393:2553:2559:2562:2693:2731:2890:2895:2898:3138:3139:3140:3141:3142:3865:3866:3867:3868:3870:3871:3872:3874:4042:4051:4120:4250:4321:4385:5007:6119:6120:6261:6742:7208:7514:7875:7903:8603:8957:10004:11026:11334:11473:11537:11658:11914:12043:12294:12296:12297:12438:12555:12986:13161:13229:13894:14096:21080:21222:21433:21450:21451:21627:21740:30001:30003:30034:30054:30056:30070:30090,0,RBL:202.108.3.11:@sina.com:.lbl8.mailshell.net-62.18.2.100 64.100.201.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fp,MSBL:0,DNSBL:none,Cu stom_rul X-HE-Tag: show73_2bd5a121c021b X-Filterd-Recvd-Size: 9812 Received: from r3-11.sinamail.sina.com.cn (r3-11.sinamail.sina.com.cn [202.108.3.11]) by imf35.hostedemail.com (Postfix) with SMTP for ; Sat, 26 Oct 2019 11:28:20 +0000 (UTC) Received: from unknown (HELO localhost.localdomain)([222.131.69.34]) by sina.com with ESMTP id 5DB42DD00000FF56; Sat, 26 Oct 2019 19:28:18 +0800 (CST) X-Sender: hdanton@sina.com X-Auth-ID: hdanton@sina.com X-SMAIL-MID: 49466649283204 From: Hillf Danton To: linux-mm Cc: Andrew Morton , linux-kernel , Matthew Wilcox , Michal Hocko , Johannes Weiner , Shakeel Butt , Minchan Kim , Mel Gorman , Vladimir Davydov , Jan Kara , Hillf Danton Subject: [RFC v2] mm: add page preemption Date: Sat, 26 Oct 2019 19:28:08 +0800 Message-Id: <20191026112808.14268-1-hdanton@sina.com> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The cpu preemption feature makes a task able to preempt other tasks of lower priorities for cpu. It has been around for a while. This work introduces task prio into page reclaiming in order to add the page preemption feature that makes a task able to preempt other tasks of lower priorities for page. No page will be reclaimed on behalf of tasks of lower priorities under pp, a two-edge feature that functions only under memory pressure, laying a barrier to pages flowing to lower prio, and the nice syscall is what users need to fiddle with it for instance as no task will be preempted without prio shades, if they have a couple of workloads that are sensitive to jitters in lru pages, and some difficulty predicting their working set sizes. Currently lru pages are reclaimed under memory pressure without prio taken into account; pages can be reclaimed from tasks of lower priorities on behalf of higher-prio tasks and vice versa. s/and vice versa/only/ is what we need to make pp by definition, but it could not make a sense without prio introduced in reclaiming, otherwise we can simply skip deactivating the lru pages based on prio comprison, and work is done. The introduction consists of two parts. On the page side, we have to store the page owner task's prio in page, which needs an extra room the size of the int type in the page struct. That room sounds impossible without inflating the page struct size, and it is not solved but walked around by sharing room with the 32-bit numa balancing, see 75980e97dacc ("mm: fold page->_last_nid into page->flags where possible"). On the reclaimer side, kswapd's prio is set with the prio of its waker, and updated in the same manner as kswapd_order. V2 is based on next-20191018. Changes since v1 - page->prio shares room with _last_cpupid as per Matthew Wilcox Changes since v0 - s/page->nice/page->prio/ - drop the role of kswapd's reclaiming prioirty in prio comparison - add pgdat->kswapd_prio Cc: Matthew Wilcox Cc: Michal Hocko Cc: Johannes Weiner Cc: Shakeel Butt Cc: Minchan Kim Cc: Mel Gorman Cc: Vladimir Davydov Cc: Jan Kara Signed-off-by: Hillf Danton Nacked-by: Michal Hocko Nacked-by: Johannes Weiner --- -- --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -14,6 +14,7 @@ #include #include #include +#include #include @@ -218,6 +219,9 @@ struct page { #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS int _last_cpupid; +#else + int prio; +#define CONFIG_PAGE_PREEMPTION PP #endif } _struct_page_alignment; @@ -232,6 +236,53 @@ struct page { #define page_private(page) ((page)->private) #define set_page_private(page, v) ((page)->private = (v)) +#ifdef CONFIG_PAGE_PREEMPTION +static inline bool page_prio_valid(struct page *p) +{ + return p->prio > MAX_PRIO; +} + +static inline void set_page_prio(struct page *p, int task_prio) +{ + if (!page_prio_valid(p)) + p->prio = task_prio + MAX_PRIO + 1; +} + +static inline void copy_page_prio(struct page *to, struct page *from) +{ + to->prio = from->prio; +} + +static inline int page_prio(struct page *p) +{ + return p->prio - MAX_PRIO - 1; +} + +static inline bool page_prio_higher(struct page *p, int prio) +{ + return page_prio(p) < prio; +} +#else +static inline bool page_prio_valid(struct page *p) +{ + return true; +} +static inline void set_page_prio(struct page *p, int task_prio) +{ +} +static inline void copy_page_prio(struct page *to, struct page *from) +{ +} +static inline int page_prio(struct page *p) +{ + return MAX_PRIO + 1; +} +static inline bool page_prio_higher(struct page *p, int prio) +{ + return false; +} +#endif /* CONFIG_PAGE_PREEMPTION */ + struct page_frag_cache { void * va; #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -671,6 +671,7 @@ static void __collapse_huge_page_copy(pt } } else { src_page = pte_page(pteval); + copy_page_prio(page, src_page); copy_user_highpage(page, src_page, address, vma); VM_BUG_ON_PAGE(page_mapcount(src_page) != 1, src_page); release_pte_page(src_page); @@ -1735,6 +1736,7 @@ xa_unlocked: clear_highpage(new_page + (index % HPAGE_PMD_NR)); index++; } + copy_page_prio(new_page, page); copy_highpage(new_page + (page->index % HPAGE_PMD_NR), page); list_del(&page->lru); --- a/mm/migrate.c +++ b/mm/migrate.c @@ -647,6 +647,7 @@ void migrate_page_states(struct page *ne end_page_writeback(newpage); copy_page_owner(page, newpage); + copy_page_prio(newpage, page); mem_cgroup_migrate(page, newpage); } --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1575,6 +1575,7 @@ static int shmem_replace_page(struct pag get_page(newpage); copy_highpage(newpage, oldpage); + copy_page_prio(newpage, oldpage); flush_dcache_page(newpage); __SetPageLocked(newpage); --- a/mm/swap.c +++ b/mm/swap.c @@ -407,6 +407,7 @@ static void __lru_cache_add(struct page struct pagevec *pvec = &get_cpu_var(lru_add_pvec); get_page(page); + set_page_prio(page, current->prio); if (!pagevec_add(pvec, page) || PageCompound(page)) __pagevec_lru_add(pvec); put_cpu_var(lru_add_pvec); --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -738,6 +738,7 @@ typedef struct pglist_data { int kswapd_order; enum zone_type kswapd_classzone_idx; + int kswapd_prio; int kswapd_failures; /* Number of 'reclaimed == 0' runs */ #ifdef CONFIG_COMPACTION --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -110,6 +110,9 @@ struct scan_control { /* The highest zone to isolate pages for reclaim from */ s8 reclaim_idx; + s8 __pad; + int reclaimer_prio; + /* This context's GFP mask */ gfp_t gfp_mask; @@ -1707,11 +1710,17 @@ static unsigned long isolate_lru_pages(u total_scan += nr_pages; if (page_zonenum(page) > sc->reclaim_idx) { +next_page: list_move(&page->lru, &pages_skipped); nr_skipped[page_zonenum(page)] += nr_pages; continue; } +#ifdef CONFIG_PAGE_PREEMPTION + if (is_active_lru(lru) && global_reclaim(sc) && + page_prio_higher(page, sc->reclaimer_prio)) + goto next_page; +#endif /* * Do not count skipped pages because that makes the function * return with no isolated pages if the LRU mostly contains @@ -3257,6 +3266,7 @@ unsigned long try_to_free_pages(struct z unsigned long nr_reclaimed; struct scan_control sc = { .nr_to_reclaim = SWAP_CLUSTER_MAX, + .reclaimer_prio = current->prio, .gfp_mask = current_gfp_context(gfp_mask), .reclaim_idx = gfp_zone(gfp_mask), .order = order, @@ -3583,6 +3593,7 @@ static int balance_pgdat(pg_data_t *pgda bool boosted; struct zone *zone; struct scan_control sc = { + .reclaimer_prio = pgdat->kswapd_prio, .gfp_mask = GFP_KERNEL, .order = order, .may_unmap = 1, @@ -3736,6 +3747,8 @@ restart: if (nr_boost_reclaim && !nr_reclaimed) break; + sc.reclaimer_prio = pgdat->kswapd_prio; + if (raise_priority || !nr_reclaimed) sc.priority--; } while (sc.priority >= 1); @@ -3828,6 +3841,7 @@ static void kswapd_try_to_sleep(pg_data_ */ wakeup_kcompactd(pgdat, alloc_order, classzone_idx); + pgdat->kswapd_prio = MAX_PRIO + 1; remaining = schedule_timeout(HZ/10); /* @@ -3862,8 +3876,10 @@ static void kswapd_try_to_sleep(pg_data_ */ set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold); - if (!kthread_should_stop()) + if (!kthread_should_stop()) { + pgdat->kswapd_prio = MAX_PRIO + 1; schedule(); + } set_pgdat_percpu_threshold(pgdat, calculate_pressure_threshold); } else { @@ -3914,6 +3930,7 @@ static int kswapd(void *p) tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; set_freezable(); + pgdat->kswapd_prio = MAX_PRIO + 1; pgdat->kswapd_order = 0; pgdat->kswapd_classzone_idx = MAX_NR_ZONES; for ( ; ; ) { @@ -3982,6 +3999,19 @@ void wakeup_kswapd(struct zone *zone, gf return; pgdat = zone->zone_pgdat; +#ifdef CONFIG_PAGE_PREEMPTION + do { + int prio = current->prio; + + if (pgdat->kswapd_prio < prio) { + smp_rmb(); + return; + } + pgdat->kswapd_prio = prio; + smp_wmb(); + } while (0); +#endif + if (pgdat->kswapd_classzone_idx == MAX_NR_ZONES) pgdat->kswapd_classzone_idx = classzone_idx; else