From patchwork Sun Oct 20 13:43:04 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Hillf Danton X-Patchwork-Id: 11200939 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 039CA13B1 for ; Sun, 20 Oct 2019 13:43:23 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id B341921927 for ; Sun, 20 Oct 2019 13:43:22 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org B341921927 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=sina.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id D13B78E0005; Sun, 20 Oct 2019 09:43:21 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id CC5828E0003; Sun, 20 Oct 2019 09:43:21 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BB1A38E0005; Sun, 20 Oct 2019 09:43:21 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0033.hostedemail.com [216.40.44.33]) by kanga.kvack.org (Postfix) with ESMTP id 945B98E0003 for ; Sun, 20 Oct 2019 09:43:21 -0400 (EDT) Received: from smtpin23.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with SMTP id 2C743180AD81F for ; Sun, 20 Oct 2019 13:43:21 +0000 (UTC) X-FDA: 76064279802.23.anger68_916ba90e71141 X-Spam-Summary: 2,0,0,129666f0811703da,d41d8cd98f00b204,hdanton@sina.com,::akpm@linux-foundation.org:linux-kernel@vger.kernel.org:mhocko@suse.com:hannes@cmpxchg.org:shakeelb@google.com:minchan@kernel.org:mgorman@suse.de:vdavydov.dev@gmail.com:jack@suse.cz:hdanton@sina.com,RULES_HIT:2:41:355:379:800:960:966:973:988:989:1260:1311:1314:1345:1431:1437:1515:1535:1605:1730:1747:1777:1792:2196:2198:2199:2200:2393:2553:2559:2562:2731:2890:2895:2898:3138:3139:3140:3141:3142:3865:3866:3867:3868:3870:3871:3872:3874:4042:4051:4120:4250:4321:4385:5007:6119:6120:6261:7514:7809:7875:7903:7904:7974:8957:10004:11026:11334:11473:11537:11658:11914:12043:12294:12296:12297:12438:12555:12986:13161:13229:13894:14096:21080:21433:21450:21451:21611:21627:21740:30001:30003:30034:30054:30070:30090,0,RBL:202.108.3.21:@sina.com:.lbl8.mailshell.net-62.18.2.100 64.100.201.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fp,MSBL:0,DNSBL:none,Custom_rules:0:0:0,LFtime:25,LUA_ SUMMARY: X-HE-Tag: anger68_916ba90e71141 X-Filterd-Recvd-Size: 9939 Received: from r3-21.sinamail.sina.com.cn (r3-21.sinamail.sina.com.cn [202.108.3.21]) by imf18.hostedemail.com (Postfix) with SMTP for ; Sun, 20 Oct 2019 13:43:18 +0000 (UTC) Received: from unknown (HELO localhost.localdomain)([222.131.66.83]) by sina.com with ESMTP id 5DAC6471000173AE; Sun, 20 Oct 2019 21:43:15 +0800 (CST) X-Sender: hdanton@sina.com X-Auth-ID: hdanton@sina.com X-SMAIL-MID: 755013629210 From: Hillf Danton To: linux-mm Cc: Andrew Morton , linux-kernel , Michal Hocko , Johannes Weiner , Shakeel Butt , Minchan Kim , Mel Gorman , Vladimir Davydov , Jan Kara , Hillf Danton Subject: [RFC v1] mm: add page preemption Date: Sun, 20 Oct 2019 21:43:04 +0800 Message-Id: <20191020134304.11700-1-hdanton@sina.com> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Unlike cpu preemption, page preemption would have been a two-edge option for quite a while. It is added by preventing tasks from reclaiming as many lru pages as possible from other tasks of higher priorities. Who need pp? Users who want to manage/control jitters in lru pages under memory pressure. Way in parallel to scheduling with task's memory footprint taken into account, pp makes task prio a part of page reclaiming. [Off topic: prio can also be defined and added in memory controller and then plays a role in memcg reclaiming, for example check prio when selecting victim memcg.] First on the page side, page->prio that is used to mirror the prio of page owner tasks is added, and a couple of helpers for setting, copying and comparing page->prio to help to add pages to lru. Then on the reclaimer side, pgdat->kswapd_prio is added to mirror the prio of tasks that wake up the kthread, and it is updated every time kswapd raises its reclaiming priority. Finally priorities on both sides are compared when deactivating lru pages, and skip page if it is higher on prio. V1 is based on next-20191018. Changes since v0 - s/page->nice/page->prio/ - drop the role of kswapd's reclaiming prioirty in prio comparison - add pgdat->kswapd_prio Cc: Michal Hocko Cc: Johannes Weiner Cc: Shakeel Butt Cc: Minchan Kim Cc: Mel Gorman Cc: Vladimir Davydov Cc: Jan Kara Signed-off-by: Hillf Danton --- -- --- a/mm/Kconfig +++ b/mm/Kconfig @@ -281,6 +281,15 @@ config VIRT_TO_BUS should probably not select this. +config PAGE_PREEMPTION + bool + help + This makes a task unable to reclaim as many lru pages as + possible from other tasks of higher priorities. + + Say N if unsure. + + config MMU_NOTIFIER bool select SRCU --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -14,6 +14,7 @@ #include #include #include +#include #include @@ -197,6 +198,10 @@ struct page { /* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */ atomic_t _refcount; +#ifdef CONFIG_PAGE_PREEMPTION + int prio; /* mirror page owner task->prio */ +#endif + #ifdef CONFIG_MEMCG struct mem_cgroup *mem_cgroup; #endif @@ -232,6 +237,54 @@ struct page { #define page_private(page) ((page)->private) #define set_page_private(page, v) ((page)->private = (v)) +#ifdef CONFIG_PAGE_PREEMPTION +static inline bool page_prio_valid(struct page *p) +{ + return p->prio > MAX_PRIO; +} + +static inline void set_page_prio(struct page *p, int task_prio) +{ + /* store page prio low enough to help khugepaged add lru pages */ + if (!page_prio_valid(p)) + p->prio = task_prio + MAX_PRIO + 1; +} + +static inline void copy_page_prio(struct page *to, struct page *from) +{ + to->prio = from->prio; +} + +static inline int page_prio(struct page *p) +{ + return p->prio - MAX_PRIO - 1; +} + +static inline bool page_prio_higher(struct page *p, int prio) +{ + return page_prio(p) < prio; +} +#else +static inline bool page_prio_valid(struct page *p) +{ + return true; +} +static inline void set_page_prio(struct page *p, int task_prio) +{ +} +static inline void copy_page_prio(struct page *to, struct page *from) +{ +} +static inline int page_prio(struct page *p) +{ + return MAX_PRIO + 1; +} +static inline bool page_prio_higher(struct page *p, int prio) +{ + return false; +} +#endif + struct page_frag_cache { void * va; #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) --- a/mm/swap.c +++ b/mm/swap.c @@ -402,6 +402,7 @@ static void __lru_cache_add(struct page struct pagevec *pvec = &get_cpu_var(lru_add_pvec); get_page(page); + set_page_prio(page, current->prio); if (!pagevec_add(pvec, page) || PageCompound(page)) __pagevec_lru_add(pvec); put_cpu_var(lru_add_pvec); --- a/mm/migrate.c +++ b/mm/migrate.c @@ -647,6 +647,7 @@ void migrate_page_states(struct page *ne end_page_writeback(newpage); copy_page_owner(page, newpage); + copy_page_prio(newpage, page); mem_cgroup_migrate(page, newpage); } --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1575,6 +1575,7 @@ static int shmem_replace_page(struct pag get_page(newpage); copy_highpage(newpage, oldpage); + copy_page_prio(newpage, oldpage); flush_dcache_page(newpage); __SetPageLocked(newpage); --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -671,6 +671,7 @@ static void __collapse_huge_page_copy(pt } } else { src_page = pte_page(pteval); + copy_page_prio(page, src_page); copy_user_highpage(page, src_page, address, vma); VM_BUG_ON_PAGE(page_mapcount(src_page) != 1, src_page); release_pte_page(src_page); @@ -1723,6 +1724,7 @@ xa_unlocked: clear_highpage(new_page + (index % HPAGE_PMD_NR)); index++; } + copy_page_prio(new_page, page); copy_highpage(new_page + (page->index % HPAGE_PMD_NR), page); list_del(&page->lru); --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -738,6 +738,9 @@ typedef struct pglist_data { int kswapd_order; enum zone_type kswapd_classzone_idx; +#ifdef CONFIG_PAGE_PREEMPTION + int kswapd_prio; /* mirror task->prio waking up kswapd */ +#endif int kswapd_failures; /* Number of 'reclaimed == 0' runs */ #ifdef CONFIG_COMPACTION --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -110,6 +110,10 @@ struct scan_control { /* The highest zone to isolate pages for reclaim from */ s8 reclaim_idx; +#ifdef CONFIG_PAGE_PREEMPTION + s8 __pad; + int reclaimer_prio; /* mirror task->prio */ +#endif /* This context's GFP mask */ gfp_t gfp_mask; @@ -1710,11 +1714,18 @@ static unsigned long isolate_lru_pages(u total_scan += nr_pages; if (page_zonenum(page) > sc->reclaim_idx) { +next_page: list_move(&page->lru, &pages_skipped); nr_skipped[page_zonenum(page)] += nr_pages; continue; } + if (IS_ENABLED(CONFIG_PAGE_PREEMPTION) && + is_active_lru(lru) && + global_reclaim(sc) && + page_prio_higher(page, sc->reclaimer_prio)) + goto next_page; + /* * Do not count skipped pages because that makes the function * return with no isolated pages if the LRU mostly contains @@ -3260,6 +3271,9 @@ unsigned long try_to_free_pages(struct z unsigned long nr_reclaimed; struct scan_control sc = { .nr_to_reclaim = SWAP_CLUSTER_MAX, +#ifdef CONFIG_PAGE_PREEMPTION + .reclaimer_prio = current->prio, +#endif .gfp_mask = current_gfp_context(gfp_mask), .reclaim_idx = gfp_zone(gfp_mask), .order = order, @@ -3586,6 +3600,9 @@ static int balance_pgdat(pg_data_t *pgda bool boosted; struct zone *zone; struct scan_control sc = { +#ifdef CONFIG_PAGE_PREEMPTION + .reclaimer_prio = pgdat->kswapd_prio, +#endif .gfp_mask = GFP_KERNEL, .order = order, .may_unmap = 1, @@ -3739,6 +3756,9 @@ restart: if (nr_boost_reclaim && !nr_reclaimed) break; + if (IS_ENABLED(CONFIG_PAGE_PREEMPTION)) + sc.reclaimer_prio = pgdat->kswapd_prio; + if (raise_priority || !nr_reclaimed) sc.priority--; } while (sc.priority >= 1); @@ -3831,6 +3851,10 @@ static void kswapd_try_to_sleep(pg_data_ */ wakeup_kcompactd(pgdat, alloc_order, classzone_idx); + if (IS_ENABLED(CONFIG_PAGE_PREEMPTION)) { + pgdat->kswapd_prio = MAX_PRIO + 1; + smp_wmb(); + } remaining = schedule_timeout(HZ/10); /* @@ -3865,8 +3889,13 @@ static void kswapd_try_to_sleep(pg_data_ */ set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold); - if (!kthread_should_stop()) + if (!kthread_should_stop()) { + if (IS_ENABLED(CONFIG_PAGE_PREEMPTION)) { + pgdat->kswapd_prio = MAX_PRIO + 1; + smp_wmb(); + } schedule(); + } set_pgdat_percpu_threshold(pgdat, calculate_pressure_threshold); } else { @@ -3917,6 +3946,8 @@ static int kswapd(void *p) tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; set_freezable(); + if (IS_ENABLED(CONFIG_PAGE_PREEMPTION)) + pgdat->kswapd_prio = MAX_PRIO + 1; pgdat->kswapd_order = 0; pgdat->kswapd_classzone_idx = MAX_NR_ZONES; for ( ; ; ) { @@ -3985,6 +4016,17 @@ void wakeup_kswapd(struct zone *zone, gf return; pgdat = zone->zone_pgdat; + if (IS_ENABLED(CONFIG_PAGE_PREEMPTION)) { + int prio = current->prio; + + if (pgdat->kswapd_prio < prio) { + smp_rmb(); + return; + } + pgdat->kswapd_prio = prio; + smp_wmb(); + } + if (pgdat->kswapd_classzone_idx == MAX_NR_ZONES) pgdat->kswapd_classzone_idx = classzone_idx; else