From patchwork Tue Feb 15 02:20:24 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hugh Dickins <hughd@google.com>
X-Patchwork-Id: 12746439
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 23353C433F5
	for <linux-mm@archiver.kernel.org>; Tue, 15 Feb 2022 02:20:30 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 5D0EE6B0078; Mon, 14 Feb 2022 21:20:29 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 574776B007B; Mon, 14 Feb 2022 21:20:29 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 3D2306B007D; Mon, 14 Feb 2022 21:20:29 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0190.hostedemail.com
 [216.40.44.190])
	by kanga.kvack.org (Postfix) with ESMTP id 256F16B0078
	for <linux-mm@kvack.org>; Mon, 14 Feb 2022 21:20:29 -0500 (EST)
Received: from smtpin15.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id BF014181AC9C6
	for <linux-mm@kvack.org>; Tue, 15 Feb 2022 02:20:28 +0000 (UTC)
X-FDA: 79143410136.15.4AD83A6
Received: from mail-qt1-f181.google.com (mail-qt1-f181.google.com
 [209.85.160.181])
	by imf23.hostedemail.com (Postfix) with ESMTP id 45234140005
	for <linux-mm@kvack.org>; Tue, 15 Feb 2022 02:20:28 +0000 (UTC)
Received: by mail-qt1-f181.google.com with SMTP id k25so17269175qtp.4
        for <linux-mm@kvack.org>; Mon, 14 Feb 2022 18:20:28 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:from:to:cc:subject:in-reply-to:message-id:references
         :mime-version;
        bh=7dzr3D8vhZBGbu28hBnb9n3i6+YNcbfsTesEURk77is=;
        b=LpDY0yb0fXwM9g+7YZhU7ZWaF6lkqSrJ8gJnm6uwjCzFbvsy/DZFkU3Ktlit+sN6ys
         Dx3XoB/IYnKvYLi1Z/wvOGTgAJ0vIez9va+Kvbe0l0aUlmW7O+tMUQiUxfUt6+3dg6d3
         1Nh/eDUjTpBQOR7yZB6HCZBGknSUyxHfGZrIcH043YgGGP7fIjgLMtUSI+GRw9JgpHWY
         SYK/jMBI6+ZU4LTKPTshnlIlNBsOKcN/ZeMNI+iOihbLS+pukgxTP7rNyCEk/tRNGjj/
         rCCkIsD0kMn+FfR1PAJyxAV7ovmBcGI9ueXQ6t4m3jCwxfXhFTBaEqyfKmEdv94eeDtz
         /D5w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id
         :references:mime-version;
        bh=7dzr3D8vhZBGbu28hBnb9n3i6+YNcbfsTesEURk77is=;
        b=j687zu+8qmC6+q7JnAk/TLG7hrswYDgFNHipYCmkaWV1znAmOVkBeixG8xO1PBeNSp
         RLKA7Y90+HnN5NsTIpNmk/iupVsZqpIJyG4Jp1VZri7XDXBeJ6krNuocnmBcgWMqvBFu
         d4PnAsQ1jVZrRAvExl2E6jHcfQqlexu9IDlyHzVVgVhmMqyI7FKx7Eq8GIIianclJURe
         NNo4aM95IjLOvsWeJzPEK30JdFXuT94HsAoFT/8nsysRO8T+yEkx9+lMqFp1IP77hU2x
         b4vgQkU0u5uqJw1WthgXE5WhU2y8aU/TfRcw7GQaReuqYgE0rY/RwZ5poxiiOe0ToihB
         0w7Q==
X-Gm-Message-State: AOAM533Vos8Ff/TAEsEAkTZ4iHBcxknKiaKm2rBwpNq5iqMQU02FSy+a
	xKVh2mauMms0euYATKk5CWfkYQ==
X-Google-Smtp-Source: 
 ABdhPJzmPpWgHw0ac88YCQVCS+B4j9/OqAwEh89t2lqVY93rmf7V6TsTdEU/rX8hNpYnfVUq2r8Mew==
X-Received: by 2002:ac8:5d8e:: with SMTP id d14mr1333266qtx.278.1644891627156;
        Mon, 14 Feb 2022 18:20:27 -0800 (PST)
Received: from ripple.attlocal.net
 (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147])
        by smtp.gmail.com with ESMTPSA id g8sm335837qki.47.2022.02.14.18.20.24
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 14 Feb 2022 18:20:26 -0800 (PST)
Date: Mon, 14 Feb 2022 18:20:24 -0800 (PST)
From: Hugh Dickins <hughd@google.com>
X-X-Sender: hugh@ripple.anvils
To: Andrew Morton <akpm@linux-foundation.org>
cc: Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>,
    "Kirill A. Shutemov" <kirill@shutemov.name>,
    Matthew Wilcox <willy@infradead.org>,
 David Hildenbrand <david@redhat.com>,
    Alistair Popple <apopple@nvidia.com>,
 Johannes Weiner <hannes@cmpxchg.org>,
    Rik van Riel <riel@surriel.com>, Suren Baghdasaryan <surenb@google.com>,
    Yu Zhao <yuzhao@google.com>, Greg Thelen <gthelen@google.com>,
    Shakeel Butt <shakeelb@google.com>, linux-kernel@vger.kernel.org,
    linux-mm@kvack.org
Subject: [PATCH v2 01/13] mm/munlock: delete page_mlock() and all its works
In-Reply-To: <55a49083-37f9-3766-1de9-9feea7428ac@google.com>
Message-ID: <48c44eae-4cf0-a8ce-454c-5ec88457ffea@google.com>
References: <55a49083-37f9-3766-1de9-9feea7428ac@google.com>
MIME-Version: 1.0
X-Rspam-User: 
X-Rspamd-Server: rspam08
X-Rspamd-Queue-Id: 45234140005
X-Stat-Signature: q4ya5gtx4oyyqb4emqaxcr4i8rh5qaf3
Authentication-Results: imf23.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=LpDY0yb0;
	spf=pass (imf23.hostedemail.com: domain of hughd@google.com designates
 209.85.160.181 as permitted sender) smtp.mailfrom=hughd@google.com;
	dmarc=pass (policy=reject) header.from=google.com
X-HE-Tag: 1644891628-784624
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

We have recommended some applications to mlock their userspace, but that
turns out to be counter-productive: when many processes mlock the same
file, contention on rmap's i_mmap_rwsem can become intolerable at exit: it
is needed for write, to remove any vma mapping that file from rmap's tree;
but hogged for read by those with mlocks calling page_mlock() (formerly
known as try_to_munlock()) on *each* page mapped from the file (the
purpose being to find out whether another process has the page mlocked,
so therefore it should not be unmlocked yet).

Several optimizations have been made in the past: one is to skip
page_mlock() when mapcount tells that nothing else has this page
mapped; but that doesn't help at all when others do have it mapped.
This time around, I initially intended to add a preliminary search
of the rmap tree for overlapping VM_LOCKED ranges; but that gets
messy with locking order, when in doubt whether a page is actually
present; and risks adding even more contention on the i_mmap_rwsem.

A solution would be much easier, if only there were space in struct page
for an mlock_count... but actually, most of the time, there is space for
it - an mlocked page spends most of its life on an unevictable LRU, but
since 3.18 removed the scan_unevictable_pages sysctl, that "LRU" has
been redundant.  Let's try to reuse its page->lru.

But leave that until a later patch: in this patch, clear the ground by
removing page_mlock(), and all the infrastructure that has gathered
around it - which mostly hinders understanding, and will make reviewing
new additions harder.  Don't mind those old comments about THPs, they
date from before 4.5's refcounting rework: splitting is not a risk here.

Just keep a minimal version of munlock_vma_page(), as reminder of what it
should attend to (in particular, the odd way PGSTRANDED is counted out of
PGMUNLOCKED), and likewise a stub for munlock_vma_pages_range().  Move
unchanged __mlock_posix_error_return() out of the way, down to above its
caller: this series then makes no further change after mlock_fixup().

After this and each following commit, the kernel builds, boots and runs;
but with deficiencies which may show up in testing of mlock and munlock.
The system calls succeed or fail as before, and mlock remains effective
in preventing page reclaim; but meminfo's Unevictable and Mlocked amounts
may be shown too low after mlock, grow, then stay too high after munlock:
with previously mlocked pages remaining unevictable for too long, until
finally unmapped and freed and counts corrected. Normal service will be
resumed in "mm/munlock: mlock_pte_range() when mlocking or munlocking".

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
v2: Restore munlock_vma_pages_range() VM_LOCKED_CLEAR_MASK, per Vlastimil.
    Add paragraph to changelog on what does not work yet, per Vlastimil.
    These v2 replacement patches based on 5.17-rc2 like v1.

 include/linux/rmap.h |   6 -
 mm/internal.h        |   2 +-
 mm/mlock.c           | 375 +++----------------------------------------
 mm/rmap.c            |  80 ---------
 4 files changed, 25 insertions(+), 438 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index e704b1a4c06c..dc48aa8c2c94 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -237,12 +237,6 @@ unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
  */
 int folio_mkclean(struct folio *);
 
-/*
- * called in munlock()/munmap() path to check for other vmas holding
- * the page mlocked.
- */
-void page_mlock(struct page *page);
-
 void remove_migration_ptes(struct page *old, struct page *new, bool locked);
 
 /*
diff --git a/mm/internal.h b/mm/internal.h
index d80300392a19..e48c486d5ddf 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -409,7 +409,7 @@ static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
  * must be called with vma's mmap_lock held for read or write, and page locked.
  */
 extern void mlock_vma_page(struct page *page);
-extern unsigned int munlock_vma_page(struct page *page);
+extern void munlock_vma_page(struct page *page);
 
 extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
 			      unsigned long len);
diff --git a/mm/mlock.c b/mm/mlock.c
index 8f584eddd305..aec4ce7919da 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -46,12 +46,6 @@ EXPORT_SYMBOL(can_do_mlock);
  * be placed on the LRU "unevictable" list, rather than the [in]active lists.
  * The unevictable list is an LRU sibling list to the [in]active lists.
  * PageUnevictable is set to indicate the unevictable state.
- *
- * When lazy mlocking via vmscan, it is important to ensure that the
- * vma's VM_LOCKED status is not concurrently being modified, otherwise we
- * may have mlocked a page that is being munlocked. So lazy mlock must take
- * the mmap_lock for read, and verify that the vma really is locked
- * (see mm/rmap.c).
  */
 
 /*
@@ -106,299 +100,28 @@ void mlock_vma_page(struct page *page)
 	}
 }
 
-/*
- * Finish munlock after successful page isolation
- *
- * Page must be locked. This is a wrapper for page_mlock()
- * and putback_lru_page() with munlock accounting.
- */
-static void __munlock_isolated_page(struct page *page)
-{
-	/*
-	 * Optimization: if the page was mapped just once, that's our mapping
-	 * and we don't need to check all the other vmas.
-	 */
-	if (page_mapcount(page) > 1)
-		page_mlock(page);
-
-	/* Did try_to_unlock() succeed or punt? */
-	if (!PageMlocked(page))
-		count_vm_events(UNEVICTABLE_PGMUNLOCKED, thp_nr_pages(page));
-
-	putback_lru_page(page);
-}
-
-/*
- * Accounting for page isolation fail during munlock
- *
- * Performs accounting when page isolation fails in munlock. There is nothing
- * else to do because it means some other task has already removed the page
- * from the LRU. putback_lru_page() will take care of removing the page from
- * the unevictable list, if necessary. vmscan [page_referenced()] will move
- * the page back to the unevictable list if some other vma has it mlocked.
- */
-static void __munlock_isolation_failed(struct page *page)
-{
-	int nr_pages = thp_nr_pages(page);
-
-	if (PageUnevictable(page))
-		__count_vm_events(UNEVICTABLE_PGSTRANDED, nr_pages);
-	else
-		__count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages);
-}
-
 /**
  * munlock_vma_page - munlock a vma page
  * @page: page to be unlocked, either a normal page or THP page head
- *
- * returns the size of the page as a page mask (0 for normal page,
- *         HPAGE_PMD_NR - 1 for THP head page)
- *
- * called from munlock()/munmap() path with page supposedly on the LRU.
- * When we munlock a page, because the vma where we found the page is being
- * munlock()ed or munmap()ed, we want to check whether other vmas hold the
- * page locked so that we can leave it on the unevictable lru list and not
- * bother vmscan with it.  However, to walk the page's rmap list in
- * page_mlock() we must isolate the page from the LRU.  If some other
- * task has removed the page from the LRU, we won't be able to do that.
- * So we clear the PageMlocked as we might not get another chance.  If we
- * can't isolate the page, we leave it for putback_lru_page() and vmscan
- * [page_referenced()/try_to_unmap()] to deal with.
  */
-unsigned int munlock_vma_page(struct page *page)
+void munlock_vma_page(struct page *page)
 {
-	int nr_pages;
-
-	/* For page_mlock() and to serialize with page migration */
+	/* Serialize with page migration */
 	BUG_ON(!PageLocked(page));
-	VM_BUG_ON_PAGE(PageTail(page), page);
-
-	if (!TestClearPageMlocked(page)) {
-		/* Potentially, PTE-mapped THP: do not skip the rest PTEs */
-		return 0;
-	}
-
-	nr_pages = thp_nr_pages(page);
-	mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
-
-	if (!isolate_lru_page(page))
-		__munlock_isolated_page(page);
-	else
-		__munlock_isolation_failed(page);
-
-	return nr_pages - 1;
-}
-
-/*
- * convert get_user_pages() return value to posix mlock() error
- */
-static int __mlock_posix_error_return(long retval)
-{
-	if (retval == -EFAULT)
-		retval = -ENOMEM;
-	else if (retval == -ENOMEM)
-		retval = -EAGAIN;
-	return retval;
-}
-
-/*
- * Prepare page for fast batched LRU putback via putback_lru_evictable_pagevec()
- *
- * The fast path is available only for evictable pages with single mapping.
- * Then we can bypass the per-cpu pvec and get better performance.
- * when mapcount > 1 we need page_mlock() which can fail.
- * when !page_evictable(), we need the full redo logic of putback_lru_page to
- * avoid leaving evictable page in unevictable list.
- *
- * In case of success, @page is added to @pvec and @pgrescued is incremented
- * in case that the page was previously unevictable. @page is also unlocked.
- */
-static bool __putback_lru_fast_prepare(struct page *page, struct pagevec *pvec,
-		int *pgrescued)
-{
-	VM_BUG_ON_PAGE(PageLRU(page), page);
-	VM_BUG_ON_PAGE(!PageLocked(page), page);
-
-	if (page_mapcount(page) <= 1 && page_evictable(page)) {
-		pagevec_add(pvec, page);
-		if (TestClearPageUnevictable(page))
-			(*pgrescued)++;
-		unlock_page(page);
-		return true;
-	}
-
-	return false;
-}
 
-/*
- * Putback multiple evictable pages to the LRU
- *
- * Batched putback of evictable pages that bypasses the per-cpu pvec. Some of
- * the pages might have meanwhile become unevictable but that is OK.
- */
-static void __putback_lru_fast(struct pagevec *pvec, int pgrescued)
-{
-	count_vm_events(UNEVICTABLE_PGMUNLOCKED, pagevec_count(pvec));
-	/*
-	 *__pagevec_lru_add() calls release_pages() so we don't call
-	 * put_page() explicitly
-	 */
-	__pagevec_lru_add(pvec);
-	count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
-}
-
-/*
- * Munlock a batch of pages from the same zone
- *
- * The work is split to two main phases. First phase clears the Mlocked flag
- * and attempts to isolate the pages, all under a single zone lru lock.
- * The second phase finishes the munlock only for pages where isolation
- * succeeded.
- *
- * Note that the pagevec may be modified during the process.
- */
-static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
-{
-	int i;
-	int nr = pagevec_count(pvec);
-	int delta_munlocked = -nr;
-	struct pagevec pvec_putback;
-	struct lruvec *lruvec = NULL;
-	int pgrescued = 0;
-
-	pagevec_init(&pvec_putback);
-
-	/* Phase 1: page isolation */
-	for (i = 0; i < nr; i++) {
-		struct page *page = pvec->pages[i];
-		struct folio *folio = page_folio(page);
-
-		if (TestClearPageMlocked(page)) {
-			/*
-			 * We already have pin from follow_page_mask()
-			 * so we can spare the get_page() here.
-			 */
-			if (TestClearPageLRU(page)) {
-				lruvec = folio_lruvec_relock_irq(folio, lruvec);
-				del_page_from_lru_list(page, lruvec);
-				continue;
-			} else
-				__munlock_isolation_failed(page);
-		} else {
-			delta_munlocked++;
-		}
+	VM_BUG_ON_PAGE(PageTail(page), page);
 
-		/*
-		 * We won't be munlocking this page in the next phase
-		 * but we still need to release the follow_page_mask()
-		 * pin. We cannot do it under lru_lock however. If it's
-		 * the last pin, __page_cache_release() would deadlock.
-		 */
-		pagevec_add(&pvec_putback, pvec->pages[i]);
-		pvec->pages[i] = NULL;
-	}
-	if (lruvec) {
-		__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
-		unlock_page_lruvec_irq(lruvec);
-	} else if (delta_munlocked) {
-		mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
-	}
+	if (TestClearPageMlocked(page)) {
+		int nr_pages = thp_nr_pages(page);
 
-	/* Now we can release pins of pages that we are not munlocking */
-	pagevec_release(&pvec_putback);
-
-	/* Phase 2: page munlock */
-	for (i = 0; i < nr; i++) {
-		struct page *page = pvec->pages[i];
-
-		if (page) {
-			lock_page(page);
-			if (!__putback_lru_fast_prepare(page, &pvec_putback,
-					&pgrescued)) {
-				/*
-				 * Slow path. We don't want to lose the last
-				 * pin before unlock_page()
-				 */
-				get_page(page); /* for putback_lru_page() */
-				__munlock_isolated_page(page);
-				unlock_page(page);
-				put_page(page); /* from follow_page_mask() */
-			}
+		mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
+		if (!isolate_lru_page(page)) {
+			putback_lru_page(page);
+			count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages);
+		} else if (PageUnevictable(page)) {
+			count_vm_events(UNEVICTABLE_PGSTRANDED, nr_pages);
 		}
 	}
-
-	/*
-	 * Phase 3: page putback for pages that qualified for the fast path
-	 * This will also call put_page() to return pin from follow_page_mask()
-	 */
-	if (pagevec_count(&pvec_putback))
-		__putback_lru_fast(&pvec_putback, pgrescued);
-}
-
-/*
- * Fill up pagevec for __munlock_pagevec using pte walk
- *
- * The function expects that the struct page corresponding to @start address is
- * a non-TPH page already pinned and in the @pvec, and that it belongs to @zone.
- *
- * The rest of @pvec is filled by subsequent pages within the same pmd and same
- * zone, as long as the pte's are present and vm_normal_page() succeeds. These
- * pages also get pinned.
- *
- * Returns the address of the next page that should be scanned. This equals
- * @start + PAGE_SIZE when no page could be added by the pte walk.
- */
-static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
-			struct vm_area_struct *vma, struct zone *zone,
-			unsigned long start, unsigned long end)
-{
-	pte_t *pte;
-	spinlock_t *ptl;
-
-	/*
-	 * Initialize pte walk starting at the already pinned page where we
-	 * are sure that there is a pte, as it was pinned under the same
-	 * mmap_lock write op.
-	 */
-	pte = get_locked_pte(vma->vm_mm, start,	&ptl);
-	/* Make sure we do not cross the page table boundary */
-	end = pgd_addr_end(start, end);
-	end = p4d_addr_end(start, end);
-	end = pud_addr_end(start, end);
-	end = pmd_addr_end(start, end);
-
-	/* The page next to the pinned page is the first we will try to get */
-	start += PAGE_SIZE;
-	while (start < end) {
-		struct page *page = NULL;
-		pte++;
-		if (pte_present(*pte))
-			page = vm_normal_page(vma, start, *pte);
-		/*
-		 * Break if page could not be obtained or the page's node+zone does not
-		 * match
-		 */
-		if (!page || page_zone(page) != zone)
-			break;
-
-		/*
-		 * Do not use pagevec for PTE-mapped THP,
-		 * munlock_vma_pages_range() will handle them.
-		 */
-		if (PageTransCompound(page))
-			break;
-
-		get_page(page);
-		/*
-		 * Increase the address that will be returned *before* the
-		 * eventual break due to pvec becoming full by adding the page
-		 */
-		start += PAGE_SIZE;
-		if (pagevec_add(pvec, page) == 0)
-			break;
-	}
-	pte_unmap_unlock(pte, ptl);
-	return start;
 }
 
 /*
@@ -413,75 +136,13 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
  *
  * Returns with VM_LOCKED cleared.  Callers must be prepared to
  * deal with this.
- *
- * We don't save and restore VM_LOCKED here because pages are
- * still on lru.  In unmap path, pages might be scanned by reclaim
- * and re-mlocked by page_mlock/try_to_unmap before we unmap and
- * free them.  This will result in freeing mlocked pages.
  */
 void munlock_vma_pages_range(struct vm_area_struct *vma,
 			     unsigned long start, unsigned long end)
 {
 	vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
 
-	while (start < end) {
-		struct page *page;
-		unsigned int page_mask = 0;
-		unsigned long page_increm;
-		struct pagevec pvec;
-		struct zone *zone;
-
-		pagevec_init(&pvec);
-		/*
-		 * Although FOLL_DUMP is intended for get_dump_page(),
-		 * it just so happens that its special treatment of the
-		 * ZERO_PAGE (returning an error instead of doing get_page)
-		 * suits munlock very well (and if somehow an abnormal page
-		 * has sneaked into the range, we won't oops here: great).
-		 */
-		page = follow_page(vma, start, FOLL_GET | FOLL_DUMP);
-
-		if (page && !IS_ERR(page)) {
-			if (PageTransTail(page)) {
-				VM_BUG_ON_PAGE(PageMlocked(page), page);
-				put_page(page); /* follow_page_mask() */
-			} else if (PageTransHuge(page)) {
-				lock_page(page);
-				/*
-				 * Any THP page found by follow_page_mask() may
-				 * have gotten split before reaching
-				 * munlock_vma_page(), so we need to compute
-				 * the page_mask here instead.
-				 */
-				page_mask = munlock_vma_page(page);
-				unlock_page(page);
-				put_page(page); /* follow_page_mask() */
-			} else {
-				/*
-				 * Non-huge pages are handled in batches via
-				 * pagevec. The pin from follow_page_mask()
-				 * prevents them from collapsing by THP.
-				 */
-				pagevec_add(&pvec, page);
-				zone = page_zone(page);
-
-				/*
-				 * Try to fill the rest of pagevec using fast
-				 * pte walk. This will also update start to
-				 * the next page to process. Then munlock the
-				 * pagevec.
-				 */
-				start = __munlock_pagevec_fill(&pvec, vma,
-						zone, start, end);
-				__munlock_pagevec(&pvec, zone);
-				goto next;
-			}
-		}
-		page_increm = 1 + page_mask;
-		start += page_increm * PAGE_SIZE;
-next:
-		cond_resched();
-	}
+	/* Reimplementation to follow in later commit */
 }
 
 /*
@@ -645,6 +306,18 @@ static unsigned long count_mm_mlocked_page_nr(struct mm_struct *mm,
 	return count >> PAGE_SHIFT;
 }
 
+/*
+ * convert get_user_pages() return value to posix mlock() error
+ */
+static int __mlock_posix_error_return(long retval)
+{
+	if (retval == -EFAULT)
+		retval = -ENOMEM;
+	else if (retval == -ENOMEM)
+		retval = -EAGAIN;
+	return retval;
+}
+
 static __must_check int do_mlock(unsigned long start, size_t len, vm_flags_t flags)
 {
 	unsigned long locked;
diff --git a/mm/rmap.c b/mm/rmap.c
index 6a1e8c7f6213..7ce7f1946cff 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1996,76 +1996,6 @@ void try_to_migrate(struct page *page, enum ttu_flags flags)
 		rmap_walk(page, &rwc);
 }
 
-/*
- * Walks the vma's mapping a page and mlocks the page if any locked vma's are
- * found. Once one is found the page is locked and the scan can be terminated.
- */
-static bool page_mlock_one(struct page *page, struct vm_area_struct *vma,
-				 unsigned long address, void *unused)
-{
-	struct page_vma_mapped_walk pvmw = {
-		.page = page,
-		.vma = vma,
-		.address = address,
-	};
-
-	/* An un-locked vma doesn't have any pages to lock, continue the scan */
-	if (!(vma->vm_flags & VM_LOCKED))
-		return true;
-
-	while (page_vma_mapped_walk(&pvmw)) {
-		/*
-		 * Need to recheck under the ptl to serialise with
-		 * __munlock_pagevec_fill() after VM_LOCKED is cleared in
-		 * munlock_vma_pages_range().
-		 */
-		if (vma->vm_flags & VM_LOCKED) {
-			/*
-			 * PTE-mapped THP are never marked as mlocked; but
-			 * this function is never called on a DoubleMap THP,
-			 * nor on an Anon THP (which may still be PTE-mapped
-			 * after DoubleMap was cleared).
-			 */
-			mlock_vma_page(page);
-			/*
-			 * No need to scan further once the page is marked
-			 * as mlocked.
-			 */
-			page_vma_mapped_walk_done(&pvmw);
-			return false;
-		}
-	}
-
-	return true;
-}
-
-/**
- * page_mlock - try to mlock a page
- * @page: the page to be mlocked
- *
- * Called from munlock code. Checks all of the VMAs mapping the page and mlocks
- * the page if any are found. The page will be returned with PG_mlocked cleared
- * if it is not mapped by any locked vmas.
- */
-void page_mlock(struct page *page)
-{
-	struct rmap_walk_control rwc = {
-		.rmap_one = page_mlock_one,
-		.done = page_not_mapped,
-		.anon_lock = page_lock_anon_vma_read,
-
-	};
-
-	VM_BUG_ON_PAGE(!PageLocked(page) || PageLRU(page), page);
-	VM_BUG_ON_PAGE(PageCompound(page) && PageDoubleMap(page), page);
-
-	/* Anon THP are only marked as mlocked when singly mapped */
-	if (PageTransCompound(page) && PageAnon(page))
-		return;
-
-	rmap_walk(page, &rwc);
-}
-
 #ifdef CONFIG_DEVICE_PRIVATE
 struct make_exclusive_args {
 	struct mm_struct *mm;
@@ -2291,11 +2221,6 @@ static struct anon_vma *rmap_walk_anon_lock(struct page *page,
  *
  * Find all the mappings of a page using the mapping pointer and the vma chains
  * contained in the anon_vma struct it points to.
- *
- * When called from page_mlock(), the mmap_lock of the mm containing the vma
- * where the page was found will be held for write.  So, we won't recheck
- * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
- * LOCKED.
  */
 static void rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc,
 		bool locked)
@@ -2344,11 +2269,6 @@ static void rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc,
  *
  * Find all the mappings of a page using the mapping pointer and the vma chains
  * contained in the address_space struct it points to.
- *
- * When called from page_mlock(), the mmap_lock of the mm containing the vma
- * where the page was found will be held for write.  So, we won't recheck
- * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
- * LOCKED.
  */
 static void rmap_walk_file(struct page *page, struct rmap_walk_control *rwc,
 		bool locked)

From patchwork Tue Feb 15 02:21:52 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hugh Dickins <hughd@google.com>
X-Patchwork-Id: 12746440
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id BE56BC433EF
	for <linux-mm@archiver.kernel.org>; Tue, 15 Feb 2022 02:21:57 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 470336B0078; Mon, 14 Feb 2022 21:21:57 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 41F606B007B; Mon, 14 Feb 2022 21:21:57 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 2BF936B007D; Mon, 14 Feb 2022 21:21:57 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0047.hostedemail.com
 [216.40.44.47])
	by kanga.kvack.org (Postfix) with ESMTP id 1B4D46B0078
	for <linux-mm@kvack.org>; Mon, 14 Feb 2022 21:21:57 -0500 (EST)
Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id D1A25181AC9C6
	for <linux-mm@kvack.org>; Tue, 15 Feb 2022 02:21:56 +0000 (UTC)
X-FDA: 79143413832.22.7A069C7
Received: from mail-qt1-f176.google.com (mail-qt1-f176.google.com
 [209.85.160.176])
	by imf05.hostedemail.com (Postfix) with ESMTP id 34535100005
	for <linux-mm@kvack.org>; Tue, 15 Feb 2022 02:21:55 +0000 (UTC)
Received: by mail-qt1-f176.google.com with SMTP id s1so17226317qtw.9
        for <linux-mm@kvack.org>; Mon, 14 Feb 2022 18:21:56 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:from:to:cc:subject:in-reply-to:message-id:references
         :mime-version;
        bh=XpVf8sy056jRyPa9drMdsEh9TQFT4vmT+7aOxE3q5hg=;
        b=YzbBYPCFtjYviPqU0UnG8kQyotr4lFmYMaSFnMO7tj6IHg/0wXSmYjKyvhPsWKcfEV
         lJS9MVPWafm44wb8v1GwOG1qchh3a3QfS8x137RR7OfqD3pIrU2MOVrcAO1h15+lxRlq
         ADMwHwA9s5rDSHEiinGc7ASmwyTgtE3Vn+BrJDtTdGJw2sahbZWj/uk6bF4ck8diCNut
         Ud4hEcmE2iErdmM92rzti7BgfVHGegRNRIm1W4QUod3ZhHAIb6zV7bRfCEc8fltahLWu
         Mgc0UTRi4CN5GrGzhNfsgPd/K85oxTzBenkBH/jdTl+JcIfORDLPaPQ+6TJtSOsRk7kt
         H6jQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id
         :references:mime-version;
        bh=XpVf8sy056jRyPa9drMdsEh9TQFT4vmT+7aOxE3q5hg=;
        b=7SmRDOysR4j/fynV5veMm44cFMusejbIJX0Isi9RCEc+c5M3C9le2HgolRw/I1GVkP
         /PzS8omG9spdDNgZitHtUS0no/cEpFvjNO/2F5Zdtb99Ie6wo0jonqnoBAEem1bEMUy3
         eITAvUbG+HMdIHNeh1Zu4TKBNBZ718g36bdOQIPtzJjtlEydEvJV2RNP6bVhy9yOp0SQ
         yUf4QZX5I74KuEdFshC/8Yr46XiTzkxXanPAj0P6PB2Up3zWQpN1GlZIAWvN82IaEGOW
         ML8tqvyuiuYwy7j3dkLa92V5pXQyBNSdRLmq3efA2GrcbMAJ5mPRc/5tD2LiFU9mTUze
         DPrg==
X-Gm-Message-State: AOAM532O9XQ2H3H51fL0zcmVBX8R/ixv2x/0yYwjBFmggAYe36lIa+Vl
	vc1b+i68tNwovdjIXlEjhNTxKQ==
X-Google-Smtp-Source: 
 ABdhPJz4Bmr7FpfejqS5S8AGFcoAxxCc3s3HCFsMnc1FMAUr9kVT/tIbYpIhNNTXbhh68pxiqT98vg==
X-Received: by 2002:ac8:5ac4:: with SMTP id d4mr1362870qtd.144.1644891715385;
        Mon, 14 Feb 2022 18:21:55 -0800 (PST)
Received: from ripple.attlocal.net
 (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147])
        by smtp.gmail.com with ESMTPSA id
 h5sm19719419qth.54.2022.02.14.18.21.53
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 14 Feb 2022 18:21:54 -0800 (PST)
Date: Mon, 14 Feb 2022 18:21:52 -0800 (PST)
From: Hugh Dickins <hughd@google.com>
X-X-Sender: hugh@ripple.anvils
To: Andrew Morton <akpm@linux-foundation.org>
cc: Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>,
    "Kirill A. Shutemov" <kirill@shutemov.name>,
    Matthew Wilcox <willy@infradead.org>,
 David Hildenbrand <david@redhat.com>,
    Alistair Popple <apopple@nvidia.com>,
 Johannes Weiner <hannes@cmpxchg.org>,
    Rik van Riel <riel@surriel.com>, Suren Baghdasaryan <surenb@google.com>,
    Yu Zhao <yuzhao@google.com>, Greg Thelen <gthelen@google.com>,
    Shakeel Butt <shakeelb@google.com>, linux-kernel@vger.kernel.org,
    linux-mm@kvack.org
Subject: [PATCH v2 02/13] mm/munlock: delete FOLL_MLOCK and FOLL_POPULATE
In-Reply-To: <55a49083-37f9-3766-1de9-9feea7428ac@google.com>
Message-ID: <cbed9c9f-1747-f06a-15ad-b2d9fb6025eb@google.com>
References: <55a49083-37f9-3766-1de9-9feea7428ac@google.com>
MIME-Version: 1.0
X-Rspam-User: 
X-Rspamd-Server: rspam08
X-Rspamd-Queue-Id: 34535100005
X-Stat-Signature: ejrgjuaswt7ybse3bkueespz9qbhoh1u
Authentication-Results: imf05.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=YzbBYPCF;
	spf=pass (imf05.hostedemail.com: domain of hughd@google.com designates
 209.85.160.176 as permitted sender) smtp.mailfrom=hughd@google.com;
	dmarc=pass (policy=reject) header.from=google.com
X-HE-Tag: 1644891715-460718
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

If counting page mlocks, we must not double-count: follow_page_pte() can
tell if a page has already been Mlocked or not, but cannot tell if a pte
has already been counted or not: that will have to be done when the pte
is mapped in (which lru_cache_add_inactive_or_unevictable() already tracks
for new anon pages, but there's no such tracking yet for others).

Delete all the FOLL_MLOCK code - faulting in the missing pages will do
all that is necessary, without special mlock_vma_page() calls from here.

But then FOLL_POPULATE turns out to serve no purpose - it was there so
that its absence would tell faultin_page() not to faultin page when
setting up VM_LOCKONFAULT areas; but if there's no special work needed
here for mlock, then there's no work at all here for VM_LOCKONFAULT.

Have I got that right?  I've not looked into the history, but see that
FOLL_POPULATE goes back before VM_LOCKONFAULT: did it serve a different
purpose before?  Ah, yes, it was used to skip the old stack guard page.

And is it intentional that COW is not broken on existing pages when
setting up a VM_LOCKONFAULT area?  I can see that being argued either
way, and have no reason to disagree with current behaviour.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
v2: same as v1.

 include/linux/mm.h |  2 --
 mm/gup.c           | 43 ++++++++-----------------------------------
 mm/huge_memory.c   | 33 ---------------------------------
 3 files changed, 8 insertions(+), 70 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 213cc569b192..74ee50c2033b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2925,13 +2925,11 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 #define FOLL_FORCE	0x10	/* get_user_pages read/write w/o permission */
 #define FOLL_NOWAIT	0x20	/* if a disk transfer is needed, start the IO
 				 * and return without waiting upon it */
-#define FOLL_POPULATE	0x40	/* fault in pages (with FOLL_MLOCK) */
 #define FOLL_NOFAULT	0x80	/* do not fault in pages */
 #define FOLL_HWPOISON	0x100	/* check page is hwpoisoned */
 #define FOLL_NUMA	0x200	/* force NUMA hinting page fault */
 #define FOLL_MIGRATION	0x400	/* wait for page to replace migration entry */
 #define FOLL_TRIED	0x800	/* a retry, previous pass started an IO */
-#define FOLL_MLOCK	0x1000	/* lock present pages */
 #define FOLL_REMOTE	0x2000	/* we are working on non-current tsk/mm */
 #define FOLL_COW	0x4000	/* internal GUP flag */
 #define FOLL_ANON	0x8000	/* don't do file mappings */
diff --git a/mm/gup.c b/mm/gup.c
index f0af462ac1e2..2076902344d8 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -572,32 +572,6 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 		 */
 		mark_page_accessed(page);
 	}
-	if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
-		/* Do not mlock pte-mapped THP */
-		if (PageTransCompound(page))
-			goto out;
-
-		/*
-		 * The preliminary mapping check is mainly to avoid the
-		 * pointless overhead of lock_page on the ZERO_PAGE
-		 * which might bounce very badly if there is contention.
-		 *
-		 * If the page is already locked, we don't need to
-		 * handle it now - vmscan will handle it later if and
-		 * when it attempts to reclaim the page.
-		 */
-		if (page->mapping && trylock_page(page)) {
-			lru_add_drain();  /* push cached pages to LRU */
-			/*
-			 * Because we lock page here, and migration is
-			 * blocked by the pte's page reference, and we
-			 * know the page is still mapped, we don't even
-			 * need to check for file-cache page truncation.
-			 */
-			mlock_vma_page(page);
-			unlock_page(page);
-		}
-	}
 out:
 	pte_unmap_unlock(ptep, ptl);
 	return page;
@@ -920,9 +894,6 @@ static int faultin_page(struct vm_area_struct *vma,
 	unsigned int fault_flags = 0;
 	vm_fault_t ret;
 
-	/* mlock all present pages, but do not fault in new pages */
-	if ((*flags & (FOLL_POPULATE | FOLL_MLOCK)) == FOLL_MLOCK)
-		return -ENOENT;
 	if (*flags & FOLL_NOFAULT)
 		return -EFAULT;
 	if (*flags & FOLL_WRITE)
@@ -1173,8 +1144,6 @@ static long __get_user_pages(struct mm_struct *mm,
 			case -ENOMEM:
 			case -EHWPOISON:
 				goto out;
-			case -ENOENT:
-				goto next_page;
 			}
 			BUG();
 		} else if (PTR_ERR(page) == -EEXIST) {
@@ -1472,9 +1441,14 @@ long populate_vma_page_range(struct vm_area_struct *vma,
 	VM_BUG_ON_VMA(end   > vma->vm_end, vma);
 	mmap_assert_locked(mm);
 
-	gup_flags = FOLL_TOUCH | FOLL_POPULATE | FOLL_MLOCK;
+	/*
+	 * Rightly or wrongly, the VM_LOCKONFAULT case has never used
+	 * faultin_page() to break COW, so it has no work to do here.
+	 */
 	if (vma->vm_flags & VM_LOCKONFAULT)
-		gup_flags &= ~FOLL_POPULATE;
+		return nr_pages;
+
+	gup_flags = FOLL_TOUCH;
 	/*
 	 * We want to touch writable mappings with a write fault in order
 	 * to break COW, except for shared mappings because these don't COW
@@ -1541,10 +1515,9 @@ long faultin_vma_page_range(struct vm_area_struct *vma, unsigned long start,
 	 *	       in the page table.
 	 * FOLL_HWPOISON: Return -EHWPOISON instead of -EFAULT when we hit
 	 *		  a poisoned page.
-	 * FOLL_POPULATE: Always populate memory with VM_LOCKONFAULT.
 	 * !FOLL_FORCE: Require proper access permissions.
 	 */
-	gup_flags = FOLL_TOUCH | FOLL_POPULATE | FOLL_MLOCK | FOLL_HWPOISON;
+	gup_flags = FOLL_TOUCH | FOLL_HWPOISON;
 	if (write)
 		gup_flags |= FOLL_WRITE;
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 406a3c28c026..9a34b85ebcf8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1380,39 +1380,6 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 	if (flags & FOLL_TOUCH)
 		touch_pmd(vma, addr, pmd, flags);
 
-	if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
-		/*
-		 * We don't mlock() pte-mapped THPs. This way we can avoid
-		 * leaking mlocked pages into non-VM_LOCKED VMAs.
-		 *
-		 * For anon THP:
-		 *
-		 * In most cases the pmd is the only mapping of the page as we
-		 * break COW for the mlock() -- see gup_flags |= FOLL_WRITE for
-		 * writable private mappings in populate_vma_page_range().
-		 *
-		 * The only scenario when we have the page shared here is if we
-		 * mlocking read-only mapping shared over fork(). We skip
-		 * mlocking such pages.
-		 *
-		 * For file THP:
-		 *
-		 * We can expect PageDoubleMap() to be stable under page lock:
-		 * for file pages we set it in page_add_file_rmap(), which
-		 * requires page to be locked.
-		 */
-
-		if (PageAnon(page) && compound_mapcount(page) != 1)
-			goto skip_mlock;
-		if (PageDoubleMap(page) || !page->mapping)
-			goto skip_mlock;
-		if (!trylock_page(page))
-			goto skip_mlock;
-		if (page->mapping && !PageDoubleMap(page))
-			mlock_vma_page(page);
-		unlock_page(page);
-	}
-skip_mlock:
 	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
 	VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
 

From patchwork Tue Feb 15 02:23:29 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hugh Dickins <hughd@google.com>
X-Patchwork-Id: 12746441
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 2A5BBC433EF
	for <linux-mm@archiver.kernel.org>; Tue, 15 Feb 2022 02:23:35 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id A3ACF6B0078; Mon, 14 Feb 2022 21:23:34 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 9E9526B007B; Mon, 14 Feb 2022 21:23:34 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 88A1E6B007D; Mon, 14 Feb 2022 21:23:34 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0236.hostedemail.com
 [216.40.44.236])
	by kanga.kvack.org (Postfix) with ESMTP id 784576B0078
	for <linux-mm@kvack.org>; Mon, 14 Feb 2022 21:23:34 -0500 (EST)
Received: from smtpin30.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id 345D8180AD837
	for <linux-mm@kvack.org>; Tue, 15 Feb 2022 02:23:34 +0000 (UTC)
X-FDA: 79143417948.30.2133C29
Received: from mail-qk1-f181.google.com (mail-qk1-f181.google.com
 [209.85.222.181])
	by imf17.hostedemail.com (Postfix) with ESMTP id C667440005
	for <linux-mm@kvack.org>; Tue, 15 Feb 2022 02:23:33 +0000 (UTC)
Received: by mail-qk1-f181.google.com with SMTP id 71so16127958qkf.4
        for <linux-mm@kvack.org>; Mon, 14 Feb 2022 18:23:33 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:from:to:cc:subject:in-reply-to:message-id:references
         :mime-version;
        bh=lmZloTsNmwsY69gJB9MeYd7MzRZCowvwjlU+iDSaqEA=;
        b=K3aXnemtoZ311il9mh1+OOYZMLA88Xz/c20SN+rqaNwZlEEpMWdVeIdcpzvuK7EgEJ
         fYYovR/x6y0SL7I+F67hhqXJU+7nGUwxYotn+2e85FPiGMIZaRMhTaWG2e6KMbi7unZz
         7+WeL0Fb9GCH7GTLtF6nEO3G+HrqJ1pwxWTOdz6QWKvrjUYBEG8OLQ37rOI0T97RoaoO
         nzGUGGJpnClr8ZMdtRmACLVAtzt8goTApSh48iNZIoAKNNXIgS0TgFXcr3gP17dRVup4
         gaahZlytPO0KtSTHCpGDzaBncJ/MmPu1udLZmfU5UuvFHLwJKCtd8ZB9Mv9PzHdI8Wzc
         jW1g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id
         :references:mime-version;
        bh=lmZloTsNmwsY69gJB9MeYd7MzRZCowvwjlU+iDSaqEA=;
        b=dSqKo6QehQXepIN9KM2ZUaBLwwZlmAiAE/nQD+xiLewd1/rZklJT2BQWLpmPiPFl0U
         /WMMoATnhf1MXSh9XJggQ0P8zFWefDGChAkiMeLk4mbxlilBqNBIvtR/cPMlIML1KKpz
         PxPus7K9dwyKb6rjDsU4XrRzLVut8CwxUGQExU6Jo6hXE4RNtoE0Z/wur+4TnvBdeZDw
         Z+GG2WjkOBiCl6sGZ463bcugYMemF1+STRBSiI2qZF8LVpHVXvD4vE7qXpI9rvmPwNpi
         1L78tnd1Hq/2THZuChRMNb/Cjmi8O5UXTsepYdEJhWJKyIDMK2dz59e5o1ulVgHAUNwI
         bGKw==
X-Gm-Message-State: AOAM531eQHfWK0RV0llCkeVeAqDi5El+fw5W0rARPyfVOfT/OXDg8Ush
	FO4MCad3NAe+aLQN8a5EQCys7w==
X-Google-Smtp-Source: 
 ABdhPJz5h4DOO5/nNWLmnjqiJ02gZn/4a1nTCI4MCyENVmsXbDp8DCPOApBXInWFs6EY77dmTqRJMQ==
X-Received: by 2002:a05:620a:248f:: with SMTP id
 i15mr1038107qkn.264.1644891812859;
        Mon, 14 Feb 2022 18:23:32 -0800 (PST)
Received: from ripple.attlocal.net
 (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147])
        by smtp.gmail.com with ESMTPSA id
 d6sm18508058qty.40.2022.02.14.18.23.30
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 14 Feb 2022 18:23:32 -0800 (PST)
Date: Mon, 14 Feb 2022 18:23:29 -0800 (PST)
From: Hugh Dickins <hughd@google.com>
X-X-Sender: hugh@ripple.anvils
To: Andrew Morton <akpm@linux-foundation.org>
cc: Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>,
    "Kirill A. Shutemov" <kirill@shutemov.name>,
    Matthew Wilcox <willy@infradead.org>,
 David Hildenbrand <david@redhat.com>,
    Alistair Popple <apopple@nvidia.com>,
 Johannes Weiner <hannes@cmpxchg.org>,
    Rik van Riel <riel@surriel.com>, Suren Baghdasaryan <surenb@google.com>,
    Yu Zhao <yuzhao@google.com>, Greg Thelen <gthelen@google.com>,
    Shakeel Butt <shakeelb@google.com>, linux-kernel@vger.kernel.org,
    linux-mm@kvack.org
Subject: [PATCH v2 03/13] mm/munlock: delete munlock_vma_pages_all(), allow
 oomreap
In-Reply-To: <55a49083-37f9-3766-1de9-9feea7428ac@google.com>
Message-ID: <d9a9f8c3-1ee0-4c81-7017-6ecb78554a7@google.com>
References: <55a49083-37f9-3766-1de9-9feea7428ac@google.com>
MIME-Version: 1.0
X-Rspamd-Queue-Id: C667440005
X-Rspam-User: 
Authentication-Results: imf17.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=K3aXnemt;
	spf=pass (imf17.hostedemail.com: domain of hughd@google.com designates
 209.85.222.181 as permitted sender) smtp.mailfrom=hughd@google.com;
	dmarc=pass (policy=reject) header.from=google.com
X-Stat-Signature: 593s16faj5wicqbr8tzx6ukioyuhcpca
X-Rspamd-Server: rspam03
X-HE-Tag: 1644891813-99414
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

munlock_vma_pages_range() will still be required, when munlocking but
not munmapping a set of pages; but when unmapping a pte, the mlock count
will be maintained in much the same way as it will be maintained when
mapping in the pte.  Which removes the need for munlock_vma_pages_all()
on mlocked vmas when munmapping or exiting: eliminating the catastrophic
contention on i_mmap_rwsem, and the need for page lock on the pages.

There is still a need to update locked_vm accounting according to the
munmapped vmas when munmapping: do that in detach_vmas_to_be_unmapped().
exit_mmap() does not need locked_vm updates, so delete unlock_range().

And wasn't I the one who forbade the OOM reaper to attack mlocked vmas,
because of the uncertainty in blocking on all those page locks?
No fear of that now, so permit the OOM reaper on mlocked vmas.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
v2: same as v1.

 mm/internal.h | 16 ++--------------
 mm/madvise.c  |  5 +++++
 mm/mlock.c    |  4 ++--
 mm/mmap.c     | 32 ++------------------------------
 mm/oom_kill.c |  2 +-
 5 files changed, 12 insertions(+), 47 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index e48c486d5ddf..f235aa92e564 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -71,11 +71,6 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
 
-static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
-{
-	return !(vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP));
-}
-
 struct zap_details;
 void unmap_page_range(struct mmu_gather *tlb,
 			     struct vm_area_struct *vma,
@@ -398,12 +393,8 @@ extern long populate_vma_page_range(struct vm_area_struct *vma,
 extern long faultin_vma_page_range(struct vm_area_struct *vma,
 				   unsigned long start, unsigned long end,
 				   bool write, int *locked);
-extern void munlock_vma_pages_range(struct vm_area_struct *vma,
-			unsigned long start, unsigned long end);
-static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
-{
-	munlock_vma_pages_range(vma, vma->vm_start, vma->vm_end);
-}
+extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
+			      unsigned long len);
 
 /*
  * must be called with vma's mmap_lock held for read or write, and page locked.
@@ -411,9 +402,6 @@ static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
 extern void mlock_vma_page(struct page *page);
 extern void munlock_vma_page(struct page *page);
 
-extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
-			      unsigned long len);
-
 /*
  * Clear the page's PageMlocked().  This can be useful in a situation where
  * we want to unconditionally remove a page from the pagecache -- e.g.,
diff --git a/mm/madvise.c b/mm/madvise.c
index 5604064df464..ae35d72627ef 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -530,6 +530,11 @@ static void madvise_cold_page_range(struct mmu_gather *tlb,
 	tlb_end_vma(tlb, vma);
 }
 
+static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
+{
+	return !(vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP));
+}
+
 static long madvise_cold(struct vm_area_struct *vma,
 			struct vm_area_struct **prev,
 			unsigned long start_addr, unsigned long end_addr)
diff --git a/mm/mlock.c b/mm/mlock.c
index aec4ce7919da..5d7ced8303be 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -137,8 +137,8 @@ void munlock_vma_page(struct page *page)
  * Returns with VM_LOCKED cleared.  Callers must be prepared to
  * deal with this.
  */
-void munlock_vma_pages_range(struct vm_area_struct *vma,
-			     unsigned long start, unsigned long end)
+static void munlock_vma_pages_range(struct vm_area_struct *vma,
+				    unsigned long start, unsigned long end)
 {
 	vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 1e8fdb0b51ed..64b5985b5295 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2674,6 +2674,8 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
 	vma->vm_prev = NULL;
 	do {
 		vma_rb_erase(vma, &mm->mm_rb);
+		if (vma->vm_flags & VM_LOCKED)
+			mm->locked_vm -= vma_pages(vma);
 		mm->map_count--;
 		tail_vma = vma;
 		vma = vma->vm_next;
@@ -2778,22 +2780,6 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
 	return __split_vma(mm, vma, addr, new_below);
 }
 
-static inline void
-unlock_range(struct vm_area_struct *start, unsigned long limit)
-{
-	struct mm_struct *mm = start->vm_mm;
-	struct vm_area_struct *tmp = start;
-
-	while (tmp && tmp->vm_start < limit) {
-		if (tmp->vm_flags & VM_LOCKED) {
-			mm->locked_vm -= vma_pages(tmp);
-			munlock_vma_pages_all(tmp);
-		}
-
-		tmp = tmp->vm_next;
-	}
-}
-
 /* Munmap is split into 2 main parts -- this part which finds
  * what needs doing, and the areas themselves, which do the
  * work.  This now handles partial unmappings.
@@ -2874,12 +2860,6 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 			return error;
 	}
 
-	/*
-	 * unlock any mlock()ed ranges before detaching vmas
-	 */
-	if (mm->locked_vm)
-		unlock_range(vma, end);
-
 	/* Detach vmas from rbtree */
 	if (!detach_vmas_to_be_unmapped(mm, vma, prev, end))
 		downgrade = false;
@@ -3147,20 +3127,12 @@ void exit_mmap(struct mm_struct *mm)
 		 * Nothing can be holding mm->mmap_lock here and the above call
 		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
 		 * __oom_reap_task_mm() will not block.
-		 *
-		 * This needs to be done before calling unlock_range(),
-		 * which clears VM_LOCKED, otherwise the oom reaper cannot
-		 * reliably test it.
 		 */
 		(void)__oom_reap_task_mm(mm);
-
 		set_bit(MMF_OOM_SKIP, &mm->flags);
 	}
 
 	mmap_write_lock(mm);
-	if (mm->locked_vm)
-		unlock_range(mm->mmap, ULONG_MAX);
-
 	arch_exit_mmap(mm);
 
 	vma = mm->mmap;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 832fb330376e..6b875acabd1e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -526,7 +526,7 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
 	set_bit(MMF_UNSTABLE, &mm->flags);
 
 	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
-		if (!can_madv_lru_vma(vma))
+		if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
 			continue;
 
 		/*

From patchwork Tue Feb 15 02:26:39 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hugh Dickins <hughd@google.com>
X-Patchwork-Id: 12746442
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 96125C433F5
	for <linux-mm@archiver.kernel.org>; Tue, 15 Feb 2022 02:26:45 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 04D206B0078; Mon, 14 Feb 2022 21:26:45 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id F16A16B007B; Mon, 14 Feb 2022 21:26:44 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id D90496B007D; Mon, 14 Feb 2022 21:26:44 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0179.hostedemail.com
 [216.40.44.179])
	by kanga.kvack.org (Postfix) with ESMTP id C5F406B0078
	for <linux-mm@kvack.org>; Mon, 14 Feb 2022 21:26:44 -0500 (EST)
Received: from smtpin15.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with ESMTP id 57E7F8F876
	for <linux-mm@kvack.org>; Tue, 15 Feb 2022 02:26:44 +0000 (UTC)
X-FDA: 79143425928.15.4104A7C
Received: from mail-qt1-f171.google.com (mail-qt1-f171.google.com
 [209.85.160.171])
	by imf26.hostedemail.com (Postfix) with ESMTP id CF2F3140003
	for <linux-mm@kvack.org>; Tue, 15 Feb 2022 02:26:43 +0000 (UTC)
Received: by mail-qt1-f171.google.com with SMTP id r9so6095968qta.1
        for <linux-mm@kvack.org>; Mon, 14 Feb 2022 18:26:43 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:from:to:cc:subject:in-reply-to:message-id:references
         :mime-version;
        bh=9UznFjE0f1ZIgv9dUCI0SLipO5AY6pyPaKDi8Cv/EQM=;
        b=BbegNf76jqpHkWfFh0RP1kr2zL0tYxxIv8mqNTimgqXOM7hI1v7czAt+R2ulgFX7D7
         kvF2D/2FPfcdv/PuEEiPyw4ze/1K58f+XxRTURfA7K5vv2kZRQK1Dnaag+ioz9x2arL4
         WQhmnhcBCI3AKUOGUj6An/4f7CHn5jfM2ODNe/JLEJKL273t1cav7Vwl8InJnEFRGrZo
         FvzdaoytwW4aPSifIW9ZkhcIMDh0tQmz6pxmLsld42WQs1lMvalvjx0vAPO8cFTuj2w4
         cd8kQvQ1Kzz1+xTHOzPdFlJ1A3aUJ6pqp1vBjO1q348rN4QR500aoAqx/IhI48KbyRxK
         3rzA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id
         :references:mime-version;
        bh=9UznFjE0f1ZIgv9dUCI0SLipO5AY6pyPaKDi8Cv/EQM=;
        b=xQcUUb3jtVOIkbPZG7bNs6eM9JSYQXYYBAyFNVoWZLwHmcdnljiSw7Qw4RtdWz8idn
         Mc2VVFsUjPamzIDRnY1A3qWJeTAGsJbkGmU8CUE3UQQ5bwyHFPhiwgFmIx/vlVe2IN1m
         qaLKio4RA609zt5bfHVT7/r7mBeTvs16MdfUZKs11htn9QmznjEf/C5Fs6VNvhSDAosp
         9RIJ9KLLLgtb7Bzp5L7ZZlEyKRABilGZEX8SM/zKiOfetsIq4HZtDL4CfhjM58TQ0iw0
         vHmffXuAHPZTlkxPG+pSIKAubt4NQlULDjiXnwMsmzr0AMUWgk7mBwdyMyKZ464dZil8
         l4rw==
X-Gm-Message-State: AOAM531VtRl+KvSjYjPATyyuWtHL8PKEnJa/HAT4nz4SbGz1+d9LU8xr
	a5jTmuK/8hm2DU/hUZL8WfM6nQ==
X-Google-Smtp-Source: 
 ABdhPJz0FzVqX6A8lLoqFJ3jt5ujuQw5P3m1e4tDotlsfo3eBY5AqGxbuwQ9v6CrWIkL3mdlk2vHpA==
X-Received: by 2002:ac8:5810:: with SMTP id g16mr1324261qtg.163.1644892002756;
        Mon, 14 Feb 2022 18:26:42 -0800 (PST)
Received: from ripple.attlocal.net
 (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147])
        by smtp.gmail.com with ESMTPSA id
 132sm390294qkk.119.2022.02.14.18.26.40
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 14 Feb 2022 18:26:41 -0800 (PST)
Date: Mon, 14 Feb 2022 18:26:39 -0800 (PST)
From: Hugh Dickins <hughd@google.com>
X-X-Sender: hugh@ripple.anvils
To: Andrew Morton <akpm@linux-foundation.org>
cc: Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>,
    "Kirill A. Shutemov" <kirill@shutemov.name>,
    Matthew Wilcox <willy@infradead.org>,
 David Hildenbrand <david@redhat.com>,
    Alistair Popple <apopple@nvidia.com>,
 Johannes Weiner <hannes@cmpxchg.org>,
    Rik van Riel <riel@surriel.com>, Suren Baghdasaryan <surenb@google.com>,
    Yu Zhao <yuzhao@google.com>, Greg Thelen <gthelen@google.com>,
    Shakeel Butt <shakeelb@google.com>, Yang Li <yang.lee@linux.alibaba.com>,
    linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: [PATCH v2 04/13] mm/munlock: rmap call mlock_vma_page()
 munlock_vma_page()
In-Reply-To: <55a49083-37f9-3766-1de9-9feea7428ac@google.com>
Message-ID: <501673c-a5a-6c5f-ab65-38545dfb723d@google.com>
References: <55a49083-37f9-3766-1de9-9feea7428ac@google.com>
MIME-Version: 1.0
Authentication-Results: imf26.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=BbegNf76;
	spf=pass (imf26.hostedemail.com: domain of hughd@google.com designates
 209.85.160.171 as permitted sender) smtp.mailfrom=hughd@google.com;
	dmarc=pass (policy=reject) header.from=google.com
X-Rspamd-Server: rspam07
X-Rspam-User: 
X-Rspamd-Queue-Id: CF2F3140003
X-Stat-Signature: 4yg4dgdgenfjsw6uh4399y7ci8mqkjo6
X-HE-Tag: 1644892003-582035
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Add vma argument to mlock_vma_page() and munlock_vma_page(), make them
inline functions which check (vma->vm_flags & VM_LOCKED) before calling
mlock_page() and munlock_page() in mm/mlock.c.

Add bool compound to mlock_vma_page() and munlock_vma_page(): this is
because we have understandable difficulty in accounting pte maps of THPs,
and if passed a PageHead page, mlock_page() and munlock_page() cannot
tell whether it's a pmd map to be counted or a pte map to be ignored.

Add vma arg to page_add_file_rmap() and page_remove_rmap(), like the
others, and use that to call mlock_vma_page() at the end of the page
adds, and munlock_vma_page() at the end of page_remove_rmap() (end or
beginning? unimportant, but end was easier for assertions in testing).

No page lock is required (although almost all adds happen to hold it):
delete the "Serialize with page migration" BUG_ON(!PageLocked(page))s.
Certainly page lock did serialize with page migration, but I'm having
difficulty explaining why that was ever important.

Mlock accounting on THPs has been hard to define, differed between anon
and file, involved PageDoubleMap in some places and not others, required
clear_page_mlock() at some points.  Keep it simple now: just count the
pmds and ignore the ptes, there is no reason for ptes to undo pmd mlocks.

page_add_new_anon_rmap() callers unchanged: they have long been calling
lru_cache_add_inactive_or_unevictable(), which does its own VM_LOCKED
handling (it also checks for not VM_SPECIAL: I think that's overcautious,
and inconsistent with other checks, that mmap_region() already prevents
VM_LOCKED on VM_SPECIAL; but haven't quite convinced myself to change it).

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
v2: Fix kerneldoc on mlock_page() to use colon, per Yang Li.
    mmotm will require minor fixup in mm/huge_memory.c like v1.

 include/linux/rmap.h    | 17 +++++++------
 kernel/events/uprobes.c |  7 ++----
 mm/huge_memory.c        | 17 ++++++-------
 mm/hugetlb.c            |  4 +--
 mm/internal.h           | 36 ++++++++++++++++++++++----
 mm/khugepaged.c         |  4 +--
 mm/ksm.c                | 12 +--------
 mm/memory.c             | 45 +++++++++++----------------------
 mm/migrate.c            |  9 ++-----
 mm/mlock.c              | 21 ++++++----------
 mm/rmap.c               | 56 +++++++++++++++++++----------------------
 mm/userfaultfd.c        | 14 ++++++-----
 12 files changed, 113 insertions(+), 129 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index dc48aa8c2c94..ac29b076082b 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -167,18 +167,19 @@ struct anon_vma *page_get_anon_vma(struct page *page);
  */
 void page_move_anon_rmap(struct page *, struct vm_area_struct *);
 void page_add_anon_rmap(struct page *, struct vm_area_struct *,
-		unsigned long, bool);
+		unsigned long address, bool compound);
 void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
-			   unsigned long, int);
+		unsigned long address, int flags);
 void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
-		unsigned long, bool);
-void page_add_file_rmap(struct page *, bool);
-void page_remove_rmap(struct page *, bool);
-
+		unsigned long address, bool compound);
+void page_add_file_rmap(struct page *, struct vm_area_struct *,
+		bool compound);
+void page_remove_rmap(struct page *, struct vm_area_struct *,
+		bool compound);
 void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
-			    unsigned long);
+		unsigned long address);
 void hugepage_add_new_anon_rmap(struct page *, struct vm_area_struct *,
-				unsigned long);
+		unsigned long address);
 
 static inline void page_dup_rmap(struct page *page, bool compound)
 {
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 6357c3580d07..eed2f7437d96 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -173,7 +173,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 			return err;
 	}
 
-	/* For try_to_free_swap() and munlock_vma_page() below */
+	/* For try_to_free_swap() below */
 	lock_page(old_page);
 
 	mmu_notifier_invalidate_range_start(&range);
@@ -201,13 +201,10 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 		set_pte_at_notify(mm, addr, pvmw.pte,
 				  mk_pte(new_page, vma->vm_page_prot));
 
-	page_remove_rmap(old_page, false);
+	page_remove_rmap(old_page, vma, false);
 	if (!page_mapped(old_page))
 		try_to_free_swap(old_page);
 	page_vma_mapped_walk_done(&pvmw);
-
-	if ((vma->vm_flags & VM_LOCKED) && !PageCompound(old_page))
-		munlock_vma_page(old_page);
 	put_page(old_page);
 
 	err = 0;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9a34b85ebcf8..d6477f48a27e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1577,7 +1577,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 
 		if (pmd_present(orig_pmd)) {
 			page = pmd_page(orig_pmd);
-			page_remove_rmap(page, true);
+			page_remove_rmap(page, vma, true);
 			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
 			VM_BUG_ON_PAGE(!PageHead(page), page);
 		} else if (thp_migration_supported()) {
@@ -1962,7 +1962,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 				set_page_dirty(page);
 			if (!PageReferenced(page) && pmd_young(old_pmd))
 				SetPageReferenced(page);
-			page_remove_rmap(page, true);
+			page_remove_rmap(page, vma, true);
 			put_page(page);
 		}
 		add_mm_counter(mm, mm_counter_file(page), -HPAGE_PMD_NR);
@@ -2096,6 +2096,9 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			}
 		}
 		unlock_page_memcg(page);
+
+		/* Above is effectively page_remove_rmap(page, vma, true) */
+		munlock_vma_page(page, vma, true);
 	}
 
 	smp_wmb(); /* make pte visible before pmd */
@@ -2103,7 +2106,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 
 	if (freeze) {
 		for (i = 0; i < HPAGE_PMD_NR; i++) {
-			page_remove_rmap(page + i, false);
+			page_remove_rmap(page + i, vma, false);
 			put_page(page + i);
 		}
 	}
@@ -2163,8 +2166,6 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 				do_unlock_page = true;
 			}
 		}
-		if (PageMlocked(page))
-			clear_page_mlock(page);
 	} else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd)))
 		goto out;
 	__split_huge_pmd_locked(vma, pmd, range.start, freeze);
@@ -3138,7 +3139,7 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 	if (pmd_soft_dirty(pmdval))
 		pmdswp = pmd_swp_mksoft_dirty(pmdswp);
 	set_pmd_at(mm, address, pvmw->pmd, pmdswp);
-	page_remove_rmap(page, true);
+	page_remove_rmap(page, vma, true);
 	put_page(page);
 }
 
@@ -3168,10 +3169,8 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 	if (PageAnon(new))
 		page_add_anon_rmap(new, vma, mmun_start, true);
 	else
-		page_add_file_rmap(new, true);
+		page_add_file_rmap(new, vma, true);
 	set_pmd_at(mm, mmun_start, pvmw->pmd, pmde);
-	if ((vma->vm_flags & VM_LOCKED) && !PageDoubleMap(new))
-		mlock_vma_page(new);
 	update_mmu_cache_pmd(vma, address, pvmw->pmd);
 }
 #endif
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 61895cc01d09..43fb3155298e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5014,7 +5014,7 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 			set_page_dirty(page);
 
 		hugetlb_count_sub(pages_per_huge_page(h), mm);
-		page_remove_rmap(page, true);
+		page_remove_rmap(page, vma, true);
 
 		spin_unlock(ptl);
 		tlb_remove_page_size(tlb, page, huge_page_size(h));
@@ -5259,7 +5259,7 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* Break COW */
 		huge_ptep_clear_flush(vma, haddr, ptep);
 		mmu_notifier_invalidate_range(mm, range.start, range.end);
-		page_remove_rmap(old_page, true);
+		page_remove_rmap(old_page, vma, true);
 		hugepage_add_new_anon_rmap(new_page, vma, haddr);
 		set_huge_pte_at(mm, haddr, ptep,
 				make_huge_pte(vma, new_page, 1));
diff --git a/mm/internal.h b/mm/internal.h
index f235aa92e564..3d7dfc8bc471 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -395,12 +395,35 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma,
 				   bool write, int *locked);
 extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
 			      unsigned long len);
-
 /*
- * must be called with vma's mmap_lock held for read or write, and page locked.
+ * mlock_vma_page() and munlock_vma_page():
+ * should be called with vma's mmap_lock held for read or write,
+ * under page table lock for the pte/pmd being added or removed.
+ *
+ * mlock is usually called at the end of page_add_*_rmap(),
+ * munlock at the end of page_remove_rmap(); but new anon
+ * pages are managed in lru_cache_add_inactive_or_unevictable().
+ *
+ * @compound is used to include pmd mappings of THPs, but filter out
+ * pte mappings of THPs, which cannot be consistently counted: a pte
+ * mapping of the THP head cannot be distinguished by the page alone.
  */
-extern void mlock_vma_page(struct page *page);
-extern void munlock_vma_page(struct page *page);
+void mlock_page(struct page *page);
+static inline void mlock_vma_page(struct page *page,
+			struct vm_area_struct *vma, bool compound)
+{
+	if (unlikely(vma->vm_flags & VM_LOCKED) &&
+	    (compound || !PageTransCompound(page)))
+		mlock_page(page);
+}
+void munlock_page(struct page *page);
+static inline void munlock_vma_page(struct page *page,
+			struct vm_area_struct *vma, bool compound)
+{
+	if (unlikely(vma->vm_flags & VM_LOCKED) &&
+	    (compound || !PageTransCompound(page)))
+		munlock_page(page);
+}
 
 /*
  * Clear the page's PageMlocked().  This can be useful in a situation where
@@ -487,7 +510,10 @@ static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf,
 #else /* !CONFIG_MMU */
 static inline void unmap_mapping_folio(struct folio *folio) { }
 static inline void clear_page_mlock(struct page *page) { }
-static inline void mlock_vma_page(struct page *page) { }
+static inline void mlock_vma_page(struct page *page,
+			struct vm_area_struct *vma, bool compound) { }
+static inline void munlock_vma_page(struct page *page,
+			struct vm_area_struct *vma, bool compound) { }
 static inline void vunmap_range_noflush(unsigned long start, unsigned long end)
 {
 }
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 35f14d0a00a6..d5e387c58bde 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -773,7 +773,7 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
 			 */
 			spin_lock(ptl);
 			ptep_clear(vma->vm_mm, address, _pte);
-			page_remove_rmap(src_page, false);
+			page_remove_rmap(src_page, vma, false);
 			spin_unlock(ptl);
 			free_page_and_swap_cache(src_page);
 		}
@@ -1497,7 +1497,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
 		if (pte_none(*pte))
 			continue;
 		page = vm_normal_page(vma, addr, *pte);
-		page_remove_rmap(page, false);
+		page_remove_rmap(page, vma, false);
 	}
 
 	pte_unmap_unlock(start_pte, ptl);
diff --git a/mm/ksm.c b/mm/ksm.c
index c20bd4d9a0d9..c5a4403b5dc9 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1177,7 +1177,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	ptep_clear_flush(vma, addr, ptep);
 	set_pte_at_notify(mm, addr, ptep, newpte);
 
-	page_remove_rmap(page, false);
+	page_remove_rmap(page, vma, false);
 	if (!page_mapped(page))
 		try_to_free_swap(page);
 	put_page(page);
@@ -1252,16 +1252,6 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
 			err = replace_page(vma, page, kpage, orig_pte);
 	}
 
-	if ((vma->vm_flags & VM_LOCKED) && kpage && !err) {
-		munlock_vma_page(page);
-		if (!PageMlocked(kpage)) {
-			unlock_page(page);
-			lock_page(kpage);
-			mlock_vma_page(kpage);
-			page = kpage;		/* for final unlock */
-		}
-	}
-
 out_unlock:
 	unlock_page(page);
 out:
diff --git a/mm/memory.c b/mm/memory.c
index c125c4969913..53bd9e5f2e33 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -735,9 +735,6 @@ static void restore_exclusive_pte(struct vm_area_struct *vma,
 
 	set_pte_at(vma->vm_mm, address, ptep, pte);
 
-	if (vma->vm_flags & VM_LOCKED)
-		mlock_vma_page(page);
-
 	/*
 	 * No need to invalidate - it was non-present before. However
 	 * secondary CPUs may have mappings that need invalidating.
@@ -1377,7 +1374,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 					mark_page_accessed(page);
 			}
 			rss[mm_counter(page)]--;
-			page_remove_rmap(page, false);
+			page_remove_rmap(page, vma, false);
 			if (unlikely(page_mapcount(page) < 0))
 				print_bad_pte(vma, addr, ptent, page);
 			if (unlikely(__tlb_remove_page(tlb, page))) {
@@ -1397,10 +1394,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				continue;
 			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
 			rss[mm_counter(page)]--;
-
 			if (is_device_private_entry(entry))
-				page_remove_rmap(page, false);
-
+				page_remove_rmap(page, vma, false);
 			put_page(page);
 			continue;
 		}
@@ -1753,16 +1748,16 @@ static int validate_page_before_insert(struct page *page)
 	return 0;
 }
 
-static int insert_page_into_pte_locked(struct mm_struct *mm, pte_t *pte,
+static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte,
 			unsigned long addr, struct page *page, pgprot_t prot)
 {
 	if (!pte_none(*pte))
 		return -EBUSY;
 	/* Ok, finally just insert the thing.. */
 	get_page(page);
-	inc_mm_counter_fast(mm, mm_counter_file(page));
-	page_add_file_rmap(page, false);
-	set_pte_at(mm, addr, pte, mk_pte(page, prot));
+	inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
+	page_add_file_rmap(page, vma, false);
+	set_pte_at(vma->vm_mm, addr, pte, mk_pte(page, prot));
 	return 0;
 }
 
@@ -1776,7 +1771,6 @@ static int insert_page_into_pte_locked(struct mm_struct *mm, pte_t *pte,
 static int insert_page(struct vm_area_struct *vma, unsigned long addr,
 			struct page *page, pgprot_t prot)
 {
-	struct mm_struct *mm = vma->vm_mm;
 	int retval;
 	pte_t *pte;
 	spinlock_t *ptl;
@@ -1785,17 +1779,17 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr,
 	if (retval)
 		goto out;
 	retval = -ENOMEM;
-	pte = get_locked_pte(mm, addr, &ptl);
+	pte = get_locked_pte(vma->vm_mm, addr, &ptl);
 	if (!pte)
 		goto out;
-	retval = insert_page_into_pte_locked(mm, pte, addr, page, prot);
+	retval = insert_page_into_pte_locked(vma, pte, addr, page, prot);
 	pte_unmap_unlock(pte, ptl);
 out:
 	return retval;
 }
 
 #ifdef pte_index
-static int insert_page_in_batch_locked(struct mm_struct *mm, pte_t *pte,
+static int insert_page_in_batch_locked(struct vm_area_struct *vma, pte_t *pte,
 			unsigned long addr, struct page *page, pgprot_t prot)
 {
 	int err;
@@ -1805,7 +1799,7 @@ static int insert_page_in_batch_locked(struct mm_struct *mm, pte_t *pte,
 	err = validate_page_before_insert(page);
 	if (err)
 		return err;
-	return insert_page_into_pte_locked(mm, pte, addr, page, prot);
+	return insert_page_into_pte_locked(vma, pte, addr, page, prot);
 }
 
 /* insert_pages() amortizes the cost of spinlock operations
@@ -1842,7 +1836,7 @@ static int insert_pages(struct vm_area_struct *vma, unsigned long addr,
 
 		start_pte = pte_offset_map_lock(mm, pmd, addr, &pte_lock);
 		for (pte = start_pte; pte_idx < batch_size; ++pte, ++pte_idx) {
-			int err = insert_page_in_batch_locked(mm, pte,
+			int err = insert_page_in_batch_locked(vma, pte,
 				addr, pages[curr_page_idx], prot);
 			if (unlikely(err)) {
 				pte_unmap_unlock(start_pte, pte_lock);
@@ -3098,7 +3092,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 			 * mapcount is visible. So transitively, TLBs to
 			 * old page will be flushed before it can be reused.
 			 */
-			page_remove_rmap(old_page, false);
+			page_remove_rmap(old_page, vma, false);
 		}
 
 		/* Free the old page.. */
@@ -3118,16 +3112,6 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 	 */
 	mmu_notifier_invalidate_range_only_end(&range);
 	if (old_page) {
-		/*
-		 * Don't let another task, with possibly unlocked vma,
-		 * keep the mlocked page.
-		 */
-		if (page_copied && (vma->vm_flags & VM_LOCKED)) {
-			lock_page(old_page);	/* LRU manipulation */
-			if (PageMlocked(old_page))
-				munlock_vma_page(old_page);
-			unlock_page(old_page);
-		}
 		if (page_copied)
 			free_swap_cache(old_page);
 		put_page(old_page);
@@ -3947,7 +3931,8 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 
 	add_mm_counter(vma->vm_mm, mm_counter_file(page), HPAGE_PMD_NR);
-	page_add_file_rmap(page, true);
+	page_add_file_rmap(page, vma, true);
+
 	/*
 	 * deposit and withdraw with pmd lock held
 	 */
@@ -3996,7 +3981,7 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
 		lru_cache_add_inactive_or_unevictable(page, vma);
 	} else {
 		inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
-		page_add_file_rmap(page, false);
+		page_add_file_rmap(page, vma, false);
 	}
 	set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
 }
diff --git a/mm/migrate.c b/mm/migrate.c
index c7da064b4781..7c4223ce2500 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -248,14 +248,9 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 			if (PageAnon(new))
 				page_add_anon_rmap(new, vma, pvmw.address, false);
 			else
-				page_add_file_rmap(new, false);
+				page_add_file_rmap(new, vma, false);
 			set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
 		}
-		if (vma->vm_flags & VM_LOCKED && !PageTransCompound(new))
-			mlock_vma_page(new);
-
-		if (PageTransHuge(page) && PageMlocked(page))
-			clear_page_mlock(page);
 
 		/* No need to invalidate - it was non-present before */
 		update_mmu_cache(vma, pvmw.address, pvmw.pte);
@@ -2331,7 +2326,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			 * drop page refcount. Page won't be freed, as we took
 			 * a reference just above.
 			 */
-			page_remove_rmap(page, false);
+			page_remove_rmap(page, vma, false);
 			put_page(page);
 
 			if (pte_present(pte))
diff --git a/mm/mlock.c b/mm/mlock.c
index 5d7ced8303be..92f28258b4ae 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -78,17 +78,13 @@ void clear_page_mlock(struct page *page)
 	}
 }
 
-/*
- * Mark page as mlocked if not already.
- * If page on LRU, isolate and putback to move to unevictable list.
+/**
+ * mlock_page - mlock a page
+ * @page: page to be mlocked, either a normal page or a THP head.
  */
-void mlock_vma_page(struct page *page)
+void mlock_page(struct page *page)
 {
-	/* Serialize with page migration */
-	BUG_ON(!PageLocked(page));
-
 	VM_BUG_ON_PAGE(PageTail(page), page);
-	VM_BUG_ON_PAGE(PageCompound(page) && PageDoubleMap(page), page);
 
 	if (!TestSetPageMlocked(page)) {
 		int nr_pages = thp_nr_pages(page);
@@ -101,14 +97,11 @@ void mlock_vma_page(struct page *page)
 }
 
 /**
- * munlock_vma_page - munlock a vma page
- * @page: page to be unlocked, either a normal page or THP page head
+ * munlock_page - munlock a page
+ * @page: page to be munlocked, either a normal page or a THP head.
  */
-void munlock_vma_page(struct page *page)
+void munlock_page(struct page *page)
 {
-	/* Serialize with page migration */
-	BUG_ON(!PageLocked(page));
-
 	VM_BUG_ON_PAGE(PageTail(page), page);
 
 	if (TestClearPageMlocked(page)) {
diff --git a/mm/rmap.c b/mm/rmap.c
index 7ce7f1946cff..6cc8bf129f18 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1181,17 +1181,17 @@ void do_page_add_anon_rmap(struct page *page,
 		__mod_lruvec_page_state(page, NR_ANON_MAPPED, nr);
 	}
 
-	if (unlikely(PageKsm(page))) {
+	if (unlikely(PageKsm(page)))
 		unlock_page_memcg(page);
-		return;
-	}
 
 	/* address might be in next vma when migration races vma_adjust */
-	if (first)
+	else if (first)
 		__page_set_anon_rmap(page, vma, address,
 				flags & RMAP_EXCLUSIVE);
 	else
 		__page_check_anon_rmap(page, vma, address);
+
+	mlock_vma_page(page, vma, compound);
 }
 
 /**
@@ -1232,12 +1232,14 @@ void page_add_new_anon_rmap(struct page *page,
 
 /**
  * page_add_file_rmap - add pte mapping to a file page
- * @page: the page to add the mapping to
- * @compound: charge the page as compound or small page
+ * @page:	the page to add the mapping to
+ * @vma:	the vm area in which the mapping is added
+ * @compound:	charge the page as compound or small page
  *
  * The caller needs to hold the pte lock.
  */
-void page_add_file_rmap(struct page *page, bool compound)
+void page_add_file_rmap(struct page *page,
+	struct vm_area_struct *vma, bool compound)
 {
 	int i, nr = 1;
 
@@ -1260,13 +1262,8 @@ void page_add_file_rmap(struct page *page, bool compound)
 						nr_pages);
 	} else {
 		if (PageTransCompound(page) && page_mapping(page)) {
-			struct page *head = compound_head(page);
-
 			VM_WARN_ON_ONCE(!PageLocked(page));
-
-			SetPageDoubleMap(head);
-			if (PageMlocked(page))
-				clear_page_mlock(head);
+			SetPageDoubleMap(compound_head(page));
 		}
 		if (!atomic_inc_and_test(&page->_mapcount))
 			goto out;
@@ -1274,6 +1271,8 @@ void page_add_file_rmap(struct page *page, bool compound)
 	__mod_lruvec_page_state(page, NR_FILE_MAPPED, nr);
 out:
 	unlock_page_memcg(page);
+
+	mlock_vma_page(page, vma, compound);
 }
 
 static void page_remove_file_rmap(struct page *page, bool compound)
@@ -1368,11 +1367,13 @@ static void page_remove_anon_compound_rmap(struct page *page)
 /**
  * page_remove_rmap - take down pte mapping from a page
  * @page:	page to remove mapping from
+ * @vma:	the vm area from which the mapping is removed
  * @compound:	uncharge the page as compound or small page
  *
  * The caller needs to hold the pte lock.
  */
-void page_remove_rmap(struct page *page, bool compound)
+void page_remove_rmap(struct page *page,
+	struct vm_area_struct *vma, bool compound)
 {
 	lock_page_memcg(page);
 
@@ -1414,6 +1415,8 @@ void page_remove_rmap(struct page *page, bool compound)
 	 */
 out:
 	unlock_page_memcg(page);
+
+	munlock_vma_page(page, vma, compound);
 }
 
 /*
@@ -1469,28 +1472,21 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	mmu_notifier_invalidate_range_start(&range);
 
 	while (page_vma_mapped_walk(&pvmw)) {
+		/* Unexpected PMD-mapped THP? */
+		VM_BUG_ON_PAGE(!pvmw.pte, page);
+
 		/*
-		 * If the page is mlock()d, we cannot swap it out.
+		 * If the page is in an mlock()d vma, we must not swap it out.
 		 */
 		if (!(flags & TTU_IGNORE_MLOCK) &&
 		    (vma->vm_flags & VM_LOCKED)) {
-			/*
-			 * PTE-mapped THP are never marked as mlocked: so do
-			 * not set it on a DoubleMap THP, nor on an Anon THP
-			 * (which may still be PTE-mapped after DoubleMap was
-			 * cleared).  But stop unmapping even in those cases.
-			 */
-			if (!PageTransCompound(page) || (PageHead(page) &&
-			     !PageDoubleMap(page) && !PageAnon(page)))
-				mlock_vma_page(page);
+			/* Restore the mlock which got missed */
+			mlock_vma_page(page, vma, false);
 			page_vma_mapped_walk_done(&pvmw);
 			ret = false;
 			break;
 		}
 
-		/* Unexpected PMD-mapped THP? */
-		VM_BUG_ON_PAGE(!pvmw.pte, page);
-
 		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
 		address = pvmw.address;
 
@@ -1668,7 +1664,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		 *
 		 * See Documentation/vm/mmu_notifier.rst
 		 */
-		page_remove_rmap(subpage, PageHuge(page));
+		page_remove_rmap(subpage, vma, PageHuge(page));
 		put_page(page);
 	}
 
@@ -1942,7 +1938,7 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,
 		 *
 		 * See Documentation/vm/mmu_notifier.rst
 		 */
-		page_remove_rmap(subpage, PageHuge(page));
+		page_remove_rmap(subpage, vma, PageHuge(page));
 		put_page(page);
 	}
 
@@ -2078,7 +2074,7 @@ static bool page_make_device_exclusive_one(struct page *page,
 		 * There is a reference on the page for the swap entry which has
 		 * been removed, so shouldn't take another.
 		 */
-		page_remove_rmap(subpage, false);
+		page_remove_rmap(subpage, vma, false);
 	}
 
 	mmu_notifier_invalidate_range_end(&range);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 0780c2a57ff1..15d3e97a6e04 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -95,10 +95,15 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 	if (!pte_none(*dst_pte))
 		goto out_unlock;
 
-	if (page_in_cache)
-		page_add_file_rmap(page, false);
-	else
+	if (page_in_cache) {
+		/* Usually, cache pages are already added to LRU */
+		if (newly_allocated)
+			lru_cache_add(page);
+		page_add_file_rmap(page, dst_vma, false);
+	} else {
 		page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
+		lru_cache_add_inactive_or_unevictable(page, dst_vma);
+	}
 
 	/*
 	 * Must happen after rmap, as mm_counter() checks mapping (via
@@ -106,9 +111,6 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 	 */
 	inc_mm_counter(dst_mm, mm_counter(page));
 
-	if (newly_allocated)
-		lru_cache_add_inactive_or_unevictable(page, dst_vma);
-
 	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
 
 	/* No need to invalidate - it was non-present before */

From patchwork Tue Feb 15 02:28:05 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hugh Dickins <hughd@google.com>
X-Patchwork-Id: 12746443
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 187A5C433EF
	for <linux-mm@archiver.kernel.org>; Tue, 15 Feb 2022 02:28:11 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 860DB6B0078; Mon, 14 Feb 2022 21:28:10 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 7E8F06B007B; Mon, 14 Feb 2022 21:28:10 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 663496B007D; Mon, 14 Feb 2022 21:28:10 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0134.hostedemail.com
 [216.40.44.134])
	by kanga.kvack.org (Postfix) with ESMTP id 539AA6B0078
	for <linux-mm@kvack.org>; Mon, 14 Feb 2022 21:28:10 -0500 (EST)
Received: from smtpin23.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id 0CE62180ACF6C
	for <linux-mm@kvack.org>; Tue, 15 Feb 2022 02:28:10 +0000 (UTC)
X-FDA: 79143429540.23.41FF647
Received: from mail-qv1-f54.google.com (mail-qv1-f54.google.com
 [209.85.219.54])
	by imf19.hostedemail.com (Postfix) with ESMTP id 9BD021A0004
	for <linux-mm@kvack.org>; Tue, 15 Feb 2022 02:28:09 +0000 (UTC)
Received: by mail-qv1-f54.google.com with SMTP id c14so16419461qvl.12
        for <linux-mm@kvack.org>; Mon, 14 Feb 2022 18:28:09 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:from:to:cc:subject:in-reply-to:message-id:references
         :mime-version;
        bh=WrirEeDxktGoFoKuWopzdOwWJ3BwcY0ugVm1nQfgOwo=;
        b=Z1l+Sif++U+qod3cX1xih6XIWKDMOr7jF2YtjAmActgSqsfZ3qw5UCyIsJW4Z9W4oD
         0YJ0AGOmJVQbtriCIcVPOb1k5ri1y+mY44CRhFA3XPIZODSUhlPOa4/BLszFJ3I8jCCg
         EvrFjA+/oTU5vP8c0lrqk/1usT4My2LzlNCzY8QxP7B5aQe7FtFgJBKR3k2BUSkNlGuA
         9ARTC7WLTN1ruwxCe52wL3Ryu0hOGs2vK7Nn/3xT7xbMk9zBjfl+muQ6pox2EWVJwL6z
         i43JwUVzleMcaubKf0DkGT/RExiycL+gOQtsGxzXspA4dkJNdCI34GwGQmAPUq/oKj+W
         ZTug==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id
         :references:mime-version;
        bh=WrirEeDxktGoFoKuWopzdOwWJ3BwcY0ugVm1nQfgOwo=;
        b=psp/UrmTo8vf8rErnYZK1x1/rqc2LD3qVDFTVneM61OqMD8lQ+iC4qo4FtQKaCAlFw
         oUe8hWoMCmzibeLiC2E1OjIIiIPC3IVRwPgbVxQwjWLR/opSREgXVyMsWTD+0lDwoYhM
         vkoLIIAJYIozhuuIXBqltDNq3X9gJOcakyvIv+F9nyf6ilDqLENEmyHo9KPtoZBbzNhA
         rqbxVvR96pJKM+n/Km/+Cylq6NLfOCQ1g78LHIZOJprW7PdG+8ocNSxEIWp9Dq1Sc+tp
         yil84PAgWqULYNgUPq8SSDtOEhCO4FoQBGj3qpvwsO/KwjOI3ouskQzB7SgyygFn3k5j
         7dGA==
X-Gm-Message-State: AOAM532pBlPa7zYq5EY1drGHdSMGcMpI0FU2j1foDBFSTfrVKpZpVlIo
	ze8DZroySoXVwIr/zYiz5c3NQg==
X-Google-Smtp-Source: 
 ABdhPJyUzoK9gED/txNJoOCGRekNw9dzEp4dB3JogIqFN0ZOacHeCuqePDmPX+2O3Rm1Jqc+X9CUyQ==
X-Received: by 2002:a0c:aa10:: with SMTP id d16mr1081983qvb.71.1644892088670;
        Mon, 14 Feb 2022 18:28:08 -0800 (PST)
Received: from ripple.attlocal.net
 (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147])
        by smtp.gmail.com with ESMTPSA id
 v19sm17163332qkp.131.2022.02.14.18.28.06
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 14 Feb 2022 18:28:07 -0800 (PST)
Date: Mon, 14 Feb 2022 18:28:05 -0800 (PST)
From: Hugh Dickins <hughd@google.com>
X-X-Sender: hugh@ripple.anvils
To: Andrew Morton <akpm@linux-foundation.org>
cc: Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>,
    "Kirill A. Shutemov" <kirill@shutemov.name>,
    Matthew Wilcox <willy@infradead.org>,
 David Hildenbrand <david@redhat.com>,
    Alistair Popple <apopple@nvidia.com>,
 Johannes Weiner <hannes@cmpxchg.org>,
    Rik van Riel <riel@surriel.com>, Suren Baghdasaryan <surenb@google.com>,
    Yu Zhao <yuzhao@google.com>, Greg Thelen <gthelen@google.com>,
    Shakeel Butt <shakeelb@google.com>, linux-kernel@vger.kernel.org,
    linux-mm@kvack.org
Subject: [PATCH v2 05/13] mm/munlock: replace clear_page_mlock() by final
 clearance
In-Reply-To: <55a49083-37f9-3766-1de9-9feea7428ac@google.com>
Message-ID: <ba15e6e-bdd5-7712-76b9-6278209e827a@google.com>
References: <55a49083-37f9-3766-1de9-9feea7428ac@google.com>
MIME-Version: 1.0
X-Rspamd-Server: rspam12
X-Rspamd-Queue-Id: 9BD021A0004
X-Stat-Signature: 47ohskptzkuu6z3fi3mbfwdixt5hbdmm
X-Rspam-User: 
Authentication-Results: imf19.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=Z1l+Sif+;
	spf=pass (imf19.hostedemail.com: domain of hughd@google.com designates
 209.85.219.54 as permitted sender) smtp.mailfrom=hughd@google.com;
	dmarc=pass (policy=reject) header.from=google.com
X-HE-Tag: 1644892089-512801
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Placing munlock_vma_page() at the end of page_remove_rmap() shifts most
of the munlocking to clear_page_mlock(), since PageMlocked is typically
still set when mapcount has fallen to 0.  That is not what we want: we
want /proc/vmstat's unevictable_pgs_cleared to remain as a useful check
on the integrity of of the mlock/munlock protocol - small numbers are
not surprising, but big numbers mean the protocol is not working.

That could be easily fixed by placing munlock_vma_page() at the start of
page_remove_rmap(); but later in the series we shall want to batch the
munlocking, and that too would tend to leave PageMlocked still set at
the point when it is checked.

So delete clear_page_mlock() now: leave it instead to release_pages()
(and __page_cache_release()) to do this backstop clearing of Mlocked,
when page refcount has fallen to 0.  If a pinned page occasionally gets
counted as Mlocked and Unevictable until it is unpinned, that's okay.

A slightly regrettable side-effect of this change is that, since
release_pages() and __page_cache_release() may be called at interrupt
time, those places which update NR_MLOCK with interrupts enabled
had better use mod_zone_page_state() than __mod_zone_page_state()
(but holding the lruvec lock always has interrupts disabled).

This change, forcing Mlocked off when refcount 0 instead of earlier
when mapcount 0, is not fundamental: it can be reversed if performance
or something else is found to suffer; but this is the easiest way to
separate the stats - let's not complicate that without good reason.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
v2: same as v1.

 mm/internal.h | 12 ------------
 mm/mlock.c    | 30 ------------------------------
 mm/rmap.c     |  9 ---------
 mm/swap.c     | 32 ++++++++++++++++++++++++--------
 4 files changed, 24 insertions(+), 59 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 3d7dfc8bc471..a43d79335c16 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -425,17 +425,6 @@ static inline void munlock_vma_page(struct page *page,
 		munlock_page(page);
 }
 
-/*
- * Clear the page's PageMlocked().  This can be useful in a situation where
- * we want to unconditionally remove a page from the pagecache -- e.g.,
- * on truncation or freeing.
- *
- * It is legal to call this function for any page, mlocked or not.
- * If called for a page that is still mapped by mlocked vmas, all we do
- * is revert to lazy LRU behaviour -- semantics are not broken.
- */
-extern void clear_page_mlock(struct page *page);
-
 extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
 
 /*
@@ -509,7 +498,6 @@ static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf,
 }
 #else /* !CONFIG_MMU */
 static inline void unmap_mapping_folio(struct folio *folio) { }
-static inline void clear_page_mlock(struct page *page) { }
 static inline void mlock_vma_page(struct page *page,
 			struct vm_area_struct *vma, bool compound) { }
 static inline void munlock_vma_page(struct page *page,
diff --git a/mm/mlock.c b/mm/mlock.c
index 92f28258b4ae..3c26473050a3 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -48,36 +48,6 @@ EXPORT_SYMBOL(can_do_mlock);
  * PageUnevictable is set to indicate the unevictable state.
  */
 
-/*
- *  LRU accounting for clear_page_mlock()
- */
-void clear_page_mlock(struct page *page)
-{
-	int nr_pages;
-
-	if (!TestClearPageMlocked(page))
-		return;
-
-	nr_pages = thp_nr_pages(page);
-	mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
-	count_vm_events(UNEVICTABLE_PGCLEARED, nr_pages);
-	/*
-	 * The previous TestClearPageMlocked() corresponds to the smp_mb()
-	 * in __pagevec_lru_add_fn().
-	 *
-	 * See __pagevec_lru_add_fn for more explanation.
-	 */
-	if (!isolate_lru_page(page)) {
-		putback_lru_page(page);
-	} else {
-		/*
-		 * We lost the race. the page already moved to evictable list.
-		 */
-		if (PageUnevictable(page))
-			count_vm_events(UNEVICTABLE_PGSTRANDED, nr_pages);
-	}
-}
-
 /**
  * mlock_page - mlock a page
  * @page: page to be mlocked, either a normal page or a THP head.
diff --git a/mm/rmap.c b/mm/rmap.c
index 6cc8bf129f18..5442a5c97a85 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1315,9 +1315,6 @@ static void page_remove_file_rmap(struct page *page, bool compound)
 	 * pte lock(a spinlock) is held, which implies preemption disabled.
 	 */
 	__mod_lruvec_page_state(page, NR_FILE_MAPPED, -nr);
-
-	if (unlikely(PageMlocked(page)))
-		clear_page_mlock(page);
 }
 
 static void page_remove_anon_compound_rmap(struct page *page)
@@ -1357,9 +1354,6 @@ static void page_remove_anon_compound_rmap(struct page *page)
 		nr = thp_nr_pages(page);
 	}
 
-	if (unlikely(PageMlocked(page)))
-		clear_page_mlock(page);
-
 	if (nr)
 		__mod_lruvec_page_state(page, NR_ANON_MAPPED, -nr);
 }
@@ -1398,9 +1392,6 @@ void page_remove_rmap(struct page *page,
 	 */
 	__dec_lruvec_page_state(page, NR_ANON_MAPPED);
 
-	if (unlikely(PageMlocked(page)))
-		clear_page_mlock(page);
-
 	if (PageTransCompound(page))
 		deferred_split_huge_page(compound_head(page));
 
diff --git a/mm/swap.c b/mm/swap.c
index bcf3ac288b56..ff4810e4a4bc 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -74,8 +74,8 @@ static DEFINE_PER_CPU(struct lru_pvecs, lru_pvecs) = {
 };
 
 /*
- * This path almost never happens for VM activity - pages are normally
- * freed via pagevecs.  But it gets used by networking.
+ * This path almost never happens for VM activity - pages are normally freed
+ * via pagevecs.  But it gets used by networking - and for compound pages.
  */
 static void __page_cache_release(struct page *page)
 {
@@ -89,6 +89,14 @@ static void __page_cache_release(struct page *page)
 		__clear_page_lru_flags(page);
 		unlock_page_lruvec_irqrestore(lruvec, flags);
 	}
+	/* See comment on PageMlocked in release_pages() */
+	if (unlikely(PageMlocked(page))) {
+		int nr_pages = thp_nr_pages(page);
+
+		__ClearPageMlocked(page);
+		mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
+		count_vm_events(UNEVICTABLE_PGCLEARED, nr_pages);
+	}
 	__ClearPageWaiters(page);
 }
 
@@ -489,12 +497,8 @@ void lru_cache_add_inactive_or_unevictable(struct page *page,
 	unevictable = (vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) == VM_LOCKED;
 	if (unlikely(unevictable) && !TestSetPageMlocked(page)) {
 		int nr_pages = thp_nr_pages(page);
-		/*
-		 * We use the irq-unsafe __mod_zone_page_state because this
-		 * counter is not modified from interrupt context, and the pte
-		 * lock is held(spinlock), which implies preemption disabled.
-		 */
-		__mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages);
+
+		mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages);
 		count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
 	}
 	lru_cache_add(page);
@@ -969,6 +973,18 @@ void release_pages(struct page **pages, int nr)
 			__clear_page_lru_flags(page);
 		}
 
+		/*
+		 * In rare cases, when truncation or holepunching raced with
+		 * munlock after VM_LOCKED was cleared, Mlocked may still be
+		 * found set here.  This does not indicate a problem, unless
+		 * "unevictable_pgs_cleared" appears worryingly large.
+		 */
+		if (unlikely(PageMlocked(page))) {
+			__ClearPageMlocked(page);
+			dec_zone_page_state(page, NR_MLOCK);
+			count_vm_event(UNEVICTABLE_PGCLEARED);
+		}
+
 		__ClearPageWaiters(page);
 
 		list_add(&page->lru, &pages_to_free);

From patchwork Tue Feb 15 02:29:54 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hugh Dickins <hughd@google.com>
X-Patchwork-Id: 12746445
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 2BB8DC433EF
	for <linux-mm@archiver.kernel.org>; Tue, 15 Feb 2022 02:30:00 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id B1E7B6B0078; Mon, 14 Feb 2022 21:29:59 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id ACE176B007B; Mon, 14 Feb 2022 21:29:59 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 96EC16B007D; Mon, 14 Feb 2022 21:29:59 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0102.hostedemail.com
 [216.40.44.102])
	by kanga.kvack.org (Postfix) with ESMTP id 81E9F6B0078
	for <linux-mm@kvack.org>; Mon, 14 Feb 2022 21:29:59 -0500 (EST)
Received: from smtpin30.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id 3F2D8181AC9C6
	for <linux-mm@kvack.org>; Tue, 15 Feb 2022 02:29:59 +0000 (UTC)
X-FDA: 79143434118.30.E0FEE94
Received: from mail-qv1-f46.google.com (mail-qv1-f46.google.com
 [209.85.219.46])
	by imf30.hostedemail.com (Postfix) with ESMTP id C32CE80006
	for <linux-mm@kvack.org>; Tue, 15 Feb 2022 02:29:58 +0000 (UTC)
Received: by mail-qv1-f46.google.com with SMTP id fh9so16491830qvb.1
        for <linux-mm@kvack.org>; Mon, 14 Feb 2022 18:29:58 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:from:to:cc:subject:in-reply-to:message-id:references
         :mime-version;
        bh=4mSmV42sXHHOhBCPioHFiH7XygaK1ndeYM6k9yLn7zw=;
        b=q33ygrKXdB5bUWyRE3WpNlU/3/9tpGTsIIJWN4RHrmeKZiMWIhQd8GCvt/JxhUxqbN
         YqZlq3EhPsWZKXRTj/8rdiKXZnTdF4AgGp+lUf0VrHKXJ0BuufxEoqcM6z04yw0fI9bx
         vQsbB7hWF2xN07TzuTNhJUpiDSerU1dR9x3xtV43TrYo+/XbkAQltCph7DFI1GAwPeZU
         JIcsxKpffv3vU3S8BFDHhnWCMCdRT4tXnZnB20nTCxs3Jh6+9jnJ0eWNc0j3qAYRHnpq
         5gvZLtUcoXvAYy2uNB3h+CyL95tQnmko9mifwruf7lQRc1Xv7d08MfpS3GloAFKvpavO
         jprg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id
         :references:mime-version;
        bh=4mSmV42sXHHOhBCPioHFiH7XygaK1ndeYM6k9yLn7zw=;
        b=joKCqDAok5sPgKRJw6jTf7qGJ4lg+mqh4XTLqlhuPilbPg+sgsr8QPap6gq0tn3zbn
         J9BgS2DWb3E8q71Xpwla0tvpultkwUH00CHtMAXp/7ba2e9efM6NtTjWl9mx7RKinSqg
         ckThSp7RHsoFECiMmqldGu9UqTVKE0rtapMNisPJbSWY/zsAPzZv+sTk5nFIBbuYiW9e
         yQcN1sICP8UN2XOVwemlwwZAHZ479BFsSwQM6qY0v3rcJzsjG5mLS7Zy5R/+6SvXNbOD
         xl3NoVkJeVjOrMNRdzhjmEoi3XD7Fs9LcyyKVoKZPJZ0j2Z1ylT7rLtLcjuf/CPlwmJu
         rI3A==
X-Gm-Message-State: AOAM530BX+T6Bki+E2w5dfEwgmWkvHUngilkPuWE2pZxZTbe3TWsaq1c
	eKCxaZV/bjI2NdbGAnPdmHVgPg==
X-Google-Smtp-Source: 
 ABdhPJxseS96A1RZAfo1sjpoCYpVexistVUUiHaF6St33J6IhEKCW95CNtgAVLHEjbEmg6orw24fag==
X-Received: by 2002:ad4:5f8a:: with SMTP id jp10mr1288762qvb.23.1644892197880;
        Mon, 14 Feb 2022 18:29:57 -0800 (PST)
Received: from ripple.attlocal.net
 (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147])
        by smtp.gmail.com with ESMTPSA id
 a23sm16077736qko.96.2022.02.14.18.29.55
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 14 Feb 2022 18:29:57 -0800 (PST)
Date: Mon, 14 Feb 2022 18:29:54 -0800 (PST)
From: Hugh Dickins <hughd@google.com>
X-X-Sender: hugh@ripple.anvils
To: Andrew Morton <akpm@linux-foundation.org>
cc: Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>,
    "Kirill A. Shutemov" <kirill@shutemov.name>,
    Matthew Wilcox <willy@infradead.org>,
 David Hildenbrand <david@redhat.com>,
    Alistair Popple <apopple@nvidia.com>,
 Johannes Weiner <hannes@cmpxchg.org>,
    Rik van Riel <riel@surriel.com>, Suren Baghdasaryan <surenb@google.com>,
    Yu Zhao <yuzhao@google.com>, Greg Thelen <gthelen@google.com>,
    Shakeel Butt <shakeelb@google.com>, linux-kernel@vger.kernel.org,
    linux-mm@kvack.org
Subject: [PATCH v2 06/13] mm/munlock: maintain page->mlock_count while
 unevictable
In-Reply-To: <55a49083-37f9-3766-1de9-9feea7428ac@google.com>
Message-ID: <cd14eda-5be0-b8b9-4273-cf28818cfef9@google.com>
References: <55a49083-37f9-3766-1de9-9feea7428ac@google.com>
MIME-Version: 1.0
X-Rspamd-Queue-Id: C32CE80006
X-Rspam-User: 
Authentication-Results: imf30.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=q33ygrKX;
	spf=pass (imf30.hostedemail.com: domain of hughd@google.com designates
 209.85.219.46 as permitted sender) smtp.mailfrom=hughd@google.com;
	dmarc=pass (policy=reject) header.from=google.com
X-Stat-Signature: cydadd664ex8xb8i8k96ycuhioqkjig3
X-Rspamd-Server: rspam11
X-HE-Tag: 1644892198-495388
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Previous patches have been preparatory: now implement page->mlock_count.
The ordering of the "Unevictable LRU" is of no significance, and there is
no point holding unevictable pages on a list: place page->mlock_count to
overlay page->lru.prev (since page->lru.next is overlaid by compound_head,
which needs to be even so as not to satisfy PageTail - though 2 could be
added instead of 1 for each mlock, if that's ever an improvement).

But it's only safe to rely on or modify page->mlock_count while lruvec
lock is held and page is on unevictable "LRU" - we can save lots of edits
by continuing to pretend that there's an imaginary LRU here (there is an
unevictable count which still needs to be maintained, but not a list).

The mlock_count technique suffers from an unreliability much like with
page_mlock(): while someone else has the page off LRU, not much can
be done.  As before, err on the safe side (behave as if mlock_count 0),
and let try_to_unlock_one() move the page to unevictable if reclaim finds
out later on - a few misplaced pages don't matter, what we want to avoid
is imbalancing reclaim by flooding evictable lists with unevictable pages.

I am not a fan of "if (!isolate_lru_page(page)) putback_lru_page(page);":
if we have taken lruvec lock to get the page off its present list, then
we save everyone trouble (and however many extra atomic ops) by putting
it on its destination list immediately.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
v2: same as v1.

 include/linux/mm_inline.h | 11 +++++--
 include/linux/mm_types.h  | 19 +++++++++--
 mm/huge_memory.c          |  5 ++-
 mm/memcontrol.c           |  3 +-
 mm/mlock.c                | 68 +++++++++++++++++++++++++++++++--------
 mm/mmzone.c               |  7 ++++
 mm/swap.c                 |  1 +
 7 files changed, 92 insertions(+), 22 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index b725839dfe71..884d6f6af05b 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -99,7 +99,8 @@ void lruvec_add_folio(struct lruvec *lruvec, struct folio *folio)
 
 	update_lru_size(lruvec, lru, folio_zonenum(folio),
 			folio_nr_pages(folio));
-	list_add(&folio->lru, &lruvec->lists[lru]);
+	if (lru != LRU_UNEVICTABLE)
+		list_add(&folio->lru, &lruvec->lists[lru]);
 }
 
 static __always_inline void add_page_to_lru_list(struct page *page,
@@ -115,6 +116,7 @@ void lruvec_add_folio_tail(struct lruvec *lruvec, struct folio *folio)
 
 	update_lru_size(lruvec, lru, folio_zonenum(folio),
 			folio_nr_pages(folio));
+	/* This is not expected to be used on LRU_UNEVICTABLE */
 	list_add_tail(&folio->lru, &lruvec->lists[lru]);
 }
 
@@ -127,8 +129,11 @@ static __always_inline void add_page_to_lru_list_tail(struct page *page,
 static __always_inline
 void lruvec_del_folio(struct lruvec *lruvec, struct folio *folio)
 {
-	list_del(&folio->lru);
-	update_lru_size(lruvec, folio_lru_list(folio), folio_zonenum(folio),
+	enum lru_list lru = folio_lru_list(folio);
+
+	if (lru != LRU_UNEVICTABLE)
+		list_del(&folio->lru);
+	update_lru_size(lruvec, lru, folio_zonenum(folio),
 			-folio_nr_pages(folio));
 }
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5140e5feb486..475bdb282769 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -85,7 +85,16 @@ struct page {
 			 * lruvec->lru_lock.  Sometimes used as a generic list
 			 * by the page owner.
 			 */
-			struct list_head lru;
+			union {
+				struct list_head lru;
+				/* Or, for the Unevictable "LRU list" slot */
+				struct {
+					/* Always even, to negate PageTail */
+					void *__filler;
+					/* Count page's or folio's mlocks */
+					unsigned int mlock_count;
+				};
+			};
 			/* See page-flags.h for PAGE_MAPPING_FLAGS */
 			struct address_space *mapping;
 			pgoff_t index;		/* Our offset within mapping. */
@@ -241,7 +250,13 @@ struct folio {
 		struct {
 	/* public: */
 			unsigned long flags;
-			struct list_head lru;
+			union {
+				struct list_head lru;
+				struct {
+					void *__filler;
+					unsigned int mlock_count;
+				};
+			};
 			struct address_space *mapping;
 			pgoff_t index;
 			void *private;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d6477f48a27e..9afca0122723 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2300,8 +2300,11 @@ static void lru_add_page_tail(struct page *head, struct page *tail,
 	} else {
 		/* head is still on lru (and we have it frozen) */
 		VM_WARN_ON(!PageLRU(head));
+		if (PageUnevictable(tail))
+			tail->mlock_count = 0;
+		else
+			list_add_tail(&tail->lru, &head->lru);
 		SetPageLRU(tail);
-		list_add_tail(&tail->lru, &head->lru);
 	}
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 09d342c7cbd0..b10590926177 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1257,8 +1257,7 @@ struct lruvec *folio_lruvec_lock_irqsave(struct folio *folio,
  * @nr_pages: positive when adding or negative when removing
  *
  * This function must be called under lru_lock, just before a page is added
- * to or just after a page is removed from an lru list (that ordering being
- * so as to allow it to check that lru_size 0 is consistent with list_empty).
+ * to or just after a page is removed from an lru list.
  */
 void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
 				int zid, int nr_pages)
diff --git a/mm/mlock.c b/mm/mlock.c
index 3c26473050a3..f8a3a54687dd 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -54,16 +54,35 @@ EXPORT_SYMBOL(can_do_mlock);
  */
 void mlock_page(struct page *page)
 {
+	struct lruvec *lruvec;
+	int nr_pages = thp_nr_pages(page);
+
 	VM_BUG_ON_PAGE(PageTail(page), page);
 
 	if (!TestSetPageMlocked(page)) {
-		int nr_pages = thp_nr_pages(page);
-
 		mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages);
-		count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
-		if (!isolate_lru_page(page))
-			putback_lru_page(page);
+		__count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
+	}
+
+	/* There is nothing more we can do while it's off LRU */
+	if (!TestClearPageLRU(page))
+		return;
+
+	lruvec = folio_lruvec_lock_irq(page_folio(page));
+	if (PageUnevictable(page)) {
+		page->mlock_count++;
+		goto out;
 	}
+
+	del_page_from_lru_list(page, lruvec);
+	ClearPageActive(page);
+	SetPageUnevictable(page);
+	page->mlock_count = 1;
+	add_page_to_lru_list(page, lruvec);
+	__count_vm_events(UNEVICTABLE_PGCULLED, nr_pages);
+out:
+	SetPageLRU(page);
+	unlock_page_lruvec_irq(lruvec);
 }
 
 /**
@@ -72,19 +91,40 @@ void mlock_page(struct page *page)
  */
 void munlock_page(struct page *page)
 {
+	struct lruvec *lruvec;
+	int nr_pages = thp_nr_pages(page);
+
 	VM_BUG_ON_PAGE(PageTail(page), page);
 
+	lock_page_memcg(page);
+	lruvec = folio_lruvec_lock_irq(page_folio(page));
+	if (PageLRU(page) && PageUnevictable(page)) {
+		/* Then mlock_count is maintained, but might undercount */
+		if (page->mlock_count)
+			page->mlock_count--;
+		if (page->mlock_count)
+			goto out;
+	}
+	/* else assume that was the last mlock: reclaim will fix it if not */
+
 	if (TestClearPageMlocked(page)) {
-		int nr_pages = thp_nr_pages(page);
-
-		mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
-		if (!isolate_lru_page(page)) {
-			putback_lru_page(page);
-			count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages);
-		} else if (PageUnevictable(page)) {
-			count_vm_events(UNEVICTABLE_PGSTRANDED, nr_pages);
-		}
+		__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
+		if (PageLRU(page) || !PageUnevictable(page))
+			__count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages);
+		else
+			__count_vm_events(UNEVICTABLE_PGSTRANDED, nr_pages);
+	}
+
+	/* page_evictable() has to be checked *after* clearing Mlocked */
+	if (PageLRU(page) && PageUnevictable(page) && page_evictable(page)) {
+		del_page_from_lru_list(page, lruvec);
+		ClearPageUnevictable(page);
+		add_page_to_lru_list(page, lruvec);
+		__count_vm_events(UNEVICTABLE_PGRESCUED, nr_pages);
 	}
+out:
+	unlock_page_lruvec_irq(lruvec);
+	unlock_page_memcg(page);
 }
 
 /*
diff --git a/mm/mmzone.c b/mm/mmzone.c
index eb89d6e018e2..40e1d9428300 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -81,6 +81,13 @@ void lruvec_init(struct lruvec *lruvec)
 
 	for_each_lru(lru)
 		INIT_LIST_HEAD(&lruvec->lists[lru]);
+	/*
+	 * The "Unevictable LRU" is imaginary: though its size is maintained,
+	 * it is never scanned, and unevictable pages are not threaded on it
+	 * (so that their lru fields can be reused to hold mlock_count).
+	 * Poison its list head, so that any operations on it would crash.
+	 */
+	list_del(&lruvec->lists[LRU_UNEVICTABLE]);
 }
 
 #if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS)
diff --git a/mm/swap.c b/mm/swap.c
index ff4810e4a4bc..682a03301a2c 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -1062,6 +1062,7 @@ static void __pagevec_lru_add_fn(struct folio *folio, struct lruvec *lruvec)
 	} else {
 		folio_clear_active(folio);
 		folio_set_unevictable(folio);
+		folio->mlock_count = !!folio_test_mlocked(folio);
 		if (!was_unevictable)
 			__count_vm_events(UNEVICTABLE_PGCULLED, nr_pages);
 	}

From patchwork Tue Feb 15 02:31:48 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hugh Dickins <hughd@google.com>
X-Patchwork-Id: 12746446
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 79E4DC433EF
	for <linux-mm@archiver.kernel.org>; Tue, 15 Feb 2022 02:31:53 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 095BD6B0078; Mon, 14 Feb 2022 21:31:53 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 043C26B007B; Mon, 14 Feb 2022 21:31:52 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id E4DBE6B007D; Mon, 14 Feb 2022 21:31:52 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0226.hostedemail.com
 [216.40.44.226])
	by kanga.kvack.org (Postfix) with ESMTP id D440A6B0078
	for <linux-mm@kvack.org>; Mon, 14 Feb 2022 21:31:52 -0500 (EST)
Received: from smtpin16.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id 8FB7D8249980
	for <linux-mm@kvack.org>; Tue, 15 Feb 2022 02:31:52 +0000 (UTC)
X-FDA: 79143438864.16.BAF0212
Received: from mail-qv1-f41.google.com (mail-qv1-f41.google.com
 [209.85.219.41])
	by imf17.hostedemail.com (Postfix) with ESMTP id 16A6240003
	for <linux-mm@kvack.org>; Tue, 15 Feb 2022 02:31:51 +0000 (UTC)
Received: by mail-qv1-f41.google.com with SMTP id a19so16462565qvm.4
        for <linux-mm@kvack.org>; Mon, 14 Feb 2022 18:31:51 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:from:to:cc:subject:in-reply-to:message-id:references
         :mime-version;
        bh=SEHRUOH5N9XtoyGwgSgWRzInurRzgz1DBRcTDOVUssY=;
        b=Ef93WPz4BNfQgCvNVDzYXhPEoVfw8YCKHvG6ms7UdB7bFwbn+0qNzMPyUWbIg64oq+
         MxE98w/ljM1MoSG1BvggJ6eAg4YLTUqcaz0jdQ1vd3MZhXYmijykAwPeuREWzO5BN4Gr
         GXxgwiuuLFphEWGcMIdagwrnRQCXPDiGXkFZk82Ss2VAHH/zIXJbXuqQjlrui/TdrF0f
         DEY4YioAZllKQPHkZQUSGcf4M8euKcAKwOPWjtibLdKxL+AQmQi+F+kgXZIW/ThkH5Xx
         Kb4sKrxpheoCiJY3zti20+oL6KCsGkFJiPHkSVNUGTrNMpHmDh3HGLwPvNcsVrZPq41P
         bvfg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id
         :references:mime-version;
        bh=SEHRUOH5N9XtoyGwgSgWRzInurRzgz1DBRcTDOVUssY=;
        b=r5tcdNKaLiinWAqgIoV9TDaU8pmyuIKvaU1U5GuT5N7O5FAjfISK3X7eHoT6DC6VE3
         /zxujeKvNu3Yb8Yil2yM0JlvTdnJWabPi64XRzhlnVpyz/WKyxV7Sku6qSdoQ6jcN27w
         TLWK1dpXYXqRI5Ae1Tp9e4Yed8/XP6+t2M/E5FmhUGAZg5SYFXICCz9HjBZwBrDCnygt
         QvUvoHgEc5pECjF+mi1OWMCp+YSHIiAz3NWtgINcfn/STvd0n9RXSdfncnPoINcTTCts
         C+gZoIzWwzYohAl0qZgiV5khjLYIYrHAbW9gpozJ19k0W55hRZaJ//1ZD0njWj5O6hXS
         6UFw==
X-Gm-Message-State: AOAM532hYwheAW1g9+0XFWPsyvDQTbNjuaa5xUlAxNgiSGiTd8ixOoAK
	6ucTyYwY8EJ2liOWAP30xzCNrg==
X-Google-Smtp-Source: 
 ABdhPJzxJaqmReL22Kes+tEvk+cR241J/U9I6IjlYTTv8kuo3FsPJYf1LN3+scnIctYC81CMtO/55w==
X-Received: by 2002:a05:6214:20ab:: with SMTP id
 11mr1304781qvd.98.1644892311112;
        Mon, 14 Feb 2022 18:31:51 -0800 (PST)
Received: from ripple.attlocal.net
 (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147])
        by smtp.gmail.com with ESMTPSA id
 br35sm16757332qkb.118.2022.02.14.18.31.48
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 14 Feb 2022 18:31:50 -0800 (PST)
Date: Mon, 14 Feb 2022 18:31:48 -0800 (PST)
From: Hugh Dickins <hughd@google.com>
X-X-Sender: hugh@ripple.anvils
To: Andrew Morton <akpm@linux-foundation.org>
cc: Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>,
    "Kirill A. Shutemov" <kirill@shutemov.name>,
    Matthew Wilcox <willy@infradead.org>,
 David Hildenbrand <david@redhat.com>,
    Alistair Popple <apopple@nvidia.com>,
 Johannes Weiner <hannes@cmpxchg.org>,
    Rik van Riel <riel@surriel.com>, Suren Baghdasaryan <surenb@google.com>,
    Yu Zhao <yuzhao@google.com>, Greg Thelen <gthelen@google.com>,
    Shakeel Butt <shakeelb@google.com>, Hillf Danton <hdanton@sina.com>,
    linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: [PATCH v2 07/13] mm/munlock: mlock_pte_range() when mlocking or
 munlocking
In-Reply-To: <55a49083-37f9-3766-1de9-9feea7428ac@google.com>
Message-ID: <d39f6e4d-aa4f-731a-68ee-e77cdbf1d7bb@google.com>
References: <55a49083-37f9-3766-1de9-9feea7428ac@google.com>
MIME-Version: 1.0
Authentication-Results: imf17.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=Ef93WPz4;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf17.hostedemail.com: domain of hughd@google.com designates
 209.85.219.41 as permitted sender) smtp.mailfrom=hughd@google.com
X-Rspam-User: 
X-Rspamd-Server: rspam10
X-Rspamd-Queue-Id: 16A6240003
X-Stat-Signature: h65qbhdm1j9zu81auw7ndoz1uk1c3seu
X-HE-Tag: 1644892311-736172
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Fill in missing pieces: reimplementation of munlock_vma_pages_range(),
required to lower the mlock_counts when munlocking without munmapping;
and its complement, implementation of mlock_vma_pages_range(), required
to raise the mlock_counts on pages already there when a range is mlocked.

Combine them into just the one function mlock_vma_pages_range(), using
walk_page_range() to run mlock_pte_range().  This approach fixes the
"Very slow unlockall()" of unpopulated PROT_NONE areas, reported in
https://lore.kernel.org/linux-mm/70885d37-62b7-748b-29df-9e94f3291736@gmail.com/

Munlock clears VM_LOCKED at the start, under exclusive mmap_lock; but if
a racing truncate or holepunch (depending on i_mmap_rwsem) gets to the
pte first, it will not try to munlock the page: leaving release_pages()
to correct it when the last reference to the page is gone - that's okay,
a page is not evictable anyway while it is held by an extra reference.

Mlock sets VM_LOCKED at the start, under exclusive mmap_lock; but if
a racing remove_migration_pte() or try_to_unmap_one() (depending on
i_mmap_rwsem) gets to the pte first, it will try to mlock the page,
then mlock_pte_range() mlock it a second time.  This is harder to
reproduce, but a more serious race because it could leave the page
unevictable indefinitely though the area is munlocked afterwards.
Guard against it by setting the (inappropriate) VM_IO flag,
and modifying mlock_vma_page() to decline such vmas.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: kernel test robot <oliver.sang@intel.com>
---
v2: Delete reintroduced VM_LOCKED_CLEAR_MASKing here, per Vlastimil.
    Use oldflags instead of vma->vm_flags in mlock_fixup(), per Vlastimil.
    Make clearer comment on mmap_lock and WRITE_ONCE, per Hillf.

 mm/internal.h |   3 +-
 mm/mlock.c    | 111 ++++++++++++++++++++++++++++++++++++++++----------
 2 files changed, 91 insertions(+), 23 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index a43d79335c16..b3f0dd3ffba2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -412,7 +412,8 @@ void mlock_page(struct page *page);
 static inline void mlock_vma_page(struct page *page,
 			struct vm_area_struct *vma, bool compound)
 {
-	if (unlikely(vma->vm_flags & VM_LOCKED) &&
+	/* VM_IO check prevents migration from double-counting during mlock */
+	if (unlikely((vma->vm_flags & (VM_LOCKED|VM_IO)) == VM_LOCKED) &&
 	    (compound || !PageTransCompound(page)))
 		mlock_page(page);
 }
diff --git a/mm/mlock.c b/mm/mlock.c
index f8a3a54687dd..581ea8bf1b83 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -14,6 +14,7 @@
 #include <linux/swapops.h>
 #include <linux/pagemap.h>
 #include <linux/pagevec.h>
+#include <linux/pagewalk.h>
 #include <linux/mempolicy.h>
 #include <linux/syscalls.h>
 #include <linux/sched.h>
@@ -127,25 +128,91 @@ void munlock_page(struct page *page)
 	unlock_page_memcg(page);
 }
 
+static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
+			   unsigned long end, struct mm_walk *walk)
+
+{
+	struct vm_area_struct *vma = walk->vma;
+	spinlock_t *ptl;
+	pte_t *start_pte, *pte;
+	struct page *page;
+
+	ptl = pmd_trans_huge_lock(pmd, vma);
+	if (ptl) {
+		if (!pmd_present(*pmd))
+			goto out;
+		if (is_huge_zero_pmd(*pmd))
+			goto out;
+		page = pmd_page(*pmd);
+		if (vma->vm_flags & VM_LOCKED)
+			mlock_page(page);
+		else
+			munlock_page(page);
+		goto out;
+	}
+
+	start_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) {
+		if (!pte_present(*pte))
+			continue;
+		page = vm_normal_page(vma, addr, *pte);
+		if (!page)
+			continue;
+		if (PageTransCompound(page))
+			continue;
+		if (vma->vm_flags & VM_LOCKED)
+			mlock_page(page);
+		else
+			munlock_page(page);
+	}
+	pte_unmap(start_pte);
+out:
+	spin_unlock(ptl);
+	cond_resched();
+	return 0;
+}
+
 /*
- * munlock_vma_pages_range() - munlock all pages in the vma range.'
- * @vma - vma containing range to be munlock()ed.
+ * mlock_vma_pages_range() - mlock any pages already in the range,
+ *                           or munlock all pages in the range.
+ * @vma - vma containing range to be mlock()ed or munlock()ed
  * @start - start address in @vma of the range
- * @end - end of range in @vma.
- *
- *  For mremap(), munmap() and exit().
+ * @end - end of range in @vma
+ * @newflags - the new set of flags for @vma.
  *
- * Called with @vma VM_LOCKED.
- *
- * Returns with VM_LOCKED cleared.  Callers must be prepared to
- * deal with this.
+ * Called for mlock(), mlock2() and mlockall(), to set @vma VM_LOCKED;
+ * called for munlock() and munlockall(), to clear VM_LOCKED from @vma.
  */
-static void munlock_vma_pages_range(struct vm_area_struct *vma,
-				    unsigned long start, unsigned long end)
+static void mlock_vma_pages_range(struct vm_area_struct *vma,
+	unsigned long start, unsigned long end, vm_flags_t newflags)
 {
-	vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
+	static const struct mm_walk_ops mlock_walk_ops = {
+		.pmd_entry = mlock_pte_range,
+	};
 
-	/* Reimplementation to follow in later commit */
+	/*
+	 * There is a slight chance that concurrent page migration,
+	 * or page reclaim finding a page of this now-VM_LOCKED vma,
+	 * will call mlock_vma_page() and raise page's mlock_count:
+	 * double counting, leaving the page unevictable indefinitely.
+	 * Communicate this danger to mlock_vma_page() with VM_IO,
+	 * which is a VM_SPECIAL flag not allowed on VM_LOCKED vmas.
+	 * mmap_lock is held in write mode here, so this weird
+	 * combination should not be visible to other mmap_lock users;
+	 * but WRITE_ONCE so rmap walkers must see VM_IO if VM_LOCKED.
+	 */
+	if (newflags & VM_LOCKED)
+		newflags |= VM_IO;
+	WRITE_ONCE(vma->vm_flags, newflags);
+
+	lru_add_drain();
+	walk_page_range(vma->vm_mm, start, end, &mlock_walk_ops, NULL);
+	lru_add_drain();
+
+	if (newflags & VM_IO) {
+		newflags &= ~VM_IO;
+		WRITE_ONCE(vma->vm_flags, newflags);
+	}
 }
 
 /*
@@ -164,10 +231,9 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	pgoff_t pgoff;
 	int nr_pages;
 	int ret = 0;
-	int lock = !!(newflags & VM_LOCKED);
-	vm_flags_t old_flags = vma->vm_flags;
+	vm_flags_t oldflags = vma->vm_flags;
 
-	if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
+	if (newflags == oldflags || (oldflags & VM_SPECIAL) ||
 	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm) ||
 	    vma_is_dax(vma) || vma_is_secretmem(vma))
 		/* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */
@@ -199,9 +265,9 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	 * Keep track of amount of locked VM.
 	 */
 	nr_pages = (end - start) >> PAGE_SHIFT;
-	if (!lock)
+	if (!(newflags & VM_LOCKED))
 		nr_pages = -nr_pages;
-	else if (old_flags & VM_LOCKED)
+	else if (oldflags & VM_LOCKED)
 		nr_pages = 0;
 	mm->locked_vm += nr_pages;
 
@@ -211,11 +277,12 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	 * set VM_LOCKED, populate_vma_page_range will bring it back.
 	 */
 
-	if (lock)
+	if ((newflags & VM_LOCKED) && (oldflags & VM_LOCKED)) {
+		/* No work to do, and mlocking twice would be wrong */
 		vma->vm_flags = newflags;
-	else
-		munlock_vma_pages_range(vma, start, end);
-
+	} else {
+		mlock_vma_pages_range(vma, start, end, newflags);
+	}
 out:
 	*prev = vma;
 	return ret;

From patchwork Tue Feb 15 02:33:17 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hugh Dickins <hughd@google.com>
X-Patchwork-Id: 12746447
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id B2A9DC433EF
	for <linux-mm@archiver.kernel.org>; Tue, 15 Feb 2022 02:33:22 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 43C5E6B0078; Mon, 14 Feb 2022 21:33:22 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 3EBDD6B007B; Mon, 14 Feb 2022 21:33:22 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 28D0C6B007D; Mon, 14 Feb 2022 21:33:22 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0016.hostedemail.com
 [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 1CA3C6B0078
	for <linux-mm@kvack.org>; Mon, 14 Feb 2022 21:33:22 -0500 (EST)
Received: from smtpin06.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with ESMTP id C9723901C2
	for <linux-mm@kvack.org>; Tue, 15 Feb 2022 02:33:21 +0000 (UTC)
X-FDA: 79143442602.06.DE215DB
Received: from mail-qt1-f175.google.com (mail-qt1-f175.google.com
 [209.85.160.175])
	by imf28.hostedemail.com (Postfix) with ESMTP id 63C68C0002
	for <linux-mm@kvack.org>; Tue, 15 Feb 2022 02:33:21 +0000 (UTC)
Received: by mail-qt1-f175.google.com with SMTP id l14so17265361qtp.7
        for <linux-mm@kvack.org>; Mon, 14 Feb 2022 18:33:21 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:from:to:cc:subject:in-reply-to:message-id:references
         :mime-version;
        bh=hA6o7f3VWTxX95Mr0GyMKT1FTGKFiKlNt02jF6v+Tb0=;
        b=rtEYp0Pul7pel5CuZdN22/B3yxGbnNW3jrMy8534GdcI8lYDHPfs/goTA/UvG5nek9
         Wo9HEjPNtZLebFqqkeMWvo5YlcgbC0vQ2MxVDE2fLyPpEN8ztcdUQIVWst7IWcjIJ/TK
         qfhWX4SVMD1lxSp76cqbh3Kw2Vx1FKnwxo9Bc71WyCwwHGSHBcVckXz6DjlzsyigRq8w
         aX5VjHhsufzKtYGundL8X2fHKjL7Risp5ebjQ20lyljFgCNyXta2nqn/gPvvEA9jcdxG
         ENHbUeKNdW4Csx+pHVD+S+m0y/MoXleQ+aI2YObanALTvC1nHb6JKBOsvTKEwz7yzaKJ
         P2uw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id
         :references:mime-version;
        bh=hA6o7f3VWTxX95Mr0GyMKT1FTGKFiKlNt02jF6v+Tb0=;
        b=VlU5wD8xg0bDbqI+IC8H+ifHebuAfFmCWVruRSAT3AYtQGE7uV3KOgG9NWXS26xDMx
         IGU39vxRw6o/oZJC+WBnCxk7QViKy2dSY7EhbObNWovAnpBYiCBGGBdMG9ZtKp5V4SAN
         +ydZUg9xHXp2cBs/4mSdri//Tq6t3DHcn0yickeSwD5MJEtXKoy4JmiG4cLDe0rqP6Kd
         hSigFz5lw/xygQec6HnH1N+uJZ3cyENJ3nUftxr9URRqYjnMqubOj2vTALX/i0B6e53f
         QbE5nx0GFj2pXqMsResjT5m1fNYDLBNkcwmm4NOZSw2ApNS+b8R+QpAeXfuUvuQpmibj
         1swQ==
X-Gm-Message-State: AOAM531Y9XNOB8kVFhpsJ4wkSz5/5zYLOuPKL8IEDfTBa/28aJ98gfJa
	B5ykjeTkGDz6JFf6R6fc8niyPA==
X-Google-Smtp-Source: 
 ABdhPJzAV7n2bz0VYf1XZvzGWHDRqLhyqek+4Pt0HQtX7UeiNiLfzxn1drhbZ4e+Dmo3KTDseUBAdg==
X-Received: by 2002:ac8:4e86:: with SMTP id 6mr1377145qtp.84.1644892400587;
        Mon, 14 Feb 2022 18:33:20 -0800 (PST)
Received: from ripple.attlocal.net
 (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147])
        by smtp.gmail.com with ESMTPSA id
 u21sm18424379qtw.80.2022.02.14.18.33.18
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 14 Feb 2022 18:33:19 -0800 (PST)
Date: Mon, 14 Feb 2022 18:33:17 -0800 (PST)
From: Hugh Dickins <hughd@google.com>
X-X-Sender: hugh@ripple.anvils
To: Andrew Morton <akpm@linux-foundation.org>
cc: Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>,
    "Kirill A. Shutemov" <kirill@shutemov.name>,
    Matthew Wilcox <willy@infradead.org>,
 David Hildenbrand <david@redhat.com>,
    Alistair Popple <apopple@nvidia.com>,
 Johannes Weiner <hannes@cmpxchg.org>,
    Rik van Riel <riel@surriel.com>, Suren Baghdasaryan <surenb@google.com>,
    Yu Zhao <yuzhao@google.com>, Greg Thelen <gthelen@google.com>,
    Shakeel Butt <shakeelb@google.com>, linux-kernel@vger.kernel.org,
    linux-mm@kvack.org
Subject: [PATCH v2 08/13] mm/migrate: __unmap_and_move() push good newpage
 to LRU
In-Reply-To: <55a49083-37f9-3766-1de9-9feea7428ac@google.com>
Message-ID: <269eec24-978a-984a-8a85-1d29f36ad343@google.com>
References: <55a49083-37f9-3766-1de9-9feea7428ac@google.com>
MIME-Version: 1.0
X-Rspamd-Queue-Id: 63C68C0002
X-Rspam-User: 
Authentication-Results: imf28.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=rtEYp0Pu;
	spf=pass (imf28.hostedemail.com: domain of hughd@google.com designates
 209.85.160.175 as permitted sender) smtp.mailfrom=hughd@google.com;
	dmarc=pass (policy=reject) header.from=google.com
X-Stat-Signature: 31b7h738wymze9ujxbfm7i8tjfx9goi8
X-Rspamd-Server: rspam11
X-HE-Tag: 1644892401-961633
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Compaction, NUMA page movement, THP collapse/split, and memory failure
do isolate unevictable pages from their "LRU", losing the record of
mlock_count in doing so (isolators are likely to use page->lru for their
own private lists, so mlock_count has to be presumed lost).

That's unfortunate, and we should put in some work to correct that: one
can imagine a function to build up the mlock_count again - but it would
require i_mmap_rwsem for read, so be careful where it's called.  Or
page_referenced_one() and try_to_unmap_one() might do that extra work.

But one place that can very easily be improved is page migration's
__unmap_and_move(): a small adjustment to where the successful new page
is put back on LRU, and its mlock_count (if any) is built back up by
remove_migration_ptes().

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
v2: same as v1.

 mm/migrate.c | 31 +++++++++++++++++++------------
 1 file changed, 19 insertions(+), 12 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 7c4223ce2500..f4bcf1541b62 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1032,6 +1032,21 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	if (!page_mapped(page))
 		rc = move_to_new_page(newpage, page, mode);
 
+	/*
+	 * When successful, push newpage to LRU immediately: so that if it
+	 * turns out to be an mlocked page, remove_migration_ptes() will
+	 * automatically build up the correct newpage->mlock_count for it.
+	 *
+	 * We would like to do something similar for the old page, when
+	 * unsuccessful, and other cases when a page has been temporarily
+	 * isolated from the unevictable LRU: but this case is the easiest.
+	 */
+	if (rc == MIGRATEPAGE_SUCCESS) {
+		lru_cache_add(newpage);
+		if (page_was_mapped)
+			lru_add_drain();
+	}
+
 	if (page_was_mapped)
 		remove_migration_ptes(page,
 			rc == MIGRATEPAGE_SUCCESS ? newpage : page, false);
@@ -1045,20 +1060,12 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	unlock_page(page);
 out:
 	/*
-	 * If migration is successful, decrease refcount of the newpage
+	 * If migration is successful, decrease refcount of the newpage,
 	 * which will not free the page because new page owner increased
-	 * refcounter. As well, if it is LRU page, add the page to LRU
-	 * list in here. Use the old state of the isolated source page to
-	 * determine if we migrated a LRU page. newpage was already unlocked
-	 * and possibly modified by its owner - don't rely on the page
-	 * state.
+	 * refcounter.
 	 */
-	if (rc == MIGRATEPAGE_SUCCESS) {
-		if (unlikely(!is_lru))
-			put_page(newpage);
-		else
-			putback_lru_page(newpage);
-	}
+	if (rc == MIGRATEPAGE_SUCCESS)
+		put_page(newpage);
 
 	return rc;
 }

From patchwork Tue Feb 15 02:34:46 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hugh Dickins <hughd@google.com>
X-Patchwork-Id: 12746451
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 1E510C433F5
	for <linux-mm@archiver.kernel.org>; Tue, 15 Feb 2022 02:34:52 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id AF3A76B0078; Mon, 14 Feb 2022 21:34:51 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id AA1536B007B; Mon, 14 Feb 2022 21:34:51 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 969046B007D; Mon, 14 Feb 2022 21:34:51 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0178.hostedemail.com
 [216.40.44.178])
	by kanga.kvack.org (Postfix) with ESMTP id 897FC6B0078
	for <linux-mm@kvack.org>; Mon, 14 Feb 2022 21:34:51 -0500 (EST)
Received: from smtpin29.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id 4281A181AC9C6
	for <linux-mm@kvack.org>; Tue, 15 Feb 2022 02:34:51 +0000 (UTC)
X-FDA: 79143446382.29.339ADBA
Received: from mail-qk1-f180.google.com (mail-qk1-f180.google.com
 [209.85.222.180])
	by imf11.hostedemail.com (Postfix) with ESMTP id B6C9740006
	for <linux-mm@kvack.org>; Tue, 15 Feb 2022 02:34:50 +0000 (UTC)
Received: by mail-qk1-f180.google.com with SMTP id c189so16115416qkg.11
        for <linux-mm@kvack.org>; Mon, 14 Feb 2022 18:34:50 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:from:to:cc:subject:in-reply-to:message-id:references
         :mime-version;
        bh=klh+DPqWZ+BlnUaC467qqykgES+6efAa+duJH3TYeKQ=;
        b=jBkp8ISwEWhgcDAzMbPhNGwAR1nASKVb7CoYeFtaLEF/od82lqGj5kJWJxxx3EfYvw
         HARVTVmf4PvZvD5gbkHVJGZL8Qpt7sIYLwOOI0ntN217CcbSoJjuWfBJwDPhMb66JRm+
         j5bsT6aT3EgKAVckTOcYnP3SwkxTysD8FDjEYT6RFLZSY9XIuDUZjtHMHplFrNP2sq7W
         74tuE87UchsdHKKjwsUvDOrSl3RyQRAchqim6HJRns8K/SSS8gn7WeTPMvEJk7UCSxl5
         5yP8gG5lcuD1rwU1oZ0TOUdrlpwFG874LE/tYnD9s+RijMGeRxZ4XsycvemP6XdVIg74
         LrZQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id
         :references:mime-version;
        bh=klh+DPqWZ+BlnUaC467qqykgES+6efAa+duJH3TYeKQ=;
        b=EB+SizVyFXCtl9a4+doWrT2mbO6hhVQ8LqHnII9mjm4bY+OsKN0MaJFNE09mJBk/3B
         qY+4EeL2hqqgNu4yP/r50qsWTyhLCk/eGmIDhKEbdfliiDCil9cyk1Tqw5E/2XDACeGf
         MlZR1I1IQp+bQA0VUfuLkzIOHaRfarkCQgS+zqrGtWafCAz7cZvDv4VyzOQHYboEt0jZ
         f/tW6RnfZICLM/AdKc98KNwJUqDzVMq7N15kGamI6qwbku12diq/Cl0WI2BUoUMrzaQJ
         wP0FDJo075BJP+Uy9QwzmZ+aIRw2IVnxqQdVffZDShV9kqWZpe3wUbldSmbt3RC7WHq7
         Rvfg==
X-Gm-Message-State: AOAM530ehXCS7qAD5YrT4jY5MsVhKOJlvkrW8VsiXz8qi/3lSGKs6h7P
	LxZcHi9UPkqCHH14wI7Cd1PrdsYoHWJVeA==
X-Google-Smtp-Source: 
 ABdhPJzGX7gP6ZEL5R3ZPtyweqq4fZ2IbQmSK/ERVYTD1l3o8HEmWHpmokabNmD4RrCsoVQTrR9s4g==
X-Received: by 2002:a37:f902:: with SMTP id l2mr1013027qkj.392.1644892489924;
        Mon, 14 Feb 2022 18:34:49 -0800 (PST)
Received: from ripple.attlocal.net
 (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147])
        by smtp.gmail.com with ESMTPSA id
 bk23sm16516786qkb.3.2022.02.14.18.34.47
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 14 Feb 2022 18:34:49 -0800 (PST)
Date: Mon, 14 Feb 2022 18:34:46 -0800 (PST)
From: Hugh Dickins <hughd@google.com>
X-X-Sender: hugh@ripple.anvils
To: Andrew Morton <akpm@linux-foundation.org>
cc: Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>,
    "Kirill A. Shutemov" <kirill@shutemov.name>,
    Matthew Wilcox <willy@infradead.org>,
 David Hildenbrand <david@redhat.com>,
    Alistair Popple <apopple@nvidia.com>,
 Johannes Weiner <hannes@cmpxchg.org>,
    Rik van Riel <riel@surriel.com>, Suren Baghdasaryan <surenb@google.com>,
    Yu Zhao <yuzhao@google.com>, Greg Thelen <gthelen@google.com>,
    Shakeel Butt <shakeelb@google.com>, linux-kernel@vger.kernel.org,
    linux-mm@kvack.org
Subject: [PATCH v2 09/13] mm/munlock: delete smp_mb() from
 __pagevec_lru_add_fn()
In-Reply-To: <55a49083-37f9-3766-1de9-9feea7428ac@google.com>
Message-ID: <28a7c6ff-6270-9060-8df0-862bdcaac366@google.com>
References: <55a49083-37f9-3766-1de9-9feea7428ac@google.com>
MIME-Version: 1.0
Authentication-Results: imf11.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=jBkp8ISw;
	spf=pass (imf11.hostedemail.com: domain of hughd@google.com designates
 209.85.222.180 as permitted sender) smtp.mailfrom=hughd@google.com;
	dmarc=pass (policy=reject) header.from=google.com
X-Rspamd-Server: rspam07
X-Rspam-User: 
X-Rspamd-Queue-Id: B6C9740006
X-Stat-Signature: h8m8s63bdu3p6apd537opsiio159prxg
X-HE-Tag: 1644892490-23758
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

My reading of comment on smp_mb__after_atomic() in __pagevec_lru_add_fn()
says that it can now be deleted; and that remains so when the next patch
is added.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
v2: same as v1.

 mm/swap.c | 37 +++++++++----------------------------
 1 file changed, 9 insertions(+), 28 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 682a03301a2c..3f770b1ea2c1 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -1025,37 +1025,18 @@ static void __pagevec_lru_add_fn(struct folio *folio, struct lruvec *lruvec)
 
 	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
 
+	folio_set_lru(folio);
 	/*
-	 * A folio becomes evictable in two ways:
-	 * 1) Within LRU lock [munlock_vma_page() and __munlock_pagevec()].
-	 * 2) Before acquiring LRU lock to put the folio on the correct LRU
-	 *    and then
-	 *   a) do PageLRU check with lock [check_move_unevictable_pages]
-	 *   b) do PageLRU check before lock [clear_page_mlock]
-	 *
-	 * (1) & (2a) are ok as LRU lock will serialize them. For (2b), we need
-	 * following strict ordering:
-	 *
-	 * #0: __pagevec_lru_add_fn		#1: clear_page_mlock
-	 *
-	 * folio_set_lru()			folio_test_clear_mlocked()
-	 * smp_mb() // explicit ordering	// above provides strict
-	 *					// ordering
-	 * folio_test_mlocked()			folio_test_lru()
+	 * Is an smp_mb__after_atomic() still required here, before
+	 * folio_evictable() tests PageMlocked, to rule out the possibility
+	 * of stranding an evictable folio on an unevictable LRU?  I think
+	 * not, because munlock_page() only clears PageMlocked while the LRU
+	 * lock is held.
 	 *
-	 *
-	 * if '#1' does not observe setting of PG_lru by '#0' and
-	 * fails isolation, the explicit barrier will make sure that
-	 * folio_evictable check will put the folio on the correct
-	 * LRU. Without smp_mb(), folio_set_lru() can be reordered
-	 * after folio_test_mlocked() check and can make '#1' fail the
-	 * isolation of the folio whose mlocked bit is cleared (#0 is
-	 * also looking at the same folio) and the evictable folio will
-	 * be stranded on an unevictable LRU.
+	 * (That is not true of __page_cache_release(), and not necessarily
+	 * true of release_pages(): but those only clear PageMlocked after
+	 * put_page_testzero() has excluded any other users of the page.)
 	 */
-	folio_set_lru(folio);
-	smp_mb__after_atomic();
-
 	if (folio_evictable(folio)) {
 		if (was_unevictable)
 			__count_vm_events(UNEVICTABLE_PGRESCUED, nr_pages);

From patchwork Tue Feb 15 02:37:29 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hugh Dickins <hughd@google.com>
X-Patchwork-Id: 12746452
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 3FECAC433F5
	for <linux-mm@archiver.kernel.org>; Tue, 15 Feb 2022 02:37:35 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id A2C016B0078; Mon, 14 Feb 2022 21:37:34 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 9DBE56B007B; Mon, 14 Feb 2022 21:37:34 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 82DC36B007D; Mon, 14 Feb 2022 21:37:34 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0115.hostedemail.com
 [216.40.44.115])
	by kanga.kvack.org (Postfix) with ESMTP id 6DF976B0078
	for <linux-mm@kvack.org>; Mon, 14 Feb 2022 21:37:34 -0500 (EST)
Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id 2E9C28249980
	for <linux-mm@kvack.org>; Tue, 15 Feb 2022 02:37:34 +0000 (UTC)
X-FDA: 79143453228.13.DEEE4D5
Received: from mail-qt1-f173.google.com (mail-qt1-f173.google.com
 [209.85.160.173])
	by imf26.hostedemail.com (Postfix) with ESMTP id ADC4D140003
	for <linux-mm@kvack.org>; Tue, 15 Feb 2022 02:37:33 +0000 (UTC)
Received: by mail-qt1-f173.google.com with SMTP id r14so17282749qtt.5
        for <linux-mm@kvack.org>; Mon, 14 Feb 2022 18:37:33 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:from:to:cc:subject:in-reply-to:message-id:references
         :mime-version;
        bh=9TcSzHM06U1tSg8wXuUnkKlgQter2rEywNZIjXHa0b4=;
        b=Uab4eYFlLQjvIEcQs5mQreEPrFIMD5vJSc7c5PfRA9ufbrLfFilSoc5ulR8PoL6PKm
         Bw3pqhNGHTM4WY5eXIwd69E6xjXEQfFpGso0IpB598su0OBvtEF9ltujsNCj18ebptOA
         MEWg2qYrSjo7LHrlp9RIMt8b/wunusqwhOwS7iFtpWXuAw0ijE/c1e0A2Eo9Q1u1NTaq
         gLry3CYDrlpOl+IS8xPqIYVpDdfTB8OZYLESXNoQZVN1Ti6UsZLFx6V/lgn652YN82DW
         EKux11ciBTtKQ9WHmU1f3ip1QVF7PatDzI+aiIsa+Sq/fonYjOhfeOxjOjDt9+VvO2Qs
         JfyA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id
         :references:mime-version;
        bh=9TcSzHM06U1tSg8wXuUnkKlgQter2rEywNZIjXHa0b4=;
        b=xFq9D56LxNXF4jQTC2EB3EYQNLPMtEz7fXmkScvEiuS7VvdwYsJ5sH7eN8kWEjpQ+b
         lVkX3YdU02tqipDBpFP6UpG2bWi82ueQkuMuXMWNxsD81llLh7Y9FGx1l9dl7Qx+a6Z9
         uUjiyM5WngjjaLfOOOSwsSKmKAsFGnSuZwi1CosMBgtKC0Hw4KrU7O29aq573qPS4tSV
         OwfVtOBvrFxBZ86P4MufDsO0QiqDR5imMyAm6INlvzNkHP/F4CEPAPBCgAuWjYCNGghG
         EjEiR9wTJQSWoZNLrx/Rh/XoWK2Pz3tvL6J5CFt79ZTWxQkVjjZZRiBO5Xsm6KHq4w5c
         Atqw==
X-Gm-Message-State: AOAM531gHp3JmgB4L41zm4tsLyyKW3pkVvlRBME/ft6sGSy7XdH/O4mb
	/qDsHtzUAFgJAaueCEPqz5tEIA==
X-Google-Smtp-Source: 
 ABdhPJzJAP6sQ26g3dA4wEcQz8Qzeq5meQwxE2bHO3G5GW80SN9S6OmlLgTXYG8LZOJPXhvVTv9r+Q==
X-Received: by 2002:ac8:5808:: with SMTP id g8mr1342559qtg.483.1644892652723;
        Mon, 14 Feb 2022 18:37:32 -0800 (PST)
Received: from ripple.attlocal.net
 (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147])
        by smtp.gmail.com with ESMTPSA id
 m130sm16121044qke.55.2022.02.14.18.37.30
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 14 Feb 2022 18:37:31 -0800 (PST)
Date: Mon, 14 Feb 2022 18:37:29 -0800 (PST)
From: Hugh Dickins <hughd@google.com>
X-X-Sender: hugh@ripple.anvils
To: Andrew Morton <akpm@linux-foundation.org>
cc: Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>,
    "Kirill A. Shutemov" <kirill@shutemov.name>,
    Matthew Wilcox <willy@infradead.org>,
 David Hildenbrand <david@redhat.com>,
    Alistair Popple <apopple@nvidia.com>,
 Johannes Weiner <hannes@cmpxchg.org>,
    Rik van Riel <riel@surriel.com>, Suren Baghdasaryan <surenb@google.com>,
    Yu Zhao <yuzhao@google.com>, Greg Thelen <gthelen@google.com>,
    Shakeel Butt <shakeelb@google.com>, Yang Li <yang.lee@linux.alibaba.com>,
    SeongJae Park <sj@kernel.org>, Geert Uytterhoeven <geert@linux-m68k.org>,
    linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: [PATCH v2 10/13] mm/munlock: mlock_page() munlock_page() batch by
 pagevec
In-Reply-To: <55a49083-37f9-3766-1de9-9feea7428ac@google.com>
Message-ID: <1abb94ee-fe72-dba9-3eb0-d1e576d148e6@google.com>
References: <55a49083-37f9-3766-1de9-9feea7428ac@google.com>
MIME-Version: 1.0
X-Rspam-User: 
X-Rspamd-Server: rspam08
X-Rspamd-Queue-Id: ADC4D140003
X-Stat-Signature: 6e4g9ercdmcf34gkpdyayttnotty984z
Authentication-Results: imf26.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=Uab4eYFl;
	spf=pass (imf26.hostedemail.com: domain of hughd@google.com designates
 209.85.160.173 as permitted sender) smtp.mailfrom=hughd@google.com;
	dmarc=pass (policy=reject) header.from=google.com
X-HE-Tag: 1644892653-677500
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

A weakness of the page->mlock_count approach is the need for lruvec lock
while holding page table lock.  That is not an overhead we would allow on
normal pages, but I think acceptable just for pages in an mlocked area.
But let's try to amortize the extra cost by gathering on per-cpu pagevec
before acquiring the lruvec lock.

I have an unverified conjecture that the mlock pagevec might work out
well for delaying the mlock processing of new file pages until they have
got off lru_cache_add()'s pagevec and on to LRU.

The initialization of page->mlock_count is subject to races and awkward:
0 or !!PageMlocked or 1?  Was it wrong even in the implementation before
this commit, which just widens the window?  I haven't gone back to think
it through.  Maybe someone can point out a better way to initialize it.

Bringing lru_cache_add_inactive_or_unevictable()'s mlock initialization
into mm/mlock.c has helped: mlock_new_page(), using the mlock pagevec,
rather than lru_cache_add()'s pagevec.

Experimented with various orderings: the right thing seems to be for
mlock_page() and mlock_new_page() to TestSetPageMlocked before adding to
pagevec, but munlock_page() to leave TestClearPageMlocked to the later
pagevec processing.

Dropped the VM_BUG_ON_PAGE(PageTail)s this time around: they have made
their point, and the thp_nr_page()s already contain a VM_BUG_ON_PGFLAGS()
for that.

This still leaves acquiring lruvec locks under page table lock each time
the pagevec fills (or a THP is added): which I suppose is rather silly,
since they sit on pagevec waiting to be processed long after page table
lock has been dropped; but I'm disinclined to uglify the calling sequence
until some load shows an actual problem with it (nothing wrong with
taking lruvec lock under page table lock, just "nicer" to do it less).

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
v2: Fix kerneldoc on mlock_page() and two more functions, per Yang Li.
    Fix build without CONFIG_MMU, per SeongJae, Geert, Naresh, lkp.

 mm/internal.h |   9 ++-
 mm/mlock.c    | 212 ++++++++++++++++++++++++++++++++++++++++++--------
 mm/swap.c     |  27 ++++---
 3 files changed, 201 insertions(+), 47 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index b3f0dd3ffba2..18af980bb1b8 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -402,7 +402,8 @@ extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
  *
  * mlock is usually called at the end of page_add_*_rmap(),
  * munlock at the end of page_remove_rmap(); but new anon
- * pages are managed in lru_cache_add_inactive_or_unevictable().
+ * pages are managed by lru_cache_add_inactive_or_unevictable()
+ * calling mlock_new_page().
  *
  * @compound is used to include pmd mappings of THPs, but filter out
  * pte mappings of THPs, which cannot be consistently counted: a pte
@@ -425,6 +426,9 @@ static inline void munlock_vma_page(struct page *page,
 	    (compound || !PageTransCompound(page)))
 		munlock_page(page);
 }
+void mlock_new_page(struct page *page);
+bool need_mlock_page_drain(int cpu);
+void mlock_page_drain(int cpu);
 
 extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
 
@@ -503,6 +507,9 @@ static inline void mlock_vma_page(struct page *page,
 			struct vm_area_struct *vma, bool compound) { }
 static inline void munlock_vma_page(struct page *page,
 			struct vm_area_struct *vma, bool compound) { }
+static inline void mlock_new_page(struct page *page) { }
+static inline bool need_mlock_page_drain(int cpu) { return false; }
+static inline void mlock_page_drain(int cpu) { }
 static inline void vunmap_range_noflush(unsigned long start, unsigned long end)
 {
 }
diff --git a/mm/mlock.c b/mm/mlock.c
index 581ea8bf1b83..93d616ba3e22 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -28,6 +28,8 @@
 
 #include "internal.h"
 
+static DEFINE_PER_CPU(struct pagevec, mlock_pvec);
+
 bool can_do_mlock(void)
 {
 	if (rlimit(RLIMIT_MEMLOCK) != 0)
@@ -49,57 +51,79 @@ EXPORT_SYMBOL(can_do_mlock);
  * PageUnevictable is set to indicate the unevictable state.
  */
 
-/**
- * mlock_page - mlock a page
- * @page: page to be mlocked, either a normal page or a THP head.
- */
-void mlock_page(struct page *page)
+static struct lruvec *__mlock_page(struct page *page, struct lruvec *lruvec)
 {
-	struct lruvec *lruvec;
-	int nr_pages = thp_nr_pages(page);
+	/* There is nothing more we can do while it's off LRU */
+	if (!TestClearPageLRU(page))
+		return lruvec;
 
-	VM_BUG_ON_PAGE(PageTail(page), page);
+	lruvec = folio_lruvec_relock_irq(page_folio(page), lruvec);
 
-	if (!TestSetPageMlocked(page)) {
-		mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages);
-		__count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
+	if (unlikely(page_evictable(page))) {
+		/*
+		 * This is a little surprising, but quite possible:
+		 * PageMlocked must have got cleared already by another CPU.
+		 * Could this page be on the Unevictable LRU?  I'm not sure,
+		 * but move it now if so.
+		 */
+		if (PageUnevictable(page)) {
+			del_page_from_lru_list(page, lruvec);
+			ClearPageUnevictable(page);
+			add_page_to_lru_list(page, lruvec);
+			__count_vm_events(UNEVICTABLE_PGRESCUED,
+					  thp_nr_pages(page));
+		}
+		goto out;
 	}
 
-	/* There is nothing more we can do while it's off LRU */
-	if (!TestClearPageLRU(page))
-		return;
-
-	lruvec = folio_lruvec_lock_irq(page_folio(page));
 	if (PageUnevictable(page)) {
-		page->mlock_count++;
+		if (PageMlocked(page))
+			page->mlock_count++;
 		goto out;
 	}
 
 	del_page_from_lru_list(page, lruvec);
 	ClearPageActive(page);
 	SetPageUnevictable(page);
-	page->mlock_count = 1;
+	page->mlock_count = !!PageMlocked(page);
 	add_page_to_lru_list(page, lruvec);
-	__count_vm_events(UNEVICTABLE_PGCULLED, nr_pages);
+	__count_vm_events(UNEVICTABLE_PGCULLED, thp_nr_pages(page));
 out:
 	SetPageLRU(page);
-	unlock_page_lruvec_irq(lruvec);
+	return lruvec;
 }
 
-/**
- * munlock_page - munlock a page
- * @page: page to be munlocked, either a normal page or a THP head.
- */
-void munlock_page(struct page *page)
+static struct lruvec *__mlock_new_page(struct page *page, struct lruvec *lruvec)
+{
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+
+	lruvec = folio_lruvec_relock_irq(page_folio(page), lruvec);
+
+	/* As above, this is a little surprising, but possible */
+	if (unlikely(page_evictable(page)))
+		goto out;
+
+	SetPageUnevictable(page);
+	page->mlock_count = !!PageMlocked(page);
+	__count_vm_events(UNEVICTABLE_PGCULLED, thp_nr_pages(page));
+out:
+	add_page_to_lru_list(page, lruvec);
+	SetPageLRU(page);
+	return lruvec;
+}
+
+static struct lruvec *__munlock_page(struct page *page, struct lruvec *lruvec)
 {
-	struct lruvec *lruvec;
 	int nr_pages = thp_nr_pages(page);
+	bool isolated = false;
 
-	VM_BUG_ON_PAGE(PageTail(page), page);
+	if (!TestClearPageLRU(page))
+		goto munlock;
 
-	lock_page_memcg(page);
-	lruvec = folio_lruvec_lock_irq(page_folio(page));
-	if (PageLRU(page) && PageUnevictable(page)) {
+	isolated = true;
+	lruvec = folio_lruvec_relock_irq(page_folio(page), lruvec);
+
+	if (PageUnevictable(page)) {
 		/* Then mlock_count is maintained, but might undercount */
 		if (page->mlock_count)
 			page->mlock_count--;
@@ -108,24 +132,144 @@ void munlock_page(struct page *page)
 	}
 	/* else assume that was the last mlock: reclaim will fix it if not */
 
+munlock:
 	if (TestClearPageMlocked(page)) {
 		__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
-		if (PageLRU(page) || !PageUnevictable(page))
+		if (isolated || !PageUnevictable(page))
 			__count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages);
 		else
 			__count_vm_events(UNEVICTABLE_PGSTRANDED, nr_pages);
 	}
 
 	/* page_evictable() has to be checked *after* clearing Mlocked */
-	if (PageLRU(page) && PageUnevictable(page) && page_evictable(page)) {
+	if (isolated && PageUnevictable(page) && page_evictable(page)) {
 		del_page_from_lru_list(page, lruvec);
 		ClearPageUnevictable(page);
 		add_page_to_lru_list(page, lruvec);
 		__count_vm_events(UNEVICTABLE_PGRESCUED, nr_pages);
 	}
 out:
-	unlock_page_lruvec_irq(lruvec);
-	unlock_page_memcg(page);
+	if (isolated)
+		SetPageLRU(page);
+	return lruvec;
+}
+
+/*
+ * Flags held in the low bits of a struct page pointer on the mlock_pvec.
+ */
+#define LRU_PAGE 0x1
+#define NEW_PAGE 0x2
+#define mlock_lru(page) ((struct page *)((unsigned long)page + LRU_PAGE))
+#define mlock_new(page) ((struct page *)((unsigned long)page + NEW_PAGE))
+
+/*
+ * mlock_pagevec() is derived from pagevec_lru_move_fn():
+ * perhaps that can make use of such page pointer flags in future,
+ * but for now just keep it for mlock.  We could use three separate
+ * pagevecs instead, but one feels better (munlocking a full pagevec
+ * does not need to drain mlocking pagevecs first).
+ */
+static void mlock_pagevec(struct pagevec *pvec)
+{
+	struct lruvec *lruvec = NULL;
+	unsigned long mlock;
+	struct page *page;
+	int i;
+
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		page = pvec->pages[i];
+		mlock = (unsigned long)page & (LRU_PAGE | NEW_PAGE);
+		page = (struct page *)((unsigned long)page - mlock);
+		pvec->pages[i] = page;
+
+		if (mlock & LRU_PAGE)
+			lruvec = __mlock_page(page, lruvec);
+		else if (mlock & NEW_PAGE)
+			lruvec = __mlock_new_page(page, lruvec);
+		else
+			lruvec = __munlock_page(page, lruvec);
+	}
+
+	if (lruvec)
+		unlock_page_lruvec_irq(lruvec);
+	release_pages(pvec->pages, pvec->nr);
+	pagevec_reinit(pvec);
+}
+
+void mlock_page_drain(int cpu)
+{
+	struct pagevec *pvec;
+
+	pvec = &per_cpu(mlock_pvec, cpu);
+	if (pagevec_count(pvec))
+		mlock_pagevec(pvec);
+}
+
+bool need_mlock_page_drain(int cpu)
+{
+	return pagevec_count(&per_cpu(mlock_pvec, cpu));
+}
+
+/**
+ * mlock_page - mlock a page already on (or temporarily off) LRU
+ * @page: page to be mlocked, either a normal page or a THP head.
+ */
+void mlock_page(struct page *page)
+{
+	struct pagevec *pvec = &get_cpu_var(mlock_pvec);
+
+	if (!TestSetPageMlocked(page)) {
+		int nr_pages = thp_nr_pages(page);
+
+		mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages);
+		__count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
+	}
+
+	get_page(page);
+	if (!pagevec_add(pvec, mlock_lru(page)) ||
+	    PageHead(page) || lru_cache_disabled())
+		mlock_pagevec(pvec);
+	put_cpu_var(mlock_pvec);
+}
+
+/**
+ * mlock_new_page - mlock a newly allocated page not yet on LRU
+ * @page: page to be mlocked, either a normal page or a THP head.
+ */
+void mlock_new_page(struct page *page)
+{
+	struct pagevec *pvec = &get_cpu_var(mlock_pvec);
+	int nr_pages = thp_nr_pages(page);
+
+	SetPageMlocked(page);
+	mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages);
+	__count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
+
+	get_page(page);
+	if (!pagevec_add(pvec, mlock_new(page)) ||
+	    PageHead(page) || lru_cache_disabled())
+		mlock_pagevec(pvec);
+	put_cpu_var(mlock_pvec);
+}
+
+/**
+ * munlock_page - munlock a page
+ * @page: page to be munlocked, either a normal page or a THP head.
+ */
+void munlock_page(struct page *page)
+{
+	struct pagevec *pvec = &get_cpu_var(mlock_pvec);
+
+	/*
+	 * TestClearPageMlocked(page) must be left to __munlock_page(),
+	 * which will check whether the page is multiply mlocked.
+	 */
+
+	get_page(page);
+	if (!pagevec_add(pvec, page) ||
+	    PageHead(page) || lru_cache_disabled())
+		mlock_pagevec(pvec);
+	put_cpu_var(mlock_pvec);
 }
 
 static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
diff --git a/mm/swap.c b/mm/swap.c
index 3f770b1ea2c1..842d5cd92cf6 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -490,18 +490,12 @@ EXPORT_SYMBOL(folio_add_lru);
 void lru_cache_add_inactive_or_unevictable(struct page *page,
 					 struct vm_area_struct *vma)
 {
-	bool unevictable;
-
 	VM_BUG_ON_PAGE(PageLRU(page), page);
 
-	unevictable = (vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) == VM_LOCKED;
-	if (unlikely(unevictable) && !TestSetPageMlocked(page)) {
-		int nr_pages = thp_nr_pages(page);
-
-		mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages);
-		count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
-	}
-	lru_cache_add(page);
+	if (unlikely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) == VM_LOCKED))
+		mlock_new_page(page);
+	else
+		lru_cache_add(page);
 }
 
 /*
@@ -640,6 +634,7 @@ void lru_add_drain_cpu(int cpu)
 		pagevec_lru_move_fn(pvec, lru_lazyfree_fn);
 
 	activate_page_drain(cpu);
+	mlock_page_drain(cpu);
 }
 
 /**
@@ -842,6 +837,7 @@ inline void __lru_add_drain_all(bool force_all_cpus)
 		    pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||
 		    pagevec_count(&per_cpu(lru_pvecs.lru_lazyfree, cpu)) ||
 		    need_activate_page_drain(cpu) ||
+		    need_mlock_page_drain(cpu) ||
 		    has_bh_in_lru(cpu, NULL)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
 			queue_work_on(cpu, mm_percpu_wq, work);
@@ -1030,7 +1026,7 @@ static void __pagevec_lru_add_fn(struct folio *folio, struct lruvec *lruvec)
 	 * Is an smp_mb__after_atomic() still required here, before
 	 * folio_evictable() tests PageMlocked, to rule out the possibility
 	 * of stranding an evictable folio on an unevictable LRU?  I think
-	 * not, because munlock_page() only clears PageMlocked while the LRU
+	 * not, because __munlock_page() only clears PageMlocked while the LRU
 	 * lock is held.
 	 *
 	 * (That is not true of __page_cache_release(), and not necessarily
@@ -1043,7 +1039,14 @@ static void __pagevec_lru_add_fn(struct folio *folio, struct lruvec *lruvec)
 	} else {
 		folio_clear_active(folio);
 		folio_set_unevictable(folio);
-		folio->mlock_count = !!folio_test_mlocked(folio);
+		/*
+		 * folio->mlock_count = !!folio_test_mlocked(folio)?
+		 * But that leaves __mlock_page() in doubt whether another
+		 * actor has already counted the mlock or not.  Err on the
+		 * safe side, underestimate, let page reclaim fix it, rather
+		 * than leaving a page on the unevictable LRU indefinitely.
+		 */
+		folio->mlock_count = 0;
 		if (!was_unevictable)
 			__count_vm_events(UNEVICTABLE_PGCULLED, nr_pages);
 	}

From patchwork Tue Feb 15 02:38:47 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hugh Dickins <hughd@google.com>
X-Patchwork-Id: 12746455
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 2445BC433F5
	for <linux-mm@archiver.kernel.org>; Tue, 15 Feb 2022 02:38:53 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id A2A436B0078; Mon, 14 Feb 2022 21:38:52 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 9DB576B007B; Mon, 14 Feb 2022 21:38:52 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 854596B007D; Mon, 14 Feb 2022 21:38:52 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0010.hostedemail.com
 [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id 767626B0078
	for <linux-mm@kvack.org>; Mon, 14 Feb 2022 21:38:52 -0500 (EST)
Received: from smtpin28.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id 34380181AC9C6
	for <linux-mm@kvack.org>; Tue, 15 Feb 2022 02:38:52 +0000 (UTC)
X-FDA: 79143456504.28.1FAC599
Received: from mail-qt1-f169.google.com (mail-qt1-f169.google.com
 [209.85.160.169])
	by imf28.hostedemail.com (Postfix) with ESMTP id AD1BCC0003
	for <linux-mm@kvack.org>; Tue, 15 Feb 2022 02:38:51 +0000 (UTC)
Received: by mail-qt1-f169.google.com with SMTP id s1so17257603qtw.9
        for <linux-mm@kvack.org>; Mon, 14 Feb 2022 18:38:51 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:from:to:cc:subject:in-reply-to:message-id:references
         :mime-version;
        bh=7RY/qhKlJPQeH1IwAyGGlnES1Mny5ocT5QtyB/fayZE=;
        b=CssZuZOe3DBbjf8JJaBjw58IQNx1PjFb/exoxSarxXEXb0Oku+9ZRuCUj5Z7Ybceav
         LxQzjdOLq1kQgG8GGiKW3Qcn+MN7WOn76nP1logW5v3dLX60adeSoNzxXJZkNF2yUrwx
         XIP/VjCPuz4BreDgp6azxwICU4A7RuibWm8BmOIru5OKZPerzNFnTXIE3RgfavBFXe4E
         L7GwFOHtBTXGz0268x1zY/1+ABnLkf1a3F32q/cqc8MXzk5h1SULSzWDWIgkwDyxJKDQ
         RsslYR7OvL/db5PKlOCiW1UED3kwElfK3o6xq0CcIoCuGhhXPzPh8MgCllnZA31k3/Uj
         c6MQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id
         :references:mime-version;
        bh=7RY/qhKlJPQeH1IwAyGGlnES1Mny5ocT5QtyB/fayZE=;
        b=bP7i80+XD/bjZmQ9jU0N8TGO9+OkY2MXP3d4Eg2RjLswFyO5g0SUeRjRgsoo+5j18D
         C2cnoc3F94YntYKrZH4wUdWCUpfSz4D3FDUp/xTaDrjvqB6UuIhkx6pFOTe6Uf4yxhcc
         oW3bpgMaj9lKYKUVhCSv5Ac2nDv4qBShQZdEDi44qhh0KHtC7hsaTdhu4UpF5uh5Qijm
         mmeTYCbW0XlOgg4YqMgDb6Ik1lPO1IYZYhtNlNyvVO5xqElL/lxN6WBe7PSQGzSy6MYn
         CedGCUMRn8ORO0fuALZWZJuuVUvOPxxNGTZhY2BCtGUX/HgDmksljUuiEsrYzTX+ouKk
         bB/A==
X-Gm-Message-State: AOAM533X2SpsdN2Qci1hc5J3hQHcjHLqjlpa+LL+LjqPQ2T9kDiNVsDX
	YvkWPwIPAcDXNJk3DD+uCH3BQw==
X-Google-Smtp-Source: 
 ABdhPJxVwCU2R6EW1pfoD9ezIoDE5R0PkX8gv3sUUegKvg3dhIR6WvgfD5QXF7JGwdRw6BNI9vYXVA==
X-Received: by 2002:a05:622a:1ce:: with SMTP id
 t14mr1338296qtw.25.1644892730863;
        Mon, 14 Feb 2022 18:38:50 -0800 (PST)
Received: from ripple.attlocal.net
 (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147])
        by smtp.gmail.com with ESMTPSA id
 r13sm10993253qkp.129.2022.02.14.18.38.48
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 14 Feb 2022 18:38:50 -0800 (PST)
Date: Mon, 14 Feb 2022 18:38:47 -0800 (PST)
From: Hugh Dickins <hughd@google.com>
X-X-Sender: hugh@ripple.anvils
To: Andrew Morton <akpm@linux-foundation.org>
cc: Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>,
    "Kirill A. Shutemov" <kirill@shutemov.name>,
    Matthew Wilcox <willy@infradead.org>,
 David Hildenbrand <david@redhat.com>,
    Alistair Popple <apopple@nvidia.com>,
 Johannes Weiner <hannes@cmpxchg.org>,
    Rik van Riel <riel@surriel.com>, Suren Baghdasaryan <surenb@google.com>,
    Yu Zhao <yuzhao@google.com>, Greg Thelen <gthelen@google.com>,
    Shakeel Butt <shakeelb@google.com>, linux-kernel@vger.kernel.org,
    linux-mm@kvack.org
Subject: [PATCH v2 11/13] mm/munlock: page migration needs mlock pagevec
 drained
In-Reply-To: <55a49083-37f9-3766-1de9-9feea7428ac@google.com>
Message-ID: <9e2ed861-951a-6e86-e298-a09d2d8e9b9f@google.com>
References: <55a49083-37f9-3766-1de9-9feea7428ac@google.com>
MIME-Version: 1.0
X-Rspam-User: 
X-Rspamd-Server: rspam08
X-Rspamd-Queue-Id: AD1BCC0003
X-Stat-Signature: ib94wdx35xj7mkxcqjek8cfwiwz6uxc5
Authentication-Results: imf28.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=CssZuZOe;
	spf=pass (imf28.hostedemail.com: domain of hughd@google.com designates
 209.85.160.169 as permitted sender) smtp.mailfrom=hughd@google.com;
	dmarc=pass (policy=reject) header.from=google.com
X-HE-Tag: 1644892731-999795
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Page migration of a VM_LOCKED page tends to fail, because when the old
page is unmapped, it is put on the mlock pagevec with raised refcount,
which then fails the freeze.

At first I thought this would be fixed by a local mlock_page_drain() at
the upper rmap_walk() level - which would have nicely batched all the
munlocks of that page; but tests show that the task can too easily move
to another cpu, leaving pagevec residue behind which fails the migration.

So try_to_migrate_one() drain the local pagevec after page_remove_rmap()
from a VM_LOCKED vma; and do the same in try_to_unmap_one(), whose
TTU_IGNORE_MLOCK users would want the same treatment; and do the same
in remove_migration_pte() - not important when successfully inserting
a new page, but necessary when hoping to retry after failure.

Any new pagevec runs the risk of adding a new way of stranding, and we
might discover other corners where mlock_page_drain() or lru_add_drain()
would now help.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
v2: Delete suggestion of sysctl from changelog, per Vlastimil.

 mm/migrate.c | 2 ++
 mm/rmap.c    | 4 ++++
 2 files changed, 6 insertions(+)

diff --git a/mm/migrate.c b/mm/migrate.c
index f4bcf1541b62..e7d0b68d5dcb 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -251,6 +251,8 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 				page_add_file_rmap(new, vma, false);
 			set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
 		}
+		if (vma->vm_flags & VM_LOCKED)
+			mlock_page_drain(smp_processor_id());
 
 		/* No need to invalidate - it was non-present before */
 		update_mmu_cache(vma, pvmw.address, pvmw.pte);
diff --git a/mm/rmap.c b/mm/rmap.c
index 5442a5c97a85..714bfdc72c7b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1656,6 +1656,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		 * See Documentation/vm/mmu_notifier.rst
 		 */
 		page_remove_rmap(subpage, vma, PageHuge(page));
+		if (vma->vm_flags & VM_LOCKED)
+			mlock_page_drain(smp_processor_id());
 		put_page(page);
 	}
 
@@ -1930,6 +1932,8 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,
 		 * See Documentation/vm/mmu_notifier.rst
 		 */
 		page_remove_rmap(subpage, vma, PageHuge(page));
+		if (vma->vm_flags & VM_LOCKED)
+			mlock_page_drain(smp_processor_id());
 		put_page(page);
 	}
 

From patchwork Tue Feb 15 02:40:55 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hugh Dickins <hughd@google.com>
X-Patchwork-Id: 12746456
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 2BF6DC433EF
	for <linux-mm@archiver.kernel.org>; Tue, 15 Feb 2022 02:41:01 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 8C8A36B0078; Mon, 14 Feb 2022 21:41:00 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 878E46B007D; Mon, 14 Feb 2022 21:41:00 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 7405A6B007E; Mon, 14 Feb 2022 21:41:00 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0239.hostedemail.com
 [216.40.44.239])
	by kanga.kvack.org (Postfix) with ESMTP id 63FB96B0078
	for <linux-mm@kvack.org>; Mon, 14 Feb 2022 21:41:00 -0500 (EST)
Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id 15FA3180AD837
	for <linux-mm@kvack.org>; Tue, 15 Feb 2022 02:41:00 +0000 (UTC)
X-FDA: 79143461880.14.B2E5D62
Received: from mail-qv1-f54.google.com (mail-qv1-f54.google.com
 [209.85.219.54])
	by imf20.hostedemail.com (Postfix) with ESMTP id 84B721C0006
	for <linux-mm@kvack.org>; Tue, 15 Feb 2022 02:40:59 +0000 (UTC)
Received: by mail-qv1-f54.google.com with SMTP id fh9so16510058qvb.1
        for <linux-mm@kvack.org>; Mon, 14 Feb 2022 18:40:59 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:from:to:cc:subject:in-reply-to:message-id:references
         :mime-version;
        bh=4LCswZrElV6Pmzb8YyUA7fug85GktLISWrDlYL4+cJU=;
        b=RK8Rg5/4qjjIjh2MTmUdnU7ZE6Ocg/1c/4CORITP3/BkGS1yJqcNkG4BUYhF3CsYX7
         XNdKA9GUjgoh2Xw/MQZrmTMn98R8uxZp/nDVs2vc6bQ3D2VZ9zBBrjuJZ1MYtsEK7OpY
         ric3eQtbdsLIvZ3djy3yq/LSIvgsGyPLvO0Fo31gqGU8Q0uhhfMtJVKSzSm1GNxuhh2D
         NcWiyqOB4fIBiqJ6eIWhcEj2nkTnwxVYrKT6TP00O81qEGGsHr3do/QRqhygbnlSg0zf
         aopiU81m8kQKUrsRc6VhRrcX0ffKj4+M/8Z8aU/3XHNigrR3fv536/y/lnNuXm47bxK5
         a0GA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id
         :references:mime-version;
        bh=4LCswZrElV6Pmzb8YyUA7fug85GktLISWrDlYL4+cJU=;
        b=ZTeclmxmnldGwjXsb/79yyBmRNmlwJG9saGdtbTKrT6l2VmgMtGC05g6eSdpQNrpFh
         YxG7mvO8DMVTMkm/q+C/YBqwov3U13ThHa4NyhW4PR2PGV/Ghm/OUd72zmBpWlLd/tom
         fTcKGv5fCBb9e/2wySNOlwkzDno4Ju9jvcZh9Y1ZdECkEIVx0A+540tnzlSu5+WK0nLh
         kg6Nz0Z5NOniXbg6FfPZi7u26tx8u2kao5H/wdbCepZY815WSA8DvwlyB/vHTiuJk8d3
         WVtUVLV7eQ87GTu9vXt850q90AVBP033su6tcLIUww2SNjs8mtb79iUDMKsSshp+BFOb
         7udQ==
X-Gm-Message-State: AOAM532NsM/BCK+D+OykcHM7cw92heBskIjeeMt0hjPB0vrVbZ4kTVqc
	wQ8xVI6g38BI9eYAR3Xg4uoJgQ==
X-Google-Smtp-Source: 
 ABdhPJwoOBqHBajFmA50x9euRGxs5w57fNsAgoxQwuf0kGlCmHaynY42ZIqWK7LLuu2RSLh3vtEqlQ==
X-Received: by 2002:a05:6214:627:: with SMTP id
 a7mr1412718qvx.91.1644892858736;
        Mon, 14 Feb 2022 18:40:58 -0800 (PST)
Received: from ripple.attlocal.net
 (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147])
        by smtp.gmail.com with ESMTPSA id
 m9sm17437831qkn.81.2022.02.14.18.40.56
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 14 Feb 2022 18:40:57 -0800 (PST)
Date: Mon, 14 Feb 2022 18:40:55 -0800 (PST)
From: Hugh Dickins <hughd@google.com>
X-X-Sender: hugh@ripple.anvils
To: Andrew Morton <akpm@linux-foundation.org>
cc: Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>,
    "Kirill A. Shutemov" <kirill@shutemov.name>,
    Matthew Wilcox <willy@infradead.org>,
 David Hildenbrand <david@redhat.com>,
    Alistair Popple <apopple@nvidia.com>,
 Johannes Weiner <hannes@cmpxchg.org>,
    Rik van Riel <riel@surriel.com>, Suren Baghdasaryan <surenb@google.com>,
    Yu Zhao <yuzhao@google.com>, Greg Thelen <gthelen@google.com>,
    Shakeel Butt <shakeelb@google.com>, linux-kernel@vger.kernel.org,
    linux-mm@kvack.org
Subject: [PATCH v2 12/13] mm/thp: collapse_file() do
 try_to_unmap(TTU_BATCH_FLUSH)
In-Reply-To: <55a49083-37f9-3766-1de9-9feea7428ac@google.com>
Message-ID: <c390e7b-7648-b3e9-9ae1-87c9b9e95ed4@google.com>
References: <55a49083-37f9-3766-1de9-9feea7428ac@google.com>
MIME-Version: 1.0
X-Rspamd-Queue-Id: 84B721C0006
X-Rspam-User: 
Authentication-Results: imf20.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b="RK8Rg5/4";
	spf=pass (imf20.hostedemail.com: domain of hughd@google.com designates
 209.85.219.54 as permitted sender) smtp.mailfrom=hughd@google.com;
	dmarc=pass (policy=reject) header.from=google.com
X-Stat-Signature: i35hpfgqt9kthsbd9dn87ut5icazd596
X-Rspamd-Server: rspam03
X-HE-Tag: 1644892859-118580
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

collapse_file() is using unmap_mapping_pages(1) on each small page found
mapped, unlike others (reclaim, migration, splitting, memory-failure) who
use try_to_unmap().  There are four advantages to try_to_unmap(): first,
its TTU_IGNORE_MLOCK option now avoids leaving mlocked page in pagevec;
second, its vma lookup uses i_mmap_lock_read() not i_mmap_lock_write();
third, it breaks out early if page is not mapped everywhere it might be;
fourth, its TTU_BATCH_FLUSH option can be used, as in page reclaim, to
save up all the TLB flushing until all of the pages have been unmapped.

Wild guess: perhaps it was originally written to use try_to_unmap(),
but hit the VM_BUG_ON_PAGE(page_mapped) after unmapping, because without
TTU_SYNC it may skip page table locks; but unmap_mapping_pages() never
skips them, so fixed the issue.  I did once hit that VM_BUG_ON_PAGE()
since making this change: we could pass TTU_SYNC here, but I think just
delete the check - the race is very rare, this is an ordinary small page
so we don't need to be so paranoid about mapcount surprises, and the
page_ref_freeze() just below already handles the case adequately.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
v2: same as v1.

 mm/khugepaged.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d5e387c58bde..e0883a33efd6 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1829,13 +1829,12 @@ static void collapse_file(struct mm_struct *mm,
 		}
 
 		if (page_mapped(page))
-			unmap_mapping_pages(mapping, index, 1, false);
+			try_to_unmap(page, TTU_IGNORE_MLOCK | TTU_BATCH_FLUSH);
 
 		xas_lock_irq(&xas);
 		xas_set(&xas, index);
 
 		VM_BUG_ON_PAGE(page != xas_load(&xas), page);
-		VM_BUG_ON_PAGE(page_mapped(page), page);
 
 		/*
 		 * The page is expected to have page_count() == 3:
@@ -1899,6 +1898,13 @@ static void collapse_file(struct mm_struct *mm,
 	xas_unlock_irq(&xas);
 xa_unlocked:
 
+	/*
+	 * If collapse is successful, flush must be done now before copying.
+	 * If collapse is unsuccessful, does flush actually need to be done?
+	 * Do it anyway, to clear the state.
+	 */
+	try_to_unmap_flush();
+
 	if (result == SCAN_SUCCEED) {
 		struct page *page, *tmp;
 

From patchwork Tue Feb 15 02:42:33 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hugh Dickins <hughd@google.com>
X-Patchwork-Id: 12746457
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id C275FC433F5
	for <linux-mm@archiver.kernel.org>; Tue, 15 Feb 2022 02:42:39 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 20FA06B0078; Mon, 14 Feb 2022 21:42:39 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 1C0806B007B; Mon, 14 Feb 2022 21:42:39 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 087B36B007D; Mon, 14 Feb 2022 21:42:39 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0098.hostedemail.com
 [216.40.44.98])
	by kanga.kvack.org (Postfix) with ESMTP id EF9976B0078
	for <linux-mm@kvack.org>; Mon, 14 Feb 2022 21:42:38 -0500 (EST)
Received: from smtpin26.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with ESMTP id A7D5990085
	for <linux-mm@kvack.org>; Tue, 15 Feb 2022 02:42:38 +0000 (UTC)
X-FDA: 79143465996.26.C5C42DE
Received: from mail-qt1-f171.google.com (mail-qt1-f171.google.com
 [209.85.160.171])
	by imf30.hostedemail.com (Postfix) with ESMTP id 402A380002
	for <linux-mm@kvack.org>; Tue, 15 Feb 2022 02:42:38 +0000 (UTC)
Received: by mail-qt1-f171.google.com with SMTP id y8so17284795qtn.8
        for <linux-mm@kvack.org>; Mon, 14 Feb 2022 18:42:38 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:from:to:cc:subject:in-reply-to:message-id:references
         :mime-version;
        bh=giPReZ8UhbPqkZLEqkAOxK4yInsc7PbE1m6SE66Te4Q=;
        b=btIELsE6p+AIrHoSPLSohcTUVe+luSKo9dd0AyNSjs7Ju9Xa9XzP3dnes8rgd8EOwp
         mm6tUMUxoPRUi6IiZyTjfcVWS1/wWYrOEJiRagFc0uLo1MraDS+RlIc43lg6+LoXvVZX
         4p4LFiGoHfzMTc4q3TgVGzbSDb8K9+QqdoZIbP9HQ7U5RXMwijNjwTcSe2Qw+GF3LUz0
         12psOMKrAo6jkpCt9p5nygK7/F9KKWOZimQoPlXIUCJjJUgT9SQpsXHB7j7TGRXaW/W9
         S9f+IR9MLXYuRPp/+mThEHeH2tlKRWEPKnc7IPr8zPOSY/a6A0U5WhX69BF+/lTw9vkR
         DTpA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id
         :references:mime-version;
        bh=giPReZ8UhbPqkZLEqkAOxK4yInsc7PbE1m6SE66Te4Q=;
        b=OpCCOb2MaYXP+QYvHFhy92oSS8afp7RP9wAzbEJU15cmpqBKMj/HdnQ6Qtw5yaRfBU
         YZZPsl7puPKrreWiG1zRjklh061G0kCFpvWOk7kR3T78ZK+rdQbW4KGqOih5i+vup92w
         XDlEdo5/MPNb25xOODf66TRFQaZjSuwjqZt7QBn36r+DFzXF7ZjWaJGyyV6P0yy4nROX
         wKTkUnPGVjhPpuSeiOmJb7UsBfIrpEO8YSMwnBQ3ZHvn5kIH/+rewqjbR96CQGa12Tfh
         Qq0VP0JViMK16dNDu2yn13Ggpjj4ylYRcO+qtgpgyDzmYwVL1ZVaiY0TzelFUKHvXnlo
         vUpQ==
X-Gm-Message-State: AOAM531kqrfLUhnFLHA9fbHDOv+ClRWEE3+lLILdCpCpqdf1izLcue/B
	2QPkytnCUumS/NpQEAVdzQ+AEQ==
X-Google-Smtp-Source: 
 ABdhPJznFEk0d/fI+EPrzI5m8zqaIJocidCZFSgE1ynr0TcpDqCkqNNw84SymhENnH3FgBo3d8wQAw==
X-Received: by 2002:ac8:4e87:: with SMTP id 7mr1362080qtp.9.1644892956787;
        Mon, 14 Feb 2022 18:42:36 -0800 (PST)
Received: from ripple.attlocal.net
 (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147])
        by smtp.gmail.com with ESMTPSA id
 c20sm19161403qtb.58.2022.02.14.18.42.34
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 14 Feb 2022 18:42:36 -0800 (PST)
Date: Mon, 14 Feb 2022 18:42:33 -0800 (PST)
From: Hugh Dickins <hughd@google.com>
X-X-Sender: hugh@ripple.anvils
To: Andrew Morton <akpm@linux-foundation.org>
cc: Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>,
    "Kirill A. Shutemov" <kirill@shutemov.name>,
    Matthew Wilcox <willy@infradead.org>,
 David Hildenbrand <david@redhat.com>,
    Alistair Popple <apopple@nvidia.com>,
 Johannes Weiner <hannes@cmpxchg.org>,
    Rik van Riel <riel@surriel.com>, Suren Baghdasaryan <surenb@google.com>,
    Yu Zhao <yuzhao@google.com>, Greg Thelen <gthelen@google.com>,
    Shakeel Butt <shakeelb@google.com>, linux-kernel@vger.kernel.org,
    linux-mm@kvack.org
Subject: [PATCH v2 13/13] mm/thp: shrink_page_list() avoid splitting VM_LOCKED
 THP
In-Reply-To: <55a49083-37f9-3766-1de9-9feea7428ac@google.com>
Message-ID: <531d13ee-bc7d-329a-9748-5e272f699d78@google.com>
References: <55a49083-37f9-3766-1de9-9feea7428ac@google.com>
MIME-Version: 1.0
X-Rspamd-Server: rspam12
X-Rspamd-Queue-Id: 402A380002
X-Stat-Signature: uybd1scjkueboy989rgtod91zsbe7qma
X-Rspam-User: 
Authentication-Results: imf30.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=btIELsE6;
	spf=pass (imf30.hostedemail.com: domain of hughd@google.com designates
 209.85.160.171 as permitted sender) smtp.mailfrom=hughd@google.com;
	dmarc=pass (policy=reject) header.from=google.com
X-HE-Tag: 1644892958-25186
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

4.8 commit 7751b2da6be0 ("vmscan: split file huge pages before paging
them out") inserted a split_huge_page_to_list() into shrink_page_list()
without considering the mlock case: no problem if the page has already
been marked as Mlocked (the !page_evictable check much higher up will
have skipped all this), but it has always been the case that races or
omissions in setting Mlocked can rely on page reclaim to detect this
and correct it before actually reclaiming - and that remains so, but
what a shame if a hugepage is needlessly split before discovering it.

It is surprising that page_check_references() returns PAGEREF_RECLAIM
when VM_LOCKED, but there was a good reason for that: try_to_unmap_one()
is where the condition is detected and corrected; and until now it could
not be done in page_referenced_one(), because that does not always have
the page locked.  Now that mlock's requirement for page lock has gone,
copy try_to_unmap_one()'s mlock restoration into page_referenced_one(),
and let page_check_references() return PAGEREF_ACTIVATE in this case.

But page_referenced_one() may find a pte mapping one part of a hugepage:
what hold should a pte mapped in a VM_LOCKED area exert over the entire
huge page?  That's debatable.  The approach taken here is to treat that
pte mapping in page_referenced_one() as if not VM_LOCKED, and if no
VM_LOCKED pmd mapping is found later in the walk, and lack of reference
permits, then PAGEREF_RECLAIM take it to attempted splitting as before.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
v2: same as v1.

 mm/rmap.c   | 7 +++++--
 mm/vmscan.c | 6 +++---
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 714bfdc72c7b..c7921c102bc0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -812,7 +812,10 @@ static bool page_referenced_one(struct page *page, struct vm_area_struct *vma,
 	while (page_vma_mapped_walk(&pvmw)) {
 		address = pvmw.address;
 
-		if (vma->vm_flags & VM_LOCKED) {
+		if ((vma->vm_flags & VM_LOCKED) &&
+		    (!PageTransCompound(page) || !pvmw.pte)) {
+			/* Restore the mlock which got missed */
+			mlock_vma_page(page, vma, !pvmw.pte);
 			page_vma_mapped_walk_done(&pvmw);
 			pra->vm_flags |= VM_LOCKED;
 			return false; /* To break the loop */
@@ -851,7 +854,7 @@ static bool page_referenced_one(struct page *page, struct vm_area_struct *vma,
 
 	if (referenced) {
 		pra->referenced++;
-		pra->vm_flags |= vma->vm_flags;
+		pra->vm_flags |= vma->vm_flags & ~VM_LOCKED;
 	}
 
 	if (!pra->mapcount)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 090bfb605ecf..a160efba3c73 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1386,11 +1386,11 @@ static enum page_references page_check_references(struct page *page,
 	referenced_page = TestClearPageReferenced(page);
 
 	/*
-	 * Mlock lost the isolation race with us.  Let try_to_unmap()
-	 * move the page to the unevictable list.
+	 * The supposedly reclaimable page was found to be in a VM_LOCKED vma.
+	 * Let the page, now marked Mlocked, be moved to the unevictable list.
 	 */
 	if (vm_flags & VM_LOCKED)
-		return PAGEREF_RECLAIM;
+		return PAGEREF_ACTIVATE;
 
 	if (referenced_ptes) {
 		/*