[043/155] mm/gup: track FOLL_PIN pages

Message ID	20200402040529.wHKICHtvB%akpm@linux-foundation.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=V2YN=5S=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E18F520776 Date: Wed, 01 Apr 2020 21:05:29 -0700 From: Andrew Morton <akpm@linux-foundation.org> To: akpm@linux-foundation.org, corbet@lwn.net, dan.j.williams@intel.com, david@fromorbit.com, hch@infradead.org, imbrenda@linux.ibm.com, ira.weiny@intel.com, jack@suse.cz, jgg@ziepe.ca, jglisse@redhat.com, jhubbard@nvidia.com, kirill.shutemov@linux.intel.com, linux-mm@kvack.org, mhocko@suse.com, mike.kravetz@oracle.com, mm-commits@vger.kernel.org, shuah@kernel.org, torvalds@linux-foundation.org, vbabka@suse.cz, viro@zeniv.linux.org.uk, willy@infradead.org Subject: [patch 043/155] mm/gup: track FOLL_PIN pages Message-ID: <20200402040529.wHKICHtvB%akpm@linux-foundation.org> In-Reply-To: <20200401210155.09e3b9742e1c6e732f5a7250@linux-foundation.org> User-Agent: s-nail v14.8.16 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[001/155] tools/accounting/getdelays.c: fix netlink attribute length \| expand [001/155] tools/accounting/getdelays.c: fix netlink attribute length [002/155] kthread: mark timer used by delayed kthread works as IRQ safe [003/155] asm-generic: make more kernel-space headers mandatory [004/155] scripts/spelling.txt: add syfs/sysfs pattern [005/155] scripts/spelling.txt: add more spellings to spelling.txt [006/155] ocfs2: remove FS_OCFS2_NM [007/155] ocfs2: remove unused macros [008/155] ocfs2: use OCFS2_SEC_BITS in macro [009/155] ocfs2: remove dlm_lock_is_remote [010/155] ocfs2: there is no need to log twice in several functions [011/155] ocfs2: correct annotation from "l_next_rec" to "l_next_free_rec" [012/155] ocfs2: remove useless err [013/155] ocfs2: add missing annotations for ocfs2_refcount_cache_lock() and ocfs2_refcount_cache_u… [014/155] ocfs2: replace zero-length array with flexible-array member [015/155] ocfs2: cluster: replace zero-length array with flexible-array member [016/155] ocfs2: dlm: replace zero-length array with flexible-array member [017/155] ocfs2: ocfs2_fs.h: replace zero-length array with flexible-array member [018/155] ocfs2: roll back the reference count modification of the parent directory if an error occ… [019/155] ocfs2: use scnprintf() for avoiding potential buffer overflow [020/155] ocfs2: use memalloc_nofs_save instead of memalloc_noio_save [021/155] fs_parse: remove pr_notice() about each validation [022/155] mm/slub.c: replace cpu_slab->partial with wrapped APIs [023/155] mm/slub.c: replace kmem_cache->cpu_partial with wrapped APIs [024/155] slub: improve bit diffusion for freelist ptr obfuscation [025/155] slub: relocate freelist pointer to middle of object [026/155] revert "topology: add support for node_to_mem_node() to determine the fallback node" [027/155] mm/kmemleak.c: use address-of operator on section symbols [028/155] mm/Makefile: disable KCSAN for kmemleak [029/155] mm/filemap.c: don't bother dropping mmap_sem for zero size readahead [030/155] mm/page-writeback.c: write_cache_pages(): deduplicate identical checks [031/155] mm/filemap.c: clear page error before actual read [032/155] mm/filemap.c: remove unused argument from shrink_readahead_size_eio() [033/155] mm/filemap.c: use vm_fault error code directly [034/155] include/linux/pagemap.h: rename arguments to find_subpage [035/155] mm/page-writeback.c: use VM_BUG_ON_PAGE in clear_page_dirty_for_io [036/155] mm/filemap.c: unexport find_get_entry [037/155] mm/filemap.c: rewrite pagecache_get_page documentation [038/155] mm/gup: split get_user_pages_remote() into two routines [039/155] mm/gup: pass a flags arg to __gup_device_* functions [040/155] mm: introduce page_ref_sub_return() [041/155] mm/gup: pass gup flags to two more routines [042/155] mm/gup: require FOLL_GET for get_user_pages_fast() [043/155] mm/gup: track FOLL_PIN pages [044/155] mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages [045/155] mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) reporting [046/155] mm/gup_benchmark: support pin_user_pages() and related calls [047/155] selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage [048/155] mm: improve dump_page() for compound pages [049/155] mm: dump_page(): additional diagnostics for huge pinned pages [050/155] mm/gup/writeback: add callbacks for inaccessible pages [051/155] mm/gup: rename nr as nr_pinned in get_user_pages_fast() [052/155] mm/gup: fix omission of check on FOLL_LONGTERM in gup fast path [053/155] mm/swapfile.c: fix comments for swapcache_prepare [054/155] mm/swap.c: not necessary to export __pagevec_lru_add() [055/155] mm/swapfile: fix data races in try_to_unuse() [056/155] mm/swap_slots.c: assign\|reset cache slot by value directly [057/155] mm: swap: make page_evictable() inline [058/155] mm: swap: use smp_mb__after_atomic() to order LRU bit set [059/155] mm/swap_state.c: use the same way to count page in [add_to\|delete_from]_swap_cache [060/155] mm, memcg: fix build error around the usage of kmem_caches [061/155] mm/memcontrol.c: allocate shrinker_map on appropriate NUMA node [062/155] mm: memcg/slab: use mem_cgroup_from_obj() [063/155] mm: kmem: cleanup (__)memcg_kmem_charge_memcg() arguments [064/155] mm: kmem: cleanup memcg_kmem_uncharge_memcg() arguments [065/155] mm: kmem: rename memcg_kmem_(un)charge() into memcg_kmem_(un)charge_page() [066/155] mm: kmem: switch to nr_pages in (__)memcg_kmem_charge_memcg() [067/155] mm: memcg/slab: cache page number in memcg_(un)charge_slab() [068/155] mm: kmem: rename (__)memcg_kmem_(un)charge_memcg() to __memcg_kmem_(un)charge() [069/155] mm: memcontrol: fix memory.low proportional distribution [070/155] mm: memcontrol: clean up and document effective low/min calculations [071/155] mm: memcontrol: recursive memory.low protection [072/155] memcg: css_tryget_online cleanups [073/155] mm/memcontrol.c: make mem_cgroup_id_get_many() __maybe_unused [074/155] mm, memcg: prevent memory.high load/store tearing [075/155] mm, memcg: prevent memory.max load tearing [076/155] mm, memcg: prevent memory.low load/store tearing [077/155] mm, memcg: prevent memory.min load/store tearing [078/155] mm, memcg: prevent memory.swap.max load tearing [079/155] mm, memcg: prevent mem_cgroup_protected store tearing [080/155] mm: memcg: make memory.oom.group tolerable to task migration [081/155] mm/mapping_dirty_helpers: update huge page-table entry callbacks [082/155] mm/vma: move VM_NO_KHUGEPAGED into generic header [083/155] mm/vma: make vma_is_foreign() available for general use [084/155] mm/vma: make is_vma_temporary_stack() available for general use [085/155] mm: add pagemap.h to the fine documentation [086/155] mm/gup: rename "nonblocking" to "locked" where proper [087/155] mm/gup: fix __get_user_pages() on fault retry of hugetlb [088/155] mm: introduce fault_signal_pending() [089/155] x86/mm: use helper fault_signal_pending() [090/155] arc/mm: use helper fault_signal_pending() [091/155] arm64/mm: use helper fault_signal_pending() [092/155] powerpc/mm: use helper fault_signal_pending() [093/155] sh/mm: use helper fault_signal_pending() [094/155] mm: return faster for non-fatal signals in user mode faults [095/155] userfaultfd: don't retake mmap_sem to emulate NOPAGE [096/155] mm: introduce FAULT_FLAG_DEFAULT [097/155] mm: introduce FAULT_FLAG_INTERRUPTIBLE [098/155] mm: allow VM_FAULT_RETRY for multiple times [099/155] mm/gup: allow VM_FAULT_RETRY for multiple times [100/155] mm/gup: allow to react to fatal signals [101/155] mm/userfaultfd: honor FAULT_FLAG_KILLABLE in fault path [102/155] mm: clarify a confusing comment for remap_pfn_range() [103/155] mm/memory.c: clarify a confusing comment for vm_iomap_memory [104/155] mmap: remove inline of vm_unmapped_area [105/155] mm: mmap: add trace point of vm_unmapped_area [106/155] mm/mremap: add MREMAP_DONTUNMAP to mremap() [107/155] selftests: add MREMAP_DONTUNMAP selftest [108/155] mm/sparsemem: get address to page struct instead of address to pfn [109/155] mm/sparse: rename pfn_present() to pfn_in_present_section() [110/155] mm/sparse.c: use kvmalloc/kvfree to alloc/free memmap for the classic sparse [111/155] mm/sparse.c: allocate memmap preferring the given node [112/155] kasan: detect negative size in memory operation function [113/155] kasan: add test for invalid size in memmove [114/155] mm/page_alloc: increase default min_free_kbytes bound [115/155] mm, pagealloc: micro-optimisation: save two branches on hot page allocation path [116/155] mm/page_alloc.c: use free_area_empty() instead of open-coding [117/155] mm/page_alloc.c: micro-optimisation Remove unnecessary branch [118/155] mm/page_alloc: simplify page_is_buddy() for better code readability [119/155] mm: vmpressure: don't need call kfree if kstrndup fails [120/155] mm: vmpressure: use mem_cgroup_is_root API [121/155] mm: vmscan: replace open codings to NUMA_NO_NODE [122/155] mm/vmscan.c: remove cpu online notification for now [123/155] mm/vmscan.c: fix data races using kswapd_classzone_idx [124/155] mm/vmscan.c: clean code by removing unnecessary assignment [125/155] mm/vmscan.c: make may_enter_fs bool in shrink_page_list() [126/155] mm/vmscan.c: do_try_to_free_pages(): clean code by removing unnecessary assignment [127/155] selftests: vm: drop dependencies on page flags from mlock2 tests [128/155] mm,compaction,cma: add alloc_contig flag to compact_control [129/155] mm,thp,compaction,cma: allow THP migration for CMA allocations [130/155] mm, compaction: fully assume capture is not NULL in compact_zone_order() [131/155] mm/compaction: really limit compact_unevictable_allowed to 0 and 1 [132/155] mm/compaction: Disable compact_unevictable_allowed on RT [133/155] mm/compaction.c: clean code by removing unnecessary assignment [134/155] mm/mempolicy: support MPOL_MF_STRICT for huge page mapping [135/155] mm/mempolicy: check hugepage migration is supported by arch in vma_migratable() [136/155] mm: mempolicy: use VM_BUG_ON_VMA in queue_pages_test_walk() [137/155] mm: mempolicy: require at least one nodeid for MPOL_PREFERRED [138/155] mm/memblock.c: remove redundant assignment to variable max_addr [139/155] hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization [140/155] hugetlbfs: Use i_mmap_rwsem to address page fault/truncate race [141/155] hugetlb_cgroup: add hugetlb_cgroup reservation counter [142/155] hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations [143/155] mm/hugetlb_cgroup: fix hugetlb_cgroup migration [144/155] hugetlb_cgroup: add reservation accounting for private mappings [145/155] hugetlb: disable region_add file_region coalescing [146/155] hugetlb_cgroup: add accounting for shared mappings [147/155] hugetlb_cgroup: support noreserve mappings [148/155] hugetlb: support file_region coalescing again [149/155] hugetlb_cgroup: add hugetlb_cgroup reservation tests [150/155] hugetlb_cgroup: add hugetlb_cgroup reservation docs [151/155] mm/hugetlb.c: clean code by removing unnecessary initialization [152/155] mm/hugetlb: remove unnecessary memory fetch in PageHeadHuge() [153/155] selftests/vm: fix map_hugetlb length used for testing read and write [154/155] mm/hugetlb: fix build failure with HUGETLB_PAGE but not HUGEBTLBFS [155/155] include/linux/huge_mm.h: check PageTail in hpage_nr_pages even when !THP

Message ID

20200402040529.wHKICHtvB%akpm@linux-foundation.org (mailing list archive)

State

New, archived

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E18F520776
Date: Wed, 01 Apr 2020 21:05:29 -0700
From: Andrew Morton <akpm@linux-foundation.org>
To: akpm@linux-foundation.org, corbet@lwn.net, dan.j.williams@intel.com,
 david@fromorbit.com, hch@infradead.org, imbrenda@linux.ibm.com,
 ira.weiny@intel.com, jack@suse.cz, jgg@ziepe.ca, jglisse@redhat.com,
 jhubbard@nvidia.com, kirill.shutemov@linux.intel.com,
 linux-mm@kvack.org, mhocko@suse.com, mike.kravetz@oracle.com,
 mm-commits@vger.kernel.org, shuah@kernel.org,
 torvalds@linux-foundation.org, vbabka@suse.cz, viro@zeniv.linux.org.uk,
 willy@infradead.org
Subject: [patch 043/155] mm/gup: track FOLL_PIN pages
Message-ID: <20200402040529.wHKICHtvB%akpm@linux-foundation.org>
In-Reply-To: <20200401210155.09e3b9742e1c6e732f5a7250@linux-foundation.org>
User-Agent: s-nail v14.8.16
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

[001/155] tools/accounting/getdelays.c: fix netlink attribute length | expand

Commit Message

Andrew Morton April 2, 2020, 4:05 a.m. UTC

From: John Hubbard <jhubbard@nvidia.com>
Subject: mm/gup: track FOLL_PIN pages

Add tracking of pages that were pinned via FOLL_PIN.  This tracking is
implemented via overloading of page->_refcount: pins are added by adding
GUP_PIN_COUNTING_BIAS (1024) to the refcount.  This provides a fuzzy
indication of pinning, and it can have false positives (and that's OK). 
Please see the pre-existing Documentation/core-api/pin_user_pages.rst for
details.

As mentioned in pin_user_pages.rst, callers who effectively set FOLL_PIN
(typically via pin_user_pages*()) are required to ultimately free such
pages via unpin_user_page().

Please also note the limitation, discussed in pin_user_pages.rst under the
"TODO: for 1GB and larger huge pages" section.  (That limitation will be
removed in a following patch.)

The effect of a FOLL_PIN flag is similar to that of FOLL_GET, and may be
thought of as "FOLL_GET for DIO and/or RDMA use".

Pages that have been pinned via FOLL_PIN are identifiable via a new
function call:

   bool page_maybe_dma_pinned(struct page *page);

What to do in response to encountering such a page, is left to later
patchsets. There is discussion about this in [1], [2], [3], and [4].

This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().

[1] Some slow progress on get_user_pages() (Apr 2, 2019):
    https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018):
    https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018):
    https://lwn.net/Articles/753027/
[4] LWN kernel index: get_user_pages():
    https://lwn.net/Kernel/Index/#Memory_management-get_user_pages

[jhubbard@nvidia.com: add kerneldoc]
  Link: http://lkml.kernel.org/r/20200307021157.235726-1-jhubbard@nvidia.com
[imbrenda@linux.ibm.com: if pin fails, we need to unpin, a simple put_page will not be enough]
  Link: http://lkml.kernel.org/r/20200306132537.783769-2-imbrenda@linux.ibm.com
[akpm@linux-foundation.org: fix put_compound_head defined but not used]
Link: http://lkml.kernel.org/r/20200211001536.1027652-7-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/core-api/pin_user_pages.rst |    6 
 include/linux/mm.h                        |   84 ++++-
 mm/gup.c                                  |  312 ++++++++++++++++----
 mm/huge_memory.c                          |   29 +
 mm/hugetlb.c                              |   54 ++-
 5 files changed, 380 insertions(+), 105 deletions(-)

Comments

Tetsuo Handa April 9, 2020, 6:08 a.m. UTC | #1

Hello.

I'm hitting WARN_ON() at try_get_page() from try_grab_page() due to page_ref_count(page) == -1021
(which is "3 - GUP_PIN_COUNTING_BIAS") if I use loadable kernel module version of TOMOYO security
module. Since I don't see recent changes in security/tomoyo regarding get_user_pages_remote(),
I'm wondering what is happening.

[   10.427414][    T1] ------------[ cut here ]------------
[   10.427425][    T1] WARNING: CPU: 3 PID: 1 at ./include/linux/mm.h:1009 try_grab_page+0x77/0x80
[   10.427426][    T1] Modules linked in: caitsith(O) akari(O) xfs libcrc32c crc32c_generic sd_mod ata_generic pata_acpi mptspi scsi_transport_spi vmwgfx mptscsih drm_kms_helper mptbase cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea fb ahci fbdev libahci ttm ata_piix nvme drm drm_panel_orientation_quirks libata nvme_core t10_pi agpgart e1000 i2c_core scsi_mod serio_raw atkbd libps2 i8042 serio unix
[   10.427451][    T1] CPU: 3 PID: 1 Comm: akari-init Tainted: G           O      5.6.0-05654-g3faa52c03f44 #984
[   10.427452][    T1] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/29/2019
[   10.427454][    T1] RIP: 0010:try_grab_page+0x77/0x80
[   10.427456][    T1] Code: 34 85 c0 7e 25 f0 ff 47 34 b8 01 00 00 00 5d c3 0f 0b eb b1 48 8d 78 ff eb c6 0f 0b 31 c0 5d 0f 1f 40 00 c3 48 8d 78 ff eb d4 <0f> 0b 31 c0 5d c3 0f 1f 00 55 48 89 e5 41 57 49 89 cf 41 56 49 89
[   10.427457][    T1] RSP: 0018:ffffa859c0013ac0 EFLAGS: 00010286
[   10.427459][    T1] RAX: 00000000fffffc03 RBX: ffffcb6208c71dc0 RCX: 8000000231c77067
[   10.427460][    T1] RDX: 8000000231c77067 RSI: 0000000000002016 RDI: ffffcb6208c71dc0
[   10.427461][    T1] RBP: ffffa859c0013ac0 R08: 0000000000000001 R09: 0000000000000000
[   10.427462][    T1] R10: 0000000000000001 R11: ffff95d91c615340 R12: 0000000000002016
[   10.427463][    T1] R13: ffff95d91c615328 R14: 0000000000000ff0 R15: ffff95d91e0c0ff0
[   10.427465][    T1] FS:  00007fbfc3772740(0000) GS:ffff95d927a00000(0000) knlGS:0000000000000000
[   10.427479][    T1] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   10.427483][    T1] CR2: 0000000000ac5190 CR3: 000000022090a002 CR4: 00000000003606e0
[   10.427491][    T1] Call Trace:
[   10.427494][    T1]  follow_page_pte+0x329/0x443
[   10.427503][    T1]  follow_p4d_mask+0x756/0x7c1
[   10.427511][    T1]  follow_page_mask+0x6a/0x70
[   10.427516][    T1]  __get_user_pages+0x110/0x880
[   10.427518][    T1]  ? ___slab_alloc.constprop.95+0x929/0x980
[   10.427532][    T1]  __get_user_pages_remote+0xce/0x230
[   10.427540][    T1]  get_user_pages_remote+0x27/0x40
[   10.427545][    T1]  ccs_dump_page+0x6a/0x140 [akari]
[   10.427552][    T1]  ccs_start_execve+0x28c/0x490 [akari]
[   10.427555][    T1]  ? ccs_start_execve+0x90/0x490 [akari]
[   10.427561][    T1]  ? ccs_load_policy+0xee/0x150 [akari]
[   10.427568][    T1]  ccs_bprm_check_security+0x4e/0x70 [akari]
[   10.427572][    T1]  security_bprm_check+0x26/0x40
[   10.427576][    T1]  search_binary_handler+0x22/0x1c0
[   10.427580][    T1]  __do_execve_file.isra.41+0x723/0xac0
[   10.427581][    T1]  ? __do_execve_file.isra.41+0x665/0xac0
[   10.427590][    T1]  __x64_sys_execve+0x44/0x50
[   10.427614][    T1]  do_syscall_64+0x4a/0x1e0
[   10.427618][    T1]  entry_SYSCALL_64_after_hwframe+0x49/0xb3
[   10.427649][    T1] RIP: 0033:0x7fbfc2e36c37
[   10.427651][    T1] Code: ff ff 76 df 89 c6 f7 de 64 41 89 32 eb d5 89 c6 f7 de 64 41 89 32 eb db 66 2e 0f 1f 84 00 00 00 00 00 90 b8 3b 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 02 f3 c3 48 8b 15 08 12 30 00 f7 d8 64 89 02
[   10.427652][    T1] RSP: 002b:00007fff25f9c0f8 EFLAGS: 00000202 ORIG_RAX: 000000000000003b
[   10.427654][    T1] RAX: ffffffffffffffda RBX: 0000000000ac64f0 RCX: 00007fbfc2e36c37
[   10.427672][    T1] RDX: 0000000000ac6be0 RSI: 0000000000ac6ac0 RDI: 0000000000ac64f0
[   10.427673][    T1] RBP: 0000000000000000 R08: 00007fff25f9c0e0 R09: 0000000000000000
[   10.427674][    T1] R10: 00007fff25f9bb60 R11: 0000000000000202 R12: 0000000000ac6be0
[   10.427675][    T1] R13: 0000000000ac6ac0 R14: 0000000000ac6be0 R15: 0000000000ac6820
[   10.427701][    T1] irq event stamp: 10135190
[   10.427704][    T1] hardirqs last  enabled at (10135189): [<ffffffffb223092a>] __slab_alloc.constprop.94+0x48/0x5e
[   10.427706][    T1] hardirqs last disabled at (10135190): [<ffffffffb2001eb7>] trace_hardirqs_off_thunk+0x1a/0x1c
[   10.427707][    T1] softirqs last  enabled at (10135114): [<ffffffffb2a0032b>] __do_softirq+0x32b/0x455
[   10.427710][    T1] softirqs last disabled at (10135107): [<ffffffffb2072985>] irq_exit+0xa5/0xb0
[   10.427711][    T1] ---[ end trace 984e9bd0ce5a1e09 ]---

+bool __must_check try_grab_page(struct page *page, unsigned int flags)
+{
+       WARN_ON_ONCE((flags & (FOLL_GET | FOLL_PIN)) == (FOLL_GET | FOLL_PIN));
+
+       if (flags & FOLL_GET)
+               return try_get_page(page);
+       else if (flags & FOLL_PIN) {
+               page = compound_head(page);
+
+               if (WARN_ON_ONCE(page_ref_count(page) <= 0))
+                       return false;
+
+               page_ref_add(page, GUP_PIN_COUNTING_BIAS);
+       }
+
+       return true;
+}

Bisection says 3faa52c03f440d1b ("mm/gup: track FOLL_PIN pages") is the bad commit.

$ git bisect log
# bad: [3faa52c03f440d1b9ddef18c4f189f4790d52d7e] mm/gup: track FOLL_PIN pages
# good: [7111951b8d4973bda27ff663f2cf18b663d15b48] Linux 5.6
# good: [d5226fa6dbae0569ee43ecfc08bdcd6770fc4755] Linux 5.5
# good: [219d54332a09e8d8741c1e1982f5eae56099de85] Linux 5.4
# good: [4d856f72c10ecb060868ed10ff1b1453943fc6c8] Linux 5.3
# good: [0ecfebd2b52404ae0c54a878c872bb93363ada36] Linux 5.2
# good: [e93c9c99a629c61837d5a7fc2120cd2b6c70dbdd] Linux 5.1
# good: [1c163f4c7b3f621efff9b28a47abb36f7378d783] Linux 5.0
# good: [86dfbed49f88fd87ce8a12d2314b1f93266da7a7] mm/gup: pass a flags arg to __gup_device_* functions
# good: [83daf837884cc44c3cc0e4f8a096c5d1461cbcc2] mm/filemap.c: unexport find_get_entry
# good: [cc7b8f6245f0042a232c7f6807dc130d87233164] mm/page-writeback.c: write_cache_pages(): deduplicate identical checks
# good: [29d9f30d4ce6c7a38745a54a8cddface10013490] Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
# good: [4afdb733b1606c6cb86e7833f9335f4870cf7ddd] io-uring: drop completion when removing file
git bisect start '3faa52c03f440d1b9ddef18c4f189f4790d52d7e' 'v5.6' 'v5.5' 'v5.4' 'v5.3' 'v5.2' 'v5.1' 'v5.0' '86dfbed49f88fd87ce8a12d2314b1f93266da7a7' '83daf837884cc44c3cc0e4f8a096c5d1461cbcc2' 'cc7b8f6245f0042a232c7f6807dc130d87233164' '29d9f30d4ce6c7a38745a54a8cddface10013490' '4afdb733b1606c6cb86e7833f9335f4870cf7ddd'
# good: [3b78d8347d31a050bb4f378a5f42cf495d873796] mm/gup: pass gup flags to two more routines
git bisect good 3b78d8347d31a050bb4f378a5f42cf495d873796
# good: [94202f126f698691f8865906ad6a68203e5dde8c] mm/gup: require FOLL_GET for get_user_pages_fast()
git bisect good 94202f126f698691f8865906ad6a68203e5dde8c
# first bad commit: [3faa52c03f440d1b9ddef18c4f189f4790d52d7e] mm/gup: track FOLL_PIN pages

John Hubbard April 9, 2020, 6:38 a.m. UTC | #2

On 4/8/20 11:08 PM, Tetsuo Handa wrote:
> Hello.
> 
> I'm hitting WARN_ON() at try_get_page() from try_grab_page() due to page_ref_count(page) == -1021
> (which is "3 - GUP_PIN_COUNTING_BIAS") if I use loadable kernel module version of TOMOYO security
> module. Since I don't see recent changes in security/tomoyo regarding get_user_pages_remote(),
> I'm wondering what is happening.


Hi Tetsuo,

Yes, commit 3faa52c03f440d1b ("mm/gup: track FOLL_PIN pages") is the one that turns everything on,
so if any problems with the whole FOLL_PIN scheme are to be found, it's natural that git bisect
would point to that commit.

Thanks for all the details here. One of the first questions I normally have is: is anything in the
system even using FOLL_PIN? And the way to answer that is to monitor the two new *foll_pin* entries
in /proc/vmstat, approximately like this:

$ cat /proc/vmstat |grep foll_pin
nr_foll_pin_acquired 0
nr_foll_pin_released 0

If you could do that before, during and after (ideally...or whatever you can get) the problem, I'd
love to see that data.

Also, if you happen to know if anything is calling pin_user_page*() and/or unpin_user_page*(), that
is extra credit. :)

I don't see anything here that jumps out at me, yet. The call stack below is as expected, and the WARN
often picks up problems that happened in some other calling path (something else leaked a FOLL_PIN
page for example, etc).

quick question below:

> 
> [   10.427414][    T1] ------------[ cut here ]------------
> [   10.427425][    T1] WARNING: CPU: 3 PID: 1 at ./include/linux/mm.h:1009 try_grab_page+0x77/0x80
> [   10.427426][    T1] Modules linked in: caitsith(O) akari(O) xfs libcrc32c crc32c_generic sd_mod ata_generic pata_acpi mptspi scsi_transport_spi vmwgfx mptscsih drm_kms_helper mptbase cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea fb ahci fbdev libahci ttm ata_piix nvme drm drm_panel_orientation_quirks libata nvme_core t10_pi agpgart e1000 i2c_core scsi_mod serio_raw atkbd libps2 i8042 serio unix
> [   10.427451][    T1] CPU: 3 PID: 1 Comm: akari-init Tainted: G           O      5.6.0-05654-g3faa52c03f44 #984
> [   10.427452][    T1] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/29/2019
> [   10.427454][    T1] RIP: 0010:try_grab_page+0x77/0x80
> [   10.427456][    T1] Code: 34 85 c0 7e 25 f0 ff 47 34 b8 01 00 00 00 5d c3 0f 0b eb b1 48 8d 78 ff eb c6 0f 0b 31 c0 5d 0f 1f 40 00 c3 48 8d 78 ff eb d4 <0f> 0b 31 c0 5d c3 0f 1f 00 55 48 89 e5 41 57 49 89 cf 41 56 49 89
> [   10.427457][    T1] RSP: 0018:ffffa859c0013ac0 EFLAGS: 00010286
> [   10.427459][    T1] RAX: 00000000fffffc03 RBX: ffffcb6208c71dc0 RCX: 8000000231c77067
> [   10.427460][    T1] RDX: 8000000231c77067 RSI: 0000000000002016 RDI: ffffcb6208c71dc0
> [   10.427461][    T1] RBP: ffffa859c0013ac0 R08: 0000000000000001 R09: 0000000000000000
> [   10.427462][    T1] R10: 0000000000000001 R11: ffff95d91c615340 R12: 0000000000002016
> [   10.427463][    T1] R13: ffff95d91c615328 R14: 0000000000000ff0 R15: ffff95d91e0c0ff0
> [   10.427465][    T1] FS:  00007fbfc3772740(0000) GS:ffff95d927a00000(0000) knlGS:0000000000000000
> [   10.427479][    T1] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   10.427483][    T1] CR2: 0000000000ac5190 CR3: 000000022090a002 CR4: 00000000003606e0
> [   10.427491][    T1] Call Trace:
> [   10.427494][    T1]  follow_page_pte+0x329/0x443
> [   10.427503][    T1]  follow_p4d_mask+0x756/0x7c1
> [   10.427511][    T1]  follow_page_mask+0x6a/0x70
> [   10.427516][    T1]  __get_user_pages+0x110/0x880
> [   10.427518][    T1]  ? ___slab_alloc.constprop.95+0x929/0x980
> [   10.427532][    T1]  __get_user_pages_remote+0xce/0x230
> [   10.427540][    T1]  get_user_pages_remote+0x27/0x40
> [   10.427545][    T1]  ccs_dump_page+0x6a/0x140 [akari]
> [   10.427552][    T1]  ccs_start_execve+0x28c/0x490 [akari]
> [   10.427555][    T1]  ? ccs_start_execve+0x90/0x490 [akari]
> [   10.427561][    T1]  ? ccs_load_policy+0xee/0x150 [akari]
> [   10.427568][    T1]  ccs_bprm_check_security+0x4e/0x70 [akari]


...say, I can't find the ccs_*() routines in my tree, can you point me to them? Probably not
important but I'm curious now.

thanks,

Tetsuo Handa April 9, 2020, 7:20 a.m. UTC | #3

Hello.

On 2020/04/09 15:38, John Hubbard wrote:
> Also, if you happen to know if anything is calling pin_user_page*() and/or unpin_user_page*(), that
> is extra credit. :)

Ah, I got it. I can see that only nr_foll_pin_released is increasing after I hit WARN_ON(),
and indeed AKARI is calling unpin_user_page() when put_user() should be used.

https://osdn.net/projects/akari/scm/svn/blobs/head/trunk/akari/permission.c line 1480

Sorry for the noise. Thank you.

John Hubbard April 9, 2020, 7:46 a.m. UTC | #4

On 4/9/20 12:20 AM, Tetsuo Handa wrote:
> Hello.
> 
> On 2020/04/09 15:38, John Hubbard wrote:
>> Also, if you happen to know if anything is calling pin_user_page*() and/or unpin_user_page*(), that
>> is extra credit. :)
> 
> Ah, I got it. I can see that only nr_foll_pin_released is increasing after I hit WARN_ON(),
> and indeed AKARI is calling unpin_user_page() when put_user() should be used.
> 
> https://osdn.net/projects/akari/scm/svn/blobs/head/trunk/akari/permission.c line 1480
> 
> Sorry for the noise. Thank you.
> 

Thanks for the quick investigation, it's a relief to hear that. :)

thanks,

--- a/Documentation/core-api/pin_user_pages.rst~mm-gup-track-foll_pin-pages
+++ a/Documentation/core-api/pin_user_pages.rst
@@ -173,8 +173,8 @@  CASE 4: Pinning for struct page manipula
 -------------------------------------------------
 Here, normal GUP calls are sufficient, so neither flag needs to be set.
 
-page_dma_pinned(): the whole point of pinning
-=============================================
+page_maybe_dma_pinned(): the whole point of pinning
+===================================================
 
 The whole point of marking pages as "DMA-pinned" or "gup-pinned" is to be able
 to query, "is this page DMA-pinned?" That allows code such as page_mkclean()
@@ -186,7 +186,7 @@  and debates (see the References at the e
 here: fill in the details once that's worked out. Meanwhile, it's safe to say
 that having this available: ::
 
-        static inline bool page_dma_pinned(struct page *page)
+        static inline bool page_maybe_dma_pinned(struct page *page)
 
 ...is a prerequisite to solving the long-running gup+DMA problem.
 
--- a/include/linux/mm.h~mm-gup-track-foll_pin-pages
+++ a/include/linux/mm.h
@@ -1001,6 +1001,8 @@  static inline void get_page(struct page
 	page_ref_inc(page);
 }
 
+bool __must_check try_grab_page(struct page *page, unsigned int flags);
+
 static inline __must_check bool try_get_page(struct page *page)
 {
 	page = compound_head(page);
@@ -1029,29 +1031,79 @@  static inline void put_page(struct page
 		__put_page(page);
 }
 
-/**
- * unpin_user_page() - release a gup-pinned page
- * @page:            pointer to page to be released
+/*
+ * GUP_PIN_COUNTING_BIAS, and the associated functions that use it, overload
+ * the page's refcount so that two separate items are tracked: the original page
+ * reference count, and also a new count of how many pin_user_pages() calls were
+ * made against the page. ("gup-pinned" is another term for the latter).
+ *
+ * With this scheme, pin_user_pages() becomes special: such pages are marked as
+ * distinct from normal pages. As such, the unpin_user_page() call (and its
+ * variants) must be used in order to release gup-pinned pages.
+ *
+ * Choice of value:
+ *
+ * By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page reference
+ * counts with respect to pin_user_pages() and unpin_user_page() becomes
+ * simpler, due to the fact that adding an even power of two to the page
+ * refcount has the effect of using only the upper N bits, for the code that
+ * counts up using the bias value. This means that the lower bits are left for
+ * the exclusive use of the original code that increments and decrements by one
+ * (or at least, by much smaller values than the bias value).
+ *
+ * Of course, once the lower bits overflow into the upper bits (and this is
+ * OK, because subtraction recovers the original values), then visual inspection
+ * no longer suffices to directly view the separate counts. However, for normal
+ * applications that don't have huge page reference counts, this won't be an
+ * issue.
  *
- * Pages that were pinned via pin_user_pages*() must be released via either
- * unpin_user_page(), or one of the unpin_user_pages*() routines. This is so
- * that eventually such pages can be separately tracked and uniquely handled. In
- * particular, interactions with RDMA and filesystems need special handling.
- *
- * unpin_user_page() and put_page() are not interchangeable, despite this early
- * implementation that makes them look the same. unpin_user_page() calls must
- * be perfectly matched up with pin*() calls.
+ * Locking: the lockless algorithm described in page_cache_get_speculative()
+ * and page_cache_gup_pin_speculative() provides safe operation for
+ * get_user_pages and page_mkclean and other calls that race to set up page
+ * table entries.
  */
-static inline void unpin_user_page(struct page *page)
-{
-	put_page(page);
-}
+#define GUP_PIN_COUNTING_BIAS (1U << 10)
 
+void unpin_user_page(struct page *page);
 void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
 				 bool make_dirty);
-
 void unpin_user_pages(struct page **pages, unsigned long npages);
 
+/**
+ * page_maybe_dma_pinned() - report if a page is pinned for DMA.
+ *
+ * This function checks if a page has been pinned via a call to
+ * pin_user_pages*().
+ *
+ * For non-huge pages, the return value is partially fuzzy: false is not fuzzy,
+ * because it means "definitely not pinned for DMA", but true means "probably
+ * pinned for DMA, but possibly a false positive due to having at least
+ * GUP_PIN_COUNTING_BIAS worth of normal page references".
+ *
+ * False positives are OK, because: a) it's unlikely for a page to get that many
+ * refcounts, and b) all the callers of this routine are expected to be able to
+ * deal gracefully with a false positive.
+ *
+ * For more information, please see Documentation/vm/pin_user_pages.rst.
+ *
+ * @page:	pointer to page to be queried.
+ * @Return:	True, if it is likely that the page has been "dma-pinned".
+ *		False, if the page is definitely not dma-pinned.
+ */
+static inline bool page_maybe_dma_pinned(struct page *page)
+{
+	/*
+	 * page_ref_count() is signed. If that refcount overflows, then
+	 * page_ref_count() returns a negative value, and callers will avoid
+	 * further incrementing the refcount.
+	 *
+	 * Here, for that overflow case, use the signed bit to count a little
+	 * bit higher via unsigned math, and thus still get an accurate result.
+	 */
+	return ((unsigned int)page_ref_count(compound_head(page))) >=
+		GUP_PIN_COUNTING_BIAS;
+}
+
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define SECTION_IN_PAGE_FLAGS
 #endif
--- a/mm/gup.c~mm-gup-track-foll_pin-pages
+++ a/mm/gup.c
@@ -44,6 +44,135 @@  static inline struct page *try_get_compo
 	return head;
 }
 
+/*
+ * try_grab_compound_head() - attempt to elevate a page's refcount, by a
+ * flags-dependent amount.
+ *
+ * "grab" names in this file mean, "look at flags to decide whether to use
+ * FOLL_PIN or FOLL_GET behavior, when incrementing the page's refcount.
+ *
+ * Either FOLL_PIN or FOLL_GET (or neither) must be set, but not both at the
+ * same time. (That's true throughout the get_user_pages*() and
+ * pin_user_pages*() APIs.) Cases:
+ *
+ *    FOLL_GET: page's refcount will be incremented by 1.
+ *    FOLL_PIN: page's refcount will be incremented by GUP_PIN_COUNTING_BIAS.
+ *
+ * Return: head page (with refcount appropriately incremented) for success, or
+ * NULL upon failure. If neither FOLL_GET nor FOLL_PIN was set, that's
+ * considered failure, and furthermore, a likely bug in the caller, so a warning
+ * is also emitted.
+ */
+static __maybe_unused struct page *try_grab_compound_head(struct page *page,
+							  int refs,
+							  unsigned int flags)
+{
+	if (flags & FOLL_GET)
+		return try_get_compound_head(page, refs);
+	else if (flags & FOLL_PIN) {
+		refs *= GUP_PIN_COUNTING_BIAS;
+		return try_get_compound_head(page, refs);
+	}
+
+	WARN_ON_ONCE(1);
+	return NULL;
+}
+
+/**
+ * try_grab_page() - elevate a page's refcount by a flag-dependent amount
+ *
+ * This might not do anything at all, depending on the flags argument.
+ *
+ * "grab" names in this file mean, "look at flags to decide whether to use
+ * FOLL_PIN or FOLL_GET behavior, when incrementing the page's refcount.
+ *
+ * @page:    pointer to page to be grabbed
+ * @flags:   gup flags: these are the FOLL_* flag values.
+ *
+ * Either FOLL_PIN or FOLL_GET (or neither) may be set, but not both at the same
+ * time. Cases:
+ *
+ *    FOLL_GET: page's refcount will be incremented by 1.
+ *    FOLL_PIN: page's refcount will be incremented by GUP_PIN_COUNTING_BIAS.
+ *
+ * Return: true for success, or if no action was required (if neither FOLL_PIN
+ * nor FOLL_GET was set, nothing is done). False for failure: FOLL_GET or
+ * FOLL_PIN was set, but the page could not be grabbed.
+ */
+bool __must_check try_grab_page(struct page *page, unsigned int flags)
+{
+	WARN_ON_ONCE((flags & (FOLL_GET | FOLL_PIN)) == (FOLL_GET | FOLL_PIN));
+
+	if (flags & FOLL_GET)
+		return try_get_page(page);
+	else if (flags & FOLL_PIN) {
+		page = compound_head(page);
+
+		if (WARN_ON_ONCE(page_ref_count(page) <= 0))
+			return false;
+
+		page_ref_add(page, GUP_PIN_COUNTING_BIAS);
+	}
+
+	return true;
+}
+
+#ifdef CONFIG_DEV_PAGEMAP_OPS
+static bool __unpin_devmap_managed_user_page(struct page *page)
+{
+	int count;
+
+	if (!page_is_devmap_managed(page))
+		return false;
+
+	count = page_ref_sub_return(page, GUP_PIN_COUNTING_BIAS);
+
+	/*
+	 * devmap page refcounts are 1-based, rather than 0-based: if
+	 * refcount is 1, then the page is free and the refcount is
+	 * stable because nobody holds a reference on the page.
+	 */
+	if (count == 1)
+		free_devmap_managed_page(page);
+	else if (!count)
+		__put_page(page);
+
+	return true;
+}
+#else
+static bool __unpin_devmap_managed_user_page(struct page *page)
+{
+	return false;
+}
+#endif /* CONFIG_DEV_PAGEMAP_OPS */
+
+/**
+ * unpin_user_page() - release a dma-pinned page
+ * @page:            pointer to page to be released
+ *
+ * Pages that were pinned via pin_user_pages*() must be released via either
+ * unpin_user_page(), or one of the unpin_user_pages*() routines. This is so
+ * that such pages can be separately tracked and uniquely handled. In
+ * particular, interactions with RDMA and filesystems need special handling.
+ */
+void unpin_user_page(struct page *page)
+{
+	page = compound_head(page);
+
+	/*
+	 * For devmap managed pages we need to catch refcount transition from
+	 * GUP_PIN_COUNTING_BIAS to 1, when refcount reach one it means the
+	 * page is free and we need to inform the device driver through
+	 * callback. See include/linux/memremap.h and HMM for details.
+	 */
+	if (__unpin_devmap_managed_user_page(page))
+		return;
+
+	if (page_ref_sub_and_test(page, GUP_PIN_COUNTING_BIAS))
+		__put_page(page);
+}
+EXPORT_SYMBOL(unpin_user_page);
+
 /**
  * unpin_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages
  * @pages:  array of pages to be maybe marked dirty, and definitely released.
@@ -230,10 +359,11 @@  retry:
 	}
 
 	page = vm_normal_page(vma, address, pte);
-	if (!page && pte_devmap(pte) && (flags & FOLL_GET)) {
+	if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) {
 		/*
-		 * Only return device mapping pages in the FOLL_GET case since
-		 * they are only valid while holding the pgmap reference.
+		 * Only return device mapping pages in the FOLL_GET or FOLL_PIN
+		 * case since they are only valid while holding the pgmap
+		 * reference.
 		 */
 		*pgmap = get_dev_pagemap(pte_pfn(pte), *pgmap);
 		if (*pgmap)
@@ -271,11 +401,10 @@  retry:
 		goto retry;
 	}
 
-	if (flags & FOLL_GET) {
-		if (unlikely(!try_get_page(page))) {
-			page = ERR_PTR(-ENOMEM);
-			goto out;
-		}
+	/* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is set. */
+	if (unlikely(!try_grab_page(page, flags))) {
+		page = ERR_PTR(-ENOMEM);
+		goto out;
 	}
 	if (flags & FOLL_TOUCH) {
 		if ((flags & FOLL_WRITE) &&
@@ -537,7 +666,7 @@  static struct page *follow_page_mask(str
 	/* make this handle hugepd */
 	page = follow_huge_addr(mm, address, flags & FOLL_WRITE);
 	if (!IS_ERR(page)) {
-		BUG_ON(flags & FOLL_GET);
+		WARN_ON_ONCE(flags & (FOLL_GET | FOLL_PIN));
 		return page;
 	}
 
@@ -1675,6 +1804,15 @@  long get_user_pages_remote(struct task_s
 {
 	return 0;
 }
+
+static long __get_user_pages_remote(struct task_struct *tsk,
+				    struct mm_struct *mm,
+				    unsigned long start, unsigned long nr_pages,
+				    unsigned int gup_flags, struct page **pages,
+				    struct vm_area_struct **vmas, int *locked)
+{
+	return 0;
+}
 #endif /* !CONFIG_MMU */
 
 /*
@@ -1814,7 +1952,24 @@  EXPORT_SYMBOL(get_user_pages_unlocked);
  * This code is based heavily on the PowerPC implementation by Nick Piggin.
  */
 #ifdef CONFIG_HAVE_FAST_GUP
+
+static void put_compound_head(struct page *page, int refs, unsigned int flags)
+{
+	if (flags & FOLL_PIN)
+		refs *= GUP_PIN_COUNTING_BIAS;
+
+	VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
+	/*
+	 * Calling put_page() for each ref is unnecessarily slow. Only the last
+	 * ref needs a put_page().
+	 */
+	if (refs > 1)
+		page_ref_sub(page, refs - 1);
+	put_page(page);
+}
+
 #ifdef CONFIG_GUP_GET_PTE_LOW_HIGH
+
 /*
  * WARNING: only to be used in the get_user_pages_fast() implementation.
  *
@@ -1877,7 +2032,10 @@  static void __maybe_unused undo_dev_page
 		struct page *page = pages[--(*nr)];
 
 		ClearPageReferenced(page);
-		put_page(page);
+		if (flags & FOLL_PIN)
+			unpin_user_page(page);
+		else
+			put_page(page);
 	}
 }
 
@@ -1919,12 +2077,12 @@  static int gup_pte_range(pmd_t pmd, unsi
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
 
-		head = try_get_compound_head(page, 1);
+		head = try_grab_compound_head(page, 1, flags);
 		if (!head)
 			goto pte_unmap;
 
 		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
-			put_page(head);
+			put_compound_head(head, 1, flags);
 			goto pte_unmap;
 		}
 
@@ -1980,7 +2138,10 @@  static int __gup_device_huge(unsigned lo
 		}
 		SetPageReferenced(page);
 		pages[*nr] = page;
-		get_page(page);
+		if (unlikely(!try_grab_page(page, flags))) {
+			undo_dev_pagemap(nr, nr_start, flags, pages);
+			return 0;
+		}
 		(*nr)++;
 		pfn++;
 	} while (addr += PAGE_SIZE, addr != end);
@@ -2054,18 +2215,6 @@  static int record_subpages(struct page *
 	return nr;
 }
 
-static void put_compound_head(struct page *page, int refs, unsigned int flags)
-{
-	VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
-	/*
-	 * Calling put_page() for each ref is unnecessarily slow. Only the last
-	 * ref needs a put_page().
-	 */
-	if (refs > 1)
-		page_ref_sub(page, refs - 1);
-	put_page(page);
-}
-
 #ifdef CONFIG_ARCH_HAS_HUGEPD
 static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
 				      unsigned long sz)
@@ -2099,7 +2248,7 @@  static int gup_hugepte(pte_t *ptep, unsi
 	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
 	refs = record_subpages(page, addr, end, pages + *nr);
 
-	head = try_get_compound_head(head, refs);
+	head = try_grab_compound_head(head, refs, flags);
 	if (!head)
 		return 0;
 
@@ -2159,7 +2308,7 @@  static int gup_huge_pmd(pmd_t orig, pmd_
 	page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
 	refs = record_subpages(page, addr, end, pages + *nr);
 
-	head = try_get_compound_head(pmd_page(orig), refs);
+	head = try_grab_compound_head(pmd_page(orig), refs, flags);
 	if (!head)
 		return 0;
 
@@ -2193,7 +2342,7 @@  static int gup_huge_pud(pud_t orig, pud_
 	page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
 	refs = record_subpages(page, addr, end, pages + *nr);
 
-	head = try_get_compound_head(pud_page(orig), refs);
+	head = try_grab_compound_head(pud_page(orig), refs, flags);
 	if (!head)
 		return 0;
 
@@ -2222,7 +2371,7 @@  static int gup_huge_pgd(pgd_t orig, pgd_
 	page = pgd_page(orig) + ((addr & ~PGDIR_MASK) >> PAGE_SHIFT);
 	refs = record_subpages(page, addr, end, pages + *nr);
 
-	head = try_get_compound_head(pgd_page(orig), refs);
+	head = try_grab_compound_head(pgd_page(orig), refs, flags);
 	if (!head)
 		return 0;
 
@@ -2505,11 +2654,11 @@  static int internal_get_user_pages_fast(
 
 /**
  * get_user_pages_fast() - pin user pages in memory
- * @start:	starting user address
- * @nr_pages:	number of pages from start to pin
- * @gup_flags:	flags modifying pin behaviour
- * @pages:	array that receives pointers to the pages pinned.
- *		Should be at least nr_pages long.
+ * @start:      starting user address
+ * @nr_pages:   number of pages from start to pin
+ * @gup_flags:  flags modifying pin behaviour
+ * @pages:      array that receives pointers to the pages pinned.
+ *              Should be at least nr_pages long.
  *
  * Attempt to pin user pages in memory without taking mm->mmap_sem.
  * If not successful, it will fall back to taking the lock and
@@ -2543,9 +2692,18 @@  EXPORT_SYMBOL_GPL(get_user_pages_fast);
 /**
  * pin_user_pages_fast() - pin user pages in memory without taking locks
  *
- * For now, this is a placeholder function, until various call sites are
- * converted to use the correct get_user_pages*() or pin_user_pages*() API. So,
- * this is identical to get_user_pages_fast().
+ * @start:      starting user address
+ * @nr_pages:   number of pages from start to pin
+ * @gup_flags:  flags modifying pin behaviour
+ * @pages:      array that receives pointers to the pages pinned.
+ *              Should be at least nr_pages long.
+ *
+ * Nearly the same as get_user_pages_fast(), except that FOLL_PIN is set. See
+ * get_user_pages_fast() for documentation on the function arguments, because
+ * the arguments here are identical.
+ *
+ * FOLL_PIN means that the pages must be released via unpin_user_page(). Please
+ * see Documentation/vm/pin_user_pages.rst for further details.
  *
  * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
  * is NOT intended for Case 2 (RDMA: long-term pins).
@@ -2553,21 +2711,39 @@  EXPORT_SYMBOL_GPL(get_user_pages_fast);
 int pin_user_pages_fast(unsigned long start, int nr_pages,
 			unsigned int gup_flags, struct page **pages)
 {
-	/*
-	 * This is a placeholder, until the pin functionality is activated.
-	 * Until then, just behave like the corresponding get_user_pages*()
-	 * routine.
-	 */
-	return get_user_pages_fast(start, nr_pages, gup_flags, pages);
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
+		return -EINVAL;
+
+	gup_flags |= FOLL_PIN;
+	return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
 }
 EXPORT_SYMBOL_GPL(pin_user_pages_fast);
 
 /**
  * pin_user_pages_remote() - pin pages of a remote process (task != current)
  *
- * For now, this is a placeholder function, until various call sites are
- * converted to use the correct get_user_pages*() or pin_user_pages*() API. So,
- * this is identical to get_user_pages_remote().
+ * @tsk:	the task_struct to use for page fault accounting, or
+ *		NULL if faults are not to be recorded.
+ * @mm:		mm_struct of target mm
+ * @start:	starting user address
+ * @nr_pages:	number of pages from start to pin
+ * @gup_flags:	flags modifying lookup behaviour
+ * @pages:	array that receives pointers to the pages pinned.
+ *		Should be at least nr_pages long. Or NULL, if caller
+ *		only intends to ensure the pages are faulted in.
+ * @vmas:	array of pointers to vmas corresponding to each page.
+ *		Or NULL if the caller does not require them.
+ * @locked:	pointer to lock flag indicating whether lock is held and
+ *		subsequently whether VM_FAULT_RETRY functionality can be
+ *		utilised. Lock must initially be held.
+ *
+ * Nearly the same as get_user_pages_remote(), except that FOLL_PIN is set. See
+ * get_user_pages_remote() for documentation on the function arguments, because
+ * the arguments here are identical.
+ *
+ * FOLL_PIN means that the pages must be released via unpin_user_page(). Please
+ * see Documentation/vm/pin_user_pages.rst for details.
  *
  * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
  * is NOT intended for Case 2 (RDMA: long-term pins).
@@ -2577,22 +2753,33 @@  long pin_user_pages_remote(struct task_s
 			   unsigned int gup_flags, struct page **pages,
 			   struct vm_area_struct **vmas, int *locked)
 {
-	/*
-	 * This is a placeholder, until the pin functionality is activated.
-	 * Until then, just behave like the corresponding get_user_pages*()
-	 * routine.
-	 */
-	return get_user_pages_remote(tsk, mm, start, nr_pages, gup_flags, pages,
-				     vmas, locked);
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
+		return -EINVAL;
+
+	gup_flags |= FOLL_PIN;
+	return __get_user_pages_remote(tsk, mm, start, nr_pages, gup_flags,
+				       pages, vmas, locked);
 }
 EXPORT_SYMBOL(pin_user_pages_remote);
 
 /**
  * pin_user_pages() - pin user pages in memory for use by other devices
  *
- * For now, this is a placeholder function, until various call sites are
- * converted to use the correct get_user_pages*() or pin_user_pages*() API. So,
- * this is identical to get_user_pages().
+ * @start:	starting user address
+ * @nr_pages:	number of pages from start to pin
+ * @gup_flags:	flags modifying lookup behaviour
+ * @pages:	array that receives pointers to the pages pinned.
+ *		Should be at least nr_pages long. Or NULL, if caller
+ *		only intends to ensure the pages are faulted in.
+ * @vmas:	array of pointers to vmas corresponding to each page.
+ *		Or NULL if the caller does not require them.
+ *
+ * Nearly the same as get_user_pages(), except that FOLL_TOUCH is not set, and
+ * FOLL_PIN is set.
+ *
+ * FOLL_PIN means that the pages must be released via unpin_user_page(). Please
+ * see Documentation/vm/pin_user_pages.rst for details.
  *
  * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
  * is NOT intended for Case 2 (RDMA: long-term pins).
@@ -2601,11 +2788,12 @@  long pin_user_pages(unsigned long start,
 		    unsigned int gup_flags, struct page **pages,
 		    struct vm_area_struct **vmas)
 {
-	/*
-	 * This is a placeholder, until the pin functionality is activated.
-	 * Until then, just behave like the corresponding get_user_pages*()
-	 * routine.
-	 */
-	return get_user_pages(start, nr_pages, gup_flags, pages, vmas);
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
+		return -EINVAL;
+
+	gup_flags |= FOLL_PIN;
+	return __gup_longterm_locked(current, current->mm, start, nr_pages,
+				     pages, vmas, gup_flags);
 }
 EXPORT_SYMBOL(pin_user_pages);
--- a/mm/huge_memory.c~mm-gup-track-foll_pin-pages
+++ a/mm/huge_memory.c
@@ -958,6 +958,11 @@  struct page *follow_devmap_pmd(struct vm
 	 */
 	WARN_ONCE(flags & FOLL_COW, "mm: In follow_devmap_pmd with FOLL_COW set");
 
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
+			 (FOLL_PIN | FOLL_GET)))
+		return NULL;
+
 	if (flags & FOLL_WRITE && !pmd_write(*pmd))
 		return NULL;
 
@@ -973,7 +978,7 @@  struct page *follow_devmap_pmd(struct vm
 	 * device mapped pages can only be returned if the
 	 * caller will manage the page reference count.
 	 */
-	if (!(flags & FOLL_GET))
+	if (!(flags & (FOLL_GET | FOLL_PIN)))
 		return ERR_PTR(-EEXIST);
 
 	pfn += (addr & ~PMD_MASK) >> PAGE_SHIFT;
@@ -981,7 +986,8 @@  struct page *follow_devmap_pmd(struct vm
 	if (!*pgmap)
 		return ERR_PTR(-EFAULT);
 	page = pfn_to_page(pfn);
-	get_page(page);
+	if (!try_grab_page(page, flags))
+		page = ERR_PTR(-ENOMEM);
 
 	return page;
 }
@@ -1101,6 +1107,11 @@  struct page *follow_devmap_pud(struct vm
 	if (flags & FOLL_WRITE && !pud_write(*pud))
 		return NULL;
 
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
+			 (FOLL_PIN | FOLL_GET)))
+		return NULL;
+
 	if (pud_present(*pud) && pud_devmap(*pud))
 		/* pass */;
 	else
@@ -1112,8 +1123,10 @@  struct page *follow_devmap_pud(struct vm
 	/*
 	 * device mapped pages can only be returned if the
 	 * caller will manage the page reference count.
+	 *
+	 * At least one of FOLL_GET | FOLL_PIN must be set, so assert that here:
 	 */
-	if (!(flags & FOLL_GET))
+	if (!(flags & (FOLL_GET | FOLL_PIN)))
 		return ERR_PTR(-EEXIST);
 
 	pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
@@ -1121,7 +1134,8 @@  struct page *follow_devmap_pud(struct vm
 	if (!*pgmap)
 		return ERR_PTR(-EFAULT);
 	page = pfn_to_page(pfn);
-	get_page(page);
+	if (!try_grab_page(page, flags))
+		page = ERR_PTR(-ENOMEM);
 
 	return page;
 }
@@ -1497,8 +1511,13 @@  struct page *follow_trans_huge_pmd(struc
 
 	page = pmd_page(*pmd);
 	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
+
+	if (!try_grab_page(page, flags))
+		return ERR_PTR(-ENOMEM);
+
 	if (flags & FOLL_TOUCH)
 		touch_pmd(vma, addr, pmd, flags);
+
 	if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
 		/*
 		 * We don't mlock() pte-mapped THPs. This way we can avoid
@@ -1535,8 +1554,6 @@  struct page *follow_trans_huge_pmd(struc
 skip_mlock:
 	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
 	VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
-	if (flags & FOLL_GET)
-		get_page(page);
 
 out:
 	return page;
--- a/mm/hugetlb.c~mm-gup-track-foll_pin-pages
+++ a/mm/hugetlb.c
@@ -4376,19 +4376,6 @@  long follow_hugetlb_page(struct mm_struc
 		page = pte_page(huge_ptep_get(pte));
 
 		/*
-		 * Instead of doing 'try_get_page()' below in the same_page
-		 * loop, just check the count once here.
-		 */
-		if (unlikely(page_count(page) <= 0)) {
-			if (pages) {
-				spin_unlock(ptl);
-				remainder = 0;
-				err = -ENOMEM;
-				break;
-			}
-		}
-
-		/*
 		 * If subpage information not requested, update counters
 		 * and skip the same_page loop below.
 		 */
@@ -4405,7 +4392,22 @@  long follow_hugetlb_page(struct mm_struc
 same_page:
 		if (pages) {
 			pages[i] = mem_map_offset(page, pfn_offset);
-			get_page(pages[i]);
+			/*
+			 * try_grab_page() should always succeed here, because:
+			 * a) we hold the ptl lock, and b) we've just checked
+			 * that the huge page is present in the page tables. If
+			 * the huge page is present, then the tail pages must
+			 * also be present. The ptl prevents the head page and
+			 * tail pages from being rearranged in any way. So this
+			 * page must be available at this point, unless the page
+			 * refcount overflowed:
+			 */
+			if (WARN_ON_ONCE(!try_grab_page(pages[i], flags))) {
+				spin_unlock(ptl);
+				remainder = 0;
+				err = -ENOMEM;
+				break;
+			}
 		}
 
 		if (vmas)
@@ -4965,6 +4967,12 @@  follow_huge_pmd(struct mm_struct *mm, un
 	struct page *page = NULL;
 	spinlock_t *ptl;
 	pte_t pte;
+
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
+			 (FOLL_PIN | FOLL_GET)))
+		return NULL;
+
 retry:
 	ptl = pmd_lockptr(mm, pmd);
 	spin_lock(ptl);
@@ -4977,8 +4985,18 @@  retry:
 	pte = huge_ptep_get((pte_t *)pmd);
 	if (pte_present(pte)) {
 		page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
-		if (flags & FOLL_GET)
-			get_page(page);
+		/*
+		 * try_grab_page() should always succeed here, because: a) we
+		 * hold the pmd (ptl) lock, and b) we've just checked that the
+		 * huge pmd (head) page is present in the page tables. The ptl
+		 * prevents the head page and tail pages from being rearranged
+		 * in any way. So this page must be available at this point,
+		 * unless the page refcount overflowed:
+		 */
+		if (WARN_ON_ONCE(!try_grab_page(page, flags))) {
+			page = NULL;
+			goto out;
+		}
 	} else {
 		if (is_hugetlb_entry_migration(pte)) {
 			spin_unlock(ptl);
@@ -4999,7 +5017,7 @@  struct page * __weak
 follow_huge_pud(struct mm_struct *mm, unsigned long address,
 		pud_t *pud, int flags)
 {
-	if (flags & FOLL_GET)
+	if (flags & (FOLL_GET | FOLL_PIN))
 		return NULL;
 
 	return pte_page(*(pte_t *)pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT);
@@ -5008,7 +5026,7 @@  follow_huge_pud(struct mm_struct *mm, un
 struct page * __weak
 follow_huge_pgd(struct mm_struct *mm, unsigned long address, pgd_t *pgd, int flags)
 {
-	if (flags & FOLL_GET)
+	if (flags & (FOLL_GET | FOLL_PIN))
 		return NULL;
 
 	return pte_page(*(pte_t *)pgd) + ((address & ~PGDIR_MASK) >> PAGE_SHIFT);

[043/155] mm/gup: track FOLL_PIN pages

Commit Message

Comments

Patch