mbox series

[v2,00/17] mm: fixes for device-exclusive entries (hmm)

Message ID 20250210193801.781278-1-david@redhat.com (mailing list archive)
Headers show
Series mm: fixes for device-exclusive entries (hmm) | expand

Message

David Hildenbrand Feb. 10, 2025, 7:37 p.m. UTC
Against mm-hotfixes-stable for now.

Discussing the PageTail() call in make_device_exclusive_range() with
Willy, I recently discovered [1] that device-exclusive handling does
not properly work with THP, making the hmm-tests selftests fail if THPs
are enabled on the system.

Looking into more details, I found that hugetlb is not properly fenced,
and I realized that something that was bugging me for longer -- how
device-exclusive entries interact with mapcounts -- completely breaks
migration/swapout/split/hwpoison handling of these folios while they have
device-exclusive PTEs.

The program below can be used to allocate 1 GiB worth of pages and
making them device-exclusive on a kernel with CONFIG_TEST_HMM.

Once they are device-exclusive, these folios cannot get swapped out
(proc$pid/smaps_rollup will always indicate 1 GiB RSS no matter how
much one forces memory reclaim), and when having a memory block onlined
to ZONE_MOVABLE, trying to offline it will loop forever and complain about
failed migration of a page that should be movable.

# echo offline > /sys/devices/system/memory/memory136/state
# echo online_movable > /sys/devices/system/memory/memory136/state
# ./hmm-swap &
... wait until everything is device-exclusive
# echo offline > /sys/devices/system/memory/memory136/state
[  285.193431][T14882] page: refcount:2 mapcount:0 mapping:0000000000000000
  index:0x7f20671f7 pfn:0x442b6a
[  285.196618][T14882] memcg:ffff888179298000
[  285.198085][T14882] anon flags: 0x5fff0000002091c(referenced|uptodate|
  dirty|active|owner_2|swapbacked|node=1|zone=3|lastcpupid=0x7ff)
[  285.201734][T14882] raw: ...
[  285.204464][T14882] raw: ...
[  285.207196][T14882] page dumped because: migration failure
[  285.209072][T14882] page_owner tracks the page as allocated
[  285.210915][T14882] page last allocated via order 0, migratetype
  Movable, gfp_mask 0x140dca(GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_ZERO),
  id 14926, tgid 14926 (hmm-swap), ts 254506295376, free_ts 227402023774
[  285.216765][T14882]  post_alloc_hook+0x197/0x1b0
[  285.218874][T14882]  get_page_from_freelist+0x76e/0x3280
[  285.220864][T14882]  __alloc_frozen_pages_noprof+0x38e/0x2740
[  285.223302][T14882]  alloc_pages_mpol+0x1fc/0x540
[  285.225130][T14882]  folio_alloc_mpol_noprof+0x36/0x340
[  285.227222][T14882]  vma_alloc_folio_noprof+0xee/0x1a0
[  285.229074][T14882]  __handle_mm_fault+0x2b38/0x56a0
[  285.230822][T14882]  handle_mm_fault+0x368/0x9f0
...

This series fixes all issues I found so far. There is no easy way to fix
without a bigger rework/cleanup. I have a bunch of cleanups on top (some
previous sent, some the result of the discussion in v1) that I will send
out separately once this landed and I get to it.

I wish we could just use some special present PROT_NONE PTEs instead of
these (non-present, non-none) fake-swap entries; but that just results in
the same problem we keep having (lack of spare PTE bits), and staring at
other similar fake-swap entries, that ship has sailed.

With this series, make_device_exclusive() doesn't actually belong into
mm/rmap.c anymore, but I'll leave moving that for another day.

I only tested this series with the hmm-tests selftests due to lack of HW,
so I'd appreciate some testing, especially if the interaction between
two GPUs wanting a device-exclusive entry works as expected.

<program>
#include <stdio.h>
#include <fcntl.h>
#include <stdint.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/ioctl.h>
#include <linux/types.h>
#include <linux/ioctl.h>

#define HMM_DMIRROR_EXCLUSIVE _IOWR('H', 0x05, struct hmm_dmirror_cmd)

struct hmm_dmirror_cmd {
	__u64 addr;
	__u64 ptr;
	__u64 npages;
	__u64 cpages;
	__u64 faults;
};

const size_t size = 1 * 1024 * 1024 * 1024ul;
const size_t chunk_size = 2 * 1024 * 1024ul;

int main(void)
{
	struct hmm_dmirror_cmd cmd;
	size_t cur_size;
	int fd, ret;
	char *addr, *mirror;

	fd = open("/dev/hmm_dmirror1", O_RDWR, 0);
	if (fd < 0) {
		perror("open failed\n");
		exit(1);
	}

	addr = mmap(NULL, size, PROT_READ | PROT_WRITE,
		    MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
	if (addr == MAP_FAILED) {
		perror("mmap failed\n");
		exit(1);
	}
	madvise(addr, size, MADV_NOHUGEPAGE);
	memset(addr, 1, size);

	mirror = malloc(chunk_size);

	for (cur_size = 0; cur_size < size; cur_size += chunk_size) {
		cmd.addr = (uintptr_t)addr + cur_size;
		cmd.ptr = (uintptr_t)mirror;
		cmd.npages = chunk_size / getpagesize();
		ret = ioctl(fd, HMM_DMIRROR_EXCLUSIVE, &cmd);
		if (ret) {
			perror("ioctl failed\n");
			exit(1);
		}
	}
	pause();
	return 0;
}
</program>

[1] https://lkml.kernel.org/r/25e02685-4f1d-47fa-be5b-01ff85bb0ce2@redhat.com

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Alex Shi <alexs@kernel.org>
Cc: Yanteng Si <si.yanteng@linux.dev>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>

v1 -> v2:
 * "mm/rmap: convert make_device_exclusive_range() to make_device_exclusive()"
  -> Fix and simplify return value handling when calling dmirror_atomic_map()
  -> Fix parameter order when calling make_device_exclusive()
  [both things were fixed by the separate cleanups I previously sent, realized
   it when re-testing the fixes here only]
  -> Heavily extend documentation of make_device_exclusive()
 * "mm/rmap: implement make_device_exclusive() using folio_walk instead of
    rmap walk"
  -> Keep MMU_NOTIFY_EXCLUSIVE, and update comments/description
 * "mm/rmap: handle device-exclusive entries correctly in try_to_migrate_one()"
  -> Handle PageHWPoison with device-private pages differently
 * Added a bunch of "handle device-exclusive entries correctly" fixes,
   now handling all page_vma_mapped_walk() callers correctly
 * Added "mm/rmap: avoid -EBUSY from make_device_exclusive()" to fix some
   hmm selftest failures I saw while testing under memory pressure
 * Plenty of comment/description updates and improvements

David Hildenbrand (17):
  mm/gup: reject FOLL_SPLIT_PMD with hugetlb VMAs
  mm/rmap: reject hugetlb folios in folio_make_device_exclusive()
  mm/rmap: convert make_device_exclusive_range() to
    make_device_exclusive()
  mm/rmap: implement make_device_exclusive() using folio_walk instead of
    rmap walk
  mm/memory: detect writability in restore_exclusive_pte() through
    can_change_pte_writable()
  mm: use single SWP_DEVICE_EXCLUSIVE entry type
  mm/page_vma_mapped: device-exclusive entries are not migration entries
  kernel/events/uprobes: handle device-exclusive entries correctly in
    __replace_page()
  mm/ksm: handle device-exclusive entries correctly in
    write_protect_page()
  mm/rmap: handle device-exclusive entries correctly in
    try_to_unmap_one()
  mm/rmap: handle device-exclusive entries correctly in
    try_to_migrate_one()
  mm/rmap: handle device-exclusive entries correctly in
    page_vma_mkclean_one()
  mm/page_idle: handle device-exclusive entries correctly in
    page_idle_clear_pte_refs_one()
  mm/damon: handle device-exclusive entries correctly in
    damon_folio_young_one()
  mm/damon: handle device-exclusive entries correctly in
    damon_folio_mkold_one()
  mm/rmap: keep mapcount untouched for device-exclusive entries
  mm/rmap: avoid -EBUSY from make_device_exclusive()

 Documentation/mm/hmm.rst                    |   2 +-
 Documentation/translations/zh_CN/mm/hmm.rst |   2 +-
 drivers/gpu/drm/nouveau/nouveau_svm.c       |   5 +-
 include/linux/mmu_notifier.h                |   2 +-
 include/linux/rmap.h                        |   5 +-
 include/linux/swap.h                        |   7 +-
 include/linux/swapops.h                     |  27 +-
 kernel/events/uprobes.c                     |  13 +-
 lib/test_hmm.c                              |  41 +-
 mm/damon/ops-common.c                       |  23 +-
 mm/damon/paddr.c                            |  10 +-
 mm/gup.c                                    |   3 +
 mm/ksm.c                                    |   9 +-
 mm/memory.c                                 |  28 +-
 mm/mprotect.c                               |   8 -
 mm/page_idle.c                              |   9 +-
 mm/page_table_check.c                       |   5 +-
 mm/page_vma_mapped.c                        |   3 +-
 mm/rmap.c                                   | 469 +++++++++-----------
 19 files changed, 315 insertions(+), 356 deletions(-)


base-commit: e5b2a356dc8a88708d97bd47cca3b8f7ed7af6cb

Comments

Andrew Morton Feb. 10, 2025, 11:05 p.m. UTC | #1
On Mon, 10 Feb 2025 20:37:42 +0100 David Hildenbrand <david@redhat.com> wrote:

> Against mm-hotfixes-stable for now.
> 
> Discussing the PageTail() call in make_device_exclusive_range() with
> Willy, I recently discovered [1] that device-exclusive handling does
> not properly work with THP, making the hmm-tests selftests fail if THPs
> are enabled on the system.
> 
> Looking into more details, I found that hugetlb is not properly fenced,
> and I realized that something that was bugging me for longer -- how
> device-exclusive entries interact with mapcounts -- completely breaks
> migration/swapout/split/hwpoison handling of these folios while they have
> device-exclusive PTEs.
> 
> The program below can be used to allocate 1 GiB worth of pages and
> making them device-exclusive on a kernel with CONFIG_TEST_HMM.
> 
> Once they are device-exclusive, these folios cannot get swapped out
> (proc$pid/smaps_rollup will always indicate 1 GiB RSS no matter how
> much one forces memory reclaim), and when having a memory block onlined
> to ZONE_MOVABLE, trying to offline it will loop forever and complain about
> failed migration of a page that should be movable.
> 
> # echo offline > /sys/devices/system/memory/memory136/state
> # echo online_movable > /sys/devices/system/memory/memory136/state
> # ./hmm-swap &
> ... wait until everything is device-exclusive
> # echo offline > /sys/devices/system/memory/memory136/state
> [  285.193431][T14882] page: refcount:2 mapcount:0 mapping:0000000000000000
>   index:0x7f20671f7 pfn:0x442b6a
> [  285.196618][T14882] memcg:ffff888179298000
> [  285.198085][T14882] anon flags: 0x5fff0000002091c(referenced|uptodate|
>   dirty|active|owner_2|swapbacked|node=1|zone=3|lastcpupid=0x7ff)
> [  285.201734][T14882] raw: ...
> [  285.204464][T14882] raw: ...
> [  285.207196][T14882] page dumped because: migration failure
> [  285.209072][T14882] page_owner tracks the page as allocated
> [  285.210915][T14882] page last allocated via order 0, migratetype
>   Movable, gfp_mask 0x140dca(GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_ZERO),
>   id 14926, tgid 14926 (hmm-swap), ts 254506295376, free_ts 227402023774
> [  285.216765][T14882]  post_alloc_hook+0x197/0x1b0
> [  285.218874][T14882]  get_page_from_freelist+0x76e/0x3280
> [  285.220864][T14882]  __alloc_frozen_pages_noprof+0x38e/0x2740
> [  285.223302][T14882]  alloc_pages_mpol+0x1fc/0x540
> [  285.225130][T14882]  folio_alloc_mpol_noprof+0x36/0x340
> [  285.227222][T14882]  vma_alloc_folio_noprof+0xee/0x1a0
> [  285.229074][T14882]  __handle_mm_fault+0x2b38/0x56a0
> [  285.230822][T14882]  handle_mm_fault+0x368/0x9f0
> ...
> 
> This series fixes all issues I found so far.

Cool.

Barry, could you please redo your series "mm: batched unmap lazyfree
large folios during reclamation" on top of this (on top of mm-unstable,
ideally).
Barry Song Feb. 10, 2025, 11:39 p.m. UTC | #2
On Tue, Feb 11, 2025 at 12:05 PM Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> On Mon, 10 Feb 2025 20:37:42 +0100 David Hildenbrand <david@redhat.com> wrote:
>
> > Against mm-hotfixes-stable for now.
> >
> > Discussing the PageTail() call in make_device_exclusive_range() with
> > Willy, I recently discovered [1] that device-exclusive handling does
> > not properly work with THP, making the hmm-tests selftests fail if THPs
> > are enabled on the system.
> >
> > Looking into more details, I found that hugetlb is not properly fenced,
> > and I realized that something that was bugging me for longer -- how
> > device-exclusive entries interact with mapcounts -- completely breaks
> > migration/swapout/split/hwpoison handling of these folios while they have
> > device-exclusive PTEs.
> >
> > The program below can be used to allocate 1 GiB worth of pages and
> > making them device-exclusive on a kernel with CONFIG_TEST_HMM.
> >
> > Once they are device-exclusive, these folios cannot get swapped out
> > (proc$pid/smaps_rollup will always indicate 1 GiB RSS no matter how
> > much one forces memory reclaim), and when having a memory block onlined
> > to ZONE_MOVABLE, trying to offline it will loop forever and complain about
> > failed migration of a page that should be movable.
> >
> > # echo offline > /sys/devices/system/memory/memory136/state
> > # echo online_movable > /sys/devices/system/memory/memory136/state
> > # ./hmm-swap &
> > ... wait until everything is device-exclusive
> > # echo offline > /sys/devices/system/memory/memory136/state
> > [  285.193431][T14882] page: refcount:2 mapcount:0 mapping:0000000000000000
> >   index:0x7f20671f7 pfn:0x442b6a
> > [  285.196618][T14882] memcg:ffff888179298000
> > [  285.198085][T14882] anon flags: 0x5fff0000002091c(referenced|uptodate|
> >   dirty|active|owner_2|swapbacked|node=1|zone=3|lastcpupid=0x7ff)
> > [  285.201734][T14882] raw: ...
> > [  285.204464][T14882] raw: ...
> > [  285.207196][T14882] page dumped because: migration failure
> > [  285.209072][T14882] page_owner tracks the page as allocated
> > [  285.210915][T14882] page last allocated via order 0, migratetype
> >   Movable, gfp_mask 0x140dca(GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_ZERO),
> >   id 14926, tgid 14926 (hmm-swap), ts 254506295376, free_ts 227402023774
> > [  285.216765][T14882]  post_alloc_hook+0x197/0x1b0
> > [  285.218874][T14882]  get_page_from_freelist+0x76e/0x3280
> > [  285.220864][T14882]  __alloc_frozen_pages_noprof+0x38e/0x2740
> > [  285.223302][T14882]  alloc_pages_mpol+0x1fc/0x540
> > [  285.225130][T14882]  folio_alloc_mpol_noprof+0x36/0x340
> > [  285.227222][T14882]  vma_alloc_folio_noprof+0xee/0x1a0
> > [  285.229074][T14882]  __handle_mm_fault+0x2b38/0x56a0
> > [  285.230822][T14882]  handle_mm_fault+0x368/0x9f0
> > ...
> >
> > This series fixes all issues I found so far.
>
> Cool.
>
> Barry, could you please redo your series "mm: batched unmap lazyfree
> large folios during reclamation" on top of this (on top of mm-unstable,
> ideally).

Sure. Thanks for letting me know.

>
Alistair Popple Feb. 13, 2025, 11:03 a.m. UTC | #3
On Mon, Feb 10, 2025 at 08:37:42PM +0100, David Hildenbrand wrote:
> Against mm-hotfixes-stable for now.
> 
> Discussing the PageTail() call in make_device_exclusive_range() with
> Willy, I recently discovered [1] that device-exclusive handling does
> not properly work with THP, making the hmm-tests selftests fail if THPs
> are enabled on the system.
> 
> Looking into more details, I found that hugetlb is not properly fenced,
> and I realized that something that was bugging me for longer -- how
> device-exclusive entries interact with mapcounts -- completely breaks
> migration/swapout/split/hwpoison handling of these folios while they have
> device-exclusive PTEs.
> 
> The program below can be used to allocate 1 GiB worth of pages and
> making them device-exclusive on a kernel with CONFIG_TEST_HMM.
> 
> Once they are device-exclusive, these folios cannot get swapped out
> (proc$pid/smaps_rollup will always indicate 1 GiB RSS no matter how
> much one forces memory reclaim), and when having a memory block onlined
> to ZONE_MOVABLE, trying to offline it will loop forever and complain about
> failed migration of a page that should be movable.
> 
> # echo offline > /sys/devices/system/memory/memory136/state
> # echo online_movable > /sys/devices/system/memory/memory136/state
> # ./hmm-swap &
> ... wait until everything is device-exclusive
> # echo offline > /sys/devices/system/memory/memory136/state
> [  285.193431][T14882] page: refcount:2 mapcount:0 mapping:0000000000000000
>   index:0x7f20671f7 pfn:0x442b6a
> [  285.196618][T14882] memcg:ffff888179298000
> [  285.198085][T14882] anon flags: 0x5fff0000002091c(referenced|uptodate|
>   dirty|active|owner_2|swapbacked|node=1|zone=3|lastcpupid=0x7ff)
> [  285.201734][T14882] raw: ...
> [  285.204464][T14882] raw: ...
> [  285.207196][T14882] page dumped because: migration failure
> [  285.209072][T14882] page_owner tracks the page as allocated
> [  285.210915][T14882] page last allocated via order 0, migratetype
>   Movable, gfp_mask 0x140dca(GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_ZERO),
>   id 14926, tgid 14926 (hmm-swap), ts 254506295376, free_ts 227402023774
> [  285.216765][T14882]  post_alloc_hook+0x197/0x1b0
> [  285.218874][T14882]  get_page_from_freelist+0x76e/0x3280
> [  285.220864][T14882]  __alloc_frozen_pages_noprof+0x38e/0x2740
> [  285.223302][T14882]  alloc_pages_mpol+0x1fc/0x540
> [  285.225130][T14882]  folio_alloc_mpol_noprof+0x36/0x340
> [  285.227222][T14882]  vma_alloc_folio_noprof+0xee/0x1a0
> [  285.229074][T14882]  __handle_mm_fault+0x2b38/0x56a0
> [  285.230822][T14882]  handle_mm_fault+0x368/0x9f0
> ...
> 
> This series fixes all issues I found so far. There is no easy way to fix
> without a bigger rework/cleanup. I have a bunch of cleanups on top (some
> previous sent, some the result of the discussion in v1) that I will send
> out separately once this landed and I get to it.
> I wish we could just use some special present PROT_NONE PTEs instead of

First off David thanks for finding and fixing these issues. If you have further
clean-ups in mind that you need help with please let me know as I'd be happy
to help.

> these (non-present, non-none) fake-swap entries; but that just results in
> the same problem we keep having (lack of spare PTE bits), and staring at
> other similar fake-swap entries, that ship has sailed.
> 
> With this series, make_device_exclusive() doesn't actually belong into
> mm/rmap.c anymore, but I'll leave moving that for another day.
> 
> I only tested this series with the hmm-tests selftests due to lack of HW,
> so I'd appreciate some testing, especially if the interaction between
> two GPUs wanting a device-exclusive entry works as expected.

I'm still reviewing the series but so far testing on my single GPU system
appears to be working as expected. I will try and fire up a dual GPU system
tomorrow and test it there as well.

 - Alistair

> <program>
> #include <stdio.h>
> #include <fcntl.h>
> #include <stdint.h>
> #include <unistd.h>
> #include <stdlib.h>
> #include <string.h>
> #include <sys/mman.h>
> #include <sys/ioctl.h>
> #include <linux/types.h>
> #include <linux/ioctl.h>
> 
> #define HMM_DMIRROR_EXCLUSIVE _IOWR('H', 0x05, struct hmm_dmirror_cmd)
> 
> struct hmm_dmirror_cmd {
> 	__u64 addr;
> 	__u64 ptr;
> 	__u64 npages;
> 	__u64 cpages;
> 	__u64 faults;
> };
> 
> const size_t size = 1 * 1024 * 1024 * 1024ul;
> const size_t chunk_size = 2 * 1024 * 1024ul;
> 
> int main(void)
> {
> 	struct hmm_dmirror_cmd cmd;
> 	size_t cur_size;
> 	int fd, ret;
> 	char *addr, *mirror;
> 
> 	fd = open("/dev/hmm_dmirror1", O_RDWR, 0);
> 	if (fd < 0) {
> 		perror("open failed\n");
> 		exit(1);
> 	}
> 
> 	addr = mmap(NULL, size, PROT_READ | PROT_WRITE,
> 		    MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> 	if (addr == MAP_FAILED) {
> 		perror("mmap failed\n");
> 		exit(1);
> 	}
> 	madvise(addr, size, MADV_NOHUGEPAGE);
> 	memset(addr, 1, size);
> 
> 	mirror = malloc(chunk_size);
> 
> 	for (cur_size = 0; cur_size < size; cur_size += chunk_size) {
> 		cmd.addr = (uintptr_t)addr + cur_size;
> 		cmd.ptr = (uintptr_t)mirror;
> 		cmd.npages = chunk_size / getpagesize();
> 		ret = ioctl(fd, HMM_DMIRROR_EXCLUSIVE, &cmd);
> 		if (ret) {
> 			perror("ioctl failed\n");
> 			exit(1);
> 		}
> 	}
> 	pause();
> 	return 0;
> }
> </program>
> 
> [1] https://lkml.kernel.org/r/25e02685-4f1d-47fa-be5b-01ff85bb0ce2@redhat.com
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Alex Shi <alexs@kernel.org>
> Cc: Yanteng Si <si.yanteng@linux.dev>
> Cc: Karol Herbst <kherbst@redhat.com>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: SeongJae Park <sj@kernel.org>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Jann Horn <jannh@google.com>
> Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> 
> v1 -> v2:
>  * "mm/rmap: convert make_device_exclusive_range() to make_device_exclusive()"
>   -> Fix and simplify return value handling when calling dmirror_atomic_map()
>   -> Fix parameter order when calling make_device_exclusive()
>   [both things were fixed by the separate cleanups I previously sent, realized
>    it when re-testing the fixes here only]
>   -> Heavily extend documentation of make_device_exclusive()
>  * "mm/rmap: implement make_device_exclusive() using folio_walk instead of
>     rmap walk"
>   -> Keep MMU_NOTIFY_EXCLUSIVE, and update comments/description
>  * "mm/rmap: handle device-exclusive entries correctly in try_to_migrate_one()"
>   -> Handle PageHWPoison with device-private pages differently
>  * Added a bunch of "handle device-exclusive entries correctly" fixes,
>    now handling all page_vma_mapped_walk() callers correctly
>  * Added "mm/rmap: avoid -EBUSY from make_device_exclusive()" to fix some
>    hmm selftest failures I saw while testing under memory pressure
>  * Plenty of comment/description updates and improvements
> 
> David Hildenbrand (17):
>   mm/gup: reject FOLL_SPLIT_PMD with hugetlb VMAs
>   mm/rmap: reject hugetlb folios in folio_make_device_exclusive()
>   mm/rmap: convert make_device_exclusive_range() to
>     make_device_exclusive()
>   mm/rmap: implement make_device_exclusive() using folio_walk instead of
>     rmap walk
>   mm/memory: detect writability in restore_exclusive_pte() through
>     can_change_pte_writable()
>   mm: use single SWP_DEVICE_EXCLUSIVE entry type
>   mm/page_vma_mapped: device-exclusive entries are not migration entries
>   kernel/events/uprobes: handle device-exclusive entries correctly in
>     __replace_page()
>   mm/ksm: handle device-exclusive entries correctly in
>     write_protect_page()
>   mm/rmap: handle device-exclusive entries correctly in
>     try_to_unmap_one()
>   mm/rmap: handle device-exclusive entries correctly in
>     try_to_migrate_one()
>   mm/rmap: handle device-exclusive entries correctly in
>     page_vma_mkclean_one()
>   mm/page_idle: handle device-exclusive entries correctly in
>     page_idle_clear_pte_refs_one()
>   mm/damon: handle device-exclusive entries correctly in
>     damon_folio_young_one()
>   mm/damon: handle device-exclusive entries correctly in
>     damon_folio_mkold_one()
>   mm/rmap: keep mapcount untouched for device-exclusive entries
>   mm/rmap: avoid -EBUSY from make_device_exclusive()
> 
>  Documentation/mm/hmm.rst                    |   2 +-
>  Documentation/translations/zh_CN/mm/hmm.rst |   2 +-
>  drivers/gpu/drm/nouveau/nouveau_svm.c       |   5 +-
>  include/linux/mmu_notifier.h                |   2 +-
>  include/linux/rmap.h                        |   5 +-
>  include/linux/swap.h                        |   7 +-
>  include/linux/swapops.h                     |  27 +-
>  kernel/events/uprobes.c                     |  13 +-
>  lib/test_hmm.c                              |  41 +-
>  mm/damon/ops-common.c                       |  23 +-
>  mm/damon/paddr.c                            |  10 +-
>  mm/gup.c                                    |   3 +
>  mm/ksm.c                                    |   9 +-
>  mm/memory.c                                 |  28 +-
>  mm/mprotect.c                               |   8 -
>  mm/page_idle.c                              |   9 +-
>  mm/page_table_check.c                       |   5 +-
>  mm/page_vma_mapped.c                        |   3 +-
>  mm/rmap.c                                   | 469 +++++++++-----------
>  19 files changed, 315 insertions(+), 356 deletions(-)
> 
> 
> base-commit: e5b2a356dc8a88708d97bd47cca3b8f7ed7af6cb
> -- 
> 2.48.1
>
David Hildenbrand Feb. 13, 2025, 11:15 a.m. UTC | #4
On 13.02.25 12:03, Alistair Popple wrote:
> On Mon, Feb 10, 2025 at 08:37:42PM +0100, David Hildenbrand wrote:
>> Against mm-hotfixes-stable for now.
>>
>> Discussing the PageTail() call in make_device_exclusive_range() with
>> Willy, I recently discovered [1] that device-exclusive handling does
>> not properly work with THP, making the hmm-tests selftests fail if THPs
>> are enabled on the system.
>>
>> Looking into more details, I found that hugetlb is not properly fenced,
>> and I realized that something that was bugging me for longer -- how
>> device-exclusive entries interact with mapcounts -- completely breaks
>> migration/swapout/split/hwpoison handling of these folios while they have
>> device-exclusive PTEs.
>>
>> The program below can be used to allocate 1 GiB worth of pages and
>> making them device-exclusive on a kernel with CONFIG_TEST_HMM.
>>
>> Once they are device-exclusive, these folios cannot get swapped out
>> (proc$pid/smaps_rollup will always indicate 1 GiB RSS no matter how
>> much one forces memory reclaim), and when having a memory block onlined
>> to ZONE_MOVABLE, trying to offline it will loop forever and complain about
>> failed migration of a page that should be movable.
>>
>> # echo offline > /sys/devices/system/memory/memory136/state
>> # echo online_movable > /sys/devices/system/memory/memory136/state
>> # ./hmm-swap &
>> ... wait until everything is device-exclusive
>> # echo offline > /sys/devices/system/memory/memory136/state
>> [  285.193431][T14882] page: refcount:2 mapcount:0 mapping:0000000000000000
>>    index:0x7f20671f7 pfn:0x442b6a
>> [  285.196618][T14882] memcg:ffff888179298000
>> [  285.198085][T14882] anon flags: 0x5fff0000002091c(referenced|uptodate|
>>    dirty|active|owner_2|swapbacked|node=1|zone=3|lastcpupid=0x7ff)
>> [  285.201734][T14882] raw: ...
>> [  285.204464][T14882] raw: ...
>> [  285.207196][T14882] page dumped because: migration failure
>> [  285.209072][T14882] page_owner tracks the page as allocated
>> [  285.210915][T14882] page last allocated via order 0, migratetype
>>    Movable, gfp_mask 0x140dca(GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_ZERO),
>>    id 14926, tgid 14926 (hmm-swap), ts 254506295376, free_ts 227402023774
>> [  285.216765][T14882]  post_alloc_hook+0x197/0x1b0
>> [  285.218874][T14882]  get_page_from_freelist+0x76e/0x3280
>> [  285.220864][T14882]  __alloc_frozen_pages_noprof+0x38e/0x2740
>> [  285.223302][T14882]  alloc_pages_mpol+0x1fc/0x540
>> [  285.225130][T14882]  folio_alloc_mpol_noprof+0x36/0x340
>> [  285.227222][T14882]  vma_alloc_folio_noprof+0xee/0x1a0
>> [  285.229074][T14882]  __handle_mm_fault+0x2b38/0x56a0
>> [  285.230822][T14882]  handle_mm_fault+0x368/0x9f0
>> ...
>>
>> This series fixes all issues I found so far. There is no easy way to fix
>> without a bigger rework/cleanup. I have a bunch of cleanups on top (some
>> previous sent, some the result of the discussion in v1) that I will send
>> out separately once this landed and I get to it.
>> I wish we could just use some special present PROT_NONE PTEs instead of
> 
> First off David thanks for finding and fixing these issues. If you have further
> clean-ups in mind that you need help with please let me know as I'd be happy
> to help.

Sure! I have some cleanups TBD as result of the previous discussion, but 
nothing bigger so far.

(removing the folio lock could be considered bigger, if we want to go 
down that path)

> 
>> these (non-present, non-none) fake-swap entries; but that just results in
>> the same problem we keep having (lack of spare PTE bits), and staring at
>> other similar fake-swap entries, that ship has sailed.
>>
>> With this series, make_device_exclusive() doesn't actually belong into
>> mm/rmap.c anymore, but I'll leave moving that for another day.
>>
>> I only tested this series with the hmm-tests selftests due to lack of HW,
>> so I'd appreciate some testing, especially if the interaction between
>> two GPUs wanting a device-exclusive entry works as expected.
> 
> I'm still reviewing the series but so far testing on my single GPU system
> appears to be working as expected. I will try and fire up a dual GPU system
> tomorrow and test it there as well.

Great, thanks a bunch for testing!

Out of interest: does the nvidia driver make use of this interface as 
well, and are you testing with that or with the nouveau driver? I saw 
some reports that nvidia at least checks for it [1] when building the 
module:

	CONFTEST: make_device_exclusive_range

[1] 
https://www.googlecloudcommunity.com/gc/AI-ML/Can-t-Install-Nvidia-Drivers-on-6-1-0-18-Kernel/m-p/722596
Alistair Popple Feb. 14, 2025, 1:25 a.m. UTC | #5
On Thu, Feb 13, 2025 at 12:15:58PM +0100, David Hildenbrand wrote:
> On 13.02.25 12:03, Alistair Popple wrote:
> > On Mon, Feb 10, 2025 at 08:37:42PM +0100, David Hildenbrand wrote:
> > > Against mm-hotfixes-stable for now.
> > > 
> > > Discussing the PageTail() call in make_device_exclusive_range() with
> > > Willy, I recently discovered [1] that device-exclusive handling does
> > > not properly work with THP, making the hmm-tests selftests fail if THPs
> > > are enabled on the system.
> > > 
> > > Looking into more details, I found that hugetlb is not properly fenced,
> > > and I realized that something that was bugging me for longer -- how
> > > device-exclusive entries interact with mapcounts -- completely breaks
> > > migration/swapout/split/hwpoison handling of these folios while they have
> > > device-exclusive PTEs.
> > > 
> > > The program below can be used to allocate 1 GiB worth of pages and
> > > making them device-exclusive on a kernel with CONFIG_TEST_HMM.
> > > 
> > > Once they are device-exclusive, these folios cannot get swapped out
> > > (proc$pid/smaps_rollup will always indicate 1 GiB RSS no matter how
> > > much one forces memory reclaim), and when having a memory block onlined
> > > to ZONE_MOVABLE, trying to offline it will loop forever and complain about
> > > failed migration of a page that should be movable.
> > > 
> > > # echo offline > /sys/devices/system/memory/memory136/state
> > > # echo online_movable > /sys/devices/system/memory/memory136/state
> > > # ./hmm-swap &
> > > ... wait until everything is device-exclusive
> > > # echo offline > /sys/devices/system/memory/memory136/state
> > > [  285.193431][T14882] page: refcount:2 mapcount:0 mapping:0000000000000000
> > >    index:0x7f20671f7 pfn:0x442b6a
> > > [  285.196618][T14882] memcg:ffff888179298000
> > > [  285.198085][T14882] anon flags: 0x5fff0000002091c(referenced|uptodate|
> > >    dirty|active|owner_2|swapbacked|node=1|zone=3|lastcpupid=0x7ff)
> > > [  285.201734][T14882] raw: ...
> > > [  285.204464][T14882] raw: ...
> > > [  285.207196][T14882] page dumped because: migration failure
> > > [  285.209072][T14882] page_owner tracks the page as allocated
> > > [  285.210915][T14882] page last allocated via order 0, migratetype
> > >    Movable, gfp_mask 0x140dca(GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_ZERO),
> > >    id 14926, tgid 14926 (hmm-swap), ts 254506295376, free_ts 227402023774
> > > [  285.216765][T14882]  post_alloc_hook+0x197/0x1b0
> > > [  285.218874][T14882]  get_page_from_freelist+0x76e/0x3280
> > > [  285.220864][T14882]  __alloc_frozen_pages_noprof+0x38e/0x2740
> > > [  285.223302][T14882]  alloc_pages_mpol+0x1fc/0x540
> > > [  285.225130][T14882]  folio_alloc_mpol_noprof+0x36/0x340
> > > [  285.227222][T14882]  vma_alloc_folio_noprof+0xee/0x1a0
> > > [  285.229074][T14882]  __handle_mm_fault+0x2b38/0x56a0
> > > [  285.230822][T14882]  handle_mm_fault+0x368/0x9f0
> > > ...
> > > 
> > > This series fixes all issues I found so far. There is no easy way to fix
> > > without a bigger rework/cleanup. I have a bunch of cleanups on top (some
> > > previous sent, some the result of the discussion in v1) that I will send
> > > out separately once this landed and I get to it.
> > > I wish we could just use some special present PROT_NONE PTEs instead of

Yeah, that was my initial instinct when I first investigated this. As you point
out a lack of spare PTE bits made it hard/impossible. Of course I'm about to
give you all one back, maybe I should keep it :) I'm only kidding though - I'm
sure there's more interesting things to spend it on.

> > 
> > First off David thanks for finding and fixing these issues. If you have further
> > clean-ups in mind that you need help with please let me know as I'd be happy
> > to help.
> 
> Sure! I have some cleanups TBD as result of the previous discussion, but
> nothing bigger so far.
> 
> (removing the folio lock could be considered bigger, if we want to go down
> that path)
> 
> > 
> > > these (non-present, non-none) fake-swap entries; but that just results in
> > > the same problem we keep having (lack of spare PTE bits), and staring at
> > > other similar fake-swap entries, that ship has sailed.
> > > 
> > > With this series, make_device_exclusive() doesn't actually belong into
> > > mm/rmap.c anymore, but I'll leave moving that for another day.
> > > 
> > > I only tested this series with the hmm-tests selftests due to lack of HW,
> > > so I'd appreciate some testing, especially if the interaction between
> > > two GPUs wanting a device-exclusive entry works as expected.
> > 
> > I'm still reviewing the series but so far testing on my single GPU system
> > appears to be working as expected. I will try and fire up a dual GPU system
> > tomorrow and test it there as well.
> 
> Great, thanks a bunch for testing!
> 
> Out of interest: does the nvidia driver make use of this interface as well,
> and are you testing with that or with the nouveau driver? I saw some reports
> that nvidia at least checks for it [1] when building the module:

Both. I have tested Nouveau with the Mesa OpenCL stack and a simple stress test
that just thrashes atomic accesses between CPU and GPU and a similar test for
the nvidia driver.

In practice the nvidia driver probably doesn't use this that often as it
more aggressively migrates data but it does use this as a fallback. Also it's
possible for users to force residency on the CPU in which case this is used,
which is what the test does.

Anyway I have just finished testing on a multi-GPU setup so please feel free to
add for the series:

Tested-by: Alistair Popple <apopple@nvidia.com>

> 
> 	CONFTEST: make_device_exclusive_range
> 
> [1] https://www.googlecloudcommunity.com/gc/AI-ML/Can-t-Install-Nvidia-Drivers-on-6-1-0-18-Kernel/m-p/722596
> 
> -- 
> Cheers,
> 
> David / dhildenb
>
David Hildenbrand Feb. 14, 2025, 10:37 a.m. UTC | #6
On 14.02.25 02:25, Alistair Popple wrote:
> On Thu, Feb 13, 2025 at 12:15:58PM +0100, David Hildenbrand wrote:
>> On 13.02.25 12:03, Alistair Popple wrote:
>>> On Mon, Feb 10, 2025 at 08:37:42PM +0100, David Hildenbrand wrote:
>>>> Against mm-hotfixes-stable for now.
>>>>
>>>> Discussing the PageTail() call in make_device_exclusive_range() with
>>>> Willy, I recently discovered [1] that device-exclusive handling does
>>>> not properly work with THP, making the hmm-tests selftests fail if THPs
>>>> are enabled on the system.
>>>>
>>>> Looking into more details, I found that hugetlb is not properly fenced,
>>>> and I realized that something that was bugging me for longer -- how
>>>> device-exclusive entries interact with mapcounts -- completely breaks
>>>> migration/swapout/split/hwpoison handling of these folios while they have
>>>> device-exclusive PTEs.
>>>>
>>>> The program below can be used to allocate 1 GiB worth of pages and
>>>> making them device-exclusive on a kernel with CONFIG_TEST_HMM.
>>>>
>>>> Once they are device-exclusive, these folios cannot get swapped out
>>>> (proc$pid/smaps_rollup will always indicate 1 GiB RSS no matter how
>>>> much one forces memory reclaim), and when having a memory block onlined
>>>> to ZONE_MOVABLE, trying to offline it will loop forever and complain about
>>>> failed migration of a page that should be movable.
>>>>
>>>> # echo offline > /sys/devices/system/memory/memory136/state
>>>> # echo online_movable > /sys/devices/system/memory/memory136/state
>>>> # ./hmm-swap &
>>>> ... wait until everything is device-exclusive
>>>> # echo offline > /sys/devices/system/memory/memory136/state
>>>> [  285.193431][T14882] page: refcount:2 mapcount:0 mapping:0000000000000000
>>>>     index:0x7f20671f7 pfn:0x442b6a
>>>> [  285.196618][T14882] memcg:ffff888179298000
>>>> [  285.198085][T14882] anon flags: 0x5fff0000002091c(referenced|uptodate|
>>>>     dirty|active|owner_2|swapbacked|node=1|zone=3|lastcpupid=0x7ff)
>>>> [  285.201734][T14882] raw: ...
>>>> [  285.204464][T14882] raw: ...
>>>> [  285.207196][T14882] page dumped because: migration failure
>>>> [  285.209072][T14882] page_owner tracks the page as allocated
>>>> [  285.210915][T14882] page last allocated via order 0, migratetype
>>>>     Movable, gfp_mask 0x140dca(GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_ZERO),
>>>>     id 14926, tgid 14926 (hmm-swap), ts 254506295376, free_ts 227402023774
>>>> [  285.216765][T14882]  post_alloc_hook+0x197/0x1b0
>>>> [  285.218874][T14882]  get_page_from_freelist+0x76e/0x3280
>>>> [  285.220864][T14882]  __alloc_frozen_pages_noprof+0x38e/0x2740
>>>> [  285.223302][T14882]  alloc_pages_mpol+0x1fc/0x540
>>>> [  285.225130][T14882]  folio_alloc_mpol_noprof+0x36/0x340
>>>> [  285.227222][T14882]  vma_alloc_folio_noprof+0xee/0x1a0
>>>> [  285.229074][T14882]  __handle_mm_fault+0x2b38/0x56a0
>>>> [  285.230822][T14882]  handle_mm_fault+0x368/0x9f0
>>>> ...
>>>>
>>>> This series fixes all issues I found so far. There is no easy way to fix
>>>> without a bigger rework/cleanup. I have a bunch of cleanups on top (some
>>>> previous sent, some the result of the discussion in v1) that I will send
>>>> out separately once this landed and I get to it.
>>>> I wish we could just use some special present PROT_NONE PTEs instead of
> 
> Yeah, that was my initial instinct when I first investigated this. As you point
> out a lack of spare PTE bits made it hard/impossible. Of course I'm about to
> give you all one back, maybe I should keep it :) I'm only kidding though - I'm
> sure there's more interesting things to spend it on.

Yes. And I think it could actually be valuable to have the option for 
more fake-prot-none things.

For example, right now we cannot really distinguish NUMA-hinting 
prot-none from ordinary prot-none without guessing based on some VMA flags.

One could implement NUMA-hinting using a PFN swap entry in an 
arch-independent way I guess.

So there are pros and cons to it. The biggest con is, that while RMAP 
can now handle it, other page table walkers mostly skip these entries.

> 
>>>
>>> First off David thanks for finding and fixing these issues. If you have further
>>> clean-ups in mind that you need help with please let me know as I'd be happy
>>> to help.
>>
>> Sure! I have some cleanups TBD as result of the previous discussion, but
>> nothing bigger so far.
>>
>> (removing the folio lock could be considered bigger, if we want to go down
>> that path)
>>
>>>
>>>> these (non-present, non-none) fake-swap entries; but that just results in
>>>> the same problem we keep having (lack of spare PTE bits), and staring at
>>>> other similar fake-swap entries, that ship has sailed.
>>>>
>>>> With this series, make_device_exclusive() doesn't actually belong into
>>>> mm/rmap.c anymore, but I'll leave moving that for another day.
>>>>
>>>> I only tested this series with the hmm-tests selftests due to lack of HW,
>>>> so I'd appreciate some testing, especially if the interaction between
>>>> two GPUs wanting a device-exclusive entry works as expected.
>>>
>>> I'm still reviewing the series but so far testing on my single GPU system
>>> appears to be working as expected. I will try and fire up a dual GPU system
>>> tomorrow and test it there as well.
>>
>> Great, thanks a bunch for testing!
>>
>> Out of interest: does the nvidia driver make use of this interface as well,
>> and are you testing with that or with the nouveau driver? I saw some reports
>> that nvidia at least checks for it [1] when building the module:
> 
> Both. I have tested Nouveau with the Mesa OpenCL stack and a simple stress test
> that just thrashes atomic accesses between CPU and GPU and a similar test for
> the nvidia driver.
> 
> In practice the nvidia driver probably doesn't use this that often as it
> more aggressively migrates data but it does use this as a fallback. Also it's
> possible for users to force residency on the CPU in which case this is used,
> which is what the test does.

Cool, thanks! (so even though nouveau is not enabled in RHEL, we'd 
effectively be using that functionality in RHEL kernels using the nvidia 
driver)

> 
> Anyway I have just finished testing on a multi-GPU setup so please feel free to
> add for the series:
> 
> Tested-by: Alistair Popple <apopple@nvidia.com>

Thanks a bunch!