Message ID | 20230309135718.1490461-2-usama.anjum@collabora.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Implement IOCTL to get and optionally clear info about PTEs | expand |
Hello, Muhammad, On Thu, Mar 09, 2023 at 06:57:12PM +0500, Muhammad Usama Anjum wrote: > Add new WP Async mode (UFFD_FEATURE_WP_ASYNC) which resolves the page > faults on its own. It can be used to track that which pages have been > written-to from the time the pages were write-protected. It is very > efficient way to track the changes as uffd is by nature pte/pmd based. > > UFFD synchronous WP sends the page faults to the userspace where the > pages which have been written-to can be tracked. But it is not efficient. > This is why this asynchronous version is being added. After setting the > WP Async, the pages which have been written to can be found in the pagemap > file or information can be obtained from the PAGEMAP_IOCTL. > > Suggested-by: Peter Xu <peterx@redhat.com> > Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Here's the patch that can enable WP_ASYNC for all kinds of memories (as I promised..). Currently I only tested btrfs (besides the common three) which is the major fs I use locally, but I guess it'll also enable the rest no matter what's underneath, just like soft-dirty. As I mentioned, I just feel it very unfortunate to have a lot of suffixes for the UFFD_FEATURE_* on types of memory, and I hope we get rid of it for this WP_ASYNC from the start because the workflow should really be similar to anon/shmem handling for most of the rest, just a few tweaks here and there. I had a feeling that some type of special VMA will work weirdly, but let's see.. so far I don't come up with any. If the patch looks fine to you, please consider replace this patch with patch 1 of mine where I attached. Then patch 1 can be reviewed alongside with your series. Logically patch 1 can be reviewed separately too, because it works perfectly afaiu without the atomic version of pagemap already. But on my side I don't think it justifies anything really matters, so unless someone thinks it a good idea to post / review / merge it separately, you can keep that with your new pagemap ioctl. Patch 2 is only for your reference. It's not for merging quality so please don't put it into your series. I do plan to cleanup the userfaultfd selftests in the near future first (when I wrote this I am more eager to do so..). I also think your final pagemap test cases can cover quite a bit. Thanks,
Thanks you for sending. I'll perform testing and share results next. On 3/17/23 12:20 AM, Peter Xu wrote: > Hello, Muhammad, > > On Thu, Mar 09, 2023 at 06:57:12PM +0500, Muhammad Usama Anjum wrote: >> Add new WP Async mode (UFFD_FEATURE_WP_ASYNC) which resolves the page >> faults on its own. It can be used to track that which pages have been >> written-to from the time the pages were write-protected. It is very >> efficient way to track the changes as uffd is by nature pte/pmd based. >> >> UFFD synchronous WP sends the page faults to the userspace where the >> pages which have been written-to can be tracked. But it is not efficient. >> This is why this asynchronous version is being added. After setting the >> WP Async, the pages which have been written to can be found in the pagemap >> file or information can be obtained from the PAGEMAP_IOCTL. >> >> Suggested-by: Peter Xu <peterx@redhat.com> >> Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> > > Here's the patch that can enable WP_ASYNC for all kinds of memories (as I > promised..). Currently I only tested btrfs (besides the common three) > which is the major fs I use locally, but I guess it'll also enable the rest > no matter what's underneath, just like soft-dirty. > > As I mentioned, I just feel it very unfortunate to have a lot of suffixes > for the UFFD_FEATURE_* on types of memory, and I hope we get rid of it for > this WP_ASYNC from the start because the workflow should really be similar > to anon/shmem handling for most of the rest, just a few tweaks here and > there. > > I had a feeling that some type of special VMA will work weirdly, but let's > see.. so far I don't come up with any. > > If the patch looks fine to you, please consider replace this patch with > patch 1 of mine where I attached. Then patch 1 can be reviewed alongside > with your series. > > Logically patch 1 can be reviewed separately too, because it works > perfectly afaiu without the atomic version of pagemap already. But on my > side I don't think it justifies anything really matters, so unless someone > thinks it a good idea to post / review / merge it separately, you can keep > that with your new pagemap ioctl. > > Patch 2 is only for your reference. It's not for merging quality so please > don't put it into your series. I do plan to cleanup the userfaultfd > selftests in the near future first (when I wrote this I am more eager to do > so..). I also think your final pagemap test cases can cover quite a bit. > > Thanks, >
On 3/17/23 12:20 AM, Peter Xu wrote: > Hello, Muhammad, > > On Thu, Mar 09, 2023 at 06:57:12PM +0500, Muhammad Usama Anjum wrote: >> Add new WP Async mode (UFFD_FEATURE_WP_ASYNC) which resolves the page >> faults on its own. It can be used to track that which pages have been >> written-to from the time the pages were write-protected. It is very >> efficient way to track the changes as uffd is by nature pte/pmd based. >> >> UFFD synchronous WP sends the page faults to the userspace where the >> pages which have been written-to can be tracked. But it is not efficient. >> This is why this asynchronous version is being added. After setting the >> WP Async, the pages which have been written to can be found in the pagemap >> file or information can be obtained from the PAGEMAP_IOCTL. >> >> Suggested-by: Peter Xu <peterx@redhat.com> >> Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> > > Here's the patch that can enable WP_ASYNC for all kinds of memories (as I > promised..). Currently I only tested btrfs (besides the common three) > which is the major fs I use locally, but I guess it'll also enable the rest > no matter what's underneath, just like soft-dirty. > > As I mentioned, I just feel it very unfortunate to have a lot of suffixes > for the UFFD_FEATURE_* on types of memory, and I hope we get rid of it for > this WP_ASYNC from the start because the workflow should really be similar > to anon/shmem handling for most of the rest, just a few tweaks here and > there. > > I had a feeling that some type of special VMA will work weirdly, but let's > see.. so far I don't come up with any. > > If the patch looks fine to you, please consider replace this patch with > patch 1 of mine where I attached. Then patch 1 can be reviewed alongside > with your series. > > Logically patch 1 can be reviewed separately too, because it works > perfectly afaiu without the atomic version of pagemap already. But on my > side I don't think it justifies anything really matters, so unless someone > thinks it a good idea to post / review / merge it separately, you can keep > that with your new pagemap ioctl. > > Patch 2 is only for your reference. It's not for merging quality so please > don't put it into your series. I do plan to cleanup the userfaultfd > selftests in the near future first (when I wrote this I am more eager to do > so..). I also think your final pagemap test cases can cover quite a bit. > > Thanks, Thank you so much for the patch. I've tested hugetlb mem. This patch is working fine for hugetlb shmem: *shmid = shmget(2, size, SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W); mem = shmat(*shmid, 0, 0); I've found slight issue with hugetlb mem which has been mmaped: mem = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_HUGETLB | MAP_PRIVATE, -1, 0); The issue is that even after witting to this memory, the wp flag is still present there and memory doesn't appear to be dirty when it should have been dirty. The temporary fix is to write to memory and write protect the memory one extra time. Here is how I'm checking if WP flag is set or not: static inline bool is_huge_pte_uffd_wp(pte_t pte) { return ((pte_present(pte) && huge_pte_uffd_wp(pte)) || pte_swp_uffd_wp_any(pte)); } I've isolated the reproducer inside kselftests by commenting the unrelated code. Please have a look at the attached kselftest and follow from main or search `//#define tmpfix` in the code.
Hi, Muhammad, On Tue, Mar 21, 2023 at 05:21:15PM +0500, Muhammad Usama Anjum wrote: > Thank you so much for the patch. I've tested hugetlb mem. This patch is > working fine for hugetlb shmem: > *shmid = shmget(2, size, SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W); > mem = shmat(*shmid, 0, 0); > > I've found slight issue with hugetlb mem which has been mmaped: > mem = mmap(NULL, size, PROT_READ | PROT_WRITE, > MAP_ANONYMOUS | MAP_HUGETLB | MAP_PRIVATE, -1, 0); > The issue is that even after witting to this memory, the wp flag is still > present there and memory doesn't appear to be dirty when it should have > been dirty. The temporary fix is to write to memory and write protect the > memory one extra time. I looked into this today and found it's an existing bug that can trigger with sync mode too.. as long as protection applied to unpopulated hugetlb private mappings, then write to it. I've sent a fix for it here and have you copied: https://lore.kernel.org/linux-mm/20230321191840.1897940-1-peterx@redhat.com/T/#u Please have a look and see whether it also fixes your issue. PS: recently I added a warning in commit c2da319c2e2789 and that can indeed capture this one when verifying using pagemap. I'd guess your dmesg should also contain something dumped. Thanks,
On 3/22/23 12:25 AM, Peter Xu wrote: > Hi, Muhammad, > > On Tue, Mar 21, 2023 at 05:21:15PM +0500, Muhammad Usama Anjum wrote: >> Thank you so much for the patch. I've tested hugetlb mem. This patch is >> working fine for hugetlb shmem: >> *shmid = shmget(2, size, SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W); >> mem = shmat(*shmid, 0, 0); >> >> I've found slight issue with hugetlb mem which has been mmaped: >> mem = mmap(NULL, size, PROT_READ | PROT_WRITE, >> MAP_ANONYMOUS | MAP_HUGETLB | MAP_PRIVATE, -1, 0); >> The issue is that even after witting to this memory, the wp flag is still >> present there and memory doesn't appear to be dirty when it should have >> been dirty. The temporary fix is to write to memory and write protect the >> memory one extra time. > > I looked into this today and found it's an existing bug that can trigger > with sync mode too.. as long as protection applied to unpopulated hugetlb > private mappings, then write to it. > > I've sent a fix for it here and have you copied: > > https://lore.kernel.org/linux-mm/20230321191840.1897940-1-peterx@redhat.com/T/#u > > Please have a look and see whether it also fixes your issue. Thanks for sending the patch. I've replied on the sent patch. > > PS: recently I added a warning in commit c2da319c2e2789 and that can indeed > capture this one when verifying using pagemap. I'd guess your dmesg should > also contain something dumped. I didn't had debug_vm config enabled. I've enabled it now. I'm getting only the following stack trace in failure scenario: ok 1 Hugetlb shmem testing: all new pages must not be written (dirty) 0 ok 2 Hugetlb shmem testing: all pages must be written (dirty) 1 512 0 512 ok 3 Hugetlb mem testing: all new pages must not be written (dirty) 0 [ 10.086540] ------------[ cut here ]------------ [ 10.087758] WARNING: CPU: 0 PID: 175 at arch/x86/include/asm/pgtable.h:313 pagemap_scan_hugetlb_entry+0x19c/0x230 [ 10.090208] Modules linked in: [ 10.091059] CPU: 0 PID: 175 Comm: pagemap_ioctl Not tainted 6.3.0-rc3-next-20230320-00010-gdc395ccf1882 #88 [ 10.093224] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014 [ 10.095879] RIP: 0010:pagemap_scan_hugetlb_entry+0x19c/0x230 [ 10.097497] Code: 89 ca 41 89 c2 29 c8 4c 01 c2 49 39 d2 41 0f 43 c0 e9 53 ff ff ff 48 83 e2 9f 89 c7 31 ed 49 89 d1 83 e7 02 0f 84 30 ff ff ff <0f> 0b 31 ff e9 27 ff ff ff 48 83 e2 9f 44 89 c0 bf 01 00 00 00 bd [ 10.102528] RSP: 0018:ffffb6cd80303d10 EFLAGS: 00010202 [ 10.104002] RAX: 8000000000000ce7 RBX: 00007fcc84000000 RCX: 0000000000200000 [ 10.105989] RDX: 80000002f7c00c87 RSI: 0000000000000001 RDI: 0000000000000002 [ 10.108043] RBP: 0000000000000000 R08: 0000000000000200 R09: 80000002f7c00c87 [ 10.110004] R10: ffffa08541e3220c R11: 0000000000000000 R12: ffffa08541562420 [ 10.112335] R13: ffffb6cd80303e70 R14: 00007fcc84000000 R15: ffffffff8eae1520 [ 10.114688] FS: 00007fcc8454b740(0000) GS:ffffa0886fc00000(0000) knlGS:0000000000000000 [ 10.116960] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 10.118187] CR2: 00007fcc84000000 CR3: 0000000102838000 CR4: 0000000000750ef0 [ 10.119628] PKRU: 55555554 [ 10.120184] Call Trace: [ 10.120730] <TASK> [ 10.121206] __walk_page_range+0xbe/0x1b0 [ 10.122048] walk_page_range+0x15f/0x1a0 [ 10.122869] do_pagemap_cmd+0x239/0x390 [ 10.123672] __x64_sys_ioctl+0x8b/0xc0 [ 10.124462] do_syscall_64+0x3a/0x90 [ 10.125227] entry_SYSCALL_64_after_hwframe+0x72/0xdc [ 10.126326] RIP: 0033:0x7fcc8464bbab [ 10.127066] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00 [ 10.130868] RSP: 002b:00007fff9b864240 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 10.132412] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007fcc8464bbab [ 10.133880] RDX: 00007fff9b8642c0 RSI: 00000000c0586610 RDI: 0000000000000003 [ 10.135328] RBP: 00007fff9b864320 R08: 0000000000000001 R09: 0000000000000000 [ 10.136790] R10: 00007fff9b864217 R11: 0000000000000246 R12: 0000000000000000 [ 10.138285] R13: 00007fff9b8644f8 R14: 0000000000409df0 R15: 00007fcc84862020 [ 10.139729] </TASK> [ 10.140197] ---[ end trace 0000000000000000 ]--- not ok 4 Hugetlb mem testing: all pages must be written (dirty) 0 -2072900416 0 512 > > Thanks, >
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index dac0ebe39774..992b0b21cd59 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1446,10 +1446,15 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, goto out_unlock; /* - * Note vmas containing huge pages + * Note vmas containing huge pages. Hugetlb isn't supported + * with UFFD_FEATURE_WP_ASYNC. */ - if (is_vm_hugetlb_page(cur)) + ret = -EINVAL; + if (is_vm_hugetlb_page(cur)) { + if (ctx->features & UFFD_FEATURE_WP_ASYNC) + goto out_unlock; basic_ioctls = true; + } found = true; } for_each_vma_range(vmi, cur, end); @@ -1874,6 +1879,10 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx, mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP; mode_dontwake = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE; + /* The unprotection is not supported if in async WP mode */ + if (!mode_wp && (ctx->features & UFFD_FEATURE_WP_ASYNC)) + return -EINVAL; + if (mode_wp && mode_dontwake) return -EINVAL; @@ -1957,6 +1966,13 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg) return ret; } +int userfaultfd_wp_async(struct vm_area_struct *vma) +{ + struct userfaultfd_ctx *ctx = vma->vm_userfaultfd_ctx.ctx; + + return (ctx && (ctx->features & UFFD_FEATURE_WP_ASYNC)); +} + static inline unsigned int uffd_ctx_features(__u64 user_features) { /* @@ -1988,6 +2004,10 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx, ret = -EPERM; if ((features & UFFD_FEATURE_EVENT_FORK) && !capable(CAP_SYS_PTRACE)) goto err_out; + if ((features & UFFD_FEATURE_WP_ASYNC) && + !(features & UFFD_FEATURE_WP_UNPOPULATED)) + goto err_out; + /* report all available features and ioctls to userland */ uffdio_api.features = UFFD_API_FEATURES; #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR @@ -2000,6 +2020,7 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx, #ifndef CONFIG_PTE_MARKER_UFFD_WP uffdio_api.features &= ~UFFD_FEATURE_WP_HUGETLBFS_SHMEM; uffdio_api.features &= ~UFFD_FEATURE_WP_UNPOPULATED; + uffdio_api.features &= ~UFFD_FEATURE_WP_ASYNC; #endif uffdio_api.ioctls = UFFD_API_IOCTLS; ret = -EFAULT; diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 52cb3de88e20..b680c0ec8b85 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -178,6 +178,7 @@ extern int userfaultfd_unmap_prep(struct mm_struct *mm, unsigned long start, extern void userfaultfd_unmap_complete(struct mm_struct *mm, struct list_head *uf); extern bool userfaultfd_wp_unpopulated(struct vm_area_struct *vma); +extern int userfaultfd_wp_async(struct vm_area_struct *vma); #else /* CONFIG_USERFAULTFD */ @@ -278,6 +279,11 @@ static inline bool userfaultfd_wp_unpopulated(struct vm_area_struct *vma) return false; } +static inline int userfaultfd_wp_async(struct vm_area_struct *vma) +{ + return false; +} + #endif /* CONFIG_USERFAULTFD */ static inline bool pte_marker_entry_uffd_wp(swp_entry_t entry) diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index 90c958952bfc..00dbe7d6551b 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -39,7 +39,8 @@ UFFD_FEATURE_MINOR_SHMEM | \ UFFD_FEATURE_EXACT_ADDRESS | \ UFFD_FEATURE_WP_HUGETLBFS_SHMEM | \ - UFFD_FEATURE_WP_UNPOPULATED) + UFFD_FEATURE_WP_UNPOPULATED | \ + UFFD_FEATURE_WP_ASYNC) #define UFFD_API_IOCTLS \ ((__u64)1 << _UFFDIO_REGISTER | \ (__u64)1 << _UFFDIO_UNREGISTER | \ @@ -210,6 +211,13 @@ struct uffdio_api { * (i.e. empty ptes). This will be the default behavior for shmem * & hugetlbfs, so this flag only affects anonymous memory behavior * when userfault write-protection mode is registered. + * + * UFFD_FEATURE_WP_ASYNC indicates that userfaultfd write-protection + * asynchronous mode is supported in which the write fault is + * automatically resolved and write-protection is un-set. It only + * supports anon and shmem (hugetlb isn't supported). It only takes + * effect when a vma is registered with write-protection mode. Otherwise + * the flag is ignored. It depends on UFFD_FEATURE_WP_UNPOPULATED. */ #define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0) #define UFFD_FEATURE_EVENT_FORK (1<<1) @@ -225,6 +233,7 @@ struct uffdio_api { #define UFFD_FEATURE_EXACT_ADDRESS (1<<11) #define UFFD_FEATURE_WP_HUGETLBFS_SHMEM (1<<12) #define UFFD_FEATURE_WP_UNPOPULATED (1<<13) +#define UFFD_FEATURE_WP_ASYNC (1<<14) __u64 features; __u64 ioctls; diff --git a/mm/memory.c b/mm/memory.c index 8d135a814c60..341071c2c49a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3348,11 +3348,28 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE; struct vm_area_struct *vma = vmf->vma; struct folio *folio = NULL; + pte_t pte; if (likely(!unshare)) { if (userfaultfd_pte_wp(vma, *vmf->pte)) { - pte_unmap_unlock(vmf->pte, vmf->ptl); - return handle_userfault(vmf, VM_UFFD_WP); + if (!userfaultfd_wp_async(vma)) { + pte_unmap_unlock(vmf->pte, vmf->ptl); + return handle_userfault(vmf, VM_UFFD_WP); + } + + /* + * Nothing needed (cache flush, TLB invalidations, + * etc.) because we're only removing the uffd-wp bit, + * which is completely invisible to the user. + */ + pte = pte_clear_uffd_wp(*vmf->pte); + + set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte); + /* + * Update this to be prepared for following up CoW + * handling + */ + vmf->orig_pte = pte; } /* @@ -4824,8 +4841,11 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf) if (vma_is_anonymous(vmf->vma)) { if (likely(!unshare) && - userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd)) + userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd)) { + if (userfaultfd_wp_async(vmf->vma)) + goto split; return handle_userfault(vmf, VM_UFFD_WP); + } return do_huge_pmd_wp_page(vmf); } @@ -4837,6 +4857,7 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf) } } +split: /* COW or write-notify handled on pte level: split pmd. */ __split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);
Add new WP Async mode (UFFD_FEATURE_WP_ASYNC) which resolves the page faults on its own. It can be used to track that which pages have been written-to from the time the pages were write-protected. It is very efficient way to track the changes as uffd is by nature pte/pmd based. UFFD synchronous WP sends the page faults to the userspace where the pages which have been written-to can be tracked. But it is not efficient. This is why this asynchronous version is being added. After setting the WP Async, the pages which have been written to can be found in the pagemap file or information can be obtained from the PAGEMAP_IOCTL. Suggested-by: Peter Xu <peterx@redhat.com> Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> --- Changes in v11: - Fix return code in userfaultfd_register() and minor changes here and there - Rebase on top of next-20230307 - Base patches on UFFD_FEATURE_WP_UNPOPULATED https://lore.kernel.org/all/20230306213925.617814-1-peterx@redhat.com - UFFD_FEATURE_WP_ASYNC depends on UFFD_FEATURE_WP_UNPOPULATED to work (correctly) Changes in v10: - Build fix - Update comments and add error condition to return error from uffd register if hugetlb pages are present when wp async flag is set Changes in v9: - Correct the fault resolution with code contributed by Peter Changes in v7: - Remove UFFDIO_WRITEPROTECT_MODE_ASYNC_WP and add UFFD_FEATURE_WP_ASYNC - Handle automatic page fault resolution in better way (thanks to Peter) --- fs/userfaultfd.c | 25 +++++++++++++++++++++++-- include/linux/userfaultfd_k.h | 6 ++++++ include/uapi/linux/userfaultfd.h | 11 ++++++++++- mm/memory.c | 27 ++++++++++++++++++++++++--- 4 files changed, 63 insertions(+), 6 deletions(-)