Message ID | 20210922175156.130228-1-peterx@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/khugepaged: Detecting uffd-wp vma more efficiently | expand |
On 22.09.21 19:51, Peter Xu wrote: > We forbid merging thps for uffd-wp enabled regions, by breaking the khugepaged > scanning right after we detected a uffd-wp armed pte (either present, or swap). > > It works, but it's less efficient, because those ptes only exist for VM_UFFD_WP > enabled VMAs. Checking against the vma flag would be more efficient, and good > enough. To be explicit, we could still be able to merge some thps for > VM_UFFD_WP regions before this patch as long as they have zero uffd-wp armed > ptes, however that's not a major target for thp collapse anyways. > Hm, are we sure there are no users that could benefit from the current handling? I'm thinking about long-term uffd-wp users that effectively end up wp-ing on only a small fraction of a gigantic vma, or always wp complete blocks in a certain granularity in the range of THP. Databases come to mind ... In the past, I played with the idea of using uffd-wp to protect access to logically unplugged memory regions part of virtio-mem devices in QEMU -- which would exactly do something as described above. But I'll most probably be using ordinary uffd once any users that might read such logically unplugged memory have been "fixed". The change itself looks sane to me AFAIKT. > This mostly reverts commit e1e267c7928fe387e5e1cffeafb0de2d0473663a, but > instead we do the same check at vma level, so it's not a bugfix. > > This also paves the way for file-backed uffd-wp support, as the VM_UFFD_WP flag > will work for file-backed too. > > After this patch, the error for khugepaged for these regions will switch from > SCAN_PTE_UFFD_WP to SCAN_VMA_CHECK. > > Since uffd minor mode should not allow thp as well, do the same thing for minor > mode to stop early on trying to collapse pages in khugepaged. > > Cc: Andrea Arcangeli <aarcange@redhat.com> > Cc: Axel Rasmussen <axelrasmussen@google.com> > Cc: Hugh Dickins <hughd@google.com> > Cc: Nadav Amit <nadav.amit@gmail.com> > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > > Axel: as I asked in the other thread, please help check whether minor mode will > work properly with shmem thp enabled. If not, I feel like this patch could be > part of that effort at last, but it's also possible that I missed something. > > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > include/trace/events/huge_memory.h | 1 - > mm/khugepaged.c | 26 +++----------------------- > 2 files changed, 3 insertions(+), 24 deletions(-) > > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h > index 4fdb14a81108..53532f5925c3 100644 > --- a/include/trace/events/huge_memory.h > +++ b/include/trace/events/huge_memory.h > @@ -15,7 +15,6 @@ > EM( SCAN_EXCEED_SWAP_PTE, "exceed_swap_pte") \ > EM( SCAN_EXCEED_SHARED_PTE, "exceed_shared_pte") \ > EM( SCAN_PTE_NON_PRESENT, "pte_non_present") \ > - EM( SCAN_PTE_UFFD_WP, "pte_uffd_wp") \ > EM( SCAN_PAGE_RO, "no_writable_page") \ > EM( SCAN_LACK_REFERENCED_PAGE, "lack_referenced_page") \ > EM( SCAN_PAGE_NULL, "page_null") \ > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 045cc579f724..3afe66d48db0 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -31,7 +31,6 @@ enum scan_result { > SCAN_EXCEED_SWAP_PTE, > SCAN_EXCEED_SHARED_PTE, > SCAN_PTE_NON_PRESENT, > - SCAN_PTE_UFFD_WP, > SCAN_PAGE_RO, > SCAN_LACK_REFERENCED_PAGE, > SCAN_PAGE_NULL, > @@ -467,6 +466,9 @@ static bool hugepage_vma_check(struct vm_area_struct *vma, > return false; > if (vma_is_temporary_stack(vma)) > return false; > + /* Don't allow thp merging for wp/minor enabled uffd regions */ > + if (userfaultfd_wp(vma) || userfaultfd_minor(vma)) > + return false; > return !(vm_flags & VM_NO_KHUGEPAGED); > } > > @@ -1246,15 +1248,6 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, > pte_t pteval = *_pte; > if (is_swap_pte(pteval)) { > if (++unmapped <= khugepaged_max_ptes_swap) { > - /* > - * Always be strict with uffd-wp > - * enabled swap entries. Please see > - * comment below for pte_uffd_wp(). > - */ > - if (pte_swp_uffd_wp(pteval)) { > - result = SCAN_PTE_UFFD_WP; > - goto out_unmap; > - } > continue; > } else { > result = SCAN_EXCEED_SWAP_PTE; > @@ -1270,19 +1263,6 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, > goto out_unmap; > } > } > - if (pte_uffd_wp(pteval)) { > - /* > - * Don't collapse the page if any of the small > - * PTEs are armed with uffd write protection. > - * Here we can also mark the new huge pmd as > - * write protected if any of the small ones is > - * marked but that could bring unknown > - * userfault messages that falls outside of > - * the registered range. So, just be simple. > - */ > - result = SCAN_PTE_UFFD_WP; > - goto out_unmap; > - } > if (pte_write(pteval)) > writable = true; > >
On Wed, Sep 22, 2021 at 08:21:40PM +0200, David Hildenbrand wrote: > On 22.09.21 19:51, Peter Xu wrote: > > We forbid merging thps for uffd-wp enabled regions, by breaking the khugepaged > > scanning right after we detected a uffd-wp armed pte (either present, or swap). > > > > It works, but it's less efficient, because those ptes only exist for VM_UFFD_WP > > enabled VMAs. Checking against the vma flag would be more efficient, and good > > enough. To be explicit, we could still be able to merge some thps for > > VM_UFFD_WP regions before this patch as long as they have zero uffd-wp armed > > ptes, however that's not a major target for thp collapse anyways. > > > > Hm, are we sure there are no users that could benefit from the current > handling? > > I'm thinking about long-term uffd-wp users that effectively end up wp-ing on > only a small fraction of a gigantic vma, or always wp complete blocks in a > certain granularity in the range of THP. Yes, that's a good question. > > Databases come to mind ... One thing to mention is that this patch didn't forbid thp being used within a uffd-wp-ed range. E.g., we still allow thp exist, we can uffd-wp a thp and it'll split only until when the thp is written. While what this patch does is it stops khugepaged from proactively merging those small pages into thps as long as VM_UFFD_WP|VM_UFFD_MINOR is set. It may still affect some user, but it's not a complete disable on thp. > > In the past, I played with the idea of using uffd-wp to protect access to > logically unplugged memory regions part of virtio-mem devices in QEMU -- > which would exactly do something as described above. But I'll most probably > be using ordinary uffd once any users that might read such logically > unplugged memory have been "fixed". Yes, even if you'd like to keep using uffd-wp that sounds like a very reasonable scenario. > > The change itself looks sane to me AFAIKT. So one major motivation of this patch of mine is to prepare for shmem, because the old commit obviously only covered anonymous. But after a 2nd thought, I just noticed shmem shouldn't have a problem with khugepaged merging at all! The thing is, when khugepaged is merging a shmem thp, unlike anonymous, it'll not merge the ptes into a pmd, but it'll simply zap the ptes. It means all uffd-wp tracking information won't be lost even if merging happened, those info will still be kept inside pgtables using (the upcoming) pte markers. When faulted, we'll just do small page mappings while it won't stop the shmem thp from being mapped hugely in other mm, afaict. With that in mind, indeed I see this patch less necessary to be merged; so for sparsely wr-protected vmas like virtio-mem we can still keep some of the ranges mergeable, that sounds a good thing to keep it as-is. NACK myself for now: let's not lose that good property of both thp+uffd-wp so easily, and I'll think more of it. (To Axel: my question to minor mode handling thp still stands, I think..) Thanks,
On Wed, Sep 22, 2021 at 11:58 AM Peter Xu <peterx@redhat.com> wrote: > > On Wed, Sep 22, 2021 at 08:21:40PM +0200, David Hildenbrand wrote: > > On 22.09.21 19:51, Peter Xu wrote: > > > We forbid merging thps for uffd-wp enabled regions, by breaking the khugepaged > > > scanning right after we detected a uffd-wp armed pte (either present, or swap). > > > > > > It works, but it's less efficient, because those ptes only exist for VM_UFFD_WP > > > enabled VMAs. Checking against the vma flag would be more efficient, and good > > > enough. To be explicit, we could still be able to merge some thps for > > > VM_UFFD_WP regions before this patch as long as they have zero uffd-wp armed > > > ptes, however that's not a major target for thp collapse anyways. > > > > > > > Hm, are we sure there are no users that could benefit from the current > > handling? > > > > I'm thinking about long-term uffd-wp users that effectively end up wp-ing on > > only a small fraction of a gigantic vma, or always wp complete blocks in a > > certain granularity in the range of THP. > > Yes, that's a good question. > > > > > Databases come to mind ... > > One thing to mention is that this patch didn't forbid thp being used within a > uffd-wp-ed range. E.g., we still allow thp exist, we can uffd-wp a thp and > it'll split only until when the thp is written. > > While what this patch does is it stops khugepaged from proactively merging > those small pages into thps as long as VM_UFFD_WP|VM_UFFD_MINOR is set. It may > still affect some user, but it's not a complete disable on thp. > > > > > In the past, I played with the idea of using uffd-wp to protect access to > > logically unplugged memory regions part of virtio-mem devices in QEMU -- > > which would exactly do something as described above. But I'll most probably > > be using ordinary uffd once any users that might read such logically > > unplugged memory have been "fixed". > > Yes, even if you'd like to keep using uffd-wp that sounds like a very > reasonable scenario. > > > > > The change itself looks sane to me AFAIKT. > > So one major motivation of this patch of mine is to prepare for shmem, because > the old commit obviously only covered anonymous. > > But after a 2nd thought, I just noticed shmem shouldn't have a problem with > khugepaged merging at all! > > The thing is, when khugepaged is merging a shmem thp, unlike anonymous, it'll > not merge the ptes into a pmd, but it'll simply zap the ptes. It means all > uffd-wp tracking information won't be lost even if merging happened, those info > will still be kept inside pgtables using (the upcoming) pte markers. khugepqged does remove the pgtables. Please check out retract_page_tables(). The pmd will be cleared and the ptes will be freed otherwise the collapsed THP won't get PMD mapped by later access. > > When faulted, we'll just do small page mappings while it won't stop the shmem > thp from being mapped hugely in other mm, afaict. > > With that in mind, indeed I see this patch less necessary to be merged; so for > sparsely wr-protected vmas like virtio-mem we can still keep some of the ranges > mergeable, that sounds a good thing to keep it as-is. > > NACK myself for now: let's not lose that good property of both thp+uffd-wp so > easily, and I'll think more of it. > > (To Axel: my question to minor mode handling thp still stands, I think..) > > Thanks, > > -- > Peter Xu > >
On Wed, Sep 22, 2021 at 01:51:56PM -0400, Peter Xu wrote: > Axel: as I asked in the other thread, please help check whether minor mode will > work properly with shmem thp enabled. If not, I feel like this patch could be > part of that effort at last, but it's also possible that I missed something. Hmm, this seems to be a false-alarm too. UFFDIO_CONTINUE is fine with thp because shmem_getpage() only returns small pages. Khugepaged is fine too on merging small shmem pages into thps, the same reason as uffd-wp: it zaps ptes only, so previous pte_none() ptes will keep the same, it makes sure all old pte_none ptes will not continue until a UFFDIO_CONTINUE. The only last problem is when khugepaged merged small ptes into a thp, then it could zap ptes even if they existed before. So for minor mode fault, the uffd service thread needs to be prepared for false positive messages. That shouldn't be a problem either, because file-backed memory pgtables are unstable and prone to lost, so uffd minor fault handler should always be prepared for false positives anyway.. In summary: please feel free to ignore the above note, and sorry for the noise.
On Wed, Sep 22, 2021 at 12:29:35PM -0700, Yang Shi wrote: > khugepqged does remove the pgtables. Please check out > retract_page_tables(). The pmd will be cleared and the ptes will be > freed otherwise the collapsed THP won't get PMD mapped by later > access. Indeed. I should probably still properly disable khugepaged for at least VM_SHARED && VM_UFFD_WP, then I'd keep the anonymous && minor mode behavior untouched. The other problem is even if current mm/vma doesn't have UFFD_WP registered, some other mm/vma could have UFFD_WP enabled there that mapped the same file. Checking that up within retract_page_tables() on all VMAs seems to be a bit too late. Checking it early may not trivially work too - I can walk the vma interval tree at the entry of khugepaged_scan_file(), making sure no vma has UFFD_WP set. However I don't see how it'll stop some of the vma from having UFFD_WP registered later after that point but before retract_page_tables(). I'll need to think about it, but thanks for the input, Yang. That's a very important point.
On Wed, Sep 22, 2021 at 04:04:44PM -0400, Peter Xu wrote: > On Wed, Sep 22, 2021 at 12:29:35PM -0700, Yang Shi wrote: > > khugepqged does remove the pgtables. Please check out > > retract_page_tables(). The pmd will be cleared and the ptes will be > > freed otherwise the collapsed THP won't get PMD mapped by later > > access. > > Indeed. > > I should probably still properly disable khugepaged for at least VM_SHARED && > VM_UFFD_WP, then I'd keep the anonymous && minor mode behavior untouched. > > The other problem is even if current mm/vma doesn't have UFFD_WP registered, > some other mm/vma could have UFFD_WP enabled there that mapped the same file. > Checking that up within retract_page_tables() on all VMAs seems to be a bit too > late. > > Checking it early may not trivially work too - I can walk the vma interval tree > at the entry of khugepaged_scan_file(), making sure no vma has UFFD_WP set. > However I don't see how it'll stop some of the vma from having UFFD_WP > registered later after that point but before retract_page_tables(). > > I'll need to think about it, but thanks for the input, Yang. That's a very > important point. Perhaps I need something like this: ---8<--- diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 045cc579f724..c63e957336d1 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1583,6 +1583,15 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) pmd = mm_find_pmd(mm, addr); if (!pmd) continue; + /* + * When a vma is registered with uffd-wp, we can't recycle the + * pmd pgtable because there can be pte markers installed. + * Skip it only, so the rest mm/vma can still have the same + * file mapped hugely, however it'll always mapped in small + * page size for uffd-wp registered ranges. + */ + if (userfaultfd_wp(vma)) + continue; /* * We need exclusive mmap_lock to retract page table. * ---8<--- I won't post a v2 because then that patch will be shmem-only and uffd-wp-only. I'll keep it with the upcoming series I'm going to post to support shmem. Thanks,
On Wed, Sep 22, 2021 at 10:52 AM Peter Xu <peterx@redhat.com> wrote: > > We forbid merging thps for uffd-wp enabled regions, by breaking the khugepaged > scanning right after we detected a uffd-wp armed pte (either present, or swap). > > It works, but it's less efficient, because those ptes only exist for VM_UFFD_WP > enabled VMAs. Checking against the vma flag would be more efficient, and good > enough. To be explicit, we could still be able to merge some thps for > VM_UFFD_WP regions before this patch as long as they have zero uffd-wp armed > ptes, however that's not a major target for thp collapse anyways. > > This mostly reverts commit e1e267c7928fe387e5e1cffeafb0de2d0473663a, but > instead we do the same check at vma level, so it's not a bugfix. > > This also paves the way for file-backed uffd-wp support, as the VM_UFFD_WP flag > will work for file-backed too. > > After this patch, the error for khugepaged for these regions will switch from > SCAN_PTE_UFFD_WP to SCAN_VMA_CHECK. > > Since uffd minor mode should not allow thp as well, do the same thing for minor > mode to stop early on trying to collapse pages in khugepaged. > > Cc: Andrea Arcangeli <aarcange@redhat.com> > Cc: Axel Rasmussen <axelrasmussen@google.com> > Cc: Hugh Dickins <hughd@google.com> > Cc: Nadav Amit <nadav.amit@gmail.com> > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > > Axel: as I asked in the other thread, please help check whether minor mode will > work properly with shmem thp enabled. If not, I feel like this patch could be > part of that effort at last, but it's also possible that I missed something. Sorry for missing the other thread. Unfortunately, I think shmem THP *doesn't* really work with minor faults, and what's worse, just checking the VMA flag isn't enough. First, let me note the guarantee UFFD minor faults are trying to provide: for a given mapping, any minor fault (that is, pte_none() but a page is present in the page cache) must result in a minor userfault event. Furthermore, the only way the fault may be resolved (i.e., a PTE installed) is via a UFFDIO_CONTINUE ioctl from userspace. A typical use case for minor faults is, we have two mappings (i.e., two VMAs), both pointing to the same underlying physical memory. It's typical for both to have MAP_SHARED. It's typical for one of these mappings to be fully faulted in (i.e., all of its PTEs exist), while the other one has some missing PTEs. The problem is, khugepaged might scan *either* of the two mappings. Say it picks the fully-faulted VMA: even if we set khugepaged_max_ptes_none to zero, it will still go ahead and collapse these pages - because *this* VMA has no missing PTEs. Why is this a problem? When we collapse, we install a PMD, for *all* VMAs which reference these pages. In other words, we might install PTEs for the other, minor-fault-registered mapping, and therefore userfaults will never trigger for some of those regions, even though userspace never UFFDIO_CONTINUE-ed them. I *think* the right place to check for this and solve it is in retract_page_tables(), and I have a patch which does this. I've been hesitant to send it though, as due to a lack of time and the complexity involved I haven't been able to write a clear reproducer program, which my patch clearly fixes. :/ > > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > include/trace/events/huge_memory.h | 1 - > mm/khugepaged.c | 26 +++----------------------- > 2 files changed, 3 insertions(+), 24 deletions(-) > > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h > index 4fdb14a81108..53532f5925c3 100644 > --- a/include/trace/events/huge_memory.h > +++ b/include/trace/events/huge_memory.h > @@ -15,7 +15,6 @@ > EM( SCAN_EXCEED_SWAP_PTE, "exceed_swap_pte") \ > EM( SCAN_EXCEED_SHARED_PTE, "exceed_shared_pte") \ > EM( SCAN_PTE_NON_PRESENT, "pte_non_present") \ > - EM( SCAN_PTE_UFFD_WP, "pte_uffd_wp") \ > EM( SCAN_PAGE_RO, "no_writable_page") \ > EM( SCAN_LACK_REFERENCED_PAGE, "lack_referenced_page") \ > EM( SCAN_PAGE_NULL, "page_null") \ > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 045cc579f724..3afe66d48db0 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -31,7 +31,6 @@ enum scan_result { > SCAN_EXCEED_SWAP_PTE, > SCAN_EXCEED_SHARED_PTE, > SCAN_PTE_NON_PRESENT, > - SCAN_PTE_UFFD_WP, > SCAN_PAGE_RO, > SCAN_LACK_REFERENCED_PAGE, > SCAN_PAGE_NULL, > @@ -467,6 +466,9 @@ static bool hugepage_vma_check(struct vm_area_struct *vma, > return false; > if (vma_is_temporary_stack(vma)) > return false; > + /* Don't allow thp merging for wp/minor enabled uffd regions */ > + if (userfaultfd_wp(vma) || userfaultfd_minor(vma)) > + return false; > return !(vm_flags & VM_NO_KHUGEPAGED); > } > > @@ -1246,15 +1248,6 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, > pte_t pteval = *_pte; > if (is_swap_pte(pteval)) { > if (++unmapped <= khugepaged_max_ptes_swap) { > - /* > - * Always be strict with uffd-wp > - * enabled swap entries. Please see > - * comment below for pte_uffd_wp(). > - */ > - if (pte_swp_uffd_wp(pteval)) { > - result = SCAN_PTE_UFFD_WP; > - goto out_unmap; > - } > continue; > } else { > result = SCAN_EXCEED_SWAP_PTE; > @@ -1270,19 +1263,6 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, > goto out_unmap; > } > } > - if (pte_uffd_wp(pteval)) { > - /* > - * Don't collapse the page if any of the small > - * PTEs are armed with uffd write protection. > - * Here we can also mark the new huge pmd as > - * write protected if any of the small ones is > - * marked but that could bring unknown > - * userfault messages that falls outside of > - * the registered range. So, just be simple. > - */ > - result = SCAN_PTE_UFFD_WP; > - goto out_unmap; > - } > if (pte_write(pteval)) > writable = true; > > -- > 2.31.1 >
On Wed, Sep 22, 2021 at 01:49:42PM -0700, Axel Rasmussen wrote: > Sorry for missing the other thread. Heh, you didn't miss the other thread; I just posted both of the emails in a few hours. :) > > Unfortunately, I think shmem THP *doesn't* really work with minor > faults, and what's worse, just checking the VMA flag isn't enough. As I replied to myself (sorry to have done that), now I think minor mode is fine, but let's see what else I've missed, which is possible... Please see below. > > First, let me note the guarantee UFFD minor faults are trying to > provide: for a given mapping, any minor fault (that is, pte_none() but > a page is present in the page cache) must result in a minor userfault > event. Furthermore, the only way the fault may be resolved (i.e., a > PTE installed) is via a UFFDIO_CONTINUE ioctl from userspace. Yes. > > A typical use case for minor faults is, we have two mappings (i.e., > two VMAs), both pointing to the same underlying physical memory. It's > typical for both to have MAP_SHARED. It's typical for one of these > mappings to be fully faulted in (i.e., all of its PTEs exist), while > the other one has some missing PTEs. The problem is, khugepaged might > scan *either* of the two mappings. Say it picks the fully-faulted VMA: > even if we set khugepaged_max_ptes_none to zero, it will still go > ahead and collapse these pages - because *this* VMA has no missing > PTEs. Yes. > > Why is this a problem? When we collapse, we install a PMD, for *all* > VMAs which reference these pages. In other words, we might install > PTEs for the other, minor-fault-registered mapping, and therefore > userfaults will never trigger for some of those regions, even though > userspace never UFFDIO_CONTINUE-ed them. Nop - we don't install PMD for file-backed, do we? Please see khugepaged_scan_pmd() - that one installs PMDs indeed, but it's anonymous-only code. Then please also see khugepaged_scan_file() - that one handles file-backed (aka, shmem), and it does _not_ install pmd, afaict. The installation is lazy. Not installing pmd means uffd-minor can still trap any further faults just like before, afaiu. There's a very trivial detail that the pmd missing case will have a very slight code path change when the next page fault happens: in __handle_mm_fault() we'll first try to go into create_huge_pmd() once, however since shmem didn't provide huge_fault(), we'll go the VM_FAULT_FALLBACK path, and things will go like before when faulting on a small pte. The next UFFDIO_CONTINUE will allocate that missing pmd again, however it'll install a 4K page only. > > I *think* the right place to check for this and solve it is in > retract_page_tables(), and I have a patch which does this. I've been > hesitant to send it though, as due to a lack of time and the > complexity involved I haven't been able to write a clear reproducer > program, which my patch clearly fixes. :/ Yes retract_page_tables() could drop pmd pgtable for minor fault, but IMHO it's fine too as mentioned above. Minor mode should only care about trapping the page fault when the next access comes. retract_page_tables() will wipe the pmd pgtable page, that's not fine for uffd-wp, but IMHO that's still very fine for minor mode as it will keep trapping the old missing ptes; the difference is it'll just generate even more traps (rather than on the pte holes only, now it'll generate one message for each 4k over the merged 2M). As I mentioned in the other thread, I think that'll cause false positive minor fault messages, but IMHO that's fine, and minor fault userspace should always need to handle that. I fully agree with you that a reproducer would be very nice to try. So if my understanding is correct, the reproducer won't really fail on minor mode in any way, but it'll just need to be prepared to receive more messages than it should. Thanks,
On Wed, 22 Sep 2021, Peter Xu wrote: > > Not installing pmd means uffd-minor can still trap any further faults just like > before, afaiu. > > There's a very trivial detail that the pmd missing case will have a very slight > code path change when the next page fault happens: in __handle_mm_fault() we'll > first try to go into create_huge_pmd() once, however since shmem didn't provide > huge_fault(), we'll go the VM_FAULT_FALLBACK path, and things will go like > before when faulting on a small pte. The next UFFDIO_CONTINUE will allocate > that missing pmd again, however it'll install a 4K page only. I think you're mistaken there. I can't tell you much about ->huge_fault(), something introduced for DAX I believe; but shmem has managed pmd mappings without it, since before ->huge_fault() was ever added. Look for the call to do_set_pmd() in finish_fault(): I think you'll find that is the way shmem's huge pmds get in. Earlier in the thread you suggested "shmem_getpage() only returns small pages": but it can very well return PageTransCompound pages, head or tail, which arrive at this do_set_pmd(). Hugh
On Wed, Sep 22, 2021 at 04:18:09PM -0700, Hugh Dickins wrote: > On Wed, 22 Sep 2021, Peter Xu wrote: > > > > Not installing pmd means uffd-minor can still trap any further faults just like > > before, afaiu. > > > > There's a very trivial detail that the pmd missing case will have a very slight > > code path change when the next page fault happens: in __handle_mm_fault() we'll > > first try to go into create_huge_pmd() once, however since shmem didn't provide > > huge_fault(), we'll go the VM_FAULT_FALLBACK path, and things will go like > > before when faulting on a small pte. The next UFFDIO_CONTINUE will allocate > > that missing pmd again, however it'll install a 4K page only. > > I think you're mistaken there. > > I can't tell you much about ->huge_fault(), something introduced for > DAX I believe; but shmem has managed pmd mappings without it, since > before ->huge_fault() was ever added. Right, I wanted to express we didn't go into there, hence no way to allocate pmd there. > > Look for the call to do_set_pmd() in finish_fault(): I think you'll > find that is the way shmem's huge pmds get in. > > Earlier in the thread you suggested "shmem_getpage() only returns > small pages": but it can very well return PageTransCompound pages, > head or tail, which arrive at this do_set_pmd(). But note that uffd-minor will trap the shmem fault() even if pmd_none: page = pagecache_get_page(mapping, index, FGP_ENTRY | FGP_HEAD | FGP_LOCK, 0); if (page && vma && userfaultfd_minor(vma)) { if (!xa_is_value(page)) { unlock_page(page); put_page(page); } *fault_type = handle_userfault(vmf, VM_UFFD_MINOR); return 0; } That's why I think it'll be fine, because it should only be UFFDIO_CONTINUE that installs the pte (alongside with allocating the pmd). Or did I miss something? Thanks,
On Wed, 22 Sep 2021, Peter Xu wrote: > On Wed, Sep 22, 2021 at 04:18:09PM -0700, Hugh Dickins wrote: > > On Wed, 22 Sep 2021, Peter Xu wrote: > > > > > > Not installing pmd means uffd-minor can still trap any further faults just like > > > before, afaiu. > > > > > > There's a very trivial detail that the pmd missing case will have a very slight > > > code path change when the next page fault happens: in __handle_mm_fault() we'll > > > first try to go into create_huge_pmd() once, however since shmem didn't provide > > > huge_fault(), we'll go the VM_FAULT_FALLBACK path, and things will go like > > > before when faulting on a small pte. The next UFFDIO_CONTINUE will allocate > > > that missing pmd again, however it'll install a 4K page only. > > > > I think you're mistaken there. > > > > I can't tell you much about ->huge_fault(), something introduced for > > DAX I believe; but shmem has managed pmd mappings without it, since > > before ->huge_fault() was ever added. > > Right, I wanted to express we didn't go into there, hence no way to allocate > pmd there. > > > > > Look for the call to do_set_pmd() in finish_fault(): I think you'll > > find that is the way shmem's huge pmds get in. > > > > Earlier in the thread you suggested "shmem_getpage() only returns > > small pages": but it can very well return PageTransCompound pages, > > head or tail, which arrive at this do_set_pmd(). > > But note that uffd-minor will trap the shmem fault() even if pmd_none: > > page = pagecache_get_page(mapping, index, > FGP_ENTRY | FGP_HEAD | FGP_LOCK, 0); > > if (page && vma && userfaultfd_minor(vma)) { > if (!xa_is_value(page)) { > unlock_page(page); > put_page(page); > } > *fault_type = handle_userfault(vmf, VM_UFFD_MINOR); > return 0; > } > > That's why I think it'll be fine, because it should only be UFFDIO_CONTINUE > that installs the pte (alongside with allocating the pmd). > > Or did I miss something? No, I think I misunderstood you before: thanks for re-explaining. (And Axel's !userfaultfd_minor() check before calling do_fault_around() plays an important part in making sure that it does reach shmem_fault().) Hugh
On Wed, Sep 22, 2021 at 06:22:45PM -0700, Hugh Dickins wrote: > No, I think I misunderstood you before: thanks for re-explaining. > (And Axel's !userfaultfd_minor() check before calling do_fault_around() > plays an important part in making sure that it does reach shmem_fault().) Still thanks for confirming this, Hugh. Said that, Axel, I didn't mean I'm against doing something similar like uffd-wp; it's just a heads-up that maybe you won't find a reproducer with real issues with minor mode. Even if I think minor mode should be fine with current code, we could still choose to disable khugepaged from removing the pmd for VM_UFFD_MINOR vmas, just like what we'll do with VM_UFFD_WP. At least it can still reduce false positives. So far in my local branch I queued the patch which I attached, that's required for uffd-wp shmem afaict. If you think minor mode would like that too, I can post it separately with minor mode added in. Note that it's slightly different from what I pasted in reply to Yang Shi - I made it slightly more complicated just to make sure there's no race. I mentioned the possible race (I think) in the commit log. Let me know your preference. Thanks,
On Wed, Sep 22, 2021 at 7:18 PM Peter Xu <peterx@redhat.com> wrote: > > On Wed, Sep 22, 2021 at 06:22:45PM -0700, Hugh Dickins wrote: > > No, I think I misunderstood you before: thanks for re-explaining. > > (And Axel's !userfaultfd_minor() check before calling do_fault_around() > > plays an important part in making sure that it does reach shmem_fault().) > > Still thanks for confirming this, Hugh. > > Said that, Axel, I didn't mean I'm against doing something similar like > uffd-wp; it's just a heads-up that maybe you won't find a reproducer with real > issues with minor mode. > > Even if I think minor mode should be fine with current code, we could still > choose to disable khugepaged from removing the pmd for VM_UFFD_MINOR vmas, just > like what we'll do with VM_UFFD_WP. At least it can still reduce false > positives. > > So far in my local branch I queued the patch which I attached, that's required > for uffd-wp shmem afaict. If you think minor mode would like that too, I can > post it separately with minor mode added in. No worries, you can leave the minor fault case to me. My thinking there was a THP collapse bug was really just based on speculation, not a real reproducer, so it's very possible my speculation was wrong. It will take some more thinking and reading to convince myself one way or the other. :) Thanks to you and Hugh for all the details. I'd prefer not to add this fix "just in case", if it isn't a real problem, as it seems like it may confuse future readers of the code. I'll send out a patch for it if / when I manage to build a real reproducer. Or, in the meantime, some of my Google colleagues are testing this code via their live migration implementation, so if there is a bug here there's a good chance we'll find it that way too. > > Note that it's slightly different from what I pasted in reply to Yang Shi - I > made it slightly more complicated just to make sure there's no race. I > mentioned the possible race (I think) in the commit log. > > Let me know your preference. > > Thanks, > > -- > Peter Xu
On Thu, Sep 23, 2021 at 09:47:42AM -0700, Axel Rasmussen wrote: > My thinking there was a THP collapse bug was really just based on > speculation, not a real reproducer, so it's very possible my > speculation was wrong. It will take some more thinking and reading to > convince myself one way or the other. :) Thanks to you and Hugh for > all the details. > > I'd prefer not to add this fix "just in case", if it isn't a real > problem, as it seems like it may confuse future readers of the code. It's not "just in case" to me - IMHO it's theoretically causing more false positives as I used to mention, at least that's my understanding so far. So if the theory is correct it'll 100% happen when khugepaged merged some minor-registered regions. Uffd-wp could have many false positives like this if we don't support swap - at last we decided to fully support swap then we removed all the false positives regarding swapping. I think it's similar here, but khugepaged should trigger much less frequently on the false positives upon uffd-minor, than swapping upon uffd-wp. But yes, there's definitely no rush on thinking or anything - it'll never hurt to think more. And more importantly, verify it with some test program would be great; after all theoretically it'll just work like a charm to me. > > I'll send out a patch for it if / when I manage to build a real > reproducer. Or, in the meantime, some of my Google colleagues are > testing this code via their live migration implementation, so if there > is a bug here there's a good chance we'll find it that way too. Sounds like a good plan. Thanks,
On 22.09.21 20:58, Peter Xu wrote: > On Wed, Sep 22, 2021 at 08:21:40PM +0200, David Hildenbrand wrote: >> On 22.09.21 19:51, Peter Xu wrote: >>> We forbid merging thps for uffd-wp enabled regions, by breaking the khugepaged >>> scanning right after we detected a uffd-wp armed pte (either present, or swap). >>> >>> It works, but it's less efficient, because those ptes only exist for VM_UFFD_WP >>> enabled VMAs. Checking against the vma flag would be more efficient, and good >>> enough. To be explicit, we could still be able to merge some thps for >>> VM_UFFD_WP regions before this patch as long as they have zero uffd-wp armed >>> ptes, however that's not a major target for thp collapse anyways. >>> >> >> Hm, are we sure there are no users that could benefit from the current >> handling? >> >> I'm thinking about long-term uffd-wp users that effectively end up wp-ing on >> only a small fraction of a gigantic vma, or always wp complete blocks in a >> certain granularity in the range of THP. > > Yes, that's a good question. > >> >> Databases come to mind ... > > One thing to mention is that this patch didn't forbid thp being used within a > uffd-wp-ed range. E.g., we still allow thp exist, we can uffd-wp a thp and > it'll split only until when the thp is written. > > While what this patch does is it stops khugepaged from proactively merging > those small pages into thps as long as VM_UFFD_WP|VM_UFFD_MINOR is set. It may > still affect some user, but it's not a complete disable on thp. > >> >> In the past, I played with the idea of using uffd-wp to protect access to >> logically unplugged memory regions part of virtio-mem devices in QEMU -- >> which would exactly do something as described above. But I'll most probably >> be using ordinary uffd once any users that might read such logically >> unplugged memory have been "fixed". > > Yes, even if you'd like to keep using uffd-wp that sounds like a very > reasonable scenario. > >> >> The change itself looks sane to me AFAIKT. > > So one major motivation of this patch of mine is to prepare for shmem, because > the old commit obviously only covered anonymous. > > But after a 2nd thought, I just noticed shmem shouldn't have a problem with > khugepaged merging at all! > > The thing is, when khugepaged is merging a shmem thp, unlike anonymous, it'll > not merge the ptes into a pmd, but it'll simply zap the ptes. It means all > uffd-wp tracking information won't be lost even if merging happened, those info > will still be kept inside pgtables using (the upcoming) pte markers. > > When faulted, we'll just do small page mappings while it won't stop the shmem > thp from being mapped hugely in other mm, afaict. > > With that in mind, indeed I see this patch less necessary to be merged; so for > sparsely wr-protected vmas like virtio-mem we can still keep some of the ranges > mergeable, that sounds a good thing to keep it as-is. > > NACK myself for now: let's not lose that good property of both thp+uffd-wp so > easily, and I'll think more of it. > Thanks for the insights, shmem THP is still mostly unexplored territory on my end :)
diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h index 4fdb14a81108..53532f5925c3 100644 --- a/include/trace/events/huge_memory.h +++ b/include/trace/events/huge_memory.h @@ -15,7 +15,6 @@ EM( SCAN_EXCEED_SWAP_PTE, "exceed_swap_pte") \ EM( SCAN_EXCEED_SHARED_PTE, "exceed_shared_pte") \ EM( SCAN_PTE_NON_PRESENT, "pte_non_present") \ - EM( SCAN_PTE_UFFD_WP, "pte_uffd_wp") \ EM( SCAN_PAGE_RO, "no_writable_page") \ EM( SCAN_LACK_REFERENCED_PAGE, "lack_referenced_page") \ EM( SCAN_PAGE_NULL, "page_null") \ diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 045cc579f724..3afe66d48db0 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -31,7 +31,6 @@ enum scan_result { SCAN_EXCEED_SWAP_PTE, SCAN_EXCEED_SHARED_PTE, SCAN_PTE_NON_PRESENT, - SCAN_PTE_UFFD_WP, SCAN_PAGE_RO, SCAN_LACK_REFERENCED_PAGE, SCAN_PAGE_NULL, @@ -467,6 +466,9 @@ static bool hugepage_vma_check(struct vm_area_struct *vma, return false; if (vma_is_temporary_stack(vma)) return false; + /* Don't allow thp merging for wp/minor enabled uffd regions */ + if (userfaultfd_wp(vma) || userfaultfd_minor(vma)) + return false; return !(vm_flags & VM_NO_KHUGEPAGED); } @@ -1246,15 +1248,6 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, pte_t pteval = *_pte; if (is_swap_pte(pteval)) { if (++unmapped <= khugepaged_max_ptes_swap) { - /* - * Always be strict with uffd-wp - * enabled swap entries. Please see - * comment below for pte_uffd_wp(). - */ - if (pte_swp_uffd_wp(pteval)) { - result = SCAN_PTE_UFFD_WP; - goto out_unmap; - } continue; } else { result = SCAN_EXCEED_SWAP_PTE; @@ -1270,19 +1263,6 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, goto out_unmap; } } - if (pte_uffd_wp(pteval)) { - /* - * Don't collapse the page if any of the small - * PTEs are armed with uffd write protection. - * Here we can also mark the new huge pmd as - * write protected if any of the small ones is - * marked but that could bring unknown - * userfault messages that falls outside of - * the registered range. So, just be simple. - */ - result = SCAN_PTE_UFFD_WP; - goto out_unmap; - } if (pte_write(pteval)) writable = true;