Message ID | 20210923232512.210092-1-peterx@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/userfaultfd: selftests: Fix memory corruption with thp enabled | expand |
On Thu, 23 Sep 2021 19:25:12 -0400 Peter Xu <peterx@redhat.com> wrote: > In RHEL's gating selftests we've encountered memory corruption in the uffd > event test even with upstream kernel: > > ... > > > We can mark the Fixes tag upon 0db282ba2c12 as it's reported to only happen > there, however the real "Fixes" IMHO should be 8ba6e8640844, as before that > commit we'll always do explicit release_pages() before registration of uffd, > and 8ba6e8640844 changed that logic by adding extra unmap/map and we didn't > release the pages at the right place. Meanwhile I don't have a solid glue > anyway on whether posix_memalign() could always avoid triggering this bug, > hence it's safer to attach this fix to commit 8ba6e8640844. > Thanks. I added a cc:stable to this. I don't think we want selftests in older kernels to be falsely reporting kernel bugs?
Hi Peter,
Tested on my two platforms, this patch works for me.
Tested-by: Li Wang <liwang@redhat.com>
On Thu, Sep 23, 2021 at 07:19:41PM -0700, Andrew Morton wrote: > On Thu, 23 Sep 2021 19:25:12 -0400 Peter Xu <peterx@redhat.com> wrote: > > > In RHEL's gating selftests we've encountered memory corruption in the uffd > > event test even with upstream kernel: > > > > ... > > > > > > We can mark the Fixes tag upon 0db282ba2c12 as it's reported to only happen > > there, however the real "Fixes" IMHO should be 8ba6e8640844, as before that > > commit we'll always do explicit release_pages() before registration of uffd, > > and 8ba6e8640844 changed that logic by adding extra unmap/map and we didn't > > release the pages at the right place. Meanwhile I don't have a solid glue > > anyway on whether posix_memalign() could always avoid triggering this bug, > > hence it's safer to attach this fix to commit 8ba6e8640844. > > > > Thanks. I added a cc:stable to this. I don't think we want selftests > in older kernels to be falsely reporting kernel bugs? Not sure how we normally handle such case for selftests, but I agree. Btw, 8ba6e8640844 is merged in 5.14, so the only stable branch that will need it will be 5.14.y; it can be applied cleanly there. Thanks,
On Thu, Sep 23, 2021 at 4:25 PM Peter Xu <peterx@redhat.com> wrote: > > In RHEL's gating selftests we've encountered memory corruption in the uffd > event test even with upstream kernel: > > # ./userfaultfd anon 128 4 > nr_pages: 32768, nr_pages_per_cpu: 32768 > bounces: 3, mode: rnd racing read, userfaults: 6240 missing (6240) 14729 wp (14729) > bounces: 2, mode: racing read, userfaults: 1444 missing (1444) 28877 wp (28877) > bounces: 1, mode: rnd read, userfaults: 6055 missing (6055) 14699 wp (14699) > bounces: 0, mode: read, userfaults: 82 missing (82) 25196 wp (25196) > testing uffd-wp with pagemap (pgsize=4096): done > testing uffd-wp with pagemap (pgsize=2097152): done > testing events (fork, remap, remove): ERROR: nr 32427 memory corruption 0 1 (errno=0, line=963) > ERROR: faulting process failed (errno=0, line=1117) > > It can be easily reproduced when global thp enabled, which is the default for > RHEL. > > It's also known as a side effect of commit 0db282ba2c12 ("selftest: use mmap > instead of posix_memalign to allocate memory", 2021-07-23), which is imho right > itself on using mmap() to make sure the addresses will be untagged even on arm. > > The problem is, for each test we allocate buffers using two allocate_area() > calls. We assumed these two buffers won't affect each other, however they > could, because mmap() could have found that the two buffers are near each other > and having the same VMA flags, so they got merged into one VMA. > > It won't be a big problem if thp is not enabled, but when thp is agressively > enabled it means when initializing the src buffer it could accidentally setup > part of the dest buffer too when there's a shared THP that overlaps the two > regions. Then some of the dest buffer won't be able to be trapped by > userfaultfd missing mode, then it'll cause memory corruption as described. > > To fix it, do release_pages() after initializing the src buffer. But, if I understand correctly, release_pages() will just free the physical pages, but not touch the VMA(s). So, with the right max_ptes_none setting, why couldn't khugepaged just decide to re-collapse (with zero pages) immediately after we release the pages, causing the same problem? It seems to me this change just significantly narrows the race window (which explains why we see less of the issue), but doesn't fix it fundamentally. > > Since the previous two release_pages() calls are after uffd_test_ctx_clear() > which will unmap all the buffers anyway (which is stronger than release pages; > as unmap() also tear town pgtables), drop them as they shouldn't really be > anything useful. > > We can mark the Fixes tag upon 0db282ba2c12 as it's reported to only happen > there, however the real "Fixes" IMHO should be 8ba6e8640844, as before that > commit we'll always do explicit release_pages() before registration of uffd, > and 8ba6e8640844 changed that logic by adding extra unmap/map and we didn't > release the pages at the right place. Meanwhile I don't have a solid glue > anyway on whether posix_memalign() could always avoid triggering this bug, > hence it's safer to attach this fix to commit 8ba6e8640844. > > Cc: Andrea Arcangeli <aarcange@redhat.com> > Cc: Axel Rasmussen <axelrasmussen@google.com> > Cc: Nadav Amit <nadav.amit@gmail.com> > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1994931 > Fixes: 8ba6e8640844 ("userfaultfd/selftests: reinitialize test context in each test") > Reported-by: Li Wang <liwan@redhat.com> > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > tools/testing/selftests/vm/userfaultfd.c | 23 ++++++++++++++++++++--- > 1 file changed, 20 insertions(+), 3 deletions(-) > > diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c > index 10ab56c2484a..60aa1a4fc69b 100644 > --- a/tools/testing/selftests/vm/userfaultfd.c > +++ b/tools/testing/selftests/vm/userfaultfd.c > @@ -414,9 +414,6 @@ static void uffd_test_ctx_init_ext(uint64_t *features) > uffd_test_ops->allocate_area((void **)&area_src); > uffd_test_ops->allocate_area((void **)&area_dst); > > - uffd_test_ops->release_pages(area_src); > - uffd_test_ops->release_pages(area_dst); > - > userfaultfd_open(features); > > count_verify = malloc(nr_pages * sizeof(unsigned long long)); > @@ -437,6 +434,26 @@ static void uffd_test_ctx_init_ext(uint64_t *features) > *(area_count(area_src, nr) + 1) = 1; > } > > + /* > + * After initialization of area_src, we must explicitly release pages > + * for area_dst to make sure it's fully empty. Otherwise we could have > + * some area_dst pages be errornously initialized with zero pages, > + * hence we could hit memory corruption later in the test. > + * > + * One example is when THP is globally enabled, above allocate_area() > + * calls could have the two areas merged into a single VMA (as they > + * will have the same VMA flags so they're mergeable). When we > + * initialize the area_src above, it's possible that some part of > + * area_dst could have been faulted in via one huge THP that will be > + * shared between area_src and area_dst. It could cause some of the > + * area_dst won't be trapped by missing userfaults. > + * > + * This release_pages() will guarantee even if that happened, we'll > + * proactively split the thp and drop any accidentally initialized > + * pages within area_dst. > + */ > + uffd_test_ops->release_pages(area_dst); > + > pipefd = malloc(sizeof(int) * nr_cpus * 2); > if (!pipefd) > err("pipefd"); > -- > 2.31.1 >
On Fri, Sep 24, 2021 at 10:21:30AM -0700, Axel Rasmussen wrote: > On Thu, Sep 23, 2021 at 4:25 PM Peter Xu <peterx@redhat.com> wrote: > > > > In RHEL's gating selftests we've encountered memory corruption in the uffd > > event test even with upstream kernel: > > > > # ./userfaultfd anon 128 4 > > nr_pages: 32768, nr_pages_per_cpu: 32768 > > bounces: 3, mode: rnd racing read, userfaults: 6240 missing (6240) 14729 wp (14729) > > bounces: 2, mode: racing read, userfaults: 1444 missing (1444) 28877 wp (28877) > > bounces: 1, mode: rnd read, userfaults: 6055 missing (6055) 14699 wp (14699) > > bounces: 0, mode: read, userfaults: 82 missing (82) 25196 wp (25196) > > testing uffd-wp with pagemap (pgsize=4096): done > > testing uffd-wp with pagemap (pgsize=2097152): done > > testing events (fork, remap, remove): ERROR: nr 32427 memory corruption 0 1 (errno=0, line=963) > > ERROR: faulting process failed (errno=0, line=1117) > > > > It can be easily reproduced when global thp enabled, which is the default for > > RHEL. > > > > It's also known as a side effect of commit 0db282ba2c12 ("selftest: use mmap > > instead of posix_memalign to allocate memory", 2021-07-23), which is imho right > > itself on using mmap() to make sure the addresses will be untagged even on arm. > > > > The problem is, for each test we allocate buffers using two allocate_area() > > calls. We assumed these two buffers won't affect each other, however they > > could, because mmap() could have found that the two buffers are near each other > > and having the same VMA flags, so they got merged into one VMA. > > > > It won't be a big problem if thp is not enabled, but when thp is agressively > > enabled it means when initializing the src buffer it could accidentally setup > > part of the dest buffer too when there's a shared THP that overlaps the two > > regions. Then some of the dest buffer won't be able to be trapped by > > userfaultfd missing mode, then it'll cause memory corruption as described. > > > > To fix it, do release_pages() after initializing the src buffer. > > But, if I understand correctly, release_pages() will just free the > physical pages, but not touch the VMA(s). So, with the right > max_ptes_none setting, why couldn't khugepaged just decide to > re-collapse (with zero pages) immediately after we release the pages, > causing the same problem? It seems to me this change just > significantly narrows the race window (which explains why we see less > of the issue), but doesn't fix it fundamentally. Did you mean you can reproduce the issue even with this patch? It is a good point anyway, indeed I don't see anything stops it from happening. I wanted to prepare a v2 by releasing the pages after uffdio registration where we'll do the vma split, but it won't simply work because release_pages() will cause the process to hang death since that test registers with EVENT_REMOVE, and release_pages() upon the thp will trigger synchronous EVENT_REMOVE which cannot be handled by anyone. Another solution is to map some PROT_NONE regions between the buffers, to make sure they won't share a VMA. I'll need to think more about which is better..
On Fri, Sep 24, 2021 at 12:59 PM Peter Xu <peterx@redhat.com> wrote: > > On Fri, Sep 24, 2021 at 10:21:30AM -0700, Axel Rasmussen wrote: > > On Thu, Sep 23, 2021 at 4:25 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > In RHEL's gating selftests we've encountered memory corruption in the uffd > > > event test even with upstream kernel: > > > > > > # ./userfaultfd anon 128 4 > > > nr_pages: 32768, nr_pages_per_cpu: 32768 > > > bounces: 3, mode: rnd racing read, userfaults: 6240 missing (6240) 14729 wp (14729) > > > bounces: 2, mode: racing read, userfaults: 1444 missing (1444) 28877 wp (28877) > > > bounces: 1, mode: rnd read, userfaults: 6055 missing (6055) 14699 wp (14699) > > > bounces: 0, mode: read, userfaults: 82 missing (82) 25196 wp (25196) > > > testing uffd-wp with pagemap (pgsize=4096): done > > > testing uffd-wp with pagemap (pgsize=2097152): done > > > testing events (fork, remap, remove): ERROR: nr 32427 memory corruption 0 1 (errno=0, line=963) > > > ERROR: faulting process failed (errno=0, line=1117) > > > > > > It can be easily reproduced when global thp enabled, which is the default for > > > RHEL. > > > > > > It's also known as a side effect of commit 0db282ba2c12 ("selftest: use mmap > > > instead of posix_memalign to allocate memory", 2021-07-23), which is imho right > > > itself on using mmap() to make sure the addresses will be untagged even on arm. > > > > > > The problem is, for each test we allocate buffers using two allocate_area() > > > calls. We assumed these two buffers won't affect each other, however they > > > could, because mmap() could have found that the two buffers are near each other > > > and having the same VMA flags, so they got merged into one VMA. > > > > > > It won't be a big problem if thp is not enabled, but when thp is agressively > > > enabled it means when initializing the src buffer it could accidentally setup > > > part of the dest buffer too when there's a shared THP that overlaps the two > > > regions. Then some of the dest buffer won't be able to be trapped by > > > userfaultfd missing mode, then it'll cause memory corruption as described. > > > > > > To fix it, do release_pages() after initializing the src buffer. > > > > But, if I understand correctly, release_pages() will just free the > > physical pages, but not touch the VMA(s). So, with the right > > max_ptes_none setting, why couldn't khugepaged just decide to > > re-collapse (with zero pages) immediately after we release the pages, > > causing the same problem? It seems to me this change just > > significantly narrows the race window (which explains why we see less > > of the issue), but doesn't fix it fundamentally. > > Did you mean you can reproduce the issue even with this patch? No, I haven't actually seen this happen after the patch. I suspect with this patch the window for it to happen is small, so reproducing may be hard. But from a theoretical standpoint, I don't see why it couldn't happen. > > It is a good point anyway, indeed I don't see anything stops it from happening. > > I wanted to prepare a v2 by releasing the pages after uffdio registration where > we'll do the vma split, but it won't simply work because release_pages() will > cause the process to hang death since that test registers with EVENT_REMOVE, > and release_pages() upon the thp will trigger synchronous EVENT_REMOVE which > cannot be handled by anyone. > > Another solution is to map some PROT_NONE regions between the buffers, to make > sure they won't share a VMA. I'll need to think more about which is better.. One possibility would be to MADV_NOHUGEPAGE the regions, which at least would fix the immediate flakiness. Then we could spend some time adding a test case which specifically targets THP interactions? (I do think we want test coverage of that in the end, but with the current tests it's kind of "accidental".) > > -- > Peter Xu >
On Fri, Sep 24, 2021 at 03:59:32PM -0400, Peter Xu wrote: > On Fri, Sep 24, 2021 at 10:21:30AM -0700, Axel Rasmussen wrote: > > On Thu, Sep 23, 2021 at 4:25 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > In RHEL's gating selftests we've encountered memory corruption in the uffd > > > event test even with upstream kernel: > > > > > > # ./userfaultfd anon 128 4 > > > nr_pages: 32768, nr_pages_per_cpu: 32768 > > > bounces: 3, mode: rnd racing read, userfaults: 6240 missing (6240) 14729 wp (14729) > > > bounces: 2, mode: racing read, userfaults: 1444 missing (1444) 28877 wp (28877) > > > bounces: 1, mode: rnd read, userfaults: 6055 missing (6055) 14699 wp (14699) > > > bounces: 0, mode: read, userfaults: 82 missing (82) 25196 wp (25196) > > > testing uffd-wp with pagemap (pgsize=4096): done > > > testing uffd-wp with pagemap (pgsize=2097152): done > > > testing events (fork, remap, remove): ERROR: nr 32427 memory corruption 0 1 (errno=0, line=963) > > > ERROR: faulting process failed (errno=0, line=1117) > > > > > > It can be easily reproduced when global thp enabled, which is the default for > > > RHEL. > > > > > > It's also known as a side effect of commit 0db282ba2c12 ("selftest: use mmap > > > instead of posix_memalign to allocate memory", 2021-07-23), which is imho right > > > itself on using mmap() to make sure the addresses will be untagged even on arm. > > > > > > The problem is, for each test we allocate buffers using two allocate_area() > > > calls. We assumed these two buffers won't affect each other, however they > > > could, because mmap() could have found that the two buffers are near each other > > > and having the same VMA flags, so they got merged into one VMA. > > > > > > It won't be a big problem if thp is not enabled, but when thp is agressively > > > enabled it means when initializing the src buffer it could accidentally setup > > > part of the dest buffer too when there's a shared THP that overlaps the two > > > regions. Then some of the dest buffer won't be able to be trapped by > > > userfaultfd missing mode, then it'll cause memory corruption as described. > > > > > > To fix it, do release_pages() after initializing the src buffer. > > > > But, if I understand correctly, release_pages() will just free the > > physical pages, but not touch the VMA(s). So, with the right > > max_ptes_none setting, why couldn't khugepaged just decide to > > re-collapse (with zero pages) immediately after we release the pages, > > causing the same problem? It seems to me this change just > > significantly narrows the race window (which explains why we see less > > of the issue), but doesn't fix it fundamentally. > > Did you mean you can reproduce the issue even with this patch? > > It is a good point anyway, indeed I don't see anything stops it from happening. > > I wanted to prepare a v2 by releasing the pages after uffdio registration where > we'll do the vma split, but it won't simply work because release_pages() will > cause the process to hang death since that test registers with EVENT_REMOVE, > and release_pages() upon the thp will trigger synchronous EVENT_REMOVE which > cannot be handled by anyone. > > Another solution is to map some PROT_NONE regions between the buffers, to make > sure they won't share a VMA. I'll need to think more about which is better.. Axel, let me know if you can reproduce an issue with this patch, or otherwise would you mind we keep this patch in -mm and fix the issue first? I can never reproduce any issue with current patch even if I agree you're probably right, however before the patch is mostly 100% reproducable to fail. It's just that after the weekend when I look back I still don't see a 100% clean way to fix it yet. Mapping 4K PROT_NONE before/after each allocation is the most ideal but still looks tricky to me. Would you have time on looking for a better solution, so as to (see it a way to) complete what commit 8ba6e8640844 whats to do afterwards? Thanks,
On Mon, Sep 27, 2021 at 10:34:06AM -0700, Axel Rasmussen wrote: > One possibility would be to MADV_NOHUGEPAGE the regions, which at > least would fix the immediate flakiness. Then we could spend some time > adding a test case which specifically targets THP interactions? (I do > think we want test coverage of that in the end, but with the current > tests it's kind of "accidental".) If we can't reproduce it with khugepaged yet, I'd think we can also consider keep torturing thp with this patch and at the meantime look for a clean approach? Now it's only the event test failing, if we apply NOHUGEPAGE we give up thp for all.
On Mon, Sep 27, 2021 at 10:36 AM Peter Xu <peterx@redhat.com> wrote: > > On Fri, Sep 24, 2021 at 03:59:32PM -0400, Peter Xu wrote: > > On Fri, Sep 24, 2021 at 10:21:30AM -0700, Axel Rasmussen wrote: > > > On Thu, Sep 23, 2021 at 4:25 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > > > In RHEL's gating selftests we've encountered memory corruption in the uffd > > > > event test even with upstream kernel: > > > > > > > > # ./userfaultfd anon 128 4 > > > > nr_pages: 32768, nr_pages_per_cpu: 32768 > > > > bounces: 3, mode: rnd racing read, userfaults: 6240 missing (6240) 14729 wp (14729) > > > > bounces: 2, mode: racing read, userfaults: 1444 missing (1444) 28877 wp (28877) > > > > bounces: 1, mode: rnd read, userfaults: 6055 missing (6055) 14699 wp (14699) > > > > bounces: 0, mode: read, userfaults: 82 missing (82) 25196 wp (25196) > > > > testing uffd-wp with pagemap (pgsize=4096): done > > > > testing uffd-wp with pagemap (pgsize=2097152): done > > > > testing events (fork, remap, remove): ERROR: nr 32427 memory corruption 0 1 (errno=0, line=963) > > > > ERROR: faulting process failed (errno=0, line=1117) > > > > > > > > It can be easily reproduced when global thp enabled, which is the default for > > > > RHEL. > > > > > > > > It's also known as a side effect of commit 0db282ba2c12 ("selftest: use mmap > > > > instead of posix_memalign to allocate memory", 2021-07-23), which is imho right > > > > itself on using mmap() to make sure the addresses will be untagged even on arm. > > > > > > > > The problem is, for each test we allocate buffers using two allocate_area() > > > > calls. We assumed these two buffers won't affect each other, however they > > > > could, because mmap() could have found that the two buffers are near each other > > > > and having the same VMA flags, so they got merged into one VMA. > > > > > > > > It won't be a big problem if thp is not enabled, but when thp is agressively > > > > enabled it means when initializing the src buffer it could accidentally setup > > > > part of the dest buffer too when there's a shared THP that overlaps the two > > > > regions. Then some of the dest buffer won't be able to be trapped by > > > > userfaultfd missing mode, then it'll cause memory corruption as described. > > > > > > > > To fix it, do release_pages() after initializing the src buffer. > > > > > > But, if I understand correctly, release_pages() will just free the > > > physical pages, but not touch the VMA(s). So, with the right > > > max_ptes_none setting, why couldn't khugepaged just decide to > > > re-collapse (with zero pages) immediately after we release the pages, > > > causing the same problem? It seems to me this change just > > > significantly narrows the race window (which explains why we see less > > > of the issue), but doesn't fix it fundamentally. > > > > Did you mean you can reproduce the issue even with this patch? > > > > It is a good point anyway, indeed I don't see anything stops it from happening. > > > > I wanted to prepare a v2 by releasing the pages after uffdio registration where > > we'll do the vma split, but it won't simply work because release_pages() will > > cause the process to hang death since that test registers with EVENT_REMOVE, > > and release_pages() upon the thp will trigger synchronous EVENT_REMOVE which > > cannot be handled by anyone. > > > > Another solution is to map some PROT_NONE regions between the buffers, to make > > sure they won't share a VMA. I'll need to think more about which is better.. > > Axel, let me know if you can reproduce an issue with this patch, or otherwise > would you mind we keep this patch in -mm and fix the issue first? I can never > reproduce any issue with current patch even if I agree you're probably right, > however before the patch is mostly 100% reproducable to fail. Totally fair, if nothing else the patch at least makes it a lot better. :) Keeping it in the mm tree or even merging it seems fine to me, we can continue iterating later. One small comment: I'd prefer to keep the "uffd_test_ops->release_pages(area_src);" above, to ensure the src region is empty. It's not immediately obvious to me that we overwrite *all* of the bytes in src when we initialize it. (I'd have to go look at the definition of area_count and read the loop carefully.) It may not be technically needed, but it makes the guarantee that we're starting with a clean slate, free from any changes from previous test cases, very clear + explicit. Moving the release_pages(area_dst) down as you've done seems correct to me. Either way you can take: Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> > > It's just that after the weekend when I look back I still don't see a 100% > clean way to fix it yet. Mapping 4K PROT_NONE before/after each allocation is > the most ideal but still looks tricky to me. > > Would you have time on looking for a better solution, so as to (see it a way > to) complete what commit 8ba6e8640844 whats to do afterwards? Sure, it seems related to the other THP investigations we talked about in the other thread, so I'm happy to look into it. Just to set expectations, progress may be slightly slow as I'm balancing other work my employer wants done, and some upcoming time off. But, I think with your patch the test is at least stable (not flaky) enough that there is no *urgent* need for this, so it should be fine. > > Thanks, > > -- > Peter Xu >
On Mon, Sep 27, 2021 at 10:49:39AM -0700, Axel Rasmussen wrote: > One small comment: > > I'd prefer to keep the "uffd_test_ops->release_pages(area_src);" > above, to ensure the src region is empty. It's not immediately obvious > to me that we overwrite *all* of the bytes in src when we initialize > it. (I'd have to go look at the definition of area_count and read the > loop carefully.) It may not be technically needed, but it makes the > guarantee that we're starting with a clean slate, free from any > changes from previous test cases, very clear + explicit. I think there're only two fields used, area_mutex and area_count. I'm not sure what's the initial idea from Andrea when the test case is merged, but IMHO it can be written as a struct too instead of using the long macros; struct could make it easier to undertand. And note again that we have your uffd_test_ctx_clear() called which contains munmap() of all the buffers before the release_pages() calls. It means at least for anon and shmem the pages won't be there 100% sure to me. hugetlbfs is the only one that may still keep the pages as the fs should hold another refcount on the inode, however as all the two fields got re-written anyway, so I think it'll be still very safe to drop the two release_pages(). > > Moving the release_pages(area_dst) down as you've done seems correct to me. > > Either way you can take: > > Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> > > > > > It's just that after the weekend when I look back I still don't see a 100% > > clean way to fix it yet. Mapping 4K PROT_NONE before/after each allocation is > > the most ideal but still looks tricky to me. > > > > Would you have time on looking for a better solution, so as to (see it a way > > to) complete what commit 8ba6e8640844 whats to do afterwards? > > Sure, it seems related to the other THP investigations we talked about > in the other thread, so I'm happy to look into it. > > Just to set expectations, progress may be slightly slow as I'm > balancing other work my employer wants done, and some upcoming time > off. But, I think with your patch the test is at least stable (not > flaky) enough that there is no *urgent* need for this, so it should be > fine. Thanks a lot on both reviewing the patch and willing to look into it. As long as we don't get any report for khugepaged (if it happens, I'll provide the PROT_NONE hack instead - that'll work 100% I believe but less clean; but for now IMHO we don't need to bother) then we don't need to rush on that.
diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index 10ab56c2484a..60aa1a4fc69b 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -414,9 +414,6 @@ static void uffd_test_ctx_init_ext(uint64_t *features) uffd_test_ops->allocate_area((void **)&area_src); uffd_test_ops->allocate_area((void **)&area_dst); - uffd_test_ops->release_pages(area_src); - uffd_test_ops->release_pages(area_dst); - userfaultfd_open(features); count_verify = malloc(nr_pages * sizeof(unsigned long long)); @@ -437,6 +434,26 @@ static void uffd_test_ctx_init_ext(uint64_t *features) *(area_count(area_src, nr) + 1) = 1; } + /* + * After initialization of area_src, we must explicitly release pages + * for area_dst to make sure it's fully empty. Otherwise we could have + * some area_dst pages be errornously initialized with zero pages, + * hence we could hit memory corruption later in the test. + * + * One example is when THP is globally enabled, above allocate_area() + * calls could have the two areas merged into a single VMA (as they + * will have the same VMA flags so they're mergeable). When we + * initialize the area_src above, it's possible that some part of + * area_dst could have been faulted in via one huge THP that will be + * shared between area_src and area_dst. It could cause some of the + * area_dst won't be trapped by missing userfaults. + * + * This release_pages() will guarantee even if that happened, we'll + * proactively split the thp and drop any accidentally initialized + * pages within area_dst. + */ + uffd_test_ops->release_pages(area_dst); + pipefd = malloc(sizeof(int) * nr_cpus * 2); if (!pipefd) err("pipefd");
In RHEL's gating selftests we've encountered memory corruption in the uffd event test even with upstream kernel: # ./userfaultfd anon 128 4 nr_pages: 32768, nr_pages_per_cpu: 32768 bounces: 3, mode: rnd racing read, userfaults: 6240 missing (6240) 14729 wp (14729) bounces: 2, mode: racing read, userfaults: 1444 missing (1444) 28877 wp (28877) bounces: 1, mode: rnd read, userfaults: 6055 missing (6055) 14699 wp (14699) bounces: 0, mode: read, userfaults: 82 missing (82) 25196 wp (25196) testing uffd-wp with pagemap (pgsize=4096): done testing uffd-wp with pagemap (pgsize=2097152): done testing events (fork, remap, remove): ERROR: nr 32427 memory corruption 0 1 (errno=0, line=963) ERROR: faulting process failed (errno=0, line=1117) It can be easily reproduced when global thp enabled, which is the default for RHEL. It's also known as a side effect of commit 0db282ba2c12 ("selftest: use mmap instead of posix_memalign to allocate memory", 2021-07-23), which is imho right itself on using mmap() to make sure the addresses will be untagged even on arm. The problem is, for each test we allocate buffers using two allocate_area() calls. We assumed these two buffers won't affect each other, however they could, because mmap() could have found that the two buffers are near each other and having the same VMA flags, so they got merged into one VMA. It won't be a big problem if thp is not enabled, but when thp is agressively enabled it means when initializing the src buffer it could accidentally setup part of the dest buffer too when there's a shared THP that overlaps the two regions. Then some of the dest buffer won't be able to be trapped by userfaultfd missing mode, then it'll cause memory corruption as described. To fix it, do release_pages() after initializing the src buffer. Since the previous two release_pages() calls are after uffd_test_ctx_clear() which will unmap all the buffers anyway (which is stronger than release pages; as unmap() also tear town pgtables), drop them as they shouldn't really be anything useful. We can mark the Fixes tag upon 0db282ba2c12 as it's reported to only happen there, however the real "Fixes" IMHO should be 8ba6e8640844, as before that commit we'll always do explicit release_pages() before registration of uffd, and 8ba6e8640844 changed that logic by adding extra unmap/map and we didn't release the pages at the right place. Meanwhile I don't have a solid glue anyway on whether posix_memalign() could always avoid triggering this bug, hence it's safer to attach this fix to commit 8ba6e8640844. Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Nadav Amit <nadav.amit@gmail.com> Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1994931 Fixes: 8ba6e8640844 ("userfaultfd/selftests: reinitialize test context in each test") Reported-by: Li Wang <liwan@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> --- tools/testing/selftests/vm/userfaultfd.c | 23 ++++++++++++++++++++--- 1 file changed, 20 insertions(+), 3 deletions(-)