diff mbox series

[1/3] userfaultfd/selftests: fix feature support detection

Message ID 20210921163323.944352-1-axelrasmussen@google.com (mailing list archive)
State New
Headers show
Series [1/3] userfaultfd/selftests: fix feature support detection | expand

Commit Message

Axel Rasmussen Sept. 21, 2021, 4:33 p.m. UTC
Currently, the test decides whether or not to test certain features
(e.g., writeprotect support) essentially by examining command-line
arguments. For example, if we're testing anonymous memory, then we
should test writeprotect support as well (since it generally is
supported for anonymous).

This is broken, however. Take writeprotect support as an example: sure
it's supported for anon, but it also requires that we have
CONFIG_HAVE_ARCH_USERFAULTFD_WP. I.e., it is not supported at all on
aarch64. So, running the test on such an arch fails: it tries to test
writeprotect for anon, but since it isn't *actually* supported, it
fails.

So, instead of checking command-line arguments to the test, check the
features the way the UFFD API intends: when we open a new userfaultfd,
pass in the feature(s) this test case would like to try to exercise. The
kernel reports back a subset of those features which are actually
supported: check these returned flags to see if the features are
*actually* supported.

(For a couple of cases, where *registration* would fail [with -EINVAL]
even though UFFDIO_API reports the feature as supported, we have to
check test_type as well as the feature flag.)

In some cases, we check immediately after opening the userfaultfd, and
if the features are missing, we skip the entire test. In some other
cases, we can proceed with "most" of the test, only skipping a few
pieces.

This lets us remove the global test_uffdio_wp and test_uffdio_minor
variables entirely.

Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
---
 tools/testing/selftests/vm/userfaultfd.c | 94 +++++++++++-------------
 1 file changed, 43 insertions(+), 51 deletions(-)

Comments

Peter Xu Sept. 21, 2021, 5:44 p.m. UTC | #1
Hi, Axel,

On Tue, Sep 21, 2021 at 09:33:21AM -0700, Axel Rasmussen wrote:
> diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
> index 10ab56c2484a..2366caf90435 100644
> --- a/tools/testing/selftests/vm/userfaultfd.c
> +++ b/tools/testing/selftests/vm/userfaultfd.c
> @@ -79,10 +79,6 @@ static int test_type;
>  #define ALARM_INTERVAL_SECS 10
>  static volatile bool test_uffdio_copy_eexist = true;
>  static volatile bool test_uffdio_zeropage_eexist = true;
> -/* Whether to test uffd write-protection */
> -static bool test_uffdio_wp = false;
> -/* Whether to test uffd minor faults */
> -static bool test_uffdio_minor = false;

IMHO it's not a fault to have these variables; they're still the fastest way to
do branching.  It's just that in some cases we should set them to "false"
rather than "true", am I right?

How about we just set them properly in set_test_type?  Say, we can fetch the
feature bits in set_test_type rather than assuming it's only related to the
type of memory.

Thanks,
Axel Rasmussen Sept. 21, 2021, 6:26 p.m. UTC | #2
On Tue, Sep 21, 2021 at 10:44 AM Peter Xu <peterx@redhat.com> wrote:
>
> Hi, Axel,
>
> On Tue, Sep 21, 2021 at 09:33:21AM -0700, Axel Rasmussen wrote:
> > diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
> > index 10ab56c2484a..2366caf90435 100644
> > --- a/tools/testing/selftests/vm/userfaultfd.c
> > +++ b/tools/testing/selftests/vm/userfaultfd.c
> > @@ -79,10 +79,6 @@ static int test_type;
> >  #define ALARM_INTERVAL_SECS 10
> >  static volatile bool test_uffdio_copy_eexist = true;
> >  static volatile bool test_uffdio_zeropage_eexist = true;
> > -/* Whether to test uffd write-protection */
> > -static bool test_uffdio_wp = false;
> > -/* Whether to test uffd minor faults */
> > -static bool test_uffdio_minor = false;
>
> IMHO it's not a fault to have these variables; they're still the fastest way to
> do branching.  It's just that in some cases we should set them to "false"
> rather than "true", am I right?
>
> How about we just set them properly in set_test_type?  Say, we can fetch the
> feature bits in set_test_type rather than assuming it's only related to the
> type of memory.

We could do that, but it would require opening a userfaultfd, issuing
a UFFDIO_API ioctl, and getting the feature bits in set_test_type. And
then I guess just closing the UFFD again, as we aren't yet setting up
for any particular test. To me, it seemed "messier" than this
approach.

Another thing to consider is, for the next patch we don't just want to
know "does this kernel support $FEATURE in general?" but also "is
$FEATURE supported for this particular memory region I've
registered?", and we can't have a single global answer to that. It
seemed a bit cleaner to me to write the code as if I was dealing with
that case, and then re-use the infrastructure I'd built for patch 2/3.

Basically, I didn't initially have a goal of getting rid of these
variables, but it ended up being the cleanest way (IMHO).

Just trying to explain the thinking. :) In the end, I think it's a
stylistic choice and don't feel super strongly about it, either way
could work. So, I can change it if you or others do feel strongly.

>
> Thanks,
>
> --
> Peter Xu
>
Peter Xu Sept. 21, 2021, 7:21 p.m. UTC | #3
On Tue, Sep 21, 2021 at 11:26:14AM -0700, Axel Rasmussen wrote:
> On Tue, Sep 21, 2021 at 10:44 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > Hi, Axel,
> >
> > On Tue, Sep 21, 2021 at 09:33:21AM -0700, Axel Rasmussen wrote:
> > > diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
> > > index 10ab56c2484a..2366caf90435 100644
> > > --- a/tools/testing/selftests/vm/userfaultfd.c
> > > +++ b/tools/testing/selftests/vm/userfaultfd.c
> > > @@ -79,10 +79,6 @@ static int test_type;
> > >  #define ALARM_INTERVAL_SECS 10
> > >  static volatile bool test_uffdio_copy_eexist = true;
> > >  static volatile bool test_uffdio_zeropage_eexist = true;
> > > -/* Whether to test uffd write-protection */
> > > -static bool test_uffdio_wp = false;
> > > -/* Whether to test uffd minor faults */
> > > -static bool test_uffdio_minor = false;
> >
> > IMHO it's not a fault to have these variables; they're still the fastest way to
> > do branching.  It's just that in some cases we should set them to "false"
> > rather than "true", am I right?
> >
> > How about we just set them properly in set_test_type?  Say, we can fetch the
> > feature bits in set_test_type rather than assuming it's only related to the
> > type of memory.
> 
> We could do that, but it would require opening a userfaultfd, issuing
> a UFFDIO_API ioctl, and getting the feature bits in set_test_type. And
> then I guess just closing the UFFD again, as we aren't yet setting up
> for any particular test. To me, it seemed "messier" than this
> approach.
> 
> Another thing to consider is, for the next patch we don't just want to
> know "does this kernel support $FEATURE in general?" but also "is
> $FEATURE supported for this particular memory region I've
> registered?", and we can't have a single global answer to that.

Could I ask why?  For each run, the memory type doesn't change, isn't it?  Then
I think the capability it should support is a constant?

Btw, note that "open an uffd, detect features, close uffd quickly" during setup
phase is totally fine to me just for probing the capabilities, and instead of
thinking it being messy I see it a very clean approach..

> It seemed a bit cleaner to me to write the code as if I was dealing with that
> case, and then re-use the infrastructure I'd built for patch 2/3.

I didn't comment on patch 2, but I had the same confusion - aren't all these
information constant after we settle the hardware, the kernel and the memory
type to test?

> 
> Basically, I didn't initially have a goal of getting rid of these
> variables, but it ended up being the cleanest way (IMHO).
> 
> Just trying to explain the thinking. :) In the end, I think it's a
> stylistic choice and don't feel super strongly about it, either way
> could work. So, I can change it if you or others do feel strongly.

I have no strong opinion as long as the code works (which I trust you on :).
We can keep it in Andrew's queue unless you do feel the other way is better.

Thanks,
Axel Rasmussen Sept. 21, 2021, 8:31 p.m. UTC | #4
On Tue, Sep 21, 2021 at 12:21 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Sep 21, 2021 at 11:26:14AM -0700, Axel Rasmussen wrote:
> > On Tue, Sep 21, 2021 at 10:44 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > Hi, Axel,
> > >
> > > On Tue, Sep 21, 2021 at 09:33:21AM -0700, Axel Rasmussen wrote:
> > > > diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
> > > > index 10ab56c2484a..2366caf90435 100644
> > > > --- a/tools/testing/selftests/vm/userfaultfd.c
> > > > +++ b/tools/testing/selftests/vm/userfaultfd.c
> > > > @@ -79,10 +79,6 @@ static int test_type;
> > > >  #define ALARM_INTERVAL_SECS 10
> > > >  static volatile bool test_uffdio_copy_eexist = true;
> > > >  static volatile bool test_uffdio_zeropage_eexist = true;
> > > > -/* Whether to test uffd write-protection */
> > > > -static bool test_uffdio_wp = false;
> > > > -/* Whether to test uffd minor faults */
> > > > -static bool test_uffdio_minor = false;
> > >
> > > IMHO it's not a fault to have these variables; they're still the fastest way to
> > > do branching.  It's just that in some cases we should set them to "false"
> > > rather than "true", am I right?
> > >
> > > How about we just set them properly in set_test_type?  Say, we can fetch the
> > > feature bits in set_test_type rather than assuming it's only related to the
> > > type of memory.
> >
> > We could do that, but it would require opening a userfaultfd, issuing
> > a UFFDIO_API ioctl, and getting the feature bits in set_test_type. And
> > then I guess just closing the UFFD again, as we aren't yet setting up
> > for any particular test. To me, it seemed "messier" than this
> > approach.
> >
> > Another thing to consider is, for the next patch we don't just want to
> > know "does this kernel support $FEATURE in general?" but also "is
> > $FEATURE supported for this particular memory region I've
> > registered?", and we can't have a single global answer to that.
>
> Could I ask why?  For each run, the memory type doesn't change, isn't it?  Then
> I think the capability it should support is a constant?

Ah, it has to do with us asserting the list of expected ioctls. The
kernel changes the list of ioctls it reports in response to a
UFFDIO_REGISTER, depending on the particular kind of vma being
registered, **as well as what mode(s) it is being registered with**.

So for example, consider the hugetlb_shared test. When registering,
the kernel might set the UFFDIO_CONTINUE bit or not, depending on
whether we registered with the MINOR mode bit set in particular. So it
will be present in one test case, but not in another, and so the set
of expected ioctls has to be computed at test time, rather than in
set_test_type.

>
> Btw, note that "open an uffd, detect features, close uffd quickly" during setup
> phase is totally fine to me just for probing the capabilities, and instead of
> thinking it being messy I see it a very clean approach..
>
> > It seemed a bit cleaner to me to write the code as if I was dealing with that
> > case, and then re-use the infrastructure I'd built for patch 2/3.
>
> I didn't comment on patch 2, but I had the same confusion - aren't all these
> information constant after we settle the hardware, the kernel and the memory
> type to test?
>
> >
> > Basically, I didn't initially have a goal of getting rid of these
> > variables, but it ended up being the cleanest way (IMHO).
> >
> > Just trying to explain the thinking. :) In the end, I think it's a
> > stylistic choice and don't feel super strongly about it, either way
> > could work. So, I can change it if you or others do feel strongly.
>
> I have no strong opinion as long as the code works (which I trust you on :).
> We can keep it in Andrew's queue unless you do feel the other way is better.
>
> Thanks,
>
> --
> Peter Xu
>
Peter Xu Sept. 22, 2021, 12:29 a.m. UTC | #5
On Tue, Sep 21, 2021 at 01:31:12PM -0700, Axel Rasmussen wrote:
> Ah, it has to do with us asserting the list of expected ioctls. The
> kernel changes the list of ioctls it reports in response to a
> UFFDIO_REGISTER, depending on the particular kind of vma being
> registered, **as well as what mode(s) it is being registered with**.
> 
> So for example, consider the hugetlb_shared test. When registering,
> the kernel might set the UFFDIO_CONTINUE bit or not, depending on
> whether we registered with the MINOR mode bit set in particular.

I can understand your point, but the "capability set" of the kernel is still
the same.  In this case we should have UFFDIO_CONTINUE capability for
hugetlb_shared test globally, as long as the kernel supports it, irrelevant of
what test case we're going to have.

Then in the test, if we don't register with MINOR mode, IMHO we should just
mask out the expected_ioctls with UFFDIO_CONTINUE because it does not make
sense to request UFFDIO_CONTINUE if we will never use it in the test.

In other words, having a "uffd_features" global variable and having it changing
all the time during tests is odd to me, but I agree it's not a big deal. :)
Axel Rasmussen Sept. 22, 2021, 5:04 p.m. UTC | #6
Thanks for discussing the design Peter. I have some ideas which might
make for a nicer v2; I'll massage the code a bit and see what I can
come up with.

On Tue, Sep 21, 2021 at 5:29 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Sep 21, 2021 at 01:31:12PM -0700, Axel Rasmussen wrote:
> > Ah, it has to do with us asserting the list of expected ioctls. The
> > kernel changes the list of ioctls it reports in response to a
> > UFFDIO_REGISTER, depending on the particular kind of vma being
> > registered, **as well as what mode(s) it is being registered with**.
> >
> > So for example, consider the hugetlb_shared test. When registering,
> > the kernel might set the UFFDIO_CONTINUE bit or not, depending on
> > whether we registered with the MINOR mode bit set in particular.
>
> I can understand your point, but the "capability set" of the kernel is still
> the same.  In this case we should have UFFDIO_CONTINUE capability for
> hugetlb_shared test globally, as long as the kernel supports it, irrelevant of
> what test case we're going to have.
>
> Then in the test, if we don't register with MINOR mode, IMHO we should just
> mask out the expected_ioctls with UFFDIO_CONTINUE because it does not make
> sense to request UFFDIO_CONTINUE if we will never use it in the test.

Right, this is how it was before. I didn't love how the base set
included everything, and then each test is responsible for removing
the things it isn't testing. It seems reversed: why not just have each
test compute the set of things it *is* testing?

>
> In other words, having a "uffd_features" global variable and having it changing
> all the time during tests is odd to me, but I agree it's not a big deal. :)

100% agree with this. From my perspective this is tech debt since:

8ba6e86408 userfaultfd/selftests: reinitialize test context in each test

It used to be that we just had one global context (variables like
uffd, count_verify, area_src, area_dst, etc). But this had the problem
where some previous test can mutate the context, breaking or affecting
following tests. After 8ba6e86408, we clear and reinitialize all these
variables for each test, but they're still global. I think it would be
cleaner if these instead were in a struct, which each test initialized
and then destroyed within its own scope. If we were to do such a
refactor, I would put uffd_features in that struct too - it should be
private to a test, since it's a property we get from the uffd.

But, I wasn't sure it was worth the churn to do something like this.

>
> --
> Peter Xu
>
Peter Xu Sept. 22, 2021, 5:32 p.m. UTC | #7
Hello, Axel,

On Wed, Sep 22, 2021 at 10:04:03AM -0700, Axel Rasmussen wrote:
> Thanks for discussing the design Peter. I have some ideas which might
> make for a nicer v2; I'll massage the code a bit and see what I can
> come up with.

Sure thing.  Note again that as I don't have a strong opinion on that, feel
free to keep it.  However if you provide v2, I'll read.

[off-topic below]

Another thing I probably have forgot but need your confirmation is, when you
worked on uffd minor mode, did you explicitly disable thp, or is it allowed?

When I'm reworking the uffd-wp series, I noticed that commit e1e267c7928f
("khugepaged: skip collapse if uffd-wp detected", 2020-04-07) was actually
awkward and not efficient, as we can simply lookup the vma flags for detecting
uffd-wp enablement.  I'm preparing a patch for it to do it by checking vmas
(and that patch will also pave the way for file-backed).

Then I noticed we need similar thing for minor mode?

I think the answer is yes, but I didn't see any code that explicitly handled
thp for minor mode, do you remember?

To be explicit, what if in mcontinue_atomic_pte() we get a shmem_getpage() call
with a thp returned?  Will minor mode break?

I plan to post the khugepaged patch soon and I plan to cover minor mode too
there, but I'm not sure whether that's enough, as the thp can be there from the
1st day I think, but I could have missed something.
Axel Rasmussen Sept. 22, 2021, 8:54 p.m. UTC | #8
On Wed, Sep 22, 2021 at 10:33 AM Peter Xu <peterx@redhat.com> wrote:
>
> Hello, Axel,
>
> On Wed, Sep 22, 2021 at 10:04:03AM -0700, Axel Rasmussen wrote:
> > Thanks for discussing the design Peter. I have some ideas which might
> > make for a nicer v2; I'll massage the code a bit and see what I can
> > come up with.
>
> Sure thing.  Note again that as I don't have a strong opinion on that, feel
> free to keep it.  However if you provide v2, I'll read.
>
> [off-topic below]
>
> Another thing I probably have forgot but need your confirmation is, when you
> worked on uffd minor mode, did you explicitly disable thp, or is it allowed?

I gave a more detailed answer in the other thread, but: currently it
is allowed, but this was a bug / oversight on my part. :) THP collapse
can break the guarantees minor fault registration is trying to
provide.

I think your approach of checking the VMA flags *in
retract_page_tables specifically* is correct, and a similar thing
should be done for minor registered VMAs too.

>
> When I'm reworking the uffd-wp series, I noticed that commit e1e267c7928f
> ("khugepaged: skip collapse if uffd-wp detected", 2020-04-07) was actually
> awkward and not efficient, as we can simply lookup the vma flags for detecting
> uffd-wp enablement.  I'm preparing a patch for it to do it by checking vmas
> (and that patch will also pave the way for file-backed).
>
> Then I noticed we need similar thing for minor mode?
>
> I think the answer is yes, but I didn't see any code that explicitly handled
> thp for minor mode, do you remember?
>
> To be explicit, what if in mcontinue_atomic_pte() we get a shmem_getpage() call
> with a thp returned?  Will minor mode break?

Ah so this I am not quite as sure about.

The issue I was describing in the other thread was more about THP
collapse racing with UFFDIO_CONTINUE. E.g., collapsing after
registration has happened, but before faults have been resolved.

But there's another scenario: what if the collapse happened well
before registration happened? I *think* the existing code deals with
THPs correctly in that case, but then again I don't think our selftest
really covers this case, and it's not something I've tested in
production either (to work around the other bug, we currently
MADV_NOHUGEPAGE the area until after VM demand paging completes, and
the UFFD registration is removed), so I am not super confident this is
the case.

>
> I plan to post the khugepaged patch soon and I plan to cover minor mode too
> there, but I'm not sure whether that's enough, as the thp can be there from the
> 1st day I think, but I could have missed something.
>
> --
> Peter Xu
>
Peter Xu Sept. 22, 2021, 9:51 p.m. UTC | #9
On Wed, Sep 22, 2021 at 01:54:53PM -0700, Axel Rasmussen wrote:
> On Wed, Sep 22, 2021 at 10:33 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > Hello, Axel,
> >
> > On Wed, Sep 22, 2021 at 10:04:03AM -0700, Axel Rasmussen wrote:
> > > Thanks for discussing the design Peter. I have some ideas which might
> > > make for a nicer v2; I'll massage the code a bit and see what I can
> > > come up with.
> >
> > Sure thing.  Note again that as I don't have a strong opinion on that, feel
> > free to keep it.  However if you provide v2, I'll read.
> >
> > [off-topic below]
> >
> > Another thing I probably have forgot but need your confirmation is, when you
> > worked on uffd minor mode, did you explicitly disable thp, or is it allowed?
> 
> I gave a more detailed answer in the other thread, but: currently it
> is allowed, but this was a bug / oversight on my part. :) THP collapse
> can break the guarantees minor fault registration is trying to
> provide.

I've replied there:

https://lore.kernel.org/linux-mm/YUueOUfoamxOvEyO@t490s/

We can try to keep the discussion unified there regarding this.

> But there's another scenario: what if the collapse happened well
> before registration happened?

Maybe yes, but my understanding of the current uffd-minor scenario tells me
that this is fine too.  Meanwhile I actually have another idea regarding minor
mode, please continue reading.

Firstly, let me try to re-cap on how minor mode is used in your production
systems: I believe there should have two processes A and B, if A is the main
process, B could be the migration process.  B migrates pages in the background,
while A so far should have been stopped and never ran.  When we want to start
A, we should register A with uffd-minor upon the whole range (note: I think so
far A does not have any pgtable mapped within uffd-minor range).  Then any page
access of A should kick B and asking "whether it is the latest page", if yes
then UFFDIO_CONTINUE, if no then B modifies the page, plus UFFDIO_CONTINUE
afterwards.  Am I right above?

So if that's the case, then A should have no page table at all.

Then, is that a problem if the shmem file that A maps contains huge thps?  I
think no - because UFFDIO_CONTINUE will only install small pages.

Let me know if I'm understanding it right above; I'll be happy to be corrected.

Actually besides this scenario, I'm also thinking of another scenario of using
minor fault in a single process - that's mostly what QEMU is doing right now,
as QEMU has the vcpu threads and migration thread sharing a single mm/pgtable.
So I think it'll be great to have a new madvise(MADV_ZAP) which will tear down
all the file-backed memory pgtables of a specific range.  I think it'll suite
perfectly for the minor fault use case, and it can be used for other things
too.  Let me know what you think about this idea, and whether that'll help in
your case too (e.g., if you worry a current process A mapped huge shmem thp
somewhere, we can use madvise(MADV_ZAP) to drop it).

> I *think* the existing code deals with THPs correctly in that case, but then
> again I don't think our selftest really covers this case, and it's not
> something I've tested in production either (to work around the other bug, we
> currently MADV_NOHUGEPAGE the area until after VM demand paging completes,
> and the UFFD registration is removed), so I am not super confident this is
> the case.

In all cases, enhancing the test program will always be welcomed.

Thanks,
Axel Rasmussen Sept. 22, 2021, 10:29 p.m. UTC | #10
On Wed, Sep 22, 2021 at 2:52 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Sep 22, 2021 at 01:54:53PM -0700, Axel Rasmussen wrote:
> > On Wed, Sep 22, 2021 at 10:33 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > Hello, Axel,
> > >
> > > On Wed, Sep 22, 2021 at 10:04:03AM -0700, Axel Rasmussen wrote:
> > > > Thanks for discussing the design Peter. I have some ideas which might
> > > > make for a nicer v2; I'll massage the code a bit and see what I can
> > > > come up with.
> > >
> > > Sure thing.  Note again that as I don't have a strong opinion on that, feel
> > > free to keep it.  However if you provide v2, I'll read.
> > >
> > > [off-topic below]
> > >
> > > Another thing I probably have forgot but need your confirmation is, when you
> > > worked on uffd minor mode, did you explicitly disable thp, or is it allowed?
> >
> > I gave a more detailed answer in the other thread, but: currently it
> > is allowed, but this was a bug / oversight on my part. :) THP collapse
> > can break the guarantees minor fault registration is trying to
> > provide.
>
> I've replied there:
>
> https://lore.kernel.org/linux-mm/YUueOUfoamxOvEyO@t490s/
>
> We can try to keep the discussion unified there regarding this.
>
> > But there's another scenario: what if the collapse happened well
> > before registration happened?
>
> Maybe yes, but my understanding of the current uffd-minor scenario tells me
> that this is fine too.  Meanwhile I actually have another idea regarding minor
> mode, please continue reading.
>
> Firstly, let me try to re-cap on how minor mode is used in your production
> systems: I believe there should have two processes A and B, if A is the main
> process, B could be the migration process.  B migrates pages in the background,
> while A so far should have been stopped and never ran.  When we want to start
> A, we should register A with uffd-minor upon the whole range (note: I think so
> far A does not have any pgtable mapped within uffd-minor range).  Then any page
> access of A should kick B and asking "whether it is the latest page", if yes
> then UFFDIO_CONTINUE, if no then B modifies the page, plus UFFDIO_CONTINUE
> afterwards.  Am I right above?
>
> So if that's the case, then A should have no page table at all.
>
> Then, is that a problem if the shmem file that A maps contains huge thps?  I
> think no - because UFFDIO_CONTINUE will only install small pages.
>
> Let me know if I'm understanding it right above; I'll be happy to be corrected.

Right, except that our use case is even more similar to QEMU: the code
doing UFFDIO_CONTINUE / demand paging, and the code running the vCPUs,
are in the same process (same mm) - just different threads.

>
> Actually besides this scenario, I'm also thinking of another scenario of using
> minor fault in a single process - that's mostly what QEMU is doing right now,
> as QEMU has the vcpu threads and migration thread sharing a single mm/pgtable.
> So I think it'll be great to have a new madvise(MADV_ZAP) which will tear down
> all the file-backed memory pgtables of a specific range.  I think it'll suite
> perfectly for the minor fault use case, and it can be used for other things
> too.  Let me know what you think about this idea, and whether that'll help in
> your case too (e.g., if you worry a current process A mapped huge shmem thp
> somewhere, we can use madvise(MADV_ZAP) to drop it).

Yes, this would be convenient for our implementation too. :) There are
workarounds if the feature doesn't exist, but it would be nice to
have. It's also useful for memory poisoning, I think, if the host
decides some page(s) are "bad" and wants to intercept any future guest
accesses to those page(s).

>
> > I *think* the existing code deals with THPs correctly in that case, but then
> > again I don't think our selftest really covers this case, and it's not
> > something I've tested in production either (to work around the other bug, we
> > currently MADV_NOHUGEPAGE the area until after VM demand paging completes,
> > and the UFFD registration is removed), so I am not super confident this is
> > the case.
>
> In all cases, enhancing the test program will always be welcomed.
>
> Thanks,
>
> --
> Peter Xu
>
Peter Xu Sept. 22, 2021, 11:49 p.m. UTC | #11
On Wed, Sep 22, 2021 at 03:29:42PM -0700, Axel Rasmussen wrote:
> On Wed, Sep 22, 2021 at 2:52 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Wed, Sep 22, 2021 at 01:54:53PM -0700, Axel Rasmussen wrote:
> > > On Wed, Sep 22, 2021 at 10:33 AM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > Hello, Axel,
> > > >
> > > > On Wed, Sep 22, 2021 at 10:04:03AM -0700, Axel Rasmussen wrote:
> > > > > Thanks for discussing the design Peter. I have some ideas which might
> > > > > make for a nicer v2; I'll massage the code a bit and see what I can
> > > > > come up with.
> > > >
> > > > Sure thing.  Note again that as I don't have a strong opinion on that, feel
> > > > free to keep it.  However if you provide v2, I'll read.
> > > >
> > > > [off-topic below]
> > > >
> > > > Another thing I probably have forgot but need your confirmation is, when you
> > > > worked on uffd minor mode, did you explicitly disable thp, or is it allowed?
> > >
> > > I gave a more detailed answer in the other thread, but: currently it
> > > is allowed, but this was a bug / oversight on my part. :) THP collapse
> > > can break the guarantees minor fault registration is trying to
> > > provide.
> >
> > I've replied there:
> >
> > https://lore.kernel.org/linux-mm/YUueOUfoamxOvEyO@t490s/
> >
> > We can try to keep the discussion unified there regarding this.
> >
> > > But there's another scenario: what if the collapse happened well
> > > before registration happened?
> >
> > Maybe yes, but my understanding of the current uffd-minor scenario tells me
> > that this is fine too.  Meanwhile I actually have another idea regarding minor
> > mode, please continue reading.
> >
> > Firstly, let me try to re-cap on how minor mode is used in your production
> > systems: I believe there should have two processes A and B, if A is the main
> > process, B could be the migration process.  B migrates pages in the background,
> > while A so far should have been stopped and never ran.  When we want to start
> > A, we should register A with uffd-minor upon the whole range (note: I think so
> > far A does not have any pgtable mapped within uffd-minor range).  Then any page
> > access of A should kick B and asking "whether it is the latest page", if yes
> > then UFFDIO_CONTINUE, if no then B modifies the page, plus UFFDIO_CONTINUE
> > afterwards.  Am I right above?
> >
> > So if that's the case, then A should have no page table at all.
> >
> > Then, is that a problem if the shmem file that A maps contains huge thps?  I
> > think no - because UFFDIO_CONTINUE will only install small pages.
> >
> > Let me know if I'm understanding it right above; I'll be happy to be corrected.
> 
> Right, except that our use case is even more similar to QEMU: the code
> doing UFFDIO_CONTINUE / demand paging, and the code running the vCPUs,
> are in the same process (same mm) - just different threads.

I see.

> 
> >
> > Actually besides this scenario, I'm also thinking of another scenario of using
> > minor fault in a single process - that's mostly what QEMU is doing right now,
> > as QEMU has the vcpu threads and migration thread sharing a single mm/pgtable.
> > So I think it'll be great to have a new madvise(MADV_ZAP) which will tear down
> > all the file-backed memory pgtables of a specific range.  I think it'll suite
> > perfectly for the minor fault use case, and it can be used for other things
> > too.  Let me know what you think about this idea, and whether that'll help in
> > your case too (e.g., if you worry a current process A mapped huge shmem thp
> > somewhere, we can use madvise(MADV_ZAP) to drop it).
> 
> Yes, this would be convenient for our implementation too. :) There are
> workarounds if the feature doesn't exist, but it would be nice to
> have.

Could I know what's the workaround?  Normally if the workaround works solidly,
then there's less need to introduce a kernel interface for that.  Otherwise I'm
glad to look into such a formal proposal.

> It's also useful for memory poisoning, I think, if the host
> decides some page(s) are "bad" and wants to intercept any future guest
> accesses to those page(s).

Curious: isn't hwpoison information come from MCEs; or say, host kernel side?
Then I thought the host kernel will have full control of it already.

Or there's other way that the host can try to detect some pages are going to be
rotten?  So the userspace can do something before the kernel handles those
exceptions?
James Houghton Sept. 23, 2021, 4:17 a.m. UTC | #12
On Wed, Sep 22, 2021 at 4:49 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Sep 22, 2021 at 03:29:42PM -0700, Axel Rasmussen wrote:
> > On Wed, Sep 22, 2021 at 2:52 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Wed, Sep 22, 2021 at 01:54:53PM -0700, Axel Rasmussen wrote:
> > > > On Wed, Sep 22, 2021 at 10:33 AM Peter Xu <peterx@redhat.com> wrote:
> > > > >
> > > > > Hello, Axel,
> > > > >
> > > > > On Wed, Sep 22, 2021 at 10:04:03AM -0700, Axel Rasmussen wrote:
> > > > > > Thanks for discussing the design Peter. I have some ideas which might
> > > > > > make for a nicer v2; I'll massage the code a bit and see what I can
> > > > > > come up with.
> > > > >
> > > > > Sure thing.  Note again that as I don't have a strong opinion on that, feel
> > > > > free to keep it.  However if you provide v2, I'll read.
> > > > >
> > > > > [off-topic below]
> > > > >
> > > > > Another thing I probably have forgot but need your confirmation is, when you
> > > > > worked on uffd minor mode, did you explicitly disable thp, or is it allowed?
> > > >
> > > > I gave a more detailed answer in the other thread, but: currently it
> > > > is allowed, but this was a bug / oversight on my part. :) THP collapse
> > > > can break the guarantees minor fault registration is trying to
> > > > provide.
> > >
> > > I've replied there:
> > >
> > > https://lore.kernel.org/linux-mm/YUueOUfoamxOvEyO@t490s/
> > >
> > > We can try to keep the discussion unified there regarding this.
> > >
> > > > But there's another scenario: what if the collapse happened well
> > > > before registration happened?
> > >
> > > Maybe yes, but my understanding of the current uffd-minor scenario tells me
> > > that this is fine too.  Meanwhile I actually have another idea regarding minor
> > > mode, please continue reading.
> > >
> > > Firstly, let me try to re-cap on how minor mode is used in your production
> > > systems: I believe there should have two processes A and B, if A is the main
> > > process, B could be the migration process.  B migrates pages in the background,
> > > while A so far should have been stopped and never ran.  When we want to start
> > > A, we should register A with uffd-minor upon the whole range (note: I think so
> > > far A does not have any pgtable mapped within uffd-minor range).  Then any page
> > > access of A should kick B and asking "whether it is the latest page", if yes
> > > then UFFDIO_CONTINUE, if no then B modifies the page, plus UFFDIO_CONTINUE
> > > afterwards.  Am I right above?
> > >
> > > So if that's the case, then A should have no page table at all.
> > >
> > > Then, is that a problem if the shmem file that A maps contains huge thps?  I
> > > think no - because UFFDIO_CONTINUE will only install small pages.
> > >
> > > Let me know if I'm understanding it right above; I'll be happy to be corrected.
> >
> > Right, except that our use case is even more similar to QEMU: the code
> > doing UFFDIO_CONTINUE / demand paging, and the code running the vCPUs,
> > are in the same process (same mm) - just different threads.
>
> I see.
>
> >
> > >
> > > Actually besides this scenario, I'm also thinking of another scenario of using
> > > minor fault in a single process - that's mostly what QEMU is doing right now,
> > > as QEMU has the vcpu threads and migration thread sharing a single mm/pgtable.
> > > So I think it'll be great to have a new madvise(MADV_ZAP) which will tear down
> > > all the file-backed memory pgtables of a specific range.  I think it'll suite
> > > perfectly for the minor fault use case, and it can be used for other things
> > > too.  Let me know what you think about this idea, and whether that'll help in
> > > your case too (e.g., if you worry a current process A mapped huge shmem thp
> > > somewhere, we can use madvise(MADV_ZAP) to drop it).
> >
> > Yes, this would be convenient for our implementation too. :) There are
> > workarounds if the feature doesn't exist, but it would be nice to
> > have.
>
> Could I know what's the workaround?  Normally if the workaround works solidly,
> then there's less need to introduce a kernel interface for that.  Otherwise I'm
> glad to look into such a formal proposal.

The workaround is, for the region that you want to zap, run through
this sequence of syscalls: mumap, mmap, and re-register with
userfaultfd if it was registered before. If we're using tmpfs, we can
use madvise(DONTNEED) instead, but this is kind of an abuse of the
API. I don't think there's a guarantee that the PTEs will get zapped,
but currently they will always get zapped if we're using tmpfs. I
really like the idea of adding a new madvise() mode that is guaranteed
to zap the PTEs.

>
> > It's also useful for memory poisoning, I think, if the host
> > decides some page(s) are "bad" and wants to intercept any future guest
> > accesses to those page(s).
>
> Curious: isn't hwpoison information come from MCEs; or say, host kernel side?
> Then I thought the host kernel will have full control of it already.
>
> Or there's other way that the host can try to detect some pages are going to be
> rotten?  So the userspace can do something before the kernel handles those
> exceptions?

Here's a general idea of how we would like to use userfaultfd to support MPR:

If a guest accesses a poisoned page for the first time, we will get an
MCE through the host kernel and send an MCE to the guest. The guest
will now no longer be able to access this page, and we have to enforce
this. After a live migration, the pages that were poisoned before
probably won't still be poisoned (from the host's perspective), so we
can't rely on the host kernel's MCE handling path. This is where
userfaultfd and this new madvise mode come in: we can just
madvise(MADV_ZAP) the poisoned page(s) on the target during a
migration. Now all accesses will be routed to the VMM and we can
inject an MCE. We don't *need* the new madvise mode, as we can also
use fallocate(PUNCH_HOLE) (works for tmpfs and hugetlbfs), but it
would be more convenient if we didn't have to use fallocate.

Jue Wang can provide more context here, so I've cc'd him. There may be
some things I'm wrong about, so Jue feel free to correct me.


- James


>
> --
> Peter Xu
>
Jue Wang Sept. 23, 2021, 5:43 a.m. UTC | #13
On Wed, Sep 22, 2021 at 9:18 PM James Houghton <jthoughton@google.com> wrote:
>
> On Wed, Sep 22, 2021 at 4:49 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Wed, Sep 22, 2021 at 03:29:42PM -0700, Axel Rasmussen wrote:
> > > On Wed, Sep 22, 2021 at 2:52 PM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > On Wed, Sep 22, 2021 at 01:54:53PM -0700, Axel Rasmussen wrote:
> > > > > On Wed, Sep 22, 2021 at 10:33 AM Peter Xu <peterx@redhat.com> wrote:
> > > > > >
> > > > > > Hello, Axel,
> > > > > >
> > > > > > On Wed, Sep 22, 2021 at 10:04:03AM -0700, Axel Rasmussen wrote:
> > > > > > > Thanks for discussing the design Peter. I have some ideas which might
> > > > > > > make for a nicer v2; I'll massage the code a bit and see what I can
> > > > > > > come up with.
> > > > > >
> > > > > > Sure thing.  Note again that as I don't have a strong opinion on that, feel
> > > > > > free to keep it.  However if you provide v2, I'll read.
> > > > > >
> > > > > > [off-topic below]
> > > > > >
> > > > > > Another thing I probably have forgot but need your confirmation is, when you
> > > > > > worked on uffd minor mode, did you explicitly disable thp, or is it allowed?
> > > > >
> > > > > I gave a more detailed answer in the other thread, but: currently it
> > > > > is allowed, but this was a bug / oversight on my part. :) THP collapse
> > > > > can break the guarantees minor fault registration is trying to
> > > > > provide.
> > > >
> > > > I've replied there:
> > > >
> > > > https://lore.kernel.org/linux-mm/YUueOUfoamxOvEyO@t490s/
> > > >
> > > > We can try to keep the discussion unified there regarding this.
> > > >
> > > > > But there's another scenario: what if the collapse happened well
> > > > > before registration happened?
> > > >
> > > > Maybe yes, but my understanding of the current uffd-minor scenario tells me
> > > > that this is fine too.  Meanwhile I actually have another idea regarding minor
> > > > mode, please continue reading.
> > > >
> > > > Firstly, let me try to re-cap on how minor mode is used in your production
> > > > systems: I believe there should have two processes A and B, if A is the main
> > > > process, B could be the migration process.  B migrates pages in the background,
> > > > while A so far should have been stopped and never ran.  When we want to start
> > > > A, we should register A with uffd-minor upon the whole range (note: I think so
> > > > far A does not have any pgtable mapped within uffd-minor range).  Then any page
> > > > access of A should kick B and asking "whether it is the latest page", if yes
> > > > then UFFDIO_CONTINUE, if no then B modifies the page, plus UFFDIO_CONTINUE
> > > > afterwards.  Am I right above?
> > > >
> > > > So if that's the case, then A should have no page table at all.
> > > >
> > > > Then, is that a problem if the shmem file that A maps contains huge thps?  I
> > > > think no - because UFFDIO_CONTINUE will only install small pages.
> > > >
> > > > Let me know if I'm understanding it right above; I'll be happy to be corrected.
> > >
> > > Right, except that our use case is even more similar to QEMU: the code
> > > doing UFFDIO_CONTINUE / demand paging, and the code running the vCPUs,
> > > are in the same process (same mm) - just different threads.
> >
> > I see.
> >
> > >
> > > >
> > > > Actually besides this scenario, I'm also thinking of another scenario of using
> > > > minor fault in a single process - that's mostly what QEMU is doing right now,
> > > > as QEMU has the vcpu threads and migration thread sharing a single mm/pgtable.
> > > > So I think it'll be great to have a new madvise(MADV_ZAP) which will tear down
> > > > all the file-backed memory pgtables of a specific range.  I think it'll suite
> > > > perfectly for the minor fault use case, and it can be used for other things
> > > > too.  Let me know what you think about this idea, and whether that'll help in
> > > > your case too (e.g., if you worry a current process A mapped huge shmem thp
> > > > somewhere, we can use madvise(MADV_ZAP) to drop it).
> > >
> > > Yes, this would be convenient for our implementation too. :) There are
> > > workarounds if the feature doesn't exist, but it would be nice to
> > > have.
> >
> > Could I know what's the workaround?  Normally if the workaround works solidly,
> > then there's less need to introduce a kernel interface for that.  Otherwise I'm
> > glad to look into such a formal proposal.
>
> The workaround is, for the region that you want to zap, run through
> this sequence of syscalls: mumap, mmap, and re-register with
> userfaultfd if it was registered before. If we're using tmpfs, we can
> use madvise(DONTNEED) instead, but this is kind of an abuse of the
> API. I don't think there's a guarantee that the PTEs will get zapped,
> but currently they will always get zapped if we're using tmpfs. I
> really like the idea of adding a new madvise() mode that is guaranteed
> to zap the PTEs.
>
> >
> > > It's also useful for memory poisoning, I think, if the host
> > > decides some page(s) are "bad" and wants to intercept any future guest
> > > accesses to those page(s).
> >
> > Curious: isn't hwpoison information come from MCEs; or say, host kernel side?
> > Then I thought the host kernel will have full control of it already.
> >
> > Or there's other way that the host can try to detect some pages are going to be
> > rotten?  So the userspace can do something before the kernel handles those
> > exceptions?
>
> Here's a general idea of how we would like to use userfaultfd to support MPR:
>
> If a guest accesses a poisoned page for the first time, we will get an
> MCE through the host kernel and send an MCE to the guest. The guest
> will now no longer be able to access this page, and we have to enforce
> this. After a live migration, the pages that were poisoned before
> probably won't still be poisoned (from the host's perspective), so we
> can't rely on the host kernel's MCE handling path. This is where
> userfaultfd and this new madvise mode come in: we can just
> madvise(MADV_ZAP) the poisoned page(s) on the target during a
> migration. Now all accesses will be routed to the VMM and we can
> inject an MCE. We don't *need* the new madvise mode, as we can also
> use fallocate(PUNCH_HOLE) (works for tmpfs and hugetlbfs), but it
> would be more convenient if we didn't have to use fallocate.
>
> Jue Wang can provide more context here, so I've cc'd him. There may be
> some things I'm wrong about, so Jue feel free to correct me.
>
James is right.

The page is marked PG_HWPoison in the source VM host's kernel. The need
of intercepting guest accesses to it exist on the target VM host, where
the same physical page is no longer poisoned.

On the target host, the hypervisor needs to intercept all guest accesses
to pages poisoned from the source VM host.
>
> - James
>
>
> >
> > --
> > Peter Xu
> >
Peter Xu Sept. 24, 2021, 8:09 p.m. UTC | #14
On Wed, Sep 22, 2021 at 10:43:40PM -0700, Jue Wang wrote:

[...]

> > > Could I know what's the workaround?  Normally if the workaround works solidly,
> > > then there's less need to introduce a kernel interface for that.  Otherwise I'm
> > > glad to look into such a formal proposal.
> >
> > The workaround is, for the region that you want to zap, run through
> > this sequence of syscalls: mumap, mmap, and re-register with
> > userfaultfd if it was registered before. If we're using tmpfs, we can
> > use madvise(DONTNEED) instead, but this is kind of an abuse of the
> > API. I don't think there's a guarantee that the PTEs will get zapped,
> > but currently they will always get zapped if we're using tmpfs. I
> > really like the idea of adding a new madvise() mode that is guaranteed
> > to zap the PTEs.

I see.

> >
> > >
> > > > It's also useful for memory poisoning, I think, if the host
> > > > decides some page(s) are "bad" and wants to intercept any future guest
> > > > accesses to those page(s).
> > >
> > > Curious: isn't hwpoison information come from MCEs; or say, host kernel side?
> > > Then I thought the host kernel will have full control of it already.
> > >
> > > Or there's other way that the host can try to detect some pages are going to be
> > > rotten?  So the userspace can do something before the kernel handles those
> > > exceptions?
> >
> > Here's a general idea of how we would like to use userfaultfd to support MPR:
> >
> > If a guest accesses a poisoned page for the first time, we will get an
> > MCE through the host kernel and send an MCE to the guest. The guest
> > will now no longer be able to access this page, and we have to enforce
> > this. After a live migration, the pages that were poisoned before
> > probably won't still be poisoned (from the host's perspective), so we
> > can't rely on the host kernel's MCE handling path. This is where
> > userfaultfd and this new madvise mode come in: we can just
> > madvise(MADV_ZAP) the poisoned page(s) on the target during a
> > migration. Now all accesses will be routed to the VMM and we can
> > inject an MCE. We don't *need* the new madvise mode, as we can also
> > use fallocate(PUNCH_HOLE) (works for tmpfs and hugetlbfs), but it
> > would be more convenient if we didn't have to use fallocate.
> >
> > Jue Wang can provide more context here, so I've cc'd him. There may be
> > some things I'm wrong about, so Jue feel free to correct me.
> >
> James is right.
> 
> The page is marked PG_HWPoison in the source VM host's kernel. The need
> of intercepting guest accesses to it exist on the target VM host, where
> the same physical page is no longer poisoned.
> 
> On the target host, the hypervisor needs to intercept all guest accesses
> to pages poisoned from the source VM host.

Thanks for these information, James, Jue, Axel.  I'm not familiar with memory
failures yet, so please bare with me with a few naive questions.

So now I can undertand that hw-poisonsed pages on src host do not mean these
pages will be hw-poisoned on dest host too, but I may have missed the reason on
why dest host needs to trap it with pgtable removed.

AFAIU after pages got hw-poisoned on src, and after vmm injects MCEs into the
guest, the guest shouldn't be accessing these pages any more, am I right?  Then
after migration completes, IIUC the guest shouldn't be accessing these pages
too.  My current understanding is, instead of trapping these pages on dest, we
should just (somehow, which I have no real idea...) un-hw-poison these pages
after migration because these pages are very possibly normal pages there.  When
there's real hw-poisoned pages reported on dst host, we should re-inject MCE
errors to guest with another set of pages.

Could you tell me where did I miss?

Thanks,
Jue Wang Sept. 24, 2021, 8:22 p.m. UTC | #15
On Fri, Sep 24, 2021 at 1:09 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Sep 22, 2021 at 10:43:40PM -0700, Jue Wang wrote:
>
> [...]
>
> > > > Could I know what's the workaround?  Normally if the workaround works solidly,
> > > > then there's less need to introduce a kernel interface for that.  Otherwise I'm
> > > > glad to look into such a formal proposal.
> > >
> > > The workaround is, for the region that you want to zap, run through
> > > this sequence of syscalls: mumap, mmap, and re-register with
> > > userfaultfd if it was registered before. If we're using tmpfs, we can
> > > use madvise(DONTNEED) instead, but this is kind of an abuse of the
> > > API. I don't think there's a guarantee that the PTEs will get zapped,
> > > but currently they will always get zapped if we're using tmpfs. I
> > > really like the idea of adding a new madvise() mode that is guaranteed
> > > to zap the PTEs.
>
> I see.
>
> > >
> > > >
> > > > > It's also useful for memory poisoning, I think, if the host
> > > > > decides some page(s) are "bad" and wants to intercept any future guest
> > > > > accesses to those page(s).
> > > >
> > > > Curious: isn't hwpoison information come from MCEs; or say, host kernel side?
> > > > Then I thought the host kernel will have full control of it already.
> > > >
> > > > Or there's other way that the host can try to detect some pages are going to be
> > > > rotten?  So the userspace can do something before the kernel handles those
> > > > exceptions?
> > >
> > > Here's a general idea of how we would like to use userfaultfd to support MPR:
> > >
> > > If a guest accesses a poisoned page for the first time, we will get an
> > > MCE through the host kernel and send an MCE to the guest. The guest
> > > will now no longer be able to access this page, and we have to enforce
> > > this. After a live migration, the pages that were poisoned before
> > > probably won't still be poisoned (from the host's perspective), so we
> > > can't rely on the host kernel's MCE handling path. This is where
> > > userfaultfd and this new madvise mode come in: we can just
> > > madvise(MADV_ZAP) the poisoned page(s) on the target during a
> > > migration. Now all accesses will be routed to the VMM and we can
> > > inject an MCE. We don't *need* the new madvise mode, as we can also
> > > use fallocate(PUNCH_HOLE) (works for tmpfs and hugetlbfs), but it
> > > would be more convenient if we didn't have to use fallocate.
> > >
> > > Jue Wang can provide more context here, so I've cc'd him. There may be
> > > some things I'm wrong about, so Jue feel free to correct me.
> > >
> > James is right.
> >
> > The page is marked PG_HWPoison in the source VM host's kernel. The need
> > of intercepting guest accesses to it exist on the target VM host, where
> > the same physical page is no longer poisoned.
> >
> > On the target host, the hypervisor needs to intercept all guest accesses
> > to pages poisoned from the source VM host.
>
> Thanks for these information, James, Jue, Axel.  I'm not familiar with memory
> failures yet, so please bare with me with a few naive questions.
>
> So now I can undertand that hw-poisonsed pages on src host do not mean these
> pages will be hw-poisoned on dest host too, but I may have missed the reason on
> why dest host needs to trap it with pgtable removed.
>
> AFAIU after pages got hw-poisoned on src, and after vmm injects MCEs into the
> guest, the guest shouldn't be accessing these pages any more, am I right?  Then

This is also our hope for the guest to behave but there is no guarantee on
guest behavior.

The goal here is to have the hypervisor provide consistent behavior
aligned to native
hardware, i.e., a guest page with "memory errors" stay persistently in
that state
no matter on source or target.

> after migration completes, IIUC the guest shouldn't be accessing these pages
> too.  My current understanding is, instead of trapping these pages on dest, we
> should just (somehow, which I have no real idea...) un-hw-poison these pages
> after migration because these pages are very possibly normal pages there.  When
> there's real hw-poisoned pages reported on dst host, we should re-inject MCE
> errors to guest with another set of pages.
>
> Could you tell me where did I miss?
>
> Thanks,
>
> --
> Peter Xu
>
diff mbox series

Patch

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 10ab56c2484a..2366caf90435 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -79,10 +79,6 @@  static int test_type;
 #define ALARM_INTERVAL_SECS 10
 static volatile bool test_uffdio_copy_eexist = true;
 static volatile bool test_uffdio_zeropage_eexist = true;
-/* Whether to test uffd write-protection */
-static bool test_uffdio_wp = false;
-/* Whether to test uffd minor faults */
-static bool test_uffdio_minor = false;
 
 static bool map_shared;
 static int shm_fd;
@@ -90,6 +86,7 @@  static int huge_fd;
 static char *huge_fd_off0;
 static unsigned long long *count_verify;
 static int uffd = -1;
+static uint64_t uffd_features;
 static int uffd_flags, finished, *pipefd;
 static char *area_src, *area_src_alias, *area_dst, *area_dst_alias;
 static char *zeropage;
@@ -345,7 +342,7 @@  static struct uffd_test_ops hugetlb_uffd_test_ops = {
 
 static struct uffd_test_ops *uffd_test_ops;
 
-static void userfaultfd_open(uint64_t *features)
+static void userfaultfd_open(uint64_t features)
 {
 	struct uffdio_api uffdio_api;
 
@@ -355,14 +352,20 @@  static void userfaultfd_open(uint64_t *features)
 	uffd_flags = fcntl(uffd, F_GETFD, NULL);
 
 	uffdio_api.api = UFFD_API;
-	uffdio_api.features = *features;
+	uffdio_api.features = features;
 	if (ioctl(uffd, UFFDIO_API, &uffdio_api))
 		err("UFFDIO_API failed.\nPlease make sure to "
 		    "run with either root or ptrace capability.");
 	if (uffdio_api.api != UFFD_API)
 		err("UFFDIO_API error: %" PRIu64, (uint64_t)uffdio_api.api);
 
-	*features = uffdio_api.features;
+	uffd_features = uffdio_api.features;
+}
+
+static inline bool uffd_wp_supported(void)
+{
+	return test_type == TEST_ANON &&
+		(uffd_features & UFFD_FEATURE_PAGEFAULT_FLAG_WP);
 }
 
 static inline void munmap_area(void **area)
@@ -397,6 +400,7 @@  static void uffd_test_ctx_clear(void)
 			err("close uffd");
 		uffd = -1;
 	}
+	uffd_features = 0;
 
 	huge_fd_off0 = NULL;
 	munmap_area((void **)&area_src);
@@ -405,7 +409,7 @@  static void uffd_test_ctx_clear(void)
 	munmap_area((void **)&area_dst_alias);
 }
 
-static void uffd_test_ctx_init_ext(uint64_t *features)
+static void uffd_test_ctx_init(uint64_t features)
 {
 	unsigned long nr, cpu;
 
@@ -445,11 +449,6 @@  static void uffd_test_ctx_init_ext(uint64_t *features)
 			err("pipe");
 }
 
-static inline void uffd_test_ctx_init(uint64_t features)
-{
-	uffd_test_ctx_init_ext(&features);
-}
-
 static int my_bcmp(char *str1, char *str2, size_t n)
 {
 	unsigned long i;
@@ -587,7 +586,7 @@  static int __copy_page(int ufd, unsigned long offset, bool retry)
 	uffdio_copy.dst = (unsigned long) area_dst + offset;
 	uffdio_copy.src = (unsigned long) area_src + offset;
 	uffdio_copy.len = page_size;
-	if (test_uffdio_wp)
+	if (uffd_wp_supported())
 		uffdio_copy.mode = UFFDIO_COPY_MODE_WP;
 	else
 		uffdio_copy.mode = 0;
@@ -778,7 +777,7 @@  static void *background_thread(void *arg)
 	 * at least the first half of the pages mapped already which
 	 * can be write-protected for testing
 	 */
-	if (test_uffdio_wp)
+	if (uffd_wp_supported())
 		wp_range(uffd, (unsigned long)area_dst + start_nr * page_size,
 			nr_pages_per_cpu * page_size, true);
 
@@ -1062,12 +1061,12 @@  static int userfaultfd_zeropage_test(void)
 	printf("testing UFFDIO_ZEROPAGE: ");
 	fflush(stdout);
 
-	uffd_test_ctx_init(0);
+	uffd_test_ctx_init(UFFD_FEATURE_PAGEFAULT_FLAG_WP);
 
 	uffdio_register.range.start = (unsigned long) area_dst;
 	uffdio_register.range.len = nr_pages * page_size;
 	uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
-	if (test_uffdio_wp)
+	if (uffd_wp_supported())
 		uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP;
 	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register))
 		err("register failure");
@@ -1089,7 +1088,7 @@  static int userfaultfd_events_test(void)
 	struct uffdio_register uffdio_register;
 	unsigned long expected_ioctls;
 	pthread_t uffd_mon;
-	int err, features;
+	int err;
 	pid_t pid;
 	char c;
 	struct uffd_stats stats = { 0 };
@@ -1097,16 +1096,15 @@  static int userfaultfd_events_test(void)
 	printf("testing events (fork, remap, remove): ");
 	fflush(stdout);
 
-	features = UFFD_FEATURE_EVENT_FORK | UFFD_FEATURE_EVENT_REMAP |
-		UFFD_FEATURE_EVENT_REMOVE;
-	uffd_test_ctx_init(features);
+	uffd_test_ctx_init(UFFD_FEATURE_EVENT_FORK | UFFD_FEATURE_EVENT_REMAP |
+			   UFFD_FEATURE_EVENT_REMOVE | UFFD_FEATURE_PAGEFAULT_FLAG_WP);
 
 	fcntl(uffd, F_SETFL, uffd_flags | O_NONBLOCK);
 
 	uffdio_register.range.start = (unsigned long) area_dst;
 	uffdio_register.range.len = nr_pages * page_size;
 	uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
-	if (test_uffdio_wp)
+	if (uffd_wp_supported())
 		uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP;
 	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register))
 		err("register failure");
@@ -1144,7 +1142,7 @@  static int userfaultfd_sig_test(void)
 	unsigned long expected_ioctls;
 	unsigned long userfaults;
 	pthread_t uffd_mon;
-	int err, features;
+	int err;
 	pid_t pid;
 	char c;
 	struct uffd_stats stats = { 0 };
@@ -1152,15 +1150,15 @@  static int userfaultfd_sig_test(void)
 	printf("testing signal delivery: ");
 	fflush(stdout);
 
-	features = UFFD_FEATURE_EVENT_FORK|UFFD_FEATURE_SIGBUS;
-	uffd_test_ctx_init(features);
+	uffd_test_ctx_init(UFFD_FEATURE_EVENT_FORK | UFFD_FEATURE_SIGBUS |
+			   UFFD_FEATURE_PAGEFAULT_FLAG_WP);
 
 	fcntl(uffd, F_SETFL, uffd_flags | O_NONBLOCK);
 
 	uffdio_register.range.start = (unsigned long) area_dst;
 	uffdio_register.range.len = nr_pages * page_size;
 	uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
-	if (test_uffdio_wp)
+	if (uffd_wp_supported())
 		uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP;
 	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register))
 		err("register failure");
@@ -1209,25 +1207,23 @@  static int userfaultfd_minor_test(void)
 	void *expected_page;
 	char c;
 	struct uffd_stats stats = { 0 };
-	uint64_t req_features, features_out;
-
-	if (!test_uffdio_minor)
-		return 0;
+	uint64_t features;
 
 	printf("testing minor faults: ");
 	fflush(stdout);
 
-	if (test_type == TEST_HUGETLB)
-		req_features = UFFD_FEATURE_MINOR_HUGETLBFS;
+	if (test_type == TEST_HUGETLB && map_shared)
+		features = UFFD_FEATURE_MINOR_HUGETLBFS;
 	else if (test_type == TEST_SHMEM)
-		req_features = UFFD_FEATURE_MINOR_SHMEM;
-	else
-		return 1;
+		features = UFFD_FEATURE_MINOR_SHMEM;
+	else {
+		printf("skipping test due to unsupported memory type\n");
+		return 0;
+	}
 
-	features_out = req_features;
-	uffd_test_ctx_init_ext(&features_out);
+	uffd_test_ctx_init(features);
 	/* If kernel reports required features aren't supported, skip test. */
-	if ((features_out & req_features) != req_features) {
+	if ((uffd_features & features) != features) {
 		printf("skipping test due to lack of feature support\n");
 		fflush(stdout);
 		return 0;
@@ -1349,10 +1345,6 @@  static void userfaultfd_pagemap_test(unsigned int test_pgsize)
 	int pagemap_fd;
 	uint64_t value;
 
-	/* Pagemap tests uffd-wp only */
-	if (!test_uffdio_wp)
-		return;
-
 	/* Not enough memory to test this page size */
 	if (test_pgsize > nr_pages * page_size)
 		return;
@@ -1361,7 +1353,12 @@  static void userfaultfd_pagemap_test(unsigned int test_pgsize)
 	/* Flush so it doesn't flush twice in parent/child later */
 	fflush(stdout);
 
-	uffd_test_ctx_init(0);
+	uffd_test_ctx_init(UFFD_FEATURE_PAGEFAULT_FLAG_WP);
+	/* Pagemap tests uffd-wp only */
+	if (!uffd_wp_supported()) {
+		printf("skipping test due to lack of feature support\n");
+		return;
+	}
 
 	if (test_pgsize > page_size) {
 		/* This is a thp test */
@@ -1426,7 +1423,7 @@  static int userfaultfd_stress(void)
 	struct uffdio_register uffdio_register;
 	struct uffd_stats uffd_stats[nr_cpus];
 
-	uffd_test_ctx_init(0);
+	uffd_test_ctx_init(UFFD_FEATURE_PAGEFAULT_FLAG_WP);
 
 	if (posix_memalign(&area, page_size, page_size))
 		err("out of memory");
@@ -1464,7 +1461,7 @@  static int userfaultfd_stress(void)
 		uffdio_register.range.start = (unsigned long) area_dst;
 		uffdio_register.range.len = nr_pages * page_size;
 		uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
-		if (test_uffdio_wp)
+		if (uffd_wp_supported())
 			uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP;
 		if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register))
 			err("register failure");
@@ -1513,7 +1510,7 @@  static int userfaultfd_stress(void)
 			return 1;
 
 		/* Clear all the write protections if there is any */
-		if (test_uffdio_wp)
+		if (uffd_wp_supported())
 			wp_range(uffd, (unsigned long)area_dst,
 				 nr_pages * page_size, false);
 
@@ -1595,8 +1592,6 @@  static void set_test_type(const char *type)
 	if (!strcmp(type, "anon")) {
 		test_type = TEST_ANON;
 		uffd_test_ops = &anon_uffd_test_ops;
-		/* Only enable write-protect test for anonymous test */
-		test_uffdio_wp = true;
 	} else if (!strcmp(type, "hugetlb")) {
 		test_type = TEST_HUGETLB;
 		uffd_test_ops = &hugetlb_uffd_test_ops;
@@ -1604,13 +1599,10 @@  static void set_test_type(const char *type)
 		map_shared = true;
 		test_type = TEST_HUGETLB;
 		uffd_test_ops = &hugetlb_uffd_test_ops;
-		/* Minor faults require shared hugetlb; only enable here. */
-		test_uffdio_minor = true;
 	} else if (!strcmp(type, "shmem")) {
 		map_shared = true;
 		test_type = TEST_SHMEM;
 		uffd_test_ops = &shmem_uffd_test_ops;
-		test_uffdio_minor = true;
 	} else {
 		err("Unknown test type: %s", type);
 	}