diff mbox series

[1/3] mm: userfaultfd: add new UFFDIO_SIGBUS ioctl

Message ID 20230511182426.1898675-1-axelrasmussen@google.com (mailing list archive)
State New
Headers show
Series [1/3] mm: userfaultfd: add new UFFDIO_SIGBUS ioctl | expand

Commit Message

Axel Rasmussen May 11, 2023, 6:24 p.m. UTC
The basic idea here is to "simulate" memory poisoning for VMs. A VM
running on some host might encounter a memory error, after which some
page(s) are poisoned (i.e., future accesses SIGBUS). They expect that
once poisoned, pages can never become "un-poisoned". So, when we live
migrate the VM, we need to preserve the poisoned status of these pages.

When live migrating, we try to get the guest running on its new host as
quickly as possible. So, we start it running before all memory has been
copied, and before we're certain which pages should be poisoned or not.

So the basic way to use this new feature is:

- On the new host, the guest's memory is registered with userfaultfd, in
  either MISSING or MINOR mode (doesn't really matter for this purpose).
- On any first access, we get a userfaultfd event. At this point we can
  communicate with the old host to find out if the page was poisoned.
- If so, we can respond with a UFFDIO_SIGBUS - this places a swap marker
  so any future accesses will SIGBUS. Because the pte is now "present",
  future accesses won't generate more userfaultfd events, they'll just
  SIGBUS directly.

UFFDIO_SIGBUS does not handle unmapping previously-present PTEs. This
isn't needed, because during live migration we want to intercept
all accesses with userfaultfd (not just writes, so WP mode isn't useful
for this). So whether minor or missing mode is being used (or both), the
PTE won't be present in any case, so handling that case isn't needed.

Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
---
 fs/userfaultfd.c                 | 63 ++++++++++++++++++++++++++++++++
 include/linux/swapops.h          |  3 +-
 include/linux/userfaultfd_k.h    |  4 ++
 include/uapi/linux/userfaultfd.h | 25 +++++++++++--
 mm/memory.c                      |  4 ++
 mm/userfaultfd.c                 | 62 ++++++++++++++++++++++++++++++-
 6 files changed, 156 insertions(+), 5 deletions(-)

Comments

Mike Kravetz May 11, 2023, 8:22 p.m. UTC | #1
On 05/11/23 11:24, Axel Rasmussen wrote:
> The basic idea here is to "simulate" memory poisoning for VMs. A VM
> running on some host might encounter a memory error, after which some
> page(s) are poisoned (i.e., future accesses SIGBUS). They expect that
> once poisoned, pages can never become "un-poisoned". So, when we live
> migrate the VM, we need to preserve the poisoned status of these pages.
> 
> When live migrating, we try to get the guest running on its new host as
> quickly as possible. So, we start it running before all memory has been
> copied, and before we're certain which pages should be poisoned or not.
> 
> So the basic way to use this new feature is:
> 
> - On the new host, the guest's memory is registered with userfaultfd, in
>   either MISSING or MINOR mode (doesn't really matter for this purpose).
> - On any first access, we get a userfaultfd event. At this point we can
>   communicate with the old host to find out if the page was poisoned.

Just curious, what is this communication channel with the old host?
Axel Rasmussen May 11, 2023, 8:40 p.m. UTC | #2
On Thu, May 11, 2023 at 1:29 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 05/11/23 11:24, Axel Rasmussen wrote:
> > The basic idea here is to "simulate" memory poisoning for VMs. A VM
> > running on some host might encounter a memory error, after which some
> > page(s) are poisoned (i.e., future accesses SIGBUS). They expect that
> > once poisoned, pages can never become "un-poisoned". So, when we live
> > migrate the VM, we need to preserve the poisoned status of these pages.
> >
> > When live migrating, we try to get the guest running on its new host as
> > quickly as possible. So, we start it running before all memory has been
> > copied, and before we're certain which pages should be poisoned or not.
> >
> > So the basic way to use this new feature is:
> >
> > - On the new host, the guest's memory is registered with userfaultfd, in
> >   either MISSING or MINOR mode (doesn't really matter for this purpose).
> > - On any first access, we get a userfaultfd event. At this point we can
> >   communicate with the old host to find out if the page was poisoned.
>
> Just curious, what is this communication channel with the old host?

James can probably describe it in more detail / more correctly than I
can. My (possibly wrong :) ) understanding is:

On the source machine we maintain a bitmap indicating which pages are
clean or dirty (meaning, modified after the initial "precopy" of
memory to the target machine) or poisoned. Eventually the entire
bitmap is sent to the target machine, but this takes some time (maybe
seconds on large machines). After this point though we have all the
information we need, we no longer need to communicate with the source
to find out the status of pages (although there may still be some
memory contents to finish copying over).

In the meantime, I think the target machine can also ask the source
machine about the status of individual pages (for quick on-demand
paging).

As for the underlying mechanism, it's an internal protocol but the
publicly-available thing it's most similar to is probably gRPC [1]. At
a really basic level, we send binary serialized protocol buffers [2]
over the network in a request / response fashion.

[1] https://grpc.io/
[2] https://protobuf.dev/

> --
> Mike Kravetz
>
> > - If so, we can respond with a UFFDIO_SIGBUS - this places a swap marker
> >   so any future accesses will SIGBUS. Because the pte is now "present",
> >   future accesses won't generate more userfaultfd events, they'll just
> >   SIGBUS directly.
> >
> > UFFDIO_SIGBUS does not handle unmapping previously-present PTEs. This
> > isn't needed, because during live migration we want to intercept
> > all accesses with userfaultfd (not just writes, so WP mode isn't useful
> > for this). So whether minor or missing mode is being used (or both), the
> > PTE won't be present in any case, so handling that case isn't needed.
> >
Axel Rasmussen May 11, 2023, 9:05 p.m. UTC | #3
On Thu, May 11, 2023 at 1:40 PM Axel Rasmussen <axelrasmussen@google.com> wrote:
>
> On Thu, May 11, 2023 at 1:29 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >
> > On 05/11/23 11:24, Axel Rasmussen wrote:

Apologies for the noise, I should have CC'ed +Jiaqi on this series
too, since he is working on other parts of the memory poisoning /
recovery stuff internally.

> > > The basic idea here is to "simulate" memory poisoning for VMs. A VM
> > > running on some host might encounter a memory error, after which some
> > > page(s) are poisoned (i.e., future accesses SIGBUS). They expect that
> > > once poisoned, pages can never become "un-poisoned". So, when we live
> > > migrate the VM, we need to preserve the poisoned status of these pages.
> > >
> > > When live migrating, we try to get the guest running on its new host as
> > > quickly as possible. So, we start it running before all memory has been
> > > copied, and before we're certain which pages should be poisoned or not.
> > >
> > > So the basic way to use this new feature is:
> > >
> > > - On the new host, the guest's memory is registered with userfaultfd, in
> > >   either MISSING or MINOR mode (doesn't really matter for this purpose).
> > > - On any first access, we get a userfaultfd event. At this point we can
> > >   communicate with the old host to find out if the page was poisoned.
> >
> > Just curious, what is this communication channel with the old host?
>
> James can probably describe it in more detail / more correctly than I
> can. My (possibly wrong :) ) understanding is:
>
> On the source machine we maintain a bitmap indicating which pages are
> clean or dirty (meaning, modified after the initial "precopy" of
> memory to the target machine) or poisoned. Eventually the entire
> bitmap is sent to the target machine, but this takes some time (maybe
> seconds on large machines). After this point though we have all the
> information we need, we no longer need to communicate with the source
> to find out the status of pages (although there may still be some
> memory contents to finish copying over).
>
> In the meantime, I think the target machine can also ask the source
> machine about the status of individual pages (for quick on-demand
> paging).
>
> As for the underlying mechanism, it's an internal protocol but the
> publicly-available thing it's most similar to is probably gRPC [1]. At
> a really basic level, we send binary serialized protocol buffers [2]
> over the network in a request / response fashion.
>
> [1] https://grpc.io/
> [2] https://protobuf.dev/
>
> > --
> > Mike Kravetz
> >
> > > - If so, we can respond with a UFFDIO_SIGBUS - this places a swap marker
> > >   so any future accesses will SIGBUS. Because the pte is now "present",
> > >   future accesses won't generate more userfaultfd events, they'll just
> > >   SIGBUS directly.
> > >
> > > UFFDIO_SIGBUS does not handle unmapping previously-present PTEs. This
> > > isn't needed, because during live migration we want to intercept
> > > all accesses with userfaultfd (not just writes, so WP mode isn't useful
> > > for this). So whether minor or missing mode is being used (or both), the
> > > PTE won't be present in any case, so handling that case isn't needed.
> > >
James Houghton May 11, 2023, 10 p.m. UTC | #4
On Thu, May 11, 2023 at 11:24 AM Axel Rasmussen
<axelrasmussen@google.com> wrote:
>
> So the basic way to use this new feature is:
>
> - On the new host, the guest's memory is registered with userfaultfd, in
>   either MISSING or MINOR mode (doesn't really matter for this purpose).
> - On any first access, we get a userfaultfd event. At this point we can
>   communicate with the old host to find out if the page was poisoned.
> - If so, we can respond with a UFFDIO_SIGBUS - this places a swap marker
>   so any future accesses will SIGBUS. Because the pte is now "present",
>   future accesses won't generate more userfaultfd events, they'll just
>   SIGBUS directly.

I want to clarify the SIGBUS mechanism here when KVM is involved,
keeping in mind that we need to be able to inject an MCE into the
guest for this to be useful.

1. vCPU gets an EPT violation --> KVM attempts GUP.
2. GUP finds a PTE_MARKER_UFFD_SIGBUS and returns VM_FAULT_SIGBUS.
3. KVM finds that GUP failed and returns -EFAULT.

This is different than if GUP found poison, in which case KVM will
actually queue up a SIGBUS *containing the address of the fault*, and
userspace can use it to inject an appropriate MCE into the guest. With
UFFDIO_SIGBUS, we are missing the address!

I see three options:
1. Make KVM_RUN queue up a signal for any VM_FAULT_SIGBUS. I think
this is pointless.
2. Don't have UFFDIO_SIGBUS install a PTE entry, but instead have a
UFFDIO_WAKE_MODE_SIGBUS, where upon waking, we return VM_FAULT_SIGBUS
instead of VM_FAULT_RETRY. We will keep getting userfaults on repeated
accesses, just like how we get repeated signals for real poison.
3. Use this in conjunction with the additional KVM EFAULT info that
Anish proposed (the first part of [1]).

I think option 3 is fine. :)

[1]: https://lore.kernel.org/kvm/20230412213510.1220557-1-amoorthy@google.com/


- James
Peter Xu May 17, 2023, 10:12 p.m. UTC | #5
On Thu, May 11, 2023 at 03:00:09PM -0700, James Houghton wrote:
> On Thu, May 11, 2023 at 11:24 AM Axel Rasmussen
> <axelrasmussen@google.com> wrote:
> >
> > So the basic way to use this new feature is:
> >
> > - On the new host, the guest's memory is registered with userfaultfd, in
> >   either MISSING or MINOR mode (doesn't really matter for this purpose).
> > - On any first access, we get a userfaultfd event. At this point we can
> >   communicate with the old host to find out if the page was poisoned.
> > - If so, we can respond with a UFFDIO_SIGBUS - this places a swap marker
> >   so any future accesses will SIGBUS. Because the pte is now "present",
> >   future accesses won't generate more userfaultfd events, they'll just
> >   SIGBUS directly.
> 
> I want to clarify the SIGBUS mechanism here when KVM is involved,
> keeping in mind that we need to be able to inject an MCE into the
> guest for this to be useful.
> 
> 1. vCPU gets an EPT violation --> KVM attempts GUP.
> 2. GUP finds a PTE_MARKER_UFFD_SIGBUS and returns VM_FAULT_SIGBUS.
> 3. KVM finds that GUP failed and returns -EFAULT.
> 
> This is different than if GUP found poison, in which case KVM will
> actually queue up a SIGBUS *containing the address of the fault*, and
> userspace can use it to inject an appropriate MCE into the guest. With
> UFFDIO_SIGBUS, we are missing the address!
> 
> I see three options:
> 1. Make KVM_RUN queue up a signal for any VM_FAULT_SIGBUS. I think
> this is pointless.
> 2. Don't have UFFDIO_SIGBUS install a PTE entry, but instead have a
> UFFDIO_WAKE_MODE_SIGBUS, where upon waking, we return VM_FAULT_SIGBUS
> instead of VM_FAULT_RETRY. We will keep getting userfaults on repeated
> accesses, just like how we get repeated signals for real poison.
> 3. Use this in conjunction with the additional KVM EFAULT info that
> Anish proposed (the first part of [1]).
> 
> I think option 3 is fine. :)

Or... option 4) just to use either MADV_HWPOISON or hwpoison-inject? :)

Besides what James mentioned on "missing addr", I didn't quickly see what's
the major difference comparing to the old hwpoison injection methods even
without the addr requirement. If we want the addr for MCE then it's more of
a question to ask.

I also didn't quickly see why for whatever new way to inject a pte error we
need to have it registered with uffd.  Could it be something like
MADV_PGERR (even if MADV_HWPOISON won't suffice) so you can inject even
without an userfault context (but still usable when uffd registered)?

And it'll be alawys nice to have a cover letter too (if there'll be a new
version) explaining the bits.

Thanks,
Peter Xu May 17, 2023, 10:20 p.m. UTC | #6
On Wed, May 17, 2023 at 06:12:33PM -0400, Peter Xu wrote:
> On Thu, May 11, 2023 at 03:00:09PM -0700, James Houghton wrote:
> > On Thu, May 11, 2023 at 11:24 AM Axel Rasmussen
> > <axelrasmussen@google.com> wrote:
> > >
> > > So the basic way to use this new feature is:
> > >
> > > - On the new host, the guest's memory is registered with userfaultfd, in
> > >   either MISSING or MINOR mode (doesn't really matter for this purpose).
> > > - On any first access, we get a userfaultfd event. At this point we can
> > >   communicate with the old host to find out if the page was poisoned.
> > > - If so, we can respond with a UFFDIO_SIGBUS - this places a swap marker
> > >   so any future accesses will SIGBUS. Because the pte is now "present",
> > >   future accesses won't generate more userfaultfd events, they'll just
> > >   SIGBUS directly.
> > 
> > I want to clarify the SIGBUS mechanism here when KVM is involved,
> > keeping in mind that we need to be able to inject an MCE into the
> > guest for this to be useful.
> > 
> > 1. vCPU gets an EPT violation --> KVM attempts GUP.
> > 2. GUP finds a PTE_MARKER_UFFD_SIGBUS and returns VM_FAULT_SIGBUS.
> > 3. KVM finds that GUP failed and returns -EFAULT.
> > 
> > This is different than if GUP found poison, in which case KVM will
> > actually queue up a SIGBUS *containing the address of the fault*, and
> > userspace can use it to inject an appropriate MCE into the guest. With
> > UFFDIO_SIGBUS, we are missing the address!
> > 
> > I see three options:
> > 1. Make KVM_RUN queue up a signal for any VM_FAULT_SIGBUS. I think
> > this is pointless.
> > 2. Don't have UFFDIO_SIGBUS install a PTE entry, but instead have a
> > UFFDIO_WAKE_MODE_SIGBUS, where upon waking, we return VM_FAULT_SIGBUS
> > instead of VM_FAULT_RETRY. We will keep getting userfaults on repeated
> > accesses, just like how we get repeated signals for real poison.
> > 3. Use this in conjunction with the additional KVM EFAULT info that
> > Anish proposed (the first part of [1]).
> > 
> > I think option 3 is fine. :)
> 
> Or... option 4) just to use either MADV_HWPOISON or hwpoison-inject? :)

I just remember Axel mentioned this in the commit message, and just in case
this is why option 4) was ruled out:

        They expect that once poisoned, pages can never become
        "un-poisoned". So, when we live migrate the VM, we need to preserve
        the poisoned status of these pages.

Just to supplement on this point: we do have unpoison (echoing to
"debug/hwpoison/hwpoison_unpoison"), or am I wrong?

> 
> Besides what James mentioned on "missing addr", I didn't quickly see what's
> the major difference comparing to the old hwpoison injection methods even
> without the addr requirement. If we want the addr for MCE then it's more of
> a question to ask.
> 
> I also didn't quickly see why for whatever new way to inject a pte error we
> need to have it registered with uffd.  Could it be something like
> MADV_PGERR (even if MADV_HWPOISON won't suffice) so you can inject even
> without an userfault context (but still usable when uffd registered)?
> 
> And it'll be alawys nice to have a cover letter too (if there'll be a new
> version) explaining the bits.
> 
> Thanks,
> 
> -- 
> Peter Xu
Axel Rasmussen May 17, 2023, 10:28 p.m. UTC | #7
On Wed, May 17, 2023 at 3:20 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, May 17, 2023 at 06:12:33PM -0400, Peter Xu wrote:
> > On Thu, May 11, 2023 at 03:00:09PM -0700, James Houghton wrote:
> > > On Thu, May 11, 2023 at 11:24 AM Axel Rasmussen
> > > <axelrasmussen@google.com> wrote:
> > > >
> > > > So the basic way to use this new feature is:
> > > >
> > > > - On the new host, the guest's memory is registered with userfaultfd, in
> > > >   either MISSING or MINOR mode (doesn't really matter for this purpose).
> > > > - On any first access, we get a userfaultfd event. At this point we can
> > > >   communicate with the old host to find out if the page was poisoned.
> > > > - If so, we can respond with a UFFDIO_SIGBUS - this places a swap marker
> > > >   so any future accesses will SIGBUS. Because the pte is now "present",
> > > >   future accesses won't generate more userfaultfd events, they'll just
> > > >   SIGBUS directly.
> > >
> > > I want to clarify the SIGBUS mechanism here when KVM is involved,
> > > keeping in mind that we need to be able to inject an MCE into the
> > > guest for this to be useful.
> > >
> > > 1. vCPU gets an EPT violation --> KVM attempts GUP.
> > > 2. GUP finds a PTE_MARKER_UFFD_SIGBUS and returns VM_FAULT_SIGBUS.
> > > 3. KVM finds that GUP failed and returns -EFAULT.
> > >
> > > This is different than if GUP found poison, in which case KVM will
> > > actually queue up a SIGBUS *containing the address of the fault*, and
> > > userspace can use it to inject an appropriate MCE into the guest. With
> > > UFFDIO_SIGBUS, we are missing the address!
> > >
> > > I see three options:
> > > 1. Make KVM_RUN queue up a signal for any VM_FAULT_SIGBUS. I think
> > > this is pointless.
> > > 2. Don't have UFFDIO_SIGBUS install a PTE entry, but instead have a
> > > UFFDIO_WAKE_MODE_SIGBUS, where upon waking, we return VM_FAULT_SIGBUS
> > > instead of VM_FAULT_RETRY. We will keep getting userfaults on repeated
> > > accesses, just like how we get repeated signals for real poison.
> > > 3. Use this in conjunction with the additional KVM EFAULT info that
> > > Anish proposed (the first part of [1]).
> > >
> > > I think option 3 is fine. :)
> >
> > Or... option 4) just to use either MADV_HWPOISON or hwpoison-inject? :)
>
> I just remember Axel mentioned this in the commit message, and just in case
> this is why option 4) was ruled out:
>
>         They expect that once poisoned, pages can never become
>         "un-poisoned". So, when we live migrate the VM, we need to preserve
>         the poisoned status of these pages.
>
> Just to supplement on this point: we do have unpoison (echoing to
> "debug/hwpoison/hwpoison_unpoison"), or am I wrong?
>
> >
> > Besides what James mentioned on "missing addr", I didn't quickly see what's
> > the major difference comparing to the old hwpoison injection methods even
> > without the addr requirement. If we want the addr for MCE then it's more of
> > a question to ask.
> >
> > I also didn't quickly see why for whatever new way to inject a pte error we
> > need to have it registered with uffd.  Could it be something like
> > MADV_PGERR (even if MADV_HWPOISON won't suffice) so you can inject even
> > without an userfault context (but still usable when uffd registered)?
> >
> > And it'll be alawys nice to have a cover letter too (if there'll be a new
> > version) explaining the bits.

I do plan a v2, if for no other reason than to update the
documentation. Happy to add a cover letter with it as well.

+Jiaqi back to CC, this is one piece of a larger memory poisoning /
recovery design Jiaqi is working on, so he may have some ideas why
MADV_HWPOISON or MADV_PGER will or won't work.

One idea is, at least for our use case, we have to have the range be
userfaultfd registered, because we need to intercept the first access
and check at that point whether or not it should be poisoned. But, I
think in principle a scheme like this could work:

1. Intercept first access with UFFD
2. Issue MADV_HWPOISON or MADV_PGERR or etc to put a pte denoting the
poisoned page in place
3. UFFDIO_WAKE to have the faulting thread retry, see the new entry, and SIGBUS

It's arguably slightly weird, since normally UFFD events are resolved
with UFFDIO_* operations, but I don't see why it *couldn't* work.

Then again I am not super familiar with MADV_HWPOISON, I will have to
do a bit of reading to understand if its semantics are the same
(future accesses to this address get SIGBUS).


> >
> > Thanks,
> >
> > --
> > Peter Xu
>
> --
> Peter Xu
>
Peter Xu May 18, 2023, 12:20 a.m. UTC | #8
Hi, Axel,

On Wed, May 17, 2023 at 03:28:36PM -0700, Axel Rasmussen wrote:
> I do plan a v2, if for no other reason than to update the
> documentation. Happy to add a cover letter with it as well.
> 
> +Jiaqi back to CC, this is one piece of a larger memory poisoning /
> recovery design Jiaqi is working on, so he may have some ideas why
> MADV_HWPOISON or MADV_PGER will or won't work.
> 
> One idea is, at least for our use case, we have to have the range be
> userfaultfd registered, because we need to intercept the first access
> and check at that point whether or not it should be poisoned. But, I
> think in principle a scheme like this could work:
> 
> 1. Intercept first access with UFFD
> 2. Issue MADV_HWPOISON or MADV_PGERR or etc to put a pte denoting the
> poisoned page in place
> 3. UFFDIO_WAKE to have the faulting thread retry, see the new entry, and SIGBUS
> 
> It's arguably slightly weird, since normally UFFD events are resolved
> with UFFDIO_* operations, but I don't see why it *couldn't* work.
> 
> Then again I am not super familiar with MADV_HWPOISON, I will have to
> do a bit of reading to understand if its semantics are the same
> (future accesses to this address get SIGBUS).

Yes, it'll be great if this can be checked up before sending v2.  What you
said match exactly what I was in mind. I hope it will already work, or we
can always discuss what is missing.
Jiaqi Yan May 18, 2023, 12:43 a.m. UTC | #9
On Wed, May 17, 2023 at 3:29 PM Axel Rasmussen <axelrasmussen@google.com> wrote:
>
> On Wed, May 17, 2023 at 3:20 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Wed, May 17, 2023 at 06:12:33PM -0400, Peter Xu wrote:
> > > On Thu, May 11, 2023 at 03:00:09PM -0700, James Houghton wrote:
> > > > On Thu, May 11, 2023 at 11:24 AM Axel Rasmussen
> > > > <axelrasmussen@google.com> wrote:
> > > > >
> > > > > So the basic way to use this new feature is:
> > > > >
> > > > > - On the new host, the guest's memory is registered with userfaultfd, in
> > > > >   either MISSING or MINOR mode (doesn't really matter for this purpose).
> > > > > - On any first access, we get a userfaultfd event. At this point we can
> > > > >   communicate with the old host to find out if the page was poisoned.
> > > > > - If so, we can respond with a UFFDIO_SIGBUS - this places a swap marker
> > > > >   so any future accesses will SIGBUS. Because the pte is now "present",
> > > > >   future accesses won't generate more userfaultfd events, they'll just
> > > > >   SIGBUS directly.
> > > >
> > > > I want to clarify the SIGBUS mechanism here when KVM is involved,
> > > > keeping in mind that we need to be able to inject an MCE into the
> > > > guest for this to be useful.
> > > >
> > > > 1. vCPU gets an EPT violation --> KVM attempts GUP.
> > > > 2. GUP finds a PTE_MARKER_UFFD_SIGBUS and returns VM_FAULT_SIGBUS.
> > > > 3. KVM finds that GUP failed and returns -EFAULT.
> > > >
> > > > This is different than if GUP found poison, in which case KVM will
> > > > actually queue up a SIGBUS *containing the address of the fault*, and
> > > > userspace can use it to inject an appropriate MCE into the guest. With
> > > > UFFDIO_SIGBUS, we are missing the address!
> > > >
> > > > I see three options:
> > > > 1. Make KVM_RUN queue up a signal for any VM_FAULT_SIGBUS. I think
> > > > this is pointless.
> > > > 2. Don't have UFFDIO_SIGBUS install a PTE entry, but instead have a
> > > > UFFDIO_WAKE_MODE_SIGBUS, where upon waking, we return VM_FAULT_SIGBUS
> > > > instead of VM_FAULT_RETRY. We will keep getting userfaults on repeated
> > > > accesses, just like how we get repeated signals for real poison.
> > > > 3. Use this in conjunction with the additional KVM EFAULT info that
> > > > Anish proposed (the first part of [1]).
> > > >
> > > > I think option 3 is fine. :)
> > >
> > > Or... option 4) just to use either MADV_HWPOISON or hwpoison-inject? :)
> >
> > I just remember Axel mentioned this in the commit message, and just in case
> > this is why option 4) was ruled out:
> >
> >         They expect that once poisoned, pages can never become
> >         "un-poisoned". So, when we live migrate the VM, we need to preserve
> >         the poisoned status of these pages.
> >
> > Just to supplement on this point: we do have unpoison (echoing to
> > "debug/hwpoison/hwpoison_unpoison"), or am I wrong?

If I read unpoison_memory() correctly, once there is a real hardware
memory corruption (hw_memory_failure will be set), unpoison will stop
working and return EOPNOTSUPP.

I know some cloud providers evacuating VMs once a single memory error
happens, so not supporting unpoison is probably not a big deal for
them. BUT others do keep VM running until more errors show up later,
which could be long after the 1st error.

> >
> > >
> > > Besides what James mentioned on "missing addr", I didn't quickly see what's
> > > the major difference comparing to the old hwpoison injection methods even
> > > without the addr requirement. If we want the addr for MCE then it's more of
> > > a question to ask.
> > >
> > > I also didn't quickly see why for whatever new way to inject a pte error we
> > > need to have it registered with uffd.  Could it be something like
> > > MADV_PGERR (even if MADV_HWPOISON won't suffice) so you can inject even
> > > without an userfault context (but still usable when uffd registered)?
> > >
> > > And it'll be alawys nice to have a cover letter too (if there'll be a new
> > > version) explaining the bits.
>
> I do plan a v2, if for no other reason than to update the
> documentation. Happy to add a cover letter with it as well.
>
> +Jiaqi back to CC, this is one piece of a larger memory poisoning /
> recovery design Jiaqi is working on, so he may have some ideas why
> MADV_HWPOISON or MADV_PGER will or won't work.

Per https://man7.org/linux/man-pages/man2/madvise.2.html,
MADV_HWPOISON "is available only for privileged (CAP_SYS_ADMIN)
processes." So for a non-root VMM, MADV_HWPOISON is out of option.

Another issue with MADV_HWPOISON is, it requires to first successfully
get_user_pages_fast(). I don't think it will work if memory is not
mapped yet.

With the UFFDIO_SIGBUS feature introduced in this patchset, it may
even be possible to free the emulated-hwpoison page back to the kernel
so we don't lose a 4K page.

I didn't find any ref/doc for MADV_PGERR. Is it something you suggest
to build, Peter?


>
> One idea is, at least for our use case, we have to have the range be
> userfaultfd registered, because we need to intercept the first access
> and check at that point whether or not it should be poisoned. But, I
> think in principle a scheme like this could work:
>
> 1. Intercept first access with UFFD
> 2. Issue MADV_HWPOISON or MADV_PGERR or etc to put a pte denoting the
> poisoned page in place
> 3. UFFDIO_WAKE to have the faulting thread retry, see the new entry, and SIGBUS
>
> It's arguably slightly weird, since normally UFFD events are resolved
> with UFFDIO_* operations, but I don't see why it *couldn't* work.
>
> Then again I am not super familiar with MADV_HWPOISON, I will have to
> do a bit of reading to understand if its semantics are the same
> (future accesses to this address get SIGBUS).
>
>
> > >
> > > Thanks,
> > >
> > > --
> > > Peter Xu
> >
> > --
> > Peter Xu
> >
Peter Xu May 18, 2023, 4:05 p.m. UTC | #10
On Wed, May 17, 2023 at 05:43:53PM -0700, Jiaqi Yan wrote:
> On Wed, May 17, 2023 at 3:29 PM Axel Rasmussen <axelrasmussen@google.com> wrote:
> >
> > On Wed, May 17, 2023 at 3:20 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Wed, May 17, 2023 at 06:12:33PM -0400, Peter Xu wrote:
> > > > On Thu, May 11, 2023 at 03:00:09PM -0700, James Houghton wrote:
> > > > > On Thu, May 11, 2023 at 11:24 AM Axel Rasmussen
> > > > > <axelrasmussen@google.com> wrote:
> > > > > >
> > > > > > So the basic way to use this new feature is:
> > > > > >
> > > > > > - On the new host, the guest's memory is registered with userfaultfd, in
> > > > > >   either MISSING or MINOR mode (doesn't really matter for this purpose).
> > > > > > - On any first access, we get a userfaultfd event. At this point we can
> > > > > >   communicate with the old host to find out if the page was poisoned.
> > > > > > - If so, we can respond with a UFFDIO_SIGBUS - this places a swap marker
> > > > > >   so any future accesses will SIGBUS. Because the pte is now "present",
> > > > > >   future accesses won't generate more userfaultfd events, they'll just
> > > > > >   SIGBUS directly.
> > > > >
> > > > > I want to clarify the SIGBUS mechanism here when KVM is involved,
> > > > > keeping in mind that we need to be able to inject an MCE into the
> > > > > guest for this to be useful.
> > > > >
> > > > > 1. vCPU gets an EPT violation --> KVM attempts GUP.
> > > > > 2. GUP finds a PTE_MARKER_UFFD_SIGBUS and returns VM_FAULT_SIGBUS.
> > > > > 3. KVM finds that GUP failed and returns -EFAULT.
> > > > >
> > > > > This is different than if GUP found poison, in which case KVM will
> > > > > actually queue up a SIGBUS *containing the address of the fault*, and
> > > > > userspace can use it to inject an appropriate MCE into the guest. With
> > > > > UFFDIO_SIGBUS, we are missing the address!
> > > > >
> > > > > I see three options:
> > > > > 1. Make KVM_RUN queue up a signal for any VM_FAULT_SIGBUS. I think
> > > > > this is pointless.
> > > > > 2. Don't have UFFDIO_SIGBUS install a PTE entry, but instead have a
> > > > > UFFDIO_WAKE_MODE_SIGBUS, where upon waking, we return VM_FAULT_SIGBUS
> > > > > instead of VM_FAULT_RETRY. We will keep getting userfaults on repeated
> > > > > accesses, just like how we get repeated signals for real poison.
> > > > > 3. Use this in conjunction with the additional KVM EFAULT info that
> > > > > Anish proposed (the first part of [1]).
> > > > >
> > > > > I think option 3 is fine. :)
> > > >
> > > > Or... option 4) just to use either MADV_HWPOISON or hwpoison-inject? :)
> > >
> > > I just remember Axel mentioned this in the commit message, and just in case
> > > this is why option 4) was ruled out:
> > >
> > >         They expect that once poisoned, pages can never become
> > >         "un-poisoned". So, when we live migrate the VM, we need to preserve
> > >         the poisoned status of these pages.
> > >
> > > Just to supplement on this point: we do have unpoison (echoing to
> > > "debug/hwpoison/hwpoison_unpoison"), or am I wrong?
> 
> If I read unpoison_memory() correctly, once there is a real hardware
> memory corruption (hw_memory_failure will be set), unpoison will stop
> working and return EOPNOTSUPP.
> 
> I know some cloud providers evacuating VMs once a single memory error
> happens, so not supporting unpoison is probably not a big deal for
> them. BUT others do keep VM running until more errors show up later,
> which could be long after the 1st error.

We're talking about postcopy migrating a VM has poisoned page on src,
rather than on dst host, am I right?  IOW, the dest hwpoison should be
fake.

If so, then I would assume that's the case where all the pages on the dest
host is still all good (so hw_memory_failure not yet set, or I doubt the
judgement of being a migration target after all)?

The other thing is even if dest host has hw poisoned page, I'm not sure
whether hw_memory_failure is the only way to solve this.

I saw that this is something got worked on before from Zhenwei, David used
to have some reasoning on why it was suggested like using a global knob:

https://lore.kernel.org/all/d7927214-e433-c26d-7a9c-a291ced81887@redhat.com/

Two major issues here afaics:

  - Zhenwei's approach only considered x86 hwpoison - it relies on kpte
    having !present in entries but that's x86 specific rather than generic
    to memory_failure.c.

  - It is _assumed_ that hwpoison injection is for debugging only.

I'm not sure whether you can fix 1) by some other ways, e.g., what if the
host just remember all the hardware poisoned pfns (or remember
soft-poisoned ones, but then here we need to be careful on removing them
from the list when it's hwpoisoned for real)?  It sounds like there's
opportunity on providing a generic solution rather than relying on
!pte_present().

For 2) IMHO that's not a big issue, you can declare it'll be used in !debug
but production systems so as to boost the feature importance with a real
use case.

So far I'd say it'll be great to leverage what it's already there in linux
and make it as generic as possible. The only issue is probably
CAP_ADMIN... not sure whether we can have some way to provide !ADMIN
somehow, or you can simply work around this issue.

> 
> > >
> > > >
> > > > Besides what James mentioned on "missing addr", I didn't quickly see what's
> > > > the major difference comparing to the old hwpoison injection methods even
> > > > without the addr requirement. If we want the addr for MCE then it's more of
> > > > a question to ask.
> > > >
> > > > I also didn't quickly see why for whatever new way to inject a pte error we
> > > > need to have it registered with uffd.  Could it be something like
> > > > MADV_PGERR (even if MADV_HWPOISON won't suffice) so you can inject even
> > > > without an userfault context (but still usable when uffd registered)?
> > > >
> > > > And it'll be alawys nice to have a cover letter too (if there'll be a new
> > > > version) explaining the bits.
> >
> > I do plan a v2, if for no other reason than to update the
> > documentation. Happy to add a cover letter with it as well.
> >
> > +Jiaqi back to CC, this is one piece of a larger memory poisoning /
> > recovery design Jiaqi is working on, so he may have some ideas why
> > MADV_HWPOISON or MADV_PGER will or won't work.
> 
> Per https://man7.org/linux/man-pages/man2/madvise.2.html,
> MADV_HWPOISON "is available only for privileged (CAP_SYS_ADMIN)
> processes." So for a non-root VMM, MADV_HWPOISON is out of option.

It makes sense to me especially when the page can be shared with other
tasks.

> 
> Another issue with MADV_HWPOISON is, it requires to first successfully
> get_user_pages_fast(). I don't think it will work if memory is not
> mapped yet.

Fair point, so probably current MADV_HWPOISON got ruled out.
hwpoison-inject seems fine where only the PFN is needed rather than the
pte. But same issue on CAP_ADMIN indeed.

> 
> With the UFFDIO_SIGBUS feature introduced in this patchset, it may
> even be possible to free the emulated-hwpoison page back to the kernel
> so we don't lose a 4K page.
> 
> I didn't find any ref/doc for MADV_PGERR. Is it something you suggest
> to build, Peter?

That's something I made up just to show my question on why such an
interface (even if wanted) needs to be bound to userfaultfd, e.g. a
madvise() seems working if someone sololy want to install a poisoned pte.

IIUC even with an madvise one may not need CAP_ADMIN since we can limit the
op to current mm only, I assume it's safe.

Here you'd want to return VM_FAULT_HWPOISON for whatever swap pte you'd
like to install (in do_swap_page) with whatever new interface (assuming
still a new madvise). As James mentioned, I think KVM liked that to
recognize -EHWPOISON from -EFAULT.  I'd say we can even consider reusing
PTE_MARKER_SWAPIN_ERROR to let it just return VM_FAULT_HWPOISON directly if
so.

Thanks,
Axel Rasmussen May 18, 2023, 8:38 p.m. UTC | #11
On Thu, May 18, 2023 at 9:05 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, May 17, 2023 at 05:43:53PM -0700, Jiaqi Yan wrote:
> > On Wed, May 17, 2023 at 3:29 PM Axel Rasmussen <axelrasmussen@google.com> wrote:
> > >
> > > On Wed, May 17, 2023 at 3:20 PM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > On Wed, May 17, 2023 at 06:12:33PM -0400, Peter Xu wrote:
> > > > > On Thu, May 11, 2023 at 03:00:09PM -0700, James Houghton wrote:
> > > > > > On Thu, May 11, 2023 at 11:24 AM Axel Rasmussen
> > > > > > <axelrasmussen@google.com> wrote:
> > > > > > >
> > > > > > > So the basic way to use this new feature is:
> > > > > > >
> > > > > > > - On the new host, the guest's memory is registered with userfaultfd, in
> > > > > > >   either MISSING or MINOR mode (doesn't really matter for this purpose).
> > > > > > > - On any first access, we get a userfaultfd event. At this point we can
> > > > > > >   communicate with the old host to find out if the page was poisoned.
> > > > > > > - If so, we can respond with a UFFDIO_SIGBUS - this places a swap marker
> > > > > > >   so any future accesses will SIGBUS. Because the pte is now "present",
> > > > > > >   future accesses won't generate more userfaultfd events, they'll just
> > > > > > >   SIGBUS directly.
> > > > > >
> > > > > > I want to clarify the SIGBUS mechanism here when KVM is involved,
> > > > > > keeping in mind that we need to be able to inject an MCE into the
> > > > > > guest for this to be useful.
> > > > > >
> > > > > > 1. vCPU gets an EPT violation --> KVM attempts GUP.
> > > > > > 2. GUP finds a PTE_MARKER_UFFD_SIGBUS and returns VM_FAULT_SIGBUS.
> > > > > > 3. KVM finds that GUP failed and returns -EFAULT.
> > > > > >
> > > > > > This is different than if GUP found poison, in which case KVM will
> > > > > > actually queue up a SIGBUS *containing the address of the fault*, and
> > > > > > userspace can use it to inject an appropriate MCE into the guest. With
> > > > > > UFFDIO_SIGBUS, we are missing the address!
> > > > > >
> > > > > > I see three options:
> > > > > > 1. Make KVM_RUN queue up a signal for any VM_FAULT_SIGBUS. I think
> > > > > > this is pointless.
> > > > > > 2. Don't have UFFDIO_SIGBUS install a PTE entry, but instead have a
> > > > > > UFFDIO_WAKE_MODE_SIGBUS, where upon waking, we return VM_FAULT_SIGBUS
> > > > > > instead of VM_FAULT_RETRY. We will keep getting userfaults on repeated
> > > > > > accesses, just like how we get repeated signals for real poison.
> > > > > > 3. Use this in conjunction with the additional KVM EFAULT info that
> > > > > > Anish proposed (the first part of [1]).
> > > > > >
> > > > > > I think option 3 is fine. :)
> > > > >
> > > > > Or... option 4) just to use either MADV_HWPOISON or hwpoison-inject? :)
> > > >
> > > > I just remember Axel mentioned this in the commit message, and just in case
> > > > this is why option 4) was ruled out:
> > > >
> > > >         They expect that once poisoned, pages can never become
> > > >         "un-poisoned". So, when we live migrate the VM, we need to preserve
> > > >         the poisoned status of these pages.
> > > >
> > > > Just to supplement on this point: we do have unpoison (echoing to
> > > > "debug/hwpoison/hwpoison_unpoison"), or am I wrong?
> >
> > If I read unpoison_memory() correctly, once there is a real hardware
> > memory corruption (hw_memory_failure will be set), unpoison will stop
> > working and return EOPNOTSUPP.
> >
> > I know some cloud providers evacuating VMs once a single memory error
> > happens, so not supporting unpoison is probably not a big deal for
> > them. BUT others do keep VM running until more errors show up later,
> > which could be long after the 1st error.
>
> We're talking about postcopy migrating a VM has poisoned page on src,
> rather than on dst host, am I right?  IOW, the dest hwpoison should be
> fake.
>
> If so, then I would assume that's the case where all the pages on the dest
> host is still all good (so hw_memory_failure not yet set, or I doubt the
> judgement of being a migration target after all)?
>
> The other thing is even if dest host has hw poisoned page, I'm not sure
> whether hw_memory_failure is the only way to solve this.
>
> I saw that this is something got worked on before from Zhenwei, David used
> to have some reasoning on why it was suggested like using a global knob:
>
> https://lore.kernel.org/all/d7927214-e433-c26d-7a9c-a291ced81887@redhat.com/
>
> Two major issues here afaics:
>
>   - Zhenwei's approach only considered x86 hwpoison - it relies on kpte
>     having !present in entries but that's x86 specific rather than generic
>     to memory_failure.c.
>
>   - It is _assumed_ that hwpoison injection is for debugging only.
>
> I'm not sure whether you can fix 1) by some other ways, e.g., what if the
> host just remember all the hardware poisoned pfns (or remember
> soft-poisoned ones, but then here we need to be careful on removing them
> from the list when it's hwpoisoned for real)?  It sounds like there's
> opportunity on providing a generic solution rather than relying on
> !pte_present().
>
> For 2) IMHO that's not a big issue, you can declare it'll be used in !debug
> but production systems so as to boost the feature importance with a real
> use case.
>
> So far I'd say it'll be great to leverage what it's already there in linux
> and make it as generic as possible. The only issue is probably
> CAP_ADMIN... not sure whether we can have some way to provide !ADMIN
> somehow, or you can simply work around this issue.

As you mention below I think the key distinction is the scope - I
think MADV_HWPOISON affects the whole system, including other
processes.

For our purposes, we really just want to "poison" this particular
virtual address (the HVA, from the VM's perspective), not even other
mappings of the same shared memory. I think that behavior is different
from MADV_HWPOISON, at least.

>
> >
> > > >
> > > > >
> > > > > Besides what James mentioned on "missing addr", I didn't quickly see what's
> > > > > the major difference comparing to the old hwpoison injection methods even
> > > > > without the addr requirement. If we want the addr for MCE then it's more of
> > > > > a question to ask.
> > > > >
> > > > > I also didn't quickly see why for whatever new way to inject a pte error we
> > > > > need to have it registered with uffd.  Could it be something like
> > > > > MADV_PGERR (even if MADV_HWPOISON won't suffice) so you can inject even
> > > > > without an userfault context (but still usable when uffd registered)?
> > > > >
> > > > > And it'll be alawys nice to have a cover letter too (if there'll be a new
> > > > > version) explaining the bits.
> > >
> > > I do plan a v2, if for no other reason than to update the
> > > documentation. Happy to add a cover letter with it as well.
> > >
> > > +Jiaqi back to CC, this is one piece of a larger memory poisoning /
> > > recovery design Jiaqi is working on, so he may have some ideas why
> > > MADV_HWPOISON or MADV_PGER will or won't work.
> >
> > Per https://man7.org/linux/man-pages/man2/madvise.2.html,
> > MADV_HWPOISON "is available only for privileged (CAP_SYS_ADMIN)
> > processes." So for a non-root VMM, MADV_HWPOISON is out of option.
>
> It makes sense to me especially when the page can be shared with other
> tasks.
>
> >
> > Another issue with MADV_HWPOISON is, it requires to first successfully
> > get_user_pages_fast(). I don't think it will work if memory is not
> > mapped yet.
>
> Fair point, so probably current MADV_HWPOISON got ruled out.
> hwpoison-inject seems fine where only the PFN is needed rather than the
> pte. But same issue on CAP_ADMIN indeed.
>
> >
> > With the UFFDIO_SIGBUS feature introduced in this patchset, it may
> > even be possible to free the emulated-hwpoison page back to the kernel
> > so we don't lose a 4K page.
> >
> > I didn't find any ref/doc for MADV_PGERR. Is it something you suggest
> > to build, Peter?
>
> That's something I made up just to show my question on why such an
> interface (even if wanted) needs to be bound to userfaultfd, e.g. a
> madvise() seems working if someone sololy want to install a poisoned pte.

I look at it a bit differently...

Even existing UFFDIO_* operations could technically be separated from
userfaultfd. You could imagine a MADV_MAP_PAGE instead of
UFFDIO_CONTINUE. UFFDIO_COPY is a bit trickier since it takes an
argument, but it could be done with process_madvise(). (Granted, I'm
not sure this would be useful... But this is equally true for
UFFDIO_SIGBUS; it seems non-live-migration use cases could use
MADV_HWPOISON, and for live migration use cases we will be using
UFFD.)

We've sort of setup a convention with userfaultfd where at a high
level users are supposed to:

1. Receive events from the uffd
2. Resolve those events with UFFDIO_* ioctls
3. Wake up with UFFDIO_WAKE to retry the fault that generated the
original event (can be combined with step 2 of course)

So for me, even if MADV_PGERR or similar existed, I would be tempted
to add a UFFDIO_SIGBUS as well, even if it just calls the same
underlying function to do the same thing, if only for consistency
(with the idea "UFFD events are resolved by UFFD ioctls") from the
user's perspective.


>
> IIUC even with an madvise one may not need CAP_ADMIN since we can limit the
> op to current mm only, I assume it's safe.
>
> Here you'd want to return VM_FAULT_HWPOISON for whatever swap pte you'd
> like to install (in do_swap_page) with whatever new interface (assuming
> still a new madvise). As James mentioned, I think KVM liked that to
> recognize -EHWPOISON from -EFAULT.  I'd say we can even consider reusing
> PTE_MARKER_SWAPIN_ERROR to let it just return VM_FAULT_HWPOISON directly if
> so.
>
> Thanks,
>
> --
> Peter Xu
>
Peter Xu May 18, 2023, 9:38 p.m. UTC | #12
On Thu, May 18, 2023 at 01:38:09PM -0700, Axel Rasmussen wrote:
> On Thu, May 18, 2023 at 9:05 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Wed, May 17, 2023 at 05:43:53PM -0700, Jiaqi Yan wrote:
> > > On Wed, May 17, 2023 at 3:29 PM Axel Rasmussen <axelrasmussen@google.com> wrote:
> > > >
> > > > On Wed, May 17, 2023 at 3:20 PM Peter Xu <peterx@redhat.com> wrote:
> > > > >
> > > > > On Wed, May 17, 2023 at 06:12:33PM -0400, Peter Xu wrote:
> > > > > > On Thu, May 11, 2023 at 03:00:09PM -0700, James Houghton wrote:
> > > > > > > On Thu, May 11, 2023 at 11:24 AM Axel Rasmussen
> > > > > > > <axelrasmussen@google.com> wrote:
> > > > > > > >
> > > > > > > > So the basic way to use this new feature is:
> > > > > > > >
> > > > > > > > - On the new host, the guest's memory is registered with userfaultfd, in
> > > > > > > >   either MISSING or MINOR mode (doesn't really matter for this purpose).
> > > > > > > > - On any first access, we get a userfaultfd event. At this point we can
> > > > > > > >   communicate with the old host to find out if the page was poisoned.
> > > > > > > > - If so, we can respond with a UFFDIO_SIGBUS - this places a swap marker
> > > > > > > >   so any future accesses will SIGBUS. Because the pte is now "present",
> > > > > > > >   future accesses won't generate more userfaultfd events, they'll just
> > > > > > > >   SIGBUS directly.
> > > > > > >
> > > > > > > I want to clarify the SIGBUS mechanism here when KVM is involved,
> > > > > > > keeping in mind that we need to be able to inject an MCE into the
> > > > > > > guest for this to be useful.
> > > > > > >
> > > > > > > 1. vCPU gets an EPT violation --> KVM attempts GUP.
> > > > > > > 2. GUP finds a PTE_MARKER_UFFD_SIGBUS and returns VM_FAULT_SIGBUS.
> > > > > > > 3. KVM finds that GUP failed and returns -EFAULT.
> > > > > > >
> > > > > > > This is different than if GUP found poison, in which case KVM will
> > > > > > > actually queue up a SIGBUS *containing the address of the fault*, and
> > > > > > > userspace can use it to inject an appropriate MCE into the guest. With
> > > > > > > UFFDIO_SIGBUS, we are missing the address!
> > > > > > >
> > > > > > > I see three options:
> > > > > > > 1. Make KVM_RUN queue up a signal for any VM_FAULT_SIGBUS. I think
> > > > > > > this is pointless.
> > > > > > > 2. Don't have UFFDIO_SIGBUS install a PTE entry, but instead have a
> > > > > > > UFFDIO_WAKE_MODE_SIGBUS, where upon waking, we return VM_FAULT_SIGBUS
> > > > > > > instead of VM_FAULT_RETRY. We will keep getting userfaults on repeated
> > > > > > > accesses, just like how we get repeated signals for real poison.
> > > > > > > 3. Use this in conjunction with the additional KVM EFAULT info that
> > > > > > > Anish proposed (the first part of [1]).
> > > > > > >
> > > > > > > I think option 3 is fine. :)
> > > > > >
> > > > > > Or... option 4) just to use either MADV_HWPOISON or hwpoison-inject? :)
> > > > >
> > > > > I just remember Axel mentioned this in the commit message, and just in case
> > > > > this is why option 4) was ruled out:
> > > > >
> > > > >         They expect that once poisoned, pages can never become
> > > > >         "un-poisoned". So, when we live migrate the VM, we need to preserve
> > > > >         the poisoned status of these pages.
> > > > >
> > > > > Just to supplement on this point: we do have unpoison (echoing to
> > > > > "debug/hwpoison/hwpoison_unpoison"), or am I wrong?
> > >
> > > If I read unpoison_memory() correctly, once there is a real hardware
> > > memory corruption (hw_memory_failure will be set), unpoison will stop
> > > working and return EOPNOTSUPP.
> > >
> > > I know some cloud providers evacuating VMs once a single memory error
> > > happens, so not supporting unpoison is probably not a big deal for
> > > them. BUT others do keep VM running until more errors show up later,
> > > which could be long after the 1st error.
> >
> > We're talking about postcopy migrating a VM has poisoned page on src,
> > rather than on dst host, am I right?  IOW, the dest hwpoison should be
> > fake.
> >
> > If so, then I would assume that's the case where all the pages on the dest
> > host is still all good (so hw_memory_failure not yet set, or I doubt the
> > judgement of being a migration target after all)?
> >
> > The other thing is even if dest host has hw poisoned page, I'm not sure
> > whether hw_memory_failure is the only way to solve this.
> >
> > I saw that this is something got worked on before from Zhenwei, David used
> > to have some reasoning on why it was suggested like using a global knob:
> >
> > https://lore.kernel.org/all/d7927214-e433-c26d-7a9c-a291ced81887@redhat.com/
> >
> > Two major issues here afaics:
> >
> >   - Zhenwei's approach only considered x86 hwpoison - it relies on kpte
> >     having !present in entries but that's x86 specific rather than generic
> >     to memory_failure.c.
> >
> >   - It is _assumed_ that hwpoison injection is for debugging only.
> >
> > I'm not sure whether you can fix 1) by some other ways, e.g., what if the
> > host just remember all the hardware poisoned pfns (or remember
> > soft-poisoned ones, but then here we need to be careful on removing them
> > from the list when it's hwpoisoned for real)?  It sounds like there's
> > opportunity on providing a generic solution rather than relying on
> > !pte_present().
> >
> > For 2) IMHO that's not a big issue, you can declare it'll be used in !debug
> > but production systems so as to boost the feature importance with a real
> > use case.
> >
> > So far I'd say it'll be great to leverage what it's already there in linux
> > and make it as generic as possible. The only issue is probably
> > CAP_ADMIN... not sure whether we can have some way to provide !ADMIN
> > somehow, or you can simply work around this issue.
> 
> As you mention below I think the key distinction is the scope - I
> think MADV_HWPOISON affects the whole system, including other
> processes.
> 
> For our purposes, we really just want to "poison" this particular
> virtual address (the HVA, from the VM's perspective), not even other
> mappings of the same shared memory. I think that behavior is different
> from MADV_HWPOISON, at least.
> 
> >
> > >
> > > > >
> > > > > >
> > > > > > Besides what James mentioned on "missing addr", I didn't quickly see what's
> > > > > > the major difference comparing to the old hwpoison injection methods even
> > > > > > without the addr requirement. If we want the addr for MCE then it's more of
> > > > > > a question to ask.
> > > > > >
> > > > > > I also didn't quickly see why for whatever new way to inject a pte error we
> > > > > > need to have it registered with uffd.  Could it be something like
> > > > > > MADV_PGERR (even if MADV_HWPOISON won't suffice) so you can inject even
> > > > > > without an userfault context (but still usable when uffd registered)?
> > > > > >
> > > > > > And it'll be alawys nice to have a cover letter too (if there'll be a new
> > > > > > version) explaining the bits.
> > > >
> > > > I do plan a v2, if for no other reason than to update the
> > > > documentation. Happy to add a cover letter with it as well.
> > > >
> > > > +Jiaqi back to CC, this is one piece of a larger memory poisoning /
> > > > recovery design Jiaqi is working on, so he may have some ideas why
> > > > MADV_HWPOISON or MADV_PGER will or won't work.
> > >
> > > Per https://man7.org/linux/man-pages/man2/madvise.2.html,
> > > MADV_HWPOISON "is available only for privileged (CAP_SYS_ADMIN)
> > > processes." So for a non-root VMM, MADV_HWPOISON is out of option.
> >
> > It makes sense to me especially when the page can be shared with other
> > tasks.
> >
> > >
> > > Another issue with MADV_HWPOISON is, it requires to first successfully
> > > get_user_pages_fast(). I don't think it will work if memory is not
> > > mapped yet.
> >
> > Fair point, so probably current MADV_HWPOISON got ruled out.
> > hwpoison-inject seems fine where only the PFN is needed rather than the
> > pte. But same issue on CAP_ADMIN indeed.
> >
> > >
> > > With the UFFDIO_SIGBUS feature introduced in this patchset, it may
> > > even be possible to free the emulated-hwpoison page back to the kernel
> > > so we don't lose a 4K page.
> > >
> > > I didn't find any ref/doc for MADV_PGERR. Is it something you suggest
> > > to build, Peter?
> >
> > That's something I made up just to show my question on why such an
> > interface (even if wanted) needs to be bound to userfaultfd, e.g. a
> > madvise() seems working if someone sololy want to install a poisoned pte.
> 
> I look at it a bit differently...
> 
> Even existing UFFDIO_* operations could technically be separated from
> userfaultfd. You could imagine a MADV_MAP_PAGE instead of
> UFFDIO_CONTINUE. UFFDIO_COPY is a bit trickier since it takes an
> argument, but it could be done with process_madvise(). (Granted, I'm
> not sure this would be useful... But this is equally true for
> UFFDIO_SIGBUS; it seems non-live-migration use cases could use
> MADV_HWPOISON, and for live migration use cases we will be using
> UFFD.)
> 
> We've sort of setup a convention with userfaultfd where at a high
> level users are supposed to:
> 
> 1. Receive events from the uffd
> 2. Resolve those events with UFFDIO_* ioctls
> 3. Wake up with UFFDIO_WAKE to retry the fault that generated the
> original event (can be combined with step 2 of course)
> 
> So for me, even if MADV_PGERR or similar existed, I would be tempted
> to add a UFFDIO_SIGBUS as well, even if it just calls the same
> underlying function to do the same thing, if only for consistency
> (with the idea "UFFD events are resolved by UFFD ioctls") from the
> user's perspective.

I don't worry too much on "consistency", but I'm trying to understand
whether it's more beneficial to combine it with uffd or being generic.

One thing I was thinking is if I have a library that manages some memory
for the user, the library can use such madvise()/... to poison specific
small pages (without registering uffd with sigbus mode, also no lose on
page faults of other normal pages) so when illegal access it can trap it
for current mm rather than silently happen (e.g. use after free).  Unpoison
is also easy there, we can simply DONTNEED it.

One defect of such general solution for your case is we need one more
UFFDIO_WAKE, but since we're talking about real poisoned pages on src, so I
guess it's not a concern (unlike most of the rest ioctls).

I've no strong opinion if you still want to do that with an userfault
ioctl.  After all, I can't provide a solid example but just some rough
ideas.  But I hope I explained why I think it's still different from other
ioctls (e.g., an "atomic update a page" operation doesn't sound reasonable
at all as generic operation for any !uffd context, so that definitely
suites more as an uffd specific ioctl).

If with uffd, perhaps avoid calling it sigbus? As we have FEATURE_SIGBUS
and I'm afraid it'll cause confusion.  UFFDIO_HWPOISON may sound more
suitable?

Thanks,
Peter Xu May 18, 2023, 9:50 p.m. UTC | #13
On Thu, May 18, 2023 at 05:38:14PM -0400, Peter Xu wrote:
> If with uffd, perhaps avoid calling it sigbus? As we have FEATURE_SIGBUS
> and I'm afraid it'll cause confusion.  UFFDIO_HWPOISON may sound more
> suitable?

Or UFFDIO_POISON (to identify it from real hw poisons)..
David Hildenbrand May 19, 2023, 8:38 a.m. UTC | #14
On 18.05.23 22:38, Axel Rasmussen wrote:
> On Thu, May 18, 2023 at 9:05 AM Peter Xu <peterx@redhat.com> wrote:
>>
>> On Wed, May 17, 2023 at 05:43:53PM -0700, Jiaqi Yan wrote:
>>> On Wed, May 17, 2023 at 3:29 PM Axel Rasmussen <axelrasmussen@google.com> wrote:
>>>>
>>>> On Wed, May 17, 2023 at 3:20 PM Peter Xu <peterx@redhat.com> wrote:
>>>>>
>>>>> On Wed, May 17, 2023 at 06:12:33PM -0400, Peter Xu wrote:
>>>>>> On Thu, May 11, 2023 at 03:00:09PM -0700, James Houghton wrote:
>>>>>>> On Thu, May 11, 2023 at 11:24 AM Axel Rasmussen
>>>>>>> <axelrasmussen@google.com> wrote:
>>>>>>>>
>>>>>>>> So the basic way to use this new feature is:
>>>>>>>>
>>>>>>>> - On the new host, the guest's memory is registered with userfaultfd, in
>>>>>>>>    either MISSING or MINOR mode (doesn't really matter for this purpose).
>>>>>>>> - On any first access, we get a userfaultfd event. At this point we can
>>>>>>>>    communicate with the old host to find out if the page was poisoned.
>>>>>>>> - If so, we can respond with a UFFDIO_SIGBUS - this places a swap marker
>>>>>>>>    so any future accesses will SIGBUS. Because the pte is now "present",
>>>>>>>>    future accesses won't generate more userfaultfd events, they'll just
>>>>>>>>    SIGBUS directly.
>>>>>>>
>>>>>>> I want to clarify the SIGBUS mechanism here when KVM is involved,
>>>>>>> keeping in mind that we need to be able to inject an MCE into the
>>>>>>> guest for this to be useful.
>>>>>>>
>>>>>>> 1. vCPU gets an EPT violation --> KVM attempts GUP.
>>>>>>> 2. GUP finds a PTE_MARKER_UFFD_SIGBUS and returns VM_FAULT_SIGBUS.
>>>>>>> 3. KVM finds that GUP failed and returns -EFAULT.
>>>>>>>
>>>>>>> This is different than if GUP found poison, in which case KVM will
>>>>>>> actually queue up a SIGBUS *containing the address of the fault*, and
>>>>>>> userspace can use it to inject an appropriate MCE into the guest. With
>>>>>>> UFFDIO_SIGBUS, we are missing the address!
>>>>>>>
>>>>>>> I see three options:
>>>>>>> 1. Make KVM_RUN queue up a signal for any VM_FAULT_SIGBUS. I think
>>>>>>> this is pointless.
>>>>>>> 2. Don't have UFFDIO_SIGBUS install a PTE entry, but instead have a
>>>>>>> UFFDIO_WAKE_MODE_SIGBUS, where upon waking, we return VM_FAULT_SIGBUS
>>>>>>> instead of VM_FAULT_RETRY. We will keep getting userfaults on repeated
>>>>>>> accesses, just like how we get repeated signals for real poison.
>>>>>>> 3. Use this in conjunction with the additional KVM EFAULT info that
>>>>>>> Anish proposed (the first part of [1]).
>>>>>>>
>>>>>>> I think option 3 is fine. :)
>>>>>>
>>>>>> Or... option 4) just to use either MADV_HWPOISON or hwpoison-inject? :)
>>>>>
>>>>> I just remember Axel mentioned this in the commit message, and just in case
>>>>> this is why option 4) was ruled out:
>>>>>
>>>>>          They expect that once poisoned, pages can never become
>>>>>          "un-poisoned". So, when we live migrate the VM, we need to preserve
>>>>>          the poisoned status of these pages.
>>>>>
>>>>> Just to supplement on this point: we do have unpoison (echoing to
>>>>> "debug/hwpoison/hwpoison_unpoison"), or am I wrong?
>>>
>>> If I read unpoison_memory() correctly, once there is a real hardware
>>> memory corruption (hw_memory_failure will be set), unpoison will stop
>>> working and return EOPNOTSUPP.
>>>
>>> I know some cloud providers evacuating VMs once a single memory error
>>> happens, so not supporting unpoison is probably not a big deal for
>>> them. BUT others do keep VM running until more errors show up later,
>>> which could be long after the 1st error.
>>
>> We're talking about postcopy migrating a VM has poisoned page on src,
>> rather than on dst host, am I right?  IOW, the dest hwpoison should be
>> fake.
>>
>> If so, then I would assume that's the case where all the pages on the dest
>> host is still all good (so hw_memory_failure not yet set, or I doubt the
>> judgement of being a migration target after all)?
>>
>> The other thing is even if dest host has hw poisoned page, I'm not sure
>> whether hw_memory_failure is the only way to solve this.
>>
>> I saw that this is something got worked on before from Zhenwei, David used
>> to have some reasoning on why it was suggested like using a global knob:
>>
>> https://lore.kernel.org/all/d7927214-e433-c26d-7a9c-a291ced81887@redhat.com/
>>
>> Two major issues here afaics:
>>
>>    - Zhenwei's approach only considered x86 hwpoison - it relies on kpte
>>      having !present in entries but that's x86 specific rather than generic
>>      to memory_failure.c.
>>
>>    - It is _assumed_ that hwpoison injection is for debugging only.
>>
>> I'm not sure whether you can fix 1) by some other ways, e.g., what if the
>> host just remember all the hardware poisoned pfns (or remember
>> soft-poisoned ones, but then here we need to be careful on removing them
>> from the list when it's hwpoisoned for real)?  It sounds like there's
>> opportunity on providing a generic solution rather than relying on
>> !pte_present().
>>
>> For 2) IMHO that's not a big issue, you can declare it'll be used in !debug
>> but production systems so as to boost the feature importance with a real
>> use case.
>>
>> So far I'd say it'll be great to leverage what it's already there in linux
>> and make it as generic as possible. The only issue is probably
>> CAP_ADMIN... not sure whether we can have some way to provide !ADMIN
>> somehow, or you can simply work around this issue.
> 
> As you mention below I think the key distinction is the scope - I
> think MADV_HWPOISON affects the whole system, including other
> processes.
> 
> For our purposes, we really just want to "poison" this particular
> virtual address (the HVA, from the VM's perspective), not even other
> mappings of the same shared memory. I think that behavior is different
> from MADV_HWPOISON, at least.

MADV_HWPOISON really is the wrong interface to use. See "man madvise".

We don't want to allow arbitrary users to hwpoison+offline absolutely 
healthy physical memory, which is what MADV_HWPOISON is all about.

As you say, we want to turn an unpopulated (!present) virtual address to 
mimic like we had a MCE on a page that would have been previously mapped 
here: install a hwpoison marker without actually poisoning any present 
page. In fact, we'd even want to fail if there *is* something mapped.

Sure, one could teach MADV_HWPOISON to allow unprivileged users to do 
that for !present PTE entries, and fail for unprivileged users if there 
is a present PTE entry. I'm not sure if that's the cleanest approach, 
though, and a new MADV as suggested in this thread would eventually be 
cleaner.
Jiaqi Yan May 19, 2023, 3:04 p.m. UTC | #15
On Fri, May 19, 2023 at 1:38 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 18.05.23 22:38, Axel Rasmussen wrote:
> > On Thu, May 18, 2023 at 9:05 AM Peter Xu <peterx@redhat.com> wrote:
> >>
> >> On Wed, May 17, 2023 at 05:43:53PM -0700, Jiaqi Yan wrote:
> >>> On Wed, May 17, 2023 at 3:29 PM Axel Rasmussen <axelrasmussen@google.com> wrote:
> >>>>
> >>>> On Wed, May 17, 2023 at 3:20 PM Peter Xu <peterx@redhat.com> wrote:
> >>>>>
> >>>>> On Wed, May 17, 2023 at 06:12:33PM -0400, Peter Xu wrote:
> >>>>>> On Thu, May 11, 2023 at 03:00:09PM -0700, James Houghton wrote:
> >>>>>>> On Thu, May 11, 2023 at 11:24 AM Axel Rasmussen
> >>>>>>> <axelrasmussen@google.com> wrote:
> >>>>>>>>
> >>>>>>>> So the basic way to use this new feature is:
> >>>>>>>>
> >>>>>>>> - On the new host, the guest's memory is registered with userfaultfd, in
> >>>>>>>>    either MISSING or MINOR mode (doesn't really matter for this purpose).
> >>>>>>>> - On any first access, we get a userfaultfd event. At this point we can
> >>>>>>>>    communicate with the old host to find out if the page was poisoned.
> >>>>>>>> - If so, we can respond with a UFFDIO_SIGBUS - this places a swap marker
> >>>>>>>>    so any future accesses will SIGBUS. Because the pte is now "present",
> >>>>>>>>    future accesses won't generate more userfaultfd events, they'll just
> >>>>>>>>    SIGBUS directly.
> >>>>>>>
> >>>>>>> I want to clarify the SIGBUS mechanism here when KVM is involved,
> >>>>>>> keeping in mind that we need to be able to inject an MCE into the
> >>>>>>> guest for this to be useful.
> >>>>>>>
> >>>>>>> 1. vCPU gets an EPT violation --> KVM attempts GUP.
> >>>>>>> 2. GUP finds a PTE_MARKER_UFFD_SIGBUS and returns VM_FAULT_SIGBUS.
> >>>>>>> 3. KVM finds that GUP failed and returns -EFAULT.
> >>>>>>>
> >>>>>>> This is different than if GUP found poison, in which case KVM will
> >>>>>>> actually queue up a SIGBUS *containing the address of the fault*, and
> >>>>>>> userspace can use it to inject an appropriate MCE into the guest. With
> >>>>>>> UFFDIO_SIGBUS, we are missing the address!
> >>>>>>>
> >>>>>>> I see three options:
> >>>>>>> 1. Make KVM_RUN queue up a signal for any VM_FAULT_SIGBUS. I think
> >>>>>>> this is pointless.
> >>>>>>> 2. Don't have UFFDIO_SIGBUS install a PTE entry, but instead have a
> >>>>>>> UFFDIO_WAKE_MODE_SIGBUS, where upon waking, we return VM_FAULT_SIGBUS
> >>>>>>> instead of VM_FAULT_RETRY. We will keep getting userfaults on repeated
> >>>>>>> accesses, just like how we get repeated signals for real poison.
> >>>>>>> 3. Use this in conjunction with the additional KVM EFAULT info that
> >>>>>>> Anish proposed (the first part of [1]).
> >>>>>>>
> >>>>>>> I think option 3 is fine. :)
> >>>>>>
> >>>>>> Or... option 4) just to use either MADV_HWPOISON or hwpoison-inject? :)
> >>>>>
> >>>>> I just remember Axel mentioned this in the commit message, and just in case
> >>>>> this is why option 4) was ruled out:
> >>>>>
> >>>>>          They expect that once poisoned, pages can never become
> >>>>>          "un-poisoned". So, when we live migrate the VM, we need to preserve
> >>>>>          the poisoned status of these pages.
> >>>>>
> >>>>> Just to supplement on this point: we do have unpoison (echoing to
> >>>>> "debug/hwpoison/hwpoison_unpoison"), or am I wrong?
> >>>
> >>> If I read unpoison_memory() correctly, once there is a real hardware
> >>> memory corruption (hw_memory_failure will be set), unpoison will stop
> >>> working and return EOPNOTSUPP.
> >>>
> >>> I know some cloud providers evacuating VMs once a single memory error
> >>> happens, so not supporting unpoison is probably not a big deal for
> >>> them. BUT others do keep VM running until more errors show up later,
> >>> which could be long after the 1st error.
> >>
> >> We're talking about postcopy migrating a VM has poisoned page on src,
> >> rather than on dst host, am I right?  IOW, the dest hwpoison should be
> >> fake.

Yes, for this we are on the same page. The scenario I want to describe is...

> >>
> >> If so, then I would assume that's the case where all the pages on the dest
> >> host is still all good (so hw_memory_failure not yet set, or I doubt the

...target VM can get hw error anytime: before precopy (if cloud
provider is not carefully monitoring the machine health), during
precopy from src to target, during src blackout, during postcopy,
after migration done, and keep running on host. Both MADV_HWPOISON[1]
and hwpoison-inject[2] are subject to hw_memory_failure, so they just
seems unreliable to me: if target is in memory error trouble before or
in early phase of migration, we lose the unpoison feature in kernel.

[1] https://github.com/torvalds/linux/blob/2d1bcbc6cd703e64caf8df314e3669b4786e008a/mm/madvise.c#L1130
[2] https://github.com/torvalds/linux/blob/2d1bcbc6cd703e64caf8df314e3669b4786e008a/mm/hwpoison-inject.c#L51

> >> judgement of being a migration target after all)?
> >>
> >> The other thing is even if dest host has hw poisoned page, I'm not sure
> >> whether hw_memory_failure is the only way to solve this.
> >>
> >> I saw that this is something got worked on before from Zhenwei, David used
> >> to have some reasoning on why it was suggested like using a global knob:
> >>
> >> https://lore.kernel.org/all/d7927214-e433-c26d-7a9c-a291ced81887@redhat.com/
> >>
> >> Two major issues here afaics:
> >>
> >>    - Zhenwei's approach only considered x86 hwpoison - it relies on kpte
> >>      having !present in entries but that's x86 specific rather than generic
> >>      to memory_failure.c.
> >>
> >>    - It is _assumed_ that hwpoison injection is for debugging only.
> >>
> >> I'm not sure whether you can fix 1) by some other ways, e.g., what if the
> >> host just remember all the hardware poisoned pfns (or remember
> >> soft-poisoned ones, but then here we need to be careful on removing them
> >> from the list when it's hwpoisoned for real)?  It sounds like there's
> >> opportunity on providing a generic solution rather than relying on
> >> !pte_present().
> >>
> >> For 2) IMHO that's not a big issue, you can declare it'll be used in !debug
> >> but production systems so as to boost the feature importance with a real
> >> use case.
> >>
> >> So far I'd say it'll be great to leverage what it's already there in linux
> >> and make it as generic as possible. The only issue is probably
> >> CAP_ADMIN... not sure whether we can have some way to provide !ADMIN
> >> somehow, or you can simply work around this issue.

I don't think CAP_ADMIN is something we can work around: a VMM must be
a good citizen to avoid introducing any vulnerability to the host or
guest.

On the other hand, "Userfaults allow the implementation of on-demand
paging from userland and more generally they allow userland to take
control of various memory page faults, something otherwise only the
kernel code could do." [3]. I am not familiar with the UFFD internals,
but our use case seems to match what UFFD wants to provide: without
affecting the whole world, give a specific userspace (without
CAP_ADMIN) the ability to handle page faults (indirectly emulate a
HWPOISON page (in my mind I treat it as SetHWPOISON(page) +
TestHWPOISON(page) operation in kernel's PF code)). So is it fair to
say what Axel provided here is "provide !ADMIN somehow"?

[3]https://docs.kernel.org/admin-guide/mm/userfaultfd.html

> >
> > As you mention below I think the key distinction is the scope - I
> > think MADV_HWPOISON affects the whole system, including other
> > processes.
> >
> > For our purposes, we really just want to "poison" this particular
> > virtual address (the HVA, from the VM's perspective), not even other
> > mappings of the same shared memory. I think that behavior is different
> > from MADV_HWPOISON, at least.
>
> MADV_HWPOISON really is the wrong interface to use. See "man madvise".
>
> We don't want to allow arbitrary users to hwpoison+offline absolutely
> healthy physical memory, which is what MADV_HWPOISON is all about.
>
> As you say, we want to turn an unpopulated (!present) virtual address to
> mimic like we had a MCE on a page that would have been previously mapped
> here: install a hwpoison marker without actually poisoning any present
> page. In fact, we'd even want to fail if there *is* something mapped.
>
> Sure, one could teach MADV_HWPOISON to allow unprivileged users to do
> that for !present PTE entries, and fail for unprivileged users if there
> is a present PTE entry. I'm not sure if that's the cleanest approach,
> though, and a new MADV as suggested in this thread would eventually be
> cleaner.
>
> --
> Thanks,
>
> David / dhildenb
>
Peter Xu May 19, 2023, 4:20 p.m. UTC | #16
Hi, Jiaqi,

On Fri, May 19, 2023 at 08:04:09AM -0700, Jiaqi Yan wrote:
> I don't think CAP_ADMIN is something we can work around: a VMM must be
> a good citizen to avoid introducing any vulnerability to the host or
> guest.
> 
> On the other hand, "Userfaults allow the implementation of on-demand
> paging from userland and more generally they allow userland to take
> control of various memory page faults, something otherwise only the
> kernel code could do." [3]. I am not familiar with the UFFD internals,
> but our use case seems to match what UFFD wants to provide: without
> affecting the whole world, give a specific userspace (without
> CAP_ADMIN) the ability to handle page faults (indirectly emulate a
> HWPOISON page (in my mind I treat it as SetHWPOISON(page) +
> TestHWPOISON(page) operation in kernel's PF code)). So is it fair to
> say what Axel provided here is "provide !ADMIN somehow"?
> 
> [3]https://docs.kernel.org/admin-guide/mm/userfaultfd.html

Userfault keywords on "user", IMHO.  We don't strictly need userfault to
resolve anything regarding CAP_ADMIN problems.  MADV_DONTNEED also dosn't
need CAP_ADMIN, same to any new madvise() if we want to make it useful for
injecting poisoned ptes with !ADMIN and limit it within current->mm.

But I think you're right that userfaultfd always tried to avoid having
ADMIN and keep everything within its own scope of permissions.

So again, totally no objection on make it uffd specific for now if you guys
are all happy with it, but just to be clear that it's (to me) mostly for
avoiding another WAKE, and afaics that's not really for solving the ADMIN
issue here.

Thanks,
Axel Rasmussen May 19, 2023, 5:32 p.m. UTC | #17
On Fri, May 19, 2023 at 9:20 AM Peter Xu <peterx@redhat.com> wrote:
>
> Hi, Jiaqi,
>
> On Fri, May 19, 2023 at 08:04:09AM -0700, Jiaqi Yan wrote:
> > I don't think CAP_ADMIN is something we can work around: a VMM must be
> > a good citizen to avoid introducing any vulnerability to the host or
> > guest.
> >
> > On the other hand, "Userfaults allow the implementation of on-demand
> > paging from userland and more generally they allow userland to take
> > control of various memory page faults, something otherwise only the
> > kernel code could do." [3]. I am not familiar with the UFFD internals,
> > but our use case seems to match what UFFD wants to provide: without
> > affecting the whole world, give a specific userspace (without
> > CAP_ADMIN) the ability to handle page faults (indirectly emulate a
> > HWPOISON page (in my mind I treat it as SetHWPOISON(page) +
> > TestHWPOISON(page) operation in kernel's PF code)). So is it fair to
> > say what Axel provided here is "provide !ADMIN somehow"?
> >
> > [3]https://docs.kernel.org/admin-guide/mm/userfaultfd.html
>
> Userfault keywords on "user", IMHO.  We don't strictly need userfault to
> resolve anything regarding CAP_ADMIN problems.  MADV_DONTNEED also dosn't
> need CAP_ADMIN, same to any new madvise() if we want to make it useful for
> injecting poisoned ptes with !ADMIN and limit it within current->mm.
>
> But I think you're right that userfaultfd always tried to avoid having
> ADMIN and keep everything within its own scope of permissions.
>
> So again, totally no objection on make it uffd specific for now if you guys
> are all happy with it, but just to be clear that it's (to me) mostly for
> avoiding another WAKE, and afaics that's not really for solving the ADMIN
> issue here.

How about this plan:

Since the concrete use case we have (postcopy live migration) is
UFFD-specific, let's leave it as a UFFDIO_* operation for now.

If in the future we come up with a non-UFFD use case, we can add a new
MADV_* which does this operation at that point. From my perspective
they could even share most of the same implementation code.

I don't think it's a big problem keeping the UFFDIO_* version too at
that point, because it still provides some (perhaps small) value:

- Combines the operation + waking into one syscall
- It allows us to support additional UFFD flags which modify / extend
the operation in UFFD-specific ways, if we want to add those in the
future

Seem reasonable?

If so, I'll send a v2 with documentation updates.

>
> Thanks,
>
> --
> Peter Xu
>
Peter Xu May 23, 2023, 5:26 p.m. UTC | #18
On Thu, May 11, 2023 at 11:24:24AM -0700, Axel Rasmussen wrote:
> The basic idea here is to "simulate" memory poisoning for VMs. A VM
> running on some host might encounter a memory error, after which some
> page(s) are poisoned (i.e., future accesses SIGBUS). They expect that
> once poisoned, pages can never become "un-poisoned". So, when we live
> migrate the VM, we need to preserve the poisoned status of these pages.
> 
> When live migrating, we try to get the guest running on its new host as
> quickly as possible. So, we start it running before all memory has been
> copied, and before we're certain which pages should be poisoned or not.
> 
> So the basic way to use this new feature is:
> 
> - On the new host, the guest's memory is registered with userfaultfd, in
>   either MISSING or MINOR mode (doesn't really matter for this purpose).
> - On any first access, we get a userfaultfd event. At this point we can
>   communicate with the old host to find out if the page was poisoned.
> - If so, we can respond with a UFFDIO_SIGBUS - this places a swap marker

[as used to suggest..] maybe UFFDIO_POISON sounds better.

>   so any future accesses will SIGBUS. Because the pte is now "present",
>   future accesses won't generate more userfaultfd events, they'll just
>   SIGBUS directly.
> 
> UFFDIO_SIGBUS does not handle unmapping previously-present PTEs. This
> isn't needed, because during live migration we want to intercept
> all accesses with userfaultfd (not just writes, so WP mode isn't useful
> for this). So whether minor or missing mode is being used (or both), the
> PTE won't be present in any case, so handling that case isn't needed.
> 
> Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
> ---
>  fs/userfaultfd.c                 | 63 ++++++++++++++++++++++++++++++++
>  include/linux/swapops.h          |  3 +-
>  include/linux/userfaultfd_k.h    |  4 ++
>  include/uapi/linux/userfaultfd.h | 25 +++++++++++--
>  mm/memory.c                      |  4 ++
>  mm/userfaultfd.c                 | 62 ++++++++++++++++++++++++++++++-
>  6 files changed, 156 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 0fd96d6e39ce..edc2928dae2b 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -1966,6 +1966,66 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
>  	return ret;
>  }
>  
> +static inline int userfaultfd_sigbus(struct userfaultfd_ctx *ctx, unsigned long arg)
> +{
> +	__s64 ret;
> +	struct uffdio_sigbus uffdio_sigbus;
> +	struct uffdio_sigbus __user *user_uffdio_sigbus;
> +	struct userfaultfd_wake_range range;
> +
> +	user_uffdio_sigbus = (struct uffdio_sigbus __user *)arg;
> +
> +	ret = -EAGAIN;
> +	if (atomic_read(&ctx->mmap_changing))
> +		goto out;
> +
> +	ret = -EFAULT;
> +	if (copy_from_user(&uffdio_sigbus, user_uffdio_sigbus,
> +			   /* don't copy the output fields */
> +			   sizeof(uffdio_sigbus) - (sizeof(__s64))))
> +		goto out;
> +
> +	ret = validate_range(ctx->mm, uffdio_sigbus.range.start,
> +			     uffdio_sigbus.range.len);
> +	if (ret)
> +		goto out;
> +
> +	ret = -EINVAL;
> +	/* double check for wraparound just in case. */
> +	if (uffdio_sigbus.range.start + uffdio_sigbus.range.len <=
> +	    uffdio_sigbus.range.start) {
> +		goto out;
> +	}
> +	if (uffdio_sigbus.mode & ~UFFDIO_SIGBUS_MODE_DONTWAKE)
> +		goto out;
> +
> +	if (mmget_not_zero(ctx->mm)) {
> +		ret = mfill_atomic_sigbus(ctx->mm, uffdio_sigbus.range.start,
> +					  uffdio_sigbus.range.len,
> +					  &ctx->mmap_changing, 0);
> +		mmput(ctx->mm);
> +	} else {
> +		return -ESRCH;
> +	}
> +
> +	if (unlikely(put_user(ret, &user_uffdio_sigbus->updated)))
> +		return -EFAULT;
> +	if (ret < 0)
> +		goto out;
> +
> +	/* len == 0 would wake all */
> +	BUG_ON(!ret);
> +	range.len = ret;
> +	if (!(uffdio_sigbus.mode & UFFDIO_SIGBUS_MODE_DONTWAKE)) {
> +		range.start = uffdio_sigbus.range.start;
> +		wake_userfault(ctx, &range);
> +	}
> +	ret = range.len == uffdio_sigbus.range.len ? 0 : -EAGAIN;
> +
> +out:
> +	return ret;
> +}
> +
>  static inline unsigned int uffd_ctx_features(__u64 user_features)
>  {
>  	/*
> @@ -2067,6 +2127,9 @@ static long userfaultfd_ioctl(struct file *file, unsigned cmd,
>  	case UFFDIO_CONTINUE:
>  		ret = userfaultfd_continue(ctx, arg);
>  		break;
> +	case UFFDIO_SIGBUS:
> +		ret = userfaultfd_sigbus(ctx, arg);
> +		break;
>  	}
>  	return ret;
>  }
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 3a451b7afcb3..fa778a0ae730 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -405,7 +405,8 @@ typedef unsigned long pte_marker;
>  
>  #define  PTE_MARKER_UFFD_WP			BIT(0)
>  #define  PTE_MARKER_SWAPIN_ERROR		BIT(1)
> -#define  PTE_MARKER_MASK			(BIT(2) - 1)
> +#define  PTE_MARKER_UFFD_SIGBUS			BIT(2)
> +#define  PTE_MARKER_MASK			(BIT(3) - 1)

[as used to suggest..] I'd consider reusing SWAPIN_ERROR directly.

Actually.. I think maybe we should have 1 patch changing SWAPIN_ERROR from
VM_FAULT_SIGBUS to VM_FAULT_HWPOISON.

Let's imagine a VM having anonymous page backing and got a swapin error
when faulted on one of the guest page.  Instead of crashing the hypervisor
with sigbus we should probably make it a MCE injected into the guest too,
because there's no page corrupt in bare metal in this specific case,
however to the guest it's the same as having one page corrupted just like a
real MCE.

>  
>  static inline swp_entry_t make_pte_marker_entry(pte_marker marker)
>  {
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index d78b01524349..6de1084939c5 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -46,6 +46,7 @@ enum mfill_atomic_mode {
>  	MFILL_ATOMIC_COPY,
>  	MFILL_ATOMIC_ZEROPAGE,
>  	MFILL_ATOMIC_CONTINUE,
> +	MFILL_ATOMIC_SIGBUS,
>  	NR_MFILL_ATOMIC_MODES,
>  };
>  
> @@ -83,6 +84,9 @@ extern ssize_t mfill_atomic_zeropage(struct mm_struct *dst_mm,
>  extern ssize_t mfill_atomic_continue(struct mm_struct *dst_mm, unsigned long dst_start,
>  				     unsigned long len, atomic_t *mmap_changing,
>  				     uffd_flags_t flags);
> +extern ssize_t mfill_atomic_sigbus(struct mm_struct *dst_mm, unsigned long start,
> +				   unsigned long len, atomic_t *mmap_changing,
> +				   uffd_flags_t flags);
>  extern int mwriteprotect_range(struct mm_struct *dst_mm,
>  			       unsigned long start, unsigned long len,
>  			       bool enable_wp, atomic_t *mmap_changing);
> diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
> index 66dd4cd277bd..616e33d3db97 100644
> --- a/include/uapi/linux/userfaultfd.h
> +++ b/include/uapi/linux/userfaultfd.h
> @@ -39,7 +39,8 @@
>  			   UFFD_FEATURE_MINOR_SHMEM |		\
>  			   UFFD_FEATURE_EXACT_ADDRESS |		\
>  			   UFFD_FEATURE_WP_HUGETLBFS_SHMEM |	\
> -			   UFFD_FEATURE_WP_UNPOPULATED)
> +			   UFFD_FEATURE_WP_UNPOPULATED |	\
> +			   UFFD_FEATURE_SIGBUS_IOCTL)
>  #define UFFD_API_IOCTLS				\
>  	((__u64)1 << _UFFDIO_REGISTER |		\
>  	 (__u64)1 << _UFFDIO_UNREGISTER |	\
> @@ -49,12 +50,14 @@
>  	 (__u64)1 << _UFFDIO_COPY |		\
>  	 (__u64)1 << _UFFDIO_ZEROPAGE |		\
>  	 (__u64)1 << _UFFDIO_WRITEPROTECT |	\
> -	 (__u64)1 << _UFFDIO_CONTINUE)
> +	 (__u64)1 << _UFFDIO_CONTINUE |		\
> +	 (__u64)1 << _UFFDIO_SIGBUS)
>  #define UFFD_API_RANGE_IOCTLS_BASIC		\
>  	((__u64)1 << _UFFDIO_WAKE |		\
>  	 (__u64)1 << _UFFDIO_COPY |		\
> +	 (__u64)1 << _UFFDIO_WRITEPROTECT |	\
>  	 (__u64)1 << _UFFDIO_CONTINUE |		\
> -	 (__u64)1 << _UFFDIO_WRITEPROTECT)
> +	 (__u64)1 << _UFFDIO_SIGBUS)
>  
>  /*
>   * Valid ioctl command number range with this API is from 0x00 to
> @@ -71,6 +74,7 @@
>  #define _UFFDIO_ZEROPAGE		(0x04)
>  #define _UFFDIO_WRITEPROTECT		(0x06)
>  #define _UFFDIO_CONTINUE		(0x07)
> +#define _UFFDIO_SIGBUS			(0x08)
>  #define _UFFDIO_API			(0x3F)
>  
>  /* userfaultfd ioctl ids */
> @@ -91,6 +95,8 @@
>  				      struct uffdio_writeprotect)
>  #define UFFDIO_CONTINUE		_IOWR(UFFDIO, _UFFDIO_CONTINUE,	\
>  				      struct uffdio_continue)
> +#define UFFDIO_SIGBUS		_IOWR(UFFDIO, _UFFDIO_SIGBUS, \
> +				      struct uffdio_sigbus)
>  
>  /* read() structure */
>  struct uffd_msg {
> @@ -225,6 +231,7 @@ struct uffdio_api {
>  #define UFFD_FEATURE_EXACT_ADDRESS		(1<<11)
>  #define UFFD_FEATURE_WP_HUGETLBFS_SHMEM		(1<<12)
>  #define UFFD_FEATURE_WP_UNPOPULATED		(1<<13)
> +#define UFFD_FEATURE_SIGBUS_IOCTL		(1<<14)
>  	__u64 features;
>  
>  	__u64 ioctls;
> @@ -321,6 +328,18 @@ struct uffdio_continue {
>  	__s64 mapped;
>  };
>  
> +struct uffdio_sigbus {
> +	struct uffdio_range range;
> +#define UFFDIO_SIGBUS_MODE_DONTWAKE		((__u64)1<<0)
> +	__u64 mode;
> +
> +	/*
> +	 * Fields below here are written by the ioctl and must be at the end:
> +	 * the copy_from_user will not read past here.
> +	 */
> +	__s64 updated;
> +};
> +
>  /*
>   * Flags for the userfaultfd(2) system call itself.
>   */
> diff --git a/mm/memory.c b/mm/memory.c
> index f69fbc251198..e4b4207c2590 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3675,6 +3675,10 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
>  	if (WARN_ON_ONCE(!marker))
>  		return VM_FAULT_SIGBUS;
>  
> +	/* SIGBUS explicitly requested for this PTE. */
> +	if (marker & PTE_MARKER_UFFD_SIGBUS)
> +		return VM_FAULT_SIGBUS;
> +
>  	/* Higher priority than uffd-wp when data corrupted */
>  	if (marker & PTE_MARKER_SWAPIN_ERROR)
>  		return VM_FAULT_SIGBUS;
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index e97a0b4889fc..933587eebd5d 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -278,6 +278,51 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd,
>  	goto out;
>  }
>  
> +/* Handles UFFDIO_SIGBUS for all non-hugetlb VMAs. */
> +static int mfill_atomic_pte_sigbus(pmd_t *dst_pmd,
> +				   struct vm_area_struct *dst_vma,
> +				   unsigned long dst_addr,
> +				   uffd_flags_t flags)
> +{
> +	int ret;
> +	struct mm_struct *dst_mm = dst_vma->vm_mm;
> +	pte_t _dst_pte, *dst_pte;
> +	spinlock_t *ptl;
> +
> +	_dst_pte = make_pte_marker(PTE_MARKER_UFFD_SIGBUS);
> +	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
> +
> +	if (vma_is_shmem(dst_vma)) {
> +		struct inode *inode;
> +		pgoff_t offset, max_off;
> +
> +		/* serialize against truncate with the page table lock */
> +		inode = dst_vma->vm_file->f_inode;
> +		offset = linear_page_index(dst_vma, dst_addr);
> +		max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
> +		ret = -EFAULT;
> +		if (unlikely(offset >= max_off))
> +			goto out_unlock;
> +	}
> +
> +	ret = -EEXIST;
> +	/*
> +	 * For now, we don't handle unmapping pages, so only support filling in
> +	 * none PTEs, or replacing PTE markers.
> +	 */
> +	if (!pte_none_mostly(*dst_pte))
> +		goto out_unlock;
> +
> +	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
> +
> +	/* No need to invalidate - it was non-present before */
> +	update_mmu_cache(dst_vma, dst_addr, dst_pte);
> +	ret = 0;
> +out_unlock:
> +	pte_unmap_unlock(dst_pte, ptl);
> +	return ret;
> +}
> +
>  static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
>  {
>  	pgd_t *pgd;
> @@ -328,8 +373,12 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
>  	 * supported by hugetlb.  A PMD_SIZE huge pages may exist as used
>  	 * by THP.  Since we can not reliably insert a zero page, this
>  	 * feature is not supported.
> +	 *
> +	 * PTE marker handling for hugetlb is a bit special, so for now
> +	 * UFFDIO_SIGBUS is not supported.

Can you be more specific on this?

What's the plan when HGM will be merged?  Is it possible that all memory
just support this always so we only need 1 feature flag?

>  	 */
> -	if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE)) {
> +	if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE) ||
> +	    uffd_flags_mode_is(flags, MFILL_ATOMIC_SIGBUS)) {
>  		mmap_read_unlock(dst_mm);
>  		return -EINVAL;
>  	}
> @@ -473,6 +522,9 @@ static __always_inline ssize_t mfill_atomic_pte(pmd_t *dst_pmd,
>  	if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) {
>  		return mfill_atomic_pte_continue(dst_pmd, dst_vma,
>  						 dst_addr, flags);
> +	} else if (uffd_flags_mode_is(flags, MFILL_ATOMIC_SIGBUS)) {
> +		return mfill_atomic_pte_sigbus(dst_pmd, dst_vma,
> +					       dst_addr, flags);
>  	}
>  
>  	/*
> @@ -694,6 +746,14 @@ ssize_t mfill_atomic_continue(struct mm_struct *dst_mm, unsigned long start,
>  			    uffd_flags_set_mode(flags, MFILL_ATOMIC_CONTINUE));
>  }
>  
> +ssize_t mfill_atomic_sigbus(struct mm_struct *dst_mm, unsigned long start,
> +			    unsigned long len, atomic_t *mmap_changing,
> +			    uffd_flags_t flags)
> +{
> +	return mfill_atomic(dst_mm, start, 0, len, mmap_changing,
> +			    uffd_flags_set_mode(flags, MFILL_ATOMIC_SIGBUS));
> +}
> +
>  long uffd_wp_range(struct vm_area_struct *dst_vma,
>  		   unsigned long start, unsigned long len, bool enable_wp)
>  {
> -- 
> 2.40.1.606.ga4b1b128d6-goog
>
Peter Xu May 23, 2023, 5:27 p.m. UTC | #19
On Fri, May 19, 2023 at 10:32:13AM -0700, Axel Rasmussen wrote:
> On Fri, May 19, 2023 at 9:20 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > Hi, Jiaqi,
> >
> > On Fri, May 19, 2023 at 08:04:09AM -0700, Jiaqi Yan wrote:
> > > I don't think CAP_ADMIN is something we can work around: a VMM must be
> > > a good citizen to avoid introducing any vulnerability to the host or
> > > guest.
> > >
> > > On the other hand, "Userfaults allow the implementation of on-demand
> > > paging from userland and more generally they allow userland to take
> > > control of various memory page faults, something otherwise only the
> > > kernel code could do." [3]. I am not familiar with the UFFD internals,
> > > but our use case seems to match what UFFD wants to provide: without
> > > affecting the whole world, give a specific userspace (without
> > > CAP_ADMIN) the ability to handle page faults (indirectly emulate a
> > > HWPOISON page (in my mind I treat it as SetHWPOISON(page) +
> > > TestHWPOISON(page) operation in kernel's PF code)). So is it fair to
> > > say what Axel provided here is "provide !ADMIN somehow"?
> > >
> > > [3]https://docs.kernel.org/admin-guide/mm/userfaultfd.html
> >
> > Userfault keywords on "user", IMHO.  We don't strictly need userfault to
> > resolve anything regarding CAP_ADMIN problems.  MADV_DONTNEED also dosn't
> > need CAP_ADMIN, same to any new madvise() if we want to make it useful for
> > injecting poisoned ptes with !ADMIN and limit it within current->mm.
> >
> > But I think you're right that userfaultfd always tried to avoid having
> > ADMIN and keep everything within its own scope of permissions.
> >
> > So again, totally no objection on make it uffd specific for now if you guys
> > are all happy with it, but just to be clear that it's (to me) mostly for
> > avoiding another WAKE, and afaics that's not really for solving the ADMIN
> > issue here.
> 
> How about this plan:
> 
> Since the concrete use case we have (postcopy live migration) is
> UFFD-specific, let's leave it as a UFFDIO_* operation for now.
> 
> If in the future we come up with a non-UFFD use case, we can add a new
> MADV_* which does this operation at that point. From my perspective
> they could even share most of the same implementation code.
> 
> I don't think it's a big problem keeping the UFFDIO_* version too at
> that point, because it still provides some (perhaps small) value:
> 
> - Combines the operation + waking into one syscall
> - It allows us to support additional UFFD flags which modify / extend
> the operation in UFFD-specific ways, if we want to add those in the
> future
> 
> Seem reasonable?

Ok here.

> 
> If so, I'll send a v2 with documentation updates.

I've reviewed v1 in this case, please have a look, thanks.
Axel Rasmussen May 23, 2023, 5:59 p.m. UTC | #20
On Tue, May 23, 2023 at 10:26 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, May 11, 2023 at 11:24:24AM -0700, Axel Rasmussen wrote:
> > The basic idea here is to "simulate" memory poisoning for VMs. A VM
> > running on some host might encounter a memory error, after which some
> > page(s) are poisoned (i.e., future accesses SIGBUS). They expect that
> > once poisoned, pages can never become "un-poisoned". So, when we live
> > migrate the VM, we need to preserve the poisoned status of these pages.
> >
> > When live migrating, we try to get the guest running on its new host as
> > quickly as possible. So, we start it running before all memory has been
> > copied, and before we're certain which pages should be poisoned or not.
> >
> > So the basic way to use this new feature is:
> >
> > - On the new host, the guest's memory is registered with userfaultfd, in
> >   either MISSING or MINOR mode (doesn't really matter for this purpose).
> > - On any first access, we get a userfaultfd event. At this point we can
> >   communicate with the old host to find out if the page was poisoned.
> > - If so, we can respond with a UFFDIO_SIGBUS - this places a swap marker
>
> [as used to suggest..] maybe UFFDIO_POISON sounds better.
>
> >   so any future accesses will SIGBUS. Because the pte is now "present",
> >   future accesses won't generate more userfaultfd events, they'll just
> >   SIGBUS directly.
> >
> > UFFDIO_SIGBUS does not handle unmapping previously-present PTEs. This
> > isn't needed, because during live migration we want to intercept
> > all accesses with userfaultfd (not just writes, so WP mode isn't useful
> > for this). So whether minor or missing mode is being used (or both), the
> > PTE won't be present in any case, so handling that case isn't needed.
> >
> > Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
> > ---
> >  fs/userfaultfd.c                 | 63 ++++++++++++++++++++++++++++++++
> >  include/linux/swapops.h          |  3 +-
> >  include/linux/userfaultfd_k.h    |  4 ++
> >  include/uapi/linux/userfaultfd.h | 25 +++++++++++--
> >  mm/memory.c                      |  4 ++
> >  mm/userfaultfd.c                 | 62 ++++++++++++++++++++++++++++++-
> >  6 files changed, 156 insertions(+), 5 deletions(-)
> >
> > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > index 0fd96d6e39ce..edc2928dae2b 100644
> > --- a/fs/userfaultfd.c
> > +++ b/fs/userfaultfd.c
> > @@ -1966,6 +1966,66 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
> >       return ret;
> >  }
> >
> > +static inline int userfaultfd_sigbus(struct userfaultfd_ctx *ctx, unsigned long arg)
> > +{
> > +     __s64 ret;
> > +     struct uffdio_sigbus uffdio_sigbus;
> > +     struct uffdio_sigbus __user *user_uffdio_sigbus;
> > +     struct userfaultfd_wake_range range;
> > +
> > +     user_uffdio_sigbus = (struct uffdio_sigbus __user *)arg;
> > +
> > +     ret = -EAGAIN;
> > +     if (atomic_read(&ctx->mmap_changing))
> > +             goto out;
> > +
> > +     ret = -EFAULT;
> > +     if (copy_from_user(&uffdio_sigbus, user_uffdio_sigbus,
> > +                        /* don't copy the output fields */
> > +                        sizeof(uffdio_sigbus) - (sizeof(__s64))))
> > +             goto out;
> > +
> > +     ret = validate_range(ctx->mm, uffdio_sigbus.range.start,
> > +                          uffdio_sigbus.range.len);
> > +     if (ret)
> > +             goto out;
> > +
> > +     ret = -EINVAL;
> > +     /* double check for wraparound just in case. */
> > +     if (uffdio_sigbus.range.start + uffdio_sigbus.range.len <=
> > +         uffdio_sigbus.range.start) {
> > +             goto out;
> > +     }
> > +     if (uffdio_sigbus.mode & ~UFFDIO_SIGBUS_MODE_DONTWAKE)
> > +             goto out;
> > +
> > +     if (mmget_not_zero(ctx->mm)) {
> > +             ret = mfill_atomic_sigbus(ctx->mm, uffdio_sigbus.range.start,
> > +                                       uffdio_sigbus.range.len,
> > +                                       &ctx->mmap_changing, 0);
> > +             mmput(ctx->mm);
> > +     } else {
> > +             return -ESRCH;
> > +     }
> > +
> > +     if (unlikely(put_user(ret, &user_uffdio_sigbus->updated)))
> > +             return -EFAULT;
> > +     if (ret < 0)
> > +             goto out;
> > +
> > +     /* len == 0 would wake all */
> > +     BUG_ON(!ret);
> > +     range.len = ret;
> > +     if (!(uffdio_sigbus.mode & UFFDIO_SIGBUS_MODE_DONTWAKE)) {
> > +             range.start = uffdio_sigbus.range.start;
> > +             wake_userfault(ctx, &range);
> > +     }
> > +     ret = range.len == uffdio_sigbus.range.len ? 0 : -EAGAIN;
> > +
> > +out:
> > +     return ret;
> > +}
> > +
> >  static inline unsigned int uffd_ctx_features(__u64 user_features)
> >  {
> >       /*
> > @@ -2067,6 +2127,9 @@ static long userfaultfd_ioctl(struct file *file, unsigned cmd,
> >       case UFFDIO_CONTINUE:
> >               ret = userfaultfd_continue(ctx, arg);
> >               break;
> > +     case UFFDIO_SIGBUS:
> > +             ret = userfaultfd_sigbus(ctx, arg);
> > +             break;
> >       }
> >       return ret;
> >  }
> > diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> > index 3a451b7afcb3..fa778a0ae730 100644
> > --- a/include/linux/swapops.h
> > +++ b/include/linux/swapops.h
> > @@ -405,7 +405,8 @@ typedef unsigned long pte_marker;
> >
> >  #define  PTE_MARKER_UFFD_WP                  BIT(0)
> >  #define  PTE_MARKER_SWAPIN_ERROR             BIT(1)
> > -#define  PTE_MARKER_MASK                     (BIT(2) - 1)
> > +#define  PTE_MARKER_UFFD_SIGBUS                      BIT(2)
> > +#define  PTE_MARKER_MASK                     (BIT(3) - 1)
>
> [as used to suggest..] I'd consider reusing SWAPIN_ERROR directly.
>
> Actually.. I think maybe we should have 1 patch changing SWAPIN_ERROR from
> VM_FAULT_SIGBUS to VM_FAULT_HWPOISON.
>
> Let's imagine a VM having anonymous page backing and got a swapin error
> when faulted on one of the guest page.  Instead of crashing the hypervisor
> with sigbus we should probably make it a MCE injected into the guest too,
> because there's no page corrupt in bare metal in this specific case,
> however to the guest it's the same as having one page corrupted just like a
> real MCE.

This is a great idea, you're right that injecting an MCE into the
guest is exactly the end goal, and it seems like VM_FAULT_HWPOISON
will "just work". Also the name UFFDIO_POISON resolves any confusion
with UFFD_FEATURE_SIGBUS, so that's a nice side benefit.

I'll make this change in v2, thanks for the idea Peter!

>
> >
> >  static inline swp_entry_t make_pte_marker_entry(pte_marker marker)
> >  {
> > diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> > index d78b01524349..6de1084939c5 100644
> > --- a/include/linux/userfaultfd_k.h
> > +++ b/include/linux/userfaultfd_k.h
> > @@ -46,6 +46,7 @@ enum mfill_atomic_mode {
> >       MFILL_ATOMIC_COPY,
> >       MFILL_ATOMIC_ZEROPAGE,
> >       MFILL_ATOMIC_CONTINUE,
> > +     MFILL_ATOMIC_SIGBUS,
> >       NR_MFILL_ATOMIC_MODES,
> >  };
> >
> > @@ -83,6 +84,9 @@ extern ssize_t mfill_atomic_zeropage(struct mm_struct *dst_mm,
> >  extern ssize_t mfill_atomic_continue(struct mm_struct *dst_mm, unsigned long dst_start,
> >                                    unsigned long len, atomic_t *mmap_changing,
> >                                    uffd_flags_t flags);
> > +extern ssize_t mfill_atomic_sigbus(struct mm_struct *dst_mm, unsigned long start,
> > +                                unsigned long len, atomic_t *mmap_changing,
> > +                                uffd_flags_t flags);
> >  extern int mwriteprotect_range(struct mm_struct *dst_mm,
> >                              unsigned long start, unsigned long len,
> >                              bool enable_wp, atomic_t *mmap_changing);
> > diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
> > index 66dd4cd277bd..616e33d3db97 100644
> > --- a/include/uapi/linux/userfaultfd.h
> > +++ b/include/uapi/linux/userfaultfd.h
> > @@ -39,7 +39,8 @@
> >                          UFFD_FEATURE_MINOR_SHMEM |           \
> >                          UFFD_FEATURE_EXACT_ADDRESS |         \
> >                          UFFD_FEATURE_WP_HUGETLBFS_SHMEM |    \
> > -                        UFFD_FEATURE_WP_UNPOPULATED)
> > +                        UFFD_FEATURE_WP_UNPOPULATED |        \
> > +                        UFFD_FEATURE_SIGBUS_IOCTL)
> >  #define UFFD_API_IOCTLS                              \
> >       ((__u64)1 << _UFFDIO_REGISTER |         \
> >        (__u64)1 << _UFFDIO_UNREGISTER |       \
> > @@ -49,12 +50,14 @@
> >        (__u64)1 << _UFFDIO_COPY |             \
> >        (__u64)1 << _UFFDIO_ZEROPAGE |         \
> >        (__u64)1 << _UFFDIO_WRITEPROTECT |     \
> > -      (__u64)1 << _UFFDIO_CONTINUE)
> > +      (__u64)1 << _UFFDIO_CONTINUE |         \
> > +      (__u64)1 << _UFFDIO_SIGBUS)
> >  #define UFFD_API_RANGE_IOCTLS_BASIC          \
> >       ((__u64)1 << _UFFDIO_WAKE |             \
> >        (__u64)1 << _UFFDIO_COPY |             \
> > +      (__u64)1 << _UFFDIO_WRITEPROTECT |     \
> >        (__u64)1 << _UFFDIO_CONTINUE |         \
> > -      (__u64)1 << _UFFDIO_WRITEPROTECT)
> > +      (__u64)1 << _UFFDIO_SIGBUS)
> >
> >  /*
> >   * Valid ioctl command number range with this API is from 0x00 to
> > @@ -71,6 +74,7 @@
> >  #define _UFFDIO_ZEROPAGE             (0x04)
> >  #define _UFFDIO_WRITEPROTECT         (0x06)
> >  #define _UFFDIO_CONTINUE             (0x07)
> > +#define _UFFDIO_SIGBUS                       (0x08)
> >  #define _UFFDIO_API                  (0x3F)
> >
> >  /* userfaultfd ioctl ids */
> > @@ -91,6 +95,8 @@
> >                                     struct uffdio_writeprotect)
> >  #define UFFDIO_CONTINUE              _IOWR(UFFDIO, _UFFDIO_CONTINUE, \
> >                                     struct uffdio_continue)
> > +#define UFFDIO_SIGBUS                _IOWR(UFFDIO, _UFFDIO_SIGBUS, \
> > +                                   struct uffdio_sigbus)
> >
> >  /* read() structure */
> >  struct uffd_msg {
> > @@ -225,6 +231,7 @@ struct uffdio_api {
> >  #define UFFD_FEATURE_EXACT_ADDRESS           (1<<11)
> >  #define UFFD_FEATURE_WP_HUGETLBFS_SHMEM              (1<<12)
> >  #define UFFD_FEATURE_WP_UNPOPULATED          (1<<13)
> > +#define UFFD_FEATURE_SIGBUS_IOCTL            (1<<14)
> >       __u64 features;
> >
> >       __u64 ioctls;
> > @@ -321,6 +328,18 @@ struct uffdio_continue {
> >       __s64 mapped;
> >  };
> >
> > +struct uffdio_sigbus {
> > +     struct uffdio_range range;
> > +#define UFFDIO_SIGBUS_MODE_DONTWAKE          ((__u64)1<<0)
> > +     __u64 mode;
> > +
> > +     /*
> > +      * Fields below here are written by the ioctl and must be at the end:
> > +      * the copy_from_user will not read past here.
> > +      */
> > +     __s64 updated;
> > +};
> > +
> >  /*
> >   * Flags for the userfaultfd(2) system call itself.
> >   */
> > diff --git a/mm/memory.c b/mm/memory.c
> > index f69fbc251198..e4b4207c2590 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3675,6 +3675,10 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
> >       if (WARN_ON_ONCE(!marker))
> >               return VM_FAULT_SIGBUS;
> >
> > +     /* SIGBUS explicitly requested for this PTE. */
> > +     if (marker & PTE_MARKER_UFFD_SIGBUS)
> > +             return VM_FAULT_SIGBUS;
> > +
> >       /* Higher priority than uffd-wp when data corrupted */
> >       if (marker & PTE_MARKER_SWAPIN_ERROR)
> >               return VM_FAULT_SIGBUS;
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index e97a0b4889fc..933587eebd5d 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -278,6 +278,51 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd,
> >       goto out;
> >  }
> >
> > +/* Handles UFFDIO_SIGBUS for all non-hugetlb VMAs. */
> > +static int mfill_atomic_pte_sigbus(pmd_t *dst_pmd,
> > +                                struct vm_area_struct *dst_vma,
> > +                                unsigned long dst_addr,
> > +                                uffd_flags_t flags)
> > +{
> > +     int ret;
> > +     struct mm_struct *dst_mm = dst_vma->vm_mm;
> > +     pte_t _dst_pte, *dst_pte;
> > +     spinlock_t *ptl;
> > +
> > +     _dst_pte = make_pte_marker(PTE_MARKER_UFFD_SIGBUS);
> > +     dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
> > +
> > +     if (vma_is_shmem(dst_vma)) {
> > +             struct inode *inode;
> > +             pgoff_t offset, max_off;
> > +
> > +             /* serialize against truncate with the page table lock */
> > +             inode = dst_vma->vm_file->f_inode;
> > +             offset = linear_page_index(dst_vma, dst_addr);
> > +             max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
> > +             ret = -EFAULT;
> > +             if (unlikely(offset >= max_off))
> > +                     goto out_unlock;
> > +     }
> > +
> > +     ret = -EEXIST;
> > +     /*
> > +      * For now, we don't handle unmapping pages, so only support filling in
> > +      * none PTEs, or replacing PTE markers.
> > +      */
> > +     if (!pte_none_mostly(*dst_pte))
> > +             goto out_unlock;
> > +
> > +     set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
> > +
> > +     /* No need to invalidate - it was non-present before */
> > +     update_mmu_cache(dst_vma, dst_addr, dst_pte);
> > +     ret = 0;
> > +out_unlock:
> > +     pte_unmap_unlock(dst_pte, ptl);
> > +     return ret;
> > +}
> > +
> >  static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
> >  {
> >       pgd_t *pgd;
> > @@ -328,8 +373,12 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
> >        * supported by hugetlb.  A PMD_SIZE huge pages may exist as used
> >        * by THP.  Since we can not reliably insert a zero page, this
> >        * feature is not supported.
> > +      *
> > +      * PTE marker handling for hugetlb is a bit special, so for now
> > +      * UFFDIO_SIGBUS is not supported.
>
> Can you be more specific on this?
>
> What's the plan when HGM will be merged?  Is it possible that all memory
> just support this always so we only need 1 feature flag?

We'll want hugetlbfs support for this operation too, but it's only
really useful (at least for our use case) after HGM is merged. But,
there's no strong reason not to just implement both all at once - I'll
extend v2 to also work properly with hugetlbfs. Probably it isn't too
hard, I just need to do a bit more reading of how swap markers are
handled in hugetlbfs.


>
> >        */
> > -     if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE)) {
> > +     if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE) ||
> > +         uffd_flags_mode_is(flags, MFILL_ATOMIC_SIGBUS)) {
> >               mmap_read_unlock(dst_mm);
> >               return -EINVAL;
> >       }
> > @@ -473,6 +522,9 @@ static __always_inline ssize_t mfill_atomic_pte(pmd_t *dst_pmd,
> >       if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) {
> >               return mfill_atomic_pte_continue(dst_pmd, dst_vma,
> >                                                dst_addr, flags);
> > +     } else if (uffd_flags_mode_is(flags, MFILL_ATOMIC_SIGBUS)) {
> > +             return mfill_atomic_pte_sigbus(dst_pmd, dst_vma,
> > +                                            dst_addr, flags);
> >       }
> >
> >       /*
> > @@ -694,6 +746,14 @@ ssize_t mfill_atomic_continue(struct mm_struct *dst_mm, unsigned long start,
> >                           uffd_flags_set_mode(flags, MFILL_ATOMIC_CONTINUE));
> >  }
> >
> > +ssize_t mfill_atomic_sigbus(struct mm_struct *dst_mm, unsigned long start,
> > +                         unsigned long len, atomic_t *mmap_changing,
> > +                         uffd_flags_t flags)
> > +{
> > +     return mfill_atomic(dst_mm, start, 0, len, mmap_changing,
> > +                         uffd_flags_set_mode(flags, MFILL_ATOMIC_SIGBUS));
> > +}
> > +
> >  long uffd_wp_range(struct vm_area_struct *dst_vma,
> >                  unsigned long start, unsigned long len, bool enable_wp)
> >  {
> > --
> > 2.40.1.606.ga4b1b128d6-goog
> >
>
> --
> Peter Xu
>
Peter Xu May 24, 2023, 3:05 p.m. UTC | #21
On Tue, May 23, 2023 at 10:59:05AM -0700, Axel Rasmussen wrote:
> > Actually.. I think maybe we should have 1 patch changing SWAPIN_ERROR from
> > VM_FAULT_SIGBUS to VM_FAULT_HWPOISON.
> >
> > Let's imagine a VM having anonymous page backing and got a swapin error
> > when faulted on one of the guest page.  Instead of crashing the hypervisor
> > with sigbus we should probably make it a MCE injected into the guest too,
> > because there's no page corrupt in bare metal in this specific case,
> > however to the guest it's the same as having one page corrupted just like a
> > real MCE.
> 
> This is a great idea, you're right that injecting an MCE into the
> guest is exactly the end goal, and it seems like VM_FAULT_HWPOISON
> will "just work". Also the name UFFDIO_POISON resolves any confusion
> with UFFD_FEATURE_SIGBUS, so that's a nice side benefit.

Hopefully!  Please double check it with KVM running altogether to make sure
the patch works exactly as expected.

[...]

> We'll want hugetlbfs support for this operation too, but it's only
> really useful (at least for our use case) after HGM is merged. But,
> there's no strong reason not to just implement both all at once - I'll
> extend v2 to also work properly with hugetlbfs. Probably it isn't too
> hard, I just need to do a bit more reading of how swap markers are
> handled in hugetlbfs.

Sure.  We have too many flags separating different types of memory, so I
think it'll be nice if when it can still trivially work for everything.

For this specific case, since your goal is definitely having hugetlb hgm
supported so it'll be even more trickier if only hgm will be supported but
not !hgm hugetlbs, so we'd better target it initially with all mem types.

Thanks,
diff mbox series

Patch

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 0fd96d6e39ce..edc2928dae2b 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1966,6 +1966,66 @@  static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
 	return ret;
 }
 
+static inline int userfaultfd_sigbus(struct userfaultfd_ctx *ctx, unsigned long arg)
+{
+	__s64 ret;
+	struct uffdio_sigbus uffdio_sigbus;
+	struct uffdio_sigbus __user *user_uffdio_sigbus;
+	struct userfaultfd_wake_range range;
+
+	user_uffdio_sigbus = (struct uffdio_sigbus __user *)arg;
+
+	ret = -EAGAIN;
+	if (atomic_read(&ctx->mmap_changing))
+		goto out;
+
+	ret = -EFAULT;
+	if (copy_from_user(&uffdio_sigbus, user_uffdio_sigbus,
+			   /* don't copy the output fields */
+			   sizeof(uffdio_sigbus) - (sizeof(__s64))))
+		goto out;
+
+	ret = validate_range(ctx->mm, uffdio_sigbus.range.start,
+			     uffdio_sigbus.range.len);
+	if (ret)
+		goto out;
+
+	ret = -EINVAL;
+	/* double check for wraparound just in case. */
+	if (uffdio_sigbus.range.start + uffdio_sigbus.range.len <=
+	    uffdio_sigbus.range.start) {
+		goto out;
+	}
+	if (uffdio_sigbus.mode & ~UFFDIO_SIGBUS_MODE_DONTWAKE)
+		goto out;
+
+	if (mmget_not_zero(ctx->mm)) {
+		ret = mfill_atomic_sigbus(ctx->mm, uffdio_sigbus.range.start,
+					  uffdio_sigbus.range.len,
+					  &ctx->mmap_changing, 0);
+		mmput(ctx->mm);
+	} else {
+		return -ESRCH;
+	}
+
+	if (unlikely(put_user(ret, &user_uffdio_sigbus->updated)))
+		return -EFAULT;
+	if (ret < 0)
+		goto out;
+
+	/* len == 0 would wake all */
+	BUG_ON(!ret);
+	range.len = ret;
+	if (!(uffdio_sigbus.mode & UFFDIO_SIGBUS_MODE_DONTWAKE)) {
+		range.start = uffdio_sigbus.range.start;
+		wake_userfault(ctx, &range);
+	}
+	ret = range.len == uffdio_sigbus.range.len ? 0 : -EAGAIN;
+
+out:
+	return ret;
+}
+
 static inline unsigned int uffd_ctx_features(__u64 user_features)
 {
 	/*
@@ -2067,6 +2127,9 @@  static long userfaultfd_ioctl(struct file *file, unsigned cmd,
 	case UFFDIO_CONTINUE:
 		ret = userfaultfd_continue(ctx, arg);
 		break;
+	case UFFDIO_SIGBUS:
+		ret = userfaultfd_sigbus(ctx, arg);
+		break;
 	}
 	return ret;
 }
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 3a451b7afcb3..fa778a0ae730 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -405,7 +405,8 @@  typedef unsigned long pte_marker;
 
 #define  PTE_MARKER_UFFD_WP			BIT(0)
 #define  PTE_MARKER_SWAPIN_ERROR		BIT(1)
-#define  PTE_MARKER_MASK			(BIT(2) - 1)
+#define  PTE_MARKER_UFFD_SIGBUS			BIT(2)
+#define  PTE_MARKER_MASK			(BIT(3) - 1)
 
 static inline swp_entry_t make_pte_marker_entry(pte_marker marker)
 {
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index d78b01524349..6de1084939c5 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -46,6 +46,7 @@  enum mfill_atomic_mode {
 	MFILL_ATOMIC_COPY,
 	MFILL_ATOMIC_ZEROPAGE,
 	MFILL_ATOMIC_CONTINUE,
+	MFILL_ATOMIC_SIGBUS,
 	NR_MFILL_ATOMIC_MODES,
 };
 
@@ -83,6 +84,9 @@  extern ssize_t mfill_atomic_zeropage(struct mm_struct *dst_mm,
 extern ssize_t mfill_atomic_continue(struct mm_struct *dst_mm, unsigned long dst_start,
 				     unsigned long len, atomic_t *mmap_changing,
 				     uffd_flags_t flags);
+extern ssize_t mfill_atomic_sigbus(struct mm_struct *dst_mm, unsigned long start,
+				   unsigned long len, atomic_t *mmap_changing,
+				   uffd_flags_t flags);
 extern int mwriteprotect_range(struct mm_struct *dst_mm,
 			       unsigned long start, unsigned long len,
 			       bool enable_wp, atomic_t *mmap_changing);
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 66dd4cd277bd..616e33d3db97 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -39,7 +39,8 @@ 
 			   UFFD_FEATURE_MINOR_SHMEM |		\
 			   UFFD_FEATURE_EXACT_ADDRESS |		\
 			   UFFD_FEATURE_WP_HUGETLBFS_SHMEM |	\
-			   UFFD_FEATURE_WP_UNPOPULATED)
+			   UFFD_FEATURE_WP_UNPOPULATED |	\
+			   UFFD_FEATURE_SIGBUS_IOCTL)
 #define UFFD_API_IOCTLS				\
 	((__u64)1 << _UFFDIO_REGISTER |		\
 	 (__u64)1 << _UFFDIO_UNREGISTER |	\
@@ -49,12 +50,14 @@ 
 	 (__u64)1 << _UFFDIO_COPY |		\
 	 (__u64)1 << _UFFDIO_ZEROPAGE |		\
 	 (__u64)1 << _UFFDIO_WRITEPROTECT |	\
-	 (__u64)1 << _UFFDIO_CONTINUE)
+	 (__u64)1 << _UFFDIO_CONTINUE |		\
+	 (__u64)1 << _UFFDIO_SIGBUS)
 #define UFFD_API_RANGE_IOCTLS_BASIC		\
 	((__u64)1 << _UFFDIO_WAKE |		\
 	 (__u64)1 << _UFFDIO_COPY |		\
+	 (__u64)1 << _UFFDIO_WRITEPROTECT |	\
 	 (__u64)1 << _UFFDIO_CONTINUE |		\
-	 (__u64)1 << _UFFDIO_WRITEPROTECT)
+	 (__u64)1 << _UFFDIO_SIGBUS)
 
 /*
  * Valid ioctl command number range with this API is from 0x00 to
@@ -71,6 +74,7 @@ 
 #define _UFFDIO_ZEROPAGE		(0x04)
 #define _UFFDIO_WRITEPROTECT		(0x06)
 #define _UFFDIO_CONTINUE		(0x07)
+#define _UFFDIO_SIGBUS			(0x08)
 #define _UFFDIO_API			(0x3F)
 
 /* userfaultfd ioctl ids */
@@ -91,6 +95,8 @@ 
 				      struct uffdio_writeprotect)
 #define UFFDIO_CONTINUE		_IOWR(UFFDIO, _UFFDIO_CONTINUE,	\
 				      struct uffdio_continue)
+#define UFFDIO_SIGBUS		_IOWR(UFFDIO, _UFFDIO_SIGBUS, \
+				      struct uffdio_sigbus)
 
 /* read() structure */
 struct uffd_msg {
@@ -225,6 +231,7 @@  struct uffdio_api {
 #define UFFD_FEATURE_EXACT_ADDRESS		(1<<11)
 #define UFFD_FEATURE_WP_HUGETLBFS_SHMEM		(1<<12)
 #define UFFD_FEATURE_WP_UNPOPULATED		(1<<13)
+#define UFFD_FEATURE_SIGBUS_IOCTL		(1<<14)
 	__u64 features;
 
 	__u64 ioctls;
@@ -321,6 +328,18 @@  struct uffdio_continue {
 	__s64 mapped;
 };
 
+struct uffdio_sigbus {
+	struct uffdio_range range;
+#define UFFDIO_SIGBUS_MODE_DONTWAKE		((__u64)1<<0)
+	__u64 mode;
+
+	/*
+	 * Fields below here are written by the ioctl and must be at the end:
+	 * the copy_from_user will not read past here.
+	 */
+	__s64 updated;
+};
+
 /*
  * Flags for the userfaultfd(2) system call itself.
  */
diff --git a/mm/memory.c b/mm/memory.c
index f69fbc251198..e4b4207c2590 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3675,6 +3675,10 @@  static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
 	if (WARN_ON_ONCE(!marker))
 		return VM_FAULT_SIGBUS;
 
+	/* SIGBUS explicitly requested for this PTE. */
+	if (marker & PTE_MARKER_UFFD_SIGBUS)
+		return VM_FAULT_SIGBUS;
+
 	/* Higher priority than uffd-wp when data corrupted */
 	if (marker & PTE_MARKER_SWAPIN_ERROR)
 		return VM_FAULT_SIGBUS;
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index e97a0b4889fc..933587eebd5d 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -278,6 +278,51 @@  static int mfill_atomic_pte_continue(pmd_t *dst_pmd,
 	goto out;
 }
 
+/* Handles UFFDIO_SIGBUS for all non-hugetlb VMAs. */
+static int mfill_atomic_pte_sigbus(pmd_t *dst_pmd,
+				   struct vm_area_struct *dst_vma,
+				   unsigned long dst_addr,
+				   uffd_flags_t flags)
+{
+	int ret;
+	struct mm_struct *dst_mm = dst_vma->vm_mm;
+	pte_t _dst_pte, *dst_pte;
+	spinlock_t *ptl;
+
+	_dst_pte = make_pte_marker(PTE_MARKER_UFFD_SIGBUS);
+	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
+
+	if (vma_is_shmem(dst_vma)) {
+		struct inode *inode;
+		pgoff_t offset, max_off;
+
+		/* serialize against truncate with the page table lock */
+		inode = dst_vma->vm_file->f_inode;
+		offset = linear_page_index(dst_vma, dst_addr);
+		max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
+		ret = -EFAULT;
+		if (unlikely(offset >= max_off))
+			goto out_unlock;
+	}
+
+	ret = -EEXIST;
+	/*
+	 * For now, we don't handle unmapping pages, so only support filling in
+	 * none PTEs, or replacing PTE markers.
+	 */
+	if (!pte_none_mostly(*dst_pte))
+		goto out_unlock;
+
+	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
+
+	/* No need to invalidate - it was non-present before */
+	update_mmu_cache(dst_vma, dst_addr, dst_pte);
+	ret = 0;
+out_unlock:
+	pte_unmap_unlock(dst_pte, ptl);
+	return ret;
+}
+
 static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
 {
 	pgd_t *pgd;
@@ -328,8 +373,12 @@  static __always_inline ssize_t mfill_atomic_hugetlb(
 	 * supported by hugetlb.  A PMD_SIZE huge pages may exist as used
 	 * by THP.  Since we can not reliably insert a zero page, this
 	 * feature is not supported.
+	 *
+	 * PTE marker handling for hugetlb is a bit special, so for now
+	 * UFFDIO_SIGBUS is not supported.
 	 */
-	if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE)) {
+	if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE) ||
+	    uffd_flags_mode_is(flags, MFILL_ATOMIC_SIGBUS)) {
 		mmap_read_unlock(dst_mm);
 		return -EINVAL;
 	}
@@ -473,6 +522,9 @@  static __always_inline ssize_t mfill_atomic_pte(pmd_t *dst_pmd,
 	if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) {
 		return mfill_atomic_pte_continue(dst_pmd, dst_vma,
 						 dst_addr, flags);
+	} else if (uffd_flags_mode_is(flags, MFILL_ATOMIC_SIGBUS)) {
+		return mfill_atomic_pte_sigbus(dst_pmd, dst_vma,
+					       dst_addr, flags);
 	}
 
 	/*
@@ -694,6 +746,14 @@  ssize_t mfill_atomic_continue(struct mm_struct *dst_mm, unsigned long start,
 			    uffd_flags_set_mode(flags, MFILL_ATOMIC_CONTINUE));
 }
 
+ssize_t mfill_atomic_sigbus(struct mm_struct *dst_mm, unsigned long start,
+			    unsigned long len, atomic_t *mmap_changing,
+			    uffd_flags_t flags)
+{
+	return mfill_atomic(dst_mm, start, 0, len, mmap_changing,
+			    uffd_flags_set_mode(flags, MFILL_ATOMIC_SIGBUS));
+}
+
 long uffd_wp_range(struct vm_area_struct *dst_vma,
 		   unsigned long start, unsigned long len, bool enable_wp)
 {