mbox series

[RESEND,v6,00/16] mm: Page fault enhancements

Message ID 20200220155353.8676-1-peterx@redhat.com (mailing list archive)
Headers show
Series mm: Page fault enhancements | expand

Message

Peter Xu Feb. 20, 2020, 3:53 p.m. UTC
[Resend v6]

This is v6 of the series.  It is majorly a rebase to 5.6-rc2, nothing
else to be expected (plus some tests after the rebase).  Instead of
rewrite the cover letter I decided to use what we have for v5.

Adding extra CCs for both Bobby Powers <bobbypowers@gmail.com> and
Brian Geffon <bgeffon@google.com>.

Online repo: https://github.com/xzpeter/linux/tree/mm-pf-signal-retry

Any review comment is appreciated.  Thanks,

=============== v5 cover letter ==================

This is v5 of the series.  As Matthew suggested, I split the previous
patch "mm: Return faster for non-fatal signals in user mode faults"
into a few smaller ones:

  1. One patch to introduce fatal_signal_pending(), and use it in
     archs that can directly apply

  2. A few more patches to let the rest archs to use the new helper.
     With that we can have an unified entry for signal detection

  3. One last patch to change fatal_signal_pending() to detect
     userspace non-fatal signal

Nothing should have changed in the rest patches.  Because the fault
retry patches will depend on the previous ones, I decided to simply
repost all the patches.

Here's the new patchset layout:

Patch 1-2:      cleanup, and potential bugfix of hugetlbfs on fault retry

Patch 3-9:      let page fault to respond to non-fatal signals faster

Patch 10:       remove the userfaultfd NOPAGE emulation

Patch 11-14:    allow page fault to retry more than once

Patch 15-16:    let gup code to use FAULT_FLAG_KILLABLE too

I would really appreciate any review comments for the series,
especially for the first two patches which IMHO are even not related
to this patchset and they should either cleanup or fix things.

Smoke tested on x86 only.

Thanks,

v5:
- split "mm: Return faster for non-fatal signals in user mode faults"
  into a few more patches, let all archs to use an unified entry for
  fast signal handling (fatal_signal_pending)

v4:
- use lore.kernel.org for all the links in commit messages [Kirill]
- one more patch ("mm/gup: Fix __get_user_pages() on fault retry of
  hugetlb") to fix hugetlb path on fault retry
- one more patch ("mm/gup: Allow to react to fatal signals") to:
  - use down_read_killable() properly [Linus]
  - pass in FAULT_FLAG_KILLABLE for all GUP [Linus]
- one more patch ("mm/userfaultfd: Honor FAULT_FLAG_KILLABLE in fault
  path") to let handle_userfaultfd() respect FAULT_FLAG_KILLABLE.
  Should have no functional change after previous two new patches.

v3:
- check fatal signals in __get_user_page_locked() [Linus]
- add r-bs

v2:
- resent previous version, rebase only

=============== v1 cover letter ==================

This series is split out of userfaultfd-wp series to only cover the
general page fault changes, since it seems to make sense itself.

Basically it does two things:

  (a) Allows the page fault handlers to be more interactive on not
      only SIGKILL, but also the rest of userspace signals (especially
      for user-mode faults), and,

  (b) Allows the page fault retry (VM_FAULT_RETRY) to happen for more
      than once.

I'm keeping the CC list as in uffd-wp v5, hopefully I'm not sending
too much spams...

And, instead of writting again the cover letter, I'm just copy-pasting
my previous link here which has more details on why we do this:

  https://patchwork.kernel.org/cover/10691991/

The major change from that latest version should be that we introduced
a new page fault flag FAULT_FLAG_INTERRUPTIBLE as suggested by Linus
[1] to represents that we would like the fault handler to respond to
non-fatal signals.  Also, we're more careful now on when to do the
immediate return of the page fault for such signals.  For example, now
we'll only check against signal_pending() for user-mode page faults
and we keep the kernel-mode page fault patch untouched for it.  More
information can be found in separate patches.

The patchset is only lightly tested on x86.

All comments are greatly welcomed.  Thanks,

[1] https://lkml.org/lkml/2019/6/25/1382

Peter Xu (16):
  mm/gup: Rename "nonblocking" to "locked" where proper
  mm/gup: Fix __get_user_pages() on fault retry of hugetlb
  mm: Introduce fault_signal_pending()
  x86/mm: Use helper fault_signal_pending()
  arc/mm: Use helper fault_signal_pending()
  arm64/mm: Use helper fault_signal_pending()
  powerpc/mm: Use helper fault_signal_pending()
  sh/mm: Use helper fault_signal_pending()
  mm: Return faster for non-fatal signals in user mode faults
  userfaultfd: Don't retake mmap_sem to emulate NOPAGE
  mm: Introduce FAULT_FLAG_DEFAULT
  mm: Introduce FAULT_FLAG_INTERRUPTIBLE
  mm: Allow VM_FAULT_RETRY for multiple times
  mm/gup: Allow VM_FAULT_RETRY for multiple times
  mm/gup: Allow to react to fatal signals
  mm/userfaultfd: Honor FAULT_FLAG_KILLABLE in fault path

 arch/alpha/mm/fault.c           |  6 +--
 arch/arc/mm/fault.c             | 35 +++++--------
 arch/arm/mm/fault.c             |  7 +--
 arch/arm64/mm/fault.c           | 26 +++------
 arch/hexagon/mm/vm_fault.c      |  5 +-
 arch/ia64/mm/fault.c            |  5 +-
 arch/m68k/mm/fault.c            |  7 +--
 arch/microblaze/mm/fault.c      |  5 +-
 arch/mips/mm/fault.c            |  5 +-
 arch/nds32/mm/fault.c           |  5 +-
 arch/nios2/mm/fault.c           |  7 +--
 arch/openrisc/mm/fault.c        |  5 +-
 arch/parisc/mm/fault.c          |  8 ++-
 arch/powerpc/mm/fault.c         | 20 ++-----
 arch/riscv/mm/fault.c           |  9 +---
 arch/s390/mm/fault.c            | 10 ++--
 arch/sh/mm/fault.c              | 13 +++--
 arch/sparc/mm/fault_32.c        |  5 +-
 arch/sparc/mm/fault_64.c        |  5 +-
 arch/um/kernel/trap.c           |  3 +-
 arch/unicore32/mm/fault.c       |  8 ++-
 arch/x86/mm/fault.c             | 30 +++++------
 arch/xtensa/mm/fault.c          |  5 +-
 drivers/gpu/drm/ttm/ttm_bo_vm.c | 12 +++--
 fs/userfaultfd.c                | 62 ++++++++++------------
 include/linux/mm.h              | 81 ++++++++++++++++++++++++----
 include/linux/sched/signal.h    | 14 +++++
 mm/filemap.c                    |  2 +-
 mm/gup.c                        | 93 +++++++++++++++++++++------------
 mm/hugetlb.c                    | 17 +++---
 mm/internal.h                   |  6 +--
 31 files changed, 281 insertions(+), 240 deletions(-)

Comments

Brian Geffon Feb. 21, 2020, 7:26 p.m. UTC | #1
I tested the entire patchset because I'm very interested in fault
retries with userfaultfd and the series has been stable and worked
well on x86.

Tested-by: Brian Geffon <bgeffon@google.com>


On Thu, Feb 20, 2020 at 7:54 AM Peter Xu <peterx@redhat.com> wrote:
>
> [Resend v6]
>
> This is v6 of the series.  It is majorly a rebase to 5.6-rc2, nothing
> else to be expected (plus some tests after the rebase).  Instead of
> rewrite the cover letter I decided to use what we have for v5.
>
> Adding extra CCs for both Bobby Powers <bobbypowers@gmail.com> and
> Brian Geffon <bgeffon@google.com>.
>
> Online repo: https://github.com/xzpeter/linux/tree/mm-pf-signal-retry
>
> Any review comment is appreciated.  Thanks,
>
> =============== v5 cover letter ==================
>
> This is v5 of the series.  As Matthew suggested, I split the previous
> patch "mm: Return faster for non-fatal signals in user mode faults"
> into a few smaller ones:
>
>   1. One patch to introduce fatal_signal_pending(), and use it in
>      archs that can directly apply
>
>   2. A few more patches to let the rest archs to use the new helper.
>      With that we can have an unified entry for signal detection
>
>   3. One last patch to change fatal_signal_pending() to detect
>      userspace non-fatal signal
>
> Nothing should have changed in the rest patches.  Because the fault
> retry patches will depend on the previous ones, I decided to simply
> repost all the patches.
>
> Here's the new patchset layout:
>
> Patch 1-2:      cleanup, and potential bugfix of hugetlbfs on fault retry
>
> Patch 3-9:      let page fault to respond to non-fatal signals faster
>
> Patch 10:       remove the userfaultfd NOPAGE emulation
>
> Patch 11-14:    allow page fault to retry more than once
>
> Patch 15-16:    let gup code to use FAULT_FLAG_KILLABLE too
>
> I would really appreciate any review comments for the series,
> especially for the first two patches which IMHO are even not related
> to this patchset and they should either cleanup or fix things.
>
> Smoke tested on x86 only.
>
> Thanks,
>
> v5:
> - split "mm: Return faster for non-fatal signals in user mode faults"
>   into a few more patches, let all archs to use an unified entry for
>   fast signal handling (fatal_signal_pending)
>
> v4:
> - use lore.kernel.org for all the links in commit messages [Kirill]
> - one more patch ("mm/gup: Fix __get_user_pages() on fault retry of
>   hugetlb") to fix hugetlb path on fault retry
> - one more patch ("mm/gup: Allow to react to fatal signals") to:
>   - use down_read_killable() properly [Linus]
>   - pass in FAULT_FLAG_KILLABLE for all GUP [Linus]
> - one more patch ("mm/userfaultfd: Honor FAULT_FLAG_KILLABLE in fault
>   path") to let handle_userfaultfd() respect FAULT_FLAG_KILLABLE.
>   Should have no functional change after previous two new patches.
>
> v3:
> - check fatal signals in __get_user_page_locked() [Linus]
> - add r-bs
>
> v2:
> - resent previous version, rebase only
>
> =============== v1 cover letter ==================
>
> This series is split out of userfaultfd-wp series to only cover the
> general page fault changes, since it seems to make sense itself.
>
> Basically it does two things:
>
>   (a) Allows the page fault handlers to be more interactive on not
>       only SIGKILL, but also the rest of userspace signals (especially
>       for user-mode faults), and,
>
>   (b) Allows the page fault retry (VM_FAULT_RETRY) to happen for more
>       than once.
>
> I'm keeping the CC list as in uffd-wp v5, hopefully I'm not sending
> too much spams...
>
> And, instead of writting again the cover letter, I'm just copy-pasting
> my previous link here which has more details on why we do this:
>
>   https://patchwork.kernel.org/cover/10691991/
>
> The major change from that latest version should be that we introduced
> a new page fault flag FAULT_FLAG_INTERRUPTIBLE as suggested by Linus
> [1] to represents that we would like the fault handler to respond to
> non-fatal signals.  Also, we're more careful now on when to do the
> immediate return of the page fault for such signals.  For example, now
> we'll only check against signal_pending() for user-mode page faults
> and we keep the kernel-mode page fault patch untouched for it.  More
> information can be found in separate patches.
>
> The patchset is only lightly tested on x86.
>
> All comments are greatly welcomed.  Thanks,
>
> [1] https://lkml.org/lkml/2019/6/25/1382
>
> Peter Xu (16):
>   mm/gup: Rename "nonblocking" to "locked" where proper
>   mm/gup: Fix __get_user_pages() on fault retry of hugetlb
>   mm: Introduce fault_signal_pending()
>   x86/mm: Use helper fault_signal_pending()
>   arc/mm: Use helper fault_signal_pending()
>   arm64/mm: Use helper fault_signal_pending()
>   powerpc/mm: Use helper fault_signal_pending()
>   sh/mm: Use helper fault_signal_pending()
>   mm: Return faster for non-fatal signals in user mode faults
>   userfaultfd: Don't retake mmap_sem to emulate NOPAGE
>   mm: Introduce FAULT_FLAG_DEFAULT
>   mm: Introduce FAULT_FLAG_INTERRUPTIBLE
>   mm: Allow VM_FAULT_RETRY for multiple times
>   mm/gup: Allow VM_FAULT_RETRY for multiple times
>   mm/gup: Allow to react to fatal signals
>   mm/userfaultfd: Honor FAULT_FLAG_KILLABLE in fault path
>
>  arch/alpha/mm/fault.c           |  6 +--
>  arch/arc/mm/fault.c             | 35 +++++--------
>  arch/arm/mm/fault.c             |  7 +--
>  arch/arm64/mm/fault.c           | 26 +++------
>  arch/hexagon/mm/vm_fault.c      |  5 +-
>  arch/ia64/mm/fault.c            |  5 +-
>  arch/m68k/mm/fault.c            |  7 +--
>  arch/microblaze/mm/fault.c      |  5 +-
>  arch/mips/mm/fault.c            |  5 +-
>  arch/nds32/mm/fault.c           |  5 +-
>  arch/nios2/mm/fault.c           |  7 +--
>  arch/openrisc/mm/fault.c        |  5 +-
>  arch/parisc/mm/fault.c          |  8 ++-
>  arch/powerpc/mm/fault.c         | 20 ++-----
>  arch/riscv/mm/fault.c           |  9 +---
>  arch/s390/mm/fault.c            | 10 ++--
>  arch/sh/mm/fault.c              | 13 +++--
>  arch/sparc/mm/fault_32.c        |  5 +-
>  arch/sparc/mm/fault_64.c        |  5 +-
>  arch/um/kernel/trap.c           |  3 +-
>  arch/unicore32/mm/fault.c       |  8 ++-
>  arch/x86/mm/fault.c             | 30 +++++------
>  arch/xtensa/mm/fault.c          |  5 +-
>  drivers/gpu/drm/ttm/ttm_bo_vm.c | 12 +++--
>  fs/userfaultfd.c                | 62 ++++++++++------------
>  include/linux/mm.h              | 81 ++++++++++++++++++++++++----
>  include/linux/sched/signal.h    | 14 +++++
>  mm/filemap.c                    |  2 +-
>  mm/gup.c                        | 93 +++++++++++++++++++++------------
>  mm/hugetlb.c                    | 17 +++---
>  mm/internal.h                   |  6 +--
>  31 files changed, 281 insertions(+), 240 deletions(-)
>
> --
> 2.24.1
>
Linus Torvalds Feb. 21, 2020, 7:32 p.m. UTC | #2
On Thu, Feb 20, 2020 at 7:54 AM Peter Xu <peterx@redhat.com> wrote:
>
> This is v6 of the series.  It is majorly a rebase to 5.6-rc2, nothing
> else to be expected (plus some tests after the rebase).  Instead of
> rewrite the cover letter I decided to use what we have for v5.

I continue to think this is the right thing to do, and the series
looks good to me.

I'd love for it to get more testing, but realistically I suspect that
being in linux-next will be the right thing.

I've been assuming this will go through Andrew. He's not explicitly
cc'd, though (although maybe he does read all of linux-mm and has seen
this several times as a result).

               Linus
Peter Xu Feb. 21, 2020, 8:11 p.m. UTC | #3
On Fri, Feb 21, 2020 at 11:32:36AM -0800, Linus Torvalds wrote:
> On Thu, Feb 20, 2020 at 7:54 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > This is v6 of the series.  It is majorly a rebase to 5.6-rc2, nothing
> > else to be expected (plus some tests after the rebase).  Instead of
> > rewrite the cover letter I decided to use what we have for v5.
> 
> I continue to think this is the right thing to do, and the series
> looks good to me.
> 
> I'd love for it to get more testing, but realistically I suspect that
> being in linux-next will be the right thing.
> 
> I've been assuming this will go through Andrew. He's not explicitly
> cc'd, though (although maybe he does read all of linux-mm and has seen
> this several times as a result).

I'm very sorry for missing that.  I should have cced Andrew for every
mm patches.  Hope that's not a problem for this series.  I'll remember
to do that from now on.  Thanks!
Peter Xu March 2, 2020, 5:31 p.m. UTC | #4
On Fri, Feb 21, 2020 at 11:26:12AM -0800, Brian Geffon wrote:
> I tested the entire patchset because I'm very interested in fault
> retries with userfaultfd and the series has been stable and worked
> well on x86.
> 
> Tested-by: Brian Geffon <bgeffon@google.com>

(Thanks again for Brian's quick follow up, and adding Andrew in again)

Hi, Andrew,

Do you have plan to queue this series for linux-next?  Please let me
know if you want me to repost with you CCed for the whole series.

Thanks!
David Hildenbrand March 7, 2020, 8:33 p.m. UTC | #5
On 20.02.20 16:53, Peter Xu wrote:
> [Resend v6]
> 
> This is v6 of the series.  It is majorly a rebase to 5.6-rc2, nothing
> else to be expected (plus some tests after the rebase).  Instead of
> rewrite the cover letter I decided to use what we have for v5.
> 
> Adding extra CCs for both Bobby Powers <bobbypowers@gmail.com> and
> Brian Geffon <bgeffon@google.com>.
> 
> Online repo: https://github.com/xzpeter/linux/tree/mm-pf-signal-retry
> 
> Any review comment is appreciated.  Thanks,

If I am not completely missing something (and all my testing today was
wrong) there is a very simple reason why I *LOVE* this series and it made
my weekend. It makes userfaultfd with concurrent discarding (e.g.,
MADV_DONTNEED) of pages actually usable.

The issue in current code is that between placing a page and waking
up a waiter, somebody can zap the new placed page and trigger
re-fault, triggering a SIGBUS and crashing an application where all
memory is supposed to be accessible. And there is no real way to protect
from that, because when the fault handler will be woken up and retry
is not deterministic (e.g., making madvise(MADV_DONTNEED) and
UFFDIO_ZEROPAGE mutually exclusive does not help).

Find a simple reproducer at the end of this mail.

Before this series:
[root@localhost ~]# ./a.out 
Progress!
Progress!
Progress!
Progress!
Progress!
Progress!
Progress!
Progress!
Progress!
Progress!
Progress!
Progress!
[   34.849604] FAULT_FLAG_ALLOW_RETRY missing 70
[   34.850466] CPU: 1 PID: 651 Comm: a.out Not tainted 5.6.0-rc2+ #92
[   34.851525] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.4
[   34.852818] Call Trace:
[   34.853045]  dump_stack+0x8f/0xd0
[   34.853338]  handle_userfault.cold+0x1a/0x2e
[   34.853704]  ? find_held_lock+0x2b/0x80
[   34.854031]  ? __handle_mm_fault+0x18c5/0x1900
[   34.854409]  __handle_mm_fault+0x18d4/0x1900
[   34.854784]  handle_mm_fault+0x169/0x360
[   34.855120]  do_user_addr_fault+0x20d/0x490
[   34.855478]  async_page_fault+0x43/0x50
[   34.855809] RIP: 0033:0x401659
[   34.856069] Code: ba 1f 00 00 00 be 01 00 00 00 bf 10 21 40 00 e8 ad fa ff ff bf ff ff ff ff e8 93 fa ff ff 48 8b8
[   34.857629] RSP: 002b:00007ffcfd536ec0 EFLAGS: 00010246
[   34.858076] RAX: 00007fcba86a4000 RBX: 0000000000000000 RCX: 00007fcba85784ef
[   34.858675] RDX: 00007fcba86a4007 RSI: 00000000016524e0 RDI: 00007fcba864b320
[   34.859272] RBP: 00007ffcfd536f20 R08: 000000000000000a R09: 0000000000000070
[   34.859876] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000401120
[   34.860472] R13: 00007ffcfd537000 R14: 0000000000000000 R15: 0000000000000000

After this series:
Well, "Progress!" all day long.


Can we please have a way to identify that this "feature" is available?
I'd appreciate a new read-only UFFD_FEAT_ , so we can detect this from
user space easily and use concurrent discards without crashing our applications.


Questions:
1. I assume KVM will do multiple retries as well, and have the same behavior, right?

2. What will happen if I don't place a page on a pagefault, but only do a UFFDIO_WAKE?
   For now we were able to trigger a signal this way. If the behavior is changed, can
   we make this configurable via a UFFD_FEAT?

--- snip ---
#include <string.h>
#include <stdbool.h>
#include <stdint.h>
#include <sys/types.h>
#include <stdio.h>
#include <pthread.h>
#include <errno.h>
#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
#include <poll.h>
#include <linux/userfaultfd.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/ioctl.h>

static int page_size;

static void *fault_handler_thread(void *arg)
{
    const long uffd = (long) arg;
    struct pollfd pollfd = {
        .fd = uffd,
        .events = POLLIN,
    };
    int ret;

    while (true) {
        struct uffdio_zeropage zeropage = {};
        struct uffd_msg msg;
        ssize_t nread;

        if (poll(&pollfd, 1, -1) == -1) {
            fprintf(stderr, "POLL failed: %s\n", strerror(errno));
            exit(-1);
        }
        if (read(uffd, &msg, sizeof(msg)) != sizeof(msg)) {
            fprintf(stderr, "READ failed\n");
            exit(-1);
        }
        if (msg.event != UFFD_EVENT_PAGEFAULT) {
            fprintf(stderr, "Not UFFD_EVENT_PAGEFAULT\n");
            exit(-1);
        }

        zeropage.range.start = msg.arg.pagefault.address;
        zeropage.range.len = page_size;
        do {
            ret = ioctl(uffd, UFFDIO_ZEROPAGE, &zeropage);
            if (ret && errno != EAGAIN) {
                fprintf(stderr, "UFFDIO_ZEROPAGE failed:%s\n", strerror(errno));
                exit(-1);
            }
        } while (ret);
    }
}
static void *discard_thread(void *arg)
{
    while (true) {
        if (madvise(arg, page_size, MADV_DONTNEED)) {
            fprintf(stderr, "MADV_DONTNEED failed:%s\n", strerror(errno));
            exit(-1);
        }
        usleep(1000);
    }
}

int main(void)
{
    struct uffdio_register reg;
    struct uffdio_api api = {
        .api = UFFD_API,
    };
    pthread_t fault, discard;
    long uffd;
    char *area;

    page_size = sysconf(_SC_PAGE_SIZE);

    uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
    if (uffd == -1) {
        fprintf(stderr, "Could not create uffd: %s\n", strerror(errno));
        exit(-1);
    }
    if (ioctl(uffd, UFFDIO_API, &api) == -1) {
        fprintf(stderr, "UFFDIO_API failed: %s\n", strerror(errno));
        exit(-1);
    }

    area = mmap(NULL, page_size, PROT_READ | PROT_WRITE,
                MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (area == MAP_FAILED) {
        fprintf(stderr, "Could not allocate memory");
        exit(-1);
    }

    reg.range.start = (uint64_t) area;
    reg.range.len = page_size,
    reg.mode = UFFDIO_REGISTER_MODE_MISSING;
    if (ioctl(uffd, UFFDIO_REGISTER, &reg) == -1) {
        fprintf(stderr, "UFFDIO_REGISTER failed: %s\n", strerror(errno));
        exit(-1);
    }

    /* thread to provide zeropages */
    if (pthread_create(&fault, NULL, fault_handler_thread,
                       (void *) uffd)) {
        fprintf(stderr, "Could not create fault handling thread");
        exit(-1);
    }

    /* thread to discard the page */
    if (pthread_create(&discard, NULL, discard_thread,
                       (void *) area)) {
        fprintf(stderr, "Could not create discard thread");
        exit(-1);
    }

    /* keep reading/writing the page */
    while (true) {
        area[7] = area[1];
        usleep(10000);
        printf("Progress!\n");
    }
    return 0;
}
Peter Xu March 7, 2020, 9:47 p.m. UTC | #6
On Sat, Mar 07, 2020 at 09:33:08PM +0100, David Hildenbrand wrote:
> On 20.02.20 16:53, Peter Xu wrote:
> > [Resend v6]
> > 
> > This is v6 of the series.  It is majorly a rebase to 5.6-rc2, nothing
> > else to be expected (plus some tests after the rebase).  Instead of
> > rewrite the cover letter I decided to use what we have for v5.
> > 
> > Adding extra CCs for both Bobby Powers <bobbypowers@gmail.com> and
> > Brian Geffon <bgeffon@google.com>.
> > 
> > Online repo: https://github.com/xzpeter/linux/tree/mm-pf-signal-retry
> > 
> > Any review comment is appreciated.  Thanks,
> 
> If I am not completely missing something (and all my testing today was
> wrong) there is a very simple reason why I *LOVE* this series and it made
> my weekend. It makes userfaultfd with concurrent discarding (e.g.,
> MADV_DONTNEED) of pages actually usable.

Hi, David,

Thanks for doing further test against this branch!

> 
> The issue in current code is that between placing a page and waking
> up a waiter, somebody can zap the new placed page and trigger
> re-fault, triggering a SIGBUS and crashing an application where all
> memory is supposed to be accessible. And there is no real way to protect
> from that, because when the fault handler will be woken up and retry
> is not deterministic (e.g., making madvise(MADV_DONTNEED) and
> UFFDIO_ZEROPAGE mutually exclusive does not help).
> 
> Find a simple reproducer at the end of this mail.
> 
> Before this series:
> [root@localhost ~]# ./a.out 
> Progress!
> Progress!
> Progress!
> Progress!
> Progress!
> Progress!
> Progress!
> Progress!
> Progress!
> Progress!
> Progress!
> Progress!
> [   34.849604] FAULT_FLAG_ALLOW_RETRY missing 70
> [   34.850466] CPU: 1 PID: 651 Comm: a.out Not tainted 5.6.0-rc2+ #92
> [   34.851525] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.4
> [   34.852818] Call Trace:
> [   34.853045]  dump_stack+0x8f/0xd0
> [   34.853338]  handle_userfault.cold+0x1a/0x2e
> [   34.853704]  ? find_held_lock+0x2b/0x80
> [   34.854031]  ? __handle_mm_fault+0x18c5/0x1900
> [   34.854409]  __handle_mm_fault+0x18d4/0x1900
> [   34.854784]  handle_mm_fault+0x169/0x360
> [   34.855120]  do_user_addr_fault+0x20d/0x490
> [   34.855478]  async_page_fault+0x43/0x50
> [   34.855809] RIP: 0033:0x401659
> [   34.856069] Code: ba 1f 00 00 00 be 01 00 00 00 bf 10 21 40 00 e8 ad fa ff ff bf ff ff ff ff e8 93 fa ff ff 48 8b8
> [   34.857629] RSP: 002b:00007ffcfd536ec0 EFLAGS: 00010246
> [   34.858076] RAX: 00007fcba86a4000 RBX: 0000000000000000 RCX: 00007fcba85784ef
> [   34.858675] RDX: 00007fcba86a4007 RSI: 00000000016524e0 RDI: 00007fcba864b320
> [   34.859272] RBP: 00007ffcfd536f20 R08: 000000000000000a R09: 0000000000000070
> [   34.859876] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000401120
> [   34.860472] R13: 00007ffcfd537000 R14: 0000000000000000 R15: 0000000000000000
> 
> After this series:
> Well, "Progress!" all day long.

Yes, IIUC the race can happen like this in your below test:

     main thread          uffd thread             disgard thread
     ===========          ===========             ==============
     access page
       uffd page fault
         wait for page
                          UFFDIO_ZEROCOPY
                            put a page P there
                                                  MADV_DONTNEED on P
                            wakeup main thread
         return from fault
       page still missing
       uffd page fault again
       (without ALLOW_RETRY)
       --> SIGBUS.

And I agree this should be one if the major issues that this series
wants to address.

> 
> 
> Can we please have a way to identify that this "feature" is available?
> I'd appreciate a new read-only UFFD_FEAT_ , so we can detect this from
> user space easily and use concurrent discards without crashing our applications.

I'm not sure how others think about it, but to me this still fells
into the bucket of "solving an existing problem" rather than a
feature.  Also note that this should change the behavior for the page
fault logic in general, rather than an uffd-only change. So I'm also
not sure whether UFFD_FEAT_* suites here even if we want it.

> 
> 
> Questions:
> 1. I assume KVM will do multiple retries as well, and have the same behavior, right?

Yes I believe so, or we'll need to fix it.

> 
> 2. What will happen if I don't place a page on a pagefault, but only do a UFFDIO_WAKE?
>    For now we were able to trigger a signal this way.

If I'm not mistaken the UFFDIO_WAKE will directly trigger the sigbus
even without the help of the MADV_DONTNEED race.

> If the behavior is changed, can
>    we make this configurable via a UFFD_FEAT?

I'll still think that could be an overkill, but I'll leave the
discussion to the experts.

Thanks,

> 
> --- snip ---
> #include <string.h>
> #include <stdbool.h>
> #include <stdint.h>
> #include <sys/types.h>
> #include <stdio.h>
> #include <pthread.h>
> #include <errno.h>
> #include <unistd.h>
> #include <stdlib.h>
> #include <fcntl.h>
> #include <poll.h>
> #include <linux/userfaultfd.h>
> #include <sys/mman.h>
> #include <sys/syscall.h>
> #include <sys/ioctl.h>
> 
> static int page_size;
> 
> static void *fault_handler_thread(void *arg)
> {
>     const long uffd = (long) arg;
>     struct pollfd pollfd = {
>         .fd = uffd,
>         .events = POLLIN,
>     };
>     int ret;
> 
>     while (true) {
>         struct uffdio_zeropage zeropage = {};
>         struct uffd_msg msg;
>         ssize_t nread;
> 
>         if (poll(&pollfd, 1, -1) == -1) {
>             fprintf(stderr, "POLL failed: %s\n", strerror(errno));
>             exit(-1);
>         }
>         if (read(uffd, &msg, sizeof(msg)) != sizeof(msg)) {
>             fprintf(stderr, "READ failed\n");
>             exit(-1);
>         }
>         if (msg.event != UFFD_EVENT_PAGEFAULT) {
>             fprintf(stderr, "Not UFFD_EVENT_PAGEFAULT\n");
>             exit(-1);
>         }
> 
>         zeropage.range.start = msg.arg.pagefault.address;
>         zeropage.range.len = page_size;
>         do {
>             ret = ioctl(uffd, UFFDIO_ZEROPAGE, &zeropage);
>             if (ret && errno != EAGAIN) {
>                 fprintf(stderr, "UFFDIO_ZEROPAGE failed:%s\n", strerror(errno));
>                 exit(-1);
>             }
>         } while (ret);
>     }
> }
> static void *discard_thread(void *arg)
> {
>     while (true) {
>         if (madvise(arg, page_size, MADV_DONTNEED)) {
>             fprintf(stderr, "MADV_DONTNEED failed:%s\n", strerror(errno));
>             exit(-1);
>         }
>         usleep(1000);
>     }
> }
> 
> int main(void)
> {
>     struct uffdio_register reg;
>     struct uffdio_api api = {
>         .api = UFFD_API,
>     };
>     pthread_t fault, discard;
>     long uffd;
>     char *area;
> 
>     page_size = sysconf(_SC_PAGE_SIZE);
> 
>     uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
>     if (uffd == -1) {
>         fprintf(stderr, "Could not create uffd: %s\n", strerror(errno));
>         exit(-1);
>     }
>     if (ioctl(uffd, UFFDIO_API, &api) == -1) {
>         fprintf(stderr, "UFFDIO_API failed: %s\n", strerror(errno));
>         exit(-1);
>     }
> 
>     area = mmap(NULL, page_size, PROT_READ | PROT_WRITE,
>                 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>     if (area == MAP_FAILED) {
>         fprintf(stderr, "Could not allocate memory");
>         exit(-1);
>     }
> 
>     reg.range.start = (uint64_t) area;
>     reg.range.len = page_size,
>     reg.mode = UFFDIO_REGISTER_MODE_MISSING;
>     if (ioctl(uffd, UFFDIO_REGISTER, &reg) == -1) {
>         fprintf(stderr, "UFFDIO_REGISTER failed: %s\n", strerror(errno));
>         exit(-1);
>     }
> 
>     /* thread to provide zeropages */
>     if (pthread_create(&fault, NULL, fault_handler_thread,
>                        (void *) uffd)) {
>         fprintf(stderr, "Could not create fault handling thread");
>         exit(-1);
>     }
> 
>     /* thread to discard the page */
>     if (pthread_create(&discard, NULL, discard_thread,
>                        (void *) area)) {
>         fprintf(stderr, "Could not create discard thread");
>         exit(-1);
>     }
> 
>     /* keep reading/writing the page */
>     while (true) {
>         area[7] = area[1];
>         usleep(10000);
>         printf("Progress!\n");
>     }
>     return 0;
> }
> 
> 
> -- 
> Thanks,
> 
> David / dhildenb
>
David Hildenbrand March 8, 2020, 12:12 p.m. UTC | #7
[...]

> Yes, IIUC the race can happen like this in your below test:
> 
>      main thread          uffd thread             disgard thread
>      ===========          ===========             ==============
>      access page
>        uffd page fault
>          wait for page
>                           UFFDIO_ZEROCOPY
>                             put a page P there
>                                                   MADV_DONTNEED on P
>                             wakeup main thread
>          return from fault
>        page still missing
>        uffd page fault again
>        (without ALLOW_RETRY)
>        --> SIGBUS.

Exactly!

>> Can we please have a way to identify that this "feature" is available?
>> I'd appreciate a new read-only UFFD_FEAT_ , so we can detect this from
>> user space easily and use concurrent discards without crashing our applications.
> 
> I'm not sure how others think about it, but to me this still fells
> into the bucket of "solving an existing problem" rather than a
> feature.  Also note that this should change the behavior for the page
> fault logic in general, rather than an uffd-only change. So I'm also
> not sure whether UFFD_FEAT_* suites here even if we want it.

So, are we planning on backporting this to stable kernels?

Imagine using this in QEMU/KVM to allow discards (e.g., balloon
inflation) while postcopy is active . You certainly don't want random
guest crashes. So either, we treat this as a fix (and backport) or as a
change in behavior/feature.

[...]

>>
>> 2. What will happen if I don't place a page on a pagefault, but only do a UFFDIO_WAKE?
>>    For now we were able to trigger a signal this way.
> 
> If I'm not mistaken the UFFDIO_WAKE will directly trigger the sigbus
> even without the help of the MADV_DONTNEED race.

Yes, that's the current way of injecting a SIGBUS instead of resolving
the pagefault. And AFAIKs, you're changing that behavior. (I am not
aware of a user, there could be use cases, but somehow it's strange to
get a signal when accessing memory that is mapped READ|WRITE and also
represented like this in e.g., /proc/$PID/maps). So IMHO, only the new
behavior makes really sense.

> 
>> If the behavior is changed, can
>>    we make this configurable via a UFFD_FEAT?
> 
> I'll still think that could be an overkill, but I'll leave the
> discussion to the experts.

I'll be happy to hear what Andrea Et al. think. At least I really want
to see the new behavior - and if it's not a fix, then I want some way to
detect if a kernel has this new (fixed?) behavior.

Thanks a lot for this work!
David Hildenbrand March 8, 2020, 12:49 p.m. UTC | #8
On 07.03.20 21:33, David Hildenbrand wrote:
> On 20.02.20 16:53, Peter Xu wrote:
>> [Resend v6]
>>
>> This is v6 of the series.  It is majorly a rebase to 5.6-rc2, nothing
>> else to be expected (plus some tests after the rebase).  Instead of
>> rewrite the cover letter I decided to use what we have for v5.
>>
>> Adding extra CCs for both Bobby Powers <bobbypowers@gmail.com> and
>> Brian Geffon <bgeffon@google.com>.
>>
>> Online repo: https://github.com/xzpeter/linux/tree/mm-pf-signal-retry
>>
>> Any review comment is appreciated.  Thanks,
> 
> If I am not completely missing something (and all my testing today was
> wrong) there is a very simple reason why I *LOVE* this series and it made
> my weekend. It makes userfaultfd with concurrent discarding (e.g.,
> MADV_DONTNEED) of pages actually usable.
> 
> The issue in current code is that between placing a page and waking
> up a waiter, somebody can zap the new placed page and trigger
> re-fault, triggering a SIGBUS and crashing an application where all
> memory is supposed to be accessible. And there is no real way to protect
> from that, because when the fault handler will be woken up and retry
> is not deterministic (e.g., making madvise(MADV_DONTNEED) and
> UFFDIO_ZEROPAGE mutually exclusive does not help).
> 
> Find a simple reproducer at the end of this mail.

See below for another test case. It's not immediately clear why we can get
SIGBUS in this example, UFFD_FEATURE_EVENT_REMOVE was supposed to protect
us from this - I thought. I think because the actual discard is delayed after
the UFFD_EVENT_REMOVE has been processed until the next UFFDIO_ZEROPAGE
takes place. Then we have the same race as in the other test case.

It also showcases why the use of UFFD_FEATURE_EVENT_REMOVE in combination
with placing of pages in a handler can easily deadlock.

Before this series: SIGBUS - FAULT_FLAG_ALLOW_RETRY missing 70
After this series: Keeps on running

But the general behavior of UFFD_FEATURE_EVENT_REMOVE is not
really useful in practice due to the possible deadlock that I work
around in this patch. You can drop the read() etc. from the loop
to observe the deadlock.

---
#include <string.h>
#include <stdbool.h>
#include <stdint.h>
#include <sys/types.h>
#include <stdio.h>
#include <pthread.h>
#include <errno.h>
#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
#include <poll.h>
#include <linux/userfaultfd.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/ioctl.h>

static int page_size;

static void *fault_handler_thread(void *arg)
{
    const long uffd = (long) arg;
    struct pollfd pollfd = {
        .fd = uffd,
        .events = POLLIN,
    };
    int ret;

    while (true) {
        struct uffdio_zeropage zeropage = {};
        struct uffd_msg msg;
        ssize_t nread;

        if (poll(&pollfd, 1, -1) == -1) {
            fprintf(stderr, "POLL failed: %s\n", strerror(errno));
            exit(-1);
        }
        if (read(uffd, &msg, sizeof(msg)) != sizeof(msg)) {
            fprintf(stderr, "READ failed\n");
            exit(-1);
        }
        if (msg.event == UFFD_EVENT_REMOVE) {
            continue;
        }
        if (msg.event != UFFD_EVENT_PAGEFAULT) {
            fprintf(stderr, "Not UFFD_EVENT_PAGEFAULT\n");
            exit(-1);
        }

        zeropage.range.start = msg.arg.pagefault.address;
        zeropage.range.len = page_size;
        /*
         * The following loop would deadlock in case we get a MADV_DONTNEED
         * at the wrong time. UFFDIO_ZEROPAGE won't be able to make progress
         * until we processed the UFFD_EVENT_REMOVE. So we manually have
         * to read and process the UFFD_EVENT_REMOVE. This is nasty, because
         * once we could get multiple UFFD_EVENT_PAGEFAULT, this would no
         * longer be usable - we would get another pagefault request, which
         * we cannot process and so on ...
         */
        do {
            ret = ioctl(uffd, UFFDIO_ZEROPAGE, &zeropage);
            if (ret && errno != EAGAIN) {
                fprintf(stderr, "UFFDIO_ZEROPAGE failed:%s\n", strerror(errno));
                exit(-1);
            }
            /*
             * We're expecting a UFFD_EVENT_REMOVE event Soemtimes we
             * get none ... ?
             */
            if (read(uffd, &msg, sizeof(msg)) != sizeof(msg)) {
                continue;
            }
            if (msg.event != UFFD_EVENT_REMOVE) {
                fprintf(stderr, "Not a remove event\n");
                continue;
            }
        } while (ret);
    }
}

static void *discard_thread(void *arg)
{
    while (true) {
        if (madvise(arg, page_size, MADV_DONTNEED)) {
            fprintf(stderr, "MADV_DONTNEED failed:%s\n", strerror(errno));
            exit(-1);
        }
        printf("Discard!\n");
        usleep(1000);
    }
}

int main(void)
{
    struct uffdio_register reg;
    struct uffdio_api api = {
        .api = UFFD_API,
        .features = UFFD_FEATURE_EVENT_REMOVE,
    };
    pthread_t fault, discard;
    long uffd;
    char *area;

    page_size = sysconf(_SC_PAGE_SIZE);

    uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
    if (uffd == -1) {
        fprintf(stderr, "Could not create uffd: %s\n", strerror(errno));
        exit(-1);
    }
    if (ioctl(uffd, UFFDIO_API, &api) == -1) {
        fprintf(stderr, "UFFDIO_API failed: %s\n", strerror(errno));
        exit(-1);
    }

    area = mmap(NULL, page_size, PROT_READ | PROT_WRITE,
                MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (area == MAP_FAILED) {
        fprintf(stderr, "Could not allocate memory");
        exit(-1);
    }

    reg.range.start = (uint64_t) area;
    reg.range.len = page_size,
    reg.mode = UFFDIO_REGISTER_MODE_MISSING;
    if (ioctl(uffd, UFFDIO_REGISTER, &reg) == -1) {
        fprintf(stderr, "UFFDIO_REGISTER failed: %s\n", strerror(errno));
        exit(-1);
    }

    /* thread to provide zeropages */
    if (pthread_create(&fault, NULL, fault_handler_thread,
                       (void *) uffd)) {
        fprintf(stderr, "Could not create fault handing thread");
        exit(-1);
    }
    /* thread to discard the page */
    if (pthread_create(&discard, NULL, discard_thread,
                       (void *) area)) {
        fprintf(stderr, "Could not create discard thread");
        exit(-1);
    }

    /* keep reading/writing the page */
    while (true) {
        area[7] = area[1];
        usleep(10000);
        printf("Progress!\n");
    }
    return 0;
}
Peter Xu March 9, 2020, 7:51 p.m. UTC | #9
On Sun, Mar 08, 2020 at 01:12:34PM +0100, David Hildenbrand wrote:
> [...]
> 
> > Yes, IIUC the race can happen like this in your below test:
> > 
> >      main thread          uffd thread             disgard thread
> >      ===========          ===========             ==============
> >      access page
> >        uffd page fault
> >          wait for page
> >                           UFFDIO_ZEROCOPY
> >                             put a page P there
> >                                                   MADV_DONTNEED on P
> >                             wakeup main thread
> >          return from fault
> >        page still missing
> >        uffd page fault again
> >        (without ALLOW_RETRY)
> >        --> SIGBUS.
> 
> Exactly!
> 
> >> Can we please have a way to identify that this "feature" is available?
> >> I'd appreciate a new read-only UFFD_FEAT_ , so we can detect this from
> >> user space easily and use concurrent discards without crashing our applications.
> > 
> > I'm not sure how others think about it, but to me this still fells
> > into the bucket of "solving an existing problem" rather than a
> > feature.  Also note that this should change the behavior for the page
> > fault logic in general, rather than an uffd-only change. So I'm also
> > not sure whether UFFD_FEAT_* suites here even if we want it.
> 
> So, are we planning on backporting this to stable kernels?

I don't have a plan so far.  I'm still at the phase to only worry
about whether it can be at least merged in master.. :)

I would think it won't worth it to backport this to stables though,
considering that it could potentially change quite a bit for faulting
procedures, and after all the issues we're fixing shouldn't be common
to general users.

> 
> Imagine using this in QEMU/KVM to allow discards (e.g., balloon
> inflation) while postcopy is active . You certainly don't want random
> guest crashes. So either, we treat this as a fix (and backport) or as a
> change in behavior/feature.

I think we don't need to worry on that - QEMU will prohibit ballooning
during postcopy starting from the first day.  Feel free to see QEMU
commit 371ff5a3f04cd7 ("Inhibit ballooning during postcopy").

> 
> [...]
> 
> >>
> >> 2. What will happen if I don't place a page on a pagefault, but only do a UFFDIO_WAKE?
> >>    For now we were able to trigger a signal this way.
> > 
> > If I'm not mistaken the UFFDIO_WAKE will directly trigger the sigbus
> > even without the help of the MADV_DONTNEED race.
> 
> Yes, that's the current way of injecting a SIGBUS instead of resolving
> the pagefault. And AFAIKs, you're changing that behavior. (I am not
> aware of a user, there could be use cases, but somehow it's strange to
> get a signal when accessing memory that is mapped READ|WRITE and also
> represented like this in e.g., /proc/$PID/maps). So IMHO, only the new
> behavior makes really sense.

I agree, I'm not sure how other people think on ABI stability, but...
for my own preference I don't worry much on ABI breakage for a problem
like this.

Thanks,
David Hildenbrand March 9, 2020, 8:06 p.m. UTC | #10
> Am 09.03.2020 um 20:51 schrieb Peter Xu <peterx@redhat.com>:
> 
> On Sun, Mar 08, 2020 at 01:12:34PM +0100, David Hildenbrand wrote:
>> [...]
>> 
>>> Yes, IIUC the race can happen like this in your below test:
>>> 
>>>     main thread          uffd thread             disgard thread
>>>     ===========          ===========             ==============
>>>     access page
>>>       uffd page fault
>>>         wait for page
>>>                          UFFDIO_ZEROCOPY
>>>                            put a page P there
>>>                                                  MADV_DONTNEED on P
>>>                            wakeup main thread
>>>         return from fault
>>>       page still missing
>>>       uffd page fault again
>>>       (without ALLOW_RETRY)
>>>       --> SIGBUS.
>> 
>> Exactly!
>> 
>>>> Can we please have a way to identify that this "feature" is available?
>>>> I'd appreciate a new read-only UFFD_FEAT_ , so we can detect this from
>>>> user space easily and use concurrent discards without crashing our applications.
>>> 
>>> I'm not sure how others think about it, but to me this still fells
>>> into the bucket of "solving an existing problem" rather than a
>>> feature.  Also note that this should change the behavior for the page
>>> fault logic in general, rather than an uffd-only change. So I'm also
>>> not sure whether UFFD_FEAT_* suites here even if we want it.
>> 
>> So, are we planning on backporting this to stable kernels?
> 
> I don't have a plan so far.  I'm still at the phase to only worry
> about whether it can be at least merged in master.. :)
> 
> I would think it won't worth it to backport this to stables though,
> considering that it could potentially change quite a bit for faulting
> procedures, and after all the issues we're fixing shouldn't be common
> to general users.
> 
>> 
>> Imagine using this in QEMU/KVM to allow discards (e.g., balloon
>> inflation) while postcopy is active . You certainly don't want random
>> guest crashes. So either, we treat this as a fix (and backport) or as a
>> change in behavior/feature.
> 
> I think we don't need to worry on that - QEMU will prohibit ballooning
> during postcopy starting from the first day.  Feel free to see QEMU
> commit 371ff5a3f04cd7 ("Inhibit ballooning during postcopy").

Imagine I want to change that or imagine I have another user that heavily depends on such races to never happen.

IOW I want to know for sure if my application can crash or not.

@Andrea what are your thoughts on a new feature flag to identify this behavior?