mbox series

[0/7] mm/userfaultfd/poll: Scale userfaultfd wakeups

Message ID 20230905214235.320571-1-peterx@redhat.com (mailing list archive)
Headers show
Series mm/userfaultfd/poll: Scale userfaultfd wakeups | expand

Message

Peter Xu Sept. 5, 2023, 9:42 p.m. UTC
Userfaultfd is the type of file that doesn't need wake-all semantics: if
there is a message enqueued (for either a fault address, or an event), we
only need to wake up one service thread to handle it.  Waking up more
normally means a waste of cpu cycles.  Besides that, and more importantly,
that just doesn't scale.

Andrea used to have one patch that made read() to be O(1) but never hit
upstream.  This is my effort to try upstreaming that (which is a
oneliner..), meanwhile on top of that I also made poll() O(1) on wakeup,
too (more or less bring EPOLLEXCLUSIVE to poll()), with some tests showing
that effect.

To verify this, I added a test called uffd-perf (leveraging the refactored
uffd selftest suite) that will measure the messaging channel latencies on
wakeups, and the waitqueue optimizations can be reflected by the new test:

        Constants: 40 uffd threads, on N_CPUS=40, memsize=512M
        Units: milliseconds (to finish the test)
        |-----------------+--------+-------+------------|
        | test case       | before | after |   diff (%) |
        |-----------------+--------+-------+------------|
        | workers=8,poll  |   1762 |  1133 | -55.516328 |
        | workers=8,read  |   1437 |   585 | -145.64103 |
        | workers=16,poll |   1117 |  1097 | -1.8231541 |
        | workers=16,read |   1159 |   759 | -52.700922 |
        | workers=32,poll |   1001 |   973 | -2.8776978 |
        | workers=32,read |    866 |   713 | -21.458626 |
        |-----------------+--------+-------+------------|

The more threads hanging on the fd_wqh, a bigger difference will be there
shown in the numbers.  "8 worker threads" is the worst case here because it
means there can be a worst case of 40-8=32 threads hanging idle on fd_wqh
queue.

In real life, workers can be more than this, but small number of active
worker threads will cause similar effect.

This is currently based on Andrew's mm-unstable branch, but assuming this
is applicable to most of the not-so-old trees.

Comments welcomed, thanks.

Andrea Arcangeli (1):
  mm/userfaultfd: Make uffd read() wait event exclusive

Peter Xu (6):
  poll: Add a poll_flags for poll_queue_proc()
  poll: POLL_ENQUEUE_EXCLUSIVE
  fs/userfaultfd: Use exclusive waitqueue for poll()
  selftests/mm: Replace uffd_read_mutex with a semaphore
  selftests/mm: Create uffd_fault_thread_create|join()
  selftests/mm: uffd perf test

 drivers/vfio/virqfd.c                    |   4 +-
 drivers/vhost/vhost.c                    |   2 +-
 drivers/virt/acrn/irqfd.c                |   2 +-
 fs/aio.c                                 |   2 +-
 fs/eventpoll.c                           |   2 +-
 fs/select.c                              |   9 +-
 fs/userfaultfd.c                         |   8 +-
 include/linux/poll.h                     |  25 ++-
 io_uring/poll.c                          |   4 +-
 mm/memcontrol.c                          |   4 +-
 net/9p/trans_fd.c                        |   3 +-
 tools/testing/selftests/mm/Makefile      |   2 +
 tools/testing/selftests/mm/uffd-common.c |  65 +++++++
 tools/testing/selftests/mm/uffd-common.h |   7 +
 tools/testing/selftests/mm/uffd-perf.c   | 207 +++++++++++++++++++++++
 tools/testing/selftests/mm/uffd-stress.c |  53 +-----
 virt/kvm/eventfd.c                       |   2 +-
 17 files changed, 337 insertions(+), 64 deletions(-)
 create mode 100644 tools/testing/selftests/mm/uffd-perf.c

Comments

Axel Rasmussen Sept. 7, 2023, 7:18 p.m. UTC | #1
On Tue, Sep 5, 2023 at 2:42 PM Peter Xu <peterx@redhat.com> wrote:
>
> Userfaultfd is the type of file that doesn't need wake-all semantics: if
> there is a message enqueued (for either a fault address, or an event), we
> only need to wake up one service thread to handle it.  Waking up more
> normally means a waste of cpu cycles.  Besides that, and more importantly,
> that just doesn't scale.

Hi Peter,

I took a quick look over the series and didn't see anything
objectionable. I was planning to actually test the series out and then
send out R-b's, but it will take some additional time (next week).

In the meantime, I was curious about the use case. A design I've seen
for VM live migration is to have 1 thread reading events off the uffd,
and then have many threads actually resolving the fault events that
come in (e.g. fetching pages over the network, issuing UFFDIO_COPY or
UFFDIO_CONTINUE, or whatever). In that design, since we only have a
single reader anyway, I think this series doesn't help.

But, I'm curious if you have data indicating that > 1 reader is more
performant overall? I suspect it might be the case that, with "enough"
vCPUs, it makes sense to do so, but I don't have benchmark data to
tell me what that tipping point is yet.

OTOH, if one reader is plenty in ~all cases, optimizing this path is
less important.




>
> Andrea used to have one patch that made read() to be O(1) but never hit
> upstream.  This is my effort to try upstreaming that (which is a
> oneliner..), meanwhile on top of that I also made poll() O(1) on wakeup,
> too (more or less bring EPOLLEXCLUSIVE to poll()), with some tests showing
> that effect.
>
> To verify this, I added a test called uffd-perf (leveraging the refactored
> uffd selftest suite) that will measure the messaging channel latencies on
> wakeups, and the waitqueue optimizations can be reflected by the new test:
>
>         Constants: 40 uffd threads, on N_CPUS=40, memsize=512M
>         Units: milliseconds (to finish the test)
>         |-----------------+--------+-------+------------|
>         | test case       | before | after |   diff (%) |
>         |-----------------+--------+-------+------------|
>         | workers=8,poll  |   1762 |  1133 | -55.516328 |
>         | workers=8,read  |   1437 |   585 | -145.64103 |
>         | workers=16,poll |   1117 |  1097 | -1.8231541 |
>         | workers=16,read |   1159 |   759 | -52.700922 |
>         | workers=32,poll |   1001 |   973 | -2.8776978 |
>         | workers=32,read |    866 |   713 | -21.458626 |
>         |-----------------+--------+-------+------------|
>
> The more threads hanging on the fd_wqh, a bigger difference will be there
> shown in the numbers.  "8 worker threads" is the worst case here because it
> means there can be a worst case of 40-8=32 threads hanging idle on fd_wqh
> queue.
>
> In real life, workers can be more than this, but small number of active
> worker threads will cause similar effect.
>
> This is currently based on Andrew's mm-unstable branch, but assuming this
> is applicable to most of the not-so-old trees.
>
> Comments welcomed, thanks.
>
> Andrea Arcangeli (1):
>   mm/userfaultfd: Make uffd read() wait event exclusive
>
> Peter Xu (6):
>   poll: Add a poll_flags for poll_queue_proc()
>   poll: POLL_ENQUEUE_EXCLUSIVE
>   fs/userfaultfd: Use exclusive waitqueue for poll()
>   selftests/mm: Replace uffd_read_mutex with a semaphore
>   selftests/mm: Create uffd_fault_thread_create|join()
>   selftests/mm: uffd perf test
>
>  drivers/vfio/virqfd.c                    |   4 +-
>  drivers/vhost/vhost.c                    |   2 +-
>  drivers/virt/acrn/irqfd.c                |   2 +-
>  fs/aio.c                                 |   2 +-
>  fs/eventpoll.c                           |   2 +-
>  fs/select.c                              |   9 +-
>  fs/userfaultfd.c                         |   8 +-
>  include/linux/poll.h                     |  25 ++-
>  io_uring/poll.c                          |   4 +-
>  mm/memcontrol.c                          |   4 +-
>  net/9p/trans_fd.c                        |   3 +-
>  tools/testing/selftests/mm/Makefile      |   2 +
>  tools/testing/selftests/mm/uffd-common.c |  65 +++++++
>  tools/testing/selftests/mm/uffd-common.h |   7 +
>  tools/testing/selftests/mm/uffd-perf.c   | 207 +++++++++++++++++++++++
>  tools/testing/selftests/mm/uffd-stress.c |  53 +-----
>  virt/kvm/eventfd.c                       |   2 +-
>  17 files changed, 337 insertions(+), 64 deletions(-)
>  create mode 100644 tools/testing/selftests/mm/uffd-perf.c
>
> --
> 2.41.0
>
Peter Xu Sept. 8, 2023, 10:01 p.m. UTC | #2
On Thu, Sep 07, 2023 at 12:18:29PM -0700, Axel Rasmussen wrote:
> On Tue, Sep 5, 2023 at 2:42 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > Userfaultfd is the type of file that doesn't need wake-all semantics: if
> > there is a message enqueued (for either a fault address, or an event), we
> > only need to wake up one service thread to handle it.  Waking up more
> > normally means a waste of cpu cycles.  Besides that, and more importantly,
> > that just doesn't scale.
> 
> Hi Peter,

Hi, Axel,

Sorry to respond late.

> 
> I took a quick look over the series and didn't see anything
> objectionable. I was planning to actually test the series out and then
> send out R-b's, but it will take some additional time (next week).

Thanks.  The 2nd patch definitely needs some fixup on some functions
(either I overlooked without enough CONFIG_* chosen; I am surprised I have
vhost even compiled out when testing..), hope that won't bring you too much
trouble.  I'll send a fixup soon on top of patch 2.

> 
> In the meantime, I was curious about the use case. A design I've seen
> for VM live migration is to have 1 thread reading events off the uffd,
> and then have many threads actually resolving the fault events that
> come in (e.g. fetching pages over the network, issuing UFFDIO_COPY or
> UFFDIO_CONTINUE, or whatever). In that design, since we only have a
> single reader anyway, I think this series doesn't help.

Yes.  If the test to carry out only uses 1 thread, it shouldn't bring much
difference.

> 
> But, I'm curious if you have data indicating that > 1 reader is more
> performant overall? I suspect it might be the case that, with "enough"
> vCPUs, it makes sense to do so, but I don't have benchmark data to
> tell me what that tipping point is yet.
> 
> OTOH, if one reader is plenty in ~all cases, optimizing this path is
> less important.

For myself I don't yet have an application that can leverage this much
indeed, because QEMU so far only uses 1 reader thread.

IIRC Anish was exactly proposing some kvm specific solutions to make single
uffd scale, and this might be suitable for any use case like that where we
may want to use single uffd and try to make it scale with threads.  Using 1
reader + N worker is also a solution, but when using N readers (which also
do the work) the app will hit this problem.

I am also aware that some apps use more than 1 reader threads (umap), but I
don't really know more than that.

The problem is I think we shouldn't have that overhead easily just because
an app invokes >1 readers, meanwhile it also doesn't make much sense to
wake up all readers for a single event for userfaults.  So it should always
be something good to have.

Thanks,