[RFC,00/16] padata, vfio, sched: Multithreaded VFIO page pinning

Message ID	20220106004656.126790-1-daniel.m.jordan@oracle.com (mailing list archive)
Headers	show Return-Path: <kvm-owner@kernel.org> From: Daniel Jordan <daniel.m.jordan@oracle.com> To: Alexander Duyck <alexanderduyck@fb.com>, Alex Williamson <alex.williamson@redhat.com>, Andrew Morton <akpm@linux-foundation.org>, Ben Segall <bsegall@google.com>, Cornelia Huck <cohuck@redhat.com>, Dan Williams <dan.j.williams@intel.com>, Dave Hansen <dave.hansen@linux.intel.com>, Dietmar Eggemann <dietmar.eggemann@arm.com>, Herbert Xu <herbert@gondor.apana.org.au>, Ingo Molnar <mingo@redhat.com>, Jason Gunthorpe <jgg@nvidia.com>, Johannes Weiner <hannes@cmpxchg.org>, Josh Triplett <josh@joshtriplett.org>, Michal Hocko <mhocko@suse.com>, Nico Pache <npache@redhat.com>, Pasha Tatashin <pasha.tatashin@soleen.com>, Peter Zijlstra <peterz@infradead.org>, Steffen Klassert <steffen.klassert@secunet.com>, Steve Sistare <steven.sistare@oracle.com>, Tejun Heo <tj@kernel.org>, Tim Chen <tim.c.chen@linux.intel.com>, Vincent Guittot <vincent.guittot@linaro.org> Cc: linux-mm@kvack.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-crypto@vger.kernel.org, Daniel Jordan <daniel.m.jordan@oracle.com> Subject: [RFC 00/16] padata, vfio, sched: Multithreaded VFIO page pinning Date: Wed, 5 Jan 2022 19:46:40 -0500 Message-Id: <20220106004656.126790-1-daniel.m.jordan@oracle.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain MIME-Version: 1.0 Precedence: bulk
Series	padata, vfio, sched: Multithreaded VFIO page pinning \| expand [RFC,00/16] padata, vfio, sched: Multithreaded VFIO page pinning [RFC,01/16] padata: Remove __init from multithreading functions [RFC,02/16] padata: Return first error from a job [RFC,03/16] padata: Add undo support [RFC,04/16] padata: Detect deadlocks between main and helper threads [RFC,05/16] vfio/type1: Pass mm to vfio_pin_pages_remote() [RFC,06/16] vfio/type1: Refactor dma map removal [RFC,07/16] vfio/type1: Parallelize vfio_pin_map_dma() [RFC,08/16] vfio/type1: Cache locked_vm to ease mmap_lock contention [RFC,09/16] padata: Use kthreads in do_multithreaded [RFC,10/16] padata: Helpers should respect main thread's CPU affinity [RFC,11/16] padata: Cap helpers started to online CPUs [RFC,12/16] sched, padata: Bound max threads with max_cfs_bandwidth_cpus() [RFC,13/16] padata: Run helper threads at MAX_NICE [RFC,14/16] padata: Nice helper threads one by one to prevent starvation [RFC,15/16] sched/fair: Account kthread runtime debt for CFS bandwidth [RFC,16/16] sched/fair: Consider kthread debt in cputime

Daniel Jordan Jan. 6, 2022, 12:46 a.m. UTC

Here's phase two of padata multithreaded jobs, which multithreads VFIO page
pinning and lays the groundwork for other padata users.  It's RFC because there
are still pieces missing and testing to do, and because of the last two
patches, which I'm hoping scheduler and cgroup folks can weigh in on.  Any and
all feedback is welcome.

---

Assigning a VFIO device to a guest requires pinning each and every page of the
guest's memory, which gets expensive for large guests even if the memory has
already been faulted in and cleared with something like qemu prealloc.

Some recent optimizations[0][1] have brought the cost down, but it's still a
significant bottleneck for guest initialization time.  Parallelize with padata
to take proper advantage of memory bandwidth, yielding up to 12x speedups for
VFIO page pinning and 10x speedups for overall qemu guest initialization.
Detailed performance results are in patch 8.

Phase one[4] of multithreaded jobs made deferred struct page init use all the
CPUs on x86.  That's a special case because it happens during boot when the
machine is waiting on page init to finish and there are generally no resource
controls to violate.

Page pinning, on the other hand, can be done by a user task (the "main thread"
in a job), so helper threads should honor the main thread's resource controls
that are relevant for pinning (CPU, memory) and give priority to other tasks on
the system.  This RFC has some but not all of the pieces to do that.

After this phase, it shouldn't take many lines to parallelize other
memory-proportional paths like struct page init for memory hotplug, munmap(),
hugetlb_fallocate(), and __ib_umem_release().

The first half of this series (more or less) has been running in our kernels
for about three years.


Changelog
---------

This addresses some comments on two earlier projects, ktask[2] and
cgroup-aware workqueues[3].

 - Fix undoing partially a completed chunk in the thread function, and use
   larger minimum chunk size (Alex Williamson)

 - Helper threads should honor the main thread's settings and resource controls,
   and shouldn't disturb other tasks (Michal Hocko, Pavel Machek)

 - Design comments, lockdep awareness (Peter Zijlstra, Jason Gunthorpe)

 - Implement remote charging in the CPU controller (Tejun Heo)


Series Rundown
--------------

     1  padata: Remove __init from multithreading functions
     2  padata: Return first error from a job
     3  padata: Add undo support
     4  padata: Detect deadlocks between main and helper threads

Get ready to parallelize.  In particular, pinning can fail, so make jobs
undo-able.

     5  vfio/type1: Pass mm to vfio_pin_pages_remote()
     6  vfio/type1: Refactor dma map removal
     7  vfio/type1: Parallelize vfio_pin_map_dma()
     8  vfio/type1: Cache locked_vm to ease mmap_lock contention

Do the parallelization itself.

     9  padata: Use kthreads in do_multithreaded
    10  padata: Helpers should respect main thread's CPU affinity
    11  padata: Cap helpers started to online CPUs
    12  sched, padata: Bound max threads with max_cfs_bandwidth_cpus()

Put caps on the number of helpers started according to the main thread's CPU
affinity, the system' online CPU count, and the main thread's CFS bandwidth
settings.  

    13  padata: Run helper threads at MAX_NICE
    14  padata: Nice helper threads one by one to prevent starvation

Prevent helpers from taking CPU away unfairly from other tasks for the sake of
an optimized kernel code path.

    15  sched/fair: Account kthread runtime debt for CFS bandwidth
    16  sched/fair: Consider kthread debt in cputime

A prototype for remote charging in CFS bandwidth and cpu.stat, described more
in the next section.  It's debatable whether these last two are required for
this series.  Patch 12 caps the number of helper threads started according to
the max effective CPUs allowed by the quota and period of the main thread's
task group.  In practice, I think this hits the sweet spot between complexity
and respecting CFS bandwidth limits so that patch 15 might just be dropped.
For instance, when running qemu with a vfio device, the restriction from patch
12 was enough to avoid the helpers breaching CFS bandwidth limits.  That leaves
patch 16, which on its own seems overkill for all the hunks it would require
from patch 15, so it could be dropped too.

Patch 12 isn't airtight, though, since other tasks running in the task group
alongside the main thread and helpers could still result in overage.  So,
patches 15-16 give an idea of what absolutely correct accounting in the CPU
controller might look like in case there are real situations that want it.


Remote Charging in the CPU Controller
-------------------------------------

CPU-intensive kthreads aren't generally accounted in the CPU controller, so
they escape settings such as weight and bandwidth when they do work on behalf
of a task group.

This problem arises with multithreaded jobs, but is also an issue in other
places.  CPU activity from async memory reclaim (kswapd, cswapd?[5]) should be
accounted to the cgroup that the memory belongs to, and similarly CPU activity
from net rx should be accounted to the task groups that correspond to the
packets being received.  There are also vague complaints from Android[6].

Each use case has its own requirements[7].  In padata and reclaim, the task
group to account to is known ahead of time, but net rx has to spend cycles
processing a packet before its destination task group is known, so any solution
should be able to work without knowing the task group in advance.  Furthermore,
the CPU controller shouldn't throttle reclaim or net rx in real time since both
are doing high priority work.  These make approaches that run kthreads directly
in a task group, like cgroup-aware workqueues[8] or a kernel path for
CLONE_INTO_CGROUP, infeasible.  Running kthreads directly in cgroups also has a
downside for padata because helpers' MAX_NICE priority is "shadowed" by the
priority of the group entities they're running under.

The proposed solution of remote charging can accrue debt to a task group to be
paid off or forgiven later, addressing all these issues.  A kthread calls the
interface

    void cpu_cgroup_remote_begin(struct task_struct *p,
                                 struct cgroup_subsys_state *css);

to begin remote charging to @css, causing @p's current sum_exec_runtime to be
updated and saved.  The @css arg isn't required and can be removed later to
facilitate the unknown cgroup case mentioned above.  Then the kthread calls
another interface

    void cpu_cgroup_remote_charge(struct task_struct *p,
                                  struct cgroup_subsys_state *css);

to account the sum_exec_runtime that @p has used since the first call.
Internally, a new field cfs_bandwidth::debt is added to keep track of unpaid
debt that's only used when the debt exceeds the quota in the current period.

Weight-based control isn't implemented for now since padata helpers run at
MAX_NICE and so always yield to anything higher priority, meaning they would
rarely compete with other task groups.

[ We have another use case to use remote charging for implementing
  CFS bandwidth control across multiple machines.  This is an entirely
  different topic that deserves its own thread. ]


TODO
----

 - Honor these other resource controls:
    - Memory controller limits for helpers via active_memcg.  I *think* this
      will turn out to be necessary despite helpers using the main thread's mm,
      but I need to look into it more.
    - cpuset.mems
    - NUMA memory policy

 - Make helpers aware of signals sent to the main thread

 - Test test test


Series based on 5.14.  I had to downgrade from 5.15 because of an intel iommu
bug that's since been fixed.

thanks,
Daniel


[0] https://lore.kernel.org/linux-mm/20210128182632.24562-1-joao.m.martins@oracle.com
[1] https://lore.kernel.org/lkml/20210219161305.36522-1-daniel.m.jordan@oracle.com/
[2] https://x-lore.kernel.org/all/20181105165558.11698-1-daniel.m.jordan@oracle.com/
[3] https://lore.kernel.org/linux-mm/20190605133650.28545-1-daniel.m.jordan@oracle.com/
[4] https://x-lore.kernel.org/all/20200527173608.2885243-1-daniel.m.jordan@oracle.com/
[5] https://x-lore.kernel.org/all/20200219181219.54356-1-hannes@cmpxchg.org/
[6] https://x-lore.kernel.org/all/20210407013856.GC21941@codeaurora.org/
[7] https://x-lore.kernel.org/all/20200219214112.4kt573kyzbvmbvn3@ca-dmjordan1.us.oracle.com/
[8] https://x-lore.kernel.org/all/20190605133650.28545-1-daniel.m.jordan@oracle.com/

Daniel Jordan (16):
  padata: Remove __init from multithreading functions
  padata: Return first error from a job
  padata: Add undo support
  padata: Detect deadlocks between main and helper threads
  vfio/type1: Pass mm to vfio_pin_pages_remote()
  vfio/type1: Refactor dma map removal
  vfio/type1: Parallelize vfio_pin_map_dma()
  vfio/type1: Cache locked_vm to ease mmap_lock contention
  padata: Use kthreads in do_multithreaded
  padata: Helpers should respect main thread's CPU affinity
  padata: Cap helpers started to online CPUs
  sched, padata: Bound max threads with max_cfs_bandwidth_cpus()
  padata: Run helper threads at MAX_NICE
  padata: Nice helper threads one by one to prevent starvation
  sched/fair: Account kthread runtime debt for CFS bandwidth
  sched/fair: Consider kthread debt in cputime

 drivers/vfio/Kconfig            |   1 +
 drivers/vfio/vfio_iommu_type1.c | 170 ++++++++++++++---
 include/linux/padata.h          |  31 +++-
 include/linux/sched.h           |   2 +
 include/linux/sched/cgroup.h    |  37 ++++
 kernel/padata.c                 | 311 +++++++++++++++++++++++++-------
 kernel/sched/core.c             |  58 ++++++
 kernel/sched/fair.c             |  99 +++++++++-
 kernel/sched/sched.h            |   5 +
 mm/page_alloc.c                 |   4 +-
 10 files changed, 620 insertions(+), 98 deletions(-)
 create mode 100644 include/linux/sched/cgroup.h


base-commit: 7d2a07b769330c34b4deabeed939325c77a7ec2f

Jason Gunthorpe Jan. 6, 2022, 1:13 a.m. UTC | #1

On Wed, Jan 05, 2022 at 07:46:40PM -0500, Daniel Jordan wrote:

> Get ready to parallelize.  In particular, pinning can fail, so make jobs
> undo-able.
> 
>      5  vfio/type1: Pass mm to vfio_pin_pages_remote()
>      6  vfio/type1: Refactor dma map removal
>      7  vfio/type1: Parallelize vfio_pin_map_dma()
>      8  vfio/type1: Cache locked_vm to ease mmap_lock contention

In some ways this kind of seems like overkill, why not just have
userspace break the guest VA into chunks and call map in parallel?
Similar to how it already does the prealloc in parallel?

This is a simpler kernel job of optimizing locking to allow
concurrency.

It is also not good that this inserts arbitary cuts in the IOVA
address space, that will cause iommu_map() to be called with smaller
npages, and could result in a long term inefficiency in the iommu.

I don't know how the kernel can combat this without prior knowledge of
the likely physical memory layout (eg is the VM using 1G huge pages or
something)..

Personally I'd rather see the results from Matthew's work to allow GUP
to work on folios efficiently before reaching to this extreme.

The results you got of only 1.2x improvement don't seem so
compelling. Based on the unpin work I fully expect that folio
optimized GUP will do much better than that with single threaded..

Jason

Daniel Jordan Jan. 7, 2022, 3:03 a.m. UTC | #2

On Wed, Jan 05, 2022 at 09:13:06PM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 05, 2022 at 07:46:40PM -0500, Daniel Jordan wrote:
> 
> > Get ready to parallelize.  In particular, pinning can fail, so make jobs
> > undo-able.
> > 
> >      5  vfio/type1: Pass mm to vfio_pin_pages_remote()
> >      6  vfio/type1: Refactor dma map removal
> >      7  vfio/type1: Parallelize vfio_pin_map_dma()
> >      8  vfio/type1: Cache locked_vm to ease mmap_lock contention
> 
> In some ways this kind of seems like overkill, why not just have
> userspace break the guest VA into chunks and call map in parallel?
> Similar to how it already does the prealloc in parallel?
>
> This is a simpler kernel job of optimizing locking to allow
> concurrency.

I didn't consider doing it that way, and am not seeing a fundamental
reason it wouldn't work right off the bat.

At a glance, I think pinning would need to be moved out from under
vfio's iommu->lock.  I haven't checked to see how hard it would be, but
that plus the locking optimizations might end up being about the same
amount of complexity as the multithreading in the vfio driver, and doing
this in the kernel would speed things up for all vfio users without
having to duplicate the parallelism in userspace.

But yes, agreed, the lock optimization could definitely be split out and
used in a different approach.

> It is also not good that this inserts arbitary cuts in the IOVA
> address space, that will cause iommu_map() to be called with smaller
> npages, and could result in a long term inefficiency in the iommu.
> 
> I don't know how the kernel can combat this without prior knowledge of
> the likely physical memory layout (eg is the VM using 1G huge pages or
> something)..

The cuts aren't arbitrary, padata controls where they happen.  This is
optimizing for big memory ranges, so why isn't it enough that padata
breaks up the work along a big enough page-aligned chunk?  The vfio
driver does one iommu mapping per physically contiguous range, and I
don't think those will be large enough to be affected using such a chunk
size.  If cuts in per-thread ranges are an issue, I *think* userspace
has the same problem?

> Personally I'd rather see the results from Matthew's work to allow GUP
> to work on folios efficiently before reaching to this extreme.
> 
> The results you got of only 1.2x improvement don't seem so
> compelling.

I know you understand, but just to be clear for everyone, that 1.2x is
the overall improvement to qemu init from multithreaded pinning alone
when prefaulting is done in both base and test.

Pinning itself, the only thing being optimized, improves 8.5x in that
experiment, bringing the time from 1.8 seconds to .2 seconds.  That's a
significant savings IMHO

> Based on the unpin work I fully expect that folio
> optimized GUP will do much better than that with single threaded..

Yeah, I'm curious to see how folio will do as well.  And there are some
very nice, efficiently gained speedups in the unpin work.  Changes like
that benefit all gup users, too, as you've pointed out before.

But, I'm skeptical that singlethreaded optimization alone will remove
the bottleneck with the enormous memory sizes we use.  For instance,
scaling up the times from the unpin results with both optimizations (the
IB specific one too, which would need to be done for vfio), a 1T guest
would still take almost 2 seconds to pin/unpin.  

If people feel strongly that we should try optimizing other ways first,
ok, but I think these are complementary approaches.  I'm coming at this
problem this way because this is fundamentally a memory-intensive
operation where more bandwidth can help, and there are other kernel
paths we and others want this infrastructure for.

In any case, thanks a lot for the super quick feedback!

Jason Gunthorpe Jan. 7, 2022, 5:12 p.m. UTC | #3

> > It is also not good that this inserts arbitary cuts in the IOVA
> > address space, that will cause iommu_map() to be called with smaller
> > npages, and could result in a long term inefficiency in the iommu.
> > 
> > I don't know how the kernel can combat this without prior knowledge of
> > the likely physical memory layout (eg is the VM using 1G huge pages or
> > something)..
>
> The cuts aren't arbitrary, padata controls where they happen.  

Well, they are, you picked a PMD alignment if I recall.

If hugetlbfs is using PUD pages then this is the wrong alignment,
right?

I suppose it could compute the cuts differently to try to maximize
alignment at the cutpoints.. 

> size.  If cuts in per-thread ranges are an issue, I *think* userspace
> has the same problem?

Userspace should know what it has done, if it is using hugetlbfs it
knows how big the pages are.

> > The results you got of only 1.2x improvement don't seem so
> > compelling.
> 
> I know you understand, but just to be clear for everyone, that 1.2x is
> the overall improvement to qemu init from multithreaded pinning alone
> when prefaulting is done in both base and test.

Yes

> Pinning itself, the only thing being optimized, improves 8.5x in that
> experiment, bringing the time from 1.8 seconds to .2 seconds.  That's a
> significant savings IMHO

And here is where I suspect we'd get similar results from folio's
based on the unpin performance uplift we already saw.

As long as PUP doesn't have to COW its work is largely proportional to
the number of struct pages it processes, so we should be expecting an
upper limit of 512x gains on the PUP alone with foliation. This is in
line with what we saw with the prior unpin work.

The other optimization that would help a lot here is to use
pin_user_pages_fast(), something like:

  if (current->mm != remote_mm)
     mmap_lock()
     pin_user_pages_remote(..)
     mmap_unlock()
  else
     pin_user_pages_fast(..)

But you can't get that gain with kernel-size parallization, right?

(I haven't dug into if gup_fast relies on current due to IPIs or not,
maybe pin_user_pages_remote_fast can exist?)

> But, I'm skeptical that singlethreaded optimization alone will remove
> the bottleneck with the enormous memory sizes we use.  

I think you can get the 1.2x at least.

> scaling up the times from the unpin results with both optimizations (the
> IB specific one too, which would need to be done for vfio), 

Oh, I did the IB one already in iommufd...

> a 1T guest would still take almost 2 seconds to pin/unpin.

Single threaded? Isn't that excellent and completely dwarfed by the
populate overhead?

> If people feel strongly that we should try optimizing other ways first,
> ok, but I think these are complementary approaches.  I'm coming at this
> problem this way because this is fundamentally a memory-intensive
> operation where more bandwidth can help, and there are other kernel
> paths we and others want this infrastructure for.

At least here I would like to see an apples to apples at least before
we have this complexity. Full user threading vs kernel auto threading.

Saying multithreaded kernel gets 8x over single threaded userspace is
nice, but sort of irrelevant because we can have multithreaded
userspace, right?

Jason

Daniel Jordan Jan. 10, 2022, 10:27 p.m. UTC | #4

On Fri, Jan 07, 2022 at 01:12:48PM -0400, Jason Gunthorpe wrote:
> > The cuts aren't arbitrary, padata controls where they happen.  
> 
> Well, they are, you picked a PMD alignment if I recall.
> 
> If hugetlbfs is using PUD pages then this is the wrong alignment,
> right?
> 
> I suppose it could compute the cuts differently to try to maximize
> alignment at the cutpoints.. 

Yes, this is what I was suggesting, increase the alignment.

> > size.  If cuts in per-thread ranges are an issue, I *think* userspace
> > has the same problem?
> 
> Userspace should know what it has done, if it is using hugetlbfs it
> knows how big the pages are.

Right, what I mean is both user and kernel threads can end up splitting
a physically contiguous range of pages, however large the page size.

> > Pinning itself, the only thing being optimized, improves 8.5x in that
> > experiment, bringing the time from 1.8 seconds to .2 seconds.  That's a
> > significant savings IMHO
> 
> And here is where I suspect we'd get similar results from folio's
> based on the unpin performance uplift we already saw.
> 
> As long as PUP doesn't have to COW its work is largely proportional to
> the number of struct pages it processes, so we should be expecting an
> upper limit of 512x gains on the PUP alone with foliation.
>
> This is in line with what we saw with the prior unpin work.

"in line with what we saw"  Not following.  The unpin work had two
optimizations, I think, 4.5x and 3.5x which together give 16x.  Why is
that in line with the potential gains from pup?

Overall I see what you're saying, just curious what you meant here.

> The other optimization that would help a lot here is to use
> pin_user_pages_fast(), something like:
> 
>   if (current->mm != remote_mm)
>      mmap_lock()
>      pin_user_pages_remote(..)
>      mmap_unlock()
>   else
>      pin_user_pages_fast(..)
> 
> But you can't get that gain with kernel-size parallization, right?
> 
> (I haven't dug into if gup_fast relies on current due to IPIs or not,
> maybe pin_user_pages_remote_fast can exist?)

Yeah, not sure.  I'll have a look.

> > But, I'm skeptical that singlethreaded optimization alone will remove
> > the bottleneck with the enormous memory sizes we use.  
> 
> I think you can get the 1.2x at least.
> 
> > scaling up the times from the unpin results with both optimizations (the
> > IB specific one too, which would need to be done for vfio), 
> 
> Oh, I did the IB one already in iommufd...

Ahead of the curve!

> > a 1T guest would still take almost 2 seconds to pin/unpin.
> 
> Single threaded?

Yes.

> Isn't that excellent

Depends on who you ask, I guess.

> and completely dwarfed by the populate overhead?

Well yes, but here we all are optimizing gup anyway :-)

> > If people feel strongly that we should try optimizing other ways first,
> > ok, but I think these are complementary approaches.  I'm coming at this
> > problem this way because this is fundamentally a memory-intensive
> > operation where more bandwidth can help, and there are other kernel
> > paths we and others want this infrastructure for.
> 
> At least here I would like to see an apples to apples at least before
> we have this complexity. Full user threading vs kernel auto threading.
> 
> Saying multithreaded kernel gets 8x over single threaded userspace is
> nice, but sort of irrelevant because we can have multithreaded
> userspace, right?

One of my assumptions was that doing this in the kernel would benefit
all vfio users, avoiding duplicating the same sort of multithreading
logic across applications, including ones that didn't prefault.  Calling
it irrelevant seems a bit strong.  Parallelizing in either layer has its
upsides and downsides.

My assumption going into this series was that multithreading VFIO page
pinning in the kernel was a viable way forward given the positive
feedback I got from the VFIO maintainer last time I posted this, which
was admittedly a while ago, and I've since been focused on the other
parts of this series rather than what's been happening in the mm lately.
Anyway, your arguments are reasonable, so I'll go take a look at some of
these optimizations and see where I get.

Daniel

Jason Gunthorpe Jan. 11, 2022, 12:17 a.m. UTC | #5

On Mon, Jan 10, 2022 at 05:27:25PM -0500, Daniel Jordan wrote:

> > > Pinning itself, the only thing being optimized, improves 8.5x in that
> > > experiment, bringing the time from 1.8 seconds to .2 seconds.  That's a
> > > significant savings IMHO
> > 
> > And here is where I suspect we'd get similar results from folio's
> > based on the unpin performance uplift we already saw.
> > 
> > As long as PUP doesn't have to COW its work is largely proportional to
> > the number of struct pages it processes, so we should be expecting an
> > upper limit of 512x gains on the PUP alone with foliation.
> >
> > This is in line with what we saw with the prior unpin work.
> 
> "in line with what we saw"  Not following.  The unpin work had two
> optimizations, I think, 4.5x and 3.5x which together give 16x.  Why is
> that in line with the potential gains from pup?

It is the same basic issue, doing extra work, dirtying extra memory..

> > and completely dwarfed by the populate overhead?
> 
> Well yes, but here we all are optimizing gup anyway :-)

Well, I assume because we can user thread the populate, so I'd user
thread the gup too..

> One of my assumptions was that doing this in the kernel would benefit
> all vfio users, avoiding duplicating the same sort of multithreading
> logic across applications, including ones that didn't prefault.

I don't know of other users that use such huge memory sizes this would
matter, besides a VMM..

> My assumption going into this series was that multithreading VFIO page
> pinning in the kernel was a viable way forward given the positive
> feedback I got from the VFIO maintainer last time I posted this, which
> was admittedly a while ago, and I've since been focused on the other
> parts of this series rather than what's been happening in the mm lately.
> Anyway, your arguments are reasonable, so I'll go take a look at some of
> these optimizations and see where I get.

Well, it is not *unreasonable* it just doesn't seem compelling to me
yet.

Especially since we are not anywhere close to the limit of single
threaded performance. Aside from GUP, the whole way we transfer the
physical pages into the iommu is just begging for optimizations
eg Matthew's struct phyr needs to be an input and output at the iommu
layer to make this code really happy.

How much time do we burn messing around in redundant iommu layer
locking because everything is page at a time?

Jason

Daniel Jordan Jan. 11, 2022, 4:20 p.m. UTC | #6

On Mon, Jan 10, 2022 at 08:17:51PM -0400, Jason Gunthorpe wrote:
> On Mon, Jan 10, 2022 at 05:27:25PM -0500, Daniel Jordan wrote:
> 
> > > > Pinning itself, the only thing being optimized, improves 8.5x in that
> > > > experiment, bringing the time from 1.8 seconds to .2 seconds.  That's a
> > > > significant savings IMHO
> > > 
> > > And here is where I suspect we'd get similar results from folio's
> > > based on the unpin performance uplift we already saw.
> > > 
> > > As long as PUP doesn't have to COW its work is largely proportional to
> > > the number of struct pages it processes, so we should be expecting an
> > > upper limit of 512x gains on the PUP alone with foliation.
> > >
> > > This is in line with what we saw with the prior unpin work.
> > 
> > "in line with what we saw"  Not following.  The unpin work had two
> > optimizations, I think, 4.5x and 3.5x which together give 16x.  Why is
> > that in line with the potential gains from pup?
> 
> It is the same basic issue, doing extra work, dirtying extra memory..

Ok, gotcha.

> I don't know of other users that use such huge memory sizes this would
> matter, besides a VMM..

Right, all the VMMs out there that use vfio.

> > My assumption going into this series was that multithreading VFIO page
> > pinning in the kernel was a viable way forward given the positive
> > feedback I got from the VFIO maintainer last time I posted this, which
> > was admittedly a while ago, and I've since been focused on the other
> > parts of this series rather than what's been happening in the mm lately.
> > Anyway, your arguments are reasonable, so I'll go take a look at some of
> > these optimizations and see where I get.
> 
> Well, it is not *unreasonable* it just doesn't seem compelling to me
> yet.
> 
> Especially since we are not anywhere close to the limit of single
> threaded performance. Aside from GUP, the whole way we transfer the
> physical pages into the iommu is just begging for optimizations
> eg Matthew's struct phyr needs to be an input and output at the iommu
> layer to make this code really happy.

/nods/  There are other ways forward.  As I say, I'll take a look.

[RFC,00/16] padata, vfio, sched: Multithreaded VFIO page pinning

Message

Comments