Message ID | 20220106004656.126790-1-daniel.m.jordan@oracle.com (mailing list archive) |
---|---|
Headers | show |
Series | padata, vfio, sched: Multithreaded VFIO page pinning | expand |
On Wed, Jan 05, 2022 at 07:46:40PM -0500, Daniel Jordan wrote: > Get ready to parallelize. In particular, pinning can fail, so make jobs > undo-able. > > 5 vfio/type1: Pass mm to vfio_pin_pages_remote() > 6 vfio/type1: Refactor dma map removal > 7 vfio/type1: Parallelize vfio_pin_map_dma() > 8 vfio/type1: Cache locked_vm to ease mmap_lock contention In some ways this kind of seems like overkill, why not just have userspace break the guest VA into chunks and call map in parallel? Similar to how it already does the prealloc in parallel? This is a simpler kernel job of optimizing locking to allow concurrency. It is also not good that this inserts arbitary cuts in the IOVA address space, that will cause iommu_map() to be called with smaller npages, and could result in a long term inefficiency in the iommu. I don't know how the kernel can combat this without prior knowledge of the likely physical memory layout (eg is the VM using 1G huge pages or something).. Personally I'd rather see the results from Matthew's work to allow GUP to work on folios efficiently before reaching to this extreme. The results you got of only 1.2x improvement don't seem so compelling. Based on the unpin work I fully expect that folio optimized GUP will do much better than that with single threaded.. Jason
On Wed, Jan 05, 2022 at 09:13:06PM -0400, Jason Gunthorpe wrote: > On Wed, Jan 05, 2022 at 07:46:40PM -0500, Daniel Jordan wrote: > > > Get ready to parallelize. In particular, pinning can fail, so make jobs > > undo-able. > > > > 5 vfio/type1: Pass mm to vfio_pin_pages_remote() > > 6 vfio/type1: Refactor dma map removal > > 7 vfio/type1: Parallelize vfio_pin_map_dma() > > 8 vfio/type1: Cache locked_vm to ease mmap_lock contention > > In some ways this kind of seems like overkill, why not just have > userspace break the guest VA into chunks and call map in parallel? > Similar to how it already does the prealloc in parallel? > > This is a simpler kernel job of optimizing locking to allow > concurrency. I didn't consider doing it that way, and am not seeing a fundamental reason it wouldn't work right off the bat. At a glance, I think pinning would need to be moved out from under vfio's iommu->lock. I haven't checked to see how hard it would be, but that plus the locking optimizations might end up being about the same amount of complexity as the multithreading in the vfio driver, and doing this in the kernel would speed things up for all vfio users without having to duplicate the parallelism in userspace. But yes, agreed, the lock optimization could definitely be split out and used in a different approach. > It is also not good that this inserts arbitary cuts in the IOVA > address space, that will cause iommu_map() to be called with smaller > npages, and could result in a long term inefficiency in the iommu. > > I don't know how the kernel can combat this without prior knowledge of > the likely physical memory layout (eg is the VM using 1G huge pages or > something).. The cuts aren't arbitrary, padata controls where they happen. This is optimizing for big memory ranges, so why isn't it enough that padata breaks up the work along a big enough page-aligned chunk? The vfio driver does one iommu mapping per physically contiguous range, and I don't think those will be large enough to be affected using such a chunk size. If cuts in per-thread ranges are an issue, I *think* userspace has the same problem? > Personally I'd rather see the results from Matthew's work to allow GUP > to work on folios efficiently before reaching to this extreme. > > The results you got of only 1.2x improvement don't seem so > compelling. I know you understand, but just to be clear for everyone, that 1.2x is the overall improvement to qemu init from multithreaded pinning alone when prefaulting is done in both base and test. Pinning itself, the only thing being optimized, improves 8.5x in that experiment, bringing the time from 1.8 seconds to .2 seconds. That's a significant savings IMHO > Based on the unpin work I fully expect that folio > optimized GUP will do much better than that with single threaded.. Yeah, I'm curious to see how folio will do as well. And there are some very nice, efficiently gained speedups in the unpin work. Changes like that benefit all gup users, too, as you've pointed out before. But, I'm skeptical that singlethreaded optimization alone will remove the bottleneck with the enormous memory sizes we use. For instance, scaling up the times from the unpin results with both optimizations (the IB specific one too, which would need to be done for vfio), a 1T guest would still take almost 2 seconds to pin/unpin. If people feel strongly that we should try optimizing other ways first, ok, but I think these are complementary approaches. I'm coming at this problem this way because this is fundamentally a memory-intensive operation where more bandwidth can help, and there are other kernel paths we and others want this infrastructure for. In any case, thanks a lot for the super quick feedback!
> > It is also not good that this inserts arbitary cuts in the IOVA > > address space, that will cause iommu_map() to be called with smaller > > npages, and could result in a long term inefficiency in the iommu. > > > > I don't know how the kernel can combat this without prior knowledge of > > the likely physical memory layout (eg is the VM using 1G huge pages or > > something).. > > The cuts aren't arbitrary, padata controls where they happen. Well, they are, you picked a PMD alignment if I recall. If hugetlbfs is using PUD pages then this is the wrong alignment, right? I suppose it could compute the cuts differently to try to maximize alignment at the cutpoints.. > size. If cuts in per-thread ranges are an issue, I *think* userspace > has the same problem? Userspace should know what it has done, if it is using hugetlbfs it knows how big the pages are. > > The results you got of only 1.2x improvement don't seem so > > compelling. > > I know you understand, but just to be clear for everyone, that 1.2x is > the overall improvement to qemu init from multithreaded pinning alone > when prefaulting is done in both base and test. Yes > Pinning itself, the only thing being optimized, improves 8.5x in that > experiment, bringing the time from 1.8 seconds to .2 seconds. That's a > significant savings IMHO And here is where I suspect we'd get similar results from folio's based on the unpin performance uplift we already saw. As long as PUP doesn't have to COW its work is largely proportional to the number of struct pages it processes, so we should be expecting an upper limit of 512x gains on the PUP alone with foliation. This is in line with what we saw with the prior unpin work. The other optimization that would help a lot here is to use pin_user_pages_fast(), something like: if (current->mm != remote_mm) mmap_lock() pin_user_pages_remote(..) mmap_unlock() else pin_user_pages_fast(..) But you can't get that gain with kernel-size parallization, right? (I haven't dug into if gup_fast relies on current due to IPIs or not, maybe pin_user_pages_remote_fast can exist?) > But, I'm skeptical that singlethreaded optimization alone will remove > the bottleneck with the enormous memory sizes we use. I think you can get the 1.2x at least. > scaling up the times from the unpin results with both optimizations (the > IB specific one too, which would need to be done for vfio), Oh, I did the IB one already in iommufd... > a 1T guest would still take almost 2 seconds to pin/unpin. Single threaded? Isn't that excellent and completely dwarfed by the populate overhead? > If people feel strongly that we should try optimizing other ways first, > ok, but I think these are complementary approaches. I'm coming at this > problem this way because this is fundamentally a memory-intensive > operation where more bandwidth can help, and there are other kernel > paths we and others want this infrastructure for. At least here I would like to see an apples to apples at least before we have this complexity. Full user threading vs kernel auto threading. Saying multithreaded kernel gets 8x over single threaded userspace is nice, but sort of irrelevant because we can have multithreaded userspace, right? Jason
On Fri, Jan 07, 2022 at 01:12:48PM -0400, Jason Gunthorpe wrote: > > The cuts aren't arbitrary, padata controls where they happen. > > Well, they are, you picked a PMD alignment if I recall. > > If hugetlbfs is using PUD pages then this is the wrong alignment, > right? > > I suppose it could compute the cuts differently to try to maximize > alignment at the cutpoints.. Yes, this is what I was suggesting, increase the alignment. > > size. If cuts in per-thread ranges are an issue, I *think* userspace > > has the same problem? > > Userspace should know what it has done, if it is using hugetlbfs it > knows how big the pages are. Right, what I mean is both user and kernel threads can end up splitting a physically contiguous range of pages, however large the page size. > > Pinning itself, the only thing being optimized, improves 8.5x in that > > experiment, bringing the time from 1.8 seconds to .2 seconds. That's a > > significant savings IMHO > > And here is where I suspect we'd get similar results from folio's > based on the unpin performance uplift we already saw. > > As long as PUP doesn't have to COW its work is largely proportional to > the number of struct pages it processes, so we should be expecting an > upper limit of 512x gains on the PUP alone with foliation. > > This is in line with what we saw with the prior unpin work. "in line with what we saw" Not following. The unpin work had two optimizations, I think, 4.5x and 3.5x which together give 16x. Why is that in line with the potential gains from pup? Overall I see what you're saying, just curious what you meant here. > The other optimization that would help a lot here is to use > pin_user_pages_fast(), something like: > > if (current->mm != remote_mm) > mmap_lock() > pin_user_pages_remote(..) > mmap_unlock() > else > pin_user_pages_fast(..) > > But you can't get that gain with kernel-size parallization, right? > > (I haven't dug into if gup_fast relies on current due to IPIs or not, > maybe pin_user_pages_remote_fast can exist?) Yeah, not sure. I'll have a look. > > But, I'm skeptical that singlethreaded optimization alone will remove > > the bottleneck with the enormous memory sizes we use. > > I think you can get the 1.2x at least. > > > scaling up the times from the unpin results with both optimizations (the > > IB specific one too, which would need to be done for vfio), > > Oh, I did the IB one already in iommufd... Ahead of the curve! > > a 1T guest would still take almost 2 seconds to pin/unpin. > > Single threaded? Yes. > Isn't that excellent Depends on who you ask, I guess. > and completely dwarfed by the populate overhead? Well yes, but here we all are optimizing gup anyway :-) > > If people feel strongly that we should try optimizing other ways first, > > ok, but I think these are complementary approaches. I'm coming at this > > problem this way because this is fundamentally a memory-intensive > > operation where more bandwidth can help, and there are other kernel > > paths we and others want this infrastructure for. > > At least here I would like to see an apples to apples at least before > we have this complexity. Full user threading vs kernel auto threading. > > Saying multithreaded kernel gets 8x over single threaded userspace is > nice, but sort of irrelevant because we can have multithreaded > userspace, right? One of my assumptions was that doing this in the kernel would benefit all vfio users, avoiding duplicating the same sort of multithreading logic across applications, including ones that didn't prefault. Calling it irrelevant seems a bit strong. Parallelizing in either layer has its upsides and downsides. My assumption going into this series was that multithreading VFIO page pinning in the kernel was a viable way forward given the positive feedback I got from the VFIO maintainer last time I posted this, which was admittedly a while ago, and I've since been focused on the other parts of this series rather than what's been happening in the mm lately. Anyway, your arguments are reasonable, so I'll go take a look at some of these optimizations and see where I get. Daniel
On Mon, Jan 10, 2022 at 05:27:25PM -0500, Daniel Jordan wrote: > > > Pinning itself, the only thing being optimized, improves 8.5x in that > > > experiment, bringing the time from 1.8 seconds to .2 seconds. That's a > > > significant savings IMHO > > > > And here is where I suspect we'd get similar results from folio's > > based on the unpin performance uplift we already saw. > > > > As long as PUP doesn't have to COW its work is largely proportional to > > the number of struct pages it processes, so we should be expecting an > > upper limit of 512x gains on the PUP alone with foliation. > > > > This is in line with what we saw with the prior unpin work. > > "in line with what we saw" Not following. The unpin work had two > optimizations, I think, 4.5x and 3.5x which together give 16x. Why is > that in line with the potential gains from pup? It is the same basic issue, doing extra work, dirtying extra memory.. > > and completely dwarfed by the populate overhead? > > Well yes, but here we all are optimizing gup anyway :-) Well, I assume because we can user thread the populate, so I'd user thread the gup too.. > One of my assumptions was that doing this in the kernel would benefit > all vfio users, avoiding duplicating the same sort of multithreading > logic across applications, including ones that didn't prefault. I don't know of other users that use such huge memory sizes this would matter, besides a VMM.. > My assumption going into this series was that multithreading VFIO page > pinning in the kernel was a viable way forward given the positive > feedback I got from the VFIO maintainer last time I posted this, which > was admittedly a while ago, and I've since been focused on the other > parts of this series rather than what's been happening in the mm lately. > Anyway, your arguments are reasonable, so I'll go take a look at some of > these optimizations and see where I get. Well, it is not *unreasonable* it just doesn't seem compelling to me yet. Especially since we are not anywhere close to the limit of single threaded performance. Aside from GUP, the whole way we transfer the physical pages into the iommu is just begging for optimizations eg Matthew's struct phyr needs to be an input and output at the iommu layer to make this code really happy. How much time do we burn messing around in redundant iommu layer locking because everything is page at a time? Jason
On Mon, Jan 10, 2022 at 08:17:51PM -0400, Jason Gunthorpe wrote: > On Mon, Jan 10, 2022 at 05:27:25PM -0500, Daniel Jordan wrote: > > > > > Pinning itself, the only thing being optimized, improves 8.5x in that > > > > experiment, bringing the time from 1.8 seconds to .2 seconds. That's a > > > > significant savings IMHO > > > > > > And here is where I suspect we'd get similar results from folio's > > > based on the unpin performance uplift we already saw. > > > > > > As long as PUP doesn't have to COW its work is largely proportional to > > > the number of struct pages it processes, so we should be expecting an > > > upper limit of 512x gains on the PUP alone with foliation. > > > > > > This is in line with what we saw with the prior unpin work. > > > > "in line with what we saw" Not following. The unpin work had two > > optimizations, I think, 4.5x and 3.5x which together give 16x. Why is > > that in line with the potential gains from pup? > > It is the same basic issue, doing extra work, dirtying extra memory.. Ok, gotcha. > I don't know of other users that use such huge memory sizes this would > matter, besides a VMM.. Right, all the VMMs out there that use vfio. > > My assumption going into this series was that multithreading VFIO page > > pinning in the kernel was a viable way forward given the positive > > feedback I got from the VFIO maintainer last time I posted this, which > > was admittedly a while ago, and I've since been focused on the other > > parts of this series rather than what's been happening in the mm lately. > > Anyway, your arguments are reasonable, so I'll go take a look at some of > > these optimizations and see where I get. > > Well, it is not *unreasonable* it just doesn't seem compelling to me > yet. > > Especially since we are not anywhere close to the limit of single > threaded performance. Aside from GUP, the whole way we transfer the > physical pages into the iommu is just begging for optimizations > eg Matthew's struct phyr needs to be an input and output at the iommu > layer to make this code really happy. /nods/ There are other ways forward. As I say, I'll take a look.