Message ID | 20230412213510.1220557-1-amoorthy@google.com (mailing list archive) |
---|---|
Headers | show |
Series | Improve scalability of KVM + userfaultfd live migration via annotated memory faults. | expand |
Hi, Anish, On Wed, Apr 12, 2023 at 09:34:48PM +0000, Anish Moorthy wrote: > KVM's demand paging self test is extended to demonstrate the performance > benefits of using the two new capabilities to bypass the userfaultfd > wait queue. The performance samples below (rates in thousands of > pages/s, n = 5), were generated using [2] on an x86 machine with 256 > cores. > > vCPUs, Average Paging Rate (w/o new caps), Average Paging Rate (w/ new caps) > 1 150 340 > 2 191 477 > 4 210 809 > 8 155 1239 > 16 130 1595 > 32 108 2299 > 64 86 3482 > 128 62 4134 > 256 36 4012 The number looks very promising. Though.. > > [1] https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@mail.gmail.com/ > [2] ./demand_paging_test -b 64M -u MINOR -s shmem -a -v <n> -r <n> [-w] > A quick rundown of the new flags (also detailed in later commits) > -a registers all of guest memory to a single uffd. ... this is the worst case scenario. I'd say it's slightly unfair to compare by first introducing a bottleneck then compare with it. :) Jokes aside: I'd think it'll make more sense if such a performance solution will be measured on real systems showing real benefits, because so far it's still not convincing enough if it's only with the test especially with only one uffd. I don't remember whether I used to discuss this with James before, but.. I know that having multiple uffds in productions also means scattered guest memory and scattered VMAs all over the place. However split the guest large mem into at least a few (or even tens of) VMAs may still be something worth trying? Do you think that'll already solve some of the contentions on userfaultfd, either on the queue or else? With a bunch of VMAs and userfaultfds (paired with uffd fault handler threads, totally separate uffd queues), I'd expect to some extend other things can pop up already, e.g., the network bandwidth, without teaching each vcpu thread to report uffd faults themselves. These are my pure imaginations though, I think that's also why it'll be great if such a solution can be tested more or less on a real migration scenario to show its real benefits. Thanks,
On Wed, Apr 19, 2023 at 12:56 PM Peter Xu <peterx@redhat.com> wrote: > > Hi, Anish, > > On Wed, Apr 12, 2023 at 09:34:48PM +0000, Anish Moorthy wrote: > > KVM's demand paging self test is extended to demonstrate the performance > > benefits of using the two new capabilities to bypass the userfaultfd > > wait queue. The performance samples below (rates in thousands of > > pages/s, n = 5), were generated using [2] on an x86 machine with 256 > > cores. > > > > vCPUs, Average Paging Rate (w/o new caps), Average Paging Rate (w/ new caps) > > 1 150 340 > > 2 191 477 > > 4 210 809 > > 8 155 1239 > > 16 130 1595 > > 32 108 2299 > > 64 86 3482 > > 128 62 4134 > > 256 36 4012 > > The number looks very promising. Though.. > > > > > [1] https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@mail.gmail.com/ > > [2] ./demand_paging_test -b 64M -u MINOR -s shmem -a -v <n> -r <n> [-w] > > A quick rundown of the new flags (also detailed in later commits) > > -a registers all of guest memory to a single uffd. > > ... this is the worst case scenario. I'd say it's slightly unfair to > compare by first introducing a bottleneck then compare with it. :) > > Jokes aside: I'd think it'll make more sense if such a performance solution > will be measured on real systems showing real benefits, because so far it's > still not convincing enough if it's only with the test especially with only > one uffd. > > I don't remember whether I used to discuss this with James before, but.. > > I know that having multiple uffds in productions also means scattered guest > memory and scattered VMAs all over the place. However split the guest > large mem into at least a few (or even tens of) VMAs may still be something > worth trying? Do you think that'll already solve some of the contentions > on userfaultfd, either on the queue or else? We considered sharding into several UFFDs. I do think it helps, but also I think there are two main problems with it: - One is, I think there's a limit to how much you'd want to do that. E.g. splitting guest memory in 1/2, or in 1/10, could be reasonable, but 1/100 or 1/1000 might become ridiculous in terms of the "scattering" of VMAs and so on like you mentioned. Especially for very large VMs (e.g. consider Google offers VMs with ~11T of RAM [1]) I'm not sure splitting just "slightly" is enough to get good performance. - Another is, sharding UFFDs sort of assumes accesses are randomly distributed across the guest physical address space. I'm not sure this is guaranteed for all possible VMs / customer workloads. In other words, even if we shard across several UFFDs, we may end up with a small number of them being "hot". A benefit to Anish's series is that it solves the problem more fundamentally, and allows demand paging with no "global" locking. So, it will scale better regardless of VM size, or access pattern. [1]: https://cloud.google.com/compute/docs/memory-optimized-machines > > With a bunch of VMAs and userfaultfds (paired with uffd fault handler > threads, totally separate uffd queues), I'd expect to some extend other > things can pop up already, e.g., the network bandwidth, without teaching > each vcpu thread to report uffd faults themselves. > > These are my pure imaginations though, I think that's also why it'll be > great if such a solution can be tested more or less on a real migration > scenario to show its real benefits. I wonder, is there an existing open source QEMU/KVM based live migration stress test? I think we could share numbers from some of our internal benchmarks, or at the very least give relative numbers (e.g. +50% increase), but since a lot of the software stack is proprietary (e.g. we don't use QEMU), it may not be that useful or reproducible for folks. > > Thanks, > > -- > Peter Xu >
On Wed, Apr 19, 2023 at 01:15:44PM -0700, Axel Rasmussen wrote: > On Wed, Apr 19, 2023 at 12:56 PM Peter Xu <peterx@redhat.com> wrote: > > > > Hi, Anish, > > > > On Wed, Apr 12, 2023 at 09:34:48PM +0000, Anish Moorthy wrote: > > > KVM's demand paging self test is extended to demonstrate the performance > > > benefits of using the two new capabilities to bypass the userfaultfd > > > wait queue. The performance samples below (rates in thousands of > > > pages/s, n = 5), were generated using [2] on an x86 machine with 256 > > > cores. > > > > > > vCPUs, Average Paging Rate (w/o new caps), Average Paging Rate (w/ new caps) > > > 1 150 340 > > > 2 191 477 > > > 4 210 809 > > > 8 155 1239 > > > 16 130 1595 > > > 32 108 2299 > > > 64 86 3482 > > > 128 62 4134 > > > 256 36 4012 > > > > The number looks very promising. Though.. > > > > > > > > [1] https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@mail.gmail.com/ > > > [2] ./demand_paging_test -b 64M -u MINOR -s shmem -a -v <n> -r <n> [-w] > > > A quick rundown of the new flags (also detailed in later commits) > > > -a registers all of guest memory to a single uffd. > > > > ... this is the worst case scenario. I'd say it's slightly unfair to > > compare by first introducing a bottleneck then compare with it. :) > > > > Jokes aside: I'd think it'll make more sense if such a performance solution > > will be measured on real systems showing real benefits, because so far it's > > still not convincing enough if it's only with the test especially with only > > one uffd. > > > > I don't remember whether I used to discuss this with James before, but.. > > > > I know that having multiple uffds in productions also means scattered guest > > memory and scattered VMAs all over the place. However split the guest > > large mem into at least a few (or even tens of) VMAs may still be something > > worth trying? Do you think that'll already solve some of the contentions > > on userfaultfd, either on the queue or else? > > We considered sharding into several UFFDs. I do think it helps, but > also I think there are two main problems with it: > > - One is, I think there's a limit to how much you'd want to do that. > E.g. splitting guest memory in 1/2, or in 1/10, could be reasonable, > but 1/100 or 1/1000 might become ridiculous in terms of the > "scattering" of VMAs and so on like you mentioned. Especially for very > large VMs (e.g. consider Google offers VMs with ~11T of RAM [1]) I'm > not sure splitting just "slightly" is enough to get good performance. > > - Another is, sharding UFFDs sort of assumes accesses are randomly > distributed across the guest physical address space. I'm not sure this > is guaranteed for all possible VMs / customer workloads. In other > words, even if we shard across several UFFDs, we may end up with a > small number of them being "hot". I never tried to monitor this, but I had a feeling that it's actually harder to maintain physical continuity of pages being used and accessed at least on Linux. The more possible case to me is the system pages goes very scattered easily after boot a few hours unless special care is taken, e.g., on using hugetlb pages or reservations for specific purpose. I also think that's normally optimal to the system, e.g., numa balancing will help nodes / cpus using local memory which helps spread the memory consumptions, hence each core can access different pages that is local to it. But I agree I can never justify that it'll always work. If you or Anish could provide some data points to further support this issue that would be very interesting and helpful, IMHO, not required though. > > A benefit to Anish's series is that it solves the problem more > fundamentally, and allows demand paging with no "global" locking. So, > it will scale better regardless of VM size, or access pattern. > > [1]: https://cloud.google.com/compute/docs/memory-optimized-machines > > > > > With a bunch of VMAs and userfaultfds (paired with uffd fault handler > > threads, totally separate uffd queues), I'd expect to some extend other > > things can pop up already, e.g., the network bandwidth, without teaching > > each vcpu thread to report uffd faults themselves. > > > > These are my pure imaginations though, I think that's also why it'll be > > great if such a solution can be tested more or less on a real migration > > scenario to show its real benefits. > > I wonder, is there an existing open source QEMU/KVM based live > migration stress test? I am not aware of any. > > I think we could share numbers from some of our internal benchmarks, > or at the very least give relative numbers (e.g. +50% increase), but > since a lot of the software stack is proprietary (e.g. we don't use > QEMU), it may not be that useful or reproducible for folks. Those numbers can still be helpful. I was not asking for reproduceability, but some test to better justify this feature. IMHO the demand paging test (at least the current one) may or may not be a good test to show the value of this specific feature. When with 1-uffd, it obviously bottlenecks on the single uffd, so it doesn't explain whether scaling num of uffds could help. But it's not friendly to multi-uffd either, because it'll be the other extreme case where all mem accesses are spread the cores, so probably the feature won't show a result proving its worthwhile. From another aspect, if a kernel feature is proposed it'll be always nice (and sometimes mandatory) to have at least one user of it (besides the unit tests). I think that should also include proprietary softwares. It doesn't need to be used already in production, but some POC would definitely be very helpful to move a feature forward towards community acceptance. Thanks,
Hi, Anish, [Copied Nadav Amit for the last few paragraphs on userfaultfd, because Nadav worked on a few userfaultfd performance problems; so maybe he'll also have some ideas around] On Wed, Apr 19, 2023 at 02:53:46PM -0700, Anish Moorthy wrote: > On Wed, Apr 19, 2023 at 2:05 PM Peter Xu <peterx@redhat.com> wrote: > > > > On Wed, Apr 19, 2023 at 01:15:44PM -0700, Axel Rasmussen wrote: > > > We considered sharding into several UFFDs. I do think it helps, but > > > also I think there are two main problems with it... > > > > But I agree I can never justify that it'll always work. If you or Anish > > could provide some data points to further support this issue that would be > > very interesting and helpful, IMHO, not required though. > > Axel covered the reasons for not pursuing the sharding approach nicely > (thanks!). It's not something we ever prototyped, so I don't have any > further numbers there. > > On Wed, Apr 19, 2023 at 2:05 PM Peter Xu <peterx@redhat.com> wrote: > > > > On Wed, Apr 19, 2023 at 01:15:44PM -0700, Axel Rasmussen wrote: > > > > > I think we could share numbers from some of our internal benchmarks, > > > or at the very least give relative numbers (e.g. +50% increase), but > > > since a lot of the software stack is proprietary (e.g. we don't use > > > QEMU), it may not be that useful or reproducible for folks. > > > > Those numbers can still be helpful. I was not asking for reproduceability, > > but some test to better justify this feature. > > I do have some internal benchmarking numbers on this front, although > it's been a while since I've collected them so the details might be a > little sparse. Thanks for sharing these data points. I don't understand most of them yet, but I think it's better than the unit test numbers provided. > > I've confirmed performance gains with "scalable userfaultfd" using two > workloads besides the self-test: > > The first, cycler, spins up a VM and launches a binary which (a) maps > a large amount of memory and then (b) loops over it issuing writes as > fast as possible. It's not a very realistic guest but it at least > involves an actual migrating VM, and we often use it to > stress/performance test migration changes. The write rate which cycler > achieves during userfaultfd-based postcopy (without scalable uffd > enabled) is about 25% of what it achieves under KVM Demand Paging (the > internal KVM feature GCE currently uses for postcopy). With > userfaultfd-based postcopy and scalable uffd enabled that rate jumps > nearly 3x, so about 75% of what KVM Demand Paging achieves. The > attached "Cycler.png" illustrates this effect (though due to some > other details, faster demand paging actually makes the migrations > worse: the point is that scalable uffd performs more similarly to kvm > demand paging :) Yes I don't understand why vanilla uffd is so different, neither am I sure what does the graph mean, though. :) Is the first drop caused by starting migration/precopy? Is the 2nd (huge) drop (mostly to zero) caused by frequently accessing new pages during postcopy? Is the workload busy writes single thread, or NCPU threads? Is what you mentioned on the 25%-75% comparison can be shown on the graph? Or maybe that's part of the period where all three are very close to 0? > > The second is the redis memtier benchmark [1], a more realistic > workflow where we migrate a VM running the redis server. With scalable > userfaultfd, the client VM observes significantly higher transaction > rates during uffd-based postcopy (see "Memtier.png"). I can pull the > exact numbers if needed, but just from eyeballing the graph you can > see that the improvement is something like 5-10x (at least) for > several seconds. There's still a noticeable gap with KVM demand paging > based-postcopy, but the improvement is definitely significant. > > [1] https://github.com/RedisLabs/memtier_benchmark Does the "5-10x" difference rely in the "15s valley" you pointed out in the graph? Is it reproduceable that the blue line always has a totally different "valley" comparing to yellow/red? Personally I still really want to know what happens if we just split the vma and see how it goes with a standard workloads, but maybe I'm asking too much so don't yet worry. The solution here proposed still makes sense to me and I agree if this can be done well it can resolve the bottleneck over 1-userfaultfd. But after I read some of the patches I'm not sure whether it's possible it can be implemented in a complete way. You mentioned here and there on that things can be missing probably due to random places accessing guest pages all over kvm. Relying sololy on -EFAULT so far doesn't look very reliable to me, but it could be because I didn't yet really understand how it works. Is above a concern to the current solution? Have any of you tried to investigate the other approach to scale userfaultfd? It seems userfaultfd does one thing great which is to have the trapping at an unified place (when the page fault happens), hence it doesn't need to worry on random codes splat over KVM module read/write a guest page. The question is whether it'll be easy to do so. Split vma definitely is still a way to scale userfaultfd, but probably not in a good enough way because it's scaling in memory axis, not cores. If tens of cores accessing a small region that falls into the same VMA, then it stops working. However maybe it can be scaled in other form? So far my understanding is "read" upon uffd for messages is still not a problem - the read can be done in chunk, and each message will be converted into a request to be send later. If the real problem relies in a bunch of threads queuing, is it possible that we can provide just more queues for the events? The readers will just need to go over all the queues. Way to decide "which thread uses which queue" can be another problem, what comes ups quickly to me is a "hash(tid) % n_queues" but maybe it can be better. Each vcpu thread will have different tids, then they can hopefully scale on the queues. There's at least one issue that I know with such an idea, that after we have >1 uffd queues it means the message order will be uncertain. It may matter for some uffd users (e.g. cooperative userfaultfd, see UFFD_FEATURE_FORK|REMOVE|etc.) because I believe order of messages matter for them (mostly CRIU). But I think that's not a blocker either because we can forbid those features with multi queues. That's a wild idea that I'm just thinking about, which I have totally no idea whether it'll work or not. It's more or less of a generic question on "whether there's chance to scale on uffd side just in case it might be a cleaner approach", when above concern is a real concern. Thanks,
My reply to Peter earlier bounced from the mailing list due to the attached images (sorry!). I've copied it below to get a record on-list. Just for completeness, the message ID of the bounced mail was <CAF7b7mo68VLNp=QynfT7QKgdq=d1YYGv1SEVEDxF9UwHzF6YDw@mail.gmail.com> On Wed, Apr 19, 2023 at 2:53 PM Anish Moorthy <amoorthy@google.com> wrote: > > On Wed, Apr 19, 2023 at 2:05 PM Peter Xu <peterx@redhat.com> wrote: > > > > On Wed, Apr 19, 2023 at 01:15:44PM -0700, Axel Rasmussen wrote: > > > We considered sharding into several UFFDs. I do think it helps, but > > > also I think there are two main problems with it... > > > > But I agree I can never justify that it'll always work. If you or Anish > > could provide some data points to further support this issue that would be > > very interesting and helpful, IMHO, not required though. > > Axel covered the reasons for not pursuing the sharding approach nicely > (thanks!). It's not something we ever prototyped, so I don't have any > further numbers there. > > On Wed, Apr 19, 2023 at 2:05 PM Peter Xu <peterx@redhat.com> wrote: > > > > On Wed, Apr 19, 2023 at 01:15:44PM -0700, Axel Rasmussen wrote: > > > > > I think we could share numbers from some of our internal benchmarks, > > > or at the very least give relative numbers (e.g. +50% increase), but > > > since a lot of the software stack is proprietary (e.g. we don't use > > > QEMU), it may not be that useful or reproducible for folks. > > > > Those numbers can still be helpful. I was not asking for reproduceability, > > but some test to better justify this feature. > > I do have some internal benchmarking numbers on this front, although > it's been a while since I've collected them so the details might be a > little sparse. > > I've confirmed performance gains with "scalable userfaultfd" using two > workloads besides the self-test: > > The first, cycler, spins up a VM and launches a binary which (a) maps > a large amount of memory and then (b) loops over it issuing writes as > fast as possible. It's not a very realistic guest but it at least > involves an actual migrating VM, and we often use it to > stress/performance test migration changes. The write rate which cycler > achieves during userfaultfd-based postcopy (without scalable uffd > enabled) is about 25% of what it achieves under KVM Demand Paging (the > internal KVM feature GCE currently uses for postcopy). With > userfaultfd-based postcopy and scalable uffd enabled that rate jumps > nearly 3x, so about 75% of what KVM Demand Paging achieves. The > attached "Cycler.png" illustrates this effect (though due to some > other details, faster demand paging actually makes the migrations > worse: the point is that scalable uffd performs more similarly to kvm > demand paging :) > > The second is the redis memtier benchmark [1], a more realistic > workflow where we migrate a VM running the redis server. With scalable > userfaultfd, the client VM observes significantly higher transaction > rates during uffd-based postcopy (see "Memtier.png"). I can pull the > exact numbers if needed, but just from eyeballing the graph you can > see that the improvement is something like 5-10x (at least) for > several seconds. There's still a noticeable gap with KVM demand paging > based-postcopy, but the improvement is definitely significant. > > [1] https://github.com/RedisLabs/memtier_benchmark
On Thu, Apr 20, 2023 at 2:29 PM Peter Xu <peterx@redhat.com> wrote: > > Yes I don't understand why vanilla uffd is so different, neither am I sure > what does the graph mean, though. :) > > Is the first drop caused by starting migration/precopy? > > Is the 2nd (huge) drop (mostly to zero) caused by frequently accessing new > pages during postcopy? Right on both counts. By the way, for anyone who notices that the userfaultfd (red/yellow) lines never recover to the initial level of performance, whereas the blue line does: that's a separate issue, please ignore :) > Is the workload busy writes single thread, or NCPU threads? One thread per vCPU. > Is what you mentioned on the 25%-75% comparison can be shown on the graph? > Or maybe that's part of the period where all three are very close to 0? Yes, unfortunately the absolute size of the improvement is still pretty small (we go from ~50 writes/s to ~150), so all looks like zero with this scale. > > The second is the redis memtier benchmark [1], a more realistic > > workflow where we migrate a VM running the redis server. With scalable > > userfaultfd, the client VM observes significantly higher transaction > > rates during uffd-based postcopy (see "Memtier.png"). I can pull the > > exact numbers if needed, but just from eyeballing the graph you can > > see that the improvement is something like 5-10x (at least) for > > several seconds. There's still a noticeable gap with KVM demand paging > > based-postcopy, but the improvement is definitely significant. > > > > [1] https://github.com/RedisLabs/memtier_benchmark > > Does the "5-10x" difference rely in the "15s valley" you pointed out in the > graph? Not quite sure what you mean: I meant to point out that the ~15s valley is where we observe improvements due to scalable userfaultfd. For most of that valley, the speedup of scalable uffd is 5-10x (or something, I admit to eyeballing those numbers :) > Is it reproduceable that the blue line always has a totally different > "valley" comparing to yellow/red? Yes, but the offset of that valley is just precopy taking longer for some reason on that configuration. Honestly it's probably just better to ignore the blue line, since that's a google-specific stack. > Personally I still really want to know what happens if we just split the > vma and see how it goes with a standard workloads, but maybe I'm asking too > much so don't yet worry. The solution here proposed still makes sense to > me and I agree if this can be done well it can resolve the bottleneck over > 1-userfaultfd. > > But after I read some of the patches I'm not sure whether it's possible it > can be implemented in a complete way. You mentioned here and there on that > things can be missing probably due to random places accessing guest pages > all over kvm. Relying sololy on -EFAULT so far doesn't look very reliable > to me, but it could be because I didn't yet really understand how it works. > > Is above a concern to the current solution? Based on your comment in [1], I think your impression of this series is that it tries to (a) catch all of the cases where userfaultfd would be triggered and (b) bypass userfaultfd by surfacing the page faults via vCPU exit. That's only happening in two places (the KVM_ABSENT_MAPPING_FAULT changes) corresponding to the EPT violation handler on x86 and the arm64 equivalent. Bypassing the queuing of faults onto a uffd in those two cases, and instead delivering those faults via vCPU exit, is what provides the performance gains I'm demonstrating. However, all of the other changes (KVM_MEMORY_FAULT_INFO, the bulk of this series) are totally unrelated to if/how faults are queued onto userfaultfd. Page faults from copy_to_user/copy_from_user, etc will continue to be delivered via uffd (if one is registered, obviously), and changing that is *not* a goal. All that KVM_MEMORY_FAULT_INFO does is deliver some extra information to userspace in cases where KVM_RUN currently just returns -EFAULT. Hopefully this, and my response to [1], clears things up. If not, let me know and I'll be glad to discuss further. [1] https://lore.kernel.org/kvm/ZEGuogfbtxPNUq7t@x1n/T/#m76f940846ecc94ea85efa80ffbe42366c2352636 > Have any of you tried to investigate the other approach to scale > userfaultfd? As Axel mentioned we considered sharding VMAs but didn't pursue it for a few different reasons. > It seems userfaultfd does one thing great which is to have the trapping at > an unified place (when the page fault happens), hence it doesn't need to > worry on random codes splat over KVM module read/write a guest page. The > question is whether it'll be easy to do so. See a couple of notes above. > Split vma definitely is still a way to scale userfaultfd, but probably not > in a good enough way because it's scaling in memory axis, not cores. If > tens of cores accessing a small region that falls into the same VMA, then > it stops working. > > However maybe it can be scaled in other form? So far my understanding is > "read" upon uffd for messages is still not a problem - the read can be done > in chunk, and each message will be converted into a request to be send > later. > > If the real problem relies in a bunch of threads queuing, is it possible > that we can provide just more queues for the events? The readers will just > need to go over all the queues. > > Way to decide "which thread uses which queue" can be another problem, what > comes ups quickly to me is a "hash(tid) % n_queues" but maybe it can be > better. Each vcpu thread will have different tids, then they can hopefully > scale on the queues. > > There's at least one issue that I know with such an idea, that after we > have >1 uffd queues it means the message order will be uncertain. It may > matter for some uffd users (e.g. cooperative userfaultfd, see > UFFD_FEATURE_FORK|REMOVE|etc.) because I believe order of messages matter > for them (mostly CRIU). But I think that's not a blocker either because we > can forbid those features with multi queues. > > That's a wild idea that I'm just thinking about, which I have totally no > idea whether it'll work or not. It's more or less of a generic question on > "whether there's chance to scale on uffd side just in case it might be a > cleaner approach", when above concern is a real concern. You bring up a good point, which is that this series only deals with uffd's performance in the context of KVM. I had another idea in this vein, which was to allow dedicating queues to certain threads: I even threw together a prototype, though there was some bug in it which stopped me from ever getting a real signal :( I think there's still potential to make uffd itself faster but, as you point out, that might get messy from an API perspective (I know my prototype did :) and is going to require more investigation and prototyping. The advantage of this approach is that it's simple, makes a lot of conceptual sense IMO (in that the previously-stalled vCPU threads can now participate in the work of demand fetching), and solves a very important (probably *the* most important) bottleneck when it comes to KVM + uffd-based postcopy.
> On Apr 20, 2023, at 2:29 PM, Peter Xu <peterx@redhat.com> wrote: > > Hi, Anish, > > [Copied Nadav Amit for the last few paragraphs on userfaultfd, because > Nadav worked on a few userfaultfd performance problems; so maybe he'll > also have some ideas around] Sorry for not following this thread and previous ones. I skimmed through and I hope my answers would be helpful and relevant… Anyhow, I also encountered to some extent the cost of locking (not the contention). In my case, I only did a small change to reduce the overhead of acquiring the locks by “combining" the locks of faulting_pending_wqh and fault_wqh, which are (almost?) always acquired together. I only acquired fault_pending_wqh and then introduced “fake_spin_lock()” which only got lockdep to understand the fault_wqh is already protected. But as I said, this solution was only intended to reduce the cost of locking, and it does not solve the contention. If I understand the problem correctly, it sounds as if the proper solution should be some kind of a range-locks. If it is too heavy or the interface can be changed/extended to wake a single address (instead of a range), simpler hashed-locks can be used.
On Fri, Apr 21, 2023 at 10:40 AM Nadav Amit <nadav.amit@gmail.com> wrote: > > If I understand the problem correctly, it sounds as if the proper solution > should be some kind of a range-locks. If it is too heavy or the interface can > be changed/extended to wake a single address (instead of a range), > simpler hashed-locks can be used. Some sort of range-based locking system does seem relevant, although I don't see how that would necessarily speed up the delivery of faults to UFFD readers: I'll have to think about it more. On the KVM side though, I think there's value in merging KVM_CAP_ABSENT_MAPPING_FAULT and allowing performance improvements to UFFD itself proceed separately. It's likely that returning faults directly via the vCPU threads will be faster than even an improved UFFD, since the former approach basically removes one level of indirection. That seems important, given the criticality of the EPT-violation path during postcopy. Furthermore, if future performance improvements to UFFD involve changes/restrictions to its API, then KVM_CAP_ABSENT_MAPPING_FAULT could well be useful anyways. Sean did mention that he wanted KVM_CAP_MEMORY_FAULT_INFO in general, so I'm guessing (some version of) that will (eventually :) be merged in any case.
> On Apr 24, 2023, at 10:54 AM, Anish Moorthy <amoorthy@google.com> wrote: > > On Fri, Apr 21, 2023 at 10:40 AM Nadav Amit <nadav.amit@gmail.com> wrote: >> >> If I understand the problem correctly, it sounds as if the proper solution >> should be some kind of a range-locks. If it is too heavy or the interface can >> be changed/extended to wake a single address (instead of a range), >> simpler hashed-locks can be used. > > Some sort of range-based locking system does seem relevant, although I > don't see how that would necessarily speed up the delivery of faults > to UFFD readers: I'll have to think about it more. Perhaps I misread your issue. Based on the scalability issues you raised, I assumed that the problem you encountered is related to lock contention. I do not know whether your profiled it, but some information would be useful. Anyhow, if the issue you encounter is mostly about the general overhead of delivering userfaultfd, I encountered this one too. The solution I tried (and you can find some old patches) is in delivering and resolving userfaultfd using IO-uring. The main advantage is that this solution is generic and performance is pretty good. The disadvantage is that you do need to allocate a polling thread/core to handle the userfaultfd. If you want, I can send you privately the last iteration of my patches for you to give it a spin. > > On the KVM side though, I think there's value in merging > KVM_CAP_ABSENT_MAPPING_FAULT and allowing performance improvements to > UFFD itself proceed separately. It's likely that returning faults > directly via the vCPU threads will be faster than even an improved > UFFD, since the former approach basically removes one level of > indirection. That seems important, given the criticality of the > EPT-violation path during postcopy. Furthermore, if future performance > improvements to UFFD involve changes/restrictions to its API, then > KVM_CAP_ABSENT_MAPPING_FAULT could well be useful anyways. > > Sean did mention that he wanted KVM_CAP_MEMORY_FAULT_INFO in general, > so I'm guessing (some version of) that will (eventually :) be merged > in any case. It certainly not my call. But if you ask me, introducing a solution for a concrete use-case that requires API changes/enhancements is not guaranteed to be the best solution. It may be better first to fully understand the existing overheads and agree that there is no alternative cleaner and more general solution with similar performance. Considering the mess that KVM async-PF introduced, I would be very careful before introducing such API changes. I did not look too much on the details, but some things anyhow look slightly strange (which might be since I am out-of-touch with KVM). For instance, returning -EFAULT on from KVM_RUN? I would have assumed -EAGAIN would be more appropriate since the invocation did succeed.
On Mon, Apr 24, 2023, Nadav Amit wrote: > > > On Apr 24, 2023, at 10:54 AM, Anish Moorthy <amoorthy@google.com> wrote: > > Sean did mention that he wanted KVM_CAP_MEMORY_FAULT_INFO in general, > > so I'm guessing (some version of) that will (eventually :) be merged > > in any case. > > It certainly not my call. But if you ask me, introducing a solution for > a concrete use-case that requires API changes/enhancements is not > guaranteed to be the best solution. It may be better first to fully > understand the existing overheads and agree that there is no alternative > cleaner and more general solution with similar performance. KVM already returns -EFAULT for these situations, the change I really want to land is to have KVM report detailed information about why the -EFAULT occurred. I'll be happy to carry the code in KVM even if userspace never does anything beyond dumping the extra information on failures. > Considering the mess that KVM async-PF introduced, I would be very careful > before introducing such API changes. I did not look too much on the details, > but some things anyhow look slightly strange (which might be since I am > out-of-touch with KVM). For instance, returning -EFAULT on from KVM_RUN? I > would have assumed -EAGAIN would be more appropriate since the invocation did > succeed. Yeah, returning -EFAULT is somewhat odd, but as above, that's pre-existing behavior that's been around for many years.
Feel free to tell me to shut up, as it is none of my business, and I might be missing a lot of context. Yet, it never stopped me before. :) > On Apr 24, 2023, at 1:35 PM, Sean Christopherson <seanjc@google.com> wrote: > > On Mon, Apr 24, 2023, Nadav Amit wrote: >> >>> On Apr 24, 2023, at 10:54 AM, Anish Moorthy <amoorthy@google.com> wrote: >>> Sean did mention that he wanted KVM_CAP_MEMORY_FAULT_INFO in general, >>> so I'm guessing (some version of) that will (eventually :) be merged >>> in any case. >> >> It certainly not my call. But if you ask me, introducing a solution for >> a concrete use-case that requires API changes/enhancements is not >> guaranteed to be the best solution. It may be better first to fully >> understand the existing overheads and agree that there is no alternative >> cleaner and more general solution with similar performance. > > KVM already returns -EFAULT for these situations, the change I really want to land > is to have KVM report detailed information about why the -EFAULT occurred. I'll be > happy to carry the code in KVM even if userspace never does anything beyond dumping > the extra information on failures. I thought that the change is to inform on page-faults through a new interface instead of the existing uffd-file-based one. There is already another interface (signals) and I thought (but did not upstream) io-uring. You now suggest yet another one. I am not sure it is very clean. IIUC, it means that you would still need in userspace to monitor uffd, as qemu (or whatever-kvm-userspace-counterpart-you- use) might also trigger a page-fault. So userspace becomes more complicated, and the ordering of different events/page-faults reporting becomes even more broken. At the same time, you also break various assumptions of userfaultfd. I don’t immediately find some functionality that would break, but I am not very confident about it either. > >> Considering the mess that KVM async-PF introduced, I would be very careful >> before introducing such API changes. I did not look too much on the details, >> but some things anyhow look slightly strange (which might be since I am >> out-of-touch with KVM). For instance, returning -EFAULT on from KVM_RUN? I >> would have assumed -EAGAIN would be more appropriate since the invocation did >> succeed. > > Yeah, returning -EFAULT is somewhat odd, but as above, that's pre-existing > behavior that's been around for many years. Again, it is none of my business, but don’t you want to gradually try to fix the interface so maybe on day you would be able to deprecate it? IOW, if you already introduce a new interface which is enabled with a new flag, which would require userspace change, then you can return the more appropriate error-code.
On Mon, Apr 24, 2023 at 12:44 PM Nadav Amit <nadav.amit@gmail.com> wrote: > > > > > On Apr 24, 2023, at 10:54 AM, Anish Moorthy <amoorthy@google.com> wrote: > > > > On Fri, Apr 21, 2023 at 10:40 AM Nadav Amit <nadav.amit@gmail.com> wrote: > >> > >> If I understand the problem correctly, it sounds as if the proper solution > >> should be some kind of a range-locks. If it is too heavy or the interface can > >> be changed/extended to wake a single address (instead of a range), > >> simpler hashed-locks can be used. > > > > Some sort of range-based locking system does seem relevant, although I > > don't see how that would necessarily speed up the delivery of faults > > to UFFD readers: I'll have to think about it more. > > Perhaps I misread your issue. Based on the scalability issues you raised, > I assumed that the problem you encountered is related to lock contention. > I do not know whether your profiled it, but some information would be > useful. No, you had it right: the issue at hand is contention on the uffd wait queues. I'm just not sure what the range-based locking would really be doing. Events would still have to be delivered to userspace in an ordered manner, so it seems to me that each uffd would still need to maintain a queue (and the associated contention). With respect to the "sharding" idea, I collected some more runs of the self test (full command in [1]). This time I omitted the "-a" flag, so that every vCPU accesses a different range of guest memory with its own UFFD, and set the number of reader threads per UFFD to 1. vCPUs, Average Paging Rate (w/o new caps), Average Paging Rate (w/ new caps) 1 180 307 2 85 220 4 80 206 8 39 163 16 18 104 32 8 73 64 4 57 128 1 37 256 1 16 I'm reporting paging rate on a per-vcpu rather than total basis, which is why the numbers look so different than the ones in the cover letter. I'm actually not sure why the demand paging rate falls off with the number of vCPUs (maybe a prioritization issue on my side?), but even when UFFDs aren't being contended for it's clear that demand paging via memory fault exits is significantly faster. I'll try to get some perf traces as well: that will take a little bit of time though, as to do it for cycler will involve patching our VMM first. [1] ./demand_paging_test -b 64M -u MINOR -s shmem -v <n> -r 1 [-w] > It certainly not my call. But if you ask me, introducing a solution for > a concrete use-case that requires API changes/enhancements is not > guaranteed to be the best solution. It may be better first to fully > understand the existing overheads and agree that there is no alternative > cleaner and more general solution with similar performance. > > Considering the mess that KVM async-PF introduced, I > would be very careful before introducing such API changes. I did not look > too much on the details, but some things anyhow look slightly strange > (which might be since I am out-of-touch with KVM). For instance, returning > -EFAULT on from KVM_RUN? I would have assumed -EAGAIN would be more > appropriate since the invocation did succeed. I'm not quite sure whether you're focusing on KVM_CAP_MEMORY_FAULT_INFO or KVM_CAP_ABSENT_MAPPING_FAULT here. But to my knowledge, none of the KVM folks have objections to either: hopefully it stays that way, but we'll have to see :)
On Mon, Apr 24, 2023, Nadav Amit wrote: > Feel free to tell me to shut up, as it is none of my business, and I might be > missing a lot of context. Yet, it never stopped me before. :) > > > On Apr 24, 2023, at 1:35 PM, Sean Christopherson <seanjc@google.com> wrote: > > > > On Mon, Apr 24, 2023, Nadav Amit wrote: > >> > >>> On Apr 24, 2023, at 10:54 AM, Anish Moorthy <amoorthy@google.com> wrote: > >>> Sean did mention that he wanted KVM_CAP_MEMORY_FAULT_INFO in general, > >>> so I'm guessing (some version of) that will (eventually :) be merged > >>> in any case. > >> > >> It certainly not my call. But if you ask me, introducing a solution for > >> a concrete use-case that requires API changes/enhancements is not > >> guaranteed to be the best solution. It may be better first to fully > >> understand the existing overheads and agree that there is no alternative > >> cleaner and more general solution with similar performance. > > > > KVM already returns -EFAULT for these situations, the change I really want to land > > is to have KVM report detailed information about why the -EFAULT occurred. I'll be > > happy to carry the code in KVM even if userspace never does anything beyond dumping > > the extra information on failures. > > I thought that the change is to inform on page-faults through a new interface > instead of the existing uffd-file-based one. There is already another interface > (signals) and I thought (but did not upstream) io-uring. You now suggest yet > another one. There are two capabilities proposed in this series: one to provide more information when KVM encounters a fault it can't resolve, and another to tell KVM to kick out to userspace instead of attempting to resolve a fault when a given address couldn't be resolved with fast gup. I'm talking purely about the first one: providing more information when KVM exits. As for the second, my plan is to try and stay out of the way and let people that actually deal with the userspace side of things settle on an approach. From the KVM side, supporting the "don't wait to resolve faults" flag is quite simple so long as KVM punts the heavy lifting to userspace, e.g. identifying _why_ the address isn't mapped with the appropriate permissions. > I am not sure it is very clean. IIUC, it means that you would still need in > userspace to monitor uffd, as qemu (or whatever-kvm-userspace-counterpart-you- > use) might also trigger a page-fault. So userspace becomes more complicated, > and the ordering of different events/page-faults reporting becomes even more > broken. > > At the same time, you also break various assumptions of userfaultfd. I don’t > immediately find some functionality that would break, but I am not very > confident about it either. > > > > >> Considering the mess that KVM async-PF introduced, I would be very careful > >> before introducing such API changes. I did not look too much on the details, > >> but some things anyhow look slightly strange (which might be since I am > >> out-of-touch with KVM). For instance, returning -EFAULT on from KVM_RUN? I > >> would have assumed -EAGAIN would be more appropriate since the invocation did > >> succeed. > > > > Yeah, returning -EFAULT is somewhat odd, but as above, that's pre-existing > > behavior that's been around for many years. > > Again, it is none of my business, but don’t you want to gradually try to fix > the interface so maybe on day you would be able to deprecate it? > > IOW, if you already introduce a new interface which is enabled with a new > flag, which would require userspace change, then you can return the more > appropriate error-code. In a perfect world, yes. But unfortunately, the relevant plumbing in KVM is quite brittle (understatement) with respect to returning "0", and AFAICT, returning -EFAULT instead of 0 is nothing more than an odditity. E.g. at worst, it could be suprising to users writing a new VMM from scratch. But I hadn't thought about returning a _different_ error code. -EAGAIN isn't obviously better though, e.g. my understanding is that -EAGAIN typically means that retrying will succeed, but in pretty much every case where KVM returns -EFAULT, simply trying again will never succeed. It's not even guaranteed to be appropriate in the "don't wait to resolve faults" case, because KVM won't actually know why an address isn't accessible, e.g. it could be because the page needs to be faulted in, but it could also be a due to a guest bug that resulted in a permission violation. All in all, returning -EFAULT doesn't seem egregious. I can't recall a single complaint about returning -EFAULT instead of -XYZ, just complaints about not KVM providing any useful information. So absent a concrete need for returning something other than -EFAULT, I'm definitely inclined to maintain the status quo, even though it's imperfect.
> On Apr 24, 2023, at 5:26 PM, Sean Christopherson <seanjc@google.com> wrote: > > On Mon, Apr 24, 2023, Nadav Amit wrote: >> Feel free to tell me to shut up, as it is none of my business, and I might be >> missing a lot of context. Yet, it never stopped me before. :) >> >>> On Apr 24, 2023, at 1:35 PM, Sean Christopherson <seanjc@google.com> wrote: >>> >>> On Mon, Apr 24, 2023, Nadav Amit wrote: >>>> >>>>> On Apr 24, 2023, at 10:54 AM, Anish Moorthy <amoorthy@google.com> wrote: >>>>> Sean did mention that he wanted KVM_CAP_MEMORY_FAULT_INFO in general, >>>>> so I'm guessing (some version of) that will (eventually :) be merged >>>>> in any case. >>>> >>>> It certainly not my call. But if you ask me, introducing a solution for >>>> a concrete use-case that requires API changes/enhancements is not >>>> guaranteed to be the best solution. It may be better first to fully >>>> understand the existing overheads and agree that there is no alternative >>>> cleaner and more general solution with similar performance. >>> >>> KVM already returns -EFAULT for these situations, the change I really want to land >>> is to have KVM report detailed information about why the -EFAULT occurred. I'll be >>> happy to carry the code in KVM even if userspace never does anything beyond dumping >>> the extra information on failures. >> >> I thought that the change is to inform on page-faults through a new interface >> instead of the existing uffd-file-based one. There is already another interface >> (signals) and I thought (but did not upstream) io-uring. You now suggest yet >> another one. > > There are two capabilities proposed in this series: one to provide more information > when KVM encounters a fault it can't resolve, and another to tell KVM to kick out > to userspace instead of attempting to resolve a fault when a given address couldn't > be resolved with fast gup. I'm talking purely about the first one: providing more > information when KVM exits. > > As for the second, my plan is to try and stay out of the way and let people that > actually deal with the userspace side of things settle on an approach. From the > KVM side, supporting the "don't wait to resolve faults" flag is quite simple so > long as KVM punts the heavy lifting to userspace, e.g. identifying _why_ the address > isn't mapped with the appropriate permissions. Thanks for your kind and detailed reply. I removed the parts that I understand. As you might guess, I focus on the second part, which you leave for others. The interactions between two page-fault reporting mechanisms is not trivial, and it is already not fully correct. I understand the approach that Anish prefers is to do something that is tailored for KVM, but I am not sure it is the best thing.
> On Apr 24, 2023, at 5:15 PM, Anish Moorthy <amoorthy@google.com> wrote: > > On Mon, Apr 24, 2023 at 12:44 PM Nadav Amit <nadav.amit@gmail.com> wrote: >> >> >> >>> On Apr 24, 2023, at 10:54 AM, Anish Moorthy <amoorthy@google.com> wrote: >>> >>> On Fri, Apr 21, 2023 at 10:40 AM Nadav Amit <nadav.amit@gmail.com> wrote: >>>> >>>> If I understand the problem correctly, it sounds as if the proper solution >>>> should be some kind of a range-locks. If it is too heavy or the interface can >>>> be changed/extended to wake a single address (instead of a range), >>>> simpler hashed-locks can be used. >>> >>> Some sort of range-based locking system does seem relevant, although I >>> don't see how that would necessarily speed up the delivery of faults >>> to UFFD readers: I'll have to think about it more. >> >> Perhaps I misread your issue. Based on the scalability issues you raised, >> I assumed that the problem you encountered is related to lock contention. >> I do not know whether your profiled it, but some information would be >> useful. > > No, you had it right: the issue at hand is contention on the uffd wait > queues. I'm just not sure what the range-based locking would really be > doing. Events would still have to be delivered to userspace in an > ordered manner, so it seems to me that each uffd would still need to > maintain a queue (and the associated contention). There are 2 queues. One for the pending faults that were still not reported to userspace, and one for the faults that we might need to wake up. The second one can have range locks. Perhaps some hybrid approach would be best: do not block on page-faults that KVM runs into, which would prevent you from the need to enqueue on fault_wqh. But I do not know whether the reporting through KVM instead of userfaultfd-based mechanism is very clean. I think that an IO-uring based solution, such as the one I proposed before, would be more generic. Actually, now that I understand better your use-case, you do not need a core to poll and you would just be able to read the page-fault information from the IO-uring. Then, you can report whether the page-fault blocked or not in a flag. > > With respect to the "sharding" idea, I collected some more runs of the > self test (full command in [1]). This time I omitted the "-a" flag, so > that every vCPU accesses a different range of guest memory with its > own UFFD, and set the number of reader threads per UFFD to 1. Just wondering, did you run the benchmark with DONTWAKE? Sounds as if the wake is not needed.
On Mon, Apr 24, 2023 at 5:54 PM Nadav Amit <nadav.amit@gmail.com> wrote: > > > > > On Apr 24, 2023, at 5:15 PM, Anish Moorthy <amoorthy@google.com> wrote: > > > > On Mon, Apr 24, 2023 at 12:44 PM Nadav Amit <nadav.amit@gmail.com> wrote: > >> > >> > >> > >>> On Apr 24, 2023, at 10:54 AM, Anish Moorthy <amoorthy@google.com> wrote: > >>> > >>> On Fri, Apr 21, 2023 at 10:40 AM Nadav Amit <nadav.amit@gmail.com> wrote: > >>>> > >>>> If I understand the problem correctly, it sounds as if the proper solution > >>>> should be some kind of a range-locks. If it is too heavy or the interface can > >>>> be changed/extended to wake a single address (instead of a range), > >>>> simpler hashed-locks can be used. > >>> > >>> Some sort of range-based locking system does seem relevant, although I > >>> don't see how that would necessarily speed up the delivery of faults > >>> to UFFD readers: I'll have to think about it more. > >> > >> Perhaps I misread your issue. Based on the scalability issues you raised, > >> I assumed that the problem you encountered is related to lock contention. > >> I do not know whether your profiled it, but some information would be > >> useful. > > > > No, you had it right: the issue at hand is contention on the uffd wait > > queues. I'm just not sure what the range-based locking would really be > > doing. Events would still have to be delivered to userspace in an > > ordered manner, so it seems to me that each uffd would still need to > > maintain a queue (and the associated contention). > > There are 2 queues. One for the pending faults that were still not reported > to userspace, and one for the faults that we might need to wake up. The second > one can have range locks. > > Perhaps some hybrid approach would be best: do not block on page-faults that > KVM runs into, which would prevent you from the need to enqueue on fault_wqh. Hi Nadav, If we don't block on the page faults that KVM runs into, what are you suggesting that these threads do? 1. If you're saying that we should kick the threads out to userspace and then read the page fault event, then I would say that it's just unnecessary complexity. (Seems like this is what you mean from what you said below.) 2. If you're saying they should busy-wait, then unfortunately we can't afford that. 3. If it's neither of those, could you clarify? > > But I do not know whether the reporting through KVM instead of > userfaultfd-based mechanism is very clean. I think that an IO-uring based > solution, such as the one I proposed before, would be more generic. Actually, > now that I understand better your use-case, you do not need a core to poll > and you would just be able to read the page-fault information from the IO-uring. > > Then, you can report whether the page-fault blocked or not in a flag. This is a fine idea, but I don't think the required complexity is worth it. The memory fault info reporting piece of this series is relatively uncontentious, so let's assume we have it at our disposal. Now, the complexity to make KVM only attempt fast GUP (and EFAULT if it fails) is really minimal. We automatically know that we don't need to WAKE and which address to make ready. Userspace is also able to resolve the fault: UFFDIO_CONTINUE if we haven't already, then MADV_POPULATE_WRITE if we have (forces userspace page tables to be populated if they haven't been, potentially going through userfaultfd to do so, i.e., if UFFDIO_CONTINUE wasn't already done). It sounds like what you're suggesting is something like: 1. KVM attempts fast GUP then slow GUP. 2. In slow GUP, queue a "non-blocking" userfault, but don't go to sleep (return with VM_FAULT_SIGBUS or something). 3. The vCPU thread gets kicked out to userspace with EFAULT (+ fault info if we've enabled it). 4. Read a fault from the userfaultfd or io_uring. 5. Make the page ready, and if it were non-blocking, then don't WAKE. I have some questions/thoughts with this approach: 1. Is io_uring the only way to make reading from a userfaultfd scale? Maybe it's possible to avoid using a wait_queue for "non-blocking" faults, but then we'd need a special read() API specifically to *avoid* the standard fault_pending_wqh queue. Either approach will be quite complex. 2. We'll still need to annotate KVM in the same-ish place to tell userfaultfd that the fault should be non-blocking, but we'll probably *also* need like GUP_USERFAULT_NONBLOCK and/or FAULT_FLAG_USERFAULT_NOBLOCK or something. (UFFD_FEATURE_SIGBUS does not exactly solve this problem either.) 3. If the vCPU thread is getting kicked out to userspace, it seems like there is no way for it to find/read the #pf it generated. This seems problematic. > > > > > With respect to the "sharding" idea, I collected some more runs of the > > self test (full command in [1]). This time I omitted the "-a" flag, so > > that every vCPU accesses a different range of guest memory with its > > own UFFD, and set the number of reader threads per UFFD to 1. > > Just wondering, did you run the benchmark with DONTWAKE? Sounds as if the > wake is not needed. > Anish's selftest only WAKEs when it's necessary[1]. IOW, we only WAKE when we actually read the #pf from the userfaultfd. If we were to WAKE for each fault, we wouldn't get much of a scalability improvement at all (we would still be contending on the wait_queue locks, just not quite as much as before). [1]: https://lore.kernel.org/kvm/20230412213510.1220557-23-amoorthy@google.com/ Thanks for your insights/suggestions, Nadav. - James
Hi, Anish, On Mon, Apr 24, 2023 at 05:15:49PM -0700, Anish Moorthy wrote: > On Mon, Apr 24, 2023 at 12:44 PM Nadav Amit <nadav.amit@gmail.com> wrote: > > > > > > > > > On Apr 24, 2023, at 10:54 AM, Anish Moorthy <amoorthy@google.com> wrote: > > > > > > On Fri, Apr 21, 2023 at 10:40 AM Nadav Amit <nadav.amit@gmail.com> wrote: > > >> > > >> If I understand the problem correctly, it sounds as if the proper solution > > >> should be some kind of a range-locks. If it is too heavy or the interface can > > >> be changed/extended to wake a single address (instead of a range), > > >> simpler hashed-locks can be used. > > > > > > Some sort of range-based locking system does seem relevant, although I > > > don't see how that would necessarily speed up the delivery of faults > > > to UFFD readers: I'll have to think about it more. > > > > Perhaps I misread your issue. Based on the scalability issues you raised, > > I assumed that the problem you encountered is related to lock contention. > > I do not know whether your profiled it, but some information would be > > useful. > > No, you had it right: the issue at hand is contention on the uffd wait > queues. I'm just not sure what the range-based locking would really be > doing. Events would still have to be delivered to userspace in an > ordered manner, so it seems to me that each uffd would still need to > maintain a queue (and the associated contention). > > With respect to the "sharding" idea, I collected some more runs of the > self test (full command in [1]). This time I omitted the "-a" flag, so > that every vCPU accesses a different range of guest memory with its > own UFFD, and set the number of reader threads per UFFD to 1. > > vCPUs, Average Paging Rate (w/o new caps), Average Paging Rate (w/ new caps) > 1 180 307 > 2 85 220 > 4 80 206 > 8 39 163 > 16 18 104 > 32 8 73 > 64 4 57 > 128 1 37 > 256 1 16 > > I'm reporting paging rate on a per-vcpu rather than total basis, which > is why the numbers look so different than the ones in the cover > letter. I'm actually not sure why the demand paging rate falls off > with the number of vCPUs (maybe a prioritization issue on my side?), > but even when UFFDs aren't being contended for it's clear that demand > paging via memory fault exits is significantly faster. > > I'll try to get some perf traces as well: that will take a little bit > of time though, as to do it for cycler will involve patching our VMM > first. > > [1] ./demand_paging_test -b 64M -u MINOR -s shmem -v <n> -r 1 [-w] Thanks (for doing this test, and also to Nadav for all his inputs), and sorry for a late response. These numbers caught my eye, and I'm very curious why even 2 vcpus can scale that bad. I gave it a shot on a test machine and I got something slightly different: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (20 cores, 40 threads) $ ./demand_paging_test -b 512M -u MINOR -s shmem -v N |-------+----------+--------| | n_thr | per-vcpu | total | |-------+----------+--------| | 1 | 39.5K | 39.5K | | 2 | 33.8K | 67.6K | | 4 | 31.8K | 127.2K | | 8 | 30.8K | 246.1K | | 16 | 21.9K | 351.0K | |-------+----------+--------| I used larger ram due to less cores. I didn't try 32+ vcpus to make sure I don't have two threads content on a core/thread already since I only got 40 hardware threads there, but still we can compare with your lower half. When I was testing I noticed bad numbers and another bug on not using NSEC_PER_SEC properly, so I did this before the test: https://lore.kernel.org/all/20230427201112.2164776-1-peterx@redhat.com/ I think it means it still doesn't scale that good, however not so bad either - no obvious 1/2 drop on using 2vcpus. There're still a bunch of paths triggered in the test so I also don't expect it to fully scale linearly. From my numbers I just didn't see as drastic as yours. I'm not sure whether it's simply broken test number, parameter differences (e.g. you used 64M only per-vcpu), or hardware differences.
On Thu, Apr 27, 2023 at 1:26 PM Peter Xu <peterx@redhat.com> wrote: > > Thanks (for doing this test, and also to Nadav for all his inputs), and > sorry for a late response. No need to apologize: anyways, I've got you comfortably beat on being late at this point :) > These numbers caught my eye, and I'm very curious why even 2 vcpus can > scale that bad. > > I gave it a shot on a test machine and I got something slightly different: > > Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (20 cores, 40 threads) > $ ./demand_paging_test -b 512M -u MINOR -s shmem -v N > |-------+----------+--------| > | n_thr | per-vcpu | total | > |-------+----------+--------| > | 1 | 39.5K | 39.5K | > | 2 | 33.8K | 67.6K | > | 4 | 31.8K | 127.2K | > | 8 | 30.8K | 246.1K | > | 16 | 21.9K | 351.0K | > |-------+----------+--------| > > I used larger ram due to less cores. I didn't try 32+ vcpus to make sure I > don't have two threads content on a core/thread already since I only got 40 > hardware threads there, but still we can compare with your lower half. > > When I was testing I noticed bad numbers and another bug on not using > NSEC_PER_SEC properly, so I did this before the test: > > https://lore.kernel.org/all/20230427201112.2164776-1-peterx@redhat.com/ > > I think it means it still doesn't scale that good, however not so bad > either - no obvious 1/2 drop on using 2vcpus. There're still a bunch of > paths triggered in the test so I also don't expect it to fully scale > linearly. From my numbers I just didn't see as drastic as yours. I'm not > sure whether it's simply broken test number, parameter differences > (e.g. you used 64M only per-vcpu), or hardware differences. Hmm, I suspect we're dealing with hardware differences here. I rebased my changes onto those two patches you sent up, taking care not to clobber them, but even with the repro command you provided my results look very different than yours (at least on 1-4 vcpus) on the machine I've been testing on (4x AMD EPYC 7B13 64-Core, 2.2GHz). (n=20) n_thr per_vcpu total 1 154K 154K 2 92k 184K 4 71K 285K 8 36K 291K 16 19K 310K Out of interested I tested on another machine (Intel(R) Xeon(R) Platinum 8273CL CPU @ 2.20GHz) as well, and results are a bit different again (n=20) n_thr per_vcpu total 1 115K 115K 2 103k 206K 4 65K 262K 8 39K 319K 16 19K 398K It is interesting how all three sets of numbers start off different but seem to converge around 16 vCPUs. I did check to make sure the memory fault exits sped things up in all cases, and that at least stays true. By the way, I've got a little helper script that I've been using to run/average the selftest results (which can vary quite a bit). I've attached it below- hopefully it doesn't bounce from the mailing list. Just for reference, the invocation to test the command you provided is > python dp_runner.py --num_runs 20 --max_cores 16 --percpu_mem 512M
On Wed, May 03, 2023, Anish Moorthy wrote: > On Thu, Apr 27, 2023 at 1:26 PM Peter Xu <peterx@redhat.com> wrote: > > > > Thanks (for doing this test, and also to Nadav for all his inputs), and > > sorry for a late response. > > No need to apologize: anyways, I've got you comfortably beat on being > late at this point :) LOL, hold my beer and let me you you the true meaning of "late response". :-)
Oops, bounced back from the list.. Forward with no attachment this time - I assume the information is still enough in the paragraphs even without the flamegraphs. Sorry for the noise. On Wed, May 03, 2023 at 05:18:13PM -0400, Peter Xu wrote: > On Wed, May 03, 2023 at 12:45:07PM -0700, Anish Moorthy wrote: > > On Thu, Apr 27, 2023 at 1:26 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > Thanks (for doing this test, and also to Nadav for all his inputs), and > > > sorry for a late response. > > > > No need to apologize: anyways, I've got you comfortably beat on being > > late at this point :) > > > > > These numbers caught my eye, and I'm very curious why even 2 vcpus can > > > scale that bad. > > > > > > I gave it a shot on a test machine and I got something slightly different: > > > > > > Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (20 cores, 40 threads) > > > $ ./demand_paging_test -b 512M -u MINOR -s shmem -v N > > > |-------+----------+--------| > > > | n_thr | per-vcpu | total | > > > |-------+----------+--------| > > > | 1 | 39.5K | 39.5K | > > > | 2 | 33.8K | 67.6K | > > > | 4 | 31.8K | 127.2K | > > > | 8 | 30.8K | 246.1K | > > > | 16 | 21.9K | 351.0K | > > > |-------+----------+--------| > > > > > > I used larger ram due to less cores. I didn't try 32+ vcpus to make sure I > > > don't have two threads content on a core/thread already since I only got 40 > > > hardware threads there, but still we can compare with your lower half. > > > > > > When I was testing I noticed bad numbers and another bug on not using > > > NSEC_PER_SEC properly, so I did this before the test: > > > > > > https://lore.kernel.org/all/20230427201112.2164776-1-peterx@redhat.com/ > > > > > > I think it means it still doesn't scale that good, however not so bad > > > either - no obvious 1/2 drop on using 2vcpus. There're still a bunch of > > > paths triggered in the test so I also don't expect it to fully scale > > > linearly. From my numbers I just didn't see as drastic as yours. I'm not > > > sure whether it's simply broken test number, parameter differences > > > (e.g. you used 64M only per-vcpu), or hardware differences. > > > > Hmm, I suspect we're dealing with hardware differences here. I > > rebased my changes onto those two patches you sent up, taking care not > > to clobber them, but even with the repro command you provided my > > results look very different than yours (at least on 1-4 vcpus) on the > > machine I've been testing on (4x AMD EPYC 7B13 64-Core, 2.2GHz). > > > > (n=20) > > n_thr per_vcpu total > > 1 154K 154K > > 2 92k 184K > > 4 71K 285K > > 8 36K 291K > > 16 19K 310K > > > > Out of interested I tested on another machine (Intel(R) Xeon(R) > > Platinum 8273CL CPU @ 2.20GHz) as well, and results are a bit > > different again > > > > (n=20) > > n_thr per_vcpu total > > 1 115K 115K > > 2 103k 206K > > 4 65K 262K > > 8 39K 319K > > 16 19K 398K > > Interesting. > > > > > It is interesting how all three sets of numbers start off different > > but seem to converge around 16 vCPUs. I did check to make sure the > > memory fault exits sped things up in all cases, and that at least > > stays true. > > > > By the way, I've got a little helper script that I've been using to > > run/average the selftest results (which can vary quite a bit). I've > > attached it below- hopefully it doesn't bounce from the mailing list. > > Just for reference, the invocation to test the command you provided is > > > > > python dp_runner.py --num_runs 20 --max_cores 16 --percpu_mem 512M > > I found that indeed I shouldn't have stopped at 16 vcpus since that's > exactly where it starts to bottleneck. :) > > So out of my curiosity I tried to profile 32 vcpus case on my system with > this test case, meanwhile I tried it both with: > > - 1 uffd + 8 readers > - 32 uffds (so 32 readers) > > I've got the flamegraphs attached for both. > > It seems that when using >1 uffds the bottleneck is not the spinlock > anymore but something else. > > From what I got there, vmx_vcpu_load() gets more highlights than the > spinlocks. I think that's the tlb flush broadcast. > > While OTOH indeed when using 1 uffd we can see obviously the overhead of > spinlock contention on either the fault() path or read()/poll() as you and > James rightfully pointed out. > > I'm not sure whether my number is caused by special setup, though. After > all I only had 40 threads and I started 32 vcpus + 8 readers and there'll > be contention already between the workloads. > > IMHO this means that there's still chance to provide a more generic > userfaultfd scaling solution as long as we can remove the single spinlock > contention on the fault/fault_pending queues. I'll see whether I can still > explore a bit on the possibility of this and keep you guys updated. The > general idea here to me is still to make multi-queue out of 1 uffd. > > I _think_ this might also be a positive result to your work, because if the > bottleneck is not userfaultfd (as we scale it with creating multiple; > ignoring the split vma effect), then it cannot be resolved by scaling > userfaultfd alone anyway, anymore. So a general solution, even if existed, > may not work here for kvm, because we'll get stuck somewhere else already. > > -- > Peter Xu
On Wed, May 03, 2023, Peter Xu wrote: > Oops, bounced back from the list.. > > Forward with no attachment this time - I assume the information is still > enough in the paragraphs even without the flamegraphs. The flamegraphs are definitely useful beyond what is captured here. Not sure how to get them accepted on the list though. > > From what I got there, vmx_vcpu_load() gets more highlights than the > > spinlocks. I think that's the tlb flush broadcast. No, it's KVM dealing with the vCPU being migrated to a different pCPU. The smp_call_function_single() that shows up is from loaded_vmcs_clear() and is triggered when KVM needs to VMCLEAR the VMCS on the _previous_ pCPU (yay for the VMCS caches not being coherent). Task migration can also trigger IBPB (if mitigations are enabled), and also does an "all contexts" INVEPT, i.e. flushes all TLB entries for KVM's MMU. Can you trying 1:1 pinning of vCPUs to pCPUs? That _should_ eliminate the vmx_vcpu_load_vmcs() hotspot, and for large VMs is likely represenative of a real world configuration.
On Wed, May 03, 2023 at 02:42:35PM -0700, Sean Christopherson wrote: > On Wed, May 03, 2023, Peter Xu wrote: > > Oops, bounced back from the list.. > > > > Forward with no attachment this time - I assume the information is still > > enough in the paragraphs even without the flamegraphs. > > The flamegraphs are definitely useful beyond what is captured here. Not sure > how to get them accepted on the list though. Trying again with google drive: single uffd: https://drive.google.com/file/d/1bYVYefIRRkW8oViRbYv_HyX5Zf81p3Jl/view 32 uffds: https://drive.google.com/file/d/1T19yTEKKhbjU9G2FpANIvArSC61mqqtp/view > > > > From what I got there, vmx_vcpu_load() gets more highlights than the > > > spinlocks. I think that's the tlb flush broadcast. > > No, it's KVM dealing with the vCPU being migrated to a different pCPU. The > smp_call_function_single() that shows up is from loaded_vmcs_clear() and is > triggered when KVM needs to VMCLEAR the VMCS on the _previous_ pCPU (yay for the > VMCS caches not being coherent). > > Task migration can also trigger IBPB (if mitigations are enabled), and also does > an "all contexts" INVEPT, i.e. flushes all TLB entries for KVM's MMU. > > Can you trying 1:1 pinning of vCPUs to pCPUs? That _should_ eliminate the > vmx_vcpu_load_vmcs() hotspot, and for large VMs is likely represenative of a real > world configuration. Yes it does went away: https://drive.google.com/file/d/1ZFhWnWjoU33Lxy43jTYnKFuluo4zZArm/view With pinning vcpu threads only (again, over 40 hard cores/threads): ./demand_paging_test -b 512M -u MINOR -s shmem -v 32 -c 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32 It seems to me for some reason the scheduler ate more than I expected.. Maybe tomorrow I can try two more things: - Do cpu isolations, and - pin reader threads too (or just leave the readers on housekeeping cores)
On Wed, May 03, 2023 at 07:45:28PM -0400, Peter Xu wrote: > On Wed, May 03, 2023 at 02:42:35PM -0700, Sean Christopherson wrote: > > On Wed, May 03, 2023, Peter Xu wrote: > > > Oops, bounced back from the list.. > > > > > > Forward with no attachment this time - I assume the information is still > > > enough in the paragraphs even without the flamegraphs. > > > > The flamegraphs are definitely useful beyond what is captured here. Not sure > > how to get them accepted on the list though. > > Trying again with google drive: > > single uffd: > https://drive.google.com/file/d/1bYVYefIRRkW8oViRbYv_HyX5Zf81p3Jl/view > > 32 uffds: > https://drive.google.com/file/d/1T19yTEKKhbjU9G2FpANIvArSC61mqqtp/view > > > > > > > From what I got there, vmx_vcpu_load() gets more highlights than the > > > > spinlocks. I think that's the tlb flush broadcast. > > > > No, it's KVM dealing with the vCPU being migrated to a different pCPU. The > > smp_call_function_single() that shows up is from loaded_vmcs_clear() and is > > triggered when KVM needs to VMCLEAR the VMCS on the _previous_ pCPU (yay for the > > VMCS caches not being coherent). > > > > Task migration can also trigger IBPB (if mitigations are enabled), and also does > > an "all contexts" INVEPT, i.e. flushes all TLB entries for KVM's MMU. > > > > Can you trying 1:1 pinning of vCPUs to pCPUs? That _should_ eliminate the > > vmx_vcpu_load_vmcs() hotspot, and for large VMs is likely represenative of a real > > world configuration. > > Yes it does went away: > > https://drive.google.com/file/d/1ZFhWnWjoU33Lxy43jTYnKFuluo4zZArm/view > > With pinning vcpu threads only (again, over 40 hard cores/threads): > > ./demand_paging_test -b 512M -u MINOR -s shmem -v 32 -c 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32 > > It seems to me for some reason the scheduler ate more than I expected.. > Maybe tomorrow I can try two more things: > > - Do cpu isolations, and > - pin reader threads too (or just leave the readers on housekeeping cores) I gave it a shot by isolating 32 cores and split into two groups, 16 for uffd threads and 16 for vcpu threads. I got similiar results and I don't see much changed. I think it's possible it's just reaching the limit of my host since it only got 40 cores anyway. Throughput never hits over 350K faults/sec overall. I assume this might not be the case for Anish if he has a much larger host. So we can have similar test carried out and see how that goes. I think the idea is making sure vcpu load overhead during sched-in is ruled out, then see whether it can keep scaling with more cores.
On Thu, May 4, 2023 at 12:09 PM Peter Xu <peterx@redhat.com> wrote: > > On Wed, May 03, 2023 at 07:45:28PM -0400, Peter Xu wrote: > > On Wed, May 03, 2023 at 02:42:35PM -0700, Sean Christopherson wrote: > > > On Wed, May 03, 2023, Peter Xu wrote: > > > > Oops, bounced back from the list.. > > > > > > > > Forward with no attachment this time - I assume the information is still > > > > enough in the paragraphs even without the flamegraphs. > > > > > > The flamegraphs are definitely useful beyond what is captured here. Not sure > > > how to get them accepted on the list though. > > > > Trying again with google drive: > > > > single uffd: > > https://drive.google.com/file/d/1bYVYefIRRkW8oViRbYv_HyX5Zf81p3Jl/view > > > > 32 uffds: > > https://drive.google.com/file/d/1T19yTEKKhbjU9G2FpANIvArSC61mqqtp/view > > > > > > > > > > From what I got there, vmx_vcpu_load() gets more highlights than the > > > > > spinlocks. I think that's the tlb flush broadcast. > > > > > > No, it's KVM dealing with the vCPU being migrated to a different pCPU. The > > > smp_call_function_single() that shows up is from loaded_vmcs_clear() and is > > > triggered when KVM needs to VMCLEAR the VMCS on the _previous_ pCPU (yay for the > > > VMCS caches not being coherent). > > > > > > Task migration can also trigger IBPB (if mitigations are enabled), and also does > > > an "all contexts" INVEPT, i.e. flushes all TLB entries for KVM's MMU. > > > > > > Can you trying 1:1 pinning of vCPUs to pCPUs? That _should_ eliminate the > > > vmx_vcpu_load_vmcs() hotspot, and for large VMs is likely represenative of a real > > > world configuration. > > > > Yes it does went away: > > > > https://drive.google.com/file/d/1ZFhWnWjoU33Lxy43jTYnKFuluo4zZArm/view > > > > With pinning vcpu threads only (again, over 40 hard cores/threads): > > > > ./demand_paging_test -b 512M -u MINOR -s shmem -v 32 -c 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32 > > > > It seems to me for some reason the scheduler ate more than I expected.. > > Maybe tomorrow I can try two more things: I pulled in your patch adding the -c flag, and confirmed that it doesn't seem to make a huge difference to the self test's numbers/scalability. The percpu paging rate actually seems a bit lower, going 117-103-77-55-18-9k for 1-32 vcpus > > - Do cpu isolations, and > > - pin reader threads too (or just leave the readers on housekeeping cores) > > I gave it a shot by isolating 32 cores and split into two groups, 16 for > uffd threads and 16 for vcpu threads. I got similiar results and I don't > see much changed. > > I think it's possible it's just reaching the limit of my host since it only > got 40 cores anyway. Throughput never hits over 350K faults/sec overall. > > I assume this might not be the case for Anish if he has a much larger host. > So we can have similar test carried out and see how that goes. I think the > idea is making sure vcpu load overhead during sched-in is ruled out, then > see whether it can keep scaling with more cores. Peter, I'm afraid that isolating cores and splitting them into groups is new to me. Do you mind explaining exactly what you did here? Also, I finally got some of my own perf traces for the self test: [1] shows what happens with 32 vCPUs faulting on a single uffd with 32 reader threads, with the contention clearly being a huge issue, and [2] shows the effect of demand paging through memory faults on that configuration. Unfortunately the export-to-svg functionality on our internal tool seems broken, so I could only grab pngs :( [1] https://drive.google.com/file/d/1YWiZTjb2FPmqj0tkbk4cuH0Oq8l65nsU/view?usp=drivesdk [2] https://drive.google.com/file/d/1P76_6SSAHpLxNgDAErSwRmXBLkuDeFoA/view?usp=drivesdk
> On May 3, 2023, at 4:45 PM, Peter Xu <peterx@redhat.com> wrote: > > On Wed, May 03, 2023 at 02:42:35PM -0700, Sean Christopherson wrote: >> On Wed, May 03, 2023, Peter Xu wrote: >>> Oops, bounced back from the list.. >>> >>> Forward with no attachment this time - I assume the information is still >>> enough in the paragraphs even without the flamegraphs. >> >> The flamegraphs are definitely useful beyond what is captured here. Not sure >> how to get them accepted on the list though. > > Trying again with google drive: > > single uffd: > https://drive.google.com/file/d/1bYVYefIRRkW8oViRbYv_HyX5Zf81p3Jl/view > > 32 uffds: > https://drive.google.com/file/d/1T19yTEKKhbjU9G2FpANIvArSC61mqqtp/view > >> >>>> From what I got there, vmx_vcpu_load() gets more highlights than the >>>> spinlocks. I think that's the tlb flush broadcast. >> >> No, it's KVM dealing with the vCPU being migrated to a different pCPU. The >> smp_call_function_single() that shows up is from loaded_vmcs_clear() and is >> triggered when KVM needs to VMCLEAR the VMCS on the _previous_ pCPU (yay for the >> VMCS caches not being coherent). >> >> Task migration can also trigger IBPB (if mitigations are enabled), and also does >> an "all contexts" INVEPT, i.e. flushes all TLB entries for KVM's MMU. >> >> Can you trying 1:1 pinning of vCPUs to pCPUs? That _should_ eliminate the >> vmx_vcpu_load_vmcs() hotspot, and for large VMs is likely represenative of a real >> world configuration. > > Yes it does went away: > > https://drive.google.com/file/d/1ZFhWnWjoU33Lxy43jTYnKFuluo4zZArm/view > > With pinning vcpu threads only (again, over 40 hard cores/threads): > > ./demand_paging_test -b 512M -u MINOR -s shmem -v 32 -c 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32 > > It seems to me for some reason the scheduler ate more than I expected.. > Maybe tomorrow I can try two more things: > > - Do cpu isolations, and > - pin reader threads too (or just leave the readers on housekeeping cores) For the record (and I hope I do not repeat myself): these scheduler overheads is something that I have encountered before. The two main solutions I tried were: 1. Optional polling on the faulting thread to avoid context switch on the faulting thread. (something like https://lore.kernel.org/linux-mm/20201129004548.1619714-6-namit@vmware.com/ ) and 2. IO-uring to avoid context switch on the handler thread. In addition, as I mentioned before, the queue locks is something that can be simplified.
Hi, Nadav, On Fri, May 05, 2023 at 01:05:02PM -0700, Nadav Amit wrote: > > ./demand_paging_test -b 512M -u MINOR -s shmem -v 32 -c 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32 > > > > It seems to me for some reason the scheduler ate more than I expected.. > > Maybe tomorrow I can try two more things: > > > > - Do cpu isolations, and > > - pin reader threads too (or just leave the readers on housekeeping cores) > > For the record (and I hope I do not repeat myself): these scheduler overheads > is something that I have encountered before. > > The two main solutions I tried were: > > 1. Optional polling on the faulting thread to avoid context switch on the > faulting thread. > > (something like https://lore.kernel.org/linux-mm/20201129004548.1619714-6-namit@vmware.com/ ) > > and > > 2. IO-uring to avoid context switch on the handler thread. > > In addition, as I mentioned before, the queue locks is something that can be > simplified. Right, thanks for double checking on that. Though do you think these are two separate issues to be looked into? One thing on reducing context switch overhead with a static configuration, which I think is what can be resolved by what you mentioned above, and the iouring series. One thing on the possibility of scaling userfaultfd over splitting guest memory into a few chunks (literally demand paging test with no -a). Logically I think it should scale if with pcpu pinning on vcpu threads to avoid kvm bottlenecks around. Side note: IIUC none of above will resolve the problem right now if we assume we can only have 1 uffd to register to the guest mem. However I'm curious on testing multi-uffd because I wanted to make sure there's no other thing stops the whole system from scaling with threads, hence I'd expect to get higher fault/sec overall if we increase the cores we use in the test. If it already cannot scale for whatever reason then it means a generic solution may not be possible at least for kvm use case. While OTOH if multi-uffd can scale well, then there's a chance of general solution as long as we can remove the single-queue contention over the whole guest mem. PS: Nadav, I think you mentioned twice on avoiding taking two locks for the fault queue, which sounds reasonable. Do you have plan to post a patch? Thanks,
On Fri, May 05, 2023 at 11:32:11AM -0700, Anish Moorthy wrote: > Peter, I'm afraid that isolating cores and splitting them into groups > is new to me. Do you mind explaining exactly what you did here? So far I think the most important pinning is the vcpu thread pinning, we should test always with that in this case to avoid the vcpu load overhead not scaling with cores/vcpus. What I did was (1) isolate cores (using isolcpus=xxx), then (2) manually pinning the userfault threads to some other isolated cores. But maybe this is not needed. > > Also, I finally got some of my own perf traces for the self test: [1] > shows what happens with 32 vCPUs faulting on a single uffd with 32 > reader threads, with the contention clearly being a huge issue, and > [2] shows the effect of demand paging through memory faults on that > configuration. Unfortunately the export-to-svg functionality on our > internal tool seems broken, so I could only grab pngs :( > > [1] https://drive.google.com/file/d/1YWiZTjb2FPmqj0tkbk4cuH0Oq8l65nsU/view?usp=drivesdk > [2] https://drive.google.com/file/d/1P76_6SSAHpLxNgDAErSwRmXBLkuDeFoA/view?usp=drivesdk Understood. What I tested was without -a so it's using >1 uffds. I explained why I think it could be useful to test this in my reply to Nadav, do you think it makes sense to you? e.g. compare (1) 32 vcpus + 32 uffd threads and (2) 64 vcpus + 64 uffd threads, again we need to make sure vcpu threads are pinned using -c this time. It'll be nice to pin the uffd threads too but I'm not sure whether it'll make a huge difference. Thanks,
On Sun, May 7, 2023 at 6:23 PM Peter Xu <peterx@redhat.com> wrote: > > I explained why I think it could be useful to test this in my reply to > Nadav, do you think it makes sense to you? Ah, I actually missed your reply to Nadav: didn't realize you had sent *two* emails. > While OTOH if multi-uffd can scale well, then there's a chance of > general solution as long as we can remove the single-queue > contention over the whole guest mem. I don't quite understand your statement here: if we pursue multi-uffd, then it seems to me that by definition we've removed the single queue(s) for all of guest memory, and thus the associated contention. And we'd still have the issue of multiple vCPUs contending for a single UFFD. But I do share some of your curiosity about multi-uffd performance, especially since some of my earlier numbers indicated that multi-uffd doesn't scale linearly, even when each vCPU corresponds to a single UFFD. So, I grabbed some more profiles for 32 and 64 vcpus using the following command ./demand_paging_test -b 512M -u MINOR -s shmem -v <n> -r 1 -c <1,...,n> The 32-vcpu config achieves a per-vcpu paging rate of 8.8k. That rate goes down to 3.9k (!) with 64 vCPUs. I don't immediately see the issue from the traces, but safe to say it's definitely not scaling. Since I applied your fixes from earlier, the prefaulting isn't being counted against the demand paging rate either. 32-vcpu profile: https://drive.google.com/file/d/19ZZDxZArhSsbW_5u5VcmLT48osHlO9TG/view?usp=drivesdk 64-vcpu profile: https://drive.google.com/file/d/1dyLOLVHRNdkUoFFr7gxqtoSZGn1_GqmS/view?usp=drivesdk Do let me know if you need svg files instead and I'll try and figure that out.
On Wed, Apr 12, 2023 at 09:34:48PM +0000, Anish Moorthy wrote: > Upon receiving an annotated EFAULT, userspace may take appropriate > action to resolve the failed access. For instance, this might involve a > UFFDIO_CONTINUE or MADV_POPULATE_WRITE in the context of uffd-based live > migration postcopy. As implemented, I think it will be prohibitively expensive if not impossible for userspace to determine why KVM is returning EFAULT when KVM_CAP_ABSENT_MAPPING_FAULT is enabled, which means userspace can't decide the correct action to take (try to resolve or bail). Consider the direct_map() case in patch in PATCH 15. The only way to hit that condition is a logic bug in KVM or data corruption. There isn't really anything userspace can do to handle this situation, and it has no way to distinguish that from faults to due absent mappings. We could end up hitting cases where userspace loops forever doing KVM_RUN, EFAULT, UFFDIO_CONTINUE/MADV_POPULATE_WRITE, KVM_RUN, EFAULT... Maybe we should just change direct_map() to use KVM_BUG() and return something other than EFAULT. But the general problem still exists and even if we have confidence in all the current EFAULT sites, we don't have much protection against someone adding an EFAULT in the future that userspace can't handle.
On Tue, May 9, 2023 at 3:19 PM David Matlack <dmatlack@google.com> wrote: > > On Wed, Apr 12, 2023 at 09:34:48PM +0000, Anish Moorthy wrote: > > Upon receiving an annotated EFAULT, userspace may take appropriate > > action to resolve the failed access. For instance, this might involve a > > UFFDIO_CONTINUE or MADV_POPULATE_WRITE in the context of uffd-based live > > migration postcopy. > > As implemented, I think it will be prohibitively expensive if not > impossible for userspace to determine why KVM is returning EFAULT when > KVM_CAP_ABSENT_MAPPING_FAULT is enabled, which means userspace can't > decide the correct action to take (try to resolve or bail). > > Consider the direct_map() case in patch in PATCH 15. The only way to hit > that condition is a logic bug in KVM or data corruption. There isn't > really anything userspace can do to handle this situation, and it has no > way to distinguish that from faults to due absent mappings. > > We could end up hitting cases where userspace loops forever doing > KVM_RUN, EFAULT, UFFDIO_CONTINUE/MADV_POPULATE_WRITE, KVM_RUN, EFAULT... > > Maybe we should just change direct_map() to use KVM_BUG() and return > something other than EFAULT. But the general problem still exists and > even if we have confidence in all the current EFAULT sites, we don't have > much protection against someone adding an EFAULT in the future that > userspace can't handle. Hmm, I had been operating under the assumption that userspace would always have been able to make the memory access succeed somehow- I (naively) didn't count on some guest memory access errors being unrecoverable. If that's the case, then we're back to needing some way to distinguish the new faults/exits emitted by user_mem_abort/kvm_faultin_pfn with the ABSENT_MAPPING_FAULT cap enabled :/ Let me paste in a bit of what Sean said to refute the idea of a special page-fault-failure set in those spots. (from https://lore.kernel.org/kvm/ZBoIzo8FGxSyUJ2I@google.com/) On Tue, Mar 21, 2023 at 12:43 PM Sean Christopherson <seanjc@google.com> wrote: > > Setting a flag that essentially says "failure when handling a guest page fault" > is problematic on multiple fronts. Tying the ABI to KVM's internal implementation > is not an option, i.e. the ABI would need to be defined as "on page faults from > the guest". And then the resulting behavior would be non-deterministic, e.g. > userspace would see different behavior if KVM accessed a "bad" gfn via emulation > instead of in response to a guest page fault. And because of hardware TLBs, it > would even be possible for the behavior to be non-deterministic on the same > platform running the same guest code (though this would be exteremly unliklely > in practice). > > And even if userspace is ok with only handling guest page faults_today_, I highly > doubt that will hold forever. I.e. at some point there will be a use case that > wants to react to uaccess failures on fast-only memslots. > > Ignoring all of those issues, simplify flagging "this -EFAULT occurred when > handling a guest page fault" isn't precise enough for userspace to blindly resolve > the failure. Even if KVM went through the trouble of setting information if and > only if get_user_page_fast_only() failed while handling a guest page fault, > userspace would still need/want a way to verify that the failure was expected and > can be resolved, e.g. to guard against userspace bugs due to wrongly unmapping > or mprotecting a page. I wonder, how much of this problem comes down to my description/name (I suggested MEMFAULT_REASON_PAGE_FAULT_FAILURE) for the flag? I see Sean's concerns of the behavior issues when fast-only pages are accessed via guest mode or via emulation/uaccess. What if the description of the fast-only fault cap was tightened to something like "generates vcpu faults/exits in response to *EPT/SLAT violations* which cannot be mapped by present userspace page table entries?" I think that would eliminate the emulation/uaccess issues (though I may be wrong, so please let me know). Of course, by the time we get to kvm_faultin_pfn we don't know that we're faulting pages in response to an EPT violation... but if the idea makes sense then that might justify some plumbing code.
Hi, Anish, On Tue, May 09, 2023 at 01:52:05PM -0700, Anish Moorthy wrote: > On Sun, May 7, 2023 at 6:23 PM Peter Xu <peterx@redhat.com> wrote: > > > > I explained why I think it could be useful to test this in my reply to > > Nadav, do you think it makes sense to you? > > Ah, I actually missed your reply to Nadav: didn't realize you had sent > *two* emails. > > > While OTOH if multi-uffd can scale well, then there's a chance of > > general solution as long as we can remove the single-queue > > contention over the whole guest mem. > > I don't quite understand your statement here: if we pursue multi-uffd, > then it seems to me that by definition we've removed the single > queue(s) for all of guest memory, and thus the associated contention. > And we'd still have the issue of multiple vCPUs contending for a > single UFFD. Yes as I mentioned it's purely what I was curious and it also shows the best result we can have if we go a more generic solution; it doesn't really solve the issue immediately. > > But I do share some of your curiosity about multi-uffd performance, > especially since some of my earlier numbers indicated that multi-uffd > doesn't scale linearly, even when each vCPU corresponds to a single > UFFD. > > So, I grabbed some more profiles for 32 and 64 vcpus using the following command > ./demand_paging_test -b 512M -u MINOR -s shmem -v <n> -r 1 -c <1,...,n> > > The 32-vcpu config achieves a per-vcpu paging rate of 8.8k. That rate > goes down to 3.9k (!) with 64 vCPUs. I don't immediately see the issue > from the traces, but safe to say it's definitely not scaling. Since I > applied your fixes from earlier, the prefaulting isn't being counted > against the demand paging rate either. > > 32-vcpu profile: > https://drive.google.com/file/d/19ZZDxZArhSsbW_5u5VcmLT48osHlO9TG/view?usp=drivesdk > 64-vcpu profile: > https://drive.google.com/file/d/1dyLOLVHRNdkUoFFr7gxqtoSZGn1_GqmS/view?usp=drivesdk > > Do let me know if you need svg files instead and I'll try and figure that out. Thanks for trying all these out, and sorry if I caused confusion in my reply. What I wanted to do is to understand whether there's still chance to provide a generic solution. I don't know why you have had a bunch of pmu stack showing in the graph, perhaps you forgot to disable some of the perf events when doing the test? Let me know if you figure out why it happened like that (so far I didn't see), but I feel guilty to keep overloading you with such questions. The major problem I had with this series is it's definitely not a clean approach. Say, even if you'll all rely on userapp you'll still need to rely on userfaultfd for kernel traps on corner cases or it just won't work. IIUC that's also the concern from Nadav. But I also agree it seems to resolve every bottleneck in the kernel no matter whether it's in scheduler or vcpu loading. After all you throw everything into userspace.. :) Considering that most of the changes are for -EFAULT traps and the 2nd part change is very self contained and maintainable, no objection here to have it. I'll leave that to the maintainers to decide. Thanks,
On Tue, May 09, 2023, David Matlack wrote: > On Wed, Apr 12, 2023 at 09:34:48PM +0000, Anish Moorthy wrote: > > Upon receiving an annotated EFAULT, userspace may take appropriate > > action to resolve the failed access. For instance, this might involve a > > UFFDIO_CONTINUE or MADV_POPULATE_WRITE in the context of uffd-based live > > migration postcopy. > > As implemented, I think it will be prohibitively expensive if not > impossible for userspace to determine why KVM is returning EFAULT when > KVM_CAP_ABSENT_MAPPING_FAULT is enabled, which means userspace can't > decide the correct action to take (try to resolve or bail). > > Consider the direct_map() case in patch in PATCH 15. The only way to hit > that condition is a logic bug in KVM or data corruption. There isn't > really anything userspace can do to handle this situation, and it has no > way to distinguish that from faults to due absent mappings. > > We could end up hitting cases where userspace loops forever doing > KVM_RUN, EFAULT, UFFDIO_CONTINUE/MADV_POPULATE_WRITE, KVM_RUN, EFAULT... > > Maybe we should just change direct_map() to use KVM_BUG() and return > something other than EFAULT. But the general problem still exists and > even if we have confidence in all the current EFAULT sites, we don't have > much protection against someone adding an EFAULT in the future that > userspace can't handle. Yeah, when I speed read the series, several of the conversions stood out as being "wrong". My (potentially unstated) idea was that KVM would only signal KVM_EXIT_MEMORY_FAULT when the -EFAULT could be traced back to a user access, i.e. when the fault _might_ be resolvable by userspace. If we want to populate KVM_EXIT_MEMORY_FAULT even on kernel bugs, and anything else that userspace can't possibly resolve, then the easiest thing would be to add a flag to signal that the fault is fatal, i.e. that userspace shouldn't retry. Adding a flag may be more robust in the long term as it will force developers to think about whether or not a fault is fatal, versus relying on documentation to say "don't signal KVM_EXIT_MEMORY_FAULT for fatal EFAULT conditions". Side topic, KVM x86 really should have a version of KVM_SYNC_X86_REGS that stores registers for userspace, but doesn't load registers. That would allow userspace to detect many infinite loops with minimal overhead, e.g. (1) set KVM_STORE_X86_REGS during demand paging, (2) check RIP on every exit to see if the vCPU is making forward progress, (3) escalate to checking all registers if RIP hasn't changed for N exits, and finally (4) take action if the guest is well and truly stuck after N more exits. KVM could even store RIP on every exit if userspace wanted to avoid the overhead of storing registers until userspace actually wants all registers.
On Wed, May 10, 2023 at 3:35 PM Sean Christopherson <seanjc@google.com> wrote: > > Yeah, when I speed read the series, several of the conversions stood out as being > "wrong". My (potentially unstated) idea was that KVM would only signal > KVM_EXIT_MEMORY_FAULT when the -EFAULT could be traced back to a user access, > i.e. when the fault _might_ be resolvable by userspace. Well, you definitely tried to get the idea across somehow- even in my cover letter here, I state > As a first step, KVM_CAP_MEMORY_FAULT_INFO is introduced. This > capability is meant to deliver useful information to userspace (i.e. the > problematic range of guest physical memory) when a vCPU fails a guest > memory access. So the fact that I'm doing something more here is unintentional and stems from unfamiliarity with all of the ways in which KVM does (or does not) perform user accesses. Sean, besides direct_map which other patches did you notice as needing to be dropped/marked as unrecoverable errors?
On Wed, May 10, 2023 at 2:50 PM Peter Xu <peterx@redhat.com> wrote: > On Tue, May 09, 2023 at 01:52:05PM -0700, Anish Moorthy wrote: > > On Sun, May 7, 2023 at 6:23 PM Peter Xu <peterx@redhat.com> wrote: > > What I wanted to do is to understand whether there's still chance to > provide a generic solution. I don't know why you have had a bunch of pmu > stack showing in the graph, perhaps you forgot to disable some of the perf > events when doing the test? Let me know if you figure out why it happened > like that (so far I didn't see), but I feel guilty to keep overloading you > with such questions. > > The major problem I had with this series is it's definitely not a clean > approach. Say, even if you'll all rely on userapp you'll still need to > rely on userfaultfd for kernel traps on corner cases or it just won't work. > IIUC that's also the concern from Nadav. This is a long thread, so apologies if the following has already been discussed. Would per-tid userfaultfd support be a generic solution? i.e. Allow userspace to create a userfaultfd that is tied to a specific task. Any userfaults encountered by that task use that fd, rather than the process-wide fd. I'm making the assumption here that each of these fds would have independent signaling mechanisms/queues and so this would solve the scaling problem. A VMM could use this to create 1 userfaultfd per vCPU and 1 thread per vCPU for handling userfault requests. This seems like it'd have roughly the same scalability characteristics as the KVM -EFAULT approach.
On Thu, May 11, 2023 at 10:18 AM David Matlack <dmatlack@google.com> wrote: > > On Wed, May 10, 2023 at 2:50 PM Peter Xu <peterx@redhat.com> wrote: > > On Tue, May 09, 2023 at 01:52:05PM -0700, Anish Moorthy wrote: > > > On Sun, May 7, 2023 at 6:23 PM Peter Xu <peterx@redhat.com> wrote: > > > > What I wanted to do is to understand whether there's still chance to > > provide a generic solution. I don't know why you have had a bunch of pmu > > stack showing in the graph, perhaps you forgot to disable some of the perf > > events when doing the test? Let me know if you figure out why it happened > > like that (so far I didn't see), but I feel guilty to keep overloading you > > with such questions. > > > > The major problem I had with this series is it's definitely not a clean > > approach. Say, even if you'll all rely on userapp you'll still need to > > rely on userfaultfd for kernel traps on corner cases or it just won't work. > > IIUC that's also the concern from Nadav. > > This is a long thread, so apologies if the following has already been discussed. > > Would per-tid userfaultfd support be a generic solution? i.e. Allow > userspace to create a userfaultfd that is tied to a specific task. Any > userfaults encountered by that task use that fd, rather than the > process-wide fd. I'm making the assumption here that each of these fds > would have independent signaling mechanisms/queues and so this would > solve the scaling problem. > > A VMM could use this to create 1 userfaultfd per vCPU and 1 thread per > vCPU for handling userfault requests. This seems like it'd have > roughly the same scalability characteristics as the KVM -EFAULT > approach. I think this would work in principle, but it's significantly different from what exists today. The splitting of userfaultfds Peter is describing is splitting up the HVA address space, not splitting per-thread. I think for this design, we'd need to change UFFD registration so multiple UFFDs can register the same VMA, but can be filtered so they only receive fault events caused by some particular tid(s). This might also incur some (small?) overhead, because in the fault path we now need to maintain some data structure so we can lookup which UFFD to notify based on a combination of the address and our tid. Today, since VMAs and UFFDs are 1:1 this lookup is trivial. I think it's worth keeping in mind that a selling point of Anish's approach is that it's a very small change. It's plausible we can come up with some alternative way to scale, but it seems to me everything suggested so far is likely to require a lot more code, complexity, and effort vs. Anish's approach.
On Thu, May 11, 2023 at 10:33:24AM -0700, Axel Rasmussen wrote: > On Thu, May 11, 2023 at 10:18 AM David Matlack <dmatlack@google.com> wrote: > > > > On Wed, May 10, 2023 at 2:50 PM Peter Xu <peterx@redhat.com> wrote: > > > On Tue, May 09, 2023 at 01:52:05PM -0700, Anish Moorthy wrote: > > > > On Sun, May 7, 2023 at 6:23 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > What I wanted to do is to understand whether there's still chance to > > > provide a generic solution. I don't know why you have had a bunch of pmu > > > stack showing in the graph, perhaps you forgot to disable some of the perf > > > events when doing the test? Let me know if you figure out why it happened > > > like that (so far I didn't see), but I feel guilty to keep overloading you > > > with such questions. > > > > > > The major problem I had with this series is it's definitely not a clean > > > approach. Say, even if you'll all rely on userapp you'll still need to > > > rely on userfaultfd for kernel traps on corner cases or it just won't work. > > > IIUC that's also the concern from Nadav. > > > > This is a long thread, so apologies if the following has already been discussed. > > > > Would per-tid userfaultfd support be a generic solution? i.e. Allow > > userspace to create a userfaultfd that is tied to a specific task. Any > > userfaults encountered by that task use that fd, rather than the > > process-wide fd. I'm making the assumption here that each of these fds > > would have independent signaling mechanisms/queues and so this would > > solve the scaling problem. > > > > A VMM could use this to create 1 userfaultfd per vCPU and 1 thread per > > vCPU for handling userfault requests. This seems like it'd have > > roughly the same scalability characteristics as the KVM -EFAULT > > approach. > > I think this would work in principle, but it's significantly different > from what exists today. > > The splitting of userfaultfds Peter is describing is splitting up the > HVA address space, not splitting per-thread. > > I think for this design, we'd need to change UFFD registration so > multiple UFFDs can register the same VMA, but can be filtered so they > only receive fault events caused by some particular tid(s). > > This might also incur some (small?) overhead, because in the fault > path we now need to maintain some data structure so we can lookup > which UFFD to notify based on a combination of the address and our > tid. Today, since VMAs and UFFDs are 1:1 this lookup is trivial. I was (perhaps naively) assuming the lookup would be as simple as: diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 44d1ee429eb0..e9856e2ba9ef 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -417,7 +417,10 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason) */ mmap_assert_locked(mm); - ctx = vma->vm_userfaultfd_ctx.ctx; + if (current->userfaultfd_ctx) + ctx = current->userfaultfd_ctx; + else + ctx = vma->vm_userfaultfd_ctx.ctx; if (!ctx) goto out; > > I think it's worth keeping in mind that a selling point of Anish's > approach is that it's a very small change. It's plausible we can come > up with some alternative way to scale, but it seems to me everything > suggested so far is likely to require a lot more code, complexity, and > effort vs. Anish's approach. Agreed. Mostly I think the per-thread UFFD approach would add complexity on the userspace side of things. With Anish's approach userspace is able to trivially re-use the vCPU thread (and it's associated pCPU if pinned) to handle the request. That gets more complicated when juggling the extra paired threads. The per-thread approach would requires a new userfault UAPI change which I think is a higher bar than the KVM UAPI change proposed here. The per-thread approach would require KVM call into slow GUP and take the mmap_lock before contacting userspace. I'm not 100% convinced that's a bad thing long term (e.g. it avoids the false-positive -EFAULT exits in Anish's proposal), but could have performance implications. Lastly, inter-thread communication is likely slower than returning to userspace from KVM_RUN. So the per-thread approach might increase the end-to-end latency of demand fetches.
On Thu, May 11, 2023 at 12:05 PM David Matlack <dmatlack@google.com> wrote: > > On Thu, May 11, 2023 at 10:33:24AM -0700, Axel Rasmussen wrote: > > On Thu, May 11, 2023 at 10:18 AM David Matlack <dmatlack@google.com> wrote: > > > > > > On Wed, May 10, 2023 at 2:50 PM Peter Xu <peterx@redhat.com> wrote: > > > > On Tue, May 09, 2023 at 01:52:05PM -0700, Anish Moorthy wrote: > > > > > On Sun, May 7, 2023 at 6:23 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > > > What I wanted to do is to understand whether there's still chance to > > > > provide a generic solution. I don't know why you have had a bunch of pmu > > > > stack showing in the graph, perhaps you forgot to disable some of the perf > > > > events when doing the test? Let me know if you figure out why it happened > > > > like that (so far I didn't see), but I feel guilty to keep overloading you > > > > with such questions. > > > > > > > > The major problem I had with this series is it's definitely not a clean > > > > approach. Say, even if you'll all rely on userapp you'll still need to > > > > rely on userfaultfd for kernel traps on corner cases or it just won't work. > > > > IIUC that's also the concern from Nadav. > > > > > > This is a long thread, so apologies if the following has already been discussed. > > > > > > Would per-tid userfaultfd support be a generic solution? i.e. Allow > > > userspace to create a userfaultfd that is tied to a specific task. Any > > > userfaults encountered by that task use that fd, rather than the > > > process-wide fd. I'm making the assumption here that each of these fds > > > would have independent signaling mechanisms/queues and so this would > > > solve the scaling problem. > > > > > > A VMM could use this to create 1 userfaultfd per vCPU and 1 thread per > > > vCPU for handling userfault requests. This seems like it'd have > > > roughly the same scalability characteristics as the KVM -EFAULT > > > approach. > > > > I think this would work in principle, but it's significantly different > > from what exists today. > > > > The splitting of userfaultfds Peter is describing is splitting up the > > HVA address space, not splitting per-thread. > > > > I think for this design, we'd need to change UFFD registration so > > multiple UFFDs can register the same VMA, but can be filtered so they > > only receive fault events caused by some particular tid(s). > > > > This might also incur some (small?) overhead, because in the fault > > path we now need to maintain some data structure so we can lookup > > which UFFD to notify based on a combination of the address and our > > tid. Today, since VMAs and UFFDs are 1:1 this lookup is trivial. > > I was (perhaps naively) assuming the lookup would be as simple as: > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > index 44d1ee429eb0..e9856e2ba9ef 100644 > --- a/fs/userfaultfd.c > +++ b/fs/userfaultfd.c > @@ -417,7 +417,10 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason) > */ > mmap_assert_locked(mm); > > - ctx = vma->vm_userfaultfd_ctx.ctx; > + if (current->userfaultfd_ctx) > + ctx = current->userfaultfd_ctx; > + else > + ctx = vma->vm_userfaultfd_ctx.ctx; > if (!ctx) > goto out; Hmm, perhaps. It might have to be more complicated if we want to allow a single task to have both per-TID UFFDs for some addresses, and "global" UFFDs for others. Actually, while thinking about this, another wrinkle: Imagine we have per-thread UFFDs. Thread X faults on some address, and goes to sleep waiting for its paired resolver thread to resolve the fault. In the meantime, thread Y also faults on the same address, before the resolution happens. In the existing model, there is a single UFFD context per VMA, and therefore a single wait queue for all threads to wait on. In the per-TID-UFFD design, now each thread has its own context, and ostensibly its own wait queue (since the wait queue locks are where Anish saw the contention, I think this is exactly what we want to split up). When we have this "multiple threads waiting on the same address" situation, how do we ensure the fault is resolved exactly once? And how do we wake up all of the sleeping threads when it is resolved? I'm sure it's solvable, but especially doing it without any locks / contention seems like it could be a bit complicated. > > > > > I think it's worth keeping in mind that a selling point of Anish's > > approach is that it's a very small change. It's plausible we can come > > up with some alternative way to scale, but it seems to me everything > > suggested so far is likely to require a lot more code, complexity, and > > effort vs. Anish's approach. > > Agreed. > > Mostly I think the per-thread UFFD approach would add complexity on the > userspace side of things. With Anish's approach userspace is able to > trivially re-use the vCPU thread (and it's associated pCPU if pinned) to > handle the request. That gets more complicated when juggling the extra > paired threads. > > The per-thread approach would requires a new userfault UAPI change which > I think is a higher bar than the KVM UAPI change proposed here. > > The per-thread approach would require KVM call into slow GUP and take > the mmap_lock before contacting userspace. I'm not 100% convinced that's > a bad thing long term (e.g. it avoids the false-positive -EFAULT exits > in Anish's proposal), but could have performance implications. > > Lastly, inter-thread communication is likely slower than returning to > userspace from KVM_RUN. So the per-thread approach might increase the > end-to-end latency of demand fetches.
On Thu, May 11, 2023 at 10:33:24AM -0700, Axel Rasmussen wrote: > On Thu, May 11, 2023 at 10:18 AM David Matlack <dmatlack@google.com> wrote: > > > > On Wed, May 10, 2023 at 2:50 PM Peter Xu <peterx@redhat.com> wrote: > > > On Tue, May 09, 2023 at 01:52:05PM -0700, Anish Moorthy wrote: > > > > On Sun, May 7, 2023 at 6:23 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > What I wanted to do is to understand whether there's still chance to > > > provide a generic solution. I don't know why you have had a bunch of pmu > > > stack showing in the graph, perhaps you forgot to disable some of the perf > > > events when doing the test? Let me know if you figure out why it happened > > > like that (so far I didn't see), but I feel guilty to keep overloading you > > > with such questions. > > > > > > The major problem I had with this series is it's definitely not a clean > > > approach. Say, even if you'll all rely on userapp you'll still need to > > > rely on userfaultfd for kernel traps on corner cases or it just won't work. > > > IIUC that's also the concern from Nadav. > > > > This is a long thread, so apologies if the following has already been discussed. > > > > Would per-tid userfaultfd support be a generic solution? i.e. Allow > > userspace to create a userfaultfd that is tied to a specific task. Any > > userfaults encountered by that task use that fd, rather than the > > process-wide fd. I'm making the assumption here that each of these fds > > would have independent signaling mechanisms/queues and so this would > > solve the scaling problem. > > > > A VMM could use this to create 1 userfaultfd per vCPU and 1 thread per > > vCPU for handling userfault requests. This seems like it'd have > > roughly the same scalability characteristics as the KVM -EFAULT > > approach. > > I think this would work in principle, but it's significantly different > from what exists today. > > The splitting of userfaultfds Peter is describing is splitting up the > HVA address space, not splitting per-thread. [sorry mostly travel last week] No, my idea was actually split per-thread, but since currently there's no way to split per thread I was thinking we should start testing with split per vma so it "emulates" the best we can have out of a split per thread. > > I think for this design, we'd need to change UFFD registration so > multiple UFFDs can register the same VMA, but can be filtered so they > only receive fault events caused by some particular tid(s). Having multiple real uffds per vma is challenging, as you mentioned enqueuing may be more of an effort, meanwhile it's hard to know what's the attribute of the uffd over this vma because each uffd has one feature list. Here what we may need is only the "logical queue" of the uffd. So I was considering supporting multi-queue for a _single_ userfaultfd. I actually mentioned some of it in the very initial reply to Anish: https://lore.kernel.org/all/ZEGuogfbtxPNUq7t@x1n/ If the real problem relies in a bunch of threads queuing, is it possible that we can provide just more queues for the events? The readers will just need to go over all the queues. Way to decide "which thread uses which queue" can be another problem, what comes ups quickly to me is a "hash(tid) % n_queues" but maybe it can be better. Each vcpu thread will have different tids, then they can hopefully scale on the queues. The queues may need to be created also as sub-uffds, each only support partial of the uffd interfaces (read/poll, COPY/CONTINUE/ZEROPAGE) but not all (e.g. UFFDIO_API shouldn't be supported there). > This might also incur some (small?) overhead, because in the fault path > we now need to maintain some data structure so we can lookup which UFFD > to notify based on a combination of the address and our tid. Today, since > VMAs and UFFDs are 1:1 this lookup is trivial. I think it's worth > keeping in mind that a selling point of Anish's approach is that it's a > very small change. It's plausible we can come up with some alternative > way to scale, but it seems to me everything suggested so far is likely to > require a lot more code, complexity, and effort vs. Anish's approach. Yes, I think that's also the reason why I thought I overloaded too much on this work. If Anish eagerly wants that and make it useful, then I'm totally fine because maintaining the 2nd cap seems trivial assuming the maintainer already would accept the 1st cap. I just hope it'll be thoroughly tested with even Google's private userspace hypervisor, so the kernel interface is (even if not straightforward enough to a new user seeing this) solid so it will service the goal for the problem Anish is tackling with. Thanks,
On Thu, May 11, 2023 at 12:45:32PM -0700, Axel Rasmussen wrote: > On Thu, May 11, 2023 at 12:05 PM David Matlack <dmatlack@google.com> wrote: > > > > On Thu, May 11, 2023 at 10:33:24AM -0700, Axel Rasmussen wrote: > > > On Thu, May 11, 2023 at 10:18 AM David Matlack <dmatlack@google.com> wrote: > > > > > > > > On Wed, May 10, 2023 at 2:50 PM Peter Xu <peterx@redhat.com> wrote: > > > > > On Tue, May 09, 2023 at 01:52:05PM -0700, Anish Moorthy wrote: > > > > > > On Sun, May 7, 2023 at 6:23 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > > > > > What I wanted to do is to understand whether there's still chance to > > > > > provide a generic solution. I don't know why you have had a bunch of pmu > > > > > stack showing in the graph, perhaps you forgot to disable some of the perf > > > > > events when doing the test? Let me know if you figure out why it happened > > > > > like that (so far I didn't see), but I feel guilty to keep overloading you > > > > > with such questions. > > > > > > > > > > The major problem I had with this series is it's definitely not a clean > > > > > approach. Say, even if you'll all rely on userapp you'll still need to > > > > > rely on userfaultfd for kernel traps on corner cases or it just won't work. > > > > > IIUC that's also the concern from Nadav. > > > > > > > > This is a long thread, so apologies if the following has already been discussed. > > > > > > > > Would per-tid userfaultfd support be a generic solution? i.e. Allow > > > > userspace to create a userfaultfd that is tied to a specific task. Any > > > > userfaults encountered by that task use that fd, rather than the > > > > process-wide fd. I'm making the assumption here that each of these fds > > > > would have independent signaling mechanisms/queues and so this would > > > > solve the scaling problem. > > > > > > > > A VMM could use this to create 1 userfaultfd per vCPU and 1 thread per > > > > vCPU for handling userfault requests. This seems like it'd have > > > > roughly the same scalability characteristics as the KVM -EFAULT > > > > approach. > > > > > > I think this would work in principle, but it's significantly different > > > from what exists today. > > > > > > The splitting of userfaultfds Peter is describing is splitting up the > > > HVA address space, not splitting per-thread. > > > > > > I think for this design, we'd need to change UFFD registration so > > > multiple UFFDs can register the same VMA, but can be filtered so they > > > only receive fault events caused by some particular tid(s). > > > > > > This might also incur some (small?) overhead, because in the fault > > > path we now need to maintain some data structure so we can lookup > > > which UFFD to notify based on a combination of the address and our > > > tid. Today, since VMAs and UFFDs are 1:1 this lookup is trivial. > > > > I was (perhaps naively) assuming the lookup would be as simple as: > > > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > > index 44d1ee429eb0..e9856e2ba9ef 100644 > > --- a/fs/userfaultfd.c > > +++ b/fs/userfaultfd.c > > @@ -417,7 +417,10 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason) > > */ > > mmap_assert_locked(mm); > > > > - ctx = vma->vm_userfaultfd_ctx.ctx; > > + if (current->userfaultfd_ctx) > > + ctx = current->userfaultfd_ctx; > > + else > > + ctx = vma->vm_userfaultfd_ctx.ctx; > > if (!ctx) > > goto out; This is an interesting idea, but I'll just double check before I grow task_struct and see whether that's the only solution. :) I'd start with hash(tid) or even hash(pcpu) to choose queue. In a pinned use case hash(pcpu) should probably reach a similar goal here (and I'd guess hash(tid) too if vcpus are mostly always created in one shot, just slightly trickier). > > Hmm, perhaps. It might have to be more complicated if we want to allow > a single task to have both per-TID UFFDs for some addresses, and > "global" UFFDs for others. > > > > Actually, while thinking about this, another wrinkle: > > Imagine we have per-thread UFFDs. Thread X faults on some address, and > goes to sleep waiting for its paired resolver thread to resolve the > fault. > > In the meantime, thread Y also faults on the same address, before the > resolution happens. > > In the existing model, there is a single UFFD context per VMA, and > therefore a single wait queue for all threads to wait on. In the > per-TID-UFFD design, now each thread has its own context, and > ostensibly its own wait queue (since the wait queue locks are where > Anish saw the contention, I think this is exactly what we want to > split up). When we have this "multiple threads waiting on the same > address" situation, how do we ensure the fault is resolved exactly > once? And how do we wake up all of the sleeping threads when it is > resolved? We probably need to wake them one by one in that case. The 2nd-Nth UFFDIO_COPY/CONTINUE will fail with -EEXIST anyway, then the userapp will need a UFFDIO_WAKE I assume. > > I'm sure it's solvable, but especially doing it without any locks / > contention seems like it could be a bit complicated. IMHO no complicated locking is needed. Here the "complicated locking" is done with pgtable lock and it should be reflected by -EEXISTs to userapp. > > > > > > > > > I think it's worth keeping in mind that a selling point of Anish's > > > approach is that it's a very small change. It's plausible we can come > > > up with some alternative way to scale, but it seems to me everything > > > suggested so far is likely to require a lot more code, complexity, and > > > effort vs. Anish's approach. > > > > Agreed. > > > > Mostly I think the per-thread UFFD approach would add complexity on the > > userspace side of things. With Anish's approach userspace is able to > > trivially re-use the vCPU thread (and it's associated pCPU if pinned) to > > handle the request. That gets more complicated when juggling the extra > > paired threads. > > > > The per-thread approach would requires a new userfault UAPI change which > > I think is a higher bar than the KVM UAPI change proposed here. > > > > The per-thread approach would require KVM call into slow GUP and take > > the mmap_lock before contacting userspace. I'm not 100% convinced that's > > a bad thing long term (e.g. it avoids the false-positive -EFAULT exits > > in Anish's proposal), but could have performance implications. > > > > Lastly, inter-thread communication is likely slower than returning to > > userspace from KVM_RUN. So the per-thread approach might increase the > > end-to-end latency of demand fetches. Right. The overhead here is (IMHO): - KVM solution: vcpu exit -> enter again, whatever happens in the procedure of exit/enter will count. - mm solution: at least schedule overhead, meanwhile let's hope we can scale first elsewhere (and I'm not sure if there'll be other issues, e.g., even if we can split the uffd queues, hopefully totally nowhere I overlooked that still need the shared uffd context) I'm not sure which one will be higher or maybe it depends (e.g., some specific cases where vcpu KVM_RUN can have higher overhead when loading?). Thanks,
On Wed, May 10, 2023 at 2:51 PM Peter Xu <peterx@redhat.com> wrote: > > What I wanted to do is to understand whether there's still chance to > provide a generic solution. I don't know why you have had a bunch of pmu > stack showing in the graph, perhaps you forgot to disable some of the perf > events when doing the test? Let me know if you figure out why it happened > like that (so far I didn't see), but I feel guilty to keep overloading you > with such questions. Not at all, I'm happy to help try and answer as many questions as I can. It helps me learn as well. I'll see about revisiting these traces, but I'll be busy for the next few days with other things so I doubt they'll come soon. I'll jump back into the mailing list sometime on Thursday/Friday
On Wed, May 10, 2023 at 4:44 PM Anish Moorthy <amoorthy@google.com> wrote: > > On Wed, May 10, 2023 at 3:35 PM Sean Christopherson <seanjc@google.com> wrote: > > > > Yeah, when I speed read the series, several of the conversions stood out as being > > "wrong". My (potentially unstated) idea was that KVM would only signal > > KVM_EXIT_MEMORY_FAULT when the -EFAULT could be traced back to a user access, > > i.e. when the fault _might_ be resolvable by userspace. > > Sean, besides direct_map which other patches did you notice as needing > to be dropped/marked as unrecoverable errors? I tried going through on my own to try and identify the incorrect annotations: here's my read. Correct (or can easily be corrected) ----------------------------------------------- - user_mem_abort Incorrect as is: the annotations in patch 19 are incorrect, as they cover an error-on-no-slot case and one more I don't fully understand: the one in patch 20 should be good though. - kvm_vcpu_read/write_guest_page: Incorrect as-is, but can fixed: the current annotations cover gpa_to_hva_memslot(_prot) failures, which can happen when "gpa" is not converted by a memslot. However we can leave these as bare efaults and just annotate the copy_to/from_user failures, which userspace should be able to resolve by checking/changing the slot permissions. - kvm_handle_error_pfn Correct: at the annotation point, the fault must be either a (a) read/write to a writable memslot or (b) read from a readable one. hva_to_pfn must have returned KVM_PFN_ERR_FAULT, which userspace can attempt to resolve using a MADV Flatly Incorrect (will drop in next version) ----------------------------------------------- - kvm_handle_page_fault efault corresponds to a kernel bug not resolvable by userspace - direct_map Same as above - kvm_mmu_page_fault Not a "leaf" return of efault, Also, the check-for-efault-and-annotate here catches efaults which userspace can do nothing about: such as the one from direct_map [1] Unsure (Switch kvm_read/write_guest to kvm_vcpu_read/write_guest?) ----------------------------------------------- - setup_vmgexit_scratch and kvm_pv_clock_pairing These efault on errors from kvm_read/write_guest, and theoretically it does seem to make sense to annotate them. However, the annotations are incorrect as is for the same reason that the kvm_vcpu_read/write_guest_page need to be corrected. In fact, the kvm_read/write_guest calls are of the form "kvm_read_guest(vcpu->kvm, ...)": if we switched these calls to kvm_vcpu_read/write_guest instead, then it seems like we'd get correct annotations for free. Would it be correct to make this switch? If not, then perhaps an optional kvm_vcpu* parameter for the "non-vcpu" read/write functions strictly for annotation purposes? That seems rather ugly though... Unsure (Similar-ish to above) ----------------------------------------------- - kvm_hv_get_assist_page Incorrect as-is. The existing annotation would cover some efaults which it doesn't seem likely that userspace can resolve [2]. Right after those though, there's a copy_from_user which it could make sense to annotate. The efault here comes from failures of kvm_read_guest_cached/kvm_read_guest_offset_cached, for which all of the calls are again of the form "f(vcpu->kvm, ...)". Again, we'll need either an (optional) vcpu parameter or to refactor these to just take a "kvm_vcpu" instead if we want to annotate just the failing uaccesses. PS: I plan to add a couple of flags to the memory fault exit to identify whether the failed access was a read/write/exec [1] https://github.com/torvalds/linux/blob/v6.3/arch/x86/kvm/mmu/mmu.c#L3196 [2] https://github.com/torvalds/linux/blob/v6.3/virt/kvm/kvm_main.c#L3261-L3270
On Tue, May 23, 2023 at 10:49:04AM -0700, Anish Moorthy wrote: > On Wed, May 10, 2023 at 4:44 PM Anish Moorthy <amoorthy@google.com> wrote: > > > > On Wed, May 10, 2023 at 3:35 PM Sean Christopherson <seanjc@google.com> wrote: > > > > > > Yeah, when I speed read the series, several of the conversions stood out as being > > > "wrong". My (potentially unstated) idea was that KVM would only signal > > > KVM_EXIT_MEMORY_FAULT when the -EFAULT could be traced back to a user access, > > > i.e. when the fault _might_ be resolvable by userspace. > > > > Sean, besides direct_map which other patches did you notice as needing > > to be dropped/marked as unrecoverable errors? > > I tried going through on my own to try and identify the incorrect > annotations: here's my read. > > Correct (or can easily be corrected) > ----------------------------------------------- > - user_mem_abort > Incorrect as is: the annotations in patch 19 are incorrect, as they > cover an error-on-no-slot case and one more I don't fully understand: That other case is a wart we endearingly refer to as MTE (Memory Tagging Extension). You theoretically _could_ pop out an annotated exit here, as userspace likely messed up the mapping (like PROT_MTE missing). But I'm perfectly happy letting someone complain about it before we go out of our way to annotate that one. So feel free to drop.