Message ID | 20200109145729.32898-1-peterx@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | KVM: Dirty ring interface | expand |
On Thu, Jan 09, 2020 at 09:57:08AM -0500, Peter Xu wrote: > Branch is here: https://github.com/xzpeter/linux/tree/kvm-dirty-ring > (based on kvm/queue) > > Please refer to either the previous cover letters, or documentation > update in patch 12 for the big picture. I would rather you pasted it here. There's no way to respond otherwise. For something that's presumably an optimization, isn't there some kind of testing that can be done to show the benefits? What kind of gain was observed? I know it's mostly relevant for huge VMs, but OTOH these probably use huge pages. > Previous posts: > > V1: https://lore.kernel.org/kvm/20191129213505.18472-1-peterx@redhat.com > V2: https://lore.kernel.org/kvm/20191221014938.58831-1-peterx@redhat.com > > The major change in V3 is that we dropped the whole waitqueue and the > global lock. With that, we have clean per-vcpu ring and no default > ring any more. The two kvmgt refactoring patches were also included > to show the dependency of the works. > > Patchset layout: > > Patch 1-2: Picked up from kvmgt refactoring > Patch 3-6: Small patches that are not directly related, > (So can be acked/nacked/picked as standalone) > Patch 7-11: Prepares for the dirty ring interface > Patch 12: Major implementation > Patch 13-14: Quick follow-ups for patch 8 > Patch 15-21: Test cases > > V3 changelog: > > - fail userspace writable maps on dirty ring ranges [Jason] > - commit message fixups [Paolo] > - change __x86_set_memory_region to return hva [Paolo] > - cacheline align for indices [Paolo, Jason] > - drop waitqueue, global lock, etc., include kvmgt rework patchset > - take lock for __x86_set_memory_region() (otherwise it triggers a > lockdep in latest kvm/queue) [Paolo] > - check KVM_DIRTY_LOG_PAGE_OFFSET in kvm_vm_ioctl_enable_dirty_log_ring > - one more patch to drop x86_set_memory_region [Paolo] > - one more patch to remove extra srcu usage in init_rmode_identity_map() > - add some r-bs for Paolo > > Please review, thanks. > > Paolo Bonzini (1): > KVM: Move running VCPU from ARM to common code > > Peter Xu (18): > KVM: Remove kvm_read_guest_atomic() > KVM: Add build-time error check on kvm_run size > KVM: X86: Change parameter for fast_page_fault tracepoint > KVM: X86: Don't take srcu lock in init_rmode_identity_map() > KVM: Cache as_id in kvm_memory_slot > KVM: X86: Drop x86_set_memory_region() > KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR] > KVM: Pass in kvm pointer into mark_page_dirty_in_slot() > KVM: X86: Implement ring-based dirty memory tracking > KVM: Make dirty ring exclusive to dirty bitmap log > KVM: Don't allocate dirty bitmap if dirty ring is enabled > KVM: selftests: Always clear dirty bitmap after iteration > KVM: selftests: Sync uapi/linux/kvm.h to tools/ > KVM: selftests: Use a single binary for dirty/clear log test > KVM: selftests: Introduce after_vcpu_run hook for dirty log test > KVM: selftests: Add dirty ring buffer test > KVM: selftests: Let dirty_log_test async for dirty ring test > KVM: selftests: Add "-c" parameter to dirty log test > > Yan Zhao (2): > vfio: introduce vfio_iova_rw to read/write a range of IOVAs > drm/i915/gvt: subsitute kvm_read/write_guest with vfio_iova_rw > > Documentation/virt/kvm/api.txt | 96 ++++ > arch/arm/include/asm/kvm_host.h | 2 - > arch/arm64/include/asm/kvm_host.h | 2 - > arch/x86/include/asm/kvm_host.h | 7 +- > arch/x86/include/uapi/asm/kvm.h | 1 + > arch/x86/kvm/Makefile | 3 +- > arch/x86/kvm/mmu/mmu.c | 6 + > arch/x86/kvm/mmutrace.h | 9 +- > arch/x86/kvm/svm.c | 3 +- > arch/x86/kvm/vmx/vmx.c | 86 ++-- > arch/x86/kvm/x86.c | 43 +- > drivers/gpu/drm/i915/gvt/kvmgt.c | 25 +- > drivers/vfio/vfio.c | 45 ++ > drivers/vfio/vfio_iommu_type1.c | 81 ++++ > include/linux/kvm_dirty_ring.h | 55 +++ > include/linux/kvm_host.h | 37 +- > include/linux/vfio.h | 5 + > include/trace/events/kvm.h | 78 ++++ > include/uapi/linux/kvm.h | 33 ++ > tools/include/uapi/linux/kvm.h | 38 ++ > tools/testing/selftests/kvm/Makefile | 2 - > .../selftests/kvm/clear_dirty_log_test.c | 2 - > tools/testing/selftests/kvm/dirty_log_test.c | 420 ++++++++++++++++-- > .../testing/selftests/kvm/include/kvm_util.h | 4 + > tools/testing/selftests/kvm/lib/kvm_util.c | 72 +++ > .../selftests/kvm/lib/kvm_util_internal.h | 3 + > virt/kvm/arm/arch_timer.c | 2 +- > virt/kvm/arm/arm.c | 29 -- > virt/kvm/arm/perf.c | 6 +- > virt/kvm/arm/vgic/vgic-mmio.c | 15 +- > virt/kvm/dirty_ring.c | 162 +++++++ > virt/kvm/kvm_main.c | 215 +++++++-- > 32 files changed, 1379 insertions(+), 208 deletions(-) > create mode 100644 include/linux/kvm_dirty_ring.h > delete mode 100644 tools/testing/selftests/kvm/clear_dirty_log_test.c > create mode 100644 virt/kvm/dirty_ring.c > > -- > 2.24.1
On Thu, Jan 09, 2020 at 10:59:50AM -0500, Michael S. Tsirkin wrote: > On Thu, Jan 09, 2020 at 09:57:08AM -0500, Peter Xu wrote: > > Branch is here: https://github.com/xzpeter/linux/tree/kvm-dirty-ring > > (based on kvm/queue) > > > > Please refer to either the previous cover letters, or documentation > > update in patch 12 for the big picture. > > I would rather you pasted it here. There's no way to respond otherwise. Sure, will do in the next post. > > For something that's presumably an optimization, isn't there > some kind of testing that can be done to show the benefits? > What kind of gain was observed? Since the interface seems to settle soon, maybe it's time to work on the QEMU part so I can give some number. It would be interesting to know the curves between dirty logging and dirty ring even for some small vms that have some workloads inside. > > I know it's mostly relevant for huge VMs, but OTOH these > probably use huge pages. Yes huge VMs could benefit more, especially if the dirty rate is not that high, I believe. Though, could you elaborate on why huge pages are special here? Thanks,
On Thu, Jan 09, 2020 at 11:17:42AM -0500, Peter Xu wrote: > On Thu, Jan 09, 2020 at 10:59:50AM -0500, Michael S. Tsirkin wrote: > > On Thu, Jan 09, 2020 at 09:57:08AM -0500, Peter Xu wrote: > > > Branch is here: https://github.com/xzpeter/linux/tree/kvm-dirty-ring > > > (based on kvm/queue) > > > > > > Please refer to either the previous cover letters, or documentation > > > update in patch 12 for the big picture. > > > > I would rather you pasted it here. There's no way to respond otherwise. > > Sure, will do in the next post. > > > > > For something that's presumably an optimization, isn't there > > some kind of testing that can be done to show the benefits? > > What kind of gain was observed? > > Since the interface seems to settle soon, maybe it's time to work on > the QEMU part so I can give some number. It would be interesting to > know the curves between dirty logging and dirty ring even for some > small vms that have some workloads inside. > > > > > I know it's mostly relevant for huge VMs, but OTOH these > > probably use huge pages. > > Yes huge VMs could benefit more, especially if the dirty rate is not > that high, I believe. Though, could you elaborate on why huge pages > are special here? > > Thanks, With hugetlbfs there are less bits to test: e.g. with 2M pages a single bit set marks 512 pages as dirty. We do not take advantage of this but it looks like a rather obvious optimization. > -- > Peter Xu
On Thu, 9 Jan 2020 09:57:08 -0500 Peter Xu <peterx@redhat.com> wrote: > Branch is here: https://github.com/xzpeter/linux/tree/kvm-dirty-ring > (based on kvm/queue) > > Please refer to either the previous cover letters, or documentation > update in patch 12 for the big picture. Previous posts: > > V1: https://lore.kernel.org/kvm/20191129213505.18472-1-peterx@redhat.com > V2: https://lore.kernel.org/kvm/20191221014938.58831-1-peterx@redhat.com > > The major change in V3 is that we dropped the whole waitqueue and the > global lock. With that, we have clean per-vcpu ring and no default > ring any more. The two kvmgt refactoring patches were also included > to show the dependency of the works. Hi Peter, Would you recommend this style of interface for vfio dirty page tracking as well? This mechanism seems very tuned to sparse page dirtying, how well does it handle fully dirty, or even significantly dirty regions? We also don't really have "active" dirty page tracking in vfio, we simply assume that if a page is pinned or otherwise mapped that it's dirty, so I think we'd constantly be trying to re-populate the dirty ring with pages that we've seen the user consume, which doesn't seem like a good fit versus a bitmap solution. Thanks, Alex
On Thu, Jan 09, 2020 at 11:40:23AM -0500, Michael S. Tsirkin wrote: [...] > > > I know it's mostly relevant for huge VMs, but OTOH these > > > probably use huge pages. > > > > Yes huge VMs could benefit more, especially if the dirty rate is not > > that high, I believe. Though, could you elaborate on why huge pages > > are special here? > > > > Thanks, > > With hugetlbfs there are less bits to test: e.g. with 2M pages a single > bit set marks 512 pages as dirty. We do not take advantage of this > but it looks like a rather obvious optimization. Right, but isn't that the trade-off between granularity of dirty tracking and how easy it is to collect the dirty bits? Say, it'll be merely impossible to migrate 1G-huge-page-backed guests if we track dirty bits using huge page granularity, since each touch of guest memory will cause another 1G memory to be transferred even if most of the content is the same. 2M can be somewhere in the middle, but still the same write amplify issue exists. PS. that seems to be another topic after all besides the dirty ring series because we need to change our policy first if we want to track it with huge pages; with that, for dirty ring we can start to leverage the kvm_dirty_gfn.pad to store the page size with another new kvm cap when we really want. Thanks,
On Thu, Jan 09, 2020 at 09:47:11AM -0700, Alex Williamson wrote: > On Thu, 9 Jan 2020 09:57:08 -0500 > Peter Xu <peterx@redhat.com> wrote: > > > Branch is here: https://github.com/xzpeter/linux/tree/kvm-dirty-ring > > (based on kvm/queue) > > > > Please refer to either the previous cover letters, or documentation > > update in patch 12 for the big picture. Previous posts: > > > > V1: https://lore.kernel.org/kvm/20191129213505.18472-1-peterx@redhat.com > > V2: https://lore.kernel.org/kvm/20191221014938.58831-1-peterx@redhat.com > > > > The major change in V3 is that we dropped the whole waitqueue and the > > global lock. With that, we have clean per-vcpu ring and no default > > ring any more. The two kvmgt refactoring patches were also included > > to show the dependency of the works. > > Hi Peter, Hi, Alex, > > Would you recommend this style of interface for vfio dirty page > tracking as well? This mechanism seems very tuned to sparse page > dirtying, how well does it handle fully dirty, or even significantly > dirty regions? That's truely the point why I think the dirty bitmap can still be used and should be kept. IIUC the dirty ring starts from COLO where (1) dirty rate is very low, and (2) sync happens frequently. That's a perfect ground for dirty ring. However it for sure does not mean that dirty ring can solve all the issues. As you said, I believe the full dirty is another extreme in that dirty bitmap could perform better. > We also don't really have "active" dirty page tracking > in vfio, we simply assume that if a page is pinned or otherwise mapped > that it's dirty, so I think we'd constantly be trying to re-populate > the dirty ring with pages that we've seen the user consume, which > doesn't seem like a good fit versus a bitmap solution. Thanks, Right, so I confess I don't know whether dirty ring is the ideal solutioon for vfio either. Actually if we're tracking by page maps or pinnings, then IMHO it also means that it could be more suitable to use an modified version of dirty ring buffer (as you suggested in the other thread), in that we can track dirty using (addr, len) range rather than a single page address. That could be hard for KVM because in KVM the page will be mostly trapped in 4K granularity in page faults, and it'll also be hard to merge continuous entries with previous ones because the userspace could be reading the entries (so after we publish the previous 4K dirty page, we should not modify the entry any more). VFIO should not have this restriction because the marking of dirty page range can be atomic when the range of pages are mapped or pinned. Thanks,
On Thu, Jan 09, 2020 at 12:08:49PM -0500, Peter Xu wrote: > On Thu, Jan 09, 2020 at 11:40:23AM -0500, Michael S. Tsirkin wrote: > > [...] > > > > > I know it's mostly relevant for huge VMs, but OTOH these > > > > probably use huge pages. > > > > > > Yes huge VMs could benefit more, especially if the dirty rate is not > > > that high, I believe. Though, could you elaborate on why huge pages > > > are special here? > > > > > > Thanks, > > > > With hugetlbfs there are less bits to test: e.g. with 2M pages a single > > bit set marks 512 pages as dirty. We do not take advantage of this > > but it looks like a rather obvious optimization. > > Right, but isn't that the trade-off between granularity of dirty > tracking and how easy it is to collect the dirty bits? Say, it'll be > merely impossible to migrate 1G-huge-page-backed guests if we track > dirty bits using huge page granularity, since each touch of guest > memory will cause another 1G memory to be transferred even if most of > the content is the same. 2M can be somewhere in the middle, but still > the same write amplify issue exists. > OK I see I'm unclear. IIUC at the moment KVM never uses huge pages if any part of the huge page is tracked. But if all parts of the page are written to then huge page is used. In this situation the whole huge page is dirty and needs to be migrated. > PS. that seems to be another topic after all besides the dirty ring > series because we need to change our policy first if we want to track > it with huge pages; with that, for dirty ring we can start to leverage > the kvm_dirty_gfn.pad to store the page size with another new kvm cap > when we really want. > > Thanks, Seems like leaking implementation detail to UAPI to me. > -- > Peter Xu
On Thu, Jan 09, 2020 at 12:58:08PM -0500, Peter Xu wrote: > On Thu, Jan 09, 2020 at 09:47:11AM -0700, Alex Williamson wrote: > > On Thu, 9 Jan 2020 09:57:08 -0500 > > Peter Xu <peterx@redhat.com> wrote: > > > > > Branch is here: https://github.com/xzpeter/linux/tree/kvm-dirty-ring > > > (based on kvm/queue) > > > > > > Please refer to either the previous cover letters, or documentation > > > update in patch 12 for the big picture. Previous posts: > > > > > > V1: https://lore.kernel.org/kvm/20191129213505.18472-1-peterx@redhat.com > > > V2: https://lore.kernel.org/kvm/20191221014938.58831-1-peterx@redhat.com > > > > > > The major change in V3 is that we dropped the whole waitqueue and the > > > global lock. With that, we have clean per-vcpu ring and no default > > > ring any more. The two kvmgt refactoring patches were also included > > > to show the dependency of the works. > > > > Hi Peter, > > Hi, Alex, > > > > > Would you recommend this style of interface for vfio dirty page > > tracking as well? This mechanism seems very tuned to sparse page > > dirtying, how well does it handle fully dirty, or even significantly > > dirty regions? > > That's truely the point why I think the dirty bitmap can still be used > and should be kept. IIUC the dirty ring starts from COLO where (1) > dirty rate is very low, and (2) sync happens frequently. That's a > perfect ground for dirty ring. However it for sure does not mean that > dirty ring can solve all the issues. As you said, I believe the full > dirty is another extreme in that dirty bitmap could perform better. > > > We also don't really have "active" dirty page tracking > > in vfio, we simply assume that if a page is pinned or otherwise mapped > > that it's dirty, so I think we'd constantly be trying to re-populate > > the dirty ring with pages that we've seen the user consume, which > > doesn't seem like a good fit versus a bitmap solution. Thanks, > > Right, so I confess I don't know whether dirty ring is the ideal > solutioon for vfio either. Actually if we're tracking by page maps or > pinnings, then IMHO it also means that it could be more suitable to > use an modified version of dirty ring buffer (as you suggested in the > other thread), in that we can track dirty using (addr, len) range > rather than a single page address. That could be hard for KVM because > in KVM the page will be mostly trapped in 4K granularity in page > faults, and it'll also be hard to merge continuous entries with > previous ones because the userspace could be reading the entries (so > after we publish the previous 4K dirty page, we should not modify the > entry any more). An easy way would be to keep a couple of entries around, not pushing them into the ring until later. In fact deferring queue write until there's a bunch of data to be pushed is a very handy optimization. When building UAPI's it makes sense to try and keep them generic rather than tying them to a given implementation. That's one of the reasons I called for using something resembling vring_packed_desc. > VFIO should not have this restriction because the > marking of dirty page range can be atomic when the range of pages are > mapped or pinned. > > Thanks, > > -- > Peter Xu
On Thu, Jan 09, 2020 at 02:13:54PM -0500, Michael S. Tsirkin wrote: > On Thu, Jan 09, 2020 at 12:58:08PM -0500, Peter Xu wrote: > > On Thu, Jan 09, 2020 at 09:47:11AM -0700, Alex Williamson wrote: > > > On Thu, 9 Jan 2020 09:57:08 -0500 > > > Peter Xu <peterx@redhat.com> wrote: > > > > > > > Branch is here: https://github.com/xzpeter/linux/tree/kvm-dirty-ring > > > > (based on kvm/queue) > > > > > > > > Please refer to either the previous cover letters, or documentation > > > > update in patch 12 for the big picture. Previous posts: > > > > > > > > V1: https://lore.kernel.org/kvm/20191129213505.18472-1-peterx@redhat.com > > > > V2: https://lore.kernel.org/kvm/20191221014938.58831-1-peterx@redhat.com > > > > > > > > The major change in V3 is that we dropped the whole waitqueue and the > > > > global lock. With that, we have clean per-vcpu ring and no default > > > > ring any more. The two kvmgt refactoring patches were also included > > > > to show the dependency of the works. > > > > > > Hi Peter, > > > > Hi, Alex, > > > > > > > > Would you recommend this style of interface for vfio dirty page > > > tracking as well? This mechanism seems very tuned to sparse page > > > dirtying, how well does it handle fully dirty, or even significantly > > > dirty regions? > > > > That's truely the point why I think the dirty bitmap can still be used > > and should be kept. IIUC the dirty ring starts from COLO where (1) > > dirty rate is very low, and (2) sync happens frequently. That's a > > perfect ground for dirty ring. However it for sure does not mean that > > dirty ring can solve all the issues. As you said, I believe the full > > dirty is another extreme in that dirty bitmap could perform better. > > > > > We also don't really have "active" dirty page tracking > > > in vfio, we simply assume that if a page is pinned or otherwise mapped > > > that it's dirty, so I think we'd constantly be trying to re-populate > > > the dirty ring with pages that we've seen the user consume, which > > > doesn't seem like a good fit versus a bitmap solution. Thanks, > > > > Right, so I confess I don't know whether dirty ring is the ideal > > solutioon for vfio either. Actually if we're tracking by page maps or > > pinnings, then IMHO it also means that it could be more suitable to > > use an modified version of dirty ring buffer (as you suggested in the > > other thread), in that we can track dirty using (addr, len) range > > rather than a single page address. That could be hard for KVM because > > in KVM the page will be mostly trapped in 4K granularity in page > > faults, and it'll also be hard to merge continuous entries with > > previous ones because the userspace could be reading the entries (so > > after we publish the previous 4K dirty page, we should not modify the > > entry any more). > > An easy way would be to keep a couple of entries around, not pushing > them into the ring until later. In fact deferring queue write until > there's a bunch of data to be pushed is a very handy optimization. I feel like I proposed similar thing in the other thread. :-) > > When building UAPI's it makes sense to try and keep them generic > rather than tying them to a given implementation. > > That's one of the reasons I called for using something > resembling vring_packed_desc. But again, I just want to make sure I don't over-engineer... I'll wait for further feedback from others for this. Thanks,
On Thu, Jan 09, 2020 at 02:23:18PM -0500, Peter Xu wrote: > On Thu, Jan 09, 2020 at 02:13:54PM -0500, Michael S. Tsirkin wrote: > > On Thu, Jan 09, 2020 at 12:58:08PM -0500, Peter Xu wrote: > > > On Thu, Jan 09, 2020 at 09:47:11AM -0700, Alex Williamson wrote: > > > > On Thu, 9 Jan 2020 09:57:08 -0500 > > > > Peter Xu <peterx@redhat.com> wrote: > > > > > > > > > Branch is here: https://github.com/xzpeter/linux/tree/kvm-dirty-ring > > > > > (based on kvm/queue) > > > > > > > > > > Please refer to either the previous cover letters, or documentation > > > > > update in patch 12 for the big picture. Previous posts: > > > > > > > > > > V1: https://lore.kernel.org/kvm/20191129213505.18472-1-peterx@redhat.com > > > > > V2: https://lore.kernel.org/kvm/20191221014938.58831-1-peterx@redhat.com > > > > > > > > > > The major change in V3 is that we dropped the whole waitqueue and the > > > > > global lock. With that, we have clean per-vcpu ring and no default > > > > > ring any more. The two kvmgt refactoring patches were also included > > > > > to show the dependency of the works. > > > > > > > > Hi Peter, > > > > > > Hi, Alex, > > > > > > > > > > > Would you recommend this style of interface for vfio dirty page > > > > tracking as well? This mechanism seems very tuned to sparse page > > > > dirtying, how well does it handle fully dirty, or even significantly > > > > dirty regions? > > > > > > That's truely the point why I think the dirty bitmap can still be used > > > and should be kept. IIUC the dirty ring starts from COLO where (1) > > > dirty rate is very low, and (2) sync happens frequently. That's a > > > perfect ground for dirty ring. However it for sure does not mean that > > > dirty ring can solve all the issues. As you said, I believe the full > > > dirty is another extreme in that dirty bitmap could perform better. > > > > > > > We also don't really have "active" dirty page tracking > > > > in vfio, we simply assume that if a page is pinned or otherwise mapped > > > > that it's dirty, so I think we'd constantly be trying to re-populate > > > > the dirty ring with pages that we've seen the user consume, which > > > > doesn't seem like a good fit versus a bitmap solution. Thanks, > > > > > > Right, so I confess I don't know whether dirty ring is the ideal > > > solutioon for vfio either. Actually if we're tracking by page maps or > > > pinnings, then IMHO it also means that it could be more suitable to > > > use an modified version of dirty ring buffer (as you suggested in the > > > other thread), in that we can track dirty using (addr, len) range > > > rather than a single page address. That could be hard for KVM because > > > in KVM the page will be mostly trapped in 4K granularity in page > > > faults, and it'll also be hard to merge continuous entries with > > > previous ones because the userspace could be reading the entries (so > > > after we publish the previous 4K dirty page, we should not modify the > > > entry any more). > > > > An easy way would be to keep a couple of entries around, not pushing > > them into the ring until later. In fact deferring queue write until > > there's a bunch of data to be pushed is a very handy optimization. > > I feel like I proposed similar thing in the other thread. :-) > > > > > When building UAPI's it makes sense to try and keep them generic > > rather than tying them to a given implementation. > > > > That's one of the reasons I called for using something > > resembling vring_packed_desc. > > But again, I just want to make sure I don't over-engineer... You will now when you start profiling in earnest. > I'll wait for further feedback from others for this. > > Thanks, > > -- > Peter Xu
On Thu, Jan 09, 2020 at 02:08:52PM -0500, Michael S. Tsirkin wrote: > On Thu, Jan 09, 2020 at 12:08:49PM -0500, Peter Xu wrote: > > On Thu, Jan 09, 2020 at 11:40:23AM -0500, Michael S. Tsirkin wrote: > > > > [...] > > > > > > > I know it's mostly relevant for huge VMs, but OTOH these > > > > > probably use huge pages. > > > > > > > > Yes huge VMs could benefit more, especially if the dirty rate is not > > > > that high, I believe. Though, could you elaborate on why huge pages > > > > are special here? > > > > > > > > Thanks, > > > > > > With hugetlbfs there are less bits to test: e.g. with 2M pages a single > > > bit set marks 512 pages as dirty. We do not take advantage of this > > > but it looks like a rather obvious optimization. > > > > Right, but isn't that the trade-off between granularity of dirty > > tracking and how easy it is to collect the dirty bits? Say, it'll be > > merely impossible to migrate 1G-huge-page-backed guests if we track > > dirty bits using huge page granularity, since each touch of guest > > memory will cause another 1G memory to be transferred even if most of > > the content is the same. 2M can be somewhere in the middle, but still > > the same write amplify issue exists. > > > > OK I see I'm unclear. > > IIUC at the moment KVM never uses huge pages if any part of the huge page is > tracked. To be more precise - I think it's per-memslot. Say, if the memslot is dirty tracked, then no huge page on the host on that memslot (even if guest used huge page over that). > But if all parts of the page are written to then huge page > is used. I'm not sure of this... I think it's still in 4K granularity. > > In this situation the whole huge page is dirty and needs to be migrated. Note that in QEMU we always migrate pages in 4K for x86, iiuc (please refer to ram_save_host_page() in QEMU). > > > PS. that seems to be another topic after all besides the dirty ring > > series because we need to change our policy first if we want to track > > it with huge pages; with that, for dirty ring we can start to leverage > > the kvm_dirty_gfn.pad to store the page size with another new kvm cap > > when we really want. > > > > Thanks, > > Seems like leaking implementation detail to UAPI to me. I'd say it's not the only place we have an assumption at least (please also refer to uffd_msg.pagefault.address). IMHO it's not something wrong because interfaces can be extended, but I am open to extending kvm_dirty_gfn to cover a length/size or make the pad larger (as long as Paolo is fine with this). Thanks,
On 09/01/20 20:39, Peter Xu wrote: >> >> IIUC at the moment KVM never uses huge pages if any part of the huge page is >> tracked. > > To be more precise - I think it's per-memslot. Say, if the memslot is > dirty tracked, then no huge page on the host on that memslot (even if > guest used huge page over that). > >> But if all parts of the page are written to then huge page >> is used. > > I'm not sure of this... I think it's still in 4K granularity. Right. Dirty tracking always uses 4K page size. Paolo
On 09/01/20 20:13, Michael S. Tsirkin wrote: > That's one of the reasons I called for using something > resembling vring_packed_desc. In principle it could make sense to use the ring-wrap detection mechanism from vring_packed_desc instead of the producer/consumer indices. However, the element address/length indirection is unnecessary. Also, unlike virtio, KVM needs to know if there are N free entries (N is ~512) before running a guest. I'm not sure if that is possible with ring-wrap counters, while it's trivial with producer/consumer indices. Paolo
On Thu, Jan 09, 2020 at 09:51:50PM +0100, Paolo Bonzini wrote: > On 09/01/20 20:13, Michael S. Tsirkin wrote: > > That's one of the reasons I called for using something > > resembling vring_packed_desc. > > In principle it could make sense to use the ring-wrap detection > mechanism from vring_packed_desc instead of the producer/consumer > indices. However, the element address/length indirection is unnecessary. > > Also, unlike virtio, KVM needs to know if there are N free entries (N is > ~512) before running a guest. I'm not sure if that is possible with > ring-wrap counters, while it's trivial with producer/consumer indices. > > Paolo Yes it's easy: just check whether current entry + 500 has been consumed. Unless scatter/father is used, but then the answer is simple - just don't use it :)
On Thu, Jan 09, 2020 at 02:39:49PM -0500, Peter Xu wrote: > On Thu, Jan 09, 2020 at 02:08:52PM -0500, Michael S. Tsirkin wrote: > > On Thu, Jan 09, 2020 at 12:08:49PM -0500, Peter Xu wrote: > > > On Thu, Jan 09, 2020 at 11:40:23AM -0500, Michael S. Tsirkin wrote: > > > > > > [...] > > > > > > > > > I know it's mostly relevant for huge VMs, but OTOH these > > > > > > probably use huge pages. > > > > > > > > > > Yes huge VMs could benefit more, especially if the dirty rate is not > > > > > that high, I believe. Though, could you elaborate on why huge pages > > > > > are special here? > > > > > > > > > > Thanks, > > > > > > > > With hugetlbfs there are less bits to test: e.g. with 2M pages a single > > > > bit set marks 512 pages as dirty. We do not take advantage of this > > > > but it looks like a rather obvious optimization. > > > > > > Right, but isn't that the trade-off between granularity of dirty > > > tracking and how easy it is to collect the dirty bits? Say, it'll be > > > merely impossible to migrate 1G-huge-page-backed guests if we track > > > dirty bits using huge page granularity, since each touch of guest > > > memory will cause another 1G memory to be transferred even if most of > > > the content is the same. 2M can be somewhere in the middle, but still > > > the same write amplify issue exists. > > > > > > > OK I see I'm unclear. > > > > IIUC at the moment KVM never uses huge pages if any part of the huge page is > > tracked. > > To be more precise - I think it's per-memslot. Say, if the memslot is > dirty tracked, then no huge page on the host on that memslot (even if > guest used huge page over that). Yea ... so does it make sense to make this implementation detail leak through UAPI? > > But if all parts of the page are written to then huge page > > is used. > > I'm not sure of this... I think it's still in 4K granularity. > > > > > In this situation the whole huge page is dirty and needs to be migrated. > > Note that in QEMU we always migrate pages in 4K for x86, iiuc (please > refer to ram_save_host_page() in QEMU). > > > > > > PS. that seems to be another topic after all besides the dirty ring > > > series because we need to change our policy first if we want to track > > > it with huge pages; with that, for dirty ring we can start to leverage > > > the kvm_dirty_gfn.pad to store the page size with another new kvm cap > > > when we really want. > > > > > > Thanks, > > > > Seems like leaking implementation detail to UAPI to me. > > I'd say it's not the only place we have an assumption at least (please > also refer to uffd_msg.pagefault.address). IMHO it's not something > wrong because interfaces can be extended, but I am open to extending > kvm_dirty_gfn to cover a length/size or make the pad larger (as long > as Paolo is fine with this). > > Thanks, > > -- > Peter Xu
On Thu, Jan 09, 2020 at 05:28:36PM -0500, Michael S. Tsirkin wrote: > On Thu, Jan 09, 2020 at 02:39:49PM -0500, Peter Xu wrote: > > On Thu, Jan 09, 2020 at 02:08:52PM -0500, Michael S. Tsirkin wrote: > > > On Thu, Jan 09, 2020 at 12:08:49PM -0500, Peter Xu wrote: > > > > On Thu, Jan 09, 2020 at 11:40:23AM -0500, Michael S. Tsirkin wrote: > > > > > > > > [...] > > > > > > > > > > > I know it's mostly relevant for huge VMs, but OTOH these > > > > > > > probably use huge pages. > > > > > > > > > > > > Yes huge VMs could benefit more, especially if the dirty rate is not > > > > > > that high, I believe. Though, could you elaborate on why huge pages > > > > > > are special here? > > > > > > > > > > > > Thanks, > > > > > > > > > > With hugetlbfs there are less bits to test: e.g. with 2M pages a single > > > > > bit set marks 512 pages as dirty. We do not take advantage of this > > > > > but it looks like a rather obvious optimization. > > > > > > > > Right, but isn't that the trade-off between granularity of dirty > > > > tracking and how easy it is to collect the dirty bits? Say, it'll be > > > > merely impossible to migrate 1G-huge-page-backed guests if we track > > > > dirty bits using huge page granularity, since each touch of guest > > > > memory will cause another 1G memory to be transferred even if most of > > > > the content is the same. 2M can be somewhere in the middle, but still > > > > the same write amplify issue exists. > > > > > > > > > > OK I see I'm unclear. > > > > > > IIUC at the moment KVM never uses huge pages if any part of the huge page is > > > tracked. > > > > To be more precise - I think it's per-memslot. Say, if the memslot is > > dirty tracked, then no huge page on the host on that memslot (even if > > guest used huge page over that). > > Yea ... so does it make sense to make this implementation detail > leak through UAPI? I think that's not a leak of internal implementation detail, we just define the interface as that the address for each kvm_dirty_gfn is always host page aligned (by default it means no huge page) and point to a single host page, that's all. Host page size is always there for userspace after all so imho it's fine. Thanks,
On 09/01/20 15:57, Peter Xu wrote: > Branch is here: https://github.com/xzpeter/linux/tree/kvm-dirty-ring > (based on kvm/queue) > > Please refer to either the previous cover letters, or documentation > update in patch 12 for the big picture. Previous posts: > > V1: https://lore.kernel.org/kvm/20191129213505.18472-1-peterx@redhat.com > V2: https://lore.kernel.org/kvm/20191221014938.58831-1-peterx@redhat.com > > The major change in V3 is that we dropped the whole waitqueue and the > global lock. With that, we have clean per-vcpu ring and no default > ring any more. The two kvmgt refactoring patches were also included > to show the dependency of the works. > > Patchset layout: > > Patch 1-2: Picked up from kvmgt refactoring > Patch 3-6: Small patches that are not directly related, > (So can be acked/nacked/picked as standalone) > Patch 7-11: Prepares for the dirty ring interface > Patch 12: Major implementation > Patch 13-14: Quick follow-ups for patch 8 > Patch 15-21: Test cases > > V3 changelog: > > - fail userspace writable maps on dirty ring ranges [Jason] > - commit message fixups [Paolo] > - change __x86_set_memory_region to return hva [Paolo] > - cacheline align for indices [Paolo, Jason] > - drop waitqueue, global lock, etc., include kvmgt rework patchset > - take lock for __x86_set_memory_region() (otherwise it triggers a > lockdep in latest kvm/queue) [Paolo] > - check KVM_DIRTY_LOG_PAGE_OFFSET in kvm_vm_ioctl_enable_dirty_log_ring > - one more patch to drop x86_set_memory_region [Paolo] > - one more patch to remove extra srcu usage in init_rmode_identity_map() > - add some r-bs for Paolo > > Please review, thanks. > > Paolo Bonzini (1): > KVM: Move running VCPU from ARM to common code > > Peter Xu (18): > KVM: Remove kvm_read_guest_atomic() > KVM: Add build-time error check on kvm_run size > KVM: X86: Change parameter for fast_page_fault tracepoint > KVM: X86: Don't take srcu lock in init_rmode_identity_map() > KVM: Cache as_id in kvm_memory_slot > KVM: X86: Drop x86_set_memory_region() > KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR] > KVM: Pass in kvm pointer into mark_page_dirty_in_slot() > KVM: X86: Implement ring-based dirty memory tracking > KVM: Make dirty ring exclusive to dirty bitmap log > KVM: Don't allocate dirty bitmap if dirty ring is enabled > KVM: selftests: Always clear dirty bitmap after iteration > KVM: selftests: Sync uapi/linux/kvm.h to tools/ > KVM: selftests: Use a single binary for dirty/clear log test > KVM: selftests: Introduce after_vcpu_run hook for dirty log test > KVM: selftests: Add dirty ring buffer test > KVM: selftests: Let dirty_log_test async for dirty ring test > KVM: selftests: Add "-c" parameter to dirty log test > > Yan Zhao (2): > vfio: introduce vfio_iova_rw to read/write a range of IOVAs > drm/i915/gvt: subsitute kvm_read/write_guest with vfio_iova_rw > > Documentation/virt/kvm/api.txt | 96 ++++ > arch/arm/include/asm/kvm_host.h | 2 - > arch/arm64/include/asm/kvm_host.h | 2 - > arch/x86/include/asm/kvm_host.h | 7 +- > arch/x86/include/uapi/asm/kvm.h | 1 + > arch/x86/kvm/Makefile | 3 +- > arch/x86/kvm/mmu/mmu.c | 6 + > arch/x86/kvm/mmutrace.h | 9 +- > arch/x86/kvm/svm.c | 3 +- > arch/x86/kvm/vmx/vmx.c | 86 ++-- > arch/x86/kvm/x86.c | 43 +- > drivers/gpu/drm/i915/gvt/kvmgt.c | 25 +- > drivers/vfio/vfio.c | 45 ++ > drivers/vfio/vfio_iommu_type1.c | 81 ++++ > include/linux/kvm_dirty_ring.h | 55 +++ > include/linux/kvm_host.h | 37 +- > include/linux/vfio.h | 5 + > include/trace/events/kvm.h | 78 ++++ > include/uapi/linux/kvm.h | 33 ++ > tools/include/uapi/linux/kvm.h | 38 ++ > tools/testing/selftests/kvm/Makefile | 2 - > .../selftests/kvm/clear_dirty_log_test.c | 2 - > tools/testing/selftests/kvm/dirty_log_test.c | 420 ++++++++++++++++-- > .../testing/selftests/kvm/include/kvm_util.h | 4 + > tools/testing/selftests/kvm/lib/kvm_util.c | 72 +++ > .../selftests/kvm/lib/kvm_util_internal.h | 3 + > virt/kvm/arm/arch_timer.c | 2 +- > virt/kvm/arm/arm.c | 29 -- > virt/kvm/arm/perf.c | 6 +- > virt/kvm/arm/vgic/vgic-mmio.c | 15 +- > virt/kvm/dirty_ring.c | 162 +++++++ > virt/kvm/kvm_main.c | 215 +++++++-- > 32 files changed, 1379 insertions(+), 208 deletions(-) > create mode 100644 include/linux/kvm_dirty_ring.h > delete mode 100644 tools/testing/selftests/kvm/clear_dirty_log_test.c > create mode 100644 virt/kvm/dirty_ring.c > Queued patches 3-6, 8-9, 11; thanks! Paolo