mbox series

[v3,00/21] KVM: Dirty ring interface

Message ID 20200109145729.32898-1-peterx@redhat.com (mailing list archive)
Headers show
Series KVM: Dirty ring interface | expand

Message

Peter Xu Jan. 9, 2020, 2:57 p.m. UTC
Branch is here: https://github.com/xzpeter/linux/tree/kvm-dirty-ring
(based on kvm/queue)

Please refer to either the previous cover letters, or documentation
update in patch 12 for the big picture.  Previous posts:

V1: https://lore.kernel.org/kvm/20191129213505.18472-1-peterx@redhat.com
V2: https://lore.kernel.org/kvm/20191221014938.58831-1-peterx@redhat.com

The major change in V3 is that we dropped the whole waitqueue and the
global lock. With that, we have clean per-vcpu ring and no default
ring any more.  The two kvmgt refactoring patches were also included
to show the dependency of the works.

Patchset layout:

Patch 1-2:         Picked up from kvmgt refactoring
Patch 3-6:         Small patches that are not directly related,
                   (So can be acked/nacked/picked as standalone)
Patch 7-11:        Prepares for the dirty ring interface
Patch 12:          Major implementation
Patch 13-14:       Quick follow-ups for patch 8
Patch 15-21:       Test cases

V3 changelog:

- fail userspace writable maps on dirty ring ranges [Jason]
- commit message fixups [Paolo]
- change __x86_set_memory_region to return hva [Paolo]
- cacheline align for indices [Paolo, Jason]
- drop waitqueue, global lock, etc., include kvmgt rework patchset
- take lock for __x86_set_memory_region() (otherwise it triggers a
  lockdep in latest kvm/queue) [Paolo]
- check KVM_DIRTY_LOG_PAGE_OFFSET in kvm_vm_ioctl_enable_dirty_log_ring
- one more patch to drop x86_set_memory_region [Paolo]
- one more patch to remove extra srcu usage in init_rmode_identity_map()
- add some r-bs for Paolo

Please review, thanks.

Paolo Bonzini (1):
  KVM: Move running VCPU from ARM to common code

Peter Xu (18):
  KVM: Remove kvm_read_guest_atomic()
  KVM: Add build-time error check on kvm_run size
  KVM: X86: Change parameter for fast_page_fault tracepoint
  KVM: X86: Don't take srcu lock in init_rmode_identity_map()
  KVM: Cache as_id in kvm_memory_slot
  KVM: X86: Drop x86_set_memory_region()
  KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
  KVM: Pass in kvm pointer into mark_page_dirty_in_slot()
  KVM: X86: Implement ring-based dirty memory tracking
  KVM: Make dirty ring exclusive to dirty bitmap log
  KVM: Don't allocate dirty bitmap if dirty ring is enabled
  KVM: selftests: Always clear dirty bitmap after iteration
  KVM: selftests: Sync uapi/linux/kvm.h to tools/
  KVM: selftests: Use a single binary for dirty/clear log test
  KVM: selftests: Introduce after_vcpu_run hook for dirty log test
  KVM: selftests: Add dirty ring buffer test
  KVM: selftests: Let dirty_log_test async for dirty ring test
  KVM: selftests: Add "-c" parameter to dirty log test

Yan Zhao (2):
  vfio: introduce vfio_iova_rw to read/write a range of IOVAs
  drm/i915/gvt: subsitute kvm_read/write_guest with vfio_iova_rw

 Documentation/virt/kvm/api.txt                |  96 ++++
 arch/arm/include/asm/kvm_host.h               |   2 -
 arch/arm64/include/asm/kvm_host.h             |   2 -
 arch/x86/include/asm/kvm_host.h               |   7 +-
 arch/x86/include/uapi/asm/kvm.h               |   1 +
 arch/x86/kvm/Makefile                         |   3 +-
 arch/x86/kvm/mmu/mmu.c                        |   6 +
 arch/x86/kvm/mmutrace.h                       |   9 +-
 arch/x86/kvm/svm.c                            |   3 +-
 arch/x86/kvm/vmx/vmx.c                        |  86 ++--
 arch/x86/kvm/x86.c                            |  43 +-
 drivers/gpu/drm/i915/gvt/kvmgt.c              |  25 +-
 drivers/vfio/vfio.c                           |  45 ++
 drivers/vfio/vfio_iommu_type1.c               |  81 ++++
 include/linux/kvm_dirty_ring.h                |  55 +++
 include/linux/kvm_host.h                      |  37 +-
 include/linux/vfio.h                          |   5 +
 include/trace/events/kvm.h                    |  78 ++++
 include/uapi/linux/kvm.h                      |  33 ++
 tools/include/uapi/linux/kvm.h                |  38 ++
 tools/testing/selftests/kvm/Makefile          |   2 -
 .../selftests/kvm/clear_dirty_log_test.c      |   2 -
 tools/testing/selftests/kvm/dirty_log_test.c  | 420 ++++++++++++++++--
 .../testing/selftests/kvm/include/kvm_util.h  |   4 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |  72 +++
 .../selftests/kvm/lib/kvm_util_internal.h     |   3 +
 virt/kvm/arm/arch_timer.c                     |   2 +-
 virt/kvm/arm/arm.c                            |  29 --
 virt/kvm/arm/perf.c                           |   6 +-
 virt/kvm/arm/vgic/vgic-mmio.c                 |  15 +-
 virt/kvm/dirty_ring.c                         | 162 +++++++
 virt/kvm/kvm_main.c                           | 215 +++++++--
 32 files changed, 1379 insertions(+), 208 deletions(-)
 create mode 100644 include/linux/kvm_dirty_ring.h
 delete mode 100644 tools/testing/selftests/kvm/clear_dirty_log_test.c
 create mode 100644 virt/kvm/dirty_ring.c

Comments

Michael S. Tsirkin Jan. 9, 2020, 3:59 p.m. UTC | #1
On Thu, Jan 09, 2020 at 09:57:08AM -0500, Peter Xu wrote:
> Branch is here: https://github.com/xzpeter/linux/tree/kvm-dirty-ring
> (based on kvm/queue)
> 
> Please refer to either the previous cover letters, or documentation
> update in patch 12 for the big picture.

I would rather you pasted it here. There's no way to respond otherwise.

For something that's presumably an optimization, isn't there
some kind of testing that can be done to show the benefits?
What kind of gain was observed?

I know it's mostly relevant for huge VMs, but OTOH these
probably use huge pages.




>  Previous posts:
> 
> V1: https://lore.kernel.org/kvm/20191129213505.18472-1-peterx@redhat.com
> V2: https://lore.kernel.org/kvm/20191221014938.58831-1-peterx@redhat.com
> 
> The major change in V3 is that we dropped the whole waitqueue and the
> global lock. With that, we have clean per-vcpu ring and no default
> ring any more.  The two kvmgt refactoring patches were also included
> to show the dependency of the works.
> 
> Patchset layout:
> 
> Patch 1-2:         Picked up from kvmgt refactoring
> Patch 3-6:         Small patches that are not directly related,
>                    (So can be acked/nacked/picked as standalone)
> Patch 7-11:        Prepares for the dirty ring interface
> Patch 12:          Major implementation
> Patch 13-14:       Quick follow-ups for patch 8
> Patch 15-21:       Test cases
> 
> V3 changelog:
> 
> - fail userspace writable maps on dirty ring ranges [Jason]
> - commit message fixups [Paolo]
> - change __x86_set_memory_region to return hva [Paolo]
> - cacheline align for indices [Paolo, Jason]
> - drop waitqueue, global lock, etc., include kvmgt rework patchset
> - take lock for __x86_set_memory_region() (otherwise it triggers a
>   lockdep in latest kvm/queue) [Paolo]
> - check KVM_DIRTY_LOG_PAGE_OFFSET in kvm_vm_ioctl_enable_dirty_log_ring
> - one more patch to drop x86_set_memory_region [Paolo]
> - one more patch to remove extra srcu usage in init_rmode_identity_map()
> - add some r-bs for Paolo
> 
> Please review, thanks.
> 
> Paolo Bonzini (1):
>   KVM: Move running VCPU from ARM to common code
> 
> Peter Xu (18):
>   KVM: Remove kvm_read_guest_atomic()
>   KVM: Add build-time error check on kvm_run size
>   KVM: X86: Change parameter for fast_page_fault tracepoint
>   KVM: X86: Don't take srcu lock in init_rmode_identity_map()
>   KVM: Cache as_id in kvm_memory_slot
>   KVM: X86: Drop x86_set_memory_region()
>   KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
>   KVM: Pass in kvm pointer into mark_page_dirty_in_slot()
>   KVM: X86: Implement ring-based dirty memory tracking
>   KVM: Make dirty ring exclusive to dirty bitmap log
>   KVM: Don't allocate dirty bitmap if dirty ring is enabled
>   KVM: selftests: Always clear dirty bitmap after iteration
>   KVM: selftests: Sync uapi/linux/kvm.h to tools/
>   KVM: selftests: Use a single binary for dirty/clear log test
>   KVM: selftests: Introduce after_vcpu_run hook for dirty log test
>   KVM: selftests: Add dirty ring buffer test
>   KVM: selftests: Let dirty_log_test async for dirty ring test
>   KVM: selftests: Add "-c" parameter to dirty log test
> 
> Yan Zhao (2):
>   vfio: introduce vfio_iova_rw to read/write a range of IOVAs
>   drm/i915/gvt: subsitute kvm_read/write_guest with vfio_iova_rw
> 
>  Documentation/virt/kvm/api.txt                |  96 ++++
>  arch/arm/include/asm/kvm_host.h               |   2 -
>  arch/arm64/include/asm/kvm_host.h             |   2 -
>  arch/x86/include/asm/kvm_host.h               |   7 +-
>  arch/x86/include/uapi/asm/kvm.h               |   1 +
>  arch/x86/kvm/Makefile                         |   3 +-
>  arch/x86/kvm/mmu/mmu.c                        |   6 +
>  arch/x86/kvm/mmutrace.h                       |   9 +-
>  arch/x86/kvm/svm.c                            |   3 +-
>  arch/x86/kvm/vmx/vmx.c                        |  86 ++--
>  arch/x86/kvm/x86.c                            |  43 +-
>  drivers/gpu/drm/i915/gvt/kvmgt.c              |  25 +-
>  drivers/vfio/vfio.c                           |  45 ++
>  drivers/vfio/vfio_iommu_type1.c               |  81 ++++
>  include/linux/kvm_dirty_ring.h                |  55 +++
>  include/linux/kvm_host.h                      |  37 +-
>  include/linux/vfio.h                          |   5 +
>  include/trace/events/kvm.h                    |  78 ++++
>  include/uapi/linux/kvm.h                      |  33 ++
>  tools/include/uapi/linux/kvm.h                |  38 ++
>  tools/testing/selftests/kvm/Makefile          |   2 -
>  .../selftests/kvm/clear_dirty_log_test.c      |   2 -
>  tools/testing/selftests/kvm/dirty_log_test.c  | 420 ++++++++++++++++--
>  .../testing/selftests/kvm/include/kvm_util.h  |   4 +
>  tools/testing/selftests/kvm/lib/kvm_util.c    |  72 +++
>  .../selftests/kvm/lib/kvm_util_internal.h     |   3 +
>  virt/kvm/arm/arch_timer.c                     |   2 +-
>  virt/kvm/arm/arm.c                            |  29 --
>  virt/kvm/arm/perf.c                           |   6 +-
>  virt/kvm/arm/vgic/vgic-mmio.c                 |  15 +-
>  virt/kvm/dirty_ring.c                         | 162 +++++++
>  virt/kvm/kvm_main.c                           | 215 +++++++--
>  32 files changed, 1379 insertions(+), 208 deletions(-)
>  create mode 100644 include/linux/kvm_dirty_ring.h
>  delete mode 100644 tools/testing/selftests/kvm/clear_dirty_log_test.c
>  create mode 100644 virt/kvm/dirty_ring.c
> 
> -- 
> 2.24.1
Peter Xu Jan. 9, 2020, 4:17 p.m. UTC | #2
On Thu, Jan 09, 2020 at 10:59:50AM -0500, Michael S. Tsirkin wrote:
> On Thu, Jan 09, 2020 at 09:57:08AM -0500, Peter Xu wrote:
> > Branch is here: https://github.com/xzpeter/linux/tree/kvm-dirty-ring
> > (based on kvm/queue)
> > 
> > Please refer to either the previous cover letters, or documentation
> > update in patch 12 for the big picture.
> 
> I would rather you pasted it here. There's no way to respond otherwise.

Sure, will do in the next post.

> 
> For something that's presumably an optimization, isn't there
> some kind of testing that can be done to show the benefits?
> What kind of gain was observed?

Since the interface seems to settle soon, maybe it's time to work on
the QEMU part so I can give some number.  It would be interesting to
know the curves between dirty logging and dirty ring even for some
small vms that have some workloads inside.

> 
> I know it's mostly relevant for huge VMs, but OTOH these
> probably use huge pages.

Yes huge VMs could benefit more, especially if the dirty rate is not
that high, I believe.  Though, could you elaborate on why huge pages
are special here?

Thanks,
Michael S. Tsirkin Jan. 9, 2020, 4:40 p.m. UTC | #3
On Thu, Jan 09, 2020 at 11:17:42AM -0500, Peter Xu wrote:
> On Thu, Jan 09, 2020 at 10:59:50AM -0500, Michael S. Tsirkin wrote:
> > On Thu, Jan 09, 2020 at 09:57:08AM -0500, Peter Xu wrote:
> > > Branch is here: https://github.com/xzpeter/linux/tree/kvm-dirty-ring
> > > (based on kvm/queue)
> > > 
> > > Please refer to either the previous cover letters, or documentation
> > > update in patch 12 for the big picture.
> > 
> > I would rather you pasted it here. There's no way to respond otherwise.
> 
> Sure, will do in the next post.
> 
> > 
> > For something that's presumably an optimization, isn't there
> > some kind of testing that can be done to show the benefits?
> > What kind of gain was observed?
> 
> Since the interface seems to settle soon, maybe it's time to work on
> the QEMU part so I can give some number.  It would be interesting to
> know the curves between dirty logging and dirty ring even for some
> small vms that have some workloads inside.
> 
> > 
> > I know it's mostly relevant for huge VMs, but OTOH these
> > probably use huge pages.
> 
> Yes huge VMs could benefit more, especially if the dirty rate is not
> that high, I believe.  Though, could you elaborate on why huge pages
> are special here?
> 
> Thanks,

With hugetlbfs there are less bits to test: e.g. with 2M pages a single
bit set marks 512 pages as dirty.  We do not take advantage of this
but it looks like a rather obvious optimization.

> -- 
> Peter Xu
Alex Williamson Jan. 9, 2020, 4:47 p.m. UTC | #4
On Thu,  9 Jan 2020 09:57:08 -0500
Peter Xu <peterx@redhat.com> wrote:

> Branch is here: https://github.com/xzpeter/linux/tree/kvm-dirty-ring
> (based on kvm/queue)
> 
> Please refer to either the previous cover letters, or documentation
> update in patch 12 for the big picture.  Previous posts:
> 
> V1: https://lore.kernel.org/kvm/20191129213505.18472-1-peterx@redhat.com
> V2: https://lore.kernel.org/kvm/20191221014938.58831-1-peterx@redhat.com
> 
> The major change in V3 is that we dropped the whole waitqueue and the
> global lock. With that, we have clean per-vcpu ring and no default
> ring any more.  The two kvmgt refactoring patches were also included
> to show the dependency of the works.

Hi Peter,

Would you recommend this style of interface for vfio dirty page
tracking as well?  This mechanism seems very tuned to sparse page
dirtying, how well does it handle fully dirty, or even significantly
dirty regions?  We also don't really have "active" dirty page tracking
in vfio, we simply assume that if a page is pinned or otherwise mapped
that it's dirty, so I think we'd constantly be trying to re-populate
the dirty ring with pages that we've seen the user consume, which
doesn't seem like a good fit versus a bitmap solution.  Thanks,

Alex
Peter Xu Jan. 9, 2020, 5:08 p.m. UTC | #5
On Thu, Jan 09, 2020 at 11:40:23AM -0500, Michael S. Tsirkin wrote:

[...]

> > > I know it's mostly relevant for huge VMs, but OTOH these
> > > probably use huge pages.
> > 
> > Yes huge VMs could benefit more, especially if the dirty rate is not
> > that high, I believe.  Though, could you elaborate on why huge pages
> > are special here?
> > 
> > Thanks,
> 
> With hugetlbfs there are less bits to test: e.g. with 2M pages a single
> bit set marks 512 pages as dirty.  We do not take advantage of this
> but it looks like a rather obvious optimization.

Right, but isn't that the trade-off between granularity of dirty
tracking and how easy it is to collect the dirty bits?  Say, it'll be
merely impossible to migrate 1G-huge-page-backed guests if we track
dirty bits using huge page granularity, since each touch of guest
memory will cause another 1G memory to be transferred even if most of
the content is the same.  2M can be somewhere in the middle, but still
the same write amplify issue exists.

PS. that seems to be another topic after all besides the dirty ring
series because we need to change our policy first if we want to track
it with huge pages; with that, for dirty ring we can start to leverage
the kvm_dirty_gfn.pad to store the page size with another new kvm cap
when we really want.

Thanks,
Peter Xu Jan. 9, 2020, 5:58 p.m. UTC | #6
On Thu, Jan 09, 2020 at 09:47:11AM -0700, Alex Williamson wrote:
> On Thu,  9 Jan 2020 09:57:08 -0500
> Peter Xu <peterx@redhat.com> wrote:
> 
> > Branch is here: https://github.com/xzpeter/linux/tree/kvm-dirty-ring
> > (based on kvm/queue)
> > 
> > Please refer to either the previous cover letters, or documentation
> > update in patch 12 for the big picture.  Previous posts:
> > 
> > V1: https://lore.kernel.org/kvm/20191129213505.18472-1-peterx@redhat.com
> > V2: https://lore.kernel.org/kvm/20191221014938.58831-1-peterx@redhat.com
> > 
> > The major change in V3 is that we dropped the whole waitqueue and the
> > global lock. With that, we have clean per-vcpu ring and no default
> > ring any more.  The two kvmgt refactoring patches were also included
> > to show the dependency of the works.
> 
> Hi Peter,

Hi, Alex,

> 
> Would you recommend this style of interface for vfio dirty page
> tracking as well?  This mechanism seems very tuned to sparse page
> dirtying, how well does it handle fully dirty, or even significantly
> dirty regions?

That's truely the point why I think the dirty bitmap can still be used
and should be kept.  IIUC the dirty ring starts from COLO where (1)
dirty rate is very low, and (2) sync happens frequently.  That's a
perfect ground for dirty ring.  However it for sure does not mean that
dirty ring can solve all the issues.  As you said, I believe the full
dirty is another extreme in that dirty bitmap could perform better.

> We also don't really have "active" dirty page tracking
> in vfio, we simply assume that if a page is pinned or otherwise mapped
> that it's dirty, so I think we'd constantly be trying to re-populate
> the dirty ring with pages that we've seen the user consume, which
> doesn't seem like a good fit versus a bitmap solution.  Thanks,

Right, so I confess I don't know whether dirty ring is the ideal
solutioon for vfio either.  Actually if we're tracking by page maps or
pinnings, then IMHO it also means that it could be more suitable to
use an modified version of dirty ring buffer (as you suggested in the
other thread), in that we can track dirty using (addr, len) range
rather than a single page address.  That could be hard for KVM because
in KVM the page will be mostly trapped in 4K granularity in page
faults, and it'll also be hard to merge continuous entries with
previous ones because the userspace could be reading the entries (so
after we publish the previous 4K dirty page, we should not modify the
entry any more).  VFIO should not have this restriction because the
marking of dirty page range can be atomic when the range of pages are
mapped or pinned.

Thanks,
Michael S. Tsirkin Jan. 9, 2020, 7:08 p.m. UTC | #7
On Thu, Jan 09, 2020 at 12:08:49PM -0500, Peter Xu wrote:
> On Thu, Jan 09, 2020 at 11:40:23AM -0500, Michael S. Tsirkin wrote:
> 
> [...]
> 
> > > > I know it's mostly relevant for huge VMs, but OTOH these
> > > > probably use huge pages.
> > > 
> > > Yes huge VMs could benefit more, especially if the dirty rate is not
> > > that high, I believe.  Though, could you elaborate on why huge pages
> > > are special here?
> > > 
> > > Thanks,
> > 
> > With hugetlbfs there are less bits to test: e.g. with 2M pages a single
> > bit set marks 512 pages as dirty.  We do not take advantage of this
> > but it looks like a rather obvious optimization.
> 
> Right, but isn't that the trade-off between granularity of dirty
> tracking and how easy it is to collect the dirty bits?  Say, it'll be
> merely impossible to migrate 1G-huge-page-backed guests if we track
> dirty bits using huge page granularity, since each touch of guest
> memory will cause another 1G memory to be transferred even if most of
> the content is the same.  2M can be somewhere in the middle, but still
> the same write amplify issue exists.
>

OK I see I'm unclear.

IIUC at the moment KVM never uses huge pages if any part of the huge page is
tracked. But if all parts of the page are written to then huge page
is used.

In this situation the whole huge page is dirty and needs to be migrated.

> PS. that seems to be another topic after all besides the dirty ring
> series because we need to change our policy first if we want to track
> it with huge pages; with that, for dirty ring we can start to leverage
> the kvm_dirty_gfn.pad to store the page size with another new kvm cap
> when we really want.
> 
> Thanks,

Seems like leaking implementation detail to UAPI to me.


> -- 
> Peter Xu
Michael S. Tsirkin Jan. 9, 2020, 7:13 p.m. UTC | #8
On Thu, Jan 09, 2020 at 12:58:08PM -0500, Peter Xu wrote:
> On Thu, Jan 09, 2020 at 09:47:11AM -0700, Alex Williamson wrote:
> > On Thu,  9 Jan 2020 09:57:08 -0500
> > Peter Xu <peterx@redhat.com> wrote:
> > 
> > > Branch is here: https://github.com/xzpeter/linux/tree/kvm-dirty-ring
> > > (based on kvm/queue)
> > > 
> > > Please refer to either the previous cover letters, or documentation
> > > update in patch 12 for the big picture.  Previous posts:
> > > 
> > > V1: https://lore.kernel.org/kvm/20191129213505.18472-1-peterx@redhat.com
> > > V2: https://lore.kernel.org/kvm/20191221014938.58831-1-peterx@redhat.com
> > > 
> > > The major change in V3 is that we dropped the whole waitqueue and the
> > > global lock. With that, we have clean per-vcpu ring and no default
> > > ring any more.  The two kvmgt refactoring patches were also included
> > > to show the dependency of the works.
> > 
> > Hi Peter,
> 
> Hi, Alex,
> 
> > 
> > Would you recommend this style of interface for vfio dirty page
> > tracking as well?  This mechanism seems very tuned to sparse page
> > dirtying, how well does it handle fully dirty, or even significantly
> > dirty regions?
> 
> That's truely the point why I think the dirty bitmap can still be used
> and should be kept.  IIUC the dirty ring starts from COLO where (1)
> dirty rate is very low, and (2) sync happens frequently.  That's a
> perfect ground for dirty ring.  However it for sure does not mean that
> dirty ring can solve all the issues.  As you said, I believe the full
> dirty is another extreme in that dirty bitmap could perform better.
> 
> > We also don't really have "active" dirty page tracking
> > in vfio, we simply assume that if a page is pinned or otherwise mapped
> > that it's dirty, so I think we'd constantly be trying to re-populate
> > the dirty ring with pages that we've seen the user consume, which
> > doesn't seem like a good fit versus a bitmap solution.  Thanks,
> 
> Right, so I confess I don't know whether dirty ring is the ideal
> solutioon for vfio either.  Actually if we're tracking by page maps or
> pinnings, then IMHO it also means that it could be more suitable to
> use an modified version of dirty ring buffer (as you suggested in the
> other thread), in that we can track dirty using (addr, len) range
> rather than a single page address.  That could be hard for KVM because
> in KVM the page will be mostly trapped in 4K granularity in page
> faults, and it'll also be hard to merge continuous entries with
> previous ones because the userspace could be reading the entries (so
> after we publish the previous 4K dirty page, we should not modify the
> entry any more).

An easy way would be to keep a couple of entries around, not pushing
them into the ring until later.  In fact deferring queue write until
there's a bunch of data to be pushed is a very handy optimization.

When building UAPI's it makes sense to try and keep them generic
rather than tying them to a given implementation.

That's one of the reasons I called for using something
resembling vring_packed_desc.


> VFIO should not have this restriction because the
> marking of dirty page range can be atomic when the range of pages are
> mapped or pinned.
> 
> Thanks,
> 
> -- 
> Peter Xu
Peter Xu Jan. 9, 2020, 7:23 p.m. UTC | #9
On Thu, Jan 09, 2020 at 02:13:54PM -0500, Michael S. Tsirkin wrote:
> On Thu, Jan 09, 2020 at 12:58:08PM -0500, Peter Xu wrote:
> > On Thu, Jan 09, 2020 at 09:47:11AM -0700, Alex Williamson wrote:
> > > On Thu,  9 Jan 2020 09:57:08 -0500
> > > Peter Xu <peterx@redhat.com> wrote:
> > > 
> > > > Branch is here: https://github.com/xzpeter/linux/tree/kvm-dirty-ring
> > > > (based on kvm/queue)
> > > > 
> > > > Please refer to either the previous cover letters, or documentation
> > > > update in patch 12 for the big picture.  Previous posts:
> > > > 
> > > > V1: https://lore.kernel.org/kvm/20191129213505.18472-1-peterx@redhat.com
> > > > V2: https://lore.kernel.org/kvm/20191221014938.58831-1-peterx@redhat.com
> > > > 
> > > > The major change in V3 is that we dropped the whole waitqueue and the
> > > > global lock. With that, we have clean per-vcpu ring and no default
> > > > ring any more.  The two kvmgt refactoring patches were also included
> > > > to show the dependency of the works.
> > > 
> > > Hi Peter,
> > 
> > Hi, Alex,
> > 
> > > 
> > > Would you recommend this style of interface for vfio dirty page
> > > tracking as well?  This mechanism seems very tuned to sparse page
> > > dirtying, how well does it handle fully dirty, or even significantly
> > > dirty regions?
> > 
> > That's truely the point why I think the dirty bitmap can still be used
> > and should be kept.  IIUC the dirty ring starts from COLO where (1)
> > dirty rate is very low, and (2) sync happens frequently.  That's a
> > perfect ground for dirty ring.  However it for sure does not mean that
> > dirty ring can solve all the issues.  As you said, I believe the full
> > dirty is another extreme in that dirty bitmap could perform better.
> > 
> > > We also don't really have "active" dirty page tracking
> > > in vfio, we simply assume that if a page is pinned or otherwise mapped
> > > that it's dirty, so I think we'd constantly be trying to re-populate
> > > the dirty ring with pages that we've seen the user consume, which
> > > doesn't seem like a good fit versus a bitmap solution.  Thanks,
> > 
> > Right, so I confess I don't know whether dirty ring is the ideal
> > solutioon for vfio either.  Actually if we're tracking by page maps or
> > pinnings, then IMHO it also means that it could be more suitable to
> > use an modified version of dirty ring buffer (as you suggested in the
> > other thread), in that we can track dirty using (addr, len) range
> > rather than a single page address.  That could be hard for KVM because
> > in KVM the page will be mostly trapped in 4K granularity in page
> > faults, and it'll also be hard to merge continuous entries with
> > previous ones because the userspace could be reading the entries (so
> > after we publish the previous 4K dirty page, we should not modify the
> > entry any more).
> 
> An easy way would be to keep a couple of entries around, not pushing
> them into the ring until later.  In fact deferring queue write until
> there's a bunch of data to be pushed is a very handy optimization.

I feel like I proposed similar thing in the other thread. :-)

> 
> When building UAPI's it makes sense to try and keep them generic
> rather than tying them to a given implementation.
> 
> That's one of the reasons I called for using something
> resembling vring_packed_desc.

But again, I just want to make sure I don't over-engineer...

I'll wait for further feedback from others for this.

Thanks,
Michael S. Tsirkin Jan. 9, 2020, 7:37 p.m. UTC | #10
On Thu, Jan 09, 2020 at 02:23:18PM -0500, Peter Xu wrote:
> On Thu, Jan 09, 2020 at 02:13:54PM -0500, Michael S. Tsirkin wrote:
> > On Thu, Jan 09, 2020 at 12:58:08PM -0500, Peter Xu wrote:
> > > On Thu, Jan 09, 2020 at 09:47:11AM -0700, Alex Williamson wrote:
> > > > On Thu,  9 Jan 2020 09:57:08 -0500
> > > > Peter Xu <peterx@redhat.com> wrote:
> > > > 
> > > > > Branch is here: https://github.com/xzpeter/linux/tree/kvm-dirty-ring
> > > > > (based on kvm/queue)
> > > > > 
> > > > > Please refer to either the previous cover letters, or documentation
> > > > > update in patch 12 for the big picture.  Previous posts:
> > > > > 
> > > > > V1: https://lore.kernel.org/kvm/20191129213505.18472-1-peterx@redhat.com
> > > > > V2: https://lore.kernel.org/kvm/20191221014938.58831-1-peterx@redhat.com
> > > > > 
> > > > > The major change in V3 is that we dropped the whole waitqueue and the
> > > > > global lock. With that, we have clean per-vcpu ring and no default
> > > > > ring any more.  The two kvmgt refactoring patches were also included
> > > > > to show the dependency of the works.
> > > > 
> > > > Hi Peter,
> > > 
> > > Hi, Alex,
> > > 
> > > > 
> > > > Would you recommend this style of interface for vfio dirty page
> > > > tracking as well?  This mechanism seems very tuned to sparse page
> > > > dirtying, how well does it handle fully dirty, or even significantly
> > > > dirty regions?
> > > 
> > > That's truely the point why I think the dirty bitmap can still be used
> > > and should be kept.  IIUC the dirty ring starts from COLO where (1)
> > > dirty rate is very low, and (2) sync happens frequently.  That's a
> > > perfect ground for dirty ring.  However it for sure does not mean that
> > > dirty ring can solve all the issues.  As you said, I believe the full
> > > dirty is another extreme in that dirty bitmap could perform better.
> > > 
> > > > We also don't really have "active" dirty page tracking
> > > > in vfio, we simply assume that if a page is pinned or otherwise mapped
> > > > that it's dirty, so I think we'd constantly be trying to re-populate
> > > > the dirty ring with pages that we've seen the user consume, which
> > > > doesn't seem like a good fit versus a bitmap solution.  Thanks,
> > > 
> > > Right, so I confess I don't know whether dirty ring is the ideal
> > > solutioon for vfio either.  Actually if we're tracking by page maps or
> > > pinnings, then IMHO it also means that it could be more suitable to
> > > use an modified version of dirty ring buffer (as you suggested in the
> > > other thread), in that we can track dirty using (addr, len) range
> > > rather than a single page address.  That could be hard for KVM because
> > > in KVM the page will be mostly trapped in 4K granularity in page
> > > faults, and it'll also be hard to merge continuous entries with
> > > previous ones because the userspace could be reading the entries (so
> > > after we publish the previous 4K dirty page, we should not modify the
> > > entry any more).
> > 
> > An easy way would be to keep a couple of entries around, not pushing
> > them into the ring until later.  In fact deferring queue write until
> > there's a bunch of data to be pushed is a very handy optimization.
> 
> I feel like I proposed similar thing in the other thread. :-)
> 
> > 
> > When building UAPI's it makes sense to try and keep them generic
> > rather than tying them to a given implementation.
> > 
> > That's one of the reasons I called for using something
> > resembling vring_packed_desc.
> 
> But again, I just want to make sure I don't over-engineer...


You will now when you start profiling in earnest.

> I'll wait for further feedback from others for this.
> 
> Thanks,
> 
> -- 
> Peter Xu
Peter Xu Jan. 9, 2020, 7:39 p.m. UTC | #11
On Thu, Jan 09, 2020 at 02:08:52PM -0500, Michael S. Tsirkin wrote:
> On Thu, Jan 09, 2020 at 12:08:49PM -0500, Peter Xu wrote:
> > On Thu, Jan 09, 2020 at 11:40:23AM -0500, Michael S. Tsirkin wrote:
> > 
> > [...]
> > 
> > > > > I know it's mostly relevant for huge VMs, but OTOH these
> > > > > probably use huge pages.
> > > > 
> > > > Yes huge VMs could benefit more, especially if the dirty rate is not
> > > > that high, I believe.  Though, could you elaborate on why huge pages
> > > > are special here?
> > > > 
> > > > Thanks,
> > > 
> > > With hugetlbfs there are less bits to test: e.g. with 2M pages a single
> > > bit set marks 512 pages as dirty.  We do not take advantage of this
> > > but it looks like a rather obvious optimization.
> > 
> > Right, but isn't that the trade-off between granularity of dirty
> > tracking and how easy it is to collect the dirty bits?  Say, it'll be
> > merely impossible to migrate 1G-huge-page-backed guests if we track
> > dirty bits using huge page granularity, since each touch of guest
> > memory will cause another 1G memory to be transferred even if most of
> > the content is the same.  2M can be somewhere in the middle, but still
> > the same write amplify issue exists.
> >
> 
> OK I see I'm unclear.
> 
> IIUC at the moment KVM never uses huge pages if any part of the huge page is
> tracked.

To be more precise - I think it's per-memslot.  Say, if the memslot is
dirty tracked, then no huge page on the host on that memslot (even if
guest used huge page over that).

> But if all parts of the page are written to then huge page
> is used.

I'm not sure of this... I think it's still in 4K granularity.

> 
> In this situation the whole huge page is dirty and needs to be migrated.

Note that in QEMU we always migrate pages in 4K for x86, iiuc (please
refer to ram_save_host_page() in QEMU).

> 
> > PS. that seems to be another topic after all besides the dirty ring
> > series because we need to change our policy first if we want to track
> > it with huge pages; with that, for dirty ring we can start to leverage
> > the kvm_dirty_gfn.pad to store the page size with another new kvm cap
> > when we really want.
> > 
> > Thanks,
> 
> Seems like leaking implementation detail to UAPI to me.

I'd say it's not the only place we have an assumption at least (please
also refer to uffd_msg.pagefault.address).  IMHO it's not something
wrong because interfaces can be extended, but I am open to extending
kvm_dirty_gfn to cover a length/size or make the pad larger (as long
as Paolo is fine with this).

Thanks,
Paolo Bonzini Jan. 9, 2020, 8:42 p.m. UTC | #12
On 09/01/20 20:39, Peter Xu wrote:
>>
>> IIUC at the moment KVM never uses huge pages if any part of the huge page is
>> tracked.
>
> To be more precise - I think it's per-memslot.  Say, if the memslot is
> dirty tracked, then no huge page on the host on that memslot (even if
> guest used huge page over that).
> 
>> But if all parts of the page are written to then huge page
>> is used.
>
> I'm not sure of this... I think it's still in 4K granularity.

Right.  Dirty tracking always uses 4K page size.

Paolo
Paolo Bonzini Jan. 9, 2020, 8:51 p.m. UTC | #13
On 09/01/20 20:13, Michael S. Tsirkin wrote:
> That's one of the reasons I called for using something
> resembling vring_packed_desc.

In principle it could make sense to use the ring-wrap detection
mechanism from vring_packed_desc instead of the producer/consumer
indices.  However, the element address/length indirection is unnecessary.

Also, unlike virtio, KVM needs to know if there are N free entries (N is
~512) before running a guest.  I'm not sure if that is possible with
ring-wrap counters, while it's trivial with producer/consumer indices.

Paolo
Michael S. Tsirkin Jan. 9, 2020, 10:21 p.m. UTC | #14
On Thu, Jan 09, 2020 at 09:51:50PM +0100, Paolo Bonzini wrote:
> On 09/01/20 20:13, Michael S. Tsirkin wrote:
> > That's one of the reasons I called for using something
> > resembling vring_packed_desc.
> 
> In principle it could make sense to use the ring-wrap detection
> mechanism from vring_packed_desc instead of the producer/consumer
> indices.  However, the element address/length indirection is unnecessary.
> 
> Also, unlike virtio, KVM needs to know if there are N free entries (N is
> ~512) before running a guest.  I'm not sure if that is possible with
> ring-wrap counters, while it's trivial with producer/consumer indices.
> 
> Paolo

Yes it's easy: just check whether current entry + 500 has been consumed.
Unless scatter/father is used, but then the answer is simple - just
don't use it :)
Michael S. Tsirkin Jan. 9, 2020, 10:28 p.m. UTC | #15
On Thu, Jan 09, 2020 at 02:39:49PM -0500, Peter Xu wrote:
> On Thu, Jan 09, 2020 at 02:08:52PM -0500, Michael S. Tsirkin wrote:
> > On Thu, Jan 09, 2020 at 12:08:49PM -0500, Peter Xu wrote:
> > > On Thu, Jan 09, 2020 at 11:40:23AM -0500, Michael S. Tsirkin wrote:
> > > 
> > > [...]
> > > 
> > > > > > I know it's mostly relevant for huge VMs, but OTOH these
> > > > > > probably use huge pages.
> > > > > 
> > > > > Yes huge VMs could benefit more, especially if the dirty rate is not
> > > > > that high, I believe.  Though, could you elaborate on why huge pages
> > > > > are special here?
> > > > > 
> > > > > Thanks,
> > > > 
> > > > With hugetlbfs there are less bits to test: e.g. with 2M pages a single
> > > > bit set marks 512 pages as dirty.  We do not take advantage of this
> > > > but it looks like a rather obvious optimization.
> > > 
> > > Right, but isn't that the trade-off between granularity of dirty
> > > tracking and how easy it is to collect the dirty bits?  Say, it'll be
> > > merely impossible to migrate 1G-huge-page-backed guests if we track
> > > dirty bits using huge page granularity, since each touch of guest
> > > memory will cause another 1G memory to be transferred even if most of
> > > the content is the same.  2M can be somewhere in the middle, but still
> > > the same write amplify issue exists.
> > >
> > 
> > OK I see I'm unclear.
> > 
> > IIUC at the moment KVM never uses huge pages if any part of the huge page is
> > tracked.
> 
> To be more precise - I think it's per-memslot.  Say, if the memslot is
> dirty tracked, then no huge page on the host on that memslot (even if
> guest used huge page over that).

Yea ... so does it make sense to make this implementation detail
leak through UAPI?

> > But if all parts of the page are written to then huge page
> > is used.
> 
> I'm not sure of this... I think it's still in 4K granularity.
> 
> > 
> > In this situation the whole huge page is dirty and needs to be migrated.
> 
> Note that in QEMU we always migrate pages in 4K for x86, iiuc (please
> refer to ram_save_host_page() in QEMU).
> 
> > 
> > > PS. that seems to be another topic after all besides the dirty ring
> > > series because we need to change our policy first if we want to track
> > > it with huge pages; with that, for dirty ring we can start to leverage
> > > the kvm_dirty_gfn.pad to store the page size with another new kvm cap
> > > when we really want.
> > > 
> > > Thanks,
> > 
> > Seems like leaking implementation detail to UAPI to me.
> 
> I'd say it's not the only place we have an assumption at least (please
> also refer to uffd_msg.pagefault.address).  IMHO it's not something
> wrong because interfaces can be extended, but I am open to extending
> kvm_dirty_gfn to cover a length/size or make the pad larger (as long
> as Paolo is fine with this).
> 
> Thanks,
> 
> -- 
> Peter Xu
Peter Xu Jan. 10, 2020, 3:10 p.m. UTC | #16
On Thu, Jan 09, 2020 at 05:28:36PM -0500, Michael S. Tsirkin wrote:
> On Thu, Jan 09, 2020 at 02:39:49PM -0500, Peter Xu wrote:
> > On Thu, Jan 09, 2020 at 02:08:52PM -0500, Michael S. Tsirkin wrote:
> > > On Thu, Jan 09, 2020 at 12:08:49PM -0500, Peter Xu wrote:
> > > > On Thu, Jan 09, 2020 at 11:40:23AM -0500, Michael S. Tsirkin wrote:
> > > > 
> > > > [...]
> > > > 
> > > > > > > I know it's mostly relevant for huge VMs, but OTOH these
> > > > > > > probably use huge pages.
> > > > > > 
> > > > > > Yes huge VMs could benefit more, especially if the dirty rate is not
> > > > > > that high, I believe.  Though, could you elaborate on why huge pages
> > > > > > are special here?
> > > > > > 
> > > > > > Thanks,
> > > > > 
> > > > > With hugetlbfs there are less bits to test: e.g. with 2M pages a single
> > > > > bit set marks 512 pages as dirty.  We do not take advantage of this
> > > > > but it looks like a rather obvious optimization.
> > > > 
> > > > Right, but isn't that the trade-off between granularity of dirty
> > > > tracking and how easy it is to collect the dirty bits?  Say, it'll be
> > > > merely impossible to migrate 1G-huge-page-backed guests if we track
> > > > dirty bits using huge page granularity, since each touch of guest
> > > > memory will cause another 1G memory to be transferred even if most of
> > > > the content is the same.  2M can be somewhere in the middle, but still
> > > > the same write amplify issue exists.
> > > >
> > > 
> > > OK I see I'm unclear.
> > > 
> > > IIUC at the moment KVM never uses huge pages if any part of the huge page is
> > > tracked.
> > 
> > To be more precise - I think it's per-memslot.  Say, if the memslot is
> > dirty tracked, then no huge page on the host on that memslot (even if
> > guest used huge page over that).
> 
> Yea ... so does it make sense to make this implementation detail
> leak through UAPI?

I think that's not a leak of internal implementation detail, we just
define the interface as that the address for each kvm_dirty_gfn is
always host page aligned (by default it means no huge page) and point
to a single host page, that's all.  Host page size is always there for
userspace after all so imho it's fine.  Thanks,
Paolo Bonzini Jan. 19, 2020, 9:11 a.m. UTC | #17
On 09/01/20 15:57, Peter Xu wrote:
> Branch is here: https://github.com/xzpeter/linux/tree/kvm-dirty-ring
> (based on kvm/queue)
> 
> Please refer to either the previous cover letters, or documentation
> update in patch 12 for the big picture.  Previous posts:
> 
> V1: https://lore.kernel.org/kvm/20191129213505.18472-1-peterx@redhat.com
> V2: https://lore.kernel.org/kvm/20191221014938.58831-1-peterx@redhat.com
> 
> The major change in V3 is that we dropped the whole waitqueue and the
> global lock. With that, we have clean per-vcpu ring and no default
> ring any more.  The two kvmgt refactoring patches were also included
> to show the dependency of the works.
> 
> Patchset layout:
> 
> Patch 1-2:         Picked up from kvmgt refactoring
> Patch 3-6:         Small patches that are not directly related,
>                    (So can be acked/nacked/picked as standalone)
> Patch 7-11:        Prepares for the dirty ring interface
> Patch 12:          Major implementation
> Patch 13-14:       Quick follow-ups for patch 8
> Patch 15-21:       Test cases
> 
> V3 changelog:
> 
> - fail userspace writable maps on dirty ring ranges [Jason]
> - commit message fixups [Paolo]
> - change __x86_set_memory_region to return hva [Paolo]
> - cacheline align for indices [Paolo, Jason]
> - drop waitqueue, global lock, etc., include kvmgt rework patchset
> - take lock for __x86_set_memory_region() (otherwise it triggers a
>   lockdep in latest kvm/queue) [Paolo]
> - check KVM_DIRTY_LOG_PAGE_OFFSET in kvm_vm_ioctl_enable_dirty_log_ring
> - one more patch to drop x86_set_memory_region [Paolo]
> - one more patch to remove extra srcu usage in init_rmode_identity_map()
> - add some r-bs for Paolo
> 
> Please review, thanks.
> 
> Paolo Bonzini (1):
>   KVM: Move running VCPU from ARM to common code
> 
> Peter Xu (18):
>   KVM: Remove kvm_read_guest_atomic()
>   KVM: Add build-time error check on kvm_run size
>   KVM: X86: Change parameter for fast_page_fault tracepoint
>   KVM: X86: Don't take srcu lock in init_rmode_identity_map()
>   KVM: Cache as_id in kvm_memory_slot
>   KVM: X86: Drop x86_set_memory_region()
>   KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
>   KVM: Pass in kvm pointer into mark_page_dirty_in_slot()
>   KVM: X86: Implement ring-based dirty memory tracking
>   KVM: Make dirty ring exclusive to dirty bitmap log
>   KVM: Don't allocate dirty bitmap if dirty ring is enabled
>   KVM: selftests: Always clear dirty bitmap after iteration
>   KVM: selftests: Sync uapi/linux/kvm.h to tools/
>   KVM: selftests: Use a single binary for dirty/clear log test
>   KVM: selftests: Introduce after_vcpu_run hook for dirty log test
>   KVM: selftests: Add dirty ring buffer test
>   KVM: selftests: Let dirty_log_test async for dirty ring test
>   KVM: selftests: Add "-c" parameter to dirty log test
> 
> Yan Zhao (2):
>   vfio: introduce vfio_iova_rw to read/write a range of IOVAs
>   drm/i915/gvt: subsitute kvm_read/write_guest with vfio_iova_rw
> 
>  Documentation/virt/kvm/api.txt                |  96 ++++
>  arch/arm/include/asm/kvm_host.h               |   2 -
>  arch/arm64/include/asm/kvm_host.h             |   2 -
>  arch/x86/include/asm/kvm_host.h               |   7 +-
>  arch/x86/include/uapi/asm/kvm.h               |   1 +
>  arch/x86/kvm/Makefile                         |   3 +-
>  arch/x86/kvm/mmu/mmu.c                        |   6 +
>  arch/x86/kvm/mmutrace.h                       |   9 +-
>  arch/x86/kvm/svm.c                            |   3 +-
>  arch/x86/kvm/vmx/vmx.c                        |  86 ++--
>  arch/x86/kvm/x86.c                            |  43 +-
>  drivers/gpu/drm/i915/gvt/kvmgt.c              |  25 +-
>  drivers/vfio/vfio.c                           |  45 ++
>  drivers/vfio/vfio_iommu_type1.c               |  81 ++++
>  include/linux/kvm_dirty_ring.h                |  55 +++
>  include/linux/kvm_host.h                      |  37 +-
>  include/linux/vfio.h                          |   5 +
>  include/trace/events/kvm.h                    |  78 ++++
>  include/uapi/linux/kvm.h                      |  33 ++
>  tools/include/uapi/linux/kvm.h                |  38 ++
>  tools/testing/selftests/kvm/Makefile          |   2 -
>  .../selftests/kvm/clear_dirty_log_test.c      |   2 -
>  tools/testing/selftests/kvm/dirty_log_test.c  | 420 ++++++++++++++++--
>  .../testing/selftests/kvm/include/kvm_util.h  |   4 +
>  tools/testing/selftests/kvm/lib/kvm_util.c    |  72 +++
>  .../selftests/kvm/lib/kvm_util_internal.h     |   3 +
>  virt/kvm/arm/arch_timer.c                     |   2 +-
>  virt/kvm/arm/arm.c                            |  29 --
>  virt/kvm/arm/perf.c                           |   6 +-
>  virt/kvm/arm/vgic/vgic-mmio.c                 |  15 +-
>  virt/kvm/dirty_ring.c                         | 162 +++++++
>  virt/kvm/kvm_main.c                           | 215 +++++++--
>  32 files changed, 1379 insertions(+), 208 deletions(-)
>  create mode 100644 include/linux/kvm_dirty_ring.h
>  delete mode 100644 tools/testing/selftests/kvm/clear_dirty_log_test.c
>  create mode 100644 virt/kvm/dirty_ring.c
> 

Queued patches 3-6, 8-9, 11; thanks!

Paolo