mbox series

[v8,00/14] KVM: Dirty ring interface

Message ID 20200331190000.659614-1-peterx@redhat.com (mailing list archive)
Headers show
Series KVM: Dirty ring interface | expand

Message

Peter Xu March 31, 2020, 6:59 p.m. UTC
KVM branch:
  https://github.com/xzpeter/linux/tree/kvm-dirty-ring

QEMU branch for testing:
  https://github.com/xzpeter/qemu/tree/kvm-dirty-ring

v8:
- rebase to kvm/next
- fix test bisection issues [Drew]
- reword comment for __x86_set_memory_region [Sean]
- document fixup on "mutual exclusive", etc. [Sean]

For previous versions, please refer to:

V1: https://lore.kernel.org/kvm/20191129213505.18472-1-peterx@redhat.com
V2: https://lore.kernel.org/kvm/20191221014938.58831-1-peterx@redhat.com
V3: https://lore.kernel.org/kvm/20200109145729.32898-1-peterx@redhat.com
V4: https://lore.kernel.org/kvm/20200205025105.367213-1-peterx@redhat.com
V5: https://lore.kernel.org/kvm/20200304174947.69595-1-peterx@redhat.com
V6: https://lore.kernel.org/kvm/20200309214424.330363-1-peterx@redhat.com
V7: https://lore.kernel.org/kvm/20200318163720.93929-1-peterx@redhat.com

Overview
============

This is a continued work from Lei Cao <lei.cao@stratus.com> and Paolo
Bonzini on the KVM dirty ring interface.

The new dirty ring interface is another way to collect dirty pages for
the virtual machines. It is different from the existing dirty logging
interface in a few ways, majorly:

  - Data format: The dirty data was in a ring format rather than a
    bitmap format, so dirty bits to sync for dirty logging does not
    depend on the size of guest memory any more, but speed of
    dirtying.  Also, the dirty ring is per-vcpu, while the dirty
    bitmap is per-vm.

  - Data copy: The sync of dirty pages does not need data copy any more,
    but instead the ring is shared between the userspace and kernel by
    page sharings (mmap() on vcpu fd)

  - Interface: Instead of using the old KVM_GET_DIRTY_LOG,
    KVM_CLEAR_DIRTY_LOG interfaces, the new ring uses the new
    KVM_RESET_DIRTY_RINGS ioctl when we want to reset the collected
    dirty pages to protected mode again (works like
    KVM_CLEAR_DIRTY_LOG, but ring based).  To collecting dirty bits,
    we only need to read the ring data, no ioctl is needed.

Ring Layout
===========

KVM dirty ring is per-vcpu.  Each ring is an array of kvm_dirty_gfn
defined as:

struct kvm_dirty_gfn {
        __u32 flags;
        __u32 slot; /* as_id | slot_id */
        __u64 offset;
};

Each GFN is a state machine itself.  The state is embeded in the flags
field, as defined in the uapi header:

/*
 * KVM dirty GFN flags, defined as:
 *
 * |---------------+---------------+--------------|
 * | bit 1 (reset) | bit 0 (dirty) | Status       |
 * |---------------+---------------+--------------|
 * |             0 |             0 | Invalid GFN  |
 * |             0 |             1 | Dirty GFN    |
 * |             1 |             X | GFN to reset |
 * |---------------+---------------+--------------|
 *
 * Lifecycle of a dirty GFN goes like:
 *
 *      dirtied         collected        reset
 * 00 -----------> 01 -------------> 1X -------+
 *  ^                                          |
 *  |                                          |
 *  +------------------------------------------+
 *
 * The userspace program is only responsible for the 01->1X state
 * conversion (to collect dirty bits).  Also, it must not skip any
 * dirty bits so that dirty bits are always collected in sequence.
 */

Testing
=======

This series provided both the implementation of the KVM dirty ring and
the test case.  Also I've implemented the QEMU counterpart that can
run with the new KVM, link can be found at the top of the cover
letter.  However that's still a very initial version which is prone to
change and future optimizations.

I did some measurement with the new method with 24G guest running some
dirty workload, I don't see any speedup so far, even in some heavy
dirty load it'll be slower (e.g., when 800MB/s random dirty rate, kvm
dirty ring takes average of ~73s to complete migration while dirty
logging only needs average of ~55s).  However that's understandable
because 24G guest means only 1M dirty bitmap, that's still a suitable
case for dirty logging.  Meanwhile heavier workload means worst case
for dirty ring.

More tests are welcomed if there's bigger host/guest, especially on
COLO-like workload.

Please review, thanks.

Peter Xu (14):
  KVM: X86: Change parameter for fast_page_fault tracepoint
  KVM: Cache as_id in kvm_memory_slot
  KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
  KVM: Pass in kvm pointer into mark_page_dirty_in_slot()
  KVM: X86: Implement ring-based dirty memory tracking
  KVM: Make dirty ring exclusive to dirty bitmap log
  KVM: Don't allocate dirty bitmap if dirty ring is enabled
  KVM: selftests: Always clear dirty bitmap after iteration
  KVM: selftests: Sync uapi/linux/kvm.h to tools/
  KVM: selftests: Use a single binary for dirty/clear log test
  KVM: selftests: Introduce after_vcpu_run hook for dirty log test
  KVM: selftests: Add dirty ring buffer test
  KVM: selftests: Let dirty_log_test async for dirty ring test
  KVM: selftests: Add "-c" parameter to dirty log test

 Documentation/virt/kvm/api.rst                | 123 +++++
 arch/x86/include/asm/kvm_host.h               |   6 +-
 arch/x86/include/uapi/asm/kvm.h               |   1 +
 arch/x86/kvm/Makefile                         |   3 +-
 arch/x86/kvm/mmu/mmu.c                        |  10 +-
 arch/x86/kvm/mmutrace.h                       |   9 +-
 arch/x86/kvm/svm.c                            |   9 +-
 arch/x86/kvm/vmx/vmx.c                        |  89 +--
 arch/x86/kvm/x86.c                            |  48 +-
 include/linux/kvm_dirty_ring.h                | 103 ++++
 include/linux/kvm_host.h                      |  19 +
 include/trace/events/kvm.h                    |  78 +++
 include/uapi/linux/kvm.h                      |  53 ++
 tools/include/uapi/linux/kvm.h                | 100 +++-
 tools/testing/selftests/kvm/Makefile          |   2 -
 .../selftests/kvm/clear_dirty_log_test.c      |   6 -
 tools/testing/selftests/kvm/dirty_log_test.c  | 505 ++++++++++++++++--
 .../testing/selftests/kvm/include/kvm_util.h  |   4 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |  68 +++
 .../selftests/kvm/lib/kvm_util_internal.h     |   4 +
 virt/kvm/dirty_ring.c                         | 195 +++++++
 virt/kvm/kvm_main.c                           | 162 +++++-
 22 files changed, 1459 insertions(+), 138 deletions(-)
 create mode 100644 include/linux/kvm_dirty_ring.h
 delete mode 100644 tools/testing/selftests/kvm/clear_dirty_log_test.c
 create mode 100644 virt/kvm/dirty_ring.c

Comments

Peter Xu April 22, 2020, 6:51 p.m. UTC | #1
Hi,

TL;DR: I'm thinking whether we should record pure GPA/GFN instead of (slot_id,
slot_offset) tuple for dirty pages in kvm dirty ring to unbind kvm_dirty_gfn
with memslots.

(A slightly longer version starts...)

The problem is that binding dirty tracking operations to KVM memslots is a
restriction that needs synchronization to memslot changes, which further needs
synchronization across all the vcpus because they're the consumers of memslots.
E.g., when we remove a memory slot, we need to flush all the dirty bits
correctly before we do the removal of the memslot.  That's actually an known
defect for QEMU/KVM [1] (I bet it could be a defect for many other
hypervisors...) right now with current dirty logging.  Meanwhile, even if we
fix it, that procedure is not scale at all, and error prone to dead locks.

Here memory removal is really an (still corner-cased but relatively) important
scenario to think about for dirty logging comparing to memory additions &
movings.  Because memory addition will always have no initial dirty page, and
we don't really move RAM a lot (or do we ever?!) for a general VM use case.

Then I went a step back to think about why we need these dirty bit information
after all if the memslot is going to be removed?

There're two cases:

  - When the memslot is going to be removed forever, then the dirty information
    is indeed meaningless and can be dropped, and,

  - When the memslot is going to be removed but quickly added back with changed
    size, then we need to keep those dirty bits because it's just a commmon way
    to e.g. punch an MMIO hole in an existing RAM region (here I'd confess I
    feel like using "slot_id" to identify memslot is really unfriendly syscall
    design for things like "hole punchings" in the RAM address space...
    However such "punch hold" operation is really needed even for a common
    guest for either system reboots or device hotplugs, etc.).

The real scenario we want to cover for dirty tracking is the 2nd one.

If we can track dirty using raw GPA, the 2nd scenario is solved itself.
Because we know we'll add those memslots back (though it might be with a
different slot ID), then the GPA value will still make sense, which means we
should be able to avoid any kind of synchronization for things like memory
removals, as long as the userspace is aware of that.

With that, when we fetch the dirty bits, we lookup the memslot dynamically,
drop bits if the memslot does not exist on that address (e.g., permanent
removals), and use whatever memslot is there for that guest physical address.
Though we for sure still need to handle memory move, that the userspace needs
to still take care of dirty bit flushing and sync for a memory move, however
that's merely not happening so nothing to take care about either.

Does this makes sense?  Comments greatly welcomed..

Thanks,

[1] https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg08361.html
Tian, Kevin April 23, 2020, 6:28 a.m. UTC | #2
> From: Peter Xu <peterx@redhat.com>
> Sent: Thursday, April 23, 2020 2:52 AM
> 
> Hi,
> 
> TL;DR: I'm thinking whether we should record pure GPA/GFN instead of
> (slot_id,
> slot_offset) tuple for dirty pages in kvm dirty ring to unbind kvm_dirty_gfn
> with memslots.
> 
> (A slightly longer version starts...)
> 
> The problem is that binding dirty tracking operations to KVM memslots is a
> restriction that needs synchronization to memslot changes, which further
> needs
> synchronization across all the vcpus because they're the consumers of
> memslots.
> E.g., when we remove a memory slot, we need to flush all the dirty bits
> correctly before we do the removal of the memslot.  That's actually an
> known
> defect for QEMU/KVM [1] (I bet it could be a defect for many other
> hypervisors...) right now with current dirty logging.  Meanwhile, even if we
> fix it, that procedure is not scale at all, and error prone to dead locks.
> 
> Here memory removal is really an (still corner-cased but relatively) important
> scenario to think about for dirty logging comparing to memory additions &
> movings.  Because memory addition will always have no initial dirty page,
> and
> we don't really move RAM a lot (or do we ever?!) for a general VM use case.
> 
> Then I went a step back to think about why we need these dirty bit
> information
> after all if the memslot is going to be removed?
> 
> There're two cases:
> 
>   - When the memslot is going to be removed forever, then the dirty
> information
>     is indeed meaningless and can be dropped, and,
> 
>   - When the memslot is going to be removed but quickly added back with
> changed
>     size, then we need to keep those dirty bits because it's just a commmon
> way
>     to e.g. punch an MMIO hole in an existing RAM region (here I'd confess I
>     feel like using "slot_id" to identify memslot is really unfriendly syscall
>     design for things like "hole punchings" in the RAM address space...
>     However such "punch hold" operation is really needed even for a common
>     guest for either system reboots or device hotplugs, etc.).

why would device hotplug punch a hole in an existing RAM region? 

> 
> The real scenario we want to cover for dirty tracking is the 2nd one.
> 
> If we can track dirty using raw GPA, the 2nd scenario is solved itself.
> Because we know we'll add those memslots back (though it might be with a
> different slot ID), then the GPA value will still make sense, which means we
> should be able to avoid any kind of synchronization for things like memory
> removals, as long as the userspace is aware of that.

A curious question. What about the backing storage of the affected GPA 
is changed after adding back? Is recorded dirty info for previous backing 
storage still making sense for the newer one?

Thanks
Kevin

> 
> With that, when we fetch the dirty bits, we lookup the memslot dynamically,
> drop bits if the memslot does not exist on that address (e.g., permanent
> removals), and use whatever memslot is there for that guest physical
> address.
> Though we for sure still need to handle memory move, that the userspace
> needs
> to still take care of dirty bit flushing and sync for a memory move, however
> that's merely not happening so nothing to take care about either.
> 
> Does this makes sense?  Comments greatly welcomed..
> 
> Thanks,
> 
> [1] https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg08361.html
> 
> --
> Peter Xu
Peter Xu April 23, 2020, 3:22 p.m. UTC | #3
On Thu, Apr 23, 2020 at 06:28:43AM +0000, Tian, Kevin wrote:
> > From: Peter Xu <peterx@redhat.com>
> > Sent: Thursday, April 23, 2020 2:52 AM
> > 
> > Hi,
> > 
> > TL;DR: I'm thinking whether we should record pure GPA/GFN instead of
> > (slot_id,
> > slot_offset) tuple for dirty pages in kvm dirty ring to unbind kvm_dirty_gfn
> > with memslots.
> > 
> > (A slightly longer version starts...)
> > 
> > The problem is that binding dirty tracking operations to KVM memslots is a
> > restriction that needs synchronization to memslot changes, which further
> > needs
> > synchronization across all the vcpus because they're the consumers of
> > memslots.
> > E.g., when we remove a memory slot, we need to flush all the dirty bits
> > correctly before we do the removal of the memslot.  That's actually an
> > known
> > defect for QEMU/KVM [1] (I bet it could be a defect for many other
> > hypervisors...) right now with current dirty logging.  Meanwhile, even if we
> > fix it, that procedure is not scale at all, and error prone to dead locks.
> > 
> > Here memory removal is really an (still corner-cased but relatively) important
> > scenario to think about for dirty logging comparing to memory additions &
> > movings.  Because memory addition will always have no initial dirty page,
> > and
> > we don't really move RAM a lot (or do we ever?!) for a general VM use case.
> > 
> > Then I went a step back to think about why we need these dirty bit
> > information
> > after all if the memslot is going to be removed?
> > 
> > There're two cases:
> > 
> >   - When the memslot is going to be removed forever, then the dirty
> > information
> >     is indeed meaningless and can be dropped, and,
> > 
> >   - When the memslot is going to be removed but quickly added back with
> > changed
> >     size, then we need to keep those dirty bits because it's just a commmon
> > way
> >     to e.g. punch an MMIO hole in an existing RAM region (here I'd confess I
> >     feel like using "slot_id" to identify memslot is really unfriendly syscall
> >     design for things like "hole punchings" in the RAM address space...
> >     However such "punch hold" operation is really needed even for a common
> >     guest for either system reboots or device hotplugs, etc.).
> 
> why would device hotplug punch a hole in an existing RAM region? 

I thought it could happen because I used to trace the KVM ioctls and see the
memslot changes during driver loading.  But later when I tried to hotplug a
device I do see that it won't...  The new MMIO regions are added only into
0xfe000000 for a virtio-net:

  00000000fe000000-00000000fe000fff (prio 0, i/o): virtio-pci-common
  00000000fe001000-00000000fe001fff (prio 0, i/o): virtio-pci-isr
  00000000fe002000-00000000fe002fff (prio 0, i/o): virtio-pci-device
  00000000fe003000-00000000fe003fff (prio 0, i/o): virtio-pci-notify
  00000000fe840000-00000000fe84002f (prio 0, i/o): msix-table
  00000000fe840800-00000000fe840807 (prio 0, i/o): msix-pba

Does it mean that device plugging is guaranteed to not trigger RAM changes?  I
am really curious about what cases we need to consider in which we need to keep
the dirty bits for a memory removal, and if system reset is the only case, then
it could be even easier (because we might be able to avoid the sync in memory
removal but do that once in a sys reset hook)...

> 
> > 
> > The real scenario we want to cover for dirty tracking is the 2nd one.
> > 
> > If we can track dirty using raw GPA, the 2nd scenario is solved itself.
> > Because we know we'll add those memslots back (though it might be with a
> > different slot ID), then the GPA value will still make sense, which means we
> > should be able to avoid any kind of synchronization for things like memory
> > removals, as long as the userspace is aware of that.
> 
> A curious question. What about the backing storage of the affected GPA 
> is changed after adding back? Is recorded dirty info for previous backing 
> storage still making sense for the newer one?

It's the case of a permanent removal, plus another addition iiuc.  Then the
worst case is we get some extra dirty bits set on that new memory region, but
IMHO that's benigh (we'll migrate some extra pages even they could be zero pages).

Thanks,

> 
> Thanks
> Kevin
> 
> > 
> > With that, when we fetch the dirty bits, we lookup the memslot dynamically,
> > drop bits if the memslot does not exist on that address (e.g., permanent
> > removals), and use whatever memslot is there for that guest physical
> > address.
> > Though we for sure still need to handle memory move, that the userspace
> > needs
> > to still take care of dirty bit flushing and sync for a memory move, however
> > that's merely not happening so nothing to take care about either.
> > 
> > Does this makes sense?  Comments greatly welcomed..
> > 
> > Thanks,
> > 
> > [1] https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg08361.html
> > 
> > --
> > Peter Xu
>
Tian, Kevin April 24, 2020, 6:01 a.m. UTC | #4
> From: Peter Xu <peterx@redhat.com>
> Sent: Thursday, April 23, 2020 11:23 PM
> 
> On Thu, Apr 23, 2020 at 06:28:43AM +0000, Tian, Kevin wrote:
> > > From: Peter Xu <peterx@redhat.com>
> > > Sent: Thursday, April 23, 2020 2:52 AM
> > >
> > > Hi,
> > >
> > > TL;DR: I'm thinking whether we should record pure GPA/GFN instead of
> > > (slot_id,
> > > slot_offset) tuple for dirty pages in kvm dirty ring to unbind
> kvm_dirty_gfn
> > > with memslots.
> > >
> > > (A slightly longer version starts...)
> > >
> > > The problem is that binding dirty tracking operations to KVM memslots is
> a
> > > restriction that needs synchronization to memslot changes, which further
> > > needs
> > > synchronization across all the vcpus because they're the consumers of
> > > memslots.
> > > E.g., when we remove a memory slot, we need to flush all the dirty bits
> > > correctly before we do the removal of the memslot.  That's actually an
> > > known
> > > defect for QEMU/KVM [1] (I bet it could be a defect for many other
> > > hypervisors...) right now with current dirty logging.  Meanwhile, even if
> we
> > > fix it, that procedure is not scale at all, and error prone to dead locks.
> > >
> > > Here memory removal is really an (still corner-cased but relatively)
> important
> > > scenario to think about for dirty logging comparing to memory additions
> &
> > > movings.  Because memory addition will always have no initial dirty page,
> > > and
> > > we don't really move RAM a lot (or do we ever?!) for a general VM use
> case.
> > >
> > > Then I went a step back to think about why we need these dirty bit
> > > information
> > > after all if the memslot is going to be removed?
> > >
> > > There're two cases:
> > >
> > >   - When the memslot is going to be removed forever, then the dirty
> > > information
> > >     is indeed meaningless and can be dropped, and,
> > >
> > >   - When the memslot is going to be removed but quickly added back with
> > > changed
> > >     size, then we need to keep those dirty bits because it's just a commmon
> > > way
> > >     to e.g. punch an MMIO hole in an existing RAM region (here I'd confess
> I
> > >     feel like using "slot_id" to identify memslot is really unfriendly syscall
> > >     design for things like "hole punchings" in the RAM address space...
> > >     However such "punch hold" operation is really needed even for a
> common
> > >     guest for either system reboots or device hotplugs, etc.).
> >
> > why would device hotplug punch a hole in an existing RAM region?
> 
> I thought it could happen because I used to trace the KVM ioctls and see the
> memslot changes during driver loading.  But later when I tried to hotplug a

Is there more detail why driver loading may lead to memslot changes?

> device I do see that it won't...  The new MMIO regions are added only into
> 0xfe000000 for a virtio-net:
> 
>   00000000fe000000-00000000fe000fff (prio 0, i/o): virtio-pci-common
>   00000000fe001000-00000000fe001fff (prio 0, i/o): virtio-pci-isr
>   00000000fe002000-00000000fe002fff (prio 0, i/o): virtio-pci-device
>   00000000fe003000-00000000fe003fff (prio 0, i/o): virtio-pci-notify
>   00000000fe840000-00000000fe84002f (prio 0, i/o): msix-table
>   00000000fe840800-00000000fe840807 (prio 0, i/o): msix-pba
> 
> Does it mean that device plugging is guaranteed to not trigger RAM changes?

I'd think so. Otherwise from guest p.o.v any device hotplug implies doing
a memory hot-unplug first then it's a bad design.

> I
> am really curious about what cases we need to consider in which we need to
> keep
> the dirty bits for a memory removal, and if system reset is the only case, then
> it could be even easier (because we might be able to avoid the sync in
> memory
> removal but do that once in a sys reset hook)...

Possibly memory hot-unplug, as allowed by recent virtio-mem? 

btw VFIO faces a similar problem when unmapping a DMA range (e.g. when
vIOMMU is enabled) in dirty log phase. There could be some dirty bits which are
not retrieved when unmapping happens. VFIO chooses to return the dirty
bits in a buffer passed in the unmapping parameters. Can memslot interface
do similar thing by allowing the userspace to specify a buffer pointer to hold
whatever dirty pages recorded for the slot that is being removed?

> 
> >
> > >
> > > The real scenario we want to cover for dirty tracking is the 2nd one.
> > >
> > > If we can track dirty using raw GPA, the 2nd scenario is solved itself.
> > > Because we know we'll add those memslots back (though it might be with
> a
> > > different slot ID), then the GPA value will still make sense, which means
> we
> > > should be able to avoid any kind of synchronization for things like
> memory
> > > removals, as long as the userspace is aware of that.
> >
> > A curious question. What about the backing storage of the affected GPA
> > is changed after adding back? Is recorded dirty info for previous backing
> > storage still making sense for the newer one?
> 
> It's the case of a permanent removal, plus another addition iiuc.  Then the
> worst case is we get some extra dirty bits set on that new memory region,
> but
> IMHO that's benigh (we'll migrate some extra pages even they could be zero
> pages).

yes, reporting more than necessary dirty bits doesn't hurt. 

> 
> Thanks,
> 
> >
> > Thanks
> > Kevin
> >
> > >
> > > With that, when we fetch the dirty bits, we lookup the memslot
> dynamically,
> > > drop bits if the memslot does not exist on that address (e.g., permanent
> > > removals), and use whatever memslot is there for that guest physical
> > > address.
> > > Though we for sure still need to handle memory move, that the
> userspace
> > > needs
> > > to still take care of dirty bit flushing and sync for a memory move,
> however
> > > that's merely not happening so nothing to take care about either.
> > >
> > > Does this makes sense?  Comments greatly welcomed..
> > >
> > > Thanks,
> > >
> > > [1] https://lists.gnu.org/archive/html/qemu-devel/2020-
> 03/msg08361.html
> > >
> > > --
> > > Peter Xu
> >
> 
> --
> Peter Xu
Peter Xu April 24, 2020, 2:19 p.m. UTC | #5
On Fri, Apr 24, 2020 at 06:01:46AM +0000, Tian, Kevin wrote:
> > From: Peter Xu <peterx@redhat.com>
> > Sent: Thursday, April 23, 2020 11:23 PM
> > 
> > On Thu, Apr 23, 2020 at 06:28:43AM +0000, Tian, Kevin wrote:
> > > > From: Peter Xu <peterx@redhat.com>
> > > > Sent: Thursday, April 23, 2020 2:52 AM
> > > >
> > > > Hi,
> > > >
> > > > TL;DR: I'm thinking whether we should record pure GPA/GFN instead of
> > > > (slot_id,
> > > > slot_offset) tuple for dirty pages in kvm dirty ring to unbind
> > kvm_dirty_gfn
> > > > with memslots.
> > > >
> > > > (A slightly longer version starts...)
> > > >
> > > > The problem is that binding dirty tracking operations to KVM memslots is
> > a
> > > > restriction that needs synchronization to memslot changes, which further
> > > > needs
> > > > synchronization across all the vcpus because they're the consumers of
> > > > memslots.
> > > > E.g., when we remove a memory slot, we need to flush all the dirty bits
> > > > correctly before we do the removal of the memslot.  That's actually an
> > > > known
> > > > defect for QEMU/KVM [1] (I bet it could be a defect for many other
> > > > hypervisors...) right now with current dirty logging.  Meanwhile, even if
> > we
> > > > fix it, that procedure is not scale at all, and error prone to dead locks.
> > > >
> > > > Here memory removal is really an (still corner-cased but relatively)
> > important
> > > > scenario to think about for dirty logging comparing to memory additions
> > &
> > > > movings.  Because memory addition will always have no initial dirty page,
> > > > and
> > > > we don't really move RAM a lot (or do we ever?!) for a general VM use
> > case.
> > > >
> > > > Then I went a step back to think about why we need these dirty bit
> > > > information
> > > > after all if the memslot is going to be removed?
> > > >
> > > > There're two cases:
> > > >
> > > >   - When the memslot is going to be removed forever, then the dirty
> > > > information
> > > >     is indeed meaningless and can be dropped, and,
> > > >
> > > >   - When the memslot is going to be removed but quickly added back with
> > > > changed
> > > >     size, then we need to keep those dirty bits because it's just a commmon
> > > > way
> > > >     to e.g. punch an MMIO hole in an existing RAM region (here I'd confess
> > I
> > > >     feel like using "slot_id" to identify memslot is really unfriendly syscall
> > > >     design for things like "hole punchings" in the RAM address space...
> > > >     However such "punch hold" operation is really needed even for a
> > common
> > > >     guest for either system reboots or device hotplugs, etc.).
> > >
> > > why would device hotplug punch a hole in an existing RAM region?
> > 
> > I thought it could happen because I used to trace the KVM ioctls and see the
> > memslot changes during driver loading.  But later when I tried to hotplug a
> 
> Is there more detail why driver loading may lead to memslot changes?

E.g., I can observe these after Linux loads and before the prompt, which is a
simplest VM with default devices on:

41874@1587736345.192636:kvm_set_user_memory Slot#3 flags=0x0 gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
41874@1587736345.192760:kvm_set_user_memory Slot#65539 flags=0x0 gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
41874@1587736345.193884:kvm_set_user_memory Slot#3 flags=0x1 gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
41874@1587736345.193956:kvm_set_user_memory Slot#65539 flags=0x1 gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
41874@1587736345.195788:kvm_set_user_memory Slot#3 flags=0x0 gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
41874@1587736345.195838:kvm_set_user_memory Slot#65539 flags=0x0 gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
41874@1587736345.196769:kvm_set_user_memory Slot#3 flags=0x1 gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
41874@1587736345.196827:kvm_set_user_memory Slot#65539 flags=0x1 gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
41874@1587736345.197787:kvm_set_user_memory Slot#3 flags=0x0 gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
41874@1587736345.197832:kvm_set_user_memory Slot#65539 flags=0x0 gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
41874@1587736345.198777:kvm_set_user_memory Slot#3 flags=0x1 gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
41874@1587736345.198836:kvm_set_user_memory Slot#65539 flags=0x1 gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
41874@1587736345.200491:kvm_set_user_memory Slot#3 flags=0x0 gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
41874@1587736345.200537:kvm_set_user_memory Slot#65539 flags=0x0 gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
41874@1587736345.201592:kvm_set_user_memory Slot#3 flags=0x1 gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
41874@1587736345.201649:kvm_set_user_memory Slot#65539 flags=0x1 gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
41874@1587736345.202415:kvm_set_user_memory Slot#3 flags=0x0 gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
41874@1587736345.202461:kvm_set_user_memory Slot#65539 flags=0x0 gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
41874@1587736345.203169:kvm_set_user_memory Slot#3 flags=0x1 gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
41874@1587736345.203225:kvm_set_user_memory Slot#65539 flags=0x1 gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
41874@1587736345.204037:kvm_set_user_memory Slot#3 flags=0x0 gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
41874@1587736345.204083:kvm_set_user_memory Slot#65539 flags=0x0 gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
41874@1587736345.204983:kvm_set_user_memory Slot#3 flags=0x1 gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
41874@1587736345.205041:kvm_set_user_memory Slot#65539 flags=0x1 gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
41874@1587736345.205940:kvm_set_user_memory Slot#3 flags=0x0 gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
41874@1587736345.206022:kvm_set_user_memory Slot#65539 flags=0x0 gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
41874@1587736345.206981:kvm_set_user_memory Slot#3 flags=0x1 gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
41874@1587736345.207038:kvm_set_user_memory Slot#65539 flags=0x1 gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
41875@1587736351.141052:kvm_set_user_memory Slot#9 flags=0x1 gpa=0xa0000 size=0x10000 ua=0x7fadf6800000 ret=0

After a careful look, I noticed it's only the VGA device mostly turning slot 3
off & on.  Frankly speaking I don't know why it happens to do so.

> 
> > device I do see that it won't...  The new MMIO regions are added only into
> > 0xfe000000 for a virtio-net:
> > 
> >   00000000fe000000-00000000fe000fff (prio 0, i/o): virtio-pci-common
> >   00000000fe001000-00000000fe001fff (prio 0, i/o): virtio-pci-isr
> >   00000000fe002000-00000000fe002fff (prio 0, i/o): virtio-pci-device
> >   00000000fe003000-00000000fe003fff (prio 0, i/o): virtio-pci-notify
> >   00000000fe840000-00000000fe84002f (prio 0, i/o): msix-table
> >   00000000fe840800-00000000fe840807 (prio 0, i/o): msix-pba
> > 
> > Does it mean that device plugging is guaranteed to not trigger RAM changes?
> 
> I'd think so. Otherwise from guest p.o.v any device hotplug implies doing
> a memory hot-unplug first then it's a bad design.

Right that's what I was confused about.  Then maybe you're right. :)

> 
> > I
> > am really curious about what cases we need to consider in which we need to
> > keep
> > the dirty bits for a memory removal, and if system reset is the only case, then
> > it could be even easier (because we might be able to avoid the sync in
> > memory
> > removal but do that once in a sys reset hook)...
> 
> Possibly memory hot-unplug, as allowed by recent virtio-mem? 

That should belong to the case where dirty bits do not matter at all after the
removal, right?  I would be mostly curious about when we (1) remove a memory
slot, and at the meantime (2) we still care about the dirty bits of that slot.

I'll see whether I can remove the dirty bit sync in kvm_set_phys_mem(), which I
think is really nasty.

> 
> btw VFIO faces a similar problem when unmapping a DMA range (e.g. when
> vIOMMU is enabled) in dirty log phase. There could be some dirty bits which are
> not retrieved when unmapping happens. VFIO chooses to return the dirty
> bits in a buffer passed in the unmapping parameters. Can memslot interface
> do similar thing by allowing the userspace to specify a buffer pointer to hold
> whatever dirty pages recorded for the slot that is being removed?

Yes I think we can, but may not be necessary.  Actually IMHO CPU access to
pages are slightly different to device DMAs in that we can do these sequence to
collect the dirty bits of a slot safely:

  - mark slot as READONLY
  - KVM_GET_DIRTY_LOG on the slot
  - finally remove the slot

I guess VFIO cannot do that because there's no way to really "mark the region
as read-only" for a device because DMA could happen and DMA would fail then
when writting to a readonly slot.

On the KVM/CPU side, after we mark the slot as READONLY then the CPU writes
will page fault and fallback to the QEMU userspace, then QEMU will take care of
the writes (so those writes could be slow but still working) then even if we
mark it READONLY it won't fail the writes but just fallback to QEMU.

Btw, since we're discussing the VFIO dirty logging across memory removal... the
unmapping DMA range you're talking about needs to be added back later, right?
Then what if the device does DMA after the removal but before it's added back?
I think this is a more general question not for dirty logging but also for when
dirty logging is not enabled - I never understand how this could be fixed with
existing facilities.

Thanks,
Tian, Kevin April 26, 2020, 10:29 a.m. UTC | #6
> From: Peter Xu
> Sent: Friday, April 24, 2020 10:20 PM
> 
> On Fri, Apr 24, 2020 at 06:01:46AM +0000, Tian, Kevin wrote:
> > > From: Peter Xu <peterx@redhat.com>
> > > Sent: Thursday, April 23, 2020 11:23 PM
> > >
> > > On Thu, Apr 23, 2020 at 06:28:43AM +0000, Tian, Kevin wrote:
> > > > > From: Peter Xu <peterx@redhat.com>
> > > > > Sent: Thursday, April 23, 2020 2:52 AM
> > > > >
> > > > > Hi,
> > > > >
> > > > > TL;DR: I'm thinking whether we should record pure GPA/GFN instead
> of
> > > > > (slot_id,
> > > > > slot_offset) tuple for dirty pages in kvm dirty ring to unbind
> > > kvm_dirty_gfn
> > > > > with memslots.
> > > > >
> > > > > (A slightly longer version starts...)
> > > > >
> > > > > The problem is that binding dirty tracking operations to KVM
> memslots is
> > > a
> > > > > restriction that needs synchronization to memslot changes, which
> further
> > > > > needs
> > > > > synchronization across all the vcpus because they're the consumers of
> > > > > memslots.
> > > > > E.g., when we remove a memory slot, we need to flush all the dirty
> bits
> > > > > correctly before we do the removal of the memslot.  That's actually an
> > > > > known
> > > > > defect for QEMU/KVM [1] (I bet it could be a defect for many other
> > > > > hypervisors...) right now with current dirty logging.  Meanwhile, even
> if
> > > we
> > > > > fix it, that procedure is not scale at all, and error prone to dead locks.
> > > > >
> > > > > Here memory removal is really an (still corner-cased but relatively)
> > > important
> > > > > scenario to think about for dirty logging comparing to memory
> additions
> > > &
> > > > > movings.  Because memory addition will always have no initial dirty
> page,
> > > > > and
> > > > > we don't really move RAM a lot (or do we ever?!) for a general VM
> use
> > > case.
> > > > >
> > > > > Then I went a step back to think about why we need these dirty bit
> > > > > information
> > > > > after all if the memslot is going to be removed?
> > > > >
> > > > > There're two cases:
> > > > >
> > > > >   - When the memslot is going to be removed forever, then the dirty
> > > > > information
> > > > >     is indeed meaningless and can be dropped, and,
> > > > >
> > > > >   - When the memslot is going to be removed but quickly added back
> with
> > > > > changed
> > > > >     size, then we need to keep those dirty bits because it's just a
> commmon
> > > > > way
> > > > >     to e.g. punch an MMIO hole in an existing RAM region (here I'd
> confess
> > > I
> > > > >     feel like using "slot_id" to identify memslot is really unfriendly
> syscall
> > > > >     design for things like "hole punchings" in the RAM address space...
> > > > >     However such "punch hold" operation is really needed even for a
> > > common
> > > > >     guest for either system reboots or device hotplugs, etc.).
> > > >
> > > > why would device hotplug punch a hole in an existing RAM region?
> > >
> > > I thought it could happen because I used to trace the KVM ioctls and see
> the
> > > memslot changes during driver loading.  But later when I tried to hotplug
> a
> >
> > Is there more detail why driver loading may lead to memslot changes?
> 
> E.g., I can observe these after Linux loads and before the prompt, which is a
> simplest VM with default devices on:
> 
> 41874@1587736345.192636:kvm_set_user_memory Slot#3 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.192760:kvm_set_user_memory Slot#65539 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.193884:kvm_set_user_memory Slot#3 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@1587736345.193956:kvm_set_user_memory Slot#65539 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@1587736345.195788:kvm_set_user_memory Slot#3 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.195838:kvm_set_user_memory Slot#65539 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.196769:kvm_set_user_memory Slot#3 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@1587736345.196827:kvm_set_user_memory Slot#65539 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@1587736345.197787:kvm_set_user_memory Slot#3 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.197832:kvm_set_user_memory Slot#65539 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.198777:kvm_set_user_memory Slot#3 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@1587736345.198836:kvm_set_user_memory Slot#65539 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@1587736345.200491:kvm_set_user_memory Slot#3 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.200537:kvm_set_user_memory Slot#65539 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.201592:kvm_set_user_memory Slot#3 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@1587736345.201649:kvm_set_user_memory Slot#65539 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@1587736345.202415:kvm_set_user_memory Slot#3 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.202461:kvm_set_user_memory Slot#65539 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.203169:kvm_set_user_memory Slot#3 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@1587736345.203225:kvm_set_user_memory Slot#65539 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@1587736345.204037:kvm_set_user_memory Slot#3 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.204083:kvm_set_user_memory Slot#65539 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.204983:kvm_set_user_memory Slot#3 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@1587736345.205041:kvm_set_user_memory Slot#65539 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@1587736345.205940:kvm_set_user_memory Slot#3 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.206022:kvm_set_user_memory Slot#65539 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@1587736345.206981:kvm_set_user_memory Slot#3 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@1587736345.207038:kvm_set_user_memory Slot#65539 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41875@1587736351.141052:kvm_set_user_memory Slot#9 flags=0x1
> gpa=0xa0000 size=0x10000 ua=0x7fadf6800000 ret=0
> 
> After a careful look, I noticed it's only the VGA device mostly turning slot 3
> off & on.  Frankly speaking I don't know why it happens to do so.
> 
> >
> > > device I do see that it won't...  The new MMIO regions are added only
> into
> > > 0xfe000000 for a virtio-net:
> > >
> > >   00000000fe000000-00000000fe000fff (prio 0, i/o): virtio-pci-common
> > >   00000000fe001000-00000000fe001fff (prio 0, i/o): virtio-pci-isr
> > >   00000000fe002000-00000000fe002fff (prio 0, i/o): virtio-pci-device
> > >   00000000fe003000-00000000fe003fff (prio 0, i/o): virtio-pci-notify
> > >   00000000fe840000-00000000fe84002f (prio 0, i/o): msix-table
> > >   00000000fe840800-00000000fe840807 (prio 0, i/o): msix-pba
> > >
> > > Does it mean that device plugging is guaranteed to not trigger RAM
> changes?
> >
> > I'd think so. Otherwise from guest p.o.v any device hotplug implies doing
> > a memory hot-unplug first then it's a bad design.
> 
> Right that's what I was confused about.  Then maybe you're right. :)
> 
> >
> > > I
> > > am really curious about what cases we need to consider in which we
> need to
> > > keep
> > > the dirty bits for a memory removal, and if system reset is the only case,
> then
> > > it could be even easier (because we might be able to avoid the sync in
> > > memory
> > > removal but do that once in a sys reset hook)...
> >
> > Possibly memory hot-unplug, as allowed by recent virtio-mem?
> 
> That should belong to the case where dirty bits do not matter at all after the
> removal, right?  I would be mostly curious about when we (1) remove a
> memory
> slot, and at the meantime (2) we still care about the dirty bits of that slot.

I remember one case. PCIe spec defines a resizable BAR capability, allowing
hardware to communicate supported resource sizes and software to set
an optimal size back to hardware. If such a capability is presented to guest
and the BAR is backed by memory, it's possible to observe a removal-and-
add-back scenario. However in such case, the spec requires the software 
to clear memory space enable bit in command register before changing 
the BAR size. I suppose such thing should happen at boot phase where
dirty bits related to old BAR size don't matter. 

> 
> I'll see whether I can remove the dirty bit sync in kvm_set_phys_mem(),
> which I
> think is really nasty.
> 
> >
> > btw VFIO faces a similar problem when unmapping a DMA range (e.g.
> when
> > vIOMMU is enabled) in dirty log phase. There could be some dirty bits
> which are
> > not retrieved when unmapping happens. VFIO chooses to return the dirty
> > bits in a buffer passed in the unmapping parameters. Can memslot
> interface
> > do similar thing by allowing the userspace to specify a buffer pointer to
> hold
> > whatever dirty pages recorded for the slot that is being removed?
> 
> Yes I think we can, but may not be necessary.  Actually IMHO CPU access to
> pages are slightly different to device DMAs in that we can do these sequence
> to
> collect the dirty bits of a slot safely:
> 
>   - mark slot as READONLY
>   - KVM_GET_DIRTY_LOG on the slot
>   - finally remove the slot
> 
> I guess VFIO cannot do that because there's no way to really "mark the
> region
> as read-only" for a device because DMA could happen and DMA would fail
> then
> when writting to a readonly slot.
> 
> On the KVM/CPU side, after we mark the slot as READONLY then the CPU
> writes
> will page fault and fallback to the QEMU userspace, then QEMU will take care
> of
> the writes (so those writes could be slow but still working) then even if we
> mark it READONLY it won't fail the writes but just fallback to QEMU.

Yes, you are right.

> 
> Btw, since we're discussing the VFIO dirty logging across memory removal...
> the
> unmapping DMA range you're talking about needs to be added back later,
> right?

pure memory removal doesn't need add-back. Or are you specifically
referring to removal-and-add-back case?

> Then what if the device does DMA after the removal but before it's added
> back?

then just got IOMMU page fault, but I guess that I didn't get your real 
question...

> I think this is a more general question not for dirty logging but also for when
> dirty logging is not enabled - I never understand how this could be fixed with
> existing facilities.
> 

Thanks,
Kevin
Peter Xu April 27, 2020, 2:27 p.m. UTC | #7
On Sun, Apr 26, 2020 at 10:29:51AM +0000, Tian, Kevin wrote:
> > From: Peter Xu
> > Sent: Friday, April 24, 2020 10:20 PM
> > 
> > On Fri, Apr 24, 2020 at 06:01:46AM +0000, Tian, Kevin wrote:
> > > > From: Peter Xu <peterx@redhat.com>
> > > > Sent: Thursday, April 23, 2020 11:23 PM
> > > >
> > > > On Thu, Apr 23, 2020 at 06:28:43AM +0000, Tian, Kevin wrote:
> > > > > > From: Peter Xu <peterx@redhat.com>
> > > > > > Sent: Thursday, April 23, 2020 2:52 AM
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > TL;DR: I'm thinking whether we should record pure GPA/GFN instead
> > of
> > > > > > (slot_id,
> > > > > > slot_offset) tuple for dirty pages in kvm dirty ring to unbind
> > > > kvm_dirty_gfn
> > > > > > with memslots.
> > > > > >
> > > > > > (A slightly longer version starts...)
> > > > > >
> > > > > > The problem is that binding dirty tracking operations to KVM
> > memslots is
> > > > a
> > > > > > restriction that needs synchronization to memslot changes, which
> > further
> > > > > > needs
> > > > > > synchronization across all the vcpus because they're the consumers of
> > > > > > memslots.
> > > > > > E.g., when we remove a memory slot, we need to flush all the dirty
> > bits
> > > > > > correctly before we do the removal of the memslot.  That's actually an
> > > > > > known
> > > > > > defect for QEMU/KVM [1] (I bet it could be a defect for many other
> > > > > > hypervisors...) right now with current dirty logging.  Meanwhile, even
> > if
> > > > we
> > > > > > fix it, that procedure is not scale at all, and error prone to dead locks.
> > > > > >
> > > > > > Here memory removal is really an (still corner-cased but relatively)
> > > > important
> > > > > > scenario to think about for dirty logging comparing to memory
> > additions
> > > > &
> > > > > > movings.  Because memory addition will always have no initial dirty
> > page,
> > > > > > and
> > > > > > we don't really move RAM a lot (or do we ever?!) for a general VM
> > use
> > > > case.
> > > > > >
> > > > > > Then I went a step back to think about why we need these dirty bit
> > > > > > information
> > > > > > after all if the memslot is going to be removed?
> > > > > >
> > > > > > There're two cases:
> > > > > >
> > > > > >   - When the memslot is going to be removed forever, then the dirty
> > > > > > information
> > > > > >     is indeed meaningless and can be dropped, and,
> > > > > >
> > > > > >   - When the memslot is going to be removed but quickly added back
> > with
> > > > > > changed
> > > > > >     size, then we need to keep those dirty bits because it's just a
> > commmon
> > > > > > way
> > > > > >     to e.g. punch an MMIO hole in an existing RAM region (here I'd
> > confess
> > > > I
> > > > > >     feel like using "slot_id" to identify memslot is really unfriendly
> > syscall
> > > > > >     design for things like "hole punchings" in the RAM address space...
> > > > > >     However such "punch hold" operation is really needed even for a
> > > > common
> > > > > >     guest for either system reboots or device hotplugs, etc.).
> > > > >
> > > > > why would device hotplug punch a hole in an existing RAM region?
> > > >
> > > > I thought it could happen because I used to trace the KVM ioctls and see
> > the
> > > > memslot changes during driver loading.  But later when I tried to hotplug
> > a
> > >
> > > Is there more detail why driver loading may lead to memslot changes?
> > 
> > E.g., I can observe these after Linux loads and before the prompt, which is a
> > simplest VM with default devices on:
> > 
> > 41874@1587736345.192636:kvm_set_user_memory Slot#3 flags=0x0
> > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.192760:kvm_set_user_memory Slot#65539 flags=0x0
> > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.193884:kvm_set_user_memory Slot#3 flags=0x1
> > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.193956:kvm_set_user_memory Slot#65539 flags=0x1
> > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.195788:kvm_set_user_memory Slot#3 flags=0x0
> > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.195838:kvm_set_user_memory Slot#65539 flags=0x0
> > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.196769:kvm_set_user_memory Slot#3 flags=0x1
> > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.196827:kvm_set_user_memory Slot#65539 flags=0x1
> > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.197787:kvm_set_user_memory Slot#3 flags=0x0
> > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.197832:kvm_set_user_memory Slot#65539 flags=0x0
> > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.198777:kvm_set_user_memory Slot#3 flags=0x1
> > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.198836:kvm_set_user_memory Slot#65539 flags=0x1
> > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.200491:kvm_set_user_memory Slot#3 flags=0x0
> > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.200537:kvm_set_user_memory Slot#65539 flags=0x0
> > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.201592:kvm_set_user_memory Slot#3 flags=0x1
> > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.201649:kvm_set_user_memory Slot#65539 flags=0x1
> > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.202415:kvm_set_user_memory Slot#3 flags=0x0
> > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.202461:kvm_set_user_memory Slot#65539 flags=0x0
> > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.203169:kvm_set_user_memory Slot#3 flags=0x1
> > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.203225:kvm_set_user_memory Slot#65539 flags=0x1
> > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.204037:kvm_set_user_memory Slot#3 flags=0x0
> > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.204083:kvm_set_user_memory Slot#65539 flags=0x0
> > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.204983:kvm_set_user_memory Slot#3 flags=0x1
> > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.205041:kvm_set_user_memory Slot#65539 flags=0x1
> > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.205940:kvm_set_user_memory Slot#3 flags=0x0
> > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.206022:kvm_set_user_memory Slot#65539 flags=0x0
> > gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.206981:kvm_set_user_memory Slot#3 flags=0x1
> > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> > 41874@1587736345.207038:kvm_set_user_memory Slot#65539 flags=0x1
> > gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> > 41875@1587736351.141052:kvm_set_user_memory Slot#9 flags=0x1
> > gpa=0xa0000 size=0x10000 ua=0x7fadf6800000 ret=0
> > 
> > After a careful look, I noticed it's only the VGA device mostly turning slot 3
> > off & on.  Frankly speaking I don't know why it happens to do so.
> > 
> > >
> > > > device I do see that it won't...  The new MMIO regions are added only
> > into
> > > > 0xfe000000 for a virtio-net:
> > > >
> > > >   00000000fe000000-00000000fe000fff (prio 0, i/o): virtio-pci-common
> > > >   00000000fe001000-00000000fe001fff (prio 0, i/o): virtio-pci-isr
> > > >   00000000fe002000-00000000fe002fff (prio 0, i/o): virtio-pci-device
> > > >   00000000fe003000-00000000fe003fff (prio 0, i/o): virtio-pci-notify
> > > >   00000000fe840000-00000000fe84002f (prio 0, i/o): msix-table
> > > >   00000000fe840800-00000000fe840807 (prio 0, i/o): msix-pba
> > > >
> > > > Does it mean that device plugging is guaranteed to not trigger RAM
> > changes?
> > >
> > > I'd think so. Otherwise from guest p.o.v any device hotplug implies doing
> > > a memory hot-unplug first then it's a bad design.
> > 
> > Right that's what I was confused about.  Then maybe you're right. :)
> > 
> > >
> > > > I
> > > > am really curious about what cases we need to consider in which we
> > need to
> > > > keep
> > > > the dirty bits for a memory removal, and if system reset is the only case,
> > then
> > > > it could be even easier (because we might be able to avoid the sync in
> > > > memory
> > > > removal but do that once in a sys reset hook)...
> > >
> > > Possibly memory hot-unplug, as allowed by recent virtio-mem?
> > 
> > That should belong to the case where dirty bits do not matter at all after the
> > removal, right?  I would be mostly curious about when we (1) remove a
> > memory
> > slot, and at the meantime (2) we still care about the dirty bits of that slot.
> 
> I remember one case. PCIe spec defines a resizable BAR capability, allowing
> hardware to communicate supported resource sizes and software to set
> an optimal size back to hardware. If such a capability is presented to guest
> and the BAR is backed by memory, it's possible to observe a removal-and-
> add-back scenario. However in such case, the spec requires the software 
> to clear memory space enable bit in command register before changing 
> the BAR size. I suppose such thing should happen at boot phase where
> dirty bits related to old BAR size don't matter. 

Good to know. Yes it makes sense to have no valid data if the driver is trying
to decide the bar size at boot time.

> 
> > 
> > I'll see whether I can remove the dirty bit sync in kvm_set_phys_mem(),
> > which I
> > think is really nasty.
> > 
> > >
> > > btw VFIO faces a similar problem when unmapping a DMA range (e.g.
> > when
> > > vIOMMU is enabled) in dirty log phase. There could be some dirty bits
> > which are
> > > not retrieved when unmapping happens. VFIO chooses to return the dirty
> > > bits in a buffer passed in the unmapping parameters. Can memslot
> > interface
> > > do similar thing by allowing the userspace to specify a buffer pointer to
> > hold
> > > whatever dirty pages recorded for the slot that is being removed?
> > 
> > Yes I think we can, but may not be necessary.  Actually IMHO CPU access to
> > pages are slightly different to device DMAs in that we can do these sequence
> > to
> > collect the dirty bits of a slot safely:
> > 
> >   - mark slot as READONLY
> >   - KVM_GET_DIRTY_LOG on the slot
> >   - finally remove the slot
> > 
> > I guess VFIO cannot do that because there's no way to really "mark the
> > region
> > as read-only" for a device because DMA could happen and DMA would fail
> > then
> > when writting to a readonly slot.
> > 
> > On the KVM/CPU side, after we mark the slot as READONLY then the CPU
> > writes
> > will page fault and fallback to the QEMU userspace, then QEMU will take care
> > of
> > the writes (so those writes could be slow but still working) then even if we
> > mark it READONLY it won't fail the writes but just fallback to QEMU.
> 
> Yes, you are right.
> 
> > 
> > Btw, since we're discussing the VFIO dirty logging across memory removal...
> > the
> > unmapping DMA range you're talking about needs to be added back later,
> > right?
> 
> pure memory removal doesn't need add-back. Or are you specifically
> referring to removal-and-add-back case?

Oh right I get the point now - VFIO unmap is different that it does not mean
the RAM is going away, but the RAM is only not accessable any more from the
device, so it's actually a different scenario here comparing to real memory
removals.  Sorry I obviously missed that... :)

Thanks,

> 
> > Then what if the device does DMA after the removal but before it's added
> > back?
> 
> then just got IOMMU page fault, but I guess that I didn't get your real 
> question...
> 
> > I think this is a more general question not for dirty logging but also for when
> > dirty logging is not enabled - I never understand how this could be fixed with
> > existing facilities.
> > 
> 
> Thanks,
> Kevin