[RFC,04/15] KVM: Implement ring-based dirty memory tracking

Message ID	20191129213505.18472-5-peterx@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=D48H=ZV=vger.kernel.org=kvm-owner@kernel.org> From: Peter Xu <peterx@redhat.com> To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: Sean Christopherson <sean.j.christopherson@intel.com>, Paolo Bonzini <pbonzini@redhat.com>, "Dr . David Alan Gilbert" <dgilbert@redhat.com>, peterx@redhat.com, Vitaly Kuznetsov <vkuznets@redhat.com> Subject: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking Date: Fri, 29 Nov 2019 16:34:54 -0500 Message-Id: <20191129213505.18472-5-peterx@redhat.com> In-Reply-To: <20191129213505.18472-1-peterx@redhat.com> References: <20191129213505.18472-1-peterx@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable Sender: kvm-owner@vger.kernel.org Precedence: bulk
Series	KVM: Dirty ring interface \| expand [RFC,00/15] KVM: Dirty ring interface [RFC,01/15] KVM: Move running VCPU from ARM to common code [RFC,02/15] KVM: Add kvm/vcpu argument to mark_dirty_page_in_slot [RFC,03/15] KVM: Add build-time error check on kvm_run size [RFC,04/15] KVM: Implement ring-based dirty memory tracking [RFC,05/15] KVM: Make dirty ring exclusive to dirty bitmap log [RFC,06/15] KVM: Introduce dirty ring wait queue [RFC,07/15] KVM: X86: Implement ring-based dirty memory tracking [RFC,08/15] KVM: selftests: Always clear dirty bitmap after iteration [RFC,09/15] KVM: selftests: Sync uapi/linux/kvm.h to tools/ [RFC,10/15] KVM: selftests: Use a single binary for dirty/clear log test [RFC,11/15] KVM: selftests: Introduce after_vcpu_run hook for dirty log test [RFC,12/15] KVM: selftests: Add dirty ring buffer test [RFC,13/15] KVM: selftests: Let dirty_log_test async for dirty ring test [RFC,14/15] KVM: selftests: Add "-c" parameter to dirty log test [RFC,15/15] KVM: selftests: Test dirty ring waitqueue

Peter Xu Nov. 29, 2019, 9:34 p.m. UTC

This patch is heavily based on previous work from Lei Cao
<lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]

KVM currently uses large bitmaps to track dirty memory.  These bitmaps
are copied to userspace when userspace queries KVM for its dirty page
information.  The use of bitmaps is mostly sufficient for live
migration, as large parts of memory are be dirtied from one log-dirty
pass to another.  However, in a checkpointing system, the number of
dirty pages is small and in fact it is often bounded---the VM is
paused when it has dirtied a pre-defined number of pages. Traversing a
large, sparsely populated bitmap to find set bits is time-consuming,
as is copying the bitmap to user-space.

A similar issue will be there for live migration when the guest memory
is huge while the page dirty procedure is trivial.  In that case for
each dirty sync we need to pull the whole dirty bitmap to userspace
and analyse every bit even if it's mostly zeros.

The preferred data structure for above scenarios is a dense list of
guest frame numbers (GFN).  This patch series stores the dirty list in
kernel memory that can be memory mapped into userspace to allow speedy
harvesting.

We defined two new data structures:

  struct kvm_dirty_ring;
  struct kvm_dirty_ring_indexes;

Firstly, kvm_dirty_ring is defined to represent a ring of dirty
pages.  When dirty tracking is enabled, we can push dirty gfn onto the
ring.

Secondly, kvm_dirty_ring_indexes is defined to represent the
user/kernel interface of each ring.  Currently it contains two
indexes: (1) avail_index represents where we should push our next
PFN (written by kernel), while (2) fetch_index represents where the
userspace should fetch the next dirty PFN (written by userspace).

One complete ring is composed by one kvm_dirty_ring plus its
corresponding kvm_dirty_ring_indexes.

Currently, we have N+1 rings for each VM of N vcpus:

  - for each vcpu, we have 1 per-vcpu dirty ring,
  - for each vm, we have 1 per-vm dirty ring

Please refer to the documentation update in this patch for more
details.

Note that this patch implements the core logic of dirty ring buffer.
It's still disabled for all archs for now.  Also, we'll address some
of the other issues in follow up patches before it's firstly enabled
on x86.

[1] https://patchwork.kernel.org/patch/10471409/

Signed-off-by: Lei Cao <lei.cao@stratus.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 Documentation/virt/kvm/api.txt | 109 +++++++++++++++
 arch/x86/kvm/Makefile          |   3 +-
 include/linux/kvm_dirty_ring.h |  67 +++++++++
 include/linux/kvm_host.h       |  33 +++++
 include/linux/kvm_types.h      |   1 +
 include/uapi/linux/kvm.h       |  36 +++++
 virt/kvm/dirty_ring.c          | 156 +++++++++++++++++++++
 virt/kvm/kvm_main.c            | 240 ++++++++++++++++++++++++++++++++-
 8 files changed, 642 insertions(+), 3 deletions(-)
 create mode 100644 include/linux/kvm_dirty_ring.h
 create mode 100644 virt/kvm/dirty_ring.c

Sean Christopherson Dec. 2, 2019, 8:10 p.m. UTC | #1

On Fri, Nov 29, 2019 at 04:34:54PM -0500, Peter Xu wrote:
> This patch is heavily based on previous work from Lei Cao
> <lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]
> 
> KVM currently uses large bitmaps to track dirty memory.  These bitmaps
> are copied to userspace when userspace queries KVM for its dirty page
> information.  The use of bitmaps is mostly sufficient for live
> migration, as large parts of memory are be dirtied from one log-dirty
> pass to another.  However, in a checkpointing system, the number of
> dirty pages is small and in fact it is often bounded---the VM is
> paused when it has dirtied a pre-defined number of pages. Traversing a
> large, sparsely populated bitmap to find set bits is time-consuming,
> as is copying the bitmap to user-space.
> 
> A similar issue will be there for live migration when the guest memory
> is huge while the page dirty procedure is trivial.  In that case for
> each dirty sync we need to pull the whole dirty bitmap to userspace
> and analyse every bit even if it's mostly zeros.
> 
> The preferred data structure for above scenarios is a dense list of
> guest frame numbers (GFN).  This patch series stores the dirty list in
> kernel memory that can be memory mapped into userspace to allow speedy
> harvesting.
> 
> We defined two new data structures:
> 
>   struct kvm_dirty_ring;
>   struct kvm_dirty_ring_indexes;
> 
> Firstly, kvm_dirty_ring is defined to represent a ring of dirty
> pages.  When dirty tracking is enabled, we can push dirty gfn onto the
> ring.
> 
> Secondly, kvm_dirty_ring_indexes is defined to represent the
> user/kernel interface of each ring.  Currently it contains two
> indexes: (1) avail_index represents where we should push our next
> PFN (written by kernel), while (2) fetch_index represents where the
> userspace should fetch the next dirty PFN (written by userspace).
> 
> One complete ring is composed by one kvm_dirty_ring plus its
> corresponding kvm_dirty_ring_indexes.
> 
> Currently, we have N+1 rings for each VM of N vcpus:
> 
>   - for each vcpu, we have 1 per-vcpu dirty ring,
>   - for each vm, we have 1 per-vm dirty ring

Why?  I assume the purpose of per-vcpu rings is to avoid contention between
threads, but the motiviation needs to be explicitly stated.  And why is a
per-vm fallback ring needed?

If my assumption is correct, have other approaches been tried/profiled?
E.g. using cmpxchg to reserve N number of entries in a shared ring.  IMO,
adding kvm_get_running_vcpu() is a hack that is just asking for future
abuse and the vcpu/vm/as_id interactions in mark_page_dirty_in_ring()
look extremely fragile.  I also dislike having two different mechanisms
for accessing the ring (lock for per-vm, something else for per-vcpu).

> Please refer to the documentation update in this patch for more
> details.
> 
> Note that this patch implements the core logic of dirty ring buffer.
> It's still disabled for all archs for now.  Also, we'll address some
> of the other issues in follow up patches before it's firstly enabled
> on x86.
> 
> [1] https://patchwork.kernel.org/patch/10471409/
> 
> Signed-off-by: Lei Cao <lei.cao@stratus.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---

...

> diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> new file mode 100644
> index 000000000000..9264891f3c32
> --- /dev/null
> +++ b/virt/kvm/dirty_ring.c
> @@ -0,0 +1,156 @@
> +#include <linux/kvm_host.h>
> +#include <linux/kvm.h>
> +#include <linux/vmalloc.h>
> +#include <linux/kvm_dirty_ring.h>
> +
> +u32 kvm_dirty_ring_get_rsvd_entries(void)
> +{
> +	return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size();
> +}
> +
> +int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring)
> +{
> +	u32 size = kvm->dirty_ring_size;

Just pass in @size, that way you don't need @kvm.  And the callers will be
less ugly, e.g. the initial allocation won't need to speculatively set
kvm->dirty_ring_size.

> +
> +	ring->dirty_gfns = vmalloc(size);
> +	if (!ring->dirty_gfns)
> +		return -ENOMEM;
> +	memset(ring->dirty_gfns, 0, size);
> +
> +	ring->size = size / sizeof(struct kvm_dirty_gfn);
> +	ring->soft_limit =
> +	    (kvm->dirty_ring_size / sizeof(struct kvm_dirty_gfn)) -

And passing @size avoids issues like this where a local var is ignored.

> +	    kvm_dirty_ring_get_rsvd_entries();
> +	ring->dirty_index = 0;
> +	ring->reset_index = 0;
> +	spin_lock_init(&ring->lock);
> +
> +	return 0;
> +}
> +

...

> +void kvm_dirty_ring_free(struct kvm_dirty_ring *ring)
> +{
> +	if (ring->dirty_gfns) {

Why condition freeing the dirty ring on kvm->dirty_ring_size, this
obviously protects itself.  Not to mention vfree() also plays nice with a
NULL input.

> +		vfree(ring->dirty_gfns);
> +		ring->dirty_gfns = NULL;
> +	}
> +}
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 681452d288cd..8642c977629b 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -64,6 +64,8 @@
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/kvm.h>
>  
> +#include <linux/kvm_dirty_ring.h>
> +
>  /* Worst case buffer size needed for holding an integer. */
>  #define ITOA_MAX_LEN 12
>  
> @@ -149,6 +151,10 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
>  				    struct kvm_vcpu *vcpu,
>  				    struct kvm_memory_slot *memslot,
>  				    gfn_t gfn);
> +static void mark_page_dirty_in_ring(struct kvm *kvm,
> +				    struct kvm_vcpu *vcpu,
> +				    struct kvm_memory_slot *slot,
> +				    gfn_t gfn);
>  
>  __visible bool kvm_rebooting;
>  EXPORT_SYMBOL_GPL(kvm_rebooting);
> @@ -359,11 +365,22 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
>  	vcpu->preempted = false;
>  	vcpu->ready = false;
>  
> +	if (kvm->dirty_ring_size) {
> +		r = kvm_dirty_ring_alloc(vcpu->kvm, &vcpu->dirty_ring);
> +		if (r) {
> +			kvm->dirty_ring_size = 0;
> +			goto fail_free_run;

This looks wrong, kvm->dirty_ring_size is used to free allocations, i.e.
previous allocations will leak if a vcpu allocation fails.

> +		}
> +	}
> +
>  	r = kvm_arch_vcpu_init(vcpu);
>  	if (r < 0)
> -		goto fail_free_run;
> +		goto fail_free_ring;
>  	return 0;
>  
> +fail_free_ring:
> +	if (kvm->dirty_ring_size)
> +		kvm_dirty_ring_free(&vcpu->dirty_ring);
>  fail_free_run:
>  	free_page((unsigned long)vcpu->run);
>  fail:
> @@ -381,6 +398,8 @@ void kvm_vcpu_uninit(struct kvm_vcpu *vcpu)
>  	put_pid(rcu_dereference_protected(vcpu->pid, 1));
>  	kvm_arch_vcpu_uninit(vcpu);
>  	free_page((unsigned long)vcpu->run);
> +	if (vcpu->kvm->dirty_ring_size)
> +		kvm_dirty_ring_free(&vcpu->dirty_ring);
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_uninit);
>  
> @@ -690,6 +709,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
>  	struct kvm *kvm = kvm_arch_alloc_vm();
>  	int r = -ENOMEM;
>  	int i;
> +	struct page *page;
>  
>  	if (!kvm)
>  		return ERR_PTR(-ENOMEM);
> @@ -705,6 +725,14 @@ static struct kvm *kvm_create_vm(unsigned long type)
>  
>  	BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
>  
> +	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> +	if (!page) {
> +		r = -ENOMEM;
> +		goto out_err_alloc_page;
> +	}
> +	kvm->vm_run = page_address(page);
> +	BUILD_BUG_ON(sizeof(struct kvm_vm_run) > PAGE_SIZE);
> +
>  	if (init_srcu_struct(&kvm->srcu))
>  		goto out_err_no_srcu;
>  	if (init_srcu_struct(&kvm->irq_srcu))
> @@ -775,6 +803,9 @@ static struct kvm *kvm_create_vm(unsigned long type)
>  out_err_no_irq_srcu:
>  	cleanup_srcu_struct(&kvm->srcu);
>  out_err_no_srcu:
> +	free_page((unsigned long)page);
> +	kvm->vm_run = NULL;

No need to nullify vm_run.

> +out_err_alloc_page:
>  	kvm_arch_free_vm(kvm);
>  	mmdrop(current->mm);
>  	return ERR_PTR(r);
> @@ -800,6 +831,15 @@ static void kvm_destroy_vm(struct kvm *kvm)
>  	int i;
>  	struct mm_struct *mm = kvm->mm;
>  
> +	if (kvm->dirty_ring_size) {
> +		kvm_dirty_ring_free(&kvm->vm_dirty_ring);
> +	}

Unnecessary parantheses.

> +
> +	if (kvm->vm_run) {
> +		free_page((unsigned long)kvm->vm_run);
> +		kvm->vm_run = NULL;
> +	}
> +
>  	kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm);
>  	kvm_destroy_vm_debugfs(kvm);
>  	kvm_arch_sync_events(kvm);
> @@ -2301,7 +2341,7 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
>  {
>  	if (memslot && memslot->dirty_bitmap) {
>  		unsigned long rel_gfn = gfn - memslot->base_gfn;
> -
> +		mark_page_dirty_in_ring(kvm, vcpu, memslot, gfn);
>  		set_bit_le(rel_gfn, memslot->dirty_bitmap);
>  	}
>  }
> @@ -2649,6 +2689,13 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);

Peter Xu Dec. 2, 2019, 9:16 p.m. UTC | #2

On Mon, Dec 02, 2019 at 12:10:36PM -0800, Sean Christopherson wrote:
> On Fri, Nov 29, 2019 at 04:34:54PM -0500, Peter Xu wrote:
> > This patch is heavily based on previous work from Lei Cao
> > <lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]
> > 
> > KVM currently uses large bitmaps to track dirty memory.  These bitmaps
> > are copied to userspace when userspace queries KVM for its dirty page
> > information.  The use of bitmaps is mostly sufficient for live
> > migration, as large parts of memory are be dirtied from one log-dirty
> > pass to another.  However, in a checkpointing system, the number of
> > dirty pages is small and in fact it is often bounded---the VM is
> > paused when it has dirtied a pre-defined number of pages. Traversing a
> > large, sparsely populated bitmap to find set bits is time-consuming,
> > as is copying the bitmap to user-space.
> > 
> > A similar issue will be there for live migration when the guest memory
> > is huge while the page dirty procedure is trivial.  In that case for
> > each dirty sync we need to pull the whole dirty bitmap to userspace
> > and analyse every bit even if it's mostly zeros.
> > 
> > The preferred data structure for above scenarios is a dense list of
> > guest frame numbers (GFN).  This patch series stores the dirty list in
> > kernel memory that can be memory mapped into userspace to allow speedy
> > harvesting.
> > 
> > We defined two new data structures:
> > 
> >   struct kvm_dirty_ring;
> >   struct kvm_dirty_ring_indexes;
> > 
> > Firstly, kvm_dirty_ring is defined to represent a ring of dirty
> > pages.  When dirty tracking is enabled, we can push dirty gfn onto the
> > ring.
> > 
> > Secondly, kvm_dirty_ring_indexes is defined to represent the
> > user/kernel interface of each ring.  Currently it contains two
> > indexes: (1) avail_index represents where we should push our next
> > PFN (written by kernel), while (2) fetch_index represents where the
> > userspace should fetch the next dirty PFN (written by userspace).
> > 
> > One complete ring is composed by one kvm_dirty_ring plus its
> > corresponding kvm_dirty_ring_indexes.
> > 
> > Currently, we have N+1 rings for each VM of N vcpus:
> > 
> >   - for each vcpu, we have 1 per-vcpu dirty ring,
> >   - for each vm, we have 1 per-vm dirty ring
> 
> Why?  I assume the purpose of per-vcpu rings is to avoid contention between
> threads, but the motiviation needs to be explicitly stated.  And why is a
> per-vm fallback ring needed?

Yes, as explained in previous reply, the problem is there could have
guest memory writes without vcpu contexts.

> 
> If my assumption is correct, have other approaches been tried/profiled?
> E.g. using cmpxchg to reserve N number of entries in a shared ring.

Not yet, but I'd be fine to try anything if there's better
alternatives.  Besides, could you help explain why sharing one ring
and let each vcpu to reserve a region in the ring could be helpful in
the pov of performance?

> IMO,
> adding kvm_get_running_vcpu() is a hack that is just asking for future
> abuse and the vcpu/vm/as_id interactions in mark_page_dirty_in_ring()
> look extremely fragile.

I agree.  Another way is to put heavier traffic to the per-vm ring,
but the downside could be that the per-vm ring could get full easier
(but I haven't tested).

> I also dislike having two different mechanisms
> for accessing the ring (lock for per-vm, something else for per-vcpu).

Actually I proposed to drop the per-vm ring (actually I had a version
that implemented this.. and I just changed it back to the per-vm ring
later on, see below) and when there's no vcpu context I thought about:

  (1) use vcpu0 ring

  (2) or a better algo to pick up a per-vcpu ring (like, the less full
      ring, we can do many things here, e.g., we can easily maintain a
      structure track this so we can get O(1) search, I think)

I discussed this with Paolo, but I think Paolo preferred the per-vm
ring because there's no good reason to choose vcpu0 as what (1)
suggested.  While if to choose (2) we probably need to lock even for
per-cpu ring, so could be a bit slower.

Since this is still RFC, I think we still have chance to change this,
depending on how the discussion goes.

> 
> > Please refer to the documentation update in this patch for more
> > details.
> > 
> > Note that this patch implements the core logic of dirty ring buffer.
> > It's still disabled for all archs for now.  Also, we'll address some
> > of the other issues in follow up patches before it's firstly enabled
> > on x86.
> > 
> > [1] https://patchwork.kernel.org/patch/10471409/
> > 
> > Signed-off-by: Lei Cao <lei.cao@stratus.com>
> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> 
> ...
> 
> > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> > new file mode 100644
> > index 000000000000..9264891f3c32
> > --- /dev/null
> > +++ b/virt/kvm/dirty_ring.c
> > @@ -0,0 +1,156 @@
> > +#include <linux/kvm_host.h>
> > +#include <linux/kvm.h>
> > +#include <linux/vmalloc.h>
> > +#include <linux/kvm_dirty_ring.h>
> > +
> > +u32 kvm_dirty_ring_get_rsvd_entries(void)
> > +{
> > +	return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size();
> > +}
> > +
> > +int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring)
> > +{
> > +	u32 size = kvm->dirty_ring_size;
> 
> Just pass in @size, that way you don't need @kvm.  And the callers will be
> less ugly, e.g. the initial allocation won't need to speculatively set
> kvm->dirty_ring_size.

Sure.

> 
> > +
> > +	ring->dirty_gfns = vmalloc(size);
> > +	if (!ring->dirty_gfns)
> > +		return -ENOMEM;
> > +	memset(ring->dirty_gfns, 0, size);
> > +
> > +	ring->size = size / sizeof(struct kvm_dirty_gfn);
> > +	ring->soft_limit =
> > +	    (kvm->dirty_ring_size / sizeof(struct kvm_dirty_gfn)) -
> 
> And passing @size avoids issues like this where a local var is ignored.
> 
> > +	    kvm_dirty_ring_get_rsvd_entries();
> > +	ring->dirty_index = 0;
> > +	ring->reset_index = 0;
> > +	spin_lock_init(&ring->lock);
> > +
> > +	return 0;
> > +}
> > +
> 
> ...
> 
> > +void kvm_dirty_ring_free(struct kvm_dirty_ring *ring)
> > +{
> > +	if (ring->dirty_gfns) {
> 
> Why condition freeing the dirty ring on kvm->dirty_ring_size, this
> obviously protects itself.  Not to mention vfree() also plays nice with a
> NULL input.

Ok I can drop this check.

> 
> > +		vfree(ring->dirty_gfns);
> > +		ring->dirty_gfns = NULL;
> > +	}
> > +}
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 681452d288cd..8642c977629b 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -64,6 +64,8 @@
> >  #define CREATE_TRACE_POINTS
> >  #include <trace/events/kvm.h>
> >  
> > +#include <linux/kvm_dirty_ring.h>
> > +
> >  /* Worst case buffer size needed for holding an integer. */
> >  #define ITOA_MAX_LEN 12
> >  
> > @@ -149,6 +151,10 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
> >  				    struct kvm_vcpu *vcpu,
> >  				    struct kvm_memory_slot *memslot,
> >  				    gfn_t gfn);
> > +static void mark_page_dirty_in_ring(struct kvm *kvm,
> > +				    struct kvm_vcpu *vcpu,
> > +				    struct kvm_memory_slot *slot,
> > +				    gfn_t gfn);
> >  
> >  __visible bool kvm_rebooting;
> >  EXPORT_SYMBOL_GPL(kvm_rebooting);
> > @@ -359,11 +365,22 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
> >  	vcpu->preempted = false;
> >  	vcpu->ready = false;
> >  
> > +	if (kvm->dirty_ring_size) {
> > +		r = kvm_dirty_ring_alloc(vcpu->kvm, &vcpu->dirty_ring);
> > +		if (r) {
> > +			kvm->dirty_ring_size = 0;
> > +			goto fail_free_run;
> 
> This looks wrong, kvm->dirty_ring_size is used to free allocations, i.e.
> previous allocations will leak if a vcpu allocation fails.

You are right.  That's an overkill.

> 
> > +		}
> > +	}
> > +
> >  	r = kvm_arch_vcpu_init(vcpu);
> >  	if (r < 0)
> > -		goto fail_free_run;
> > +		goto fail_free_ring;
> >  	return 0;
> >  
> > +fail_free_ring:
> > +	if (kvm->dirty_ring_size)
> > +		kvm_dirty_ring_free(&vcpu->dirty_ring);
> >  fail_free_run:
> >  	free_page((unsigned long)vcpu->run);
> >  fail:
> > @@ -381,6 +398,8 @@ void kvm_vcpu_uninit(struct kvm_vcpu *vcpu)
> >  	put_pid(rcu_dereference_protected(vcpu->pid, 1));
> >  	kvm_arch_vcpu_uninit(vcpu);
> >  	free_page((unsigned long)vcpu->run);
> > +	if (vcpu->kvm->dirty_ring_size)
> > +		kvm_dirty_ring_free(&vcpu->dirty_ring);
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_vcpu_uninit);
> >  
> > @@ -690,6 +709,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
> >  	struct kvm *kvm = kvm_arch_alloc_vm();
> >  	int r = -ENOMEM;
> >  	int i;
> > +	struct page *page;
> >  
> >  	if (!kvm)
> >  		return ERR_PTR(-ENOMEM);
> > @@ -705,6 +725,14 @@ static struct kvm *kvm_create_vm(unsigned long type)
> >  
> >  	BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
> >  
> > +	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> > +	if (!page) {
> > +		r = -ENOMEM;
> > +		goto out_err_alloc_page;
> > +	}
> > +	kvm->vm_run = page_address(page);
> > +	BUILD_BUG_ON(sizeof(struct kvm_vm_run) > PAGE_SIZE);
> > +
> >  	if (init_srcu_struct(&kvm->srcu))
> >  		goto out_err_no_srcu;
> >  	if (init_srcu_struct(&kvm->irq_srcu))
> > @@ -775,6 +803,9 @@ static struct kvm *kvm_create_vm(unsigned long type)
> >  out_err_no_irq_srcu:
> >  	cleanup_srcu_struct(&kvm->srcu);
> >  out_err_no_srcu:
> > +	free_page((unsigned long)page);
> > +	kvm->vm_run = NULL;
> 
> No need to nullify vm_run.

Ok.

> 
> > +out_err_alloc_page:
> >  	kvm_arch_free_vm(kvm);
> >  	mmdrop(current->mm);
> >  	return ERR_PTR(r);
> > @@ -800,6 +831,15 @@ static void kvm_destroy_vm(struct kvm *kvm)
> >  	int i;
> >  	struct mm_struct *mm = kvm->mm;
> >  
> > +	if (kvm->dirty_ring_size) {
> > +		kvm_dirty_ring_free(&kvm->vm_dirty_ring);
> > +	}
> 
> Unnecessary parantheses.

True.

Thanks,

> 
> > +
> > +	if (kvm->vm_run) {
> > +		free_page((unsigned long)kvm->vm_run);
> > +		kvm->vm_run = NULL;
> > +	}
> > +
> >  	kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm);
> >  	kvm_destroy_vm_debugfs(kvm);
> >  	kvm_arch_sync_events(kvm);
> > @@ -2301,7 +2341,7 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
> >  {
> >  	if (memslot && memslot->dirty_bitmap) {
> >  		unsigned long rel_gfn = gfn - memslot->base_gfn;
> > -
> > +		mark_page_dirty_in_ring(kvm, vcpu, memslot, gfn);
> >  		set_bit_le(rel_gfn, memslot->dirty_bitmap);
> >  	}
> >  }
> > @@ -2649,6 +2689,13 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
>

Sean Christopherson Dec. 2, 2019, 9:50 p.m. UTC | #3

On Mon, Dec 02, 2019 at 04:16:40PM -0500, Peter Xu wrote:
> On Mon, Dec 02, 2019 at 12:10:36PM -0800, Sean Christopherson wrote:
> > On Fri, Nov 29, 2019 at 04:34:54PM -0500, Peter Xu wrote:
> > > Currently, we have N+1 rings for each VM of N vcpus:
> > > 
> > >   - for each vcpu, we have 1 per-vcpu dirty ring,
> > >   - for each vm, we have 1 per-vm dirty ring
> > 
> > Why?  I assume the purpose of per-vcpu rings is to avoid contention between
> > threads, but the motiviation needs to be explicitly stated.  And why is a
> > per-vm fallback ring needed?
> 
> Yes, as explained in previous reply, the problem is there could have
> guest memory writes without vcpu contexts.
> 
> > 
> > If my assumption is correct, have other approaches been tried/profiled?
> > E.g. using cmpxchg to reserve N number of entries in a shared ring.
> 
> Not yet, but I'd be fine to try anything if there's better
> alternatives.  Besides, could you help explain why sharing one ring
> and let each vcpu to reserve a region in the ring could be helpful in
> the pov of performance?

The goal would be to avoid taking a lock, or at least to avoid holding a
lock for an extended duration, e.g. some sort of multi-step process where
entries in the ring are first reserved, then filled, and finally marked
valid.  That'd allow the "fill" action to be done in parallel.

In case it isn't clear, I haven't thought through an actual solution :-).

My point is that I think it's worth exploring and profiling other
implementations because the dual per-vm and per-vcpu rings has a few warts
that we'd be stuck with forever.

> > IMO,
> > adding kvm_get_running_vcpu() is a hack that is just asking for future
> > abuse and the vcpu/vm/as_id interactions in mark_page_dirty_in_ring()
> > look extremely fragile.
> 
> I agree.  Another way is to put heavier traffic to the per-vm ring,
> but the downside could be that the per-vm ring could get full easier
> (but I haven't tested).

There's nothing that prevents increasing the size of the common ring each
time a new vCPU is added.  Alternatively, userspace could explicitly
request or hint the desired ring size.

> > I also dislike having two different mechanisms
> > for accessing the ring (lock for per-vm, something else for per-vcpu).
> 
> Actually I proposed to drop the per-vm ring (actually I had a version
> that implemented this.. and I just changed it back to the per-vm ring
> later on, see below) and when there's no vcpu context I thought about:
> 
>   (1) use vcpu0 ring
> 
>   (2) or a better algo to pick up a per-vcpu ring (like, the less full
>       ring, we can do many things here, e.g., we can easily maintain a
>       structure track this so we can get O(1) search, I think)
> 
> I discussed this with Paolo, but I think Paolo preferred the per-vm
> ring because there's no good reason to choose vcpu0 as what (1)
> suggested.  While if to choose (2) we probably need to lock even for
> per-cpu ring, so could be a bit slower.

Ya, per-vm is definitely better than dumping on vcpu0.  I'm hoping we can
find a third option that provides comparable performance without using any
per-vcpu rings.

Peter Xu Dec. 2, 2019, 11:09 p.m. UTC | #4

On Mon, Dec 02, 2019 at 01:50:49PM -0800, Sean Christopherson wrote:
> On Mon, Dec 02, 2019 at 04:16:40PM -0500, Peter Xu wrote:
> > On Mon, Dec 02, 2019 at 12:10:36PM -0800, Sean Christopherson wrote:
> > > On Fri, Nov 29, 2019 at 04:34:54PM -0500, Peter Xu wrote:
> > > > Currently, we have N+1 rings for each VM of N vcpus:
> > > > 
> > > >   - for each vcpu, we have 1 per-vcpu dirty ring,
> > > >   - for each vm, we have 1 per-vm dirty ring
> > > 
> > > Why?  I assume the purpose of per-vcpu rings is to avoid contention between
> > > threads, but the motiviation needs to be explicitly stated.  And why is a
> > > per-vm fallback ring needed?
> > 
> > Yes, as explained in previous reply, the problem is there could have
> > guest memory writes without vcpu contexts.
> > 
> > > 
> > > If my assumption is correct, have other approaches been tried/profiled?
> > > E.g. using cmpxchg to reserve N number of entries in a shared ring.
> > 
> > Not yet, but I'd be fine to try anything if there's better
> > alternatives.  Besides, could you help explain why sharing one ring
> > and let each vcpu to reserve a region in the ring could be helpful in
> > the pov of performance?
> 
> The goal would be to avoid taking a lock, or at least to avoid holding a
> lock for an extended duration, e.g. some sort of multi-step process where
> entries in the ring are first reserved, then filled, and finally marked
> valid.  That'd allow the "fill" action to be done in parallel.

Considering that per-vcpu ring should be no worst than this, so iiuc
you prefer a single per-vm ring here, which is without per-vcpu ring.
However I don't see a good reason to split a per-vm resource into
per-vcpu manually somehow, instead of using the per-vcpu structure
directly like what this series does...  Or could you show me what I've
missed?

IMHO it's really a natural thought that we should use kvm_vcpu to
split the ring as long as we still want to make it in parallel of the
vcpus.

> 
> In case it isn't clear, I haven't thought through an actual solution :-).

Feel free to shoot when the ideas come. :) I'd be glad to test your
idea, especially where it could be better!

> 
> My point is that I think it's worth exploring and profiling other
> implementations because the dual per-vm and per-vcpu rings has a few warts
> that we'd be stuck with forever.

I do agree that the interface could be a bit awkward to keep these two
rings.  Besides this, do you still have other concerns?

And when you say about profiling, I hope I understand it right that it
should be something unrelated to this specific issue that we're
discussing (say, on whether to use per-vm ring, or per-vm + per-vcpu
rings) because for performance imho it's really the layout of the ring
that could matter more, and how the ring is shared and accessed
between the userspace and kernel.

For current implementation (I'm not sure whether that's initial
version from Lei, or Paolo, anyway...), IMHO it's good enough from
perf pov in that it at least supports:

  (1) zero copy
  (2) complete async model
  (3) per-vcpu isolations

None of these is there for KVM_GET_DIRTY_LOG.  Not to mention that
tracking dirty bits are not really that "performance critical" - if
you see in QEMU we have plenty of ways to explicitly turn down the CPU
like cpu-throttle, just because dirtying pages and even with the whole
tracking overhead is too fast already even using KVM_GET_DIRTY_LOG,
and the slow thing is QEMU when collecting and sending the pages! :)

> 
> > > IMO,
> > > adding kvm_get_running_vcpu() is a hack that is just asking for future
> > > abuse and the vcpu/vm/as_id interactions in mark_page_dirty_in_ring()
> > > look extremely fragile.
> > 
> > I agree.  Another way is to put heavier traffic to the per-vm ring,
> > but the downside could be that the per-vm ring could get full easier
> > (but I haven't tested).
> 
> There's nothing that prevents increasing the size of the common ring each
> time a new vCPU is added.  Alternatively, userspace could explicitly
> request or hint the desired ring size.

Yeah I don't have strong opinion on this, but I just don't see it
greatly helpful to explicitly expose this API to userspace.  IMHO for
now a global ring size should be good enough.  If userspace wants to
make it fast, the ring can hardly gets full (because the collection of
the dirty ring can be really, really fast if the userspace wants).

> 
> > > I also dislike having two different mechanisms
> > > for accessing the ring (lock for per-vm, something else for per-vcpu).
> > 
> > Actually I proposed to drop the per-vm ring (actually I had a version
> > that implemented this.. and I just changed it back to the per-vm ring
> > later on, see below) and when there's no vcpu context I thought about:
> > 
> >   (1) use vcpu0 ring
> > 
> >   (2) or a better algo to pick up a per-vcpu ring (like, the less full
> >       ring, we can do many things here, e.g., we can easily maintain a
> >       structure track this so we can get O(1) search, I think)
> > 
> > I discussed this with Paolo, but I think Paolo preferred the per-vm
> > ring because there's no good reason to choose vcpu0 as what (1)
> > suggested.  While if to choose (2) we probably need to lock even for
> > per-cpu ring, so could be a bit slower.
> 
> Ya, per-vm is definitely better than dumping on vcpu0.  I'm hoping we can
> find a third option that provides comparable performance without using any
> per-vcpu rings.

I'm still uncertain on whether it's a good idea to drop the per-vcpu
ring (as stated above).  But I'm still open to any further thoughts
as long as I can start to understand when the only-per-vm ring would
be better.

Thanks!

Paolo Bonzini Dec. 3, 2019, 1:48 p.m. UTC | #5

On 02/12/19 22:50, Sean Christopherson wrote:
>>
>> I discussed this with Paolo, but I think Paolo preferred the per-vm
>> ring because there's no good reason to choose vcpu0 as what (1)
>> suggested.  While if to choose (2) we probably need to lock even for
>> per-cpu ring, so could be a bit slower.
> Ya, per-vm is definitely better than dumping on vcpu0.  I'm hoping we can
> find a third option that provides comparable performance without using any
> per-vcpu rings.
> 

The advantage of per-vCPU rings is that it naturally: 1) parallelizes
the processing of dirty pages; 2) makes userspace vCPU thread do more
work on vCPUs that dirty more pages.

I agree that on the producer side we could reserve multiple entries in
the case of PML (and without PML only one entry should be added at a
time).  But I'm afraid that things get ugly when the ring is full,
because you'd have to wait for all vCPUs to finish publishing the
entries they have reserved.

It's ugly that we _also_ need a per-VM ring, but unfortunately some
operations do not really have a vCPU that they can refer to.

Paolo

Sean Christopherson Dec. 3, 2019, 6:46 p.m. UTC | #6

On Tue, Dec 03, 2019 at 02:48:10PM +0100, Paolo Bonzini wrote:
> On 02/12/19 22:50, Sean Christopherson wrote:
> >>
> >> I discussed this with Paolo, but I think Paolo preferred the per-vm
> >> ring because there's no good reason to choose vcpu0 as what (1)
> >> suggested.  While if to choose (2) we probably need to lock even for
> >> per-cpu ring, so could be a bit slower.
> > Ya, per-vm is definitely better than dumping on vcpu0.  I'm hoping we can
> > find a third option that provides comparable performance without using any
> > per-vcpu rings.
> > 
> 
> The advantage of per-vCPU rings is that it naturally: 1) parallelizes
> the processing of dirty pages; 2) makes userspace vCPU thread do more
> work on vCPUs that dirty more pages.
> 
> I agree that on the producer side we could reserve multiple entries in
> the case of PML (and without PML only one entry should be added at a
> time).  But I'm afraid that things get ugly when the ring is full,
> because you'd have to wait for all vCPUs to finish publishing the
> entries they have reserved.

Ah, I take it the intended model is that userspace will only start pulling
entries off the ring when KVM explicitly signals that the ring is "full"?

Rather than reserve entries, what if vCPUs reserved an entire ring?  Create
a pool of N=nr_vcpus rings that are shared by all vCPUs.  To mark pages
dirty, a vCPU claims a ring, pushes the pages into the ring, and then
returns the ring to the pool.  If pushing pages hits the soft limit, a
request is made to drain the ring and the ring is not returned to the pool
until it is drained.

Except for acquiring a ring, which likely can be heavily optimized, that'd
allow parallel processing (#1), and would provide a facsimile of #2 as
pushing more pages onto a ring would naturally increase the likelihood of
triggering a drain.  And it might be interesting to see the effect of using
different methods of ring selection, e.g. pure round robin, LRU, last used
on the current vCPU, etc...

> It's ugly that we _also_ need a per-VM ring, but unfortunately some
> operations do not really have a vCPU that they can refer to.

Sean Christopherson Dec. 3, 2019, 7:13 p.m. UTC | #7

On Fri, Nov 29, 2019 at 04:34:54PM -0500, Peter Xu wrote:
> +static void mark_page_dirty_in_ring(struct kvm *kvm,
> +				    struct kvm_vcpu *vcpu,
> +				    struct kvm_memory_slot *slot,
> +				    gfn_t gfn)
> +{
> +	u32 as_id = 0;

Redundant initialization of as_id.

> +	u64 offset;
> +	int ret;
> +	struct kvm_dirty_ring *ring;
> +	struct kvm_dirty_ring_indexes *indexes;
> +	bool is_vm_ring;
> +
> +	if (!kvm->dirty_ring_size)
> +		return;
> +
> +	offset = gfn - slot->base_gfn;
> +
> +	if (vcpu) {
> +		as_id = kvm_arch_vcpu_memslots_id(vcpu);
> +	} else {
> +		as_id = 0;

The setting of as_id is wrong, both with and without a vCPU.  as_id should
come from slot->as_id.  It may not be actually broken in the current code
base, but at best it's fragile, e.g. Ben's TDP MMU rewrite[*] adds a call
to mark_page_dirty_in_slot() with a potentially non-zero as_id.

[*] https://lkml.kernel.org/r/20190926231824.149014-25-bgardon@google.com

> +		vcpu = kvm_get_running_vcpu();
> +	}
> +
> +	if (vcpu) {
> +		ring = &vcpu->dirty_ring;
> +		indexes = &vcpu->run->vcpu_ring_indexes;
> +		is_vm_ring = false;
> +	} else {
> +		/*
> +		 * Put onto per vm ring because no vcpu context.  Kick
> +		 * vcpu0 if ring is full.
> +		 */
> +		vcpu = kvm->vcpus[0];

Is this a rare event?

> +		ring = &kvm->vm_dirty_ring;
> +		indexes = &kvm->vm_run->vm_ring_indexes;
> +		is_vm_ring = true;
> +	}
> +
> +	ret = kvm_dirty_ring_push(ring, indexes,
> +				  (as_id << 16)|slot->id, offset,
> +				  is_vm_ring);
> +	if (ret < 0) {
> +		if (is_vm_ring)
> +			pr_warn_once("vcpu %d dirty log overflow\n",
> +				     vcpu->vcpu_id);
> +		else
> +			pr_warn_once("per-vm dirty log overflow\n");
> +		return;
> +	}
> +
> +	if (ret)
> +		kvm_make_request(KVM_REQ_DIRTY_RING_FULL, vcpu);
> +}

Paolo Bonzini Dec. 4, 2019, 10:05 a.m. UTC | #8

On 03/12/19 19:46, Sean Christopherson wrote:
> On Tue, Dec 03, 2019 at 02:48:10PM +0100, Paolo Bonzini wrote:
>> On 02/12/19 22:50, Sean Christopherson wrote:
>>>>
>>>> I discussed this with Paolo, but I think Paolo preferred the per-vm
>>>> ring because there's no good reason to choose vcpu0 as what (1)
>>>> suggested.  While if to choose (2) we probably need to lock even for
>>>> per-cpu ring, so could be a bit slower.
>>> Ya, per-vm is definitely better than dumping on vcpu0.  I'm hoping we can
>>> find a third option that provides comparable performance without using any
>>> per-vcpu rings.
>>>
>>
>> The advantage of per-vCPU rings is that it naturally: 1) parallelizes
>> the processing of dirty pages; 2) makes userspace vCPU thread do more
>> work on vCPUs that dirty more pages.
>>
>> I agree that on the producer side we could reserve multiple entries in
>> the case of PML (and without PML only one entry should be added at a
>> time).  But I'm afraid that things get ugly when the ring is full,
>> because you'd have to wait for all vCPUs to finish publishing the
>> entries they have reserved.
> 
> Ah, I take it the intended model is that userspace will only start pulling
> entries off the ring when KVM explicitly signals that the ring is "full"?

No, it's not.  But perhaps in the asynchronous case you can delay
pushing the reserved entries to the consumer until a moment where no
CPUs have left empty slots in the ring buffer (somebody must have done
multi-producer ring buffers before).  In the ring-full case that is
harder because it requires synchronization.

> Rather than reserve entries, what if vCPUs reserved an entire ring?  Create
> a pool of N=nr_vcpus rings that are shared by all vCPUs.  To mark pages
> dirty, a vCPU claims a ring, pushes the pages into the ring, and then
> returns the ring to the pool.  If pushing pages hits the soft limit, a
> request is made to drain the ring and the ring is not returned to the pool
> until it is drained.
> 
> Except for acquiring a ring, which likely can be heavily optimized, that'd
> allow parallel processing (#1), and would provide a facsimile of #2 as
> pushing more pages onto a ring would naturally increase the likelihood of
> triggering a drain.  And it might be interesting to see the effect of using
> different methods of ring selection, e.g. pure round robin, LRU, last used
> on the current vCPU, etc...

If you are creating nr_vcpus rings, and draining is done on the vCPU
thread that has filled the ring, why not create nr_vcpus+1?  The current
code then is exactly the same as pre-claiming a ring per vCPU and never
releasing it, and using a spinlock to claim the per-VM ring.

However, we could build on top of my other suggestion to add
slot->as_id, and wrap kvm_get_running_vcpu() with a nice API, mimicking
exactly what you've suggested.  Maybe even add a scary comment around
kvm_get_running_vcpu() suggesting that users only do so to avoid locking
and wrap it with a nice API.  Similar to what get_cpu/put_cpu do with
smp_processor_id.

1) Add a pointer from struct kvm_dirty_ring to struct
kvm_dirty_ring_indexes:

vcpu->dirty_ring->data = &vcpu->run->vcpu_ring_indexes;
kvm->vm_dirty_ring->data = *kvm->vm_run->vm_ring_indexes;

2) push the ring choice and locking to two new functions

struct kvm_ring *kvm_get_dirty_ring(struct kvm *kvm)
{
	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();

	if (vcpu && !WARN_ON_ONCE(vcpu->kvm != kvm)) {
		return &vcpu->dirty_ring;
	} else {
		/*
		 * Put onto per vm ring because no vcpu context.
		 * We'll kick vcpu0 if ring is full.
		 */
		spin_lock(&kvm->vm_dirty_ring->lock);
		return &kvm->vm_dirty_ring;
	}
}

void kvm_put_dirty_ring(struct kvm *kvm,
			struct kvm_dirty_ring *ring)
{
	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
	bool full = kvm_dirty_ring_used(ring) >= ring->soft_limit;

	if (ring == &kvm->vm_dirty_ring) {
		if (vcpu == NULL)
			vcpu = kvm->vcpus[0];
		spin_unlock(&kvm->vm_dirty_ring->lock);
	}

	if (full)
		kvm_make_request(KVM_REQ_DIRTY_RING_FULL, vcpu);
}

3) simplify kvm_dirty_ring_push to

void kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
			 u32 slot, u64 offset)
{
	/* left as an exercise to the reader */
}

and mark_page_dirty_in_ring to

static void mark_page_dirty_in_ring(struct kvm *kvm,
				    struct kvm_memory_slot *slot,
				    gfn_t gfn)
{
	struct kvm_dirty_ring *ring;

	if (!kvm->dirty_ring_size)
		return;

	ring = kvm_get_dirty_ring(kvm);
	kvm_dirty_ring_push(ring, (slot->as_id << 16) | slot->id,
			    gfn - slot->base_gfn);
	kvm_put_dirty_ring(kvm, ring);
}

Paolo

>> It's ugly that we _also_ need a per-VM ring, but unfortunately some
>> operations do not really have a vCPU that they can refer to.
>

Paolo Bonzini Dec. 4, 2019, 10:14 a.m. UTC | #9

On 03/12/19 20:13, Sean Christopherson wrote:
> The setting of as_id is wrong, both with and without a vCPU.  as_id should
> come from slot->as_id.

Which doesn't exist, but is an excellent suggestion nevertheless.

>> +		/*
>> +		 * Put onto per vm ring because no vcpu context.  Kick
>> +		 * vcpu0 if ring is full.
>> +		 */
>> +		vcpu = kvm->vcpus[0];
> 
> Is this a rare event?

Yes, every time a vCPU exit happens, the vCPU is supposed to reap the VM
ring as well.  (Most of the time it will be empty, and while the reaping
of VM ring entries needs locking, the emptiness check doesn't).

Paolo

>> +		ring = &kvm->vm_dirty_ring;
>> +		indexes = &kvm->vm_run->vm_ring_indexes;
>> +		is_vm_ring = true;
>> +	}
>> +
>> +	ret = kvm_dirty_ring_push(ring, indexes,
>> +				  (as_id << 16)|slot->id, offset,
>> +				  is_vm_ring);
>> +	if (ret < 0) {
>> +		if (is_vm_ring)
>> +			pr_warn_once("vcpu %d dirty log overflow\n",
>> +				     vcpu->vcpu_id);
>> +		else
>> +			pr_warn_once("per-vm dirty log overflow\n");
>> +		return;
>> +	}
>> +
>> +	if (ret)
>> +		kvm_make_request(KVM_REQ_DIRTY_RING_FULL, vcpu);
>> +}
>

Jason Wang Dec. 4, 2019, 10:38 a.m. UTC | #10

On 2019/11/30 上午5:34, Peter Xu wrote:
> +int kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
> +			struct kvm_dirty_ring_indexes *indexes,
> +			u32 slot, u64 offset, bool lock)
> +{
> +	int ret;
> +	struct kvm_dirty_gfn *entry;
> +
> +	if (lock)
> +		spin_lock(&ring->lock);
> +
> +	if (kvm_dirty_ring_full(ring)) {
> +		ret = -EBUSY;
> +		goto out;
> +	}
> +
> +	entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
> +	entry->slot = slot;
> +	entry->offset = offset;


Haven't gone through the whole series, sorry if it was a silly question 
but I wonder things like this will suffer from similar issue on 
virtually tagged archs as mentioned in [1].

Is this better to allocate the ring from userspace and set to KVM 
instead? Then we can use copy_to/from_user() friends (a little bit slow 
on recent CPUs).

[1] https://lkml.org/lkml/2019/4/9/5

Thanks


> +	smp_wmb();
> +	ring->dirty_index++;
> +	WRITE_ONCE(indexes->avail_index, ring->dirty_index);
> +	ret = kvm_dirty_ring_used(ring) >= ring->soft_limit;
> +	pr_info("%s: slot %u offset %llu used %u\n",
> +		__func__, slot, offset, kvm_dirty_ring_used(ring));
> +
> +out:

Paolo Bonzini Dec. 4, 2019, 11:04 a.m. UTC | #11

On 04/12/19 11:38, Jason Wang wrote:
>>
>> +    entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
>> +    entry->slot = slot;
>> +    entry->offset = offset;
> 
> 
> Haven't gone through the whole series, sorry if it was a silly question
> but I wonder things like this will suffer from similar issue on
> virtually tagged archs as mentioned in [1].

There is no new infrastructure to track the dirty pages---it's just a
different way to pass them to userspace.

> Is this better to allocate the ring from userspace and set to KVM
> instead? Then we can use copy_to/from_user() friends (a little bit slow
> on recent CPUs).

Yeah, I don't think that would be better than mmap.

Paolo


> [1] https://lkml.org/lkml/2019/4/9/5

Sean Christopherson Dec. 4, 2019, 2:33 p.m. UTC | #12

On Wed, Dec 04, 2019 at 11:14:19AM +0100, Paolo Bonzini wrote:
> On 03/12/19 20:13, Sean Christopherson wrote:
> > The setting of as_id is wrong, both with and without a vCPU.  as_id should
> > come from slot->as_id.
> 
> Which doesn't exist, but is an excellent suggestion nevertheless.

Huh, I explicitly looked at the code to make sure as_id existed before
making this suggestion.  No idea what code I actually pulled up.

Peter Xu Dec. 4, 2019, 7:52 p.m. UTC | #13

On Wed, Dec 04, 2019 at 12:04:53PM +0100, Paolo Bonzini wrote:
> On 04/12/19 11:38, Jason Wang wrote:
> >>
> >> +    entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
> >> +    entry->slot = slot;
> >> +    entry->offset = offset;
> > 
> > 
> > Haven't gone through the whole series, sorry if it was a silly question
> > but I wonder things like this will suffer from similar issue on
> > virtually tagged archs as mentioned in [1].
> 
> There is no new infrastructure to track the dirty pages---it's just a
> different way to pass them to userspace.
> 
> > Is this better to allocate the ring from userspace and set to KVM
> > instead? Then we can use copy_to/from_user() friends (a little bit slow
> > on recent CPUs).
> 
> Yeah, I don't think that would be better than mmap.

Yeah I agree, because I didn't see how copy_to/from_user() helped to
do icache/dcache flushings...

Some context here: Jason raised this question offlist first on whether
we should also need these flush_dcache_cache() helpers for operations
like kvm dirty ring accesses.  I feel like it should, however I've got
two other questions, on:

  - if we need to do flush_dcache_page() on kernel modified pages
    (assuming the same page has mapped to userspace), then why don't
    we need flush_cache_page() too on the page, where
    flush_cache_page() is defined not-a-nop on those archs?

  - assuming an arch has not-a-nop impl for flush_[d]cache_page(),
    would atomic operations like cmpxchg really work for them
    (assuming that ISAs like cmpxchg should depend on cache
    consistency).

Sorry I think these are for sure a bit out of topic for kvm dirty ring
patchset, but since we're at it, I'm raising the questions up in case
there're answers..

Thanks,

Jason Wang Dec. 5, 2019, 6:51 a.m. UTC | #14

On 2019/12/5 上午3:52, Peter Xu wrote:
> On Wed, Dec 04, 2019 at 12:04:53PM +0100, Paolo Bonzini wrote:
>> On 04/12/19 11:38, Jason Wang wrote:
>>>> +    entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
>>>> +    entry->slot = slot;
>>>> +    entry->offset = offset;
>>>
>>> Haven't gone through the whole series, sorry if it was a silly question
>>> but I wonder things like this will suffer from similar issue on
>>> virtually tagged archs as mentioned in [1].
>> There is no new infrastructure to track the dirty pages---it's just a
>> different way to pass them to userspace.
>>
>>> Is this better to allocate the ring from userspace and set to KVM
>>> instead? Then we can use copy_to/from_user() friends (a little bit slow
>>> on recent CPUs).
>> Yeah, I don't think that would be better than mmap.
> Yeah I agree, because I didn't see how copy_to/from_user() helped to
> do icache/dcache flushings...


It looks to me one advantage is that exact the same VA is used by both 
userspace and kernel so there will be no alias.

Thanks


>
> Some context here: Jason raised this question offlist first on whether
> we should also need these flush_dcache_cache() helpers for operations
> like kvm dirty ring accesses.  I feel like it should, however I've got
> two other questions, on:
>
>    - if we need to do flush_dcache_page() on kernel modified pages
>      (assuming the same page has mapped to userspace), then why don't
>      we need flush_cache_page() too on the page, where
>      flush_cache_page() is defined not-a-nop on those archs?
>
>    - assuming an arch has not-a-nop impl for flush_[d]cache_page(),
>      would atomic operations like cmpxchg really work for them
>      (assuming that ISAs like cmpxchg should depend on cache
>      consistency).
>
> Sorry I think these are for sure a bit out of topic for kvm dirty ring
> patchset, but since we're at it, I'm raising the questions up in case
> there're answers..
>
> Thanks,
>

Peter Xu Dec. 5, 2019, 12:08 p.m. UTC | #15

On Thu, Dec 05, 2019 at 02:51:15PM +0800, Jason Wang wrote:
> 
> On 2019/12/5 上午3:52, Peter Xu wrote:
> > On Wed, Dec 04, 2019 at 12:04:53PM +0100, Paolo Bonzini wrote:
> > > On 04/12/19 11:38, Jason Wang wrote:
> > > > > +    entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
> > > > > +    entry->slot = slot;
> > > > > +    entry->offset = offset;
> > > > 
> > > > Haven't gone through the whole series, sorry if it was a silly question
> > > > but I wonder things like this will suffer from similar issue on
> > > > virtually tagged archs as mentioned in [1].
> > > There is no new infrastructure to track the dirty pages---it's just a
> > > different way to pass them to userspace.
> > > 
> > > > Is this better to allocate the ring from userspace and set to KVM
> > > > instead? Then we can use copy_to/from_user() friends (a little bit slow
> > > > on recent CPUs).
> > > Yeah, I don't think that would be better than mmap.
> > Yeah I agree, because I didn't see how copy_to/from_user() helped to
> > do icache/dcache flushings...
> 
> 
> It looks to me one advantage is that exact the same VA is used by both
> userspace and kernel so there will be no alias.

Hmm.. but what if the page is mapped more than once in user?  Thanks,

Jason Wang Dec. 5, 2019, 1:12 p.m. UTC | #16

On 2019/12/5 下午8:08, Peter Xu wrote:
> On Thu, Dec 05, 2019 at 02:51:15PM +0800, Jason Wang wrote:
>> On 2019/12/5 上午3:52, Peter Xu wrote:
>>> On Wed, Dec 04, 2019 at 12:04:53PM +0100, Paolo Bonzini wrote:
>>>> On 04/12/19 11:38, Jason Wang wrote:
>>>>>> +    entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
>>>>>> +    entry->slot = slot;
>>>>>> +    entry->offset = offset;
>>>>> Haven't gone through the whole series, sorry if it was a silly question
>>>>> but I wonder things like this will suffer from similar issue on
>>>>> virtually tagged archs as mentioned in [1].
>>>> There is no new infrastructure to track the dirty pages---it's just a
>>>> different way to pass them to userspace.
>>>>
>>>>> Is this better to allocate the ring from userspace and set to KVM
>>>>> instead? Then we can use copy_to/from_user() friends (a little bit slow
>>>>> on recent CPUs).
>>>> Yeah, I don't think that would be better than mmap.
>>> Yeah I agree, because I didn't see how copy_to/from_user() helped to
>>> do icache/dcache flushings...
>>
>> It looks to me one advantage is that exact the same VA is used by both
>> userspace and kernel so there will be no alias.
> Hmm.. but what if the page is mapped more than once in user?  Thanks,


Then it's the responsibility of userspace program to do the flush I think.

Thanks

>

Sean Christopherson Dec. 7, 2019, 12:29 a.m. UTC | #17

On Wed, Dec 04, 2019 at 11:05:47AM +0100, Paolo Bonzini wrote:
> On 03/12/19 19:46, Sean Christopherson wrote:
> > Rather than reserve entries, what if vCPUs reserved an entire ring?  Create
> > a pool of N=nr_vcpus rings that are shared by all vCPUs.  To mark pages
> > dirty, a vCPU claims a ring, pushes the pages into the ring, and then
> > returns the ring to the pool.  If pushing pages hits the soft limit, a
> > request is made to drain the ring and the ring is not returned to the pool
> > until it is drained.
> > 
> > Except for acquiring a ring, which likely can be heavily optimized, that'd
> > allow parallel processing (#1), and would provide a facsimile of #2 as
> > pushing more pages onto a ring would naturally increase the likelihood of
> > triggering a drain.  And it might be interesting to see the effect of using
> > different methods of ring selection, e.g. pure round robin, LRU, last used
> > on the current vCPU, etc...
> 
> If you are creating nr_vcpus rings, and draining is done on the vCPU
> thread that has filled the ring, why not create nr_vcpus+1?  The current
> code then is exactly the same as pre-claiming a ring per vCPU and never
> releasing it, and using a spinlock to claim the per-VM ring.

Because I really don't like kvm_get_running_vcpu() :-)

Binding the rings to vCPUs also makes for an inflexible API, e.g. the
amount of memory required for the rings scales linearly with the number of
vCPUs, or maybe there's a use case for having M:N vCPUs:rings.

That being said, I'm pretty clueless when it comes to implementing and
tuning the userspace side of this type of stuff, so feel free to ignore my
thoughts on the API.

Paolo Bonzini Dec. 9, 2019, 9:37 a.m. UTC | #18

On 07/12/19 01:29, Sean Christopherson wrote:
> On Wed, Dec 04, 2019 at 11:05:47AM +0100, Paolo Bonzini wrote:
>> On 03/12/19 19:46, Sean Christopherson wrote:
>>> Rather than reserve entries, what if vCPUs reserved an entire ring?  Create
>>> a pool of N=nr_vcpus rings that are shared by all vCPUs.  To mark pages
>>> dirty, a vCPU claims a ring, pushes the pages into the ring, and then
>>> returns the ring to the pool.  If pushing pages hits the soft limit, a
>>> request is made to drain the ring and the ring is not returned to the pool
>>> until it is drained.
>>>
>>> Except for acquiring a ring, which likely can be heavily optimized, that'd
>>> allow parallel processing (#1), and would provide a facsimile of #2 as
>>> pushing more pages onto a ring would naturally increase the likelihood of
>>> triggering a drain.  And it might be interesting to see the effect of using
>>> different methods of ring selection, e.g. pure round robin, LRU, last used
>>> on the current vCPU, etc...
>>
>> If you are creating nr_vcpus rings, and draining is done on the vCPU
>> thread that has filled the ring, why not create nr_vcpus+1?  The current
>> code then is exactly the same as pre-claiming a ring per vCPU and never
>> releasing it, and using a spinlock to claim the per-VM ring.
> 
> Because I really don't like kvm_get_running_vcpu() :-)

I also don't like it particularly, but I think it's okay to wrap it into
a nicer API.

> Binding the rings to vCPUs also makes for an inflexible API, e.g. the
> amount of memory required for the rings scales linearly with the number of
> vCPUs, or maybe there's a use case for having M:N vCPUs:rings.

If we can get rid of the dirty bitmap, the amount of memory is probably
going to be smaller anyway.  For example at 64k per ring, 256 rings
occupy 16 MiB of memory, and that is the cost of dirty bitmaps for 512
GiB of guest memory, and that's probably what you can expect for the
memory of a 256-vCPU guest (at least roughly: if the memory is 128 GiB,
the extra 12 MiB for dirty page rings don't really matter).

Paolo

> That being said, I'm pretty clueless when it comes to implementing and
> tuning the userspace side of this type of stuff, so feel free to ignore my
> thoughts on the API.
>

Peter Xu Dec. 9, 2019, 9:54 p.m. UTC | #19

On Wed, Dec 04, 2019 at 11:05:47AM +0100, Paolo Bonzini wrote:
> On 03/12/19 19:46, Sean Christopherson wrote:
> > On Tue, Dec 03, 2019 at 02:48:10PM +0100, Paolo Bonzini wrote:
> >> On 02/12/19 22:50, Sean Christopherson wrote:
> >>>>
> >>>> I discussed this with Paolo, but I think Paolo preferred the per-vm
> >>>> ring because there's no good reason to choose vcpu0 as what (1)
> >>>> suggested.  While if to choose (2) we probably need to lock even for
> >>>> per-cpu ring, so could be a bit slower.
> >>> Ya, per-vm is definitely better than dumping on vcpu0.  I'm hoping we can
> >>> find a third option that provides comparable performance without using any
> >>> per-vcpu rings.
> >>>
> >>
> >> The advantage of per-vCPU rings is that it naturally: 1) parallelizes
> >> the processing of dirty pages; 2) makes userspace vCPU thread do more
> >> work on vCPUs that dirty more pages.
> >>
> >> I agree that on the producer side we could reserve multiple entries in
> >> the case of PML (and without PML only one entry should be added at a
> >> time).  But I'm afraid that things get ugly when the ring is full,
> >> because you'd have to wait for all vCPUs to finish publishing the
> >> entries they have reserved.
> > 
> > Ah, I take it the intended model is that userspace will only start pulling
> > entries off the ring when KVM explicitly signals that the ring is "full"?
> 
> No, it's not.  But perhaps in the asynchronous case you can delay
> pushing the reserved entries to the consumer until a moment where no
> CPUs have left empty slots in the ring buffer (somebody must have done
> multi-producer ring buffers before).  In the ring-full case that is
> harder because it requires synchronization.
> 
> > Rather than reserve entries, what if vCPUs reserved an entire ring?  Create
> > a pool of N=nr_vcpus rings that are shared by all vCPUs.  To mark pages
> > dirty, a vCPU claims a ring, pushes the pages into the ring, and then
> > returns the ring to the pool.  If pushing pages hits the soft limit, a
> > request is made to drain the ring and the ring is not returned to the pool
> > until it is drained.
> > 
> > Except for acquiring a ring, which likely can be heavily optimized, that'd
> > allow parallel processing (#1), and would provide a facsimile of #2 as
> > pushing more pages onto a ring would naturally increase the likelihood of
> > triggering a drain.  And it might be interesting to see the effect of using
> > different methods of ring selection, e.g. pure round robin, LRU, last used
> > on the current vCPU, etc...
> 
> If you are creating nr_vcpus rings, and draining is done on the vCPU
> thread that has filled the ring, why not create nr_vcpus+1?  The current
> code then is exactly the same as pre-claiming a ring per vCPU and never
> releasing it, and using a spinlock to claim the per-VM ring.
> 
> However, we could build on top of my other suggestion to add
> slot->as_id, and wrap kvm_get_running_vcpu() with a nice API, mimicking
> exactly what you've suggested.  Maybe even add a scary comment around
> kvm_get_running_vcpu() suggesting that users only do so to avoid locking
> and wrap it with a nice API.  Similar to what get_cpu/put_cpu do with
> smp_processor_id.
> 
> 1) Add a pointer from struct kvm_dirty_ring to struct
> kvm_dirty_ring_indexes:
> 
> vcpu->dirty_ring->data = &vcpu->run->vcpu_ring_indexes;
> kvm->vm_dirty_ring->data = *kvm->vm_run->vm_ring_indexes;
> 
> 2) push the ring choice and locking to two new functions
> 
> struct kvm_ring *kvm_get_dirty_ring(struct kvm *kvm)
> {
> 	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
> 
> 	if (vcpu && !WARN_ON_ONCE(vcpu->kvm != kvm)) {
> 		return &vcpu->dirty_ring;
> 	} else {
> 		/*
> 		 * Put onto per vm ring because no vcpu context.
> 		 * We'll kick vcpu0 if ring is full.
> 		 */
> 		spin_lock(&kvm->vm_dirty_ring->lock);
> 		return &kvm->vm_dirty_ring;
> 	}
> }
> 
> void kvm_put_dirty_ring(struct kvm *kvm,
> 			struct kvm_dirty_ring *ring)
> {
> 	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
> 	bool full = kvm_dirty_ring_used(ring) >= ring->soft_limit;
> 
> 	if (ring == &kvm->vm_dirty_ring) {
> 		if (vcpu == NULL)
> 			vcpu = kvm->vcpus[0];
> 		spin_unlock(&kvm->vm_dirty_ring->lock);
> 	}
> 
> 	if (full)
> 		kvm_make_request(KVM_REQ_DIRTY_RING_FULL, vcpu);
> }
> 
> 3) simplify kvm_dirty_ring_push to
> 
> void kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
> 			 u32 slot, u64 offset)
> {
> 	/* left as an exercise to the reader */
> }
> 
> and mark_page_dirty_in_ring to
> 
> static void mark_page_dirty_in_ring(struct kvm *kvm,
> 				    struct kvm_memory_slot *slot,
> 				    gfn_t gfn)
> {
> 	struct kvm_dirty_ring *ring;
> 
> 	if (!kvm->dirty_ring_size)
> 		return;
> 
> 	ring = kvm_get_dirty_ring(kvm);
> 	kvm_dirty_ring_push(ring, (slot->as_id << 16) | slot->id,
> 			    gfn - slot->base_gfn);
> 	kvm_put_dirty_ring(kvm, ring);
> }

I think I got the major point here.  Unless Sean has some better idea
in the future I'll go with this.

Just until recently I noticed that actually kvm_get_running_vcpu() has
a real benefit in that it gives a very solid result on whether we're
with the vcpu context, even more accurate than when we pass vcpu
pointers around (because sometimes we just passed the kvm pointer
along the stack even if we're with a vcpu context, just like what we
did with mark_page_dirty_in_slot).  I'm thinking whether I can start
to use this information in the next post on solving an issue I
encountered with the waitqueue.

Current waitqueue is still problematic in that it could wait even with
the mmu lock held when with vcpu context.

The issue is KVM_RESET_DIRTY_RINGS needs the mmu lock to manipulate
the write bits, while it's the only interface to also wake up the
dirty ring sleepers.  They could dead lock like this:

      main thread                            vcpu thread
      ===========                            ===========
                                             kvm page fault
                                               mark_page_dirty_in_slot
                                               mmu lock taken
                                               mark dirty, ring full
                                               queue on waitqueue
                                               (with mmu lock)
      KVM_RESET_DIRTY_RINGS
        take mmu lock               <------------ deadlock here
        reset ring gfns
        wakeup dirty ring sleepers

And if we see if the mark_page_dirty_in_slot() is not with a vcpu
context (e.g. kvm_mmu_page_fault) but with an ioctl context (those
cases we'll use per-vm dirty ring) then it's probably fine.

My planned solution:

- When kvm_get_running_vcpu() != NULL, we postpone the waitqueue waits
  until we finished handling this page fault, probably in somewhere
  around vcpu_enter_guest, so that we can do wait_event() after the
  mmu lock released

- For per-vm ring full, I'll do what we do now (wait_event() as long
  in mark_page_dirty_in_ring) assuming it should not be with the mmu
  lock held

To achieve above, I think I really need to know exactly on whether
we're with the vcpu context, where I suppose kvm_get_running_vcpu()
would work for me then, rather than checking against vcpu pointer
passed in.

I also wanted to let KVM_RUN return immediately if either per-vm ring
or per-vcpu ring reaches softlimit always, instead of continue
execution until the next dirty ring full event.

I'd be glad to receive any early comment before I move on to these.

Thanks!

Paolo Bonzini Dec. 10, 2019, 10:07 a.m. UTC | #20

On 09/12/19 22:54, Peter Xu wrote:
> Just until recently I noticed that actually kvm_get_running_vcpu() has
> a real benefit in that it gives a very solid result on whether we're
> with the vcpu context, even more accurate than when we pass vcpu
> pointers around (because sometimes we just passed the kvm pointer
> along the stack even if we're with a vcpu context, just like what we
> did with mark_page_dirty_in_slot).

Right, that's the point.

> I'm thinking whether I can start
> to use this information in the next post on solving an issue I
> encountered with the waitqueue.
> 
> Current waitqueue is still problematic in that it could wait even with
> the mmu lock held when with vcpu context.

I think the idea of the soft limit is that the waiting just cannot
happen.  That is, the number of dirtied pages _outside_ the guest (guest
accesses are taken care of by PML, and are subtracted from the soft
limit) cannot exceed hard_limit - (soft_limit + pml_size).

> The issue is KVM_RESET_DIRTY_RINGS needs the mmu lock to manipulate
> the write bits, while it's the only interface to also wake up the
> dirty ring sleepers.  They could dead lock like this:
> 
>       main thread                            vcpu thread
>       ===========                            ===========
>                                              kvm page fault
>                                                mark_page_dirty_in_slot
>                                                mmu lock taken
>                                                mark dirty, ring full
>                                                queue on waitqueue
>                                                (with mmu lock)
>       KVM_RESET_DIRTY_RINGS
>         take mmu lock               <------------ deadlock here
>         reset ring gfns
>         wakeup dirty ring sleepers
> 
> And if we see if the mark_page_dirty_in_slot() is not with a vcpu
> context (e.g. kvm_mmu_page_fault) but with an ioctl context (those
> cases we'll use per-vm dirty ring) then it's probably fine.
> 
> My planned solution:
> 
> - When kvm_get_running_vcpu() != NULL, we postpone the waitqueue waits
>   until we finished handling this page fault, probably in somewhere
>   around vcpu_enter_guest, so that we can do wait_event() after the
>   mmu lock released

I think this can cause a race:

	vCPU 1			vCPU 2		host
	---------------------------------------------------------------
	mark page dirty
				write to page
						treat page as not dirty
	add page to ring

where vCPU 2 skips the clean-page slow path entirely.

Paolo

Michael S. Tsirkin Dec. 10, 2019, 1:25 p.m. UTC | #21

On Wed, Dec 04, 2019 at 12:04:53PM +0100, Paolo Bonzini wrote:
> On 04/12/19 11:38, Jason Wang wrote:
> >>
> >> +    entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
> >> +    entry->slot = slot;
> >> +    entry->offset = offset;
> > 
> > 
> > Haven't gone through the whole series, sorry if it was a silly question
> > but I wonder things like this will suffer from similar issue on
> > virtually tagged archs as mentioned in [1].
> 
> There is no new infrastructure to track the dirty pages---it's just a
> different way to pass them to userspace.

Did you guys consider using one of the virtio ring formats?
Maybe reusing vhost code?

If you did and it's not a good fit, this is something good to mention
in the commit log.

I also wonder about performance numbers - any data here?


> > Is this better to allocate the ring from userspace and set to KVM
> > instead? Then we can use copy_to/from_user() friends (a little bit slow
> > on recent CPUs).
> 
> Yeah, I don't think that would be better than mmap.
> 
> Paolo
> 
> 
> > [1] https://lkml.org/lkml/2019/4/9/5

Paolo Bonzini Dec. 10, 2019, 1:31 p.m. UTC | #22

On 10/12/19 14:25, Michael S. Tsirkin wrote:
>> There is no new infrastructure to track the dirty pages---it's just a
>> different way to pass them to userspace.
> Did you guys consider using one of the virtio ring formats?
> Maybe reusing vhost code?

There are no used/available entries here, it's unidirectional
(kernel->user).

> If you did and it's not a good fit, this is something good to mention
> in the commit log.
> 
> I also wonder about performance numbers - any data here?

Yes some numbers would be useful.  Note however that the improvement is
asymptotical, O(#dirtied pages) vs O(#total pages) so it may differ
depending on the workload.

Paolo

Peter Xu Dec. 10, 2019, 3:52 p.m. UTC | #23

On Tue, Dec 10, 2019 at 11:07:31AM +0100, Paolo Bonzini wrote:
> > I'm thinking whether I can start
> > to use this information in the next post on solving an issue I
> > encountered with the waitqueue.
> > 
> > Current waitqueue is still problematic in that it could wait even with
> > the mmu lock held when with vcpu context.
> 
> I think the idea of the soft limit is that the waiting just cannot
> happen.  That is, the number of dirtied pages _outside_ the guest (guest
> accesses are taken care of by PML, and are subtracted from the soft
> limit) cannot exceed hard_limit - (soft_limit + pml_size).

So the question go backs to, whether this is guaranteed somehow?  Or
do you prefer us to keep the warn_on_once until it triggers then we
can analyze (which I doubt..)?

One thing to mention is that for with-vcpu cases, we probably can even
stop KVM_RUN immediately as long as either the per-vm or per-vcpu ring
reaches the softlimit, then for vcpu case it should be easier to
guarantee that.  What I want to know is the rest of cases like ioctls
or even something not from the userspace (which I think I should read
more later..).

If the answer is yes, I'd be more than glad to drop the waitqueue.

> 
> > The issue is KVM_RESET_DIRTY_RINGS needs the mmu lock to manipulate
> > the write bits, while it's the only interface to also wake up the
> > dirty ring sleepers.  They could dead lock like this:
> > 
> >       main thread                            vcpu thread
> >       ===========                            ===========
> >                                              kvm page fault
> >                                                mark_page_dirty_in_slot
> >                                                mmu lock taken
> >                                                mark dirty, ring full
> >                                                queue on waitqueue
> >                                                (with mmu lock)
> >       KVM_RESET_DIRTY_RINGS
> >         take mmu lock               <------------ deadlock here
> >         reset ring gfns
> >         wakeup dirty ring sleepers
> > 
> > And if we see if the mark_page_dirty_in_slot() is not with a vcpu
> > context (e.g. kvm_mmu_page_fault) but with an ioctl context (those
> > cases we'll use per-vm dirty ring) then it's probably fine.
> > 
> > My planned solution:
> > 
> > - When kvm_get_running_vcpu() != NULL, we postpone the waitqueue waits
> >   until we finished handling this page fault, probably in somewhere
> >   around vcpu_enter_guest, so that we can do wait_event() after the
> >   mmu lock released
> 
> I think this can cause a race:
> 
> 	vCPU 1			vCPU 2		host
> 	---------------------------------------------------------------
> 	mark page dirty
> 				write to page
> 						treat page as not dirty
> 	add page to ring
> 
> where vCPU 2 skips the clean-page slow path entirely.

If we're still with the rule in userspace that we first do RESET then
collect and send the pages (just like what we've discussed before),
then IMHO it's fine to have vcpu2 to skip the slow path?  Because
RESET happens at "treat page as not dirty", then if we are sure that
we only collect and send pages after that point, then the latest
"write to page" data from vcpu2 won't be lost even if vcpu2 is not
blocked by vcpu1's ring full?

Maybe we can also consider to let mark_page_dirty_in_slot() return a
value, then the upper layer could have a chance to skip the spte
update if mark_page_dirty_in_slot() fails to mark the dirty bit, so it
can return directly with RET_PF_RETRY.

Thanks,

Peter Xu Dec. 10, 2019, 4:02 p.m. UTC | #24

On Tue, Dec 10, 2019 at 02:31:54PM +0100, Paolo Bonzini wrote:
> On 10/12/19 14:25, Michael S. Tsirkin wrote:
> >> There is no new infrastructure to track the dirty pages---it's just a
> >> different way to pass them to userspace.
> > Did you guys consider using one of the virtio ring formats?
> > Maybe reusing vhost code?
> 
> There are no used/available entries here, it's unidirectional
> (kernel->user).

Agreed.  Vring could be an overkill IMHO (the whole dirty_ring.c is
100+ LOC only).

> 
> > If you did and it's not a good fit, this is something good to mention
> > in the commit log.
> > 
> > I also wonder about performance numbers - any data here?
> 
> Yes some numbers would be useful.  Note however that the improvement is
> asymptotical, O(#dirtied pages) vs O(#total pages) so it may differ
> depending on the workload.

Yes.  I plan to give some numbers when start to work on the QEMU
series (after this lands).  However as Paolo said, those numbers would
probably only be with some special case where I know the dirty ring
could win.  Frankly speaking I don't even know whether we should
change the default logging mode when the QEMU work is done - I feel
like the old logging interface is still good in many major cases
(small vms, or high dirty rates).  It could be that we just offer
another option when the user could consider to solve specific problems.

Thanks,

Paolo Bonzini Dec. 10, 2019, 5:09 p.m. UTC | #25

On 10/12/19 16:52, Peter Xu wrote:
> On Tue, Dec 10, 2019 at 11:07:31AM +0100, Paolo Bonzini wrote:
>>> I'm thinking whether I can start
>>> to use this information in the next post on solving an issue I
>>> encountered with the waitqueue.
>>>
>>> Current waitqueue is still problematic in that it could wait even with
>>> the mmu lock held when with vcpu context.
>>
>> I think the idea of the soft limit is that the waiting just cannot
>> happen.  That is, the number of dirtied pages _outside_ the guest (guest
>> accesses are taken care of by PML, and are subtracted from the soft
>> limit) cannot exceed hard_limit - (soft_limit + pml_size).
> 
> So the question go backs to, whether this is guaranteed somehow?  Or
> do you prefer us to keep the warn_on_once until it triggers then we
> can analyze (which I doubt..)?

Yes, I would like to keep the WARN_ON_ONCE just because you never know.

Of course it would be much better to audit the calls to kvm_write_guest
and figure out how many could trigger (e.g. two from the operands of an
emulated instruction, 5 from a nested EPT walk, 1 from a page walk, etc.).

> One thing to mention is that for with-vcpu cases, we probably can even
> stop KVM_RUN immediately as long as either the per-vm or per-vcpu ring
> reaches the softlimit, then for vcpu case it should be easier to
> guarantee that.  What I want to know is the rest of cases like ioctls
> or even something not from the userspace (which I think I should read
> more later..).

Which ioctls?  Most ioctls shouldn't dirty memory at all.

>>> And if we see if the mark_page_dirty_in_slot() is not with a vcpu
>>> context (e.g. kvm_mmu_page_fault) but with an ioctl context (those
>>> cases we'll use per-vm dirty ring) then it's probably fine.
>>>
>>> My planned solution:
>>>
>>> - When kvm_get_running_vcpu() != NULL, we postpone the waitqueue waits
>>>   until we finished handling this page fault, probably in somewhere
>>>   around vcpu_enter_guest, so that we can do wait_event() after the
>>>   mmu lock released
>>
>> I think this can cause a race:
>>
>> 	vCPU 1			vCPU 2		host
>> 	---------------------------------------------------------------
>> 	mark page dirty
>> 				write to page
>> 						treat page as not dirty
>> 	add page to ring
>>
>> where vCPU 2 skips the clean-page slow path entirely.
> 
> If we're still with the rule in userspace that we first do RESET then
> collect and send the pages (just like what we've discussed before),
> then IMHO it's fine to have vcpu2 to skip the slow path?  Because
> RESET happens at "treat page as not dirty", then if we are sure that
> we only collect and send pages after that point, then the latest
> "write to page" data from vcpu2 won't be lost even if vcpu2 is not
> blocked by vcpu1's ring full?

Good point, the race would become

 	vCPU 1			vCPU 2		host
 	---------------------------------------------------------------
 	mark page dirty
 				write to page
						reset rings
						  wait for mmu lock
 	add page to ring
	release mmu lock
						  ...do reset...
						  release mmu lock
						page is now dirty

> Maybe we can also consider to let mark_page_dirty_in_slot() return a
> value, then the upper layer could have a chance to skip the spte
> update if mark_page_dirty_in_slot() fails to mark the dirty bit, so it
> can return directly with RET_PF_RETRY.

I don't think that's possible, most writes won't come from a page fault
path and cannot retry.

Paolo

Michael S. Tsirkin Dec. 10, 2019, 9:48 p.m. UTC | #26

On Tue, Dec 10, 2019 at 02:31:54PM +0100, Paolo Bonzini wrote:
> On 10/12/19 14:25, Michael S. Tsirkin wrote:
> >> There is no new infrastructure to track the dirty pages---it's just a
> >> different way to pass them to userspace.
> > Did you guys consider using one of the virtio ring formats?
> > Maybe reusing vhost code?
> 
> There are no used/available entries here, it's unidirectional
> (kernel->user).

Didn't look at the design yet, but flow control (to prevent overflow)
goes the other way, doesn't it?  That's what used is, essentially.

> > If you did and it's not a good fit, this is something good to mention
> > in the commit log.
> > 
> > I also wonder about performance numbers - any data here?
> 
> Yes some numbers would be useful.  Note however that the improvement is
> asymptotical, O(#dirtied pages) vs O(#total pages) so it may differ
> depending on the workload.
> 
> Paolo

Michael S. Tsirkin Dec. 10, 2019, 9:53 p.m. UTC | #27

On Tue, Dec 10, 2019 at 11:02:11AM -0500, Peter Xu wrote:
> On Tue, Dec 10, 2019 at 02:31:54PM +0100, Paolo Bonzini wrote:
> > On 10/12/19 14:25, Michael S. Tsirkin wrote:
> > >> There is no new infrastructure to track the dirty pages---it's just a
> > >> different way to pass them to userspace.
> > > Did you guys consider using one of the virtio ring formats?
> > > Maybe reusing vhost code?
> > 
> > There are no used/available entries here, it's unidirectional
> > (kernel->user).
> 
> Agreed.  Vring could be an overkill IMHO (the whole dirty_ring.c is
> 100+ LOC only).


I guess you don't do polling/ event suppression and other tricks that
virtio came up with for speed then? Why won't they be helpful for kvm?
To put it another way, LOC is irrelevant, virtio is already in the
kernel.

Anyway, this is something to be discussed in the cover letter.

> > 
> > > If you did and it's not a good fit, this is something good to mention
> > > in the commit log.
> > > 
> > > I also wonder about performance numbers - any data here?
> > 
> > Yes some numbers would be useful.  Note however that the improvement is
> > asymptotical, O(#dirtied pages) vs O(#total pages) so it may differ
> > depending on the workload.
> 
> Yes.  I plan to give some numbers when start to work on the QEMU
> series (after this lands).  However as Paolo said, those numbers would
> probably only be with some special case where I know the dirty ring
> could win.  Frankly speaking I don't even know whether we should
> change the default logging mode when the QEMU work is done - I feel
> like the old logging interface is still good in many major cases
> (small vms, or high dirty rates).  It could be that we just offer
> another option when the user could consider to solve specific problems.
> 
> Thanks,
> 
> -- 
> Peter Xu

Paolo Bonzini Dec. 11, 2019, 9:05 a.m. UTC | #28

On 10/12/19 22:53, Michael S. Tsirkin wrote:
> On Tue, Dec 10, 2019 at 11:02:11AM -0500, Peter Xu wrote:
>> On Tue, Dec 10, 2019 at 02:31:54PM +0100, Paolo Bonzini wrote:
>>> On 10/12/19 14:25, Michael S. Tsirkin wrote:
>>>>> There is no new infrastructure to track the dirty pages---it's just a
>>>>> different way to pass them to userspace.
>>>> Did you guys consider using one of the virtio ring formats?
>>>> Maybe reusing vhost code?
>>>
>>> There are no used/available entries here, it's unidirectional
>>> (kernel->user).
>>
>> Agreed.  Vring could be an overkill IMHO (the whole dirty_ring.c is
>> 100+ LOC only).
> 
> I guess you don't do polling/ event suppression and other tricks that
> virtio came up with for speed then?

There are no interrupts either, so no need for event suppression.  You
have vmexits when the ring gets full (and that needs to be synchronous),
but apart from that the migration thread will poll the rings once when
it needs to send more pages.

Paolo

Michael S. Tsirkin Dec. 11, 2019, 12:53 p.m. UTC | #29

On Fri, Nov 29, 2019 at 04:34:54PM -0500, Peter Xu wrote:
> This patch is heavily based on previous work from Lei Cao
> <lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]
> 
> KVM currently uses large bitmaps to track dirty memory.  These bitmaps
> are copied to userspace when userspace queries KVM for its dirty page
> information.  The use of bitmaps is mostly sufficient for live
> migration, as large parts of memory are be dirtied from one log-dirty
> pass to another.  However, in a checkpointing system, the number of
> dirty pages is small and in fact it is often bounded---the VM is
> paused when it has dirtied a pre-defined number of pages. Traversing a
> large, sparsely populated bitmap to find set bits is time-consuming,
> as is copying the bitmap to user-space.
> 
> A similar issue will be there for live migration when the guest memory
> is huge while the page dirty procedure is trivial.  In that case for
> each dirty sync we need to pull the whole dirty bitmap to userspace
> and analyse every bit even if it's mostly zeros.
> 
> The preferred data structure for above scenarios is a dense list of
> guest frame numbers (GFN).  This patch series stores the dirty list in
> kernel memory that can be memory mapped into userspace to allow speedy
> harvesting.
> 
> We defined two new data structures:
> 
>   struct kvm_dirty_ring;
>   struct kvm_dirty_ring_indexes;
> 
> Firstly, kvm_dirty_ring is defined to represent a ring of dirty
> pages.  When dirty tracking is enabled, we can push dirty gfn onto the
> ring.
> 
> Secondly, kvm_dirty_ring_indexes is defined to represent the
> user/kernel interface of each ring.  Currently it contains two
> indexes: (1) avail_index represents where we should push our next
> PFN (written by kernel), while (2) fetch_index represents where the
> userspace should fetch the next dirty PFN (written by userspace).
> 
> One complete ring is composed by one kvm_dirty_ring plus its
> corresponding kvm_dirty_ring_indexes.
> 
> Currently, we have N+1 rings for each VM of N vcpus:
> 
>   - for each vcpu, we have 1 per-vcpu dirty ring,
>   - for each vm, we have 1 per-vm dirty ring
> 
> Please refer to the documentation update in this patch for more
> details.
> 
> Note that this patch implements the core logic of dirty ring buffer.
> It's still disabled for all archs for now.  Also, we'll address some
> of the other issues in follow up patches before it's firstly enabled
> on x86.
> 
> [1] https://patchwork.kernel.org/patch/10471409/
> 
> Signed-off-by: Lei Cao <lei.cao@stratus.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>


Thanks, that's interesting.

> ---
>  Documentation/virt/kvm/api.txt | 109 +++++++++++++++
>  arch/x86/kvm/Makefile          |   3 +-
>  include/linux/kvm_dirty_ring.h |  67 +++++++++
>  include/linux/kvm_host.h       |  33 +++++
>  include/linux/kvm_types.h      |   1 +
>  include/uapi/linux/kvm.h       |  36 +++++
>  virt/kvm/dirty_ring.c          | 156 +++++++++++++++++++++
>  virt/kvm/kvm_main.c            | 240 ++++++++++++++++++++++++++++++++-
>  8 files changed, 642 insertions(+), 3 deletions(-)
>  create mode 100644 include/linux/kvm_dirty_ring.h
>  create mode 100644 virt/kvm/dirty_ring.c
> 
> diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt
> index 49183add44e7..fa622c9a2eb8 100644
> --- a/Documentation/virt/kvm/api.txt
> +++ b/Documentation/virt/kvm/api.txt
> @@ -231,6 +231,7 @@ Based on their initialization different VMs may have different capabilities.
>  It is thus encouraged to use the vm ioctl to query for capabilities (available
>  with KVM_CAP_CHECK_EXTENSION_VM on the vm fd)
>  
> +
>  4.5 KVM_GET_VCPU_MMAP_SIZE
>  
>  Capability: basic
> @@ -243,6 +244,18 @@ The KVM_RUN ioctl (cf.) communicates with userspace via a shared
>  memory region.  This ioctl returns the size of that region.  See the
>  KVM_RUN documentation for details.
>  
> +Besides the size of the KVM_RUN communication region, other areas of
> +the VCPU file descriptor can be mmap-ed, including:
> +
> +- if KVM_CAP_COALESCED_MMIO is available, a page at
> +  KVM_COALESCED_MMIO_PAGE_OFFSET * PAGE_SIZE; for historical reasons,
> +  this page is included in the result of KVM_GET_VCPU_MMAP_SIZE.
> +  KVM_CAP_COALESCED_MMIO is not documented yet.
> +
> +- if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
> +  KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE.  For more information on
> +  KVM_CAP_DIRTY_LOG_RING, see section 8.3.
> +
>  
>  4.6 KVM_SET_MEMORY_REGION
>  

PAGE_SIZE being which value? It's not always trivial for
userspace to know what's the PAGE_SIZE for the kernel ...


> @@ -5358,6 +5371,7 @@ CPU when the exception is taken. If this virtual SError is taken to EL1 using
>  AArch64, this value will be reported in the ISS field of ESR_ELx.
>  
>  See KVM_CAP_VCPU_EVENTS for more details.
> +
>  8.20 KVM_CAP_HYPERV_SEND_IPI
>  
>  Architectures: x86
> @@ -5365,6 +5379,7 @@ Architectures: x86
>  This capability indicates that KVM supports paravirtualized Hyper-V IPI send
>  hypercalls:
>  HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
> +
>  8.21 KVM_CAP_HYPERV_DIRECT_TLBFLUSH
>  
>  Architecture: x86
> @@ -5378,3 +5393,97 @@ handling by KVM (as some KVM hypercall may be mistakenly treated as TLB
>  flush hypercalls by Hyper-V) so userspace should disable KVM identification
>  in CPUID and only exposes Hyper-V identification. In this case, guest
>  thinks it's running on Hyper-V and only use Hyper-V hypercalls.
> +
> +8.22 KVM_CAP_DIRTY_LOG_RING
> +
> +Architectures: x86
> +Parameters: args[0] - size of the dirty log ring
> +
> +KVM is capable of tracking dirty memory using ring buffers that are
> +mmaped into userspace; there is one dirty ring per vcpu and one global
> +ring per vm.
> +
> +One dirty ring has the following two major structures:
> +
> +struct kvm_dirty_ring {
> +	u16 dirty_index;
> +	u16 reset_index;
> +	u32 size;
> +	u32 soft_limit;
> +	spinlock_t lock;
> +	struct kvm_dirty_gfn *dirty_gfns;
> +};
> +
> +struct kvm_dirty_ring_indexes {
> +	__u32 avail_index; /* set by kernel */
> +	__u32 fetch_index; /* set by userspace */

Sticking these next to each other seems to guarantee cache conflicts.

Avail/Fetch seems to mimic Virtio's avail/used exactly.  I am not saying
you must reuse the code really, but I think you should take a hard look
at e.g. the virtio packed ring structure. We spent a bunch of time
optimizing it for cache utilization. It seems kernel is the driver,
making entries available, and userspace the device, using them.
Again let's not develop a thread about this, but I think
this is something to consider and discuss in future versions
of the patches.


> +};
> +
> +While for each of the dirty entry it's defined as:
> +
> +struct kvm_dirty_gfn {

What does GFN stand for?

> +        __u32 pad;
> +        __u32 slot; /* as_id | slot_id */
> +        __u64 offset;
> +};

offset of what? a 4K page right? Seems like a waste e.g. for
hugetlbfs... How about replacing pad with size instead?

> +
> +The fields in kvm_dirty_ring will be only internal to KVM itself,
> +while the fields in kvm_dirty_ring_indexes will be exposed to
> +userspace to be either read or written.

I'm not sure what you are trying to say here. kvm_dirty_gfn
seems to be part of UAPI.

> +
> +The two indices in the ring buffer are free running counters.
> +
> +In pseudocode, processing the ring buffer looks like this:
> +
> +	idx = load-acquire(&ring->fetch_index);
> +	while (idx != ring->avail_index) {
> +		struct kvm_dirty_gfn *entry;
> +		entry = &ring->dirty_gfns[idx & (size - 1)];
> +		...
> +
> +		idx++;
> +	}
> +	ring->fetch_index = idx;
> +
> +Userspace calls KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM ioctl
> +to enable this capability for the new guest and set the size of the
> +rings.  It is only allowed before creating any vCPU, and the size of
> +the ring must be a power of two.

All these seem like arbitrary limitations to me.

Sizing the ring correctly might prove to be a challenge.

Thus I think there's value in resizing the rings
without destroying VCPU.

Also, power of two just saves a branch here and there,
but wastes lots of memory. Just wrap the index around to
0 and then users can select any size?



>  The larger the ring buffer, the less
> +likely the ring is full and the VM is forced to exit to userspace. The
> +optimal size depends on the workload, but it is recommended that it be
> +at least 64 KiB (4096 entries).

OTOH larger buffers put lots of pressure on the system cache.


> +
> +After the capability is enabled, userspace can mmap the global ring
> +buffer (kvm_dirty_gfn[], offset KVM_DIRTY_LOG_PAGE_OFFSET) and the
> +indexes (kvm_dirty_ring_indexes, offset 0) from the VM file
> +descriptor.  The per-vcpu dirty ring instead is mmapped when the vcpu
> +is created, similar to the kvm_run struct (kvm_dirty_ring_indexes
> +locates inside kvm_run, while kvm_dirty_gfn[] at offset
> +KVM_DIRTY_LOG_PAGE_OFFSET).
> +
> +Just like for dirty page bitmaps, the buffer tracks writes to
> +all user memory regions for which the KVM_MEM_LOG_DIRTY_PAGES flag was
> +set in KVM_SET_USER_MEMORY_REGION.  Once a memory region is registered
> +with the flag set, userspace can start harvesting dirty pages from the
> +ring buffer.
> +
> +To harvest the dirty pages, userspace accesses the mmaped ring buffer
> +to read the dirty GFNs up to avail_index, and sets the fetch_index
> +accordingly.  This can be done when the guest is running or paused,
> +and dirty pages need not be collected all at once.  After processing
> +one or more entries in the ring buffer, userspace calls the VM ioctl
> +KVM_RESET_DIRTY_RINGS to notify the kernel that it has updated
> +fetch_index and to mark those pages clean.  Therefore, the ioctl
> +must be called *before* reading the content of the dirty pages.
> +
> +However, there is a major difference comparing to the
> +KVM_GET_DIRTY_LOG interface in that when reading the dirty ring from
> +userspace it's still possible that the kernel has not yet flushed the
> +hardware dirty buffers into the kernel buffer.  To achieve that, one
> +needs to kick the vcpu out for a hardware buffer flush (vmexit).
> +
> +If one of the ring buffers is full, the guest will exit to userspace
> +with the exit reason set to KVM_EXIT_DIRTY_LOG_FULL, and the
> +KVM_RUN ioctl will return -EINTR. Once that happens, userspace
> +should pause all the vcpus, then harvest all the dirty pages and
> +rearm the dirty traps. It can unpause the guest after that.

This last item means that the performance impact of the feature is
really hard to predict. Can improve some workloads drastically. Or can
slow some down.


One solution could be to actually allow using this together with the
existing bitmap. Userspace can then decide whether it wants to block
VCPU on ring full, or just record ring full condition and recover by
bitmap scanning.


> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> index b19ef421084d..0acee817adfb 100644
> --- a/arch/x86/kvm/Makefile
> +++ b/arch/x86/kvm/Makefile
> @@ -5,7 +5,8 @@ ccflags-y += -Iarch/x86/kvm
>  KVM := ../../../virt/kvm
>  
>  kvm-y			+= $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> -				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
> +				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o \
> +				$(KVM)/dirty_ring.o
>  kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(KVM)/async_pf.o
>  
>  kvm-y			+= x86.o emulate.o i8259.o irq.o lapic.o \
> diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
> new file mode 100644
> index 000000000000..8335635b7ff7
> --- /dev/null
> +++ b/include/linux/kvm_dirty_ring.h
> @@ -0,0 +1,67 @@
> +#ifndef KVM_DIRTY_RING_H
> +#define KVM_DIRTY_RING_H
> +
> +/*
> + * struct kvm_dirty_ring is defined in include/uapi/linux/kvm.h.
> + *
> + * dirty_ring:  shared with userspace via mmap. It is the compact list
> + *              that holds the dirty pages.
> + * dirty_index: free running counter that points to the next slot in
> + *              dirty_ring->dirty_gfns  where a new dirty page should go.
> + * reset_index: free running counter that points to the next dirty page
> + *              in dirty_ring->dirty_gfns for which dirty trap needs to
> + *              be reenabled
> + * size:        size of the compact list, dirty_ring->dirty_gfns
> + * soft_limit:  when the number of dirty pages in the list reaches this
> + *              limit, vcpu that owns this ring should exit to userspace
> + *              to allow userspace to harvest all the dirty pages
> + * lock:        protects dirty_ring, only in use if this is the global
> + *              ring
> + *
> + * The number of dirty pages in the ring is calculated by,
> + * dirty_index - reset_index
> + *
> + * kernel increments dirty_ring->indices.avail_index after dirty index
> + * is incremented. When userspace harvests the dirty pages, it increments
> + * dirty_ring->indices.fetch_index up to dirty_ring->indices.avail_index.
> + * When kernel reenables dirty traps for the dirty pages, it increments
> + * reset_index up to dirty_ring->indices.fetch_index.
> + *
> + */
> +struct kvm_dirty_ring {
> +	u32 dirty_index;
> +	u32 reset_index;
> +	u32 size;
> +	u32 soft_limit;
> +	spinlock_t lock;
> +	struct kvm_dirty_gfn *dirty_gfns;
> +};
> +
> +u32 kvm_dirty_ring_get_rsvd_entries(void);
> +int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring);
> +
> +/*
> + * called with kvm->slots_lock held, returns the number of
> + * processed pages.
> + */
> +int kvm_dirty_ring_reset(struct kvm *kvm,
> +			 struct kvm_dirty_ring *ring,
> +			 struct kvm_dirty_ring_indexes *indexes);
> +
> +/*
> + * returns 0: successfully pushed
> + *         1: successfully pushed, soft limit reached,
> + *            vcpu should exit to userspace
> + *         -EBUSY: unable to push, dirty ring full.
> + */
> +int kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
> +			struct kvm_dirty_ring_indexes *indexes,
> +			u32 slot, u64 offset, bool lock);
> +
> +/* for use in vm_operations_struct */
> +struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 i);
> +
> +void kvm_dirty_ring_free(struct kvm_dirty_ring *ring);
> +bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring);
> +
> +#endif
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 498a39462ac1..7b747bc9ff3e 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -34,6 +34,7 @@
>  #include <linux/kvm_types.h>
>  
>  #include <asm/kvm_host.h>
> +#include <linux/kvm_dirty_ring.h>
>  
>  #ifndef KVM_MAX_VCPU_ID
>  #define KVM_MAX_VCPU_ID KVM_MAX_VCPUS
> @@ -146,6 +147,7 @@ static inline bool is_error_page(struct page *page)
>  #define KVM_REQ_MMU_RELOAD        (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>  #define KVM_REQ_PENDING_TIMER     2
>  #define KVM_REQ_UNHALT            3
> +#define KVM_REQ_DIRTY_RING_FULL   4
>  #define KVM_REQUEST_ARCH_BASE     8
>  
>  #define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
> @@ -321,6 +323,7 @@ struct kvm_vcpu {
>  	bool ready;
>  	struct kvm_vcpu_arch arch;
>  	struct dentry *debugfs_dentry;
> +	struct kvm_dirty_ring dirty_ring;
>  };
>  
>  static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
> @@ -501,6 +504,10 @@ struct kvm {
>  	struct srcu_struct srcu;
>  	struct srcu_struct irq_srcu;
>  	pid_t userspace_pid;
> +	/* Data structure to be exported by mmap(kvm->fd, 0) */
> +	struct kvm_vm_run *vm_run;
> +	u32 dirty_ring_size;
> +	struct kvm_dirty_ring vm_dirty_ring;
>  };
>  
>  #define kvm_err(fmt, ...) \
> @@ -832,6 +839,8 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>  					gfn_t gfn_offset,
>  					unsigned long mask);
>  
> +void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask);
> +
>  int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
>  				struct kvm_dirty_log *log);
>  int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
> @@ -1411,4 +1420,28 @@ int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
>  				uintptr_t data, const char *name,
>  				struct task_struct **thread_ptr);
>  
> +/*
> + * This defines how many reserved entries we want to keep before we
> + * kick the vcpu to the userspace to avoid dirty ring full.  This
> + * value can be tuned to higher if e.g. PML is enabled on the host.
> + */
> +#define  KVM_DIRTY_RING_RSVD_ENTRIES  64
> +
> +/* Max number of entries allowed for each kvm dirty ring */
> +#define  KVM_DIRTY_RING_MAX_ENTRIES  65536
> +
> +/*
> + * Arch needs to define these macro after implementing the dirty ring
> + * feature.  KVM_DIRTY_LOG_PAGE_OFFSET should be defined as the
> + * starting page offset of the dirty ring structures,

Confused. Offset where? You set a default for everyone - where does arch
want to override it?

> while
> + * KVM_DIRTY_RING_VERSION should be defined as >=1.  By default, this
> + * feature is off on all archs.
> + */
> +#ifndef KVM_DIRTY_LOG_PAGE_OFFSET
> +#define KVM_DIRTY_LOG_PAGE_OFFSET 0
> +#endif
> +#ifndef KVM_DIRTY_RING_VERSION
> +#define KVM_DIRTY_RING_VERSION 0
> +#endif

One way versioning, with no bits and negotiation
will make it hard to change down the road.
what's wrong with existing KVM capabilities that
you feel there's a need for dedicated versioning for this?

> +
>  #endif
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index 1c88e69db3d9..d9d03eea145a 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -11,6 +11,7 @@ struct kvm_irq_routing_table;
>  struct kvm_memory_slot;
>  struct kvm_one_reg;
>  struct kvm_run;
> +struct kvm_vm_run;
>  struct kvm_userspace_memory_region;
>  struct kvm_vcpu;
>  struct kvm_vcpu_init;
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index e6f17c8e2dba..0b88d76d6215 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -236,6 +236,7 @@ struct kvm_hyperv_exit {
>  #define KVM_EXIT_IOAPIC_EOI       26
>  #define KVM_EXIT_HYPERV           27
>  #define KVM_EXIT_ARM_NISV         28
> +#define KVM_EXIT_DIRTY_RING_FULL  29
>  
>  /* For KVM_EXIT_INTERNAL_ERROR */
>  /* Emulate instruction failed. */
> @@ -247,6 +248,11 @@ struct kvm_hyperv_exit {
>  /* Encounter unexpected vm-exit reason */
>  #define KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON	4
>  
> +struct kvm_dirty_ring_indexes {
> +	__u32 avail_index; /* set by kernel */
> +	__u32 fetch_index; /* set by userspace */
> +};
> +
>  /* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */
>  struct kvm_run {
>  	/* in */
> @@ -421,6 +427,13 @@ struct kvm_run {
>  		struct kvm_sync_regs regs;
>  		char padding[SYNC_REGS_SIZE_BYTES];
>  	} s;
> +
> +	struct kvm_dirty_ring_indexes vcpu_ring_indexes;
> +};
> +
> +/* Returned by mmap(kvm->fd, offset=0) */
> +struct kvm_vm_run {
> +	struct kvm_dirty_ring_indexes vm_ring_indexes;
>  };
>  
>  /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
> @@ -1009,6 +1022,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_PPC_GUEST_DEBUG_SSTEP 176
>  #define KVM_CAP_ARM_NISV_TO_USER 177
>  #define KVM_CAP_ARM_INJECT_EXT_DABT 178
> +#define KVM_CAP_DIRTY_LOG_RING 179
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> @@ -1472,6 +1486,9 @@ struct kvm_enc_region {
>  /* Available with KVM_CAP_ARM_SVE */
>  #define KVM_ARM_VCPU_FINALIZE	  _IOW(KVMIO,  0xc2, int)
>  
> +/* Available with KVM_CAP_DIRTY_LOG_RING */
> +#define KVM_RESET_DIRTY_RINGS     _IO(KVMIO, 0xc3)
> +
>  /* Secure Encrypted Virtualization command */
>  enum sev_cmd_id {
>  	/* Guest initialization commands */
> @@ -1622,4 +1639,23 @@ struct kvm_hyperv_eventfd {
>  #define KVM_HYPERV_CONN_ID_MASK		0x00ffffff
>  #define KVM_HYPERV_EVENTFD_DEASSIGN	(1 << 0)
>  
> +/*
> + * The following are the requirements for supporting dirty log ring
> + * (by enabling KVM_DIRTY_LOG_PAGE_OFFSET).
> + *
> + * 1. Memory accesses by KVM should call kvm_vcpu_write_* instead
> + *    of kvm_write_* so that the global dirty ring is not filled up
> + *    too quickly.
> + * 2. kvm_arch_mmu_enable_log_dirty_pt_masked should be defined for
> + *    enabling dirty logging.
> + * 3. There should not be a separate step to synchronize hardware
> + *    dirty bitmap with KVM's.
> + */
> +
> +struct kvm_dirty_gfn {
> +	__u32 pad;
> +	__u32 slot;
> +	__u64 offset;
> +};
> +
>  #endif /* __LINUX_KVM_H */
> diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> new file mode 100644
> index 000000000000..9264891f3c32
> --- /dev/null
> +++ b/virt/kvm/dirty_ring.c
> @@ -0,0 +1,156 @@
> +#include <linux/kvm_host.h>
> +#include <linux/kvm.h>
> +#include <linux/vmalloc.h>
> +#include <linux/kvm_dirty_ring.h>
> +
> +u32 kvm_dirty_ring_get_rsvd_entries(void)
> +{
> +	return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size();
> +}
> +
> +int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring)
> +{
> +	u32 size = kvm->dirty_ring_size;
> +
> +	ring->dirty_gfns = vmalloc(size);

So 1/2 a megabyte of kernel memory per VM that userspace locks up.
Do we really have to though? Why not get a userspace pointer,
write it with copy to user, and sidestep all this?

> +	if (!ring->dirty_gfns)
> +		return -ENOMEM;
> +	memset(ring->dirty_gfns, 0, size);
> +
> +	ring->size = size / sizeof(struct kvm_dirty_gfn);
> +	ring->soft_limit =
> +	    (kvm->dirty_ring_size / sizeof(struct kvm_dirty_gfn)) -
> +	    kvm_dirty_ring_get_rsvd_entries();
> +	ring->dirty_index = 0;
> +	ring->reset_index = 0;
> +	spin_lock_init(&ring->lock);
> +
> +	return 0;
> +}
> +
> +int kvm_dirty_ring_reset(struct kvm *kvm,
> +			 struct kvm_dirty_ring *ring,
> +			 struct kvm_dirty_ring_indexes *indexes)
> +{
> +	u32 cur_slot, next_slot;
> +	u64 cur_offset, next_offset;
> +	unsigned long mask;
> +	u32 fetch;
> +	int count = 0;
> +	struct kvm_dirty_gfn *entry;
> +
> +	fetch = READ_ONCE(indexes->fetch_index);
> +	if (fetch == ring->reset_index)
> +		return 0;
> +
> +	entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> +	/*
> +	 * The ring buffer is shared with userspace, which might mmap
> +	 * it and concurrently modify slot and offset.  Userspace must
> +	 * not be trusted!  READ_ONCE prevents the compiler from changing
> +	 * the values after they've been range-checked (the checks are
> +	 * in kvm_reset_dirty_gfn).

What it doesn't is prevent speculative attacks.  That's why things like
copy from user have a speculation barrier.  Instead of worrying about
that, unless it's really critical, I think you'd do well do just use
copy to/from user.

> +	 */
> +	smp_read_barrier_depends();

What depends on what here? Looks suspicious ...

> +	cur_slot = READ_ONCE(entry->slot);
> +	cur_offset = READ_ONCE(entry->offset);
> +	mask = 1;
> +	count++;
> +	ring->reset_index++;
> +	while (ring->reset_index != fetch) {
> +		entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> +		smp_read_barrier_depends();

same concerns here

> +		next_slot = READ_ONCE(entry->slot);
> +		next_offset = READ_ONCE(entry->offset);
> +		ring->reset_index++;
> +		count++;
> +		/*
> +		 * Try to coalesce the reset operations when the guest is
> +		 * scanning pages in the same slot.

what does guest scanning mean?

> +		 */
> +		if (next_slot == cur_slot) {
> +			int delta = next_offset - cur_offset;
> +
> +			if (delta >= 0 && delta < BITS_PER_LONG) {
> +				mask |= 1ull << delta;
> +				continue;
> +			}
> +
> +			/* Backwards visit, careful about overflows!  */
> +			if (delta > -BITS_PER_LONG && delta < 0 &&
> +			    (mask << -delta >> -delta) == mask) {
> +				cur_offset = next_offset;
> +				mask = (mask << -delta) | 1;
> +				continue;
> +			}
> +		}
> +		kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> +		cur_slot = next_slot;
> +		cur_offset = next_offset;
> +		mask = 1;
> +	}
> +	kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> +
> +	return count;
> +}
> +
> +static inline u32 kvm_dirty_ring_used(struct kvm_dirty_ring *ring)
> +{
> +	return ring->dirty_index - ring->reset_index;
> +}
> +
> +bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring)
> +{
> +	return kvm_dirty_ring_used(ring) >= ring->size;
> +}
> +
> +/*
> + * Returns:
> + *   >0 if we should kick the vcpu out,
> + *   =0 if the gfn pushed successfully, or,
> + *   <0 if error (e.g. ring full)
> + */
> +int kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
> +			struct kvm_dirty_ring_indexes *indexes,
> +			u32 slot, u64 offset, bool lock)
> +{
> +	int ret;
> +	struct kvm_dirty_gfn *entry;
> +
> +	if (lock)
> +		spin_lock(&ring->lock);

what's the story around locking here? Why is it safe
not to take the lock sometimes?

> +
> +	if (kvm_dirty_ring_full(ring)) {
> +		ret = -EBUSY;
> +		goto out;
> +	}
> +
> +	entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
> +	entry->slot = slot;
> +	entry->offset = offset;
> +	smp_wmb();
> +	ring->dirty_index++;
> +	WRITE_ONCE(indexes->avail_index, ring->dirty_index);
> +	ret = kvm_dirty_ring_used(ring) >= ring->soft_limit;
> +	pr_info("%s: slot %u offset %llu used %u\n",
> +		__func__, slot, offset, kvm_dirty_ring_used(ring));
> +
> +out:
> +	if (lock)
> +		spin_unlock(&ring->lock);
> +
> +	return ret;
> +}
> +
> +struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 i)
> +{
> +	return vmalloc_to_page((void *)ring->dirty_gfns + i * PAGE_SIZE);
> +}
> +
> +void kvm_dirty_ring_free(struct kvm_dirty_ring *ring)
> +{
> +	if (ring->dirty_gfns) {
> +		vfree(ring->dirty_gfns);
> +		ring->dirty_gfns = NULL;
> +	}
> +}
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 681452d288cd..8642c977629b 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -64,6 +64,8 @@
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/kvm.h>
>  
> +#include <linux/kvm_dirty_ring.h>
> +
>  /* Worst case buffer size needed for holding an integer. */
>  #define ITOA_MAX_LEN 12
>  
> @@ -149,6 +151,10 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
>  				    struct kvm_vcpu *vcpu,
>  				    struct kvm_memory_slot *memslot,
>  				    gfn_t gfn);
> +static void mark_page_dirty_in_ring(struct kvm *kvm,
> +				    struct kvm_vcpu *vcpu,
> +				    struct kvm_memory_slot *slot,
> +				    gfn_t gfn);
>  
>  __visible bool kvm_rebooting;
>  EXPORT_SYMBOL_GPL(kvm_rebooting);
> @@ -359,11 +365,22 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
>  	vcpu->preempted = false;
>  	vcpu->ready = false;
>  
> +	if (kvm->dirty_ring_size) {
> +		r = kvm_dirty_ring_alloc(vcpu->kvm, &vcpu->dirty_ring);
> +		if (r) {
> +			kvm->dirty_ring_size = 0;
> +			goto fail_free_run;
> +		}
> +	}
> +
>  	r = kvm_arch_vcpu_init(vcpu);
>  	if (r < 0)
> -		goto fail_free_run;
> +		goto fail_free_ring;
>  	return 0;
>  
> +fail_free_ring:
> +	if (kvm->dirty_ring_size)
> +		kvm_dirty_ring_free(&vcpu->dirty_ring);
>  fail_free_run:
>  	free_page((unsigned long)vcpu->run);
>  fail:
> @@ -381,6 +398,8 @@ void kvm_vcpu_uninit(struct kvm_vcpu *vcpu)
>  	put_pid(rcu_dereference_protected(vcpu->pid, 1));
>  	kvm_arch_vcpu_uninit(vcpu);
>  	free_page((unsigned long)vcpu->run);
> +	if (vcpu->kvm->dirty_ring_size)
> +		kvm_dirty_ring_free(&vcpu->dirty_ring);
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_uninit);
>  
> @@ -690,6 +709,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
>  	struct kvm *kvm = kvm_arch_alloc_vm();
>  	int r = -ENOMEM;
>  	int i;
> +	struct page *page;
>  
>  	if (!kvm)
>  		return ERR_PTR(-ENOMEM);
> @@ -705,6 +725,14 @@ static struct kvm *kvm_create_vm(unsigned long type)
>  
>  	BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
>  
> +	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> +	if (!page) {
> +		r = -ENOMEM;
> +		goto out_err_alloc_page;
> +	}
> +	kvm->vm_run = page_address(page);

So 4K with just 8 bytes used. Not as bad as 1/2Mbyte for the ring but
still. What is wrong with just a pointer and calling put_user?

> +	BUILD_BUG_ON(sizeof(struct kvm_vm_run) > PAGE_SIZE);
> +
>  	if (init_srcu_struct(&kvm->srcu))
>  		goto out_err_no_srcu;
>  	if (init_srcu_struct(&kvm->irq_srcu))
> @@ -775,6 +803,9 @@ static struct kvm *kvm_create_vm(unsigned long type)
>  out_err_no_irq_srcu:
>  	cleanup_srcu_struct(&kvm->srcu);
>  out_err_no_srcu:
> +	free_page((unsigned long)page);
> +	kvm->vm_run = NULL;
> +out_err_alloc_page:
>  	kvm_arch_free_vm(kvm);
>  	mmdrop(current->mm);
>  	return ERR_PTR(r);
> @@ -800,6 +831,15 @@ static void kvm_destroy_vm(struct kvm *kvm)
>  	int i;
>  	struct mm_struct *mm = kvm->mm;
>  
> +	if (kvm->dirty_ring_size) {
> +		kvm_dirty_ring_free(&kvm->vm_dirty_ring);
> +	}
> +
> +	if (kvm->vm_run) {
> +		free_page((unsigned long)kvm->vm_run);
> +		kvm->vm_run = NULL;
> +	}
> +
>  	kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm);
>  	kvm_destroy_vm_debugfs(kvm);
>  	kvm_arch_sync_events(kvm);
> @@ -2301,7 +2341,7 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
>  {
>  	if (memslot && memslot->dirty_bitmap) {
>  		unsigned long rel_gfn = gfn - memslot->base_gfn;
> -
> +		mark_page_dirty_in_ring(kvm, vcpu, memslot, gfn);
>  		set_bit_le(rel_gfn, memslot->dirty_bitmap);
>  	}
>  }
> @@ -2649,6 +2689,13 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
>  
> +static bool kvm_fault_in_dirty_ring(struct kvm *kvm, struct vm_fault *vmf)
> +{
> +	return (vmf->pgoff >= KVM_DIRTY_LOG_PAGE_OFFSET) &&
> +	    (vmf->pgoff < KVM_DIRTY_LOG_PAGE_OFFSET +
> +	     kvm->dirty_ring_size / PAGE_SIZE);
> +}
> +
>  static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
>  {
>  	struct kvm_vcpu *vcpu = vmf->vma->vm_file->private_data;
> @@ -2664,6 +2711,10 @@ static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
>  	else if (vmf->pgoff == KVM_COALESCED_MMIO_PAGE_OFFSET)
>  		page = virt_to_page(vcpu->kvm->coalesced_mmio_ring);
>  #endif
> +	else if (kvm_fault_in_dirty_ring(vcpu->kvm, vmf))
> +		page = kvm_dirty_ring_get_page(
> +		    &vcpu->dirty_ring,
> +		    vmf->pgoff - KVM_DIRTY_LOG_PAGE_OFFSET);
>  	else
>  		return kvm_arch_vcpu_fault(vcpu, vmf);
>  	get_page(page);
> @@ -3259,12 +3310,162 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
>  #endif
>  	case KVM_CAP_NR_MEMSLOTS:
>  		return KVM_USER_MEM_SLOTS;
> +	case KVM_CAP_DIRTY_LOG_RING:
> +		/* Version will be zero if arch didn't implement it */
> +		return KVM_DIRTY_RING_VERSION;
>  	default:
>  		break;
>  	}
>  	return kvm_vm_ioctl_check_extension(kvm, arg);
>  }
>  
> +static void mark_page_dirty_in_ring(struct kvm *kvm,
> +				    struct kvm_vcpu *vcpu,
> +				    struct kvm_memory_slot *slot,
> +				    gfn_t gfn)
> +{
> +	u32 as_id = 0;
> +	u64 offset;
> +	int ret;
> +	struct kvm_dirty_ring *ring;
> +	struct kvm_dirty_ring_indexes *indexes;
> +	bool is_vm_ring;
> +
> +	if (!kvm->dirty_ring_size)
> +		return;
> +
> +	offset = gfn - slot->base_gfn;
> +
> +	if (vcpu) {
> +		as_id = kvm_arch_vcpu_memslots_id(vcpu);
> +	} else {
> +		as_id = 0;
> +		vcpu = kvm_get_running_vcpu();
> +	}
> +
> +	if (vcpu) {
> +		ring = &vcpu->dirty_ring;
> +		indexes = &vcpu->run->vcpu_ring_indexes;
> +		is_vm_ring = false;
> +	} else {
> +		/*
> +		 * Put onto per vm ring because no vcpu context.  Kick
> +		 * vcpu0 if ring is full.

What about tasks on vcpu 0? Do guests realize it's a bad idea to put
critical tasks there, they will be penalized disproportionally?

> +		 */
> +		vcpu = kvm->vcpus[0];
> +		ring = &kvm->vm_dirty_ring;
> +		indexes = &kvm->vm_run->vm_ring_indexes;
> +		is_vm_ring = true;
> +	}
> +
> +	ret = kvm_dirty_ring_push(ring, indexes,
> +				  (as_id << 16)|slot->id, offset,
> +				  is_vm_ring);
> +	if (ret < 0) {
> +		if (is_vm_ring)
> +			pr_warn_once("vcpu %d dirty log overflow\n",
> +				     vcpu->vcpu_id);
> +		else
> +			pr_warn_once("per-vm dirty log overflow\n");
> +		return;
> +	}
> +
> +	if (ret)
> +		kvm_make_request(KVM_REQ_DIRTY_RING_FULL, vcpu);
> +}
> +
> +void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
> +{
> +	struct kvm_memory_slot *memslot;
> +	int as_id, id;
> +
> +	as_id = slot >> 16;
> +	id = (u16)slot;
> +	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
> +		return;
> +
> +	memslot = id_to_memslot(__kvm_memslots(kvm, as_id), id);
> +	if (offset >= memslot->npages)
> +		return;
> +
> +	spin_lock(&kvm->mmu_lock);
> +	/* FIXME: we should use a single AND operation, but there is no
> +	 * applicable atomic API.
> +	 */
> +	while (mask) {
> +		clear_bit_le(offset + __ffs(mask), memslot->dirty_bitmap);
> +		mask &= mask - 1;
> +	}
> +
> +	kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask);
> +	spin_unlock(&kvm->mmu_lock);
> +}
> +
> +static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size)
> +{
> +	int r;
> +
> +	/* the size should be power of 2 */
> +	if (!size || (size & (size - 1)))
> +		return -EINVAL;
> +
> +	/* Should be bigger to keep the reserved entries, or a page */
> +	if (size < kvm_dirty_ring_get_rsvd_entries() *
> +	    sizeof(struct kvm_dirty_gfn) || size < PAGE_SIZE)
> +		return -EINVAL;
> +
> +	if (size > KVM_DIRTY_RING_MAX_ENTRIES *
> +	    sizeof(struct kvm_dirty_gfn))
> +		return -E2BIG;

KVM_DIRTY_RING_MAX_ENTRIES is not part of UAPI.
So how does userspace know what's legal?
Do you expect it to just try?
More likely it will just copy the number from kernel and can
never ever make it smaller.

> +
> +	/* We only allow it to set once */
> +	if (kvm->dirty_ring_size)
> +		return -EINVAL;
> +
> +	mutex_lock(&kvm->lock);
> +
> +	if (kvm->created_vcpus) {
> +		/* We don't allow to change this value after vcpu created */
> +		r = -EINVAL;
> +	} else {
> +		kvm->dirty_ring_size = size;
> +		r = kvm_dirty_ring_alloc(kvm, &kvm->vm_dirty_ring);
> +		if (r) {
> +			/* Unset dirty ring */
> +			kvm->dirty_ring_size = 0;
> +		}
> +	}
> +
> +	mutex_unlock(&kvm->lock);
> +	return r;
> +}
> +
> +static int kvm_vm_ioctl_reset_dirty_pages(struct kvm *kvm)
> +{
> +	int i;
> +	struct kvm_vcpu *vcpu;
> +	int cleared = 0;
> +
> +	if (!kvm->dirty_ring_size)
> +		return -EINVAL;
> +
> +	mutex_lock(&kvm->slots_lock);
> +
> +	cleared += kvm_dirty_ring_reset(kvm, &kvm->vm_dirty_ring,
> +					&kvm->vm_run->vm_ring_indexes);
> +
> +	kvm_for_each_vcpu(i, vcpu, kvm)
> +		cleared += kvm_dirty_ring_reset(vcpu->kvm, &vcpu->dirty_ring,
> +						&vcpu->run->vcpu_ring_indexes);
> +
> +	mutex_unlock(&kvm->slots_lock);
> +
> +	if (cleared)
> +		kvm_flush_remote_tlbs(kvm);
> +
> +	return cleared;
> +}
> +
>  int __attribute__((weak)) kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>  						  struct kvm_enable_cap *cap)
>  {
> @@ -3282,6 +3483,8 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
>  		kvm->manual_dirty_log_protect = cap->args[0];
>  		return 0;
>  #endif
> +	case KVM_CAP_DIRTY_LOG_RING:
> +		return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
>  	default:
>  		return kvm_vm_ioctl_enable_cap(kvm, cap);
>  	}
> @@ -3469,6 +3672,9 @@ static long kvm_vm_ioctl(struct file *filp,
>  	case KVM_CHECK_EXTENSION:
>  		r = kvm_vm_ioctl_check_extension_generic(kvm, arg);
>  		break;
> +	case KVM_RESET_DIRTY_RINGS:
> +		r = kvm_vm_ioctl_reset_dirty_pages(kvm);
> +		break;
>  	default:
>  		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
>  	}
> @@ -3517,9 +3723,39 @@ static long kvm_vm_compat_ioctl(struct file *filp,
>  }
>  #endif
>  
> +static vm_fault_t kvm_vm_fault(struct vm_fault *vmf)
> +{
> +	struct kvm *kvm = vmf->vma->vm_file->private_data;
> +	struct page *page = NULL;
> +
> +	if (vmf->pgoff == 0)
> +		page = virt_to_page(kvm->vm_run);
> +	else if (kvm_fault_in_dirty_ring(kvm, vmf))
> +		page = kvm_dirty_ring_get_page(
> +		    &kvm->vm_dirty_ring,
> +		    vmf->pgoff - KVM_DIRTY_LOG_PAGE_OFFSET);
> +	else
> +		return VM_FAULT_SIGBUS;
> +
> +	get_page(page);
> +	vmf->page = page;
> +	return 0;
> +}
> +
> +static const struct vm_operations_struct kvm_vm_vm_ops = {
> +	.fault = kvm_vm_fault,
> +};
> +
> +static int kvm_vm_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	vma->vm_ops = &kvm_vm_vm_ops;
> +	return 0;
> +}
> +
>  static struct file_operations kvm_vm_fops = {
>  	.release        = kvm_vm_release,
>  	.unlocked_ioctl = kvm_vm_ioctl,
> +	.mmap           = kvm_vm_mmap,
>  	.llseek		= noop_llseek,
>  	KVM_COMPAT(kvm_vm_compat_ioctl),
>  };
> -- 
> 2.21.0

Michael S. Tsirkin Dec. 11, 2019, 1:04 p.m. UTC | #30

On Wed, Dec 11, 2019 at 10:05:28AM +0100, Paolo Bonzini wrote:
> On 10/12/19 22:53, Michael S. Tsirkin wrote:
> > On Tue, Dec 10, 2019 at 11:02:11AM -0500, Peter Xu wrote:
> >> On Tue, Dec 10, 2019 at 02:31:54PM +0100, Paolo Bonzini wrote:
> >>> On 10/12/19 14:25, Michael S. Tsirkin wrote:
> >>>>> There is no new infrastructure to track the dirty pages---it's just a
> >>>>> different way to pass them to userspace.
> >>>> Did you guys consider using one of the virtio ring formats?
> >>>> Maybe reusing vhost code?
> >>>
> >>> There are no used/available entries here, it's unidirectional
> >>> (kernel->user).
> >>
> >> Agreed.  Vring could be an overkill IMHO (the whole dirty_ring.c is
> >> 100+ LOC only).
> > 
> > I guess you don't do polling/ event suppression and other tricks that
> > virtio came up with for speed then?

I looked at the code finally, there's actually available, and fetched is
exactly like used. Not saying existing code is a great fit for you as
you have an extra slot parameter to pass and it's reversed as compared
to vhost, with kernel being the driver and userspace the device (even
though vringh might fit, yet needs to be updated to support packed rings
though).  But sticking to an existing format is a good idea IMHO,
or if not I think it's not a bad idea to add some justification.

> There are no interrupts either, so no need for event suppression.  You
> have vmexits when the ring gets full (and that needs to be synchronous),
> but apart from that the migration thread will poll the rings once when
> it needs to send more pages.
> 
> Paolo

OK don't use that then.

Paolo Bonzini Dec. 11, 2019, 2:14 p.m. UTC | #31

On 11/12/19 13:53, Michael S. Tsirkin wrote:
>> +
>> +struct kvm_dirty_ring_indexes {
>> +	__u32 avail_index; /* set by kernel */
>> +	__u32 fetch_index; /* set by userspace */
>
> Sticking these next to each other seems to guarantee cache conflicts.

I don't think that's an issue because you'd have a conflict anyway on
the actual entry; userspace anyway has to read the kernel-written index,
which will cause cache traffic.

> Avail/Fetch seems to mimic Virtio's avail/used exactly.

No, avail_index/fetch_index is just the producer and consumer indices
respectively.  There is only one ring buffer, not two as in virtio.

> I am not saying
> you must reuse the code really, but I think you should take a hard look
> at e.g. the virtio packed ring structure. We spent a bunch of time
> optimizing it for cache utilization. It seems kernel is the driver,
> making entries available, and userspace the device, using them.
> Again let's not develop a thread about this, but I think
> this is something to consider and discuss in future versions
> of the patches.

Even in the packed ring you have two cache lines accessed, one for the
index and one for the descriptor.  Here you have one, because the data
is embedded in the ring buffer.

> 
>> +};
>> +
>> +While for each of the dirty entry it's defined as:
>> +
>> +struct kvm_dirty_gfn {
> 
> What does GFN stand for?
> 
>> +        __u32 pad;
>> +        __u32 slot; /* as_id | slot_id */
>> +        __u64 offset;
>> +};
> 
> offset of what? a 4K page right? Seems like a waste e.g. for
> hugetlbfs... How about replacing pad with size instead?

No, it's an offset in the memslot (which will usually be >4GB for any VM
with bigger memory than that).

Paolo

Peter Xu Dec. 11, 2019, 2:54 p.m. UTC | #32

On Wed, Dec 11, 2019 at 08:04:36AM -0500, Michael S. Tsirkin wrote:
> On Wed, Dec 11, 2019 at 10:05:28AM +0100, Paolo Bonzini wrote:
> > On 10/12/19 22:53, Michael S. Tsirkin wrote:
> > > On Tue, Dec 10, 2019 at 11:02:11AM -0500, Peter Xu wrote:
> > >> On Tue, Dec 10, 2019 at 02:31:54PM +0100, Paolo Bonzini wrote:
> > >>> On 10/12/19 14:25, Michael S. Tsirkin wrote:
> > >>>>> There is no new infrastructure to track the dirty pages---it's just a
> > >>>>> different way to pass them to userspace.
> > >>>> Did you guys consider using one of the virtio ring formats?
> > >>>> Maybe reusing vhost code?
> > >>>
> > >>> There are no used/available entries here, it's unidirectional
> > >>> (kernel->user).
> > >>
> > >> Agreed.  Vring could be an overkill IMHO (the whole dirty_ring.c is
> > >> 100+ LOC only).
> > > 
> > > I guess you don't do polling/ event suppression and other tricks that
> > > virtio came up with for speed then?
> 
> I looked at the code finally, there's actually available, and fetched is
> exactly like used. Not saying existing code is a great fit for you as
> you have an extra slot parameter to pass and it's reversed as compared
> to vhost, with kernel being the driver and userspace the device (even
> though vringh might fit, yet needs to be updated to support packed rings
> though).  But sticking to an existing format is a good idea IMHO,
> or if not I think it's not a bad idea to add some justification.

Right, I'll add a small paragraph in the next cover letter to justify.

Thanks,

Christophe de Dinechin Dec. 11, 2019, 5:24 p.m. UTC | #33

Peter Xu writes:

> This patch is heavily based on previous work from Lei Cao
> <lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]
>
> KVM currently uses large bitmaps to track dirty memory.  These bitmaps
> are copied to userspace when userspace queries KVM for its dirty page
> information.  The use of bitmaps is mostly sufficient for live
> migration, as large parts of memory are be dirtied from one log-dirty
> pass to another.

That statement sort of concerns me. If large parts of memory are
dirtied, won't this cause the rings to fill up quickly enough to cause a
lot of churn between user-space and kernel?

See a possible suggestion to address that below.

> However, in a checkpointing system, the number of
> dirty pages is small and in fact it is often bounded---the VM is
> paused when it has dirtied a pre-defined number of pages. Traversing a
> large, sparsely populated bitmap to find set bits is time-consuming,
> as is copying the bitmap to user-space.
>
> A similar issue will be there for live migration when the guest memory
> is huge while the page dirty procedure is trivial.  In that case for
> each dirty sync we need to pull the whole dirty bitmap to userspace
> and analyse every bit even if it's mostly zeros.
>
> The preferred data structure for above scenarios is a dense list of
> guest frame numbers (GFN).  This patch series stores the dirty list in
> kernel memory that can be memory mapped into userspace to allow speedy
> harvesting.
>
> We defined two new data structures:
>
>   struct kvm_dirty_ring;
>   struct kvm_dirty_ring_indexes;
>
> Firstly, kvm_dirty_ring is defined to represent a ring of dirty
> pages.  When dirty tracking is enabled, we can push dirty gfn onto the
> ring.
>
> Secondly, kvm_dirty_ring_indexes is defined to represent the
> user/kernel interface of each ring.  Currently it contains two
> indexes: (1) avail_index represents where we should push our next
> PFN (written by kernel), while (2) fetch_index represents where the
> userspace should fetch the next dirty PFN (written by userspace).
>
> One complete ring is composed by one kvm_dirty_ring plus its
> corresponding kvm_dirty_ring_indexes.
>
> Currently, we have N+1 rings for each VM of N vcpus:
>
>   - for each vcpu, we have 1 per-vcpu dirty ring,
>   - for each vm, we have 1 per-vm dirty ring
>
> Please refer to the documentation update in this patch for more
> details.
>
> Note that this patch implements the core logic of dirty ring buffer.
> It's still disabled for all archs for now.  Also, we'll address some
> of the other issues in follow up patches before it's firstly enabled
> on x86.
>
> [1] https://patchwork.kernel.org/patch/10471409/
>
> Signed-off-by: Lei Cao <lei.cao@stratus.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  Documentation/virt/kvm/api.txt | 109 +++++++++++++++
>  arch/x86/kvm/Makefile          |   3 +-
>  include/linux/kvm_dirty_ring.h |  67 +++++++++
>  include/linux/kvm_host.h       |  33 +++++
>  include/linux/kvm_types.h      |   1 +
>  include/uapi/linux/kvm.h       |  36 +++++
>  virt/kvm/dirty_ring.c          | 156 +++++++++++++++++++++
>  virt/kvm/kvm_main.c            | 240 ++++++++++++++++++++++++++++++++-
>  8 files changed, 642 insertions(+), 3 deletions(-)
>  create mode 100644 include/linux/kvm_dirty_ring.h
>  create mode 100644 virt/kvm/dirty_ring.c
>
> diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt
> index 49183add44e7..fa622c9a2eb8 100644
> --- a/Documentation/virt/kvm/api.txt
> +++ b/Documentation/virt/kvm/api.txt
> @@ -231,6 +231,7 @@ Based on their initialization different VMs may have different capabilities.
>  It is thus encouraged to use the vm ioctl to query for capabilities (available
>  with KVM_CAP_CHECK_EXTENSION_VM on the vm fd)
>
> +
>  4.5 KVM_GET_VCPU_MMAP_SIZE
>
>  Capability: basic
> @@ -243,6 +244,18 @@ The KVM_RUN ioctl (cf.) communicates with userspace via a shared
>  memory region.  This ioctl returns the size of that region.  See the
>  KVM_RUN documentation for details.
>
> +Besides the size of the KVM_RUN communication region, other areas of
> +the VCPU file descriptor can be mmap-ed, including:
> +
> +- if KVM_CAP_COALESCED_MMIO is available, a page at
> +  KVM_COALESCED_MMIO_PAGE_OFFSET * PAGE_SIZE; for historical reasons,
> +  this page is included in the result of KVM_GET_VCPU_MMAP_SIZE.
> +  KVM_CAP_COALESCED_MMIO is not documented yet.

Does the above really belong to this patch?

> +
> +- if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
> +  KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE.  For more information on
> +  KVM_CAP_DIRTY_LOG_RING, see section 8.3.
> +
>
>  4.6 KVM_SET_MEMORY_REGION
>
> @@ -5358,6 +5371,7 @@ CPU when the exception is taken. If this virtual SError is taken to EL1 using
>  AArch64, this value will be reported in the ISS field of ESR_ELx.
>
>  See KVM_CAP_VCPU_EVENTS for more details.
> +
>  8.20 KVM_CAP_HYPERV_SEND_IPI
>
>  Architectures: x86
> @@ -5365,6 +5379,7 @@ Architectures: x86
>  This capability indicates that KVM supports paravirtualized Hyper-V IPI send
>  hypercalls:
>  HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
> +
>  8.21 KVM_CAP_HYPERV_DIRECT_TLBFLUSH
>
>  Architecture: x86
> @@ -5378,3 +5393,97 @@ handling by KVM (as some KVM hypercall may be mistakenly treated as TLB
>  flush hypercalls by Hyper-V) so userspace should disable KVM identification
>  in CPUID and only exposes Hyper-V identification. In this case, guest
>  thinks it's running on Hyper-V and only use Hyper-V hypercalls.
> +
> +8.22 KVM_CAP_DIRTY_LOG_RING
> +
> +Architectures: x86
> +Parameters: args[0] - size of the dirty log ring
> +
> +KVM is capable of tracking dirty memory using ring buffers that are
> +mmaped into userspace; there is one dirty ring per vcpu and one global
> +ring per vm.
> +
> +One dirty ring has the following two major structures:
> +
> +struct kvm_dirty_ring {
> +	u16 dirty_index;
> +	u16 reset_index;

What is the benefit of using u16 for that? That means with 4K pages, you
can share at most 256M of dirty memory each time? That seems low to me,
especially since it's sufficient to touch one byte in a page to dirty it.

Actually, this is not consistent with the definition in the code ;-)
So I'll assume it's actually u32.

> +	u32 size;
> +	u32 soft_limit;
> +	spinlock_t lock;
> +	struct kvm_dirty_gfn *dirty_gfns;
> +};
> +
> +struct kvm_dirty_ring_indexes {
> +	__u32 avail_index; /* set by kernel */
> +	__u32 fetch_index; /* set by userspace */
> +};
> +
> +While for each of the dirty entry it's defined as:
> +
> +struct kvm_dirty_gfn {
> +        __u32 pad;
> +        __u32 slot; /* as_id | slot_id */
> +        __u64 offset;
> +};

Like other have suggested, I think we might used "pad" to store size
information to be able to dirty large pages more efficiently.

> +
> +The fields in kvm_dirty_ring will be only internal to KVM itself,
> +while the fields in kvm_dirty_ring_indexes will be exposed to
> +userspace to be either read or written.

The sentence above is confusing when contrasted with the "set by kernel"
comment above.

> +
> +The two indices in the ring buffer are free running counters.

Nit: this patch uses both "indices" and "indexes".
Both are correct, but it would be nice to be consistent.

> +
> +In pseudocode, processing the ring buffer looks like this:
> +
> +	idx = load-acquire(&ring->fetch_index);
> +	while (idx != ring->avail_index) {
> +		struct kvm_dirty_gfn *entry;
> +		entry = &ring->dirty_gfns[idx & (size - 1)];
> +		...
> +
> +		idx++;
> +	}
> +	ring->fetch_index = idx;
> +
> +Userspace calls KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM ioctl
> +to enable this capability for the new guest and set the size of the
> +rings.  It is only allowed before creating any vCPU, and the size of
> +the ring must be a power of two.  The larger the ring buffer, the less
> +likely the ring is full and the VM is forced to exit to userspace. The
> +optimal size depends on the workload, but it is recommended that it be
> +at least 64 KiB (4096 entries).

Is there anything in the design that would preclude resizing the ring
buffer at a later time? Presumably, you'd want a large ring while you
are doing things like migrations, but it's mostly useless when you are
not monitoring memory. So it would be nice to be able to call
KVM_ENABLE_CAP at any time to adjust the size.

As I read the current code, one of the issue would be the mapping of the
rings in case of a later extension where we added something beyond the
rings. But I'm not sure that's a big deal at the moment.

> +
> +After the capability is enabled, userspace can mmap the global ring
> +buffer (kvm_dirty_gfn[], offset KVM_DIRTY_LOG_PAGE_OFFSET) and the
> +indexes (kvm_dirty_ring_indexes, offset 0) from the VM file
> +descriptor.  The per-vcpu dirty ring instead is mmapped when the vcpu
> +is created, similar to the kvm_run struct (kvm_dirty_ring_indexes
> +locates inside kvm_run, while kvm_dirty_gfn[] at offset
> +KVM_DIRTY_LOG_PAGE_OFFSET).
> +
> +Just like for dirty page bitmaps, the buffer tracks writes to
> +all user memory regions for which the KVM_MEM_LOG_DIRTY_PAGES flag was
> +set in KVM_SET_USER_MEMORY_REGION.  Once a memory region is registered
> +with the flag set, userspace can start harvesting dirty pages from the
> +ring buffer.
> +
> +To harvest the dirty pages, userspace accesses the mmaped ring buffer
> +to read the dirty GFNs up to avail_index, and sets the fetch_index
> +accordingly.  This can be done when the guest is running or paused,
> +and dirty pages need not be collected all at once.  After processing
> +one or more entries in the ring buffer, userspace calls the VM ioctl
> +KVM_RESET_DIRTY_RINGS to notify the kernel that it has updated
> +fetch_index and to mark those pages clean.  Therefore, the ioctl
> +must be called *before* reading the content of the dirty pages.

> +
> +However, there is a major difference comparing to the
> +KVM_GET_DIRTY_LOG interface in that when reading the dirty ring from
> +userspace it's still possible that the kernel has not yet flushed the
> +hardware dirty buffers into the kernel buffer.  To achieve that, one
> +needs to kick the vcpu out for a hardware buffer flush (vmexit).

When you refer to "buffers", are you referring to the cache lines that
contain the ring buffers, or to something else?

I'm a bit confused by this sentence. I think that you mean that a VCPU
may still be running while you read its ring buffer, in which case the
values in the ring buffer are not necessarily in memory yet, so not
visible to a different CPU. But I wonder if you can't make this
requirement to cause a vmexit unnecessary by carefully ordering the
writes, to make sure that the fetch_index is updated only after the
corresponding ring entries have been written to memory,

In other words, as seen by user-space, you would not care that the ring
entries have not been flushed as long as the fetch_index itself is
guaranteed to still be behind the not-flushed-yet entries.

(I would know how to do that on a different architecture, not sure for x86)

> +
> +If one of the ring buffers is full, the guest will exit to userspace
> +with the exit reason set to KVM_EXIT_DIRTY_LOG_FULL, and the
> +KVM_RUN ioctl will return -EINTR. Once that happens, userspace
> +should pause all the vcpus, then harvest all the dirty pages and
> +rearm the dirty traps. It can unpause the guest after that.

Except for the condition above, why is it necessary to pause other VCPUs
than the one being harvested?


> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> index b19ef421084d..0acee817adfb 100644
> --- a/arch/x86/kvm/Makefile
> +++ b/arch/x86/kvm/Makefile
> @@ -5,7 +5,8 @@ ccflags-y += -Iarch/x86/kvm
>  KVM := ../../../virt/kvm
>
>  kvm-y			+= $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> -				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
> +				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o \
> +				$(KVM)/dirty_ring.o
>  kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(KVM)/async_pf.o
>
>  kvm-y			+= x86.o emulate.o i8259.o irq.o lapic.o \
> diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
> new file mode 100644
> index 000000000000..8335635b7ff7
> --- /dev/null
> +++ b/include/linux/kvm_dirty_ring.h
> @@ -0,0 +1,67 @@
> +#ifndef KVM_DIRTY_RING_H
> +#define KVM_DIRTY_RING_H
> +
> +/*
> + * struct kvm_dirty_ring is defined in include/uapi/linux/kvm.h.
> + *
> + * dirty_ring:  shared with userspace via mmap. It is the compact list
> + *              that holds the dirty pages.
> + * dirty_index: free running counter that points to the next slot in
> + *              dirty_ring->dirty_gfns  where a new dirty page should go.
> + * reset_index: free running counter that points to the next dirty page
> + *              in dirty_ring->dirty_gfns for which dirty trap needs to
> + *              be reenabled
> + * size:        size of the compact list, dirty_ring->dirty_gfns
> + * soft_limit:  when the number of dirty pages in the list reaches this
> + *              limit, vcpu that owns this ring should exit to userspace
> + *              to allow userspace to harvest all the dirty pages
> + * lock:        protects dirty_ring, only in use if this is the global
> + *              ring

If that's not used for vcpu rings, maybe move it out of kvm_dirty_ring?

> + *
> + * The number of dirty pages in the ring is calculated by,
> + * dirty_index - reset_index

Nit: the code calls it "used" (in kvm_dirty_ring_used). Maybe find an
unambiguous terminology. What about "posted", as in

The number of posted dirty pages, i.e. the number of dirty pages in the
ring, is calculated as dirty_index - reset_index by function
kvm_dirty_ring_posted

(Replace "posted" by any adjective of your liking)

> + *
> + * kernel increments dirty_ring->indices.avail_index after dirty index
> + * is incremented. When userspace harvests the dirty pages, it increments
> + * dirty_ring->indices.fetch_index up to dirty_ring->indices.avail_index.
> + * When kernel reenables dirty traps for the dirty pages, it increments
> + * reset_index up to dirty_ring->indices.fetch_index.

Userspace should not be trusted to be doing this, see below.


> + *
> + */
> +struct kvm_dirty_ring {
> +	u32 dirty_index;
> +	u32 reset_index;
> +	u32 size;
> +	u32 soft_limit;
> +	spinlock_t lock;
> +	struct kvm_dirty_gfn *dirty_gfns;
> +};
> +
> +u32 kvm_dirty_ring_get_rsvd_entries(void);
> +int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring);
> +
> +/*
> + * called with kvm->slots_lock held, returns the number of
> + * processed pages.
> + */
> +int kvm_dirty_ring_reset(struct kvm *kvm,
> +			 struct kvm_dirty_ring *ring,
> +			 struct kvm_dirty_ring_indexes *indexes);
> +
> +/*
> + * returns 0: successfully pushed
> + *         1: successfully pushed, soft limit reached,
> + *            vcpu should exit to userspace
> + *         -EBUSY: unable to push, dirty ring full.
> + */
> +int kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
> +			struct kvm_dirty_ring_indexes *indexes,
> +			u32 slot, u64 offset, bool lock);
> +
> +/* for use in vm_operations_struct */
> +struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 i);

Not very clear what 'i' means, seems to be a page offset based on call sites?

> +
> +void kvm_dirty_ring_free(struct kvm_dirty_ring *ring);
> +bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring);
> +
> +#endif
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 498a39462ac1..7b747bc9ff3e 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -34,6 +34,7 @@
>  #include <linux/kvm_types.h>
>
>  #include <asm/kvm_host.h>
> +#include <linux/kvm_dirty_ring.h>
>
>  #ifndef KVM_MAX_VCPU_ID
>  #define KVM_MAX_VCPU_ID KVM_MAX_VCPUS
> @@ -146,6 +147,7 @@ static inline bool is_error_page(struct page *page)
>  #define KVM_REQ_MMU_RELOAD        (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>  #define KVM_REQ_PENDING_TIMER     2
>  #define KVM_REQ_UNHALT            3
> +#define KVM_REQ_DIRTY_RING_FULL   4
>  #define KVM_REQUEST_ARCH_BASE     8
>
>  #define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
> @@ -321,6 +323,7 @@ struct kvm_vcpu {
>  	bool ready;
>  	struct kvm_vcpu_arch arch;
>  	struct dentry *debugfs_dentry;
> +	struct kvm_dirty_ring dirty_ring;
>  };
>
>  static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
> @@ -501,6 +504,10 @@ struct kvm {
>  	struct srcu_struct srcu;
>  	struct srcu_struct irq_srcu;
>  	pid_t userspace_pid;
> +	/* Data structure to be exported by mmap(kvm->fd, 0) */
> +	struct kvm_vm_run *vm_run;
> +	u32 dirty_ring_size;
> +	struct kvm_dirty_ring vm_dirty_ring;

If you remove the lock from struct kvm_dirty_ring, you could just put it there.

>  };
>
>  #define kvm_err(fmt, ...) \
> @@ -832,6 +839,8 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>  					gfn_t gfn_offset,
>  					unsigned long mask);
>
> +void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask);
> +
>  int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
>  				struct kvm_dirty_log *log);
>  int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
> @@ -1411,4 +1420,28 @@ int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
>  				uintptr_t data, const char *name,
>  				struct task_struct **thread_ptr);
>
> +/*
> + * This defines how many reserved entries we want to keep before we
> + * kick the vcpu to the userspace to avoid dirty ring full.  This
> + * value can be tuned to higher if e.g. PML is enabled on the host.
> + */
> +#define  KVM_DIRTY_RING_RSVD_ENTRIES  64
> +
> +/* Max number of entries allowed for each kvm dirty ring */
> +#define  KVM_DIRTY_RING_MAX_ENTRIES  65536
> +
> +/*
> + * Arch needs to define these macro after implementing the dirty ring
> + * feature.  KVM_DIRTY_LOG_PAGE_OFFSET should be defined as the
> + * starting page offset of the dirty ring structures, while
> + * KVM_DIRTY_RING_VERSION should be defined as >=1.  By default, this
> + * feature is off on all archs.
> + */
> +#ifndef KVM_DIRTY_LOG_PAGE_OFFSET
> +#define KVM_DIRTY_LOG_PAGE_OFFSET 0
> +#endif
> +#ifndef KVM_DIRTY_RING_VERSION
> +#define KVM_DIRTY_RING_VERSION 0
> +#endif
> +
>  #endif
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index 1c88e69db3d9..d9d03eea145a 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -11,6 +11,7 @@ struct kvm_irq_routing_table;
>  struct kvm_memory_slot;
>  struct kvm_one_reg;
>  struct kvm_run;
> +struct kvm_vm_run;
>  struct kvm_userspace_memory_region;
>  struct kvm_vcpu;
>  struct kvm_vcpu_init;
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index e6f17c8e2dba..0b88d76d6215 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -236,6 +236,7 @@ struct kvm_hyperv_exit {
>  #define KVM_EXIT_IOAPIC_EOI       26
>  #define KVM_EXIT_HYPERV           27
>  #define KVM_EXIT_ARM_NISV         28
> +#define KVM_EXIT_DIRTY_RING_FULL  29
>
>  /* For KVM_EXIT_INTERNAL_ERROR */
>  /* Emulate instruction failed. */
> @@ -247,6 +248,11 @@ struct kvm_hyperv_exit {
>  /* Encounter unexpected vm-exit reason */
>  #define KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON	4
>
> +struct kvm_dirty_ring_indexes {
> +	__u32 avail_index; /* set by kernel */
> +	__u32 fetch_index; /* set by userspace */
> +};
> +
>  /* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */
>  struct kvm_run {
>  	/* in */
> @@ -421,6 +427,13 @@ struct kvm_run {
>  		struct kvm_sync_regs regs;
>  		char padding[SYNC_REGS_SIZE_BYTES];
>  	} s;
> +
> +	struct kvm_dirty_ring_indexes vcpu_ring_indexes;
> +};
> +
> +/* Returned by mmap(kvm->fd, offset=0) */
> +struct kvm_vm_run {
> +	struct kvm_dirty_ring_indexes vm_ring_indexes;
>  };
>
>  /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
> @@ -1009,6 +1022,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_PPC_GUEST_DEBUG_SSTEP 176
>  #define KVM_CAP_ARM_NISV_TO_USER 177
>  #define KVM_CAP_ARM_INJECT_EXT_DABT 178
> +#define KVM_CAP_DIRTY_LOG_RING 179
>
>  #ifdef KVM_CAP_IRQ_ROUTING
>
> @@ -1472,6 +1486,9 @@ struct kvm_enc_region {
>  /* Available with KVM_CAP_ARM_SVE */
>  #define KVM_ARM_VCPU_FINALIZE	  _IOW(KVMIO,  0xc2, int)
>
> +/* Available with KVM_CAP_DIRTY_LOG_RING */
> +#define KVM_RESET_DIRTY_RINGS     _IO(KVMIO, 0xc3)
> +
>  /* Secure Encrypted Virtualization command */
>  enum sev_cmd_id {
>  	/* Guest initialization commands */
> @@ -1622,4 +1639,23 @@ struct kvm_hyperv_eventfd {
>  #define KVM_HYPERV_CONN_ID_MASK		0x00ffffff
>  #define KVM_HYPERV_EVENTFD_DEASSIGN	(1 << 0)
>
> +/*
> + * The following are the requirements for supporting dirty log ring
> + * (by enabling KVM_DIRTY_LOG_PAGE_OFFSET).
> + *
> + * 1. Memory accesses by KVM should call kvm_vcpu_write_* instead
> + *    of kvm_write_* so that the global dirty ring is not filled up
> + *    too quickly.
> + * 2. kvm_arch_mmu_enable_log_dirty_pt_masked should be defined for
> + *    enabling dirty logging.
> + * 3. There should not be a separate step to synchronize hardware
> + *    dirty bitmap with KVM's.
> + */
> +
> +struct kvm_dirty_gfn {
> +	__u32 pad;
> +	__u32 slot;
> +	__u64 offset;
> +};
> +
>  #endif /* __LINUX_KVM_H */
> diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> new file mode 100644
> index 000000000000..9264891f3c32
> --- /dev/null
> +++ b/virt/kvm/dirty_ring.c
> @@ -0,0 +1,156 @@
> +#include <linux/kvm_host.h>
> +#include <linux/kvm.h>
> +#include <linux/vmalloc.h>
> +#include <linux/kvm_dirty_ring.h>
> +
> +u32 kvm_dirty_ring_get_rsvd_entries(void)
> +{
> +	return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size();
> +}
> +
> +int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring)
> +{
> +	u32 size = kvm->dirty_ring_size;
> +
> +	ring->dirty_gfns = vmalloc(size);
> +	if (!ring->dirty_gfns)
> +		return -ENOMEM;
> +	memset(ring->dirty_gfns, 0, size);
> +
> +	ring->size = size / sizeof(struct kvm_dirty_gfn);
> +	ring->soft_limit =
> +	    (kvm->dirty_ring_size / sizeof(struct kvm_dirty_gfn)) -
> +	    kvm_dirty_ring_get_rsvd_entries();

Minor, but what about

       ring->soft_limit = ring->size - kvm_dirty_ring_get_rsvd_entries();


> +	ring->dirty_index = 0;
> +	ring->reset_index = 0;
> +	spin_lock_init(&ring->lock);
> +
> +	return 0;
> +}
> +
> +int kvm_dirty_ring_reset(struct kvm *kvm,
> +			 struct kvm_dirty_ring *ring,
> +			 struct kvm_dirty_ring_indexes *indexes)
> +{
> +	u32 cur_slot, next_slot;
> +	u64 cur_offset, next_offset;
> +	unsigned long mask;
> +	u32 fetch;
> +	int count = 0;
> +	struct kvm_dirty_gfn *entry;
> +
> +	fetch = READ_ONCE(indexes->fetch_index);

If I understand correctly, if a malicious user-space writes
ring->reset_index + 1 into fetch_index, the loop below will execute 4
billion times.


> +	if (fetch == ring->reset_index)
> +		return 0;

To protect against scenario above, I would have something like:

	if (fetch - ring->reset_index >= ring->size)
		return -EINVAL;

> +
> +	entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> +	/*
> +	 * The ring buffer is shared with userspace, which might mmap
> +	 * it and concurrently modify slot and offset.  Userspace must
> +	 * not be trusted!  READ_ONCE prevents the compiler from changing
> +	 * the values after they've been range-checked (the checks are
> +	 * in kvm_reset_dirty_gfn).
> +	 */
> +	smp_read_barrier_depends();
> +	cur_slot = READ_ONCE(entry->slot);
> +	cur_offset = READ_ONCE(entry->offset);
> +	mask = 1;
> +	count++;
> +	ring->reset_index++;
> +	while (ring->reset_index != fetch) {
> +		entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> +		smp_read_barrier_depends();
> +		next_slot = READ_ONCE(entry->slot);
> +		next_offset = READ_ONCE(entry->offset);
> +		ring->reset_index++;
> +		count++;
> +		/*
> +		 * Try to coalesce the reset operations when the guest is
> +		 * scanning pages in the same slot.
> +		 */
> +		if (next_slot == cur_slot) {
> +			int delta = next_offset - cur_offset;

Since you diff two u64, shouldn't that be an i64 rather than int?

> +
> +			if (delta >= 0 && delta < BITS_PER_LONG) {
> +				mask |= 1ull << delta;
> +				continue;
> +			}
> +
> +			/* Backwards visit, careful about overflows!  */
> +			if (delta > -BITS_PER_LONG && delta < 0 &&
> +			    (mask << -delta >> -delta) == mask) {
> +				cur_offset = next_offset;
> +				mask = (mask << -delta) | 1;
> +				continue;
> +			}
> +		}
> +		kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> +		cur_slot = next_slot;
> +		cur_offset = next_offset;
> +		mask = 1;
> +	}
> +	kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);

So if you did not coalesce the last one, you call kvm_reset_dirty_gfn
twice? Something smells weird about this loop ;-) I have a gut feeling
that it could be done in a single while loop combined with the entry
test, but I may be wrong.


> +
> +	return count;
> +}
> +
> +static inline u32 kvm_dirty_ring_used(struct kvm_dirty_ring *ring)
> +{
> +	return ring->dirty_index - ring->reset_index;
> +}
> +
> +bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring)
> +{
> +	return kvm_dirty_ring_used(ring) >= ring->size;
> +}
> +
> +/*
> + * Returns:
> + *   >0 if we should kick the vcpu out,
> + *   =0 if the gfn pushed successfully, or,
> + *   <0 if error (e.g. ring full)
> + */
> +int kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
> +			struct kvm_dirty_ring_indexes *indexes,
> +			u32 slot, u64 offset, bool lock)

Obviously, if you go with the suggestion to have a "lock" only in struct
kvm, then you'd have to pass a lock ptr instead of a bool.

> +{
> +	int ret;
> +	struct kvm_dirty_gfn *entry;
> +
> +	if (lock)
> +		spin_lock(&ring->lock);
> +
> +	if (kvm_dirty_ring_full(ring)) {
> +		ret = -EBUSY;
> +		goto out;
> +	}
> +
> +	entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
> +	entry->slot = slot;
> +	entry->offset = offset;
> +	smp_wmb();
> +	ring->dirty_index++;
> +	WRITE_ONCE(indexes->avail_index, ring->dirty_index);

Following up on comment about having to vmexit other VCPUs above:
If you have a write barrier for the entry, and then a write once for the
index, isn't that sufficient to ensure that another CPU will pick up the
right values in the right order?


> +	ret = kvm_dirty_ring_used(ring) >= ring->soft_limit;
> +	pr_info("%s: slot %u offset %llu used %u\n",
> +		__func__, slot, offset, kvm_dirty_ring_used(ring));
> +
> +out:
> +	if (lock)
> +		spin_unlock(&ring->lock);
> +
> +	return ret;
> +}
> +
> +struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 i)

Still don't like 'i' :-)


(Stopped my review here for lack of time, decided to share what I had so far)

> +{
> +	return vmalloc_to_page((void *)ring->dirty_gfns + i * PAGE_SIZE);
> +}
> +
> +void kvm_dirty_ring_free(struct kvm_dirty_ring *ring)
> +{
> +	if (ring->dirty_gfns) {
> +		vfree(ring->dirty_gfns);
> +		ring->dirty_gfns = NULL;
> +	}
> +}
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 681452d288cd..8642c977629b 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -64,6 +64,8 @@
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/kvm.h>
>
> +#include <linux/kvm_dirty_ring.h>
> +
>  /* Worst case buffer size needed for holding an integer. */
>  #define ITOA_MAX_LEN 12
>
> @@ -149,6 +151,10 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
>  				    struct kvm_vcpu *vcpu,
>  				    struct kvm_memory_slot *memslot,
>  				    gfn_t gfn);
> +static void mark_page_dirty_in_ring(struct kvm *kvm,
> +				    struct kvm_vcpu *vcpu,
> +				    struct kvm_memory_slot *slot,
> +				    gfn_t gfn);
>
>  __visible bool kvm_rebooting;
>  EXPORT_SYMBOL_GPL(kvm_rebooting);
> @@ -359,11 +365,22 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
>  	vcpu->preempted = false;
>  	vcpu->ready = false;
>
> +	if (kvm->dirty_ring_size) {
> +		r = kvm_dirty_ring_alloc(vcpu->kvm, &vcpu->dirty_ring);
> +		if (r) {
> +			kvm->dirty_ring_size = 0;
> +			goto fail_free_run;
> +		}
> +	}
> +
>  	r = kvm_arch_vcpu_init(vcpu);
>  	if (r < 0)
> -		goto fail_free_run;
> +		goto fail_free_ring;
>  	return 0;
>
> +fail_free_ring:
> +	if (kvm->dirty_ring_size)
> +		kvm_dirty_ring_free(&vcpu->dirty_ring);
>  fail_free_run:
>  	free_page((unsigned long)vcpu->run);
>  fail:
> @@ -381,6 +398,8 @@ void kvm_vcpu_uninit(struct kvm_vcpu *vcpu)
>  	put_pid(rcu_dereference_protected(vcpu->pid, 1));
>  	kvm_arch_vcpu_uninit(vcpu);
>  	free_page((unsigned long)vcpu->run);
> +	if (vcpu->kvm->dirty_ring_size)
> +		kvm_dirty_ring_free(&vcpu->dirty_ring);
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_uninit);
>
> @@ -690,6 +709,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
>  	struct kvm *kvm = kvm_arch_alloc_vm();
>  	int r = -ENOMEM;
>  	int i;
> +	struct page *page;
>
>  	if (!kvm)
>  		return ERR_PTR(-ENOMEM);
> @@ -705,6 +725,14 @@ static struct kvm *kvm_create_vm(unsigned long type)
>
>  	BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
>
> +	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> +	if (!page) {
> +		r = -ENOMEM;
> +		goto out_err_alloc_page;
> +	}
> +	kvm->vm_run = page_address(page);
> +	BUILD_BUG_ON(sizeof(struct kvm_vm_run) > PAGE_SIZE);
> +
>  	if (init_srcu_struct(&kvm->srcu))
>  		goto out_err_no_srcu;
>  	if (init_srcu_struct(&kvm->irq_srcu))
> @@ -775,6 +803,9 @@ static struct kvm *kvm_create_vm(unsigned long type)
>  out_err_no_irq_srcu:
>  	cleanup_srcu_struct(&kvm->srcu);
>  out_err_no_srcu:
> +	free_page((unsigned long)page);
> +	kvm->vm_run = NULL;
> +out_err_alloc_page:
>  	kvm_arch_free_vm(kvm);
>  	mmdrop(current->mm);
>  	return ERR_PTR(r);
> @@ -800,6 +831,15 @@ static void kvm_destroy_vm(struct kvm *kvm)
>  	int i;
>  	struct mm_struct *mm = kvm->mm;
>
> +	if (kvm->dirty_ring_size) {
> +		kvm_dirty_ring_free(&kvm->vm_dirty_ring);
> +	}
> +
> +	if (kvm->vm_run) {
> +		free_page((unsigned long)kvm->vm_run);
> +		kvm->vm_run = NULL;
> +	}
> +
>  	kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm);
>  	kvm_destroy_vm_debugfs(kvm);
>  	kvm_arch_sync_events(kvm);
> @@ -2301,7 +2341,7 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
>  {
>  	if (memslot && memslot->dirty_bitmap) {
>  		unsigned long rel_gfn = gfn - memslot->base_gfn;
> -
> +		mark_page_dirty_in_ring(kvm, vcpu, memslot, gfn);
>  		set_bit_le(rel_gfn, memslot->dirty_bitmap);
>  	}
>  }
> @@ -2649,6 +2689,13 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
>
> +static bool kvm_fault_in_dirty_ring(struct kvm *kvm, struct vm_fault *vmf)
> +{
> +	return (vmf->pgoff >= KVM_DIRTY_LOG_PAGE_OFFSET) &&
> +	    (vmf->pgoff < KVM_DIRTY_LOG_PAGE_OFFSET +
> +	     kvm->dirty_ring_size / PAGE_SIZE);
> +}
> +
>  static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
>  {
>  	struct kvm_vcpu *vcpu = vmf->vma->vm_file->private_data;
> @@ -2664,6 +2711,10 @@ static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
>  	else if (vmf->pgoff == KVM_COALESCED_MMIO_PAGE_OFFSET)
>  		page = virt_to_page(vcpu->kvm->coalesced_mmio_ring);
>  #endif
> +	else if (kvm_fault_in_dirty_ring(vcpu->kvm, vmf))
> +		page = kvm_dirty_ring_get_page(
> +		    &vcpu->dirty_ring,
> +		    vmf->pgoff - KVM_DIRTY_LOG_PAGE_OFFSET);
>  	else
>  		return kvm_arch_vcpu_fault(vcpu, vmf);
>  	get_page(page);
> @@ -3259,12 +3310,162 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
>  #endif
>  	case KVM_CAP_NR_MEMSLOTS:
>  		return KVM_USER_MEM_SLOTS;
> +	case KVM_CAP_DIRTY_LOG_RING:
> +		/* Version will be zero if arch didn't implement it */
> +		return KVM_DIRTY_RING_VERSION;
>  	default:
>  		break;
>  	}
>  	return kvm_vm_ioctl_check_extension(kvm, arg);
>  }
>
> +static void mark_page_dirty_in_ring(struct kvm *kvm,
> +				    struct kvm_vcpu *vcpu,
> +				    struct kvm_memory_slot *slot,
> +				    gfn_t gfn)
> +{
> +	u32 as_id = 0;
> +	u64 offset;
> +	int ret;
> +	struct kvm_dirty_ring *ring;
> +	struct kvm_dirty_ring_indexes *indexes;
> +	bool is_vm_ring;
> +
> +	if (!kvm->dirty_ring_size)
> +		return;
> +
> +	offset = gfn - slot->base_gfn;
> +
> +	if (vcpu) {
> +		as_id = kvm_arch_vcpu_memslots_id(vcpu);
> +	} else {
> +		as_id = 0;
> +		vcpu = kvm_get_running_vcpu();
> +	}
> +
> +	if (vcpu) {
> +		ring = &vcpu->dirty_ring;
> +		indexes = &vcpu->run->vcpu_ring_indexes;
> +		is_vm_ring = false;
> +	} else {
> +		/*
> +		 * Put onto per vm ring because no vcpu context.  Kick
> +		 * vcpu0 if ring is full.
> +		 */
> +		vcpu = kvm->vcpus[0];
> +		ring = &kvm->vm_dirty_ring;
> +		indexes = &kvm->vm_run->vm_ring_indexes;
> +		is_vm_ring = true;
> +	}
> +
> +	ret = kvm_dirty_ring_push(ring, indexes,
> +				  (as_id << 16)|slot->id, offset,
> +				  is_vm_ring);
> +	if (ret < 0) {
> +		if (is_vm_ring)
> +			pr_warn_once("vcpu %d dirty log overflow\n",
> +				     vcpu->vcpu_id);
> +		else
> +			pr_warn_once("per-vm dirty log overflow\n");
> +		return;
> +	}
> +
> +	if (ret)
> +		kvm_make_request(KVM_REQ_DIRTY_RING_FULL, vcpu);
> +}
> +
> +void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
> +{
> +	struct kvm_memory_slot *memslot;
> +	int as_id, id;
> +
> +	as_id = slot >> 16;
> +	id = (u16)slot;
> +	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
> +		return;
> +
> +	memslot = id_to_memslot(__kvm_memslots(kvm, as_id), id);
> +	if (offset >= memslot->npages)
> +		return;
> +
> +	spin_lock(&kvm->mmu_lock);
> +	/* FIXME: we should use a single AND operation, but there is no
> +	 * applicable atomic API.
> +	 */
> +	while (mask) {
> +		clear_bit_le(offset + __ffs(mask), memslot->dirty_bitmap);
> +		mask &= mask - 1;
> +	}
> +
> +	kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask);
> +	spin_unlock(&kvm->mmu_lock);
> +}
> +
> +static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size)
> +{
> +	int r;
> +
> +	/* the size should be power of 2 */
> +	if (!size || (size & (size - 1)))
> +		return -EINVAL;
> +
> +	/* Should be bigger to keep the reserved entries, or a page */
> +	if (size < kvm_dirty_ring_get_rsvd_entries() *
> +	    sizeof(struct kvm_dirty_gfn) || size < PAGE_SIZE)
> +		return -EINVAL;
> +
> +	if (size > KVM_DIRTY_RING_MAX_ENTRIES *
> +	    sizeof(struct kvm_dirty_gfn))
> +		return -E2BIG;
> +
> +	/* We only allow it to set once */
> +	if (kvm->dirty_ring_size)
> +		return -EINVAL;
> +
> +	mutex_lock(&kvm->lock);
> +
> +	if (kvm->created_vcpus) {
> +		/* We don't allow to change this value after vcpu created */
> +		r = -EINVAL;
> +	} else {
> +		kvm->dirty_ring_size = size;
> +		r = kvm_dirty_ring_alloc(kvm, &kvm->vm_dirty_ring);
> +		if (r) {
> +			/* Unset dirty ring */
> +			kvm->dirty_ring_size = 0;
> +		}
> +	}
> +
> +	mutex_unlock(&kvm->lock);
> +	return r;
> +}
> +
> +static int kvm_vm_ioctl_reset_dirty_pages(struct kvm *kvm)
> +{
> +	int i;
> +	struct kvm_vcpu *vcpu;
> +	int cleared = 0;
> +
> +	if (!kvm->dirty_ring_size)
> +		return -EINVAL;
> +
> +	mutex_lock(&kvm->slots_lock);
> +
> +	cleared += kvm_dirty_ring_reset(kvm, &kvm->vm_dirty_ring,
> +					&kvm->vm_run->vm_ring_indexes);
> +
> +	kvm_for_each_vcpu(i, vcpu, kvm)
> +		cleared += kvm_dirty_ring_reset(vcpu->kvm, &vcpu->dirty_ring,
> +						&vcpu->run->vcpu_ring_indexes);
> +
> +	mutex_unlock(&kvm->slots_lock);
> +
> +	if (cleared)
> +		kvm_flush_remote_tlbs(kvm);
> +
> +	return cleared;
> +}
> +
>  int __attribute__((weak)) kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>  						  struct kvm_enable_cap *cap)
>  {
> @@ -3282,6 +3483,8 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
>  		kvm->manual_dirty_log_protect = cap->args[0];
>  		return 0;
>  #endif
> +	case KVM_CAP_DIRTY_LOG_RING:
> +		return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
>  	default:
>  		return kvm_vm_ioctl_enable_cap(kvm, cap);
>  	}
> @@ -3469,6 +3672,9 @@ static long kvm_vm_ioctl(struct file *filp,
>  	case KVM_CHECK_EXTENSION:
>  		r = kvm_vm_ioctl_check_extension_generic(kvm, arg);
>  		break;
> +	case KVM_RESET_DIRTY_RINGS:
> +		r = kvm_vm_ioctl_reset_dirty_pages(kvm);
> +		break;
>  	default:
>  		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
>  	}
> @@ -3517,9 +3723,39 @@ static long kvm_vm_compat_ioctl(struct file *filp,
>  }
>  #endif
>
> +static vm_fault_t kvm_vm_fault(struct vm_fault *vmf)
> +{
> +	struct kvm *kvm = vmf->vma->vm_file->private_data;
> +	struct page *page = NULL;
> +
> +	if (vmf->pgoff == 0)
> +		page = virt_to_page(kvm->vm_run);
> +	else if (kvm_fault_in_dirty_ring(kvm, vmf))
> +		page = kvm_dirty_ring_get_page(
> +		    &kvm->vm_dirty_ring,
> +		    vmf->pgoff - KVM_DIRTY_LOG_PAGE_OFFSET);
> +	else
> +		return VM_FAULT_SIGBUS;
> +
> +	get_page(page);
> +	vmf->page = page;
> +	return 0;
> +}
> +
> +static const struct vm_operations_struct kvm_vm_vm_ops = {
> +	.fault = kvm_vm_fault,
> +};
> +
> +static int kvm_vm_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	vma->vm_ops = &kvm_vm_vm_ops;
> +	return 0;
> +}
> +
>  static struct file_operations kvm_vm_fops = {
>  	.release        = kvm_vm_release,
>  	.unlocked_ioctl = kvm_vm_ioctl,
> +	.mmap           = kvm_vm_mmap,
>  	.llseek		= noop_llseek,
>  	KVM_COMPAT(kvm_vm_compat_ioctl),
>  };


--
Cheers,
Christophe de Dinechin (IRC c3d)

Peter Xu Dec. 11, 2019, 8:59 p.m. UTC | #34

On Wed, Dec 11, 2019 at 07:53:48AM -0500, Michael S. Tsirkin wrote:
> On Fri, Nov 29, 2019 at 04:34:54PM -0500, Peter Xu wrote:
> > This patch is heavily based on previous work from Lei Cao
> > <lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]
> > 
> > KVM currently uses large bitmaps to track dirty memory.  These bitmaps
> > are copied to userspace when userspace queries KVM for its dirty page
> > information.  The use of bitmaps is mostly sufficient for live
> > migration, as large parts of memory are be dirtied from one log-dirty
> > pass to another.  However, in a checkpointing system, the number of
> > dirty pages is small and in fact it is often bounded---the VM is
> > paused when it has dirtied a pre-defined number of pages. Traversing a
> > large, sparsely populated bitmap to find set bits is time-consuming,
> > as is copying the bitmap to user-space.
> > 
> > A similar issue will be there for live migration when the guest memory
> > is huge while the page dirty procedure is trivial.  In that case for
> > each dirty sync we need to pull the whole dirty bitmap to userspace
> > and analyse every bit even if it's mostly zeros.
> > 
> > The preferred data structure for above scenarios is a dense list of
> > guest frame numbers (GFN).  This patch series stores the dirty list in
> > kernel memory that can be memory mapped into userspace to allow speedy
> > harvesting.
> > 
> > We defined two new data structures:
> > 
> >   struct kvm_dirty_ring;
> >   struct kvm_dirty_ring_indexes;
> > 
> > Firstly, kvm_dirty_ring is defined to represent a ring of dirty
> > pages.  When dirty tracking is enabled, we can push dirty gfn onto the
> > ring.
> > 
> > Secondly, kvm_dirty_ring_indexes is defined to represent the
> > user/kernel interface of each ring.  Currently it contains two
> > indexes: (1) avail_index represents where we should push our next
> > PFN (written by kernel), while (2) fetch_index represents where the
> > userspace should fetch the next dirty PFN (written by userspace).
> > 
> > One complete ring is composed by one kvm_dirty_ring plus its
> > corresponding kvm_dirty_ring_indexes.
> > 
> > Currently, we have N+1 rings for each VM of N vcpus:
> > 
> >   - for each vcpu, we have 1 per-vcpu dirty ring,
> >   - for each vm, we have 1 per-vm dirty ring
> > 
> > Please refer to the documentation update in this patch for more
> > details.
> > 
> > Note that this patch implements the core logic of dirty ring buffer.
> > It's still disabled for all archs for now.  Also, we'll address some
> > of the other issues in follow up patches before it's firstly enabled
> > on x86.
> > 
> > [1] https://patchwork.kernel.org/patch/10471409/
> > 
> > Signed-off-by: Lei Cao <lei.cao@stratus.com>
> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> 
> Thanks, that's interesting.

Hi, Michael,

Thanks for reading the series.

> 
> > ---
> >  Documentation/virt/kvm/api.txt | 109 +++++++++++++++
> >  arch/x86/kvm/Makefile          |   3 +-
> >  include/linux/kvm_dirty_ring.h |  67 +++++++++
> >  include/linux/kvm_host.h       |  33 +++++
> >  include/linux/kvm_types.h      |   1 +
> >  include/uapi/linux/kvm.h       |  36 +++++
> >  virt/kvm/dirty_ring.c          | 156 +++++++++++++++++++++
> >  virt/kvm/kvm_main.c            | 240 ++++++++++++++++++++++++++++++++-
> >  8 files changed, 642 insertions(+), 3 deletions(-)
> >  create mode 100644 include/linux/kvm_dirty_ring.h
> >  create mode 100644 virt/kvm/dirty_ring.c
> > 
> > diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt
> > index 49183add44e7..fa622c9a2eb8 100644
> > --- a/Documentation/virt/kvm/api.txt
> > +++ b/Documentation/virt/kvm/api.txt
> > @@ -231,6 +231,7 @@ Based on their initialization different VMs may have different capabilities.
> >  It is thus encouraged to use the vm ioctl to query for capabilities (available
> >  with KVM_CAP_CHECK_EXTENSION_VM on the vm fd)
> >  
> > +
> >  4.5 KVM_GET_VCPU_MMAP_SIZE
> >  
> >  Capability: basic
> > @@ -243,6 +244,18 @@ The KVM_RUN ioctl (cf.) communicates with userspace via a shared
> >  memory region.  This ioctl returns the size of that region.  See the
> >  KVM_RUN documentation for details.
> >  
> > +Besides the size of the KVM_RUN communication region, other areas of
> > +the VCPU file descriptor can be mmap-ed, including:
> > +
> > +- if KVM_CAP_COALESCED_MMIO is available, a page at
> > +  KVM_COALESCED_MMIO_PAGE_OFFSET * PAGE_SIZE; for historical reasons,
> > +  this page is included in the result of KVM_GET_VCPU_MMAP_SIZE.
> > +  KVM_CAP_COALESCED_MMIO is not documented yet.
> > +
> > +- if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
> > +  KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE.  For more information on
> > +  KVM_CAP_DIRTY_LOG_RING, see section 8.3.
> > +
> >  
> >  4.6 KVM_SET_MEMORY_REGION
> >  
> 
> PAGE_SIZE being which value? It's not always trivial for
> userspace to know what's the PAGE_SIZE for the kernel ...

I thought it can be easily fetched from getpagesize() or
sysconf(PAGE_SIZE)?  Especially considering that the document should
be for kvm userspace, I'd say it should be common that a hypervisor
process will need to know this probably in other tons of places.. no?

> 
> 
> > @@ -5358,6 +5371,7 @@ CPU when the exception is taken. If this virtual SError is taken to EL1 using
> >  AArch64, this value will be reported in the ISS field of ESR_ELx.
> >  
> >  See KVM_CAP_VCPU_EVENTS for more details.
> > +
> >  8.20 KVM_CAP_HYPERV_SEND_IPI
> >  
> >  Architectures: x86
> > @@ -5365,6 +5379,7 @@ Architectures: x86
> >  This capability indicates that KVM supports paravirtualized Hyper-V IPI send
> >  hypercalls:
> >  HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
> > +
> >  8.21 KVM_CAP_HYPERV_DIRECT_TLBFLUSH
> >  
> >  Architecture: x86
> > @@ -5378,3 +5393,97 @@ handling by KVM (as some KVM hypercall may be mistakenly treated as TLB
> >  flush hypercalls by Hyper-V) so userspace should disable KVM identification
> >  in CPUID and only exposes Hyper-V identification. In this case, guest
> >  thinks it's running on Hyper-V and only use Hyper-V hypercalls.
> > +
> > +8.22 KVM_CAP_DIRTY_LOG_RING
> > +
> > +Architectures: x86
> > +Parameters: args[0] - size of the dirty log ring
> > +
> > +KVM is capable of tracking dirty memory using ring buffers that are
> > +mmaped into userspace; there is one dirty ring per vcpu and one global
> > +ring per vm.
> > +
> > +One dirty ring has the following two major structures:
> > +
> > +struct kvm_dirty_ring {
> > +	u16 dirty_index;
> > +	u16 reset_index;
> > +	u32 size;
> > +	u32 soft_limit;
> > +	spinlock_t lock;
> > +	struct kvm_dirty_gfn *dirty_gfns;
> > +};
> > +
> > +struct kvm_dirty_ring_indexes {
> > +	__u32 avail_index; /* set by kernel */
> > +	__u32 fetch_index; /* set by userspace */
> 
> Sticking these next to each other seems to guarantee cache conflicts.
> 
> Avail/Fetch seems to mimic Virtio's avail/used exactly.  I am not saying
> you must reuse the code really, but I think you should take a hard look
> at e.g. the virtio packed ring structure. We spent a bunch of time
> optimizing it for cache utilization. It seems kernel is the driver,
> making entries available, and userspace the device, using them.
> Again let's not develop a thread about this, but I think
> this is something to consider and discuss in future versions
> of the patches.

I think I completely understand your concern.  We should avoid wasting
time on those are already there.  I'm just afraid that it'll took even
more time to use virtio for this use case while at last we don't
really get much benefit out of it (e.g. most of the virtio features
are not used).

Yeh let's not develop a thread for this topic - I will read more on
virtio before my next post to see whether there's any chance we can
share anything with virtio ring.

> 
> 
> > +};
> > +
> > +While for each of the dirty entry it's defined as:
> > +
> > +struct kvm_dirty_gfn {
> 
> What does GFN stand for?

It's guest frame number, iiuc.  I'm not the one who named this, but
that's what I understand..

> 
> > +        __u32 pad;
> > +        __u32 slot; /* as_id | slot_id */
> > +        __u64 offset;
> > +};
> 
> offset of what? a 4K page right? Seems like a waste e.g. for
> hugetlbfs... How about replacing pad with size instead?

As Paolo explained, it's the page frame number of the guest.  IIUC
even for hugetlbfs we track dirty bits in 4k size.

> 
> > +
> > +The fields in kvm_dirty_ring will be only internal to KVM itself,
> > +while the fields in kvm_dirty_ring_indexes will be exposed to
> > +userspace to be either read or written.
> 
> I'm not sure what you are trying to say here. kvm_dirty_gfn
> seems to be part of UAPI.

It was talking about kvm_dirty_ring, which is kvm internal and not
exposed to uapi.  While kvm_dirty_gfn is exposed to the users.

> 
> > +
> > +The two indices in the ring buffer are free running counters.
> > +
> > +In pseudocode, processing the ring buffer looks like this:
> > +
> > +	idx = load-acquire(&ring->fetch_index);
> > +	while (idx != ring->avail_index) {
> > +		struct kvm_dirty_gfn *entry;
> > +		entry = &ring->dirty_gfns[idx & (size - 1)];
> > +		...
> > +
> > +		idx++;
> > +	}
> > +	ring->fetch_index = idx;
> > +
> > +Userspace calls KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM ioctl
> > +to enable this capability for the new guest and set the size of the
> > +rings.  It is only allowed before creating any vCPU, and the size of
> > +the ring must be a power of two.
> 
> All these seem like arbitrary limitations to me.

The dependency of vcpu is partly because we need to create per-vcpu
ring, so it's easier that we don't allow it to change after that.

> 
> Sizing the ring correctly might prove to be a challenge.
> 
> Thus I think there's value in resizing the rings
> without destroying VCPU.

Do you have an example on when we could use this feature?  My wild
guess is that even if we try hard to allow resizing (assuming that
won't bring more bugs, but I hightly doubt...), people may not use it
at all.

The major scenario here is that kvm userspace will be collecting the
dirty bits quickly, so the ring should not really get full easily.
Then the ring size does not really matter much either, as long as it
is bigger than some specific value to avoid vmexits due to full.

How about we start with the simple that we don't allow it to change?
We can do that when the requirement comes.

> 
> Also, power of two just saves a branch here and there,
> but wastes lots of memory. Just wrap the index around to
> 0 and then users can select any size?

Same as above to postpone until we need it?

> 
> 
> 
> >  The larger the ring buffer, the less
> > +likely the ring is full and the VM is forced to exit to userspace. The
> > +optimal size depends on the workload, but it is recommended that it be
> > +at least 64 KiB (4096 entries).
> 
> OTOH larger buffers put lots of pressure on the system cache.
> 
> > +
> > +After the capability is enabled, userspace can mmap the global ring
> > +buffer (kvm_dirty_gfn[], offset KVM_DIRTY_LOG_PAGE_OFFSET) and the
> > +indexes (kvm_dirty_ring_indexes, offset 0) from the VM file
> > +descriptor.  The per-vcpu dirty ring instead is mmapped when the vcpu
> > +is created, similar to the kvm_run struct (kvm_dirty_ring_indexes
> > +locates inside kvm_run, while kvm_dirty_gfn[] at offset
> > +KVM_DIRTY_LOG_PAGE_OFFSET).
> > +
> > +Just like for dirty page bitmaps, the buffer tracks writes to
> > +all user memory regions for which the KVM_MEM_LOG_DIRTY_PAGES flag was
> > +set in KVM_SET_USER_MEMORY_REGION.  Once a memory region is registered
> > +with the flag set, userspace can start harvesting dirty pages from the
> > +ring buffer.
> > +
> > +To harvest the dirty pages, userspace accesses the mmaped ring buffer
> > +to read the dirty GFNs up to avail_index, and sets the fetch_index
> > +accordingly.  This can be done when the guest is running or paused,
> > +and dirty pages need not be collected all at once.  After processing
> > +one or more entries in the ring buffer, userspace calls the VM ioctl
> > +KVM_RESET_DIRTY_RINGS to notify the kernel that it has updated
> > +fetch_index and to mark those pages clean.  Therefore, the ioctl
> > +must be called *before* reading the content of the dirty pages.
> > +
> > +However, there is a major difference comparing to the
> > +KVM_GET_DIRTY_LOG interface in that when reading the dirty ring from
> > +userspace it's still possible that the kernel has not yet flushed the
> > +hardware dirty buffers into the kernel buffer.  To achieve that, one
> > +needs to kick the vcpu out for a hardware buffer flush (vmexit).
> > +
> > +If one of the ring buffers is full, the guest will exit to userspace
> > +with the exit reason set to KVM_EXIT_DIRTY_LOG_FULL, and the
> > +KVM_RUN ioctl will return -EINTR. Once that happens, userspace
> > +should pause all the vcpus, then harvest all the dirty pages and
> > +rearm the dirty traps. It can unpause the guest after that.
> 
> This last item means that the performance impact of the feature is
> really hard to predict. Can improve some workloads drastically. Or can
> slow some down.
> 
> 
> One solution could be to actually allow using this together with the
> existing bitmap. Userspace can then decide whether it wants to block
> VCPU on ring full, or just record ring full condition and recover by
> bitmap scanning.

That's true, but again allowing mixture use of the two might bring
extra complexity as well (especially when after adding
KVM_CLEAR_DIRTY_LOG).

My understanding of this is that normally we do only want either one
of them depending on the major workload and the configuration of the
guest.  It's not trivial to try to provide a one-for-all solution.  So
again I would hope we can start from easy, then we extend when we have
better ideas on how to leverage the two interfaces when the ideas
really come, and then we can justify whether it's worth it to work on
that complexity.

> 
> 
> > diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> > index b19ef421084d..0acee817adfb 100644
> > --- a/arch/x86/kvm/Makefile
> > +++ b/arch/x86/kvm/Makefile
> > @@ -5,7 +5,8 @@ ccflags-y += -Iarch/x86/kvm
> >  KVM := ../../../virt/kvm
> >  
> >  kvm-y			+= $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> > -				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
> > +				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o \
> > +				$(KVM)/dirty_ring.o
> >  kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(KVM)/async_pf.o
> >  
> >  kvm-y			+= x86.o emulate.o i8259.o irq.o lapic.o \
> > diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
> > new file mode 100644
> > index 000000000000..8335635b7ff7
> > --- /dev/null
> > +++ b/include/linux/kvm_dirty_ring.h
> > @@ -0,0 +1,67 @@
> > +#ifndef KVM_DIRTY_RING_H
> > +#define KVM_DIRTY_RING_H
> > +
> > +/*
> > + * struct kvm_dirty_ring is defined in include/uapi/linux/kvm.h.
> > + *
> > + * dirty_ring:  shared with userspace via mmap. It is the compact list
> > + *              that holds the dirty pages.
> > + * dirty_index: free running counter that points to the next slot in
> > + *              dirty_ring->dirty_gfns  where a new dirty page should go.
> > + * reset_index: free running counter that points to the next dirty page
> > + *              in dirty_ring->dirty_gfns for which dirty trap needs to
> > + *              be reenabled
> > + * size:        size of the compact list, dirty_ring->dirty_gfns
> > + * soft_limit:  when the number of dirty pages in the list reaches this
> > + *              limit, vcpu that owns this ring should exit to userspace
> > + *              to allow userspace to harvest all the dirty pages
> > + * lock:        protects dirty_ring, only in use if this is the global
> > + *              ring
> > + *
> > + * The number of dirty pages in the ring is calculated by,
> > + * dirty_index - reset_index
> > + *
> > + * kernel increments dirty_ring->indices.avail_index after dirty index
> > + * is incremented. When userspace harvests the dirty pages, it increments
> > + * dirty_ring->indices.fetch_index up to dirty_ring->indices.avail_index.
> > + * When kernel reenables dirty traps for the dirty pages, it increments
> > + * reset_index up to dirty_ring->indices.fetch_index.
> > + *
> > + */
> > +struct kvm_dirty_ring {
> > +	u32 dirty_index;
> > +	u32 reset_index;
> > +	u32 size;
> > +	u32 soft_limit;
> > +	spinlock_t lock;
> > +	struct kvm_dirty_gfn *dirty_gfns;
> > +};
> > +
> > +u32 kvm_dirty_ring_get_rsvd_entries(void);
> > +int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring);
> > +
> > +/*
> > + * called with kvm->slots_lock held, returns the number of
> > + * processed pages.
> > + */
> > +int kvm_dirty_ring_reset(struct kvm *kvm,
> > +			 struct kvm_dirty_ring *ring,
> > +			 struct kvm_dirty_ring_indexes *indexes);
> > +
> > +/*
> > + * returns 0: successfully pushed
> > + *         1: successfully pushed, soft limit reached,
> > + *            vcpu should exit to userspace
> > + *         -EBUSY: unable to push, dirty ring full.
> > + */
> > +int kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
> > +			struct kvm_dirty_ring_indexes *indexes,
> > +			u32 slot, u64 offset, bool lock);
> > +
> > +/* for use in vm_operations_struct */
> > +struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 i);
> > +
> > +void kvm_dirty_ring_free(struct kvm_dirty_ring *ring);
> > +bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring);
> > +
> > +#endif
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 498a39462ac1..7b747bc9ff3e 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -34,6 +34,7 @@
> >  #include <linux/kvm_types.h>
> >  
> >  #include <asm/kvm_host.h>
> > +#include <linux/kvm_dirty_ring.h>
> >  
> >  #ifndef KVM_MAX_VCPU_ID
> >  #define KVM_MAX_VCPU_ID KVM_MAX_VCPUS
> > @@ -146,6 +147,7 @@ static inline bool is_error_page(struct page *page)
> >  #define KVM_REQ_MMU_RELOAD        (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> >  #define KVM_REQ_PENDING_TIMER     2
> >  #define KVM_REQ_UNHALT            3
> > +#define KVM_REQ_DIRTY_RING_FULL   4
> >  #define KVM_REQUEST_ARCH_BASE     8
> >  
> >  #define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
> > @@ -321,6 +323,7 @@ struct kvm_vcpu {
> >  	bool ready;
> >  	struct kvm_vcpu_arch arch;
> >  	struct dentry *debugfs_dentry;
> > +	struct kvm_dirty_ring dirty_ring;
> >  };
> >  
> >  static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
> > @@ -501,6 +504,10 @@ struct kvm {
> >  	struct srcu_struct srcu;
> >  	struct srcu_struct irq_srcu;
> >  	pid_t userspace_pid;
> > +	/* Data structure to be exported by mmap(kvm->fd, 0) */
> > +	struct kvm_vm_run *vm_run;
> > +	u32 dirty_ring_size;
> > +	struct kvm_dirty_ring vm_dirty_ring;
> >  };
> >  
> >  #define kvm_err(fmt, ...) \
> > @@ -832,6 +839,8 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
> >  					gfn_t gfn_offset,
> >  					unsigned long mask);
> >  
> > +void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask);
> > +
> >  int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
> >  				struct kvm_dirty_log *log);
> >  int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
> > @@ -1411,4 +1420,28 @@ int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
> >  				uintptr_t data, const char *name,
> >  				struct task_struct **thread_ptr);
> >  
> > +/*
> > + * This defines how many reserved entries we want to keep before we
> > + * kick the vcpu to the userspace to avoid dirty ring full.  This
> > + * value can be tuned to higher if e.g. PML is enabled on the host.
> > + */
> > +#define  KVM_DIRTY_RING_RSVD_ENTRIES  64
> > +
> > +/* Max number of entries allowed for each kvm dirty ring */
> > +#define  KVM_DIRTY_RING_MAX_ENTRIES  65536
> > +
> > +/*
> > + * Arch needs to define these macro after implementing the dirty ring
> > + * feature.  KVM_DIRTY_LOG_PAGE_OFFSET should be defined as the
> > + * starting page offset of the dirty ring structures,
> 
> Confused. Offset where? You set a default for everyone - where does arch
> want to override it?

If arch defines KVM_DIRTY_LOG_PAGE_OFFSET then below will be a no-op,
please see [1] on #ifndef.

> 
> > while
> > + * KVM_DIRTY_RING_VERSION should be defined as >=1.  By default, this
> > + * feature is off on all archs.
> > + */
> > +#ifndef KVM_DIRTY_LOG_PAGE_OFFSET

[1]

> > +#define KVM_DIRTY_LOG_PAGE_OFFSET 0
> > +#endif
> > +#ifndef KVM_DIRTY_RING_VERSION
> > +#define KVM_DIRTY_RING_VERSION 0
> > +#endif
> 
> One way versioning, with no bits and negotiation
> will make it hard to change down the road.
> what's wrong with existing KVM capabilities that
> you feel there's a need for dedicated versioning for this?

Frankly speaking I don't even think it'll change in the near
future.. :)

Yeh kvm versioning could work too.  Here we can also return a zero
just like the most of the caps (or KVM_DIRTY_LOG_PAGE_OFFSET as in the
original patchset, but it's really helpless either because it's
defined in uapi), but I just don't see how it helps...  So I returned
a version number just in case we'd like to change the layout some day
and when we don't want to bother introducing another cap bit for the
same feature (like KVM_CAP_MANUAL_DIRTY_LOG_PROTECT and
KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2).

> 
> > +
> >  #endif
> > diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> > index 1c88e69db3d9..d9d03eea145a 100644
> > --- a/include/linux/kvm_types.h
> > +++ b/include/linux/kvm_types.h
> > @@ -11,6 +11,7 @@ struct kvm_irq_routing_table;
> >  struct kvm_memory_slot;
> >  struct kvm_one_reg;
> >  struct kvm_run;
> > +struct kvm_vm_run;
> >  struct kvm_userspace_memory_region;
> >  struct kvm_vcpu;
> >  struct kvm_vcpu_init;
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index e6f17c8e2dba..0b88d76d6215 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -236,6 +236,7 @@ struct kvm_hyperv_exit {
> >  #define KVM_EXIT_IOAPIC_EOI       26
> >  #define KVM_EXIT_HYPERV           27
> >  #define KVM_EXIT_ARM_NISV         28
> > +#define KVM_EXIT_DIRTY_RING_FULL  29
> >  
> >  /* For KVM_EXIT_INTERNAL_ERROR */
> >  /* Emulate instruction failed. */
> > @@ -247,6 +248,11 @@ struct kvm_hyperv_exit {
> >  /* Encounter unexpected vm-exit reason */
> >  #define KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON	4
> >  
> > +struct kvm_dirty_ring_indexes {
> > +	__u32 avail_index; /* set by kernel */
> > +	__u32 fetch_index; /* set by userspace */
> > +};
> > +
> >  /* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */
> >  struct kvm_run {
> >  	/* in */
> > @@ -421,6 +427,13 @@ struct kvm_run {
> >  		struct kvm_sync_regs regs;
> >  		char padding[SYNC_REGS_SIZE_BYTES];
> >  	} s;
> > +
> > +	struct kvm_dirty_ring_indexes vcpu_ring_indexes;
> > +};
> > +
> > +/* Returned by mmap(kvm->fd, offset=0) */
> > +struct kvm_vm_run {
> > +	struct kvm_dirty_ring_indexes vm_ring_indexes;
> >  };
> >  
> >  /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
> > @@ -1009,6 +1022,7 @@ struct kvm_ppc_resize_hpt {
> >  #define KVM_CAP_PPC_GUEST_DEBUG_SSTEP 176
> >  #define KVM_CAP_ARM_NISV_TO_USER 177
> >  #define KVM_CAP_ARM_INJECT_EXT_DABT 178
> > +#define KVM_CAP_DIRTY_LOG_RING 179
> >  
> >  #ifdef KVM_CAP_IRQ_ROUTING
> >  
> > @@ -1472,6 +1486,9 @@ struct kvm_enc_region {
> >  /* Available with KVM_CAP_ARM_SVE */
> >  #define KVM_ARM_VCPU_FINALIZE	  _IOW(KVMIO,  0xc2, int)
> >  
> > +/* Available with KVM_CAP_DIRTY_LOG_RING */
> > +#define KVM_RESET_DIRTY_RINGS     _IO(KVMIO, 0xc3)
> > +
> >  /* Secure Encrypted Virtualization command */
> >  enum sev_cmd_id {
> >  	/* Guest initialization commands */
> > @@ -1622,4 +1639,23 @@ struct kvm_hyperv_eventfd {
> >  #define KVM_HYPERV_CONN_ID_MASK		0x00ffffff
> >  #define KVM_HYPERV_EVENTFD_DEASSIGN	(1 << 0)
> >  
> > +/*
> > + * The following are the requirements for supporting dirty log ring
> > + * (by enabling KVM_DIRTY_LOG_PAGE_OFFSET).
> > + *
> > + * 1. Memory accesses by KVM should call kvm_vcpu_write_* instead
> > + *    of kvm_write_* so that the global dirty ring is not filled up
> > + *    too quickly.
> > + * 2. kvm_arch_mmu_enable_log_dirty_pt_masked should be defined for
> > + *    enabling dirty logging.
> > + * 3. There should not be a separate step to synchronize hardware
> > + *    dirty bitmap with KVM's.
> > + */
> > +
> > +struct kvm_dirty_gfn {
> > +	__u32 pad;
> > +	__u32 slot;
> > +	__u64 offset;
> > +};
> > +
> >  #endif /* __LINUX_KVM_H */
> > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> > new file mode 100644
> > index 000000000000..9264891f3c32
> > --- /dev/null
> > +++ b/virt/kvm/dirty_ring.c
> > @@ -0,0 +1,156 @@
> > +#include <linux/kvm_host.h>
> > +#include <linux/kvm.h>
> > +#include <linux/vmalloc.h>
> > +#include <linux/kvm_dirty_ring.h>
> > +
> > +u32 kvm_dirty_ring_get_rsvd_entries(void)
> > +{
> > +	return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size();
> > +}
> > +
> > +int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring)
> > +{
> > +	u32 size = kvm->dirty_ring_size;
> > +
> > +	ring->dirty_gfns = vmalloc(size);
> 
> So 1/2 a megabyte of kernel memory per VM that userspace locks up.
> Do we really have to though? Why not get a userspace pointer,
> write it with copy to user, and sidestep all this?

I'd say it won't be a big issue on locking 1/2M of host mem for a
vm...

Also note that if dirty ring is enabled, I plan to evaporate the
dirty_bitmap in the next post. The old kvm->dirty_bitmap takes
$GUEST_MEM/32K*2 mem.  E.g., for 64G guest it's 64G/32K*2=4M.  If with
dirty ring of 8 vcpus, that could be 64K*8=0.5M, which could be even
less memory used.

> 
> > +	if (!ring->dirty_gfns)
> > +		return -ENOMEM;
> > +	memset(ring->dirty_gfns, 0, size);
> > +
> > +	ring->size = size / sizeof(struct kvm_dirty_gfn);
> > +	ring->soft_limit =
> > +	    (kvm->dirty_ring_size / sizeof(struct kvm_dirty_gfn)) -
> > +	    kvm_dirty_ring_get_rsvd_entries();
> > +	ring->dirty_index = 0;
> > +	ring->reset_index = 0;
> > +	spin_lock_init(&ring->lock);
> > +
> > +	return 0;
> > +}
> > +
> > +int kvm_dirty_ring_reset(struct kvm *kvm,
> > +			 struct kvm_dirty_ring *ring,
> > +			 struct kvm_dirty_ring_indexes *indexes)
> > +{
> > +	u32 cur_slot, next_slot;
> > +	u64 cur_offset, next_offset;
> > +	unsigned long mask;
> > +	u32 fetch;
> > +	int count = 0;
> > +	struct kvm_dirty_gfn *entry;
> > +
> > +	fetch = READ_ONCE(indexes->fetch_index);
> > +	if (fetch == ring->reset_index)
> > +		return 0;
> > +
> > +	entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> > +	/*
> > +	 * The ring buffer is shared with userspace, which might mmap
> > +	 * it and concurrently modify slot and offset.  Userspace must
> > +	 * not be trusted!  READ_ONCE prevents the compiler from changing
> > +	 * the values after they've been range-checked (the checks are
> > +	 * in kvm_reset_dirty_gfn).
> 
> What it doesn't is prevent speculative attacks.  That's why things like
> copy from user have a speculation barrier.  Instead of worrying about
> that, unless it's really critical, I think you'd do well do just use
> copy to/from user.

IMHO I would really hope these data be there without swapped out of
memory, just like what we did with kvm->dirty_bitmap... it's on the
hot path of mmu page fault, even we could be with mmu lock held if
copy_to_user() page faulted.  But indeed I've no experience on
avoiding speculative attacks, suggestions would be greatly welcomed on
that.  In our case we do (index & (size - 1)), so is it still
suffering from speculative attacks?

> 
> > +	 */
> > +	smp_read_barrier_depends();
> 
> What depends on what here? Looks suspicious ...

Hmm, I think maybe it can be removed because the entry pointer
reference below should be an ordering constraint already?

> 
> > +	cur_slot = READ_ONCE(entry->slot);
> > +	cur_offset = READ_ONCE(entry->offset);
> > +	mask = 1;
> > +	count++;
> > +	ring->reset_index++;
> > +	while (ring->reset_index != fetch) {
> > +		entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> > +		smp_read_barrier_depends();
> 
> same concerns here
> 
> > +		next_slot = READ_ONCE(entry->slot);
> > +		next_offset = READ_ONCE(entry->offset);
> > +		ring->reset_index++;
> > +		count++;
> > +		/*
> > +		 * Try to coalesce the reset operations when the guest is
> > +		 * scanning pages in the same slot.
> 
> what does guest scanning mean?

My wild guess is that it means when the guest is accessing the pages
continuously so the dirty gfns are continuous too.  Anyway I agree
it's not clear, where I can try to rephrase.

> 
> > +		 */
> > +		if (next_slot == cur_slot) {
> > +			int delta = next_offset - cur_offset;
> > +
> > +			if (delta >= 0 && delta < BITS_PER_LONG) {
> > +				mask |= 1ull << delta;
> > +				continue;
> > +			}
> > +
> > +			/* Backwards visit, careful about overflows!  */
> > +			if (delta > -BITS_PER_LONG && delta < 0 &&
> > +			    (mask << -delta >> -delta) == mask) {
> > +				cur_offset = next_offset;
> > +				mask = (mask << -delta) | 1;
> > +				continue;
> > +			}
> > +		}
> > +		kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> > +		cur_slot = next_slot;
> > +		cur_offset = next_offset;
> > +		mask = 1;
> > +	}
> > +	kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> > +
> > +	return count;
> > +}
> > +
> > +static inline u32 kvm_dirty_ring_used(struct kvm_dirty_ring *ring)
> > +{
> > +	return ring->dirty_index - ring->reset_index;
> > +}
> > +
> > +bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring)
> > +{
> > +	return kvm_dirty_ring_used(ring) >= ring->size;
> > +}
> > +
> > +/*
> > + * Returns:
> > + *   >0 if we should kick the vcpu out,
> > + *   =0 if the gfn pushed successfully, or,
> > + *   <0 if error (e.g. ring full)
> > + */
> > +int kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
> > +			struct kvm_dirty_ring_indexes *indexes,
> > +			u32 slot, u64 offset, bool lock)
> > +{
> > +	int ret;
> > +	struct kvm_dirty_gfn *entry;
> > +
> > +	if (lock)
> > +		spin_lock(&ring->lock);
> 
> what's the story around locking here? Why is it safe
> not to take the lock sometimes?

kvm_dirty_ring_push() will be with lock==true only when the per-vm
ring is used.  For per-vcpu ring, because that will only happen with
the vcpu context, then we don't need locks (so kvm_dirty_ring_push()
is called with lock==false).

> 
> > +
> > +	if (kvm_dirty_ring_full(ring)) {
> > +		ret = -EBUSY;
> > +		goto out;
> > +	}
> > +
> > +	entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
> > +	entry->slot = slot;
> > +	entry->offset = offset;
> > +	smp_wmb();
> > +	ring->dirty_index++;
> > +	WRITE_ONCE(indexes->avail_index, ring->dirty_index);
> > +	ret = kvm_dirty_ring_used(ring) >= ring->soft_limit;
> > +	pr_info("%s: slot %u offset %llu used %u\n",
> > +		__func__, slot, offset, kvm_dirty_ring_used(ring));
> > +
> > +out:
> > +	if (lock)
> > +		spin_unlock(&ring->lock);
> > +
> > +	return ret;
> > +}
> > +
> > +struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 i)
> > +{
> > +	return vmalloc_to_page((void *)ring->dirty_gfns + i * PAGE_SIZE);
> > +}
> > +
> > +void kvm_dirty_ring_free(struct kvm_dirty_ring *ring)
> > +{
> > +	if (ring->dirty_gfns) {
> > +		vfree(ring->dirty_gfns);
> > +		ring->dirty_gfns = NULL;
> > +	}
> > +}
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 681452d288cd..8642c977629b 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -64,6 +64,8 @@
> >  #define CREATE_TRACE_POINTS
> >  #include <trace/events/kvm.h>
> >  
> > +#include <linux/kvm_dirty_ring.h>
> > +
> >  /* Worst case buffer size needed for holding an integer. */
> >  #define ITOA_MAX_LEN 12
> >  
> > @@ -149,6 +151,10 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
> >  				    struct kvm_vcpu *vcpu,
> >  				    struct kvm_memory_slot *memslot,
> >  				    gfn_t gfn);
> > +static void mark_page_dirty_in_ring(struct kvm *kvm,
> > +				    struct kvm_vcpu *vcpu,
> > +				    struct kvm_memory_slot *slot,
> > +				    gfn_t gfn);
> >  
> >  __visible bool kvm_rebooting;
> >  EXPORT_SYMBOL_GPL(kvm_rebooting);
> > @@ -359,11 +365,22 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
> >  	vcpu->preempted = false;
> >  	vcpu->ready = false;
> >  
> > +	if (kvm->dirty_ring_size) {
> > +		r = kvm_dirty_ring_alloc(vcpu->kvm, &vcpu->dirty_ring);
> > +		if (r) {
> > +			kvm->dirty_ring_size = 0;
> > +			goto fail_free_run;
> > +		}
> > +	}
> > +
> >  	r = kvm_arch_vcpu_init(vcpu);
> >  	if (r < 0)
> > -		goto fail_free_run;
> > +		goto fail_free_ring;
> >  	return 0;
> >  
> > +fail_free_ring:
> > +	if (kvm->dirty_ring_size)
> > +		kvm_dirty_ring_free(&vcpu->dirty_ring);
> >  fail_free_run:
> >  	free_page((unsigned long)vcpu->run);
> >  fail:
> > @@ -381,6 +398,8 @@ void kvm_vcpu_uninit(struct kvm_vcpu *vcpu)
> >  	put_pid(rcu_dereference_protected(vcpu->pid, 1));
> >  	kvm_arch_vcpu_uninit(vcpu);
> >  	free_page((unsigned long)vcpu->run);
> > +	if (vcpu->kvm->dirty_ring_size)
> > +		kvm_dirty_ring_free(&vcpu->dirty_ring);
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_vcpu_uninit);
> >  
> > @@ -690,6 +709,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
> >  	struct kvm *kvm = kvm_arch_alloc_vm();
> >  	int r = -ENOMEM;
> >  	int i;
> > +	struct page *page;
> >  
> >  	if (!kvm)
> >  		return ERR_PTR(-ENOMEM);
> > @@ -705,6 +725,14 @@ static struct kvm *kvm_create_vm(unsigned long type)
> >  
> >  	BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
> >  
> > +	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> > +	if (!page) {
> > +		r = -ENOMEM;
> > +		goto out_err_alloc_page;
> > +	}
> > +	kvm->vm_run = page_address(page);
> 
> So 4K with just 8 bytes used. Not as bad as 1/2Mbyte for the ring but
> still. What is wrong with just a pointer and calling put_user?

I want to make it the start point for sharing fields between
user/kernel per-vm.  Just like kvm_run for per-vcpu.

IMHO it'll be awkward if we always introduce a new interface just to
take a pointer of the userspace buffer and cache it...  I'd say so far
I like the design of kvm_run and alike because it's efficient, easy to
use, and easy for extensions.

> 
> > +	BUILD_BUG_ON(sizeof(struct kvm_vm_run) > PAGE_SIZE);
> > +
> >  	if (init_srcu_struct(&kvm->srcu))
> >  		goto out_err_no_srcu;
> >  	if (init_srcu_struct(&kvm->irq_srcu))
> > @@ -775,6 +803,9 @@ static struct kvm *kvm_create_vm(unsigned long type)
> >  out_err_no_irq_srcu:
> >  	cleanup_srcu_struct(&kvm->srcu);
> >  out_err_no_srcu:
> > +	free_page((unsigned long)page);
> > +	kvm->vm_run = NULL;
> > +out_err_alloc_page:
> >  	kvm_arch_free_vm(kvm);
> >  	mmdrop(current->mm);
> >  	return ERR_PTR(r);
> > @@ -800,6 +831,15 @@ static void kvm_destroy_vm(struct kvm *kvm)
> >  	int i;
> >  	struct mm_struct *mm = kvm->mm;
> >  
> > +	if (kvm->dirty_ring_size) {
> > +		kvm_dirty_ring_free(&kvm->vm_dirty_ring);
> > +	}
> > +
> > +	if (kvm->vm_run) {
> > +		free_page((unsigned long)kvm->vm_run);
> > +		kvm->vm_run = NULL;
> > +	}
> > +
> >  	kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm);
> >  	kvm_destroy_vm_debugfs(kvm);
> >  	kvm_arch_sync_events(kvm);
> > @@ -2301,7 +2341,7 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
> >  {
> >  	if (memslot && memslot->dirty_bitmap) {
> >  		unsigned long rel_gfn = gfn - memslot->base_gfn;
> > -
> > +		mark_page_dirty_in_ring(kvm, vcpu, memslot, gfn);
> >  		set_bit_le(rel_gfn, memslot->dirty_bitmap);
> >  	}
> >  }
> > @@ -2649,6 +2689,13 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
> >  
> > +static bool kvm_fault_in_dirty_ring(struct kvm *kvm, struct vm_fault *vmf)
> > +{
> > +	return (vmf->pgoff >= KVM_DIRTY_LOG_PAGE_OFFSET) &&
> > +	    (vmf->pgoff < KVM_DIRTY_LOG_PAGE_OFFSET +
> > +	     kvm->dirty_ring_size / PAGE_SIZE);
> > +}
> > +
> >  static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
> >  {
> >  	struct kvm_vcpu *vcpu = vmf->vma->vm_file->private_data;
> > @@ -2664,6 +2711,10 @@ static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
> >  	else if (vmf->pgoff == KVM_COALESCED_MMIO_PAGE_OFFSET)
> >  		page = virt_to_page(vcpu->kvm->coalesced_mmio_ring);
> >  #endif
> > +	else if (kvm_fault_in_dirty_ring(vcpu->kvm, vmf))
> > +		page = kvm_dirty_ring_get_page(
> > +		    &vcpu->dirty_ring,
> > +		    vmf->pgoff - KVM_DIRTY_LOG_PAGE_OFFSET);
> >  	else
> >  		return kvm_arch_vcpu_fault(vcpu, vmf);
> >  	get_page(page);
> > @@ -3259,12 +3310,162 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> >  #endif
> >  	case KVM_CAP_NR_MEMSLOTS:
> >  		return KVM_USER_MEM_SLOTS;
> > +	case KVM_CAP_DIRTY_LOG_RING:
> > +		/* Version will be zero if arch didn't implement it */
> > +		return KVM_DIRTY_RING_VERSION;
> >  	default:
> >  		break;
> >  	}
> >  	return kvm_vm_ioctl_check_extension(kvm, arg);
> >  }
> >  
> > +static void mark_page_dirty_in_ring(struct kvm *kvm,
> > +				    struct kvm_vcpu *vcpu,
> > +				    struct kvm_memory_slot *slot,
> > +				    gfn_t gfn)
> > +{
> > +	u32 as_id = 0;
> > +	u64 offset;
> > +	int ret;
> > +	struct kvm_dirty_ring *ring;
> > +	struct kvm_dirty_ring_indexes *indexes;
> > +	bool is_vm_ring;
> > +
> > +	if (!kvm->dirty_ring_size)
> > +		return;
> > +
> > +	offset = gfn - slot->base_gfn;
> > +
> > +	if (vcpu) {
> > +		as_id = kvm_arch_vcpu_memslots_id(vcpu);
> > +	} else {
> > +		as_id = 0;
> > +		vcpu = kvm_get_running_vcpu();
> > +	}
> > +
> > +	if (vcpu) {
> > +		ring = &vcpu->dirty_ring;
> > +		indexes = &vcpu->run->vcpu_ring_indexes;
> > +		is_vm_ring = false;
> > +	} else {
> > +		/*
> > +		 * Put onto per vm ring because no vcpu context.  Kick
> > +		 * vcpu0 if ring is full.
> 
> What about tasks on vcpu 0? Do guests realize it's a bad idea to put
> critical tasks there, they will be penalized disproportionally?

Reasonable question.  So far we can't avoid it because vcpu exit is
the event mechanism to say "hey please collect dirty bits".  Maybe
someway is better than this, but I'll need to rethink all these
over...

> 
> > +		 */
> > +		vcpu = kvm->vcpus[0];
> > +		ring = &kvm->vm_dirty_ring;
> > +		indexes = &kvm->vm_run->vm_ring_indexes;
> > +		is_vm_ring = true;
> > +	}
> > +
> > +	ret = kvm_dirty_ring_push(ring, indexes,
> > +				  (as_id << 16)|slot->id, offset,
> > +				  is_vm_ring);
> > +	if (ret < 0) {
> > +		if (is_vm_ring)
> > +			pr_warn_once("vcpu %d dirty log overflow\n",
> > +				     vcpu->vcpu_id);
> > +		else
> > +			pr_warn_once("per-vm dirty log overflow\n");
> > +		return;
> > +	}
> > +
> > +	if (ret)
> > +		kvm_make_request(KVM_REQ_DIRTY_RING_FULL, vcpu);
> > +}
> > +
> > +void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
> > +{
> > +	struct kvm_memory_slot *memslot;
> > +	int as_id, id;
> > +
> > +	as_id = slot >> 16;
> > +	id = (u16)slot;
> > +	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
> > +		return;
> > +
> > +	memslot = id_to_memslot(__kvm_memslots(kvm, as_id), id);
> > +	if (offset >= memslot->npages)
> > +		return;
> > +
> > +	spin_lock(&kvm->mmu_lock);
> > +	/* FIXME: we should use a single AND operation, but there is no
> > +	 * applicable atomic API.
> > +	 */
> > +	while (mask) {
> > +		clear_bit_le(offset + __ffs(mask), memslot->dirty_bitmap);
> > +		mask &= mask - 1;
> > +	}
> > +
> > +	kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask);
> > +	spin_unlock(&kvm->mmu_lock);
> > +}
> > +
> > +static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size)
> > +{
> > +	int r;
> > +
> > +	/* the size should be power of 2 */
> > +	if (!size || (size & (size - 1)))
> > +		return -EINVAL;
> > +
> > +	/* Should be bigger to keep the reserved entries, or a page */
> > +	if (size < kvm_dirty_ring_get_rsvd_entries() *
> > +	    sizeof(struct kvm_dirty_gfn) || size < PAGE_SIZE)
> > +		return -EINVAL;
> > +
> > +	if (size > KVM_DIRTY_RING_MAX_ENTRIES *
> > +	    sizeof(struct kvm_dirty_gfn))
> > +		return -E2BIG;
> 
> KVM_DIRTY_RING_MAX_ENTRIES is not part of UAPI.
> So how does userspace know what's legal?
> Do you expect it to just try?

Yep that's what I thought. :)

Please grep E2BIG in QEMU repo target/i386/kvm.c...  won't be hard to
do imho..

> More likely it will just copy the number from kernel and can
> never ever make it smaller.

Not sure, but for sure I can probably move KVM_DIRTY_RING_MAX_ENTRIES
to uapi too.

Thanks,

Michael S. Tsirkin Dec. 11, 2019, 10:57 p.m. UTC | #35

On Wed, Dec 11, 2019 at 03:59:52PM -0500, Peter Xu wrote:
> On Wed, Dec 11, 2019 at 07:53:48AM -0500, Michael S. Tsirkin wrote:
> > On Fri, Nov 29, 2019 at 04:34:54PM -0500, Peter Xu wrote:
> > > This patch is heavily based on previous work from Lei Cao
> > > <lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]
> > > 
> > > KVM currently uses large bitmaps to track dirty memory.  These bitmaps
> > > are copied to userspace when userspace queries KVM for its dirty page
> > > information.  The use of bitmaps is mostly sufficient for live
> > > migration, as large parts of memory are be dirtied from one log-dirty
> > > pass to another.  However, in a checkpointing system, the number of
> > > dirty pages is small and in fact it is often bounded---the VM is
> > > paused when it has dirtied a pre-defined number of pages. Traversing a
> > > large, sparsely populated bitmap to find set bits is time-consuming,
> > > as is copying the bitmap to user-space.
> > > 
> > > A similar issue will be there for live migration when the guest memory
> > > is huge while the page dirty procedure is trivial.  In that case for
> > > each dirty sync we need to pull the whole dirty bitmap to userspace
> > > and analyse every bit even if it's mostly zeros.
> > > 
> > > The preferred data structure for above scenarios is a dense list of
> > > guest frame numbers (GFN).  This patch series stores the dirty list in
> > > kernel memory that can be memory mapped into userspace to allow speedy
> > > harvesting.
> > > 
> > > We defined two new data structures:
> > > 
> > >   struct kvm_dirty_ring;
> > >   struct kvm_dirty_ring_indexes;
> > > 
> > > Firstly, kvm_dirty_ring is defined to represent a ring of dirty
> > > pages.  When dirty tracking is enabled, we can push dirty gfn onto the
> > > ring.
> > > 
> > > Secondly, kvm_dirty_ring_indexes is defined to represent the
> > > user/kernel interface of each ring.  Currently it contains two
> > > indexes: (1) avail_index represents where we should push our next
> > > PFN (written by kernel), while (2) fetch_index represents where the
> > > userspace should fetch the next dirty PFN (written by userspace).
> > > 
> > > One complete ring is composed by one kvm_dirty_ring plus its
> > > corresponding kvm_dirty_ring_indexes.
> > > 
> > > Currently, we have N+1 rings for each VM of N vcpus:
> > > 
> > >   - for each vcpu, we have 1 per-vcpu dirty ring,
> > >   - for each vm, we have 1 per-vm dirty ring
> > > 
> > > Please refer to the documentation update in this patch for more
> > > details.
> > > 
> > > Note that this patch implements the core logic of dirty ring buffer.
> > > It's still disabled for all archs for now.  Also, we'll address some
> > > of the other issues in follow up patches before it's firstly enabled
> > > on x86.
> > > 
> > > [1] https://patchwork.kernel.org/patch/10471409/
> > > 
> > > Signed-off-by: Lei Cao <lei.cao@stratus.com>
> > > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > 
> > 
> > Thanks, that's interesting.
> 
> Hi, Michael,
> 
> Thanks for reading the series.
> 
> > 
> > > ---
> > >  Documentation/virt/kvm/api.txt | 109 +++++++++++++++
> > >  arch/x86/kvm/Makefile          |   3 +-
> > >  include/linux/kvm_dirty_ring.h |  67 +++++++++
> > >  include/linux/kvm_host.h       |  33 +++++
> > >  include/linux/kvm_types.h      |   1 +
> > >  include/uapi/linux/kvm.h       |  36 +++++
> > >  virt/kvm/dirty_ring.c          | 156 +++++++++++++++++++++
> > >  virt/kvm/kvm_main.c            | 240 ++++++++++++++++++++++++++++++++-
> > >  8 files changed, 642 insertions(+), 3 deletions(-)
> > >  create mode 100644 include/linux/kvm_dirty_ring.h
> > >  create mode 100644 virt/kvm/dirty_ring.c
> > > 
> > > diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt
> > > index 49183add44e7..fa622c9a2eb8 100644
> > > --- a/Documentation/virt/kvm/api.txt
> > > +++ b/Documentation/virt/kvm/api.txt
> > > @@ -231,6 +231,7 @@ Based on their initialization different VMs may have different capabilities.
> > >  It is thus encouraged to use the vm ioctl to query for capabilities (available
> > >  with KVM_CAP_CHECK_EXTENSION_VM on the vm fd)
> > >  
> > > +
> > >  4.5 KVM_GET_VCPU_MMAP_SIZE
> > >  
> > >  Capability: basic
> > > @@ -243,6 +244,18 @@ The KVM_RUN ioctl (cf.) communicates with userspace via a shared
> > >  memory region.  This ioctl returns the size of that region.  See the
> > >  KVM_RUN documentation for details.
> > >  
> > > +Besides the size of the KVM_RUN communication region, other areas of
> > > +the VCPU file descriptor can be mmap-ed, including:
> > > +
> > > +- if KVM_CAP_COALESCED_MMIO is available, a page at
> > > +  KVM_COALESCED_MMIO_PAGE_OFFSET * PAGE_SIZE; for historical reasons,
> > > +  this page is included in the result of KVM_GET_VCPU_MMAP_SIZE.
> > > +  KVM_CAP_COALESCED_MMIO is not documented yet.
> > > +
> > > +- if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
> > > +  KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE.  For more information on
> > > +  KVM_CAP_DIRTY_LOG_RING, see section 8.3.
> > > +
> > >  
> > >  4.6 KVM_SET_MEMORY_REGION
> > >  
> > 
> > PAGE_SIZE being which value? It's not always trivial for
> > userspace to know what's the PAGE_SIZE for the kernel ...
> 
> I thought it can be easily fetched from getpagesize() or
> sysconf(PAGE_SIZE)?  Especially considering that the document should
> be for kvm userspace, I'd say it should be common that a hypervisor
> process will need to know this probably in other tons of places.. no?
> 
> > 
> > 
> > > @@ -5358,6 +5371,7 @@ CPU when the exception is taken. If this virtual SError is taken to EL1 using
> > >  AArch64, this value will be reported in the ISS field of ESR_ELx.
> > >  
> > >  See KVM_CAP_VCPU_EVENTS for more details.
> > > +
> > >  8.20 KVM_CAP_HYPERV_SEND_IPI
> > >  
> > >  Architectures: x86
> > > @@ -5365,6 +5379,7 @@ Architectures: x86
> > >  This capability indicates that KVM supports paravirtualized Hyper-V IPI send
> > >  hypercalls:
> > >  HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
> > > +
> > >  8.21 KVM_CAP_HYPERV_DIRECT_TLBFLUSH
> > >  
> > >  Architecture: x86
> > > @@ -5378,3 +5393,97 @@ handling by KVM (as some KVM hypercall may be mistakenly treated as TLB
> > >  flush hypercalls by Hyper-V) so userspace should disable KVM identification
> > >  in CPUID and only exposes Hyper-V identification. In this case, guest
> > >  thinks it's running on Hyper-V and only use Hyper-V hypercalls.
> > > +
> > > +8.22 KVM_CAP_DIRTY_LOG_RING
> > > +
> > > +Architectures: x86
> > > +Parameters: args[0] - size of the dirty log ring
> > > +
> > > +KVM is capable of tracking dirty memory using ring buffers that are
> > > +mmaped into userspace; there is one dirty ring per vcpu and one global
> > > +ring per vm.
> > > +
> > > +One dirty ring has the following two major structures:
> > > +
> > > +struct kvm_dirty_ring {
> > > +	u16 dirty_index;
> > > +	u16 reset_index;
> > > +	u32 size;
> > > +	u32 soft_limit;
> > > +	spinlock_t lock;
> > > +	struct kvm_dirty_gfn *dirty_gfns;
> > > +};
> > > +
> > > +struct kvm_dirty_ring_indexes {
> > > +	__u32 avail_index; /* set by kernel */
> > > +	__u32 fetch_index; /* set by userspace */
> > 
> > Sticking these next to each other seems to guarantee cache conflicts.
> > 
> > Avail/Fetch seems to mimic Virtio's avail/used exactly.  I am not saying
> > you must reuse the code really, but I think you should take a hard look
> > at e.g. the virtio packed ring structure. We spent a bunch of time
> > optimizing it for cache utilization. It seems kernel is the driver,
> > making entries available, and userspace the device, using them.
> > Again let's not develop a thread about this, but I think
> > this is something to consider and discuss in future versions
> > of the patches.
> 
> I think I completely understand your concern.  We should avoid wasting
> time on those are already there.  I'm just afraid that it'll took even
> more time to use virtio for this use case while at last we don't
> really get much benefit out of it (e.g. most of the virtio features
> are not used).
> 
> Yeh let's not develop a thread for this topic - I will read more on
> virtio before my next post to see whether there's any chance we can
> share anything with virtio ring.
> 
> > 
> > 
> > > +};
> > > +
> > > +While for each of the dirty entry it's defined as:
> > > +
> > > +struct kvm_dirty_gfn {
> > 
> > What does GFN stand for?
> 
> It's guest frame number, iiuc.  I'm not the one who named this, but
> that's what I understand..
> 
> > 
> > > +        __u32 pad;
> > > +        __u32 slot; /* as_id | slot_id */
> > > +        __u64 offset;
> > > +};
> > 
> > offset of what? a 4K page right? Seems like a waste e.g. for
> > hugetlbfs... How about replacing pad with size instead?
> 
> As Paolo explained, it's the page frame number of the guest.  IIUC
> even for hugetlbfs we track dirty bits in 4k size.
> 
> > 
> > > +
> > > +The fields in kvm_dirty_ring will be only internal to KVM itself,
> > > +while the fields in kvm_dirty_ring_indexes will be exposed to
> > > +userspace to be either read or written.
> > 
> > I'm not sure what you are trying to say here. kvm_dirty_gfn
> > seems to be part of UAPI.
> 
> It was talking about kvm_dirty_ring, which is kvm internal and not
> exposed to uapi.  While kvm_dirty_gfn is exposed to the users.
> 
> > 
> > > +
> > > +The two indices in the ring buffer are free running counters.
> > > +
> > > +In pseudocode, processing the ring buffer looks like this:
> > > +
> > > +	idx = load-acquire(&ring->fetch_index);
> > > +	while (idx != ring->avail_index) {
> > > +		struct kvm_dirty_gfn *entry;
> > > +		entry = &ring->dirty_gfns[idx & (size - 1)];
> > > +		...
> > > +
> > > +		idx++;
> > > +	}
> > > +	ring->fetch_index = idx;
> > > +
> > > +Userspace calls KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM ioctl
> > > +to enable this capability for the new guest and set the size of the
> > > +rings.  It is only allowed before creating any vCPU, and the size of
> > > +the ring must be a power of two.
> > 
> > All these seem like arbitrary limitations to me.
> 
> The dependency of vcpu is partly because we need to create per-vcpu
> ring, so it's easier that we don't allow it to change after that.
> 
> > 
> > Sizing the ring correctly might prove to be a challenge.
> > 
> > Thus I think there's value in resizing the rings
> > without destroying VCPU.
> 
> Do you have an example on when we could use this feature?

So e.g. start with a small ring, and if you see stalls too often
increase it? Otherwise I don't see how does one decide
on ring size.

>  My wild
> guess is that even if we try hard to allow resizing (assuming that
> won't bring more bugs, but I hightly doubt...), people may not use it
> at all.
> 
> The major scenario here is that kvm userspace will be collecting the
> dirty bits quickly, so the ring should not really get full easily.
> Then the ring size does not really matter much either, as long as it
> is bigger than some specific value to avoid vmexits due to full.

Exactly but I don't see how you are going to find that value
unless it's auto-tuning dynamically.

> How about we start with the simple that we don't allow it to change?
> We can do that when the requirement comes.
> 
> > 
> > Also, power of two just saves a branch here and there,
> > but wastes lots of memory. Just wrap the index around to
> > 0 and then users can select any size?
> 
> Same as above to postpone until we need it?

It's to save memory, don't we always need to do that?

> > 
> > 
> > 
> > >  The larger the ring buffer, the less
> > > +likely the ring is full and the VM is forced to exit to userspace. The
> > > +optimal size depends on the workload, but it is recommended that it be
> > > +at least 64 KiB (4096 entries).
> > 
> > OTOH larger buffers put lots of pressure on the system cache.
> > 
> > > +
> > > +After the capability is enabled, userspace can mmap the global ring
> > > +buffer (kvm_dirty_gfn[], offset KVM_DIRTY_LOG_PAGE_OFFSET) and the
> > > +indexes (kvm_dirty_ring_indexes, offset 0) from the VM file
> > > +descriptor.  The per-vcpu dirty ring instead is mmapped when the vcpu
> > > +is created, similar to the kvm_run struct (kvm_dirty_ring_indexes
> > > +locates inside kvm_run, while kvm_dirty_gfn[] at offset
> > > +KVM_DIRTY_LOG_PAGE_OFFSET).
> > > +
> > > +Just like for dirty page bitmaps, the buffer tracks writes to
> > > +all user memory regions for which the KVM_MEM_LOG_DIRTY_PAGES flag was
> > > +set in KVM_SET_USER_MEMORY_REGION.  Once a memory region is registered
> > > +with the flag set, userspace can start harvesting dirty pages from the
> > > +ring buffer.
> > > +
> > > +To harvest the dirty pages, userspace accesses the mmaped ring buffer
> > > +to read the dirty GFNs up to avail_index, and sets the fetch_index
> > > +accordingly.  This can be done when the guest is running or paused,
> > > +and dirty pages need not be collected all at once.  After processing
> > > +one or more entries in the ring buffer, userspace calls the VM ioctl
> > > +KVM_RESET_DIRTY_RINGS to notify the kernel that it has updated
> > > +fetch_index and to mark those pages clean.  Therefore, the ioctl
> > > +must be called *before* reading the content of the dirty pages.
> > > +
> > > +However, there is a major difference comparing to the
> > > +KVM_GET_DIRTY_LOG interface in that when reading the dirty ring from
> > > +userspace it's still possible that the kernel has not yet flushed the
> > > +hardware dirty buffers into the kernel buffer.  To achieve that, one
> > > +needs to kick the vcpu out for a hardware buffer flush (vmexit).
> > > +
> > > +If one of the ring buffers is full, the guest will exit to userspace
> > > +with the exit reason set to KVM_EXIT_DIRTY_LOG_FULL, and the
> > > +KVM_RUN ioctl will return -EINTR. Once that happens, userspace
> > > +should pause all the vcpus, then harvest all the dirty pages and
> > > +rearm the dirty traps. It can unpause the guest after that.
> > 
> > This last item means that the performance impact of the feature is
> > really hard to predict. Can improve some workloads drastically. Or can
> > slow some down.
> > 
> > 
> > One solution could be to actually allow using this together with the
> > existing bitmap. Userspace can then decide whether it wants to block
> > VCPU on ring full, or just record ring full condition and recover by
> > bitmap scanning.
> 
> That's true, but again allowing mixture use of the two might bring
> extra complexity as well (especially when after adding
> KVM_CLEAR_DIRTY_LOG).
> 
> My understanding of this is that normally we do only want either one
> of them depending on the major workload and the configuration of the
> guest.

And again how does one know which to enable? No one has the
time to fine-tune gazillion parameters.

>  It's not trivial to try to provide a one-for-all solution.  So
> again I would hope we can start from easy, then we extend when we have
> better ideas on how to leverage the two interfaces when the ideas
> really come, and then we can justify whether it's worth it to work on
> that complexity.

It's less *coding* work to build a simple thing but it need much more *testing*.

IMHO a huge amount of benchmarking has to happen if you just want to
set this loose on users as default with these kind of
limitations. We need to be sure that even though in theory
it can be very bad, in practice it's actually good.
If it's auto-tuning then it's a much easier sell to upstream
even if there's a chance of some regressions.

> > 
> > 
> > > diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> > > index b19ef421084d..0acee817adfb 100644
> > > --- a/arch/x86/kvm/Makefile
> > > +++ b/arch/x86/kvm/Makefile
> > > @@ -5,7 +5,8 @@ ccflags-y += -Iarch/x86/kvm
> > >  KVM := ../../../virt/kvm
> > >  
> > >  kvm-y			+= $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> > > -				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
> > > +				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o \
> > > +				$(KVM)/dirty_ring.o
> > >  kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(KVM)/async_pf.o
> > >  
> > >  kvm-y			+= x86.o emulate.o i8259.o irq.o lapic.o \
> > > diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
> > > new file mode 100644
> > > index 000000000000..8335635b7ff7
> > > --- /dev/null
> > > +++ b/include/linux/kvm_dirty_ring.h
> > > @@ -0,0 +1,67 @@
> > > +#ifndef KVM_DIRTY_RING_H
> > > +#define KVM_DIRTY_RING_H
> > > +
> > > +/*
> > > + * struct kvm_dirty_ring is defined in include/uapi/linux/kvm.h.
> > > + *
> > > + * dirty_ring:  shared with userspace via mmap. It is the compact list
> > > + *              that holds the dirty pages.
> > > + * dirty_index: free running counter that points to the next slot in
> > > + *              dirty_ring->dirty_gfns  where a new dirty page should go.
> > > + * reset_index: free running counter that points to the next dirty page
> > > + *              in dirty_ring->dirty_gfns for which dirty trap needs to
> > > + *              be reenabled
> > > + * size:        size of the compact list, dirty_ring->dirty_gfns
> > > + * soft_limit:  when the number of dirty pages in the list reaches this
> > > + *              limit, vcpu that owns this ring should exit to userspace
> > > + *              to allow userspace to harvest all the dirty pages
> > > + * lock:        protects dirty_ring, only in use if this is the global
> > > + *              ring
> > > + *
> > > + * The number of dirty pages in the ring is calculated by,
> > > + * dirty_index - reset_index
> > > + *
> > > + * kernel increments dirty_ring->indices.avail_index after dirty index
> > > + * is incremented. When userspace harvests the dirty pages, it increments
> > > + * dirty_ring->indices.fetch_index up to dirty_ring->indices.avail_index.
> > > + * When kernel reenables dirty traps for the dirty pages, it increments
> > > + * reset_index up to dirty_ring->indices.fetch_index.
> > > + *
> > > + */
> > > +struct kvm_dirty_ring {
> > > +	u32 dirty_index;
> > > +	u32 reset_index;
> > > +	u32 size;
> > > +	u32 soft_limit;
> > > +	spinlock_t lock;
> > > +	struct kvm_dirty_gfn *dirty_gfns;
> > > +};
> > > +
> > > +u32 kvm_dirty_ring_get_rsvd_entries(void);
> > > +int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring);
> > > +
> > > +/*
> > > + * called with kvm->slots_lock held, returns the number of
> > > + * processed pages.
> > > + */
> > > +int kvm_dirty_ring_reset(struct kvm *kvm,
> > > +			 struct kvm_dirty_ring *ring,
> > > +			 struct kvm_dirty_ring_indexes *indexes);
> > > +
> > > +/*
> > > + * returns 0: successfully pushed
> > > + *         1: successfully pushed, soft limit reached,
> > > + *            vcpu should exit to userspace
> > > + *         -EBUSY: unable to push, dirty ring full.
> > > + */
> > > +int kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
> > > +			struct kvm_dirty_ring_indexes *indexes,
> > > +			u32 slot, u64 offset, bool lock);
> > > +
> > > +/* for use in vm_operations_struct */
> > > +struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 i);
> > > +
> > > +void kvm_dirty_ring_free(struct kvm_dirty_ring *ring);
> > > +bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring);
> > > +
> > > +#endif
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index 498a39462ac1..7b747bc9ff3e 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -34,6 +34,7 @@
> > >  #include <linux/kvm_types.h>
> > >  
> > >  #include <asm/kvm_host.h>
> > > +#include <linux/kvm_dirty_ring.h>
> > >  
> > >  #ifndef KVM_MAX_VCPU_ID
> > >  #define KVM_MAX_VCPU_ID KVM_MAX_VCPUS
> > > @@ -146,6 +147,7 @@ static inline bool is_error_page(struct page *page)
> > >  #define KVM_REQ_MMU_RELOAD        (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> > >  #define KVM_REQ_PENDING_TIMER     2
> > >  #define KVM_REQ_UNHALT            3
> > > +#define KVM_REQ_DIRTY_RING_FULL   4
> > >  #define KVM_REQUEST_ARCH_BASE     8
> > >  
> > >  #define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
> > > @@ -321,6 +323,7 @@ struct kvm_vcpu {
> > >  	bool ready;
> > >  	struct kvm_vcpu_arch arch;
> > >  	struct dentry *debugfs_dentry;
> > > +	struct kvm_dirty_ring dirty_ring;
> > >  };
> > >  
> > >  static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
> > > @@ -501,6 +504,10 @@ struct kvm {
> > >  	struct srcu_struct srcu;
> > >  	struct srcu_struct irq_srcu;
> > >  	pid_t userspace_pid;
> > > +	/* Data structure to be exported by mmap(kvm->fd, 0) */
> > > +	struct kvm_vm_run *vm_run;
> > > +	u32 dirty_ring_size;
> > > +	struct kvm_dirty_ring vm_dirty_ring;
> > >  };
> > >  
> > >  #define kvm_err(fmt, ...) \
> > > @@ -832,6 +839,8 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
> > >  					gfn_t gfn_offset,
> > >  					unsigned long mask);
> > >  
> > > +void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask);
> > > +
> > >  int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
> > >  				struct kvm_dirty_log *log);
> > >  int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
> > > @@ -1411,4 +1420,28 @@ int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
> > >  				uintptr_t data, const char *name,
> > >  				struct task_struct **thread_ptr);
> > >  
> > > +/*
> > > + * This defines how many reserved entries we want to keep before we
> > > + * kick the vcpu to the userspace to avoid dirty ring full.  This
> > > + * value can be tuned to higher if e.g. PML is enabled on the host.
> > > + */
> > > +#define  KVM_DIRTY_RING_RSVD_ENTRIES  64
> > > +
> > > +/* Max number of entries allowed for each kvm dirty ring */
> > > +#define  KVM_DIRTY_RING_MAX_ENTRIES  65536
> > > +
> > > +/*
> > > + * Arch needs to define these macro after implementing the dirty ring
> > > + * feature.  KVM_DIRTY_LOG_PAGE_OFFSET should be defined as the
> > > + * starting page offset of the dirty ring structures,
> > 
> > Confused. Offset where? You set a default for everyone - where does arch
> > want to override it?
> 
> If arch defines KVM_DIRTY_LOG_PAGE_OFFSET then below will be a no-op,
> please see [1] on #ifndef.

So which arches need to override it? Why do you say they should?

> > 
> > > while
> > > + * KVM_DIRTY_RING_VERSION should be defined as >=1.  By default, this
> > > + * feature is off on all archs.
> > > + */
> > > +#ifndef KVM_DIRTY_LOG_PAGE_OFFSET
> 
> [1]
> 
> > > +#define KVM_DIRTY_LOG_PAGE_OFFSET 0
> > > +#endif
> > > +#ifndef KVM_DIRTY_RING_VERSION
> > > +#define KVM_DIRTY_RING_VERSION 0
> > > +#endif
> > 
> > One way versioning, with no bits and negotiation
> > will make it hard to change down the road.
> > what's wrong with existing KVM capabilities that
> > you feel there's a need for dedicated versioning for this?
> 
> Frankly speaking I don't even think it'll change in the near
> future.. :)
> 
> Yeh kvm versioning could work too.  Here we can also return a zero
> just like the most of the caps (or KVM_DIRTY_LOG_PAGE_OFFSET as in the
> original patchset, but it's really helpless either because it's
> defined in uapi), but I just don't see how it helps...  So I returned
> a version number just in case we'd like to change the layout some day
> and when we don't want to bother introducing another cap bit for the
> same feature (like KVM_CAP_MANUAL_DIRTY_LOG_PROTECT and
> KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2).

I guess it's up to Paolo but really I don't see the point.
You can add a version later when it means something ...

> > 
> > > +
> > >  #endif
> > > diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> > > index 1c88e69db3d9..d9d03eea145a 100644
> > > --- a/include/linux/kvm_types.h
> > > +++ b/include/linux/kvm_types.h
> > > @@ -11,6 +11,7 @@ struct kvm_irq_routing_table;
> > >  struct kvm_memory_slot;
> > >  struct kvm_one_reg;
> > >  struct kvm_run;
> > > +struct kvm_vm_run;
> > >  struct kvm_userspace_memory_region;
> > >  struct kvm_vcpu;
> > >  struct kvm_vcpu_init;
> > > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > > index e6f17c8e2dba..0b88d76d6215 100644
> > > --- a/include/uapi/linux/kvm.h
> > > +++ b/include/uapi/linux/kvm.h
> > > @@ -236,6 +236,7 @@ struct kvm_hyperv_exit {
> > >  #define KVM_EXIT_IOAPIC_EOI       26
> > >  #define KVM_EXIT_HYPERV           27
> > >  #define KVM_EXIT_ARM_NISV         28
> > > +#define KVM_EXIT_DIRTY_RING_FULL  29
> > >  
> > >  /* For KVM_EXIT_INTERNAL_ERROR */
> > >  /* Emulate instruction failed. */
> > > @@ -247,6 +248,11 @@ struct kvm_hyperv_exit {
> > >  /* Encounter unexpected vm-exit reason */
> > >  #define KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON	4
> > >  
> > > +struct kvm_dirty_ring_indexes {
> > > +	__u32 avail_index; /* set by kernel */
> > > +	__u32 fetch_index; /* set by userspace */
> > > +};
> > > +
> > >  /* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */
> > >  struct kvm_run {
> > >  	/* in */
> > > @@ -421,6 +427,13 @@ struct kvm_run {
> > >  		struct kvm_sync_regs regs;
> > >  		char padding[SYNC_REGS_SIZE_BYTES];
> > >  	} s;
> > > +
> > > +	struct kvm_dirty_ring_indexes vcpu_ring_indexes;
> > > +};
> > > +
> > > +/* Returned by mmap(kvm->fd, offset=0) */
> > > +struct kvm_vm_run {
> > > +	struct kvm_dirty_ring_indexes vm_ring_indexes;
> > >  };
> > >  
> > >  /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
> > > @@ -1009,6 +1022,7 @@ struct kvm_ppc_resize_hpt {
> > >  #define KVM_CAP_PPC_GUEST_DEBUG_SSTEP 176
> > >  #define KVM_CAP_ARM_NISV_TO_USER 177
> > >  #define KVM_CAP_ARM_INJECT_EXT_DABT 178
> > > +#define KVM_CAP_DIRTY_LOG_RING 179
> > >  
> > >  #ifdef KVM_CAP_IRQ_ROUTING
> > >  
> > > @@ -1472,6 +1486,9 @@ struct kvm_enc_region {
> > >  /* Available with KVM_CAP_ARM_SVE */
> > >  #define KVM_ARM_VCPU_FINALIZE	  _IOW(KVMIO,  0xc2, int)
> > >  
> > > +/* Available with KVM_CAP_DIRTY_LOG_RING */
> > > +#define KVM_RESET_DIRTY_RINGS     _IO(KVMIO, 0xc3)
> > > +
> > >  /* Secure Encrypted Virtualization command */
> > >  enum sev_cmd_id {
> > >  	/* Guest initialization commands */
> > > @@ -1622,4 +1639,23 @@ struct kvm_hyperv_eventfd {
> > >  #define KVM_HYPERV_CONN_ID_MASK		0x00ffffff
> > >  #define KVM_HYPERV_EVENTFD_DEASSIGN	(1 << 0)
> > >  
> > > +/*
> > > + * The following are the requirements for supporting dirty log ring
> > > + * (by enabling KVM_DIRTY_LOG_PAGE_OFFSET).
> > > + *
> > > + * 1. Memory accesses by KVM should call kvm_vcpu_write_* instead
> > > + *    of kvm_write_* so that the global dirty ring is not filled up
> > > + *    too quickly.
> > > + * 2. kvm_arch_mmu_enable_log_dirty_pt_masked should be defined for
> > > + *    enabling dirty logging.
> > > + * 3. There should not be a separate step to synchronize hardware
> > > + *    dirty bitmap with KVM's.
> > > + */
> > > +
> > > +struct kvm_dirty_gfn {
> > > +	__u32 pad;
> > > +	__u32 slot;
> > > +	__u64 offset;
> > > +};
> > > +
> > >  #endif /* __LINUX_KVM_H */
> > > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> > > new file mode 100644
> > > index 000000000000..9264891f3c32
> > > --- /dev/null
> > > +++ b/virt/kvm/dirty_ring.c
> > > @@ -0,0 +1,156 @@
> > > +#include <linux/kvm_host.h>
> > > +#include <linux/kvm.h>
> > > +#include <linux/vmalloc.h>
> > > +#include <linux/kvm_dirty_ring.h>
> > > +
> > > +u32 kvm_dirty_ring_get_rsvd_entries(void)
> > > +{
> > > +	return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size();
> > > +}
> > > +
> > > +int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring)
> > > +{
> > > +	u32 size = kvm->dirty_ring_size;
> > > +
> > > +	ring->dirty_gfns = vmalloc(size);
> > 
> > So 1/2 a megabyte of kernel memory per VM that userspace locks up.
> > Do we really have to though? Why not get a userspace pointer,
> > write it with copy to user, and sidestep all this?
> 
> I'd say it won't be a big issue on locking 1/2M of host mem for a
> vm...
> Also note that if dirty ring is enabled, I plan to evaporate the
> dirty_bitmap in the next post. The old kvm->dirty_bitmap takes
> $GUEST_MEM/32K*2 mem.  E.g., for 64G guest it's 64G/32K*2=4M.  If with
> dirty ring of 8 vcpus, that could be 64K*8=0.5M, which could be even
> less memory used.

Right - I think Avi described the bitmap in kernel memory as one of
design mistakes. Why repeat that with the new design?

> > 
> > > +	if (!ring->dirty_gfns)
> > > +		return -ENOMEM;
> > > +	memset(ring->dirty_gfns, 0, size);
> > > +
> > > +	ring->size = size / sizeof(struct kvm_dirty_gfn);
> > > +	ring->soft_limit =
> > > +	    (kvm->dirty_ring_size / sizeof(struct kvm_dirty_gfn)) -
> > > +	    kvm_dirty_ring_get_rsvd_entries();
> > > +	ring->dirty_index = 0;
> > > +	ring->reset_index = 0;
> > > +	spin_lock_init(&ring->lock);
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +int kvm_dirty_ring_reset(struct kvm *kvm,
> > > +			 struct kvm_dirty_ring *ring,
> > > +			 struct kvm_dirty_ring_indexes *indexes)
> > > +{
> > > +	u32 cur_slot, next_slot;
> > > +	u64 cur_offset, next_offset;
> > > +	unsigned long mask;
> > > +	u32 fetch;
> > > +	int count = 0;
> > > +	struct kvm_dirty_gfn *entry;
> > > +
> > > +	fetch = READ_ONCE(indexes->fetch_index);
> > > +	if (fetch == ring->reset_index)
> > > +		return 0;
> > > +
> > > +	entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> > > +	/*
> > > +	 * The ring buffer is shared with userspace, which might mmap
> > > +	 * it and concurrently modify slot and offset.  Userspace must
> > > +	 * not be trusted!  READ_ONCE prevents the compiler from changing
> > > +	 * the values after they've been range-checked (the checks are
> > > +	 * in kvm_reset_dirty_gfn).
> > 
> > What it doesn't is prevent speculative attacks.  That's why things like
> > copy from user have a speculation barrier.  Instead of worrying about
> > that, unless it's really critical, I think you'd do well do just use
> > copy to/from user.
> 
> IMHO I would really hope these data be there without swapped out of
> memory, just like what we did with kvm->dirty_bitmap... it's on the
> hot path of mmu page fault, even we could be with mmu lock held if
> copy_to_user() page faulted.  But indeed I've no experience on
> avoiding speculative attacks, suggestions would be greatly welcomed on
> that.  In our case we do (index & (size - 1)), so is it still
> suffering from speculative attacks?

I don't say I understand everything in depth.
Just reacting to this:
	READ_ONCE prevents the compiler from changing
	the values after they've been range-checked (the checks are
	in kvm_reset_dirty_gfn)

so any range checks you do can be attacked.

And the safest way to avoid the attacks is to do what most
kernel does and use copy from/to user when you talk to
userspace. Avoid annoying things like bypassing SMAP too.


> > 
> > > +	 */
> > > +	smp_read_barrier_depends();
> > 
> > What depends on what here? Looks suspicious ...
> 
> Hmm, I think maybe it can be removed because the entry pointer
> reference below should be an ordering constraint already?
> 
> > 
> > > +	cur_slot = READ_ONCE(entry->slot);
> > > +	cur_offset = READ_ONCE(entry->offset);
> > > +	mask = 1;
> > > +	count++;
> > > +	ring->reset_index++;
> > > +	while (ring->reset_index != fetch) {
> > > +		entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> > > +		smp_read_barrier_depends();
> > 
> > same concerns here
> > 
> > > +		next_slot = READ_ONCE(entry->slot);
> > > +		next_offset = READ_ONCE(entry->offset);
> > > +		ring->reset_index++;
> > > +		count++;
> > > +		/*
> > > +		 * Try to coalesce the reset operations when the guest is
> > > +		 * scanning pages in the same slot.
> > 
> > what does guest scanning mean?
> 
> My wild guess is that it means when the guest is accessing the pages
> continuously so the dirty gfns are continuous too.  Anyway I agree
> it's not clear, where I can try to rephrase.
> 
> > 
> > > +		 */
> > > +		if (next_slot == cur_slot) {
> > > +			int delta = next_offset - cur_offset;
> > > +
> > > +			if (delta >= 0 && delta < BITS_PER_LONG) {
> > > +				mask |= 1ull << delta;
> > > +				continue;
> > > +			}
> > > +
> > > +			/* Backwards visit, careful about overflows!  */
> > > +			if (delta > -BITS_PER_LONG && delta < 0 &&
> > > +			    (mask << -delta >> -delta) == mask) {
> > > +				cur_offset = next_offset;
> > > +				mask = (mask << -delta) | 1;
> > > +				continue;
> > > +			}
> > > +		}
> > > +		kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> > > +		cur_slot = next_slot;
> > > +		cur_offset = next_offset;
> > > +		mask = 1;
> > > +	}
> > > +	kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> > > +
> > > +	return count;
> > > +}
> > > +
> > > +static inline u32 kvm_dirty_ring_used(struct kvm_dirty_ring *ring)
> > > +{
> > > +	return ring->dirty_index - ring->reset_index;
> > > +}
> > > +
> > > +bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring)
> > > +{
> > > +	return kvm_dirty_ring_used(ring) >= ring->size;
> > > +}
> > > +
> > > +/*
> > > + * Returns:
> > > + *   >0 if we should kick the vcpu out,
> > > + *   =0 if the gfn pushed successfully, or,
> > > + *   <0 if error (e.g. ring full)
> > > + */
> > > +int kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
> > > +			struct kvm_dirty_ring_indexes *indexes,
> > > +			u32 slot, u64 offset, bool lock)
> > > +{
> > > +	int ret;
> > > +	struct kvm_dirty_gfn *entry;
> > > +
> > > +	if (lock)
> > > +		spin_lock(&ring->lock);
> > 
> > what's the story around locking here? Why is it safe
> > not to take the lock sometimes?
> 
> kvm_dirty_ring_push() will be with lock==true only when the per-vm
> ring is used.  For per-vcpu ring, because that will only happen with
> the vcpu context, then we don't need locks (so kvm_dirty_ring_push()
> is called with lock==false).
> 
> > 
> > > +
> > > +	if (kvm_dirty_ring_full(ring)) {
> > > +		ret = -EBUSY;
> > > +		goto out;
> > > +	}
> > > +
> > > +	entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
> > > +	entry->slot = slot;
> > > +	entry->offset = offset;
> > > +	smp_wmb();
> > > +	ring->dirty_index++;
> > > +	WRITE_ONCE(indexes->avail_index, ring->dirty_index);
> > > +	ret = kvm_dirty_ring_used(ring) >= ring->soft_limit;
> > > +	pr_info("%s: slot %u offset %llu used %u\n",
> > > +		__func__, slot, offset, kvm_dirty_ring_used(ring));
> > > +
> > > +out:
> > > +	if (lock)
> > > +		spin_unlock(&ring->lock);
> > > +
> > > +	return ret;
> > > +}
> > > +
> > > +struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 i)
> > > +{
> > > +	return vmalloc_to_page((void *)ring->dirty_gfns + i * PAGE_SIZE);
> > > +}
> > > +
> > > +void kvm_dirty_ring_free(struct kvm_dirty_ring *ring)
> > > +{
> > > +	if (ring->dirty_gfns) {
> > > +		vfree(ring->dirty_gfns);
> > > +		ring->dirty_gfns = NULL;
> > > +	}
> > > +}
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index 681452d288cd..8642c977629b 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -64,6 +64,8 @@
> > >  #define CREATE_TRACE_POINTS
> > >  #include <trace/events/kvm.h>
> > >  
> > > +#include <linux/kvm_dirty_ring.h>
> > > +
> > >  /* Worst case buffer size needed for holding an integer. */
> > >  #define ITOA_MAX_LEN 12
> > >  
> > > @@ -149,6 +151,10 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
> > >  				    struct kvm_vcpu *vcpu,
> > >  				    struct kvm_memory_slot *memslot,
> > >  				    gfn_t gfn);
> > > +static void mark_page_dirty_in_ring(struct kvm *kvm,
> > > +				    struct kvm_vcpu *vcpu,
> > > +				    struct kvm_memory_slot *slot,
> > > +				    gfn_t gfn);
> > >  
> > >  __visible bool kvm_rebooting;
> > >  EXPORT_SYMBOL_GPL(kvm_rebooting);
> > > @@ -359,11 +365,22 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
> > >  	vcpu->preempted = false;
> > >  	vcpu->ready = false;
> > >  
> > > +	if (kvm->dirty_ring_size) {
> > > +		r = kvm_dirty_ring_alloc(vcpu->kvm, &vcpu->dirty_ring);
> > > +		if (r) {
> > > +			kvm->dirty_ring_size = 0;
> > > +			goto fail_free_run;
> > > +		}
> > > +	}
> > > +
> > >  	r = kvm_arch_vcpu_init(vcpu);
> > >  	if (r < 0)
> > > -		goto fail_free_run;
> > > +		goto fail_free_ring;
> > >  	return 0;
> > >  
> > > +fail_free_ring:
> > > +	if (kvm->dirty_ring_size)
> > > +		kvm_dirty_ring_free(&vcpu->dirty_ring);
> > >  fail_free_run:
> > >  	free_page((unsigned long)vcpu->run);
> > >  fail:
> > > @@ -381,6 +398,8 @@ void kvm_vcpu_uninit(struct kvm_vcpu *vcpu)
> > >  	put_pid(rcu_dereference_protected(vcpu->pid, 1));
> > >  	kvm_arch_vcpu_uninit(vcpu);
> > >  	free_page((unsigned long)vcpu->run);
> > > +	if (vcpu->kvm->dirty_ring_size)
> > > +		kvm_dirty_ring_free(&vcpu->dirty_ring);
> > >  }
> > >  EXPORT_SYMBOL_GPL(kvm_vcpu_uninit);
> > >  
> > > @@ -690,6 +709,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
> > >  	struct kvm *kvm = kvm_arch_alloc_vm();
> > >  	int r = -ENOMEM;
> > >  	int i;
> > > +	struct page *page;
> > >  
> > >  	if (!kvm)
> > >  		return ERR_PTR(-ENOMEM);
> > > @@ -705,6 +725,14 @@ static struct kvm *kvm_create_vm(unsigned long type)
> > >  
> > >  	BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
> > >  
> > > +	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> > > +	if (!page) {
> > > +		r = -ENOMEM;
> > > +		goto out_err_alloc_page;
> > > +	}
> > > +	kvm->vm_run = page_address(page);
> > 
> > So 4K with just 8 bytes used. Not as bad as 1/2Mbyte for the ring but
> > still. What is wrong with just a pointer and calling put_user?
> 
> I want to make it the start point for sharing fields between
> user/kernel per-vm.  Just like kvm_run for per-vcpu.

And why is doing that without get/put user a good idea?
If nothing else this bypasses SMAP, exploits can pass
data from userspace to kernel through that.

> IMHO it'll be awkward if we always introduce a new interface just to
> take a pointer of the userspace buffer and cache it...  I'd say so far
> I like the design of kvm_run and alike because it's efficient, easy to
> use, and easy for extensions.


Well kvm run at least isn't accessed when kernel is processing it.
And the structure there is dead simple, not a tricky lockless ring
with indices and things.

Again I might be wrong, eventually it's up to kvm maintainers.  But
really there's a standard thing all drivers do to talk to userspace, and
if there's no special reason to do otherwise I would do exactly it.

> > 
> > > +	BUILD_BUG_ON(sizeof(struct kvm_vm_run) > PAGE_SIZE);
> > > +
> > >  	if (init_srcu_struct(&kvm->srcu))
> > >  		goto out_err_no_srcu;
> > >  	if (init_srcu_struct(&kvm->irq_srcu))
> > > @@ -775,6 +803,9 @@ static struct kvm *kvm_create_vm(unsigned long type)
> > >  out_err_no_irq_srcu:
> > >  	cleanup_srcu_struct(&kvm->srcu);
> > >  out_err_no_srcu:
> > > +	free_page((unsigned long)page);
> > > +	kvm->vm_run = NULL;
> > > +out_err_alloc_page:
> > >  	kvm_arch_free_vm(kvm);
> > >  	mmdrop(current->mm);
> > >  	return ERR_PTR(r);
> > > @@ -800,6 +831,15 @@ static void kvm_destroy_vm(struct kvm *kvm)
> > >  	int i;
> > >  	struct mm_struct *mm = kvm->mm;
> > >  
> > > +	if (kvm->dirty_ring_size) {
> > > +		kvm_dirty_ring_free(&kvm->vm_dirty_ring);
> > > +	}
> > > +
> > > +	if (kvm->vm_run) {
> > > +		free_page((unsigned long)kvm->vm_run);
> > > +		kvm->vm_run = NULL;
> > > +	}
> > > +
> > >  	kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm);
> > >  	kvm_destroy_vm_debugfs(kvm);
> > >  	kvm_arch_sync_events(kvm);
> > > @@ -2301,7 +2341,7 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
> > >  {
> > >  	if (memslot && memslot->dirty_bitmap) {
> > >  		unsigned long rel_gfn = gfn - memslot->base_gfn;
> > > -
> > > +		mark_page_dirty_in_ring(kvm, vcpu, memslot, gfn);
> > >  		set_bit_le(rel_gfn, memslot->dirty_bitmap);
> > >  	}
> > >  }
> > > @@ -2649,6 +2689,13 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
> > >  }
> > >  EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
> > >  
> > > +static bool kvm_fault_in_dirty_ring(struct kvm *kvm, struct vm_fault *vmf)
> > > +{
> > > +	return (vmf->pgoff >= KVM_DIRTY_LOG_PAGE_OFFSET) &&
> > > +	    (vmf->pgoff < KVM_DIRTY_LOG_PAGE_OFFSET +
> > > +	     kvm->dirty_ring_size / PAGE_SIZE);
> > > +}
> > > +
> > >  static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
> > >  {
> > >  	struct kvm_vcpu *vcpu = vmf->vma->vm_file->private_data;
> > > @@ -2664,6 +2711,10 @@ static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
> > >  	else if (vmf->pgoff == KVM_COALESCED_MMIO_PAGE_OFFSET)
> > >  		page = virt_to_page(vcpu->kvm->coalesced_mmio_ring);
> > >  #endif
> > > +	else if (kvm_fault_in_dirty_ring(vcpu->kvm, vmf))
> > > +		page = kvm_dirty_ring_get_page(
> > > +		    &vcpu->dirty_ring,
> > > +		    vmf->pgoff - KVM_DIRTY_LOG_PAGE_OFFSET);
> > >  	else
> > >  		return kvm_arch_vcpu_fault(vcpu, vmf);
> > >  	get_page(page);
> > > @@ -3259,12 +3310,162 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> > >  #endif
> > >  	case KVM_CAP_NR_MEMSLOTS:
> > >  		return KVM_USER_MEM_SLOTS;
> > > +	case KVM_CAP_DIRTY_LOG_RING:
> > > +		/* Version will be zero if arch didn't implement it */
> > > +		return KVM_DIRTY_RING_VERSION;
> > >  	default:
> > >  		break;
> > >  	}
> > >  	return kvm_vm_ioctl_check_extension(kvm, arg);
> > >  }
> > >  
> > > +static void mark_page_dirty_in_ring(struct kvm *kvm,
> > > +				    struct kvm_vcpu *vcpu,
> > > +				    struct kvm_memory_slot *slot,
> > > +				    gfn_t gfn)
> > > +{
> > > +	u32 as_id = 0;
> > > +	u64 offset;
> > > +	int ret;
> > > +	struct kvm_dirty_ring *ring;
> > > +	struct kvm_dirty_ring_indexes *indexes;
> > > +	bool is_vm_ring;
> > > +
> > > +	if (!kvm->dirty_ring_size)
> > > +		return;
> > > +
> > > +	offset = gfn - slot->base_gfn;
> > > +
> > > +	if (vcpu) {
> > > +		as_id = kvm_arch_vcpu_memslots_id(vcpu);
> > > +	} else {
> > > +		as_id = 0;
> > > +		vcpu = kvm_get_running_vcpu();
> > > +	}
> > > +
> > > +	if (vcpu) {
> > > +		ring = &vcpu->dirty_ring;
> > > +		indexes = &vcpu->run->vcpu_ring_indexes;
> > > +		is_vm_ring = false;
> > > +	} else {
> > > +		/*
> > > +		 * Put onto per vm ring because no vcpu context.  Kick
> > > +		 * vcpu0 if ring is full.
> > 
> > What about tasks on vcpu 0? Do guests realize it's a bad idea to put
> > critical tasks there, they will be penalized disproportionally?
> 
> Reasonable question.  So far we can't avoid it because vcpu exit is
> the event mechanism to say "hey please collect dirty bits".  Maybe
> someway is better than this, but I'll need to rethink all these
> over...

Maybe signal an eventfd, and let userspace worry about deciding what to
do.

> > 
> > > +		 */
> > > +		vcpu = kvm->vcpus[0];
> > > +		ring = &kvm->vm_dirty_ring;
> > > +		indexes = &kvm->vm_run->vm_ring_indexes;
> > > +		is_vm_ring = true;
> > > +	}
> > > +
> > > +	ret = kvm_dirty_ring_push(ring, indexes,
> > > +				  (as_id << 16)|slot->id, offset,
> > > +				  is_vm_ring);
> > > +	if (ret < 0) {
> > > +		if (is_vm_ring)
> > > +			pr_warn_once("vcpu %d dirty log overflow\n",
> > > +				     vcpu->vcpu_id);
> > > +		else
> > > +			pr_warn_once("per-vm dirty log overflow\n");
> > > +		return;
> > > +	}
> > > +
> > > +	if (ret)
> > > +		kvm_make_request(KVM_REQ_DIRTY_RING_FULL, vcpu);
> > > +}
> > > +
> > > +void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
> > > +{
> > > +	struct kvm_memory_slot *memslot;
> > > +	int as_id, id;
> > > +
> > > +	as_id = slot >> 16;
> > > +	id = (u16)slot;
> > > +	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
> > > +		return;
> > > +
> > > +	memslot = id_to_memslot(__kvm_memslots(kvm, as_id), id);
> > > +	if (offset >= memslot->npages)
> > > +		return;
> > > +
> > > +	spin_lock(&kvm->mmu_lock);
> > > +	/* FIXME: we should use a single AND operation, but there is no
> > > +	 * applicable atomic API.
> > > +	 */
> > > +	while (mask) {
> > > +		clear_bit_le(offset + __ffs(mask), memslot->dirty_bitmap);
> > > +		mask &= mask - 1;
> > > +	}
> > > +
> > > +	kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask);
> > > +	spin_unlock(&kvm->mmu_lock);
> > > +}
> > > +
> > > +static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size)
> > > +{
> > > +	int r;
> > > +
> > > +	/* the size should be power of 2 */
> > > +	if (!size || (size & (size - 1)))
> > > +		return -EINVAL;
> > > +
> > > +	/* Should be bigger to keep the reserved entries, or a page */
> > > +	if (size < kvm_dirty_ring_get_rsvd_entries() *
> > > +	    sizeof(struct kvm_dirty_gfn) || size < PAGE_SIZE)
> > > +		return -EINVAL;
> > > +
> > > +	if (size > KVM_DIRTY_RING_MAX_ENTRIES *
> > > +	    sizeof(struct kvm_dirty_gfn))
> > > +		return -E2BIG;
> > 
> > KVM_DIRTY_RING_MAX_ENTRIES is not part of UAPI.
> > So how does userspace know what's legal?
> > Do you expect it to just try?
> 
> Yep that's what I thought. :)
> 
> Please grep E2BIG in QEMU repo target/i386/kvm.c...  won't be hard to
> do imho..

I don't see anything except just failing. Do we really have something
trying to find a working value? What would even be a reasonable range?
Start from UINT_MAX and work down? In which increments?
This is just a ton of overhead for what could have been a
simple query.

> > More likely it will just copy the number from kernel and can
> > never ever make it smaller.
> 
> Not sure, but for sure I can probably move KVM_DIRTY_RING_MAX_ENTRIES
> to uapi too.
> 
> Thanks,

Won't help as you can't change it ever then.
You need it runtime discoverable.
Or again, keep it in userspace memory and then you don't
really care what size it is.


> -- 
> Peter Xu

Paolo Bonzini Dec. 12, 2019, 12:08 a.m. UTC | #36

On 11/12/19 23:57, Michael S. Tsirkin wrote:
>>> All these seem like arbitrary limitations to me.
>>>
>>> Sizing the ring correctly might prove to be a challenge.
>>>
>>> Thus I think there's value in resizing the rings
>>> without destroying VCPU.
>>
>> Do you have an example on when we could use this feature?
> 
> So e.g. start with a small ring, and if you see stalls too often
> increase it? Otherwise I don't see how does one decide
> on ring size.

If you see stalls often, it means the guest is dirtying memory very
fast.  Harvesting the ring puts back pressure on the guest, you may
prefer a smaller ring size to avoid a bufferbloat-like situation.

Note that having a larger ring is better, even though it does incur a
memory cost, because it means the migration thread will be able to reap
the ring buffer asynchronously with no vmexits.

With smaller ring sizes the cost of flushing the TLB when resetting the
rings goes up, but the initial bulk copy phase _will_ have vmexits and
then having to reap more dirty rings becomes more expensive and
introduces some jitter.  So it will require some experimentation to find
an optimal value.

Anyway if in the future we go for resizable rings, KVM_ENABLE_CAP can be
passed the largest desired size and then another ioctl can be introduced
to set the mask for indices.

>>> Also, power of two just saves a branch here and there,
>>> but wastes lots of memory. Just wrap the index around to
>>> 0 and then users can select any size?
>>
>> Same as above to postpone until we need it?
> 
> It's to save memory, don't we always need to do that?

Does it really save that much memory?  Would it really be so beneficial
to choose 12K entries rather than 8K or 16K in the ring?

>> My understanding of this is that normally we do only want either one
>> of them depending on the major workload and the configuration of the
>> guest.
> 
> And again how does one know which to enable? No one has the
> time to fine-tune gazillion parameters.

Hopefully we can always use just the ring buffer.

> IMHO a huge amount of benchmarking has to happen if you just want to
> set this loose on users as default with these kind of
> limitations. We need to be sure that even though in theory
> it can be very bad, in practice it's actually good.
> If it's auto-tuning then it's a much easier sell to upstream
> even if there's a chance of some regressions.

Auto-tuning is not a silver bullet, it requires just as much
benchmarking to make sure that it doesn't oscillate crazily and that it
actually outperforms a simple fixed size.

>> Yeh kvm versioning could work too.  Here we can also return a zero
>> just like the most of the caps (or KVM_DIRTY_LOG_PAGE_OFFSET as in the
>> original patchset, but it's really helpless either because it's
>> defined in uapi), but I just don't see how it helps...  So I returned
>> a version number just in case we'd like to change the layout some day
>> and when we don't want to bother introducing another cap bit for the
>> same feature (like KVM_CAP_MANUAL_DIRTY_LOG_PROTECT and
>> KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2).
> 
> I guess it's up to Paolo but really I don't see the point.
> You can add a version later when it means something ...

Yeah, we can return the maximum size of the ring buffer, too.

>> I'd say it won't be a big issue on locking 1/2M of host mem for a
>> vm...
>> Also note that if dirty ring is enabled, I plan to evaporate the
>> dirty_bitmap in the next post. The old kvm->dirty_bitmap takes
>> $GUEST_MEM/32K*2 mem.  E.g., for 64G guest it's 64G/32K*2=4M.  If with
>> dirty ring of 8 vcpus, that could be 64K*8=0.5M, which could be even
>> less memory used.
> 
> Right - I think Avi described the bitmap in kernel memory as one of
> design mistakes. Why repeat that with the new design?

Do you have a source for that?  At least the dirty bitmap has to be
accessed from atomic context so it seems unlikely that it can be moved
to user memory.

The dirty ring could use user memory indeed, but it would be much harder
to set up (multiple ioctls for each ring?  what to do if userspace
forgets one? etc.).  The mmap API is easier to use.

>>>> +	entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
>>>> +	/*
>>>> +	 * The ring buffer is shared with userspace, which might mmap
>>>> +	 * it and concurrently modify slot and offset.  Userspace must
>>>> +	 * not be trusted!  READ_ONCE prevents the compiler from changing
>>>> +	 * the values after they've been range-checked (the checks are
>>>> +	 * in kvm_reset_dirty_gfn).
>>>
>>> What it doesn't is prevent speculative attacks.  That's why things like
>>> copy from user have a speculation barrier.  Instead of worrying about
>>> that, unless it's really critical, I think you'd do well do just use
>>> copy to/from user.

An unconditional speculation barrier (lfence) is also expensive.  We
already have macros to add speculation checks with array_index_nospec at
the right places, for example __kvm_memslots.  We should add an
array_index_nospec to id_to_memslot as well.  I'll send a patch for that.

>>> What depends on what here? Looks suspicious ...
>>
>> Hmm, I think maybe it can be removed because the entry pointer
>> reference below should be an ordering constraint already?

entry->xxx depends on ring->reset_index.

>>> what's the story around locking here? Why is it safe
>>> not to take the lock sometimes?
>>
>> kvm_dirty_ring_push() will be with lock==true only when the per-vm
>> ring is used.  For per-vcpu ring, because that will only happen with
>> the vcpu context, then we don't need locks (so kvm_dirty_ring_push()
>> is called with lock==false).

FWIW this will be done much more nicely in v2.

>>>> +	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
>>>> +	if (!page) {
>>>> +		r = -ENOMEM;
>>>> +		goto out_err_alloc_page;
>>>> +	}
>>>> +	kvm->vm_run = page_address(page);
>>>
>>> So 4K with just 8 bytes used. Not as bad as 1/2Mbyte for the ring but
>>> still. What is wrong with just a pointer and calling put_user?
>>
>> I want to make it the start point for sharing fields between
>> user/kernel per-vm.  Just like kvm_run for per-vcpu.

This page is actually not needed at all.  Userspace can just map at
KVM_DIRTY_LOG_PAGE_OFFSET, the indices reside there.  You can drop
kvm_vm_run completely.

>>>> +	} else {
>>>> +		/*
>>>> +		 * Put onto per vm ring because no vcpu context.  Kick
>>>> +		 * vcpu0 if ring is full.
>>>
>>> What about tasks on vcpu 0? Do guests realize it's a bad idea to put
>>> critical tasks there, they will be penalized disproportionally?
>>
>> Reasonable question.  So far we can't avoid it because vcpu exit is
>> the event mechanism to say "hey please collect dirty bits".  Maybe
>> someway is better than this, but I'll need to rethink all these
>> over...
> 
> Maybe signal an eventfd, and let userspace worry about deciding what to
> do.

This has to be done synchronously.  But the vm ring should be used very
rarely (it's for things like kvmclock updates that write to guest memory
outside a vCPU), possibly a handful of times in the whole run of the VM.

>>> KVM_DIRTY_RING_MAX_ENTRIES is not part of UAPI.
>>> So how does userspace know what's legal?
>>> Do you expect it to just try?
>>
>> Yep that's what I thought. :)

We should return it for KVM_CHECK_EXTENSION.

Paolo

Michael S. Tsirkin Dec. 12, 2019, 7:36 a.m. UTC | #37

On Thu, Dec 12, 2019 at 01:08:14AM +0100, Paolo Bonzini wrote:
> >> I'd say it won't be a big issue on locking 1/2M of host mem for a
> >> vm...
> >> Also note that if dirty ring is enabled, I plan to evaporate the
> >> dirty_bitmap in the next post. The old kvm->dirty_bitmap takes
> >> $GUEST_MEM/32K*2 mem.  E.g., for 64G guest it's 64G/32K*2=4M.  If with
> >> dirty ring of 8 vcpus, that could be 64K*8=0.5M, which could be even
> >> less memory used.
> > 
> > Right - I think Avi described the bitmap in kernel memory as one of
> > design mistakes. Why repeat that with the new design?
> 
> Do you have a source for that?

Nope, it was a private talk.

> At least the dirty bitmap has to be
> accessed from atomic context so it seems unlikely that it can be moved
> to user memory.

Why is that? We could surely do it from VCPU context?

> The dirty ring could use user memory indeed, but it would be much harder
> to set up (multiple ioctls for each ring?  what to do if userspace
> forgets one? etc.).

Why multiple ioctls? If you do like virtio packed ring you just need the
base and the size.

Paolo Bonzini Dec. 12, 2019, 8:12 a.m. UTC | #38

On 12/12/19 08:36, Michael S. Tsirkin wrote:
> On Thu, Dec 12, 2019 at 01:08:14AM +0100, Paolo Bonzini wrote:
>>>> I'd say it won't be a big issue on locking 1/2M of host mem for a
>>>> vm...
>>>> Also note that if dirty ring is enabled, I plan to evaporate the
>>>> dirty_bitmap in the next post. The old kvm->dirty_bitmap takes
>>>> $GUEST_MEM/32K*2 mem.  E.g., for 64G guest it's 64G/32K*2=4M.  If with
>>>> dirty ring of 8 vcpus, that could be 64K*8=0.5M, which could be even
>>>> less memory used.
>>>
>>> Right - I think Avi described the bitmap in kernel memory as one of
>>> design mistakes. Why repeat that with the new design?
>>
>> Do you have a source for that?
> 
> Nope, it was a private talk.
> 
>> At least the dirty bitmap has to be
>> accessed from atomic context so it seems unlikely that it can be moved
>> to user memory.
> 
> Why is that? We could surely do it from VCPU context?

Spinlock is taken.

>> The dirty ring could use user memory indeed, but it would be much harder
>> to set up (multiple ioctls for each ring?  what to do if userspace
>> forgets one? etc.).
> 
> Why multiple ioctls? If you do like virtio packed ring you just need the
> base and the size.

You have multiple rings, so multiple invocations of one ioctl.

Paolo

Michael S. Tsirkin Dec. 12, 2019, 10:38 a.m. UTC | #39

On Thu, Dec 12, 2019 at 09:12:04AM +0100, Paolo Bonzini wrote:
> On 12/12/19 08:36, Michael S. Tsirkin wrote:
> > On Thu, Dec 12, 2019 at 01:08:14AM +0100, Paolo Bonzini wrote:
> >>>> I'd say it won't be a big issue on locking 1/2M of host mem for a
> >>>> vm...
> >>>> Also note that if dirty ring is enabled, I plan to evaporate the
> >>>> dirty_bitmap in the next post. The old kvm->dirty_bitmap takes
> >>>> $GUEST_MEM/32K*2 mem.  E.g., for 64G guest it's 64G/32K*2=4M.  If with
> >>>> dirty ring of 8 vcpus, that could be 64K*8=0.5M, which could be even
> >>>> less memory used.
> >>>
> >>> Right - I think Avi described the bitmap in kernel memory as one of
> >>> design mistakes. Why repeat that with the new design?
> >>
> >> Do you have a source for that?
> > 
> > Nope, it was a private talk.
> > 
> >> At least the dirty bitmap has to be
> >> accessed from atomic context so it seems unlikely that it can be moved
> >> to user memory.
> > 
> > Why is that? We could surely do it from VCPU context?
> 
> Spinlock is taken.

Right, that's an implementation detail though isn't it?

> >> The dirty ring could use user memory indeed, but it would be much harder
> >> to set up (multiple ioctls for each ring?  what to do if userspace
> >> forgets one? etc.).
> > 
> > Why multiple ioctls? If you do like virtio packed ring you just need the
> > base and the size.
> 
> You have multiple rings, so multiple invocations of one ioctl.
> 
> Paolo

Oh. So when you said "multiple ioctls for each ring" - I guess you
meant: "multiple ioctls - one for each ring"?

And it's true, but then it allows supporting things like resize in a
clean way without any effort in the kernel. You get a new ring address -
you switch to that one.

Peter Xu Dec. 13, 2019, 8:23 p.m. UTC | #40

On Wed, Dec 11, 2019 at 06:24:00PM +0100, Christophe de Dinechin wrote:
> Peter Xu writes:
> 
> > This patch is heavily based on previous work from Lei Cao
> > <lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]
> >
> > KVM currently uses large bitmaps to track dirty memory.  These bitmaps
> > are copied to userspace when userspace queries KVM for its dirty page
> > information.  The use of bitmaps is mostly sufficient for live
> > migration, as large parts of memory are be dirtied from one log-dirty
> > pass to another.
> 
> That statement sort of concerns me. If large parts of memory are
> dirtied, won't this cause the rings to fill up quickly enough to cause a
> lot of churn between user-space and kernel?

We have cpu-throttle in the QEMU to explicitly provide some "churns"
just to slow the vcpu down.  If dirtying is heavy during migrations
then we might prefer some churns..  Also, this should not replace the
old dirty_bitmap, but it should be a new interface only.  Even if we
want to switch this as default we'll definitely still keep the old
interface when the user wants it in some scenarios.

> 
> See a possible suggestion to address that below.
> 
> > However, in a checkpointing system, the number of
> > dirty pages is small and in fact it is often bounded---the VM is
> > paused when it has dirtied a pre-defined number of pages. Traversing a
> > large, sparsely populated bitmap to find set bits is time-consuming,
> > as is copying the bitmap to user-space.
> >
> > A similar issue will be there for live migration when the guest memory
> > is huge while the page dirty procedure is trivial.  In that case for
> > each dirty sync we need to pull the whole dirty bitmap to userspace
> > and analyse every bit even if it's mostly zeros.
> >
> > The preferred data structure for above scenarios is a dense list of
> > guest frame numbers (GFN).  This patch series stores the dirty list in
> > kernel memory that can be memory mapped into userspace to allow speedy
> > harvesting.
> >
> > We defined two new data structures:
> >
> >   struct kvm_dirty_ring;
> >   struct kvm_dirty_ring_indexes;
> >
> > Firstly, kvm_dirty_ring is defined to represent a ring of dirty
> > pages.  When dirty tracking is enabled, we can push dirty gfn onto the
> > ring.
> >
> > Secondly, kvm_dirty_ring_indexes is defined to represent the
> > user/kernel interface of each ring.  Currently it contains two
> > indexes: (1) avail_index represents where we should push our next
> > PFN (written by kernel), while (2) fetch_index represents where the
> > userspace should fetch the next dirty PFN (written by userspace).
> >
> > One complete ring is composed by one kvm_dirty_ring plus its
> > corresponding kvm_dirty_ring_indexes.
> >
> > Currently, we have N+1 rings for each VM of N vcpus:
> >
> >   - for each vcpu, we have 1 per-vcpu dirty ring,
> >   - for each vm, we have 1 per-vm dirty ring
> >
> > Please refer to the documentation update in this patch for more
> > details.
> >
> > Note that this patch implements the core logic of dirty ring buffer.
> > It's still disabled for all archs for now.  Also, we'll address some
> > of the other issues in follow up patches before it's firstly enabled
> > on x86.
> >
> > [1] https://patchwork.kernel.org/patch/10471409/
> >
> > Signed-off-by: Lei Cao <lei.cao@stratus.com>
> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  Documentation/virt/kvm/api.txt | 109 +++++++++++++++
> >  arch/x86/kvm/Makefile          |   3 +-
> >  include/linux/kvm_dirty_ring.h |  67 +++++++++
> >  include/linux/kvm_host.h       |  33 +++++
> >  include/linux/kvm_types.h      |   1 +
> >  include/uapi/linux/kvm.h       |  36 +++++
> >  virt/kvm/dirty_ring.c          | 156 +++++++++++++++++++++
> >  virt/kvm/kvm_main.c            | 240 ++++++++++++++++++++++++++++++++-
> >  8 files changed, 642 insertions(+), 3 deletions(-)
> >  create mode 100644 include/linux/kvm_dirty_ring.h
> >  create mode 100644 virt/kvm/dirty_ring.c
> >
> > diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt
> > index 49183add44e7..fa622c9a2eb8 100644
> > --- a/Documentation/virt/kvm/api.txt
> > +++ b/Documentation/virt/kvm/api.txt
> > @@ -231,6 +231,7 @@ Based on their initialization different VMs may have different capabilities.
> >  It is thus encouraged to use the vm ioctl to query for capabilities (available
> >  with KVM_CAP_CHECK_EXTENSION_VM on the vm fd)
> >
> > +
> >  4.5 KVM_GET_VCPU_MMAP_SIZE
> >
> >  Capability: basic
> > @@ -243,6 +244,18 @@ The KVM_RUN ioctl (cf.) communicates with userspace via a shared
> >  memory region.  This ioctl returns the size of that region.  See the
> >  KVM_RUN documentation for details.
> >
> > +Besides the size of the KVM_RUN communication region, other areas of
> > +the VCPU file descriptor can be mmap-ed, including:
> > +
> > +- if KVM_CAP_COALESCED_MMIO is available, a page at
> > +  KVM_COALESCED_MMIO_PAGE_OFFSET * PAGE_SIZE; for historical reasons,
> > +  this page is included in the result of KVM_GET_VCPU_MMAP_SIZE.
> > +  KVM_CAP_COALESCED_MMIO is not documented yet.
> 
> Does the above really belong to this patch?

Probably not..  But sure I can move that out in my next post.

> 
> > +
> > +- if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
> > +  KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE.  For more information on
> > +  KVM_CAP_DIRTY_LOG_RING, see section 8.3.
> > +
> >
> >  4.6 KVM_SET_MEMORY_REGION
> >
> > @@ -5358,6 +5371,7 @@ CPU when the exception is taken. If this virtual SError is taken to EL1 using
> >  AArch64, this value will be reported in the ISS field of ESR_ELx.
> >
> >  See KVM_CAP_VCPU_EVENTS for more details.
> > +
> >  8.20 KVM_CAP_HYPERV_SEND_IPI
> >
> >  Architectures: x86
> > @@ -5365,6 +5379,7 @@ Architectures: x86
> >  This capability indicates that KVM supports paravirtualized Hyper-V IPI send
> >  hypercalls:
> >  HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
> > +
> >  8.21 KVM_CAP_HYPERV_DIRECT_TLBFLUSH
> >
> >  Architecture: x86
> > @@ -5378,3 +5393,97 @@ handling by KVM (as some KVM hypercall may be mistakenly treated as TLB
> >  flush hypercalls by Hyper-V) so userspace should disable KVM identification
> >  in CPUID and only exposes Hyper-V identification. In this case, guest
> >  thinks it's running on Hyper-V and only use Hyper-V hypercalls.
> > +
> > +8.22 KVM_CAP_DIRTY_LOG_RING
> > +
> > +Architectures: x86
> > +Parameters: args[0] - size of the dirty log ring
> > +
> > +KVM is capable of tracking dirty memory using ring buffers that are
> > +mmaped into userspace; there is one dirty ring per vcpu and one global
> > +ring per vm.
> > +
> > +One dirty ring has the following two major structures:
> > +
> > +struct kvm_dirty_ring {
> > +	u16 dirty_index;
> > +	u16 reset_index;
> 
> What is the benefit of using u16 for that? That means with 4K pages, you
> can share at most 256M of dirty memory each time? That seems low to me,
> especially since it's sufficient to touch one byte in a page to dirty it.
> 
> Actually, this is not consistent with the definition in the code ;-)
> So I'll assume it's actually u32.

Yes it's u32 now.  Actually I believe at least Paolo would prefer u16
more. :)

I think even u16 would be mostly enough (if you see, the maximum
allowed value currently is 64K entries only, not a big one).  Again,
the thing is that the userspace should be collecting the dirty bits,
so the ring shouldn't reach full easily.  Even if it does, we should
probably let it stop for a while as explained above.  It'll be
inefficient only if we set it to a too-small value, imho.

> 
> > +	u32 size;
> > +	u32 soft_limit;
> > +	spinlock_t lock;
> > +	struct kvm_dirty_gfn *dirty_gfns;
> > +};
> > +
> > +struct kvm_dirty_ring_indexes {
> > +	__u32 avail_index; /* set by kernel */
> > +	__u32 fetch_index; /* set by userspace */
> > +};
> > +
> > +While for each of the dirty entry it's defined as:
> > +
> > +struct kvm_dirty_gfn {
> > +        __u32 pad;
> > +        __u32 slot; /* as_id | slot_id */
> > +        __u64 offset;
> > +};
> 
> Like other have suggested, I think we might used "pad" to store size
> information to be able to dirty large pages more efficiently.

As explained in the other thread, KVM should only trap dirty bits in
4K granularity, never in huge page sizes.

> 
> > +
> > +The fields in kvm_dirty_ring will be only internal to KVM itself,
> > +while the fields in kvm_dirty_ring_indexes will be exposed to
> > +userspace to be either read or written.
> 
> The sentence above is confusing when contrasted with the "set by kernel"
> comment above.

Maybe "kvm_dirty_ring_indexes will be exposed to both KVM and
userspace" to be clearer?

"set by kernel" means kernel will write to it, then the userspace will
still need to read from it.

> 
> > +
> > +The two indices in the ring buffer are free running counters.
> 
> Nit: this patch uses both "indices" and "indexes".
> Both are correct, but it would be nice to be consistent.

I'll respect the original patch to change everything into indices.

> 
> > +
> > +In pseudocode, processing the ring buffer looks like this:
> > +
> > +	idx = load-acquire(&ring->fetch_index);
> > +	while (idx != ring->avail_index) {
> > +		struct kvm_dirty_gfn *entry;
> > +		entry = &ring->dirty_gfns[idx & (size - 1)];
> > +		...
> > +
> > +		idx++;
> > +	}
> > +	ring->fetch_index = idx;
> > +
> > +Userspace calls KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM ioctl
> > +to enable this capability for the new guest and set the size of the
> > +rings.  It is only allowed before creating any vCPU, and the size of
> > +the ring must be a power of two.  The larger the ring buffer, the less
> > +likely the ring is full and the VM is forced to exit to userspace. The
> > +optimal size depends on the workload, but it is recommended that it be
> > +at least 64 KiB (4096 entries).
> 
> Is there anything in the design that would preclude resizing the ring
> buffer at a later time? Presumably, you'd want a large ring while you
> are doing things like migrations, but it's mostly useless when you are
> not monitoring memory. So it would be nice to be able to call
> KVM_ENABLE_CAP at any time to adjust the size.

It'll be scary to me to have it be adjusted at any time...  Even
during pushing dirty gfns onto the ring?  We need to handle all these
complexities...

IMHO it really does not help that much to have such a feature, or I'd
appreciate we can allow to start from simple.

> 
> As I read the current code, one of the issue would be the mapping of the
> rings in case of a later extension where we added something beyond the
> rings. But I'm not sure that's a big deal at the moment.

I think we must define something to be sure that the ring mapped pages
will be limited, so we can still extend things.  IMHO that's why I
introduced the maximum allowed ring size.  That limits this.

> 
> > +
> > +After the capability is enabled, userspace can mmap the global ring
> > +buffer (kvm_dirty_gfn[], offset KVM_DIRTY_LOG_PAGE_OFFSET) and the
> > +indexes (kvm_dirty_ring_indexes, offset 0) from the VM file
> > +descriptor.  The per-vcpu dirty ring instead is mmapped when the vcpu
> > +is created, similar to the kvm_run struct (kvm_dirty_ring_indexes
> > +locates inside kvm_run, while kvm_dirty_gfn[] at offset
> > +KVM_DIRTY_LOG_PAGE_OFFSET).
> > +
> > +Just like for dirty page bitmaps, the buffer tracks writes to
> > +all user memory regions for which the KVM_MEM_LOG_DIRTY_PAGES flag was
> > +set in KVM_SET_USER_MEMORY_REGION.  Once a memory region is registered
> > +with the flag set, userspace can start harvesting dirty pages from the
> > +ring buffer.
> > +
> > +To harvest the dirty pages, userspace accesses the mmaped ring buffer
> > +to read the dirty GFNs up to avail_index, and sets the fetch_index
> > +accordingly.  This can be done when the guest is running or paused,
> > +and dirty pages need not be collected all at once.  After processing
> > +one or more entries in the ring buffer, userspace calls the VM ioctl
> > +KVM_RESET_DIRTY_RINGS to notify the kernel that it has updated
> > +fetch_index and to mark those pages clean.  Therefore, the ioctl
> > +must be called *before* reading the content of the dirty pages.
> 
> > +
> > +However, there is a major difference comparing to the
> > +KVM_GET_DIRTY_LOG interface in that when reading the dirty ring from
> > +userspace it's still possible that the kernel has not yet flushed the
> > +hardware dirty buffers into the kernel buffer.  To achieve that, one
> > +needs to kick the vcpu out for a hardware buffer flush (vmexit).
> 
> When you refer to "buffers", are you referring to the cache lines that
> contain the ring buffers, or to something else?
> 
> I'm a bit confused by this sentence. I think that you mean that a VCPU
> may still be running while you read its ring buffer, in which case the
> values in the ring buffer are not necessarily in memory yet, so not
> visible to a different CPU. But I wonder if you can't make this
> requirement to cause a vmexit unnecessary by carefully ordering the
> writes, to make sure that the fetch_index is updated only after the
> corresponding ring entries have been written to memory,
> 
> In other words, as seen by user-space, you would not care that the ring
> entries have not been flushed as long as the fetch_index itself is
> guaranteed to still be behind the not-flushed-yet entries.
> 
> (I would know how to do that on a different architecture, not sure for x86)

Sorry for not being clear, but.. Do you mean the "hardware dirty
buffers"?  For Intel, it could be PML.  Vmexits guarantee that even
PML buffers will be flushed to the dirty rings.  Nothing about cache
lines.

I used "hardware dirty buffer" only because this document is for KVM
in general, while PML is only one way to do such buffering.  I can add
"(for example, PML)" to make it clearer if you like.

> 
> > +
> > +If one of the ring buffers is full, the guest will exit to userspace
> > +with the exit reason set to KVM_EXIT_DIRTY_LOG_FULL, and the
> > +KVM_RUN ioctl will return -EINTR. Once that happens, userspace
> > +should pause all the vcpus, then harvest all the dirty pages and
> > +rearm the dirty traps. It can unpause the guest after that.
> 
> Except for the condition above, why is it necessary to pause other VCPUs
> than the one being harvested?

This is a good question.  Paolo could correct me if I'm wrong.

Firstly I think this should rarely happen if the userspace is
collecting the dirty bits from time to time.  If it happens, we'll
need to call KVM_RESET_DIRTY_RINGS to reset all the rings.  Then the
question actually becomes to: Whether we'd like to have per-vcpu
KVM_RESET_DIRTY_RINGS?

So the answer is that it could be an overkill to do so.  The important
thing here is no matter what KVM_RESET_DIRTY_RINGS will need to change
the page tables and kick all VCPUs for TLB flushings.  If we must do
it, we'd better do it as rare as possible.  When we're with per-vcpu
ring resets, we'll do N*N vcpu kicks for the bad case (N kicks per
vcpu ring reset, and we've probably got N vcpus).  While if we stick
to the simple per-vm reset, it'll anyway kick all vcpus for tlb
flushing after all, then maybe it's easier to collect all of them
altogether and reset them altogether.

> 
> 
> > diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> > index b19ef421084d..0acee817adfb 100644
> > --- a/arch/x86/kvm/Makefile
> > +++ b/arch/x86/kvm/Makefile
> > @@ -5,7 +5,8 @@ ccflags-y += -Iarch/x86/kvm
> >  KVM := ../../../virt/kvm
> >
> >  kvm-y			+= $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> > -				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
> > +				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o \
> > +				$(KVM)/dirty_ring.o
> >  kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(KVM)/async_pf.o
> >
> >  kvm-y			+= x86.o emulate.o i8259.o irq.o lapic.o \
> > diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
> > new file mode 100644
> > index 000000000000..8335635b7ff7
> > --- /dev/null
> > +++ b/include/linux/kvm_dirty_ring.h
> > @@ -0,0 +1,67 @@
> > +#ifndef KVM_DIRTY_RING_H
> > +#define KVM_DIRTY_RING_H
> > +
> > +/*
> > + * struct kvm_dirty_ring is defined in include/uapi/linux/kvm.h.
> > + *
> > + * dirty_ring:  shared with userspace via mmap. It is the compact list
> > + *              that holds the dirty pages.
> > + * dirty_index: free running counter that points to the next slot in
> > + *              dirty_ring->dirty_gfns  where a new dirty page should go.
> > + * reset_index: free running counter that points to the next dirty page
> > + *              in dirty_ring->dirty_gfns for which dirty trap needs to
> > + *              be reenabled
> > + * size:        size of the compact list, dirty_ring->dirty_gfns
> > + * soft_limit:  when the number of dirty pages in the list reaches this
> > + *              limit, vcpu that owns this ring should exit to userspace
> > + *              to allow userspace to harvest all the dirty pages
> > + * lock:        protects dirty_ring, only in use if this is the global
> > + *              ring
> 
> If that's not used for vcpu rings, maybe move it out of kvm_dirty_ring?

Yeah we can.

> 
> > + *
> > + * The number of dirty pages in the ring is calculated by,
> > + * dirty_index - reset_index
> 
> Nit: the code calls it "used" (in kvm_dirty_ring_used). Maybe find an
> unambiguous terminology. What about "posted", as in
> 
> The number of posted dirty pages, i.e. the number of dirty pages in the
> ring, is calculated as dirty_index - reset_index by function
> kvm_dirty_ring_posted
> 
> (Replace "posted" by any adjective of your liking)

Sure.

(Or maybe I'll just try to remove these lines to avoid introducing any
 terminology as long as it's not very necessary... and after all
 similar things will be mentioned in the documents, and the code itself)

> 
> > + *
> > + * kernel increments dirty_ring->indices.avail_index after dirty index
> > + * is incremented. When userspace harvests the dirty pages, it increments
> > + * dirty_ring->indices.fetch_index up to dirty_ring->indices.avail_index.
> > + * When kernel reenables dirty traps for the dirty pages, it increments
> > + * reset_index up to dirty_ring->indices.fetch_index.
> 
> Userspace should not be trusted to be doing this, see below.
> 
> 
> > + *
> > + */
> > +struct kvm_dirty_ring {
> > +	u32 dirty_index;
> > +	u32 reset_index;
> > +	u32 size;
> > +	u32 soft_limit;
> > +	spinlock_t lock;
> > +	struct kvm_dirty_gfn *dirty_gfns;
> > +};
> > +
> > +u32 kvm_dirty_ring_get_rsvd_entries(void);
> > +int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring);
> > +
> > +/*
> > + * called with kvm->slots_lock held, returns the number of
> > + * processed pages.
> > + */
> > +int kvm_dirty_ring_reset(struct kvm *kvm,
> > +			 struct kvm_dirty_ring *ring,
> > +			 struct kvm_dirty_ring_indexes *indexes);
> > +
> > +/*
> > + * returns 0: successfully pushed
> > + *         1: successfully pushed, soft limit reached,
> > + *            vcpu should exit to userspace
> > + *         -EBUSY: unable to push, dirty ring full.
> > + */
> > +int kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
> > +			struct kvm_dirty_ring_indexes *indexes,
> > +			u32 slot, u64 offset, bool lock);
> > +
> > +/* for use in vm_operations_struct */
> > +struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 i);
> 
> Not very clear what 'i' means, seems to be a page offset based on call sites?

I'll rename it to "offset".

> 
> > +
> > +void kvm_dirty_ring_free(struct kvm_dirty_ring *ring);
> > +bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring);
> > +
> > +#endif
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 498a39462ac1..7b747bc9ff3e 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -34,6 +34,7 @@
> >  #include <linux/kvm_types.h>
> >
> >  #include <asm/kvm_host.h>
> > +#include <linux/kvm_dirty_ring.h>
> >
> >  #ifndef KVM_MAX_VCPU_ID
> >  #define KVM_MAX_VCPU_ID KVM_MAX_VCPUS
> > @@ -146,6 +147,7 @@ static inline bool is_error_page(struct page *page)
> >  #define KVM_REQ_MMU_RELOAD        (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> >  #define KVM_REQ_PENDING_TIMER     2
> >  #define KVM_REQ_UNHALT            3
> > +#define KVM_REQ_DIRTY_RING_FULL   4
> >  #define KVM_REQUEST_ARCH_BASE     8
> >
> >  #define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
> > @@ -321,6 +323,7 @@ struct kvm_vcpu {
> >  	bool ready;
> >  	struct kvm_vcpu_arch arch;
> >  	struct dentry *debugfs_dentry;
> > +	struct kvm_dirty_ring dirty_ring;
> >  };
> >
> >  static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
> > @@ -501,6 +504,10 @@ struct kvm {
> >  	struct srcu_struct srcu;
> >  	struct srcu_struct irq_srcu;
> >  	pid_t userspace_pid;
> > +	/* Data structure to be exported by mmap(kvm->fd, 0) */
> > +	struct kvm_vm_run *vm_run;
> > +	u32 dirty_ring_size;
> > +	struct kvm_dirty_ring vm_dirty_ring;
> 
> If you remove the lock from struct kvm_dirty_ring, you could just put it there.

Ok.

> 
> >  };
> >
> >  #define kvm_err(fmt, ...) \
> > @@ -832,6 +839,8 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
> >  					gfn_t gfn_offset,
> >  					unsigned long mask);
> >
> > +void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask);
> > +
> >  int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
> >  				struct kvm_dirty_log *log);
> >  int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
> > @@ -1411,4 +1420,28 @@ int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
> >  				uintptr_t data, const char *name,
> >  				struct task_struct **thread_ptr);
> >
> > +/*
> > + * This defines how many reserved entries we want to keep before we
> > + * kick the vcpu to the userspace to avoid dirty ring full.  This
> > + * value can be tuned to higher if e.g. PML is enabled on the host.
> > + */
> > +#define  KVM_DIRTY_RING_RSVD_ENTRIES  64
> > +
> > +/* Max number of entries allowed for each kvm dirty ring */
> > +#define  KVM_DIRTY_RING_MAX_ENTRIES  65536
> > +
> > +/*
> > + * Arch needs to define these macro after implementing the dirty ring
> > + * feature.  KVM_DIRTY_LOG_PAGE_OFFSET should be defined as the
> > + * starting page offset of the dirty ring structures, while
> > + * KVM_DIRTY_RING_VERSION should be defined as >=1.  By default, this
> > + * feature is off on all archs.
> > + */
> > +#ifndef KVM_DIRTY_LOG_PAGE_OFFSET
> > +#define KVM_DIRTY_LOG_PAGE_OFFSET 0
> > +#endif
> > +#ifndef KVM_DIRTY_RING_VERSION
> > +#define KVM_DIRTY_RING_VERSION 0
> > +#endif
> > +
> >  #endif
> > diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> > index 1c88e69db3d9..d9d03eea145a 100644
> > --- a/include/linux/kvm_types.h
> > +++ b/include/linux/kvm_types.h
> > @@ -11,6 +11,7 @@ struct kvm_irq_routing_table;
> >  struct kvm_memory_slot;
> >  struct kvm_one_reg;
> >  struct kvm_run;
> > +struct kvm_vm_run;
> >  struct kvm_userspace_memory_region;
> >  struct kvm_vcpu;
> >  struct kvm_vcpu_init;
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index e6f17c8e2dba..0b88d76d6215 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -236,6 +236,7 @@ struct kvm_hyperv_exit {
> >  #define KVM_EXIT_IOAPIC_EOI       26
> >  #define KVM_EXIT_HYPERV           27
> >  #define KVM_EXIT_ARM_NISV         28
> > +#define KVM_EXIT_DIRTY_RING_FULL  29
> >
> >  /* For KVM_EXIT_INTERNAL_ERROR */
> >  /* Emulate instruction failed. */
> > @@ -247,6 +248,11 @@ struct kvm_hyperv_exit {
> >  /* Encounter unexpected vm-exit reason */
> >  #define KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON	4
> >
> > +struct kvm_dirty_ring_indexes {
> > +	__u32 avail_index; /* set by kernel */
> > +	__u32 fetch_index; /* set by userspace */
> > +};
> > +
> >  /* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */
> >  struct kvm_run {
> >  	/* in */
> > @@ -421,6 +427,13 @@ struct kvm_run {
> >  		struct kvm_sync_regs regs;
> >  		char padding[SYNC_REGS_SIZE_BYTES];
> >  	} s;
> > +
> > +	struct kvm_dirty_ring_indexes vcpu_ring_indexes;
> > +};
> > +
> > +/* Returned by mmap(kvm->fd, offset=0) */
> > +struct kvm_vm_run {
> > +	struct kvm_dirty_ring_indexes vm_ring_indexes;
> >  };
> >
> >  /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
> > @@ -1009,6 +1022,7 @@ struct kvm_ppc_resize_hpt {
> >  #define KVM_CAP_PPC_GUEST_DEBUG_SSTEP 176
> >  #define KVM_CAP_ARM_NISV_TO_USER 177
> >  #define KVM_CAP_ARM_INJECT_EXT_DABT 178
> > +#define KVM_CAP_DIRTY_LOG_RING 179
> >
> >  #ifdef KVM_CAP_IRQ_ROUTING
> >
> > @@ -1472,6 +1486,9 @@ struct kvm_enc_region {
> >  /* Available with KVM_CAP_ARM_SVE */
> >  #define KVM_ARM_VCPU_FINALIZE	  _IOW(KVMIO,  0xc2, int)
> >
> > +/* Available with KVM_CAP_DIRTY_LOG_RING */
> > +#define KVM_RESET_DIRTY_RINGS     _IO(KVMIO, 0xc3)
> > +
> >  /* Secure Encrypted Virtualization command */
> >  enum sev_cmd_id {
> >  	/* Guest initialization commands */
> > @@ -1622,4 +1639,23 @@ struct kvm_hyperv_eventfd {
> >  #define KVM_HYPERV_CONN_ID_MASK		0x00ffffff
> >  #define KVM_HYPERV_EVENTFD_DEASSIGN	(1 << 0)
> >
> > +/*
> > + * The following are the requirements for supporting dirty log ring
> > + * (by enabling KVM_DIRTY_LOG_PAGE_OFFSET).
> > + *
> > + * 1. Memory accesses by KVM should call kvm_vcpu_write_* instead
> > + *    of kvm_write_* so that the global dirty ring is not filled up
> > + *    too quickly.
> > + * 2. kvm_arch_mmu_enable_log_dirty_pt_masked should be defined for
> > + *    enabling dirty logging.
> > + * 3. There should not be a separate step to synchronize hardware
> > + *    dirty bitmap with KVM's.
> > + */
> > +
> > +struct kvm_dirty_gfn {
> > +	__u32 pad;
> > +	__u32 slot;
> > +	__u64 offset;
> > +};
> > +
> >  #endif /* __LINUX_KVM_H */
> > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> > new file mode 100644
> > index 000000000000..9264891f3c32
> > --- /dev/null
> > +++ b/virt/kvm/dirty_ring.c
> > @@ -0,0 +1,156 @@
> > +#include <linux/kvm_host.h>
> > +#include <linux/kvm.h>
> > +#include <linux/vmalloc.h>
> > +#include <linux/kvm_dirty_ring.h>
> > +
> > +u32 kvm_dirty_ring_get_rsvd_entries(void)
> > +{
> > +	return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size();
> > +}
> > +
> > +int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring)
> > +{
> > +	u32 size = kvm->dirty_ring_size;
> > +
> > +	ring->dirty_gfns = vmalloc(size);
> > +	if (!ring->dirty_gfns)
> > +		return -ENOMEM;
> > +	memset(ring->dirty_gfns, 0, size);
> > +
> > +	ring->size = size / sizeof(struct kvm_dirty_gfn);
> > +	ring->soft_limit =
> > +	    (kvm->dirty_ring_size / sizeof(struct kvm_dirty_gfn)) -
> > +	    kvm_dirty_ring_get_rsvd_entries();
> 
> Minor, but what about
> 
>        ring->soft_limit = ring->size - kvm_dirty_ring_get_rsvd_entries();

Yeah it's better.

> 
> 
> > +	ring->dirty_index = 0;
> > +	ring->reset_index = 0;
> > +	spin_lock_init(&ring->lock);
> > +
> > +	return 0;
> > +}
> > +
> > +int kvm_dirty_ring_reset(struct kvm *kvm,
> > +			 struct kvm_dirty_ring *ring,
> > +			 struct kvm_dirty_ring_indexes *indexes)
> > +{
> > +	u32 cur_slot, next_slot;
> > +	u64 cur_offset, next_offset;
> > +	unsigned long mask;
> > +	u32 fetch;
> > +	int count = 0;
> > +	struct kvm_dirty_gfn *entry;
> > +
> > +	fetch = READ_ONCE(indexes->fetch_index);
> 
> If I understand correctly, if a malicious user-space writes
> ring->reset_index + 1 into fetch_index, the loop below will execute 4
> billion times.
> 
> 
> > +	if (fetch == ring->reset_index)
> > +		return 0;
> 
> To protect against scenario above, I would have something like:
> 
> 	if (fetch - ring->reset_index >= ring->size)
> 		return -EINVAL;

Good point...  Actually I've got this in my latest branch already, but
still thanks for noticing this!

> 
> > +
> > +	entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> > +	/*
> > +	 * The ring buffer is shared with userspace, which might mmap
> > +	 * it and concurrently modify slot and offset.  Userspace must
> > +	 * not be trusted!  READ_ONCE prevents the compiler from changing
> > +	 * the values after they've been range-checked (the checks are
> > +	 * in kvm_reset_dirty_gfn).
> > +	 */
> > +	smp_read_barrier_depends();
> > +	cur_slot = READ_ONCE(entry->slot);
> > +	cur_offset = READ_ONCE(entry->offset);
> > +	mask = 1;
> > +	count++;
> > +	ring->reset_index++;

[1]

> > +	while (ring->reset_index != fetch) {
> > +		entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> > +		smp_read_barrier_depends();
> > +		next_slot = READ_ONCE(entry->slot);
> > +		next_offset = READ_ONCE(entry->offset);
> > +		ring->reset_index++;
> > +		count++;
> > +		/*
> > +		 * Try to coalesce the reset operations when the guest is
> > +		 * scanning pages in the same slot.
> > +		 */
> > +		if (next_slot == cur_slot) {
> > +			int delta = next_offset - cur_offset;
> 
> Since you diff two u64, shouldn't that be an i64 rather than int?

I found there's no i64, so I'm using "long long".

> 
> > +
> > +			if (delta >= 0 && delta < BITS_PER_LONG) {
> > +				mask |= 1ull << delta;
> > +				continue;
> > +			}
> > +
> > +			/* Backwards visit, careful about overflows!  */
> > +			if (delta > -BITS_PER_LONG && delta < 0 &&
> > +			    (mask << -delta >> -delta) == mask) {
> > +				cur_offset = next_offset;
> > +				mask = (mask << -delta) | 1;
> > +				continue;
> > +			}
> > +		}
> > +		kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> > +		cur_slot = next_slot;
> > +		cur_offset = next_offset;
> > +		mask = 1;
> > +	}
> > +	kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> 
> So if you did not coalesce the last one, you call kvm_reset_dirty_gfn
> twice? Something smells weird about this loop ;-) I have a gut feeling
> that it could be done in a single while loop combined with the entry
> test, but I may be wrong.

It should be easy to save a few lines at [1] by introducing a boolean
"first_round".  I don't see it easy to avoid the kvm_reset_dirty_gfn()
call at the end though...

> 
> 
> > +
> > +	return count;
> > +}
> > +
> > +static inline u32 kvm_dirty_ring_used(struct kvm_dirty_ring *ring)
> > +{
> > +	return ring->dirty_index - ring->reset_index;
> > +}
> > +
> > +bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring)
> > +{
> > +	return kvm_dirty_ring_used(ring) >= ring->size;
> > +}
> > +
> > +/*
> > + * Returns:
> > + *   >0 if we should kick the vcpu out,
> > + *   =0 if the gfn pushed successfully, or,
> > + *   <0 if error (e.g. ring full)
> > + */
> > +int kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
> > +			struct kvm_dirty_ring_indexes *indexes,
> > +			u32 slot, u64 offset, bool lock)
> 
> Obviously, if you go with the suggestion to have a "lock" only in struct
> kvm, then you'd have to pass a lock ptr instead of a bool.

Paolo got a better solution on that.  That "lock" will be dropped.

> 
> > +{
> > +	int ret;
> > +	struct kvm_dirty_gfn *entry;
> > +
> > +	if (lock)
> > +		spin_lock(&ring->lock);
> > +
> > +	if (kvm_dirty_ring_full(ring)) {
> > +		ret = -EBUSY;
> > +		goto out;
> > +	}
> > +
> > +	entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
> > +	entry->slot = slot;
> > +	entry->offset = offset;
> > +	smp_wmb();
> > +	ring->dirty_index++;
> > +	WRITE_ONCE(indexes->avail_index, ring->dirty_index);
> 
> Following up on comment about having to vmexit other VCPUs above:
> If you have a write barrier for the entry, and then a write once for the
> index, isn't that sufficient to ensure that another CPU will pick up the
> right values in the right order?

I think so.  I've replied above on the RESET issue.

> 
> 
> > +	ret = kvm_dirty_ring_used(ring) >= ring->soft_limit;
> > +	pr_info("%s: slot %u offset %llu used %u\n",
> > +		__func__, slot, offset, kvm_dirty_ring_used(ring));
> > +
> > +out:
> > +	if (lock)
> > +		spin_unlock(&ring->lock);
> > +
> > +	return ret;
> > +}
> > +
> > +struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 i)
> 
> Still don't like 'i' :-)
> 
> 
> (Stopped my review here for lack of time, decided to share what I had so far)

Thanks for your comments!

Paolo Bonzini Dec. 14, 2019, 7:57 a.m. UTC | #41

On 13/12/19 21:23, Peter Xu wrote:
>> What is the benefit of using u16 for that? That means with 4K pages, you
>> can share at most 256M of dirty memory each time? That seems low to me,
>> especially since it's sufficient to touch one byte in a page to dirty it.
>>
>> Actually, this is not consistent with the definition in the code ;-)
>> So I'll assume it's actually u32.
> Yes it's u32 now.  Actually I believe at least Paolo would prefer u16
> more. :)

It has to be u16, because it overlaps the padding of the first entry.

Paolo

> I think even u16 would be mostly enough (if you see, the maximum
> allowed value currently is 64K entries only, not a big one).  Again,
> the thing is that the userspace should be collecting the dirty bits,
> so the ring shouldn't reach full easily.  Even if it does, we should
> probably let it stop for a while as explained above.  It'll be
> inefficient only if we set it to a too-small value, imho.
>

Peter Xu Dec. 14, 2019, 4:26 p.m. UTC | #42

On Sat, Dec 14, 2019 at 08:57:26AM +0100, Paolo Bonzini wrote:
> On 13/12/19 21:23, Peter Xu wrote:
> >> What is the benefit of using u16 for that? That means with 4K pages, you
> >> can share at most 256M of dirty memory each time? That seems low to me,
> >> especially since it's sufficient to touch one byte in a page to dirty it.
> >>
> >> Actually, this is not consistent with the definition in the code ;-)
> >> So I'll assume it's actually u32.
> > Yes it's u32 now.  Actually I believe at least Paolo would prefer u16
> > more. :)
> 
> It has to be u16, because it overlaps the padding of the first entry.

Hmm, could you explain?

Note that here what Christophe commented is on dirty_index,
reset_index of "struct kvm_dirty_ring", so imho it could really be
anything we want as long as it can store a u32 (which is the size of
the elements in kvm_dirty_ring_indexes).

If you were instead talking about the previous union definition of
"struct kvm_dirty_gfns" rather than "struct kvm_dirty_ring", iiuc I've
moved those indices out of it and defined kvm_dirty_ring_indexes which
we expose via kvm_run, so we don't have that limitation as well any
more?

Peter Xu Dec. 15, 2019, 5:21 p.m. UTC | #43

On Tue, Dec 10, 2019 at 06:09:02PM +0100, Paolo Bonzini wrote:
> On 10/12/19 16:52, Peter Xu wrote:
> > On Tue, Dec 10, 2019 at 11:07:31AM +0100, Paolo Bonzini wrote:
> >>> I'm thinking whether I can start
> >>> to use this information in the next post on solving an issue I
> >>> encountered with the waitqueue.
> >>>
> >>> Current waitqueue is still problematic in that it could wait even with
> >>> the mmu lock held when with vcpu context.
> >>
> >> I think the idea of the soft limit is that the waiting just cannot
> >> happen.  That is, the number of dirtied pages _outside_ the guest (guest
> >> accesses are taken care of by PML, and are subtracted from the soft
> >> limit) cannot exceed hard_limit - (soft_limit + pml_size).
> > 
> > So the question go backs to, whether this is guaranteed somehow?  Or
> > do you prefer us to keep the warn_on_once until it triggers then we
> > can analyze (which I doubt..)?
> 
> Yes, I would like to keep the WARN_ON_ONCE just because you never know.
> 
> Of course it would be much better to audit the calls to kvm_write_guest
> and figure out how many could trigger (e.g. two from the operands of an
> emulated instruction, 5 from a nested EPT walk, 1 from a page walk, etc.).

I would say we'd better either figure out all the caller's sites to
prove it will never overflow, or, I think we'll need the waitqueue at
least.  The problem is if we release a kvm with WARN_ON_ONCE and at
last we found that it can be triggered and ring full can't be avoided,
then it means the interface and design is broken, and it could even be
too late to fix it after the interface is published.

(Actually I was not certain on previous clear_dirty interface where we
 introduced a new capability for it.  I'm not sure whether that can be
 avoided because after all the initial version is not working at all,
 and we fixed it up without changing the interface.  However for this
 one if at last we prove the design wrong, then we must introduce
 another capability for it IMHO, and the interface is prone to change
 too)

So, with the hope that we could avoid the waitqueue, I checked all the
callers of mark_page_dirty_in_slot().  Since this initial work is only
for x86, I didn't look more into other archs, assuming that can be
done later when it is implemented for other archs (and this will for
sure also cover the common code):

    mark_page_dirty_in_slot calls, per-vm (x86 only)
        __kvm_write_guest_page
            kvm_write_guest_page
                init_rmode_tss
                    vmx_set_tss_addr
                        kvm_vm_ioctl_set_tss_addr [*]
                init_rmode_identity_map
                    vmx_create_vcpu [*]
                vmx_write_pml_buffer
                    kvm_arch_write_log_dirty [&]
                kvm_write_guest
                    kvm_hv_setup_tsc_page
                        kvm_guest_time_update [&]
                    nested_flush_cached_shadow_vmcs12 [&]
                    kvm_write_wall_clock [&]
                    kvm_pv_clock_pairing [&]
                    kvmgt_rw_gpa [?]
                    kvm_write_guest_offset_cached
                        kvm_steal_time_set_preempted [&]
                        kvm_write_guest_cached
                            pv_eoi_put_user [&]
                            kvm_lapic_sync_to_vapic [&]
                            kvm_setup_pvclock_page [&]
                            record_steal_time [&]
                            apf_put_user [&]
                kvm_clear_guest_page
                    init_rmode_tss [*] (see above)
                    init_rmode_identity_map [*] (see above)
                    kvm_clear_guest
                        synic_set_msr
                            kvm_hv_set_msr [&]
        kvm_write_guest_offset_cached [&] (see above)
        mark_page_dirty
            kvm_hv_set_msr_pw [&]

We should only need to look at the leaves of the traces because
they're where the dirty request starts.  I'm marking all the leaves
with below criteria then it'll be easier to focus:

Cases with [*]: should not matter much
           [&]: actually with a per-vcpu context in the upper layer
           [?]: uncertain...

I'm a bit amazed after I took these notes, since I found that besides
those that could probbaly be ignored (marked as [*]), most of the rest
per-vm dirty requests are actually with a vcpu context.

Although now because we have kvm_get_running_vcpu() all cases for [&]
should be fine without changing anything, but I tend to add another
patch in the next post to convert all the [&] cases explicitly to pass
vcpu pointer instead of kvm pointer to be clear if no one disagrees,
then we verify that against kvm_get_running_vcpu().

So the only uncertainty now is kvmgt_rw_gpa() which is marked as [?].
Could this happen frequently?  I would guess the answer is we don't
know (which means it can).

> 
> > One thing to mention is that for with-vcpu cases, we probably can even
> > stop KVM_RUN immediately as long as either the per-vm or per-vcpu ring
> > reaches the softlimit, then for vcpu case it should be easier to
> > guarantee that.  What I want to know is the rest of cases like ioctls
> > or even something not from the userspace (which I think I should read
> > more later..).
> 
> Which ioctls?  Most ioctls shouldn't dirty memory at all.

init_rmode_tss or init_rmode_identity_map.  But I've marked them as
unimportant because they should only happen once at boot.

> 
> >>> And if we see if the mark_page_dirty_in_slot() is not with a vcpu
> >>> context (e.g. kvm_mmu_page_fault) but with an ioctl context (those
> >>> cases we'll use per-vm dirty ring) then it's probably fine.
> >>>
> >>> My planned solution:
> >>>
> >>> - When kvm_get_running_vcpu() != NULL, we postpone the waitqueue waits
> >>>   until we finished handling this page fault, probably in somewhere
> >>>   around vcpu_enter_guest, so that we can do wait_event() after the
> >>>   mmu lock released
> >>
> >> I think this can cause a race:
> >>
> >> 	vCPU 1			vCPU 2		host
> >> 	---------------------------------------------------------------
> >> 	mark page dirty
> >> 				write to page
> >> 						treat page as not dirty
> >> 	add page to ring
> >>
> >> where vCPU 2 skips the clean-page slow path entirely.
> > 
> > If we're still with the rule in userspace that we first do RESET then
> > collect and send the pages (just like what we've discussed before),
> > then IMHO it's fine to have vcpu2 to skip the slow path?  Because
> > RESET happens at "treat page as not dirty", then if we are sure that
> > we only collect and send pages after that point, then the latest
> > "write to page" data from vcpu2 won't be lost even if vcpu2 is not
> > blocked by vcpu1's ring full?
> 
> Good point, the race would become
> 
>  	vCPU 1			vCPU 2		host
>  	---------------------------------------------------------------
>  	mark page dirty
>  				write to page
> 						reset rings
> 						  wait for mmu lock
>  	add page to ring
> 	release mmu lock
> 						  ...do reset...
> 						  release mmu lock
> 						page is now dirty

Hmm, the page will be dirty after the reset, but is that an issue?

Or, could you help me to identify what I've missed?

> 
> > Maybe we can also consider to let mark_page_dirty_in_slot() return a
> > value, then the upper layer could have a chance to skip the spte
> > update if mark_page_dirty_in_slot() fails to mark the dirty bit, so it
> > can return directly with RET_PF_RETRY.
> 
> I don't think that's possible, most writes won't come from a page fault
> path and cannot retry.

Yep, maybe I should say it in the other way round: we only wait if
kvm_get_running_vcpu() == NULL.  Then in somewhere near
vcpu_enter_guest(), we add a check to wait if per-vcpu ring is full.
Would that work?

Thanks,

Peter Xu Dec. 15, 2019, 5:33 p.m. UTC | #44

On Thu, Dec 12, 2019 at 01:08:14AM +0100, Paolo Bonzini wrote:
> >>> What depends on what here? Looks suspicious ...
> >>
> >> Hmm, I think maybe it can be removed because the entry pointer
> >> reference below should be an ordering constraint already?
> 
> entry->xxx depends on ring->reset_index.

Yes that's true, but...

        entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
        /* barrier? */
        next_slot = READ_ONCE(entry->slot);
        next_offset = READ_ONCE(entry->offset);

... I think entry->xxx depends on entry first, then entry depends on
reset_index.  So it seems fine because all things have a dependency?

> 
> >>> what's the story around locking here? Why is it safe
> >>> not to take the lock sometimes?
> >>
> >> kvm_dirty_ring_push() will be with lock==true only when the per-vm
> >> ring is used.  For per-vcpu ring, because that will only happen with
> >> the vcpu context, then we don't need locks (so kvm_dirty_ring_push()
> >> is called with lock==false).
> 
> FWIW this will be done much more nicely in v2.
> 
> >>>> +	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> >>>> +	if (!page) {
> >>>> +		r = -ENOMEM;
> >>>> +		goto out_err_alloc_page;
> >>>> +	}
> >>>> +	kvm->vm_run = page_address(page);
> >>>
> >>> So 4K with just 8 bytes used. Not as bad as 1/2Mbyte for the ring but
> >>> still. What is wrong with just a pointer and calling put_user?
> >>
> >> I want to make it the start point for sharing fields between
> >> user/kernel per-vm.  Just like kvm_run for per-vcpu.
> 
> This page is actually not needed at all.  Userspace can just map at
> KVM_DIRTY_LOG_PAGE_OFFSET, the indices reside there.  You can drop
> kvm_vm_run completely.

I changed it because otherwise we use one entry of the padding, and
all the rest of paddings are a waste of memory because we can never
really use the padding as new fields only for the 1st entry which
overlaps with the indices.  IMHO that could even waste more than 4k.

(for now we only "waste" 4K for per-vm, kvm_run is already mapped so
 no waste there, not to say potentially I still think we can use the
 kvm_vm_run in the future)

> 
> >>>> +	} else {
> >>>> +		/*
> >>>> +		 * Put onto per vm ring because no vcpu context.  Kick
> >>>> +		 * vcpu0 if ring is full.
> >>>
> >>> What about tasks on vcpu 0? Do guests realize it's a bad idea to put
> >>> critical tasks there, they will be penalized disproportionally?
> >>
> >> Reasonable question.  So far we can't avoid it because vcpu exit is
> >> the event mechanism to say "hey please collect dirty bits".  Maybe
> >> someway is better than this, but I'll need to rethink all these
> >> over...
> > 
> > Maybe signal an eventfd, and let userspace worry about deciding what to
> > do.
> 
> This has to be done synchronously.  But the vm ring should be used very
> rarely (it's for things like kvmclock updates that write to guest memory
> outside a vCPU), possibly a handful of times in the whole run of the VM.

I've summarized a list of callers who might dirty guest memory in the
other thread, it seems to me that even the kvm clock is using per-vcpu
contexts.

> 
> >>> KVM_DIRTY_RING_MAX_ENTRIES is not part of UAPI.
> >>> So how does userspace know what's legal?
> >>> Do you expect it to just try?
> >>
> >> Yep that's what I thought. :)
> 
> We should return it for KVM_CHECK_EXTENSION.

OK.  I'll drop the versioning.

Paolo Bonzini Dec. 16, 2019, 9:29 a.m. UTC | #45

On 14/12/19 17:26, Peter Xu wrote:
> On Sat, Dec 14, 2019 at 08:57:26AM +0100, Paolo Bonzini wrote:
>> On 13/12/19 21:23, Peter Xu wrote:
>>>> What is the benefit of using u16 for that? That means with 4K pages, you
>>>> can share at most 256M of dirty memory each time? That seems low to me,
>>>> especially since it's sufficient to touch one byte in a page to dirty it.
>>>>
>>>> Actually, this is not consistent with the definition in the code ;-)
>>>> So I'll assume it's actually u32.
>>> Yes it's u32 now.  Actually I believe at least Paolo would prefer u16
>>> more. :)
>>
>> It has to be u16, because it overlaps the padding of the first entry.
> 
> Hmm, could you explain?
> 
> Note that here what Christophe commented is on dirty_index,
> reset_index of "struct kvm_dirty_ring", so imho it could really be
> anything we want as long as it can store a u32 (which is the size of
> the elements in kvm_dirty_ring_indexes).
> 
> If you were instead talking about the previous union definition of
> "struct kvm_dirty_gfns" rather than "struct kvm_dirty_ring", iiuc I've
> moved those indices out of it and defined kvm_dirty_ring_indexes which
> we expose via kvm_run, so we don't have that limitation as well any
> more?

Yeah, I meant that since the size has (had) to be u16 in the union, it
need not be bigger in kvm_dirty_ring.

I don't think having more than 2^16 entries in the *per-CPU* ring buffer
makes sense; lagging in recording dirty memory by more than 256 MiB per
CPU would mean a large pause later on resetting the ring buffers (your
KVM_CLEAR_DIRTY_LOG patches found the sweet spot to be around 1 GiB for
the whole system).

So I liked the union, but if you removed it you might as well align the
producer and consumer indices to 64 bytes so that they are in separate
cache lines.

Paolo

Michael S. Tsirkin Dec. 16, 2019, 9:47 a.m. UTC | #46

On Sun, Dec 15, 2019 at 12:33:02PM -0500, Peter Xu wrote:
> On Thu, Dec 12, 2019 at 01:08:14AM +0100, Paolo Bonzini wrote:
> > >>> What depends on what here? Looks suspicious ...
> > >>
> > >> Hmm, I think maybe it can be removed because the entry pointer
> > >> reference below should be an ordering constraint already?
> > 
> > entry->xxx depends on ring->reset_index.
> 
> Yes that's true, but...
> 
>         entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
>         /* barrier? */
>         next_slot = READ_ONCE(entry->slot);
>         next_offset = READ_ONCE(entry->offset);
> 
> ... I think entry->xxx depends on entry first, then entry depends on
> reset_index.  So it seems fine because all things have a dependency?

Is reset_index changed from another thread then?
If yes then you want to read reset_index with READ_ONCE.
That includes a dependency barrier.

> > 
> > >>> what's the story around locking here? Why is it safe
> > >>> not to take the lock sometimes?
> > >>
> > >> kvm_dirty_ring_push() will be with lock==true only when the per-vm
> > >> ring is used.  For per-vcpu ring, because that will only happen with
> > >> the vcpu context, then we don't need locks (so kvm_dirty_ring_push()
> > >> is called with lock==false).
> > 
> > FWIW this will be done much more nicely in v2.
> > 
> > >>>> +	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> > >>>> +	if (!page) {
> > >>>> +		r = -ENOMEM;
> > >>>> +		goto out_err_alloc_page;
> > >>>> +	}
> > >>>> +	kvm->vm_run = page_address(page);
> > >>>
> > >>> So 4K with just 8 bytes used. Not as bad as 1/2Mbyte for the ring but
> > >>> still. What is wrong with just a pointer and calling put_user?
> > >>
> > >> I want to make it the start point for sharing fields between
> > >> user/kernel per-vm.  Just like kvm_run for per-vcpu.
> > 
> > This page is actually not needed at all.  Userspace can just map at
> > KVM_DIRTY_LOG_PAGE_OFFSET, the indices reside there.  You can drop
> > kvm_vm_run completely.
> 
> I changed it because otherwise we use one entry of the padding, and
> all the rest of paddings are a waste of memory because we can never
> really use the padding as new fields only for the 1st entry which
> overlaps with the indices.  IMHO that could even waste more than 4k.
> 
> (for now we only "waste" 4K for per-vm, kvm_run is already mapped so
>  no waste there, not to say potentially I still think we can use the
>  kvm_vm_run in the future)
> 
> > 
> > >>>> +	} else {
> > >>>> +		/*
> > >>>> +		 * Put onto per vm ring because no vcpu context.  Kick
> > >>>> +		 * vcpu0 if ring is full.
> > >>>
> > >>> What about tasks on vcpu 0? Do guests realize it's a bad idea to put
> > >>> critical tasks there, they will be penalized disproportionally?
> > >>
> > >> Reasonable question.  So far we can't avoid it because vcpu exit is
> > >> the event mechanism to say "hey please collect dirty bits".  Maybe
> > >> someway is better than this, but I'll need to rethink all these
> > >> over...
> > > 
> > > Maybe signal an eventfd, and let userspace worry about deciding what to
> > > do.
> > 
> > This has to be done synchronously.  But the vm ring should be used very
> > rarely (it's for things like kvmclock updates that write to guest memory
> > outside a vCPU), possibly a handful of times in the whole run of the VM.
> 
> I've summarized a list of callers who might dirty guest memory in the
> other thread, it seems to me that even the kvm clock is using per-vcpu
> contexts.
> 
> > 
> > >>> KVM_DIRTY_RING_MAX_ENTRIES is not part of UAPI.
> > >>> So how does userspace know what's legal?
> > >>> Do you expect it to just try?
> > >>
> > >> Yep that's what I thought. :)
> > 
> > We should return it for KVM_CHECK_EXTENSION.
> 
> OK.  I'll drop the versioning.
> 
> -- 
> Peter Xu

Paolo Bonzini Dec. 16, 2019, 10:08 a.m. UTC | #47

[Alex and Kevin: there are doubts below regarding dirty page tracking
from VFIO and mdev devices, which perhaps you can help with]

On 15/12/19 18:21, Peter Xu wrote:
>                 init_rmode_tss
>                     vmx_set_tss_addr
>                         kvm_vm_ioctl_set_tss_addr [*]
>                 init_rmode_identity_map
>                     vmx_create_vcpu [*]

These don't matter because their content is not visible to userspace
(the backing storage is mmap-ed by __x86_set_memory_region).  In fact, d

>                 vmx_write_pml_buffer
>                     kvm_arch_write_log_dirty [&]
>                 kvm_write_guest
>                     kvm_hv_setup_tsc_page
>                         kvm_guest_time_update [&]
>                     nested_flush_cached_shadow_vmcs12 [&]
>                     kvm_write_wall_clock [&]
>                     kvm_pv_clock_pairing [&]
>                     kvmgt_rw_gpa [?]

This then expands (partially) to

intel_gvt_hypervisor_write_gpa
    emulate_csb_update
        emulate_execlist_ctx_schedule_out
            complete_execlist_workload
                complete_current_workload
                     workload_thread
        emulate_execlist_ctx_schedule_in
            prepare_execlist_workload
                prepare_workload
                    dispatch_workload
                        workload_thread

So KVMGT is always writing to GPAs instead of IOVAs and basically
bypassing a guest IOMMU.  So here it would be better if kvmgt was
changed not use kvm_write_guest (also because I'd probably have nacked
that if I had known :)).

As far as I know, there is some work on live migration with both VFIO
and mdev, and that probably includes some dirty page tracking API.
kvmgt could switch to that API, or there could be VFIO APIs similar to
kvm_write_guest but taking IOVAs instead of GPAs.  Advantage: this would
fix the GPA/IOVA confusion.  Disadvantage: userspace would lose the
tracking of writes from mdev devices.  Kevin, are these writes used in
any way?  Do the calls to intel_gvt_hypervisor_write_gpa covers all
writes from kvmgt vGPUs, or can the hardware write to memory as well
(which would be my guess if I didn't know anything about kvmgt, which I
pretty much don't)?

> We should only need to look at the leaves of the traces because
> they're where the dirty request starts.  I'm marking all the leaves
> with below criteria then it'll be easier to focus:
> 
> Cases with [*]: should not matter much
>            [&]: actually with a per-vcpu context in the upper layer
>            [?]: uncertain...
> 
> I'm a bit amazed after I took these notes, since I found that besides
> those that could probbaly be ignored (marked as [*]), most of the rest
> per-vm dirty requests are actually with a vcpu context.
> 
> Although now because we have kvm_get_running_vcpu() all cases for [&]
> should be fine without changing anything, but I tend to add another
> patch in the next post to convert all the [&] cases explicitly to pass
> vcpu pointer instead of kvm pointer to be clear if no one disagrees,
> then we verify that against kvm_get_running_vcpu().

This is a good idea but remember not to convert those to
kvm_vcpu_write_guest, because you _don't_ want these writes to touch
SMRAM (most of the addresses are OS-controlled rather than
firmware-controlled).

> init_rmode_tss or init_rmode_identity_map.  But I've marked them as
> unimportant because they should only happen once at boot.

We need to check if userspace can add an arbitrary number of entries by
calling KVM_SET_TSS_ADDR repeatedly.  I think it can; we'd have to
forbid multiple calls to KVM_SET_TSS_ADDR which is not a problem in general.

>>> If we're still with the rule in userspace that we first do RESET then
>>> collect and send the pages (just like what we've discussed before),
>>> then IMHO it's fine to have vcpu2 to skip the slow path?  Because
>>> RESET happens at "treat page as not dirty", then if we are sure that
>>> we only collect and send pages after that point, then the latest
>>> "write to page" data from vcpu2 won't be lost even if vcpu2 is not
>>> blocked by vcpu1's ring full?
>>
>> Good point, the race would become
>>
>>  	vCPU 1			vCPU 2		host
>>  	---------------------------------------------------------------
>>  	mark page dirty
>>  				write to page
>> 						reset rings
>> 						  wait for mmu lock
>>  	add page to ring
>> 	release mmu lock
>> 						  ...do reset...
>> 						  release mmu lock
>> 						page is now dirty
> 
> Hmm, the page will be dirty after the reset, but is that an issue?
> 
> Or, could you help me to identify what I've missed?

Nothing: the race is always solved in such a way that there's no issue.

>> I don't think that's possible, most writes won't come from a page fault
>> path and cannot retry.
> 
> Yep, maybe I should say it in the other way round: we only wait if
> kvm_get_running_vcpu() == NULL.  Then in somewhere near
> vcpu_enter_guest(), we add a check to wait if per-vcpu ring is full.
> Would that work?

Yes, that should work, especially if we know that kvmgt is the only case
that can wait.  And since:

1) kvmgt doesn't really need dirty page tracking (because VFIO devices
generally don't track dirty pages, and because kvmgt shouldn't be using
kvm_write_guest anyway)

2) the real mode TSS and identity map shouldn't even be tracked, as they
are invisible to userspace

it seems to me that kvm_get_running_vcpu() lets us get rid of the per-VM
ring altogether.

Paolo

Peter Xu Dec. 16, 2019, 3:07 p.m. UTC | #48

On Mon, Dec 16, 2019 at 04:47:36AM -0500, Michael S. Tsirkin wrote:
> On Sun, Dec 15, 2019 at 12:33:02PM -0500, Peter Xu wrote:
> > On Thu, Dec 12, 2019 at 01:08:14AM +0100, Paolo Bonzini wrote:
> > > >>> What depends on what here? Looks suspicious ...
> > > >>
> > > >> Hmm, I think maybe it can be removed because the entry pointer
> > > >> reference below should be an ordering constraint already?
> > > 
> > > entry->xxx depends on ring->reset_index.
> > 
> > Yes that's true, but...
> > 
> >         entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> >         /* barrier? */
> >         next_slot = READ_ONCE(entry->slot);
> >         next_offset = READ_ONCE(entry->offset);
> > 
> > ... I think entry->xxx depends on entry first, then entry depends on
> > reset_index.  So it seems fine because all things have a dependency?
> 
> Is reset_index changed from another thread then?
> If yes then you want to read reset_index with READ_ONCE.
> That includes a dependency barrier.

There're a few readers, but only this function will change it
(kvm_dirty_ring_reset).  Thanks,

Peter Xu Dec. 16, 2019, 3:26 p.m. UTC | #49

On Mon, Dec 16, 2019 at 10:29:36AM +0100, Paolo Bonzini wrote:
> On 14/12/19 17:26, Peter Xu wrote:
> > On Sat, Dec 14, 2019 at 08:57:26AM +0100, Paolo Bonzini wrote:
> >> On 13/12/19 21:23, Peter Xu wrote:
> >>>> What is the benefit of using u16 for that? That means with 4K pages, you
> >>>> can share at most 256M of dirty memory each time? That seems low to me,
> >>>> especially since it's sufficient to touch one byte in a page to dirty it.
> >>>>
> >>>> Actually, this is not consistent with the definition in the code ;-)
> >>>> So I'll assume it's actually u32.
> >>> Yes it's u32 now.  Actually I believe at least Paolo would prefer u16
> >>> more. :)
> >>
> >> It has to be u16, because it overlaps the padding of the first entry.
> > 
> > Hmm, could you explain?
> > 
> > Note that here what Christophe commented is on dirty_index,
> > reset_index of "struct kvm_dirty_ring", so imho it could really be
> > anything we want as long as it can store a u32 (which is the size of
> > the elements in kvm_dirty_ring_indexes).
> > 
> > If you were instead talking about the previous union definition of
> > "struct kvm_dirty_gfns" rather than "struct kvm_dirty_ring", iiuc I've
> > moved those indices out of it and defined kvm_dirty_ring_indexes which
> > we expose via kvm_run, so we don't have that limitation as well any
> > more?
> 
> Yeah, I meant that since the size has (had) to be u16 in the union, it
> need not be bigger in kvm_dirty_ring.
> 
> I don't think having more than 2^16 entries in the *per-CPU* ring buffer
> makes sense; lagging in recording dirty memory by more than 256 MiB per
> CPU would mean a large pause later on resetting the ring buffers (your
> KVM_CLEAR_DIRTY_LOG patches found the sweet spot to be around 1 GiB for
> the whole system).

That's right, 1G could probably be a "common flavor" for guests in
that case.

Though I wanted to use u64 only because I wanted to prepare even
better for future potential changes as long as it won't hurt much.
Here I'm just afraid 16bit might not be big enough for this 64bit
world, at the meantime I'd confess some of the requirement could be
really unimaginable before we know it..  I'm trying to forge one here:
what if the customer wants to handle 4G burst dirtying workload during
a migration (besides the burst IOs, mostly idle guests), while the
customer also want good responsiveness during the burst dirtying?  In
that case even if we use 256MiB ring we'll still need to freqently
pause for the harvesting, but actually this case really suites for a
8G ring size.

My example could be nonsense actually, just to show that if we can
extend something to u64 from u16 without paying much, then why not. :-)

> 
> So I liked the union, but if you removed it you might as well align the
> producer and consumer indices to 64 bytes so that they are in separate
> cache lines.

Yeh that I can do.  Thanks,

Paolo Bonzini Dec. 16, 2019, 3:31 p.m. UTC | #50

On 16/12/19 16:26, Peter Xu wrote:
> On Mon, Dec 16, 2019 at 10:29:36AM +0100, Paolo Bonzini wrote:
>> On 14/12/19 17:26, Peter Xu wrote:
>>> On Sat, Dec 14, 2019 at 08:57:26AM +0100, Paolo Bonzini wrote:
>>>> On 13/12/19 21:23, Peter Xu wrote:
>>>>>> What is the benefit of using u16 for that? That means with 4K pages, you
>>>>>> can share at most 256M of dirty memory each time? That seems low to me,
>>>>>> especially since it's sufficient to touch one byte in a page to dirty it.
>>>>>>
>>>>>> Actually, this is not consistent with the definition in the code ;-)
>>>>>> So I'll assume it's actually u32.
>>>>> Yes it's u32 now.  Actually I believe at least Paolo would prefer u16
>>>>> more. :)
>>>>
>>>> It has to be u16, because it overlaps the padding of the first entry.
>>>
>>> Hmm, could you explain?
>>>
>>> Note that here what Christophe commented is on dirty_index,
>>> reset_index of "struct kvm_dirty_ring", so imho it could really be
>>> anything we want as long as it can store a u32 (which is the size of
>>> the elements in kvm_dirty_ring_indexes).
>>>
>>> If you were instead talking about the previous union definition of
>>> "struct kvm_dirty_gfns" rather than "struct kvm_dirty_ring", iiuc I've
>>> moved those indices out of it and defined kvm_dirty_ring_indexes which
>>> we expose via kvm_run, so we don't have that limitation as well any
>>> more?
>>
>> Yeah, I meant that since the size has (had) to be u16 in the union, it
>> need not be bigger in kvm_dirty_ring.
>>
>> I don't think having more than 2^16 entries in the *per-CPU* ring buffer
>> makes sense; lagging in recording dirty memory by more than 256 MiB per
>> CPU would mean a large pause later on resetting the ring buffers (your
>> KVM_CLEAR_DIRTY_LOG patches found the sweet spot to be around 1 GiB for
>> the whole system).
> 
> That's right, 1G could probably be a "common flavor" for guests in
> that case.
> 
> Though I wanted to use u64 only because I wanted to prepare even
> better for future potential changes as long as it won't hurt much.

No u64, please.  u32 I can agree with, 16-bit *should* be enough but it
is a bit tight, so let's make it 32-bit if we drop the union idea.

Paolo

Michael S. Tsirkin Dec. 16, 2019, 3:33 p.m. UTC | #51

On Mon, Dec 16, 2019 at 10:07:54AM -0500, Peter Xu wrote:
> On Mon, Dec 16, 2019 at 04:47:36AM -0500, Michael S. Tsirkin wrote:
> > On Sun, Dec 15, 2019 at 12:33:02PM -0500, Peter Xu wrote:
> > > On Thu, Dec 12, 2019 at 01:08:14AM +0100, Paolo Bonzini wrote:
> > > > >>> What depends on what here? Looks suspicious ...
> > > > >>
> > > > >> Hmm, I think maybe it can be removed because the entry pointer
> > > > >> reference below should be an ordering constraint already?
> > > > 
> > > > entry->xxx depends on ring->reset_index.
> > > 
> > > Yes that's true, but...
> > > 
> > >         entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> > >         /* barrier? */
> > >         next_slot = READ_ONCE(entry->slot);
> > >         next_offset = READ_ONCE(entry->offset);
> > > 
> > > ... I think entry->xxx depends on entry first, then entry depends on
> > > reset_index.  So it seems fine because all things have a dependency?
> > 
> > Is reset_index changed from another thread then?
> > If yes then you want to read reset_index with READ_ONCE.
> > That includes a dependency barrier.
> 
> There're a few readers, but only this function will change it
> (kvm_dirty_ring_reset).  Thanks,

Then you don't need any barriers in this function.
readers need at least READ_ONCE.

> -- 
> Peter Xu

Peter Xu Dec. 16, 2019, 3:43 p.m. UTC | #52

On Mon, Dec 16, 2019 at 04:31:50PM +0100, Paolo Bonzini wrote:
> No u64, please.  u32 I can agree with, 16-bit *should* be enough but it
> is a bit tight, so let's make it 32-bit if we drop the union idea.

Sure.

Peter Xu Dec. 16, 2019, 3:47 p.m. UTC | #53

On Mon, Dec 16, 2019 at 10:33:42AM -0500, Michael S. Tsirkin wrote:
> On Mon, Dec 16, 2019 at 10:07:54AM -0500, Peter Xu wrote:
> > On Mon, Dec 16, 2019 at 04:47:36AM -0500, Michael S. Tsirkin wrote:
> > > On Sun, Dec 15, 2019 at 12:33:02PM -0500, Peter Xu wrote:
> > > > On Thu, Dec 12, 2019 at 01:08:14AM +0100, Paolo Bonzini wrote:
> > > > > >>> What depends on what here? Looks suspicious ...
> > > > > >>
> > > > > >> Hmm, I think maybe it can be removed because the entry pointer
> > > > > >> reference below should be an ordering constraint already?
> > > > > 
> > > > > entry->xxx depends on ring->reset_index.
> > > > 
> > > > Yes that's true, but...
> > > > 
> > > >         entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> > > >         /* barrier? */
> > > >         next_slot = READ_ONCE(entry->slot);
> > > >         next_offset = READ_ONCE(entry->offset);
> > > > 
> > > > ... I think entry->xxx depends on entry first, then entry depends on
> > > > reset_index.  So it seems fine because all things have a dependency?
> > > 
> > > Is reset_index changed from another thread then?
> > > If yes then you want to read reset_index with READ_ONCE.
> > > That includes a dependency barrier.
> > 
> > There're a few readers, but only this function will change it
> > (kvm_dirty_ring_reset).  Thanks,
> 
> Then you don't need any barriers in this function.
> readers need at least READ_ONCE.

In our case even an old reset_index should not matter much here imho
because the worst case is we read an old reset so we stop pushing to a
ring when it's just being reset and at the same time it's soft-full
(so an extra user exit even race happened).  But I agree it's clearer
to READ_ONCE() on readers.  Thanks!

Peter Xu Dec. 16, 2019, 6:54 p.m. UTC | #54

On Mon, Dec 16, 2019 at 11:08:15AM +0100, Paolo Bonzini wrote:
> > Although now because we have kvm_get_running_vcpu() all cases for [&]
> > should be fine without changing anything, but I tend to add another
> > patch in the next post to convert all the [&] cases explicitly to pass
> > vcpu pointer instead of kvm pointer to be clear if no one disagrees,
> > then we verify that against kvm_get_running_vcpu().
> 
> This is a good idea but remember not to convert those to
> kvm_vcpu_write_guest, because you _don't_ want these writes to touch
> SMRAM (most of the addresses are OS-controlled rather than
> firmware-controlled).

OK.  I think I only need to pass in vcpu* instead of kvm* in
kvm_write_guest_page() just like kvm_vcpu_write_guest(), however we
still keep to only write to address space id==0 for that.

> 
> > init_rmode_tss or init_rmode_identity_map.  But I've marked them as
> > unimportant because they should only happen once at boot.
> 
> We need to check if userspace can add an arbitrary number of entries by
> calling KVM_SET_TSS_ADDR repeatedly.  I think it can; we'd have to
> forbid multiple calls to KVM_SET_TSS_ADDR which is not a problem in general.

Will do that altogether with the series.  I can further change both of
these calls to not track dirty at all, which shouldn't be hard, after
all userspace didn't even know them, as you mentioned below.

Is there anything to explain what KVM_SET_TSS_ADDR is used for?  This
is the thing I found that is closest to useful (from api.txt):

        This ioctl is required on Intel-based hosts.  This is needed
        on Intel hardware because of a quirk in the virtualization
        implementation (see the internals documentation when it pops
        into existence).

So... has it really popped into existance somewhere?  It would be good
at least to know why it does not need to be migrated.

> >> I don't think that's possible, most writes won't come from a page fault
> >> path and cannot retry.
> > 
> > Yep, maybe I should say it in the other way round: we only wait if
> > kvm_get_running_vcpu() == NULL.  Then in somewhere near
> > vcpu_enter_guest(), we add a check to wait if per-vcpu ring is full.
> > Would that work?
> 
> Yes, that should work, especially if we know that kvmgt is the only case
> that can wait.  And since:
> 
> 1) kvmgt doesn't really need dirty page tracking (because VFIO devices
> generally don't track dirty pages, and because kvmgt shouldn't be using
> kvm_write_guest anyway)
> 
> 2) the real mode TSS and identity map shouldn't even be tracked, as they
> are invisible to userspace
> 
> it seems to me that kvm_get_running_vcpu() lets us get rid of the per-VM
> ring altogether.

Yes, it would be perfect if so.

Tian, Kevin Dec. 17, 2019, 2:28 a.m. UTC | #55

> From: Paolo Bonzini
> Sent: Monday, December 16, 2019 6:08 PM
> 
> [Alex and Kevin: there are doubts below regarding dirty page tracking
> from VFIO and mdev devices, which perhaps you can help with]
> 
> On 15/12/19 18:21, Peter Xu wrote:
> >                 init_rmode_tss
> >                     vmx_set_tss_addr
> >                         kvm_vm_ioctl_set_tss_addr [*]
> >                 init_rmode_identity_map
> >                     vmx_create_vcpu [*]
> 
> These don't matter because their content is not visible to userspace
> (the backing storage is mmap-ed by __x86_set_memory_region).  In fact, d
> 
> >                 vmx_write_pml_buffer
> >                     kvm_arch_write_log_dirty [&]
> >                 kvm_write_guest
> >                     kvm_hv_setup_tsc_page
> >                         kvm_guest_time_update [&]
> >                     nested_flush_cached_shadow_vmcs12 [&]
> >                     kvm_write_wall_clock [&]
> >                     kvm_pv_clock_pairing [&]
> >                     kvmgt_rw_gpa [?]
> 
> This then expands (partially) to
> 
> intel_gvt_hypervisor_write_gpa
>     emulate_csb_update
>         emulate_execlist_ctx_schedule_out
>             complete_execlist_workload
>                 complete_current_workload
>                      workload_thread
>         emulate_execlist_ctx_schedule_in
>             prepare_execlist_workload
>                 prepare_workload
>                     dispatch_workload
>                         workload_thread
> 
> So KVMGT is always writing to GPAs instead of IOVAs and basically
> bypassing a guest IOMMU.  So here it would be better if kvmgt was
> changed not use kvm_write_guest (also because I'd probably have nacked
> that if I had known :)).

I agree. 

> 
> As far as I know, there is some work on live migration with both VFIO
> and mdev, and that probably includes some dirty page tracking API.
> kvmgt could switch to that API, or there could be VFIO APIs similar to
> kvm_write_guest but taking IOVAs instead of GPAs.  Advantage: this would
> fix the GPA/IOVA confusion.  Disadvantage: userspace would lose the
> tracking of writes from mdev devices.  Kevin, are these writes used in
> any way?  Do the calls to intel_gvt_hypervisor_write_gpa covers all
> writes from kvmgt vGPUs, or can the hardware write to memory as well
> (which would be my guess if I didn't know anything about kvmgt, which I
> pretty much don't)?

intel_gvt_hypervisor_write_gpa covers all writes due to software mediation.

for hardware updates, it needs be mapped in IOMMU through vfio_pin_pages 
before any DMA happens. The ongoing dirty tracking effort in VFIO will take
every pinned page through that API as dirtied.

However, currently VFIO doesn't implement any vfio_read/write_guest
interface yet. and it doesn't make sense to use vfio_pin_pages for software
dirtied pages, as pin is unnecessary and heavy involving iommu invalidation.

Alex, if you are OK we'll work on such interface and move kvmgt to use it.
After it's accepted, we can also mark pages dirty through this new interface
in Kirti's dirty page tracking series.

Thanks
Kevin

> 
> > We should only need to look at the leaves of the traces because
> > they're where the dirty request starts.  I'm marking all the leaves
> > with below criteria then it'll be easier to focus:
> >
> > Cases with [*]: should not matter much
> >            [&]: actually with a per-vcpu context in the upper layer
> >            [?]: uncertain...
> >
> > I'm a bit amazed after I took these notes, since I found that besides
> > those that could probbaly be ignored (marked as [*]), most of the rest
> > per-vm dirty requests are actually with a vcpu context.
> >
> > Although now because we have kvm_get_running_vcpu() all cases for [&]
> > should be fine without changing anything, but I tend to add another
> > patch in the next post to convert all the [&] cases explicitly to pass
> > vcpu pointer instead of kvm pointer to be clear if no one disagrees,
> > then we verify that against kvm_get_running_vcpu().
> 
> This is a good idea but remember not to convert those to
> kvm_vcpu_write_guest, because you _don't_ want these writes to touch
> SMRAM (most of the addresses are OS-controlled rather than
> firmware-controlled).
> 
> > init_rmode_tss or init_rmode_identity_map.  But I've marked them as
> > unimportant because they should only happen once at boot.
> 
> We need to check if userspace can add an arbitrary number of entries by
> calling KVM_SET_TSS_ADDR repeatedly.  I think it can; we'd have to
> forbid multiple calls to KVM_SET_TSS_ADDR which is not a problem in
> general.
> 
> >>> If we're still with the rule in userspace that we first do RESET then
> >>> collect and send the pages (just like what we've discussed before),
> >>> then IMHO it's fine to have vcpu2 to skip the slow path?  Because
> >>> RESET happens at "treat page as not dirty", then if we are sure that
> >>> we only collect and send pages after that point, then the latest
> >>> "write to page" data from vcpu2 won't be lost even if vcpu2 is not
> >>> blocked by vcpu1's ring full?
> >>
> >> Good point, the race would become
> >>
> >>  	vCPU 1			vCPU 2		host
> >>  	---------------------------------------------------------------
> >>  	mark page dirty
> >>  				write to page
> >> 						reset rings
> >> 						  wait for mmu lock
> >>  	add page to ring
> >> 	release mmu lock
> >> 						  ...do reset...
> >> 						  release mmu lock
> >> 						page is now dirty
> >
> > Hmm, the page will be dirty after the reset, but is that an issue?
> >
> > Or, could you help me to identify what I've missed?
> 
> Nothing: the race is always solved in such a way that there's no issue.
> 
> >> I don't think that's possible, most writes won't come from a page fault
> >> path and cannot retry.
> >
> > Yep, maybe I should say it in the other way round: we only wait if
> > kvm_get_running_vcpu() == NULL.  Then in somewhere near
> > vcpu_enter_guest(), we add a check to wait if per-vcpu ring is full.
> > Would that work?
> 
> Yes, that should work, especially if we know that kvmgt is the only case
> that can wait.  And since:
> 
> 1) kvmgt doesn't really need dirty page tracking (because VFIO devices
> generally don't track dirty pages, and because kvmgt shouldn't be using
> kvm_write_guest anyway)
> 
> 2) the real mode TSS and identity map shouldn't even be tracked, as they
> are invisible to userspace
> 
> it seems to me that kvm_get_running_vcpu() lets us get rid of the per-VM
> ring altogether.
> 
> Paolo

Tian, Kevin Dec. 17, 2019, 5:17 a.m. UTC | #56

> From: Tian, Kevin
> Sent: Tuesday, December 17, 2019 10:29 AM
> 
> > From: Paolo Bonzini
> > Sent: Monday, December 16, 2019 6:08 PM
> >
> > [Alex and Kevin: there are doubts below regarding dirty page tracking
> > from VFIO and mdev devices, which perhaps you can help with]
> >
> > On 15/12/19 18:21, Peter Xu wrote:
> > >                 init_rmode_tss
> > >                     vmx_set_tss_addr
> > >                         kvm_vm_ioctl_set_tss_addr [*]
> > >                 init_rmode_identity_map
> > >                     vmx_create_vcpu [*]
> >
> > These don't matter because their content is not visible to userspace
> > (the backing storage is mmap-ed by __x86_set_memory_region).  In fact, d
> >
> > >                 vmx_write_pml_buffer
> > >                     kvm_arch_write_log_dirty [&]
> > >                 kvm_write_guest
> > >                     kvm_hv_setup_tsc_page
> > >                         kvm_guest_time_update [&]
> > >                     nested_flush_cached_shadow_vmcs12 [&]
> > >                     kvm_write_wall_clock [&]
> > >                     kvm_pv_clock_pairing [&]
> > >                     kvmgt_rw_gpa [?]
> >
> > This then expands (partially) to
> >
> > intel_gvt_hypervisor_write_gpa
> >     emulate_csb_update
> >         emulate_execlist_ctx_schedule_out
> >             complete_execlist_workload
> >                 complete_current_workload
> >                      workload_thread
> >         emulate_execlist_ctx_schedule_in
> >             prepare_execlist_workload
> >                 prepare_workload
> >                     dispatch_workload
> >                         workload_thread
> >
> > So KVMGT is always writing to GPAs instead of IOVAs and basically
> > bypassing a guest IOMMU.  So here it would be better if kvmgt was
> > changed not use kvm_write_guest (also because I'd probably have nacked
> > that if I had known :)).
> 
> I agree.
> 
> >
> > As far as I know, there is some work on live migration with both VFIO
> > and mdev, and that probably includes some dirty page tracking API.
> > kvmgt could switch to that API, or there could be VFIO APIs similar to
> > kvm_write_guest but taking IOVAs instead of GPAs.  Advantage: this would
> > fix the GPA/IOVA confusion.  Disadvantage: userspace would lose the
> > tracking of writes from mdev devices.  Kevin, are these writes used in
> > any way?  Do the calls to intel_gvt_hypervisor_write_gpa covers all
> > writes from kvmgt vGPUs, or can the hardware write to memory as well
> > (which would be my guess if I didn't know anything about kvmgt, which I
> > pretty much don't)?
> 
> intel_gvt_hypervisor_write_gpa covers all writes due to software mediation.
> 
> for hardware updates, it needs be mapped in IOMMU through
> vfio_pin_pages
> before any DMA happens. The ongoing dirty tracking effort in VFIO will take
> every pinned page through that API as dirtied.
> 
> However, currently VFIO doesn't implement any vfio_read/write_guest
> interface yet. and it doesn't make sense to use vfio_pin_pages for software
> dirtied pages, as pin is unnecessary and heavy involving iommu invalidation.

One correction. vfio_pin_pages doesn't involve iommu invalidation. I should
just mean that pinning the page is not necessary. We just need a kvm-like
interface based on hva to access.

> 
> Alex, if you are OK we'll work on such interface and move kvmgt to use it.
> After it's accepted, we can also mark pages dirty through this new interface
> in Kirti's dirty page tracking series.
>

Yan Zhao Dec. 17, 2019, 5:25 a.m. UTC | #57

On Tue, Dec 17, 2019 at 01:17:29PM +0800, Tian, Kevin wrote:
> > From: Tian, Kevin
> > Sent: Tuesday, December 17, 2019 10:29 AM
> > 
> > > From: Paolo Bonzini
> > > Sent: Monday, December 16, 2019 6:08 PM
> > >
> > > [Alex and Kevin: there are doubts below regarding dirty page tracking
> > > from VFIO and mdev devices, which perhaps you can help with]
> > >
> > > On 15/12/19 18:21, Peter Xu wrote:
> > > >                 init_rmode_tss
> > > >                     vmx_set_tss_addr
> > > >                         kvm_vm_ioctl_set_tss_addr [*]
> > > >                 init_rmode_identity_map
> > > >                     vmx_create_vcpu [*]
> > >
> > > These don't matter because their content is not visible to userspace
> > > (the backing storage is mmap-ed by __x86_set_memory_region).  In fact, d
> > >
> > > >                 vmx_write_pml_buffer
> > > >                     kvm_arch_write_log_dirty [&]
> > > >                 kvm_write_guest
> > > >                     kvm_hv_setup_tsc_page
> > > >                         kvm_guest_time_update [&]
> > > >                     nested_flush_cached_shadow_vmcs12 [&]
> > > >                     kvm_write_wall_clock [&]
> > > >                     kvm_pv_clock_pairing [&]
> > > >                     kvmgt_rw_gpa [?]
> > >
> > > This then expands (partially) to
> > >
> > > intel_gvt_hypervisor_write_gpa
> > >     emulate_csb_update
> > >         emulate_execlist_ctx_schedule_out
> > >             complete_execlist_workload
> > >                 complete_current_workload
> > >                      workload_thread
> > >         emulate_execlist_ctx_schedule_in
> > >             prepare_execlist_workload
> > >                 prepare_workload
> > >                     dispatch_workload
> > >                         workload_thread
> > >
> > > So KVMGT is always writing to GPAs instead of IOVAs and basically
> > > bypassing a guest IOMMU.  So here it would be better if kvmgt was
> > > changed not use kvm_write_guest (also because I'd probably have nacked
> > > that if I had known :)).
> > 
> > I agree.
> > 
> > >
> > > As far as I know, there is some work on live migration with both VFIO
> > > and mdev, and that probably includes some dirty page tracking API.
> > > kvmgt could switch to that API, or there could be VFIO APIs similar to
> > > kvm_write_guest but taking IOVAs instead of GPAs.  Advantage: this would
> > > fix the GPA/IOVA confusion.  Disadvantage: userspace would lose the
> > > tracking of writes from mdev devices.  Kevin, are these writes used in
> > > any way?  Do the calls to intel_gvt_hypervisor_write_gpa covers all
> > > writes from kvmgt vGPUs, or can the hardware write to memory as well
> > > (which would be my guess if I didn't know anything about kvmgt, which I
> > > pretty much don't)?
> > 
> > intel_gvt_hypervisor_write_gpa covers all writes due to software mediation.
> > 
> > for hardware updates, it needs be mapped in IOMMU through
> > vfio_pin_pages
> > before any DMA happens. The ongoing dirty tracking effort in VFIO will take
> > every pinned page through that API as dirtied.
> > 
> > However, currently VFIO doesn't implement any vfio_read/write_guest
> > interface yet. and it doesn't make sense to use vfio_pin_pages for software
> > dirtied pages, as pin is unnecessary and heavy involving iommu invalidation.
> 
> One correction. vfio_pin_pages doesn't involve iommu invalidation. I should
> just mean that pinning the page is not necessary. We just need a kvm-like
> interface based on hva to access.
>
And can we propose to differentiate read and write when calling vfio_pin_pages, e.g.
vfio_pin_pages_read, vfio_pin_pages_write? Otherwise, calling to
vfio_pin_pages will unnecessarily cause read pages to be dirty and
sometimes reading guest pages is a way for device model to track dirty
pages.

> > 
> > Alex, if you are OK we'll work on such interface and move kvmgt to use it.
> > After it's accepted, we can also mark pages dirty through this new interface
> > in Kirti's dirty page tracking series.
> >

Paolo Bonzini Dec. 17, 2019, 9:01 a.m. UTC | #58

On 16/12/19 19:54, Peter Xu wrote:
> On Mon, Dec 16, 2019 at 11:08:15AM +0100, Paolo Bonzini wrote:
>>> Although now because we have kvm_get_running_vcpu() all cases for [&]
>>> should be fine without changing anything, but I tend to add another
>>> patch in the next post to convert all the [&] cases explicitly to pass
>>> vcpu pointer instead of kvm pointer to be clear if no one disagrees,
>>> then we verify that against kvm_get_running_vcpu().
>>
>> This is a good idea but remember not to convert those to
>> kvm_vcpu_write_guest, because you _don't_ want these writes to touch
>> SMRAM (most of the addresses are OS-controlled rather than
>> firmware-controlled).
> 
> OK.  I think I only need to pass in vcpu* instead of kvm* in
> kvm_write_guest_page() just like kvm_vcpu_write_guest(), however we
> still keep to only write to address space id==0 for that.

No, please pass it all the way down to the [&] functions but not to
kvm_write_guest_page.  Those should keep using vcpu->kvm.

>>> init_rmode_tss or init_rmode_identity_map.  But I've marked them as
>>> unimportant because they should only happen once at boot.
>>
>> We need to check if userspace can add an arbitrary number of entries by
>> calling KVM_SET_TSS_ADDR repeatedly.  I think it can; we'd have to
>> forbid multiple calls to KVM_SET_TSS_ADDR which is not a problem in general.
> 
> Will do that altogether with the series.  I can further change both of
> these calls to not track dirty at all, which shouldn't be hard, after
> all userspace didn't even know them, as you mentioned below.
> 
> Is there anything to explain what KVM_SET_TSS_ADDR is used for?  This
> is the thing I found that is closest to useful (from api.txt):

The best description is probably at https://lwn.net/Articles/658883/:

They are needed for unrestricted_guest=0. Remember that, in that case,
the VM always runs in protected mode and with paging enabled. In order
to emulate real mode you put the guest in a vm86 task, so you need some
place for a TSS and for a page table, and they must be in guest RAM
because the guest's TR and CR3 points to it. They are invisible to the
guest, because the STR and MOV-from-CR instructions are invalid in vm86
mode, but it must be there.

If you don't call KVM_SET_TSS_ADDR you actually get a complaint in
dmesg, and the TR stays at 0. I am not really sure what kind of bad
things can happen with unrestricted_guest=0, probably you just get a VM
Entry failure. The TSS takes 3 pages of memory. An interesting point is
that you actually don't need to set the TR selector to a valid value (as
you would do when running in "normal" vm86 mode), you can simply set the
base and limit registers that are hidden in the processor, and generally
inaccessible except through VMREAD/VMWRITE or system management mode. So
KVM needs to set up a TSS but not a GDT.

For paging, instead, 1 page is enough because we have only 4GB of memory
to address. KVM disables CR4.PAE (page address extensions, aka 8-byte
entries in each page directory or page table) and enables CR4.PSE (page
size extensions, aka 4MB huge pages support with 4-byte page directory
entries). One page then fits 1024 4-byte page directory entries, each
for a 4MB huge pages, totaling exactly 4GB. Here if you don't set it the
page table is at address 0xFFFBC000. QEMU changes it to 0xFEFFC000 so
that the BIOS can be up to 16MB in size (the default only allows 256k
between 0xFFFC0000 and 0xFFFFFFFF).

The different handling, where only the page table has a default, is
unfortunate, but so goes life...

> So... has it really popped into existance somewhere?  It would be good
> at least to know why it does not need to be migrated.

It does not need to be migrated just because the contents are constant.

Paolo

Christophe de Dinechin Dec. 17, 2019, 12:16 p.m. UTC | #59

> On 14 Dec 2019, at 08:57, Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> On 13/12/19 21:23, Peter Xu wrote:
>>> What is the benefit of using u16 for that? That means with 4K pages, you
>>> can share at most 256M of dirty memory each time? That seems low to me,
>>> especially since it's sufficient to touch one byte in a page to dirty it.
>>> 
>>> Actually, this is not consistent with the definition in the code ;-)
>>> So I'll assume it's actually u32.
>> Yes it's u32 now.  Actually I believe at least Paolo would prefer u16
>> more. :)
> 
> It has to be u16, because it overlaps the padding of the first entry.

Wow, now that’s subtle.

That definitely needs a union with the padding to make this explicit.

(My guess is you do that to page-align the whole thing and avoid adding a
page just for the counters)

> 
> Paolo
> 
>> I think even u16 would be mostly enough (if you see, the maximum
>> allowed value currently is 64K entries only, not a big one).  Again,
>> the thing is that the userspace should be collecting the dirty bits,
>> so the ring shouldn't reach full easily.  Even if it does, we should
>> probably let it stop for a while as explained above.  It'll be
>> inefficient only if we set it to a too-small value, imho.
>> 
>

Paolo Bonzini Dec. 17, 2019, 12:19 p.m. UTC | #60

On 17/12/19 13:16, Christophe de Dinechin wrote:
> 
> 
>> On 14 Dec 2019, at 08:57, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>
>> On 13/12/19 21:23, Peter Xu wrote:
>>>> What is the benefit of using u16 for that? That means with 4K pages, you
>>>> can share at most 256M of dirty memory each time? That seems low to me,
>>>> especially since it's sufficient to touch one byte in a page to dirty it.
>>>>
>>>> Actually, this is not consistent with the definition in the code ;-)
>>>> So I'll assume it's actually u32.
>>> Yes it's u32 now.  Actually I believe at least Paolo would prefer u16
>>> more. :)
>>
>> It has to be u16, because it overlaps the padding of the first entry.
> 
> Wow, now that’s subtle.
> 
> That definitely needs a union with the padding to make this explicit.
> 
> (My guess is you do that to page-align the whole thing and avoid adding a
> page just for the counters)

Yes, that was the idea but Peter decided to scrap it. :)

Paolo

>>
>> Paolo
>>
>>> I think even u16 would be mostly enough (if you see, the maximum
>>> allowed value currently is 64K entries only, not a big one).  Again,
>>> the thing is that the userspace should be collecting the dirty bits,
>>> so the ring shouldn't reach full easily.  Even if it does, we should
>>> probably let it stop for a while as explained above.  It'll be
>>> inefficient only if we set it to a too-small value, imho.
>>>
>>
>

Peter Xu Dec. 17, 2019, 3:38 p.m. UTC | #61

On Tue, Dec 17, 2019 at 01:19:05PM +0100, Paolo Bonzini wrote:
> On 17/12/19 13:16, Christophe de Dinechin wrote:
> > 
> > 
> >> On 14 Dec 2019, at 08:57, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >>
> >> On 13/12/19 21:23, Peter Xu wrote:
> >>>> What is the benefit of using u16 for that? That means with 4K pages, you
> >>>> can share at most 256M of dirty memory each time? That seems low to me,
> >>>> especially since it's sufficient to touch one byte in a page to dirty it.
> >>>>
> >>>> Actually, this is not consistent with the definition in the code ;-)
> >>>> So I'll assume it's actually u32.
> >>> Yes it's u32 now.  Actually I believe at least Paolo would prefer u16
> >>> more. :)
> >>
> >> It has to be u16, because it overlaps the padding of the first entry.
> > 
> > Wow, now that’s subtle.
> > 
> > That definitely needs a union with the padding to make this explicit.
> > 
> > (My guess is you do that to page-align the whole thing and avoid adding a
> > page just for the counters)

(Just to make sure this is clear... Paolo was talking about the
 previous version.  This version does not have this limitation because
 we don't have that union definition any more)

> 
> Yes, that was the idea but Peter decided to scrap it. :)

There's still time to persuade me to going back to it. :)

(Though, yes I still like current solution... if we can get rid of the
 only kvmgt ugliness, we can even throw away the per-vm ring with its
 "extra" 4k page.  Then I suppose it'll be even harder to persuade me :)

Alex Williamson Dec. 17, 2019, 4:18 p.m. UTC | #62

On Tue, 17 Dec 2019 02:28:33 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Paolo Bonzini
> > Sent: Monday, December 16, 2019 6:08 PM
> > 
> > [Alex and Kevin: there are doubts below regarding dirty page tracking
> > from VFIO and mdev devices, which perhaps you can help with]
> > 
> > On 15/12/19 18:21, Peter Xu wrote:  
> > >                 init_rmode_tss
> > >                     vmx_set_tss_addr
> > >                         kvm_vm_ioctl_set_tss_addr [*]
> > >                 init_rmode_identity_map
> > >                     vmx_create_vcpu [*]  
> > 
> > These don't matter because their content is not visible to userspace
> > (the backing storage is mmap-ed by __x86_set_memory_region).  In fact, d
> >   
> > >                 vmx_write_pml_buffer
> > >                     kvm_arch_write_log_dirty [&]
> > >                 kvm_write_guest
> > >                     kvm_hv_setup_tsc_page
> > >                         kvm_guest_time_update [&]
> > >                     nested_flush_cached_shadow_vmcs12 [&]
> > >                     kvm_write_wall_clock [&]
> > >                     kvm_pv_clock_pairing [&]
> > >                     kvmgt_rw_gpa [?]  
> > 
> > This then expands (partially) to
> > 
> > intel_gvt_hypervisor_write_gpa
> >     emulate_csb_update
> >         emulate_execlist_ctx_schedule_out
> >             complete_execlist_workload
> >                 complete_current_workload
> >                      workload_thread
> >         emulate_execlist_ctx_schedule_in
> >             prepare_execlist_workload
> >                 prepare_workload
> >                     dispatch_workload
> >                         workload_thread
> > 
> > So KVMGT is always writing to GPAs instead of IOVAs and basically
> > bypassing a guest IOMMU.  So here it would be better if kvmgt was
> > changed not use kvm_write_guest (also because I'd probably have nacked
> > that if I had known :)).  
> 
> I agree. 
> 
> > 
> > As far as I know, there is some work on live migration with both VFIO
> > and mdev, and that probably includes some dirty page tracking API.
> > kvmgt could switch to that API, or there could be VFIO APIs similar to
> > kvm_write_guest but taking IOVAs instead of GPAs.  Advantage: this would
> > fix the GPA/IOVA confusion.  Disadvantage: userspace would lose the
> > tracking of writes from mdev devices.  Kevin, are these writes used in
> > any way?  Do the calls to intel_gvt_hypervisor_write_gpa covers all
> > writes from kvmgt vGPUs, or can the hardware write to memory as well
> > (which would be my guess if I didn't know anything about kvmgt, which I
> > pretty much don't)?  
> 
> intel_gvt_hypervisor_write_gpa covers all writes due to software mediation.
> 
> for hardware updates, it needs be mapped in IOMMU through vfio_pin_pages 
> before any DMA happens. The ongoing dirty tracking effort in VFIO will take
> every pinned page through that API as dirtied.
> 
> However, currently VFIO doesn't implement any vfio_read/write_guest
> interface yet. and it doesn't make sense to use vfio_pin_pages for software
> dirtied pages, as pin is unnecessary and heavy involving iommu invalidation.
> 
> Alex, if you are OK we'll work on such interface and move kvmgt to use it.
> After it's accepted, we can also mark pages dirty through this new interface
> in Kirti's dirty page tracking series.

I'm not sure what you're asking for, is it an interface for the host
CPU to read/write the memory backing of a mapped IOVA range without
pinning pages?  That seems like something like that would make sense for
an emulation model where a page does not need to be pinned for physical
DMA.  If you're asking more for an interface that understands the
userspace driver is a VM (ie. implied using a _guest postfix on the
function name) and knows about GPA mappings beyond the windows directly
mapped for device access, I'd not look fondly on such a request.
Thanks,

Alex

Peter Xu Dec. 17, 2019, 4:24 p.m. UTC | #63

On Tue, Dec 17, 2019 at 10:01:40AM +0100, Paolo Bonzini wrote:
> On 16/12/19 19:54, Peter Xu wrote:
> > On Mon, Dec 16, 2019 at 11:08:15AM +0100, Paolo Bonzini wrote:
> >>> Although now because we have kvm_get_running_vcpu() all cases for [&]
> >>> should be fine without changing anything, but I tend to add another
> >>> patch in the next post to convert all the [&] cases explicitly to pass
> >>> vcpu pointer instead of kvm pointer to be clear if no one disagrees,
> >>> then we verify that against kvm_get_running_vcpu().
> >>
> >> This is a good idea but remember not to convert those to
> >> kvm_vcpu_write_guest, because you _don't_ want these writes to touch
> >> SMRAM (most of the addresses are OS-controlled rather than
> >> firmware-controlled).
> > 
> > OK.  I think I only need to pass in vcpu* instead of kvm* in
> > kvm_write_guest_page() just like kvm_vcpu_write_guest(), however we
> > still keep to only write to address space id==0 for that.
> 
> No, please pass it all the way down to the [&] functions but not to
> kvm_write_guest_page.  Those should keep using vcpu->kvm.

Actually I even wanted to refactor these helpers.  I mean, we have two
sets of helpers now, kvm_[vcpu]_{read|write}*(), so one set is per-vm,
the other set is per-vcpu.  IIUC the only difference of these two are
whether we should consider ((vcpu)->arch.hflags & HF_SMM_MASK) or we
just write to address space zero always.  Could we unify them into a
single set of helper (I'll just drop the *_vcpu_* helpers because it's
longer when write) but we always pass in vcpu* as the first parameter?
Then we add another parameter "vcpu_smm" to show whether we want to
consider the HF_SMM_MASK flag.

Kvmgt is of course special here because it does not have vcpu context,
but as we're going to rework that, I'd like to know whether you agree
with above refactoring if without the kvmgt caller.

> 
> >>> init_rmode_tss or init_rmode_identity_map.  But I've marked them as
> >>> unimportant because they should only happen once at boot.
> >>
> >> We need to check if userspace can add an arbitrary number of entries by
> >> calling KVM_SET_TSS_ADDR repeatedly.  I think it can; we'd have to
> >> forbid multiple calls to KVM_SET_TSS_ADDR which is not a problem in general.
> > 
> > Will do that altogether with the series.  I can further change both of
> > these calls to not track dirty at all, which shouldn't be hard, after
> > all userspace didn't even know them, as you mentioned below.
> > 
> > Is there anything to explain what KVM_SET_TSS_ADDR is used for?  This
> > is the thing I found that is closest to useful (from api.txt):
> 
> The best description is probably at https://lwn.net/Articles/658883/:
> 
> They are needed for unrestricted_guest=0. Remember that, in that case,
> the VM always runs in protected mode and with paging enabled. In order
> to emulate real mode you put the guest in a vm86 task, so you need some
> place for a TSS and for a page table, and they must be in guest RAM
> because the guest's TR and CR3 points to it. They are invisible to the
> guest, because the STR and MOV-from-CR instructions are invalid in vm86
> mode, but it must be there.
> 
> If you don't call KVM_SET_TSS_ADDR you actually get a complaint in
> dmesg, and the TR stays at 0. I am not really sure what kind of bad
> things can happen with unrestricted_guest=0, probably you just get a VM
> Entry failure. The TSS takes 3 pages of memory. An interesting point is
> that you actually don't need to set the TR selector to a valid value (as
> you would do when running in "normal" vm86 mode), you can simply set the
> base and limit registers that are hidden in the processor, and generally
> inaccessible except through VMREAD/VMWRITE or system management mode. So
> KVM needs to set up a TSS but not a GDT.
> 
> For paging, instead, 1 page is enough because we have only 4GB of memory
> to address. KVM disables CR4.PAE (page address extensions, aka 8-byte
> entries in each page directory or page table) and enables CR4.PSE (page
> size extensions, aka 4MB huge pages support with 4-byte page directory
> entries). One page then fits 1024 4-byte page directory entries, each
> for a 4MB huge pages, totaling exactly 4GB. Here if you don't set it the
> page table is at address 0xFFFBC000. QEMU changes it to 0xFEFFC000 so
> that the BIOS can be up to 16MB in size (the default only allows 256k
> between 0xFFFC0000 and 0xFFFFFFFF).
> 
> The different handling, where only the page table has a default, is
> unfortunate, but so goes life...
> 
> > So... has it really popped into existance somewhere?  It would be good
> > at least to know why it does not need to be migrated.
> 
> It does not need to be migrated just because the contents are constant.

OK, thanks!  IIUC they should likely be all zeros then.

Do you think it's time to add most of these to kvm/api.txt? :)  I can
do that too if you like.

Alex Williamson Dec. 17, 2019, 4:24 p.m. UTC | #64

On Tue, 17 Dec 2019 00:25:02 -0500
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Tue, Dec 17, 2019 at 01:17:29PM +0800, Tian, Kevin wrote:
> > > From: Tian, Kevin
> > > Sent: Tuesday, December 17, 2019 10:29 AM
> > >   
> > > > From: Paolo Bonzini
> > > > Sent: Monday, December 16, 2019 6:08 PM
> > > >
> > > > [Alex and Kevin: there are doubts below regarding dirty page tracking
> > > > from VFIO and mdev devices, which perhaps you can help with]
> > > >
> > > > On 15/12/19 18:21, Peter Xu wrote:  
> > > > >                 init_rmode_tss
> > > > >                     vmx_set_tss_addr
> > > > >                         kvm_vm_ioctl_set_tss_addr [*]
> > > > >                 init_rmode_identity_map
> > > > >                     vmx_create_vcpu [*]  
> > > >
> > > > These don't matter because their content is not visible to userspace
> > > > (the backing storage is mmap-ed by __x86_set_memory_region).  In fact, d
> > > >  
> > > > >                 vmx_write_pml_buffer
> > > > >                     kvm_arch_write_log_dirty [&]
> > > > >                 kvm_write_guest
> > > > >                     kvm_hv_setup_tsc_page
> > > > >                         kvm_guest_time_update [&]
> > > > >                     nested_flush_cached_shadow_vmcs12 [&]
> > > > >                     kvm_write_wall_clock [&]
> > > > >                     kvm_pv_clock_pairing [&]
> > > > >                     kvmgt_rw_gpa [?]  
> > > >
> > > > This then expands (partially) to
> > > >
> > > > intel_gvt_hypervisor_write_gpa
> > > >     emulate_csb_update
> > > >         emulate_execlist_ctx_schedule_out
> > > >             complete_execlist_workload
> > > >                 complete_current_workload
> > > >                      workload_thread
> > > >         emulate_execlist_ctx_schedule_in
> > > >             prepare_execlist_workload
> > > >                 prepare_workload
> > > >                     dispatch_workload
> > > >                         workload_thread
> > > >
> > > > So KVMGT is always writing to GPAs instead of IOVAs and basically
> > > > bypassing a guest IOMMU.  So here it would be better if kvmgt was
> > > > changed not use kvm_write_guest (also because I'd probably have nacked
> > > > that if I had known :)).  
> > > 
> > > I agree.
> > >   
> > > >
> > > > As far as I know, there is some work on live migration with both VFIO
> > > > and mdev, and that probably includes some dirty page tracking API.
> > > > kvmgt could switch to that API, or there could be VFIO APIs similar to
> > > > kvm_write_guest but taking IOVAs instead of GPAs.  Advantage: this would
> > > > fix the GPA/IOVA confusion.  Disadvantage: userspace would lose the
> > > > tracking of writes from mdev devices.  Kevin, are these writes used in
> > > > any way?  Do the calls to intel_gvt_hypervisor_write_gpa covers all
> > > > writes from kvmgt vGPUs, or can the hardware write to memory as well
> > > > (which would be my guess if I didn't know anything about kvmgt, which I
> > > > pretty much don't)?  
> > > 
> > > intel_gvt_hypervisor_write_gpa covers all writes due to software mediation.
> > > 
> > > for hardware updates, it needs be mapped in IOMMU through
> > > vfio_pin_pages
> > > before any DMA happens. The ongoing dirty tracking effort in VFIO will take
> > > every pinned page through that API as dirtied.
> > > 
> > > However, currently VFIO doesn't implement any vfio_read/write_guest
> > > interface yet. and it doesn't make sense to use vfio_pin_pages for software
> > > dirtied pages, as pin is unnecessary and heavy involving iommu invalidation.  
> > 
> > One correction. vfio_pin_pages doesn't involve iommu invalidation. I should
> > just mean that pinning the page is not necessary. We just need a kvm-like
> > interface based on hva to access.
> >  
> And can we propose to differentiate read and write when calling vfio_pin_pages, e.g.
> vfio_pin_pages_read, vfio_pin_pages_write? Otherwise, calling to
> vfio_pin_pages will unnecessarily cause read pages to be dirty and
> sometimes reading guest pages is a way for device model to track dirty
> pages.

Yes, I've discussed this with Kirti, when devices add more fine grained
dirty tracking we'll probably need to extend the mdev pinned pages
interface to allow vendor drivers to indicate a pinning is intended to
be used as read-only and perhaps also a way to unpin a page that was
pinned as read-write as clean, if the device did not write to it.  So
perhaps vfio_pin_pages_for_read() and vfio_unpin_pages_clean().  Thanks,

Alex

Paolo Bonzini Dec. 17, 2019, 4:28 p.m. UTC | #65

On 17/12/19 17:24, Peter Xu wrote:
>> No, please pass it all the way down to the [&] functions but not to
>> kvm_write_guest_page.  Those should keep using vcpu->kvm.
> Actually I even wanted to refactor these helpers.  I mean, we have two
> sets of helpers now, kvm_[vcpu]_{read|write}*(), so one set is per-vm,
> the other set is per-vcpu.  IIUC the only difference of these two are
> whether we should consider ((vcpu)->arch.hflags & HF_SMM_MASK) or we
> just write to address space zero always.

Right.

> Could we unify them into a
> single set of helper (I'll just drop the *_vcpu_* helpers because it's
> longer when write) but we always pass in vcpu* as the first parameter?
> Then we add another parameter "vcpu_smm" to show whether we want to
> consider the HF_SMM_MASK flag.

You'd have to check through all KVM implementations whether you always
have the vCPU.  Also non-x86 doesn't have address spaces, and by the
time you add ", true" or ", false" it's longer than the "_vcpu_" you
have removed.  So, not a good idea in my opinion. :D

Paolo

Paolo Bonzini Dec. 17, 2019, 4:30 p.m. UTC | #66

On 17/12/19 17:18, Alex Williamson wrote:
>>
>> Alex, if you are OK we'll work on such interface and move kvmgt to use it.
>> After it's accepted, we can also mark pages dirty through this new interface
>> in Kirti's dirty page tracking series.
> I'm not sure what you're asking for, is it an interface for the host
> CPU to read/write the memory backing of a mapped IOVA range without
> pinning pages?  That seems like something like that would make sense for
> an emulation model where a page does not need to be pinned for physical
> DMA.  If you're asking more for an interface that understands the
> userspace driver is a VM (ie. implied using a _guest postfix on the
> function name) and knows about GPA mappings beyond the windows directly
> mapped for device access, I'd not look fondly on such a request.

No, it would definitely be the former, using IOVAs to access guest
memory---kvmgt is currently doing the latter by calling into KVM, and
I'm not really fond of that either.

Paolo

Paolo Bonzini Dec. 17, 2019, 4:31 p.m. UTC | #67

On 17/12/19 16:38, Peter Xu wrote:
> There's still time to persuade me to going back to it. :)
> 
> (Though, yes I still like current solution... if we can get rid of the
>  only kvmgt ugliness, we can even throw away the per-vm ring with its
>  "extra" 4k page.  Then I suppose it'll be even harder to persuade me :)

Actually that's what convinced me in the first place, so let's
absolutely get rid of both the per-VM ring and the union.  Kevin and
Alex have answered and everybody seems to agree.

Paolo

Peter Xu Dec. 17, 2019, 4:42 p.m. UTC | #68

On Tue, Dec 17, 2019 at 05:31:48PM +0100, Paolo Bonzini wrote:
> On 17/12/19 16:38, Peter Xu wrote:
> > There's still time to persuade me to going back to it. :)
> > 
> > (Though, yes I still like current solution... if we can get rid of the
> >  only kvmgt ugliness, we can even throw away the per-vm ring with its
> >  "extra" 4k page.  Then I suppose it'll be even harder to persuade me :)
> 
> Actually that's what convinced me in the first place, so let's
> absolutely get rid of both the per-VM ring and the union.  Kevin and
> Alex have answered and everybody seems to agree.

Yeah that'd be perfect.

However I just noticed something... Note that we still didn't read
into non-x86 archs, I think it's the same question as when I asked
whether we can unify the kvm[_vcpu]_write() interfaces and you'd like
me to read the non-x86 archs - I think it's time I read them, because
it's still possible that non-x86 archs will still need the per-vm
ring... then that could be another problem if we want to at last
spread the dirty ring idea outside of x86.

Paolo Bonzini Dec. 17, 2019, 4:48 p.m. UTC | #69

On 17/12/19 17:42, Peter Xu wrote:
> 
> However I just noticed something... Note that we still didn't read
> into non-x86 archs, I think it's the same question as when I asked
> whether we can unify the kvm[_vcpu]_write() interfaces and you'd like
> me to read the non-x86 archs - I think it's time I read them, because
> it's still possible that non-x86 archs will still need the per-vm
> ring... then that could be another problem if we want to at last
> spread the dirty ring idea outside of x86.

We can take a look, but I think based on x86 experience it's okay if we
restrict dirty ring to arches that do no VM-wide accesses.

Paolo

Peter Xu Dec. 17, 2019, 7:41 p.m. UTC | #70

On Tue, Dec 17, 2019 at 05:48:58PM +0100, Paolo Bonzini wrote:
> On 17/12/19 17:42, Peter Xu wrote:
> > 
> > However I just noticed something... Note that we still didn't read
> > into non-x86 archs, I think it's the same question as when I asked
> > whether we can unify the kvm[_vcpu]_write() interfaces and you'd like
> > me to read the non-x86 archs - I think it's time I read them, because
> > it's still possible that non-x86 archs will still need the per-vm
> > ring... then that could be another problem if we want to at last
> > spread the dirty ring idea outside of x86.
> 
> We can take a look, but I think based on x86 experience it's okay if we
> restrict dirty ring to arches that do no VM-wide accesses.

Here it is - a quick update on callers of mark_page_dirty_in_slot().
The same reverse trace, but ignoring all common and x86 code path
(which I covered in the other thread):

==================================

   mark_page_dirty_in_slot (non-x86)
        mark_page_dirty
            kvm_write_guest_page
                kvm_write_guest
                    kvm_write_guest_lock
                        vgic_its_save_ite [?]
                        vgic_its_save_dte [?]
                        vgic_its_save_cte [?]
                        vgic_its_save_collection_table [?]
                        vgic_v3_lpi_sync_pending_status [?]
                        vgic_v3_save_pending_tables [?]
                    kvmppc_rtas_hcall [&]
                    kvmppc_st [&]
                    access_guest [&]
                    put_guest_lc [&]
                    write_guest_lc [&]
                    write_guest_abs [&]
            mark_page_dirty
                _kvm_mips_map_page_fast [&]
                kvm_mips_map_page [&]
                kvmppc_mmu_map_page [&]
                kvmppc_copy_guest
                    kvmppc_h_page_init [&]
                kvmppc_xive_native_vcpu_eq_sync [&]
                adapter_indicators_set [?] (from kvm_set_irq)
                kvm_s390_sync_dirty_log [?]
                unpin_guest_page
                    unpin_blocks [&]
                    unpin_scb [&]

Cases with [*]: should not matter much
           [&]: should be able to change to per-vcpu context
           [?]: uncertain...

==================================

This time we've got 8 leaves with "[?]".

I'm starting with these:

        vgic_its_save_ite [?]
        vgic_its_save_dte [?]
        vgic_its_save_cte [?]
        vgic_its_save_collection_table [?]
        vgic_v3_lpi_sync_pending_status [?]
        vgic_v3_save_pending_tables [?]

These come from ARM specific ioctls like KVM_DEV_ARM_ITS_SAVE_TABLES,
KVM_DEV_ARM_ITS_RESTORE_TABLES, KVM_DEV_ARM_VGIC_SAVE_PENDING_TABLES.
IIUC ARM needed these to allow proper migration which indeed does not
have a vcpu context.

(Though I'm a bit curious why ARM didn't simply migrate these
 information explicitly from userspace, instead it seems to me that
 ARM guests will dump something into guest ram and then tries to
 recover from that which seems to be a bit weird)
 
Then it's this:

        adapter_indicators_set [?]

This is s390 specific, which should come from kvm_set_irq.  I'm not
sure whether we can remove the mark_page_dirty() call of this, if it's
applied from another kernel structure (which should be migrated
properly IIUC).  But I might be completely wrong.

        kvm_s390_sync_dirty_log [?]
        
This is also s390 specific, should be collecting from the hardware
PGSTE_UC_BIT bit.  No vcpu context for sure.

(I'd be glad too if anyone could hint me why x86 cannot use page table
 dirty bits for dirty tracking, if there's short answer...)

I think my conclusion so far...

  - for s390 I don't think we even need this dirty ring buffer thing,
    because I think hardware trackings should be more efficient, then
    we don't need to care much on that either from design-wise of
    dirty ring,

  - for ARM, those no-vcpu-context dirty tracking probably needs to be
    considered, but hopefully that's a very special path so it rarely
    happen.  The bad thing is I didn't dig how many pages will be
    dirtied when ARM guest starts to dump all these things so it could
    be a burst...  If it is, then there's risk to trigger the ring
    full condition (which we wanted to avoid..)

I'm CCing Eric for ARM, Conny&David for s390, just in case there're
further inputs.

Thanks,

Tian, Kevin Dec. 18, 2019, 12:29 a.m. UTC | #71

> From: Paolo Bonzini <pbonzini@redhat.com>
> Sent: Wednesday, December 18, 2019 12:31 AM
> 
> On 17/12/19 17:18, Alex Williamson wrote:
> >>
> >> Alex, if you are OK we'll work on such interface and move kvmgt to use it.
> >> After it's accepted, we can also mark pages dirty through this new
> interface
> >> in Kirti's dirty page tracking series.
> > I'm not sure what you're asking for, is it an interface for the host
> > CPU to read/write the memory backing of a mapped IOVA range without
> > pinning pages?  That seems like something like that would make sense for
> > an emulation model where a page does not need to be pinned for physical
> > DMA.  If you're asking more for an interface that understands the
> > userspace driver is a VM (ie. implied using a _guest postfix on the
> > function name) and knows about GPA mappings beyond the windows
> directly
> > mapped for device access, I'd not look fondly on such a request.
> 
> No, it would definitely be the former, using IOVAs to access guest
> memory---kvmgt is currently doing the latter by calling into KVM, and
> I'm not really fond of that either.
> 

Exactly. let's work on the fix.

Thanks
Kevin

Paolo Bonzini Dec. 18, 2019, 12:33 a.m. UTC | #72

On 17/12/19 20:41, Peter Xu wrote:
> On Tue, Dec 17, 2019 at 05:48:58PM +0100, Paolo Bonzini wrote:
>> On 17/12/19 17:42, Peter Xu wrote:
>>>
>>> However I just noticed something... Note that we still didn't read
>>> into non-x86 archs, I think it's the same question as when I asked
>>> whether we can unify the kvm[_vcpu]_write() interfaces and you'd like
>>> me to read the non-x86 archs - I think it's time I read them, because
>>> it's still possible that non-x86 archs will still need the per-vm
>>> ring... then that could be another problem if we want to at last
>>> spread the dirty ring idea outside of x86.
>>
>> We can take a look, but I think based on x86 experience it's okay if we
>> restrict dirty ring to arches that do no VM-wide accesses.
> 
> Here it is - a quick update on callers of mark_page_dirty_in_slot().
> The same reverse trace, but ignoring all common and x86 code path
> (which I covered in the other thread):
> 
> ==================================
> 
>    mark_page_dirty_in_slot (non-x86)
>         mark_page_dirty
>             kvm_write_guest_page
>                 kvm_write_guest
>                     kvm_write_guest_lock
>                         vgic_its_save_ite [?]
>                         vgic_its_save_dte [?]
>                         vgic_its_save_cte [?]
>                         vgic_its_save_collection_table [?]
>                         vgic_v3_lpi_sync_pending_status [?]
>                         vgic_v3_save_pending_tables [?]
>                     kvmppc_rtas_hcall [&]
>                     kvmppc_st [&]
>                     access_guest [&]
>                     put_guest_lc [&]
>                     write_guest_lc [&]
>                     write_guest_abs [&]
>             mark_page_dirty
>                 _kvm_mips_map_page_fast [&]
>                 kvm_mips_map_page [&]
>                 kvmppc_mmu_map_page [&]
>                 kvmppc_copy_guest
>                     kvmppc_h_page_init [&]
>                 kvmppc_xive_native_vcpu_eq_sync [&]
>                 adapter_indicators_set [?] (from kvm_set_irq)
>                 kvm_s390_sync_dirty_log [?]
>                 unpin_guest_page
>                     unpin_blocks [&]
>                     unpin_scb [&]
> 
> Cases with [*]: should not matter much
>            [&]: should be able to change to per-vcpu context
>            [?]: uncertain...
> 
> ==================================
> 
> This time we've got 8 leaves with "[?]".
> 
> I'm starting with these:
> 
>         vgic_its_save_ite [?]
>         vgic_its_save_dte [?]
>         vgic_its_save_cte [?]
>         vgic_its_save_collection_table [?]
>         vgic_v3_lpi_sync_pending_status [?]
>         vgic_v3_save_pending_tables [?]
> 
> These come from ARM specific ioctls like KVM_DEV_ARM_ITS_SAVE_TABLES,
> KVM_DEV_ARM_ITS_RESTORE_TABLES, KVM_DEV_ARM_VGIC_SAVE_PENDING_TABLES.
> IIUC ARM needed these to allow proper migration which indeed does not
> have a vcpu context.
> 
> (Though I'm a bit curious why ARM didn't simply migrate these
>  information explicitly from userspace, instead it seems to me that
>  ARM guests will dump something into guest ram and then tries to
>  recover from that which seems to be a bit weird)
>  
> Then it's this:
> 
>         adapter_indicators_set [?]
> 
> This is s390 specific, which should come from kvm_set_irq.  I'm not
> sure whether we can remove the mark_page_dirty() call of this, if it's
> applied from another kernel structure (which should be migrated
> properly IIUC).  But I might be completely wrong.
> 
>         kvm_s390_sync_dirty_log [?]
>         
> This is also s390 specific, should be collecting from the hardware
> PGSTE_UC_BIT bit.  No vcpu context for sure.
> 
> (I'd be glad too if anyone could hint me why x86 cannot use page table
>  dirty bits for dirty tracking, if there's short answer...)

With PML it is.  Without PML, however, it would be much slower to
synchronize the dirty bitmap from KVM to userspace (one atomic operation
per page instead of one per 64 pages) and even impossible to have the
dirty ring.

> I think my conclusion so far...
> 
>   - for s390 I don't think we even need this dirty ring buffer thing,
>     because I think hardware trackings should be more efficient, then
>     we don't need to care much on that either from design-wise of
>     dirty ring,

I would be surprised if it's more efficient without something like PML,
but anyway the gist is correct---without write protection-based dirty
page logging, s390 cannot use the dirty page ring buffer.

>   - for ARM, those no-vcpu-context dirty tracking probably needs to be
>     considered, but hopefully that's a very special path so it rarely
>     happen.  The bad thing is I didn't dig how many pages will be
>     dirtied when ARM guest starts to dump all these things so it could
>     be a burst...  If it is, then there's risk to trigger the ring
>     full condition (which we wanted to avoid..)

It says all vCPU locks must be held, so it could just use any vCPU.  I
am not sure what's the upper limit on the number of entries, or even
whether userspace could just dirty those pages itself, or perhaps
whether there could be a different ioctl that gets the pages into
userspace memory (and then if needed userspace can copy them into guest
memory, I don't know why it is designed like that).

Paolo

Peter Xu Dec. 18, 2019, 4:32 p.m. UTC | #73

On Wed, Dec 18, 2019 at 01:33:01AM +0100, Paolo Bonzini wrote:
> On 17/12/19 20:41, Peter Xu wrote:
> > On Tue, Dec 17, 2019 at 05:48:58PM +0100, Paolo Bonzini wrote:
> >> On 17/12/19 17:42, Peter Xu wrote:
> >>>
> >>> However I just noticed something... Note that we still didn't read
> >>> into non-x86 archs, I think it's the same question as when I asked
> >>> whether we can unify the kvm[_vcpu]_write() interfaces and you'd like
> >>> me to read the non-x86 archs - I think it's time I read them, because
> >>> it's still possible that non-x86 archs will still need the per-vm
> >>> ring... then that could be another problem if we want to at last
> >>> spread the dirty ring idea outside of x86.
> >>
> >> We can take a look, but I think based on x86 experience it's okay if we
> >> restrict dirty ring to arches that do no VM-wide accesses.
> > 
> > Here it is - a quick update on callers of mark_page_dirty_in_slot().
> > The same reverse trace, but ignoring all common and x86 code path
> > (which I covered in the other thread):
> > 
> > ==================================
> > 
> >    mark_page_dirty_in_slot (non-x86)
> >         mark_page_dirty
> >             kvm_write_guest_page
> >                 kvm_write_guest
> >                     kvm_write_guest_lock
> >                         vgic_its_save_ite [?]
> >                         vgic_its_save_dte [?]
> >                         vgic_its_save_cte [?]
> >                         vgic_its_save_collection_table [?]
> >                         vgic_v3_lpi_sync_pending_status [?]
> >                         vgic_v3_save_pending_tables [?]
> >                     kvmppc_rtas_hcall [&]
> >                     kvmppc_st [&]
> >                     access_guest [&]
> >                     put_guest_lc [&]
> >                     write_guest_lc [&]
> >                     write_guest_abs [&]
> >             mark_page_dirty
> >                 _kvm_mips_map_page_fast [&]
> >                 kvm_mips_map_page [&]
> >                 kvmppc_mmu_map_page [&]
> >                 kvmppc_copy_guest
> >                     kvmppc_h_page_init [&]
> >                 kvmppc_xive_native_vcpu_eq_sync [&]
> >                 adapter_indicators_set [?] (from kvm_set_irq)
> >                 kvm_s390_sync_dirty_log [?]
> >                 unpin_guest_page
> >                     unpin_blocks [&]
> >                     unpin_scb [&]
> > 
> > Cases with [*]: should not matter much
> >            [&]: should be able to change to per-vcpu context
> >            [?]: uncertain...
> > 
> > ==================================
> > 
> > This time we've got 8 leaves with "[?]".
> > 
> > I'm starting with these:
> > 
> >         vgic_its_save_ite [?]
> >         vgic_its_save_dte [?]
> >         vgic_its_save_cte [?]
> >         vgic_its_save_collection_table [?]
> >         vgic_v3_lpi_sync_pending_status [?]
> >         vgic_v3_save_pending_tables [?]
> > 
> > These come from ARM specific ioctls like KVM_DEV_ARM_ITS_SAVE_TABLES,
> > KVM_DEV_ARM_ITS_RESTORE_TABLES, KVM_DEV_ARM_VGIC_SAVE_PENDING_TABLES.
> > IIUC ARM needed these to allow proper migration which indeed does not
> > have a vcpu context.
> > 
> > (Though I'm a bit curious why ARM didn't simply migrate these
> >  information explicitly from userspace, instead it seems to me that
> >  ARM guests will dump something into guest ram and then tries to
> >  recover from that which seems to be a bit weird)
> >  
> > Then it's this:
> > 
> >         adapter_indicators_set [?]
> > 
> > This is s390 specific, which should come from kvm_set_irq.  I'm not
> > sure whether we can remove the mark_page_dirty() call of this, if it's
> > applied from another kernel structure (which should be migrated
> > properly IIUC).  But I might be completely wrong.
> > 
> >         kvm_s390_sync_dirty_log [?]
> >         
> > This is also s390 specific, should be collecting from the hardware
> > PGSTE_UC_BIT bit.  No vcpu context for sure.
> > 
> > (I'd be glad too if anyone could hint me why x86 cannot use page table
> >  dirty bits for dirty tracking, if there's short answer...)
> 
> With PML it is.  Without PML, however, it would be much slower to
> synchronize the dirty bitmap from KVM to userspace (one atomic operation
> per page instead of one per 64 pages) and even impossible to have the
> dirty ring.

Indeed, however I think it'll be faster for hardware to mark page as
dirty.  So could it be a tradeoff on whether we want the "collection"
to be faster or "marking page dirty" to be faster?  IMHO "marking page
dirty" could be even more important sometimes because that affects
guest responsiveness (blocks vcpu execution), while the collection
procedure can happen in parrallel with that.

> 
> > I think my conclusion so far...
> > 
> >   - for s390 I don't think we even need this dirty ring buffer thing,
> >     because I think hardware trackings should be more efficient, then
> >     we don't need to care much on that either from design-wise of
> >     dirty ring,
> 
> I would be surprised if it's more efficient without something like PML,
> but anyway the gist is correct---without write protection-based dirty
> page logging, s390 cannot use the dirty page ring buffer.
> 
> >   - for ARM, those no-vcpu-context dirty tracking probably needs to be
> >     considered, but hopefully that's a very special path so it rarely
> >     happen.  The bad thing is I didn't dig how many pages will be
> >     dirtied when ARM guest starts to dump all these things so it could
> >     be a burst...  If it is, then there's risk to trigger the ring
> >     full condition (which we wanted to avoid..)
> 
> It says all vCPU locks must be held, so it could just use any vCPU.  I
> am not sure what's the upper limit on the number of entries, or even
> whether userspace could just dirty those pages itself, or perhaps
> whether there could be a different ioctl that gets the pages into
> userspace memory (and then if needed userspace can copy them into guest
> memory, I don't know why it is designed like that).

Yeah that's true.  I'll see whether Eric has more update on these...

Thanks,

Paolo Bonzini Dec. 18, 2019, 4:41 p.m. UTC | #74

On 18/12/19 17:32, Peter Xu wrote:
>> With PML it is.  Without PML, however, it would be much slower to
>> synchronize the dirty bitmap from KVM to userspace (one atomic operation
>> per page instead of one per 64 pages) and even impossible to have the
>> dirty ring.
>
> Indeed, however I think it'll be faster for hardware to mark page as
> dirty.  So could it be a tradeoff on whether we want the "collection"
> to be faster or "marking page dirty" to be faster?  IMHO "marking page
> dirty" could be even more important sometimes because that affects
> guest responsiveness (blocks vcpu execution), while the collection
> procedure can happen in parrallel with that.

The problem is that the marking page dirty will be many many times
slower, because you don't have this

                        if (!dirty_bitmap[i])
                                continue;

and instead you have to scan the whole of the page tables even if a
handful of bits are set (reading  4K of memory for every 2M of guest
RAM).  This can be quite bad for the TLB too.  It is certainly possible
that it turns out to be faster but I would be quite surprised and, with
PML, that is more or less moot.

Thanks,

Paolo

Peter Xu Dec. 18, 2019, 9:58 p.m. UTC | #75

On Tue, Dec 17, 2019 at 05:28:54PM +0100, Paolo Bonzini wrote:
> On 17/12/19 17:24, Peter Xu wrote:
> >> No, please pass it all the way down to the [&] functions but not to
> >> kvm_write_guest_page.  Those should keep using vcpu->kvm.
> > Actually I even wanted to refactor these helpers.  I mean, we have two
> > sets of helpers now, kvm_[vcpu]_{read|write}*(), so one set is per-vm,
> > the other set is per-vcpu.  IIUC the only difference of these two are
> > whether we should consider ((vcpu)->arch.hflags & HF_SMM_MASK) or we
> > just write to address space zero always.
> 
> Right.
> 
> > Could we unify them into a
> > single set of helper (I'll just drop the *_vcpu_* helpers because it's
> > longer when write) but we always pass in vcpu* as the first parameter?
> > Then we add another parameter "vcpu_smm" to show whether we want to
> > consider the HF_SMM_MASK flag.
> 
> You'd have to check through all KVM implementations whether you always
> have the vCPU.  Also non-x86 doesn't have address spaces, and by the
> time you add ", true" or ", false" it's longer than the "_vcpu_" you
> have removed.  So, not a good idea in my opinion. :D

Well, now I've changed my mind. :) (considering that we still have
many places that will not have vcpu*...)

I can simply add that "vcpu_smm" parameter to kvm_vcpu_write_*()
without removing the kvm_write_*() helpers.  Then I'll be able to
convert most of the kvm_write_*() (or its family) callers to
kvm_vcpu_write*(..., vcpu_smm=false) calls where proper.

Would that be good?

Sean Christopherson Dec. 18, 2019, 10:24 p.m. UTC | #76

On Wed, Dec 18, 2019 at 04:58:57PM -0500, Peter Xu wrote:
> On Tue, Dec 17, 2019 at 05:28:54PM +0100, Paolo Bonzini wrote:
> > On 17/12/19 17:24, Peter Xu wrote:
> > >> No, please pass it all the way down to the [&] functions but not to
> > >> kvm_write_guest_page.  Those should keep using vcpu->kvm.
> > > Actually I even wanted to refactor these helpers.  I mean, we have two
> > > sets of helpers now, kvm_[vcpu]_{read|write}*(), so one set is per-vm,
> > > the other set is per-vcpu.  IIUC the only difference of these two are
> > > whether we should consider ((vcpu)->arch.hflags & HF_SMM_MASK) or we
> > > just write to address space zero always.
> > 
> > Right.
> > 
> > > Could we unify them into a
> > > single set of helper (I'll just drop the *_vcpu_* helpers because it's
> > > longer when write) but we always pass in vcpu* as the first parameter?
> > > Then we add another parameter "vcpu_smm" to show whether we want to
> > > consider the HF_SMM_MASK flag.
> > 
> > You'd have to check through all KVM implementations whether you always
> > have the vCPU.  Also non-x86 doesn't have address spaces, and by the
> > time you add ", true" or ", false" it's longer than the "_vcpu_" you
> > have removed.  So, not a good idea in my opinion. :D
> 
> Well, now I've changed my mind. :) (considering that we still have
> many places that will not have vcpu*...)
> 
> I can simply add that "vcpu_smm" parameter to kvm_vcpu_write_*()
> without removing the kvm_write_*() helpers.  Then I'll be able to
> convert most of the kvm_write_*() (or its family) callers to
> kvm_vcpu_write*(..., vcpu_smm=false) calls where proper.
> 
> Would that be good?

I've lost track of the problem you're trying to solve, but if you do
something like "vcpu_smm=false", explicitly pass an address space ID
instead of hardcoding x86 specific SMM crud, e.g.

	kvm_vcpu_write*(..., as_id=0);

Paolo Bonzini Dec. 18, 2019, 10:37 p.m. UTC | #77

On 18/12/19 23:24, Sean Christopherson wrote:
> I've lost track of the problem you're trying to solve, but if you do
> something like "vcpu_smm=false", explicitly pass an address space ID
> instead of hardcoding x86 specific SMM crud, e.g.
> 
> 	kvm_vcpu_write*(..., as_id=0);

And the point of having kvm_vcpu_* vs. kvm_write_* was exactly to not
having to hardcode the address space ID.  If anything you could add a
__kvm_vcpu_write_* API that takes vcpu+as_id, but really I'd prefer to
keep kvm_get_running_vcpu() for now and then it can be refactored later.
 There are already way too many memory r/w APIs...

Paolo

Peter Xu Dec. 18, 2019, 10:49 p.m. UTC | #78

On Wed, Dec 18, 2019 at 11:37:31PM +0100, Paolo Bonzini wrote:
> On 18/12/19 23:24, Sean Christopherson wrote:
> > I've lost track of the problem you're trying to solve, but if you do
> > something like "vcpu_smm=false", explicitly pass an address space ID
> > instead of hardcoding x86 specific SMM crud, e.g.
> > 
> > 	kvm_vcpu_write*(..., as_id=0);
> 
> And the point of having kvm_vcpu_* vs. kvm_write_* was exactly to not
> having to hardcode the address space ID.  If anything you could add a
> __kvm_vcpu_write_* API that takes vcpu+as_id, but really I'd prefer to
> keep kvm_get_running_vcpu() for now and then it can be refactored later.
>  There are already way too many memory r/w APIs...

Yeah actuall that's why I wanted to start working on that just in case
it could help to unify all of them some day (and since we did go a few
steps forward on that when discussing the dirty ring).  But yeah
kvm_get_running_vcpu() for sure works for us already; let's go the
easy way this time.  Thanks,

Peter Xu Dec. 20, 2019, 6:19 p.m. UTC | #79

On Fri, Dec 13, 2019 at 03:23:24PM -0500, Peter Xu wrote:
> > > +If one of the ring buffers is full, the guest will exit to userspace
> > > +with the exit reason set to KVM_EXIT_DIRTY_LOG_FULL, and the
> > > +KVM_RUN ioctl will return -EINTR. Once that happens, userspace
> > > +should pause all the vcpus, then harvest all the dirty pages and
> > > +rearm the dirty traps. It can unpause the guest after that.
> > 
> > Except for the condition above, why is it necessary to pause other VCPUs
> > than the one being harvested?
> 
> This is a good question.  Paolo could correct me if I'm wrong.
> 
> Firstly I think this should rarely happen if the userspace is
> collecting the dirty bits from time to time.  If it happens, we'll
> need to call KVM_RESET_DIRTY_RINGS to reset all the rings.  Then the
> question actually becomes to: Whether we'd like to have per-vcpu
> KVM_RESET_DIRTY_RINGS?

Hmm when I'm rethinking this, I could have errornously deduced
something from Christophe's question.  Christophe was asking about why
kicking other vcpus, while it does not mean that the RESET will need
to do per-vcpu.

So now I tend to agree here with Christophe that I can't find a reason
why we need to kick all vcpus out.  Even if we need to do tlb flushing
for all vcpus when RESET, we can simply collect all the rings before
sending the RESET, then it's not really a reason to explicitly kick
them from userspace.  So I plan to remove this sentence in the next
version (which is only a document update).

[RFC,04/15] KVM: Implement ring-based dirty memory tracking

Commit Message

Comments

Patch