[v12,5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

Message ID	1499863221-16206-6-git-send-email-wei.w.wang@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org> From: Wei Wang <wei.w.wang@intel.com> To: linux-kernel@vger.kernel.org, qemu-devel@nongnu.org, virtualization@lists.linux-foundation.org, kvm@vger.kernel.org, linux-mm@kvack.org, mst@redhat.com, david@redhat.com, cornelia.huck@de.ibm.com, akpm@linux-foundation.org, mgorman@techsingularity.net, aarcange@redhat.com, amit.shah@redhat.com, pbonzini@redhat.com, wei.w.wang@intel.com, liliang.opensource@gmail.com Date: Wed, 12 Jul 2017 20:40:18 +0800 Message-Id: <1499863221-16206-6-git-send-email-wei.w.wang@intel.com> In-Reply-To: <1499863221-16206-1-git-send-email-wei.w.wang@intel.com> References: <1499863221-16206-1-git-send-email-wei.w.wang@intel.com> Subject: [Qemu-devel] [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG Precedence: list Cc: yang.zhang.wz@gmail.com, virtio-dev@lists.oasis-open.org, quan.xu@aliyun.com Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>

Wang, Wei W July 12, 2017, 12:40 p.m. UTC

Add a new feature, VIRTIO_BALLOON_F_SG, which enables to
transfer a chunk of ballooned (i.e. inflated/deflated) pages using
scatter-gather lists to the host.

The implementation of the previous virtio-balloon is not very
efficient, because the balloon pages are transferred to the
host one by one. Here is the breakdown of the time in percentage
spent on each step of the balloon inflating process (inflating
7GB of an 8GB idle guest).

1) allocating pages (6.5%)
2) sending PFNs to host (68.3%)
3) address translation (6.1%)
4) madvise (19%)

It takes about 4126ms for the inflating process to complete.
The above profiling shows that the bottlenecks are stage 2)
and stage 4).

This patch optimizes step 2) by transferring pages to the host in
sgs. An sg describes a chunk of guest physically continuous pages.
With this mechanism, step 4) can also be optimized by doing address
translation and madvise() in chunks rather than page by page.

With this new feature, the above ballooning process takes ~491ms
resulting in an improvement of ~88%.

TODO: optimize stage 1) by allocating/freeing a chunk of pages
instead of a single page each time.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_balloon.c     | 141 ++++++++++++++++++++++---
 drivers/virtio/virtio_ring.c        | 199 +++++++++++++++++++++++++++++++++---
 include/linux/virtio.h              |  20 ++++
 include/uapi/linux/virtio_balloon.h |   1 +
 4 files changed, 329 insertions(+), 32 deletions(-)

Michael S. Tsirkin July 12, 2017, 1:06 p.m. UTC | #1

On Wed, Jul 12, 2017 at 08:40:18PM +0800, Wei Wang wrote:
> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> index 28b0e96..9f27101 100644
> --- a/include/linux/virtio.h
> +++ b/include/linux/virtio.h
> @@ -57,8 +57,28 @@ int virtqueue_add_sgs(struct virtqueue *vq,
>  		      void *data,
>  		      gfp_t gfp);
>  
> +/* A desc with this init id is treated as an invalid desc */
> +#define VIRTQUEUE_DESC_ID_INIT UINT_MAX
> +int virtqueue_add_chain_desc(struct virtqueue *_vq,
> +			     uint64_t addr,
> +			     uint32_t len,
> +			     unsigned int *head_id,
> +			     unsigned int *prev_id,
> +			     bool in);
> +
> +int virtqueue_add_chain(struct virtqueue *_vq,
> +			unsigned int head,
> +			bool indirect,
> +			struct vring_desc *indirect_desc,
> +			void *data,
> +			void *ctx);
> +
>  bool virtqueue_kick(struct virtqueue *vq);
>  
> +bool virtqueue_kick_sync(struct virtqueue *vq);
> +
> +bool virtqueue_kick_async(struct virtqueue *vq, wait_queue_head_t wq);
> +
>  bool virtqueue_kick_prepare(struct virtqueue *vq);
>  
>  bool virtqueue_notify(struct virtqueue *vq);

I don't much care for this API. It does exactly what balloon needs,
but at cost of e.g. transparently busy-waiting. Unlikely to be
a good fit for anything else.

If you don't like my original _first/_next/_last, you will
need to come up with something else.

Wang, Wei W July 12, 2017, 1:29 p.m. UTC | #2

On 07/12/2017 09:06 PM, Michael S. Tsirkin wrote:
> On Wed, Jul 12, 2017 at 08:40:18PM +0800, Wei Wang wrote:
>> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
>> index 28b0e96..9f27101 100644
>> --- a/include/linux/virtio.h
>> +++ b/include/linux/virtio.h
>> @@ -57,8 +57,28 @@ int virtqueue_add_sgs(struct virtqueue *vq,
>>   		      void *data,
>>   		      gfp_t gfp);
>>   
>> +/* A desc with this init id is treated as an invalid desc */
>> +#define VIRTQUEUE_DESC_ID_INIT UINT_MAX
>> +int virtqueue_add_chain_desc(struct virtqueue *_vq,
>> +			     uint64_t addr,
>> +			     uint32_t len,
>> +			     unsigned int *head_id,
>> +			     unsigned int *prev_id,
>> +			     bool in);
>> +
>> +int virtqueue_add_chain(struct virtqueue *_vq,
>> +			unsigned int head,
>> +			bool indirect,
>> +			struct vring_desc *indirect_desc,
>> +			void *data,
>> +			void *ctx);
>> +
>>   bool virtqueue_kick(struct virtqueue *vq);
>>   
>> +bool virtqueue_kick_sync(struct virtqueue *vq);
>> +
>> +bool virtqueue_kick_async(struct virtqueue *vq, wait_queue_head_t wq);
>> +
>>   bool virtqueue_kick_prepare(struct virtqueue *vq);
>>   
>>   bool virtqueue_notify(struct virtqueue *vq);
> I don't much care for this API. It does exactly what balloon needs,
> but at cost of e.g. transparently busy-waiting. Unlikely to be
> a good fit for anything else.

If you were referring to this API - virtqueue_add_chain_desc():

Busy waiting only happens when the vq is full (i.e. no desc left). If
necessary, I think we can add an input parameter like
"bool busywaiting", then the caller can decide to simply get a -ENOSPC
or busy wait to add when no desc is available.

>
> If you don't like my original _first/_next/_last, you will
> need to come up with something else.

I thought the above virtqueue_add_chain_des() performs the same
functionality as _first/next/last, which are used to grab descs from the
vq and chain them together. If not, could you please elaborate the
usage of the original proposal?

Best,
Wei

Michael S. Tsirkin July 12, 2017, 1:56 p.m. UTC | #3

On Wed, Jul 12, 2017 at 09:29:00PM +0800, Wei Wang wrote:
> On 07/12/2017 09:06 PM, Michael S. Tsirkin wrote:
> > On Wed, Jul 12, 2017 at 08:40:18PM +0800, Wei Wang wrote:
> > > diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> > > index 28b0e96..9f27101 100644
> > > --- a/include/linux/virtio.h
> > > +++ b/include/linux/virtio.h
> > > @@ -57,8 +57,28 @@ int virtqueue_add_sgs(struct virtqueue *vq,
> > >   		      void *data,
> > >   		      gfp_t gfp);
> > > +/* A desc with this init id is treated as an invalid desc */
> > > +#define VIRTQUEUE_DESC_ID_INIT UINT_MAX
> > > +int virtqueue_add_chain_desc(struct virtqueue *_vq,
> > > +			     uint64_t addr,
> > > +			     uint32_t len,
> > > +			     unsigned int *head_id,
> > > +			     unsigned int *prev_id,
> > > +			     bool in);
> > > +
> > > +int virtqueue_add_chain(struct virtqueue *_vq,
> > > +			unsigned int head,
> > > +			bool indirect,
> > > +			struct vring_desc *indirect_desc,
> > > +			void *data,
> > > +			void *ctx);
> > > +
> > >   bool virtqueue_kick(struct virtqueue *vq);
> > > +bool virtqueue_kick_sync(struct virtqueue *vq);
> > > +
> > > +bool virtqueue_kick_async(struct virtqueue *vq, wait_queue_head_t wq);
> > > +
> > >   bool virtqueue_kick_prepare(struct virtqueue *vq);
> > >   bool virtqueue_notify(struct virtqueue *vq);
> > I don't much care for this API. It does exactly what balloon needs,
> > but at cost of e.g. transparently busy-waiting. Unlikely to be
> > a good fit for anything else.
> 
> If you were referring to this API - virtqueue_add_chain_desc():
> 
> Busy waiting only happens when the vq is full (i.e. no desc left). If
> necessary, I think we can add an input parameter like
> "bool busywaiting", then the caller can decide to simply get a -ENOSPC
> or busy wait to add when no desc is available.

I think this just shows this API is too high level.
This policy should live in drivers.

> > 
> > If you don't like my original _first/_next/_last, you will
> > need to come up with something else.
> 
> I thought the above virtqueue_add_chain_des() performs the same
> functionality as _first/next/last, which are used to grab descs from the
> vq and chain them together. If not, could you please elaborate the
> usage of the original proposal?
> 
> Best,
> Wei
> 

So the way I see it, there are several issues:

- internal wait - forces multiple APIs like kick/kick_sync
  note how kick_sync can fail but your code never checks return code
- need to re-write the last descriptor - might not work
  for alternative layouts which always expose descriptors
  immediately
- some kind of iterator type would be nicer instead of
  maintaining head/prev explicitly


As for the use, it would be better to do

if (!add_next(vq, ...)) {
	add_last(vq, ...)
	kick
	wait
}

Using VIRTQUEUE_DESC_ID_INIT seems to avoid a branch in the driver, but
in fact it merely puts the branch in the virtio code.

Michael S. Tsirkin July 13, 2017, 12:44 a.m. UTC | #4

On Wed, Jul 12, 2017 at 08:40:18PM +0800, Wei Wang wrote:
> Add a new feature, VIRTIO_BALLOON_F_SG, which enables to
> transfer a chunk of ballooned (i.e. inflated/deflated) pages using
> scatter-gather lists to the host.
> 
> The implementation of the previous virtio-balloon is not very
> efficient, because the balloon pages are transferred to the
> host one by one. Here is the breakdown of the time in percentage
> spent on each step of the balloon inflating process (inflating
> 7GB of an 8GB idle guest).
> 
> 1) allocating pages (6.5%)
> 2) sending PFNs to host (68.3%)
> 3) address translation (6.1%)
> 4) madvise (19%)
> 
> It takes about 4126ms for the inflating process to complete.
> The above profiling shows that the bottlenecks are stage 2)
> and stage 4).
> 
> This patch optimizes step 2) by transferring pages to the host in
> sgs. An sg describes a chunk of guest physically continuous pages.
> With this mechanism, step 4) can also be optimized by doing address
> translation and madvise() in chunks rather than page by page.
> 
> With this new feature, the above ballooning process takes ~491ms
> resulting in an improvement of ~88%.
> 
> TODO: optimize stage 1) by allocating/freeing a chunk of pages
> instead of a single page each time.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  drivers/virtio/virtio_balloon.c     | 141 ++++++++++++++++++++++---
>  drivers/virtio/virtio_ring.c        | 199 +++++++++++++++++++++++++++++++++---
>  include/linux/virtio.h              |  20 ++++
>  include/uapi/linux/virtio_balloon.h |   1 +
>  4 files changed, 329 insertions(+), 32 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index f0b3a0b..aa4e7ec 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -32,6 +32,7 @@
>  #include <linux/mm.h>
>  #include <linux/mount.h>
>  #include <linux/magic.h>
> +#include <linux/xbitmap.h>
>  
>  /*
>   * Balloon device works in 4K page units.  So each page is pointed to by
> @@ -79,6 +80,9 @@ struct virtio_balloon {
>  	/* Synchronize access/update to this struct virtio_balloon elements */
>  	struct mutex balloon_lock;
>  
> +	/* The xbitmap used to record ballooned pages */
> +	struct xb page_xb;
> +
>  	/* The array of pfns we tell the Host about. */
>  	unsigned int num_pfns;
>  	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
> @@ -141,13 +145,71 @@ static void set_page_pfns(struct virtio_balloon *vb,
>  					  page_to_balloon_pfn(page) + i);
>  }
>  
> +/*
> + * Send balloon pages in sgs to host.
> + * The balloon pages are recorded in the page xbitmap. Each bit in the bitmap
> + * corresponds to a page of PAGE_SIZE. The page xbitmap is searched for
> + * continuous "1" bits, which correspond to continuous pages, to chunk into
> + * sgs.
> + *
> + * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
> + * need to be serached.

searched

> + */
> +static void tell_host_sgs(struct virtio_balloon *vb,
> +			  struct virtqueue *vq,
> +			  unsigned long page_xb_start,
> +			  unsigned long page_xb_end)
> +{
> +	unsigned int head_id = VIRTQUEUE_DESC_ID_INIT,
> +		     prev_id = VIRTQUEUE_DESC_ID_INIT;
> +	unsigned long sg_pfn_start, sg_pfn_end;
> +	uint64_t sg_addr;
> +	uint32_t sg_size;
> +
> +	sg_pfn_start = page_xb_start;
> +	while (sg_pfn_start < page_xb_end) {
> +		sg_pfn_start = xb_find_next_bit(&vb->page_xb, sg_pfn_start,
> +						page_xb_end, 1);
> +		if (sg_pfn_start == page_xb_end + 1)
> +			break;
> +		sg_pfn_end = xb_find_next_bit(&vb->page_xb, sg_pfn_start + 1,
> +					      page_xb_end, 0);
> +		sg_addr = sg_pfn_start << PAGE_SHIFT;
> +		sg_size = (sg_pfn_end - sg_pfn_start) * PAGE_SIZE;

There's an issue here - this might not fit in uint32_t.
You need to limit sg_pfn_end - something like:

	/* make sure sg_size below fits in a 32 bit integer */
	sg_pfn_end = min(sg_pfn_end, sg_pfn_start + UINT_MAX >> PAGE_SIZE);

> +		virtqueue_add_chain_desc(vq, sg_addr, sg_size, &head_id,
> +					 &prev_id, 0);
> +		xb_zero(&vb->page_xb, sg_pfn_start, sg_pfn_end);
> +		sg_pfn_start = sg_pfn_end + 1;
> +	}
> +
> +	if (head_id != VIRTQUEUE_DESC_ID_INIT) {
> +		virtqueue_add_chain(vq, head_id, 0, NULL, vb, NULL);
> +		virtqueue_kick_async(vq, vb->acked);
> +	}
> +}
> +
> +/* Update pfn_max and pfn_min according to the pfn of @page */
> +static inline void update_pfn_range(struct virtio_balloon *vb,
> +				    struct page *page,
> +				    unsigned long *pfn_min,
> +				    unsigned long *pfn_max)
> +{
> +	unsigned long pfn = page_to_pfn(page);
> +
> +	*pfn_min = min(pfn, *pfn_min);
> +	*pfn_max = max(pfn, *pfn_max);
> +}
> +
>  static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  {
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	unsigned num_allocated_pages;
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
>  	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	if (!use_sg)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	mutex_lock(&vb->balloon_lock);
>  	for (vb->num_pfns = 0; vb->num_pfns < num;
> @@ -162,7 +224,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  			msleep(200);
>  			break;
>  		}
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		if (use_sg) {
> +			update_pfn_range(vb, page, &pfn_min, &pfn_max);
> +			xb_set_bit(&vb->page_xb, page_to_pfn(page));
> +		} else {
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		}
>  		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
>  		if (!virtio_has_feature(vb->vdev,
>  					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> @@ -171,8 +238,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  
>  	num_allocated_pages = vb->num_pfns;
>  	/* Did we get any? */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->inflate_vq);
> +	if (vb->num_pfns != 0) {
> +		if (use_sg)
> +			tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max);
> +		else
> +			tell_host(vb, vb->inflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	return num_allocated_pages;
> @@ -198,9 +269,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	struct page *page;
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	LIST_HEAD(pages);
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> +	unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
>  
> -	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	/* Traditionally, we can only do one array worth at a time. */
> +	if (!use_sg)
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	mutex_lock(&vb->balloon_lock);
>  	/* We can't release more pages than taken */
> @@ -210,7 +284,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  		page = balloon_page_dequeue(vb_dev_info);
>  		if (!page)
>  			break;
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		if (use_sg) {
> +			update_pfn_range(vb, page, &pfn_min, &pfn_max);
> +			xb_set_bit(&vb->page_xb, page_to_pfn(page));
> +		} else {
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		}
>  		list_add(&page->lru, &pages);
>  		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
>  	}
> @@ -221,8 +300,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
>  	 * is true, we *have* to do it in this order
>  	 */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->deflate_vq);
> +	if (vb->num_pfns != 0) {
> +		if (use_sg)
> +			tell_host_sgs(vb, vb->deflate_vq, pfn_min, pfn_max);
> +		else
> +			tell_host(vb, vb->deflate_vq);
> +	}
>  	release_pages_balloon(vb, &pages);
>  	mutex_unlock(&vb->balloon_lock);
>  	return num_freed_pages;
> @@ -441,6 +524,18 @@ static int init_vqs(struct virtio_balloon *vb)
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> +
> +static void tell_host_one_page(struct virtio_balloon *vb, struct virtqueue *vq,
> +			       struct page *page)
> +{
> +	unsigned int id = VIRTQUEUE_DESC_ID_INIT;
> +	u64 addr = page_to_pfn(page) << VIRTIO_BALLOON_PFN_SHIFT;
> +
> +	virtqueue_add_chain_desc(vq, addr, PAGE_SIZE, &id, &id, 0);
> +	virtqueue_add_chain(vq, id, 0, NULL, (void *)addr, NULL);
> +	virtqueue_kick_async(vq, vb->acked);
> +}
> +
>  /*
>   * virtballoon_migratepage - perform the balloon page migration on behalf of
>   *			     a compation thread.     (called under page lock)
> @@ -464,6 +559,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  {
>  	struct virtio_balloon *vb = container_of(vb_dev_info,
>  			struct virtio_balloon, vb_dev_info);
> +	bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
>  	unsigned long flags;
>  
>  	/*
> @@ -485,16 +581,22 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  	vb_dev_info->isolated_pages--;
>  	__count_vm_event(BALLOON_MIGRATE);
>  	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, newpage);
> -	tell_host(vb, vb->inflate_vq);
> -
> +	if (use_sg) {
> +		tell_host_one_page(vb, vb->inflate_vq, newpage);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, newpage);
> +		tell_host(vb, vb->inflate_vq);
> +	}
>  	/* balloon's page migration 2nd step -- deflate "page" */
>  	balloon_page_delete(page);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, page);
> -	tell_host(vb, vb->deflate_vq);
> -
> +	if (use_sg) {
> +		tell_host_one_page(vb, vb->deflate_vq, page);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, page);
> +		tell_host(vb, vb->deflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	put_page(page); /* balloon reference */
> @@ -553,6 +655,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	if (err)
>  		goto out_free_vb;
>  
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
> +		xb_init(&vb->page_xb);
> +
>  	vb->nb.notifier_call = virtballoon_oom_notify;
>  	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
>  	err = register_oom_notifier(&vb->nb);
> @@ -618,6 +723,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
>  	cancel_work_sync(&vb->update_balloon_size_work);
>  	cancel_work_sync(&vb->update_balloon_stats_work);
>  
> +	xb_empty(&vb->page_xb);
>  	remove_common(vb);
>  #ifdef CONFIG_BALLOON_COMPACTION
>  	if (vb->vb_dev_info.inode)
> @@ -669,6 +775,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_MUST_TELL_HOST,
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> +	VIRTIO_BALLOON_F_SG,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 5e1b548..b9d7e10 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -269,7 +269,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>  	struct vring_virtqueue *vq = to_vvq(_vq);
>  	struct scatterlist *sg;
>  	struct vring_desc *desc;
> -	unsigned int i, n, avail, descs_used, uninitialized_var(prev), err_idx;
> +	unsigned int i, n, descs_used, uninitialized_var(prev), err_id;
>  	int head;
>  	bool indirect;
>  
> @@ -387,10 +387,68 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>  	else
>  		vq->free_head = i;
>  
> -	/* Store token and indirect buffer state. */
> +	END_USE(vq);
> +
> +	return virtqueue_add_chain(_vq, head, indirect, desc, data, ctx);
> +
> +unmap_release:
> +	err_id = i;
> +	i = head;
> +
> +	for (n = 0; n < total_sg; n++) {
> +		if (i == err_id)
> +			break;
> +		vring_unmap_one(vq, &desc[i]);
> +		i = virtio16_to_cpu(_vq->vdev, vq->vring.desc[i].next);
> +	}
> +
> +	vq->vq.num_free += total_sg;
> +
> +	if (indirect)
> +		kfree(desc);
> +
> +	END_USE(vq);
> +	return -EIO;
> +}
> +
> +/**
> + * virtqueue_add_chain - expose a chain of buffers to the other end
> + * @_vq: the struct virtqueue we're talking about.
> + * @head: desc id of the chain head.
> + * @indirect: set if the chain of descs are indrect descs.
> + * @indir_desc: the first indirect desc.
> + * @data: the token identifying the chain.
> + * @ctx: extra context for the token.
> + *
> + * Caller must ensure we don't call this with other virtqueue operations
> + * at the same time (except where noted).
> + *
> + * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).
> + */
> +int virtqueue_add_chain(struct virtqueue *_vq,
> +			unsigned int head,
> +			bool indirect,
> +			struct vring_desc *indir_desc,
> +			void *data,
> +			void *ctx)
> +{
> +	struct vring_virtqueue *vq = to_vvq(_vq);
> +	unsigned int avail;
> +
> +	/* The desc chain is empty. */
> +	if (head == VIRTQUEUE_DESC_ID_INIT)
> +		return 0;
> +
> +	START_USE(vq);
> +
> +	if (unlikely(vq->broken)) {
> +		END_USE(vq);
> +		return -EIO;
> +	}
> +
>  	vq->desc_state[head].data = data;
>  	if (indirect)
> -		vq->desc_state[head].indir_desc = desc;
> +		vq->desc_state[head].indir_desc = indir_desc;
>  	if (ctx)
>  		vq->desc_state[head].indir_desc = ctx;
>  
> @@ -415,26 +473,87 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>  		virtqueue_kick(_vq);
>  
>  	return 0;
> +}
> +EXPORT_SYMBOL_GPL(virtqueue_add_chain);
>  
> -unmap_release:
> -	err_idx = i;
> -	i = head;
> +/**
> + * virtqueue_add_chain_desc - add a buffer to a chain using a vring desc
> + * @vq: the struct virtqueue we're talking about.
> + * @addr: address of the buffer to add.
> + * @len: length of the buffer.
> + * @head_id: desc id of the chain head.
> + * @prev_id: desc id of the previous buffer.
> + * @in: set if the buffer is for the device to write.
> + *
> + * Caller must ensure we don't call this with other virtqueue operations
> + * at the same time (except where noted).
> + *
> + * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).
> + */
> +int virtqueue_add_chain_desc(struct virtqueue *_vq,
> +			     uint64_t addr,
> +			     uint32_t len,
> +			     unsigned int *head_id,
> +			     unsigned int *prev_id,
> +			     bool in)
> +{
> +	struct vring_virtqueue *vq = to_vvq(_vq);
> +	struct vring_desc *desc = vq->vring.desc;
> +	uint16_t flags = in ? VRING_DESC_F_WRITE : 0;
> +	unsigned int i;
>  
> -	for (n = 0; n < total_sg; n++) {
> -		if (i == err_idx)
> -			break;
> -		vring_unmap_one(vq, &desc[i]);
> -		i = virtio16_to_cpu(_vq->vdev, vq->vring.desc[i].next);
> +	/* Sanity check */
> +	if (!_vq || !head_id || !prev_id)
> +		return -EINVAL;
> +retry:
> +	START_USE(vq);
> +	if (unlikely(vq->broken)) {
> +		END_USE(vq);
> +		return -EIO;
>  	}
>  
> -	vq->vq.num_free += total_sg;
> +	if (vq->vq.num_free < 1) {
> +		/*
> +		 * If there is no desc avail in the vq, so kick what is
> +		 * already added, and re-start to build a new chain for
> +		 * the passed sg.
> +		 */
> +		if (likely(*head_id != VIRTQUEUE_DESC_ID_INIT)) {
> +			END_USE(vq);
> +			virtqueue_add_chain(_vq, *head_id, 0, NULL, vq, NULL);
> +			virtqueue_kick_sync(_vq);
> +			*head_id = VIRTQUEUE_DESC_ID_INIT;
> +			*prev_id = VIRTQUEUE_DESC_ID_INIT;
> +			goto retry;
> +		} else {
> +			END_USE(vq);
> +			return -ENOSPC;
> +		}
> +	}
>  
> -	if (indirect)
> -		kfree(desc);
> +	i = vq->free_head;
> +	flags &= ~VRING_DESC_F_NEXT;
> +	desc[i].flags = cpu_to_virtio16(_vq->vdev, flags);
> +	desc[i].addr = cpu_to_virtio64(_vq->vdev, addr);
> +	desc[i].len = cpu_to_virtio32(_vq->vdev, len);
> +
> +	/* Add the desc to the end of the chain */
> +	if (*prev_id != VIRTQUEUE_DESC_ID_INIT) {
> +		desc[*prev_id].next = cpu_to_virtio16(_vq->vdev, i);
> +		desc[*prev_id].flags |= cpu_to_virtio16(_vq->vdev,
> +							 VRING_DESC_F_NEXT);
> +	}
> +	*prev_id = i;
> +	if (*head_id == VIRTQUEUE_DESC_ID_INIT)
> +		*head_id = *prev_id;
>  
> +	vq->vq.num_free--;
> +	vq->free_head = virtio16_to_cpu(_vq->vdev, desc[i].next);
>  	END_USE(vq);
> -	return -EIO;
> +
> +	return 0;
>  }
> +EXPORT_SYMBOL_GPL(virtqueue_add_chain_desc);
>  
>  /**
>   * virtqueue_add_sgs - expose buffers to other end
> @@ -627,6 +746,56 @@ bool virtqueue_kick(struct virtqueue *vq)
>  }
>  EXPORT_SYMBOL_GPL(virtqueue_kick);
>  
> +/**
> + * virtqueue_kick_sync - update after add_buf and busy wait till update is done
> + * @vq: the struct virtqueue
> + *
> + * After one or more virtqueue_add_* calls, invoke this to kick
> + * the other side. Busy wait till the other side is done with the update.
> + *
> + * Caller must ensure we don't call this with other virtqueue
> + * operations at the same time (except where noted).
> + *
> + * Returns false if kick failed, otherwise true.
> + */
> +bool virtqueue_kick_sync(struct virtqueue *vq)
> +{
> +	u32 len;
> +
> +	if (likely(virtqueue_kick(vq))) {
> +		while (!virtqueue_get_buf(vq, &len) &&
> +		       !virtqueue_is_broken(vq))
> +			cpu_relax();
> +		return true;
> +	}
> +	return false;
> +}
> +EXPORT_SYMBOL_GPL(virtqueue_kick_sync);
> +
> +/**
> + * virtqueue_kick_async - update after add_buf and blocking till update is done
> + * @vq: the struct virtqueue
> + *
> + * After one or more virtqueue_add_* calls, invoke this to kick
> + * the other side. Blocking till the other side is done with the update.
> + *
> + * Caller must ensure we don't call this with other virtqueue
> + * operations at the same time (except where noted).
> + *
> + * Returns false if kick failed, otherwise true.
> + */
> +bool virtqueue_kick_async(struct virtqueue *vq, wait_queue_head_t wq)
> +{
> +	u32 len;
> +
> +	if (likely(virtqueue_kick(vq))) {
> +		wait_event(wq, virtqueue_get_buf(vq, &len));
> +		return true;
> +	}
> +	return false;
> +}
> +EXPORT_SYMBOL_GPL(virtqueue_kick_async);
> +

This happens to
1. drop the buf
2. not do the right thing if more than one is in flight

which means this API isn't all that useful. Even balloon
might benefit from keeping multiple bufs in flight down
the road.


>  static void detach_buf(struct vring_virtqueue *vq, unsigned int head,
>  		       void **ctx)
>  {
> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> index 28b0e96..9f27101 100644
> --- a/include/linux/virtio.h
> +++ b/include/linux/virtio.h
> @@ -57,8 +57,28 @@ int virtqueue_add_sgs(struct virtqueue *vq,
>  		      void *data,
>  		      gfp_t gfp);
>  
> +/* A desc with this init id is treated as an invalid desc */
> +#define VIRTQUEUE_DESC_ID_INIT UINT_MAX
> +int virtqueue_add_chain_desc(struct virtqueue *_vq,
> +			     uint64_t addr,
> +			     uint32_t len,
> +			     unsigned int *head_id,
> +			     unsigned int *prev_id,
> +			     bool in);
> +
> +int virtqueue_add_chain(struct virtqueue *_vq,
> +			unsigned int head,
> +			bool indirect,
> +			struct vring_desc *indirect_desc,
> +			void *data,
> +			void *ctx);
> +
>  bool virtqueue_kick(struct virtqueue *vq);
>  
> +bool virtqueue_kick_sync(struct virtqueue *vq);
> +
> +bool virtqueue_kick_async(struct virtqueue *vq, wait_queue_head_t wq);
> +
>  bool virtqueue_kick_prepare(struct virtqueue *vq);
>  
>  bool virtqueue_notify(struct virtqueue *vq);
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 343d7dd..37780a7 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -34,6 +34,7 @@
>  #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
> +#define VIRTIO_BALLOON_F_SG		3 /* Use sg instead of PFN lists */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> -- 
> 2.7.4

kernel test robot July 13, 2017, 1:16 a.m. UTC | #5

Hi Wei,

[auto build test WARNING on linus/master]
[also build test WARNING on v4.12 next-20170712]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Wei-Wang/Virtio-balloon-Enhancement/20170713-074956
config: i386-randconfig-x071-07121639 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All warnings (new ones prefixed by >>):

   drivers//virtio/virtio_balloon.c: In function 'tell_host_one_page':
>> drivers//virtio/virtio_balloon.c:535:39: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
     virtqueue_add_chain(vq, id, 0, NULL, (void *)addr, NULL);
                                          ^

vim +535 drivers//virtio/virtio_balloon.c

   527	
   528	static void tell_host_one_page(struct virtio_balloon *vb, struct virtqueue *vq,
   529				       struct page *page)
   530	{
   531		unsigned int id = VIRTQUEUE_DESC_ID_INIT;
   532		u64 addr = page_to_pfn(page) << VIRTIO_BALLOON_PFN_SHIFT;
   533	
   534		virtqueue_add_chain_desc(vq, addr, PAGE_SIZE, &id, &id, 0);
 > 535		virtqueue_add_chain(vq, id, 0, NULL, (void *)addr, NULL);
   536		virtqueue_kick_async(vq, vb->acked);
   537	}
   538	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

kernel test robot July 13, 2017, 4:21 a.m. UTC | #6

Hi Wei,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.12 next-20170712]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Wei-Wang/Virtio-balloon-Enhancement/20170713-074956
config: powerpc-defconfig (attached as .config)
compiler: powerpc64-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
        wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=powerpc 

All errors (new ones prefixed by >>):

>> ERROR: ".xb_set_bit" [drivers/virtio/virtio_balloon.ko] undefined!
>> ERROR: ".xb_zero" [drivers/virtio/virtio_balloon.ko] undefined!
>> ERROR: ".xb_find_next_bit" [drivers/virtio/virtio_balloon.ko] undefined!

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

Wang, Wei W July 13, 2017, 7:42 a.m. UTC | #7

On 07/12/2017 09:56 PM, Michael S. Tsirkin wrote:
>
> So the way I see it, there are several issues:
>
> - internal wait - forces multiple APIs like kick/kick_sync
>    note how kick_sync can fail but your code never checks return code
> - need to re-write the last descriptor - might not work
>    for alternative layouts which always expose descriptors
>    immediately

Probably it wasn't clear. Please let me explain the two functions here:

1) virtqueue_add_chain_desc(vq, head_id, prev_id,..):
grabs a desc from the vq and inserts it to the chain tail (which is 
indexed by
prev_id, probably better to call it tail_id). Then, the new added desc 
becomes
the tail (i.e. the last desc). The _F_NEXT flag is cleared for each desc 
when it's
added to the chain, and set when another desc comes to follow later.

2) virtqueue_add_chain(vq, head_id,..): expose the chain to the other end.

So, if people want to add a desc and immediately expose it to the other end,
i.e. build a single desc chain, they can just add and expose:

virtqueue_add_chain_desc(..);
virtqueue_add_chain(..,head_id);

Would you see any issues here?

> - some kind of iterator type would be nicer instead of
>    maintaining head/prev explicitly

Why would we need to iterate the chain? I think it would be simpler to use
a wrapper struct:

struct virtqueue_desc_chain {
     unsigned int head;  // head desc id of the chain
     unsigned int tail;     // tail desc id of the chain
}

The new desc will be put to desc[tail].next, and we don't need to walk
from the head desc[head].next when inserting a new desc to the chain, right?

>
> As for the use, it would be better to do
>
> if (!add_next(vq, ...)) {
> 	add_last(vq, ...)
> 	kick
> 	wait
> }

"!add_next(vq, ...)" means that the vq is full? If so, what would 
add_last() do then?

> Using VIRTQUEUE_DESC_ID_INIT seems to avoid a branch in the driver, but
> in fact it merely puts the branch in the virtio code.
>

Actually it wasn't intended to improve performance. It is used to 
indicate the "init" state
of the chain. So, when virtqueue_add_chain_desc(, head_id,..) finds head 
id=INIT, it will
assign the grabbed desc id to &head_id. In some sense, it is equivalent 
to add_first().

Do you have a different opinion here?

Best,
Wei

Michael S. Tsirkin July 13, 2017, 8:19 p.m. UTC | #8

On Thu, Jul 13, 2017 at 03:42:35PM +0800, Wei Wang wrote:
> On 07/12/2017 09:56 PM, Michael S. Tsirkin wrote:
> > 
> > So the way I see it, there are several issues:
> > 
> > - internal wait - forces multiple APIs like kick/kick_sync
> >    note how kick_sync can fail but your code never checks return code
> > - need to re-write the last descriptor - might not work
> >    for alternative layouts which always expose descriptors
> >    immediately
> 
> Probably it wasn't clear. Please let me explain the two functions here:
> 
> 1) virtqueue_add_chain_desc(vq, head_id, prev_id,..):
> grabs a desc from the vq and inserts it to the chain tail (which is indexed
> by
> prev_id, probably better to call it tail_id). Then, the new added desc
> becomes
> the tail (i.e. the last desc). The _F_NEXT flag is cleared for each desc
> when it's
> added to the chain, and set when another desc comes to follow later.

And this only works if there are multiple rings like
avail + descriptor ring.
It won't work e.g. with the proposed new layout where
writing out a descriptor exposes it immediately.

> 2) virtqueue_add_chain(vq, head_id,..): expose the chain to the other end.
> 
> So, if people want to add a desc and immediately expose it to the other end,
> i.e. build a single desc chain, they can just add and expose:
> 
> virtqueue_add_chain_desc(..);
> virtqueue_add_chain(..,head_id);
> 
> Would you see any issues here?

The way the new APIs poll used ring internally.

> 
> > - some kind of iterator type would be nicer instead of
> >    maintaining head/prev explicitly
> 
> Why would we need to iterate the chain?

In your patches prev/tail are iterators - they keep track of
where you are in the chain.

> I think it would be simpler to use
> a wrapper struct:
> 
> struct virtqueue_desc_chain {
>     unsigned int head;  // head desc id of the chain
>     unsigned int tail;     // tail desc id of the chain
> }
> 
> The new desc will be put to desc[tail].next, and we don't need to walk
> from the head desc[head].next when inserting a new desc to the chain, right?
> 
> 
> > 
> > As for the use, it would be better to do
> > 
> > if (!add_next(vq, ...)) {
> > 	add_last(vq, ...)
> > 	kick
> > 	wait
> > }
> 
> "!add_next(vq, ...)" means that the vq is full?


No - it means there's only 1 entry left for the last descriptor.


> If so, what would add_last()
> do then?
> 
> > Using VIRTQUEUE_DESC_ID_INIT seems to avoid a branch in the driver, but
> > in fact it merely puts the branch in the virtio code.
> > 
> 
> Actually it wasn't intended to improve performance. It is used to indicate
> the "init" state
> of the chain. So, when virtqueue_add_chain_desc(, head_id,..) finds head
> id=INIT, it will
> assign the grabbed desc id to &head_id. In some sense, it is equivalent to
> add_first().
> 
> Do you have a different opinion here?
> 
> Best,
> Wei
> 

It is but let's make it explicit here - an API function is better
than a special value.

Wang, Wei W July 14, 2017, 7:12 a.m. UTC | #9

On 07/14/2017 04:19 AM, Michael S. Tsirkin wrote:
> On Thu, Jul 13, 2017 at 03:42:35PM +0800, Wei Wang wrote:
>> On 07/12/2017 09:56 PM, Michael S. Tsirkin wrote:
>>> So the way I see it, there are several issues:
>>>
>>> - internal wait - forces multiple APIs like kick/kick_sync
>>>     note how kick_sync can fail but your code never checks return code
>>> - need to re-write the last descriptor - might not work
>>>     for alternative layouts which always expose descriptors
>>>     immediately
>> Probably it wasn't clear. Please let me explain the two functions here:
>>
>> 1) virtqueue_add_chain_desc(vq, head_id, prev_id,..):
>> grabs a desc from the vq and inserts it to the chain tail (which is indexed
>> by
>> prev_id, probably better to call it tail_id). Then, the new added desc
>> becomes
>> the tail (i.e. the last desc). The _F_NEXT flag is cleared for each desc
>> when it's
>> added to the chain, and set when another desc comes to follow later.
> And this only works if there are multiple rings like
> avail + descriptor ring.
> It won't work e.g. with the proposed new layout where
> writing out a descriptor exposes it immediately.

I think it can support the 1.1 proposal, too. But before getting
into that, I think we first need to deep dive into the implementation
and usage of _first/next/last. The usage would need to lock the vq
from the first to the end (otherwise, the returned info about the number
of available desc in the vq, i.e. num_free, would be invalid):

lock(vq);
add_first();
add_next();
add_last();
unlock(vq);

However, I think the case isn't this simple, since we need to check more 
things
after each add_xx() step. For example, if only one entry is available at 
the time
we start to use the vq, that is, num_free is 0 after add_first(), we 
wouldn't be
able to add_next and add_last. So, it would work like this:

start:
     ...get free page block..
     lock(vq)
retry:
     ret = add_first(..,&num_free,);
     if(ret == -ENOSPC) {
         goto retry;
     } else if (!num_free) {
         add_chain_head();
         unlock(vq);
         kick & wait;
         goto start;
     }
next_one:
     ...get free page block..
     add_next(..,&num_free,);
     if (!num_free) {
         add_chain_head();
         unlock(vq);
         kick & wait;
         goto start;
     } if (num_free == 1) {
         ...get free page block..
         add_last(..);
         unlock(vq);
         kick & wait;
         goto start;
     } else {
         goto next_one;
     }

The above seems unnecessary to me to have three different APIs.
That's the reason to combine them into one virtqueue_add_chain_desc().

-- or, do you have a different thought about using the three APIs?


Implementation Reference:

struct desc_iterator {
     unsigned int head;
     unsigned int tail;
};

add_first(*vq, *desc_iterator, *num_free, ..)
{
     if (vq->vq.num_free < 1)
         return -ENOSPC;
     get_desc(&desc_id);
     desc[desc_id].flag &= ~_F_NEXT;
     desc_iterator->head = desc_id
     desc_iterator->tail = desc_iterator->head;
     *num_free = vq->vq.num_free;
}

add_next(vq, desc_iterator, *num_free,..)
{
     get_desc(&desc_id);
     desc[desc_id].flag &= ~_F_NEXT;
     desc[desc_iterator.tail].next = desc_id;
     desc[desc_iterator->tail].flag |= _F_NEXT;
     desc_iterator->tail = desc_id;
     *num_free = vq->vq.num_free;
}

add_last(vq, desc_iterator,..)
{
     get_desc(&desc_id);
     desc[desc_id].flag &= ~_F_NEXT;
     desc[desc_iterator.tail].next = desc_id;
     desc_iterator->tail = desc_id;

     add_chain_head(); // put the desc_iterator.head to the ring
}


Best,
Wei

Michael S. Tsirkin July 23, 2017, 1:45 a.m. UTC | #10

On Fri, Jul 14, 2017 at 03:12:43PM +0800, Wei Wang wrote:
> On 07/14/2017 04:19 AM, Michael S. Tsirkin wrote:
> > On Thu, Jul 13, 2017 at 03:42:35PM +0800, Wei Wang wrote:
> > > On 07/12/2017 09:56 PM, Michael S. Tsirkin wrote:
> > > > So the way I see it, there are several issues:
> > > > 
> > > > - internal wait - forces multiple APIs like kick/kick_sync
> > > >     note how kick_sync can fail but your code never checks return code
> > > > - need to re-write the last descriptor - might not work
> > > >     for alternative layouts which always expose descriptors
> > > >     immediately
> > > Probably it wasn't clear. Please let me explain the two functions here:
> > > 
> > > 1) virtqueue_add_chain_desc(vq, head_id, prev_id,..):
> > > grabs a desc from the vq and inserts it to the chain tail (which is indexed
> > > by
> > > prev_id, probably better to call it tail_id). Then, the new added desc
> > > becomes
> > > the tail (i.e. the last desc). The _F_NEXT flag is cleared for each desc
> > > when it's
> > > added to the chain, and set when another desc comes to follow later.
> > And this only works if there are multiple rings like
> > avail + descriptor ring.
> > It won't work e.g. with the proposed new layout where
> > writing out a descriptor exposes it immediately.
> 
> I think it can support the 1.1 proposal, too. But before getting
> into that, I think we first need to deep dive into the implementation
> and usage of _first/next/last. The usage would need to lock the vq
> from the first to the end (otherwise, the returned info about the number
> of available desc in the vq, i.e. num_free, would be invalid):
> 
> lock(vq);
> add_first();
> add_next();
> add_last();
> unlock(vq);
> 
> However, I think the case isn't this simple, since we need to check more
> things
> after each add_xx() step. For example, if only one entry is available at the
> time
> we start to use the vq, that is, num_free is 0 after add_first(), we
> wouldn't be
> able to add_next and add_last. So, it would work like this:
> 
> start:
>     ...get free page block..
>     lock(vq)
> retry:
>     ret = add_first(..,&num_free,);
>     if(ret == -ENOSPC) {
>         goto retry;
>     } else if (!num_free) {
>         add_chain_head();
>         unlock(vq);
>         kick & wait;
>         goto start;
>     }
> next_one:
>     ...get free page block..
>     add_next(..,&num_free,);
>     if (!num_free) {
>         add_chain_head();
>         unlock(vq);
>         kick & wait;
>         goto start;
>     } if (num_free == 1) {
>         ...get free page block..
>         add_last(..);
>         unlock(vq);
>         kick & wait;
>         goto start;
>     } else {
>         goto next_one;
>     }
> 
> The above seems unnecessary to me to have three different APIs.
> That's the reason to combine them into one virtqueue_add_chain_desc().
> 
> -- or, do you have a different thought about using the three APIs?
> 
> 
> Implementation Reference:
> 
> struct desc_iterator {
>     unsigned int head;
>     unsigned int tail;
> };
> 
> add_first(*vq, *desc_iterator, *num_free, ..)
> {
>     if (vq->vq.num_free < 1)
>         return -ENOSPC;
>     get_desc(&desc_id);
>     desc[desc_id].flag &= ~_F_NEXT;
>     desc_iterator->head = desc_id
>     desc_iterator->tail = desc_iterator->head;
>     *num_free = vq->vq.num_free;
> }
> 
> add_next(vq, desc_iterator, *num_free,..)
> {
>     get_desc(&desc_id);
>     desc[desc_id].flag &= ~_F_NEXT;
>     desc[desc_iterator.tail].next = desc_id;
>     desc[desc_iterator->tail].flag |= _F_NEXT;
>     desc_iterator->tail = desc_id;
>     *num_free = vq->vq.num_free;
> }
> 
> add_last(vq, desc_iterator,..)
> {
>     get_desc(&desc_id);
>     desc[desc_id].flag &= ~_F_NEXT;
>     desc[desc_iterator.tail].next = desc_id;
>     desc_iterator->tail = desc_id;
> 
>     add_chain_head(); // put the desc_iterator.head to the ring
> }
> 
> 
> Best,
> Wei

OK I thought this over. While we might need these new APIs in
the future, I think that at the moment, there's a way to implement
this feature that is significantly simpler. Just add each s/g
as a separate input buffer.

This needs zero new APIs.

I know that follow-up patches need to add a header in front
so you might be thinking: how am I going to add this
header? The answer is quite simple - add it as a separate
out header.

Host will be able to distinguish between header and pages
by looking at the direction, and - should we want to add
IN data to header - additionally size (<4K => header).

We will be able to look at extended APIs separately down
the road.

Wang, Wei W July 26, 2017, 3:48 a.m. UTC | #11

On 07/23/2017 09:45 AM, Michael S. Tsirkin wrote:
> On Fri, Jul 14, 2017 at 03:12:43PM +0800, Wei Wang wrote:
>> On 07/14/2017 04:19 AM, Michael S. Tsirkin wrote:
>>> On Thu, Jul 13, 2017 at 03:42:35PM +0800, Wei Wang wrote:
>>>> On 07/12/2017 09:56 PM, Michael S. Tsirkin wrote:
>>>>> So the way I see it, there are several issues:
>>>>>
>>>>> - internal wait - forces multiple APIs like kick/kick_sync
>>>>>      note how kick_sync can fail but your code never checks return code
>>>>> - need to re-write the last descriptor - might not work
>>>>>      for alternative layouts which always expose descriptors
>>>>>      immediately
>>>> Probably it wasn't clear. Please let me explain the two functions here:
>>>>
>>>> 1) virtqueue_add_chain_desc(vq, head_id, prev_id,..):
>>>> grabs a desc from the vq and inserts it to the chain tail (which is indexed
>>>> by
>>>> prev_id, probably better to call it tail_id). Then, the new added desc
>>>> becomes
>>>> the tail (i.e. the last desc). The _F_NEXT flag is cleared for each desc
>>>> when it's
>>>> added to the chain, and set when another desc comes to follow later.
>>> And this only works if there are multiple rings like
>>> avail + descriptor ring.
>>> It won't work e.g. with the proposed new layout where
>>> writing out a descriptor exposes it immediately.
>> I think it can support the 1.1 proposal, too. But before getting
>> into that, I think we first need to deep dive into the implementation
>> and usage of _first/next/last. The usage would need to lock the vq
>> from the first to the end (otherwise, the returned info about the number
>> of available desc in the vq, i.e. num_free, would be invalid):
>>
>> lock(vq);
>> add_first();
>> add_next();
>> add_last();
>> unlock(vq);
>>
>> However, I think the case isn't this simple, since we need to check more
>> things
>> after each add_xx() step. For example, if only one entry is available at the
>> time
>> we start to use the vq, that is, num_free is 0 after add_first(), we
>> wouldn't be
>> able to add_next and add_last. So, it would work like this:
>>
>> start:
>>      ...get free page block..
>>      lock(vq)
>> retry:
>>      ret = add_first(..,&num_free,);
>>      if(ret == -ENOSPC) {
>>          goto retry;
>>      } else if (!num_free) {
>>          add_chain_head();
>>          unlock(vq);
>>          kick & wait;
>>          goto start;
>>      }
>> next_one:
>>      ...get free page block..
>>      add_next(..,&num_free,);
>>      if (!num_free) {
>>          add_chain_head();
>>          unlock(vq);
>>          kick & wait;
>>          goto start;
>>      } if (num_free == 1) {
>>          ...get free page block..
>>          add_last(..);
>>          unlock(vq);
>>          kick & wait;
>>          goto start;
>>      } else {
>>          goto next_one;
>>      }
>>
>> The above seems unnecessary to me to have three different APIs.
>> That's the reason to combine them into one virtqueue_add_chain_desc().
>>
>> -- or, do you have a different thought about using the three APIs?
>>
>>
>> Implementation Reference:
>>
>> struct desc_iterator {
>>      unsigned int head;
>>      unsigned int tail;
>> };
>>
>> add_first(*vq, *desc_iterator, *num_free, ..)
>> {
>>      if (vq->vq.num_free < 1)
>>          return -ENOSPC;
>>      get_desc(&desc_id);
>>      desc[desc_id].flag &= ~_F_NEXT;
>>      desc_iterator->head = desc_id
>>      desc_iterator->tail = desc_iterator->head;
>>      *num_free = vq->vq.num_free;
>> }
>>
>> add_next(vq, desc_iterator, *num_free,..)
>> {
>>      get_desc(&desc_id);
>>      desc[desc_id].flag &= ~_F_NEXT;
>>      desc[desc_iterator.tail].next = desc_id;
>>      desc[desc_iterator->tail].flag |= _F_NEXT;
>>      desc_iterator->tail = desc_id;
>>      *num_free = vq->vq.num_free;
>> }
>>
>> add_last(vq, desc_iterator,..)
>> {
>>      get_desc(&desc_id);
>>      desc[desc_id].flag &= ~_F_NEXT;
>>      desc[desc_iterator.tail].next = desc_id;
>>      desc_iterator->tail = desc_id;
>>
>>      add_chain_head(); // put the desc_iterator.head to the ring
>> }
>>
>>
>> Best,
>> Wei
> OK I thought this over. While we might need these new APIs in
> the future, I think that at the moment, there's a way to implement
> this feature that is significantly simpler. Just add each s/g
> as a separate input buffer.


Should it be an output buffer? I think output means from the
driver to device (i.e. DMA_TO_DEVICE).

>
> This needs zero new APIs.
>
> I know that follow-up patches need to add a header in front
> so you might be thinking: how am I going to add this
> header? The answer is quite simple - add it as a separate
> out header.
>
> Host will be able to distinguish between header and pages
> by looking at the direction, and - should we want to add
> IN data to header - additionally size (<4K => header).


I think this works fine when the cmdq is only used for
reporting the unused pages. It would be an issue
if there are other usages (e.g. report memory statistics)
interleaving. I think one solution would be to lock the cmdq until
a cmd usage is done ((e.g. all the unused pages are reported) ) -
in this case, the periodically updated guest memory statistics
may be delayed for a while occasionally when live migration starts.
Would this be acceptable? If not, probably we can have the cmdq
for one usage only.


Best,
Wei

Michael S. Tsirkin July 26, 2017, 5:02 p.m. UTC | #12

On Wed, Jul 26, 2017 at 11:48:41AM +0800, Wei Wang wrote:
> On 07/23/2017 09:45 AM, Michael S. Tsirkin wrote:
> > On Fri, Jul 14, 2017 at 03:12:43PM +0800, Wei Wang wrote:
> > > On 07/14/2017 04:19 AM, Michael S. Tsirkin wrote:
> > > > On Thu, Jul 13, 2017 at 03:42:35PM +0800, Wei Wang wrote:
> > > > > On 07/12/2017 09:56 PM, Michael S. Tsirkin wrote:
> > > > > > So the way I see it, there are several issues:
> > > > > > 
> > > > > > - internal wait - forces multiple APIs like kick/kick_sync
> > > > > >      note how kick_sync can fail but your code never checks return code
> > > > > > - need to re-write the last descriptor - might not work
> > > > > >      for alternative layouts which always expose descriptors
> > > > > >      immediately
> > > > > Probably it wasn't clear. Please let me explain the two functions here:
> > > > > 
> > > > > 1) virtqueue_add_chain_desc(vq, head_id, prev_id,..):
> > > > > grabs a desc from the vq and inserts it to the chain tail (which is indexed
> > > > > by
> > > > > prev_id, probably better to call it tail_id). Then, the new added desc
> > > > > becomes
> > > > > the tail (i.e. the last desc). The _F_NEXT flag is cleared for each desc
> > > > > when it's
> > > > > added to the chain, and set when another desc comes to follow later.
> > > > And this only works if there are multiple rings like
> > > > avail + descriptor ring.
> > > > It won't work e.g. with the proposed new layout where
> > > > writing out a descriptor exposes it immediately.
> > > I think it can support the 1.1 proposal, too. But before getting
> > > into that, I think we first need to deep dive into the implementation
> > > and usage of _first/next/last. The usage would need to lock the vq
> > > from the first to the end (otherwise, the returned info about the number
> > > of available desc in the vq, i.e. num_free, would be invalid):
> > > 
> > > lock(vq);
> > > add_first();
> > > add_next();
> > > add_last();
> > > unlock(vq);
> > > 
> > > However, I think the case isn't this simple, since we need to check more
> > > things
> > > after each add_xx() step. For example, if only one entry is available at the
> > > time
> > > we start to use the vq, that is, num_free is 0 after add_first(), we
> > > wouldn't be
> > > able to add_next and add_last. So, it would work like this:
> > > 
> > > start:
> > >      ...get free page block..
> > >      lock(vq)
> > > retry:
> > >      ret = add_first(..,&num_free,);
> > >      if(ret == -ENOSPC) {
> > >          goto retry;
> > >      } else if (!num_free) {
> > >          add_chain_head();
> > >          unlock(vq);
> > >          kick & wait;
> > >          goto start;
> > >      }
> > > next_one:
> > >      ...get free page block..
> > >      add_next(..,&num_free,);
> > >      if (!num_free) {
> > >          add_chain_head();
> > >          unlock(vq);
> > >          kick & wait;
> > >          goto start;
> > >      } if (num_free == 1) {
> > >          ...get free page block..
> > >          add_last(..);
> > >          unlock(vq);
> > >          kick & wait;
> > >          goto start;
> > >      } else {
> > >          goto next_one;
> > >      }
> > > 
> > > The above seems unnecessary to me to have three different APIs.
> > > That's the reason to combine them into one virtqueue_add_chain_desc().
> > > 
> > > -- or, do you have a different thought about using the three APIs?
> > > 
> > > 
> > > Implementation Reference:
> > > 
> > > struct desc_iterator {
> > >      unsigned int head;
> > >      unsigned int tail;
> > > };
> > > 
> > > add_first(*vq, *desc_iterator, *num_free, ..)
> > > {
> > >      if (vq->vq.num_free < 1)
> > >          return -ENOSPC;
> > >      get_desc(&desc_id);
> > >      desc[desc_id].flag &= ~_F_NEXT;
> > >      desc_iterator->head = desc_id
> > >      desc_iterator->tail = desc_iterator->head;
> > >      *num_free = vq->vq.num_free;
> > > }
> > > 
> > > add_next(vq, desc_iterator, *num_free,..)
> > > {
> > >      get_desc(&desc_id);
> > >      desc[desc_id].flag &= ~_F_NEXT;
> > >      desc[desc_iterator.tail].next = desc_id;
> > >      desc[desc_iterator->tail].flag |= _F_NEXT;
> > >      desc_iterator->tail = desc_id;
> > >      *num_free = vq->vq.num_free;
> > > }
> > > 
> > > add_last(vq, desc_iterator,..)
> > > {
> > >      get_desc(&desc_id);
> > >      desc[desc_id].flag &= ~_F_NEXT;
> > >      desc[desc_iterator.tail].next = desc_id;
> > >      desc_iterator->tail = desc_id;
> > > 
> > >      add_chain_head(); // put the desc_iterator.head to the ring
> > > }
> > > 
> > > 
> > > Best,
> > > Wei
> > OK I thought this over. While we might need these new APIs in
> > the future, I think that at the moment, there's a way to implement
> > this feature that is significantly simpler. Just add each s/g
> > as a separate input buffer.
> 
> 
> Should it be an output buffer?

Hypervisor overwrites these pages with zeroes. Therefore it is
writeable by device: DMA_FROM_DEVICE.

> I think output means from the
> driver to device (i.e. DMA_TO_DEVICE).

This part is correct I believe.

> > 
> > This needs zero new APIs.
> > 
> > I know that follow-up patches need to add a header in front
> > so you might be thinking: how am I going to add this
> > header? The answer is quite simple - add it as a separate
> > out header.
> > 
> > Host will be able to distinguish between header and pages
> > by looking at the direction, and - should we want to add
> > IN data to header - additionally size (<4K => header).
> 
> 
> I think this works fine when the cmdq is only used for
> reporting the unused pages.
> It would be an issue
> if there are other usages (e.g. report memory statistics)
> interleaving. I think one solution would be to lock the cmdq until
> a cmd usage is done ((e.g. all the unused pages are reported) ) -
> in this case, the periodically updated guest memory statistics
> may be delayed for a while occasionally when live migration starts.
> Would this be acceptable? If not, probably we can have the cmdq
> for one usage only.
> 
> 
> Best,
> Wei

OK I see, I think the issue is that reporting free pages
was structured like stats. Let's split it -
send pages on e.g. free_vq, get commands on vq shared with
stats.

Wang, Wei W July 27, 2017, 2:50 a.m. UTC | #13

On 07/27/2017 01:02 AM, Michael S. Tsirkin wrote:
> On Wed, Jul 26, 2017 at 11:48:41AM +0800, Wei Wang wrote:
>> On 07/23/2017 09:45 AM, Michael S. Tsirkin wrote:
>>> On Fri, Jul 14, 2017 at 03:12:43PM +0800, Wei Wang wrote:
>>>> On 07/14/2017 04:19 AM, Michael S. Tsirkin wrote:
>>>>> On Thu, Jul 13, 2017 at 03:42:35PM +0800, Wei Wang wrote:
>>>>>> On 07/12/2017 09:56 PM, Michael S. Tsirkin wrote:
>>>>>>> So the way I see it, there are several issues:
>>>>>>>
>>>>>>> - internal wait - forces multiple APIs like kick/kick_sync
>>>>>>>       note how kick_sync can fail but your code never checks return code
>>>>>>> - need to re-write the last descriptor - might not work
>>>>>>>       for alternative layouts which always expose descriptors
>>>>>>>       immediately
>>>>>> Probably it wasn't clear. Please let me explain the two functions here:
>>>>>>
>>>>>> 1) virtqueue_add_chain_desc(vq, head_id, prev_id,..):
>>>>>> grabs a desc from the vq and inserts it to the chain tail (which is indexed
>>>>>> by
>>>>>> prev_id, probably better to call it tail_id). Then, the new added desc
>>>>>> becomes
>>>>>> the tail (i.e. the last desc). The _F_NEXT flag is cleared for each desc
>>>>>> when it's
>>>>>> added to the chain, and set when another desc comes to follow later.
>>>>> And this only works if there are multiple rings like
>>>>> avail + descriptor ring.
>>>>> It won't work e.g. with the proposed new layout where
>>>>> writing out a descriptor exposes it immediately.
>>>> I think it can support the 1.1 proposal, too. But before getting
>>>> into that, I think we first need to deep dive into the implementation
>>>> and usage of _first/next/last. The usage would need to lock the vq
>>>> from the first to the end (otherwise, the returned info about the number
>>>> of available desc in the vq, i.e. num_free, would be invalid):
>>>>
>>>> lock(vq);
>>>> add_first();
>>>> add_next();
>>>> add_last();
>>>> unlock(vq);
>>>>
>>>> However, I think the case isn't this simple, since we need to check more
>>>> things
>>>> after each add_xx() step. For example, if only one entry is available at the
>>>> time
>>>> we start to use the vq, that is, num_free is 0 after add_first(), we
>>>> wouldn't be
>>>> able to add_next and add_last. So, it would work like this:
>>>>
>>>> start:
>>>>       ...get free page block..
>>>>       lock(vq)
>>>> retry:
>>>>       ret = add_first(..,&num_free,);
>>>>       if(ret == -ENOSPC) {
>>>>           goto retry;
>>>>       } else if (!num_free) {
>>>>           add_chain_head();
>>>>           unlock(vq);
>>>>           kick & wait;
>>>>           goto start;
>>>>       }
>>>> next_one:
>>>>       ...get free page block..
>>>>       add_next(..,&num_free,);
>>>>       if (!num_free) {
>>>>           add_chain_head();
>>>>           unlock(vq);
>>>>           kick & wait;
>>>>           goto start;
>>>>       } if (num_free == 1) {
>>>>           ...get free page block..
>>>>           add_last(..);
>>>>           unlock(vq);
>>>>           kick & wait;
>>>>           goto start;
>>>>       } else {
>>>>           goto next_one;
>>>>       }
>>>>
>>>> The above seems unnecessary to me to have three different APIs.
>>>> That's the reason to combine them into one virtqueue_add_chain_desc().
>>>>
>>>> -- or, do you have a different thought about using the three APIs?
>>>>
>>>>
>>>> Implementation Reference:
>>>>
>>>> struct desc_iterator {
>>>>       unsigned int head;
>>>>       unsigned int tail;
>>>> };
>>>>
>>>> add_first(*vq, *desc_iterator, *num_free, ..)
>>>> {
>>>>       if (vq->vq.num_free < 1)
>>>>           return -ENOSPC;
>>>>       get_desc(&desc_id);
>>>>       desc[desc_id].flag &= ~_F_NEXT;
>>>>       desc_iterator->head = desc_id
>>>>       desc_iterator->tail = desc_iterator->head;
>>>>       *num_free = vq->vq.num_free;
>>>> }
>>>>
>>>> add_next(vq, desc_iterator, *num_free,..)
>>>> {
>>>>       get_desc(&desc_id);
>>>>       desc[desc_id].flag &= ~_F_NEXT;
>>>>       desc[desc_iterator.tail].next = desc_id;
>>>>       desc[desc_iterator->tail].flag |= _F_NEXT;
>>>>       desc_iterator->tail = desc_id;
>>>>       *num_free = vq->vq.num_free;
>>>> }
>>>>
>>>> add_last(vq, desc_iterator,..)
>>>> {
>>>>       get_desc(&desc_id);
>>>>       desc[desc_id].flag &= ~_F_NEXT;
>>>>       desc[desc_iterator.tail].next = desc_id;
>>>>       desc_iterator->tail = desc_id;
>>>>
>>>>       add_chain_head(); // put the desc_iterator.head to the ring
>>>> }
>>>>
>>>>
>>>> Best,
>>>> Wei
>>> OK I thought this over. While we might need these new APIs in
>>> the future, I think that at the moment, there's a way to implement
>>> this feature that is significantly simpler. Just add each s/g
>>> as a separate input buffer.
>>
>> Should it be an output buffer?
> Hypervisor overwrites these pages with zeroes. Therefore it is
> writeable by device: DMA_FROM_DEVICE.

Why would the hypervisor need to zero the buffer? I think it may only
need to read out the info(base,size).

I think it should be like this:
the cmd hdr buffer: input, because the hypervisor would write it to
send a cmd to the guest
the payload buffer: output, for the hypervisor to read the info

>> I think output means from the
>> driver to device (i.e. DMA_TO_DEVICE).
> This part is correct I believe.
>
>>> This needs zero new APIs.
>>>
>>> I know that follow-up patches need to add a header in front
>>> so you might be thinking: how am I going to add this
>>> header? The answer is quite simple - add it as a separate
>>> out header.
>>>
>>> Host will be able to distinguish between header and pages
>>> by looking at the direction, and - should we want to add
>>> IN data to header - additionally size (<4K => header).
>>
>> I think this works fine when the cmdq is only used for
>> reporting the unused pages.
>> It would be an issue
>> if there are other usages (e.g. report memory statistics)
>> interleaving. I think one solution would be to lock the cmdq until
>> a cmd usage is done ((e.g. all the unused pages are reported) ) -
>> in this case, the periodically updated guest memory statistics
>> may be delayed for a while occasionally when live migration starts.
>> Would this be acceptable? If not, probably we can have the cmdq
>> for one usage only.
>>
>>
>> Best,
>> Wei
> OK I see, I think the issue is that reporting free pages
> was structured like stats. Let's split it -
> send pages on e.g. free_vq, get commands on vq shared with
> stats.
>

Would it be better to have the "report free page" command to be sent
through the free_vq? In this case,we will have
stats_vq: for the stats usage, which is already there
free_vq: for reporting free pages.

Best,
Wei

Wang, Wei W July 28, 2017, 8:25 a.m. UTC | #14

On 07/12/2017 08:40 PM, Wei Wang wrote:
> Add a new feature, VIRTIO_BALLOON_F_SG, which enables to
> transfer a chunk of ballooned (i.e. inflated/deflated) pages using
> scatter-gather lists to the host.
>
> The implementation of the previous virtio-balloon is not very
> efficient, because the balloon pages are transferred to the
> host one by one. Here is the breakdown of the time in percentage
> spent on each step of the balloon inflating process (inflating
> 7GB of an 8GB idle guest).
>
> 1) allocating pages (6.5%)
> 2) sending PFNs to host (68.3%)
> 3) address translation (6.1%)
> 4) madvise (19%)
>
> It takes about 4126ms for the inflating process to complete.
> The above profiling shows that the bottlenecks are stage 2)
> and stage 4).
>
> This patch optimizes step 2) by transferring pages to the host in
> sgs. An sg describes a chunk of guest physically continuous pages.
> With this mechanism, step 4) can also be optimized by doing address
> translation and madvise() in chunks rather than page by page.
>
> With this new feature, the above ballooning process takes ~491ms
> resulting in an improvement of ~88%.
>


I found a recent mm patch, bb01b64cfab7c22f3848cb73dc0c2b46b8d38499
, zeros all the ballooned pages, which is very time consuming.

Tests show that the time to balloon 7G pages is increased from ~491 ms to
2.8 seconds with the above patch.

How about moving the zero operation to the hypervisor? In this way, we
will have a much faster balloon process.


Best,
Wei

Michael S. Tsirkin July 28, 2017, 11:01 p.m. UTC | #15

On Fri, Jul 28, 2017 at 04:25:19PM +0800, Wei Wang wrote:
> On 07/12/2017 08:40 PM, Wei Wang wrote:
> > Add a new feature, VIRTIO_BALLOON_F_SG, which enables to
> > transfer a chunk of ballooned (i.e. inflated/deflated) pages using
> > scatter-gather lists to the host.
> > 
> > The implementation of the previous virtio-balloon is not very
> > efficient, because the balloon pages are transferred to the
> > host one by one. Here is the breakdown of the time in percentage
> > spent on each step of the balloon inflating process (inflating
> > 7GB of an 8GB idle guest).
> > 
> > 1) allocating pages (6.5%)
> > 2) sending PFNs to host (68.3%)
> > 3) address translation (6.1%)
> > 4) madvise (19%)
> > 
> > It takes about 4126ms for the inflating process to complete.
> > The above profiling shows that the bottlenecks are stage 2)
> > and stage 4).
> > 
> > This patch optimizes step 2) by transferring pages to the host in
> > sgs. An sg describes a chunk of guest physically continuous pages.
> > With this mechanism, step 4) can also be optimized by doing address
> > translation and madvise() in chunks rather than page by page.
> > 
> > With this new feature, the above ballooning process takes ~491ms
> > resulting in an improvement of ~88%.
> > 
> 
> 
> I found a recent mm patch, bb01b64cfab7c22f3848cb73dc0c2b46b8d38499
> , zeros all the ballooned pages, which is very time consuming.
> 
> Tests show that the time to balloon 7G pages is increased from ~491 ms to
> 2.8 seconds with the above patch.

Sounds like it should be reverted. Post a revert pls and
we'll discuss.

> How about moving the zero operation to the hypervisor? In this way, we
> will have a much faster balloon process.
> 
> 
> Best,
> Wei

Or in other words hypervisors should not be stupid and
should not try to run ksm on DONTNEED pages.

Michael S. Tsirkin July 28, 2017, 11:08 p.m. UTC | #16

On Thu, Jul 27, 2017 at 10:50:11AM +0800, Wei Wang wrote:
> > > > OK I thought this over. While we might need these new APIs in
> > > > the future, I think that at the moment, there's a way to implement
> > > > this feature that is significantly simpler. Just add each s/g
> > > > as a separate input buffer.
> > > 
> > > Should it be an output buffer?
> > Hypervisor overwrites these pages with zeroes. Therefore it is
> > writeable by device: DMA_FROM_DEVICE.
> 
> Why would the hypervisor need to zero the buffer?

The page is supplied to hypervisor and can lose the value that
is there.  That is the definition of writeable by device.

> I think it may only
> need to read out the info(base,size).

And then do what?

> I think it should be like this:
> the cmd hdr buffer: input, because the hypervisor would write it to
> send a cmd to the guest
> the payload buffer: output, for the hypervisor to read the info

These should be split.

We have:

1. request that hypervisor sends to guest, includes ID: input
2. synchronisation header with ID sent by guest: output
3. list of pages: input

2 and 3 must be on the same VQ. 1 can come on any VQ - reusing stats VQ
might make sense.


> > > I think output means from the
> > > driver to device (i.e. DMA_TO_DEVICE).
> > This part is correct I believe.
> > 
> > > > This needs zero new APIs.
> > > > 
> > > > I know that follow-up patches need to add a header in front
> > > > so you might be thinking: how am I going to add this
> > > > header? The answer is quite simple - add it as a separate
> > > > out header.
> > > > 
> > > > Host will be able to distinguish between header and pages
> > > > by looking at the direction, and - should we want to add
> > > > IN data to header - additionally size (<4K => header).
> > > 
> > > I think this works fine when the cmdq is only used for
> > > reporting the unused pages.
> > > It would be an issue
> > > if there are other usages (e.g. report memory statistics)
> > > interleaving. I think one solution would be to lock the cmdq until
> > > a cmd usage is done ((e.g. all the unused pages are reported) ) -
> > > in this case, the periodically updated guest memory statistics
> > > may be delayed for a while occasionally when live migration starts.
> > > Would this be acceptable? If not, probably we can have the cmdq
> > > for one usage only.
> > > 
> > > 
> > > Best,
> > > Wei
> > OK I see, I think the issue is that reporting free pages
> > was structured like stats. Let's split it -
> > send pages on e.g. free_vq, get commands on vq shared with
> > stats.
> > 
> 
> Would it be better to have the "report free page" command to be sent
> through the free_vq? In this case,we will have
> stats_vq: for the stats usage, which is already there
> free_vq: for reporting free pages.
> 
> Best,
> Wei

See above. I would get requests on stats vq but report
free pages separately on free vq.

> 
> 
> 
>

Wang, Wei W July 29, 2017, 12:47 p.m. UTC | #17

On 07/29/2017 07:08 AM, Michael S. Tsirkin wrote:
> On Thu, Jul 27, 2017 at 10:50:11AM +0800, Wei Wang wrote:
>>>>> OK I thought this over. While we might need these new APIs in
>>>>> the future, I think that at the moment, there's a way to implement
>>>>> this feature that is significantly simpler. Just add each s/g
>>>>> as a separate input buffer.
>>>> Should it be an output buffer?
>>> Hypervisor overwrites these pages with zeroes. Therefore it is
>>> writeable by device: DMA_FROM_DEVICE.
>> Why would the hypervisor need to zero the buffer?
> The page is supplied to hypervisor and can lose the value that
> is there.  That is the definition of writeable by device.

I think for the free pages, it should be clear that they will be added as
output buffer to the device, because (as we discussed) they are just hints,
and some of them may be used by the guest after the report_ API is invoked.
The device/hypervisor should not use or discard them.

For the balloon pages, I kind of agree with the existing implementation
(e.g. inside tell_host()), which uses virtqueue_add_outbuf (instead of 
_add_inbuf()).
I think inbuf should be a buffer which will be written by the device and 
read by the
driver. The cmd buffer put on the vq for the device to send commands can 
be an
inbuf, I think.

>
>> I think it may only
>> need to read out the info(base,size).
> And then do what?

For the free pages, the info will be used to clear the corresponding "1" 
in the dirty bitmap.
For balloon pages, they will be made DONTNEED and given to other host 
processes to
use (the device won't write them, so no need to set "write" when 
virtqueue_map_desc() in
the device).

>
>> I think it should be like this:
>> the cmd hdr buffer: input, because the hypervisor would write it to
>> send a cmd to the guest
>> the payload buffer: output, for the hypervisor to read the info
> These should be split.
>
> We have:
>
> 1. request that hypervisor sends to guest, includes ID: input
> 2. synchronisation header with ID sent by guest: output
> 3. list of pages: input
>
> 2 and 3 must be on the same VQ. 1 can come on any VQ - reusing stats VQ
> might make sense.

I would prefer to make the above item 1 come on the the free page vq,
because the existing stat_vq doesn't support the cmd hdr.
Now, I think it is also not necessary to move the existing stat_vq 
implementation to
a new implementation under the cmd hdr, because
that new support doesn't make a difference for stats, it will still use 
its stat_vq (the free
page vq will be used for reporting free pages only)

What do you think?

Best,
Wei

Michael S. Tsirkin July 30, 2017, 4:22 a.m. UTC | #18

On Sat, Jul 29, 2017 at 08:47:08PM +0800, Wei Wang wrote:
> On 07/29/2017 07:08 AM, Michael S. Tsirkin wrote:
> > On Thu, Jul 27, 2017 at 10:50:11AM +0800, Wei Wang wrote:
> > > > > > OK I thought this over. While we might need these new APIs in
> > > > > > the future, I think that at the moment, there's a way to implement
> > > > > > this feature that is significantly simpler. Just add each s/g
> > > > > > as a separate input buffer.
> > > > > Should it be an output buffer?
> > > > Hypervisor overwrites these pages with zeroes. Therefore it is
> > > > writeable by device: DMA_FROM_DEVICE.
> > > Why would the hypervisor need to zero the buffer?
> > The page is supplied to hypervisor and can lose the value that
> > is there.  That is the definition of writeable by device.
> 
> I think for the free pages, it should be clear that they will be added as
> output buffer to the device, because (as we discussed) they are just hints,
> and some of them may be used by the guest after the report_ API is invoked.
> The device/hypervisor should not use or discard them.

Discarding contents is exactly what you propose doing if
migration is going on, isn't it?

> For the balloon pages, I kind of agree with the existing implementation
> (e.g. inside tell_host()), which uses virtqueue_add_outbuf (instead of
> _add_inbuf()).


This is because it does not pass SGs, it passes weirdly
formatted PA within the buffer.

> I think inbuf should be a buffer which will be written by the device and
> read by the
> driver.

If hypervisor can change it, it's an inbuf. Should not matter
whether driver reads it.

> The cmd buffer put on the vq for the device to send commands can be
> an
> inbuf, I think.
> 
> 
> > 
> > > I think it may only
> > > need to read out the info(base,size).
> > And then do what?
> 
> 
> For the free pages, the info will be used to clear the corresponding "1" in
> the dirty bitmap.
> For balloon pages, they will be made DONTNEED and given to other host
> processes to
> use (the device won't write them, so no need to set "write" when
> virtqueue_map_desc() in
> the device).
> 
> 
> > 
> > > I think it should be like this:
> > > the cmd hdr buffer: input, because the hypervisor would write it to
> > > send a cmd to the guest
> > > the payload buffer: output, for the hypervisor to read the info
> > These should be split.
> > 
> > We have:
> > 
> > 1. request that hypervisor sends to guest, includes ID: input
> > 2. synchronisation header with ID sent by guest: output
> > 3. list of pages: input
> > 
> > 2 and 3 must be on the same VQ. 1 can come on any VQ - reusing stats VQ
> > might make sense.
> 
> I would prefer to make the above item 1 come on the the free page vq,
> because the existing stat_vq doesn't support the cmd hdr.
> Now, I think it is also not necessary to move the existing stat_vq
> implementation to
> a new implementation under the cmd hdr, because
> that new support doesn't make a difference for stats, it will still use its
> stat_vq (the free
> page vq will be used for reporting free pages only)
> 
> What do you think?
> 
> 
> Best,
> Wei

Wang, Wei W July 30, 2017, 5:59 a.m. UTC | #19

On Sunday, July 30, 2017 12:23 PM, Michael S. Tsirkin wrote:
> On Sat, Jul 29, 2017 at 08:47:08PM +0800, Wei Wang wrote:
> > On 07/29/2017 07:08 AM, Michael S. Tsirkin wrote:
> > > On Thu, Jul 27, 2017 at 10:50:11AM +0800, Wei Wang wrote:
> > > > > > > OK I thought this over. While we might need these new APIs
> > > > > > > in the future, I think that at the moment, there's a way to
> > > > > > > implement this feature that is significantly simpler. Just
> > > > > > > add each s/g as a separate input buffer.
> > > > > > Should it be an output buffer?
> > > > > Hypervisor overwrites these pages with zeroes. Therefore it is
> > > > > writeable by device: DMA_FROM_DEVICE.
> > > > Why would the hypervisor need to zero the buffer?
> > > The page is supplied to hypervisor and can lose the value that is
> > > there.  That is the definition of writeable by device.
> >
> > I think for the free pages, it should be clear that they will be added
> > as output buffer to the device, because (as we discussed) they are
> > just hints, and some of them may be used by the guest after the report_ API is
> invoked.
> > The device/hypervisor should not use or discard them.
> 
> Discarding contents is exactly what you propose doing if migration is going on,
> isn't it?

That's actually a different concept. Please let me explain it with this example:

The hypervisor receives the hint saying the guest PageX is a free page, but as we know, 
after that report_ API exits, the guest kernel may take PageX to use, so PageX is not free
page any more. At this time, if the hypervisor writes to the page, that would crash the guest.
So, I think the cornerstone of this work is that the hypervisor should not touch the
reported pages.

Best,
Wei

Michael S. Tsirkin July 30, 2017, 4:18 p.m. UTC | #20

On Sun, Jul 30, 2017 at 05:59:17AM +0000, Wang, Wei W wrote:
> On Sunday, July 30, 2017 12:23 PM, Michael S. Tsirkin wrote:
> > On Sat, Jul 29, 2017 at 08:47:08PM +0800, Wei Wang wrote:
> > > On 07/29/2017 07:08 AM, Michael S. Tsirkin wrote:
> > > > On Thu, Jul 27, 2017 at 10:50:11AM +0800, Wei Wang wrote:
> > > > > > > > OK I thought this over. While we might need these new APIs
> > > > > > > > in the future, I think that at the moment, there's a way to
> > > > > > > > implement this feature that is significantly simpler. Just
> > > > > > > > add each s/g as a separate input buffer.
> > > > > > > Should it be an output buffer?
> > > > > > Hypervisor overwrites these pages with zeroes. Therefore it is
> > > > > > writeable by device: DMA_FROM_DEVICE.
> > > > > Why would the hypervisor need to zero the buffer?
> > > > The page is supplied to hypervisor and can lose the value that is
> > > > there.  That is the definition of writeable by device.
> > >
> > > I think for the free pages, it should be clear that they will be added
> > > as output buffer to the device, because (as we discussed) they are
> > > just hints, and some of them may be used by the guest after the report_ API is
> > invoked.
> > > The device/hypervisor should not use or discard them.
> > 
> > Discarding contents is exactly what you propose doing if migration is going on,
> > isn't it?
> 
> That's actually a different concept. Please let me explain it with this example:
> 
> The hypervisor receives the hint saying the guest PageX is a free page, but as we know, 
> after that report_ API exits, the guest kernel may take PageX to use, so PageX is not free
> page any more. At this time, if the hypervisor writes to the page, that would crash the guest.
> So, I think the cornerstone of this work is that the hypervisor should not touch the
> reported pages.
> 
> Best,
> Wei    

That's a hypervisor implementation detail. From guest point of view,
discarding contents can not be distinguished from writing old contents.

Michael S. Tsirkin July 30, 2017, 4:20 p.m. UTC | #21

On Sun, Jul 30, 2017 at 07:18:33PM +0300, Michael S. Tsirkin wrote:
> On Sun, Jul 30, 2017 at 05:59:17AM +0000, Wang, Wei W wrote:
> > On Sunday, July 30, 2017 12:23 PM, Michael S. Tsirkin wrote:
> > > On Sat, Jul 29, 2017 at 08:47:08PM +0800, Wei Wang wrote:
> > > > On 07/29/2017 07:08 AM, Michael S. Tsirkin wrote:
> > > > > On Thu, Jul 27, 2017 at 10:50:11AM +0800, Wei Wang wrote:
> > > > > > > > > OK I thought this over. While we might need these new APIs
> > > > > > > > > in the future, I think that at the moment, there's a way to
> > > > > > > > > implement this feature that is significantly simpler. Just
> > > > > > > > > add each s/g as a separate input buffer.
> > > > > > > > Should it be an output buffer?
> > > > > > > Hypervisor overwrites these pages with zeroes. Therefore it is
> > > > > > > writeable by device: DMA_FROM_DEVICE.
> > > > > > Why would the hypervisor need to zero the buffer?
> > > > > The page is supplied to hypervisor and can lose the value that is
> > > > > there.  That is the definition of writeable by device.
> > > >
> > > > I think for the free pages, it should be clear that they will be added
> > > > as output buffer to the device, because (as we discussed) they are
> > > > just hints, and some of them may be used by the guest after the report_ API is
> > > invoked.
> > > > The device/hypervisor should not use or discard them.
> > > 
> > > Discarding contents is exactly what you propose doing if migration is going on,
> > > isn't it?
> > 
> > That's actually a different concept. Please let me explain it with this example:
> > 
> > The hypervisor receives the hint saying the guest PageX is a free page, but as we know, 
> > after that report_ API exits, the guest kernel may take PageX to use, so PageX is not free
> > page any more. At this time, if the hypervisor writes to the page, that would crash the guest.
> > So, I think the cornerstone of this work is that the hypervisor should not touch the
> > reported pages.
> > 
> > Best,
> > Wei    
> 
> That's a hypervisor implementation detail. From guest point of view,
> discarding contents can not be distinguished from writing old contents.
> 

Besides, ignoring the free page tricks, consider regular ballooning.
We map page with DONTNEED then back with WILLNEED. Result is
getting a zero page. So at least one of deflate/inflate should be input.
I'd say both for symmetry.

Wang, Wei W July 31, 2017, 12:36 p.m. UTC | #22

On 07/31/2017 12:20 AM, Michael S. Tsirkin wrote:
> On Sun, Jul 30, 2017 at 07:18:33PM +0300, Michael S. Tsirkin wrote:
>> On Sun, Jul 30, 2017 at 05:59:17AM +0000, Wang, Wei W wrote:
>> That's a hypervisor implementation detail. From guest point of view,
>> discarding contents can not be distinguished from writing old contents.
>>
> Besides, ignoring the free page tricks, consider regular ballooning.
> We map page with DONTNEED then back with WILLNEED. Result is
> getting a zero page. So at least one of deflate/inflate should be input.
> I'd say both for symmetry.
>

OK, I see the point. Thanks.

Best,
Wei

[v12,5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

Commit Message

Comments

Patch