Message ID | 1499863221-16206-6-git-send-email-wei.w.wang@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Jul 12, 2017 at 08:40:18PM +0800, Wei Wang wrote: > diff --git a/include/linux/virtio.h b/include/linux/virtio.h > index 28b0e96..9f27101 100644 > --- a/include/linux/virtio.h > +++ b/include/linux/virtio.h > @@ -57,8 +57,28 @@ int virtqueue_add_sgs(struct virtqueue *vq, > void *data, > gfp_t gfp); > > +/* A desc with this init id is treated as an invalid desc */ > +#define VIRTQUEUE_DESC_ID_INIT UINT_MAX > +int virtqueue_add_chain_desc(struct virtqueue *_vq, > + uint64_t addr, > + uint32_t len, > + unsigned int *head_id, > + unsigned int *prev_id, > + bool in); > + > +int virtqueue_add_chain(struct virtqueue *_vq, > + unsigned int head, > + bool indirect, > + struct vring_desc *indirect_desc, > + void *data, > + void *ctx); > + > bool virtqueue_kick(struct virtqueue *vq); > > +bool virtqueue_kick_sync(struct virtqueue *vq); > + > +bool virtqueue_kick_async(struct virtqueue *vq, wait_queue_head_t wq); > + > bool virtqueue_kick_prepare(struct virtqueue *vq); > > bool virtqueue_notify(struct virtqueue *vq); I don't much care for this API. It does exactly what balloon needs, but at cost of e.g. transparently busy-waiting. Unlikely to be a good fit for anything else. If you don't like my original _first/_next/_last, you will need to come up with something else.
On 07/12/2017 09:06 PM, Michael S. Tsirkin wrote: > On Wed, Jul 12, 2017 at 08:40:18PM +0800, Wei Wang wrote: >> diff --git a/include/linux/virtio.h b/include/linux/virtio.h >> index 28b0e96..9f27101 100644 >> --- a/include/linux/virtio.h >> +++ b/include/linux/virtio.h >> @@ -57,8 +57,28 @@ int virtqueue_add_sgs(struct virtqueue *vq, >> void *data, >> gfp_t gfp); >> >> +/* A desc with this init id is treated as an invalid desc */ >> +#define VIRTQUEUE_DESC_ID_INIT UINT_MAX >> +int virtqueue_add_chain_desc(struct virtqueue *_vq, >> + uint64_t addr, >> + uint32_t len, >> + unsigned int *head_id, >> + unsigned int *prev_id, >> + bool in); >> + >> +int virtqueue_add_chain(struct virtqueue *_vq, >> + unsigned int head, >> + bool indirect, >> + struct vring_desc *indirect_desc, >> + void *data, >> + void *ctx); >> + >> bool virtqueue_kick(struct virtqueue *vq); >> >> +bool virtqueue_kick_sync(struct virtqueue *vq); >> + >> +bool virtqueue_kick_async(struct virtqueue *vq, wait_queue_head_t wq); >> + >> bool virtqueue_kick_prepare(struct virtqueue *vq); >> >> bool virtqueue_notify(struct virtqueue *vq); > I don't much care for this API. It does exactly what balloon needs, > but at cost of e.g. transparently busy-waiting. Unlikely to be > a good fit for anything else. If you were referring to this API - virtqueue_add_chain_desc(): Busy waiting only happens when the vq is full (i.e. no desc left). If necessary, I think we can add an input parameter like "bool busywaiting", then the caller can decide to simply get a -ENOSPC or busy wait to add when no desc is available. > > If you don't like my original _first/_next/_last, you will > need to come up with something else. I thought the above virtqueue_add_chain_des() performs the same functionality as _first/next/last, which are used to grab descs from the vq and chain them together. If not, could you please elaborate the usage of the original proposal? Best, Wei
On Wed, Jul 12, 2017 at 09:29:00PM +0800, Wei Wang wrote: > On 07/12/2017 09:06 PM, Michael S. Tsirkin wrote: > > On Wed, Jul 12, 2017 at 08:40:18PM +0800, Wei Wang wrote: > > > diff --git a/include/linux/virtio.h b/include/linux/virtio.h > > > index 28b0e96..9f27101 100644 > > > --- a/include/linux/virtio.h > > > +++ b/include/linux/virtio.h > > > @@ -57,8 +57,28 @@ int virtqueue_add_sgs(struct virtqueue *vq, > > > void *data, > > > gfp_t gfp); > > > +/* A desc with this init id is treated as an invalid desc */ > > > +#define VIRTQUEUE_DESC_ID_INIT UINT_MAX > > > +int virtqueue_add_chain_desc(struct virtqueue *_vq, > > > + uint64_t addr, > > > + uint32_t len, > > > + unsigned int *head_id, > > > + unsigned int *prev_id, > > > + bool in); > > > + > > > +int virtqueue_add_chain(struct virtqueue *_vq, > > > + unsigned int head, > > > + bool indirect, > > > + struct vring_desc *indirect_desc, > > > + void *data, > > > + void *ctx); > > > + > > > bool virtqueue_kick(struct virtqueue *vq); > > > +bool virtqueue_kick_sync(struct virtqueue *vq); > > > + > > > +bool virtqueue_kick_async(struct virtqueue *vq, wait_queue_head_t wq); > > > + > > > bool virtqueue_kick_prepare(struct virtqueue *vq); > > > bool virtqueue_notify(struct virtqueue *vq); > > I don't much care for this API. It does exactly what balloon needs, > > but at cost of e.g. transparently busy-waiting. Unlikely to be > > a good fit for anything else. > > If you were referring to this API - virtqueue_add_chain_desc(): > > Busy waiting only happens when the vq is full (i.e. no desc left). If > necessary, I think we can add an input parameter like > "bool busywaiting", then the caller can decide to simply get a -ENOSPC > or busy wait to add when no desc is available. I think this just shows this API is too high level. This policy should live in drivers. > > > > If you don't like my original _first/_next/_last, you will > > need to come up with something else. > > I thought the above virtqueue_add_chain_des() performs the same > functionality as _first/next/last, which are used to grab descs from the > vq and chain them together. If not, could you please elaborate the > usage of the original proposal? > > Best, > Wei > So the way I see it, there are several issues: - internal wait - forces multiple APIs like kick/kick_sync note how kick_sync can fail but your code never checks return code - need to re-write the last descriptor - might not work for alternative layouts which always expose descriptors immediately - some kind of iterator type would be nicer instead of maintaining head/prev explicitly As for the use, it would be better to do if (!add_next(vq, ...)) { add_last(vq, ...) kick wait } Using VIRTQUEUE_DESC_ID_INIT seems to avoid a branch in the driver, but in fact it merely puts the branch in the virtio code.
On Wed, Jul 12, 2017 at 08:40:18PM +0800, Wei Wang wrote: > Add a new feature, VIRTIO_BALLOON_F_SG, which enables to > transfer a chunk of ballooned (i.e. inflated/deflated) pages using > scatter-gather lists to the host. > > The implementation of the previous virtio-balloon is not very > efficient, because the balloon pages are transferred to the > host one by one. Here is the breakdown of the time in percentage > spent on each step of the balloon inflating process (inflating > 7GB of an 8GB idle guest). > > 1) allocating pages (6.5%) > 2) sending PFNs to host (68.3%) > 3) address translation (6.1%) > 4) madvise (19%) > > It takes about 4126ms for the inflating process to complete. > The above profiling shows that the bottlenecks are stage 2) > and stage 4). > > This patch optimizes step 2) by transferring pages to the host in > sgs. An sg describes a chunk of guest physically continuous pages. > With this mechanism, step 4) can also be optimized by doing address > translation and madvise() in chunks rather than page by page. > > With this new feature, the above ballooning process takes ~491ms > resulting in an improvement of ~88%. > > TODO: optimize stage 1) by allocating/freeing a chunk of pages > instead of a single page each time. > > Signed-off-by: Wei Wang <wei.w.wang@intel.com> > Signed-off-by: Liang Li <liang.z.li@intel.com> > Suggested-by: Michael S. Tsirkin <mst@redhat.com> > --- > drivers/virtio/virtio_balloon.c | 141 ++++++++++++++++++++++--- > drivers/virtio/virtio_ring.c | 199 +++++++++++++++++++++++++++++++++--- > include/linux/virtio.h | 20 ++++ > include/uapi/linux/virtio_balloon.h | 1 + > 4 files changed, 329 insertions(+), 32 deletions(-) > > diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c > index f0b3a0b..aa4e7ec 100644 > --- a/drivers/virtio/virtio_balloon.c > +++ b/drivers/virtio/virtio_balloon.c > @@ -32,6 +32,7 @@ > #include <linux/mm.h> > #include <linux/mount.h> > #include <linux/magic.h> > +#include <linux/xbitmap.h> > > /* > * Balloon device works in 4K page units. So each page is pointed to by > @@ -79,6 +80,9 @@ struct virtio_balloon { > /* Synchronize access/update to this struct virtio_balloon elements */ > struct mutex balloon_lock; > > + /* The xbitmap used to record ballooned pages */ > + struct xb page_xb; > + > /* The array of pfns we tell the Host about. */ > unsigned int num_pfns; > __virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX]; > @@ -141,13 +145,71 @@ static void set_page_pfns(struct virtio_balloon *vb, > page_to_balloon_pfn(page) + i); > } > > +/* > + * Send balloon pages in sgs to host. > + * The balloon pages are recorded in the page xbitmap. Each bit in the bitmap > + * corresponds to a page of PAGE_SIZE. The page xbitmap is searched for > + * continuous "1" bits, which correspond to continuous pages, to chunk into > + * sgs. > + * > + * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that > + * need to be serached. searched > + */ > +static void tell_host_sgs(struct virtio_balloon *vb, > + struct virtqueue *vq, > + unsigned long page_xb_start, > + unsigned long page_xb_end) > +{ > + unsigned int head_id = VIRTQUEUE_DESC_ID_INIT, > + prev_id = VIRTQUEUE_DESC_ID_INIT; > + unsigned long sg_pfn_start, sg_pfn_end; > + uint64_t sg_addr; > + uint32_t sg_size; > + > + sg_pfn_start = page_xb_start; > + while (sg_pfn_start < page_xb_end) { > + sg_pfn_start = xb_find_next_bit(&vb->page_xb, sg_pfn_start, > + page_xb_end, 1); > + if (sg_pfn_start == page_xb_end + 1) > + break; > + sg_pfn_end = xb_find_next_bit(&vb->page_xb, sg_pfn_start + 1, > + page_xb_end, 0); > + sg_addr = sg_pfn_start << PAGE_SHIFT; > + sg_size = (sg_pfn_end - sg_pfn_start) * PAGE_SIZE; There's an issue here - this might not fit in uint32_t. You need to limit sg_pfn_end - something like: /* make sure sg_size below fits in a 32 bit integer */ sg_pfn_end = min(sg_pfn_end, sg_pfn_start + UINT_MAX >> PAGE_SIZE); > + virtqueue_add_chain_desc(vq, sg_addr, sg_size, &head_id, > + &prev_id, 0); > + xb_zero(&vb->page_xb, sg_pfn_start, sg_pfn_end); > + sg_pfn_start = sg_pfn_end + 1; > + } > + > + if (head_id != VIRTQUEUE_DESC_ID_INIT) { > + virtqueue_add_chain(vq, head_id, 0, NULL, vb, NULL); > + virtqueue_kick_async(vq, vb->acked); > + } > +} > + > +/* Update pfn_max and pfn_min according to the pfn of @page */ > +static inline void update_pfn_range(struct virtio_balloon *vb, > + struct page *page, > + unsigned long *pfn_min, > + unsigned long *pfn_max) > +{ > + unsigned long pfn = page_to_pfn(page); > + > + *pfn_min = min(pfn, *pfn_min); > + *pfn_max = max(pfn, *pfn_max); > +} > + > static unsigned fill_balloon(struct virtio_balloon *vb, size_t num) > { > struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info; > unsigned num_allocated_pages; > + bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG); > + unsigned long pfn_max = 0, pfn_min = ULONG_MAX; > > /* We can only do one array worth at a time. */ > - num = min(num, ARRAY_SIZE(vb->pfns)); > + if (!use_sg) > + num = min(num, ARRAY_SIZE(vb->pfns)); > > mutex_lock(&vb->balloon_lock); > for (vb->num_pfns = 0; vb->num_pfns < num; > @@ -162,7 +224,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num) > msleep(200); > break; > } > - set_page_pfns(vb, vb->pfns + vb->num_pfns, page); > + if (use_sg) { > + update_pfn_range(vb, page, &pfn_min, &pfn_max); > + xb_set_bit(&vb->page_xb, page_to_pfn(page)); > + } else { > + set_page_pfns(vb, vb->pfns + vb->num_pfns, page); > + } > vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE; > if (!virtio_has_feature(vb->vdev, > VIRTIO_BALLOON_F_DEFLATE_ON_OOM)) > @@ -171,8 +238,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num) > > num_allocated_pages = vb->num_pfns; > /* Did we get any? */ > - if (vb->num_pfns != 0) > - tell_host(vb, vb->inflate_vq); > + if (vb->num_pfns != 0) { > + if (use_sg) > + tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max); > + else > + tell_host(vb, vb->inflate_vq); > + } > mutex_unlock(&vb->balloon_lock); > > return num_allocated_pages; > @@ -198,9 +269,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num) > struct page *page; > struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info; > LIST_HEAD(pages); > + bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG); > + unsigned long pfn_max = 0, pfn_min = ULONG_MAX; > > - /* We can only do one array worth at a time. */ > - num = min(num, ARRAY_SIZE(vb->pfns)); > + /* Traditionally, we can only do one array worth at a time. */ > + if (!use_sg) > + num = min(num, ARRAY_SIZE(vb->pfns)); > > mutex_lock(&vb->balloon_lock); > /* We can't release more pages than taken */ > @@ -210,7 +284,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num) > page = balloon_page_dequeue(vb_dev_info); > if (!page) > break; > - set_page_pfns(vb, vb->pfns + vb->num_pfns, page); > + if (use_sg) { > + update_pfn_range(vb, page, &pfn_min, &pfn_max); > + xb_set_bit(&vb->page_xb, page_to_pfn(page)); > + } else { > + set_page_pfns(vb, vb->pfns + vb->num_pfns, page); > + } > list_add(&page->lru, &pages); > vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE; > } > @@ -221,8 +300,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num) > * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST); > * is true, we *have* to do it in this order > */ > - if (vb->num_pfns != 0) > - tell_host(vb, vb->deflate_vq); > + if (vb->num_pfns != 0) { > + if (use_sg) > + tell_host_sgs(vb, vb->deflate_vq, pfn_min, pfn_max); > + else > + tell_host(vb, vb->deflate_vq); > + } > release_pages_balloon(vb, &pages); > mutex_unlock(&vb->balloon_lock); > return num_freed_pages; > @@ -441,6 +524,18 @@ static int init_vqs(struct virtio_balloon *vb) > } > > #ifdef CONFIG_BALLOON_COMPACTION > + > +static void tell_host_one_page(struct virtio_balloon *vb, struct virtqueue *vq, > + struct page *page) > +{ > + unsigned int id = VIRTQUEUE_DESC_ID_INIT; > + u64 addr = page_to_pfn(page) << VIRTIO_BALLOON_PFN_SHIFT; > + > + virtqueue_add_chain_desc(vq, addr, PAGE_SIZE, &id, &id, 0); > + virtqueue_add_chain(vq, id, 0, NULL, (void *)addr, NULL); > + virtqueue_kick_async(vq, vb->acked); > +} > + > /* > * virtballoon_migratepage - perform the balloon page migration on behalf of > * a compation thread. (called under page lock) > @@ -464,6 +559,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info, > { > struct virtio_balloon *vb = container_of(vb_dev_info, > struct virtio_balloon, vb_dev_info); > + bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG); > unsigned long flags; > > /* > @@ -485,16 +581,22 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info, > vb_dev_info->isolated_pages--; > __count_vm_event(BALLOON_MIGRATE); > spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags); > - vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE; > - set_page_pfns(vb, vb->pfns, newpage); > - tell_host(vb, vb->inflate_vq); > - > + if (use_sg) { > + tell_host_one_page(vb, vb->inflate_vq, newpage); > + } else { > + vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE; > + set_page_pfns(vb, vb->pfns, newpage); > + tell_host(vb, vb->inflate_vq); > + } > /* balloon's page migration 2nd step -- deflate "page" */ > balloon_page_delete(page); > - vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE; > - set_page_pfns(vb, vb->pfns, page); > - tell_host(vb, vb->deflate_vq); > - > + if (use_sg) { > + tell_host_one_page(vb, vb->deflate_vq, page); > + } else { > + vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE; > + set_page_pfns(vb, vb->pfns, page); > + tell_host(vb, vb->deflate_vq); > + } > mutex_unlock(&vb->balloon_lock); > > put_page(page); /* balloon reference */ > @@ -553,6 +655,9 @@ static int virtballoon_probe(struct virtio_device *vdev) > if (err) > goto out_free_vb; > > + if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG)) > + xb_init(&vb->page_xb); > + > vb->nb.notifier_call = virtballoon_oom_notify; > vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY; > err = register_oom_notifier(&vb->nb); > @@ -618,6 +723,7 @@ static void virtballoon_remove(struct virtio_device *vdev) > cancel_work_sync(&vb->update_balloon_size_work); > cancel_work_sync(&vb->update_balloon_stats_work); > > + xb_empty(&vb->page_xb); > remove_common(vb); > #ifdef CONFIG_BALLOON_COMPACTION > if (vb->vb_dev_info.inode) > @@ -669,6 +775,7 @@ static unsigned int features[] = { > VIRTIO_BALLOON_F_MUST_TELL_HOST, > VIRTIO_BALLOON_F_STATS_VQ, > VIRTIO_BALLOON_F_DEFLATE_ON_OOM, > + VIRTIO_BALLOON_F_SG, > }; > > static struct virtio_driver virtio_balloon_driver = { > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c > index 5e1b548..b9d7e10 100644 > --- a/drivers/virtio/virtio_ring.c > +++ b/drivers/virtio/virtio_ring.c > @@ -269,7 +269,7 @@ static inline int virtqueue_add(struct virtqueue *_vq, > struct vring_virtqueue *vq = to_vvq(_vq); > struct scatterlist *sg; > struct vring_desc *desc; > - unsigned int i, n, avail, descs_used, uninitialized_var(prev), err_idx; > + unsigned int i, n, descs_used, uninitialized_var(prev), err_id; > int head; > bool indirect; > > @@ -387,10 +387,68 @@ static inline int virtqueue_add(struct virtqueue *_vq, > else > vq->free_head = i; > > - /* Store token and indirect buffer state. */ > + END_USE(vq); > + > + return virtqueue_add_chain(_vq, head, indirect, desc, data, ctx); > + > +unmap_release: > + err_id = i; > + i = head; > + > + for (n = 0; n < total_sg; n++) { > + if (i == err_id) > + break; > + vring_unmap_one(vq, &desc[i]); > + i = virtio16_to_cpu(_vq->vdev, vq->vring.desc[i].next); > + } > + > + vq->vq.num_free += total_sg; > + > + if (indirect) > + kfree(desc); > + > + END_USE(vq); > + return -EIO; > +} > + > +/** > + * virtqueue_add_chain - expose a chain of buffers to the other end > + * @_vq: the struct virtqueue we're talking about. > + * @head: desc id of the chain head. > + * @indirect: set if the chain of descs are indrect descs. > + * @indir_desc: the first indirect desc. > + * @data: the token identifying the chain. > + * @ctx: extra context for the token. > + * > + * Caller must ensure we don't call this with other virtqueue operations > + * at the same time (except where noted). > + * > + * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO). > + */ > +int virtqueue_add_chain(struct virtqueue *_vq, > + unsigned int head, > + bool indirect, > + struct vring_desc *indir_desc, > + void *data, > + void *ctx) > +{ > + struct vring_virtqueue *vq = to_vvq(_vq); > + unsigned int avail; > + > + /* The desc chain is empty. */ > + if (head == VIRTQUEUE_DESC_ID_INIT) > + return 0; > + > + START_USE(vq); > + > + if (unlikely(vq->broken)) { > + END_USE(vq); > + return -EIO; > + } > + > vq->desc_state[head].data = data; > if (indirect) > - vq->desc_state[head].indir_desc = desc; > + vq->desc_state[head].indir_desc = indir_desc; > if (ctx) > vq->desc_state[head].indir_desc = ctx; > > @@ -415,26 +473,87 @@ static inline int virtqueue_add(struct virtqueue *_vq, > virtqueue_kick(_vq); > > return 0; > +} > +EXPORT_SYMBOL_GPL(virtqueue_add_chain); > > -unmap_release: > - err_idx = i; > - i = head; > +/** > + * virtqueue_add_chain_desc - add a buffer to a chain using a vring desc > + * @vq: the struct virtqueue we're talking about. > + * @addr: address of the buffer to add. > + * @len: length of the buffer. > + * @head_id: desc id of the chain head. > + * @prev_id: desc id of the previous buffer. > + * @in: set if the buffer is for the device to write. > + * > + * Caller must ensure we don't call this with other virtqueue operations > + * at the same time (except where noted). > + * > + * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO). > + */ > +int virtqueue_add_chain_desc(struct virtqueue *_vq, > + uint64_t addr, > + uint32_t len, > + unsigned int *head_id, > + unsigned int *prev_id, > + bool in) > +{ > + struct vring_virtqueue *vq = to_vvq(_vq); > + struct vring_desc *desc = vq->vring.desc; > + uint16_t flags = in ? VRING_DESC_F_WRITE : 0; > + unsigned int i; > > - for (n = 0; n < total_sg; n++) { > - if (i == err_idx) > - break; > - vring_unmap_one(vq, &desc[i]); > - i = virtio16_to_cpu(_vq->vdev, vq->vring.desc[i].next); > + /* Sanity check */ > + if (!_vq || !head_id || !prev_id) > + return -EINVAL; > +retry: > + START_USE(vq); > + if (unlikely(vq->broken)) { > + END_USE(vq); > + return -EIO; > } > > - vq->vq.num_free += total_sg; > + if (vq->vq.num_free < 1) { > + /* > + * If there is no desc avail in the vq, so kick what is > + * already added, and re-start to build a new chain for > + * the passed sg. > + */ > + if (likely(*head_id != VIRTQUEUE_DESC_ID_INIT)) { > + END_USE(vq); > + virtqueue_add_chain(_vq, *head_id, 0, NULL, vq, NULL); > + virtqueue_kick_sync(_vq); > + *head_id = VIRTQUEUE_DESC_ID_INIT; > + *prev_id = VIRTQUEUE_DESC_ID_INIT; > + goto retry; > + } else { > + END_USE(vq); > + return -ENOSPC; > + } > + } > > - if (indirect) > - kfree(desc); > + i = vq->free_head; > + flags &= ~VRING_DESC_F_NEXT; > + desc[i].flags = cpu_to_virtio16(_vq->vdev, flags); > + desc[i].addr = cpu_to_virtio64(_vq->vdev, addr); > + desc[i].len = cpu_to_virtio32(_vq->vdev, len); > + > + /* Add the desc to the end of the chain */ > + if (*prev_id != VIRTQUEUE_DESC_ID_INIT) { > + desc[*prev_id].next = cpu_to_virtio16(_vq->vdev, i); > + desc[*prev_id].flags |= cpu_to_virtio16(_vq->vdev, > + VRING_DESC_F_NEXT); > + } > + *prev_id = i; > + if (*head_id == VIRTQUEUE_DESC_ID_INIT) > + *head_id = *prev_id; > > + vq->vq.num_free--; > + vq->free_head = virtio16_to_cpu(_vq->vdev, desc[i].next); > END_USE(vq); > - return -EIO; > + > + return 0; > } > +EXPORT_SYMBOL_GPL(virtqueue_add_chain_desc); > > /** > * virtqueue_add_sgs - expose buffers to other end > @@ -627,6 +746,56 @@ bool virtqueue_kick(struct virtqueue *vq) > } > EXPORT_SYMBOL_GPL(virtqueue_kick); > > +/** > + * virtqueue_kick_sync - update after add_buf and busy wait till update is done > + * @vq: the struct virtqueue > + * > + * After one or more virtqueue_add_* calls, invoke this to kick > + * the other side. Busy wait till the other side is done with the update. > + * > + * Caller must ensure we don't call this with other virtqueue > + * operations at the same time (except where noted). > + * > + * Returns false if kick failed, otherwise true. > + */ > +bool virtqueue_kick_sync(struct virtqueue *vq) > +{ > + u32 len; > + > + if (likely(virtqueue_kick(vq))) { > + while (!virtqueue_get_buf(vq, &len) && > + !virtqueue_is_broken(vq)) > + cpu_relax(); > + return true; > + } > + return false; > +} > +EXPORT_SYMBOL_GPL(virtqueue_kick_sync); > + > +/** > + * virtqueue_kick_async - update after add_buf and blocking till update is done > + * @vq: the struct virtqueue > + * > + * After one or more virtqueue_add_* calls, invoke this to kick > + * the other side. Blocking till the other side is done with the update. > + * > + * Caller must ensure we don't call this with other virtqueue > + * operations at the same time (except where noted). > + * > + * Returns false if kick failed, otherwise true. > + */ > +bool virtqueue_kick_async(struct virtqueue *vq, wait_queue_head_t wq) > +{ > + u32 len; > + > + if (likely(virtqueue_kick(vq))) { > + wait_event(wq, virtqueue_get_buf(vq, &len)); > + return true; > + } > + return false; > +} > +EXPORT_SYMBOL_GPL(virtqueue_kick_async); > + This happens to 1. drop the buf 2. not do the right thing if more than one is in flight which means this API isn't all that useful. Even balloon might benefit from keeping multiple bufs in flight down the road. > static void detach_buf(struct vring_virtqueue *vq, unsigned int head, > void **ctx) > { > diff --git a/include/linux/virtio.h b/include/linux/virtio.h > index 28b0e96..9f27101 100644 > --- a/include/linux/virtio.h > +++ b/include/linux/virtio.h > @@ -57,8 +57,28 @@ int virtqueue_add_sgs(struct virtqueue *vq, > void *data, > gfp_t gfp); > > +/* A desc with this init id is treated as an invalid desc */ > +#define VIRTQUEUE_DESC_ID_INIT UINT_MAX > +int virtqueue_add_chain_desc(struct virtqueue *_vq, > + uint64_t addr, > + uint32_t len, > + unsigned int *head_id, > + unsigned int *prev_id, > + bool in); > + > +int virtqueue_add_chain(struct virtqueue *_vq, > + unsigned int head, > + bool indirect, > + struct vring_desc *indirect_desc, > + void *data, > + void *ctx); > + > bool virtqueue_kick(struct virtqueue *vq); > > +bool virtqueue_kick_sync(struct virtqueue *vq); > + > +bool virtqueue_kick_async(struct virtqueue *vq, wait_queue_head_t wq); > + > bool virtqueue_kick_prepare(struct virtqueue *vq); > > bool virtqueue_notify(struct virtqueue *vq); > diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h > index 343d7dd..37780a7 100644 > --- a/include/uapi/linux/virtio_balloon.h > +++ b/include/uapi/linux/virtio_balloon.h > @@ -34,6 +34,7 @@ > #define VIRTIO_BALLOON_F_MUST_TELL_HOST 0 /* Tell before reclaiming pages */ > #define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */ > #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */ > +#define VIRTIO_BALLOON_F_SG 3 /* Use sg instead of PFN lists */ > > /* Size of a PFN in the balloon interface. */ > #define VIRTIO_BALLOON_PFN_SHIFT 12 > -- > 2.7.4
Hi Wei, [auto build test WARNING on linus/master] [also build test WARNING on v4.12 next-20170712] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Wei-Wang/Virtio-balloon-Enhancement/20170713-074956 config: i386-randconfig-x071-07121639 (attached as .config) compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901 reproduce: # save the attached .config to linux build tree make ARCH=i386 All warnings (new ones prefixed by >>): drivers//virtio/virtio_balloon.c: In function 'tell_host_one_page': >> drivers//virtio/virtio_balloon.c:535:39: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] virtqueue_add_chain(vq, id, 0, NULL, (void *)addr, NULL); ^ vim +535 drivers//virtio/virtio_balloon.c 527 528 static void tell_host_one_page(struct virtio_balloon *vb, struct virtqueue *vq, 529 struct page *page) 530 { 531 unsigned int id = VIRTQUEUE_DESC_ID_INIT; 532 u64 addr = page_to_pfn(page) << VIRTIO_BALLOON_PFN_SHIFT; 533 534 virtqueue_add_chain_desc(vq, addr, PAGE_SIZE, &id, &id, 0); > 535 virtqueue_add_chain(vq, id, 0, NULL, (void *)addr, NULL); 536 virtqueue_kick_async(vq, vb->acked); 537 } 538 --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation
Hi Wei, [auto build test ERROR on linus/master] [also build test ERROR on v4.12 next-20170712] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Wei-Wang/Virtio-balloon-Enhancement/20170713-074956 config: powerpc-defconfig (attached as .config) compiler: powerpc64-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705 reproduce: wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=powerpc All errors (new ones prefixed by >>): >> ERROR: ".xb_set_bit" [drivers/virtio/virtio_balloon.ko] undefined! >> ERROR: ".xb_zero" [drivers/virtio/virtio_balloon.ko] undefined! >> ERROR: ".xb_find_next_bit" [drivers/virtio/virtio_balloon.ko] undefined! --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation
On 07/12/2017 09:56 PM, Michael S. Tsirkin wrote: > > So the way I see it, there are several issues: > > - internal wait - forces multiple APIs like kick/kick_sync > note how kick_sync can fail but your code never checks return code > - need to re-write the last descriptor - might not work > for alternative layouts which always expose descriptors > immediately Probably it wasn't clear. Please let me explain the two functions here: 1) virtqueue_add_chain_desc(vq, head_id, prev_id,..): grabs a desc from the vq and inserts it to the chain tail (which is indexed by prev_id, probably better to call it tail_id). Then, the new added desc becomes the tail (i.e. the last desc). The _F_NEXT flag is cleared for each desc when it's added to the chain, and set when another desc comes to follow later. 2) virtqueue_add_chain(vq, head_id,..): expose the chain to the other end. So, if people want to add a desc and immediately expose it to the other end, i.e. build a single desc chain, they can just add and expose: virtqueue_add_chain_desc(..); virtqueue_add_chain(..,head_id); Would you see any issues here? > - some kind of iterator type would be nicer instead of > maintaining head/prev explicitly Why would we need to iterate the chain? I think it would be simpler to use a wrapper struct: struct virtqueue_desc_chain { unsigned int head; // head desc id of the chain unsigned int tail; // tail desc id of the chain } The new desc will be put to desc[tail].next, and we don't need to walk from the head desc[head].next when inserting a new desc to the chain, right? > > As for the use, it would be better to do > > if (!add_next(vq, ...)) { > add_last(vq, ...) > kick > wait > } "!add_next(vq, ...)" means that the vq is full? If so, what would add_last() do then? > Using VIRTQUEUE_DESC_ID_INIT seems to avoid a branch in the driver, but > in fact it merely puts the branch in the virtio code. > Actually it wasn't intended to improve performance. It is used to indicate the "init" state of the chain. So, when virtqueue_add_chain_desc(, head_id,..) finds head id=INIT, it will assign the grabbed desc id to &head_id. In some sense, it is equivalent to add_first(). Do you have a different opinion here? Best, Wei
On Thu, Jul 13, 2017 at 03:42:35PM +0800, Wei Wang wrote: > On 07/12/2017 09:56 PM, Michael S. Tsirkin wrote: > > > > So the way I see it, there are several issues: > > > > - internal wait - forces multiple APIs like kick/kick_sync > > note how kick_sync can fail but your code never checks return code > > - need to re-write the last descriptor - might not work > > for alternative layouts which always expose descriptors > > immediately > > Probably it wasn't clear. Please let me explain the two functions here: > > 1) virtqueue_add_chain_desc(vq, head_id, prev_id,..): > grabs a desc from the vq and inserts it to the chain tail (which is indexed > by > prev_id, probably better to call it tail_id). Then, the new added desc > becomes > the tail (i.e. the last desc). The _F_NEXT flag is cleared for each desc > when it's > added to the chain, and set when another desc comes to follow later. And this only works if there are multiple rings like avail + descriptor ring. It won't work e.g. with the proposed new layout where writing out a descriptor exposes it immediately. > 2) virtqueue_add_chain(vq, head_id,..): expose the chain to the other end. > > So, if people want to add a desc and immediately expose it to the other end, > i.e. build a single desc chain, they can just add and expose: > > virtqueue_add_chain_desc(..); > virtqueue_add_chain(..,head_id); > > Would you see any issues here? The way the new APIs poll used ring internally. > > > - some kind of iterator type would be nicer instead of > > maintaining head/prev explicitly > > Why would we need to iterate the chain? In your patches prev/tail are iterators - they keep track of where you are in the chain. > I think it would be simpler to use > a wrapper struct: > > struct virtqueue_desc_chain { > unsigned int head; // head desc id of the chain > unsigned int tail; // tail desc id of the chain > } > > The new desc will be put to desc[tail].next, and we don't need to walk > from the head desc[head].next when inserting a new desc to the chain, right? > > > > > > As for the use, it would be better to do > > > > if (!add_next(vq, ...)) { > > add_last(vq, ...) > > kick > > wait > > } > > "!add_next(vq, ...)" means that the vq is full? No - it means there's only 1 entry left for the last descriptor. > If so, what would add_last() > do then? > > > Using VIRTQUEUE_DESC_ID_INIT seems to avoid a branch in the driver, but > > in fact it merely puts the branch in the virtio code. > > > > Actually it wasn't intended to improve performance. It is used to indicate > the "init" state > of the chain. So, when virtqueue_add_chain_desc(, head_id,..) finds head > id=INIT, it will > assign the grabbed desc id to &head_id. In some sense, it is equivalent to > add_first(). > > Do you have a different opinion here? > > Best, > Wei > It is but let's make it explicit here - an API function is better than a special value.
On 07/14/2017 04:19 AM, Michael S. Tsirkin wrote: > On Thu, Jul 13, 2017 at 03:42:35PM +0800, Wei Wang wrote: >> On 07/12/2017 09:56 PM, Michael S. Tsirkin wrote: >>> So the way I see it, there are several issues: >>> >>> - internal wait - forces multiple APIs like kick/kick_sync >>> note how kick_sync can fail but your code never checks return code >>> - need to re-write the last descriptor - might not work >>> for alternative layouts which always expose descriptors >>> immediately >> Probably it wasn't clear. Please let me explain the two functions here: >> >> 1) virtqueue_add_chain_desc(vq, head_id, prev_id,..): >> grabs a desc from the vq and inserts it to the chain tail (which is indexed >> by >> prev_id, probably better to call it tail_id). Then, the new added desc >> becomes >> the tail (i.e. the last desc). The _F_NEXT flag is cleared for each desc >> when it's >> added to the chain, and set when another desc comes to follow later. > And this only works if there are multiple rings like > avail + descriptor ring. > It won't work e.g. with the proposed new layout where > writing out a descriptor exposes it immediately. I think it can support the 1.1 proposal, too. But before getting into that, I think we first need to deep dive into the implementation and usage of _first/next/last. The usage would need to lock the vq from the first to the end (otherwise, the returned info about the number of available desc in the vq, i.e. num_free, would be invalid): lock(vq); add_first(); add_next(); add_last(); unlock(vq); However, I think the case isn't this simple, since we need to check more things after each add_xx() step. For example, if only one entry is available at the time we start to use the vq, that is, num_free is 0 after add_first(), we wouldn't be able to add_next and add_last. So, it would work like this: start: ...get free page block.. lock(vq) retry: ret = add_first(..,&num_free,); if(ret == -ENOSPC) { goto retry; } else if (!num_free) { add_chain_head(); unlock(vq); kick & wait; goto start; } next_one: ...get free page block.. add_next(..,&num_free,); if (!num_free) { add_chain_head(); unlock(vq); kick & wait; goto start; } if (num_free == 1) { ...get free page block.. add_last(..); unlock(vq); kick & wait; goto start; } else { goto next_one; } The above seems unnecessary to me to have three different APIs. That's the reason to combine them into one virtqueue_add_chain_desc(). -- or, do you have a different thought about using the three APIs? Implementation Reference: struct desc_iterator { unsigned int head; unsigned int tail; }; add_first(*vq, *desc_iterator, *num_free, ..) { if (vq->vq.num_free < 1) return -ENOSPC; get_desc(&desc_id); desc[desc_id].flag &= ~_F_NEXT; desc_iterator->head = desc_id desc_iterator->tail = desc_iterator->head; *num_free = vq->vq.num_free; } add_next(vq, desc_iterator, *num_free,..) { get_desc(&desc_id); desc[desc_id].flag &= ~_F_NEXT; desc[desc_iterator.tail].next = desc_id; desc[desc_iterator->tail].flag |= _F_NEXT; desc_iterator->tail = desc_id; *num_free = vq->vq.num_free; } add_last(vq, desc_iterator,..) { get_desc(&desc_id); desc[desc_id].flag &= ~_F_NEXT; desc[desc_iterator.tail].next = desc_id; desc_iterator->tail = desc_id; add_chain_head(); // put the desc_iterator.head to the ring } Best, Wei
On Fri, Jul 14, 2017 at 03:12:43PM +0800, Wei Wang wrote: > On 07/14/2017 04:19 AM, Michael S. Tsirkin wrote: > > On Thu, Jul 13, 2017 at 03:42:35PM +0800, Wei Wang wrote: > > > On 07/12/2017 09:56 PM, Michael S. Tsirkin wrote: > > > > So the way I see it, there are several issues: > > > > > > > > - internal wait - forces multiple APIs like kick/kick_sync > > > > note how kick_sync can fail but your code never checks return code > > > > - need to re-write the last descriptor - might not work > > > > for alternative layouts which always expose descriptors > > > > immediately > > > Probably it wasn't clear. Please let me explain the two functions here: > > > > > > 1) virtqueue_add_chain_desc(vq, head_id, prev_id,..): > > > grabs a desc from the vq and inserts it to the chain tail (which is indexed > > > by > > > prev_id, probably better to call it tail_id). Then, the new added desc > > > becomes > > > the tail (i.e. the last desc). The _F_NEXT flag is cleared for each desc > > > when it's > > > added to the chain, and set when another desc comes to follow later. > > And this only works if there are multiple rings like > > avail + descriptor ring. > > It won't work e.g. with the proposed new layout where > > writing out a descriptor exposes it immediately. > > I think it can support the 1.1 proposal, too. But before getting > into that, I think we first need to deep dive into the implementation > and usage of _first/next/last. The usage would need to lock the vq > from the first to the end (otherwise, the returned info about the number > of available desc in the vq, i.e. num_free, would be invalid): > > lock(vq); > add_first(); > add_next(); > add_last(); > unlock(vq); > > However, I think the case isn't this simple, since we need to check more > things > after each add_xx() step. For example, if only one entry is available at the > time > we start to use the vq, that is, num_free is 0 after add_first(), we > wouldn't be > able to add_next and add_last. So, it would work like this: > > start: > ...get free page block.. > lock(vq) > retry: > ret = add_first(..,&num_free,); > if(ret == -ENOSPC) { > goto retry; > } else if (!num_free) { > add_chain_head(); > unlock(vq); > kick & wait; > goto start; > } > next_one: > ...get free page block.. > add_next(..,&num_free,); > if (!num_free) { > add_chain_head(); > unlock(vq); > kick & wait; > goto start; > } if (num_free == 1) { > ...get free page block.. > add_last(..); > unlock(vq); > kick & wait; > goto start; > } else { > goto next_one; > } > > The above seems unnecessary to me to have three different APIs. > That's the reason to combine them into one virtqueue_add_chain_desc(). > > -- or, do you have a different thought about using the three APIs? > > > Implementation Reference: > > struct desc_iterator { > unsigned int head; > unsigned int tail; > }; > > add_first(*vq, *desc_iterator, *num_free, ..) > { > if (vq->vq.num_free < 1) > return -ENOSPC; > get_desc(&desc_id); > desc[desc_id].flag &= ~_F_NEXT; > desc_iterator->head = desc_id > desc_iterator->tail = desc_iterator->head; > *num_free = vq->vq.num_free; > } > > add_next(vq, desc_iterator, *num_free,..) > { > get_desc(&desc_id); > desc[desc_id].flag &= ~_F_NEXT; > desc[desc_iterator.tail].next = desc_id; > desc[desc_iterator->tail].flag |= _F_NEXT; > desc_iterator->tail = desc_id; > *num_free = vq->vq.num_free; > } > > add_last(vq, desc_iterator,..) > { > get_desc(&desc_id); > desc[desc_id].flag &= ~_F_NEXT; > desc[desc_iterator.tail].next = desc_id; > desc_iterator->tail = desc_id; > > add_chain_head(); // put the desc_iterator.head to the ring > } > > > Best, > Wei OK I thought this over. While we might need these new APIs in the future, I think that at the moment, there's a way to implement this feature that is significantly simpler. Just add each s/g as a separate input buffer. This needs zero new APIs. I know that follow-up patches need to add a header in front so you might be thinking: how am I going to add this header? The answer is quite simple - add it as a separate out header. Host will be able to distinguish between header and pages by looking at the direction, and - should we want to add IN data to header - additionally size (<4K => header). We will be able to look at extended APIs separately down the road.
On 07/23/2017 09:45 AM, Michael S. Tsirkin wrote: > On Fri, Jul 14, 2017 at 03:12:43PM +0800, Wei Wang wrote: >> On 07/14/2017 04:19 AM, Michael S. Tsirkin wrote: >>> On Thu, Jul 13, 2017 at 03:42:35PM +0800, Wei Wang wrote: >>>> On 07/12/2017 09:56 PM, Michael S. Tsirkin wrote: >>>>> So the way I see it, there are several issues: >>>>> >>>>> - internal wait - forces multiple APIs like kick/kick_sync >>>>> note how kick_sync can fail but your code never checks return code >>>>> - need to re-write the last descriptor - might not work >>>>> for alternative layouts which always expose descriptors >>>>> immediately >>>> Probably it wasn't clear. Please let me explain the two functions here: >>>> >>>> 1) virtqueue_add_chain_desc(vq, head_id, prev_id,..): >>>> grabs a desc from the vq and inserts it to the chain tail (which is indexed >>>> by >>>> prev_id, probably better to call it tail_id). Then, the new added desc >>>> becomes >>>> the tail (i.e. the last desc). The _F_NEXT flag is cleared for each desc >>>> when it's >>>> added to the chain, and set when another desc comes to follow later. >>> And this only works if there are multiple rings like >>> avail + descriptor ring. >>> It won't work e.g. with the proposed new layout where >>> writing out a descriptor exposes it immediately. >> I think it can support the 1.1 proposal, too. But before getting >> into that, I think we first need to deep dive into the implementation >> and usage of _first/next/last. The usage would need to lock the vq >> from the first to the end (otherwise, the returned info about the number >> of available desc in the vq, i.e. num_free, would be invalid): >> >> lock(vq); >> add_first(); >> add_next(); >> add_last(); >> unlock(vq); >> >> However, I think the case isn't this simple, since we need to check more >> things >> after each add_xx() step. For example, if only one entry is available at the >> time >> we start to use the vq, that is, num_free is 0 after add_first(), we >> wouldn't be >> able to add_next and add_last. So, it would work like this: >> >> start: >> ...get free page block.. >> lock(vq) >> retry: >> ret = add_first(..,&num_free,); >> if(ret == -ENOSPC) { >> goto retry; >> } else if (!num_free) { >> add_chain_head(); >> unlock(vq); >> kick & wait; >> goto start; >> } >> next_one: >> ...get free page block.. >> add_next(..,&num_free,); >> if (!num_free) { >> add_chain_head(); >> unlock(vq); >> kick & wait; >> goto start; >> } if (num_free == 1) { >> ...get free page block.. >> add_last(..); >> unlock(vq); >> kick & wait; >> goto start; >> } else { >> goto next_one; >> } >> >> The above seems unnecessary to me to have three different APIs. >> That's the reason to combine them into one virtqueue_add_chain_desc(). >> >> -- or, do you have a different thought about using the three APIs? >> >> >> Implementation Reference: >> >> struct desc_iterator { >> unsigned int head; >> unsigned int tail; >> }; >> >> add_first(*vq, *desc_iterator, *num_free, ..) >> { >> if (vq->vq.num_free < 1) >> return -ENOSPC; >> get_desc(&desc_id); >> desc[desc_id].flag &= ~_F_NEXT; >> desc_iterator->head = desc_id >> desc_iterator->tail = desc_iterator->head; >> *num_free = vq->vq.num_free; >> } >> >> add_next(vq, desc_iterator, *num_free,..) >> { >> get_desc(&desc_id); >> desc[desc_id].flag &= ~_F_NEXT; >> desc[desc_iterator.tail].next = desc_id; >> desc[desc_iterator->tail].flag |= _F_NEXT; >> desc_iterator->tail = desc_id; >> *num_free = vq->vq.num_free; >> } >> >> add_last(vq, desc_iterator,..) >> { >> get_desc(&desc_id); >> desc[desc_id].flag &= ~_F_NEXT; >> desc[desc_iterator.tail].next = desc_id; >> desc_iterator->tail = desc_id; >> >> add_chain_head(); // put the desc_iterator.head to the ring >> } >> >> >> Best, >> Wei > OK I thought this over. While we might need these new APIs in > the future, I think that at the moment, there's a way to implement > this feature that is significantly simpler. Just add each s/g > as a separate input buffer. Should it be an output buffer? I think output means from the driver to device (i.e. DMA_TO_DEVICE). > > This needs zero new APIs. > > I know that follow-up patches need to add a header in front > so you might be thinking: how am I going to add this > header? The answer is quite simple - add it as a separate > out header. > > Host will be able to distinguish between header and pages > by looking at the direction, and - should we want to add > IN data to header - additionally size (<4K => header). I think this works fine when the cmdq is only used for reporting the unused pages. It would be an issue if there are other usages (e.g. report memory statistics) interleaving. I think one solution would be to lock the cmdq until a cmd usage is done ((e.g. all the unused pages are reported) ) - in this case, the periodically updated guest memory statistics may be delayed for a while occasionally when live migration starts. Would this be acceptable? If not, probably we can have the cmdq for one usage only. Best, Wei
On Wed, Jul 26, 2017 at 11:48:41AM +0800, Wei Wang wrote: > On 07/23/2017 09:45 AM, Michael S. Tsirkin wrote: > > On Fri, Jul 14, 2017 at 03:12:43PM +0800, Wei Wang wrote: > > > On 07/14/2017 04:19 AM, Michael S. Tsirkin wrote: > > > > On Thu, Jul 13, 2017 at 03:42:35PM +0800, Wei Wang wrote: > > > > > On 07/12/2017 09:56 PM, Michael S. Tsirkin wrote: > > > > > > So the way I see it, there are several issues: > > > > > > > > > > > > - internal wait - forces multiple APIs like kick/kick_sync > > > > > > note how kick_sync can fail but your code never checks return code > > > > > > - need to re-write the last descriptor - might not work > > > > > > for alternative layouts which always expose descriptors > > > > > > immediately > > > > > Probably it wasn't clear. Please let me explain the two functions here: > > > > > > > > > > 1) virtqueue_add_chain_desc(vq, head_id, prev_id,..): > > > > > grabs a desc from the vq and inserts it to the chain tail (which is indexed > > > > > by > > > > > prev_id, probably better to call it tail_id). Then, the new added desc > > > > > becomes > > > > > the tail (i.e. the last desc). The _F_NEXT flag is cleared for each desc > > > > > when it's > > > > > added to the chain, and set when another desc comes to follow later. > > > > And this only works if there are multiple rings like > > > > avail + descriptor ring. > > > > It won't work e.g. with the proposed new layout where > > > > writing out a descriptor exposes it immediately. > > > I think it can support the 1.1 proposal, too. But before getting > > > into that, I think we first need to deep dive into the implementation > > > and usage of _first/next/last. The usage would need to lock the vq > > > from the first to the end (otherwise, the returned info about the number > > > of available desc in the vq, i.e. num_free, would be invalid): > > > > > > lock(vq); > > > add_first(); > > > add_next(); > > > add_last(); > > > unlock(vq); > > > > > > However, I think the case isn't this simple, since we need to check more > > > things > > > after each add_xx() step. For example, if only one entry is available at the > > > time > > > we start to use the vq, that is, num_free is 0 after add_first(), we > > > wouldn't be > > > able to add_next and add_last. So, it would work like this: > > > > > > start: > > > ...get free page block.. > > > lock(vq) > > > retry: > > > ret = add_first(..,&num_free,); > > > if(ret == -ENOSPC) { > > > goto retry; > > > } else if (!num_free) { > > > add_chain_head(); > > > unlock(vq); > > > kick & wait; > > > goto start; > > > } > > > next_one: > > > ...get free page block.. > > > add_next(..,&num_free,); > > > if (!num_free) { > > > add_chain_head(); > > > unlock(vq); > > > kick & wait; > > > goto start; > > > } if (num_free == 1) { > > > ...get free page block.. > > > add_last(..); > > > unlock(vq); > > > kick & wait; > > > goto start; > > > } else { > > > goto next_one; > > > } > > > > > > The above seems unnecessary to me to have three different APIs. > > > That's the reason to combine them into one virtqueue_add_chain_desc(). > > > > > > -- or, do you have a different thought about using the three APIs? > > > > > > > > > Implementation Reference: > > > > > > struct desc_iterator { > > > unsigned int head; > > > unsigned int tail; > > > }; > > > > > > add_first(*vq, *desc_iterator, *num_free, ..) > > > { > > > if (vq->vq.num_free < 1) > > > return -ENOSPC; > > > get_desc(&desc_id); > > > desc[desc_id].flag &= ~_F_NEXT; > > > desc_iterator->head = desc_id > > > desc_iterator->tail = desc_iterator->head; > > > *num_free = vq->vq.num_free; > > > } > > > > > > add_next(vq, desc_iterator, *num_free,..) > > > { > > > get_desc(&desc_id); > > > desc[desc_id].flag &= ~_F_NEXT; > > > desc[desc_iterator.tail].next = desc_id; > > > desc[desc_iterator->tail].flag |= _F_NEXT; > > > desc_iterator->tail = desc_id; > > > *num_free = vq->vq.num_free; > > > } > > > > > > add_last(vq, desc_iterator,..) > > > { > > > get_desc(&desc_id); > > > desc[desc_id].flag &= ~_F_NEXT; > > > desc[desc_iterator.tail].next = desc_id; > > > desc_iterator->tail = desc_id; > > > > > > add_chain_head(); // put the desc_iterator.head to the ring > > > } > > > > > > > > > Best, > > > Wei > > OK I thought this over. While we might need these new APIs in > > the future, I think that at the moment, there's a way to implement > > this feature that is significantly simpler. Just add each s/g > > as a separate input buffer. > > > Should it be an output buffer? Hypervisor overwrites these pages with zeroes. Therefore it is writeable by device: DMA_FROM_DEVICE. > I think output means from the > driver to device (i.e. DMA_TO_DEVICE). This part is correct I believe. > > > > This needs zero new APIs. > > > > I know that follow-up patches need to add a header in front > > so you might be thinking: how am I going to add this > > header? The answer is quite simple - add it as a separate > > out header. > > > > Host will be able to distinguish between header and pages > > by looking at the direction, and - should we want to add > > IN data to header - additionally size (<4K => header). > > > I think this works fine when the cmdq is only used for > reporting the unused pages. > It would be an issue > if there are other usages (e.g. report memory statistics) > interleaving. I think one solution would be to lock the cmdq until > a cmd usage is done ((e.g. all the unused pages are reported) ) - > in this case, the periodically updated guest memory statistics > may be delayed for a while occasionally when live migration starts. > Would this be acceptable? If not, probably we can have the cmdq > for one usage only. > > > Best, > Wei OK I see, I think the issue is that reporting free pages was structured like stats. Let's split it - send pages on e.g. free_vq, get commands on vq shared with stats.
On 07/27/2017 01:02 AM, Michael S. Tsirkin wrote: > On Wed, Jul 26, 2017 at 11:48:41AM +0800, Wei Wang wrote: >> On 07/23/2017 09:45 AM, Michael S. Tsirkin wrote: >>> On Fri, Jul 14, 2017 at 03:12:43PM +0800, Wei Wang wrote: >>>> On 07/14/2017 04:19 AM, Michael S. Tsirkin wrote: >>>>> On Thu, Jul 13, 2017 at 03:42:35PM +0800, Wei Wang wrote: >>>>>> On 07/12/2017 09:56 PM, Michael S. Tsirkin wrote: >>>>>>> So the way I see it, there are several issues: >>>>>>> >>>>>>> - internal wait - forces multiple APIs like kick/kick_sync >>>>>>> note how kick_sync can fail but your code never checks return code >>>>>>> - need to re-write the last descriptor - might not work >>>>>>> for alternative layouts which always expose descriptors >>>>>>> immediately >>>>>> Probably it wasn't clear. Please let me explain the two functions here: >>>>>> >>>>>> 1) virtqueue_add_chain_desc(vq, head_id, prev_id,..): >>>>>> grabs a desc from the vq and inserts it to the chain tail (which is indexed >>>>>> by >>>>>> prev_id, probably better to call it tail_id). Then, the new added desc >>>>>> becomes >>>>>> the tail (i.e. the last desc). The _F_NEXT flag is cleared for each desc >>>>>> when it's >>>>>> added to the chain, and set when another desc comes to follow later. >>>>> And this only works if there are multiple rings like >>>>> avail + descriptor ring. >>>>> It won't work e.g. with the proposed new layout where >>>>> writing out a descriptor exposes it immediately. >>>> I think it can support the 1.1 proposal, too. But before getting >>>> into that, I think we first need to deep dive into the implementation >>>> and usage of _first/next/last. The usage would need to lock the vq >>>> from the first to the end (otherwise, the returned info about the number >>>> of available desc in the vq, i.e. num_free, would be invalid): >>>> >>>> lock(vq); >>>> add_first(); >>>> add_next(); >>>> add_last(); >>>> unlock(vq); >>>> >>>> However, I think the case isn't this simple, since we need to check more >>>> things >>>> after each add_xx() step. For example, if only one entry is available at the >>>> time >>>> we start to use the vq, that is, num_free is 0 after add_first(), we >>>> wouldn't be >>>> able to add_next and add_last. So, it would work like this: >>>> >>>> start: >>>> ...get free page block.. >>>> lock(vq) >>>> retry: >>>> ret = add_first(..,&num_free,); >>>> if(ret == -ENOSPC) { >>>> goto retry; >>>> } else if (!num_free) { >>>> add_chain_head(); >>>> unlock(vq); >>>> kick & wait; >>>> goto start; >>>> } >>>> next_one: >>>> ...get free page block.. >>>> add_next(..,&num_free,); >>>> if (!num_free) { >>>> add_chain_head(); >>>> unlock(vq); >>>> kick & wait; >>>> goto start; >>>> } if (num_free == 1) { >>>> ...get free page block.. >>>> add_last(..); >>>> unlock(vq); >>>> kick & wait; >>>> goto start; >>>> } else { >>>> goto next_one; >>>> } >>>> >>>> The above seems unnecessary to me to have three different APIs. >>>> That's the reason to combine them into one virtqueue_add_chain_desc(). >>>> >>>> -- or, do you have a different thought about using the three APIs? >>>> >>>> >>>> Implementation Reference: >>>> >>>> struct desc_iterator { >>>> unsigned int head; >>>> unsigned int tail; >>>> }; >>>> >>>> add_first(*vq, *desc_iterator, *num_free, ..) >>>> { >>>> if (vq->vq.num_free < 1) >>>> return -ENOSPC; >>>> get_desc(&desc_id); >>>> desc[desc_id].flag &= ~_F_NEXT; >>>> desc_iterator->head = desc_id >>>> desc_iterator->tail = desc_iterator->head; >>>> *num_free = vq->vq.num_free; >>>> } >>>> >>>> add_next(vq, desc_iterator, *num_free,..) >>>> { >>>> get_desc(&desc_id); >>>> desc[desc_id].flag &= ~_F_NEXT; >>>> desc[desc_iterator.tail].next = desc_id; >>>> desc[desc_iterator->tail].flag |= _F_NEXT; >>>> desc_iterator->tail = desc_id; >>>> *num_free = vq->vq.num_free; >>>> } >>>> >>>> add_last(vq, desc_iterator,..) >>>> { >>>> get_desc(&desc_id); >>>> desc[desc_id].flag &= ~_F_NEXT; >>>> desc[desc_iterator.tail].next = desc_id; >>>> desc_iterator->tail = desc_id; >>>> >>>> add_chain_head(); // put the desc_iterator.head to the ring >>>> } >>>> >>>> >>>> Best, >>>> Wei >>> OK I thought this over. While we might need these new APIs in >>> the future, I think that at the moment, there's a way to implement >>> this feature that is significantly simpler. Just add each s/g >>> as a separate input buffer. >> >> Should it be an output buffer? > Hypervisor overwrites these pages with zeroes. Therefore it is > writeable by device: DMA_FROM_DEVICE. Why would the hypervisor need to zero the buffer? I think it may only need to read out the info(base,size). I think it should be like this: the cmd hdr buffer: input, because the hypervisor would write it to send a cmd to the guest the payload buffer: output, for the hypervisor to read the info >> I think output means from the >> driver to device (i.e. DMA_TO_DEVICE). > This part is correct I believe. > >>> This needs zero new APIs. >>> >>> I know that follow-up patches need to add a header in front >>> so you might be thinking: how am I going to add this >>> header? The answer is quite simple - add it as a separate >>> out header. >>> >>> Host will be able to distinguish between header and pages >>> by looking at the direction, and - should we want to add >>> IN data to header - additionally size (<4K => header). >> >> I think this works fine when the cmdq is only used for >> reporting the unused pages. >> It would be an issue >> if there are other usages (e.g. report memory statistics) >> interleaving. I think one solution would be to lock the cmdq until >> a cmd usage is done ((e.g. all the unused pages are reported) ) - >> in this case, the periodically updated guest memory statistics >> may be delayed for a while occasionally when live migration starts. >> Would this be acceptable? If not, probably we can have the cmdq >> for one usage only. >> >> >> Best, >> Wei > OK I see, I think the issue is that reporting free pages > was structured like stats. Let's split it - > send pages on e.g. free_vq, get commands on vq shared with > stats. > Would it be better to have the "report free page" command to be sent through the free_vq? In this case,we will have stats_vq: for the stats usage, which is already there free_vq: for reporting free pages. Best, Wei
On 07/12/2017 08:40 PM, Wei Wang wrote: > Add a new feature, VIRTIO_BALLOON_F_SG, which enables to > transfer a chunk of ballooned (i.e. inflated/deflated) pages using > scatter-gather lists to the host. > > The implementation of the previous virtio-balloon is not very > efficient, because the balloon pages are transferred to the > host one by one. Here is the breakdown of the time in percentage > spent on each step of the balloon inflating process (inflating > 7GB of an 8GB idle guest). > > 1) allocating pages (6.5%) > 2) sending PFNs to host (68.3%) > 3) address translation (6.1%) > 4) madvise (19%) > > It takes about 4126ms for the inflating process to complete. > The above profiling shows that the bottlenecks are stage 2) > and stage 4). > > This patch optimizes step 2) by transferring pages to the host in > sgs. An sg describes a chunk of guest physically continuous pages. > With this mechanism, step 4) can also be optimized by doing address > translation and madvise() in chunks rather than page by page. > > With this new feature, the above ballooning process takes ~491ms > resulting in an improvement of ~88%. > I found a recent mm patch, bb01b64cfab7c22f3848cb73dc0c2b46b8d38499 , zeros all the ballooned pages, which is very time consuming. Tests show that the time to balloon 7G pages is increased from ~491 ms to 2.8 seconds with the above patch. How about moving the zero operation to the hypervisor? In this way, we will have a much faster balloon process. Best, Wei
On Fri, Jul 28, 2017 at 04:25:19PM +0800, Wei Wang wrote: > On 07/12/2017 08:40 PM, Wei Wang wrote: > > Add a new feature, VIRTIO_BALLOON_F_SG, which enables to > > transfer a chunk of ballooned (i.e. inflated/deflated) pages using > > scatter-gather lists to the host. > > > > The implementation of the previous virtio-balloon is not very > > efficient, because the balloon pages are transferred to the > > host one by one. Here is the breakdown of the time in percentage > > spent on each step of the balloon inflating process (inflating > > 7GB of an 8GB idle guest). > > > > 1) allocating pages (6.5%) > > 2) sending PFNs to host (68.3%) > > 3) address translation (6.1%) > > 4) madvise (19%) > > > > It takes about 4126ms for the inflating process to complete. > > The above profiling shows that the bottlenecks are stage 2) > > and stage 4). > > > > This patch optimizes step 2) by transferring pages to the host in > > sgs. An sg describes a chunk of guest physically continuous pages. > > With this mechanism, step 4) can also be optimized by doing address > > translation and madvise() in chunks rather than page by page. > > > > With this new feature, the above ballooning process takes ~491ms > > resulting in an improvement of ~88%. > > > > > I found a recent mm patch, bb01b64cfab7c22f3848cb73dc0c2b46b8d38499 > , zeros all the ballooned pages, which is very time consuming. > > Tests show that the time to balloon 7G pages is increased from ~491 ms to > 2.8 seconds with the above patch. Sounds like it should be reverted. Post a revert pls and we'll discuss. > How about moving the zero operation to the hypervisor? In this way, we > will have a much faster balloon process. > > > Best, > Wei Or in other words hypervisors should not be stupid and should not try to run ksm on DONTNEED pages.
On Thu, Jul 27, 2017 at 10:50:11AM +0800, Wei Wang wrote: > > > > OK I thought this over. While we might need these new APIs in > > > > the future, I think that at the moment, there's a way to implement > > > > this feature that is significantly simpler. Just add each s/g > > > > as a separate input buffer. > > > > > > Should it be an output buffer? > > Hypervisor overwrites these pages with zeroes. Therefore it is > > writeable by device: DMA_FROM_DEVICE. > > Why would the hypervisor need to zero the buffer? The page is supplied to hypervisor and can lose the value that is there. That is the definition of writeable by device. > I think it may only > need to read out the info(base,size). And then do what? > I think it should be like this: > the cmd hdr buffer: input, because the hypervisor would write it to > send a cmd to the guest > the payload buffer: output, for the hypervisor to read the info These should be split. We have: 1. request that hypervisor sends to guest, includes ID: input 2. synchronisation header with ID sent by guest: output 3. list of pages: input 2 and 3 must be on the same VQ. 1 can come on any VQ - reusing stats VQ might make sense. > > > I think output means from the > > > driver to device (i.e. DMA_TO_DEVICE). > > This part is correct I believe. > > > > > > This needs zero new APIs. > > > > > > > > I know that follow-up patches need to add a header in front > > > > so you might be thinking: how am I going to add this > > > > header? The answer is quite simple - add it as a separate > > > > out header. > > > > > > > > Host will be able to distinguish between header and pages > > > > by looking at the direction, and - should we want to add > > > > IN data to header - additionally size (<4K => header). > > > > > > I think this works fine when the cmdq is only used for > > > reporting the unused pages. > > > It would be an issue > > > if there are other usages (e.g. report memory statistics) > > > interleaving. I think one solution would be to lock the cmdq until > > > a cmd usage is done ((e.g. all the unused pages are reported) ) - > > > in this case, the periodically updated guest memory statistics > > > may be delayed for a while occasionally when live migration starts. > > > Would this be acceptable? If not, probably we can have the cmdq > > > for one usage only. > > > > > > > > > Best, > > > Wei > > OK I see, I think the issue is that reporting free pages > > was structured like stats. Let's split it - > > send pages on e.g. free_vq, get commands on vq shared with > > stats. > > > > Would it be better to have the "report free page" command to be sent > through the free_vq? In this case,we will have > stats_vq: for the stats usage, which is already there > free_vq: for reporting free pages. > > Best, > Wei See above. I would get requests on stats vq but report free pages separately on free vq. > > > >
On 07/29/2017 07:08 AM, Michael S. Tsirkin wrote: > On Thu, Jul 27, 2017 at 10:50:11AM +0800, Wei Wang wrote: >>>>> OK I thought this over. While we might need these new APIs in >>>>> the future, I think that at the moment, there's a way to implement >>>>> this feature that is significantly simpler. Just add each s/g >>>>> as a separate input buffer. >>>> Should it be an output buffer? >>> Hypervisor overwrites these pages with zeroes. Therefore it is >>> writeable by device: DMA_FROM_DEVICE. >> Why would the hypervisor need to zero the buffer? > The page is supplied to hypervisor and can lose the value that > is there. That is the definition of writeable by device. I think for the free pages, it should be clear that they will be added as output buffer to the device, because (as we discussed) they are just hints, and some of them may be used by the guest after the report_ API is invoked. The device/hypervisor should not use or discard them. For the balloon pages, I kind of agree with the existing implementation (e.g. inside tell_host()), which uses virtqueue_add_outbuf (instead of _add_inbuf()). I think inbuf should be a buffer which will be written by the device and read by the driver. The cmd buffer put on the vq for the device to send commands can be an inbuf, I think. > >> I think it may only >> need to read out the info(base,size). > And then do what? For the free pages, the info will be used to clear the corresponding "1" in the dirty bitmap. For balloon pages, they will be made DONTNEED and given to other host processes to use (the device won't write them, so no need to set "write" when virtqueue_map_desc() in the device). > >> I think it should be like this: >> the cmd hdr buffer: input, because the hypervisor would write it to >> send a cmd to the guest >> the payload buffer: output, for the hypervisor to read the info > These should be split. > > We have: > > 1. request that hypervisor sends to guest, includes ID: input > 2. synchronisation header with ID sent by guest: output > 3. list of pages: input > > 2 and 3 must be on the same VQ. 1 can come on any VQ - reusing stats VQ > might make sense. I would prefer to make the above item 1 come on the the free page vq, because the existing stat_vq doesn't support the cmd hdr. Now, I think it is also not necessary to move the existing stat_vq implementation to a new implementation under the cmd hdr, because that new support doesn't make a difference for stats, it will still use its stat_vq (the free page vq will be used for reporting free pages only) What do you think? Best, Wei
On Sat, Jul 29, 2017 at 08:47:08PM +0800, Wei Wang wrote: > On 07/29/2017 07:08 AM, Michael S. Tsirkin wrote: > > On Thu, Jul 27, 2017 at 10:50:11AM +0800, Wei Wang wrote: > > > > > > OK I thought this over. While we might need these new APIs in > > > > > > the future, I think that at the moment, there's a way to implement > > > > > > this feature that is significantly simpler. Just add each s/g > > > > > > as a separate input buffer. > > > > > Should it be an output buffer? > > > > Hypervisor overwrites these pages with zeroes. Therefore it is > > > > writeable by device: DMA_FROM_DEVICE. > > > Why would the hypervisor need to zero the buffer? > > The page is supplied to hypervisor and can lose the value that > > is there. That is the definition of writeable by device. > > I think for the free pages, it should be clear that they will be added as > output buffer to the device, because (as we discussed) they are just hints, > and some of them may be used by the guest after the report_ API is invoked. > The device/hypervisor should not use or discard them. Discarding contents is exactly what you propose doing if migration is going on, isn't it? > For the balloon pages, I kind of agree with the existing implementation > (e.g. inside tell_host()), which uses virtqueue_add_outbuf (instead of > _add_inbuf()). This is because it does not pass SGs, it passes weirdly formatted PA within the buffer. > I think inbuf should be a buffer which will be written by the device and > read by the > driver. If hypervisor can change it, it's an inbuf. Should not matter whether driver reads it. > The cmd buffer put on the vq for the device to send commands can be > an > inbuf, I think. > > > > > > > I think it may only > > > need to read out the info(base,size). > > And then do what? > > > For the free pages, the info will be used to clear the corresponding "1" in > the dirty bitmap. > For balloon pages, they will be made DONTNEED and given to other host > processes to > use (the device won't write them, so no need to set "write" when > virtqueue_map_desc() in > the device). > > > > > > > I think it should be like this: > > > the cmd hdr buffer: input, because the hypervisor would write it to > > > send a cmd to the guest > > > the payload buffer: output, for the hypervisor to read the info > > These should be split. > > > > We have: > > > > 1. request that hypervisor sends to guest, includes ID: input > > 2. synchronisation header with ID sent by guest: output > > 3. list of pages: input > > > > 2 and 3 must be on the same VQ. 1 can come on any VQ - reusing stats VQ > > might make sense. > > I would prefer to make the above item 1 come on the the free page vq, > because the existing stat_vq doesn't support the cmd hdr. > Now, I think it is also not necessary to move the existing stat_vq > implementation to > a new implementation under the cmd hdr, because > that new support doesn't make a difference for stats, it will still use its > stat_vq (the free > page vq will be used for reporting free pages only) > > What do you think? > > > Best, > Wei
On Sunday, July 30, 2017 12:23 PM, Michael S. Tsirkin wrote: > On Sat, Jul 29, 2017 at 08:47:08PM +0800, Wei Wang wrote: > > On 07/29/2017 07:08 AM, Michael S. Tsirkin wrote: > > > On Thu, Jul 27, 2017 at 10:50:11AM +0800, Wei Wang wrote: > > > > > > > OK I thought this over. While we might need these new APIs > > > > > > > in the future, I think that at the moment, there's a way to > > > > > > > implement this feature that is significantly simpler. Just > > > > > > > add each s/g as a separate input buffer. > > > > > > Should it be an output buffer? > > > > > Hypervisor overwrites these pages with zeroes. Therefore it is > > > > > writeable by device: DMA_FROM_DEVICE. > > > > Why would the hypervisor need to zero the buffer? > > > The page is supplied to hypervisor and can lose the value that is > > > there. That is the definition of writeable by device. > > > > I think for the free pages, it should be clear that they will be added > > as output buffer to the device, because (as we discussed) they are > > just hints, and some of them may be used by the guest after the report_ API is > invoked. > > The device/hypervisor should not use or discard them. > > Discarding contents is exactly what you propose doing if migration is going on, > isn't it? That's actually a different concept. Please let me explain it with this example: The hypervisor receives the hint saying the guest PageX is a free page, but as we know, after that report_ API exits, the guest kernel may take PageX to use, so PageX is not free page any more. At this time, if the hypervisor writes to the page, that would crash the guest. So, I think the cornerstone of this work is that the hypervisor should not touch the reported pages. Best, Wei
On Sun, Jul 30, 2017 at 05:59:17AM +0000, Wang, Wei W wrote: > On Sunday, July 30, 2017 12:23 PM, Michael S. Tsirkin wrote: > > On Sat, Jul 29, 2017 at 08:47:08PM +0800, Wei Wang wrote: > > > On 07/29/2017 07:08 AM, Michael S. Tsirkin wrote: > > > > On Thu, Jul 27, 2017 at 10:50:11AM +0800, Wei Wang wrote: > > > > > > > > OK I thought this over. While we might need these new APIs > > > > > > > > in the future, I think that at the moment, there's a way to > > > > > > > > implement this feature that is significantly simpler. Just > > > > > > > > add each s/g as a separate input buffer. > > > > > > > Should it be an output buffer? > > > > > > Hypervisor overwrites these pages with zeroes. Therefore it is > > > > > > writeable by device: DMA_FROM_DEVICE. > > > > > Why would the hypervisor need to zero the buffer? > > > > The page is supplied to hypervisor and can lose the value that is > > > > there. That is the definition of writeable by device. > > > > > > I think for the free pages, it should be clear that they will be added > > > as output buffer to the device, because (as we discussed) they are > > > just hints, and some of them may be used by the guest after the report_ API is > > invoked. > > > The device/hypervisor should not use or discard them. > > > > Discarding contents is exactly what you propose doing if migration is going on, > > isn't it? > > That's actually a different concept. Please let me explain it with this example: > > The hypervisor receives the hint saying the guest PageX is a free page, but as we know, > after that report_ API exits, the guest kernel may take PageX to use, so PageX is not free > page any more. At this time, if the hypervisor writes to the page, that would crash the guest. > So, I think the cornerstone of this work is that the hypervisor should not touch the > reported pages. > > Best, > Wei That's a hypervisor implementation detail. From guest point of view, discarding contents can not be distinguished from writing old contents.
On Sun, Jul 30, 2017 at 07:18:33PM +0300, Michael S. Tsirkin wrote: > On Sun, Jul 30, 2017 at 05:59:17AM +0000, Wang, Wei W wrote: > > On Sunday, July 30, 2017 12:23 PM, Michael S. Tsirkin wrote: > > > On Sat, Jul 29, 2017 at 08:47:08PM +0800, Wei Wang wrote: > > > > On 07/29/2017 07:08 AM, Michael S. Tsirkin wrote: > > > > > On Thu, Jul 27, 2017 at 10:50:11AM +0800, Wei Wang wrote: > > > > > > > > > OK I thought this over. While we might need these new APIs > > > > > > > > > in the future, I think that at the moment, there's a way to > > > > > > > > > implement this feature that is significantly simpler. Just > > > > > > > > > add each s/g as a separate input buffer. > > > > > > > > Should it be an output buffer? > > > > > > > Hypervisor overwrites these pages with zeroes. Therefore it is > > > > > > > writeable by device: DMA_FROM_DEVICE. > > > > > > Why would the hypervisor need to zero the buffer? > > > > > The page is supplied to hypervisor and can lose the value that is > > > > > there. That is the definition of writeable by device. > > > > > > > > I think for the free pages, it should be clear that they will be added > > > > as output buffer to the device, because (as we discussed) they are > > > > just hints, and some of them may be used by the guest after the report_ API is > > > invoked. > > > > The device/hypervisor should not use or discard them. > > > > > > Discarding contents is exactly what you propose doing if migration is going on, > > > isn't it? > > > > That's actually a different concept. Please let me explain it with this example: > > > > The hypervisor receives the hint saying the guest PageX is a free page, but as we know, > > after that report_ API exits, the guest kernel may take PageX to use, so PageX is not free > > page any more. At this time, if the hypervisor writes to the page, that would crash the guest. > > So, I think the cornerstone of this work is that the hypervisor should not touch the > > reported pages. > > > > Best, > > Wei > > That's a hypervisor implementation detail. From guest point of view, > discarding contents can not be distinguished from writing old contents. > Besides, ignoring the free page tricks, consider regular ballooning. We map page with DONTNEED then back with WILLNEED. Result is getting a zero page. So at least one of deflate/inflate should be input. I'd say both for symmetry.
On 07/31/2017 12:20 AM, Michael S. Tsirkin wrote: > On Sun, Jul 30, 2017 at 07:18:33PM +0300, Michael S. Tsirkin wrote: >> On Sun, Jul 30, 2017 at 05:59:17AM +0000, Wang, Wei W wrote: >> That's a hypervisor implementation detail. From guest point of view, >> discarding contents can not be distinguished from writing old contents. >> > Besides, ignoring the free page tricks, consider regular ballooning. > We map page with DONTNEED then back with WILLNEED. Result is > getting a zero page. So at least one of deflate/inflate should be input. > I'd say both for symmetry. > OK, I see the point. Thanks. Best, Wei
diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index f0b3a0b..aa4e7ec 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -32,6 +32,7 @@ #include <linux/mm.h> #include <linux/mount.h> #include <linux/magic.h> +#include <linux/xbitmap.h> /* * Balloon device works in 4K page units. So each page is pointed to by @@ -79,6 +80,9 @@ struct virtio_balloon { /* Synchronize access/update to this struct virtio_balloon elements */ struct mutex balloon_lock; + /* The xbitmap used to record ballooned pages */ + struct xb page_xb; + /* The array of pfns we tell the Host about. */ unsigned int num_pfns; __virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX]; @@ -141,13 +145,71 @@ static void set_page_pfns(struct virtio_balloon *vb, page_to_balloon_pfn(page) + i); } +/* + * Send balloon pages in sgs to host. + * The balloon pages are recorded in the page xbitmap. Each bit in the bitmap + * corresponds to a page of PAGE_SIZE. The page xbitmap is searched for + * continuous "1" bits, which correspond to continuous pages, to chunk into + * sgs. + * + * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that + * need to be serached. + */ +static void tell_host_sgs(struct virtio_balloon *vb, + struct virtqueue *vq, + unsigned long page_xb_start, + unsigned long page_xb_end) +{ + unsigned int head_id = VIRTQUEUE_DESC_ID_INIT, + prev_id = VIRTQUEUE_DESC_ID_INIT; + unsigned long sg_pfn_start, sg_pfn_end; + uint64_t sg_addr; + uint32_t sg_size; + + sg_pfn_start = page_xb_start; + while (sg_pfn_start < page_xb_end) { + sg_pfn_start = xb_find_next_bit(&vb->page_xb, sg_pfn_start, + page_xb_end, 1); + if (sg_pfn_start == page_xb_end + 1) + break; + sg_pfn_end = xb_find_next_bit(&vb->page_xb, sg_pfn_start + 1, + page_xb_end, 0); + sg_addr = sg_pfn_start << PAGE_SHIFT; + sg_size = (sg_pfn_end - sg_pfn_start) * PAGE_SIZE; + virtqueue_add_chain_desc(vq, sg_addr, sg_size, &head_id, + &prev_id, 0); + xb_zero(&vb->page_xb, sg_pfn_start, sg_pfn_end); + sg_pfn_start = sg_pfn_end + 1; + } + + if (head_id != VIRTQUEUE_DESC_ID_INIT) { + virtqueue_add_chain(vq, head_id, 0, NULL, vb, NULL); + virtqueue_kick_async(vq, vb->acked); + } +} + +/* Update pfn_max and pfn_min according to the pfn of @page */ +static inline void update_pfn_range(struct virtio_balloon *vb, + struct page *page, + unsigned long *pfn_min, + unsigned long *pfn_max) +{ + unsigned long pfn = page_to_pfn(page); + + *pfn_min = min(pfn, *pfn_min); + *pfn_max = max(pfn, *pfn_max); +} + static unsigned fill_balloon(struct virtio_balloon *vb, size_t num) { struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info; unsigned num_allocated_pages; + bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG); + unsigned long pfn_max = 0, pfn_min = ULONG_MAX; /* We can only do one array worth at a time. */ - num = min(num, ARRAY_SIZE(vb->pfns)); + if (!use_sg) + num = min(num, ARRAY_SIZE(vb->pfns)); mutex_lock(&vb->balloon_lock); for (vb->num_pfns = 0; vb->num_pfns < num; @@ -162,7 +224,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num) msleep(200); break; } - set_page_pfns(vb, vb->pfns + vb->num_pfns, page); + if (use_sg) { + update_pfn_range(vb, page, &pfn_min, &pfn_max); + xb_set_bit(&vb->page_xb, page_to_pfn(page)); + } else { + set_page_pfns(vb, vb->pfns + vb->num_pfns, page); + } vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE; if (!virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM)) @@ -171,8 +238,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num) num_allocated_pages = vb->num_pfns; /* Did we get any? */ - if (vb->num_pfns != 0) - tell_host(vb, vb->inflate_vq); + if (vb->num_pfns != 0) { + if (use_sg) + tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max); + else + tell_host(vb, vb->inflate_vq); + } mutex_unlock(&vb->balloon_lock); return num_allocated_pages; @@ -198,9 +269,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num) struct page *page; struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info; LIST_HEAD(pages); + bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG); + unsigned long pfn_max = 0, pfn_min = ULONG_MAX; - /* We can only do one array worth at a time. */ - num = min(num, ARRAY_SIZE(vb->pfns)); + /* Traditionally, we can only do one array worth at a time. */ + if (!use_sg) + num = min(num, ARRAY_SIZE(vb->pfns)); mutex_lock(&vb->balloon_lock); /* We can't release more pages than taken */ @@ -210,7 +284,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num) page = balloon_page_dequeue(vb_dev_info); if (!page) break; - set_page_pfns(vb, vb->pfns + vb->num_pfns, page); + if (use_sg) { + update_pfn_range(vb, page, &pfn_min, &pfn_max); + xb_set_bit(&vb->page_xb, page_to_pfn(page)); + } else { + set_page_pfns(vb, vb->pfns + vb->num_pfns, page); + } list_add(&page->lru, &pages); vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE; } @@ -221,8 +300,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num) * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST); * is true, we *have* to do it in this order */ - if (vb->num_pfns != 0) - tell_host(vb, vb->deflate_vq); + if (vb->num_pfns != 0) { + if (use_sg) + tell_host_sgs(vb, vb->deflate_vq, pfn_min, pfn_max); + else + tell_host(vb, vb->deflate_vq); + } release_pages_balloon(vb, &pages); mutex_unlock(&vb->balloon_lock); return num_freed_pages; @@ -441,6 +524,18 @@ static int init_vqs(struct virtio_balloon *vb) } #ifdef CONFIG_BALLOON_COMPACTION + +static void tell_host_one_page(struct virtio_balloon *vb, struct virtqueue *vq, + struct page *page) +{ + unsigned int id = VIRTQUEUE_DESC_ID_INIT; + u64 addr = page_to_pfn(page) << VIRTIO_BALLOON_PFN_SHIFT; + + virtqueue_add_chain_desc(vq, addr, PAGE_SIZE, &id, &id, 0); + virtqueue_add_chain(vq, id, 0, NULL, (void *)addr, NULL); + virtqueue_kick_async(vq, vb->acked); +} + /* * virtballoon_migratepage - perform the balloon page migration on behalf of * a compation thread. (called under page lock) @@ -464,6 +559,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info, { struct virtio_balloon *vb = container_of(vb_dev_info, struct virtio_balloon, vb_dev_info); + bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG); unsigned long flags; /* @@ -485,16 +581,22 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info, vb_dev_info->isolated_pages--; __count_vm_event(BALLOON_MIGRATE); spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags); - vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE; - set_page_pfns(vb, vb->pfns, newpage); - tell_host(vb, vb->inflate_vq); - + if (use_sg) { + tell_host_one_page(vb, vb->inflate_vq, newpage); + } else { + vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE; + set_page_pfns(vb, vb->pfns, newpage); + tell_host(vb, vb->inflate_vq); + } /* balloon's page migration 2nd step -- deflate "page" */ balloon_page_delete(page); - vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE; - set_page_pfns(vb, vb->pfns, page); - tell_host(vb, vb->deflate_vq); - + if (use_sg) { + tell_host_one_page(vb, vb->deflate_vq, page); + } else { + vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE; + set_page_pfns(vb, vb->pfns, page); + tell_host(vb, vb->deflate_vq); + } mutex_unlock(&vb->balloon_lock); put_page(page); /* balloon reference */ @@ -553,6 +655,9 @@ static int virtballoon_probe(struct virtio_device *vdev) if (err) goto out_free_vb; + if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG)) + xb_init(&vb->page_xb); + vb->nb.notifier_call = virtballoon_oom_notify; vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY; err = register_oom_notifier(&vb->nb); @@ -618,6 +723,7 @@ static void virtballoon_remove(struct virtio_device *vdev) cancel_work_sync(&vb->update_balloon_size_work); cancel_work_sync(&vb->update_balloon_stats_work); + xb_empty(&vb->page_xb); remove_common(vb); #ifdef CONFIG_BALLOON_COMPACTION if (vb->vb_dev_info.inode) @@ -669,6 +775,7 @@ static unsigned int features[] = { VIRTIO_BALLOON_F_MUST_TELL_HOST, VIRTIO_BALLOON_F_STATS_VQ, VIRTIO_BALLOON_F_DEFLATE_ON_OOM, + VIRTIO_BALLOON_F_SG, }; static struct virtio_driver virtio_balloon_driver = { diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 5e1b548..b9d7e10 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -269,7 +269,7 @@ static inline int virtqueue_add(struct virtqueue *_vq, struct vring_virtqueue *vq = to_vvq(_vq); struct scatterlist *sg; struct vring_desc *desc; - unsigned int i, n, avail, descs_used, uninitialized_var(prev), err_idx; + unsigned int i, n, descs_used, uninitialized_var(prev), err_id; int head; bool indirect; @@ -387,10 +387,68 @@ static inline int virtqueue_add(struct virtqueue *_vq, else vq->free_head = i; - /* Store token and indirect buffer state. */ + END_USE(vq); + + return virtqueue_add_chain(_vq, head, indirect, desc, data, ctx); + +unmap_release: + err_id = i; + i = head; + + for (n = 0; n < total_sg; n++) { + if (i == err_id) + break; + vring_unmap_one(vq, &desc[i]); + i = virtio16_to_cpu(_vq->vdev, vq->vring.desc[i].next); + } + + vq->vq.num_free += total_sg; + + if (indirect) + kfree(desc); + + END_USE(vq); + return -EIO; +} + +/** + * virtqueue_add_chain - expose a chain of buffers to the other end + * @_vq: the struct virtqueue we're talking about. + * @head: desc id of the chain head. + * @indirect: set if the chain of descs are indrect descs. + * @indir_desc: the first indirect desc. + * @data: the token identifying the chain. + * @ctx: extra context for the token. + * + * Caller must ensure we don't call this with other virtqueue operations + * at the same time (except where noted). + * + * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO). + */ +int virtqueue_add_chain(struct virtqueue *_vq, + unsigned int head, + bool indirect, + struct vring_desc *indir_desc, + void *data, + void *ctx) +{ + struct vring_virtqueue *vq = to_vvq(_vq); + unsigned int avail; + + /* The desc chain is empty. */ + if (head == VIRTQUEUE_DESC_ID_INIT) + return 0; + + START_USE(vq); + + if (unlikely(vq->broken)) { + END_USE(vq); + return -EIO; + } + vq->desc_state[head].data = data; if (indirect) - vq->desc_state[head].indir_desc = desc; + vq->desc_state[head].indir_desc = indir_desc; if (ctx) vq->desc_state[head].indir_desc = ctx; @@ -415,26 +473,87 @@ static inline int virtqueue_add(struct virtqueue *_vq, virtqueue_kick(_vq); return 0; +} +EXPORT_SYMBOL_GPL(virtqueue_add_chain); -unmap_release: - err_idx = i; - i = head; +/** + * virtqueue_add_chain_desc - add a buffer to a chain using a vring desc + * @vq: the struct virtqueue we're talking about. + * @addr: address of the buffer to add. + * @len: length of the buffer. + * @head_id: desc id of the chain head. + * @prev_id: desc id of the previous buffer. + * @in: set if the buffer is for the device to write. + * + * Caller must ensure we don't call this with other virtqueue operations + * at the same time (except where noted). + * + * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO). + */ +int virtqueue_add_chain_desc(struct virtqueue *_vq, + uint64_t addr, + uint32_t len, + unsigned int *head_id, + unsigned int *prev_id, + bool in) +{ + struct vring_virtqueue *vq = to_vvq(_vq); + struct vring_desc *desc = vq->vring.desc; + uint16_t flags = in ? VRING_DESC_F_WRITE : 0; + unsigned int i; - for (n = 0; n < total_sg; n++) { - if (i == err_idx) - break; - vring_unmap_one(vq, &desc[i]); - i = virtio16_to_cpu(_vq->vdev, vq->vring.desc[i].next); + /* Sanity check */ + if (!_vq || !head_id || !prev_id) + return -EINVAL; +retry: + START_USE(vq); + if (unlikely(vq->broken)) { + END_USE(vq); + return -EIO; } - vq->vq.num_free += total_sg; + if (vq->vq.num_free < 1) { + /* + * If there is no desc avail in the vq, so kick what is + * already added, and re-start to build a new chain for + * the passed sg. + */ + if (likely(*head_id != VIRTQUEUE_DESC_ID_INIT)) { + END_USE(vq); + virtqueue_add_chain(_vq, *head_id, 0, NULL, vq, NULL); + virtqueue_kick_sync(_vq); + *head_id = VIRTQUEUE_DESC_ID_INIT; + *prev_id = VIRTQUEUE_DESC_ID_INIT; + goto retry; + } else { + END_USE(vq); + return -ENOSPC; + } + } - if (indirect) - kfree(desc); + i = vq->free_head; + flags &= ~VRING_DESC_F_NEXT; + desc[i].flags = cpu_to_virtio16(_vq->vdev, flags); + desc[i].addr = cpu_to_virtio64(_vq->vdev, addr); + desc[i].len = cpu_to_virtio32(_vq->vdev, len); + + /* Add the desc to the end of the chain */ + if (*prev_id != VIRTQUEUE_DESC_ID_INIT) { + desc[*prev_id].next = cpu_to_virtio16(_vq->vdev, i); + desc[*prev_id].flags |= cpu_to_virtio16(_vq->vdev, + VRING_DESC_F_NEXT); + } + *prev_id = i; + if (*head_id == VIRTQUEUE_DESC_ID_INIT) + *head_id = *prev_id; + vq->vq.num_free--; + vq->free_head = virtio16_to_cpu(_vq->vdev, desc[i].next); END_USE(vq); - return -EIO; + + return 0; } +EXPORT_SYMBOL_GPL(virtqueue_add_chain_desc); /** * virtqueue_add_sgs - expose buffers to other end @@ -627,6 +746,56 @@ bool virtqueue_kick(struct virtqueue *vq) } EXPORT_SYMBOL_GPL(virtqueue_kick); +/** + * virtqueue_kick_sync - update after add_buf and busy wait till update is done + * @vq: the struct virtqueue + * + * After one or more virtqueue_add_* calls, invoke this to kick + * the other side. Busy wait till the other side is done with the update. + * + * Caller must ensure we don't call this with other virtqueue + * operations at the same time (except where noted). + * + * Returns false if kick failed, otherwise true. + */ +bool virtqueue_kick_sync(struct virtqueue *vq) +{ + u32 len; + + if (likely(virtqueue_kick(vq))) { + while (!virtqueue_get_buf(vq, &len) && + !virtqueue_is_broken(vq)) + cpu_relax(); + return true; + } + return false; +} +EXPORT_SYMBOL_GPL(virtqueue_kick_sync); + +/** + * virtqueue_kick_async - update after add_buf and blocking till update is done + * @vq: the struct virtqueue + * + * After one or more virtqueue_add_* calls, invoke this to kick + * the other side. Blocking till the other side is done with the update. + * + * Caller must ensure we don't call this with other virtqueue + * operations at the same time (except where noted). + * + * Returns false if kick failed, otherwise true. + */ +bool virtqueue_kick_async(struct virtqueue *vq, wait_queue_head_t wq) +{ + u32 len; + + if (likely(virtqueue_kick(vq))) { + wait_event(wq, virtqueue_get_buf(vq, &len)); + return true; + } + return false; +} +EXPORT_SYMBOL_GPL(virtqueue_kick_async); + static void detach_buf(struct vring_virtqueue *vq, unsigned int head, void **ctx) { diff --git a/include/linux/virtio.h b/include/linux/virtio.h index 28b0e96..9f27101 100644 --- a/include/linux/virtio.h +++ b/include/linux/virtio.h @@ -57,8 +57,28 @@ int virtqueue_add_sgs(struct virtqueue *vq, void *data, gfp_t gfp); +/* A desc with this init id is treated as an invalid desc */ +#define VIRTQUEUE_DESC_ID_INIT UINT_MAX +int virtqueue_add_chain_desc(struct virtqueue *_vq, + uint64_t addr, + uint32_t len, + unsigned int *head_id, + unsigned int *prev_id, + bool in); + +int virtqueue_add_chain(struct virtqueue *_vq, + unsigned int head, + bool indirect, + struct vring_desc *indirect_desc, + void *data, + void *ctx); + bool virtqueue_kick(struct virtqueue *vq); +bool virtqueue_kick_sync(struct virtqueue *vq); + +bool virtqueue_kick_async(struct virtqueue *vq, wait_queue_head_t wq); + bool virtqueue_kick_prepare(struct virtqueue *vq); bool virtqueue_notify(struct virtqueue *vq); diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h index 343d7dd..37780a7 100644 --- a/include/uapi/linux/virtio_balloon.h +++ b/include/uapi/linux/virtio_balloon.h @@ -34,6 +34,7 @@ #define VIRTIO_BALLOON_F_MUST_TELL_HOST 0 /* Tell before reclaiming pages */ #define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */ #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */ +#define VIRTIO_BALLOON_F_SG 3 /* Use sg instead of PFN lists */ /* Size of a PFN in the balloon interface. */ #define VIRTIO_BALLOON_PFN_SHIFT 12