diff mbox

[net-next,RFC,V5,4/5] virtio_net: multiqueue support

Message ID 1341484194-8108-5-git-send-email-jasowang@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Jason Wang July 5, 2012, 10:29 a.m. UTC
This patch converts virtio_net to a multi queue device. After negotiated
VIRTIO_NET_F_MULTIQUEUE feature, the virtio device has many tx/rx queue pairs,
and driver could read the number from config space.

The driver expects the number of rx/tx queue paris is equal to the number of
vcpus. To maximize the performance under this per-cpu rx/tx queue pairs, some
optimization were introduced:

- Txq selection is based on the processor id in order to avoid contending a lock
  whose owner may exits to host.
- Since the txq/txq were per-cpu, affinity hint were set to the cpu that owns
  the queue pairs.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/virtio_net.c   |  645 ++++++++++++++++++++++++++++++-------------
 include/linux/virtio_net.h |    2 +
 2 files changed, 452 insertions(+), 195 deletions(-)

Comments

Amos Kong July 5, 2012, 8:02 p.m. UTC | #1
On 07/05/2012 06:29 PM, Jason Wang wrote:
> This patch converts virtio_net to a multi queue device. After negotiated
> VIRTIO_NET_F_MULTIQUEUE feature, the virtio device has many tx/rx queue pairs,
> and driver could read the number from config space.
> 
> The driver expects the number of rx/tx queue paris is equal to the number of
> vcpus. To maximize the performance under this per-cpu rx/tx queue pairs, some
> optimization were introduced:
> 
> - Txq selection is based on the processor id in order to avoid contending a lock
>   whose owner may exits to host.
> - Since the txq/txq were per-cpu, affinity hint were set to the cpu that owns
>   the queue pairs.
> 
> Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---

...

>  
>  static int virtnet_probe(struct virtio_device *vdev)
>  {
> -	int err;
> +	int i, err;
>  	struct net_device *dev;
>  	struct virtnet_info *vi;
> +	u16 num_queues, num_queue_pairs;
> +
> +	/* Find if host supports multiqueue virtio_net device */
> +	err = virtio_config_val(vdev, VIRTIO_NET_F_MULTIQUEUE,
> +				offsetof(struct virtio_net_config,
> +				num_queues), &num_queues);
> +
> +	/* We need atleast 2 queue's */


s/atleast/at least/


> +	if (err || num_queues < 2)
> +		num_queues = 2;
> +	if (num_queues > MAX_QUEUES * 2)
> +		num_queues = MAX_QUEUES;

                num_queues = MAX_QUEUES * 2;

MAX_QUEUES is the limitation of RX or TX.

> +
> +	num_queue_pairs = num_queues / 2;

...
Jason Wang July 6, 2012, 7:45 a.m. UTC | #2
On 07/06/2012 04:02 AM, Amos Kong wrote:
> On 07/05/2012 06:29 PM, Jason Wang wrote:
>> This patch converts virtio_net to a multi queue device. After negotiated
>> VIRTIO_NET_F_MULTIQUEUE feature, the virtio device has many tx/rx queue pairs,
>> and driver could read the number from config space.
>>
>> The driver expects the number of rx/tx queue paris is equal to the number of
>> vcpus. To maximize the performance under this per-cpu rx/tx queue pairs, some
>> optimization were introduced:
>>
>> - Txq selection is based on the processor id in order to avoid contending a lock
>>    whose owner may exits to host.
>> - Since the txq/txq were per-cpu, affinity hint were set to the cpu that owns
>>    the queue pairs.
>>
>> Signed-off-by: Krishna Kumar<krkumar2@in.ibm.com>
>> Signed-off-by: Jason Wang<jasowang@redhat.com>
>> ---
> ...
>
>>
>>   static int virtnet_probe(struct virtio_device *vdev)
>>   {
>> -	int err;
>> +	int i, err;
>>   	struct net_device *dev;
>>   	struct virtnet_info *vi;
>> +	u16 num_queues, num_queue_pairs;
>> +
>> +	/* Find if host supports multiqueue virtio_net device */
>> +	err = virtio_config_val(vdev, VIRTIO_NET_F_MULTIQUEUE,
>> +				offsetof(struct virtio_net_config,
>> +				num_queues),&num_queues);
>> +
>> +	/* We need atleast 2 queue's */
>
> s/atleast/at least/
>
>
>> +	if (err || num_queues<  2)
>> +		num_queues = 2;
>> +	if (num_queues>  MAX_QUEUES * 2)
>> +		num_queues = MAX_QUEUES;
>                  num_queues = MAX_QUEUES * 2;
>
> MAX_QUEUES is the limitation of RX or TX.

Right, it's a typo, thanks.
>> +
>> +	num_queue_pairs = num_queues / 2;
> ...
>

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin July 20, 2012, 1:40 p.m. UTC | #3
On Thu, Jul 05, 2012 at 06:29:53PM +0800, Jason Wang wrote:
> This patch converts virtio_net to a multi queue device. After negotiated
> VIRTIO_NET_F_MULTIQUEUE feature, the virtio device has many tx/rx queue pairs,
> and driver could read the number from config space.
> 
> The driver expects the number of rx/tx queue paris is equal to the number of
> vcpus. To maximize the performance under this per-cpu rx/tx queue pairs, some
> optimization were introduced:
> 
> - Txq selection is based on the processor id in order to avoid contending a lock
>   whose owner may exits to host.
> - Since the txq/txq were per-cpu, affinity hint were set to the cpu that owns
>   the queue pairs.
> 
> Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>

Overall fine. I think it is best to smash the following patch
into this one, so that default behavior does not
jump to mq then back. some comments below: mostly nits, and a minor bug.

If you are worried the patch is too big, it can be split
differently
	- rework to use send_queue/receive_queue structures, no
	  functional changes.
	- add multiqueue

but this is not a must.

> ---
>  drivers/net/virtio_net.c   |  645 ++++++++++++++++++++++++++++++-------------
>  include/linux/virtio_net.h |    2 +
>  2 files changed, 452 insertions(+), 195 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 1db445b..7410187 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -26,6 +26,7 @@
>  #include <linux/scatterlist.h>
>  #include <linux/if_vlan.h>
>  #include <linux/slab.h>
> +#include <linux/interrupt.h>
>  
>  static int napi_weight = 128;
>  module_param(napi_weight, int, 0444);
> @@ -41,6 +42,8 @@ module_param(gso, bool, 0444);
>  #define VIRTNET_SEND_COMMAND_SG_MAX    2
>  #define VIRTNET_DRIVER_VERSION "1.0.0"
>  
> +#define MAX_QUEUES 256
> +
>  struct virtnet_stats {
>  	struct u64_stats_sync tx_syncp;
>  	struct u64_stats_sync rx_syncp;

Would be a bit better not to have artificial limits like that.
Maybe allocate arrays at probe time, then we can
take whatever the device gives us?

> @@ -51,43 +54,69 @@ struct virtnet_stats {
>  	u64 rx_packets;
>  };
>  
> -struct virtnet_info {
> -	struct virtio_device *vdev;
> -	struct virtqueue *rvq, *svq, *cvq;
> -	struct net_device *dev;
> +/* Internal representation of a send virtqueue */
> +struct send_queue {
> +	/* Virtqueue associated with this send _queue */
> +	struct virtqueue *vq;
> +
> +	/* TX: fragments + linear part + virtio header */
> +	struct scatterlist sg[MAX_SKB_FRAGS + 2];
> +};
> +
> +/* Internal representation of a receive virtqueue */
> +struct receive_queue {
> +	/* Virtqueue associated with this receive_queue */
> +	struct virtqueue *vq;
> +
> +	/* Back pointer to the virtnet_info */
> +	struct virtnet_info *vi;
> +
>  	struct napi_struct napi;
> -	unsigned int status;
>  
>  	/* Number of input buffers, and max we've ever had. */
>  	unsigned int num, max;
>  
> +	/* Work struct for refilling if we run low on memory. */
> +	struct delayed_work refill;
> +
> +	/* Chain pages by the private ptr. */
> +	struct page *pages;
> +
> +	/* RX: fragments + linear part + virtio header */
> +	struct scatterlist sg[MAX_SKB_FRAGS + 2];
> +};
> +
> +struct virtnet_info {
> +	u16 num_queue_pairs;		/* # of RX/TX vq pairs */
> +
> +	struct send_queue *sq[MAX_QUEUES] ____cacheline_aligned_in_smp;
> +	struct receive_queue *rq[MAX_QUEUES] ____cacheline_aligned_in_smp;

The assumption is a tx/rx pair is handled on the same cpu, yes?
If yes maybe make it a single array to improve cache locality
a bit?
	struct queue_pair {
		struct send_queue sq;
		struct receive_queue rq;
	};

> +	struct virtqueue *cvq;
> +
> +	struct virtio_device *vdev;
> +	struct net_device *dev;
> +	unsigned int status;
> +
>  	/* I like... big packets and I cannot lie! */
>  	bool big_packets;
>  
>  	/* Host will merge rx buffers for big packets (shake it! shake it!) */
>  	bool mergeable_rx_bufs;
>  
> +	/* Has control virtqueue */
> +	bool has_cvq;
> +

won't checking (cvq != NULL) be enough?

>  	/* enable config space updates */
>  	bool config_enable;
>  
>  	/* Active statistics */
>  	struct virtnet_stats __percpu *stats;
>  
> -	/* Work struct for refilling if we run low on memory. */
> -	struct delayed_work refill;
> -
>  	/* Work struct for config space updates */
>  	struct work_struct config_work;
>  
>  	/* Lock for config space updates */
>  	struct mutex config_lock;
> -
> -	/* Chain pages by the private ptr. */
> -	struct page *pages;
> -
> -	/* fragments + linear part + virtio header */
> -	struct scatterlist rx_sg[MAX_SKB_FRAGS + 2];
> -	struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
>  };
>  
>  struct skb_vnet_hdr {
> @@ -108,6 +137,22 @@ struct padded_vnet_hdr {
>  	char padding[6];
>  };
>  
> +static inline int txq_get_qnum(struct virtnet_info *vi, struct virtqueue *vq)
> +{
> +	int ret = virtqueue_get_queue_index(vq);
> +
> +	/* skip ctrl vq */
> +	if (vi->has_cvq)
> +		return (ret - 1) / 2;
> +	else
> +		return ret / 2;
> +}
> +
> +static inline int rxq_get_qnum(struct virtnet_info *vi, struct virtqueue *vq)
> +{
> +	return virtqueue_get_queue_index(vq) / 2;
> +}
> +
>  static inline struct skb_vnet_hdr *skb_vnet_hdr(struct sk_buff *skb)
>  {
>  	return (struct skb_vnet_hdr *)skb->cb;
> @@ -117,22 +162,22 @@ static inline struct skb_vnet_hdr *skb_vnet_hdr(struct sk_buff *skb)
>   * private is used to chain pages for big packets, put the whole
>   * most recent used list in the beginning for reuse
>   */
> -static void give_pages(struct virtnet_info *vi, struct page *page)
> +static void give_pages(struct receive_queue *rq, struct page *page)
>  {
>  	struct page *end;
>  
>  	/* Find end of list, sew whole thing into vi->pages. */
>  	for (end = page; end->private; end = (struct page *)end->private);
> -	end->private = (unsigned long)vi->pages;
> -	vi->pages = page;
> +	end->private = (unsigned long)rq->pages;
> +	rq->pages = page;
>  }
>  
> -static struct page *get_a_page(struct virtnet_info *vi, gfp_t gfp_mask)
> +static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
>  {
> -	struct page *p = vi->pages;
> +	struct page *p = rq->pages;
>  
>  	if (p) {
> -		vi->pages = (struct page *)p->private;
> +		rq->pages = (struct page *)p->private;
>  		/* clear private here, it is used to chain pages */
>  		p->private = 0;
>  	} else
> @@ -140,15 +185,15 @@ static struct page *get_a_page(struct virtnet_info *vi, gfp_t gfp_mask)
>  	return p;
>  }
>  
> -static void skb_xmit_done(struct virtqueue *svq)
> +static void skb_xmit_done(struct virtqueue *vq)
>  {
> -	struct virtnet_info *vi = svq->vdev->priv;
> +	struct virtnet_info *vi = vq->vdev->priv;
>  
>  	/* Suppress further interrupts. */
> -	virtqueue_disable_cb(svq);
> +	virtqueue_disable_cb(vq);
>  
>  	/* We were probably waiting for more output buffers. */
> -	netif_wake_queue(vi->dev);
> +	netif_wake_subqueue(vi->dev, txq_get_qnum(vi, vq));
>  }
>  
>  static void set_skb_frag(struct sk_buff *skb, struct page *page,
> @@ -167,9 +212,10 @@ static void set_skb_frag(struct sk_buff *skb, struct page *page,
>  }
>  
>  /* Called from bottom half context */
> -static struct sk_buff *page_to_skb(struct virtnet_info *vi,
> +static struct sk_buff *page_to_skb(struct receive_queue *rq,
>  				   struct page *page, unsigned int len)
>  {
> +	struct virtnet_info *vi = rq->vi;
>  	struct sk_buff *skb;
>  	struct skb_vnet_hdr *hdr;
>  	unsigned int copy, hdr_len, offset;
> @@ -225,12 +271,12 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>  	}
>  
>  	if (page)
> -		give_pages(vi, page);
> +		give_pages(rq, page);
>  
>  	return skb;
>  }
>  
> -static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
> +static int receive_mergeable(struct receive_queue *rq, struct sk_buff *skb)
>  {
>  	struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
>  	struct page *page;
> @@ -244,7 +290,7 @@ static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
>  			skb->dev->stats.rx_length_errors++;
>  			return -EINVAL;
>  		}
> -		page = virtqueue_get_buf(vi->rvq, &len);
> +		page = virtqueue_get_buf(rq->vq, &len);
>  		if (!page) {
>  			pr_debug("%s: rx error: %d buffers missing\n",
>  				 skb->dev->name, hdr->mhdr.num_buffers);
> @@ -257,13 +303,14 @@ static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
>  
>  		set_skb_frag(skb, page, 0, &len);
>  
> -		--vi->num;
> +		--rq->num;
>  	}
>  	return 0;
>  }
>  
> -static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
> +static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len)
>  {
> +	struct net_device *dev = rq->vi->dev;
>  	struct virtnet_info *vi = netdev_priv(dev);
>  	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
>  	struct sk_buff *skb;
> @@ -274,7 +321,7 @@ static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
>  		pr_debug("%s: short packet %i\n", dev->name, len);
>  		dev->stats.rx_length_errors++;
>  		if (vi->mergeable_rx_bufs || vi->big_packets)
> -			give_pages(vi, buf);
> +			give_pages(rq, buf);
>  		else
>  			dev_kfree_skb(buf);
>  		return;
> @@ -286,14 +333,14 @@ static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
>  		skb_trim(skb, len);
>  	} else {
>  		page = buf;
> -		skb = page_to_skb(vi, page, len);
> +		skb = page_to_skb(rq, page, len);
>  		if (unlikely(!skb)) {
>  			dev->stats.rx_dropped++;
> -			give_pages(vi, page);
> +			give_pages(rq, page);
>  			return;
>  		}
>  		if (vi->mergeable_rx_bufs)
> -			if (receive_mergeable(vi, skb)) {
> +			if (receive_mergeable(rq, skb)) {
>  				dev_kfree_skb(skb);
>  				return;
>  			}
> @@ -363,90 +410,91 @@ frame_err:
>  	dev_kfree_skb(skb);
>  }
>  
> -static int add_recvbuf_small(struct virtnet_info *vi, gfp_t gfp)
> +static int add_recvbuf_small(struct receive_queue *rq, gfp_t gfp)
>  {
>  	struct sk_buff *skb;
>  	struct skb_vnet_hdr *hdr;
>  	int err;
>  
> -	skb = __netdev_alloc_skb_ip_align(vi->dev, MAX_PACKET_LEN, gfp);
> +	skb = __netdev_alloc_skb_ip_align(rq->vi->dev, MAX_PACKET_LEN, gfp);
>  	if (unlikely(!skb))
>  		return -ENOMEM;
>  
>  	skb_put(skb, MAX_PACKET_LEN);
>  
>  	hdr = skb_vnet_hdr(skb);
> -	sg_set_buf(vi->rx_sg, &hdr->hdr, sizeof hdr->hdr);
> +	sg_set_buf(rq->sg, &hdr->hdr, sizeof hdr->hdr);
> +
> +	skb_to_sgvec(skb, rq->sg + 1, 0, skb->len);
>  
> -	skb_to_sgvec(skb, vi->rx_sg + 1, 0, skb->len);
> +	err = virtqueue_add_buf(rq->vq, rq->sg, 0, 2, skb, gfp);
>  
> -	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, 2, skb, gfp);
>  	if (err < 0)
>  		dev_kfree_skb(skb);
>  
>  	return err;
>  }
>  
> -static int add_recvbuf_big(struct virtnet_info *vi, gfp_t gfp)
> +static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
>  {
>  	struct page *first, *list = NULL;
>  	char *p;
>  	int i, err, offset;
>  
> -	/* page in vi->rx_sg[MAX_SKB_FRAGS + 1] is list tail */
> +	/* page in rq->sg[MAX_SKB_FRAGS + 1] is list tail */
>  	for (i = MAX_SKB_FRAGS + 1; i > 1; --i) {
> -		first = get_a_page(vi, gfp);
> +		first = get_a_page(rq, gfp);
>  		if (!first) {
>  			if (list)
> -				give_pages(vi, list);
> +				give_pages(rq, list);
>  			return -ENOMEM;
>  		}
> -		sg_set_buf(&vi->rx_sg[i], page_address(first), PAGE_SIZE);
> +		sg_set_buf(&rq->sg[i], page_address(first), PAGE_SIZE);
>  
>  		/* chain new page in list head to match sg */
>  		first->private = (unsigned long)list;
>  		list = first;
>  	}
>  
> -	first = get_a_page(vi, gfp);
> +	first = get_a_page(rq, gfp);
>  	if (!first) {
> -		give_pages(vi, list);
> +		give_pages(rq, list);
>  		return -ENOMEM;
>  	}
>  	p = page_address(first);
>  
> -	/* vi->rx_sg[0], vi->rx_sg[1] share the same page */
> -	/* a separated vi->rx_sg[0] for virtio_net_hdr only due to QEMU bug */
> -	sg_set_buf(&vi->rx_sg[0], p, sizeof(struct virtio_net_hdr));
> +	/* rq->sg[0], rq->sg[1] share the same page */
> +	/* a separated rq->sg[0] for virtio_net_hdr only due to QEMU bug */
> +	sg_set_buf(&rq->sg[0], p, sizeof(struct virtio_net_hdr));
>  
> -	/* vi->rx_sg[1] for data packet, from offset */
> +	/* rq->sg[1] for data packet, from offset */
>  	offset = sizeof(struct padded_vnet_hdr);
> -	sg_set_buf(&vi->rx_sg[1], p + offset, PAGE_SIZE - offset);
> +	sg_set_buf(&rq->sg[1], p + offset, PAGE_SIZE - offset);
>  
>  	/* chain first in list head */
>  	first->private = (unsigned long)list;
> -	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, MAX_SKB_FRAGS + 2,
> +	err = virtqueue_add_buf(rq->vq, rq->sg, 0, MAX_SKB_FRAGS + 2,
>  				first, gfp);
>  	if (err < 0)
> -		give_pages(vi, first);
> +		give_pages(rq, first);
>  
>  	return err;
>  }
>  
> -static int add_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp)
> +static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
>  {
>  	struct page *page;
>  	int err;
>  
> -	page = get_a_page(vi, gfp);
> +	page = get_a_page(rq, gfp);
>  	if (!page)
>  		return -ENOMEM;
>  
> -	sg_init_one(vi->rx_sg, page_address(page), PAGE_SIZE);
> +	sg_init_one(rq->sg, page_address(page), PAGE_SIZE);
>  
> -	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, 1, page, gfp);
> +	err = virtqueue_add_buf(rq->vq, rq->sg, 0, 1, page, gfp);
>  	if (err < 0)
> -		give_pages(vi, page);
> +		give_pages(rq, page);
>  
>  	return err;
>  }
> @@ -458,97 +506,104 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp)
>   * before we're receiving packets, or from refill_work which is
>   * careful to disable receiving (using napi_disable).
>   */
> -static bool try_fill_recv(struct virtnet_info *vi, gfp_t gfp)
> +static bool try_fill_recv(struct receive_queue *rq, gfp_t gfp)
>  {
> +	struct virtnet_info *vi = rq->vi;
>  	int err;
>  	bool oom;
>  
>  	do {
>  		if (vi->mergeable_rx_bufs)
> -			err = add_recvbuf_mergeable(vi, gfp);
> +			err = add_recvbuf_mergeable(rq, gfp);
>  		else if (vi->big_packets)
> -			err = add_recvbuf_big(vi, gfp);
> +			err = add_recvbuf_big(rq, gfp);
>  		else
> -			err = add_recvbuf_small(vi, gfp);
> +			err = add_recvbuf_small(rq, gfp);
>  
>  		oom = err == -ENOMEM;
>  		if (err < 0)
>  			break;
> -		++vi->num;
> +		++rq->num;
>  	} while (err > 0);
> -	if (unlikely(vi->num > vi->max))
> -		vi->max = vi->num;
> -	virtqueue_kick(vi->rvq);
> +	if (unlikely(rq->num > rq->max))
> +		rq->max = rq->num;
> +	virtqueue_kick(rq->vq);
>  	return !oom;
>  }
>  
> -static void skb_recv_done(struct virtqueue *rvq)
> +static void skb_recv_done(struct virtqueue *vq)
>  {
> -	struct virtnet_info *vi = rvq->vdev->priv;
> +	struct virtnet_info *vi = vq->vdev->priv;
> +	struct napi_struct *napi = &vi->rq[rxq_get_qnum(vi, vq)]->napi;
> +
>  	/* Schedule NAPI, Suppress further interrupts if successful. */
> -	if (napi_schedule_prep(&vi->napi)) {
> -		virtqueue_disable_cb(rvq);
> -		__napi_schedule(&vi->napi);
> +	if (napi_schedule_prep(napi)) {
> +		virtqueue_disable_cb(vq);
> +		__napi_schedule(napi);
>  	}
>  }
>  
> -static void virtnet_napi_enable(struct virtnet_info *vi)
> +static void virtnet_napi_enable(struct receive_queue *rq)
>  {
> -	napi_enable(&vi->napi);
> +	napi_enable(&rq->napi);
>  
>  	/* If all buffers were filled by other side before we napi_enabled, we
>  	 * won't get another interrupt, so process any outstanding packets
>  	 * now.  virtnet_poll wants re-enable the queue, so we disable here.
>  	 * We synchronize against interrupts via NAPI_STATE_SCHED */
> -	if (napi_schedule_prep(&vi->napi)) {
> -		virtqueue_disable_cb(vi->rvq);
> +	if (napi_schedule_prep(&rq->napi)) {
> +		virtqueue_disable_cb(rq->vq);
>  		local_bh_disable();
> -		__napi_schedule(&vi->napi);
> +		__napi_schedule(&rq->napi);
>  		local_bh_enable();
>  	}
>  }
>  
>  static void refill_work(struct work_struct *work)
>  {
> -	struct virtnet_info *vi;
> +	struct napi_struct *napi;
> +	struct receive_queue *rq;
>  	bool still_empty;
>  
> -	vi = container_of(work, struct virtnet_info, refill.work);
> -	napi_disable(&vi->napi);
> -	still_empty = !try_fill_recv(vi, GFP_KERNEL);
> -	virtnet_napi_enable(vi);
> +	rq = container_of(work, struct receive_queue, refill.work);
> +	napi = &rq->napi;
> +
> +	napi_disable(napi);
> +	still_empty = !try_fill_recv(rq, GFP_KERNEL);
> +	virtnet_napi_enable(rq);
>  
>  	/* In theory, this can happen: if we don't get any buffers in
>  	 * we will *never* try to fill again. */
>  	if (still_empty)
> -		queue_delayed_work(system_nrt_wq, &vi->refill, HZ/2);
> +		queue_delayed_work(system_nrt_wq, &rq->refill, HZ/2);
>  }
>  
>  static int virtnet_poll(struct napi_struct *napi, int budget)
>  {
> -	struct virtnet_info *vi = container_of(napi, struct virtnet_info, napi);
> +	struct receive_queue *rq = container_of(napi, struct receive_queue,
> +						napi);
>  	void *buf;
>  	unsigned int len, received = 0;
>  
>  again:
>  	while (received < budget &&
> -	       (buf = virtqueue_get_buf(vi->rvq, &len)) != NULL) {
> -		receive_buf(vi->dev, buf, len);
> -		--vi->num;
> +	       (buf = virtqueue_get_buf(rq->vq, &len)) != NULL) {
> +		receive_buf(rq, buf, len);
> +		--rq->num;
>  		received++;
>  	}
>  
> -	if (vi->num < vi->max / 2) {
> -		if (!try_fill_recv(vi, GFP_ATOMIC))
> -			queue_delayed_work(system_nrt_wq, &vi->refill, 0);
> +	if (rq->num < rq->max / 2) {
> +		if (!try_fill_recv(rq, GFP_ATOMIC))
> +			queue_delayed_work(system_nrt_wq, &rq->refill, 0);
>  	}
>  
>  	/* Out of packets? */
>  	if (received < budget) {
>  		napi_complete(napi);
> -		if (unlikely(!virtqueue_enable_cb(vi->rvq)) &&
> +		if (unlikely(!virtqueue_enable_cb(rq->vq)) &&
>  		    napi_schedule_prep(napi)) {
> -			virtqueue_disable_cb(vi->rvq);
> +			virtqueue_disable_cb(rq->vq);
>  			__napi_schedule(napi);
>  			goto again;
>  		}
> @@ -557,13 +612,14 @@ again:
>  	return received;
>  }
>  
> -static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
> +static unsigned int free_old_xmit_skbs(struct virtnet_info *vi,
> +				       struct virtqueue *vq)
>  {
>  	struct sk_buff *skb;
>  	unsigned int len, tot_sgs = 0;
>  	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
>  
> -	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
> +	while ((skb = virtqueue_get_buf(vq, &len)) != NULL) {
>  		pr_debug("Sent skb %p\n", skb);
>  
>  		u64_stats_update_begin(&stats->tx_syncp);
> @@ -577,7 +633,8 @@ static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
>  	return tot_sgs;
>  }
>  
> -static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
> +static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb,
> +		    struct virtqueue *vq, struct scatterlist *sg)
>  {
>  	struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
>  	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
> @@ -615,44 +672,47 @@ static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
>  
>  	/* Encode metadata header at front. */
>  	if (vi->mergeable_rx_bufs)
> -		sg_set_buf(vi->tx_sg, &hdr->mhdr, sizeof hdr->mhdr);
> +		sg_set_buf(sg, &hdr->mhdr, sizeof hdr->mhdr);
>  	else
> -		sg_set_buf(vi->tx_sg, &hdr->hdr, sizeof hdr->hdr);
> +		sg_set_buf(sg, &hdr->hdr, sizeof hdr->hdr);
>  
> -	hdr->num_sg = skb_to_sgvec(skb, vi->tx_sg + 1, 0, skb->len) + 1;
> -	return virtqueue_add_buf(vi->svq, vi->tx_sg, hdr->num_sg,
> +	hdr->num_sg = skb_to_sgvec(skb, sg + 1, 0, skb->len) + 1;
> +	return virtqueue_add_buf(vq, sg, hdr->num_sg,
>  				 0, skb, GFP_ATOMIC);
>  }
>  
>  static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>  {
>  	struct virtnet_info *vi = netdev_priv(dev);
> +	int qnum = skb_get_queue_mapping(skb);
> +	struct virtqueue *vq = vi->sq[qnum]->vq;
>  	int capacity;
>  
>  	/* Free up any pending old buffers before queueing new ones. */
> -	free_old_xmit_skbs(vi);
> +	free_old_xmit_skbs(vi, vq);
>  
>  	/* Try to transmit */
> -	capacity = xmit_skb(vi, skb);
> +	capacity = xmit_skb(vi, skb, vq, vi->sq[qnum]->sg);
>  
>  	/* This can happen with OOM and indirect buffers. */
>  	if (unlikely(capacity < 0)) {
>  		if (likely(capacity == -ENOMEM)) {
>  			if (net_ratelimit())
>  				dev_warn(&dev->dev,
> -					 "TX queue failure: out of memory\n");
> +					"TXQ (%d) failure: out of memory\n",
> +					qnum);
>  		} else {
>  			dev->stats.tx_fifo_errors++;
>  			if (net_ratelimit())
>  				dev_warn(&dev->dev,
> -					 "Unexpected TX queue failure: %d\n",
> -					 capacity);
> +					"Unexpected TXQ (%d) failure: %d\n",
> +					qnum, capacity);
>  		}
>  		dev->stats.tx_dropped++;
>  		kfree_skb(skb);
>  		return NETDEV_TX_OK;
>  	}
> -	virtqueue_kick(vi->svq);
> +	virtqueue_kick(vq);
>  
>  	/* Don't wait up for transmitted skbs to be freed. */
>  	skb_orphan(skb);
> @@ -661,13 +721,13 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>  	/* Apparently nice girls don't return TX_BUSY; stop the queue
>  	 * before it gets out of hand.  Naturally, this wastes entries. */
>  	if (capacity < 2+MAX_SKB_FRAGS) {
> -		netif_stop_queue(dev);
> -		if (unlikely(!virtqueue_enable_cb_delayed(vi->svq))) {
> +		netif_stop_subqueue(dev, qnum);
> +		if (unlikely(!virtqueue_enable_cb_delayed(vq))) {
>  			/* More just got used, free them then recheck. */
> -			capacity += free_old_xmit_skbs(vi);
> +			capacity += free_old_xmit_skbs(vi, vq);
>  			if (capacity >= 2+MAX_SKB_FRAGS) {
> -				netif_start_queue(dev);
> -				virtqueue_disable_cb(vi->svq);
> +				netif_start_subqueue(dev, qnum);
> +				virtqueue_disable_cb(vq);
>  			}
>  		}
>  	}
> @@ -700,7 +760,8 @@ static struct rtnl_link_stats64 *virtnet_stats(struct net_device *dev,
>  	unsigned int start;
>  
>  	for_each_possible_cpu(cpu) {
> -		struct virtnet_stats *stats = per_cpu_ptr(vi->stats, cpu);
> +		struct virtnet_stats __percpu *stats
> +			= per_cpu_ptr(vi->stats, cpu);
>  		u64 tpackets, tbytes, rpackets, rbytes;
>  
>  		do {
> @@ -734,20 +795,26 @@ static struct rtnl_link_stats64 *virtnet_stats(struct net_device *dev,
>  static void virtnet_netpoll(struct net_device *dev)
>  {
>  	struct virtnet_info *vi = netdev_priv(dev);
> +	int i;
>  
> -	napi_schedule(&vi->napi);
> +	for (i = 0; i < vi->num_queue_pairs; i++)
> +		napi_schedule(&vi->rq[i]->napi);
>  }
>  #endif
>  
>  static int virtnet_open(struct net_device *dev)
>  {
>  	struct virtnet_info *vi = netdev_priv(dev);
> +	int i;
>  
> -	/* Make sure we have some buffers: if oom use wq. */
> -	if (!try_fill_recv(vi, GFP_KERNEL))
> -		queue_delayed_work(system_nrt_wq, &vi->refill, 0);
> +	for (i = 0; i < vi->num_queue_pairs; i++) {
> +		/* Make sure we have some buffers: if oom use wq. */
> +		if (!try_fill_recv(vi->rq[i], GFP_KERNEL))
> +			queue_delayed_work(system_nrt_wq,
> +					   &vi->rq[i]->refill, 0);
> +		virtnet_napi_enable(vi->rq[i]);
> +	}
>  
> -	virtnet_napi_enable(vi);
>  	return 0;
>  }
>  
> @@ -809,10 +876,13 @@ static void virtnet_ack_link_announce(struct virtnet_info *vi)
>  static int virtnet_close(struct net_device *dev)
>  {
>  	struct virtnet_info *vi = netdev_priv(dev);
> +	int i;
>  
>  	/* Make sure refill_work doesn't re-enable napi! */
> -	cancel_delayed_work_sync(&vi->refill);
> -	napi_disable(&vi->napi);
> +	for (i = 0; i < vi->num_queue_pairs; i++) {
> +		cancel_delayed_work_sync(&vi->rq[i]->refill);
> +		napi_disable(&vi->rq[i]->napi);
> +	}
>  
>  	return 0;
>  }
> @@ -924,11 +994,10 @@ static void virtnet_get_ringparam(struct net_device *dev,
>  {
>  	struct virtnet_info *vi = netdev_priv(dev);
>  
> -	ring->rx_max_pending = virtqueue_get_vring_size(vi->rvq);
> -	ring->tx_max_pending = virtqueue_get_vring_size(vi->svq);
> +	ring->rx_max_pending = virtqueue_get_vring_size(vi->rq[0]->vq);
> +	ring->tx_max_pending = virtqueue_get_vring_size(vi->sq[0]->vq);
>  	ring->rx_pending = ring->rx_max_pending;
>  	ring->tx_pending = ring->tx_max_pending;
> -
>  }
>  
>  
> @@ -961,6 +1030,19 @@ static int virtnet_change_mtu(struct net_device *dev, int new_mtu)
>  	return 0;
>  }
>  
> +/* To avoid contending a lock hold by a vcpu who would exit to host, select the
> + * txq based on the processor id.
> + */
> +static u16 virtnet_select_queue(struct net_device *dev, struct sk_buff *skb)
> +{
> +	int txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) :
> +		  smp_processor_id();
> +
> +	while (unlikely(txq >= dev->real_num_tx_queues))
> +		txq -= dev->real_num_tx_queues;
> +	return txq;
> +}
> +
>  static const struct net_device_ops virtnet_netdev = {
>  	.ndo_open            = virtnet_open,
>  	.ndo_stop   	     = virtnet_close,
> @@ -972,6 +1054,7 @@ static const struct net_device_ops virtnet_netdev = {
>  	.ndo_get_stats64     = virtnet_stats,
>  	.ndo_vlan_rx_add_vid = virtnet_vlan_rx_add_vid,
>  	.ndo_vlan_rx_kill_vid = virtnet_vlan_rx_kill_vid,
> +	.ndo_select_queue     = virtnet_select_queue,
>  #ifdef CONFIG_NET_POLL_CONTROLLER
>  	.ndo_poll_controller = virtnet_netpoll,
>  #endif
> @@ -1007,10 +1090,10 @@ static void virtnet_config_changed_work(struct work_struct *work)
>  
>  	if (vi->status & VIRTIO_NET_S_LINK_UP) {
>  		netif_carrier_on(vi->dev);
> -		netif_wake_queue(vi->dev);
> +		netif_tx_wake_all_queues(vi->dev);
>  	} else {
>  		netif_carrier_off(vi->dev);
> -		netif_stop_queue(vi->dev);
> +		netif_tx_stop_all_queues(vi->dev);
>  	}
>  done:
>  	mutex_unlock(&vi->config_lock);
> @@ -1023,41 +1106,217 @@ static void virtnet_config_changed(struct virtio_device *vdev)
>  	queue_work(system_nrt_wq, &vi->config_work);
>  }
>  
> -static int init_vqs(struct virtnet_info *vi)
> +static void free_receive_bufs(struct virtnet_info *vi)
> +{
> +	int i;
> +
> +	for (i = 0; i < vi->num_queue_pairs; i++) {
> +		while (vi->rq[i]->pages)
> +			__free_pages(get_a_page(vi->rq[i], GFP_KERNEL), 0);
> +	}
> +}
> +
> +/* Free memory allocated for send and receive queues */
> +static void virtnet_free_queues(struct virtnet_info *vi)
>  {
> -	struct virtqueue *vqs[3];
> -	vq_callback_t *callbacks[] = { skb_recv_done, skb_xmit_done, NULL};
> -	const char *names[] = { "input", "output", "control" };
> -	int nvqs, err;
> +	int i;
>  
> -	/* We expect two virtqueues, receive then send,
> -	 * and optionally control. */
> -	nvqs = virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ) ? 3 : 2;
> +	for (i = 0; i < vi->num_queue_pairs; i++) {
> +		kfree(vi->rq[i]);
> +		vi->rq[i] = NULL;
> +		kfree(vi->sq[i]);
> +		vi->sq[i] = NULL;
> +	}
> +}
>  
> -	err = vi->vdev->config->find_vqs(vi->vdev, nvqs, vqs, callbacks, names);
> -	if (err)
> -		return err;
> +static void free_unused_bufs(struct virtnet_info *vi)
> +{
> +	void *buf;
> +	int i;
> +
> +	for (i = 0; i < vi->num_queue_pairs; i++) {
> +		struct virtqueue *vq = vi->sq[i]->vq;
> +
> +		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL)
> +			dev_kfree_skb(buf);
> +	}
> +
> +	for (i = 0; i < vi->num_queue_pairs; i++) {
> +		struct virtqueue *vq = vi->rq[i]->vq;
> +
> +		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL) {
> +			if (vi->mergeable_rx_bufs || vi->big_packets)
> +				give_pages(vi->rq[i], buf);
> +			else
> +				dev_kfree_skb(buf);
> +			--vi->rq[i]->num;
> +		}
> +		BUG_ON(vi->rq[i]->num != 0);
> +	}
> +}
> +
> +static void virtnet_set_affinity(struct virtnet_info *vi, bool set)
> +{
> +	int i;
> +
> +	if (vi->num_queue_pairs == 1)
> +		return;
> +
> +	for (i = 0; i < vi->num_queue_pairs; i++) {
> +		int cpu = set ? i : -1;
> +		virtqueue_set_affinity(vi->rq[i]->vq, cpu);
> +		virtqueue_set_affinity(vi->sq[i]->vq, cpu);
> +	}
> +	return;
> +}
> +
> +static void virtnet_del_vqs(struct virtnet_info *vi)
> +{
> +	struct virtio_device *vdev = vi->vdev;
> +
> +	virtnet_set_affinity(vi, false);
> +
> +	vdev->config->del_vqs(vdev);
> +
> +	virtnet_free_queues(vi);
> +}
> +
> +static int virtnet_find_vqs(struct virtnet_info *vi)
> +{
> +	vq_callback_t **callbacks;
> +	struct virtqueue **vqs;
> +	int ret = -ENOMEM;
> +	int i, total_vqs;
> +	char **names;
>  
> -	vi->rvq = vqs[0];
> -	vi->svq = vqs[1];
> +	/*
> +	 * We expect 1 RX virtqueue followed by 1 TX virtqueue, followd by
> +	 * possible control virtqueue and followed by the same
> +	 * 'vi->num_queue_pairs-1' more times
> +	 */
> +	total_vqs = vi->num_queue_pairs * 2 +
> +		    virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ);
> +
> +	/* Allocate space for find_vqs parameters */
> +	vqs = kmalloc(total_vqs * sizeof(*vqs), GFP_KERNEL);
> +	callbacks = kmalloc(total_vqs * sizeof(*callbacks), GFP_KERNEL);
> +	names = kmalloc(total_vqs * sizeof(*names), GFP_KERNEL);

so this needs to be kzalloc otherwise on an error cleanup will
get uninitialized data and crash?

> +	if (!vqs || !callbacks || !names)
> +		goto err;
> +
> +	/* Parameters for control virtqueue, if any */
> +	if (vi->has_cvq) {
> +		callbacks[2] = NULL;
> +		names[2] = "control";
> +	}
> +
> +	/* Allocate/initialize parameters for send/receive virtqueues */
> +	for (i = 0; i < vi->num_queue_pairs * 2; i += 2) {
> +		int j = (i == 0 ? i : i + vi->has_cvq);
> +		callbacks[j] = skb_recv_done;
> +		callbacks[j + 1] = skb_xmit_done;
> +		names[j] = kasprintf(GFP_KERNEL, "input.%d", i / 2);
> +		names[j + 1] = kasprintf(GFP_KERNEL, "output.%d", i / 2);

This needs wrappers. E.g. virtnet_rx_vq(int queue_pair), virtnet_tx_vq(int queue_pair);
Then you would just scan 0 to num_queue_pairs, and i is queue pair
number.

> +	}
>  
> -	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
> +	ret = vi->vdev->config->find_vqs(vi->vdev, total_vqs, vqs, callbacks,
> +					 (const char **)names);
> +	if (ret)
> +		goto err;
> +
> +	if (vi->has_cvq)
>  		vi->cvq = vqs[2];
>  
> -		if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
> -			vi->dev->features |= NETIF_F_HW_VLAN_FILTER;
> +	for (i = 0; i < vi->num_queue_pairs * 2; i += 2) {
> +		int j = i == 0 ? i : i + vi->has_cvq;
> +		vi->rq[i / 2]->vq = vqs[j];
> +		vi->sq[i / 2]->vq = vqs[j + 1];

Same here.

>  	}
> -	return 0;
> +
> +err:
> +	if (ret && names)

If we are here ret != 0. For names, just add another label, don't
complicate cleanup.

> +		for (i = 0; i < vi->num_queue_pairs * 2; i++)
> +			kfree(names[i]);
> +
> +	kfree(names);
> +	kfree(callbacks);
> +	kfree(vqs);
> +
> +	return ret;
> +}
> +
> +static int virtnet_alloc_queues(struct virtnet_info *vi)
> +{
> +	int ret = -ENOMEM;
> +	int i;
> +
> +	for (i = 0; i < vi->num_queue_pairs; i++) {
> +		vi->rq[i] = kzalloc(sizeof(*vi->rq[i]), GFP_KERNEL);
> +		vi->sq[i] = kzalloc(sizeof(*vi->sq[i]), GFP_KERNEL);
> +		if (!vi->rq[i] || !vi->sq[i])
> +			goto err;
> +	}
> +
> +	ret = 0;
> +
> +	/* setup initial receive and send queue parameters */
> +	for (i = 0; i < vi->num_queue_pairs; i++) {
> +		vi->rq[i]->vi = vi;
> +		vi->rq[i]->pages = NULL;
> +		INIT_DELAYED_WORK(&vi->rq[i]->refill, refill_work);
> +		netif_napi_add(vi->dev, &vi->rq[i]->napi, virtnet_poll,
> +			       napi_weight);
> +
> +		sg_init_table(vi->rq[i]->sg, ARRAY_SIZE(vi->rq[i]->sg));
> +		sg_init_table(vi->sq[i]->sg, ARRAY_SIZE(vi->sq[i]->sg));
> +	}
> +

Add return 0 here, then ret = 0 will not be needed
above and if (ret) below.


> +err:
> +	if (ret)
> +		virtnet_free_queues(vi);
> +
> +	return ret;
> +}
> +
> +static int virtnet_setup_vqs(struct virtnet_info *vi)
> +{
> +	int ret;
> +
> +	/* Allocate send & receive queues */
> +	ret = virtnet_alloc_queues(vi);
> +	if (!ret) {
> +		ret = virtnet_find_vqs(vi);
> +		if (ret)
> +			virtnet_free_queues(vi);
> +		else
> +			virtnet_set_affinity(vi, true);
> +	}
> +
> +	return ret;

Add some labels for error handling, this if nesting is messy.

>  }
>  
>  static int virtnet_probe(struct virtio_device *vdev)
>  {
> -	int err;
> +	int i, err;
>  	struct net_device *dev;
>  	struct virtnet_info *vi;
> +	u16 num_queues, num_queue_pairs;
> +
> +	/* Find if host supports multiqueue virtio_net device */
> +	err = virtio_config_val(vdev, VIRTIO_NET_F_MULTIQUEUE,
> +				offsetof(struct virtio_net_config,
> +				num_queues), &num_queues);
> +
> +	/* We need atleast 2 queue's */

typo

> +	if (err || num_queues < 2)
> +		num_queues = 2;
> +	if (num_queues > MAX_QUEUES * 2)
> +		num_queues = MAX_QUEUES;
> +
> +	num_queue_pairs = num_queues / 2;
>  
>  	/* Allocate ourselves a network device with room for our info */
> -	dev = alloc_etherdev(sizeof(struct virtnet_info));
> +	dev = alloc_etherdev_mq(sizeof(struct virtnet_info), num_queue_pairs);
>  	if (!dev)
>  		return -ENOMEM;
>  
> @@ -1103,22 +1362,18 @@ static int virtnet_probe(struct virtio_device *vdev)
>  
>  	/* Set up our device-specific information */
>  	vi = netdev_priv(dev);
> -	netif_napi_add(dev, &vi->napi, virtnet_poll, napi_weight);
>  	vi->dev = dev;
>  	vi->vdev = vdev;
>  	vdev->priv = vi;
> -	vi->pages = NULL;
>  	vi->stats = alloc_percpu(struct virtnet_stats);
>  	err = -ENOMEM;
>  	if (vi->stats == NULL)
> -		goto free;
> +		goto free_netdev;
>  
> -	INIT_DELAYED_WORK(&vi->refill, refill_work);
>  	mutex_init(&vi->config_lock);
>  	vi->config_enable = true;
>  	INIT_WORK(&vi->config_work, virtnet_config_changed_work);
> -	sg_init_table(vi->rx_sg, ARRAY_SIZE(vi->rx_sg));
> -	sg_init_table(vi->tx_sg, ARRAY_SIZE(vi->tx_sg));
> +	vi->num_queue_pairs = num_queue_pairs;
>  
>  	/* If we can receive ANY GSO packets, we must allocate large ones. */
>  	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) ||
> @@ -1129,9 +1384,17 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
>  		vi->mergeable_rx_bufs = true;
>  
> -	err = init_vqs(vi);
> +	if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
> +		vi->has_cvq = true;
> +

How about we disable multiqueue if there's no cvq?
Will make logic a bit simpler, won't it?

> +	/* Allocate/initialize the rx/tx queues, and invoke find_vqs */
> +	err = virtnet_setup_vqs(vi);
>  	if (err)
> -		goto free_stats;
> +		goto free_netdev;
> +
> +	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ) &&
> +	    virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
> +		dev->features |= NETIF_F_HW_VLAN_FILTER;
>  
>  	err = register_netdev(dev);
>  	if (err) {
> @@ -1140,12 +1403,15 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	}
>  
>  	/* Last of all, set up some receive buffers. */
> -	try_fill_recv(vi, GFP_KERNEL);
> -
> -	/* If we didn't even get one input buffer, we're useless. */
> -	if (vi->num == 0) {
> -		err = -ENOMEM;
> -		goto unregister;
> +	for (i = 0; i < num_queue_pairs; i++) {
> +		try_fill_recv(vi->rq[i], GFP_KERNEL);
> +
> +		/* If we didn't even get one input buffer, we're useless. */
> +		if (vi->rq[i]->num == 0) {
> +			free_unused_bufs(vi);
> +			err = -ENOMEM;
> +			goto free_recv_bufs;
> +		}
>  	}
>  
>  	/* Assume link up if device can't report link status,
> @@ -1158,42 +1424,25 @@ static int virtnet_probe(struct virtio_device *vdev)
>  		netif_carrier_on(dev);
>  	}
>  
> -	pr_debug("virtnet: registered device %s\n", dev->name);
> +	pr_debug("virtnet: registered device %s with %d RX and TX vq's\n",
> +		 dev->name, num_queue_pairs);
> +
>  	return 0;
>  
> -unregister:
> +free_recv_bufs:
> +	free_receive_bufs(vi);
>  	unregister_netdev(dev);
> +
>  free_vqs:
> -	vdev->config->del_vqs(vdev);
> -free_stats:
> -	free_percpu(vi->stats);
> -free:
> +	for (i = 0; i < num_queue_pairs; i++)
> +		cancel_delayed_work_sync(&vi->rq[i]->refill);
> +	virtnet_del_vqs(vi);
> +
> +free_netdev:
>  	free_netdev(dev);
>  	return err;
>  }
>  
> -static void free_unused_bufs(struct virtnet_info *vi)
> -{
> -	void *buf;
> -	while (1) {
> -		buf = virtqueue_detach_unused_buf(vi->svq);
> -		if (!buf)
> -			break;
> -		dev_kfree_skb(buf);
> -	}
> -	while (1) {
> -		buf = virtqueue_detach_unused_buf(vi->rvq);
> -		if (!buf)
> -			break;
> -		if (vi->mergeable_rx_bufs || vi->big_packets)
> -			give_pages(vi, buf);
> -		else
> -			dev_kfree_skb(buf);
> -		--vi->num;
> -	}
> -	BUG_ON(vi->num != 0);
> -}
> -
>  static void remove_vq_common(struct virtnet_info *vi)
>  {
>  	vi->vdev->config->reset(vi->vdev);
> @@ -1201,10 +1450,9 @@ static void remove_vq_common(struct virtnet_info *vi)
>  	/* Free unused buffers in both send and recv, if any. */
>  	free_unused_bufs(vi);
>  
> -	vi->vdev->config->del_vqs(vi->vdev);
> +	free_receive_bufs(vi);
>  
> -	while (vi->pages)
> -		__free_pages(get_a_page(vi, GFP_KERNEL), 0);
> +	virtnet_del_vqs(vi);
>  }
>  
>  static void __devexit virtnet_remove(struct virtio_device *vdev)
> @@ -1230,6 +1478,7 @@ static void __devexit virtnet_remove(struct virtio_device *vdev)
>  static int virtnet_freeze(struct virtio_device *vdev)
>  {
>  	struct virtnet_info *vi = vdev->priv;
> +	int i;
>  
>  	/* Prevent config work handler from accessing the device */
>  	mutex_lock(&vi->config_lock);
> @@ -1237,10 +1486,13 @@ static int virtnet_freeze(struct virtio_device *vdev)
>  	mutex_unlock(&vi->config_lock);
>  
>  	netif_device_detach(vi->dev);
> -	cancel_delayed_work_sync(&vi->refill);
> +	for (i = 0; i < vi->num_queue_pairs; i++)
> +		cancel_delayed_work_sync(&vi->rq[i]->refill);
>  
>  	if (netif_running(vi->dev))
> -		napi_disable(&vi->napi);
> +		for (i = 0; i < vi->num_queue_pairs; i++)
> +			napi_disable(&vi->rq[i]->napi);
> +
>  
>  	remove_vq_common(vi);
>  
> @@ -1252,19 +1504,22 @@ static int virtnet_freeze(struct virtio_device *vdev)
>  static int virtnet_restore(struct virtio_device *vdev)
>  {
>  	struct virtnet_info *vi = vdev->priv;
> -	int err;
> +	int err, i;
>  
> -	err = init_vqs(vi);
> +	err = virtnet_setup_vqs(vi);
>  	if (err)
>  		return err;
>  
>  	if (netif_running(vi->dev))
> -		virtnet_napi_enable(vi);
> +		for (i = 0; i < vi->num_queue_pairs; i++)
> +			virtnet_napi_enable(vi->rq[i]);
>  
>  	netif_device_attach(vi->dev);
>  
> -	if (!try_fill_recv(vi, GFP_KERNEL))
> -		queue_delayed_work(system_nrt_wq, &vi->refill, 0);
> +	for (i = 0; i < vi->num_queue_pairs; i++)
> +		if (!try_fill_recv(vi->rq[i], GFP_KERNEL))
> +			queue_delayed_work(system_nrt_wq,
> +					   &vi->rq[i]->refill, 0);
>  
>  	mutex_lock(&vi->config_lock);
>  	vi->config_enable = true;
> @@ -1287,7 +1542,7 @@ static unsigned int features[] = {
>  	VIRTIO_NET_F_GUEST_ECN, VIRTIO_NET_F_GUEST_UFO,
>  	VIRTIO_NET_F_MRG_RXBUF, VIRTIO_NET_F_STATUS, VIRTIO_NET_F_CTRL_VQ,
>  	VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN,
> -	VIRTIO_NET_F_GUEST_ANNOUNCE,
> +	VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MULTIQUEUE,
>  };
>  
>  static struct virtio_driver virtio_net_driver = {
> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
> index 1bc7e30..60f09ff 100644
> --- a/include/linux/virtio_net.h
> +++ b/include/linux/virtio_net.h
> @@ -61,6 +61,8 @@ struct virtio_net_config {
>  	__u8 mac[6];
>  	/* See VIRTIO_NET_F_STATUS and VIRTIO_NET_S_* above */
>  	__u16 status;
> +	/* Total number of RX/TX queues */
> +	__u16 num_queues;
>  } __attribute__((packed));
>  
>  /* This is the first element of the scatter-gather list.  If you don't
> -- 
> 1.7.1
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sasha Levin July 21, 2012, 12:02 p.m. UTC | #4
On 07/20/2012 03:40 PM, Michael S. Tsirkin wrote:
>> -	err = init_vqs(vi);
>> > +	if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
>> > +		vi->has_cvq = true;
>> > +
> How about we disable multiqueue if there's no cvq?
> Will make logic a bit simpler, won't it?

multiqueues don't really depend on cvq. Does this added complexity really justifies adding an artificial limit?


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Wang July 23, 2012, 5:48 a.m. UTC | #5
On 07/20/2012 09:40 PM, Michael S. Tsirkin wrote:
> On Thu, Jul 05, 2012 at 06:29:53PM +0800, Jason Wang wrote:
>> This patch converts virtio_net to a multi queue device. After negotiated
>> VIRTIO_NET_F_MULTIQUEUE feature, the virtio device has many tx/rx queue pairs,
>> and driver could read the number from config space.
>>
>> The driver expects the number of rx/tx queue paris is equal to the number of
>> vcpus. To maximize the performance under this per-cpu rx/tx queue pairs, some
>> optimization were introduced:
>>
>> - Txq selection is based on the processor id in order to avoid contending a lock
>>    whose owner may exits to host.
>> - Since the txq/txq were per-cpu, affinity hint were set to the cpu that owns
>>    the queue pairs.
>>
>> Signed-off-by: Krishna Kumar<krkumar2@in.ibm.com>
>> Signed-off-by: Jason Wang<jasowang@redhat.com>
> Overall fine. I think it is best to smash the following patch
> into this one, so that default behavior does not
> jump to mq then back. some comments below: mostly nits, and a minor bug.

Sure, thanks for the reviewing.
>
> If you are worried the patch is too big, it can be split
> differently
> 	- rework to use send_queue/receive_queue structures, no
> 	  functional changes.
> 	- add multiqueue
>
> but this is not a must.
>
>> ---
>>   drivers/net/virtio_net.c   |  645 ++++++++++++++++++++++++++++++-------------
>>   include/linux/virtio_net.h |    2 +
>>   2 files changed, 452 insertions(+), 195 deletions(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index 1db445b..7410187 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -26,6 +26,7 @@
>>   #include<linux/scatterlist.h>
>>   #include<linux/if_vlan.h>
>>   #include<linux/slab.h>
>> +#include<linux/interrupt.h>
>>
>>   static int napi_weight = 128;
>>   module_param(napi_weight, int, 0444);
>> @@ -41,6 +42,8 @@ module_param(gso, bool, 0444);
>>   #define VIRTNET_SEND_COMMAND_SG_MAX    2
>>   #define VIRTNET_DRIVER_VERSION "1.0.0"
>>
>> +#define MAX_QUEUES 256
>> +
>>   struct virtnet_stats {
>>   	struct u64_stats_sync tx_syncp;
>>   	struct u64_stats_sync rx_syncp;
> Would be a bit better not to have artificial limits like that.
> Maybe allocate arrays at probe time, then we can
> take whatever the device gives us?

Sure.
>> @@ -51,43 +54,69 @@ struct virtnet_stats {
>>   	u64 rx_packets;
>>   };
>>
>> -struct virtnet_info {
>> -	struct virtio_device *vdev;
>> -	struct virtqueue *rvq, *svq, *cvq;
>> -	struct net_device *dev;
>> +/* Internal representation of a send virtqueue */
>> +struct send_queue {
>> +	/* Virtqueue associated with this send _queue */
>> +	struct virtqueue *vq;
>> +
>> +	/* TX: fragments + linear part + virtio header */
>> +	struct scatterlist sg[MAX_SKB_FRAGS + 2];
>> +};
>> +
>> +/* Internal representation of a receive virtqueue */
>> +struct receive_queue {
>> +	/* Virtqueue associated with this receive_queue */
>> +	struct virtqueue *vq;
>> +
>> +	/* Back pointer to the virtnet_info */
>> +	struct virtnet_info *vi;
>> +
>>   	struct napi_struct napi;
>> -	unsigned int status;
>>
>>   	/* Number of input buffers, and max we've ever had. */
>>   	unsigned int num, max;
>>
>> +	/* Work struct for refilling if we run low on memory. */
>> +	struct delayed_work refill;
>> +
>> +	/* Chain pages by the private ptr. */
>> +	struct page *pages;
>> +
>> +	/* RX: fragments + linear part + virtio header */
>> +	struct scatterlist sg[MAX_SKB_FRAGS + 2];
>> +};
>> +
>> +struct virtnet_info {
>> +	u16 num_queue_pairs;		/* # of RX/TX vq pairs */
>> +
>> +	struct send_queue *sq[MAX_QUEUES] ____cacheline_aligned_in_smp;
>> +	struct receive_queue *rq[MAX_QUEUES] ____cacheline_aligned_in_smp;
> The assumption is a tx/rx pair is handled on the same cpu, yes?
> If yes maybe make it a single array to improve cache locality
> a bit?
> 	struct queue_pair {
> 		struct send_queue sq;
> 		struct receive_queue rq;
> 	};

Ok.
>> +	struct virtqueue *cvq;
>> +
>> +	struct virtio_device *vdev;
>> +	struct net_device *dev;
>> +	unsigned int status;
>> +
>>   	/* I like... big packets and I cannot lie! */
>>   	bool big_packets;
>>
>>   	/* Host will merge rx buffers for big packets (shake it! shake it!) */
>>   	bool mergeable_rx_bufs;
>>
>> +	/* Has control virtqueue */
>> +	bool has_cvq;
>> +
> won't checking (cvq != NULL) be enough?

Enough, so has_cvq is dupliated with vi->cvq. I will remove it in next 
version.
>
>>   	/* enable config space updates */
>>   	bool config_enable;
>>
>>   	/* Active statistics */
>>   	struct virtnet_stats __percpu *stats;
>>
>> -	/* Work struct for refilling if we run low on memory. */
>> -	struct delayed_work refill;
>> -
>>   	/* Work struct for config space updates */
>>   	struct work_struct config_work;
>>
>>   	/* Lock for config space updates */
>>   	struct mutex config_lock;
>> -
>> -	/* Chain pages by the private ptr. */
>> -	struct page *pages;
>> -
>> -	/* fragments + linear part + virtio header */
>> -	struct scatterlist rx_sg[MAX_SKB_FRAGS + 2];
>> -	struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
>>   };
>>
>>   struct skb_vnet_hdr {
>> @@ -108,6 +137,22 @@ struct padded_vnet_hdr {
>>   	char padding[6];
>>   };
>>
>> +static inline int txq_get_qnum(struct virtnet_info *vi, struct virtqueue *vq)
>> +{
>> +	int ret = virtqueue_get_queue_index(vq);
>> +
>> +	/* skip ctrl vq */
>> +	if (vi->has_cvq)
>> +		return (ret - 1) / 2;
>> +	else
>> +		return ret / 2;
>> +}
>> +
>> +static inline int rxq_get_qnum(struct virtnet_info *vi, struct virtqueue *vq)
>> +{
>> +	return virtqueue_get_queue_index(vq) / 2;
>> +}
>> +
>>   static inline struct skb_vnet_hdr *skb_vnet_hdr(struct sk_buff *skb)
>>   {
>>   	return (struct skb_vnet_hdr *)skb->cb;
>> @@ -117,22 +162,22 @@ static inline struct skb_vnet_hdr *skb_vnet_hdr(struct sk_buff *skb)
>>    * private is used to chain pages for big packets, put the whole
>>    * most recent used list in the beginning for reuse
>>    */
>> -static void give_pages(struct virtnet_info *vi, struct page *page)
>> +static void give_pages(struct receive_queue *rq, struct page *page)
>>   {
>>   	struct page *end;
>>
>>   	/* Find end of list, sew whole thing into vi->pages. */
>>   	for (end = page; end->private; end = (struct page *)end->private);
>> -	end->private = (unsigned long)vi->pages;
>> -	vi->pages = page;
>> +	end->private = (unsigned long)rq->pages;
>> +	rq->pages = page;
>>   }
>>
>> -static struct page *get_a_page(struct virtnet_info *vi, gfp_t gfp_mask)
>> +static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
>>   {
>> -	struct page *p = vi->pages;
>> +	struct page *p = rq->pages;
>>
>>   	if (p) {
>> -		vi->pages = (struct page *)p->private;
>> +		rq->pages = (struct page *)p->private;
>>   		/* clear private here, it is used to chain pages */
>>   		p->private = 0;
>>   	} else
>> @@ -140,15 +185,15 @@ static struct page *get_a_page(struct virtnet_info *vi, gfp_t gfp_mask)
>>   	return p;
>>   }
>>
>> -static void skb_xmit_done(struct virtqueue *svq)
>> +static void skb_xmit_done(struct virtqueue *vq)
>>   {
>> -	struct virtnet_info *vi = svq->vdev->priv;
>> +	struct virtnet_info *vi = vq->vdev->priv;
>>
>>   	/* Suppress further interrupts. */
>> -	virtqueue_disable_cb(svq);
>> +	virtqueue_disable_cb(vq);
>>
>>   	/* We were probably waiting for more output buffers. */
>> -	netif_wake_queue(vi->dev);
>> +	netif_wake_subqueue(vi->dev, txq_get_qnum(vi, vq));
>>   }
>>
>>   static void set_skb_frag(struct sk_buff *skb, struct page *page,
>> @@ -167,9 +212,10 @@ static void set_skb_frag(struct sk_buff *skb, struct page *page,
>>   }
>>
>>   /* Called from bottom half context */
>> -static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>> +static struct sk_buff *page_to_skb(struct receive_queue *rq,
>>   				   struct page *page, unsigned int len)
>>   {
>> +	struct virtnet_info *vi = rq->vi;
>>   	struct sk_buff *skb;
>>   	struct skb_vnet_hdr *hdr;
>>   	unsigned int copy, hdr_len, offset;
>> @@ -225,12 +271,12 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>>   	}
>>
>>   	if (page)
>> -		give_pages(vi, page);
>> +		give_pages(rq, page);
>>
>>   	return skb;
>>   }
>>
>> -static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
>> +static int receive_mergeable(struct receive_queue *rq, struct sk_buff *skb)
>>   {
>>   	struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
>>   	struct page *page;
>> @@ -244,7 +290,7 @@ static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
>>   			skb->dev->stats.rx_length_errors++;
>>   			return -EINVAL;
>>   		}
>> -		page = virtqueue_get_buf(vi->rvq,&len);
>> +		page = virtqueue_get_buf(rq->vq,&len);
>>   		if (!page) {
>>   			pr_debug("%s: rx error: %d buffers missing\n",
>>   				 skb->dev->name, hdr->mhdr.num_buffers);
>> @@ -257,13 +303,14 @@ static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
>>
>>   		set_skb_frag(skb, page, 0,&len);
>>
>> -		--vi->num;
>> +		--rq->num;
>>   	}
>>   	return 0;
>>   }
>>
>> -static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
>> +static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len)
>>   {
>> +	struct net_device *dev = rq->vi->dev;
>>   	struct virtnet_info *vi = netdev_priv(dev);
>>   	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
>>   	struct sk_buff *skb;
>> @@ -274,7 +321,7 @@ static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
>>   		pr_debug("%s: short packet %i\n", dev->name, len);
>>   		dev->stats.rx_length_errors++;
>>   		if (vi->mergeable_rx_bufs || vi->big_packets)
>> -			give_pages(vi, buf);
>> +			give_pages(rq, buf);
>>   		else
>>   			dev_kfree_skb(buf);
>>   		return;
>> @@ -286,14 +333,14 @@ static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
>>   		skb_trim(skb, len);
>>   	} else {
>>   		page = buf;
>> -		skb = page_to_skb(vi, page, len);
>> +		skb = page_to_skb(rq, page, len);
>>   		if (unlikely(!skb)) {
>>   			dev->stats.rx_dropped++;
>> -			give_pages(vi, page);
>> +			give_pages(rq, page);
>>   			return;
>>   		}
>>   		if (vi->mergeable_rx_bufs)
>> -			if (receive_mergeable(vi, skb)) {
>> +			if (receive_mergeable(rq, skb)) {
>>   				dev_kfree_skb(skb);
>>   				return;
>>   			}
>> @@ -363,90 +410,91 @@ frame_err:
>>   	dev_kfree_skb(skb);
>>   }
>>
>> -static int add_recvbuf_small(struct virtnet_info *vi, gfp_t gfp)
>> +static int add_recvbuf_small(struct receive_queue *rq, gfp_t gfp)
>>   {
>>   	struct sk_buff *skb;
>>   	struct skb_vnet_hdr *hdr;
>>   	int err;
>>
>> -	skb = __netdev_alloc_skb_ip_align(vi->dev, MAX_PACKET_LEN, gfp);
>> +	skb = __netdev_alloc_skb_ip_align(rq->vi->dev, MAX_PACKET_LEN, gfp);
>>   	if (unlikely(!skb))
>>   		return -ENOMEM;
>>
>>   	skb_put(skb, MAX_PACKET_LEN);
>>
>>   	hdr = skb_vnet_hdr(skb);
>> -	sg_set_buf(vi->rx_sg,&hdr->hdr, sizeof hdr->hdr);
>> +	sg_set_buf(rq->sg,&hdr->hdr, sizeof hdr->hdr);
>> +
>> +	skb_to_sgvec(skb, rq->sg + 1, 0, skb->len);
>>
>> -	skb_to_sgvec(skb, vi->rx_sg + 1, 0, skb->len);
>> +	err = virtqueue_add_buf(rq->vq, rq->sg, 0, 2, skb, gfp);
>>
>> -	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, 2, skb, gfp);
>>   	if (err<  0)
>>   		dev_kfree_skb(skb);
>>
>>   	return err;
>>   }
>>
>> -static int add_recvbuf_big(struct virtnet_info *vi, gfp_t gfp)
>> +static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
>>   {
>>   	struct page *first, *list = NULL;
>>   	char *p;
>>   	int i, err, offset;
>>
>> -	/* page in vi->rx_sg[MAX_SKB_FRAGS + 1] is list tail */
>> +	/* page in rq->sg[MAX_SKB_FRAGS + 1] is list tail */
>>   	for (i = MAX_SKB_FRAGS + 1; i>  1; --i) {
>> -		first = get_a_page(vi, gfp);
>> +		first = get_a_page(rq, gfp);
>>   		if (!first) {
>>   			if (list)
>> -				give_pages(vi, list);
>> +				give_pages(rq, list);
>>   			return -ENOMEM;
>>   		}
>> -		sg_set_buf(&vi->rx_sg[i], page_address(first), PAGE_SIZE);
>> +		sg_set_buf(&rq->sg[i], page_address(first), PAGE_SIZE);
>>
>>   		/* chain new page in list head to match sg */
>>   		first->private = (unsigned long)list;
>>   		list = first;
>>   	}
>>
>> -	first = get_a_page(vi, gfp);
>> +	first = get_a_page(rq, gfp);
>>   	if (!first) {
>> -		give_pages(vi, list);
>> +		give_pages(rq, list);
>>   		return -ENOMEM;
>>   	}
>>   	p = page_address(first);
>>
>> -	/* vi->rx_sg[0], vi->rx_sg[1] share the same page */
>> -	/* a separated vi->rx_sg[0] for virtio_net_hdr only due to QEMU bug */
>> -	sg_set_buf(&vi->rx_sg[0], p, sizeof(struct virtio_net_hdr));
>> +	/* rq->sg[0], rq->sg[1] share the same page */
>> +	/* a separated rq->sg[0] for virtio_net_hdr only due to QEMU bug */
>> +	sg_set_buf(&rq->sg[0], p, sizeof(struct virtio_net_hdr));
>>
>> -	/* vi->rx_sg[1] for data packet, from offset */
>> +	/* rq->sg[1] for data packet, from offset */
>>   	offset = sizeof(struct padded_vnet_hdr);
>> -	sg_set_buf(&vi->rx_sg[1], p + offset, PAGE_SIZE - offset);
>> +	sg_set_buf(&rq->sg[1], p + offset, PAGE_SIZE - offset);
>>
>>   	/* chain first in list head */
>>   	first->private = (unsigned long)list;
>> -	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, MAX_SKB_FRAGS + 2,
>> +	err = virtqueue_add_buf(rq->vq, rq->sg, 0, MAX_SKB_FRAGS + 2,
>>   				first, gfp);
>>   	if (err<  0)
>> -		give_pages(vi, first);
>> +		give_pages(rq, first);
>>
>>   	return err;
>>   }
>>
>> -static int add_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp)
>> +static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
>>   {
>>   	struct page *page;
>>   	int err;
>>
>> -	page = get_a_page(vi, gfp);
>> +	page = get_a_page(rq, gfp);
>>   	if (!page)
>>   		return -ENOMEM;
>>
>> -	sg_init_one(vi->rx_sg, page_address(page), PAGE_SIZE);
>> +	sg_init_one(rq->sg, page_address(page), PAGE_SIZE);
>>
>> -	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, 1, page, gfp);
>> +	err = virtqueue_add_buf(rq->vq, rq->sg, 0, 1, page, gfp);
>>   	if (err<  0)
>> -		give_pages(vi, page);
>> +		give_pages(rq, page);
>>
>>   	return err;
>>   }
>> @@ -458,97 +506,104 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp)
>>    * before we're receiving packets, or from refill_work which is
>>    * careful to disable receiving (using napi_disable).
>>    */
>> -static bool try_fill_recv(struct virtnet_info *vi, gfp_t gfp)
>> +static bool try_fill_recv(struct receive_queue *rq, gfp_t gfp)
>>   {
>> +	struct virtnet_info *vi = rq->vi;
>>   	int err;
>>   	bool oom;
>>
>>   	do {
>>   		if (vi->mergeable_rx_bufs)
>> -			err = add_recvbuf_mergeable(vi, gfp);
>> +			err = add_recvbuf_mergeable(rq, gfp);
>>   		else if (vi->big_packets)
>> -			err = add_recvbuf_big(vi, gfp);
>> +			err = add_recvbuf_big(rq, gfp);
>>   		else
>> -			err = add_recvbuf_small(vi, gfp);
>> +			err = add_recvbuf_small(rq, gfp);
>>
>>   		oom = err == -ENOMEM;
>>   		if (err<  0)
>>   			break;
>> -		++vi->num;
>> +		++rq->num;
>>   	} while (err>  0);
>> -	if (unlikely(vi->num>  vi->max))
>> -		vi->max = vi->num;
>> -	virtqueue_kick(vi->rvq);
>> +	if (unlikely(rq->num>  rq->max))
>> +		rq->max = rq->num;
>> +	virtqueue_kick(rq->vq);
>>   	return !oom;
>>   }
>>
>> -static void skb_recv_done(struct virtqueue *rvq)
>> +static void skb_recv_done(struct virtqueue *vq)
>>   {
>> -	struct virtnet_info *vi = rvq->vdev->priv;
>> +	struct virtnet_info *vi = vq->vdev->priv;
>> +	struct napi_struct *napi =&vi->rq[rxq_get_qnum(vi, vq)]->napi;
>> +
>>   	/* Schedule NAPI, Suppress further interrupts if successful. */
>> -	if (napi_schedule_prep(&vi->napi)) {
>> -		virtqueue_disable_cb(rvq);
>> -		__napi_schedule(&vi->napi);
>> +	if (napi_schedule_prep(napi)) {
>> +		virtqueue_disable_cb(vq);
>> +		__napi_schedule(napi);
>>   	}
>>   }
>>
>> -static void virtnet_napi_enable(struct virtnet_info *vi)
>> +static void virtnet_napi_enable(struct receive_queue *rq)
>>   {
>> -	napi_enable(&vi->napi);
>> +	napi_enable(&rq->napi);
>>
>>   	/* If all buffers were filled by other side before we napi_enabled, we
>>   	 * won't get another interrupt, so process any outstanding packets
>>   	 * now.  virtnet_poll wants re-enable the queue, so we disable here.
>>   	 * We synchronize against interrupts via NAPI_STATE_SCHED */
>> -	if (napi_schedule_prep(&vi->napi)) {
>> -		virtqueue_disable_cb(vi->rvq);
>> +	if (napi_schedule_prep(&rq->napi)) {
>> +		virtqueue_disable_cb(rq->vq);
>>   		local_bh_disable();
>> -		__napi_schedule(&vi->napi);
>> +		__napi_schedule(&rq->napi);
>>   		local_bh_enable();
>>   	}
>>   }
>>
>>   static void refill_work(struct work_struct *work)
>>   {
>> -	struct virtnet_info *vi;
>> +	struct napi_struct *napi;
>> +	struct receive_queue *rq;
>>   	bool still_empty;
>>
>> -	vi = container_of(work, struct virtnet_info, refill.work);
>> -	napi_disable(&vi->napi);
>> -	still_empty = !try_fill_recv(vi, GFP_KERNEL);
>> -	virtnet_napi_enable(vi);
>> +	rq = container_of(work, struct receive_queue, refill.work);
>> +	napi =&rq->napi;
>> +
>> +	napi_disable(napi);
>> +	still_empty = !try_fill_recv(rq, GFP_KERNEL);
>> +	virtnet_napi_enable(rq);
>>
>>   	/* In theory, this can happen: if we don't get any buffers in
>>   	 * we will *never* try to fill again. */
>>   	if (still_empty)
>> -		queue_delayed_work(system_nrt_wq,&vi->refill, HZ/2);
>> +		queue_delayed_work(system_nrt_wq,&rq->refill, HZ/2);
>>   }
>>
>>   static int virtnet_poll(struct napi_struct *napi, int budget)
>>   {
>> -	struct virtnet_info *vi = container_of(napi, struct virtnet_info, napi);
>> +	struct receive_queue *rq = container_of(napi, struct receive_queue,
>> +						napi);
>>   	void *buf;
>>   	unsigned int len, received = 0;
>>
>>   again:
>>   	while (received<  budget&&
>> -	       (buf = virtqueue_get_buf(vi->rvq,&len)) != NULL) {
>> -		receive_buf(vi->dev, buf, len);
>> -		--vi->num;
>> +	       (buf = virtqueue_get_buf(rq->vq,&len)) != NULL) {
>> +		receive_buf(rq, buf, len);
>> +		--rq->num;
>>   		received++;
>>   	}
>>
>> -	if (vi->num<  vi->max / 2) {
>> -		if (!try_fill_recv(vi, GFP_ATOMIC))
>> -			queue_delayed_work(system_nrt_wq,&vi->refill, 0);
>> +	if (rq->num<  rq->max / 2) {
>> +		if (!try_fill_recv(rq, GFP_ATOMIC))
>> +			queue_delayed_work(system_nrt_wq,&rq->refill, 0);
>>   	}
>>
>>   	/* Out of packets? */
>>   	if (received<  budget) {
>>   		napi_complete(napi);
>> -		if (unlikely(!virtqueue_enable_cb(vi->rvq))&&
>> +		if (unlikely(!virtqueue_enable_cb(rq->vq))&&
>>   		napi_schedule_prep(napi)) {
>> -			virtqueue_disable_cb(vi->rvq);
>> +			virtqueue_disable_cb(rq->vq);
>>   			__napi_schedule(napi);
>>   			goto again;
>>   		}
>> @@ -557,13 +612,14 @@ again:
>>   	return received;
>>   }
>>
>> -static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
>> +static unsigned int free_old_xmit_skbs(struct virtnet_info *vi,
>> +				       struct virtqueue *vq)
>>   {
>>   	struct sk_buff *skb;
>>   	unsigned int len, tot_sgs = 0;
>>   	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
>>
>> -	while ((skb = virtqueue_get_buf(vi->svq,&len)) != NULL) {
>> +	while ((skb = virtqueue_get_buf(vq,&len)) != NULL) {
>>   		pr_debug("Sent skb %p\n", skb);
>>
>>   		u64_stats_update_begin(&stats->tx_syncp);
>> @@ -577,7 +633,8 @@ static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
>>   	return tot_sgs;
>>   }
>>
>> -static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
>> +static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb,
>> +		    struct virtqueue *vq, struct scatterlist *sg)
>>   {
>>   	struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
>>   	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
>> @@ -615,44 +672,47 @@ static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
>>
>>   	/* Encode metadata header at front. */
>>   	if (vi->mergeable_rx_bufs)
>> -		sg_set_buf(vi->tx_sg,&hdr->mhdr, sizeof hdr->mhdr);
>> +		sg_set_buf(sg,&hdr->mhdr, sizeof hdr->mhdr);
>>   	else
>> -		sg_set_buf(vi->tx_sg,&hdr->hdr, sizeof hdr->hdr);
>> +		sg_set_buf(sg,&hdr->hdr, sizeof hdr->hdr);
>>
>> -	hdr->num_sg = skb_to_sgvec(skb, vi->tx_sg + 1, 0, skb->len) + 1;
>> -	return virtqueue_add_buf(vi->svq, vi->tx_sg, hdr->num_sg,
>> +	hdr->num_sg = skb_to_sgvec(skb, sg + 1, 0, skb->len) + 1;
>> +	return virtqueue_add_buf(vq, sg, hdr->num_sg,
>>   				 0, skb, GFP_ATOMIC);
>>   }
>>
>>   static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>>   {
>>   	struct virtnet_info *vi = netdev_priv(dev);
>> +	int qnum = skb_get_queue_mapping(skb);
>> +	struct virtqueue *vq = vi->sq[qnum]->vq;
>>   	int capacity;
>>
>>   	/* Free up any pending old buffers before queueing new ones. */
>> -	free_old_xmit_skbs(vi);
>> +	free_old_xmit_skbs(vi, vq);
>>
>>   	/* Try to transmit */
>> -	capacity = xmit_skb(vi, skb);
>> +	capacity = xmit_skb(vi, skb, vq, vi->sq[qnum]->sg);
>>
>>   	/* This can happen with OOM and indirect buffers. */
>>   	if (unlikely(capacity<  0)) {
>>   		if (likely(capacity == -ENOMEM)) {
>>   			if (net_ratelimit())
>>   				dev_warn(&dev->dev,
>> -					 "TX queue failure: out of memory\n");
>> +					"TXQ (%d) failure: out of memory\n",
>> +					qnum);
>>   		} else {
>>   			dev->stats.tx_fifo_errors++;
>>   			if (net_ratelimit())
>>   				dev_warn(&dev->dev,
>> -					 "Unexpected TX queue failure: %d\n",
>> -					 capacity);
>> +					"Unexpected TXQ (%d) failure: %d\n",
>> +					qnum, capacity);
>>   		}
>>   		dev->stats.tx_dropped++;
>>   		kfree_skb(skb);
>>   		return NETDEV_TX_OK;
>>   	}
>> -	virtqueue_kick(vi->svq);
>> +	virtqueue_kick(vq);
>>
>>   	/* Don't wait up for transmitted skbs to be freed. */
>>   	skb_orphan(skb);
>> @@ -661,13 +721,13 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>>   	/* Apparently nice girls don't return TX_BUSY; stop the queue
>>   	 * before it gets out of hand.  Naturally, this wastes entries. */
>>   	if (capacity<  2+MAX_SKB_FRAGS) {
>> -		netif_stop_queue(dev);
>> -		if (unlikely(!virtqueue_enable_cb_delayed(vi->svq))) {
>> +		netif_stop_subqueue(dev, qnum);
>> +		if (unlikely(!virtqueue_enable_cb_delayed(vq))) {
>>   			/* More just got used, free them then recheck. */
>> -			capacity += free_old_xmit_skbs(vi);
>> +			capacity += free_old_xmit_skbs(vi, vq);
>>   			if (capacity>= 2+MAX_SKB_FRAGS) {
>> -				netif_start_queue(dev);
>> -				virtqueue_disable_cb(vi->svq);
>> +				netif_start_subqueue(dev, qnum);
>> +				virtqueue_disable_cb(vq);
>>   			}
>>   		}
>>   	}
>> @@ -700,7 +760,8 @@ static struct rtnl_link_stats64 *virtnet_stats(struct net_device *dev,
>>   	unsigned int start;
>>
>>   	for_each_possible_cpu(cpu) {
>> -		struct virtnet_stats *stats = per_cpu_ptr(vi->stats, cpu);
>> +		struct virtnet_stats __percpu *stats
>> +			= per_cpu_ptr(vi->stats, cpu);
>>   		u64 tpackets, tbytes, rpackets, rbytes;
>>
>>   		do {
>> @@ -734,20 +795,26 @@ static struct rtnl_link_stats64 *virtnet_stats(struct net_device *dev,
>>   static void virtnet_netpoll(struct net_device *dev)
>>   {
>>   	struct virtnet_info *vi = netdev_priv(dev);
>> +	int i;
>>
>> -	napi_schedule(&vi->napi);
>> +	for (i = 0; i<  vi->num_queue_pairs; i++)
>> +		napi_schedule(&vi->rq[i]->napi);
>>   }
>>   #endif
>>
>>   static int virtnet_open(struct net_device *dev)
>>   {
>>   	struct virtnet_info *vi = netdev_priv(dev);
>> +	int i;
>>
>> -	/* Make sure we have some buffers: if oom use wq. */
>> -	if (!try_fill_recv(vi, GFP_KERNEL))
>> -		queue_delayed_work(system_nrt_wq,&vi->refill, 0);
>> +	for (i = 0; i<  vi->num_queue_pairs; i++) {
>> +		/* Make sure we have some buffers: if oom use wq. */
>> +		if (!try_fill_recv(vi->rq[i], GFP_KERNEL))
>> +			queue_delayed_work(system_nrt_wq,
>> +					&vi->rq[i]->refill, 0);
>> +		virtnet_napi_enable(vi->rq[i]);
>> +	}
>>
>> -	virtnet_napi_enable(vi);
>>   	return 0;
>>   }
>>
>> @@ -809,10 +876,13 @@ static void virtnet_ack_link_announce(struct virtnet_info *vi)
>>   static int virtnet_close(struct net_device *dev)
>>   {
>>   	struct virtnet_info *vi = netdev_priv(dev);
>> +	int i;
>>
>>   	/* Make sure refill_work doesn't re-enable napi! */
>> -	cancel_delayed_work_sync(&vi->refill);
>> -	napi_disable(&vi->napi);
>> +	for (i = 0; i<  vi->num_queue_pairs; i++) {
>> +		cancel_delayed_work_sync(&vi->rq[i]->refill);
>> +		napi_disable(&vi->rq[i]->napi);
>> +	}
>>
>>   	return 0;
>>   }
>> @@ -924,11 +994,10 @@ static void virtnet_get_ringparam(struct net_device *dev,
>>   {
>>   	struct virtnet_info *vi = netdev_priv(dev);
>>
>> -	ring->rx_max_pending = virtqueue_get_vring_size(vi->rvq);
>> -	ring->tx_max_pending = virtqueue_get_vring_size(vi->svq);
>> +	ring->rx_max_pending = virtqueue_get_vring_size(vi->rq[0]->vq);
>> +	ring->tx_max_pending = virtqueue_get_vring_size(vi->sq[0]->vq);
>>   	ring->rx_pending = ring->rx_max_pending;
>>   	ring->tx_pending = ring->tx_max_pending;
>> -
>>   }
>>
>>
>> @@ -961,6 +1030,19 @@ static int virtnet_change_mtu(struct net_device *dev, int new_mtu)
>>   	return 0;
>>   }
>>
>> +/* To avoid contending a lock hold by a vcpu who would exit to host, select the
>> + * txq based on the processor id.
>> + */
>> +static u16 virtnet_select_queue(struct net_device *dev, struct sk_buff *skb)
>> +{
>> +	int txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) :
>> +		  smp_processor_id();
>> +
>> +	while (unlikely(txq>= dev->real_num_tx_queues))
>> +		txq -= dev->real_num_tx_queues;
>> +	return txq;
>> +}
>> +
>>   static const struct net_device_ops virtnet_netdev = {
>>   	.ndo_open            = virtnet_open,
>>   	.ndo_stop   	     = virtnet_close,
>> @@ -972,6 +1054,7 @@ static const struct net_device_ops virtnet_netdev = {
>>   	.ndo_get_stats64     = virtnet_stats,
>>   	.ndo_vlan_rx_add_vid = virtnet_vlan_rx_add_vid,
>>   	.ndo_vlan_rx_kill_vid = virtnet_vlan_rx_kill_vid,
>> +	.ndo_select_queue     = virtnet_select_queue,
>>   #ifdef CONFIG_NET_POLL_CONTROLLER
>>   	.ndo_poll_controller = virtnet_netpoll,
>>   #endif
>> @@ -1007,10 +1090,10 @@ static void virtnet_config_changed_work(struct work_struct *work)
>>
>>   	if (vi->status&  VIRTIO_NET_S_LINK_UP) {
>>   		netif_carrier_on(vi->dev);
>> -		netif_wake_queue(vi->dev);
>> +		netif_tx_wake_all_queues(vi->dev);
>>   	} else {
>>   		netif_carrier_off(vi->dev);
>> -		netif_stop_queue(vi->dev);
>> +		netif_tx_stop_all_queues(vi->dev);
>>   	}
>>   done:
>>   	mutex_unlock(&vi->config_lock);
>> @@ -1023,41 +1106,217 @@ static void virtnet_config_changed(struct virtio_device *vdev)
>>   	queue_work(system_nrt_wq,&vi->config_work);
>>   }
>>
>> -static int init_vqs(struct virtnet_info *vi)
>> +static void free_receive_bufs(struct virtnet_info *vi)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i<  vi->num_queue_pairs; i++) {
>> +		while (vi->rq[i]->pages)
>> +			__free_pages(get_a_page(vi->rq[i], GFP_KERNEL), 0);
>> +	}
>> +}
>> +
>> +/* Free memory allocated for send and receive queues */
>> +static void virtnet_free_queues(struct virtnet_info *vi)
>>   {
>> -	struct virtqueue *vqs[3];
>> -	vq_callback_t *callbacks[] = { skb_recv_done, skb_xmit_done, NULL};
>> -	const char *names[] = { "input", "output", "control" };
>> -	int nvqs, err;
>> +	int i;
>>
>> -	/* We expect two virtqueues, receive then send,
>> -	 * and optionally control. */
>> -	nvqs = virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ) ? 3 : 2;
>> +	for (i = 0; i<  vi->num_queue_pairs; i++) {
>> +		kfree(vi->rq[i]);
>> +		vi->rq[i] = NULL;
>> +		kfree(vi->sq[i]);
>> +		vi->sq[i] = NULL;
>> +	}
>> +}
>>
>> -	err = vi->vdev->config->find_vqs(vi->vdev, nvqs, vqs, callbacks, names);
>> -	if (err)
>> -		return err;
>> +static void free_unused_bufs(struct virtnet_info *vi)
>> +{
>> +	void *buf;
>> +	int i;
>> +
>> +	for (i = 0; i<  vi->num_queue_pairs; i++) {
>> +		struct virtqueue *vq = vi->sq[i]->vq;
>> +
>> +		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL)
>> +			dev_kfree_skb(buf);
>> +	}
>> +
>> +	for (i = 0; i<  vi->num_queue_pairs; i++) {
>> +		struct virtqueue *vq = vi->rq[i]->vq;
>> +
>> +		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL) {
>> +			if (vi->mergeable_rx_bufs || vi->big_packets)
>> +				give_pages(vi->rq[i], buf);
>> +			else
>> +				dev_kfree_skb(buf);
>> +			--vi->rq[i]->num;
>> +		}
>> +		BUG_ON(vi->rq[i]->num != 0);
>> +	}
>> +}
>> +
>> +static void virtnet_set_affinity(struct virtnet_info *vi, bool set)
>> +{
>> +	int i;
>> +
>> +	if (vi->num_queue_pairs == 1)
>> +		return;
>> +
>> +	for (i = 0; i<  vi->num_queue_pairs; i++) {
>> +		int cpu = set ? i : -1;
>> +		virtqueue_set_affinity(vi->rq[i]->vq, cpu);
>> +		virtqueue_set_affinity(vi->sq[i]->vq, cpu);
>> +	}
>> +	return;
>> +}
>> +
>> +static void virtnet_del_vqs(struct virtnet_info *vi)
>> +{
>> +	struct virtio_device *vdev = vi->vdev;
>> +
>> +	virtnet_set_affinity(vi, false);
>> +
>> +	vdev->config->del_vqs(vdev);
>> +
>> +	virtnet_free_queues(vi);
>> +}
>> +
>> +static int virtnet_find_vqs(struct virtnet_info *vi)
>> +{
>> +	vq_callback_t **callbacks;
>> +	struct virtqueue **vqs;
>> +	int ret = -ENOMEM;
>> +	int i, total_vqs;
>> +	char **names;
>>
>> -	vi->rvq = vqs[0];
>> -	vi->svq = vqs[1];
>> +	/*
>> +	 * We expect 1 RX virtqueue followed by 1 TX virtqueue, followd by
>> +	 * possible control virtqueue and followed by the same
>> +	 * 'vi->num_queue_pairs-1' more times
>> +	 */
>> +	total_vqs = vi->num_queue_pairs * 2 +
>> +		    virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ);
>> +
>> +	/* Allocate space for find_vqs parameters */
>> +	vqs = kmalloc(total_vqs * sizeof(*vqs), GFP_KERNEL);
>> +	callbacks = kmalloc(total_vqs * sizeof(*callbacks), GFP_KERNEL);
>> +	names = kmalloc(total_vqs * sizeof(*names), GFP_KERNEL);
> so this needs to be kzalloc otherwise on an error cleanup will
> get uninitialized data and crash?

Yes, will change it to use kzalloc.
>
>> +	if (!vqs || !callbacks || !names)
>> +		goto err;
>> +
>> +	/* Parameters for control virtqueue, if any */
>> +	if (vi->has_cvq) {
>> +		callbacks[2] = NULL;
>> +		names[2] = "control";
>> +	}
>> +
>> +	/* Allocate/initialize parameters for send/receive virtqueues */
>> +	for (i = 0; i<  vi->num_queue_pairs * 2; i += 2) {
>> +		int j = (i == 0 ? i : i + vi->has_cvq);
>> +		callbacks[j] = skb_recv_done;
>> +		callbacks[j + 1] = skb_xmit_done;
>> +		names[j] = kasprintf(GFP_KERNEL, "input.%d", i / 2);
>> +		names[j + 1] = kasprintf(GFP_KERNEL, "output.%d", i / 2);
> This needs wrappers. E.g. virtnet_rx_vq(int queue_pair), virtnet_tx_vq(int queue_pair);
> Then you would just scan 0 to num_queue_pairs, and i is queue pair
> number.

Ok.
>> +	}
>>
>> -	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
>> +	ret = vi->vdev->config->find_vqs(vi->vdev, total_vqs, vqs, callbacks,
>> +					 (const char **)names);
>> +	if (ret)
>> +		goto err;
>> +
>> +	if (vi->has_cvq)
>>   		vi->cvq = vqs[2];
>>
>> -		if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
>> -			vi->dev->features |= NETIF_F_HW_VLAN_FILTER;
>> +	for (i = 0; i<  vi->num_queue_pairs * 2; i += 2) {
>> +		int j = i == 0 ? i : i + vi->has_cvq;
>> +		vi->rq[i / 2]->vq = vqs[j];
>> +		vi->sq[i / 2]->vq = vqs[j + 1];
> Same here.

Consider the code is really simple, seem no need to use helpers.
>
>>   	}
>> -	return 0;
>> +
>> +err:
>> +	if (ret&&  names)
> If we are here ret != 0. For names, just add another label, don't
> complicate cleanup.

Ok.
>> +		for (i = 0; i<  vi->num_queue_pairs * 2; i++)
>> +			kfree(names[i]);
>> +
>> +	kfree(names);
>> +	kfree(callbacks);
>> +	kfree(vqs);
>> +
>> +	return ret;
>> +}
>> +
>> +static int virtnet_alloc_queues(struct virtnet_info *vi)
>> +{
>> +	int ret = -ENOMEM;
>> +	int i;
>> +
>> +	for (i = 0; i<  vi->num_queue_pairs; i++) {
>> +		vi->rq[i] = kzalloc(sizeof(*vi->rq[i]), GFP_KERNEL);
>> +		vi->sq[i] = kzalloc(sizeof(*vi->sq[i]), GFP_KERNEL);
>> +		if (!vi->rq[i] || !vi->sq[i])
>> +			goto err;
>> +	}
>> +
>> +	ret = 0;
>> +
>> +	/* setup initial receive and send queue parameters */
>> +	for (i = 0; i<  vi->num_queue_pairs; i++) {
>> +		vi->rq[i]->vi = vi;
>> +		vi->rq[i]->pages = NULL;
>> +		INIT_DELAYED_WORK(&vi->rq[i]->refill, refill_work);
>> +		netif_napi_add(vi->dev,&vi->rq[i]->napi, virtnet_poll,
>> +			       napi_weight);
>> +
>> +		sg_init_table(vi->rq[i]->sg, ARRAY_SIZE(vi->rq[i]->sg));
>> +		sg_init_table(vi->sq[i]->sg, ARRAY_SIZE(vi->sq[i]->sg));
>> +	}
>> +
> Add return 0 here, then ret = 0 will not be needed
> above and if (ret) below.
>

Ok.
>> +err:
>> +	if (ret)
>> +		virtnet_free_queues(vi);
>> +
>> +	return ret;
>> +}
>> +
>> +static int virtnet_setup_vqs(struct virtnet_info *vi)
>> +{
>> +	int ret;
>> +
>> +	/* Allocate send&  receive queues */
>> +	ret = virtnet_alloc_queues(vi);
>> +	if (!ret) {
>> +		ret = virtnet_find_vqs(vi);
>> +		if (ret)
>> +			virtnet_free_queues(vi);
>> +		else
>> +			virtnet_set_affinity(vi, true);
>> +	}
>> +
>> +	return ret;
> Add some labels for error handling, this if nesting is messy.

Ok.
>>   }
>>
>>   static int virtnet_probe(struct virtio_device *vdev)
>>   {
>> -	int err;
>> +	int i, err;
>>   	struct net_device *dev;
>>   	struct virtnet_info *vi;
>> +	u16 num_queues, num_queue_pairs;
>> +
>> +	/* Find if host supports multiqueue virtio_net device */
>> +	err = virtio_config_val(vdev, VIRTIO_NET_F_MULTIQUEUE,
>> +				offsetof(struct virtio_net_config,
>> +				num_queues),&num_queues);
>> +
>> +	/* We need atleast 2 queue's */
> typo

Will fix this, thanks.
>> +	if (err || num_queues<  2)
>> +		num_queues = 2;
>> +	if (num_queues>  MAX_QUEUES * 2)
>> +		num_queues = MAX_QUEUES;
>> +
>> +	num_queue_pairs = num_queues / 2;
>>
>>   	/* Allocate ourselves a network device with room for our info */
>> -	dev = alloc_etherdev(sizeof(struct virtnet_info));
>> +	dev = alloc_etherdev_mq(sizeof(struct virtnet_info), num_queue_pairs);
>>   	if (!dev)
>>   		return -ENOMEM;
>>
>> @@ -1103,22 +1362,18 @@ static int virtnet_probe(struct virtio_device *vdev)
>>
>>   	/* Set up our device-specific information */
>>   	vi = netdev_priv(dev);
>> -	netif_napi_add(dev,&vi->napi, virtnet_poll, napi_weight);
>>   	vi->dev = dev;
>>   	vi->vdev = vdev;
>>   	vdev->priv = vi;
>> -	vi->pages = NULL;
>>   	vi->stats = alloc_percpu(struct virtnet_stats);
>>   	err = -ENOMEM;
>>   	if (vi->stats == NULL)
>> -		goto free;
>> +		goto free_netdev;
>>
>> -	INIT_DELAYED_WORK(&vi->refill, refill_work);
>>   	mutex_init(&vi->config_lock);
>>   	vi->config_enable = true;
>>   	INIT_WORK(&vi->config_work, virtnet_config_changed_work);
>> -	sg_init_table(vi->rx_sg, ARRAY_SIZE(vi->rx_sg));
>> -	sg_init_table(vi->tx_sg, ARRAY_SIZE(vi->tx_sg));
>> +	vi->num_queue_pairs = num_queue_pairs;
>>
>>   	/* If we can receive ANY GSO packets, we must allocate large ones. */
>>   	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) ||
>> @@ -1129,9 +1384,17 @@ static int virtnet_probe(struct virtio_device *vdev)
>>   	if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
>>   		vi->mergeable_rx_bufs = true;
>>
>> -	err = init_vqs(vi);
>> +	if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
>> +		vi->has_cvq = true;
>> +
> How about we disable multiqueue if there's no cvq?
> Will make logic a bit simpler, won't it?

We can, but as you said, just let the logic simpler a bit.
>> +	/* Allocate/initialize the rx/tx queues, and invoke find_vqs */
>> +	err = virtnet_setup_vqs(vi);
>>   	if (err)
>> -		goto free_stats;
>> +		goto free_netdev;
>> +
>> +	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)&&
>> +	    virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
>> +		dev->features |= NETIF_F_HW_VLAN_FILTER;
>>
>>   	err = register_netdev(dev);
>>   	if (err) {
>> @@ -1140,12 +1403,15 @@ static int virtnet_probe(struct virtio_device *vdev)
>>   	}
>>
>>   	/* Last of all, set up some receive buffers. */
>> -	try_fill_recv(vi, GFP_KERNEL);
>> -
>> -	/* If we didn't even get one input buffer, we're useless. */
>> -	if (vi->num == 0) {
>> -		err = -ENOMEM;
>> -		goto unregister;
>> +	for (i = 0; i<  num_queue_pairs; i++) {
>> +		try_fill_recv(vi->rq[i], GFP_KERNEL);
>> +
>> +		/* If we didn't even get one input buffer, we're useless. */
>> +		if (vi->rq[i]->num == 0) {
>> +			free_unused_bufs(vi);
>> +			err = -ENOMEM;
>> +			goto free_recv_bufs;
>> +		}
>>   	}
>>
>>   	/* Assume link up if device can't report link status,
>> @@ -1158,42 +1424,25 @@ static int virtnet_probe(struct virtio_device *vdev)
>>   		netif_carrier_on(dev);
>>   	}
>>
>> -	pr_debug("virtnet: registered device %s\n", dev->name);
>> +	pr_debug("virtnet: registered device %s with %d RX and TX vq's\n",
>> +		 dev->name, num_queue_pairs);
>> +
>>   	return 0;
>>
>> -unregister:
>> +free_recv_bufs:
>> +	free_receive_bufs(vi);
>>   	unregister_netdev(dev);
>> +
>>   free_vqs:
>> -	vdev->config->del_vqs(vdev);
>> -free_stats:
>> -	free_percpu(vi->stats);
>> -free:
>> +	for (i = 0; i<  num_queue_pairs; i++)
>> +		cancel_delayed_work_sync(&vi->rq[i]->refill);
>> +	virtnet_del_vqs(vi);
>> +
>> +free_netdev:
>>   	free_netdev(dev);
>>   	return err;
>>   }
>>
>> -static void free_unused_bufs(struct virtnet_info *vi)
>> -{
>> -	void *buf;
>> -	while (1) {
>> -		buf = virtqueue_detach_unused_buf(vi->svq);
>> -		if (!buf)
>> -			break;
>> -		dev_kfree_skb(buf);
>> -	}
>> -	while (1) {
>> -		buf = virtqueue_detach_unused_buf(vi->rvq);
>> -		if (!buf)
>> -			break;
>> -		if (vi->mergeable_rx_bufs || vi->big_packets)
>> -			give_pages(vi, buf);
>> -		else
>> -			dev_kfree_skb(buf);
>> -		--vi->num;
>> -	}
>> -	BUG_ON(vi->num != 0);
>> -}
>> -
>>   static void remove_vq_common(struct virtnet_info *vi)
>>   {
>>   	vi->vdev->config->reset(vi->vdev);
>> @@ -1201,10 +1450,9 @@ static void remove_vq_common(struct virtnet_info *vi)
>>   	/* Free unused buffers in both send and recv, if any. */
>>   	free_unused_bufs(vi);
>>
>> -	vi->vdev->config->del_vqs(vi->vdev);
>> +	free_receive_bufs(vi);
>>
>> -	while (vi->pages)
>> -		__free_pages(get_a_page(vi, GFP_KERNEL), 0);
>> +	virtnet_del_vqs(vi);
>>   }
>>
>>   static void __devexit virtnet_remove(struct virtio_device *vdev)
>> @@ -1230,6 +1478,7 @@ static void __devexit virtnet_remove(struct virtio_device *vdev)
>>   static int virtnet_freeze(struct virtio_device *vdev)
>>   {
>>   	struct virtnet_info *vi = vdev->priv;
>> +	int i;
>>
>>   	/* Prevent config work handler from accessing the device */
>>   	mutex_lock(&vi->config_lock);
>> @@ -1237,10 +1486,13 @@ static int virtnet_freeze(struct virtio_device *vdev)
>>   	mutex_unlock(&vi->config_lock);
>>
>>   	netif_device_detach(vi->dev);
>> -	cancel_delayed_work_sync(&vi->refill);
>> +	for (i = 0; i<  vi->num_queue_pairs; i++)
>> +		cancel_delayed_work_sync(&vi->rq[i]->refill);
>>
>>   	if (netif_running(vi->dev))
>> -		napi_disable(&vi->napi);
>> +		for (i = 0; i<  vi->num_queue_pairs; i++)
>> +			napi_disable(&vi->rq[i]->napi);
>> +
>>
>>   	remove_vq_common(vi);
>>
>> @@ -1252,19 +1504,22 @@ static int virtnet_freeze(struct virtio_device *vdev)
>>   static int virtnet_restore(struct virtio_device *vdev)
>>   {
>>   	struct virtnet_info *vi = vdev->priv;
>> -	int err;
>> +	int err, i;
>>
>> -	err = init_vqs(vi);
>> +	err = virtnet_setup_vqs(vi);
>>   	if (err)
>>   		return err;
>>
>>   	if (netif_running(vi->dev))
>> -		virtnet_napi_enable(vi);
>> +		for (i = 0; i<  vi->num_queue_pairs; i++)
>> +			virtnet_napi_enable(vi->rq[i]);
>>
>>   	netif_device_attach(vi->dev);
>>
>> -	if (!try_fill_recv(vi, GFP_KERNEL))
>> -		queue_delayed_work(system_nrt_wq,&vi->refill, 0);
>> +	for (i = 0; i<  vi->num_queue_pairs; i++)
>> +		if (!try_fill_recv(vi->rq[i], GFP_KERNEL))
>> +			queue_delayed_work(system_nrt_wq,
>> +					&vi->rq[i]->refill, 0);
>>
>>   	mutex_lock(&vi->config_lock);
>>   	vi->config_enable = true;
>> @@ -1287,7 +1542,7 @@ static unsigned int features[] = {
>>   	VIRTIO_NET_F_GUEST_ECN, VIRTIO_NET_F_GUEST_UFO,
>>   	VIRTIO_NET_F_MRG_RXBUF, VIRTIO_NET_F_STATUS, VIRTIO_NET_F_CTRL_VQ,
>>   	VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN,
>> -	VIRTIO_NET_F_GUEST_ANNOUNCE,
>> +	VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MULTIQUEUE,
>>   };
>>
>>   static struct virtio_driver virtio_net_driver = {
>> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
>> index 1bc7e30..60f09ff 100644
>> --- a/include/linux/virtio_net.h
>> +++ b/include/linux/virtio_net.h
>> @@ -61,6 +61,8 @@ struct virtio_net_config {
>>   	__u8 mac[6];
>>   	/* See VIRTIO_NET_F_STATUS and VIRTIO_NET_S_* above */
>>   	__u16 status;
>> +	/* Total number of RX/TX queues */
>> +	__u16 num_queues;
>>   } __attribute__((packed));
>>
>>   /* This is the first element of the scatter-gather list.  If you don't
>> -- 
>> 1.7.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Wang July 23, 2012, 5:54 a.m. UTC | #6
On 07/21/2012 08:02 PM, Sasha Levin wrote:
> On 07/20/2012 03:40 PM, Michael S. Tsirkin wrote:
>>> -	err = init_vqs(vi);
>>>> +	if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
>>>> +		vi->has_cvq = true;
>>>> +
>> How about we disable multiqueue if there's no cvq?
>> Will make logic a bit simpler, won't it?
> multiqueues don't really depend on cvq. Does this added complexity really justifies adding an artificial limit?
>

Yes, it does not depends on cvq. Cvq were just used to negotiate the 
number of queues a guest wishes to use which is really useful (at least 
for now). Since multiqueue can not out-perform for single queue in every 
kinds of workloads or benchmark, so we want to let guest driver use 
single queue by default even when multiqueue were enabled by management 
software and let use to enalbe it through ethtool. So user could not 
feel regression when it switch to use a multiqueue capable driver and 
backend.

So the only difference is the user experiences.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sasha Levin July 23, 2012, 9:28 a.m. UTC | #7
On 07/23/2012 07:54 AM, Jason Wang wrote:
> On 07/21/2012 08:02 PM, Sasha Levin wrote:
>> On 07/20/2012 03:40 PM, Michael S. Tsirkin wrote:
>>>> -    err = init_vqs(vi);
>>>>> +    if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
>>>>> +        vi->has_cvq = true;
>>>>> +
>>> How about we disable multiqueue if there's no cvq?
>>> Will make logic a bit simpler, won't it?
>> multiqueues don't really depend on cvq. Does this added complexity really justifies adding an artificial limit?
>>
> 
> Yes, it does not depends on cvq. Cvq were just used to negotiate the number of queues a guest wishes to use which is really useful (at least for now). Since multiqueue can not out-perform for single queue in every kinds of workloads or benchmark, so we want to let guest driver use single queue by default even when multiqueue were enabled by management software and let use to enalbe it through ethtool. So user could not feel regression when it switch to use a multiqueue capable driver and backend.

Why would you limit it to a single vq if the user has specified a different number of vqs (>1) in the virtio-net device config?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin July 29, 2012, 9:44 a.m. UTC | #8
On Sat, Jul 21, 2012 at 02:02:58PM +0200, Sasha Levin wrote:
> On 07/20/2012 03:40 PM, Michael S. Tsirkin wrote:
> >> -	err = init_vqs(vi);
> >> > +	if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
> >> > +		vi->has_cvq = true;
> >> > +
> > How about we disable multiqueue if there's no cvq?
> > Will make logic a bit simpler, won't it?
> 
> multiqueues don't really depend on cvq. Does this added complexity really justifies adding an artificial limit?

Well !cvq support is a legacy feature: the reason we support it
in driver is to avoid breaking on old hosts. Adding more code to that
path just doesn't make much sense since old hosts won't have mq.
Michael S. Tsirkin July 29, 2012, 9:50 a.m. UTC | #9
On Mon, Jul 23, 2012 at 01:48:35PM +0800, Jason Wang wrote:
> >>+	}
> >>
> >>-	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
> >>+	ret = vi->vdev->config->find_vqs(vi->vdev, total_vqs, vqs, callbacks,
> >>+					 (const char **)names);
> >>+	if (ret)
> >>+		goto err;
> >>+
> >>+	if (vi->has_cvq)
> >>  		vi->cvq = vqs[2];
> >>
> >>-		if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
> >>-			vi->dev->features |= NETIF_F_HW_VLAN_FILTER;
> >>+	for (i = 0; i<  vi->num_queue_pairs * 2; i += 2) {
> >>+		int j = i == 0 ? i : i + vi->has_cvq;
> >>+		vi->rq[i / 2]->vq = vqs[j];
> >>+		vi->sq[i / 2]->vq = vqs[j + 1];
> >Same here.
> 
> Consider the code is really simple, seem no need to use helpers.

Well it was not simple to at least one reader :)
The problem is not this logic is complex,
it is that it is spread all over the code.

If we had e.g. vnet_tx_vqn_to_queuenum vnet_tx_queuenum_to_vqn
and same for rx, then the logic would all be
in one place, and have a tidy comment on top explaining
the VQ numbering scheme.
Jason Wang July 30, 2012, 3:26 a.m. UTC | #10
On 07/29/2012 05:44 PM, Michael S. Tsirkin wrote:
> On Sat, Jul 21, 2012 at 02:02:58PM +0200, Sasha Levin wrote:
>> On 07/20/2012 03:40 PM, Michael S. Tsirkin wrote:
>>>> -	err = init_vqs(vi);
>>>>> +	if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
>>>>> +		vi->has_cvq = true;
>>>>> +
>>> How about we disable multiqueue if there's no cvq?
>>> Will make logic a bit simpler, won't it?
>> multiqueues don't really depend on cvq. Does this added complexity really justifies adding an artificial limit?
> Well !cvq support is a legacy feature: the reason we support it
> in driver is to avoid breaking on old hosts. Adding more code to that
> path just doesn't make much sense since old hosts won't have mq.
>

After some thought about this, maybe there's no need to the cvq for the 
negotiation if we want support only two modes ( 1 tx/rx queue pair and N 
tx/rx queue pairs). We can do this just through the feature bit negotiation.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Wang July 30, 2012, 3:29 a.m. UTC | #11
On 07/23/2012 05:28 PM, Sasha Levin wrote:
> On 07/23/2012 07:54 AM, Jason Wang wrote:
>> On 07/21/2012 08:02 PM, Sasha Levin wrote:
>>> On 07/20/2012 03:40 PM, Michael S. Tsirkin wrote:
>>>>> -    err = init_vqs(vi);
>>>>>> +    if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
>>>>>> +        vi->has_cvq = true;
>>>>>> +
>>>> How about we disable multiqueue if there's no cvq?
>>>> Will make logic a bit simpler, won't it?
>>> multiqueues don't really depend on cvq. Does this added complexity really justifies adding an artificial limit?
>>>
>> Yes, it does not depends on cvq. Cvq were just used to negotiate the number of queues a guest wishes to use which is really useful (at least for now). Since multiqueue can not out-perform for single queue in every kinds of workloads or benchmark, so we want to let guest driver use single queue by default even when multiqueue were enabled by management software and let use to enalbe it through ethtool. So user could not feel regression when it switch to use a multiqueue capable driver and backend.
> Why would you limit it to a single vq if the user has specified a different number of vqs (>1) in the virtio-net device config?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

The only reason is to prevent the user from seeing the regression. The 
performance of small packet sending is wrose than single queue, it tends 
to send more but small packets when multiqueue is enabled. If we make 
multiqueue bahave as good as single queue, we can remove this limit.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Wang July 30, 2012, 5:15 a.m. UTC | #12
On 07/29/2012 05:50 PM, Michael S. Tsirkin wrote:
> On Mon, Jul 23, 2012 at 01:48:35PM +0800, Jason Wang wrote:
>>>> +	}
>>>>
>>>> -	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
>>>> +	ret = vi->vdev->config->find_vqs(vi->vdev, total_vqs, vqs, callbacks,
>>>> +					 (const char **)names);
>>>> +	if (ret)
>>>> +		goto err;
>>>> +
>>>> +	if (vi->has_cvq)
>>>>   		vi->cvq = vqs[2];
>>>>
>>>> -		if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
>>>> -			vi->dev->features |= NETIF_F_HW_VLAN_FILTER;
>>>> +	for (i = 0; i<   vi->num_queue_pairs * 2; i += 2) {
>>>> +		int j = i == 0 ? i : i + vi->has_cvq;
>>>> +		vi->rq[i / 2]->vq = vqs[j];
>>>> +		vi->sq[i / 2]->vq = vqs[j + 1];
>>> Same here.
>> Consider the code is really simple, seem no need to use helpers.
> Well it was not simple to at least one reader :)
> The problem is not this logic is complex,
> it is that it is spread all over the code.
>
> If we had e.g. vnet_tx_vqn_to_queuenum vnet_tx_queuenum_to_vqn
> and same for rx, then the logic would all be
> in one place, and have a tidy comment on top explaining
> the VQ numbering scheme.
>

Looks reasonable, thanks.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sasha Levin July 30, 2012, 1 p.m. UTC | #13
On 07/29/2012 11:44 AM, Michael S. Tsirkin wrote:
> On Sat, Jul 21, 2012 at 02:02:58PM +0200, Sasha Levin wrote:
>> On 07/20/2012 03:40 PM, Michael S. Tsirkin wrote:
>>>> -	err = init_vqs(vi);
>>>>> +	if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
>>>>> +		vi->has_cvq = true;
>>>>> +
>>> How about we disable multiqueue if there's no cvq?
>>> Will make logic a bit simpler, won't it?
>>
>> multiqueues don't really depend on cvq. Does this added complexity really justifies adding an artificial limit?
> 
> Well !cvq support is a legacy feature: the reason we support it
> in driver is to avoid breaking on old hosts. Adding more code to that
> path just doesn't make much sense since old hosts won't have mq.

Is it really a legacy feature? The spec suggests that its an optional queue which is not necessary for the operation of the device.

Which is why we never implemented it in lkvm - we weren't interested in any of the features it provided at that time and we could provide high performance with vhost support even without it.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 1db445b..7410187 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -26,6 +26,7 @@ 
 #include <linux/scatterlist.h>
 #include <linux/if_vlan.h>
 #include <linux/slab.h>
+#include <linux/interrupt.h>
 
 static int napi_weight = 128;
 module_param(napi_weight, int, 0444);
@@ -41,6 +42,8 @@  module_param(gso, bool, 0444);
 #define VIRTNET_SEND_COMMAND_SG_MAX    2
 #define VIRTNET_DRIVER_VERSION "1.0.0"
 
+#define MAX_QUEUES 256
+
 struct virtnet_stats {
 	struct u64_stats_sync tx_syncp;
 	struct u64_stats_sync rx_syncp;
@@ -51,43 +54,69 @@  struct virtnet_stats {
 	u64 rx_packets;
 };
 
-struct virtnet_info {
-	struct virtio_device *vdev;
-	struct virtqueue *rvq, *svq, *cvq;
-	struct net_device *dev;
+/* Internal representation of a send virtqueue */
+struct send_queue {
+	/* Virtqueue associated with this send _queue */
+	struct virtqueue *vq;
+
+	/* TX: fragments + linear part + virtio header */
+	struct scatterlist sg[MAX_SKB_FRAGS + 2];
+};
+
+/* Internal representation of a receive virtqueue */
+struct receive_queue {
+	/* Virtqueue associated with this receive_queue */
+	struct virtqueue *vq;
+
+	/* Back pointer to the virtnet_info */
+	struct virtnet_info *vi;
+
 	struct napi_struct napi;
-	unsigned int status;
 
 	/* Number of input buffers, and max we've ever had. */
 	unsigned int num, max;
 
+	/* Work struct for refilling if we run low on memory. */
+	struct delayed_work refill;
+
+	/* Chain pages by the private ptr. */
+	struct page *pages;
+
+	/* RX: fragments + linear part + virtio header */
+	struct scatterlist sg[MAX_SKB_FRAGS + 2];
+};
+
+struct virtnet_info {
+	u16 num_queue_pairs;		/* # of RX/TX vq pairs */
+
+	struct send_queue *sq[MAX_QUEUES] ____cacheline_aligned_in_smp;
+	struct receive_queue *rq[MAX_QUEUES] ____cacheline_aligned_in_smp;
+	struct virtqueue *cvq;
+
+	struct virtio_device *vdev;
+	struct net_device *dev;
+	unsigned int status;
+
 	/* I like... big packets and I cannot lie! */
 	bool big_packets;
 
 	/* Host will merge rx buffers for big packets (shake it! shake it!) */
 	bool mergeable_rx_bufs;
 
+	/* Has control virtqueue */
+	bool has_cvq;
+
 	/* enable config space updates */
 	bool config_enable;
 
 	/* Active statistics */
 	struct virtnet_stats __percpu *stats;
 
-	/* Work struct for refilling if we run low on memory. */
-	struct delayed_work refill;
-
 	/* Work struct for config space updates */
 	struct work_struct config_work;
 
 	/* Lock for config space updates */
 	struct mutex config_lock;
-
-	/* Chain pages by the private ptr. */
-	struct page *pages;
-
-	/* fragments + linear part + virtio header */
-	struct scatterlist rx_sg[MAX_SKB_FRAGS + 2];
-	struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
 };
 
 struct skb_vnet_hdr {
@@ -108,6 +137,22 @@  struct padded_vnet_hdr {
 	char padding[6];
 };
 
+static inline int txq_get_qnum(struct virtnet_info *vi, struct virtqueue *vq)
+{
+	int ret = virtqueue_get_queue_index(vq);
+
+	/* skip ctrl vq */
+	if (vi->has_cvq)
+		return (ret - 1) / 2;
+	else
+		return ret / 2;
+}
+
+static inline int rxq_get_qnum(struct virtnet_info *vi, struct virtqueue *vq)
+{
+	return virtqueue_get_queue_index(vq) / 2;
+}
+
 static inline struct skb_vnet_hdr *skb_vnet_hdr(struct sk_buff *skb)
 {
 	return (struct skb_vnet_hdr *)skb->cb;
@@ -117,22 +162,22 @@  static inline struct skb_vnet_hdr *skb_vnet_hdr(struct sk_buff *skb)
  * private is used to chain pages for big packets, put the whole
  * most recent used list in the beginning for reuse
  */
-static void give_pages(struct virtnet_info *vi, struct page *page)
+static void give_pages(struct receive_queue *rq, struct page *page)
 {
 	struct page *end;
 
 	/* Find end of list, sew whole thing into vi->pages. */
 	for (end = page; end->private; end = (struct page *)end->private);
-	end->private = (unsigned long)vi->pages;
-	vi->pages = page;
+	end->private = (unsigned long)rq->pages;
+	rq->pages = page;
 }
 
-static struct page *get_a_page(struct virtnet_info *vi, gfp_t gfp_mask)
+static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
 {
-	struct page *p = vi->pages;
+	struct page *p = rq->pages;
 
 	if (p) {
-		vi->pages = (struct page *)p->private;
+		rq->pages = (struct page *)p->private;
 		/* clear private here, it is used to chain pages */
 		p->private = 0;
 	} else
@@ -140,15 +185,15 @@  static struct page *get_a_page(struct virtnet_info *vi, gfp_t gfp_mask)
 	return p;
 }
 
-static void skb_xmit_done(struct virtqueue *svq)
+static void skb_xmit_done(struct virtqueue *vq)
 {
-	struct virtnet_info *vi = svq->vdev->priv;
+	struct virtnet_info *vi = vq->vdev->priv;
 
 	/* Suppress further interrupts. */
-	virtqueue_disable_cb(svq);
+	virtqueue_disable_cb(vq);
 
 	/* We were probably waiting for more output buffers. */
-	netif_wake_queue(vi->dev);
+	netif_wake_subqueue(vi->dev, txq_get_qnum(vi, vq));
 }
 
 static void set_skb_frag(struct sk_buff *skb, struct page *page,
@@ -167,9 +212,10 @@  static void set_skb_frag(struct sk_buff *skb, struct page *page,
 }
 
 /* Called from bottom half context */
-static struct sk_buff *page_to_skb(struct virtnet_info *vi,
+static struct sk_buff *page_to_skb(struct receive_queue *rq,
 				   struct page *page, unsigned int len)
 {
+	struct virtnet_info *vi = rq->vi;
 	struct sk_buff *skb;
 	struct skb_vnet_hdr *hdr;
 	unsigned int copy, hdr_len, offset;
@@ -225,12 +271,12 @@  static struct sk_buff *page_to_skb(struct virtnet_info *vi,
 	}
 
 	if (page)
-		give_pages(vi, page);
+		give_pages(rq, page);
 
 	return skb;
 }
 
-static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
+static int receive_mergeable(struct receive_queue *rq, struct sk_buff *skb)
 {
 	struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
 	struct page *page;
@@ -244,7 +290,7 @@  static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
 			skb->dev->stats.rx_length_errors++;
 			return -EINVAL;
 		}
-		page = virtqueue_get_buf(vi->rvq, &len);
+		page = virtqueue_get_buf(rq->vq, &len);
 		if (!page) {
 			pr_debug("%s: rx error: %d buffers missing\n",
 				 skb->dev->name, hdr->mhdr.num_buffers);
@@ -257,13 +303,14 @@  static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
 
 		set_skb_frag(skb, page, 0, &len);
 
-		--vi->num;
+		--rq->num;
 	}
 	return 0;
 }
 
-static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
+static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len)
 {
+	struct net_device *dev = rq->vi->dev;
 	struct virtnet_info *vi = netdev_priv(dev);
 	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
 	struct sk_buff *skb;
@@ -274,7 +321,7 @@  static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
 		pr_debug("%s: short packet %i\n", dev->name, len);
 		dev->stats.rx_length_errors++;
 		if (vi->mergeable_rx_bufs || vi->big_packets)
-			give_pages(vi, buf);
+			give_pages(rq, buf);
 		else
 			dev_kfree_skb(buf);
 		return;
@@ -286,14 +333,14 @@  static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
 		skb_trim(skb, len);
 	} else {
 		page = buf;
-		skb = page_to_skb(vi, page, len);
+		skb = page_to_skb(rq, page, len);
 		if (unlikely(!skb)) {
 			dev->stats.rx_dropped++;
-			give_pages(vi, page);
+			give_pages(rq, page);
 			return;
 		}
 		if (vi->mergeable_rx_bufs)
-			if (receive_mergeable(vi, skb)) {
+			if (receive_mergeable(rq, skb)) {
 				dev_kfree_skb(skb);
 				return;
 			}
@@ -363,90 +410,91 @@  frame_err:
 	dev_kfree_skb(skb);
 }
 
-static int add_recvbuf_small(struct virtnet_info *vi, gfp_t gfp)
+static int add_recvbuf_small(struct receive_queue *rq, gfp_t gfp)
 {
 	struct sk_buff *skb;
 	struct skb_vnet_hdr *hdr;
 	int err;
 
-	skb = __netdev_alloc_skb_ip_align(vi->dev, MAX_PACKET_LEN, gfp);
+	skb = __netdev_alloc_skb_ip_align(rq->vi->dev, MAX_PACKET_LEN, gfp);
 	if (unlikely(!skb))
 		return -ENOMEM;
 
 	skb_put(skb, MAX_PACKET_LEN);
 
 	hdr = skb_vnet_hdr(skb);
-	sg_set_buf(vi->rx_sg, &hdr->hdr, sizeof hdr->hdr);
+	sg_set_buf(rq->sg, &hdr->hdr, sizeof hdr->hdr);
+
+	skb_to_sgvec(skb, rq->sg + 1, 0, skb->len);
 
-	skb_to_sgvec(skb, vi->rx_sg + 1, 0, skb->len);
+	err = virtqueue_add_buf(rq->vq, rq->sg, 0, 2, skb, gfp);
 
-	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, 2, skb, gfp);
 	if (err < 0)
 		dev_kfree_skb(skb);
 
 	return err;
 }
 
-static int add_recvbuf_big(struct virtnet_info *vi, gfp_t gfp)
+static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
 {
 	struct page *first, *list = NULL;
 	char *p;
 	int i, err, offset;
 
-	/* page in vi->rx_sg[MAX_SKB_FRAGS + 1] is list tail */
+	/* page in rq->sg[MAX_SKB_FRAGS + 1] is list tail */
 	for (i = MAX_SKB_FRAGS + 1; i > 1; --i) {
-		first = get_a_page(vi, gfp);
+		first = get_a_page(rq, gfp);
 		if (!first) {
 			if (list)
-				give_pages(vi, list);
+				give_pages(rq, list);
 			return -ENOMEM;
 		}
-		sg_set_buf(&vi->rx_sg[i], page_address(first), PAGE_SIZE);
+		sg_set_buf(&rq->sg[i], page_address(first), PAGE_SIZE);
 
 		/* chain new page in list head to match sg */
 		first->private = (unsigned long)list;
 		list = first;
 	}
 
-	first = get_a_page(vi, gfp);
+	first = get_a_page(rq, gfp);
 	if (!first) {
-		give_pages(vi, list);
+		give_pages(rq, list);
 		return -ENOMEM;
 	}
 	p = page_address(first);
 
-	/* vi->rx_sg[0], vi->rx_sg[1] share the same page */
-	/* a separated vi->rx_sg[0] for virtio_net_hdr only due to QEMU bug */
-	sg_set_buf(&vi->rx_sg[0], p, sizeof(struct virtio_net_hdr));
+	/* rq->sg[0], rq->sg[1] share the same page */
+	/* a separated rq->sg[0] for virtio_net_hdr only due to QEMU bug */
+	sg_set_buf(&rq->sg[0], p, sizeof(struct virtio_net_hdr));
 
-	/* vi->rx_sg[1] for data packet, from offset */
+	/* rq->sg[1] for data packet, from offset */
 	offset = sizeof(struct padded_vnet_hdr);
-	sg_set_buf(&vi->rx_sg[1], p + offset, PAGE_SIZE - offset);
+	sg_set_buf(&rq->sg[1], p + offset, PAGE_SIZE - offset);
 
 	/* chain first in list head */
 	first->private = (unsigned long)list;
-	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, MAX_SKB_FRAGS + 2,
+	err = virtqueue_add_buf(rq->vq, rq->sg, 0, MAX_SKB_FRAGS + 2,
 				first, gfp);
 	if (err < 0)
-		give_pages(vi, first);
+		give_pages(rq, first);
 
 	return err;
 }
 
-static int add_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp)
+static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
 {
 	struct page *page;
 	int err;
 
-	page = get_a_page(vi, gfp);
+	page = get_a_page(rq, gfp);
 	if (!page)
 		return -ENOMEM;
 
-	sg_init_one(vi->rx_sg, page_address(page), PAGE_SIZE);
+	sg_init_one(rq->sg, page_address(page), PAGE_SIZE);
 
-	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, 1, page, gfp);
+	err = virtqueue_add_buf(rq->vq, rq->sg, 0, 1, page, gfp);
 	if (err < 0)
-		give_pages(vi, page);
+		give_pages(rq, page);
 
 	return err;
 }
@@ -458,97 +506,104 @@  static int add_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp)
  * before we're receiving packets, or from refill_work which is
  * careful to disable receiving (using napi_disable).
  */
-static bool try_fill_recv(struct virtnet_info *vi, gfp_t gfp)
+static bool try_fill_recv(struct receive_queue *rq, gfp_t gfp)
 {
+	struct virtnet_info *vi = rq->vi;
 	int err;
 	bool oom;
 
 	do {
 		if (vi->mergeable_rx_bufs)
-			err = add_recvbuf_mergeable(vi, gfp);
+			err = add_recvbuf_mergeable(rq, gfp);
 		else if (vi->big_packets)
-			err = add_recvbuf_big(vi, gfp);
+			err = add_recvbuf_big(rq, gfp);
 		else
-			err = add_recvbuf_small(vi, gfp);
+			err = add_recvbuf_small(rq, gfp);
 
 		oom = err == -ENOMEM;
 		if (err < 0)
 			break;
-		++vi->num;
+		++rq->num;
 	} while (err > 0);
-	if (unlikely(vi->num > vi->max))
-		vi->max = vi->num;
-	virtqueue_kick(vi->rvq);
+	if (unlikely(rq->num > rq->max))
+		rq->max = rq->num;
+	virtqueue_kick(rq->vq);
 	return !oom;
 }
 
-static void skb_recv_done(struct virtqueue *rvq)
+static void skb_recv_done(struct virtqueue *vq)
 {
-	struct virtnet_info *vi = rvq->vdev->priv;
+	struct virtnet_info *vi = vq->vdev->priv;
+	struct napi_struct *napi = &vi->rq[rxq_get_qnum(vi, vq)]->napi;
+
 	/* Schedule NAPI, Suppress further interrupts if successful. */
-	if (napi_schedule_prep(&vi->napi)) {
-		virtqueue_disable_cb(rvq);
-		__napi_schedule(&vi->napi);
+	if (napi_schedule_prep(napi)) {
+		virtqueue_disable_cb(vq);
+		__napi_schedule(napi);
 	}
 }
 
-static void virtnet_napi_enable(struct virtnet_info *vi)
+static void virtnet_napi_enable(struct receive_queue *rq)
 {
-	napi_enable(&vi->napi);
+	napi_enable(&rq->napi);
 
 	/* If all buffers were filled by other side before we napi_enabled, we
 	 * won't get another interrupt, so process any outstanding packets
 	 * now.  virtnet_poll wants re-enable the queue, so we disable here.
 	 * We synchronize against interrupts via NAPI_STATE_SCHED */
-	if (napi_schedule_prep(&vi->napi)) {
-		virtqueue_disable_cb(vi->rvq);
+	if (napi_schedule_prep(&rq->napi)) {
+		virtqueue_disable_cb(rq->vq);
 		local_bh_disable();
-		__napi_schedule(&vi->napi);
+		__napi_schedule(&rq->napi);
 		local_bh_enable();
 	}
 }
 
 static void refill_work(struct work_struct *work)
 {
-	struct virtnet_info *vi;
+	struct napi_struct *napi;
+	struct receive_queue *rq;
 	bool still_empty;
 
-	vi = container_of(work, struct virtnet_info, refill.work);
-	napi_disable(&vi->napi);
-	still_empty = !try_fill_recv(vi, GFP_KERNEL);
-	virtnet_napi_enable(vi);
+	rq = container_of(work, struct receive_queue, refill.work);
+	napi = &rq->napi;
+
+	napi_disable(napi);
+	still_empty = !try_fill_recv(rq, GFP_KERNEL);
+	virtnet_napi_enable(rq);
 
 	/* In theory, this can happen: if we don't get any buffers in
 	 * we will *never* try to fill again. */
 	if (still_empty)
-		queue_delayed_work(system_nrt_wq, &vi->refill, HZ/2);
+		queue_delayed_work(system_nrt_wq, &rq->refill, HZ/2);
 }
 
 static int virtnet_poll(struct napi_struct *napi, int budget)
 {
-	struct virtnet_info *vi = container_of(napi, struct virtnet_info, napi);
+	struct receive_queue *rq = container_of(napi, struct receive_queue,
+						napi);
 	void *buf;
 	unsigned int len, received = 0;
 
 again:
 	while (received < budget &&
-	       (buf = virtqueue_get_buf(vi->rvq, &len)) != NULL) {
-		receive_buf(vi->dev, buf, len);
-		--vi->num;
+	       (buf = virtqueue_get_buf(rq->vq, &len)) != NULL) {
+		receive_buf(rq, buf, len);
+		--rq->num;
 		received++;
 	}
 
-	if (vi->num < vi->max / 2) {
-		if (!try_fill_recv(vi, GFP_ATOMIC))
-			queue_delayed_work(system_nrt_wq, &vi->refill, 0);
+	if (rq->num < rq->max / 2) {
+		if (!try_fill_recv(rq, GFP_ATOMIC))
+			queue_delayed_work(system_nrt_wq, &rq->refill, 0);
 	}
 
 	/* Out of packets? */
 	if (received < budget) {
 		napi_complete(napi);
-		if (unlikely(!virtqueue_enable_cb(vi->rvq)) &&
+		if (unlikely(!virtqueue_enable_cb(rq->vq)) &&
 		    napi_schedule_prep(napi)) {
-			virtqueue_disable_cb(vi->rvq);
+			virtqueue_disable_cb(rq->vq);
 			__napi_schedule(napi);
 			goto again;
 		}
@@ -557,13 +612,14 @@  again:
 	return received;
 }
 
-static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
+static unsigned int free_old_xmit_skbs(struct virtnet_info *vi,
+				       struct virtqueue *vq)
 {
 	struct sk_buff *skb;
 	unsigned int len, tot_sgs = 0;
 	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
 
-	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
+	while ((skb = virtqueue_get_buf(vq, &len)) != NULL) {
 		pr_debug("Sent skb %p\n", skb);
 
 		u64_stats_update_begin(&stats->tx_syncp);
@@ -577,7 +633,8 @@  static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
 	return tot_sgs;
 }
 
-static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
+static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb,
+		    struct virtqueue *vq, struct scatterlist *sg)
 {
 	struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
 	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
@@ -615,44 +672,47 @@  static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
 
 	/* Encode metadata header at front. */
 	if (vi->mergeable_rx_bufs)
-		sg_set_buf(vi->tx_sg, &hdr->mhdr, sizeof hdr->mhdr);
+		sg_set_buf(sg, &hdr->mhdr, sizeof hdr->mhdr);
 	else
-		sg_set_buf(vi->tx_sg, &hdr->hdr, sizeof hdr->hdr);
+		sg_set_buf(sg, &hdr->hdr, sizeof hdr->hdr);
 
-	hdr->num_sg = skb_to_sgvec(skb, vi->tx_sg + 1, 0, skb->len) + 1;
-	return virtqueue_add_buf(vi->svq, vi->tx_sg, hdr->num_sg,
+	hdr->num_sg = skb_to_sgvec(skb, sg + 1, 0, skb->len) + 1;
+	return virtqueue_add_buf(vq, sg, hdr->num_sg,
 				 0, skb, GFP_ATOMIC);
 }
 
 static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
+	int qnum = skb_get_queue_mapping(skb);
+	struct virtqueue *vq = vi->sq[qnum]->vq;
 	int capacity;
 
 	/* Free up any pending old buffers before queueing new ones. */
-	free_old_xmit_skbs(vi);
+	free_old_xmit_skbs(vi, vq);
 
 	/* Try to transmit */
-	capacity = xmit_skb(vi, skb);
+	capacity = xmit_skb(vi, skb, vq, vi->sq[qnum]->sg);
 
 	/* This can happen with OOM and indirect buffers. */
 	if (unlikely(capacity < 0)) {
 		if (likely(capacity == -ENOMEM)) {
 			if (net_ratelimit())
 				dev_warn(&dev->dev,
-					 "TX queue failure: out of memory\n");
+					"TXQ (%d) failure: out of memory\n",
+					qnum);
 		} else {
 			dev->stats.tx_fifo_errors++;
 			if (net_ratelimit())
 				dev_warn(&dev->dev,
-					 "Unexpected TX queue failure: %d\n",
-					 capacity);
+					"Unexpected TXQ (%d) failure: %d\n",
+					qnum, capacity);
 		}
 		dev->stats.tx_dropped++;
 		kfree_skb(skb);
 		return NETDEV_TX_OK;
 	}
-	virtqueue_kick(vi->svq);
+	virtqueue_kick(vq);
 
 	/* Don't wait up for transmitted skbs to be freed. */
 	skb_orphan(skb);
@@ -661,13 +721,13 @@  static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 	/* Apparently nice girls don't return TX_BUSY; stop the queue
 	 * before it gets out of hand.  Naturally, this wastes entries. */
 	if (capacity < 2+MAX_SKB_FRAGS) {
-		netif_stop_queue(dev);
-		if (unlikely(!virtqueue_enable_cb_delayed(vi->svq))) {
+		netif_stop_subqueue(dev, qnum);
+		if (unlikely(!virtqueue_enable_cb_delayed(vq))) {
 			/* More just got used, free them then recheck. */
-			capacity += free_old_xmit_skbs(vi);
+			capacity += free_old_xmit_skbs(vi, vq);
 			if (capacity >= 2+MAX_SKB_FRAGS) {
-				netif_start_queue(dev);
-				virtqueue_disable_cb(vi->svq);
+				netif_start_subqueue(dev, qnum);
+				virtqueue_disable_cb(vq);
 			}
 		}
 	}
@@ -700,7 +760,8 @@  static struct rtnl_link_stats64 *virtnet_stats(struct net_device *dev,
 	unsigned int start;
 
 	for_each_possible_cpu(cpu) {
-		struct virtnet_stats *stats = per_cpu_ptr(vi->stats, cpu);
+		struct virtnet_stats __percpu *stats
+			= per_cpu_ptr(vi->stats, cpu);
 		u64 tpackets, tbytes, rpackets, rbytes;
 
 		do {
@@ -734,20 +795,26 @@  static struct rtnl_link_stats64 *virtnet_stats(struct net_device *dev,
 static void virtnet_netpoll(struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
+	int i;
 
-	napi_schedule(&vi->napi);
+	for (i = 0; i < vi->num_queue_pairs; i++)
+		napi_schedule(&vi->rq[i]->napi);
 }
 #endif
 
 static int virtnet_open(struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
+	int i;
 
-	/* Make sure we have some buffers: if oom use wq. */
-	if (!try_fill_recv(vi, GFP_KERNEL))
-		queue_delayed_work(system_nrt_wq, &vi->refill, 0);
+	for (i = 0; i < vi->num_queue_pairs; i++) {
+		/* Make sure we have some buffers: if oom use wq. */
+		if (!try_fill_recv(vi->rq[i], GFP_KERNEL))
+			queue_delayed_work(system_nrt_wq,
+					   &vi->rq[i]->refill, 0);
+		virtnet_napi_enable(vi->rq[i]);
+	}
 
-	virtnet_napi_enable(vi);
 	return 0;
 }
 
@@ -809,10 +876,13 @@  static void virtnet_ack_link_announce(struct virtnet_info *vi)
 static int virtnet_close(struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
+	int i;
 
 	/* Make sure refill_work doesn't re-enable napi! */
-	cancel_delayed_work_sync(&vi->refill);
-	napi_disable(&vi->napi);
+	for (i = 0; i < vi->num_queue_pairs; i++) {
+		cancel_delayed_work_sync(&vi->rq[i]->refill);
+		napi_disable(&vi->rq[i]->napi);
+	}
 
 	return 0;
 }
@@ -924,11 +994,10 @@  static void virtnet_get_ringparam(struct net_device *dev,
 {
 	struct virtnet_info *vi = netdev_priv(dev);
 
-	ring->rx_max_pending = virtqueue_get_vring_size(vi->rvq);
-	ring->tx_max_pending = virtqueue_get_vring_size(vi->svq);
+	ring->rx_max_pending = virtqueue_get_vring_size(vi->rq[0]->vq);
+	ring->tx_max_pending = virtqueue_get_vring_size(vi->sq[0]->vq);
 	ring->rx_pending = ring->rx_max_pending;
 	ring->tx_pending = ring->tx_max_pending;
-
 }
 
 
@@ -961,6 +1030,19 @@  static int virtnet_change_mtu(struct net_device *dev, int new_mtu)
 	return 0;
 }
 
+/* To avoid contending a lock hold by a vcpu who would exit to host, select the
+ * txq based on the processor id.
+ */
+static u16 virtnet_select_queue(struct net_device *dev, struct sk_buff *skb)
+{
+	int txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) :
+		  smp_processor_id();
+
+	while (unlikely(txq >= dev->real_num_tx_queues))
+		txq -= dev->real_num_tx_queues;
+	return txq;
+}
+
 static const struct net_device_ops virtnet_netdev = {
 	.ndo_open            = virtnet_open,
 	.ndo_stop   	     = virtnet_close,
@@ -972,6 +1054,7 @@  static const struct net_device_ops virtnet_netdev = {
 	.ndo_get_stats64     = virtnet_stats,
 	.ndo_vlan_rx_add_vid = virtnet_vlan_rx_add_vid,
 	.ndo_vlan_rx_kill_vid = virtnet_vlan_rx_kill_vid,
+	.ndo_select_queue     = virtnet_select_queue,
 #ifdef CONFIG_NET_POLL_CONTROLLER
 	.ndo_poll_controller = virtnet_netpoll,
 #endif
@@ -1007,10 +1090,10 @@  static void virtnet_config_changed_work(struct work_struct *work)
 
 	if (vi->status & VIRTIO_NET_S_LINK_UP) {
 		netif_carrier_on(vi->dev);
-		netif_wake_queue(vi->dev);
+		netif_tx_wake_all_queues(vi->dev);
 	} else {
 		netif_carrier_off(vi->dev);
-		netif_stop_queue(vi->dev);
+		netif_tx_stop_all_queues(vi->dev);
 	}
 done:
 	mutex_unlock(&vi->config_lock);
@@ -1023,41 +1106,217 @@  static void virtnet_config_changed(struct virtio_device *vdev)
 	queue_work(system_nrt_wq, &vi->config_work);
 }
 
-static int init_vqs(struct virtnet_info *vi)
+static void free_receive_bufs(struct virtnet_info *vi)
+{
+	int i;
+
+	for (i = 0; i < vi->num_queue_pairs; i++) {
+		while (vi->rq[i]->pages)
+			__free_pages(get_a_page(vi->rq[i], GFP_KERNEL), 0);
+	}
+}
+
+/* Free memory allocated for send and receive queues */
+static void virtnet_free_queues(struct virtnet_info *vi)
 {
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { skb_recv_done, skb_xmit_done, NULL};
-	const char *names[] = { "input", "output", "control" };
-	int nvqs, err;
+	int i;
 
-	/* We expect two virtqueues, receive then send,
-	 * and optionally control. */
-	nvqs = virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ) ? 3 : 2;
+	for (i = 0; i < vi->num_queue_pairs; i++) {
+		kfree(vi->rq[i]);
+		vi->rq[i] = NULL;
+		kfree(vi->sq[i]);
+		vi->sq[i] = NULL;
+	}
+}
 
-	err = vi->vdev->config->find_vqs(vi->vdev, nvqs, vqs, callbacks, names);
-	if (err)
-		return err;
+static void free_unused_bufs(struct virtnet_info *vi)
+{
+	void *buf;
+	int i;
+
+	for (i = 0; i < vi->num_queue_pairs; i++) {
+		struct virtqueue *vq = vi->sq[i]->vq;
+
+		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL)
+			dev_kfree_skb(buf);
+	}
+
+	for (i = 0; i < vi->num_queue_pairs; i++) {
+		struct virtqueue *vq = vi->rq[i]->vq;
+
+		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL) {
+			if (vi->mergeable_rx_bufs || vi->big_packets)
+				give_pages(vi->rq[i], buf);
+			else
+				dev_kfree_skb(buf);
+			--vi->rq[i]->num;
+		}
+		BUG_ON(vi->rq[i]->num != 0);
+	}
+}
+
+static void virtnet_set_affinity(struct virtnet_info *vi, bool set)
+{
+	int i;
+
+	if (vi->num_queue_pairs == 1)
+		return;
+
+	for (i = 0; i < vi->num_queue_pairs; i++) {
+		int cpu = set ? i : -1;
+		virtqueue_set_affinity(vi->rq[i]->vq, cpu);
+		virtqueue_set_affinity(vi->sq[i]->vq, cpu);
+	}
+	return;
+}
+
+static void virtnet_del_vqs(struct virtnet_info *vi)
+{
+	struct virtio_device *vdev = vi->vdev;
+
+	virtnet_set_affinity(vi, false);
+
+	vdev->config->del_vqs(vdev);
+
+	virtnet_free_queues(vi);
+}
+
+static int virtnet_find_vqs(struct virtnet_info *vi)
+{
+	vq_callback_t **callbacks;
+	struct virtqueue **vqs;
+	int ret = -ENOMEM;
+	int i, total_vqs;
+	char **names;
 
-	vi->rvq = vqs[0];
-	vi->svq = vqs[1];
+	/*
+	 * We expect 1 RX virtqueue followed by 1 TX virtqueue, followd by
+	 * possible control virtqueue and followed by the same
+	 * 'vi->num_queue_pairs-1' more times
+	 */
+	total_vqs = vi->num_queue_pairs * 2 +
+		    virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ);
+
+	/* Allocate space for find_vqs parameters */
+	vqs = kmalloc(total_vqs * sizeof(*vqs), GFP_KERNEL);
+	callbacks = kmalloc(total_vqs * sizeof(*callbacks), GFP_KERNEL);
+	names = kmalloc(total_vqs * sizeof(*names), GFP_KERNEL);
+	if (!vqs || !callbacks || !names)
+		goto err;
+
+	/* Parameters for control virtqueue, if any */
+	if (vi->has_cvq) {
+		callbacks[2] = NULL;
+		names[2] = "control";
+	}
+
+	/* Allocate/initialize parameters for send/receive virtqueues */
+	for (i = 0; i < vi->num_queue_pairs * 2; i += 2) {
+		int j = (i == 0 ? i : i + vi->has_cvq);
+		callbacks[j] = skb_recv_done;
+		callbacks[j + 1] = skb_xmit_done;
+		names[j] = kasprintf(GFP_KERNEL, "input.%d", i / 2);
+		names[j + 1] = kasprintf(GFP_KERNEL, "output.%d", i / 2);
+	}
 
-	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
+	ret = vi->vdev->config->find_vqs(vi->vdev, total_vqs, vqs, callbacks,
+					 (const char **)names);
+	if (ret)
+		goto err;
+
+	if (vi->has_cvq)
 		vi->cvq = vqs[2];
 
-		if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
-			vi->dev->features |= NETIF_F_HW_VLAN_FILTER;
+	for (i = 0; i < vi->num_queue_pairs * 2; i += 2) {
+		int j = i == 0 ? i : i + vi->has_cvq;
+		vi->rq[i / 2]->vq = vqs[j];
+		vi->sq[i / 2]->vq = vqs[j + 1];
 	}
-	return 0;
+
+err:
+	if (ret && names)
+		for (i = 0; i < vi->num_queue_pairs * 2; i++)
+			kfree(names[i]);
+
+	kfree(names);
+	kfree(callbacks);
+	kfree(vqs);
+
+	return ret;
+}
+
+static int virtnet_alloc_queues(struct virtnet_info *vi)
+{
+	int ret = -ENOMEM;
+	int i;
+
+	for (i = 0; i < vi->num_queue_pairs; i++) {
+		vi->rq[i] = kzalloc(sizeof(*vi->rq[i]), GFP_KERNEL);
+		vi->sq[i] = kzalloc(sizeof(*vi->sq[i]), GFP_KERNEL);
+		if (!vi->rq[i] || !vi->sq[i])
+			goto err;
+	}
+
+	ret = 0;
+
+	/* setup initial receive and send queue parameters */
+	for (i = 0; i < vi->num_queue_pairs; i++) {
+		vi->rq[i]->vi = vi;
+		vi->rq[i]->pages = NULL;
+		INIT_DELAYED_WORK(&vi->rq[i]->refill, refill_work);
+		netif_napi_add(vi->dev, &vi->rq[i]->napi, virtnet_poll,
+			       napi_weight);
+
+		sg_init_table(vi->rq[i]->sg, ARRAY_SIZE(vi->rq[i]->sg));
+		sg_init_table(vi->sq[i]->sg, ARRAY_SIZE(vi->sq[i]->sg));
+	}
+
+err:
+	if (ret)
+		virtnet_free_queues(vi);
+
+	return ret;
+}
+
+static int virtnet_setup_vqs(struct virtnet_info *vi)
+{
+	int ret;
+
+	/* Allocate send & receive queues */
+	ret = virtnet_alloc_queues(vi);
+	if (!ret) {
+		ret = virtnet_find_vqs(vi);
+		if (ret)
+			virtnet_free_queues(vi);
+		else
+			virtnet_set_affinity(vi, true);
+	}
+
+	return ret;
 }
 
 static int virtnet_probe(struct virtio_device *vdev)
 {
-	int err;
+	int i, err;
 	struct net_device *dev;
 	struct virtnet_info *vi;
+	u16 num_queues, num_queue_pairs;
+
+	/* Find if host supports multiqueue virtio_net device */
+	err = virtio_config_val(vdev, VIRTIO_NET_F_MULTIQUEUE,
+				offsetof(struct virtio_net_config,
+				num_queues), &num_queues);
+
+	/* We need atleast 2 queue's */
+	if (err || num_queues < 2)
+		num_queues = 2;
+	if (num_queues > MAX_QUEUES * 2)
+		num_queues = MAX_QUEUES;
+
+	num_queue_pairs = num_queues / 2;
 
 	/* Allocate ourselves a network device with room for our info */
-	dev = alloc_etherdev(sizeof(struct virtnet_info));
+	dev = alloc_etherdev_mq(sizeof(struct virtnet_info), num_queue_pairs);
 	if (!dev)
 		return -ENOMEM;
 
@@ -1103,22 +1362,18 @@  static int virtnet_probe(struct virtio_device *vdev)
 
 	/* Set up our device-specific information */
 	vi = netdev_priv(dev);
-	netif_napi_add(dev, &vi->napi, virtnet_poll, napi_weight);
 	vi->dev = dev;
 	vi->vdev = vdev;
 	vdev->priv = vi;
-	vi->pages = NULL;
 	vi->stats = alloc_percpu(struct virtnet_stats);
 	err = -ENOMEM;
 	if (vi->stats == NULL)
-		goto free;
+		goto free_netdev;
 
-	INIT_DELAYED_WORK(&vi->refill, refill_work);
 	mutex_init(&vi->config_lock);
 	vi->config_enable = true;
 	INIT_WORK(&vi->config_work, virtnet_config_changed_work);
-	sg_init_table(vi->rx_sg, ARRAY_SIZE(vi->rx_sg));
-	sg_init_table(vi->tx_sg, ARRAY_SIZE(vi->tx_sg));
+	vi->num_queue_pairs = num_queue_pairs;
 
 	/* If we can receive ANY GSO packets, we must allocate large ones. */
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) ||
@@ -1129,9 +1384,17 @@  static int virtnet_probe(struct virtio_device *vdev)
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
 		vi->mergeable_rx_bufs = true;
 
-	err = init_vqs(vi);
+	if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
+		vi->has_cvq = true;
+
+	/* Allocate/initialize the rx/tx queues, and invoke find_vqs */
+	err = virtnet_setup_vqs(vi);
 	if (err)
-		goto free_stats;
+		goto free_netdev;
+
+	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ) &&
+	    virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
+		dev->features |= NETIF_F_HW_VLAN_FILTER;
 
 	err = register_netdev(dev);
 	if (err) {
@@ -1140,12 +1403,15 @@  static int virtnet_probe(struct virtio_device *vdev)
 	}
 
 	/* Last of all, set up some receive buffers. */
-	try_fill_recv(vi, GFP_KERNEL);
-
-	/* If we didn't even get one input buffer, we're useless. */
-	if (vi->num == 0) {
-		err = -ENOMEM;
-		goto unregister;
+	for (i = 0; i < num_queue_pairs; i++) {
+		try_fill_recv(vi->rq[i], GFP_KERNEL);
+
+		/* If we didn't even get one input buffer, we're useless. */
+		if (vi->rq[i]->num == 0) {
+			free_unused_bufs(vi);
+			err = -ENOMEM;
+			goto free_recv_bufs;
+		}
 	}
 
 	/* Assume link up if device can't report link status,
@@ -1158,42 +1424,25 @@  static int virtnet_probe(struct virtio_device *vdev)
 		netif_carrier_on(dev);
 	}
 
-	pr_debug("virtnet: registered device %s\n", dev->name);
+	pr_debug("virtnet: registered device %s with %d RX and TX vq's\n",
+		 dev->name, num_queue_pairs);
+
 	return 0;
 
-unregister:
+free_recv_bufs:
+	free_receive_bufs(vi);
 	unregister_netdev(dev);
+
 free_vqs:
-	vdev->config->del_vqs(vdev);
-free_stats:
-	free_percpu(vi->stats);
-free:
+	for (i = 0; i < num_queue_pairs; i++)
+		cancel_delayed_work_sync(&vi->rq[i]->refill);
+	virtnet_del_vqs(vi);
+
+free_netdev:
 	free_netdev(dev);
 	return err;
 }
 
-static void free_unused_bufs(struct virtnet_info *vi)
-{
-	void *buf;
-	while (1) {
-		buf = virtqueue_detach_unused_buf(vi->svq);
-		if (!buf)
-			break;
-		dev_kfree_skb(buf);
-	}
-	while (1) {
-		buf = virtqueue_detach_unused_buf(vi->rvq);
-		if (!buf)
-			break;
-		if (vi->mergeable_rx_bufs || vi->big_packets)
-			give_pages(vi, buf);
-		else
-			dev_kfree_skb(buf);
-		--vi->num;
-	}
-	BUG_ON(vi->num != 0);
-}
-
 static void remove_vq_common(struct virtnet_info *vi)
 {
 	vi->vdev->config->reset(vi->vdev);
@@ -1201,10 +1450,9 @@  static void remove_vq_common(struct virtnet_info *vi)
 	/* Free unused buffers in both send and recv, if any. */
 	free_unused_bufs(vi);
 
-	vi->vdev->config->del_vqs(vi->vdev);
+	free_receive_bufs(vi);
 
-	while (vi->pages)
-		__free_pages(get_a_page(vi, GFP_KERNEL), 0);
+	virtnet_del_vqs(vi);
 }
 
 static void __devexit virtnet_remove(struct virtio_device *vdev)
@@ -1230,6 +1478,7 @@  static void __devexit virtnet_remove(struct virtio_device *vdev)
 static int virtnet_freeze(struct virtio_device *vdev)
 {
 	struct virtnet_info *vi = vdev->priv;
+	int i;
 
 	/* Prevent config work handler from accessing the device */
 	mutex_lock(&vi->config_lock);
@@ -1237,10 +1486,13 @@  static int virtnet_freeze(struct virtio_device *vdev)
 	mutex_unlock(&vi->config_lock);
 
 	netif_device_detach(vi->dev);
-	cancel_delayed_work_sync(&vi->refill);
+	for (i = 0; i < vi->num_queue_pairs; i++)
+		cancel_delayed_work_sync(&vi->rq[i]->refill);
 
 	if (netif_running(vi->dev))
-		napi_disable(&vi->napi);
+		for (i = 0; i < vi->num_queue_pairs; i++)
+			napi_disable(&vi->rq[i]->napi);
+
 
 	remove_vq_common(vi);
 
@@ -1252,19 +1504,22 @@  static int virtnet_freeze(struct virtio_device *vdev)
 static int virtnet_restore(struct virtio_device *vdev)
 {
 	struct virtnet_info *vi = vdev->priv;
-	int err;
+	int err, i;
 
-	err = init_vqs(vi);
+	err = virtnet_setup_vqs(vi);
 	if (err)
 		return err;
 
 	if (netif_running(vi->dev))
-		virtnet_napi_enable(vi);
+		for (i = 0; i < vi->num_queue_pairs; i++)
+			virtnet_napi_enable(vi->rq[i]);
 
 	netif_device_attach(vi->dev);
 
-	if (!try_fill_recv(vi, GFP_KERNEL))
-		queue_delayed_work(system_nrt_wq, &vi->refill, 0);
+	for (i = 0; i < vi->num_queue_pairs; i++)
+		if (!try_fill_recv(vi->rq[i], GFP_KERNEL))
+			queue_delayed_work(system_nrt_wq,
+					   &vi->rq[i]->refill, 0);
 
 	mutex_lock(&vi->config_lock);
 	vi->config_enable = true;
@@ -1287,7 +1542,7 @@  static unsigned int features[] = {
 	VIRTIO_NET_F_GUEST_ECN, VIRTIO_NET_F_GUEST_UFO,
 	VIRTIO_NET_F_MRG_RXBUF, VIRTIO_NET_F_STATUS, VIRTIO_NET_F_CTRL_VQ,
 	VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN,
-	VIRTIO_NET_F_GUEST_ANNOUNCE,
+	VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MULTIQUEUE,
 };
 
 static struct virtio_driver virtio_net_driver = {
diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index 1bc7e30..60f09ff 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -61,6 +61,8 @@  struct virtio_net_config {
 	__u8 mac[6];
 	/* See VIRTIO_NET_F_STATUS and VIRTIO_NET_S_* above */
 	__u16 status;
+	/* Total number of RX/TX queues */
+	__u16 num_queues;
 } __attribute__((packed));
 
 /* This is the first element of the scatter-gather list.  If you don't