diff mbox series

[V5,12/12] net: netvsc: Add Isolation VM support for netvsc driver

Message ID 20210914133916.1440931-13-ltykernel@gmail.com (mailing list archive)
State Not Applicable
Headers show
Series x86/Hyper-V: Add Hyper-V Isolation VM support | expand

Commit Message

Tianyu Lan Sept. 14, 2021, 1:39 p.m. UTC
From: Tianyu Lan <Tianyu.Lan@microsoft.com>

In Isolation VM, all shared memory with host needs to mark visible
to host via hvcall. vmbus_establish_gpadl() has already done it for
netvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_
pagebuffer() stills need to be handled. Use DMA API to map/umap
these memory during sending/receiving packet and Hyper-V swiotlb
bounce buffer dma address will be returned. The swiotlb bounce buffer
has been masked to be visible to host during boot up.

Allocate rx/tx ring buffer via alloc_pages() in Isolation VM and map
these pages via vmap(). After calling vmbus_establish_gpadl() which
marks these pages visible to host, unmap these pages to release the
virtual address mapped with physical address below shared_gpa_boundary
and map them in the extra address space via vmap_pfn().

Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
---
Change since v4:
	* Allocate rx/tx ring buffer via alloc_pages() in Isolation VM
	* Map pages after calling vmbus_establish_gpadl().
	* set dma_set_min_align_mask for netvsc driver.

Change since v3:
	* Add comment to explain why not to use dma_map_sg()
	* Fix some error handle.
---
 drivers/net/hyperv/hyperv_net.h   |   7 +
 drivers/net/hyperv/netvsc.c       | 287 +++++++++++++++++++++++++++++-
 drivers/net/hyperv/netvsc_drv.c   |   1 +
 drivers/net/hyperv/rndis_filter.c |   2 +
 include/linux/hyperv.h            |   5 +
 5 files changed, 296 insertions(+), 6 deletions(-)

Comments

Haiyang Zhang Sept. 14, 2021, 3:49 p.m. UTC | #1
> -----Original Message-----
> From: Tianyu Lan <ltykernel@gmail.com>
> Sent: Tuesday, September 14, 2021 9:39 AM
> To: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
> <haiyangz@microsoft.com>; Stephen Hemminger <sthemmin@microsoft.com>;
> wei.liu@kernel.org; Dexuan Cui <decui@microsoft.com>; tglx@linutronix.de;
> mingo@redhat.com; bp@alien8.de; x86@kernel.org; hpa@zytor.com;
> dave.hansen@linux.intel.com; luto@kernel.org; peterz@infradead.org;
> konrad.wilk@oracle.com; boris.ostrovsky@oracle.com; jgross@suse.com;
> sstabellini@kernel.org; joro@8bytes.org; will@kernel.org;
> davem@davemloft.net; kuba@kernel.org; jejb@linux.ibm.com;
> martin.petersen@oracle.com; gregkh@linuxfoundation.org; arnd@arndb.de;
> hch@lst.de; m.szyprowski@samsung.com; robin.murphy@arm.com;
> brijesh.singh@amd.com; Tianyu Lan <Tianyu.Lan@microsoft.com>;
> thomas.lendacky@amd.com; pgonda@google.com; akpm@linux-foundation.org;
> kirill.shutemov@linux.intel.com; rppt@kernel.org; sfr@canb.auug.org.au;
> aneesh.kumar@linux.ibm.com; saravanand@fb.com;
> krish.sadhukhan@oracle.com; xen-devel@lists.xenproject.org;
> tj@kernel.org; rientjes@google.com; Michael Kelley
> <mikelley@microsoft.com>
> Cc: iommu@lists.linux-foundation.org; linux-arch@vger.kernel.org; linux-
> hyperv@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> scsi@vger.kernel.org; netdev@vger.kernel.org; vkuznets
> <vkuznets@redhat.com>; parri.andrea@gmail.com; dave.hansen@intel.com
> Subject: [PATCH V5 12/12] net: netvsc: Add Isolation VM support for
> netvsc driver
> 
> From: Tianyu Lan <Tianyu.Lan@microsoft.com>
> 
> In Isolation VM, all shared memory with host needs to mark visible
> to host via hvcall. vmbus_establish_gpadl() has already done it for
> netvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_
> pagebuffer() stills need to be handled. Use DMA API to map/umap
> these memory during sending/receiving packet and Hyper-V swiotlb
> bounce buffer dma address will be returned. The swiotlb bounce buffer
> has been masked to be visible to host during boot up.
> 
> Allocate rx/tx ring buffer via alloc_pages() in Isolation VM and map
> these pages via vmap(). After calling vmbus_establish_gpadl() which
> marks these pages visible to host, unmap these pages to release the
> virtual address mapped with physical address below shared_gpa_boundary
> and map them in the extra address space via vmap_pfn().
> 
> Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
> ---
> Change since v4:
> 	* Allocate rx/tx ring buffer via alloc_pages() in Isolation VM
> 	* Map pages after calling vmbus_establish_gpadl().
> 	* set dma_set_min_align_mask for netvsc driver.
> 
> Change since v3:
> 	* Add comment to explain why not to use dma_map_sg()
> 	* Fix some error handle.
> ---

Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>

Thank you!
Michael Kelley (LINUX) Sept. 15, 2021, 4:21 p.m. UTC | #2
From: Tianyu Lan <ltykernel@gmail.com>  Sent: Tuesday, September 14, 2021 6:39 AM
> 
> In Isolation VM, all shared memory with host needs to mark visible
> to host via hvcall. vmbus_establish_gpadl() has already done it for
> netvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_
> pagebuffer() stills need to be handled. Use DMA API to map/umap
> these memory during sending/receiving packet and Hyper-V swiotlb
> bounce buffer dma address will be returned. The swiotlb bounce buffer
> has been masked to be visible to host during boot up.
> 
> Allocate rx/tx ring buffer via alloc_pages() in Isolation VM and map
> these pages via vmap(). After calling vmbus_establish_gpadl() which
> marks these pages visible to host, unmap these pages to release the
> virtual address mapped with physical address below shared_gpa_boundary
> and map them in the extra address space via vmap_pfn().
> 
> Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
> ---
> Change since v4:
> 	* Allocate rx/tx ring buffer via alloc_pages() in Isolation VM
> 	* Map pages after calling vmbus_establish_gpadl().
> 	* set dma_set_min_align_mask for netvsc driver.
> 
> Change since v3:
> 	* Add comment to explain why not to use dma_map_sg()
> 	* Fix some error handle.
> ---
>  drivers/net/hyperv/hyperv_net.h   |   7 +
>  drivers/net/hyperv/netvsc.c       | 287 +++++++++++++++++++++++++++++-
>  drivers/net/hyperv/netvsc_drv.c   |   1 +
>  drivers/net/hyperv/rndis_filter.c |   2 +
>  include/linux/hyperv.h            |   5 +
>  5 files changed, 296 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
> index 315278a7cf88..87e8c74398a5 100644
> --- a/drivers/net/hyperv/hyperv_net.h
> +++ b/drivers/net/hyperv/hyperv_net.h
> @@ -164,6 +164,7 @@ struct hv_netvsc_packet {
>  	u32 total_bytes;
>  	u32 send_buf_index;
>  	u32 total_data_buflen;
> +	struct hv_dma_range *dma_range;
>  };
> 
>  #define NETVSC_HASH_KEYLEN 40
> @@ -1074,6 +1075,8 @@ struct netvsc_device {
> 
>  	/* Receive buffer allocated by us but manages by NetVSP */
>  	void *recv_buf;
> +	struct page **recv_pages;
> +	u32 recv_page_count;
>  	u32 recv_buf_size; /* allocated bytes */
>  	struct vmbus_gpadl recv_buf_gpadl_handle;
>  	u32 recv_section_cnt;
> @@ -1082,6 +1085,8 @@ struct netvsc_device {
> 
>  	/* Send buffer allocated by us */
>  	void *send_buf;
> +	struct page **send_pages;
> +	u32 send_page_count;
>  	u32 send_buf_size;
>  	struct vmbus_gpadl send_buf_gpadl_handle;
>  	u32 send_section_cnt;
> @@ -1731,4 +1736,6 @@ struct rndis_message {
>  #define RETRY_US_HI	10000
>  #define RETRY_MAX	2000	/* >10 sec */
> 
> +void netvsc_dma_unmap(struct hv_device *hv_dev,
> +		      struct hv_netvsc_packet *packet);
>  #endif /* _HYPERV_NET_H */
> diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
> index 1f87e570ed2b..7d5254bf043e 100644
> --- a/drivers/net/hyperv/netvsc.c
> +++ b/drivers/net/hyperv/netvsc.c
> @@ -20,6 +20,7 @@
>  #include <linux/vmalloc.h>
>  #include <linux/rtnetlink.h>
>  #include <linux/prefetch.h>
> +#include <linux/gfp.h>
> 
>  #include <asm/sync_bitops.h>
>  #include <asm/mshyperv.h>
> @@ -150,11 +151,33 @@ static void free_netvsc_device(struct rcu_head *head)
>  {
>  	struct netvsc_device *nvdev
>  		= container_of(head, struct netvsc_device, rcu);
> +	unsigned int alloc_unit;
>  	int i;
> 
>  	kfree(nvdev->extension);
> -	vfree(nvdev->recv_buf);
> -	vfree(nvdev->send_buf);
> +
> +	if (nvdev->recv_pages) {
> +		alloc_unit = (nvdev->recv_buf_size /
> +			nvdev->recv_page_count) >> PAGE_SHIFT;
> +
> +		vunmap(nvdev->recv_buf);
> +		for (i = 0; i < nvdev->recv_page_count; i++)
> +			__free_pages(nvdev->recv_pages[i], alloc_unit);
> +	} else {
> +		vfree(nvdev->recv_buf);
> +	}
> +
> +	if (nvdev->send_pages) {
> +		alloc_unit = (nvdev->send_buf_size /
> +			nvdev->send_page_count) >> PAGE_SHIFT;
> +
> +		vunmap(nvdev->send_buf);
> +		for (i = 0; i < nvdev->send_page_count; i++)
> +			__free_pages(nvdev->send_pages[i], alloc_unit);
> +	} else {
> +		vfree(nvdev->send_buf);
> +	}
> +
>  	kfree(nvdev->send_section_map);
> 
>  	for (i = 0; i < VRSS_CHANNEL_MAX; i++) {
> @@ -330,6 +353,108 @@ int netvsc_alloc_recv_comp_ring(struct netvsc_device *net_device, u32 q_idx)
>  	return nvchan->mrc.slots ? 0 : -ENOMEM;
>  }
> 
> +void *netvsc_alloc_pages(struct page ***pages_array, unsigned int *array_len,
> +			 unsigned long size)
> +{
> +	struct page *page, **pages, **vmap_pages;
> +	unsigned long pg_count = size >> PAGE_SHIFT;
> +	int alloc_unit = MAX_ORDER_NR_PAGES;
> +	int i, j, vmap_page_index = 0;
> +	void *vaddr;
> +
> +	if (pg_count < alloc_unit)
> +		alloc_unit = 1;
> +
> +	/* vmap() accepts page array with PAGE_SIZE as unit while try to
> +	 * allocate high order pages here in order to save page array space.
> +	 * vmap_pages[] is used as input parameter of vmap(). pages[] is to
> +	 * store allocated pages and map them later.
> +	 */
> +	vmap_pages = kmalloc_array(pg_count, sizeof(*vmap_pages), GFP_KERNEL);
> +	if (!vmap_pages)
> +		return NULL;
> +
> +retry:
> +	*array_len = pg_count / alloc_unit;
> +	pages = kmalloc_array(*array_len, sizeof(*pages), GFP_KERNEL);
> +	if (!pages)
> +		goto cleanup;
> +
> +	for (i = 0; i < *array_len; i++) {
> +		page = alloc_pages(GFP_KERNEL | __GFP_ZERO,
> +				   get_order(alloc_unit << PAGE_SHIFT));
> +		if (!page) {
> +			/* Try allocating small pages if high order pages are not available. */
> +			if (alloc_unit == 1) {
> +				goto cleanup;
> +			} else {

The "else" clause isn't really needed because of the goto cleanup above.  Then
the indentation of the code below could be reduced by one level.

> +				memset(vmap_pages, 0,
> +				       sizeof(*vmap_pages) * vmap_page_index);
> +				vmap_page_index = 0;
> +
> +				for (j = 0; j < i; j++)
> +					__free_pages(pages[j], alloc_unit);
> +
> +				kfree(pages);
> +				alloc_unit = 1;

This is the case where a large enough contiguous physical memory chunk could
not be found.  But rather than dropping all the way down to single pages,
would it make sense to try something smaller, but not 1?  For example,
cut the alloc_unit in half and try again.  But I'm not sure of all the implications.

> +				goto retry;
> +			}
> +		}
> +
> +		pages[i] = page;
> +		for (j = 0; j < alloc_unit; j++)
> +			vmap_pages[vmap_page_index++] = page++;
> +	}
> +
> +	vaddr = vmap(vmap_pages, vmap_page_index, VM_MAP, PAGE_KERNEL);
> +	kfree(vmap_pages);
> +
> +	*pages_array = pages;
> +	return vaddr;
> +
> +cleanup:
> +	for (j = 0; j < i; j++)
> +		__free_pages(pages[i], alloc_unit);
> +
> +	kfree(pages);
> +	kfree(vmap_pages);
> +	return NULL;
> +}
> +
> +static void *netvsc_map_pages(struct page **pages, int count, int alloc_unit)
> +{
> +	int pg_count = count * alloc_unit;
> +	struct page *page;
> +	unsigned long *pfns;
> +	int pfn_index = 0;
> +	void *vaddr;
> +	int i, j;
> +
> +	if (!pages)
> +		return NULL;
> +
> +	pfns = kcalloc(pg_count, sizeof(*pfns), GFP_KERNEL);
> +	if (!pfns)
> +		return NULL;
> +
> +	for (i = 0; i < count; i++) {
> +		page = pages[i];
> +		if (!page) {
> +			pr_warn("page is not available %d.\n", i);
> +			return NULL;
> +		}
> +
> +		for (j = 0; j < alloc_unit; j++) {
> +			pfns[pfn_index++] = page_to_pfn(page++) +
> +				(ms_hyperv.shared_gpa_boundary >> PAGE_SHIFT);
> +		}
> +	}
> +
> +	vaddr = vmap_pfn(pfns, pg_count, PAGE_KERNEL_IO);
> +	kfree(pfns);
> +	return vaddr;
> +}
> +

I think you are proposing this approach to allocating memory for the send
and receive buffers so that you can avoid having two virtual mappings for
the memory, per comments from Christop Hellwig.  But overall, the approach
seems a bit complex and I wonder if it is worth it.  If allocating large contiguous
chunks of physical memory is successful, then there is some memory savings
in that the data structures needed to keep track of the physical pages is
smaller than the equivalent page tables might be.  But if you have to revert
to allocating individual pages, then the memory savings is reduced.

Ultimately, the list of actual PFNs has to be kept somewhere.  Another approach
would be to do the reverse of what hv_map_memory() from the v4 patch
series does.  I.e., you could do virt_to_phys() on each virtual address that
maps above VTOM, and subtract out the shared_gpa_boundary to get the
list of actual PFNs that need to be freed.   This way you don't have two copies
of the list of PFNs -- one with and one without the shared_gpa_boundary added.
But it comes at the cost of additional code so that may not be a great idea.

I think what you have here works, and I don't have a clearly better solution
at the moment except perhaps to revert to the v4 solution and just have two
virtual mappings.  I'll keep thinking about it.  Maybe Christop has other
thoughts.

>  static int netvsc_init_buf(struct hv_device *device,
>  			   struct netvsc_device *net_device,
>  			   const struct netvsc_device_info *device_info)
> @@ -337,7 +462,7 @@ static int netvsc_init_buf(struct hv_device *device,
>  	struct nvsp_1_message_send_receive_buffer_complete *resp;
>  	struct net_device *ndev = hv_get_drvdata(device);
>  	struct nvsp_message *init_packet;
> -	unsigned int buf_size;
> +	unsigned int buf_size, alloc_unit;
>  	size_t map_words;
>  	int i, ret = 0;
> 
> @@ -350,7 +475,14 @@ static int netvsc_init_buf(struct hv_device *device,
>  		buf_size = min_t(unsigned int, buf_size,
>  				 NETVSC_RECEIVE_BUFFER_SIZE_LEGACY);
> 
> -	net_device->recv_buf = vzalloc(buf_size);
> +	if (hv_isolation_type_snp())
> +		net_device->recv_buf =
> +			netvsc_alloc_pages(&net_device->recv_pages,
> +					   &net_device->recv_page_count,
> +					   buf_size);
> +	else
> +		net_device->recv_buf = vzalloc(buf_size);
> +

I wonder if it is necessary to have two different code paths here.  The
allocating and freeing of the send and receive buffers is not perf
sensitive, and it seems like netvsc_alloc_pages() could be used
regardless of whether SNP Isolation is in effect.  To my thinking,
one code path is better than two code paths unless there's a
compelling reason to have two.

>  	if (!net_device->recv_buf) {
>  		netdev_err(ndev,
>  			   "unable to allocate receive buffer of size %u\n",
> @@ -375,6 +507,27 @@ static int netvsc_init_buf(struct hv_device *device,
>  		goto cleanup;
>  	}
> 
> +	if (hv_isolation_type_snp()) {
> +		alloc_unit = (buf_size / net_device->recv_page_count)
> +				>> PAGE_SHIFT;
> +
> +		/* Unmap previous virtual address and map pages in the extra
> +		 * address space(above shared gpa boundary) in Isolation VM.
> +		 */
> +		vunmap(net_device->recv_buf);
> +		net_device->recv_buf =
> +			netvsc_map_pages(net_device->recv_pages,
> +					 net_device->recv_page_count,
> +					 alloc_unit);
> +		if (!net_device->recv_buf) {
> +			netdev_err(ndev,
> +				   "unable to allocate receive buffer of size %u\n",
> +				   buf_size);
> +			ret = -ENOMEM;
> +			goto cleanup;
> +		}
> +	}
> +
>  	/* Notify the NetVsp of the gpadl handle */
>  	init_packet = &net_device->channel_init_pkt;
>  	memset(init_packet, 0, sizeof(struct nvsp_message));
> @@ -456,13 +609,21 @@ static int netvsc_init_buf(struct hv_device *device,
>  	buf_size = device_info->send_sections * device_info->send_section_size;
>  	buf_size = round_up(buf_size, PAGE_SIZE);
> 
> -	net_device->send_buf = vzalloc(buf_size);
> +	if (hv_isolation_type_snp())
> +		net_device->send_buf =
> +			netvsc_alloc_pages(&net_device->send_pages,
> +					   &net_device->send_page_count,
> +					   buf_size);
> +	else
> +		net_device->send_buf = vzalloc(buf_size);
> +
>  	if (!net_device->send_buf) {
>  		netdev_err(ndev, "unable to allocate send buffer of size %u\n",
>  			   buf_size);
>  		ret = -ENOMEM;
>  		goto cleanup;
>  	}
> +
>  	net_device->send_buf_size = buf_size;
> 
>  	/* Establish the gpadl handle for this buffer on this
> @@ -478,6 +639,27 @@ static int netvsc_init_buf(struct hv_device *device,
>  		goto cleanup;
>  	}
> 
> +	if (hv_isolation_type_snp()) {
> +		alloc_unit = (buf_size / net_device->send_page_count)
> +				>> PAGE_SHIFT;
> +
> +		/* Unmap previous virtual address and map pages in the extra
> +		 * address space(above shared gpa boundary) in Isolation VM.
> +		 */
> +		vunmap(net_device->send_buf);
> +		net_device->send_buf =
> +			netvsc_map_pages(net_device->send_pages,
> +					 net_device->send_page_count,
> +					 alloc_unit);
> +		if (!net_device->send_buf) {
> +			netdev_err(ndev,
> +				   "unable to allocate receive buffer of size %u\n",
> +				   buf_size);
> +			ret = -ENOMEM;
> +			goto cleanup;
> +		}
> +	}
> +
>  	/* Notify the NetVsp of the gpadl handle */
>  	init_packet = &net_device->channel_init_pkt;
>  	memset(init_packet, 0, sizeof(struct nvsp_message));
> @@ -768,7 +950,7 @@ static void netvsc_send_tx_complete(struct net_device *ndev,
> 
>  	/* Notify the layer above us */
>  	if (likely(skb)) {
> -		const struct hv_netvsc_packet *packet
> +		struct hv_netvsc_packet *packet
>  			= (struct hv_netvsc_packet *)skb->cb;
>  		u32 send_index = packet->send_buf_index;
>  		struct netvsc_stats *tx_stats;
> @@ -784,6 +966,7 @@ static void netvsc_send_tx_complete(struct net_device *ndev,
>  		tx_stats->bytes += packet->total_bytes;
>  		u64_stats_update_end(&tx_stats->syncp);
> 
> +		netvsc_dma_unmap(ndev_ctx->device_ctx, packet);
>  		napi_consume_skb(skb, budget);
>  	}
> 
> @@ -948,6 +1131,87 @@ static void netvsc_copy_to_send_buf(struct netvsc_device *net_device,
>  		memset(dest, 0, padding);
>  }
> 
> +void netvsc_dma_unmap(struct hv_device *hv_dev,
> +		      struct hv_netvsc_packet *packet)
> +{
> +	u32 page_count = packet->cp_partial ?
> +		packet->page_buf_cnt - packet->rmsg_pgcnt :
> +		packet->page_buf_cnt;
> +	int i;
> +
> +	if (!hv_is_isolation_supported())
> +		return;
> +
> +	if (!packet->dma_range)
> +		return;
> +
> +	for (i = 0; i < page_count; i++)
> +		dma_unmap_single(&hv_dev->device, packet->dma_range[i].dma,
> +				 packet->dma_range[i].mapping_size,
> +				 DMA_TO_DEVICE);
> +
> +	kfree(packet->dma_range);
> +}
> +
> +/* netvsc_dma_map - Map swiotlb bounce buffer with data page of
> + * packet sent by vmbus_sendpacket_pagebuffer() in the Isolation
> + * VM.
> + *
> + * In isolation VM, netvsc send buffer has been marked visible to
> + * host and so the data copied to send buffer doesn't need to use
> + * bounce buffer. The data pages handled by vmbus_sendpacket_pagebuffer()
> + * may not be copied to send buffer and so these pages need to be
> + * mapped with swiotlb bounce buffer. netvsc_dma_map() is to do
> + * that. The pfns in the struct hv_page_buffer need to be converted
> + * to bounce buffer's pfn. The loop here is necessary because the
> + * entries in the page buffer array are not necessarily full
> + * pages of data.  Each entry in the array has a separate offset and
> + * len that may be non-zero, even for entries in the middle of the
> + * array.  And the entries are not physically contiguous.  So each
> + * entry must be individually mapped rather than as a contiguous unit.
> + * So not use dma_map_sg() here.
> + */
> +static int netvsc_dma_map(struct hv_device *hv_dev,
> +		   struct hv_netvsc_packet *packet,
> +		   struct hv_page_buffer *pb)
> +{
> +	u32 page_count =  packet->cp_partial ?
> +		packet->page_buf_cnt - packet->rmsg_pgcnt :
> +		packet->page_buf_cnt;
> +	dma_addr_t dma;
> +	int i;
> +
> +	if (!hv_is_isolation_supported())
> +		return 0;
> +
> +	packet->dma_range = kcalloc(page_count,
> +				    sizeof(*packet->dma_range),
> +				    GFP_KERNEL);
> +	if (!packet->dma_range)
> +		return -ENOMEM;
> +
> +	for (i = 0; i < page_count; i++) {
> +		char *src = phys_to_virt((pb[i].pfn << HV_HYP_PAGE_SHIFT)
> +					 + pb[i].offset);
> +		u32 len = pb[i].len;
> +
> +		dma = dma_map_single(&hv_dev->device, src, len,
> +				     DMA_TO_DEVICE);
> +		if (dma_mapping_error(&hv_dev->device, dma)) {
> +			kfree(packet->dma_range);
> +			return -ENOMEM;
> +		}
> +
> +		packet->dma_range[i].dma = dma;
> +		packet->dma_range[i].mapping_size = len;
> +		pb[i].pfn = dma >> HV_HYP_PAGE_SHIFT;
> +		pb[i].offset = offset_in_hvpage(dma);

With the DMA min align mask now being set, the offset within
the Hyper-V page won't be changed by dma_map_single().  So I
think the above statement can be removed.

> +		pb[i].len = len;

A few lines above, the value of "len" is set from pb[i].len.  Neither
"len" nor "i" is changed in the loop, so this statement can also be
removed.

> +	}
> +
> +	return 0;
> +}
> +
>  static inline int netvsc_send_pkt(
>  	struct hv_device *device,
>  	struct hv_netvsc_packet *packet,
> @@ -988,14 +1252,24 @@ static inline int netvsc_send_pkt(
> 
>  	trace_nvsp_send_pkt(ndev, out_channel, rpkt);
> 
> +	packet->dma_range = NULL;
>  	if (packet->page_buf_cnt) {
>  		if (packet->cp_partial)
>  			pb += packet->rmsg_pgcnt;
> 
> +		ret = netvsc_dma_map(ndev_ctx->device_ctx, packet, pb);
> +		if (ret) {
> +			ret = -EAGAIN;
> +			goto exit;
> +		}
> +
>  		ret = vmbus_sendpacket_pagebuffer(out_channel,
>  						  pb, packet->page_buf_cnt,
>  						  &nvmsg, sizeof(nvmsg),
>  						  req_id);
> +
> +		if (ret)
> +			netvsc_dma_unmap(ndev_ctx->device_ctx, packet);
>  	} else {
>  		ret = vmbus_sendpacket(out_channel,
>  				       &nvmsg, sizeof(nvmsg),
> @@ -1003,6 +1277,7 @@ static inline int netvsc_send_pkt(
>  				       VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED);
>  	}
> 
> +exit:
>  	if (ret == 0) {
>  		atomic_inc_return(&nvchan->queue_sends);
> 
> diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
> index 382bebc2420d..c3dc884b31e3 100644
> --- a/drivers/net/hyperv/netvsc_drv.c
> +++ b/drivers/net/hyperv/netvsc_drv.c
> @@ -2577,6 +2577,7 @@ static int netvsc_probe(struct hv_device *dev,
>  	list_add(&net_device_ctx->list, &netvsc_dev_list);
>  	rtnl_unlock();
> 
> +	dma_set_min_align_mask(&dev->device, HV_HYP_PAGE_SIZE - 1);
>  	netvsc_devinfo_put(device_info);
>  	return 0;
> 
> diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c
> index f6c9c2a670f9..448fcc325ed7 100644
> --- a/drivers/net/hyperv/rndis_filter.c
> +++ b/drivers/net/hyperv/rndis_filter.c
> @@ -361,6 +361,8 @@ static void rndis_filter_receive_response(struct net_device *ndev,
>  			}
>  		}
> 
> +		netvsc_dma_unmap(((struct net_device_context *)
> +			netdev_priv(ndev))->device_ctx, &request->pkt);
>  		complete(&request->wait_event);
>  	} else {
>  		netdev_err(ndev,
> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
> index c94c534a944e..81e58dd582dc 100644
> --- a/include/linux/hyperv.h
> +++ b/include/linux/hyperv.h
> @@ -1597,6 +1597,11 @@ struct hyperv_service_callback {
>  	void (*callback)(void *context);
>  };
> 
> +struct hv_dma_range {
> +	dma_addr_t dma;
> +	u32 mapping_size;
> +};
> +
>  #define MAX_SRV_VER	0x7ffffff
>  extern bool vmbus_prep_negotiate_resp(struct icmsg_hdr *icmsghdrp, u8 *buf, u32 buflen,
>  				const int *fw_version, int fw_vercnt,
> --
> 2.25.1
Haiyang Zhang Sept. 15, 2021, 4:46 p.m. UTC | #3
> -----Original Message-----
> From: Michael Kelley <mikelley@microsoft.com>
> Sent: Wednesday, September 15, 2021 12:22 PM
> To: Tianyu Lan <ltykernel@gmail.com>; KY Srinivasan <kys@microsoft.com>;

> > +				memset(vmap_pages, 0,
> > +				       sizeof(*vmap_pages) * vmap_page_index);
> > +				vmap_page_index = 0;
> > +
> > +				for (j = 0; j < i; j++)
> > +					__free_pages(pages[j], alloc_unit);
> > +
> > +				kfree(pages);
> > +				alloc_unit = 1;
> 
> This is the case where a large enough contiguous physical memory chunk
> could not be found.  But rather than dropping all the way down to single
> pages, would it make sense to try something smaller, but not 1?  For
> example, cut the alloc_unit in half and try again.  But I'm not sure of
> all the implications.

I had the same question. But probably gradually decrementing uses too much
time?

> 
> > +				goto retry;
> > +			}
> > +		}
> > +
> > +		pages[i] = page;
> > +		for (j = 0; j < alloc_unit; j++)
> > +			vmap_pages[vmap_page_index++] = page++;
> > +	}
> > +
> > +	vaddr = vmap(vmap_pages, vmap_page_index, VM_MAP, PAGE_KERNEL);
> > +	kfree(vmap_pages);
> > +
> > +	*pages_array = pages;
> > +	return vaddr;
> > +
> > +cleanup:
> > +	for (j = 0; j < i; j++)
> > +		__free_pages(pages[i], alloc_unit);
> > +
> > +	kfree(pages);
> > +	kfree(vmap_pages);
> > +	return NULL;
> > +}
> > +
> > +static void *netvsc_map_pages(struct page **pages, int count, int
> > +alloc_unit) {
> > +	int pg_count = count * alloc_unit;
> > +	struct page *page;
> > +	unsigned long *pfns;
> > +	int pfn_index = 0;
> > +	void *vaddr;
> > +	int i, j;
> > +
> > +	if (!pages)
> > +		return NULL;
> > +
> > +	pfns = kcalloc(pg_count, sizeof(*pfns), GFP_KERNEL);
> > +	if (!pfns)
> > +		return NULL;
> > +
> > +	for (i = 0; i < count; i++) {
> > +		page = pages[i];
> > +		if (!page) {
> > +			pr_warn("page is not available %d.\n", i);
> > +			return NULL;
> > +		}
> > +
> > +		for (j = 0; j < alloc_unit; j++) {
> > +			pfns[pfn_index++] = page_to_pfn(page++) +
> > +				(ms_hyperv.shared_gpa_boundary >> PAGE_SHIFT);
> > +		}
> > +	}
> > +
> > +	vaddr = vmap_pfn(pfns, pg_count, PAGE_KERNEL_IO);
> > +	kfree(pfns);
> > +	return vaddr;
> > +}
> > +
> 
> I think you are proposing this approach to allocating memory for the
> send and receive buffers so that you can avoid having two virtual
> mappings for the memory, per comments from Christop Hellwig.  But
> overall, the approach seems a bit complex and I wonder if it is worth it.
> If allocating large contiguous chunks of physical memory is successful,
> then there is some memory savings in that the data structures needed to
> keep track of the physical pages is smaller than the equivalent page
> tables might be.  But if you have to revert to allocating individual
> pages, then the memory savings is reduced.
> 
> Ultimately, the list of actual PFNs has to be kept somewhere.  Another
> approach would be to do the reverse of what hv_map_memory() from the v4
> patch series does.  I.e., you could do virt_to_phys() on each virtual
> address that maps above VTOM, and subtract out the shared_gpa_boundary
> to get the
> list of actual PFNs that need to be freed.   This way you don't have two
> copies
> of the list of PFNs -- one with and one without the shared_gpa_boundary
> added.
> But it comes at the cost of additional code so that may not be a great
> idea.
> 
> I think what you have here works, and I don't have a clearly better
> solution at the moment except perhaps to revert to the v4 solution and
> just have two virtual mappings.  I'll keep thinking about it.  Maybe
> Christop has other thoughts.
> 
> >  static int netvsc_init_buf(struct hv_device *device,
> >  			   struct netvsc_device *net_device,
> >  			   const struct netvsc_device_info *device_info) @@ -
> 337,7 +462,7
> > @@ static int netvsc_init_buf(struct hv_device *device,
> >  	struct nvsp_1_message_send_receive_buffer_complete *resp;
> >  	struct net_device *ndev = hv_get_drvdata(device);
> >  	struct nvsp_message *init_packet;
> > -	unsigned int buf_size;
> > +	unsigned int buf_size, alloc_unit;
> >  	size_t map_words;
> >  	int i, ret = 0;
> >
> > @@ -350,7 +475,14 @@ static int netvsc_init_buf(struct hv_device
> *device,
> >  		buf_size = min_t(unsigned int, buf_size,
> >  				 NETVSC_RECEIVE_BUFFER_SIZE_LEGACY);
> >
> > -	net_device->recv_buf = vzalloc(buf_size);
> > +	if (hv_isolation_type_snp())
> > +		net_device->recv_buf =
> > +			netvsc_alloc_pages(&net_device->recv_pages,
> > +					   &net_device->recv_page_count,
> > +					   buf_size);
> > +	else
> > +		net_device->recv_buf = vzalloc(buf_size);
> > +
> 
> I wonder if it is necessary to have two different code paths here.  The
> allocating and freeing of the send and receive buffers is not perf
> sensitive, and it seems like netvsc_alloc_pages() could be used
> regardless of whether SNP Isolation is in effect.  To my thinking, one
> code path is better than two code paths unless there's a compelling
> reason to have two.

I still prefer keeping the simple vzalloc for the non isolated VMs, because
simple code path usually means more robust. 
I don't know how much time difference between the two, but in some cases 
we really care about boot time? 
Also in the multi vPort case for MANA, we potentially support hundreds of 
vPorts, and there will be the same number of synthetic NICs associated with 
them. So even small time difference in the initialization time may add up.

Thanks,
- Haiyang
Tianyu Lan Sept. 16, 2021, 1:56 p.m. UTC | #4
On 9/16/2021 12:46 AM, Haiyang Zhang wrote:
>>> +				memset(vmap_pages, 0,
>>> +				       sizeof(*vmap_pages) * vmap_page_index);
>>> +				vmap_page_index = 0;
>>> +
>>> +				for (j = 0; j < i; j++)
>>> +					__free_pages(pages[j], alloc_unit);
>>> +
>>> +				kfree(pages);
>>> +				alloc_unit = 1;
>> This is the case where a large enough contiguous physical memory chunk
>> could not be found.  But rather than dropping all the way down to single
>> pages, would it make sense to try something smaller, but not 1?  For
>> example, cut the alloc_unit in half and try again.  But I'm not sure of
>> all the implications.
> I had the same question. But probably gradually decrementing uses too much
> time?
> 

This version is to propose the solution. We may optimize this to try
smaller size until to single page if this is right direction.
Tianyu Lan Sept. 16, 2021, 2:43 p.m. UTC | #5
On 9/16/2021 12:21 AM, Michael Kelley wrote:
> I think you are proposing this approach to allocating memory for the send
> and receive buffers so that you can avoid having two virtual mappings for
> the memory, per comments from Christop Hellwig.  But overall, the approach
> seems a bit complex and I wonder if it is worth it.  If allocating large contiguous
> chunks of physical memory is successful, then there is some memory savings
> in that the data structures needed to keep track of the physical pages is
> smaller than the equivalent page tables might be.  But if you have to revert
> to allocating individual pages, then the memory savings is reduced.
> 

Yes, this version follows idea from Christop in the previous 
discussion.(https://lkml.org/lkml/2021/9/2/112)
This patch shows the implementation and check whether this is a right 
direction.

> Ultimately, the list of actual PFNs has to be kept somewhere.  Another approach
> would be to do the reverse of what hv_map_memory() from the v4 patch
> series does.  I.e., you could do virt_to_phys() on each virtual address that
> maps above VTOM, and subtract out the shared_gpa_boundary to get the
> list of actual PFNs that need to be freed.

virt_to_phys() doesn't work for virtual address returned by 
vmap/vmap_pfn() (just like it doesn't work for va returned by 
vmalloc()). The pfn above vTom doesn't have struct page backing and
vmap_pfn() populates the pfn directly in the pte.(Please see the
vmap_pfn_apply()). So it's not easy to convert the va to pa.

>   This way you don't have two copies
> of the list of PFNs -- one with and one without the shared_gpa_boundary added.
> But it comes at the cost of additional code so that may not be a great idea.
> 
> I think what you have here works, and I don't have a clearly better solution
> at the moment except perhaps to revert to the v4 solution and just have two
> virtual mappings.  I'll keep thinking about it.  Maybe Christop has other
> thoughts.
Tianyu Lan Sept. 22, 2021, 10:34 a.m. UTC | #6
Hi Christoph:
     This patch follows your purposal in the previous discussion.
Could you have a look?
     "use vmap_pfn as in the current series.  But in that case I think
     we should get rid of the other mapping created by vmalloc.  I
     though a bit about finding a way to apply the offset in vmalloc
     itself, but I think it would be too invasive to the normal fast
     path.  So the other sub-option would be to allocate the pages
     manually (maybe even using high order allocations to reduce TLB
     pressure) and then remap them(https://lkml.org/lkml/2021/9/2/112)

Otherwise, I merge your previous change for swiotlb into patch 9
“x86/Swiotlb: Add Swiotlb bounce buffer remap function for HV IVM”
You previous change 
link.(http://git.infradead.org/users/hch/misc.git/commit/8248f295928aded3364a1e54a4e0022e93d3610c) 
Please have a look.


Thanks.


On 9/16/2021 12:21 AM, Michael Kelley wrote:
> From: Tianyu Lan <ltykernel@gmail.com>  Sent: Tuesday, September 14, 2021 6:39 AM
>>
>> In Isolation VM, all shared memory with host needs to mark visible
>> to host via hvcall. vmbus_establish_gpadl() has already done it for
>> netvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_
>> pagebuffer() stills need to be handled. Use DMA API to map/umap
>> these memory during sending/receiving packet and Hyper-V swiotlb
>> bounce buffer dma address will be returned. The swiotlb bounce buffer
>> has been masked to be visible to host during boot up.
>>
>> Allocate rx/tx ring buffer via alloc_pages() in Isolation VM and map
>> these pages via vmap(). After calling vmbus_establish_gpadl() which
>> marks these pages visible to host, unmap these pages to release the
>> virtual address mapped with physical address below shared_gpa_boundary
>> and map them in the extra address space via vmap_pfn().
>>
>> Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
>> ---
>> Change since v4:
>> 	* Allocate rx/tx ring buffer via alloc_pages() in Isolation VM
>> 	* Map pages after calling vmbus_establish_gpadl().
>> 	* set dma_set_min_align_mask for netvsc driver.
>>
>> Change since v3:
>> 	* Add comment to explain why not to use dma_map_sg()
>> 	* Fix some error handle.
>> ---
>>   drivers/net/hyperv/hyperv_net.h   |   7 +
>>   drivers/net/hyperv/netvsc.c       | 287 +++++++++++++++++++++++++++++-
>>   drivers/net/hyperv/netvsc_drv.c   |   1 +
>>   drivers/net/hyperv/rndis_filter.c |   2 +
>>   include/linux/hyperv.h            |   5 +
>>   5 files changed, 296 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
>> index 315278a7cf88..87e8c74398a5 100644
>> --- a/drivers/net/hyperv/hyperv_net.h
>> +++ b/drivers/net/hyperv/hyperv_net.h
>> @@ -164,6 +164,7 @@ struct hv_netvsc_packet {
>>   	u32 total_bytes;
>>   	u32 send_buf_index;
>>   	u32 total_data_buflen;
>> +	struct hv_dma_range *dma_range;
>>   };
>>
>>   #define NETVSC_HASH_KEYLEN 40
>> @@ -1074,6 +1075,8 @@ struct netvsc_device {
>>
>>   	/* Receive buffer allocated by us but manages by NetVSP */
>>   	void *recv_buf;
>> +	struct page **recv_pages;
>> +	u32 recv_page_count;
>>   	u32 recv_buf_size; /* allocated bytes */
>>   	struct vmbus_gpadl recv_buf_gpadl_handle;
>>   	u32 recv_section_cnt;
>> @@ -1082,6 +1085,8 @@ struct netvsc_device {
>>
>>   	/* Send buffer allocated by us */
>>   	void *send_buf;
>> +	struct page **send_pages;
>> +	u32 send_page_count;
>>   	u32 send_buf_size;
>>   	struct vmbus_gpadl send_buf_gpadl_handle;
>>   	u32 send_section_cnt;
>> @@ -1731,4 +1736,6 @@ struct rndis_message {
>>   #define RETRY_US_HI	10000
>>   #define RETRY_MAX	2000	/* >10 sec */
>>
>> +void netvsc_dma_unmap(struct hv_device *hv_dev,
>> +		      struct hv_netvsc_packet *packet);
>>   #endif /* _HYPERV_NET_H */
>> diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
>> index 1f87e570ed2b..7d5254bf043e 100644
>> --- a/drivers/net/hyperv/netvsc.c
>> +++ b/drivers/net/hyperv/netvsc.c
>> @@ -20,6 +20,7 @@
>>   #include <linux/vmalloc.h>
>>   #include <linux/rtnetlink.h>
>>   #include <linux/prefetch.h>
>> +#include <linux/gfp.h>
>>
>>   #include <asm/sync_bitops.h>
>>   #include <asm/mshyperv.h>
>> @@ -150,11 +151,33 @@ static void free_netvsc_device(struct rcu_head *head)
>>   {
>>   	struct netvsc_device *nvdev
>>   		= container_of(head, struct netvsc_device, rcu);
>> +	unsigned int alloc_unit;
>>   	int i;
>>
>>   	kfree(nvdev->extension);
>> -	vfree(nvdev->recv_buf);
>> -	vfree(nvdev->send_buf);
>> +
>> +	if (nvdev->recv_pages) {
>> +		alloc_unit = (nvdev->recv_buf_size /
>> +			nvdev->recv_page_count) >> PAGE_SHIFT;
>> +
>> +		vunmap(nvdev->recv_buf);
>> +		for (i = 0; i < nvdev->recv_page_count; i++)
>> +			__free_pages(nvdev->recv_pages[i], alloc_unit);
>> +	} else {
>> +		vfree(nvdev->recv_buf);
>> +	}
>> +
>> +	if (nvdev->send_pages) {
>> +		alloc_unit = (nvdev->send_buf_size /
>> +			nvdev->send_page_count) >> PAGE_SHIFT;
>> +
>> +		vunmap(nvdev->send_buf);
>> +		for (i = 0; i < nvdev->send_page_count; i++)
>> +			__free_pages(nvdev->send_pages[i], alloc_unit);
>> +	} else {
>> +		vfree(nvdev->send_buf);
>> +	}
>> +
>>   	kfree(nvdev->send_section_map);
>>
>>   	for (i = 0; i < VRSS_CHANNEL_MAX; i++) {
>> @@ -330,6 +353,108 @@ int netvsc_alloc_recv_comp_ring(struct netvsc_device *net_device, u32 q_idx)
>>   	return nvchan->mrc.slots ? 0 : -ENOMEM;
>>   }
>>
>> +void *netvsc_alloc_pages(struct page ***pages_array, unsigned int *array_len,
>> +			 unsigned long size)
>> +{
>> +	struct page *page, **pages, **vmap_pages;
>> +	unsigned long pg_count = size >> PAGE_SHIFT;
>> +	int alloc_unit = MAX_ORDER_NR_PAGES;
>> +	int i, j, vmap_page_index = 0;
>> +	void *vaddr;
>> +
>> +	if (pg_count < alloc_unit)
>> +		alloc_unit = 1;
>> +
>> +	/* vmap() accepts page array with PAGE_SIZE as unit while try to
>> +	 * allocate high order pages here in order to save page array space.
>> +	 * vmap_pages[] is used as input parameter of vmap(). pages[] is to
>> +	 * store allocated pages and map them later.
>> +	 */
>> +	vmap_pages = kmalloc_array(pg_count, sizeof(*vmap_pages), GFP_KERNEL);
>> +	if (!vmap_pages)
>> +		return NULL;
>> +
>> +retry:
>> +	*array_len = pg_count / alloc_unit;
>> +	pages = kmalloc_array(*array_len, sizeof(*pages), GFP_KERNEL);
>> +	if (!pages)
>> +		goto cleanup;
>> +
>> +	for (i = 0; i < *array_len; i++) {
>> +		page = alloc_pages(GFP_KERNEL | __GFP_ZERO,
>> +				   get_order(alloc_unit << PAGE_SHIFT));
>> +		if (!page) {
>> +			/* Try allocating small pages if high order pages are not available. */
>> +			if (alloc_unit == 1) {
>> +				goto cleanup;
>> +			} else {
> 
> The "else" clause isn't really needed because of the goto cleanup above.  Then
> the indentation of the code below could be reduced by one level.
> 
>> +				memset(vmap_pages, 0,
>> +				       sizeof(*vmap_pages) * vmap_page_index);
>> +				vmap_page_index = 0;
>> +
>> +				for (j = 0; j < i; j++)
>> +					__free_pages(pages[j], alloc_unit);
>> +
>> +				kfree(pages);
>> +				alloc_unit = 1;
> 
> This is the case where a large enough contiguous physical memory chunk could
> not be found.  But rather than dropping all the way down to single pages,
> would it make sense to try something smaller, but not 1?  For example,
> cut the alloc_unit in half and try again.  But I'm not sure of all the implications.
> 
>> +				goto retry;
>> +			}
>> +		}
>> +
>> +		pages[i] = page;
>> +		for (j = 0; j < alloc_unit; j++)
>> +			vmap_pages[vmap_page_index++] = page++;
>> +	}
>> +
>> +	vaddr = vmap(vmap_pages, vmap_page_index, VM_MAP, PAGE_KERNEL);
>> +	kfree(vmap_pages);
>> +
>> +	*pages_array = pages;
>> +	return vaddr;
>> +
>> +cleanup:
>> +	for (j = 0; j < i; j++)
>> +		__free_pages(pages[i], alloc_unit);
>> +
>> +	kfree(pages);
>> +	kfree(vmap_pages);
>> +	return NULL;
>> +}
>> +
>> +static void *netvsc_map_pages(struct page **pages, int count, int alloc_unit)
>> +{
>> +	int pg_count = count * alloc_unit;
>> +	struct page *page;
>> +	unsigned long *pfns;
>> +	int pfn_index = 0;
>> +	void *vaddr;
>> +	int i, j;
>> +
>> +	if (!pages)
>> +		return NULL;
>> +
>> +	pfns = kcalloc(pg_count, sizeof(*pfns), GFP_KERNEL);
>> +	if (!pfns)
>> +		return NULL;
>> +
>> +	for (i = 0; i < count; i++) {
>> +		page = pages[i];
>> +		if (!page) {
>> +			pr_warn("page is not available %d.\n", i);
>> +			return NULL;
>> +		}
>> +
>> +		for (j = 0; j < alloc_unit; j++) {
>> +			pfns[pfn_index++] = page_to_pfn(page++) +
>> +				(ms_hyperv.shared_gpa_boundary >> PAGE_SHIFT);
>> +		}
>> +	}
>> +
>> +	vaddr = vmap_pfn(pfns, pg_count, PAGE_KERNEL_IO);
>> +	kfree(pfns);
>> +	return vaddr;
>> +}
>> +
> 
> I think you are proposing this approach to allocating memory for the send
> and receive buffers so that you can avoid having two virtual mappings for
> the memory, per comments from Christop Hellwig.  But overall, the approach
> seems a bit complex and I wonder if it is worth it.  If allocating large contiguous
> chunks of physical memory is successful, then there is some memory savings
> in that the data structures needed to keep track of the physical pages is
> smaller than the equivalent page tables might be.  But if you have to revert
> to allocating individual pages, then the memory savings is reduced.
> 
> Ultimately, the list of actual PFNs has to be kept somewhere.  Another approach
> would be to do the reverse of what hv_map_memory() from the v4 patch
> series does.  I.e., you could do virt_to_phys() on each virtual address that
> maps above VTOM, and subtract out the shared_gpa_boundary to get the
> list of actual PFNs that need to be freed.   This way you don't have two copies
> of the list of PFNs -- one with and one without the shared_gpa_boundary added.
> But it comes at the cost of additional code so that may not be a great idea.
> 
> I think what you have here works, and I don't have a clearly better solution
> at the moment except perhaps to revert to the v4 solution and just have two
> virtual mappings.  I'll keep thinking about it.  Maybe Christop has other
> thoughts.
> 
>>   static int netvsc_init_buf(struct hv_device *device,
>>   			   struct netvsc_device *net_device,
>>   			   const struct netvsc_device_info *device_info)
>> @@ -337,7 +462,7 @@ static int netvsc_init_buf(struct hv_device *device,
>>   	struct nvsp_1_message_send_receive_buffer_complete *resp;
>>   	struct net_device *ndev = hv_get_drvdata(device);
>>   	struct nvsp_message *init_packet;
>> -	unsigned int buf_size;
>> +	unsigned int buf_size, alloc_unit;
>>   	size_t map_words;
>>   	int i, ret = 0;
>>
>> @@ -350,7 +475,14 @@ static int netvsc_init_buf(struct hv_device *device,
>>   		buf_size = min_t(unsigned int, buf_size,
>>   				 NETVSC_RECEIVE_BUFFER_SIZE_LEGACY);
>>
>> -	net_device->recv_buf = vzalloc(buf_size);
>> +	if (hv_isolation_type_snp())
>> +		net_device->recv_buf =
>> +			netvsc_alloc_pages(&net_device->recv_pages,
>> +					   &net_device->recv_page_count,
>> +					   buf_size);
>> +	else
>> +		net_device->recv_buf = vzalloc(buf_size);
>> +
> 
> I wonder if it is necessary to have two different code paths here.  The
> allocating and freeing of the send and receive buffers is not perf
> sensitive, and it seems like netvsc_alloc_pages() could be used
> regardless of whether SNP Isolation is in effect.  To my thinking,
> one code path is better than two code paths unless there's a
> compelling reason to have two.
> 
>>   	if (!net_device->recv_buf) {
>>   		netdev_err(ndev,
>>   			   "unable to allocate receive buffer of size %u\n",
>> @@ -375,6 +507,27 @@ static int netvsc_init_buf(struct hv_device *device,
>>   		goto cleanup;
>>   	}
>>
>> +	if (hv_isolation_type_snp()) {
>> +		alloc_unit = (buf_size / net_device->recv_page_count)
>> +				>> PAGE_SHIFT;
>> +
>> +		/* Unmap previous virtual address and map pages in the extra
>> +		 * address space(above shared gpa boundary) in Isolation VM.
>> +		 */
>> +		vunmap(net_device->recv_buf);
>> +		net_device->recv_buf =
>> +			netvsc_map_pages(net_device->recv_pages,
>> +					 net_device->recv_page_count,
>> +					 alloc_unit);
>> +		if (!net_device->recv_buf) {
>> +			netdev_err(ndev,
>> +				   "unable to allocate receive buffer of size %u\n",
>> +				   buf_size);
>> +			ret = -ENOMEM;
>> +			goto cleanup;
>> +		}
>> +	}
>> +
>>   	/* Notify the NetVsp of the gpadl handle */
>>   	init_packet = &net_device->channel_init_pkt;
>>   	memset(init_packet, 0, sizeof(struct nvsp_message));
>> @@ -456,13 +609,21 @@ static int netvsc_init_buf(struct hv_device *device,
>>   	buf_size = device_info->send_sections * device_info->send_section_size;
>>   	buf_size = round_up(buf_size, PAGE_SIZE);
>>
>> -	net_device->send_buf = vzalloc(buf_size);
>> +	if (hv_isolation_type_snp())
>> +		net_device->send_buf =
>> +			netvsc_alloc_pages(&net_device->send_pages,
>> +					   &net_device->send_page_count,
>> +					   buf_size);
>> +	else
>> +		net_device->send_buf = vzalloc(buf_size);
>> +
>>   	if (!net_device->send_buf) {
>>   		netdev_err(ndev, "unable to allocate send buffer of size %u\n",
>>   			   buf_size);
>>   		ret = -ENOMEM;
>>   		goto cleanup;
>>   	}
>> +
>>   	net_device->send_buf_size = buf_size;
>>
>>   	/* Establish the gpadl handle for this buffer on this
>> @@ -478,6 +639,27 @@ static int netvsc_init_buf(struct hv_device *device,
>>   		goto cleanup;
>>   	}
>>
>> +	if (hv_isolation_type_snp()) {
>> +		alloc_unit = (buf_size / net_device->send_page_count)
>> +				>> PAGE_SHIFT;
>> +
>> +		/* Unmap previous virtual address and map pages in the extra
>> +		 * address space(above shared gpa boundary) in Isolation VM.
>> +		 */
>> +		vunmap(net_device->send_buf);
>> +		net_device->send_buf =
>> +			netvsc_map_pages(net_device->send_pages,
>> +					 net_device->send_page_count,
>> +					 alloc_unit);
>> +		if (!net_device->send_buf) {
>> +			netdev_err(ndev,
>> +				   "unable to allocate receive buffer of size %u\n",
>> +				   buf_size);
>> +			ret = -ENOMEM;
>> +			goto cleanup;
>> +		}
>> +	}
>> +
>>   	/* Notify the NetVsp of the gpadl handle */
>>   	init_packet = &net_device->channel_init_pkt;
>>   	memset(init_packet, 0, sizeof(struct nvsp_message));
>> @@ -768,7 +950,7 @@ static void netvsc_send_tx_complete(struct net_device *ndev,
>>
>>   	/* Notify the layer above us */
>>   	if (likely(skb)) {
>> -		const struct hv_netvsc_packet *packet
>> +		struct hv_netvsc_packet *packet
>>   			= (struct hv_netvsc_packet *)skb->cb;
>>   		u32 send_index = packet->send_buf_index;
>>   		struct netvsc_stats *tx_stats;
>> @@ -784,6 +966,7 @@ static void netvsc_send_tx_complete(struct net_device *ndev,
>>   		tx_stats->bytes += packet->total_bytes;
>>   		u64_stats_update_end(&tx_stats->syncp);
>>
>> +		netvsc_dma_unmap(ndev_ctx->device_ctx, packet);
>>   		napi_consume_skb(skb, budget);
>>   	}
>>
>> @@ -948,6 +1131,87 @@ static void netvsc_copy_to_send_buf(struct netvsc_device *net_device,
>>   		memset(dest, 0, padding);
>>   }
>>
>> +void netvsc_dma_unmap(struct hv_device *hv_dev,
>> +		      struct hv_netvsc_packet *packet)
>> +{
>> +	u32 page_count = packet->cp_partial ?
>> +		packet->page_buf_cnt - packet->rmsg_pgcnt :
>> +		packet->page_buf_cnt;
>> +	int i;
>> +
>> +	if (!hv_is_isolation_supported())
>> +		return;
>> +
>> +	if (!packet->dma_range)
>> +		return;
>> +
>> +	for (i = 0; i < page_count; i++)
>> +		dma_unmap_single(&hv_dev->device, packet->dma_range[i].dma,
>> +				 packet->dma_range[i].mapping_size,
>> +				 DMA_TO_DEVICE);
>> +
>> +	kfree(packet->dma_range);
>> +}
>> +
>> +/* netvsc_dma_map - Map swiotlb bounce buffer with data page of
>> + * packet sent by vmbus_sendpacket_pagebuffer() in the Isolation
>> + * VM.
>> + *
>> + * In isolation VM, netvsc send buffer has been marked visible to
>> + * host and so the data copied to send buffer doesn't need to use
>> + * bounce buffer. The data pages handled by vmbus_sendpacket_pagebuffer()
>> + * may not be copied to send buffer and so these pages need to be
>> + * mapped with swiotlb bounce buffer. netvsc_dma_map() is to do
>> + * that. The pfns in the struct hv_page_buffer need to be converted
>> + * to bounce buffer's pfn. The loop here is necessary because the
>> + * entries in the page buffer array are not necessarily full
>> + * pages of data.  Each entry in the array has a separate offset and
>> + * len that may be non-zero, even for entries in the middle of the
>> + * array.  And the entries are not physically contiguous.  So each
>> + * entry must be individually mapped rather than as a contiguous unit.
>> + * So not use dma_map_sg() here.
>> + */
>> +static int netvsc_dma_map(struct hv_device *hv_dev,
>> +		   struct hv_netvsc_packet *packet,
>> +		   struct hv_page_buffer *pb)
>> +{
>> +	u32 page_count =  packet->cp_partial ?
>> +		packet->page_buf_cnt - packet->rmsg_pgcnt :
>> +		packet->page_buf_cnt;
>> +	dma_addr_t dma;
>> +	int i;
>> +
>> +	if (!hv_is_isolation_supported())
>> +		return 0;
>> +
>> +	packet->dma_range = kcalloc(page_count,
>> +				    sizeof(*packet->dma_range),
>> +				    GFP_KERNEL);
>> +	if (!packet->dma_range)
>> +		return -ENOMEM;
>> +
>> +	for (i = 0; i < page_count; i++) {
>> +		char *src = phys_to_virt((pb[i].pfn << HV_HYP_PAGE_SHIFT)
>> +					 + pb[i].offset);
>> +		u32 len = pb[i].len;
>> +
>> +		dma = dma_map_single(&hv_dev->device, src, len,
>> +				     DMA_TO_DEVICE);
>> +		if (dma_mapping_error(&hv_dev->device, dma)) {
>> +			kfree(packet->dma_range);
>> +			return -ENOMEM;
>> +		}
>> +
>> +		packet->dma_range[i].dma = dma;
>> +		packet->dma_range[i].mapping_size = len;
>> +		pb[i].pfn = dma >> HV_HYP_PAGE_SHIFT;
>> +		pb[i].offset = offset_in_hvpage(dma);
> 
> With the DMA min align mask now being set, the offset within
> the Hyper-V page won't be changed by dma_map_single().  So I
> think the above statement can be removed.
> 
>> +		pb[i].len = len;
> 
> A few lines above, the value of "len" is set from pb[i].len.  Neither
> "len" nor "i" is changed in the loop, so this statement can also be
> removed.
> 
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>>   static inline int netvsc_send_pkt(
>>   	struct hv_device *device,
>>   	struct hv_netvsc_packet *packet,
>> @@ -988,14 +1252,24 @@ static inline int netvsc_send_pkt(
>>
>>   	trace_nvsp_send_pkt(ndev, out_channel, rpkt);
>>
>> +	packet->dma_range = NULL;
>>   	if (packet->page_buf_cnt) {
>>   		if (packet->cp_partial)
>>   			pb += packet->rmsg_pgcnt;
>>
>> +		ret = netvsc_dma_map(ndev_ctx->device_ctx, packet, pb);
>> +		if (ret) {
>> +			ret = -EAGAIN;
>> +			goto exit;
>> +		}
>> +
>>   		ret = vmbus_sendpacket_pagebuffer(out_channel,
>>   						  pb, packet->page_buf_cnt,
>>   						  &nvmsg, sizeof(nvmsg),
>>   						  req_id);
>> +
>> +		if (ret)
>> +			netvsc_dma_unmap(ndev_ctx->device_ctx, packet);
>>   	} else {
>>   		ret = vmbus_sendpacket(out_channel,
>>   				       &nvmsg, sizeof(nvmsg),
>> @@ -1003,6 +1277,7 @@ static inline int netvsc_send_pkt(
>>   				       VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED);
>>   	}
>>
>> +exit:
>>   	if (ret == 0) {
>>   		atomic_inc_return(&nvchan->queue_sends);
>>
>> diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
>> index 382bebc2420d..c3dc884b31e3 100644
>> --- a/drivers/net/hyperv/netvsc_drv.c
>> +++ b/drivers/net/hyperv/netvsc_drv.c
>> @@ -2577,6 +2577,7 @@ static int netvsc_probe(struct hv_device *dev,
>>   	list_add(&net_device_ctx->list, &netvsc_dev_list);
>>   	rtnl_unlock();
>>
>> +	dma_set_min_align_mask(&dev->device, HV_HYP_PAGE_SIZE - 1);
>>   	netvsc_devinfo_put(device_info);
>>   	return 0;
>>
>> diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c
>> index f6c9c2a670f9..448fcc325ed7 100644
>> --- a/drivers/net/hyperv/rndis_filter.c
>> +++ b/drivers/net/hyperv/rndis_filter.c
>> @@ -361,6 +361,8 @@ static void rndis_filter_receive_response(struct net_device *ndev,
>>   			}
>>   		}
>>
>> +		netvsc_dma_unmap(((struct net_device_context *)
>> +			netdev_priv(ndev))->device_ctx, &request->pkt);
>>   		complete(&request->wait_event);
>>   	} else {
>>   		netdev_err(ndev,
>> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
>> index c94c534a944e..81e58dd582dc 100644
>> --- a/include/linux/hyperv.h
>> +++ b/include/linux/hyperv.h
>> @@ -1597,6 +1597,11 @@ struct hyperv_service_callback {
>>   	void (*callback)(void *context);
>>   };
>>
>> +struct hv_dma_range {
>> +	dma_addr_t dma;
>> +	u32 mapping_size;
>> +};
>> +
>>   #define MAX_SRV_VER	0x7ffffff
>>   extern bool vmbus_prep_negotiate_resp(struct icmsg_hdr *icmsghdrp, u8 *buf, u32 buflen,
>>   				const int *fw_version, int fw_vercnt,
>> --
>> 2.25.1
>
Tianyu Lan Sept. 27, 2021, 2:26 p.m. UTC | #7
Hi Christoph:
     Gentile ping. The swiotlb and shared memory mapping changes in this
patchset needs your reivew. Could you have a look?

Thanks.

On 9/22/2021 6:34 PM, Tianyu Lan wrote:
> Hi Christoph:
>      This patch follows your purposal in the previous discussion.
> Could you have a look?
>      "use vmap_pfn as in the current series.  But in that case I think
>      we should get rid of the other mapping created by vmalloc.  I
>      though a bit about finding a way to apply the offset in vmalloc
>      itself, but I think it would be too invasive to the normal fast
>      path.  So the other sub-option would be to allocate the pages
>      manually (maybe even using high order allocations to reduce TLB
>      pressure) and then remap them(https://lkml.org/lkml/2021/9/2/112)
> 
> Otherwise, I merge your previous change for swiotlb into patch 9
> “x86/Swiotlb: Add Swiotlb bounce buffer remap function for HV IVM”
> You previous change 
> link.(http://git.infradead.org/users/hch/misc.git/commit/8248f295928aded3364a1e54a4e0022e93d3610c) 
> Please have a look.
> 
> 
> Thanks.
> 
> 
> On 9/16/2021 12:21 AM, Michael Kelley wrote:
>> From: Tianyu Lan <ltykernel@gmail.com>  Sent: Tuesday, September 14, 
>> 2021 6:39 AM
>>>
>>> In Isolation VM, all shared memory with host needs to mark visible
>>> to host via hvcall. vmbus_establish_gpadl() has already done it for
>>> netvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_
>>> pagebuffer() stills need to be handled. Use DMA API to map/umap
>>> these memory during sending/receiving packet and Hyper-V swiotlb
>>> bounce buffer dma address will be returned. The swiotlb bounce buffer
>>> has been masked to be visible to host during boot up.
>>>
>>> Allocate rx/tx ring buffer via alloc_pages() in Isolation VM and map
>>> these pages via vmap(). After calling vmbus_establish_gpadl() which
>>> marks these pages visible to host, unmap these pages to release the
>>> virtual address mapped with physical address below shared_gpa_boundary
>>> and map them in the extra address space via vmap_pfn().
>>>
>>> Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
>>> ---
>>> Change since v4:
>>>     * Allocate rx/tx ring buffer via alloc_pages() in Isolation VM
>>>     * Map pages after calling vmbus_establish_gpadl().
>>>     * set dma_set_min_align_mask for netvsc driver.
>>>
>>> Change since v3:
>>>     * Add comment to explain why not to use dma_map_sg()
>>>     * Fix some error handle.
>>> ---
>>>   drivers/net/hyperv/hyperv_net.h   |   7 +
>>>   drivers/net/hyperv/netvsc.c       | 287 +++++++++++++++++++++++++++++-
>>>   drivers/net/hyperv/netvsc_drv.c   |   1 +
>>>   drivers/net/hyperv/rndis_filter.c |   2 +
>>>   include/linux/hyperv.h            |   5 +
>>>   5 files changed, 296 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/drivers/net/hyperv/hyperv_net.h 
>>> b/drivers/net/hyperv/hyperv_net.h
>>> index 315278a7cf88..87e8c74398a5 100644
>>> --- a/drivers/net/hyperv/hyperv_net.h
>>> +++ b/drivers/net/hyperv/hyperv_net.h
>>> @@ -164,6 +164,7 @@ struct hv_netvsc_packet {
>>>       u32 total_bytes;
>>>       u32 send_buf_index;
>>>       u32 total_data_buflen;
>>> +    struct hv_dma_range *dma_range;
>>>   };
>>>
>>>   #define NETVSC_HASH_KEYLEN 40
>>> @@ -1074,6 +1075,8 @@ struct netvsc_device {
>>>
>>>       /* Receive buffer allocated by us but manages by NetVSP */
>>>       void *recv_buf;
>>> +    struct page **recv_pages;
>>> +    u32 recv_page_count;
>>>       u32 recv_buf_size; /* allocated bytes */
>>>       struct vmbus_gpadl recv_buf_gpadl_handle;
>>>       u32 recv_section_cnt;
>>> @@ -1082,6 +1085,8 @@ struct netvsc_device {
>>>
>>>       /* Send buffer allocated by us */
>>>       void *send_buf;
>>> +    struct page **send_pages;
>>> +    u32 send_page_count;
>>>       u32 send_buf_size;
>>>       struct vmbus_gpadl send_buf_gpadl_handle;
>>>       u32 send_section_cnt;
>>> @@ -1731,4 +1736,6 @@ struct rndis_message {
>>>   #define RETRY_US_HI    10000
>>>   #define RETRY_MAX    2000    /* >10 sec */
>>>
>>> +void netvsc_dma_unmap(struct hv_device *hv_dev,
>>> +              struct hv_netvsc_packet *packet);
>>>   #endif /* _HYPERV_NET_H */
>>> diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
>>> index 1f87e570ed2b..7d5254bf043e 100644
>>> --- a/drivers/net/hyperv/netvsc.c
>>> +++ b/drivers/net/hyperv/netvsc.c
>>> @@ -20,6 +20,7 @@
>>>   #include <linux/vmalloc.h>
>>>   #include <linux/rtnetlink.h>
>>>   #include <linux/prefetch.h>
>>> +#include <linux/gfp.h>
>>>
>>>   #include <asm/sync_bitops.h>
>>>   #include <asm/mshyperv.h>
>>> @@ -150,11 +151,33 @@ static void free_netvsc_device(struct rcu_head 
>>> *head)
>>>   {
>>>       struct netvsc_device *nvdev
>>>           = container_of(head, struct netvsc_device, rcu);
>>> +    unsigned int alloc_unit;
>>>       int i;
>>>
>>>       kfree(nvdev->extension);
>>> -    vfree(nvdev->recv_buf);
>>> -    vfree(nvdev->send_buf);
>>> +
>>> +    if (nvdev->recv_pages) {
>>> +        alloc_unit = (nvdev->recv_buf_size /
>>> +            nvdev->recv_page_count) >> PAGE_SHIFT;
>>> +
>>> +        vunmap(nvdev->recv_buf);
>>> +        for (i = 0; i < nvdev->recv_page_count; i++)
>>> +            __free_pages(nvdev->recv_pages[i], alloc_unit);
>>> +    } else {
>>> +        vfree(nvdev->recv_buf);
>>> +    }
>>> +
>>> +    if (nvdev->send_pages) {
>>> +        alloc_unit = (nvdev->send_buf_size /
>>> +            nvdev->send_page_count) >> PAGE_SHIFT;
>>> +
>>> +        vunmap(nvdev->send_buf);
>>> +        for (i = 0; i < nvdev->send_page_count; i++)
>>> +            __free_pages(nvdev->send_pages[i], alloc_unit);
>>> +    } else {
>>> +        vfree(nvdev->send_buf);
>>> +    }
>>> +
>>>       kfree(nvdev->send_section_map);
>>>
>>>       for (i = 0; i < VRSS_CHANNEL_MAX; i++) {
>>> @@ -330,6 +353,108 @@ int netvsc_alloc_recv_comp_ring(struct 
>>> netvsc_device *net_device, u32 q_idx)
>>>       return nvchan->mrc.slots ? 0 : -ENOMEM;
>>>   }
>>>
>>> +void *netvsc_alloc_pages(struct page ***pages_array, unsigned int 
>>> *array_len,
>>> +             unsigned long size)
>>> +{
>>> +    struct page *page, **pages, **vmap_pages;
>>> +    unsigned long pg_count = size >> PAGE_SHIFT;
>>> +    int alloc_unit = MAX_ORDER_NR_PAGES;
>>> +    int i, j, vmap_page_index = 0;
>>> +    void *vaddr;
>>> +
>>> +    if (pg_count < alloc_unit)
>>> +        alloc_unit = 1;
>>> +
>>> +    /* vmap() accepts page array with PAGE_SIZE as unit while try to
>>> +     * allocate high order pages here in order to save page array 
>>> space.
>>> +     * vmap_pages[] is used as input parameter of vmap(). pages[] is to
>>> +     * store allocated pages and map them later.
>>> +     */
>>> +    vmap_pages = kmalloc_array(pg_count, sizeof(*vmap_pages), 
>>> GFP_KERNEL);
>>> +    if (!vmap_pages)
>>> +        return NULL;
>>> +
>>> +retry:
>>> +    *array_len = pg_count / alloc_unit;
>>> +    pages = kmalloc_array(*array_len, sizeof(*pages), GFP_KERNEL);
>>> +    if (!pages)
>>> +        goto cleanup;
>>> +
>>> +    for (i = 0; i < *array_len; i++) {
>>> +        page = alloc_pages(GFP_KERNEL | __GFP_ZERO,
>>> +                   get_order(alloc_unit << PAGE_SHIFT));
>>> +        if (!page) {
>>> +            /* Try allocating small pages if high order pages are 
>>> not available. */
>>> +            if (alloc_unit == 1) {
>>> +                goto cleanup;
>>> +            } else {
>>
>> The "else" clause isn't really needed because of the goto cleanup 
>> above.  Then
>> the indentation of the code below could be reduced by one level.
>>
>>> +                memset(vmap_pages, 0,
>>> +                       sizeof(*vmap_pages) * vmap_page_index);
>>> +                vmap_page_index = 0;
>>> +
>>> +                for (j = 0; j < i; j++)
>>> +                    __free_pages(pages[j], alloc_unit);
>>> +
>>> +                kfree(pages);
>>> +                alloc_unit = 1;
>>
>> This is the case where a large enough contiguous physical memory chunk 
>> could
>> not be found.  But rather than dropping all the way down to single pages,
>> would it make sense to try something smaller, but not 1?  For example,
>> cut the alloc_unit in half and try again.  But I'm not sure of all the 
>> implications.
>>
>>> +                goto retry;
>>> +            }
>>> +        }
>>> +
>>> +        pages[i] = page;
>>> +        for (j = 0; j < alloc_unit; j++)
>>> +            vmap_pages[vmap_page_index++] = page++;
>>> +    }
>>> +
>>> +    vaddr = vmap(vmap_pages, vmap_page_index, VM_MAP, PAGE_KERNEL);
>>> +    kfree(vmap_pages);
>>> +
>>> +    *pages_array = pages;
>>> +    return vaddr;
>>> +
>>> +cleanup:
>>> +    for (j = 0; j < i; j++)
>>> +        __free_pages(pages[i], alloc_unit);
>>> +
>>> +    kfree(pages);
>>> +    kfree(vmap_pages);
>>> +    return NULL;
>>> +}
>>> +
>>> +static void *netvsc_map_pages(struct page **pages, int count, int 
>>> alloc_unit)
>>> +{
>>> +    int pg_count = count * alloc_unit;
>>> +    struct page *page;
>>> +    unsigned long *pfns;
>>> +    int pfn_index = 0;
>>> +    void *vaddr;
>>> +    int i, j;
>>> +
>>> +    if (!pages)
>>> +        return NULL;
>>> +
>>> +    pfns = kcalloc(pg_count, sizeof(*pfns), GFP_KERNEL);
>>> +    if (!pfns)
>>> +        return NULL;
>>> +
>>> +    for (i = 0; i < count; i++) {
>>> +        page = pages[i];
>>> +        if (!page) {
>>> +            pr_warn("page is not available %d.\n", i);
>>> +            return NULL;
>>> +        }
>>> +
>>> +        for (j = 0; j < alloc_unit; j++) {
>>> +            pfns[pfn_index++] = page_to_pfn(page++) +
>>> +                (ms_hyperv.shared_gpa_boundary >> PAGE_SHIFT);
>>> +        }
>>> +    }
>>> +
>>> +    vaddr = vmap_pfn(pfns, pg_count, PAGE_KERNEL_IO);
>>> +    kfree(pfns);
>>> +    return vaddr;
>>> +}
>>> +
>>
>> I think you are proposing this approach to allocating memory for the send
>> and receive buffers so that you can avoid having two virtual mappings for
>> the memory, per comments from Christop Hellwig.  But overall, the 
>> approach
>> seems a bit complex and I wonder if it is worth it.  If allocating 
>> large contiguous
>> chunks of physical memory is successful, then there is some memory 
>> savings
>> in that the data structures needed to keep track of the physical pages is
>> smaller than the equivalent page tables might be.  But if you have to 
>> revert
>> to allocating individual pages, then the memory savings is reduced.
>>
>> Ultimately, the list of actual PFNs has to be kept somewhere.  Another 
>> approach
>> would be to do the reverse of what hv_map_memory() from the v4 patch
>> series does.  I.e., you could do virt_to_phys() on each virtual 
>> address that
>> maps above VTOM, and subtract out the shared_gpa_boundary to get the
>> list of actual PFNs that need to be freed.   This way you don't have 
>> two copies
>> of the list of PFNs -- one with and one without the 
>> shared_gpa_boundary added.
>> But it comes at the cost of additional code so that may not be a great 
>> idea.
>>
>> I think what you have here works, and I don't have a clearly better 
>> solution
>> at the moment except perhaps to revert to the v4 solution and just 
>> have two
>> virtual mappings.  I'll keep thinking about it.  Maybe Christop has other
>> thoughts.
>>
>>>   static int netvsc_init_buf(struct hv_device *device,
>>>                  struct netvsc_device *net_device,
>>>                  const struct netvsc_device_info *device_info)
>>> @@ -337,7 +462,7 @@ static int netvsc_init_buf(struct hv_device *device,
>>>       struct nvsp_1_message_send_receive_buffer_complete *resp;
>>>       struct net_device *ndev = hv_get_drvdata(device);
>>>       struct nvsp_message *init_packet;
>>> -    unsigned int buf_size;
>>> +    unsigned int buf_size, alloc_unit;
>>>       size_t map_words;
>>>       int i, ret = 0;
>>>
>>> @@ -350,7 +475,14 @@ static int netvsc_init_buf(struct hv_device 
>>> *device,
>>>           buf_size = min_t(unsigned int, buf_size,
>>>                    NETVSC_RECEIVE_BUFFER_SIZE_LEGACY);
>>>
>>> -    net_device->recv_buf = vzalloc(buf_size);
>>> +    if (hv_isolation_type_snp())
>>> +        net_device->recv_buf =
>>> +            netvsc_alloc_pages(&net_device->recv_pages,
>>> +                       &net_device->recv_page_count,
>>> +                       buf_size);
>>> +    else
>>> +        net_device->recv_buf = vzalloc(buf_size);
>>> +
>>
>> I wonder if it is necessary to have two different code paths here.  The
>> allocating and freeing of the send and receive buffers is not perf
>> sensitive, and it seems like netvsc_alloc_pages() could be used
>> regardless of whether SNP Isolation is in effect.  To my thinking,
>> one code path is better than two code paths unless there's a
>> compelling reason to have two.
>>
>>>       if (!net_device->recv_buf) {
>>>           netdev_err(ndev,
>>>                  "unable to allocate receive buffer of size %u\n",
>>> @@ -375,6 +507,27 @@ static int netvsc_init_buf(struct hv_device 
>>> *device,
>>>           goto cleanup;
>>>       }
>>>
>>> +    if (hv_isolation_type_snp()) {
>>> +        alloc_unit = (buf_size / net_device->recv_page_count)
>>> +                >> PAGE_SHIFT;
>>> +
>>> +        /* Unmap previous virtual address and map pages in the extra
>>> +         * address space(above shared gpa boundary) in Isolation VM.
>>> +         */
>>> +        vunmap(net_device->recv_buf);
>>> +        net_device->recv_buf =
>>> +            netvsc_map_pages(net_device->recv_pages,
>>> +                     net_device->recv_page_count,
>>> +                     alloc_unit);
>>> +        if (!net_device->recv_buf) {
>>> +            netdev_err(ndev,
>>> +                   "unable to allocate receive buffer of size %u\n",
>>> +                   buf_size);
>>> +            ret = -ENOMEM;
>>> +            goto cleanup;
>>> +        }
>>> +    }
>>> +
>>>       /* Notify the NetVsp of the gpadl handle */
>>>       init_packet = &net_device->channel_init_pkt;
>>>       memset(init_packet, 0, sizeof(struct nvsp_message));
>>> @@ -456,13 +609,21 @@ static int netvsc_init_buf(struct hv_device 
>>> *device,
>>>       buf_size = device_info->send_sections * 
>>> device_info->send_section_size;
>>>       buf_size = round_up(buf_size, PAGE_SIZE);
>>>
>>> -    net_device->send_buf = vzalloc(buf_size);
>>> +    if (hv_isolation_type_snp())
>>> +        net_device->send_buf =
>>> +            netvsc_alloc_pages(&net_device->send_pages,
>>> +                       &net_device->send_page_count,
>>> +                       buf_size);
>>> +    else
>>> +        net_device->send_buf = vzalloc(buf_size);
>>> +
>>>       if (!net_device->send_buf) {
>>>           netdev_err(ndev, "unable to allocate send buffer of size 
>>> %u\n",
>>>                  buf_size);
>>>           ret = -ENOMEM;
>>>           goto cleanup;
>>>       }
>>> +
>>>       net_device->send_buf_size = buf_size;
>>>
>>>       /* Establish the gpadl handle for this buffer on this
>>> @@ -478,6 +639,27 @@ static int netvsc_init_buf(struct hv_device 
>>> *device,
>>>           goto cleanup;
>>>       }
>>>
>>> +    if (hv_isolation_type_snp()) {
>>> +        alloc_unit = (buf_size / net_device->send_page_count)
>>> +                >> PAGE_SHIFT;
>>> +
>>> +        /* Unmap previous virtual address and map pages in the extra
>>> +         * address space(above shared gpa boundary) in Isolation VM.
>>> +         */
>>> +        vunmap(net_device->send_buf);
>>> +        net_device->send_buf =
>>> +            netvsc_map_pages(net_device->send_pages,
>>> +                     net_device->send_page_count,
>>> +                     alloc_unit);
>>> +        if (!net_device->send_buf) {
>>> +            netdev_err(ndev,
>>> +                   "unable to allocate receive buffer of size %u\n",
>>> +                   buf_size);
>>> +            ret = -ENOMEM;
>>> +            goto cleanup;
>>> +        }
>>> +    }
>>> +
>>>       /* Notify the NetVsp of the gpadl handle */
>>>       init_packet = &net_device->channel_init_pkt;
>>>       memset(init_packet, 0, sizeof(struct nvsp_message));
>>> @@ -768,7 +950,7 @@ static void netvsc_send_tx_complete(struct 
>>> net_device *ndev,
>>>
>>>       /* Notify the layer above us */
>>>       if (likely(skb)) {
>>> -        const struct hv_netvsc_packet *packet
>>> +        struct hv_netvsc_packet *packet
>>>               = (struct hv_netvsc_packet *)skb->cb;
>>>           u32 send_index = packet->send_buf_index;
>>>           struct netvsc_stats *tx_stats;
>>> @@ -784,6 +966,7 @@ static void netvsc_send_tx_complete(struct 
>>> net_device *ndev,
>>>           tx_stats->bytes += packet->total_bytes;
>>>           u64_stats_update_end(&tx_stats->syncp);
>>>
>>> +        netvsc_dma_unmap(ndev_ctx->device_ctx, packet);
>>>           napi_consume_skb(skb, budget);
>>>       }
>>>
>>> @@ -948,6 +1131,87 @@ static void netvsc_copy_to_send_buf(struct 
>>> netvsc_device *net_device,
>>>           memset(dest, 0, padding);
>>>   }
>>>
>>> +void netvsc_dma_unmap(struct hv_device *hv_dev,
>>> +              struct hv_netvsc_packet *packet)
>>> +{
>>> +    u32 page_count = packet->cp_partial ?
>>> +        packet->page_buf_cnt - packet->rmsg_pgcnt :
>>> +        packet->page_buf_cnt;
>>> +    int i;
>>> +
>>> +    if (!hv_is_isolation_supported())
>>> +        return;
>>> +
>>> +    if (!packet->dma_range)
>>> +        return;
>>> +
>>> +    for (i = 0; i < page_count; i++)
>>> +        dma_unmap_single(&hv_dev->device, packet->dma_range[i].dma,
>>> +                 packet->dma_range[i].mapping_size,
>>> +                 DMA_TO_DEVICE);
>>> +
>>> +    kfree(packet->dma_range);
>>> +}
>>> +
>>> +/* netvsc_dma_map - Map swiotlb bounce buffer with data page of
>>> + * packet sent by vmbus_sendpacket_pagebuffer() in the Isolation
>>> + * VM.
>>> + *
>>> + * In isolation VM, netvsc send buffer has been marked visible to
>>> + * host and so the data copied to send buffer doesn't need to use
>>> + * bounce buffer. The data pages handled by 
>>> vmbus_sendpacket_pagebuffer()
>>> + * may not be copied to send buffer and so these pages need to be
>>> + * mapped with swiotlb bounce buffer. netvsc_dma_map() is to do
>>> + * that. The pfns in the struct hv_page_buffer need to be converted
>>> + * to bounce buffer's pfn. The loop here is necessary because the
>>> + * entries in the page buffer array are not necessarily full
>>> + * pages of data.  Each entry in the array has a separate offset and
>>> + * len that may be non-zero, even for entries in the middle of the
>>> + * array.  And the entries are not physically contiguous.  So each
>>> + * entry must be individually mapped rather than as a contiguous unit.
>>> + * So not use dma_map_sg() here.
>>> + */
>>> +static int netvsc_dma_map(struct hv_device *hv_dev,
>>> +           struct hv_netvsc_packet *packet,
>>> +           struct hv_page_buffer *pb)
>>> +{
>>> +    u32 page_count =  packet->cp_partial ?
>>> +        packet->page_buf_cnt - packet->rmsg_pgcnt :
>>> +        packet->page_buf_cnt;
>>> +    dma_addr_t dma;
>>> +    int i;
>>> +
>>> +    if (!hv_is_isolation_supported())
>>> +        return 0;
>>> +
>>> +    packet->dma_range = kcalloc(page_count,
>>> +                    sizeof(*packet->dma_range),
>>> +                    GFP_KERNEL);
>>> +    if (!packet->dma_range)
>>> +        return -ENOMEM;
>>> +
>>> +    for (i = 0; i < page_count; i++) {
>>> +        char *src = phys_to_virt((pb[i].pfn << HV_HYP_PAGE_SHIFT)
>>> +                     + pb[i].offset);
>>> +        u32 len = pb[i].len;
>>> +
>>> +        dma = dma_map_single(&hv_dev->device, src, len,
>>> +                     DMA_TO_DEVICE);
>>> +        if (dma_mapping_error(&hv_dev->device, dma)) {
>>> +            kfree(packet->dma_range);
>>> +            return -ENOMEM;
>>> +        }
>>> +
>>> +        packet->dma_range[i].dma = dma;
>>> +        packet->dma_range[i].mapping_size = len;
>>> +        pb[i].pfn = dma >> HV_HYP_PAGE_SHIFT;
>>> +        pb[i].offset = offset_in_hvpage(dma);
>>
>> With the DMA min align mask now being set, the offset within
>> the Hyper-V page won't be changed by dma_map_single().  So I
>> think the above statement can be removed.
>>
>>> +        pb[i].len = len;
>>
>> A few lines above, the value of "len" is set from pb[i].len.  Neither
>> "len" nor "i" is changed in the loop, so this statement can also be
>> removed.
>>
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>>> +
>>>   static inline int netvsc_send_pkt(
>>>       struct hv_device *device,
>>>       struct hv_netvsc_packet *packet,
>>> @@ -988,14 +1252,24 @@ static inline int netvsc_send_pkt(
>>>
>>>       trace_nvsp_send_pkt(ndev, out_channel, rpkt);
>>>
>>> +    packet->dma_range = NULL;
>>>       if (packet->page_buf_cnt) {
>>>           if (packet->cp_partial)
>>>               pb += packet->rmsg_pgcnt;
>>>
>>> +        ret = netvsc_dma_map(ndev_ctx->device_ctx, packet, pb);
>>> +        if (ret) {
>>> +            ret = -EAGAIN;
>>> +            goto exit;
>>> +        }
>>> +
>>>           ret = vmbus_sendpacket_pagebuffer(out_channel,
>>>                             pb, packet->page_buf_cnt,
>>>                             &nvmsg, sizeof(nvmsg),
>>>                             req_id);
>>> +
>>> +        if (ret)
>>> +            netvsc_dma_unmap(ndev_ctx->device_ctx, packet);
>>>       } else {
>>>           ret = vmbus_sendpacket(out_channel,
>>>                          &nvmsg, sizeof(nvmsg),
>>> @@ -1003,6 +1277,7 @@ static inline int netvsc_send_pkt(
>>>                          VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED);
>>>       }
>>>
>>> +exit:
>>>       if (ret == 0) {
>>>           atomic_inc_return(&nvchan->queue_sends);
>>>
>>> diff --git a/drivers/net/hyperv/netvsc_drv.c 
>>> b/drivers/net/hyperv/netvsc_drv.c
>>> index 382bebc2420d..c3dc884b31e3 100644
>>> --- a/drivers/net/hyperv/netvsc_drv.c
>>> +++ b/drivers/net/hyperv/netvsc_drv.c
>>> @@ -2577,6 +2577,7 @@ static int netvsc_probe(struct hv_device *dev,
>>>       list_add(&net_device_ctx->list, &netvsc_dev_list);
>>>       rtnl_unlock();
>>>
>>> +    dma_set_min_align_mask(&dev->device, HV_HYP_PAGE_SIZE - 1);
>>>       netvsc_devinfo_put(device_info);
>>>       return 0;
>>>
>>> diff --git a/drivers/net/hyperv/rndis_filter.c 
>>> b/drivers/net/hyperv/rndis_filter.c
>>> index f6c9c2a670f9..448fcc325ed7 100644
>>> --- a/drivers/net/hyperv/rndis_filter.c
>>> +++ b/drivers/net/hyperv/rndis_filter.c
>>> @@ -361,6 +361,8 @@ static void rndis_filter_receive_response(struct 
>>> net_device *ndev,
>>>               }
>>>           }
>>>
>>> +        netvsc_dma_unmap(((struct net_device_context *)
>>> +            netdev_priv(ndev))->device_ctx, &request->pkt);
>>>           complete(&request->wait_event);
>>>       } else {
>>>           netdev_err(ndev,
>>> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
>>> index c94c534a944e..81e58dd582dc 100644
>>> --- a/include/linux/hyperv.h
>>> +++ b/include/linux/hyperv.h
>>> @@ -1597,6 +1597,11 @@ struct hyperv_service_callback {
>>>       void (*callback)(void *context);
>>>   };
>>>
>>> +struct hv_dma_range {
>>> +    dma_addr_t dma;
>>> +    u32 mapping_size;
>>> +};
>>> +
>>>   #define MAX_SRV_VER    0x7ffffff
>>>   extern bool vmbus_prep_negotiate_resp(struct icmsg_hdr *icmsghdrp, 
>>> u8 *buf, u32 buflen,
>>>                   const int *fw_version, int fw_vercnt,
>>> -- 
>>> 2.25.1
>>
Christoph Hellwig Sept. 28, 2021, 5:39 a.m. UTC | #8
On Mon, Sep 27, 2021 at 10:26:43PM +0800, Tianyu Lan wrote:
> Hi Christoph:
>     Gentile ping. The swiotlb and shared memory mapping changes in this
> patchset needs your reivew. Could you have a look?

I'm a little too busy for a review of such a huge patchset right now.
That being said here are my comments from a very quick review:

 - the bare memremap usage in swiotlb looks strange and I'd
   definitively expect a well documented wrapper.
 - given that we can now hand out swiotlb memory for coherent mappings
   we need to carefully audit what happens when this memremaped
   memory gets mmaped or used through dma_get_sgtable
 - the netscv changes I'm not happy with at all.  A large part of it
   is that the driver already has a bad structure, but this series
   is making it significantly worse.  We'll need to find a way
   to use the proper dma mapping abstractions here.  One option
   if you want to stick to the double vmapped buffer would be something
   like using dma_alloc_noncontigous plus a variant of
   dma_vmap_noncontiguous that takes the shared_gpa_boundary into
   account.
Tianyu Lan Sept. 28, 2021, 9:23 a.m. UTC | #9
On 9/28/2021 1:39 PM, Christoph Hellwig wrote:
> On Mon, Sep 27, 2021 at 10:26:43PM +0800, Tianyu Lan wrote:
>> Hi Christoph:
>>      Gentile ping. The swiotlb and shared memory mapping changes in this
>> patchset needs your reivew. Could you have a look? >
> I'm a little too busy for a review of such a huge patchset right now.
> That being said here are my comments from a very quick review:
Hi Christoph:
       Thanks for your comments. Most patches in the series are Hyper-V
change. I will split patchset and make it easy to review.


> 
>   - the bare memremap usage in swiotlb looks strange and I'd
>     definitively expect a well documented wrapper.

OK. Should the wrapper in the DMA code? How about dma_map_decrypted() 
introduced in the V4?
https://lkml.org/lkml/2021/8/27/605

>   - given that we can now hand out swiotlb memory for coherent mappings
>     we need to carefully audit what happens when this memremaped
>     memory gets mmaped or used through dma_get_sgtable

OK. I check that.

>   - the netscv changes I'm not happy with at all.  A large part of it
>     is that the driver already has a bad structure, but this series
>     is making it significantly worse.  We'll need to find a way
>     to use the proper dma mapping abstractions here.  One option
>     if you want to stick to the double vmapped buffer would be something
>     like using dma_alloc_noncontigous plus a variant of
>     dma_vmap_noncontiguous that takes the shared_gpa_boundary into
>     account.
> 

OK. I will do that.
Christoph Hellwig Sept. 30, 2021, 5:48 a.m. UTC | #10
On Tue, Sep 28, 2021 at 05:23:31PM +0800, Tianyu Lan wrote:
>>
>>   - the bare memremap usage in swiotlb looks strange and I'd
>>     definitively expect a well documented wrapper.
>
> OK. Should the wrapper in the DMA code? How about dma_map_decrypted() 
> introduced in the V4?

A mentioned then the name is a pretty bad choice as it touches the dma_map*
namespace that it is not related to.  I suspect just a little helper
in the swiotlb code that explains how it is used might be enogh for now.
diff mbox series

Patch

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 315278a7cf88..87e8c74398a5 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -164,6 +164,7 @@  struct hv_netvsc_packet {
 	u32 total_bytes;
 	u32 send_buf_index;
 	u32 total_data_buflen;
+	struct hv_dma_range *dma_range;
 };
 
 #define NETVSC_HASH_KEYLEN 40
@@ -1074,6 +1075,8 @@  struct netvsc_device {
 
 	/* Receive buffer allocated by us but manages by NetVSP */
 	void *recv_buf;
+	struct page **recv_pages;
+	u32 recv_page_count;
 	u32 recv_buf_size; /* allocated bytes */
 	struct vmbus_gpadl recv_buf_gpadl_handle;
 	u32 recv_section_cnt;
@@ -1082,6 +1085,8 @@  struct netvsc_device {
 
 	/* Send buffer allocated by us */
 	void *send_buf;
+	struct page **send_pages;
+	u32 send_page_count;
 	u32 send_buf_size;
 	struct vmbus_gpadl send_buf_gpadl_handle;
 	u32 send_section_cnt;
@@ -1731,4 +1736,6 @@  struct rndis_message {
 #define RETRY_US_HI	10000
 #define RETRY_MAX	2000	/* >10 sec */
 
+void netvsc_dma_unmap(struct hv_device *hv_dev,
+		      struct hv_netvsc_packet *packet);
 #endif /* _HYPERV_NET_H */
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 1f87e570ed2b..7d5254bf043e 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -20,6 +20,7 @@ 
 #include <linux/vmalloc.h>
 #include <linux/rtnetlink.h>
 #include <linux/prefetch.h>
+#include <linux/gfp.h>
 
 #include <asm/sync_bitops.h>
 #include <asm/mshyperv.h>
@@ -150,11 +151,33 @@  static void free_netvsc_device(struct rcu_head *head)
 {
 	struct netvsc_device *nvdev
 		= container_of(head, struct netvsc_device, rcu);
+	unsigned int alloc_unit;
 	int i;
 
 	kfree(nvdev->extension);
-	vfree(nvdev->recv_buf);
-	vfree(nvdev->send_buf);
+
+	if (nvdev->recv_pages) {
+		alloc_unit = (nvdev->recv_buf_size /
+			nvdev->recv_page_count) >> PAGE_SHIFT;
+
+		vunmap(nvdev->recv_buf);
+		for (i = 0; i < nvdev->recv_page_count; i++)
+			__free_pages(nvdev->recv_pages[i], alloc_unit);
+	} else {
+		vfree(nvdev->recv_buf);
+	}
+
+	if (nvdev->send_pages) {
+		alloc_unit = (nvdev->send_buf_size /
+			nvdev->send_page_count) >> PAGE_SHIFT;
+
+		vunmap(nvdev->send_buf);
+		for (i = 0; i < nvdev->send_page_count; i++)
+			__free_pages(nvdev->send_pages[i], alloc_unit);
+	} else {
+		vfree(nvdev->send_buf);
+	}
+
 	kfree(nvdev->send_section_map);
 
 	for (i = 0; i < VRSS_CHANNEL_MAX; i++) {
@@ -330,6 +353,108 @@  int netvsc_alloc_recv_comp_ring(struct netvsc_device *net_device, u32 q_idx)
 	return nvchan->mrc.slots ? 0 : -ENOMEM;
 }
 
+void *netvsc_alloc_pages(struct page ***pages_array, unsigned int *array_len,
+			 unsigned long size)
+{
+	struct page *page, **pages, **vmap_pages;
+	unsigned long pg_count = size >> PAGE_SHIFT;
+	int alloc_unit = MAX_ORDER_NR_PAGES;
+	int i, j, vmap_page_index = 0;
+	void *vaddr;
+
+	if (pg_count < alloc_unit)
+		alloc_unit = 1;
+
+	/* vmap() accepts page array with PAGE_SIZE as unit while try to
+	 * allocate high order pages here in order to save page array space.
+	 * vmap_pages[] is used as input parameter of vmap(). pages[] is to
+	 * store allocated pages and map them later.
+	 */
+	vmap_pages = kmalloc_array(pg_count, sizeof(*vmap_pages), GFP_KERNEL);
+	if (!vmap_pages)
+		return NULL;
+
+retry:
+	*array_len = pg_count / alloc_unit;
+	pages = kmalloc_array(*array_len, sizeof(*pages), GFP_KERNEL);
+	if (!pages)
+		goto cleanup;
+
+	for (i = 0; i < *array_len; i++) {
+		page = alloc_pages(GFP_KERNEL | __GFP_ZERO,
+				   get_order(alloc_unit << PAGE_SHIFT));
+		if (!page) {
+			/* Try allocating small pages if high order pages are not available. */
+			if (alloc_unit == 1) {
+				goto cleanup;
+			} else {
+				memset(vmap_pages, 0,
+				       sizeof(*vmap_pages) * vmap_page_index);
+				vmap_page_index = 0;
+
+				for (j = 0; j < i; j++)
+					__free_pages(pages[j], alloc_unit);
+
+				kfree(pages);
+				alloc_unit = 1;
+				goto retry;
+			}
+		}
+
+		pages[i] = page;
+		for (j = 0; j < alloc_unit; j++)
+			vmap_pages[vmap_page_index++] = page++;
+	}
+
+	vaddr = vmap(vmap_pages, vmap_page_index, VM_MAP, PAGE_KERNEL);
+	kfree(vmap_pages);
+
+	*pages_array = pages;
+	return vaddr;
+
+cleanup:
+	for (j = 0; j < i; j++)
+		__free_pages(pages[i], alloc_unit);
+
+	kfree(pages);
+	kfree(vmap_pages);
+	return NULL;
+}
+
+static void *netvsc_map_pages(struct page **pages, int count, int alloc_unit)
+{
+	int pg_count = count * alloc_unit;
+	struct page *page;
+	unsigned long *pfns;
+	int pfn_index = 0;
+	void *vaddr;
+	int i, j;
+
+	if (!pages)
+		return NULL;
+
+	pfns = kcalloc(pg_count, sizeof(*pfns), GFP_KERNEL);
+	if (!pfns)
+		return NULL;
+
+	for (i = 0; i < count; i++) {
+		page = pages[i];
+		if (!page) {
+			pr_warn("page is not available %d.\n", i);
+			return NULL;
+		}
+
+		for (j = 0; j < alloc_unit; j++) {
+			pfns[pfn_index++] = page_to_pfn(page++) +
+				(ms_hyperv.shared_gpa_boundary >> PAGE_SHIFT);
+		}
+	}
+
+	vaddr = vmap_pfn(pfns, pg_count, PAGE_KERNEL_IO);
+	kfree(pfns);
+	return vaddr;
+}
+
 static int netvsc_init_buf(struct hv_device *device,
 			   struct netvsc_device *net_device,
 			   const struct netvsc_device_info *device_info)
@@ -337,7 +462,7 @@  static int netvsc_init_buf(struct hv_device *device,
 	struct nvsp_1_message_send_receive_buffer_complete *resp;
 	struct net_device *ndev = hv_get_drvdata(device);
 	struct nvsp_message *init_packet;
-	unsigned int buf_size;
+	unsigned int buf_size, alloc_unit;
 	size_t map_words;
 	int i, ret = 0;
 
@@ -350,7 +475,14 @@  static int netvsc_init_buf(struct hv_device *device,
 		buf_size = min_t(unsigned int, buf_size,
 				 NETVSC_RECEIVE_BUFFER_SIZE_LEGACY);
 
-	net_device->recv_buf = vzalloc(buf_size);
+	if (hv_isolation_type_snp())
+		net_device->recv_buf =
+			netvsc_alloc_pages(&net_device->recv_pages,
+					   &net_device->recv_page_count,
+					   buf_size);
+	else
+		net_device->recv_buf = vzalloc(buf_size);
+
 	if (!net_device->recv_buf) {
 		netdev_err(ndev,
 			   "unable to allocate receive buffer of size %u\n",
@@ -375,6 +507,27 @@  static int netvsc_init_buf(struct hv_device *device,
 		goto cleanup;
 	}
 
+	if (hv_isolation_type_snp()) {
+		alloc_unit = (buf_size / net_device->recv_page_count)
+				>> PAGE_SHIFT;
+
+		/* Unmap previous virtual address and map pages in the extra
+		 * address space(above shared gpa boundary) in Isolation VM.
+		 */
+		vunmap(net_device->recv_buf);
+		net_device->recv_buf =
+			netvsc_map_pages(net_device->recv_pages,
+					 net_device->recv_page_count,
+					 alloc_unit);
+		if (!net_device->recv_buf) {
+			netdev_err(ndev,
+				   "unable to allocate receive buffer of size %u\n",
+				   buf_size);
+			ret = -ENOMEM;
+			goto cleanup;
+		}
+	}
+
 	/* Notify the NetVsp of the gpadl handle */
 	init_packet = &net_device->channel_init_pkt;
 	memset(init_packet, 0, sizeof(struct nvsp_message));
@@ -456,13 +609,21 @@  static int netvsc_init_buf(struct hv_device *device,
 	buf_size = device_info->send_sections * device_info->send_section_size;
 	buf_size = round_up(buf_size, PAGE_SIZE);
 
-	net_device->send_buf = vzalloc(buf_size);
+	if (hv_isolation_type_snp())
+		net_device->send_buf =
+			netvsc_alloc_pages(&net_device->send_pages,
+					   &net_device->send_page_count,
+					   buf_size);
+	else
+		net_device->send_buf = vzalloc(buf_size);
+
 	if (!net_device->send_buf) {
 		netdev_err(ndev, "unable to allocate send buffer of size %u\n",
 			   buf_size);
 		ret = -ENOMEM;
 		goto cleanup;
 	}
+
 	net_device->send_buf_size = buf_size;
 
 	/* Establish the gpadl handle for this buffer on this
@@ -478,6 +639,27 @@  static int netvsc_init_buf(struct hv_device *device,
 		goto cleanup;
 	}
 
+	if (hv_isolation_type_snp()) {
+		alloc_unit = (buf_size / net_device->send_page_count)
+				>> PAGE_SHIFT;
+
+		/* Unmap previous virtual address and map pages in the extra
+		 * address space(above shared gpa boundary) in Isolation VM.
+		 */
+		vunmap(net_device->send_buf);
+		net_device->send_buf =
+			netvsc_map_pages(net_device->send_pages,
+					 net_device->send_page_count,
+					 alloc_unit);
+		if (!net_device->send_buf) {
+			netdev_err(ndev,
+				   "unable to allocate receive buffer of size %u\n",
+				   buf_size);
+			ret = -ENOMEM;
+			goto cleanup;
+		}
+	}
+
 	/* Notify the NetVsp of the gpadl handle */
 	init_packet = &net_device->channel_init_pkt;
 	memset(init_packet, 0, sizeof(struct nvsp_message));
@@ -768,7 +950,7 @@  static void netvsc_send_tx_complete(struct net_device *ndev,
 
 	/* Notify the layer above us */
 	if (likely(skb)) {
-		const struct hv_netvsc_packet *packet
+		struct hv_netvsc_packet *packet
 			= (struct hv_netvsc_packet *)skb->cb;
 		u32 send_index = packet->send_buf_index;
 		struct netvsc_stats *tx_stats;
@@ -784,6 +966,7 @@  static void netvsc_send_tx_complete(struct net_device *ndev,
 		tx_stats->bytes += packet->total_bytes;
 		u64_stats_update_end(&tx_stats->syncp);
 
+		netvsc_dma_unmap(ndev_ctx->device_ctx, packet);
 		napi_consume_skb(skb, budget);
 	}
 
@@ -948,6 +1131,87 @@  static void netvsc_copy_to_send_buf(struct netvsc_device *net_device,
 		memset(dest, 0, padding);
 }
 
+void netvsc_dma_unmap(struct hv_device *hv_dev,
+		      struct hv_netvsc_packet *packet)
+{
+	u32 page_count = packet->cp_partial ?
+		packet->page_buf_cnt - packet->rmsg_pgcnt :
+		packet->page_buf_cnt;
+	int i;
+
+	if (!hv_is_isolation_supported())
+		return;
+
+	if (!packet->dma_range)
+		return;
+
+	for (i = 0; i < page_count; i++)
+		dma_unmap_single(&hv_dev->device, packet->dma_range[i].dma,
+				 packet->dma_range[i].mapping_size,
+				 DMA_TO_DEVICE);
+
+	kfree(packet->dma_range);
+}
+
+/* netvsc_dma_map - Map swiotlb bounce buffer with data page of
+ * packet sent by vmbus_sendpacket_pagebuffer() in the Isolation
+ * VM.
+ *
+ * In isolation VM, netvsc send buffer has been marked visible to
+ * host and so the data copied to send buffer doesn't need to use
+ * bounce buffer. The data pages handled by vmbus_sendpacket_pagebuffer()
+ * may not be copied to send buffer and so these pages need to be
+ * mapped with swiotlb bounce buffer. netvsc_dma_map() is to do
+ * that. The pfns in the struct hv_page_buffer need to be converted
+ * to bounce buffer's pfn. The loop here is necessary because the
+ * entries in the page buffer array are not necessarily full
+ * pages of data.  Each entry in the array has a separate offset and
+ * len that may be non-zero, even for entries in the middle of the
+ * array.  And the entries are not physically contiguous.  So each
+ * entry must be individually mapped rather than as a contiguous unit.
+ * So not use dma_map_sg() here.
+ */
+static int netvsc_dma_map(struct hv_device *hv_dev,
+		   struct hv_netvsc_packet *packet,
+		   struct hv_page_buffer *pb)
+{
+	u32 page_count =  packet->cp_partial ?
+		packet->page_buf_cnt - packet->rmsg_pgcnt :
+		packet->page_buf_cnt;
+	dma_addr_t dma;
+	int i;
+
+	if (!hv_is_isolation_supported())
+		return 0;
+
+	packet->dma_range = kcalloc(page_count,
+				    sizeof(*packet->dma_range),
+				    GFP_KERNEL);
+	if (!packet->dma_range)
+		return -ENOMEM;
+
+	for (i = 0; i < page_count; i++) {
+		char *src = phys_to_virt((pb[i].pfn << HV_HYP_PAGE_SHIFT)
+					 + pb[i].offset);
+		u32 len = pb[i].len;
+
+		dma = dma_map_single(&hv_dev->device, src, len,
+				     DMA_TO_DEVICE);
+		if (dma_mapping_error(&hv_dev->device, dma)) {
+			kfree(packet->dma_range);
+			return -ENOMEM;
+		}
+
+		packet->dma_range[i].dma = dma;
+		packet->dma_range[i].mapping_size = len;
+		pb[i].pfn = dma >> HV_HYP_PAGE_SHIFT;
+		pb[i].offset = offset_in_hvpage(dma);
+		pb[i].len = len;
+	}
+
+	return 0;
+}
+
 static inline int netvsc_send_pkt(
 	struct hv_device *device,
 	struct hv_netvsc_packet *packet,
@@ -988,14 +1252,24 @@  static inline int netvsc_send_pkt(
 
 	trace_nvsp_send_pkt(ndev, out_channel, rpkt);
 
+	packet->dma_range = NULL;
 	if (packet->page_buf_cnt) {
 		if (packet->cp_partial)
 			pb += packet->rmsg_pgcnt;
 
+		ret = netvsc_dma_map(ndev_ctx->device_ctx, packet, pb);
+		if (ret) {
+			ret = -EAGAIN;
+			goto exit;
+		}
+
 		ret = vmbus_sendpacket_pagebuffer(out_channel,
 						  pb, packet->page_buf_cnt,
 						  &nvmsg, sizeof(nvmsg),
 						  req_id);
+
+		if (ret)
+			netvsc_dma_unmap(ndev_ctx->device_ctx, packet);
 	} else {
 		ret = vmbus_sendpacket(out_channel,
 				       &nvmsg, sizeof(nvmsg),
@@ -1003,6 +1277,7 @@  static inline int netvsc_send_pkt(
 				       VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED);
 	}
 
+exit:
 	if (ret == 0) {
 		atomic_inc_return(&nvchan->queue_sends);
 
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 382bebc2420d..c3dc884b31e3 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -2577,6 +2577,7 @@  static int netvsc_probe(struct hv_device *dev,
 	list_add(&net_device_ctx->list, &netvsc_dev_list);
 	rtnl_unlock();
 
+	dma_set_min_align_mask(&dev->device, HV_HYP_PAGE_SIZE - 1);
 	netvsc_devinfo_put(device_info);
 	return 0;
 
diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c
index f6c9c2a670f9..448fcc325ed7 100644
--- a/drivers/net/hyperv/rndis_filter.c
+++ b/drivers/net/hyperv/rndis_filter.c
@@ -361,6 +361,8 @@  static void rndis_filter_receive_response(struct net_device *ndev,
 			}
 		}
 
+		netvsc_dma_unmap(((struct net_device_context *)
+			netdev_priv(ndev))->device_ctx, &request->pkt);
 		complete(&request->wait_event);
 	} else {
 		netdev_err(ndev,
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index c94c534a944e..81e58dd582dc 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1597,6 +1597,11 @@  struct hyperv_service_callback {
 	void (*callback)(void *context);
 };
 
+struct hv_dma_range {
+	dma_addr_t dma;
+	u32 mapping_size;
+};
+
 #define MAX_SRV_VER	0x7ffffff
 extern bool vmbus_prep_negotiate_resp(struct icmsg_hdr *icmsghdrp, u8 *buf, u32 buflen,
 				const int *fw_version, int fw_vercnt,