diff mbox series

[4/5] vfio/type1: Flush CPU caches on DMA pages in non-coherent domains

Message ID 20240507062138.20465-1-yan.y.zhao@intel.com (mailing list archive)
State New
Headers show
Series Enforce CPU cache flush for non-coherent device assignment | expand

Commit Message

Yan Zhao May 7, 2024, 6:21 a.m. UTC
Flush CPU cache on DMA pages before mapping them into the first
non-coherent domain (domain that does not enforce cache coherency, i.e. CPU
caches are not force-snooped) and after unmapping them from the last
domain.

Devices attached to non-coherent domains can execute non-coherent DMAs
(DMAs that lack CPU cache snooping) to access physical memory with CPU
caches bypassed.

Such a scenario could be exploited by a malicious guest, allowing them to
access stale host data in memory rather than the data initialized by the
host (e.g., zeros) in the cache, thus posing a risk of information leakage
attack.

Furthermore, the host kernel (e.g. a ksm thread) might encounter
inconsistent data between the CPU cache and memory (left by a malicious
guest) after a page is unpinned for DMA but before it's recycled.

Therefore, it is required to flush the CPU cache before a page is
accessible to non-coherent DMAs and after the page is inaccessible to
non-coherent DMAs.

However, the CPU cache is not flushed immediately when the page is unmapped
from the last non-coherent domain. Instead, the flushing is performed
lazily, right before the page is unpinned.
Take the following example to illustrate the process. The CPU cache is
flushed right before step 2 and step 5.
1. A page is mapped into a coherent domain.
2. The page is mapped into a non-coherent domain.
3. The page is unmapped from the non-coherent domain e.g.due to hot-unplug.
4. The page is unmapped from the coherent domain.
5. The page is unpinned.

Reasons for adopting this lazily flushing design include:
- There're several unmap paths and only one unpin path. Lazily flush before
  unpin wipes out the inconsistency between cache and physical memory
  before a page is globally visible and produces code that is simpler, more
  maintainable and easier to backport.
- Avoid dividing a large unmap range into several smaller ones or
  allocating additional memory to hold IOVA to HPA relationship.

Reported-by: Jason Gunthorpe <jgg@nvidia.com>
Closes: https://lore.kernel.org/lkml/20240109002220.GA439767@nvidia.com
Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation")
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 drivers/vfio/vfio_iommu_type1.c | 51 +++++++++++++++++++++++++++++++++
 1 file changed, 51 insertions(+)

Comments

Alex Williamson May 9, 2024, 6:10 p.m. UTC | #1
On Tue,  7 May 2024 14:21:38 +0800
Yan Zhao <yan.y.zhao@intel.com> wrote:

> Flush CPU cache on DMA pages before mapping them into the first
> non-coherent domain (domain that does not enforce cache coherency, i.e. CPU
> caches are not force-snooped) and after unmapping them from the last
> domain.
> 
> Devices attached to non-coherent domains can execute non-coherent DMAs
> (DMAs that lack CPU cache snooping) to access physical memory with CPU
> caches bypassed.
> 
> Such a scenario could be exploited by a malicious guest, allowing them to
> access stale host data in memory rather than the data initialized by the
> host (e.g., zeros) in the cache, thus posing a risk of information leakage
> attack.
> 
> Furthermore, the host kernel (e.g. a ksm thread) might encounter
> inconsistent data between the CPU cache and memory (left by a malicious
> guest) after a page is unpinned for DMA but before it's recycled.
> 
> Therefore, it is required to flush the CPU cache before a page is
> accessible to non-coherent DMAs and after the page is inaccessible to
> non-coherent DMAs.
> 
> However, the CPU cache is not flushed immediately when the page is unmapped
> from the last non-coherent domain. Instead, the flushing is performed
> lazily, right before the page is unpinned.
> Take the following example to illustrate the process. The CPU cache is
> flushed right before step 2 and step 5.
> 1. A page is mapped into a coherent domain.
> 2. The page is mapped into a non-coherent domain.
> 3. The page is unmapped from the non-coherent domain e.g.due to hot-unplug.
> 4. The page is unmapped from the coherent domain.
> 5. The page is unpinned.
> 
> Reasons for adopting this lazily flushing design include:
> - There're several unmap paths and only one unpin path. Lazily flush before
>   unpin wipes out the inconsistency between cache and physical memory
>   before a page is globally visible and produces code that is simpler, more
>   maintainable and easier to backport.
> - Avoid dividing a large unmap range into several smaller ones or
>   allocating additional memory to hold IOVA to HPA relationship.
> 
> Reported-by: Jason Gunthorpe <jgg@nvidia.com>
> Closes: https://lore.kernel.org/lkml/20240109002220.GA439767@nvidia.com
> Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation")
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 51 +++++++++++++++++++++++++++++++++
>  1 file changed, 51 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index b5c15fe8f9fc..ce873f4220bf 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -74,6 +74,7 @@ struct vfio_iommu {
>  	bool			v2;
>  	bool			nesting;
>  	bool			dirty_page_tracking;
> +	bool			has_noncoherent_domain;
>  	struct list_head	emulated_iommu_groups;
>  };
>  
> @@ -99,6 +100,7 @@ struct vfio_dma {
>  	unsigned long		*bitmap;
>  	struct mm_struct	*mm;
>  	size_t			locked_vm;
> +	bool			cache_flush_required; /* For noncoherent domain */

Poor packing, minimally this should be grouped with the other bools in
the structure, longer term they should likely all be converted to
bit fields.

>  };
>  
>  struct vfio_batch {
> @@ -716,6 +718,9 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
>  	long unlocked = 0, locked = 0;
>  	long i;
>  
> +	if (dma->cache_flush_required)
> +		arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT, npage << PAGE_SHIFT);
> +
>  	for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
>  		if (put_pfn(pfn++, dma->prot)) {
>  			unlocked++;
> @@ -1099,6 +1104,8 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
>  					    &iotlb_gather);
>  	}
>  
> +	dma->cache_flush_required = false;
> +
>  	if (do_accounting) {
>  		vfio_lock_acct(dma, -unlocked, true);
>  		return 0;
> @@ -1120,6 +1127,21 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
>  	iommu->dma_avail++;
>  }
>  
> +static void vfio_update_noncoherent_domain_state(struct vfio_iommu *iommu)
> +{
> +	struct vfio_domain *domain;
> +	bool has_noncoherent = false;
> +
> +	list_for_each_entry(domain, &iommu->domain_list, next) {
> +		if (domain->enforce_cache_coherency)
> +			continue;
> +
> +		has_noncoherent = true;
> +		break;
> +	}
> +	iommu->has_noncoherent_domain = has_noncoherent;
> +}

This should be merged with vfio_domains_have_enforce_cache_coherency()
and the VFIO_DMA_CC_IOMMU extension (if we keep it, see below).

> +
>  static void vfio_update_pgsize_bitmap(struct vfio_iommu *iommu)
>  {
>  	struct vfio_domain *domain;
> @@ -1455,6 +1477,12 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
>  
>  	vfio_batch_init(&batch);
>  
> +	/*
> +	 * Record necessity to flush CPU cache to make sure CPU cache is flushed
> +	 * for both pin & map and unmap & unpin (for unwind) paths.
> +	 */
> +	dma->cache_flush_required = iommu->has_noncoherent_domain;
> +
>  	while (size) {
>  		/* Pin a contiguous chunk of memory */
>  		npage = vfio_pin_pages_remote(dma, vaddr + dma->size,
> @@ -1466,6 +1494,10 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
>  			break;
>  		}
>  
> +		if (dma->cache_flush_required)
> +			arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT,
> +						npage << PAGE_SHIFT);
> +
>  		/* Map it! */
>  		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage,
>  				     dma->prot);
> @@ -1683,9 +1715,14 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
>  	for (; n; n = rb_next(n)) {
>  		struct vfio_dma *dma;
>  		dma_addr_t iova;
> +		bool cache_flush_required;
>  
>  		dma = rb_entry(n, struct vfio_dma, node);
>  		iova = dma->iova;
> +		cache_flush_required = !domain->enforce_cache_coherency &&
> +				       !dma->cache_flush_required;
> +		if (cache_flush_required)
> +			dma->cache_flush_required = true;

The variable name here isn't accurate and the logic is confusing.  If
the domain does not enforce coherency and the mapping is not tagged as
requiring a cache flush, then we need to mark the mapping as requiring
a cache flush.  So the variable state is something more akin to
set_cache_flush_required.  But all we're saving with this is a
redundant set if the mapping is already tagged as requiring a cache
flush, so it could really be simplified to:

		dma->cache_flush_required = !domain->enforce_cache_coherency;

It might add more clarity to just name the mapping flag
dma->mapped_noncoherent.

>  
>  		while (iova < dma->iova + dma->size) {
>  			phys_addr_t phys;
> @@ -1737,6 +1774,9 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
>  				size = npage << PAGE_SHIFT;
>  			}
>  
> +			if (cache_flush_required)
> +				arch_clean_nonsnoop_dma(phys, size);
> +

I agree with others as well that this arch callback should be named
something relative to the cache-flush/write-back operation that it
actually performs instead of the overall reason for us requiring it.

>  			ret = iommu_map(domain->domain, iova, phys, size,
>  					dma->prot | IOMMU_CACHE,
>  					GFP_KERNEL_ACCOUNT);
> @@ -1801,6 +1841,7 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
>  			vfio_unpin_pages_remote(dma, iova, phys >> PAGE_SHIFT,
>  						size >> PAGE_SHIFT, true);
>  		}
> +		dma->cache_flush_required = false;
>  	}
>  
>  	vfio_batch_fini(&batch);
> @@ -1828,6 +1869,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain, struct list_head *
>  	if (!pages)
>  		return;
>  
> +	if (!domain->enforce_cache_coherency)
> +		arch_clean_nonsnoop_dma(page_to_phys(pages), PAGE_SIZE * 2);
> +
>  	list_for_each_entry(region, regions, list) {
>  		start = ALIGN(region->start, PAGE_SIZE * 2);
>  		if (start >= region->end || (region->end - start < PAGE_SIZE * 2))
> @@ -1847,6 +1891,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain, struct list_head *
>  		break;
>  	}
>  
> +	if (!domain->enforce_cache_coherency)
> +		arch_clean_nonsnoop_dma(page_to_phys(pages), PAGE_SIZE * 2);
> +

Seems like this use case isn't subject to the unmap aspect since these
are kernel allocated and freed pages rather than userspace pages.
There's not an "ongoing use of the page" concern.

The window of opportunity for a device to discover and exploit the
mapping side issue appears almost impossibly small.

>  	__free_pages(pages, order);
>  }
>  
> @@ -2308,6 +2355,8 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  
>  	list_add(&domain->next, &iommu->domain_list);
>  	vfio_update_pgsize_bitmap(iommu);
> +	if (!domain->enforce_cache_coherency)
> +		vfio_update_noncoherent_domain_state(iommu);

Why isn't this simply:

	if (!domain->enforce_cache_coherency)
		iommu->has_noncoherent_domain = true;

Or maybe:

	if (!domain->enforce_cache_coherency)
		iommu->noncoherent_domains++;

>  done:
>  	/* Delete the old one and insert new iova list */
>  	vfio_iommu_iova_insert_copy(iommu, &iova_copy);
> @@ -2508,6 +2557,8 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
>  			}
>  			iommu_domain_free(domain->domain);
>  			list_del(&domain->next);
> +			if (!domain->enforce_cache_coherency)
> +				vfio_update_noncoherent_domain_state(iommu);

If we were to just track the number of noncoherent domains, this could
simply be iommu->noncoherent_domains-- and VFIO_DMA_CC_DMA could be:

	return iommu->noncoherent_domains ? 1 : 0;

Maybe there should be wrappers for list_add() and list_del() relative
to the iommu domain list to make it just be a counter.  Thanks,

Alex

>  			kfree(domain);
>  			vfio_iommu_aper_expand(iommu, &iova_copy);
>  			vfio_update_pgsize_bitmap(iommu);
Yan Zhao May 10, 2024, 10:31 a.m. UTC | #2
On Thu, May 09, 2024 at 12:10:49PM -0600, Alex Williamson wrote:
> On Tue,  7 May 2024 14:21:38 +0800
> Yan Zhao <yan.y.zhao@intel.com> wrote:
... 
> >  drivers/vfio/vfio_iommu_type1.c | 51 +++++++++++++++++++++++++++++++++
> >  1 file changed, 51 insertions(+)
> > 
> > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > index b5c15fe8f9fc..ce873f4220bf 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -74,6 +74,7 @@ struct vfio_iommu {
> >  	bool			v2;
> >  	bool			nesting;
> >  	bool			dirty_page_tracking;
> > +	bool			has_noncoherent_domain;
> >  	struct list_head	emulated_iommu_groups;
> >  };
> >  
> > @@ -99,6 +100,7 @@ struct vfio_dma {
> >  	unsigned long		*bitmap;
> >  	struct mm_struct	*mm;
> >  	size_t			locked_vm;
> > +	bool			cache_flush_required; /* For noncoherent domain */
> 
> Poor packing, minimally this should be grouped with the other bools in
> the structure, longer term they should likely all be converted to
> bit fields.
Yes. Will do!

> 
> >  };
> >  
> >  struct vfio_batch {
> > @@ -716,6 +718,9 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
> >  	long unlocked = 0, locked = 0;
> >  	long i;
> >  
> > +	if (dma->cache_flush_required)
> > +		arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT, npage << PAGE_SHIFT);
> > +
> >  	for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
> >  		if (put_pfn(pfn++, dma->prot)) {
> >  			unlocked++;
> > @@ -1099,6 +1104,8 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
> >  					    &iotlb_gather);
> >  	}
> >  
> > +	dma->cache_flush_required = false;
> > +
> >  	if (do_accounting) {
> >  		vfio_lock_acct(dma, -unlocked, true);
> >  		return 0;
> > @@ -1120,6 +1127,21 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
> >  	iommu->dma_avail++;
> >  }
> >  
> > +static void vfio_update_noncoherent_domain_state(struct vfio_iommu *iommu)
> > +{
> > +	struct vfio_domain *domain;
> > +	bool has_noncoherent = false;
> > +
> > +	list_for_each_entry(domain, &iommu->domain_list, next) {
> > +		if (domain->enforce_cache_coherency)
> > +			continue;
> > +
> > +		has_noncoherent = true;
> > +		break;
> > +	}
> > +	iommu->has_noncoherent_domain = has_noncoherent;
> > +}
> 
> This should be merged with vfio_domains_have_enforce_cache_coherency()
> and the VFIO_DMA_CC_IOMMU extension (if we keep it, see below).
Will convert it to a counter and do the merge.
Thanks for pointing it out!

> 
> > +
> >  static void vfio_update_pgsize_bitmap(struct vfio_iommu *iommu)
> >  {
> >  	struct vfio_domain *domain;
> > @@ -1455,6 +1477,12 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
> >  
> >  	vfio_batch_init(&batch);
> >  
> > +	/*
> > +	 * Record necessity to flush CPU cache to make sure CPU cache is flushed
> > +	 * for both pin & map and unmap & unpin (for unwind) paths.
> > +	 */
> > +	dma->cache_flush_required = iommu->has_noncoherent_domain;
> > +
> >  	while (size) {
> >  		/* Pin a contiguous chunk of memory */
> >  		npage = vfio_pin_pages_remote(dma, vaddr + dma->size,
> > @@ -1466,6 +1494,10 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
> >  			break;
> >  		}
> >  
> > +		if (dma->cache_flush_required)
> > +			arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT,
> > +						npage << PAGE_SHIFT);
> > +
> >  		/* Map it! */
> >  		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage,
> >  				     dma->prot);
> > @@ -1683,9 +1715,14 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
> >  	for (; n; n = rb_next(n)) {
> >  		struct vfio_dma *dma;
> >  		dma_addr_t iova;
> > +		bool cache_flush_required;
> >  
> >  		dma = rb_entry(n, struct vfio_dma, node);
> >  		iova = dma->iova;
> > +		cache_flush_required = !domain->enforce_cache_coherency &&
> > +				       !dma->cache_flush_required;
> > +		if (cache_flush_required)
> > +			dma->cache_flush_required = true;
> 
> The variable name here isn't accurate and the logic is confusing.  If
> the domain does not enforce coherency and the mapping is not tagged as
> requiring a cache flush, then we need to mark the mapping as requiring
> a cache flush.  So the variable state is something more akin to
> set_cache_flush_required.  But all we're saving with this is a
> redundant set if the mapping is already tagged as requiring a cache
> flush, so it could really be simplified to:
> 
> 		dma->cache_flush_required = !domain->enforce_cache_coherency;
Sorry about the confusion.

If dma->cache_flush_required is set to true by a domain not enforcing cache
coherency, we hope it will not be reset to false by a later attaching to domain 
enforcing cache coherency due to the lazily flushing design.

> It might add more clarity to just name the mapping flag
> dma->mapped_noncoherent.

The dma->cache_flush_required is to mark whether pages in a vfio_dma requires
cache flush in the subsequence mapping into the first non-coherent domain
and page unpinning.
So, mapped_noncoherent may not be accurate.
Do you think it's better to put a comment for explanation? 

struct vfio_dma {
        ...    
        bool                    iommu_mapped;
        bool                    lock_cap;       /* capable(CAP_IPC_LOCK) */
        bool                    vaddr_invalid;
        /*
         *  Mark whether it is required to flush CPU caches when mapping pages
         *  of the vfio_dma to the first non-coherent domain and when unpinning
         *  pages of the vfio_dma
         */
        bool                    cache_flush_required;
        ...    
};
> 
> >  
> >  		while (iova < dma->iova + dma->size) {
> >  			phys_addr_t phys;
> > @@ -1737,6 +1774,9 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
> >  				size = npage << PAGE_SHIFT;
> >  			}
> >  
> > +			if (cache_flush_required)
> > +				arch_clean_nonsnoop_dma(phys, size);
> > +
> 
> I agree with others as well that this arch callback should be named
> something relative to the cache-flush/write-back operation that it
> actually performs instead of the overall reason for us requiring it.
>
Ok. If there are no objections, I'll rename it to arch_flush_cache_phys() as
suggested by Kevin.

> >  			ret = iommu_map(domain->domain, iova, phys, size,
> >  					dma->prot | IOMMU_CACHE,
> >  					GFP_KERNEL_ACCOUNT);
> > @@ -1801,6 +1841,7 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
> >  			vfio_unpin_pages_remote(dma, iova, phys >> PAGE_SHIFT,
> >  						size >> PAGE_SHIFT, true);
> >  		}
> > +		dma->cache_flush_required = false;
> >  	}
> >  
> >  	vfio_batch_fini(&batch);
> > @@ -1828,6 +1869,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain, struct list_head *
> >  	if (!pages)
> >  		return;
> >  
> > +	if (!domain->enforce_cache_coherency)
> > +		arch_clean_nonsnoop_dma(page_to_phys(pages), PAGE_SIZE * 2);
> > +
> >  	list_for_each_entry(region, regions, list) {
> >  		start = ALIGN(region->start, PAGE_SIZE * 2);
> >  		if (start >= region->end || (region->end - start < PAGE_SIZE * 2))
> > @@ -1847,6 +1891,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain, struct list_head *
> >  		break;
> >  	}
> >  
> > +	if (!domain->enforce_cache_coherency)
> > +		arch_clean_nonsnoop_dma(page_to_phys(pages), PAGE_SIZE * 2);
> > +
> 
> Seems like this use case isn't subject to the unmap aspect since these
> are kernel allocated and freed pages rather than userspace pages.
> There's not an "ongoing use of the page" concern.
> 
> The window of opportunity for a device to discover and exploit the
> mapping side issue appears almost impossibly small.
>
The concern is for a malicious device attempting DMAs automatically.
Do you think this concern is valid?
As there're only extra flushes for 4 pages, what about keeping it for safety?

> >  	__free_pages(pages, order);
> >  }
> >  
> > @@ -2308,6 +2355,8 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> >  
> >  	list_add(&domain->next, &iommu->domain_list);
> >  	vfio_update_pgsize_bitmap(iommu);
> > +	if (!domain->enforce_cache_coherency)
> > +		vfio_update_noncoherent_domain_state(iommu);
> 
> Why isn't this simply:
> 
> 	if (!domain->enforce_cache_coherency)
> 		iommu->has_noncoherent_domain = true;
Yes, it's simpler during attach.

> Or maybe:
> 
> 	if (!domain->enforce_cache_coherency)
> 		iommu->noncoherent_domains++;
Yes, this counter is better.
I previously thought a bool can save some space.

> >  done:
> >  	/* Delete the old one and insert new iova list */
> >  	vfio_iommu_iova_insert_copy(iommu, &iova_copy);
> > @@ -2508,6 +2557,8 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
> >  			}
> >  			iommu_domain_free(domain->domain);
> >  			list_del(&domain->next);
> > +			if (!domain->enforce_cache_coherency)
> > +				vfio_update_noncoherent_domain_state(iommu);
> 
> If we were to just track the number of noncoherent domains, this could
> simply be iommu->noncoherent_domains-- and VFIO_DMA_CC_DMA could be:
> 
> 	return iommu->noncoherent_domains ? 1 : 0;
> 
> Maybe there should be wrappers for list_add() and list_del() relative
> to the iommu domain list to make it just be a counter.  Thanks,

Do you think we can skip the "iommu->noncoherent_domains--" in
vfio_iommu_type1_release() when iommu is about to be freed.

Asking that is also because it's hard for me to find a good name for the wrapper
around list_del().  :)

It follows vfio_release_domain() in vfio_iommu_type1_release(), but not in
vfio_iommu_type1_detach_group().

> 
> 
> >  			kfree(domain);
> >  			vfio_iommu_aper_expand(iommu, &iova_copy);
> >  			vfio_update_pgsize_bitmap(iommu);
>
Alex Williamson May 10, 2024, 4:57 p.m. UTC | #3
On Fri, 10 May 2024 18:31:13 +0800
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Thu, May 09, 2024 at 12:10:49PM -0600, Alex Williamson wrote:
> > On Tue,  7 May 2024 14:21:38 +0800
> > Yan Zhao <yan.y.zhao@intel.com> wrote:  
> ... 
> > >  drivers/vfio/vfio_iommu_type1.c | 51 +++++++++++++++++++++++++++++++++
> > >  1 file changed, 51 insertions(+)
> > > 
> > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > > index b5c15fe8f9fc..ce873f4220bf 100644
> > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > @@ -74,6 +74,7 @@ struct vfio_iommu {
> > >  	bool			v2;
> > >  	bool			nesting;
> > >  	bool			dirty_page_tracking;
> > > +	bool			has_noncoherent_domain;
> > >  	struct list_head	emulated_iommu_groups;
> > >  };
> > >  
> > > @@ -99,6 +100,7 @@ struct vfio_dma {
> > >  	unsigned long		*bitmap;
> > >  	struct mm_struct	*mm;
> > >  	size_t			locked_vm;
> > > +	bool			cache_flush_required; /* For noncoherent domain */  
> > 
> > Poor packing, minimally this should be grouped with the other bools in
> > the structure, longer term they should likely all be converted to
> > bit fields.  
> Yes. Will do!
> 
> >   
> > >  };
> > >  
> > >  struct vfio_batch {
> > > @@ -716,6 +718,9 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
> > >  	long unlocked = 0, locked = 0;
> > >  	long i;
> > >  
> > > +	if (dma->cache_flush_required)
> > > +		arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT, npage << PAGE_SHIFT);
> > > +
> > >  	for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
> > >  		if (put_pfn(pfn++, dma->prot)) {
> > >  			unlocked++;
> > > @@ -1099,6 +1104,8 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
> > >  					    &iotlb_gather);
> > >  	}
> > >  
> > > +	dma->cache_flush_required = false;
> > > +
> > >  	if (do_accounting) {
> > >  		vfio_lock_acct(dma, -unlocked, true);
> > >  		return 0;
> > > @@ -1120,6 +1127,21 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
> > >  	iommu->dma_avail++;
> > >  }
> > >  
> > > +static void vfio_update_noncoherent_domain_state(struct vfio_iommu *iommu)
> > > +{
> > > +	struct vfio_domain *domain;
> > > +	bool has_noncoherent = false;
> > > +
> > > +	list_for_each_entry(domain, &iommu->domain_list, next) {
> > > +		if (domain->enforce_cache_coherency)
> > > +			continue;
> > > +
> > > +		has_noncoherent = true;
> > > +		break;
> > > +	}
> > > +	iommu->has_noncoherent_domain = has_noncoherent;
> > > +}  
> > 
> > This should be merged with vfio_domains_have_enforce_cache_coherency()
> > and the VFIO_DMA_CC_IOMMU extension (if we keep it, see below).  
> Will convert it to a counter and do the merge.
> Thanks for pointing it out!
> 
> >   
> > > +
> > >  static void vfio_update_pgsize_bitmap(struct vfio_iommu *iommu)
> > >  {
> > >  	struct vfio_domain *domain;
> > > @@ -1455,6 +1477,12 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
> > >  
> > >  	vfio_batch_init(&batch);
> > >  
> > > +	/*
> > > +	 * Record necessity to flush CPU cache to make sure CPU cache is flushed
> > > +	 * for both pin & map and unmap & unpin (for unwind) paths.
> > > +	 */
> > > +	dma->cache_flush_required = iommu->has_noncoherent_domain;
> > > +
> > >  	while (size) {
> > >  		/* Pin a contiguous chunk of memory */
> > >  		npage = vfio_pin_pages_remote(dma, vaddr + dma->size,
> > > @@ -1466,6 +1494,10 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
> > >  			break;
> > >  		}
> > >  
> > > +		if (dma->cache_flush_required)
> > > +			arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT,
> > > +						npage << PAGE_SHIFT);
> > > +
> > >  		/* Map it! */
> > >  		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage,
> > >  				     dma->prot);
> > > @@ -1683,9 +1715,14 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
> > >  	for (; n; n = rb_next(n)) {
> > >  		struct vfio_dma *dma;
> > >  		dma_addr_t iova;
> > > +		bool cache_flush_required;
> > >  
> > >  		dma = rb_entry(n, struct vfio_dma, node);
> > >  		iova = dma->iova;
> > > +		cache_flush_required = !domain->enforce_cache_coherency &&
> > > +				       !dma->cache_flush_required;
> > > +		if (cache_flush_required)
> > > +			dma->cache_flush_required = true;  
> > 
> > The variable name here isn't accurate and the logic is confusing.  If
> > the domain does not enforce coherency and the mapping is not tagged as
> > requiring a cache flush, then we need to mark the mapping as requiring
> > a cache flush.  So the variable state is something more akin to
> > set_cache_flush_required.  But all we're saving with this is a
> > redundant set if the mapping is already tagged as requiring a cache
> > flush, so it could really be simplified to:
> > 
> > 		dma->cache_flush_required = !domain->enforce_cache_coherency;  
> Sorry about the confusion.
> 
> If dma->cache_flush_required is set to true by a domain not enforcing cache
> coherency, we hope it will not be reset to false by a later attaching to domain 
> enforcing cache coherency due to the lazily flushing design.

Right, ok, the vfio_dma objects are shared between domains so we never
want to set 'dma->cache_flush_required = false' due to the addition of a
'domain->enforce_cache_coherent == true'.  So this could be:

	if (!dma->cache_flush_required)
		dma->cache_flush_required = !domain->enforce_cache_coherency;

> > It might add more clarity to just name the mapping flag
> > dma->mapped_noncoherent.  
> 
> The dma->cache_flush_required is to mark whether pages in a vfio_dma requires
> cache flush in the subsequence mapping into the first non-coherent domain
> and page unpinning.

How do we arrive at a sequence where we have dma->cache_flush_required
that isn't the result of being mapped into a domain with
!domain->enforce_cache_coherency?

It seems to me that we only get 'dma->cache_flush_required == true' as
a result of being mapped into a 'domain->enforce_cache_coherency ==
false' domain.  In that case the flush-on-map is handled at the time
we're setting dma->cache_flush_required and what we're actually
tracking with the flag is that the dma object has been mapped into a
noncoherent domain.

> So, mapped_noncoherent may not be accurate.
> Do you think it's better to put a comment for explanation? 
> 
> struct vfio_dma {
>         ...    
>         bool                    iommu_mapped;
>         bool                    lock_cap;       /* capable(CAP_IPC_LOCK) */
>         bool                    vaddr_invalid;
>         /*
>          *  Mark whether it is required to flush CPU caches when mapping pages
>          *  of the vfio_dma to the first non-coherent domain and when unpinning
>          *  pages of the vfio_dma
>          */
>         bool                    cache_flush_required;
>         ...    
> };
> >   
> > >  
> > >  		while (iova < dma->iova + dma->size) {
> > >  			phys_addr_t phys;
> > > @@ -1737,6 +1774,9 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
> > >  				size = npage << PAGE_SHIFT;
> > >  			}
> > >  
> > > +			if (cache_flush_required)
> > > +				arch_clean_nonsnoop_dma(phys, size);
> > > +  
> > 
> > I agree with others as well that this arch callback should be named
> > something relative to the cache-flush/write-back operation that it
> > actually performs instead of the overall reason for us requiring it.
> >  
> Ok. If there are no objections, I'll rename it to arch_flush_cache_phys() as
> suggested by Kevin.

Yes, better.

> > >  			ret = iommu_map(domain->domain, iova, phys, size,
> > >  					dma->prot | IOMMU_CACHE,
> > >  					GFP_KERNEL_ACCOUNT);
> > > @@ -1801,6 +1841,7 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
> > >  			vfio_unpin_pages_remote(dma, iova, phys >> PAGE_SHIFT,
> > >  						size >> PAGE_SHIFT, true);
> > >  		}
> > > +		dma->cache_flush_required = false;
> > >  	}
> > >  
> > >  	vfio_batch_fini(&batch);
> > > @@ -1828,6 +1869,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain, struct list_head *
> > >  	if (!pages)
> > >  		return;
> > >  
> > > +	if (!domain->enforce_cache_coherency)
> > > +		arch_clean_nonsnoop_dma(page_to_phys(pages), PAGE_SIZE * 2);
> > > +
> > >  	list_for_each_entry(region, regions, list) {
> > >  		start = ALIGN(region->start, PAGE_SIZE * 2);
> > >  		if (start >= region->end || (region->end - start < PAGE_SIZE * 2))
> > > @@ -1847,6 +1891,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain, struct list_head *
> > >  		break;
> > >  	}
> > >  
> > > +	if (!domain->enforce_cache_coherency)
> > > +		arch_clean_nonsnoop_dma(page_to_phys(pages), PAGE_SIZE * 2);
> > > +  
> > 
> > Seems like this use case isn't subject to the unmap aspect since these
> > are kernel allocated and freed pages rather than userspace pages.
> > There's not an "ongoing use of the page" concern.
> > 
> > The window of opportunity for a device to discover and exploit the
> > mapping side issue appears almost impossibly small.
> >  
> The concern is for a malicious device attempting DMAs automatically.
> Do you think this concern is valid?
> As there're only extra flushes for 4 pages, what about keeping it for safety?

Userspace doesn't know anything about these mappings, so to exploit
them the device would somehow need to discover and interact with the
mapping in the split second that the mapping exists, without exposing
itself with mapping faults at the IOMMU.

I don't mind keeping the flush before map so that infinitesimal gap
where previous data in physical memory exposed to the device is closed,
but I have a much harder time seeing that the flush on unmap to
synchronize physical memory is required.

For example, the potential KSM use case doesn't exist since the pages
are not owned by the user.  Any subsequent use of the pages would be
subject to the same condition we assumed after allocation, where the
physical data may be inconsistent with the cached data.  It's easy to
flush 2 pages, but I think it obscures the function of the flush if we
can't articulate the value in this case.


> > >  	__free_pages(pages, order);
> > >  }
> > >  
> > > @@ -2308,6 +2355,8 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> > >  
> > >  	list_add(&domain->next, &iommu->domain_list);
> > >  	vfio_update_pgsize_bitmap(iommu);
> > > +	if (!domain->enforce_cache_coherency)
> > > +		vfio_update_noncoherent_domain_state(iommu);  
> > 
> > Why isn't this simply:
> > 
> > 	if (!domain->enforce_cache_coherency)
> > 		iommu->has_noncoherent_domain = true;  
> Yes, it's simpler during attach.
> 
> > Or maybe:
> > 
> > 	if (!domain->enforce_cache_coherency)
> > 		iommu->noncoherent_domains++;  
> Yes, this counter is better.
> I previously thought a bool can save some space.
> 
> > >  done:
> > >  	/* Delete the old one and insert new iova list */
> > >  	vfio_iommu_iova_insert_copy(iommu, &iova_copy);
> > > @@ -2508,6 +2557,8 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
> > >  			}
> > >  			iommu_domain_free(domain->domain);
> > >  			list_del(&domain->next);
> > > +			if (!domain->enforce_cache_coherency)
> > > +				vfio_update_noncoherent_domain_state(iommu);  
> > 
> > If we were to just track the number of noncoherent domains, this could
> > simply be iommu->noncoherent_domains-- and VFIO_DMA_CC_DMA could be:
> > 
> > 	return iommu->noncoherent_domains ? 1 : 0;
> > 
> > Maybe there should be wrappers for list_add() and list_del() relative
> > to the iommu domain list to make it just be a counter.  Thanks,  
> 
> Do you think we can skip the "iommu->noncoherent_domains--" in
> vfio_iommu_type1_release() when iommu is about to be freed.
> 
> Asking that is also because it's hard for me to find a good name for the wrapper
> around list_del().  :)

vfio_iommu_link_domain(), vfio_iommu_unlink_domain()?

> 
> It follows vfio_release_domain() in vfio_iommu_type1_release(), but not in
> vfio_iommu_type1_detach_group().

I'm not sure I understand the concern here, detach_group is performed
under the iommu->lock where the value of iommu->noncohernet_domains is
only guaranteed while this lock is held.  In the release callback the
iommu->lock is not held, but we have no external users at this point.
It's not strictly required that we decrement each domain, but it's also
not a bad sanity test that iommu->noncoherent_domains should be zero
after unlinking the domains.  Thanks,

Alex
 
> > >  			kfree(domain);
> > >  			vfio_iommu_aper_expand(iommu,
> > > &iova_copy); vfio_update_pgsize_bitmap(iommu);  
> >   
>
Yan Zhao May 13, 2024, 7:11 a.m. UTC | #4
On Fri, May 10, 2024 at 10:57:28AM -0600, Alex Williamson wrote:
> On Fri, 10 May 2024 18:31:13 +0800
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > On Thu, May 09, 2024 at 12:10:49PM -0600, Alex Williamson wrote:
> > > On Tue,  7 May 2024 14:21:38 +0800
> > > Yan Zhao <yan.y.zhao@intel.com> wrote:  
> > ... 
> > > >  drivers/vfio/vfio_iommu_type1.c | 51 +++++++++++++++++++++++++++++++++
> > > >  1 file changed, 51 insertions(+)
> > > > 
> > > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > > > index b5c15fe8f9fc..ce873f4220bf 100644
> > > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > > @@ -74,6 +74,7 @@ struct vfio_iommu {
> > > >  	bool			v2;
> > > >  	bool			nesting;
> > > >  	bool			dirty_page_tracking;
> > > > +	bool			has_noncoherent_domain;
> > > >  	struct list_head	emulated_iommu_groups;
> > > >  };
> > > >  
> > > > @@ -99,6 +100,7 @@ struct vfio_dma {
> > > >  	unsigned long		*bitmap;
> > > >  	struct mm_struct	*mm;
> > > >  	size_t			locked_vm;
> > > > +	bool			cache_flush_required; /* For noncoherent domain */  
> > > 
> > > Poor packing, minimally this should be grouped with the other bools in
> > > the structure, longer term they should likely all be converted to
> > > bit fields.  
> > Yes. Will do!
> > 
> > >   
> > > >  };
> > > >  
> > > >  struct vfio_batch {
> > > > @@ -716,6 +718,9 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
> > > >  	long unlocked = 0, locked = 0;
> > > >  	long i;
> > > >  
> > > > +	if (dma->cache_flush_required)
> > > > +		arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT, npage << PAGE_SHIFT);
> > > > +
> > > >  	for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
> > > >  		if (put_pfn(pfn++, dma->prot)) {
> > > >  			unlocked++;
> > > > @@ -1099,6 +1104,8 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
> > > >  					    &iotlb_gather);
> > > >  	}
> > > >  
> > > > +	dma->cache_flush_required = false;
> > > > +
> > > >  	if (do_accounting) {
> > > >  		vfio_lock_acct(dma, -unlocked, true);
> > > >  		return 0;
> > > > @@ -1120,6 +1127,21 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
> > > >  	iommu->dma_avail++;
> > > >  }
> > > >  
> > > > +static void vfio_update_noncoherent_domain_state(struct vfio_iommu *iommu)
> > > > +{
> > > > +	struct vfio_domain *domain;
> > > > +	bool has_noncoherent = false;
> > > > +
> > > > +	list_for_each_entry(domain, &iommu->domain_list, next) {
> > > > +		if (domain->enforce_cache_coherency)
> > > > +			continue;
> > > > +
> > > > +		has_noncoherent = true;
> > > > +		break;
> > > > +	}
> > > > +	iommu->has_noncoherent_domain = has_noncoherent;
> > > > +}  
> > > 
> > > This should be merged with vfio_domains_have_enforce_cache_coherency()
> > > and the VFIO_DMA_CC_IOMMU extension (if we keep it, see below).  
> > Will convert it to a counter and do the merge.
> > Thanks for pointing it out!
> > 
> > >   
> > > > +
> > > >  static void vfio_update_pgsize_bitmap(struct vfio_iommu *iommu)
> > > >  {
> > > >  	struct vfio_domain *domain;
> > > > @@ -1455,6 +1477,12 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
> > > >  
> > > >  	vfio_batch_init(&batch);
> > > >  
> > > > +	/*
> > > > +	 * Record necessity to flush CPU cache to make sure CPU cache is flushed
> > > > +	 * for both pin & map and unmap & unpin (for unwind) paths.
> > > > +	 */
> > > > +	dma->cache_flush_required = iommu->has_noncoherent_domain;
> > > > +
> > > >  	while (size) {
> > > >  		/* Pin a contiguous chunk of memory */
> > > >  		npage = vfio_pin_pages_remote(dma, vaddr + dma->size,
> > > > @@ -1466,6 +1494,10 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
> > > >  			break;
> > > >  		}
> > > >  
> > > > +		if (dma->cache_flush_required)
> > > > +			arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT,
> > > > +						npage << PAGE_SHIFT);
> > > > +
> > > >  		/* Map it! */
> > > >  		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage,
> > > >  				     dma->prot);
> > > > @@ -1683,9 +1715,14 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
> > > >  	for (; n; n = rb_next(n)) {
> > > >  		struct vfio_dma *dma;
> > > >  		dma_addr_t iova;
> > > > +		bool cache_flush_required;
> > > >  
> > > >  		dma = rb_entry(n, struct vfio_dma, node);
> > > >  		iova = dma->iova;
> > > > +		cache_flush_required = !domain->enforce_cache_coherency &&
> > > > +				       !dma->cache_flush_required;
> > > > +		if (cache_flush_required)
> > > > +			dma->cache_flush_required = true;  
> > > 
> > > The variable name here isn't accurate and the logic is confusing.  If
> > > the domain does not enforce coherency and the mapping is not tagged as
> > > requiring a cache flush, then we need to mark the mapping as requiring
> > > a cache flush.  So the variable state is something more akin to
> > > set_cache_flush_required.  But all we're saving with this is a
> > > redundant set if the mapping is already tagged as requiring a cache
> > > flush, so it could really be simplified to:
> > > 
> > > 		dma->cache_flush_required = !domain->enforce_cache_coherency;  
> > Sorry about the confusion.
> > 
> > If dma->cache_flush_required is set to true by a domain not enforcing cache
> > coherency, we hope it will not be reset to false by a later attaching to domain 
> > enforcing cache coherency due to the lazily flushing design.
> 
> Right, ok, the vfio_dma objects are shared between domains so we never
> want to set 'dma->cache_flush_required = false' due to the addition of a
> 'domain->enforce_cache_coherent == true'.  So this could be:
> 
> 	if (!dma->cache_flush_required)
> 		dma->cache_flush_required = !domain->enforce_cache_coherency;

Though this code is easier for understanding, it leads to unnecessary setting of
dma->cache_flush_required to false, given domain->enforce_cache_coherency is
true at the most time.

> > > It might add more clarity to just name the mapping flag
> > > dma->mapped_noncoherent.  
> > 
> > The dma->cache_flush_required is to mark whether pages in a vfio_dma requires
> > cache flush in the subsequence mapping into the first non-coherent domain
> > and page unpinning.
> 
> How do we arrive at a sequence where we have dma->cache_flush_required
> that isn't the result of being mapped into a domain with
> !domain->enforce_cache_coherency?
Hmm, dma->cache_flush_required IS the result of being mapped into a domain with
!domain->enforce_cache_coherency.
My concern only arrives from the actual code sequence, i.e.
dma->cache_flush_required is set to true before the actual mapping.

If we rename it to dma->mapped_noncoherent and only set it to true after the
actual successful mapping, it would lead to more code to handle flushing for the
unwind case.
Currently, flush for unwind is handled centrally in vfio_unpin_pages_remote()
by checking dma->cache_flush_required, which is true even before a full
successful mapping, so we won't miss flush on any pages that are mapped into a
non-coherent domain in a short window.

> 
> It seems to me that we only get 'dma->cache_flush_required == true' as
> a result of being mapped into a 'domain->enforce_cache_coherency ==
> false' domain.  In that case the flush-on-map is handled at the time
> we're setting dma->cache_flush_required and what we're actually
> tracking with the flag is that the dma object has been mapped into a
> noncoherent domain.
> 
> > So, mapped_noncoherent may not be accurate.
> > Do you think it's better to put a comment for explanation? 
> > 
> > struct vfio_dma {
> >         ...    
> >         bool                    iommu_mapped;
> >         bool                    lock_cap;       /* capable(CAP_IPC_LOCK) */
> >         bool                    vaddr_invalid;
> >         /*
> >          *  Mark whether it is required to flush CPU caches when mapping pages
> >          *  of the vfio_dma to the first non-coherent domain and when unpinning
> >          *  pages of the vfio_dma
> >          */
> >         bool                    cache_flush_required;
> >         ...    
> > };
> > >   
> > > >  
> > > >  		while (iova < dma->iova + dma->size) {
> > > >  			phys_addr_t phys;
> > > > @@ -1737,6 +1774,9 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
> > > >  				size = npage << PAGE_SHIFT;
> > > >  			}
> > > >  
> > > > +			if (cache_flush_required)
> > > > +				arch_clean_nonsnoop_dma(phys, size);
> > > > +  
> > > 
> > > I agree with others as well that this arch callback should be named
> > > something relative to the cache-flush/write-back operation that it
> > > actually performs instead of the overall reason for us requiring it.
> > >  
> > Ok. If there are no objections, I'll rename it to arch_flush_cache_phys() as
> > suggested by Kevin.
> 
> Yes, better.
> 
> > > >  			ret = iommu_map(domain->domain, iova, phys, size,
> > > >  					dma->prot | IOMMU_CACHE,
> > > >  					GFP_KERNEL_ACCOUNT);
> > > > @@ -1801,6 +1841,7 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
> > > >  			vfio_unpin_pages_remote(dma, iova, phys >> PAGE_SHIFT,
> > > >  						size >> PAGE_SHIFT, true);
> > > >  		}
> > > > +		dma->cache_flush_required = false;
> > > >  	}
> > > >  
> > > >  	vfio_batch_fini(&batch);
> > > > @@ -1828,6 +1869,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain, struct list_head *
> > > >  	if (!pages)
> > > >  		return;
> > > >  
> > > > +	if (!domain->enforce_cache_coherency)
> > > > +		arch_clean_nonsnoop_dma(page_to_phys(pages), PAGE_SIZE * 2);
> > > > +
> > > >  	list_for_each_entry(region, regions, list) {
> > > >  		start = ALIGN(region->start, PAGE_SIZE * 2);
> > > >  		if (start >= region->end || (region->end - start < PAGE_SIZE * 2))
> > > > @@ -1847,6 +1891,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain, struct list_head *
> > > >  		break;
> > > >  	}
> > > >  
> > > > +	if (!domain->enforce_cache_coherency)
> > > > +		arch_clean_nonsnoop_dma(page_to_phys(pages), PAGE_SIZE * 2);
> > > > +  
> > > 
> > > Seems like this use case isn't subject to the unmap aspect since these
> > > are kernel allocated and freed pages rather than userspace pages.
> > > There's not an "ongoing use of the page" concern.
> > > 
> > > The window of opportunity for a device to discover and exploit the
> > > mapping side issue appears almost impossibly small.
> > >  
> > The concern is for a malicious device attempting DMAs automatically.
> > Do you think this concern is valid?
> > As there're only extra flushes for 4 pages, what about keeping it for safety?
> 
> Userspace doesn't know anything about these mappings, so to exploit
> them the device would somehow need to discover and interact with the
> mapping in the split second that the mapping exists, without exposing
> itself with mapping faults at the IOMMU.
> 
> I don't mind keeping the flush before map so that infinitesimal gap
> where previous data in physical memory exposed to the device is closed,
> but I have a much harder time seeing that the flush on unmap to
> synchronize physical memory is required.
> 
> For example, the potential KSM use case doesn't exist since the pages
> are not owned by the user.  Any subsequent use of the pages would be
> subject to the same condition we assumed after allocation, where the
> physical data may be inconsistent with the cached data.  It's easy to
> flush 2 pages, but I think it obscures the function of the flush if we
> can't articulate the value in this case.
>
I agree the second flush is not necessary if we are confident that functions in
between the two flushes do not and will not touch the page in CPU side.
However, can we guarantee this? For instance, is it possible for some IOMMU
driver to read/write the page for some quirks? (Or is it just a totally
paranoid?)
If that's not impossible, then ensuring cache and memory coherency before
page reclaiming is better?

> 
> > > >  	__free_pages(pages, order);
> > > >  }
> > > >  
> > > > @@ -2308,6 +2355,8 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> > > >  
> > > >  	list_add(&domain->next, &iommu->domain_list);
> > > >  	vfio_update_pgsize_bitmap(iommu);
> > > > +	if (!domain->enforce_cache_coherency)
> > > > +		vfio_update_noncoherent_domain_state(iommu);  
> > > 
> > > Why isn't this simply:
> > > 
> > > 	if (!domain->enforce_cache_coherency)
> > > 		iommu->has_noncoherent_domain = true;  
> > Yes, it's simpler during attach.
> > 
> > > Or maybe:
> > > 
> > > 	if (!domain->enforce_cache_coherency)
> > > 		iommu->noncoherent_domains++;  
> > Yes, this counter is better.
> > I previously thought a bool can save some space.
> > 
> > > >  done:
> > > >  	/* Delete the old one and insert new iova list */
> > > >  	vfio_iommu_iova_insert_copy(iommu, &iova_copy);
> > > > @@ -2508,6 +2557,8 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
> > > >  			}
> > > >  			iommu_domain_free(domain->domain);
> > > >  			list_del(&domain->next);
> > > > +			if (!domain->enforce_cache_coherency)
> > > > +				vfio_update_noncoherent_domain_state(iommu);  
> > > 
> > > If we were to just track the number of noncoherent domains, this could
> > > simply be iommu->noncoherent_domains-- and VFIO_DMA_CC_DMA could be:
> > > 
> > > 	return iommu->noncoherent_domains ? 1 : 0;
> > > 
> > > Maybe there should be wrappers for list_add() and list_del() relative
> > > to the iommu domain list to make it just be a counter.  Thanks,  
> > 
> > Do you think we can skip the "iommu->noncoherent_domains--" in
> > vfio_iommu_type1_release() when iommu is about to be freed.
> > 
> > Asking that is also because it's hard for me to find a good name for the wrapper
> > around list_del().  :)
> 
> vfio_iommu_link_domain(), vfio_iommu_unlink_domain()?

Ah, this is a good name!

> > 
> > It follows vfio_release_domain() in vfio_iommu_type1_release(), but not in
> > vfio_iommu_type1_detach_group().
> 
> I'm not sure I understand the concern here, detach_group is performed
> under the iommu->lock where the value of iommu->noncohernet_domains is
> only guaranteed while this lock is held.  In the release callback the
> iommu->lock is not held, but we have no external users at this point.
> It's not strictly required that we decrement each domain, but it's also
> not a bad sanity test that iommu->noncoherent_domains should be zero
> after unlinking the domains.  Thanks,
I previously thought I couldn't find a name for a domain operation that's
called after vfio_release_domain(), and I couldn't merge list_del() into
vfio_release_domain() given it's not in vfio_iommu_type1_detach_group().

But vfio_iommu_unlink_domain() is a good one.
I'll rename list_del() to vfio_iommu_unlink_domain().

Thanks!
Yan
Tian, Kevin May 16, 2024, 7:53 a.m. UTC | #5
> From: Zhao, Yan Y <yan.y.zhao@intel.com>
> Sent: Monday, May 13, 2024 3:11 PM
> On Fri, May 10, 2024 at 10:57:28AM -0600, Alex Williamson wrote:
> > On Fri, 10 May 2024 18:31:13 +0800
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >
> > > The dma->cache_flush_required is to mark whether pages in a vfio_dma
> requires
> > > cache flush in the subsequence mapping into the first non-coherent
> domain
> > > and page unpinning.
> >
> > How do we arrive at a sequence where we have dma-
> >cache_flush_required
> > that isn't the result of being mapped into a domain with
> > !domain->enforce_cache_coherency?
> Hmm, dma->cache_flush_required IS the result of being mapped into a
> domain with
> !domain->enforce_cache_coherency.
> My concern only arrives from the actual code sequence, i.e.
> dma->cache_flush_required is set to true before the actual mapping.
> 
> If we rename it to dma->mapped_noncoherent and only set it to true after
> the
> actual successful mapping, it would lead to more code to handle flushing for
> the
> unwind case.
> Currently, flush for unwind is handled centrally in vfio_unpin_pages_remote()
> by checking dma->cache_flush_required, which is true even before a full
> successful mapping, so we won't miss flush on any pages that are mapped
> into a
> non-coherent domain in a short window.
> 

What about storing a vfio_iommu pointer in vfio_dma? Or pass an extra
parameter to vfio_unpin_pages_remote()...
Tian, Kevin May 16, 2024, 8:34 a.m. UTC | #6
> From: Zhao, Yan Y <yan.y.zhao@intel.com>
> Sent: Monday, May 13, 2024 3:11 PM
> On Fri, May 10, 2024 at 10:57:28AM -0600, Alex Williamson wrote:
> > On Fri, 10 May 2024 18:31:13 +0800
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > > On Thu, May 09, 2024 at 12:10:49PM -0600, Alex Williamson wrote:
> > > > On Tue,  7 May 2024 14:21:38 +0800
> > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > @@ -1847,6 +1891,9 @@ static void vfio_test_domain_fgsp(struct
> vfio_domain *domain, struct list_head *
> > > > >  		break;
> > > > >  	}
> > > > >
> > > > > +	if (!domain->enforce_cache_coherency)
> > > > > +		arch_clean_nonsnoop_dma(page_to_phys(pages),
> PAGE_SIZE * 2);
> > > > > +
> > > >
> > > > Seems like this use case isn't subject to the unmap aspect since these
> > > > are kernel allocated and freed pages rather than userspace pages.
> > > > There's not an "ongoing use of the page" concern.
> > > >
> > > > The window of opportunity for a device to discover and exploit the
> > > > mapping side issue appears almost impossibly small.
> > > >
> > > The concern is for a malicious device attempting DMAs automatically.
> > > Do you think this concern is valid?
> > > As there're only extra flushes for 4 pages, what about keeping it for safety?
> >
> > Userspace doesn't know anything about these mappings, so to exploit
> > them the device would somehow need to discover and interact with the
> > mapping in the split second that the mapping exists, without exposing
> > itself with mapping faults at the IOMMU.

Userspace could guess the attacking ranges based on code, e.g. currently
the code just tries to use the 1st available IOVA region which likely starts
at address 0.

and mapping faults don't stop the attack. Just some after-the-fact hint
revealing the possibility of being attacked. 
Alex Williamson May 16, 2024, 8:31 p.m. UTC | #7
On Thu, 16 May 2024 08:34:20 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Zhao, Yan Y <yan.y.zhao@intel.com>
> > Sent: Monday, May 13, 2024 3:11 PM
> > On Fri, May 10, 2024 at 10:57:28AM -0600, Alex Williamson wrote:  
> > > On Fri, 10 May 2024 18:31:13 +0800
> > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >  
> > > > On Thu, May 09, 2024 at 12:10:49PM -0600, Alex Williamson wrote:  
> > > > > On Tue,  7 May 2024 14:21:38 +0800
> > > > > Yan Zhao <yan.y.zhao@intel.com> wrote:  
> > > > > > @@ -1847,6 +1891,9 @@ static void vfio_test_domain_fgsp(struct  
> > vfio_domain *domain, struct list_head *  
> > > > > >  		break;
> > > > > >  	}
> > > > > >
> > > > > > +	if (!domain->enforce_cache_coherency)
> > > > > > +		arch_clean_nonsnoop_dma(page_to_phys(pages),  
> > PAGE_SIZE * 2);  
> > > > > > +  
> > > > >
> > > > > Seems like this use case isn't subject to the unmap aspect since these
> > > > > are kernel allocated and freed pages rather than userspace pages.
> > > > > There's not an "ongoing use of the page" concern.
> > > > >
> > > > > The window of opportunity for a device to discover and exploit the
> > > > > mapping side issue appears almost impossibly small.
> > > > >  
> > > > The concern is for a malicious device attempting DMAs automatically.
> > > > Do you think this concern is valid?
> > > > As there're only extra flushes for 4 pages, what about keeping it for safety?  
> > >
> > > Userspace doesn't know anything about these mappings, so to exploit
> > > them the device would somehow need to discover and interact with the
> > > mapping in the split second that the mapping exists, without exposing
> > > itself with mapping faults at the IOMMU.  
> 
> Userspace could guess the attacking ranges based on code, e.g. currently
> the code just tries to use the 1st available IOVA region which likely starts
> at address 0.
> 
> and mapping faults don't stop the attack. Just some after-the-fact hint
> revealing the possibility of being attacked. 
Alex Williamson May 16, 2024, 8:50 p.m. UTC | #8
On Mon, 13 May 2024 15:11:28 +0800
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Fri, May 10, 2024 at 10:57:28AM -0600, Alex Williamson wrote:
> > On Fri, 10 May 2024 18:31:13 +0800
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> >   
> > > On Thu, May 09, 2024 at 12:10:49PM -0600, Alex Williamson wrote:  
> > > > On Tue,  7 May 2024 14:21:38 +0800
> > > > Yan Zhao <yan.y.zhao@intel.com> wrote:    
> > > ...   
> > > > >  drivers/vfio/vfio_iommu_type1.c | 51 +++++++++++++++++++++++++++++++++
> > > > >  1 file changed, 51 insertions(+)
> > > > > 
> > > > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > > > > index b5c15fe8f9fc..ce873f4220bf 100644
> > > > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > > > @@ -74,6 +74,7 @@ struct vfio_iommu {
> > > > >  	bool			v2;
> > > > >  	bool			nesting;
> > > > >  	bool			dirty_page_tracking;
> > > > > +	bool			has_noncoherent_domain;
> > > > >  	struct list_head	emulated_iommu_groups;
> > > > >  };
> > > > >  
> > > > > @@ -99,6 +100,7 @@ struct vfio_dma {
> > > > >  	unsigned long		*bitmap;
> > > > >  	struct mm_struct	*mm;
> > > > >  	size_t			locked_vm;
> > > > > +	bool			cache_flush_required; /* For noncoherent domain */    
> > > > 
> > > > Poor packing, minimally this should be grouped with the other bools in
> > > > the structure, longer term they should likely all be converted to
> > > > bit fields.    
> > > Yes. Will do!
> > >   
> > > >     
> > > > >  };
> > > > >  
> > > > >  struct vfio_batch {
> > > > > @@ -716,6 +718,9 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
> > > > >  	long unlocked = 0, locked = 0;
> > > > >  	long i;
> > > > >  
> > > > > +	if (dma->cache_flush_required)
> > > > > +		arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT, npage << PAGE_SHIFT);
> > > > > +
> > > > >  	for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
> > > > >  		if (put_pfn(pfn++, dma->prot)) {
> > > > >  			unlocked++;
> > > > > @@ -1099,6 +1104,8 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
> > > > >  					    &iotlb_gather);
> > > > >  	}
> > > > >  
> > > > > +	dma->cache_flush_required = false;
> > > > > +
> > > > >  	if (do_accounting) {
> > > > >  		vfio_lock_acct(dma, -unlocked, true);
> > > > >  		return 0;
> > > > > @@ -1120,6 +1127,21 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
> > > > >  	iommu->dma_avail++;
> > > > >  }
> > > > >  
> > > > > +static void vfio_update_noncoherent_domain_state(struct vfio_iommu *iommu)
> > > > > +{
> > > > > +	struct vfio_domain *domain;
> > > > > +	bool has_noncoherent = false;
> > > > > +
> > > > > +	list_for_each_entry(domain, &iommu->domain_list, next) {
> > > > > +		if (domain->enforce_cache_coherency)
> > > > > +			continue;
> > > > > +
> > > > > +		has_noncoherent = true;
> > > > > +		break;
> > > > > +	}
> > > > > +	iommu->has_noncoherent_domain = has_noncoherent;
> > > > > +}    
> > > > 
> > > > This should be merged with vfio_domains_have_enforce_cache_coherency()
> > > > and the VFIO_DMA_CC_IOMMU extension (if we keep it, see below).    
> > > Will convert it to a counter and do the merge.
> > > Thanks for pointing it out!
> > >   
> > > >     
> > > > > +
> > > > >  static void vfio_update_pgsize_bitmap(struct vfio_iommu *iommu)
> > > > >  {
> > > > >  	struct vfio_domain *domain;
> > > > > @@ -1455,6 +1477,12 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
> > > > >  
> > > > >  	vfio_batch_init(&batch);
> > > > >  
> > > > > +	/*
> > > > > +	 * Record necessity to flush CPU cache to make sure CPU cache is flushed
> > > > > +	 * for both pin & map and unmap & unpin (for unwind) paths.
> > > > > +	 */
> > > > > +	dma->cache_flush_required = iommu->has_noncoherent_domain;
> > > > > +
> > > > >  	while (size) {
> > > > >  		/* Pin a contiguous chunk of memory */
> > > > >  		npage = vfio_pin_pages_remote(dma, vaddr + dma->size,
> > > > > @@ -1466,6 +1494,10 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
> > > > >  			break;
> > > > >  		}
> > > > >  
> > > > > +		if (dma->cache_flush_required)
> > > > > +			arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT,
> > > > > +						npage << PAGE_SHIFT);
> > > > > +
> > > > >  		/* Map it! */
> > > > >  		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage,
> > > > >  				     dma->prot);
> > > > > @@ -1683,9 +1715,14 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
> > > > >  	for (; n; n = rb_next(n)) {
> > > > >  		struct vfio_dma *dma;
> > > > >  		dma_addr_t iova;
> > > > > +		bool cache_flush_required;
> > > > >  
> > > > >  		dma = rb_entry(n, struct vfio_dma, node);
> > > > >  		iova = dma->iova;
> > > > > +		cache_flush_required = !domain->enforce_cache_coherency &&
> > > > > +				       !dma->cache_flush_required;
> > > > > +		if (cache_flush_required)
> > > > > +			dma->cache_flush_required = true;    
> > > > 
> > > > The variable name here isn't accurate and the logic is confusing.  If
> > > > the domain does not enforce coherency and the mapping is not tagged as
> > > > requiring a cache flush, then we need to mark the mapping as requiring
> > > > a cache flush.  So the variable state is something more akin to
> > > > set_cache_flush_required.  But all we're saving with this is a
> > > > redundant set if the mapping is already tagged as requiring a cache
> > > > flush, so it could really be simplified to:
> > > > 
> > > > 		dma->cache_flush_required = !domain->enforce_cache_coherency;    
> > > Sorry about the confusion.
> > > 
> > > If dma->cache_flush_required is set to true by a domain not enforcing cache
> > > coherency, we hope it will not be reset to false by a later attaching to domain 
> > > enforcing cache coherency due to the lazily flushing design.  
> > 
> > Right, ok, the vfio_dma objects are shared between domains so we never
> > want to set 'dma->cache_flush_required = false' due to the addition of a
> > 'domain->enforce_cache_coherent == true'.  So this could be:
> > 
> > 	if (!dma->cache_flush_required)
> > 		dma->cache_flush_required = !domain->enforce_cache_coherency;  
> 
> Though this code is easier for understanding, it leads to unnecessary setting of
> dma->cache_flush_required to false, given domain->enforce_cache_coherency is
> true at the most time.

I don't really see that as an issue, but the variable name originally
chosen above, cache_flush_required, also doesn't convey that it's only
attempting to set the value if it wasn't previously set and is now
required by a noncoherent domain.

> > > > It might add more clarity to just name the mapping flag
> > > > dma->mapped_noncoherent.    
> > > 
> > > The dma->cache_flush_required is to mark whether pages in a vfio_dma requires
> > > cache flush in the subsequence mapping into the first non-coherent domain
> > > and page unpinning.  
> > 
> > How do we arrive at a sequence where we have dma->cache_flush_required
> > that isn't the result of being mapped into a domain with
> > !domain->enforce_cache_coherency?  
> Hmm, dma->cache_flush_required IS the result of being mapped into a domain with
> !domain->enforce_cache_coherency.
> My concern only arrives from the actual code sequence, i.e.
> dma->cache_flush_required is set to true before the actual mapping.
> 
> If we rename it to dma->mapped_noncoherent and only set it to true after the
> actual successful mapping, it would lead to more code to handle flushing for the
> unwind case.
> Currently, flush for unwind is handled centrally in vfio_unpin_pages_remote()
> by checking dma->cache_flush_required, which is true even before a full
> successful mapping, so we won't miss flush on any pages that are mapped into a
> non-coherent domain in a short window.

I don't think we need to be so literal that "mapped_noncoherent" can
only be set after the vfio_dma is fully mapped to a noncoherent domain,
but also we can come up with other names for the flag.  Perhaps
"is_noncoherent".  My suggestion was more from the perspective of what
does the flag represent rather than what we intend to do as a result of
the flag being set.  Thanks,

Alex
Yan Zhao May 17, 2024, 3:11 a.m. UTC | #9
On Thu, May 16, 2024 at 02:50:09PM -0600, Alex Williamson wrote:
> On Mon, 13 May 2024 15:11:28 +0800
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > On Fri, May 10, 2024 at 10:57:28AM -0600, Alex Williamson wrote:
> > > On Fri, 10 May 2024 18:31:13 +0800
> > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >   
> > > > On Thu, May 09, 2024 at 12:10:49PM -0600, Alex Williamson wrote:  
> > > > > On Tue,  7 May 2024 14:21:38 +0800
> > > > > Yan Zhao <yan.y.zhao@intel.com> wrote:    
...   
> > > > > > @@ -1683,9 +1715,14 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
> > > > > >  	for (; n; n = rb_next(n)) {
> > > > > >  		struct vfio_dma *dma;
> > > > > >  		dma_addr_t iova;
> > > > > > +		bool cache_flush_required;
> > > > > >  
> > > > > >  		dma = rb_entry(n, struct vfio_dma, node);
> > > > > >  		iova = dma->iova;
> > > > > > +		cache_flush_required = !domain->enforce_cache_coherency &&
> > > > > > +				       !dma->cache_flush_required;
> > > > > > +		if (cache_flush_required)
> > > > > > +			dma->cache_flush_required = true;    
> > > > > 
> > > > > The variable name here isn't accurate and the logic is confusing.  If
> > > > > the domain does not enforce coherency and the mapping is not tagged as
> > > > > requiring a cache flush, then we need to mark the mapping as requiring
> > > > > a cache flush.  So the variable state is something more akin to
> > > > > set_cache_flush_required.  But all we're saving with this is a
> > > > > redundant set if the mapping is already tagged as requiring a cache
> > > > > flush, so it could really be simplified to:
> > > > > 
> > > > > 		dma->cache_flush_required = !domain->enforce_cache_coherency;    
> > > > Sorry about the confusion.
> > > > 
> > > > If dma->cache_flush_required is set to true by a domain not enforcing cache
> > > > coherency, we hope it will not be reset to false by a later attaching to domain 
> > > > enforcing cache coherency due to the lazily flushing design.  
> > > 
> > > Right, ok, the vfio_dma objects are shared between domains so we never
> > > want to set 'dma->cache_flush_required = false' due to the addition of a
> > > 'domain->enforce_cache_coherent == true'.  So this could be:
> > > 
> > > 	if (!dma->cache_flush_required)
> > > 		dma->cache_flush_required = !domain->enforce_cache_coherency;  
> > 
> > Though this code is easier for understanding, it leads to unnecessary setting of
> > dma->cache_flush_required to false, given domain->enforce_cache_coherency is
> > true at the most time.
> 
> I don't really see that as an issue, but the variable name originally
> chosen above, cache_flush_required, also doesn't convey that it's only
> attempting to set the value if it wasn't previously set and is now
> required by a noncoherent domain.
Agreed, the old name is too vague.
What about update_to_noncoherent_required?
Then in vfio_iommu_replay(), it's like

update_to_noncoherent_required = !domain->enforce_cache_coherency && !dma->is_noncoherent;
if (update_to_noncoherent_required)
         dma->is_noncoherent = true;

...
if (update_to_noncoherent_required)
	arch_flush_cache_phys((phys, size);
> 
> > > > > It might add more clarity to just name the mapping flag
> > > > > dma->mapped_noncoherent.    
> > > > 
> > > > The dma->cache_flush_required is to mark whether pages in a vfio_dma requires
> > > > cache flush in the subsequence mapping into the first non-coherent domain
> > > > and page unpinning.  
> > > 
> > > How do we arrive at a sequence where we have dma->cache_flush_required
> > > that isn't the result of being mapped into a domain with
> > > !domain->enforce_cache_coherency?  
> > Hmm, dma->cache_flush_required IS the result of being mapped into a domain with
> > !domain->enforce_cache_coherency.
> > My concern only arrives from the actual code sequence, i.e.
> > dma->cache_flush_required is set to true before the actual mapping.
> > 
> > If we rename it to dma->mapped_noncoherent and only set it to true after the
> > actual successful mapping, it would lead to more code to handle flushing for the
> > unwind case.
> > Currently, flush for unwind is handled centrally in vfio_unpin_pages_remote()
> > by checking dma->cache_flush_required, which is true even before a full
> > successful mapping, so we won't miss flush on any pages that are mapped into a
> > non-coherent domain in a short window.
> 
> I don't think we need to be so literal that "mapped_noncoherent" can
> only be set after the vfio_dma is fully mapped to a noncoherent domain,
> but also we can come up with other names for the flag.  Perhaps
> "is_noncoherent".  My suggestion was more from the perspective of what
> does the flag represent rather than what we intend to do as a result of
> the flag being set.  Thanks, 
Makes sense!
I like the name "is_noncoherent" :)
Alex Williamson May 17, 2024, 4:44 a.m. UTC | #10
On Fri, 17 May 2024 11:11:48 +0800
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Thu, May 16, 2024 at 02:50:09PM -0600, Alex Williamson wrote:
> > On Mon, 13 May 2024 15:11:28 +0800
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> >   
> > > On Fri, May 10, 2024 at 10:57:28AM -0600, Alex Williamson wrote:  
> > > > On Fri, 10 May 2024 18:31:13 +0800
> > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > >     
> > > > > On Thu, May 09, 2024 at 12:10:49PM -0600, Alex Williamson wrote:    
> > > > > > On Tue,  7 May 2024 14:21:38 +0800
> > > > > > Yan Zhao <yan.y.zhao@intel.com> wrote:      
> ...   
> > > > > > > @@ -1683,9 +1715,14 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
> > > > > > >  	for (; n; n = rb_next(n)) {
> > > > > > >  		struct vfio_dma *dma;
> > > > > > >  		dma_addr_t iova;
> > > > > > > +		bool cache_flush_required;
> > > > > > >  
> > > > > > >  		dma = rb_entry(n, struct vfio_dma, node);
> > > > > > >  		iova = dma->iova;
> > > > > > > +		cache_flush_required = !domain->enforce_cache_coherency &&
> > > > > > > +				       !dma->cache_flush_required;
> > > > > > > +		if (cache_flush_required)
> > > > > > > +			dma->cache_flush_required = true;      
> > > > > > 
> > > > > > The variable name here isn't accurate and the logic is confusing.  If
> > > > > > the domain does not enforce coherency and the mapping is not tagged as
> > > > > > requiring a cache flush, then we need to mark the mapping as requiring
> > > > > > a cache flush.  So the variable state is something more akin to
> > > > > > set_cache_flush_required.  But all we're saving with this is a
> > > > > > redundant set if the mapping is already tagged as requiring a cache
> > > > > > flush, so it could really be simplified to:
> > > > > > 
> > > > > > 		dma->cache_flush_required = !domain->enforce_cache_coherency;      
> > > > > Sorry about the confusion.
> > > > > 
> > > > > If dma->cache_flush_required is set to true by a domain not enforcing cache
> > > > > coherency, we hope it will not be reset to false by a later attaching to domain 
> > > > > enforcing cache coherency due to the lazily flushing design.    
> > > > 
> > > > Right, ok, the vfio_dma objects are shared between domains so we never
> > > > want to set 'dma->cache_flush_required = false' due to the addition of a
> > > > 'domain->enforce_cache_coherent == true'.  So this could be:
> > > > 
> > > > 	if (!dma->cache_flush_required)
> > > > 		dma->cache_flush_required = !domain->enforce_cache_coherency;    
> > > 
> > > Though this code is easier for understanding, it leads to unnecessary setting of
> > > dma->cache_flush_required to false, given domain->enforce_cache_coherency is
> > > true at the most time.  
> > 
> > I don't really see that as an issue, but the variable name originally
> > chosen above, cache_flush_required, also doesn't convey that it's only
> > attempting to set the value if it wasn't previously set and is now
> > required by a noncoherent domain.  
> Agreed, the old name is too vague.
> What about update_to_noncoherent_required?

set_noncoherent?  Thanks,

Alex

> Then in vfio_iommu_replay(), it's like
> 
> update_to_noncoherent_required = !domain->enforce_cache_coherency && !dma->is_noncoherent;
> if (update_to_noncoherent_required)
>          dma->is_noncoherent = true;
> 
> ...
> if (update_to_noncoherent_required)
> 	arch_flush_cache_phys((phys, size);
> >   
> > > > > > It might add more clarity to just name the mapping flag
> > > > > > dma->mapped_noncoherent.      
> > > > > 
> > > > > The dma->cache_flush_required is to mark whether pages in a vfio_dma requires
> > > > > cache flush in the subsequence mapping into the first non-coherent domain
> > > > > and page unpinning.    
> > > > 
> > > > How do we arrive at a sequence where we have dma->cache_flush_required
> > > > that isn't the result of being mapped into a domain with
> > > > !domain->enforce_cache_coherency?    
> > > Hmm, dma->cache_flush_required IS the result of being mapped into a domain with
> > > !domain->enforce_cache_coherency.
> > > My concern only arrives from the actual code sequence, i.e.
> > > dma->cache_flush_required is set to true before the actual mapping.
> > > 
> > > If we rename it to dma->mapped_noncoherent and only set it to true after the
> > > actual successful mapping, it would lead to more code to handle flushing for the
> > > unwind case.
> > > Currently, flush for unwind is handled centrally in vfio_unpin_pages_remote()
> > > by checking dma->cache_flush_required, which is true even before a full
> > > successful mapping, so we won't miss flush on any pages that are mapped into a
> > > non-coherent domain in a short window.  
> > 
> > I don't think we need to be so literal that "mapped_noncoherent" can
> > only be set after the vfio_dma is fully mapped to a noncoherent domain,
> > but also we can come up with other names for the flag.  Perhaps
> > "is_noncoherent".  My suggestion was more from the perspective of what
> > does the flag represent rather than what we intend to do as a result of
> > the flag being set.  Thanks,   
> Makes sense!
> I like the name "is_noncoherent" :)
>
Yan Zhao May 17, 2024, 5 a.m. UTC | #11
On Thu, May 16, 2024 at 10:44:42PM -0600, Alex Williamson wrote:
> On Fri, 17 May 2024 11:11:48 +0800
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > On Thu, May 16, 2024 at 02:50:09PM -0600, Alex Williamson wrote:
> > > On Mon, 13 May 2024 15:11:28 +0800
> > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >   
> > > > On Fri, May 10, 2024 at 10:57:28AM -0600, Alex Williamson wrote:  
> > > > > On Fri, 10 May 2024 18:31:13 +0800
> > > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > >     
> > > > > > On Thu, May 09, 2024 at 12:10:49PM -0600, Alex Williamson wrote:    
> > > > > > > On Tue,  7 May 2024 14:21:38 +0800
> > > > > > > Yan Zhao <yan.y.zhao@intel.com> wrote:      
> > ...   
> > > > > > > > @@ -1683,9 +1715,14 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
> > > > > > > >  	for (; n; n = rb_next(n)) {
> > > > > > > >  		struct vfio_dma *dma;
> > > > > > > >  		dma_addr_t iova;
> > > > > > > > +		bool cache_flush_required;
> > > > > > > >  
> > > > > > > >  		dma = rb_entry(n, struct vfio_dma, node);
> > > > > > > >  		iova = dma->iova;
> > > > > > > > +		cache_flush_required = !domain->enforce_cache_coherency &&
> > > > > > > > +				       !dma->cache_flush_required;
> > > > > > > > +		if (cache_flush_required)
> > > > > > > > +			dma->cache_flush_required = true;      
> > > > > > > 
> > > > > > > The variable name here isn't accurate and the logic is confusing.  If
> > > > > > > the domain does not enforce coherency and the mapping is not tagged as
> > > > > > > requiring a cache flush, then we need to mark the mapping as requiring
> > > > > > > a cache flush.  So the variable state is something more akin to
> > > > > > > set_cache_flush_required.  But all we're saving with this is a
> > > > > > > redundant set if the mapping is already tagged as requiring a cache
> > > > > > > flush, so it could really be simplified to:
> > > > > > > 
> > > > > > > 		dma->cache_flush_required = !domain->enforce_cache_coherency;      
> > > > > > Sorry about the confusion.
> > > > > > 
> > > > > > If dma->cache_flush_required is set to true by a domain not enforcing cache
> > > > > > coherency, we hope it will not be reset to false by a later attaching to domain 
> > > > > > enforcing cache coherency due to the lazily flushing design.    
> > > > > 
> > > > > Right, ok, the vfio_dma objects are shared between domains so we never
> > > > > want to set 'dma->cache_flush_required = false' due to the addition of a
> > > > > 'domain->enforce_cache_coherent == true'.  So this could be:
> > > > > 
> > > > > 	if (!dma->cache_flush_required)
> > > > > 		dma->cache_flush_required = !domain->enforce_cache_coherency;    
> > > > 
> > > > Though this code is easier for understanding, it leads to unnecessary setting of
> > > > dma->cache_flush_required to false, given domain->enforce_cache_coherency is
> > > > true at the most time.  
> > > 
> > > I don't really see that as an issue, but the variable name originally
> > > chosen above, cache_flush_required, also doesn't convey that it's only
> > > attempting to set the value if it wasn't previously set and is now
> > > required by a noncoherent domain.  
> > Agreed, the old name is too vague.
> > What about update_to_noncoherent_required?
> 
> set_noncoherent?  Thanks,
> 
Concise!

> 
> > Then in vfio_iommu_replay(), it's like
> > 
> > update_to_noncoherent_required = !domain->enforce_cache_coherency && !dma->is_noncoherent;
> > if (update_to_noncoherent_required)
> >          dma->is_noncoherent = true;
> > 
> > ...
> > if (update_to_noncoherent_required)
> > 	arch_flush_cache_phys((phys, size);
> > >   
> > > > > > > It might add more clarity to just name the mapping flag
> > > > > > > dma->mapped_noncoherent.      
> > > > > > 
> > > > > > The dma->cache_flush_required is to mark whether pages in a vfio_dma requires
> > > > > > cache flush in the subsequence mapping into the first non-coherent domain
> > > > > > and page unpinning.    
> > > > > 
> > > > > How do we arrive at a sequence where we have dma->cache_flush_required
> > > > > that isn't the result of being mapped into a domain with
> > > > > !domain->enforce_cache_coherency?    
> > > > Hmm, dma->cache_flush_required IS the result of being mapped into a domain with
> > > > !domain->enforce_cache_coherency.
> > > > My concern only arrives from the actual code sequence, i.e.
> > > > dma->cache_flush_required is set to true before the actual mapping.
> > > > 
> > > > If we rename it to dma->mapped_noncoherent and only set it to true after the
> > > > actual successful mapping, it would lead to more code to handle flushing for the
> > > > unwind case.
> > > > Currently, flush for unwind is handled centrally in vfio_unpin_pages_remote()
> > > > by checking dma->cache_flush_required, which is true even before a full
> > > > successful mapping, so we won't miss flush on any pages that are mapped into a
> > > > non-coherent domain in a short window.  
> > > 
> > > I don't think we need to be so literal that "mapped_noncoherent" can
> > > only be set after the vfio_dma is fully mapped to a noncoherent domain,
> > > but also we can come up with other names for the flag.  Perhaps
> > > "is_noncoherent".  My suggestion was more from the perspective of what
> > > does the flag represent rather than what we intend to do as a result of
> > > the flag being set.  Thanks,   
> > Makes sense!
> > I like the name "is_noncoherent" :)
> > 
>
Jason Gunthorpe May 17, 2024, 5:11 p.m. UTC | #12
On Thu, May 16, 2024 at 02:31:59PM -0600, Alex Williamson wrote:

> Yes, exactly.  Zero'ing the page would obviously reestablish the
> coherency, but the page could be reallocated without being zero'd and as
> you describe the owner of that page could then get inconsistent
> results.  

I think if we care about the performance of this stuff enough to try
and remove flushes we'd be better off figuring out how to disable no
snoop in PCI config space and trust the device not to use it and avoid
these flushes.

iommu enforcement is nice, but at least ARM has been assuming that the
PCI config space bit is sufficient.

Intel/AMD are probably fine here as they will only flush for weird GPU
cases, but I expect ARM is going to be unhappy.

Jason
diff mbox series

Patch

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index b5c15fe8f9fc..ce873f4220bf 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -74,6 +74,7 @@  struct vfio_iommu {
 	bool			v2;
 	bool			nesting;
 	bool			dirty_page_tracking;
+	bool			has_noncoherent_domain;
 	struct list_head	emulated_iommu_groups;
 };
 
@@ -99,6 +100,7 @@  struct vfio_dma {
 	unsigned long		*bitmap;
 	struct mm_struct	*mm;
 	size_t			locked_vm;
+	bool			cache_flush_required; /* For noncoherent domain */
 };
 
 struct vfio_batch {
@@ -716,6 +718,9 @@  static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
 	long unlocked = 0, locked = 0;
 	long i;
 
+	if (dma->cache_flush_required)
+		arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT, npage << PAGE_SHIFT);
+
 	for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
 		if (put_pfn(pfn++, dma->prot)) {
 			unlocked++;
@@ -1099,6 +1104,8 @@  static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
 					    &iotlb_gather);
 	}
 
+	dma->cache_flush_required = false;
+
 	if (do_accounting) {
 		vfio_lock_acct(dma, -unlocked, true);
 		return 0;
@@ -1120,6 +1127,21 @@  static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
 	iommu->dma_avail++;
 }
 
+static void vfio_update_noncoherent_domain_state(struct vfio_iommu *iommu)
+{
+	struct vfio_domain *domain;
+	bool has_noncoherent = false;
+
+	list_for_each_entry(domain, &iommu->domain_list, next) {
+		if (domain->enforce_cache_coherency)
+			continue;
+
+		has_noncoherent = true;
+		break;
+	}
+	iommu->has_noncoherent_domain = has_noncoherent;
+}
+
 static void vfio_update_pgsize_bitmap(struct vfio_iommu *iommu)
 {
 	struct vfio_domain *domain;
@@ -1455,6 +1477,12 @@  static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
 
 	vfio_batch_init(&batch);
 
+	/*
+	 * Record necessity to flush CPU cache to make sure CPU cache is flushed
+	 * for both pin & map and unmap & unpin (for unwind) paths.
+	 */
+	dma->cache_flush_required = iommu->has_noncoherent_domain;
+
 	while (size) {
 		/* Pin a contiguous chunk of memory */
 		npage = vfio_pin_pages_remote(dma, vaddr + dma->size,
@@ -1466,6 +1494,10 @@  static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
 			break;
 		}
 
+		if (dma->cache_flush_required)
+			arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT,
+						npage << PAGE_SHIFT);
+
 		/* Map it! */
 		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage,
 				     dma->prot);
@@ -1683,9 +1715,14 @@  static int vfio_iommu_replay(struct vfio_iommu *iommu,
 	for (; n; n = rb_next(n)) {
 		struct vfio_dma *dma;
 		dma_addr_t iova;
+		bool cache_flush_required;
 
 		dma = rb_entry(n, struct vfio_dma, node);
 		iova = dma->iova;
+		cache_flush_required = !domain->enforce_cache_coherency &&
+				       !dma->cache_flush_required;
+		if (cache_flush_required)
+			dma->cache_flush_required = true;
 
 		while (iova < dma->iova + dma->size) {
 			phys_addr_t phys;
@@ -1737,6 +1774,9 @@  static int vfio_iommu_replay(struct vfio_iommu *iommu,
 				size = npage << PAGE_SHIFT;
 			}
 
+			if (cache_flush_required)
+				arch_clean_nonsnoop_dma(phys, size);
+
 			ret = iommu_map(domain->domain, iova, phys, size,
 					dma->prot | IOMMU_CACHE,
 					GFP_KERNEL_ACCOUNT);
@@ -1801,6 +1841,7 @@  static int vfio_iommu_replay(struct vfio_iommu *iommu,
 			vfio_unpin_pages_remote(dma, iova, phys >> PAGE_SHIFT,
 						size >> PAGE_SHIFT, true);
 		}
+		dma->cache_flush_required = false;
 	}
 
 	vfio_batch_fini(&batch);
@@ -1828,6 +1869,9 @@  static void vfio_test_domain_fgsp(struct vfio_domain *domain, struct list_head *
 	if (!pages)
 		return;
 
+	if (!domain->enforce_cache_coherency)
+		arch_clean_nonsnoop_dma(page_to_phys(pages), PAGE_SIZE * 2);
+
 	list_for_each_entry(region, regions, list) {
 		start = ALIGN(region->start, PAGE_SIZE * 2);
 		if (start >= region->end || (region->end - start < PAGE_SIZE * 2))
@@ -1847,6 +1891,9 @@  static void vfio_test_domain_fgsp(struct vfio_domain *domain, struct list_head *
 		break;
 	}
 
+	if (!domain->enforce_cache_coherency)
+		arch_clean_nonsnoop_dma(page_to_phys(pages), PAGE_SIZE * 2);
+
 	__free_pages(pages, order);
 }
 
@@ -2308,6 +2355,8 @@  static int vfio_iommu_type1_attach_group(void *iommu_data,
 
 	list_add(&domain->next, &iommu->domain_list);
 	vfio_update_pgsize_bitmap(iommu);
+	if (!domain->enforce_cache_coherency)
+		vfio_update_noncoherent_domain_state(iommu);
 done:
 	/* Delete the old one and insert new iova list */
 	vfio_iommu_iova_insert_copy(iommu, &iova_copy);
@@ -2508,6 +2557,8 @@  static void vfio_iommu_type1_detach_group(void *iommu_data,
 			}
 			iommu_domain_free(domain->domain);
 			list_del(&domain->next);
+			if (!domain->enforce_cache_coherency)
+				vfio_update_noncoherent_domain_state(iommu);
 			kfree(domain);
 			vfio_iommu_aper_expand(iommu, &iova_copy);
 			vfio_update_pgsize_bitmap(iommu);