diff mbox series

[v3,4/7] swiotlb: if swiotlb is full, fall back to a transient memory pool

Message ID 34c2a1ba721a7bc496128aac5e20724e4077f1ab.1687859323.git.petr.tesarik.ext@huawei.com (mailing list archive)
State Superseded
Headers show
Series Allow dynamic allocation of software IO TLB bounce buffers | expand

Commit Message

Petr Tesarik June 27, 2023, 9:54 a.m. UTC
From: Petr Tesarik <petr.tesarik.ext@huawei.com>

Try to allocate a transient memory pool if no suitable slots can be found,
except when allocating from a restricted pool. The transient pool is just
enough big for this one bounce buffer. It is inserted into a per-device
list of transient memory pools, and it is freed again when the bounce
buffer is unmapped.

Transient memory pools are kept in an RCU list. A memory barrier is
required after adding a new entry, because any address within a transient
buffer must be immediately recognized as belonging to the SWIOTLB, even if
it is passed to another CPU.

Deletion does not require any synchronization beyond RCU ordering
guarantees. After a buffer is unmapped, its physical addresses may no
longer be passed to the DMA API, so the memory range of the corresponding
stale entry in the RCU list never matches. If the memory range gets
allocated again, then it happens only after a RCU quiescent state.

Since bounce buffers can now be allocated from different pools, add a
parameter to swiotlb_alloc_pool() to let the caller know which memory pool
is used. Add swiotlb_find_pool() to find the memory pool corresponding to
an address. This function is now also used by is_swiotlb_buffer(), because
a simple boundary check is no longer sufficient.

The logic in swiotlb_alloc_tlb() is taken from __dma_direct_alloc_pages(),
simplified and enhanced to use coherent memory pools if needed.

Note that this is not the most efficient way to provide a bounce buffer,
but when a DMA buffer can't be mapped, something may (and will) actually
break. At that point it is better to make an allocation, even if it may be
an expensive operation.

Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com>
---
 include/linux/device.h      |   4 +
 include/linux/dma-mapping.h |   2 +
 include/linux/swiotlb.h     |  13 +-
 kernel/dma/direct.c         |   2 +-
 kernel/dma/swiotlb.c        | 265 ++++++++++++++++++++++++++++++++++--
 5 files changed, 272 insertions(+), 14 deletions(-)

Comments

Michael Kelley (LINUX) July 6, 2023, 3:50 a.m. UTC | #1
From: Petr Tesarik <petrtesarik@huaweicloud.com> Sent: Tuesday, June 27, 2023 2:54 AM
> 
> Try to allocate a transient memory pool if no suitable slots can be found,
> except when allocating from a restricted pool. The transient pool is just
> enough big for this one bounce buffer. It is inserted into a per-device
> list of transient memory pools, and it is freed again when the bounce
> buffer is unmapped.
> 
> Transient memory pools are kept in an RCU list. A memory barrier is
> required after adding a new entry, because any address within a transient
> buffer must be immediately recognized as belonging to the SWIOTLB, even if
> it is passed to another CPU.
> 
> Deletion does not require any synchronization beyond RCU ordering
> guarantees. After a buffer is unmapped, its physical addresses may no
> longer be passed to the DMA API, so the memory range of the corresponding
> stale entry in the RCU list never matches. If the memory range gets
> allocated again, then it happens only after a RCU quiescent state.
> 
> Since bounce buffers can now be allocated from different pools, add a
> parameter to swiotlb_alloc_pool() to let the caller know which memory pool
> is used. Add swiotlb_find_pool() to find the memory pool corresponding to
> an address. This function is now also used by is_swiotlb_buffer(), because
> a simple boundary check is no longer sufficient.
> 
> The logic in swiotlb_alloc_tlb() is taken from __dma_direct_alloc_pages(),
> simplified and enhanced to use coherent memory pools if needed.
> 
> Note that this is not the most efficient way to provide a bounce buffer,
> but when a DMA buffer can't be mapped, something may (and will) actually
> break. At that point it is better to make an allocation, even if it may be
> an expensive operation.

I continue to think about swiotlb memory management from the standpoint
of CoCo VMs that may be quite large with high network and storage loads.
These VMs are often running mission-critical workloads that can't tolerate
a bounce buffer allocation failure.  To prevent such failures, the swiotlb
memory size must be overly large, which wastes memory.

Your new approach helps by using the coherent memory pools as an overflow
space.   But in a lot of ways, it only pushes the problem around.  As you
noted in your cover letter, reducing the initial size of the swiotlb might
require increasing the size of the coherent pools.

What might be really useful is to pend bounce buffer requests while the
new worker thread is adding more swiotlb pools.  Of course, requests made
from interrupt level can't be pended, but at least in my experience with large
CoCo VMs, storage I/O is the biggest consumer of bounce buffers.  A lot
(most?) storage requests make the swiotlb_map() call in a context where
it is OK to pend.   If the coherent pool overflow space is could be used only
for swiotlb_map() calls that can't pend, it's more likely to be sufficient to
bridge the gap until new pools are added.

Could swiotlb code detect if it's OK to pend, and then pend a bounce
buffer request until the worker thread adds a new pool?  Even an overly
conversative check would help reduce pressure on the coherent pools
as overflow space.

Michael

> 
> Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com>
> ---
>  include/linux/device.h      |   4 +
>  include/linux/dma-mapping.h |   2 +
>  include/linux/swiotlb.h     |  13 +-
>  kernel/dma/direct.c         |   2 +-
>  kernel/dma/swiotlb.c        | 265 ++++++++++++++++++++++++++++++++++--
>  5 files changed, 272 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/device.h b/include/linux/device.h
> index 83081aa99e6a..a1ee4c5924b8 100644
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -510,6 +510,8 @@ struct device_physical_location {
>   * @dma_mem:	Internal for coherent mem override.
>   * @cma_area:	Contiguous memory area for dma allocations
>   * @dma_io_tlb_mem: Software IO TLB allocator.  Not for driver use.
> + * @dma_io_tlb_pools:	List of transient swiotlb memory pools.
> + * @dma_io_tlb_lock:	Protects changes to the list of active pools.
>   * @archdata:	For arch-specific additions.
>   * @of_node:	Associated device tree node.
>   * @fwnode:	Associated device node supplied by platform firmware.
> @@ -615,6 +617,8 @@ struct device {
>  #endif
>  #ifdef CONFIG_SWIOTLB
>  	struct io_tlb_mem *dma_io_tlb_mem;
> +	struct list_head dma_io_tlb_pools;
> +	spinlock_t dma_io_tlb_lock;
>  #endif
>  	/* arch specific additions */
>  	struct dev_archdata	archdata;
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index 0ee20b764000..c36c5a546787 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -417,6 +417,8 @@ static inline void dma_sync_sgtable_for_device(struct device
> *dev,
>  #define dma_get_sgtable(d, t, v, h, s) dma_get_sgtable_attrs(d, t, v, h, s, 0)
>  #define dma_mmap_coherent(d, v, c, h, s) dma_mmap_attrs(d, v, c, h, s, 0)
> 
> +bool dma_coherent_ok(struct device *dev, phys_addr_t phys, size_t size);
> +
>  static inline void *dma_alloc_coherent(struct device *dev, size_t size,
>  		dma_addr_t *dma_handle, gfp_t gfp)
>  {
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index 0aa6868cb68c..ae1688438850 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -63,6 +63,7 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t phys,
> 
>  /**
>   * struct io_tlb_pool - IO TLB memory pool descriptor
> + * @node:	Member of the IO TLB memory pool list.
>   * @start:	The start address of the swiotlb memory pool. Used to do a quick
>   *		range check to see if the memory was in fact allocated by this
>   *		API.
> @@ -77,22 +78,27 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t
> phys,
>   *		see setup_io_tlb_npages().
>   * @used:	The number of used IO TLB slots.
>   * @late_alloc:	%true if allocated using the page allocator.
> + * @transient:  %true if transient memory pool.
>   * @nareas:	Number of areas in the pool.
>   * @area_nslabs: Number of slots in each area.
>   * @areas:	Array of memory area descriptors.
>   * @slots:	Array of slot descriptors.
> + * @rcu:	RCU head for swiotlb_dyn_free().
>   */
>  struct io_tlb_pool {
> +	struct list_head node;
>  	phys_addr_t start;
>  	phys_addr_t end;
>  	void *vaddr;
>  	unsigned long nslabs;
>  	unsigned long used;
>  	bool late_alloc;
> +	bool transient;
>  	unsigned int nareas;
>  	unsigned int area_nslabs;
>  	struct io_tlb_area *areas;
>  	struct io_tlb_slot *slots;
> +	struct rcu_head rcu;
>  };
> 
>  /**
> @@ -120,6 +126,8 @@ struct io_tlb_mem {
>  #endif
>  };
> 
> +struct io_tlb_pool *swiotlb_find_pool(struct device *dev, phys_addr_t paddr);
> +
>  /**
>   * is_swiotlb_buffer() - check if a physical address belongs to a swiotlb
>   * @dev:        Device which has mapped the buffer.
> @@ -133,9 +141,8 @@ struct io_tlb_mem {
>   */
>  static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
>  {
> -	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
> -
> -	return mem && paddr >= mem->pool->start && paddr < mem->pool->end;
> +	return dev->dma_io_tlb_mem &&
> +		!!swiotlb_find_pool(dev, paddr);
>  }
> 
>  static inline bool is_swiotlb_force_bounce(struct device *dev)
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index 5595d1d5cdcc..820561cab38d 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -66,7 +66,7 @@ static gfp_t dma_direct_optimal_gfp_mask(struct device *dev,
> u64 *phys_limit)
>  	return 0;
>  }
> 
> -static bool dma_coherent_ok(struct device *dev, phys_addr_t phys, size_t size)
> +bool dma_coherent_ok(struct device *dev, phys_addr_t phys, size_t size)
>  {
>  	dma_addr_t dma_addr = phys_to_dma_direct(dev, phys);
> 
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index 4c5de91bda57..06b4fa7c2e9b 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -35,6 +35,7 @@
>  #include <linux/memblock.h>
>  #include <linux/mm.h>
>  #include <linux/pfn.h>
> +#include <linux/rculist.h>
>  #include <linux/scatterlist.h>
>  #include <linux/set_memory.h>
>  #include <linux/spinlock.h>
> @@ -500,6 +501,157 @@ void __init swiotlb_exit(void)
>  	memset(mem, 0, sizeof(*mem));
>  }
> 
> +/**
> + * alloc_dma_pages() - allocate pages to be used for DMA
> + * @gfp:	GFP flags for the allocation.
> + * @bytes:	Size of the buffer.
> + *
> + * Allocate pages from the buddy allocator. If successful, make the allocated
> + * pages decrypted that they can be used for DMA.
> + *
> + * Return: Decrypted pages, or %NULL on failure.
> + */
> +static struct page *alloc_dma_pages(gfp_t gfp, size_t bytes)
> +{
> +	unsigned int order = get_order(bytes);
> +	struct page *page;
> +	void *vaddr;
> +
> +	page = alloc_pages(gfp, order);
> +	if (!page)
> +		return NULL;
> +
> +	vaddr = page_address(page);
> +	if (set_memory_decrypted((unsigned long)vaddr, PFN_UP(bytes))) {
> +		__free_pages(page, order);
> +		return NULL;
> +	}
> +
> +	return page;
> +}
> +
> +/**
> + * swiotlb_alloc_tlb() - allocate a dynamic IO TLB buffer
> + * @dev:	Device for which a memory pool is allocated.
> + * @bytes:	Size of the buffer.
> + * @phys_limit:	Maximum allowed physical address of the buffer.
> + * @gfp:	GFP flags for the allocation.
> + *
> + * Return: Allocated pages, or %NULL on allocation failure.
> + */
> +static struct page *swiotlb_alloc_tlb(struct device *dev, size_t bytes,
> +		u64 phys_limit, gfp_t gfp)
> +{
> +	struct page *page;
> +
> +	/*
> +	 * Allocate from the atomic pools if memory is encrypted and
> +	 * the allocation is atomic, because decrypting may block.
> +	 */
> +	if (dev && force_dma_unencrypted(dev) && !gfpflags_allow_blocking(gfp)) {
> +		void *vaddr;
> +
> +		return IS_ENABLED(CONFIG_DMA_COHERENT_POOL)
> +			? dma_alloc_from_pool(dev, bytes, &vaddr, gfp,
> +					      dma_coherent_ok)
> +			: NULL;
> +	}
> +
> +	gfp &= ~GFP_ZONEMASK;
> +	if (phys_limit <= DMA_BIT_MASK(zone_dma_bits))
> +		gfp |= __GFP_DMA;
> +	else if (phys_limit <= DMA_BIT_MASK(32))
> +		gfp |= __GFP_DMA32;
> +
> +	while ((page = alloc_dma_pages(gfp, bytes)) &&
> +	       page_to_phys(page) + bytes - 1 > phys_limit) {
> +		/* allocated, but too high */
> +		__free_pages(page, get_order(bytes));
> +
> +		if (IS_ENABLED(CONFIG_ZONE_DMA32) &&
> +		    phys_limit < DMA_BIT_MASK(64) &&
> +		    !(gfp & (__GFP_DMA32 | __GFP_DMA)))
> +			gfp |= __GFP_DMA32;
> +		else if (IS_ENABLED(CONFIG_ZONE_DMA) &&
> +			 !(gfp & __GFP_DMA))
> +			gfp = (gfp & ~__GFP_DMA32) | __GFP_DMA;
> +		else
> +			return NULL;
> +	}
> +
> +	return page;
> +}
> +
> +/**
> + * swiotlb_free_tlb() - free a dynamically allocated IO TLB buffer
> + * @vaddr:	Virtual address of the buffer.
> + * @bytes:	Size of the buffer.
> + */
> +static void swiotlb_free_tlb(void *vaddr, size_t bytes)
> +{
> +	if (IS_ENABLED(CONFIG_DMA_COHERENT_POOL) &&
> +	    dma_free_from_pool(NULL, vaddr, bytes))
> +		return;
> +
> +	/* Intentional leak if pages cannot be encrypted again. */
> +	if (!set_memory_encrypted((unsigned long)vaddr, PFN_UP(bytes)))
> +		__free_pages(virt_to_page(vaddr), get_order(bytes));
> +}
> +
> +/**
> + * swiotlb_alloc_pool() - allocate a new IO TLB memory pool
> + * @dev:	Device for which a memory pool is allocated.
> + * @nslabs:	Desired number of slabs.
> + * @phys_limit:	Maximum DMA buffer physical address.
> + * @gfp:	GFP flags for the allocations.
> + *
> + * Allocate and initialize a new IO TLB memory pool.
> + *
> + * Return: New memory pool, or %NULL on allocation failure.
> + */
> +static struct io_tlb_pool *swiotlb_alloc_pool(struct device *dev,
> +		unsigned int nslabs, u64 phys_limit, gfp_t gfp)
> +{
> +	struct io_tlb_pool *pool;
> +	struct page *tlb;
> +	size_t pool_size;
> +	size_t tlb_size;
> +
> +	pool_size = sizeof(*pool) + array_size(sizeof(*pool->areas), 1) +
> +		array_size(sizeof(*pool->slots), nslabs);
> +	pool = kzalloc(pool_size, gfp);
> +	if (!pool)
> +		goto error;
> +	pool->areas = (void *)pool + sizeof(*pool);
> +	pool->slots = (void *)pool->areas + sizeof(*pool->areas);
> +
> +	tlb_size = nslabs << IO_TLB_SHIFT;
> +	tlb = swiotlb_alloc_tlb(dev, tlb_size, phys_limit, gfp);
> +	if (!tlb)
> +		goto error_tlb;
> +
> +	swiotlb_init_io_tlb_pool(pool, page_to_phys(tlb), nslabs, true, 1);
> +	return pool;
> +
> +error_tlb:
> +	kfree(pool);
> +error:
> +	return NULL;
> +}
> +
> +/**
> + * swiotlb_dyn_free() - RCU callback to free a memory pool
> + * @rcu:	RCU head in the corresponding struct io_tlb_pool.
> + */
> +static void swiotlb_dyn_free(struct rcu_head *rcu)
> +{
> +	struct io_tlb_pool *pool = container_of(rcu, struct io_tlb_pool, rcu);
> +	size_t tlb_size = pool->end - pool->start;
> +
> +	swiotlb_free_tlb(pool->vaddr, tlb_size);
> +	kfree(pool);
> +}
> +
>  /**
>   * swiotlb_dev_init() - initialize swiotlb fields in &struct device
>   * @dev:	Device to be initialized.
> @@ -507,6 +659,56 @@ void __init swiotlb_exit(void)
>  void swiotlb_dev_init(struct device *dev)
>  {
>  	dev->dma_io_tlb_mem = &io_tlb_default_mem;
> +	INIT_LIST_HEAD(&dev->dma_io_tlb_pools);
> +	spin_lock_init(&dev->dma_io_tlb_lock);
> +}
> +
> +/**
> + * swiotlb_find_pool() - find the IO TLB pool for a physical address
> + * @dev:        Device which has mapped the DMA buffer.
> + * @paddr:      Physical address within the DMA buffer.
> + *
> + * Find the IO TLB memory pool descriptor which contains the given physical
> + * address, if any.
> + *
> + * Return: Memory pool which contains @paddr, or %NULL if none.
> + */
> +struct io_tlb_pool *swiotlb_find_pool(struct device *dev, phys_addr_t paddr)
> +{
> +	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
> +	struct io_tlb_pool *pool = mem->pool;
> +
> +	if (paddr >= pool->start && paddr < pool->end)
> +		return pool;
> +
> +	/* Pairs with smp_wmb() in swiotlb_find_slots(). */
> +	smp_rmb();
> +
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(pool, &dev->dma_io_tlb_pools, node) {
> +		if (paddr >= pool->start && paddr < pool->end)
> +			goto out;
> +	}
> +	pool = NULL;
> +out:
> +	rcu_read_unlock();
> +	return pool;
> +}
> +
> +/**
> + * swiotlb_del_pool() - remove an IO TLB pool from a device
> + * @dev:	Owning device.
> + * @pool:	Memory pool to be removed.
> + */
> +static void swiotlb_del_pool(struct device *dev, struct io_tlb_pool *pool)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&dev->dma_io_tlb_lock, flags);
> +	list_del_rcu(&pool->node);
> +	spin_unlock_irqrestore(&dev->dma_io_tlb_lock, flags);
> +
> +	call_rcu(&pool->rcu, swiotlb_dyn_free);
>  }
> 
>  /*
> @@ -523,7 +725,7 @@ static unsigned int swiotlb_align_offset(struct device *dev, u64
> addr)
>  static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t size,
>  			   enum dma_data_direction dir)
>  {
> -	struct io_tlb_pool *mem = dev->dma_io_tlb_mem->pool;
> +	struct io_tlb_pool *mem = swiotlb_find_pool(dev, tlb_addr);
>  	int index = (tlb_addr - mem->start) >> IO_TLB_SHIFT;
>  	phys_addr_t orig_addr = mem->slots[index].orig_addr;
>  	size_t alloc_size = mem->slots[index].alloc_size;
> @@ -796,6 +998,7 @@ static int pool_find_slots(struct device *dev, struct io_tlb_pool
> *pool,
>   * @alloc_size: Total requested size of the bounce buffer,
>   *		including initial alignment padding.
>   * @alloc_align_mask:	Required alignment of the allocated buffer.
> + * @retpool:	Used memory pool, updated on return.
>   *
>   * Search through the whole software IO TLB to find a sequence of slots that
>   * match the allocation constraints.
> @@ -803,10 +1006,46 @@ static int pool_find_slots(struct device *dev, struct
> io_tlb_pool *pool,
>   * Return: Index of the first allocated slot, or -1 on error.
>   */
>  static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
> -		size_t alloc_size, unsigned int alloc_align_mask)
> +		size_t alloc_size, unsigned int alloc_align_mask,
> +		struct io_tlb_pool **retpool)
>  {
> -	return pool_find_slots(dev, dev->dma_io_tlb_mem->pool, orig_addr,
> -			       alloc_size, alloc_align_mask);
> +	struct io_tlb_pool *pool;
> +	unsigned long flags;
> +	u64 phys_limit;
> +	int index;
> +
> +	pool = dev->dma_io_tlb_mem->pool;
> +	index = pool_find_slots(dev, pool, orig_addr,
> +				alloc_size, alloc_align_mask);
> +	if (index >= 0)
> +		goto found;
> +
> +	if (dev->dma_io_tlb_mem->for_alloc)
> +		return -1;
> +
> +	phys_limit = min_not_zero(dev->coherent_dma_mask, dev->bus_dma_limit);
> +	pool = swiotlb_alloc_pool(dev, nr_slots(alloc_size), phys_limit,
> +				  GFP_NOWAIT | __GFP_NOWARN);
> +	if (!pool)
> +		return -1;
> +
> +	index = pool_find_slots(dev, pool, orig_addr,
> +				alloc_size, alloc_align_mask);
> +	if (index < 0) {
> +		swiotlb_dyn_free(&pool->rcu);
> +		return -1;
> +	}
> +
> +	pool->transient = true;
> +	spin_lock_irqsave(&dev->dma_io_tlb_lock, flags);
> +	list_add_rcu(&pool->node, &dev->dma_io_tlb_pools);
> +	spin_unlock_irqrestore(&dev->dma_io_tlb_lock, flags);
> +
> +	/* Pairs with smp_rmb() in swiotlb_find_pool(). */
> +	smp_wmb();
> +found:
> +	*retpool = pool;
> +	return index;
>  }
> 
>  /**
> @@ -869,7 +1108,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev,
> phys_addr_t orig_addr,
>  	}
> 
>  	index = swiotlb_find_slots(dev, orig_addr,
> -				   alloc_size + offset, alloc_align_mask);
> +				   alloc_size + offset, alloc_align_mask, &pool);
>  	if (index == -1) {
>  		if (!(attrs & DMA_ATTR_NO_WARN))
>  			dev_warn_ratelimited(dev,
> @@ -883,7 +1122,6 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev,
> phys_addr_t orig_addr,
>  	 * This is needed when we sync the memory.  Then we sync the buffer if
>  	 * needed.
>  	 */
> -	pool = mem->pool;
>  	for (i = 0; i < nr_slots(alloc_size + offset); i++)
>  		pool->slots[index + i].orig_addr = slot_addr(orig_addr, i);
>  	tlb_addr = slot_addr(pool->start, index) + offset;
> @@ -900,7 +1138,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev,
> phys_addr_t orig_addr,
> 
>  static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
>  {
> -	struct io_tlb_pool *mem = dev->dma_io_tlb_mem->pool;
> +	struct io_tlb_pool *mem = swiotlb_find_pool(dev, tlb_addr);
>  	unsigned long flags;
>  	unsigned int offset = swiotlb_align_offset(dev, tlb_addr);
>  	int index = (tlb_addr - offset - mem->start) >> IO_TLB_SHIFT;
> @@ -954,6 +1192,8 @@ void swiotlb_tbl_unmap_single(struct device *dev,
> phys_addr_t tlb_addr,
>  			      size_t mapping_size, enum dma_data_direction dir,
>  			      unsigned long attrs)
>  {
> +	struct io_tlb_pool *pool;
> +
>  	/*
>  	 * First, sync the memory before unmapping the entry
>  	 */
> @@ -961,7 +1201,13 @@ void swiotlb_tbl_unmap_single(struct device *dev,
> phys_addr_t tlb_addr,
>  	    (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL))
>  		swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_FROM_DEVICE);
> 
> -	swiotlb_release_slots(dev, tlb_addr);
> +	pool = swiotlb_find_pool(dev, tlb_addr);
> +	if (pool->transient) {
> +		dec_used(dev->dma_io_tlb_mem, pool->nslabs);
> +		swiotlb_del_pool(dev, pool);
> +	} else {
> +		swiotlb_release_slots(dev, tlb_addr);
> +	}
>  }
> 
>  void swiotlb_sync_single_for_device(struct device *dev, phys_addr_t tlb_addr,
> @@ -1145,11 +1391,10 @@ struct page *swiotlb_alloc(struct device *dev, size_t size)
>  	if (!mem)
>  		return NULL;
> 
> -	index = swiotlb_find_slots(dev, 0, size, 0);
> +	index = swiotlb_find_slots(dev, 0, size, 0, &pool);
>  	if (index == -1)
>  		return NULL;
> 
> -	pool = mem->pool;
>  	tlb_addr = slot_addr(pool->start, index);
> 
>  	return pfn_to_page(PFN_DOWN(tlb_addr));
> --
> 2.25.1
Greg KH July 6, 2023, 8:07 a.m. UTC | #2
On Thu, Jul 06, 2023 at 03:50:55AM +0000, Michael Kelley (LINUX) wrote:
> From: Petr Tesarik <petrtesarik@huaweicloud.com> Sent: Tuesday, June 27, 2023 2:54 AM
> > 
> > Try to allocate a transient memory pool if no suitable slots can be found,
> > except when allocating from a restricted pool. The transient pool is just
> > enough big for this one bounce buffer. It is inserted into a per-device
> > list of transient memory pools, and it is freed again when the bounce
> > buffer is unmapped.
> > 
> > Transient memory pools are kept in an RCU list. A memory barrier is
> > required after adding a new entry, because any address within a transient
> > buffer must be immediately recognized as belonging to the SWIOTLB, even if
> > it is passed to another CPU.
> > 
> > Deletion does not require any synchronization beyond RCU ordering
> > guarantees. After a buffer is unmapped, its physical addresses may no
> > longer be passed to the DMA API, so the memory range of the corresponding
> > stale entry in the RCU list never matches. If the memory range gets
> > allocated again, then it happens only after a RCU quiescent state.
> > 
> > Since bounce buffers can now be allocated from different pools, add a
> > parameter to swiotlb_alloc_pool() to let the caller know which memory pool
> > is used. Add swiotlb_find_pool() to find the memory pool corresponding to
> > an address. This function is now also used by is_swiotlb_buffer(), because
> > a simple boundary check is no longer sufficient.
> > 
> > The logic in swiotlb_alloc_tlb() is taken from __dma_direct_alloc_pages(),
> > simplified and enhanced to use coherent memory pools if needed.
> > 
> > Note that this is not the most efficient way to provide a bounce buffer,
> > but when a DMA buffer can't be mapped, something may (and will) actually
> > break. At that point it is better to make an allocation, even if it may be
> > an expensive operation.
> 
> I continue to think about swiotlb memory management from the standpoint
> of CoCo VMs that may be quite large with high network and storage loads.
> These VMs are often running mission-critical workloads that can't tolerate
> a bounce buffer allocation failure.  To prevent such failures, the swiotlb
> memory size must be overly large, which wastes memory.

If "mission critical workloads" are in a vm that allowes overcommit and
no control over other vms in that same system, then you have worse
problems, sorry.

Just don't do that.

thanks,

greg k-h
Michael Kelley (LINUX) July 6, 2023, 2:22 p.m. UTC | #3
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Sent: Thursday, July 6, 2023 1:07 AM
> 
> On Thu, Jul 06, 2023 at 03:50:55AM +0000, Michael Kelley (LINUX) wrote:
> > From: Petr Tesarik <petrtesarik@huaweicloud.com> Sent: Tuesday, June 27, 2023
> 2:54 AM
> > >
> > > Try to allocate a transient memory pool if no suitable slots can be found,
> > > except when allocating from a restricted pool. The transient pool is just
> > > enough big for this one bounce buffer. It is inserted into a per-device
> > > list of transient memory pools, and it is freed again when the bounce
> > > buffer is unmapped.
> > >
> > > Transient memory pools are kept in an RCU list. A memory barrier is
> > > required after adding a new entry, because any address within a transient
> > > buffer must be immediately recognized as belonging to the SWIOTLB, even if
> > > it is passed to another CPU.
> > >
> > > Deletion does not require any synchronization beyond RCU ordering
> > > guarantees. After a buffer is unmapped, its physical addresses may no
> > > longer be passed to the DMA API, so the memory range of the corresponding
> > > stale entry in the RCU list never matches. If the memory range gets
> > > allocated again, then it happens only after a RCU quiescent state.
> > >
> > > Since bounce buffers can now be allocated from different pools, add a
> > > parameter to swiotlb_alloc_pool() to let the caller know which memory pool
> > > is used. Add swiotlb_find_pool() to find the memory pool corresponding to
> > > an address. This function is now also used by is_swiotlb_buffer(), because
> > > a simple boundary check is no longer sufficient.
> > >
> > > The logic in swiotlb_alloc_tlb() is taken from __dma_direct_alloc_pages(),
> > > simplified and enhanced to use coherent memory pools if needed.
> > >
> > > Note that this is not the most efficient way to provide a bounce buffer,
> > > but when a DMA buffer can't be mapped, something may (and will) actually
> > > break. At that point it is better to make an allocation, even if it may be
> > > an expensive operation.
> >
> > I continue to think about swiotlb memory management from the standpoint
> > of CoCo VMs that may be quite large with high network and storage loads.
> > These VMs are often running mission-critical workloads that can't tolerate
> > a bounce buffer allocation failure.  To prevent such failures, the swiotlb
> > memory size must be overly large, which wastes memory.
> 
> If "mission critical workloads" are in a vm that allowes overcommit and
> no control over other vms in that same system, then you have worse
> problems, sorry.
> 
> Just don't do that.
> 

No, the cases I'm concerned about don't involve memory overcommit.

CoCo VMs must use swiotlb bounce buffers to do DMA I/O.  Current swiotlb
code in the Linux guest allocates a configurable, but fixed, amount of guest
memory at boot time for this purpose.  But it's hard to know how much
swiotlb bounce buffer memory will be needed to handle peak I/O loads.
This patch set does dynamic allocation of swiotlb bounce buffer memory,
which can help avoid needing to configure an overly large fixed size at boot.

Michael
Greg KH July 7, 2023, 9:29 a.m. UTC | #4
On Thu, Jul 06, 2023 at 02:22:50PM +0000, Michael Kelley (LINUX) wrote:
> From: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Sent: Thursday, July 6, 2023 1:07 AM
> > 
> > On Thu, Jul 06, 2023 at 03:50:55AM +0000, Michael Kelley (LINUX) wrote:
> > > From: Petr Tesarik <petrtesarik@huaweicloud.com> Sent: Tuesday, June 27, 2023
> > 2:54 AM
> > > >
> > > > Try to allocate a transient memory pool if no suitable slots can be found,
> > > > except when allocating from a restricted pool. The transient pool is just
> > > > enough big for this one bounce buffer. It is inserted into a per-device
> > > > list of transient memory pools, and it is freed again when the bounce
> > > > buffer is unmapped.
> > > >
> > > > Transient memory pools are kept in an RCU list. A memory barrier is
> > > > required after adding a new entry, because any address within a transient
> > > > buffer must be immediately recognized as belonging to the SWIOTLB, even if
> > > > it is passed to another CPU.
> > > >
> > > > Deletion does not require any synchronization beyond RCU ordering
> > > > guarantees. After a buffer is unmapped, its physical addresses may no
> > > > longer be passed to the DMA API, so the memory range of the corresponding
> > > > stale entry in the RCU list never matches. If the memory range gets
> > > > allocated again, then it happens only after a RCU quiescent state.
> > > >
> > > > Since bounce buffers can now be allocated from different pools, add a
> > > > parameter to swiotlb_alloc_pool() to let the caller know which memory pool
> > > > is used. Add swiotlb_find_pool() to find the memory pool corresponding to
> > > > an address. This function is now also used by is_swiotlb_buffer(), because
> > > > a simple boundary check is no longer sufficient.
> > > >
> > > > The logic in swiotlb_alloc_tlb() is taken from __dma_direct_alloc_pages(),
> > > > simplified and enhanced to use coherent memory pools if needed.
> > > >
> > > > Note that this is not the most efficient way to provide a bounce buffer,
> > > > but when a DMA buffer can't be mapped, something may (and will) actually
> > > > break. At that point it is better to make an allocation, even if it may be
> > > > an expensive operation.
> > >
> > > I continue to think about swiotlb memory management from the standpoint
> > > of CoCo VMs that may be quite large with high network and storage loads.
> > > These VMs are often running mission-critical workloads that can't tolerate
> > > a bounce buffer allocation failure.  To prevent such failures, the swiotlb
> > > memory size must be overly large, which wastes memory.
> > 
> > If "mission critical workloads" are in a vm that allowes overcommit and
> > no control over other vms in that same system, then you have worse
> > problems, sorry.
> > 
> > Just don't do that.
> > 
> 
> No, the cases I'm concerned about don't involve memory overcommit.
> 
> CoCo VMs must use swiotlb bounce buffers to do DMA I/O.  Current swiotlb
> code in the Linux guest allocates a configurable, but fixed, amount of guest
> memory at boot time for this purpose.  But it's hard to know how much
> swiotlb bounce buffer memory will be needed to handle peak I/O loads.
> This patch set does dynamic allocation of swiotlb bounce buffer memory,
> which can help avoid needing to configure an overly large fixed size at boot.

But, as you point out, memory allocation can fail at runtime, so how can
you "guarantee" that this will work properly anymore if you are going to
make it dynamic?

confused,

greg k-h
Petr Tesařík July 7, 2023, 10:22 a.m. UTC | #5
On Fri, 7 Jul 2023 10:29:00 +0100
Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote:

> On Thu, Jul 06, 2023 at 02:22:50PM +0000, Michael Kelley (LINUX) wrote:
> > From: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Sent: Thursday, July 6, 2023 1:07 AM  
> > > 
> > > On Thu, Jul 06, 2023 at 03:50:55AM +0000, Michael Kelley (LINUX) wrote:  
> > > > From: Petr Tesarik <petrtesarik@huaweicloud.com> Sent: Tuesday, June 27, 2023  
> > > 2:54 AM  
> > > > >
> > > > > Try to allocate a transient memory pool if no suitable slots can be found,
> > > > > except when allocating from a restricted pool. The transient pool is just
> > > > > enough big for this one bounce buffer. It is inserted into a per-device
> > > > > list of transient memory pools, and it is freed again when the bounce
> > > > > buffer is unmapped.
> > > > >
> > > > > Transient memory pools are kept in an RCU list. A memory barrier is
> > > > > required after adding a new entry, because any address within a transient
> > > > > buffer must be immediately recognized as belonging to the SWIOTLB, even if
> > > > > it is passed to another CPU.
> > > > >
> > > > > Deletion does not require any synchronization beyond RCU ordering
> > > > > guarantees. After a buffer is unmapped, its physical addresses may no
> > > > > longer be passed to the DMA API, so the memory range of the corresponding
> > > > > stale entry in the RCU list never matches. If the memory range gets
> > > > > allocated again, then it happens only after a RCU quiescent state.
> > > > >
> > > > > Since bounce buffers can now be allocated from different pools, add a
> > > > > parameter to swiotlb_alloc_pool() to let the caller know which memory pool
> > > > > is used. Add swiotlb_find_pool() to find the memory pool corresponding to
> > > > > an address. This function is now also used by is_swiotlb_buffer(), because
> > > > > a simple boundary check is no longer sufficient.
> > > > >
> > > > > The logic in swiotlb_alloc_tlb() is taken from __dma_direct_alloc_pages(),
> > > > > simplified and enhanced to use coherent memory pools if needed.
> > > > >
> > > > > Note that this is not the most efficient way to provide a bounce buffer,
> > > > > but when a DMA buffer can't be mapped, something may (and will) actually
> > > > > break. At that point it is better to make an allocation, even if it may be
> > > > > an expensive operation.  
> > > >
> > > > I continue to think about swiotlb memory management from the standpoint
> > > > of CoCo VMs that may be quite large with high network and storage loads.
> > > > These VMs are often running mission-critical workloads that can't tolerate
> > > > a bounce buffer allocation failure.  To prevent such failures, the swiotlb
> > > > memory size must be overly large, which wastes memory.  
> > > 
> > > If "mission critical workloads" are in a vm that allowes overcommit and
> > > no control over other vms in that same system, then you have worse
> > > problems, sorry.
> > > 
> > > Just don't do that.
> > >   
> > 
> > No, the cases I'm concerned about don't involve memory overcommit.
> > 
> > CoCo VMs must use swiotlb bounce buffers to do DMA I/O.  Current swiotlb
> > code in the Linux guest allocates a configurable, but fixed, amount of guest
> > memory at boot time for this purpose.  But it's hard to know how much
> > swiotlb bounce buffer memory will be needed to handle peak I/O loads.
> > This patch set does dynamic allocation of swiotlb bounce buffer memory,
> > which can help avoid needing to configure an overly large fixed size at boot.  
> 
> But, as you point out, memory allocation can fail at runtime, so how can
> you "guarantee" that this will work properly anymore if you are going to
> make it dynamic?

In general, there is no guarantee, of course, because bounce buffers
may be requested from interrupt context. I believe Michael is looking
for the SWIOTLB_MAY_SLEEP flag that was introduced in my v2 series, so
new pools can be allocated with GFP_KERNEL instead of GFP_NOWAIT if
possible, and then there is no need to dip into the coherent pool.

Well, I have deliberately removed all complexities from my v3 series,
but I have more WIP local topic branches in my local repo:

- allow blocking allocations if possible
- allocate a new pool before existing pools are full
- free unused memory pools

I can make a bigger series, or I can send another series as RFC if this
is desired. ATM I don't feel confident enough that my v3 series will be
accepted without major changes, so I haven't invested time into
finalizing the other topic branches.

@Michael: If you know that my plan is to introduce blocking allocations
with a follow-up patch series, is the present approach acceptable?

Petr T
Michael Kelley (LINUX) July 8, 2023, 3:18 p.m. UTC | #6
From: Petr Tesařík <petr@tesarici.cz> Sent: Friday, July 7, 2023 3:22 AM
> 
> On Fri, 7 Jul 2023 10:29:00 +0100
> Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote:
> 
> > On Thu, Jul 06, 2023 at 02:22:50PM +0000, Michael Kelley (LINUX) wrote:
> > > From: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Sent: Thursday, July 6,
> 2023 1:07 AM
> > > >
> > > > On Thu, Jul 06, 2023 at 03:50:55AM +0000, Michael Kelley (LINUX) wrote:
> > > > > From: Petr Tesarik <petrtesarik@huaweicloud.com> Sent: Tuesday, June 27, 2023
> > > > 2:54 AM
> > > > > >
> > > > > > Try to allocate a transient memory pool if no suitable slots can be found,
> > > > > > except when allocating from a restricted pool. The transient pool is just
> > > > > > enough big for this one bounce buffer. It is inserted into a per-device
> > > > > > list of transient memory pools, and it is freed again when the bounce
> > > > > > buffer is unmapped.
> > > > > >
> > > > > > Transient memory pools are kept in an RCU list. A memory barrier is
> > > > > > required after adding a new entry, because any address within a transient
> > > > > > buffer must be immediately recognized as belonging to the SWIOTLB, even if
> > > > > > it is passed to another CPU.
> > > > > >
> > > > > > Deletion does not require any synchronization beyond RCU ordering
> > > > > > guarantees. After a buffer is unmapped, its physical addresses may no
> > > > > > longer be passed to the DMA API, so the memory range of the corresponding
> > > > > > stale entry in the RCU list never matches. If the memory range gets
> > > > > > allocated again, then it happens only after a RCU quiescent state.
> > > > > >
> > > > > > Since bounce buffers can now be allocated from different pools, add a
> > > > > > parameter to swiotlb_alloc_pool() to let the caller know which memory pool
> > > > > > is used. Add swiotlb_find_pool() to find the memory pool corresponding to
> > > > > > an address. This function is now also used by is_swiotlb_buffer(), because
> > > > > > a simple boundary check is no longer sufficient.
> > > > > >
> > > > > > The logic in swiotlb_alloc_tlb() is taken from __dma_direct_alloc_pages(),
> > > > > > simplified and enhanced to use coherent memory pools if needed.
> > > > > >
> > > > > > Note that this is not the most efficient way to provide a bounce buffer,
> > > > > > but when a DMA buffer can't be mapped, something may (and will) actually
> > > > > > break. At that point it is better to make an allocation, even if it may be
> > > > > > an expensive operation.
> > > > >
> > > > > I continue to think about swiotlb memory management from the standpoint
> > > > > of CoCo VMs that may be quite large with high network and storage loads.
> > > > > These VMs are often running mission-critical workloads that can't tolerate
> > > > > a bounce buffer allocation failure.  To prevent such failures, the swiotlb
> > > > > memory size must be overly large, which wastes memory.
> > > >
> > > > If "mission critical workloads" are in a vm that allowes overcommit and
> > > > no control over other vms in that same system, then you have worse
> > > > problems, sorry.
> > > >
> > > > Just don't do that.
> > > >
> > >
> > > No, the cases I'm concerned about don't involve memory overcommit.
> > >
> > > CoCo VMs must use swiotlb bounce buffers to do DMA I/O.  Current swiotlb
> > > code in the Linux guest allocates a configurable, but fixed, amount of guest
> > > memory at boot time for this purpose.  But it's hard to know how much
> > > swiotlb bounce buffer memory will be needed to handle peak I/O loads.
> > > This patch set does dynamic allocation of swiotlb bounce buffer memory,
> > > which can help avoid needing to configure an overly large fixed size at boot.
> >
> > But, as you point out, memory allocation can fail at runtime, so how can
> > you "guarantee" that this will work properly anymore if you are going to
> > make it dynamic?
> 
> In general, there is no guarantee, of course, because bounce buffers
> may be requested from interrupt context. I believe Michael is looking
> for the SWIOTLB_MAY_SLEEP flag that was introduced in my v2 series, so
> new pools can be allocated with GFP_KERNEL instead of GFP_NOWAIT if
> possible, and then there is no need to dip into the coherent pool.
> 
> Well, I have deliberately removed all complexities from my v3 series,
> but I have more WIP local topic branches in my local repo:
> 
> - allow blocking allocations if possible
> - allocate a new pool before existing pools are full
> - free unused memory pools
> 
> I can make a bigger series, or I can send another series as RFC if this
> is desired. ATM I don't feel confident enough that my v3 series will be
> accepted without major changes, so I haven't invested time into
> finalizing the other topic branches.
> 
> @Michael: If you know that my plan is to introduce blocking allocations
> with a follow-up patch series, is the present approach acceptable?
> 

Yes, I think the present approach is acceptable as a first step.  But
let me elaborate a bit on my thinking.

I was originally wondering if it is possible for swiotlb_map() to detect
whether it is called from a context that allows sleeping, without the use
of SWIOTLB_MAY_SLEEP.   This would get the benefits without having to
explicitly update drivers to add the flag.  But maybe that's too risky.  For
the CoCo VM scenario that I'm most interested in, being a VM implicitly
reduces the set of drivers that are being used, and so it's not that hard
to add the flag in the key drivers that generate most of the bounce
buffer traffic.

Then I was thinking about a slightly different usage for the flag than what
you implemented in v2 of the series.   In the case where swiotlb_map()
can't allocate slots because of the swiotlb pool being full (or mostly full),
kick the background thread (if it is not already awake) to allocate a
dynamic pool and grow the total size of the swiotlb.  Then if
SWIOTLB_MAY_SLEEP is *not* set, allocate a transient pool just as you
have implemented in this v3 of the series.  But if SWIOTLB_MAY_SLEEP
*is* set, swiotlb_map() should sleep until the background thread has
completed the memory allocation and grown the size of the swiotlb.
After the sleep, retry the slot allocation.  Maybe what I'm describing
is what you mean by "allow blocking allocations".  :-)

This approach effectively throttles incoming swiotlb requests when space
is exhausted, and gives the dynamic sizing mechanism a chance to catch
up in an efficient fashion.  Limiting transient pools to requests that can't
sleep will reduce the likelihood of exhausting the coherent memory
pools.  And as you mentioned above, kicking the background thread at the
90% full mark (or some such heuristic) also helps the dynamic sizing
mechanism keep up with demand.

Michael
Petr Tesařík July 10, 2023, 9:36 a.m. UTC | #7
On Sat, 8 Jul 2023 15:18:32 +0000
"Michael Kelley (LINUX)" <mikelley@microsoft.com> wrote:

> From: Petr Tesařík <petr@tesarici.cz> Sent: Friday, July 7, 2023 3:22 AM
> > 
> > On Fri, 7 Jul 2023 10:29:00 +0100
> > Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote:
> >   
> > > On Thu, Jul 06, 2023 at 02:22:50PM +0000, Michael Kelley (LINUX) wrote:  
> > > > From: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Sent: Thursday, July 6,  
> > 2023 1:07 AM  
> > > > >
> > > > > On Thu, Jul 06, 2023 at 03:50:55AM +0000, Michael Kelley (LINUX) wrote:  
> > > > > > From: Petr Tesarik <petrtesarik@huaweicloud.com> Sent: Tuesday, June 27, 2023  
> > > > > 2:54 AM  
> > > > > > >
> > > > > > > Try to allocate a transient memory pool if no suitable slots can be found,
> > > > > > > except when allocating from a restricted pool. The transient pool is just
> > > > > > > enough big for this one bounce buffer. It is inserted into a per-device
> > > > > > > list of transient memory pools, and it is freed again when the bounce
> > > > > > > buffer is unmapped.
> > > > > > >
> > > > > > > Transient memory pools are kept in an RCU list. A memory barrier is
> > > > > > > required after adding a new entry, because any address within a transient
> > > > > > > buffer must be immediately recognized as belonging to the SWIOTLB, even if
> > > > > > > it is passed to another CPU.
> > > > > > >
> > > > > > > Deletion does not require any synchronization beyond RCU ordering
> > > > > > > guarantees. After a buffer is unmapped, its physical addresses may no
> > > > > > > longer be passed to the DMA API, so the memory range of the corresponding
> > > > > > > stale entry in the RCU list never matches. If the memory range gets
> > > > > > > allocated again, then it happens only after a RCU quiescent state.
> > > > > > >
> > > > > > > Since bounce buffers can now be allocated from different pools, add a
> > > > > > > parameter to swiotlb_alloc_pool() to let the caller know which memory pool
> > > > > > > is used. Add swiotlb_find_pool() to find the memory pool corresponding to
> > > > > > > an address. This function is now also used by is_swiotlb_buffer(), because
> > > > > > > a simple boundary check is no longer sufficient.
> > > > > > >
> > > > > > > The logic in swiotlb_alloc_tlb() is taken from __dma_direct_alloc_pages(),
> > > > > > > simplified and enhanced to use coherent memory pools if needed.
> > > > > > >
> > > > > > > Note that this is not the most efficient way to provide a bounce buffer,
> > > > > > > but when a DMA buffer can't be mapped, something may (and will) actually
> > > > > > > break. At that point it is better to make an allocation, even if it may be
> > > > > > > an expensive operation.  
> > > > > >
> > > > > > I continue to think about swiotlb memory management from the standpoint
> > > > > > of CoCo VMs that may be quite large with high network and storage loads.
> > > > > > These VMs are often running mission-critical workloads that can't tolerate
> > > > > > a bounce buffer allocation failure.  To prevent such failures, the swiotlb
> > > > > > memory size must be overly large, which wastes memory.  
> > > > >
> > > > > If "mission critical workloads" are in a vm that allowes overcommit and
> > > > > no control over other vms in that same system, then you have worse
> > > > > problems, sorry.
> > > > >
> > > > > Just don't do that.
> > > > >  
> > > >
> > > > No, the cases I'm concerned about don't involve memory overcommit.
> > > >
> > > > CoCo VMs must use swiotlb bounce buffers to do DMA I/O.  Current swiotlb
> > > > code in the Linux guest allocates a configurable, but fixed, amount of guest
> > > > memory at boot time for this purpose.  But it's hard to know how much
> > > > swiotlb bounce buffer memory will be needed to handle peak I/O loads.
> > > > This patch set does dynamic allocation of swiotlb bounce buffer memory,
> > > > which can help avoid needing to configure an overly large fixed size at boot.  
> > >
> > > But, as you point out, memory allocation can fail at runtime, so how can
> > > you "guarantee" that this will work properly anymore if you are going to
> > > make it dynamic?  
> > 
> > In general, there is no guarantee, of course, because bounce buffers
> > may be requested from interrupt context. I believe Michael is looking
> > for the SWIOTLB_MAY_SLEEP flag that was introduced in my v2 series, so
> > new pools can be allocated with GFP_KERNEL instead of GFP_NOWAIT if
> > possible, and then there is no need to dip into the coherent pool.
> > 
> > Well, I have deliberately removed all complexities from my v3 series,
> > but I have more WIP local topic branches in my local repo:
> > 
> > - allow blocking allocations if possible
> > - allocate a new pool before existing pools are full
> > - free unused memory pools
> > 
> > I can make a bigger series, or I can send another series as RFC if this
> > is desired. ATM I don't feel confident enough that my v3 series will be
> > accepted without major changes, so I haven't invested time into
> > finalizing the other topic branches.
> > 
> > @Michael: If you know that my plan is to introduce blocking allocations
> > with a follow-up patch series, is the present approach acceptable?
> >   
> 
> Yes, I think the present approach is acceptable as a first step.  But
> let me elaborate a bit on my thinking.
> 
> I was originally wondering if it is possible for swiotlb_map() to detect
> whether it is called from a context that allows sleeping, without the use
> of SWIOTLB_MAY_SLEEP.   This would get the benefits without having to
> explicitly update drivers to add the flag.  But maybe that's too risky.

This is a recurring topic and it has been discussed several times in
the mailing lists. If you ask me, the best answer is this one by Andrew
Morton, albeit a bit dated:

https://lore.kernel.org/lkml/20080320201723.b87b3732.akpm@linux-foundation.org/

> For
> the CoCo VM scenario that I'm most interested in, being a VM implicitly
> reduces the set of drivers that are being used, and so it's not that hard
> to add the flag in the key drivers that generate most of the bounce
> buffer traffic.

Yes, that's my thinking as well.

> Then I was thinking about a slightly different usage for the flag than what
> you implemented in v2 of the series.   In the case where swiotlb_map()
> can't allocate slots because of the swiotlb pool being full (or mostly full),
> kick the background thread (if it is not already awake) to allocate a
> dynamic pool and grow the total size of the swiotlb.  Then if
> SWIOTLB_MAY_SLEEP is *not* set, allocate a transient pool just as you
> have implemented in this v3 of the series.  But if SWIOTLB_MAY_SLEEP
> *is* set, swiotlb_map() should sleep until the background thread has
> completed the memory allocation and grown the size of the swiotlb.
> After the sleep, retry the slot allocation.  Maybe what I'm describing
> is what you mean by "allow blocking allocations".  :-)

Not really, but I like the idea. After all, the only reason to have
transient pools is when something is needed immediately while the
background allocation is running.

> This approach effectively throttles incoming swiotlb requests when space
> is exhausted, and gives the dynamic sizing mechanism a chance to catch
> up in an efficient fashion.  Limiting transient pools to requests that can't
> sleep will reduce the likelihood of exhausting the coherent memory
> pools.  And as you mentioned above, kicking the background thread at the
> 90% full mark (or some such heuristic) also helps the dynamic sizing
> mechanism keep up with demand.

FWIW I did some testing, and my systems were not able to survive a
sudden I/O peak without transient pools, no matter how low I set the
threshold for kicking a background. OTOH I always tested with the
smallest possible SWIOTLB (256 KiB * rounded up number of CPUs, e.g. 16
MiB on my VM with 48 CPUs). Other sizes may lead to different results.

As a matter of fact, the size of the initial SWIOTLB memory pool and the
size(s) of additional pool(s) sound like interesting tunable parameters
that I haven't explored in depth yet.

Petr T
Michael Kelley (LINUX) July 11, 2023, 3:54 p.m. UTC | #8
From: Petr Tesařík <petr@tesarici.cz> Sent: Monday, July 10, 2023 2:36 AM
> 
> On Sat, 8 Jul 2023 15:18:32 +0000
> "Michael Kelley (LINUX)" <mikelley@microsoft.com> wrote:
> 
> > From: Petr Tesařík <petr@tesarici.cz> Sent: Friday, July 7, 2023 3:22 AM
> > >
> > > On Fri, 7 Jul 2023 10:29:00 +0100
> > > Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote:
> > >
> > > > On Thu, Jul 06, 2023 at 02:22:50PM +0000, Michael Kelley (LINUX) wrote:
> > > > > From: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Sent: Thursday, July 6,
> > > 2023 1:07 AM
> > > > > >
> > > > > > On Thu, Jul 06, 2023 at 03:50:55AM +0000, Michael Kelley (LINUX) wrote:
> > > > > > > From: Petr Tesarik <petrtesarik@huaweicloud.com> Sent: Tuesday, June 27, 2023
> > > > > > 2:54 AM
> > > > > > > >
> > > > > > > > Try to allocate a transient memory pool if no suitable slots can be found,
> > > > > > > > except when allocating from a restricted pool. The transient pool is just
> > > > > > > > enough big for this one bounce buffer. It is inserted into a per-device
> > > > > > > > list of transient memory pools, and it is freed again when the bounce
> > > > > > > > buffer is unmapped.
> > > > > > > >
> > > > > > > > Transient memory pools are kept in an RCU list. A memory barrier is
> > > > > > > > required after adding a new entry, because any address within a transient
> > > > > > > > buffer must be immediately recognized as belonging to the SWIOTLB, even if
> > > > > > > > it is passed to another CPU.
> > > > > > > >
> > > > > > > > Deletion does not require any synchronization beyond RCU ordering
> > > > > > > > guarantees. After a buffer is unmapped, its physical addresses may no
> > > > > > > > longer be passed to the DMA API, so the memory range of the corresponding
> > > > > > > > stale entry in the RCU list never matches. If the memory range gets
> > > > > > > > allocated again, then it happens only after a RCU quiescent state.
> > > > > > > >
> > > > > > > > Since bounce buffers can now be allocated from different pools, add a
> > > > > > > > parameter to swiotlb_alloc_pool() to let the caller know which memory pool
> > > > > > > > is used. Add swiotlb_find_pool() to find the memory pool corresponding to
> > > > > > > > an address. This function is now also used by is_swiotlb_buffer(), because
> > > > > > > > a simple boundary check is no longer sufficient.
> > > > > > > >
> > > > > > > > The logic in swiotlb_alloc_tlb() is taken from __dma_direct_alloc_pages(),
> > > > > > > > simplified and enhanced to use coherent memory pools if needed.
> > > > > > > >
> > > > > > > > Note that this is not the most efficient way to provide a bounce buffer,
> > > > > > > > but when a DMA buffer can't be mapped, something may (and will) actually
> > > > > > > > break. At that point it is better to make an allocation, even if it may be
> > > > > > > > an expensive operation.
> > > > > > >
> > > > > > > I continue to think about swiotlb memory management from the standpoint
> > > > > > > of CoCo VMs that may be quite large with high network and storage loads.
> > > > > > > These VMs are often running mission-critical workloads that can't tolerate
> > > > > > > a bounce buffer allocation failure.  To prevent such failures, the swiotlb
> > > > > > > memory size must be overly large, which wastes memory.
> > > > > >
> > > > > > If "mission critical workloads" are in a vm that allowes overcommit and
> > > > > > no control over other vms in that same system, then you have worse
> > > > > > problems, sorry.
> > > > > >
> > > > > > Just don't do that.
> > > > > >
> > > > >
> > > > > No, the cases I'm concerned about don't involve memory overcommit.
> > > > >
> > > > > CoCo VMs must use swiotlb bounce buffers to do DMA I/O.  Current swiotlb
> > > > > code in the Linux guest allocates a configurable, but fixed, amount of guest
> > > > > memory at boot time for this purpose.  But it's hard to know how much
> > > > > swiotlb bounce buffer memory will be needed to handle peak I/O loads.
> > > > > This patch set does dynamic allocation of swiotlb bounce buffer memory,
> > > > > which can help avoid needing to configure an overly large fixed size at boot.
> > > >
> > > > But, as you point out, memory allocation can fail at runtime, so how can
> > > > you "guarantee" that this will work properly anymore if you are going to
> > > > make it dynamic?
> > >
> > > In general, there is no guarantee, of course, because bounce buffers
> > > may be requested from interrupt context. I believe Michael is looking
> > > for the SWIOTLB_MAY_SLEEP flag that was introduced in my v2 series, so
> > > new pools can be allocated with GFP_KERNEL instead of GFP_NOWAIT if
> > > possible, and then there is no need to dip into the coherent pool.
> > >
> > > Well, I have deliberately removed all complexities from my v3 series,
> > > but I have more WIP local topic branches in my local repo:
> > >
> > > - allow blocking allocations if possible
> > > - allocate a new pool before existing pools are full
> > > - free unused memory pools
> > >
> > > I can make a bigger series, or I can send another series as RFC if this
> > > is desired. ATM I don't feel confident enough that my v3 series will be
> > > accepted without major changes, so I haven't invested time into
> > > finalizing the other topic branches.
> > >
> > > @Michael: If you know that my plan is to introduce blocking allocations
> > > with a follow-up patch series, is the present approach acceptable?
> > >
> >
> > Yes, I think the present approach is acceptable as a first step.  But
> > let me elaborate a bit on my thinking.
> >
> > I was originally wondering if it is possible for swiotlb_map() to detect
> > whether it is called from a context that allows sleeping, without the use
> > of SWIOTLB_MAY_SLEEP.   This would get the benefits without having to
> > explicitly update drivers to add the flag.  But maybe that's too risky.
> 
> This is a recurring topic and it has been discussed several times in
> the mailing lists. If you ask me, the best answer is this one by Andrew
> Morton, albeit a bit dated:
> 
> https://lore.kernel.org/lkml/20080320201723.b87b3732.akpm@linux-foundation.org/

Thanks.  That's useful context.

> 
> > For
> > the CoCo VM scenario that I'm most interested in, being a VM implicitly
> > reduces the set of drivers that are being used, and so it's not that hard
> > to add the flag in the key drivers that generate most of the bounce
> > buffer traffic.
> 
> Yes, that's my thinking as well.
> 
> > Then I was thinking about a slightly different usage for the flag than what
> > you implemented in v2 of the series.   In the case where swiotlb_map()
> > can't allocate slots because of the swiotlb pool being full (or mostly full),
> > kick the background thread (if it is not already awake) to allocate a
> > dynamic pool and grow the total size of the swiotlb.  Then if
> > SWIOTLB_MAY_SLEEP is *not* set, allocate a transient pool just as you
> > have implemented in this v3 of the series.  But if SWIOTLB_MAY_SLEEP
> > *is* set, swiotlb_map() should sleep until the background thread has
> > completed the memory allocation and grown the size of the swiotlb.
> > After the sleep, retry the slot allocation.  Maybe what I'm describing
> > is what you mean by "allow blocking allocations".  :-)
> 
> Not really, but I like the idea. After all, the only reason to have
> transient pools is when something is needed immediately while the
> background allocation is running.

You can also take the thinking one step further:  For bounce buffer
requests that allow blocking, you could decide not to grow the pool
above a specified maximum.  If the max has been reached and space
is not available, sleep until space is released by some other in-progress
request.  This could be a valid way to handle peak demand while
capping the memory allocated to the bounce buffer pool.  There
would be a latency hit because of the waiting, but that could
be a valid tradeoff for rare peaks.  Of course, for requests that can't
block, you'd still need to allocate a transient pool.

Michael

> 
> > This approach effectively throttles incoming swiotlb requests when space
> > is exhausted, and gives the dynamic sizing mechanism a chance to catch
> > up in an efficient fashion.  Limiting transient pools to requests that can't
> > sleep will reduce the likelihood of exhausting the coherent memory
> > pools.  And as you mentioned above, kicking the background thread at the
> > 90% full mark (or some such heuristic) also helps the dynamic sizing
> > mechanism keep up with demand.
> 
> FWIW I did some testing, and my systems were not able to survive a
> sudden I/O peak without transient pools, no matter how low I set the
> threshold for kicking a background. OTOH I always tested with the
> smallest possible SWIOTLB (256 KiB * rounded up number of CPUs, e.g. 16
> MiB on my VM with 48 CPUs). Other sizes may lead to different results.
> 
> As a matter of fact, the size of the initial SWIOTLB memory pool and the
> size(s) of additional pool(s) sound like interesting tunable parameters
> that I haven't explored in depth yet.
> 
> Petr T
diff mbox series

Patch

diff --git a/include/linux/device.h b/include/linux/device.h
index 83081aa99e6a..a1ee4c5924b8 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -510,6 +510,8 @@  struct device_physical_location {
  * @dma_mem:	Internal for coherent mem override.
  * @cma_area:	Contiguous memory area for dma allocations
  * @dma_io_tlb_mem: Software IO TLB allocator.  Not for driver use.
+ * @dma_io_tlb_pools:	List of transient swiotlb memory pools.
+ * @dma_io_tlb_lock:	Protects changes to the list of active pools.
  * @archdata:	For arch-specific additions.
  * @of_node:	Associated device tree node.
  * @fwnode:	Associated device node supplied by platform firmware.
@@ -615,6 +617,8 @@  struct device {
 #endif
 #ifdef CONFIG_SWIOTLB
 	struct io_tlb_mem *dma_io_tlb_mem;
+	struct list_head dma_io_tlb_pools;
+	spinlock_t dma_io_tlb_lock;
 #endif
 	/* arch specific additions */
 	struct dev_archdata	archdata;
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 0ee20b764000..c36c5a546787 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -417,6 +417,8 @@  static inline void dma_sync_sgtable_for_device(struct device *dev,
 #define dma_get_sgtable(d, t, v, h, s) dma_get_sgtable_attrs(d, t, v, h, s, 0)
 #define dma_mmap_coherent(d, v, c, h, s) dma_mmap_attrs(d, v, c, h, s, 0)
 
+bool dma_coherent_ok(struct device *dev, phys_addr_t phys, size_t size);
+
 static inline void *dma_alloc_coherent(struct device *dev, size_t size,
 		dma_addr_t *dma_handle, gfp_t gfp)
 {
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 0aa6868cb68c..ae1688438850 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -63,6 +63,7 @@  dma_addr_t swiotlb_map(struct device *dev, phys_addr_t phys,
 
 /**
  * struct io_tlb_pool - IO TLB memory pool descriptor
+ * @node:	Member of the IO TLB memory pool list.
  * @start:	The start address of the swiotlb memory pool. Used to do a quick
  *		range check to see if the memory was in fact allocated by this
  *		API.
@@ -77,22 +78,27 @@  dma_addr_t swiotlb_map(struct device *dev, phys_addr_t phys,
  *		see setup_io_tlb_npages().
  * @used:	The number of used IO TLB slots.
  * @late_alloc:	%true if allocated using the page allocator.
+ * @transient:  %true if transient memory pool.
  * @nareas:	Number of areas in the pool.
  * @area_nslabs: Number of slots in each area.
  * @areas:	Array of memory area descriptors.
  * @slots:	Array of slot descriptors.
+ * @rcu:	RCU head for swiotlb_dyn_free().
  */
 struct io_tlb_pool {
+	struct list_head node;
 	phys_addr_t start;
 	phys_addr_t end;
 	void *vaddr;
 	unsigned long nslabs;
 	unsigned long used;
 	bool late_alloc;
+	bool transient;
 	unsigned int nareas;
 	unsigned int area_nslabs;
 	struct io_tlb_area *areas;
 	struct io_tlb_slot *slots;
+	struct rcu_head rcu;
 };
 
 /**
@@ -120,6 +126,8 @@  struct io_tlb_mem {
 #endif
 };
 
+struct io_tlb_pool *swiotlb_find_pool(struct device *dev, phys_addr_t paddr);
+
 /**
  * is_swiotlb_buffer() - check if a physical address belongs to a swiotlb
  * @dev:        Device which has mapped the buffer.
@@ -133,9 +141,8 @@  struct io_tlb_mem {
  */
 static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
 {
-	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
-
-	return mem && paddr >= mem->pool->start && paddr < mem->pool->end;
+	return dev->dma_io_tlb_mem &&
+		!!swiotlb_find_pool(dev, paddr);
 }
 
 static inline bool is_swiotlb_force_bounce(struct device *dev)
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 5595d1d5cdcc..820561cab38d 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -66,7 +66,7 @@  static gfp_t dma_direct_optimal_gfp_mask(struct device *dev, u64 *phys_limit)
 	return 0;
 }
 
-static bool dma_coherent_ok(struct device *dev, phys_addr_t phys, size_t size)
+bool dma_coherent_ok(struct device *dev, phys_addr_t phys, size_t size)
 {
 	dma_addr_t dma_addr = phys_to_dma_direct(dev, phys);
 
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 4c5de91bda57..06b4fa7c2e9b 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -35,6 +35,7 @@ 
 #include <linux/memblock.h>
 #include <linux/mm.h>
 #include <linux/pfn.h>
+#include <linux/rculist.h>
 #include <linux/scatterlist.h>
 #include <linux/set_memory.h>
 #include <linux/spinlock.h>
@@ -500,6 +501,157 @@  void __init swiotlb_exit(void)
 	memset(mem, 0, sizeof(*mem));
 }
 
+/**
+ * alloc_dma_pages() - allocate pages to be used for DMA
+ * @gfp:	GFP flags for the allocation.
+ * @bytes:	Size of the buffer.
+ *
+ * Allocate pages from the buddy allocator. If successful, make the allocated
+ * pages decrypted that they can be used for DMA.
+ *
+ * Return: Decrypted pages, or %NULL on failure.
+ */
+static struct page *alloc_dma_pages(gfp_t gfp, size_t bytes)
+{
+	unsigned int order = get_order(bytes);
+	struct page *page;
+	void *vaddr;
+
+	page = alloc_pages(gfp, order);
+	if (!page)
+		return NULL;
+
+	vaddr = page_address(page);
+	if (set_memory_decrypted((unsigned long)vaddr, PFN_UP(bytes))) {
+		__free_pages(page, order);
+		return NULL;
+	}
+
+	return page;
+}
+
+/**
+ * swiotlb_alloc_tlb() - allocate a dynamic IO TLB buffer
+ * @dev:	Device for which a memory pool is allocated.
+ * @bytes:	Size of the buffer.
+ * @phys_limit:	Maximum allowed physical address of the buffer.
+ * @gfp:	GFP flags for the allocation.
+ *
+ * Return: Allocated pages, or %NULL on allocation failure.
+ */
+static struct page *swiotlb_alloc_tlb(struct device *dev, size_t bytes,
+		u64 phys_limit, gfp_t gfp)
+{
+	struct page *page;
+
+	/*
+	 * Allocate from the atomic pools if memory is encrypted and
+	 * the allocation is atomic, because decrypting may block.
+	 */
+	if (dev && force_dma_unencrypted(dev) && !gfpflags_allow_blocking(gfp)) {
+		void *vaddr;
+
+		return IS_ENABLED(CONFIG_DMA_COHERENT_POOL)
+			? dma_alloc_from_pool(dev, bytes, &vaddr, gfp,
+					      dma_coherent_ok)
+			: NULL;
+	}
+
+	gfp &= ~GFP_ZONEMASK;
+	if (phys_limit <= DMA_BIT_MASK(zone_dma_bits))
+		gfp |= __GFP_DMA;
+	else if (phys_limit <= DMA_BIT_MASK(32))
+		gfp |= __GFP_DMA32;
+
+	while ((page = alloc_dma_pages(gfp, bytes)) &&
+	       page_to_phys(page) + bytes - 1 > phys_limit) {
+		/* allocated, but too high */
+		__free_pages(page, get_order(bytes));
+
+		if (IS_ENABLED(CONFIG_ZONE_DMA32) &&
+		    phys_limit < DMA_BIT_MASK(64) &&
+		    !(gfp & (__GFP_DMA32 | __GFP_DMA)))
+			gfp |= __GFP_DMA32;
+		else if (IS_ENABLED(CONFIG_ZONE_DMA) &&
+			 !(gfp & __GFP_DMA))
+			gfp = (gfp & ~__GFP_DMA32) | __GFP_DMA;
+		else
+			return NULL;
+	}
+
+	return page;
+}
+
+/**
+ * swiotlb_free_tlb() - free a dynamically allocated IO TLB buffer
+ * @vaddr:	Virtual address of the buffer.
+ * @bytes:	Size of the buffer.
+ */
+static void swiotlb_free_tlb(void *vaddr, size_t bytes)
+{
+	if (IS_ENABLED(CONFIG_DMA_COHERENT_POOL) &&
+	    dma_free_from_pool(NULL, vaddr, bytes))
+		return;
+
+	/* Intentional leak if pages cannot be encrypted again. */
+	if (!set_memory_encrypted((unsigned long)vaddr, PFN_UP(bytes)))
+		__free_pages(virt_to_page(vaddr), get_order(bytes));
+}
+
+/**
+ * swiotlb_alloc_pool() - allocate a new IO TLB memory pool
+ * @dev:	Device for which a memory pool is allocated.
+ * @nslabs:	Desired number of slabs.
+ * @phys_limit:	Maximum DMA buffer physical address.
+ * @gfp:	GFP flags for the allocations.
+ *
+ * Allocate and initialize a new IO TLB memory pool.
+ *
+ * Return: New memory pool, or %NULL on allocation failure.
+ */
+static struct io_tlb_pool *swiotlb_alloc_pool(struct device *dev,
+		unsigned int nslabs, u64 phys_limit, gfp_t gfp)
+{
+	struct io_tlb_pool *pool;
+	struct page *tlb;
+	size_t pool_size;
+	size_t tlb_size;
+
+	pool_size = sizeof(*pool) + array_size(sizeof(*pool->areas), 1) +
+		array_size(sizeof(*pool->slots), nslabs);
+	pool = kzalloc(pool_size, gfp);
+	if (!pool)
+		goto error;
+	pool->areas = (void *)pool + sizeof(*pool);
+	pool->slots = (void *)pool->areas + sizeof(*pool->areas);
+
+	tlb_size = nslabs << IO_TLB_SHIFT;
+	tlb = swiotlb_alloc_tlb(dev, tlb_size, phys_limit, gfp);
+	if (!tlb)
+		goto error_tlb;
+
+	swiotlb_init_io_tlb_pool(pool, page_to_phys(tlb), nslabs, true, 1);
+	return pool;
+
+error_tlb:
+	kfree(pool);
+error:
+	return NULL;
+}
+
+/**
+ * swiotlb_dyn_free() - RCU callback to free a memory pool
+ * @rcu:	RCU head in the corresponding struct io_tlb_pool.
+ */
+static void swiotlb_dyn_free(struct rcu_head *rcu)
+{
+	struct io_tlb_pool *pool = container_of(rcu, struct io_tlb_pool, rcu);
+	size_t tlb_size = pool->end - pool->start;
+
+	swiotlb_free_tlb(pool->vaddr, tlb_size);
+	kfree(pool);
+}
+
 /**
  * swiotlb_dev_init() - initialize swiotlb fields in &struct device
  * @dev:	Device to be initialized.
@@ -507,6 +659,56 @@  void __init swiotlb_exit(void)
 void swiotlb_dev_init(struct device *dev)
 {
 	dev->dma_io_tlb_mem = &io_tlb_default_mem;
+	INIT_LIST_HEAD(&dev->dma_io_tlb_pools);
+	spin_lock_init(&dev->dma_io_tlb_lock);
+}
+
+/**
+ * swiotlb_find_pool() - find the IO TLB pool for a physical address
+ * @dev:        Device which has mapped the DMA buffer.
+ * @paddr:      Physical address within the DMA buffer.
+ *
+ * Find the IO TLB memory pool descriptor which contains the given physical
+ * address, if any.
+ *
+ * Return: Memory pool which contains @paddr, or %NULL if none.
+ */
+struct io_tlb_pool *swiotlb_find_pool(struct device *dev, phys_addr_t paddr)
+{
+	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+	struct io_tlb_pool *pool = mem->pool;
+
+	if (paddr >= pool->start && paddr < pool->end)
+		return pool;
+
+	/* Pairs with smp_wmb() in swiotlb_find_slots(). */
+	smp_rmb();
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(pool, &dev->dma_io_tlb_pools, node) {
+		if (paddr >= pool->start && paddr < pool->end)
+			goto out;
+	}
+	pool = NULL;
+out:
+	rcu_read_unlock();
+	return pool;
+}
+
+/**
+ * swiotlb_del_pool() - remove an IO TLB pool from a device
+ * @dev:	Owning device.
+ * @pool:	Memory pool to be removed.
+ */
+static void swiotlb_del_pool(struct device *dev, struct io_tlb_pool *pool)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&dev->dma_io_tlb_lock, flags);
+	list_del_rcu(&pool->node);
+	spin_unlock_irqrestore(&dev->dma_io_tlb_lock, flags);
+
+	call_rcu(&pool->rcu, swiotlb_dyn_free);
 }
 
 /*
@@ -523,7 +725,7 @@  static unsigned int swiotlb_align_offset(struct device *dev, u64 addr)
 static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t size,
 			   enum dma_data_direction dir)
 {
-	struct io_tlb_pool *mem = dev->dma_io_tlb_mem->pool;
+	struct io_tlb_pool *mem = swiotlb_find_pool(dev, tlb_addr);
 	int index = (tlb_addr - mem->start) >> IO_TLB_SHIFT;
 	phys_addr_t orig_addr = mem->slots[index].orig_addr;
 	size_t alloc_size = mem->slots[index].alloc_size;
@@ -796,6 +998,7 @@  static int pool_find_slots(struct device *dev, struct io_tlb_pool *pool,
  * @alloc_size: Total requested size of the bounce buffer,
  *		including initial alignment padding.
  * @alloc_align_mask:	Required alignment of the allocated buffer.
+ * @retpool:	Used memory pool, updated on return.
  *
  * Search through the whole software IO TLB to find a sequence of slots that
  * match the allocation constraints.
@@ -803,10 +1006,46 @@  static int pool_find_slots(struct device *dev, struct io_tlb_pool *pool,
  * Return: Index of the first allocated slot, or -1 on error.
  */
 static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
-		size_t alloc_size, unsigned int alloc_align_mask)
+		size_t alloc_size, unsigned int alloc_align_mask,
+		struct io_tlb_pool **retpool)
 {
-	return pool_find_slots(dev, dev->dma_io_tlb_mem->pool, orig_addr,
-			       alloc_size, alloc_align_mask);
+	struct io_tlb_pool *pool;
+	unsigned long flags;
+	u64 phys_limit;
+	int index;
+
+	pool = dev->dma_io_tlb_mem->pool;
+	index = pool_find_slots(dev, pool, orig_addr,
+				alloc_size, alloc_align_mask);
+	if (index >= 0)
+		goto found;
+
+	if (dev->dma_io_tlb_mem->for_alloc)
+		return -1;
+
+	phys_limit = min_not_zero(dev->coherent_dma_mask, dev->bus_dma_limit);
+	pool = swiotlb_alloc_pool(dev, nr_slots(alloc_size), phys_limit,
+				  GFP_NOWAIT | __GFP_NOWARN);
+	if (!pool)
+		return -1;
+
+	index = pool_find_slots(dev, pool, orig_addr,
+				alloc_size, alloc_align_mask);
+	if (index < 0) {
+		swiotlb_dyn_free(&pool->rcu);
+		return -1;
+	}
+
+	pool->transient = true;
+	spin_lock_irqsave(&dev->dma_io_tlb_lock, flags);
+	list_add_rcu(&pool->node, &dev->dma_io_tlb_pools);
+	spin_unlock_irqrestore(&dev->dma_io_tlb_lock, flags);
+
+	/* Pairs with smp_rmb() in swiotlb_find_pool(). */
+	smp_wmb();
+found:
+	*retpool = pool;
+	return index;
 }
 
 /**
@@ -869,7 +1108,7 @@  phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
 	}
 
 	index = swiotlb_find_slots(dev, orig_addr,
-				   alloc_size + offset, alloc_align_mask);
+				   alloc_size + offset, alloc_align_mask, &pool);
 	if (index == -1) {
 		if (!(attrs & DMA_ATTR_NO_WARN))
 			dev_warn_ratelimited(dev,
@@ -883,7 +1122,6 @@  phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
 	 * This is needed when we sync the memory.  Then we sync the buffer if
 	 * needed.
 	 */
-	pool = mem->pool;
 	for (i = 0; i < nr_slots(alloc_size + offset); i++)
 		pool->slots[index + i].orig_addr = slot_addr(orig_addr, i);
 	tlb_addr = slot_addr(pool->start, index) + offset;
@@ -900,7 +1138,7 @@  phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
 
 static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
 {
-	struct io_tlb_pool *mem = dev->dma_io_tlb_mem->pool;
+	struct io_tlb_pool *mem = swiotlb_find_pool(dev, tlb_addr);
 	unsigned long flags;
 	unsigned int offset = swiotlb_align_offset(dev, tlb_addr);
 	int index = (tlb_addr - offset - mem->start) >> IO_TLB_SHIFT;
@@ -954,6 +1192,8 @@  void swiotlb_tbl_unmap_single(struct device *dev, phys_addr_t tlb_addr,
 			      size_t mapping_size, enum dma_data_direction dir,
 			      unsigned long attrs)
 {
+	struct io_tlb_pool *pool;
+
 	/*
 	 * First, sync the memory before unmapping the entry
 	 */
@@ -961,7 +1201,13 @@  void swiotlb_tbl_unmap_single(struct device *dev, phys_addr_t tlb_addr,
 	    (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL))
 		swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_FROM_DEVICE);
 
-	swiotlb_release_slots(dev, tlb_addr);
+	pool = swiotlb_find_pool(dev, tlb_addr);
+	if (pool->transient) {
+		dec_used(dev->dma_io_tlb_mem, pool->nslabs);
+		swiotlb_del_pool(dev, pool);
+	} else {
+		swiotlb_release_slots(dev, tlb_addr);
+	}
 }
 
 void swiotlb_sync_single_for_device(struct device *dev, phys_addr_t tlb_addr,
@@ -1145,11 +1391,10 @@  struct page *swiotlb_alloc(struct device *dev, size_t size)
 	if (!mem)
 		return NULL;
 
-	index = swiotlb_find_slots(dev, 0, size, 0);
+	index = swiotlb_find_slots(dev, 0, size, 0, &pool);
 	if (index == -1)
 		return NULL;
 
-	pool = mem->pool;
 	tlb_addr = slot_addr(pool->start, index);
 
 	return pfn_to_page(PFN_DOWN(tlb_addr));