[RFC,0/7] Introduce swiotlb throttling

Message ID	20240822183718.1234-1-mhklinux@outlook.com (mailing list archive)
Headers	show Received: from mail-pl1-f172.google.com (mail-pl1-f172.google.com [209.85.214.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 238DE41C63; Thu, 22 Aug 2024 18:37:45 +0000 (UTC) From: mhkelley58@gmail.com To: kbusch@kernel.org, axboe@kernel.dk, sagi@grimberg.me, James.Bottomley@HansenPartnership.com, martin.petersen@oracle.com, kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, robin.murphy@arm.com, hch@lst.de, m.szyprowski@samsung.com, petr@tesarici.cz, iommu@lists.linux.dev, linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org, linux-scsi@vger.kernel.org, linux-hyperv@vger.kernel.org, linux-coco@lists.linux.dev Subject: [RFC 0/7] Introduce swiotlb throttling Date: Thu, 22 Aug 2024 11:37:11 -0700 Message-Id: <20240822183718.1234-1-mhklinux@outlook.com> Reply-To: mhklinux@outlook.com Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Introduce swiotlb throttling \| expand [RFC,0/7] Introduce swiotlb throttling [RFC,1/7] swiotlb: Introduce swiotlb throttling [RFC,2/7] dma: Handle swiotlb throttling for SGLs [RFC,3/7] dma: Add function for drivers to know if allowing blocking is useful [RFC,4/7] scsi_lib_dma: Add _attrs variant of scsi_dma_map() [RFC,5/7] scsi: storvsc: Enable swiotlb throttling [RFC,6/7] nvme: Move BLK_MQ_F_BLOCKING indicator to struct nvme_ctrl [RFC,7/7] nvme: Enable swiotlb throttling for NVMe PCI devices

Michael Kelley Aug. 22, 2024, 6:37 p.m. UTC

From: Michael Kelley <mhklinux@outlook.com>

Background
==========
Linux device drivers may make DMA map/unmap calls in contexts that
cannot block, such as in an interrupt handler. Consequently, when a
DMA map call must use a bounce buffer, the allocation of swiotlb
memory must always succeed immediately. If swiotlb memory is
exhausted, the DMA map call cannot wait for memory to be released. The
call fails, which usually results in an I/O error.

Bounce buffers are usually used infrequently for a few corner cases,
so the default swiotlb memory allocation of 64 MiB is more than
sufficient to avoid running out and causing errors. However, recently
introduced Confidential Computing (CoCo) VMs must use bounce buffers
for all DMA I/O because the VM's memory is encrypted. In CoCo VMs
a new heuristic allocates ~6% of the VM's memory, up to 1 GiB, for
swiotlb memory. This large allocation reduces the likelihood of a
spike in usage causing DMA map failures. Unfortunately for most
workloads, this insurance against spikes comes at the cost of
potentially "wasting" hundreds of MiB's of the VM's memory, as swiotlb
memory can't be used for other purposes.

Approach
========
The goal is to significantly reduce the amount of memory reserved as
swiotlb memory in CoCo VMs, while not unduly increasing the risk of
DMA map failures due to memory exhaustion.

To reach this goal, this patch set introduces the concept of swiotlb
throttling, which can delay swiotlb allocation requests when swiotlb
memory usage is high. This approach depends on the fact that some
DMA map requests are made from contexts where it's OK to block.
Throttling such requests is acceptable to spread out a spike in usage.

Because it's not possible to detect at runtime whether a DMA map call
is made in a context that can block, the calls in key device drivers
must be updated with a MAY_BLOCK attribute, if appropriate. When this
attribute is set and swiotlb memory usage is above a threshold, the
swiotlb allocation code can serialize swiotlb memory usage to help
ensure that it is not exhausted.

In general, storage device drivers can take advantage of the MAY_BLOCK
option, while network device drivers cannot. The Linux block layer
already allows storage requests to block when the BLK_MQ_F_BLOCKING
flag is present on the request queue. In a CoCo VM environment,
relatively few device types are used for storage devices, and updating
these drivers is feasible. This patch set updates the NVMe driver and
the Hyper-V storvsc synthetic storage driver. A few other drivers
might also need to be updated to handle the key CoCo VM storage
devices.

Because network drivers generally cannot use swiotlb throttling, it is
still possible for swiotlb memory to become exhausted. But blunting
the maximum swiotlb memory used by storage devices can significantly
reduce the peak usage, and a smaller amount of swiotlb memory can be
allocated in a CoCo VM. Also, usage by storage drivers is likely to
overall be larger than for network drivers, especially when large
numbers of disk devices are in use, each with many I/O requests in-
flight.

swiotlb throttling does not affect the context requirements of DMA
unmap calls. These always complete without blocking, even if the
corresponding DMA map call was throttled.

Patches
=======
Patches 1 and 2 implement the core of swiotlb throttling. They define
DMA attribute flag DMA_ATTR_MAY_BLOCK that device drivers use to
indicate that a DMA map call is allowed to block, and therefore can be
throttled. They update swiotlb_tbl_map_single() to detect this flag and
implement the throttling. Similarly, swiotlb_tbl_unmap_single() is
updated to handle a previously throttled request that has now freed
its swiotlb memory.

Patch 3 adds the dma_recommend_may_block() call that device drivers
can use to know if there's benefit in using the MAY_BLOCK option on
DMA map calls. If not in a CoCo VM, this call returns "false" because
swiotlb is not being used for all DMA I/O. This allows the driver to
set the BLK_MQ_F_BLOCKING flag on blk-mq request queues only when
there is benefit.

Patch 4 updates the SCSI-specific DMA map calls to add a "_attrs"
variant to allow passing the MAY_BLOCK attribute.

Patch 5 adds the MAY_BLOCK option to the Hyper-V storvsc driver, which
is used for storage in CoCo VMs in the Azure public cloud.

Patches 6 and 7 add the MAY_BLOCK option to the NVMe PCI host driver.

Discussion
==========
* Since swiotlb isn't visible to device drivers, I've specifically
named the DMA attribute as MAY_BLOCK instead of MAY_THROTTLE or
something swiotlb specific. While this patch set consumes MAY_BLOCK
only on the DMA direct path to do throttling in the swiotlb code,
there might be other uses in the future outside of CoCo VMs, or
perhaps on the IOMMU path.

* The swiotlb throttling code in this patch set throttles by
serializing the use of swiotlb memory when usage is above a designated
threshold: i.e., only one new swiotlb request is allowed to proceed at
a time. When the corresponding unmap is done to release its swiotlb
memory, the next request is allowed to proceed. This serialization is
global and without knowledge of swiotlb areas. From a storage I/O
performance standpoint, the serialization is a bit restrictive, but
the code isn't trying to optimize for being above the threshold. If a
workload regularly runs above the threshold, the size of the swiotlb
memory should be increased.

* Except for knowing how much swiotlb memory is currently allocated,
throttle accounting is done without locking or atomic operations. For
example, multiple requests could proceed in parallel when usage is
just under the threshold, putting usage above the threshold by the
aggregate size of the parallel requests. The threshold must already be
set relatively conservatively because of drivers that can't enable
throttling, so this slop in the accounting shouldn't be a problem.
It's better than the potential bottleneck of a globally synchronized
reservation mechanism.

* In a CoCo VM, mapping a scatter/gather list makes an independent
swiotlb request for each entry. Throttling each independent request
wouldn't really work, so the code throttles only the first SGL entry.
Once that entry passes any throttle, subsequent entries in the SGL
proceed without throttling. When the SGL is unmapped, entries 1 thru
N-1 are unmapped first, then entry 0 is unmapped, allowing the next
serialized request to proceed.

Open Topics
===========
1. swiotlb allocations from Xen and the IOMMU code don't make use of
throttling. This could be added if beneficial.

2. The throttling values are currently exposed and adjustable in
/sys/kernel/debug/swiotlb. Should any of this be moved so it is
visible even without CONFIG_DEBUG_FS?

3. I have not changed the current heuristic for the swiotlb memory
size in CoCo VMs. It's not clear to me how to link this to whether the
key storage drivers have been updated to allow throttling. For now,
the benefit of reduced swiotlb memory size must be realized using the
swiotlb= kernel boot line option.

4. I need to update the swiotlb documentation to describe throttling.

This patch set is built against linux-next-20240816.

Michael Kelley (7):
  swiotlb: Introduce swiotlb throttling
  dma: Handle swiotlb throttling for SGLs
  dma: Add function for drivers to know if allowing blocking is useful
  scsi_lib_dma: Add _attrs variant of scsi_dma_map()
  scsi: storvsc: Enable swiotlb throttling
  nvme: Move BLK_MQ_F_BLOCKING indicator to struct nvme_ctrl
  nvme: Enable swiotlb throttling for NVMe PCI devices

 drivers/nvme/host/core.c    |   4 +-
 drivers/nvme/host/nvme.h    |   2 +-
 drivers/nvme/host/pci.c     |  18 ++++--
 drivers/nvme/host/tcp.c     |   3 +-
 drivers/scsi/scsi_lib_dma.c |  13 ++--
 drivers/scsi/storvsc_drv.c  |   9 ++-
 include/linux/dma-mapping.h |  13 ++++
 include/linux/swiotlb.h     |  15 ++++-
 include/scsi/scsi_cmnd.h    |   7 ++-
 kernel/dma/Kconfig          |  13 ++++
 kernel/dma/direct.c         |  41 +++++++++++--
 kernel/dma/direct.h         |   1 +
 kernel/dma/mapping.c        |  10 ++++
 kernel/dma/swiotlb.c        | 114 ++++++++++++++++++++++++++++++++----
 14 files changed, 227 insertions(+), 36 deletions(-)

Bart Van Assche Aug. 22, 2024, 7:29 p.m. UTC | #1

On 8/22/24 11:37 AM, mhkelley58@gmail.com wrote:
> Linux device drivers may make DMA map/unmap calls in contexts that
> cannot block, such as in an interrupt handler.

Although I really appreciate your work, what alternatives have been
considered? How many drivers perform DMA mapping from atomic context?
Would it be feasible to modify these drivers such that DMA mapping
always happens in a context in which sleeping is allowed?

Thanks,

Bart.

Michael Kelley Aug. 23, 2024, 2:20 a.m. UTC | #2

From: Bart Van Assche <bvanassche@acm.org> Sent: Thursday, August 22, 2024 12:29 PM
> 
> On 8/22/24 11:37 AM, mhkelley58@gmail.com wrote:
> > Linux device drivers may make DMA map/unmap calls in contexts that
> > cannot block, such as in an interrupt handler.
> 
> Although I really appreciate your work, what alternatives have been
> considered? How many drivers perform DMA mapping from atomic context?
> Would it be feasible to modify these drivers such that DMA mapping
> always happens in a context in which sleeping is allowed?
> 

I had assumed that allowing DMA mapping from interrupt context is a
long-time fundamental requirement that can't be changed.  It's been
allowed at least for the past 20 years, as Linus added this statement to
kernel documentation in 2005:

   The streaming DMA mapping routines can be called from interrupt context.

But I don't have any idea how many drivers actually do that. There are
roughly 1700 call sites in kernel code/drivers that call one of the
dma_map_*() variants, so looking through them all doesn't seem
feasible. From the limited samples I looked at, block device drivers
typically do not call dma_map_*() from interrupt context, though they
do call dma_unmap_*(). Network drivers _do_ call dma_map_*()
from interrupt context, and that seems likely to be an artifact of the
generic networking framework that the drivers fit into. I haven't looked
at any other device types. 

Christoph Hellwig, or anyone else who knows the history and current
reality better than I do, please jump in. :-)

Michael

Petr Tesařík Aug. 23, 2024, 5:46 a.m. UTC | #3

On Fri, 23 Aug 2024 02:20:41 +0000
Michael Kelley <mhklinux@outlook.com> wrote:

> From: Bart Van Assche <bvanassche@acm.org> Sent: Thursday, August 22, 2024 12:29 PM
> > 
> > On 8/22/24 11:37 AM, mhkelley58@gmail.com wrote:  
> > > Linux device drivers may make DMA map/unmap calls in contexts that
> > > cannot block, such as in an interrupt handler.  
> > 
> > Although I really appreciate your work, what alternatives have been
> > considered? How many drivers perform DMA mapping from atomic context?
> > Would it be feasible to modify these drivers such that DMA mapping
> > always happens in a context in which sleeping is allowed?
> >   
> 
> I had assumed that allowing DMA mapping from interrupt context is a
> long-time fundamental requirement that can't be changed.  It's been
> allowed at least for the past 20 years, as Linus added this statement to
> kernel documentation in 2005:
> 
>    The streaming DMA mapping routines can be called from interrupt context.
> 
> But I don't have any idea how many drivers actually do that. There are
> roughly 1700 call sites in kernel code/drivers that call one of the
> dma_map_*() variants, so looking through them all doesn't seem
> feasible.

Besides, calls from interrupt context are not the only calls which are
not allowed to schedule (e.g. lock nesting comes to mind). Even if we
agreed to make DMA mapping routines blocking, I believe the easiest way
would be to start adding DMA_ATTR_MAY_BLOCK until it would be used by
all drivers. ;-)

But most importantly, if streaming DMA could block, there would be no
need for a SWIOTLB, because you could simply allocate a bounce buffer
from the buddy allocator when it's needed.

Petr T

Petr Tesařík Aug. 23, 2024, 6:44 a.m. UTC | #4

Hi all,

upfront, I've had more time to consider this idea, because Michael
kindly shared it with me back in February.

On Thu, 22 Aug 2024 11:37:11 -0700
mhkelley58@gmail.com wrote:

> From: Michael Kelley <mhklinux@outlook.com>
> 
> Background
> ==========
> Linux device drivers may make DMA map/unmap calls in contexts that
> cannot block, such as in an interrupt handler. Consequently, when a
> DMA map call must use a bounce buffer, the allocation of swiotlb
> memory must always succeed immediately. If swiotlb memory is
> exhausted, the DMA map call cannot wait for memory to be released. The
> call fails, which usually results in an I/O error.

FTR most I/O errors are recoverable, but the recovery usually takes
a lot of time. Plus the errors are logged and usually treated as
important by monitoring software. In short, I agree it's a poor choice.

> Bounce buffers are usually used infrequently for a few corner cases,
> so the default swiotlb memory allocation of 64 MiB is more than
> sufficient to avoid running out and causing errors. However, recently
> introduced Confidential Computing (CoCo) VMs must use bounce buffers
> for all DMA I/O because the VM's memory is encrypted. In CoCo VMs
> a new heuristic allocates ~6% of the VM's memory, up to 1 GiB, for
> swiotlb memory. This large allocation reduces the likelihood of a
> spike in usage causing DMA map failures. Unfortunately for most
> workloads, this insurance against spikes comes at the cost of
> potentially "wasting" hundreds of MiB's of the VM's memory, as swiotlb
> memory can't be used for other purposes.

It may be worth mentioning that page encryption state can be changed by
a hypercall, but that's a costly (and non-atomic) operation. It's much
faster to copy the data to a page which is already unencrypted (a
bounce buffer).

> Approach
> ========
> The goal is to significantly reduce the amount of memory reserved as
> swiotlb memory in CoCo VMs, while not unduly increasing the risk of
> DMA map failures due to memory exhaustion.
> 
> To reach this goal, this patch set introduces the concept of swiotlb
> throttling, which can delay swiotlb allocation requests when swiotlb
> memory usage is high. This approach depends on the fact that some
> DMA map requests are made from contexts where it's OK to block.
> Throttling such requests is acceptable to spread out a spike in usage.
> 
> Because it's not possible to detect at runtime whether a DMA map call
> is made in a context that can block, the calls in key device drivers
> must be updated with a MAY_BLOCK attribute, if appropriate.

Before somebody asks, the general agreement for decades has been that
there should be no global state indicating whether the kernel is in
atomic context. Instead, if a function needs to know, it should take an
explicit parameter.

IOW this MAY_BLOCK attribute follows an unquestioned kernel design
pattern.

> When this
> attribute is set and swiotlb memory usage is above a threshold, the
> swiotlb allocation code can serialize swiotlb memory usage to help
> ensure that it is not exhausted.
> 
> In general, storage device drivers can take advantage of the MAY_BLOCK
> option, while network device drivers cannot. The Linux block layer
> already allows storage requests to block when the BLK_MQ_F_BLOCKING
> flag is present on the request queue. In a CoCo VM environment,
> relatively few device types are used for storage devices, and updating
> these drivers is feasible. This patch set updates the NVMe driver and
> the Hyper-V storvsc synthetic storage driver. A few other drivers
> might also need to be updated to handle the key CoCo VM storage
> devices.
> 
> Because network drivers generally cannot use swiotlb throttling, it is
> still possible for swiotlb memory to become exhausted. But blunting
> the maximum swiotlb memory used by storage devices can significantly
> reduce the peak usage, and a smaller amount of swiotlb memory can be
> allocated in a CoCo VM. Also, usage by storage drivers is likely to
> overall be larger than for network drivers, especially when large
> numbers of disk devices are in use, each with many I/O requests in-
> flight.

The system can also handle network packet loss much better than I/O
errors, mainly because lost packets have always been part of normal
operation, unlike I/O errors. After all, that's why we unmount all
filesystems on removable media before physically unplugging (or
ejecting) them.

> swiotlb throttling does not affect the context requirements of DMA
> unmap calls. These always complete without blocking, even if the
> corresponding DMA map call was throttled.
> 
> Patches
> =======
> Patches 1 and 2 implement the core of swiotlb throttling. They define
> DMA attribute flag DMA_ATTR_MAY_BLOCK that device drivers use to
> indicate that a DMA map call is allowed to block, and therefore can be
> throttled. They update swiotlb_tbl_map_single() to detect this flag and
> implement the throttling. Similarly, swiotlb_tbl_unmap_single() is
> updated to handle a previously throttled request that has now freed
> its swiotlb memory.
> 
> Patch 3 adds the dma_recommend_may_block() call that device drivers
> can use to know if there's benefit in using the MAY_BLOCK option on
> DMA map calls. If not in a CoCo VM, this call returns "false" because
> swiotlb is not being used for all DMA I/O. This allows the driver to
> set the BLK_MQ_F_BLOCKING flag on blk-mq request queues only when
> there is benefit.
> 
> Patch 4 updates the SCSI-specific DMA map calls to add a "_attrs"
> variant to allow passing the MAY_BLOCK attribute.
> 
> Patch 5 adds the MAY_BLOCK option to the Hyper-V storvsc driver, which
> is used for storage in CoCo VMs in the Azure public cloud.
> 
> Patches 6 and 7 add the MAY_BLOCK option to the NVMe PCI host driver.
> 
> Discussion
> ==========
> * Since swiotlb isn't visible to device drivers, I've specifically
> named the DMA attribute as MAY_BLOCK instead of MAY_THROTTLE or
> something swiotlb specific. While this patch set consumes MAY_BLOCK
> only on the DMA direct path to do throttling in the swiotlb code,
> there might be other uses in the future outside of CoCo VMs, or
> perhaps on the IOMMU path.

I once introduced a similar flag and called it MAY_SLEEP. I chose
MAY_SLEEP, because there is already a might_sleep() annotation, but I
don't have a strong opinion unless your semantics is supposed to be
different from might_sleep(). If it is, then I strongly prefer
MAY_BLOCK to prevent confusing the two.

> * The swiotlb throttling code in this patch set throttles by
> serializing the use of swiotlb memory when usage is above a designated
> threshold: i.e., only one new swiotlb request is allowed to proceed at
> a time. When the corresponding unmap is done to release its swiotlb
> memory, the next request is allowed to proceed. This serialization is
> global and without knowledge of swiotlb areas. From a storage I/O
> performance standpoint, the serialization is a bit restrictive, but
> the code isn't trying to optimize for being above the threshold. If a
> workload regularly runs above the threshold, the size of the swiotlb
> memory should be increased.

With CONFIG_SWIOTLB_DYNAMIC, this could happen automatically in the
future. But let's get the basic functionality first.

> * Except for knowing how much swiotlb memory is currently allocated,
> throttle accounting is done without locking or atomic operations. For
> example, multiple requests could proceed in parallel when usage is
> just under the threshold, putting usage above the threshold by the
> aggregate size of the parallel requests. The threshold must already be
> set relatively conservatively because of drivers that can't enable
> throttling, so this slop in the accounting shouldn't be a problem.
> It's better than the potential bottleneck of a globally synchronized
> reservation mechanism.

Agreed.

> * In a CoCo VM, mapping a scatter/gather list makes an independent
> swiotlb request for each entry. Throttling each independent request
> wouldn't really work, so the code throttles only the first SGL entry.
> Once that entry passes any throttle, subsequent entries in the SGL
> proceed without throttling. When the SGL is unmapped, entries 1 thru
> N-1 are unmapped first, then entry 0 is unmapped, allowing the next
> serialized request to proceed.
> 
> Open Topics
> ===========
> 1. swiotlb allocations from Xen and the IOMMU code don't make use of
> throttling. This could be added if beneficial.
> 
> 2. The throttling values are currently exposed and adjustable in
> /sys/kernel/debug/swiotlb. Should any of this be moved so it is
> visible even without CONFIG_DEBUG_FS?

Yes. It should be possible to control the thresholds through sysctl.

> 3. I have not changed the current heuristic for the swiotlb memory
> size in CoCo VMs. It's not clear to me how to link this to whether the
> key storage drivers have been updated to allow throttling. For now,
> the benefit of reduced swiotlb memory size must be realized using the
> swiotlb= kernel boot line option.

This sounds fine for now.

> 4. I need to update the swiotlb documentation to describe throttling.
> 
> This patch set is built against linux-next-20240816.

OK, I'm going try it out.

Thank you for making this happen!

Petr T

Michael Kelley Aug. 23, 2024, 8:40 p.m. UTC | #5

From: Petr Tesařík <petr@tesarici.cz> Sent: Thursday, August 22, 2024 11:45 PM
> 
> Hi all,
> 
> upfront, I've had more time to consider this idea, because Michael
> kindly shared it with me back in February.
> 
> On Thu, 22 Aug 2024 11:37:11 -0700
> mhkelley58@gmail.com wrote:
> 
> > From: Michael Kelley <mhklinux@outlook.com>
> >
> > Background
> > ==========
> > Linux device drivers may make DMA map/unmap calls in contexts that
> > cannot block, such as in an interrupt handler. Consequently, when a
> > DMA map call must use a bounce buffer, the allocation of swiotlb
> > memory must always succeed immediately. If swiotlb memory is
> > exhausted, the DMA map call cannot wait for memory to be released. The
> > call fails, which usually results in an I/O error.
> 
> FTR most I/O errors are recoverable, but the recovery usually takes
> a lot of time. Plus the errors are logged and usually treated as
> important by monitoring software. In short, I agree it's a poor choice.
> 
> > Bounce buffers are usually used infrequently for a few corner cases,
> > so the default swiotlb memory allocation of 64 MiB is more than
> > sufficient to avoid running out and causing errors. However, recently
> > introduced Confidential Computing (CoCo) VMs must use bounce buffers
> > for all DMA I/O because the VM's memory is encrypted. In CoCo VMs
> > a new heuristic allocates ~6% of the VM's memory, up to 1 GiB, for
> > swiotlb memory. This large allocation reduces the likelihood of a
> > spike in usage causing DMA map failures. Unfortunately for most
> > workloads, this insurance against spikes comes at the cost of
> > potentially "wasting" hundreds of MiB's of the VM's memory, as swiotlb
> > memory can't be used for other purposes.
> 
> It may be worth mentioning that page encryption state can be changed by
> a hypercall, but that's a costly (and non-atomic) operation. It's much
> faster to copy the data to a page which is already unencrypted (a
> bounce buffer).
> 
> > Approach
> > ========
> > The goal is to significantly reduce the amount of memory reserved as
> > swiotlb memory in CoCo VMs, while not unduly increasing the risk of
> > DMA map failures due to memory exhaustion.
> >
> > To reach this goal, this patch set introduces the concept of swiotlb
> > throttling, which can delay swiotlb allocation requests when swiotlb
> > memory usage is high. This approach depends on the fact that some
> > DMA map requests are made from contexts where it's OK to block.
> > Throttling such requests is acceptable to spread out a spike in usage.
> >
> > Because it's not possible to detect at runtime whether a DMA map call
> > is made in a context that can block, the calls in key device drivers
> > must be updated with a MAY_BLOCK attribute, if appropriate.
> 
> Before somebody asks, the general agreement for decades has been that
> there should be no global state indicating whether the kernel is in
> atomic context. Instead, if a function needs to know, it should take an
> explicit parameter.
> 
> IOW this MAY_BLOCK attribute follows an unquestioned kernel design
> pattern.
> 
> > When this
> > attribute is set and swiotlb memory usage is above a threshold, the
> > swiotlb allocation code can serialize swiotlb memory usage to help
> > ensure that it is not exhausted.
> >
> > In general, storage device drivers can take advantage of the MAY_BLOCK
> > option, while network device drivers cannot. The Linux block layer
> > already allows storage requests to block when the BLK_MQ_F_BLOCKING
> > flag is present on the request queue. In a CoCo VM environment,
> > relatively few device types are used for storage devices, and updating
> > these drivers is feasible. This patch set updates the NVMe driver and
> > the Hyper-V storvsc synthetic storage driver. A few other drivers
> > might also need to be updated to handle the key CoCo VM storage
> > devices.
> >
> > Because network drivers generally cannot use swiotlb throttling, it is
> > still possible for swiotlb memory to become exhausted. But blunting
> > the maximum swiotlb memory used by storage devices can significantly
> > reduce the peak usage, and a smaller amount of swiotlb memory can be
> > allocated in a CoCo VM. Also, usage by storage drivers is likely to
> > overall be larger than for network drivers, especially when large
> > numbers of disk devices are in use, each with many I/O requests in-
> > flight.
> 
> The system can also handle network packet loss much better than I/O
> errors, mainly because lost packets have always been part of normal
> operation, unlike I/O errors. After all, that's why we unmount all
> filesystems on removable media before physically unplugging (or
> ejecting) them.
> 
> > swiotlb throttling does not affect the context requirements of DMA
> > unmap calls. These always complete without blocking, even if the
> > corresponding DMA map call was throttled.
> >
> > Patches
> > =======
> > Patches 1 and 2 implement the core of swiotlb throttling. They define
> > DMA attribute flag DMA_ATTR_MAY_BLOCK that device drivers use to
> > indicate that a DMA map call is allowed to block, and therefore can be
> > throttled. They update swiotlb_tbl_map_single() to detect this flag and
> > implement the throttling. Similarly, swiotlb_tbl_unmap_single() is
> > updated to handle a previously throttled request that has now freed
> > its swiotlb memory.
> >
> > Patch 3 adds the dma_recommend_may_block() call that device drivers
> > can use to know if there's benefit in using the MAY_BLOCK option on
> > DMA map calls. If not in a CoCo VM, this call returns "false" because
> > swiotlb is not being used for all DMA I/O. This allows the driver to
> > set the BLK_MQ_F_BLOCKING flag on blk-mq request queues only when
> > there is benefit.
> >
> > Patch 4 updates the SCSI-specific DMA map calls to add a "_attrs"
> > variant to allow passing the MAY_BLOCK attribute.
> >
> > Patch 5 adds the MAY_BLOCK option to the Hyper-V storvsc driver, which
> > is used for storage in CoCo VMs in the Azure public cloud.
> >
> > Patches 6 and 7 add the MAY_BLOCK option to the NVMe PCI host driver.
> >
> > Discussion
> > ==========
> > * Since swiotlb isn't visible to device drivers, I've specifically
> > named the DMA attribute as MAY_BLOCK instead of MAY_THROTTLE or
> > something swiotlb specific. While this patch set consumes MAY_BLOCK
> > only on the DMA direct path to do throttling in the swiotlb code,
> > there might be other uses in the future outside of CoCo VMs, or
> > perhaps on the IOMMU path.
> 
> I once introduced a similar flag and called it MAY_SLEEP. I chose
> MAY_SLEEP, because there is already a might_sleep() annotation, but I
> don't have a strong opinion unless your semantics is supposed to be
> different from might_sleep(). If it is, then I strongly prefer
> MAY_BLOCK to prevent confusing the two.

My intent is that the semantics are the same as might_sleep(). I
vacillated between MAY_SLEEP and MAY_BLOCK. The kernel seems
to treat "sleep" and "block" as equivalent, because blk-mq has
the BLK_MQ_F_BLOCKING flag, and SCSI has the 
queuecommand_may_block flag that is translated to
BLK_MQ_F_BLOCKING. So I settled on MAY_BLOCK, but as you
point out, that's inconsistent with might_sleep(). Either way will
be inconsistent somewhere, and I don't have a preference.

> 
> > * The swiotlb throttling code in this patch set throttles by
> > serializing the use of swiotlb memory when usage is above a designated
> > threshold: i.e., only one new swiotlb request is allowed to proceed at
> > a time. When the corresponding unmap is done to release its swiotlb
> > memory, the next request is allowed to proceed. This serialization is
> > global and without knowledge of swiotlb areas. From a storage I/O
> > performance standpoint, the serialization is a bit restrictive, but
> > the code isn't trying to optimize for being above the threshold. If a
> > workload regularly runs above the threshold, the size of the swiotlb
> > memory should be increased.
> 
> With CONFIG_SWIOTLB_DYNAMIC, this could happen automatically in the
> future. But let's get the basic functionality first.
> 
> > * Except for knowing how much swiotlb memory is currently allocated,
> > throttle accounting is done without locking or atomic operations. For
> > example, multiple requests could proceed in parallel when usage is
> > just under the threshold, putting usage above the threshold by the
> > aggregate size of the parallel requests. The threshold must already be
> > set relatively conservatively because of drivers that can't enable
> > throttling, so this slop in the accounting shouldn't be a problem.
> > It's better than the potential bottleneck of a globally synchronized
> > reservation mechanism.
> 
> Agreed.
> 
> > * In a CoCo VM, mapping a scatter/gather list makes an independent
> > swiotlb request for each entry. Throttling each independent request
> > wouldn't really work, so the code throttles only the first SGL entry.
> > Once that entry passes any throttle, subsequent entries in the SGL
> > proceed without throttling. When the SGL is unmapped, entries 1 thru
> > N-1 are unmapped first, then entry 0 is unmapped, allowing the next
> > serialized request to proceed.
> >
> > Open Topics
> > ===========
> > 1. swiotlb allocations from Xen and the IOMMU code don't make use of
> > throttling. This could be added if beneficial.
> >
> > 2. The throttling values are currently exposed and adjustable in
> > /sys/kernel/debug/swiotlb. Should any of this be moved so it is
> > visible even without CONFIG_DEBUG_FS?
> 
> Yes. It should be possible to control the thresholds through sysctl.

Good point.  I was thinking about creating /sys/kernel/swiotlb, but
sysctl is better.

Michael

> 
> > 3. I have not changed the current heuristic for the swiotlb memory
> > size in CoCo VMs. It's not clear to me how to link this to whether the
> > key storage drivers have been updated to allow throttling. For now,
> > the benefit of reduced swiotlb memory size must be realized using the
> > swiotlb= kernel boot line option.
> 
> This sounds fine for now.
> 
> > 4. I need to update the swiotlb documentation to describe throttling.
> >
> > This patch set is built against linux-next-20240816.
> 
> OK, I'm going try it out.
> 
> Thank you for making this happen!
> 
> Petr T

Christoph Hellwig Aug. 24, 2024, 8:05 a.m. UTC | #6

On Fri, Aug 23, 2024 at 02:20:41AM +0000, Michael Kelley wrote:
> Christoph Hellwig, or anyone else who knows the history and current
> reality better than I do, please jump in. :-)

It's not just interrupt context, but any context that does not allow
blocking.  There is plenty of that as seen by the moving of nvme
to specifically request a blocking context for I/O submission in this
path.

That being said there are probably more contexts that can block than
those that can't, so allowing for that option is a good thing.

Christoph Hellwig Aug. 24, 2024, 8:16 a.m. UTC | #7

On Thu, Aug 22, 2024 at 11:37:11AM -0700, mhkelley58@gmail.com wrote:
> Because it's not possible to detect at runtime whether a DMA map call
> is made in a context that can block, the calls in key device drivers
> must be updated with a MAY_BLOCK attribute, if appropriate. When this
> attribute is set and swiotlb memory usage is above a threshold, the
> swiotlb allocation code can serialize swiotlb memory usage to help
> ensure that it is not exhausted.

One thing I've been doing for a while but haven't gotten to due to
my lack of semantic patching skills is that we really want to split
the few flags useful for dma_map* from DMA_ATTR_* which largely
only applies to dma_alloc.

Only DMA_ATTR_WEAK_ORDERING (if we can't just kill it entirely)
and for now DMA_ATTR_NO_WARN is used for both.

DMA_ATTR_SKIP_CPU_SYNC and your new SLEEP/BLOCK attribute is only
useful for mapping, and the rest is for allocation only.

So I'd love to move to a DMA_MAP_* namespace for the mapping flags
before adding more on potentially widely used ones.

With a little grace period we can then also phase out DMA_ATTR_NO_WARN
for allocations, as the gfp_t can control that much better.

> In general, storage device drivers can take advantage of the MAY_BLOCK
> option, while network device drivers cannot. The Linux block layer
> already allows storage requests to block when the BLK_MQ_F_BLOCKING
> flag is present on the request queue.

Note that this also in general involves changes to the block drivers
to set that flag, which is a bit annoying, but I guess there is not
easy way around it without paying the price for the BLK_MQ_F_BLOCKING
overhead everywhere.

Petr Tesařík Aug. 24, 2024, 8:05 p.m. UTC | #8

On Fri, 23 Aug 2024 20:40:16 +0000
Michael Kelley <mhklinux@outlook.com> wrote:

> From: Petr Tesařík <petr@tesarici.cz> Sent: Thursday, August 22, 2024 11:45 PM
>[...]
> > > Discussion
> > > ==========
> > > * Since swiotlb isn't visible to device drivers, I've specifically
> > > named the DMA attribute as MAY_BLOCK instead of MAY_THROTTLE or
> > > something swiotlb specific. While this patch set consumes MAY_BLOCK
> > > only on the DMA direct path to do throttling in the swiotlb code,
> > > there might be other uses in the future outside of CoCo VMs, or
> > > perhaps on the IOMMU path.  
> > 
> > I once introduced a similar flag and called it MAY_SLEEP. I chose
> > MAY_SLEEP, because there is already a might_sleep() annotation, but I
> > don't have a strong opinion unless your semantics is supposed to be
> > different from might_sleep(). If it is, then I strongly prefer
> > MAY_BLOCK to prevent confusing the two.  
> 
> My intent is that the semantics are the same as might_sleep(). I
> vacillated between MAY_SLEEP and MAY_BLOCK. The kernel seems
> to treat "sleep" and "block" as equivalent, because blk-mq has
> the BLK_MQ_F_BLOCKING flag, and SCSI has the 
> queuecommand_may_block flag that is translated to
> BLK_MQ_F_BLOCKING. So I settled on MAY_BLOCK, but as you
> point out, that's inconsistent with might_sleep(). Either way will
> be inconsistent somewhere, and I don't have a preference.

Fair enough. Let's stay with MAY_BLOCK then, so you don't have to
change it everywhere.

>[...]
> > > Open Topics
> > > ===========
> > > 1. swiotlb allocations from Xen and the IOMMU code don't make use
> > > of throttling. This could be added if beneficial.
> > >
> > > 2. The throttling values are currently exposed and adjustable in
> > > /sys/kernel/debug/swiotlb. Should any of this be moved so it is
> > > visible even without CONFIG_DEBUG_FS?  
> > 
> > Yes. It should be possible to control the thresholds through
> > sysctl.  
> 
> Good point.  I was thinking about creating /sys/kernel/swiotlb, but
> sysctl is better.

That still leaves the question where it should go.

Under /proc/sys/kernel? Or should we make a /proc/sys/kernel/dma
subdirectory to make room for more dma-related controls?

Petr T

Michael Kelley Aug. 26, 2024, 3:27 p.m. UTC | #9

From: Christoph Hellwig <hch@lst.de> Sent: Saturday, August 24, 2024 1:16 AM
> 
> On Thu, Aug 22, 2024 at 11:37:11AM -0700, mhkelley58@gmail.com wrote:
> > Because it's not possible to detect at runtime whether a DMA map call
> > is made in a context that can block, the calls in key device drivers
> > must be updated with a MAY_BLOCK attribute, if appropriate. When this
> > attribute is set and swiotlb memory usage is above a threshold, the
> > swiotlb allocation code can serialize swiotlb memory usage to help
> > ensure that it is not exhausted.
> 
> One thing I've been doing for a while but haven't gotten to due to
> my lack of semantic patching skills is that we really want to split
> the few flags useful for dma_map* from DMA_ATTR_* which largely
> only applies to dma_alloc.
> 
> Only DMA_ATTR_WEAK_ORDERING (if we can't just kill it entirely)
> and for now DMA_ATTR_NO_WARN is used for both.
> 
> DMA_ATTR_SKIP_CPU_SYNC and your new SLEEP/BLOCK attribute is only
> useful for mapping, and the rest is for allocation only.
> 
> So I'd love to move to a DMA_MAP_* namespace for the mapping flags
> before adding more on potentially widely used ones.

OK, this makes sense to me. The DMA_ATTR_* symbols are currently
defined as just values that are not part of an enum or any other higher
level abstraction, and the "attrs" parameter to the dma_* functions is
just "unsigned long". Are you thinking that the separate namespace is
based only on the symbolic name (i.e., DMA_MAP_* vs DMA_ATTR_*),
with the values being disjoint? That seems straightforward to me.
Changing the "attrs" parameter to an enum is a much bigger change ....

For a transition period we can have both DMA_ATTR_SKIP_CPU_SYNC
and DMA_MAP_SKIP_CPU_SYNC, and then work to change all
occurrences of the former to the latter.

I'll have to look more closely at WEAK_ORDERING and NO_WARN.

There are also a couple of places where DMA_ATTR_NO_KERNEL_MAPPING
is used for dma_map_* calls, but those are clearly bogus since that
attribute is never tested in the map path.

> 
> With a little grace period we can then also phase out DMA_ATTR_NO_WARN
> for allocations, as the gfp_t can control that much better.
> 
> > In general, storage device drivers can take advantage of the MAY_BLOCK
> > option, while network device drivers cannot. The Linux block layer
> > already allows storage requests to block when the BLK_MQ_F_BLOCKING
> > flag is present on the request queue.
> 
> Note that this also in general involves changes to the block drivers
> to set that flag, which is a bit annoying, but I guess there is not
> easy way around it without paying the price for the BLK_MQ_F_BLOCKING
> overhead everywhere.

Agreed. I assumed there was some cost to BLK_MQ_F_BLOCKING since
the default is !BLK_MQ_F_BLOCKING, but I don't really know what
that is. Do you have a short summary, just for my education?

Michael

Michael Kelley Aug. 26, 2024, 4:24 p.m. UTC | #10

From: Petr Tesařík <petr@tesarici.cz> Sent: Saturday, August 24, 2024 1:06 PM
> 
> On Fri, 23 Aug 2024 20:40:16 +0000
> Michael Kelley <mhklinux@outlook.com> wrote:
> 
> > From: Petr Tesařík <petr@tesarici.cz> Sent: Thursday, August 22, 2024 11:45 PM
> >[...]
> > > > Discussion
> > > > ==========
> > > > * Since swiotlb isn't visible to device drivers, I've specifically
> > > > named the DMA attribute as MAY_BLOCK instead of MAY_THROTTLE or
> > > > something swiotlb specific. While this patch set consumes MAY_BLOCK
> > > > only on the DMA direct path to do throttling in the swiotlb code,
> > > > there might be other uses in the future outside of CoCo VMs, or
> > > > perhaps on the IOMMU path.
> > >
> > > I once introduced a similar flag and called it MAY_SLEEP. I chose
> > > MAY_SLEEP, because there is already a might_sleep() annotation, but I
> > > don't have a strong opinion unless your semantics is supposed to be
> > > different from might_sleep(). If it is, then I strongly prefer
> > > MAY_BLOCK to prevent confusing the two.
> >
> > My intent is that the semantics are the same as might_sleep(). I
> > vacillated between MAY_SLEEP and MAY_BLOCK. The kernel seems
> > to treat "sleep" and "block" as equivalent, because blk-mq has
> > the BLK_MQ_F_BLOCKING flag, and SCSI has the
> > queuecommand_may_block flag that is translated to
> > BLK_MQ_F_BLOCKING. So I settled on MAY_BLOCK, but as you
> > point out, that's inconsistent with might_sleep(). Either way will
> > be inconsistent somewhere, and I don't have a preference.
> 
> Fair enough. Let's stay with MAY_BLOCK then, so you don't have to
> change it everywhere.
> 
> >[...]
> > > > Open Topics
> > > > ===========
> > > > 1. swiotlb allocations from Xen and the IOMMU code don't make use
> > > > of throttling. This could be added if beneficial.
> > > >
> > > > 2. The throttling values are currently exposed and adjustable in
> > > > /sys/kernel/debug/swiotlb. Should any of this be moved so it is
> > > > visible even without CONFIG_DEBUG_FS?
> > >
> > > Yes. It should be possible to control the thresholds through
> > > sysctl.
> >
> > Good point.  I was thinking about creating /sys/kernel/swiotlb, but
> > sysctl is better.
> 
> That still leaves the question where it should go.
> 
> Under /proc/sys/kernel? Or should we make a /proc/sys/kernel/dma
> subdirectory to make room for more dma-related controls?

I would be good with /proc/sys/kernel/swiotlb (or "dma"). There
are only two entries (high_throttle and low_throttle), but just
dumping everything directly in /proc/sys/kernel doesn't seem like
a good long-term approach.  Even though there are currently a lot
of direct entries in /proc/sys/kernel, that may be historical, and not
changeable due to backwards compatibility requirements.

Michael

Michael

Petr Tesařík Aug. 26, 2024, 7:28 p.m. UTC | #11

On Mon, 26 Aug 2024 16:24:53 +0000
Michael Kelley <mhklinux@outlook.com> wrote:

> From: Petr Tesařík <petr@tesarici.cz> Sent: Saturday, August 24, 2024 1:06 PM
> > 
> > On Fri, 23 Aug 2024 20:40:16 +0000
> > Michael Kelley <mhklinux@outlook.com> wrote:
> >   
> > > From: Petr Tesařík <petr@tesarici.cz> Sent: Thursday, August 22, 2024 11:45 PM
> > >[...]  
> > > > > Discussion
> > > > > ==========
> > > > > * Since swiotlb isn't visible to device drivers, I've specifically
> > > > > named the DMA attribute as MAY_BLOCK instead of MAY_THROTTLE or
> > > > > something swiotlb specific. While this patch set consumes MAY_BLOCK
> > > > > only on the DMA direct path to do throttling in the swiotlb code,
> > > > > there might be other uses in the future outside of CoCo VMs, or
> > > > > perhaps on the IOMMU path.  
> > > >
> > > > I once introduced a similar flag and called it MAY_SLEEP. I chose
> > > > MAY_SLEEP, because there is already a might_sleep() annotation, but I
> > > > don't have a strong opinion unless your semantics is supposed to be
> > > > different from might_sleep(). If it is, then I strongly prefer
> > > > MAY_BLOCK to prevent confusing the two.  
> > >
> > > My intent is that the semantics are the same as might_sleep(). I
> > > vacillated between MAY_SLEEP and MAY_BLOCK. The kernel seems
> > > to treat "sleep" and "block" as equivalent, because blk-mq has
> > > the BLK_MQ_F_BLOCKING flag, and SCSI has the
> > > queuecommand_may_block flag that is translated to
> > > BLK_MQ_F_BLOCKING. So I settled on MAY_BLOCK, but as you
> > > point out, that's inconsistent with might_sleep(). Either way will
> > > be inconsistent somewhere, and I don't have a preference.  
> > 
> > Fair enough. Let's stay with MAY_BLOCK then, so you don't have to
> > change it everywhere.
> >   
> > >[...]  
> > > > > Open Topics
> > > > > ===========
> > > > > 1. swiotlb allocations from Xen and the IOMMU code don't make use
> > > > > of throttling. This could be added if beneficial.
> > > > >
> > > > > 2. The throttling values are currently exposed and adjustable in
> > > > > /sys/kernel/debug/swiotlb. Should any of this be moved so it is
> > > > > visible even without CONFIG_DEBUG_FS?  
> > > >
> > > > Yes. It should be possible to control the thresholds through
> > > > sysctl.  
> > >
> > > Good point.  I was thinking about creating /sys/kernel/swiotlb, but
> > > sysctl is better.  
> > 
> > That still leaves the question where it should go.
> > 
> > Under /proc/sys/kernel? Or should we make a /proc/sys/kernel/dma
> > subdirectory to make room for more dma-related controls?  
> 
> I would be good with /proc/sys/kernel/swiotlb (or "dma"). There
> are only two entries (high_throttle and low_throttle), but just
> dumping everything directly in /proc/sys/kernel doesn't seem like
> a good long-term approach.  Even though there are currently a lot
> of direct entries in /proc/sys/kernel, that may be historical, and not
> changeable due to backwards compatibility requirements.

I think SWIOTLB is a bit too narrow. How many controls would we add
under /proc/sys/kernel/swiotlb? The chances seem higher if we call it
/proc/sys/kernel/dma/swiotlb_{low,high}_throttle, and it follows the
paths in source code (which are subject to change any time, however).
Anyway, I don't want to get into bikeshedding; I'm fine with whatever
you send in the end. :-)

BTW those entries directly under /proc/sys/kernel are not all
historical. The io_uring_* controls were added just last year, see
commit 76d3ccecfa18.

Petr T

Michael Kelley Aug. 27, 2024, 12:26 a.m. UTC | #12

From: Petr Tesařík <petr@tesarici.cz> Sent: Monday, August 26, 2024 12:28 PM
> 
> On Mon, 26 Aug 2024 16:24:53 +0000
> Michael Kelley <mhklinux@outlook.com> wrote:
> 
> > From: Petr Tesařík <petr@tesarici.cz> Sent: Saturday, August 24, 2024 1:06 PM
> > >
> > > On Fri, 23 Aug 2024 20:40:16 +0000
> > > Michael Kelley <mhklinux@outlook.com> wrote:
> > >
> > > > From: Petr Tesařík <petr@tesarici.cz> Sent: Thursday, August 22, 2024 11:45 PM
> > > >[...]
> > > > > > Discussion
> > > > > > ==========
> > > > > > * Since swiotlb isn't visible to device drivers, I've specifically
> > > > > > named the DMA attribute as MAY_BLOCK instead of MAY_THROTTLE or
> > > > > > something swiotlb specific. While this patch set consumes MAY_BLOCK
> > > > > > only on the DMA direct path to do throttling in the swiotlb code,
> > > > > > there might be other uses in the future outside of CoCo VMs, or
> > > > > > perhaps on the IOMMU path.
> > > > >
> > > > > I once introduced a similar flag and called it MAY_SLEEP. I chose
> > > > > MAY_SLEEP, because there is already a might_sleep() annotation, but I
> > > > > don't have a strong opinion unless your semantics is supposed to be
> > > > > different from might_sleep(). If it is, then I strongly prefer
> > > > > MAY_BLOCK to prevent confusing the two.
> > > >
> > > > My intent is that the semantics are the same as might_sleep(). I
> > > > vacillated between MAY_SLEEP and MAY_BLOCK. The kernel seems
> > > > to treat "sleep" and "block" as equivalent, because blk-mq has
> > > > the BLK_MQ_F_BLOCKING flag, and SCSI has the
> > > > queuecommand_may_block flag that is translated to
> > > > BLK_MQ_F_BLOCKING. So I settled on MAY_BLOCK, but as you
> > > > point out, that's inconsistent with might_sleep(). Either way will
> > > > be inconsistent somewhere, and I don't have a preference.
> > >
> > > Fair enough. Let's stay with MAY_BLOCK then, so you don't have to
> > > change it everywhere.
> > >
> > > >[...]
> > > > > > Open Topics
> > > > > > ===========
> > > > > > 1. swiotlb allocations from Xen and the IOMMU code don't make use
> > > > > > of throttling. This could be added if beneficial.
> > > > > >
> > > > > > 2. The throttling values are currently exposed and adjustable in
> > > > > > /sys/kernel/debug/swiotlb. Should any of this be moved so it is
> > > > > > visible even without CONFIG_DEBUG_FS?
> > > > >
> > > > > Yes. It should be possible to control the thresholds through
> > > > > sysctl.
> > > >
> > > > Good point.  I was thinking about creating /sys/kernel/swiotlb, but
> > > > sysctl is better.
> > >
> > > That still leaves the question where it should go.
> > >
> > > Under /proc/sys/kernel? Or should we make a /proc/sys/kernel/dma
> > > subdirectory to make room for more dma-related controls?
> >
> > I would be good with /proc/sys/kernel/swiotlb (or "dma"). There
> > are only two entries (high_throttle and low_throttle), but just
> > dumping everything directly in /proc/sys/kernel doesn't seem like
> > a good long-term approach.  Even though there are currently a lot
> > of direct entries in /proc/sys/kernel, that may be historical, and not
> > changeable due to backwards compatibility requirements.
> 
> I think SWIOTLB is a bit too narrow. How many controls would we add
> under /proc/sys/kernel/swiotlb? The chances seem higher if we call it
> /proc/sys/kernel/dma/swiotlb_{low,high}_throttle, and it follows the
> paths in source code (which are subject to change any time, however).
> Anyway, I don't want to get into bikeshedding; I'm fine with whatever
> you send in the end. :-)
> 
> BTW those entries directly under /proc/sys/kernel are not all
> historical. The io_uring_* controls were added just last year, see
> commit 76d3ccecfa18.
> 

Note that there could be multiple instances of the throttle values, since
a DMA restricted pool has its own struct io_tlb_mem that is separate
from the default. I wrote the code so that throttling is independently
applied to a restricted pool as well, though I haven't tested it.

So the typical case is that we'll have high and low throttle values for the
default swiotlb pool, but we could also have high and low throttle
values for any restricted pools.

Maybe the /proc pathnames would need to be:

   /proc/sys/kernel/dma/swiotlb_default/high_throttle
   /proc/sys/kernel/dma/swiotlb_default/low_throttle
   /proc/sys/kernel/dma/swiotlb_<rpoolname>/high_throttle
   /proc/sys/kernel/dma/swiotlb_<rpoolname>/low_throttle

Or we could throw all the throttles directly into the "dma" directory,
though that makes for fairly long names in lieu of a deeper directory
structure:

   /proc/sys/kernel/dma/default_swiotlb_high_throttle
   /proc/sys/kernel/dma/default_swiotlb_low_throttle
   /proc/sys/kernel/dma/<rpoolname>_swiotlb_high_throttle
   /proc/sys/kernel/dma/<rpoolname_>swiotlb_low_throttle

Thoughts?

Michael

Christoph Hellwig Aug. 27, 2024, 7:14 a.m. UTC | #13

On Mon, Aug 26, 2024 at 03:27:30PM +0000, Michael Kelley wrote:
> OK, this makes sense to me. The DMA_ATTR_* symbols are currently
> defined as just values that are not part of an enum or any other higher
> level abstraction, and the "attrs" parameter to the dma_* functions is
> just "unsigned long". Are you thinking that the separate namespace is
> based only on the symbolic name (i.e., DMA_MAP_* vs DMA_ATTR_*),
> with the values being disjoint? That seems straightforward to me.

Yes. Although initially I'd just keep ATTR for the allocation and then
maybe do a scripted run to convert it.

> Changing the "attrs" parameter to an enum is a much bigger change ....

I don't think an enum makes much sense as we have bits defined.  A
__bitwise type would be nice, but not required.

> For a transition period we can have both DMA_ATTR_SKIP_CPU_SYNC
> and DMA_MAP_SKIP_CPU_SYNC, and then work to change all
> occurrences of the former to the latter.
> 
> I'll have to look more closely at WEAK_ORDERING and NO_WARN.
> 
> There are also a couple of places where DMA_ATTR_NO_KERNEL_MAPPING
> is used for dma_map_* calls, but those are clearly bogus since that
> attribute is never tested in the map path.

Yeah, these kinds of bogus things is what I'd like to kill..


> > Note that this also in general involves changes to the block drivers
> > to set that flag, which is a bit annoying, but I guess there is not
> > easy way around it without paying the price for the BLK_MQ_F_BLOCKING
> > overhead everywhere.
> 
> Agreed. I assumed there was some cost to BLK_MQ_F_BLOCKING since
> the default is !BLK_MQ_F_BLOCKING, but I don't really know what
> that is. Do you have a short summary, just for my education?

I think the biggest issue is that synchronize_srcu is pretty damn
expensive, but there's also a whole bunch of places that unconditionally
defer to the workqueue.

Petr Tesařík Aug. 27, 2024, 8 a.m. UTC | #14

On Tue, 27 Aug 2024 00:26:36 +0000
Michael Kelley <mhklinux@outlook.com> wrote:

> From: Petr Tesařík <petr@tesarici.cz> Sent: Monday, August 26, 2024 12:28 PM
> > 
> > On Mon, 26 Aug 2024 16:24:53 +0000
> > Michael Kelley <mhklinux@outlook.com> wrote:
> >   
> > > From: Petr Tesařík <petr@tesarici.cz> Sent: Saturday, August 24, 2024 1:06 PM  
> > > >
> > > > On Fri, 23 Aug 2024 20:40:16 +0000
> > > > Michael Kelley <mhklinux@outlook.com> wrote:
> > > >  
> > > > > From: Petr Tesařík <petr@tesarici.cz> Sent: Thursday, August 22, 2024 11:45 PM
> > > > >[...]  
> > > > > > > Discussion
> > > > > > > ==========
> > > > > > > * Since swiotlb isn't visible to device drivers, I've specifically
> > > > > > > named the DMA attribute as MAY_BLOCK instead of MAY_THROTTLE or
> > > > > > > something swiotlb specific. While this patch set consumes MAY_BLOCK
> > > > > > > only on the DMA direct path to do throttling in the swiotlb code,
> > > > > > > there might be other uses in the future outside of CoCo VMs, or
> > > > > > > perhaps on the IOMMU path.  
> > > > > >
> > > > > > I once introduced a similar flag and called it MAY_SLEEP. I chose
> > > > > > MAY_SLEEP, because there is already a might_sleep() annotation, but I
> > > > > > don't have a strong opinion unless your semantics is supposed to be
> > > > > > different from might_sleep(). If it is, then I strongly prefer
> > > > > > MAY_BLOCK to prevent confusing the two.  
> > > > >
> > > > > My intent is that the semantics are the same as might_sleep(). I
> > > > > vacillated between MAY_SLEEP and MAY_BLOCK. The kernel seems
> > > > > to treat "sleep" and "block" as equivalent, because blk-mq has
> > > > > the BLK_MQ_F_BLOCKING flag, and SCSI has the
> > > > > queuecommand_may_block flag that is translated to
> > > > > BLK_MQ_F_BLOCKING. So I settled on MAY_BLOCK, but as you
> > > > > point out, that's inconsistent with might_sleep(). Either way will
> > > > > be inconsistent somewhere, and I don't have a preference.  
> > > >
> > > > Fair enough. Let's stay with MAY_BLOCK then, so you don't have to
> > > > change it everywhere.
> > > >  
> > > > >[...]  
> > > > > > > Open Topics
> > > > > > > ===========
> > > > > > > 1. swiotlb allocations from Xen and the IOMMU code don't make use
> > > > > > > of throttling. This could be added if beneficial.
> > > > > > >
> > > > > > > 2. The throttling values are currently exposed and adjustable in
> > > > > > > /sys/kernel/debug/swiotlb. Should any of this be moved so it is
> > > > > > > visible even without CONFIG_DEBUG_FS?  
> > > > > >
> > > > > > Yes. It should be possible to control the thresholds through
> > > > > > sysctl.  
> > > > >
> > > > > Good point.  I was thinking about creating /sys/kernel/swiotlb, but
> > > > > sysctl is better.  
> > > >
> > > > That still leaves the question where it should go.
> > > >
> > > > Under /proc/sys/kernel? Or should we make a /proc/sys/kernel/dma
> > > > subdirectory to make room for more dma-related controls?  
> > >
> > > I would be good with /proc/sys/kernel/swiotlb (or "dma"). There
> > > are only two entries (high_throttle and low_throttle), but just
> > > dumping everything directly in /proc/sys/kernel doesn't seem like
> > > a good long-term approach.  Even though there are currently a lot
> > > of direct entries in /proc/sys/kernel, that may be historical, and not
> > > changeable due to backwards compatibility requirements.  
> > 
> > I think SWIOTLB is a bit too narrow. How many controls would we add
> > under /proc/sys/kernel/swiotlb? The chances seem higher if we call it
> > /proc/sys/kernel/dma/swiotlb_{low,high}_throttle, and it follows the
> > paths in source code (which are subject to change any time, however).
> > Anyway, I don't want to get into bikeshedding; I'm fine with whatever
> > you send in the end. :-)
> > 
> > BTW those entries directly under /proc/sys/kernel are not all
> > historical. The io_uring_* controls were added just last year, see
> > commit 76d3ccecfa18.
> >   
> 
> Note that there could be multiple instances of the throttle values, since
> a DMA restricted pool has its own struct io_tlb_mem that is separate
> from the default. I wrote the code so that throttling is independently
> applied to a restricted pool as well, though I haven't tested it.

Good point. I didn't think about it.

> So the typical case is that we'll have high and low throttle values for the
> default swiotlb pool, but we could also have high and low throttle
> values for any restricted pools.
> 
> Maybe the /proc pathnames would need to be:
> 
>    /proc/sys/kernel/dma/swiotlb_default/high_throttle
>    /proc/sys/kernel/dma/swiotlb_default/low_throttle
>    /proc/sys/kernel/dma/swiotlb_<rpoolname>/high_throttle
>    /proc/sys/kernel/dma/swiotlb_<rpoolname>/low_throttle

If a subdirectory is needed anyway, then we may ditch the dma
directory idea and place swiotlb subdirectories directly under
/proc/sys/kernel.

> Or we could throw all the throttles directly into the "dma" directory,
> though that makes for fairly long names in lieu of a deeper directory
> structure:
> 
>    /proc/sys/kernel/dma/default_swiotlb_high_throttle
>    /proc/sys/kernel/dma/default_swiotlb_low_throttle
>    /proc/sys/kernel/dma/<rpoolname>_swiotlb_high_throttle
>    /proc/sys/kernel/dma/<rpoolname_>swiotlb_low_throttle
> 
> Thoughts?

I have already said I don't care much as long as the naming and/or
placement is not downright confusing. If the default values are
adjusted, they will end up in a config file under /etc/sysctl.d, and
admins will copy&paste it from Stack Exchange.

I mean, you're probably the most interested person on the planet, so
make a choice, and we'll adapt. ;-)

Petr T

Robin Murphy Aug. 28, 2024, 12:02 p.m. UTC | #15

On 2024-08-22 7:37 pm, mhkelley58@gmail.com wrote:
> From: Michael Kelley <mhklinux@outlook.com>
> 
> Background
> ==========
> Linux device drivers may make DMA map/unmap calls in contexts that
> cannot block, such as in an interrupt handler. Consequently, when a
> DMA map call must use a bounce buffer, the allocation of swiotlb
> memory must always succeed immediately. If swiotlb memory is
> exhausted, the DMA map call cannot wait for memory to be released. The
> call fails, which usually results in an I/O error.
> 
> Bounce buffers are usually used infrequently for a few corner cases,
> so the default swiotlb memory allocation of 64 MiB is more than
> sufficient to avoid running out and causing errors. However, recently
> introduced Confidential Computing (CoCo) VMs must use bounce buffers
> for all DMA I/O because the VM's memory is encrypted. In CoCo VMs
> a new heuristic allocates ~6% of the VM's memory, up to 1 GiB, for
> swiotlb memory. This large allocation reduces the likelihood of a
> spike in usage causing DMA map failures. Unfortunately for most
> workloads, this insurance against spikes comes at the cost of
> potentially "wasting" hundreds of MiB's of the VM's memory, as swiotlb
> memory can't be used for other purposes.
> 
> Approach
> ========
> The goal is to significantly reduce the amount of memory reserved as
> swiotlb memory in CoCo VMs, while not unduly increasing the risk of
> DMA map failures due to memory exhaustion.

Isn't that fundamentally the same thing that SWIOTLB_DYNAMIC was already 
meant to address? Of course the implementation of that is still young 
and has plenty of scope to be made more effective, and some of the ideas 
here could very much help with that, but I'm struggling a little to see 
what's really beneficial about having a completely disjoint mechanism 
for sitting around doing nothing in the precise circumstances where it 
would seem most possible to allocate a transient buffer and get on with it.

Thanks,
Robin.

> To reach this goal, this patch set introduces the concept of swiotlb
> throttling, which can delay swiotlb allocation requests when swiotlb
> memory usage is high. This approach depends on the fact that some
> DMA map requests are made from contexts where it's OK to block.
> Throttling such requests is acceptable to spread out a spike in usage.
> 
> Because it's not possible to detect at runtime whether a DMA map call
> is made in a context that can block, the calls in key device drivers
> must be updated with a MAY_BLOCK attribute, if appropriate. When this
> attribute is set and swiotlb memory usage is above a threshold, the
> swiotlb allocation code can serialize swiotlb memory usage to help
> ensure that it is not exhausted.
> 
> In general, storage device drivers can take advantage of the MAY_BLOCK
> option, while network device drivers cannot. The Linux block layer
> already allows storage requests to block when the BLK_MQ_F_BLOCKING
> flag is present on the request queue. In a CoCo VM environment,
> relatively few device types are used for storage devices, and updating
> these drivers is feasible. This patch set updates the NVMe driver and
> the Hyper-V storvsc synthetic storage driver. A few other drivers
> might also need to be updated to handle the key CoCo VM storage
> devices.
> 
> Because network drivers generally cannot use swiotlb throttling, it is
> still possible for swiotlb memory to become exhausted. But blunting
> the maximum swiotlb memory used by storage devices can significantly
> reduce the peak usage, and a smaller amount of swiotlb memory can be
> allocated in a CoCo VM. Also, usage by storage drivers is likely to
> overall be larger than for network drivers, especially when large
> numbers of disk devices are in use, each with many I/O requests in-
> flight.
> 
> swiotlb throttling does not affect the context requirements of DMA
> unmap calls. These always complete without blocking, even if the
> corresponding DMA map call was throttled.
> 
> Patches
> =======
> Patches 1 and 2 implement the core of swiotlb throttling. They define
> DMA attribute flag DMA_ATTR_MAY_BLOCK that device drivers use to
> indicate that a DMA map call is allowed to block, and therefore can be
> throttled. They update swiotlb_tbl_map_single() to detect this flag and
> implement the throttling. Similarly, swiotlb_tbl_unmap_single() is
> updated to handle a previously throttled request that has now freed
> its swiotlb memory.
> 
> Patch 3 adds the dma_recommend_may_block() call that device drivers
> can use to know if there's benefit in using the MAY_BLOCK option on
> DMA map calls. If not in a CoCo VM, this call returns "false" because
> swiotlb is not being used for all DMA I/O. This allows the driver to
> set the BLK_MQ_F_BLOCKING flag on blk-mq request queues only when
> there is benefit.
> 
> Patch 4 updates the SCSI-specific DMA map calls to add a "_attrs"
> variant to allow passing the MAY_BLOCK attribute.
> 
> Patch 5 adds the MAY_BLOCK option to the Hyper-V storvsc driver, which
> is used for storage in CoCo VMs in the Azure public cloud.
> 
> Patches 6 and 7 add the MAY_BLOCK option to the NVMe PCI host driver.
> 
> Discussion
> ==========
> * Since swiotlb isn't visible to device drivers, I've specifically
> named the DMA attribute as MAY_BLOCK instead of MAY_THROTTLE or
> something swiotlb specific. While this patch set consumes MAY_BLOCK
> only on the DMA direct path to do throttling in the swiotlb code,
> there might be other uses in the future outside of CoCo VMs, or
> perhaps on the IOMMU path.
> 
> * The swiotlb throttling code in this patch set throttles by
> serializing the use of swiotlb memory when usage is above a designated
> threshold: i.e., only one new swiotlb request is allowed to proceed at
> a time. When the corresponding unmap is done to release its swiotlb
> memory, the next request is allowed to proceed. This serialization is
> global and without knowledge of swiotlb areas. From a storage I/O
> performance standpoint, the serialization is a bit restrictive, but
> the code isn't trying to optimize for being above the threshold. If a
> workload regularly runs above the threshold, the size of the swiotlb
> memory should be increased.
> 
> * Except for knowing how much swiotlb memory is currently allocated,
> throttle accounting is done without locking or atomic operations. For
> example, multiple requests could proceed in parallel when usage is
> just under the threshold, putting usage above the threshold by the
> aggregate size of the parallel requests. The threshold must already be
> set relatively conservatively because of drivers that can't enable
> throttling, so this slop in the accounting shouldn't be a problem.
> It's better than the potential bottleneck of a globally synchronized
> reservation mechanism.
> 
> * In a CoCo VM, mapping a scatter/gather list makes an independent
> swiotlb request for each entry. Throttling each independent request
> wouldn't really work, so the code throttles only the first SGL entry.
> Once that entry passes any throttle, subsequent entries in the SGL
> proceed without throttling. When the SGL is unmapped, entries 1 thru
> N-1 are unmapped first, then entry 0 is unmapped, allowing the next
> serialized request to proceed.
> 
> Open Topics
> ===========
> 1. swiotlb allocations from Xen and the IOMMU code don't make use of
> throttling. This could be added if beneficial.
> 
> 2. The throttling values are currently exposed and adjustable in
> /sys/kernel/debug/swiotlb. Should any of this be moved so it is
> visible even without CONFIG_DEBUG_FS?
> 
> 3. I have not changed the current heuristic for the swiotlb memory
> size in CoCo VMs. It's not clear to me how to link this to whether the
> key storage drivers have been updated to allow throttling. For now,
> the benefit of reduced swiotlb memory size must be realized using the
> swiotlb= kernel boot line option.
> 
> 4. I need to update the swiotlb documentation to describe throttling.
> 
> This patch set is built against linux-next-20240816.
> 
> Michael Kelley (7):
>    swiotlb: Introduce swiotlb throttling
>    dma: Handle swiotlb throttling for SGLs
>    dma: Add function for drivers to know if allowing blocking is useful
>    scsi_lib_dma: Add _attrs variant of scsi_dma_map()
>    scsi: storvsc: Enable swiotlb throttling
>    nvme: Move BLK_MQ_F_BLOCKING indicator to struct nvme_ctrl
>    nvme: Enable swiotlb throttling for NVMe PCI devices
> 
>   drivers/nvme/host/core.c    |   4 +-
>   drivers/nvme/host/nvme.h    |   2 +-
>   drivers/nvme/host/pci.c     |  18 ++++--
>   drivers/nvme/host/tcp.c     |   3 +-
>   drivers/scsi/scsi_lib_dma.c |  13 ++--
>   drivers/scsi/storvsc_drv.c  |   9 ++-
>   include/linux/dma-mapping.h |  13 ++++
>   include/linux/swiotlb.h     |  15 ++++-
>   include/scsi/scsi_cmnd.h    |   7 ++-
>   kernel/dma/Kconfig          |  13 ++++
>   kernel/dma/direct.c         |  41 +++++++++++--
>   kernel/dma/direct.h         |   1 +
>   kernel/dma/mapping.c        |  10 ++++
>   kernel/dma/swiotlb.c        | 114 ++++++++++++++++++++++++++++++++----
>   14 files changed, 227 insertions(+), 36 deletions(-)
>

Petr Tesařík Aug. 28, 2024, 1:03 p.m. UTC | #16

On Wed, 28 Aug 2024 13:02:31 +0100
Robin Murphy <robin.murphy@arm.com> wrote:

> On 2024-08-22 7:37 pm, mhkelley58@gmail.com wrote:
> > From: Michael Kelley <mhklinux@outlook.com>
> > 
> > Background
> > ==========
> > Linux device drivers may make DMA map/unmap calls in contexts that
> > cannot block, such as in an interrupt handler. Consequently, when a
> > DMA map call must use a bounce buffer, the allocation of swiotlb
> > memory must always succeed immediately. If swiotlb memory is
> > exhausted, the DMA map call cannot wait for memory to be released. The
> > call fails, which usually results in an I/O error.
> > 
> > Bounce buffers are usually used infrequently for a few corner cases,
> > so the default swiotlb memory allocation of 64 MiB is more than
> > sufficient to avoid running out and causing errors. However, recently
> > introduced Confidential Computing (CoCo) VMs must use bounce buffers
> > for all DMA I/O because the VM's memory is encrypted. In CoCo VMs
> > a new heuristic allocates ~6% of the VM's memory, up to 1 GiB, for
> > swiotlb memory. This large allocation reduces the likelihood of a
> > spike in usage causing DMA map failures. Unfortunately for most
> > workloads, this insurance against spikes comes at the cost of
> > potentially "wasting" hundreds of MiB's of the VM's memory, as swiotlb
> > memory can't be used for other purposes.
> > 
> > Approach
> > ========
> > The goal is to significantly reduce the amount of memory reserved as
> > swiotlb memory in CoCo VMs, while not unduly increasing the risk of
> > DMA map failures due to memory exhaustion.  
> 
> Isn't that fundamentally the same thing that SWIOTLB_DYNAMIC was already 
> meant to address? Of course the implementation of that is still young 
> and has plenty of scope to be made more effective, and some of the ideas 
> here could very much help with that, but I'm struggling a little to see 
> what's really beneficial about having a completely disjoint mechanism 
> for sitting around doing nothing in the precise circumstances where it 
> would seem most possible to allocate a transient buffer and get on with it.

This question can be probably best answered by Michael, but let me give
my understanding of the differences. First the similarity: Yes, one
of the key new concepts is that swiotlb allocation may block, and I
introduced a similar attribute in one of my dynamic SWIOTLB patches; it
was later dropped, but dynamic SWIOTLB would still benefit from it.

More importantly, dynamic SWIOTLB may deplete memory following an I/O
spike. I do have some ideas how memory could be returned back to the
allocator, but the code is not ready (unlike this patch series).
Moreover, it may still be a better idea to throttle the devices
instead, because returning DMA'able memory is not always cheap. In a
CoCo VM, this memory must be re-encrypted, and that requires a
hypercall that I'm told is expensive.

In short, IIUC it is faster in a CoCo VM to delay some requests a bit
than to grow the swiotlb.

Michael, please add your insights.

Petr T

> > To reach this goal, this patch set introduces the concept of swiotlb
> > throttling, which can delay swiotlb allocation requests when swiotlb
> > memory usage is high. This approach depends on the fact that some
> > DMA map requests are made from contexts where it's OK to block.
> > Throttling such requests is acceptable to spread out a spike in usage.
> > 
> > Because it's not possible to detect at runtime whether a DMA map call
> > is made in a context that can block, the calls in key device drivers
> > must be updated with a MAY_BLOCK attribute, if appropriate. When this
> > attribute is set and swiotlb memory usage is above a threshold, the
> > swiotlb allocation code can serialize swiotlb memory usage to help
> > ensure that it is not exhausted.
> > 
> > In general, storage device drivers can take advantage of the MAY_BLOCK
> > option, while network device drivers cannot. The Linux block layer
> > already allows storage requests to block when the BLK_MQ_F_BLOCKING
> > flag is present on the request queue. In a CoCo VM environment,
> > relatively few device types are used for storage devices, and updating
> > these drivers is feasible. This patch set updates the NVMe driver and
> > the Hyper-V storvsc synthetic storage driver. A few other drivers
> > might also need to be updated to handle the key CoCo VM storage
> > devices.
> > 
> > Because network drivers generally cannot use swiotlb throttling, it is
> > still possible for swiotlb memory to become exhausted. But blunting
> > the maximum swiotlb memory used by storage devices can significantly
> > reduce the peak usage, and a smaller amount of swiotlb memory can be
> > allocated in a CoCo VM. Also, usage by storage drivers is likely to
> > overall be larger than for network drivers, especially when large
> > numbers of disk devices are in use, each with many I/O requests in-
> > flight.
> > 
> > swiotlb throttling does not affect the context requirements of DMA
> > unmap calls. These always complete without blocking, even if the
> > corresponding DMA map call was throttled.
> > 
> > Patches
> > =======
> > Patches 1 and 2 implement the core of swiotlb throttling. They define
> > DMA attribute flag DMA_ATTR_MAY_BLOCK that device drivers use to
> > indicate that a DMA map call is allowed to block, and therefore can be
> > throttled. They update swiotlb_tbl_map_single() to detect this flag and
> > implement the throttling. Similarly, swiotlb_tbl_unmap_single() is
> > updated to handle a previously throttled request that has now freed
> > its swiotlb memory.
> > 
> > Patch 3 adds the dma_recommend_may_block() call that device drivers
> > can use to know if there's benefit in using the MAY_BLOCK option on
> > DMA map calls. If not in a CoCo VM, this call returns "false" because
> > swiotlb is not being used for all DMA I/O. This allows the driver to
> > set the BLK_MQ_F_BLOCKING flag on blk-mq request queues only when
> > there is benefit.
> > 
> > Patch 4 updates the SCSI-specific DMA map calls to add a "_attrs"
> > variant to allow passing the MAY_BLOCK attribute.
> > 
> > Patch 5 adds the MAY_BLOCK option to the Hyper-V storvsc driver, which
> > is used for storage in CoCo VMs in the Azure public cloud.
> > 
> > Patches 6 and 7 add the MAY_BLOCK option to the NVMe PCI host driver.
> > 
> > Discussion
> > ==========
> > * Since swiotlb isn't visible to device drivers, I've specifically
> > named the DMA attribute as MAY_BLOCK instead of MAY_THROTTLE or
> > something swiotlb specific. While this patch set consumes MAY_BLOCK
> > only on the DMA direct path to do throttling in the swiotlb code,
> > there might be other uses in the future outside of CoCo VMs, or
> > perhaps on the IOMMU path.
> > 
> > * The swiotlb throttling code in this patch set throttles by
> > serializing the use of swiotlb memory when usage is above a designated
> > threshold: i.e., only one new swiotlb request is allowed to proceed at
> > a time. When the corresponding unmap is done to release its swiotlb
> > memory, the next request is allowed to proceed. This serialization is
> > global and without knowledge of swiotlb areas. From a storage I/O
> > performance standpoint, the serialization is a bit restrictive, but
> > the code isn't trying to optimize for being above the threshold. If a
> > workload regularly runs above the threshold, the size of the swiotlb
> > memory should be increased.
> > 
> > * Except for knowing how much swiotlb memory is currently allocated,
> > throttle accounting is done without locking or atomic operations. For
> > example, multiple requests could proceed in parallel when usage is
> > just under the threshold, putting usage above the threshold by the
> > aggregate size of the parallel requests. The threshold must already be
> > set relatively conservatively because of drivers that can't enable
> > throttling, so this slop in the accounting shouldn't be a problem.
> > It's better than the potential bottleneck of a globally synchronized
> > reservation mechanism.
> > 
> > * In a CoCo VM, mapping a scatter/gather list makes an independent
> > swiotlb request for each entry. Throttling each independent request
> > wouldn't really work, so the code throttles only the first SGL entry.
> > Once that entry passes any throttle, subsequent entries in the SGL
> > proceed without throttling. When the SGL is unmapped, entries 1 thru
> > N-1 are unmapped first, then entry 0 is unmapped, allowing the next
> > serialized request to proceed.
> > 
> > Open Topics
> > ===========
> > 1. swiotlb allocations from Xen and the IOMMU code don't make use of
> > throttling. This could be added if beneficial.
> > 
> > 2. The throttling values are currently exposed and adjustable in
> > /sys/kernel/debug/swiotlb. Should any of this be moved so it is
> > visible even without CONFIG_DEBUG_FS?
> > 
> > 3. I have not changed the current heuristic for the swiotlb memory
> > size in CoCo VMs. It's not clear to me how to link this to whether the
> > key storage drivers have been updated to allow throttling. For now,
> > the benefit of reduced swiotlb memory size must be realized using the
> > swiotlb= kernel boot line option.
> > 
> > 4. I need to update the swiotlb documentation to describe throttling.
> > 
> > This patch set is built against linux-next-20240816.
> > 
> > Michael Kelley (7):
> >    swiotlb: Introduce swiotlb throttling
> >    dma: Handle swiotlb throttling for SGLs
> >    dma: Add function for drivers to know if allowing blocking is useful
> >    scsi_lib_dma: Add _attrs variant of scsi_dma_map()
> >    scsi: storvsc: Enable swiotlb throttling
> >    nvme: Move BLK_MQ_F_BLOCKING indicator to struct nvme_ctrl
> >    nvme: Enable swiotlb throttling for NVMe PCI devices
> > 
> >   drivers/nvme/host/core.c    |   4 +-
> >   drivers/nvme/host/nvme.h    |   2 +-
> >   drivers/nvme/host/pci.c     |  18 ++++--
> >   drivers/nvme/host/tcp.c     |   3 +-
> >   drivers/scsi/scsi_lib_dma.c |  13 ++--
> >   drivers/scsi/storvsc_drv.c  |   9 ++-
> >   include/linux/dma-mapping.h |  13 ++++
> >   include/linux/swiotlb.h     |  15 ++++-
> >   include/scsi/scsi_cmnd.h    |   7 ++-
> >   kernel/dma/Kconfig          |  13 ++++
> >   kernel/dma/direct.c         |  41 +++++++++++--
> >   kernel/dma/direct.h         |   1 +
> >   kernel/dma/mapping.c        |  10 ++++
> >   kernel/dma/swiotlb.c        | 114 ++++++++++++++++++++++++++++++++----
> >   14 files changed, 227 insertions(+), 36 deletions(-)
> >

Michael Kelley Aug. 28, 2024, 4:30 p.m. UTC | #17

From: Petr Tesařík <petr@tesarici.cz> Sent: Wednesday, August 28, 2024 6:04 AM
> 
> On Wed, 28 Aug 2024 13:02:31 +0100
> Robin Murphy <robin.murphy@arm.com> wrote:
> 
> > On 2024-08-22 7:37 pm, mhkelley58@gmail.com wrote:
> > > From: Michael Kelley <mhklinux@outlook.com>
> > >
> > > Background
> > > ==========
> > > Linux device drivers may make DMA map/unmap calls in contexts that
> > > cannot block, such as in an interrupt handler. Consequently, when a
> > > DMA map call must use a bounce buffer, the allocation of swiotlb
> > > memory must always succeed immediately. If swiotlb memory is
> > > exhausted, the DMA map call cannot wait for memory to be released. The
> > > call fails, which usually results in an I/O error.
> > >
> > > Bounce buffers are usually used infrequently for a few corner cases,
> > > so the default swiotlb memory allocation of 64 MiB is more than
> > > sufficient to avoid running out and causing errors. However, recently
> > > introduced Confidential Computing (CoCo) VMs must use bounce buffers
> > > for all DMA I/O because the VM's memory is encrypted. In CoCo VMs
> > > a new heuristic allocates ~6% of the VM's memory, up to 1 GiB, for
> > > swiotlb memory. This large allocation reduces the likelihood of a
> > > spike in usage causing DMA map failures. Unfortunately for most
> > > workloads, this insurance against spikes comes at the cost of
> > > potentially "wasting" hundreds of MiB's of the VM's memory, as swiotlb
> > > memory can't be used for other purposes.
> > >
> > > Approach
> > > ========
> > > The goal is to significantly reduce the amount of memory reserved as
> > > swiotlb memory in CoCo VMs, while not unduly increasing the risk of
> > > DMA map failures due to memory exhaustion.
> >
> > Isn't that fundamentally the same thing that SWIOTLB_DYNAMIC was already
> > meant to address? Of course the implementation of that is still young
> > and has plenty of scope to be made more effective, and some of the ideas
> > here could very much help with that, but I'm struggling a little to see
> > what's really beneficial about having a completely disjoint mechanism
> > for sitting around doing nothing in the precise circumstances where it
> > would seem most possible to allocate a transient buffer and get on with it.
> 
> This question can be probably best answered by Michael, but let me give
> my understanding of the differences. First the similarity: Yes, one
> of the key new concepts is that swiotlb allocation may block, and I
> introduced a similar attribute in one of my dynamic SWIOTLB patches; it
> was later dropped, but dynamic SWIOTLB would still benefit from it.
> 
> More importantly, dynamic SWIOTLB may deplete memory following an I/O
> spike. I do have some ideas how memory could be returned back to the
> allocator, but the code is not ready (unlike this patch series).
> Moreover, it may still be a better idea to throttle the devices
> instead, because returning DMA'able memory is not always cheap. In a
> CoCo VM, this memory must be re-encrypted, and that requires a
> hypercall that I'm told is expensive.
> 
> In short, IIUC it is faster in a CoCo VM to delay some requests a bit
> than to grow the swiotlb.
> 
> Michael, please add your insights.
> 
> Petr T
> 

The other limitation of SWIOTLB_DYNAMIC is that growing swiotlb
memory requires large chunks of physically contiguous memory,
which may be impossible to get after a system has been running a
while. With a major rework of swiotlb memory allocation code, it might
be possible to get by with a piecewise assembly of smaller contiguous
memory chunks, but getting many smaller chunks could also be
challenging.

Growing swiotlb memory also must be done as a background async
operation if the DMA map operation can't block. So transient buffers
are needed, which must be encrypted and decrypted on every round
trip in a CoCo VM. The transient buffer memory comes from the
atomic pool, which typically isn't that large and could itself become
exhausted. So we're somewhat playing whack-a-mole on the memory
allocation problem.

We discussed the limitations of SWIOTLB_DYNAMIC in large CoCo VMs
at the time SWIOTLB_DYNAMIC was being developed, and I think there
was general agreement that throttling would be better for the CoCo
VM scenario.

Broadly, throttling DMA map requests seems like a fundamentally more
robust approach than growing swiotlb memory. And starting down
the path of allowing designated DMA map requests to block might have
broader benefits as well, perhaps on the IOMMU path.

These points are all arguable, and your point about having two somewhat
overlapping mechanisms is valid. Between the two, my personal viewpoint
is that throttling is the better approach, but I'm probably biased by my
background in the CoCo VM world. Petr and others may see the tradeoffs
differently.

Michael

Petr Tesařík Aug. 28, 2024, 4:41 p.m. UTC | #18

On Wed, 28 Aug 2024 16:30:04 +0000
Michael Kelley <mhklinux@outlook.com> wrote:

> From: Petr Tesařík <petr@tesarici.cz> Sent: Wednesday, August 28, 2024 6:04 AM
> > 
> > On Wed, 28 Aug 2024 13:02:31 +0100
> > Robin Murphy <robin.murphy@arm.com> wrote:
> >   
> > > On 2024-08-22 7:37 pm, mhkelley58@gmail.com wrote:  
> > > > From: Michael Kelley <mhklinux@outlook.com>
> > > >
> > > > Background
> > > > ==========
> > > > Linux device drivers may make DMA map/unmap calls in contexts that
> > > > cannot block, such as in an interrupt handler. Consequently, when a
> > > > DMA map call must use a bounce buffer, the allocation of swiotlb
> > > > memory must always succeed immediately. If swiotlb memory is
> > > > exhausted, the DMA map call cannot wait for memory to be released. The
> > > > call fails, which usually results in an I/O error.
> > > >
> > > > Bounce buffers are usually used infrequently for a few corner cases,
> > > > so the default swiotlb memory allocation of 64 MiB is more than
> > > > sufficient to avoid running out and causing errors. However, recently
> > > > introduced Confidential Computing (CoCo) VMs must use bounce buffers
> > > > for all DMA I/O because the VM's memory is encrypted. In CoCo VMs
> > > > a new heuristic allocates ~6% of the VM's memory, up to 1 GiB, for
> > > > swiotlb memory. This large allocation reduces the likelihood of a
> > > > spike in usage causing DMA map failures. Unfortunately for most
> > > > workloads, this insurance against spikes comes at the cost of
> > > > potentially "wasting" hundreds of MiB's of the VM's memory, as swiotlb
> > > > memory can't be used for other purposes.
> > > >
> > > > Approach
> > > > ========
> > > > The goal is to significantly reduce the amount of memory reserved as
> > > > swiotlb memory in CoCo VMs, while not unduly increasing the risk of
> > > > DMA map failures due to memory exhaustion.  
> > >
> > > Isn't that fundamentally the same thing that SWIOTLB_DYNAMIC was already
> > > meant to address? Of course the implementation of that is still young
> > > and has plenty of scope to be made more effective, and some of the ideas
> > > here could very much help with that, but I'm struggling a little to see
> > > what's really beneficial about having a completely disjoint mechanism
> > > for sitting around doing nothing in the precise circumstances where it
> > > would seem most possible to allocate a transient buffer and get on with it.  
> > 
> > This question can be probably best answered by Michael, but let me give
> > my understanding of the differences. First the similarity: Yes, one
> > of the key new concepts is that swiotlb allocation may block, and I
> > introduced a similar attribute in one of my dynamic SWIOTLB patches; it
> > was later dropped, but dynamic SWIOTLB would still benefit from it.
> > 
> > More importantly, dynamic SWIOTLB may deplete memory following an I/O
> > spike. I do have some ideas how memory could be returned back to the
> > allocator, but the code is not ready (unlike this patch series).
> > Moreover, it may still be a better idea to throttle the devices
> > instead, because returning DMA'able memory is not always cheap. In a
> > CoCo VM, this memory must be re-encrypted, and that requires a
> > hypercall that I'm told is expensive.
> > 
> > In short, IIUC it is faster in a CoCo VM to delay some requests a bit
> > than to grow the swiotlb.
> > 
> > Michael, please add your insights.
> > 
> > Petr T
> >   
> 
> The other limitation of SWIOTLB_DYNAMIC is that growing swiotlb
> memory requires large chunks of physically contiguous memory,
> which may be impossible to get after a system has been running a
> while. With a major rework of swiotlb memory allocation code, it might
> be possible to get by with a piecewise assembly of smaller contiguous
> memory chunks, but getting many smaller chunks could also be
> challenging.
> 
> Growing swiotlb memory also must be done as a background async
> operation if the DMA map operation can't block. So transient buffers
> are needed, which must be encrypted and decrypted on every round
> trip in a CoCo VM. The transient buffer memory comes from the
> atomic pool, which typically isn't that large and could itself become
> exhausted. So we're somewhat playing whack-a-mole on the memory
> allocation problem.

Note that this situation can be somewhat improved with the
SWIOTLB_ATTR_MAY_BLOCK flag, because a new SWIOTLB chunk can then be
allocated immediately, removing the need to allocate a transient pool
from the atomic pool.

> We discussed the limitations of SWIOTLB_DYNAMIC in large CoCo VMs
> at the time SWIOTLB_DYNAMIC was being developed, and I think there
> was general agreement that throttling would be better for the CoCo
> VM scenario.
> 
> Broadly, throttling DMA map requests seems like a fundamentally more
> robust approach than growing swiotlb memory. And starting down
> the path of allowing designated DMA map requests to block might have
> broader benefits as well, perhaps on the IOMMU path.
> 
> These points are all arguable, and your point about having two somewhat
> overlapping mechanisms is valid. Between the two, my personal viewpoint
> is that throttling is the better approach, but I'm probably biased by my
> background in the CoCo VM world. Petr and others may see the tradeoffs
> differently.

For CoCo VMs, throttling indeed seems to be better. Embedded devices
seem to benefit more from growing the swiotlb on demand.

As usual, YMMV.

Petr T

Robin Murphy Aug. 28, 2024, 7:50 p.m. UTC | #19

On 2024-08-28 2:03 pm, Petr Tesařík wrote:
> On Wed, 28 Aug 2024 13:02:31 +0100
> Robin Murphy <robin.murphy@arm.com> wrote:
> 
>> On 2024-08-22 7:37 pm, mhkelley58@gmail.com wrote:
>>> From: Michael Kelley <mhklinux@outlook.com>
>>>
>>> Background
>>> ==========
>>> Linux device drivers may make DMA map/unmap calls in contexts that
>>> cannot block, such as in an interrupt handler. Consequently, when a
>>> DMA map call must use a bounce buffer, the allocation of swiotlb
>>> memory must always succeed immediately. If swiotlb memory is
>>> exhausted, the DMA map call cannot wait for memory to be released. The
>>> call fails, which usually results in an I/O error.
>>>
>>> Bounce buffers are usually used infrequently for a few corner cases,
>>> so the default swiotlb memory allocation of 64 MiB is more than
>>> sufficient to avoid running out and causing errors. However, recently
>>> introduced Confidential Computing (CoCo) VMs must use bounce buffers
>>> for all DMA I/O because the VM's memory is encrypted. In CoCo VMs
>>> a new heuristic allocates ~6% of the VM's memory, up to 1 GiB, for
>>> swiotlb memory. This large allocation reduces the likelihood of a
>>> spike in usage causing DMA map failures. Unfortunately for most
>>> workloads, this insurance against spikes comes at the cost of
>>> potentially "wasting" hundreds of MiB's of the VM's memory, as swiotlb
>>> memory can't be used for other purposes.
>>>
>>> Approach
>>> ========
>>> The goal is to significantly reduce the amount of memory reserved as
>>> swiotlb memory in CoCo VMs, while not unduly increasing the risk of
>>> DMA map failures due to memory exhaustion.
>>
>> Isn't that fundamentally the same thing that SWIOTLB_DYNAMIC was already
>> meant to address? Of course the implementation of that is still young
>> and has plenty of scope to be made more effective, and some of the ideas
>> here could very much help with that, but I'm struggling a little to see
>> what's really beneficial about having a completely disjoint mechanism
>> for sitting around doing nothing in the precise circumstances where it
>> would seem most possible to allocate a transient buffer and get on with it.
> 
> This question can be probably best answered by Michael, but let me give
> my understanding of the differences. First the similarity: Yes, one
> of the key new concepts is that swiotlb allocation may block, and I
> introduced a similar attribute in one of my dynamic SWIOTLB patches; it
> was later dropped, but dynamic SWIOTLB would still benefit from it.
> 
> More importantly, dynamic SWIOTLB may deplete memory following an I/O
> spike. I do have some ideas how memory could be returned back to the
> allocator, but the code is not ready (unlike this patch series).
> Moreover, it may still be a better idea to throttle the devices
> instead, because returning DMA'able memory is not always cheap. In a
> CoCo VM, this memory must be re-encrypted, and that requires a
> hypercall that I'm told is expensive.

Sure, making a hypercall in order to progress is expensive relative to 
being able to progress without doing that, but waiting on a lock for an 
unbounded time in the hope that other drivers might release their DMA 
mappings soon represents a potentially unbounded expense, since it 
doesn't even carry any promise of progress at all - oops userspace just 
filled up SWIOTLB with a misguided dma-buf import and now the OS has 
livelocked on stalled I/O threads fighting to retry :(

As soon as we start tracking thresholds etc. then that should equally 
put us in the position to be able to manage the lifecycle of both 
dynamic and transient pools more effectively - larger allocations which 
can be reused by multiple mappings until the I/O load drops again could 
amortise that initial cost quite a bit.

Furthermore I'm not entirely convinced that the rationale for throttling 
being beneficial is even all that sound. Serialising requests doesn't 
make them somehow use less memory, it just makes them use it... 
serially. If a single CPU is capable of queueing enough requests at once 
to fill the SWIOTLB, this is going to do absolutely nothing; if two CPUs 
are capable of queueing enough requests together to fill the SWIOTLB, 
making them take slightly longer to do so doesn't inherently mean 
anything more than reaching the same outcome more slowly. At worst, if a 
thread is blocked from polling for completion and releasing a bunch of 
mappings of already-finished descriptors because it's stuck on an unfair 
lock trying to get one last one submitted, then throttling has actively 
harmed the situation.

AFAICS this is dependent on rather particular assumptions of driver 
behaviour in terms of DMA mapping patterns and interrupts, plus the 
overall I/O workload shape, and it's not clear to me how well that 
really generalises.

> In short, IIUC it is faster in a CoCo VM to delay some requests a bit
> than to grow the swiotlb.

I'm not necessarily disputing that for the cases where the assumptions 
do hold, it's still more a question of why those two things should be 
separate and largely incompatible (I've only skimmed the patches here, 
but my impression is that it doesn't look like they'd play all that 
nicely together if both enabled). To me it would make far more sense for 
this to be a tuneable policy of a more holistic SWIOTLB_DYNAMIC itself, 
i.e. blockable calls can opportunistically wait for free space up to a 
well-defined timeout, but then also fall back to synchronously 
allocating a new pool in order to assure a definite outcome of success 
or system-is-dying-level failure.

Thanks,
Robin.

Michael Kelley Aug. 30, 2024, 3:58 a.m. UTC | #20

From: Robin Murphy <robin.murphy@arm.com> Sent: Wednesday, August 28, 2024 12:50 PM
> 
> On 2024-08-28 2:03 pm, Petr Tesařík wrote:
> > On Wed, 28 Aug 2024 13:02:31 +0100
> > Robin Murphy <robin.murphy@arm.com> wrote:
> >
> >> On 2024-08-22 7:37 pm, mhkelley58@gmail.com wrote:
> >>> From: Michael Kelley <mhklinux@outlook.com>
> >>>
> >>> Background
> >>> ==========
> >>> Linux device drivers may make DMA map/unmap calls in contexts that
> >>> cannot block, such as in an interrupt handler. Consequently, when a
> >>> DMA map call must use a bounce buffer, the allocation of swiotlb
> >>> memory must always succeed immediately. If swiotlb memory is
> >>> exhausted, the DMA map call cannot wait for memory to be released. The
> >>> call fails, which usually results in an I/O error.
> >>>
> >>> Bounce buffers are usually used infrequently for a few corner cases,
> >>> so the default swiotlb memory allocation of 64 MiB is more than
> >>> sufficient to avoid running out and causing errors. However, recently
> >>> introduced Confidential Computing (CoCo) VMs must use bounce buffers
> >>> for all DMA I/O because the VM's memory is encrypted. In CoCo VMs
> >>> a new heuristic allocates ~6% of the VM's memory, up to 1 GiB, for
> >>> swiotlb memory. This large allocation reduces the likelihood of a
> >>> spike in usage causing DMA map failures. Unfortunately for most
> >>> workloads, this insurance against spikes comes at the cost of
> >>> potentially "wasting" hundreds of MiB's of the VM's memory, as swiotlb
> >>> memory can't be used for other purposes.
> >>>
> >>> Approach
> >>> ========
> >>> The goal is to significantly reduce the amount of memory reserved as
> >>> swiotlb memory in CoCo VMs, while not unduly increasing the risk of
> >>> DMA map failures due to memory exhaustion.
> >>
> >> Isn't that fundamentally the same thing that SWIOTLB_DYNAMIC was already
> >> meant to address? Of course the implementation of that is still young
> >> and has plenty of scope to be made more effective, and some of the ideas
> >> here could very much help with that, but I'm struggling a little to see
> >> what's really beneficial about having a completely disjoint mechanism
> >> for sitting around doing nothing in the precise circumstances where it
> >> would seem most possible to allocate a transient buffer and get on with it.
> >
> > This question can be probably best answered by Michael, but let me give
> > my understanding of the differences. First the similarity: Yes, one
> > of the key new concepts is that swiotlb allocation may block, and I
> > introduced a similar attribute in one of my dynamic SWIOTLB patches; it
> > was later dropped, but dynamic SWIOTLB would still benefit from it.
> >
> > More importantly, dynamic SWIOTLB may deplete memory following an I/O
> > spike. I do have some ideas how memory could be returned back to the
> > allocator, but the code is not ready (unlike this patch series).
> > Moreover, it may still be a better idea to throttle the devices
> > instead, because returning DMA'able memory is not always cheap. In a
> > CoCo VM, this memory must be re-encrypted, and that requires a
> > hypercall that I'm told is expensive.
> 
> Sure, making a hypercall in order to progress is expensive relative to
> being able to progress without doing that, but waiting on a lock for an
> unbounded time in the hope that other drivers might release their DMA
> mappings soon represents a potentially unbounded expense, since it
> doesn't even carry any promise of progress at all 

FWIW, the implementation in this patch set guarantees forward
progress for throttled requests as long as drivers that use MAY_BLOCK
are well-behaved.

> - oops userspace just
> filled up SWIOTLB with a misguided dma-buf import and now the OS has
> livelocked on stalled I/O threads fighting to retry :(
> 
> As soon as we start tracking thresholds etc. then that should equally
> put us in the position to be able to manage the lifecycle of both
> dynamic and transient pools more effectively - larger allocations which
> can be reused by multiple mappings until the I/O load drops again could
> amortise that initial cost quite a bit.

I'm not understanding what you envision here. Could you elaborate?
With the current implementation of SWIOTLB_DYNAMIC, dynamic
pools are already allocated with size MAX_PAGE_ORDER (or smaller
if that size isn't available). That size really isn't big enough in CoCo
VMs with more than 16 vCPUs since we want to split the allocation
into per-CPU areas. To fix this, we would need to support swiotlb
pools that are stitched together from multiple contiguous physical
memory ranges. That probably could be done, but I don't see how
it's related to thresholds.

> 
> Furthermore I'm not entirely convinced that the rationale for throttling
> being beneficial is even all that sound. Serialising requests doesn't
> make them somehow use less memory, it just makes them use it...
> serially. If a single CPU is capable of queueing enough requests at once
> to fill the SWIOTLB, this is going to do absolutely nothing; if two CPUs
> are capable of queueing enough requests together to fill the SWIOTLB,
> making them take slightly longer to do so doesn't inherently mean
> anything more than reaching the same outcome more slowly.

I don't get your point. My intent with throttling is that it caps the
system-wide high-water mark for swiotlb memory usage, without
causing I/O errors due to DMA map failures. Without
SWIOTLB_DYNAMIC, the original boot-time allocation size is the limit
for swiotlb memory usage, and DMA map fails if the system-wide
high-water mark tries to rise above that limit. With SWIOTLB_DYNAMIC,
the current code continues to allocate additional system memory and
turn it into swiotlb memory, with no limit. There probably *should*
be a limit, even for SWIOTLB_DYNAMIC.

I've run "fio" loads with and without throttling as implemented in this
patch set. Without SWIOTLB_DYNAMIC and no throttling, it's pretty
easy to reach the limit and get I/O errors due to DMA map failure. With
throttling and the same "fio" load, the usage high-water mark stays
near the throttling threshold, with no I/O errors. The limit should be
set large enough for a workload to operate below the throttling
threshold. But if the threshold is exceeded, throttling should avoid a
big failure due to DMA map failures.

My mental model here is somewhat like blk-mq tags. There's a fixed
number allocated with the storage controller. Block I/O requests must
get a tag, and if one isn't available, the requesting thread is pended
until one becomes available. The fixed number of tags is the limit, but
the requestor doesn't get an error if a tag isn't available -- it just
waits. The fixed number of tags necessarily imposes a kind of
resource limit on block I/O requests, rather than just always allocating
an additional tag if there's a request that can't get an existing tag. I think
the same model makes sense for swiotlb memory when the device
driver can support it.

> At worst, if a
> thread is blocked from polling for completion and releasing a bunch of
> mappings of already-finished descriptors because it's stuck on an unfair
> lock trying to get one last one submitted, then throttling has actively
> harmed the situation.

OK, yes, I can understand there might be an issue with a driver (like
NVMe) that supports polling. I'll look at that more closely and see
if there is.

> 
> AFAICS this is dependent on rather particular assumptions of driver
> behaviour in terms of DMA mapping patterns and interrupts, plus the
> overall I/O workload shape, and it's not clear to me how well that
> really generalises.
> 
> > In short, IIUC it is faster in a CoCo VM to delay some requests a bit
> > than to grow the swiotlb.
> 
> I'm not necessarily disputing that for the cases where the assumptions
> do hold, it's still more a question of why those two things should be
> separate and largely incompatible (I've only skimmed the patches here,
> but my impression is that it doesn't look like they'd play all that
> nicely together if both enabled).

As I've mulled over your comments the past day, I'm not sure the two
things really are incompatible or even overlapping. To me it seems like
SWIOTLB_DYNAMIC is about whether the swiotlb memory is pre-allocated
at boot time, or allocated as needed. But SWIOTLB_DYNAMIC currently
doesn't have a limit to how much it will allocate, and it probably should.
(Though SWIOTLB_DYNAMIC has a limit imposed upon it if there's isn't
enough contiguous physical memory to grow the swiotlb pool.) Given a
limit, both the pre-allocate case and allocate-as-needed case have the
same question about what to do when the limit is reached. In both cases,
we're generally forced to set the limit pretty high, because DMA map
failures occur if you broach the limit. Throttling is about dealing with
the limit in a better way when permitted by the driver. That in turn
allows setting a tighter limit and not having to overprovision the
swiotlb memory.

> To me it would make far more sense for
> this to be a tuneable policy of a more holistic SWIOTLB_DYNAMIC itself,
> i.e. blockable calls can opportunistically wait for free space up to a
> well-defined timeout, but then also fall back to synchronously
> allocating a new pool in order to assure a definite outcome of success
> or system-is-dying-level failure.

Yes, I can see value in the kind of interaction you describe before any
limits are approached. But to me the primary question is dealing better
with the limit.

Michael 

> 
> Thanks,
> Robin.

[RFC,0/7] Introduce swiotlb throttling

Message

Comments