Message ID | 20220805162444.3985535-1-kbusch@fb.com (mailing list archive) |
---|---|
Headers | show |
Series | dma mapping optimisations | expand |
I finally found some time to look over this, and I have some very high level problems with the implementations: - the DMA streaming ops (dma_map_*) are only intended for short term mappings, not for long living ones that can lead to starvation. Especially with swiotlb, but also with some IOMMUs. I think this needs to be changed to an API that can allocate DMA mapped memory using dma_alloc_noncoherent/dma_alloc_noncontigious for the device and then give access to that to the user instead - I really do not like the special casing in the bio. Did you try to just stash away the DMA mapping information in an efficient lookup data structure (e.g. rthashtable, but details might need analysis and benchmarking) and thus touch very little code outside of the driver I/O path and the place that performs the mapping? - the design seems to ignore DMA ownership. Every time data in transfered data needs to be transferred to and from the device, take a look at Documentation/core-api/dma-api.rst and Documentation/core-api/dma-api-howto.rst. As for the multiple devices discussion: mapping memory for multiple devices is possible, but nontrivial to get right, mostly due to the ownership. So unless we have a really good reason I'd suggest to not support this initially.
On Tue, Aug 09, 2022 at 08:46:13AM +0200, Christoph Hellwig wrote: > - the design seems to ignore DMA ownership. Every time data in > transfered data needs to be transferred to and from the device, > take a look at Documentation/core-api/dma-api.rst and > Documentation/core-api/dma-api-howto.rst. I have this doing appropriate dma_sync_single_for_{device,cpu} if we aren't using coherent memory. Is there more to ownership beyond that?
On Tue, Aug 09, 2022 at 08:46:13AM +0200, Christoph Hellwig wrote: > - the DMA streaming ops (dma_map_*) are only intended for short term > mappings, not for long living ones that can lead to starvation. > Especially with swiotlb, but also with some IOMMUs. I think this > needs to be changed to an API that can allocate DMA mapped memory > using dma_alloc_noncoherent/dma_alloc_noncontigious for the device > and then give access to that to the user instead I am a bit reluctant to require a new memory allocator to use the feature. I'm also generally not concerned about the odd resource constrained IOMMU. User space drivers pre-map all their memory resources up front, and this is essentially the same concept. For swiotlb, though, we can error out the mapping if the requested memory uses swiotlb with the device: the driver's .dma_map() can return ENOTSUPP if is_swiotlb_buffer() is true. Would that be more acceptable?
On Tue, Aug 09, 2022 at 08:18:45AM -0600, Keith Busch wrote: > On Tue, Aug 09, 2022 at 08:46:13AM +0200, Christoph Hellwig wrote: > > - the design seems to ignore DMA ownership. Every time data in > > transfered data needs to be transferred to and from the device, > > take a look at Documentation/core-api/dma-api.rst and > > Documentation/core-api/dma-api-howto.rst. > > I have this doing appropriate dma_sync_single_for_{device,cpu} if we aren't > using coherent memory. Is there more to ownership beyond that? As long as we only every support a single device that is fine, but if we want to support that it gets complicated. Sorry, this should have been integrated with the mumblings on the multiple device mappings as the statement iѕ false without that.
On Tue, Aug 09, 2022 at 10:46:04AM -0600, Keith Busch wrote: > I am a bit reluctant to require a new memory allocator to use the feature. I'm > also generally not concerned about the odd resource constrained IOMMU. User > space drivers pre-map all their memory resources up front, and this is > essentially the same concept. Yes, it really isn't an issue for the modern iommus that support vfio, but it is a limitation on the dma-api. The old iommus aren't really the issue, but.. > For swiotlb, though, we can error out the mapping if the requested memory uses > swiotlb with the device: the driver's .dma_map() can return ENOTSUPP if > is_swiotlb_buffer() is true. Would that be more acceptable? No, is_swiotlb_buffer and similar are not exported APIs. More importantly with the various secure hypervisor schemes swiotlb is unfortunately actually massively increasing these days. On those systems all streaming mappings use swiotlb. And the only way to get any kind of half-decent I/O performance would be the "special" premapped allocator, which is another reason why I'd like to see it.
On Tue, Aug 09, 2022 at 08:41:37PM +0200, Christoph Hellwig wrote: > On Tue, Aug 09, 2022 at 10:46:04AM -0600, Keith Busch wrote: > > > For swiotlb, though, we can error out the mapping if the requested memory uses > > swiotlb with the device: the driver's .dma_map() can return ENOTSUPP if > > is_swiotlb_buffer() is true. Would that be more acceptable? > > No, is_swiotlb_buffer and similar are not exported APIs. The functions are implemented under 'include/linux/', indistinguishable from exported APIs. I think I understand why they are there, but they look the same as exported functions from a driver perspective. > More importantly with the various secure hypervisor schemes swiotlb is > unfortunately actually massively increasing these days. On those systems all > streaming mappings use swiotlb. And the only way to get any kind of > half-decent I/O performance would be the "special" premapped allocator, which > is another reason why I'd like to see it. Perhaps I'm being daft, but I'm totally missing why I should care if swiotlb leverages this feature. If you're using that, you've traded performance for security or compatibility already. If this idea can be used to make it perform better, then great, but that shouldn't be the reason to hold this up IMO. This optimization needs to be easy to reach if we expect anyone to use it. Working with arbitrary user addresses with minimal additions to the user ABI was deliberate. If you want a special allocator, we can always add one later; this series doesn't affect that. If this has potential to starve system resource though, I can constrain it to specific users like CAP_SYS_ADMIN, or maybe only memory allocated from hugetlbfs. Or perhaps a more complicated scheme of shuffling dma mapping resources on demand if that is an improvement over the status quo.
On Wed, Aug 10, 2022 at 12:05:05PM -0600, Keith Busch wrote: > The functions are implemented under 'include/linux/', indistinguishable from > exported APIs. I think I understand why they are there, but they look the same > as exported functions from a driver perspective. swiotlb.h is not a driver API. There's two leftovers used by the drm code I'm trying to get fixed up, but in general the DMA API is the interface and swiotlb is just an implementation detail. > Perhaps I'm being daft, but I'm totally missing why I should care if swiotlb > leverages this feature. If you're using that, you've traded performance for > security or compatibility already. If this idea can be used to make it perform > better, then great, but that shouldn't be the reason to hold this up IMO. We firstly need to make sure that everything actually works on swiotlb, or any other implementation that properly implements the DMA API. And the fact that I/O performance currently sucks and we can fix it on the trusted hypervisor is an important consideration. At least as importantant as micro-optimizing performance a little more on setups not using them. So not taking care of both in one go seems rather silly for a feature that is in its current form pretty intrusive and thus needs a really good justification. > This optimization needs to be easy to reach if we expect anyone to use it. > Working with arbitrary user addresses with minimal additions to the user ABI > was deliberate. If you want a special allocator, we can always add one later; > this series doesn't affect that. > > If this has potential to starve system resource though, I can constrain it to > specific users like CAP_SYS_ADMIN, or maybe only memory allocated from > hugetlbfs. Or perhaps a more complicated scheme of shuffling dma mapping > resources on demand if that is an improvement over the status quo. Or just not bother with it at all. Because with all those limits it really does not seems to be worth to an entirely need type of bio payload to the block layer and a lot of boilerplate to drivers.
On Thu, Aug 11, 2022 at 09:22:32AM +0200, Christoph Hellwig wrote: > On Wed, Aug 10, 2022 at 12:05:05PM -0600, Keith Busch wrote: > > The functions are implemented under 'include/linux/', indistinguishable from > > exported APIs. I think I understand why they are there, but they look the same > > as exported functions from a driver perspective. > > swiotlb.h is not a driver API. There's two leftovers used by the drm > code I'm trying to get fixed up, but in general the DMA API is the > interface and swiotlb is just an implementation detail. > > > Perhaps I'm being daft, but I'm totally missing why I should care if swiotlb > > leverages this feature. If you're using that, you've traded performance for > > security or compatibility already. If this idea can be used to make it perform > > better, then great, but that shouldn't be the reason to hold this up IMO. > > We firstly need to make sure that everything actually works on swiotlb, or > any other implementation that properly implements the DMA API. > > And the fact that I/O performance currently sucks and we can fix it on > the trusted hypervisor is an important consideration. At least as > importantant as micro-optimizing performance a little more on setups > not using them. So not taking care of both in one go seems rather silly > for a feature that is in its current form pretty intrusive and thus needs > a really good justification. Sorry for the delay response; I had some trouble with test setup. Okay, I will restart developing this with swiotlb in mind. In the mean time, I wanted to share some results with this series because I'm thinking this might be past the threshold for when we can drop the "micro-" prefix on optimisations. The most significant data points are these: * submission latency stays the same regardless of the transfer size or depth * IOPs is always equal or better (usually better) with up to 50% reduced cpu cost Based on this, I do think this type of optimisation is worth having a something like a new bio type. I know this introduces some complications in the io-path, but it is pretty minimal and doesn't add any size penalties to common structs for drivers that don't use them. Test details: fio with ioengine=io_uring 'none': using __user void* 'bvec': using buf registered with IORING_REGISTER_BUFFERS 'dma': using buf registered with IORING_REGISTER_MAP_BUFFERS (new) intel_iommu=on Results: (submission latency [slat] in nano-seconds) Q-Depth 1: Size | Premap | IOPs | slat | sys-cpu% .....|..........|.........|........|......... 4k | none | 41.4k | 2126 | 16.47% | bvec | 43.8k | 1843 | 15.79% | dma | 46.8k | 1504 | 14.94% 16k | none | 33.3k | 3279 | 17.78% | bvec | 33.9k | 2607 | 14.59% | dma | 40.2k | 1490 | 12.57% 64k | none | 18.7k | 6778 | 18.22% | bvec | 20.0k | 4626 | 13.80% | dma | 22.6k | 1586 | 7.58% Q-Depth 16: Size | Premap | IOPs | slat | sys-cpu% .....|..........|.........|........|......... 4k | none | 207k | 3657 | 72.81% | bvec | 219k | 3369 | 71.55% | dma | 310k | 2237 | 60.16% 16k | none | 164k | 5024 | 78.38% | bvec | 177k | 4553 | 76.29% | dma | 186k | 1880 | 43.56% 64k | none | 46.7k | 4424 | 30.51% | bvec | 46.7k | 4389 | 29.42% | dma | 46.7k | 1574 | 15.61%
From: Keith Busch <kbusch@kernel.org> Changes since v2: Fixed incorrect io_uring io_fixed_file index validit checksy: this should have been validating the file_ptr (Ammar) Various micro-optimizations: move up dma in iov type checks, skip iov_iter_advance on async IO (Jens). NVMe driver cleanups splitting the fast and slow paths. NVMe driver prp list setup fixes when using the slow path. Summary: A user address undergoes various represenations for a typical read or write command. Each consumes memory and CPU cycles. When the backing storage is NVMe, the sequence looks something like the following: __user void * struct iov_iter struct pages[] struct bio_vec[] struct scatterlist[] __le64[] Applications will often use the same buffer for many IO, so these potentially costly per-IO transformations to reach the exact same hardware descriptor can be skipped. The io_uring interface already provides a way for users to register buffers to get to 'struct bio_vec[]'. That still leaves the scatterlist needed for the repeated dma_map_sg(), then transform to nvme's PRP list format. This series takes the registered buffers a step further. A block driver can implement a new .dma_map() callback to reach the hardware's DMA mapped address format, and return a cookie so a user can reference it later for any given IO. When used, the block stack can skip significant amounts of code, improving CPU utilization and IOPs. The implementation is currently limited to mapping a registered buffer to a single io_uring fixed file. Keith Busch (7): blk-mq: add ops to dma map bvec file: add ops to dma map bvec iov_iter: introduce type for preregistered dma tags block: add dma tag bio type io_uring: introduce file slot release helper io_uring: add support for dma pre-mapping nvme-pci: implement dma_map support block/bdev.c | 20 +++ block/bio.c | 24 ++- block/blk-merge.c | 19 ++ block/fops.c | 24 ++- drivers/nvme/host/pci.c | 314 +++++++++++++++++++++++++++++++-- fs/file.c | 15 ++ include/linux/bio.h | 22 ++- include/linux/blk-mq.h | 24 +++ include/linux/blk_types.h | 6 +- include/linux/blkdev.h | 16 ++ include/linux/fs.h | 20 +++ include/linux/io_uring_types.h | 2 + include/linux/uio.h | 9 + include/uapi/linux/io_uring.h | 12 ++ io_uring/filetable.c | 34 ++-- io_uring/filetable.h | 10 +- io_uring/io_uring.c | 139 +++++++++++++++ io_uring/net.c | 2 +- io_uring/rsrc.c | 27 +-- io_uring/rsrc.h | 10 +- io_uring/rw.c | 2 +- lib/iov_iter.c | 27 ++- 22 files changed, 724 insertions(+), 54 deletions(-)