mbox series

[PATCHv3,0/7] dma mapping optimisations

Message ID 20220805162444.3985535-1-kbusch@fb.com (mailing list archive)
Headers show
Series dma mapping optimisations | expand

Message

Keith Busch Aug. 5, 2022, 4:24 p.m. UTC
From: Keith Busch <kbusch@kernel.org>

Changes since v2:

  Fixed incorrect io_uring io_fixed_file index validit checksy: this should
  have been validating the file_ptr (Ammar)

  Various micro-optimizations: move up dma in iov type checks, skip
  iov_iter_advance on async IO (Jens).

  NVMe driver cleanups splitting the fast and slow paths.

  NVMe driver prp list setup fixes when using the slow path.

Summary:

A user address undergoes various represenations for a typical read or
write command. Each consumes memory and CPU cycles. When the backing
storage is NVMe, the sequence looks something like the following:

  __user void *
  struct iov_iter
  struct pages[]
  struct bio_vec[]
  struct scatterlist[]
  __le64[]

Applications will often use the same buffer for many IO, so these
potentially costly per-IO transformations to reach the exact same
hardware descriptor can be skipped.

The io_uring interface already provides a way for users to register
buffers to get to 'struct bio_vec[]'. That still leaves the scatterlist
needed for the repeated dma_map_sg(), then transform to nvme's PRP list
format.

This series takes the registered buffers a step further. A block driver
can implement a new .dma_map() callback to reach the hardware's DMA
mapped address format, and return a cookie so a user can reference it
later for any given IO. When used, the block stack can skip significant
amounts of code, improving CPU utilization and IOPs.

The implementation is currently limited to mapping a registered buffer
to a single io_uring fixed file.

Keith Busch (7):
  blk-mq: add ops to dma map bvec
  file: add ops to dma map bvec
  iov_iter: introduce type for preregistered dma tags
  block: add dma tag bio type
  io_uring: introduce file slot release helper
  io_uring: add support for dma pre-mapping
  nvme-pci: implement dma_map support

 block/bdev.c                   |  20 +++
 block/bio.c                    |  24 ++-
 block/blk-merge.c              |  19 ++
 block/fops.c                   |  24 ++-
 drivers/nvme/host/pci.c        | 314 +++++++++++++++++++++++++++++++--
 fs/file.c                      |  15 ++
 include/linux/bio.h            |  22 ++-
 include/linux/blk-mq.h         |  24 +++
 include/linux/blk_types.h      |   6 +-
 include/linux/blkdev.h         |  16 ++
 include/linux/fs.h             |  20 +++
 include/linux/io_uring_types.h |   2 +
 include/linux/uio.h            |   9 +
 include/uapi/linux/io_uring.h  |  12 ++
 io_uring/filetable.c           |  34 ++--
 io_uring/filetable.h           |  10 +-
 io_uring/io_uring.c            | 139 +++++++++++++++
 io_uring/net.c                 |   2 +-
 io_uring/rsrc.c                |  27 +--
 io_uring/rsrc.h                |  10 +-
 io_uring/rw.c                  |   2 +-
 lib/iov_iter.c                 |  27 ++-
 22 files changed, 724 insertions(+), 54 deletions(-)

Comments

Christoph Hellwig Aug. 9, 2022, 6:46 a.m. UTC | #1
I finally found some time to look over this, and I have some
very high level problems with the implementations:

 - the DMA streaming ops (dma_map_*) are only intended for short term
   mappings, not for long living ones that can lead to starvation.
   Especially with swiotlb, but also with some IOMMUs.  I think this
   needs to be changed to an API that can allocate DMA mapped memory
   using dma_alloc_noncoherent/dma_alloc_noncontigious for the device
   and then give access to that to the user instead
 - I really do not like the special casing in the bio.  Did you try
   to just stash away the DMA mapping information in an efficient
   lookup data structure (e.g. rthashtable, but details might need
   analysis and benchmarking) and thus touch very little code outside
   of the driver I/O path and the place that performs the mapping?
 - the design seems to ignore DMA ownership.  Every time data in
   transfered data needs to be transferred to and from the device,
   take a look at Documentation/core-api/dma-api.rst and
   Documentation/core-api/dma-api-howto.rst.

As for the multiple devices discussion:  mapping memory for multiple
devices is possible, but nontrivial to get right, mostly due to the
ownership.  So unless we have a really good reason I'd suggest to
not support this initially.
Keith Busch Aug. 9, 2022, 2:18 p.m. UTC | #2
On Tue, Aug 09, 2022 at 08:46:13AM +0200, Christoph Hellwig wrote:
>  - the design seems to ignore DMA ownership.  Every time data in
>    transfered data needs to be transferred to and from the device,
>    take a look at Documentation/core-api/dma-api.rst and
>    Documentation/core-api/dma-api-howto.rst.

I have this doing appropriate dma_sync_single_for_{device,cpu} if we aren't
using coherent memory. Is there more to ownership beyond that?
Keith Busch Aug. 9, 2022, 4:46 p.m. UTC | #3
On Tue, Aug 09, 2022 at 08:46:13AM +0200, Christoph Hellwig wrote:
>  - the DMA streaming ops (dma_map_*) are only intended for short term
>    mappings, not for long living ones that can lead to starvation.
>    Especially with swiotlb, but also with some IOMMUs.  I think this
>    needs to be changed to an API that can allocate DMA mapped memory
>    using dma_alloc_noncoherent/dma_alloc_noncontigious for the device
>    and then give access to that to the user instead

I am a bit reluctant to require a new memory allocator to use the feature. I'm
also generally not concerned about the odd resource constrained IOMMU. User
space drivers pre-map all their memory resources up front, and this is
essentially the same concept.

For swiotlb, though, we can error out the mapping if the requested memory uses
swiotlb with the device: the driver's .dma_map() can return ENOTSUPP if
is_swiotlb_buffer() is true. Would that be more acceptable?
Christoph Hellwig Aug. 9, 2022, 6:39 p.m. UTC | #4
On Tue, Aug 09, 2022 at 08:18:45AM -0600, Keith Busch wrote:
> On Tue, Aug 09, 2022 at 08:46:13AM +0200, Christoph Hellwig wrote:
> >  - the design seems to ignore DMA ownership.  Every time data in
> >    transfered data needs to be transferred to and from the device,
> >    take a look at Documentation/core-api/dma-api.rst and
> >    Documentation/core-api/dma-api-howto.rst.
> 
> I have this doing appropriate dma_sync_single_for_{device,cpu} if we aren't
> using coherent memory. Is there more to ownership beyond that?

As long as we only every support a single device that is fine, but
if we want to support that it gets complicated.  Sorry, this should
have been integrated with the mumblings on the multiple device mappings
as the statement iѕ false without that.
Christoph Hellwig Aug. 9, 2022, 6:41 p.m. UTC | #5
On Tue, Aug 09, 2022 at 10:46:04AM -0600, Keith Busch wrote:
> I am a bit reluctant to require a new memory allocator to use the feature. I'm
> also generally not concerned about the odd resource constrained IOMMU. User
> space drivers pre-map all their memory resources up front, and this is
> essentially the same concept.

Yes, it really isn't an issue for the modern iommus that support vfio,
but it is a limitation on the dma-api.  The old iommus aren't really the
issue, but..

> For swiotlb, though, we can error out the mapping if the requested memory uses
> swiotlb with the device: the driver's .dma_map() can return ENOTSUPP if
> is_swiotlb_buffer() is true. Would that be more acceptable?

No, is_swiotlb_buffer and similar are not exported APIs.  More importantly
with the various secure hypervisor schemes swiotlb is unfortunately
actually massively increasing these days.  On those systems all streaming
mappings use swiotlb.  And the only way to get any kind of half-decent
I/O performance would be the "special" premapped allocator, which is
another reason why I'd like to see it.
Keith Busch Aug. 10, 2022, 6:05 p.m. UTC | #6
On Tue, Aug 09, 2022 at 08:41:37PM +0200, Christoph Hellwig wrote:
> On Tue, Aug 09, 2022 at 10:46:04AM -0600, Keith Busch wrote:
> 
> > For swiotlb, though, we can error out the mapping if the requested memory uses
> > swiotlb with the device: the driver's .dma_map() can return ENOTSUPP if
> > is_swiotlb_buffer() is true. Would that be more acceptable?
> 
> No, is_swiotlb_buffer and similar are not exported APIs.

The functions are implemented under 'include/linux/', indistinguishable from
exported APIs. I think I understand why they are there, but they look the same
as exported functions from a driver perspective.

> More importantly with the various secure hypervisor schemes swiotlb is
> unfortunately actually massively increasing these days.  On those systems all
> streaming mappings use swiotlb.  And the only way to get any kind of
> half-decent I/O performance would be the "special" premapped allocator, which
> is another reason why I'd like to see it.

Perhaps I'm being daft, but I'm totally missing why I should care if swiotlb
leverages this feature. If you're using that, you've traded performance for
security or compatibility already. If this idea can be used to make it perform
better, then great, but that shouldn't be the reason to hold this up IMO.

This optimization needs to be easy to reach if we expect anyone to use it.
Working with arbitrary user addresses with minimal additions to the user ABI
was deliberate. If you want a special allocator, we can always add one later;
this series doesn't affect that.

If this has potential to starve system resource though, I can constrain it to
specific users like CAP_SYS_ADMIN, or maybe only memory allocated from
hugetlbfs. Or perhaps a more complicated scheme of shuffling dma mapping
resources on demand if that is an improvement over the status quo.
Christoph Hellwig Aug. 11, 2022, 7:22 a.m. UTC | #7
On Wed, Aug 10, 2022 at 12:05:05PM -0600, Keith Busch wrote:
> The functions are implemented under 'include/linux/', indistinguishable from
> exported APIs. I think I understand why they are there, but they look the same
> as exported functions from a driver perspective.

swiotlb.h is not a driver API.  There's two leftovers used by the drm
code I'm trying to get fixed up, but in general the DMA API is the
interface and swiotlb is just an implementation detail.

> Perhaps I'm being daft, but I'm totally missing why I should care if swiotlb
> leverages this feature. If you're using that, you've traded performance for
> security or compatibility already. If this idea can be used to make it perform
> better, then great, but that shouldn't be the reason to hold this up IMO.

We firstly need to make sure that everything actually works on swiotlb, or
any other implementation that properly implements the DMA API.

And the fact that I/O performance currently sucks and we can fix it on
the trusted hypervisor is an important consideration.  At least as
importantant as micro-optimizing performance a little more on setups
not using them.  So not taking care of both in one go seems rather silly
for a feature that is in its current form pretty intrusive and thus needs
a really good justification.

> This optimization needs to be easy to reach if we expect anyone to use it.
> Working with arbitrary user addresses with minimal additions to the user ABI
> was deliberate. If you want a special allocator, we can always add one later;
> this series doesn't affect that.
> 
> If this has potential to starve system resource though, I can constrain it to
> specific users like CAP_SYS_ADMIN, or maybe only memory allocated from
> hugetlbfs. Or perhaps a more complicated scheme of shuffling dma mapping
> resources on demand if that is an improvement over the status quo.

Or just not bother with it at all.  Because with all those limits it
really does not seems to be worth to an entirely need type of bio
payload to the block layer and a lot of boilerplate to drivers.
Keith Busch Aug. 31, 2022, 9:19 p.m. UTC | #8
On Thu, Aug 11, 2022 at 09:22:32AM +0200, Christoph Hellwig wrote:
> On Wed, Aug 10, 2022 at 12:05:05PM -0600, Keith Busch wrote:
> > The functions are implemented under 'include/linux/', indistinguishable from
> > exported APIs. I think I understand why they are there, but they look the same
> > as exported functions from a driver perspective.
> 
> swiotlb.h is not a driver API.  There's two leftovers used by the drm
> code I'm trying to get fixed up, but in general the DMA API is the
> interface and swiotlb is just an implementation detail.
> 
> > Perhaps I'm being daft, but I'm totally missing why I should care if swiotlb
> > leverages this feature. If you're using that, you've traded performance for
> > security or compatibility already. If this idea can be used to make it perform
> > better, then great, but that shouldn't be the reason to hold this up IMO.
> 
> We firstly need to make sure that everything actually works on swiotlb, or
> any other implementation that properly implements the DMA API.
> 
> And the fact that I/O performance currently sucks and we can fix it on
> the trusted hypervisor is an important consideration.  At least as
> importantant as micro-optimizing performance a little more on setups
> not using them.  So not taking care of both in one go seems rather silly
> for a feature that is in its current form pretty intrusive and thus needs
> a really good justification.

Sorry for the delay response; I had some trouble with test setup.

Okay, I will restart developing this with swiotlb in mind.

In the mean time, I wanted to share some results with this series because I'm
thinking this might be past the threshold for when we can drop the "micro-"
prefix on optimisations.

The most significant data points are these:

  * submission latency stays the same regardless of the transfer size or depth
  * IOPs is always equal or better (usually better) with up to 50% reduced
    cpu cost

Based on this, I do think this type of optimisation is worth having a something
like a new bio type. I know this introduces some complications in the io-path,
but it is pretty minimal and doesn't add any size penalties to common structs
for drivers that don't use them.

Test details:

  fio with ioengine=io_uring
    'none': using __user void*
    'bvec': using buf registered with IORING_REGISTER_BUFFERS
    'dma': using buf registered with IORING_REGISTER_MAP_BUFFERS (new)

  intel_iommu=on

Results:

(submission latency [slat] in nano-seconds)
Q-Depth 1:

 Size |  Premap  |   IOPs  |  slat  | sys-cpu%
 .....|..........|.........|........|.........
 4k   |    none  |  41.4k  |  2126  |   16.47%
      |    bvec  |  43.8k  |  1843  |   15.79%
      |     dma  |  46.8k  |  1504  |   14.94%
 16k  |    none  |  33.3k  |  3279  |   17.78%
      |    bvec  |  33.9k  |  2607  |   14.59%
      |     dma  |  40.2k  |  1490  |   12.57%
 64k  |    none  |  18.7k  |  6778  |   18.22%
      |    bvec  |  20.0k  |  4626  |   13.80%
      |     dma  |  22.6k  |  1586  |    7.58%

Q-Depth 16:

 Size |  Premap  |   IOPs  |  slat  | sys-cpu%
 .....|..........|.........|........|.........
 4k   |    none  |   207k  |  3657  |   72.81%
      |    bvec  |   219k  |  3369  |   71.55%
      |     dma  |   310k  |  2237  |   60.16%
 16k  |    none  |   164k  |  5024  |   78.38%
      |    bvec  |   177k  |  4553  |   76.29%
      |     dma  |   186k  |  1880  |   43.56%
 64k  |    none  |  46.7k  |  4424  |   30.51%
      |    bvec  |  46.7k  |  4389  |   29.42%
      |     dma  |  46.7k  |  1574  |   15.61%