mbox series

[RFC,v2,00/11] Add support to dma_map_sg for P2PDMA

Message ID 20210311233142.7900-1-logang@deltatee.com (mailing list archive)
Headers show
Series Add support to dma_map_sg for P2PDMA | expand

Message

Logan Gunthorpe March 11, 2021, 11:31 p.m. UTC
Hi,

This is a rework of the first half of my RFC for doing P2PDMA in userspace
with O_DIRECT[1].

The largest issue with that series was the gross way of flagging P2PDMA
SGL segments. This RFC proposes a different approach, (suggested by
Dan Williams[2]) which uses the third bit in the page_link field of the
SGL.

This approach is a lot less hacky but comes at the cost of adding a
CONFIG_64BIT dependency to CONFIG_PCI_P2PDMA and using up the last
scarce bit in the page_link. For our purposes, a 64BIT restriction is
acceptable but it's not clear if this is ok for all usecases hoping
to make use of P2PDMA.

Matthew Wilcox has already suggested (off-list) that this is the wrong
approach, preferring a new dma mapping operation and an SGL replacement. I
don't disagree that something along those lines would be a better long
term solution, but it involves overcoming a lot of challenges to get
there. Creating a new mapping operation still means adding support to more
than 25 dma_map_ops implementations (many of which are on obscure
architectures) or creating a redundant path to fallback with dma_map_sg()
for every driver that uses the new operation. This RFC is an approach
that doesn't require overcoming these blocks.

Any alternative ideas or feedback is welcome.

These patches are based on v5.12-rc2 and a git branch is available here:

  https://github.com/sbates130272/linux-p2pmem/  p2pdma_dma_map_ops_rfc

A branch with the patches from the previous RFC that add userspace
O_DIRECT support is available at the same URL with the name
"p2pdma_dma_map_ops_rfc+user" (however, none of the issues with those
extra patches from the feedback of the last posting have been fixed).

Thanks,

Logan

[1] https://lore.kernel.org/linux-block/20201106170036.18713-1-logang@deltatee.com/
[2] https://lore.kernel.org/linux-block/CAPcyv4ifGcrdOtUt8qr7pmFhmecGHqGVre9G0RorGczCGVECQQ@mail.gmail.com/

--

Logan Gunthorpe (11):
  PCI/P2PDMA: Pass gfp_mask flags to upstream_bridge_distance_warn()
  PCI/P2PDMA: Avoid pci_get_slot() which sleeps
  PCI/P2PDMA: Attempt to set map_type if it has not been set
  PCI/P2PDMA: Introduce pci_p2pdma_should_map_bus() and
    pci_p2pdma_bus_offset()
  lib/scatterlist: Add flag for indicating P2PDMA segments in an SGL
  dma-direct: Support PCI P2PDMA pages in dma-direct map_sg
  dma-mapping: Add flags to dma_map_ops to indicate PCI P2PDMA support
  iommu/dma: Support PCI P2PDMA pages in dma-iommu map_sg
  block: Add BLK_STS_P2PDMA
  nvme-pci: Check DMA ops when indicating support for PCI P2PDMA
  nvme-pci: Convert to using dma_map_sg for p2pdma pages

 block/blk-core.c            |  2 +
 drivers/iommu/dma-iommu.c   | 63 +++++++++++++++++++++-----
 drivers/nvme/host/core.c    |  3 +-
 drivers/nvme/host/nvme.h    |  2 +-
 drivers/nvme/host/pci.c     | 38 +++++++---------
 drivers/pci/Kconfig         |  2 +-
 drivers/pci/p2pdma.c        | 89 +++++++++++++++++++++++++++++++------
 include/linux/blk_types.h   |  7 +++
 include/linux/dma-map-ops.h |  3 ++
 include/linux/dma-mapping.h |  5 +++
 include/linux/pci-p2pdma.h  | 11 +++++
 include/linux/scatterlist.h | 49 ++++++++++++++++++--
 kernel/dma/direct.c         | 35 +++++++++++++--
 kernel/dma/mapping.c        | 21 +++++++--
 14 files changed, 271 insertions(+), 59 deletions(-)


base-commit: a38fd8748464831584a19438cbb3082b5a2dab15
--
2.20.1

Comments

Robin Murphy March 12, 2021, 3:51 p.m. UTC | #1
On 2021-03-11 23:31, Logan Gunthorpe wrote:
> Hi,
> 
> This is a rework of the first half of my RFC for doing P2PDMA in userspace
> with O_DIRECT[1].
> 
> The largest issue with that series was the gross way of flagging P2PDMA
> SGL segments. This RFC proposes a different approach, (suggested by
> Dan Williams[2]) which uses the third bit in the page_link field of the
> SGL.
> 
> This approach is a lot less hacky but comes at the cost of adding a
> CONFIG_64BIT dependency to CONFIG_PCI_P2PDMA and using up the last
> scarce bit in the page_link. For our purposes, a 64BIT restriction is
> acceptable but it's not clear if this is ok for all usecases hoping
> to make use of P2PDMA.
> 
> Matthew Wilcox has already suggested (off-list) that this is the wrong
> approach, preferring a new dma mapping operation and an SGL replacement. I
> don't disagree that something along those lines would be a better long
> term solution, but it involves overcoming a lot of challenges to get
> there. Creating a new mapping operation still means adding support to more
> than 25 dma_map_ops implementations (many of which are on obscure
> architectures) or creating a redundant path to fallback with dma_map_sg()
> for every driver that uses the new operation. This RFC is an approach
> that doesn't require overcoming these blocks.

I don't really follow that argument - you're only adding support to two 
implementations with the awkward flag, so why would using a dedicated 
operation instead be any different? Whatever callers need to do if 
dma_pci_p2pdma_supported() says no, they could equally do if 
dma_map_p2p_sg() (or whatever) returns -ENXIO, no?

We don't try to multiplex .map_resource through .map_page, so there 
doesn't seem to be any good reason to force that complexity on .map_sg 
either. And having a distinct API from the outset should make it a lot 
easier to transition to better "list of P2P memory regions" data 
structures in future without rewriting the whole world. As it is, there 
are potential benefits in a P2P interface which can define its own 
behaviour - for instance if threw out the notion of segment merging it 
could save a load of bother by just maintaining the direct correlation 
between pages and DMA addresses.

Robin.

> Any alternative ideas or feedback is welcome.
> 
> These patches are based on v5.12-rc2 and a git branch is available here:
> 
>    https://github.com/sbates130272/linux-p2pmem/  p2pdma_dma_map_ops_rfc
> 
> A branch with the patches from the previous RFC that add userspace
> O_DIRECT support is available at the same URL with the name
> "p2pdma_dma_map_ops_rfc+user" (however, none of the issues with those
> extra patches from the feedback of the last posting have been fixed).
> 
> Thanks,
> 
> Logan
> 
> [1] https://lore.kernel.org/linux-block/20201106170036.18713-1-logang@deltatee.com/
> [2] https://lore.kernel.org/linux-block/CAPcyv4ifGcrdOtUt8qr7pmFhmecGHqGVre9G0RorGczCGVECQQ@mail.gmail.com/
> 
> --
> 
> Logan Gunthorpe (11):
>    PCI/P2PDMA: Pass gfp_mask flags to upstream_bridge_distance_warn()
>    PCI/P2PDMA: Avoid pci_get_slot() which sleeps
>    PCI/P2PDMA: Attempt to set map_type if it has not been set
>    PCI/P2PDMA: Introduce pci_p2pdma_should_map_bus() and
>      pci_p2pdma_bus_offset()
>    lib/scatterlist: Add flag for indicating P2PDMA segments in an SGL
>    dma-direct: Support PCI P2PDMA pages in dma-direct map_sg
>    dma-mapping: Add flags to dma_map_ops to indicate PCI P2PDMA support
>    iommu/dma: Support PCI P2PDMA pages in dma-iommu map_sg
>    block: Add BLK_STS_P2PDMA
>    nvme-pci: Check DMA ops when indicating support for PCI P2PDMA
>    nvme-pci: Convert to using dma_map_sg for p2pdma pages
> 
>   block/blk-core.c            |  2 +
>   drivers/iommu/dma-iommu.c   | 63 +++++++++++++++++++++-----
>   drivers/nvme/host/core.c    |  3 +-
>   drivers/nvme/host/nvme.h    |  2 +-
>   drivers/nvme/host/pci.c     | 38 +++++++---------
>   drivers/pci/Kconfig         |  2 +-
>   drivers/pci/p2pdma.c        | 89 +++++++++++++++++++++++++++++++------
>   include/linux/blk_types.h   |  7 +++
>   include/linux/dma-map-ops.h |  3 ++
>   include/linux/dma-mapping.h |  5 +++
>   include/linux/pci-p2pdma.h  | 11 +++++
>   include/linux/scatterlist.h | 49 ++++++++++++++++++--
>   kernel/dma/direct.c         | 35 +++++++++++++--
>   kernel/dma/mapping.c        | 21 +++++++--
>   14 files changed, 271 insertions(+), 59 deletions(-)
> 
> 
> base-commit: a38fd8748464831584a19438cbb3082b5a2dab15
> --
> 2.20.1
> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>
Logan Gunthorpe March 12, 2021, 4:18 p.m. UTC | #2
On 2021-03-12 8:51 a.m., Robin Murphy wrote:
> On 2021-03-11 23:31, Logan Gunthorpe wrote:
>> Hi,
>>
>> This is a rework of the first half of my RFC for doing P2PDMA in
>> userspace
>> with O_DIRECT[1].
>>
>> The largest issue with that series was the gross way of flagging P2PDMA
>> SGL segments. This RFC proposes a different approach, (suggested by
>> Dan Williams[2]) which uses the third bit in the page_link field of the
>> SGL.
>>
>> This approach is a lot less hacky but comes at the cost of adding a
>> CONFIG_64BIT dependency to CONFIG_PCI_P2PDMA and using up the last
>> scarce bit in the page_link. For our purposes, a 64BIT restriction is
>> acceptable but it's not clear if this is ok for all usecases hoping
>> to make use of P2PDMA.
>>
>> Matthew Wilcox has already suggested (off-list) that this is the wrong
>> approach, preferring a new dma mapping operation and an SGL
>> replacement. I
>> don't disagree that something along those lines would be a better long
>> term solution, but it involves overcoming a lot of challenges to get
>> there. Creating a new mapping operation still means adding support to
>> more
>> than 25 dma_map_ops implementations (many of which are on obscure
>> architectures) or creating a redundant path to fallback with dma_map_sg()
>> for every driver that uses the new operation. This RFC is an approach
>> that doesn't require overcoming these blocks.
> 
> I don't really follow that argument - you're only adding support to two
> implementations with the awkward flag, so why would using a dedicated
> operation instead be any different? Whatever callers need to do if
> dma_pci_p2pdma_supported() says no, they could equally do if
> dma_map_p2p_sg() (or whatever) returns -ENXIO, no?

The thing is if the dma_map_sg doesn't support P2PDMA then P2PDMA
transactions cannot be done, but regular transactions can still go
through as they always did.

But replacing dma_map_sg() with dma_map_new() affects all operations,
P2PDMA or otherwise. If dma_map_new() isn't supported it can't simply
not support P2PDMA; it has to maintain a fallback path to dma_map_sg().
Given that the inputs and outputs for dma_map_new() will be completely
different data structures this will be quite a lot of similar paths
required in the driver. (ie mapping a bvec to the input struct and the
output struct to hardware requirements) If a bug crops up in the old
dma_map_sg(), developers might not notice it for some time seeing it
won't be used on the most popular architectures.

Logan
Robin Murphy March 12, 2021, 5:46 p.m. UTC | #3
On 2021-03-12 16:18, Logan Gunthorpe wrote:
> 
> 
> On 2021-03-12 8:51 a.m., Robin Murphy wrote:
>> On 2021-03-11 23:31, Logan Gunthorpe wrote:
>>> Hi,
>>>
>>> This is a rework of the first half of my RFC for doing P2PDMA in
>>> userspace
>>> with O_DIRECT[1].
>>>
>>> The largest issue with that series was the gross way of flagging P2PDMA
>>> SGL segments. This RFC proposes a different approach, (suggested by
>>> Dan Williams[2]) which uses the third bit in the page_link field of the
>>> SGL.
>>>
>>> This approach is a lot less hacky but comes at the cost of adding a
>>> CONFIG_64BIT dependency to CONFIG_PCI_P2PDMA and using up the last
>>> scarce bit in the page_link. For our purposes, a 64BIT restriction is
>>> acceptable but it's not clear if this is ok for all usecases hoping
>>> to make use of P2PDMA.
>>>
>>> Matthew Wilcox has already suggested (off-list) that this is the wrong
>>> approach, preferring a new dma mapping operation and an SGL
>>> replacement. I
>>> don't disagree that something along those lines would be a better long
>>> term solution, but it involves overcoming a lot of challenges to get
>>> there. Creating a new mapping operation still means adding support to
>>> more
>>> than 25 dma_map_ops implementations (many of which are on obscure
>>> architectures) or creating a redundant path to fallback with dma_map_sg()
>>> for every driver that uses the new operation. This RFC is an approach
>>> that doesn't require overcoming these blocks.
>>
>> I don't really follow that argument - you're only adding support to two
>> implementations with the awkward flag, so why would using a dedicated
>> operation instead be any different? Whatever callers need to do if
>> dma_pci_p2pdma_supported() says no, they could equally do if
>> dma_map_p2p_sg() (or whatever) returns -ENXIO, no?
> 
> The thing is if the dma_map_sg doesn't support P2PDMA then P2PDMA
> transactions cannot be done, but regular transactions can still go
> through as they always did.
> 
> But replacing dma_map_sg() with dma_map_new() affects all operations,
> P2PDMA or otherwise. If dma_map_new() isn't supported it can't simply
> not support P2PDMA; it has to maintain a fallback path to dma_map_sg().

But AFAICS the equivalent fallback path still has to exist either way. 
My impression so far is that callers would end up looking something like 
this:

	if (dma_pci_p2pdma_supported()) {
		if (dma_map_sg(...) < 0)
			//do non-p2p fallback due to p2p failure
	} else {
		//do non-p2p fallback due to lack of support
	}

at which point, simply:

	if (dma_map_sg_p2p(...) < 0)
		//do non-p2p fallback either way

seems entirely reasonable. What am I missing?

Let's not pretend that overloading an existing API means we can start 
feeding P2P pages into any old subsystem/driver without further changes 
- there already *are* at least some that retry ad infinitum if DMA 
mapping fails (the USB layer springs to mind...) and thus wouldn't 
handle the PCI_P2PDMA_MAP_NOT_SUPPORTED case acceptably.

> Given that the inputs and outputs for dma_map_new() will be completely
> different data structures this will be quite a lot of similar paths
> required in the driver. (ie mapping a bvec to the input struct and the
> output struct to hardware requirements) If a bug crops up in the old
> dma_map_sg(), developers might not notice it for some time seeing it
> won't be used on the most popular architectures.

Huh? I'm specifically suggesting a new interface that takes the *same* 
data structure (at least to begin with), but just gives us more 
flexibility in terms of introducing p2p-aware behaviour somewhat more 
safely. Yes, we already know that we ultimately want something better 
than scatterlists for representing things like this and dma-buf imports, 
but that hardly has to happen overnight.

Robin.
Logan Gunthorpe March 12, 2021, 6:24 p.m. UTC | #4
On 2021-03-12 10:46 a.m., Robin Murphy wrote:
> On 2021-03-12 16:18, Logan Gunthorpe wrote:
>>
>>
>> On 2021-03-12 8:51 a.m., Robin Murphy wrote:
>>> On 2021-03-11 23:31, Logan Gunthorpe wrote:
>>>> Hi,
>>>>
>>>> This is a rework of the first half of my RFC for doing P2PDMA in
>>>> userspace
>>>> with O_DIRECT[1].
>>>>
>>>> The largest issue with that series was the gross way of flagging P2PDMA
>>>> SGL segments. This RFC proposes a different approach, (suggested by
>>>> Dan Williams[2]) which uses the third bit in the page_link field of the
>>>> SGL.
>>>>
>>>> This approach is a lot less hacky but comes at the cost of adding a
>>>> CONFIG_64BIT dependency to CONFIG_PCI_P2PDMA and using up the last
>>>> scarce bit in the page_link. For our purposes, a 64BIT restriction is
>>>> acceptable but it's not clear if this is ok for all usecases hoping
>>>> to make use of P2PDMA.
>>>>
>>>> Matthew Wilcox has already suggested (off-list) that this is the wrong
>>>> approach, preferring a new dma mapping operation and an SGL
>>>> replacement. I
>>>> don't disagree that something along those lines would be a better long
>>>> term solution, but it involves overcoming a lot of challenges to get
>>>> there. Creating a new mapping operation still means adding support to
>>>> more
>>>> than 25 dma_map_ops implementations (many of which are on obscure
>>>> architectures) or creating a redundant path to fallback with
>>>> dma_map_sg()
>>>> for every driver that uses the new operation. This RFC is an approach
>>>> that doesn't require overcoming these blocks.
>>>
>>> I don't really follow that argument - you're only adding support to two
>>> implementations with the awkward flag, so why would using a dedicated
>>> operation instead be any different? Whatever callers need to do if
>>> dma_pci_p2pdma_supported() says no, they could equally do if
>>> dma_map_p2p_sg() (or whatever) returns -ENXIO, no?
>>
>> The thing is if the dma_map_sg doesn't support P2PDMA then P2PDMA
>> transactions cannot be done, but regular transactions can still go
>> through as they always did.
>>
>> But replacing dma_map_sg() with dma_map_new() affects all operations,
>> P2PDMA or otherwise. If dma_map_new() isn't supported it can't simply
>> not support P2PDMA; it has to maintain a fallback path to dma_map_sg().
> 
> But AFAICS the equivalent fallback path still has to exist either way.
> My impression so far is that callers would end up looking something like
> this:
> 
>     if (dma_pci_p2pdma_supported()) {
>         if (dma_map_sg(...) < 0)
>             //do non-p2p fallback due to p2p failure
>     } else {
>         //do non-p2p fallback due to lack of support
>     }
> 
> at which point, simply:
> 
>     if (dma_map_sg_p2p(...) < 0)
>         //do non-p2p fallback either way
> 
> seems entirely reasonable. What am I missing?

No, that's not how it works. The dma_pci_p2pdma_supported() flag gates
whether P2PDMA pages will be used at a much higher level. Currently with
NVMe, if P2PDMA is supported, it sets the QUEUE_FLAG_PCI_P2PDMA on the
block queue. This is then tested with blk_queue_pci_p2pdma() before any
P2PDMA transaction with the block device is started.

In the following patches that could add userspace support,
blk_queue_pci_p2pdma() is used to add a flag to GUP which allows P2PDMA
pages into the driver via O_DIRECT.

There is no fallback path on the dma_map_sg() operation if p2pdma is not
supported. dma_map_sg() is always used. The code simply prevents pages
from getting there in the first place.

A new DMA operation would have to be:

if (dma_has_new_operation()) {
     // create input for dma_map_new_op()
     // map with dma_map_new_op()
     // parse output of dma_map_new_op()
} else {
     // create SGL
     dma_map_sg()
     // convert SGL to hardware map
}

And this if statement has nothing to do with p2pdma support or not.

> 
> Let's not pretend that overloading an existing API means we can start
> feeding P2P pages into any old subsystem/driver without further changes
> - there already *are* at least some that retry ad infinitum if DMA
> mapping fails (the USB layer springs to mind...) and thus wouldn't
> handle the PCI_P2PDMA_MAP_NOT_SUPPORTED case acceptably.

Yes, nobody is pretending that at all. We are being very careful with
P2PDMA pages and we don't want them to get into any driver that isn't
explicitly written to expect them. With the code in the current kernel
this is all very explicit and done ahead of time (with
QUEUE_FLAG_PCI_P2PDMA and similar). Once the pages get into userspace,
GUP will reject them unless the driver getting the pages specifically
sets a flag indicating support for them.

>> Given that the inputs and outputs for dma_map_new() will be completely
>> different data structures this will be quite a lot of similar paths
>> required in the driver. (ie mapping a bvec to the input struct and the
>> output struct to hardware requirements) If a bug crops up in the old
>> dma_map_sg(), developers might not notice it for some time seeing it
>> won't be used on the most popular architectures.
> 
> Huh? I'm specifically suggesting a new interface that takes the *same*
> data structure (at least to begin with), but just gives us more
> flexibility in terms of introducing p2p-aware behaviour somewhat more
> safely. Yes, we already know that we ultimately want something better
> than scatterlists for representing things like this and dma-buf imports,
> but that hardly has to happen overnight.

Ok, well that's not what Matthew had suggested off list. He specifically
was suggesting new data structures. And yes, that isn't going to happen
overnight.

If we have a dma_map_sg() and a dma_map_sg_p2p() that are identical
except dma_map_sg_p2p() supports p2pdma memory and can return -1 then
that doesn't sound so bad to me. So dma_map_sg() would simply be call
dma_map_sg_p2p() and add the BUG() on the return code. Though I really
don't see the benefit of the extra annotations. I don't think it really
adds any value. The tricky thing in the API is the SGL which needs to
flag segments for P2PDMA and the new function doesn't solve that.

Logan