diff mbox

[v3,01/11] PCI/P2PDMA: Support peer-to-peer memory

Message ID 20180312193525.2855-2-logang@deltatee.com (mailing list archive)
State New, archived
Headers show

Commit Message

Logan Gunthorpe March 12, 2018, 7:35 p.m. UTC
Some PCI devices may have memory mapped in a BAR space that's
intended for use in peer-to-peer transactions. In order to enable
such transactions the memory must be registered with ZONE_DEVICE pages
so it can be used by DMA interfaces in existing drivers.

Add an interface for other subsystems to find and allocate chunks of P2P
memory as necessary to facilitate transfers between two PCI peers:

int pci_p2pdma_add_client();
struct pci_dev *pci_p2pmem_find();
void *pci_alloc_p2pmem();

The new interface requires a driver to collect a list of client devices
involved in the transaction with the pci_p2pmem_add_client*() functions
then call pci_p2pmem_find() to obtain any suitable P2P memory. Once
this is done the list is bound to the memory and the calling driver is
free to add and remove clients as necessary (adding incompatible clients
will fail). With a suitable p2pmem device, memory can then be
allocated with pci_alloc_p2pmem() for use in DMA transactions.

Depending on hardware, using peer-to-peer memory may reduce the bandwidth
of the transfer but would significantly reduce pressure on system memory.
This may be desirable in many cases: for example a system could be designed
with a small CPU connected to a PCI switch by a small number of lanes
which would maximize the number of lanes available to connect to NVME
devices.

The code is designed to only utilize the p2pmem device if all the devices
involved in a transfer are behind the same PCI switch. This is because
we have no way of knowing whether peer-to-peer routing between PCIe
Root Ports is supported (PCIe r4.0, sec 1.3.1). Additionally, the
benefits of P2P transfers that go through the RC is limited to only
reducing DRAM usage and, in some cases, coding convienence.

This commit includes significant rework and feedback from Christoph
Hellwig.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/pci/Kconfig        |  16 ++
 drivers/pci/Makefile       |   1 +
 drivers/pci/p2pdma.c       | 679 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/memremap.h   |  18 ++
 include/linux/pci-p2pdma.h | 101 +++++++
 include/linux/pci.h        |   4 +
 6 files changed, 819 insertions(+)
 create mode 100644 drivers/pci/p2pdma.c
 create mode 100644 include/linux/pci-p2pdma.h

Comments

Sinan Kaya March 13, 2018, 3:28 a.m. UTC | #1
On 3/12/2018 3:35 PM, Logan Gunthorpe wrote:
> +int pci_p2pdma_add_client(struct list_head *head, struct device *dev)

It feels like code tried to be a generic p2pdma provider first. Then got
converted to PCI, yet all dev parameters are still struct device.

Maybe, dev parameter should also be struct pci_dev so that you can get rid of
all to_pci_dev() calls in this code including find_parent_pci_dev() function.

Regarding the switch business, It is amazing how much trouble you went into
limit this functionality into very specific hardware.

I thought that we reached to an agreement that code would not impose
any limits on what user wants.

What happened to all the emails we exchanged?

I understand you are coming from what you tested. Considering that you
are singing up for providing a generic PCI functionality into the kernel,
why don't you just blacklist the products that you had problems with and
yet still allow other architectures to use your code with their root ports?
Logan Gunthorpe March 13, 2018, 4:43 p.m. UTC | #2
On 12/03/18 09:28 PM, Sinan Kaya wrote:
> On 3/12/2018 3:35 PM, Logan Gunthorpe wrote:
> Regarding the switch business, It is amazing how much trouble you went into
> limit this functionality into very specific hardware.
> 
> I thought that we reached to an agreement that code would not impose
> any limits on what user wants.
> 
> What happened to all the emails we exchanged?

It turns out that root ports that support P2P are far less common than 
anyone thought. So it will likely have to be a white list. Nobody else 
seems keen on allowing the user to enable this on hardware that doesn't 
work. The easiest solution is still limiting it to using a switch. From 
there, if someone wants to start creating a white-list then that's 
probably the way forward to support root ports.

And there's also the ACS problem which means if you want to use P2P on 
the root ports you'll have to disable ACS on the entire system. (Or 
preferably, the IOMMU groups need to get more sophisticated to allow for 
dynamic changes).

Additionally, once you allow for root ports you may find the IOMMU 
getting in the way.

So there are great deal more issues to sort out if you don't restrict to 
devices behind switches.

Logan
Sinan Kaya March 13, 2018, 5:49 p.m. UTC | #3
On 3/13/2018 12:43 PM, Logan Gunthorpe wrote:
> 
> 
> On 12/03/18 09:28 PM, Sinan Kaya wrote:
>> On 3/12/2018 3:35 PM, Logan Gunthorpe wrote:
>> Regarding the switch business, It is amazing how much trouble you went into
>> limit this functionality into very specific hardware.
>>
>> I thought that we reached to an agreement that code would not impose
>> any limits on what user wants.
>>
>> What happened to all the emails we exchanged?
> 
> It turns out that root ports that support P2P are far less common than anyone thought. So it will likely have to be a white list. Nobody else seems keen on allowing the user to enable this on hardware that doesn't work. The easiest solution is still limiting it to using a switch. From there, if someone wants to start creating a white-list then that's probably the way forward to support root ports.

I thought only few root ports had problem. Thanks for clarifying that.

> 
> And there's also the ACS problem which means if you want to use P2P on the root ports you'll have to disable ACS on the entire system. (Or preferably, the IOMMU groups need to get more sophisticated to allow for dynamic changes).
> 

Do you think you can keep a pointer to the parent bridge instead of querying it
via get_upstream_bridge_port() here so that we can reuse your
pci_p2pdma_disable_acs() in the future.

+int pci_p2pdma_disable_acs(struct pci_dev *pdev)
+{
+	struct pci_dev *up;
+	int pos;
+	u16 ctrl;
+
+	up = get_upstream_bridge_port(pdev);
+	if (!up)
+		return 0;

Some future device would only deal with pci_p2pdma_add_client(() for whitelisting
instead of changing all of your code.

We should at least limit the usage of get_upstream_bridge_port() family of functions
to probe time only.

> Additionally, once you allow for root ports you may find the IOMMU getting in the way.

We can tell iommu to do one to one translation by passing iommu.passthrough=1 to kernel
command line to have identical behavior to your switch case.

> 
> So there are great deal more issues to sort out if you don't restrict to devices behind switches.
> 
> Logan
>
Logan Gunthorpe March 13, 2018, 6:40 p.m. UTC | #4
On 12/03/18 09:28 PM, Sinan Kaya wrote:
> Maybe, dev parameter should also be struct pci_dev so that you can get rid of
> all to_pci_dev() calls in this code including find_parent_pci_dev() function.

No, this was mentioned in v2. find_parent_pci_dev is necessary because 
the calling drivers aren't likely to have a bunch of struct pci_dev's 
for all the devices they are working with lying around. It's a much 
nicer from an API stand point to take struct devs and not worth it just 
to have a PCI API only taking struct pci_devs.

Logan
Logan Gunthorpe March 13, 2018, 6:44 p.m. UTC | #5
On 13/03/18 11:49 AM, Sinan Kaya wrote:
>> And there's also the ACS problem which means if you want to use P2P on the root ports you'll have to disable ACS on the entire system. (Or preferably, the IOMMU groups need to get more sophisticated to allow for dynamic changes).
>>
> 
> Do you think you can keep a pointer to the parent bridge instead of querying it
> via get_upstream_bridge_port() here so that we can reuse your
> pci_p2pdma_disable_acs() in the future.

Keep a pointer where? pci_p2pdma_disable_acs() and 
pci_p2pdma_add_client() are used in completely different cases on 
completely different devices. There really is no overlap and no obvious 
place to store the port pointer (except in the struct pci_dev itself, in 
which case why not just call the function again).

> +int pci_p2pdma_disable_acs(struct pci_dev *pdev)
> +{
> +	struct pci_dev *up;
> +	int pos;
> +	u16 ctrl;
> +
> +	up = get_upstream_bridge_port(pdev);
> +	if (!up)
> +		return 0;
> 
> Some future device would only deal with pci_p2pdma_add_client(() for whitelisting
> instead of changing all of your code.

That's a problem for whoever writes the future code.

> We should at least limit the usage of get_upstream_bridge_port() family of functions
> to probe time only.

Why?

> We can tell iommu to do one to one translation by passing iommu.passthrough=1 to kernel
> command line to have identical behavior to your switch case.

Well, someone will need to write code for all available IOMMUs to 
support this. That's a very big project.

Logan
Sinan Kaya March 13, 2018, 7:10 p.m. UTC | #6
On 3/13/2018 2:44 PM, Logan Gunthorpe wrote:
>> Do you think you can keep a pointer to the parent bridge instead of querying it
>> via get_upstream_bridge_port() here so that we can reuse your
>> pci_p2pdma_disable_acs() in the future.
> 
> Keep a pointer where? pci_p2pdma_disable_acs() and pci_p2pdma_add_client() are used in completely different cases on completely different devices. There really is no overlap and no obvious place to store the port pointer (except in the struct pci_dev itself, in which case why not just call the function again).

I was thinking of this for the pci_p2pdma_add_client() case for the
parent pointer.

+struct pci_p2pdma_client {
+	struct list_head list;
+	struct pci_dev *client;
+	struct pci_dev *provider;
+};

You are right second code seems to be there to disable ACS altogether
when this kernel configuration is selected.

It doesn't have any relationship to the client code.

But then, Why bother searching for the switch at all?

Even if the switch is there, there is no guarantee that it is currently
being used for P2P.

It seems that we are going with the assumption that enabling this config
option implies you want P2P, then we can simplify this code as well.
Logan Gunthorpe March 13, 2018, 7:19 p.m. UTC | #7
On 13/03/18 01:10 PM, Sinan Kaya wrote:
> I was thinking of this for the pci_p2pdma_add_client() case for the
> parent pointer.
> 
> +struct pci_p2pdma_client {
> +	struct list_head list;
> +	struct pci_dev *client;
> +	struct pci_dev *provider;
> +};

Yeah, that structure only exists in a list owned by the client and we
only check the upstream bridge once per entry so I don't see the point.

> But then, Why bother searching for the switch at all?

Huh? We have to make sure all the client and provider devices are behind
the same switch. How can we do that without "searching" for the switch?

In the ACS case, we only disable ACS on downstream ports of switches. No
sense disabling it globally as that's worse from an isolation point of
view and not worth it given we require all P2P transactions to be behind
a switch.

> Even if the switch is there, there is no guarantee that it is currently
> being used for P2P.

IOMMU groups are set at boot time and, at present, there's no way to
dynamically change ACS bits without messing up the groups. So switches
not used for P2P will not have ACS enabled when CONFIG_PCI_P2PDMA is set
and I don't know of any good solution to that. Please see the ACS
discussions in v1 and v2.

> It seems that we are going with the assumption that enabling this config
> option implies you want P2P, then we can simplify this code as well.

How so?

Logan
Sinan Kaya March 13, 2018, 7:53 p.m. UTC | #8
On 3/13/2018 3:19 PM, Logan Gunthorpe wrote:
> 
> 
> On 13/03/18 01:10 PM, Sinan Kaya wrote:
>> I was thinking of this for the pci_p2pdma_add_client() case for the
>> parent pointer.
>>
>> +struct pci_p2pdma_client {
>> +	struct list_head list;
>> +	struct pci_dev *client;
>> +	struct pci_dev *provider;
>> +};
> 
> Yeah, that structure only exists in a list owned by the client and we
> only check the upstream bridge once per entry so I don't see the point.
> 
>> But then, Why bother searching for the switch at all?
> 
> Huh? We have to make sure all the client and provider devices are behind
> the same switch. How can we do that without "searching" for the switch?
> 

Sorry, I was thinking of ACS case you described below. The only thing code
cares is if the device is behind a switch or not at this moment.

> In the ACS case, we only disable ACS on downstream ports of switches. No
> sense disabling it globally as that's worse from an isolation point of
> view and not worth it given we require all P2P transactions to be behind
> a switch.

I agree disabling globally would be bad. Somebody can always say I have
ten switches on my system. I want to do peer-to-peer on one switch only. Now,
this change weakened security for the other switches that I had no intention
with doing P2P.

Isn't this a problem?

Can we specify the BDF of the downstream device we want P2P with during boot via
kernel command line?

> 
>> Even if the switch is there, there is no guarantee that it is currently
>> being used for P2P.
> 
> IOMMU groups are set at boot time and, at present, there's no way to
> dynamically change ACS bits without messing up the groups. So switches
> not used for P2P will not have ACS enabled when CONFIG_PCI_P2PDMA is set
> and I don't know of any good solution to that. Please see the ACS
> discussions in v1 and v2.

Given the implementation limitations, this might be OK as a short-term
solution. 

It depends on if Alex is comfortable with this.

> 
>> It seems that we are going with the assumption that enabling this config
>> option implies you want P2P, then we can simplify this code as well.
> 
> How so?
> 
> Logan
> 
> 
>
Logan Gunthorpe March 13, 2018, 8:46 p.m. UTC | #9
On 13/03/18 01:53 PM, Sinan Kaya wrote:
> I agree disabling globally would be bad. Somebody can always say I have
> ten switches on my system. I want to do peer-to-peer on one switch only. Now,
> this change weakened security for the other switches that I had no intention
> with doing P2P.
> 
> Isn't this a problem?

Well, if it's a problem for someone they'll have to solve it. We're
targeting JBOFs that have no use for ACS / IOMMU groups at all.

> Can we specify the BDF of the downstream device we want P2P with during boot via
> kernel command line?

That's a painful configuration burden. And then things might stop
working if you change your topology at all and now have to change boot
parameters.

Logan
Sinan Kaya March 13, 2018, 9:22 p.m. UTC | #10
On 3/13/2018 4:46 PM, Logan Gunthorpe wrote:
> 
> 
> On 13/03/18 01:53 PM, Sinan Kaya wrote:
>> I agree disabling globally would be bad. Somebody can always say I have
>> ten switches on my system. I want to do peer-to-peer on one switch only. Now,
>> this change weakened security for the other switches that I had no intention
>> with doing P2P.
>>
>> Isn't this a problem?
> 
> Well, if it's a problem for someone they'll have to solve it. We're
> targeting JBOFs that have no use for ACS / IOMMU groups at all.
> 
>> Can we specify the BDF of the downstream device we want P2P with during boot via
>> kernel command line?
> 
> That's a painful configuration burden. And then things might stop
> working if you change your topology at all and now have to change boot
> parameters.
> 

It sounds like you have very tight hardware expectations for this to work
at this moment. You also don't want to generalize this code for others and
address the shortcomings.

To get you going, you should limit this change to the switch products that you have
validated via white-listing PCI vendor/device ids. Please do not enable this feature
for all other PCI devices or by default.

I think your code qualifies as a virus until this issue is resolved (so NAK). 

Another option is for your CONFIG to depend on BROKEN/EXPERT.

You are delivering a general purpose P2P code with a lot of holes in it and
expecting people to jump through it.

Turning security off by default is also not acceptable. Linux requires ACS support
even though you don't care about it for your particular application.

I'd hate ACS to be broken due to some operating system enabling your CONFIG option.
Logan Gunthorpe March 13, 2018, 10 p.m. UTC | #11
On 13/03/18 03:22 PM, Sinan Kaya wrote:
> It sounds like you have very tight hardware expectations for this to work
> at this moment. You also don't want to generalize this code for others and
> address the shortcomings.

No, that's the way the community has pushed this work. Our original work
was very general and we were told it was unacceptable to put the onus on
the user and have things break if the hardware doesn't support it. I
think that's a reasonable requirement. So the hardware use-cases were
wittled down to the ones we can be confident about and support with
reasonable changes to the kernel today.

> To get you going, you should limit this change to the switch products that you have
> validated via white-listing PCI vendor/device ids. Please do not enable this feature
> for all other PCI devices or by default.

This doesn't seem necessary. We know that all PCIe switches available
today support P2P and we are highly confident that any switch that would
ever be produced would support P2P. As long as devices are behind a
switch you don't need any white listing. This is what the current patch
set implements. If you want to start including root ports then you will
need a white list (and solve all the other problems I mentioned earlier).

> I think your code qualifies as a virus until this issue is resolved (so NAK). 

That seems a bit hyperbolic... "a virus"??!... please keep things
constructive.

> You are delivering a general purpose P2P code with a lot of holes in it and
> expecting people to jump through it.
No, the code prevents users from screwing it up. It just requires a
switch in the hardware which is hardly a high bar to jump through
(provided you are putting some design thought into your hardware). And
given it requires semi-custom hardware today, it's not something that
needs to be on by default in any distributor kernel.

> Turning security off by default is also not acceptable. Linux requires ACS support
> even though you don't care about it for your particular application.

That's not true. Linux does not require ACS support. In fact it's
already off by default until you specifically turn on the IOMMU. (Which
is not always the most obvious thing to enable.) And the only thing it
really supports is enforcing isolation between VMs that are using
different pieces of hardware in the system.

> I'd hate ACS to be broken due to some operating system enabling your CONFIG option.

ACS isn't "broken" by enabling the config option. It just makes the
IOMMU groups and isolation less granular. (ie. devices behind a switch
will be in the same group and not isolated from each-other).

Logan
Sinan Kaya March 13, 2018, 10:29 p.m. UTC | #12
On 3/13/2018 6:00 PM, Logan Gunthorpe wrote:
> 
> 
> On 13/03/18 03:22 PM, Sinan Kaya wrote:
>> It sounds like you have very tight hardware expectations for this to work
>> at this moment. You also don't want to generalize this code for others and
>> address the shortcomings.
> 
> No, that's the way the community has pushed this work. Our original work
> was very general and we were told it was unacceptable to put the onus on
> the user and have things break if the hardware doesn't support it. I
> think that's a reasonable requirement. So the hardware use-cases were
> wittled down to the ones we can be confident about and support with
> reasonable changes to the kernel today.

If hardware doesn't support it, blacklisting should have been the right
path and I still think that you should remove all switch business from the code.
I did not hear enough justification for having a switch requirement
for P2P.

You are also saying that root ports have issues not because of functionality but
because of performance. 

If you know what is bad, create a list and keep adding it. You can't assume
universally that all root ports are broken/ have bad performance.

> 
>> To get you going, you should limit this change to the switch products that you have
>> validated via white-listing PCI vendor/device ids. Please do not enable this feature
>> for all other PCI devices or by default.
> 
> This doesn't seem necessary. We know that all PCIe switches available
> today support P2P and we are highly confident that any switch that would
> ever be produced would support P2P. As long as devices are behind a
> switch you don't need any white listing. This is what the current patch
> set implements. If you want to start including root ports then you will
> need a white list (and solve all the other problems I mentioned earlier).

What if I come up with a very cheap/crappy switch (something like used in data
mining)?

What guarantees that P2P will work with this device? You are making an
assumption here that all switches have good performance.

How is that any different from good switch vs. bad switch and good root
port vs. bad root port?

If it is universally broken, why don't you list the things that work?

> 
>> I think your code qualifies as a virus until this issue is resolved (so NAK). 
> 
> That seems a bit hyperbolic... "a virus"??!... please keep things
> constructive.
> 

Sorry, this was harsh. I'm taking "virus" word back. I apologize. 
But, I hold onto my NAK opinion. 

I have been doing my best to provide feedback. It feels like you are throwing
them over the wall to be honest.

You keep implying "not my problem".

> > I agree disabling globally would be bad. Somebody can always say I have
> > ten switches on my system. I want to do peer-to-peer on one switch only. Now,
> > this change weakened security for the other switches that I had no intention
> > with doing P2P.
> >
> > Isn't this a problem?
> 
> Well, if it's a problem for someone they'll have to solve it. We're
> targeting JBOFs that have no use for ACS / IOMMU groups at all.

IMO, you (not somebody) should address this one way or the other before this
series land in upstream.

>> You are delivering a general purpose P2P code with a lot of holes in it and
>> expecting people to jump through it.
> No, the code prevents users from screwing it up. It just requires a
> switch in the hardware which is hardly a high bar to jump through
> (provided you are putting some design thought into your hardware). And
> given it requires semi-custom hardware today, it's not something that
> needs to be on by default in any distributor kernel.
> 
>> Turning security off by default is also not acceptable. Linux requires ACS support
>> even though you don't care about it for your particular application.
> 
> That's not true. Linux does not require ACS support. In fact it's
> already off by default until you specifically turn on the IOMMU. (Which
> is not always the most obvious thing to enable.) And the only thing it
> really supports is enforcing isolation between VMs that are using
> different pieces of hardware in the system.

Another assumption: There are other architectures like ARM64 where IOMMU
is enabled by default even if you don't use VMs for security reasons.
IOMMU blocks stray transactions.

> 
>> I'd hate ACS to be broken due to some operating system enabling your CONFIG option.
> 
> ACS isn't "broken" by enabling the config option. It just makes the
> IOMMU groups and isolation less granular. (ie. devices behind a switch
> will be in the same group and not isolated from each-other).

Didn't the ACS behavior change suddenly for no good reason when we enabled
your code even though I might not be using the P2P but I happen to have
a kernel with P2P config option?
Stephen Bates March 13, 2018, 10:31 p.m. UTC | #13
>> It sounds like you have very tight hardware expectations for this to work
>> at this moment. You also don't want to generalize this code for others and
>> address the shortcomings.
>  No, that's the way the community has pushed this work

Hi Sinan

Thanks for all the input. As Logan has pointed out the switch requirement is something that has evolved over time based on input from the community. You are more than welcome to have an opinion on this (and you have made that opinion clear ;-)). Over time the patchset may evolve from its current requirements but right now we are aligned with the feedback from the community.

Cheers

Stephen
Stephen Bates March 13, 2018, 10:45 p.m. UTC | #14
Hi Sinan

>    If hardware doesn't support it, blacklisting should have been the right
>    path and I still think that you should remove all switch business from the code.
>    I did not hear enough justification for having a switch requirement
>    for P2P.
  
We disagree. As does the community based on feedback on previous version of this patch set.
  
> You are also saying that root ports have issues not because of functionality but
> because of performance. 

I have seen both.  Some systems fail in entirety to route TLPs across route ports. Others do route but with poor performance. Having a requirement for a switch is therefore a reasonable and sensible minimum criteria for enabling p2p.
    
> If you know what is bad, create a list and keep adding it. You can't assume
> universally that all root ports are broken/ have bad performance.

I actually wanted to do a blacklist originally but that idea was not accepted by the community. 
    
>    What if I come up with a very cheap/crappy switch (something like used in data
>    mining)?
> What guarantees that P2P will work with this device? You are making an
> assumption here that all switches have good performance.

A switch must support P2P or it is in violation of the PCIe specification.
    
>    I have been doing my best to provide feedback. It feels like you are throwing
>    them over the wall to be honest.

I think one issue Sinan is that a lot of your ideas have been discussed in previous versions of the series. So you are asking us to make changes which, in some cases, we know are not acceptable to others in the community. The input is welcome but it flies in the face of other input we have received. 

However if other developers feel strongly about the blacklist/whitelist and switch criteria please speak up! ;-)

Stephen
Logan Gunthorpe March 13, 2018, 10:48 p.m. UTC | #15
On 13/03/18 04:29 PM, Sinan Kaya wrote:
> If hardware doesn't support it, blacklisting should have been the right
> path and I still think that you should remove all switch business from the code.
> I did not hear enough justification for having a switch requirement
> for P2P.

I disagree.

> You are also saying that root ports have issues not because of functionality but
> because of performance. 

No... performance can also be an issue but the main justification for
this is that many root ports do not support P2P at all and can fail in
different ways (usually just dropping the TLPs).

> What if I come up with a very cheap/crappy switch (something like used in data
> mining)?

Good luck. That's not how hardware is designed. PCIe switches that have
any hope to compete with the existing market will need features like
NTB, non-transparent ports, etc... and that implies a certain symmetry
(ie there isn't a special host port because there may be more than one
and it may move around) which implies that packets will always be able
to forward between each ports which implies P2P will work.

> I have been doing my best to provide feedback. It feels like you are throwing
> them over the wall to be honest.
> 
> You keep implying "not my problem".

Well, the fact of the matter is that extending this in all the ways
people like you want face huge problems on all sides. These are not
trivial issues and holding back work that works for our problem because
it doesn't solve your problem is IMO just going to grind development in
this area to a halt. We have to have something we can agree on which is
safe to start building on. The building can't just emerge fully formed
in one go.

P2P proposal go back a long time and have never gotten anywhere because
there are limitations and people want it to do things that are hard but
don't want to contribute the work to solving those problems.

>> Well, if it's a problem for someone they'll have to solve it. We're
>> targeting JBOFs that have no use for ACS / IOMMU groups at all.
> 
> IMO, you (not somebody) should address this one way or the other before this
> series land in upstream.

The real way to address this (as I've mentioned before) is with some way
of doing ACS and iomem groups dynamically. But this is a huge project in
itself and is never going to be part of the P2P patchset.

> Another assumption: There are other architectures like ARM64 where IOMMU
> is enabled by default even if you don't use VMs for security reasons.
> IOMMU blocks stray transactions.

True, but it doesn't change my point: ACS is not a requirement for Linux
many many systems do not have it on at all or by default.

> Didn't the ACS behavior change suddenly for no good reason when we enabled
> your code even though I might not be using the P2P but I happen to have
> a kernel with P2P config option?

Well no, presumably you made a conscious choice to turn the config
option on and build a custom kernel for your box. That doesn't seem very
sudden and the reason is that the two concepts are very much at odds
with each other: you can't have isolation and still have transactions
between devices.

Logan
Bjorn Helgaas March 13, 2018, 11:08 p.m. UTC | #16
On Tue, Mar 13, 2018 at 10:31:55PM +0000, Stephen  Bates wrote:
> >> It sounds like you have very tight hardware expectations for this to work
> >> at this moment. You also don't want to generalize this code for others and
> >> address the shortcomings.
> >  No, that's the way the community has pushed this work
> 
> Hi Sinan
> 
> Thanks for all the input. As Logan has pointed out the switch
> requirement is something that has evolved over time based on input
> from the community. You are more than welcome to have an opinion on
> this (and you have made that opinion clear ;-)). Over time the
> patchset may evolve from its current requirements but right now we
> are aligned with the feedback from the community.

This part of the community hasn't been convinced of the need to have
two bridges, e.g., both an Upstream Port and a Downstream Port, or two
conventional PCI bridges, above the peers.

Every PCI-to-PCI bridge is required to support routing transactions
between devices on its secondary side.  Therefore, I think it is
sufficient to verify that the potential peers share a single common
upstream bridge.  This could be a conventional PCI bridge, a Switch
Downstream Port, or a Root Port.

I've seen the response that peers directly below a Root Port could not
DMA to each other through the Root Port because of the "route to self"
issue, and I'm not disputing that.  But enforcing a requirement for
two upstream bridges introduces a weird restriction on conventional
PCI topologies, makes the code hard to read, and I don't think it's
necessary.

If it *is* necessary because Root Ports and devices below them behave
differently than in conventional PCI, I think you should include a
reference to the relevant section of the spec and check directly for a
Root Port.  I would prefer that over trying to exclude Root Ports by
looking for two upstream bridges.

Bjorn
Sinan Kaya March 13, 2018, 11:19 p.m. UTC | #17
On 3/13/2018 6:48 PM, Logan Gunthorpe wrote:
> 
> 
> On 13/03/18 04:29 PM, Sinan Kaya wrote:
>> If hardware doesn't support it, blacklisting should have been the right
>> path and I still think that you should remove all switch business from the code.
>> I did not hear enough justification for having a switch requirement
>> for P2P.
> 
> I disagree.
> 
>> You are also saying that root ports have issues not because of functionality but
>> because of performance. 
> 
> No... performance can also be an issue but the main justification for
> this is that many root ports do not support P2P at all and can fail in
> different ways (usually just dropping the TLPs).
> 
>> What if I come up with a very cheap/crappy switch (something like used in data
>> mining)?
> 
> Good luck. That's not how hardware is designed. PCIe switches that have
> any hope to compete with the existing market will need features like
> NTB, non-transparent ports, etc... and that implies a certain symmetry
> (ie there isn't a special host port because there may be more than one
> and it may move around) which implies that packets will always be able
> to forward between each ports which implies P2P will work.
> 

It is still a switch it can move packets but, maybe it can move data at
100kbps speed. 

What prevents that switch from trying P2P and having a bad user experience?

If everything is so broken, I was suggesting to at least list the switches
you have tested.

What's the problem with this?

Why do you want to assume that all switches are good and all root ports are
bad?

>> I have been doing my best to provide feedback. It feels like you are throwing
>> them over the wall to be honest.
>>
>> You keep implying "not my problem".
> 
> Well, the fact of the matter is that extending this in all the ways
> people like you want face huge problems on all sides. These are not
> trivial issues and holding back work that works for our problem because
> it doesn't solve your problem is IMO just going to grind development in
> this area to a halt. We have to have something we can agree on which is
> safe to start building on. The building can't just emerge fully formed
> in one go.
> 

What if the design is so canned that you can't change anything? 

I have been asking things like getting rid of switch search in ACS
enablement towards achieving generic P2P. You seem to be pushing back.
You said yourself P2P and isolation doesn't go together at this point
but you also care about isolation for other devices that are not doing
P2P.

Confusing...

> P2P proposal go back a long time and have never gotten anywhere because
> there are limitations and people want it to do things that are hard but
> don't want to contribute the work to solving those problems.
> 
>>> Well, if it's a problem for someone they'll have to solve it. We're
>>> targeting JBOFs that have no use for ACS / IOMMU groups at all.
>>
>> IMO, you (not somebody) should address this one way or the other before this
>> series land in upstream.
> 
> The real way to address this (as I've mentioned before) is with some way
> of doing ACS and iomem groups dynamically. But this is a huge project in
> itself and is never going to be part of the P2P patchset.

fair enough.

> 
>> Another assumption: There are other architectures like ARM64 where IOMMU
>> is enabled by default even if you don't use VMs for security reasons.
>> IOMMU blocks stray transactions.
> 
> True, but it doesn't change my point: ACS is not a requirement for Linux
> many many systems do not have it on at all or by default.

I don't think so.

It is not a requirement for you but it is a requirement for me (ARM64 guy).
Linux happens to run on multiple architectures. One exception invalidates your
point.

> 
>> Didn't the ACS behavior change suddenly for no good reason when we enabled
>> your code even though I might not be using the P2P but I happen to have
>> a kernel with P2P config option?
> 

If you are assuming that your kernel option should not be used by general
distributions like Ubuntu/redhat etc. and requires a kernel compilation,
creating a dependency to EXPERT is the right way to do. 

Distributions assume that no damage is done by enabling PCI bus options
under normal circumstances.

> Well no, presumably you made a conscious choice to turn the config
> option on and build a custom kernel for your box. That doesn't seem very
> sudden and the reason is that the two concepts are very much at odds
> with each other: you can't have isolation and still have transactions
> between devices.
> 
> Logan
> 
>
Logan Gunthorpe March 13, 2018, 11:21 p.m. UTC | #18
On 13/03/18 05:08 PM, Bjorn Helgaas wrote:
> On Tue, Mar 13, 2018 at 10:31:55PM +0000, Stephen  Bates wrote:
> If it *is* necessary because Root Ports and devices below them behave
> differently than in conventional PCI, I think you should include a
> reference to the relevant section of the spec and check directly for a
> Root Port.  I would prefer that over trying to exclude Root Ports by
> looking for two upstream bridges.

Well we've established that we don't want to allow root ports.

But we need to, at a minimum, do two pci_upstream_bridge() calls...

Remember, we need to check that a number of devices are behind the same
switch. So we need to find a _common_ upstream port for several devices.
Excluding the multifunction device case (which I don't think is
applicable for reasons we've discussed before), this will *never* be the
first upstream port for a given device. We need to find the upstream
port of the switch and ensure all devices share that same port. If a
device is behind a switch, it's pci_upstream_bridge() is the downstream
switch port which is unique to that device. So a peer device would have
a different pci_upstream_bridge() port but share the same
pci_upstream_bridge(pci_upstream_bridge()) port (assuming they are on
the same switch).

The alternative, is to walk the entire tree of upstream ports and check
that all devices have a common port somewhere in the tree that isn't the
root complex. Arguably this is the right way to do it and would support
a network of switches but is a fair bit more complex to code.

Logan
Logan Gunthorpe March 13, 2018, 11:45 p.m. UTC | #19
On 13/03/18 05:19 PM, Sinan Kaya wrote:
> It is still a switch it can move packets but, maybe it can move data at
> 100kbps speed. 

As Stephen pointed out, it's a requirement of the PCIe spec that a
switch supports P2P. If you want to sell a switch that does P2P with bad
performance then that's on you to deal with.

> What prevents that switch from trying P2P and having a bad user experience?

The fact that no one would buy such a switch as it would have a bad user
experience with regular PCI transfers. A P2P TLP is not special on a
switch as it's routed in just the same way as any others. There'd also
be no cost gain in making such a broken-by-design device.

> If everything is so broken, I was suggesting to at least list the switches
> you have tested.
> 
> What's the problem with this?

More complexity for zero gain.

> Why do you want to assume that all switches are good and all root ports are
> bad?

Because the assumption that all switches are good is more than
reasonable and simplifies the code and maintenance significantly. No one
wants to maintain a white list when they don't have to.

> What if the design is so canned that you can't change anything? 

Based on the feedback we've got so far and the developers that have
participated in getting it to where it is, it is not "canned".

> I have been asking things like getting rid of switch search in ACS
> enablement towards achieving generic P2P. You seem to be pushing back.
> You said yourself P2P and isolation doesn't go together at this point
> but you also care about isolation for other devices that are not doing
> P2P.

P2P and isolation will never be compatible at any point. They are two
opposite concepts. So we could just disable isolation across the whole
system and for our purposes that would be fine. But it's relatively
simple to limit this and only disable it behind switches. So why
wouldn't we? It enables use cases like having an isolated card on the
root complex used in a VM while having P2P on cards behind switches. I
personally have no interest in doing this but I also have no reason to
prevent it with my code.

> It is not a requirement for you but it is a requirement for me (ARM64 guy).
> Linux happens to run on multiple architectures. One exception invalidates your
> point.

It is not a requirement of an architecture or people that use a specific
architecture. It is a requirement of the use-case and you have not said
any use case or how we could do better. If you're running VMs that
require isolation you will need to be *very* careful if you also want to
do P2P between cards which requires no isolation. But, based on my
understanding, most people will want to do one or the other -- not both.
If you want to do P2P you enable the P2P config option, if you want
isolation you don't.

> If you are assuming that your kernel option should not be used by general
> distributions like Ubuntu/redhat etc. and requires a kernel compilation,
> creating a dependency to EXPERT is the right way to do. 

I don't have a problem adding EXPERT as a dependency. We can do that for
v4. I'd rather hope that distros actually read and understand the
kconfig help text before enabling an off-by-default option. But maybe
I'm asking too much.

Logan
Bjorn Helgaas March 14, 2018, 2:56 a.m. UTC | #20
On Tue, Mar 13, 2018 at 05:21:20PM -0600, Logan Gunthorpe wrote:
> On 13/03/18 05:08 PM, Bjorn Helgaas wrote:
> > On Tue, Mar 13, 2018 at 10:31:55PM +0000, Stephen  Bates wrote:
> > If it *is* necessary because Root Ports and devices below them behave
> > differently than in conventional PCI, I think you should include a
> > reference to the relevant section of the spec and check directly for a
> > Root Port.  I would prefer that over trying to exclude Root Ports by
> > looking for two upstream bridges.
> 
> Well we've established that we don't want to allow root ports.

I assume you want to exclude Root Ports because of multi-function
devices and the "route to self" error.  I was hoping for a reference
to that so I could learn more about it.

While I was looking for it, I found sec 6.12.1.2 (PCIe r4.0), "ACS
Functions in SR-IOV Capable and Multi-Function Devices", which seems
relevant.  It talks about "peer-to-peer Requests (between Functions of
the device)".  Thay says to me that multi-function devices can DMA
between themselves.

> But we need to, at a minimum, do two pci_upstream_bridge() calls...
> 
> Remember, we need to check that a number of devices are behind the same
> switch. So we need to find a _common_ upstream port for several devices.

I agree that peers need to have a common upstream bridge.  I think
you're saying peers need to have *two* common upstream bridges.  If I
understand correctly, requiring two common bridges is a way to ensure
that peers directly below Root Ports don't try to DMA to each other.

So I guess the first order of business is to nail down whether peers
below a Root Port are prohibited from DMAing to each other.  My
assumption, based on 6.12.1.2 and the fact that I haven't yet found
a prohibition, is that they can.

> Excluding the multifunction device case (which I don't think is
> applicable for reasons we've discussed before), this will *never* be the
> first upstream port for a given device.

If you're looking for a common upstream bridge, you don't need to make
assumptions about whether the hierarchy is conventional PCI or PCIe or
how many levels are in the hierarchy.

You already have upstream_bridges_match(), which takes two pci_devs.
I think it should walk up the PCI hierarchy from the first device,
checking whether the bridge at each level is also a parent of the
second device.

Bjorn
David Laight March 14, 2018, 12:16 p.m. UTC | #21
From: Logan Gunthorpe
> Sent: 13 March 2018 23:46
...
> As Stephen pointed out, it's a requirement of the PCIe spec that a
> switch supports P2P. If you want to sell a switch that does P2P with bad
> performance then that's on you to deal with.

That surprises me (unless I missed something last time I read the spec).
While P2P writes are relatively easy to handle, reads and any other TLP that
require acks are a completely different proposition.
There are no additional fields that can be set in the read TLP and will be
reflected back in the ack(s) than can be used to route the acks back to the
correct initiator.

I'm pretty sure that to support P2P reads a switch would have to save
the received read TLP and (possibly later on) issue read TLP of its own
for the required data.
I'm not even sure it is easy to interleave the P2P reads with those
coming from the root.
That requires a potentially infinite queue of pending requests.

Some x86 root ports support P2P writes (maybe with a bios option).
It would be a shame not to be able to do P2P writes on such systems
even though P2P reads won't work.

(We looked at using P2P transfers for some data, but in the end used
a different scheme.
For our use case P2P writes were enough.
An alternative would be to access the same host memory buffer from
two different devices - but there isn't an API that lets you do that.)

	David
Stephen Bates March 14, 2018, 2:05 p.m. UTC | #22
>I assume you want to exclude Root Ports because of multi-function
>  devices and the "route to self" error.  I was hoping for a reference
>  to that so I could learn more about it.

Apologies Bjorn. This slipped through my net. I will try and get you a reference for RTS in the next couple of days.
    
> While I was looking for it, I found sec 6.12.1.2 (PCIe r4.0), "ACS
> Functions in SR-IOV Capable and Multi-Function Devices", which seems
> relevant.  It talks about "peer-to-peer Requests (between Functions of
> the device)".  Thay says to me that multi-function devices can DMA
> between themselves.
    
I will go take a look. Appreciate the link.

Stephen
Logan Gunthorpe March 14, 2018, 4:17 p.m. UTC | #23
On 13/03/18 08:56 PM, Bjorn Helgaas wrote:
> I assume you want to exclude Root Ports because of multi-function
> devices and the "route to self" error.  I was hoping for a reference
> to that so I could learn more about it.

I haven't been able to find where in the spec it forbids route to self.
But I was told this by developers who work know switches. Hopefully
Stephen can find the reference.

But it's a bit of a moot point. Devices can DMA to themselves if they
are designed to do so. For example, some NVMe cards can read and write
their own CMB for certain types of DMA request. There is a register in
the spec (CMBSZ) which specifies which types of requests are supported.
(See 3.1.12 in NVMe 1.3a).

> I agree that peers need to have a common upstream bridge.  I think
> you're saying peers need to have *two* common upstream bridges.  If I
> understand correctly, requiring two common bridges is a way to ensure
> that peers directly below Root Ports don't try to DMA to each other.

No, I don't get where you think we need to have two common upstream
bridges. I'm not sure when such a case would ever happen. But you seem
to understand based on what you wrote below.

> So I guess the first order of business is to nail down whether peers
> below a Root Port are prohibited from DMAing to each other.  My
> assumption, based on 6.12.1.2 and the fact that I haven't yet found
> a prohibition, is that they can.

If you have a multifunction device designed to DMA to itself below a
root port, it can. But determining this is on a device by device basis,
just as determining whether a root complex can do peer to peer is on a
per device basis. So I'd say we don't want to allow it by default and
let someone who has such a device figure out what's necessary if and
when one comes along.

> You already have upstream_bridges_match(), which takes two pci_devs.
> I think it should walk up the PCI hierarchy from the first device,
> checking whether the bridge at each level is also a parent of the
> second device.

Yes, this is what I meant when I said walking the entire tree. I've been
kicking the can down the road on implementing this as getting ref
counting right and testing it is going to be quite tricky. The single
switch approach we implemented now is just a simplification which works
for a single switch. But I guess we can look at implementing it this way
for v4.

Logan
Logan Gunthorpe March 14, 2018, 4:23 p.m. UTC | #24
On 14/03/18 06:16 AM, David Laight wrote:
> That surprises me (unless I missed something last time I read the spec).
> While P2P writes are relatively easy to handle, reads and any other TLP that
> require acks are a completely different proposition.
> There are no additional fields that can be set in the read TLP and will be
> reflected back in the ack(s) than can be used to route the acks back to the
> correct initiator.
>
> I'm pretty sure that to support P2P reads a switch would have to save
> the received read TLP and (possibly later on) issue read TLP of its own
> for the required data.
> I'm not even sure it is easy to interleave the P2P reads with those
> coming from the root.
> That requires a potentially infinite queue of pending requests.

This is wrong. A completion is a TLP just like any other and makes use
of the Destination ID field in the header to route it back to the
original requester.

> Some x86 root ports support P2P writes (maybe with a bios option).
> It would be a shame not to be able to do P2P writes on such systems
> even though P2P reads won't work.

Yes, and this has been discussed many times. It won't be changing in the
near term.

Logan
Bjorn Helgaas March 14, 2018, 6:51 p.m. UTC | #25
On Wed, Mar 14, 2018 at 10:17:34AM -0600, Logan Gunthorpe wrote:
> On 13/03/18 08:56 PM, Bjorn Helgaas wrote:
> > I agree that peers need to have a common upstream bridge.  I think
> > you're saying peers need to have *two* common upstream bridges.  If I
> > understand correctly, requiring two common bridges is a way to ensure
> > that peers directly below Root Ports don't try to DMA to each other.
> 
> No, I don't get where you think we need to have two common upstream
> bridges. I'm not sure when such a case would ever happen. But you seem
> to understand based on what you wrote below.

Sorry, I phrased that wrong.  You don't require two common upstream
bridges; you require two upstream bridges, with the upper one being
common, i.e.,

  static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
  {
    struct pci_dev *up1, *up2;

    up1 = pci_dev_get(pci_upstream_bridge(pdev));
    up2 = pci_dev_get(pci_upstream_bridge(up1));
    return up2;
  }

So if you're starting with pdev, up1 is the immediately upstream
bridge and up2 is the second upstream bridge.  If this is PCIe, up1
may be a Root Port and there is no up2, or up1 and up2 are in a
switch.

This is more restrictive than the spec requires.  As long as there is
a single common upstream bridge, peer-to-peer DMA should work.  In
fact, in conventional PCI, I think the upstream bridge could even be
the host bridge (not a PCI-to-PCI bridge).

You are focused on PCIe systems, and in those systems, most topologies
do have an upstream switch, which means two upstream bridges.  I'm
trying to remove that assumption because I don't think there's a
requirement for it in the spec.  Enforcing this assumption complicates
the code and makes it harder to understand because the reader says
"huh, I know peer-to-peer DMA should work inside any PCI hierarchy*,
so why do we need these two bridges?"

[*] For conventional PCI, this means anything below the same host
bridge.  Two devices on a conventional PCI root bus should be able to
DMA to each other, even though there's no PCI-to-PCI bridge above
them.  For PCIe, it means a "hierarchy domain" as used in PCIe r4.0,
sec 1.3.1, i.e., anything below the same Root Port.

> > So I guess the first order of business is to nail down whether peers
> > below a Root Port are prohibited from DMAing to each other.  My
> > assumption, based on 6.12.1.2 and the fact that I haven't yet found
> > a prohibition, is that they can.
> 
> If you have a multifunction device designed to DMA to itself below a
> root port, it can. But determining this is on a device by device basis,
> just as determining whether a root complex can do peer to peer is on a
> per device basis. So I'd say we don't want to allow it by default and
> let someone who has such a device figure out what's necessary if and
> when one comes along.

It's not the job of this infrastructure to answer the device-dependent
question of whether DMA initiators or targets support peer-to-peer
DMA.

All we want to do here is figure out whether the PCI topology supports
it, using the mechanisms guaranteed by the spec.  We can derive that
from the basic rules about how PCI bridges work, i.e., from the
PCI-to-PCI Bridge spec r1.2, sec 4.3:

  A bridge forwards PCI memory transactions from its primary interface
  to its secondary interface (downstream) if a memory address is in
  the range defined by the Memory Base and Memory Limit registers
  (when the base is less than or equal to the limit) as illustrated in
  Figure 4-3. Conversely, a memory transaction on the secondary
  interface that is within this address range will not be forwarded
  upstream to the primary interface. Any memory transactions on the
  secondary interface that are outside this address range will be
  forwarded upstream to the primary interface (provided they are not
  in the address range defined by the prefetchable memory address
  range registers).

This works for either PCI or PCIe.  The only wrinkle PCIe adds is that
the very top of the hierarchy is a Root Port, and we can't rely on it
to route traffic to other Root Ports.  I also doubt Root Complex
Integrated Endpoints can participate in peer-to-peer DMA.

Thanks for your patience in working through all this.  I know it
sometimes feels like being bounced around in all directions.  It's
just a normal consequence of trying to add complex functionality to an
already complex system, with interest and expertise spread unevenly
across a crowd of people.

Bjorn
Logan Gunthorpe March 14, 2018, 7:03 p.m. UTC | #26
On 14/03/18 12:51 PM, Bjorn Helgaas wrote:
> You are focused on PCIe systems, and in those systems, most topologies
> do have an upstream switch, which means two upstream bridges.  I'm
> trying to remove that assumption because I don't think there's a
> requirement for it in the spec.  Enforcing this assumption complicates
> the code and makes it harder to understand because the reader says
> "huh, I know peer-to-peer DMA should work inside any PCI hierarchy*,
> so why do we need these two bridges?"

Yes, as I've said, we focused on being behind a single PCIe Switch
because it's easier and vaguely safer (we *know* switches will work but
other types of topology we have to assume will work based on the spec).
Also, I have my doubts that anyone will ever have a use for this with
non-PCIe devices.

A switch shows up as two or more virtual bridges (per the PCIe v4 Spec
1.3.3) which explains the existing get_upstream_bridge_port() function.

In any case, we'll look at generalizing this by looking for a common
upstream port in the next revision of the patch set.

Logan
Dan Williams March 14, 2018, 7:28 p.m. UTC | #27
On Wed, Mar 14, 2018 at 12:03 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 14/03/18 12:51 PM, Bjorn Helgaas wrote:
>> You are focused on PCIe systems, and in those systems, most topologies
>> do have an upstream switch, which means two upstream bridges.  I'm
>> trying to remove that assumption because I don't think there's a
>> requirement for it in the spec.  Enforcing this assumption complicates
>> the code and makes it harder to understand because the reader says
>> "huh, I know peer-to-peer DMA should work inside any PCI hierarchy*,
>> so why do we need these two bridges?"
>
> Yes, as I've said, we focused on being behind a single PCIe Switch
> because it's easier and vaguely safer (we *know* switches will work but
> other types of topology we have to assume will work based on the spec).
> Also, I have my doubts that anyone will ever have a use for this with
> non-PCIe devices.

P2P over PCI/PCI-X is quite common in devices like raid controllers.
It would be useful if those configurations were not left behind so
that Linux could feasibly deploy offload code to a controller in the
PCI domain.
Logan Gunthorpe March 14, 2018, 7:30 p.m. UTC | #28
On 14/03/18 01:28 PM, Dan Williams wrote:
> P2P over PCI/PCI-X is quite common in devices like raid controllers.
> It would be useful if those configurations were not left behind so
> that Linux could feasibly deploy offload code to a controller in the
> PCI domain.

Thanks for the note. Neat. In the end nothing is getting left behind
it's just work for someone to add support. Even if I wasn't already
going to make the change I mentioned it all fits in the architecture and
APIs quite easily.

Logan
Stephen Bates March 14, 2018, 7:34 p.m. UTC | #29
> P2P over PCI/PCI-X is quite common in devices like raid controllers.

Hi Dan 

Do you mean between PCIe devices below the RAID controller? Isn't it pretty novel to be able to support PCIe EPs below a RAID controller (as opposed to SCSI based devices)?

> It would be useful if those configurations were not left behind so
> that Linux could feasibly deploy offload code to a controller in the
> PCI domain.
   
Agreed. I think this would be great. Kind of like the XCOPY framework that was proposed a while back for SCSI devices [1] but updated to also include NVMe devices. That is definitely a use case we would like this framework to support.

Stephen
 
[1] https://lwn.net/Articles/592094/
Martin K. Petersen March 15, 2018, 4 a.m. UTC | #30
Stephen,

>> It would be useful if those configurations were not left behind so
>> that Linux could feasibly deploy offload code to a controller in the
>> PCI domain.
>    
> Agreed. I think this would be great. Kind of like the XCOPY framework
> that was proposed a while back for SCSI devices [1] but updated to also
> include NVMe devices. That is definitely a use case we would like this
> framework to support.

I'm on my umpteenth rewrite of the block/SCSI offload code. It is not as
protocol-agnostic as I would like in the block layer facing downwards.
It has proven quite hard to reconcile token-based and EXTENDED COPY
semantics along with the desire to support stacking. But from an
application/filesystem perspective everything looks the same regardless
of the intricacies of the device. Nothing is preventing us from
supporting other protocols...
Dan Williams March 15, 2018, 4:30 a.m. UTC | #31
On Wed, Mar 14, 2018 at 12:34 PM, Stephen  Bates <sbates@raithlin.com> wrote:
>> P2P over PCI/PCI-X is quite common in devices like raid controllers.
>
> Hi Dan
>
> Do you mean between PCIe devices below the RAID controller? Isn't it pretty novel to be able to support PCIe EPs below a RAID controller (as opposed to SCSI based devices)?

I'm thinking of the classic I/O offload card where there's an NTB to
an internal PCI bus that has a storage controller and raid offload
engines.
Stephen Bates March 22, 2018, 10:57 p.m. UTC | #32
>  I've seen the response that peers directly below a Root Port could not
> DMA to each other through the Root Port because of the "route to self"
> issue, and I'm not disputing that.  

Bjorn 

You asked me for a reference to RTS in the PCIe specification. As luck would have it I ended up in an Irish bar with Peter Onufryk this week at OCP Summit. We discussed the topic. It is not explicitly referred to as "Route to Self" and it's certainly not explicit (or obvious) but r6.2.8.1 of the PCIe 4.0 specification discusses error conditions for virtual PCI bridges. One of these conditions (given in the very first bullet in that section) applies to a request that is destined for the same port it came in on. When this occurs the request must be terminated as a UR.

Stephen
Bjorn Helgaas March 23, 2018, 9:50 p.m. UTC | #33
On Thu, Mar 22, 2018 at 10:57:32PM +0000, Stephen  Bates wrote:
> >  I've seen the response that peers directly below a Root Port could not
> > DMA to each other through the Root Port because of the "route to self"
> > issue, and I'm not disputing that.  
> 
> Bjorn 
> 
> You asked me for a reference to RTS in the PCIe specification. As
> luck would have it I ended up in an Irish bar with Peter Onufryk
> this week at OCP Summit. We discussed the topic. It is not
> explicitly referred to as "Route to Self" and it's certainly not
> explicit (or obvious) but r6.2.8.1 of the PCIe 4.0 specification
> discusses error conditions for virtual PCI bridges. One of these
> conditions (given in the very first bullet in that section) applies
> to a request that is destined for the same port it came in on. When
> this occurs the request must be terminated as a UR.

Thanks for that reference!

I suspect figure 10-3 in sec 10.1.1 might also be relevant, although
it's buried in the ATS section.  It shows internal routing between
functions of a multifunction device.  That suggests that the functions
should be able to DMA to each other without those transactions ever
appearing on the link.

Popping way up the stack, my original point was that I'm trying to
remove restrictions on what devices can participate in peer-to-peer
DMA.  I think it's fairly clear that in conventional PCI, any devices
in the same PCI hierarchy, i.e., below the same host-to-PCI bridge,
should be able to DMA to each other.

The routing behavior of PCIe is supposed to be compatible with
conventional PCI, and I would argue that this effectively requires
multi-function PCIe devices to have the internal routing required to
avoid the route-to-self issue.

Bjorn
Logan Gunthorpe March 23, 2018, 9:59 p.m. UTC | #34
On 23/03/18 03:50 PM, Bjorn Helgaas wrote:
> Popping way up the stack, my original point was that I'm trying to
> remove restrictions on what devices can participate in peer-to-peer
> DMA.  I think it's fairly clear that in conventional PCI, any devices
> in the same PCI hierarchy, i.e., below the same host-to-PCI bridge,
> should be able to DMA to each other.

Yup, we are working on this.

> The routing behavior of PCIe is supposed to be compatible with
> conventional PCI, and I would argue that this effectively requires
> multi-function PCIe devices to have the internal routing required to
> avoid the route-to-self issue.

That would be very nice but many devices do not support the internal
route. We've had to work around this in the past and as I mentioned
earlier that NVMe devices have a flag indicating support. However, if a
device wants to be involved in P2P it must support it and we can exclude
devices that don't support it by simply not enabling their drivers.

Logan
Bjorn Helgaas March 24, 2018, 3:49 a.m. UTC | #35
On Fri, Mar 23, 2018 at 03:59:14PM -0600, Logan Gunthorpe wrote:
> On 23/03/18 03:50 PM, Bjorn Helgaas wrote:
> > Popping way up the stack, my original point was that I'm trying to
> > remove restrictions on what devices can participate in
> > peer-to-peer DMA.  I think it's fairly clear that in conventional
> > PCI, any devices in the same PCI hierarchy, i.e., below the same
> > host-to-PCI bridge, should be able to DMA to each other.
> 
> Yup, we are working on this.
> 
> > The routing behavior of PCIe is supposed to be compatible with
> > conventional PCI, and I would argue that this effectively requires
> > multi-function PCIe devices to have the internal routing required
> > to avoid the route-to-self issue.
> 
> That would be very nice but many devices do not support the internal
> route. We've had to work around this in the past and as I mentioned
> earlier that NVMe devices have a flag indicating support. However,
> if a device wants to be involved in P2P it must support it and we
> can exclude devices that don't support it by simply not enabling
> their drivers.

Do you think these devices that don't support internal DMA between
functions are within spec, or should we handle them as exceptions,
e.g., via quirks?

If NVMe defines a flag indicating peer-to-peer support, that would
suggest to me that these devices are within spec.

I looked up the CMBSZ register you mentioned (NVMe 1.3a, sec 3.1.12).
You must be referring to the WDS, RDS, LISTS, CQS, and SQS bits.  If
WDS is set, the controller supports having Write-related data and
metadata in the Controller Memory Buffer.  That would mean the driver
could put certain queues in controller memory instead of in host
memory.  The controller could then read the queue from its own
internal memory rather than issuing a PCIe transaction to read it from
host memory.

That makes sense to me, but I don't see the connection to
peer-to-peer.  There's no multi-function device in this picture, so
it's not about internal DMA between functions.

WDS, etc., tell us about capabilities of the controller.  If WDS is
set, the CPU (or a peer PCIe device) can write things to controller
memory.  If it is clear, neither the CPU nor a peer device can put
things there.  So it doesn't seem to tell us anything about
peer-to-peer specifically.  It looks like information needed by the
NVMe driver, but not by the PCI core.

Bjorn
Stephen Bates March 24, 2018, 3:28 p.m. UTC | #36
> That would be very nice but many devices do not support the internal
> route. 

But Logan in the NVMe case we are discussing movement within a single function (i.e. from a NVMe namespace to a NVMe CMB on the same function). Bjorn is discussing movement between two functions (PFs or VFs) in the same PCIe EP. In the case of multi-function endpoints I think the standard requires those devices to support internal DMAs for transfers between those functions (but does not require it within a function).

So I think the summary is:

1. There is no requirement for a single function to support internal DMAs but in the case of NVMe we do have a protocol specific way for a NVMe function to indicate it supports via the CMB BAR. Other protocols may also have such methods but I am not aware of them at this time.

2. For multi-function end-points I think it is a requirement that DMAs *between* functions are supported via an internal path but this can be over-ridden by ACS when supported in the EP.

3. For multi-function end-points there is no requirement to support internal DMA within each individual function (i.e. a la point 1 but extended to each function in a MF device). 

Based on my review of the specification I concur with Bjorn that p2pdma between functions in a MF end-point should be assured to be supported via the standard. However if the p2pdma involves only a single function in a MF device then we can only support NVMe CMBs for now. Let's review and see what the options are for supporting this in the next respin.

Stephen
Jonathan Cameron March 26, 2018, 11:11 a.m. UTC | #37
On Tue, 13 Mar 2018 10:43:55 -0600
Logan Gunthorpe <logang@deltatee.com> wrote:

> On 12/03/18 09:28 PM, Sinan Kaya wrote:
> > On 3/12/2018 3:35 PM, Logan Gunthorpe wrote:
> > Regarding the switch business, It is amazing how much trouble you went into
> > limit this functionality into very specific hardware.
> > 
> > I thought that we reached to an agreement that code would not impose
> > any limits on what user wants.
> > 
> > What happened to all the emails we exchanged?  
> 
> It turns out that root ports that support P2P are far less common than 
> anyone thought. So it will likely have to be a white list.

This came as a bit of a surprise to our PCIe architect.

His follow up was whether it was worth raising an ECR for the PCIe spec
to add a capability bit to allow this to be discovered.  This might
long term avoid the need to maintain the white list for new devices.

So is it worth having a long term solution for making this discoverable?

Jonathan

> Nobody else 
> seems keen on allowing the user to enable this on hardware that doesn't 
> work. The easiest solution is still limiting it to using a switch. From 
> there, if someone wants to start creating a white-list then that's 
> probably the way forward to support root ports.
> 
> And there's also the ACS problem which means if you want to use P2P on 
> the root ports you'll have to disable ACS on the entire system. (Or 
> preferably, the IOMMU groups need to get more sophisticated to allow for 
> dynamic changes).
> 
> Additionally, once you allow for root ports you may find the IOMMU 
> getting in the way.
> 
> So there are great deal more issues to sort out if you don't restrict to 
> devices behind switches.
> 
> Logan
Bjorn Helgaas March 26, 2018, 2:01 p.m. UTC | #38
On Mon, Mar 26, 2018 at 12:11:38PM +0100, Jonathan Cameron wrote:
> On Tue, 13 Mar 2018 10:43:55 -0600
> Logan Gunthorpe <logang@deltatee.com> wrote:
> > It turns out that root ports that support P2P are far less common than 
> > anyone thought. So it will likely have to be a white list.
> 
> This came as a bit of a surprise to our PCIe architect.
> 
> His follow up was whether it was worth raising an ECR for the PCIe spec
> to add a capability bit to allow this to be discovered.  This might
> long term avoid the need to maintain the white list for new devices.
> 
> So is it worth having a long term solution for making this discoverable?

It was surprising to me that there's no architected way to discover
this.  It seems like such an obvious thing that I guess I assumed the
omission was intentional, i.e., maybe there's something that makes it
impractical, but it would be worth at least asking somebody in the
SIG.  It seems like for root ports in the same root complex, at least,
there could be a bit somewhere in the root port or the RCRB (which
Linux doesn't support yet).
Logan Gunthorpe March 26, 2018, 3:43 p.m. UTC | #39
On 24/03/18 09:28 AM, Stephen  Bates wrote:
> 1. There is no requirement for a single function to support internal DMAs but in the case of NVMe we do have a protocol specific way for a NVMe function to indicate it supports via the CMB BAR. Other protocols may also have such methods but I am not aware of them at this time.
> 
> 2. For multi-function end-points I think it is a requirement that DMAs *between* functions are supported via an internal path but this can be over-ridden by ACS when supported in the EP.
> 
> 3. For multi-function end-points there is no requirement to support internal DMA within each individual function (i.e. a la point 1 but extended to each function in a MF device). 

Yeah, that make sense. I wasn't making a huge distinction between single
and multi-function devices because from an implementation point of view
it's not that different.

Logan
Logan Gunthorpe March 26, 2018, 3:46 p.m. UTC | #40
On 26/03/18 08:01 AM, Bjorn Helgaas wrote:
> On Mon, Mar 26, 2018 at 12:11:38PM +0100, Jonathan Cameron wrote:
>> On Tue, 13 Mar 2018 10:43:55 -0600
>> Logan Gunthorpe <logang@deltatee.com> wrote:
>>> It turns out that root ports that support P2P are far less common than 
>>> anyone thought. So it will likely have to be a white list.
>>
>> This came as a bit of a surprise to our PCIe architect.
>>
>> His follow up was whether it was worth raising an ECR for the PCIe spec
>> to add a capability bit to allow this to be discovered.  This might
>> long term avoid the need to maintain the white list for new devices.
>>
>> So is it worth having a long term solution for making this discoverable?
> 
> It was surprising to me that there's no architected way to discover
> this.  It seems like such an obvious thing that I guess I assumed the
> omission was intentional, i.e., maybe there's something that makes it
> impractical, but it would be worth at least asking somebody in the
> SIG.  It seems like for root ports in the same root complex, at least,
> there could be a bit somewhere in the root port or the RCRB (which
> Linux doesn't support yet).

Yes, I agree. It would be a good long term solution to have this bit in
the spec. That would avoid us needing to create a white list for new
hardware. However, I expect it would be years before we can rely on it
so someone may yet implement that white list.

Logan
Jason Gunthorpe March 26, 2018, 4:41 p.m. UTC | #41
On Mon, Mar 26, 2018 at 12:11:38PM +0100, Jonathan Cameron wrote:
> On Tue, 13 Mar 2018 10:43:55 -0600
> Logan Gunthorpe <logang@deltatee.com> wrote:
> 
> > On 12/03/18 09:28 PM, Sinan Kaya wrote:
> > > On 3/12/2018 3:35 PM, Logan Gunthorpe wrote:
> > > Regarding the switch business, It is amazing how much trouble you went into
> > > limit this functionality into very specific hardware.
> > > 
> > > I thought that we reached to an agreement that code would not impose
> > > any limits on what user wants.
> > > 
> > > What happened to all the emails we exchanged?  
> > 
> > It turns out that root ports that support P2P are far less common than 
> > anyone thought. So it will likely have to be a white list.
> 
> This came as a bit of a surprise to our PCIe architect.

I don't think it is a hardware problem.

I know Mellanox and Nvidia have been doing p2p on Intel root complexes
for something like 5-6 years now.. I don't have the details, but it
does seem to work.

I have heard some chips give poor performance..

Also AMD GPU SLI uses P2P these days, so this isn't exactly a niche
feature in Intel/AMD land.

I think the main issue here is that there is some BIOS involvement to
set things up properly. Eg in GPU land motherboards certify for
'crossfire' support.

> His follow up was whether it was worth raising an ECR for the PCIe spec
> to add a capability bit to allow this to be discovered.  This might
> long term avoid the need to maintain the white list for new devices.

If it is primarily a BIOS issue then it should be an ACPI thing, right?

Jason
Logan Gunthorpe March 26, 2018, 5:30 p.m. UTC | #42
On 26/03/18 10:41 AM, Jason Gunthorpe wrote:
> On Mon, Mar 26, 2018 at 12:11:38PM +0100, Jonathan Cameron wrote:
>> On Tue, 13 Mar 2018 10:43:55 -0600
>> Logan Gunthorpe <logang@deltatee.com> wrote:
>>
>>> On 12/03/18 09:28 PM, Sinan Kaya wrote:
>>>> On 3/12/2018 3:35 PM, Logan Gunthorpe wrote:
>>>> Regarding the switch business, It is amazing how much trouble you went into
>>>> limit this functionality into very specific hardware.
>>>>
>>>> I thought that we reached to an agreement that code would not impose
>>>> any limits on what user wants.
>>>>
>>>> What happened to all the emails we exchanged?  
>>>
>>> It turns out that root ports that support P2P are far less common than 
>>> anyone thought. So it will likely have to be a white list.
>>
>> This came as a bit of a surprise to our PCIe architect.
> 
> I don't think it is a hardware problem.

The latest and greatest Power9 CPUs still explicitly do not support
this. And, if I recall correctly, the ARM64 device we played with did
not either -- but I suspect that will differ depending on vendor.

The latest Intel devices of course do support it, but go back 3-4 years
or so and the performance is pretty much unusable (at least for the
purposes of what this patchset implements). Even older CPUs did not
support it.

We haven't done any testing on AMD devices but I assume they are similar
to Intel.

In any case, even if every CPU manufactured today supported it well,
there are still older devices out there without support that we need to
ensure are handled properly.

Logan
Jason Gunthorpe March 26, 2018, 7:35 p.m. UTC | #43
On Mon, Mar 26, 2018 at 11:30:38AM -0600, Logan Gunthorpe wrote:
> 
> 
> On 26/03/18 10:41 AM, Jason Gunthorpe wrote:
> > On Mon, Mar 26, 2018 at 12:11:38PM +0100, Jonathan Cameron wrote:
> >> On Tue, 13 Mar 2018 10:43:55 -0600
> >> Logan Gunthorpe <logang@deltatee.com> wrote:
> >>
> >>> On 12/03/18 09:28 PM, Sinan Kaya wrote:
> >>>> On 3/12/2018 3:35 PM, Logan Gunthorpe wrote:
> >>>> Regarding the switch business, It is amazing how much trouble you went into
> >>>> limit this functionality into very specific hardware.
> >>>>
> >>>> I thought that we reached to an agreement that code would not impose
> >>>> any limits on what user wants.
> >>>>
> >>>> What happened to all the emails we exchanged?  
> >>>
> >>> It turns out that root ports that support P2P are far less common than 
> >>> anyone thought. So it will likely have to be a white list.
> >>
> >> This came as a bit of a surprise to our PCIe architect.
> > 
> > I don't think it is a hardware problem.
> 
> The latest and greatest Power9 CPUs still explicitly do not support
> this.

I think this is another case of the HW can do it but the SW support is
missing. IOMMU configuration and maybe firmware too, for instance.

If I recall I saw a presentation that Coral was expected to use P2P
between the network and GPU.

> And, if I recall correctly, the ARM64 device we played with did
> not either -- but I suspect that will differ depending on vendor.

Wouldn't surprise me at all to see broken implementations in
ARM64.. But even there it needs IOMMU enablement to work at all if I
recall.

Bascially, this is probably not a HW problem that needs a HW bit, but
a OS/firmware problem to do all the enablement..

Jason
Logan Gunthorpe March 26, 2018, 8:42 p.m. UTC | #44
On 26/03/18 01:35 PM, Jason Gunthorpe wrote:
> I think this is another case of the HW can do it but the SW support is
> missing. IOMMU configuration and maybe firmware too, for instance.

Nope, not sure how you can make this leap. We've been specifically told
that peer-to-peer PCIe DMA is not supported on P9. We've also been told
there is a very limited (ie. store only), low performance mode designed
for poking mailboxes but it is disabled by default and must be enable
with some kind of special firmware calls.

Logan
Jonathan Cameron March 27, 2018, 8:47 a.m. UTC | #45
On Mon, 26 Mar 2018 09:46:24 -0600
Logan Gunthorpe <logang@deltatee.com> wrote:

> On 26/03/18 08:01 AM, Bjorn Helgaas wrote:
> > On Mon, Mar 26, 2018 at 12:11:38PM +0100, Jonathan Cameron wrote:  
> >> On Tue, 13 Mar 2018 10:43:55 -0600
> >> Logan Gunthorpe <logang@deltatee.com> wrote:  
> >>> It turns out that root ports that support P2P are far less common than 
> >>> anyone thought. So it will likely have to be a white list.  
> >>
> >> This came as a bit of a surprise to our PCIe architect.
> >>
> >> His follow up was whether it was worth raising an ECR for the PCIe spec
> >> to add a capability bit to allow this to be discovered.  This might
> >> long term avoid the need to maintain the white list for new devices.
> >>
> >> So is it worth having a long term solution for making this discoverable?  
> > 
> > It was surprising to me that there's no architected way to discover
> > this.  It seems like such an obvious thing that I guess I assumed the
> > omission was intentional, i.e., maybe there's something that makes it
> > impractical, but it would be worth at least asking somebody in the
> > SIG.  It seems like for root ports in the same root complex, at least,
> > there could be a bit somewhere in the root port or the RCRB (which
> > Linux doesn't support yet).  
> 
> Yes, I agree. It would be a good long term solution to have this bit in
> the spec. That would avoid us needing to create a white list for new
> hardware. However, I expect it would be years before we can rely on it
> so someone may yet implement that white list.
> 
> Logan

I'll see if I can get our PCI SIG people to follow this through and see if
it is just an omission or as Bjorn suggested, there is some reason we
aren't thinking of that makes it hard.

Agreed that it is a somewhat tangential question to what to do with current
hardware, but let's get the ball rolling if possible.

Jonathan
Logan Gunthorpe March 27, 2018, 3:37 p.m. UTC | #46
On 27/03/18 02:47 AM, Jonathan Cameron wrote:
> I'll see if I can get our PCI SIG people to follow this through and see if
> it is just an omission or as Bjorn suggested, there is some reason we
> aren't thinking of that makes it hard.

That would be great! Thanks!

Logan
Stephen Bates April 13, 2018, 9:56 p.m. UTC | #47
>  I'll see if I can get our PCI SIG people to follow this through 

Hi Jonathan

Can you let me know if this moves forward within PCI-SIG? I would like to track it. I can see this being doable between Root Ports that reside in the same Root Complex but might become more challenging to standardize for RPs that reside in different RCs in the same (potentially multi-socket) system. I know in the past we have seem MemWr TLPS cross the QPI bus in Intel systems but I am sure that is not something that would work in all systems and must fall outside the remit of PCI-SIG ;-).

I agree such a capability bit would be very useful but it's going to be quite some time before we can rely on hardware being available that supports it.

Stephen
diff mbox

Patch

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 34b56a8f8480..d59f6f5ddfcd 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -124,6 +124,22 @@  config PCI_PASID
 
 	  If unsure, say N.
 
+config PCI_P2PDMA
+	bool "PCI peer-to-peer transfer support"
+	depends on ZONE_DEVICE
+	select GENERIC_ALLOCATOR
+	help
+	  Enableѕ drivers to do PCI peer-to-peer transactions to and from
+	  BARs that are exposed in other devices that are the part of
+	  the hierarchy where peer-to-peer DMA is guaranteed by the PCI
+	  specification to work (ie. anything below a single PCI bridge).
+
+	  Many PCIe root complexes do not support P2P transactions and
+	  it's hard to tell which support it at all, so at this time you
+	  will need a PCIe switch.
+
+	  If unsure, say N.
+
 config PCI_LABEL
 	def_bool y if (DMI || ACPI)
 	depends on PCI
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 941970936840..45e0ff6f3213 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -26,6 +26,7 @@  obj-$(CONFIG_PCI_MSI) += msi.o
 
 obj-$(CONFIG_PCI_ATS) += ats.o
 obj-$(CONFIG_PCI_IOV) += iov.o
+obj-$(CONFIG_PCI_P2PDMA) += p2pdma.o
 
 #
 # ACPI Related PCI FW Functions
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
new file mode 100644
index 000000000000..0ee917381dce
--- /dev/null
+++ b/drivers/pci/p2pdma.c
@@ -0,0 +1,679 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * PCI Peer 2 Peer DMA support.
+ *
+ * Copyright (c) 2016-2018, Logan Gunthorpe
+ * Copyright (c) 2016-2017, Microsemi Corporation
+ * Copyright (c) 2017, Christoph Hellwig
+ * Copyright (c) 2018, Eideticom Inc.
+ *
+ */
+
+#include <linux/pci-p2pdma.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/genalloc.h>
+#include <linux/memremap.h>
+#include <linux/percpu-refcount.h>
+#include <linux/random.h>
+
+struct pci_p2pdma {
+	struct percpu_ref devmap_ref;
+	struct completion devmap_ref_done;
+	struct gen_pool *pool;
+	bool p2pmem_published;
+};
+
+static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
+{
+	struct pci_p2pdma *p2p =
+		container_of(ref, struct pci_p2pdma, devmap_ref);
+
+	complete_all(&p2p->devmap_ref_done);
+}
+
+static void pci_p2pdma_percpu_kill(void *data)
+{
+	struct percpu_ref *ref = data;
+
+	if (percpu_ref_is_dying(ref))
+		return;
+
+	percpu_ref_kill(ref);
+}
+
+static void pci_p2pdma_release(void *data)
+{
+	struct pci_dev *pdev = data;
+
+	if (!pdev->p2pdma)
+		return;
+
+	wait_for_completion(&pdev->p2pdma->devmap_ref_done);
+	percpu_ref_exit(&pdev->p2pdma->devmap_ref);
+
+	gen_pool_destroy(pdev->p2pdma->pool);
+	pdev->p2pdma = NULL;
+}
+
+static int pci_p2pdma_setup(struct pci_dev *pdev)
+{
+	int error = -ENOMEM;
+	struct pci_p2pdma *p2p;
+
+	p2p = devm_kzalloc(&pdev->dev, sizeof(*p2p), GFP_KERNEL);
+	if (!p2p)
+		return -ENOMEM;
+
+	p2p->pool = gen_pool_create(PAGE_SHIFT, dev_to_node(&pdev->dev));
+	if (!p2p->pool)
+		goto out;
+
+	init_completion(&p2p->devmap_ref_done);
+	error = percpu_ref_init(&p2p->devmap_ref,
+			pci_p2pdma_percpu_release, 0, GFP_KERNEL);
+	if (error)
+		goto out_pool_destroy;
+
+	percpu_ref_switch_to_atomic_sync(&p2p->devmap_ref);
+
+	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
+	if (error)
+		goto out_pool_destroy;
+
+	pdev->p2pdma = p2p;
+
+	return 0;
+
+out_pool_destroy:
+	gen_pool_destroy(p2p->pool);
+out:
+	devm_kfree(&pdev->dev, p2p);
+	return error;
+}
+
+/**
+ * pci_p2pdma_add_resource - add memory for use as p2p memory
+ * @pdev: the device to add the memory to
+ * @bar: PCI BAR to add
+ * @size: size of the memory to add, may be zero to use the whole BAR
+ * @offset: offset into the PCI BAR
+ *
+ * The memory will be given ZONE_DEVICE struct pages so that it may
+ * be used with any DMA request.
+ */
+int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
+			    u64 offset)
+{
+	struct dev_pagemap *pgmap;
+	void *addr;
+	int error;
+
+	if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM))
+		return -EINVAL;
+
+	if (offset >= pci_resource_len(pdev, bar))
+		return -EINVAL;
+
+	if (!size)
+		size = pci_resource_len(pdev, bar) - offset;
+
+	if (size + offset > pci_resource_len(pdev, bar))
+		return -EINVAL;
+
+	if (!pdev->p2pdma) {
+		error = pci_p2pdma_setup(pdev);
+		if (error)
+			return error;
+	}
+
+	pgmap = devm_kzalloc(&pdev->dev, sizeof(*pgmap), GFP_KERNEL);
+	if (!pgmap)
+		return -ENOMEM;
+
+	pgmap->res.start = pci_resource_start(pdev, bar) + offset;
+	pgmap->res.end = pgmap->res.start + size - 1;
+	pgmap->res.flags = pci_resource_flags(pdev, bar);
+	pgmap->ref = &pdev->p2pdma->devmap_ref;
+	pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
+
+	addr = devm_memremap_pages(&pdev->dev, pgmap);
+	if (IS_ERR(addr)) {
+		error = PTR_ERR(addr);
+		goto pgmap_free;
+	}
+
+	error = gen_pool_add_virt(pdev->p2pdma->pool, (unsigned long)addr,
+			pci_bus_address(pdev, bar) + offset,
+			resource_size(&pgmap->res), dev_to_node(&pdev->dev));
+	if (error)
+		goto pgmap_free;
+
+	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_percpu_kill,
+					  &pdev->p2pdma->devmap_ref);
+	if (error)
+		goto pgmap_free;
+
+	pci_info(pdev, "added peer-to-peer DMA memory %pR\n",
+		 &pgmap->res);
+
+	return 0;
+
+pgmap_free:
+	devres_free(pgmap);
+	return error;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_add_resource);
+
+static struct pci_dev *find_parent_pci_dev(struct device *dev)
+{
+	struct device *parent;
+
+	dev = get_device(dev);
+
+	while (dev) {
+		if (dev_is_pci(dev))
+			return to_pci_dev(dev);
+
+		parent = get_device(dev->parent);
+		put_device(dev);
+		dev = parent;
+	}
+
+	return NULL;
+}
+
+/*
+ * If a device is behind a switch, we try to find the upstream bridge
+ * port of the switch. This requires two calls to pci_upstream_bridge():
+ * one for the upstream port on the switch, one on the upstream port
+ * for the next level in the hierarchy. Because of this, devices connected
+ * to the root port will be rejected.
+ */
+static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
+{
+	struct pci_dev *up1, *up2;
+
+	if (!pdev)
+		return NULL;
+
+	up1 = pci_dev_get(pci_upstream_bridge(pdev));
+	if (!up1)
+		return NULL;
+
+	up2 = pci_dev_get(pci_upstream_bridge(up1));
+	pci_dev_put(up1);
+
+	return up2;
+}
+
+/*
+ * This function checks if two PCI devices are behind the same switch.
+ * (ie. they share the same second upstream port as returned by
+ *  get_upstream_bridge_port().)
+ *
+ * Future work could expand this to handle hierarchies of switches
+ * so any devices whose traffic can be routed without going through
+ * the root complex could be used. For now, we limit it to just one
+ * level of switch.
+ *
+ * This function returns the "distance" between the devices. 0 meaning
+ * they are the same device, 1 meaning they are behind the same switch.
+ * If they are not behind the same switch, -1 is returned.
+ */
+static int __upstream_bridges_match(struct pci_dev *upstream,
+				    struct pci_dev *client)
+{
+	struct pci_dev *dma_up;
+	int ret = 1;
+
+	dma_up = get_upstream_bridge_port(client);
+
+	if (!dma_up) {
+		dev_dbg(&client->dev, "not a PCI device behind a bridge\n");
+		ret = -1;
+		goto out;
+	}
+
+	if (upstream != dma_up) {
+		dev_dbg(&client->dev,
+			"does not reside on the same upstream bridge\n");
+		ret = -1;
+		goto out;
+	}
+
+out:
+	pci_dev_put(dma_up);
+	return ret;
+}
+
+static int upstream_bridges_match(struct pci_dev *provider,
+				  struct pci_dev *client)
+{
+	struct pci_dev *upstream;
+	int ret;
+
+	if (provider == client)
+		return 0;
+
+	upstream = get_upstream_bridge_port(provider);
+	if (!upstream) {
+		pci_warn(provider, "not behind a PCI bridge\n");
+		return -1;
+	}
+
+	ret = __upstream_bridges_match(upstream, client);
+
+	pci_dev_put(upstream);
+
+	return ret;
+}
+
+struct pci_p2pdma_client {
+	struct list_head list;
+	struct pci_dev *client;
+	struct pci_dev *provider;
+};
+
+/**
+ * pci_p2pdma_add_client - allocate a new element in a client device list
+ * @head: list head of p2pdma clients
+ * @dev: device to add to the list
+ *
+ * This adds @dev to a list of clients used by a p2pdma device.
+ * This list should be passed to pci_p2pmem_find(). Once pci_p2pmem_find() has
+ * been called successfully, the list will be bound to a specific p2pdma
+ * device and new clients can only be added to the list if they are
+ * supported by that p2pdma device.
+ *
+ * The caller is expected to have a lock which protects @head as necessary
+ * so that none of the pci_p2p functions can be called concurrently
+ * on that list.
+ *
+ * Returns 0 if the client was successfully added.
+ */
+int pci_p2pdma_add_client(struct list_head *head, struct device *dev)
+{
+	struct pci_p2pdma_client *item, *new_item;
+	struct pci_dev *provider = NULL;
+	struct pci_dev *client;
+	int ret;
+
+	if (IS_ENABLED(CONFIG_DMA_VIRT_OPS) && dev->dma_ops == &dma_virt_ops) {
+		dev_warn(dev,
+			 "cannot be used for peer-to-peer DMA because the driver makes use of dma_virt_ops\n");
+		return -ENODEV;
+	}
+
+
+	client = find_parent_pci_dev(dev);
+	if (!client) {
+		dev_warn(dev,
+			 "cannot be used for peer-to-peer DMA as it is not a PCI device\n");
+		return -ENODEV;
+	}
+
+	item = list_first_entry_or_null(head, struct pci_p2pdma_client, list);
+	if (item && item->provider) {
+		provider = item->provider;
+
+		if (upstream_bridges_match(provider, client) < 0) {
+			ret = -EXDEV;
+			goto put_client;
+		}
+	}
+
+	new_item = kzalloc(sizeof(*new_item), GFP_KERNEL);
+	if (!new_item) {
+		ret = -ENOMEM;
+		goto put_client;
+	}
+
+	new_item->client = client;
+	new_item->provider = pci_dev_get(provider);
+
+	list_add_tail(&new_item->list, head);
+
+	return 0;
+
+put_client:
+	pci_dev_put(client);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_add_client);
+
+static void pci_p2pdma_client_free(struct pci_p2pdma_client *item)
+{
+	list_del(&item->list);
+	pci_dev_put(item->client);
+	pci_dev_put(item->provider);
+	kfree(item);
+}
+
+/**
+ * pci_p2pdma_remove_client - remove and free a new p2pdma client
+ * @head: list head of p2pdma clients
+ * @dev: device to remove from the list
+ *
+ * This removes @dev from a list of clients used by a p2pdma device.
+ * The caller is expected to have a lock which protects @head as necessary
+ * so that none of the pci_p2p functions can be called concurrently
+ * on that list.
+ */
+void pci_p2pdma_remove_client(struct list_head *head, struct device *dev)
+{
+	struct pci_p2pdma_client *pos, *tmp;
+	struct pci_dev *pdev;
+
+	pdev = find_parent_pci_dev(dev);
+	if (!pdev)
+		return;
+
+	list_for_each_entry_safe(pos, tmp, head, list) {
+		if (pos->client != pdev)
+			continue;
+
+		pci_p2pdma_client_free(pos);
+	}
+
+	pci_dev_put(pdev);
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_remove_client);
+
+/**
+ * pci_p2pdma_client_list_free - free an entire list of p2pdma clients
+ * @head: list head of p2pdma clients
+ *
+ * This removes all devices in a list of clients used by a p2pdma device.
+ * The caller is expected to have a lock which protects @head as necessary
+ * so that none of the pci_p2pdma functions can be called concurrently
+ * on that list.
+ */
+void pci_p2pdma_client_list_free(struct list_head *head)
+{
+	struct pci_p2pdma_client *pos, *tmp;
+
+	list_for_each_entry_safe(pos, tmp, head, list)
+		pci_p2pdma_client_free(pos);
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_client_list_free);
+
+/**
+ * pci_p2pdma_distance - Determive the cumulative distance between
+ *	a p2pdma provider and the clients in use.
+ * @provider: p2pdma provider to check against the client list
+ * @clients: list of devices to check (NULL-terminated)
+ *
+ * Returns -1 if any of the clients are not compatible (behind the same
+ * switch as the provider), otherwise returns a positive number where
+ * the lower number is the preferrable choice. (If there's one client
+ * that's the same as the provider it will return 0, which is best choice).
+ *
+ * For now, "compatible" means the provider and the clients are all behind
+ * the same switch. This cuts out cases that may work but is safest for the
+ * user. Future work can expand this to cases with nested switches.
+ */
+int pci_p2pdma_distance(struct pci_dev *provider, struct list_head *clients)
+{
+	struct pci_p2pdma_client *pos;
+	struct pci_dev *upstream;
+	int ret;
+	int distance = 0;
+
+	upstream = get_upstream_bridge_port(provider);
+	if (!upstream) {
+		pci_warn(provider, "not behind a PCI bridge\n");
+		return false;
+	}
+
+	list_for_each_entry(pos, clients, list) {
+		if (pos->client == provider)
+			continue;
+
+		ret = __upstream_bridges_match(upstream, pos->client);
+		if (ret < 0)
+			goto no_match;
+
+		distance += ret;
+	}
+
+	ret = distance;
+
+no_match:
+	pci_dev_put(upstream);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_distance);
+
+/**
+ * pci_p2pdma_assign_provider - Check compatibily (as per pci_p2pdma_distance)
+ *	and assign a provider to a list of clients
+ * @provider: p2pdma provider to assign to the client list
+ * @clients: list of devices to check (NULL-terminated)
+ *
+ * Returns false if any of the clients are not compatible, true if the
+ * provider was successfully assigned to the clients.
+ */
+bool pci_p2pdma_assign_provider(struct pci_dev *provider,
+				struct list_head *clients)
+{
+	struct pci_p2pdma_client *pos;
+
+	if (pci_p2pdma_distance(provider, clients) < 0)
+		return false;
+
+	list_for_each_entry(pos, clients, list)
+		pos->provider = provider;
+
+	return true;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_assign_provider);
+
+/**
+ * pci_has_p2pmem - check if a given PCI device has published any p2pmem
+ * @pdev: PCI device to check
+ */
+bool pci_has_p2pmem(struct pci_dev *pdev)
+{
+	return pdev->p2pdma && pdev->p2pdma->p2pmem_published;
+}
+EXPORT_SYMBOL_GPL(pci_has_p2pmem);
+
+/**
+ * pci_p2pmem_find - find a peer-to-peer DMA memory device compatible with
+ *	the specified list of clients and shortest distance (as determined
+ *	by pci_p2pmem_dma())
+ * @clients: list of devices to check (NULL-terminated)
+ *
+ * If multiple devices are behind the same switch, the one "closest" to the
+ * client devices in use will be chosen first. (So if one of the providers are
+ * the same as one of the clients, that provider will be used ahead of any
+ * other providers that are unrelated). If multiple providers are an equal
+ * distance away, one will be chosen at random.
+ *
+ * Returns a pointer to the PCI device with a reference taken (use pci_dev_put
+ * to return the reference) or NULL if no compatible device is found. The
+ * found provider will also be assigned to the client list.
+ */
+struct pci_dev *pci_p2pmem_find(struct list_head *clients)
+{
+	struct pci_dev *pdev = NULL;
+	struct pci_p2pdma_client *pos;
+	int distance;
+	int closest_distance = INT_MAX;
+	struct pci_dev **closest_pdevs;
+	int ties = 0;
+	const int max_ties = PAGE_SIZE / sizeof(*closest_pdevs);
+	int i;
+
+	closest_pdevs = kmalloc(PAGE_SIZE, GFP_KERNEL);
+
+	while ((pdev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, pdev))) {
+		if (!pci_has_p2pmem(pdev))
+			continue;
+
+		distance = pci_p2pdma_distance(pdev, clients);
+		if (distance < 0 || distance > closest_distance)
+			continue;
+
+		if (distance == closest_distance && ties >= max_ties)
+			continue;
+
+		if (distance < closest_distance) {
+			for (i = 0; i < ties; i++)
+				pci_dev_put(closest_pdevs[i]);
+
+			ties = 0;
+			closest_distance = distance;
+		}
+
+		closest_pdevs[ties++] = pci_dev_get(pdev);
+	}
+
+	if (ties)
+		pdev = pci_dev_get(closest_pdevs[prandom_u32_max(ties)]);
+
+	for (i = 0; i < ties; i++)
+		pci_dev_put(closest_pdevs[i]);
+
+	if (pdev)
+		list_for_each_entry(pos, clients, list)
+			pos->provider = pdev;
+
+	kfree(closest_pdevs);
+	return pdev;
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_find);
+
+/**
+ * pci_alloc_p2p_mem - allocate peer-to-peer DMA memory
+ * @pdev: the device to allocate memory from
+ * @size: number of bytes to allocate
+ *
+ * Returns the allocated memory or NULL on error.
+ */
+void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
+{
+	void *ret;
+
+	if (unlikely(!pdev->p2pdma))
+		return NULL;
+
+	if (unlikely(!percpu_ref_tryget_live(&pdev->p2pdma->devmap_ref)))
+		return NULL;
+
+	ret = (void *)gen_pool_alloc(pdev->p2pdma->pool, size);
+
+	if (unlikely(!ret))
+		percpu_ref_put(&pdev->p2pdma->devmap_ref);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pci_alloc_p2pmem);
+
+/**
+ * pci_free_p2pmem - allocate peer-to-peer DMA memory
+ * @pdev: the device the memory was allocated from
+ * @addr: address of the memory that was allocated
+ * @size: number of bytes that was allocated
+ */
+void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size)
+{
+	gen_pool_free(pdev->p2pdma->pool, (uintptr_t)addr, size);
+	percpu_ref_put(&pdev->p2pdma->devmap_ref);
+}
+EXPORT_SYMBOL_GPL(pci_free_p2pmem);
+
+/**
+ * pci_virt_to_bus - return the PCI bus address for a given virtual
+ *	address obtained with pci_alloc_p2pmem()
+ * @pdev: the device the memory was allocated from
+ * @addr: address of the memory that was allocated
+ */
+pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr)
+{
+	if (!addr)
+		return 0;
+	if (!pdev->p2pdma)
+		return 0;
+
+	/*
+	 * Note: when we added the memory to the pool we used the PCI
+	 * bus address as the physical address. So gen_pool_virt_to_phys()
+	 * actually returns the bus address despite the misleading name.
+	 */
+	return gen_pool_virt_to_phys(pdev->p2pdma->pool, (unsigned long)addr);
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_virt_to_bus);
+
+/**
+ * pci_p2pmem_alloc_sgl - allocate peer-to-peer DMA memory in a scatterlist
+ * @pdev: the device to allocate memory from
+ * @sgl: the allocated scatterlist
+ * @nents: the number of SG entries in the list
+ * @length: number of bytes to allocate
+ *
+ * Returns 0 on success
+ */
+int pci_p2pmem_alloc_sgl(struct pci_dev *pdev, struct scatterlist **sgl,
+			 unsigned int *nents, u32 length)
+{
+	struct scatterlist *sg;
+	void *addr;
+
+	sg = kzalloc(sizeof(*sg), GFP_KERNEL);
+	if (!sg)
+		return -ENOMEM;
+
+	sg_init_table(sg, 1);
+
+	addr = pci_alloc_p2pmem(pdev, length);
+	if (!addr)
+		goto out_free_sg;
+
+	sg_set_buf(sg, addr, length);
+	*sgl = sg;
+	*nents = 1;
+	return 0;
+
+out_free_sg:
+	kfree(sg);
+	return -ENOMEM;
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_alloc_sgl);
+
+/**
+ * pci_p2pmem_free_sgl - free a scatterlist allocated by pci_p2pmem_alloc_sgl()
+ * @pdev: the device to allocate memory from
+ * @sgl: the allocated scatterlist
+ * @nents: the number of SG entries in the list
+ */
+void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl,
+			 unsigned int nents)
+{
+	struct scatterlist *sg;
+	int count;
+
+	if (!sgl || !nents)
+		return;
+
+	for_each_sg(sgl, sg, nents, count)
+		pci_free_p2pmem(pdev, sg_virt(sg), sg->length);
+	kfree(sgl);
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_free_sgl);
+
+/**
+ * pci_p2pmem_publish - publish the peer-to-peer DMA memory for use by
+ *	other devices with pci_p2pmem_find()
+ * @pdev: the device with peer-to-peer DMA memory to publish
+ * @publish: set to true to publish the memory, false to unpublish it
+ */
+void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
+{
+	if (publish && !pdev->p2pdma)
+		return;
+
+	pdev->p2pdma->p2pmem_published = publish;
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 7b4899c06f49..9e907c338a44 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -53,11 +53,16 @@  struct vmem_altmap {
  * driver can hotplug the device memory using ZONE_DEVICE and with that memory
  * type. Any page of a process can be migrated to such memory. However no one
  * should be allow to pin such memory so that it can always be evicted.
+ *
+ * MEMORY_DEVICE_PCI_P2PDMA:
+ * Device memory residing in a PCI BAR intended for use with Peer-to-Peer
+ * transactions.
  */
 enum memory_type {
 	MEMORY_DEVICE_HOST = 0,
 	MEMORY_DEVICE_PRIVATE,
 	MEMORY_DEVICE_PUBLIC,
+	MEMORY_DEVICE_PCI_P2PDMA,
 };
 
 /*
@@ -161,6 +166,19 @@  static inline void vmem_altmap_free(struct vmem_altmap *altmap,
 }
 #endif /* CONFIG_ZONE_DEVICE */
 
+#ifdef CONFIG_PCI_P2PDMA
+static inline bool is_pci_p2pdma_page(const struct page *page)
+{
+	return is_zone_device_page(page) &&
+		page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
+}
+#else /* CONFIG_PCI_P2PDMA */
+static inline bool is_pci_p2pdma_page(const struct page *page)
+{
+	return false;
+}
+#endif /* CONFIG_PCI_P2PDMA */
+
 #if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
 static inline bool is_device_private_page(const struct page *page)
 {
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
new file mode 100644
index 000000000000..1f7856ff098b
--- /dev/null
+++ b/include/linux/pci-p2pdma.h
@@ -0,0 +1,101 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * PCI Peer 2 Peer DMA support.
+ *
+ * Copyright (c) 2016-2018, Logan Gunthorpe
+ * Copyright (c) 2016-2017, Microsemi Corporation
+ * Copyright (c) 2017, Christoph Hellwig
+ * Copyright (c) 2018, Eideticom Inc.
+ *
+ */
+
+#ifndef _LINUX_PCI_P2PDMA_H
+#define _LINUX_PCI_P2PDMA_H
+
+#include <linux/pci.h>
+
+struct block_device;
+struct scatterlist;
+
+#ifdef CONFIG_PCI_P2PDMA
+int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
+		u64 offset);
+int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
+void pci_p2pdma_remove_client(struct list_head *head, struct device *dev);
+void pci_p2pdma_client_list_free(struct list_head *head);
+int pci_p2pdma_distance(struct pci_dev *provider, struct list_head *clients);
+bool pci_p2pdma_assign_provider(struct pci_dev *provider,
+				struct list_head *clients);
+bool pci_has_p2pmem(struct pci_dev *pdev);
+struct pci_dev *pci_p2pmem_find(struct list_head *clients);
+void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size);
+void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size);
+pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr);
+int pci_p2pmem_alloc_sgl(struct pci_dev *pdev, struct scatterlist **sgl,
+		unsigned int *nents, u32 length);
+void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl,
+		unsigned int nents);
+void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
+#else /* CONFIG_PCI_P2PDMA */
+static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
+		size_t size, u64 offset)
+{
+	return 0;
+}
+static inline int pci_p2pdma_add_client(struct list_head *head,
+		struct device *dev)
+{
+	return 0;
+}
+static inline void pci_p2pdma_remove_client(struct list_head *head,
+		struct device *dev)
+{
+}
+static inline void pci_p2pdma_client_list_free(struct list_head *head)
+{
+}
+static inline int pci_p2pdma_distance(struct pci_dev *provider,
+				      struct list_head *clients)
+{
+	return -1;
+}
+static inline bool pci_p2pdma_assign_provider(struct pci_dev *provider,
+					      struct list_head *clients)
+{
+	return false;
+}
+static inline bool pci_has_p2pmem(struct pci_dev *pdev)
+{
+	return false;
+}
+static inline struct pci_dev *pci_p2pmem_find(struct list_head *clients)
+{
+	return NULL;
+}
+static inline void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
+{
+	return NULL;
+}
+static inline void pci_free_p2pmem(struct pci_dev *pdev, void *addr,
+		size_t size)
+{
+}
+static inline pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev,
+						    void *addr)
+{
+	return 0;
+}
+static inline int pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
+		struct scatterlist **sgl, unsigned int *nents, u32 length)
+{
+	return -ENODEV;
+}
+static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
+		struct scatterlist *sgl, unsigned int nents)
+{
+}
+static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
+{
+}
+#endif /* CONFIG_PCI_P2PDMA */
+#endif /* _LINUX_PCI_P2P_H */
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 024a1beda008..437e42615896 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -276,6 +276,7 @@  struct pcie_link_state;
 struct pci_vpd;
 struct pci_sriov;
 struct pci_ats;
+struct pci_p2pdma;
 
 /* The pci_dev structure describes PCI devices */
 struct pci_dev {
@@ -429,6 +430,9 @@  struct pci_dev {
 #ifdef CONFIG_PCI_PASID
 	u16		pasid_features;
 #endif
+#ifdef CONFIG_PCI_P2PDMA
+	struct pci_p2pdma *p2pdma;
+#endif
 	phys_addr_t	rom;		/* Physical address if not from BAR */
 	size_t		romlen;		/* Length if not from BAR */
 	char		*driver_override; /* Driver name to force a match */