diff mbox

[2/8] PCI: Add pci_find_common_upstream_dev()

Message ID 20180325110000.2238-2-christian.koenig@amd.com (mailing list archive)
State New, archived
Headers show

Commit Message

Christian König March 25, 2018, 10:59 a.m. UTC
From: "wdavis@nvidia.com" <wdavis@nvidia.com>

Add an interface to find the first device which is upstream of both
devices.

Signed-off-by: Will Davis <wdavis@nvidia.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/pci/search.c | 24 ++++++++++++++++++++++++
 include/linux/pci.h  |  2 ++
 2 files changed, 26 insertions(+)

Comments

Christoph Hellwig March 28, 2018, 12:38 p.m. UTC | #1
On Sun, Mar 25, 2018 at 12:59:54PM +0200, Christian König wrote:
> From: "wdavis@nvidia.com" <wdavis@nvidia.com>
> 
> Add an interface to find the first device which is upstream of both
> devices.

Please work with Logan and base this on top of the outstanding peer
to peer patchset.
Christian König March 28, 2018, 3:07 p.m. UTC | #2
Am 28.03.2018 um 14:38 schrieb Christoph Hellwig:
> On Sun, Mar 25, 2018 at 12:59:54PM +0200, Christian König wrote:
>> From: "wdavis@nvidia.com" <wdavis@nvidia.com>
>>
>> Add an interface to find the first device which is upstream of both
>> devices.
> Please work with Logan and base this on top of the outstanding peer
> to peer patchset.

Can you point me to that? The last code I could find about that was from 
2015.

Thanks,
Christian.
Logan Gunthorpe March 28, 2018, 3:47 p.m. UTC | #3
On 28/03/18 09:07 AM, Christian König wrote:
> Am 28.03.2018 um 14:38 schrieb Christoph Hellwig:
>> On Sun, Mar 25, 2018 at 12:59:54PM +0200, Christian König wrote:
>>> From: "wdavis@nvidia.com" <wdavis@nvidia.com>
>>>
>>> Add an interface to find the first device which is upstream of both
>>> devices.
>> Please work with Logan and base this on top of the outstanding peer
>> to peer patchset.
> 
> Can you point me to that? The last code I could find about that was from 
> 2015.

The latest posted series is here:

https://lkml.org/lkml/2018/3/12/830

However, we've made some significant changes to the area that's similar
to what you are doing. You can find lasted un-posted here:

https://github.com/sbates130272/linux-p2pmem/tree/pci-p2p-v4-pre2

Specifically this function would be of interest to you:

https://github.com/sbates130272/linux-p2pmem/blob/0e9468ae2a5a5198513dd12990151e09105f0351/drivers/pci/p2pdma.c#L239

However, the difference between what we are doing is that we are
interested in the distance through the common upstream device and you
appear to be finding the actual common device.

Thanks,

Logan
Christian König March 28, 2018, 4:02 p.m. UTC | #4
Am 28.03.2018 um 17:47 schrieb Logan Gunthorpe:
>
> On 28/03/18 09:07 AM, Christian König wrote:
>> Am 28.03.2018 um 14:38 schrieb Christoph Hellwig:
>>> On Sun, Mar 25, 2018 at 12:59:54PM +0200, Christian König wrote:
>>>> From: "wdavis@nvidia.com" <wdavis@nvidia.com>
>>>>
>>>> Add an interface to find the first device which is upstream of both
>>>> devices.
>>> Please work with Logan and base this on top of the outstanding peer
>>> to peer patchset.
>> Can you point me to that? The last code I could find about that was from
>> 2015.
> The latest posted series is here:
>
> https://lkml.org/lkml/2018/3/12/830
>
> However, we've made some significant changes to the area that's similar
> to what you are doing. You can find lasted un-posted here:
>
> https://github.com/sbates130272/linux-p2pmem/tree/pci-p2p-v4-pre2
>
> Specifically this function would be of interest to you:
>
> https://github.com/sbates130272/linux-p2pmem/blob/0e9468ae2a5a5198513dd12990151e09105f0351/drivers/pci/p2pdma.c#L239
>
> However, the difference between what we are doing is that we are
> interested in the distance through the common upstream device and you
> appear to be finding the actual common device.

Yeah, that looks very similar to what I picked up from the older 
patches, going to read up on that after my vacation.

Just in general why are you interested in the "distance" of the devices?

And BTW: At least for writes that Peer 2 Peer transactions between 
different root complexes work is actually more common than the other way 
around.

So I'm a bit torn between using a blacklist or a whitelist. A whitelist 
is certainly more conservative approach, but that could get a bit long.

Thanks,
Christian.

>
> Thanks,
>
> Logan
Logan Gunthorpe March 28, 2018, 4:25 p.m. UTC | #5
On 28/03/18 10:02 AM, Christian König wrote:
> Yeah, that looks very similar to what I picked up from the older 
> patches, going to read up on that after my vacation.

Yeah, I was just reading through your patchset and there are a lot of
similarities. Though, I'm not sure what you're trying to accomplish as I
could not find a cover letter and it seems to only enable one driver. Is
it meant to enable DMA transactions only between two AMD GPUs?

I also don't see where you've taken into account the PCI bus address. On
some architectures this is not the same as the CPU physical address.

> Just in general why are you interested in the "distance" of the devices?

We've taken a general approach where some drivers may provide p2p memory
(ie. an NVMe card or an RDMA NIC) and other drivers make use of it (ie.
the NVMe-of driver). The orchestrator driver needs to find the most
applicable provider device for a transaction in a situation that may
have multiple providers and multiple clients. So the most applicable
provider is the one that's closest ("distance"-wise) to all the clients
for the P2P transaction.

> And BTW: At least for writes that Peer 2 Peer transactions between 
> different root complexes work is actually more common than the other way 
> around.

Maybe on x86 with hardware made in the last few years. But on PowerPC,
ARM64, and likely a lot more the chance of support is *much* less. Also,
hardware that only supports P2P stores is hardly full support and is
insufficient for our needs.

> So I'm a bit torn between using a blacklist or a whitelist. A whitelist 
> is certainly more conservative approach, but that could get a bit long.

I think a whitelist approach is correct. Given old hardware and other
architectures, a black list is going to be too long and too difficult to
comprehensively populate.

Logan
Christian König March 28, 2018, 6:28 p.m. UTC | #6
Am 28.03.2018 um 18:25 schrieb Logan Gunthorpe:
>
> On 28/03/18 10:02 AM, Christian König wrote:
>> Yeah, that looks very similar to what I picked up from the older
>> patches, going to read up on that after my vacation.
> Yeah, I was just reading through your patchset and there are a lot of
> similarities. Though, I'm not sure what you're trying to accomplish as I
> could not find a cover letter and it seems to only enable one driver.

Yeah, it was the last day before my easter vacation and I wanted it out 
of the door.

> Is it meant to enable DMA transactions only between two AMD GPUs?

Not really, DMA-buf is a general framework for sharing buffers between 
device drivers.

It is widely used in the GFX stack on laptops with both Intel+AMD, 
Intel+NVIDIA or AMD+AMD graphics devices.

Additional to that ARM uses it quite massively for their GFX stacks 
because they have rendering and displaying device separated.

I'm just using amdgpu as blueprint because I'm the co-maintainer of it 
and know it mostly inside out.

> I also don't see where you've taken into account the PCI bus address. On
> some architectures this is not the same as the CPU physical address.

The resource addresses are translated using dma_map_resource(). As far 
as I know that should be sufficient to offload all the architecture 
specific stuff to the DMA subsystem.

>
>> Just in general why are you interested in the "distance" of the devices?
> We've taken a general approach where some drivers may provide p2p memory
> (ie. an NVMe card or an RDMA NIC) and other drivers make use of it (ie.
> the NVMe-of driver). The orchestrator driver needs to find the most
> applicable provider device for a transaction in a situation that may
> have multiple providers and multiple clients. So the most applicable
> provider is the one that's closest ("distance"-wise) to all the clients
> for the P2P transaction.

That seems to make sense.

>
>> And BTW: At least for writes that Peer 2 Peer transactions between
>> different root complexes work is actually more common than the other way
>> around.
> Maybe on x86 with hardware made in the last few years. But on PowerPC,
> ARM64, and likely a lot more the chance of support is *much* less. Also,
> hardware that only supports P2P stores is hardly full support and is
> insufficient for our needs.

Yeah, but not for ours. See if you want to do real peer 2 peer you need 
to keep both the operation as well as the direction into account.

For example when you can do writes between A and B that doesn't mean 
that writes between B and A work. And reads are generally less likely to 
work than writes. etc...

Since the use case I'm targeting for is GFX or GFX+V4L (or GFX+NIC in 
the future) I really need to handle all such use cases as well.

>
>> So I'm a bit torn between using a blacklist or a whitelist. A whitelist
>> is certainly more conservative approach, but that could get a bit long.
> I think a whitelist approach is correct. Given old hardware and other
> architectures, a black list is going to be too long and too difficult to
> comprehensively populate.

Yeah, it would certainly be better if we have something in the root 
complex capabilities. But you're right that a whitelist sounds the less 
painful way.

Regards,
Christian.


>
> Logan
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Logan Gunthorpe March 28, 2018, 6:57 p.m. UTC | #7
On 28/03/18 12:28 PM, Christian König wrote:
> I'm just using amdgpu as blueprint because I'm the co-maintainer of it 
> and know it mostly inside out.

Ah, I see.

> The resource addresses are translated using dma_map_resource(). As far 
> as I know that should be sufficient to offload all the architecture 
> specific stuff to the DMA subsystem.

It's not. The dma_map infrastructure currently has no concept of
peer-to-peer mappings and is designed for system memory only. No
architecture I'm aware of will translate PCI CPU addresses into PCI Bus
addresses which is necessary for any transfer that doesn't go through
the root complex (though on arches like x86 the CPU and Bus address
happen to be the same). There's a lot of people that would like to see
this change but it's likely going to be a long road before it does.

Furthermore, one of the reasons our patch-set avoids going through the
root complex at all is that IOMMU drivers will need to be made aware
that it is operating on P2P memory and do arch-specific things
accordingly. There will also need to be flags that indicate whether a
given IOMMU driver supports this. None of this work is done or easy.

> Yeah, but not for ours. See if you want to do real peer 2 peer you need 
> to keep both the operation as well as the direction into account.

Not sure what you are saying here... I'm pretty sure we are doing "real"
peer 2 peer...

> For example when you can do writes between A and B that doesn't mean 
> that writes between B and A work. And reads are generally less likely to 
> work than writes. etc...

If both devices are behind a switch then the PCI spec guarantees that A
can both read and write B and vice versa. Only once you involve root
complexes do you have this problem. Ie. you have unknown support which
may be no support, or partial support (stores but not loads); or
sometimes bad performance; or a combination of both... and you need some
way to figure out all this mess and that is hard. Whoever tries to
implement a white list will have to sort all this out.

Logan
Christian König March 28, 2018, 7:44 p.m. UTC | #8
Am 28.03.2018 um 20:57 schrieb Logan Gunthorpe:
>
> On 28/03/18 12:28 PM, Christian König wrote:
>> I'm just using amdgpu as blueprint because I'm the co-maintainer of it
>> and know it mostly inside out.
> Ah, I see.
>
>> The resource addresses are translated using dma_map_resource(). As far
>> as I know that should be sufficient to offload all the architecture
>> specific stuff to the DMA subsystem.
> It's not. The dma_map infrastructure currently has no concept of
> peer-to-peer mappings and is designed for system memory only. No
> architecture I'm aware of will translate PCI CPU addresses into PCI Bus
> addresses which is necessary for any transfer that doesn't go through
> the root complex (though on arches like x86 the CPU and Bus address
> happen to be the same). There's a lot of people that would like to see
> this change but it's likely going to be a long road before it does.

Well, isn't that exactly what dma_map_resource() is good for? As far as 
I can see it makes sure IOMMU is aware of the access route and 
translates a CPU address into a PCI Bus address.

> Furthermore, one of the reasons our patch-set avoids going through the
> root complex at all is that IOMMU drivers will need to be made aware
> that it is operating on P2P memory and do arch-specific things
> accordingly. There will also need to be flags that indicate whether a
> given IOMMU driver supports this. None of this work is done or easy.

I'm using that with the AMD IOMMU driver and at least there it works 
perfectly fine.

>> Yeah, but not for ours. See if you want to do real peer 2 peer you need
>> to keep both the operation as well as the direction into account.
> Not sure what you are saying here... I'm pretty sure we are doing "real"
> peer 2 peer...
>
>> For example when you can do writes between A and B that doesn't mean
>> that writes between B and A work. And reads are generally less likely to
>> work than writes. etc...
> If both devices are behind a switch then the PCI spec guarantees that A
> can both read and write B and vice versa.

Sorry to say that, but I know a whole bunch of PCI devices which 
horrible ignores that.

For example all AMD APUs fall under that category...

> Only once you involve root
> complexes do you have this problem. Ie. you have unknown support which
> may be no support, or partial support (stores but not loads); or
> sometimes bad performance; or a combination of both... and you need some
> way to figure out all this mess and that is hard. Whoever tries to
> implement a white list will have to sort all this out.

Yes, exactly and unfortunately it looks like I'm the poor guy who needs 
to do this :)

Regards,
Christian.

>
> Logan
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Logan Gunthorpe March 28, 2018, 7:53 p.m. UTC | #9
On 28/03/18 01:44 PM, Christian König wrote:
> Well, isn't that exactly what dma_map_resource() is good for? As far as 
> I can see it makes sure IOMMU is aware of the access route and 
> translates a CPU address into a PCI Bus address.

> I'm using that with the AMD IOMMU driver and at least there it works 
> perfectly fine.

Yes, it would be nice, but no arch has implemented this yet. We are just
lucky in the x86 case because that arch is simple and doesn't need to do
anything for P2P (partially due to the Bus and CPU addresses being the
same). But in the general case, you can't rely on it.

>>> Yeah, but not for ours. See if you want to do real peer 2 peer you need
>>> to keep both the operation as well as the direction into account.
>> Not sure what you are saying here... I'm pretty sure we are doing "real"
>> peer 2 peer...
>>
>>> For example when you can do writes between A and B that doesn't mean
>>> that writes between B and A work. And reads are generally less likely to
>>> work than writes. etc...
>> If both devices are behind a switch then the PCI spec guarantees that A
>> can both read and write B and vice versa.
> 
> Sorry to say that, but I know a whole bunch of PCI devices which 
> horrible ignores that.

Can you elaborate? As far as the device is concerned it shouldn't know
whether a request comes from a peer or from the host. If it does do
crazy stuff like that it's well out of spec. It's up to the switch (or
root complex if good support exists) to route the request to the device
and it's the root complex that tends to be what drops the load requests
which causes the asymmetries.

Logan
Christian König March 29, 2018, 11:44 a.m. UTC | #10
Am 28.03.2018 um 21:53 schrieb Logan Gunthorpe:
>
> On 28/03/18 01:44 PM, Christian König wrote:
>> Well, isn't that exactly what dma_map_resource() is good for? As far as
>> I can see it makes sure IOMMU is aware of the access route and
>> translates a CPU address into a PCI Bus address.
>> I'm using that with the AMD IOMMU driver and at least there it works
>> perfectly fine.
> Yes, it would be nice, but no arch has implemented this yet. We are just
> lucky in the x86 case because that arch is simple and doesn't need to do
> anything for P2P (partially due to the Bus and CPU addresses being the
> same). But in the general case, you can't rely on it.

Well, that an arch hasn't implemented it doesn't mean that we don't have 
the right interface to do it.

>>>> Yeah, but not for ours. See if you want to do real peer 2 peer you need
>>>> to keep both the operation as well as the direction into account.
>>> Not sure what you are saying here... I'm pretty sure we are doing "real"
>>> peer 2 peer...
>>>
>>>> For example when you can do writes between A and B that doesn't mean
>>>> that writes between B and A work. And reads are generally less likely to
>>>> work than writes. etc...
>>> If both devices are behind a switch then the PCI spec guarantees that A
>>> can both read and write B and vice versa.
>> Sorry to say that, but I know a whole bunch of PCI devices which
>> horrible ignores that.
> Can you elaborate? As far as the device is concerned it shouldn't know
> whether a request comes from a peer or from the host. If it does do
> crazy stuff like that it's well out of spec. It's up to the switch (or
> root complex if good support exists) to route the request to the device
> and it's the root complex that tends to be what drops the load requests
> which causes the asymmetries.

Devices integrated in the CPU usually only "claim" to be PCIe devices. 
In reality their memory request path go directly through the integrated 
north bridge. The reason for this is simple better throughput/latency.

That is hidden from the software, for example the BIOS just allocates 
address space for the BARs as if it's a normal PCIe device.

The only crux is when you then do peer2peer your request simply go into 
nirvana and are not handled by anything because the BARs are only 
visible from the CPU side of the northbridge.

Regards,
Christian.

>
> Logan
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Alex Deucher March 29, 2018, 2:37 p.m. UTC | #11
Sorry, didn't mean to drop the lists here. re-adding.

On Wed, Mar 28, 2018 at 4:05 PM, Alex Deucher <alexdeucher@gmail.com> wrote:
> On Wed, Mar 28, 2018 at 3:53 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>>
>>
>> On 28/03/18 01:44 PM, Christian König wrote:
>>> Well, isn't that exactly what dma_map_resource() is good for? As far as
>>> I can see it makes sure IOMMU is aware of the access route and
>>> translates a CPU address into a PCI Bus address.
>>
>>> I'm using that with the AMD IOMMU driver and at least there it works
>>> perfectly fine.
>>
>> Yes, it would be nice, but no arch has implemented this yet. We are just
>> lucky in the x86 case because that arch is simple and doesn't need to do
>> anything for P2P (partially due to the Bus and CPU addresses being the
>> same). But in the general case, you can't rely on it.
>
> Could we do something for the arches where it works?  I feel like peer
> to peer has dragged out for years because everyone is trying to boil
> the ocean for all arches.  There are a huge number of use cases for
> peer to peer on these "simple" architectures which actually represent
> a good deal of the users that want this.
>
> Alex
>
>>
>>>>> Yeah, but not for ours. See if you want to do real peer 2 peer you need
>>>>> to keep both the operation as well as the direction into account.
>>>> Not sure what you are saying here... I'm pretty sure we are doing "real"
>>>> peer 2 peer...
>>>>
>>>>> For example when you can do writes between A and B that doesn't mean
>>>>> that writes between B and A work. And reads are generally less likely to
>>>>> work than writes. etc...
>>>> If both devices are behind a switch then the PCI spec guarantees that A
>>>> can both read and write B and vice versa.
>>>
>>> Sorry to say that, but I know a whole bunch of PCI devices which
>>> horrible ignores that.
>>
>> Can you elaborate? As far as the device is concerned it shouldn't know
>> whether a request comes from a peer or from the host. If it does do
>> crazy stuff like that it's well out of spec. It's up to the switch (or
>> root complex if good support exists) to route the request to the device
>> and it's the root complex that tends to be what drops the load requests
>> which causes the asymmetries.
>>
>> Logan
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Logan Gunthorpe March 29, 2018, 3:45 p.m. UTC | #12
On 29/03/18 05:44 AM, Christian König wrote:
> Am 28.03.2018 um 21:53 schrieb Logan Gunthorpe:
>>
>> On 28/03/18 01:44 PM, Christian König wrote:
>>> Well, isn't that exactly what dma_map_resource() is good for? As far as
>>> I can see it makes sure IOMMU is aware of the access route and
>>> translates a CPU address into a PCI Bus address.
>>> I'm using that with the AMD IOMMU driver and at least there it works
>>> perfectly fine.
>> Yes, it would be nice, but no arch has implemented this yet. We are just
>> lucky in the x86 case because that arch is simple and doesn't need to do
>> anything for P2P (partially due to the Bus and CPU addresses being the
>> same). But in the general case, you can't rely on it.
> 
> Well, that an arch hasn't implemented it doesn't mean that we don't have 
> the right interface to do it.

Yes, but right now we don't have a performant way to check if we are
doing P2P or not in the dma_map_X() wrappers. And this is necessary to
check if the DMA ops in use support it or not. We can't have the
dma_map_X() functions do the wrong thing because they don't support it yet.

> Devices integrated in the CPU usually only "claim" to be PCIe devices. 
> In reality their memory request path go directly through the integrated 
> north bridge. The reason for this is simple better throughput/latency.

These are just more reasons why our patchset restricts to devices behind
a switch. And more mess for someone to deal with if they need to relax
that restriction.

Logan
Christian König March 29, 2018, 4:10 p.m. UTC | #13
Am 29.03.2018 um 17:45 schrieb Logan Gunthorpe:
>
> On 29/03/18 05:44 AM, Christian König wrote:
>> Am 28.03.2018 um 21:53 schrieb Logan Gunthorpe:
>>> On 28/03/18 01:44 PM, Christian König wrote:
>>>> Well, isn't that exactly what dma_map_resource() is good for? As far as
>>>> I can see it makes sure IOMMU is aware of the access route and
>>>> translates a CPU address into a PCI Bus address.
>>>> I'm using that with the AMD IOMMU driver and at least there it works
>>>> perfectly fine.
>>> Yes, it would be nice, but no arch has implemented this yet. We are just
>>> lucky in the x86 case because that arch is simple and doesn't need to do
>>> anything for P2P (partially due to the Bus and CPU addresses being the
>>> same). But in the general case, you can't rely on it.
>> Well, that an arch hasn't implemented it doesn't mean that we don't have
>> the right interface to do it.
> Yes, but right now we don't have a performant way to check if we are
> doing P2P or not in the dma_map_X() wrappers.

Why not? I mean the dma_map_resource() function is for P2P while other 
dma_map_* functions are only for system memory.

> And this is necessary to
> check if the DMA ops in use support it or not. We can't have the
> dma_map_X() functions do the wrong thing because they don't support it yet.

Well that sounds like we should just return an error from 
dma_map_resources() when an architecture doesn't support P2P yet as Alex 
suggested.

>> Devices integrated in the CPU usually only "claim" to be PCIe devices.
>> In reality their memory request path go directly through the integrated
>> north bridge. The reason for this is simple better throughput/latency.
> These are just more reasons why our patchset restricts to devices behind
> a switch. And more mess for someone to deal with if they need to relax
> that restriction.

You don't seem to understand the implications: The devices do have a 
common upstream bridge! In other words your code would currently claim 
that P2P is supported, but in practice it doesn't work.

You need to include both drivers which participate in the P2P 
transaction to make sure that both supports this and give them 
opportunity to chicken out and in the case of AMD APUs even redirect the 
request to another location (e.g. participate in the DMA translation).

DMA-buf fortunately seems to handle all this already, that's why we 
choose it as base for our implementation.

Regards,
Christian.
Logan Gunthorpe March 29, 2018, 4:25 p.m. UTC | #14
On 29/03/18 10:10 AM, Christian König wrote:
> Why not? I mean the dma_map_resource() function is for P2P while other 
> dma_map_* functions are only for system memory.

Oh, hmm, I wasn't aware dma_map_resource was exclusively for mapping
P2P. Though it's a bit odd seeing we've been working under the
assumption that PCI P2P is different as it has to translate the PCI bus
address. Where as P2P for devices on other buses is a big unknown.

>> And this is necessary to
>> check if the DMA ops in use support it or not. We can't have the
>> dma_map_X() functions do the wrong thing because they don't support it yet.
> 
> Well that sounds like we should just return an error from 
> dma_map_resources() when an architecture doesn't support P2P yet as Alex 
> suggested.

Yes, well except in our patch-set we can't easily use
dma_map_resources() as we either have SGLs to deal with or we need to
create whole new interfaces to a number of subsystems.

> You don't seem to understand the implications: The devices do have a 
> common upstream bridge! In other words your code would currently claim 
> that P2P is supported, but in practice it doesn't work.

Do they? They don't on any of the Intel machines I'm looking at. The
previous version of the patchset not only required a common upstream
bridge but two layers of upstream bridges on both devices which would
effectively limit transfers to PCIe switches only. But Bjorn did not
like this.

> You need to include both drivers which participate in the P2P 
> transaction to make sure that both supports this and give them 
> opportunity to chicken out and in the case of AMD APUs even redirect the 
> request to another location (e.g. participate in the DMA translation).

I don't think it's the drivers responsibility to reject P2P . The
topology is what governs support or not. The discussions we had with
Bjorn settled on if the devices are all behind the same bridge they can
communicate with each other. This is essentially guaranteed by the PCI spec.

> DMA-buf fortunately seems to handle all this already, that's why we 
> choose it as base for our implementation.

Well, unfortunately DMA-buf doesn't help for the drivers we are working
with as neither the block layer nor the RDMA subsystem have any
interfaces for it.

Logan
Christian König March 29, 2018, 6:15 p.m. UTC | #15
Am 29.03.2018 um 18:25 schrieb Logan Gunthorpe:
>
> On 29/03/18 10:10 AM, Christian König wrote:
>> Why not? I mean the dma_map_resource() function is for P2P while other
>> dma_map_* functions are only for system memory.
> Oh, hmm, I wasn't aware dma_map_resource was exclusively for mapping
> P2P. Though it's a bit odd seeing we've been working under the
> assumption that PCI P2P is different as it has to translate the PCI bus
> address. Where as P2P for devices on other buses is a big unknown.

Yeah, completely agree. On my TODO list (but rather far down) is 
actually supporting P2P with USB devices.

And no, I don't have the slightest idea how to do this at the moment.

>>> And this is necessary to
>>> check if the DMA ops in use support it or not. We can't have the
>>> dma_map_X() functions do the wrong thing because they don't support it yet.
>> Well that sounds like we should just return an error from
>> dma_map_resources() when an architecture doesn't support P2P yet as Alex
>> suggested.
> Yes, well except in our patch-set we can't easily use
> dma_map_resources() as we either have SGLs to deal with or we need to
> create whole new interfaces to a number of subsystems.

Agree as well. I was also in clear favor of extending the SGLs to have a 
flag for this instead of the dma_map_resource() interface, but for some 
reason that didn't made it into the kernel.

>> You don't seem to understand the implications: The devices do have a
>> common upstream bridge! In other words your code would currently claim
>> that P2P is supported, but in practice it doesn't work.
> Do they? They don't on any of the Intel machines I'm looking at. The
> previous version of the patchset not only required a common upstream
> bridge but two layers of upstream bridges on both devices which would
> effectively limit transfers to PCIe switches only. But Bjorn did not
> like this.

At least to me that sounds like a good idea, it would at least disable 
(the incorrect) auto detection of P2P for such devices.

>> You need to include both drivers which participate in the P2P
>> transaction to make sure that both supports this and give them
>> opportunity to chicken out and in the case of AMD APUs even redirect the
>> request to another location (e.g. participate in the DMA translation).
> I don't think it's the drivers responsibility to reject P2P . The
> topology is what governs support or not. The discussions we had with
> Bjorn settled on if the devices are all behind the same bridge they can
> communicate with each other. This is essentially guaranteed by the PCI spec.

Well it is not only rejecting P2P, see the devices I need to worry about 
are essentially part of the CPU. Their resources looks like a PCI BAR to 
the BIOS and OS, but are actually backed by stolen system memory.

So as crazy as it sounds what you get is an operation which starts as 
P2P, but then the GPU drivers sees it and says: Hey please don't write 
that to my PCIe BAR, but rather system memory location X.

>> DMA-buf fortunately seems to handle all this already, that's why we
>> choose it as base for our implementation.
> Well, unfortunately DMA-buf doesn't help for the drivers we are working
> with as neither the block layer nor the RDMA subsystem have any
> interfaces for it.

A fact that gives me quite some sleepless nights as well. I think we 
sooner or later need to extend those interfaces to work with DMA-bufs as 
well.

I will try to give your patch set a review when I'm back from vacation 
and rebase my DMA-buf work on top of that.

Regards,
Christian.

>
> Logan
Jerome Glisse March 30, 2018, 1:58 a.m. UTC | #16
On Thu, Mar 29, 2018 at 10:25:52AM -0600, Logan Gunthorpe wrote:
> 
> 
> On 29/03/18 10:10 AM, Christian König wrote:
> > Why not? I mean the dma_map_resource() function is for P2P while other 
> > dma_map_* functions are only for system memory.
> 
> Oh, hmm, I wasn't aware dma_map_resource was exclusively for mapping
> P2P. Though it's a bit odd seeing we've been working under the
> assumption that PCI P2P is different as it has to translate the PCI bus
> address. Where as P2P for devices on other buses is a big unknown.

dma_map_resource() is the right API (thought its current implementation
is fill with x86 assumptions). So i would argue that arch can decide to
implement it or simply return dma error address which trigger fallback
path into the caller (at least for GPU drivers). SG variant can be added
on top.

dma_map_resource() is the right API because it has all the necessary
informations. It use the CPU physical address as the common address
"language" with CPU physical address of PCIE bar to map to another
device you can find the corresponding bus address from the IOMMU code
(NOP on x86). So dma_map_resource() knows both the source device which
export its PCIE bar and the destination devices.


So i don't think Christian need to base his patchset on yours. This
is orthogonal. Christian is using existing upstream API, if it is
broken on some arch it is up to those arch to fix it, or return error
if it is not fixable. In fact i would assume that you would need to
base your patchset on top of dma_map_resource() too ...

Moreover i doubt Christian want to have struct page for this. For
nouveau there will be HMM struct page and this would conflict.

AFAICT you need struct page because the API you are trying to expose
to user space rely on "regular" filesystem/RDMA umem API which all
assume struct page. GPU drivers do not have this requirement and it
should not be impose on them.


So from my point of view Christian patchset is good as it is. Modulo
better commit message.


Bottom line i think we need common PCIE helper for P2P and the current
dma_map_resource() is the right kind from POV. What you are doing with
struct page is specific to your sub-system and should not be impose on
others.

Cheers,
Jérôme
Christoph Hellwig March 30, 2018, 6:33 a.m. UTC | #17
On Thu, Mar 29, 2018 at 09:58:54PM -0400, Jerome Glisse wrote:
> dma_map_resource() is the right API (thought its current implementation
> is fill with x86 assumptions). So i would argue that arch can decide to
> implement it or simply return dma error address which trigger fallback
> path into the caller (at least for GPU drivers). SG variant can be added
> on top.

It isn't in general.  It doesn't integrate with scatterlists (see my
comment to page one), and it doesn't integrate with all the subsystems
that also need a kernel virtual address.
Jerome Glisse March 30, 2018, 3:25 p.m. UTC | #18
On Thu, Mar 29, 2018 at 11:33:34PM -0700, Christoph Hellwig wrote:
> On Thu, Mar 29, 2018 at 09:58:54PM -0400, Jerome Glisse wrote:
> > dma_map_resource() is the right API (thought its current implementation
> > is fill with x86 assumptions). So i would argue that arch can decide to
> > implement it or simply return dma error address which trigger fallback
> > path into the caller (at least for GPU drivers). SG variant can be added
> > on top.
> 
> It isn't in general.  It doesn't integrate with scatterlists (see my
> comment to page one), and it doesn't integrate with all the subsystems
> that also need a kernel virtual address.

IIRC SG variant was proposed at the time dma_map_resource() was added,
dunno why they did not make it (again from memory i think it was because
it grows the scatterlist struct size). My point is more than people that
need SG variant should work on adding it. People who do not, should not
be stop by the lack of it. There is something there upstream, it does
not make sense to not use it. The dma-buf infrastructure is useful to
many drivers.

If you do not want to share the underlying logic of dma_map_resource()
fine (ie not sharing code inside drivers/iommu/*). I thought it would
be a good thing to share code ...

Cheers,
Jérôme
Logan Gunthorpe March 30, 2018, 6:46 p.m. UTC | #19
On 29/03/18 07:58 PM, Jerome Glisse wrote:
> On Thu, Mar 29, 2018 at 10:25:52AM -0600, Logan Gunthorpe wrote:
>>
>>
>> On 29/03/18 10:10 AM, Christian König wrote:
>>> Why not? I mean the dma_map_resource() function is for P2P while other 
>>> dma_map_* functions are only for system memory.
>>
>> Oh, hmm, I wasn't aware dma_map_resource was exclusively for mapping
>> P2P. Though it's a bit odd seeing we've been working under the
>> assumption that PCI P2P is different as it has to translate the PCI bus
>> address. Where as P2P for devices on other buses is a big unknown.
> 
> dma_map_resource() is the right API (thought its current implementation
> is fill with x86 assumptions). So i would argue that arch can decide to
> implement it or simply return dma error address which trigger fallback
> path into the caller (at least for GPU drivers). SG variant can be added
> on top.
> 
> dma_map_resource() is the right API because it has all the necessary
> informations. It use the CPU physical address as the common address
> "language" with CPU physical address of PCIE bar to map to another
> device you can find the corresponding bus address from the IOMMU code
> (NOP on x86). So dma_map_resource() knows both the source device which
> export its PCIE bar and the destination devices.

Well, as it is today, it doesn't look very sane to me. The default is to
just return the physical address if the architecture doesn't support it.
So if someone tries this on an arch that hasn't added itself to return
an error they're very likely going to just end up DMAing to an invalid
address and loosing the data or causing a machine check.

Furthermore, the API does not have all the information it needs to do
sane things. A phys_addr_t doesn't really tell you anything about the
memory behind it or what needs to be done with it. For example, on some
arm64 implementations if the physical address points to a PCI BAR and
that BAR is behind a switch with the DMA device then the address must be
converted to the PCI BUS address. On the other hand, if it's a physical
address of a device in an SOC it might need to  be handled in a
completely different way. And right now all the implementations I can
find seem to just assume that phys_addr_t points to regular memory and
can be treated as such.

This is one of the reasons that, based on feedback, our work went from
being general P2P with any device to being restricted to only P2P with
PCI devices. The dream that we can just grab the physical address of any
device and use it in a DMA request is simply not realistic.

Logan
Jerome Glisse March 30, 2018, 7:45 p.m. UTC | #20
On Fri, Mar 30, 2018 at 12:46:42PM -0600, Logan Gunthorpe wrote:
> 
> 
> On 29/03/18 07:58 PM, Jerome Glisse wrote:
> > On Thu, Mar 29, 2018 at 10:25:52AM -0600, Logan Gunthorpe wrote:
> >>
> >>
> >> On 29/03/18 10:10 AM, Christian König wrote:
> >>> Why not? I mean the dma_map_resource() function is for P2P while other 
> >>> dma_map_* functions are only for system memory.
> >>
> >> Oh, hmm, I wasn't aware dma_map_resource was exclusively for mapping
> >> P2P. Though it's a bit odd seeing we've been working under the
> >> assumption that PCI P2P is different as it has to translate the PCI bus
> >> address. Where as P2P for devices on other buses is a big unknown.
> > 
> > dma_map_resource() is the right API (thought its current implementation
> > is fill with x86 assumptions). So i would argue that arch can decide to
> > implement it or simply return dma error address which trigger fallback
> > path into the caller (at least for GPU drivers). SG variant can be added
> > on top.
> > 
> > dma_map_resource() is the right API because it has all the necessary
> > informations. It use the CPU physical address as the common address
> > "language" with CPU physical address of PCIE bar to map to another
> > device you can find the corresponding bus address from the IOMMU code
> > (NOP on x86). So dma_map_resource() knows both the source device which
> > export its PCIE bar and the destination devices.
> 
> Well, as it is today, it doesn't look very sane to me. The default is to
> just return the physical address if the architecture doesn't support it.
> So if someone tries this on an arch that hasn't added itself to return
> an error they're very likely going to just end up DMAing to an invalid
> address and loosing the data or causing a machine check.

Looking at upstream code it seems that the x86 bits never made it upstream
and thus what is now upstream is only for ARM. See [1] for x86 code. Dunno
what happen, i was convince it got merge. So yes current code is broken on
x86. ccing Joerg maybe he remembers what happened there.

[1] https://lwn.net/Articles/646605/

> 
> Furthermore, the API does not have all the information it needs to do
> sane things. A phys_addr_t doesn't really tell you anything about the
> memory behind it or what needs to be done with it. For example, on some
> arm64 implementations if the physical address points to a PCI BAR and
> that BAR is behind a switch with the DMA device then the address must be
> converted to the PCI BUS address. On the other hand, if it's a physical
> address of a device in an SOC it might need to  be handled in a
> completely different way. And right now all the implementations I can
> find seem to just assume that phys_addr_t points to regular memory and
> can be treated as such.

Given it is currently only used by ARM folks it appear to at least work
for them (tm) :) Note that Christian is doing this in PCIE only context
and again dma_map_resource can easily figure that out if the address is
a PCIE or something else. Note that the exporter export the CPU bus
address. So again dma_map_resource has all the informations it will ever
need, if the peer to peer is fundamentaly un-doable it can return dma
error and it is up to the caller to handle this, just like GPU code do.

Do you claim that dma_map_resource is missing any information ?


> 
> This is one of the reasons that, based on feedback, our work went from
> being general P2P with any device to being restricted to only P2P with
> PCI devices. The dream that we can just grab the physical address of any
> device and use it in a DMA request is simply not realistic.

I agree and understand that but for platform where such feature make sense
this will work. For me it is PowerPC and x86 and given PowerPC has CAPI
which has far more advance feature when it comes to peer to peer, i don't
see something more basic not working. On x86, Intel is a bit of lone wolf,
dunno if they gonna support this usecase pro-actively. AMD definitly will.

If you feel like dma_map_resource() can be interpreted too broadly, more
strict phrasing/wording can be added to it so people better understand its
limitation and gotcha.

Cheers,
Jérôme
Logan Gunthorpe April 2, 2018, 5:02 p.m. UTC | #21
On 30/03/18 01:45 PM, Jerome Glisse wrote:
> Looking at upstream code it seems that the x86 bits never made it upstream
> and thus what is now upstream is only for ARM. See [1] for x86 code. Dunno
> what happen, i was convince it got merge. So yes current code is broken on
> x86. ccing Joerg maybe he remembers what happened there.
> 
> [1] https://lwn.net/Articles/646605/

That looks like a significant improvement over what's upstream. But it's
three years old and looks to have been abandoned. I think I agree with
Bjorn's comments in that if it's going to be a general P2P API it will
need the device the resource belongs to in addition to the device that's
initiating the DMA request.

> Given it is currently only used by ARM folks it appear to at least work
> for them (tm) :) Note that Christian is doing this in PCIE only context
> and again dma_map_resource can easily figure that out if the address is
> a PCIE or something else. Note that the exporter export the CPU bus
> address. So again dma_map_resource has all the informations it will ever
> need, if the peer to peer is fundamentaly un-doable it can return dma
> error and it is up to the caller to handle this, just like GPU code do.
> 
> Do you claim that dma_map_resource is missing any information ?

Yes, that's what I said. All the existing ARM code appears to use it for
platform devices only. I suspect platform P2P is relatively trivial to
support on ARM. I think it's a big difference from using it for PCI P2P
or general P2P on any bus.

> I agree and understand that but for platform where such feature make sense
> this will work. For me it is PowerPC and x86 and given PowerPC has CAPI
> which has far more advance feature when it comes to peer to peer, i don't
> see something more basic not working. On x86, Intel is a bit of lone wolf,
> dunno if they gonna support this usecase pro-actively. AMD definitly will.

Well PowerPC doesn't even support P2P between root ports. And I fail to
see how CAPI applies unless/until we get this memory mapped into
userspace and the mappings need to be dynamically managed. Seems like
that's a long way away.

Logan
Jerome Glisse April 2, 2018, 5:20 p.m. UTC | #22
On Mon, Apr 02, 2018 at 11:02:10AM -0600, Logan Gunthorpe wrote:
> 
> 
> On 30/03/18 01:45 PM, Jerome Glisse wrote:
> > Looking at upstream code it seems that the x86 bits never made it upstream
> > and thus what is now upstream is only for ARM. See [1] for x86 code. Dunno
> > what happen, i was convince it got merge. So yes current code is broken on
> > x86. ccing Joerg maybe he remembers what happened there.
> > 
> > [1] https://lwn.net/Articles/646605/
> 
> That looks like a significant improvement over what's upstream. But it's
> three years old and looks to have been abandoned. I think I agree with
> Bjorn's comments in that if it's going to be a general P2P API it will
> need the device the resource belongs to in addition to the device that's
> initiating the DMA request.

The point i have been trying to get accross is that you do have this
information with dma_map_resource() you know the device to which you
are trying to map (dev argument to dma_map_resource()) and you can
easily get the device to which the memory belongs because you have the
CPU physical address of the memory hence you can lookup the resource
and get the device from that.


> > Given it is currently only used by ARM folks it appear to at least work
> > for them (tm) :) Note that Christian is doing this in PCIE only context
> > and again dma_map_resource can easily figure that out if the address is
> > a PCIE or something else. Note that the exporter export the CPU bus
> > address. So again dma_map_resource has all the informations it will ever
> > need, if the peer to peer is fundamentaly un-doable it can return dma
> > error and it is up to the caller to handle this, just like GPU code do.
> > 
> > Do you claim that dma_map_resource is missing any information ?
> 
> Yes, that's what I said. All the existing ARM code appears to use it for
> platform devices only. I suspect platform P2P is relatively trivial to
> support on ARM. I think it's a big difference from using it for PCI P2P
> or general P2P on any bus.
> 

It does have all we need for PCIE when using dma_api and not the SG one.
SG issue IIRC is that it assume struct page ... See above for device
lookup.

> > I agree and understand that but for platform where such feature make sense
> > this will work. For me it is PowerPC and x86 and given PowerPC has CAPI
> > which has far more advance feature when it comes to peer to peer, i don't
> > see something more basic not working. On x86, Intel is a bit of lone wolf,
> > dunno if they gonna support this usecase pro-actively. AMD definitly will.
> 
> Well PowerPC doesn't even support P2P between root ports. And I fail to
> see how CAPI applies unless/until we get this memory mapped into
> userspace and the mappings need to be dynamically managed. Seems like
> that's a long way away.

IIRC CAPI make P2P mandatory but maybe this is with NVLink. We can ask
the PowerPC folks to confirm. Note CAPI is Power8 and newer AFAICT.

Mapping to userspace have nothing to do here. I am talking at hardware
level. How thing are expose to userspace is a completely different
problems that do not have one solution fit all. For GPU you want this
to be under total control of GPU drivers. For storage like persistent
memory, you might want to expose it userspace more directly ...

Cheers,
Jérôme
Logan Gunthorpe April 2, 2018, 5:37 p.m. UTC | #23
On 02/04/18 11:20 AM, Jerome Glisse wrote:
> The point i have been trying to get accross is that you do have this
> information with dma_map_resource() you know the device to which you
> are trying to map (dev argument to dma_map_resource()) and you can
> easily get the device to which the memory belongs because you have the
> CPU physical address of the memory hence you can lookup the resource
> and get the device from that.

How do you go from a physical address to a struct device generally and
in a performant manner?

> IIRC CAPI make P2P mandatory but maybe this is with NVLink. We can ask
> the PowerPC folks to confirm. Note CAPI is Power8 and newer AFAICT.

PowerPC folks recently told us specifically that Power9 does not support
P2P between PCI root ports. I've said this many times. CAPI has nothing
to do with it.

> Mapping to userspace have nothing to do here. I am talking at hardware
> level. How thing are expose to userspace is a completely different
> problems that do not have one solution fit all. For GPU you want this
> to be under total control of GPU drivers. For storage like persistent
> memory, you might want to expose it userspace more directly ...

My understanding (and I worked on this a while ago) is that CAPI
hardware manages memory maps typically for userspace memory. When a
userspace program changes it's mapping, the CAPI hardware is updated so
that hardware is coherent with the user address space and it is safe to
DMA to any address without having to pin memory. (This is very similar
to ODP in RNICs.) This is *really* nice but doesn't solve *any* of the
problems we've been discussing. Moreover, many developers want to keep
P2P in-kernel, for the time being, where the problem of pinning memory
does not exist.

Logan
Jerome Glisse April 2, 2018, 7:16 p.m. UTC | #24
On Mon, Apr 02, 2018 at 11:37:07AM -0600, Logan Gunthorpe wrote:
> 
> 
> On 02/04/18 11:20 AM, Jerome Glisse wrote:
> > The point i have been trying to get accross is that you do have this
> > information with dma_map_resource() you know the device to which you
> > are trying to map (dev argument to dma_map_resource()) and you can
> > easily get the device to which the memory belongs because you have the
> > CPU physical address of the memory hence you can lookup the resource
> > and get the device from that.
> 
> How do you go from a physical address to a struct device generally and
> in a performant manner?

There isn't good API at the moment AFAIK, closest thing would either be
lookup_resource() or region_intersects(), but a more appropriate one can
easily be added, code to walk down the tree is readily available. More-
over this can be optimize like vma lookup are, even more as resource are
seldomly added so read side (finding a resource) can be heavily favor
over write side (adding|registering a new resource).

> 
> > IIRC CAPI make P2P mandatory but maybe this is with NVLink. We can ask
> > the PowerPC folks to confirm. Note CAPI is Power8 and newer AFAICT.
> 
> PowerPC folks recently told us specifically that Power9 does not support
> P2P between PCI root ports. I've said this many times. CAPI has nothing
> to do with it.

I need to check CAPI, i must have confuse that with NVLink which is also
on some powerpc arch.

> 
> > Mapping to userspace have nothing to do here. I am talking at hardware
> > level. How thing are expose to userspace is a completely different
> > problems that do not have one solution fit all. For GPU you want this
> > to be under total control of GPU drivers. For storage like persistent
> > memory, you might want to expose it userspace more directly ...
> 
> My understanding (and I worked on this a while ago) is that CAPI
> hardware manages memory maps typically for userspace memory. When a
> userspace program changes it's mapping, the CAPI hardware is updated so
> that hardware is coherent with the user address space and it is safe to
> DMA to any address without having to pin memory. (This is very similar
> to ODP in RNICs.) This is *really* nice but doesn't solve *any* of the
> problems we've been discussing. Moreover, many developers want to keep
> P2P in-kernel, for the time being, where the problem of pinning memory
> does not exist.

What you describe is the ATS(Address Translation Service)/PASID(Process
Address Space IDentifier) part of CAPI. Which have also been available
for years on AMD x86 platform (AMD IOMMU-v2), thought it is barely ever
use. Interesting aspect of CAPI is its cache coherency protocol between
devices and CPUs. This in both direction, the usual device access to
system memory can be cache coherent with CPU access and participate in
cache coherency protocol (bit further than PCIE snoop). But also the
other direction the CPU access to device memory can also be cache coherent,
which is not the case in PCIE.

This cache coherency between CPU and device is what made me assume that
CAPI must have Peer To Peer support as peer must be able to talk to each
other for cache coherency purpose. But maybe all cache coherency
arbritration goes through central directory allievating Peer to Peer
requirement.

Anyway, like you said, this does not matter for the discussion. The
dma_map_resource() can be just stub out on platform that do not support
this and they would not allow it. If it get use on other platform and
shows enough advantages that users start asking for it then maybe those
platform will attention to the hardware requirement.

Note that with mmu_notifier there isn't any need to pin stuff (even
without any special hardware capabilities), as long as you can preempt
what is happening on your hardware to update its page table.

Cheers,
Jérôme
Logan Gunthorpe April 2, 2018, 7:32 p.m. UTC | #25
On 02/04/18 01:16 PM, Jerome Glisse wrote:
> There isn't good API at the moment AFAIK, closest thing would either be
> lookup_resource() or region_intersects(), but a more appropriate one can
> easily be added, code to walk down the tree is readily available. More-
> over this can be optimize like vma lookup are, even more as resource are
> seldomly added so read side (finding a resource) can be heavily favor
> over write side (adding|registering a new resource).

So someone needs to create a highly optimized tree that registers all
physical address on the system and maps them to devices? That seems a
long way from being realized. I'd hardly characterize that as "easily".
If we can pass both devices to the API I'd suspect it would be preferred
over the complicated tree. This, of course, depends on what users of the
API need.

> cache coherency protocol (bit further than PCIE snoop). But also the
> other direction the CPU access to device memory can also be cache coherent,
> which is not the case in PCIE.

I was not aware that CAPI allows PCI device memory to be cache coherent.
That sounds like it would be very tricky...

> Note that with mmu_notifier there isn't any need to pin stuff (even
> without any special hardware capabilities), as long as you can preempt
> what is happening on your hardware to update its page table.

I've been told there's a lot of dislike of the mmu_notifier interface.
And being able to preempt what's happening on hardware, generally, is
not trivial. But, yes, this is essentially how ODP works.

Logan
Jerome Glisse April 2, 2018, 7:45 p.m. UTC | #26
On Mon, Apr 02, 2018 at 01:32:37PM -0600, Logan Gunthorpe wrote:
> 
> 
> On 02/04/18 01:16 PM, Jerome Glisse wrote:
> > There isn't good API at the moment AFAIK, closest thing would either be
> > lookup_resource() or region_intersects(), but a more appropriate one can
> > easily be added, code to walk down the tree is readily available. More-
> > over this can be optimize like vma lookup are, even more as resource are
> > seldomly added so read side (finding a resource) can be heavily favor
> > over write side (adding|registering a new resource).
> 
> So someone needs to create a highly optimized tree that registers all
> physical address on the system and maps them to devices? That seems a
> long way from being realized. I'd hardly characterize that as "easily".
> If we can pass both devices to the API I'd suspect it would be preferred
> over the complicated tree. This, of course, depends on what users of the
> API need.

This tree already exist, it is all there upstream see kernel/resource.c
What is missing is something that take a single address and return the
device struct. There is function that take a range region_intersects()
or one that take the start address lookup_resource(). It isn't hard to
think that using roughly same code as region_intersects() an helper
that return the device for a resource can be added.

And yes currently this does not have a pointer back to the device that
own the resource but this can be added. It wasn't needed until now.

It can latter be optimize if device lookup shows as a bottleneck in perf
profile.


> 
> > cache coherency protocol (bit further than PCIE snoop). But also the
> > other direction the CPU access to device memory can also be cache coherent,
> > which is not the case in PCIE.
> 
> I was not aware that CAPI allows PCI device memory to be cache coherent.
> That sounds like it would be very tricky...

And yet CAPI, CCIX, Gen-Z, NVLink, ... are all inter-connect that aim at
achieving this cache coherency between multiple devices and CPUs.

Jérôme
diff mbox

Patch

diff --git a/drivers/pci/search.c b/drivers/pci/search.c
index bc1e023f1353..446648f4238b 100644
--- a/drivers/pci/search.c
+++ b/drivers/pci/search.c
@@ -393,3 +393,27 @@  int pci_dev_present(const struct pci_device_id *ids)
 	return 0;
 }
 EXPORT_SYMBOL(pci_dev_present);
+
+/**
+ * pci_find_common_upstream_dev - Returns the first common upstream device
+ * @dev: the PCI device from which the bus hierarchy walk will start
+ * @other: the PCI device whose bus number will be used for the search
+ *
+ * Walks up the bus hierarchy from the @dev device, looking for the first bus
+ * which the @other device is downstream of. Returns %NULL if the devices do
+ * not share any upstream devices.
+ */
+struct pci_dev *pci_find_common_upstream_dev(struct pci_dev *dev,
+					     struct pci_dev *other)
+{
+	struct pci_dev *pdev = dev;
+
+	while (pdev != NULL) {
+		if ((other->bus->number >= pdev->bus->busn_res.start) &&
+		    (other->bus->number <= pdev->bus->busn_res.end))
+			return pdev;
+		pdev = pci_upstream_bridge(pdev);
+	}
+
+	return NULL;
+}
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 024a1beda008..0d29f5cdcb04 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -956,6 +956,8 @@  static inline struct pci_dev *pci_get_bus_and_slot(unsigned int bus,
 }
 struct pci_dev *pci_get_class(unsigned int class, struct pci_dev *from);
 int pci_dev_present(const struct pci_device_id *ids);
+struct pci_dev *pci_find_common_upstream_dev(struct pci_dev *from,
+					     struct pci_dev *to);
 
 int pci_bus_read_config_byte(struct pci_bus *bus, unsigned int devfn,
 			     int where, u8 *val);