diff mbox series

[4/6] PCI/P2PDMA: Remove the DMA_VIRT_OPS hacks

Message ID 20201105074205.1690638-5-hch@lst.de (mailing list archive)
State Superseded
Delegated to: Jason Gunthorpe
Headers show
Series [1/6] RMDA/sw: don't allow drivers using dma_virt_ops on highmem configs | expand

Commit Message

Christoph Hellwig Nov. 5, 2020, 7:42 a.m. UTC
Now that all users of dma_virt_ops are gone we can remove the workaround
for it in the PCI peer to peer code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/p2pdma.c | 20 --------------------
 1 file changed, 20 deletions(-)

Comments

Jason Gunthorpe Nov. 5, 2020, 2:34 p.m. UTC | #1
On Thu, Nov 05, 2020 at 08:42:03AM +0100, Christoph Hellwig wrote:
> Now that all users of dma_virt_ops are gone we can remove the workaround
> for it in the PCI peer to peer code.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
>  drivers/pci/p2pdma.c | 20 --------------------
>  1 file changed, 20 deletions(-)
> 
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index de1c331dbed43f..b07018af53876c 100644
> +++ b/drivers/pci/p2pdma.c
> @@ -556,15 +556,6 @@ int pci_p2pdma_distance_many(struct pci_dev *provider, struct device **clients,
>  		return -1;
>  
>  	for (i = 0; i < num_clients; i++) {
> -#ifdef CONFIG_DMA_VIRT_OPS
> -		if (clients[i]->dma_ops == &dma_virt_ops) {
> -			if (verbose)
> -				dev_warn(clients[i],
> -					 "cannot be used for peer-to-peer DMA because the driver makes use of dma_virt_ops\n");
> -			return -1;
> -		}
> -#endif
> -
>  		pci_client = find_parent_pci_dev(clients[i]);
>  		if (!pci_client) {
>  			if (verbose)
> @@ -837,17 +828,6 @@ static int __pci_p2pdma_map_sg(struct pci_p2pdma_pagemap *p2p_pgmap,
>  	phys_addr_t paddr;
>  	int i;
>  
> -	/*
> -	 * p2pdma mappings are not compatible with devices that use
> -	 * dma_virt_ops. If the upper layers do the right thing
> -	 * this should never happen because it will be prevented
> -	 * by the check in pci_p2pdma_distance_many()
> -	 */
> -#ifdef CONFIG_DMA_VIRT_OPS
> -	if (WARN_ON_ONCE(dev->dma_ops == &dma_virt_ops))
> -		return 0;
> -#endif

The check is removed here, but I didn't see a matching check added to
the IB side? Something like:

static int rdma_rw_map_sg(struct ib_device *dev, struct scatterlist *sg,
			  u32 sg_cnt, enum dma_data_direction dir)
{
	if (is_pci_p2pdma_page(sg_page(sg))) {
		if (ib_uses_virt_dma(dev))
			return 0;
		return pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir);
	}
	return ib_dma_map_sg(dev, sg, sg_cnt, dir);
}

I think the change to rdma_rw_unmap_sg() should probably be dropped in
favour of the above?

Jason
Christoph Hellwig Nov. 5, 2020, 5:08 p.m. UTC | #2
On Thu, Nov 05, 2020 at 10:34:18AM -0400, Jason Gunthorpe wrote:
> The check is removed here, but I didn't see a matching check added to
> the IB side? Something like:
> 
> static int rdma_rw_map_sg(struct ib_device *dev, struct scatterlist *sg,
> 			  u32 sg_cnt, enum dma_data_direction dir)
> {
> 	if (is_pci_p2pdma_page(sg_page(sg))) {
> 		if (ib_uses_virt_dma(dev))
> 			return 0;
> 		return pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir);
> 	}
> 	return ib_dma_map_sg(dev, sg, sg_cnt, dir);
> }

We should never get a P2P page into the rdma_rw_map_sg or other ib_dma*
routines for the software drivers, as their struct devices don't connect
to a PCІ device underneath, and thus no valid P2P distance can be
retourned.  That being said IFF we want to implement P2P for those
we'd need somethign like the above check, except that we still need
to cal ib_dma_map_sg, i.e.:

static int rdma_rw_map_sg(struct ib_device *dev, struct scatterlist *sg,
 			  u32 sg_cnt, enum dma_data_direction dir)
{
	if (is_pci_p2pdma_page(sg_page(sg)) && !ib_uses_virt_dma(dev))
		return pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir);
	return ib_dma_map_sg(dev, sg, sg_cnt, dir);
}
Jason Gunthorpe Nov. 5, 2020, 5:23 p.m. UTC | #3
On Thu, Nov 05, 2020 at 06:08:16PM +0100, Christoph Hellwig wrote:
> On Thu, Nov 05, 2020 at 10:34:18AM -0400, Jason Gunthorpe wrote:
> > The check is removed here, but I didn't see a matching check added to
> > the IB side? Something like:
> > 
> > static int rdma_rw_map_sg(struct ib_device *dev, struct scatterlist *sg,
> > 			  u32 sg_cnt, enum dma_data_direction dir)
> > {
> > 	if (is_pci_p2pdma_page(sg_page(sg))) {
> > 		if (ib_uses_virt_dma(dev))
> > 			return 0;
> > 		return pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir);
> > 	}
> > 	return ib_dma_map_sg(dev, sg, sg_cnt, dir);
> > }
> 
> We should never get a P2P page into the rdma_rw_map_sg or other ib_dma*
> routines for the software drivers, as their struct devices don't connect
> to a PCІ device underneath, and thus no valid P2P distance can be
> retourned.  

But that depends on the calling driver doing this properly, and we
don't expose an API to get the PCI device of the struct ib_device
.. how does nvme even work here?

If we can't get here then why did you add the check to the unmap side?

Why did this code in p2pdma exist at all?

> That being said IFF we want to implement P2P for those we'd need
> somethign like the above check, except that we still need to cal
> ib_dma_map_sg, i.e.:
> 
> static int rdma_rw_map_sg(struct ib_device *dev, struct scatterlist *sg,
>  			  u32 sg_cnt, enum dma_data_direction dir)
> {
> 	if (is_pci_p2pdma_page(sg_page(sg)) && !ib_uses_virt_dma(dev))
> 		return pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir);
> 	return ib_dma_map_sg(dev, sg, sg_cnt, dir);
> }

The SW drivers can't handle PCI pages at all, they are going to try to
memcpy them or something else not __iomem, so we really do need to
prevent P2P pages going into them.

Jason
Christoph Hellwig Nov. 5, 2020, 5:29 p.m. UTC | #4
On Thu, Nov 05, 2020 at 01:23:57PM -0400, Jason Gunthorpe wrote:
> But that depends on the calling driver doing this properly, and we
> don't expose an API to get the PCI device of the struct ib_device
> .. how does nvme even work here?

The PCI p2pdma APIs walk the parent chains of a struct device until
they find a PCI device.  And the ib_device eventually ends up there.

> 
> If we can't get here then why did you add the check to the unmap side?

Because I added them to the map and unmap side, but forgot to commit
the map side.  Mostly to be prepared for the case where we could
end up there.  And thinking out loud I actually need to double check
rdmavt if that is true there as well.  It certainly is for rxe and
siw as I checked it on a live system.

> The SW drivers can't handle PCI pages at all, they are going to try to
> memcpy them or something else not __iomem, so we really do need to
> prevent P2P pages going into them.

Ok, let's prevent it for now.  And if someone wants to do it there
they have to do all the work.
Jason Gunthorpe Nov. 5, 2020, 5:39 p.m. UTC | #5
On Thu, Nov 05, 2020 at 06:29:21PM +0100, Christoph Hellwig wrote:
> On Thu, Nov 05, 2020 at 01:23:57PM -0400, Jason Gunthorpe wrote:
> > But that depends on the calling driver doing this properly, and we
> > don't expose an API to get the PCI device of the struct ib_device
> > .. how does nvme even work here?
> 
> The PCI p2pdma APIs walk the parent chains of a struct device until
> they find a PCI device.  And the ib_device eventually ends up there.

Hmm. This works for real devices like mlx5, but it means the three SW
devices will also resolve to a real PCI device that is not the DMA
device.

If nvme wants to do something like this it should walk from the
ibdev->dma_device, after these patches to make dma_device NULL.

eg rxe is like:

$ sudo rdma link add rxe0 type rxe netdev eth1

lrwxrwxrwx 1 root root 0 Nov  5 17:34 /sys/class/infiniband/rxe0/device -> ../../../0000:00:09.0/

I think this is a bug, these virtual devices should have NULL
parents...

> > If we can't get here then why did you add the check to the unmap side?
> 
> Because I added them to the map and unmap side, but forgot to commit
> the map side.  Mostly to be prepared for the case where we could
> end up there.  And thinking out loud I actually need to double check
> rdmavt if that is true there as well.  It certainly is for rxe and
> siw as I checked it on a live system.

rdmavt parents itself to the HFI/QIB PCI device, so the walk above
should also find a real PCI device

> > The SW drivers can't handle PCI pages at all, they are going to try to
> > memcpy them or something else not __iomem, so we really do need to
> > prevent P2P pages going into them.
> 
> Ok, let's prevent it for now.  And if someone wants to do it there
> they have to do all the work.

Yes, that is the safest - just block the SW devices from ever touch
P2P pages.

Jason
Christoph Hellwig Nov. 5, 2020, 5:43 p.m. UTC | #6
On Thu, Nov 05, 2020 at 01:39:30PM -0400, Jason Gunthorpe wrote:
> Hmm. This works for real devices like mlx5, but it means the three SW
> devices will also resolve to a real PCI device that is not the DMA
> device.

Does it?  When I followed the links on my system rxe was a virtual
device without a physical parent.  Weird.
Jason Gunthorpe Nov. 5, 2020, 5:56 p.m. UTC | #7
On Thu, Nov 05, 2020 at 06:43:06PM +0100, Christoph Hellwig wrote:
> On Thu, Nov 05, 2020 at 01:39:30PM -0400, Jason Gunthorpe wrote:
> > Hmm. This works for real devices like mlx5, but it means the three SW
> > devices will also resolve to a real PCI device that is not the DMA
> > device.
> 
> Does it?  When I followed the links on my system rxe was a virtual
> device without a physical parent.  Weird.

Yes, Bob also might have seen it oddly be virtual too.. No idea why.

This seems like the right thing to do, looks like virtual is only
possible if the netdev is virtual too..

diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
index 575e1a4ec82121..2b4238cdeab953 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -20,18 +20,6 @@
 
 static struct rxe_recv_sockets recv_sockets;
 
-struct device *rxe_dma_device(struct rxe_dev *rxe)
-{
-	struct net_device *ndev;
-
-	ndev = rxe->ndev;
-
-	if (is_vlan_dev(ndev))
-		ndev = vlan_dev_real_dev(ndev);
-
-	return ndev->dev.parent;
-}
-
 int rxe_mcast_add(struct rxe_dev *rxe, union ib_gid *mgid)
 {
 	int err;
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c
index 209c7b3fab97a2..0cc4116d9a1fa6 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.c
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
@@ -1134,7 +1134,6 @@ int rxe_register_device(struct rxe_dev *rxe, const char *ibdev_name)
 	dev->node_type = RDMA_NODE_IB_CA;
 	dev->phys_port_cnt = 1;
 	dev->num_comp_vectors = num_possible_cpus();
-	dev->dev.parent = rxe_dma_device(rxe);
 	dev->local_dma_lkey = 0;
 	addrconf_addr_eui48((unsigned char *)&dev->node_guid,
 			    rxe->ndev->dev_addr);
diff mbox series

Patch

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index de1c331dbed43f..b07018af53876c 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -556,15 +556,6 @@  int pci_p2pdma_distance_many(struct pci_dev *provider, struct device **clients,
 		return -1;
 
 	for (i = 0; i < num_clients; i++) {
-#ifdef CONFIG_DMA_VIRT_OPS
-		if (clients[i]->dma_ops == &dma_virt_ops) {
-			if (verbose)
-				dev_warn(clients[i],
-					 "cannot be used for peer-to-peer DMA because the driver makes use of dma_virt_ops\n");
-			return -1;
-		}
-#endif
-
 		pci_client = find_parent_pci_dev(clients[i]);
 		if (!pci_client) {
 			if (verbose)
@@ -837,17 +828,6 @@  static int __pci_p2pdma_map_sg(struct pci_p2pdma_pagemap *p2p_pgmap,
 	phys_addr_t paddr;
 	int i;
 
-	/*
-	 * p2pdma mappings are not compatible with devices that use
-	 * dma_virt_ops. If the upper layers do the right thing
-	 * this should never happen because it will be prevented
-	 * by the check in pci_p2pdma_distance_many()
-	 */
-#ifdef CONFIG_DMA_VIRT_OPS
-	if (WARN_ON_ONCE(dev->dma_ops == &dma_virt_ops))
-		return 0;
-#endif
-
 	for_each_sg(sg, s, nents, i) {
 		paddr = sg_phys(s);