diff mbox series

[v5,09/14] PCI: portdrv: Suppress kernel DMA ownership auto-claiming

Message ID 20220104015644.2294354-10-baolu.lu@linux.intel.com (mailing list archive)
State Superseded
Delegated to: Bjorn Helgaas
Headers show
Series Fix BUG_ON in vfio_iommu_group_notifier() | expand

Commit Message

Baolu Lu Jan. 4, 2022, 1:56 a.m. UTC
If a switch lacks ACS P2P Request Redirect, a device below the switch can
bypass the IOMMU and DMA directly to other devices below the switch, so
all the downstream devices must be in the same IOMMU group as the switch
itself. The existing vfio framework allows the portdrv driver to be bound
to the bridge while its downstream devices are assigned to user space.
The pci_dma_configure() marks the iommu_group as containing only devices
with kernel drivers that manage DMA. Avoid this default behavior for the
portdrv driver in order for compatibility with the current vfio policy.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Suggested-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/pci/pcie/portdrv_pci.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Bjorn Helgaas Jan. 4, 2022, 5:06 p.m. UTC | #1
On Tue, Jan 04, 2022 at 09:56:39AM +0800, Lu Baolu wrote:
> If a switch lacks ACS P2P Request Redirect, a device below the switch can
> bypass the IOMMU and DMA directly to other devices below the switch, so
> all the downstream devices must be in the same IOMMU group as the switch
> itself.

Help me think through what's going on here.  IIUC, we put devices in
the same IOMMU group when they can interfere with each other in any
way (DMA, config access, etc).

(We said "DMA" above, but I guess this would also apply to config
requests, right?)

*This* patch doesn't check for any ACS features.  Can you connect the
dots for me?  I guess the presence or absence of P2P Request Redirect
determines the size of the IOMMU group.  And the following says
something about what is allowed in the group?  And .no_kernel_api_dma
allows an exception to the general rule?

> The existing vfio framework allows the portdrv driver to be bound
> to the bridge while its downstream devices are assigned to user space.

I.e., the existing VFIO framework allows a switch to be in the same
IOMMU group as the devices below it, even though the switch has a
kernel driver and the other devices may have userspace drivers?

Is this a function of VFIO design or of the IOMMU driver?

> The pci_dma_configure() marks the iommu_group as containing only devices
> with kernel drivers that manage DMA. Avoid this default behavior for the
> portdrv driver in order for compatibility with the current vfio policy.

I assume "IOMMU group" means the same as "iommu_group"; maybe we can
use one of them consistently?

> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Suggested-by: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> ---
>  drivers/pci/pcie/portdrv_pci.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
> index 35eca6277a96..2116f821c005 100644
> --- a/drivers/pci/pcie/portdrv_pci.c
> +++ b/drivers/pci/pcie/portdrv_pci.c
> @@ -202,6 +202,8 @@ static struct pci_driver pcie_portdriver = {
>  
>  	.err_handler	= &pcie_portdrv_err_handler,
>  
> +	.no_kernel_api_dma = true,
> +
>  	.driver.pm	= PCIE_PORTDRV_PM_OPS,
>  };
>  
> -- 
> 2.25.1
>
Jason Gunthorpe Jan. 4, 2022, 7:26 p.m. UTC | #2
On Tue, Jan 04, 2022 at 11:06:31AM -0600, Bjorn Helgaas wrote:

> > The existing vfio framework allows the portdrv driver to be bound
> > to the bridge while its downstream devices are assigned to user space.
> 
> I.e., the existing VFIO framework allows a switch to be in the same
> IOMMU group as the devices below it, even though the switch has a
> kernel driver and the other devices may have userspace drivers?

Yes, this patch exists to maintain current VFIO behavior which has this
same check.

I belive the basis for VFIO doing this is that the these devices
cannot do DMA, so don't care about the DMA API or the group->domain,
and do not expose MMIO memory so do not care about the P2P attack.

A comment in the code to this effect would be good, IMHO.

Jason
Bjorn Helgaas Jan. 4, 2022, 7:51 p.m. UTC | #3
On Tue, Jan 04, 2022 at 03:26:14PM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 04, 2022 at 11:06:31AM -0600, Bjorn Helgaas wrote:
> 
> > > The existing vfio framework allows the portdrv driver to be bound
> > > to the bridge while its downstream devices are assigned to user space.
> > 
> > I.e., the existing VFIO framework allows a switch to be in the same
> > IOMMU group as the devices below it, even though the switch has a
> > kernel driver and the other devices may have userspace drivers?
> 
> Yes, this patch exists to maintain current VFIO behavior which has this
> same check.
> 
> I belive the basis for VFIO doing this is that the these devices
> cannot do DMA, so don't care about the DMA API or the group->domain,
> and do not expose MMIO memory so do not care about the P2P attack.

"These devices" means bridges, right?  Not sure why we wouldn't care
about the P2P attack.

PCIe switches use MSI or MSI-X for hotplug, PME, etc, so they do DMA
for that.  Is that not relevant here?

Is there something that *prohibits* a bridge from having
device-specific functionality including DMA?

I know some bridges have device-specific BARs for performance counters
and the like.
Jason Gunthorpe Jan. 5, 2022, 12:35 a.m. UTC | #4
On Tue, Jan 04, 2022 at 01:51:30PM -0600, Bjorn Helgaas wrote:
> On Tue, Jan 04, 2022 at 03:26:14PM -0400, Jason Gunthorpe wrote:
> > On Tue, Jan 04, 2022 at 11:06:31AM -0600, Bjorn Helgaas wrote:
> > 
> > > > The existing vfio framework allows the portdrv driver to be bound
> > > > to the bridge while its downstream devices are assigned to user space.
> > > 
> > > I.e., the existing VFIO framework allows a switch to be in the same
> > > IOMMU group as the devices below it, even though the switch has a
> > > kernel driver and the other devices may have userspace drivers?
> > 
> > Yes, this patch exists to maintain current VFIO behavior which has this
> > same check.
> > 
> > I belive the basis for VFIO doing this is that the these devices
> > cannot do DMA, so don't care about the DMA API or the group->domain,
> > and do not expose MMIO memory so do not care about the P2P attack.
> 
> "These devices" means bridges, right?  Not sure why we wouldn't care
> about the P2P attack.

Yes bridges. I said "I belive" because VFIO was changed to ignore
bridges a long time ago and the security rational was a bit unclear in
the commit.

See 5f096b14d421 ("vfio: Whitelist PCI bridges")

> PCIe switches use MSI or MSI-X for hotplug, PME, etc, so they do DMA
> for that.  Is that not relevant here?

Alex's comment in the above commit notes about interrupts, but I think
the answer today is that it does not matter.

For platforms like x86 that keep interrupts and DMA seperate it
works fine.

For platforms that comingle the iommu_domain and interrupts (do some
exist?) we expect the platform to do whatever is necessary at
iommu_domain attach time to ensure interrupts continue to
work. (AFAICT at least)

In other words we don't have an API restriction to use
iommu_device_use_dma_api() to use request_irq().

So the main concern should be P2P attacks on bridge MMIO registers.

> Is there something that *prohibits* a bridge from having
> device-specific functionality including DMA?

I'm not sure, I think not, but the 'Bus Master Enable' language does
have a different definition for Type 1..

However, it doesn't matter - the question here is not about what the
device HW is capable of, but what does the kernel driver do. The
portdrv does not use the DMA API, so that alone is half the
requirement to skip the iommu_device_use_dma_api() call.

Some future device-specific bridge driver that is able to issue the
device-specific MMIO's and operate the DMA API should be coded to use
iommu_device_use_dma_api(), probably by using a different
device_driver for the bridge.

> I know some bridges have device-specific BARs for performance counters
> and the like.

Yep.

IMHO it is probably not so great as-is..

However, this patch is just porting the status-quo of commit 5f09 into
this new framework.

What this new framework does allow is for portdrv to police itself. eg
it could check if there is a MMIO BAR and call
iommu_device_use_dma_api() out of caution. It could also have an
allowance list of devices where we know hostile writes to the MMIO are
known harmless.

Without knowing more about what motivated 5f09 it is hard to say,
other than it has been 7 years like this and nobody has complained
yet :)

Jason
Baolu Lu Jan. 6, 2022, 4:12 a.m. UTC | #5
Hi Bjorn,

On 1/5/22 1:06 AM, Bjorn Helgaas wrote:
> On Tue, Jan 04, 2022 at 09:56:39AM +0800, Lu Baolu wrote:
>> If a switch lacks ACS P2P Request Redirect, a device below the switch can
>> bypass the IOMMU and DMA directly to other devices below the switch, so
>> all the downstream devices must be in the same IOMMU group as the switch
>> itself.
> Help me think through what's going on here.  IIUC, we put devices in
> the same IOMMU group when they can interfere with each other in any
> way (DMA, config access, etc).
> 
> (We said "DMA" above, but I guess this would also apply to config
> requests, right?)

I am not sure whether devices could interfere each other through config
space access. The IOMMU hardware only protects and isolates DMA
accesses, so that userspace could control DMA directly. The config
accesses will always be intercepted by VFIO. Hence, I don't see a
problem.

> 
> *This*  patch doesn't check for any ACS features.  Can you connect the
> dots for me?  I guess the presence or absence of P2P Request Redirect
> determines the size of the IOMMU group.  And the following says

It's done in iommu core (drivers/iommu/iommu.c):

/*
  * Use standard PCI bus topology, isolation features, and DMA alias quirks
  * to find or create an IOMMU group for a device.
  */
struct iommu_group *pci_device_group(struct device *dev)


> something about what is allowed in the group?  And .no_kernel_api_dma
> allows an exception to the general rule?
> 

Yes.

>> The pci_dma_configure() marks the iommu_group as containing only devices
>> with kernel drivers that manage DMA. Avoid this default behavior for the
>> portdrv driver in order for compatibility with the current vfio policy.
> I assume "IOMMU group" means the same as "iommu_group"; maybe we can
> use one of them consistently?

Sure.

Best regards,
baolu
Bjorn Helgaas Jan. 6, 2022, 6:32 p.m. UTC | #6
On Thu, Jan 06, 2022 at 12:12:35PM +0800, Lu Baolu wrote:
> On 1/5/22 1:06 AM, Bjorn Helgaas wrote:
> > On Tue, Jan 04, 2022 at 09:56:39AM +0800, Lu Baolu wrote:
> > > If a switch lacks ACS P2P Request Redirect, a device below the switch can
> > > bypass the IOMMU and DMA directly to other devices below the switch, so
> > > all the downstream devices must be in the same IOMMU group as the switch
> > > itself.
> > Help me think through what's going on here.  IIUC, we put devices in
> > the same IOMMU group when they can interfere with each other in any
> > way (DMA, config access, etc).
> > 
> > (We said "DMA" above, but I guess this would also apply to config
> > requests, right?)
> 
> I am not sure whether devices could interfere each other through config
> space access. The IOMMU hardware only protects and isolates DMA
> accesses, so that userspace could control DMA directly. The config
> accesses will always be intercepted by VFIO. Hence, I don't see a
> problem.

I was wondering about config accesses generated by an endpoint, e.g.,
an endpoint doing config writes to a peer or the upstream bridge.

But I think that is prohibited by spec - PCIe r5.0, sec 7.3.3, says
"Propagation of Configuration Requests from Downstream to Upstream as
well as peer-to-peer are not supported" and "Configuration Requests
are initiated only by the Host Bridge, including those passed through
the SFI CAM mechanism."

Bjorn
Baolu Lu Jan. 7, 2022, 1:53 a.m. UTC | #7
On 1/7/22 2:32 AM, Bjorn Helgaas wrote:
> On Thu, Jan 06, 2022 at 12:12:35PM +0800, Lu Baolu wrote:
>> On 1/5/22 1:06 AM, Bjorn Helgaas wrote:
>>> On Tue, Jan 04, 2022 at 09:56:39AM +0800, Lu Baolu wrote:
>>>> If a switch lacks ACS P2P Request Redirect, a device below the switch can
>>>> bypass the IOMMU and DMA directly to other devices below the switch, so
>>>> all the downstream devices must be in the same IOMMU group as the switch
>>>> itself.
>>> Help me think through what's going on here.  IIUC, we put devices in
>>> the same IOMMU group when they can interfere with each other in any
>>> way (DMA, config access, etc).
>>>
>>> (We said "DMA" above, but I guess this would also apply to config
>>> requests, right?)
>>
>> I am not sure whether devices could interfere each other through config
>> space access. The IOMMU hardware only protects and isolates DMA
>> accesses, so that userspace could control DMA directly. The config
>> accesses will always be intercepted by VFIO. Hence, I don't see a
>> problem.
> 
> I was wondering about config accesses generated by an endpoint, e.g.,
> an endpoint doing config writes to a peer or the upstream bridge.
> 
> But I think that is prohibited by spec - PCIe r5.0, sec 7.3.3, says
> "Propagation of Configuration Requests from Downstream to Upstream as
> well as peer-to-peer are not supported" and "Configuration Requests
> are initiated only by the Host Bridge, including those passed through
> the SFI CAM mechanism."

That's clear. Thank you for the clarification.

> 
> Bjorn
> 

Best regards,
baolu
diff mbox series

Patch

diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
index 35eca6277a96..2116f821c005 100644
--- a/drivers/pci/pcie/portdrv_pci.c
+++ b/drivers/pci/pcie/portdrv_pci.c
@@ -202,6 +202,8 @@  static struct pci_driver pcie_portdriver = {
 
 	.err_handler	= &pcie_portdrv_err_handler,
 
+	.no_kernel_api_dma = true,
+
 	.driver.pm	= PCIE_PORTDRV_PM_OPS,
 };