Message ID | 20240625153150.159310-1-vidyas@nvidia.com (mailing list archive) |
---|---|
State | Accepted |
Delegated to: | Bjorn Helgaas |
Headers | show |
Series | [V4] PCI: Extend ACS configurability | expand |
On Tue, Jun 25, 2024 at 09:01:50PM +0530, Vidya Sagar wrote: > Add a kernel command-line option 'config_acs' to directly control all the > ACS bits for specific devices, which allows the operator to setup the > right level of isolation to achieve the desired P2P configuration. An example wouldn't hurt, here and in kernel-parameters.txt. > ACS offers a range of security choices controlling how traffic is > allowed to go directly between two devices. Some popular choices: > - Full prevention > - Translated requests can be direct, with various options > - Asymmetric direct traffic, A can reach B but not the reverse > - All traffic can be direct > Along with some other less common ones for special topologies. I'm wondering whether it would make more sense to let users choose between those "higher-level" options, instead of giving direct access to bits (and thus risking users to choose an incorrect setting). Also, would it be possible to automatically change ACS settings when enabling or disabling P2PDMA? The representation chosen here (as a command line option) seems questionable: We're going to add more user-controllable options going forward. E.g., when introducing IDE, we'll have to let user space choose whether encryption should be enabled for certain PCIe devices. That's because encryption isn't for free, so can't be enabled opportunistically. (The number of crypto engines on a CPU is limited and enabling encryption consumes energy.) What about exposing such user configurable settings with sysctl? The networking subsystem has per-interface sysctl settings, we could have per-PCI-device settings. So just like this... net.ipv4.conf.default.arp_accept = 0 net.ipv4.conf.eth0.arp_accept = 0 net.ipv4.conf.eth1.arp_accept = 0 ... we could have... pci.0000:03:00.0.acs = full_prevention pci.0000:03:00.0.ide = 1 pci.0000:03:01.0.acs = all_traffic pci.0000:03:01.0.ide = 0 This isn't hard to do, just call register_sysctl() for each device on enumeration and unregister_sysctl_table() on pci_destroy_dev(). Thanks, Lukas
On Tue, Jun 25, 2024 at 06:26:00PM +0200, Lukas Wunner wrote: > On Tue, Jun 25, 2024 at 09:01:50PM +0530, Vidya Sagar wrote: > > Add a kernel command-line option 'config_acs' to directly control all the > > ACS bits for specific devices, which allows the operator to setup the > > right level of isolation to achieve the desired P2P configuration. > > An example wouldn't hurt, here and in kernel-parameters.txt. > > > > ACS offers a range of security choices controlling how traffic is > > allowed to go directly between two devices. Some popular choices: > > - Full prevention > > - Translated requests can be direct, with various options > > - Asymmetric direct traffic, A can reach B but not the reverse > > - All traffic can be direct > > Along with some other less common ones for special topologies. > > I'm wondering whether it would make more sense to let users choose > between those "higher-level" options, instead of giving direct access > to bits (and thus risking users to choose an incorrect setting). It doesn't seem like the kernel has enough information to do that, or at least describing enough information in the command line would be more complex than this. > Also, would it be possible to automatically change ACS settings > when enabling or disabling P2PDMA? No, as the commit said the ACS settings are required at early boot before iommu_groups are formed. They cannot be changed dynamically with today's kernel. > The representation chosen here (as a command line option) seems > questionable: > > We're going to add more user-controllable options going forward. > E.g., when introducing IDE, we'll have to let user space choose > whether encryption should be enabled for certain PCIe devices. > That's because encryption isn't for free, so can't be enabled > opportunistically. (The number of crypto engines on a CPU is > limited and enabling encryption consumes energy.) But that isn't part of ACS, so what is wrong with having ACS its own configurable and other PCI functions can do what is appropriate for them? I do encourage people to avoid using the kernel command line. ACS is forced into that because of the iommu_group issue. > What about exposing such user configurable settings with sysctl? I think sysctl is mostly deprecated in favour of sysfs. An ide file in the sysfs to control the IDE stuff makes alot of sense to me. Jason
On Tue, Jun 25, 2024 at 06:26:00PM +0200, Lukas Wunner wrote: > On Tue, Jun 25, 2024 at 09:01:50PM +0530, Vidya Sagar wrote: > > Add a kernel command-line option 'config_acs' to directly control all the > > ACS bits for specific devices, which allows the operator to setup the > > right level of isolation to achieve the desired P2P configuration. > > An example wouldn't hurt, here and in kernel-parameters.txt. > > > > ACS offers a range of security choices controlling how traffic is > > allowed to go directly between two devices. Some popular choices: > > - Full prevention > > - Translated requests can be direct, with various options > > - Asymmetric direct traffic, A can reach B but not the reverse > > - All traffic can be direct > > Along with some other less common ones for special topologies. > > I'm wondering whether it would make more sense to let users choose > between those "higher-level" options, instead of giving direct access > to bits (and thus risking users to choose an incorrect setting). IMHO, with "higher-level" options will be much more complex to use than simple ACS bits matrix. In any case, the user who will use this feature will need to read PCI spec before. In PCI v6, 13 bits are used for ACS with 8192 possible combinations and it is unlikely to find small set of "definitions" that will fit all cases. Thanks
> From: Vidya Sagar <vidyas@nvidia.com> > Sent: Tuesday, June 25, 2024 11:32 PM > > PCIe ACS settings control the level of isolation and the possible P2P > paths between devices. With greater isolation the kernel will create > smaller iommu_groups and with less isolation there is more HW that > can achieve P2P transfers. From a virtualization perspective all > devices in the same iommu_group must be assigned to the same VM as > they lack security isolation. > It'll be helpful to also call out the impact of losing other features (e.g. PASID) with less isolation: pci_enable_pasid() { ... if (!pci_acs_path_enabled(pdev, NULL, PCI_ACS_RR | PCI_ACS_UF)) return -EINVAL; ... }
On Wed, Jun 26, 2024 at 07:40:31AM +0000, Tian, Kevin wrote: > > From: Vidya Sagar <vidyas@nvidia.com> > > Sent: Tuesday, June 25, 2024 11:32 PM > > > > PCIe ACS settings control the level of isolation and the possible P2P > > paths between devices. With greater isolation the kernel will create > > smaller iommu_groups and with less isolation there is more HW that > > can achieve P2P transfers. From a virtualization perspective all > > devices in the same iommu_group must be assigned to the same VM as > > they lack security isolation. > > > > It'll be helpful to also call out the impact of losing other features (e.g. PASID) > with less isolation: > > pci_enable_pasid() > { > ... > if (!pci_acs_path_enabled(pdev, NULL, PCI_ACS_RR | PCI_ACS_UF)) > return -EINVAL; > ... > } Yeah, that is one of the considerations that might go into using asymmetric ACS.. Jason
On Tue, Jun 25, 2024 at 09:01:50PM +0530, Vidya Sagar wrote: > PCIe ACS settings control the level of isolation and the possible P2P > paths between devices. With greater isolation the kernel will create > smaller iommu_groups and with less isolation there is more HW that > can achieve P2P transfers. From a virtualization perspective all > devices in the same iommu_group must be assigned to the same VM as > they lack security isolation. > > There is no way for the kernel to automatically know the correct > ACS settings for any given system and workload. Existing command line > options (ex:- disable_acs_redir) allow only for large scale change, > disabling all isolation, but this is not sufficient for more complex cases. > > Add a kernel command-line option 'config_acs' to directly control all the > ACS bits for specific devices, which allows the operator to setup the > right level of isolation to achieve the desired P2P configuration. > The definition is future proof, when new ACS bits are added to the spec > the open syntax can be extended. > > ACS needs to be setup early in the kernel boot as the ACS settings > effect how iommu_groups are formed. iommu_group formation is a one > time event during initial device discovery, changing ACS bits after > kernel boot can result in an inaccurate view of the iommu_groups > compared to the current isolation configuration. > > ACS applies to PCIe Downstream Ports and multi-function devices. > The default ACS settings are strict and deny any direct traffic > between two functions. This results in the smallest iommu_group the > HW can support. Frequently these values result in slow or > non-working P2PDMA. > > ACS offers a range of security choices controlling how traffic is > allowed to go directly between two devices. Some popular choices: > - Full prevention > - Translated requests can be direct, with various options > - Asymmetric direct traffic, A can reach B but not the reverse > - All traffic can be direct > Along with some other less common ones for special topologies. > > The intention is that this option would be used with expert knowledge > of the HW capability and workload to achieve the desired > configuration. > > Signed-off-by: Vidya Sagar <vidyas@nvidia.com> > --- > v4: > * Changed commit message (Courtesy: Jason) to provide more details > > v3: > * Fixed a documentation issue reported by kernel test bot > > v2: > * Refactored the code as per Jason's suggestion > > .../admin-guide/kernel-parameters.txt | 22 +++ > drivers/pci/pci.c | 148 +++++++++++------- > 2 files changed, 112 insertions(+), 58 deletions(-) Bjorn? Jason
On Tue, Jun 25, 2024 at 09:01:50PM +0530, Vidya Sagar wrote: > PCIe ACS settings control the level of isolation and the possible P2P > paths between devices. With greater isolation the kernel will create > smaller iommu_groups and with less isolation there is more HW that > can achieve P2P transfers. From a virtualization perspective all > devices in the same iommu_group must be assigned to the same VM as > they lack security isolation. > > There is no way for the kernel to automatically know the correct > ACS settings for any given system and workload. Existing command line > options (ex:- disable_acs_redir) allow only for large scale change, > disabling all isolation, but this is not sufficient for more complex cases. > > Add a kernel command-line option 'config_acs' to directly control all the > ACS bits for specific devices, which allows the operator to setup the > right level of isolation to achieve the desired P2P configuration. > The definition is future proof, when new ACS bits are added to the spec > the open syntax can be extended. > > ACS needs to be setup early in the kernel boot as the ACS settings > effect how iommu_groups are formed. iommu_group formation is a one > time event during initial device discovery, changing ACS bits after > kernel boot can result in an inaccurate view of the iommu_groups > compared to the current isolation configuration. > > ACS applies to PCIe Downstream Ports and multi-function devices. > The default ACS settings are strict and deny any direct traffic > between two functions. This results in the smallest iommu_group the > HW can support. Frequently these values result in slow or > non-working P2PDMA. > > ACS offers a range of security choices controlling how traffic is > allowed to go directly between two devices. Some popular choices: > - Full prevention > - Translated requests can be direct, with various options > - Asymmetric direct traffic, A can reach B but not the reverse > - All traffic can be direct > Along with some other less common ones for special topologies. > > The intention is that this option would be used with expert knowledge > of the HW capability and workload to achieve the desired > configuration. > > Signed-off-by: Vidya Sagar <vidyas@nvidia.com> Applied with the tweaks below to pci/acs for v6.11, thanks! I added an example to the doc; please check it to see if I interpreted the doc correctly. diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 42d0f6fd40d0..b2057241ea6c 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4621,24 +4621,34 @@ may put more devices in an IOMMU group. config_acs= Format: - =<ACS flags>@<pci_dev>[; ...] + <ACS flags>@<pci_dev>[; ...] Specify one or more PCI devices (in the format specified above) optionally prepended with flags and separated by semicolons. The respective - capabilities will be enabled, disabled or unchanged - based on what is specified in flags. - ACS Flags is defined as follows - bit-0 : ACS Source Validation - bit-1 : ACS Translation Blocking - bit-2 : ACS P2P Request Redirect - bit-3 : ACS P2P Completion Redirect - bit-4 : ACS Upstream Forwarding - bit-5 : ACS P2P Egress Control - bit-6 : ACS Direct Translated P2P - Each bit can be marked as - ‘0‘ – force disabled - ‘1’ – force enabled - ‘x’ – unchanged. + capabilities will be enabled, disabled or + unchanged based on what is specified in + flags. + + ACS Flags is defined as follows: + bit-0 : ACS Source Validation + bit-1 : ACS Translation Blocking + bit-2 : ACS P2P Request Redirect + bit-3 : ACS P2P Completion Redirect + bit-4 : ACS Upstream Forwarding + bit-5 : ACS P2P Egress Control + bit-6 : ACS Direct Translated P2P + Each bit can be marked as: + '0' – force disabled + '1' – force enabled + 'x' – unchanged + For example, + pci=config_acs=10x + would configure all devices that support + ACS to enable P2P Request Redirect, disable + Translation Blocking, and leave Source + Validation unchanged from whatever power-up + or firmware set it to. + Note: this may remove isolation between devices and may put more devices in an IOMMU group. force_floating [S390] Force usage of floating interrupts. diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 1afe650ce338..45d93101a08b 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -1006,7 +1006,7 @@ static void __pci_config_acs(struct pci_dev *dev, struct pci_acs *caps, ret = pci_dev_str_match(dev, p, &p); if (ret < 0) { - pr_info_once("PCI: Can't parse acs command line parameter\n"); + pr_info_once("PCI: Can't parse ACS command line parameter\n"); break; } else if (ret == 1) { /* Found a match */ @@ -1026,14 +1026,14 @@ static void __pci_config_acs(struct pci_dev *dev, struct pci_acs *caps, if (!pci_dev_specific_disable_acs_redir(dev)) return; - pci_dbg(dev, "ACS mask = 0x%X\n", mask); - pci_dbg(dev, "ACS flags = 0x%X\n", flags); + pci_dbg(dev, "ACS mask = %#06x\n", mask); + pci_dbg(dev, "ACS flags = %#06x\n", flags); /* If mask is 0 then we copy the bit from the firmware setting. */ caps->ctrl = (caps->ctrl & ~mask) | (caps->fw_ctrl & mask); caps->ctrl |= flags; - pci_info(dev, "Configured ACS to 0x%x\n", caps->ctrl); + pci_info(dev, "Configured ACS to %#06x\n", caps->ctrl); } /**
On 25. 06. 24, 17:31, Vidya Sagar wrote: > PCIe ACS settings control the level of isolation and the possible P2P > paths between devices. With greater isolation the kernel will create > smaller iommu_groups and with less isolation there is more HW that > can achieve P2P transfers. From a virtualization perspective all > devices in the same iommu_group must be assigned to the same VM as > they lack security isolation. > > There is no way for the kernel to automatically know the correct > ACS settings for any given system and workload. Existing command line > options (ex:- disable_acs_redir) allow only for large scale change, > disabling all isolation, but this is not sufficient for more complex cases. > > Add a kernel command-line option 'config_acs' to directly control all the > ACS bits for specific devices, which allows the operator to setup the > right level of isolation to achieve the desired P2P configuration. > The definition is future proof, when new ACS bits are added to the spec > the open syntax can be extended. > > ACS needs to be setup early in the kernel boot as the ACS settings > effect how iommu_groups are formed. iommu_group formation is a one > time event during initial device discovery, changing ACS bits after > kernel boot can result in an inaccurate view of the iommu_groups > compared to the current isolation configuration. > > ACS applies to PCIe Downstream Ports and multi-function devices. > The default ACS settings are strict and deny any direct traffic > between two functions. This results in the smallest iommu_group the > HW can support. Frequently these values result in slow or > non-working P2PDMA. > > ACS offers a range of security choices controlling how traffic is > allowed to go directly between two devices. Some popular choices: > - Full prevention > - Translated requests can be direct, with various options > - Asymmetric direct traffic, A can reach B but not the reverse > - All traffic can be direct > Along with some other less common ones for special topologies. > > The intention is that this option would be used with expert knowledge > of the HW capability and workload to achieve the desired > configuration. Hi, this breaks ACS on some platforms (in 6.11). See: https://bugzilla.suse.com/show_bug.cgi?id=1229019 When starting a KVM: > "Error starting domain: internal error: QEMU unexpectedly closed the monitor (vm=‘win10’): qxl_send_events: spice-server bug: guest stopped, ignoring > 2024-08-08T01:45:51.824474Z qemu-system-x86_64: -device {“driver”:“vfio-pci”,“host”:“0000:0b:00.0”,“id”:“hostdev0”,“bus”:“pci.2”,“addr”:“0x0”}: vfio 0000:0b:00.0: group 83 is not viable > Please ensure all devices within the iommu_group are bound to their vfio bus driver." We backported the commit to 6.4, there they see: > The good kernel 6.4.0-150600.23.14 log has this: > > [ 0.618918] pci 0000:00:1c.0: Intel PCH root port ACS workaround enabled > [ 0.619153] pci 0000:00:1c.4: Intel PCH root port ACS workaround enabled > > the bad one 6.4.0-150600.23.22 does not. But the same difference is with 6.10 vs 6.11. Any ideas? It looks like workarounds are not applied anymore. > Signed-off-by: Vidya Sagar <vidyas@nvidia.com> > --- > v4: > * Changed commit message (Courtesy: Jason) to provide more details > > v3: > * Fixed a documentation issue reported by kernel test bot > > v2: > * Refactored the code as per Jason's suggestion > > .../admin-guide/kernel-parameters.txt | 22 +++ > drivers/pci/pci.c | 148 +++++++++++------- > 2 files changed, 112 insertions(+), 58 deletions(-) > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > index 500cfa776225..42d0f6fd40d0 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -4619,6 +4619,28 @@ > bridges without forcing it upstream. Note: > this removes isolation between devices and > may put more devices in an IOMMU group. > + config_acs= > + Format: > + =<ACS flags>@<pci_dev>[; ...] > + Specify one or more PCI devices (in the format > + specified above) optionally prepended with flags > + and separated by semicolons. The respective > + capabilities will be enabled, disabled or unchanged > + based on what is specified in flags. > + ACS Flags is defined as follows > + bit-0 : ACS Source Validation > + bit-1 : ACS Translation Blocking > + bit-2 : ACS P2P Request Redirect > + bit-3 : ACS P2P Completion Redirect > + bit-4 : ACS Upstream Forwarding > + bit-5 : ACS P2P Egress Control > + bit-6 : ACS Direct Translated P2P > + Each bit can be marked as > + ‘0‘ – force disabled > + ‘1’ – force enabled > + ‘x’ – unchanged. > + Note: this may remove isolation between devices > + and may put more devices in an IOMMU group. > force_floating [S390] Force usage of floating interrupts. > nomio [S390] Do not use MIO instructions. > norid [S390] ignore the RID field and force use of > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index 55078c70a05b..6661932afe59 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -946,30 +946,67 @@ void pci_request_acs(void) > } > > static const char *disable_acs_redir_param; > +static const char *config_acs_param; > > -/** > - * pci_disable_acs_redir - disable ACS redirect capabilities > - * @dev: the PCI device > - * > - * For only devices specified in the disable_acs_redir parameter. > - */ > -static void pci_disable_acs_redir(struct pci_dev *dev) > +struct pci_acs { > + u16 cap; > + u16 ctrl; > + u16 fw_ctrl; > +}; > + > +static void __pci_config_acs(struct pci_dev *dev, struct pci_acs *caps, > + const char *p, u16 mask, u16 flags) > { > + char *delimit; > int ret = 0; > - const char *p; > - int pos; > - u16 ctrl; > > - if (!disable_acs_redir_param) > + if (!p) > return; > > - p = disable_acs_redir_param; > while (*p) { > + if (!mask) { > + /* Check for ACS flags */ > + delimit = strstr(p, "@"); > + if (delimit) { > + int end; > + u32 shift = 0; > + > + end = delimit - p - 1; > + > + while (end > -1) { > + if (*(p + end) == '0') { > + mask |= 1 << shift; > + shift++; > + end--; > + } else if (*(p + end) == '1') { > + mask |= 1 << shift; > + flags |= 1 << shift; > + shift++; > + end--; > + } else if ((*(p + end) == 'x') || (*(p + end) == 'X')) { > + shift++; > + end--; > + } else { > + pci_err(dev, "Invalid ACS flags... Ignoring\n"); > + return; > + } > + } > + p = delimit + 1; > + } else { > + pci_err(dev, "ACS Flags missing\n"); > + return; > + } > + } > + > + if (mask & ~(PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR | PCI_ACS_CR | > + PCI_ACS_UF | PCI_ACS_EC | PCI_ACS_DT)) { > + pci_err(dev, "Invalid ACS flags specified\n"); > + return; > + } > + > ret = pci_dev_str_match(dev, p, &p); > if (ret < 0) { > - pr_info_once("PCI: Can't parse disable_acs_redir parameter: %s\n", > - disable_acs_redir_param); > - > + pr_info_once("PCI: Can't parse acs command line parameter\n"); > break; > } else if (ret == 1) { > /* Found a match */ > @@ -989,56 +1026,38 @@ static void pci_disable_acs_redir(struct pci_dev *dev) > if (!pci_dev_specific_disable_acs_redir(dev)) > return; > > - pos = dev->acs_cap; > - if (!pos) { > - pci_warn(dev, "cannot disable ACS redirect for this hardware as it does not have ACS capabilities\n"); > - return; > - } > - > - pci_read_config_word(dev, pos + PCI_ACS_CTRL, &ctrl); > + pci_dbg(dev, "ACS mask = 0x%X\n", mask); > + pci_dbg(dev, "ACS flags = 0x%X\n", flags); > > - /* P2P Request & Completion Redirect */ > - ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_EC); > + /* If mask is 0 then we copy the bit from the firmware setting. */ > + caps->ctrl = (caps->ctrl & ~mask) | (caps->fw_ctrl & mask); > + caps->ctrl |= flags; > > - pci_write_config_word(dev, pos + PCI_ACS_CTRL, ctrl); > - > - pci_info(dev, "disabled ACS redirect\n"); > + pci_info(dev, "Configured ACS to 0x%x\n", caps->ctrl); > } > > /** > * pci_std_enable_acs - enable ACS on devices using standard ACS capabilities > * @dev: the PCI device > + * @caps: default ACS controls > */ > -static void pci_std_enable_acs(struct pci_dev *dev) > +static void pci_std_enable_acs(struct pci_dev *dev, struct pci_acs *caps) > { > - int pos; > - u16 cap; > - u16 ctrl; > - > - pos = dev->acs_cap; > - if (!pos) > - return; > - > - pci_read_config_word(dev, pos + PCI_ACS_CAP, &cap); > - pci_read_config_word(dev, pos + PCI_ACS_CTRL, &ctrl); > - > /* Source Validation */ > - ctrl |= (cap & PCI_ACS_SV); > + caps->ctrl |= (caps->cap & PCI_ACS_SV); > > /* P2P Request Redirect */ > - ctrl |= (cap & PCI_ACS_RR); > + caps->ctrl |= (caps->cap & PCI_ACS_RR); > > /* P2P Completion Redirect */ > - ctrl |= (cap & PCI_ACS_CR); > + caps->ctrl |= (caps->cap & PCI_ACS_CR); > > /* Upstream Forwarding */ > - ctrl |= (cap & PCI_ACS_UF); > + caps->ctrl |= (caps->cap & PCI_ACS_UF); > > /* Enable Translation Blocking for external devices and noats */ > if (pci_ats_disabled() || dev->external_facing || dev->untrusted) > - ctrl |= (cap & PCI_ACS_TB); > - > - pci_write_config_word(dev, pos + PCI_ACS_CTRL, ctrl); > + caps->ctrl |= (caps->cap & PCI_ACS_TB); > } > > /** > @@ -1047,23 +1066,33 @@ static void pci_std_enable_acs(struct pci_dev *dev) > */ > static void pci_enable_acs(struct pci_dev *dev) > { > - if (!pci_acs_enable) > - goto disable_acs_redir; > + struct pci_acs caps; > + int pos; > + > + pos = dev->acs_cap; > + if (!pos) > + return; > > - if (!pci_dev_specific_enable_acs(dev)) > - goto disable_acs_redir; > + pci_read_config_word(dev, pos + PCI_ACS_CAP, &caps.cap); > + pci_read_config_word(dev, pos + PCI_ACS_CTRL, &caps.ctrl); > + caps.fw_ctrl = caps.ctrl; > > - pci_std_enable_acs(dev); > + /* If an iommu is present we start with kernel default caps */ > + if (pci_acs_enable) { > + if (pci_dev_specific_enable_acs(dev)) > + pci_std_enable_acs(dev, &caps); > + } > > -disable_acs_redir: > /* > - * Note: pci_disable_acs_redir() must be called even if ACS was not > - * enabled by the kernel because it may have been enabled by > - * platform firmware. So if we are told to disable it, we should > - * always disable it after setting the kernel's default > - * preferences. > + * Always apply caps from the command line, even if there is no iommu. > + * Trust that the admin has a reason to change the ACS settings. > */ > - pci_disable_acs_redir(dev); > + __pci_config_acs(dev, &caps, disable_acs_redir_param, > + PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_EC, > + ~(PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_EC)); > + __pci_config_acs(dev, &caps, config_acs_param, 0, 0); > + > + pci_write_config_word(dev, pos + PCI_ACS_CTRL, caps.ctrl); > } > > /** > @@ -6740,6 +6769,8 @@ static int __init pci_setup(char *str) > pci_add_flags(PCI_SCAN_ALL_PCIE_DEVS); > } else if (!strncmp(str, "disable_acs_redir=", 18)) { > disable_acs_redir_param = str + 18; > + } else if (!strncmp(str, "config_acs=", 11)) { > + config_acs_param = str + 11; > } else { > pr_err("PCI: Unknown option `%s'\n", str); > } > @@ -6764,6 +6795,7 @@ static int __init pci_realloc_setup_params(void) > resource_alignment_param = kstrdup(resource_alignment_param, > GFP_KERNEL); > disable_acs_redir_param = kstrdup(disable_acs_redir_param, GFP_KERNEL); > + config_acs_param = kstrdup(config_acs_param, GFP_KERNEL); > > return 0; > }
On 25. 09. 24, 7:06, Jiri Slaby wrote: >> @@ -1047,23 +1066,33 @@ static void pci_std_enable_acs(struct pci_dev >> *dev) >> */ >> static void pci_enable_acs(struct pci_dev *dev) >> { >> - if (!pci_acs_enable) >> - goto disable_acs_redir; >> + struct pci_acs caps; >> + int pos; >> + >> + pos = dev->acs_cap; >> + if (!pos) >> + return; >> - if (!pci_dev_specific_enable_acs(dev)) >> - goto disable_acs_redir; >> + pci_read_config_word(dev, pos + PCI_ACS_CAP, &caps.cap); >> + pci_read_config_word(dev, pos + PCI_ACS_CTRL, &caps.ctrl); >> + caps.fw_ctrl = caps.ctrl; >> - pci_std_enable_acs(dev); >> + /* If an iommu is present we start with kernel default caps */ >> + if (pci_acs_enable) { AFAIU pci_acs_enable is set from iommus' code via pci_request_acs(). Which is much later than when bridges are initialized here, right? >> + if (pci_dev_specific_enable_acs(dev)) >> + pci_std_enable_acs(dev, &caps); So this is never called, IMO. >> + } thanks,
On 25. 09. 24, 7:29, Jiri Slaby wrote: > On 25. 09. 24, 7:06, Jiri Slaby wrote: >>> @@ -1047,23 +1066,33 @@ static void pci_std_enable_acs(struct pci_dev >>> *dev) >>> */ >>> static void pci_enable_acs(struct pci_dev *dev) >>> { >>> - if (!pci_acs_enable) >>> - goto disable_acs_redir; >>> + struct pci_acs caps; >>> + int pos; >>> + >>> + pos = dev->acs_cap; >>> + if (!pos) >>> + return; Ignore the previous post. The bridge has no ACS (see lspci below). So it used to be enabled by pci_quirk_enable_intel_pch_acs() by another registers. But the "if (!pos)" does not let it run now. I am not sure how to fix this as we cannot have "caps" from these quirks, so that whole idea of __pci_config_acs() is nonworking for these quirks. 00:1c.0 PCI bridge: Intel Corporation C610/X99 series chipset PCI Express Root Port #1 (rev d5) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 36 NUMA node: 0 Bus: primary=00, secondary=08, subordinate=08, sec-latency=0 I/O behind bridge: 00001000-00001fff [size=4K] Memory behind bridge: f3600000-f37fffff [size=2M] Prefetchable memory behind bridge: 00000000f3800000-00000000f39fffff [size=2M] Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR- BridgeCtl: Parity- SERR+ NoISA- VGA- MAbort- >Reset- FastB2B- PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn- Capabilities: [40] Express (v2) Root Port (Slot+), MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0 ExtTag- RBE+ DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend- LnkCap: Port #1, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x0, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug+ Surprise+ Slot #0, PowerLimit 0.000W; Interlock- NoCompl+ SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg- Control: AttnInd Unknown, PwrInd Unknown, Power- Interlock- SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet- Interlock- Changed: MRL- PresDet- LinkState- RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible- RootCap: CRSVisible- RootSta: PME ReqID 0000, PMEStatus- PMEPending- DevCap2: Completion Timeout: Range ABC, TimeoutDis+, LTR+, OBFF Via WAKE# ARIFwd- AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled ARIFwd- AtomicOpsCtl: ReqEn- EgressBlck- LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit- Address: fee00358 Data: 0000 Capabilities: [90] Subsystem: Dell Device 0618 Capabilities: [a0] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Kernel driver in use: pcieport >>> - if (!pci_dev_specific_enable_acs(dev)) >>> - goto disable_acs_redir; >>> + pci_read_config_word(dev, pos + PCI_ACS_CAP, &caps.cap); >>> + pci_read_config_word(dev, pos + PCI_ACS_CTRL, &caps.ctrl); >>> + caps.fw_ctrl = caps.ctrl; >>> - pci_std_enable_acs(dev); >>> + /* If an iommu is present we start with kernel default caps */ >>> + if (pci_acs_enable) { > > AFAIU pci_acs_enable is set from iommus' code via pci_request_acs(). > Which is much later than when bridges are initialized here, right?
On Wed, Sep 25, 2024 at 07:49:59AM +0200, Jiri Slaby wrote: > On 25. 09. 24, 7:29, Jiri Slaby wrote: > > On 25. 09. 24, 7:06, Jiri Slaby wrote: > > > > @@ -1047,23 +1066,33 @@ static void pci_std_enable_acs(struct > > > > pci_dev *dev) > > > > */ > > > > static void pci_enable_acs(struct pci_dev *dev) > > > > { > > > > - if (!pci_acs_enable) > > > > - goto disable_acs_redir; > > > > + struct pci_acs caps; > > > > + int pos; > > > > + > > > > + pos = dev->acs_cap; > > > > + if (!pos) > > > > + return; > > Ignore the previous post. > > The bridge has no ACS (see lspci below). So it used to be enabled by > pci_quirk_enable_intel_pch_acs() by another registers. Er, Ok, so it overrides the whole thing with pci_dev_specific_acs_enabled() too.. > I am not sure how to fix this as we cannot have "caps" from these quirks, so > that whole idea of __pci_config_acs() is nonworking for these quirks. We just need to allow the quirk to run before we try to do anything with the cap, which has probably always been a NOP anyhow. Maybe like this? diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 7d85c04fbba2ae..225a6cd2e9ca3b 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -1067,8 +1067,15 @@ static void pci_std_enable_acs(struct pci_dev *dev, struct pci_acs *caps) static void pci_enable_acs(struct pci_dev *dev) { struct pci_acs caps; + bool enable_acs = false; int pos; + /* If an iommu is present we start with kernel default caps */ + if (pci_acs_enable) { + if (pci_dev_specific_enable_acs(dev)) + enable_acs = true; + } + pos = dev->acs_cap; if (!pos) return; @@ -1077,11 +1084,8 @@ static void pci_enable_acs(struct pci_dev *dev) pci_read_config_word(dev, pos + PCI_ACS_CTRL, &caps.ctrl); caps.fw_ctrl = caps.ctrl; - /* If an iommu is present we start with kernel default caps */ - if (pci_acs_enable) { - if (pci_dev_specific_enable_acs(dev)) - pci_std_enable_acs(dev, &caps); - } + if (enable_acs) + pci_std_enable_acs(dev, &caps); /* * Always apply caps from the command line, even if there is no iommu.
On Tue, 2024-10-01 at 16:33 -0300, Jason Gunthorpe wrote: > On Wed, Sep 25, 2024 at 07:49:59AM +0200, Jiri Slaby wrote: > > On 25. 09. 24, 7:29, Jiri Slaby wrote: > > > On 25. 09. 24, 7:06, Jiri Slaby wrote: > > > > > @@ -1047,23 +1066,33 @@ static void pci_std_enable_acs(struct > > > > > pci_dev *dev) > > > > > */ > > > > > static void pci_enable_acs(struct pci_dev *dev) > > > > > { > > > > > - if (!pci_acs_enable) > > > > > - goto disable_acs_redir; > > > > > + struct pci_acs caps; > > > > > + int pos; > > > > > + > > > > > + pos = dev->acs_cap; > > > > > + if (!pos) > > > > > + return; > > > > Ignore the previous post. > > > > The bridge has no ACS (see lspci below). So it used to be enabled > > by > > pci_quirk_enable_intel_pch_acs() by another registers. > > Er, Ok, so it overrides the whole thing with > pci_dev_specific_acs_enabled() too.. > > > I am not sure how to fix this as we cannot have "caps" from these > > quirks, so > > that whole idea of __pci_config_acs() is nonworking for these > > quirks. > > We just need to allow the quirk to run before we try to do anything > with the cap, which has probably always been a NOP anyhow. > > Maybe like this? > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index 7d85c04fbba2ae..225a6cd2e9ca3b 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -1067,8 +1067,15 @@ static void pci_std_enable_acs(struct pci_dev > *dev, struct pci_acs *caps) > static void pci_enable_acs(struct pci_dev *dev) > { > struct pci_acs caps; > + bool enable_acs = false; > int pos; > > + /* If an iommu is present we start with kernel default caps > */ > + if (pci_acs_enable) { > + if (pci_dev_specific_enable_acs(dev)) > + enable_acs = true; > + } > + > pos = dev->acs_cap; > if (!pos) > return; > @@ -1077,11 +1084,8 @@ static void pci_enable_acs(struct pci_dev > *dev) > pci_read_config_word(dev, pos + PCI_ACS_CTRL, &caps.ctrl); > caps.fw_ctrl = caps.ctrl; > > - /* If an iommu is present we start with kernel default caps > */ > - if (pci_acs_enable) { > - if (pci_dev_specific_enable_acs(dev)) > - pci_std_enable_acs(dev, &caps); > - } > + if (enable_acs) > + pci_std_enable_acs(dev, &caps); > > /* > * Always apply caps from the command line, even if there is > no iommu. Hi, I just ran into this issue (fewer iommu groups starting with 6.11). Both reverting the original patch or applying your suggestion worked for me. Thanks Steffen
On Wed, Sep 25, 2024 at 07:06:41AM +0200, Jiri Slaby wrote: > On 25. 06. 24, 17:31, Vidya Sagar wrote: > > PCIe ACS settings control the level of isolation and the possible P2P > > paths between devices. With greater isolation the kernel will create > > smaller iommu_groups and with less isolation there is more HW that > > can achieve P2P transfers. From a virtualization perspective all > > devices in the same iommu_group must be assigned to the same VM as > > they lack security isolation. > > > > There is no way for the kernel to automatically know the correct > > ACS settings for any given system and workload. Existing command line > > options (ex:- disable_acs_redir) allow only for large scale change, > > disabling all isolation, but this is not sufficient for more complex cases. > > > > Add a kernel command-line option 'config_acs' to directly control all the > > ACS bits for specific devices, which allows the operator to setup the > > right level of isolation to achieve the desired P2P configuration. > > The definition is future proof, when new ACS bits are added to the spec > > the open syntax can be extended. > > > > ACS needs to be setup early in the kernel boot as the ACS settings > > effect how iommu_groups are formed. iommu_group formation is a one > > time event during initial device discovery, changing ACS bits after > > kernel boot can result in an inaccurate view of the iommu_groups > > compared to the current isolation configuration. > > > > ACS applies to PCIe Downstream Ports and multi-function devices. > > The default ACS settings are strict and deny any direct traffic > > between two functions. This results in the smallest iommu_group the > > HW can support. Frequently these values result in slow or > > non-working P2PDMA. > > > > ACS offers a range of security choices controlling how traffic is > > allowed to go directly between two devices. Some popular choices: > > - Full prevention > > - Translated requests can be direct, with various options > > - Asymmetric direct traffic, A can reach B but not the reverse > > - All traffic can be direct > > Along with some other less common ones for special topologies. > > > > The intention is that this option would be used with expert knowledge > > of the HW capability and workload to achieve the desired > > configuration. > > Hi, > > this breaks ACS on some platforms (in 6.11). See: > https://bugzilla.suse.com/show_bug.cgi?id=1229019 #regzbot introduced: 47c8846a ("PCI: Extend ACS configurability") #regzbot link: https://bugzilla.suse.com/show_bug.cgi?id=1229019
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 500cfa776225..42d0f6fd40d0 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4619,6 +4619,28 @@ bridges without forcing it upstream. Note: this removes isolation between devices and may put more devices in an IOMMU group. + config_acs= + Format: + =<ACS flags>@<pci_dev>[; ...] + Specify one or more PCI devices (in the format + specified above) optionally prepended with flags + and separated by semicolons. The respective + capabilities will be enabled, disabled or unchanged + based on what is specified in flags. + ACS Flags is defined as follows + bit-0 : ACS Source Validation + bit-1 : ACS Translation Blocking + bit-2 : ACS P2P Request Redirect + bit-3 : ACS P2P Completion Redirect + bit-4 : ACS Upstream Forwarding + bit-5 : ACS P2P Egress Control + bit-6 : ACS Direct Translated P2P + Each bit can be marked as + ‘0‘ – force disabled + ‘1’ – force enabled + ‘x’ – unchanged. + Note: this may remove isolation between devices + and may put more devices in an IOMMU group. force_floating [S390] Force usage of floating interrupts. nomio [S390] Do not use MIO instructions. norid [S390] ignore the RID field and force use of diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 55078c70a05b..6661932afe59 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -946,30 +946,67 @@ void pci_request_acs(void) } static const char *disable_acs_redir_param; +static const char *config_acs_param; -/** - * pci_disable_acs_redir - disable ACS redirect capabilities - * @dev: the PCI device - * - * For only devices specified in the disable_acs_redir parameter. - */ -static void pci_disable_acs_redir(struct pci_dev *dev) +struct pci_acs { + u16 cap; + u16 ctrl; + u16 fw_ctrl; +}; + +static void __pci_config_acs(struct pci_dev *dev, struct pci_acs *caps, + const char *p, u16 mask, u16 flags) { + char *delimit; int ret = 0; - const char *p; - int pos; - u16 ctrl; - if (!disable_acs_redir_param) + if (!p) return; - p = disable_acs_redir_param; while (*p) { + if (!mask) { + /* Check for ACS flags */ + delimit = strstr(p, "@"); + if (delimit) { + int end; + u32 shift = 0; + + end = delimit - p - 1; + + while (end > -1) { + if (*(p + end) == '0') { + mask |= 1 << shift; + shift++; + end--; + } else if (*(p + end) == '1') { + mask |= 1 << shift; + flags |= 1 << shift; + shift++; + end--; + } else if ((*(p + end) == 'x') || (*(p + end) == 'X')) { + shift++; + end--; + } else { + pci_err(dev, "Invalid ACS flags... Ignoring\n"); + return; + } + } + p = delimit + 1; + } else { + pci_err(dev, "ACS Flags missing\n"); + return; + } + } + + if (mask & ~(PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR | PCI_ACS_CR | + PCI_ACS_UF | PCI_ACS_EC | PCI_ACS_DT)) { + pci_err(dev, "Invalid ACS flags specified\n"); + return; + } + ret = pci_dev_str_match(dev, p, &p); if (ret < 0) { - pr_info_once("PCI: Can't parse disable_acs_redir parameter: %s\n", - disable_acs_redir_param); - + pr_info_once("PCI: Can't parse acs command line parameter\n"); break; } else if (ret == 1) { /* Found a match */ @@ -989,56 +1026,38 @@ static void pci_disable_acs_redir(struct pci_dev *dev) if (!pci_dev_specific_disable_acs_redir(dev)) return; - pos = dev->acs_cap; - if (!pos) { - pci_warn(dev, "cannot disable ACS redirect for this hardware as it does not have ACS capabilities\n"); - return; - } - - pci_read_config_word(dev, pos + PCI_ACS_CTRL, &ctrl); + pci_dbg(dev, "ACS mask = 0x%X\n", mask); + pci_dbg(dev, "ACS flags = 0x%X\n", flags); - /* P2P Request & Completion Redirect */ - ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_EC); + /* If mask is 0 then we copy the bit from the firmware setting. */ + caps->ctrl = (caps->ctrl & ~mask) | (caps->fw_ctrl & mask); + caps->ctrl |= flags; - pci_write_config_word(dev, pos + PCI_ACS_CTRL, ctrl); - - pci_info(dev, "disabled ACS redirect\n"); + pci_info(dev, "Configured ACS to 0x%x\n", caps->ctrl); } /** * pci_std_enable_acs - enable ACS on devices using standard ACS capabilities * @dev: the PCI device + * @caps: default ACS controls */ -static void pci_std_enable_acs(struct pci_dev *dev) +static void pci_std_enable_acs(struct pci_dev *dev, struct pci_acs *caps) { - int pos; - u16 cap; - u16 ctrl; - - pos = dev->acs_cap; - if (!pos) - return; - - pci_read_config_word(dev, pos + PCI_ACS_CAP, &cap); - pci_read_config_word(dev, pos + PCI_ACS_CTRL, &ctrl); - /* Source Validation */ - ctrl |= (cap & PCI_ACS_SV); + caps->ctrl |= (caps->cap & PCI_ACS_SV); /* P2P Request Redirect */ - ctrl |= (cap & PCI_ACS_RR); + caps->ctrl |= (caps->cap & PCI_ACS_RR); /* P2P Completion Redirect */ - ctrl |= (cap & PCI_ACS_CR); + caps->ctrl |= (caps->cap & PCI_ACS_CR); /* Upstream Forwarding */ - ctrl |= (cap & PCI_ACS_UF); + caps->ctrl |= (caps->cap & PCI_ACS_UF); /* Enable Translation Blocking for external devices and noats */ if (pci_ats_disabled() || dev->external_facing || dev->untrusted) - ctrl |= (cap & PCI_ACS_TB); - - pci_write_config_word(dev, pos + PCI_ACS_CTRL, ctrl); + caps->ctrl |= (caps->cap & PCI_ACS_TB); } /** @@ -1047,23 +1066,33 @@ static void pci_std_enable_acs(struct pci_dev *dev) */ static void pci_enable_acs(struct pci_dev *dev) { - if (!pci_acs_enable) - goto disable_acs_redir; + struct pci_acs caps; + int pos; + + pos = dev->acs_cap; + if (!pos) + return; - if (!pci_dev_specific_enable_acs(dev)) - goto disable_acs_redir; + pci_read_config_word(dev, pos + PCI_ACS_CAP, &caps.cap); + pci_read_config_word(dev, pos + PCI_ACS_CTRL, &caps.ctrl); + caps.fw_ctrl = caps.ctrl; - pci_std_enable_acs(dev); + /* If an iommu is present we start with kernel default caps */ + if (pci_acs_enable) { + if (pci_dev_specific_enable_acs(dev)) + pci_std_enable_acs(dev, &caps); + } -disable_acs_redir: /* - * Note: pci_disable_acs_redir() must be called even if ACS was not - * enabled by the kernel because it may have been enabled by - * platform firmware. So if we are told to disable it, we should - * always disable it after setting the kernel's default - * preferences. + * Always apply caps from the command line, even if there is no iommu. + * Trust that the admin has a reason to change the ACS settings. */ - pci_disable_acs_redir(dev); + __pci_config_acs(dev, &caps, disable_acs_redir_param, + PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_EC, + ~(PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_EC)); + __pci_config_acs(dev, &caps, config_acs_param, 0, 0); + + pci_write_config_word(dev, pos + PCI_ACS_CTRL, caps.ctrl); } /** @@ -6740,6 +6769,8 @@ static int __init pci_setup(char *str) pci_add_flags(PCI_SCAN_ALL_PCIE_DEVS); } else if (!strncmp(str, "disable_acs_redir=", 18)) { disable_acs_redir_param = str + 18; + } else if (!strncmp(str, "config_acs=", 11)) { + config_acs_param = str + 11; } else { pr_err("PCI: Unknown option `%s'\n", str); } @@ -6764,6 +6795,7 @@ static int __init pci_realloc_setup_params(void) resource_alignment_param = kstrdup(resource_alignment_param, GFP_KERNEL); disable_acs_redir_param = kstrdup(disable_acs_redir_param, GFP_KERNEL); + config_acs_param = kstrdup(config_acs_param, GFP_KERNEL); return 0; }
PCIe ACS settings control the level of isolation and the possible P2P paths between devices. With greater isolation the kernel will create smaller iommu_groups and with less isolation there is more HW that can achieve P2P transfers. From a virtualization perspective all devices in the same iommu_group must be assigned to the same VM as they lack security isolation. There is no way for the kernel to automatically know the correct ACS settings for any given system and workload. Existing command line options (ex:- disable_acs_redir) allow only for large scale change, disabling all isolation, but this is not sufficient for more complex cases. Add a kernel command-line option 'config_acs' to directly control all the ACS bits for specific devices, which allows the operator to setup the right level of isolation to achieve the desired P2P configuration. The definition is future proof, when new ACS bits are added to the spec the open syntax can be extended. ACS needs to be setup early in the kernel boot as the ACS settings effect how iommu_groups are formed. iommu_group formation is a one time event during initial device discovery, changing ACS bits after kernel boot can result in an inaccurate view of the iommu_groups compared to the current isolation configuration. ACS applies to PCIe Downstream Ports and multi-function devices. The default ACS settings are strict and deny any direct traffic between two functions. This results in the smallest iommu_group the HW can support. Frequently these values result in slow or non-working P2PDMA. ACS offers a range of security choices controlling how traffic is allowed to go directly between two devices. Some popular choices: - Full prevention - Translated requests can be direct, with various options - Asymmetric direct traffic, A can reach B but not the reverse - All traffic can be direct Along with some other less common ones for special topologies. The intention is that this option would be used with expert knowledge of the HW capability and workload to achieve the desired configuration. Signed-off-by: Vidya Sagar <vidyas@nvidia.com> --- v4: * Changed commit message (Courtesy: Jason) to provide more details v3: * Fixed a documentation issue reported by kernel test bot v2: * Refactored the code as per Jason's suggestion .../admin-guide/kernel-parameters.txt | 22 +++ drivers/pci/pci.c | 148 +++++++++++------- 2 files changed, 112 insertions(+), 58 deletions(-)