diff mbox series

arm64: PCI: Validate the node before setting node id for root bus

Message ID 1600770804-116365-1-git-send-email-baolin.wang@linux.alibaba.com (mailing list archive)
State New, archived
Headers show
Series arm64: PCI: Validate the node before setting node id for root bus | expand

Commit Message

Baolin Wang Sept. 22, 2020, 10:33 a.m. UTC
If the BIOS disabled the NUMA configuration, but did not change the
proximity domain description in the SRAT table, so the PCI root bus
device may get a incorrect node id by acpi_get_node().

Thus better to add a numa node validation before setting numa node
for the PCI root bus, like pci_acpi_root_get_node() does for X86
architecture.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 arch/arm64/kernel/pci.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Comments

Will Deacon Sept. 28, 2020, 2 p.m. UTC | #1
[+ Lorenzo]

On Tue, Sep 22, 2020 at 06:33:24PM +0800, Baolin Wang wrote:
> If the BIOS disabled the NUMA configuration, but did not change the
> proximity domain description in the SRAT table, so the PCI root bus
> device may get a incorrect node id by acpi_get_node().

How "incorrect" are we talking here? What actually goes wrong? At some
point, we have to trust what the firmware is telling us.

> Thus better to add a numa node validation before setting numa node
> for the PCI root bus, like pci_acpi_root_get_node() does for X86
> architecture.
> 
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  arch/arm64/kernel/pci.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c
> index 1006ed2..24fe2bd 100644
> --- a/arch/arm64/kernel/pci.c
> +++ b/arch/arm64/kernel/pci.c
> @@ -86,9 +86,13 @@ int pcibios_root_bridge_prepare(struct pci_host_bridge *bridge)
>  		struct pci_config_window *cfg = bridge->bus->sysdata;
>  		struct acpi_device *adev = to_acpi_device(cfg->parent);
>  		struct device *bus_dev = &bridge->bus->dev;
> +		int node = acpi_get_node(acpi_device_handle(adev));
> +
> +		if (node != NUMA_NO_NODE && !node_online(node))
> +			node = NUMA_NO_NODE;

Hmm. afaict, acpi_get_node() tries quite hard to return a valid node when
it gets back NUMA_NO_NODE in acpi_map_pxm_to_node(). Seems like we're
undoing all of that here, which worries me because NUMA_NO_NODE is a bit
of a loaded gun if you interpret it as a valid node.

Anyway, I defer to Lorenzo on this.

Will
Baolin Wang Sept. 28, 2020, 2:49 p.m. UTC | #2
On Mon, Sep 28, 2020 at 03:00:55PM +0100, Will Deacon wrote:
> [+ Lorenzo]
> 
> On Tue, Sep 22, 2020 at 06:33:24PM +0800, Baolin Wang wrote:
> > If the BIOS disabled the NUMA configuration, but did not change the
> > proximity domain description in the SRAT table, so the PCI root bus
> > device may get a incorrect node id by acpi_get_node().
> 
> How "incorrect" are we talking here? What actually goes wrong? At some
> point, we have to trust what the firmware is telling us.

What I mean is, if we disable the NUMA from BIOS, but we did not change
the PXM for the PCI devices, so the PCI devices can still get a numa
node id from acpi_get_node(). For example, we can still get the numa
node id = 1 in this case from acpi_get_node(), but the numa_nodes_parsed
is empty, which means the node id 1 is invalid. We should add a
validation for the node id when setting the root bus node id.

> 
> > Thus better to add a numa node validation before setting numa node
> > for the PCI root bus, like pci_acpi_root_get_node() does for X86
> > architecture.
> > 
> > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > ---
> >  arch/arm64/kernel/pci.c | 6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c
> > index 1006ed2..24fe2bd 100644
> > --- a/arch/arm64/kernel/pci.c
> > +++ b/arch/arm64/kernel/pci.c
> > @@ -86,9 +86,13 @@ int pcibios_root_bridge_prepare(struct pci_host_bridge *bridge)
> >  		struct pci_config_window *cfg = bridge->bus->sysdata;
> >  		struct acpi_device *adev = to_acpi_device(cfg->parent);
> >  		struct device *bus_dev = &bridge->bus->dev;
> > +		int node = acpi_get_node(acpi_device_handle(adev));
> > +
> > +		if (node != NUMA_NO_NODE && !node_online(node))
> > +			node = NUMA_NO_NODE;
> 
> Hmm. afaict, acpi_get_node() tries quite hard to return a valid node when
> it gets back NUMA_NO_NODE in acpi_map_pxm_to_node(). Seems like we're
> undoing all of that here, which worries me because NUMA_NO_NODE is a bit
> of a loaded gun if you interpret it as a valid node.

I did not treate NUMA_NO_NODE as a valid node, I just add a validation
to validate if it is a valid node before setting. See my previous comments,
hopes I make things clear. Thanks.

> 
> Anyway, I defer to Lorenzo on this.
> 
> Will
Lorenzo Pieralisi Sept. 28, 2020, 3:23 p.m. UTC | #3
On Mon, Sep 28, 2020 at 10:49:57PM +0800, Baolin Wang wrote:
> On Mon, Sep 28, 2020 at 03:00:55PM +0100, Will Deacon wrote:
> > [+ Lorenzo]
> > 
> > On Tue, Sep 22, 2020 at 06:33:24PM +0800, Baolin Wang wrote:
> > > If the BIOS disabled the NUMA configuration, but did not change the
> > > proximity domain description in the SRAT table, so the PCI root bus
> > > device may get a incorrect node id by acpi_get_node().
> > 
> > How "incorrect" are we talking here? What actually goes wrong? At some
> > point, we have to trust what the firmware is telling us.
> 
> What I mean is, if we disable the NUMA from BIOS

Please define what this means ie are you removing SRAT from ACPI static
tables ?

> but we did not change the PXM for the PCI devices, 

If a _PXM maps to a proximity domain that is not described in the SRAT
your firmware is buggy.

> so the PCI devices can still get a numa node id from acpi_get_node().
> For example, we can still get the numa node id = 1 in this case from
> acpi_get_node(), but the numa_nodes_parsed is empty, which means the
> node id 1 is invalid.  We should add a validation for the node id when
> setting the root bus node id.

The kernel is not a firmware validation test suite, so fix the firmware
please.

Having said that, please provide a trace log of the issue this is
causing, if any.

Thanks,
Lorenzo
Baolin Wang Sept. 29, 2020, 3:41 p.m. UTC | #4
Hi,

在 2020/9/28 23:23, Lorenzo Pieralisi 写道:
> On Mon, Sep 28, 2020 at 10:49:57PM +0800, Baolin Wang wrote:
>> On Mon, Sep 28, 2020 at 03:00:55PM +0100, Will Deacon wrote:
>>> [+ Lorenzo]
>>>
>>> On Tue, Sep 22, 2020 at 06:33:24PM +0800, Baolin Wang wrote:
>>>> If the BIOS disabled the NUMA configuration, but did not change the
>>>> proximity domain description in the SRAT table, so the PCI root bus
>>>> device may get a incorrect node id by acpi_get_node().
>>>
>>> How "incorrect" are we talking here? What actually goes wrong? At some
>>> point, we have to trust what the firmware is telling us.
>>
>> What I mean is, if we disable the NUMA from BIOS
> 
> Please define what this means ie are you removing SRAT from ACPI static
> tables ?

Yes.

> 
>> but we did not change the PXM for the PCI devices,
> 
> If a _PXM maps to a proximity domain that is not described in the SRAT
> your firmware is buggy.

Sorry for confusing, that's not what I mean. When the BIOS disable the 
NUMA (remove the SRAT table), but the PCI devices' _PXM description is 
still available, which means we can still get the pxm from 
acpi_evaluate_integer() in this case.

So we can get below inconsistent log on ARM platform:
"No NUMA configuration found
PCI_bus 0000:00 on NUMA node 0
...
PCI_bus 0000:e3 on NUMA node 1"

On X86, the pci_acpi_root_get_node() will validate the node before 
setting the node id for root bus. So I think we can add this validation 
for ARM platform. Or anything else I missed?

> 
>> so the PCI devices can still get a numa node id from acpi_get_node().
>> For example, we can still get the numa node id = 1 in this case from
>> acpi_get_node(), but the numa_nodes_parsed is empty, which means the
>> node id 1 is invalid.  We should add a validation for the node id when
>> setting the root bus node id.
> 
> The kernel is not a firmware validation test suite, so fix the firmware
> please.
> 
> Having said that, please provide a trace log of the issue this is
> causing, if any.

See above.
Lorenzo Pieralisi Oct. 1, 2020, 8:55 a.m. UTC | #5
On Tue, Sep 29, 2020 at 11:41:29PM +0800, Baolin Wang wrote:
> Hi,
> 
> 锟斤拷 2020/9/28 23:23, Lorenzo Pieralisi 写锟斤拷:
> > On Mon, Sep 28, 2020 at 10:49:57PM +0800, Baolin Wang wrote:
> > > On Mon, Sep 28, 2020 at 03:00:55PM +0100, Will Deacon wrote:
> > > > [+ Lorenzo]
> > > > 
> > > > On Tue, Sep 22, 2020 at 06:33:24PM +0800, Baolin Wang wrote:
> > > > > If the BIOS disabled the NUMA configuration, but did not change the
> > > > > proximity domain description in the SRAT table, so the PCI root bus
> > > > > device may get a incorrect node id by acpi_get_node().
> > > > 
> > > > How "incorrect" are we talking here? What actually goes wrong? At some
> > > > point, we have to trust what the firmware is telling us.
> > > 
> > > What I mean is, if we disable the NUMA from BIOS
> > 
> > Please define what this means ie are you removing SRAT from ACPI static
> > tables ?
> 
> Yes.
> 
> > 
> > > but we did not change the PXM for the PCI devices,
> > 
> > If a _PXM maps to a proximity domain that is not described in the SRAT
> > your firmware is buggy.
> 
> Sorry for confusing, that's not what I mean. When the BIOS disable the NUMA
> (remove the SRAT table), but the PCI devices' _PXM description is still
> available, which means we can still get the pxm from acpi_evaluate_integer()
> in this case.

There should not be a _PXM object if the SRAT is not available, that's
a firmware bug.

> So we can get below inconsistent log on ARM platform:
> "No NUMA configuration found
> PCI_bus 0000:00 on NUMA node 0
> ...
> PCI_bus 0000:e3 on NUMA node 1"
> 
> On X86, the pci_acpi_root_get_node() will validate the node before setting
> the node id for root bus. So I think we can add this validation for ARM
> platform. Or anything else I missed?

We are not adding checks because x86 does it, it is certainly to paper
over a firmware bug that you hopefully still have a chance to fix,
let's do that instead of adding code that is not necessary.

Lorenzo
Baolin Wang Oct. 3, 2020, 9:35 a.m. UTC | #6
> On Tue, Sep 29, 2020 at 11:41:29PM +0800, Baolin Wang wrote:
>> Hi,
>>
>> 锟斤拷 2020/9/28 23:23, Lorenzo Pieralisi 写锟斤拷:
>>> On Mon, Sep 28, 2020 at 10:49:57PM +0800, Baolin Wang wrote:
>>>> On Mon, Sep 28, 2020 at 03:00:55PM +0100, Will Deacon wrote:
>>>>> [+ Lorenzo]
>>>>>
>>>>> On Tue, Sep 22, 2020 at 06:33:24PM +0800, Baolin Wang wrote:
>>>>>> If the BIOS disabled the NUMA configuration, but did not change the
>>>>>> proximity domain description in the SRAT table, so the PCI root bus
>>>>>> device may get a incorrect node id by acpi_get_node().
>>>>>
>>>>> How "incorrect" are we talking here? What actually goes wrong? At some
>>>>> point, we have to trust what the firmware is telling us.
>>>>
>>>> What I mean is, if we disable the NUMA from BIOS
>>>
>>> Please define what this means ie are you removing SRAT from ACPI static
>>> tables ?
>>
>> Yes.
>>
>>>
>>>> but we did not change the PXM for the PCI devices,
>>>
>>> If a _PXM maps to a proximity domain that is not described in the SRAT
>>> your firmware is buggy.
>>
>> Sorry for confusing, that's not what I mean. When the BIOS disable the NUMA
>> (remove the SRAT table), but the PCI devices' _PXM description is still
>> available, which means we can still get the pxm from acpi_evaluate_integer()
>> in this case.
> 
> There should not be a _PXM object if the SRAT is not available, that's
> a firmware bug.
> 
>> So we can get below inconsistent log on ARM platform:
>> "No NUMA configuration found
>> PCI_bus 0000:00 on NUMA node 0
>> ...
>> PCI_bus 0000:e3 on NUMA node 1"
>>
>> On X86, the pci_acpi_root_get_node() will validate the node before setting
>> the node id for root bus. So I think we can add this validation for ARM
>> platform. Or anything else I missed?
> 
> We are not adding checks because x86 does it, it is certainly to paper
> over a firmware bug that you hopefully still have a chance to fix,
> let's do that instead of adding code that is not necessary.

Thanks for your input, and I will check this issue with our firmware 
colleagues again.
Baolin Wang Nov. 9, 2020, 12:27 p.m. UTC | #7
Hi Lorenzo,

> 
>> On Tue, Sep 29, 2020 at 11:41:29PM +0800, Baolin Wang wrote:
>>> Hi,
>>>
>>> 锟斤拷 2020/9/28 23:23, Lorenzo Pieralisi 写锟斤拷:
>>>> On Mon, Sep 28, 2020 at 10:49:57PM +0800, Baolin Wang wrote:
>>>>> On Mon, Sep 28, 2020 at 03:00:55PM +0100, Will Deacon wrote:
>>>>>> [+ Lorenzo]
>>>>>>
>>>>>> On Tue, Sep 22, 2020 at 06:33:24PM +0800, Baolin Wang wrote:
>>>>>>> If the BIOS disabled the NUMA configuration, but did not change the
>>>>>>> proximity domain description in the SRAT table, so the PCI root bus
>>>>>>> device may get a incorrect node id by acpi_get_node().
>>>>>>
>>>>>> How "incorrect" are we talking here? What actually goes wrong? At 
>>>>>> some
>>>>>> point, we have to trust what the firmware is telling us.
>>>>>
>>>>> What I mean is, if we disable the NUMA from BIOS
>>>>
>>>> Please define what this means ie are you removing SRAT from ACPI static
>>>> tables ?
>>>
>>> Yes.
>>>
>>>>
>>>>> but we did not change the PXM for the PCI devices,
>>>>
>>>> If a _PXM maps to a proximity domain that is not described in the SRAT
>>>> your firmware is buggy.
>>>
>>> Sorry for confusing, that's not what I mean. When the BIOS disable 
>>> the NUMA
>>> (remove the SRAT table), but the PCI devices' _PXM description is still
>>> available, which means we can still get the pxm from 
>>> acpi_evaluate_integer()
>>> in this case.
>>
>> There should not be a _PXM object if the SRAT is not available, that's
>> a firmware bug.
>>
>>> So we can get below inconsistent log on ARM platform:
>>> "No NUMA configuration found
>>> PCI_bus 0000:00 on NUMA node 0
>>> ...
>>> PCI_bus 0000:e3 on NUMA node 1"
>>>
>>> On X86, the pci_acpi_root_get_node() will validate the node before 
>>> setting
>>> the node id for root bus. So I think we can add this validation for ARM
>>> platform. Or anything else I missed?
>>
>> We are not adding checks because x86 does it, it is certainly to paper
>> over a firmware bug that you hopefully still have a chance to fix,
>> let's do that instead of adding code that is not necessary.
> 
> Thanks for your input, and I will check this issue with our firmware 
> colleagues again.

Sorry for late reply.

I did some investigation for this issue. I am sorry I made some 
misleading description in the commit message. The issue is, when we
want to disable the NUMA from firmware, we usually just remove the SRAT 
table from the BIOS. But the devices' proximity domain information is 
still remain in the ACPI tables.

For example, the IORT table still contains the proximity domain 
information for the SMMU devices, so in this case, the SMMU devices 
still can get incorrect NUMA nodes if we remove the SRAT table. But
the SMMU devices will validate the numa node in 
arm_smmu_v3_set_proximity() to avoid this issue.

static int  __init arm_smmu_v3_set_proximity(struct device *dev,
					      struct acpi_iort_node *node)
{
	struct acpi_iort_smmu_v3 *smmu;

	smmu = (struct acpi_iort_smmu_v3 *)node->node_data;
	if (smmu->flags & ACPI_IORT_SMMU_V3_PXM_VALID) {
		int dev_node = pxm_to_node(smmu->pxm);

		if (dev_node != NUMA_NO_NODE && !node_online(dev_node))
			return -EINVAL;

		set_dev_node(dev, dev_node);
		pr_info("SMMU-v3[%llx] Mapped to Proximity domain %d\n",
			smmu->base_address,
			smmu->pxm);
	}
	return 0;
}

So similar with SMMU devices, the DSDT table will still contain the PCI 
root host devices' proximity domain though we already remove the SRAT 
table. So I think we still need this validation in 
pcibios_root_bridge_prepare() to avoid this issue like other devices did.

I can update the commit message in next version if you think this is 
reasonable. Thanks.
Lorenzo Pieralisi Nov. 12, 2020, 5:05 p.m. UTC | #8
[+Jonathan]

On Mon, Nov 09, 2020 at 08:27:09PM +0800, Baolin Wang wrote:

[...]

> I did some investigation for this issue. I am sorry I made some
> misleading description in the commit message. The issue is, when we
> want to disable the NUMA from firmware, we usually just remove the SRAT
> table from the BIOS. But the devices' proximity domain information is
> still remain in the ACPI tables.

I understand and it should not.

> For example, the IORT table still contains the proximity domain
> information for the SMMU devices, so in this case, the SMMU devices still
> can get incorrect NUMA nodes if we remove the SRAT table. But
> the SMMU devices will validate the numa node in
> arm_smmu_v3_set_proximity() to avoid this issue.
> 
> static int  __init arm_smmu_v3_set_proximity(struct device *dev,
> 					      struct acpi_iort_node *node)
> {
> 	struct acpi_iort_smmu_v3 *smmu;
> 
> 	smmu = (struct acpi_iort_smmu_v3 *)node->node_data;
> 	if (smmu->flags & ACPI_IORT_SMMU_V3_PXM_VALID) {
> 		int dev_node = pxm_to_node(smmu->pxm);
> 
> 		if (dev_node != NUMA_NO_NODE && !node_online(dev_node))
> 			return -EINVAL;
> 
> 		set_dev_node(dev, dev_node);
> 		pr_info("SMMU-v3[%llx] Mapped to Proximity domain %d\n",
> 			smmu->base_address,
> 			smmu->pxm);
> 	}
> 	return 0;
> }
> 
> So similar with SMMU devices, the DSDT table will still contain the PCI
> root host devices' proximity domain though we already remove the SRAT
> table. So I think we still need this validation in
> pcibios_root_bridge_prepare() to avoid this issue like other devices did.
No. The right thing to do is to fix the PXM handling and that's what
Jonathan did:

https://lore.kernel.org/linux-mm/20200818142430.1156547-2-Jonathan.Cameron@huawei.com

Can you try booting with v5.10-rc* and report back the *full* boot log
please ?

> I can update the commit message in next version if you think this is
> reasonable. Thanks.

See above.

Thanks,
Lorenzo
diff mbox series

Patch

diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c
index 1006ed2..24fe2bd 100644
--- a/arch/arm64/kernel/pci.c
+++ b/arch/arm64/kernel/pci.c
@@ -86,9 +86,13 @@  int pcibios_root_bridge_prepare(struct pci_host_bridge *bridge)
 		struct pci_config_window *cfg = bridge->bus->sysdata;
 		struct acpi_device *adev = to_acpi_device(cfg->parent);
 		struct device *bus_dev = &bridge->bus->dev;
+		int node = acpi_get_node(acpi_device_handle(adev));
+
+		if (node != NUMA_NO_NODE && !node_online(node))
+			node = NUMA_NO_NODE;
 
 		ACPI_COMPANION_SET(&bridge->dev, adev);
-		set_dev_node(bus_dev, acpi_get_node(acpi_device_handle(adev)));
+		set_dev_node(bus_dev, node);
 	}
 
 	return 0;