[V2] x86: Fix an issue with invalid ACPI NUMA config

Message ID	20181211094737.71554-1-Jonathan.Cameron@huawei.com (mailing list archive)
State	New, archived
Delegated to:	Bjorn Helgaas
Headers	show Return-Path: <linux-pci-owner@kernel.org> From: Jonathan Cameron <Jonathan.Cameron@huawei.com> To: <linux-pci@vger.kernel.org>, <x86@kernel.org>, <helgaas@kernel.org> CC: <linuxarm@huawei.com>, Ingo Molnar <mingo@kernel.org>, Dave Hansen <dave.hansen@linux.intel.com>, Andy Lutomirski <luto@kernel.org>, "Peter Zijlstra" <peterz@infradead.org>, <martin@geanix.com>, Jonathan Cameron <Jonathan.Cameron@huawei.com> Subject: [PATCH V2] x86: Fix an issue with invalid ACPI NUMA config Date: Tue, 11 Dec 2018 17:47:37 +0800 Message-ID: <20181211094737.71554-1-Jonathan.Cameron@huawei.com> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit Sender: linux-pci-owner@vger.kernel.org Precedence: bulk
Series	[V2] x86: Fix an issue with invalid ACPI NUMA config \| expand [V2] x86: Fix an issue with invalid ACPI NUMA config

Jonathan Cameron Dec. 11, 2018, 9:47 a.m. UTC

The addition of support to read the NUMA node for a PCI card specified by
_PXM resulted in Martin's system not booting.   Looking at the ACPI tables
it seems that there are _PXM entries for the root ports, but no SRAT table.

The absence of the SRAT table results in dummy_numa_init() being called.
However, unlike on arm64, this doesn't result in numa_off being set.

When the PCI code later comes along and calls acpi_get_node() for any PCI
card below the root port, it navigates up the ACPI tree until it finds the
_PXM value in the root port. This value is then passed to
acpi_map_pxm_to_node().

As numa_off has not been set on x86 it tries to allocate a NUMA node, from
the unused set, without setting up all the infrastructure that would
normally accompany such a call.  We have not identified exactly which driver
is causing the subsequent hang for Martin.

If numa_off had been set, as it is in the equivalent flow on arm64, then
acpi_map_pxm_to_node() would return NUMA_NO_NODE, which is what we want to
happen.

It is invalid under the ACPI spec to specify new NUMA nodes using _PXM if
they have no presence in SRAT. Thus the simplest fix is to set numa_off when
NUMA support is disabled due to an invalid SRAT (here not present at all).

I do not have easy access to appropriate x86 NUMA systems so would
appreciate some testing of this one!

Known problem boards setups:

AMD Ryzen Threadripper 2950X on ASROCK X399 TAICHI
MSI X399 SLI PLUS (probably - not confirmed yet)

The PCI patch has been reverted, so this fix is not critical.

Reported-by: Martin Hundebøll <martin@geanix.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Fixes: bad7dcd94f39 ("ACPI/PCI: Pay attention to device-specific _PXM node values")
---
Changes since V1:
* Update commit message as suggested by Bjorn Helgaas.
* No functional changes.

 arch/x86/mm/numa.c | 2 ++
 1 file changed, 2 insertions(+)

Dave Hansen Dec. 11, 2018, 6:19 p.m. UTC | #1

On 12/11/18 1:47 AM, Jonathan Cameron wrote:
> When the PCI code later comes along and calls acpi_get_node() for any PCI
> card below the root port, it navigates up the ACPI tree until it finds the
> _PXM value in the root port. This value is then passed to
> acpi_map_pxm_to_node().
> 
> As numa_off has not been set on x86 it tries to allocate a NUMA node, from
> the unused set, without setting up all the infrastructure that would
> normally accompany such a call. 

FWIW, this _sounds_ like the real problem here.  We're allowing an
allocation to proceed without some infrastructure that we require.
Shouldn't we be detecting that this infrastructure is not in place and
warn about *it* at least?

I'm a bit worried that this is just papering over an unknown error to
make a hang go away.  It seems a bit too far away from the root cause.

Jonathan Cameron Dec. 12, 2018, 9:39 a.m. UTC | #2

On Tue, 11 Dec 2018 10:19:49 -0800
Dave Hansen <dave.hansen@intel.com> wrote:

> On 12/11/18 1:47 AM, Jonathan Cameron wrote:
> > When the PCI code later comes along and calls acpi_get_node() for any PCI
> > card below the root port, it navigates up the ACPI tree until it finds the
> > _PXM value in the root port. This value is then passed to
> > acpi_map_pxm_to_node().
> > 
> > As numa_off has not been set on x86 it tries to allocate a NUMA node, from
> > the unused set, without setting up all the infrastructure that would
> > normally accompany such a call.   
> 
> FWIW, this _sounds_ like the real problem here.  We're allowing an
> allocation to proceed without some infrastructure that we require.
> Shouldn't we be detecting that this infrastructure is not in place and
> warn about *it* at least?
> 
> I'm a bit worried that this is just papering over an unknown error to
> make a hang go away.  It seems a bit too far away from the root cause.

I'm not totally convinced.  We are warning about it on the two lines just off the
top of this patch.

"No NUMA configuration found"
"Faking a node at [mem....]"

We are falling back to the exact same code paths as if you had deliberately
turned off NUMA at the command line with messages stating that is the case.
That approach seems to be safe and is consistent.

Now there is a potential corner here where I agree with you that it may
make sense to 'also' add protections in the acpi_map_pxm_to_node() path
which is that where we do have a valid NUMA configuration and along comes
a new device with a node outside of those that are defined,
(note there is a change coming in next ACPI precisely to work around a case
that causes this to validly happen when the OS sees some new features and
doesn't know what to do with them - it still relies on the ACPI tables
having the right magic in them though for the fallback to work - more
on that when the spec is out...).

One option would be to (in addition to this patch) add a new version of
acpi_get_node that will only give you a node that actually exists
and an error otherwise allowing code to fallback to to NO_NODE.

Other than the error we might be able to use acpi_map_pxm_to_online_node
for this, or call both acpi_map_pxm_to_node and acpi_map_pxm_to_online_node
and compare the answers to verify we are getting the node we want?

Jonathan

Bjorn Helgaas Dec. 20, 2018, 3:12 p.m. UTC | #3

On Wed, Dec 12, 2018 at 09:39:14AM +0000, Jonathan Cameron wrote:
> On Tue, 11 Dec 2018 10:19:49 -0800
> Dave Hansen <dave.hansen@intel.com> wrote:
> 
> > On 12/11/18 1:47 AM, Jonathan Cameron wrote:
> > > When the PCI code later comes along and calls acpi_get_node() for any PCI
> > > card below the root port, it navigates up the ACPI tree until it finds the
> > > _PXM value in the root port. This value is then passed to
> > > acpi_map_pxm_to_node().
> > > 
> > > As numa_off has not been set on x86 it tries to allocate a NUMA node, from
> > > the unused set, without setting up all the infrastructure that would
> > > normally accompany such a call.   
> > 
> > FWIW, this _sounds_ like the real problem here.  We're allowing an
> > allocation to proceed without some infrastructure that we require.
> > Shouldn't we be detecting that this infrastructure is not in place and
> > warn about *it* at least?
> > 
> > I'm a bit worried that this is just papering over an unknown error to
> > make a hang go away.  It seems a bit too far away from the root cause.
> 
> I'm not totally convinced.  We are warning about it on the two lines just off the
> top of this patch.
> 
> "No NUMA configuration found"
> "Faking a node at [mem....]"
> 
> We are falling back to the exact same code paths as if you had deliberately
> turned off NUMA at the command line with messages stating that is the case.
> That approach seems to be safe and is consistent.
> 
> Now there is a potential corner here where I agree with you that it may
> make sense to 'also' add protections in the acpi_map_pxm_to_node() path
> which is that where we do have a valid NUMA configuration and along comes
> a new device with a node outside of those that are defined,
> (note there is a change coming in next ACPI precisely to work around a case
> that causes this to validly happen when the OS sees some new features and
> doesn't know what to do with them - it still relies on the ACPI tables
> having the right magic in them though for the fallback to work - more
> on that when the spec is out...).
> 
> One option would be to (in addition to this patch) add a new version of
> acpi_get_node that will only give you a node that actually exists
> and an error otherwise allowing code to fallback to to NO_NODE.
> 
> Other than the error we might be able to use acpi_map_pxm_to_online_node
> for this, or call both acpi_map_pxm_to_node and acpi_map_pxm_to_online_node
> and compare the answers to verify we are getting the node we want?

Where are we at with this?  It'd be nice to resolve it for v4.21, but
it's a little out of my comfort zone, so I don't want to apply it
unless there's clear consensus that this is the right fix.

Bjorn

Dave Hansen Dec. 20, 2018, 5:13 p.m. UTC | #4

On 12/20/18 7:12 AM, Bjorn Helgaas wrote:
>> Other than the error we might be able to use acpi_map_pxm_to_online_node
>> for this, or call both acpi_map_pxm_to_node and acpi_map_pxm_to_online_node
>> and compare the answers to verify we are getting the node we want?
> Where are we at with this?  It'd be nice to resolve it for v4.21, but
> it's a little out of my comfort zone, so I don't want to apply it
> unless there's clear consensus that this is the right fix.

I still think the fix in this patch sweeps the problem under the rug too
much.  But, it just might be the best single fix for backports, for
instance.

Bjorn Helgaas Dec. 20, 2018, 7:57 p.m. UTC | #5

On Thu, Dec 20, 2018 at 09:13:12AM -0800, Dave Hansen wrote:
> On 12/20/18 7:12 AM, Bjorn Helgaas wrote:
> >> Other than the error we might be able to use acpi_map_pxm_to_online_node
> >> for this, or call both acpi_map_pxm_to_node and acpi_map_pxm_to_online_node
> >> and compare the answers to verify we are getting the node we want?
> > Where are we at with this?  It'd be nice to resolve it for v4.21, but
> > it's a little out of my comfort zone, so I don't want to apply it
> > unless there's clear consensus that this is the right fix.
> 
> I still think the fix in this patch sweeps the problem under the rug too
> much.  But, it just might be the best single fix for backports, for
> instance.

Sounds like we should first find the best fix, then worry about how to
backport it.  So I think we have a little more noodling to do, and
I'll defer this for now.

Bjorn

Jonathan Cameron Jan. 28, 2019, 11:31 a.m. UTC | #6

On Thu, 20 Dec 2018 13:57:14 -0600
Bjorn Helgaas <helgaas@kernel.org> wrote:

> On Thu, Dec 20, 2018 at 09:13:12AM -0800, Dave Hansen wrote:
> > On 12/20/18 7:12 AM, Bjorn Helgaas wrote:  
> > >> Other than the error we might be able to use acpi_map_pxm_to_online_node
> > >> for this, or call both acpi_map_pxm_to_node and acpi_map_pxm_to_online_node
> > >> and compare the answers to verify we are getting the node we want?  
> > > Where are we at with this?  It'd be nice to resolve it for v4.21, but
> > > it's a little out of my comfort zone, so I don't want to apply it
> > > unless there's clear consensus that this is the right fix.  
> > 
> > I still think the fix in this patch sweeps the problem under the rug too
> > much.  But, it just might be the best single fix for backports, for
> > instance.  
> 
> Sounds like we should first find the best fix, then worry about how to
> backport it.  So I think we have a little more noodling to do, and
> I'll defer this for now.
> 
> Bjorn

Hi All,

I'd definitely appreciate some guidance on what the 'right' fix is.
We are starting to get real performance issues reported as a result of not
being able to use this patch on mainline.

5-10% performance drop on some networking benchmarks.

As a brief summary (having added linux-mm / linux-acpi) the issue is:

1) ACPI allows _PXM to be applied to pci devices (including root ports for
   example, but any device is fine).
2) Due to the ordering of when the fw node was set for PCI devices this wasn't
   taking effect. Easy to solve by just adding the numa node if provided in
   pci_acpi_setup (which is late enough)
3) A patch to fix that was applied to the PCIe tree
  https://patchwork.kernel.org/patch/10597777/
   but we got non booting regressions on some threadripper platforms.
   That turned out to be because they don't have SRAT, but do have PXM entries.
  (i.e. broken firmware).  Naturally Bjorn reverted this very quickly!

I proposed this fix which was to do the same as on Arm and clearly mark numa as
off when SRAT isn't present on an ACPI system.
https://elixir.bootlin.com/linux/latest/source/arch/arm64/mm/numa.c#L460
https://elixir.bootlin.com/linux/latest/source/arch/x86/mm/numa.c#L688

Dave's response was that we needed to fix the underlying issue of trying to
allocate from non existent NUMA nodes.

Whilst I agree with that in principle (having managed to provide tables doing
exactly that during development a few times!), I'm not sure the path to doing so is
clear and so this has been stalled for a few months.  There is to my mind
still a strong argument, even with such protection in place, that we
should still be short cutting it so that you get the same paths if you deliberately
disable numa, and if you have no SRAT and hence can't have NUMA.

So given I have some 'mild for now' screaming going on, I'd definitely
appreciate input on how to move forward!

There are lots of places this could be worked around, e.g. we could sanity
check in the acpi_get_pxm call.  I'm not sure what side effects that would have
and also what cases it wouldn't cover.

Thanks,

Jonathan

Bjorn Helgaas Jan. 28, 2019, 11:13 p.m. UTC | #7

On Mon, Jan 28, 2019 at 11:31:08AM +0000, Jonathan Cameron wrote:
> On Thu, 20 Dec 2018 13:57:14 -0600
> Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Thu, Dec 20, 2018 at 09:13:12AM -0800, Dave Hansen wrote:
> > > On 12/20/18 7:12 AM, Bjorn Helgaas wrote:  
> > > >> Other than the error we might be able to use acpi_map_pxm_to_online_node
> > > >> for this, or call both acpi_map_pxm_to_node and acpi_map_pxm_to_online_node
> > > >> and compare the answers to verify we are getting the node we want?  
> > > > Where are we at with this?  It'd be nice to resolve it for v4.21, but
> > > > it's a little out of my comfort zone, so I don't want to apply it
> > > > unless there's clear consensus that this is the right fix.  
> > > 
> > > I still think the fix in this patch sweeps the problem under the rug too
> > > much.  But, it just might be the best single fix for backports, for
> > > instance.  
> > 
> > Sounds like we should first find the best fix, then worry about how to
> > backport it.  So I think we have a little more noodling to do, and
> > I'll defer this for now.
> > 
> > Bjorn
> 
> Hi All,
> 
> I'd definitely appreciate some guidance on what the 'right' fix is.
> We are starting to get real performance issues reported as a result of not
> being able to use this patch on mainline.
> 
> 5-10% performance drop on some networking benchmarks.

I guess the performance drop must be from calling kmalloc_node() with
the wrong node number because we currently ignore _PXM for the NIC?
And to get that performance back, you need both the previous patch to
pay attention to _PXM (https://lore.kernel.org/linux-pci/1527768879-88161-2-git-send-email-xiexiuqi@huawei.com)
and this patch (to set "numa_off=1" to avoid the regression the _PXM
patch by itself would cause)?

> As a brief summary (having added linux-mm / linux-acpi) the issue is:
> 
> 1) ACPI allows _PXM to be applied to pci devices (including root ports for
>    example, but any device is fine).
> 2) Due to the ordering of when the fw node was set for PCI devices this wasn't
>    taking effect. Easy to solve by just adding the numa node if provided in
>    pci_acpi_setup (which is late enough)
> 3) A patch to fix that was applied to the PCIe tree
>   https://patchwork.kernel.org/patch/10597777/
>    but we got non booting regressions on some threadripper platforms.
>    That turned out to be because they don't have SRAT, but do have PXM entries.
>   (i.e. broken firmware).  Naturally Bjorn reverted this very quickly!

Here's the beginning of the current thread, for anybody coming in
late: https://lore.kernel.org/linux-pci/20181211094737.71554-1-Jonathan.Cameron@huawei.com).

The current patch proposes setting "numa_off=1" in the x86 version of
dummy_numa_init(), on the assumption (from the changelog) that:

  It is invalid under the ACPI spec to specify new NUMA nodes using
  _PXM if they have no presence in SRAT.

Do you have a reference for this?  I looked and couldn't find a clear
statement in the spec to that effect.  The _PXM description (ACPI
v6.2, sec 6.1.14) says that two devices with the same _PXM value are
in the same proximity domain, but it doesn't seem to require an SRAT.

But I guess it doesn't really matter whether it's invalid; that
situation exists in the field, so we have to handle it gracefully.

Martin reported the regression from 3) above and attached useful logs,
which unfortunately aren't in the archives because the mailing list rejects
attachments.  To preserve them, I opened https://bugzilla.kernel.org/show_bug.cgi?id=202443
and attached the logs there.

> I proposed this fix which was to do the same as on Arm and clearly
> mark numa as off when SRAT isn't present on an ACPI system.
> https://elixir.bootlin.com/linux/latest/source/arch/arm64/mm/numa.c#L460
> https://elixir.bootlin.com/linux/latest/source/arch/x86/mm/numa.c#L688

There are several threads we could pull on while untangling this.

We use dummy_numa_init() when we don't have static NUMA info from ACPI
SRAT or DT.  On arm64 (but not x86), it sets numa_off=1 when we don't
have that static info.  I think neither should set numa_off=1 because
we should allow for future information, e.g., from _PXM.

I think acpi_numa_init() is being a little too aggressive when it
returns failure if it finds no SRAT or if it finds an SRAT with no
ACPI_SRAT_TYPE_MEMORY_AFFINITY entries.

Also from your changelog:

  When the PCI code later comes along and calls acpi_get_node() for
  any PCI card below the root port, it navigates up the ACPI tree
  until it finds the _PXM value in the root port. This value is then
  passed to acpi_map_pxm_to_node().

  As numa_off has not been set on x86 it tries to allocate a NUMA
  node, from the unused set, without setting up all the infrastructure
  that would normally accompany such a call.  We have not identified
  exactly which driver is causing the subsequent hang for Martin.

So the problem seems to be that when we get the _PXM value (in the
acpi_get_node() path), there's some infrastructure we don't set up?
I'm not sure what exactly this is -- I see that when we have an SRAT,
acpi_numa_memory_affinity() does a little more, but nothing that
would account for a problem if we call acpi_map_pxm_to_node() without
an SRAT.

Maybe it results in an issue when we call kmalloc_node() using this
_PXM value that SRAT didn't tell us about?  If so, that's reminiscent
of these earlier discussions about kmalloc_node() returning something
useless if the requested node is not online:

  https://lkml.kernel.org/r/1527768879-88161-2-git-send-email-xiexiuqi@huawei.com
  https://lore.kernel.org/linux-arm-kernel/20180801173132.19739-1-punit.agrawal@arm.com/

As far as I know, that was never really resolved.  The immediate
problem of using passing an invalid node number to kmalloc_node() was
avoided by using kmalloc() instead.

> Dave's response was that we needed to fix the underlying issue of
> trying to allocate from non existent NUMA nodes.

Oops, sorry for telling you what you obviously already know!  I guess
I didn't internalize this sentence before writing the above.

Bottom line, I totally agree that it would be better to fix the
underlying issue without trying to avoid it by disabling NUMA.

> Whilst I agree with that in principle (having managed to provide
> tables doing exactly that during development a few times!), I'm not
> sure the path to doing so is clear and so this has been stalled for
> a few months.  There is to my mind still a strong argument, even
> with such protection in place, that we should still be short cutting
> it so that you get the same paths if you deliberately disable numa,
> and if you have no SRAT and hence can't have NUMA.

I guess we need to resolve the question of whether NUMA without SRAT
is possible.

> So given I have some 'mild for now' screaming going on, I'd
> definitely appreciate input on how to move forward!
> 
> There are lots of places this could be worked around, e.g. we could
> sanity check in the acpi_get_pxm call.  I'm not sure what side
> effects that would have and also what cases it wouldn't cover.
> 
> Thanks,
> 
> Jonathan

Jonathan Cameron Jan. 29, 2019, 9:51 a.m. UTC | #8

On Mon, 28 Jan 2019 17:13:22 -0600
Bjorn Helgaas <helgaas@kernel.org> wrote:

> On Mon, Jan 28, 2019 at 11:31:08AM +0000, Jonathan Cameron wrote:
> > On Thu, 20 Dec 2018 13:57:14 -0600
> > Bjorn Helgaas <helgaas@kernel.org> wrote:  
> > > On Thu, Dec 20, 2018 at 09:13:12AM -0800, Dave Hansen wrote:  
> > > > On 12/20/18 7:12 AM, Bjorn Helgaas wrote:    
> > > > >> Other than the error we might be able to use acpi_map_pxm_to_online_node
> > > > >> for this, or call both acpi_map_pxm_to_node and acpi_map_pxm_to_online_node
> > > > >> and compare the answers to verify we are getting the node we want?    
> > > > > Where are we at with this?  It'd be nice to resolve it for v4.21, but
> > > > > it's a little out of my comfort zone, so I don't want to apply it
> > > > > unless there's clear consensus that this is the right fix.    
> > > > 
> > > > I still think the fix in this patch sweeps the problem under the rug too
> > > > much.  But, it just might be the best single fix for backports, for
> > > > instance.    
> > > 
> > > Sounds like we should first find the best fix, then worry about how to
> > > backport it.  So I think we have a little more noodling to do, and
> > > I'll defer this for now.
> > > 
> > > Bjorn  
> > 
> > Hi All,
> > 
> > I'd definitely appreciate some guidance on what the 'right' fix is.
> > We are starting to get real performance issues reported as a result of not
> > being able to use this patch on mainline.
> > 
> > 5-10% performance drop on some networking benchmarks.  
> 
> I guess the performance drop must be from calling kmalloc_node() with
> the wrong node number because we currently ignore _PXM for the NIC?
> And to get that performance back, you need both the previous patch to
> pay attention to _PXM (https://lore.kernel.org/linux-pci/1527768879-88161-2-git-send-email-xiexiuqi@huawei.com)
> and this patch (to set "numa_off=1" to avoid the regression the _PXM
> patch by itself would cause)?

Exactly.

> 
> > As a brief summary (having added linux-mm / linux-acpi) the issue is:
> > 
> > 1) ACPI allows _PXM to be applied to pci devices (including root ports for
> >    example, but any device is fine).
> > 2) Due to the ordering of when the fw node was set for PCI devices this wasn't
> >    taking effect. Easy to solve by just adding the numa node if provided in
> >    pci_acpi_setup (which is late enough)
> > 3) A patch to fix that was applied to the PCIe tree
> >   https://patchwork.kernel.org/patch/10597777/
> >    but we got non booting regressions on some threadripper platforms.
> >    That turned out to be because they don't have SRAT, but do have PXM entries.
> >   (i.e. broken firmware).  Naturally Bjorn reverted this very quickly!  
> 
> Here's the beginning of the current thread, for anybody coming in
> late: https://lore.kernel.org/linux-pci/20181211094737.71554-1-Jonathan.Cameron@huawei.com).
> 
> The current patch proposes setting "numa_off=1" in the x86 version of
> dummy_numa_init(), on the assumption (from the changelog) that:
> 
>   It is invalid under the ACPI spec to specify new NUMA nodes using
>   _PXM if they have no presence in SRAT.
> 
> Do you have a reference for this?  I looked and couldn't find a clear
> statement in the spec to that effect.  The _PXM description (ACPI
> v6.2, sec 6.1.14) says that two devices with the same _PXM value are
> in the same proximity domain, but it doesn't seem to require an SRAT.

No comment (feel free to guess why). *sigh*

> 
> But I guess it doesn't really matter whether it's invalid; that
> situation exists in the field, so we have to handle it gracefully.
> 
> Martin reported the regression from 3) above and attached useful logs,
> which unfortunately aren't in the archives because the mailing list rejects
> attachments.  To preserve them, I opened https://bugzilla.kernel.org/show_bug.cgi?id=202443
> and attached the logs there.

Cool. Thanks for doing that.

> 
> > I proposed this fix which was to do the same as on Arm and clearly
> > mark numa as off when SRAT isn't present on an ACPI system.
> > https://elixir.bootlin.com/linux/latest/source/arch/arm64/mm/numa.c#L460
> > https://elixir.bootlin.com/linux/latest/source/arch/x86/mm/numa.c#L688  
> 
> There are several threads we could pull on while untangling this.
> 
> We use dummy_numa_init() when we don't have static NUMA info from ACPI
> SRAT or DT.  On arm64 (but not x86), it sets numa_off=1 when we don't
> have that static info.  I think neither should set numa_off=1 because
> we should allow for future information, e.g., from _PXM.
> 
> I think acpi_numa_init() is being a little too aggressive when it
> returns failure if it finds no SRAT or if it finds an SRAT with no
> ACPI_SRAT_TYPE_MEMORY_AFFINITY entries.
> 
> Also from your changelog:
> 
>   When the PCI code later comes along and calls acpi_get_node() for
>   any PCI card below the root port, it navigates up the ACPI tree
>   until it finds the _PXM value in the root port. This value is then
>   passed to acpi_map_pxm_to_node().
> 
>   As numa_off has not been set on x86 it tries to allocate a NUMA
>   node, from the unused set, without setting up all the infrastructure
>   that would normally accompany such a call.  We have not identified
>   exactly which driver is causing the subsequent hang for Martin.
> 
> So the problem seems to be that when we get the _PXM value (in the
> acpi_get_node() path), there's some infrastructure we don't set up?
> I'm not sure what exactly this is -- I see that when we have an SRAT,
> acpi_numa_memory_affinity() does a little more, but nothing that
> would account for a problem if we call acpi_map_pxm_to_node() without
> an SRAT.
> 
> Maybe it results in an issue when we call kmalloc_node() using this
> _PXM value that SRAT didn't tell us about?  If so, that's reminiscent
> of these earlier discussions about kmalloc_node() returning something
> useless if the requested node is not online:
> 
>   https://lkml.kernel.org/r/1527768879-88161-2-git-send-email-xiexiuqi@huawei.com
>   https://lore.kernel.org/linux-arm-kernel/20180801173132.19739-1-punit.agrawal@arm.com/
> 
> As far as I know, that was never really resolved.  The immediate
> problem of using passing an invalid node number to kmalloc_node() was
> avoided by using kmalloc() instead.

Yes, that's definitely still a problem (or was last time I checked)

> 
> > Dave's response was that we needed to fix the underlying issue of
> > trying to allocate from non existent NUMA nodes.  
> 
> Oops, sorry for telling you what you obviously already know!  I guess
> I didn't internalize this sentence before writing the above.

Not to worry, your description was a lot better than mine! Thanks.

> 
> Bottom line, I totally agree that it would be better to fix the
> underlying issue without trying to avoid it by disabling NUMA.

I don't agree on this point.  I think two layers make sense.

If there is no NUMA description in DT or ACPI, why not just stop anything
from using it at all?  The firmware has basically declared there is no
point, why not save a bit of complexity (and use an existing tested code
path) but setting numa_off?

However, if there is NUMA description, but with bugs then we should
protect in depth.  A simple example being that we declare 2 nodes, but
then use _PXM for a third. I've done that by accident and blows up
in a nasty fashion (not done it for a while, but probably still true).

Given DSDT is only parsed long after SRAT we can just check on _PXM
queries.  Or I suppose we could do a verification parse for all _PXM
entries and put out some warnings if they don't match SRAT entries?

> 
> > Whilst I agree with that in principle (having managed to provide
> > tables doing exactly that during development a few times!), I'm not
> > sure the path to doing so is clear and so this has been stalled for
> > a few months.  There is to my mind still a strong argument, even
> > with such protection in place, that we should still be short cutting
> > it so that you get the same paths if you deliberately disable numa,
> > and if you have no SRAT and hence can't have NUMA.  
> 
> I guess we need to resolve the question of whether NUMA without SRAT
> is possible.

It's certainly unclear of whether it has any meaning.  If we allow for
the fact that the intent of ACPI was never to allow this (and a bit
of history checking verified this as best as anyone can remember),
then what do we do with the few platforms that do use _PXM to nodes that
haven't been defined?

Note we have never actually supported them as we weren't using the
values provided, so there is no regression if we simply rule them
as not valid.  It's also unclear that it was ever intentional for
these platforms, rather than something that got through compliance tests
because no one was using it.

Thanks for your detailed insight and help!

Jonathan

> 
> > So given I have some 'mild for now' screaming going on, I'd
> > definitely appreciate input on how to move forward!
> > 
> > There are lots of places this could be worked around, e.g. we could
> > sanity check in the acpi_get_pxm call.  I'm not sure what side
> > effects that would have and also what cases it wouldn't cover.
> > 
> > Thanks,
> > 
> > Jonathan

Bjorn Helgaas Jan. 29, 2019, 7:05 p.m. UTC | #9

On Tue, Jan 29, 2019 at 09:51:05AM +0000, Jonathan Cameron wrote:
> On Mon, 28 Jan 2019 17:13:22 -0600
> Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Mon, Jan 28, 2019 at 11:31:08AM +0000, Jonathan Cameron wrote:
> > > On Thu, 20 Dec 2018 13:57:14 -0600
> > > Bjorn Helgaas <helgaas@kernel.org> wrote:  
> > > > On Thu, Dec 20, 2018 at 09:13:12AM -0800, Dave Hansen wrote:  
> > > > > On 12/20/18 7:12 AM, Bjorn Helgaas wrote:    

> > The current patch proposes setting "numa_off=1" in the x86 version of
> > dummy_numa_init(), on the assumption (from the changelog) that:
> > 
> >   It is invalid under the ACPI spec to specify new NUMA nodes using
> >   _PXM if they have no presence in SRAT.
> > 
> > Do you have a reference for this?  I looked and couldn't find a clear
> > statement in the spec to that effect.  The _PXM description (ACPI
> > v6.2, sec 6.1.14) says that two devices with the same _PXM value are
> > in the same proximity domain, but it doesn't seem to require an SRAT.
> 
> No comment (feel free to guess why). *sigh*

Secret interpretations of the spec are out of bounds.  But I think
it's a waste of time to argue about whether _PXM without SRAT is
valid.  Systems like that exist, and I think it's possible to do
something sensible with them.

> > Maybe it results in an issue when we call kmalloc_node() using this
> > _PXM value that SRAT didn't tell us about?  If so, that's reminiscent
> > of these earlier discussions about kmalloc_node() returning something
> > useless if the requested node is not online:
> > 
> >   https://lkml.kernel.org/r/1527768879-88161-2-git-send-email-xiexiuqi@huawei.com
> >   https://lore.kernel.org/linux-arm-kernel/20180801173132.19739-1-punit.agrawal@arm.com/
> > 
> > As far as I know, that was never really resolved.  The immediate
> > problem of using passing an invalid node number to kmalloc_node() was
> > avoided by using kmalloc() instead.
> 
> Yes, that's definitely still a problem (or was last time I checked)
> 
> > > Dave's response was that we needed to fix the underlying issue of
> > > trying to allocate from non existent NUMA nodes.  

> > Bottom line, I totally agree that it would be better to fix the
> > underlying issue without trying to avoid it by disabling NUMA.
> 
> I don't agree on this point.  I think two layers make sense.
> 
> If there is no NUMA description in DT or ACPI, why not just stop anything
> from using it at all?  The firmware has basically declared there is no
> point, why not save a bit of complexity (and use an existing tested code
> path) but setting numa_off?

Firmware with a _PXM does have a NUMA description.

> However, if there is NUMA description, but with bugs then we should
> protect in depth.  A simple example being that we declare 2 nodes, but
> then use _PXM for a third. I've done that by accident and blows up
> in a nasty fashion (not done it for a while, but probably still true).
> 
> Given DSDT is only parsed long after SRAT we can just check on _PXM
> queries.  Or I suppose we could do a verification parse for all _PXM
> entries and put out some warnings if they don't match SRAT entries?

I'm assuming the crash happens when we call kmalloc_node() with a node
not mentioned in SRAT.  I think that's just sub-optimal implementation
in kmalloc_node().

We *could* fail the allocation and return a NULL pointer, but I think
even that is excessive.  I think we should simply fall back to
kmalloc().  We could print a one-time warning if that's useful.

If kmalloc_node() for an unknown node fell back to kmalloc(), would
anything else be required?

> > > Whilst I agree with that in principle (having managed to provide
> > > tables doing exactly that during development a few times!), I'm not
> > > sure the path to doing so is clear and so this has been stalled for
> > > a few months.  There is to my mind still a strong argument, even
> > > with such protection in place, that we should still be short cutting
> > > it so that you get the same paths if you deliberately disable numa,
> > > and if you have no SRAT and hence can't have NUMA.  
> > 
> > I guess we need to resolve the question of whether NUMA without SRAT
> > is possible.
> 
> It's certainly unclear of whether it has any meaning.  If we allow for
> the fact that the intent of ACPI was never to allow this (and a bit
> of history checking verified this as best as anyone can remember),
> then what do we do with the few platforms that do use _PXM to nodes that
> haven't been defined?

We *could* ignore any _PXM that mentions a proximity domain not
mentioned by an SRAT.  That seems a little heavy-handed because it
means every possible proximity domain must be described up front in
the SRAT, which limits the flexibility of hot-adding entire nodes
(CPU/memory/IO).

But I think it's possible to make sense of a _PXM that adds a
proximity domain not mentioned in an SRAT, e.g., if a new memory
device and a new I/O device supply the same _PXM value, we can assume
they're close together.  If a new I/O device has a previously unknown
_PXM, we may not be able to allocate memory near it, but we should at
least be able to allocate from a default zone.

Bjorn

Jonathan Cameron Jan. 29, 2019, 7:45 p.m. UTC | #10

On Tue, 29 Jan 2019 13:05:56 -0600
Bjorn Helgaas <helgaas@kernel.org> wrote:

> On Tue, Jan 29, 2019 at 09:51:05AM +0000, Jonathan Cameron wrote:
> > On Mon, 28 Jan 2019 17:13:22 -0600
> > Bjorn Helgaas <helgaas@kernel.org> wrote:  
> > > On Mon, Jan 28, 2019 at 11:31:08AM +0000, Jonathan Cameron wrote:  
> > > > On Thu, 20 Dec 2018 13:57:14 -0600
> > > > Bjorn Helgaas <helgaas@kernel.org> wrote:    
> > > > > On Thu, Dec 20, 2018 at 09:13:12AM -0800, Dave Hansen wrote:    
> > > > > > On 12/20/18 7:12 AM, Bjorn Helgaas wrote:      
> 
> > > The current patch proposes setting "numa_off=1" in the x86 version of
> > > dummy_numa_init(), on the assumption (from the changelog) that:
> > > 
> > >   It is invalid under the ACPI spec to specify new NUMA nodes using
> > >   _PXM if they have no presence in SRAT.
> > > 
> > > Do you have a reference for this?  I looked and couldn't find a clear
> > > statement in the spec to that effect.  The _PXM description (ACPI
> > > v6.2, sec 6.1.14) says that two devices with the same _PXM value are
> > > in the same proximity domain, but it doesn't seem to require an SRAT.  
> > 
> > No comment (feel free to guess why). *sigh*  
> 
> Secret interpretations of the spec are out of bounds.  But I think
> it's a waste of time to argue about whether _PXM without SRAT is
> valid.  Systems like that exist, and I think it's possible to do
> something sensible with them.
> 
> > > Maybe it results in an issue when we call kmalloc_node() using this
> > > _PXM value that SRAT didn't tell us about?  If so, that's reminiscent
> > > of these earlier discussions about kmalloc_node() returning something
> > > useless if the requested node is not online:
> > > 
> > >   https://lkml.kernel.org/r/1527768879-88161-2-git-send-email-xiexiuqi@huawei.com
> > >   https://lore.kernel.org/linux-arm-kernel/20180801173132.19739-1-punit.agrawal@arm.com/
> > > 
> > > As far as I know, that was never really resolved.  The immediate
> > > problem of using passing an invalid node number to kmalloc_node() was
> > > avoided by using kmalloc() instead.  
> > 
> > Yes, that's definitely still a problem (or was last time I checked)
> >   
> > > > Dave's response was that we needed to fix the underlying issue of
> > > > trying to allocate from non existent NUMA nodes.    
> 
> > > Bottom line, I totally agree that it would be better to fix the
> > > underlying issue without trying to avoid it by disabling NUMA.  
> > 
> > I don't agree on this point.  I think two layers make sense.
> > 
> > If there is no NUMA description in DT or ACPI, why not just stop anything
> > from using it at all?  The firmware has basically declared there is no
> > point, why not save a bit of complexity (and use an existing tested code
> > path) but setting numa_off?  
> 
> Firmware with a _PXM does have a NUMA description.

Most of the meaning is lost.  It applies some grouping but no info
on the relative distance between that any anywhere else.
So perhaps 'some' description.

> 
> > However, if there is NUMA description, but with bugs then we should
> > protect in depth.  A simple example being that we declare 2 nodes, but
> > then use _PXM for a third. I've done that by accident and blows up
> > in a nasty fashion (not done it for a while, but probably still true).
> > 
> > Given DSDT is only parsed long after SRAT we can just check on _PXM
> > queries.  Or I suppose we could do a verification parse for all _PXM
> > entries and put out some warnings if they don't match SRAT entries?  
> 
> I'm assuming the crash happens when we call kmalloc_node() with a node
> not mentioned in SRAT.  I think that's just sub-optimal implementation
> in kmalloc_node().
> 
> We *could* fail the allocation and return a NULL pointer, but I think
> even that is excessive.  I think we should simply fall back to
> kmalloc().  We could print a one-time warning if that's useful.
> 
> If kmalloc_node() for an unknown node fell back to kmalloc(), would
> anything else be required?

It will deal with that case, but it may not be the only one.
I think there are interrupt related issues as well, but will have to check.

> 
> > > > Whilst I agree with that in principle (having managed to provide
> > > > tables doing exactly that during development a few times!), I'm not
> > > > sure the path to doing so is clear and so this has been stalled for
> > > > a few months.  There is to my mind still a strong argument, even
> > > > with such protection in place, that we should still be short cutting
> > > > it so that you get the same paths if you deliberately disable numa,
> > > > and if you have no SRAT and hence can't have NUMA.    
> > > 
> > > I guess we need to resolve the question of whether NUMA without SRAT
> > > is possible.  
> > 
> > It's certainly unclear of whether it has any meaning.  If we allow for
> > the fact that the intent of ACPI was never to allow this (and a bit
> > of history checking verified this as best as anyone can remember),
> > then what do we do with the few platforms that do use _PXM to nodes that
> > haven't been defined?  
> 
> We *could* ignore any _PXM that mentions a proximity domain not
> mentioned by an SRAT.  That seems a little heavy-handed because it
> means every possible proximity domain must be described up front in
> the SRAT, which limits the flexibility of hot-adding entire nodes
> (CPU/memory/IO).
> 
> But I think it's possible to make sense of a _PXM that adds a
> proximity domain not mentioned in an SRAT, e.g., if a new memory
> device and a new I/O device supply the same _PXM value, we can assume
> they're close together.  If a new I/O device has a previously unknown
> _PXM, we may not be able to allocate memory near it, but we should at
> least be able to allocate from a default zone.

I would like to know if this is real before we support it though.
We have a known platform that does it.  That platform might as well
not bother as I understand it as it doesn't have memory in those nodes.

I'll be honest though I'm happy with fixing it the hard way and
dropping the numa_off = 1 for arm if that is the consensus.

Jonathan

> 
> Bjorn

Bjorn Helgaas Jan. 29, 2019, 9:10 p.m. UTC | #11

On Tue, Jan 29, 2019 at 07:45:34PM +0000, Jonathan Cameron wrote:
> On Tue, 29 Jan 2019 13:05:56 -0600
> Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Tue, Jan 29, 2019 at 09:51:05AM +0000, Jonathan Cameron wrote:

> > > However, if there is NUMA description, but with bugs then we should
> > > protect in depth.  A simple example being that we declare 2 nodes, but
> > > then use _PXM for a third. I've done that by accident and blows up
> > > in a nasty fashion (not done it for a while, but probably still true).
> > > 
> > > Given DSDT is only parsed long after SRAT we can just check on _PXM
> > > queries.  Or I suppose we could do a verification parse for all _PXM
> > > entries and put out some warnings if they don't match SRAT entries?  
> > 
> > I'm assuming the crash happens when we call kmalloc_node() with a node
> > not mentioned in SRAT.  I think that's just sub-optimal implementation
> > in kmalloc_node().
> > 
> > We *could* fail the allocation and return a NULL pointer, but I think
> > even that is excessive.  I think we should simply fall back to
> > kmalloc().  We could print a one-time warning if that's useful.
> > 
> > If kmalloc_node() for an unknown node fell back to kmalloc(), would
> > anything else be required?
> 
> It will deal with that case, but it may not be the only one.  I
> think there are interrupt related issues as well, but will have to
> check.

Sounds like a valid concern.  Also, kmalloc() in general looks like a
performance path, so maybe it would be better to address this on the
other end, i.e., by ensuring that dev->numa_node always contains
something valid for kmalloc(), interrupts, etc.

Maybe set_dev_node() could be made smarter along that line?

Bjorn

Jonathan Cameron Feb. 7, 2019, 10:12 a.m. UTC | #12

On Tue, 29 Jan 2019 13:05:56 -0600
Bjorn Helgaas <helgaas@kernel.org> wrote:

> On Tue, Jan 29, 2019 at 09:51:05AM +0000, Jonathan Cameron wrote:
> > On Mon, 28 Jan 2019 17:13:22 -0600
> > Bjorn Helgaas <helgaas@kernel.org> wrote:  
> > > On Mon, Jan 28, 2019 at 11:31:08AM +0000, Jonathan Cameron wrote:  
> > > > On Thu, 20 Dec 2018 13:57:14 -0600
> > > > Bjorn Helgaas <helgaas@kernel.org> wrote:    
> > > > > On Thu, Dec 20, 2018 at 09:13:12AM -0800, Dave Hansen wrote:    
> > > > > > On 12/20/18 7:12 AM, Bjorn Helgaas wrote:      
> 
> > > The current patch proposes setting "numa_off=1" in the x86 version of
> > > dummy_numa_init(), on the assumption (from the changelog) that:
> > > 
> > >   It is invalid under the ACPI spec to specify new NUMA nodes using
> > >   _PXM if they have no presence in SRAT.
> > > 
> > > Do you have a reference for this?  I looked and couldn't find a clear
> > > statement in the spec to that effect.  The _PXM description (ACPI
> > > v6.2, sec 6.1.14) says that two devices with the same _PXM value are
> > > in the same proximity domain, but it doesn't seem to require an SRAT.  
> > 
> > No comment (feel free to guess why). *sigh*  
> 
> Secret interpretations of the spec are out of bounds.  But I think
> it's a waste of time to argue about whether _PXM without SRAT is
> valid.  Systems like that exist, and I think it's possible to do
> something sensible with them.

Now less secret :)

https://uefi.org/sites/default/files/resources/ACPI_6_3_final_Jan30.pdf

Specifically 
6.2B Errata 1951 _PXM Clarifications

Adds lots of statements including:

(5.2.16)
Note: SRAT is the place where proximity domains are defined, and _PXM provides
a mechanism to associate a device object (and its children) to an SRAT-defined
proximity domain. 

6.2.14 _PXM (Proximity)
This optional object is used to describe proximity domain associations within a
machine. _PXM evaluates to an integer that identifies a device as belonging
to a Proximity Domain defined in the System Resource Affinity Table (SRAT).

Obviously this doesn't necessarily change the fact there 'might' be
a platform out there with tables written against earlier ACPI specs that
does 'deliberately' provide _PXM entries that don't match entries in SRAT.
What is does mean is that going forwards we "shouldn't" see any new ones.

Note that the usecase that was conjectured below is now accounted for with
the new Generic Initiator Domains (5.2.16.6).  There is some juggling done via
an OSC bit to ensure that firmware can 'adjust' it's _PXM entries to account
for whether or not these Generic Initiator domains are supported by the OS.
I'll clean up my patches for that and post soon (if no one beats me to it!)

One thing I will note though, is I'm not going to propose we drop the
numa_off = true line in the arm code, given there aren't any arm platforms
known to have _PXMs not matching entries in SRAT and now we have a spec
that says it isn't right to do it anyway.

Jonathan

> 
> > > Maybe it results in an issue when we call kmalloc_node() using this
> > > _PXM value that SRAT didn't tell us about?  If so, that's reminiscent
> > > of these earlier discussions about kmalloc_node() returning something
> > > useless if the requested node is not online:
> > > 
> > >   https://lkml.kernel.org/r/1527768879-88161-2-git-send-email-xiexiuqi@huawei.com
> > >   https://lore.kernel.org/linux-arm-kernel/20180801173132.19739-1-punit.agrawal@arm.com/
> > > 
> > > As far as I know, that was never really resolved.  The immediate
> > > problem of using passing an invalid node number to kmalloc_node() was
> > > avoided by using kmalloc() instead.  
> > 
> > Yes, that's definitely still a problem (or was last time I checked)
> >   
> > > > Dave's response was that we needed to fix the underlying issue of
> > > > trying to allocate from non existent NUMA nodes.    
> 
> > > Bottom line, I totally agree that it would be better to fix the
> > > underlying issue without trying to avoid it by disabling NUMA.  
> > 
> > I don't agree on this point.  I think two layers make sense.
> > 
> > If there is no NUMA description in DT or ACPI, why not just stop anything
> > from using it at all?  The firmware has basically declared there is no
> > point, why not save a bit of complexity (and use an existing tested code
> > path) but setting numa_off?  
> 
> Firmware with a _PXM does have a NUMA description.
> 
> > However, if there is NUMA description, but with bugs then we should
> > protect in depth.  A simple example being that we declare 2 nodes, but
> > then use _PXM for a third. I've done that by accident and blows up
> > in a nasty fashion (not done it for a while, but probably still true).
> > 
> > Given DSDT is only parsed long after SRAT we can just check on _PXM
> > queries.  Or I suppose we could do a verification parse for all _PXM
> > entries and put out some warnings if they don't match SRAT entries?  
> 
> I'm assuming the crash happens when we call kmalloc_node() with a node
> not mentioned in SRAT.  I think that's just sub-optimal implementation
> in kmalloc_node().
> 
> We *could* fail the allocation and return a NULL pointer, but I think
> even that is excessive.  I think we should simply fall back to
> kmalloc().  We could print a one-time warning if that's useful.
> 
> If kmalloc_node() for an unknown node fell back to kmalloc(), would
> anything else be required?
> 
> > > > Whilst I agree with that in principle (having managed to provide
> > > > tables doing exactly that during development a few times!), I'm not
> > > > sure the path to doing so is clear and so this has been stalled for
> > > > a few months.  There is to my mind still a strong argument, even
> > > > with such protection in place, that we should still be short cutting
> > > > it so that you get the same paths if you deliberately disable numa,
> > > > and if you have no SRAT and hence can't have NUMA.    
> > > 
> > > I guess we need to resolve the question of whether NUMA without SRAT
> > > is possible.  
> > 
> > It's certainly unclear of whether it has any meaning.  If we allow for
> > the fact that the intent of ACPI was never to allow this (and a bit
> > of history checking verified this as best as anyone can remember),
> > then what do we do with the few platforms that do use _PXM to nodes that
> > haven't been defined?  
> 
> We *could* ignore any _PXM that mentions a proximity domain not
> mentioned by an SRAT.  That seems a little heavy-handed because it
> means every possible proximity domain must be described up front in
> the SRAT, which limits the flexibility of hot-adding entire nodes
> (CPU/memory/IO).
> 
> But I think it's possible to make sense of a _PXM that adds a
> proximity domain not mentioned in an SRAT, e.g., if a new memory
> device and a new I/O device supply the same _PXM value, we can assume
> they're close together.  If a new I/O device has a previously unknown
> _PXM, we may not be able to allocate memory near it, but we should at
> least be able to allocate from a default zone.
> 
> Bjorn

[V2] x86: Fix an issue with invalid ACPI NUMA config

Commit Message

Comments

Patch