Message ID | 20181211094737.71554-1-Jonathan.Cameron@huawei.com (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
Series | [V2] x86: Fix an issue with invalid ACPI NUMA config | expand |
On 12/11/18 1:47 AM, Jonathan Cameron wrote: > When the PCI code later comes along and calls acpi_get_node() for any PCI > card below the root port, it navigates up the ACPI tree until it finds the > _PXM value in the root port. This value is then passed to > acpi_map_pxm_to_node(). > > As numa_off has not been set on x86 it tries to allocate a NUMA node, from > the unused set, without setting up all the infrastructure that would > normally accompany such a call. FWIW, this _sounds_ like the real problem here. We're allowing an allocation to proceed without some infrastructure that we require. Shouldn't we be detecting that this infrastructure is not in place and warn about *it* at least? I'm a bit worried that this is just papering over an unknown error to make a hang go away. It seems a bit too far away from the root cause.
On Tue, 11 Dec 2018 10:19:49 -0800 Dave Hansen <dave.hansen@intel.com> wrote: > On 12/11/18 1:47 AM, Jonathan Cameron wrote: > > When the PCI code later comes along and calls acpi_get_node() for any PCI > > card below the root port, it navigates up the ACPI tree until it finds the > > _PXM value in the root port. This value is then passed to > > acpi_map_pxm_to_node(). > > > > As numa_off has not been set on x86 it tries to allocate a NUMA node, from > > the unused set, without setting up all the infrastructure that would > > normally accompany such a call. > > FWIW, this _sounds_ like the real problem here. We're allowing an > allocation to proceed without some infrastructure that we require. > Shouldn't we be detecting that this infrastructure is not in place and > warn about *it* at least? > > I'm a bit worried that this is just papering over an unknown error to > make a hang go away. It seems a bit too far away from the root cause. I'm not totally convinced. We are warning about it on the two lines just off the top of this patch. "No NUMA configuration found" "Faking a node at [mem....]" We are falling back to the exact same code paths as if you had deliberately turned off NUMA at the command line with messages stating that is the case. That approach seems to be safe and is consistent. Now there is a potential corner here where I agree with you that it may make sense to 'also' add protections in the acpi_map_pxm_to_node() path which is that where we do have a valid NUMA configuration and along comes a new device with a node outside of those that are defined, (note there is a change coming in next ACPI precisely to work around a case that causes this to validly happen when the OS sees some new features and doesn't know what to do with them - it still relies on the ACPI tables having the right magic in them though for the fallback to work - more on that when the spec is out...). One option would be to (in addition to this patch) add a new version of acpi_get_node that will only give you a node that actually exists and an error otherwise allowing code to fallback to to NO_NODE. Other than the error we might be able to use acpi_map_pxm_to_online_node for this, or call both acpi_map_pxm_to_node and acpi_map_pxm_to_online_node and compare the answers to verify we are getting the node we want? Jonathan
On Wed, Dec 12, 2018 at 09:39:14AM +0000, Jonathan Cameron wrote: > On Tue, 11 Dec 2018 10:19:49 -0800 > Dave Hansen <dave.hansen@intel.com> wrote: > > > On 12/11/18 1:47 AM, Jonathan Cameron wrote: > > > When the PCI code later comes along and calls acpi_get_node() for any PCI > > > card below the root port, it navigates up the ACPI tree until it finds the > > > _PXM value in the root port. This value is then passed to > > > acpi_map_pxm_to_node(). > > > > > > As numa_off has not been set on x86 it tries to allocate a NUMA node, from > > > the unused set, without setting up all the infrastructure that would > > > normally accompany such a call. > > > > FWIW, this _sounds_ like the real problem here. We're allowing an > > allocation to proceed without some infrastructure that we require. > > Shouldn't we be detecting that this infrastructure is not in place and > > warn about *it* at least? > > > > I'm a bit worried that this is just papering over an unknown error to > > make a hang go away. It seems a bit too far away from the root cause. > > I'm not totally convinced. We are warning about it on the two lines just off the > top of this patch. > > "No NUMA configuration found" > "Faking a node at [mem....]" > > We are falling back to the exact same code paths as if you had deliberately > turned off NUMA at the command line with messages stating that is the case. > That approach seems to be safe and is consistent. > > Now there is a potential corner here where I agree with you that it may > make sense to 'also' add protections in the acpi_map_pxm_to_node() path > which is that where we do have a valid NUMA configuration and along comes > a new device with a node outside of those that are defined, > (note there is a change coming in next ACPI precisely to work around a case > that causes this to validly happen when the OS sees some new features and > doesn't know what to do with them - it still relies on the ACPI tables > having the right magic in them though for the fallback to work - more > on that when the spec is out...). > > One option would be to (in addition to this patch) add a new version of > acpi_get_node that will only give you a node that actually exists > and an error otherwise allowing code to fallback to to NO_NODE. > > Other than the error we might be able to use acpi_map_pxm_to_online_node > for this, or call both acpi_map_pxm_to_node and acpi_map_pxm_to_online_node > and compare the answers to verify we are getting the node we want? Where are we at with this? It'd be nice to resolve it for v4.21, but it's a little out of my comfort zone, so I don't want to apply it unless there's clear consensus that this is the right fix. Bjorn
On 12/20/18 7:12 AM, Bjorn Helgaas wrote: >> Other than the error we might be able to use acpi_map_pxm_to_online_node >> for this, or call both acpi_map_pxm_to_node and acpi_map_pxm_to_online_node >> and compare the answers to verify we are getting the node we want? > Where are we at with this? It'd be nice to resolve it for v4.21, but > it's a little out of my comfort zone, so I don't want to apply it > unless there's clear consensus that this is the right fix. I still think the fix in this patch sweeps the problem under the rug too much. But, it just might be the best single fix for backports, for instance.
On Thu, Dec 20, 2018 at 09:13:12AM -0800, Dave Hansen wrote: > On 12/20/18 7:12 AM, Bjorn Helgaas wrote: > >> Other than the error we might be able to use acpi_map_pxm_to_online_node > >> for this, or call both acpi_map_pxm_to_node and acpi_map_pxm_to_online_node > >> and compare the answers to verify we are getting the node we want? > > Where are we at with this? It'd be nice to resolve it for v4.21, but > > it's a little out of my comfort zone, so I don't want to apply it > > unless there's clear consensus that this is the right fix. > > I still think the fix in this patch sweeps the problem under the rug too > much. But, it just might be the best single fix for backports, for > instance. Sounds like we should first find the best fix, then worry about how to backport it. So I think we have a little more noodling to do, and I'll defer this for now. Bjorn
On Thu, 20 Dec 2018 13:57:14 -0600 Bjorn Helgaas <helgaas@kernel.org> wrote: > On Thu, Dec 20, 2018 at 09:13:12AM -0800, Dave Hansen wrote: > > On 12/20/18 7:12 AM, Bjorn Helgaas wrote: > > >> Other than the error we might be able to use acpi_map_pxm_to_online_node > > >> for this, or call both acpi_map_pxm_to_node and acpi_map_pxm_to_online_node > > >> and compare the answers to verify we are getting the node we want? > > > Where are we at with this? It'd be nice to resolve it for v4.21, but > > > it's a little out of my comfort zone, so I don't want to apply it > > > unless there's clear consensus that this is the right fix. > > > > I still think the fix in this patch sweeps the problem under the rug too > > much. But, it just might be the best single fix for backports, for > > instance. > > Sounds like we should first find the best fix, then worry about how to > backport it. So I think we have a little more noodling to do, and > I'll defer this for now. > > Bjorn Hi All, I'd definitely appreciate some guidance on what the 'right' fix is. We are starting to get real performance issues reported as a result of not being able to use this patch on mainline. 5-10% performance drop on some networking benchmarks. As a brief summary (having added linux-mm / linux-acpi) the issue is: 1) ACPI allows _PXM to be applied to pci devices (including root ports for example, but any device is fine). 2) Due to the ordering of when the fw node was set for PCI devices this wasn't taking effect. Easy to solve by just adding the numa node if provided in pci_acpi_setup (which is late enough) 3) A patch to fix that was applied to the PCIe tree https://patchwork.kernel.org/patch/10597777/ but we got non booting regressions on some threadripper platforms. That turned out to be because they don't have SRAT, but do have PXM entries. (i.e. broken firmware). Naturally Bjorn reverted this very quickly! I proposed this fix which was to do the same as on Arm and clearly mark numa as off when SRAT isn't present on an ACPI system. https://elixir.bootlin.com/linux/latest/source/arch/arm64/mm/numa.c#L460 https://elixir.bootlin.com/linux/latest/source/arch/x86/mm/numa.c#L688 Dave's response was that we needed to fix the underlying issue of trying to allocate from non existent NUMA nodes. Whilst I agree with that in principle (having managed to provide tables doing exactly that during development a few times!), I'm not sure the path to doing so is clear and so this has been stalled for a few months. There is to my mind still a strong argument, even with such protection in place, that we should still be short cutting it so that you get the same paths if you deliberately disable numa, and if you have no SRAT and hence can't have NUMA. So given I have some 'mild for now' screaming going on, I'd definitely appreciate input on how to move forward! There are lots of places this could be worked around, e.g. we could sanity check in the acpi_get_pxm call. I'm not sure what side effects that would have and also what cases it wouldn't cover. Thanks, Jonathan
On Mon, Jan 28, 2019 at 11:31:08AM +0000, Jonathan Cameron wrote: > On Thu, 20 Dec 2018 13:57:14 -0600 > Bjorn Helgaas <helgaas@kernel.org> wrote: > > On Thu, Dec 20, 2018 at 09:13:12AM -0800, Dave Hansen wrote: > > > On 12/20/18 7:12 AM, Bjorn Helgaas wrote: > > > >> Other than the error we might be able to use acpi_map_pxm_to_online_node > > > >> for this, or call both acpi_map_pxm_to_node and acpi_map_pxm_to_online_node > > > >> and compare the answers to verify we are getting the node we want? > > > > Where are we at with this? It'd be nice to resolve it for v4.21, but > > > > it's a little out of my comfort zone, so I don't want to apply it > > > > unless there's clear consensus that this is the right fix. > > > > > > I still think the fix in this patch sweeps the problem under the rug too > > > much. But, it just might be the best single fix for backports, for > > > instance. > > > > Sounds like we should first find the best fix, then worry about how to > > backport it. So I think we have a little more noodling to do, and > > I'll defer this for now. > > > > Bjorn > > Hi All, > > I'd definitely appreciate some guidance on what the 'right' fix is. > We are starting to get real performance issues reported as a result of not > being able to use this patch on mainline. > > 5-10% performance drop on some networking benchmarks. I guess the performance drop must be from calling kmalloc_node() with the wrong node number because we currently ignore _PXM for the NIC? And to get that performance back, you need both the previous patch to pay attention to _PXM (https://lore.kernel.org/linux-pci/1527768879-88161-2-git-send-email-xiexiuqi@huawei.com) and this patch (to set "numa_off=1" to avoid the regression the _PXM patch by itself would cause)? > As a brief summary (having added linux-mm / linux-acpi) the issue is: > > 1) ACPI allows _PXM to be applied to pci devices (including root ports for > example, but any device is fine). > 2) Due to the ordering of when the fw node was set for PCI devices this wasn't > taking effect. Easy to solve by just adding the numa node if provided in > pci_acpi_setup (which is late enough) > 3) A patch to fix that was applied to the PCIe tree > https://patchwork.kernel.org/patch/10597777/ > but we got non booting regressions on some threadripper platforms. > That turned out to be because they don't have SRAT, but do have PXM entries. > (i.e. broken firmware). Naturally Bjorn reverted this very quickly! Here's the beginning of the current thread, for anybody coming in late: https://lore.kernel.org/linux-pci/20181211094737.71554-1-Jonathan.Cameron@huawei.com). The current patch proposes setting "numa_off=1" in the x86 version of dummy_numa_init(), on the assumption (from the changelog) that: It is invalid under the ACPI spec to specify new NUMA nodes using _PXM if they have no presence in SRAT. Do you have a reference for this? I looked and couldn't find a clear statement in the spec to that effect. The _PXM description (ACPI v6.2, sec 6.1.14) says that two devices with the same _PXM value are in the same proximity domain, but it doesn't seem to require an SRAT. But I guess it doesn't really matter whether it's invalid; that situation exists in the field, so we have to handle it gracefully. Martin reported the regression from 3) above and attached useful logs, which unfortunately aren't in the archives because the mailing list rejects attachments. To preserve them, I opened https://bugzilla.kernel.org/show_bug.cgi?id=202443 and attached the logs there. > I proposed this fix which was to do the same as on Arm and clearly > mark numa as off when SRAT isn't present on an ACPI system. > https://elixir.bootlin.com/linux/latest/source/arch/arm64/mm/numa.c#L460 > https://elixir.bootlin.com/linux/latest/source/arch/x86/mm/numa.c#L688 There are several threads we could pull on while untangling this. We use dummy_numa_init() when we don't have static NUMA info from ACPI SRAT or DT. On arm64 (but not x86), it sets numa_off=1 when we don't have that static info. I think neither should set numa_off=1 because we should allow for future information, e.g., from _PXM. I think acpi_numa_init() is being a little too aggressive when it returns failure if it finds no SRAT or if it finds an SRAT with no ACPI_SRAT_TYPE_MEMORY_AFFINITY entries. Also from your changelog: When the PCI code later comes along and calls acpi_get_node() for any PCI card below the root port, it navigates up the ACPI tree until it finds the _PXM value in the root port. This value is then passed to acpi_map_pxm_to_node(). As numa_off has not been set on x86 it tries to allocate a NUMA node, from the unused set, without setting up all the infrastructure that would normally accompany such a call. We have not identified exactly which driver is causing the subsequent hang for Martin. So the problem seems to be that when we get the _PXM value (in the acpi_get_node() path), there's some infrastructure we don't set up? I'm not sure what exactly this is -- I see that when we have an SRAT, acpi_numa_memory_affinity() does a little more, but nothing that would account for a problem if we call acpi_map_pxm_to_node() without an SRAT. Maybe it results in an issue when we call kmalloc_node() using this _PXM value that SRAT didn't tell us about? If so, that's reminiscent of these earlier discussions about kmalloc_node() returning something useless if the requested node is not online: https://lkml.kernel.org/r/1527768879-88161-2-git-send-email-xiexiuqi@huawei.com https://lore.kernel.org/linux-arm-kernel/20180801173132.19739-1-punit.agrawal@arm.com/ As far as I know, that was never really resolved. The immediate problem of using passing an invalid node number to kmalloc_node() was avoided by using kmalloc() instead. > Dave's response was that we needed to fix the underlying issue of > trying to allocate from non existent NUMA nodes. Oops, sorry for telling you what you obviously already know! I guess I didn't internalize this sentence before writing the above. Bottom line, I totally agree that it would be better to fix the underlying issue without trying to avoid it by disabling NUMA. > Whilst I agree with that in principle (having managed to provide > tables doing exactly that during development a few times!), I'm not > sure the path to doing so is clear and so this has been stalled for > a few months. There is to my mind still a strong argument, even > with such protection in place, that we should still be short cutting > it so that you get the same paths if you deliberately disable numa, > and if you have no SRAT and hence can't have NUMA. I guess we need to resolve the question of whether NUMA without SRAT is possible. > So given I have some 'mild for now' screaming going on, I'd > definitely appreciate input on how to move forward! > > There are lots of places this could be worked around, e.g. we could > sanity check in the acpi_get_pxm call. I'm not sure what side > effects that would have and also what cases it wouldn't cover. > > Thanks, > > Jonathan
On Mon, 28 Jan 2019 17:13:22 -0600 Bjorn Helgaas <helgaas@kernel.org> wrote: > On Mon, Jan 28, 2019 at 11:31:08AM +0000, Jonathan Cameron wrote: > > On Thu, 20 Dec 2018 13:57:14 -0600 > > Bjorn Helgaas <helgaas@kernel.org> wrote: > > > On Thu, Dec 20, 2018 at 09:13:12AM -0800, Dave Hansen wrote: > > > > On 12/20/18 7:12 AM, Bjorn Helgaas wrote: > > > > >> Other than the error we might be able to use acpi_map_pxm_to_online_node > > > > >> for this, or call both acpi_map_pxm_to_node and acpi_map_pxm_to_online_node > > > > >> and compare the answers to verify we are getting the node we want? > > > > > Where are we at with this? It'd be nice to resolve it for v4.21, but > > > > > it's a little out of my comfort zone, so I don't want to apply it > > > > > unless there's clear consensus that this is the right fix. > > > > > > > > I still think the fix in this patch sweeps the problem under the rug too > > > > much. But, it just might be the best single fix for backports, for > > > > instance. > > > > > > Sounds like we should first find the best fix, then worry about how to > > > backport it. So I think we have a little more noodling to do, and > > > I'll defer this for now. > > > > > > Bjorn > > > > Hi All, > > > > I'd definitely appreciate some guidance on what the 'right' fix is. > > We are starting to get real performance issues reported as a result of not > > being able to use this patch on mainline. > > > > 5-10% performance drop on some networking benchmarks. > > I guess the performance drop must be from calling kmalloc_node() with > the wrong node number because we currently ignore _PXM for the NIC? > And to get that performance back, you need both the previous patch to > pay attention to _PXM (https://lore.kernel.org/linux-pci/1527768879-88161-2-git-send-email-xiexiuqi@huawei.com) > and this patch (to set "numa_off=1" to avoid the regression the _PXM > patch by itself would cause)? Exactly. > > > As a brief summary (having added linux-mm / linux-acpi) the issue is: > > > > 1) ACPI allows _PXM to be applied to pci devices (including root ports for > > example, but any device is fine). > > 2) Due to the ordering of when the fw node was set for PCI devices this wasn't > > taking effect. Easy to solve by just adding the numa node if provided in > > pci_acpi_setup (which is late enough) > > 3) A patch to fix that was applied to the PCIe tree > > https://patchwork.kernel.org/patch/10597777/ > > but we got non booting regressions on some threadripper platforms. > > That turned out to be because they don't have SRAT, but do have PXM entries. > > (i.e. broken firmware). Naturally Bjorn reverted this very quickly! > > Here's the beginning of the current thread, for anybody coming in > late: https://lore.kernel.org/linux-pci/20181211094737.71554-1-Jonathan.Cameron@huawei.com). > > The current patch proposes setting "numa_off=1" in the x86 version of > dummy_numa_init(), on the assumption (from the changelog) that: > > It is invalid under the ACPI spec to specify new NUMA nodes using > _PXM if they have no presence in SRAT. > > Do you have a reference for this? I looked and couldn't find a clear > statement in the spec to that effect. The _PXM description (ACPI > v6.2, sec 6.1.14) says that two devices with the same _PXM value are > in the same proximity domain, but it doesn't seem to require an SRAT. No comment (feel free to guess why). *sigh* > > But I guess it doesn't really matter whether it's invalid; that > situation exists in the field, so we have to handle it gracefully. > > Martin reported the regression from 3) above and attached useful logs, > which unfortunately aren't in the archives because the mailing list rejects > attachments. To preserve them, I opened https://bugzilla.kernel.org/show_bug.cgi?id=202443 > and attached the logs there. Cool. Thanks for doing that. > > > I proposed this fix which was to do the same as on Arm and clearly > > mark numa as off when SRAT isn't present on an ACPI system. > > https://elixir.bootlin.com/linux/latest/source/arch/arm64/mm/numa.c#L460 > > https://elixir.bootlin.com/linux/latest/source/arch/x86/mm/numa.c#L688 > > There are several threads we could pull on while untangling this. > > We use dummy_numa_init() when we don't have static NUMA info from ACPI > SRAT or DT. On arm64 (but not x86), it sets numa_off=1 when we don't > have that static info. I think neither should set numa_off=1 because > we should allow for future information, e.g., from _PXM. > > I think acpi_numa_init() is being a little too aggressive when it > returns failure if it finds no SRAT or if it finds an SRAT with no > ACPI_SRAT_TYPE_MEMORY_AFFINITY entries. > > Also from your changelog: > > When the PCI code later comes along and calls acpi_get_node() for > any PCI card below the root port, it navigates up the ACPI tree > until it finds the _PXM value in the root port. This value is then > passed to acpi_map_pxm_to_node(). > > As numa_off has not been set on x86 it tries to allocate a NUMA > node, from the unused set, without setting up all the infrastructure > that would normally accompany such a call. We have not identified > exactly which driver is causing the subsequent hang for Martin. > > So the problem seems to be that when we get the _PXM value (in the > acpi_get_node() path), there's some infrastructure we don't set up? > I'm not sure what exactly this is -- I see that when we have an SRAT, > acpi_numa_memory_affinity() does a little more, but nothing that > would account for a problem if we call acpi_map_pxm_to_node() without > an SRAT. > > Maybe it results in an issue when we call kmalloc_node() using this > _PXM value that SRAT didn't tell us about? If so, that's reminiscent > of these earlier discussions about kmalloc_node() returning something > useless if the requested node is not online: > > https://lkml.kernel.org/r/1527768879-88161-2-git-send-email-xiexiuqi@huawei.com > https://lore.kernel.org/linux-arm-kernel/20180801173132.19739-1-punit.agrawal@arm.com/ > > As far as I know, that was never really resolved. The immediate > problem of using passing an invalid node number to kmalloc_node() was > avoided by using kmalloc() instead. Yes, that's definitely still a problem (or was last time I checked) > > > Dave's response was that we needed to fix the underlying issue of > > trying to allocate from non existent NUMA nodes. > > Oops, sorry for telling you what you obviously already know! I guess > I didn't internalize this sentence before writing the above. Not to worry, your description was a lot better than mine! Thanks. > > Bottom line, I totally agree that it would be better to fix the > underlying issue without trying to avoid it by disabling NUMA. I don't agree on this point. I think two layers make sense. If there is no NUMA description in DT or ACPI, why not just stop anything from using it at all? The firmware has basically declared there is no point, why not save a bit of complexity (and use an existing tested code path) but setting numa_off? However, if there is NUMA description, but with bugs then we should protect in depth. A simple example being that we declare 2 nodes, but then use _PXM for a third. I've done that by accident and blows up in a nasty fashion (not done it for a while, but probably still true). Given DSDT is only parsed long after SRAT we can just check on _PXM queries. Or I suppose we could do a verification parse for all _PXM entries and put out some warnings if they don't match SRAT entries? > > > Whilst I agree with that in principle (having managed to provide > > tables doing exactly that during development a few times!), I'm not > > sure the path to doing so is clear and so this has been stalled for > > a few months. There is to my mind still a strong argument, even > > with such protection in place, that we should still be short cutting > > it so that you get the same paths if you deliberately disable numa, > > and if you have no SRAT and hence can't have NUMA. > > I guess we need to resolve the question of whether NUMA without SRAT > is possible. It's certainly unclear of whether it has any meaning. If we allow for the fact that the intent of ACPI was never to allow this (and a bit of history checking verified this as best as anyone can remember), then what do we do with the few platforms that do use _PXM to nodes that haven't been defined? Note we have never actually supported them as we weren't using the values provided, so there is no regression if we simply rule them as not valid. It's also unclear that it was ever intentional for these platforms, rather than something that got through compliance tests because no one was using it. Thanks for your detailed insight and help! Jonathan > > > So given I have some 'mild for now' screaming going on, I'd > > definitely appreciate input on how to move forward! > > > > There are lots of places this could be worked around, e.g. we could > > sanity check in the acpi_get_pxm call. I'm not sure what side > > effects that would have and also what cases it wouldn't cover. > > > > Thanks, > > > > Jonathan
On Tue, Jan 29, 2019 at 09:51:05AM +0000, Jonathan Cameron wrote: > On Mon, 28 Jan 2019 17:13:22 -0600 > Bjorn Helgaas <helgaas@kernel.org> wrote: > > On Mon, Jan 28, 2019 at 11:31:08AM +0000, Jonathan Cameron wrote: > > > On Thu, 20 Dec 2018 13:57:14 -0600 > > > Bjorn Helgaas <helgaas@kernel.org> wrote: > > > > On Thu, Dec 20, 2018 at 09:13:12AM -0800, Dave Hansen wrote: > > > > > On 12/20/18 7:12 AM, Bjorn Helgaas wrote: > > The current patch proposes setting "numa_off=1" in the x86 version of > > dummy_numa_init(), on the assumption (from the changelog) that: > > > > It is invalid under the ACPI spec to specify new NUMA nodes using > > _PXM if they have no presence in SRAT. > > > > Do you have a reference for this? I looked and couldn't find a clear > > statement in the spec to that effect. The _PXM description (ACPI > > v6.2, sec 6.1.14) says that two devices with the same _PXM value are > > in the same proximity domain, but it doesn't seem to require an SRAT. > > No comment (feel free to guess why). *sigh* Secret interpretations of the spec are out of bounds. But I think it's a waste of time to argue about whether _PXM without SRAT is valid. Systems like that exist, and I think it's possible to do something sensible with them. > > Maybe it results in an issue when we call kmalloc_node() using this > > _PXM value that SRAT didn't tell us about? If so, that's reminiscent > > of these earlier discussions about kmalloc_node() returning something > > useless if the requested node is not online: > > > > https://lkml.kernel.org/r/1527768879-88161-2-git-send-email-xiexiuqi@huawei.com > > https://lore.kernel.org/linux-arm-kernel/20180801173132.19739-1-punit.agrawal@arm.com/ > > > > As far as I know, that was never really resolved. The immediate > > problem of using passing an invalid node number to kmalloc_node() was > > avoided by using kmalloc() instead. > > Yes, that's definitely still a problem (or was last time I checked) > > > > Dave's response was that we needed to fix the underlying issue of > > > trying to allocate from non existent NUMA nodes. > > Bottom line, I totally agree that it would be better to fix the > > underlying issue without trying to avoid it by disabling NUMA. > > I don't agree on this point. I think two layers make sense. > > If there is no NUMA description in DT or ACPI, why not just stop anything > from using it at all? The firmware has basically declared there is no > point, why not save a bit of complexity (and use an existing tested code > path) but setting numa_off? Firmware with a _PXM does have a NUMA description. > However, if there is NUMA description, but with bugs then we should > protect in depth. A simple example being that we declare 2 nodes, but > then use _PXM for a third. I've done that by accident and blows up > in a nasty fashion (not done it for a while, but probably still true). > > Given DSDT is only parsed long after SRAT we can just check on _PXM > queries. Or I suppose we could do a verification parse for all _PXM > entries and put out some warnings if they don't match SRAT entries? I'm assuming the crash happens when we call kmalloc_node() with a node not mentioned in SRAT. I think that's just sub-optimal implementation in kmalloc_node(). We *could* fail the allocation and return a NULL pointer, but I think even that is excessive. I think we should simply fall back to kmalloc(). We could print a one-time warning if that's useful. If kmalloc_node() for an unknown node fell back to kmalloc(), would anything else be required? > > > Whilst I agree with that in principle (having managed to provide > > > tables doing exactly that during development a few times!), I'm not > > > sure the path to doing so is clear and so this has been stalled for > > > a few months. There is to my mind still a strong argument, even > > > with such protection in place, that we should still be short cutting > > > it so that you get the same paths if you deliberately disable numa, > > > and if you have no SRAT and hence can't have NUMA. > > > > I guess we need to resolve the question of whether NUMA without SRAT > > is possible. > > It's certainly unclear of whether it has any meaning. If we allow for > the fact that the intent of ACPI was never to allow this (and a bit > of history checking verified this as best as anyone can remember), > then what do we do with the few platforms that do use _PXM to nodes that > haven't been defined? We *could* ignore any _PXM that mentions a proximity domain not mentioned by an SRAT. That seems a little heavy-handed because it means every possible proximity domain must be described up front in the SRAT, which limits the flexibility of hot-adding entire nodes (CPU/memory/IO). But I think it's possible to make sense of a _PXM that adds a proximity domain not mentioned in an SRAT, e.g., if a new memory device and a new I/O device supply the same _PXM value, we can assume they're close together. If a new I/O device has a previously unknown _PXM, we may not be able to allocate memory near it, but we should at least be able to allocate from a default zone. Bjorn
On Tue, 29 Jan 2019 13:05:56 -0600 Bjorn Helgaas <helgaas@kernel.org> wrote: > On Tue, Jan 29, 2019 at 09:51:05AM +0000, Jonathan Cameron wrote: > > On Mon, 28 Jan 2019 17:13:22 -0600 > > Bjorn Helgaas <helgaas@kernel.org> wrote: > > > On Mon, Jan 28, 2019 at 11:31:08AM +0000, Jonathan Cameron wrote: > > > > On Thu, 20 Dec 2018 13:57:14 -0600 > > > > Bjorn Helgaas <helgaas@kernel.org> wrote: > > > > > On Thu, Dec 20, 2018 at 09:13:12AM -0800, Dave Hansen wrote: > > > > > > On 12/20/18 7:12 AM, Bjorn Helgaas wrote: > > > > The current patch proposes setting "numa_off=1" in the x86 version of > > > dummy_numa_init(), on the assumption (from the changelog) that: > > > > > > It is invalid under the ACPI spec to specify new NUMA nodes using > > > _PXM if they have no presence in SRAT. > > > > > > Do you have a reference for this? I looked and couldn't find a clear > > > statement in the spec to that effect. The _PXM description (ACPI > > > v6.2, sec 6.1.14) says that two devices with the same _PXM value are > > > in the same proximity domain, but it doesn't seem to require an SRAT. > > > > No comment (feel free to guess why). *sigh* > > Secret interpretations of the spec are out of bounds. But I think > it's a waste of time to argue about whether _PXM without SRAT is > valid. Systems like that exist, and I think it's possible to do > something sensible with them. > > > > Maybe it results in an issue when we call kmalloc_node() using this > > > _PXM value that SRAT didn't tell us about? If so, that's reminiscent > > > of these earlier discussions about kmalloc_node() returning something > > > useless if the requested node is not online: > > > > > > https://lkml.kernel.org/r/1527768879-88161-2-git-send-email-xiexiuqi@huawei.com > > > https://lore.kernel.org/linux-arm-kernel/20180801173132.19739-1-punit.agrawal@arm.com/ > > > > > > As far as I know, that was never really resolved. The immediate > > > problem of using passing an invalid node number to kmalloc_node() was > > > avoided by using kmalloc() instead. > > > > Yes, that's definitely still a problem (or was last time I checked) > > > > > > Dave's response was that we needed to fix the underlying issue of > > > > trying to allocate from non existent NUMA nodes. > > > > Bottom line, I totally agree that it would be better to fix the > > > underlying issue without trying to avoid it by disabling NUMA. > > > > I don't agree on this point. I think two layers make sense. > > > > If there is no NUMA description in DT or ACPI, why not just stop anything > > from using it at all? The firmware has basically declared there is no > > point, why not save a bit of complexity (and use an existing tested code > > path) but setting numa_off? > > Firmware with a _PXM does have a NUMA description. Most of the meaning is lost. It applies some grouping but no info on the relative distance between that any anywhere else. So perhaps 'some' description. > > > However, if there is NUMA description, but with bugs then we should > > protect in depth. A simple example being that we declare 2 nodes, but > > then use _PXM for a third. I've done that by accident and blows up > > in a nasty fashion (not done it for a while, but probably still true). > > > > Given DSDT is only parsed long after SRAT we can just check on _PXM > > queries. Or I suppose we could do a verification parse for all _PXM > > entries and put out some warnings if they don't match SRAT entries? > > I'm assuming the crash happens when we call kmalloc_node() with a node > not mentioned in SRAT. I think that's just sub-optimal implementation > in kmalloc_node(). > > We *could* fail the allocation and return a NULL pointer, but I think > even that is excessive. I think we should simply fall back to > kmalloc(). We could print a one-time warning if that's useful. > > If kmalloc_node() for an unknown node fell back to kmalloc(), would > anything else be required? It will deal with that case, but it may not be the only one. I think there are interrupt related issues as well, but will have to check. > > > > > Whilst I agree with that in principle (having managed to provide > > > > tables doing exactly that during development a few times!), I'm not > > > > sure the path to doing so is clear and so this has been stalled for > > > > a few months. There is to my mind still a strong argument, even > > > > with such protection in place, that we should still be short cutting > > > > it so that you get the same paths if you deliberately disable numa, > > > > and if you have no SRAT and hence can't have NUMA. > > > > > > I guess we need to resolve the question of whether NUMA without SRAT > > > is possible. > > > > It's certainly unclear of whether it has any meaning. If we allow for > > the fact that the intent of ACPI was never to allow this (and a bit > > of history checking verified this as best as anyone can remember), > > then what do we do with the few platforms that do use _PXM to nodes that > > haven't been defined? > > We *could* ignore any _PXM that mentions a proximity domain not > mentioned by an SRAT. That seems a little heavy-handed because it > means every possible proximity domain must be described up front in > the SRAT, which limits the flexibility of hot-adding entire nodes > (CPU/memory/IO). > > But I think it's possible to make sense of a _PXM that adds a > proximity domain not mentioned in an SRAT, e.g., if a new memory > device and a new I/O device supply the same _PXM value, we can assume > they're close together. If a new I/O device has a previously unknown > _PXM, we may not be able to allocate memory near it, but we should at > least be able to allocate from a default zone. I would like to know if this is real before we support it though. We have a known platform that does it. That platform might as well not bother as I understand it as it doesn't have memory in those nodes. I'll be honest though I'm happy with fixing it the hard way and dropping the numa_off = 1 for arm if that is the consensus. Jonathan > > Bjorn
On Tue, Jan 29, 2019 at 07:45:34PM +0000, Jonathan Cameron wrote: > On Tue, 29 Jan 2019 13:05:56 -0600 > Bjorn Helgaas <helgaas@kernel.org> wrote: > > On Tue, Jan 29, 2019 at 09:51:05AM +0000, Jonathan Cameron wrote: > > > However, if there is NUMA description, but with bugs then we should > > > protect in depth. A simple example being that we declare 2 nodes, but > > > then use _PXM for a third. I've done that by accident and blows up > > > in a nasty fashion (not done it for a while, but probably still true). > > > > > > Given DSDT is only parsed long after SRAT we can just check on _PXM > > > queries. Or I suppose we could do a verification parse for all _PXM > > > entries and put out some warnings if they don't match SRAT entries? > > > > I'm assuming the crash happens when we call kmalloc_node() with a node > > not mentioned in SRAT. I think that's just sub-optimal implementation > > in kmalloc_node(). > > > > We *could* fail the allocation and return a NULL pointer, but I think > > even that is excessive. I think we should simply fall back to > > kmalloc(). We could print a one-time warning if that's useful. > > > > If kmalloc_node() for an unknown node fell back to kmalloc(), would > > anything else be required? > > It will deal with that case, but it may not be the only one. I > think there are interrupt related issues as well, but will have to > check. Sounds like a valid concern. Also, kmalloc() in general looks like a performance path, so maybe it would be better to address this on the other end, i.e., by ensuring that dev->numa_node always contains something valid for kmalloc(), interrupts, etc. Maybe set_dev_node() could be made smarter along that line? Bjorn
On Tue, 29 Jan 2019 13:05:56 -0600 Bjorn Helgaas <helgaas@kernel.org> wrote: > On Tue, Jan 29, 2019 at 09:51:05AM +0000, Jonathan Cameron wrote: > > On Mon, 28 Jan 2019 17:13:22 -0600 > > Bjorn Helgaas <helgaas@kernel.org> wrote: > > > On Mon, Jan 28, 2019 at 11:31:08AM +0000, Jonathan Cameron wrote: > > > > On Thu, 20 Dec 2018 13:57:14 -0600 > > > > Bjorn Helgaas <helgaas@kernel.org> wrote: > > > > > On Thu, Dec 20, 2018 at 09:13:12AM -0800, Dave Hansen wrote: > > > > > > On 12/20/18 7:12 AM, Bjorn Helgaas wrote: > > > > The current patch proposes setting "numa_off=1" in the x86 version of > > > dummy_numa_init(), on the assumption (from the changelog) that: > > > > > > It is invalid under the ACPI spec to specify new NUMA nodes using > > > _PXM if they have no presence in SRAT. > > > > > > Do you have a reference for this? I looked and couldn't find a clear > > > statement in the spec to that effect. The _PXM description (ACPI > > > v6.2, sec 6.1.14) says that two devices with the same _PXM value are > > > in the same proximity domain, but it doesn't seem to require an SRAT. > > > > No comment (feel free to guess why). *sigh* > > Secret interpretations of the spec are out of bounds. But I think > it's a waste of time to argue about whether _PXM without SRAT is > valid. Systems like that exist, and I think it's possible to do > something sensible with them. Now less secret :) https://uefi.org/sites/default/files/resources/ACPI_6_3_final_Jan30.pdf Specifically 6.2B Errata 1951 _PXM Clarifications Adds lots of statements including: (5.2.16) Note: SRAT is the place where proximity domains are defined, and _PXM provides a mechanism to associate a device object (and its children) to an SRAT-defined proximity domain. 6.2.14 _PXM (Proximity) This optional object is used to describe proximity domain associations within a machine. _PXM evaluates to an integer that identifies a device as belonging to a Proximity Domain defined in the System Resource Affinity Table (SRAT). Obviously this doesn't necessarily change the fact there 'might' be a platform out there with tables written against earlier ACPI specs that does 'deliberately' provide _PXM entries that don't match entries in SRAT. What is does mean is that going forwards we "shouldn't" see any new ones. Note that the usecase that was conjectured below is now accounted for with the new Generic Initiator Domains (5.2.16.6). There is some juggling done via an OSC bit to ensure that firmware can 'adjust' it's _PXM entries to account for whether or not these Generic Initiator domains are supported by the OS. I'll clean up my patches for that and post soon (if no one beats me to it!) One thing I will note though, is I'm not going to propose we drop the numa_off = true line in the arm code, given there aren't any arm platforms known to have _PXMs not matching entries in SRAT and now we have a spec that says it isn't right to do it anyway. Jonathan > > > > Maybe it results in an issue when we call kmalloc_node() using this > > > _PXM value that SRAT didn't tell us about? If so, that's reminiscent > > > of these earlier discussions about kmalloc_node() returning something > > > useless if the requested node is not online: > > > > > > https://lkml.kernel.org/r/1527768879-88161-2-git-send-email-xiexiuqi@huawei.com > > > https://lore.kernel.org/linux-arm-kernel/20180801173132.19739-1-punit.agrawal@arm.com/ > > > > > > As far as I know, that was never really resolved. The immediate > > > problem of using passing an invalid node number to kmalloc_node() was > > > avoided by using kmalloc() instead. > > > > Yes, that's definitely still a problem (or was last time I checked) > > > > > > Dave's response was that we needed to fix the underlying issue of > > > > trying to allocate from non existent NUMA nodes. > > > > Bottom line, I totally agree that it would be better to fix the > > > underlying issue without trying to avoid it by disabling NUMA. > > > > I don't agree on this point. I think two layers make sense. > > > > If there is no NUMA description in DT or ACPI, why not just stop anything > > from using it at all? The firmware has basically declared there is no > > point, why not save a bit of complexity (and use an existing tested code > > path) but setting numa_off? > > Firmware with a _PXM does have a NUMA description. > > > However, if there is NUMA description, but with bugs then we should > > protect in depth. A simple example being that we declare 2 nodes, but > > then use _PXM for a third. I've done that by accident and blows up > > in a nasty fashion (not done it for a while, but probably still true). > > > > Given DSDT is only parsed long after SRAT we can just check on _PXM > > queries. Or I suppose we could do a verification parse for all _PXM > > entries and put out some warnings if they don't match SRAT entries? > > I'm assuming the crash happens when we call kmalloc_node() with a node > not mentioned in SRAT. I think that's just sub-optimal implementation > in kmalloc_node(). > > We *could* fail the allocation and return a NULL pointer, but I think > even that is excessive. I think we should simply fall back to > kmalloc(). We could print a one-time warning if that's useful. > > If kmalloc_node() for an unknown node fell back to kmalloc(), would > anything else be required? > > > > > Whilst I agree with that in principle (having managed to provide > > > > tables doing exactly that during development a few times!), I'm not > > > > sure the path to doing so is clear and so this has been stalled for > > > > a few months. There is to my mind still a strong argument, even > > > > with such protection in place, that we should still be short cutting > > > > it so that you get the same paths if you deliberately disable numa, > > > > and if you have no SRAT and hence can't have NUMA. > > > > > > I guess we need to resolve the question of whether NUMA without SRAT > > > is possible. > > > > It's certainly unclear of whether it has any meaning. If we allow for > > the fact that the intent of ACPI was never to allow this (and a bit > > of history checking verified this as best as anyone can remember), > > then what do we do with the few platforms that do use _PXM to nodes that > > haven't been defined? > > We *could* ignore any _PXM that mentions a proximity domain not > mentioned by an SRAT. That seems a little heavy-handed because it > means every possible proximity domain must be described up front in > the SRAT, which limits the flexibility of hot-adding entire nodes > (CPU/memory/IO). > > But I think it's possible to make sense of a _PXM that adds a > proximity domain not mentioned in an SRAT, e.g., if a new memory > device and a new I/O device supply the same _PXM value, we can assume > they're close together. If a new I/O device has a previously unknown > _PXM, we may not be able to allocate memory near it, but we should at > least be able to allocate from a default zone. > > Bjorn
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 1308f5408bf7..ce1182f953ff 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -695,6 +695,8 @@ static int __init dummy_numa_init(void) node_set(0, numa_nodes_parsed); numa_add_memblk(0, 0, PFN_PHYS(max_pfn)); + numa_off = true; + return 0; }
The addition of support to read the NUMA node for a PCI card specified by _PXM resulted in Martin's system not booting. Looking at the ACPI tables it seems that there are _PXM entries for the root ports, but no SRAT table. The absence of the SRAT table results in dummy_numa_init() being called. However, unlike on arm64, this doesn't result in numa_off being set. When the PCI code later comes along and calls acpi_get_node() for any PCI card below the root port, it navigates up the ACPI tree until it finds the _PXM value in the root port. This value is then passed to acpi_map_pxm_to_node(). As numa_off has not been set on x86 it tries to allocate a NUMA node, from the unused set, without setting up all the infrastructure that would normally accompany such a call. We have not identified exactly which driver is causing the subsequent hang for Martin. If numa_off had been set, as it is in the equivalent flow on arm64, then acpi_map_pxm_to_node() would return NUMA_NO_NODE, which is what we want to happen. It is invalid under the ACPI spec to specify new NUMA nodes using _PXM if they have no presence in SRAT. Thus the simplest fix is to set numa_off when NUMA support is disabled due to an invalid SRAT (here not present at all). I do not have easy access to appropriate x86 NUMA systems so would appreciate some testing of this one! Known problem boards setups: AMD Ryzen Threadripper 2950X on ASROCK X399 TAICHI MSI X399 SLI PLUS (probably - not confirmed yet) The PCI patch has been reverted, so this fix is not critical. Reported-by: Martin Hundebøll <martin@geanix.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Fixes: bad7dcd94f39 ("ACPI/PCI: Pay attention to device-specific _PXM node values") --- Changes since V1: * Update commit message as suggested by Bjorn Helgaas. * No functional changes. arch/x86/mm/numa.c | 2 ++ 1 file changed, 2 insertions(+)