diff mbox series

[08/37] xen/x86: add detection of discontinous node memory range

Message ID 20210923120236.3692135-9-wei.chen@arm.com (mailing list archive)
State New, archived
Headers show
Series Add device tree based NUMA support to Arm | expand

Commit Message

Wei Chen Sept. 23, 2021, 12:02 p.m. UTC
One NUMA node may contain several memory blocks. In current Xen
code, Xen will maintain a node memory range for each node to cover
all its memory blocks. But here comes the problem, in the gap of
one node's two memory blocks, if there are some memory blocks don't
belong to this node (remote memory blocks). This node's memory range
will be expanded to cover these remote memory blocks.

One node's memory range contains othe nodes' memory, this is obviously
not very reasonable. This means current NUMA code only can support
node has continous memory blocks. However, on a physical machine, the
addresses of multiple nodes can be interleaved.

So in this patch, we add code to detect discontinous memory blocks
for one node. NUMA initializtion will be failed and error messages
will be printed when Xen detect such hardware configuration.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
 xen/arch/x86/srat.c | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

Comments

Stefano Stabellini Sept. 24, 2021, 12:25 a.m. UTC | #1
CC'ing x86 maintainers

On Thu, 23 Sep 2021, Wei Chen wrote:
> One NUMA node may contain several memory blocks. In current Xen
> code, Xen will maintain a node memory range for each node to cover
> all its memory blocks. But here comes the problem, in the gap of
> one node's two memory blocks, if there are some memory blocks don't
> belong to this node (remote memory blocks). This node's memory range
> will be expanded to cover these remote memory blocks.
> 
> One node's memory range contains othe nodes' memory, this is obviously
> not very reasonable. This means current NUMA code only can support
> node has continous memory blocks. However, on a physical machine, the
> addresses of multiple nodes can be interleaved.
> 
> So in this patch, we add code to detect discontinous memory blocks
> for one node. NUMA initializtion will be failed and error messages
> will be printed when Xen detect such hardware configuration.

At least on ARM, it is not just memory that can be interleaved, but also
MMIO regions. For instance:

node0 bank0 0-0x1000000
MMIO 0x1000000-0x1002000
Hole 0x1002000-0x2000000
node0 bank1 0x2000000-0x3000000

So I am not familiar with the SRAT format, but I think on ARM the check
would look different: we would just look for multiple memory ranges
under a device_type = "memory" node of a NUMA node in device tree.



> Signed-off-by: Wei Chen <wei.chen@arm.com>
> ---
>  xen/arch/x86/srat.c | 36 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 36 insertions(+)
> 
> diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
> index 7d20d7f222..2f08fa4660 100644
> --- a/xen/arch/x86/srat.c
> +++ b/xen/arch/x86/srat.c
> @@ -271,6 +271,36 @@ acpi_numa_processor_affinity_init(const struct acpi_srat_cpu_affinity *pa)
>  		       pxm, pa->apic_id, node);
>  }
>  
> +/*
> + * Check to see if there are other nodes within this node's range.
> + * We just need to check full contains situation. Because overlaps
> + * have been checked before by conflicting_memblks.
> + */
> +static bool __init is_node_memory_continuous(nodeid_t nid,
> +    paddr_t start, paddr_t end)
> +{
> +	nodeid_t i;
> +
> +	struct node *nd = &nodes[nid];
> +	for_each_node_mask(i, memory_nodes_parsed)
> +	{
> +		/* Skip itself */
> +		if (i == nid)
> +			continue;
> +
> +		nd = &nodes[i];
> +		if (start < nd->start && nd->end < end)
> +		{
> +			printk(KERN_ERR
> +			       "NODE %u: (%"PRIpaddr"-%"PRIpaddr") intertwine with NODE %u (%"PRIpaddr"-%"PRIpaddr")\n",
> +			       nid, start, end, i, nd->start, nd->end);
> +			return false;
> +		}
> +	}
> +
> +	return true;
> +}
> +
>  /* Callback for parsing of the Proximity Domain <-> Memory Area mappings */
>  void __init
>  acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
> @@ -344,6 +374,12 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
>  				nd->start = start;
>  			if (nd->end < end)
>  				nd->end = end;
> +
> +			/* Check whether this range contains memory for other nodes */
> +			if (!is_node_memory_continuous(node, nd->start, nd->end)) {
> +				bad_srat();
> +				return;
> +			}
>  		}
>  	}
>  	printk(KERN_INFO "SRAT: Node %u PXM %u %"PRIpaddr"-%"PRIpaddr"%s\n",
> -- 
> 2.25.1
>
Wei Chen Sept. 24, 2021, 4:28 a.m. UTC | #2
> -----Original Message-----
> From: Stefano Stabellini <sstabellini@kernel.org>
> Sent: 2021年9月24日 8:26
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org;
> Bertrand Marquis <Bertrand.Marquis@arm.com>; jbeulich@suse.com;
> andrew.cooper3@citrix.com; roger.pau@citrix.com; wl@xen.org
> Subject: Re: [PATCH 08/37] xen/x86: add detection of discontinous node
> memory range
> 
> CC'ing x86 maintainers
> 
> On Thu, 23 Sep 2021, Wei Chen wrote:
> > One NUMA node may contain several memory blocks. In current Xen
> > code, Xen will maintain a node memory range for each node to cover
> > all its memory blocks. But here comes the problem, in the gap of
> > one node's two memory blocks, if there are some memory blocks don't
> > belong to this node (remote memory blocks). This node's memory range
> > will be expanded to cover these remote memory blocks.
> >
> > One node's memory range contains othe nodes' memory, this is obviously
> > not very reasonable. This means current NUMA code only can support
> > node has continous memory blocks. However, on a physical machine, the
> > addresses of multiple nodes can be interleaved.
> >
> > So in this patch, we add code to detect discontinous memory blocks
> > for one node. NUMA initializtion will be failed and error messages
> > will be printed when Xen detect such hardware configuration.
> 
> At least on ARM, it is not just memory that can be interleaved, but also
> MMIO regions. For instance:
> 
> node0 bank0 0-0x1000000
> MMIO 0x1000000-0x1002000
> Hole 0x1002000-0x2000000
> node0 bank1 0x2000000-0x3000000
> 
> So I am not familiar with the SRAT format, but I think on ARM the check
> would look different: we would just look for multiple memory ranges
> under a device_type = "memory" node of a NUMA node in device tree.
> 
> 

Should I need to include/refine above message to commit log?

> 
> > Signed-off-by: Wei Chen <wei.chen@arm.com>
> > ---
> >  xen/arch/x86/srat.c | 36 ++++++++++++++++++++++++++++++++++++
> >  1 file changed, 36 insertions(+)
> >
> > diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
> > index 7d20d7f222..2f08fa4660 100644
> > --- a/xen/arch/x86/srat.c
> > +++ b/xen/arch/x86/srat.c
> > @@ -271,6 +271,36 @@ acpi_numa_processor_affinity_init(const struct
> acpi_srat_cpu_affinity *pa)
> >  		       pxm, pa->apic_id, node);
> >  }
> >
> > +/*
> > + * Check to see if there are other nodes within this node's range.
> > + * We just need to check full contains situation. Because overlaps
> > + * have been checked before by conflicting_memblks.
> > + */
> > +static bool __init is_node_memory_continuous(nodeid_t nid,
> > +    paddr_t start, paddr_t end)
> > +{
> > +	nodeid_t i;
> > +
> > +	struct node *nd = &nodes[nid];
> > +	for_each_node_mask(i, memory_nodes_parsed)
> > +	{
> > +		/* Skip itself */
> > +		if (i == nid)
> > +			continue;
> > +
> > +		nd = &nodes[i];
> > +		if (start < nd->start && nd->end < end)
> > +		{
> > +			printk(KERN_ERR
> > +			       "NODE %u: (%"PRIpaddr"-%"PRIpaddr") intertwine
> with NODE %u (%"PRIpaddr"-%"PRIpaddr")\n",
> > +			       nid, start, end, i, nd->start, nd->end);
> > +			return false;
> > +		}
> > +	}
> > +
> > +	return true;
> > +}
> > +
> >  /* Callback for parsing of the Proximity Domain <-> Memory Area
> mappings */
> >  void __init
> >  acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
> > @@ -344,6 +374,12 @@ acpi_numa_memory_affinity_init(const struct
> acpi_srat_mem_affinity *ma)
> >  				nd->start = start;
> >  			if (nd->end < end)
> >  				nd->end = end;
> > +
> > +			/* Check whether this range contains memory for other
> nodes */
> > +			if (!is_node_memory_continuous(node, nd->start, nd->end))
> {
> > +				bad_srat();
> > +				return;
> > +			}
> >  		}
> >  	}
> >  	printk(KERN_INFO "SRAT: Node %u PXM %u %"PRIpaddr"-%"PRIpaddr"%s\n",
> > --
> > 2.25.1
> >
Stefano Stabellini Sept. 24, 2021, 7:52 p.m. UTC | #3
On Fri, 24 Sep 2021, Wei Chen wrote:
> > -----Original Message-----
> > From: Stefano Stabellini <sstabellini@kernel.org>
> > Sent: 2021年9月24日 8:26
> > To: Wei Chen <Wei.Chen@arm.com>
> > Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org;
> > Bertrand Marquis <Bertrand.Marquis@arm.com>; jbeulich@suse.com;
> > andrew.cooper3@citrix.com; roger.pau@citrix.com; wl@xen.org
> > Subject: Re: [PATCH 08/37] xen/x86: add detection of discontinous node
> > memory range
> > 
> > CC'ing x86 maintainers
> > 
> > On Thu, 23 Sep 2021, Wei Chen wrote:
> > > One NUMA node may contain several memory blocks. In current Xen
> > > code, Xen will maintain a node memory range for each node to cover
> > > all its memory blocks. But here comes the problem, in the gap of
> > > one node's two memory blocks, if there are some memory blocks don't
> > > belong to this node (remote memory blocks). This node's memory range
> > > will be expanded to cover these remote memory blocks.
> > >
> > > One node's memory range contains othe nodes' memory, this is obviously
> > > not very reasonable. This means current NUMA code only can support
> > > node has continous memory blocks. However, on a physical machine, the
> > > addresses of multiple nodes can be interleaved.
> > >
> > > So in this patch, we add code to detect discontinous memory blocks
> > > for one node. NUMA initializtion will be failed and error messages
> > > will be printed when Xen detect such hardware configuration.
> > 
> > At least on ARM, it is not just memory that can be interleaved, but also
> > MMIO regions. For instance:
> > 
> > node0 bank0 0-0x1000000
> > MMIO 0x1000000-0x1002000
> > Hole 0x1002000-0x2000000
> > node0 bank1 0x2000000-0x3000000
> > 
> > So I am not familiar with the SRAT format, but I think on ARM the check
> > would look different: we would just look for multiple memory ranges
> > under a device_type = "memory" node of a NUMA node in device tree.
> > 
> > 
> 
> Should I need to include/refine above message to commit log?

Let me ask you a question first.

With the NUMA implementation of this patch series, can we deal with
cases where each node has multiple memory banks, not interleaved?
An an example:

node0: 0x0        - 0x10000000
MMIO : 0x10000000 - 0x20000000
node0: 0x20000000 - 0x30000000
MMIO : 0x30000000 - 0x50000000
node1: 0x50000000 - 0x60000000
MMIO : 0x60000000 - 0x80000000
node2: 0x80000000 - 0x90000000


I assume we can deal with this case simply by setting node0 memory to
0x0-0x30000000 even if there is actually something else, a device, that
doesn't belong to node0 in between the two node0 banks?

Is it only other nodes' memory interleaved that cause issues? In other
words, only the following is a problematic scenario?

node0: 0x0        - 0x10000000
MMIO : 0x10000000 - 0x20000000
node1: 0x20000000 - 0x30000000
MMIO : 0x30000000 - 0x50000000
node0: 0x50000000 - 0x60000000

Because node1 is in between the two ranges of node0?


I am asking these questions because it is certainly possible to have
multiple memory ranges for each NUMA node in device tree, either by
specifying multiple ranges with a single "reg" property, or by
specifying multiple memory nodes with the same numa-node-id.
Wei Chen Sept. 26, 2021, 10:11 a.m. UTC | #4
Hi Stefano,

> -----Original Message-----
> From: Stefano Stabellini <sstabellini@kernel.org>
> Sent: 2021年9月25日 3:53
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: Stefano Stabellini <sstabellini@kernel.org>; xen-
> devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> <Bertrand.Marquis@arm.com>; jbeulich@suse.com; andrew.cooper3@citrix.com;
> roger.pau@citrix.com; wl@xen.org
> Subject: RE: [PATCH 08/37] xen/x86: add detection of discontinous node
> memory range
> 
> On Fri, 24 Sep 2021, Wei Chen wrote:
> > > -----Original Message-----
> > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > Sent: 2021年9月24日 8:26
> > > To: Wei Chen <Wei.Chen@arm.com>
> > > Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org;
> julien@xen.org;
> > > Bertrand Marquis <Bertrand.Marquis@arm.com>; jbeulich@suse.com;
> > > andrew.cooper3@citrix.com; roger.pau@citrix.com; wl@xen.org
> > > Subject: Re: [PATCH 08/37] xen/x86: add detection of discontinous node
> > > memory range
> > >
> > > CC'ing x86 maintainers
> > >
> > > On Thu, 23 Sep 2021, Wei Chen wrote:
> > > > One NUMA node may contain several memory blocks. In current Xen
> > > > code, Xen will maintain a node memory range for each node to cover
> > > > all its memory blocks. But here comes the problem, in the gap of
> > > > one node's two memory blocks, if there are some memory blocks don't
> > > > belong to this node (remote memory blocks). This node's memory range
> > > > will be expanded to cover these remote memory blocks.
> > > >
> > > > One node's memory range contains othe nodes' memory, this is
> obviously
> > > > not very reasonable. This means current NUMA code only can support
> > > > node has continous memory blocks. However, on a physical machine,
> the
> > > > addresses of multiple nodes can be interleaved.
> > > >
> > > > So in this patch, we add code to detect discontinous memory blocks
> > > > for one node. NUMA initializtion will be failed and error messages
> > > > will be printed when Xen detect such hardware configuration.
> > >
> > > At least on ARM, it is not just memory that can be interleaved, but
> also
> > > MMIO regions. For instance:
> > >
> > > node0 bank0 0-0x1000000
> > > MMIO 0x1000000-0x1002000
> > > Hole 0x1002000-0x2000000
> > > node0 bank1 0x2000000-0x3000000
> > >
> > > So I am not familiar with the SRAT format, but I think on ARM the
> check
> > > would look different: we would just look for multiple memory ranges
> > > under a device_type = "memory" node of a NUMA node in device tree.
> > >
> > >
> >
> > Should I need to include/refine above message to commit log?
> 
> Let me ask you a question first.
> 
> With the NUMA implementation of this patch series, can we deal with
> cases where each node has multiple memory banks, not interleaved?

Yes.

> An an example:
> 
> node0: 0x0        - 0x10000000
> MMIO : 0x10000000 - 0x20000000
> node0: 0x20000000 - 0x30000000
> MMIO : 0x30000000 - 0x50000000
> node1: 0x50000000 - 0x60000000
> MMIO : 0x60000000 - 0x80000000
> node2: 0x80000000 - 0x90000000
> 
> 
> I assume we can deal with this case simply by setting node0 memory to
> 0x0-0x30000000 even if there is actually something else, a device, that
> doesn't belong to node0 in between the two node0 banks?

While this configuration is rare in SoC design, but it is not impossible. 

> 
> Is it only other nodes' memory interleaved that cause issues? In other
> words, only the following is a problematic scenario?
> 
> node0: 0x0        - 0x10000000
> MMIO : 0x10000000 - 0x20000000
> node1: 0x20000000 - 0x30000000
> MMIO : 0x30000000 - 0x50000000
> node0: 0x50000000 - 0x60000000
> 
> Because node1 is in between the two ranges of node0?
> 

But only device_type="memory" can be added to allocation.
For mmio there are two cases:
1. mmio doesn't have NUMA id property.
2. mmio has NUMA id property, just like some PCIe controllers.
   But we don’t need to handle these kinds of MMIO devices
   in memory block parsing. Because we don't need to allocate
   memory from these mmio ranges. And for accessing, we need
   a NUMA-aware PCIe controller driver or a generic NUMA-aware
   MMIO accessing APIs.

> 
> I am asking these questions because it is certainly possible to have
> multiple memory ranges for each NUMA node in device tree, either by
> specifying multiple ranges with a single "reg" property, or by
> specifying multiple memory nodes with the same numa-node-id.
Stefano Stabellini Sept. 27, 2021, 3:13 a.m. UTC | #5
On Sun, 26 Sep 2021, Wei Chen wrote:
> > -----Original Message-----
> > From: Stefano Stabellini <sstabellini@kernel.org>
> > Sent: 2021年9月25日 3:53
> > To: Wei Chen <Wei.Chen@arm.com>
> > Cc: Stefano Stabellini <sstabellini@kernel.org>; xen-
> > devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> > <Bertrand.Marquis@arm.com>; jbeulich@suse.com; andrew.cooper3@citrix.com;
> > roger.pau@citrix.com; wl@xen.org
> > Subject: RE: [PATCH 08/37] xen/x86: add detection of discontinous node
> > memory range
> > 
> > On Fri, 24 Sep 2021, Wei Chen wrote:
> > > > -----Original Message-----
> > > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > > Sent: 2021年9月24日 8:26
> > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org;
> > julien@xen.org;
> > > > Bertrand Marquis <Bertrand.Marquis@arm.com>; jbeulich@suse.com;
> > > > andrew.cooper3@citrix.com; roger.pau@citrix.com; wl@xen.org
> > > > Subject: Re: [PATCH 08/37] xen/x86: add detection of discontinous node
> > > > memory range
> > > >
> > > > CC'ing x86 maintainers
> > > >
> > > > On Thu, 23 Sep 2021, Wei Chen wrote:
> > > > > One NUMA node may contain several memory blocks. In current Xen
> > > > > code, Xen will maintain a node memory range for each node to cover
> > > > > all its memory blocks. But here comes the problem, in the gap of
> > > > > one node's two memory blocks, if there are some memory blocks don't
> > > > > belong to this node (remote memory blocks). This node's memory range
> > > > > will be expanded to cover these remote memory blocks.
> > > > >
> > > > > One node's memory range contains othe nodes' memory, this is
> > obviously
> > > > > not very reasonable. This means current NUMA code only can support
> > > > > node has continous memory blocks. However, on a physical machine,
> > the
> > > > > addresses of multiple nodes can be interleaved.
> > > > >
> > > > > So in this patch, we add code to detect discontinous memory blocks
> > > > > for one node. NUMA initializtion will be failed and error messages
> > > > > will be printed when Xen detect such hardware configuration.
> > > >
> > > > At least on ARM, it is not just memory that can be interleaved, but
> > also
> > > > MMIO regions. For instance:
> > > >
> > > > node0 bank0 0-0x1000000
> > > > MMIO 0x1000000-0x1002000
> > > > Hole 0x1002000-0x2000000
> > > > node0 bank1 0x2000000-0x3000000
> > > >
> > > > So I am not familiar with the SRAT format, but I think on ARM the
> > check
> > > > would look different: we would just look for multiple memory ranges
> > > > under a device_type = "memory" node of a NUMA node in device tree.
> > > >
> > > >
> > >
> > > Should I need to include/refine above message to commit log?
> > 
> > Let me ask you a question first.
> > 
> > With the NUMA implementation of this patch series, can we deal with
> > cases where each node has multiple memory banks, not interleaved?
> 
> Yes.
> 
> > An an example:
> > 
> > node0: 0x0        - 0x10000000
> > MMIO : 0x10000000 - 0x20000000
> > node0: 0x20000000 - 0x30000000
> > MMIO : 0x30000000 - 0x50000000
> > node1: 0x50000000 - 0x60000000
> > MMIO : 0x60000000 - 0x80000000
> > node2: 0x80000000 - 0x90000000
> > 
> > 
> > I assume we can deal with this case simply by setting node0 memory to
> > 0x0-0x30000000 even if there is actually something else, a device, that
> > doesn't belong to node0 in between the two node0 banks?
> 
> While this configuration is rare in SoC design, but it is not impossible. 

Definitely, I have seen it before.


> > Is it only other nodes' memory interleaved that cause issues? In other
> > words, only the following is a problematic scenario?
> > 
> > node0: 0x0        - 0x10000000
> > MMIO : 0x10000000 - 0x20000000
> > node1: 0x20000000 - 0x30000000
> > MMIO : 0x30000000 - 0x50000000
> > node0: 0x50000000 - 0x60000000
> > 
> > Because node1 is in between the two ranges of node0?
> > 
> 
> But only device_type="memory" can be added to allocation.
> For mmio there are two cases:
> 1. mmio doesn't have NUMA id property.
> 2. mmio has NUMA id property, just like some PCIe controllers.
>    But we don’t need to handle these kinds of MMIO devices
>    in memory block parsing. Because we don't need to allocate
>    memory from these mmio ranges. And for accessing, we need
>    a NUMA-aware PCIe controller driver or a generic NUMA-aware
>    MMIO accessing APIs.

Yes, I am not too worried about devices with a NUMA id property because
they are less common and this series doesn't handle them at all, right?
I imagine they would be treated like any other device without NUMA
awareness.

I am thinking about the case where the memory of each NUMA node is made
of multiple banks. I understand that this patch adds an explicit check
for cases where these banks are interleaving, however there are many
other cases where NUMA memory nodes are *not* interleaving but they are
still made of multiple discontinuous banks, like in the two example
above.

My question is whether this patch series in its current form can handle
the two cases above correctly. If so, I am wondering how it works given
that we only have a single "start" and "size" parameter per node.

On the other hand if this series cannot handle the two cases above, my
question is whether it would fail explicitly or not. The new
check is_node_memory_continuous doesn't seem to be able to catch them.


> > I am asking these questions because it is certainly possible to have
> > multiple memory ranges for each NUMA node in device tree, either by
> > specifying multiple ranges with a single "reg" property, or by
> > specifying multiple memory nodes with the same numa-node-id.
> 
> 
>
Stefano Stabellini Sept. 27, 2021, 5:05 a.m. UTC | #6
On Sun, 26 Sep 2021, Stefano Stabellini wrote:
> On Sun, 26 Sep 2021, Wei Chen wrote:
> > > -----Original Message-----
> > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > Sent: 2021年9月25日 3:53
> > > To: Wei Chen <Wei.Chen@arm.com>
> > > Cc: Stefano Stabellini <sstabellini@kernel.org>; xen-
> > > devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> > > <Bertrand.Marquis@arm.com>; jbeulich@suse.com; andrew.cooper3@citrix.com;
> > > roger.pau@citrix.com; wl@xen.org
> > > Subject: RE: [PATCH 08/37] xen/x86: add detection of discontinous node
> > > memory range
> > > 
> > > On Fri, 24 Sep 2021, Wei Chen wrote:
> > > > > -----Original Message-----
> > > > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > > > Sent: 2021年9月24日 8:26
> > > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > > Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org;
> > > julien@xen.org;
> > > > > Bertrand Marquis <Bertrand.Marquis@arm.com>; jbeulich@suse.com;
> > > > > andrew.cooper3@citrix.com; roger.pau@citrix.com; wl@xen.org
> > > > > Subject: Re: [PATCH 08/37] xen/x86: add detection of discontinous node
> > > > > memory range
> > > > >
> > > > > CC'ing x86 maintainers
> > > > >
> > > > > On Thu, 23 Sep 2021, Wei Chen wrote:
> > > > > > One NUMA node may contain several memory blocks. In current Xen
> > > > > > code, Xen will maintain a node memory range for each node to cover
> > > > > > all its memory blocks. But here comes the problem, in the gap of
> > > > > > one node's two memory blocks, if there are some memory blocks don't
> > > > > > belong to this node (remote memory blocks). This node's memory range
> > > > > > will be expanded to cover these remote memory blocks.
> > > > > >
> > > > > > One node's memory range contains othe nodes' memory, this is
> > > obviously
> > > > > > not very reasonable. This means current NUMA code only can support
> > > > > > node has continous memory blocks. However, on a physical machine,
> > > the
> > > > > > addresses of multiple nodes can be interleaved.
> > > > > >
> > > > > > So in this patch, we add code to detect discontinous memory blocks
> > > > > > for one node. NUMA initializtion will be failed and error messages
> > > > > > will be printed when Xen detect such hardware configuration.
> > > > >
> > > > > At least on ARM, it is not just memory that can be interleaved, but
> > > also
> > > > > MMIO regions. For instance:
> > > > >
> > > > > node0 bank0 0-0x1000000
> > > > > MMIO 0x1000000-0x1002000
> > > > > Hole 0x1002000-0x2000000
> > > > > node0 bank1 0x2000000-0x3000000
> > > > >
> > > > > So I am not familiar with the SRAT format, but I think on ARM the
> > > check
> > > > > would look different: we would just look for multiple memory ranges
> > > > > under a device_type = "memory" node of a NUMA node in device tree.
> > > > >
> > > > >
> > > >
> > > > Should I need to include/refine above message to commit log?
> > > 
> > > Let me ask you a question first.
> > > 
> > > With the NUMA implementation of this patch series, can we deal with
> > > cases where each node has multiple memory banks, not interleaved?
> > 
> > Yes.
> > 
> > > An an example:
> > > 
> > > node0: 0x0        - 0x10000000
> > > MMIO : 0x10000000 - 0x20000000
> > > node0: 0x20000000 - 0x30000000
> > > MMIO : 0x30000000 - 0x50000000
> > > node1: 0x50000000 - 0x60000000
> > > MMIO : 0x60000000 - 0x80000000
> > > node2: 0x80000000 - 0x90000000
> > > 
> > > 
> > > I assume we can deal with this case simply by setting node0 memory to
> > > 0x0-0x30000000 even if there is actually something else, a device, that
> > > doesn't belong to node0 in between the two node0 banks?
> > 
> > While this configuration is rare in SoC design, but it is not impossible. 
> 
> Definitely, I have seen it before.
> 
> 
> > > Is it only other nodes' memory interleaved that cause issues? In other
> > > words, only the following is a problematic scenario?
> > > 
> > > node0: 0x0        - 0x10000000
> > > MMIO : 0x10000000 - 0x20000000
> > > node1: 0x20000000 - 0x30000000
> > > MMIO : 0x30000000 - 0x50000000
> > > node0: 0x50000000 - 0x60000000
> > > 
> > > Because node1 is in between the two ranges of node0?
> > > 
> > 
> > But only device_type="memory" can be added to allocation.
> > For mmio there are two cases:
> > 1. mmio doesn't have NUMA id property.
> > 2. mmio has NUMA id property, just like some PCIe controllers.
> >    But we don’t need to handle these kinds of MMIO devices
> >    in memory block parsing. Because we don't need to allocate
> >    memory from these mmio ranges. And for accessing, we need
> >    a NUMA-aware PCIe controller driver or a generic NUMA-aware
> >    MMIO accessing APIs.
> 
> Yes, I am not too worried about devices with a NUMA id property because
> they are less common and this series doesn't handle them at all, right?
> I imagine they would be treated like any other device without NUMA
> awareness.
> 
> I am thinking about the case where the memory of each NUMA node is made
> of multiple banks. I understand that this patch adds an explicit check
> for cases where these banks are interleaving, however there are many
> other cases where NUMA memory nodes are *not* interleaving but they are
> still made of multiple discontinuous banks, like in the two example
> above.
> 
> My question is whether this patch series in its current form can handle
> the two cases above correctly. If so, I am wondering how it works given
> that we only have a single "start" and "size" parameter per node.
> 
> On the other hand if this series cannot handle the two cases above, my
> question is whether it would fail explicitly or not. The new
> check is_node_memory_continuous doesn't seem to be able to catch them.


Looking at numa_update_node_memblks, it is clear that the code is meant
to increase the range of each numa node to cover even MMIO regions in
between memory banks. Also see the comment at the top of the file:

 * Assumes all memory regions belonging to a single proximity domain
 * are in one chunk. Holes between them will be included in the node.

So if there are multiple banks for each node, start and end are
stretched to cover the holes between them, and it works as long as
memory banks of different NUMA nodes don't interleave.

I would appreciate if you could add an in-code comment to explain this
on top of numa_update_node_memblk.

Have you had a chance to test this? If not it would be fantastic if you
could give it a quick test to make sure it works as intended: for
instance by creating multiple memory banks for each NUMA node by
splitting an real bank into two smaller banks with a hole in between in
device tree, just for the sake of testing.
Wei Chen Sept. 27, 2021, 9:50 a.m. UTC | #7
Hi Stefano,

> -----Original Message-----
> From: Stefano Stabellini <sstabellini@kernel.org>
> Sent: 2021年9月27日 13:05
> To: Stefano Stabellini <sstabellini@kernel.org>
> Cc: Wei Chen <Wei.Chen@arm.com>; xen-devel@lists.xenproject.org;
> julien@xen.org; Bertrand Marquis <Bertrand.Marquis@arm.com>;
> jbeulich@suse.com; andrew.cooper3@citrix.com; roger.pau@citrix.com;
> wl@xen.org
> Subject: RE: [PATCH 08/37] xen/x86: add detection of discontinous node
> memory range
> 
> On Sun, 26 Sep 2021, Stefano Stabellini wrote:
> > On Sun, 26 Sep 2021, Wei Chen wrote:
> > > > -----Original Message-----
> > > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > > Sent: 2021年9月25日 3:53
> > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > Cc: Stefano Stabellini <sstabellini@kernel.org>; xen-
> > > > devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> > > > <Bertrand.Marquis@arm.com>; jbeulich@suse.com;
> andrew.cooper3@citrix.com;
> > > > roger.pau@citrix.com; wl@xen.org
> > > > Subject: RE: [PATCH 08/37] xen/x86: add detection of discontinous
> node
> > > > memory range
> > > >
> > > > On Fri, 24 Sep 2021, Wei Chen wrote:
> > > > > > -----Original Message-----
> > > > > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > > > > Sent: 2021年9月24日 8:26
> > > > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > > > Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org;
> > > > julien@xen.org;
> > > > > > Bertrand Marquis <Bertrand.Marquis@arm.com>; jbeulich@suse.com;
> > > > > > andrew.cooper3@citrix.com; roger.pau@citrix.com; wl@xen.org
> > > > > > Subject: Re: [PATCH 08/37] xen/x86: add detection of
> discontinous node
> > > > > > memory range
> > > > > >
> > > > > > CC'ing x86 maintainers
> > > > > >
> > > > > > On Thu, 23 Sep 2021, Wei Chen wrote:
> > > > > > > One NUMA node may contain several memory blocks. In current
> Xen
> > > > > > > code, Xen will maintain a node memory range for each node to
> cover
> > > > > > > all its memory blocks. But here comes the problem, in the gap
> of
> > > > > > > one node's two memory blocks, if there are some memory blocks
> don't
> > > > > > > belong to this node (remote memory blocks). This node's memory
> range
> > > > > > > will be expanded to cover these remote memory blocks.
> > > > > > >
> > > > > > > One node's memory range contains othe nodes' memory, this is
> > > > obviously
> > > > > > > not very reasonable. This means current NUMA code only can
> support
> > > > > > > node has continous memory blocks. However, on a physical
> machine,
> > > > the
> > > > > > > addresses of multiple nodes can be interleaved.
> > > > > > >
> > > > > > > So in this patch, we add code to detect discontinous memory
> blocks
> > > > > > > for one node. NUMA initializtion will be failed and error
> messages
> > > > > > > will be printed when Xen detect such hardware configuration.
> > > > > >
> > > > > > At least on ARM, it is not just memory that can be interleaved,
> but
> > > > also
> > > > > > MMIO regions. For instance:
> > > > > >
> > > > > > node0 bank0 0-0x1000000
> > > > > > MMIO 0x1000000-0x1002000
> > > > > > Hole 0x1002000-0x2000000
> > > > > > node0 bank1 0x2000000-0x3000000
> > > > > >
> > > > > > So I am not familiar with the SRAT format, but I think on ARM
> the
> > > > check
> > > > > > would look different: we would just look for multiple memory
> ranges
> > > > > > under a device_type = "memory" node of a NUMA node in device
> tree.
> > > > > >
> > > > > >
> > > > >
> > > > > Should I need to include/refine above message to commit log?
> > > >
> > > > Let me ask you a question first.
> > > >
> > > > With the NUMA implementation of this patch series, can we deal with
> > > > cases where each node has multiple memory banks, not interleaved?
> > >
> > > Yes.
> > >
> > > > An an example:
> > > >
> > > > node0: 0x0        - 0x10000000
> > > > MMIO : 0x10000000 - 0x20000000
> > > > node0: 0x20000000 - 0x30000000
> > > > MMIO : 0x30000000 - 0x50000000
> > > > node1: 0x50000000 - 0x60000000
> > > > MMIO : 0x60000000 - 0x80000000
> > > > node2: 0x80000000 - 0x90000000
> > > >
> > > >
> > > > I assume we can deal with this case simply by setting node0 memory
> to
> > > > 0x0-0x30000000 even if there is actually something else, a device,
> that
> > > > doesn't belong to node0 in between the two node0 banks?
> > >
> > > While this configuration is rare in SoC design, but it is not
> impossible.
> >
> > Definitely, I have seen it before.
> >
> >
> > > > Is it only other nodes' memory interleaved that cause issues? In
> other
> > > > words, only the following is a problematic scenario?
> > > >
> > > > node0: 0x0        - 0x10000000
> > > > MMIO : 0x10000000 - 0x20000000
> > > > node1: 0x20000000 - 0x30000000
> > > > MMIO : 0x30000000 - 0x50000000
> > > > node0: 0x50000000 - 0x60000000
> > > >
> > > > Because node1 is in between the two ranges of node0?
> > > >
> > >
> > > But only device_type="memory" can be added to allocation.
> > > For mmio there are two cases:
> > > 1. mmio doesn't have NUMA id property.
> > > 2. mmio has NUMA id property, just like some PCIe controllers.
> > >    But we don’t need to handle these kinds of MMIO devices
> > >    in memory block parsing. Because we don't need to allocate
> > >    memory from these mmio ranges. And for accessing, we need
> > >    a NUMA-aware PCIe controller driver or a generic NUMA-aware
> > >    MMIO accessing APIs.
> >
> > Yes, I am not too worried about devices with a NUMA id property because
> > they are less common and this series doesn't handle them at all, right?
> > I imagine they would be treated like any other device without NUMA
> > awareness.
> >
> > I am thinking about the case where the memory of each NUMA node is made
> > of multiple banks. I understand that this patch adds an explicit check
> > for cases where these banks are interleaving, however there are many
> > other cases where NUMA memory nodes are *not* interleaving but they are
> > still made of multiple discontinuous banks, like in the two example
> > above.
> >
> > My question is whether this patch series in its current form can handle
> > the two cases above correctly. If so, I am wondering how it works given
> > that we only have a single "start" and "size" parameter per node.
> >
> > On the other hand if this series cannot handle the two cases above, my
> > question is whether it would fail explicitly or not. The new
> > check is_node_memory_continuous doesn't seem to be able to catch them.
> 
> 
> Looking at numa_update_node_memblks, it is clear that the code is meant
> to increase the range of each numa node to cover even MMIO regions in
> between memory banks. Also see the comment at the top of the file:
> 
>  * Assumes all memory regions belonging to a single proximity domain
>  * are in one chunk. Holes between them will be included in the node.
> 
> So if there are multiple banks for each node, start and end are
> stretched to cover the holes between them, and it works as long as
> memory banks of different NUMA nodes don't interleave.
> 
> I would appreciate if you could add an in-code comment to explain this
> on top of numa_update_node_memblk.

Yes, I will do it.

> 
> Have you had a chance to test this? If not it would be fantastic if you
> could give it a quick test to make sure it works as intended: for
> instance by creating multiple memory banks for each NUMA node by
> splitting an real bank into two smaller banks with a hole in between in
> device tree, just for the sake of testing.

Yes, I have created some fake NUMA nodes in FVP device tree to test it.
The intertwine of nodes' address can be detected.

(XEN) SRAT: Node 0 0000000080000000-00000000ff000000
(XEN) SRAT: Node 1 0000000880000000-00000008c0000000
(XEN) NODE 0: (0000000080000000-00000008d0000000) intertwine with NODE 1 (0000000880000000-00000008c0000000)
Stefano Stabellini Sept. 27, 2021, 5:19 p.m. UTC | #8
On Mon, 27 Sep 2021, Wei Chen wrote:
> > -----Original Message-----
> > From: Stefano Stabellini <sstabellini@kernel.org>
> > Sent: 2021年9月27日 13:05
> > To: Stefano Stabellini <sstabellini@kernel.org>
> > Cc: Wei Chen <Wei.Chen@arm.com>; xen-devel@lists.xenproject.org;
> > julien@xen.org; Bertrand Marquis <Bertrand.Marquis@arm.com>;
> > jbeulich@suse.com; andrew.cooper3@citrix.com; roger.pau@citrix.com;
> > wl@xen.org
> > Subject: RE: [PATCH 08/37] xen/x86: add detection of discontinous node
> > memory range
> > 
> > On Sun, 26 Sep 2021, Stefano Stabellini wrote:
> > > On Sun, 26 Sep 2021, Wei Chen wrote:
> > > > > -----Original Message-----
> > > > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > > > Sent: 2021年9月25日 3:53
> > > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > > Cc: Stefano Stabellini <sstabellini@kernel.org>; xen-
> > > > > devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> > > > > <Bertrand.Marquis@arm.com>; jbeulich@suse.com;
> > andrew.cooper3@citrix.com;
> > > > > roger.pau@citrix.com; wl@xen.org
> > > > > Subject: RE: [PATCH 08/37] xen/x86: add detection of discontinous
> > node
> > > > > memory range
> > > > >
> > > > > On Fri, 24 Sep 2021, Wei Chen wrote:
> > > > > > > -----Original Message-----
> > > > > > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > > > > > Sent: 2021年9月24日 8:26
> > > > > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > > > > Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org;
> > > > > julien@xen.org;
> > > > > > > Bertrand Marquis <Bertrand.Marquis@arm.com>; jbeulich@suse.com;
> > > > > > > andrew.cooper3@citrix.com; roger.pau@citrix.com; wl@xen.org
> > > > > > > Subject: Re: [PATCH 08/37] xen/x86: add detection of
> > discontinous node
> > > > > > > memory range
> > > > > > >
> > > > > > > CC'ing x86 maintainers
> > > > > > >
> > > > > > > On Thu, 23 Sep 2021, Wei Chen wrote:
> > > > > > > > One NUMA node may contain several memory blocks. In current
> > Xen
> > > > > > > > code, Xen will maintain a node memory range for each node to
> > cover
> > > > > > > > all its memory blocks. But here comes the problem, in the gap
> > of
> > > > > > > > one node's two memory blocks, if there are some memory blocks
> > don't
> > > > > > > > belong to this node (remote memory blocks). This node's memory
> > range
> > > > > > > > will be expanded to cover these remote memory blocks.
> > > > > > > >
> > > > > > > > One node's memory range contains othe nodes' memory, this is
> > > > > obviously
> > > > > > > > not very reasonable. This means current NUMA code only can
> > support
> > > > > > > > node has continous memory blocks. However, on a physical
> > machine,
> > > > > the
> > > > > > > > addresses of multiple nodes can be interleaved.
> > > > > > > >
> > > > > > > > So in this patch, we add code to detect discontinous memory
> > blocks
> > > > > > > > for one node. NUMA initializtion will be failed and error
> > messages
> > > > > > > > will be printed when Xen detect such hardware configuration.
> > > > > > >
> > > > > > > At least on ARM, it is not just memory that can be interleaved,
> > but
> > > > > also
> > > > > > > MMIO regions. For instance:
> > > > > > >
> > > > > > > node0 bank0 0-0x1000000
> > > > > > > MMIO 0x1000000-0x1002000
> > > > > > > Hole 0x1002000-0x2000000
> > > > > > > node0 bank1 0x2000000-0x3000000
> > > > > > >
> > > > > > > So I am not familiar with the SRAT format, but I think on ARM
> > the
> > > > > check
> > > > > > > would look different: we would just look for multiple memory
> > ranges
> > > > > > > under a device_type = "memory" node of a NUMA node in device
> > tree.
> > > > > > >
> > > > > > >
> > > > > >
> > > > > > Should I need to include/refine above message to commit log?
> > > > >
> > > > > Let me ask you a question first.
> > > > >
> > > > > With the NUMA implementation of this patch series, can we deal with
> > > > > cases where each node has multiple memory banks, not interleaved?
> > > >
> > > > Yes.
> > > >
> > > > > An an example:
> > > > >
> > > > > node0: 0x0        - 0x10000000
> > > > > MMIO : 0x10000000 - 0x20000000
> > > > > node0: 0x20000000 - 0x30000000
> > > > > MMIO : 0x30000000 - 0x50000000
> > > > > node1: 0x50000000 - 0x60000000
> > > > > MMIO : 0x60000000 - 0x80000000
> > > > > node2: 0x80000000 - 0x90000000
> > > > >
> > > > >
> > > > > I assume we can deal with this case simply by setting node0 memory
> > to
> > > > > 0x0-0x30000000 even if there is actually something else, a device,
> > that
> > > > > doesn't belong to node0 in between the two node0 banks?
> > > >
> > > > While this configuration is rare in SoC design, but it is not
> > impossible.
> > >
> > > Definitely, I have seen it before.
> > >
> > >
> > > > > Is it only other nodes' memory interleaved that cause issues? In
> > other
> > > > > words, only the following is a problematic scenario?
> > > > >
> > > > > node0: 0x0        - 0x10000000
> > > > > MMIO : 0x10000000 - 0x20000000
> > > > > node1: 0x20000000 - 0x30000000
> > > > > MMIO : 0x30000000 - 0x50000000
> > > > > node0: 0x50000000 - 0x60000000
> > > > >
> > > > > Because node1 is in between the two ranges of node0?
> > > > >
> > > >
> > > > But only device_type="memory" can be added to allocation.
> > > > For mmio there are two cases:
> > > > 1. mmio doesn't have NUMA id property.
> > > > 2. mmio has NUMA id property, just like some PCIe controllers.
> > > >    But we don’t need to handle these kinds of MMIO devices
> > > >    in memory block parsing. Because we don't need to allocate
> > > >    memory from these mmio ranges. And for accessing, we need
> > > >    a NUMA-aware PCIe controller driver or a generic NUMA-aware
> > > >    MMIO accessing APIs.
> > >
> > > Yes, I am not too worried about devices with a NUMA id property because
> > > they are less common and this series doesn't handle them at all, right?
> > > I imagine they would be treated like any other device without NUMA
> > > awareness.
> > >
> > > I am thinking about the case where the memory of each NUMA node is made
> > > of multiple banks. I understand that this patch adds an explicit check
> > > for cases where these banks are interleaving, however there are many
> > > other cases where NUMA memory nodes are *not* interleaving but they are
> > > still made of multiple discontinuous banks, like in the two example
> > > above.
> > >
> > > My question is whether this patch series in its current form can handle
> > > the two cases above correctly. If so, I am wondering how it works given
> > > that we only have a single "start" and "size" parameter per node.
> > >
> > > On the other hand if this series cannot handle the two cases above, my
> > > question is whether it would fail explicitly or not. The new
> > > check is_node_memory_continuous doesn't seem to be able to catch them.
> > 
> > 
> > Looking at numa_update_node_memblks, it is clear that the code is meant
> > to increase the range of each numa node to cover even MMIO regions in
> > between memory banks. Also see the comment at the top of the file:
> > 
> >  * Assumes all memory regions belonging to a single proximity domain
> >  * are in one chunk. Holes between them will be included in the node.
> > 
> > So if there are multiple banks for each node, start and end are
> > stretched to cover the holes between them, and it works as long as
> > memory banks of different NUMA nodes don't interleave.
> > 
> > I would appreciate if you could add an in-code comment to explain this
> > on top of numa_update_node_memblk.
> 
> Yes, I will do it.
 
Thank you


> > Have you had a chance to test this? If not it would be fantastic if you
> > could give it a quick test to make sure it works as intended: for
> > instance by creating multiple memory banks for each NUMA node by
> > splitting an real bank into two smaller banks with a hole in between in
> > device tree, just for the sake of testing.
> 
> Yes, I have created some fake NUMA nodes in FVP device tree to test it.
> The intertwine of nodes' address can be detected.
> 
> (XEN) SRAT: Node 0 0000000080000000-00000000ff000000
> (XEN) SRAT: Node 1 0000000880000000-00000008c0000000
> (XEN) NODE 0: (0000000080000000-00000008d0000000) intertwine with NODE 1 (0000000880000000-00000008c0000000)

Great thanks! And what if there are multiple non-contiguous memory banks
per node, but *not* intertwined. Does that all work correctly as
expected?
Wei Chen Sept. 28, 2021, 4:41 a.m. UTC | #9
Hi Stefano,

> -----Original Message-----
> From: Stefano Stabellini <sstabellini@kernel.org>
> Sent: 2021年9月28日 1:19
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: Stefano Stabellini <sstabellini@kernel.org>; xen-
> devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> <Bertrand.Marquis@arm.com>; jbeulich@suse.com; andrew.cooper3@citrix.com;
> roger.pau@citrix.com; wl@xen.org
> Subject: RE: [PATCH 08/37] xen/x86: add detection of discontinous node
> memory range
> 
> On Mon, 27 Sep 2021, Wei Chen wrote:
> > > -----Original Message-----
> > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > Sent: 2021年9月27日 13:05
> > > To: Stefano Stabellini <sstabellini@kernel.org>
> > > Cc: Wei Chen <Wei.Chen@arm.com>; xen-devel@lists.xenproject.org;
> > > julien@xen.org; Bertrand Marquis <Bertrand.Marquis@arm.com>;
> > > jbeulich@suse.com; andrew.cooper3@citrix.com; roger.pau@citrix.com;
> > > wl@xen.org
> > > Subject: RE: [PATCH 08/37] xen/x86: add detection of discontinous node
> > > memory range
> > >
> > > On Sun, 26 Sep 2021, Stefano Stabellini wrote:
> > > > On Sun, 26 Sep 2021, Wei Chen wrote:
> > > > > > -----Original Message-----
> > > > > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > > > > Sent: 2021年9月25日 3:53
> > > > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > > > Cc: Stefano Stabellini <sstabellini@kernel.org>; xen-
> > > > > > devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> > > > > > <Bertrand.Marquis@arm.com>; jbeulich@suse.com;
> > > andrew.cooper3@citrix.com;
> > > > > > roger.pau@citrix.com; wl@xen.org
> > > > > > Subject: RE: [PATCH 08/37] xen/x86: add detection of
> discontinous
> > > node
> > > > > > memory range
> > > > > >
> > > > > > On Fri, 24 Sep 2021, Wei Chen wrote:
> > > > > > > > -----Original Message-----
> > > > > > > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > > > > > > Sent: 2021年9月24日 8:26
> > > > > > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > > > > > Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org;
> > > > > > julien@xen.org;
> > > > > > > > Bertrand Marquis <Bertrand.Marquis@arm.com>;
> jbeulich@suse.com;
> > > > > > > > andrew.cooper3@citrix.com; roger.pau@citrix.com; wl@xen.org
> > > > > > > > Subject: Re: [PATCH 08/37] xen/x86: add detection of
> > > discontinous node
> > > > > > > > memory range
> > > > > > > >
> > > > > > > > CC'ing x86 maintainers
> > > > > > > >
> > > > > > > > On Thu, 23 Sep 2021, Wei Chen wrote:
> > > > > > > > > One NUMA node may contain several memory blocks. In
> current
> > > Xen
> > > > > > > > > code, Xen will maintain a node memory range for each node
> to
> > > cover
> > > > > > > > > all its memory blocks. But here comes the problem, in the
> gap
> > > of
> > > > > > > > > one node's two memory blocks, if there are some memory
> blocks
> > > don't
> > > > > > > > > belong to this node (remote memory blocks). This node's
> memory
> > > range
> > > > > > > > > will be expanded to cover these remote memory blocks.
> > > > > > > > >
> > > > > > > > > One node's memory range contains othe nodes' memory, this
> is
> > > > > > obviously
> > > > > > > > > not very reasonable. This means current NUMA code only can
> > > support
> > > > > > > > > node has continous memory blocks. However, on a physical
> > > machine,
> > > > > > the
> > > > > > > > > addresses of multiple nodes can be interleaved.
> > > > > > > > >
> > > > > > > > > So in this patch, we add code to detect discontinous
> memory
> > > blocks
> > > > > > > > > for one node. NUMA initializtion will be failed and error
> > > messages
> > > > > > > > > will be printed when Xen detect such hardware
> configuration.
> > > > > > > >
> > > > > > > > At least on ARM, it is not just memory that can be
> interleaved,
> > > but
> > > > > > also
> > > > > > > > MMIO regions. For instance:
> > > > > > > >
> > > > > > > > node0 bank0 0-0x1000000
> > > > > > > > MMIO 0x1000000-0x1002000
> > > > > > > > Hole 0x1002000-0x2000000
> > > > > > > > node0 bank1 0x2000000-0x3000000
> > > > > > > >
> > > > > > > > So I am not familiar with the SRAT format, but I think on
> ARM
> > > the
> > > > > > check
> > > > > > > > would look different: we would just look for multiple memory
> > > ranges
> > > > > > > > under a device_type = "memory" node of a NUMA node in device
> > > tree.
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > > Should I need to include/refine above message to commit log?
> > > > > >
> > > > > > Let me ask you a question first.
> > > > > >
> > > > > > With the NUMA implementation of this patch series, can we deal
> with
> > > > > > cases where each node has multiple memory banks, not interleaved?
> > > > >
> > > > > Yes.
> > > > >
> > > > > > An an example:
> > > > > >
> > > > > > node0: 0x0        - 0x10000000
> > > > > > MMIO : 0x10000000 - 0x20000000
> > > > > > node0: 0x20000000 - 0x30000000
> > > > > > MMIO : 0x30000000 - 0x50000000
> > > > > > node1: 0x50000000 - 0x60000000
> > > > > > MMIO : 0x60000000 - 0x80000000
> > > > > > node2: 0x80000000 - 0x90000000
> > > > > >
> > > > > >
> > > > > > I assume we can deal with this case simply by setting node0
> memory
> > > to
> > > > > > 0x0-0x30000000 even if there is actually something else, a
> device,
> > > that
> > > > > > doesn't belong to node0 in between the two node0 banks?
> > > > >
> > > > > While this configuration is rare in SoC design, but it is not
> > > impossible.
> > > >
> > > > Definitely, I have seen it before.
> > > >
> > > >
> > > > > > Is it only other nodes' memory interleaved that cause issues? In
> > > other
> > > > > > words, only the following is a problematic scenario?
> > > > > >
> > > > > > node0: 0x0        - 0x10000000
> > > > > > MMIO : 0x10000000 - 0x20000000
> > > > > > node1: 0x20000000 - 0x30000000
> > > > > > MMIO : 0x30000000 - 0x50000000
> > > > > > node0: 0x50000000 - 0x60000000
> > > > > >
> > > > > > Because node1 is in between the two ranges of node0?
> > > > > >
> > > > >
> > > > > But only device_type="memory" can be added to allocation.
> > > > > For mmio there are two cases:
> > > > > 1. mmio doesn't have NUMA id property.
> > > > > 2. mmio has NUMA id property, just like some PCIe controllers.
> > > > >    But we don’t need to handle these kinds of MMIO devices
> > > > >    in memory block parsing. Because we don't need to allocate
> > > > >    memory from these mmio ranges. And for accessing, we need
> > > > >    a NUMA-aware PCIe controller driver or a generic NUMA-aware
> > > > >    MMIO accessing APIs.
> > > >
> > > > Yes, I am not too worried about devices with a NUMA id property
> because
> > > > they are less common and this series doesn't handle them at all,
> right?
> > > > I imagine they would be treated like any other device without NUMA
> > > > awareness.
> > > >
> > > > I am thinking about the case where the memory of each NUMA node is
> made
> > > > of multiple banks. I understand that this patch adds an explicit
> check
> > > > for cases where these banks are interleaving, however there are many
> > > > other cases where NUMA memory nodes are *not* interleaving but they
> are
> > > > still made of multiple discontinuous banks, like in the two example
> > > > above.
> > > >
> > > > My question is whether this patch series in its current form can
> handle
> > > > the two cases above correctly. If so, I am wondering how it works
> given
> > > > that we only have a single "start" and "size" parameter per node.
> > > >
> > > > On the other hand if this series cannot handle the two cases above,
> my
> > > > question is whether it would fail explicitly or not. The new
> > > > check is_node_memory_continuous doesn't seem to be able to catch
> them.
> > >
> > >
> > > Looking at numa_update_node_memblks, it is clear that the code is
> meant
> > > to increase the range of each numa node to cover even MMIO regions in
> > > between memory banks. Also see the comment at the top of the file:
> > >
> > >  * Assumes all memory regions belonging to a single proximity domain
> > >  * are in one chunk. Holes between them will be included in the node.
> > >
> > > So if there are multiple banks for each node, start and end are
> > > stretched to cover the holes between them, and it works as long as
> > > memory banks of different NUMA nodes don't interleave.
> > >
> > > I would appreciate if you could add an in-code comment to explain this
> > > on top of numa_update_node_memblk.
> >
> > Yes, I will do it.
> 
> Thank you
> 
> 
> > > Have you had a chance to test this? If not it would be fantastic if
> you
> > > could give it a quick test to make sure it works as intended: for
> > > instance by creating multiple memory banks for each NUMA node by
> > > splitting an real bank into two smaller banks with a hole in between
> in
> > > device tree, just for the sake of testing.
> >
> > Yes, I have created some fake NUMA nodes in FVP device tree to test it.
> > The intertwine of nodes' address can be detected.
> >
> > (XEN) SRAT: Node 0 0000000080000000-00000000ff000000
> > (XEN) SRAT: Node 1 0000000880000000-00000008c0000000
> > (XEN) NODE 0: (0000000080000000-00000008d0000000) intertwine with NODE 1
> (0000000880000000-00000008c0000000)
> 
> Great thanks! And what if there are multiple non-contiguous memory banks
> per node, but *not* intertwined. Does that all work correctly as
> expected?

Yes, I am using a device tree setting like this:

    memory@80000000 {
        device_type = "memory";
        reg = <0x0 0x80000000 0x0 0x80000000>;
        numa-node-id = <0>;
    };

    memory@880000000 {
        device_type = "memory";
        reg = <0x8 0x80000000 0x0 0x8000000>;
        numa-node-id = <0>;
    };

    memory@890000000 {
        device_type = "memory";
        reg = <0x8 0x90000000 0x0 0x8000000>;
        numa-node-id = <0>;
    };

    memory@8A0000000 {
        device_type = "memory";
        reg = <0x8 0xA0000000 0x0 0x8000000>;
        numa-node-id = <0>;
    };

    memory@8B0000000 {
        device_type = "memory";
        reg = <0x8 0xB0000000 0x0 0x8000000>;
        numa-node-id = <0>;
    };

    memory@8C0000000 {
        device_type = "memory";
        reg = <0x8 0xC0000000 0x0 0x8000000>;
        numa-node-id = <1>;
    };

    memory@8D0000000 {
        device_type = "memory";
        reg = <0x8 0xD0000000 0x0 0x8000000>;
        numa-node-id = <1>;
    };

    memory@8E0000000 {
        device_type = "memory";
        reg = <0x8 0xE0000000 0x0 0x8000000>;
        numa-node-id = <1>;
    };

    memory@8F0000000 {
        device_type = "memory";
        reg = <0x8 0xF0000000 0x0 0x8000000>;
        numa-node-id = <1>;
    };

And in Xen we got the output:

(XEN) DT: NUMA node 0 processor parsed
(XEN) DT: NUMA node 0 processor parsed
(XEN) DT: NUMA node 1 processor parsed
(XEN) DT: NUMA node 1 processor parsed
(XEN) SRAT: Node 0 0000000080000000-00000000ff000000
(XEN) SRAT: Node 0 0000000880000000-0000000888000000
(XEN) SRAT: Node 0 0000000890000000-0000000898000000
(XEN) SRAT: Node 0 00000008a0000000-00000008a8000000
(XEN) SRAT: Node 0 00000008b0000000-00000008b8000000
(XEN) SRAT: Node 1 00000008c0000000-00000008c8000000
(XEN) SRAT: Node 1 00000008d0000000-00000008d8000000
(XEN) SRAT: Node 1 00000008e0000000-00000008e8000000
(XEN) SRAT: Node 1 00000008f0000000-00000008f8000000
(XEN) NUMA: parsing numa-distance-map
(XEN) NUMA: distance: NODE#0->NODE#0:10
(XEN) NUMA: distance: NODE#0->NODE#1:20
(XEN) NUMA: distance: NODE#1->NODE#1:10
(XEN) NUMA: Using 16 for the hash shift.
(XEN) Domain heap initialised
(XEN) Booting using Device Tree

Dom0 can be boot successfully, xl info got:
xl info
host                   : X-Dom0
release                : 5.12.0
version                : #20 SMP PREEMPT Wed Jul 28 13:41:28 CST 2021
machine                : aarch64
nr_cpus                : 4
max_cpu_id             : 3
nr_nodes               : 2
cores_per_socket       : 1
threads_per_core       : 1

Xen debug console to dump numa info, we got:

(XEN) 'u' pressed -> dumping numa info (now = 13229372281010)
(XEN) NODE0 start->524288 size->8617984 free->388741
(XEN) NODE1 start->9175040 size->229376 free->106460
(XEN) CPU0...1 -> NODE0
(XEN) CPU2...3 -> NODE1
(XEN) Memory location of each domain:
(XEN) Domain 0 (total: 262144):
(XEN)     Node 0: 262144
(XEN)     Node 1: 0
Stefano Stabellini Sept. 28, 2021, 4:59 a.m. UTC | #10
On Tue, 28 Sep 2021, Wei Chen wrote:
> Hi Stefano,
> 
> > -----Original Message-----
> > From: Stefano Stabellini <sstabellini@kernel.org>
> > Sent: 2021年9月28日 1:19
> > To: Wei Chen <Wei.Chen@arm.com>
> > Cc: Stefano Stabellini <sstabellini@kernel.org>; xen-
> > devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> > <Bertrand.Marquis@arm.com>; jbeulich@suse.com; andrew.cooper3@citrix.com;
> > roger.pau@citrix.com; wl@xen.org
> > Subject: RE: [PATCH 08/37] xen/x86: add detection of discontinous node
> > memory range
> > 
> > On Mon, 27 Sep 2021, Wei Chen wrote:
> > > > -----Original Message-----
> > > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > > Sent: 2021年9月27日 13:05
> > > > To: Stefano Stabellini <sstabellini@kernel.org>
> > > > Cc: Wei Chen <Wei.Chen@arm.com>; xen-devel@lists.xenproject.org;
> > > > julien@xen.org; Bertrand Marquis <Bertrand.Marquis@arm.com>;
> > > > jbeulich@suse.com; andrew.cooper3@citrix.com; roger.pau@citrix.com;
> > > > wl@xen.org
> > > > Subject: RE: [PATCH 08/37] xen/x86: add detection of discontinous node
> > > > memory range
> > > >
> > > > On Sun, 26 Sep 2021, Stefano Stabellini wrote:
> > > > > On Sun, 26 Sep 2021, Wei Chen wrote:
> > > > > > > -----Original Message-----
> > > > > > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > > > > > Sent: 2021年9月25日 3:53
> > > > > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > > > > Cc: Stefano Stabellini <sstabellini@kernel.org>; xen-
> > > > > > > devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> > > > > > > <Bertrand.Marquis@arm.com>; jbeulich@suse.com;
> > > > andrew.cooper3@citrix.com;
> > > > > > > roger.pau@citrix.com; wl@xen.org
> > > > > > > Subject: RE: [PATCH 08/37] xen/x86: add detection of
> > discontinous
> > > > node
> > > > > > > memory range
> > > > > > >
> > > > > > > On Fri, 24 Sep 2021, Wei Chen wrote:
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > > > > > > > Sent: 2021年9月24日 8:26
> > > > > > > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > > > > > > Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org;
> > > > > > > julien@xen.org;
> > > > > > > > > Bertrand Marquis <Bertrand.Marquis@arm.com>;
> > jbeulich@suse.com;
> > > > > > > > > andrew.cooper3@citrix.com; roger.pau@citrix.com; wl@xen.org
> > > > > > > > > Subject: Re: [PATCH 08/37] xen/x86: add detection of
> > > > discontinous node
> > > > > > > > > memory range
> > > > > > > > >
> > > > > > > > > CC'ing x86 maintainers
> > > > > > > > >
> > > > > > > > > On Thu, 23 Sep 2021, Wei Chen wrote:
> > > > > > > > > > One NUMA node may contain several memory blocks. In
> > current
> > > > Xen
> > > > > > > > > > code, Xen will maintain a node memory range for each node
> > to
> > > > cover
> > > > > > > > > > all its memory blocks. But here comes the problem, in the
> > gap
> > > > of
> > > > > > > > > > one node's two memory blocks, if there are some memory
> > blocks
> > > > don't
> > > > > > > > > > belong to this node (remote memory blocks). This node's
> > memory
> > > > range
> > > > > > > > > > will be expanded to cover these remote memory blocks.
> > > > > > > > > >
> > > > > > > > > > One node's memory range contains othe nodes' memory, this
> > is
> > > > > > > obviously
> > > > > > > > > > not very reasonable. This means current NUMA code only can
> > > > support
> > > > > > > > > > node has continous memory blocks. However, on a physical
> > > > machine,
> > > > > > > the
> > > > > > > > > > addresses of multiple nodes can be interleaved.
> > > > > > > > > >
> > > > > > > > > > So in this patch, we add code to detect discontinous
> > memory
> > > > blocks
> > > > > > > > > > for one node. NUMA initializtion will be failed and error
> > > > messages
> > > > > > > > > > will be printed when Xen detect such hardware
> > configuration.
> > > > > > > > >
> > > > > > > > > At least on ARM, it is not just memory that can be
> > interleaved,
> > > > but
> > > > > > > also
> > > > > > > > > MMIO regions. For instance:
> > > > > > > > >
> > > > > > > > > node0 bank0 0-0x1000000
> > > > > > > > > MMIO 0x1000000-0x1002000
> > > > > > > > > Hole 0x1002000-0x2000000
> > > > > > > > > node0 bank1 0x2000000-0x3000000
> > > > > > > > >
> > > > > > > > > So I am not familiar with the SRAT format, but I think on
> > ARM
> > > > the
> > > > > > > check
> > > > > > > > > would look different: we would just look for multiple memory
> > > > ranges
> > > > > > > > > under a device_type = "memory" node of a NUMA node in device
> > > > tree.
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > > Should I need to include/refine above message to commit log?
> > > > > > >
> > > > > > > Let me ask you a question first.
> > > > > > >
> > > > > > > With the NUMA implementation of this patch series, can we deal
> > with
> > > > > > > cases where each node has multiple memory banks, not interleaved?
> > > > > >
> > > > > > Yes.
> > > > > >
> > > > > > > An an example:
> > > > > > >
> > > > > > > node0: 0x0        - 0x10000000
> > > > > > > MMIO : 0x10000000 - 0x20000000
> > > > > > > node0: 0x20000000 - 0x30000000
> > > > > > > MMIO : 0x30000000 - 0x50000000
> > > > > > > node1: 0x50000000 - 0x60000000
> > > > > > > MMIO : 0x60000000 - 0x80000000
> > > > > > > node2: 0x80000000 - 0x90000000
> > > > > > >
> > > > > > >
> > > > > > > I assume we can deal with this case simply by setting node0
> > memory
> > > > to
> > > > > > > 0x0-0x30000000 even if there is actually something else, a
> > device,
> > > > that
> > > > > > > doesn't belong to node0 in between the two node0 banks?
> > > > > >
> > > > > > While this configuration is rare in SoC design, but it is not
> > > > impossible.
> > > > >
> > > > > Definitely, I have seen it before.
> > > > >
> > > > >
> > > > > > > Is it only other nodes' memory interleaved that cause issues? In
> > > > other
> > > > > > > words, only the following is a problematic scenario?
> > > > > > >
> > > > > > > node0: 0x0        - 0x10000000
> > > > > > > MMIO : 0x10000000 - 0x20000000
> > > > > > > node1: 0x20000000 - 0x30000000
> > > > > > > MMIO : 0x30000000 - 0x50000000
> > > > > > > node0: 0x50000000 - 0x60000000
> > > > > > >
> > > > > > > Because node1 is in between the two ranges of node0?
> > > > > > >
> > > > > >
> > > > > > But only device_type="memory" can be added to allocation.
> > > > > > For mmio there are two cases:
> > > > > > 1. mmio doesn't have NUMA id property.
> > > > > > 2. mmio has NUMA id property, just like some PCIe controllers.
> > > > > >    But we don’t need to handle these kinds of MMIO devices
> > > > > >    in memory block parsing. Because we don't need to allocate
> > > > > >    memory from these mmio ranges. And for accessing, we need
> > > > > >    a NUMA-aware PCIe controller driver or a generic NUMA-aware
> > > > > >    MMIO accessing APIs.
> > > > >
> > > > > Yes, I am not too worried about devices with a NUMA id property
> > because
> > > > > they are less common and this series doesn't handle them at all,
> > right?
> > > > > I imagine they would be treated like any other device without NUMA
> > > > > awareness.
> > > > >
> > > > > I am thinking about the case where the memory of each NUMA node is
> > made
> > > > > of multiple banks. I understand that this patch adds an explicit
> > check
> > > > > for cases where these banks are interleaving, however there are many
> > > > > other cases where NUMA memory nodes are *not* interleaving but they
> > are
> > > > > still made of multiple discontinuous banks, like in the two example
> > > > > above.
> > > > >
> > > > > My question is whether this patch series in its current form can
> > handle
> > > > > the two cases above correctly. If so, I am wondering how it works
> > given
> > > > > that we only have a single "start" and "size" parameter per node.
> > > > >
> > > > > On the other hand if this series cannot handle the two cases above,
> > my
> > > > > question is whether it would fail explicitly or not. The new
> > > > > check is_node_memory_continuous doesn't seem to be able to catch
> > them.
> > > >
> > > >
> > > > Looking at numa_update_node_memblks, it is clear that the code is
> > meant
> > > > to increase the range of each numa node to cover even MMIO regions in
> > > > between memory banks. Also see the comment at the top of the file:
> > > >
> > > >  * Assumes all memory regions belonging to a single proximity domain
> > > >  * are in one chunk. Holes between them will be included in the node.
> > > >
> > > > So if there are multiple banks for each node, start and end are
> > > > stretched to cover the holes between them, and it works as long as
> > > > memory banks of different NUMA nodes don't interleave.
> > > >
> > > > I would appreciate if you could add an in-code comment to explain this
> > > > on top of numa_update_node_memblk.
> > >
> > > Yes, I will do it.
> > 
> > Thank you
> > 
> > 
> > > > Have you had a chance to test this? If not it would be fantastic if
> > you
> > > > could give it a quick test to make sure it works as intended: for
> > > > instance by creating multiple memory banks for each NUMA node by
> > > > splitting an real bank into two smaller banks with a hole in between
> > in
> > > > device tree, just for the sake of testing.
> > >
> > > Yes, I have created some fake NUMA nodes in FVP device tree to test it.
> > > The intertwine of nodes' address can be detected.
> > >
> > > (XEN) SRAT: Node 0 0000000080000000-00000000ff000000
> > > (XEN) SRAT: Node 1 0000000880000000-00000008c0000000
> > > (XEN) NODE 0: (0000000080000000-00000008d0000000) intertwine with NODE 1
> > (0000000880000000-00000008c0000000)
> > 
> > Great thanks! And what if there are multiple non-contiguous memory banks
> > per node, but *not* intertwined. Does that all work correctly as
> > expected?
> 
> Yes, I am using a device tree setting like this:

Perfect! Thank you!


>     memory@80000000 {
>         device_type = "memory";
>         reg = <0x0 0x80000000 0x0 0x80000000>;
>         numa-node-id = <0>;
>     };
> 
>     memory@880000000 {
>         device_type = "memory";
>         reg = <0x8 0x80000000 0x0 0x8000000>;
>         numa-node-id = <0>;
>     };
> 
>     memory@890000000 {
>         device_type = "memory";
>         reg = <0x8 0x90000000 0x0 0x8000000>;
>         numa-node-id = <0>;
>     };
> 
>     memory@8A0000000 {
>         device_type = "memory";
>         reg = <0x8 0xA0000000 0x0 0x8000000>;
>         numa-node-id = <0>;
>     };
> 
>     memory@8B0000000 {
>         device_type = "memory";
>         reg = <0x8 0xB0000000 0x0 0x8000000>;
>         numa-node-id = <0>;
>     };
> 
>     memory@8C0000000 {
>         device_type = "memory";
>         reg = <0x8 0xC0000000 0x0 0x8000000>;
>         numa-node-id = <1>;
>     };
> 
>     memory@8D0000000 {
>         device_type = "memory";
>         reg = <0x8 0xD0000000 0x0 0x8000000>;
>         numa-node-id = <1>;
>     };
> 
>     memory@8E0000000 {
>         device_type = "memory";
>         reg = <0x8 0xE0000000 0x0 0x8000000>;
>         numa-node-id = <1>;
>     };
> 
>     memory@8F0000000 {
>         device_type = "memory";
>         reg = <0x8 0xF0000000 0x0 0x8000000>;
>         numa-node-id = <1>;
>     };
> 
> And in Xen we got the output:
> 
> (XEN) DT: NUMA node 0 processor parsed
> (XEN) DT: NUMA node 0 processor parsed
> (XEN) DT: NUMA node 1 processor parsed
> (XEN) DT: NUMA node 1 processor parsed
> (XEN) SRAT: Node 0 0000000080000000-00000000ff000000
> (XEN) SRAT: Node 0 0000000880000000-0000000888000000
> (XEN) SRAT: Node 0 0000000890000000-0000000898000000
> (XEN) SRAT: Node 0 00000008a0000000-00000008a8000000
> (XEN) SRAT: Node 0 00000008b0000000-00000008b8000000
> (XEN) SRAT: Node 1 00000008c0000000-00000008c8000000
> (XEN) SRAT: Node 1 00000008d0000000-00000008d8000000
> (XEN) SRAT: Node 1 00000008e0000000-00000008e8000000
> (XEN) SRAT: Node 1 00000008f0000000-00000008f8000000
> (XEN) NUMA: parsing numa-distance-map
> (XEN) NUMA: distance: NODE#0->NODE#0:10
> (XEN) NUMA: distance: NODE#0->NODE#1:20
> (XEN) NUMA: distance: NODE#1->NODE#1:10
> (XEN) NUMA: Using 16 for the hash shift.
> (XEN) Domain heap initialised
> (XEN) Booting using Device Tree
> 
> Dom0 can be boot successfully, xl info got:
> xl info
> host                   : X-Dom0
> release                : 5.12.0
> version                : #20 SMP PREEMPT Wed Jul 28 13:41:28 CST 2021
> machine                : aarch64
> nr_cpus                : 4
> max_cpu_id             : 3
> nr_nodes               : 2
> cores_per_socket       : 1
> threads_per_core       : 1
> 
> Xen debug console to dump numa info, we got:
> 
> (XEN) 'u' pressed -> dumping numa info (now = 13229372281010)
> (XEN) NODE0 start->524288 size->8617984 free->388741
> (XEN) NODE1 start->9175040 size->229376 free->106460
> (XEN) CPU0...1 -> NODE0
> (XEN) CPU2...3 -> NODE1
> (XEN) Memory location of each domain:
> (XEN) Domain 0 (total: 262144):
> (XEN)     Node 0: 262144
> (XEN)     Node 1: 0
> 
>
Jan Beulich Jan. 18, 2022, 4:13 p.m. UTC | #11
On 23.09.2021 14:02, Wei Chen wrote:
> One NUMA node may contain several memory blocks. In current Xen
> code, Xen will maintain a node memory range for each node to cover
> all its memory blocks. But here comes the problem, in the gap of
> one node's two memory blocks, if there are some memory blocks don't
> belong to this node (remote memory blocks). This node's memory range
> will be expanded to cover these remote memory blocks.
> 
> One node's memory range contains othe nodes' memory, this is obviously
> not very reasonable. This means current NUMA code only can support
> node has continous memory blocks. However, on a physical machine, the
> addresses of multiple nodes can be interleaved.
> 
> So in this patch, we add code to detect discontinous memory blocks
> for one node. NUMA initializtion will be failed and error messages
> will be printed when Xen detect such hardware configuration.

Luckily what you actually check for isn't as strict as "discontinuous":
What you're after is no interleaving of memory. A single nod can still
have multiple discontiguous ranges (and that'll often be the case on
x86). Please adjust description and function name accordingly.

> --- a/xen/arch/x86/srat.c
> +++ b/xen/arch/x86/srat.c
> @@ -271,6 +271,36 @@ acpi_numa_processor_affinity_init(const struct acpi_srat_cpu_affinity *pa)
>  		       pxm, pa->apic_id, node);
>  }
>  
> +/*
> + * Check to see if there are other nodes within this node's range.
> + * We just need to check full contains situation. Because overlaps
> + * have been checked before by conflicting_memblks.
> + */
> +static bool __init is_node_memory_continuous(nodeid_t nid,
> +    paddr_t start, paddr_t end)

This indentation style demands indenting like ...

> +{
> +	nodeid_t i;

... variable declarations at function scope, i.e. in a Linux-style
file by a tab.

> +
> +	struct node *nd = &nodes[nid];
> +	for_each_node_mask(i, memory_nodes_parsed)

Please move the blank line to be between declarations and statements.

Also please make nd pointer-to const.

> +	{

In a Linux-style file opening braces do not belong on their own lines.

> +		/* Skip itself */
> +		if (i == nid)
> +			continue;
> +
> +		nd = &nodes[i];
> +		if (start < nd->start && nd->end < end)
> +		{
> +			printk(KERN_ERR
> +			       "NODE %u: (%"PRIpaddr"-%"PRIpaddr") intertwine with NODE %u (%"PRIpaddr"-%"PRIpaddr")\n",

s/intertwine/interleaves/ ?

> @@ -344,6 +374,12 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
>  				nd->start = start;
>  			if (nd->end < end)
>  				nd->end = end;
> +
> +			/* Check whether this range contains memory for other nodes */
> +			if (!is_node_memory_continuous(node, nd->start, nd->end)) {
> +				bad_srat();
> +				return;
> +			}

I think this check would better come before nodes[] gets updated? Otoh
bad_srat() will zap everything anyway ...

Jan
Wei Chen Jan. 19, 2022, 7:33 a.m. UTC | #12
Hi Jan,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 2022年1月19日 0:13
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: Bertrand Marquis <Bertrand.Marquis@arm.com>; xen-
> devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org
> Subject: Re: [PATCH 08/37] xen/x86: add detection of discontinous node
> memory range
> 
> On 23.09.2021 14:02, Wei Chen wrote:
> > One NUMA node may contain several memory blocks. In current Xen
> > code, Xen will maintain a node memory range for each node to cover
> > all its memory blocks. But here comes the problem, in the gap of
> > one node's two memory blocks, if there are some memory blocks don't
> > belong to this node (remote memory blocks). This node's memory range
> > will be expanded to cover these remote memory blocks.
> >
> > One node's memory range contains othe nodes' memory, this is obviously
> > not very reasonable. This means current NUMA code only can support
> > node has continous memory blocks. However, on a physical machine, the
> > addresses of multiple nodes can be interleaved.
> >

I will adjust above paragraph to:
... This means current NUMA code only can support node has no interlaced
memory blocks. ...

> > So in this patch, we add code to detect discontinous memory blocks
> > for one node. NUMA initializtion will be failed and error messages
> > will be printed when Xen detect such hardware configuration.

I will adjust above paragraph to:
So in this patch, we add code to detect interleave of different nodes'
memory blocks. NUMA initialization will be ...

> 
> Luckily what you actually check for isn't as strict as "discontinuous":
> What you're after is no interleaving of memory. A single nod can still
> have multiple discontiguous ranges (and that'll often be the case on
> x86). Please adjust description and function name accordingly.
> 

Yes, we're checking for no interlaced memory among nodes. In one
node's memory range, the memory block still can be discontinuous.

I will rename the subject to:
"add detection of interlaced memory for different nodes"
And I would rename is_node_memory_continuous to:
node_without_interleave_memory.

> > --- a/xen/arch/x86/srat.c
> > +++ b/xen/arch/x86/srat.c
> > @@ -271,6 +271,36 @@ acpi_numa_processor_affinity_init(const struct
> acpi_srat_cpu_affinity *pa)
> >  		       pxm, pa->apic_id, node);
> >  }
> >
> > +/*
> > + * Check to see if there are other nodes within this node's range.
> > + * We just need to check full contains situation. Because overlaps
> > + * have been checked before by conflicting_memblks.
> > + */
> > +static bool __init is_node_memory_continuous(nodeid_t nid,
> > +    paddr_t start, paddr_t end)
> 
> This indentation style demands indenting like ...
> 

Ok.

> > +{
> > +	nodeid_t i;
> 
> ... variable declarations at function scope, i.e. in a Linux-style
> file by a tab.
> 
> > +
> > +	struct node *nd = &nodes[nid];
> > +	for_each_node_mask(i, memory_nodes_parsed)
> 
> Please move the blank line to be between declarations and statements.
> 
> Also please make nd pointer-to const.

Ok.

> 
> > +	{
> 
> In a Linux-style file opening braces do not belong on their own lines.
> 

Ok.

> > +		/* Skip itself */
> > +		if (i == nid)
> > +			continue;
> > +
> > +		nd = &nodes[i];
> > +		if (start < nd->start && nd->end < end)
> > +		{
> > +			printk(KERN_ERR
> > +			       "NODE %u: (%"PRIpaddr"-%"PRIpaddr") intertwine
> with NODE %u (%"PRIpaddr"-%"PRIpaddr")\n",
> 
> s/intertwine/interleaves/ ?

Yes, interleaves. I will fix it.

> 
> > @@ -344,6 +374,12 @@ acpi_numa_memory_affinity_init(const struct
> acpi_srat_mem_affinity *ma)
> >  				nd->start = start;
> >  			if (nd->end < end)
> >  				nd->end = end;
> > +
> > +			/* Check whether this range contains memory for other
> nodes */
> > +			if (!is_node_memory_continuous(node, nd->start, nd->end))
> {
> > +				bad_srat();
> > +				return;
> > +			}
> 
> I think this check would better come before nodes[] gets updated? Otoh
> bad_srat() will zap everything anyway ...

Yes, when I wrote this check, I considered when the check was failed,
bad_srat would make numa initialization be failed. The values in nodes[]
will not take any effect. So I didn't adjust the order. But if the bad_srat
will be changed in future, and nodes[] will be used in fallback progress,
this will take more effort to debug. In this case, I agree with you,
I will update the order in next version.

Thanks,
Wei Chen

> 
> Jan
Jan Beulich Jan. 19, 2022, 8:01 a.m. UTC | #13
On 19.01.2022 08:33, Wei Chen wrote:
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: 2022年1月19日 0:13
>>
>> On 23.09.2021 14:02, Wei Chen wrote:
>>> One NUMA node may contain several memory blocks. In current Xen
>>> code, Xen will maintain a node memory range for each node to cover
>>> all its memory blocks. But here comes the problem, in the gap of
>>> one node's two memory blocks, if there are some memory blocks don't
>>> belong to this node (remote memory blocks). This node's memory range
>>> will be expanded to cover these remote memory blocks.
>>>
>>> One node's memory range contains othe nodes' memory, this is obviously
>>> not very reasonable. This means current NUMA code only can support
>>> node has continous memory blocks. However, on a physical machine, the
>>> addresses of multiple nodes can be interleaved.
>>>
> 
> I will adjust above paragraph to:
> ... This means current NUMA code only can support node has no interlaced
> memory blocks. ...
> 
>>> So in this patch, we add code to detect discontinous memory blocks
>>> for one node. NUMA initializtion will be failed and error messages
>>> will be printed when Xen detect such hardware configuration.
> 
> I will adjust above paragraph to:
> So in this patch, we add code to detect interleave of different nodes'
> memory blocks. NUMA initialization will be ...

Taking just this part of your reply (the issue continues later), may I
ask that you use a consistent term throughout this single patch? Mixing
"interlace" and "interleave" like you do may make people wonder whether
the two are intended to express slightly different aspects. Personally,
as per my suggestion, I'd prefer "interleave", but I'm not a native
speaker.

Jan
Wei Chen Jan. 19, 2022, 8:24 a.m. UTC | #14
Hi Jan,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 2022年1月19日 16:01
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: Bertrand Marquis <Bertrand.Marquis@arm.com>; xen-
> devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org
> Subject: Re: [PATCH 08/37] xen/x86: add detection of discontinous node
> memory range
> 
> On 19.01.2022 08:33, Wei Chen wrote:
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: 2022年1月19日 0:13
> >>
> >> On 23.09.2021 14:02, Wei Chen wrote:
> >>> One NUMA node may contain several memory blocks. In current Xen
> >>> code, Xen will maintain a node memory range for each node to cover
> >>> all its memory blocks. But here comes the problem, in the gap of
> >>> one node's two memory blocks, if there are some memory blocks don't
> >>> belong to this node (remote memory blocks). This node's memory range
> >>> will be expanded to cover these remote memory blocks.
> >>>
> >>> One node's memory range contains othe nodes' memory, this is obviously
> >>> not very reasonable. This means current NUMA code only can support
> >>> node has continous memory blocks. However, on a physical machine, the
> >>> addresses of multiple nodes can be interleaved.
> >>>
> >
> > I will adjust above paragraph to:
> > ... This means current NUMA code only can support node has no interlaced
> > memory blocks. ...
> >
> >>> So in this patch, we add code to detect discontinous memory blocks
> >>> for one node. NUMA initializtion will be failed and error messages
> >>> will be printed when Xen detect such hardware configuration.
> >
> > I will adjust above paragraph to:
> > So in this patch, we add code to detect interleave of different nodes'
> > memory blocks. NUMA initialization will be ...
> 
> Taking just this part of your reply (the issue continues later), may I
> ask that you use a consistent term throughout this single patch? Mixing
> "interlace" and "interleave" like you do may make people wonder whether
> the two are intended to express slightly different aspects. Personally,
> as per my suggestion, I'd prefer "interleave", but I'm not a native
> speaker.
> 

Sorry, I am not a native speaker too, I had checked dict for interlaced
before I used it. https://www.merriam-webster.com/thesaurus/interlaced

Obviously, I'm probably using it incorrectly and making it harder to
understand, I will use "interleave" in my patches.

Thanks,
Wei Chen


> Jan
diff mbox series

Patch

diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
index 7d20d7f222..2f08fa4660 100644
--- a/xen/arch/x86/srat.c
+++ b/xen/arch/x86/srat.c
@@ -271,6 +271,36 @@  acpi_numa_processor_affinity_init(const struct acpi_srat_cpu_affinity *pa)
 		       pxm, pa->apic_id, node);
 }
 
+/*
+ * Check to see if there are other nodes within this node's range.
+ * We just need to check full contains situation. Because overlaps
+ * have been checked before by conflicting_memblks.
+ */
+static bool __init is_node_memory_continuous(nodeid_t nid,
+    paddr_t start, paddr_t end)
+{
+	nodeid_t i;
+
+	struct node *nd = &nodes[nid];
+	for_each_node_mask(i, memory_nodes_parsed)
+	{
+		/* Skip itself */
+		if (i == nid)
+			continue;
+
+		nd = &nodes[i];
+		if (start < nd->start && nd->end < end)
+		{
+			printk(KERN_ERR
+			       "NODE %u: (%"PRIpaddr"-%"PRIpaddr") intertwine with NODE %u (%"PRIpaddr"-%"PRIpaddr")\n",
+			       nid, start, end, i, nd->start, nd->end);
+			return false;
+		}
+	}
+
+	return true;
+}
+
 /* Callback for parsing of the Proximity Domain <-> Memory Area mappings */
 void __init
 acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
@@ -344,6 +374,12 @@  acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
 				nd->start = start;
 			if (nd->end < end)
 				nd->end = end;
+
+			/* Check whether this range contains memory for other nodes */
+			if (!is_node_memory_continuous(node, nd->start, nd->end)) {
+				bad_srat();
+				return;
+			}
 		}
 	}
 	printk(KERN_INFO "SRAT: Node %u PXM %u %"PRIpaddr"-%"PRIpaddr"%s\n",