diff mbox series

hw/arm/boot: Use NUMA node ID in memory node name

Message ID 20210601073004.106490-1-gshan@redhat.com (mailing list archive)
State New, archived
Headers show
Series hw/arm/boot: Use NUMA node ID in memory node name | expand

Commit Message

Gavin Shan June 1, 2021, 7:30 a.m. UTC
We possibly populate empty nodes where memory isn't included and might
be hot added at late time. The FDT memory nodes can't be created due
to conflicts on their names if multiple empty nodes are specified.
For example, the VM fails to start with the following error messages.

  /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64          \
  -accel kvm -machine virt,gic-version=host                        \
  -cpu host -smp 4,sockets=2,cores=2,threads=1 -m 1024M,maxmem=64G \
  -object memory-backend-ram,id=mem0,size=512M                     \
  -object memory-backend-ram,id=mem1,size=512M                     \
  -numa node,nodeid=0,cpus=0-1,memdev=mem0                         \
  -numa node,nodeid=1,cpus=2-3,memdev=mem1                         \
  -numa node,nodeid=2                                              \
  -numa node,nodeid=3                                              \
    :
  -device virtio-balloon-pci,id=balloon0,free-page-reporting=yes

  qemu-system-aarch64: FDT: Failed to create subnode /memory@80000000: \
                       FDT_ERR_EXISTS

This fixes the issue by using NUMA node ID or zero in the memory node
name to avoid the conflicting memory node names. With this applied, the
VM can boot successfully with above command lines.

Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 hw/arm/boot.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Comments

Andrew Jones June 1, 2021, 7:50 a.m. UTC | #1
On Tue, Jun 01, 2021 at 03:30:04PM +0800, Gavin Shan wrote:
> We possibly populate empty nodes where memory isn't included and might
> be hot added at late time. The FDT memory nodes can't be created due
> to conflicts on their names if multiple empty nodes are specified.
> For example, the VM fails to start with the following error messages.
> 
>   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64          \
>   -accel kvm -machine virt,gic-version=host                        \
>   -cpu host -smp 4,sockets=2,cores=2,threads=1 -m 1024M,maxmem=64G \
>   -object memory-backend-ram,id=mem0,size=512M                     \
>   -object memory-backend-ram,id=mem1,size=512M                     \
>   -numa node,nodeid=0,cpus=0-1,memdev=mem0                         \
>   -numa node,nodeid=1,cpus=2-3,memdev=mem1                         \
>   -numa node,nodeid=2                                              \
>   -numa node,nodeid=3                                              \
>     :
>   -device virtio-balloon-pci,id=balloon0,free-page-reporting=yes
> 
>   qemu-system-aarch64: FDT: Failed to create subnode /memory@80000000: \
>                        FDT_ERR_EXISTS
> 
> This fixes the issue by using NUMA node ID or zero in the memory node
> name to avoid the conflicting memory node names. With this applied, the
> VM can boot successfully with above command lines.
> 
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
>  hw/arm/boot.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/arm/boot.c b/hw/arm/boot.c
> index d7b059225e..3169bdf595 100644
> --- a/hw/arm/boot.c
> +++ b/hw/arm/boot.c
> @@ -432,7 +432,12 @@ static int fdt_add_memory_node(void *fdt, uint32_t acells, hwaddr mem_base,
>      char *nodename;
>      int ret;
>  
> -    nodename = g_strdup_printf("/memory@%" PRIx64, mem_base);
> +    if (numa_node_id >= 0) {
> +        nodename = g_strdup_printf("/memory@%d", numa_node_id);
> +    } else {
> +        nodename = g_strdup("/memory@0");
> +    }
> +
>      qemu_fdt_add_subnode(fdt, nodename);
>      qemu_fdt_setprop_string(fdt, nodename, "device_type", "memory");
>      ret = qemu_fdt_setprop_sized_cells(fdt, nodename, "reg", acells, mem_base,
> -- 
> 2.23.0
>

Hi Gavin,

Is it conventional to use the unit-address like this? If so, can you point
out where that's documented? If it's not conventional, then we shouldn't
do it. And then I'm not sure what we should do in this case. Here's a
couple links I found, but they don't really help...

https://devicetree-specification.readthedocs.io/en/latest/chapter2-devicetree-basics.html#sect-node-names
https://devicetree-specification.readthedocs.io/en/latest/chapter3-devicenodes.html#memory-node

Thanks,
drew
Gavin Shan June 2, 2021, 1:09 a.m. UTC | #2
Hi Drew,

On 6/1/21 5:50 PM, Andrew Jones wrote:
> On Tue, Jun 01, 2021 at 03:30:04PM +0800, Gavin Shan wrote:
>> We possibly populate empty nodes where memory isn't included and might
>> be hot added at late time. The FDT memory nodes can't be created due
>> to conflicts on their names if multiple empty nodes are specified.
>> For example, the VM fails to start with the following error messages.
>>
>>    /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64          \
>>    -accel kvm -machine virt,gic-version=host                        \
>>    -cpu host -smp 4,sockets=2,cores=2,threads=1 -m 1024M,maxmem=64G \
>>    -object memory-backend-ram,id=mem0,size=512M                     \
>>    -object memory-backend-ram,id=mem1,size=512M                     \
>>    -numa node,nodeid=0,cpus=0-1,memdev=mem0                         \
>>    -numa node,nodeid=1,cpus=2-3,memdev=mem1                         \
>>    -numa node,nodeid=2                                              \
>>    -numa node,nodeid=3                                              \
>>      :
>>    -device virtio-balloon-pci,id=balloon0,free-page-reporting=yes
>>
>>    qemu-system-aarch64: FDT: Failed to create subnode /memory@80000000: \
>>                         FDT_ERR_EXISTS
>>
>> This fixes the issue by using NUMA node ID or zero in the memory node
>> name to avoid the conflicting memory node names. With this applied, the
>> VM can boot successfully with above command lines.
>>
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> ---
>>   hw/arm/boot.c | 7 ++++++-
>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/arm/boot.c b/hw/arm/boot.c
>> index d7b059225e..3169bdf595 100644
>> --- a/hw/arm/boot.c
>> +++ b/hw/arm/boot.c
>> @@ -432,7 +432,12 @@ static int fdt_add_memory_node(void *fdt, uint32_t acells, hwaddr mem_base,
>>       char *nodename;
>>       int ret;
>>   
>> -    nodename = g_strdup_printf("/memory@%" PRIx64, mem_base);
>> +    if (numa_node_id >= 0) {
>> +        nodename = g_strdup_printf("/memory@%d", numa_node_id);
>> +    } else {
>> +        nodename = g_strdup("/memory@0");
>> +    }
>> +
>>       qemu_fdt_add_subnode(fdt, nodename);
>>       qemu_fdt_setprop_string(fdt, nodename, "device_type", "memory");
>>       ret = qemu_fdt_setprop_sized_cells(fdt, nodename, "reg", acells, mem_base,

[...]

> 
> Is it conventional to use the unit-address like this? If so, can you point
> out where that's documented? If it's not conventional, then we shouldn't
> do it. And then I'm not sure what we should do in this case. Here's a
> couple links I found, but they don't really help...
> 
> https://devicetree-specification.readthedocs.io/en/latest/chapter2-devicetree-basics.html#sect-node-names
> https://devicetree-specification.readthedocs.io/en/latest/chapter3-devicenodes.html#memory-node
> 

As stated in the document (section 2.2.1.1). It's conventional to take the first
address of 'reg' property as unit-address, but it's not mandatory as I understand:

(1) In section 2.2.1.1, the bus can specify additional format to unit-address.
(2) The device node name isn't used to identify the device node in Linux kernel.
     They are actually identified by 'device_type' property.
     (drivers/of/fdt.c::early_init_dt_scan_memory())

I think it's still nice to include the unit-address in meory node's name. For the
conflicting nodes, we add more suffix like below. I can update the code in v2 if
it's preferred way to go.

    memory@0
    memory@0-0                               # For empty NUMA node
    memory@0-1                               # For empty NUMA node
    memory@80000000
    memory@80000000-0                        # For empty NUMA node
    memory@80000000-1                        # For empty NUMA node

---

https://devicetree-specification.readthedocs.io/en/latest/chapter2-devicetree-basics.html#sect-node-names

The unit-address must match the first address specified in the reg property of the node.
If the node has no reg property, the @unit-address must be omitted and the node-name
alone differentiates the node from other nodes at the same level in the tree. The binding
for a particular bus may specify additional, more specific requirements for the format
of reg and the unit-address.

Thanks,
Gavin
Andrew Jones June 2, 2021, 11:36 a.m. UTC | #3
On Wed, Jun 02, 2021 at 11:09:32AM +1000, Gavin Shan wrote:
> Hi Drew,
> 
> On 6/1/21 5:50 PM, Andrew Jones wrote:
> > On Tue, Jun 01, 2021 at 03:30:04PM +0800, Gavin Shan wrote:
> > > We possibly populate empty nodes where memory isn't included and might
> > > be hot added at late time. The FDT memory nodes can't be created due
> > > to conflicts on their names if multiple empty nodes are specified.
> > > For example, the VM fails to start with the following error messages.
> > > 
> > >    /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64          \
> > >    -accel kvm -machine virt,gic-version=host                        \
> > >    -cpu host -smp 4,sockets=2,cores=2,threads=1 -m 1024M,maxmem=64G \
> > >    -object memory-backend-ram,id=mem0,size=512M                     \
> > >    -object memory-backend-ram,id=mem1,size=512M                     \
> > >    -numa node,nodeid=0,cpus=0-1,memdev=mem0                         \
> > >    -numa node,nodeid=1,cpus=2-3,memdev=mem1                         \
> > >    -numa node,nodeid=2                                              \
> > >    -numa node,nodeid=3                                              \
> > >      :
> > >    -device virtio-balloon-pci,id=balloon0,free-page-reporting=yes
> > > 
> > >    qemu-system-aarch64: FDT: Failed to create subnode /memory@80000000: \
> > >                         FDT_ERR_EXISTS
> > > 
> > > This fixes the issue by using NUMA node ID or zero in the memory node
> > > name to avoid the conflicting memory node names. With this applied, the
> > > VM can boot successfully with above command lines.
> > > 
> > > Signed-off-by: Gavin Shan <gshan@redhat.com>
> > > ---
> > >   hw/arm/boot.c | 7 ++++++-
> > >   1 file changed, 6 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/hw/arm/boot.c b/hw/arm/boot.c
> > > index d7b059225e..3169bdf595 100644
> > > --- a/hw/arm/boot.c
> > > +++ b/hw/arm/boot.c
> > > @@ -432,7 +432,12 @@ static int fdt_add_memory_node(void *fdt, uint32_t acells, hwaddr mem_base,
> > >       char *nodename;
> > >       int ret;
> > > -    nodename = g_strdup_printf("/memory@%" PRIx64, mem_base);
> > > +    if (numa_node_id >= 0) {
> > > +        nodename = g_strdup_printf("/memory@%d", numa_node_id);
> > > +    } else {
> > > +        nodename = g_strdup("/memory@0");
> > > +    }
> > > +
> > >       qemu_fdt_add_subnode(fdt, nodename);
> > >       qemu_fdt_setprop_string(fdt, nodename, "device_type", "memory");
> > >       ret = qemu_fdt_setprop_sized_cells(fdt, nodename, "reg", acells, mem_base,
> 
> [...]
> 
> > 
> > Is it conventional to use the unit-address like this? If so, can you point
> > out where that's documented? If it's not conventional, then we shouldn't
> > do it. And then I'm not sure what we should do in this case. Here's a
> > couple links I found, but they don't really help...
> > 
> > https://devicetree-specification.readthedocs.io/en/latest/chapter2-devicetree-basics.html#sect-node-names
> > https://devicetree-specification.readthedocs.io/en/latest/chapter3-devicenodes.html#memory-node
> > 
> 
> As stated in the document (section 2.2.1.1). It's conventional to take the first
> address of 'reg' property as unit-address, but it's not mandatory as I understand:
> 
> (1) In section 2.2.1.1, the bus can specify additional format to unit-address.

The bus type we're using is the physical memory map. Does it allow this
use of the unit-address? I still haven't seen that documented anywhere.

> (2) The device node name isn't used to identify the device node in Linux kernel.
>     They are actually identified by 'device_type' property.
>     (drivers/of/fdt.c::early_init_dt_scan_memory())

This doesn't matter. We can't change DT node name formats just because
they may not impact Linux. We need to follow the DT standard and the Linux
convention.

> 
> I think it's still nice to include the unit-address in meory node's name. For the
> conflicting nodes, we add more suffix like below. I can update the code in v2 if
> it's preferred way to go.

I don't think we should change the semantics of the unit address at all,
not without either a) finding a document that says our use is OK or b)
getting it documented in the Linux kernel first.

Thanks,
drew

> 
>    memory@0
>    memory@0-0                               # For empty NUMA node
>    memory@0-1                               # For empty NUMA node
>    memory@80000000
>    memory@80000000-0                        # For empty NUMA node
>    memory@80000000-1                        # For empty NUMA node
> 
> ---
> 
> https://devicetree-specification.readthedocs.io/en/latest/chapter2-devicetree-basics.html#sect-node-names
> 
> The unit-address must match the first address specified in the reg property of the node.
> If the node has no reg property, the @unit-address must be omitted and the node-name
> alone differentiates the node from other nodes at the same level in the tree. The binding
> for a particular bus may specify additional, more specific requirements for the format
> of reg and the unit-address.
>
Gavin Shan June 3, 2021, 4:48 a.m. UTC | #4
Hi Drew,

On 6/2/21 9:36 PM, Andrew Jones wrote:
> On Wed, Jun 02, 2021 at 11:09:32AM +1000, Gavin Shan wrote:
>> On 6/1/21 5:50 PM, Andrew Jones wrote:
>>> On Tue, Jun 01, 2021 at 03:30:04PM +0800, Gavin Shan wrote:
>>>> We possibly populate empty nodes where memory isn't included and might
>>>> be hot added at late time. The FDT memory nodes can't be created due
>>>> to conflicts on their names if multiple empty nodes are specified.
>>>> For example, the VM fails to start with the following error messages.
>>>>
>>>>     /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64          \
>>>>     -accel kvm -machine virt,gic-version=host                        \
>>>>     -cpu host -smp 4,sockets=2,cores=2,threads=1 -m 1024M,maxmem=64G \
>>>>     -object memory-backend-ram,id=mem0,size=512M                     \
>>>>     -object memory-backend-ram,id=mem1,size=512M                     \
>>>>     -numa node,nodeid=0,cpus=0-1,memdev=mem0                         \
>>>>     -numa node,nodeid=1,cpus=2-3,memdev=mem1                         \
>>>>     -numa node,nodeid=2                                              \
>>>>     -numa node,nodeid=3                                              \
>>>>       :
>>>>     -device virtio-balloon-pci,id=balloon0,free-page-reporting=yes
>>>>
>>>>     qemu-system-aarch64: FDT: Failed to create subnode /memory@80000000: \
>>>>                          FDT_ERR_EXISTS
>>>>
>>>> This fixes the issue by using NUMA node ID or zero in the memory node
>>>> name to avoid the conflicting memory node names. With this applied, the
>>>> VM can boot successfully with above command lines.
>>>>
>>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>>>> ---
>>>>    hw/arm/boot.c | 7 ++++++-
>>>>    1 file changed, 6 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/hw/arm/boot.c b/hw/arm/boot.c
>>>> index d7b059225e..3169bdf595 100644
>>>> --- a/hw/arm/boot.c
>>>> +++ b/hw/arm/boot.c
>>>> @@ -432,7 +432,12 @@ static int fdt_add_memory_node(void *fdt, uint32_t acells, hwaddr mem_base,
>>>>        char *nodename;
>>>>        int ret;
>>>> -    nodename = g_strdup_printf("/memory@%" PRIx64, mem_base);
>>>> +    if (numa_node_id >= 0) {
>>>> +        nodename = g_strdup_printf("/memory@%d", numa_node_id);
>>>> +    } else {
>>>> +        nodename = g_strdup("/memory@0");
>>>> +    }
>>>> +
>>>>        qemu_fdt_add_subnode(fdt, nodename);
>>>>        qemu_fdt_setprop_string(fdt, nodename, "device_type", "memory");
>>>>        ret = qemu_fdt_setprop_sized_cells(fdt, nodename, "reg", acells, mem_base,
>>
>> [...]
>>
>>>
>>> Is it conventional to use the unit-address like this? If so, can you point
>>> out where that's documented? If it's not conventional, then we shouldn't
>>> do it. And then I'm not sure what we should do in this case. Here's a
>>> couple links I found, but they don't really help...
>>>
>>> https://devicetree-specification.readthedocs.io/en/latest/chapter2-devicetree-basics.html#sect-node-names
>>> https://devicetree-specification.readthedocs.io/en/latest/chapter3-devicenodes.html#memory-node
>>>
>>
>> As stated in the document (section 2.2.1.1). It's conventional to take the first
>> address of 'reg' property as unit-address, but it's not mandatory as I understand:
>>
>> (1) In section 2.2.1.1, the bus can specify additional format to unit-address.
> 
> The bus type we're using is the physical memory map. Does it allow this
> use of the unit-address? I still haven't seen that documented anywhere.
> 
>> (2) The device node name isn't used to identify the device node in Linux kernel.
>>      They are actually identified by 'device_type' property.
>>      (drivers/of/fdt.c::early_init_dt_scan_memory())
> 
> This doesn't matter. We can't change DT node name formats just because
> they may not impact Linux. We need to follow the DT standard and the Linux
> convention.
> 
>>
>> I think it's still nice to include the unit-address in meory node's name. For the
>> conflicting nodes, we add more suffix like below. I can update the code in v2 if
>> it's preferred way to go.
> 
> I don't think we should change the semantics of the unit address at all,
> not without either a) finding a document that says our use is OK or b)
> getting it documented in the Linux kernel first.
> 

[...]

I've sent one separate mail to check with Rob Herring. Hopefully he have
ideas as he is maintaining linux FDT subsystem. You have been included to
that thread. I didn't find something meaningful to this thread after doing
some google search either.

Yes, I agree with you we need to follow the specification strictly. It seems
it's uncertain about the 'physical memory map' bus binding requirements.

Thanks,
Gavin
Andrew Jones June 22, 2021, 7:13 a.m. UTC | #5
On Tue, Jun 22, 2021 at 06:53:41PM +1000, Gavin Shan wrote:
> Hi Drew,
> 
> On 6/3/21 2:48 PM, Gavin Shan wrote:
> > On 6/2/21 9:36 PM, Andrew Jones wrote:
> > > On Wed, Jun 02, 2021 at 11:09:32AM +1000, Gavin Shan wrote:
> > > > On 6/1/21 5:50 PM, Andrew Jones wrote:
> > > > > On Tue, Jun 01, 2021 at 03:30:04PM +0800, Gavin Shan wrote:
> > > > > > We possibly populate empty nodes where memory isn't included and might
> > > > > > be hot added at late time. The FDT memory nodes can't be created due
> > > > > > to conflicts on their names if multiple empty nodes are specified.
> > > > > > For example, the VM fails to start with the following error messages.
> > > > > > 
> > > > > >     /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64          \
> > > > > >     -accel kvm -machine virt,gic-version=host                        \
> > > > > >     -cpu host -smp 4,sockets=2,cores=2,threads=1 -m 1024M,maxmem=64G \
> > > > > >     -object memory-backend-ram,id=mem0,size=512M                     \
> > > > > >     -object memory-backend-ram,id=mem1,size=512M                     \
> > > > > >     -numa node,nodeid=0,cpus=0-1,memdev=mem0                         \
> > > > > >     -numa node,nodeid=1,cpus=2-3,memdev=mem1                         \
> > > > > >     -numa node,nodeid=2                                              \
> > > > > >     -numa node,nodeid=3                                              \
> > > > > >       :
> > > > > >     -device virtio-balloon-pci,id=balloon0,free-page-reporting=yes
> > > > > > 
> > > > > >     qemu-system-aarch64: FDT: Failed to create subnode /memory@80000000: \
> > > > > >                          FDT_ERR_EXISTS
> > > > > > 
> > > > > > This fixes the issue by using NUMA node ID or zero in the memory node
> > > > > > name to avoid the conflicting memory node names. With this applied, the
> > > > > > VM can boot successfully with above command lines.
> > > > > > 
> > > > > > Signed-off-by: Gavin Shan <gshan@redhat.com>
> > > > > > ---
> > > > > >    hw/arm/boot.c | 7 ++++++-
> > > > > >    1 file changed, 6 insertions(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/hw/arm/boot.c b/hw/arm/boot.c
> > > > > > index d7b059225e..3169bdf595 100644
> > > > > > --- a/hw/arm/boot.c
> > > > > > +++ b/hw/arm/boot.c
> > > > > > @@ -432,7 +432,12 @@ static int fdt_add_memory_node(void *fdt, uint32_t acells, hwaddr mem_base,
> > > > > >        char *nodename;
> > > > > >        int ret;
> > > > > > -    nodename = g_strdup_printf("/memory@%" PRIx64, mem_base);
> > > > > > +    if (numa_node_id >= 0) {
> > > > > > +        nodename = g_strdup_printf("/memory@%d", numa_node_id);
> > > > > > +    } else {
> > > > > > +        nodename = g_strdup("/memory@0");
> > > > > > +    }
> > > > > > +
> > > > > >        qemu_fdt_add_subnode(fdt, nodename);
> > > > > >        qemu_fdt_setprop_string(fdt, nodename, "device_type", "memory");
> > > > > >        ret = qemu_fdt_setprop_sized_cells(fdt, nodename, "reg", acells, mem_base,
> 
> [...]
> 
> > 
> > I've sent one separate mail to check with Rob Herring. Hopefully he have
> > ideas as he is maintaining linux FDT subsystem. You have been included to
> > that thread. I didn't find something meaningful to this thread after doing
> > some google search either.
> > 
> > Yes, I agree with you we need to follow the specification strictly. It seems
> > it's uncertain about the 'physical memory map' bus binding requirements.
> > 
> 
> I didn't get expected answers from device-tree experts.

What response did you get? Can you please provide a link to the discussion?

> After rethinking about it,
> I plan to fix this like this way, but please let me know if it sounds sensible
> to you.
> 
> The idea is to assign a (not overlapped) dummy base address to each memory
> node in the device-tree. The dummy is (last_valid_memory_address + NUMA ID).
> The 'length' of the 'reg' property in the device-tree nodes, corresponding
> to empty NUMA nodes, is still zero. This ensures the nodes are still invalid
> until memory is added to these nodes.
> 
> I had the temporary patch for the implementation. It works fine and VM can
> boot up successfully.

I would like to be sure that the kernel developers for NUMA, memory
hotplug, and devicetree specifications are all happy with the approach
before adding it to QEMU.

Thanks,
drew
Gavin Shan June 22, 2021, 8:53 a.m. UTC | #6
Hi Drew,

On 6/3/21 2:48 PM, Gavin Shan wrote:
> On 6/2/21 9:36 PM, Andrew Jones wrote:
>> On Wed, Jun 02, 2021 at 11:09:32AM +1000, Gavin Shan wrote:
>>> On 6/1/21 5:50 PM, Andrew Jones wrote:
>>>> On Tue, Jun 01, 2021 at 03:30:04PM +0800, Gavin Shan wrote:
>>>>> We possibly populate empty nodes where memory isn't included and might
>>>>> be hot added at late time. The FDT memory nodes can't be created due
>>>>> to conflicts on their names if multiple empty nodes are specified.
>>>>> For example, the VM fails to start with the following error messages.
>>>>>
>>>>>     /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64          \
>>>>>     -accel kvm -machine virt,gic-version=host                        \
>>>>>     -cpu host -smp 4,sockets=2,cores=2,threads=1 -m 1024M,maxmem=64G \
>>>>>     -object memory-backend-ram,id=mem0,size=512M                     \
>>>>>     -object memory-backend-ram,id=mem1,size=512M                     \
>>>>>     -numa node,nodeid=0,cpus=0-1,memdev=mem0                         \
>>>>>     -numa node,nodeid=1,cpus=2-3,memdev=mem1                         \
>>>>>     -numa node,nodeid=2                                              \
>>>>>     -numa node,nodeid=3                                              \
>>>>>       :
>>>>>     -device virtio-balloon-pci,id=balloon0,free-page-reporting=yes
>>>>>
>>>>>     qemu-system-aarch64: FDT: Failed to create subnode /memory@80000000: \
>>>>>                          FDT_ERR_EXISTS
>>>>>
>>>>> This fixes the issue by using NUMA node ID or zero in the memory node
>>>>> name to avoid the conflicting memory node names. With this applied, the
>>>>> VM can boot successfully with above command lines.
>>>>>
>>>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>>>>> ---
>>>>>    hw/arm/boot.c | 7 ++++++-
>>>>>    1 file changed, 6 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/hw/arm/boot.c b/hw/arm/boot.c
>>>>> index d7b059225e..3169bdf595 100644
>>>>> --- a/hw/arm/boot.c
>>>>> +++ b/hw/arm/boot.c
>>>>> @@ -432,7 +432,12 @@ static int fdt_add_memory_node(void *fdt, uint32_t acells, hwaddr mem_base,
>>>>>        char *nodename;
>>>>>        int ret;
>>>>> -    nodename = g_strdup_printf("/memory@%" PRIx64, mem_base);
>>>>> +    if (numa_node_id >= 0) {
>>>>> +        nodename = g_strdup_printf("/memory@%d", numa_node_id);
>>>>> +    } else {
>>>>> +        nodename = g_strdup("/memory@0");
>>>>> +    }
>>>>> +
>>>>>        qemu_fdt_add_subnode(fdt, nodename);
>>>>>        qemu_fdt_setprop_string(fdt, nodename, "device_type", "memory");
>>>>>        ret = qemu_fdt_setprop_sized_cells(fdt, nodename, "reg", acells, mem_base,

[...]

> 
> I've sent one separate mail to check with Rob Herring. Hopefully he have
> ideas as he is maintaining linux FDT subsystem. You have been included to
> that thread. I didn't find something meaningful to this thread after doing
> some google search either.
> 
> Yes, I agree with you we need to follow the specification strictly. It seems
> it's uncertain about the 'physical memory map' bus binding requirements.
> 

I didn't get expected answers from device-tree experts. After rethinking about it,
I plan to fix this like this way, but please let me know if it sounds sensible
to you.

The idea is to assign a (not overlapped) dummy base address to each memory
node in the device-tree. The dummy is (last_valid_memory_address + NUMA ID).
The 'length' of the 'reg' property in the device-tree nodes, corresponding
to empty NUMA nodes, is still zero. This ensures the nodes are still invalid
until memory is added to these nodes.

I had the temporary patch for the implementation. It works fine and VM can
boot up successfully.

Thanks,
Gavin
Gavin Shan June 23, 2021, 4:43 a.m. UTC | #7
Hi Drew,

On 6/22/21 5:13 PM, Andrew Jones wrote:
> On Tue, Jun 22, 2021 at 06:53:41PM +1000, Gavin Shan wrote:
>> On 6/3/21 2:48 PM, Gavin Shan wrote:
>>> On 6/2/21 9:36 PM, Andrew Jones wrote:
>>>> On Wed, Jun 02, 2021 at 11:09:32AM +1000, Gavin Shan wrote:
>>>>> On 6/1/21 5:50 PM, Andrew Jones wrote:
>>>>>> On Tue, Jun 01, 2021 at 03:30:04PM +0800, Gavin Shan wrote:
>>>>>>> We possibly populate empty nodes where memory isn't included and might
>>>>>>> be hot added at late time. The FDT memory nodes can't be created due
>>>>>>> to conflicts on their names if multiple empty nodes are specified.
>>>>>>> For example, the VM fails to start with the following error messages.
>>>>>>>
>>>>>>>      /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64          \
>>>>>>>      -accel kvm -machine virt,gic-version=host                        \
>>>>>>>      -cpu host -smp 4,sockets=2,cores=2,threads=1 -m 1024M,maxmem=64G \
>>>>>>>      -object memory-backend-ram,id=mem0,size=512M                     \
>>>>>>>      -object memory-backend-ram,id=mem1,size=512M                     \
>>>>>>>      -numa node,nodeid=0,cpus=0-1,memdev=mem0                         \
>>>>>>>      -numa node,nodeid=1,cpus=2-3,memdev=mem1                         \
>>>>>>>      -numa node,nodeid=2                                              \
>>>>>>>      -numa node,nodeid=3                                              \
>>>>>>>        :
>>>>>>>      -device virtio-balloon-pci,id=balloon0,free-page-reporting=yes
>>>>>>>
>>>>>>>      qemu-system-aarch64: FDT: Failed to create subnode /memory@80000000: \
>>>>>>>                           FDT_ERR_EXISTS
>>>>>>>
>>>>>>> This fixes the issue by using NUMA node ID or zero in the memory node
>>>>>>> name to avoid the conflicting memory node names. With this applied, the
>>>>>>> VM can boot successfully with above command lines.
>>>>>>>
>>>>>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>>>>>>> ---
>>>>>>>     hw/arm/boot.c | 7 ++++++-
>>>>>>>     1 file changed, 6 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/hw/arm/boot.c b/hw/arm/boot.c
>>>>>>> index d7b059225e..3169bdf595 100644
>>>>>>> --- a/hw/arm/boot.c
>>>>>>> +++ b/hw/arm/boot.c
>>>>>>> @@ -432,7 +432,12 @@ static int fdt_add_memory_node(void *fdt, uint32_t acells, hwaddr mem_base,
>>>>>>>         char *nodename;
>>>>>>>         int ret;
>>>>>>> -    nodename = g_strdup_printf("/memory@%" PRIx64, mem_base);
>>>>>>> +    if (numa_node_id >= 0) {
>>>>>>> +        nodename = g_strdup_printf("/memory@%d", numa_node_id);
>>>>>>> +    } else {
>>>>>>> +        nodename = g_strdup("/memory@0");
>>>>>>> +    }
>>>>>>> +
>>>>>>>         qemu_fdt_add_subnode(fdt, nodename);
>>>>>>>         qemu_fdt_setprop_string(fdt, nodename, "device_type", "memory");
>>>>>>>         ret = qemu_fdt_setprop_sized_cells(fdt, nodename, "reg", acells, mem_base,
>>
>> [...]
>>
>>>
>>> I've sent one separate mail to check with Rob Herring. Hopefully he have
>>> ideas as he is maintaining linux FDT subsystem. You have been included to
>>> that thread. I didn't find something meaningful to this thread after doing
>>> some google search either.
>>>
>>> Yes, I agree with you we need to follow the specification strictly. It seems
>>> it's uncertain about the 'physical memory map' bus binding requirements.
>>>
>>
>> I didn't get expected answers from device-tree experts.
> 
> What response did you get? Can you please provide a link to the discussion?
> 

Sorry about the confusion. I meant "no response" by "expected answers". Here
is the mail sent to Rob before, no reply so far:

https://lkml.org/lkml/2021/6/2/1446

>> After rethinking about it,
>> I plan to fix this like this way, but please let me know if it sounds sensible
>> to you.
>>
>> The idea is to assign a (not overlapped) dummy base address to each memory
>> node in the device-tree. The dummy is (last_valid_memory_address + NUMA ID).
>> The 'length' of the 'reg' property in the device-tree nodes, corresponding
>> to empty NUMA nodes, is still zero. This ensures the nodes are still invalid
>> until memory is added to these nodes.
>>
>> I had the temporary patch for the implementation. It works fine and VM can
>> boot up successfully.
> 
> I would like to be sure that the kernel developers for NUMA, memory
> hotplug, and devicetree specifications are all happy with the approach
> before adding it to QEMU.
> 

As I understood, it won't break anything from perspectives of NUMA
and device-tree specification. First of all, NUMA cares the so-called
distance map and 'numa-node-id' property in the individual device-tree
nodes. The device-tree specification doesn't say 'length' in 'reg'
property of memory node can't be zero. Also, the linux device-tree
implementation has the check on 'length', the memory block won't be
added if it's zero.

Documentation/devicetree/bindings/numa.txt has more details about
the required device-tree NUMA properties.

I'm not sure about memory hotplug. I tried memory hot add and it seems
working finely. Memory hot-add/remove are working without issues:

/home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
-accel kvm -machine virt,gic-version=host               \
-cpu host -smp 4,sockets=2,cores=2,threads=1            \
-m 4096M,slots=16,maxmem=64G                            \
-object memory-backend-ram,id=mem0,size=2048M           \
-object memory-backend-ram,id=mem1,size=2048M           \
-numa node,nodeid=0,cpus=0-1,memdev=mem0                \
-numa node,nodeid=1,cpus=2-3,memdev=mem1                \
-numa node,nodeid=2                                     \
-numa node,nodeid=3
    :

Memory hot-add
===============

# cat /proc/meminfo  | grep MemTotal
MemTotal:        4027472 kB
# cat /sys/devices/system/node/node2/meminfo | grep MemTotal
Node 2 MemTotal:       0 kB

(qemu) object_add memory-backend-ram,id=hp-mem0,size=1G
(qemu) device_add pc-dimm,id=dimm0,memdev=hp-mem0,node=3

# cat /proc/meminfo  | grep MemTotal
MemTotal:        5076048 kB
# cat /sys/devices/system/node/node2/meminfo | grep MemTotal
Node 2 MemTotal: 1048576 kB

Memory hot-remove
=================

(qemu) device_del dimm0
(qemu) object_del hp-mem0

# cat /proc/meminfo  | grep MemTotal
MemTotal:        4027472 kB
# cat /sys/devices/system/node/node2/meminfo | grep MemTotal
cat: can't open '/sys/devices/system/node/node2/meminfo': No such file or directory

After this point, the memory can be added again without issues with
"object_add" and "device_add". So it sounds reasonable to remove
the empty NUMA node during memory hot-remove.

Thanks,
Gavin
Andrew Jones June 23, 2021, 8:07 a.m. UTC | #8
On Wed, Jun 23, 2021 at 02:43:49PM +1000, Gavin Shan wrote:
> Hi Drew,
> 
> On 6/22/21 5:13 PM, Andrew Jones wrote:
> > On Tue, Jun 22, 2021 at 06:53:41PM +1000, Gavin Shan wrote:
> > > On 6/3/21 2:48 PM, Gavin Shan wrote:
> > > > On 6/2/21 9:36 PM, Andrew Jones wrote:
> > > > > On Wed, Jun 02, 2021 at 11:09:32AM +1000, Gavin Shan wrote:
> > > > > > On 6/1/21 5:50 PM, Andrew Jones wrote:
> > > > > > > On Tue, Jun 01, 2021 at 03:30:04PM +0800, Gavin Shan wrote:
> > > > > > > > We possibly populate empty nodes where memory isn't included and might
> > > > > > > > be hot added at late time. The FDT memory nodes can't be created due
> > > > > > > > to conflicts on their names if multiple empty nodes are specified.
> > > > > > > > For example, the VM fails to start with the following error messages.
> > > > > > > > 
> > > > > > > >      /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64          \
> > > > > > > >      -accel kvm -machine virt,gic-version=host                        \
> > > > > > > >      -cpu host -smp 4,sockets=2,cores=2,threads=1 -m 1024M,maxmem=64G \
> > > > > > > >      -object memory-backend-ram,id=mem0,size=512M                     \
> > > > > > > >      -object memory-backend-ram,id=mem1,size=512M                     \
> > > > > > > >      -numa node,nodeid=0,cpus=0-1,memdev=mem0                         \
> > > > > > > >      -numa node,nodeid=1,cpus=2-3,memdev=mem1                         \
> > > > > > > >      -numa node,nodeid=2                                              \
> > > > > > > >      -numa node,nodeid=3                                              \
> > > > > > > >        :
> > > > > > > >      -device virtio-balloon-pci,id=balloon0,free-page-reporting=yes
> > > > > > > > 
> > > > > > > >      qemu-system-aarch64: FDT: Failed to create subnode /memory@80000000: \
> > > > > > > >                           FDT_ERR_EXISTS
> > > > > > > > 
> > > > > > > > This fixes the issue by using NUMA node ID or zero in the memory node
> > > > > > > > name to avoid the conflicting memory node names. With this applied, the
> > > > > > > > VM can boot successfully with above command lines.
> > > > > > > > 
> > > > > > > > Signed-off-by: Gavin Shan <gshan@redhat.com>
> > > > > > > > ---
> > > > > > > >     hw/arm/boot.c | 7 ++++++-
> > > > > > > >     1 file changed, 6 insertions(+), 1 deletion(-)
> > > > > > > > 
> > > > > > > > diff --git a/hw/arm/boot.c b/hw/arm/boot.c
> > > > > > > > index d7b059225e..3169bdf595 100644
> > > > > > > > --- a/hw/arm/boot.c
> > > > > > > > +++ b/hw/arm/boot.c
> > > > > > > > @@ -432,7 +432,12 @@ static int fdt_add_memory_node(void *fdt, uint32_t acells, hwaddr mem_base,
> > > > > > > >         char *nodename;
> > > > > > > >         int ret;
> > > > > > > > -    nodename = g_strdup_printf("/memory@%" PRIx64, mem_base);
> > > > > > > > +    if (numa_node_id >= 0) {
> > > > > > > > +        nodename = g_strdup_printf("/memory@%d", numa_node_id);
> > > > > > > > +    } else {
> > > > > > > > +        nodename = g_strdup("/memory@0");
> > > > > > > > +    }
> > > > > > > > +
> > > > > > > >         qemu_fdt_add_subnode(fdt, nodename);
> > > > > > > >         qemu_fdt_setprop_string(fdt, nodename, "device_type", "memory");
> > > > > > > >         ret = qemu_fdt_setprop_sized_cells(fdt, nodename, "reg", acells, mem_base,
> > > 
> > > [...]
> > > 
> > > > 
> > > > I've sent one separate mail to check with Rob Herring. Hopefully he have
> > > > ideas as he is maintaining linux FDT subsystem. You have been included to
> > > > that thread. I didn't find something meaningful to this thread after doing
> > > > some google search either.
> > > > 
> > > > Yes, I agree with you we need to follow the specification strictly. It seems
> > > > it's uncertain about the 'physical memory map' bus binding requirements.
> > > > 
> > > 
> > > I didn't get expected answers from device-tree experts.
> > 
> > What response did you get? Can you please provide a link to the discussion?
> > 
> 
> Sorry about the confusion. I meant "no response" by "expected answers". Here
> is the mail sent to Rob before, no reply so far:
> 
> https://lkml.org/lkml/2021/6/2/1446
> 
> > > After rethinking about it,
> > > I plan to fix this like this way, but please let me know if it sounds sensible
> > > to you.
> > > 
> > > The idea is to assign a (not overlapped) dummy base address to each memory
> > > node in the device-tree. The dummy is (last_valid_memory_address + NUMA ID).
> > > The 'length' of the 'reg' property in the device-tree nodes, corresponding
> > > to empty NUMA nodes, is still zero. This ensures the nodes are still invalid
> > > until memory is added to these nodes.

Since we don't know of any other established way to do this, then
proposing a solution makes sense. However, QEMU isn't the place to put it
first. Please write a Documentation/devicetree/bindings/ document and post
the patch to the kernel. If it gets accepted, then we can implement what
you've documented there.

Hopefully you'll get more interest and activity on your patch than on your
inquiry.

Thanks,
drew

> > > 
> > > I had the temporary patch for the implementation. It works fine and VM can
> > > boot up successfully.
> > 
> > I would like to be sure that the kernel developers for NUMA, memory
> > hotplug, and devicetree specifications are all happy with the approach
> > before adding it to QEMU.
> > 
> 
> As I understood, it won't break anything from perspectives of NUMA
> and device-tree specification. First of all, NUMA cares the so-called
> distance map and 'numa-node-id' property in the individual device-tree
> nodes. The device-tree specification doesn't say 'length' in 'reg'
> property of memory node can't be zero. Also, the linux device-tree
> implementation has the check on 'length', the memory block won't be
> added if it's zero.
> 
> Documentation/devicetree/bindings/numa.txt has more details about
> the required device-tree NUMA properties.
> 
> I'm not sure about memory hotplug. I tried memory hot add and it seems
> working finely. Memory hot-add/remove are working without issues:
> 
> /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
> -accel kvm -machine virt,gic-version=host               \
> -cpu host -smp 4,sockets=2,cores=2,threads=1            \
> -m 4096M,slots=16,maxmem=64G                            \
> -object memory-backend-ram,id=mem0,size=2048M           \
> -object memory-backend-ram,id=mem1,size=2048M           \
> -numa node,nodeid=0,cpus=0-1,memdev=mem0                \
> -numa node,nodeid=1,cpus=2-3,memdev=mem1                \
> -numa node,nodeid=2                                     \
> -numa node,nodeid=3
>    :
> 
> Memory hot-add
> ===============
> 
> # cat /proc/meminfo  | grep MemTotal
> MemTotal:        4027472 kB
> # cat /sys/devices/system/node/node2/meminfo | grep MemTotal
> Node 2 MemTotal:       0 kB
> 
> (qemu) object_add memory-backend-ram,id=hp-mem0,size=1G
> (qemu) device_add pc-dimm,id=dimm0,memdev=hp-mem0,node=3
> 
> # cat /proc/meminfo  | grep MemTotal
> MemTotal:        5076048 kB
> # cat /sys/devices/system/node/node2/meminfo | grep MemTotal
> Node 2 MemTotal: 1048576 kB
> 
> Memory hot-remove
> =================
> 
> (qemu) device_del dimm0
> (qemu) object_del hp-mem0
> 
> # cat /proc/meminfo  | grep MemTotal
> MemTotal:        4027472 kB
> # cat /sys/devices/system/node/node2/meminfo | grep MemTotal
> cat: can't open '/sys/devices/system/node/node2/meminfo': No such file or directory
> 
> After this point, the memory can be added again without issues with
> "object_add" and "device_add". So it sounds reasonable to remove
> the empty NUMA node during memory hot-remove.
> 
> Thanks,
> Gavin
>
Andrew Jones June 23, 2021, 8:16 a.m. UTC | #9
On Wed, Jun 23, 2021 at 10:07:36AM +0200, Andrew Jones wrote:
> On Wed, Jun 23, 2021 at 02:43:49PM +1000, Gavin Shan wrote:
> > Hi Drew,
> > 
> > On 6/22/21 5:13 PM, Andrew Jones wrote:
> > > On Tue, Jun 22, 2021 at 06:53:41PM +1000, Gavin Shan wrote:
> > > > On 6/3/21 2:48 PM, Gavin Shan wrote:
> > > > > On 6/2/21 9:36 PM, Andrew Jones wrote:
> > > > > > On Wed, Jun 02, 2021 at 11:09:32AM +1000, Gavin Shan wrote:
> > > > > > > On 6/1/21 5:50 PM, Andrew Jones wrote:
> > > > > > > > On Tue, Jun 01, 2021 at 03:30:04PM +0800, Gavin Shan wrote:
> > > > > > > > > We possibly populate empty nodes where memory isn't included and might
> > > > > > > > > be hot added at late time. The FDT memory nodes can't be created due
> > > > > > > > > to conflicts on their names if multiple empty nodes are specified.
> > > > > > > > > For example, the VM fails to start with the following error messages.
> > > > > > > > > 
> > > > > > > > >      /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64          \
> > > > > > > > >      -accel kvm -machine virt,gic-version=host                        \
> > > > > > > > >      -cpu host -smp 4,sockets=2,cores=2,threads=1 -m 1024M,maxmem=64G \
> > > > > > > > >      -object memory-backend-ram,id=mem0,size=512M                     \
> > > > > > > > >      -object memory-backend-ram,id=mem1,size=512M                     \
> > > > > > > > >      -numa node,nodeid=0,cpus=0-1,memdev=mem0                         \
> > > > > > > > >      -numa node,nodeid=1,cpus=2-3,memdev=mem1                         \
> > > > > > > > >      -numa node,nodeid=2                                              \
> > > > > > > > >      -numa node,nodeid=3                                              \
> > > > > > > > >        :
> > > > > > > > >      -device virtio-balloon-pci,id=balloon0,free-page-reporting=yes
> > > > > > > > > 
> > > > > > > > >      qemu-system-aarch64: FDT: Failed to create subnode /memory@80000000: \
> > > > > > > > >                           FDT_ERR_EXISTS
> > > > > > > > > 
> > > > > > > > > This fixes the issue by using NUMA node ID or zero in the memory node
> > > > > > > > > name to avoid the conflicting memory node names. With this applied, the
> > > > > > > > > VM can boot successfully with above command lines.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Gavin Shan <gshan@redhat.com>
> > > > > > > > > ---
> > > > > > > > >     hw/arm/boot.c | 7 ++++++-
> > > > > > > > >     1 file changed, 6 insertions(+), 1 deletion(-)
> > > > > > > > > 
> > > > > > > > > diff --git a/hw/arm/boot.c b/hw/arm/boot.c
> > > > > > > > > index d7b059225e..3169bdf595 100644
> > > > > > > > > --- a/hw/arm/boot.c
> > > > > > > > > +++ b/hw/arm/boot.c
> > > > > > > > > @@ -432,7 +432,12 @@ static int fdt_add_memory_node(void *fdt, uint32_t acells, hwaddr mem_base,
> > > > > > > > >         char *nodename;
> > > > > > > > >         int ret;
> > > > > > > > > -    nodename = g_strdup_printf("/memory@%" PRIx64, mem_base);
> > > > > > > > > +    if (numa_node_id >= 0) {
> > > > > > > > > +        nodename = g_strdup_printf("/memory@%d", numa_node_id);
> > > > > > > > > +    } else {
> > > > > > > > > +        nodename = g_strdup("/memory@0");
> > > > > > > > > +    }
> > > > > > > > > +
> > > > > > > > >         qemu_fdt_add_subnode(fdt, nodename);
> > > > > > > > >         qemu_fdt_setprop_string(fdt, nodename, "device_type", "memory");
> > > > > > > > >         ret = qemu_fdt_setprop_sized_cells(fdt, nodename, "reg", acells, mem_base,
> > > > 
> > > > [...]
> > > > 
> > > > > 
> > > > > I've sent one separate mail to check with Rob Herring. Hopefully he have
> > > > > ideas as he is maintaining linux FDT subsystem. You have been included to
> > > > > that thread. I didn't find something meaningful to this thread after doing
> > > > > some google search either.
> > > > > 
> > > > > Yes, I agree with you we need to follow the specification strictly. It seems
> > > > > it's uncertain about the 'physical memory map' bus binding requirements.
> > > > > 
> > > > 
> > > > I didn't get expected answers from device-tree experts.
> > > 
> > > What response did you get? Can you please provide a link to the discussion?
> > > 
> > 
> > Sorry about the confusion. I meant "no response" by "expected answers". Here
> > is the mail sent to Rob before, no reply so far:
> > 
> > https://lkml.org/lkml/2021/6/2/1446
> > 
> > > > After rethinking about it,
> > > > I plan to fix this like this way, but please let me know if it sounds sensible
> > > > to you.
> > > > 
> > > > The idea is to assign a (not overlapped) dummy base address to each memory
> > > > node in the device-tree. The dummy is (last_valid_memory_address + NUMA ID).
> > > > The 'length' of the 'reg' property in the device-tree nodes, corresponding
> > > > to empty NUMA nodes, is still zero. This ensures the nodes are still invalid
> > > > until memory is added to these nodes.
> 
> Since we don't know of any other established way to do this, then
> proposing a solution makes sense. However, QEMU isn't the place to put it
> first. Please write a Documentation/devicetree/bindings/ document and post
> the patch to the kernel. If it gets accepted, then we can implement what
> you've documented there.
> 
> Hopefully you'll get more interest and activity on your patch than on your
> inquiry.

Also, before posting the documentation, you may want to confirm that this
configuration is valid for real hardware. If not, then the solution to
this QEMU bug is to not allow this configuration.

Thanks,
drew

> 
> Thanks,
> drew
> 
> > > > 
> > > > I had the temporary patch for the implementation. It works fine and VM can
> > > > boot up successfully.
> > > 
> > > I would like to be sure that the kernel developers for NUMA, memory
> > > hotplug, and devicetree specifications are all happy with the approach
> > > before adding it to QEMU.
> > > 
> > 
> > As I understood, it won't break anything from perspectives of NUMA
> > and device-tree specification. First of all, NUMA cares the so-called
> > distance map and 'numa-node-id' property in the individual device-tree
> > nodes. The device-tree specification doesn't say 'length' in 'reg'
> > property of memory node can't be zero. Also, the linux device-tree
> > implementation has the check on 'length', the memory block won't be
> > added if it's zero.
> > 
> > Documentation/devicetree/bindings/numa.txt has more details about
> > the required device-tree NUMA properties.
> > 
> > I'm not sure about memory hotplug. I tried memory hot add and it seems
> > working finely. Memory hot-add/remove are working without issues:
> > 
> > /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
> > -accel kvm -machine virt,gic-version=host               \
> > -cpu host -smp 4,sockets=2,cores=2,threads=1            \
> > -m 4096M,slots=16,maxmem=64G                            \
> > -object memory-backend-ram,id=mem0,size=2048M           \
> > -object memory-backend-ram,id=mem1,size=2048M           \
> > -numa node,nodeid=0,cpus=0-1,memdev=mem0                \
> > -numa node,nodeid=1,cpus=2-3,memdev=mem1                \
> > -numa node,nodeid=2                                     \
> > -numa node,nodeid=3
> >    :
> > 
> > Memory hot-add
> > ===============
> > 
> > # cat /proc/meminfo  | grep MemTotal
> > MemTotal:        4027472 kB
> > # cat /sys/devices/system/node/node2/meminfo | grep MemTotal
> > Node 2 MemTotal:       0 kB
> > 
> > (qemu) object_add memory-backend-ram,id=hp-mem0,size=1G
> > (qemu) device_add pc-dimm,id=dimm0,memdev=hp-mem0,node=3
> > 
> > # cat /proc/meminfo  | grep MemTotal
> > MemTotal:        5076048 kB
> > # cat /sys/devices/system/node/node2/meminfo | grep MemTotal
> > Node 2 MemTotal: 1048576 kB
> > 
> > Memory hot-remove
> > =================
> > 
> > (qemu) device_del dimm0
> > (qemu) object_del hp-mem0
> > 
> > # cat /proc/meminfo  | grep MemTotal
> > MemTotal:        4027472 kB
> > # cat /sys/devices/system/node/node2/meminfo | grep MemTotal
> > cat: can't open '/sys/devices/system/node/node2/meminfo': No such file or directory
> > 
> > After this point, the memory can be added again without issues with
> > "object_add" and "device_add". So it sounds reasonable to remove
> > the empty NUMA node during memory hot-remove.
> > 
> > Thanks,
> > Gavin
> >
Gavin Shan June 24, 2021, 3:43 a.m. UTC | #10
On 6/23/21 6:16 PM, Andrew Jones wrote:
> On Wed, Jun 23, 2021 at 10:07:36AM +0200, Andrew Jones wrote:
>> On Wed, Jun 23, 2021 at 02:43:49PM +1000, Gavin Shan wrote:
>>> On 6/22/21 5:13 PM, Andrew Jones wrote:
>>>> On Tue, Jun 22, 2021 at 06:53:41PM +1000, Gavin Shan wrote:
>>>>> On 6/3/21 2:48 PM, Gavin Shan wrote:
>>>>>> On 6/2/21 9:36 PM, Andrew Jones wrote:
>>>>>>> On Wed, Jun 02, 2021 at 11:09:32AM +1000, Gavin Shan wrote:
>>>>>>>> On 6/1/21 5:50 PM, Andrew Jones wrote:
>>>>>>>>> On Tue, Jun 01, 2021 at 03:30:04PM +0800, Gavin Shan wrote:

[...]

>>>>> After rethinking about it,
>>>>> I plan to fix this like this way, but please let me know if it sounds sensible
>>>>> to you.
>>>>>
>>>>> The idea is to assign a (not overlapped) dummy base address to each memory
>>>>> node in the device-tree. The dummy is (last_valid_memory_address + NUMA ID).
>>>>> The 'length' of the 'reg' property in the device-tree nodes, corresponding
>>>>> to empty NUMA nodes, is still zero. This ensures the nodes are still invalid
>>>>> until memory is added to these nodes.
>>
>> Since we don't know of any other established way to do this, then
>> proposing a solution makes sense. However, QEMU isn't the place to put it
>> first. Please write a Documentation/devicetree/bindings/ document and post
>> the patch to the kernel. If it gets accepted, then we can implement what
>> you've documented there.
>>
>> Hopefully you'll get more interest and activity on your patch than on your
>> inquiry.
> 
> Also, before posting the documentation, you may want to confirm that this
> configuration is valid for real hardware. If not, then the solution to
> this QEMU bug is to not allow this configuration.
> 

Thanks for the nice suggestion, Drew. I've posted one patch and you've
copied. The real hardware could contain empty NUMA node. For example,
all memory DIMM are removed from that node and it becomes empty :)

Lets see what feedback will be provided by device-tree guys. If they
think empty NUMA nodes are allowed. Then I will repost the patch to
use the dummy base address to avoid breaking device-tree specification
at least.

Thanks,
Gavin
diff mbox series

Patch

diff --git a/hw/arm/boot.c b/hw/arm/boot.c
index d7b059225e..3169bdf595 100644
--- a/hw/arm/boot.c
+++ b/hw/arm/boot.c
@@ -432,7 +432,12 @@  static int fdt_add_memory_node(void *fdt, uint32_t acells, hwaddr mem_base,
     char *nodename;
     int ret;
 
-    nodename = g_strdup_printf("/memory@%" PRIx64, mem_base);
+    if (numa_node_id >= 0) {
+        nodename = g_strdup_printf("/memory@%d", numa_node_id);
+    } else {
+        nodename = g_strdup("/memory@0");
+    }
+
     qemu_fdt_add_subnode(fdt, nodename);
     qemu_fdt_setprop_string(fdt, nodename, "device_type", "memory");
     ret = qemu_fdt_setprop_sized_cells(fdt, nodename, "reg", acells, mem_base,