diff mbox series

[RFC,v4,4/7] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM

Message ID 20220527122528.129445-5-aneesh.kumar@linux.ibm.com (mailing list archive)
State New
Headers show
Series mm/demotion: Memory tiers and demotion | expand

Commit Message

Aneesh Kumar K.V May 27, 2022, 12:25 p.m. UTC
From: Jagdish Gediya <jvgediya@linux.ibm.com>

By default, all nodes are assigned to DEFAULT_MEMORY_TIER which
is memory tier 1 which is designated for nodes with DRAM, so it
is not the right tier for dax devices.

Set dax kmem device node's tier to MEMORY_TIER_PMEM, In future,
support should be added to distinguish the dax-devices which should
not be MEMORY_TIER_PMEM and right memory tier should be set for them.

Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 drivers/dax/kmem.c | 4 ++++
 mm/migrate.c       | 2 ++
 2 files changed, 6 insertions(+)

Comments

Bharata B Rao June 1, 2022, 6:29 a.m. UTC | #1
On 5/27/2022 5:55 PM, Aneesh Kumar K.V wrote:
> From: Jagdish Gediya <jvgediya@linux.ibm.com>
> 
> By default, all nodes are assigned to DEFAULT_MEMORY_TIER which
> is memory tier 1 which is designated for nodes with DRAM, so it
> is not the right tier for dax devices.
> 
> Set dax kmem device node's tier to MEMORY_TIER_PMEM, In future,
> support should be added to distinguish the dax-devices which should
> not be MEMORY_TIER_PMEM and right memory tier should be set for them.
> 
> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  drivers/dax/kmem.c | 4 ++++
>  mm/migrate.c       | 2 ++
>  2 files changed, 6 insertions(+)
> 
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index a37622060fff..991782aa2448 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -11,6 +11,7 @@
>  #include <linux/fs.h>
>  #include <linux/mm.h>
>  #include <linux/mman.h>
> +#include <linux/migrate.h>
>  #include "dax-private.h"
>  #include "bus.h"
>  
> @@ -147,6 +148,9 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>  
>  	dev_set_drvdata(dev, data);
>  
> +#ifdef CONFIG_TIERED_MEMORY
> +	node_set_memory_tier(numa_node, MEMORY_TIER_PMEM);
> +#endif

I was experimenting with this patchset and found this behaviour.
Here's what I did:

Boot a KVM guest with vNVDIMM device which ends up with device_dax
driver by default.

Use it as RAM by binding it to dax kmem driver. It now appears as
RAM with a new NUMA node that is put to memtier1 (the existing tier
where DRAM already exists)

I can move it to memtier2 (MEMORY_RANK_PMEM) manually, but isn't
that expected to happen automatically when a node with dax kmem
device comes up?

Regards,
Bharata.
Aneesh Kumar K.V June 1, 2022, 1:49 p.m. UTC | #2
On 6/1/22 11:59 AM, Bharata B Rao wrote:
> On 5/27/2022 5:55 PM, Aneesh Kumar K.V wrote:
>> From: Jagdish Gediya <jvgediya@linux.ibm.com>
>>
>> By default, all nodes are assigned to DEFAULT_MEMORY_TIER which
>> is memory tier 1 which is designated for nodes with DRAM, so it
>> is not the right tier for dax devices.
>>
>> Set dax kmem device node's tier to MEMORY_TIER_PMEM, In future,
>> support should be added to distinguish the dax-devices which should
>> not be MEMORY_TIER_PMEM and right memory tier should be set for them.
>>
>> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>   drivers/dax/kmem.c | 4 ++++
>>   mm/migrate.c       | 2 ++
>>   2 files changed, 6 insertions(+)
>>
>> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
>> index a37622060fff..991782aa2448 100644
>> --- a/drivers/dax/kmem.c
>> +++ b/drivers/dax/kmem.c
>> @@ -11,6 +11,7 @@
>>   #include <linux/fs.h>
>>   #include <linux/mm.h>
>>   #include <linux/mman.h>
>> +#include <linux/migrate.h>
>>   #include "dax-private.h"
>>   #include "bus.h"
>>   
>> @@ -147,6 +148,9 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>   
>>   	dev_set_drvdata(dev, data);
>>   
>> +#ifdef CONFIG_TIERED_MEMORY
>> +	node_set_memory_tier(numa_node, MEMORY_TIER_PMEM);
>> +#endif
> 
> I was experimenting with this patchset and found this behaviour.
> Here's what I did:
> 
> Boot a KVM guest with vNVDIMM device which ends up with device_dax
> driver by default.
> 
> Use it as RAM by binding it to dax kmem driver. It now appears as
> RAM with a new NUMA node that is put to memtier1 (the existing tier
> where DRAM already exists)
> 

That should have placed it in memtier2.

> I can move it to memtier2 (MEMORY_RANK_PMEM) manually, but isn't
> that expected to happen automatically when a node with dax kmem
> device comes up?
> 

This can happen if we have added the same NUMA node to memtier1 before 
dax kmem driver initialized the pmem memory. Can you check before the 
above node_set_memory_tier_rank() whether the specific NUMA node is 
already part of any memory tier?

Thank you for testing the patchset.
-aneesh
Bharata B Rao June 2, 2022, 6:36 a.m. UTC | #3
On 6/1/2022 7:19 PM, Aneesh Kumar K V wrote:
> On 6/1/22 11:59 AM, Bharata B Rao wrote:
>> I was experimenting with this patchset and found this behaviour.
>> Here's what I did:
>>
>> Boot a KVM guest with vNVDIMM device which ends up with device_dax
>> driver by default.
>>
>> Use it as RAM by binding it to dax kmem driver. It now appears as
>> RAM with a new NUMA node that is put to memtier1 (the existing tier
>> where DRAM already exists)
>>
> 
> That should have placed it in memtier2.
> 
>> I can move it to memtier2 (MEMORY_RANK_PMEM) manually, but isn't
>> that expected to happen automatically when a node with dax kmem
>> device comes up?
>>
> 
> This can happen if we have added the same NUMA node to memtier1 before dax kmem driver initialized the pmem memory. Can you check before the above node_set_memory_tier_rank() whether the specific NUMA node is already part of any memory tier?

When we reach node_set_memory_tier_rank(), node1 (that has the pmem device)
is already part of memtier1 whose nodelist shows 0-1.

Regards,
Bharata.
Aneesh Kumar K.V June 3, 2022, 9:04 a.m. UTC | #4
On 6/2/22 12:06 PM, Bharata B Rao wrote:
> On 6/1/2022 7:19 PM, Aneesh Kumar K V wrote:
>> On 6/1/22 11:59 AM, Bharata B Rao wrote:
>>> I was experimenting with this patchset and found this behaviour.
>>> Here's what I did:
>>>
>>> Boot a KVM guest with vNVDIMM device which ends up with device_dax
>>> driver by default.
>>>
>>> Use it as RAM by binding it to dax kmem driver. It now appears as
>>> RAM with a new NUMA node that is put to memtier1 (the existing tier
>>> where DRAM already exists)
>>>
>>
>> That should have placed it in memtier2.
>>
>>> I can move it to memtier2 (MEMORY_RANK_PMEM) manually, but isn't
>>> that expected to happen automatically when a node with dax kmem
>>> device comes up?
>>>
>>
>> This can happen if we have added the same NUMA node to memtier1 before dax kmem driver initialized the pmem memory. Can you check before the above node_set_memory_tier_rank() whether the specific NUMA node is already part of any memory tier?
> 
> When we reach node_set_memory_tier_rank(), node1 (that has the pmem device)
> is already part of memtier1 whose nodelist shows 0-1.
> 

can you find out which code path added node1 to memtier1? Do you have 
regular memory also appearing on node1?

-aneesh
Bharata B Rao June 6, 2022, 10:11 a.m. UTC | #5
On 6/3/2022 2:34 PM, Aneesh Kumar K V wrote:
> On 6/2/22 12:06 PM, Bharata B Rao wrote:
>> On 6/1/2022 7:19 PM, Aneesh Kumar K V wrote:
>>> On 6/1/22 11:59 AM, Bharata B Rao wrote:
>>>> I was experimenting with this patchset and found this behaviour.
>>>> Here's what I did:
>>>>
>>>> Boot a KVM guest with vNVDIMM device which ends up with device_dax
>>>> driver by default.
>>>>
>>>> Use it as RAM by binding it to dax kmem driver. It now appears as
>>>> RAM with a new NUMA node that is put to memtier1 (the existing tier
>>>> where DRAM already exists)
>>>>
>>>
>>> That should have placed it in memtier2.
>>>
>>>> I can move it to memtier2 (MEMORY_RANK_PMEM) manually, but isn't
>>>> that expected to happen automatically when a node with dax kmem
>>>> device comes up?
>>>>
>>>
>>> This can happen if we have added the same NUMA node to memtier1 before dax kmem driver initialized the pmem memory. Can you check before the above node_set_memory_tier_rank() whether the specific NUMA node is already part of any memory tier?
>>
>> When we reach node_set_memory_tier_rank(), node1 (that has the pmem device)
>> is already part of memtier1 whose nodelist shows 0-1.
>>
> 
> can you find out which code path added node1 to memtier1?

 node_set_memory_tier_rank+0x63/0x80 
 migrate_on_reclaim_callback+0x40/0x4d 
 blocking_notifier_call_chain+0x68/0x90 
 memory_notify+0x1b/0x20 
 online_pages+0x257/0x2f0 
 memory_subsys_online+0x99/0x150 
 device_online+0x65/0x90 
 online_memory_block+0x1b/0x20 
 walk_memory_blocks+0x85/0xc0 
 ? generic_online_page+0x40/0x40 
 add_memory_resource+0x1fa/0x2d0 
 add_memory_driver_managed+0x80/0xc0 
 dev_dax_kmem_probe+0x1af/0x250 
 dax_bus_probe+0x6e/0xa0

After this the explicit call to node_set_memory_tier_rank(numa_node, MEMORY_RANK_PMEM)
from dev_dax_kmem_probe() finds that the memtier is already set.

> Do you have regular memory also appearing on node1?

No, regular memory is on Node0.

Regards,
Bharata.
Aneesh Kumar K.V June 6, 2022, 10:16 a.m. UTC | #6
On 6/6/22 3:41 PM, Bharata B Rao wrote:
> On 6/3/2022 2:34 PM, Aneesh Kumar K V wrote:
>> On 6/2/22 12:06 PM, Bharata B Rao wrote:
>>> On 6/1/2022 7:19 PM, Aneesh Kumar K V wrote:
>>>> On 6/1/22 11:59 AM, Bharata B Rao wrote:
>>>>> I was experimenting with this patchset and found this behaviour.
>>>>> Here's what I did:
>>>>>
>>>>> Boot a KVM guest with vNVDIMM device which ends up with device_dax
>>>>> driver by default.
>>>>>
>>>>> Use it as RAM by binding it to dax kmem driver. It now appears as
>>>>> RAM with a new NUMA node that is put to memtier1 (the existing tier
>>>>> where DRAM already exists)
>>>>>
>>>>
>>>> That should have placed it in memtier2.
>>>>
>>>>> I can move it to memtier2 (MEMORY_RANK_PMEM) manually, but isn't
>>>>> that expected to happen automatically when a node with dax kmem
>>>>> device comes up?
>>>>>
>>>>
>>>> This can happen if we have added the same NUMA node to memtier1 before dax kmem driver initialized the pmem memory. Can you check before the above node_set_memory_tier_rank() whether the specific NUMA node is already part of any memory tier?
>>>
>>> When we reach node_set_memory_tier_rank(), node1 (that has the pmem device)
>>> is already part of memtier1 whose nodelist shows 0-1.
>>>
>>
>> can you find out which code path added node1 to memtier1?
> 
>   node_set_memory_tier_rank+0x63/0x80
>   migrate_on_reclaim_callback+0x40/0x4d
>   blocking_notifier_call_chain+0x68/0x90
>   memory_notify+0x1b/0x20
>   online_pages+0x257/0x2f0
>   memory_subsys_online+0x99/0x150
>   device_online+0x65/0x90
>   online_memory_block+0x1b/0x20
>   walk_memory_blocks+0x85/0xc0
>   ? generic_online_page+0x40/0x40
>   add_memory_resource+0x1fa/0x2d0
>   add_memory_driver_managed+0x80/0xc0
>   dev_dax_kmem_probe+0x1af/0x250
>   dax_bus_probe+0x6e/0xa0
> 
> After this the explicit call to node_set_memory_tier_rank(numa_node, MEMORY_RANK_PMEM)
> from dev_dax_kmem_probe() finds that the memtier is already set.
> 
>> Do you have regular memory also appearing on node1?
> 
> No, regular memory is on Node0.
> 

Thanks for the stack trace. I was getting the kvm setup on my laptop to 
test this. We should move node_set_mem_tier() early. You had automatic 
online on memory hotplug

	/* online pages if requested */
	if (mhp_default_online_type != MMOP_OFFLINE)
		walk_memory_blocks(start, size, NULL, online_memory_block);


which caused memory to be onlined before we could do node_set_mem_tier. 
That is a bug on my side. Will send you a change after testing .

-aneesh
Aneesh Kumar K.V June 6, 2022, 11:54 a.m. UTC | #7
Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:

> On 6/6/22 3:41 PM, Bharata B Rao wrote:
>> On 6/3/2022 2:34 PM, Aneesh Kumar K V wrote:
>>> On 6/2/22 12:06 PM, Bharata B Rao wrote:
>>>> On 6/1/2022 7:19 PM, Aneesh Kumar K V wrote:
>>>>> On 6/1/22 11:59 AM, Bharata B Rao wrote:
>>>>>> I was experimenting with this patchset and found this behaviour.
>>>>>> Here's what I did:
>>>>>>
>>>>>> Boot a KVM guest with vNVDIMM device which ends up with device_dax
>>>>>> driver by default.
>>>>>>
>>>>>> Use it as RAM by binding it to dax kmem driver. It now appears as
>>>>>> RAM with a new NUMA node that is put to memtier1 (the existing tier
>>>>>> where DRAM already exists)
>>>>>>
>>>>>
>>>>> That should have placed it in memtier2.
>>>>>
>>>>>> I can move it to memtier2 (MEMORY_RANK_PMEM) manually, but isn't
>>>>>> that expected to happen automatically when a node with dax kmem
>>>>>> device comes up?
>>>>>>
>>>>>
>>>>> This can happen if we have added the same NUMA node to memtier1 before dax kmem driver initialized the pmem memory. Can you check before the above node_set_memory_tier_rank() whether the specific NUMA node is already part of any memory tier?
>>>>
>>>> When we reach node_set_memory_tier_rank(), node1 (that has the pmem device)
>>>> is already part of memtier1 whose nodelist shows 0-1.
>>>>
>>>
>>> can you find out which code path added node1 to memtier1?
>> 
>>   node_set_memory_tier_rank+0x63/0x80
>>   migrate_on_reclaim_callback+0x40/0x4d
>>   blocking_notifier_call_chain+0x68/0x90
>>   memory_notify+0x1b/0x20
>>   online_pages+0x257/0x2f0
>>   memory_subsys_online+0x99/0x150
>>   device_online+0x65/0x90
>>   online_memory_block+0x1b/0x20
>>   walk_memory_blocks+0x85/0xc0
>>   ? generic_online_page+0x40/0x40
>>   add_memory_resource+0x1fa/0x2d0
>>   add_memory_driver_managed+0x80/0xc0
>>   dev_dax_kmem_probe+0x1af/0x250
>>   dax_bus_probe+0x6e/0xa0
>> 
>> After this the explicit call to node_set_memory_tier_rank(numa_node, MEMORY_RANK_PMEM)
>> from dev_dax_kmem_probe() finds that the memtier is already set.
>> 
>>> Do you have regular memory also appearing on node1?
>> 
>> No, regular memory is on Node0.
>> 
>
> Thanks for the stack trace. I was getting the kvm setup on my laptop to 
> test this. We should move node_set_mem_tier() early. You had automatic 
> online on memory hotplug
>
> 	/* online pages if requested */
> 	if (mhp_default_online_type != MMOP_OFFLINE)
> 		walk_memory_blocks(start, size, NULL, online_memory_block);
>
>
> which caused memory to be onlined before we could do node_set_mem_tier. 
> That is a bug on my side. Will send you a change after testing .
>
Can you try this change?

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 7a11c387fbbc..905609260dda 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -94,6 +94,17 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 		goto err_reg_mgid;
 	data->mgid = rc;
 
+	/*
+	 * This get called before the node is brought online. That
+	 * is because depending on the value of mhp_default_online_type
+	 * the kernel will online the memory along with hotplug
+	 * operation. Add the new memory tier before we try to bring
+	 * memory blocks online. Otherwise new node will get added to
+	 * the default memory tier via hotplug callbacks.
+	 */
+#ifdef CONFIG_TIERED_MEMORY
+	node_set_memory_tier(numa_node, MEMORY_TIER_PMEM);
+#endif
 	for (i = 0; i < dev_dax->nr_range; i++) {
 		struct resource *res;
 		struct range range;
@@ -148,9 +159,6 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 
 	dev_set_drvdata(dev, data);
 
-#ifdef CONFIG_TIERED_MEMORY
-	node_set_memory_tier(numa_node, MEMORY_TIER_PMEM);
-#endif
 	return 0;
 
 err_request_mem:
Bharata B Rao June 6, 2022, 12:09 p.m. UTC | #8
On 6/6/2022 5:24 PM, Aneesh Kumar K.V wrote:
> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>>
> Can you try this change?
> 
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index 7a11c387fbbc..905609260dda 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -94,6 +94,17 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>  		goto err_reg_mgid;
>  	data->mgid = rc;
>  
> +	/*
> +	 * This get called before the node is brought online. That
> +	 * is because depending on the value of mhp_default_online_type
> +	 * the kernel will online the memory along with hotplug
> +	 * operation. Add the new memory tier before we try to bring
> +	 * memory blocks online. Otherwise new node will get added to
> +	 * the default memory tier via hotplug callbacks.
> +	 */
> +#ifdef CONFIG_TIERED_MEMORY
> +	node_set_memory_tier(numa_node, MEMORY_TIER_PMEM);
> +#endif
>  	for (i = 0; i < dev_dax->nr_range; i++) {
>  		struct resource *res;
>  		struct range range;
> @@ -148,9 +159,6 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>  
>  	dev_set_drvdata(dev, data);
>  
> -#ifdef CONFIG_TIERED_MEMORY
> -	node_set_memory_tier(numa_node, MEMORY_TIER_PMEM);
> -#endif
>  	return 0;
>  
>  err_request_mem:

Yes, this fixes the issue for me. Thanks.

Regards,
Bharata.
Aneesh Kumar K.V June 6, 2022, 1 p.m. UTC | #9
On 6/6/22 5:39 PM, Bharata B Rao wrote:
> On 6/6/2022 5:24 PM, Aneesh Kumar K.V wrote:
>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>>>
>> Can you try this change?
>>
>> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
>> index 7a11c387fbbc..905609260dda 100644
>> --- a/drivers/dax/kmem.c
>> +++ b/drivers/dax/kmem.c
>> @@ -94,6 +94,17 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>   		goto err_reg_mgid;
>>   	data->mgid = rc;
>>   
>> +	/*
>> +	 * This get called before the node is brought online. That
>> +	 * is because depending on the value of mhp_default_online_type
>> +	 * the kernel will online the memory along with hotplug
>> +	 * operation. Add the new memory tier before we try to bring
>> +	 * memory blocks online. Otherwise new node will get added to
>> +	 * the default memory tier via hotplug callbacks.
>> +	 */
>> +#ifdef CONFIG_TIERED_MEMORY
>> +	node_set_memory_tier(numa_node, MEMORY_TIER_PMEM);
>> +#endif
>>   	for (i = 0; i < dev_dax->nr_range; i++) {
>>   		struct resource *res;
>>   		struct range range;
>> @@ -148,9 +159,6 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>   
>>   	dev_set_drvdata(dev, data);
>>   
>> -#ifdef CONFIG_TIERED_MEMORY
>> -	node_set_memory_tier(numa_node, MEMORY_TIER_PMEM);
>> -#endif
>>   	return 0;
>>   
>>   err_request_mem:
> 
> Yes, this fixes the issue for me. Thanks.
> 

I might put the below change instead of the above. In the end I guess it 
is better to add a NUMA node to memory tier after the node is brought 
online than before even though with the current code it shouldn't matter 
much.

modified   drivers/dax/kmem.c
@@ -147,9 +147,15 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
  	}

  	dev_set_drvdata(dev, data);
-
+	/*
+	 * node_reset_memory_tier is used here to ensure we force
+	 * update the NUMA node memory tier. Depending on the value
+	 * of mhp_default_online_type the kernel will online the memory
+	 * blocks along with hotplug operation above. This can result in dax
+	 * kmem memory NUMA node getting added to default memory tier.
+	 */
  #ifdef CONFIG_TIERED_MEMORY
-	node_set_memory_tier(numa_node, MEMORY_TIER_PMEM);
+	node_reset_memory_tier(numa_node, MEMORY_TIER_PMEM);
  #endif
  	return 0;
diff mbox series

Patch

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index a37622060fff..991782aa2448 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -11,6 +11,7 @@ 
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/mman.h>
+#include <linux/migrate.h>
 #include "dax-private.h"
 #include "bus.h"
 
@@ -147,6 +148,9 @@  static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 
 	dev_set_drvdata(dev, data);
 
+#ifdef CONFIG_TIERED_MEMORY
+	node_set_memory_tier(numa_node, MEMORY_TIER_PMEM);
+#endif
 	return 0;
 
 err_request_mem:
diff --git a/mm/migrate.c b/mm/migrate.c
index d819a64db5b1..59d8558dd2ee 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2418,6 +2418,8 @@  int node_set_memory_tier(int node, int tier)
 
 	return ret;
 }
+EXPORT_SYMBOL_GPL(node_set_memory_tier);
+
 
 /**
  * next_demotion_node() - Get the next node in the demotion path