diff mbox

[RFC,1/1] memremap: devm_memremap_pages has wrong nid

Message ID CAPcyv4jCg58-Je4hW=O4TT9ESeJTV-GFEUidLD9n60-z9w9V2Q@mail.gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Dan Williams Nov. 11, 2015, 9:46 p.m. UTC
On Wed, Nov 11, 2015 at 1:16 PM, Boaz Harrosh <boaz@plexistor.com> wrote:
>
> The pmem dev as received in devm_memremap_pages() does not have
> its dev->numa_node properly initialized yet.
>
> What I see is that all pmem devices will call arch_add_memory
> with nid==0 regardless of the real nid the memory is actually
> on. Needless to say that after that all the NUMA page and zone
> members (at page->flags) come out wrong.
>
> If I do the below code it will all work properly.
>
> RFC because probably we want to fix dev->numa_node before the
> call to devm_memremap_pages.

Let's just do that instead.  I.e. in the case of NFIT numa node should
already be set, and in the case of the memmap=ss!nn or e820-type-12 we
can set the numa node like this:


Thanks for pointing out memory_add_physaddr_to_nid(). I did not know
that existed.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Boaz Harrosh Nov. 11, 2015, 10:05 p.m. UTC | #1
On 11/11/2015 11:46 PM, Dan Williams wrote:
> On Wed, Nov 11, 2015 at 1:16 PM, Boaz Harrosh <boaz@plexistor.com> wrote:
>>
>> The pmem dev as received in devm_memremap_pages() does not have
>> its dev->numa_node properly initialized yet.
>>
>> What I see is that all pmem devices will call arch_add_memory
>> with nid==0 regardless of the real nid the memory is actually
>> on. Needless to say that after that all the NUMA page and zone
>> members (at page->flags) come out wrong.
>>
>> If I do the below code it will all work properly.
>>
>> RFC because probably we want to fix dev->numa_node before the
>> call to devm_memremap_pages.
> 
> Let's just do that instead.  I.e. in the case of NFIT numa node should
> already be set, and in the case of the memmap=ss!nn or e820-type-12 we
> can set the numa node like this:
> 
> diff --git a/drivers/nvdimm/e820.c b/drivers/nvdimm/e820.c
> index 8282db2ef99e..e40df8fedf73 100644
> --- a/drivers/nvdimm/e820.c
> +++ b/drivers/nvdimm/e820.c
> @@ -48,7 +48,7 @@ static int e820_pmem_probe(struct platform_device *pdev)
>                memset(&ndr_desc, 0, sizeof(ndr_desc));
>                ndr_desc.res = p;
>                ndr_desc.attr_groups = e820_pmem_region_attribute_groups;
> -               ndr_desc.numa_node = NUMA_NO_NODE;
> +               ndr_desc.numa_node = memory_add_physaddr_to_nid(p->start);
>                set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
>                if (!nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc))
>                        goto err;
> 
> Thanks for pointing out memory_add_physaddr_to_nid(). I did not know
> that existed.
> 

Thanks.

Its kind of late here now, please let me test this tomorrow and come back
to you, but it sounds good, thanks

Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kani, Toshi Nov. 13, 2015, 4 p.m. UTC | #2
On Wed, 2015-11-11 at 13:46 -0800, Dan Williams wrote:
> On Wed, Nov 11, 2015 at 1:16 PM, Boaz Harrosh <boaz@plexistor.com> wrote:
> > 
> > The pmem dev as received in devm_memremap_pages() does not have
> > its dev->numa_node properly initialized yet.
> > 
> > What I see is that all pmem devices will call arch_add_memory
> > with nid==0 regardless of the real nid the memory is actually
> > on. Needless to say that after that all the NUMA page and zone
> > members (at page->flags) come out wrong.
> > 
> > If I do the below code it will all work properly.
> > 
> > RFC because probably we want to fix dev->numa_node before the
> > call to devm_memremap_pages.
> 
> Let's just do that instead.  I.e. in the case of NFIT numa node should
> already be set, and in the case of the memmap=ss!nn or e820-type-12 we
> can set the numa node like this:

Agreed.  memory_add_physaddr_to_nid() uses the SRAT info, which does not work
with the NFIT case. 

Thanks,
-Toshi 


> diff --git a/drivers/nvdimm/e820.c b/drivers/nvdimm/e820.c
> index 8282db2ef99e..e40df8fedf73 100644
> --- a/drivers/nvdimm/e820.c
> +++ b/drivers/nvdimm/e820.c
> @@ -48,7 +48,7 @@ static int e820_pmem_probe(struct platform_device *pdev)
>                memset(&ndr_desc, 0, sizeof(ndr_desc));
>                ndr_desc.res = p;
>                ndr_desc.attr_groups = e820_pmem_region_attribute_groups;
> -               ndr_desc.numa_node = NUMA_NO_NODE;
> +               ndr_desc.numa_node = memory_add_physaddr_to_nid(p->start);
>                set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
>                if (!nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc))
>                        goto err;
> 
> Thanks for pointing out memory_add_physaddr_to_nid(). I did not know
> that existed.



--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Boaz Harrosh Nov. 15, 2015, 9:17 a.m. UTC | #3
On 11/13/2015 06:00 PM, Toshi Kani wrote:
<>
> 
> Agreed.  memory_add_physaddr_to_nid() uses the SRAT info, which does not work
> with the NFIT case. 
> 

Thanks Toshi, I did not know that NFIT would not work. (As I already ranted NFIT
is hard to find)

Would it be hard to fix? I mean the way it is today NvDIMM is always put at the
*end* of the NUMA address range, so all the NUMA boundaries (start) are there, all
we need is to make sure max_pfn is advanced behind the last NvDIMM range.

(Ok and we might have a slight problem with an NFIT only Node, where there
 is no volatile memory at all)

I think it is worth fixing there are surprising places this might be used.
I know that it works with type-12 and emulated pmem.

(Once I set up my NFIT QEMU I'll see what I can find)

> Thanks,
> -Toshi 
> 

Thanks
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kani, Toshi Nov. 16, 2015, 5:19 p.m. UTC | #4
On Sun, 2015-11-15 at 11:17 +0200, Boaz Harrosh wrote:
> On 11/13/2015 06:00 PM, Toshi Kani wrote:
> <>
> > 
> > Agreed.  memory_add_physaddr_to_nid() uses the SRAT info, which does not 
> > work with the NFIT case.
> > 
> 
> Thanks Toshi, I did not know that NFIT would not work. (As I already ranted 
> NFIT is hard to find)
> 
> Would it be hard to fix? I mean the way it is today NvDIMM is always put at 
> the *end* of the NUMA address range, so all the NUMA boundaries (start) are 
> there, all we need is to make sure max_pfn is advanced behind the last NvDIMM
> range.

I understand that both memory_add_physaddr_to_nid() and max_pfn cover NVDIMM
ranges on platforms with legacy E820_PRAM (12), which differs from the NFIT
case.  I agree that such difference is not desirable.

NFIT FW I have does not put NVDIMM ranges into SRAT, but ACPI spec is not very
clear about it [1].  We currently consider NVDIMM as device memory, not regular
memory.  So, this depends on how we define the "memory" info.

As for max_pfn, yes, it may make sense to cover NVDIMM when ZONE_DEVICE is used.

> (Ok and we might have a slight problem with an NFIT only Node, where there
>  is no volatile memory at all)

The NFIT driver sets it to the closest on-line node (i.e. where regular memory
resides) in this case.

> I think it is worth fixing there are surprising places this might be used.
> I know that it works with type-12 and emulated pmem.
> 
> (Once I set up my NFIT QEMU I'll see what I can find)

Thanks,
-Toshi

[1] https://lkml.org/lkml/2015/9/1/484

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Boaz Harrosh Nov. 17, 2015, 1:15 p.m. UTC | #5
On 11/16/2015 07:19 PM, Toshi Kani wrote:
<>
>> (Ok and we might have a slight problem with an NFIT only Node, where there
>>  is no volatile memory at all)
> 
> The NFIT driver sets it to the closest on-line node (i.e. where regular memory
> resides) in this case.
> 

This is a real problem then. I know it used to be the same with type-12 pmem
and we had real trouble with it on systems performance. Eventually one of your
patches fixed it.

We should fix this for NFIT, yes specially with ZONE_DEVICE and page support.
It will hurt performance a lot.

Eventually I'll get to it, again when NFIT becomes more relevant to us.

<>
> 
> Thanks,
> -Toshi
> 

Thanks Toshi
Boaz


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/nvdimm/e820.c b/drivers/nvdimm/e820.c
index 8282db2ef99e..e40df8fedf73 100644
--- a/drivers/nvdimm/e820.c
+++ b/drivers/nvdimm/e820.c
@@ -48,7 +48,7 @@  static int e820_pmem_probe(struct platform_device *pdev)
               memset(&ndr_desc, 0, sizeof(ndr_desc));
               ndr_desc.res = p;
               ndr_desc.attr_groups = e820_pmem_region_attribute_groups;
-               ndr_desc.numa_node = NUMA_NO_NODE;
+               ndr_desc.numa_node = memory_add_physaddr_to_nid(p->start);
               set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
               if (!nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc))
                       goto err;