Message ID | CAPcyv4jCg58-Je4hW=O4TT9ESeJTV-GFEUidLD9n60-z9w9V2Q@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 11/11/2015 11:46 PM, Dan Williams wrote: > On Wed, Nov 11, 2015 at 1:16 PM, Boaz Harrosh <boaz@plexistor.com> wrote: >> >> The pmem dev as received in devm_memremap_pages() does not have >> its dev->numa_node properly initialized yet. >> >> What I see is that all pmem devices will call arch_add_memory >> with nid==0 regardless of the real nid the memory is actually >> on. Needless to say that after that all the NUMA page and zone >> members (at page->flags) come out wrong. >> >> If I do the below code it will all work properly. >> >> RFC because probably we want to fix dev->numa_node before the >> call to devm_memremap_pages. > > Let's just do that instead. I.e. in the case of NFIT numa node should > already be set, and in the case of the memmap=ss!nn or e820-type-12 we > can set the numa node like this: > > diff --git a/drivers/nvdimm/e820.c b/drivers/nvdimm/e820.c > index 8282db2ef99e..e40df8fedf73 100644 > --- a/drivers/nvdimm/e820.c > +++ b/drivers/nvdimm/e820.c > @@ -48,7 +48,7 @@ static int e820_pmem_probe(struct platform_device *pdev) > memset(&ndr_desc, 0, sizeof(ndr_desc)); > ndr_desc.res = p; > ndr_desc.attr_groups = e820_pmem_region_attribute_groups; > - ndr_desc.numa_node = NUMA_NO_NODE; > + ndr_desc.numa_node = memory_add_physaddr_to_nid(p->start); > set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags); > if (!nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc)) > goto err; > > Thanks for pointing out memory_add_physaddr_to_nid(). I did not know > that existed. > Thanks. Its kind of late here now, please let me test this tomorrow and come back to you, but it sounds good, thanks Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2015-11-11 at 13:46 -0800, Dan Williams wrote: > On Wed, Nov 11, 2015 at 1:16 PM, Boaz Harrosh <boaz@plexistor.com> wrote: > > > > The pmem dev as received in devm_memremap_pages() does not have > > its dev->numa_node properly initialized yet. > > > > What I see is that all pmem devices will call arch_add_memory > > with nid==0 regardless of the real nid the memory is actually > > on. Needless to say that after that all the NUMA page and zone > > members (at page->flags) come out wrong. > > > > If I do the below code it will all work properly. > > > > RFC because probably we want to fix dev->numa_node before the > > call to devm_memremap_pages. > > Let's just do that instead. I.e. in the case of NFIT numa node should > already be set, and in the case of the memmap=ss!nn or e820-type-12 we > can set the numa node like this: Agreed. memory_add_physaddr_to_nid() uses the SRAT info, which does not work with the NFIT case. Thanks, -Toshi > diff --git a/drivers/nvdimm/e820.c b/drivers/nvdimm/e820.c > index 8282db2ef99e..e40df8fedf73 100644 > --- a/drivers/nvdimm/e820.c > +++ b/drivers/nvdimm/e820.c > @@ -48,7 +48,7 @@ static int e820_pmem_probe(struct platform_device *pdev) > memset(&ndr_desc, 0, sizeof(ndr_desc)); > ndr_desc.res = p; > ndr_desc.attr_groups = e820_pmem_region_attribute_groups; > - ndr_desc.numa_node = NUMA_NO_NODE; > + ndr_desc.numa_node = memory_add_physaddr_to_nid(p->start); > set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags); > if (!nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc)) > goto err; > > Thanks for pointing out memory_add_physaddr_to_nid(). I did not know > that existed. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11/13/2015 06:00 PM, Toshi Kani wrote: <> > > Agreed. memory_add_physaddr_to_nid() uses the SRAT info, which does not work > with the NFIT case. > Thanks Toshi, I did not know that NFIT would not work. (As I already ranted NFIT is hard to find) Would it be hard to fix? I mean the way it is today NvDIMM is always put at the *end* of the NUMA address range, so all the NUMA boundaries (start) are there, all we need is to make sure max_pfn is advanced behind the last NvDIMM range. (Ok and we might have a slight problem with an NFIT only Node, where there is no volatile memory at all) I think it is worth fixing there are surprising places this might be used. I know that it works with type-12 and emulated pmem. (Once I set up my NFIT QEMU I'll see what I can find) > Thanks, > -Toshi > Thanks Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, 2015-11-15 at 11:17 +0200, Boaz Harrosh wrote: > On 11/13/2015 06:00 PM, Toshi Kani wrote: > <> > > > > Agreed. memory_add_physaddr_to_nid() uses the SRAT info, which does not > > work with the NFIT case. > > > > Thanks Toshi, I did not know that NFIT would not work. (As I already ranted > NFIT is hard to find) > > Would it be hard to fix? I mean the way it is today NvDIMM is always put at > the *end* of the NUMA address range, so all the NUMA boundaries (start) are > there, all we need is to make sure max_pfn is advanced behind the last NvDIMM > range. I understand that both memory_add_physaddr_to_nid() and max_pfn cover NVDIMM ranges on platforms with legacy E820_PRAM (12), which differs from the NFIT case. I agree that such difference is not desirable. NFIT FW I have does not put NVDIMM ranges into SRAT, but ACPI spec is not very clear about it [1]. We currently consider NVDIMM as device memory, not regular memory. So, this depends on how we define the "memory" info. As for max_pfn, yes, it may make sense to cover NVDIMM when ZONE_DEVICE is used. > (Ok and we might have a slight problem with an NFIT only Node, where there > is no volatile memory at all) The NFIT driver sets it to the closest on-line node (i.e. where regular memory resides) in this case. > I think it is worth fixing there are surprising places this might be used. > I know that it works with type-12 and emulated pmem. > > (Once I set up my NFIT QEMU I'll see what I can find) Thanks, -Toshi [1] https://lkml.org/lkml/2015/9/1/484 -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11/16/2015 07:19 PM, Toshi Kani wrote: <> >> (Ok and we might have a slight problem with an NFIT only Node, where there >> is no volatile memory at all) > > The NFIT driver sets it to the closest on-line node (i.e. where regular memory > resides) in this case. > This is a real problem then. I know it used to be the same with type-12 pmem and we had real trouble with it on systems performance. Eventually one of your patches fixed it. We should fix this for NFIT, yes specially with ZONE_DEVICE and page support. It will hurt performance a lot. Eventually I'll get to it, again when NFIT becomes more relevant to us. <> > > Thanks, > -Toshi > Thanks Toshi Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/nvdimm/e820.c b/drivers/nvdimm/e820.c index 8282db2ef99e..e40df8fedf73 100644 --- a/drivers/nvdimm/e820.c +++ b/drivers/nvdimm/e820.c @@ -48,7 +48,7 @@ static int e820_pmem_probe(struct platform_device *pdev) memset(&ndr_desc, 0, sizeof(ndr_desc)); ndr_desc.res = p; ndr_desc.attr_groups = e820_pmem_region_attribute_groups; - ndr_desc.numa_node = NUMA_NO_NODE; + ndr_desc.numa_node = memory_add_physaddr_to_nid(p->start); set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags); if (!nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc)) goto err;