diff mbox series

[v2,1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

Message ID 20200707055917.143653-2-justin.he@arm.com (mailing list archive)
State New, archived
Headers show
Series Fix and enable pmem as RAM device on arm64 | expand

Commit Message

Jia He July 7, 2020, 5:59 a.m. UTC
This exports memory_add_physaddr_to_nid() for module driver to use.

memory_add_physaddr_to_nid() is a fallback option to get the nid in case
NUMA_NO_NID is detected.

Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Jia He <justin.he@arm.com>
---
 arch/arm64/mm/numa.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Comments

David Hildenbrand July 7, 2020, 11:35 a.m. UTC | #1
On 07.07.20 07:59, Jia He wrote:
> This exports memory_add_physaddr_to_nid() for module driver to use.
> 
> memory_add_physaddr_to_nid() is a fallback option to get the nid in case
> NUMA_NO_NID is detected.
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Jia He <justin.he@arm.com>
> ---
>  arch/arm64/mm/numa.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> index aafcee3e3f7e..7eeb31740248 100644
> --- a/arch/arm64/mm/numa.c
> +++ b/arch/arm64/mm/numa.c
> @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
>  
>  /*
>   * We hope that we will be hotplugging memory on nodes we already know about,
> - * such that acpi_get_node() succeeds and we never fall back to this...
> + * such that acpi_get_node() succeeds. But when SRAT is not present, the node
> + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback option.
>   */
>  int memory_add_physaddr_to_nid(u64 addr)
>  {
> -	pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
>  	return 0;
>  }
> +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> 
We could turn that into a pr_info() instead, but the effect is visible
to user space (e.g., which memory blocks belong to which node in sysfs),
so this can be debugged easily on demand.

Reviewed-by: David Hildenbrand <david@redhat.com>
Michal Hocko July 7, 2020, 11:54 a.m. UTC | #2
On Tue 07-07-20 13:59:15, Jia He wrote:
> This exports memory_add_physaddr_to_nid() for module driver to use.
> 
> memory_add_physaddr_to_nid() is a fallback option to get the nid in case
> NUMA_NO_NID is detected.
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Jia He <justin.he@arm.com>
> ---
>  arch/arm64/mm/numa.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> index aafcee3e3f7e..7eeb31740248 100644
> --- a/arch/arm64/mm/numa.c
> +++ b/arch/arm64/mm/numa.c
> @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
>  
>  /*
>   * We hope that we will be hotplugging memory on nodes we already know about,
> - * such that acpi_get_node() succeeds and we never fall back to this...
> + * such that acpi_get_node() succeeds. But when SRAT is not present, the node
> + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback option.
>   */
>  int memory_add_physaddr_to_nid(u64 addr)
>  {
> -	pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
>  	return 0;
>  }
> +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);

Does it make sense to export a noop function? Wouldn't make more sense
to simply make it static inline somewhere in a header? I haven't checked
whether there is an easy way to do that sanely bu this just hit my eyes.
Mike Rapoport July 7, 2020, 12:13 p.m. UTC | #3
On Tue, Jul 07, 2020 at 01:54:54PM +0200, Michal Hocko wrote:
> On Tue 07-07-20 13:59:15, Jia He wrote:
> > This exports memory_add_physaddr_to_nid() for module driver to use.
> > 
> > memory_add_physaddr_to_nid() is a fallback option to get the nid in case
> > NUMA_NO_NID is detected.
> > 
> > Suggested-by: David Hildenbrand <david@redhat.com>
> > Signed-off-by: Jia He <justin.he@arm.com>
> > ---
> >  arch/arm64/mm/numa.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> > index aafcee3e3f7e..7eeb31740248 100644
> > --- a/arch/arm64/mm/numa.c
> > +++ b/arch/arm64/mm/numa.c
> > @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
> >  
> >  /*
> >   * We hope that we will be hotplugging memory on nodes we already know about,
> > - * such that acpi_get_node() succeeds and we never fall back to this...
> > + * such that acpi_get_node() succeeds. But when SRAT is not present, the node
> > + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback option.
> >   */
> >  int memory_add_physaddr_to_nid(u64 addr)
> >  {
> > -	pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
> >  	return 0;
> >  }
> > +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> 
> Does it make sense to export a noop function? Wouldn't make more sense
> to simply make it static inline somewhere in a header? I haven't checked
> whether there is an easy way to do that sanely bu this just hit my eyes.

We'll need to either add a CONFIG_ option or arch specific callback to
make both non-empty (x86, powerpc, ia64) and empty (arm64, sh)
implementations coexist ...

> -- 
> Michal Hocko
> SUSE Labs
David Hildenbrand July 7, 2020, 12:26 p.m. UTC | #4
On 07.07.20 14:13, Mike Rapoport wrote:
> On Tue, Jul 07, 2020 at 01:54:54PM +0200, Michal Hocko wrote:
>> On Tue 07-07-20 13:59:15, Jia He wrote:
>>> This exports memory_add_physaddr_to_nid() for module driver to use.
>>>
>>> memory_add_physaddr_to_nid() is a fallback option to get the nid in case
>>> NUMA_NO_NID is detected.
>>>
>>> Suggested-by: David Hildenbrand <david@redhat.com>
>>> Signed-off-by: Jia He <justin.he@arm.com>
>>> ---
>>>  arch/arm64/mm/numa.c | 5 +++--
>>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>>> index aafcee3e3f7e..7eeb31740248 100644
>>> --- a/arch/arm64/mm/numa.c
>>> +++ b/arch/arm64/mm/numa.c
>>> @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
>>>  
>>>  /*
>>>   * We hope that we will be hotplugging memory on nodes we already know about,
>>> - * such that acpi_get_node() succeeds and we never fall back to this...
>>> + * such that acpi_get_node() succeeds. But when SRAT is not present, the node
>>> + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback option.
>>>   */
>>>  int memory_add_physaddr_to_nid(u64 addr)
>>>  {
>>> -	pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
>>>  	return 0;
>>>  }
>>> +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
>>
>> Does it make sense to export a noop function? Wouldn't make more sense
>> to simply make it static inline somewhere in a header? I haven't checked
>> whether there is an easy way to do that sanely bu this just hit my eyes.
> 
> We'll need to either add a CONFIG_ option or arch specific callback to
> make both non-empty (x86, powerpc, ia64) and empty (arm64, sh)
> implementations coexist ...

Note: I have a similar dummy (return 0) patch for s390x lying around here.
Mike Rapoport July 7, 2020, 6 p.m. UTC | #5
On Tue, Jul 07, 2020 at 02:26:08PM +0200, David Hildenbrand wrote:
> On 07.07.20 14:13, Mike Rapoport wrote:
> > On Tue, Jul 07, 2020 at 01:54:54PM +0200, Michal Hocko wrote:
> >> On Tue 07-07-20 13:59:15, Jia He wrote:
> >>> This exports memory_add_physaddr_to_nid() for module driver to use.
> >>>
> >>> memory_add_physaddr_to_nid() is a fallback option to get the nid in case
> >>> NUMA_NO_NID is detected.
> >>>
> >>> Suggested-by: David Hildenbrand <david@redhat.com>
> >>> Signed-off-by: Jia He <justin.he@arm.com>
> >>> ---
> >>>  arch/arm64/mm/numa.c | 5 +++--
> >>>  1 file changed, 3 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> >>> index aafcee3e3f7e..7eeb31740248 100644
> >>> --- a/arch/arm64/mm/numa.c
> >>> +++ b/arch/arm64/mm/numa.c
> >>> @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
> >>>  
> >>>  /*
> >>>   * We hope that we will be hotplugging memory on nodes we already know about,
> >>> - * such that acpi_get_node() succeeds and we never fall back to this...
> >>> + * such that acpi_get_node() succeeds. But when SRAT is not present, the node
> >>> + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback option.
> >>>   */
> >>>  int memory_add_physaddr_to_nid(u64 addr)
> >>>  {
> >>> -	pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
> >>>  	return 0;
> >>>  }
> >>> +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> >>
> >> Does it make sense to export a noop function? Wouldn't make more sense
> >> to simply make it static inline somewhere in a header? I haven't checked
> >> whether there is an easy way to do that sanely bu this just hit my eyes.
> > 
> > We'll need to either add a CONFIG_ option or arch specific callback to
> > make both non-empty (x86, powerpc, ia64) and empty (arm64, sh)
> > implementations coexist ...
> 
> Note: I have a similar dummy (return 0) patch for s390x lying around here.

Then we'll call it a tie - 3:3 ;-)
 
> -- 
> Thanks,
> 
> David / dhildenb
>
Dan Williams July 7, 2020, 10:05 p.m. UTC | #6
On Tue, Jul 7, 2020 at 11:01 AM Mike Rapoport <rppt@linux.ibm.com> wrote:
>
> On Tue, Jul 07, 2020 at 02:26:08PM +0200, David Hildenbrand wrote:
> > On 07.07.20 14:13, Mike Rapoport wrote:
> > > On Tue, Jul 07, 2020 at 01:54:54PM +0200, Michal Hocko wrote:
> > >> On Tue 07-07-20 13:59:15, Jia He wrote:
> > >>> This exports memory_add_physaddr_to_nid() for module driver to use.
> > >>>
> > >>> memory_add_physaddr_to_nid() is a fallback option to get the nid in case
> > >>> NUMA_NO_NID is detected.
> > >>>
> > >>> Suggested-by: David Hildenbrand <david@redhat.com>
> > >>> Signed-off-by: Jia He <justin.he@arm.com>
> > >>> ---
> > >>>  arch/arm64/mm/numa.c | 5 +++--
> > >>>  1 file changed, 3 insertions(+), 2 deletions(-)
> > >>>
> > >>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> > >>> index aafcee3e3f7e..7eeb31740248 100644
> > >>> --- a/arch/arm64/mm/numa.c
> > >>> +++ b/arch/arm64/mm/numa.c
> > >>> @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
> > >>>
> > >>>  /*
> > >>>   * We hope that we will be hotplugging memory on nodes we already know about,
> > >>> - * such that acpi_get_node() succeeds and we never fall back to this...
> > >>> + * such that acpi_get_node() succeeds. But when SRAT is not present, the node
> > >>> + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback option.
> > >>>   */
> > >>>  int memory_add_physaddr_to_nid(u64 addr)
> > >>>  {
> > >>> - pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
> > >>>   return 0;
> > >>>  }
> > >>> +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> > >>
> > >> Does it make sense to export a noop function? Wouldn't make more sense
> > >> to simply make it static inline somewhere in a header? I haven't checked
> > >> whether there is an easy way to do that sanely bu this just hit my eyes.
> > >
> > > We'll need to either add a CONFIG_ option or arch specific callback to
> > > make both non-empty (x86, powerpc, ia64) and empty (arm64, sh)
> > > implementations coexist ...
> >
> > Note: I have a similar dummy (return 0) patch for s390x lying around here.
>
> Then we'll call it a tie - 3:3 ;-)

So I'd be happy to jump on the train of people wanting to export the
ARM stub for this (and add a new ARM stub for phys_to_target_node()),
but Will did have a plausibly better idea that I have been meaning to
circle back to:

http://lore.kernel.org/r/20200325111039.GA32109@willie-the-truck

...i.e. iterate over node data to do the lookup. This would seem to
work generically for multiple archs unless I am missing something?
Jia He July 8, 2020, 2:20 a.m. UTC | #7
Hi Michal and David

> -----Original Message-----
> From: Michal Hocko <mhocko@kernel.org>
> Sent: Tuesday, July 7, 2020 7:55 PM
> To: Justin He <Justin.He@arm.com>
> Cc: Catalin Marinas <Catalin.Marinas@arm.com>; Will Deacon
> <will@kernel.org>; Dan Williams <dan.j.williams@intel.com>; Vishal Verma
> <vishal.l.verma@intel.com>; Dave Jiang <dave.jiang@intel.com>; Andrew
> Morton <akpm@linux-foundation.org>; Mike Rapoport <rppt@linux.ibm.com>;
> Baoquan He <bhe@redhat.com>; Chuhong Yuan <hslester96@gmail.com>; linux-
> arm-kernel@lists.infradead.org; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org; linux-nvdimm@lists.01.org; Kaly Xin <Kaly.Xin@arm.com>
> Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid
> as EXPORT_SYMBOL_GPL
> 
> On Tue 07-07-20 13:59:15, Jia He wrote:
> > This exports memory_add_physaddr_to_nid() for module driver to use.
> >
> > memory_add_physaddr_to_nid() is a fallback option to get the nid in case
> > NUMA_NO_NID is detected.
> >
> > Suggested-by: David Hildenbrand <david@redhat.com>
> > Signed-off-by: Jia He <justin.he@arm.com>
> > ---
> >  arch/arm64/mm/numa.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> > index aafcee3e3f7e..7eeb31740248 100644
> > --- a/arch/arm64/mm/numa.c
> > +++ b/arch/arm64/mm/numa.c
> > @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
> >
> >  /*
> >   * We hope that we will be hotplugging memory on nodes we already know
> about,
> > - * such that acpi_get_node() succeeds and we never fall back to this...
> > + * such that acpi_get_node() succeeds. But when SRAT is not present,
> the node
> > + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback
> option.
> >   */
> >  int memory_add_physaddr_to_nid(u64 addr)
> >  {
> > -	pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n",
> addr);
> >  	return 0;
> >  }
> > +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> 
> Does it make sense to export a noop function? Wouldn't make more sense
> to simply make it static inline somewhere in a header? I haven't checked
> whether there is an easy way to do that sanely bu this just hit my eyes.

Okay, I can make a change in memory_hotplug.h, sth like:
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -149,13 +149,13 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
              struct mhp_params *params);
 #endif /* ARCH_HAS_ADD_PAGES */
 
-#ifdef CONFIG_NUMA
-extern int memory_add_physaddr_to_nid(u64 start);
-#else
+#if !defined(CONFIG_NUMA) || !defined(memory_add_physaddr_to_nid)
 static inline int memory_add_physaddr_to_nid(u64 start)
 {
        return 0;
 }
+#else
+extern int memory_add_physaddr_to_nid(u64 start);
 #endif

And then check the memory_add_physaddr_to_nid() helper on all arches,
if it is noop(return 0), I can simply remove it.
if it is not noop, after the helper, 
#define memory_add_physaddr_to_nid

What do you think of this proposal?

--
Cheers,
Justin (Jia He)
Dan Williams July 8, 2020, 3:56 a.m. UTC | #8
On Tue, Jul 7, 2020 at 7:20 PM Justin He <Justin.He@arm.com> wrote:
>
> Hi Michal and David
>
> > -----Original Message-----
> > From: Michal Hocko <mhocko@kernel.org>
> > Sent: Tuesday, July 7, 2020 7:55 PM
> > To: Justin He <Justin.He@arm.com>
> > Cc: Catalin Marinas <Catalin.Marinas@arm.com>; Will Deacon
> > <will@kernel.org>; Dan Williams <dan.j.williams@intel.com>; Vishal Verma
> > <vishal.l.verma@intel.com>; Dave Jiang <dave.jiang@intel.com>; Andrew
> > Morton <akpm@linux-foundation.org>; Mike Rapoport <rppt@linux.ibm.com>;
> > Baoquan He <bhe@redhat.com>; Chuhong Yuan <hslester96@gmail.com>; linux-
> > arm-kernel@lists.infradead.org; linux-kernel@vger.kernel.org; linux-
> > mm@kvack.org; linux-nvdimm@lists.01.org; Kaly Xin <Kaly.Xin@arm.com>
> > Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid
> > as EXPORT_SYMBOL_GPL
> >
> > On Tue 07-07-20 13:59:15, Jia He wrote:
> > > This exports memory_add_physaddr_to_nid() for module driver to use.
> > >
> > > memory_add_physaddr_to_nid() is a fallback option to get the nid in case
> > > NUMA_NO_NID is detected.
> > >
> > > Suggested-by: David Hildenbrand <david@redhat.com>
> > > Signed-off-by: Jia He <justin.he@arm.com>
> > > ---
> > >  arch/arm64/mm/numa.c | 5 +++--
> > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> > > index aafcee3e3f7e..7eeb31740248 100644
> > > --- a/arch/arm64/mm/numa.c
> > > +++ b/arch/arm64/mm/numa.c
> > > @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
> > >
> > >  /*
> > >   * We hope that we will be hotplugging memory on nodes we already know
> > about,
> > > - * such that acpi_get_node() succeeds and we never fall back to this...
> > > + * such that acpi_get_node() succeeds. But when SRAT is not present,
> > the node
> > > + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback
> > option.
> > >   */
> > >  int memory_add_physaddr_to_nid(u64 addr)
> > >  {
> > > -   pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n",
> > addr);
> > >     return 0;
> > >  }
> > > +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> >
> > Does it make sense to export a noop function? Wouldn't make more sense
> > to simply make it static inline somewhere in a header? I haven't checked
> > whether there is an easy way to do that sanely bu this just hit my eyes.
>
> Okay, I can make a change in memory_hotplug.h, sth like:
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -149,13 +149,13 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
>               struct mhp_params *params);
>  #endif /* ARCH_HAS_ADD_PAGES */
>
> -#ifdef CONFIG_NUMA
> -extern int memory_add_physaddr_to_nid(u64 start);
> -#else
> +#if !defined(CONFIG_NUMA) || !defined(memory_add_physaddr_to_nid)
>  static inline int memory_add_physaddr_to_nid(u64 start)
>  {
>         return 0;
>  }
> +#else
> +extern int memory_add_physaddr_to_nid(u64 start);
>  #endif
>
> And then check the memory_add_physaddr_to_nid() helper on all arches,
> if it is noop(return 0), I can simply remove it.
> if it is not noop, after the helper,
> #define memory_add_physaddr_to_nid
>
> What do you think of this proposal?

Especially for architectures that use memblock info for numa info
(which seems to be everyone except x86) why not implement a generic
memory_add_physaddr_to_nid() that does:

int memory_add_physaddr_to_nid(u64 addr)
{
        unsigned long start_pfn, end_pfn, pfn = PHYS_PFN(addr);
        int nid;

        for_each_online_node(nid) {
                get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
                if (pfn >= start_pfn && pfn <= end_pfn)
                        return nid;
        }
        return NUMA_NO_NODE;
}
Jia He July 8, 2020, 4:08 a.m. UTC | #9
Hi Dan

> -----Original Message-----
> From: Dan Williams <dan.j.williams@intel.com>
> Sent: Wednesday, July 8, 2020 11:57 AM
> To: Justin He <Justin.He@arm.com>
> Cc: Michal Hocko <mhocko@kernel.org>; David Hildenbrand <david@redhat.com>;
> Catalin Marinas <Catalin.Marinas@arm.com>; Will Deacon <will@kernel.org>;
> Vishal Verma <vishal.l.verma@intel.com>; Dave Jiang <dave.jiang@intel.com>;
> Andrew Morton <akpm@linux-foundation.org>; Mike Rapoport
> <rppt@linux.ibm.com>; Baoquan He <bhe@redhat.com>; Chuhong Yuan
> <hslester96@gmail.com>; linux-arm-kernel@lists.infradead.org; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; linux-nvdimm@lists.01.org;
> Kaly Xin <Kaly.Xin@arm.com>
> Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid
> as EXPORT_SYMBOL_GPL
> 
> On Tue, Jul 7, 2020 at 7:20 PM Justin He <Justin.He@arm.com> wrote:
> >
> > Hi Michal and David
> >
> > > -----Original Message-----
> > > From: Michal Hocko <mhocko@kernel.org>
> > > Sent: Tuesday, July 7, 2020 7:55 PM
> > > To: Justin He <Justin.He@arm.com>
> > > Cc: Catalin Marinas <Catalin.Marinas@arm.com>; Will Deacon
> > > <will@kernel.org>; Dan Williams <dan.j.williams@intel.com>; Vishal
> Verma
> > > <vishal.l.verma@intel.com>; Dave Jiang <dave.jiang@intel.com>; Andrew
> > > Morton <akpm@linux-foundation.org>; Mike Rapoport <rppt@linux.ibm.com>;
> > > Baoquan He <bhe@redhat.com>; Chuhong Yuan <hslester96@gmail.com>;
> linux-
> > > arm-kernel@lists.infradead.org; linux-kernel@vger.kernel.org; linux-
> > > mm@kvack.org; linux-nvdimm@lists.01.org; Kaly Xin <Kaly.Xin@arm.com>
> > > Subject: Re: [PATCH v2 1/3] arm64/numa: export
> memory_add_physaddr_to_nid
> > > as EXPORT_SYMBOL_GPL
> > >
> > > On Tue 07-07-20 13:59:15, Jia He wrote:
> > > > This exports memory_add_physaddr_to_nid() for module driver to use.
> > > >
> > > > memory_add_physaddr_to_nid() is a fallback option to get the nid in
> case
> > > > NUMA_NO_NID is detected.
> > > >
> > > > Suggested-by: David Hildenbrand <david@redhat.com>
> > > > Signed-off-by: Jia He <justin.he@arm.com>
> > > > ---
> > > >  arch/arm64/mm/numa.c | 5 +++--
> > > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> > > > index aafcee3e3f7e..7eeb31740248 100644
> > > > --- a/arch/arm64/mm/numa.c
> > > > +++ b/arch/arm64/mm/numa.c
> > > > @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
> > > >
> > > >  /*
> > > >   * We hope that we will be hotplugging memory on nodes we already
> know
> > > about,
> > > > - * such that acpi_get_node() succeeds and we never fall back to
> this...
> > > > + * such that acpi_get_node() succeeds. But when SRAT is not present,
> > > the node
> > > > + * id may be probed as NUMA_NO_NODE by acpi, Here provide a
> fallback
> > > option.
> > > >   */
> > > >  int memory_add_physaddr_to_nid(u64 addr)
> > > >  {
> > > > -   pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n",
> > > addr);
> > > >     return 0;
> > > >  }
> > > > +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> > >
> > > Does it make sense to export a noop function? Wouldn't make more sense
> > > to simply make it static inline somewhere in a header? I haven't
> checked
> > > whether there is an easy way to do that sanely bu this just hit my
> eyes.
> >
> > Okay, I can make a change in memory_hotplug.h, sth like:
> > --- a/include/linux/memory_hotplug.h
> > +++ b/include/linux/memory_hotplug.h
> > @@ -149,13 +149,13 @@ int add_pages(int nid, unsigned long start_pfn,
> unsigned long nr_pages,
> >               struct mhp_params *params);
> >  #endif /* ARCH_HAS_ADD_PAGES */
> >
> > -#ifdef CONFIG_NUMA
> > -extern int memory_add_physaddr_to_nid(u64 start);
> > -#else
> > +#if !defined(CONFIG_NUMA) || !defined(memory_add_physaddr_to_nid)
> >  static inline int memory_add_physaddr_to_nid(u64 start)
> >  {
> >         return 0;
> >  }
> > +#else
> > +extern int memory_add_physaddr_to_nid(u64 start);
> >  #endif
> >
> > And then check the memory_add_physaddr_to_nid() helper on all arches,
> > if it is noop(return 0), I can simply remove it.
> > if it is not noop, after the helper,
> > #define memory_add_physaddr_to_nid
> >
> > What do you think of this proposal?
> 
> Especially for architectures that use memblock info for numa info
> (which seems to be everyone except x86) why not implement a generic
> memory_add_physaddr_to_nid() that does:
> 
> int memory_add_physaddr_to_nid(u64 addr)
> {
>         unsigned long start_pfn, end_pfn, pfn = PHYS_PFN(addr);
>         int nid;
> 
>         for_each_online_node(nid) {
>                 get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
>                 if (pfn >= start_pfn && pfn <= end_pfn)
>                         return nid;
>         }
>         return NUMA_NO_NODE;
> }

Thanks for your suggestion,
Could I wrap the codes and let memory_add_physaddr_to_nid simply invoke
phys_to_target_node()? 


--
Cheers,
Justin (Jia He)
Dan Williams July 8, 2020, 4:27 a.m. UTC | #10
On Tue, Jul 7, 2020 at 9:08 PM Justin He <Justin.He@arm.com> wrote:
[..]
> > Especially for architectures that use memblock info for numa info
> > (which seems to be everyone except x86) why not implement a generic
> > memory_add_physaddr_to_nid() that does:
> >
> > int memory_add_physaddr_to_nid(u64 addr)
> > {
> >         unsigned long start_pfn, end_pfn, pfn = PHYS_PFN(addr);
> >         int nid;
> >
> >         for_each_online_node(nid) {
> >                 get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
> >                 if (pfn >= start_pfn && pfn <= end_pfn)
> >                         return nid;
> >         }
> >         return NUMA_NO_NODE;
> > }
>
> Thanks for your suggestion,
> Could I wrap the codes and let memory_add_physaddr_to_nid simply invoke
> phys_to_target_node()?

I think it needs to be the reverse. phys_to_target_node() should call
memory_add_physaddr_to_nid() by default, but fall back to searching
reserved memory address ranges in memblock. See phys_to_target_node()
in arch/x86/mm/numa.c. That one uses numa_meminfo instead of memblock,
but the principle is the same i.e. that a target node may not be
represented in memblock.memory, but memblock.reserved. I'm working on
a patch to provide a function similar to get_pfn_range_for_nid() that
operates on reserved memory.
Mike Rapoport July 8, 2020, 5:27 a.m. UTC | #11
On Tue, Jul 07, 2020 at 03:05:48PM -0700, Dan Williams wrote:
> On Tue, Jul 7, 2020 at 11:01 AM Mike Rapoport <rppt@linux.ibm.com> wrote:
> >
> > On Tue, Jul 07, 2020 at 02:26:08PM +0200, David Hildenbrand wrote:
> > > On 07.07.20 14:13, Mike Rapoport wrote:
> > > > On Tue, Jul 07, 2020 at 01:54:54PM +0200, Michal Hocko wrote:
> > > >> On Tue 07-07-20 13:59:15, Jia He wrote:
> > > >>> This exports memory_add_physaddr_to_nid() for module driver to use.
> > > >>>
> > > >>> memory_add_physaddr_to_nid() is a fallback option to get the nid in case
> > > >>> NUMA_NO_NID is detected.
> > > >>>
> > > >>> Suggested-by: David Hildenbrand <david@redhat.com>
> > > >>> Signed-off-by: Jia He <justin.he@arm.com>
> > > >>> ---
> > > >>>  arch/arm64/mm/numa.c | 5 +++--
> > > >>>  1 file changed, 3 insertions(+), 2 deletions(-)
> > > >>>
> > > >>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> > > >>> index aafcee3e3f7e..7eeb31740248 100644
> > > >>> --- a/arch/arm64/mm/numa.c
> > > >>> +++ b/arch/arm64/mm/numa.c
> > > >>> @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
> > > >>>
> > > >>>  /*
> > > >>>   * We hope that we will be hotplugging memory on nodes we already know about,
> > > >>> - * such that acpi_get_node() succeeds and we never fall back to this...
> > > >>> + * such that acpi_get_node() succeeds. But when SRAT is not present, the node
> > > >>> + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback option.
> > > >>>   */
> > > >>>  int memory_add_physaddr_to_nid(u64 addr)
> > > >>>  {
> > > >>> - pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
> > > >>>   return 0;
> > > >>>  }
> > > >>> +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> > > >>
> > > >> Does it make sense to export a noop function? Wouldn't make more sense
> > > >> to simply make it static inline somewhere in a header? I haven't checked
> > > >> whether there is an easy way to do that sanely bu this just hit my eyes.
> > > >
> > > > We'll need to either add a CONFIG_ option or arch specific callback to
> > > > make both non-empty (x86, powerpc, ia64) and empty (arm64, sh)
> > > > implementations coexist ...
> > >
> > > Note: I have a similar dummy (return 0) patch for s390x lying around here.
> >
> > Then we'll call it a tie - 3:3 ;-)
> 
> So I'd be happy to jump on the train of people wanting to export the
> ARM stub for this (and add a new ARM stub for phys_to_target_node()),
> but Will did have a plausibly better idea that I have been meaning to
> circle back to:
> 
> http://lore.kernel.org/r/20200325111039.GA32109@willie-the-truck
> 
> ...i.e. iterate over node data to do the lookup. This would seem to
> work generically for multiple archs unless I am missing something?

I think it would work on arm64, power and, most propbably on s390
(David?), but not on x86. x86 does not have reserved memory in pgdat,
it's never memblock_add()'ed (see e820__memblock_setup()).

I've suggested to add E820_*_RESERVED to memblock.memory a while ago
[1], but apparently there are systems that cannot tolerate OS mappings
of the BIOS reserved areas.

[1] https://lore.kernel.org/lkml/20200522142053.GW1059226@linux.ibm.com/
Mike Rapoport July 8, 2020, 5:32 a.m. UTC | #12
On Tue, Jul 07, 2020 at 08:56:36PM -0700, Dan Williams wrote:
> On Tue, Jul 7, 2020 at 7:20 PM Justin He <Justin.He@arm.com> wrote:
> >
> > Hi Michal and David
> >
> > > -----Original Message-----
> > > From: Michal Hocko <mhocko@kernel.org>
> > > Sent: Tuesday, July 7, 2020 7:55 PM
> > > To: Justin He <Justin.He@arm.com>
> > > Cc: Catalin Marinas <Catalin.Marinas@arm.com>; Will Deacon
> > > <will@kernel.org>; Dan Williams <dan.j.williams@intel.com>; Vishal Verma
> > > <vishal.l.verma@intel.com>; Dave Jiang <dave.jiang@intel.com>; Andrew
> > > Morton <akpm@linux-foundation.org>; Mike Rapoport <rppt@linux.ibm.com>;
> > > Baoquan He <bhe@redhat.com>; Chuhong Yuan <hslester96@gmail.com>; linux-
> > > arm-kernel@lists.infradead.org; linux-kernel@vger.kernel.org; linux-
> > > mm@kvack.org; linux-nvdimm@lists.01.org; Kaly Xin <Kaly.Xin@arm.com>
> > > Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid
> > > as EXPORT_SYMBOL_GPL
> > >
> > > On Tue 07-07-20 13:59:15, Jia He wrote:
> > > > This exports memory_add_physaddr_to_nid() for module driver to use.
> > > >
> > > > memory_add_physaddr_to_nid() is a fallback option to get the nid in case
> > > > NUMA_NO_NID is detected.
> > > >
> > > > Suggested-by: David Hildenbrand <david@redhat.com>
> > > > Signed-off-by: Jia He <justin.he@arm.com>
> > > > ---
> > > >  arch/arm64/mm/numa.c | 5 +++--
> > > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> > > > index aafcee3e3f7e..7eeb31740248 100644
> > > > --- a/arch/arm64/mm/numa.c
> > > > +++ b/arch/arm64/mm/numa.c
> > > > @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
> > > >
> > > >  /*
> > > >   * We hope that we will be hotplugging memory on nodes we already know
> > > about,
> > > > - * such that acpi_get_node() succeeds and we never fall back to this...
> > > > + * such that acpi_get_node() succeeds. But when SRAT is not present,
> > > the node
> > > > + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback
> > > option.
> > > >   */
> > > >  int memory_add_physaddr_to_nid(u64 addr)
> > > >  {
> > > > -   pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n",
> > > addr);
> > > >     return 0;
> > > >  }
> > > > +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> > >
> > > Does it make sense to export a noop function? Wouldn't make more sense
> > > to simply make it static inline somewhere in a header? I haven't checked
> > > whether there is an easy way to do that sanely bu this just hit my eyes.
> >
> > Okay, I can make a change in memory_hotplug.h, sth like:
> > --- a/include/linux/memory_hotplug.h
> > +++ b/include/linux/memory_hotplug.h
> > @@ -149,13 +149,13 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
> >               struct mhp_params *params);
> >  #endif /* ARCH_HAS_ADD_PAGES */
> >
> > -#ifdef CONFIG_NUMA
> > -extern int memory_add_physaddr_to_nid(u64 start);
> > -#else
> > +#if !defined(CONFIG_NUMA) || !defined(memory_add_physaddr_to_nid)
> >  static inline int memory_add_physaddr_to_nid(u64 start)
> >  {
> >         return 0;
> >  }
> > +#else
> > +extern int memory_add_physaddr_to_nid(u64 start);
> >  #endif
> >
> > And then check the memory_add_physaddr_to_nid() helper on all arches,
> > if it is noop(return 0), I can simply remove it.
> > if it is not noop, after the helper,
> > #define memory_add_physaddr_to_nid
> >
> > What do you think of this proposal?
> 
> Especially for architectures that use memblock info for numa info
> (which seems to be everyone except x86) why not implement a generic
> memory_add_physaddr_to_nid() that does:

That would be only arm64.

> int memory_add_physaddr_to_nid(u64 addr)
> {
>         unsigned long start_pfn, end_pfn, pfn = PHYS_PFN(addr);
>         int nid;
> 
>         for_each_online_node(nid) {
>                 get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
>                 if (pfn >= start_pfn && pfn <= end_pfn)
>                         return nid;
>         }
>         return NUMA_NO_NODE;
> }
Dan Williams July 8, 2020, 5:48 a.m. UTC | #13
On Tue, Jul 7, 2020 at 10:33 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
>
> On Tue, Jul 07, 2020 at 08:56:36PM -0700, Dan Williams wrote:
> > On Tue, Jul 7, 2020 at 7:20 PM Justin He <Justin.He@arm.com> wrote:
> > >
> > > Hi Michal and David
> > >
> > > > -----Original Message-----
> > > > From: Michal Hocko <mhocko@kernel.org>
> > > > Sent: Tuesday, July 7, 2020 7:55 PM
> > > > To: Justin He <Justin.He@arm.com>
> > > > Cc: Catalin Marinas <Catalin.Marinas@arm.com>; Will Deacon
> > > > <will@kernel.org>; Dan Williams <dan.j.williams@intel.com>; Vishal Verma
> > > > <vishal.l.verma@intel.com>; Dave Jiang <dave.jiang@intel.com>; Andrew
> > > > Morton <akpm@linux-foundation.org>; Mike Rapoport <rppt@linux.ibm.com>;
> > > > Baoquan He <bhe@redhat.com>; Chuhong Yuan <hslester96@gmail.com>; linux-
> > > > arm-kernel@lists.infradead.org; linux-kernel@vger.kernel.org; linux-
> > > > mm@kvack.org; linux-nvdimm@lists.01.org; Kaly Xin <Kaly.Xin@arm.com>
> > > > Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid
> > > > as EXPORT_SYMBOL_GPL
> > > >
> > > > On Tue 07-07-20 13:59:15, Jia He wrote:
> > > > > This exports memory_add_physaddr_to_nid() for module driver to use.
> > > > >
> > > > > memory_add_physaddr_to_nid() is a fallback option to get the nid in case
> > > > > NUMA_NO_NID is detected.
> > > > >
> > > > > Suggested-by: David Hildenbrand <david@redhat.com>
> > > > > Signed-off-by: Jia He <justin.he@arm.com>
> > > > > ---
> > > > >  arch/arm64/mm/numa.c | 5 +++--
> > > > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> > > > > index aafcee3e3f7e..7eeb31740248 100644
> > > > > --- a/arch/arm64/mm/numa.c
> > > > > +++ b/arch/arm64/mm/numa.c
> > > > > @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
> > > > >
> > > > >  /*
> > > > >   * We hope that we will be hotplugging memory on nodes we already know
> > > > about,
> > > > > - * such that acpi_get_node() succeeds and we never fall back to this...
> > > > > + * such that acpi_get_node() succeeds. But when SRAT is not present,
> > > > the node
> > > > > + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback
> > > > option.
> > > > >   */
> > > > >  int memory_add_physaddr_to_nid(u64 addr)
> > > > >  {
> > > > > -   pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n",
> > > > addr);
> > > > >     return 0;
> > > > >  }
> > > > > +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> > > >
> > > > Does it make sense to export a noop function? Wouldn't make more sense
> > > > to simply make it static inline somewhere in a header? I haven't checked
> > > > whether there is an easy way to do that sanely bu this just hit my eyes.
> > >
> > > Okay, I can make a change in memory_hotplug.h, sth like:
> > > --- a/include/linux/memory_hotplug.h
> > > +++ b/include/linux/memory_hotplug.h
> > > @@ -149,13 +149,13 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
> > >               struct mhp_params *params);
> > >  #endif /* ARCH_HAS_ADD_PAGES */
> > >
> > > -#ifdef CONFIG_NUMA
> > > -extern int memory_add_physaddr_to_nid(u64 start);
> > > -#else
> > > +#if !defined(CONFIG_NUMA) || !defined(memory_add_physaddr_to_nid)
> > >  static inline int memory_add_physaddr_to_nid(u64 start)
> > >  {
> > >         return 0;
> > >  }
> > > +#else
> > > +extern int memory_add_physaddr_to_nid(u64 start);
> > >  #endif
> > >
> > > And then check the memory_add_physaddr_to_nid() helper on all arches,
> > > if it is noop(return 0), I can simply remove it.
> > > if it is not noop, after the helper,
> > > #define memory_add_physaddr_to_nid
> > >
> > > What do you think of this proposal?
> >
> > Especially for architectures that use memblock info for numa info
> > (which seems to be everyone except x86) why not implement a generic
> > memory_add_physaddr_to_nid() that does:
>
> That would be only arm64.
>

Darn, I saw ARCH_KEEP_MEMBLOCK and had delusions of grandeur that it
could solve my numa api woes. At least for x86 the problem is already
solved with reserved numa_meminfo, but now I'm trying to write generic
drivers that use those apis and finding these gaps on other archs.
Mike Rapoport July 8, 2020, 6:19 a.m. UTC | #14
On Tue, Jul 07, 2020 at 10:48:19PM -0700, Dan Williams wrote:
> On Tue, Jul 7, 2020 at 10:33 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
> >
> > On Tue, Jul 07, 2020 at 08:56:36PM -0700, Dan Williams wrote:
> > > On Tue, Jul 7, 2020 at 7:20 PM Justin He <Justin.He@arm.com> wrote:
> > > >
> > > > Hi Michal and David
> > > >
> > > > > -----Original Message-----
> > > > > From: Michal Hocko <mhocko@kernel.org>
> > > > > Sent: Tuesday, July 7, 2020 7:55 PM
> > > > > To: Justin He <Justin.He@arm.com>
> > > > > Cc: Catalin Marinas <Catalin.Marinas@arm.com>; Will Deacon
> > > > > <will@kernel.org>; Dan Williams <dan.j.williams@intel.com>; Vishal Verma
> > > > > <vishal.l.verma@intel.com>; Dave Jiang <dave.jiang@intel.com>; Andrew
> > > > > Morton <akpm@linux-foundation.org>; Mike Rapoport <rppt@linux.ibm.com>;
> > > > > Baoquan He <bhe@redhat.com>; Chuhong Yuan <hslester96@gmail.com>; linux-
> > > > > arm-kernel@lists.infradead.org; linux-kernel@vger.kernel.org; linux-
> > > > > mm@kvack.org; linux-nvdimm@lists.01.org; Kaly Xin <Kaly.Xin@arm.com>
> > > > > Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid
> > > > > as EXPORT_SYMBOL_GPL
> > > > >
> > > > > On Tue 07-07-20 13:59:15, Jia He wrote:
> > > > > > This exports memory_add_physaddr_to_nid() for module driver to use.
> > > > > >
> > > > > > memory_add_physaddr_to_nid() is a fallback option to get the nid in case
> > > > > > NUMA_NO_NID is detected.
> > > > > >
> > > > > > Suggested-by: David Hildenbrand <david@redhat.com>
> > > > > > Signed-off-by: Jia He <justin.he@arm.com>
> > > > > > ---
> > > > > >  arch/arm64/mm/numa.c | 5 +++--
> > > > > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > > > > >
> > > > > > diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> > > > > > index aafcee3e3f7e..7eeb31740248 100644
> > > > > > --- a/arch/arm64/mm/numa.c
> > > > > > +++ b/arch/arm64/mm/numa.c
> > > > > > @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
> > > > > >
> > > > > >  /*
> > > > > >   * We hope that we will be hotplugging memory on nodes we already know
> > > > > about,
> > > > > > - * such that acpi_get_node() succeeds and we never fall back to this...
> > > > > > + * such that acpi_get_node() succeeds. But when SRAT is not present,
> > > > > the node
> > > > > > + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback
> > > > > option.
> > > > > >   */
> > > > > >  int memory_add_physaddr_to_nid(u64 addr)
> > > > > >  {
> > > > > > -   pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n",
> > > > > addr);
> > > > > >     return 0;
> > > > > >  }
> > > > > > +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> > > > >
> > > > > Does it make sense to export a noop function? Wouldn't make more sense
> > > > > to simply make it static inline somewhere in a header? I haven't checked
> > > > > whether there is an easy way to do that sanely bu this just hit my eyes.
> > > >
> > > > Okay, I can make a change in memory_hotplug.h, sth like:
> > > > --- a/include/linux/memory_hotplug.h
> > > > +++ b/include/linux/memory_hotplug.h
> > > > @@ -149,13 +149,13 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
> > > >               struct mhp_params *params);
> > > >  #endif /* ARCH_HAS_ADD_PAGES */
> > > >
> > > > -#ifdef CONFIG_NUMA
> > > > -extern int memory_add_physaddr_to_nid(u64 start);
> > > > -#else
> > > > +#if !defined(CONFIG_NUMA) || !defined(memory_add_physaddr_to_nid)
> > > >  static inline int memory_add_physaddr_to_nid(u64 start)
> > > >  {
> > > >         return 0;
> > > >  }
> > > > +#else
> > > > +extern int memory_add_physaddr_to_nid(u64 start);
> > > >  #endif
> > > >
> > > > And then check the memory_add_physaddr_to_nid() helper on all arches,
> > > > if it is noop(return 0), I can simply remove it.
> > > > if it is not noop, after the helper,
> > > > #define memory_add_physaddr_to_nid
> > > >
> > > > What do you think of this proposal?
> > >
> > > Especially for architectures that use memblock info for numa info
> > > (which seems to be everyone except x86) why not implement a generic
> > > memory_add_physaddr_to_nid() that does:
> >
> > That would be only arm64.
> >
> 
> Darn, I saw ARCH_KEEP_MEMBLOCK and had delusions of grandeur that it
> could solve my numa api woes. At least for x86 the problem is already
> solved with reserved numa_meminfo, but now I'm trying to write generic
> drivers that use those apis and finding these gaps on other archs.

I'm not sure if x86's numa_meminfo is a part of the solution or a part
of the problem ;-)
Anyway, this all indeed messy and there is a lot to improve there.
Mike Rapoport July 8, 2020, 6:22 a.m. UTC | #15
On Tue, Jul 07, 2020 at 09:27:43PM -0700, Dan Williams wrote:
> On Tue, Jul 7, 2020 at 9:08 PM Justin He <Justin.He@arm.com> wrote:
> [..]
> > > Especially for architectures that use memblock info for numa info
> > > (which seems to be everyone except x86) why not implement a generic
> > > memory_add_physaddr_to_nid() that does:
> > >
> > > int memory_add_physaddr_to_nid(u64 addr)
> > > {
> > >         unsigned long start_pfn, end_pfn, pfn = PHYS_PFN(addr);
> > >         int nid;
> > >
> > >         for_each_online_node(nid) {
> > >                 get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
> > >                 if (pfn >= start_pfn && pfn <= end_pfn)
> > >                         return nid;
> > >         }
> > >         return NUMA_NO_NODE;
> > > }
> >
> > Thanks for your suggestion,
> > Could I wrap the codes and let memory_add_physaddr_to_nid simply invoke
> > phys_to_target_node()?
> 
> I think it needs to be the reverse. phys_to_target_node() should call
> memory_add_physaddr_to_nid() by default, but fall back to searching
> reserved memory address ranges in memblock. See phys_to_target_node()
> in arch/x86/mm/numa.c. That one uses numa_meminfo instead of memblock,
> but the principle is the same i.e. that a target node may not be
> represented in memblock.memory, but memblock.reserved. I'm working on
> a patch to provide a function similar to get_pfn_range_for_nid() that
> operates on reserved memory.

Do we really need yet another memblock iterator?
I think only x86 has memory that is not in memblock.memory but only in
memblock.reserved.
Dan Williams July 8, 2020, 6:44 a.m. UTC | #16
On Tue, Jul 7, 2020 at 11:20 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
[..]
> > Darn, I saw ARCH_KEEP_MEMBLOCK and had delusions of grandeur that it
> > could solve my numa api woes. At least for x86 the problem is already
> > solved with reserved numa_meminfo, but now I'm trying to write generic
> > drivers that use those apis and finding these gaps on other archs.
>
> I'm not sure if x86's numa_meminfo is a part of the solution or a part
> of the problem ;-)

More the latter, but hopefully it can remain an exception and not the rule.
Dan Williams July 8, 2020, 6:53 a.m. UTC | #17
On Tue, Jul 7, 2020 at 11:22 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
[..]
> > > Thanks for your suggestion,
> > > Could I wrap the codes and let memory_add_physaddr_to_nid simply invoke
> > > phys_to_target_node()?
> >
> > I think it needs to be the reverse. phys_to_target_node() should call
> > memory_add_physaddr_to_nid() by default, but fall back to searching
> > reserved memory address ranges in memblock. See phys_to_target_node()
> > in arch/x86/mm/numa.c. That one uses numa_meminfo instead of memblock,
> > but the principle is the same i.e. that a target node may not be
> > represented in memblock.memory, but memblock.reserved. I'm working on
> > a patch to provide a function similar to get_pfn_range_for_nid() that
> > operates on reserved memory.
>
> Do we really need yet another memblock iterator?
> I think only x86 has memory that is not in memblock.memory but only in
> memblock.reserved.

Well, that's what led me here. EFI has introduced a memory attribute
called "EFI Special Purpose Memory". I mapped it to a new Linux
concept called Soft Reserved memory (commit b617c5266eed "efi: Common
enable/disable infrastructure for EFI soft reservation"). The driver I
want to claim that memory, device-dax, wants to be able to look up
numa information for an address range that is marked reserved in
memblock. The device-dax facility has the ability to either let
userspace map a device, or assign the memory backing that device to
the page allocator. In both scenarios the driver needs numa info to
either populate the 'numa_node' property of the device in sysfs, or to
pass an node-id to add_memory_resource() when it is hot-plugged.

I was thwarted by the lack of phys_to_target_node() on arm64, and
rather than add another stub like memory_add_physaddr_to_nid() I
wanted to see if it could be solved properly / generically with
memblock data.
Jia He July 8, 2020, 6:56 a.m. UTC | #18
Hi Dan

> -----Original Message-----
> From: Dan Williams <dan.j.williams@intel.com>
> Sent: Wednesday, July 8, 2020 1:48 PM
> To: Mike Rapoport <rppt@linux.ibm.com>
> Cc: Justin He <Justin.He@arm.com>; Michal Hocko <mhocko@kernel.org>; David
> Hildenbrand <david@redhat.com>; Catalin Marinas <Catalin.Marinas@arm.com>;
> Will Deacon <will@kernel.org>; Vishal Verma <vishal.l.verma@intel.com>;
> Dave Jiang <dave.jiang@intel.com>; Andrew Morton <akpm@linux-
> foundation.org>; Baoquan He <bhe@redhat.com>; Chuhong Yuan
> <hslester96@gmail.com>; linux-arm-kernel@lists.infradead.org; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; linux-nvdimm@lists.01.org;
> Kaly Xin <Kaly.Xin@arm.com>
> Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid
> as EXPORT_SYMBOL_GPL
> 
> On Tue, Jul 7, 2020 at 10:33 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
> >
> > On Tue, Jul 07, 2020 at 08:56:36PM -0700, Dan Williams wrote:
> > > On Tue, Jul 7, 2020 at 7:20 PM Justin He <Justin.He@arm.com> wrote:
> > > >
> > > > Hi Michal and David
> > > >
> > > > > -----Original Message-----
> > > > > From: Michal Hocko <mhocko@kernel.org>
> > > > > Sent: Tuesday, July 7, 2020 7:55 PM
> > > > > To: Justin He <Justin.He@arm.com>
> > > > > Cc: Catalin Marinas <Catalin.Marinas@arm.com>; Will Deacon
> > > > > <will@kernel.org>; Dan Williams <dan.j.williams@intel.com>; Vishal
> Verma
> > > > > <vishal.l.verma@intel.com>; Dave Jiang <dave.jiang@intel.com>;
> Andrew
> > > > > Morton <akpm@linux-foundation.org>; Mike Rapoport
> <rppt@linux.ibm.com>;
> > > > > Baoquan He <bhe@redhat.com>; Chuhong Yuan <hslester96@gmail.com>;
> linux-
> > > > > arm-kernel@lists.infradead.org; linux-kernel@vger.kernel.org;
> linux-
> > > > > mm@kvack.org; linux-nvdimm@lists.01.org; Kaly Xin
> <Kaly.Xin@arm.com>
> > > > > Subject: Re: [PATCH v2 1/3] arm64/numa: export
> memory_add_physaddr_to_nid
> > > > > as EXPORT_SYMBOL_GPL
> > > > >
> > > > > On Tue 07-07-20 13:59:15, Jia He wrote:
> > > > > > This exports memory_add_physaddr_to_nid() for module driver to
> use.
> > > > > >
> > > > > > memory_add_physaddr_to_nid() is a fallback option to get the nid
> in case
> > > > > > NUMA_NO_NID is detected.
> > > > > >
> > > > > > Suggested-by: David Hildenbrand <david@redhat.com>
> > > > > > Signed-off-by: Jia He <justin.he@arm.com>
> > > > > > ---
> > > > > >  arch/arm64/mm/numa.c | 5 +++--
> > > > > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > > > > >
> > > > > > diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> > > > > > index aafcee3e3f7e..7eeb31740248 100644
> > > > > > --- a/arch/arm64/mm/numa.c
> > > > > > +++ b/arch/arm64/mm/numa.c
> > > > > > @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
> > > > > >
> > > > > >  /*
> > > > > >   * We hope that we will be hotplugging memory on nodes we
> already know
> > > > > about,
> > > > > > - * such that acpi_get_node() succeeds and we never fall back to
> this...
> > > > > > + * such that acpi_get_node() succeeds. But when SRAT is not
> present,
> > > > > the node
> > > > > > + * id may be probed as NUMA_NO_NODE by acpi, Here provide a
> fallback
> > > > > option.
> > > > > >   */
> > > > > >  int memory_add_physaddr_to_nid(u64 addr)
> > > > > >  {
> > > > > > -   pr_warn("Unknown node for memory at 0x%llx, assuming node
> 0\n",
> > > > > addr);
> > > > > >     return 0;
> > > > > >  }
> > > > > > +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> > > > >
> > > > > Does it make sense to export a noop function? Wouldn't make more
> sense
> > > > > to simply make it static inline somewhere in a header? I haven't
> checked
> > > > > whether there is an easy way to do that sanely bu this just hit my
> eyes.
> > > >
> > > > Okay, I can make a change in memory_hotplug.h, sth like:
> > > > --- a/include/linux/memory_hotplug.h
> > > > +++ b/include/linux/memory_hotplug.h
> > > > @@ -149,13 +149,13 @@ int add_pages(int nid, unsigned long start_pfn,
> unsigned long nr_pages,
> > > >               struct mhp_params *params);
> > > >  #endif /* ARCH_HAS_ADD_PAGES */
> > > >
> > > > -#ifdef CONFIG_NUMA
> > > > -extern int memory_add_physaddr_to_nid(u64 start);
> > > > -#else
> > > > +#if !defined(CONFIG_NUMA) || !defined(memory_add_physaddr_to_nid)
> > > >  static inline int memory_add_physaddr_to_nid(u64 start)
> > > >  {
> > > >         return 0;
> > > >  }
> > > > +#else
> > > > +extern int memory_add_physaddr_to_nid(u64 start);
> > > >  #endif
> > > >
> > > > And then check the memory_add_physaddr_to_nid() helper on all arches,
> > > > if it is noop(return 0), I can simply remove it.
> > > > if it is not noop, after the helper,
> > > > #define memory_add_physaddr_to_nid
> > > >
> > > > What do you think of this proposal?
> > >
> > > Especially for architectures that use memblock info for numa info
> > > (which seems to be everyone except x86) why not implement a generic
> > > memory_add_physaddr_to_nid() that does:
> >
> > That would be only arm64.
> >
> 
> Darn, I saw ARCH_KEEP_MEMBLOCK and had delusions of grandeur that it
> could solve my numa api woes. At least for x86 the problem is already
> solved with reserved numa_meminfo, but now I'm trying to write generic
> drivers that use those apis and finding these gaps on other archs.

Even on arm64, there is a dependency issue in dax_pmem kmem case.
If dax pmem uses memory_add_physaddr_to_nid() to decide which node that
memblock should add into, get_pfn_range_for_nid() might not have
the correct memblock info at that time. That is, get_pfn_range_for_nid()
can't get the correct memblock info before add_memory()

So IMO, memory_add_physaddr_to_nid() still have to implement as noop on
arm64 (return 0) together with sh,s390x? Powerpc, x86,ia64 can use their
own implementation. And phys_to_target_node() can use your suggested(
for_each_online_node() ...)

What do you think of it? Thanks

--
Cheers,
Justin (Jia He)
David Hildenbrand July 8, 2020, 6:59 a.m. UTC | #19
On 08.07.20 08:22, Mike Rapoport wrote:
> On Tue, Jul 07, 2020 at 09:27:43PM -0700, Dan Williams wrote:
>> On Tue, Jul 7, 2020 at 9:08 PM Justin He <Justin.He@arm.com> wrote:
>> [..]
>>>> Especially for architectures that use memblock info for numa info
>>>> (which seems to be everyone except x86) why not implement a generic
>>>> memory_add_physaddr_to_nid() that does:
>>>>
>>>> int memory_add_physaddr_to_nid(u64 addr)
>>>> {
>>>>         unsigned long start_pfn, end_pfn, pfn = PHYS_PFN(addr);
>>>>         int nid;
>>>>
>>>>         for_each_online_node(nid) {
>>>>                 get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
>>>>                 if (pfn >= start_pfn && pfn <= end_pfn)
>>>>                         return nid;
>>>>         }
>>>>         return NUMA_NO_NODE;
>>>> }
>>>
>>> Thanks for your suggestion,
>>> Could I wrap the codes and let memory_add_physaddr_to_nid simply invoke
>>> phys_to_target_node()?
>>
>> I think it needs to be the reverse. phys_to_target_node() should call
>> memory_add_physaddr_to_nid() by default, but fall back to searching
>> reserved memory address ranges in memblock. See phys_to_target_node()
>> in arch/x86/mm/numa.c. That one uses numa_meminfo instead of memblock,
>> but the principle is the same i.e. that a target node may not be
>> represented in memblock.memory, but memblock.reserved. I'm working on
>> a patch to provide a function similar to get_pfn_range_for_nid() that
>> operates on reserved memory.
> 
> Do we really need yet another memblock iterator?
> I think only x86 has memory that is not in memblock.memory but only in
> memblock.reserved.

Reading about abusing the memblock allcoator once again in memory
hotplug paths makes me shiver.
David Hildenbrand July 8, 2020, 7 a.m. UTC | #20
On 08.07.20 08:56, Justin He wrote:
> Hi Dan
> 
>> -----Original Message-----
>> From: Dan Williams <dan.j.williams@intel.com>
>> Sent: Wednesday, July 8, 2020 1:48 PM
>> To: Mike Rapoport <rppt@linux.ibm.com>
>> Cc: Justin He <Justin.He@arm.com>; Michal Hocko <mhocko@kernel.org>; David
>> Hildenbrand <david@redhat.com>; Catalin Marinas <Catalin.Marinas@arm.com>;
>> Will Deacon <will@kernel.org>; Vishal Verma <vishal.l.verma@intel.com>;
>> Dave Jiang <dave.jiang@intel.com>; Andrew Morton <akpm@linux-
>> foundation.org>; Baoquan He <bhe@redhat.com>; Chuhong Yuan
>> <hslester96@gmail.com>; linux-arm-kernel@lists.infradead.org; linux-
>> kernel@vger.kernel.org; linux-mm@kvack.org; linux-nvdimm@lists.01.org;
>> Kaly Xin <Kaly.Xin@arm.com>
>> Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid
>> as EXPORT_SYMBOL_GPL
>>
>> On Tue, Jul 7, 2020 at 10:33 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
>>>
>>> On Tue, Jul 07, 2020 at 08:56:36PM -0700, Dan Williams wrote:
>>>> On Tue, Jul 7, 2020 at 7:20 PM Justin He <Justin.He@arm.com> wrote:
>>>>>
>>>>> Hi Michal and David
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Michal Hocko <mhocko@kernel.org>
>>>>>> Sent: Tuesday, July 7, 2020 7:55 PM
>>>>>> To: Justin He <Justin.He@arm.com>
>>>>>> Cc: Catalin Marinas <Catalin.Marinas@arm.com>; Will Deacon
>>>>>> <will@kernel.org>; Dan Williams <dan.j.williams@intel.com>; Vishal
>> Verma
>>>>>> <vishal.l.verma@intel.com>; Dave Jiang <dave.jiang@intel.com>;
>> Andrew
>>>>>> Morton <akpm@linux-foundation.org>; Mike Rapoport
>> <rppt@linux.ibm.com>;
>>>>>> Baoquan He <bhe@redhat.com>; Chuhong Yuan <hslester96@gmail.com>;
>> linux-
>>>>>> arm-kernel@lists.infradead.org; linux-kernel@vger.kernel.org;
>> linux-
>>>>>> mm@kvack.org; linux-nvdimm@lists.01.org; Kaly Xin
>> <Kaly.Xin@arm.com>
>>>>>> Subject: Re: [PATCH v2 1/3] arm64/numa: export
>> memory_add_physaddr_to_nid
>>>>>> as EXPORT_SYMBOL_GPL
>>>>>>
>>>>>> On Tue 07-07-20 13:59:15, Jia He wrote:
>>>>>>> This exports memory_add_physaddr_to_nid() for module driver to
>> use.
>>>>>>>
>>>>>>> memory_add_physaddr_to_nid() is a fallback option to get the nid
>> in case
>>>>>>> NUMA_NO_NID is detected.
>>>>>>>
>>>>>>> Suggested-by: David Hildenbrand <david@redhat.com>
>>>>>>> Signed-off-by: Jia He <justin.he@arm.com>
>>>>>>> ---
>>>>>>>  arch/arm64/mm/numa.c | 5 +++--
>>>>>>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>>>>>>> index aafcee3e3f7e..7eeb31740248 100644
>>>>>>> --- a/arch/arm64/mm/numa.c
>>>>>>> +++ b/arch/arm64/mm/numa.c
>>>>>>> @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
>>>>>>>
>>>>>>>  /*
>>>>>>>   * We hope that we will be hotplugging memory on nodes we
>> already know
>>>>>> about,
>>>>>>> - * such that acpi_get_node() succeeds and we never fall back to
>> this...
>>>>>>> + * such that acpi_get_node() succeeds. But when SRAT is not
>> present,
>>>>>> the node
>>>>>>> + * id may be probed as NUMA_NO_NODE by acpi, Here provide a
>> fallback
>>>>>> option.
>>>>>>>   */
>>>>>>>  int memory_add_physaddr_to_nid(u64 addr)
>>>>>>>  {
>>>>>>> -   pr_warn("Unknown node for memory at 0x%llx, assuming node
>> 0\n",
>>>>>> addr);
>>>>>>>     return 0;
>>>>>>>  }
>>>>>>> +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
>>>>>>
>>>>>> Does it make sense to export a noop function? Wouldn't make more
>> sense
>>>>>> to simply make it static inline somewhere in a header? I haven't
>> checked
>>>>>> whether there is an easy way to do that sanely bu this just hit my
>> eyes.
>>>>>
>>>>> Okay, I can make a change in memory_hotplug.h, sth like:
>>>>> --- a/include/linux/memory_hotplug.h
>>>>> +++ b/include/linux/memory_hotplug.h
>>>>> @@ -149,13 +149,13 @@ int add_pages(int nid, unsigned long start_pfn,
>> unsigned long nr_pages,
>>>>>               struct mhp_params *params);
>>>>>  #endif /* ARCH_HAS_ADD_PAGES */
>>>>>
>>>>> -#ifdef CONFIG_NUMA
>>>>> -extern int memory_add_physaddr_to_nid(u64 start);
>>>>> -#else
>>>>> +#if !defined(CONFIG_NUMA) || !defined(memory_add_physaddr_to_nid)
>>>>>  static inline int memory_add_physaddr_to_nid(u64 start)
>>>>>  {
>>>>>         return 0;
>>>>>  }
>>>>> +#else
>>>>> +extern int memory_add_physaddr_to_nid(u64 start);
>>>>>  #endif
>>>>>
>>>>> And then check the memory_add_physaddr_to_nid() helper on all arches,
>>>>> if it is noop(return 0), I can simply remove it.
>>>>> if it is not noop, after the helper,
>>>>> #define memory_add_physaddr_to_nid
>>>>>
>>>>> What do you think of this proposal?
>>>>
>>>> Especially for architectures that use memblock info for numa info
>>>> (which seems to be everyone except x86) why not implement a generic
>>>> memory_add_physaddr_to_nid() that does:
>>>
>>> That would be only arm64.
>>>
>>
>> Darn, I saw ARCH_KEEP_MEMBLOCK and had delusions of grandeur that it
>> could solve my numa api woes. At least for x86 the problem is already
>> solved with reserved numa_meminfo, but now I'm trying to write generic
>> drivers that use those apis and finding these gaps on other archs.
> 
> Even on arm64, there is a dependency issue in dax_pmem kmem case.
> If dax pmem uses memory_add_physaddr_to_nid() to decide which node that
> memblock should add into, get_pfn_range_for_nid() might not have
> the correct memblock info at that time. That is, get_pfn_range_for_nid()
> can't get the correct memblock info before add_memory()
> 
> So IMO, memory_add_physaddr_to_nid() still have to implement as noop on
> arm64 (return 0) together with sh,s390x? Powerpc, x86,ia64 can use their
> own implementation. And phys_to_target_node() can use your suggested(
> for_each_online_node() ...)
> 
> What do you think of it? Thanks

You are trying to fix the "we only have one dummy node" AFAIU, so what
you propose here is certainly good enough for now.
Dan Williams July 8, 2020, 7:04 a.m. UTC | #21
On Tue, Jul 7, 2020 at 11:59 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 08.07.20 08:22, Mike Rapoport wrote:
> > On Tue, Jul 07, 2020 at 09:27:43PM -0700, Dan Williams wrote:
> >> On Tue, Jul 7, 2020 at 9:08 PM Justin He <Justin.He@arm.com> wrote:
> >> [..]
> >>>> Especially for architectures that use memblock info for numa info
> >>>> (which seems to be everyone except x86) why not implement a generic
> >>>> memory_add_physaddr_to_nid() that does:
> >>>>
> >>>> int memory_add_physaddr_to_nid(u64 addr)
> >>>> {
> >>>>         unsigned long start_pfn, end_pfn, pfn = PHYS_PFN(addr);
> >>>>         int nid;
> >>>>
> >>>>         for_each_online_node(nid) {
> >>>>                 get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
> >>>>                 if (pfn >= start_pfn && pfn <= end_pfn)
> >>>>                         return nid;
> >>>>         }
> >>>>         return NUMA_NO_NODE;
> >>>> }
> >>>
> >>> Thanks for your suggestion,
> >>> Could I wrap the codes and let memory_add_physaddr_to_nid simply invoke
> >>> phys_to_target_node()?
> >>
> >> I think it needs to be the reverse. phys_to_target_node() should call
> >> memory_add_physaddr_to_nid() by default, but fall back to searching
> >> reserved memory address ranges in memblock. See phys_to_target_node()
> >> in arch/x86/mm/numa.c. That one uses numa_meminfo instead of memblock,
> >> but the principle is the same i.e. that a target node may not be
> >> represented in memblock.memory, but memblock.reserved. I'm working on
> >> a patch to provide a function similar to get_pfn_range_for_nid() that
> >> operates on reserved memory.
> >
> > Do we really need yet another memblock iterator?
> > I think only x86 has memory that is not in memblock.memory but only in
> > memblock.reserved.
>
> Reading about abusing the memblock allcoator once again in memory
> hotplug paths makes me shiver.

Technical reasoning please?

arm64 numa information is established from memblock data. It seems
counterproductive to ignore that fact if we're already touching
memory_add_physaddr_to_nid() and have a use case for a driver to call
it.
David Hildenbrand July 8, 2020, 7:16 a.m. UTC | #22
On 08.07.20 09:04, Dan Williams wrote:
> On Tue, Jul 7, 2020 at 11:59 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 08.07.20 08:22, Mike Rapoport wrote:
>>> On Tue, Jul 07, 2020 at 09:27:43PM -0700, Dan Williams wrote:
>>>> On Tue, Jul 7, 2020 at 9:08 PM Justin He <Justin.He@arm.com> wrote:
>>>> [..]
>>>>>> Especially for architectures that use memblock info for numa info
>>>>>> (which seems to be everyone except x86) why not implement a generic
>>>>>> memory_add_physaddr_to_nid() that does:
>>>>>>
>>>>>> int memory_add_physaddr_to_nid(u64 addr)
>>>>>> {
>>>>>>         unsigned long start_pfn, end_pfn, pfn = PHYS_PFN(addr);
>>>>>>         int nid;
>>>>>>
>>>>>>         for_each_online_node(nid) {
>>>>>>                 get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
>>>>>>                 if (pfn >= start_pfn && pfn <= end_pfn)
>>>>>>                         return nid;
>>>>>>         }
>>>>>>         return NUMA_NO_NODE;
>>>>>> }
>>>>>
>>>>> Thanks for your suggestion,
>>>>> Could I wrap the codes and let memory_add_physaddr_to_nid simply invoke
>>>>> phys_to_target_node()?
>>>>
>>>> I think it needs to be the reverse. phys_to_target_node() should call
>>>> memory_add_physaddr_to_nid() by default, but fall back to searching
>>>> reserved memory address ranges in memblock. See phys_to_target_node()
>>>> in arch/x86/mm/numa.c. That one uses numa_meminfo instead of memblock,
>>>> but the principle is the same i.e. that a target node may not be
>>>> represented in memblock.memory, but memblock.reserved. I'm working on
>>>> a patch to provide a function similar to get_pfn_range_for_nid() that
>>>> operates on reserved memory.
>>>
>>> Do we really need yet another memblock iterator?
>>> I think only x86 has memory that is not in memblock.memory but only in
>>> memblock.reserved.
>>
>> Reading about abusing the memblock allcoator once again in memory
>> hotplug paths makes me shiver.
> 
> Technical reasoning please?

ARCH_KEEP_MEMBLOCK is (AFAIK) only a hack for arm64 to implement
pfn_valid(), because they zap out individual pages corresponding to
memory holes of full sections.

I am not a friend of adding more post-init code to rely on memblock
data. It just makes it harder to eventually get rid of ARCH_KEEP_MEMBLOCK.

> 
> arm64 numa information is established from memblock data. It seems
> counterproductive to ignore that fact if we're already touching
> memory_add_physaddr_to_nid() and have a use case for a driver to call
> it.

... and we are trying to handle the "only a single dummy node" case
(patch #2), or what am I missing? What is there to optimize currently?
David Hildenbrand July 8, 2020, 7:21 a.m. UTC | #23
On 08.07.20 07:27, Mike Rapoport wrote:
> On Tue, Jul 07, 2020 at 03:05:48PM -0700, Dan Williams wrote:
>> On Tue, Jul 7, 2020 at 11:01 AM Mike Rapoport <rppt@linux.ibm.com> wrote:
>>>
>>> On Tue, Jul 07, 2020 at 02:26:08PM +0200, David Hildenbrand wrote:
>>>> On 07.07.20 14:13, Mike Rapoport wrote:
>>>>> On Tue, Jul 07, 2020 at 01:54:54PM +0200, Michal Hocko wrote:
>>>>>> On Tue 07-07-20 13:59:15, Jia He wrote:
>>>>>>> This exports memory_add_physaddr_to_nid() for module driver to use.
>>>>>>>
>>>>>>> memory_add_physaddr_to_nid() is a fallback option to get the nid in case
>>>>>>> NUMA_NO_NID is detected.
>>>>>>>
>>>>>>> Suggested-by: David Hildenbrand <david@redhat.com>
>>>>>>> Signed-off-by: Jia He <justin.he@arm.com>
>>>>>>> ---
>>>>>>>  arch/arm64/mm/numa.c | 5 +++--
>>>>>>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>>>>>>> index aafcee3e3f7e..7eeb31740248 100644
>>>>>>> --- a/arch/arm64/mm/numa.c
>>>>>>> +++ b/arch/arm64/mm/numa.c
>>>>>>> @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
>>>>>>>
>>>>>>>  /*
>>>>>>>   * We hope that we will be hotplugging memory on nodes we already know about,
>>>>>>> - * such that acpi_get_node() succeeds and we never fall back to this...
>>>>>>> + * such that acpi_get_node() succeeds. But when SRAT is not present, the node
>>>>>>> + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback option.
>>>>>>>   */
>>>>>>>  int memory_add_physaddr_to_nid(u64 addr)
>>>>>>>  {
>>>>>>> - pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
>>>>>>>   return 0;
>>>>>>>  }
>>>>>>> +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
>>>>>>
>>>>>> Does it make sense to export a noop function? Wouldn't make more sense
>>>>>> to simply make it static inline somewhere in a header? I haven't checked
>>>>>> whether there is an easy way to do that sanely bu this just hit my eyes.
>>>>>
>>>>> We'll need to either add a CONFIG_ option or arch specific callback to
>>>>> make both non-empty (x86, powerpc, ia64) and empty (arm64, sh)
>>>>> implementations coexist ...
>>>>
>>>> Note: I have a similar dummy (return 0) patch for s390x lying around here.
>>>
>>> Then we'll call it a tie - 3:3 ;-)
>>
>> So I'd be happy to jump on the train of people wanting to export the
>> ARM stub for this (and add a new ARM stub for phys_to_target_node()),
>> but Will did have a plausibly better idea that I have been meaning to
>> circle back to:
>>
>> http://lore.kernel.org/r/20200325111039.GA32109@willie-the-truck
>>
>> ...i.e. iterate over node data to do the lookup. This would seem to
>> work generically for multiple archs unless I am missing something?

IIRC, only memory assigned to/onlined to a ZONE is represented in the
pgdat node span. E.g., not offline memory blocks.

Esp., when hotplugging + onlining consecutive memory, there won't really
be any intersections in most cases if I am not wrong. It would not be
"intersection" but rather "closest fit".

With overlapping nodes it's even more unclear. Which one to pick?

> 
> I think it would work on arm64, power and, most propbably on s390

With only a single dummy node I guess it should work (searching when
there is only a single node does not make too much sense).

> (David?), but not on x86. x86 does not have reserved memory in pgdat,
> it's never memblock_add()'ed (see e820__memblock_setup()).

Can you enlighten me why that is relevant for the memory hotplug path?
(or is it just a general comment to make the function as accurate as
possible for all addresses?)
Mike Rapoport July 8, 2020, 7:38 a.m. UTC | #24
On Wed, Jul 08, 2020 at 09:21:25AM +0200, David Hildenbrand wrote:
> On 08.07.20 07:27, Mike Rapoport wrote:
> > On Tue, Jul 07, 2020 at 03:05:48PM -0700, Dan Williams wrote:
> >> On Tue, Jul 7, 2020 at 11:01 AM Mike Rapoport <rppt@linux.ibm.com> wrote:
> >>>
> >>> On Tue, Jul 07, 2020 at 02:26:08PM +0200, David Hildenbrand wrote:
> >>>> On 07.07.20 14:13, Mike Rapoport wrote:
> >>>>> On Tue, Jul 07, 2020 at 01:54:54PM +0200, Michal Hocko wrote:
> >>>>>> On Tue 07-07-20 13:59:15, Jia He wrote:
> >>>>>>> This exports memory_add_physaddr_to_nid() for module driver to use.
> >>>>>>>
> >>>>>>> memory_add_physaddr_to_nid() is a fallback option to get the nid in case
> >>>>>>> NUMA_NO_NID is detected.
> >>>>>>>
> >>>>>>> Suggested-by: David Hildenbrand <david@redhat.com>
> >>>>>>> Signed-off-by: Jia He <justin.he@arm.com>
> >>>>>>> ---
> >>>>>>>  arch/arm64/mm/numa.c | 5 +++--
> >>>>>>>  1 file changed, 3 insertions(+), 2 deletions(-)
> >>>>>>>
> >>>>>>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> >>>>>>> index aafcee3e3f7e..7eeb31740248 100644
> >>>>>>> --- a/arch/arm64/mm/numa.c
> >>>>>>> +++ b/arch/arm64/mm/numa.c
> >>>>>>> @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
> >>>>>>>
> >>>>>>>  /*
> >>>>>>>   * We hope that we will be hotplugging memory on nodes we already know about,
> >>>>>>> - * such that acpi_get_node() succeeds and we never fall back to this...
> >>>>>>> + * such that acpi_get_node() succeeds. But when SRAT is not present, the node
> >>>>>>> + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback option.
> >>>>>>>   */
> >>>>>>>  int memory_add_physaddr_to_nid(u64 addr)
> >>>>>>>  {
> >>>>>>> - pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
> >>>>>>>   return 0;
> >>>>>>>  }
> >>>>>>> +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> >>>>>>
> >>>>>> Does it make sense to export a noop function? Wouldn't make more sense
> >>>>>> to simply make it static inline somewhere in a header? I haven't checked
> >>>>>> whether there is an easy way to do that sanely bu this just hit my eyes.
> >>>>>
> >>>>> We'll need to either add a CONFIG_ option or arch specific callback to
> >>>>> make both non-empty (x86, powerpc, ia64) and empty (arm64, sh)
> >>>>> implementations coexist ...
> >>>>
> >>>> Note: I have a similar dummy (return 0) patch for s390x lying around here.
> >>>
> >>> Then we'll call it a tie - 3:3 ;-)
> >>
> >> So I'd be happy to jump on the train of people wanting to export the
> >> ARM stub for this (and add a new ARM stub for phys_to_target_node()),
> >> but Will did have a plausibly better idea that I have been meaning to
> >> circle back to:
> >>
> >> http://lore.kernel.org/r/20200325111039.GA32109@willie-the-truck
> >>
> >> ...i.e. iterate over node data to do the lookup. This would seem to
> >> work generically for multiple archs unless I am missing something?
> 
> IIRC, only memory assigned to/onlined to a ZONE is represented in the
> pgdat node span. E.g., not offline memory blocks.
> 
> Esp., when hotplugging + onlining consecutive memory, there won't really
> be any intersections in most cases if I am not wrong. It would not be
> "intersection" but rather "closest fit".
> 
> With overlapping nodes it's even more unclear. Which one to pick?
> 
> > 
> > I think it would work on arm64, power and, most propbably on s390
> 
> With only a single dummy node I guess it should work (searching when
> there is only a single node does not make too much sense).
> 
> > (David?), but not on x86. x86 does not have reserved memory in pgdat,
> > it's never memblock_add()'ed (see e820__memblock_setup()).
> 
> Can you enlighten me why that is relevant for the memory hotplug path?
> (or is it just a general comment to make the function as accurate as
> possible for all addresses?)

phys_to_target_node() on x86 falls back to numa_reserved_meminfo which
holds memory that is never listed in a node.

> -- 
> Thanks,
> 
> David / dhildenb
>
David Hildenbrand July 8, 2020, 7:40 a.m. UTC | #25
On 08.07.20 09:38, Mike Rapoport wrote:
> On Wed, Jul 08, 2020 at 09:21:25AM +0200, David Hildenbrand wrote:
>> On 08.07.20 07:27, Mike Rapoport wrote:
>>> On Tue, Jul 07, 2020 at 03:05:48PM -0700, Dan Williams wrote:
>>>> On Tue, Jul 7, 2020 at 11:01 AM Mike Rapoport <rppt@linux.ibm.com> wrote:
>>>>>
>>>>> On Tue, Jul 07, 2020 at 02:26:08PM +0200, David Hildenbrand wrote:
>>>>>> On 07.07.20 14:13, Mike Rapoport wrote:
>>>>>>> On Tue, Jul 07, 2020 at 01:54:54PM +0200, Michal Hocko wrote:
>>>>>>>> On Tue 07-07-20 13:59:15, Jia He wrote:
>>>>>>>>> This exports memory_add_physaddr_to_nid() for module driver to use.
>>>>>>>>>
>>>>>>>>> memory_add_physaddr_to_nid() is a fallback option to get the nid in case
>>>>>>>>> NUMA_NO_NID is detected.
>>>>>>>>>
>>>>>>>>> Suggested-by: David Hildenbrand <david@redhat.com>
>>>>>>>>> Signed-off-by: Jia He <justin.he@arm.com>
>>>>>>>>> ---
>>>>>>>>>  arch/arm64/mm/numa.c | 5 +++--
>>>>>>>>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>>>>>>>>> index aafcee3e3f7e..7eeb31740248 100644
>>>>>>>>> --- a/arch/arm64/mm/numa.c
>>>>>>>>> +++ b/arch/arm64/mm/numa.c
>>>>>>>>> @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
>>>>>>>>>
>>>>>>>>>  /*
>>>>>>>>>   * We hope that we will be hotplugging memory on nodes we already know about,
>>>>>>>>> - * such that acpi_get_node() succeeds and we never fall back to this...
>>>>>>>>> + * such that acpi_get_node() succeeds. But when SRAT is not present, the node
>>>>>>>>> + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback option.
>>>>>>>>>   */
>>>>>>>>>  int memory_add_physaddr_to_nid(u64 addr)
>>>>>>>>>  {
>>>>>>>>> - pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
>>>>>>>>>   return 0;
>>>>>>>>>  }
>>>>>>>>> +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
>>>>>>>>
>>>>>>>> Does it make sense to export a noop function? Wouldn't make more sense
>>>>>>>> to simply make it static inline somewhere in a header? I haven't checked
>>>>>>>> whether there is an easy way to do that sanely bu this just hit my eyes.
>>>>>>>
>>>>>>> We'll need to either add a CONFIG_ option or arch specific callback to
>>>>>>> make both non-empty (x86, powerpc, ia64) and empty (arm64, sh)
>>>>>>> implementations coexist ...
>>>>>>
>>>>>> Note: I have a similar dummy (return 0) patch for s390x lying around here.
>>>>>
>>>>> Then we'll call it a tie - 3:3 ;-)
>>>>
>>>> So I'd be happy to jump on the train of people wanting to export the
>>>> ARM stub for this (and add a new ARM stub for phys_to_target_node()),
>>>> but Will did have a plausibly better idea that I have been meaning to
>>>> circle back to:
>>>>
>>>> http://lore.kernel.org/r/20200325111039.GA32109@willie-the-truck
>>>>
>>>> ...i.e. iterate over node data to do the lookup. This would seem to
>>>> work generically for multiple archs unless I am missing something?
>>
>> IIRC, only memory assigned to/onlined to a ZONE is represented in the
>> pgdat node span. E.g., not offline memory blocks.
>>
>> Esp., when hotplugging + onlining consecutive memory, there won't really
>> be any intersections in most cases if I am not wrong. It would not be
>> "intersection" but rather "closest fit".
>>
>> With overlapping nodes it's even more unclear. Which one to pick?
>>
>>>
>>> I think it would work on arm64, power and, most propbably on s390
>>
>> With only a single dummy node I guess it should work (searching when
>> there is only a single node does not make too much sense).
>>
>>> (David?), but not on x86. x86 does not have reserved memory in pgdat,
>>> it's never memblock_add()'ed (see e820__memblock_setup()).
>>
>> Can you enlighten me why that is relevant for the memory hotplug path?
>> (or is it just a general comment to make the function as accurate as
>> possible for all addresses?)
> 
> phys_to_target_node() on x86 falls back to numa_reserved_meminfo which
> holds memory that is never listed in a node.
> 

Ah, I see - thanks.
Mike Rapoport July 8, 2020, 7:43 a.m. UTC | #26
On Wed, Jul 08, 2020 at 09:16:01AM +0200, David Hildenbrand wrote:
> On 08.07.20 09:04, Dan Williams wrote:
> > On Tue, Jul 7, 2020 at 11:59 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 08.07.20 08:22, Mike Rapoport wrote:
> >>> On Tue, Jul 07, 2020 at 09:27:43PM -0700, Dan Williams wrote:
> >>>> On Tue, Jul 7, 2020 at 9:08 PM Justin He <Justin.He@arm.com> wrote:
> >>>> [..]
> >>>>>> Especially for architectures that use memblock info for numa info
> >>>>>> (which seems to be everyone except x86) why not implement a generic
> >>>>>> memory_add_physaddr_to_nid() that does:
> >>>>>>
> >>>>>> int memory_add_physaddr_to_nid(u64 addr)
> >>>>>> {
> >>>>>>         unsigned long start_pfn, end_pfn, pfn = PHYS_PFN(addr);
> >>>>>>         int nid;
> >>>>>>
> >>>>>>         for_each_online_node(nid) {
> >>>>>>                 get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
> >>>>>>                 if (pfn >= start_pfn && pfn <= end_pfn)
> >>>>>>                         return nid;
> >>>>>>         }
> >>>>>>         return NUMA_NO_NODE;
> >>>>>> }
> >>>>>
> >>>>> Thanks for your suggestion,
> >>>>> Could I wrap the codes and let memory_add_physaddr_to_nid simply invoke
> >>>>> phys_to_target_node()?
> >>>>
> >>>> I think it needs to be the reverse. phys_to_target_node() should call
> >>>> memory_add_physaddr_to_nid() by default, but fall back to searching
> >>>> reserved memory address ranges in memblock. See phys_to_target_node()
> >>>> in arch/x86/mm/numa.c. That one uses numa_meminfo instead of memblock,
> >>>> but the principle is the same i.e. that a target node may not be
> >>>> represented in memblock.memory, but memblock.reserved. I'm working on
> >>>> a patch to provide a function similar to get_pfn_range_for_nid() that
> >>>> operates on reserved memory.
> >>>
> >>> Do we really need yet another memblock iterator?
> >>> I think only x86 has memory that is not in memblock.memory but only in
> >>> memblock.reserved.
> >>
> >> Reading about abusing the memblock allcoator once again in memory
> >> hotplug paths makes me shiver.
> > 
> > Technical reasoning please?
> 
> ARCH_KEEP_MEMBLOCK is (AFAIK) only a hack for arm64 to implement
> pfn_valid(), because they zap out individual pages corresponding to
> memory holes of full sections.
> 
> I am not a friend of adding more post-init code to rely on memblock
> data. It just makes it harder to eventually get rid of ARCH_KEEP_MEMBLOCK.

The most heavy user of memblock in post-init code is powerpc. It won't
be easy to get rid of it there.

> > arm64 numa information is established from memblock data. It seems
> > counterproductive to ignore that fact if we're already touching
> > memory_add_physaddr_to_nid() and have a use case for a driver to call
> > it.
> 
> ... and we are trying to handle the "only a single dummy node" case
> (patch #2), or what am I missing? What is there to optimize currently?
> 
> -- 
> Thanks,
> 
> David / dhildenb
>
Dan Williams July 8, 2020, 7:50 a.m. UTC | #27
On Wed, Jul 8, 2020 at 12:22 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 08.07.20 07:27, Mike Rapoport wrote:
> > On Tue, Jul 07, 2020 at 03:05:48PM -0700, Dan Williams wrote:
> >> On Tue, Jul 7, 2020 at 11:01 AM Mike Rapoport <rppt@linux.ibm.com> wrote:
> >>>
> >>> On Tue, Jul 07, 2020 at 02:26:08PM +0200, David Hildenbrand wrote:
> >>>> On 07.07.20 14:13, Mike Rapoport wrote:
> >>>>> On Tue, Jul 07, 2020 at 01:54:54PM +0200, Michal Hocko wrote:
> >>>>>> On Tue 07-07-20 13:59:15, Jia He wrote:
> >>>>>>> This exports memory_add_physaddr_to_nid() for module driver to use.
> >>>>>>>
> >>>>>>> memory_add_physaddr_to_nid() is a fallback option to get the nid in case
> >>>>>>> NUMA_NO_NID is detected.
> >>>>>>>
> >>>>>>> Suggested-by: David Hildenbrand <david@redhat.com>
> >>>>>>> Signed-off-by: Jia He <justin.he@arm.com>
> >>>>>>> ---
> >>>>>>>  arch/arm64/mm/numa.c | 5 +++--
> >>>>>>>  1 file changed, 3 insertions(+), 2 deletions(-)
> >>>>>>>
> >>>>>>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> >>>>>>> index aafcee3e3f7e..7eeb31740248 100644
> >>>>>>> --- a/arch/arm64/mm/numa.c
> >>>>>>> +++ b/arch/arm64/mm/numa.c
> >>>>>>> @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
> >>>>>>>
> >>>>>>>  /*
> >>>>>>>   * We hope that we will be hotplugging memory on nodes we already know about,
> >>>>>>> - * such that acpi_get_node() succeeds and we never fall back to this...
> >>>>>>> + * such that acpi_get_node() succeeds. But when SRAT is not present, the node
> >>>>>>> + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback option.
> >>>>>>>   */
> >>>>>>>  int memory_add_physaddr_to_nid(u64 addr)
> >>>>>>>  {
> >>>>>>> - pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
> >>>>>>>   return 0;
> >>>>>>>  }
> >>>>>>> +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> >>>>>>
> >>>>>> Does it make sense to export a noop function? Wouldn't make more sense
> >>>>>> to simply make it static inline somewhere in a header? I haven't checked
> >>>>>> whether there is an easy way to do that sanely bu this just hit my eyes.
> >>>>>
> >>>>> We'll need to either add a CONFIG_ option or arch specific callback to
> >>>>> make both non-empty (x86, powerpc, ia64) and empty (arm64, sh)
> >>>>> implementations coexist ...
> >>>>
> >>>> Note: I have a similar dummy (return 0) patch for s390x lying around here.
> >>>
> >>> Then we'll call it a tie - 3:3 ;-)
> >>
> >> So I'd be happy to jump on the train of people wanting to export the
> >> ARM stub for this (and add a new ARM stub for phys_to_target_node()),
> >> but Will did have a plausibly better idea that I have been meaning to
> >> circle back to:
> >>
> >> http://lore.kernel.org/r/20200325111039.GA32109@willie-the-truck
> >>
> >> ...i.e. iterate over node data to do the lookup. This would seem to
> >> work generically for multiple archs unless I am missing something?
>
> IIRC, only memory assigned to/onlined to a ZONE is represented in the
> pgdat node span. E.g., not offline memory blocks.

So this dovetails somewhat with Will's idea. What if we populated
node_data for "offline" ranges? I started there, but then saw
ARCH_KEEP_MEMBLOCK and thought it would be safer to just teach
phys_to_target_node() to use that rather than update other code paths
to expect node_data might not always reflect online data.

> Esp., when hotplugging + onlining consecutive memory, there won't really
> be any intersections in most cases if I am not wrong. It would not be
> "intersection" but rather "closest fit".
>
> With overlapping nodes it's even more unclear. Which one to pick?

In the overlap case you get what you get. Some signal is better than
the noise of a dummy function. The consequences of picking the wrong
node might be that the kernel can't properly associate a memory range
to its performance data tables in firmware, but then again firmware
messed up with an overlapping node definition in the first instance.
David Hildenbrand July 8, 2020, 8:26 a.m. UTC | #28
On 08.07.20 09:50, Dan Williams wrote:
> On Wed, Jul 8, 2020 at 12:22 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 08.07.20 07:27, Mike Rapoport wrote:
>>> On Tue, Jul 07, 2020 at 03:05:48PM -0700, Dan Williams wrote:
>>>> On Tue, Jul 7, 2020 at 11:01 AM Mike Rapoport <rppt@linux.ibm.com> wrote:
>>>>>
>>>>> On Tue, Jul 07, 2020 at 02:26:08PM +0200, David Hildenbrand wrote:
>>>>>> On 07.07.20 14:13, Mike Rapoport wrote:
>>>>>>> On Tue, Jul 07, 2020 at 01:54:54PM +0200, Michal Hocko wrote:
>>>>>>>> On Tue 07-07-20 13:59:15, Jia He wrote:
>>>>>>>>> This exports memory_add_physaddr_to_nid() for module driver to use.
>>>>>>>>>
>>>>>>>>> memory_add_physaddr_to_nid() is a fallback option to get the nid in case
>>>>>>>>> NUMA_NO_NID is detected.
>>>>>>>>>
>>>>>>>>> Suggested-by: David Hildenbrand <david@redhat.com>
>>>>>>>>> Signed-off-by: Jia He <justin.he@arm.com>
>>>>>>>>> ---
>>>>>>>>>  arch/arm64/mm/numa.c | 5 +++--
>>>>>>>>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>>>>>>>>> index aafcee3e3f7e..7eeb31740248 100644
>>>>>>>>> --- a/arch/arm64/mm/numa.c
>>>>>>>>> +++ b/arch/arm64/mm/numa.c
>>>>>>>>> @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
>>>>>>>>>
>>>>>>>>>  /*
>>>>>>>>>   * We hope that we will be hotplugging memory on nodes we already know about,
>>>>>>>>> - * such that acpi_get_node() succeeds and we never fall back to this...
>>>>>>>>> + * such that acpi_get_node() succeeds. But when SRAT is not present, the node
>>>>>>>>> + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback option.
>>>>>>>>>   */
>>>>>>>>>  int memory_add_physaddr_to_nid(u64 addr)
>>>>>>>>>  {
>>>>>>>>> - pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
>>>>>>>>>   return 0;
>>>>>>>>>  }
>>>>>>>>> +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
>>>>>>>>
>>>>>>>> Does it make sense to export a noop function? Wouldn't make more sense
>>>>>>>> to simply make it static inline somewhere in a header? I haven't checked
>>>>>>>> whether there is an easy way to do that sanely bu this just hit my eyes.
>>>>>>>
>>>>>>> We'll need to either add a CONFIG_ option or arch specific callback to
>>>>>>> make both non-empty (x86, powerpc, ia64) and empty (arm64, sh)
>>>>>>> implementations coexist ...
>>>>>>
>>>>>> Note: I have a similar dummy (return 0) patch for s390x lying around here.
>>>>>
>>>>> Then we'll call it a tie - 3:3 ;-)
>>>>
>>>> So I'd be happy to jump on the train of people wanting to export the
>>>> ARM stub for this (and add a new ARM stub for phys_to_target_node()),
>>>> but Will did have a plausibly better idea that I have been meaning to
>>>> circle back to:
>>>>
>>>> http://lore.kernel.org/r/20200325111039.GA32109@willie-the-truck
>>>>
>>>> ...i.e. iterate over node data to do the lookup. This would seem to
>>>> work generically for multiple archs unless I am missing something?
>>
>> IIRC, only memory assigned to/onlined to a ZONE is represented in the
>> pgdat node span. E.g., not offline memory blocks.
> 
> So this dovetails somewhat with Will's idea. What if we populated
> node_data for "offline" ranges? I started there, but then saw
> ARCH_KEEP_MEMBLOCK and thought it would be safer to just teach
> phys_to_target_node() to use that rather than update other code paths
> to expect node_data might not always reflect online data.

We currently need a somewhat-accurate pgdat node span to detect when to
offline a node. See try_offline_node(). This works fairly reliable.

Shrinking the node span is currently fairly easy for !ZONE_DEVICE
memory, because we can rely on pfn_to_online_page() + pfn_to_nid(pfn).
See e.g., find_biggest_section_pfn().

If we glue growing/shrinking the node span to adding/removing of memory
(instead of e.g., onlining/offlining), we can no longer base shrinking
on memmap data. We would have to get the information ("how far can I
shrink the node span, is it empty?") from somewhere else. E.g.,
for_each_memory_block() - but that one does not cover ZONE_DEVICE. And
there are memory blocks which cover multiple nodes, in which case we
only store one of them ... unreliable.

This certainly needs more thought :/

> 
>> Esp., when hotplugging + onlining consecutive memory, there won't really
>> be any intersections in most cases if I am not wrong. It would not be
>> "intersection" but rather "closest fit".
>>
>> With overlapping nodes it's even more unclear. Which one to pick?
> 
> In the overlap case you get what you get. Some signal is better than
> the noise of a dummy function. The consequences of picking the wrong
> node might be that the kernel can't properly associate a memory range
> to its performance data tables in firmware, but then again firmware
> messed up with an overlapping node definition in the first instance.

I'd be curious if what we are trying to optimize here is actually worth
optimizing. IOW, is there a well-known scenario where the dummy value on
arm64 would be problematic and is worth the effort?

I mean, in all performance relevant setups (ignoring
hv_balloon/xen-balloon/prove_store(), which also use
memory_add_physaddr_to_nid()), we should have a proper PXM/node
specified by the hardware on memory hotadd. The fallback of
memory_add_physaddr_to_nid() is not relevant in these scenarios.
Mike Rapoport July 8, 2020, 8:39 a.m. UTC | #29
On Wed, Jul 08, 2020 at 10:26:41AM +0200, David Hildenbrand wrote:
> On 08.07.20 09:50, Dan Williams wrote:
> > On Wed, Jul 8, 2020 at 12:22 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >>>>>>>> On Tue 07-07-20 13:59:15, Jia He wrote:
> >>>>>>>>> This exports memory_add_physaddr_to_nid() for module driver to use.
> >>>>>>>>>
> >>>>>>>>> memory_add_physaddr_to_nid() is a fallback option to get the nid in case
> >>>>>>>>> NUMA_NO_NID is detected.
> >>>>>>>>>
> >>>>>>>>> Suggested-by: David Hildenbrand <david@redhat.com>
> >>>>>>>>> Signed-off-by: Jia He <justin.he@arm.com>
> >>>>>>>>> ---
> >>>>>>>>>  arch/arm64/mm/numa.c | 5 +++--
> >>>>>>>>>  1 file changed, 3 insertions(+), 2 deletions(-)
> >>>>>>>>>
> >>>>>>>>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> >>>>>>>>> index aafcee3e3f7e..7eeb31740248 100644
> >>>>>>>>> --- a/arch/arm64/mm/numa.c
> >>>>>>>>> +++ b/arch/arm64/mm/numa.c
> >>>>>>>>> @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
> >>>>>>>>>
> >>>>>>>>>  /*
> >>>>>>>>>   * We hope that we will be hotplugging memory on nodes we already know about,
> >>>>>>>>> - * such that acpi_get_node() succeeds and we never fall back to this...
> >>>>>>>>> + * such that acpi_get_node() succeeds. But when SRAT is not present, the node
> >>>>>>>>> + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback option.
> >>>>>>>>>   */
> >>>>>>>>>  int memory_add_physaddr_to_nid(u64 addr)
> >>>>>>>>>  {
> >>>>>>>>> - pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
> >>>>>>>>>   return 0;
> >>>>>>>>>  }
> >>>>>>>>> +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> >>>>>>>>
> >>>>>>>> Does it make sense to export a noop function? Wouldn't make more sense
> >>>>>>>> to simply make it static inline somewhere in a header? I haven't checked
> >>>>>>>> whether there is an easy way to do that sanely bu this just hit my eyes.

> I'd be curious if what we are trying to optimize here is actually worth
> optimizing. IOW, is there a well-known scenario where the dummy value on
> arm64 would be problematic and is worth the effort?

Well, it started with Michal's comment above that EXPORT_SYMBOL_GPL()
for a stub might be an overkill.

I think Jia's suggestion [1] with addition of a comment that explains
why and when the stub will be used, can work for both
memory_add_physaddr_to_nid() and phys_to_target_node().

But on more theoretical/fundmanetal level, I think we lack a generic
abstraction similar to e.g. x86 'struct numa_meminfo' that serves as
translaton of firmware supplied information into data that can be used
by the generic mm without need to reimplement it for each and every
arch.

[1] https://lore.kernel.org/lkml/AM6PR08MB406907F9F2B13DA6DC893AD9F7670@AM6PR08MB4069.eurprd08.prod.outlook.com

> I mean, in all performance relevant setups (ignoring
> hv_balloon/xen-balloon/prove_store(), which also use
> memory_add_physaddr_to_nid()), we should have a proper PXM/node
> specified by the hardware on memory hotadd. The fallback of
> memory_add_physaddr_to_nid() is not relevant in these scenarios.
> 
> -- 
> Thanks,
> 
> David / dhildenb
>
David Hildenbrand July 8, 2020, 8:45 a.m. UTC | #30
On 08.07.20 10:39, Mike Rapoport wrote:
> On Wed, Jul 08, 2020 at 10:26:41AM +0200, David Hildenbrand wrote:
>> On 08.07.20 09:50, Dan Williams wrote:
>>> On Wed, Jul 8, 2020 at 12:22 AM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>>>>>>>> On Tue 07-07-20 13:59:15, Jia He wrote:
>>>>>>>>>>> This exports memory_add_physaddr_to_nid() for module driver to use.
>>>>>>>>>>>
>>>>>>>>>>> memory_add_physaddr_to_nid() is a fallback option to get the nid in case
>>>>>>>>>>> NUMA_NO_NID is detected.
>>>>>>>>>>>
>>>>>>>>>>> Suggested-by: David Hildenbrand <david@redhat.com>
>>>>>>>>>>> Signed-off-by: Jia He <justin.he@arm.com>
>>>>>>>>>>> ---
>>>>>>>>>>>  arch/arm64/mm/numa.c | 5 +++--
>>>>>>>>>>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>>>>>>>>>>> index aafcee3e3f7e..7eeb31740248 100644
>>>>>>>>>>> --- a/arch/arm64/mm/numa.c
>>>>>>>>>>> +++ b/arch/arm64/mm/numa.c
>>>>>>>>>>> @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
>>>>>>>>>>>
>>>>>>>>>>>  /*
>>>>>>>>>>>   * We hope that we will be hotplugging memory on nodes we already know about,
>>>>>>>>>>> - * such that acpi_get_node() succeeds and we never fall back to this...
>>>>>>>>>>> + * such that acpi_get_node() succeeds. But when SRAT is not present, the node
>>>>>>>>>>> + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback option.
>>>>>>>>>>>   */
>>>>>>>>>>>  int memory_add_physaddr_to_nid(u64 addr)
>>>>>>>>>>>  {
>>>>>>>>>>> - pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
>>>>>>>>>>>   return 0;
>>>>>>>>>>>  }
>>>>>>>>>>> +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
>>>>>>>>>>
>>>>>>>>>> Does it make sense to export a noop function? Wouldn't make more sense
>>>>>>>>>> to simply make it static inline somewhere in a header? I haven't checked
>>>>>>>>>> whether there is an easy way to do that sanely bu this just hit my eyes.
> 
>> I'd be curious if what we are trying to optimize here is actually worth
>> optimizing. IOW, is there a well-known scenario where the dummy value on
>> arm64 would be problematic and is worth the effort?
> 
> Well, it started with Michal's comment above that EXPORT_SYMBOL_GPL()
> for a stub might be an overkill.
> 
> I think Jia's suggestion [1] with addition of a comment that explains
> why and when the stub will be used, can work for both
> memory_add_physaddr_to_nid() and phys_to_target_node().

Agreed.

> 
> But on more theoretical/fundmanetal level, I think we lack a generic
> abstraction similar to e.g. x86 'struct numa_meminfo' that serves as
> translaton of firmware supplied information into data that can be used
> by the generic mm without need to reimplement it for each and every
> arch.

Right. As I expressed, I am not a friend of using memblock for that, and
the pgdat node span is tricky.

Maybe abstracting that x86 concept is possible in some way (and we could
restrict the information to boot-time properties, so we don't have to
mess with memory hot(un)plug - just as done for numa_meminfo AFAIKS).
Mike Rapoport July 8, 2020, 9:15 a.m. UTC | #31
On Wed, Jul 08, 2020 at 10:45:17AM +0200, David Hildenbrand wrote:
> On 08.07.20 10:39, Mike Rapoport wrote:
> > On Wed, Jul 08, 2020 at 10:26:41AM +0200, David Hildenbrand wrote:
> >> On 08.07.20 09:50, Dan Williams wrote:
> >>> On Wed, Jul 8, 2020 at 12:22 AM David Hildenbrand <david@redhat.com> wrote:
> >>>>
> >>>>>>>>>> On Tue 07-07-20 13:59:15, Jia He wrote:
> >>>>>>>>>>> This exports memory_add_physaddr_to_nid() for module driver to use.
> >>>>>>>>>>>
> >>>>>>>>>>> memory_add_physaddr_to_nid() is a fallback option to get the nid in case
> >>>>>>>>>>> NUMA_NO_NID is detected.
> >>>>>>>>>>>
> >>>>>>>>>>> Suggested-by: David Hildenbrand <david@redhat.com>
> >>>>>>>>>>> Signed-off-by: Jia He <justin.he@arm.com>
> >>>>>>>>>>> ---
> >>>>>>>>>>>  arch/arm64/mm/numa.c | 5 +++--
> >>>>>>>>>>>  1 file changed, 3 insertions(+), 2 deletions(-)
> >>>>>>>>>>>
> >>>>>>>>>>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> >>>>>>>>>>> index aafcee3e3f7e..7eeb31740248 100644
> >>>>>>>>>>> --- a/arch/arm64/mm/numa.c
> >>>>>>>>>>> +++ b/arch/arm64/mm/numa.c
> >>>>>>>>>>> @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
> >>>>>>>>>>>
> >>>>>>>>>>>  /*
> >>>>>>>>>>>   * We hope that we will be hotplugging memory on nodes we already know about,
> >>>>>>>>>>> - * such that acpi_get_node() succeeds and we never fall back to this...
> >>>>>>>>>>> + * such that acpi_get_node() succeeds. But when SRAT is not present, the node
> >>>>>>>>>>> + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback option.
> >>>>>>>>>>>   */
> >>>>>>>>>>>  int memory_add_physaddr_to_nid(u64 addr)
> >>>>>>>>>>>  {
> >>>>>>>>>>> - pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
> >>>>>>>>>>>   return 0;
> >>>>>>>>>>>  }
> >>>>>>>>>>> +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> >>>>>>>>>>
> >>>>>>>>>> Does it make sense to export a noop function? Wouldn't make more sense
> >>>>>>>>>> to simply make it static inline somewhere in a header? I haven't checked
> >>>>>>>>>> whether there is an easy way to do that sanely bu this just hit my eyes.
> > 
> >> I'd be curious if what we are trying to optimize here is actually worth
> >> optimizing. IOW, is there a well-known scenario where the dummy value on
> >> arm64 would be problematic and is worth the effort?
> > 
> > Well, it started with Michal's comment above that EXPORT_SYMBOL_GPL()
> > for a stub might be an overkill.
> > 
> > I think Jia's suggestion [1] with addition of a comment that explains
> > why and when the stub will be used, can work for both
> > memory_add_physaddr_to_nid() and phys_to_target_node().
> 
> Agreed.
> 
> > 
> > But on more theoretical/fundmanetal level, I think we lack a generic
> > abstraction similar to e.g. x86 'struct numa_meminfo' that serves as
> > translaton of firmware supplied information into data that can be used
> > by the generic mm without need to reimplement it for each and every
> > arch.
> 
> Right. As I expressed, I am not a friend of using memblock for that, and
> the pgdat node span is tricky.
>
> Maybe abstracting that x86 concept is possible in some way (and we could
> restrict the information to boot-time properties, so we don't have to
> mess with memory hot(un)plug - just as done for numa_meminfo AFAIKS).

I agree with pgdat part and disagree about memblock. It already has
non-init physmap, why won't we add memblock.memory to the mix? ;-)

Now, seriously, memblock already has all the necessary information about
the coldplug memory for several architectures. x86 being an exception
because for some reason the reserved memory is not considered memory
there. The infrastructure for quiering and iterating memory regions is
already there. We just need to leave out the irrelevant parts, like
memblock.reserved and allocation funcions.

Otherwise we'll add yet another 'struct { start, end }', a horde of
covnersion and re-initialization functions that will do more or less the
same things as current memblock APIs.

> -- 
> Thanks,
> 
> David / dhildenb
> 
>
David Hildenbrand July 8, 2020, 9:25 a.m. UTC | #32
On 08.07.20 11:15, Mike Rapoport wrote:
> On Wed, Jul 08, 2020 at 10:45:17AM +0200, David Hildenbrand wrote:
>> On 08.07.20 10:39, Mike Rapoport wrote:
>>> On Wed, Jul 08, 2020 at 10:26:41AM +0200, David Hildenbrand wrote:
>>>> On 08.07.20 09:50, Dan Williams wrote:
>>>>> On Wed, Jul 8, 2020 at 12:22 AM David Hildenbrand <david@redhat.com> wrote:
>>>>>>
>>>>>>>>>>>> On Tue 07-07-20 13:59:15, Jia He wrote:
>>>>>>>>>>>>> This exports memory_add_physaddr_to_nid() for module driver to use.
>>>>>>>>>>>>>
>>>>>>>>>>>>> memory_add_physaddr_to_nid() is a fallback option to get the nid in case
>>>>>>>>>>>>> NUMA_NO_NID is detected.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Suggested-by: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>> Signed-off-by: Jia He <justin.he@arm.com>
>>>>>>>>>>>>> ---
>>>>>>>>>>>>>  arch/arm64/mm/numa.c | 5 +++--
>>>>>>>>>>>>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>>>>>>>>>>>>
>>>>>>>>>>>>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>>>>>>>>>>>>> index aafcee3e3f7e..7eeb31740248 100644
>>>>>>>>>>>>> --- a/arch/arm64/mm/numa.c
>>>>>>>>>>>>> +++ b/arch/arm64/mm/numa.c
>>>>>>>>>>>>> @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
>>>>>>>>>>>>>
>>>>>>>>>>>>>  /*
>>>>>>>>>>>>>   * We hope that we will be hotplugging memory on nodes we already know about,
>>>>>>>>>>>>> - * such that acpi_get_node() succeeds and we never fall back to this...
>>>>>>>>>>>>> + * such that acpi_get_node() succeeds. But when SRAT is not present, the node
>>>>>>>>>>>>> + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback option.
>>>>>>>>>>>>>   */
>>>>>>>>>>>>>  int memory_add_physaddr_to_nid(u64 addr)
>>>>>>>>>>>>>  {
>>>>>>>>>>>>> - pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
>>>>>>>>>>>>>   return 0;
>>>>>>>>>>>>>  }
>>>>>>>>>>>>> +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
>>>>>>>>>>>>
>>>>>>>>>>>> Does it make sense to export a noop function? Wouldn't make more sense
>>>>>>>>>>>> to simply make it static inline somewhere in a header? I haven't checked
>>>>>>>>>>>> whether there is an easy way to do that sanely bu this just hit my eyes.
>>>
>>>> I'd be curious if what we are trying to optimize here is actually worth
>>>> optimizing. IOW, is there a well-known scenario where the dummy value on
>>>> arm64 would be problematic and is worth the effort?
>>>
>>> Well, it started with Michal's comment above that EXPORT_SYMBOL_GPL()
>>> for a stub might be an overkill.
>>>
>>> I think Jia's suggestion [1] with addition of a comment that explains
>>> why and when the stub will be used, can work for both
>>> memory_add_physaddr_to_nid() and phys_to_target_node().
>>
>> Agreed.
>>
>>>
>>> But on more theoretical/fundmanetal level, I think we lack a generic
>>> abstraction similar to e.g. x86 'struct numa_meminfo' that serves as
>>> translaton of firmware supplied information into data that can be used
>>> by the generic mm without need to reimplement it for each and every
>>> arch.
>>
>> Right. As I expressed, I am not a friend of using memblock for that, and
>> the pgdat node span is tricky.
>>
>> Maybe abstracting that x86 concept is possible in some way (and we could
>> restrict the information to boot-time properties, so we don't have to
>> mess with memory hot(un)plug - just as done for numa_meminfo AFAIKS).
> 
> I agree with pgdat part and disagree about memblock. It already has
> non-init physmap, why won't we add memblock.memory to the mix? ;-)

Can we generalize and tweak physmap to contain node info? That's all we
need, no? (the special mem= parameter handling should not matter for our
use case, where "physmap" and "memory" would differ)

> 
> Now, seriously, memblock already has all the necessary information about
> the coldplug memory for several architectures. x86 being an exception
> because for some reason the reserved memory is not considered memory
> there. The infrastructure for quiering and iterating memory regions is
> already there. We just need to leave out the irrelevant parts, like
> memblock.reserved and allocation funcions.

I *really* don't want to mess with memblocks on memory hot(un)plug on
x86 and s390x (+other architectures in the future). I also thought about
stopping to create memblocks for hotplugged memory on arm64, by tweaking
pfn_valid() to query memblocks only for early sections.

If "physmem" is not an option, can we at least introduce something like
ARCH_UPDTAE_MEMBLOCK_ON_HOTPLUG to avoid doing that on x86 and s390x for
now (and later maybe for others)?
Mike Rapoport July 8, 2020, 9:45 a.m. UTC | #33
On Wed, Jul 08, 2020 at 11:25:36AM +0200, David Hildenbrand wrote:
> On 08.07.20 11:15, Mike Rapoport wrote:
> >>>>>>
> >>> But on more theoretical/fundmanetal level, I think we lack a generic
> >>> abstraction similar to e.g. x86 'struct numa_meminfo' that serves as
> >>> translaton of firmware supplied information into data that can be used
> >>> by the generic mm without need to reimplement it for each and every
> >>> arch.
> >>
> >> Right. As I expressed, I am not a friend of using memblock for that, and
> >> the pgdat node span is tricky.
> >>
> >> Maybe abstracting that x86 concept is possible in some way (and we could
> >> restrict the information to boot-time properties, so we don't have to
> >> mess with memory hot(un)plug - just as done for numa_meminfo AFAIKS).
> > 
> > I agree with pgdat part and disagree about memblock. It already has
> > non-init physmap, why won't we add memblock.memory to the mix? ;-)
> 
> Can we generalize and tweak physmap to contain node info? That's all we
> need, no? (the special mem= parameter handling should not matter for our
> use case, where "physmap" and "memory" would differ)

TBH, I have only random vague thoughts at the moment. This might be an
option. But then we need to enable physmap on !s390, right?
 
> > Now, seriously, memblock already has all the necessary information about
> > the coldplug memory for several architectures. x86 being an exception
> > because for some reason the reserved memory is not considered memory
> > there. The infrastructure for quiering and iterating memory regions is
> > already there. We just need to leave out the irrelevant parts, like
> > memblock.reserved and allocation funcions.
> 
> I *really* don't want to mess with memblocks on memory hot(un)plug on
> x86 and s390x (+other architectures in the future). I also thought about
> stopping to create memblocks for hotplugged memory on arm64, by tweaking
> pfn_valid() to query memblocks only for early sections.
>
> If "physmem" is not an option, can we at least introduce something like
> ARCH_UPDTAE_MEMBLOCK_ON_HOTPLUG to avoid doing that on x86 and s390x for
> now (and later maybe for others)?

I have to do more memory hotplug howework to answer that ;-)

My general point is that we don't have to reinvent the wheel to have
coldplug memory representation, it's already there. We just need a way
to use it properly.

> -- 
> Thanks,
> 
> David / dhildenb
>
David Hildenbrand July 8, 2020, 10:04 a.m. UTC | #34
On 08.07.20 11:45, Mike Rapoport wrote:
> On Wed, Jul 08, 2020 at 11:25:36AM +0200, David Hildenbrand wrote:
>> On 08.07.20 11:15, Mike Rapoport wrote:
>>>>>>>>
>>>>> But on more theoretical/fundmanetal level, I think we lack a generic
>>>>> abstraction similar to e.g. x86 'struct numa_meminfo' that serves as
>>>>> translaton of firmware supplied information into data that can be used
>>>>> by the generic mm without need to reimplement it for each and every
>>>>> arch.
>>>>
>>>> Right. As I expressed, I am not a friend of using memblock for that, and
>>>> the pgdat node span is tricky.
>>>>
>>>> Maybe abstracting that x86 concept is possible in some way (and we could
>>>> restrict the information to boot-time properties, so we don't have to
>>>> mess with memory hot(un)plug - just as done for numa_meminfo AFAIKS).
>>>
>>> I agree with pgdat part and disagree about memblock. It already has
>>> non-init physmap, why won't we add memblock.memory to the mix? ;-)
>>
>> Can we generalize and tweak physmap to contain node info? That's all we
>> need, no? (the special mem= parameter handling should not matter for our
>> use case, where "physmap" and "memory" would differ)
> 
> TBH, I have only random vague thoughts at the moment. This might be an
> option. But then we need to enable physmap on !s390, right?

Yes, looks like it.

>  
>>> Now, seriously, memblock already has all the necessary information about
>>> the coldplug memory for several architectures. x86 being an exception
>>> because for some reason the reserved memory is not considered memory
>>> there. The infrastructure for quiering and iterating memory regions is
>>> already there. We just need to leave out the irrelevant parts, like
>>> memblock.reserved and allocation funcions.
>>
>> I *really* don't want to mess with memblocks on memory hot(un)plug on
>> x86 and s390x (+other architectures in the future). I also thought about
>> stopping to create memblocks for hotplugged memory on arm64, by tweaking
>> pfn_valid() to query memblocks only for early sections.
>>
>> If "physmem" is not an option, can we at least introduce something like
>> ARCH_UPDTAE_MEMBLOCK_ON_HOTPLUG to avoid doing that on x86 and s390x for
>> now (and later maybe for others)?
> 
> I have to do more memory hotplug howework to answer that ;-)
> 
> My general point is that we don't have to reinvent the wheel to have
> coldplug memory representation, it's already there. We just need a way
> to use it properly.

Yes, I tend to agree. Details to be clarified :)
Dan Williams July 8, 2020, 3:50 p.m. UTC | #35
On Wed, Jul 8, 2020 at 3:04 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 08.07.20 11:45, Mike Rapoport wrote:
> > On Wed, Jul 08, 2020 at 11:25:36AM +0200, David Hildenbrand wrote:
> >> On 08.07.20 11:15, Mike Rapoport wrote:
> >>>>>>>>
> >>>>> But on more theoretical/fundmanetal level, I think we lack a generic
> >>>>> abstraction similar to e.g. x86 'struct numa_meminfo' that serves as
> >>>>> translaton of firmware supplied information into data that can be used
> >>>>> by the generic mm without need to reimplement it for each and every
> >>>>> arch.
> >>>>
> >>>> Right. As I expressed, I am not a friend of using memblock for that, and
> >>>> the pgdat node span is tricky.
> >>>>
> >>>> Maybe abstracting that x86 concept is possible in some way (and we could
> >>>> restrict the information to boot-time properties, so we don't have to
> >>>> mess with memory hot(un)plug - just as done for numa_meminfo AFAIKS).
> >>>
> >>> I agree with pgdat part and disagree about memblock. It already has
> >>> non-init physmap, why won't we add memblock.memory to the mix? ;-)
> >>
> >> Can we generalize and tweak physmap to contain node info? That's all we
> >> need, no? (the special mem= parameter handling should not matter for our
> >> use case, where "physmap" and "memory" would differ)
> >
> > TBH, I have only random vague thoughts at the moment. This might be an
> > option. But then we need to enable physmap on !s390, right?
>
> Yes, looks like it.
>
> >
> >>> Now, seriously, memblock already has all the necessary information about
> >>> the coldplug memory for several architectures. x86 being an exception
> >>> because for some reason the reserved memory is not considered memory
> >>> there. The infrastructure for quiering and iterating memory regions is
> >>> already there. We just need to leave out the irrelevant parts, like
> >>> memblock.reserved and allocation funcions.
> >>
> >> I *really* don't want to mess with memblocks on memory hot(un)plug on
> >> x86 and s390x (+other architectures in the future). I also thought about
> >> stopping to create memblocks for hotplugged memory on arm64, by tweaking
> >> pfn_valid() to query memblocks only for early sections.
> >>
> >> If "physmem" is not an option, can we at least introduce something like
> >> ARCH_UPDTAE_MEMBLOCK_ON_HOTPLUG to avoid doing that on x86 and s390x for
> >> now (and later maybe for others)?
> >
> > I have to do more memory hotplug howework to answer that ;-)
> >
> > My general point is that we don't have to reinvent the wheel to have
> > coldplug memory representation, it's already there. We just need a way
> > to use it properly.
>
> Yes, I tend to agree. Details to be clarified :)

I'm not quite understanding the concern, or requirement about
"updating memblock" in the hotplug path. The routines
memory_add_physaddr_to_nid() and phys_to_target_node() are helpers to
interrogate platform-firmware numa info through a common abstraction.
They place no burden on the memory hotplug code they're just used to
see if a hot-added range lies within an existing node span when
platform-firmware otherwise fails to communicate a node. x86 can
continue to back those helpers with numa_meminfo, arm64 can use a
generic memblock implementation and other archs can follow the arm64
example if they want better numa answers for drivers.
David Hildenbrand July 8, 2020, 4:10 p.m. UTC | #36
On 08.07.20 17:50, Dan Williams wrote:
> On Wed, Jul 8, 2020 at 3:04 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 08.07.20 11:45, Mike Rapoport wrote:
>>> On Wed, Jul 08, 2020 at 11:25:36AM +0200, David Hildenbrand wrote:
>>>> On 08.07.20 11:15, Mike Rapoport wrote:
>>>>>>>>>>
>>>>>>> But on more theoretical/fundmanetal level, I think we lack a generic
>>>>>>> abstraction similar to e.g. x86 'struct numa_meminfo' that serves as
>>>>>>> translaton of firmware supplied information into data that can be used
>>>>>>> by the generic mm without need to reimplement it for each and every
>>>>>>> arch.
>>>>>>
>>>>>> Right. As I expressed, I am not a friend of using memblock for that, and
>>>>>> the pgdat node span is tricky.
>>>>>>
>>>>>> Maybe abstracting that x86 concept is possible in some way (and we could
>>>>>> restrict the information to boot-time properties, so we don't have to
>>>>>> mess with memory hot(un)plug - just as done for numa_meminfo AFAIKS).
>>>>>
>>>>> I agree with pgdat part and disagree about memblock. It already has
>>>>> non-init physmap, why won't we add memblock.memory to the mix? ;-)
>>>>
>>>> Can we generalize and tweak physmap to contain node info? That's all we
>>>> need, no? (the special mem= parameter handling should not matter for our
>>>> use case, where "physmap" and "memory" would differ)
>>>
>>> TBH, I have only random vague thoughts at the moment. This might be an
>>> option. But then we need to enable physmap on !s390, right?
>>
>> Yes, looks like it.
>>
>>>
>>>>> Now, seriously, memblock already has all the necessary information about
>>>>> the coldplug memory for several architectures. x86 being an exception
>>>>> because for some reason the reserved memory is not considered memory
>>>>> there. The infrastructure for quiering and iterating memory regions is
>>>>> already there. We just need to leave out the irrelevant parts, like
>>>>> memblock.reserved and allocation funcions.
>>>>
>>>> I *really* don't want to mess with memblocks on memory hot(un)plug on
>>>> x86 and s390x (+other architectures in the future). I also thought about
>>>> stopping to create memblocks for hotplugged memory on arm64, by tweaking
>>>> pfn_valid() to query memblocks only for early sections.
>>>>
>>>> If "physmem" is not an option, can we at least introduce something like
>>>> ARCH_UPDTAE_MEMBLOCK_ON_HOTPLUG to avoid doing that on x86 and s390x for
>>>> now (and later maybe for others)?
>>>
>>> I have to do more memory hotplug howework to answer that ;-)
>>>
>>> My general point is that we don't have to reinvent the wheel to have
>>> coldplug memory representation, it's already there. We just need a way
>>> to use it properly.
>>
>> Yes, I tend to agree. Details to be clarified :)
> 
> I'm not quite understanding the concern, or requirement about
> "updating memblock" in the hotplug path. The routines
> memory_add_physaddr_to_nid() and phys_to_target_node() are helpers to
> interrogate platform-firmware numa info through a common abstraction.
> They place no burden on the memory hotplug code they're just used to
> see if a hot-added range lies within an existing node span when
> platform-firmware otherwise fails to communicate a node. x86 can
> continue to back those helpers with numa_meminfo, arm64 can use a
> generic memblock implementation and other archs can follow the arm64
> example if they want better numa answers for drivers.
> 

See memblock_add_node()/memblock_remove() in mm/memory_hotplug.c. I
don't want that code be reactivated for x86/s390x. That's all I am saying.
Mike Rapoport July 8, 2020, 4:47 p.m. UTC | #37
On Wed, Jul 08, 2020 at 06:10:19PM +0200, David Hildenbrand wrote:
> On 08.07.20 17:50, Dan Williams wrote:
> > On Wed, Jul 8, 2020 at 3:04 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> > 
> > I'm not quite understanding the concern, or requirement about
> > "updating memblock" in the hotplug path. The routines
> > memory_add_physaddr_to_nid() and phys_to_target_node() are helpers to
> > interrogate platform-firmware numa info through a common abstraction.
> > They place no burden on the memory hotplug code they're just used to
> > see if a hot-added range lies within an existing node span when
> > platform-firmware otherwise fails to communicate a node. x86 can
> > continue to back those helpers with numa_meminfo, arm64 can use a
> > generic memblock implementation and other archs can follow the arm64
> > example if they want better numa answers for drivers.
> > 
> 
> See memblock_add_node()/memblock_remove() in mm/memory_hotplug.c. I
> don't want that code be reactivated for x86/s390x. That's all I am saying.

And these have actual meaning only on arm64 because powerpc does not
rely on memblock for memory hot(un)plug, AFAIU.

Anyway, at the moment we can use memblock on hotplug path only on arm64
and I don't think its the path worth exploring.

> -- 
> Thanks,
> 
> David / dhildenb
>
diff mbox series

Patch

diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
index aafcee3e3f7e..7eeb31740248 100644
--- a/arch/arm64/mm/numa.c
+++ b/arch/arm64/mm/numa.c
@@ -464,10 +464,11 @@  void __init arm64_numa_init(void)
 
 /*
  * We hope that we will be hotplugging memory on nodes we already know about,
- * such that acpi_get_node() succeeds and we never fall back to this...
+ * such that acpi_get_node() succeeds. But when SRAT is not present, the node
+ * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback option.
  */
 int memory_add_physaddr_to_nid(u64 addr)
 {
-	pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
 	return 0;
 }
+EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);