diff mbox series

[PATCHv3,1/2] mm/memblock: extend the limit inferior of bottom-up after parsing hotplug attr

Message ID 1545966002-3075-2-git-send-email-kernelfans@gmail.com (mailing list archive)
State New, archived
Headers show
Series mm/memblock: reuse memblock bottom-up allocation style | expand

Commit Message

Pingfan Liu Dec. 28, 2018, 3 a.m. UTC
The bottom-up allocation style is introduced to cope with movable_node,
where the limit inferior of allocation starts from kernel's end, due to
lack of knowledge of memory hotplug info at this early time. But if later,
hotplug info has been got, the limit inferior can be extend to 0.
'kexec -c' prefers to reuse this style to alloc mem at lower address,
since if the reserved region is beyond 4G, then it requires extra mem
(default is 16M) for swiotlb.

Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Daniel Vacek <neelx@redhat.com>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Stefan Agner <stefan@agner.ch>
Cc: Dave Young <dyoung@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: yinghai@kernel.org,
Cc: vgoyal@redhat.com
Cc: linux-kernel@vger.kernel.org
---
 drivers/acpi/numa.c      |  4 ++++
 include/linux/memblock.h |  1 +
 mm/memblock.c            | 58 +++++++++++++++++++++++++++++-------------------
 3 files changed, 40 insertions(+), 23 deletions(-)

Comments

Mike Rapoport Dec. 31, 2018, 8:40 a.m. UTC | #1
On Fri, Dec 28, 2018 at 11:00:01AM +0800, Pingfan Liu wrote:
> The bottom-up allocation style is introduced to cope with movable_node,
> where the limit inferior of allocation starts from kernel's end, due to
> lack of knowledge of memory hotplug info at this early time. But if later,
> hotplug info has been got, the limit inferior can be extend to 0.
> 'kexec -c' prefers to reuse this style to alloc mem at lower address,
> since if the reserved region is beyond 4G, then it requires extra mem
> (default is 16M) for swiotlb.

I fail to understand why the availability of memory hotplug information
would allow to extend the lower limit of bottom-up memblock allocations
below the kernel. The memory in the physical range [0, kernel_start) can be
allocated as soon as the kernel memory is reserved.

The extents of the memory node hosting the kernel image can be used to
limit memblok allocations from that particular node, even in top-down mode.
 
> Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
> Cc: Tang Chen <tangchen@cn.fujitsu.com>
> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> Cc: Len Brown <lenb@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
> Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Daniel Vacek <neelx@redhat.com>
> Cc: Mathieu Malaterre <malat@debian.org>
> Cc: Stefan Agner <stefan@agner.ch>
> Cc: Dave Young <dyoung@redhat.com>
> Cc: Baoquan He <bhe@redhat.com>
> Cc: yinghai@kernel.org,
> Cc: vgoyal@redhat.com
> Cc: linux-kernel@vger.kernel.org
> ---
>  drivers/acpi/numa.c      |  4 ++++
>  include/linux/memblock.h |  1 +
>  mm/memblock.c            | 58 +++++++++++++++++++++++++++++-------------------
>  3 files changed, 40 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
> index 2746994..3eea4e4 100644
> --- a/drivers/acpi/numa.c
> +++ b/drivers/acpi/numa.c
> @@ -462,6 +462,10 @@ int __init acpi_numa_init(void)
> 
>  		cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
>  					    acpi_parse_memory_affinity, 0);
> +
> +#if defined(CONFIG_X86) || defined(CONFIG_ARM64)
> +		mark_mem_hotplug_parsed();
> +#endif
>  	}
> 
>  	/* SLIT: System Locality Information Table */
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index aee299a..d89ed9e 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -125,6 +125,7 @@ int memblock_reserve(phys_addr_t base, phys_addr_t size);
>  void memblock_trim_memory(phys_addr_t align);
>  bool memblock_overlaps_region(struct memblock_type *type,
>  			      phys_addr_t base, phys_addr_t size);
> +void mark_mem_hotplug_parsed(void);
>  int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
>  int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
>  int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 81ae63c..a3f5e46 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -231,6 +231,12 @@ __memblock_find_range_top_down(phys_addr_t start, phys_addr_t end,
>  	return 0;
>  }
> 
> +static bool mem_hotmovable_parsed __initdata_memblock;
> +void __init_memblock mark_mem_hotplug_parsed(void)
> +{
> +	mem_hotmovable_parsed = true;
> +}
> +
>  /**
>   * memblock_find_in_range_node - find free area in given range and node
>   * @size: size of free area to find
> @@ -259,7 +265,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
>  					phys_addr_t end, int nid,
>  					enum memblock_flags flags)
>  {
> -	phys_addr_t kernel_end, ret;
> +	phys_addr_t kernel_end, ret = 0;
> 
>  	/* pump up @end */
>  	if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
> @@ -270,34 +276,40 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
>  	end = max(start, end);
>  	kernel_end = __pa_symbol(_end);
> 
> -	/*
> -	 * try bottom-up allocation only when bottom-up mode
> -	 * is set and @end is above the kernel image.
> -	 */
> -	if (memblock_bottom_up() && end > kernel_end) {
> -		phys_addr_t bottom_up_start;
> +	if (memblock_bottom_up()) {
> +		phys_addr_t bottom_up_start = start;
> 
> -		/* make sure we will allocate above the kernel */
> -		bottom_up_start = max(start, kernel_end);
> -
> -		/* ok, try bottom-up allocation first */
> -		ret = __memblock_find_range_bottom_up(bottom_up_start, end,
> -						      size, align, nid, flags);
> -		if (ret)
> +		if (mem_hotmovable_parsed) {
> +			ret = __memblock_find_range_bottom_up(
> +				bottom_up_start, end, size, align, nid,
> +				flags);
>  			return ret;
> 
>  		/*
> -		 * we always limit bottom-up allocation above the kernel,
> -		 * but top-down allocation doesn't have the limit, so
> -		 * retrying top-down allocation may succeed when bottom-up
> -		 * allocation failed.
> -		 *
> -		 * bottom-up allocation is expected to be fail very rarely,
> -		 * so we use WARN_ONCE() here to see the stack trace if
> -		 * fail happens.
> +		 * if mem hotplug info is not parsed yet, try bottom-up
> +		 * allocation with @end above the kernel image.
>  		 */
> -		WARN_ONCE(IS_ENABLED(CONFIG_MEMORY_HOTREMOVE),
> +		} else if (!mem_hotmovable_parsed && end > kernel_end) {
> +			/* make sure we will allocate above the kernel */
> +			bottom_up_start = max(start, kernel_end);
> +			ret = __memblock_find_range_bottom_up(
> +				bottom_up_start, end, size, align, nid,
> +				flags);
> +			if (ret)
> +				return ret;
> +			/*
> +			 * we always limit bottom-up allocation above the
> +			 * kernel, but top-down allocation doesn't have
> +			 * the limit, so retrying top-down allocation may
> +			 * succeed when bottom-up allocation failed.
> +			 *
> +			 * bottom-up allocation is expected to be fail
> +			 * very rarely, so we use WARN_ONCE() here to see
> +			 * the stack trace if fail happens.
> +			 */
> +			WARN_ONCE(IS_ENABLED(CONFIG_MEMORY_HOTREMOVE),
>  			  "memblock: bottom-up allocation failed, memory hotremove may be affected\n");
> +		}
>  	}
> 
>  	return __memblock_find_range_top_down(start, end, size, align, nid,
> -- 
> 2.7.4
>
Pingfan Liu Jan. 2, 2019, 6:47 a.m. UTC | #2
On Mon, Dec 31, 2018 at 4:40 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
>
> On Fri, Dec 28, 2018 at 11:00:01AM +0800, Pingfan Liu wrote:
> > The bottom-up allocation style is introduced to cope with movable_node,
> > where the limit inferior of allocation starts from kernel's end, due to
> > lack of knowledge of memory hotplug info at this early time. But if later,
> > hotplug info has been got, the limit inferior can be extend to 0.
> > 'kexec -c' prefers to reuse this style to alloc mem at lower address,
> > since if the reserved region is beyond 4G, then it requires extra mem
> > (default is 16M) for swiotlb.
>
> I fail to understand why the availability of memory hotplug information
> would allow to extend the lower limit of bottom-up memblock allocations
> below the kernel. The memory in the physical range [0, kernel_start) can be
> allocated as soon as the kernel memory is reserved.
>
Yes, the  [0, kernel_start) can be allocated at this time by some func
e.g. memblock_reserve(). But there is trick. For the func like
memblock_find_in_range(), this is hotplug attr checking ,,it will
check the hotmovable attr in __next_mem_range()
{
if (movable_node_is_enabled() && memblock_is_hotpluggable(m))
continue
}.  So the movable memory can be safely skipped.

Thanks for your kindly review.

Regards,
Pingfan

> The extents of the memory node hosting the kernel image can be used to
> limit memblok allocations from that particular node, even in top-down mode.
>
> > Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
> > Cc: Tang Chen <tangchen@cn.fujitsu.com>
> > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> > Cc: Len Brown <lenb@kernel.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Jonathan Corbet <corbet@lwn.net>
> > Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
> > Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
> > Cc: Nicholas Piggin <npiggin@gmail.com>
> > Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Cc: Daniel Vacek <neelx@redhat.com>
> > Cc: Mathieu Malaterre <malat@debian.org>
> > Cc: Stefan Agner <stefan@agner.ch>
> > Cc: Dave Young <dyoung@redhat.com>
> > Cc: Baoquan He <bhe@redhat.com>
> > Cc: yinghai@kernel.org,
> > Cc: vgoyal@redhat.com
> > Cc: linux-kernel@vger.kernel.org
> > ---
> >  drivers/acpi/numa.c      |  4 ++++
> >  include/linux/memblock.h |  1 +
> >  mm/memblock.c            | 58 +++++++++++++++++++++++++++++-------------------
> >  3 files changed, 40 insertions(+), 23 deletions(-)
> >
> > diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
> > index 2746994..3eea4e4 100644
> > --- a/drivers/acpi/numa.c
> > +++ b/drivers/acpi/numa.c
> > @@ -462,6 +462,10 @@ int __init acpi_numa_init(void)
> >
> >               cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
> >                                           acpi_parse_memory_affinity, 0);
> > +
> > +#if defined(CONFIG_X86) || defined(CONFIG_ARM64)
> > +             mark_mem_hotplug_parsed();
> > +#endif
> >       }
> >
> >       /* SLIT: System Locality Information Table */
> > diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> > index aee299a..d89ed9e 100644
> > --- a/include/linux/memblock.h
> > +++ b/include/linux/memblock.h
> > @@ -125,6 +125,7 @@ int memblock_reserve(phys_addr_t base, phys_addr_t size);
> >  void memblock_trim_memory(phys_addr_t align);
> >  bool memblock_overlaps_region(struct memblock_type *type,
> >                             phys_addr_t base, phys_addr_t size);
> > +void mark_mem_hotplug_parsed(void);
> >  int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
> >  int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
> >  int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
> > diff --git a/mm/memblock.c b/mm/memblock.c
> > index 81ae63c..a3f5e46 100644
> > --- a/mm/memblock.c
> > +++ b/mm/memblock.c
> > @@ -231,6 +231,12 @@ __memblock_find_range_top_down(phys_addr_t start, phys_addr_t end,
> >       return 0;
> >  }
> >
> > +static bool mem_hotmovable_parsed __initdata_memblock;
> > +void __init_memblock mark_mem_hotplug_parsed(void)
> > +{
> > +     mem_hotmovable_parsed = true;
> > +}
> > +
> >  /**
> >   * memblock_find_in_range_node - find free area in given range and node
> >   * @size: size of free area to find
> > @@ -259,7 +265,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
> >                                       phys_addr_t end, int nid,
> >                                       enum memblock_flags flags)
> >  {
> > -     phys_addr_t kernel_end, ret;
> > +     phys_addr_t kernel_end, ret = 0;
> >
> >       /* pump up @end */
> >       if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
> > @@ -270,34 +276,40 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
> >       end = max(start, end);
> >       kernel_end = __pa_symbol(_end);
> >
> > -     /*
> > -      * try bottom-up allocation only when bottom-up mode
> > -      * is set and @end is above the kernel image.
> > -      */
> > -     if (memblock_bottom_up() && end > kernel_end) {
> > -             phys_addr_t bottom_up_start;
> > +     if (memblock_bottom_up()) {
> > +             phys_addr_t bottom_up_start = start;
> >
> > -             /* make sure we will allocate above the kernel */
> > -             bottom_up_start = max(start, kernel_end);
> > -
> > -             /* ok, try bottom-up allocation first */
> > -             ret = __memblock_find_range_bottom_up(bottom_up_start, end,
> > -                                                   size, align, nid, flags);
> > -             if (ret)
> > +             if (mem_hotmovable_parsed) {
> > +                     ret = __memblock_find_range_bottom_up(
> > +                             bottom_up_start, end, size, align, nid,
> > +                             flags);
> >                       return ret;
> >
> >               /*
> > -              * we always limit bottom-up allocation above the kernel,
> > -              * but top-down allocation doesn't have the limit, so
> > -              * retrying top-down allocation may succeed when bottom-up
> > -              * allocation failed.
> > -              *
> > -              * bottom-up allocation is expected to be fail very rarely,
> > -              * so we use WARN_ONCE() here to see the stack trace if
> > -              * fail happens.
> > +              * if mem hotplug info is not parsed yet, try bottom-up
> > +              * allocation with @end above the kernel image.
> >                */
> > -             WARN_ONCE(IS_ENABLED(CONFIG_MEMORY_HOTREMOVE),
> > +             } else if (!mem_hotmovable_parsed && end > kernel_end) {
> > +                     /* make sure we will allocate above the kernel */
> > +                     bottom_up_start = max(start, kernel_end);
> > +                     ret = __memblock_find_range_bottom_up(
> > +                             bottom_up_start, end, size, align, nid,
> > +                             flags);
> > +                     if (ret)
> > +                             return ret;
> > +                     /*
> > +                      * we always limit bottom-up allocation above the
> > +                      * kernel, but top-down allocation doesn't have
> > +                      * the limit, so retrying top-down allocation may
> > +                      * succeed when bottom-up allocation failed.
> > +                      *
> > +                      * bottom-up allocation is expected to be fail
> > +                      * very rarely, so we use WARN_ONCE() here to see
> > +                      * the stack trace if fail happens.
> > +                      */
> > +                     WARN_ONCE(IS_ENABLED(CONFIG_MEMORY_HOTREMOVE),
> >                         "memblock: bottom-up allocation failed, memory hotremove may be affected\n");
> > +             }
> >       }
> >
> >       return __memblock_find_range_top_down(start, end, size, align, nid,
> > --
> > 2.7.4
> >
>
> --
> Sincerely yours,
> Mike.
>
Mike Rapoport Jan. 2, 2019, 9:27 a.m. UTC | #3
On Wed, Jan 02, 2019 at 02:47:34PM +0800, Pingfan Liu wrote:
> On Mon, Dec 31, 2018 at 4:40 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
> >
> > On Fri, Dec 28, 2018 at 11:00:01AM +0800, Pingfan Liu wrote:
> > > The bottom-up allocation style is introduced to cope with movable_node,
> > > where the limit inferior of allocation starts from kernel's end, due to
> > > lack of knowledge of memory hotplug info at this early time. But if later,
> > > hotplug info has been got, the limit inferior can be extend to 0.
> > > 'kexec -c' prefers to reuse this style to alloc mem at lower address,
> > > since if the reserved region is beyond 4G, then it requires extra mem
> > > (default is 16M) for swiotlb.
> >
> > I fail to understand why the availability of memory hotplug information
> > would allow to extend the lower limit of bottom-up memblock allocations
> > below the kernel. The memory in the physical range [0, kernel_start) can be
> > allocated as soon as the kernel memory is reserved.
> >
> Yes, the  [0, kernel_start) can be allocated at this time by some func
> e.g. memblock_reserve(). But there is trick. For the func like
> memblock_find_in_range(), this is hotplug attr checking ,,it will
> check the hotmovable attr in __next_mem_range()
> {
> if (movable_node_is_enabled() && memblock_is_hotpluggable(m))
> continue
> }.  So the movable memory can be safely skipped.

I still don't see the connection between allocating memory below
kernel_start and the hotplug info.

The check for 'end > kernel_end' in

 	if (memblock_bottom_up() && end > kernel_end)

does not protect against allocation in a hotplugable area.
If memblock_find_in_range() is called before hotplug info is parsed it can
return a range in a hotplugable area.

The point I'd like to clarify is why allocating memory in the range [0,
kernel_start) cannot be done before hotplug info is available and why it is
safe to allocate that memory afterwards?

> Thanks for your kindly review.
> 
> Regards,
> Pingfan
> 
> > The extents of the memory node hosting the kernel image can be used to
> > limit memblok allocations from that particular node, even in top-down mode.
> >
> > > Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
> > > Cc: Tang Chen <tangchen@cn.fujitsu.com>
> > > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> > > Cc: Len Brown <lenb@kernel.org>
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
> > > Cc: Michal Hocko <mhocko@suse.com>
> > > Cc: Jonathan Corbet <corbet@lwn.net>
> > > Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
> > > Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
> > > Cc: Nicholas Piggin <npiggin@gmail.com>
> > > Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > > Cc: Daniel Vacek <neelx@redhat.com>
> > > Cc: Mathieu Malaterre <malat@debian.org>
> > > Cc: Stefan Agner <stefan@agner.ch>
> > > Cc: Dave Young <dyoung@redhat.com>
> > > Cc: Baoquan He <bhe@redhat.com>
> > > Cc: yinghai@kernel.org,
> > > Cc: vgoyal@redhat.com
> > > Cc: linux-kernel@vger.kernel.org
> > > ---
> > >  drivers/acpi/numa.c      |  4 ++++
> > >  include/linux/memblock.h |  1 +
> > >  mm/memblock.c            | 58 +++++++++++++++++++++++++++++-------------------
> > >  3 files changed, 40 insertions(+), 23 deletions(-)
> > >
> > > diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
> > > index 2746994..3eea4e4 100644
> > > --- a/drivers/acpi/numa.c
> > > +++ b/drivers/acpi/numa.c
> > > @@ -462,6 +462,10 @@ int __init acpi_numa_init(void)
> > >
> > >               cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
> > >                                           acpi_parse_memory_affinity, 0);
> > > +
> > > +#if defined(CONFIG_X86) || defined(CONFIG_ARM64)
> > > +             mark_mem_hotplug_parsed();
> > > +#endif
> > >       }
> > >
> > >       /* SLIT: System Locality Information Table */
> > > diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> > > index aee299a..d89ed9e 100644
> > > --- a/include/linux/memblock.h
> > > +++ b/include/linux/memblock.h
> > > @@ -125,6 +125,7 @@ int memblock_reserve(phys_addr_t base, phys_addr_t size);
> > >  void memblock_trim_memory(phys_addr_t align);
> > >  bool memblock_overlaps_region(struct memblock_type *type,
> > >                             phys_addr_t base, phys_addr_t size);
> > > +void mark_mem_hotplug_parsed(void);
> > >  int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
> > >  int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
> > >  int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
> > > diff --git a/mm/memblock.c b/mm/memblock.c
> > > index 81ae63c..a3f5e46 100644
> > > --- a/mm/memblock.c
> > > +++ b/mm/memblock.c
> > > @@ -231,6 +231,12 @@ __memblock_find_range_top_down(phys_addr_t start, phys_addr_t end,
> > >       return 0;
> > >  }
> > >
> > > +static bool mem_hotmovable_parsed __initdata_memblock;
> > > +void __init_memblock mark_mem_hotplug_parsed(void)
> > > +{
> > > +     mem_hotmovable_parsed = true;
> > > +}
> > > +
> > >  /**
> > >   * memblock_find_in_range_node - find free area in given range and node
> > >   * @size: size of free area to find
> > > @@ -259,7 +265,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
> > >                                       phys_addr_t end, int nid,
> > >                                       enum memblock_flags flags)
> > >  {
> > > -     phys_addr_t kernel_end, ret;
> > > +     phys_addr_t kernel_end, ret = 0;
> > >
> > >       /* pump up @end */
> > >       if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
> > > @@ -270,34 +276,40 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
> > >       end = max(start, end);
> > >       kernel_end = __pa_symbol(_end);
> > >
> > > -     /*
> > > -      * try bottom-up allocation only when bottom-up mode
> > > -      * is set and @end is above the kernel image.
> > > -      */
> > > -     if (memblock_bottom_up() && end > kernel_end) {
> > > -             phys_addr_t bottom_up_start;
> > > +     if (memblock_bottom_up()) {
> > > +             phys_addr_t bottom_up_start = start;
> > >
> > > -             /* make sure we will allocate above the kernel */
> > > -             bottom_up_start = max(start, kernel_end);
> > > -
> > > -             /* ok, try bottom-up allocation first */
> > > -             ret = __memblock_find_range_bottom_up(bottom_up_start, end,
> > > -                                                   size, align, nid, flags);
> > > -             if (ret)
> > > +             if (mem_hotmovable_parsed) {
> > > +                     ret = __memblock_find_range_bottom_up(
> > > +                             bottom_up_start, end, size, align, nid,
> > > +                             flags);
> > >                       return ret;
> > >
> > >               /*
> > > -              * we always limit bottom-up allocation above the kernel,
> > > -              * but top-down allocation doesn't have the limit, so
> > > -              * retrying top-down allocation may succeed when bottom-up
> > > -              * allocation failed.
> > > -              *
> > > -              * bottom-up allocation is expected to be fail very rarely,
> > > -              * so we use WARN_ONCE() here to see the stack trace if
> > > -              * fail happens.
> > > +              * if mem hotplug info is not parsed yet, try bottom-up
> > > +              * allocation with @end above the kernel image.
> > >                */
> > > -             WARN_ONCE(IS_ENABLED(CONFIG_MEMORY_HOTREMOVE),
> > > +             } else if (!mem_hotmovable_parsed && end > kernel_end) {
> > > +                     /* make sure we will allocate above the kernel */
> > > +                     bottom_up_start = max(start, kernel_end);
> > > +                     ret = __memblock_find_range_bottom_up(
> > > +                             bottom_up_start, end, size, align, nid,
> > > +                             flags);
> > > +                     if (ret)
> > > +                             return ret;
> > > +                     /*
> > > +                      * we always limit bottom-up allocation above the
> > > +                      * kernel, but top-down allocation doesn't have
> > > +                      * the limit, so retrying top-down allocation may
> > > +                      * succeed when bottom-up allocation failed.
> > > +                      *
> > > +                      * bottom-up allocation is expected to be fail
> > > +                      * very rarely, so we use WARN_ONCE() here to see
> > > +                      * the stack trace if fail happens.
> > > +                      */
> > > +                     WARN_ONCE(IS_ENABLED(CONFIG_MEMORY_HOTREMOVE),
> > >                         "memblock: bottom-up allocation failed, memory hotremove may be affected\n");
> > > +             }
> > >       }
> > >
> > >       return __memblock_find_range_top_down(start, end, size, align, nid,
> > > --
> > > 2.7.4
> > >
> >
> > --
> > Sincerely yours,
> > Mike.
> >
>
Baoquan He Jan. 2, 2019, 10:18 a.m. UTC | #4
On 01/02/19 at 11:27am, Mike Rapoport wrote:
> On Wed, Jan 02, 2019 at 02:47:34PM +0800, Pingfan Liu wrote:
> > On Mon, Dec 31, 2018 at 4:40 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
> > >
> > > On Fri, Dec 28, 2018 at 11:00:01AM +0800, Pingfan Liu wrote:
> > > > The bottom-up allocation style is introduced to cope with movable_node,
> > > > where the limit inferior of allocation starts from kernel's end, due to
> > > > lack of knowledge of memory hotplug info at this early time. But if later,
> > > > hotplug info has been got, the limit inferior can be extend to 0.
> > > > 'kexec -c' prefers to reuse this style to alloc mem at lower address,
> > > > since if the reserved region is beyond 4G, then it requires extra mem
> > > > (default is 16M) for swiotlb.
> > >
> > > I fail to understand why the availability of memory hotplug information
> > > would allow to extend the lower limit of bottom-up memblock allocations
> > > below the kernel. The memory in the physical range [0, kernel_start) can be
> > > allocated as soon as the kernel memory is reserved.
> > >
> > Yes, the  [0, kernel_start) can be allocated at this time by some func
> > e.g. memblock_reserve(). But there is trick. For the func like
> > memblock_find_in_range(), this is hotplug attr checking ,,it will
> > check the hotmovable attr in __next_mem_range()
> > {
> > if (movable_node_is_enabled() && memblock_is_hotpluggable(m))
> > continue
> > }.  So the movable memory can be safely skipped.
> 
> I still don't see the connection between allocating memory below
> kernel_start and the hotplug info.
> 
> The check for 'end > kernel_end' in
> 
>  	if (memblock_bottom_up() && end > kernel_end)
> 
> does not protect against allocation in a hotplugable area.
> If memblock_find_in_range() is called before hotplug info is parsed it can
> return a range in a hotplugable area.
> 
> The point I'd like to clarify is why allocating memory in the range [0,
> kernel_start) cannot be done before hotplug info is available and why it is
> safe to allocate that memory afterwards?

Well, I think that's because we have KASLR. Before KASLR was introdueced,
kernel is put at a low and fixed physical address. Allocating memblock
bottom-up after kernel can make sure those kernel data is in the same node
as kernel text itself before SRAT parsed. While [0, kernel_start) is a
very small range, e.g on x86, it was 16 MB, which is very possibly used
up.

But now, with KASLR enabled by default, this bottom-up after kernel text
allocation has potential issue. E.g we have node0 (including normal zone),
node1(including movable zone), if KASLR put kernel at top of node0, the
next memblock allocation before SRAT parsed will stamp into movable zone
of node1, hotplug doesn't work well any more consequently. I had
considered this issue previously, but haven't thought of a way to fix
it.

While it's not related to this patch. About this patchset, I didn't
check it carefully in v2 post, and acked it. In fact the current way is
not good, Pingfan should call __memblock_find_range_bottom_up() directly
for crashkernel reserving. Reasons are:
1)SRAT parsing is done, system restore to take top-down way to do
memblock allocat.
2)we do need to find range bottom-up if user specify crashkernel=xxM
(without a explicit base address).

Thanks
Baoquan

> 
> > Thanks for your kindly review.
> > 
> > Regards,
> > Pingfan
> > 
> > > The extents of the memory node hosting the kernel image can be used to
> > > limit memblok allocations from that particular node, even in top-down mode.
> > >
> > > > Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
> > > > Cc: Tang Chen <tangchen@cn.fujitsu.com>
> > > > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> > > > Cc: Len Brown <lenb@kernel.org>
> > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
> > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > Cc: Jonathan Corbet <corbet@lwn.net>
> > > > Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
> > > > Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
> > > > Cc: Nicholas Piggin <npiggin@gmail.com>
> > > > Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > > > Cc: Daniel Vacek <neelx@redhat.com>
> > > > Cc: Mathieu Malaterre <malat@debian.org>
> > > > Cc: Stefan Agner <stefan@agner.ch>
> > > > Cc: Dave Young <dyoung@redhat.com>
> > > > Cc: Baoquan He <bhe@redhat.com>
> > > > Cc: yinghai@kernel.org,
> > > > Cc: vgoyal@redhat.com
> > > > Cc: linux-kernel@vger.kernel.org
> > > > ---
> > > >  drivers/acpi/numa.c      |  4 ++++
> > > >  include/linux/memblock.h |  1 +
> > > >  mm/memblock.c            | 58 +++++++++++++++++++++++++++++-------------------
> > > >  3 files changed, 40 insertions(+), 23 deletions(-)
> > > >
> > > > diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
> > > > index 2746994..3eea4e4 100644
> > > > --- a/drivers/acpi/numa.c
> > > > +++ b/drivers/acpi/numa.c
> > > > @@ -462,6 +462,10 @@ int __init acpi_numa_init(void)
> > > >
> > > >               cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
> > > >                                           acpi_parse_memory_affinity, 0);
> > > > +
> > > > +#if defined(CONFIG_X86) || defined(CONFIG_ARM64)
> > > > +             mark_mem_hotplug_parsed();
> > > > +#endif
> > > >       }
> > > >
> > > >       /* SLIT: System Locality Information Table */
> > > > diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> > > > index aee299a..d89ed9e 100644
> > > > --- a/include/linux/memblock.h
> > > > +++ b/include/linux/memblock.h
> > > > @@ -125,6 +125,7 @@ int memblock_reserve(phys_addr_t base, phys_addr_t size);
> > > >  void memblock_trim_memory(phys_addr_t align);
> > > >  bool memblock_overlaps_region(struct memblock_type *type,
> > > >                             phys_addr_t base, phys_addr_t size);
> > > > +void mark_mem_hotplug_parsed(void);
> > > >  int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
> > > >  int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
> > > >  int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
> > > > diff --git a/mm/memblock.c b/mm/memblock.c
> > > > index 81ae63c..a3f5e46 100644
> > > > --- a/mm/memblock.c
> > > > +++ b/mm/memblock.c
> > > > @@ -231,6 +231,12 @@ __memblock_find_range_top_down(phys_addr_t start, phys_addr_t end,
> > > >       return 0;
> > > >  }
> > > >
> > > > +static bool mem_hotmovable_parsed __initdata_memblock;
> > > > +void __init_memblock mark_mem_hotplug_parsed(void)
> > > > +{
> > > > +     mem_hotmovable_parsed = true;
> > > > +}
> > > > +
> > > >  /**
> > > >   * memblock_find_in_range_node - find free area in given range and node
> > > >   * @size: size of free area to find
> > > > @@ -259,7 +265,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
> > > >                                       phys_addr_t end, int nid,
> > > >                                       enum memblock_flags flags)
> > > >  {
> > > > -     phys_addr_t kernel_end, ret;
> > > > +     phys_addr_t kernel_end, ret = 0;
> > > >
> > > >       /* pump up @end */
> > > >       if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
> > > > @@ -270,34 +276,40 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
> > > >       end = max(start, end);
> > > >       kernel_end = __pa_symbol(_end);
> > > >
> > > > -     /*
> > > > -      * try bottom-up allocation only when bottom-up mode
> > > > -      * is set and @end is above the kernel image.
> > > > -      */
> > > > -     if (memblock_bottom_up() && end > kernel_end) {
> > > > -             phys_addr_t bottom_up_start;
> > > > +     if (memblock_bottom_up()) {
> > > > +             phys_addr_t bottom_up_start = start;
> > > >
> > > > -             /* make sure we will allocate above the kernel */
> > > > -             bottom_up_start = max(start, kernel_end);
> > > > -
> > > > -             /* ok, try bottom-up allocation first */
> > > > -             ret = __memblock_find_range_bottom_up(bottom_up_start, end,
> > > > -                                                   size, align, nid, flags);
> > > > -             if (ret)
> > > > +             if (mem_hotmovable_parsed) {
> > > > +                     ret = __memblock_find_range_bottom_up(
> > > > +                             bottom_up_start, end, size, align, nid,
> > > > +                             flags);
> > > >                       return ret;
> > > >
> > > >               /*
> > > > -              * we always limit bottom-up allocation above the kernel,
> > > > -              * but top-down allocation doesn't have the limit, so
> > > > -              * retrying top-down allocation may succeed when bottom-up
> > > > -              * allocation failed.
> > > > -              *
> > > > -              * bottom-up allocation is expected to be fail very rarely,
> > > > -              * so we use WARN_ONCE() here to see the stack trace if
> > > > -              * fail happens.
> > > > +              * if mem hotplug info is not parsed yet, try bottom-up
> > > > +              * allocation with @end above the kernel image.
> > > >                */
> > > > -             WARN_ONCE(IS_ENABLED(CONFIG_MEMORY_HOTREMOVE),
> > > > +             } else if (!mem_hotmovable_parsed && end > kernel_end) {
> > > > +                     /* make sure we will allocate above the kernel */
> > > > +                     bottom_up_start = max(start, kernel_end);
> > > > +                     ret = __memblock_find_range_bottom_up(
> > > > +                             bottom_up_start, end, size, align, nid,
> > > > +                             flags);
> > > > +                     if (ret)
> > > > +                             return ret;
> > > > +                     /*
> > > > +                      * we always limit bottom-up allocation above the
> > > > +                      * kernel, but top-down allocation doesn't have
> > > > +                      * the limit, so retrying top-down allocation may
> > > > +                      * succeed when bottom-up allocation failed.
> > > > +                      *
> > > > +                      * bottom-up allocation is expected to be fail
> > > > +                      * very rarely, so we use WARN_ONCE() here to see
> > > > +                      * the stack trace if fail happens.
> > > > +                      */
> > > > +                     WARN_ONCE(IS_ENABLED(CONFIG_MEMORY_HOTREMOVE),
> > > >                         "memblock: bottom-up allocation failed, memory hotremove may be affected\n");
> > > > +             }
> > > >       }
> > > >
> > > >       return __memblock_find_range_top_down(start, end, size, align, nid,
> > > > --
> > > > 2.7.4
> > > >
> > >
> > > --
> > > Sincerely yours,
> > > Mike.
> > >
> > 
> 
> -- 
> Sincerely yours,
> Mike.
>
Mike Rapoport Jan. 2, 2019, 5:05 p.m. UTC | #5
(added Tejun)

On Wed, Jan 02, 2019 at 06:18:04PM +0800, Baoquan He wrote:
> On 01/02/19 at 11:27am, Mike Rapoport wrote:
> > On Wed, Jan 02, 2019 at 02:47:34PM +0800, Pingfan Liu wrote:
> > > On Mon, Dec 31, 2018 at 4:40 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
> > > >
> > > > On Fri, Dec 28, 2018 at 11:00:01AM +0800, Pingfan Liu wrote:
> > > > > The bottom-up allocation style is introduced to cope with movable_node,
> > > > > where the limit inferior of allocation starts from kernel's end, due to
> > > > > lack of knowledge of memory hotplug info at this early time. But if later,
> > > > > hotplug info has been got, the limit inferior can be extend to 0.
> > > > > 'kexec -c' prefers to reuse this style to alloc mem at lower address,
> > > > > since if the reserved region is beyond 4G, then it requires extra mem
> > > > > (default is 16M) for swiotlb.
> > > >
> > > > I fail to understand why the availability of memory hotplug information
> > > > would allow to extend the lower limit of bottom-up memblock allocations
> > > > below the kernel. The memory in the physical range [0, kernel_start) can be
> > > > allocated as soon as the kernel memory is reserved.
> > > >
> > > Yes, the  [0, kernel_start) can be allocated at this time by some func
> > > e.g. memblock_reserve(). But there is trick. For the func like
> > > memblock_find_in_range(), this is hotplug attr checking ,,it will
> > > check the hotmovable attr in __next_mem_range()
> > > {
> > > if (movable_node_is_enabled() && memblock_is_hotpluggable(m))
> > > continue
> > > }.  So the movable memory can be safely skipped.
> > 
> > I still don't see the connection between allocating memory below
> > kernel_start and the hotplug info.
> > 
> > The check for 'end > kernel_end' in
> > 
> >  	if (memblock_bottom_up() && end > kernel_end)
> > 
> > does not protect against allocation in a hotplugable area.
> > If memblock_find_in_range() is called before hotplug info is parsed it can
> > return a range in a hotplugable area.
> > 
> > The point I'd like to clarify is why allocating memory in the range [0,
> > kernel_start) cannot be done before hotplug info is available and why it is
> > safe to allocate that memory afterwards?
> 
> Well, I think that's because we have KASLR. Before KASLR was introdueced,
> kernel is put at a low and fixed physical address. Allocating memblock
> bottom-up after kernel can make sure those kernel data is in the same node
> as kernel text itself before SRAT parsed. While [0, kernel_start) is a
> very small range, e.g on x86, it was 16 MB, which is very possibly used
> up.
> 
> But now, with KASLR enabled by default, this bottom-up after kernel text
> allocation has potential issue. E.g we have node0 (including normal zone),
> node1(including movable zone), if KASLR put kernel at top of node0, the
> next memblock allocation before SRAT parsed will stamp into movable zone
> of node1, hotplug doesn't work well any more consequently. I had
> considered this issue previously, but haven't thought of a way to fix
> it.
 
I agree that currently the bottom-up allocation after the kernel text has
issues with KASLR. But this issues are not necessarily related to the
memory hotplug. Even with a single memory node, a bottom-up allocation will
fail if KASLR would put the kernel near the end of node0.

What I am trying to understand is whether there is a fundamental reason to
prevent allocations from [0, kernel_start)?

Maybe Tejun can recall why he suggested to start bottom-up allocations from
kernel_end.

> While it's not related to this patch. About this patchset, I didn't
> check it carefully in v2 post, and acked it. In fact the current way is
> not good, Pingfan should call __memblock_find_range_bottom_up() directly
> for crashkernel reserving. Reasons are:
> 1)SRAT parsing is done, system restore to take top-down way to do
> memblock allocat.
> 2)we do need to find range bottom-up if user specify crashkernel=xxM
> (without a explicit base address).
> 
> Thanks
> Baoquan
> 
> > 
> > > Thanks for your kindly review.
> > > 
> > > Regards,
> > > Pingfan
> > > 
> > > > The extents of the memory node hosting the kernel image can be used to
> > > > limit memblok allocations from that particular node, even in top-down mode.
> > > >
> > > > > Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
> > > > > Cc: Tang Chen <tangchen@cn.fujitsu.com>
> > > > > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> > > > > Cc: Len Brown <lenb@kernel.org>
> > > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > > Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
> > > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > > Cc: Jonathan Corbet <corbet@lwn.net>
> > > > > Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
> > > > > Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
> > > > > Cc: Nicholas Piggin <npiggin@gmail.com>
> > > > > Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > > > > Cc: Daniel Vacek <neelx@redhat.com>
> > > > > Cc: Mathieu Malaterre <malat@debian.org>
> > > > > Cc: Stefan Agner <stefan@agner.ch>
> > > > > Cc: Dave Young <dyoung@redhat.com>
> > > > > Cc: Baoquan He <bhe@redhat.com>
> > > > > Cc: yinghai@kernel.org,
> > > > > Cc: vgoyal@redhat.com
> > > > > Cc: linux-kernel@vger.kernel.org
> > > > > ---
> > > > >  drivers/acpi/numa.c      |  4 ++++
> > > > >  include/linux/memblock.h |  1 +
> > > > >  mm/memblock.c            | 58 +++++++++++++++++++++++++++++-------------------
> > > > >  3 files changed, 40 insertions(+), 23 deletions(-)
> > > > >
> > > > > diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
> > > > > index 2746994..3eea4e4 100644
> > > > > --- a/drivers/acpi/numa.c
> > > > > +++ b/drivers/acpi/numa.c
> > > > > @@ -462,6 +462,10 @@ int __init acpi_numa_init(void)
> > > > >
> > > > >               cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
> > > > >                                           acpi_parse_memory_affinity, 0);
> > > > > +
> > > > > +#if defined(CONFIG_X86) || defined(CONFIG_ARM64)
> > > > > +             mark_mem_hotplug_parsed();
> > > > > +#endif
> > > > >       }
> > > > >
> > > > >       /* SLIT: System Locality Information Table */
> > > > > diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> > > > > index aee299a..d89ed9e 100644
> > > > > --- a/include/linux/memblock.h
> > > > > +++ b/include/linux/memblock.h
> > > > > @@ -125,6 +125,7 @@ int memblock_reserve(phys_addr_t base, phys_addr_t size);
> > > > >  void memblock_trim_memory(phys_addr_t align);
> > > > >  bool memblock_overlaps_region(struct memblock_type *type,
> > > > >                             phys_addr_t base, phys_addr_t size);
> > > > > +void mark_mem_hotplug_parsed(void);
> > > > >  int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
> > > > >  int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
> > > > >  int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
> > > > > diff --git a/mm/memblock.c b/mm/memblock.c
> > > > > index 81ae63c..a3f5e46 100644
> > > > > --- a/mm/memblock.c
> > > > > +++ b/mm/memblock.c
> > > > > @@ -231,6 +231,12 @@ __memblock_find_range_top_down(phys_addr_t start, phys_addr_t end,
> > > > >       return 0;
> > > > >  }
> > > > >
> > > > > +static bool mem_hotmovable_parsed __initdata_memblock;
> > > > > +void __init_memblock mark_mem_hotplug_parsed(void)
> > > > > +{
> > > > > +     mem_hotmovable_parsed = true;
> > > > > +}
> > > > > +
> > > > >  /**
> > > > >   * memblock_find_in_range_node - find free area in given range and node
> > > > >   * @size: size of free area to find
> > > > > @@ -259,7 +265,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
> > > > >                                       phys_addr_t end, int nid,
> > > > >                                       enum memblock_flags flags)
> > > > >  {
> > > > > -     phys_addr_t kernel_end, ret;
> > > > > +     phys_addr_t kernel_end, ret = 0;
> > > > >
> > > > >       /* pump up @end */
> > > > >       if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
> > > > > @@ -270,34 +276,40 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
> > > > >       end = max(start, end);
> > > > >       kernel_end = __pa_symbol(_end);
> > > > >
> > > > > -     /*
> > > > > -      * try bottom-up allocation only when bottom-up mode
> > > > > -      * is set and @end is above the kernel image.
> > > > > -      */
> > > > > -     if (memblock_bottom_up() && end > kernel_end) {
> > > > > -             phys_addr_t bottom_up_start;
> > > > > +     if (memblock_bottom_up()) {
> > > > > +             phys_addr_t bottom_up_start = start;
> > > > >
> > > > > -             /* make sure we will allocate above the kernel */
> > > > > -             bottom_up_start = max(start, kernel_end);
> > > > > -
> > > > > -             /* ok, try bottom-up allocation first */
> > > > > -             ret = __memblock_find_range_bottom_up(bottom_up_start, end,
> > > > > -                                                   size, align, nid, flags);
> > > > > -             if (ret)
> > > > > +             if (mem_hotmovable_parsed) {
> > > > > +                     ret = __memblock_find_range_bottom_up(
> > > > > +                             bottom_up_start, end, size, align, nid,
> > > > > +                             flags);
> > > > >                       return ret;
> > > > >
> > > > >               /*
> > > > > -              * we always limit bottom-up allocation above the kernel,
> > > > > -              * but top-down allocation doesn't have the limit, so
> > > > > -              * retrying top-down allocation may succeed when bottom-up
> > > > > -              * allocation failed.
> > > > > -              *
> > > > > -              * bottom-up allocation is expected to be fail very rarely,
> > > > > -              * so we use WARN_ONCE() here to see the stack trace if
> > > > > -              * fail happens.
> > > > > +              * if mem hotplug info is not parsed yet, try bottom-up
> > > > > +              * allocation with @end above the kernel image.
> > > > >                */
> > > > > -             WARN_ONCE(IS_ENABLED(CONFIG_MEMORY_HOTREMOVE),
> > > > > +             } else if (!mem_hotmovable_parsed && end > kernel_end) {
> > > > > +                     /* make sure we will allocate above the kernel */
> > > > > +                     bottom_up_start = max(start, kernel_end);
> > > > > +                     ret = __memblock_find_range_bottom_up(
> > > > > +                             bottom_up_start, end, size, align, nid,
> > > > > +                             flags);
> > > > > +                     if (ret)
> > > > > +                             return ret;
> > > > > +                     /*
> > > > > +                      * we always limit bottom-up allocation above the
> > > > > +                      * kernel, but top-down allocation doesn't have
> > > > > +                      * the limit, so retrying top-down allocation may
> > > > > +                      * succeed when bottom-up allocation failed.
> > > > > +                      *
> > > > > +                      * bottom-up allocation is expected to be fail
> > > > > +                      * very rarely, so we use WARN_ONCE() here to see
> > > > > +                      * the stack trace if fail happens.
> > > > > +                      */
> > > > > +                     WARN_ONCE(IS_ENABLED(CONFIG_MEMORY_HOTREMOVE),
> > > > >                         "memblock: bottom-up allocation failed, memory hotremove may be affected\n");
> > > > > +             }
> > > > >       }
> > > > >
> > > > >       return __memblock_find_range_top_down(start, end, size, align, nid,
> > > > > --
> > > > > 2.7.4
> > > > >
> > > >
> > > > --
> > > > Sincerely yours,
> > > > Mike.
> > > >
> > > 
> > 
> > -- 
> > Sincerely yours,
> > Mike.
> > 
>
Tejun Heo Jan. 3, 2019, 6:47 p.m. UTC | #6
Hello,

On Wed, Jan 02, 2019 at 07:05:38PM +0200, Mike Rapoport wrote:
> I agree that currently the bottom-up allocation after the kernel text has
> issues with KASLR. But this issues are not necessarily related to the
> memory hotplug. Even with a single memory node, a bottom-up allocation will
> fail if KASLR would put the kernel near the end of node0.
> 
> What I am trying to understand is whether there is a fundamental reason to
> prevent allocations from [0, kernel_start)?
> 
> Maybe Tejun can recall why he suggested to start bottom-up allocations from
> kernel_end.

That's from 79442ed189ac ("mm/memblock.c: introduce bottom-up
allocation mode").  I wasn't involved in that patch, so no idea why
the restrictions were added, but FWIW it doesn't seem necessary to me.

Thanks.
Pingfan Liu Jan. 4, 2019, 5:59 a.m. UTC | #7
On Wed, Jan 2, 2019 at 6:18 PM Baoquan He <bhe@redhat.com> wrote:
>
> On 01/02/19 at 11:27am, Mike Rapoport wrote:
> > On Wed, Jan 02, 2019 at 02:47:34PM +0800, Pingfan Liu wrote:
> > > On Mon, Dec 31, 2018 at 4:40 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
> > > >
> > > > On Fri, Dec 28, 2018 at 11:00:01AM +0800, Pingfan Liu wrote:
> > > > > The bottom-up allocation style is introduced to cope with movable_node,
> > > > > where the limit inferior of allocation starts from kernel's end, due to
> > > > > lack of knowledge of memory hotplug info at this early time. But if later,
> > > > > hotplug info has been got, the limit inferior can be extend to 0.
> > > > > 'kexec -c' prefers to reuse this style to alloc mem at lower address,
> > > > > since if the reserved region is beyond 4G, then it requires extra mem
> > > > > (default is 16M) for swiotlb.
> > > >
> > > > I fail to understand why the availability of memory hotplug information
> > > > would allow to extend the lower limit of bottom-up memblock allocations
> > > > below the kernel. The memory in the physical range [0, kernel_start) can be
> > > > allocated as soon as the kernel memory is reserved.
> > > >
> > > Yes, the  [0, kernel_start) can be allocated at this time by some func
> > > e.g. memblock_reserve(). But there is trick. For the func like
> > > memblock_find_in_range(), this is hotplug attr checking ,,it will
> > > check the hotmovable attr in __next_mem_range()
> > > {
> > > if (movable_node_is_enabled() && memblock_is_hotpluggable(m))
> > > continue
> > > }.  So the movable memory can be safely skipped.
> >
> > I still don't see the connection between allocating memory below
> > kernel_start and the hotplug info.
> >
> > The check for 'end > kernel_end' in
> >
> >       if (memblock_bottom_up() && end > kernel_end)
> >
> > does not protect against allocation in a hotplugable area.
> > If memblock_find_in_range() is called before hotplug info is parsed it can
> > return a range in a hotplugable area.
> >
> > The point I'd like to clarify is why allocating memory in the range [0,
> > kernel_start) cannot be done before hotplug info is available and why it is
> > safe to allocate that memory afterwards?
>
> Well, I think that's because we have KASLR. Before KASLR was introdueced,
> kernel is put at a low and fixed physical address. Allocating memblock
> bottom-up after kernel can make sure those kernel data is in the same node
> as kernel text itself before SRAT parsed. While [0, kernel_start) is a
> very small range, e.g on x86, it was 16 MB, which is very possibly used
> up.
>
> But now, with KASLR enabled by default, this bottom-up after kernel text
> allocation has potential issue. E.g we have node0 (including normal zone),
> node1(including movable zone), if KASLR put kernel at top of node0, the
> next memblock allocation before SRAT parsed will stamp into movable zone
> of node1, hotplug doesn't work well any more consequently. I had
> considered this issue previously, but haven't thought of a way to fix
> it.
>
> While it's not related to this patch. About this patchset, I didn't
> check it carefully in v2 post, and acked it. In fact the current way is
> not good, Pingfan should call __memblock_find_range_bottom_up() directly
> for crashkernel reserving. Reasons are:

Good suggestion, thanks. I will send out V4.

Regards,
Pingfan
> 1)SRAT parsing is done, system restore to take top-down way to do
> memblock allocat.
> 2)we do need to find range bottom-up if user specify crashkernel=xxM
> (without a explicit base address).
>
> Thanks
> Baoquan
>
> >
> > > Thanks for your kindly review.
> > >
> > > Regards,
> > > Pingfan
> > >
> > > > The extents of the memory node hosting the kernel image can be used to
> > > > limit memblok allocations from that particular node, even in top-down mode.
> > > >
> > > > > Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
> > > > > Cc: Tang Chen <tangchen@cn.fujitsu.com>
> > > > > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> > > > > Cc: Len Brown <lenb@kernel.org>
> > > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > > Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
> > > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > > Cc: Jonathan Corbet <corbet@lwn.net>
> > > > > Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
> > > > > Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
> > > > > Cc: Nicholas Piggin <npiggin@gmail.com>
> > > > > Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > > > > Cc: Daniel Vacek <neelx@redhat.com>
> > > > > Cc: Mathieu Malaterre <malat@debian.org>
> > > > > Cc: Stefan Agner <stefan@agner.ch>
> > > > > Cc: Dave Young <dyoung@redhat.com>
> > > > > Cc: Baoquan He <bhe@redhat.com>
> > > > > Cc: yinghai@kernel.org,
> > > > > Cc: vgoyal@redhat.com
> > > > > Cc: linux-kernel@vger.kernel.org
> > > > > ---
> > > > >  drivers/acpi/numa.c      |  4 ++++
> > > > >  include/linux/memblock.h |  1 +
> > > > >  mm/memblock.c            | 58 +++++++++++++++++++++++++++++-------------------
> > > > >  3 files changed, 40 insertions(+), 23 deletions(-)
> > > > >
> > > > > diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
> > > > > index 2746994..3eea4e4 100644
> > > > > --- a/drivers/acpi/numa.c
> > > > > +++ b/drivers/acpi/numa.c
> > > > > @@ -462,6 +462,10 @@ int __init acpi_numa_init(void)
> > > > >
> > > > >               cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
> > > > >                                           acpi_parse_memory_affinity, 0);
> > > > > +
> > > > > +#if defined(CONFIG_X86) || defined(CONFIG_ARM64)
> > > > > +             mark_mem_hotplug_parsed();
> > > > > +#endif
> > > > >       }
> > > > >
> > > > >       /* SLIT: System Locality Information Table */
> > > > > diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> > > > > index aee299a..d89ed9e 100644
> > > > > --- a/include/linux/memblock.h
> > > > > +++ b/include/linux/memblock.h
> > > > > @@ -125,6 +125,7 @@ int memblock_reserve(phys_addr_t base, phys_addr_t size);
> > > > >  void memblock_trim_memory(phys_addr_t align);
> > > > >  bool memblock_overlaps_region(struct memblock_type *type,
> > > > >                             phys_addr_t base, phys_addr_t size);
> > > > > +void mark_mem_hotplug_parsed(void);
> > > > >  int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
> > > > >  int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
> > > > >  int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
> > > > > diff --git a/mm/memblock.c b/mm/memblock.c
> > > > > index 81ae63c..a3f5e46 100644
> > > > > --- a/mm/memblock.c
> > > > > +++ b/mm/memblock.c
> > > > > @@ -231,6 +231,12 @@ __memblock_find_range_top_down(phys_addr_t start, phys_addr_t end,
> > > > >       return 0;
> > > > >  }
> > > > >
> > > > > +static bool mem_hotmovable_parsed __initdata_memblock;
> > > > > +void __init_memblock mark_mem_hotplug_parsed(void)
> > > > > +{
> > > > > +     mem_hotmovable_parsed = true;
> > > > > +}
> > > > > +
> > > > >  /**
> > > > >   * memblock_find_in_range_node - find free area in given range and node
> > > > >   * @size: size of free area to find
> > > > > @@ -259,7 +265,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
> > > > >                                       phys_addr_t end, int nid,
> > > > >                                       enum memblock_flags flags)
> > > > >  {
> > > > > -     phys_addr_t kernel_end, ret;
> > > > > +     phys_addr_t kernel_end, ret = 0;
> > > > >
> > > > >       /* pump up @end */
> > > > >       if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
> > > > > @@ -270,34 +276,40 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
> > > > >       end = max(start, end);
> > > > >       kernel_end = __pa_symbol(_end);
> > > > >
> > > > > -     /*
> > > > > -      * try bottom-up allocation only when bottom-up mode
> > > > > -      * is set and @end is above the kernel image.
> > > > > -      */
> > > > > -     if (memblock_bottom_up() && end > kernel_end) {
> > > > > -             phys_addr_t bottom_up_start;
> > > > > +     if (memblock_bottom_up()) {
> > > > > +             phys_addr_t bottom_up_start = start;
> > > > >
> > > > > -             /* make sure we will allocate above the kernel */
> > > > > -             bottom_up_start = max(start, kernel_end);
> > > > > -
> > > > > -             /* ok, try bottom-up allocation first */
> > > > > -             ret = __memblock_find_range_bottom_up(bottom_up_start, end,
> > > > > -                                                   size, align, nid, flags);
> > > > > -             if (ret)
> > > > > +             if (mem_hotmovable_parsed) {
> > > > > +                     ret = __memblock_find_range_bottom_up(
> > > > > +                             bottom_up_start, end, size, align, nid,
> > > > > +                             flags);
> > > > >                       return ret;
> > > > >
> > > > >               /*
> > > > > -              * we always limit bottom-up allocation above the kernel,
> > > > > -              * but top-down allocation doesn't have the limit, so
> > > > > -              * retrying top-down allocation may succeed when bottom-up
> > > > > -              * allocation failed.
> > > > > -              *
> > > > > -              * bottom-up allocation is expected to be fail very rarely,
> > > > > -              * so we use WARN_ONCE() here to see the stack trace if
> > > > > -              * fail happens.
> > > > > +              * if mem hotplug info is not parsed yet, try bottom-up
> > > > > +              * allocation with @end above the kernel image.
> > > > >                */
> > > > > -             WARN_ONCE(IS_ENABLED(CONFIG_MEMORY_HOTREMOVE),
> > > > > +             } else if (!mem_hotmovable_parsed && end > kernel_end) {
> > > > > +                     /* make sure we will allocate above the kernel */
> > > > > +                     bottom_up_start = max(start, kernel_end);
> > > > > +                     ret = __memblock_find_range_bottom_up(
> > > > > +                             bottom_up_start, end, size, align, nid,
> > > > > +                             flags);
> > > > > +                     if (ret)
> > > > > +                             return ret;
> > > > > +                     /*
> > > > > +                      * we always limit bottom-up allocation above the
> > > > > +                      * kernel, but top-down allocation doesn't have
> > > > > +                      * the limit, so retrying top-down allocation may
> > > > > +                      * succeed when bottom-up allocation failed.
> > > > > +                      *
> > > > > +                      * bottom-up allocation is expected to be fail
> > > > > +                      * very rarely, so we use WARN_ONCE() here to see
> > > > > +                      * the stack trace if fail happens.
> > > > > +                      */
> > > > > +                     WARN_ONCE(IS_ENABLED(CONFIG_MEMORY_HOTREMOVE),
> > > > >                         "memblock: bottom-up allocation failed, memory hotremove may be affected\n");
> > > > > +             }
> > > > >       }
> > > > >
> > > > >       return __memblock_find_range_top_down(start, end, size, align, nid,
> > > > > --
> > > > > 2.7.4
> > > > >
> > > >
> > > > --
> > > > Sincerely yours,
> > > > Mike.
> > > >
> > >
> >
> > --
> > Sincerely yours,
> > Mike.
> >
Mike Rapoport Jan. 4, 2019, 3:09 p.m. UTC | #8
On Thu, Jan 03, 2019 at 10:47:06AM -0800, Tejun Heo wrote:
> Hello,
> 
> On Wed, Jan 02, 2019 at 07:05:38PM +0200, Mike Rapoport wrote:
> > I agree that currently the bottom-up allocation after the kernel text has
> > issues with KASLR. But this issues are not necessarily related to the
> > memory hotplug. Even with a single memory node, a bottom-up allocation will
> > fail if KASLR would put the kernel near the end of node0.
> > 
> > What I am trying to understand is whether there is a fundamental reason to
> > prevent allocations from [0, kernel_start)?
> > 
> > Maybe Tejun can recall why he suggested to start bottom-up allocations from
> > kernel_end.
> 
> That's from 79442ed189ac ("mm/memblock.c: introduce bottom-up
> allocation mode").  I wasn't involved in that patch, so no idea why
> the restrictions were added, but FWIW it doesn't seem necessary to me.

I should have added the reference [1] at the first place :)
Thanks!

[1] https://lore.kernel.org/lkml/20130904192215.GG26609@mtj.dyndns.org/
 
> Thanks.
> 
> -- 
> tejun
>
Mike Rapoport Jan. 4, 2019, 4:20 p.m. UTC | #9
On Fri, Jan 04, 2019 at 01:59:46PM +0800, Pingfan Liu wrote:
> On Wed, Jan 2, 2019 at 6:18 PM Baoquan He <bhe@redhat.com> wrote:
> >
> > On 01/02/19 at 11:27am, Mike Rapoport wrote:
> > > On Wed, Jan 02, 2019 at 02:47:34PM +0800, Pingfan Liu wrote:
> > > > On Mon, Dec 31, 2018 at 4:40 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
> > > > >
> > > > > On Fri, Dec 28, 2018 at 11:00:01AM +0800, Pingfan Liu wrote:
> > > > > > The bottom-up allocation style is introduced to cope with movable_node,
> > > > > > where the limit inferior of allocation starts from kernel's end, due to
> > > > > > lack of knowledge of memory hotplug info at this early time. But if later,
> > > > > > hotplug info has been got, the limit inferior can be extend to 0.
> > > > > > 'kexec -c' prefers to reuse this style to alloc mem at lower address,
> > > > > > since if the reserved region is beyond 4G, then it requires extra mem
> > > > > > (default is 16M) for swiotlb.
> > > > >
> > > > > I fail to understand why the availability of memory hotplug information
> > > > > would allow to extend the lower limit of bottom-up memblock allocations
> > > > > below the kernel. The memory in the physical range [0, kernel_start) can be
> > > > > allocated as soon as the kernel memory is reserved.
> > > > >
> > > > Yes, the  [0, kernel_start) can be allocated at this time by some func
> > > > e.g. memblock_reserve(). But there is trick. For the func like
> > > > memblock_find_in_range(), this is hotplug attr checking ,,it will
> > > > check the hotmovable attr in __next_mem_range()
> > > > {
> > > > if (movable_node_is_enabled() && memblock_is_hotpluggable(m))
> > > > continue
> > > > }.  So the movable memory can be safely skipped.
> > >
> > > I still don't see the connection between allocating memory below
> > > kernel_start and the hotplug info.
> > >
> > > The check for 'end > kernel_end' in
> > >
> > >       if (memblock_bottom_up() && end > kernel_end)
> > >
> > > does not protect against allocation in a hotplugable area.
> > > If memblock_find_in_range() is called before hotplug info is parsed it can
> > > return a range in a hotplugable area.
> > >
> > > The point I'd like to clarify is why allocating memory in the range [0,
> > > kernel_start) cannot be done before hotplug info is available and why it is
> > > safe to allocate that memory afterwards?
> >
> > Well, I think that's because we have KASLR. Before KASLR was introdueced,
> > kernel is put at a low and fixed physical address. Allocating memblock
> > bottom-up after kernel can make sure those kernel data is in the same node
> > as kernel text itself before SRAT parsed. While [0, kernel_start) is a
> > very small range, e.g on x86, it was 16 MB, which is very possibly used
> > up.
> >
> > But now, with KASLR enabled by default, this bottom-up after kernel text
> > allocation has potential issue. E.g we have node0 (including normal zone),
> > node1(including movable zone), if KASLR put kernel at top of node0, the
> > next memblock allocation before SRAT parsed will stamp into movable zone
> > of node1, hotplug doesn't work well any more consequently. I had
> > considered this issue previously, but haven't thought of a way to fix
> > it.
> >
> > While it's not related to this patch. About this patchset, I didn't
> > check it carefully in v2 post, and acked it. In fact the current way is
> > not good, Pingfan should call __memblock_find_range_bottom_up() directly
> > for crashkernel reserving. Reasons are:
> 
> Good suggestion, thanks. I will send out V4.

I think we can simply remove the restriction of allocating above the kernel
in the memblock_find_in_range_node().
 
> Regards,
> Pingfan
> > 1)SRAT parsing is done, system restore to take top-down way to do
> > memblock allocat.
> > 2)we do need to find range bottom-up if user specify crashkernel=xxM
> > (without a explicit base address).
> >
> > Thanks
> > Baoquan
> >
> > >
> > > > Thanks for your kindly review.
> > > >
> > > > Regards,
> > > > Pingfan
> > > >
> > > > > The extents of the memory node hosting the kernel image can be used to
> > > > > limit memblok allocations from that particular node, even in top-down mode.
> > > > >
> > > > > > Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
> > > > > > Cc: Tang Chen <tangchen@cn.fujitsu.com>
> > > > > > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> > > > > > Cc: Len Brown <lenb@kernel.org>
> > > > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > > > Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
> > > > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > > > Cc: Jonathan Corbet <corbet@lwn.net>
> > > > > > Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
> > > > > > Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
> > > > > > Cc: Nicholas Piggin <npiggin@gmail.com>
> > > > > > Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > > > > > Cc: Daniel Vacek <neelx@redhat.com>
> > > > > > Cc: Mathieu Malaterre <malat@debian.org>
> > > > > > Cc: Stefan Agner <stefan@agner.ch>
> > > > > > Cc: Dave Young <dyoung@redhat.com>
> > > > > > Cc: Baoquan He <bhe@redhat.com>
> > > > > > Cc: yinghai@kernel.org,
> > > > > > Cc: vgoyal@redhat.com
> > > > > > Cc: linux-kernel@vger.kernel.org
> > > > > > ---
> > > > > >  drivers/acpi/numa.c      |  4 ++++
> > > > > >  include/linux/memblock.h |  1 +
> > > > > >  mm/memblock.c            | 58 +++++++++++++++++++++++++++++-------------------
> > > > > >  3 files changed, 40 insertions(+), 23 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
> > > > > > index 2746994..3eea4e4 100644
> > > > > > --- a/drivers/acpi/numa.c
> > > > > > +++ b/drivers/acpi/numa.c
> > > > > > @@ -462,6 +462,10 @@ int __init acpi_numa_init(void)
> > > > > >
> > > > > >               cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
> > > > > >                                           acpi_parse_memory_affinity, 0);
> > > > > > +
> > > > > > +#if defined(CONFIG_X86) || defined(CONFIG_ARM64)
> > > > > > +             mark_mem_hotplug_parsed();
> > > > > > +#endif
> > > > > >       }
> > > > > >
> > > > > >       /* SLIT: System Locality Information Table */
> > > > > > diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> > > > > > index aee299a..d89ed9e 100644
> > > > > > --- a/include/linux/memblock.h
> > > > > > +++ b/include/linux/memblock.h
> > > > > > @@ -125,6 +125,7 @@ int memblock_reserve(phys_addr_t base, phys_addr_t size);
> > > > > >  void memblock_trim_memory(phys_addr_t align);
> > > > > >  bool memblock_overlaps_region(struct memblock_type *type,
> > > > > >                             phys_addr_t base, phys_addr_t size);
> > > > > > +void mark_mem_hotplug_parsed(void);
> > > > > >  int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
> > > > > >  int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
> > > > > >  int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
> > > > > > diff --git a/mm/memblock.c b/mm/memblock.c
> > > > > > index 81ae63c..a3f5e46 100644
> > > > > > --- a/mm/memblock.c
> > > > > > +++ b/mm/memblock.c
> > > > > > @@ -231,6 +231,12 @@ __memblock_find_range_top_down(phys_addr_t start, phys_addr_t end,
> > > > > >       return 0;
> > > > > >  }
> > > > > >
> > > > > > +static bool mem_hotmovable_parsed __initdata_memblock;
> > > > > > +void __init_memblock mark_mem_hotplug_parsed(void)
> > > > > > +{
> > > > > > +     mem_hotmovable_parsed = true;
> > > > > > +}
> > > > > > +
> > > > > >  /**
> > > > > >   * memblock_find_in_range_node - find free area in given range and node
> > > > > >   * @size: size of free area to find
> > > > > > @@ -259,7 +265,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
> > > > > >                                       phys_addr_t end, int nid,
> > > > > >                                       enum memblock_flags flags)
> > > > > >  {
> > > > > > -     phys_addr_t kernel_end, ret;
> > > > > > +     phys_addr_t kernel_end, ret = 0;
> > > > > >
> > > > > >       /* pump up @end */
> > > > > >       if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
> > > > > > @@ -270,34 +276,40 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
> > > > > >       end = max(start, end);
> > > > > >       kernel_end = __pa_symbol(_end);
> > > > > >
> > > > > > -     /*
> > > > > > -      * try bottom-up allocation only when bottom-up mode
> > > > > > -      * is set and @end is above the kernel image.
> > > > > > -      */
> > > > > > -     if (memblock_bottom_up() && end > kernel_end) {
> > > > > > -             phys_addr_t bottom_up_start;
> > > > > > +     if (memblock_bottom_up()) {
> > > > > > +             phys_addr_t bottom_up_start = start;
> > > > > >
> > > > > > -             /* make sure we will allocate above the kernel */
> > > > > > -             bottom_up_start = max(start, kernel_end);
> > > > > > -
> > > > > > -             /* ok, try bottom-up allocation first */
> > > > > > -             ret = __memblock_find_range_bottom_up(bottom_up_start, end,
> > > > > > -                                                   size, align, nid, flags);
> > > > > > -             if (ret)
> > > > > > +             if (mem_hotmovable_parsed) {
> > > > > > +                     ret = __memblock_find_range_bottom_up(
> > > > > > +                             bottom_up_start, end, size, align, nid,
> > > > > > +                             flags);
> > > > > >                       return ret;
> > > > > >
> > > > > >               /*
> > > > > > -              * we always limit bottom-up allocation above the kernel,
> > > > > > -              * but top-down allocation doesn't have the limit, so
> > > > > > -              * retrying top-down allocation may succeed when bottom-up
> > > > > > -              * allocation failed.
> > > > > > -              *
> > > > > > -              * bottom-up allocation is expected to be fail very rarely,
> > > > > > -              * so we use WARN_ONCE() here to see the stack trace if
> > > > > > -              * fail happens.
> > > > > > +              * if mem hotplug info is not parsed yet, try bottom-up
> > > > > > +              * allocation with @end above the kernel image.
> > > > > >                */
> > > > > > -             WARN_ONCE(IS_ENABLED(CONFIG_MEMORY_HOTREMOVE),
> > > > > > +             } else if (!mem_hotmovable_parsed && end > kernel_end) {
> > > > > > +                     /* make sure we will allocate above the kernel */
> > > > > > +                     bottom_up_start = max(start, kernel_end);
> > > > > > +                     ret = __memblock_find_range_bottom_up(
> > > > > > +                             bottom_up_start, end, size, align, nid,
> > > > > > +                             flags);
> > > > > > +                     if (ret)
> > > > > > +                             return ret;
> > > > > > +                     /*
> > > > > > +                      * we always limit bottom-up allocation above the
> > > > > > +                      * kernel, but top-down allocation doesn't have
> > > > > > +                      * the limit, so retrying top-down allocation may
> > > > > > +                      * succeed when bottom-up allocation failed.
> > > > > > +                      *
> > > > > > +                      * bottom-up allocation is expected to be fail
> > > > > > +                      * very rarely, so we use WARN_ONCE() here to see
> > > > > > +                      * the stack trace if fail happens.
> > > > > > +                      */
> > > > > > +                     WARN_ONCE(IS_ENABLED(CONFIG_MEMORY_HOTREMOVE),
> > > > > >                         "memblock: bottom-up allocation failed, memory hotremove may be affected\n");
> > > > > > +             }
> > > > > >       }
> > > > > >
> > > > > >       return __memblock_find_range_top_down(start, end, size, align, nid,
> > > > > > --
> > > > > > 2.7.4
> > > > > >
> > > > >
> > > > > --
> > > > > Sincerely yours,
> > > > > Mike.
> > > > >
> > > >
> > >
> > > --
> > > Sincerely yours,
> > > Mike.
> > >
>
Baoquan He Jan. 5, 2019, 3:44 a.m. UTC | #10
On 01/04/19 at 05:09pm, Mike Rapoport wrote:
> On Thu, Jan 03, 2019 at 10:47:06AM -0800, Tejun Heo wrote:
> > Hello,
> > 
> > On Wed, Jan 02, 2019 at 07:05:38PM +0200, Mike Rapoport wrote:
> > > I agree that currently the bottom-up allocation after the kernel text has
> > > issues with KASLR. But this issues are not necessarily related to the
> > > memory hotplug. Even with a single memory node, a bottom-up allocation will
> > > fail if KASLR would put the kernel near the end of node0.
> > > 
> > > What I am trying to understand is whether there is a fundamental reason to
> > > prevent allocations from [0, kernel_start)?
> > > 
> > > Maybe Tejun can recall why he suggested to start bottom-up allocations from
> > > kernel_end.
> > 
> > That's from 79442ed189ac ("mm/memblock.c: introduce bottom-up
> > allocation mode").  I wasn't involved in that patch, so no idea why
> > the restrictions were added, but FWIW it doesn't seem necessary to me.
> 
> I should have added the reference [1] at the first place :)
> Thanks!
> 
> [1] https://lore.kernel.org/lkml/20130904192215.GG26609@mtj.dyndns.org/

With my understanding, we may not be able to discard the bottom-up
method for the current kernel. It's related to hotplug feature when
'movable_node' kernel parameter is specified. With 'movable_node',
system relies on reading hotplug information from firmware, on x86 it's
acpi SRAT table. In the current system, we allocate memblock region
top-down by default. However, before that hotplug information retrieving,
there are several places of memblock allocating, top-down memblock
allocation must break hotplug feature since it will allocate kernel data
in movable zone which is usually at the end node on bare metal system.

This bottom-up way is taken on many ARCHes, it works well on system if
KASLR is not enabled. Below is the searching result in the current linux
kernel, we can see that all ARCHes have this mechanism, except of
arm/arm64. But now only arm64/mips/x86 have KASLR.

W/o KASLR, allocating memblock region above kernle end when hotplug info
is not parsed, looks very reasonable. Since kernel is usually put at
lower address, e.g on x86, it's 16M. My thought is that we need do
memblock allocation around kernel before hotplug info parsed. That is
for system w/o KASLR, we will keep the current bottom-up way; for system
with KASLR, we should allocate memblock region top-down just below
kernel start.

This issue must break hotplug, just because currently bare metal system
need add 'nokaslr' to disable KASLR since another bug fix is under
discussion as below, so this issue is covered up.

 [PATCH v14 0/5] x86/boot/KASLR: Parse ACPI table and limit KASLR to choosing immovable memory
lkml.kernel.org/r/20181214093013.13370-1-fanc.fnst@cn.fujitsu.com

[~ ]$ git grep memblock_set_bottom_up
arch/alpha/kernel/setup.c:      memblock_set_bottom_up(true);
arch/m68k/mm/motorola.c:        memblock_set_bottom_up(true);
arch/mips/kernel/setup.c:       memblock_set_bottom_up(true);
arch/mips/kernel/traps.c:       memblock_set_bottom_up(false);
arch/nds32/kernel/setup.c:      memblock_set_bottom_up(true);
arch/powerpc/kernel/paca.c:             memblock_set_bottom_up(true);
arch/powerpc/kernel/paca.c:             memblock_set_bottom_up(false);
arch/s390/kernel/setup.c:       memblock_set_bottom_up(true);
arch/s390/kernel/setup.c:       memblock_set_bottom_up(false);
arch/sparc/mm/init_32.c:        memblock_set_bottom_up(true);
arch/x86/kernel/setup.c:                memblock_set_bottom_up(true);
arch/x86/mm/numa.c:     memblock_set_bottom_up(false);
include/linux/memblock.h:static inline void __init memblock_set_bottom_up(bool enable)
Mike Rapoport Jan. 6, 2019, 6:27 a.m. UTC | #11
On Sat, Jan 05, 2019 at 11:44:50AM +0800, Baoquan He wrote:
> On 01/04/19 at 05:09pm, Mike Rapoport wrote:
> > On Thu, Jan 03, 2019 at 10:47:06AM -0800, Tejun Heo wrote:
> > > Hello,
> > > 
> > > On Wed, Jan 02, 2019 at 07:05:38PM +0200, Mike Rapoport wrote:
> > > > I agree that currently the bottom-up allocation after the kernel text has
> > > > issues with KASLR. But this issues are not necessarily related to the
> > > > memory hotplug. Even with a single memory node, a bottom-up allocation will
> > > > fail if KASLR would put the kernel near the end of node0.
> > > > 
> > > > What I am trying to understand is whether there is a fundamental reason to
> > > > prevent allocations from [0, kernel_start)?
> > > > 
> > > > Maybe Tejun can recall why he suggested to start bottom-up allocations from
> > > > kernel_end.
> > > 
> > > That's from 79442ed189ac ("mm/memblock.c: introduce bottom-up
> > > allocation mode").  I wasn't involved in that patch, so no idea why
> > > the restrictions were added, but FWIW it doesn't seem necessary to me.
> > 
> > I should have added the reference [1] at the first place :)
> > Thanks!
> > 
> > [1] https://lore.kernel.org/lkml/20130904192215.GG26609@mtj.dyndns.org/
> 
> With my understanding, we may not be able to discard the bottom-up
> method for the current kernel. It's related to hotplug feature when
> 'movable_node' kernel parameter is specified. With 'movable_node',
> system relies on reading hotplug information from firmware, on x86 it's
> acpi SRAT table. In the current system, we allocate memblock region
> top-down by default. However, before that hotplug information retrieving,
> there are several places of memblock allocating, top-down memblock
> allocation must break hotplug feature since it will allocate kernel data
> in movable zone which is usually at the end node on bare metal system.

I do not suggest to discard the bottom-up method, I merely suggest to allow
it to use [0, kernel_start).
 
> This bottom-up way is taken on many ARCHes, it works well on system if
> KASLR is not enabled. Below is the searching result in the current linux
> kernel, we can see that all ARCHes have this mechanism, except of
> arm/arm64. But now only arm64/mips/x86 have KASLR.
> 
> W/o KASLR, allocating memblock region above kernle end when hotplug info
> is not parsed, looks very reasonable. Since kernel is usually put at
> lower address, e.g on x86, it's 16M. My thought is that we need do
> memblock allocation around kernel before hotplug info parsed. That is
> for system w/o KASLR, we will keep the current bottom-up way; for system
> with KASLR, we should allocate memblock region top-down just below
> kernel start.

I completely agree. I was thinking about making
memblock_find_in_range_node() to do something like

if (memblock_bottom_up()) {
	bottom_up_start = max(start, kernel_end);

	ret = __memblock_find_range_bottom_up(bottom_up_start, end,
					      size, align, nid, flags);
	if (ret)
		return ret;

	bottom_up_start = max(start, 0);
	end = kernel_start;

	ret = __memblock_find_range_top_down(bottom_up_start, end,
					     size, align, nid, flags);
	if (ret)
		return ret;
}

 
> This issue must break hotplug, just because currently bare metal system
> need add 'nokaslr' to disable KASLR since another bug fix is under
> discussion as below, so this issue is covered up.
> 
>  [PATCH v14 0/5] x86/boot/KASLR: Parse ACPI table and limit KASLR to choosing immovable memory
> lkml.kernel.org/r/20181214093013.13370-1-fanc.fnst@cn.fujitsu.com
> 
> [~ ]$ git grep memblock_set_bottom_up
> arch/alpha/kernel/setup.c:      memblock_set_bottom_up(true);
> arch/m68k/mm/motorola.c:        memblock_set_bottom_up(true);
> arch/mips/kernel/setup.c:       memblock_set_bottom_up(true);
> arch/mips/kernel/traps.c:       memblock_set_bottom_up(false);
> arch/nds32/kernel/setup.c:      memblock_set_bottom_up(true);
> arch/powerpc/kernel/paca.c:             memblock_set_bottom_up(true);
> arch/powerpc/kernel/paca.c:             memblock_set_bottom_up(false);
> arch/s390/kernel/setup.c:       memblock_set_bottom_up(true);
> arch/s390/kernel/setup.c:       memblock_set_bottom_up(false);
> arch/sparc/mm/init_32.c:        memblock_set_bottom_up(true);
> arch/x86/kernel/setup.c:                memblock_set_bottom_up(true);
> arch/x86/mm/numa.c:     memblock_set_bottom_up(false);
> include/linux/memblock.h:static inline void __init memblock_set_bottom_up(bool enable)
>
Pingfan Liu Jan. 7, 2019, 8:37 a.m. UTC | #12
I send out a series [RFC PATCH 0/4] x86_64/mm: remove bottom-up
allocation style by pushing forward the parsing of mem hotplug info (
https://lore.kernel.org/lkml/1546849485-27933-1-git-send-email-kernelfans@gmail.com/T/#t).
Please give comment if you are interested.

Thanks,
Pingfan

On Fri, Jan 4, 2019 at 2:47 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Wed, Jan 02, 2019 at 07:05:38PM +0200, Mike Rapoport wrote:
> > I agree that currently the bottom-up allocation after the kernel text has
> > issues with KASLR. But this issues are not necessarily related to the
> > memory hotplug. Even with a single memory node, a bottom-up allocation will
> > fail if KASLR would put the kernel near the end of node0.
> >
> > What I am trying to understand is whether there is a fundamental reason to
> > prevent allocations from [0, kernel_start)?
> >
> > Maybe Tejun can recall why he suggested to start bottom-up allocations from
> > kernel_end.
>
> That's from 79442ed189ac ("mm/memblock.c: introduce bottom-up
> allocation mode").  I wasn't involved in that patch, so no idea why
> the restrictions were added, but FWIW it doesn't seem necessary to me.
>
> Thanks.
>
> --
> tejun
Baoquan He Jan. 8, 2019, 8:50 a.m. UTC | #13
On 01/06/19 at 08:27am, Mike Rapoport wrote:
> I do not suggest to discard the bottom-up method, I merely suggest to allow
> it to use [0, kernel_start).

Sorry for late reply.

I misunderstood it, sorry.

> > This bottom-up way is taken on many ARCHes, it works well on system if
> > KASLR is not enabled. Below is the searching result in the current linux
> > kernel, we can see that all ARCHes have this mechanism, except of
> > arm/arm64. But now only arm64/mips/x86 have KASLR.
> > 
> > W/o KASLR, allocating memblock region above kernle end when hotplug info
> > is not parsed, looks very reasonable. Since kernel is usually put at
> > lower address, e.g on x86, it's 16M. My thought is that we need do
> > memblock allocation around kernel before hotplug info parsed. That is
> > for system w/o KASLR, we will keep the current bottom-up way; for system
> > with KASLR, we should allocate memblock region top-down just below
> > kernel start.
> 
> I completely agree. I was thinking about making
> memblock_find_in_range_node() to do something like
> 
> if (memblock_bottom_up()) {
> 	bottom_up_start = max(start, kernel_end);

In this way, if start < kernel_end, it will still succeed to find a
region in bottom-up way after kernel end.

I am still reading code. Just noticed Pingfan sent a RFC patchset to put
SRAT parsing earlier, not sure if he has tested it in numa system with
acpi. I doubt that really works.

Thanks
Baoquan

> 	ret = __memblock_find_range_bottom_up(bottom_up_start, end,
> 					      size, align, nid, flags);



> 	if (ret)
> 		return ret;
> 
> 	bottom_up_start = max(start, 0);
> 	end = kernel_start;
> 
> 	ret = __memblock_find_range_top_down(bottom_up_start, end,
> 					     size, align, nid, flags);
> 	if (ret)
> 		return ret;
> }
diff mbox series

Patch

diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
index 2746994..3eea4e4 100644
--- a/drivers/acpi/numa.c
+++ b/drivers/acpi/numa.c
@@ -462,6 +462,10 @@  int __init acpi_numa_init(void)
 
 		cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
 					    acpi_parse_memory_affinity, 0);
+
+#if defined(CONFIG_X86) || defined(CONFIG_ARM64)
+		mark_mem_hotplug_parsed();
+#endif
 	}
 
 	/* SLIT: System Locality Information Table */
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index aee299a..d89ed9e 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -125,6 +125,7 @@  int memblock_reserve(phys_addr_t base, phys_addr_t size);
 void memblock_trim_memory(phys_addr_t align);
 bool memblock_overlaps_region(struct memblock_type *type,
 			      phys_addr_t base, phys_addr_t size);
+void mark_mem_hotplug_parsed(void);
 int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
 int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
 int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
diff --git a/mm/memblock.c b/mm/memblock.c
index 81ae63c..a3f5e46 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -231,6 +231,12 @@  __memblock_find_range_top_down(phys_addr_t start, phys_addr_t end,
 	return 0;
 }
 
+static bool mem_hotmovable_parsed __initdata_memblock;
+void __init_memblock mark_mem_hotplug_parsed(void)
+{
+	mem_hotmovable_parsed = true;
+}
+
 /**
  * memblock_find_in_range_node - find free area in given range and node
  * @size: size of free area to find
@@ -259,7 +265,7 @@  phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
 					phys_addr_t end, int nid,
 					enum memblock_flags flags)
 {
-	phys_addr_t kernel_end, ret;
+	phys_addr_t kernel_end, ret = 0;
 
 	/* pump up @end */
 	if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
@@ -270,34 +276,40 @@  phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
 	end = max(start, end);
 	kernel_end = __pa_symbol(_end);
 
-	/*
-	 * try bottom-up allocation only when bottom-up mode
-	 * is set and @end is above the kernel image.
-	 */
-	if (memblock_bottom_up() && end > kernel_end) {
-		phys_addr_t bottom_up_start;
+	if (memblock_bottom_up()) {
+		phys_addr_t bottom_up_start = start;
 
-		/* make sure we will allocate above the kernel */
-		bottom_up_start = max(start, kernel_end);
-
-		/* ok, try bottom-up allocation first */
-		ret = __memblock_find_range_bottom_up(bottom_up_start, end,
-						      size, align, nid, flags);
-		if (ret)
+		if (mem_hotmovable_parsed) {
+			ret = __memblock_find_range_bottom_up(
+				bottom_up_start, end, size, align, nid,
+				flags);
 			return ret;
 
 		/*
-		 * we always limit bottom-up allocation above the kernel,
-		 * but top-down allocation doesn't have the limit, so
-		 * retrying top-down allocation may succeed when bottom-up
-		 * allocation failed.
-		 *
-		 * bottom-up allocation is expected to be fail very rarely,
-		 * so we use WARN_ONCE() here to see the stack trace if
-		 * fail happens.
+		 * if mem hotplug info is not parsed yet, try bottom-up
+		 * allocation with @end above the kernel image.
 		 */
-		WARN_ONCE(IS_ENABLED(CONFIG_MEMORY_HOTREMOVE),
+		} else if (!mem_hotmovable_parsed && end > kernel_end) {
+			/* make sure we will allocate above the kernel */
+			bottom_up_start = max(start, kernel_end);
+			ret = __memblock_find_range_bottom_up(
+				bottom_up_start, end, size, align, nid,
+				flags);
+			if (ret)
+				return ret;
+			/*
+			 * we always limit bottom-up allocation above the
+			 * kernel, but top-down allocation doesn't have
+			 * the limit, so retrying top-down allocation may
+			 * succeed when bottom-up allocation failed.
+			 *
+			 * bottom-up allocation is expected to be fail
+			 * very rarely, so we use WARN_ONCE() here to see
+			 * the stack trace if fail happens.
+			 */
+			WARN_ONCE(IS_ENABLED(CONFIG_MEMORY_HOTREMOVE),
 			  "memblock: bottom-up allocation failed, memory hotremove may be affected\n");
+		}
 	}
 
 	return __memblock_find_range_top_down(start, end, size, align, nid,