diff mbox series

[V7,3/3] arm64/mm: Enable memory hot remove

Message ID 1567503958-25831-4-git-send-email-anshuman.khandual@arm.com (mailing list archive)
State New, archived
Headers show
Series arm64/mm: Enable memory hot remove | expand

Commit Message

Anshuman Khandual Sept. 3, 2019, 9:45 a.m. UTC
The arch code for hot-remove must tear down portions of the linear map and
vmemmap corresponding to memory being removed. In both cases the page
tables mapping these regions must be freed, and when sparse vmemmap is in
use the memory backing the vmemmap must also be freed.

This patch adds a new remove_pagetable() helper which can be used to tear
down either region, and calls it from vmemmap_free() and
___remove_pgd_mapping(). The sparse_vmap argument determines whether the
backing memory will be freed.

remove_pagetable() makes two distinct passes over the kernel page table.
In the first pass it unmaps, invalidates applicable TLB cache and frees
backing memory if required (vmemmap) for each mapped leaf entry. In the
second pass it looks for empty page table sections whose page table page
can be unmapped, TLB invalidated and freed.

While freeing intermediate level page table pages bail out if any of its
entries are still valid. This can happen for partially filled kernel page
table either from a previously attempted failed memory hot add or while
removing an address range which does not span the entire page table page
range.

The vmemmap region may share levels of table with the vmalloc region.
There can be conflicts between hot remove freeing page table pages with
a concurrent vmalloc() walking the kernel page table. This conflict can
not just be solved by taking the init_mm ptl because of existing locking
scheme in vmalloc(). Hence unlike linear mapping, skip freeing page table
pages while tearing down vmemmap mapping when vmalloc and vmemmap ranges
overlap.

While here update arch_add_memory() to handle __add_pages() failures by
just unmapping recently added kernel linear mapping. Now enable memory hot
remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE.

This implementation is overall inspired from kernel page table tear down
procedure on X86 architecture.

Acked-by: Steve Capper <steve.capper@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 arch/arm64/Kconfig              |   3 +
 arch/arm64/include/asm/memory.h |   1 +
 arch/arm64/mm/mmu.c             | 338 +++++++++++++++++++++++++++++++-
 3 files changed, 333 insertions(+), 9 deletions(-)

Comments

Catalin Marinas Sept. 10, 2019, 4:17 p.m. UTC | #1
On Tue, Sep 03, 2019 at 03:15:58PM +0530, Anshuman Khandual wrote:
> @@ -770,6 +1022,28 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>  void vmemmap_free(unsigned long start, unsigned long end,
>  		struct vmem_altmap *altmap)
>  {
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +	/*
> +	 * FIXME: We should have called remove_pagetable(start, end, true).
> +	 * vmemmap and vmalloc virtual range might share intermediate kernel
> +	 * page table entries. Removing vmemmap range page table pages here
> +	 * can potentially conflict with a concurrent vmalloc() allocation.
> +	 *
> +	 * This is primarily because vmalloc() does not take init_mm ptl for
> +	 * the entire page table walk and it's modification. Instead it just
> +	 * takes the lock while allocating and installing page table pages
> +	 * via [p4d|pud|pmd|pte]_alloc(). A concurrently vanishing page table
> +	 * entry via memory hot remove can cause vmalloc() kernel page table
> +	 * walk pointers to be invalid on the fly which can cause corruption
> +	 * or worst, a crash.
> +	 *
> +	 * So free_empty_tables() gets called where vmalloc and vmemmap range
> +	 * do not overlap at any intermediate level kernel page table entry.
> +	 */
> +	unmap_hotplug_range(start, end, true);
> +	if (!vmalloc_vmemmap_overlap)
> +		free_empty_tables(start, end);
> +#endif
>  }
>  #endif	/* CONFIG_SPARSEMEM_VMEMMAP */

I wonder whether we could simply ignore the vmemmap freeing altogether,
just leave it around and not unmap it. This way, we could call
unmap_kernel_range() for removing the linear map and we save some code.

For the linear map, I think we use just above 2MB of tables for 1GB of
memory mapped (worst case with 4KB pages we need 512 pte pages). For
vmemmap we'd use slightly above 2MB for a 64GB hotplugged memory. Do we
expect such memory to be re-plugged again in the same range? If we do,
then I shouldn't even bother with removing the vmmemmap.

I don't fully understand the use-case for memory hotremove, so any
additional info would be useful to make a decision here.
David Hildenbrand Sept. 11, 2019, 10:31 a.m. UTC | #2
On 10.09.19 18:17, Catalin Marinas wrote:
> On Tue, Sep 03, 2019 at 03:15:58PM +0530, Anshuman Khandual wrote:
>> @@ -770,6 +1022,28 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>>  void vmemmap_free(unsigned long start, unsigned long end,
>>  		struct vmem_altmap *altmap)
>>  {
>> +#ifdef CONFIG_MEMORY_HOTPLUG
>> +	/*
>> +	 * FIXME: We should have called remove_pagetable(start, end, true).
>> +	 * vmemmap and vmalloc virtual range might share intermediate kernel
>> +	 * page table entries. Removing vmemmap range page table pages here
>> +	 * can potentially conflict with a concurrent vmalloc() allocation.
>> +	 *
>> +	 * This is primarily because vmalloc() does not take init_mm ptl for
>> +	 * the entire page table walk and it's modification. Instead it just
>> +	 * takes the lock while allocating and installing page table pages
>> +	 * via [p4d|pud|pmd|pte]_alloc(). A concurrently vanishing page table
>> +	 * entry via memory hot remove can cause vmalloc() kernel page table
>> +	 * walk pointers to be invalid on the fly which can cause corruption
>> +	 * or worst, a crash.
>> +	 *
>> +	 * So free_empty_tables() gets called where vmalloc and vmemmap range
>> +	 * do not overlap at any intermediate level kernel page table entry.
>> +	 */
>> +	unmap_hotplug_range(start, end, true);
>> +	if (!vmalloc_vmemmap_overlap)
>> +		free_empty_tables(start, end);
>> +#endif
>>  }
>>  #endif	/* CONFIG_SPARSEMEM_VMEMMAP */
> 
> I wonder whether we could simply ignore the vmemmap freeing altogether,
> just leave it around and not unmap it. This way, we could call
> unmap_kernel_range() for removing the linear map and we save some code.
> 
> For the linear map, I think we use just above 2MB of tables for 1GB of
> memory mapped (worst case with 4KB pages we need 512 pte pages). For
> vmemmap we'd use slightly above 2MB for a 64GB hotplugged memory. Do we
> expect such memory to be re-plugged again in the same range? If we do,
> then I shouldn't even bother with removing the vmmemmap.
> 

FWIW, I think we should do it cleanly.

> I don't fully understand the use-case for memory hotremove, so any
> additional info would be useful to make a decision here.
> 

Especially in virtual environment, hotremove will be relevant. For
physical environments - I have no idea how important that is for ARM.
Anshuman Khandual Sept. 12, 2019, 4:28 a.m. UTC | #3
On 09/10/2019 09:47 PM, Catalin Marinas wrote:
> On Tue, Sep 03, 2019 at 03:15:58PM +0530, Anshuman Khandual wrote:
>> @@ -770,6 +1022,28 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>>  void vmemmap_free(unsigned long start, unsigned long end,
>>  		struct vmem_altmap *altmap)
>>  {
>> +#ifdef CONFIG_MEMORY_HOTPLUG
>> +	/*
>> +	 * FIXME: We should have called remove_pagetable(start, end, true).
>> +	 * vmemmap and vmalloc virtual range might share intermediate kernel
>> +	 * page table entries. Removing vmemmap range page table pages here
>> +	 * can potentially conflict with a concurrent vmalloc() allocation.
>> +	 *
>> +	 * This is primarily because vmalloc() does not take init_mm ptl for
>> +	 * the entire page table walk and it's modification. Instead it just
>> +	 * takes the lock while allocating and installing page table pages
>> +	 * via [p4d|pud|pmd|pte]_alloc(). A concurrently vanishing page table
>> +	 * entry via memory hot remove can cause vmalloc() kernel page table
>> +	 * walk pointers to be invalid on the fly which can cause corruption
>> +	 * or worst, a crash.
>> +	 *
>> +	 * So free_empty_tables() gets called where vmalloc and vmemmap range
>> +	 * do not overlap at any intermediate level kernel page table entry.
>> +	 */
>> +	unmap_hotplug_range(start, end, true);
>> +	if (!vmalloc_vmemmap_overlap)
>> +		free_empty_tables(start, end);
>> +#endif
>>  }
>>  #endif	/* CONFIG_SPARSEMEM_VMEMMAP */

Hello Catalin,

> 
> I wonder whether we could simply ignore the vmemmap freeing altogether,
> just leave it around and not unmap it. This way, we could call

This would have been an option (even if we just ignore for a moment that
it might not be the cleanest possible method) if present memory hot remove
scenarios involved just system RAM of comparable sizes.

But with persistent memory which will be plugged in as ZONE_DEVICE might
ask for a vmem_atlamp based vmemmap mapping where the backing memory comes
from the persistent memory range itself not from existing system RAM. IIRC
altmap support was originally added because the amount persistent memory on
a system might be order of magnitude higher than that of regular system RAM.
During normal memory hot add (without altmap) would have caused great deal
of consumption from system RAM just for persistent memory range's vmemmap
mapping. In order to avoid such a scenario altmap was created to allocate
vmemmap mapping backing memory from the device memory range itself.

In such cases vmemmap must be unmapped and it's backing memory freed up for
the complete removal of persistent memory which originally requested for
altmap based vmemmap backing.

Just as a reference, the upcoming series which enables altmap support on
arm64 tries to allocate vmemmap mapping backing memory from the device range
itself during memory hot add and free them up during memory hot remove. Those
methods will not be possible if memory hot-remove does not really free up
vmemmap backing storage.

https://patchwork.kernel.org/project/linux-mm/list/?series=139299

> unmap_kernel_range() for removing the linear map and we save some code.
> 
> For the linear map, I think we use just above 2MB of tables for 1GB of
> memory mapped (worst case with 4KB pages we need 512 pte pages). For
> vmemmap we'd use slightly above 2MB for a 64GB hotplugged memory. Do we

You are right, the amount of memory required for kernel page table pages
are dependent on mapping page size and size of the range to be mapped. But
as explained below there might be hot remove situations where these ranges
will remain unused for ever after hot remove. There are chances that some
these pages (even empty) might remain unused for good.

> expect such memory to be re-plugged again in the same range? If we do,
> then I shouldn't even bother with removing the vmmemmap.
> 
> I don't fully understand the use-case for memory hotremove, so any
> additional info would be useful to make a decision here.

Sure, these are some of the scenarios I could recollect.

Physical Environment:

A. Physical DIMM replacement

Platform detects memory errors and initiates a DIMM replacement.

- Hot remove selected DIMM with errors
- Hot add a new DIMM in it's place on the same slot

In normal circumstances, the new DIMM will require the same linear
and vmemmap mapping. In such cases hot-remove could just unmap
linear mapping, leave everything else and be done with it. Though
I am not sure whether its a good idea to leave aside accessible
struct pages which correspond to non-present pfns.

B. Physical DIMM movement

Platform can detect errors on a DIMM slot itself and initiates a
DIMM movement into a different empty slot

- Hot remove selected memory DIMM from defective slot
- Hot add same memory DIMM into a different available empty slot

Physical address range for the DIMM has now changed, it will require
different linear and vmemmap mapping than what it had originally.
Hence during hot remove we should not only unmap linear and vmemmap
mapping but also free up all associated resources as this physical
memory range is never going to be available again because the slot
has gone bad permanently.

C. Physical DIMM hot-remove

Platform just initiates hot-remove of a DIMM and reduces available
memory as instructed by the administrator.

- Hot remove a selected DIMM

This memory might never come back again or comes back on a different
slot. Without that certainty, its is always better to unmap both linear
and vmemmap mappings, free up all associated resources.

D. Changing NUMA affinity

After performance analysis, administrator through the platform initiates
a DIMM hot-remove from a given node and a DIMM hot-add to another node
to achieve better NUMA affinity.

- Hot remove a selected DIMM from node N0
- Hot add selected DIMM to another node N1

Here both linear and vmemmap ranges will change after the movement and
there is uncertainty regarding whether the now empty physical range on
node N0 will ever get populated again. Without that certainty, its is
always better to unmap both linear and vmemmap mapping, free up all
associated resources.

Virtual Environment:

1. Memory hot-remove can just be initiated by the admin from the host in
order to reduce total physical memory entitlement of a guest which will
reflect any changing hosting contracts etc. The memory might never come
back again and in such cases hot-remove should be as clean freeing all
associated resources.

2. Memory hot-remove on the guest can be initiated from the host after
detecting memory errors on the backing physical DIMM. Memory hot-remove
on the guest will be followed by memory hot-remove on the host itself.
Replacement DIMM can be on the same slot taking over the same physical
address range from host as before but guest might get back it's memory
either on the same range previously or on some other guest physical
range.

3. Changing NUMA binding for a guest on the host might require guest
PFN realignment with respect to guest nodes as well.

Persistent Memory:

As mentioned previously, persistent memory has special vmemmap mapping
requirements through vmem_altmap which would need freeing up backing
memory from it's own range, for it to be completely removed.

Device memory (FPGA cards, GPU cards, Network cards etc):

In future, some of these coherent device memory might be plugged into
ZONE_DEVICE and managed through drivers. They might be attached to the
system via upcoming interfaces like CCIX. The managing drivers might
need to offline the device memory range in order to service some high
priority error, re-init and plug it back on a different physical range
due to existing CCIX link errors or some other constraints.

The point I am trying to make here is that there are many such possible
combinations of events with respect to memory hot-remove in both physical
and virtual environment for system RAM, persistent memory and other coherent
device memory. Leaving aside kernel page table pages or even struct pages
for unavailable (possibly forever) physical range might problematic. IMHO
it is better to do this as much cleanly as possible.
Anshuman Khandual Sept. 12, 2019, 8:37 a.m. UTC | #4
On 09/12/2019 09:58 AM, Anshuman Khandual wrote:
> 
> On 09/10/2019 09:47 PM, Catalin Marinas wrote:
>> On Tue, Sep 03, 2019 at 03:15:58PM +0530, Anshuman Khandual wrote:
>>> @@ -770,6 +1022,28 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>>>  void vmemmap_free(unsigned long start, unsigned long end,
>>>  		struct vmem_altmap *altmap)
>>>  {
>>> +#ifdef CONFIG_MEMORY_HOTPLUG
>>> +	/*
>>> +	 * FIXME: We should have called remove_pagetable(start, end, true).
>>> +	 * vmemmap and vmalloc virtual range might share intermediate kernel
>>> +	 * page table entries. Removing vmemmap range page table pages here
>>> +	 * can potentially conflict with a concurrent vmalloc() allocation.
>>> +	 *
>>> +	 * This is primarily because vmalloc() does not take init_mm ptl for
>>> +	 * the entire page table walk and it's modification. Instead it just
>>> +	 * takes the lock while allocating and installing page table pages
>>> +	 * via [p4d|pud|pmd|pte]_alloc(). A concurrently vanishing page table
>>> +	 * entry via memory hot remove can cause vmalloc() kernel page table
>>> +	 * walk pointers to be invalid on the fly which can cause corruption
>>> +	 * or worst, a crash.
>>> +	 *
>>> +	 * So free_empty_tables() gets called where vmalloc and vmemmap range
>>> +	 * do not overlap at any intermediate level kernel page table entry.
>>> +	 */
>>> +	unmap_hotplug_range(start, end, true);
>>> +	if (!vmalloc_vmemmap_overlap)
>>> +		free_empty_tables(start, end);
>>> +#endif
>>>  }
>>>  #endif	/* CONFIG_SPARSEMEM_VMEMMAP */
> Hello Catalin,
> 
>> I wonder whether we could simply ignore the vmemmap freeing altogether,
>> just leave it around and not unmap it. This way, we could call
> This would have been an option (even if we just ignore for a moment that
> it might not be the cleanest possible method) if present memory hot remove
> scenarios involved just system RAM of comparable sizes.
> 
> But with persistent memory which will be plugged in as ZONE_DEVICE might
> ask for a vmem_atlamp based vmemmap mapping where the backing memory comes
> from the persistent memory range itself not from existing system RAM. IIRC
> altmap support was originally added because the amount persistent memory on
> a system might be order of magnitude higher than that of regular system RAM.
> During normal memory hot add (without altmap) would have caused great deal
> of consumption from system RAM just for persistent memory range's vmemmap
> mapping. In order to avoid such a scenario altmap was created to allocate
> vmemmap mapping backing memory from the device memory range itself.
> 
> In such cases vmemmap must be unmapped and it's backing memory freed up for
> the complete removal of persistent memory which originally requested for
> altmap based vmemmap backing.
> 
> Just as a reference, the upcoming series which enables altmap support on
> arm64 tries to allocate vmemmap mapping backing memory from the device range
> itself during memory hot add and free them up during memory hot remove. Those
> methods will not be possible if memory hot-remove does not really free up
> vmemmap backing storage.
> 
> https://patchwork.kernel.org/project/linux-mm/list/?series=139299
> 

Just to add in here. There is an ongoing work which will enable allocating
memory from the hot-add range itself even for normal system RAM. So this
might not be specific to ZONE_DEVICE based device/persistent memory alone
for a long time.

https://lore.kernel.org/lkml/20190725160207.19579-1-osalvador@suse.de/
Catalin Marinas Sept. 12, 2019, 8:15 p.m. UTC | #5
Hi Anshuman,

Thanks for the details on the need for removing the page tables and
vmemmap backing. Some comments on the code below.

On Tue, Sep 03, 2019 at 03:15:58PM +0530, Anshuman Khandual wrote:
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -60,6 +60,14 @@ static pud_t bm_pud[PTRS_PER_PUD] __page_aligned_bss __maybe_unused;
>  
>  static DEFINE_SPINLOCK(swapper_pgdir_lock);
>  
> +/*
> + * This represents if vmalloc and vmemmap address range overlap with
> + * each other on an intermediate level kernel page table entry which
> + * in turn helps in deciding whether empty kernel page table pages
> + * if any can be freed during memory hotplug operation.
> + */
> +static bool vmalloc_vmemmap_overlap;

I'd say just move the static find_vmalloc_vmemmap_overlap() function
here, the compiler should be sufficiently smart enough to figure out
that it's just a build-time constant.

> @@ -770,6 +1022,28 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>  void vmemmap_free(unsigned long start, unsigned long end,
>  		struct vmem_altmap *altmap)
>  {
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +	/*
> +	 * FIXME: We should have called remove_pagetable(start, end, true).
> +	 * vmemmap and vmalloc virtual range might share intermediate kernel
> +	 * page table entries. Removing vmemmap range page table pages here
> +	 * can potentially conflict with a concurrent vmalloc() allocation.
> +	 *
> +	 * This is primarily because vmalloc() does not take init_mm ptl for
> +	 * the entire page table walk and it's modification. Instead it just
> +	 * takes the lock while allocating and installing page table pages
> +	 * via [p4d|pud|pmd|pte]_alloc(). A concurrently vanishing page table
> +	 * entry via memory hot remove can cause vmalloc() kernel page table
> +	 * walk pointers to be invalid on the fly which can cause corruption
> +	 * or worst, a crash.
> +	 *
> +	 * So free_empty_tables() gets called where vmalloc and vmemmap range
> +	 * do not overlap at any intermediate level kernel page table entry.
> +	 */
> +	unmap_hotplug_range(start, end, true);
> +	if (!vmalloc_vmemmap_overlap)
> +		free_empty_tables(start, end);
> +#endif
>  }

So, I see the risk with overlapping and I guess for some kernel
configurations (PAGE_SIZE == 64K) we may not be able to avoid it. If we
can, that's great, otherwise could we rewrite the above functions to
handle floor and ceiling similar to free_pgd_range()? (I wonder how this
function works if you called it on init_mm and kernel address range). By
having the vmemmap start/end information it avoids freeing partially
filled page table pages.

Another question: could we do the page table and the actual vmemmap
pages freeing in a single pass (sorry if this has been discussed
before)?

> @@ -1048,10 +1322,18 @@ int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTPLUG
> +static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long start, u64 size)
> +{
> +	unsigned long end = start + size;
> +
> +	WARN_ON(pgdir != init_mm.pgd);
> +	remove_pagetable(start, end, false);
> +}

I think the point I've made previously still stands: you only call
remove_pagetable() with sparse_vmap == false in this patch. Just remove
the extra argument and call unmap_hotplug_range() with sparse_vmap ==
false directly in remove_pagetable().
Anshuman Khandual Sept. 13, 2019, 5:58 a.m. UTC | #6
On 09/13/2019 01:45 AM, Catalin Marinas wrote:
> Hi Anshuman,
> 
> Thanks for the details on the need for removing the page tables and
> vmemmap backing. Some comments on the code below.
> 
> On Tue, Sep 03, 2019 at 03:15:58PM +0530, Anshuman Khandual wrote:
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -60,6 +60,14 @@ static pud_t bm_pud[PTRS_PER_PUD] __page_aligned_bss __maybe_unused;
>>  
>>  static DEFINE_SPINLOCK(swapper_pgdir_lock);
>>  
>> +/*
>> + * This represents if vmalloc and vmemmap address range overlap with
>> + * each other on an intermediate level kernel page table entry which
>> + * in turn helps in deciding whether empty kernel page table pages
>> + * if any can be freed during memory hotplug operation.
>> + */
>> +static bool vmalloc_vmemmap_overlap;
> 
> I'd say just move the static find_vmalloc_vmemmap_overlap() function
> here, the compiler should be sufficiently smart enough to figure out
> that it's just a build-time constant.

Sure, will do.

> 
>> @@ -770,6 +1022,28 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>>  void vmemmap_free(unsigned long start, unsigned long end,
>>  		struct vmem_altmap *altmap)
>>  {
>> +#ifdef CONFIG_MEMORY_HOTPLUG
>> +	/*
>> +	 * FIXME: We should have called remove_pagetable(start, end, true).
>> +	 * vmemmap and vmalloc virtual range might share intermediate kernel
>> +	 * page table entries. Removing vmemmap range page table pages here
>> +	 * can potentially conflict with a concurrent vmalloc() allocation.
>> +	 *
>> +	 * This is primarily because vmalloc() does not take init_mm ptl for
>> +	 * the entire page table walk and it's modification. Instead it just
>> +	 * takes the lock while allocating and installing page table pages
>> +	 * via [p4d|pud|pmd|pte]_alloc(). A concurrently vanishing page table
>> +	 * entry via memory hot remove can cause vmalloc() kernel page table
>> +	 * walk pointers to be invalid on the fly which can cause corruption
>> +	 * or worst, a crash.
>> +	 *
>> +	 * So free_empty_tables() gets called where vmalloc and vmemmap range
>> +	 * do not overlap at any intermediate level kernel page table entry.
>> +	 */
>> +	unmap_hotplug_range(start, end, true);
>> +	if (!vmalloc_vmemmap_overlap)
>> +		free_empty_tables(start, end);
>> +#endif
>>  }
> 
> So, I see the risk with overlapping and I guess for some kernel
> configurations (PAGE_SIZE == 64K) we may not be able to avoid it. If we

Did not see 64K config options to have overlap, do you suspect they might ?
After the 52 bit KVA series has been merged, following configurations have
the vmalloc-vmemmap range overlap problem.

- 4K  page size with 48 bit VA space
- 16K page size with 48 bit VA space

> can, that's great, otherwise could we rewrite the above functions to
> handle floor and ceiling similar to free_pgd_range()? (I wonder how this
> function works if you called it on init_mm and kernel address range). By

Hmm, never tried that. Are you wondering if this can be used directly ?
There are two distinct elements which make it very specific to user page
tables, mmu_gather based TLB tracking and mm->pgtable_bytes accounting
with mm_dec_nr_pxx().

> having the vmemmap start/end information it avoids freeing partially
> filled page table pages.

Did you mean page table pages which can partially overlap with vmalloc ?

The problem (race) is not because of the inability to deal with partially
filled table. We can handle that correctly as explained below [1]. The
problem is with inadequate kernel page table locking during vmalloc()
which might be accessing intermediate kernel page table pointers which is
being freed with free_empty_tables() concurrently. Hence we cannot free
any page table page which can ever have entries from vmalloc() range.

Though not completely sure, whether I really understood the suggestion above
with respect to the floor-ceiling mechanism as in free_pgd_range(). Are you
suggesting that we should only attempt to free up those vmemmap range page
table pages which *definitely* could never overlap with vmalloc by working
on a modified (i.e cut down with floor-ceiling while avoiding vmalloc range
at each level) vmemmap range instead ? This can be one restrictive version of
the function free_empty_tables() called in case there is an overlap. So we
will maintain two versions for free_empty_tables(). Please correct me if any
the above assumptions or understanding is wrong.

But yes, with this we should be able to free up some possible empty page
table pages which were being left out in the current proposal when overlap
happens.

[1] Skipping partially filled page tables

All free_pXX_table() functions take care in avoiding freeing partially filled
page table pages whether they represent or ever represented linear or vmemmap
or vmalloc mapping in init_mm. They go over each individual entry in a given
page table making sure that each of them checks as pXX_none() before freeing
the entire page table page.

Though walking is restricted by the address range in question.

free_empty_tables(start, end)
	free_empty_pud_table(pgdp, addr, next);
		free_empty_pmd_table(pudp, addr, next);
			free_empty_pte_table(pmdp, addr, next);

Page table pages being examined here on the way while freeing might contain
entries which once represented address beyond vmemmap range in removal. But
thats a good thing IMHO. It can accommodate vmemmap tear down from a previous
hot remove for an adjacent range which might not have been freed last time.

pudp = pud_offset(pgdp, 0UL);
pmdp = pmd_offset(pudp, 0UL);
ptep = pte_offset_kernel(pmdp, 0UL);

pxx_none() makes sure that in such cases freeing of the page table page is
skipped. But yes, even though it is more thorough, it might attempt to free
page table pages which might contains entries not belonging to the range
being removed.

> 
> Another question: could we do the page table and the actual vmemmap
> pages freeing in a single pass (sorry if this has been discussed
> before)?

We could and some initial versions (till V5) of the series had that in fact.
Initially Mark Rutland had suggested to do this in two passes. Some extracts
from the previous discussion.

https://lkml.org/lkml/2019/5/30/1159

-----------------------
Looking at this some more, I don't think this is quite right, and tI
think that structure of the free_*() and remove_*() functions makes this
unnecessarily hard to follow. We should aim for this to be obviously
correct.

The x86 code is the best template to follow here. As mentioned
previously, I'm fairly certain it's not entirely correct (e.g. due to
missing TLB maintenance), and we've already diverged a fair amount in
fixing up obvious issues, so we shouldn't aim to mirror it.

I think that the structure of unmap_region() is closer to what we want
here -- do one pass to unmap leaf entries (and freeing the associated
memory if unmapping the vmemmap), then do a second pass cleaning up any
empty tables.
----------------------

Apart from the fact that two passes over the page table is cleaner and gives
us more granular and modular infrastructure to use for later purposes, it is
also a necessity in dealing with vmalloc-vmemmap overlap. free_empty_tables()
which is the second pass, can be skipped cleanly when overlap is detected.

> 
>> @@ -1048,10 +1322,18 @@ int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
>>  }
>>  
>>  #ifdef CONFIG_MEMORY_HOTPLUG
>> +static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long start, u64 size)
>> +{
>> +	unsigned long end = start + size;
>> +
>> +	WARN_ON(pgdir != init_mm.pgd);
>> +	remove_pagetable(start, end, false);
>> +}
> 
> I think the point I've made previously still stands: you only call
> remove_pagetable() with sparse_vmap == false in this patch. Just remove
> the extra argument and call unmap_hotplug_range() with sparse_vmap ==
> false directly in remove_pagetable().

Sure, will do. The original function signature was left unchanged in the hope
that at a later point in time it can be called with "sparse_vmap == true" as
mentioned by the comment in vmemmap_free(). Will change the comment as well.
Catalin Marinas Sept. 13, 2019, 10:09 a.m. UTC | #7
On Fri, Sep 13, 2019 at 11:28:01AM +0530, Anshuman Khandual wrote:
> On 09/13/2019 01:45 AM, Catalin Marinas wrote:
> > On Tue, Sep 03, 2019 at 03:15:58PM +0530, Anshuman Khandual wrote:
> >> @@ -770,6 +1022,28 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
> >>  void vmemmap_free(unsigned long start, unsigned long end,
> >>  		struct vmem_altmap *altmap)
> >>  {
> >> +#ifdef CONFIG_MEMORY_HOTPLUG
> >> +	/*
> >> +	 * FIXME: We should have called remove_pagetable(start, end, true).
> >> +	 * vmemmap and vmalloc virtual range might share intermediate kernel
> >> +	 * page table entries. Removing vmemmap range page table pages here
> >> +	 * can potentially conflict with a concurrent vmalloc() allocation.
> >> +	 *
> >> +	 * This is primarily because vmalloc() does not take init_mm ptl for
> >> +	 * the entire page table walk and it's modification. Instead it just
> >> +	 * takes the lock while allocating and installing page table pages
> >> +	 * via [p4d|pud|pmd|pte]_alloc(). A concurrently vanishing page table
> >> +	 * entry via memory hot remove can cause vmalloc() kernel page table
> >> +	 * walk pointers to be invalid on the fly which can cause corruption
> >> +	 * or worst, a crash.
> >> +	 *
> >> +	 * So free_empty_tables() gets called where vmalloc and vmemmap range
> >> +	 * do not overlap at any intermediate level kernel page table entry.
> >> +	 */
> >> +	unmap_hotplug_range(start, end, true);
> >> +	if (!vmalloc_vmemmap_overlap)
> >> +		free_empty_tables(start, end);
> >> +#endif
> >>  }
> > 
> > So, I see the risk with overlapping and I guess for some kernel
> > configurations (PAGE_SIZE == 64K) we may not be able to avoid it. If we
> 
> Did not see 64K config options to have overlap, do you suspect they might ?
> After the 52 bit KVA series has been merged, following configurations have
> the vmalloc-vmemmap range overlap problem.
> 
> - 4K  page size with 48 bit VA space
> - 16K page size with 48 bit VA space

OK. I haven't checked, so it was just a guess that 64K has this problem
since the pgd entry coverage is fairly large.

> > can, that's great, otherwise could we rewrite the above functions to
> > handle floor and ceiling similar to free_pgd_range()? (I wonder how this
> > function works if you called it on init_mm and kernel address range). By
> 
> Hmm, never tried that. Are you wondering if this can be used directly ?
> There are two distinct elements which make it very specific to user page
> tables, mmu_gather based TLB tracking and mm->pgtable_bytes accounting
> with mm_dec_nr_pxx().

Ah, I missed the mm_dec_nr_*(). So I don't think it would work directly.
We could, however, use the same approach for kernel page tables.

> > having the vmemmap start/end information it avoids freeing partially
> > filled page table pages.
> 
> Did you mean page table pages which can partially overlap with vmalloc ?

Overlapping with the vmalloc range, not necessarily with a mapped
vmalloc area.

> The problem (race) is not because of the inability to deal with partially
> filled table. We can handle that correctly as explained below [1]. The
> problem is with inadequate kernel page table locking during vmalloc()
> which might be accessing intermediate kernel page table pointers which is
> being freed with free_empty_tables() concurrently. Hence we cannot free
> any page table page which can ever have entries from vmalloc() range.

The way you deal with the partially filled table in this patch is to
avoid freeing if there is a non-empty entry (!p*d_none()). This is what
causes the race with vmalloc. If you simply avoid freeing a pmd page,
for example, if the range floor/ceiling is not aligned to PUD_SIZE,
irrespective of whether the other entries are empty or not, you
shouldn't have this problem. You do free the pte page if the range is
aligned to PMD_SIZE but in this case it wouldn't overlap with the
vmalloc space. That's how free_pgd_range() works.

We may have some pgtable pages not freed at both ends of the range
(maximum 6 in total) but I don't really see this an issue. They could be
reused if something else gets mapped in that range.

> Though not completely sure, whether I really understood the suggestion above
> with respect to the floor-ceiling mechanism as in free_pgd_range(). Are you
> suggesting that we should only attempt to free up those vmemmap range page
> table pages which *definitely* could never overlap with vmalloc by working
> on a modified (i.e cut down with floor-ceiling while avoiding vmalloc range
> at each level) vmemmap range instead ?

You can ignore the overlap check altogether, only free the page tables
with floor/ceiling set to the start/size passed to arch_remove_memory()
and vmemmap_free().

> This can be one restrictive version of the function
> free_empty_tables() called in case there is an overlap. So we will
> maintain two versions for free_empty_tables(). Please correct me if
> any the above assumptions or understanding is wrong.

I'd rather have a single version of free_empty_tables(). As I said
above, the only downside is that a partially filled pgtable page would
not be freed even though the other entries are empty.

> But yes, with this we should be able to free up some possible empty page
> table pages which were being left out in the current proposal when overlap
> happens.
> 
> [1] Skipping partially filled page tables
> 
> All free_pXX_table() functions take care in avoiding freeing partially filled
> page table pages whether they represent or ever represented linear or vmemmap
> or vmalloc mapping in init_mm. They go over each individual entry in a given
> page table making sure that each of them checks as pXX_none() before freeing
> the entire page table page.

Yes but that's what's causing the race with a vmalloc trying to create
such entries.

> > Another question: could we do the page table and the actual vmemmap
> > pages freeing in a single pass (sorry if this has been discussed
> > before)?
> 
> We could and some initial versions (till V5) of the series had that in fact.
> Initially Mark Rutland had suggested to do this in two passes. Some extracts
> from the previous discussion.
> 
> https://lkml.org/lkml/2019/5/30/1159
> 
> -----------------------
> Looking at this some more, I don't think this is quite right, and tI
> think that structure of the free_*() and remove_*() functions makes this
> unnecessarily hard to follow. We should aim for this to be obviously
> correct.
> 
> The x86 code is the best template to follow here. As mentioned
> previously, I'm fairly certain it's not entirely correct (e.g. due to
> missing TLB maintenance), and we've already diverged a fair amount in
> fixing up obvious issues, so we shouldn't aim to mirror it.
> 
> I think that the structure of unmap_region() is closer to what we want
> here -- do one pass to unmap leaf entries (and freeing the associated
> memory if unmapping the vmemmap), then do a second pass cleaning up any
> empty tables.
> ----------------------
> 
> Apart from the fact that two passes over the page table is cleaner and gives
> us more granular and modular infrastructure to use for later purposes, it is
> also a necessity in dealing with vmalloc-vmemmap overlap. free_empty_tables()
> which is the second pass, can be skipped cleanly when overlap is detected.

I'm fine with two passes for unmap and pgtable free for the time being
and if they look fairly similar in a feature version, we can think of
merging them. But for now, stick with two passes. The unmapping one in
this patchset I think seems fine (though I haven't looked in detail).

There is also a race with ptdump that I haven't looked into.
Anshuman Khandual Sept. 17, 2019, 4:36 a.m. UTC | #8
On 09/13/2019 03:39 PM, Catalin Marinas wrote:
> On Fri, Sep 13, 2019 at 11:28:01AM +0530, Anshuman Khandual wrote:
>> On 09/13/2019 01:45 AM, Catalin Marinas wrote:
>>> On Tue, Sep 03, 2019 at 03:15:58PM +0530, Anshuman Khandual wrote:
>>>> @@ -770,6 +1022,28 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>>>>  void vmemmap_free(unsigned long start, unsigned long end,
>>>>  		struct vmem_altmap *altmap)
>>>>  {
>>>> +#ifdef CONFIG_MEMORY_HOTPLUG
>>>> +	/*
>>>> +	 * FIXME: We should have called remove_pagetable(start, end, true).
>>>> +	 * vmemmap and vmalloc virtual range might share intermediate kernel
>>>> +	 * page table entries. Removing vmemmap range page table pages here
>>>> +	 * can potentially conflict with a concurrent vmalloc() allocation.
>>>> +	 *
>>>> +	 * This is primarily because vmalloc() does not take init_mm ptl for
>>>> +	 * the entire page table walk and it's modification. Instead it just
>>>> +	 * takes the lock while allocating and installing page table pages
>>>> +	 * via [p4d|pud|pmd|pte]_alloc(). A concurrently vanishing page table
>>>> +	 * entry via memory hot remove can cause vmalloc() kernel page table
>>>> +	 * walk pointers to be invalid on the fly which can cause corruption
>>>> +	 * or worst, a crash.
>>>> +	 *
>>>> +	 * So free_empty_tables() gets called where vmalloc and vmemmap range
>>>> +	 * do not overlap at any intermediate level kernel page table entry.
>>>> +	 */
>>>> +	unmap_hotplug_range(start, end, true);
>>>> +	if (!vmalloc_vmemmap_overlap)
>>>> +		free_empty_tables(start, end);
>>>> +#endif
>>>>  }
>>>
>>> So, I see the risk with overlapping and I guess for some kernel
>>> configurations (PAGE_SIZE == 64K) we may not be able to avoid it. If we
>>
>> Did not see 64K config options to have overlap, do you suspect they might ?
>> After the 52 bit KVA series has been merged, following configurations have
>> the vmalloc-vmemmap range overlap problem.
>>
>> - 4K  page size with 48 bit VA space
>> - 16K page size with 48 bit VA space
> 
> OK. I haven't checked, so it was just a guess that 64K has this problem
> since the pgd entry coverage is fairly large.
> 
>>> can, that's great, otherwise could we rewrite the above functions to
>>> handle floor and ceiling similar to free_pgd_range()? (I wonder how this
>>> function works if you called it on init_mm and kernel address range). By
>>
>> Hmm, never tried that. Are you wondering if this can be used directly ?
>> There are two distinct elements which make it very specific to user page
>> tables, mmu_gather based TLB tracking and mm->pgtable_bytes accounting
>> with mm_dec_nr_pxx().
> 
> Ah, I missed the mm_dec_nr_*(). So I don't think it would work directly.
> We could, however, use the same approach for kernel page tables.

Right.

> 
>>> having the vmemmap start/end information it avoids freeing partially
>>> filled page table pages.
>>
>> Did you mean page table pages which can partially overlap with vmalloc ?
> 
> Overlapping with the vmalloc range, not necessarily with a mapped
> vmalloc area.
> 
>> The problem (race) is not because of the inability to deal with partially
>> filled table. We can handle that correctly as explained below [1]. The
>> problem is with inadequate kernel page table locking during vmalloc()
>> which might be accessing intermediate kernel page table pointers which is
>> being freed with free_empty_tables() concurrently. Hence we cannot free
>> any page table page which can ever have entries from vmalloc() range.
> 
> The way you deal with the partially filled table in this patch is to
> avoid freeing if there is a non-empty entry (!p*d_none()). This is what
> causes the race with vmalloc. If you simply avoid freeing a pmd page,
> for example, if the range floor/ceiling is not aligned to PUD_SIZE,
> irrespective of whether the other entries are empty or not, you
> shouldn't have this problem. You do free the pte page if the range is

Right, the floor/ceiling alignment check should abort the process before
scanning for non-empty entries.

> aligned to PMD_SIZE but in this case it wouldn't overlap with the
> vmalloc space. That's how free_pgd_range() works.

Like free_pgd_range(), page table pages can be freed at lower levels when
they have the right alignment wrt floor/ceiling. I will change all existing
free_pxx_table() functions to accommodate floor/ceiling alignment checks to
achieve this.

> 
> We may have some pgtable pages not freed at both ends of the range
> (maximum 6 in total) but I don't really see this an issue. They could be
> reused if something else gets mapped in that range.

I assume that the number 6 for maximum page possibility came from

(floor edge + ceiling edge) * (PTE table + PMD table + PUD table)


> 
>> Though not completely sure, whether I really understood the suggestion above
>> with respect to the floor-ceiling mechanism as in free_pgd_range(). Are you
>> suggesting that we should only attempt to free up those vmemmap range page
>> table pages which *definitely* could never overlap with vmalloc by working
>> on a modified (i.e cut down with floor-ceiling while avoiding vmalloc range
>> at each level) vmemmap range instead ?
> 
> You can ignore the overlap check altogether, only free the page tables
> with floor/ceiling set to the start/size passed to arch_remove_memory()
> and vmemmap_free().

Wondering if it will be better to use [VMEMMAP_START - VMEMMAP_END] and
[PAGE_OFFSET - PAGE_END] as floor/ceiling respectively with vmemmap_free()
and arch_remove_memory(). Not only it is safe to free all page table pages
which span over these maximum possible mapping range but also it reduces
the risk for alignment related wastage.

> 
>> This can be one restrictive version of the function
>> free_empty_tables() called in case there is an overlap. So we will
>> maintain two versions for free_empty_tables(). Please correct me if
>> any the above assumptions or understanding is wrong.
> 
> I'd rather have a single version of free_empty_tables(). As I said
> above, the only downside is that a partially filled pgtable page would
> not be freed even though the other entries are empty.

Sure. Also practically the limitation will be applicable only for vmemmap
mapping but not for linear mappings where the chances of overlap might be
negligible as it covers half kernel virtual address space.

> 
>> But yes, with this we should be able to free up some possible empty page
>> table pages which were being left out in the current proposal when overlap
>> happens.
>>
>> [1] Skipping partially filled page tables
>>
>> All free_pXX_table() functions take care in avoiding freeing partially filled
>> page table pages whether they represent or ever represented linear or vmemmap
>> or vmalloc mapping in init_mm. They go over each individual entry in a given
>> page table making sure that each of them checks as pXX_none() before freeing
>> the entire page table page.
> 
> Yes but that's what's causing the race with a vmalloc trying to create
> such entries.

free_pxx_table() needs to check for both floor-ceiling alignment before
making sure that the page table page is completely empty before freeing.

> 
>>> Another question: could we do the page table and the actual vmemmap
>>> pages freeing in a single pass (sorry if this has been discussed
>>> before)?
>>
>> We could and some initial versions (till V5) of the series had that in fact.
>> Initially Mark Rutland had suggested to do this in two passes. Some extracts
>> from the previous discussion.
>>
>> https://lkml.org/lkml/2019/5/30/1159
>>
>> -----------------------
>> Looking at this some more, I don't think this is quite right, and tI
>> think that structure of the free_*() and remove_*() functions makes this
>> unnecessarily hard to follow. We should aim for this to be obviously
>> correct.
>>
>> The x86 code is the best template to follow here. As mentioned
>> previously, I'm fairly certain it's not entirely correct (e.g. due to
>> missing TLB maintenance), and we've already diverged a fair amount in
>> fixing up obvious issues, so we shouldn't aim to mirror it.
>>
>> I think that the structure of unmap_region() is closer to what we want
>> here -- do one pass to unmap leaf entries (and freeing the associated
>> memory if unmapping the vmemmap), then do a second pass cleaning up any
>> empty tables.
>> ----------------------
>>
>> Apart from the fact that two passes over the page table is cleaner and gives
>> us more granular and modular infrastructure to use for later purposes, it is
>> also a necessity in dealing with vmalloc-vmemmap overlap. free_empty_tables()
>> which is the second pass, can be skipped cleanly when overlap is detected.
> 
> I'm fine with two passes for unmap and pgtable free for the time being
> and if they look fairly similar in a feature version, we can think of
> merging them. But for now, stick with two passes. The unmapping one in
> this patchset I think seems fine (though I haven't looked in detail).

Sure.

> 
> There is also a race with ptdump that I haven't looked into.

The second patch in the series deals with that.
Catalin Marinas Sept. 17, 2019, 3:08 p.m. UTC | #9
On Tue, Sep 17, 2019 at 10:06:11AM +0530, Anshuman Khandual wrote:
> On 09/13/2019 03:39 PM, Catalin Marinas wrote:
> > On Fri, Sep 13, 2019 at 11:28:01AM +0530, Anshuman Khandual wrote:
> >> The problem (race) is not because of the inability to deal with partially
> >> filled table. We can handle that correctly as explained below [1]. The
> >> problem is with inadequate kernel page table locking during vmalloc()
> >> which might be accessing intermediate kernel page table pointers which is
> >> being freed with free_empty_tables() concurrently. Hence we cannot free
> >> any page table page which can ever have entries from vmalloc() range.
> > 
> > The way you deal with the partially filled table in this patch is to
> > avoid freeing if there is a non-empty entry (!p*d_none()). This is what
> > causes the race with vmalloc. If you simply avoid freeing a pmd page,
> > for example, if the range floor/ceiling is not aligned to PUD_SIZE,
> > irrespective of whether the other entries are empty or not, you
> > shouldn't have this problem. You do free the pte page if the range is
[...]
> > We may have some pgtable pages not freed at both ends of the range
> > (maximum 6 in total) but I don't really see this an issue. They could be
> > reused if something else gets mapped in that range.
> 
> I assume that the number 6 for maximum page possibility came from
> 
> (floor edge + ceiling edge) * (PTE table + PMD table + PUD table)

Yes.

> >> Though not completely sure, whether I really understood the suggestion above
> >> with respect to the floor-ceiling mechanism as in free_pgd_range(). Are you
> >> suggesting that we should only attempt to free up those vmemmap range page
> >> table pages which *definitely* could never overlap with vmalloc by working
> >> on a modified (i.e cut down with floor-ceiling while avoiding vmalloc range
> >> at each level) vmemmap range instead ?
> > 
> > You can ignore the overlap check altogether, only free the page tables
> > with floor/ceiling set to the start/size passed to arch_remove_memory()
> > and vmemmap_free().
> 
> Wondering if it will be better to use [VMEMMAP_START - VMEMMAP_END] and
> [PAGE_OFFSET - PAGE_END] as floor/ceiling respectively with vmemmap_free()
> and arch_remove_memory(). Not only it is safe to free all page table pages
> which span over these maximum possible mapping range but also it reduces
> the risk for alignment related wastage.

That's indeed better. You pass the floor/ceiling as the enclosing range
and start/end as the actual range to unmap is. We avoid the potential
"leak" around the edges when falling within the floor/ceiling range (I
think that's close to what free_pgd_range() does).

> >> This can be one restrictive version of the function
> >> free_empty_tables() called in case there is an overlap. So we will
> >> maintain two versions for free_empty_tables(). Please correct me if
> >> any the above assumptions or understanding is wrong.
> > 
> > I'd rather have a single version of free_empty_tables(). As I said
> > above, the only downside is that a partially filled pgtable page would
> > not be freed even though the other entries are empty.
> 
> Sure. Also practically the limitation will be applicable only for vmemmap
> mapping but not for linear mappings where the chances of overlap might be
> negligible as it covers half kernel virtual address space.

If you have a common set of functions, it doesn't heart to pass the
correct floor/ceiling in both cases.
diff mbox series

Patch

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 6481964b6425..0079820af308 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -274,6 +274,9 @@  config ZONE_DMA32
 config ARCH_ENABLE_MEMORY_HOTPLUG
 	def_bool y
 
+config ARCH_ENABLE_MEMORY_HOTREMOVE
+	def_bool y
+
 config SMP
 	def_bool y
 
diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index b61b50bf68b1..615dcd08acfa 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -54,6 +54,7 @@ 
 #define MODULES_VADDR		(BPF_JIT_REGION_END)
 #define MODULES_VSIZE		(SZ_128M)
 #define VMEMMAP_START		(-VMEMMAP_SIZE - SZ_2M)
+#define VMEMMAP_END		(VMEMMAP_START + VMEMMAP_SIZE)
 #define PCI_IO_END		(VMEMMAP_START - SZ_2M)
 #define PCI_IO_START		(PCI_IO_END - PCI_IO_SIZE)
 #define FIXADDR_TOP		(PCI_IO_START - SZ_2M)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index ee1bf416368d..980d761a7327 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -60,6 +60,14 @@  static pud_t bm_pud[PTRS_PER_PUD] __page_aligned_bss __maybe_unused;
 
 static DEFINE_SPINLOCK(swapper_pgdir_lock);
 
+/*
+ * This represents if vmalloc and vmemmap address range overlap with
+ * each other on an intermediate level kernel page table entry which
+ * in turn helps in deciding whether empty kernel page table pages
+ * if any can be freed during memory hotplug operation.
+ */
+static bool vmalloc_vmemmap_overlap;
+
 void set_swapper_pgd(pgd_t *pgdp, pgd_t pgd)
 {
 	pgd_t *fixmap_pgdp;
@@ -723,6 +731,250 @@  int kern_addr_valid(unsigned long addr)
 
 	return pfn_valid(pte_pfn(pte));
 }
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+static void free_hotplug_page_range(struct page *page, size_t size)
+{
+	WARN_ON(!page || PageReserved(page));
+	free_pages((unsigned long)page_address(page), get_order(size));
+}
+
+static void free_hotplug_pgtable_page(struct page *page)
+{
+	free_hotplug_page_range(page, PAGE_SIZE);
+}
+
+static void free_pte_table(pmd_t *pmdp, unsigned long addr)
+{
+	struct page *page;
+	pte_t *ptep;
+	int i;
+
+	ptep = pte_offset_kernel(pmdp, 0UL);
+	for (i = 0; i < PTRS_PER_PTE; i++) {
+		if (!pte_none(READ_ONCE(ptep[i])))
+			return;
+	}
+
+	page = pmd_page(READ_ONCE(*pmdp));
+	pmd_clear(pmdp);
+	__flush_tlb_kernel_pgtable(addr);
+	free_hotplug_pgtable_page(page);
+}
+
+static void free_pmd_table(pud_t *pudp, unsigned long addr)
+{
+	struct page *page;
+	pmd_t *pmdp;
+	int i;
+
+	if (CONFIG_PGTABLE_LEVELS <= 2)
+		return;
+
+	pmdp = pmd_offset(pudp, 0UL);
+	for (i = 0; i < PTRS_PER_PMD; i++) {
+		if (!pmd_none(READ_ONCE(pmdp[i])))
+			return;
+	}
+
+	page = pud_page(READ_ONCE(*pudp));
+	pud_clear(pudp);
+	__flush_tlb_kernel_pgtable(addr);
+	free_hotplug_pgtable_page(page);
+}
+
+static void free_pud_table(pgd_t *pgdp, unsigned long addr)
+{
+	struct page *page;
+	pud_t *pudp;
+	int i;
+
+	if (CONFIG_PGTABLE_LEVELS <= 3)
+		return;
+
+	pudp = pud_offset(pgdp, 0UL);
+	for (i = 0; i < PTRS_PER_PUD; i++) {
+		if (!pud_none(READ_ONCE(pudp[i])))
+			return;
+	}
+
+	page = pgd_page(READ_ONCE(*pgdp));
+	pgd_clear(pgdp);
+	__flush_tlb_kernel_pgtable(addr);
+	free_hotplug_pgtable_page(page);
+}
+
+static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr,
+				    unsigned long end, bool sparse_vmap)
+{
+	struct page *page;
+	pte_t *ptep, pte;
+
+	do {
+		ptep = pte_offset_kernel(pmdp, addr);
+		pte = READ_ONCE(*ptep);
+		if (pte_none(pte))
+			continue;
+
+		WARN_ON(!pte_present(pte));
+		page = sparse_vmap ? pte_page(pte) : NULL;
+		pte_clear(&init_mm, addr, ptep);
+		flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+		if (sparse_vmap)
+			free_hotplug_page_range(page, PAGE_SIZE);
+	} while (addr += PAGE_SIZE, addr < end);
+}
+
+static void unmap_hotplug_pmd_range(pud_t *pudp, unsigned long addr,
+				    unsigned long end, bool sparse_vmap)
+{
+	unsigned long next;
+	struct page *page;
+	pmd_t *pmdp, pmd;
+
+	do {
+		next = pmd_addr_end(addr, end);
+		pmdp = pmd_offset(pudp, addr);
+		pmd = READ_ONCE(*pmdp);
+		if (pmd_none(pmd))
+			continue;
+
+		WARN_ON(!pmd_present(pmd));
+		if (pmd_sect(pmd)) {
+			page = sparse_vmap ? pmd_page(pmd) : NULL;
+			pmd_clear(pmdp);
+			flush_tlb_kernel_range(addr, next);
+			if (sparse_vmap)
+				free_hotplug_page_range(page, PMD_SIZE);
+			continue;
+		}
+		WARN_ON(!pmd_table(pmd));
+		unmap_hotplug_pte_range(pmdp, addr, next, sparse_vmap);
+	} while (addr = next, addr < end);
+}
+
+static void unmap_hotplug_pud_range(pgd_t *pgdp, unsigned long addr,
+				    unsigned long end, bool sparse_vmap)
+{
+	unsigned long next;
+	struct page *page;
+	pud_t *pudp, pud;
+
+	do {
+		next = pud_addr_end(addr, end);
+		pudp = pud_offset(pgdp, addr);
+		pud = READ_ONCE(*pudp);
+		if (pud_none(pud))
+			continue;
+
+		WARN_ON(!pud_present(pud));
+		if (pud_sect(pud)) {
+			page = sparse_vmap ? pud_page(pud) : NULL;
+			pud_clear(pudp);
+			flush_tlb_kernel_range(addr, next);
+			if (sparse_vmap)
+				free_hotplug_page_range(page, PUD_SIZE);
+			continue;
+		}
+		WARN_ON(!pud_table(pud));
+		unmap_hotplug_pmd_range(pudp, addr, next, sparse_vmap);
+	} while (addr = next, addr < end);
+}
+
+static void unmap_hotplug_range(unsigned long addr, unsigned long end,
+				bool sparse_vmap)
+{
+	unsigned long next;
+	pgd_t *pgdp, pgd;
+
+	do {
+		next = pgd_addr_end(addr, end);
+		pgdp = pgd_offset_k(addr);
+		pgd = READ_ONCE(*pgdp);
+		if (pgd_none(pgd))
+			continue;
+
+		WARN_ON(!pgd_present(pgd));
+		unmap_hotplug_pud_range(pgdp, addr, next, sparse_vmap);
+	} while (addr = next, addr < end);
+}
+
+static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr,
+				 unsigned long end)
+{
+	pte_t *ptep, pte;
+
+	do {
+		ptep = pte_offset_kernel(pmdp, addr);
+		pte = READ_ONCE(*ptep);
+		WARN_ON(!pte_none(pte));
+	} while (addr += PAGE_SIZE, addr < end);
+}
+
+static void free_empty_pmd_table(pud_t *pudp, unsigned long addr,
+				 unsigned long end)
+{
+	unsigned long next;
+	pmd_t *pmdp, pmd;
+
+	do {
+		next = pmd_addr_end(addr, end);
+		pmdp = pmd_offset(pudp, addr);
+		pmd = READ_ONCE(*pmdp);
+		if (pmd_none(pmd))
+			continue;
+
+		WARN_ON(!pmd_present(pmd) || !pmd_table(pmd) || pmd_sect(pmd));
+		free_empty_pte_table(pmdp, addr, next);
+		free_pte_table(pmdp, addr);
+	} while (addr = next, addr < end);
+}
+
+static void free_empty_pud_table(pgd_t *pgdp, unsigned long addr,
+				 unsigned long end)
+{
+	unsigned long next;
+	pud_t *pudp, pud;
+
+	do {
+		next = pud_addr_end(addr, end);
+		pudp = pud_offset(pgdp, addr);
+		pud = READ_ONCE(*pudp);
+		if (pud_none(pud))
+			continue;
+
+		WARN_ON(!pud_present(pud) || !pud_table(pud) || pud_sect(pud));
+		free_empty_pmd_table(pudp, addr, next);
+		free_pmd_table(pudp, addr);
+	} while (addr = next, addr < end);
+}
+
+static void free_empty_tables(unsigned long addr, unsigned long end)
+{
+	unsigned long next;
+	pgd_t *pgdp, pgd;
+
+	do {
+		next = pgd_addr_end(addr, end);
+		pgdp = pgd_offset_k(addr);
+		pgd = READ_ONCE(*pgdp);
+		if (pgd_none(pgd))
+			continue;
+
+		WARN_ON(!pgd_present(pgd));
+		free_empty_pud_table(pgdp, addr, next);
+		free_pud_table(pgdp, addr);
+	} while (addr = next, addr < end);
+}
+
+static void remove_pagetable(unsigned long start, unsigned long end,
+			     bool sparse_vmap)
+{
+	unmap_hotplug_range(start, end, sparse_vmap);
+	free_empty_tables(start, end);
+}
+#endif
+
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 #if !ARM64_SWAPPER_USES_SECTION_MAPS
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
@@ -770,6 +1022,28 @@  int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 void vmemmap_free(unsigned long start, unsigned long end,
 		struct vmem_altmap *altmap)
 {
+#ifdef CONFIG_MEMORY_HOTPLUG
+	/*
+	 * FIXME: We should have called remove_pagetable(start, end, true).
+	 * vmemmap and vmalloc virtual range might share intermediate kernel
+	 * page table entries. Removing vmemmap range page table pages here
+	 * can potentially conflict with a concurrent vmalloc() allocation.
+	 *
+	 * This is primarily because vmalloc() does not take init_mm ptl for
+	 * the entire page table walk and it's modification. Instead it just
+	 * takes the lock while allocating and installing page table pages
+	 * via [p4d|pud|pmd|pte]_alloc(). A concurrently vanishing page table
+	 * entry via memory hot remove can cause vmalloc() kernel page table
+	 * walk pointers to be invalid on the fly which can cause corruption
+	 * or worst, a crash.
+	 *
+	 * So free_empty_tables() gets called where vmalloc and vmemmap range
+	 * do not overlap at any intermediate level kernel page table entry.
+	 */
+	unmap_hotplug_range(start, end, true);
+	if (!vmalloc_vmemmap_overlap)
+		free_empty_tables(start, end);
+#endif
 }
 #endif	/* CONFIG_SPARSEMEM_VMEMMAP */
 
@@ -1048,10 +1322,18 @@  int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
+static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long start, u64 size)
+{
+	unsigned long end = start + size;
+
+	WARN_ON(pgdir != init_mm.pgd);
+	remove_pagetable(start, end, false);
+}
+
 int arch_add_memory(int nid, u64 start, u64 size,
 			struct mhp_restrictions *restrictions)
 {
-	int flags = 0;
+	int ret, flags = 0;
 
 	if (rodata_full || debug_pagealloc_enabled())
 		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
@@ -1059,9 +1341,14 @@  int arch_add_memory(int nid, u64 start, u64 size,
 	__create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
 			     size, PAGE_KERNEL, __pgd_pgtable_alloc, flags);
 
-	return __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT,
+	ret = __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT,
 			   restrictions);
+	if (ret)
+		__remove_pgd_mapping(swapper_pg_dir,
+				     __phys_to_virt(start), size);
+	return ret;
 }
+
 void arch_remove_memory(int nid, u64 start, u64 size,
 			struct vmem_altmap *altmap)
 {
@@ -1069,14 +1356,47 @@  void arch_remove_memory(int nid, u64 start, u64 size,
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	struct zone *zone;
 
-	/*
-	 * FIXME: Cleanup page tables (also in arch_add_memory() in case
-	 * adding fails). Until then, this function should only be used
-	 * during memory hotplug (adding memory), not for memory
-	 * unplug. ARCH_ENABLE_MEMORY_HOTREMOVE must not be
-	 * unlocked yet.
-	 */
 	zone = page_zone(pfn_to_page(start_pfn));
 	__remove_pages(zone, start_pfn, nr_pages, altmap);
+	__remove_pgd_mapping(swapper_pg_dir, __phys_to_virt(start), size);
+}
+
+static int __init find_vmalloc_vmemmap_overlap(void)
+{
+	unsigned long gap_start, gap_end;
+
+	/*
+	 * vmalloc and vmemmap address ranges should be linearly
+	 * increasing and non-overlapping before the gap between
+	 * them can be measured.
+	 */
+	BUILD_BUG_ON(VMALLOC_END <= VMALLOC_START);
+	BUILD_BUG_ON(VMEMMAP_END <= VMEMMAP_START);
+	BUILD_BUG_ON((VMEMMAP_START >= VMALLOC_START)
+			&& (VMEMMAP_START <= VMALLOC_END));
+	BUILD_BUG_ON((VMEMMAP_END >= VMALLOC_START)
+			&& (VMEMMAP_END <= VMALLOC_END));
+
+	/*
+	 * vmalloc and vmemmap can only be non-overlapping disjoint
+	 * ranges. So [gap_start..gap_end] is determined which will
+	 * represent the gap between the above mentioned ranges.
+	 */
+	if (VMEMMAP_START > VMALLOC_END) {
+		gap_start = VMALLOC_END;
+		gap_end = VMEMMAP_START;
+	} else {
+		gap_start = VMEMMAP_END;
+		gap_end = VMALLOC_START;
+	}
+
+	/*
+	 * Race condition during memory hot-remove exists when the
+	 * gap edges share same PGD entry in kernel page table.
+	 */
+	if ((gap_start & PGDIR_MASK) == (gap_end & PGDIR_MASK))
+		vmalloc_vmemmap_overlap = true;
+	return 0;
 }
+early_initcall(find_vmalloc_vmemmap_overlap);
 #endif