[V9,2/2] arm64/mm: Enable memory hot remove
diff mbox series

Message ID 1570609308-15697-3-git-send-email-anshuman.khandual@arm.com
State New
Headers show
Series
  • arm64/mm: Enable memory hot remove
Related show

Commit Message

Anshuman Khandual Oct. 9, 2019, 8:21 a.m. UTC
The arch code for hot-remove must tear down portions of the linear map and
vmemmap corresponding to memory being removed. In both cases the page
tables mapping these regions must be freed, and when sparse vmemmap is in
use the memory backing the vmemmap must also be freed.

This patch adds unmap_hotplug_range() and free_empty_tables() helpers which
can be used to tear down either region and calls it from vmemmap_free() and
___remove_pgd_mapping(). The free_mapped argument determines whether the
backing memory will be freed.

It makes two distinct passes over the kernel page table. In the first pass
with unmap_hotplug_range() it unmaps, invalidates applicable TLB cache and
frees backing memory if required (vmemmap) for each mapped leaf entry. In
the second pass with free_empty_tables() it looks for empty page table
sections whose page table page can be unmapped, TLB invalidated and freed.

While freeing intermediate level page table pages bail out if any of its
entries are still valid. This can happen for partially filled kernel page
table either from a previously attempted failed memory hot add or while
removing an address range which does not span the entire page table page
range.

The vmemmap region may share levels of table with the vmalloc region.
There can be conflicts between hot remove freeing page table pages with
a concurrent vmalloc() walking the kernel page table. This conflict can
not just be solved by taking the init_mm ptl because of existing locking
scheme in vmalloc(). So free_empty_tables() implements a floor and ceiling
method which is borrowed from user page table tear with free_pgd_range()
which skips freeing page table pages if intermediate address range is not
aligned or maximum floor-ceiling might not own the entire page table page.

While here update arch_add_memory() to handle __add_pages() failures by
just unmapping recently added kernel linear mapping. Now enable memory hot
remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE.

This implementation is overall inspired from kernel page table tear down
procedure on X86 architecture and user page table tear down method.

Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 arch/arm64/Kconfig              |   3 +
 arch/arm64/include/asm/memory.h |   1 +
 arch/arm64/mm/mmu.c             | 273 ++++++++++++++++++++++++++++++--
 3 files changed, 268 insertions(+), 9 deletions(-)

Comments

Catalin Marinas Oct. 10, 2019, 11:34 a.m. UTC | #1
Hi Anshuman,

On Wed, Oct 09, 2019 at 01:51:48PM +0530, Anshuman Khandual wrote:
> +static void unmap_hotplug_pmd_range(pud_t *pudp, unsigned long addr,
> +				    unsigned long end, bool free_mapped)
> +{
> +	unsigned long next;
> +	pmd_t *pmdp, pmd;
> +
> +	do {
> +		next = pmd_addr_end(addr, end);
> +		pmdp = pmd_offset(pudp, addr);
> +		pmd = READ_ONCE(*pmdp);
> +		if (pmd_none(pmd))
> +			continue;
> +
> +		WARN_ON(!pmd_present(pmd));
> +		if (pmd_sect(pmd)) {
> +			pmd_clear(pmdp);
> +			flush_tlb_kernel_range(addr, next);

The range here could be a whole PMD_SIZE. Since we are invalidating a
single block entry, one TLBI should be sufficient:

			flush_tlb_kernel_range(addr, addr + PAGE_SIZE);

> +			if (free_mapped)
> +				free_hotplug_page_range(pmd_page(pmd),
> +							PMD_SIZE);
> +			continue;
> +		}
> +		WARN_ON(!pmd_table(pmd));
> +		unmap_hotplug_pte_range(pmdp, addr, next, free_mapped);
> +	} while (addr = next, addr < end);
> +}
> +
> +static void unmap_hotplug_pud_range(pgd_t *pgdp, unsigned long addr,
> +				    unsigned long end, bool free_mapped)
> +{
> +	unsigned long next;
> +	pud_t *pudp, pud;
> +
> +	do {
> +		next = pud_addr_end(addr, end);
> +		pudp = pud_offset(pgdp, addr);
> +		pud = READ_ONCE(*pudp);
> +		if (pud_none(pud))
> +			continue;
> +
> +		WARN_ON(!pud_present(pud));
> +		if (pud_sect(pud)) {
> +			pud_clear(pudp);
> +			flush_tlb_kernel_range(addr, next);
			^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
			flush_tlb_kernel_range(addr, addr + PAGE_SIZE);

[...]
> +static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr,
> +				 unsigned long end, unsigned long floor,
> +				 unsigned long ceiling)
> +{
> +	pte_t *ptep, pte;
> +	unsigned long i, start = addr;
> +
> +	do {
> +		ptep = pte_offset_kernel(pmdp, addr);
> +		pte = READ_ONCE(*ptep);
> +		WARN_ON(!pte_none(pte));
> +	} while (addr += PAGE_SIZE, addr < end);

So this loop is just a sanity check (pte clearing having been done by
the unmap loops). That's fine, maybe a comment for future reference.

> +
> +	if (!pgtable_range_aligned(start, end, floor, ceiling, PMD_MASK))
> +		return;
> +
> +	ptep = pte_offset_kernel(pmdp, 0UL);
> +	for (i = 0; i < PTRS_PER_PTE; i++) {
> +		if (!pte_none(READ_ONCE(ptep[i])))
> +			return;
> +	}

We could do with a comment for this loop along the lines of:

	Check whether we can free the pte page if the rest of the
	entries are empty. Overlap with other regions have been handled
	by the floor/ceiling check.

Apart from the comments above, the rest of the patch looks fine. Once
fixed:

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>

Mark Rutland mentioned at some point that, as a preparatory patch to
this series, we'd need to make sure we don't hot-remove memory already
given to the kernel at boot. Any plans here?

Thanks.
Anshuman Khandual Oct. 11, 2019, 2:56 a.m. UTC | #2
On 10/10/2019 05:04 PM, Catalin Marinas wrote:
> Hi Anshuman,
> 
> On Wed, Oct 09, 2019 at 01:51:48PM +0530, Anshuman Khandual wrote:
>> +static void unmap_hotplug_pmd_range(pud_t *pudp, unsigned long addr,
>> +				    unsigned long end, bool free_mapped)
>> +{
>> +	unsigned long next;
>> +	pmd_t *pmdp, pmd;
>> +
>> +	do {
>> +		next = pmd_addr_end(addr, end);
>> +		pmdp = pmd_offset(pudp, addr);
>> +		pmd = READ_ONCE(*pmdp);
>> +		if (pmd_none(pmd))
>> +			continue;
>> +
>> +		WARN_ON(!pmd_present(pmd));
>> +		if (pmd_sect(pmd)) {
>> +			pmd_clear(pmdp);
>> +			flush_tlb_kernel_range(addr, next);
> 
> The range here could be a whole PMD_SIZE. Since we are invalidating a
> single block entry, one TLBI should be sufficient:
> 
> 			flush_tlb_kernel_range(addr, addr + PAGE_SIZE);

Sure, will change.

> 
>> +			if (free_mapped)
>> +				free_hotplug_page_range(pmd_page(pmd),
>> +							PMD_SIZE);
>> +			continue;
>> +		}
>> +		WARN_ON(!pmd_table(pmd));
>> +		unmap_hotplug_pte_range(pmdp, addr, next, free_mapped);
>> +	} while (addr = next, addr < end);
>> +}
>> +
>> +static void unmap_hotplug_pud_range(pgd_t *pgdp, unsigned long addr,
>> +				    unsigned long end, bool free_mapped)
>> +{
>> +	unsigned long next;
>> +	pud_t *pudp, pud;
>> +
>> +	do {
>> +		next = pud_addr_end(addr, end);
>> +		pudp = pud_offset(pgdp, addr);
>> +		pud = READ_ONCE(*pudp);
>> +		if (pud_none(pud))
>> +			continue;
>> +
>> +		WARN_ON(!pud_present(pud));
>> +		if (pud_sect(pud)) {
>> +			pud_clear(pudp);
>> +			flush_tlb_kernel_range(addr, next);
> 			^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 			flush_tlb_kernel_range(addr, addr + PAGE_SIZE);

Will change.

> 
> [...]
>> +static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr,
>> +				 unsigned long end, unsigned long floor,
>> +				 unsigned long ceiling)
>> +{
>> +	pte_t *ptep, pte;
>> +	unsigned long i, start = addr;
>> +
>> +	do {
>> +		ptep = pte_offset_kernel(pmdp, addr);
>> +		pte = READ_ONCE(*ptep);
>> +		WARN_ON(!pte_none(pte));
>> +	} while (addr += PAGE_SIZE, addr < end);
> 
> So this loop is just a sanity check (pte clearing having been done by
> the unmap loops). That's fine, maybe a comment for future reference.

Sure, will add.
> 
>> +
>> +	if (!pgtable_range_aligned(start, end, floor, ceiling, PMD_MASK))
>> +		return;
>> +
>> +	ptep = pte_offset_kernel(pmdp, 0UL);
>> +	for (i = 0; i < PTRS_PER_PTE; i++) {
>> +		if (!pte_none(READ_ONCE(ptep[i])))
>> +			return;
>> +	}
> 
> We could do with a comment for this loop along the lines of:
> 
> 	Check whether we can free the pte page if the rest of the
> 	entries are empty. Overlap with other regions have been handled
> 	by the floor/ceiling check.

Sure, will add.

> 
> Apart from the comments above, the rest of the patch looks fine. Once
> fixed:
> 
> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
> 
> Mark Rutland mentioned at some point that, as a preparatory patch to
> this series, we'd need to make sure we don't hot-remove memory already
> given to the kernel at boot. Any plans here?

Hmm, this series just enables platform memory hot remove as required from
generic memory hotplug framework. The path here is triggered either from
remove_memory() or __remove_memory() which takes physical memory range
arguments like (nid, start, size) and do the needful. arch_remove_memory()
should never be required to test given memory range for anything including
being part of the boot memory.

IIUC boot memory added to system with memblock_add() lose all it's identity
after the system is up and running. In order to reject any attempt to hot
remove boot memory, platform needs to remember all those memory that came
early in the boot and then scan through it during arch_remove_memory().

Ideally, it is the responsibility of [_]remove_memory() callers like ACPI
driver, DAX etc to make sure they never attempt to hot remove a memory
range, which never got hot added by them in the first place. Also, unlike
/sys/devices/system/memory/probe there is no 'unprobe' interface where the
user can just trigger boot memory removal. Hence, unless there is a bug in
ACPI, DAX or other callers, there should never be any attempt to hot remove
boot memory in the first place.

> 
> Thanks.
>
Catalin Marinas Oct. 18, 2019, 9:48 a.m. UTC | #3
On Fri, Oct 11, 2019 at 08:26:32AM +0530, Anshuman Khandual wrote:
> On 10/10/2019 05:04 PM, Catalin Marinas wrote:
> > Mark Rutland mentioned at some point that, as a preparatory patch to
> > this series, we'd need to make sure we don't hot-remove memory already
> > given to the kernel at boot. Any plans here?
> 
> Hmm, this series just enables platform memory hot remove as required from
> generic memory hotplug framework. The path here is triggered either from
> remove_memory() or __remove_memory() which takes physical memory range
> arguments like (nid, start, size) and do the needful. arch_remove_memory()
> should never be required to test given memory range for anything including
> being part of the boot memory.

Assuming arch_remove_memory() doesn't (cannot) check, is there a risk on
arm64 that, for example, one removes memory available at boot and then
kexecs a new kernel? Does the kexec tool present the new kernel with the
original memory map?

I can see x86 has CONFIG_FIRMWARE_MEMMAP suggesting that it is used by
kexec. try_remove_memory() calls firmware_map_remove() so maybe they
solve this problem differently.

Correspondingly, after an arch_add_memory(), do we want a kexec kernel
to access it? x86 seems to use the firmware_map_add_hotplug() mechanism.

Adding James as well for additional comments on kexec scenarios.

> IIUC boot memory added to system with memblock_add() lose all it's identity
> after the system is up and running. In order to reject any attempt to hot
> remove boot memory, platform needs to remember all those memory that came
> early in the boot and then scan through it during arch_remove_memory().
> 
> Ideally, it is the responsibility of [_]remove_memory() callers like ACPI
> driver, DAX etc to make sure they never attempt to hot remove a memory
> range, which never got hot added by them in the first place. Also, unlike
> /sys/devices/system/memory/probe there is no 'unprobe' interface where the
> user can just trigger boot memory removal. Hence, unless there is a bug in
> ACPI, DAX or other callers, there should never be any attempt to hot remove
> boot memory in the first place.

That's fine if these callers give such guarantees. I just want to make
sure someone checked all the possible scenarios for memory hot-remove.
Anshuman Khandual Oct. 21, 2019, 9:53 a.m. UTC | #4
On 10/18/2019 03:18 PM, Catalin Marinas wrote:
> On Fri, Oct 11, 2019 at 08:26:32AM +0530, Anshuman Khandual wrote:
>> On 10/10/2019 05:04 PM, Catalin Marinas wrote:
>>> Mark Rutland mentioned at some point that, as a preparatory patch to
>>> this series, we'd need to make sure we don't hot-remove memory already
>>> given to the kernel at boot. Any plans here?
>>
>> Hmm, this series just enables platform memory hot remove as required from
>> generic memory hotplug framework. The path here is triggered either from
>> remove_memory() or __remove_memory() which takes physical memory range
>> arguments like (nid, start, size) and do the needful. arch_remove_memory()
>> should never be required to test given memory range for anything including
>> being part of the boot memory.
> 
> Assuming arch_remove_memory() doesn't (cannot) check, is there a risk on

Platform can definitely enumerate boot memory ranges. But checking on it in
arch_remove_memory() which deals with actual procedural details might not be
ideal IMHO. Refusing a requested removal attempt should have been done up in
the call chain. This will require making generic hot plug reject any removal
request which falls within enumerated boot memory. IFAICS currently there is
no generic way to remember which memory came as part of the boot process.
Probably be a new MEMBLOCK flag will do.

> arm64 that, for example, one removes memory available at boot and then
> kexecs a new kernel? Does the kexec tool present the new kernel with the
> original memory map?

I dont know, probably James can help here. But as I had mentioned earlier,
the callers of remove_memory() should be able to control that. ACPI should
definitely be aware about which ranges were part of boot memory and refrain
from removing any subset, if the platform is known to have problems with
any subsequent kexec operation because the way boot memory map get used.

Though I am not much aware about kexec internals, it should inherit the
memory state at given point in time accommodating all previous memory hot
and remove operations. As an example cloud environment scenario, memory
resources might have increased or decreased during a guest lifetime, so
when the guest needs to have new OS image why should not it have all the
memory ? I dont know if it's feasible for the guest to expect previous hot
add or remove operations to be played again after the kexec.

There is another fundamental question here. Is there a notion of a minimum
subset of boot memory which cannot be hot removed no matter what ? If yes,
how that is being conveyed to the kernel currently ?

The point is that all these need to be established between ACPI, EFI and
kernel. AFAICS this problem is for MM subsystem (including the platform
part of it) to solve instead.

> 
> I can see x86 has CONFIG_FIRMWARE_MEMMAP suggesting that it is used by
> kexec. try_remove_memory() calls firmware_map_remove() so maybe they
> solve this problem differently.
> 
> Correspondingly, after an arch_add_memory(), do we want a kexec kernel
> to access it? x86 seems to use the firmware_map_add_hotplug() mechanism.

Hmm, kexec could use it instead on arm64 as well ?

> 
> Adding James as well for additional comments on kexec scenarios.
> 
>> IIUC boot memory added to system with memblock_add() lose all it's identity
>> after the system is up and running. In order to reject any attempt to hot
>> remove boot memory, platform needs to remember all those memory that came
>> early in the boot and then scan through it during arch_remove_memory().
>>
>> Ideally, it is the responsibility of [_]remove_memory() callers like ACPI
>> driver, DAX etc to make sure they never attempt to hot remove a memory
>> range, which never got hot added by them in the first place. Also, unlike
>> /sys/devices/system/memory/probe there is no 'unprobe' interface where the
>> user can just trigger boot memory removal. Hence, unless there is a bug in
>> ACPI, DAX or other callers, there should never be any attempt to hot remove
>> boot memory in the first place.
> 
> That's fine if these callers give such guarantees. I just want to make
> sure someone checked all the possible scenarios for memory hot-remove.

remove_memory() is a destructive call but without any user interface. So that
leaves only callers in the kernel which definitely need to know what exactly
they intent to do. I dont see how this is any different from numerous other
interfaces which just can mess up memory subsystem if not used appropriately.

There is another reason why the boot memory will be prevented from being hot
removed. Generally (unless marked as hotpluggable in SRAT table) boot memory
will never become ZONE_MOVABLE, which could not be isolated and migrated,
making it impossible hot remove.

Just wanted to add one thing about MEMBLOCK_HOTPLUG regions which might have
come during boot after parsing ACPI SRAT table's ACPI_SRAT_MEM_HOT_PLUGGABLE.
Corresponding memblock regions are marked with MEMBLOCK_HOTPLUG till buddy
allocator has been initialized. These flags get cleared entirely on the system
during memblock_free_all() and those areas eventually become ZONE_MOVABLE.

Even though those ZONE_MOVABLE memory block devices can be hot removed after
being isolated and offlined first, a remove_memory() caller is required to
trigger actual hot removal. AFAICS apart from ACPI or other firmware driver,
there wont be any other remove_memory() caller which will attempt to remove
boot memory.

Going forward, in case we would want to support hot-remove from hot pluggable
regions at boot (i.e MEMBLOCK_HOTPLUG from SRAT), we will have to re-introduce
back reserved page freeing sequence in free_hotplug_page_range() which was
dropped back in V3 per Mark. The current implementation does a WARN_ON() in
such cases because it should never happen.

https://lkml.org/lkml/2019/4/17/782

ZONE_DEVICE callers for arch_add_memory() and arch_remove_memory() are straight
forward (memremap_pages and memunmap_pages), where the address range is contained
in 'struct dev_pagemap' reducing the chances of an error which could hot-remove
boot memory.
Anshuman Khandual Oct. 21, 2019, 9:55 a.m. UTC | #5
On 10/21/2019 03:23 PM, Anshuman Khandual wrote:
> 
> On 10/18/2019 03:18 PM, Catalin Marinas wrote:
>> On Fri, Oct 11, 2019 at 08:26:32AM +0530, Anshuman Khandual wrote:
>>> On 10/10/2019 05:04 PM, Catalin Marinas wrote:
>>>> Mark Rutland mentioned at some point that, as a preparatory patch to
>>>> this series, we'd need to make sure we don't hot-remove memory already
>>>> given to the kernel at boot. Any plans here?
>>> Hmm, this series just enables platform memory hot remove as required from
>>> generic memory hotplug framework. The path here is triggered either from
>>> remove_memory() or __remove_memory() which takes physical memory range
>>> arguments like (nid, start, size) and do the needful. arch_remove_memory()
>>> should never be required to test given memory range for anything including
>>> being part of the boot memory.
>> Assuming arch_remove_memory() doesn't (cannot) check, is there a risk on
> Platform can definitely enumerate boot memory ranges. But checking on it in
> arch_remove_memory() which deals with actual procedural details might not be
> ideal IMHO. Refusing a requested removal attempt should have been done up in
> the call chain. This will require making generic hot plug reject any removal
> request which falls within enumerated boot memory. IFAICS currently there is
> no generic way to remember which memory came as part of the boot process.
> Probably be a new MEMBLOCK flag will do.
> 
>> arm64 that, for example, one removes memory available at boot and then
>> kexecs a new kernel? Does the kexec tool present the new kernel with the
>> original memory map?
> I dont know, probably James can help here. But as I had mentioned earlier,
> the callers of remove_memory() should be able to control that. ACPI should
> definitely be aware about which ranges were part of boot memory and refrain
> from removing any subset, if the platform is known to have problems with
> any subsequent kexec operation because the way boot memory map get used.
> 
> Though I am not much aware about kexec internals, it should inherit the
> memory state at given point in time accommodating all previous memory hot
> and remove operations. As an example cloud environment scenario, memory
> resources might have increased or decreased during a guest lifetime, so
> when the guest needs to have new OS image why should not it have all the
> memory ? I dont know if it's feasible for the guest to expect previous hot
> add or remove operations to be played again after the kexec.
> 
> There is another fundamental question here. Is there a notion of a minimum
> subset of boot memory which cannot be hot removed no matter what ? If yes,
> how that is being conveyed to the kernel currently ?
> 
> The point is that all these need to be established between ACPI, EFI and
> kernel. AFAICS this problem is for MM subsystem (including the platform

s/is for/is not for/          ^^^^^^^^^^
James Morse Oct. 25, 2019, 5:09 p.m. UTC | #6
Hi guys,

On 21/10/2019 10:53, Anshuman Khandual wrote:
> On 10/18/2019 03:18 PM, Catalin Marinas wrote:
>> On Fri, Oct 11, 2019 at 08:26:32AM +0530, Anshuman Khandual wrote:
>>> On 10/10/2019 05:04 PM, Catalin Marinas wrote:
>>>> Mark Rutland mentioned at some point that, as a preparatory patch to
>>>> this series, we'd need to make sure we don't hot-remove memory already
>>>> given to the kernel at boot. Any plans here?
>>>
>>> Hmm, this series just enables platform memory hot remove as required from
>>> generic memory hotplug framework. The path here is triggered either from
>>> remove_memory() or __remove_memory() which takes physical memory range
>>> arguments like (nid, start, size) and do the needful. arch_remove_memory()
>>> should never be required to test given memory range for anything including
>>> being part of the boot memory.
>>
>> Assuming arch_remove_memory() doesn't (cannot) check, is there a risk on
> 
> Platform can definitely enumerate boot memory ranges. But checking on it in
> arch_remove_memory() which deals with actual procedural details might not be
> ideal IMHO. Refusing a requested removal attempt should have been done up in
> the call chain. This will require making generic hot plug reject any removal
> request which falls within enumerated boot memory. IFAICS currently there is
> no generic way to remember which memory came as part of the boot process.
> Probably be a new MEMBLOCK flag will do.

Memblock flags are fun because they have to be provided to the walkers like
for_each_mem_range().

Unless hot remove is a hot path, it should be enough to check against the UEFI memory map
or DT memory node. (we already have helpers to query the attributes from the memory map at
runtime, so it is still available).


>> arm64 that, for example, one removes memory available at boot and then
>> kexecs a new kernel? Does the kexec tool present the new kernel with the
>> original memory map?
> I dont know, probably James can help here. But as I had mentioned earlier,
> the callers of remove_memory() should be able to control that. ACPI should
> definitely be aware about which ranges were part of boot memory and refrain
> from removing any subset, if the platform is known to have problems with
> any subsequent kexec operation because the way boot memory map get used.
> 
> Though I am not much aware about kexec internals, it should inherit the
> memory state at given point in time

It does, but t = first-boot


> accommodating all previous memory hot and remove operations.

This would imply we rewrite the tables we get from firmware as the facts about the
platform change ... that way madness lies!

ACPI doesn't describe memory, the UEFI memory map does. You may be using the UEFI memory
map on either a DT or ACPI system. If you don't have UEFI, you're using the DT memory-node.

Linux passes on exactly what it had at boot through kexec. We don't rewrite the tables.
Memory is either described in DT, or in the UEFI memory map that was left in memory by the
EFI stub. Linux remembers where the UEFI memory map is through kexec using the additional
entries in the DT chosen node that were put there by the EFI stub.


The bootloader (including the EFI stub) needs to know what memory is removable. Certain
allocations can't move once they have been made:
 * The kernel's randomised physical address should not be in removable memory. With UEFI,
   the EFI stub does this.
 * Firmware structures like the DT or ACPI tables should not be in removable memory.
   Neither should reservations for runtime use, like the RAS CPER regions, or the UEFI
   runtime services.
 * The EFI stub should not allocate the authoritative copy of the memory map in removable
   memory. (we have runtime helpers to lookup the attributes. we pass the boot-time memory
   map to the next OS via kexec).
 * During paging_init() we allocate memory for swapper_pg_dir. This isn't something we can
easily move around.

Its not just software!:
 * The GIC ITS property/pending (?) tables should not be in removable memory.


The simplest thing to do here is decree that all memory present at boot, is non-removable.
Firmware may need to trim the memory available to UEFI to the minimum needed to boot the
system, we can hot-add the rest of it once we're up and running.


> As an example cloud environment scenario, memory
> resources might have increased or decreased during a guest lifetime, so
> when the guest needs to have new OS image why should not it have all the
> memory ? I dont know if it's feasible for the guest to expect previous hot
> add or remove operations to be played again after the kexec.

Firmware can't know that we kexec'd, so it can't replay the operations.

I think we need a way of determining whether a particular block of removable memory is
present or not. If we do this during boot, then kexec works in the same way as a normal boot.


> There is another fundamental question here. Is there a notion of a minimum
> subset of boot memory which cannot be hot removed no matter what ? If yes,
> how that is being conveyed to the kernel currently ?

Yes. The UEFI memory map.

See drivers/firmware/efi/libstub/fdt.c::exit_boot_func()
the EFI stub calls efi_get_virtmap() to get the running memory map, then stores in the DT
with update_fdt_memmap().

The memory described at this stage may not be removed as allocations from the EFI stub
can't be moved. The biggest of these, is the kernel, which relocates itself to a random
physical address during the EFI stub.

See drivers/firmware/efi/libstub/arm64-stub.c::handle_kernel_image()
The memcpy() is at the end.


> The point is that all these need to be established between ACPI, EFI and
> kernel. AFAICS this problem is for MM subsystem (including the platform
> part of it) to solve instead.

>> I can see x86 has CONFIG_FIRMWARE_MEMMAP suggesting that it is used by
>> kexec. try_remove_memory() calls firmware_map_remove() so maybe they
>> solve this problem differently.
>>
>> Correspondingly, after an arch_add_memory(), do we want a kexec kernel
>> to access it? x86 seems to use the firmware_map_add_hotplug() mechanism.
> 
> Hmm, kexec could use it instead on arm64 as well ?

Mmm, a linux specific description of the platform that we have to keep over kexec.

How do we describe this if we kexec something that isn't linux? How do we tell a version
of linux that doesn't support hotplug not to overwrite it?

It would be better if we had something in ACPI to tell us at runtime whether a hot
pluggable range of memory was populated.

(I haven't looked to see whether ACPI can already do this)



Thanks,

James
Anshuman Khandual Oct. 28, 2019, 8:25 a.m. UTC | #7
On 10/25/2019 10:39 PM, James Morse wrote:
> Hi guys,
> 
> On 21/10/2019 10:53, Anshuman Khandual wrote:
>> On 10/18/2019 03:18 PM, Catalin Marinas wrote:
>>> On Fri, Oct 11, 2019 at 08:26:32AM +0530, Anshuman Khandual wrote:
>>>> On 10/10/2019 05:04 PM, Catalin Marinas wrote:
>>>>> Mark Rutland mentioned at some point that, as a preparatory patch to
>>>>> this series, we'd need to make sure we don't hot-remove memory already
>>>>> given to the kernel at boot. Any plans here?
>>>>
>>>> Hmm, this series just enables platform memory hot remove as required from
>>>> generic memory hotplug framework. The path here is triggered either from
>>>> remove_memory() or __remove_memory() which takes physical memory range
>>>> arguments like (nid, start, size) and do the needful. arch_remove_memory()
>>>> should never be required to test given memory range for anything including
>>>> being part of the boot memory.
>>>
>>> Assuming arch_remove_memory() doesn't (cannot) check, is there a risk on
>>
>> Platform can definitely enumerate boot memory ranges. But checking on it in
>> arch_remove_memory() which deals with actual procedural details might not be
>> ideal IMHO. Refusing a requested removal attempt should have been done up in
>> the call chain. This will require making generic hot plug reject any removal
>> request which falls within enumerated boot memory. IFAICS currently there is
>> no generic way to remember which memory came as part of the boot process.
>> Probably be a new MEMBLOCK flag will do.
> 
> Memblock flags are fun because they have to be provided to the walkers like
> for_each_mem_range().

Yes, it will require some code changes but nevertheless, it can properly track
early boot time added memory and differentiate it from runtime added memory. I
am not saying we will have to go in this direction but it will be one of the
viable generic ways to enumerate boot memory. IIUC the other existing method is
through firmware memory map.

> 
> Unless hot remove is a hot path, it should be enough to check against the UEFI memory map
> or DT memory node. (we already have helpers to query the attributes from the memory map at
> runtime, so it is still available).

Only problem will be unlike memblock or firmware memory map, it does not get
updated after each memory hot add or remove operation.

> 
> 
>>> arm64 that, for example, one removes memory available at boot and then
>>> kexecs a new kernel? Does the kexec tool present the new kernel with the
>>> original memory map?
>> I dont know, probably James can help here. But as I had mentioned earlier,
>> the callers of remove_memory() should be able to control that. ACPI should
>> definitely be aware about which ranges were part of boot memory and refrain
>> from removing any subset, if the platform is known to have problems with
>> any subsequent kexec operation because the way boot memory map get used.
>>
>> Though I am not much aware about kexec internals, it should inherit the
>> memory state at given point in time
> 
> It does, but t = first-boot
> 
> 
>> accommodating all previous memory hot and remove operations.
> 
> This would imply we rewrite the tables we get from firmware as the facts about the
> platform change ... that way madness lies!

OR the firmware itself can rewrite it's own table in memory after performing
memory hotplug operations. But lets take a step back. The basic question here
would be "What should the new kexec kernel get in terms of memory resources".

There can be two options

1. Memory state as available at runtime

	- Seems logical unless kexec kernel has to be associated with boot resources
	- Requires tracking the changes some where, either in kernel or firmware
	- If kernel has to track this in a generic way, there are some options 

		- memblock lists all accessible memory but does not identify boot memory
		- firmware memmap

	- If kexec needs resource enumeration from firmware table, then it requires re-write

		- Either in memory hotplug path as with existing firmware memory map
		- OR firmware updates the table itself after driving memory hotplug operation

2. Memory state as available at boot

	- If boot memory has been removed

		- Platform must guarantee they are still accessible to next kexec kernel
		- This is highly unlikely because

			- DIMM might be taken out e.g resource trimming, error handling 
			- DIMM might be reassigned to other guests in virtualization

	- If new memory added

		- Not problematic, new kexec kernel will not use this additional memory

			- Just resource wastage from platform perspective

>
> ACPI doesn't describe memory, the UEFI memory map does. You may be using the UEFI memory
> map on either a DT or ACPI system. If you don't have UEFI, you're using the DT memory-node.

Is there any reason why firmware cannot update these tables (UEFI memory map or DT memory
node) after driving memory hotplug operation ?

> 
> Linux passes on exactly what it had at boot through kexec. We don't rewrite the tables.
> Memory is either described in DT, or in the UEFI memory map that was left in memory by the
> EFI stub. Linux remembers where the UEFI memory map is through kexec using the additional
> entries in the DT chosen node that were put there by the EFI stub.

Okay, but again the point is if hot-plug is driven by firmware why cant it just update
the resource table ? What is rationale of having values in there which does not correctly
represent the entire addressable memory range anymore.

> 
> 
> The bootloader (including the EFI stub) needs to know what memory is removable. Certain
> allocations can't move once they have been made:
>  * The kernel's randomised physical address should not be in removable memory. With UEFI,
>    the EFI stub does this.
>  * Firmware structures like the DT or ACPI tables should not be in removable memory.
>    Neither should reservations for runtime use, like the RAS CPER regions, or the UEFI
>    runtime services.
>  * The EFI stub should not allocate the authoritative copy of the memory map in removable
>    memory. (we have runtime helpers to lookup the attributes. we pass the boot-time memory
>    map to the next OS via kexec).
>  * During paging_init() we allocate memory for swapper_pg_dir. This isn't something we can
> easily move around.
> 
> Its not just software!:
>  * The GIC ITS property/pending (?) tables should not be in removable memory.

All these are kernel allocations and they are always protected either being on non movable
zones or with PG_reserved set. AFAICS these pages cannot be isolated, migrated or removed.
So the protection comes from kernel because of their zone classification or allocation
method (memblock_reserve etc).

> 
> 
> The simplest thing to do here is decree that all memory present at boot, is non-removable.
> Firmware may need to trim the memory available to UEFI to the minimum needed to boot the
> system, we can hot-add the rest of it once we're up and running.

That will be a more fundamental change in the way memory is handled during boot. AFAICS,
we dont have to go in that direction.

> 
> 
>> As an example cloud environment scenario, memory
>> resources might have increased or decreased during a guest lifetime, so
>> when the guest needs to have new OS image why should not it have all the
>> memory ? I dont know if it's feasible for the guest to expect previous hot
>> add or remove operations to be played again after the kexec.
> 
> Firmware can't know that we kexec'd, so it can't replay the operations.

Okay.

> 
> I think we need a way of determining whether a particular block of removable memory is
> present or not. If we do this during boot, then kexec works in the same way as a normal boot.
> 
> 
>> There is another fundamental question here. Is there a notion of a minimum
>> subset of boot memory which cannot be hot removed no matter what ? If yes,
>> how that is being conveyed to the kernel currently ?
> 
> Yes. The UEFI memory map.
> 
> See drivers/firmware/efi/libstub/fdt.c::exit_boot_func()
> the EFI stub calls efi_get_virtmap() to get the running memory map, then stores in the DT
> with update_fdt_memmap().
> 
> The memory described at this stage may not be removed as allocations from the EFI stub
> can't be moved. The biggest of these, is the kernel, which relocates itself to a random
> physical address during the EFI stub.
> 
> See drivers/firmware/efi/libstub/arm64-stub.c::handle_kernel_image()
> The memcpy() is at the end.

But all these should be protected because their current allocation and usage
(i.e ZONE_NORMAL, PG_reserved etc).

> 
> 
>> The point is that all these need to be established between ACPI, EFI and
>> kernel. AFAICS this problem is for MM subsystem (including the platform
>> part of it) to solve instead.
> 
>>> I can see x86 has CONFIG_FIRMWARE_MEMMAP suggesting that it is used by
>>> kexec. try_remove_memory() calls firmware_map_remove() so maybe they
>>> solve this problem differently.
>>>
>>> Correspondingly, after an arch_add_memory(), do we want a kexec kernel
>>> to access it? x86 seems to use the firmware_map_add_hotplug() mechanism.
>>
>> Hmm, kexec could use it instead on arm64 as well ?
> 
> Mmm, a linux specific description of the platform that we have to keep over kexec.

As we mentioned before, if the kexec needs to inherent runtime memory resources
instead of boot time, then changed memory resources will have to be tracked
either in Linux or in firmware which kexec can refer.

> 
> How do we describe this if we kexec something that isn't linux? How do we tell a version
Kexec tool reads memory resource enumeration from first running kernel and
presents that to new kexec kernel. It can read what ever format it wants (EUFI
memory map, DT, memblock, firmware memory map) but presents in way which kexec
kernel can use (EUFI memory map, DT). Both formats need not be the same.

> of linux that doesn't support hotplug not to overwrite it?

If the running kernel does not support hotplug, nothing updates the memory map.
It remains unchanged from boot. If the kexec kernel does not support hotplug, it
is irrelevant because it will just consume memory as presented from the first
running kernel which remains the same through out it's runtime.

> 
> It would be better if we had something in ACPI to tell us at runtime whether a hot
> pluggable range of memory was populated.
> 
> (I haven't looked to see whether ACPI can already do this)

There is a mechanism in ACPI for this i.e ACPI_SRAT_MEM_HOT_PLUGGABLE.

Lets re-evaluate the situation here from scratch. Memory can be classified as
boot and runtime because it impacts the way in which kernel allocations, zone
initializations are treated. Boot memory deals with kernel allocation before
zone init happens where as runtime memory might choose which zone to get into
right away.

(1) Boot memory

	- Non-movable

		- Normal memblocks
		- All kernel allocations come here
		- Become ZONE_NORMAL/DMA/DMA32 at runtime

			- Never removable because isolation and hence migration impossible
			- Removal protection because of the zone classification

	- Movable	(ACPI_SRAT_MEM_HOT_PLUGGABLE)

		- Memblock will be marked with MEMBLOCK_HOTPLUG
		- Memblock allocations tried to be avoided (reversing the memblock order etc)
		- Become ZONE_MOVABLE at runtime

			- Removable  [1]

(2) Runtime memory


	- Removable
		- Can become ZONE_NORMAL

			- Never removable because isolation and hence migration impossible
			- Removal protection because of the zone classification

		- Can become ZONE_MOVABLE

			- Removable [2]


We dont have to worry about non-movable boot memory as they are protected
against removal because of their zone and subsequent usage. Hence even if
the firmware attempts (which it should not), it cannot be removed.

[1] Remove hotpluggable boot memory

Firmware should only attempt to remove memory which was tagged as hotpluggable
ACPI_SRAT_MEM_HOT_PLUGGABLE in the SRAT table. The entire discussion here, is
how to handle only this particular situation. Though DT based systems might
also have similar concerns, if they support hotpluggable boot memory.

a) Platform decides not to support removal of hotpluggable memory

Platform can have a policy not to either give hotpluggable memory regions or
not to attempt removing them even if given. This is a purely a firmware centric
platform solution without requiring any changes to current memory hot-remove code.

b) Platform decides to support removal of hotpluggable memory

As mentioned before, free_hotplug_page_range() will have to handle PG_reserved
pages while freeing. But kexec also needs to understand part of the boot memory
is inaccessible now and should not be passed to the next kernel. This will
require tracking boot memory changes in EUFI memory map or DT nodes or membock
or firmware memory map, updated either by kernel or firmware itself.

[2] Remove hotpluggable runtime memory

There are no similar problems here. If runtime added memory is never going to
be part of the kexec enumeration, it does not even matter whether these memory
are removed or not.

- Anshuman

> 
> 
> 
> Thanks,
> 
> James
> 
>
Anshuman Khandual Nov. 4, 2019, 3:57 a.m. UTC | #8
On 10/28/2019 01:55 PM, Anshuman Khandual wrote:
> There is a mechanism in ACPI for this i.e ACPI_SRAT_MEM_HOT_PLUGGABLE.
> 
> Lets re-evaluate the situation here from scratch. Memory can be classified as
> boot and runtime because it impacts the way in which kernel allocations, zone
> initializations are treated. Boot memory deals with kernel allocation before
> zone init happens where as runtime memory might choose which zone to get into
> right away.
> 
> (1) Boot memory
> 
> 	- Non-movable
> 
> 		- Normal memblocks
> 		- All kernel allocations come here
> 		- Become ZONE_NORMAL/DMA/DMA32 at runtime
> 
> 			- Never removable because isolation and hence migration impossible
> 			- Removal protection because of the zone classification
> 
> 	- Movable	(ACPI_SRAT_MEM_HOT_PLUGGABLE)
> 
> 		- Memblock will be marked with MEMBLOCK_HOTPLUG
> 		- Memblock allocations tried to be avoided (reversing the memblock order etc)
> 		- Become ZONE_MOVABLE at runtime
> 
> 			- Removable  [1]

There is another way in which boot memory can be created as ZONE_MOVABLE
irrespective of whether the firmware (ACPI/OF) had asked for it or not.
This is achieved with "kernelcore" or "movablecore" kernel command line
options where the administrator exclusively asks for sections of memory
to be converted as ZONE_MOVABLE. This creates some of the memory block
devices in /sys/devices/system/memory as removable (ZONE_MOVABLE). IIUC
this is mutually exclusive with respect to removable boot memory creation
with "movable_node" kernel command line option with firmware tagged hot
pluggable memory sections (ACPI_SRAT_MEM_HOT_PLUGGABLE).

Details here: mm/page_alloc.c find_zone_movable_pfns_for_nodes()

Now, this boils down to the fact whether firmware will ever request for
removal of boot memory sections which was never tagged as hotpluggable
by the firmware during boot. Wondering if tagging portions of boot memory
as ZONE_MOVABLE might have any other use case if they are never to be hot
removed. Will continue looking into ACPI/OF memory hotplug scenarios.

Patch
diff mbox series

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 950a56b71ff0..c9a3c029c55f 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -273,6 +273,9 @@  config ZONE_DMA32
 config ARCH_ENABLE_MEMORY_HOTPLUG
 	def_bool y
 
+config ARCH_ENABLE_MEMORY_HOTREMOVE
+	def_bool y
+
 config SMP
 	def_bool y
 
diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index b61b50bf68b1..615dcd08acfa 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -54,6 +54,7 @@ 
 #define MODULES_VADDR		(BPF_JIT_REGION_END)
 #define MODULES_VSIZE		(SZ_128M)
 #define VMEMMAP_START		(-VMEMMAP_SIZE - SZ_2M)
+#define VMEMMAP_END		(VMEMMAP_START + VMEMMAP_SIZE)
 #define PCI_IO_END		(VMEMMAP_START - SZ_2M)
 #define PCI_IO_START		(PCI_IO_END - PCI_IO_SIZE)
 #define FIXADDR_TOP		(PCI_IO_START - SZ_2M)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index d10247fab0fd..cb0411b1e735 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -725,6 +725,245 @@  int kern_addr_valid(unsigned long addr)
 
 	return pfn_valid(pte_pfn(pte));
 }
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+static void free_hotplug_page_range(struct page *page, size_t size)
+{
+	WARN_ON(PageReserved(page));
+	free_pages((unsigned long)page_address(page), get_order(size));
+}
+
+static void free_hotplug_pgtable_page(struct page *page)
+{
+	free_hotplug_page_range(page, PAGE_SIZE);
+}
+
+static bool pgtable_range_aligned(unsigned long start, unsigned long end,
+				  unsigned long floor, unsigned long ceiling,
+				  unsigned long mask)
+{
+	start &= mask;
+	if (start < floor)
+		return false;
+
+	if (ceiling) {
+		ceiling &= mask;
+		if (!ceiling)
+			return false;
+	}
+
+	if (end - 1 > ceiling - 1)
+		return false;
+	return true;
+}
+
+static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr,
+				    unsigned long end, bool free_mapped)
+{
+	pte_t *ptep, pte;
+
+	do {
+		ptep = pte_offset_kernel(pmdp, addr);
+		pte = READ_ONCE(*ptep);
+		if (pte_none(pte))
+			continue;
+
+		WARN_ON(!pte_present(pte));
+		pte_clear(&init_mm, addr, ptep);
+		flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+		if (free_mapped)
+			free_hotplug_page_range(pte_page(pte), PAGE_SIZE);
+	} while (addr += PAGE_SIZE, addr < end);
+}
+
+static void unmap_hotplug_pmd_range(pud_t *pudp, unsigned long addr,
+				    unsigned long end, bool free_mapped)
+{
+	unsigned long next;
+	pmd_t *pmdp, pmd;
+
+	do {
+		next = pmd_addr_end(addr, end);
+		pmdp = pmd_offset(pudp, addr);
+		pmd = READ_ONCE(*pmdp);
+		if (pmd_none(pmd))
+			continue;
+
+		WARN_ON(!pmd_present(pmd));
+		if (pmd_sect(pmd)) {
+			pmd_clear(pmdp);
+			flush_tlb_kernel_range(addr, next);
+			if (free_mapped)
+				free_hotplug_page_range(pmd_page(pmd),
+							PMD_SIZE);
+			continue;
+		}
+		WARN_ON(!pmd_table(pmd));
+		unmap_hotplug_pte_range(pmdp, addr, next, free_mapped);
+	} while (addr = next, addr < end);
+}
+
+static void unmap_hotplug_pud_range(pgd_t *pgdp, unsigned long addr,
+				    unsigned long end, bool free_mapped)
+{
+	unsigned long next;
+	pud_t *pudp, pud;
+
+	do {
+		next = pud_addr_end(addr, end);
+		pudp = pud_offset(pgdp, addr);
+		pud = READ_ONCE(*pudp);
+		if (pud_none(pud))
+			continue;
+
+		WARN_ON(!pud_present(pud));
+		if (pud_sect(pud)) {
+			pud_clear(pudp);
+			flush_tlb_kernel_range(addr, next);
+			if (free_mapped)
+				free_hotplug_page_range(pud_page(pud),
+							PUD_SIZE);
+			continue;
+		}
+		WARN_ON(!pud_table(pud));
+		unmap_hotplug_pmd_range(pudp, addr, next, free_mapped);
+	} while (addr = next, addr < end);
+}
+
+static void unmap_hotplug_range(unsigned long addr, unsigned long end,
+				bool free_mapped)
+{
+	unsigned long next;
+	pgd_t *pgdp, pgd;
+
+	do {
+		next = pgd_addr_end(addr, end);
+		pgdp = pgd_offset_k(addr);
+		pgd = READ_ONCE(*pgdp);
+		if (pgd_none(pgd))
+			continue;
+
+		WARN_ON(!pgd_present(pgd));
+		unmap_hotplug_pud_range(pgdp, addr, next, free_mapped);
+	} while (addr = next, addr < end);
+}
+
+static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr,
+				 unsigned long end, unsigned long floor,
+				 unsigned long ceiling)
+{
+	pte_t *ptep, pte;
+	unsigned long i, start = addr;
+
+	do {
+		ptep = pte_offset_kernel(pmdp, addr);
+		pte = READ_ONCE(*ptep);
+		WARN_ON(!pte_none(pte));
+	} while (addr += PAGE_SIZE, addr < end);
+
+	if (!pgtable_range_aligned(start, end, floor, ceiling, PMD_MASK))
+		return;
+
+	ptep = pte_offset_kernel(pmdp, 0UL);
+	for (i = 0; i < PTRS_PER_PTE; i++) {
+		if (!pte_none(READ_ONCE(ptep[i])))
+			return;
+	}
+
+	pmd_clear(pmdp);
+	__flush_tlb_kernel_pgtable(start);
+	free_hotplug_pgtable_page(virt_to_page(ptep));
+}
+
+static void free_empty_pmd_table(pud_t *pudp, unsigned long addr,
+				 unsigned long end, unsigned long floor,
+				 unsigned long ceiling)
+{
+	pmd_t *pmdp, pmd;
+	unsigned long i, next, start = addr;
+
+	do {
+		next = pmd_addr_end(addr, end);
+		pmdp = pmd_offset(pudp, addr);
+		pmd = READ_ONCE(*pmdp);
+		if (pmd_none(pmd))
+			continue;
+
+		WARN_ON(!pmd_present(pmd) || !pmd_table(pmd) || pmd_sect(pmd));
+		free_empty_pte_table(pmdp, addr, next, floor, ceiling);
+	} while (addr = next, addr < end);
+
+	if (CONFIG_PGTABLE_LEVELS <= 2)
+		return;
+
+	if (!pgtable_range_aligned(start, end, floor, ceiling, PUD_MASK))
+		return;
+
+	pmdp = pmd_offset(pudp, 0UL);
+	for (i = 0; i < PTRS_PER_PMD; i++) {
+		if (!pmd_none(READ_ONCE(pmdp[i])))
+			return;
+	}
+
+	pud_clear(pudp);
+	__flush_tlb_kernel_pgtable(start);
+	free_hotplug_pgtable_page(virt_to_page(pmdp));
+}
+
+static void free_empty_pud_table(pgd_t *pgdp, unsigned long addr,
+				 unsigned long end, unsigned long floor,
+				 unsigned long ceiling)
+{
+	pud_t *pudp, pud;
+	unsigned long i, next, start = addr;
+
+	do {
+		next = pud_addr_end(addr, end);
+		pudp = pud_offset(pgdp, addr);
+		pud = READ_ONCE(*pudp);
+		if (pud_none(pud))
+			continue;
+
+		WARN_ON(!pud_present(pud) || !pud_table(pud) || pud_sect(pud));
+		free_empty_pmd_table(pudp, addr, next, floor, ceiling);
+	} while (addr = next, addr < end);
+
+	if (CONFIG_PGTABLE_LEVELS <= 3)
+		return;
+
+	if (!pgtable_range_aligned(start, end, floor, ceiling, PGDIR_MASK))
+		return;
+
+	pudp = pud_offset(pgdp, 0UL);
+	for (i = 0; i < PTRS_PER_PUD; i++) {
+		if (!pud_none(READ_ONCE(pudp[i])))
+			return;
+	}
+
+	pgd_clear(pgdp);
+	__flush_tlb_kernel_pgtable(start);
+	free_hotplug_pgtable_page(virt_to_page(pudp));
+}
+
+static void free_empty_tables(unsigned long addr, unsigned long end,
+			      unsigned long floor, unsigned long ceiling)
+{
+	unsigned long next;
+	pgd_t *pgdp, pgd;
+
+	do {
+		next = pgd_addr_end(addr, end);
+		pgdp = pgd_offset_k(addr);
+		pgd = READ_ONCE(*pgdp);
+		if (pgd_none(pgd))
+			continue;
+
+		WARN_ON(!pgd_present(pgd));
+		free_empty_pud_table(pgdp, addr, next, floor, ceiling);
+	} while (addr = next, addr < end);
+}
+#endif
+
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 #if !ARM64_SWAPPER_USES_SECTION_MAPS
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
@@ -772,6 +1011,12 @@  int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 void vmemmap_free(unsigned long start, unsigned long end,
 		struct vmem_altmap *altmap)
 {
+#ifdef CONFIG_MEMORY_HOTPLUG
+	WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
+
+	unmap_hotplug_range(start, end, true);
+	free_empty_tables(start, end, VMEMMAP_START, VMEMMAP_END);
+#endif
 }
 #endif	/* CONFIG_SPARSEMEM_VMEMMAP */
 
@@ -1050,10 +1295,21 @@  int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
+static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long start, u64 size)
+{
+	unsigned long end = start + size;
+
+	WARN_ON(pgdir != init_mm.pgd);
+	WARN_ON((start < PAGE_OFFSET) || (end > PAGE_END));
+
+	unmap_hotplug_range(start, end, false);
+	free_empty_tables(start, end, PAGE_OFFSET, PAGE_END);
+}
+
 int arch_add_memory(int nid, u64 start, u64 size,
 			struct mhp_restrictions *restrictions)
 {
-	int flags = 0;
+	int ret, flags = 0;
 
 	if (rodata_full || debug_pagealloc_enabled())
 		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
@@ -1061,22 +1317,21 @@  int arch_add_memory(int nid, u64 start, u64 size,
 	__create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
 			     size, PAGE_KERNEL, __pgd_pgtable_alloc, flags);
 
-	return __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT,
+	ret = __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT,
 			   restrictions);
+	if (ret)
+		__remove_pgd_mapping(swapper_pg_dir,
+				     __phys_to_virt(start), size);
+	return ret;
 }
+
 void arch_remove_memory(int nid, u64 start, u64 size,
 			struct vmem_altmap *altmap)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 
-	/*
-	 * FIXME: Cleanup page tables (also in arch_add_memory() in case
-	 * adding fails). Until then, this function should only be used
-	 * during memory hotplug (adding memory), not for memory
-	 * unplug. ARCH_ENABLE_MEMORY_HOTREMOVE must not be
-	 * unlocked yet.
-	 */
 	__remove_pages(start_pfn, nr_pages, altmap);
+	__remove_pgd_mapping(swapper_pg_dir, __phys_to_virt(start), size);
 }
 #endif