[RFC,0/3] Allocate memmap from hotadded memory (per device)

Message ID	20201022125835.26396-1-osalvador@suse.de (mailing list archive)
Headers	show Return-Path: <SRS0=W9oF=D5=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 38D382417D From: Oscar Salvador <osalvador@suse.de> To: david@redhat.com Cc: mhocko@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, vbabka@suse.cz, pasha.tatashin@soleen.com, Oscar Salvador <osalvador@suse.de> Subject: [RFC PATCH 0/3] Allocate memmap from hotadded memory (per device) Date: Thu, 22 Oct 2020 14:58:32 +0200 Message-Id: <20201022125835.26396-1-osalvador@suse.de> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Allocate memmap from hotadded memory (per device) \| expand [RFC,0/3] Allocate memmap from hotadded memory (per device) [RFC,1/3] mm,memory_hotplug: Introduce MHP_MEMMAP_ON_MEMORY [RFC,2/3] mm: Introduce a new Vmemmap page-type [RFC,3/3] mm,memory_hotplug: Allocate memmap from the added memory range

Oscar Salvador Oct. 22, 2020, 12:58 p.m. UTC

This patchset would be the next version of [1], but a lot has changed
in the meantime, so I figured I would just make another RFC.

After some discussions with David off the list, we agreed that it would be
easier as a starter to only support memmap from hotadded memory if the hotadded range
spans a single memory device.

The reason behind is that at any given time, a memory_block is either online or
offline, and so the pages within it.
That means that operations like pfn_to_online_page always returns always the
right thing.

But that would not be the case if we support spanning multiple devices with
the infrastructure we have at the moment.

We have two options to support spanning multiple memory devices (which is the
final goal of this work):

1) We play with sub-section bitmap, so although a section might be offline
a pfn_to_online_page made on a vmemmap page will give us the right value.
I was tempted to explore this, I am leaning more towards #2.

2) Do some work towards flexible-sized memory devices.
The way I see it, a memory_block device would be as big as the hot-added range
so we could have memory_blocks of 1GB, 512MB, 64GB, all depending on the size
of the device to be added to the system.

I am addind some David's notes in here:

" Case 1: add_memory() spans a single memory device

The memory can be either online/offline, and thereby, all sections
online/offline. Nobody should be touching the vmemmap (even during
add_memory() - except when poisoning vmemmap, which should most probably
work as well, if not we can work around that).

Case 2: add_memory() spans multiple memory devices

Option 1: As we discussed, only cover full sections with the vmemmap. "Bad" thing
is that the memory devices holding the vemmap might be offline to user space,
but still contain relevant data ... bad for kexec-tools when creating memory to
dump via kdump. Won't properly work.

Option 2: Later extend option 1 to use sub-section online bitmap.

Option 3: Convert to case 1. Michal proposed allowing flexible-sized memory devices.
Will require some work, but would be the cleanest IMHO.

So maybe starting with case 1 is best for now, and later extending it via Case2.3 -
which would simply be reworking memory devices."
"

" 1. It can happen that pfn_online() for a vmemmap page returns either
true or false, depending on the state of the section. It could be that
the memory block holding the vmemmap is offline while another memory
block making use of it is online.

I guess this isn't bad (I assume it is similar for the altmap), however
it could be that makedumpfile will exclude the vmemmap from dumps (as it
will usually only dump pages in sections marked online if I am not wrong
- maybe it special cases vmemmaps already). Also, could be that it is
not saved/restored during hibernation. We'll have to verify."

This does not go without saying that the patchset is not 100% complete.
It is missing:

- a way to disable memmap_on_memory (either sysfs or boot_time cmd)
- atm, arch_add_memory for s390 screams if an altmap is passed.
I am still thinking of a way to nicely drop handle that.
Maybe a function in s390 that sets memmap_on_memory false and
stuff that check in support_memmap_on_memory function.

Original cover letter:

----
The primary goal of this patchset is to reduce memory overhead of the
hot-added memory (at least for SPARSEMEM_VMEMMAP memory model).
The current way we use to populate memmap (struct page array) has two main drawbacks:

a) it consumes an additional memory until the hotadded memory itself is
onlined and
b) memmap might end up on a different numa node which is especially true
for movable_node configuration.
c) due to fragmentation we might end up populating memmap with base
pages

One way to mitigate all these issues is to simply allocate memmap array
(which is the largest memory footprint of the physical memory hotplug)
from the hot-added memory itself. SPARSEMEM_VMEMMAP memory model allows
us to map any pfn range so the memory doesn't need to be online to be
usable for the array. See patch 3 for more details.
This feature is only usable when CONFIG_SPARSEMEM_VMEMMAP is set.

[Overall design]:

Implementation wise we reuse vmem_altmap infrastructure to override
the default allocator used by vmemap_populate. Once the memmap is
allocated we need a way to mark altmap pfns used for the allocation.
If MHP_MEMMAP_ON_MEMORY flag was passed, we set up the layout of the
altmap structure in add_memory_resource), and then we call
mark_vmemmap_pages() to mark vmemmap pages.

memory_block gained a new field called nr_vmemmap_pages.
This plays well for two reasons:

1) {offline/online}_pages know the differente between start_pfn and
valid_start_pfn, which is start_pfn + nr_vmemmap_pages.
In this way all isolation/migration/initialization operations are
done to the right range of memory without vmemmap pages to get involved.
This allows us for a much cleaner handling.

2) In try_remove_memory, we construct a new vmemap_altmap struct with the
right info, so we end up calling vmem_altmap_free instead of free_pagetable
when removing the memory.

Oscar Salvador (3):
mm,memory_hotplug: Introduce MHP_MEMMAP_ON_MEMORY
mm: Introduce a new Vmemmap page-type
mm,memory_hotplug: Allocate memmap from the added memory range

David Hildenbrand Oct. 22, 2020, 1:01 p.m. UTC | #1

> This does not go without saying that the patchset is not 100% complete.
> It is missing:
> 
>  - a way to disable memmap_on_memory (either sysfs or boot_time cmd)
>  - atm, arch_add_memory for s390 screams if an altmap is passed.
>    I am still thinking of a way to nicely drop handle that.
>    Maybe a function in s390 that sets memmap_on_memory false and
>    stuff that check in support_memmap_on_memory function.

Or simply implement altmap support ... shouldn't be too hard :)

Oscar Salvador Oct. 27, 2020, 3:40 p.m. UTC | #2

On Thu, Oct 22, 2020 at 03:01:44PM +0200, David Hildenbrand wrote:
> > This does not go without saying that the patchset is not 100% complete.
> > It is missing:
> > 
> >  - a way to disable memmap_on_memory (either sysfs or boot_time cmd)
> >  - atm, arch_add_memory for s390 screams if an altmap is passed.
> >    I am still thinking of a way to nicely drop handle that.
> >    Maybe a function in s390 that sets memmap_on_memory false and
> >    stuff that check in support_memmap_on_memory function.
> 
> Or simply implement altmap support ... shouldn't be too hard :)

Yeah, I guess so, but first I would like to have everything else settled.
So, gentle ping :-)

David Hildenbrand Oct. 27, 2020, 3:44 p.m. UTC | #3

On 27.10.20 16:40, Oscar Salvador wrote:
> On Thu, Oct 22, 2020 at 03:01:44PM +0200, David Hildenbrand wrote:
>>> This does not go without saying that the patchset is not 100% complete.
>>> It is missing:
>>>
>>>   - a way to disable memmap_on_memory (either sysfs or boot_time cmd)
>>>   - atm, arch_add_memory for s390 screams if an altmap is passed.
>>>     I am still thinking of a way to nicely drop handle that.
>>>     Maybe a function in s390 that sets memmap_on_memory false and
>>>     stuff that check in support_memmap_on_memory function.
>>
>> Or simply implement altmap support ... shouldn't be too hard :)
> 
> Yeah, I guess so, but first I would like to have everything else settled.
> So, gentle ping :-)
> 

I'm planning on looking into patch #2/3 later this or next week (this 
week is open source summit / KVM Forum).

One thing to look into right now is how to make this fly this with 
vmemmap optimizations for hugetlb pages.

https://lkml.kernel.org/r/20201026145114.59424-1-songmuchun@bytedance.com

Oscar Salvador Oct. 27, 2020, 3:58 p.m. UTC | #4

On Tue, Oct 27, 2020 at 04:44:33PM +0100, David Hildenbrand wrote:
> I'm planning on looking into patch #2/3 later this or next week (this week
> is open source summit / KVM Forum).

Sure, aprecciated the time ;-)

> 
> One thing to look into right now is how to make this fly this with vmemmap
> optimizations for hugetlb pages.
> 
> https://lkml.kernel.org/r/20201026145114.59424-1-songmuchun@bytedance.com

I was about to have a look at that series eitherway, but good you mentioned.

Mike Kravetz Oct. 28, 2020, 6:47 p.m. UTC | #5

On 10/27/20 8:58 AM, Oscar Salvador wrote:
> On Tue, Oct 27, 2020 at 04:44:33PM +0100, David Hildenbrand wrote:
>> I'm planning on looking into patch #2/3 later this or next week (this week
>> is open source summit / KVM Forum).
> 
> Sure, aprecciated the time ;-)
> 
>>
>> One thing to look into right now is how to make this fly this with vmemmap
>> optimizations for hugetlb pages.
>>
>> https://lkml.kernel.org/r/20201026145114.59424-1-songmuchun@bytedance.com
> 
> I was about to have a look at that series eitherway, but good you mentioned.
> 

More eyes on that series would be appreciated.

That series will dynamically free and allocate memmap pages as hugetlb
pages are allocated or freed.  I haven't looked through this series, but
my first thought is that we would need to ensure those allocs/frees are
directed to the device.  Not sure if there are interfaces for that.

David Hildenbrand Oct. 29, 2020, 7:49 a.m. UTC | #6

On 28.10.20 19:47, Mike Kravetz wrote:
> On 10/27/20 8:58 AM, Oscar Salvador wrote:
>> On Tue, Oct 27, 2020 at 04:44:33PM +0100, David Hildenbrand wrote:
>>> I'm planning on looking into patch #2/3 later this or next week (this week
>>> is open source summit / KVM Forum).
>>
>> Sure, aprecciated the time ;-)
>>
>>>
>>> One thing to look into right now is how to make this fly this with vmemmap
>>> optimizations for hugetlb pages.
>>>
>>> https://lkml.kernel.org/r/20201026145114.59424-1-songmuchun@bytedance.com
>>
>> I was about to have a look at that series eitherway, but good you mentioned.
>>
> 
> More eyes on that series would be appreciated.
> 
> That series will dynamically free and allocate memmap pages as hugetlb
> pages are allocated or freed.  I haven't looked through this series, but
> my first thought is that we would need to ensure those allocs/frees are
> directed to the device.  Not sure if there are interfaces for that.

Directing to the device might be part of the solution, but does not have 
to be. You really want to free the pages to the OS in the end, otherwise 
you lose the whole benefit of the vmemmap optimization.

You would want to actually free the pages (making sure whatever 
generic_online_page() does was done to these special vmemmap pages). But 
then, you cannot simply skip all X first pages of a memory block when 
offlining, you can only skip the once that are still vmemmap pages 
(e.g., marked via page type), and have to isolate/migrate off the 
no-longer vmemmap pages.

[RFC,0/3] Allocate memmap from hotadded memory (per device)

Message

Comments