Message ID | 20190328134320.13232-1-osalvador@suse.de (mailing list archive) |
---|---|
Headers | show |
Series | mm,memory_hotplug: allocate memmap from hotadded memory | expand |
On 28.03.19 14:43, Oscar Salvador wrote: > Hi, > > since last two RFCs were almost unnoticed (thanks David for the feedback), > I decided to re-work some parts to make it more simple and give it a more > testing, and drop the RFC, to see if it gets more attention. > I also added David's feedback, so now all users of add_memory/__add_memory/ > add_memory_resource can specify whether they want to use this feature or not. Terrific, I will also definetly try to make use of that in the next virito-mem prototype (looks like I'll finally have time to look into it again). > I also fixed some compilation issues when CONFIG_SPARSEMEM_VMEMMAP is not set. > > [Testing] > > Testing has been carried out on the following platforms: > > - x86_64 (small and big memblocks) > - powerpc > - arm64 (Huawei's fellows) > > I plan to test it on Xen and Hyper-V, but for now those two will not be > using this feature, and neither DAX/pmem. I think doing it step by step is the right approach. Less likely to break stuff. > > Of course, if this does not find any strong objection, my next step is to > work on enabling this on Xen/Hyper-V. > > [Coverletter] > > This is another step to make the memory hotplug more usable. The primary > goal of this patchset is to reduce memory overhead of the hot added > memory (at least for SPARSE_VMEMMAP memory model). The current way we use > to populate memmap (struct page array) has two main drawbacks: > > a) it consumes an additional memory until the hotadded memory itself is > onlined and > b) memmap might end up on a different numa node which is especially true > for movable_node configuration. > > a) is problem especially for memory hotplug based memory "ballooning" > solutions when the delay between physical memory hotplug and the > onlining can lead to OOM and that led to introduction of hacks like auto > onlining (see 31bc3858ea3e ("memory-hotplug: add automatic onlining > policy for the newly added memory")). > > b) can have performance drawbacks. > > I have also seen hot-add operations failing on archs because they > were running out of order-x pages. > E.g On powerpc, in certain configurations, we use order-8 pages, > and given 64KB base pagesize, that is 16MB. > If we run out of those, we just fail the operation and we cannot add > more memory. > We could fallback to base pages as x86_64 does, but we can do better. > > One way to mitigate all these issues is to simply allocate memmap array > (which is the largest memory footprint of the physical memory hotplug) > from the hotadded memory itself. VMEMMAP memory model allows us to map > any pfn range so the memory doesn't need to be online to be usable > for the array. See patch 3 for more details. In short I am reusing an > existing vmem_altmap which wants to achieve the same thing for nvdim > device memory. > > There is also one potential drawback, though. If somebody uses memory > hotplug for 1G (gigantic) hugetlb pages then this scheme will not work > for them obviously because each memory block will contain reserved > area. Large x86 machines will use 2G memblocks so at least one 1G page > will be available but this is still not 2G... > > If that is a problem, we can always configure a fallback strategy to > use the current scheme. > > Since this only works when CONFIG_VMEMMAP_ENABLED is set, > we do check for it before setting the flag that allows use > to use the feature, no matter if the user wanted it. > > [Overall design]: > > Let us say we hot-add 2GB of memory on a x86_64 (memblock size = 128M). > That is: > > - 16 sections > - 524288 pages > - 8192 vmemmap pages (out of those 524288. We spend 512 pages for each section) > > The range of pages is: 0xffffea0004000000 - 0xffffea0006000000 > The vmemmap range is: 0xffffea0004000000 - 0xffffea0004080000 > > 0xffffea0004000000 is the head vmemmap page (first page), while all the others > are "tails". > > We keep the following information in it: > > - Head page: > - head->_refcount: number of sections > - head->private : number of vmemmap pages > - Tail page: > - tail->freelist : pointer to the head > > This is done because it eases the work in cases where we have to compute the > number of vmemmap pages to know how much do we have to skip etc, and to keep > the right accounting to present_pages. > > When we want to hot-remove the range, we need to be careful because the first > pages of that range, are used for the memmap maping, so if we remove those > first, we would blow up while accessing the others later on. > For that reason we keep the number of sections in head->_refcount, to know how > much do we have to defer the free up. > > Since in a hot-remove operation, sections are being removed sequentially, the > approach taken here is that every time we hit free_section_memmap(), we decrease > the refcount of the head. > When it reaches 0, we know that we hit the last section, so we call > vmemmap_free() for the whole memory-range in backwards, so we make sure that > the pages used for the mapping will be latest to be freed up. > > Vmemmap pages are charged to spanned/present_paged, but not to manages_pages. > I guess one important thing to mention is that it is no longer possible to remove memory in a different granularity it was added. I slightly remember that ACPI code sometimes "reuses" parts of already added memory. We would have to validate that this can indeed not be an issue. drivers/acpi/acpi_memhotplug.c: result = __add_memory(node, info->start_addr, info->length); if (result && result != -EEXIST) continue; What would happen when removing this dimm (->remove_memory()) Also have a look at arch/powerpc/platforms/powernv/memtrace.c I consider it evil code. It will simply try to offline+unplug *some* memory it finds in *some granularity*. Not sure if this might be problematic- Would there be any "safety net" for adding/removing memory in different granularities?
On 28.03.19 16:09, David Hildenbrand wrote: > On 28.03.19 14:43, Oscar Salvador wrote: >> Hi, >> >> since last two RFCs were almost unnoticed (thanks David for the feedback), >> I decided to re-work some parts to make it more simple and give it a more >> testing, and drop the RFC, to see if it gets more attention. >> I also added David's feedback, so now all users of add_memory/__add_memory/ >> add_memory_resource can specify whether they want to use this feature or not. > > Terrific, I will also definetly try to make use of that in the next > virito-mem prototype (looks like I'll finally have time to look into it > again). > >> I also fixed some compilation issues when CONFIG_SPARSEMEM_VMEMMAP is not set. >> >> [Testing] >> >> Testing has been carried out on the following platforms: >> >> - x86_64 (small and big memblocks) >> - powerpc >> - arm64 (Huawei's fellows) >> >> I plan to test it on Xen and Hyper-V, but for now those two will not be >> using this feature, and neither DAX/pmem. > > I think doing it step by step is the right approach. Less likely to > break stuff. > >> >> Of course, if this does not find any strong objection, my next step is to >> work on enabling this on Xen/Hyper-V. >> >> [Coverletter] >> >> This is another step to make the memory hotplug more usable. The primary >> goal of this patchset is to reduce memory overhead of the hot added >> memory (at least for SPARSE_VMEMMAP memory model). The current way we use >> to populate memmap (struct page array) has two main drawbacks: >> >> a) it consumes an additional memory until the hotadded memory itself is >> onlined and >> b) memmap might end up on a different numa node which is especially true >> for movable_node configuration. >> >> a) is problem especially for memory hotplug based memory "ballooning" >> solutions when the delay between physical memory hotplug and the >> onlining can lead to OOM and that led to introduction of hacks like auto >> onlining (see 31bc3858ea3e ("memory-hotplug: add automatic onlining >> policy for the newly added memory")). >> >> b) can have performance drawbacks. >> >> I have also seen hot-add operations failing on archs because they >> were running out of order-x pages. >> E.g On powerpc, in certain configurations, we use order-8 pages, >> and given 64KB base pagesize, that is 16MB. >> If we run out of those, we just fail the operation and we cannot add >> more memory. >> We could fallback to base pages as x86_64 does, but we can do better. >> >> One way to mitigate all these issues is to simply allocate memmap array >> (which is the largest memory footprint of the physical memory hotplug) >> from the hotadded memory itself. VMEMMAP memory model allows us to map >> any pfn range so the memory doesn't need to be online to be usable >> for the array. See patch 3 for more details. In short I am reusing an >> existing vmem_altmap which wants to achieve the same thing for nvdim >> device memory. >> >> There is also one potential drawback, though. If somebody uses memory >> hotplug for 1G (gigantic) hugetlb pages then this scheme will not work >> for them obviously because each memory block will contain reserved >> area. Large x86 machines will use 2G memblocks so at least one 1G page >> will be available but this is still not 2G... >> >> If that is a problem, we can always configure a fallback strategy to >> use the current scheme. >> >> Since this only works when CONFIG_VMEMMAP_ENABLED is set, >> we do check for it before setting the flag that allows use >> to use the feature, no matter if the user wanted it. >> >> [Overall design]: >> >> Let us say we hot-add 2GB of memory on a x86_64 (memblock size = 128M). >> That is: >> >> - 16 sections >> - 524288 pages >> - 8192 vmemmap pages (out of those 524288. We spend 512 pages for each section) >> >> The range of pages is: 0xffffea0004000000 - 0xffffea0006000000 >> The vmemmap range is: 0xffffea0004000000 - 0xffffea0004080000 >> >> 0xffffea0004000000 is the head vmemmap page (first page), while all the others >> are "tails". >> >> We keep the following information in it: >> >> - Head page: >> - head->_refcount: number of sections >> - head->private : number of vmemmap pages >> - Tail page: >> - tail->freelist : pointer to the head >> >> This is done because it eases the work in cases where we have to compute the >> number of vmemmap pages to know how much do we have to skip etc, and to keep >> the right accounting to present_pages. >> >> When we want to hot-remove the range, we need to be careful because the first >> pages of that range, are used for the memmap maping, so if we remove those >> first, we would blow up while accessing the others later on. >> For that reason we keep the number of sections in head->_refcount, to know how >> much do we have to defer the free up. >> >> Since in a hot-remove operation, sections are being removed sequentially, the >> approach taken here is that every time we hit free_section_memmap(), we decrease >> the refcount of the head. >> When it reaches 0, we know that we hit the last section, so we call >> vmemmap_free() for the whole memory-range in backwards, so we make sure that >> the pages used for the mapping will be latest to be freed up. >> >> Vmemmap pages are charged to spanned/present_paged, but not to manages_pages. >> > > I guess one important thing to mention is that it is no longer possible > to remove memory in a different granularity it was added. I slightly > remember that ACPI code sometimes "reuses" parts of already added > memory. We would have to validate that this can indeed not be an issue. > > drivers/acpi/acpi_memhotplug.c: > > result = __add_memory(node, info->start_addr, info->length); > if (result && result != -EEXIST) > continue; > > What would happen when removing this dimm (->remove_memory()) > > > Also have a look at > > arch/powerpc/platforms/powernv/memtrace.c > > I consider it evil code. It will simply try to offline+unplug *some* > memory it finds in *some granularity*. Not sure if this might be > problematic- > > Would there be any "safety net" for adding/removing memory in different > granularities? > Correct me if I am wrong. I think I was confused - vmemmap data is still allocated *per memory block*, not for the whole added memory, correct?
On Thu, Mar 28, 2019 at 04:09:06PM +0100, David Hildenbrand wrote: > On 28.03.19 14:43, Oscar Salvador wrote: > > Hi, > > > > since last two RFCs were almost unnoticed (thanks David for the feedback), > > I decided to re-work some parts to make it more simple and give it a more > > testing, and drop the RFC, to see if it gets more attention. > > I also added David's feedback, so now all users of add_memory/__add_memory/ > > add_memory_resource can specify whether they want to use this feature or not. > > Terrific, I will also definetly try to make use of that in the next > virito-mem prototype (looks like I'll finally have time to look into it > again). Great, I would like to see how this works there :-). > I guess one important thing to mention is that it is no longer possible > to remove memory in a different granularity it was added. I slightly > remember that ACPI code sometimes "reuses" parts of already added > memory. We would have to validate that this can indeed not be an issue. > > drivers/acpi/acpi_memhotplug.c: > > result = __add_memory(node, info->start_addr, info->length); > if (result && result != -EEXIST) > continue; > > What would happen when removing this dimm (->remove_memory()) Yeah, I see the point. Well, we are safe here because the vmemmap data is being allocated in every call to __add_memory/add_memory/add_memory_resource. E.g: * Being memblock granularity 128M # object_add memory-backend-ram,id=ram0,size=256M # device_add pc-dimm,id=dimm0,memdev=ram0,node=1 I am not sure how ACPI gets to split the DIMM in memory resources (aka mem_device->res_list), but it does not really matter. For each mem_device->res_list item, we will make a call to __add_memory, which will allocate the vmemmap data for __that__ item, we do not care about the others. And when removing the DIMM, acpi_memory_remove_memory will make a call to __remove_memory() for each mem_device->res_list item, and that will take care of free up the vmemmap data. Now, with all my tests, ACPI always considered a DIMM a single memory resource, but maybe under different circumstances it gets to split it in different mem resources. But it does not really matter, as vmemmap data is being created and isolated in every call to __add_memory. > Also have a look at > > arch/powerpc/platforms/powernv/memtrace.c > > I consider it evil code. It will simply try to offline+unplug *some* > memory it finds in *some granularity*. Not sure if this might be > problematic- Heh, memtrace from powerpc ^^, I saw some oddities coming from there, but with my code though because I did not get to test that in concret. But I am interested to see if it can trigger something, so I will be testing that the next days. > Would there be any "safety net" for adding/removing memory in different > granularities? Uhm, I do not think we need it, or at least I cannot think of a case where this could cause trouble with the current design. Can you think of any? Thanks David ;-)
On Thu, Mar 28, 2019 at 04:31:44PM +0100, David Hildenbrand wrote: > Correct me if I am wrong. I think I was confused - vmemmap data is still > allocated *per memory block*, not for the whole added memory, correct? No, vmemap data is allocated per memory-resource added. In case a DIMM, would be a DIMM, in case a qemu memory-device, would be that memory-device. That is counting that ACPI does not split the DIMM/memory-device in several memory resources. If that happens, then acpi_memory_enable_device() calls __add_memory for every memory-resource, which means that the vmemmap data will be allocated per memory-resource. I did not see this happening though, and I am not sure under which circumstances can happen (I have to study the ACPI code a bit more). The problem with allocating vmemmap data per memblock, is the fragmentation. Let us say you do the following: * memblock granularity 128M (qemu) object_add memory-backend-ram,id=ram0,size=256M (qemu) device_add pc-dimm,id=dimm0,memdev=ram0,node=1 This will create two memblocks (2 sections), and if we allocate the vmemmap data for each corresponding section within it section(memblock), you only get 126M contiguous memory. So, the taken approach is to allocate the vmemmap data corresponging to the whole DIMM/memory-device/memory-resource from the beginning of its memory. In the example from above, the vmemmap data for both sections is allocated from the beginning of the first section: memmap array takes 2MB per section, so 512 pfns. If we add 2 sections: [ pfn#0 ] \ [ ... ] | vmemmap used for memmap array [pfn#1023 ] / [pfn#1024 ] \ [ ... ] | used as normal memory [pfn#65536] / So, out of 256M, we get 252M to use as a real memory, as 4M will be used for building the memmap array. Actually, it can happen that depending on how big a DIMM/memory-device is, the first/s memblock is fully used for the memmap array (of course, this can only be seen when adding a huge DIMM/memory-device).
> Great, I would like to see how this works there :-). > >> I guess one important thing to mention is that it is no longer possible >> to remove memory in a different granularity it was added. I slightly >> remember that ACPI code sometimes "reuses" parts of already added >> memory. We would have to validate that this can indeed not be an issue. >> >> drivers/acpi/acpi_memhotplug.c: >> >> result = __add_memory(node, info->start_addr, info->length); >> if (result && result != -EEXIST) >> continue; >> >> What would happen when removing this dimm (->remove_memory()) > > Yeah, I see the point. > Well, we are safe here because the vmemmap data is being allocated in > every call to __add_memory/add_memory/add_memory_resource. > > E.g: > > * Being memblock granularity 128M > > # object_add memory-backend-ram,id=ram0,size=256M > # device_add pc-dimm,id=dimm0,memdev=ram0,node=1 So, this should result in one __add_memory() call with 256MB, creating two memory block devices (128MB). I *assume* (haven't looked at the details yet, sorry), that you will allocate vmmap for (and on!) each of these two 128MB sections/memblocks, correct? > > I am not sure how ACPI gets to split the DIMM in memory resources > (aka mem_device->res_list), but it does not really matter. > For each mem_device->res_list item, we will make a call to __add_memory, > which will allocate the vmemmap data for __that__ item, we do not care > about the others. > > And when removing the DIMM, acpi_memory_remove_memory will make a call to > __remove_memory() for each mem_device->res_list item, and that will take > care of free up the vmemmap data. Ah okay, that makes sense. > > Now, with all my tests, ACPI always considered a DIMM a single memory resource, > but maybe under different circumstances it gets to split it in different mem > resources. > But it does not really matter, as vmemmap data is being created and isolated in > every call to __add_memory. Yes, as long as the calls to add_memory matches remove_memory, we are totally fine. I am wondering if that could not be the case. A simplified example: A DIMM overlaps with some other system ram, as detected and added during boot. When detecting the dimm, __add_memory() returns -EEXIST. Now, wehn unplugging the dimm, we call remove_memory(), but only remove the DIMM part. I wonder how/if something like that can happen and how the system would react. I guess I'll have to do some more ACPI code reading to find out how this -EEXIST case can come to life. > >> Also have a look at >> >> arch/powerpc/platforms/powernv/memtrace.c >> >> I consider it evil code. It will simply try to offline+unplug *some* >> memory it finds in *some granularity*. Not sure if this might be >> problematic- > > Heh, memtrace from powerpc ^^, I saw some oddities coming from there, but > with my code though because I did not get to test that in concret. > But I am interested to see if it can trigger something, so I will be testing > that the next days. > >> Would there be any "safety net" for adding/removing memory in different >> granularities? > > Uhm, I do not think we need it, or at least I cannot think of a case where this > could cause trouble with the current design. > Can you think of any? Nope, as long as it works (especially no change to what we had before), no safety net needed :) I was just curious if add_memory() followed by remove_memory() used to work before and if you patches might change that behavior. Thanks! Will try to look into the details soon! > > Thanks David ;-) >
On 29.03.19 09:45, Oscar Salvador wrote: > On Thu, Mar 28, 2019 at 04:31:44PM +0100, David Hildenbrand wrote: >> Correct me if I am wrong. I think I was confused - vmemmap data is still >> allocated *per memory block*, not for the whole added memory, correct? > > No, vmemap data is allocated per memory-resource added. > In case a DIMM, would be a DIMM, in case a qemu memory-device, would be that > memory-device. > That is counting that ACPI does not split the DIMM/memory-device in several memory > resources. > If that happens, then acpi_memory_enable_device() calls __add_memory for every > memory-resource, which means that the vmemmap data will be allocated per > memory-resource. > I did not see this happening though, and I am not sure under which circumstances > can happen (I have to study the ACPI code a bit more). > > The problem with allocating vmemmap data per memblock, is the fragmentation. > Let us say you do the following: > > * memblock granularity 128M > > (qemu) object_add memory-backend-ram,id=ram0,size=256M > (qemu) device_add pc-dimm,id=dimm0,memdev=ram0,node=1 > > This will create two memblocks (2 sections), and if we allocate the vmemmap > data for each corresponding section within it section(memblock), you only get > 126M contiguous memory. Oh okay, so actually the way I guessed it would be now. While this makes totally sense, I'll have to look how it is currently handled, meaning if there is a change. I somewhat remembering that delayed struct pages initialization would initialize vmmap per section, not per memory resource. But as I work on 10 things differently, my mind sometimes seems to forget stuff in order to replace it with random nonsense. Will look into the details to not have to ask too many dumb questions. > > So, the taken approach is to allocate the vmemmap data corresponging to the > whole DIMM/memory-device/memory-resource from the beginning of its memory. > > In the example from above, the vmemmap data for both sections is allocated from > the beginning of the first section: > > memmap array takes 2MB per section, so 512 pfns. > If we add 2 sections: > > [ pfn#0 ] \ > [ ... ] | vmemmap used for memmap array > [pfn#1023 ] / > > [pfn#1024 ] \ > [ ... ] | used as normal memory > [pfn#65536] / > > So, out of 256M, we get 252M to use as a real memory, as 4M will be used for > building the memmap array. > > Actually, it can happen that depending on how big a DIMM/memory-device is, > the first/s memblock is fully used for the memmap array (of course, this > can only be seen when adding a huge DIMM/memory-device). > Just stating here, that with your code, add_memory() and remove_memory() always have to be called in the same granularity. Will have to see if that implies a change.
On 29.03.19 09:56, David Hildenbrand wrote: > On 29.03.19 09:45, Oscar Salvador wrote: >> On Thu, Mar 28, 2019 at 04:31:44PM +0100, David Hildenbrand wrote: >>> Correct me if I am wrong. I think I was confused - vmemmap data is still >>> allocated *per memory block*, not for the whole added memory, correct? >> >> No, vmemap data is allocated per memory-resource added. >> In case a DIMM, would be a DIMM, in case a qemu memory-device, would be that >> memory-device. >> That is counting that ACPI does not split the DIMM/memory-device in several memory >> resources. >> If that happens, then acpi_memory_enable_device() calls __add_memory for every >> memory-resource, which means that the vmemmap data will be allocated per >> memory-resource. >> I did not see this happening though, and I am not sure under which circumstances >> can happen (I have to study the ACPI code a bit more). >> >> The problem with allocating vmemmap data per memblock, is the fragmentation. >> Let us say you do the following: >> >> * memblock granularity 128M >> >> (qemu) object_add memory-backend-ram,id=ram0,size=256M >> (qemu) device_add pc-dimm,id=dimm0,memdev=ram0,node=1 >> >> This will create two memblocks (2 sections), and if we allocate the vmemmap >> data for each corresponding section within it section(memblock), you only get >> 126M contiguous memory. > > Oh okay, so actually the way I guessed it would be now. > > While this makes totally sense, I'll have to look how it is currently > handled, meaning if there is a change. I somewhat remembering that > delayed struct pages initialization would initialize vmmap per section, > not per memory resource. > > But as I work on 10 things differently, my mind sometimes seems to > forget stuff in order to replace it with random nonsense. Will look into > the details to not have to ask too many dumb questions. s/differently/concurrently/ See, nonsense ;)
On Fri, Mar 29, 2019 at 09:56:37AM +0100, David Hildenbrand wrote: > Oh okay, so actually the way I guessed it would be now. > > While this makes totally sense, I'll have to look how it is currently > handled, meaning if there is a change. I somewhat remembering that > delayed struct pages initialization would initialize vmmap per section, > not per memory resource. Uhm, the memmap array for each section is built early during boot. We actually do not care about deferred struct pages initialization there. What we do is: - We go through all memblock regions marked as memory - We mark the sections within those regions present - We initialize those sections and build the corresponding memmap array The thing is that sparse_init_nid() allocates/reserves a buffer big enough to allocate the memmap array for all those sections, and for each memmap array to need to allocate, we consume it from that buffer, using contigous memory. Have a look at: - sparse_memory_present_with_active_regions() - sparse_init() - sparse_init_nid - sparse_buffer_init > But as I work on 10 things differently, my mind sometimes seems to > forget stuff in order to replace it with random nonsense. Will look into > the details to not have to ask too many dumb questions. > > > > > So, the taken approach is to allocate the vmemmap data corresponging to the > > whole DIMM/memory-device/memory-resource from the beginning of its memory. > > > > In the example from above, the vmemmap data for both sections is allocated from > > the beginning of the first section: > > > > memmap array takes 2MB per section, so 512 pfns. > > If we add 2 sections: > > > > [ pfn#0 ] \ > > [ ... ] | vmemmap used for memmap array > > [pfn#1023 ] / > > > > [pfn#1024 ] \ > > [ ... ] | used as normal memory > > [pfn#65536] / > > > > So, out of 256M, we get 252M to use as a real memory, as 4M will be used for > > building the memmap array. > > > > Actually, it can happen that depending on how big a DIMM/memory-device is, > > the first/s memblock is fully used for the memmap array (of course, this > > can only be seen when adding a huge DIMM/memory-device). > > > > Just stating here, that with your code, add_memory() and remove_memory() > always have to be called in the same granularity. Will have to see if > that implies a change. Well, I only tested it in such scenario yes, but I think that ACPI code enforces that somehow. I will take a closer look though.
On Fri 29-03-19 09:45:47, Oscar Salvador wrote: [...] > * memblock granularity 128M > > (qemu) object_add memory-backend-ram,id=ram0,size=256M > (qemu) device_add pc-dimm,id=dimm0,memdev=ram0,node=1 > > This will create two memblocks (2 sections), and if we allocate the vmemmap > data for each corresponding section within it section(memblock), you only get > 126M contiguous memory. > > So, the taken approach is to allocate the vmemmap data corresponging to the > whole DIMM/memory-device/memory-resource from the beginning of its memory. > > In the example from above, the vmemmap data for both sections is allocated from > the beginning of the first section: > > memmap array takes 2MB per section, so 512 pfns. > If we add 2 sections: > > [ pfn#0 ] \ > [ ... ] | vmemmap used for memmap array > [pfn#1023 ] / > > [pfn#1024 ] \ > [ ... ] | used as normal memory > [pfn#65536] / > > So, out of 256M, we get 252M to use as a real memory, as 4M will be used for > building the memmap array. Having a larger contiguous area is definitely nice to have but you also have to consider the other side of the thing. If we have a movable memblock with unmovable memory then we are breaking the movable property. So there should be some flexibility for caller to tell whether to allocate on per device or per memblock. Or we need something to move memmaps during the hotremove.
On 3/28/19 6:43 AM, Oscar Salvador wrote: > Hi, > > since last two RFCs were almost unnoticed (thanks David for the feedback), > I decided to re-work some parts to make it more simple and give it a more > testing, and drop the RFC, to see if it gets more attention. > I also added David's feedback, so now all users of add_memory/__add_memory/ > add_memory_resource can specify whether they want to use this feature or not. > I also fixed some compilation issues when CONFIG_SPARSEMEM_VMEMMAP is not set. > Hi Oscar, say, what tree and/or commit does this series apply to? I'm having some trouble finding the right place. Sorry for the easy question, I did try quite a few trees already... thanks,
On Fri, Mar 29, 2019 at 03:23:00PM -0700, John Hubbard wrote: > On 3/28/19 6:43 AM, Oscar Salvador wrote: > > Hi, > > > > since last two RFCs were almost unnoticed (thanks David for the feedback), > > I decided to re-work some parts to make it more simple and give it a more > > testing, and drop the RFC, to see if it gets more attention. > > I also added David's feedback, so now all users of add_memory/__add_memory/ > > add_memory_resource can specify whether they want to use this feature or not. > > I also fixed some compilation issues when CONFIG_SPARSEMEM_VMEMMAP is not set. > > > > Hi Oscar, say, what tree and/or commit does this series apply to? I'm having some > trouble finding the right place. Sorry for the easy question, I did try quite > a few trees already... Hi John, I somehow forgot to mention it in the cover-letter, sorry. This patchsed is based on v5.1-rc2-31-gece06d4a8149 + the following fixes on top: * https://patchwork.kernel.org/patch/10862609/ * https://patchwork.kernel.org/patch/10862611/ * https://patchwork.kernel.org/patch/10853049/
On Fri, Mar 29, 2019 at 02:42:43PM +0100, Michal Hocko wrote: > Having a larger contiguous area is definitely nice to have but you also > have to consider the other side of the thing. If we have a movable > memblock with unmovable memory then we are breaking the movable > property. So there should be some flexibility for caller to tell whether > to allocate on per device or per memblock. Or we need something to move > memmaps during the hotremove. By movable memblock you mean a memblock whose pages can be migrated over when this memblock is offlined, right?
On Mon 01-04-19 09:59:36, Oscar Salvador wrote: > On Fri, Mar 29, 2019 at 02:42:43PM +0100, Michal Hocko wrote: > > Having a larger contiguous area is definitely nice to have but you also > > have to consider the other side of the thing. If we have a movable > > memblock with unmovable memory then we are breaking the movable > > property. So there should be some flexibility for caller to tell whether > > to allocate on per device or per memblock. Or we need something to move > > memmaps during the hotremove. > > By movable memblock you mean a memblock whose pages can be migrated over when > this memblock is offlined, right? I am mostly thinking about movable_node kernel parameter which makes newly hotpluged memory go into ZONE_MOVABLE and people do use that to make sure such a memory can be later hotremoved.
On Mon, Apr 01, 2019 at 01:53:06PM +0200, Michal Hocko wrote: > On Mon 01-04-19 09:59:36, Oscar Salvador wrote: > > On Fri, Mar 29, 2019 at 02:42:43PM +0100, Michal Hocko wrote: > > > Having a larger contiguous area is definitely nice to have but you also > > > have to consider the other side of the thing. If we have a movable > > > memblock with unmovable memory then we are breaking the movable > > > property. So there should be some flexibility for caller to tell whether > > > to allocate on per device or per memblock. Or we need something to move > > > memmaps during the hotremove. > > > > By movable memblock you mean a memblock whose pages can be migrated over when > > this memblock is offlined, right? > > I am mostly thinking about movable_node kernel parameter which makes > newly hotpluged memory go into ZONE_MOVABLE and people do use that to > make sure such a memory can be later hotremoved. Uhm, I might be missing your point, but hot-added memory that makes use of vmemmap pages can be hot-removed as any other memory. Vmemmap pages do not account as unmovable memory, they just stick around until all sections they referred to have been removed, and then, we proceed with removing them. So, to put it in another way: vmemmap pages are left in the system until the whole memory device (DIMM, virt mem-device or whatever) is completely hot-removed.
On 02.04.19 10:28, Oscar Salvador wrote: > On Mon, Apr 01, 2019 at 01:53:06PM +0200, Michal Hocko wrote: >> On Mon 01-04-19 09:59:36, Oscar Salvador wrote: >>> On Fri, Mar 29, 2019 at 02:42:43PM +0100, Michal Hocko wrote: >>>> Having a larger contiguous area is definitely nice to have but you also >>>> have to consider the other side of the thing. If we have a movable >>>> memblock with unmovable memory then we are breaking the movable >>>> property. So there should be some flexibility for caller to tell whether >>>> to allocate on per device or per memblock. Or we need something to move >>>> memmaps during the hotremove. >>> >>> By movable memblock you mean a memblock whose pages can be migrated over when >>> this memblock is offlined, right? >> >> I am mostly thinking about movable_node kernel parameter which makes >> newly hotpluged memory go into ZONE_MOVABLE and people do use that to >> make sure such a memory can be later hotremoved. > > Uhm, I might be missing your point, but hot-added memory that makes use of > vmemmap pages can be hot-removed as any other memory. > > Vmemmap pages do not account as unmovable memory, they just stick around > until all sections they referred to have been removed, and then, we proceed > with removing them. > So, to put it in another way: vmemmap pages are left in the system until the > whole memory device (DIMM, virt mem-device or whatever) is completely > hot-removed. Indeed, separate memblocks can be offlined, but vmemmap is removed along with remove_memory().
On Tue 02-04-19 10:28:15, Oscar Salvador wrote: > On Mon, Apr 01, 2019 at 01:53:06PM +0200, Michal Hocko wrote: > > On Mon 01-04-19 09:59:36, Oscar Salvador wrote: > > > On Fri, Mar 29, 2019 at 02:42:43PM +0100, Michal Hocko wrote: > > > > Having a larger contiguous area is definitely nice to have but you also > > > > have to consider the other side of the thing. If we have a movable > > > > memblock with unmovable memory then we are breaking the movable > > > > property. So there should be some flexibility for caller to tell whether > > > > to allocate on per device or per memblock. Or we need something to move > > > > memmaps during the hotremove. > > > > > > By movable memblock you mean a memblock whose pages can be migrated over when > > > this memblock is offlined, right? > > > > I am mostly thinking about movable_node kernel parameter which makes > > newly hotpluged memory go into ZONE_MOVABLE and people do use that to > > make sure such a memory can be later hotremoved. > > Uhm, I might be missing your point, but hot-added memory that makes use of > vmemmap pages can be hot-removed as any other memory. > > Vmemmap pages do not account as unmovable memory, they just stick around > until all sections they referred to have been removed, and then, we proceed > with removing them. > So, to put it in another way: vmemmap pages are left in the system until the > whole memory device (DIMM, virt mem-device or whatever) is completely > hot-removed. So what is going to happen when you hotadd two memblocks. The first one holds memmaps and then you want to hotremove (not just offline) it?
On Tue, Apr 02, 2019 at 02:48:45PM +0200, Michal Hocko wrote: > So what is going to happen when you hotadd two memblocks. The first one > holds memmaps and then you want to hotremove (not just offline) it? If you hot-add two memblocks, this means that either: a) you hot-add a 256MB-memory-device (128MB per memblock) b) you hot-add two 128MB-memory-device Either way, hot-removing only works for memory-device as a whole, so there is no problem. Vmemmaps are created per hot-added operations, this means that vmemmaps will be created for the hot-added range. And since hot-add/hot-remove operations works with the same granularity, there is no problem. E.g: # (qemu) object_add memory-backend-ram,id=ram0,size=128M # (qemu) device_add pc-dimm,id=dimm0,memdev=ram0,node=1 # (qemu) object_add memory-backend-ram,id=ram1,size=512M # (qemu) device_add pc-dimm,id=dimm1,memdev=ram1,node=1 # (qemu) object_add memory-backend-ram,id=ram2,size=1G # (qemu) device_add pc-dimm,id=dimm2,memdev=ram2,node=1 These are three hot-add operations. Each hot-add operation will create use vmemmaps to hold the memmap for its hot-added sections.
On Wed 03-04-19 10:01:16, Oscar Salvador wrote: > On Tue, Apr 02, 2019 at 02:48:45PM +0200, Michal Hocko wrote: > > So what is going to happen when you hotadd two memblocks. The first one > > holds memmaps and then you want to hotremove (not just offline) it? > > If you hot-add two memblocks, this means that either: > > a) you hot-add a 256MB-memory-device (128MB per memblock) > b) you hot-add two 128MB-memory-device > > Either way, hot-removing only works for memory-device as a whole, so > there is no problem. > > Vmemmaps are created per hot-added operations, this means that > vmemmaps will be created for the hot-added range. > And since hot-add/hot-remove operations works with the same granularity, > there is no problem. What does prevent calling somebody arch_add_memory for a range spanning multiple memblocks from a driver directly. In other words aren't you making assumptions about a future usage based on the qemu usecase?
On 03.04.19 10:12, Michal Hocko wrote: > On Wed 03-04-19 10:01:16, Oscar Salvador wrote: >> On Tue, Apr 02, 2019 at 02:48:45PM +0200, Michal Hocko wrote: >>> So what is going to happen when you hotadd two memblocks. The first one >>> holds memmaps and then you want to hotremove (not just offline) it? >> >> If you hot-add two memblocks, this means that either: >> >> a) you hot-add a 256MB-memory-device (128MB per memblock) >> b) you hot-add two 128MB-memory-device >> >> Either way, hot-removing only works for memory-device as a whole, so >> there is no problem. >> >> Vmemmaps are created per hot-added operations, this means that >> vmemmaps will be created for the hot-added range. >> And since hot-add/hot-remove operations works with the same granularity, >> there is no problem. > > What does prevent calling somebody arch_add_memory for a range spanning > multiple memblocks from a driver directly. In other words aren't you To drivers, we only expose add_memory() and friends. And I think this is a good idea. > making assumptions about a future usage based on the qemu usecase? > As I noted, we only have an issue if add add_memory() and remove_memory() is called with different granularity. I gave two examples where this might not be the case, but we will have to look int the details.
On Wed, Apr 03, 2019 at 10:12:32AM +0200, Michal Hocko wrote: > What does prevent calling somebody arch_add_memory for a range spanning > multiple memblocks from a driver directly. In other words aren't you > making assumptions about a future usage based on the qemu usecase? Well, right now they cannot as it is not exported. But if we want to do it in the future, then yes, I would have to be more careful because I made the assumption that hot-add/hot-remove are working with the same granularity, which is the case right now. Given said this, I think that something like you said before, giving the option to the caller to specify whether it wants vmemmaps per the whole hot-added range or per memblock is a reasonable thing to do. That way, there will not be a problem working with different granularities in hot-add/hot-remove operations and we would be on safe side.
On 03.04.19 10:34, Oscar Salvador wrote: > On Wed, Apr 03, 2019 at 10:12:32AM +0200, Michal Hocko wrote: >> What does prevent calling somebody arch_add_memory for a range spanning >> multiple memblocks from a driver directly. In other words aren't you >> making assumptions about a future usage based on the qemu usecase? > > Well, right now they cannot as it is not exported. > But if we want to do it in the future, then yes, I would have to > be more careful because I made the assumption that hot-add/hot-remove > are working with the same granularity, which is the case right now. > > Given said this, I think that something like you said before, giving > the option to the caller to specify whether it wants vmemmaps per the > whole hot-added range or per memblock is a reasonable thing to do. > That way, there will not be a problem working with different granularities > in hot-add/hot-remove operations and we would be on safe side. There might still be an issue if the person adding memory might be somebody else removing memory. I am not yet sure if we should even allow add_memory/remove_memory with different granularity. But as I noted, ACPI and powernv.
On Wed 03-04-19 10:17:26, David Hildenbrand wrote: > On 03.04.19 10:12, Michal Hocko wrote: > > On Wed 03-04-19 10:01:16, Oscar Salvador wrote: > >> On Tue, Apr 02, 2019 at 02:48:45PM +0200, Michal Hocko wrote: > >>> So what is going to happen when you hotadd two memblocks. The first one > >>> holds memmaps and then you want to hotremove (not just offline) it? > >> > >> If you hot-add two memblocks, this means that either: > >> > >> a) you hot-add a 256MB-memory-device (128MB per memblock) > >> b) you hot-add two 128MB-memory-device > >> > >> Either way, hot-removing only works for memory-device as a whole, so > >> there is no problem. > >> > >> Vmemmaps are created per hot-added operations, this means that > >> vmemmaps will be created for the hot-added range. > >> And since hot-add/hot-remove operations works with the same granularity, > >> there is no problem. > > > > What does prevent calling somebody arch_add_memory for a range spanning > > multiple memblocks from a driver directly. In other words aren't you > > To drivers, we only expose add_memory() and friends. And I think this is > a good idea. > > > making assumptions about a future usage based on the qemu usecase? > > > > As I noted, we only have an issue if add add_memory() and > remove_memory() is called with different granularity. I gave two > examples where this might not be the case, but we will have to look int > the details. It seems natural that the DIMM will be hot remove all at once because you cannot hot remove a half of the DIMM, right? But I can envision that people might want to hotremove a faulty part of a really large DIMM because they would like to save some resources. With different users asking for the hotplug functionality, I do not think we want to make such a strong assumption as hotremove will have the same granularity as hotadd. That being said it should be the caller of the hotplug code to tell the vmemmap allocation strategy. For starter, I would only pack vmemmaps for "regular" kernel zone memory. Movable zones should be more careful. We can always re-evaluate later when there is a strong demand for huge pages on movable zones but this is not the case now because those pages are not really movable in practice.
On 03.04.19 10:37, Michal Hocko wrote: > On Wed 03-04-19 10:17:26, David Hildenbrand wrote: >> On 03.04.19 10:12, Michal Hocko wrote: >>> On Wed 03-04-19 10:01:16, Oscar Salvador wrote: >>>> On Tue, Apr 02, 2019 at 02:48:45PM +0200, Michal Hocko wrote: >>>>> So what is going to happen when you hotadd two memblocks. The first one >>>>> holds memmaps and then you want to hotremove (not just offline) it? >>>> >>>> If you hot-add two memblocks, this means that either: >>>> >>>> a) you hot-add a 256MB-memory-device (128MB per memblock) >>>> b) you hot-add two 128MB-memory-device >>>> >>>> Either way, hot-removing only works for memory-device as a whole, so >>>> there is no problem. >>>> >>>> Vmemmaps are created per hot-added operations, this means that >>>> vmemmaps will be created for the hot-added range. >>>> And since hot-add/hot-remove operations works with the same granularity, >>>> there is no problem. >>> >>> What does prevent calling somebody arch_add_memory for a range spanning >>> multiple memblocks from a driver directly. In other words aren't you >> >> To drivers, we only expose add_memory() and friends. And I think this is >> a good idea. >> >>> making assumptions about a future usage based on the qemu usecase? >>> >> >> As I noted, we only have an issue if add add_memory() and >> remove_memory() is called with different granularity. I gave two >> examples where this might not be the case, but we will have to look int >> the details. > > It seems natural that the DIMM will be hot remove all at once because > you cannot hot remove a half of the DIMM, right? But I can envision that > people might want to hotremove a faulty part of a really large DIMM > because they would like to save some resources. Even for virtio-mem, something like that would be useful. But I could try to live without it :) Add a lot of memory in one go when starting up (add_memory()) - much faster than doing individual remove_memory() calls. When removing memory, as soon as all parts of a memblock are offline, remove only the memblock to save memory (remove_memory()). There, I would need to allocate it per memblock. > > With different users asking for the hotplug functionality, I do not > think we want to make such a strong assumption as hotremove will have > the same granularity as hotadd. > Then we have to make sure it works for all use cases. > > That being said it should be the caller of the hotplug code to tell > the vmemmap allocation strategy. For starter, I would only pack vmemmaps > for "regular" kernel zone memory. Movable zones should be more careful. > We can always re-evaluate later when there is a strong demand for huge > pages on movable zones but this is not the case now because those pages > are not really movable in practice. Remains the issue with potential different user trying to remove memory it didn't add in some other granularity. We then really have to identify and isolate that case.
On Wed 03-04-19 10:41:35, David Hildenbrand wrote: > On 03.04.19 10:37, Michal Hocko wrote: [...] > > That being said it should be the caller of the hotplug code to tell > > the vmemmap allocation strategy. For starter, I would only pack vmemmaps > > for "regular" kernel zone memory. Movable zones should be more careful. > > We can always re-evaluate later when there is a strong demand for huge > > pages on movable zones but this is not the case now because those pages > > are not really movable in practice. > > Remains the issue with potential different user trying to remove memory > it didn't add in some other granularity. We then really have to identify > and isolate that case. Can you give an example of a sensible usecase that would require this?
On Wed, Apr 03, 2019 at 10:41:35AM +0200, David Hildenbrand wrote: > > That being said it should be the caller of the hotplug code to tell > > the vmemmap allocation strategy. For starter, I would only pack vmemmaps > > for "regular" kernel zone memory. Movable zones should be more careful. > > We can always re-evaluate later when there is a strong demand for huge > > pages on movable zones but this is not the case now because those pages > > are not really movable in practice. > > Remains the issue with potential different user trying to remove memory > it didn't add in some other granularity. We then really have to identify > and isolate that case. If we let the caller specify whether it wants vmemmaps per memblock or range, I would trust that caller to do the correct thing and specify one thing or another depending on what it wants to do in the future. So, say a driver adds 512MB memory and it specifies that it wants vmemmaps per memblock because later on it will like to hot-remove in chunks of 128MB.
On 03.04.19 10:49, Michal Hocko wrote: > On Wed 03-04-19 10:41:35, David Hildenbrand wrote: >> On 03.04.19 10:37, Michal Hocko wrote: > [...] >>> That being said it should be the caller of the hotplug code to tell >>> the vmemmap allocation strategy. For starter, I would only pack vmemmaps >>> for "regular" kernel zone memory. Movable zones should be more careful. >>> We can always re-evaluate later when there is a strong demand for huge >>> pages on movable zones but this is not the case now because those pages >>> are not really movable in practice. >> >> Remains the issue with potential different user trying to remove memory >> it didn't add in some other granularity. We then really have to identify >> and isolate that case. > > Can you give an example of a sensible usecase that would require this? > The two cases I mentioned are 1. arch/powerpc/platforms/powernv/memtrace.c: memtrace_alloc_node() AFAIKS, memory that wasn't added by memtrace is tried to be offlined + removed. "Remove memory in memory block size chunks so that iomem resources are always split to the same size and we never try to remove memory that spans two iomem resources" 2. drivers/acpi/acpi_memhotplug.c:acpi_memory_enable_device() We might hit "__add_memory() == -EEXIST" and continue. When removing the devices, __remove_memory() is called. I am still to find out if that could imply removing in a different granularity than added.
On 03.04.19 10:50, Oscar Salvador wrote: > On Wed, Apr 03, 2019 at 10:41:35AM +0200, David Hildenbrand wrote: >>> That being said it should be the caller of the hotplug code to tell >>> the vmemmap allocation strategy. For starter, I would only pack vmemmaps >>> for "regular" kernel zone memory. Movable zones should be more careful. >>> We can always re-evaluate later when there is a strong demand for huge >>> pages on movable zones but this is not the case now because those pages >>> are not really movable in practice. >> >> Remains the issue with potential different user trying to remove memory >> it didn't add in some other granularity. We then really have to identify >> and isolate that case. > > If we let the caller specify whether it wants vmemmaps per memblock or range, > I would trust that caller to do the correct thing and specify one thing or > another depending on what it wants to do in the future. > > So, say a driver adds 512MB memory and it specifies that it wants vmemmaps per > memblock because later on it will like to hot-remove in chunks of 128MB. > I am talking about the memtrace and ACPI thing. Otherwise I agree, trust the user iff the user is the same person adding/removing memory.
On Wed, Apr 03, 2019 at 10:37:57AM +0200, Michal Hocko wrote: > That being said it should be the caller of the hotplug code to tell > the vmemmap allocation strategy. For starter, I would only pack vmemmaps > for "regular" kernel zone memory. Movable zones should be more careful. > We can always re-evaluate later when there is a strong demand for huge > pages on movable zones but this is not the case now because those pages > are not really movable in practice. I agree that makes sense to let the caller specify if it wants to allocate vmemmaps per memblock or per memory-range, so we are more flexible when it comes to granularity in hot-add/hot-remove operations. But the thing is that the zones are picked at onling stage, while vmemmaps are created at hot-add stage, so I am not sure we can define the strategy depending on the zone.
On Wed 03-04-19 11:40:54, Oscar Salvador wrote: > On Wed, Apr 03, 2019 at 10:37:57AM +0200, Michal Hocko wrote: > > That being said it should be the caller of the hotplug code to tell > > the vmemmap allocation strategy. For starter, I would only pack vmemmaps > > for "regular" kernel zone memory. Movable zones should be more careful. > > We can always re-evaluate later when there is a strong demand for huge > > pages on movable zones but this is not the case now because those pages > > are not really movable in practice. > > I agree that makes sense to let the caller specify if it wants to allocate > vmemmaps per memblock or per memory-range, so we are more flexible when it > comes to granularity in hot-add/hot-remove operations. And just to be more specific. The api shouldn't really care about this implementation detail. So ideally the caller of add_pages just picks up the proper allocator and the rest is completely transparent. > But the thing is that the zones are picked at onling stage, while > vmemmaps are created at hot-add stage, so I am not sure we can define > the strategy depending on the zone. THis is a good point. We do have means to tell a default zone for the regular memory hotplug so this can be reused based on the pfn range (default_zone_for_pfn).
On 4/3/19 11:40 AM, Oscar Salvador wrote: > On Wed, Apr 03, 2019 at 10:37:57AM +0200, Michal Hocko wrote: >> That being said it should be the caller of the hotplug code to tell >> the vmemmap allocation strategy. For starter, I would only pack vmemmaps >> for "regular" kernel zone memory. Movable zones should be more careful. >> We can always re-evaluate later when there is a strong demand for huge >> pages on movable zones but this is not the case now because those pages >> are not really movable in practice. > > I agree that makes sense to let the caller specify if it wants to allocate > vmemmaps per memblock or per memory-range, so we are more flexible when it > comes to granularity in hot-add/hot-remove operations. Please also add possibility to not allocate from hotadded memory (i.e. allocate from node 0 as it is done now). I know about some MCDRAM users that want to expose as much as possible to userspace, and not even occupy those ~4% for memmap. > But the thing is that the zones are picked at onling stage, while > vmemmaps are created at hot-add stage, so I am not sure we can define > the strategy depending on the zone. >