Message ID | 20250320015551.2157511-10-changyuanl@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | kexec: introduce Kexec HandOver (KHO) | expand |
On Wed, Mar 19, 2025 at 06:55:44PM -0700, Changyuan Lyu wrote: > +/** > + * kho_preserve_folio - preserve a folio across KHO. > + * @folio: folio to preserve > + * > + * Records that the entire folio is preserved across KHO. The order > + * will be preserved as well. > + * > + * Return: 0 on success, error code on failure > + */ > +int kho_preserve_folio(struct folio *folio) > +{ > + unsigned long pfn = folio_pfn(folio); > + unsigned int order = folio_order(folio); > + int err; > + > + if (!kho_enable) > + return -EOPNOTSUPP; > + > + down_read(&kho_out.tree_lock); > + if (kho_out.fdt) { What is the lock and fdt test for? I'm getting the feeling that probably kho_preserve_folio() and the like should accept some kind of 'struct kho_serialization *' and then we don't need this to prove we are within a valid serialization window. It could pass the pointer through the notifiers The global variables in this series are sort of ugly.. We want this to be fast, so try hard to avoid a lock.. > +void *kho_restore_phys(phys_addr_t phys, size_t size) > +{ > + unsigned long start_pfn, end_pfn, pfn; > + void *va = __va(phys); > + > + start_pfn = PFN_DOWN(phys); > + end_pfn = PFN_UP(phys + size); > + > + for (pfn = start_pfn; pfn < end_pfn; pfn++) { > + struct page *page = pfn_to_online_page(pfn); > + > + if (!page) > + return NULL; > + kho_restore_page(page); > + } > + > + return va; > +} > +EXPORT_SYMBOL_GPL(kho_restore_phys); What do you imagine this is used for? I'm not sure what value there is in returning a void *? How does the caller "free" this? > +#define KHOSER_PTR(type) \ > + union { \ > + phys_addr_t phys; \ > + type ptr; \ > + } > +#define KHOSER_STORE_PTR(dest, val) \ > + ({ \ > + (dest).phys = virt_to_phys(val); \ > + typecheck(typeof((dest).ptr), val); \ > + }) > +#define KHOSER_LOAD_PTR(src) \ > + ((src).phys ? (typeof((src).ptr))(phys_to_virt((src).phys)) : NULL) I had imagined these macros would be in a header and usably by drivers that also want to use structs to carry information. > +static void deserialize_bitmap(unsigned int order, > + struct khoser_mem_bitmap_ptr *elm) > +{ > + struct kho_mem_phys_bits *bitmap = KHOSER_LOAD_PTR(elm->bitmap); > + unsigned long bit; > + > + for_each_set_bit(bit, bitmap->preserve, PRESERVE_BITS) { > + int sz = 1 << (order + PAGE_SHIFT); > + phys_addr_t phys = > + elm->phys_start + (bit << (order + PAGE_SHIFT)); > + struct page *page = phys_to_page(phys); > + > + memblock_reserve(phys, sz); > + memblock_reserved_mark_noinit(phys, sz); Mike asked about this earlier, is it work combining runs of set bits to increase sz? Or is this sort of temporary pending something better that doesn't rely on memblock_reserve? > + page->private = order; Can't just set the page order directly? Why use private? > @@ -829,6 +1305,10 @@ static __init int kho_init(void) > > kho_out.root.name = ""; ? > err = kho_add_string_prop(&kho_out.root, "compatible", "kho-v1"); > + err |= kho_add_prop(&kho_out.preserved_memory, "metadata", > + &kho_out.first_chunk_phys, sizeof(phys_addr_t)); metedata doesn't fee like a great a better name.. Please also document all the FDT schema thoroughly! There should be yaml files just like in the normal DT case defining all of this. This level of documentation and stability was one of the selling reasons why FDT is being used here! Jason
On Fri, Mar 21, 2025 at 10:46:29AM -0300, Jason Gunthorpe wrote: > On Wed, Mar 19, 2025 at 06:55:44PM -0700, Changyuan Lyu wrote: > > > > +static void deserialize_bitmap(unsigned int order, > > + struct khoser_mem_bitmap_ptr *elm) > > +{ > > + struct kho_mem_phys_bits *bitmap = KHOSER_LOAD_PTR(elm->bitmap); > > + unsigned long bit; > > + > > + for_each_set_bit(bit, bitmap->preserve, PRESERVE_BITS) { > > + int sz = 1 << (order + PAGE_SHIFT); > > + phys_addr_t phys = > > + elm->phys_start + (bit << (order + PAGE_SHIFT)); > > + struct page *page = phys_to_page(phys); > > + > > + memblock_reserve(phys, sz); > > + memblock_reserved_mark_noinit(phys, sz); > > Mike asked about this earlier, is it work combining runs of set bits > to increase sz? Or is this sort of temporary pending something better > that doesn't rely on memblock_reserve? This hunk actually came from me. I decided to keep it simple for now and check what are the alternatives, like moving away from memblock_reserve(), adding a maple_tree or even something else. > > + page->private = order; > > Can't just set the page order directly? Why use private? Setting the order means recreating the folio the way prep_compound_page() does. I think it's better to postpone it until the folio is requested. This way it might run after SMP is enabled. Besides, when we start allocating folios separately from struct page, initializing it here would be a real issue. > Jason
On Sat, Mar 22, 2025 at 03:12:26PM -0400, Mike Rapoport wrote: > This hunk actually came from me. I decided to keep it simple for now and > check what are the alternatives, like moving away from memblock_reserve(), > adding a maple_tree or even something else. Okat, makes sense to me > > > + page->private = order; > > > > Can't just set the page order directly? Why use private? > > Setting the order means recreating the folio the way prep_compound_page() > does. I think it's better to postpone it until the folio is requested. This > way it might run after SMP is enabled. I see, that makes sense, but also it could stil use page->order.. > Besides, when we start allocating > folios separately from struct page, initializing it here would be a real > issue. Yes, but also we wouldn't have page->private to make it work.. Somehow anything we want to carry over would have to become encoded in the memdesc directly. I think this supports my remark someplace else that any user of this that wants to preserve per-page data should do it on its own somehow as an add-on-top? Jason
On Fri, Mar 21, 2025 at 10:46:29 -0300, Jason Gunthorpe <jgg@nvidia.com> wrote: > On Wed, Mar 19, 2025 at 06:55:44PM -0700, Changyuan Lyu wrote: > > +/** > > + * kho_preserve_folio - preserve a folio across KHO. > > + * @folio: folio to preserve > > + * > > + * Records that the entire folio is preserved across KHO. The order > > + * will be preserved as well. > > + * > > + * Return: 0 on success, error code on failure > > + */ > > +int kho_preserve_folio(struct folio *folio) > > +{ > > + unsigned long pfn = folio_pfn(folio); > > + unsigned int order = folio_order(folio); > > + int err; > > + > > + if (!kho_enable) > > + return -EOPNOTSUPP; > > + > > + down_read(&kho_out.tree_lock); > > + if (kho_out.fdt) { > > What is the lock and fdt test for? It is to avoid the competition between the following 2 operations, - converting the hashtables and mem traker to FDT, - adding new data to hashtable/mem tracker. Please also see function kho_finalize() in the previous patch "kexec: add Kexec HandOver (KHO) generation helpers" [1]. The function kho_finalize() iterates over all the hashtables and the mem tracker. We want to make sure that during the iterations, no new data is added to the hashtables and mem tracker. Also if FDT is generated, the mem tracker then has been serialized to linked pages, so we return -EBUSY to prevent more data from being added to the mem tracker. > I'm getting the feeling that probably kho_preserve_folio() and the > like should accept some kind of > 'struct kho_serialization *' and then we don't need this to prove we > are within a valid serialization window. It could pass the pointer > through the notifiers If we use notifiers, callbacks have to be done serially. > The global variables in this series are sort of ugly.. > > We want this to be fast, so try hard to avoid a lock.. In most cases we only need read lock. Different KHO users can adding data into their own subnodes in parallel. We only need a write lock if - 2 KHO users register subnodes to the KHO root node at the same time - KHO root tree is about to be converted to FDT. > > +void *kho_restore_phys(phys_addr_t phys, size_t size) > > +{ > > + unsigned long start_pfn, end_pfn, pfn; > > + void *va = __va(phys); > > + > > + start_pfn = PFN_DOWN(phys); > > + end_pfn = PFN_UP(phys + size); > > + > > + for (pfn = start_pfn; pfn < end_pfn; pfn++) { > > + struct page *page = pfn_to_online_page(pfn); > > + > > + if (!page) > > + return NULL; > > + kho_restore_page(page); > > + } > > + > > + return va; > > +} > > +EXPORT_SYMBOL_GPL(kho_restore_phys); > > What do you imagine this is used for? I'm not sure what value there is > in returning a void *? How does the caller "free" this? This function is also from Mike :) I suppose some KHO users may still preserve memory using memory ranges (instead of folio). In the restoring stage they need a helper to setup the pages of reserved memory ranges. A void * is returned so the KHO user can access the memory contents through the virtual address. I guess the caller can free the ranges by free_pages()? It makes sense to return nothing and let caller to call `__va` if they want. Then the function signature looks more symmetric to `kho_preserve_phys`. > > +#define KHOSER_PTR(type) \ > > + union { \ > > + phys_addr_t phys; \ > > + type ptr; \ > > + } > > +#define KHOSER_STORE_PTR(dest, val) \ > > + ({ \ > > + (dest).phys = virt_to_phys(val); \ > > + typecheck(typeof((dest).ptr), val); \ > > + }) > > +#define KHOSER_LOAD_PTR(src) \ > > + ((src).phys ? (typeof((src).ptr))(phys_to_virt((src).phys)) : NULL) > > I had imagined these macros would be in a header and usably by drivers > that also want to use structs to carry information. > OK I will move them to the header file in the next version. > > [...] > > @@ -829,6 +1305,10 @@ static __init int kho_init(void) > > > > kho_out.root.name = ""; > > ? Set the root node name to an empty string since fdt_begin_node calls strlen on the node name. It is equivalent to `err = fdt_begin_node(fdt, "")` in kho_serialize() of Mike's V4 patch [2]. > > err = kho_add_string_prop(&kho_out.root, "compatible", "kho-v1"); > > + err |= kho_add_prop(&kho_out.preserved_memory, "metadata", > > + &kho_out.first_chunk_phys, sizeof(phys_addr_t)); > > metedata doesn't fee like a great a better name.. > > Please also document all the FDT schema thoroughly! > > There should be yaml files just like in the normal DT case defining > all of this. This level of documentation and stability was one of the > selling reasons why FDT is being used here! YAML files were dropped because we think it may take a while for our schema to be near stable. So we start from some simple plain text. We can add some prop and node docs (that are considered stable at this point) back to YAML in the next version. [1] https://lore.kernel.org/all/20250320015551.2157511-8-changyuanl@google.com/ [2] https://lore.kernel.org/all/20250206132754.2596694-6-rppt@kernel.org/ Best, Changyuan
On Sun, Mar 23, 2025 at 03:55:52PM -0300, Jason Gunthorpe wrote: > On Sat, Mar 22, 2025 at 03:12:26PM -0400, Mike Rapoport wrote: > > > > > + page->private = order; > > > > > > Can't just set the page order directly? Why use private? > > > > Setting the order means recreating the folio the way prep_compound_page() > > does. I think it's better to postpone it until the folio is requested. This > > way it might run after SMP is enabled. > > I see, that makes sense, but also it could stil use page->order.. But there's no page->order :) > > Besides, when we start allocating > > folios separately from struct page, initializing it here would be a real > > issue. > > Yes, but also we wouldn't have page->private to make it work.. Somehow > anything we want to carry over would have to become encoded in the > memdesc directly. This is a problem to solve in 2026 :) The January update for State of Page [1] talks about reasonable goal to shrink struct page to (approximately): struct page { unsigned long flags; union { struct list_head buddy_list; struct list_head pcp_list; struct { unsigned long memdesc; int _refcount; }; }; union { unsigned long private; struct { int _folio_mapcount; }; }; }; [1] https://lore.kernel.org/linux-mm/Z37pxbkHPbLYnDKn@casper.infradead.org/ > Jason
On Mon, Mar 24, 2025 at 02:18:34PM -0400, Mike Rapoport wrote: > On Sun, Mar 23, 2025 at 03:55:52PM -0300, Jason Gunthorpe wrote: > > On Sat, Mar 22, 2025 at 03:12:26PM -0400, Mike Rapoport wrote: > > > > > > > + page->private = order; > > > > > > > > Can't just set the page order directly? Why use private? > > > > > > Setting the order means recreating the folio the way prep_compound_page() > > > does. I think it's better to postpone it until the folio is requested. This > > > way it might run after SMP is enabled. > > > > I see, that makes sense, but also it could stil use page->order.. > > But there's no page->order :) I mean this: static inline unsigned int folio_order(const struct folio *folio) { if (!folio_test_large(folio)) return 0; return folio->_flags_1 & 0xff; } > > Yes, but also we wouldn't have page->private to make it work.. Somehow > > anything we want to carry over would have to become encoded in the > > memdesc directly. > > This is a problem to solve in 2026 :) Yes :) Jason
On Sun, Mar 23, 2025 at 12:07:58PM -0700, Changyuan Lyu wrote: > > > + down_read(&kho_out.tree_lock); > > > + if (kho_out.fdt) { > > > > What is the lock and fdt test for? > > It is to avoid the competition between the following 2 operations, > - converting the hashtables and mem traker to FDT, > - adding new data to hashtable/mem tracker. I think you should strive to prevent this by code construction at a higher level. Do not lock each preserve but lock entire object serializations, operations. For instance if we do recursive FDT then you'd lock the call that builds a single FDT page for a single object. > In most cases we only need read lock. Different KHO users can adding > data into their own subnodes in parallel. read locks like this are still quite slow in parallel systems, there is alot of slow cacheline bouncing as taking a read lock still has to write to the lock memory. > > What do you imagine this is used for? I'm not sure what value there is > > in returning a void *? How does the caller "free" this? > > This function is also from Mike :) > > I suppose some KHO users may still > preserve memory using memory ranges (instead of folio). I don't know what that would be, but the folio scheme is all about preserving memory from the page buddy allocator, I don't know what this is for or how it would be used. IMHO split this to its own patch and include it in the series that would use it. > I guess the caller can free the ranges by free_pages()? The folios were not setup right, so no.. And if this is the case then you'd just get the struct page and convert it to a void * with some helper function, not implement a whole new function... > > There should be yaml files just like in the normal DT case defining > > all of this. This level of documentation and stability was one of the > > selling reasons why FDT is being used here! > > YAML files were dropped because we think it may take a while for our > schema to be near stable. So we start from some simple plain text. We > can add some prop and node docs (that are considered stable at this point) > back to YAML in the next version. You need to do something to document what is going on here and show the full schema with some explanation. It is hard to grasp the full intention just from the C code. Jason
On Mon, Mar 24, 2025 at 05:07:36PM -0300, Jason Gunthorpe wrote: > On Mon, Mar 24, 2025 at 02:18:34PM -0400, Mike Rapoport wrote: > > On Sun, Mar 23, 2025 at 03:55:52PM -0300, Jason Gunthorpe wrote: > > > On Sat, Mar 22, 2025 at 03:12:26PM -0400, Mike Rapoport wrote: > > > > > > > > > + page->private = order; > > > > > > > > > > Can't just set the page order directly? Why use private? > > > > > > > > Setting the order means recreating the folio the way prep_compound_page() > > > > does. I think it's better to postpone it until the folio is requested. This > > > > way it might run after SMP is enabled. > > > > > > I see, that makes sense, but also it could stil use page->order.. > > > > But there's no page->order :) > > I mean this: > > static inline unsigned int folio_order(const struct folio *folio) > { > if (!folio_test_large(folio)) > return 0; > return folio->_flags_1 & 0xff; > } I don't think it's better than page->private, KHO will need to prep_compound_page() anyway so these will be overwritten there. And I don't remember, but having those set before prep_compound_page() might trigger VM_BUG_ON_PGFLAGS(). > Jason
Hi Changyuan, On Wed, Mar 19 2025, Changyuan Lyu wrote: > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org> > > Introduce APIs allowing KHO users to preserve memory across kexec and > get access to that memory after boot of the kexeced kernel > > kho_preserve_folio() - record a folio to be preserved over kexec > kho_restore_folio() - recreates the folio from the preserved memory > kho_preserve_phys() - record physically contiguous range to be > preserved over kexec. > kho_restore_phys() - recreates order-0 pages corresponding to the > preserved physical range > > The memory preservations are tracked by two levels of xarrays to manage > chunks of per-order 512 byte bitmaps. For instance the entire 1G order > of a 1TB x86 system would fit inside a single 512 byte bitmap. For > order 0 allocations each bitmap will cover 16M of address space. Thus, > for 16G of memory at most 512K of bitmap memory will be needed for order 0. > > At serialization time all bitmaps are recorded in a linked list of pages > for the next kernel to process and the physical address of the list is > recorded in KHO FDT. Why build the xarray only to transform it down to bitmaps when you can build the bitmaps from the get go? This would end up wasting both time and memory. At least from this patch, I don't really see much else being done with the xarray apart from setting bits in the bitmap. Of course, with the current linked list structure, this cannot work. But I don't see why we need to have it. I think having a page-table like structure would be better -- only instead of having PTEs at the lowest levels, you have the bitmap. Just like page tables, each table is page-size. So each page at the lowest level can have 4k * 8 == 32768 bits. This maps to 128 MiB of 4k pages. The next level will be pointers to the level 1 table, just like in page tables. So we get 4096 / 8 == 512 pointers. Each level 2 table maps to 64 GiB of memory. Similarly, level 3 table maps to 32 TiB and level 4 to 16 PiB. Now, __kho_preserve() can just find or allocate the table entry for the PFN and set its bit. Similar work has to be done when doing the xarray access as well, so this should have roughly the same performance. When doing KHO, we just need to record the base address of the table and we are done. This saves us from doing the expensive copying/transformation of data in the critical path. I don't see any obvious downsides compared to the current format. The serialized state might end up taking slightly more memory due to upper level tables, but it should still be much less than having two representations of the same information exist simultaneously. > > The next kernel then processes that list, reserves the memory ranges and > later, when a user requests a folio or a physical range, KHO restores > corresponding memory map entries. > > Suggested-by: Jason Gunthorpe <jgg@nvidia.com> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> > Co-developed-by: Changyuan Lyu <changyuanl@google.com> > Signed-off-by: Changyuan Lyu <changyuanl@google.com> [...] > +static void deserialize_bitmap(unsigned int order, > + struct khoser_mem_bitmap_ptr *elm) > +{ > + struct kho_mem_phys_bits *bitmap = KHOSER_LOAD_PTR(elm->bitmap); > + unsigned long bit; > + > + for_each_set_bit(bit, bitmap->preserve, PRESERVE_BITS) { > + int sz = 1 << (order + PAGE_SHIFT); > + phys_addr_t phys = > + elm->phys_start + (bit << (order + PAGE_SHIFT)); > + struct page *page = phys_to_page(phys); > + > + memblock_reserve(phys, sz); > + memblock_reserved_mark_noinit(phys, sz); Why waste time and memory building the reserved ranges? We already have all the information in the serialized bitmaps, and memblock is already only allocating from scratch. So we should not need this at all, and instead simply skip these pages in memblock_free_pages(). With the page-table like format I mentioned above, this should be very easy since you can find out whether a page is reserved or not in O(1) time. > + page->private = order; > + } > +} > + > +static void __init kho_mem_deserialize(void) > +{ > + struct khoser_mem_chunk *chunk; > + struct kho_in_node preserved_mem; > + const phys_addr_t *mem; > + int err; > + u32 len; > + > + err = kho_get_node(NULL, "preserved-memory", &preserved_mem); > + if (err) { > + pr_err("no preserved-memory node: %d\n", err); > + return; > + } > + > + mem = kho_get_prop(&preserved_mem, "metadata", &len); > + if (!mem || len != sizeof(*mem)) { > + pr_err("failed to get preserved memory bitmaps\n"); > + return; > + } > + > + chunk = *mem ? phys_to_virt(*mem) : NULL; > + while (chunk) { > + unsigned int i; > + > + memblock_reserve(virt_to_phys(chunk), sizeof(*chunk)); > + > + for (i = 0; i != chunk->hdr.num_elms; i++) > + deserialize_bitmap(chunk->hdr.order, > + &chunk->bitmaps[i]); > + chunk = KHOSER_LOAD_PTR(chunk->hdr.next); > + } > +} > + > /* Helper functions for KHO state tree */ > > struct kho_prop { [...]
On Thu, Mar 27, 2025 at 10:03:17AM +0000, Pratyush Yadav wrote: > Of course, with the current linked list structure, this cannot work. But > I don't see why we need to have it. I think having a page-table like > structure would be better -- only instead of having PTEs at the lowest > levels, you have the bitmap. Yes, but there is a trade off here of what I could write in 30 mins and what is maximally possible :) The xarray is providing a page table implementation in a library form. I think this whole thing can be optimized, especially the memblock_reserve side, but the idea here is to get started and once we have some data on what the actual preservation workload is then someone can optimize this. Otherwise we are going to be spending months just polishing this one patch without any actual data on where the performance issues and hot spots actually are. Jason
On Thu, Mar 27 2025, Jason Gunthorpe wrote: > On Thu, Mar 27, 2025 at 10:03:17AM +0000, Pratyush Yadav wrote: > >> Of course, with the current linked list structure, this cannot work. But >> I don't see why we need to have it. I think having a page-table like >> structure would be better -- only instead of having PTEs at the lowest >> levels, you have the bitmap. > > Yes, but there is a trade off here of what I could write in 30 mins > and what is maximally possible :) The xarray is providing a page table > implementation in a library form. > > I think this whole thing can be optimized, especially the > memblock_reserve side, but the idea here is to get started and once we > have some data on what the actual preservation workload is then > someone can optimize this. > > Otherwise we are going to be spending months just polishing this one > patch without any actual data on where the performance issues and hot > spots actually are. The memblock_reserve side we can optimize later, I agree. But the memory preservation format is ABI and I think that is worth spending a little more time on. And I don't think it should be that much more complex than the current format. I want to hack around with it, so I'll give it a try over the next few days and see what I can come up with.
On Thu, Mar 27, 2025 at 05:28:40PM +0000, Pratyush Yadav wrote: > > Otherwise we are going to be spending months just polishing this one > > patch without any actual data on where the performance issues and hot > > spots actually are. > > The memblock_reserve side we can optimize later, I agree. But the memory > preservation format is ABI I think the agreement was that nothing is ABI at this point.. > and I think that is worth spending a little > more time on. And I don't think it should be that much more complex than > the current format. Maybe! Jason
Hi Pratyush, Thanks for suggestions! On Thu, Mar 27, 2025 at 17:28:40 +0000, Pratyush Yadav <ptyadav@amazon.de> wrote: > On Thu, Mar 27 2025, Jason Gunthorpe wrote: > > > On Thu, Mar 27, 2025 at 10:03:17AM +0000, Pratyush Yadav wrote: > > > >> Of course, with the current linked list structure, this cannot work. But > >> I don't see why we need to have it. I think having a page-table like > >> structure would be better -- only instead of having PTEs at the lowest > >> levels, you have the bitmap. > > > > Yes, but there is a trade off here of what I could write in 30 mins > > and what is maximally possible :) The xarray is providing a page table > > implementation in a library form. > > > > I think this whole thing can be optimized, especially the > > memblock_reserve side, but the idea here is to get started and once we > > have some data on what the actual preservation workload is then > > someone can optimize this. > > > > Otherwise we are going to be spending months just polishing this one > > patch without any actual data on where the performance issues and hot > > spots actually are. > > The memblock_reserve side we can optimize later, I agree. But the memory > preservation format is ABI and I think that is worth spending a little > more time on. And I don't think it should be that much more complex than > the current format. > > I want to hack around with it, so I'll give it a try over the next few > days and see what I can come up with. I agree with Jason that "nothing is ABI at this point" and it will take some time for KHO to stabilize. On the other hand if you have already came up with something working and simple, we can include it in the next version. (Sorry for the late reply, I was traveling.) Best, Changyuan
Hi, On Wed, Apr 02 2025, Changyuan Lyu wrote: > Hi Pratyush, Thanks for suggestions! > > On Thu, Mar 27, 2025 at 17:28:40 +0000, Pratyush Yadav <ptyadav@amazon.de> wrote: >> On Thu, Mar 27 2025, Jason Gunthorpe wrote: >> >> > On Thu, Mar 27, 2025 at 10:03:17AM +0000, Pratyush Yadav wrote: >> > >> >> Of course, with the current linked list structure, this cannot work. But >> >> I don't see why we need to have it. I think having a page-table like >> >> structure would be better -- only instead of having PTEs at the lowest >> >> levels, you have the bitmap. >> > >> > Yes, but there is a trade off here of what I could write in 30 mins >> > and what is maximally possible :) The xarray is providing a page table >> > implementation in a library form. >> > >> > I think this whole thing can be optimized, especially the >> > memblock_reserve side, but the idea here is to get started and once we >> > have some data on what the actual preservation workload is then >> > someone can optimize this. >> > >> > Otherwise we are going to be spending months just polishing this one >> > patch without any actual data on where the performance issues and hot >> > spots actually are. >> >> The memblock_reserve side we can optimize later, I agree. But the memory >> preservation format is ABI and I think that is worth spending a little >> more time on. And I don't think it should be that much more complex than >> the current format. >> >> I want to hack around with it, so I'll give it a try over the next few >> days and see what I can come up with. > > I agree with Jason that "nothing is ABI at this > point" and it will take some time for KHO to stabilize. > > On the other hand if you have already came up with something working and > simple, we can include it in the next version. I already have something that works with zero-order pages. I am currently implementing support for other orders. It is almost done, but I need to test it and do a performance comparison with the current patch. Will post something soon!
On Wed, Apr 2, 2025 at 12:47 PM Pratyush Yadav <ptyadav@amazon.de> wrote: > > Hi, > > On Wed, Apr 02 2025, Changyuan Lyu wrote: > > > Hi Pratyush, Thanks for suggestions! > > > > On Thu, Mar 27, 2025 at 17:28:40 +0000, Pratyush Yadav <ptyadav@amazon.de> wrote: > >> On Thu, Mar 27 2025, Jason Gunthorpe wrote: > >> > >> > On Thu, Mar 27, 2025 at 10:03:17AM +0000, Pratyush Yadav wrote: > >> > > >> >> Of course, with the current linked list structure, this cannot work. But > >> >> I don't see why we need to have it. I think having a page-table like > >> >> structure would be better -- only instead of having PTEs at the lowest > >> >> levels, you have the bitmap. > >> > > >> > Yes, but there is a trade off here of what I could write in 30 mins > >> > and what is maximally possible :) The xarray is providing a page table > >> > implementation in a library form. > >> > > >> > I think this whole thing can be optimized, especially the > >> > memblock_reserve side, but the idea here is to get started and once we > >> > have some data on what the actual preservation workload is then > >> > someone can optimize this. > >> > > >> > Otherwise we are going to be spending months just polishing this one > >> > patch without any actual data on where the performance issues and hot > >> > spots actually are. > >> > >> The memblock_reserve side we can optimize later, I agree. But the memory > >> preservation format is ABI and I think that is worth spending a little > >> more time on. And I don't think it should be that much more complex than > >> the current format. > >> > >> I want to hack around with it, so I'll give it a try over the next few > >> days and see what I can come up with. > > > > I agree with Jason that "nothing is ABI at this > > point" and it will take some time for KHO to stabilize. > > > > On the other hand if you have already came up with something working and > > simple, we can include it in the next version. > > I already have something that works with zero-order pages. I am > currently implementing support for other orders. It is almost done, but > I need to test it and do a performance comparison with the current > patch. Will post something soon! Hi Pratyush, Just to clarify, how soon? We are about to post v6 for KHO, with all other comments in this thread addressed. Thanks, Pasha > > -- > Regards, > Pratyush Yadav
On Wed, Apr 02 2025, Pasha Tatashin wrote: > On Wed, Apr 2, 2025 at 12:47 PM Pratyush Yadav <ptyadav@amazon.de> wrote: >> >> Hi, >> >> On Wed, Apr 02 2025, Changyuan Lyu wrote: >> >> > Hi Pratyush, Thanks for suggestions! >> > >> > On Thu, Mar 27, 2025 at 17:28:40 +0000, Pratyush Yadav <ptyadav@amazon.de> wrote: [...] >> >> >> >> The memblock_reserve side we can optimize later, I agree. But the memory >> >> preservation format is ABI and I think that is worth spending a little >> >> more time on. And I don't think it should be that much more complex than >> >> the current format. >> >> >> >> I want to hack around with it, so I'll give it a try over the next few >> >> days and see what I can come up with. >> > >> > I agree with Jason that "nothing is ABI at this >> > point" and it will take some time for KHO to stabilize. >> > >> > On the other hand if you have already came up with something working and >> > simple, we can include it in the next version. >> >> I already have something that works with zero-order pages. I am >> currently implementing support for other orders. It is almost done, but >> I need to test it and do a performance comparison with the current >> patch. Will post something soon! > > Hi Pratyush, > > Just to clarify, how soon? We are about to post v6 for KHO, with all > other comments in this thread addressed. I have it working, but I need to clean up the code a bit and test it better. So hopefully end of this week or early next week.
Hi Changyuan, On Wed, Mar 19 2025, Changyuan Lyu wrote: > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org> > > Introduce APIs allowing KHO users to preserve memory across kexec and > get access to that memory after boot of the kexeced kernel > > kho_preserve_folio() - record a folio to be preserved over kexec > kho_restore_folio() - recreates the folio from the preserved memory > kho_preserve_phys() - record physically contiguous range to be > preserved over kexec. > kho_restore_phys() - recreates order-0 pages corresponding to the > preserved physical range > > The memory preservations are tracked by two levels of xarrays to manage > chunks of per-order 512 byte bitmaps. For instance the entire 1G order > of a 1TB x86 system would fit inside a single 512 byte bitmap. For > order 0 allocations each bitmap will cover 16M of address space. Thus, > for 16G of memory at most 512K of bitmap memory will be needed for order 0. > > At serialization time all bitmaps are recorded in a linked list of pages > for the next kernel to process and the physical address of the list is > recorded in KHO FDT. > > The next kernel then processes that list, reserves the memory ranges and > later, when a user requests a folio or a physical range, KHO restores > corresponding memory map entries. > > Suggested-by: Jason Gunthorpe <jgg@nvidia.com> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> > Co-developed-by: Changyuan Lyu <changyuanl@google.com> > Signed-off-by: Changyuan Lyu <changyuanl@google.com> > --- > include/linux/kexec_handover.h | 38 +++ > kernel/kexec_handover.c | 486 ++++++++++++++++++++++++++++++++- > 2 files changed, 522 insertions(+), 2 deletions(-) [...] > +int kho_preserve_phys(phys_addr_t phys, size_t size) > +{ > + unsigned long pfn = PHYS_PFN(phys), end_pfn = PHYS_PFN(phys + size); > + unsigned int order = ilog2(end_pfn - pfn); This caught my eye when playing around with the code. It does not put any limit on the order, so it can exceed NR_PAGE_ORDERS. Also, when initializing the page after KHO, we pass the order directly to prep_compound_page() without sanity checking it. The next kernel might not support all the orders the current one supports. Perhaps something to fix? > + unsigned long failed_pfn; > + int err = 0; > + > + if (!kho_enable) > + return -EOPNOTSUPP; > + > + down_read(&kho_out.tree_lock); > + if (kho_out.fdt) { > + err = -EBUSY; > + goto unlock; > + } > + > + for (; pfn < end_pfn; > + pfn += (1 << order), order = ilog2(end_pfn - pfn)) { > + err = __kho_preserve(&kho_mem_track, pfn, order); > + if (err) { > + failed_pfn = pfn; > + break; > + } > + } [... > +struct folio *kho_restore_folio(phys_addr_t phys) > +{ > + struct page *page = pfn_to_online_page(PHYS_PFN(phys)); > + unsigned long order = page->private; > + > + if (!page) > + return NULL; > + > + order = page->private; > + if (order) > + prep_compound_page(page, order); > + else > + kho_restore_page(page); > + > + return page_folio(page); > +} [...]
On Wed, Apr 02, 2025 at 07:16:27PM +0000, Pratyush Yadav wrote: > > +int kho_preserve_phys(phys_addr_t phys, size_t size) > > +{ > > + unsigned long pfn = PHYS_PFN(phys), end_pfn = PHYS_PFN(phys + size); > > + unsigned int order = ilog2(end_pfn - pfn); > > This caught my eye when playing around with the code. It does not put > any limit on the order, so it can exceed NR_PAGE_ORDERS. Also, when > initializing the page after KHO, we pass the order directly to > prep_compound_page() without sanity checking it. The next kernel might > not support all the orders the current one supports. Perhaps something > to fix? IMHO we should delete the phys functions until we get a user of them Jason
On Wed, Apr 02, 2025 at 07:16:27PM +0000, Pratyush Yadav wrote: > Hi Changyuan, > > On Wed, Mar 19 2025, Changyuan Lyu wrote: > > > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org> > > > > Introduce APIs allowing KHO users to preserve memory across kexec and > > get access to that memory after boot of the kexeced kernel > > > > kho_preserve_folio() - record a folio to be preserved over kexec > > kho_restore_folio() - recreates the folio from the preserved memory > > kho_preserve_phys() - record physically contiguous range to be > > preserved over kexec. > > kho_restore_phys() - recreates order-0 pages corresponding to the > > preserved physical range > > > > The memory preservations are tracked by two levels of xarrays to manage > > chunks of per-order 512 byte bitmaps. For instance the entire 1G order > > of a 1TB x86 system would fit inside a single 512 byte bitmap. For > > order 0 allocations each bitmap will cover 16M of address space. Thus, > > for 16G of memory at most 512K of bitmap memory will be needed for order 0. > > > > At serialization time all bitmaps are recorded in a linked list of pages > > for the next kernel to process and the physical address of the list is > > recorded in KHO FDT. > > > > The next kernel then processes that list, reserves the memory ranges and > > later, when a user requests a folio or a physical range, KHO restores > > corresponding memory map entries. > > > > Suggested-by: Jason Gunthorpe <jgg@nvidia.com> > > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> > > Co-developed-by: Changyuan Lyu <changyuanl@google.com> > > Signed-off-by: Changyuan Lyu <changyuanl@google.com> > > --- > > include/linux/kexec_handover.h | 38 +++ > > kernel/kexec_handover.c | 486 ++++++++++++++++++++++++++++++++- > > 2 files changed, 522 insertions(+), 2 deletions(-) > [...] > > +int kho_preserve_phys(phys_addr_t phys, size_t size) > > +{ > > + unsigned long pfn = PHYS_PFN(phys), end_pfn = PHYS_PFN(phys + size); > > + unsigned int order = ilog2(end_pfn - pfn); > > This caught my eye when playing around with the code. It does not put > any limit on the order, so it can exceed NR_PAGE_ORDERS. Also, when I don't see a problem with this > initializing the page after KHO, we pass the order directly to > prep_compound_page() without sanity checking it. The next kernel might > not support all the orders the current one supports. Perhaps something > to fix? And this needs to be fixed and we should refuse to create folios larger than MAX_ORDER. > > + unsigned long failed_pfn; > > + int err = 0; > > + > > + if (!kho_enable) > > + return -EOPNOTSUPP; > > + > > + down_read(&kho_out.tree_lock); > > + if (kho_out.fdt) { > > + err = -EBUSY; > > + goto unlock; > > + } > > + > > + for (; pfn < end_pfn; > > + pfn += (1 << order), order = ilog2(end_pfn - pfn)) { > > + err = __kho_preserve(&kho_mem_track, pfn, order); > > + if (err) { > > + failed_pfn = pfn; > > + break; > > + } > > + } > [... > > +struct folio *kho_restore_folio(phys_addr_t phys) > > +{ > > + struct page *page = pfn_to_online_page(PHYS_PFN(phys)); > > + unsigned long order = page->private; > > + > > + if (!page) > > + return NULL; > > + > > + order = page->private; > > + if (order) > > + prep_compound_page(page, order); > > + else > > + kho_restore_page(page); > > + > > + return page_folio(page); > > +} > [...] > > -- > Regards, > Pratyush Yadav
On Thu, Apr 03, 2025 at 08:42:09AM -0300, Jason Gunthorpe wrote: > On Wed, Apr 02, 2025 at 07:16:27PM +0000, Pratyush Yadav wrote: > > > +int kho_preserve_phys(phys_addr_t phys, size_t size) > > > +{ > > > + unsigned long pfn = PHYS_PFN(phys), end_pfn = PHYS_PFN(phys + size); > > > + unsigned int order = ilog2(end_pfn - pfn); > > > > This caught my eye when playing around with the code. It does not put > > any limit on the order, so it can exceed NR_PAGE_ORDERS. Also, when > > initializing the page after KHO, we pass the order directly to > > prep_compound_page() without sanity checking it. The next kernel might > > not support all the orders the current one supports. Perhaps something > > to fix? > > IMHO we should delete the phys functions until we get a user of them The only user of memory tracker in this series uses kho_preserve_phys() > Jason
On Thu, Apr 03, 2025 at 04:58:27PM +0300, Mike Rapoport wrote: > On Thu, Apr 03, 2025 at 08:42:09AM -0300, Jason Gunthorpe wrote: > > On Wed, Apr 02, 2025 at 07:16:27PM +0000, Pratyush Yadav wrote: > > > > +int kho_preserve_phys(phys_addr_t phys, size_t size) > > > > +{ > > > > + unsigned long pfn = PHYS_PFN(phys), end_pfn = PHYS_PFN(phys + size); > > > > + unsigned int order = ilog2(end_pfn - pfn); > > > > > > This caught my eye when playing around with the code. It does not put > > > any limit on the order, so it can exceed NR_PAGE_ORDERS. Also, when > > > initializing the page after KHO, we pass the order directly to > > > prep_compound_page() without sanity checking it. The next kernel might > > > not support all the orders the current one supports. Perhaps something > > > to fix? > > > > IMHO we should delete the phys functions until we get a user of them > > The only user of memory tracker in this series uses kho_preserve_phys() But it really shouldn't. The reserved memory is a completely different mechanism than buddy allocator preservation. It doesn't even call kho_restore_phys() those pages, it just feeds the ranges directly to: + reserved_mem_add(*p_start, size, name); The bitmaps should be understood as preserving memory from the buddy allocator only. IMHO it should not call kho_preserve_phys() at all. Jason
Hi all, The below patch implements the table based memory preservation mechanism I suggested. It is a replacement of this patch. Instead of using an xarray of bitmaps and converting them into a linked list of bitmaps at serialization time, it tracks preserved pages in a page table like format, that needs no extra work when serializing. This results in noticeably better performance when preserving a large number of pages. To compare performance, I allocated 48 GiB of memory and preserved it using KHO. Below is the time taken to make the reservations, and then serialize that to FDT. Linked list: 577ms +- 0.7% (6 samples) Table: 469ms +- 0.6% (6 samples) From this, we can see that the table is almost 19% faster. This test was done with only one thread, but since it is possible to make reservations in parallel, the performance would increase even more -- especially since the linked list serialization cannot be parallelized easily. In terms of memory usage, I could not collect reliable data, but I don't think there should be significant difference between either approach since the bitmaps are the same density, and only difference would be extra metadata (chunks vs upper level tables). Memory usage for tables can be further optimized if needed by collapsing full tables. That is, if all bits in a L1 table are set, we can just not allocate a page for it, and instead set a flag in the L2 descriptor. The patch currently has a limitation where it does not free any of the empty tables after a unpreserve operation. But Changyuan's patch also doesn't do it so at least it is not any worse off. In terms of code size, I believe both are roughly the same. This patch is 609 lines compared Changyuan's 522, many of which come from the longer comment. When working on this patch, I realized that kho_mem_deserialize() is currently _very_ slow. It takes over 2 seconds to make memblock reservations for 48 GiB of 0-order pages. I suppose this can later be optimized by teaching memblock_free_all() to skip preserved pages instead of making memblock reservations. Regards, Pratyush Yadav ---- 8< ---- From 40c1274052709e4d102cc9fe55fa94272f827283 Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" <rppt@kernel.org> Date: Wed, 19 Mar 2025 18:55:44 -0700 Subject: [PATCH] kexec: enable KHO support for memory preservation Introduce APIs allowing KHO users to preserve memory across kexec and get access to that memory after boot of the kexeced kernel kho_preserve_folio() - record a folio to be preserved over kexec kho_restore_folio() - recreates the folio from the preserved memory kho_preserve_phys() - record physically contiguous range to be preserved over kexec. kho_restore_phys() - recreates order-0 pages corresponding to the preserved physical range The memory preservations are tracked by using 4 levels of tables, similar to page tables, except at the lowest level, a bitmap is present instead of PTEs. Each page order has its own separate table. A set bit in the bitmap represents a page of the table's order. The tables are named simply by their level, with the highest being level 4 (L4), the next being level 3 (L3) and so on. Assuming 0-order 4K pages, a L1 table will have a total of 4096 * 8 == 23768 bits. This maps to 128 MiB of memory. L2 and above tables will consist of pointers to lower level tables, so each level will have 4096 / 8 == 512 pointers. This means Each level 2 table maps to 64 GiB of memory, each level 3 table maps to 32 TiB of memory, and each level 4 table maps to 16 PiB of memory. More information on the table format can be found in the comment in the patch. At serialization time, all that needs to be done is to record the top-level table descriptors for each order. The next kernel can use those to find the tables, walk the them, reserve the memory ranges, and later when a user requests a folio or a physical range, KHO restores corresponding memory map entries. Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Co-developed-by: Changyuan Lyu <changyuanl@google.com> Signed-off-by: Changyuan Lyu <changyuanl@google.com> [ptyadav@amazon.de: table based preserved page tracking] Co-developed-by: Pratyush Yadav <ptyadav@amazon.de> Signed-off-by: Pratyush Yadav <ptyadav@amazon.de> --- include/linux/kexec_handover.h | 38 +++ kernel/kexec_handover.c | 573 ++++++++++++++++++++++++++++++++- 2 files changed, 609 insertions(+), 2 deletions(-) diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h index c665ff6cd728a..d52a7b500f4ce 100644 --- a/include/linux/kexec_handover.h +++ b/include/linux/kexec_handover.h @@ -5,6 +5,7 @@ #include <linux/types.h> #include <linux/hashtable.h> #include <linux/notifier.h> +#include <linux/mm_types.h> struct kho_scratch { phys_addr_t addr; @@ -54,6 +55,13 @@ int kho_add_string_prop(struct kho_node *node, const char *key, int register_kho_notifier(struct notifier_block *nb); int unregister_kho_notifier(struct notifier_block *nb); +int kho_preserve_folio(struct folio *folio); +int kho_unpreserve_folio(struct folio *folio); +int kho_preserve_phys(phys_addr_t phys, size_t size); +int kho_unpreserve_phys(phys_addr_t phys, size_t size); +struct folio *kho_restore_folio(phys_addr_t phys); +void *kho_restore_phys(phys_addr_t phys, size_t size); + void kho_memory_init(void); void kho_populate(phys_addr_t handover_fdt_phys, phys_addr_t scratch_phys, @@ -118,6 +126,36 @@ static inline int unregister_kho_notifier(struct notifier_block *nb) return -EOPNOTSUPP; } +static inline int kho_preserve_folio(struct folio *folio) +{ + return -EOPNOTSUPP; +} + +static inline int kho_unpreserve_folio(struct folio *folio) +{ + return -EOPNOTSUPP; +} + +static inline int kho_preserve_phys(phys_addr_t phys, size_t size) +{ + return -EOPNOTSUPP; +} + +static inline int kho_unpreserve_phys(phys_addr_t phys, size_t size) +{ + return -EOPNOTSUPP; +} + +static inline struct folio *kho_restore_folio(phys_addr_t phys) +{ + return NULL; +} + +static inline void *kho_restore_phys(phys_addr_t phys, size_t size) +{ + return NULL; +} + static inline void kho_memory_init(void) { } diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c index 6ebad2f023f95..16f10bd06de0a 100644 --- a/kernel/kexec_handover.c +++ b/kernel/kexec_handover.c @@ -3,6 +3,7 @@ * kexec_handover.c - kexec handover metadata processing * Copyright (C) 2023 Alexander Graf <graf@amazon.com> * Copyright (C) 2025 Microsoft Corporation, Mike Rapoport <rppt@kernel.org> + * Copyright (C) 2025 Amazon.com Inc. or its affiliates, Pratyush Yadav <ptyadav@amazon.de> * Copyright (C) 2024 Google LLC */ @@ -62,6 +63,10 @@ struct kho_out { struct rw_semaphore tree_lock; struct kho_node root; + /* Array containing the L4 table descriptors for each order. */ + unsigned long mem_tables[NR_PAGE_ORDERS]; + struct kho_node preserved_memory; + void *fdt; u64 fdt_max; }; @@ -70,6 +75,7 @@ static struct kho_out kho_out = { .chain_head = BLOCKING_NOTIFIER_INIT(kho_out.chain_head), .tree_lock = __RWSEM_INITIALIZER(kho_out.tree_lock), .root = KHO_NODE_INIT, + .preserved_memory = KHO_NODE_INIT, .fdt_max = 10 * SZ_1M, }; @@ -237,6 +243,559 @@ int kho_node_check_compatible(const struct kho_in_node *node, } EXPORT_SYMBOL_GPL(kho_node_check_compatible); +/* + * Keep track of memory that is to be preserved across KHO. + * + * The memory is tracked by using 4 levels of tables, similar to page tables, + * except at the lowest level, a bitmap is present instead of PTEs. Each page + * order has its own separate table. A set bit in the bitmap represents a page + * of the table's order. The tables are named simply by their level, with the + * highest being level 4 (L4), the next being level 3 (L3) and so on. + * + * The table hierarchy can be seen with the below diagram. + * + * +----+ + * | L4 | + * +----+ + * | + * | +----+ + * +-->| L3 | + * +----+ + * | + * | +----+ + * +-->| L2 | + * +----+ + * | + * | +----+ + * +-->| L1 | + * +----+ + * + * Assuming 0-order 4K pages, a L1 table will have a total of 4096 * 8 == 23768 + * bits. This maps to 128 MiB of memory. L2 and above tables will consist of + * pointers to lower level tables, so each level will have 4096 / 8 == 512 + * pointers. This means Each level 2 table maps to 64 GiB of memory, each level + * 3 table maps to 32 TiB of memory, and each level 4 table maps to 16 PiB of + * memory. + * + * The below diagram shows how the address is split into the different levels + * for 0-order 4K pages: + * + * 63:54 53:45 44:36 35:27 26:12 11:0 + * +----------+--------+--------+--------+--------------+-----------+ + * | Ignored | L4 | L3 | L2 | L1 | Page off | + * +----------+--------+--------+--------+--------------+-----------+ + * + * For higher order pages, the bits for each level get shifted left by the + * order. + * + * Each table except L1 contains a descriptor for the next level table. For 4K + * pages, the below diagram shows the format of the descriptor: + * + * 63:12 11:0 + * +----------+--------+--------+--------+--------------+-----------+ + * | Pointer to next level table | Reserved | + * +----------+--------+--------+--------+--------------+-----------+ + * + * The reserved bits must be zero, but can be used for flags in later versions. + */ + +typedef unsigned long khomem_desc_t; +typedef int (*khomem_walk_fn_t)(unsigned long phys, unsigned int order, void *arg); + +#define PTRS_PER_LEVEL (PAGE_SIZE / sizeof(unsigned long)) +#define KHOMEM_L1_BITS (PAGE_SIZE * BITS_PER_BYTE) +#define KHOMEM_L1_MASK ((1 << ilog2(KHOMEM_L1_BITS)) - 1) +#define KHOMEM_L1_SHIFT (PAGE_SHIFT) +#define KHOMEM_L2_SHIFT (KHOMEM_L1_SHIFT + ilog2(KHOMEM_L1_BITS)) +#define KHOMEM_L3_SHIFT (KHOMEM_L2_SHIFT + ilog2(PTRS_PER_LEVEL)) +#define KHOMEM_L4_SHIFT (KHOMEM_L3_SHIFT + ilog2(PTRS_PER_LEVEL)) +#define KHOMEM_PFN_MASK PAGE_MASK + +static unsigned int khomem_level_shifts[] = { + [1] = KHOMEM_L1_SHIFT, + [2] = KHOMEM_L2_SHIFT, + [3] = KHOMEM_L3_SHIFT, + [4] = KHOMEM_L4_SHIFT, +}; + +static inline unsigned long khomem_table_index(unsigned long address, + unsigned int level, + unsigned int order) +{ + unsigned long mask = level == 1 ? KHOMEM_L1_MASK : (PTRS_PER_LEVEL - 1); + /* Avoid undefined behaviour in case shift is too big. */ + int shift = min_t(int, khomem_level_shifts[level] + order, BITS_PER_LONG); + + return (address >> shift) & mask; +} + +static inline khomem_desc_t *khomem_table(khomem_desc_t desc) +{ + return __va(desc & KHOMEM_PFN_MASK); +} + +static inline khomem_desc_t *khomem_table_offset(khomem_desc_t *table, + unsigned long address, + unsigned int level, + unsigned int order) +{ + return khomem_table(*table) + khomem_table_index(address, level, order); +} + +static inline bool khomem_desc_none(khomem_desc_t desc) +{ + return !(desc & KHOMEM_PFN_MASK); +} + +static inline void khomem_bitmap_preserve(khomem_desc_t *desc, + unsigned long address, + unsigned int order) +{ + /* set_bit() is atomic, so no need for locking. */ + set_bit(khomem_table_index(address, 1, order), khomem_table(*desc)); +} + +static inline void khomem_bitmap_unpreserve(khomem_desc_t *desc, + unsigned long address, + unsigned int order) +{ + /* clear_bit() is atomic, so no need for locking. */ + clear_bit(khomem_table_index(address, 1, order), khomem_table(*desc)); +} + +static inline khomem_desc_t khomem_mkdesc(void *table) +{ + return virt_to_phys(table) & KHOMEM_PFN_MASK; +} + +static int __khomem_table_alloc(khomem_desc_t *desc) +{ + if (khomem_desc_none(*desc)) { + khomem_desc_t *table, val; + + table = (khomem_desc_t *)get_zeroed_page(GFP_KERNEL); + if (!table) + return -ENOMEM; + + val = khomem_mkdesc(table); + if (cmpxchg(desc, 0, val)) + /* Someone else already allocated it. */ + free_page((unsigned long)table); + } + + return 0; +} + +static khomem_desc_t *khomem_table_alloc(khomem_desc_t *desc, + unsigned long address, + unsigned int level, + unsigned int order) +{ + if (__khomem_table_alloc(desc)) + return NULL; + + return khomem_table_offset(desc, address, level, order); +} + +static int khomem_preserve(khomem_desc_t *l4, unsigned long pfn, + unsigned int order) +{ + unsigned long address = PFN_PHYS(pfn); + khomem_desc_t *l4p, *l3p, *l2p; + int ret; + + l4p = khomem_table_alloc(l4, address, 4, order); + if (!l4p) + return -ENOMEM; + + l3p = khomem_table_alloc(l4p, address, 3, order); + if (!l3p) + return -ENOMEM; + + l2p = khomem_table_alloc(l3p, address, 2, order); + if (!l2p) + return -ENOMEM; + + /* + * L1 table is handled different since it is a bitmap not a table of + * descriptors. So offsetting into it directly does not work. + */ + ret = __khomem_table_alloc(l2p); + if (ret) + return ret; + + khomem_bitmap_preserve(l2p, address, order); + return 0; +} + +/* TODO: Clean up empty tables eventually. */ +static void khomem_unpreserve(khomem_desc_t *l4, unsigned long pfn, + unsigned int order) +{ + unsigned long address = PFN_PHYS(pfn); + khomem_desc_t *l4p, *l3p, *l2p; + + if (khomem_desc_none(*l4)) + return; + + l4p = khomem_table_offset(l4, address, 4, order); + if (khomem_desc_none(*l4p)) + return; + + l3p = khomem_table_offset(l4p, address, 3, order); + if (khomem_desc_none(*l3p)) + return; + + l2p = khomem_table_offset(l3p, address, 2, order); + if (khomem_desc_none(*l2p)) + return; + + khomem_bitmap_unpreserve(l2p, address, order); +} + +static int khomem_walk_l1(unsigned long *table, unsigned long addr, + unsigned int order, khomem_walk_fn_t fn, void *arg) +{ + int ret, i; + + for_each_set_bit(i, table, KHOMEM_L1_BITS) { + ret = fn(addr + (i * PAGE_SIZE), order, arg); + if (ret) + return ret; + } + + return 0; +} + +static int __khomem_walk_table(khomem_desc_t *base, unsigned int level, + unsigned long addr, unsigned int order, + khomem_walk_fn_t fn, void *arg) +{ + unsigned long block = (1UL << (khomem_level_shifts[level] + order)); + khomem_desc_t *cur; + int ret; + + if (level == 1) + return khomem_walk_l1(base, addr, order, fn, arg); + + for (cur = base; cur < base + PTRS_PER_LEVEL; cur++, addr += block) { + if (!khomem_desc_none(*cur)) { + ret = __khomem_walk_table(khomem_table(*cur), level - 1, + addr, order, fn, arg); + if (ret) + return ret; + } + } + + return 0; +} + +static int khomem_walk_preserved(khomem_desc_t *l4, unsigned int order, + khomem_walk_fn_t fn, void *arg) +{ + if (khomem_desc_none(*l4)) + return 0; + + return __khomem_walk_table(khomem_table(*l4), 4, 0, order, fn, arg); +} + +struct kho_mem_track { + /* Points to L4 KHOMEM descriptor, each order gets its own table. */ + struct xarray orders; +}; + +static struct kho_mem_track kho_mem_track; + +static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz) +{ + void *elm, *res; + + elm = xa_load(xa, index); + if (elm) + return elm; + + elm = kzalloc(sz, GFP_KERNEL); + if (!elm) + return ERR_PTR(-ENOMEM); + + res = xa_cmpxchg(xa, index, NULL, elm, GFP_KERNEL); + if (xa_is_err(res)) + res = ERR_PTR(xa_err(res)); + + if (res) { + kfree(elm); + return res; + } + + return elm; +} + +static void __kho_unpreserve(struct kho_mem_track *tracker, unsigned long pfn, + unsigned int order) +{ + khomem_desc_t *l4; + + l4 = xa_load(&tracker->orders, order); + if (!l4) + return; + + khomem_unpreserve(l4, pfn, order); +} + +static int __kho_preserve(struct kho_mem_track *tracker, unsigned long pfn, + unsigned int order) +{ + khomem_desc_t *l4; + + might_sleep(); + + l4 = xa_load_or_alloc(&tracker->orders, order, sizeof(*l4)); + if (IS_ERR(l4)) + return PTR_ERR(l4); + + khomem_preserve(l4, pfn, order); + + return 0; +} + +/** + * kho_preserve_folio - preserve a folio across KHO. + * @folio: folio to preserve + * + * Records that the entire folio is preserved across KHO. The order + * will be preserved as well. + * + * Return: 0 on success, error code on failure + */ +int kho_preserve_folio(struct folio *folio) +{ + unsigned long pfn = folio_pfn(folio); + unsigned int order = folio_order(folio); + int err; + + if (!kho_enable) + return -EOPNOTSUPP; + + down_read(&kho_out.tree_lock); + if (kho_out.fdt) { + err = -EBUSY; + goto unlock; + } + + err = __kho_preserve(&kho_mem_track, pfn, order); + +unlock: + up_read(&kho_out.tree_lock); + + return err; +} +EXPORT_SYMBOL_GPL(kho_preserve_folio); + +/** + * kho_unpreserve_folio - unpreserve a folio + * @folio: folio to unpreserve + * + * Remove the record of a folio previously preserved by kho_preserve_folio(). + * + * Return: 0 on success, error code on failure + */ +int kho_unpreserve_folio(struct folio *folio) +{ + unsigned long pfn = folio_pfn(folio); + unsigned int order = folio_order(folio); + int err = 0; + + down_read(&kho_out.tree_lock); + if (kho_out.fdt) { + err = -EBUSY; + goto unlock; + } + + __kho_unpreserve(&kho_mem_track, pfn, order); + +unlock: + up_read(&kho_out.tree_lock); + + return err; +} +EXPORT_SYMBOL_GPL(kho_unpreserve_folio); + +/** + * kho_preserve_phys - preserve a physically contiguous range across KHO. + * @phys: physical address of the range + * @size: size of the range + * + * Records that the entire range from @phys to @phys + @size is preserved + * across KHO. + * + * Return: 0 on success, error code on failure + */ +int kho_preserve_phys(phys_addr_t phys, size_t size) +{ + unsigned long pfn = PHYS_PFN(phys), end_pfn = PHYS_PFN(phys + size); + unsigned int order = ilog2(end_pfn - pfn); + unsigned long failed_pfn; + int err = 0; + + if (!kho_enable) + return -EOPNOTSUPP; + + down_read(&kho_out.tree_lock); + if (kho_out.fdt) { + err = -EBUSY; + goto unlock; + } + + for (; pfn < end_pfn; + pfn += (1 << order), order = ilog2(end_pfn - pfn)) { + err = __kho_preserve(&kho_mem_track, pfn, order); + if (err) { + failed_pfn = pfn; + break; + } + } + + if (err) + for (pfn = PHYS_PFN(phys); pfn < failed_pfn; + pfn += (1 << order), order = ilog2(end_pfn - pfn)) + __kho_unpreserve(&kho_mem_track, pfn, order); + +unlock: + up_read(&kho_out.tree_lock); + + return err; +} +EXPORT_SYMBOL_GPL(kho_preserve_phys); + +/** + * kho_unpreserve_phys - unpreserve a physically contiguous range + * @phys: physical address of the range + * @size: size of the range + * + * Remove the record of a range previously preserved by kho_preserve_phys(). + * + * Return: 0 on success, error code on failure + */ +int kho_unpreserve_phys(phys_addr_t phys, size_t size) +{ + unsigned long pfn = PHYS_PFN(phys), end_pfn = PHYS_PFN(phys + size); + unsigned int order = ilog2(end_pfn - pfn); + int err = 0; + + down_read(&kho_out.tree_lock); + if (kho_out.fdt) { + err = -EBUSY; + goto unlock; + } + + for (; pfn < end_pfn; pfn += (1 << order), order = ilog2(end_pfn - pfn)) + __kho_unpreserve(&kho_mem_track, pfn, order); + +unlock: + up_read(&kho_out.tree_lock); + + return err; +} +EXPORT_SYMBOL_GPL(kho_unpreserve_phys); + +/* almost as free_reserved_page(), just don't free the page */ +static void kho_restore_page(struct page *page) +{ + ClearPageReserved(page); + init_page_count(page); + adjust_managed_page_count(page, 1); +} + +struct folio *kho_restore_folio(phys_addr_t phys) +{ + struct page *page = pfn_to_online_page(PHYS_PFN(phys)); + unsigned long order = page->private; + + if (!page) + return NULL; + + order = page->private; + if (order) + prep_compound_page(page, order); + else + kho_restore_page(page); + + return page_folio(page); +} +EXPORT_SYMBOL_GPL(kho_restore_folio); + +void *kho_restore_phys(phys_addr_t phys, size_t size) +{ + unsigned long start_pfn, end_pfn, pfn; + void *va = __va(phys); + + start_pfn = PFN_DOWN(phys); + end_pfn = PFN_UP(phys + size); + + for (pfn = start_pfn; pfn < end_pfn; pfn++) { + struct page *page = pfn_to_online_page(pfn); + + if (!page) + return NULL; + kho_restore_page(page); + } + + return va; +} +EXPORT_SYMBOL_GPL(kho_restore_phys); + +static void kho_mem_serialize(void) +{ + struct kho_mem_track *tracker = &kho_mem_track; + khomem_desc_t *desc; + unsigned long order; + + xa_for_each(&tracker->orders, order, desc) { + if (WARN_ON(order >= NR_PAGE_ORDERS)) + break; + kho_out.mem_tables[order] = *desc; + } +} + +static int kho_mem_deser_walk_fn(unsigned long phys, unsigned int order, + void *arg) +{ + struct page *page = phys_to_page(phys); + unsigned long sz = 1UL << (PAGE_SHIFT + order); + + memblock_reserve(phys, sz); + memblock_reserved_mark_noinit(phys, sz); + page->private = order; + + return 0; +} + +static void __init kho_mem_deserialize(void) +{ + struct kho_in_node preserved_mem; + const unsigned long *tables; + unsigned int nr_tables; + int err, order; + u32 len; + + err = kho_get_node(NULL, "preserved-memory", &preserved_mem); + if (err) { + pr_err("no preserved-memory node: %d\n", err); + return; + } + + tables = kho_get_prop(&preserved_mem, "metadata", &len); + if (!tables || len % sizeof(*tables)) { + pr_err("failed to get preserved memory table\n"); + return; + } + + nr_tables = min_t(unsigned int, len / sizeof(*tables), NR_PAGE_ORDERS); + for (order = 0; order < nr_tables; order++) + khomem_walk_preserved((khomem_desc_t *)&tables[order], order, + kho_mem_deser_walk_fn, NULL); +} + /* Helper functions for KHO state tree */ struct kho_prop { @@ -542,6 +1101,8 @@ static int kho_unfreeze(void) kho_out.fdt = NULL; up_write(&kho_out.tree_lock); + memset(kho_out.mem_tables, 0, sizeof(kho_out.mem_tables)); + if (fdt) kvfree(fdt); @@ -648,6 +1209,8 @@ static int kho_finalize(void) kho_out.fdt = fdt; up_write(&kho_out.tree_lock); + kho_mem_serialize(); + err = kho_convert_tree(fdt, kho_out.fdt_max); unfreeze: @@ -829,6 +1392,10 @@ static __init int kho_init(void) kho_out.root.name = ""; err = kho_add_string_prop(&kho_out.root, "compatible", "kho-v1"); + err |= kho_add_prop(&kho_out.preserved_memory, "metadata", + kho_out.mem_tables, sizeof(kho_out.mem_tables)); + err |= kho_add_node(&kho_out.root, "preserved-memory", + &kho_out.preserved_memory); if (err) goto err_free_scratch; @@ -1079,10 +1646,12 @@ static void __init kho_release_scratch(void) void __init kho_memory_init(void) { - if (!kho_get_fdt()) + if (!kho_get_fdt()) { kho_reserve_scratch(); - else + } else { + kho_mem_deserialize(); kho_release_scratch(); + } } void __init kho_populate(phys_addr_t handover_fdt_phys,
On Thu, Apr 03, 2025 at 03:50:04PM +0000, Pratyush Yadav wrote: > The patch currently has a limitation where it does not free any of the > empty tables after a unpreserve operation. But Changyuan's patch also > doesn't do it so at least it is not any worse off. We do we even have unpreserve? Just discard the entire KHO operation in a bulk. > When working on this patch, I realized that kho_mem_deserialize() is > currently _very_ slow. It takes over 2 seconds to make memblock > reservations for 48 GiB of 0-order pages. I suppose this can later be > optimized by teaching memblock_free_all() to skip preserved pages > instead of making memblock reservations. Yes, this was my prior point of not having actual data to know what the actual hot spots are.. This saves a few ms on an operation that takes over 2 seconds :) > +typedef unsigned long khomem_desc_t; This should be more like: union { void *table; phys_addr_t table_phys; }; Since we are not using the low bits right now and it is alot cheaper to convert from va to phys only once during the final step. __va is not exactly fast. > +#define PTRS_PER_LEVEL (PAGE_SIZE / sizeof(unsigned long)) > +#define KHOMEM_L1_BITS (PAGE_SIZE * BITS_PER_BYTE) > +#define KHOMEM_L1_MASK ((1 << ilog2(KHOMEM_L1_BITS)) - 1) > +#define KHOMEM_L1_SHIFT (PAGE_SHIFT) > +#define KHOMEM_L2_SHIFT (KHOMEM_L1_SHIFT + ilog2(KHOMEM_L1_BITS)) > +#define KHOMEM_L3_SHIFT (KHOMEM_L2_SHIFT + ilog2(PTRS_PER_LEVEL)) > +#define KHOMEM_L4_SHIFT (KHOMEM_L3_SHIFT + ilog2(PTRS_PER_LEVEL)) > +#define KHOMEM_PFN_MASK PAGE_MASK This all works better if you just use GENMASK and FIELD_GET > +static int __khomem_table_alloc(khomem_desc_t *desc) > +{ > + if (khomem_desc_none(*desc)) { Needs READ_ONCE > +struct kho_mem_track { > + /* Points to L4 KHOMEM descriptor, each order gets its own table. */ > + struct xarray orders; > +}; I think it would be easy to add a 5th level and just use bits 63:57 as a 6 bit order. Then you don't need all this stuff either. > +int kho_preserve_folio(struct folio *folio) > +{ > + unsigned long pfn = folio_pfn(folio); > + unsigned int order = folio_order(folio); > + int err; > + > + if (!kho_enable) > + return -EOPNOTSUPP; > + > + down_read(&kho_out.tree_lock); This lock still needs to go away > +static void kho_mem_serialize(void) > +{ > + struct kho_mem_track *tracker = &kho_mem_track; > + khomem_desc_t *desc; > + unsigned long order; > + > + xa_for_each(&tracker->orders, order, desc) { > + if (WARN_ON(order >= NR_PAGE_ORDERS)) > + break; > + kho_out.mem_tables[order] = *desc; Missing the virt_to_phys? > + nr_tables = min_t(unsigned int, len / sizeof(*tables), NR_PAGE_ORDERS); > + for (order = 0; order < nr_tables; order++) > + khomem_walk_preserved((khomem_desc_t *)&tables[order], order, Missing phys_to_virt Please dont' remove the KHOSER stuff, and do use it with proper structs and types. It is part of keeping this stuff understandable. Jason
On Thu, Apr 03 2025, Jason Gunthorpe wrote: > On Thu, Apr 03, 2025 at 03:50:04PM +0000, Pratyush Yadav wrote: > >> The patch currently has a limitation where it does not free any of the >> empty tables after a unpreserve operation. But Changyuan's patch also >> doesn't do it so at least it is not any worse off. > > We do we even have unpreserve? Just discard the entire KHO operation > in a bulk. Yeah, I guess that makes sense. > >> When working on this patch, I realized that kho_mem_deserialize() is >> currently _very_ slow. It takes over 2 seconds to make memblock >> reservations for 48 GiB of 0-order pages. I suppose this can later be >> optimized by teaching memblock_free_all() to skip preserved pages >> instead of making memblock reservations. > > Yes, this was my prior point of not having actual data to know what > the actual hot spots are.. This saves a few ms on an operation that > takes over 2 seconds :) Yes, you're right. But for 2.5 days of work it isn't too shabby :-) And I think this will help make the 2 seconds much smaller as well later down the line since we can now find out if a given page is reserved in a few operations, and do it in parallel. > >> +typedef unsigned long khomem_desc_t; > > This should be more like: > > union { > void *table; > phys_addr_t table_phys; > }; > > Since we are not using the low bits right now and it is alot cheaper > to convert from va to phys only once during the final step. __va is > not exactly fast. The descriptor is used on _every_ level of the table, not just the top. So if we use virtual addresses, at serialize time we would have to walk the whole table and covert all addresses to physical. And __va() does not seem to be doing too much. On x86, it expands to: #define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET)) and on ARM64 to: #define __va(x) ((void *)__phys_to_virt((phys_addr_t)(x))) #define __phys_to_virt(x) ((unsigned long)((x) - PHYS_OFFSET) | PAGE_OFFSET) So only some addition and bitwise or. Should be fast enough I reckon. Maybe walking the table once is faster than calculating va every time, but that walking would happen in the blackout time, and need more data on whether that optimization is worth it. > >> +#define PTRS_PER_LEVEL (PAGE_SIZE / sizeof(unsigned long)) >> +#define KHOMEM_L1_BITS (PAGE_SIZE * BITS_PER_BYTE) >> +#define KHOMEM_L1_MASK ((1 << ilog2(KHOMEM_L1_BITS)) - 1) >> +#define KHOMEM_L1_SHIFT (PAGE_SHIFT) >> +#define KHOMEM_L2_SHIFT (KHOMEM_L1_SHIFT + ilog2(KHOMEM_L1_BITS)) >> +#define KHOMEM_L3_SHIFT (KHOMEM_L2_SHIFT + ilog2(PTRS_PER_LEVEL)) >> +#define KHOMEM_L4_SHIFT (KHOMEM_L3_SHIFT + ilog2(PTRS_PER_LEVEL)) >> +#define KHOMEM_PFN_MASK PAGE_MASK > > This all works better if you just use GENMASK and FIELD_GET I suppose yes. Though the masks need to be shifted by page order so need to be careful. Will take a look. > >> +static int __khomem_table_alloc(khomem_desc_t *desc) >> +{ >> + if (khomem_desc_none(*desc)) { > > Needs READ_ONCE ACK, will add. > >> +struct kho_mem_track { >> + /* Points to L4 KHOMEM descriptor, each order gets its own table. */ >> + struct xarray orders; >> +}; > > I think it would be easy to add a 5th level and just use bits 63:57 as > a 6 bit order. Then you don't need all this stuff either. I am guessing you mean to store the order in the table descriptor itself, instead of having a different table for each order. I don't think that would work since say you have a level 1 table spanning 128 MiB. You can have pages of different orders in that 128 MiB, and have no way of knowing which is which. To have all orders in one table, we would need more than one bit per page at the lowest level. Though now that I think of it, it is probably much simpler to just use khomem_desc_t orders[NR_PAGE_ORDERS] instead of the xarray. > >> +int kho_preserve_folio(struct folio *folio) >> +{ >> + unsigned long pfn = folio_pfn(folio); >> + unsigned int order = folio_order(folio); >> + int err; >> + >> + if (!kho_enable) >> + return -EOPNOTSUPP; >> + >> + down_read(&kho_out.tree_lock); > > This lock still needs to go away Agree. I hope Changyuan's next version fixes it. I didn't really touch any of these functions. > >> +static void kho_mem_serialize(void) >> +{ >> + struct kho_mem_track *tracker = &kho_mem_track; >> + khomem_desc_t *desc; >> + unsigned long order; >> + >> + xa_for_each(&tracker->orders, order, desc) { >> + if (WARN_ON(order >= NR_PAGE_ORDERS)) >> + break; >> + kho_out.mem_tables[order] = *desc; > > Missing the virt_to_phys? Nope. This isn't storing the pointer to the descriptor, but the _value_ of the descriptor -- so it already contains the physical address of the level 4 table. > >> + nr_tables = min_t(unsigned int, len / sizeof(*tables), NR_PAGE_ORDERS); >> + for (order = 0; order < nr_tables; order++) >> + khomem_walk_preserved((khomem_desc_t *)&tables[order], order, > > Missing phys_to_virt Same as above. tables contains the _values_ of the descriptors which already have physical addresses which we turn into virtual ones in khomem_table(). > > Please dont' remove the KHOSER stuff, and do use it with proper > structs and types. It is part of keeping this stuff understandable. I didn't see any need for KHOSER stuff here to be honest. The only time we deal with KHO pointers is with table addresses, and that is already well abstracted in khomem_mkdesc() (I suppose that can require a khomem_desc_t * instead of a void *, but beyond that it is quite easy to understand IMO).
Hi Jason, On Thu, Apr 03, 2025 at 11:24:38AM -0300, Jason Gunthorpe wrote: > On Thu, Apr 03, 2025 at 04:58:27PM +0300, Mike Rapoport wrote: > > On Thu, Apr 03, 2025 at 08:42:09AM -0300, Jason Gunthorpe wrote: > > > On Wed, Apr 02, 2025 at 07:16:27PM +0000, Pratyush Yadav wrote: > > > > > +int kho_preserve_phys(phys_addr_t phys, size_t size) > > > > > +{ > > > > > + unsigned long pfn = PHYS_PFN(phys), end_pfn = PHYS_PFN(phys + size); > > > > > + unsigned int order = ilog2(end_pfn - pfn); > > > > > > > > This caught my eye when playing around with the code. It does not put > > > > any limit on the order, so it can exceed NR_PAGE_ORDERS. Also, when > > > > initializing the page after KHO, we pass the order directly to > > > > prep_compound_page() without sanity checking it. The next kernel might > > > > not support all the orders the current one supports. Perhaps something > > > > to fix? > > > > > > IMHO we should delete the phys functions until we get a user of them > > > > The only user of memory tracker in this series uses kho_preserve_phys() > > But it really shouldn't. The reserved memory is a completely different > mechanism than buddy allocator preservation. It doesn't even call > kho_restore_phys() those pages, it just feeds the ranges directly to: > > + reserved_mem_add(*p_start, size, name); > > The bitmaps should be understood as preserving memory from the buddy > allocator only. > > IMHO it should not call kho_preserve_phys() at all. Do you mean that for preserving large physical ranges we need something entirely different? Then we don't need the bitmaps at this point, as we don't have any users for kho_preserve_folio() and we should not worry ourself with orders and restoration of high order folios until then ;-) Now more seriously, I considered the bitmaps and sparse xarrays as good initial implementation of memory preservation that can do both physical ranges now and folios later when we'll need them. It might not be the optimal solution in the long run but we don't have enough data right now to do any optimizations for real. Preserving huge amounts of order-0 pages does not seem to me a representative test case at all. The xarrays + bitmaps do have the limitation that we cannot store any information about the folio except its order and if we are anyway need something else to preserve physical ranges, I suggest starting with preserving ranges and then adding optimizations for the folio case. As I've mentioned earlier, maple tree is perfect for tracking ranges, it simpler than other alternatives, and at allows storing information about a range and easy and efficient coalescing for adjacent ranges with matching properties. The maple tree based memory tracker is less memory efficient than bitmap if we count how many data is required to preserve gigabytes of distinct order-0 pages, but I don't think this is the right thing to measure at least until we have some real data about how KHO is used. Here's something that implements preservation of ranges (compile tested only) and adding folios with their orders and maybe other information would be quite easy. /* * Keep track of memory that is to be preserved across KHO. * * For simplicity use a maple tree that conveniently stores ranges and * allows adding BITS_PER_XA_VALUE of metadata to each range */ struct kho_mem_track { struct maple_tree ranges; }; static struct kho_mem_track kho_mem_track; typedef unsigned long kho_range_desc_t; static int __kho_preserve(struct kho_mem_track *tracker, unsigned long addr, size_t size, kho_range_desc_t info) { struct maple_tree *ranges = &tracker->ranges; MA_STATE(mas, ranges, addr - 1, addr + size + 1); unsigned long lower, upper; void *area = NULL; lower = addr; upper = addr + size - 1; might_sleep(); area = mas_walk(&mas); if (area && mas.last == addr - 1) lower = mas.index; area = mas_next(&mas, ULONG_MAX); if (area && mas.index == addr + size) upper = mas.last; mas_set_range(&mas, lower, upper); return mas_store_gfp(&mas, xa_mk_value(info), GFP_KERNEL); } /** * kho_preserve_phys - preserve a physcally contiguous range accross KHO. * @phys: physical address of the range * @phys: size of the range * * Records that the entire range from @phys to @phys + @size is preserved * across KHO. * * Return: 0 on success, error code on failure */ int kho_preserve_phys(phys_addr_t phys, size_t size) { return __kho_preserve(&kho_mem_track, phys, size, 0); } EXPORT_SYMBOL_GPL(kho_preserve_phys); #define KHOSER_PTR(type) union {phys_addr_t phys; type ptr;} #define KHOSER_STORE_PTR(dest, val) \ ({ \ (dest).phys = virt_to_phys(val); \ typecheck(typeof((dest).ptr), val); \ }) #define KHOSER_LOAD_PTR(src) ((src).phys ? (typeof((src).ptr))(phys_to_virt((src).phys)): NULL) struct khoser_mem_range { phys_addr_t start; phys_addr_t size; unsigned long data; }; struct khoser_mem_chunk_hdr { KHOSER_PTR(struct khoser_mem_chunk *) next; unsigned long num_ranges; }; #define KHOSER_RANGES_SIZE \ ((PAGE_SIZE - sizeof(struct khoser_mem_chunk_hdr) / \ sizeof(struct khoser_mem_range))) struct khoser_mem_chunk { struct khoser_mem_chunk_hdr hdr; struct khoser_mem_range ranges[KHOSER_RANGES_SIZE]; }; static int new_chunk(struct khoser_mem_chunk **cur_chunk) { struct khoser_mem_chunk *chunk; chunk = kzalloc(sizeof(*chunk), GFP_KERNEL); if (!chunk) return -ENOMEM; if (*cur_chunk) KHOSER_STORE_PTR((*cur_chunk)->hdr.next, chunk); *cur_chunk = chunk; return 0; } /* * Record all the ranges in a linked list of pages for the next kernel to * process. Each chunk holds array of ragnes. The maple_tree is used to store * them in a tree while building up the data structure, but the KHO successor * kernel only needs to process them once in order. * * All of this memory is normal kmalloc() memory and is not marked for * preservation. The successor kernel will remain isolated to the scratch space * until it completes processing this list. Once processed all the memory * storing these ranges will be marked as free. */ static int kho_mem_serialize(phys_addr_t *fdt_value) { struct kho_mem_track *tracker = &kho_mem_track; struct maple_tree *ranges = &tracker->ranges; struct khoser_mem_chunk *first_chunk = NULL; struct khoser_mem_chunk *chunk = NULL; MA_STATE(mas, ranges, 0, ULONG_MAX); void *entry; int err; mas_for_each(&mas, entry, ULONG_MAX) { size_t size = mas.last - mas.index + 1; struct khoser_mem_range *range; err = new_chunk(&chunk); if (err) goto err_free; if (!first_chunk) first_chunk = chunk; if (chunk->hdr.num_ranges == ARRAY_SIZE(chunk->ranges)) { err = new_chunk(&chunk); if (err) goto err_free; } range = &chunk->ranges[chunk->hdr.num_ranges]; range->start = mas.index; range->size = size; range->data = xa_to_value(entry); chunk->hdr.num_ranges++; } *fdt_value = virt_to_phys(first_chunk); return 0; err_free: chunk = first_chunk; while (chunk) { struct khoser_mem_chunk *tmp = chunk; chunk = KHOSER_LOAD_PTR(chunk->hdr.next); kfree(tmp); } return err; } static void __init deserialize_range(struct khoser_mem_range *range) { memblock_reserved_mark_noinit(range->start, range->size); memblock_reserve(range->start, range->size); } static void __init kho_mem_deserialize(void) { const void *fdt = kho_get_fdt(); struct khoser_mem_chunk *chunk; const phys_addr_t *mem; int len, node; if (!fdt) return; node = fdt_path_offset(fdt, "/preserved-memory"); if (node < 0) { pr_err("no preserved-memory node: %d\n", node); return; } mem = fdt_getprop(fdt, node, "metadata", &len); if (!mem || len != sizeof(*mem)) { pr_err("failed to get preserved memory bitmaps\n"); return; } chunk = phys_to_virt(*mem); while (chunk) { unsigned int i; memblock_reserve(virt_to_phys(chunk), sizeof(*chunk)); for (i = 0; i != chunk->hdr.num_ranges; i++) deserialize_range(&chunk->ranges[i]); chunk = KHOSER_LOAD_PTR(chunk->hdr.next); } }
On Fri, Apr 04, 2025 at 12:54:25PM +0300, Mike Rapoport wrote: > > IMHO it should not call kho_preserve_phys() at all. > > Do you mean that for preserving large physical ranges we need something > entirely different? If they don't use the buddy allocator, then yes? > Then we don't need the bitmaps at this point, as we don't have any users > for kho_preserve_folio() and we should not worry ourself with orders and > restoration of high order folios until then ;-) Arguably yes :\ Maybe change the reserved regions code to put the region list in a folio and preserve the folio instead of using FDT as a "demo" for the functionality. > The xarrays + bitmaps do have the limitation that we cannot store any > information about the folio except its order and if we are anyway need > something else to preserve physical ranges, I suggest starting with > preserving ranges and then adding optimizations for the folio case. Why? What is the use case for physical ranges that isn't handled entirely by reserved_mem_add()? We know what the future use case is for the folio preservation, all the drivers and the iommu are going to rely on this. > Here's something that implements preservation of ranges (compile tested > only) and adding folios with their orders and maybe other information would > be quite easy. But folios and their orders is the *whole point*, again I don't see any use case for preserving ranges, beyond it being a way to optimize the memblock reserve path. But that path should be fixed up to just use the bitmap directly.. Jason
On Thu, Apr 03, 2025 at 05:37:06PM +0000, Pratyush Yadav wrote: > And I think this will help make the 2 seconds much smaller as well later > down the line since we can now find out if a given page is reserved in a > few operations, and do it in parallel. Yes, most certainly > > This should be more like: > > > > union { > > void *table; > > phys_addr_t table_phys; > > }; > > > > Since we are not using the low bits right now and it is alot cheaper > > to convert from va to phys only once during the final step. __va is > > not exactly fast. > > The descriptor is used on _every_ level of the table, not just the > top. Yes > So if we use virtual addresses, at serialize time we would have to walk > the whole table and covert all addresses to physical. Yes > And __va() does > not seem to be doing too much. On x86, it expands to: > > #define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET)) > > and on ARM64 to: > > #define __va(x) ((void *)__phys_to_virt((phys_addr_t)(x))) > #define __phys_to_virt(x) ((unsigned long)((x) - PHYS_OFFSET) | PAGE_OFFSET) Hmm, I was sure sparsemem added a bunch of stuff to this path, maybe I'm thinking of page_to_phys > >> +struct kho_mem_track { > >> + /* Points to L4 KHOMEM descriptor, each order gets its own table. */ > >> + struct xarray orders; > >> +}; > > > > I think it would be easy to add a 5th level and just use bits 63:57 as > > a 6 bit order. Then you don't need all this stuff either. > > I am guessing you mean to store the order in the table descriptor > itself, instead of having a different table for each order. Not quite, I mean to index the per-order sub trees by using the high order bits. You still end up with N seperate bitmap trees, but instead of using an xarray to hold their top pointers you hold them in a 5th level. > Though now that I think of it, it is probably much simpler to just use > khomem_desc_t orders[NR_PAGE_ORDERS] instead of the xarray. Which is basically this, but encoding the index to orders in the address Jason
On Fri, Apr 04, 2025 at 09:47:29AM -0300, Jason Gunthorpe wrote: > On Fri, Apr 04, 2025 at 12:54:25PM +0300, Mike Rapoport wrote: > > > IMHO it should not call kho_preserve_phys() at all. > > > > Do you mean that for preserving large physical ranges we need something > > entirely different? > > If they don't use the buddy allocator, then yes? > > > Then we don't need the bitmaps at this point, as we don't have any users > > for kho_preserve_folio() and we should not worry ourself with orders and > > restoration of high order folios until then ;-) > > Arguably yes :\ > > Maybe change the reserved regions code to put the region list in a > folio and preserve the folio instead of using FDT as a "demo" for the > functionality. Folios are not available when we restore reserved regions, this just won't work. > > The xarrays + bitmaps do have the limitation that we cannot store any > > information about the folio except its order and if we are anyway need > > something else to preserve physical ranges, I suggest starting with > > preserving ranges and then adding optimizations for the folio case. > > Why? What is the use case for physical ranges that isn't handled > entirely by reserved_mem_add()? > > We know what the future use case is for the folio preservation, all > the drivers and the iommu are going to rely on this. We don't know how much of the preservation will be based on folios. Most drivers do not use folios and for preserving memfd* and hugetlb we'd need to have some dance around that memory anyway. So I think kho_preserve_folio() would be a part of the fdbox or whatever that functionality will be called. > > Here's something that implements preservation of ranges (compile tested > > only) and adding folios with their orders and maybe other information would > > be quite easy. > > But folios and their orders is the *whole point*, again I don't see > any use case for preserving ranges, beyond it being a way to optimize > the memblock reserve path. But that path should be fixed up to just > use the bitmap directly.. Are they? The purpose of basic KHO is to make sure the memory we want to preserve is not trampled over. Preserving folios with their orders means we need to make sure memory range of the folio is preserved and we carry additional information to actually recreate the folio object, in case it is needed and in case it is possible. Hughetlb, for instance has its own way initializing folios and just keeping the order won't be enough for that. As for the optimizations of memblock reserve path, currently it what hurts the most in my and Pratyush experiments. They are not very representative, but still, preserving lots of pages/folios spread all over would have it's toll on the mm initialization. And I don't think invasive changes to how buddy and memory map initialization are the best way to move forward and optimize that. Quite possibly we'd want to be able to minimize amount of *ranges* that we preserve. So from the three alternatives we have now (xarrays + bitmaps, tables + bitmaps and maple tree for ranges) maple tree seems to be the simplest and efficient enough to start with. Preserving folio orders with it is really straighforward and until we see some real data of how the entire KHO machinery is used, I'd prefer simple over anything else. > Jason
On Fri, Apr 04, 2025 at 04:53:13PM +0300, Mike Rapoport wrote: > > Maybe change the reserved regions code to put the region list in a > > folio and preserve the folio instead of using FDT as a "demo" for the > > functionality. > > Folios are not available when we restore reserved regions, this just won't > work. You don't need the folio at that point, you just need the data in the page. The folio would be freed after starting up the buddy allocator. > > We know what the future use case is for the folio preservation, all > > the drivers and the iommu are going to rely on this. > > We don't know how much of the preservation will be based on folios. I think almost all of it. Where else does memory come from for drivers? > Most drivers do not use folios Yes they do, either through kmalloc or through alloc_page/etc. "folio" here is just some generic word meaning memory from the buddy allocator. The big question on my mind is if we need a way to preserve slab objects as well.. > and for preserving memfd* and hugetlb we'd need to have some dance > around that memory anyway. memfd is all folios - what do you mean? hugetlb is moving toward folios.. eg guestmemfd is supposed to be taking the hugetlb special stuff and turning it into folios. > So I think kho_preserve_folio() would be a part of the fdbox or > whatever that functionality will be called. It is part of KHO. Preserving the folios has to be sequenced with starting the buddy allocator, and that is KHO's entire responsibility. I could see something like preserving slab being in a different layer, built on preserving folios. > Are they? > The purpose of basic KHO is to make sure the memory we want to preserve is > not trampled over. Preserving folios with their orders means we need to > make sure memory range of the folio is preserved and we carry additional > information to actually recreate the folio object, in case it is needed and > in case it is possible. Hughetlb, for instance has its own way initializing > folios and just keeping the order won't be enough for that. I expect many things will need a side-car datastructure to record that additional meta-data. hugetlb can start with folios, then switch them over to its non-folio stuff based on its metadata. The point is the basic low level KHO mechanism is simple folios - memory from the buddy allocator with an neutral struct folio that the caller can then customize to its own memory descriptor type on restore. Eventually restore would allocate a caller specific memdesc and it wouldn't be "folios" at all. We just don't have the right words yet to describe this. > As for the optimizations of memblock reserve path, currently it what hurts > the most in my and Pratyush experiments. They are not very representative, > but still, preserving lots of pages/folios spread all over would have it's > toll on the mm initialization. > And I don't think invasive changes to how > buddy and memory map initialization are the best way to move forward and > optimize that. I'm pretty sure this is going to be the best performance path, but I have no idea how invasive it would be to the buddy alloactor to make it work. > Quite possibly we'd want to be able to minimize amount of *ranges* > that we preserve. I'm not sure, that seems backwards to me, we really don't want to have KHO mem zones! So I think optimizing for, and thinking about ranges doesn't make sense. The big ranges will arise naturally beacuse things like hugetlb reservations should all be contiguous and the resulting folios should all be allocated for the VM and also all be contigous. So vast, vast amounts of memory will be high order and contiguous. > Preserving folio orders with it is really straighforward and until we see > some real data of how the entire KHO machinery is used, I'd prefer simple > over anything else. mapletree may not even work as it has a very high bound on memory usage if the preservation workload is small and fragmented. This is why I didn't want to use list of ranges in the first place. It also doesn't work so well if you need to preserve the order too :\ Until we know the workload(s) and cost how much memory the maple tree version will use I don't think it is a good general starting point. Jason
On Fri, Apr 04 2025, Jason Gunthorpe wrote: > On Thu, Apr 03, 2025 at 05:37:06PM +0000, Pratyush Yadav wrote: > [...] >> > This should be more like: >> > >> > union { >> > void *table; >> > phys_addr_t table_phys; >> > }; >> > >> > Since we are not using the low bits right now and it is alot cheaper >> > to convert from va to phys only once during the final step. __va is >> > not exactly fast. >> >> The descriptor is used on _every_ level of the table, not just the >> top. > > Yes > >> So if we use virtual addresses, at serialize time we would have to walk >> the whole table and covert all addresses to physical. > > Yes > >> And __va() does >> not seem to be doing too much. On x86, it expands to: >> >> #define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET)) >> >> and on ARM64 to: >> >> #define __va(x) ((void *)__phys_to_virt((phys_addr_t)(x))) >> #define __phys_to_virt(x) ((unsigned long)((x) - PHYS_OFFSET) | PAGE_OFFSET) > > Hmm, I was sure sparsemem added a bunch of stuff to this path, maybe > I'm thinking of page_to_phys Yep, page_to_phys for sparsemem is somewhat expensive, but __va() seems to be fine. #define __page_to_pfn(pg) \ ({ const struct page *__pg = (pg); \ int __sec = page_to_section(__pg); \ (unsigned long)(__pg - __section_mem_map_addr(__nr_to_section(__sec))); \ }) > >> >> +struct kho_mem_track { >> >> + /* Points to L4 KHOMEM descriptor, each order gets its own table. */ >> >> + struct xarray orders; >> >> +}; >> > >> > I think it would be easy to add a 5th level and just use bits 63:57 as >> > a 6 bit order. Then you don't need all this stuff either. >> >> I am guessing you mean to store the order in the table descriptor >> itself, instead of having a different table for each order. > > Not quite, I mean to index the per-order sub trees by using the high > order bits. You still end up with N seperate bitmap trees, but instead > of using an xarray to hold their top pointers you hold them in a 5th > level. > >> Though now that I think of it, it is probably much simpler to just use >> khomem_desc_t orders[NR_PAGE_ORDERS] instead of the xarray. > > Which is basically this, but encoding the index to orders in the address I think this is way easier to wrap your head around compared to trying to encode orders in address. That doesn't even work that well since address has pretty much no connection to order. So a page at address say 0x100000 might be any order. So we would have to encode the order into the address, and then decode it when doing table operations. Just having a separate table indexed separately by orders is way simpler.
diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h index c665ff6cd728..d52a7b500f4c 100644 --- a/include/linux/kexec_handover.h +++ b/include/linux/kexec_handover.h @@ -5,6 +5,7 @@ #include <linux/types.h> #include <linux/hashtable.h> #include <linux/notifier.h> +#include <linux/mm_types.h> struct kho_scratch { phys_addr_t addr; @@ -54,6 +55,13 @@ int kho_add_string_prop(struct kho_node *node, const char *key, int register_kho_notifier(struct notifier_block *nb); int unregister_kho_notifier(struct notifier_block *nb); +int kho_preserve_folio(struct folio *folio); +int kho_unpreserve_folio(struct folio *folio); +int kho_preserve_phys(phys_addr_t phys, size_t size); +int kho_unpreserve_phys(phys_addr_t phys, size_t size); +struct folio *kho_restore_folio(phys_addr_t phys); +void *kho_restore_phys(phys_addr_t phys, size_t size); + void kho_memory_init(void); void kho_populate(phys_addr_t handover_fdt_phys, phys_addr_t scratch_phys, @@ -118,6 +126,36 @@ static inline int unregister_kho_notifier(struct notifier_block *nb) return -EOPNOTSUPP; } +static inline int kho_preserve_folio(struct folio *folio) +{ + return -EOPNOTSUPP; +} + +static inline int kho_unpreserve_folio(struct folio *folio) +{ + return -EOPNOTSUPP; +} + +static inline int kho_preserve_phys(phys_addr_t phys, size_t size) +{ + return -EOPNOTSUPP; +} + +static inline int kho_unpreserve_phys(phys_addr_t phys, size_t size) +{ + return -EOPNOTSUPP; +} + +static inline struct folio *kho_restore_folio(phys_addr_t phys) +{ + return NULL; +} + +static inline void *kho_restore_phys(phys_addr_t phys, size_t size) +{ + return NULL; +} + static inline void kho_memory_init(void) { } diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c index 6ebad2f023f9..592563c21369 100644 --- a/kernel/kexec_handover.c +++ b/kernel/kexec_handover.c @@ -62,6 +62,13 @@ struct kho_out { struct rw_semaphore tree_lock; struct kho_node root; + /** + * Physical address of the first struct khoser_mem_chunk containing + * serialized data from struct kho_mem_track. + */ + phys_addr_t first_chunk_phys; + struct kho_node preserved_memory; + void *fdt; u64 fdt_max; }; @@ -70,6 +77,7 @@ static struct kho_out kho_out = { .chain_head = BLOCKING_NOTIFIER_INIT(kho_out.chain_head), .tree_lock = __RWSEM_INITIALIZER(kho_out.tree_lock), .root = KHO_NODE_INIT, + .preserved_memory = KHO_NODE_INIT, .fdt_max = 10 * SZ_1M, }; @@ -237,6 +245,461 @@ int kho_node_check_compatible(const struct kho_in_node *node, } EXPORT_SYMBOL_GPL(kho_node_check_compatible); +/* + * Keep track of memory that is to be preserved across KHO. + * + * The serializing side uses two levels of xarrays to manage chunks of per-order + * 512 byte bitmaps. For instance the entire 1G order of a 1TB system would fit + * inside a single 512 byte bitmap. For order 0 allocations each bitmap will + * cover 16M of address space. Thus, for 16G of memory at most 512K + * of bitmap memory will be needed for order 0. + * + * This approach is fully incremental, as the serialization progresses folios + * can continue be aggregated to the tracker. The final step, immediately prior + * to kexec would serialize the xarray information into a linked list for the + * successor kernel to parse. + */ + +#define PRESERVE_BITS (512 * 8) + +struct kho_mem_phys_bits { + DECLARE_BITMAP(preserve, PRESERVE_BITS); +}; + +struct kho_mem_phys { + /* + * Points to kho_mem_phys_bits, a sparse bitmap array. Each bit is sized + * to order. + */ + struct xarray phys_bits; +}; + +struct kho_mem_track { + /* Points to kho_mem_phys, each order gets its own bitmap tree */ + struct xarray orders; +}; + +static struct kho_mem_track kho_mem_track; + +static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz) +{ + void *elm, *res; + + elm = xa_load(xa, index); + if (elm) + return elm; + + elm = kzalloc(sz, GFP_KERNEL); + if (!elm) + return ERR_PTR(-ENOMEM); + + res = xa_cmpxchg(xa, index, NULL, elm, GFP_KERNEL); + if (xa_is_err(res)) + res = ERR_PTR(xa_err(res)); + + if (res) { + kfree(elm); + return res; + } + + return elm; +} + +static void __kho_unpreserve(struct kho_mem_track *tracker, unsigned long pfn, + unsigned int order) +{ + struct kho_mem_phys_bits *bits; + struct kho_mem_phys *physxa; + unsigned long pfn_hi = pfn >> order; + + physxa = xa_load(&tracker->orders, order); + if (!physxa) + return; + + bits = xa_load(&physxa->phys_bits, pfn_hi / PRESERVE_BITS); + if (!bits) + return; + + clear_bit(pfn_hi % PRESERVE_BITS, bits->preserve); +} + +static int __kho_preserve(struct kho_mem_track *tracker, unsigned long pfn, + unsigned int order) +{ + struct kho_mem_phys_bits *bits; + struct kho_mem_phys *physxa; + unsigned long pfn_hi = pfn >> order; + + might_sleep(); + + physxa = xa_load_or_alloc(&tracker->orders, order, sizeof(*physxa)); + if (IS_ERR(physxa)) + return PTR_ERR(physxa); + + bits = xa_load_or_alloc(&physxa->phys_bits, pfn_hi / PRESERVE_BITS, + sizeof(*bits)); + if (IS_ERR(bits)) + return PTR_ERR(bits); + + set_bit(pfn_hi % PRESERVE_BITS, bits->preserve); + + return 0; +} + +/** + * kho_preserve_folio - preserve a folio across KHO. + * @folio: folio to preserve + * + * Records that the entire folio is preserved across KHO. The order + * will be preserved as well. + * + * Return: 0 on success, error code on failure + */ +int kho_preserve_folio(struct folio *folio) +{ + unsigned long pfn = folio_pfn(folio); + unsigned int order = folio_order(folio); + int err; + + if (!kho_enable) + return -EOPNOTSUPP; + + down_read(&kho_out.tree_lock); + if (kho_out.fdt) { + err = -EBUSY; + goto unlock; + } + + err = __kho_preserve(&kho_mem_track, pfn, order); + +unlock: + up_read(&kho_out.tree_lock); + + return err; +} +EXPORT_SYMBOL_GPL(kho_preserve_folio); + +/** + * kho_unpreserve_folio - unpreserve a folio + * @folio: folio to unpreserve + * + * Remove the record of a folio previously preserved by kho_preserve_folio(). + * + * Return: 0 on success, error code on failure + */ +int kho_unpreserve_folio(struct folio *folio) +{ + unsigned long pfn = folio_pfn(folio); + unsigned int order = folio_order(folio); + int err = 0; + + down_read(&kho_out.tree_lock); + if (kho_out.fdt) { + err = -EBUSY; + goto unlock; + } + + __kho_unpreserve(&kho_mem_track, pfn, order); + +unlock: + up_read(&kho_out.tree_lock); + + return err; +} +EXPORT_SYMBOL_GPL(kho_unpreserve_folio); + +/** + * kho_preserve_phys - preserve a physically contiguous range across KHO. + * @phys: physical address of the range + * @size: size of the range + * + * Records that the entire range from @phys to @phys + @size is preserved + * across KHO. + * + * Return: 0 on success, error code on failure + */ +int kho_preserve_phys(phys_addr_t phys, size_t size) +{ + unsigned long pfn = PHYS_PFN(phys), end_pfn = PHYS_PFN(phys + size); + unsigned int order = ilog2(end_pfn - pfn); + unsigned long failed_pfn; + int err = 0; + + if (!kho_enable) + return -EOPNOTSUPP; + + down_read(&kho_out.tree_lock); + if (kho_out.fdt) { + err = -EBUSY; + goto unlock; + } + + for (; pfn < end_pfn; + pfn += (1 << order), order = ilog2(end_pfn - pfn)) { + err = __kho_preserve(&kho_mem_track, pfn, order); + if (err) { + failed_pfn = pfn; + break; + } + } + + if (err) + for (pfn = PHYS_PFN(phys); pfn < failed_pfn; + pfn += (1 << order), order = ilog2(end_pfn - pfn)) + __kho_unpreserve(&kho_mem_track, pfn, order); + +unlock: + up_read(&kho_out.tree_lock); + + return err; +} +EXPORT_SYMBOL_GPL(kho_preserve_phys); + +/** + * kho_unpreserve_phys - unpreserve a physically contiguous range + * @phys: physical address of the range + * @size: size of the range + * + * Remove the record of a range previously preserved by kho_preserve_phys(). + * + * Return: 0 on success, error code on failure + */ +int kho_unpreserve_phys(phys_addr_t phys, size_t size) +{ + unsigned long pfn = PHYS_PFN(phys), end_pfn = PHYS_PFN(phys + size); + unsigned int order = ilog2(end_pfn - pfn); + int err = 0; + + down_read(&kho_out.tree_lock); + if (kho_out.fdt) { + err = -EBUSY; + goto unlock; + } + + for (; pfn < end_pfn; pfn += (1 << order), order = ilog2(end_pfn - pfn)) + __kho_unpreserve(&kho_mem_track, pfn, order); + +unlock: + up_read(&kho_out.tree_lock); + + return err; +} +EXPORT_SYMBOL_GPL(kho_unpreserve_phys); + +/* almost as free_reserved_page(), just don't free the page */ +static void kho_restore_page(struct page *page) +{ + ClearPageReserved(page); + init_page_count(page); + adjust_managed_page_count(page, 1); +} + +struct folio *kho_restore_folio(phys_addr_t phys) +{ + struct page *page = pfn_to_online_page(PHYS_PFN(phys)); + unsigned long order = page->private; + + if (!page) + return NULL; + + order = page->private; + if (order) + prep_compound_page(page, order); + else + kho_restore_page(page); + + return page_folio(page); +} +EXPORT_SYMBOL_GPL(kho_restore_folio); + +void *kho_restore_phys(phys_addr_t phys, size_t size) +{ + unsigned long start_pfn, end_pfn, pfn; + void *va = __va(phys); + + start_pfn = PFN_DOWN(phys); + end_pfn = PFN_UP(phys + size); + + for (pfn = start_pfn; pfn < end_pfn; pfn++) { + struct page *page = pfn_to_online_page(pfn); + + if (!page) + return NULL; + kho_restore_page(page); + } + + return va; +} +EXPORT_SYMBOL_GPL(kho_restore_phys); + +#define KHOSER_PTR(type) \ + union { \ + phys_addr_t phys; \ + type ptr; \ + } +#define KHOSER_STORE_PTR(dest, val) \ + ({ \ + (dest).phys = virt_to_phys(val); \ + typecheck(typeof((dest).ptr), val); \ + }) +#define KHOSER_LOAD_PTR(src) \ + ((src).phys ? (typeof((src).ptr))(phys_to_virt((src).phys)) : NULL) + +struct khoser_mem_bitmap_ptr { + phys_addr_t phys_start; + KHOSER_PTR(struct kho_mem_phys_bits *) bitmap; +}; + +struct khoser_mem_chunk; + +struct khoser_mem_chunk_hdr { + KHOSER_PTR(struct khoser_mem_chunk *) next; + unsigned int order; + unsigned int num_elms; +}; + +#define KHOSER_BITMAP_SIZE \ + ((PAGE_SIZE - sizeof(struct khoser_mem_chunk_hdr)) / \ + sizeof(struct khoser_mem_bitmap_ptr)) + +struct khoser_mem_chunk { + struct khoser_mem_chunk_hdr hdr; + struct khoser_mem_bitmap_ptr bitmaps[KHOSER_BITMAP_SIZE]; +}; +static_assert(sizeof(struct khoser_mem_chunk) == PAGE_SIZE); + +static struct khoser_mem_chunk *new_chunk(struct khoser_mem_chunk *cur_chunk, + unsigned long order) +{ + struct khoser_mem_chunk *chunk; + + chunk = (struct khoser_mem_chunk *)get_zeroed_page(GFP_KERNEL); + if (!chunk) + return NULL; + chunk->hdr.order = order; + if (cur_chunk) + KHOSER_STORE_PTR(cur_chunk->hdr.next, chunk); + return chunk; +} + +static void kho_mem_ser_free(struct khoser_mem_chunk *first_chunk) +{ + struct khoser_mem_chunk *chunk = first_chunk; + + while (chunk) { + unsigned long chunk_page = (unsigned long)chunk; + + chunk = KHOSER_LOAD_PTR(chunk->hdr.next); + free_page(chunk_page); + } +} + +/* + * Record all the bitmaps in a linked list of pages for the next kernel to + * process. Each chunk holds bitmaps of the same order and each block of bitmaps + * starts at a given physical address. This allows the bitmaps to be sparse. The + * xarray is used to store them in a tree while building up the data structure, + * but the KHO successor kernel only needs to process them once in order. + * + * All of this memory is normal kmalloc() memory and is not marked for + * preservation. The successor kernel will remain isolated to the scratch space + * until it completes processing this list. Once processed all the memory + * storing these ranges will be marked as free. + */ +static struct khoser_mem_chunk *kho_mem_serialize(void) +{ + struct kho_mem_track *tracker = &kho_mem_track; + struct khoser_mem_chunk *first_chunk = NULL; + struct khoser_mem_chunk *chunk = NULL; + struct kho_mem_phys *physxa; + unsigned long order; + + xa_for_each(&tracker->orders, order, physxa) { + struct kho_mem_phys_bits *bits; + unsigned long phys; + + chunk = new_chunk(chunk, order); + if (!chunk) + goto err_free; + + if (!first_chunk) + first_chunk = chunk; + + xa_for_each(&physxa->phys_bits, phys, bits) { + struct khoser_mem_bitmap_ptr *elm; + + if (chunk->hdr.num_elms == ARRAY_SIZE(chunk->bitmaps)) { + chunk = new_chunk(chunk, order); + if (!chunk) + goto err_free; + } + + elm = &chunk->bitmaps[chunk->hdr.num_elms]; + chunk->hdr.num_elms++; + elm->phys_start = (phys * PRESERVE_BITS) + << (order + PAGE_SHIFT); + KHOSER_STORE_PTR(elm->bitmap, bits); + } + } + + return first_chunk; + +err_free: + kho_mem_ser_free(first_chunk); + return ERR_PTR(-ENOMEM); +} + +static void deserialize_bitmap(unsigned int order, + struct khoser_mem_bitmap_ptr *elm) +{ + struct kho_mem_phys_bits *bitmap = KHOSER_LOAD_PTR(elm->bitmap); + unsigned long bit; + + for_each_set_bit(bit, bitmap->preserve, PRESERVE_BITS) { + int sz = 1 << (order + PAGE_SHIFT); + phys_addr_t phys = + elm->phys_start + (bit << (order + PAGE_SHIFT)); + struct page *page = phys_to_page(phys); + + memblock_reserve(phys, sz); + memblock_reserved_mark_noinit(phys, sz); + page->private = order; + } +} + +static void __init kho_mem_deserialize(void) +{ + struct khoser_mem_chunk *chunk; + struct kho_in_node preserved_mem; + const phys_addr_t *mem; + int err; + u32 len; + + err = kho_get_node(NULL, "preserved-memory", &preserved_mem); + if (err) { + pr_err("no preserved-memory node: %d\n", err); + return; + } + + mem = kho_get_prop(&preserved_mem, "metadata", &len); + if (!mem || len != sizeof(*mem)) { + pr_err("failed to get preserved memory bitmaps\n"); + return; + } + + chunk = *mem ? phys_to_virt(*mem) : NULL; + while (chunk) { + unsigned int i; + + memblock_reserve(virt_to_phys(chunk), sizeof(*chunk)); + + for (i = 0; i != chunk->hdr.num_elms; i++) + deserialize_bitmap(chunk->hdr.order, + &chunk->bitmaps[i]); + chunk = KHOSER_LOAD_PTR(chunk->hdr.next); + } +} + /* Helper functions for KHO state tree */ struct kho_prop { @@ -545,6 +1008,11 @@ static int kho_unfreeze(void) if (fdt) kvfree(fdt); + if (kho_out.first_chunk_phys) { + kho_mem_ser_free(phys_to_virt(kho_out.first_chunk_phys)); + kho_out.first_chunk_phys = 0; + } + err = blocking_notifier_call_chain(&kho_out.chain_head, KEXEC_KHO_UNFREEZE, NULL); err = notifier_to_errno(err); @@ -633,6 +1101,7 @@ static int kho_finalize(void) { int err = 0; void *fdt; + struct khoser_mem_chunk *first_chunk; fdt = kvmalloc(kho_out.fdt_max, GFP_KERNEL); if (!fdt) @@ -648,6 +1117,13 @@ static int kho_finalize(void) kho_out.fdt = fdt; up_write(&kho_out.tree_lock); + first_chunk = kho_mem_serialize(); + if (IS_ERR(first_chunk)) { + err = PTR_ERR(first_chunk); + goto unfreeze; + } + kho_out.first_chunk_phys = first_chunk ? virt_to_phys(first_chunk) : 0; + err = kho_convert_tree(fdt, kho_out.fdt_max); unfreeze: @@ -829,6 +1305,10 @@ static __init int kho_init(void) kho_out.root.name = ""; err = kho_add_string_prop(&kho_out.root, "compatible", "kho-v1"); + err |= kho_add_prop(&kho_out.preserved_memory, "metadata", + &kho_out.first_chunk_phys, sizeof(phys_addr_t)); + err |= kho_add_node(&kho_out.root, "preserved-memory", + &kho_out.preserved_memory); if (err) goto err_free_scratch; @@ -1079,10 +1559,12 @@ static void __init kho_release_scratch(void) void __init kho_memory_init(void) { - if (!kho_get_fdt()) + if (!kho_get_fdt()) { kho_reserve_scratch(); - else + } else { + kho_mem_deserialize(); kho_release_scratch(); + } } void __init kho_populate(phys_addr_t handover_fdt_phys,