Message ID | 20201025101555.3057-1-rppt@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | arch, mm: improve robustness of direct map manipulation | expand |
On Sun, 2020-10-25 at 12:15 +0200, Mike Rapoport wrote: > Indeed, for architectures that define CONFIG_ARCH_HAS_SET_DIRECT_MAP > it is > possible that __kernel_map_pages() would fail, but since this > function is > void, the failure will go unnoticed. Could you elaborate on how this could happen? Do you mean during runtime today or if something new was introduced?
On Mon, Oct 26, 2020 at 01:13:52AM +0000, Edgecombe, Rick P wrote: > On Sun, 2020-10-25 at 12:15 +0200, Mike Rapoport wrote: > > Indeed, for architectures that define CONFIG_ARCH_HAS_SET_DIRECT_MAP > > it is > > possible that __kernel_map_pages() would fail, but since this > > function is > > void, the failure will go unnoticed. > > Could you elaborate on how this could happen? Do you mean during > runtime today or if something new was introduced? A failure in__kernel_map_pages() may happen today. For instance, on x86 if the kernel is built with DEBUG_PAGEALLOC. __kernel_map_pages(page, 1, 0); will need to split, say, 2M page and during the split an allocation of page table could fail. Currently, the only user of __kernel_map_pages() outside DEBUG_PAGEALLOC is hibernation, but I think it would be safer to entirely prevent usage of __kernel_map_pages() when DEBUG_PAGEALLOC=n.
On Mon, 2020-10-26 at 11:05 +0200, Mike Rapoport wrote: > On Mon, Oct 26, 2020 at 01:13:52AM +0000, Edgecombe, Rick P wrote: > > On Sun, 2020-10-25 at 12:15 +0200, Mike Rapoport wrote: > > > Indeed, for architectures that define > > > CONFIG_ARCH_HAS_SET_DIRECT_MAP > > > it is > > > possible that __kernel_map_pages() would fail, but since this > > > function is > > > void, the failure will go unnoticed. > > > > Could you elaborate on how this could happen? Do you mean during > > runtime today or if something new was introduced? > > A failure in__kernel_map_pages() may happen today. For instance, on > x86 > if the kernel is built with DEBUG_PAGEALLOC. > > __kernel_map_pages(page, 1, 0); > > will need to split, say, 2M page and during the split an allocation > of > page table could fail. On x86 at least, DEBUG_PAGEALLOC expects to never have to break a page on the direct map and even disables locking in cpa because it assumes this. If this is happening somehow anyway then we should probably fix that. Even if it's a debug feature, it will not be as useful if it is causing its own crashes. I'm still wondering if there is something I'm missing here. It seems like you are saying there is a bug in some arch's, so let's add a WARN in cross-arch code to log it as it crashes. A warn and making things clearer seem like good ideas, but if there is a bug we should fix it. The code around the callers still functionally assume re-mapping can't fail. > Currently, the only user of __kernel_map_pages() outside > DEBUG_PAGEALLOC > is hibernation, but I think it would be safer to entirely prevent > usage > of __kernel_map_pages() when DEBUG_PAGEALLOC=n. I totally agree it's error prone FWIW. On x86, my mental model of how it is supposed to work is: If a page is 4k and NP it cannot fail to be remapped. set_direct_map_invalid_noflush() should result in 4k NP pages, and DEBUG_PAGEALLOC should result in all 4k pages on the direct map. Are you seeing this violated or do I have wrong assumptions? Beyond whatever you are seeing, for the latter case of new things getting introduced to an interface with hidden dependencies... Another edge case could be a new caller to set_memory_np() could result in large NP pages. None of the callers today should cause this AFAICT, but it's not great to rely on the callers to know these details.
On Mon, Oct 26, 2020 at 06:05:30PM +0000, Edgecombe, Rick P wrote: > On Mon, 2020-10-26 at 11:05 +0200, Mike Rapoport wrote: > > On Mon, Oct 26, 2020 at 01:13:52AM +0000, Edgecombe, Rick P wrote: > > > On Sun, 2020-10-25 at 12:15 +0200, Mike Rapoport wrote: > > > > Indeed, for architectures that define > > > > CONFIG_ARCH_HAS_SET_DIRECT_MAP > > > > it is > > > > possible that __kernel_map_pages() would fail, but since this > > > > function is > > > > void, the failure will go unnoticed. > > > > > > Could you elaborate on how this could happen? Do you mean during > > > runtime today or if something new was introduced? > > > > A failure in__kernel_map_pages() may happen today. For instance, on > > x86 > > if the kernel is built with DEBUG_PAGEALLOC. > > > > __kernel_map_pages(page, 1, 0); > > > > will need to split, say, 2M page and during the split an allocation > > of > > page table could fail. > > On x86 at least, DEBUG_PAGEALLOC expects to never have to break a page > on the direct map and even disables locking in cpa because it assumes > this. If this is happening somehow anyway then we should probably fix > that. Even if it's a debug feature, it will not be as useful if it is > causing its own crashes. > > I'm still wondering if there is something I'm missing here. It seems > like you are saying there is a bug in some arch's, so let's add a WARN > in cross-arch code to log it as it crashes. A warn and making things > clearer seem like good ideas, but if there is a bug we should fix it. > The code around the callers still functionally assume re-mapping can't > fail. Oh, I've meant x86 kernel *without* DEBUG_PAGEALLOC, and indeed the call that unmaps pages back in safe_copy_page will just reset a 4K page to NP because whatever made it NP at the first place already did the split. Still, on arm64 with DEBUG_PAGEALLOC=n there is a possibility of a race between map/unmap dance in __vunmap() and safe_copy_page() that may cause access to unmapped memory: __vunmap() vm_remove_mappings() set_direct_map_invalid() safe_copy_page() __kernel_map_pages() return do_copy_page() -> fault This is a theoretical bug, but it is still not nice :) > > Currently, the only user of __kernel_map_pages() outside > > DEBUG_PAGEALLOC > > is hibernation, but I think it would be safer to entirely prevent > > usage > > of __kernel_map_pages() when DEBUG_PAGEALLOC=n. > > I totally agree it's error prone FWIW. On x86, my mental model of how > it is supposed to work is: If a page is 4k and NP it cannot fail to be > remapped. set_direct_map_invalid_noflush() should result in 4k NP > pages, and DEBUG_PAGEALLOC should result in all 4k pages on the direct > map. Are you seeing this violated or do I have wrong assumptions? You are right, there is a set of assumptions about the remapping of the direct map pages that make it all work, at least on x86. But this is very subtle and it's not easy to wrap one's head around this. That's why putting __kernel_map_pages() out of "common" use and keep it only for DEBUG_PAGEALLOC would make things clearer. > Beyond whatever you are seeing, for the latter case of new things > getting introduced to an interface with hidden dependencies... Another > edge case could be a new caller to set_memory_np() could result in > large NP pages. None of the callers today should cause this AFAICT, but > it's not great to rely on the callers to know these details. A caller of set_memory_*() or set_direct_map_*() should expect a failure and be ready for that. So adding a WARN to safe_copy_page() is the first step in that direction :)
On 27.10.20 09:38, Mike Rapoport wrote: > On Mon, Oct 26, 2020 at 06:05:30PM +0000, Edgecombe, Rick P wrote: >> On Mon, 2020-10-26 at 11:05 +0200, Mike Rapoport wrote: >>> On Mon, Oct 26, 2020 at 01:13:52AM +0000, Edgecombe, Rick P wrote: >>>> On Sun, 2020-10-25 at 12:15 +0200, Mike Rapoport wrote: >>>>> Indeed, for architectures that define >>>>> CONFIG_ARCH_HAS_SET_DIRECT_MAP >>>>> it is >>>>> possible that __kernel_map_pages() would fail, but since this >>>>> function is >>>>> void, the failure will go unnoticed. >>>> >>>> Could you elaborate on how this could happen? Do you mean during >>>> runtime today or if something new was introduced? >>> >>> A failure in__kernel_map_pages() may happen today. For instance, on >>> x86 >>> if the kernel is built with DEBUG_PAGEALLOC. >>> >>> __kernel_map_pages(page, 1, 0); >>> >>> will need to split, say, 2M page and during the split an allocation >>> of >>> page table could fail. >> >> On x86 at least, DEBUG_PAGEALLOC expects to never have to break a page >> on the direct map and even disables locking in cpa because it assumes >> this. If this is happening somehow anyway then we should probably fix >> that. Even if it's a debug feature, it will not be as useful if it is >> causing its own crashes. >> >> I'm still wondering if there is something I'm missing here. It seems >> like you are saying there is a bug in some arch's, so let's add a WARN >> in cross-arch code to log it as it crashes. A warn and making things >> clearer seem like good ideas, but if there is a bug we should fix it. >> The code around the callers still functionally assume re-mapping can't >> fail. > > Oh, I've meant x86 kernel *without* DEBUG_PAGEALLOC, and indeed the call > that unmaps pages back in safe_copy_page will just reset a 4K page to > NP because whatever made it NP at the first place already did the split. > > Still, on arm64 with DEBUG_PAGEALLOC=n there is a possibility of a race > between map/unmap dance in __vunmap() and safe_copy_page() that may > cause access to unmapped memory: > > __vunmap() > vm_remove_mappings() > set_direct_map_invalid() > safe_copy_page() > __kernel_map_pages() > return > do_copy_page() -> fault > > This is a theoretical bug, but it is still not nice :) > >>> Currently, the only user of __kernel_map_pages() outside >>> DEBUG_PAGEALLOC >>> is hibernation, but I think it would be safer to entirely prevent >>> usage >>> of __kernel_map_pages() when DEBUG_PAGEALLOC=n. >> >> I totally agree it's error prone FWIW. On x86, my mental model of how >> it is supposed to work is: If a page is 4k and NP it cannot fail to be >> remapped. set_direct_map_invalid_noflush() should result in 4k NP >> pages, and DEBUG_PAGEALLOC should result in all 4k pages on the direct >> map. Are you seeing this violated or do I have wrong assumptions? > > You are right, there is a set of assumptions about the remapping of the > direct map pages that make it all work, at least on x86. > But this is very subtle and it's not easy to wrap one's head around > this. > > That's why putting __kernel_map_pages() out of "common" use and > keep it only for DEBUG_PAGEALLOC would make things clearer. > >> Beyond whatever you are seeing, for the latter case of new things >> getting introduced to an interface with hidden dependencies... Another >> edge case could be a new caller to set_memory_np() could result in >> large NP pages. None of the callers today should cause this AFAICT, but >> it's not great to rely on the callers to know these details. > > A caller of set_memory_*() or set_direct_map_*() should expect a failure > and be ready for that. So adding a WARN to safe_copy_page() is the first > step in that direction :) > I am probably missing something important, but why are we saving/restoring the content of pages that were explicitly removed from the identity mapping such that nobody will access them? Pages that are not allocated should contain garbage or be zero (init_on_free). That should be easy to handle without ever reading the page content. The other user seems to be vm_remove_mappings(), where we only *temporarily* remove the mapping - while hibernating, that code shouldn't be active anymore I guess - or we could protect it from happening. As I expressed in another mail, secretmem pages should rather not be saved when hibernating - hibernation should be rather be disabled. What am I missing?
On Tue, Oct 27, 2020 at 09:46:35AM +0100, David Hildenbrand wrote: > On 27.10.20 09:38, Mike Rapoport wrote: > > On Mon, Oct 26, 2020 at 06:05:30PM +0000, Edgecombe, Rick P wrote: > > > On Mon, 2020-10-26 at 11:05 +0200, Mike Rapoport wrote: > > > > On Mon, Oct 26, 2020 at 01:13:52AM +0000, Edgecombe, Rick P wrote: > > > > > On Sun, 2020-10-25 at 12:15 +0200, Mike Rapoport wrote: > > > > > > Indeed, for architectures that define > > > > > > CONFIG_ARCH_HAS_SET_DIRECT_MAP > > > > > > it is > > > > > > possible that __kernel_map_pages() would fail, but since this > > > > > > function is > > > > > > void, the failure will go unnoticed. > > > > > > > > > > Could you elaborate on how this could happen? Do you mean during > > > > > runtime today or if something new was introduced? > > > > > > > > A failure in__kernel_map_pages() may happen today. For instance, on > > > > x86 > > > > if the kernel is built with DEBUG_PAGEALLOC. > > > > > > > > __kernel_map_pages(page, 1, 0); > > > > > > > > will need to split, say, 2M page and during the split an allocation > > > > of > > > > page table could fail. > > > > > > On x86 at least, DEBUG_PAGEALLOC expects to never have to break a page > > > on the direct map and even disables locking in cpa because it assumes > > > this. If this is happening somehow anyway then we should probably fix > > > that. Even if it's a debug feature, it will not be as useful if it is > > > causing its own crashes. > > > > > > I'm still wondering if there is something I'm missing here. It seems > > > like you are saying there is a bug in some arch's, so let's add a WARN > > > in cross-arch code to log it as it crashes. A warn and making things > > > clearer seem like good ideas, but if there is a bug we should fix it. > > > The code around the callers still functionally assume re-mapping can't > > > fail. > > > > Oh, I've meant x86 kernel *without* DEBUG_PAGEALLOC, and indeed the call > > that unmaps pages back in safe_copy_page will just reset a 4K page to > > NP because whatever made it NP at the first place already did the split. > > > > Still, on arm64 with DEBUG_PAGEALLOC=n there is a possibility of a race > > between map/unmap dance in __vunmap() and safe_copy_page() that may > > cause access to unmapped memory: > > > > __vunmap() > > vm_remove_mappings() > > set_direct_map_invalid() > > safe_copy_page() > > __kernel_map_pages() > > return > > do_copy_page() -> fault > > > > This is a theoretical bug, but it is still not nice :) > > > > > > Currently, the only user of __kernel_map_pages() outside > > > > DEBUG_PAGEALLOC > > > > is hibernation, but I think it would be safer to entirely prevent > > > > usage > > > > of __kernel_map_pages() when DEBUG_PAGEALLOC=n. > > > > > > I totally agree it's error prone FWIW. On x86, my mental model of how > > > it is supposed to work is: If a page is 4k and NP it cannot fail to be > > > remapped. set_direct_map_invalid_noflush() should result in 4k NP > > > pages, and DEBUG_PAGEALLOC should result in all 4k pages on the direct > > > map. Are you seeing this violated or do I have wrong assumptions? > > > > You are right, there is a set of assumptions about the remapping of the > > direct map pages that make it all work, at least on x86. > > But this is very subtle and it's not easy to wrap one's head around > > this. > > > > That's why putting __kernel_map_pages() out of "common" use and > > keep it only for DEBUG_PAGEALLOC would make things clearer. > > > > > Beyond whatever you are seeing, for the latter case of new things > > > getting introduced to an interface with hidden dependencies... Another > > > edge case could be a new caller to set_memory_np() could result in > > > large NP pages. None of the callers today should cause this AFAICT, but > > > it's not great to rely on the callers to know these details. > > A caller of set_memory_*() or set_direct_map_*() should expect a failure > > and be ready for that. So adding a WARN to safe_copy_page() is the first > > step in that direction :) > > > > I am probably missing something important, but why are we saving/restoring > the content of pages that were explicitly removed from the identity mapping > such that nobody will access them? > > Pages that are not allocated should contain garbage or be zero > (init_on_free). That should be easy to handle without ever reading the page > content. I'm not familiar with hibernation to say anything smart here, but the help text of DEBUG_PAGEALLOC in Kconfig says: ... this option cannot be enabled in combination with hibernation as that would result in incorrect warnings of memory corruption after a resume because free pages are not saved to the suspend image. Probably you are right and free pages need to be handled differently, but it does not seem the case now. > The other user seems to be vm_remove_mappings(), where we only *temporarily* > remove the mapping - while hibernating, that code shouldn't be active > anymore I guess - or we could protect it from happening. Hmm, I _think_ vm_remove_mappings() shouldn't be active while hibernating, but I'm not 100% sure. > As I expressed in another mail, secretmem pages should rather not be saved > when hibernating - hibernation should be rather be disabled. Agree. > What am I missing? I think I miscommunicated the purpose of this set, which was to hide __kernel_map_pages() under DEBUG_PAGEALLOC and make hibernation use set_direct_map_*() explictly without major rework of free pages handling during hibernation. Does it help? > -- > Thanks, > > David / dhildenb >
On 27.10.20 10:47, Mike Rapoport wrote: > On Tue, Oct 27, 2020 at 09:46:35AM +0100, David Hildenbrand wrote: >> On 27.10.20 09:38, Mike Rapoport wrote: >>> On Mon, Oct 26, 2020 at 06:05:30PM +0000, Edgecombe, Rick P wrote: >>>> On Mon, 2020-10-26 at 11:05 +0200, Mike Rapoport wrote: >>>>> On Mon, Oct 26, 2020 at 01:13:52AM +0000, Edgecombe, Rick P wrote: >>>>>> On Sun, 2020-10-25 at 12:15 +0200, Mike Rapoport wrote: >>>>>>> Indeed, for architectures that define >>>>>>> CONFIG_ARCH_HAS_SET_DIRECT_MAP >>>>>>> it is >>>>>>> possible that __kernel_map_pages() would fail, but since this >>>>>>> function is >>>>>>> void, the failure will go unnoticed. >>>>>> >>>>>> Could you elaborate on how this could happen? Do you mean during >>>>>> runtime today or if something new was introduced? >>>>> >>>>> A failure in__kernel_map_pages() may happen today. For instance, on >>>>> x86 >>>>> if the kernel is built with DEBUG_PAGEALLOC. >>>>> >>>>> __kernel_map_pages(page, 1, 0); >>>>> >>>>> will need to split, say, 2M page and during the split an allocation >>>>> of >>>>> page table could fail. >>>> >>>> On x86 at least, DEBUG_PAGEALLOC expects to never have to break a page >>>> on the direct map and even disables locking in cpa because it assumes >>>> this. If this is happening somehow anyway then we should probably fix >>>> that. Even if it's a debug feature, it will not be as useful if it is >>>> causing its own crashes. >>>> >>>> I'm still wondering if there is something I'm missing here. It seems >>>> like you are saying there is a bug in some arch's, so let's add a WARN >>>> in cross-arch code to log it as it crashes. A warn and making things >>>> clearer seem like good ideas, but if there is a bug we should fix it. >>>> The code around the callers still functionally assume re-mapping can't >>>> fail. >>> >>> Oh, I've meant x86 kernel *without* DEBUG_PAGEALLOC, and indeed the call >>> that unmaps pages back in safe_copy_page will just reset a 4K page to >>> NP because whatever made it NP at the first place already did the split. >>> >>> Still, on arm64 with DEBUG_PAGEALLOC=n there is a possibility of a race >>> between map/unmap dance in __vunmap() and safe_copy_page() that may >>> cause access to unmapped memory: >>> >>> __vunmap() >>> vm_remove_mappings() >>> set_direct_map_invalid() >>> safe_copy_page() >>> __kernel_map_pages() >>> return >>> do_copy_page() -> fault >>> >>> This is a theoretical bug, but it is still not nice :) >>> >>>>> Currently, the only user of __kernel_map_pages() outside >>>>> DEBUG_PAGEALLOC >>>>> is hibernation, but I think it would be safer to entirely prevent >>>>> usage >>>>> of __kernel_map_pages() when DEBUG_PAGEALLOC=n. >>>> >>>> I totally agree it's error prone FWIW. On x86, my mental model of how >>>> it is supposed to work is: If a page is 4k and NP it cannot fail to be >>>> remapped. set_direct_map_invalid_noflush() should result in 4k NP >>>> pages, and DEBUG_PAGEALLOC should result in all 4k pages on the direct >>>> map. Are you seeing this violated or do I have wrong assumptions? >>> >>> You are right, there is a set of assumptions about the remapping of the >>> direct map pages that make it all work, at least on x86. >>> But this is very subtle and it's not easy to wrap one's head around >>> this. >>> >>> That's why putting __kernel_map_pages() out of "common" use and >>> keep it only for DEBUG_PAGEALLOC would make things clearer. >>> >>>> Beyond whatever you are seeing, for the latter case of new things >>>> getting introduced to an interface with hidden dependencies... Another >>>> edge case could be a new caller to set_memory_np() could result in >>>> large NP pages. None of the callers today should cause this AFAICT, but >>>> it's not great to rely on the callers to know these details. >>> A caller of set_memory_*() or set_direct_map_*() should expect a failure >>> and be ready for that. So adding a WARN to safe_copy_page() is the first >>> step in that direction :) >>> >> >> I am probably missing something important, but why are we saving/restoring >> the content of pages that were explicitly removed from the identity mapping >> such that nobody will access them? >> >> Pages that are not allocated should contain garbage or be zero >> (init_on_free). That should be easy to handle without ever reading the page >> content. > > I'm not familiar with hibernation to say anything smart here, but the > help text of DEBUG_PAGEALLOC in Kconfig says: > > ... this option cannot be enabled in combination with > hibernation as that would result in incorrect warnings of memory > corruption after a resume because free pages are not saved to > the suspend image. > > Probably you are right and free pages need to be handled differently, > but it does not seem the case now. > >> The other user seems to be vm_remove_mappings(), where we only *temporarily* >> remove the mapping - while hibernating, that code shouldn't be active >> anymore I guess - or we could protect it from happening. > > Hmm, I _think_ vm_remove_mappings() shouldn't be active while > hibernating, but I'm not 100% sure. > >> As I expressed in another mail, secretmem pages should rather not be saved >> when hibernating - hibernation should be rather be disabled. > > Agree. > >> What am I missing? > > I think I miscommunicated the purpose of this set, which was to hide > __kernel_map_pages() under DEBUG_PAGEALLOC and make hibernation use > set_direct_map_*() explictly without major rework of free pages handling > during hibernation. > > Does it help? > Heh, as always, once you touch questionable code, people will beg for proper cleanups instead :)
On Tue, Oct 27, 2020 at 09:46:35AM +0100, David Hildenbrand wrote: > On 27.10.20 09:38, Mike Rapoport wrote: > > On Mon, Oct 26, 2020 at 06:05:30PM +0000, Edgecombe, Rick P wrote: > > > > > Beyond whatever you are seeing, for the latter case of new things > > > getting introduced to an interface with hidden dependencies... Another > > > edge case could be a new caller to set_memory_np() could result in > > > large NP pages. None of the callers today should cause this AFAICT, but > > > it's not great to rely on the callers to know these details. > > A caller of set_memory_*() or set_direct_map_*() should expect a failure > > and be ready for that. So adding a WARN to safe_copy_page() is the first > > step in that direction :) > > > > I am probably missing something important, but why are we saving/restoring > the content of pages that were explicitly removed from the identity mapping > such that nobody will access them? Actually, we should not be saving/restoring free pages during hibernation as there are several calls to mark_free_pages() that should exclude the free pages from the snapshot. I've tried to find why the fix that maps/unmaps a page to save it was required at the first place, but I could not find bug reports. The closest I've got is an email from Rafael that asked to update "hibernate: handle DEBUG_PAGEALLOC" patch: https://lore.kernel.org/linux-pm/200802200133.44098.rjw@sisk.pl/ Could it be that safe_copy_page() tries to workaround a non-existent problem?
On 28.10.20 12:09, Mike Rapoport wrote: > On Tue, Oct 27, 2020 at 09:46:35AM +0100, David Hildenbrand wrote: >> On 27.10.20 09:38, Mike Rapoport wrote: >>> On Mon, Oct 26, 2020 at 06:05:30PM +0000, Edgecombe, Rick P wrote: >>> >>>> Beyond whatever you are seeing, for the latter case of new things >>>> getting introduced to an interface with hidden dependencies... Another >>>> edge case could be a new caller to set_memory_np() could result in >>>> large NP pages. None of the callers today should cause this AFAICT, but >>>> it's not great to rely on the callers to know these details. > >>> A caller of set_memory_*() or set_direct_map_*() should expect a failure >>> and be ready for that. So adding a WARN to safe_copy_page() is the first >>> step in that direction :) >>> >> >> I am probably missing something important, but why are we saving/restoring >> the content of pages that were explicitly removed from the identity mapping >> such that nobody will access them? > > Actually, we should not be saving/restoring free pages during > hibernation as there are several calls to mark_free_pages() that should > exclude the free pages from the snapshot. I've tried to find why the fix > that maps/unmaps a page to save it was required at the first place, but > I could not find bug reports. > > The closest I've got is an email from Rafael that asked to update > "hibernate: handle DEBUG_PAGEALLOC" patch: > > https://lore.kernel.org/linux-pm/200802200133.44098.rjw@sisk.pl/ > > Could it be that safe_copy_page() tries to workaround a non-existent > problem? > Clould be! Also see https://lkml.kernel.org/r/38de5bb0-5559-d069-0ce0-daec66ef2746@suse.cz which restores free page content based on more kernel parameters, not based on the original content.
On Tue, Oct 27, 2020 at 10:38:16AM +0200, Mike Rapoport wrote: > On Mon, Oct 26, 2020 at 06:05:30PM +0000, Edgecombe, Rick P wrote: > > On Mon, 2020-10-26 at 11:05 +0200, Mike Rapoport wrote: > > > On Mon, Oct 26, 2020 at 01:13:52AM +0000, Edgecombe, Rick P wrote: > > > > On Sun, 2020-10-25 at 12:15 +0200, Mike Rapoport wrote: > > > > > Indeed, for architectures that define > > > > > CONFIG_ARCH_HAS_SET_DIRECT_MAP > > > > > it is > > > > > possible that __kernel_map_pages() would fail, but since this > > > > > function is > > > > > void, the failure will go unnoticed. > > > > > > > > Could you elaborate on how this could happen? Do you mean during > > > > runtime today or if something new was introduced? > > > > > > A failure in__kernel_map_pages() may happen today. For instance, on > > > x86 > > > if the kernel is built with DEBUG_PAGEALLOC. > > > > > > __kernel_map_pages(page, 1, 0); > > > > > > will need to split, say, 2M page and during the split an allocation > > > of > > > page table could fail. > > > > On x86 at least, DEBUG_PAGEALLOC expects to never have to break a page > > on the direct map and even disables locking in cpa because it assumes > > this. If this is happening somehow anyway then we should probably fix > > that. Even if it's a debug feature, it will not be as useful if it is > > causing its own crashes. > > > > I'm still wondering if there is something I'm missing here. It seems > > like you are saying there is a bug in some arch's, so let's add a WARN > > in cross-arch code to log it as it crashes. A warn and making things > > clearer seem like good ideas, but if there is a bug we should fix it. > > The code around the callers still functionally assume re-mapping can't > > fail. > > Oh, I've meant x86 kernel *without* DEBUG_PAGEALLOC, and indeed the call > that unmaps pages back in safe_copy_page will just reset a 4K page to > NP because whatever made it NP at the first place already did the split. > > Still, on arm64 with DEBUG_PAGEALLOC=n there is a possibility of a race > between map/unmap dance in __vunmap() and safe_copy_page() that may > cause access to unmapped memory: > > __vunmap() > vm_remove_mappings() > set_direct_map_invalid() > safe_copy_page() > __kernel_map_pages() > return > do_copy_page() -> fault > > This is a theoretical bug, but it is still not nice :) Just to clarify: this patch series fixes this problem, right? Will
On Wed, Oct 28, 2020 at 11:20:12AM +0000, Will Deacon wrote: > On Tue, Oct 27, 2020 at 10:38:16AM +0200, Mike Rapoport wrote: > > On Mon, Oct 26, 2020 at 06:05:30PM +0000, Edgecombe, Rick P wrote: > > > On Mon, 2020-10-26 at 11:05 +0200, Mike Rapoport wrote: > > > > On Mon, Oct 26, 2020 at 01:13:52AM +0000, Edgecombe, Rick P wrote: > > > > > On Sun, 2020-10-25 at 12:15 +0200, Mike Rapoport wrote: > > > > > > Indeed, for architectures that define > > > > > > CONFIG_ARCH_HAS_SET_DIRECT_MAP > > > > > > it is > > > > > > possible that __kernel_map_pages() would fail, but since this > > > > > > function is > > > > > > void, the failure will go unnoticed. > > > > > > > > > > Could you elaborate on how this could happen? Do you mean during > > > > > runtime today or if something new was introduced? > > > > > > > > A failure in__kernel_map_pages() may happen today. For instance, on > > > > x86 > > > > if the kernel is built with DEBUG_PAGEALLOC. > > > > > > > > __kernel_map_pages(page, 1, 0); > > > > > > > > will need to split, say, 2M page and during the split an allocation > > > > of > > > > page table could fail. > > > > > > On x86 at least, DEBUG_PAGEALLOC expects to never have to break a page > > > on the direct map and even disables locking in cpa because it assumes > > > this. If this is happening somehow anyway then we should probably fix > > > that. Even if it's a debug feature, it will not be as useful if it is > > > causing its own crashes. > > > > > > I'm still wondering if there is something I'm missing here. It seems > > > like you are saying there is a bug in some arch's, so let's add a WARN > > > in cross-arch code to log it as it crashes. A warn and making things > > > clearer seem like good ideas, but if there is a bug we should fix it. > > > The code around the callers still functionally assume re-mapping can't > > > fail. > > > > Oh, I've meant x86 kernel *without* DEBUG_PAGEALLOC, and indeed the call > > that unmaps pages back in safe_copy_page will just reset a 4K page to > > NP because whatever made it NP at the first place already did the split. > > > > Still, on arm64 with DEBUG_PAGEALLOC=n there is a possibility of a race > > between map/unmap dance in __vunmap() and safe_copy_page() that may > > cause access to unmapped memory: > > > > __vunmap() > > vm_remove_mappings() > > set_direct_map_invalid() > > safe_copy_page() > > __kernel_map_pages() > > return > > do_copy_page() -> fault > > > > This is a theoretical bug, but it is still not nice :) > > Just to clarify: this patch series fixes this problem, right? Yes. > Will
On Wed, Oct 28, 2020 at 12:17:35PM +0100, David Hildenbrand wrote: > On 28.10.20 12:09, Mike Rapoport wrote: > > On Tue, Oct 27, 2020 at 09:46:35AM +0100, David Hildenbrand wrote: > > > On 27.10.20 09:38, Mike Rapoport wrote: > > > > On Mon, Oct 26, 2020 at 06:05:30PM +0000, Edgecombe, Rick P wrote: > > > > > > > > > Beyond whatever you are seeing, for the latter case of new things > > > > > getting introduced to an interface with hidden dependencies... Another > > > > > edge case could be a new caller to set_memory_np() could result in > > > > > large NP pages. None of the callers today should cause this AFAICT, but > > > > > it's not great to rely on the callers to know these details. > > > > > > A caller of set_memory_*() or set_direct_map_*() should expect a failure > > > > and be ready for that. So adding a WARN to safe_copy_page() is the first > > > > step in that direction :) > > > > > > > > > > I am probably missing something important, but why are we saving/restoring > > > the content of pages that were explicitly removed from the identity mapping > > > such that nobody will access them? > > > > Actually, we should not be saving/restoring free pages during > > hibernation as there are several calls to mark_free_pages() that should > > exclude the free pages from the snapshot. I've tried to find why the fix > > that maps/unmaps a page to save it was required at the first place, but > > I could not find bug reports. > > > > The closest I've got is an email from Rafael that asked to update > > "hibernate: handle DEBUG_PAGEALLOC" patch: > > > > https://lore.kernel.org/linux-pm/200802200133.44098.rjw@sisk.pl/ > > > > Could it be that safe_copy_page() tries to workaround a non-existent > > problem? > > > > Clould be! Also see > > https://lkml.kernel.org/r/38de5bb0-5559-d069-0ce0-daec66ef2746@suse.cz > > which restores free page content based on more kernel parameters, not based > on the original content. Ah, after looking at it now I've run kernel with DEBUG_PAGEALLOC=y and CONFIG_INIT_ON_FREE_DEFAULT_ON=y and restore crahsed nicely. [ 27.210093] PM: Image successfully loaded [ 27.226709] Disabling non-boot CPUs ... [ 27.231208] smpboot: CPU 1 is now offline [ 27.363926] kvm-clock: cpu 0, msr 5c889001, primary cpu clock, resume [ 27.363995] BUG: unable to handle page fault for address: ffff9f7a40108000 [ 27.367996] #PF: supervisor write access in kernel mode [ 27.369558] #PF: error_code(0x0002) - not-present page [ 27.371098] PGD 5ca01067 P4D 5ca01067 PUD 5ca02067 PMD 5ca03067 PTE 800ffffff fef7060 [ 27.373421] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC PTI [ 27.374905] CPU: 0 PID: 1200 Comm: bash Not tainted 5.10.0-rc1 #5 [ 27.376700] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14 .0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 27.379879] RIP: 0010:clear_page_rep+0x7/0x10 [ 27.381218] Code: e8 be 88 75 00 44 89 e2 48 89 ee 48 89 df e8 60 ff ff ff c6 03 00 5b 5d 41 5c c3 cc cc cc cc cc cc cc cc b9 00 02 00 00 31 c0 <f3> 48 ab c3 0f 1f 44 00 00 31 c0 b9 40 00 00 00 66 0f 1f 84 00 00 [ 27.386457] RSP: 0018:ffffb6838046be08 EFLAGS: 00010046 [ 27.388011] RAX: 0000000000000000 RBX: ffff9f7a487c0ec0 RCX: 0000000000000200 [ 27.390082] RDX: ffff9f7a4c788000 RSI: 0000000000000000 RDI: ffff9f7a40108000 [ 27.392138] RBP: ffffffff8629c860 R08: 0000000000000000 R09: 0000000000000007 [ 27.394205] R10: 0000000000000004 R11: ffffb6838046bbf8 R12: 0000000000000000 [ 27.396271] R13: ffff9f7a419a62a0 R14: 0000000000000005 R15: ffff9f7a484f4da0 [ 27.398334] FS: 00007fe0c3f6a700(0000) GS:ffff9f7abf800000(0000) knlGS:0000000000000000 [ 27.400717] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 27.402432] CR2: ffff9f7a40108000 CR3: 000000000859a001 CR4: 0000000000060ef0 [ 27.404485] Call Trace: [ 27.405326] clear_free_pages+0xf5/0x150 [ 27.406568] hibernation_snapshot+0x390/0x3d0 [ 27.407908] hibernate+0xdb/0x240 [ 27.408978] state_store+0xd7/0xe0 [ 27.410078] kernfs_fop_write+0x10e/0x1a0 [ 27.411333] vfs_write+0xbb/0x210 [ 27.412423] ksys_write+0x9c/0xd0 [ 27.413488] do_syscall_64+0x33/0x40 [ 27.414636] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 27.416150] RIP: 0033:0x7fe0c364e380 66 0f 1f 44 00 00 83 3d c9 23 2d 00 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 fe dd 01 00 48 89 04 24 [ 27.422500] RSP: 002b:00007ffeb64bd0c8 EFLAGS: 00000246 ORIG_RAX: 00000000000 00001 [ 27.424724] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fe0c364e380 [ 27.426761] RDX: 0000000000000005 RSI: 0000000001eb6408 RDI: 0000000000000001 [ 27.428791] RBP: 0000000001eb6408 R08: 00007fe0c391d780 R09: 00007fe0c3f6a700 [ 27.430863] R10: 0000000000000004 R11: 0000000000000246 R12: 0000000000000005 [ 27.432920] R13: 0000000000000001 R14: 00007fe0c391c620 R15: 0000000000000000 [ 27.434989] Modules linked in: [ 27.436004] CR2: ffff9f7a40108000 [ 27.437075] ---[ end trace 424c466bcd2bfcad ]--- > -- > Thanks, > > David / dhildenb >
On Wed, 2020-10-28 at 13:09 +0200, Mike Rapoport wrote: > On Tue, Oct 27, 2020 at 09:46:35AM +0100, David Hildenbrand wrote: > > On 27.10.20 09:38, Mike Rapoport wrote: > > > On Mon, Oct 26, 2020 at 06:05:30PM +0000, Edgecombe, Rick P > > > wrote: > > > > > > > Beyond whatever you are seeing, for the latter case of new > > > > things > > > > getting introduced to an interface with hidden dependencies... > > > > Another > > > > edge case could be a new caller to set_memory_np() could result > > > > in > > > > large NP pages. None of the callers today should cause this > > > > AFAICT, but > > > > it's not great to rely on the callers to know these details. > > > A caller of set_memory_*() or set_direct_map_*() should expect a > > > failure > > > and be ready for that. So adding a WARN to safe_copy_page() is > > > the first > > > step in that direction :) > > > > > > > I am probably missing something important, but why are we > > saving/restoring > > the content of pages that were explicitly removed from the identity > > mapping > > such that nobody will access them? > > Actually, we should not be saving/restoring free pages during > hibernation as there are several calls to mark_free_pages() that > should > exclude the free pages from the snapshot. I've tried to find why the > fix > that maps/unmaps a page to save it was required at the first place, > but > I could not find bug reports. > > The closest I've got is an email from Rafael that asked to update > "hibernate: handle DEBUG_PAGEALLOC" patch: > > https://lore.kernel.org/linux-pm/200802200133.44098.rjw@sisk.pl/ > > Could it be that safe_copy_page() tries to workaround a non-existent > problem? It looks like inside page_alloc.c it unmaps the page before it actually frees it, so to hibernate it could look like the page is still allocated even though it's unmapped? Maybe that small window is what it cared about initially. There is also now the vmalloc case, which I am actually working on expanding. So I think the re-mapping logic is needed.
On Wed, 2020-10-28 at 13:30 +0200, Mike Rapoport wrote: > On Wed, Oct 28, 2020 at 11:20:12AM +0000, Will Deacon wrote: > > On Tue, Oct 27, 2020 at 10:38:16AM +0200, Mike Rapoport wrote: > > > On Mon, Oct 26, 2020 at 06:05:30PM +0000, Edgecombe, Rick P > > > wrote: > > > > On Mon, 2020-10-26 at 11:05 +0200, Mike Rapoport wrote: > > > > > On Mon, Oct 26, 2020 at 01:13:52AM +0000, Edgecombe, Rick P > > > > > wrote: > > > > > > On Sun, 2020-10-25 at 12:15 +0200, Mike Rapoport wrote: > > > > > > > Indeed, for architectures that define > > > > > > > CONFIG_ARCH_HAS_SET_DIRECT_MAP > > > > > > > it is > > > > > > > possible that __kernel_map_pages() would fail, but since > > > > > > > this > > > > > > > function is > > > > > > > void, the failure will go unnoticed. > > > > > > > > > > > > Could you elaborate on how this could happen? Do you mean > > > > > > during > > > > > > runtime today or if something new was introduced? > > > > > > > > > > A failure in__kernel_map_pages() may happen today. For > > > > > instance, on > > > > > x86 > > > > > if the kernel is built with DEBUG_PAGEALLOC. > > > > > > > > > > __kernel_map_pages(page, 1, 0); > > > > > > > > > > will need to split, say, 2M page and during the split an > > > > > allocation > > > > > of > > > > > page table could fail. > > > > > > > > On x86 at least, DEBUG_PAGEALLOC expects to never have to break > > > > a page > > > > on the direct map and even disables locking in cpa because it > > > > assumes > > > > this. If this is happening somehow anyway then we should > > > > probably fix > > > > that. Even if it's a debug feature, it will not be as useful if > > > > it is > > > > causing its own crashes. > > > > > > > > I'm still wondering if there is something I'm missing here. It > > > > seems > > > > like you are saying there is a bug in some arch's, so let's add > > > > a WARN > > > > in cross-arch code to log it as it crashes. A warn and making > > > > things > > > > clearer seem like good ideas, but if there is a bug we should > > > > fix it. > > > > The code around the callers still functionally assume re- > > > > mapping can't > > > > fail. > > > > > > Oh, I've meant x86 kernel *without* DEBUG_PAGEALLOC, and indeed > > > the call > > > that unmaps pages back in safe_copy_page will just reset a 4K > > > page to > > > NP because whatever made it NP at the first place already did the > > > split. > > > > > > Still, on arm64 with DEBUG_PAGEALLOC=n there is a possibility of > > > a race > > > between map/unmap dance in __vunmap() and safe_copy_page() that > > > may > > > cause access to unmapped memory: > > > > > > __vunmap() > > > vm_remove_mappings() > > > set_direct_map_invalid() > > > safe_copy_page() > > > __kernel_map_pages() > > > return > > > do_copy_page() -> fault > > > > > > This is a theoretical bug, but it is still not nice :) > > > > > > > Just to clarify: this patch series fixes this problem, right? > > Yes. > Well, now I'm confused again. As David pointed, __vunmap() should not be executing simultaneously with the hibernate operation because hibernate can't snapshot while data it needs to save is still updating. If a thread was paused when a page was in an "invalid" state, it should be remapped by hibernate before the copy. To level set, before reading this mail, my takeaways from the discussions on potential hibernate/debug page alloc problems were: Potential RISC-V issue: Doesn't have hibernate support Potential ARM issue: The logic around when it's cpa determines pages might be unmapped looks correct for current callers. Potential x86 page break issue: Seems to be ok for now, but a new set_memory_np() caller could violate assumptions in hibernate. Non-obvious thorny logic: General agreement it would be good to separate dependencies. Behavior of V1 of this patchset: No functional change other than addition of a warn in hibernate. So "does this fix the problem", "yes" leaves me a bit confused... Not saying there couldn't be any problems, especially due to the thorniness and cross arch stride, but what is it exactly and how does this series fix it?
On Wed, Oct 28, 2020 at 09:03:31PM +0000, Edgecombe, Rick P wrote: > > On Wed, Oct 28, 2020 at 11:20:12AM +0000, Will Deacon wrote: > > > On Tue, Oct 27, 2020 at 10:38:16AM +0200, Mike Rapoport wrote: > > > > > > > > This is a theoretical bug, but it is still not nice :) > > > > > > > > > > Just to clarify: this patch series fixes this problem, right? > > > > Yes. > > > > Well, now I'm confused again. > > As David pointed, __vunmap() should not be executing simultaneously > with the hibernate operation because hibernate can't snapshot while > data it needs to save is still updating. If a thread was paused when a > page was in an "invalid" state, it should be remapped by hibernate > before the copy. > > To level set, before reading this mail, my takeaways from the > discussions on potential hibernate/debug page alloc problems were: > > Potential RISC-V issue: > Doesn't have hibernate support > > Potential ARM issue: > The logic around when it's cpa determines pages might be unmapped looks > correct for current callers. > > Potential x86 page break issue: > Seems to be ok for now, but a new set_memory_np() caller could violate > assumptions in hibernate. > > Non-obvious thorny logic: > General agreement it would be good to separate dependencies. > > Behavior of V1 of this patchset: > No functional change other than addition of a warn in hibernate. There is a change that adds explicit use of set_direct_map() to hibernate. Currently, in case of arm64 with DEBUG_PAGEALLOC=n if a thread was paused when a page was in an "invalid" state hibernate will access an unmapped data because __kernel_map_pages() will bail out. After the change set_direct_map_default_noflush() would be used and the page will get mapped before copy. > So "does this fix the problem", "yes" leaves me a bit confused... Not > saying there couldn't be any problems, especially due to the thorniness > and cross arch stride, but what is it exactly and how does this series > fix it? This series goal was primarily to separate dependincies and make it clearer what DEBUG_PAGEALLOC and what SET_DIRECT_MAP are. As it turned out, there is also some lack of consistency between architectures that implement either of this so I tried to improve this as well. Honestly, I don't know if a thread can be paused at the time __vunmap() left invalid pages, but it could, there is an issue on arm64 with DEBUG_PAGEALLOC=n and this set fixes it. __vunmap() vm_remove_mappings() set_direct_map_invalid() /* thread is frozen */ safe_copy_page() __kernel_map_pages() if (!debug_pagealloc()) return do_copy_page() -> fault
On 25.10.20 11:15, Mike Rapoport wrote: > From: Mike Rapoport <rppt@linux.ibm.com> > > Hi, > > During recent discussion about KVM protected memory, David raised a concern > about usage of __kernel_map_pages() outside of DEBUG_PAGEALLOC scope [1]. > > Indeed, for architectures that define CONFIG_ARCH_HAS_SET_DIRECT_MAP it is > possible that __kernel_map_pages() would fail, but since this function is > void, the failure will go unnoticed. > > Moreover, there's lack of consistency of __kernel_map_pages() semantics > across architectures as some guard this function with > #ifdef DEBUG_PAGEALLOC, some refuse to update the direct map if page > allocation debugging is disabled at run time and some allow modifying the > direct map regardless of DEBUG_PAGEALLOC settings. > > This set straightens this out by restoring dependency of > __kernel_map_pages() on DEBUG_PAGEALLOC and updating the call sites > accordingly. > So, I was primarily wondering if we really have to touch direct mappings in hibernation code, or if we can avoid doing that. I was wondering if we cannot simply do something like kmap() when trying to access a !mapped page. Similar to reading old-os memory after kexec when in kdump. Just a thought.
On Thu, 2020-10-29 at 10:12 +0200, Mike Rapoport wrote: > This series goal was primarily to separate dependincies and make it > clearer what DEBUG_PAGEALLOC and what SET_DIRECT_MAP are. As it > turned > out, there is also some lack of consistency between architectures > that > implement either of this so I tried to improve this as well. > > Honestly, I don't know if a thread can be paused at the time > __vunmap() > left invalid pages, but it could, there is an issue on arm64 with > DEBUG_PAGEALLOC=n and this set fixes it. Ah, ok. So from this and the other thread, this is about the logic in arm's cpa for when it will try the un/map operations. I think the logic actually works currently. And this series introduces a problem on ARM similar to the one you are saying preexists. I put the details in the other thread.
From: Mike Rapoport <rppt@linux.ibm.com> Hi, During recent discussion about KVM protected memory, David raised a concern about usage of __kernel_map_pages() outside of DEBUG_PAGEALLOC scope [1]. Indeed, for architectures that define CONFIG_ARCH_HAS_SET_DIRECT_MAP it is possible that __kernel_map_pages() would fail, but since this function is void, the failure will go unnoticed. Moreover, there's lack of consistency of __kernel_map_pages() semantics across architectures as some guard this function with #ifdef DEBUG_PAGEALLOC, some refuse to update the direct map if page allocation debugging is disabled at run time and some allow modifying the direct map regardless of DEBUG_PAGEALLOC settings. This set straightens this out by restoring dependency of __kernel_map_pages() on DEBUG_PAGEALLOC and updating the call sites accordingly. [1] https://lore.kernel.org/lkml/2759b4bf-e1e3-d006-7d86-78a40348269d@redhat.com Mike Rapoport (4): mm: introduce debug_pagealloc_map_pages() helper PM: hibernate: improve robustness of mapping pages in the direct map arch, mm: restore dependency of __kernel_map_pages() of DEBUG_PAGEALLOC arch, mm: make kernel_page_present() always available arch/Kconfig | 3 +++ arch/arm64/Kconfig | 4 +--- arch/arm64/include/asm/cacheflush.h | 1 + arch/arm64/mm/pageattr.c | 6 +++-- arch/powerpc/Kconfig | 5 +---- arch/riscv/Kconfig | 4 +--- arch/riscv/include/asm/pgtable.h | 2 -- arch/riscv/include/asm/set_memory.h | 1 + arch/riscv/mm/pageattr.c | 31 +++++++++++++++++++++++++ arch/s390/Kconfig | 4 +--- arch/sparc/Kconfig | 4 +--- arch/x86/Kconfig | 4 +--- arch/x86/include/asm/set_memory.h | 1 + arch/x86/mm/pat/set_memory.c | 4 ++-- include/linux/mm.h | 35 +++++++++++++---------------- include/linux/set_memory.h | 5 +++++ kernel/power/snapshot.c | 24 ++++++++++++++++++-- mm/memory_hotplug.c | 3 +-- mm/page_alloc.c | 6 ++--- mm/slab.c | 8 +++---- 20 files changed, 97 insertions(+), 58 deletions(-)