Message ID | 20250212141648.599661-1-chenhuacai@loongson.cn (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/slab: Initialise random_kmalloc_seed after initcalls | expand |
On Wed, Feb 12, 2025 at 11:17 PM Huacai Chen <chenhuacai@loongson.cn> wrote: > > Hibernation assumes the memory layout after resume be the same as that > before sleep, but CONFIG_RANDOM_KMALLOC_CACHES breaks this assumption. [Let's also Cc SLAB ALLOCATOR folks in MAINTAINERS file] Could you please elaborate what do you mean by hibernation assumes 'the memory layout' after resume be the same as that before sleep? I don't understand how updating random_kmalloc_seed breaks resuming from hibernation. Changing random_kmalloc_seed affects which kmalloc caches newly allocated objects are from, but it should not affect the objects that are already allocated (before hibernation). > At least on LoongArch and ARM64 we observed failures of resuming from > hibernation (on LoongArch non-boot CPUs fail to bringup, on ARM64 some > devices are unusable). Did you have any chance to reproduce it on x86_64? > software_resume_initcall(), the function which resume the target kernel > is a initcall function. So, move the random_kmalloc_seed initialisation > after all initcalls. > > Cc: stable@vger.kernel.org > Fixes: 3c6152940584290668 ("Randomized slab caches for kmalloc()") > Reported-by: Yuli Wang <wangyuli@uniontech.com> > Signed-off-by: Huacai Chen <chenhuacai@loongson.cn> > --- > > init/main.c | 3 +++ > mm/slab_common.c | 3 --- > 2 files changed, 3 insertions(+), 3 deletions(-) > > diff --git a/init/main.c b/init/main.c > index 2a1757826397..1362957bdbe4 100644 > --- a/init/main.c > +++ b/init/main.c > @@ -1458,6 +1458,9 @@ static int __ref kernel_init(void *unused) > /* need to finish all async __init code before freeing the memory */ > async_synchronize_full(); > > +#ifdef CONFIG_RANDOM_KMALLOC_CACHES > + random_kmalloc_seed = get_random_u64(); > +#endif It doesn’t seem appropriate to put slab code in kernel_init. Additionally, it introduces a dependency that the code must be executed after all late_initcalls, which sounds like introducing yet another type of initcall. > system_state = SYSTEM_FREEING_INITMEM; > kprobe_free_init_mem(); > ftrace_free_init_mem(); > diff --git a/mm/slab_common.c b/mm/slab_common.c > index 4030907b6b7d..23e324aee218 100644 > --- a/mm/slab_common.c > +++ b/mm/slab_common.c > @@ -971,9 +971,6 @@ void __init create_kmalloc_caches(void) > for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++) > new_kmalloc_cache(i, type); > } > -#ifdef CONFIG_RANDOM_KMALLOC_CACHES > - random_kmalloc_seed = get_random_u64(); > -#endif I have no idea how hibernation and resume work, but let me ask here: Can we simply skip or defer updating random_kmalloc_seed when the system is resuming from hibernation? (probably system_state represents this?) > /* Kmalloc array is now usable */ > slab_state = UP; -- Harry
Hi, Harry, On Wed, Feb 12, 2025 at 11:39 PM Harry (Hyeonggon) Yoo <42.hyeyoo@gmail.com> wrote: > > On Wed, Feb 12, 2025 at 11:17 PM Huacai Chen <chenhuacai@loongson.cn> wrote: > > > > Hibernation assumes the memory layout after resume be the same as that > > before sleep, but CONFIG_RANDOM_KMALLOC_CACHES breaks this assumption. > > [Let's also Cc SLAB ALLOCATOR folks in MAINTAINERS file] > > Could you please elaborate what do you mean by > hibernation assumes 'the memory layout' after resume be the same as that > before sleep? > > I don't understand how updating random_kmalloc_seed breaks resuming from > hibernation. Changing random_kmalloc_seed affects which kmalloc caches > newly allocated objects are from, but it should not affect the objects that are > already allocated (before hibernation). When resuming, the booting kernel should switch to the target kernel, if the address of switch code (from the booting kernel) is the effective data of the target kernel, then the switch code may be overwritten. For LoongArch there is an additional problem: the regular kernel function uses absolute address to call exception handlers, this means the code calls to exception handlers should at the same address for booting kernel and target kernel. > > > At least on LoongArch and ARM64 we observed failures of resuming from > > hibernation (on LoongArch non-boot CPUs fail to bringup, on ARM64 some > > devices are unusable). > > Did you have any chance to reproduce it on x86_64? I haven't reproduce on x86_64, but I have heard that x86_32 has problems. > > > software_resume_initcall(), the function which resume the target kernel > > is a initcall function. So, move the random_kmalloc_seed initialisation > > after all initcalls. > > > > Cc: stable@vger.kernel.org > > Fixes: 3c6152940584290668 ("Randomized slab caches for kmalloc()") > > Reported-by: Yuli Wang <wangyuli@uniontech.com> > > Signed-off-by: Huacai Chen <chenhuacai@loongson.cn> > > --- > > > > init/main.c | 3 +++ > > mm/slab_common.c | 3 --- > > 2 files changed, 3 insertions(+), 3 deletions(-) > > > > diff --git a/init/main.c b/init/main.c > > index 2a1757826397..1362957bdbe4 100644 > > --- a/init/main.c > > +++ b/init/main.c > > @@ -1458,6 +1458,9 @@ static int __ref kernel_init(void *unused) > > /* need to finish all async __init code before freeing the memory */ > > async_synchronize_full(); > > > > +#ifdef CONFIG_RANDOM_KMALLOC_CACHES > > + random_kmalloc_seed = get_random_u64(); > > +#endif > > It doesn’t seem appropriate to put slab code in kernel_init. > > Additionally, it introduces a dependency that the code must be executed > after all late_initcalls, which sounds like introducing yet another > type of initcall. What about introducing a function to initialize kmalloc seed in slab_common.c, and then call it at kernel_init()? I don't have a better solution than this. > > > system_state = SYSTEM_FREEING_INITMEM; > > kprobe_free_init_mem(); > > ftrace_free_init_mem(); > > diff --git a/mm/slab_common.c b/mm/slab_common.c > > index 4030907b6b7d..23e324aee218 100644 > > --- a/mm/slab_common.c > > +++ b/mm/slab_common.c > > @@ -971,9 +971,6 @@ void __init create_kmalloc_caches(void) > > for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++) > > new_kmalloc_cache(i, type); > > } > > -#ifdef CONFIG_RANDOM_KMALLOC_CACHES > > - random_kmalloc_seed = get_random_u64(); > > -#endif > > I have no idea how hibernation and resume work, but let me ask here: > Can we simply skip or defer updating random_kmalloc_seed when the system is > resuming from hibernation? (probably system_state represents this?) Do you mean something like below? It does work (it is my original solution), but this patch is simpler. diff --git a/include/linux/slab.h b/include/linux/slab.h index b35e2db7eb0e..42fb91650b13 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -614,14 +614,20 @@ static __always_inline enum kmalloc_cache_type kmalloc_type(gfp_t flags, unsigne * The most common case is KMALLOC_NORMAL, so test for it * with a single branch for all the relevant flags. */ - if (likely((flags & KMALLOC_NOT_NORMAL_BITS) == 0)) + if (likely((flags & KMALLOC_NOT_NORMAL_BITS) == 0)) { #ifdef CONFIG_RANDOM_KMALLOC_CACHES + unsigned long random_seed = 0; + + if (system_state > SYSTEM_SCHEDULING) + random_seed = random_kmalloc_seed; + /* RANDOM_KMALLOC_CACHES_NR (=15) copies + the KMALLOC_NORMAL */ - return KMALLOC_RANDOM_START + hash_64(caller ^ random_kmalloc_seed, + return KMALLOC_RANDOM_START + hash_64(caller ^ random_seed, ilog2(RANDOM_KMALLOC_CACHES_NR + 1)); #else return KMALLOC_NORMAL; #endif + } Huacai > > > /* Kmalloc array is now usable */ > > slab_state = UP; > > -- > Harry
On Thu, Feb 13, 2025 at 11:20:22AM +0800, Huacai Chen wrote: > Hi, Harry, > > On Wed, Feb 12, 2025 at 11:39 PM Harry (Hyeonggon) Yoo > <42.hyeyoo@gmail.com> wrote: > > On Wed, Feb 12, 2025 at 11:17 PM Huacai Chen <chenhuacai@loongson.cn> wrote: > > > > > > Hibernation assumes the memory layout after resume be the same as that > > > before sleep, but CONFIG_RANDOM_KMALLOC_CACHES breaks this assumption. > > > > Could you please elaborate what do you mean by > > hibernation assumes 'the memory layout' after resume be the same as that > > before sleep? > > > > I don't understand how updating random_kmalloc_seed breaks resuming from > > hibernation. Changing random_kmalloc_seed affects which kmalloc caches > > newly allocated objects are from, but it should not affect the objects that are > > already allocated (before hibernation). > > When resuming, the booting kernel should switch to the target kernel, > if the address of switch code (from the booting kernel) is the > effective data of the target kernel, then the switch code may be > overwritten. Hmm... I'm still missing some pieces. How is the kernel binary overwritten when slab allocations are randomized? Also, I'm not sure if it's even safe to assume that the memory layout is the same across boots. But I'm not an expert on swsusp anyway... It'd be really helpful for linux-pm folks to clarify 1) what are the (architecture-independent) assumptions are for swsusp to work, and 2) how architectures dealt with other randomization features like kASLR... > For LoongArch there is an additional problem: the regular kernel > function uses absolute address to call exception handlers, this means > the code calls to exception handlers should at the same address for > booting kernel and target kernel.
On Fri, Feb 14, 2025 at 5:33 PM Harry (Hyeonggon) Yoo <42.hyeyoo@gmail.com> wrote: > > On Thu, Feb 13, 2025 at 11:20:22AM +0800, Huacai Chen wrote: > > Hi, Harry, > > > > On Wed, Feb 12, 2025 at 11:39 PM Harry (Hyeonggon) Yoo > > <42.hyeyoo@gmail.com> wrote: > > > On Wed, Feb 12, 2025 at 11:17 PM Huacai Chen <chenhuacai@loongson.cn> wrote: > > > > > > > > Hibernation assumes the memory layout after resume be the same as that > > > > before sleep, but CONFIG_RANDOM_KMALLOC_CACHES breaks this assumption. > > > > > > Could you please elaborate what do you mean by > > > hibernation assumes 'the memory layout' after resume be the same as that > > > before sleep? > > > > > > I don't understand how updating random_kmalloc_seed breaks resuming from > > > hibernation. Changing random_kmalloc_seed affects which kmalloc caches > > > newly allocated objects are from, but it should not affect the objects that are > > > already allocated (before hibernation). > > > > When resuming, the booting kernel should switch to the target kernel, > > if the address of switch code (from the booting kernel) is the > > effective data of the target kernel, then the switch code may be > > overwritten. > > Hmm... I'm still missing some pieces. > How is the kernel binary overwritten when slab allocations are randomized? > > Also, I'm not sure if it's even safe to assume that the memory layout is the > same across boots. But I'm not an expert on swsusp anyway... > > It'd be really helpful for linux-pm folks to clarify 1) what are the > (architecture-independent) assumptions are for swsusp to work, and > 2) how architectures dealt with other randomization features like kASLR... I'm sorry to confuse you. Binary overwriting is indeed caused by kASLR, so at least on LoongArch we should disable kASLR for hibernation. Random kmalloc is another story, on LoongArch it breaks smpboot when resuming, the details are: 1, LoongArch uses kmalloc() family to allocate idle_task's stack/thread_info and other data structures. 2, If random kmalloc is enabled, idle_task's stack in the booting kernel may be other things in the target kernel. 3, When CPU0 executes the switch code, other CPUs are executing idle_task, and their stacks may be corrupted by the switch code. So in experiments we can fix hibernation only by moving random_kmalloc_seed initialization after smp_init(). But obviously, moving it after all initcalls is harmless and safer. Huacai > > > For LoongArch there is an additional problem: the regular kernel > > function uses absolute address to call exception handlers, this means > > the code calls to exception handlers should at the same address for > > booting kernel and target kernel. > > -- > Harry
On Fri, Feb 14, 2025 at 06:02:52PM +0800, Huacai Chen wrote: > On Fri, Feb 14, 2025 at 5:33 PM Harry (Hyeonggon) Yoo > <42.hyeyoo@gmail.com> wrote: > > > > On Thu, Feb 13, 2025 at 11:20:22AM +0800, Huacai Chen wrote: > > > Hi, Harry, > > > > > > On Wed, Feb 12, 2025 at 11:39 PM Harry (Hyeonggon) Yoo > > > <42.hyeyoo@gmail.com> wrote: > > > > On Wed, Feb 12, 2025 at 11:17 PM Huacai Chen <chenhuacai@loongson.cn> wrote: > > > > > > > > > > Hibernation assumes the memory layout after resume be the same as that > > > > > before sleep, but CONFIG_RANDOM_KMALLOC_CACHES breaks this assumption. > > > > > > > > Could you please elaborate what do you mean by > > > > hibernation assumes 'the memory layout' after resume be the same as that > > > > before sleep? > > > > > > > > I don't understand how updating random_kmalloc_seed breaks resuming from > > > > hibernation. Changing random_kmalloc_seed affects which kmalloc caches > > > > newly allocated objects are from, but it should not affect the objects that are > > > > already allocated (before hibernation). > > > > > > When resuming, the booting kernel should switch to the target kernel, > > > if the address of switch code (from the booting kernel) is the > > > effective data of the target kernel, then the switch code may be > > > overwritten. > > > > Hmm... I'm still missing some pieces. > > How is the kernel binary overwritten when slab allocations are randomized? > > > > Also, I'm not sure if it's even safe to assume that the memory layout is the > > same across boots. But I'm not an expert on swsusp anyway... > > > > It'd be really helpful for linux-pm folks to clarify 1) what are the > > (architecture-independent) assumptions are for swsusp to work, and > > 2) how architectures dealt with other randomization features like kASLR... > [+Cc few more people that worked on slab hardening] > I'm sorry to confuse you. Binary overwriting is indeed caused by > kASLR, so at least on LoongArch we should disable kASLR for > hibernation. Understood. > Random kmalloc is another story, on LoongArch it breaks smpboot when > resuming, the details are: > 1, LoongArch uses kmalloc() family to allocate idle_task's > stack/thread_info and other data structures. > 2, If random kmalloc is enabled, idle_task's stack in the booting > kernel may be other things in the target kernel. Slab hardening features try so hard to prevent such predictability. For example, SLAB_FREELIST_RANDOM could also randomize the address kmalloc objects are allocated at. Rather than hacking CONFIG_RANDOM_KMALLOC_CACHES like this, we could have a single option to disable slab hardening features that makes the address unpredictable. It'd be nice to have something like ARCH_SUPPORTS_SLAB_RANDOM which some hardening features depend on. And then let some arches conditionally not select ARCH_SUPPORTS_SLAB_RANDOM if hibernation's enabled (at cost of less hardening)?
On Fri, Feb 14, 2025 at 8:45 PM Harry (Hyeonggon) Yoo <42.hyeyoo@gmail.com> wrote: > > On Fri, Feb 14, 2025 at 06:02:52PM +0800, Huacai Chen wrote: > > On Fri, Feb 14, 2025 at 5:33 PM Harry (Hyeonggon) Yoo > > <42.hyeyoo@gmail.com> wrote: > > > > > > On Thu, Feb 13, 2025 at 11:20:22AM +0800, Huacai Chen wrote: > > > > Hi, Harry, > > > > > > > > On Wed, Feb 12, 2025 at 11:39 PM Harry (Hyeonggon) Yoo > > > > <42.hyeyoo@gmail.com> wrote: > > > > > On Wed, Feb 12, 2025 at 11:17 PM Huacai Chen <chenhuacai@loongson.cn> wrote: > > > > > > > > > > > > Hibernation assumes the memory layout after resume be the same as that > > > > > > before sleep, but CONFIG_RANDOM_KMALLOC_CACHES breaks this assumption. > > > > > > > > > > Could you please elaborate what do you mean by > > > > > hibernation assumes 'the memory layout' after resume be the same as that > > > > > before sleep? > > > > > > > > > > I don't understand how updating random_kmalloc_seed breaks resuming from > > > > > hibernation. Changing random_kmalloc_seed affects which kmalloc caches > > > > > newly allocated objects are from, but it should not affect the objects that are > > > > > already allocated (before hibernation). > > > > > > > > When resuming, the booting kernel should switch to the target kernel, > > > > if the address of switch code (from the booting kernel) is the > > > > effective data of the target kernel, then the switch code may be > > > > overwritten. > > > > > > Hmm... I'm still missing some pieces. > > > How is the kernel binary overwritten when slab allocations are randomized? > > > > > > Also, I'm not sure if it's even safe to assume that the memory layout is the > > > same across boots. But I'm not an expert on swsusp anyway... > > > > > > It'd be really helpful for linux-pm folks to clarify 1) what are the > > > (architecture-independent) assumptions are for swsusp to work, and > > > 2) how architectures dealt with other randomization features like kASLR... > > > > [+Cc few more people that worked on slab hardening] > > > I'm sorry to confuse you. Binary overwriting is indeed caused by > > kASLR, so at least on LoongArch we should disable kASLR for > > hibernation. > > Understood. > > > Random kmalloc is another story, on LoongArch it breaks smpboot when > > resuming, the details are: > > 1, LoongArch uses kmalloc() family to allocate idle_task's > > stack/thread_info and other data structures. > > 2, If random kmalloc is enabled, idle_task's stack in the booting > > kernel may be other things in the target kernel. > > Slab hardening features try so hard to prevent such predictability. > For example, SLAB_FREELIST_RANDOM could also randomize the address > kmalloc objects are allocated at. > > Rather than hacking CONFIG_RANDOM_KMALLOC_CACHES like this, we could > have a single option to disable slab hardening features that makes > the address unpredictable. > > It'd be nice to have something like ARCH_SUPPORTS_SLAB_RANDOM which > some hardening features depend on. And then let some arches conditionally > not select ARCH_SUPPORTS_SLAB_RANDOM if hibernation's enabled > (at cost of less hardening)? This is not good, my patch doesn't disable RANDOM for hibernation, it just delays the initialization. When the system is running, all randomization is still usable. For SLAB_FREELIST_RANDOM, I found that it doesn't break hibernation (at least on LoongArch), the reason is: 1. When I said "data overwritten" before, it doesn't mean that every byte shouldn't be overwritten, only some important parts matter. 2. On LoongArch, the important parts include: switch code, exception handlers, idle_task's stack/thread_info. 3. switch code and exception handlers are protected by automatically disabling kASLR from arch-specific code, idle_task's stack/thread_info is protected by delaying random seeds (this patch). Why SLAB_FREELIST_RANDOM doesn't corrupt idle_task's stack/thread_info? Because the scope of randomization of SLAB_FREELIST_RANDOM is significantly less than RANDOM_KMALLOC_CACHES. When RANDOM_KMALLOC_CACHES enabled, the CPU1's idle task stack from the booting kernel may be the CPU2's idle task stack from the target kernel, and CPU2's idle task stack from the booting kernel may be the CPU1's idle task stack from the target kernel, but idle task's stack from the booting kernel won't be other things from the target kernel (and won't be overwritten by switching kernel). Huacai > > -- > Harry > > > 3, When CPU0 executes the switch code, other CPUs are executing > > idle_task, and their stacks may be corrupted by the switch code. > > > > So in experiments we can fix hibernation only by moving > > random_kmalloc_seed initialization after smp_init(). But obviously, > > moving it after all initcalls is harmless and safer. > > > > > > Huacai > > > > > > For LoongArch there is an additional problem: the regular kernel > > > > function uses absolute address to call exception handlers, this means > > > > the code calls to exception handlers should at the same address for > > > > booting kernel and target kernel.
On Sat, Feb 15, 2025 at 05:53:29PM +0800, Huacai Chen wrote: > On Fri, Feb 14, 2025 at 8:45 PM Harry (Hyeonggon) Yoo > <42.hyeyoo@gmail.com> wrote: > > > > On Fri, Feb 14, 2025 at 06:02:52PM +0800, Huacai Chen wrote: > > > On Fri, Feb 14, 2025 at 5:33 PM Harry (Hyeonggon) Yoo > > > <42.hyeyoo@gmail.com> wrote: > > > > > > > > On Thu, Feb 13, 2025 at 11:20:22AM +0800, Huacai Chen wrote: > > > > > Hi, Harry, > > > > > > > > > > On Wed, Feb 12, 2025 at 11:39 PM Harry (Hyeonggon) Yoo > > > > > <42.hyeyoo@gmail.com> wrote: > > > > > > On Wed, Feb 12, 2025 at 11:17 PM Huacai Chen <chenhuacai@loongson.cn> wrote: > > > > > > > > > > > > > > Hibernation assumes the memory layout after resume be the same as that > > > > > > > before sleep, but CONFIG_RANDOM_KMALLOC_CACHES breaks this assumption. > > > > > > > > > > > > Could you please elaborate what do you mean by > > > > > > hibernation assumes 'the memory layout' after resume be the same as that > > > > > > before sleep? > > > > > > > > > > > > I don't understand how updating random_kmalloc_seed breaks resuming from > > > > > > hibernation. Changing random_kmalloc_seed affects which kmalloc caches > > > > > > newly allocated objects are from, but it should not affect the objects that are > > > > > > already allocated (before hibernation). > > > > > > > > > > When resuming, the booting kernel should switch to the target kernel, > > > > > if the address of switch code (from the booting kernel) is the > > > > > effective data of the target kernel, then the switch code may be > > > > > overwritten. > > > > > > > > Hmm... I'm still missing some pieces. > > > > How is the kernel binary overwritten when slab allocations are randomized? > > > > > > > > Also, I'm not sure if it's even safe to assume that the memory layout is the > > > > same across boots. But I'm not an expert on swsusp anyway... > > > > > > > > It'd be really helpful for linux-pm folks to clarify 1) what are the > > > > (architecture-independent) assumptions are for swsusp to work, and > > > > 2) how architectures dealt with other randomization features like kASLR... > > > > > > > [+Cc few more people that worked on slab hardening] > > > > > I'm sorry to confuse you. Binary overwriting is indeed caused by > > > kASLR, so at least on LoongArch we should disable kASLR for > > > hibernation. > > > > Understood. > > > > > Random kmalloc is another story, on LoongArch it breaks smpboot when > > > resuming, the details are: > > > 1, LoongArch uses kmalloc() family to allocate idle_task's > > > stack/thread_info and other data structures. > > > 2, If random kmalloc is enabled, idle_task's stack in the booting > > > kernel may be other things in the target kernel. > > > > Slab hardening features try so hard to prevent such predictability. > > For example, SLAB_FREELIST_RANDOM could also randomize the address > > kmalloc objects are allocated at. > > > > Rather than hacking CONFIG_RANDOM_KMALLOC_CACHES like this, we could > > have a single option to disable slab hardening features that makes > > the address unpredictable. > > > > It'd be nice to have something like ARCH_SUPPORTS_SLAB_RANDOM which > > some hardening features depend on. And then let some arches conditionally > > not select ARCH_SUPPORTS_SLAB_RANDOM if hibernation's enabled > > (at cost of less hardening)? > > This is not good, my patch doesn't disable RANDOM for hibernation, it > just delays the initialization. When the system is running, all > randomization is still usable. I think at least we need a rule (like ARCH_SUPPORTS_SLAB_RANDOM) for slab hardening features that prevents breaking hibernation in the future. Without rules, introducing new hardening features could break hibernation again. But I'm not yet convinced if it's worth the complexity of hacking slab hardening features (for security) just to make hibernation work on some arches, which have already disabled kASLR anyway... > For SLAB_FREELIST_RANDOM, I found that it doesn't break hibernation > (at least on LoongArch), the reason is: > 1. When I said "data overwritten" before, it doesn't mean that every > byte shouldn't be overwritten, only some important parts matter. > 2. On LoongArch, the important parts include: switch code, exception > handlers, idle_task's stack/thread_info. > 3. switch code and exception handlers are protected by automatically > disabling kASLR from arch-specific code, idle_task's stack/thread_info > is protected by delaying random seeds (this patch). > > Why SLAB_FREELIST_RANDOM doesn't corrupt idle_task's > stack/thread_info? Because the scope of randomization of > SLAB_FREELIST_RANDOM is significantly less than RANDOM_KMALLOC_CACHES. > When RANDOM_KMALLOC_CACHES enabled, You mean when SLAB_FREELIST_RANDOM enabled? Assuming that... > the CPU1's idle task stack from > the booting kernel may be the CPU2's idle task stack from the target > kernel, and CPU2's idle task stack from the booting kernel may be the > CPU1's idle task stack from the target kernel What happens if it's not the case? > but idle task's stack > from the booting kernel won't be other things from the target kernel > (and won't be overwritten by switching kernel). What guarantees that it won't be overwritten? To me it seems to be a fragile assumption that could be broken. Am I missing something?
On Sat, Feb 15, 2025 at 10:05 PM Harry (Hyeonggon) Yoo <42.hyeyoo@gmail.com> wrote: > > On Sat, Feb 15, 2025 at 05:53:29PM +0800, Huacai Chen wrote: > > On Fri, Feb 14, 2025 at 8:45 PM Harry (Hyeonggon) Yoo > > <42.hyeyoo@gmail.com> wrote: > > > > > > On Fri, Feb 14, 2025 at 06:02:52PM +0800, Huacai Chen wrote: > > > > On Fri, Feb 14, 2025 at 5:33 PM Harry (Hyeonggon) Yoo > > > > <42.hyeyoo@gmail.com> wrote: > > > > > > > > > > On Thu, Feb 13, 2025 at 11:20:22AM +0800, Huacai Chen wrote: > > > > > > Hi, Harry, > > > > > > > > > > > > On Wed, Feb 12, 2025 at 11:39 PM Harry (Hyeonggon) Yoo > > > > > > <42.hyeyoo@gmail.com> wrote: > > > > > > > On Wed, Feb 12, 2025 at 11:17 PM Huacai Chen <chenhuacai@loongson.cn> wrote: > > > > > > > > > > > > > > > > Hibernation assumes the memory layout after resume be the same as that > > > > > > > > before sleep, but CONFIG_RANDOM_KMALLOC_CACHES breaks this assumption. > > > > > > > > > > > > > > Could you please elaborate what do you mean by > > > > > > > hibernation assumes 'the memory layout' after resume be the same as that > > > > > > > before sleep? > > > > > > > > > > > > > > I don't understand how updating random_kmalloc_seed breaks resuming from > > > > > > > hibernation. Changing random_kmalloc_seed affects which kmalloc caches > > > > > > > newly allocated objects are from, but it should not affect the objects that are > > > > > > > already allocated (before hibernation). > > > > > > > > > > > > When resuming, the booting kernel should switch to the target kernel, > > > > > > if the address of switch code (from the booting kernel) is the > > > > > > effective data of the target kernel, then the switch code may be > > > > > > overwritten. > > > > > > > > > > Hmm... I'm still missing some pieces. > > > > > How is the kernel binary overwritten when slab allocations are randomized? > > > > > > > > > > Also, I'm not sure if it's even safe to assume that the memory layout is the > > > > > same across boots. But I'm not an expert on swsusp anyway... > > > > > > > > > > It'd be really helpful for linux-pm folks to clarify 1) what are the > > > > > (architecture-independent) assumptions are for swsusp to work, and > > > > > 2) how architectures dealt with other randomization features like kASLR... > > > > > > > > > > [+Cc few more people that worked on slab hardening] > > > > > > > I'm sorry to confuse you. Binary overwriting is indeed caused by > > > > kASLR, so at least on LoongArch we should disable kASLR for > > > > hibernation. > > > > > > Understood. > > > > > > > Random kmalloc is another story, on LoongArch it breaks smpboot when > > > > resuming, the details are: > > > > 1, LoongArch uses kmalloc() family to allocate idle_task's > > > > stack/thread_info and other data structures. > > > > 2, If random kmalloc is enabled, idle_task's stack in the booting > > > > kernel may be other things in the target kernel. > > > > > > Slab hardening features try so hard to prevent such predictability. > > > For example, SLAB_FREELIST_RANDOM could also randomize the address > > > kmalloc objects are allocated at. > > > > > > Rather than hacking CONFIG_RANDOM_KMALLOC_CACHES like this, we could > > > have a single option to disable slab hardening features that makes > > > the address unpredictable. > > > > > > It'd be nice to have something like ARCH_SUPPORTS_SLAB_RANDOM which > > > some hardening features depend on. And then let some arches conditionally > > > not select ARCH_SUPPORTS_SLAB_RANDOM if hibernation's enabled > > > (at cost of less hardening)? > > > > This is not good, my patch doesn't disable RANDOM for hibernation, it > > just delays the initialization. When the system is running, all > > randomization is still usable. > > I think at least we need a rule (like ARCH_SUPPORTS_SLAB_RANDOM) > for slab hardening features that prevents breaking hibernation > in the future. Without rules, introducing new hardening features could > break hibernation again. I don't think so, even if on LoongArch, hibernation and various RANDOM infra can co-exist (need some adjustment). Even if hibernation cannot be used together with kASLR, we don't disable it at build time but at run time. This means we can choose hibernation or kASLR with a single kernel image. > > But I'm not yet convinced if it's worth the complexity of hacking slab > hardening features (for security) just to make hibernation work on > some arches, which have already disabled kASLR anyway... I don't think this patch is an ugly hack, it is a proper adjustment. Random kmalloc can still work with hibernation, delaying the seed initialization only makes that random kmalloc doesn't work (fallback to predictable kmalloc) for a very short time during the boot phase. And again, kASLR is disable at run time rather than build time. > > > For SLAB_FREELIST_RANDOM, I found that it doesn't break hibernation > > (at least on LoongArch), the reason is: > > 1. When I said "data overwritten" before, it doesn't mean that every > > byte shouldn't be overwritten, only some important parts matter. > > 2. On LoongArch, the important parts include: switch code, exception > > handlers, idle_task's stack/thread_info. > > 3. switch code and exception handlers are protected by automatically > > disabling kASLR from arch-specific code, idle_task's stack/thread_info > > is protected by delaying random seeds (this patch). > > > > Why SLAB_FREELIST_RANDOM doesn't corrupt idle_task's > > stack/thread_info? Because the scope of randomization of > > SLAB_FREELIST_RANDOM is significantly less than RANDOM_KMALLOC_CACHES. > > When RANDOM_KMALLOC_CACHES enabled, > > You mean when SLAB_FREELIST_RANDOM enabled? > Assuming that... Yes. > > > the CPU1's idle task stack from > > the booting kernel may be the CPU2's idle task stack from the target > > kernel, and CPU2's idle task stack from the booting kernel may be the > > CPU1's idle task stack from the target kernel > > What happens if it's not the case? SLAB means "objects with the same type", right? So it is probably the case. Yes, there is a very very low possibility that not the case, but... In theory x86_64 also has a low possibility that the idle task's stack or other metadata be overwritten, then should we also disable random kmalloc for x86_64? On the other hand, if we really need to handle this theoretic possibility about SLAB_FREELIST_RANDOM now, we can simply move init_freelist_randomization() after all initcalls, too. Huacai > > > but idle task's stack > > from the booting kernel won't be other things from the target kernel > > (and won't be overwritten by switching kernel). > > What guarantees that it won't be overwritten? > To me it seems to be a fragile assumption that could be broken. > > Am I missing something? > > -- > Harry
Hi all, This patch helps resolve a wake-up issue after hibernation on Lenovo M90f desktops (with Phytium D3000 processor). The kernel stack trace when reproducing this issue is as follows: [ 67.503075] task:jbd2/nvme0n1p4- state:D stack:0 pid:363 ppid:2 flags:0x00000008 [ 67.511418] Call trace: [ 67.513852] __switch_to+0xe4/0x128 [ 67.517333] __schedule+0x214/0x588 [ 67.520809] schedule+0x5c/0xc8 [ 67.523938] __bio_queue_enter+0x144/0x240 [ 67.528024] blk_mq_submit_bio+0x2b4/0x608 [ 67.532109] __submit_bio+0x90/0x150 [ 67.535672] submit_bio_noacct_nocheck+0x104/0x1c0 [ 67.540451] submit_bio_noacct+0x12c/0x548 [ 67.544535] submit_bio+0x4c/0xd8 [ 67.547838] submit_bh_wbc+0x148/0x1d0 [ 67.551576] write_dirty_buffer+0x7c/0x158 [ 67.555660] flush_descriptor.part.0+0x6c/0xd8 [ 67.560093] jbd2_journal_write_revoke_records+0x130/0x1a8 [ 67.565567] jbd2_journal_commit_transaction+0x380/0x18a0 [ 67.570954] kjournald2+0xe4/0x370 [ 67.574344] kthread+0xe4/0xf0 [ 67.577388] ret_from_fork+0x10/0x20 [ 67.580952] task:systemd-journal state:D stack:0 pid:449 ppid:1 flags:0x00000804 [ 67.589290] Call trace: [ 67.591724] __switch_to+0xe4/0x128 [ 67.595200] __schedule+0x214/0x588 [ 67.598677] schedule+0x5c/0xc8 [ 67.601806] io_schedule+0x48/0x78 [ 67.605195] bit_wait_io+0x24/0x90 [ 67.608585] __wait_on_bit+0x7c/0xe0 [ 67.612148] out_of_line_wait_on_bit+0x94/0xd0 [ 67.616580] do_get_write_access+0x31c/0x508 [ 67.620838] jbd2_journal_get_write_access+0x94/0xe8 [ 67.625791] __ext4_journal_get_write_access+0x7c/0x1c0 [ 67.631004] ext4_reserve_inode_write+0xb4/0x118 [ 67.635609] __ext4_mark_inode_dirty+0x74/0x280 [ 67.640127] ext4_dirty_inode+0x70/0xa0 [ 67.643951] __mark_inode_dirty+0x60/0x428 [ 67.648035] generic_update_time+0x58/0x78 [ 67.652119] file_update_time+0xb4/0xc0 [ 67.655942] ext4_page_mkwrite+0xb4/0x5d8 [ 67.659939] do_page_mkwrite+0x64/0xf0 [ 67.663677] do_wp_page+0x198/0x4a8 [ 67.667153] __handle_mm_fault+0x438/0x450 [ 67.671237] handle_mm_fault+0xac/0x288 [ 67.675061] do_page_fault+0x278/0x4a8 [ 67.678799] do_mem_abort+0x50/0xb0 [ 67.682276] el0_da+0x3c/0xe8 [ 67.685233] el0t_64_sync_handler+0xc0/0x138 [ 67.689492] el0t_64_sync+0x1ac/0x1b0 [ 67.693142] task:jbd2/nvme0n1p5- state:D stack:0 pid:582 ppid:2 flags:0x00000008 [ 67.701480] Call trace: [ 67.703914] __switch_to+0xe4/0x128 [ 67.707390] __schedule+0x214/0x588 [ 67.710866] schedule+0x5c/0xc8 [ 67.713996] __bio_queue_enter+0x144/0x240 [ 67.718080] blk_mq_submit_bio+0x2b4/0x608 [ 67.722164] __submit_bio+0x90/0x150 [ 67.725728] submit_bio_noacct_nocheck+0x104/0x1c0 [ 67.730507] submit_bio_noacct+0x12c/0x548 [ 67.734592] submit_bio+0x4c/0xd8 [ 67.737895] submit_bh_wbc+0x148/0x1d0 [ 67.741632] submit_bh+0x20/0x38 [ 67.744848] jbd2_journal_commit_transaction+0x564/0x18a0 [ 67.750234] kjournald2+0xe4/0x370 [ 67.753624] kthread+0xe4/0xf0 [ 67.756666] ret_from_fork+0x10/0x20 [ 67.760234] task:fprintd state:D stack:0 pid:630 ppid:1 flags:0x00000804 [ 67.768572] Call trace: [ 67.771005] __switch_to+0xe4/0x128 [ 67.774482] __schedule+0x214/0x588 [ 67.777958] schedule+0x5c/0xc8 [ 67.781087] __bio_queue_enter+0x144/0x240 [ 67.785172] blk_mq_submit_bio+0x2b4/0x608 [ 67.789257] __submit_bio+0x90/0x150 [ 67.792820] submit_bio_noacct_nocheck+0x104/0x1c0 [ 67.797599] submit_bio_noacct+0x12c/0x548 [ 67.801684] submit_bio+0x4c/0xd8 [ 67.804987] ext4_mpage_readpages+0x1a4/0x780 [ 67.809331] ext4_readahead+0x44/0x60 [ 67.812982] read_pages+0xa4/0x318 [ 67.816372] page_cache_ra_unbounded+0x164/0x1f8 [ 67.820977] page_cache_ra_order+0x94/0x360 [ 67.825148] filemap_fault+0x478/0xaa0 [ 67.828885] __do_fault+0x48/0x220 [ 67.832274] do_read_fault+0x108/0x1f0 [ 67.836011] do_pte_missing+0x15c/0x1e8 [ 67.839835] __handle_mm_fault+0x204/0x450 [ 67.843919] handle_mm_fault+0xac/0x288 [ 67.847742] do_page_fault+0x278/0x4a8 [ 67.851480] do_translation_fault+0x5c/0x90 [ 67.855652] do_mem_abort+0x50/0xb0 [ 67.859128] el0_ia+0x68/0x148 [ 67.862171] el0t_64_sync_handler+0xd8/0x138 [ 67.866429] el0t_64_sync+0x1ac/0x1b0 [ 67.870092] task:deepin-anything state:D stack:0 pid:871 ppid:1 flags:0x00000004 [ 67.878429] Call trace: [ 67.880862] __switch_to+0xe4/0x128 [ 67.884339] __schedule+0x214/0x588 [ 67.887815] schedule+0x5c/0xc8 [ 67.890944] io_schedule+0x48/0x78 [ 67.894333] bit_wait_io+0x24/0x90 [ 67.897723] __wait_on_bit+0x7c/0xe0 [ 67.901286] out_of_line_wait_on_bit+0x94/0xd0 [ 67.905717] do_get_write_access+0x31c/0x508 [ 67.909975] jbd2_journal_get_write_access+0x94/0xe8 [ 67.914928] __ext4_journal_get_write_access+0x7c/0x1c0 [ 67.920140] ext4_reserve_inode_write+0xb4/0x118 [ 67.924745] __ext4_mark_inode_dirty+0x74/0x280 [ 67.929263] ext4_dirty_inode+0x70/0xa0 [ 67.933086] __mark_inode_dirty+0x60/0x428 [ 67.937170] generic_update_time+0x58/0x78 [ 67.941254] file_modified+0xe4/0xf0 [ 67.944816] ext4_buffered_write_iter+0x64/0x148 [ 67.949422] ext4_file_write_iter+0x44/0x90 [ 67.953593] vfs_write+0x218/0x3d0 [ 67.956984] ksys_write+0x80/0x128 [ 67.960374] __arm64_sys_write+0x28/0x40 [ 67.964285] invoke_syscall+0x54/0x138 [ 67.968023] el0_svc_common.constprop.0+0x4c/0x100 [ 67.972803] do_el0_svc+0x28/0x40 [ 67.976106] el0_svc+0x44/0x128 [ 67.979236] el0t_64_sync_handler+0x108/0x138 [ 67.983581] el0t_64_sync+0x1ac/0x1b0 [ 67.987243] task:wireplumber state:D stack:0 pid:1792 ppid:1764 flags:0x00000804 [ 67.995581] Call trace: [ 67.998014] __switch_to+0xe4/0x128 [ 68.001490] __schedule+0x214/0x588 [ 68.004966] schedule+0x5c/0xc8 [ 68.008096] __bio_queue_enter+0x144/0x240 [ 68.012180] blk_mq_submit_bio+0x2b4/0x608 [ 68.016264] __submit_bio+0x90/0x150 [ 68.019828] submit_bio_noacct_nocheck+0x104/0x1c0 [ 68.024607] submit_bio_noacct+0x12c/0x548 [ 68.028691] submit_bio+0x4c/0xd8 [ 68.031994] ext4_io_submit+0x3c/0x70 [ 68.035644] ext4_do_writepages+0x41c/0x6b0 [ 68.039814] ext4_writepages+0xac/0x148 [ 68.043638] do_writepages+0x98/0x238 [ 68.047287] filemap_fdatawrite_wbc+0xa0/0x110 [ 68.051719] __filemap_fdatawrite_range+0x70/0xb0 [ 68.056411] file_write_and_wait_range+0x78/0xf8 [ 68.061016] ext4_sync_file+0x80/0x398 [ 68.064753] vfs_fsync_range+0x40/0x98 [ 68.068491] do_fsync+0x4c/0xa8 [ 68.071621] __arm64_sys_fsync+0x24/0x40 [ 68.075532] invoke_syscall+0x54/0x138 [ 68.079269] el0_svc_common.constprop.0+0xd4/0x100 [ 68.084048] do_el0_svc+0x28/0x40 [ 68.087352] el0_svc+0x44/0x128 [ 68.090481] el0t_64_sync_handler+0x108/0x138 [ 68.094827] el0t_64_sync+0x1ac/0x1b0 [ 68.098481] task:startdde state:D stack:0 pid:2105 ppid:1764 flags:0x00000005 [ 68.106819] Call trace: [ 68.109252] __switch_to+0xe4/0x128 [ 68.112729] __schedule+0x214/0x588 [ 68.116205] schedule+0x5c/0xc8 [ 68.119334] __bio_queue_enter+0x144/0x240 [ 68.123418] blk_mq_submit_bio+0x2b4/0x608 [ 68.127502] __submit_bio+0x90/0x150 [ 68.131065] submit_bio_noacct_nocheck+0x104/0x1c0 [ 68.135844] submit_bio_noacct+0x12c/0x548 [ 68.139928] submit_bio+0x4c/0xd8 [ 68.143232] ext4_mpage_readpages+0x1a4/0x780 [ 68.147576] ext4_readahead+0x44/0x60 [ 68.151226] read_pages+0xa4/0x318 [ 68.154616] page_cache_ra_unbounded+0x130/0x1f8 [ 68.159220] page_cache_ra_order+0x94/0x360 [ 68.163391] filemap_fault+0x478/0xaa0 [ 68.167128] __do_fault+0x48/0x220 [ 68.170518] do_read_fault+0x108/0x1f0 [ 68.174255] do_pte_missing+0x15c/0x1e8 [ 68.178078] __handle_mm_fault+0x204/0x450 [ 68.182162] handle_mm_fault+0xac/0x288 [ 68.185986] do_page_fault+0x278/0x4a8 [ 68.189724] do_translation_fault+0x5c/0x90 [ 68.193896] do_mem_abort+0x50/0xb0 [ 68.197372] el0_da+0x3c/0xe8 [ 68.200328] el0t_64_sync_handler+0xc0/0x138 [ 68.204586] el0t_64_sync+0x1ac/0x1b0 [ 68.208242] task:dde-file-manage state:D stack:0 pid:2202 ppid:1764 flags:0x00000004 [ 68.216579] Call trace: [ 68.219012] __switch_to+0xe4/0x128 [ 68.222489] __schedule+0x214/0x588 [ 68.225965] schedule+0x5c/0xc8 [ 68.229094] __bio_queue_enter+0x144/0x240 [ 68.233178] blk_mq_submit_bio+0x2b4/0x608 [ 68.237263] __submit_bio+0x90/0x150 [ 68.240826] submit_bio_noacct_nocheck+0x104/0x1c0 [ 68.245605] submit_bio_noacct+0x12c/0x548 [ 68.249689] submit_bio+0x4c/0xd8 [ 68.252992] ext4_io_submit+0x3c/0x70 [ 68.256641] ext4_do_writepages+0x41c/0x6b0 [ 68.260812] ext4_writepages+0xac/0x148 [ 68.264635] do_writepages+0x98/0x238 [ 68.268285] filemap_fdatawrite_wbc+0xa0/0x110 [ 68.272716] __filemap_fdatawrite_range+0x70/0xb0 [ 68.277408] filemap_flush+0x28/0x40 [ 68.280971] ext4_alloc_da_blocks+0x38/0x90 [ 68.285141] ext4_release_file+0x88/0xf0 [ 68.289052] __fput+0xe0/0x2a0 [ 68.292096] __fput_sync+0x5c/0x78 [ 68.295486] __arm64_sys_close+0x44/0x98 [ 68.299397] invoke_syscall+0x54/0x138 [ 68.303135] el0_svc_common.constprop.0+0x4c/0x100 [ 68.307914] do_el0_svc+0x28/0x40 [ 68.311217] el0_svc+0x44/0x128 [ 68.314347] el0t_64_sync_handler+0x108/0x138 [ 68.318692] el0t_64_sync+0x1ac/0x1b0 [ 68.322347] task:dde-session-dae state:D stack:0 pid:2247 ppid:1764 flags:0x00000004 [ 68.330684] Call trace: [ 68.333118] __switch_to+0xe4/0x128 [ 68.336594] __schedule+0x214/0x588 [ 68.340070] schedule+0x5c/0xc8 [ 68.343199] __bio_queue_enter+0x144/0x240 [ 68.347284] blk_mq_submit_bio+0x2b4/0x608 [ 68.351368] __submit_bio+0x90/0x150 [ 68.354932] submit_bio_noacct_nocheck+0x104/0x1c0 [ 68.359710] submit_bio_noacct+0x12c/0x548 [ 68.363794] submit_bio+0x4c/0xd8 [ 68.367097] submit_bh_wbc+0x148/0x1d0 [ 68.370833] submit_bh+0x20/0x38 [ 68.374049] __ext4_read_bh+0x58/0xe8 [ 68.377700] ext4_read_bh+0x4c/0xf0 [ 68.381176] ext4_read_bh_lock+0x60/0xd8 [ 68.385088] ext4_bread+0x88/0xf8 [ 68.388390] __ext4_read_dirblock+0x68/0x3d8 [ 68.392649] htree_dirblock_to_tree+0x98/0x348 [ 68.397081] ext4_htree_fill_tree+0xe8/0x378 [ 68.401339] ext4_dx_readdir+0xf4/0x380 [ 68.405162] ext4_readdir+0x53c/0x6e8 [ 68.408812] iterate_dir+0xa4/0x190 [ 68.412288] __arm64_sys_getdents64+0x7c/0x170 [ 68.416720] invoke_syscall+0x54/0x138 [ 68.420457] el0_svc_common.constprop.0+0x4c/0x100 [ 68.425236] do_el0_svc+0x28/0x40 [ 68.428540] el0_svc+0x44/0x128 [ 68.431669] el0t_64_sync_handler+0x108/0x138 [ 68.436014] el0t_64_sync+0x1ac/0x1b0 [ 68.439669] task:dde-lock state:D stack:0 pid:2259 ppid:1764 flags:0x00000004 [ 68.448007] Call trace: [ 68.450440] __switch_to+0xe4/0x128 [ 68.453916] __schedule+0x214/0x588 [ 68.457392] schedule+0x5c/0xc8 [ 68.460522] io_schedule+0x48/0x78 [ 68.463911] bit_wait_io+0x24/0x90 [ 68.467300] __wait_on_bit+0x7c/0xe0 [ 68.470864] out_of_line_wait_on_bit+0x94/0xd0 [ 68.475295] do_get_write_access+0x31c/0x508 [ 68.479553] jbd2_journal_get_write_access+0x94/0xe8 [ 68.484505] __ext4_journal_get_write_access+0x7c/0x1c0 [ 68.489717] ext4_reserve_inode_write+0xb4/0x118 [ 68.494322] __ext4_mark_inode_dirty+0x74/0x280 [ 68.498840] ext4_dirty_inode+0x70/0xa0 [ 68.502664] __mark_inode_dirty+0x60/0x428 [ 68.506748] generic_update_time+0x58/0x78 [ 68.510831] file_modified+0xe4/0xf0 [ 68.514393] ext4_buffered_write_iter+0x64/0x148 [ 68.518998] ext4_file_write_iter+0x44/0x90 [ 68.523169] vfs_write+0x218/0x3d0 [ 68.526559] ksys_write+0x80/0x128 [ 68.529949] __arm64_sys_write+0x28/0x40 [ 68.533860] invoke_syscall+0x54/0x138 [ 68.537598] el0_svc_common.constprop.0+0x4c/0x100 [ 68.542377] do_el0_svc+0x28/0x40 [ 68.545680] el0_svc+0x44/0x128 [ 68.548810] el0t_64_sync_handler+0x108/0x138 [ 68.553155] el0t_64_sync+0x1ac/0x1b0 [ 68.556837] task:dde-blackwidget state:D stack:0 pid:3059 ppid:1764 flags:0x00000005 [ 68.565174] Call trace: [ 68.567607] __switch_to+0xe4/0x128 [ 68.571084] __schedule+0x214/0x588 [ 68.574560] schedule+0x5c/0xc8 [ 68.577689] __bio_queue_enter+0x144/0x240 [ 68.581773] blk_mq_submit_bio+0x2b4/0x608 [ 68.585858] __submit_bio+0x90/0x150 [ 68.589421] submit_bio_noacct_nocheck+0x104/0x1c0 [ 68.594200] submit_bio_noacct+0x12c/0x548 [ 68.598284] submit_bio+0x4c/0xd8 [ 68.601587] ext4_mpage_readpages+0x1a4/0x780 [ 68.605931] ext4_readahead+0x44/0x60 [ 68.609582] read_pages+0xa4/0x318 [ 68.612972] page_cache_ra_unbounded+0x164/0x1f8 [ 68.617576] page_cache_ra_order+0x94/0x360 [ 68.621747] filemap_fault+0x478/0xaa0 [ 68.625484] __do_fault+0x48/0x220 [ 68.628873] do_read_fault+0x108/0x1f0 [ 68.632610] do_pte_missing+0x15c/0x1e8 [ 68.636434] __handle_mm_fault+0x204/0x450 [ 68.640518] handle_mm_fault+0xac/0x288 [ 68.644341] do_page_fault+0x278/0x4a8 [ 68.648078] do_translation_fault+0x5c/0x90 [ 68.652250] do_mem_abort+0x50/0xb0 [ 68.655726] el0_ia+0x68/0x148 [ 68.658769] el0t_64_sync_handler+0xd8/0x138 [ 68.663027] el0t_64_sync+0x1ac/0x1b0 [ 68.666677] task:kworker/u16:8 state:D stack:0 pid:3071 ppid:2 flags:0x00000008 [ 68.675015] Workqueue: nvme-reset-wq nvme_reset_work [ 68.679969] Call trace: [ 68.682403] __switch_to+0xe4/0x128 [ 68.685880] __schedule+0x214/0x588 [ 68.689356] schedule+0x5c/0xc8 [ 68.692485] schedule_timeout+0x180/0x198 [ 68.696483] io_schedule_timeout+0x54/0x78 [ 68.700567] __wait_for_common+0xd8/0x290 [ 68.704565] wait_for_completion_io+0x2c/0x48 [ 68.708909] blk_execute_rq+0xd4/0x178 [ 68.712646] __nvme_submit_sync_cmd+0xa4/0x180 [ 68.717079] nvme_identify_ctrl+0xdc/0x140 [ 68.721164] nvme_init_identify+0x4c/0x598 [ 68.725248] nvme_init_ctrl_finish+0x94/0x300 [ 68.729592] nvme_reset_work+0xa0/0x2e8 [ 68.733415] process_one_work+0x160/0x3c8 [ 68.737413] worker_thread+0x2e4/0x508 [ 68.741150] kthread+0xe4/0xf0 [ 68.744193] ret_from_fork+0x10/0x20 [ 68.747764] task:kworker/u16:66 state:D stack:0 pid:3129 ppid:2 flags:0x00000008 [ 68.756102] Workqueue: writeback wb_workfn (flush-259:0) [ 68.761403] Call trace: [ 68.763836] __switch_to+0xe4/0x128 [ 68.767313] __schedule+0x214/0x588 [ 68.770789] schedule+0x5c/0xc8 [ 68.773918] __bio_queue_enter+0x144/0x240 [ 68.778002] blk_mq_submit_bio+0x2b4/0x608 [ 68.782087] __submit_bio+0x90/0x150 [ 68.785650] submit_bio_noacct_nocheck+0x104/0x1c0 [ 68.790429] submit_bio_noacct+0x12c/0x548 [ 68.794513] submit_bio+0x4c/0xd8 [ 68.797817] ext4_io_submit+0x3c/0x70 [ 68.801466] ext4_do_writepages+0x2ec/0x6b0 [ 68.805637] ext4_writepages+0xac/0x148 [ 68.809460] do_writepages+0x98/0x238 [ 68.813110] __writeback_single_inode+0x4c/0x408 [ 68.817714] writeback_sb_inodes+0x22c/0x5d8 [ 68.821972] __writeback_inodes_wb+0x6c/0x160 [ 68.826316] wb_writeback+0x2ec/0x3e0 [ 68.829966] wb_do_writeback+0x2cc/0x408 [ 68.833877] wb_workfn+0x88/0x2b8 [ 68.837180] process_one_work+0x160/0x3c8 [ 68.841178] worker_thread+0x2e4/0x508 [ 68.844916] kthread+0xe4/0xf0 [ 68.847958] ret_from_fork+0x10/0x20 [ 68.851522] task:kworker/u16:69 state:D stack:0 pid:3132 ppid:2 flags:0x00000008 [ 68.859859] Workqueue: writeback wb_workfn (flush-259:0) [ 68.865160] Call trace: [ 68.867593] __switch_to+0xe4/0x128 [ 68.871069] __schedule+0x214/0x588 [ 68.874545] schedule+0x5c/0xc8 [ 68.877674] io_schedule+0x48/0x78 [ 68.881064] bit_wait_io+0x24/0x90 [ 68.884453] __wait_on_bit+0x7c/0xe0 [ 68.888017] out_of_line_wait_on_bit+0x94/0xd0 [ 68.892448] do_get_write_access+0x31c/0x508 [ 68.896706] jbd2_journal_get_write_access+0x94/0xe8 [ 68.901659] __ext4_journal_get_write_access+0x7c/0x1c0 [ 68.906870] ext4_mb_mark_diskspace_used+0x148/0x3e8 [ 68.911823] ext4_mb_new_blocks+0x198/0x7a0 [ 68.915994] ext4_ext_map_blocks+0x620/0x828 [ 68.920252] ext4_map_blocks+0x1b0/0x5b8 [ 68.924162] mpage_map_one_extent+0x80/0x1a0 [ 68.928419] mpage_map_and_submit_extent+0x8c/0x338 [ 68.933284] ext4_do_writepages+0x5bc/0x6b0 [ 68.937454] ext4_writepages+0xac/0x148 [ 68.941278] do_writepages+0x98/0x238 [ 68.944928] __writeback_single_inode+0x4c/0x408 [ 68.949532] writeback_sb_inodes+0x22c/0x5d8 [ 68.953790] __writeback_inodes_wb+0x6c/0x160 [ 68.958134] wb_writeback+0x2ec/0x3e0 [ 68.961784] wb_do_writeback+0x2cc/0x408 [ 68.965694] wb_workfn+0x88/0x2b8 [ 68.968997] process_one_work+0x160/0x3c8 [ 68.972995] worker_thread+0x2e4/0x508 [ 68.976732] kthread+0xe4/0xf0 [ 68.979774] ret_from_fork+0x10/0x20 [ 68.983340] task:(oothd.sh) state:D stack:0 pid:3150 ppid:1 flags:0x00000804 [ 68.991677] Call trace: [ 68.994110] __switch_to+0xe4/0x128 [ 68.997587] __schedule+0x214/0x588 [ 69.001063] schedule+0x5c/0xc8 [ 69.004192] __bio_queue_enter+0x144/0x240 [ 69.008276] blk_mq_submit_bio+0x2b4/0x608 [ 69.012361] __submit_bio+0x90/0x150 [ 69.015924] submit_bio_noacct_nocheck+0x104/0x1c0 [ 69.020703] submit_bio_noacct+0x12c/0x548 [ 69.024787] submit_bio+0x4c/0xd8 [ 69.028090] ext4_mpage_readpages+0x1a4/0x780 [ 69.032434] ext4_readahead+0x44/0x60 [ 69.036085] read_pages+0xa4/0x318 [ 69.039475] page_cache_ra_unbounded+0x164/0x1f8 [ 69.044080] page_cache_ra_order+0x94/0x360 [ 69.048250] ondemand_readahead+0x16c/0x2e0 [ 69.052421] page_cache_sync_ra+0xa4/0xc8 [ 69.056418] filemap_get_pages+0xc8/0x3c0 [ 69.060416] filemap_read+0xec/0x398 [ 69.063979] generic_file_read_iter+0x50/0x160 [ 69.068410] ext4_file_read_iter+0x60/0x240 [ 69.072581] __kernel_read+0xdc/0x280 [ 69.076232] kernel_read+0x74/0xc8 [ 69.079621] search_binary_handler+0x60/0x320 [ 69.083967] exec_binprm+0x5c/0x1c0 [ 69.087444] bprm_execve.part.0+0x1e8/0x270 [ 69.091614] bprm_execve+0x68/0xa8 [ 69.095004] do_execveat_common.isra.0+0x1a8/0x250 [ 69.099783] __arm64_sys_execve+0x4c/0x70 [ 69.103781] invoke_syscall+0x54/0x138 [ 69.107518] el0_svc_common.constprop.0+0xd4/0x100 [ 69.112297] do_el0_svc+0x28/0x40 [ 69.115601] el0_svc+0x44/0x128 [ 69.118730] el0t_64_sync_handler+0x108/0x138 [ 69.123075] el0t_64_sync+0x1ac/0x1b0 [ 69.126725] task:QThread state:D stack:0 pid:3174 ppid:1486 flags:0x00000005 [ 69.135063] Call trace: [ 69.137496] __switch_to+0xe4/0x128 [ 69.140973] __schedule+0x214/0x588 [ 69.144449] schedule+0x5c/0xc8 [ 69.147578] __bio_queue_enter+0x144/0x240 [ 69.151663] blk_mq_submit_bio+0x2b4/0x608 [ 69.155748] __submit_bio+0x90/0x150 [ 69.159312] submit_bio_noacct_nocheck+0x104/0x1c0 [ 69.164090] submit_bio_noacct+0x12c/0x548 [ 69.168175] submit_bio+0x4c/0xd8 [ 69.171478] ext4_mpage_readpages+0x1a4/0x780 [ 69.175822] ext4_readahead+0x44/0x60 [ 69.179472] read_pages+0xa4/0x318 [ 69.182861] page_cache_ra_unbounded+0x164/0x1f8 [ 69.187466] page_cache_ra_order+0x94/0x360 [ 69.191637] ondemand_readahead+0x16c/0x2e0 [ 69.195808] page_cache_sync_ra+0xa4/0xc8 [ 69.199805] filemap_get_pages+0xc8/0x3c0 [ 69.203802] filemap_read+0xec/0x398 [ 69.207365] generic_file_read_iter+0x50/0x160 [ 69.211796] ext4_file_read_iter+0x60/0x240 [ 69.215967] __kernel_read+0xdc/0x280 [ 69.219617] kernel_read+0x74/0xc8 [ 69.223007] search_binary_handler+0x60/0x320 [ 69.227352] exec_binprm+0x5c/0x1c0 [ 69.230829] bprm_execve.part.0+0x1e8/0x270 [ 69.235000] bprm_execve+0x68/0xa8 [ 69.238389] do_execveat_common.isra.0+0x1a8/0x250 [ 69.243168] __arm64_sys_execve+0x4c/0x70 [ 69.247166] invoke_syscall+0x54/0x138 [ 69.250904] el0_svc_common.constprop.0+0x4c/0x100 [ 69.255684] do_el0_svc+0x28/0x40 [ 69.258987] el0_svc+0x44/0x128 [ 69.262117] el0t_64_sync_handler+0x108/0x138 [ 69.266462] el0t_64_sync+0x1ac/0x1b0 [ 69.270111] task:(sd-rmrf) state:D stack:0 pid:3177 ppid:1 flags:0x00000004 [ 69.278449] Call trace: [ 69.280882] __switch_to+0xe4/0x128 [ 69.284358] __schedule+0x214/0x588 [ 69.287835] schedule+0x5c/0xc8 [ 69.290964] io_schedule+0x48/0x78 [ 69.294353] bit_wait_io+0x24/0x90 [ 69.297743] __wait_on_bit+0x7c/0xe0 [ 69.301306] out_of_line_wait_on_bit+0x94/0xd0 [ 69.305738] do_get_write_access+0x31c/0x508 [ 69.309996] jbd2_journal_get_write_access+0x94/0xe8 [ 69.314948] __ext4_journal_get_write_access+0x7c/0x1c0 [ 69.320160] ext4_orphan_add+0x158/0x398 [ 69.324071] ext4_rmdir+0x26c/0x3d8 [ 69.327547] vfs_rmdir+0x98/0x250 [ 69.330849] do_rmdir+0x184/0x1b8 [ 69.334152] __arm64_sys_unlinkat+0x74/0xa0 [ 69.338322] invoke_syscall+0x54/0x138 [ 69.342060] el0_svc_common.constprop.0+0x4c/0x100 [ 69.346840] do_el0_svc+0x28/0x40 [ 69.350143] el0_svc+0x44/0x128 [ 69.353273] el0t_64_sync_handler+0x108/0x138 [ 69.357617] el0t_64_sync+0x1ac/0x1b0 [ 101.463556] nvme nvme0: I/O 17 QID 0 timeout, completion polled [ 101.463558] nvme nvme0: Shutdown timeout set to 10 seconds [FAILED] Failed to start nmbd.service - Samba NMB Daemon.
On 2/16/25 06:08, Huacai Chen wrote: > On Sat, Feb 15, 2025 at 10:05 PM Harry (Hyeonggon) Yoo > <42.hyeyoo@gmail.com> wrote: >> >> >> You mean when SLAB_FREELIST_RANDOM enabled? >> Assuming that... > Yes. > >> >> > the CPU1's idle task stack from >> > the booting kernel may be the CPU2's idle task stack from the target >> > kernel, and CPU2's idle task stack from the booting kernel may be the >> > CPU1's idle task stack from the target kernel >> The freelist can have more objects than the number of cpus and then it's not just exchanging stacks between cpus but also with whatever else allocates from the same slab. >> What happens if it's not the case? > SLAB means "objects with the same type", right? So it is probably the kmalloc() is not objects of the same type, just size > case. Yes, there is a very very low possibility that not the case, > but... > > In theory x86_64 also has a low possibility that the idle task's stack > or other metadata be overwritten, then should we also disable random > kmalloc for x86_64? Does x86 really have such possibility? Can you explain in more detail? > On the other hand, if we really need to handle this theoretic > possibility about SLAB_FREELIST_RANDOM now, we can simply move > init_freelist_randomization() after all initcalls, too. I think a more robust approach would be to make sure any allocations critical for hibernation/resume are static? Are there really multiple CPU's idle task stacks involved? Aren't the critical paths single cpu only? I'd assume when multiple cpus might be running at that phase, all bets about determinism are off? So there could be a statically allocated stack for the single cpu? Note I don't know much about how hibernation works. But it seems fragile to rely on rather complex allocators (slab over page allocators) to give me exactly the same addresses accross boots, and hack around their randomization features to help with that, if there's an alternative to use static allocations for the critical pieces of the hibernation/resume code. > Huacai > >> >> > but idle task's stack >> > from the booting kernel won't be other things from the target kernel >> > (and won't be overwritten by switching kernel). >> >> What guarantees that it won't be overwritten? >> To me it seems to be a fragile assumption that could be broken. >> >> Am I missing something? >> >> -- >> Harry >
On Fri, Feb 14, 2025 at 09:44:59PM +0900, Harry (Hyeonggon) Yoo wrote: > On Fri, Feb 14, 2025 at 06:02:52PM +0800, Huacai Chen wrote: > > On Fri, Feb 14, 2025 at 5:33 PM Harry (Hyeonggon) Yoo > > <42.hyeyoo@gmail.com> wrote: > > > > > > On Thu, Feb 13, 2025 at 11:20:22AM +0800, Huacai Chen wrote: > > > > Hi, Harry, > > > > > > > > On Wed, Feb 12, 2025 at 11:39 PM Harry (Hyeonggon) Yoo > > > > <42.hyeyoo@gmail.com> wrote: > > > > > On Wed, Feb 12, 2025 at 11:17 PM Huacai Chen <chenhuacai@loongson.cn> wrote: > > > > > > > > > > > > Hibernation assumes the memory layout after resume be the same as that > > > > > > before sleep, but CONFIG_RANDOM_KMALLOC_CACHES breaks this assumption. > > > > > > > > > > Could you please elaborate what do you mean by > > > > > hibernation assumes 'the memory layout' after resume be the same as that > > > > > before sleep? > > > > > > > > > > I don't understand how updating random_kmalloc_seed breaks resuming from > > > > > hibernation. Changing random_kmalloc_seed affects which kmalloc caches > > > > > newly allocated objects are from, but it should not affect the objects that are > > > > > already allocated (before hibernation). > > > > > > > > When resuming, the booting kernel should switch to the target kernel, > > > > if the address of switch code (from the booting kernel) is the > > > > effective data of the target kernel, then the switch code may be > > > > overwritten. > > > > > > Hmm... I'm still missing some pieces. > > > How is the kernel binary overwritten when slab allocations are randomized? > > > > > > Also, I'm not sure if it's even safe to assume that the memory layout is the > > > same across boots. But I'm not an expert on swsusp anyway... > > > > > > It'd be really helpful for linux-pm folks to clarify 1) what are the > > > (architecture-independent) assumptions are for swsusp to work, and > > > 2) how architectures dealt with other randomization features like kASLR... > > > > [+Cc few more people that worked on slab hardening] > > > I'm sorry to confuse you. Binary overwriting is indeed caused by > > kASLR, so at least on LoongArch we should disable kASLR for > > hibernation. > > Understood. > > > Random kmalloc is another story, on LoongArch it breaks smpboot when > > resuming, the details are: > > 1, LoongArch uses kmalloc() family to allocate idle_task's > > stack/thread_info and other data structures. > > 2, If random kmalloc is enabled, idle_task's stack in the booting > > kernel may be other things in the target kernel. > > Slab hardening features try so hard to prevent such predictability. > For example, SLAB_FREELIST_RANDOM could also randomize the address > kmalloc objects are allocated at. > > Rather than hacking CONFIG_RANDOM_KMALLOC_CACHES like this, we could > have a single option to disable slab hardening features that makes > the address unpredictable. > > It'd be nice to have something like ARCH_SUPPORTS_SLAB_RANDOM which > some hardening features depend on. And then let some arches conditionally > not select ARCH_SUPPORTS_SLAB_RANDOM if hibernation's enabled > (at cost of less hardening)? I find this whole thread confusing. :) Hibernation should already do whatever it need to to get out of the way of the kernel it is restoring to memory. The random locations shouldn't matter at all: they're all stored in the image. I am not a hibernation expert, but my understanding is that the "resume" kernel moves itself out of the way to restore the KASLR-ed hibernation image and puts everything back exactly as it was. Randomization should not matter at all: it's just simply "put everything back where it was". Yes, the tricky part is the "move itself out of the way", but that's required for any kernel that support being relocatable (a prerequisite for KASLR), and KASLR is just an aggressive form of "the relocatable kernel might be anywhere" beyond just different boot loaders putting it in a handful of different potential offsets.
On Wed, Feb 19, 2025 at 6:25 PM Kees Cook <kees@kernel.org> wrote: > > On Fri, Feb 14, 2025 at 09:44:59PM +0900, Harry (Hyeonggon) Yoo wrote: > > On Fri, Feb 14, 2025 at 06:02:52PM +0800, Huacai Chen wrote: > > > On Fri, Feb 14, 2025 at 5:33 PM Harry (Hyeonggon) Yoo > > > <42.hyeyoo@gmail.com> wrote: > > > > > > > > On Thu, Feb 13, 2025 at 11:20:22AM +0800, Huacai Chen wrote: > > > > > Hi, Harry, > > > > > > > > > > On Wed, Feb 12, 2025 at 11:39 PM Harry (Hyeonggon) Yoo > > > > > <42.hyeyoo@gmail.com> wrote: > > > > > > On Wed, Feb 12, 2025 at 11:17 PM Huacai Chen <chenhuacai@loongson.cn> wrote: > > > > > > > > > > > > > > Hibernation assumes the memory layout after resume be the same as that > > > > > > > before sleep, but CONFIG_RANDOM_KMALLOC_CACHES breaks this assumption. > > > > > > > > > > > > Could you please elaborate what do you mean by > > > > > > hibernation assumes 'the memory layout' after resume be the same as that > > > > > > before sleep? > > > > > > > > > > > > I don't understand how updating random_kmalloc_seed breaks resuming from > > > > > > hibernation. Changing random_kmalloc_seed affects which kmalloc caches > > > > > > newly allocated objects are from, but it should not affect the objects that are > > > > > > already allocated (before hibernation). > > > > > > > > > > When resuming, the booting kernel should switch to the target kernel, > > > > > if the address of switch code (from the booting kernel) is the > > > > > effective data of the target kernel, then the switch code may be > > > > > overwritten. > > > > > > > > Hmm... I'm still missing some pieces. > > > > How is the kernel binary overwritten when slab allocations are randomized? > > > > > > > > Also, I'm not sure if it's even safe to assume that the memory layout is the > > > > same across boots. But I'm not an expert on swsusp anyway... > > > > > > > > It'd be really helpful for linux-pm folks to clarify 1) what are the > > > > (architecture-independent) assumptions are for swsusp to work, and > > > > 2) how architectures dealt with other randomization features like kASLR... > > > > > > > [+Cc few more people that worked on slab hardening] > > > > > I'm sorry to confuse you. Binary overwriting is indeed caused by > > > kASLR, so at least on LoongArch we should disable kASLR for > > > hibernation. > > > > Understood. > > > > > Random kmalloc is another story, on LoongArch it breaks smpboot when > > > resuming, the details are: > > > 1, LoongArch uses kmalloc() family to allocate idle_task's > > > stack/thread_info and other data structures. > > > 2, If random kmalloc is enabled, idle_task's stack in the booting > > > kernel may be other things in the target kernel. > > > > Slab hardening features try so hard to prevent such predictability. > > For example, SLAB_FREELIST_RANDOM could also randomize the address > > kmalloc objects are allocated at. > > > > Rather than hacking CONFIG_RANDOM_KMALLOC_CACHES like this, we could > > have a single option to disable slab hardening features that makes > > the address unpredictable. > > > > It'd be nice to have something like ARCH_SUPPORTS_SLAB_RANDOM which > > some hardening features depend on. And then let some arches conditionally > > not select ARCH_SUPPORTS_SLAB_RANDOM if hibernation's enabled > > (at cost of less hardening)? > > I find this whole thread confusing. :) Hibernation should already do > whatever it need to to get out of the way of the kernel it is restoring > to memory. The random locations shouldn't matter at all: they're all > stored in the image. I am not a hibernation expert, but my understanding > is that the "resume" kernel moves itself out of the way to restore the > KASLR-ed hibernation image and puts everything back exactly as it was. > Randomization should not matter at all: it's just simply "put everything > back where it was". Exactly. > Yes, the tricky part is the "move itself out of the way", but that's > required for any kernel that support being relocatable (a prerequisite > for KASLR), and KASLR is just an aggressive form of "the relocatable > kernel might be anywhere" beyond just different boot loaders putting it > in a handful of different potential offsets. Right. Thanks!
diff --git a/init/main.c b/init/main.c index 2a1757826397..1362957bdbe4 100644 --- a/init/main.c +++ b/init/main.c @@ -1458,6 +1458,9 @@ static int __ref kernel_init(void *unused) /* need to finish all async __init code before freeing the memory */ async_synchronize_full(); +#ifdef CONFIG_RANDOM_KMALLOC_CACHES + random_kmalloc_seed = get_random_u64(); +#endif system_state = SYSTEM_FREEING_INITMEM; kprobe_free_init_mem(); ftrace_free_init_mem(); diff --git a/mm/slab_common.c b/mm/slab_common.c index 4030907b6b7d..23e324aee218 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -971,9 +971,6 @@ void __init create_kmalloc_caches(void) for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++) new_kmalloc_cache(i, type); } -#ifdef CONFIG_RANDOM_KMALLOC_CACHES - random_kmalloc_seed = get_random_u64(); -#endif /* Kmalloc array is now usable */ slab_state = UP;
Hibernation assumes the memory layout after resume be the same as that before sleep, but CONFIG_RANDOM_KMALLOC_CACHES breaks this assumption. At least on LoongArch and ARM64 we observed failures of resuming from hibernation (on LoongArch non-boot CPUs fail to bringup, on ARM64 some devices are unusable). software_resume_initcall(), the function which resume the target kernel is a initcall function. So, move the random_kmalloc_seed initialisation after all initcalls. Cc: stable@vger.kernel.org Fixes: 3c6152940584290668 ("Randomized slab caches for kmalloc()") Reported-by: Yuli Wang <wangyuli@uniontech.com> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn> --- init/main.c | 3 +++ mm/slab_common.c | 3 --- 2 files changed, 3 insertions(+), 3 deletions(-)