diff mbox series

[RESEND,v2] modules: wait do_free_init correctly

Message ID 20240129020304.1981372-1-changbin.du@huawei.com (mailing list archive)
State New
Headers show
Series [RESEND,v2] modules: wait do_free_init correctly | expand

Commit Message

Changbin Du Jan. 29, 2024, 2:03 a.m. UTC
The commit 1a7b7d922081 ("modules: Use vmalloc special flag") moves
do_free_init() into a global workqueue instead of call_rcu(). So now
rcu_barrier() can not ensure that do_free_init has completed. We should
wait it via flush_work().

Without this fix, we still could encounter false positive reports in
W+X checking, and rcu synchronization is unnecessary.

Fixes: 1a7b7d922081 ("modules: Use vmalloc special flag")
Signed-off-by: Changbin Du <changbin.du@huawei.com>
Cc: Xiaoyi Su <suxiaoyi@huawei.com>

---
v2: fix compilation issue for no CONFIG_MODULES found by 0-DAY.
---
 include/linux/moduleloader.h | 8 ++++++++
 init/main.c                  | 5 +++--
 kernel/module/main.c         | 5 +++++
 3 files changed, 16 insertions(+), 2 deletions(-)

Comments

Luis Chamberlain Jan. 29, 2024, 5:53 p.m. UTC | #1
On Mon, Jan 29, 2024 at 10:03:04AM +0800, Changbin Du wrote:
> The commit 1a7b7d922081 ("modules: Use vmalloc special flag") moves
> do_free_init() into a global workqueue instead of call_rcu(). So now
> rcu_barrier() can not ensure that do_free_init has completed. We should
> wait it via flush_work().
> 
> Without this fix, we still could encounter false positive reports in
> W+X checking, and rcu synchronization is unnecessary.

You didn't answer my question, which should be documented in the commit log.

Does this mean we never freed modules init because of this? If so then
your commit log should clearly explain that. It should also explain that
if true (you have to verify) then it means we were no longer saving
the memory we wished to save, and that is important for distributions
which do want to save anything on memory. You may want to do a general
estimate on how much that means these days on any desktop / server.

  Luis
Changbin Du Jan. 30, 2024, 1:40 a.m. UTC | #2
On Mon, Jan 29, 2024 at 09:53:58AM -0800, Luis Chamberlain wrote:
> On Mon, Jan 29, 2024 at 10:03:04AM +0800, Changbin Du wrote:
> > The commit 1a7b7d922081 ("modules: Use vmalloc special flag") moves
> > do_free_init() into a global workqueue instead of call_rcu(). So now
> > rcu_barrier() can not ensure that do_free_init has completed. We should
> > wait it via flush_work().
> > 
> > Without this fix, we still could encounter false positive reports in
> > W+X checking, and rcu synchronization is unnecessary.
> 
> You didn't answer my question, which should be documented in the commit log.
> 
> Does this mean we never freed modules init because of this? If so then
> your commit log should clearly explain that. It should also explain that
> if true (you have to verify) then it means we were no longer saving
> the memory we wished to save, and that is important for distributions
> which do want to save anything on memory. You may want to do a general
> estimate on how much that means these days on any desktop / server.
>
Actually, I have explained it in commit msg. It's not about saving memory. The
synchronization here is just to ensure the module init's been freed before
doing W+X checking. The problem is that the current implementation is wrong,
rcu_barrier() cannot guarantee that. So we can encounter false positive reports.
But anyway, the module init will be freed, and it's just a timing related issue.

>   Luis
Luis Chamberlain Jan. 30, 2024, 2:21 p.m. UTC | #3
On Tue, Jan 30, 2024 at 09:40:38AM +0800, Changbin Du wrote:
> On Mon, Jan 29, 2024 at 09:53:58AM -0800, Luis Chamberlain wrote:
> > On Mon, Jan 29, 2024 at 10:03:04AM +0800, Changbin Du wrote:
> > > The commit 1a7b7d922081 ("modules: Use vmalloc special flag") moves
> > > do_free_init() into a global workqueue instead of call_rcu(). So now
> > > rcu_barrier() can not ensure that do_free_init has completed. We should
> > > wait it via flush_work().
> > > 
> > > Without this fix, we still could encounter false positive reports in
> > > W+X checking, and rcu synchronization is unnecessary.
> > 
> > You didn't answer my question, which should be documented in the commit log.
> > 
> > Does this mean we never freed modules init because of this? If so then
> > your commit log should clearly explain that. It should also explain that
> > if true (you have to verify) then it means we were no longer saving
> > the memory we wished to save, and that is important for distributions
> > which do want to save anything on memory. You may want to do a general
> > estimate on how much that means these days on any desktop / server.
>
> Actually, I have explained it in commit msg. It's not about saving memory. The
> synchronization here is just to ensure the module init's been freed before
> doing W+X checking. The problem is that the current implementation is wrong,
> rcu_barrier() cannot guarantee that. So we can encounter false positive reports.
> But anyway, the module init will be freed, and it's just a timing related issue.

Your desciption here is better than the commit log.

  Luis
Eric Chanudet Feb. 15, 2024, 2:18 p.m. UTC | #4
On Tue, Jan 30, 2024 at 06:21:03AM -0800, Luis Chamberlain wrote:
> On Tue, Jan 30, 2024 at 09:40:38AM +0800, Changbin Du wrote:
> > On Mon, Jan 29, 2024 at 09:53:58AM -0800, Luis Chamberlain wrote:
> > > On Mon, Jan 29, 2024 at 10:03:04AM +0800, Changbin Du wrote:
> > > > The commit 1a7b7d922081 ("modules: Use vmalloc special flag") moves
> > > > do_free_init() into a global workqueue instead of call_rcu(). So now
> > > > rcu_barrier() can not ensure that do_free_init has completed. We should
> > > > wait it via flush_work().
> > > > 
> > > > Without this fix, we still could encounter false positive reports in
> > > > W+X checking, and rcu synchronization is unnecessary.

The comment in do_init_module(), just before
schedule_work(&init_free_wq), mentioning rcu_barrier(), should be
amended as well.

> > > 
> > > You didn't answer my question, which should be documented in the commit log.
> > > 
> > > Does this mean we never freed modules init because of this? If so then
> > > your commit log should clearly explain that. It should also explain that
> > > if true (you have to verify) then it means we were no longer saving
> > > the memory we wished to save, and that is important for distributions
> > > which do want to save anything on memory. You may want to do a general
> > > estimate on how much that means these days on any desktop / server.
> >
> > Actually, I have explained it in commit msg. It's not about saving memory. The
> > synchronization here is just to ensure the module init's been freed before
> > doing W+X checking. The problem is that the current implementation is wrong,
> > rcu_barrier() cannot guarantee that. So we can encounter false positive reports.
> > But anyway, the module init will be freed, and it's just a timing related issue.
> 
> Your desciption here is better than the commit log.

I saw this problem using a PREEMPT_RT kernel as well. Setting DEBUG_WX=n
stills show a significant delay due to the rcu_barrier:
  [    0.291444] Freeing unused kernel memory: 5568K
  [    0.402442] Run /sbin/init as init process

The same delay is shorter using linux-next, but still noticeable
(DEBUG_WX=n):
  [    0.384362] Freeing unused kernel memory: 14080K
  [    0.413423] Run /sbin/init as init process

Matching trace_event=rcu:rcu_barrier trace:
         systemd-1       [002] .....     0.384391: rcu_barrier: rcu_preempt Begin cpu -1 remaining 0 # 4
         systemd-1       [002] d..1.     0.384394: rcu_barrier: rcu_preempt Inc1 cpu -1 remaining 0 # 1
         systemd-1       [002] .....     0.384395: rcu_barrier: rcu_preempt NQ cpu 0 remaining 2 # 1
          <idle>-0       [001] d.h2.     0.384407: rcu_barrier: rcu_preempt IRQ cpu -1 remaining 2 # 1
         systemd-1       [002] .....     0.384408: rcu_barrier: rcu_preempt OnlineQ cpu 1 remaining 3 # 1
         systemd-1       [002] .....     0.384409: rcu_barrier: rcu_preempt NQ cpu 2 remaining 3 # 1
          <idle>-0       [003] d.h2.     0.384416: rcu_barrier: rcu_preempt IRQ cpu -1 remaining 3 # 1
         systemd-1       [002] .....     0.384418: rcu_barrier: rcu_preempt OnlineQ cpu 3 remaining 4 # 1
          <idle>-0       [004] d.h2.     0.384428: rcu_barrier: rcu_preempt IRQ cpu -1 remaining 4 # 1
         systemd-1       [002] .....     0.384430: rcu_barrier: rcu_preempt OnlineQ cpu 4 remaining 5 # 1
          <idle>-0       [005] d.h2.     0.384438: rcu_barrier: rcu_preempt IRQ cpu -1 remaining 5 # 1
         systemd-1       [002] .....     0.384441: rcu_barrier: rcu_preempt OnlineQ cpu 5 remaining 6 # 1
          <idle>-0       [006] d.h2.     0.384450: rcu_barrier: rcu_preempt IRQ cpu -1 remaining 6 # 1
         systemd-1       [002] .....     0.384452: rcu_barrier: rcu_preempt OnlineQ cpu 6 remaining 7 # 1
          <idle>-0       [007] d.h2.     0.384461: rcu_barrier: rcu_preempt IRQ cpu -1 remaining 7 # 1
         systemd-1       [002] .....     0.384463: rcu_barrier: rcu_preempt OnlineQ cpu 7 remaining 8 # 1
          <idle>-0       [004] ..s1.     0.385339: rcu_barrier: rcu_preempt CB cpu -1 remaining 5 # 1
          <idle>-0       [007] ..s1.     0.397335: rcu_barrier: rcu_preempt CB cpu -1 remaining 4 # 1
          <idle>-0       [003] ..s1.     0.397337: rcu_barrier: rcu_preempt CB cpu -1 remaining 3 # 1
          <idle>-0       [005] ..s1.     0.401336: rcu_barrier: rcu_preempt CB cpu -1 remaining 2 # 1
          <idle>-0       [006] ..s1.     0.401336: rcu_barrier: rcu_preempt CB cpu -1 remaining 1 # 1
          <idle>-0       [001] .Ns1.     0.413338: rcu_barrier: rcu_preempt LastCB cpu -1 remaining 0 # 1
         systemd-1       [002] .....     0.413351: rcu_barrier: rcu_preempt Inc2 cpu -1 remaining 0 # 1

With this patch the delay is no longer there:
  [    0.377662] Freeing unused kernel memory: 14080K
  [    0.377767] Run /sbin/init as init process

AFAIU, for the race to happen, module_alloc() needs to create a W+X
mapping (neither x86 nor arm64 does) and debug_checkwx() has to happen
before module_enable_nx() in complete_formation(), I didn't get a
reproducer so far.

Best,
Changbin Du Feb. 17, 2024, 8:10 a.m. UTC | #5
On Thu, Feb 15, 2024 at 09:18:09AM -0500, Eric Chanudet wrote:
> On Tue, Jan 30, 2024 at 06:21:03AM -0800, Luis Chamberlain wrote:
> > On Tue, Jan 30, 2024 at 09:40:38AM +0800, Changbin Du wrote:
> > > On Mon, Jan 29, 2024 at 09:53:58AM -0800, Luis Chamberlain wrote:
> > > > On Mon, Jan 29, 2024 at 10:03:04AM +0800, Changbin Du wrote:
> > > > > The commit 1a7b7d922081 ("modules: Use vmalloc special flag") moves
> > > > > do_free_init() into a global workqueue instead of call_rcu(). So now
> > > > > rcu_barrier() can not ensure that do_free_init has completed. We should
> > > > > wait it via flush_work().
> > > > > 
> > > > > Without this fix, we still could encounter false positive reports in
> > > > > W+X checking, and rcu synchronization is unnecessary.
> 
> The comment in do_init_module(), just before
> schedule_work(&init_free_wq), mentioning rcu_barrier(), should be
> amended as well.
>
yes, I'll update it as well.

> > > > 
> > > > You didn't answer my question, which should be documented in the commit log.
> > > > 
> > > > Does this mean we never freed modules init because of this? If so then
> > > > your commit log should clearly explain that. It should also explain that
> > > > if true (you have to verify) then it means we were no longer saving
> > > > the memory we wished to save, and that is important for distributions
> > > > which do want to save anything on memory. You may want to do a general
> > > > estimate on how much that means these days on any desktop / server.
> > >
> > > Actually, I have explained it in commit msg. It's not about saving memory. The
> > > synchronization here is just to ensure the module init's been freed before
> > > doing W+X checking. The problem is that the current implementation is wrong,
> > > rcu_barrier() cannot guarantee that. So we can encounter false positive reports.
> > > But anyway, the module init will be freed, and it's just a timing related issue.
> > 
> > Your desciption here is better than the commit log.
> 
> I saw this problem using a PREEMPT_RT kernel as well. Setting DEBUG_WX=n
> stills show a significant delay due to the rcu_barrier:
>   [    0.291444] Freeing unused kernel memory: 5568K
>   [    0.402442] Run /sbin/init as init process
> 
> The same delay is shorter using linux-next, but still noticeable
> (DEBUG_WX=n):
>   [    0.384362] Freeing unused kernel memory: 14080K
>   [    0.413423] Run /sbin/init as init process
> 
> Matching trace_event=rcu:rcu_barrier trace:
>          systemd-1       [002] .....     0.384391: rcu_barrier: rcu_preempt Begin cpu -1 remaining 0 # 4
>          systemd-1       [002] d..1.     0.384394: rcu_barrier: rcu_preempt Inc1 cpu -1 remaining 0 # 1
>          systemd-1       [002] .....     0.384395: rcu_barrier: rcu_preempt NQ cpu 0 remaining 2 # 1
>           <idle>-0       [001] d.h2.     0.384407: rcu_barrier: rcu_preempt IRQ cpu -1 remaining 2 # 1
>          systemd-1       [002] .....     0.384408: rcu_barrier: rcu_preempt OnlineQ cpu 1 remaining 3 # 1
>          systemd-1       [002] .....     0.384409: rcu_barrier: rcu_preempt NQ cpu 2 remaining 3 # 1
>           <idle>-0       [003] d.h2.     0.384416: rcu_barrier: rcu_preempt IRQ cpu -1 remaining 3 # 1
>          systemd-1       [002] .....     0.384418: rcu_barrier: rcu_preempt OnlineQ cpu 3 remaining 4 # 1
>           <idle>-0       [004] d.h2.     0.384428: rcu_barrier: rcu_preempt IRQ cpu -1 remaining 4 # 1
>          systemd-1       [002] .....     0.384430: rcu_barrier: rcu_preempt OnlineQ cpu 4 remaining 5 # 1
>           <idle>-0       [005] d.h2.     0.384438: rcu_barrier: rcu_preempt IRQ cpu -1 remaining 5 # 1
>          systemd-1       [002] .....     0.384441: rcu_barrier: rcu_preempt OnlineQ cpu 5 remaining 6 # 1
>           <idle>-0       [006] d.h2.     0.384450: rcu_barrier: rcu_preempt IRQ cpu -1 remaining 6 # 1
>          systemd-1       [002] .....     0.384452: rcu_barrier: rcu_preempt OnlineQ cpu 6 remaining 7 # 1
>           <idle>-0       [007] d.h2.     0.384461: rcu_barrier: rcu_preempt IRQ cpu -1 remaining 7 # 1
>          systemd-1       [002] .....     0.384463: rcu_barrier: rcu_preempt OnlineQ cpu 7 remaining 8 # 1
>           <idle>-0       [004] ..s1.     0.385339: rcu_barrier: rcu_preempt CB cpu -1 remaining 5 # 1
>           <idle>-0       [007] ..s1.     0.397335: rcu_barrier: rcu_preempt CB cpu -1 remaining 4 # 1
>           <idle>-0       [003] ..s1.     0.397337: rcu_barrier: rcu_preempt CB cpu -1 remaining 3 # 1
>           <idle>-0       [005] ..s1.     0.401336: rcu_barrier: rcu_preempt CB cpu -1 remaining 2 # 1
>           <idle>-0       [006] ..s1.     0.401336: rcu_barrier: rcu_preempt CB cpu -1 remaining 1 # 1
>           <idle>-0       [001] .Ns1.     0.413338: rcu_barrier: rcu_preempt LastCB cpu -1 remaining 0 # 1
>          systemd-1       [002] .....     0.413351: rcu_barrier: rcu_preempt Inc2 cpu -1 remaining 0 # 1
> 
> With this patch the delay is no longer there:
>   [    0.377662] Freeing unused kernel memory: 14080K
>   [    0.377767] Run /sbin/init as init process
> 
Thanks for your info. We encounter similar delay in our scenario. I'll add your
testing data in commit msg.

> AFAIU, for the race to happen, module_alloc() needs to create a W+X
> mapping (neither x86 nor arm64 does) and debug_checkwx() has to happen
> before module_enable_nx() in complete_formation(), I didn't get a
> reproducer so far.
> 
> Best,
> 
> -- 
> Eric Chanudet
>
diff mbox series

Patch

diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h
index 001b2ce83832..89b1e0ed9811 100644
--- a/include/linux/moduleloader.h
+++ b/include/linux/moduleloader.h
@@ -115,6 +115,14 @@  int module_finalize(const Elf_Ehdr *hdr,
 		    const Elf_Shdr *sechdrs,
 		    struct module *mod);
 
+#ifdef CONFIG_MODULES
+void flush_module_init_free_work(void);
+#else
+static inline void flush_module_init_free_work(void)
+{
+}
+#endif
+
 /* Any cleanup needed when module leaves. */
 void module_arch_cleanup(struct module *mod);
 
diff --git a/init/main.c b/init/main.c
index e24b0780fdff..f0b7e21ac67f 100644
--- a/init/main.c
+++ b/init/main.c
@@ -99,6 +99,7 @@ 
 #include <linux/init_syscalls.h>
 #include <linux/stackdepot.h>
 #include <linux/randomize_kstack.h>
+#include <linux/moduleloader.h>
 #include <net/net_namespace.h>
 
 #include <asm/io.h>
@@ -1402,11 +1403,11 @@  static void mark_readonly(void)
 	if (rodata_enabled) {
 		/*
 		 * load_module() results in W+X mappings, which are cleaned
-		 * up with call_rcu().  Let's make sure that queued work is
+		 * up with init_free_wq. Let's make sure that queued work is
 		 * flushed so that we don't hit false positives looking for
 		 * insecure pages which are W+X.
 		 */
-		rcu_barrier();
+		flush_module_init_free_work();
 		mark_rodata_ro();
 		rodata_test();
 	} else
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 36681911c05a..ea66b5c2a2a1 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -2489,6 +2489,11 @@  static void do_free_init(struct work_struct *w)
 	}
 }
 
+void flush_module_init_free_work(void)
+{
+	flush_work(&init_free_wq);
+}
+
 #undef MODULE_PARAM_PREFIX
 #define MODULE_PARAM_PREFIX "module."
 /* Default value for module->async_probe_requested */