Message ID | 20210330044208.8305-1-ilya.lipnitskiy@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3] mm: fix race by making init_zero_pfn() early_initcall | expand |
Hi Ilya, On 2021/3/30 下午12:42, Ilya Lipnitskiy wrote: > There are code paths that rely on zero_pfn to be fully initialized > before core_initcall. For example, wq_sysfs_init() is a core_initcall > function that eventually results in a call to kernel_execve, which > causes a page fault with a subsequent mmput. If zero_pfn is not > initialized by then it may not get cleaned up properly and result in an > error: > BUG: Bad rss-counter state mm:(ptrval) type:MM_ANONPAGES val:1 > > Here is an analysis of the race as seen on a MIPS device. On this > particular MT7621 device (Ubiquiti ER-X), zero_pfn is PFN 0 until > initialized, at which point it becomes PFN 5120: > 1. wq_sysfs_init calls into kobject_uevent_env at core_initcall: > [<80340dc8>] kobject_uevent_env+0x7e4/0x7ec > [<8033f8b8>] kset_register+0x68/0x88 > [<803cf824>] bus_register+0xdc/0x34c > [<803cfac8>] subsys_virtual_register+0x34/0x78 > [<8086afb0>] wq_sysfs_init+0x1c/0x4c > [<80001648>] do_one_initcall+0x50/0x1a8 > [<8086503c>] kernel_init_freeable+0x230/0x2c8 > [<8066bca0>] kernel_init+0x10/0x100 > [<80003038>] ret_from_kernel_thread+0x14/0x1c > > 2. kobject_uevent_env() calls call_usermodehelper_exec() which executes > kernel_execve asynchronously. > > 3. Memory allocations in kernel_execve cause a page fault, bumping the > MM reference counter: > [<8015adb4>] add_mm_counter_fast+0xb4/0xc0 > [<80160d58>] handle_mm_fault+0x6e4/0xea0 > [<80158aa4>] __get_user_pages.part.78+0x190/0x37c > [<8015992c>] __get_user_pages_remote+0x128/0x360 > [<801a6d9c>] get_arg_page+0x34/0xa0 > [<801a7394>] copy_string_kernel+0x194/0x2a4 > [<801a880c>] kernel_execve+0x11c/0x298 > [<800420f4>] call_usermodehelper_exec_async+0x114/0x194 > > 4. In case zero_pfn has not been initialized yet, zap_pte_range does > not decrement the MM_ANONPAGES RSS counter and the BUG message is > triggered shortly afterwards when __mmdrop checks the ref counters: > [<800285e8>] __mmdrop+0x98/0x1d0 > [<801a6de8>] free_bprm+0x44/0x118 > [<801a86a8>] kernel_execve+0x160/0x1d8 > [<800420f4>] call_usermodehelper_exec_async+0x114/0x194 > [<80003198>] ret_from_kernel_thread+0x14/0x1c > > To avoid races such as described above, initialize init_zero_pfn at > early_initcall level. Depending on the architecture, ZERO_PAGE is either > constant or gets initialized even earlier, at paging_init, so there is > no issue with initializing zero_pfn earlier. > > Discussion: https://lkml.kernel.org/r/CALCv0x2YqOXEAy2Q=hafjhHCtTHVodChv1qpM=niAXOpqEbt7w@mail.gmail.com > > Signed-off-by: Ilya Lipnitskiy <ilya.lipnitskiy@gmail.com> > Cc: Hugh Dickins <hughd@google.com> > Cc: "Eric W. Biederman" <ebiederm@xmission.com> > Cc: stable@vger.kernel.org > --- > mm/memory.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) Tested-by: 周琰杰 (Zhou Yanjie)<zhouyanjie@wanyeetech.com> # on CU1000-Neo/X1000E and CU1830-Neo/X1830 > diff --git a/mm/memory.c b/mm/memory.c > index 5c3b29d3af66..e66b11ac1659 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -166,7 +166,7 @@ static int __init init_zero_pfn(void) > zero_pfn = page_to_pfn(ZERO_PAGE(0)); > return 0; > } > -core_initcall(init_zero_pfn); > +early_initcall(init_zero_pfn); > > void mm_trace_rss_stat(struct mm_struct *mm, int member, long count) > {
diff --git a/mm/memory.c b/mm/memory.c index 5c3b29d3af66..e66b11ac1659 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -166,7 +166,7 @@ static int __init init_zero_pfn(void) zero_pfn = page_to_pfn(ZERO_PAGE(0)); return 0; } -core_initcall(init_zero_pfn); +early_initcall(init_zero_pfn); void mm_trace_rss_stat(struct mm_struct *mm, int member, long count) {
There are code paths that rely on zero_pfn to be fully initialized before core_initcall. For example, wq_sysfs_init() is a core_initcall function that eventually results in a call to kernel_execve, which causes a page fault with a subsequent mmput. If zero_pfn is not initialized by then it may not get cleaned up properly and result in an error: BUG: Bad rss-counter state mm:(ptrval) type:MM_ANONPAGES val:1 Here is an analysis of the race as seen on a MIPS device. On this particular MT7621 device (Ubiquiti ER-X), zero_pfn is PFN 0 until initialized, at which point it becomes PFN 5120: 1. wq_sysfs_init calls into kobject_uevent_env at core_initcall: [<80340dc8>] kobject_uevent_env+0x7e4/0x7ec [<8033f8b8>] kset_register+0x68/0x88 [<803cf824>] bus_register+0xdc/0x34c [<803cfac8>] subsys_virtual_register+0x34/0x78 [<8086afb0>] wq_sysfs_init+0x1c/0x4c [<80001648>] do_one_initcall+0x50/0x1a8 [<8086503c>] kernel_init_freeable+0x230/0x2c8 [<8066bca0>] kernel_init+0x10/0x100 [<80003038>] ret_from_kernel_thread+0x14/0x1c 2. kobject_uevent_env() calls call_usermodehelper_exec() which executes kernel_execve asynchronously. 3. Memory allocations in kernel_execve cause a page fault, bumping the MM reference counter: [<8015adb4>] add_mm_counter_fast+0xb4/0xc0 [<80160d58>] handle_mm_fault+0x6e4/0xea0 [<80158aa4>] __get_user_pages.part.78+0x190/0x37c [<8015992c>] __get_user_pages_remote+0x128/0x360 [<801a6d9c>] get_arg_page+0x34/0xa0 [<801a7394>] copy_string_kernel+0x194/0x2a4 [<801a880c>] kernel_execve+0x11c/0x298 [<800420f4>] call_usermodehelper_exec_async+0x114/0x194 4. In case zero_pfn has not been initialized yet, zap_pte_range does not decrement the MM_ANONPAGES RSS counter and the BUG message is triggered shortly afterwards when __mmdrop checks the ref counters: [<800285e8>] __mmdrop+0x98/0x1d0 [<801a6de8>] free_bprm+0x44/0x118 [<801a86a8>] kernel_execve+0x160/0x1d8 [<800420f4>] call_usermodehelper_exec_async+0x114/0x194 [<80003198>] ret_from_kernel_thread+0x14/0x1c To avoid races such as described above, initialize init_zero_pfn at early_initcall level. Depending on the architecture, ZERO_PAGE is either constant or gets initialized even earlier, at paging_init, so there is no issue with initializing zero_pfn earlier. Discussion: https://lkml.kernel.org/r/CALCv0x2YqOXEAy2Q=hafjhHCtTHVodChv1qpM=niAXOpqEbt7w@mail.gmail.com Signed-off-by: Ilya Lipnitskiy <ilya.lipnitskiy@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: stable@vger.kernel.org --- mm/memory.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)