diff mbox series

[v2] mm/hugetlb: Fix a race between hugetlb sysctl handlers

Message ID 20200828031146.43035-1-songmuchun@bytedance.com (mailing list archive)
State New, archived
Headers show
Series [v2] mm/hugetlb: Fix a race between hugetlb sysctl handlers | expand

Commit Message

Muchun Song Aug. 28, 2020, 3:11 a.m. UTC
There is a race between the assignment of `table->data` and write value
to the pointer of `table->data` in the __do_proc_doulongvec_minmax() on
the other thread.

CPU0:                                 CPU1:
                                      proc_sys_write
hugetlb_sysctl_handler                  proc_sys_call_handler
hugetlb_sysctl_handler_common             hugetlb_sysctl_handler
  table->data = &tmp;                       hugetlb_sysctl_handler_common
                                              table->data = &tmp;
    proc_doulongvec_minmax
      do_proc_doulongvec_minmax           sysctl_head_finish
        __do_proc_doulongvec_minmax         unuse_table
          i = table->data;
          *i = val;  // corrupt CPU1's stack

Fix this by duplicating the `table`, and only update the duplicate of
it. And introduce a helper of proc_hugetlb_doulongvec_minmax() to
simplify the code.

The following oops was seen:

    BUG: kernel NULL pointer dereference, address: 0000000000000000
    #PF: supervisor instruction fetch in kernel mode
    #PF: error_code(0x0010) - not-present page
    Code: Bad RIP value.
    ...
    Call Trace:
     ? set_max_huge_pages+0x3da/0x4f0
     ? alloc_pool_huge_page+0x150/0x150
     ? proc_doulongvec_minmax+0x46/0x60
     ? hugetlb_sysctl_handler_common+0x1c7/0x200
     ? nr_hugepages_store+0x20/0x20
     ? copy_fd_bitmaps+0x170/0x170
     ? hugetlb_sysctl_handler+0x1e/0x20
     ? proc_sys_call_handler+0x2f1/0x300
     ? unregister_sysctl_table+0xb0/0xb0
     ? __fd_install+0x78/0x100
     ? proc_sys_write+0x14/0x20
     ? __vfs_write+0x4d/0x90
     ? vfs_write+0xef/0x240
     ? ksys_write+0xc0/0x160
     ? __ia32_sys_read+0x50/0x50
     ? __close_fd+0x129/0x150
     ? __x64_sys_write+0x43/0x50
     ? do_syscall_64+0x6c/0x200
     ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: e5ff215941d5 ("hugetlb: multiple hstates for multiple page sizes")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
chagelogs in v2:
 1. Add more details about how the race happened to the commit message.
 2. Remove unnecessary assignment of table->maxlen.

 mm/hugetlb.c | 26 ++++++++++++++++++++------
 1 file changed, 20 insertions(+), 6 deletions(-)

Comments

Mike Kravetz Aug. 28, 2020, 2:46 p.m. UTC | #1
On 8/27/20 8:11 PM, Muchun Song wrote:
> There is a race between the assignment of `table->data` and write value
> to the pointer of `table->data` in the __do_proc_doulongvec_minmax() on
> the other thread.
> 
> CPU0:                                 CPU1:
>                                       proc_sys_write
> hugetlb_sysctl_handler                  proc_sys_call_handler
> hugetlb_sysctl_handler_common             hugetlb_sysctl_handler
>   table->data = &tmp;                       hugetlb_sysctl_handler_common
>                                               table->data = &tmp;
>     proc_doulongvec_minmax
>       do_proc_doulongvec_minmax           sysctl_head_finish
>         __do_proc_doulongvec_minmax         unuse_table
>           i = table->data;
>           *i = val;  // corrupt CPU1's stack
> 
> Fix this by duplicating the `table`, and only update the duplicate of
> it. And introduce a helper of proc_hugetlb_doulongvec_minmax() to
> simplify the code.
> 
> The following oops was seen:
> 
>     BUG: kernel NULL pointer dereference, address: 0000000000000000
>     #PF: supervisor instruction fetch in kernel mode
>     #PF: error_code(0x0010) - not-present page
>     Code: Bad RIP value.
>     ...
>     Call Trace:
>      ? set_max_huge_pages+0x3da/0x4f0
>      ? alloc_pool_huge_page+0x150/0x150
>      ? proc_doulongvec_minmax+0x46/0x60
>      ? hugetlb_sysctl_handler_common+0x1c7/0x200
>      ? nr_hugepages_store+0x20/0x20
>      ? copy_fd_bitmaps+0x170/0x170
>      ? hugetlb_sysctl_handler+0x1e/0x20
>      ? proc_sys_call_handler+0x2f1/0x300
>      ? unregister_sysctl_table+0xb0/0xb0
>      ? __fd_install+0x78/0x100
>      ? proc_sys_write+0x14/0x20
>      ? __vfs_write+0x4d/0x90
>      ? vfs_write+0xef/0x240
>      ? ksys_write+0xc0/0x160
>      ? __ia32_sys_read+0x50/0x50
>      ? __close_fd+0x129/0x150
>      ? __x64_sys_write+0x43/0x50
>      ? do_syscall_64+0x6c/0x200
>      ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> Fixes: e5ff215941d5 ("hugetlb: multiple hstates for multiple page sizes")
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>

Thank you!

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Andi Kleen Aug. 28, 2020, 2:59 p.m. UTC | #2
> Fixes: e5ff215941d5 ("hugetlb: multiple hstates for multiple page sizes")

I believe the Fixes line is still not correct. The original patch
didn't have that problem. Please identify which patch added
the problematic code.

-Andi
Mike Kravetz Aug. 28, 2020, 5:14 p.m. UTC | #3
On 8/28/20 7:59 AM, Andi Kleen wrote:
>> Fixes: e5ff215941d5 ("hugetlb: multiple hstates for multiple page sizes")
> 
> I believe the Fixes line is still not correct. The original patch
> didn't have that problem. Please identify which patch added
> the problematic code.

Hi Andi,

I agree with Muchun's assessment that the issue was caused by e5ff215941d5.
Why?

Commit e5ff215941d5 changed initialization of the .data field in the
ctl_table structure for hugetlb_sysctl_handler.  This was required because
the commit introduced multiple hstates.  As a result, it can not be known
until runtime the value for .data.  This code was added to
hugetlb_sysctl_handler to set the value at runtime.

@@ -1037,8 +1109,19 @@ int hugetlb_sysctl_handler(struct ctl_table *table, int write,
                           struct file *file, void __user *buffer,
                           size_t *length, loff_t *ppos)
 {
+       struct hstate *h = &default_hstate;
+       unsigned long tmp;
+
+       if (!write)
+               tmp = h->max_huge_pages;
+
+       table->data = &tmp;
+       table->maxlen = sizeof(unsigned long);

The problem is that two threads running in parallel can modify table->data
in the global data structure without any synchronization.  Worse yet, is
that that value is local to their stacks.  That was the root cause of the
issue addressed by Muchun's patch.

Does that analysis make sense?  Or, are we missing something.
Andi Kleen Aug. 28, 2020, 8:30 p.m. UTC | #4
On Fri, Aug 28, 2020 at 10:14:16AM -0700, Mike Kravetz wrote:
> On 8/28/20 7:59 AM, Andi Kleen wrote:
> >> Fixes: e5ff215941d5 ("hugetlb: multiple hstates for multiple page sizes")
> > 
> > I believe the Fixes line is still not correct. The original patch
> > didn't have that problem. Please identify which patch added
> > the problematic code.
> 
> Hi Andi,
> 
> I agree with Muchun's assessment that the issue was caused by e5ff215941d5.
> Why?

Yes when checking the code again I agree. It was introduced with that
patch. Patches look ok to me now.

-Andi
diff mbox series

Patch

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a301c2d672bf..4c2a2620eeed 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3454,6 +3454,22 @@  static unsigned int allowed_mems_nr(struct hstate *h)
 }
 
 #ifdef CONFIG_SYSCTL
+static int proc_hugetlb_doulongvec_minmax(struct ctl_table *table, int write,
+					  void *buffer, size_t *length,
+					  loff_t *ppos, unsigned long *out)
+{
+	struct ctl_table dup_table;
+
+	/*
+	 * In order to avoid races with __do_proc_doulongvec_minmax(), we
+	 * can duplicate the @table and alter the duplicate of it.
+	 */
+	dup_table = *table;
+	dup_table.data = out;
+
+	return proc_doulongvec_minmax(&dup_table, write, buffer, length, ppos);
+}
+
 static int hugetlb_sysctl_handler_common(bool obey_mempolicy,
 			 struct ctl_table *table, int write,
 			 void *buffer, size_t *length, loff_t *ppos)
@@ -3465,9 +3481,8 @@  static int hugetlb_sysctl_handler_common(bool obey_mempolicy,
 	if (!hugepages_supported())
 		return -EOPNOTSUPP;
 
-	table->data = &tmp;
-	table->maxlen = sizeof(unsigned long);
-	ret = proc_doulongvec_minmax(table, write, buffer, length, ppos);
+	ret = proc_hugetlb_doulongvec_minmax(table, write, buffer, length, ppos,
+					     &tmp);
 	if (ret)
 		goto out;
 
@@ -3510,9 +3525,8 @@  int hugetlb_overcommit_handler(struct ctl_table *table, int write,
 	if (write && hstate_is_gigantic(h))
 		return -EINVAL;
 
-	table->data = &tmp;
-	table->maxlen = sizeof(unsigned long);
-	ret = proc_doulongvec_minmax(table, write, buffer, length, ppos);
+	ret = proc_hugetlb_doulongvec_minmax(table, write, buffer, length, ppos,
+					     &tmp);
 	if (ret)
 		goto out;