Message ID | 20250127143201.45453-1-d.dulov@aladdin.ru (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/vma: Fix hugetlb accounting error in copy_vma() | expand |
Thanks for the report. On Mon, Jan 27, 2025 at 05:32:01PM +0300, Daniil Dulov wrote: > In copy_vma() allocation of maple tree nodes may fail. Since page accounting > takes place at the close() operation for hugetlb, it is called at the error > path against the new_vma to account pages of the vma that was not successfully > copied and that shares the page_counter with the original vma. Then, when the > process is being terminated, vm_ops->close() is called once again against the > original vma, which results in a page_counter underflow. This seems like a bug in hugetlb. I really hate the solution here, it's hacky and assumes only these fields are meaningful for 'close twice' scenarios. We now use vma_close(), which assigns vma->vm_ops to vma_dummy_vm_ops, meaning no further close() invocations can occur. If hugetlb is _still_ choosing to internally invoke this, it seems like it should have some if (vma->vm_ops == hugetlb_vm_ops) { ... } check first? That way it'll account for the closing twice issue. Can you easily repro in order to check a solution like that fixes your problem? I don't see why it shouldn't > > page_counter underflow: -1024 nr_pages=1024 > WARNING: CPU: 1 PID: 1086 at mm/page_counter.c:55 page_counter_cancel+0xd6/0x130 mm/page_counter.c:55 > Modules linked in: > CPU: 1 PID: 1086 Comm: syz-executor200 Not tainted 6.1.108-syzkaller-00078-g9ce77c16947b #0 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 > Call Trace: > <TASK> > page_counter_uncharge+0x2e/0x70 mm/page_counter.c:158 > hugetlb_cgroup_uncharge_counter+0xd2/0x420 mm/hugetlb_cgroup.c:430 > hugetlb_vm_op_close+0x435/0x700 mm/hugetlb.c:4886 > remove_vma+0x84/0x130 mm/mmap.c:140 > exit_mmap+0x32f/0x7a0 mm/mmap.c:3249 > __mmput+0x11e/0x430 kernel/fork.c:1199 > mmput+0x61/0x70 kernel/fork.c:1221 > exit_mm kernel/exit.c:565 [inline] > do_exit+0xa4a/0x2790 kernel/exit.c:858 > do_group_exit+0xd0/0x2a0 kernel/exit.c:1021 > __do_sys_exit_group kernel/exit.c:1032 [inline] > __se_sys_exit_group kernel/exit.c:1030 [inline] > __x64_sys_exit_group+0x3a/0x50 kernel/exit.c:1030 > do_syscall_x64 arch/x86/entry/common.c:51 [inline] > do_syscall_64+0x35/0x80 arch/x86/entry/common.c:81 > entry_SYSCALL_64_after_hwframe+0x6e/0xd8 > </TASK> > > Since there is no sense in vm accounting for a bad copy of vma, set vm_start > to be equal vm_end and vm_pgoff to be equal 0. Previously, a similar issue > has been fixed in __split_vma() in the same way [1]. > > [1]: https://lore.kernel.org/all/20220719201523.3561958-1-Liam.Howlett@oracle.com/T/ Understood that we do this elsewhere, I think equally we should not do this there either! :) > > Found by Linux Verification Center (linuxtesting.org) with Syzkaller. > > Fixes: d4af56c5c7c6 ("mm: start tracking VMAs with maple tree") > Cc: stable@vger.kernel.com > Signed-off-by: Daniil Dulov <d.dulov@aladdin.ru> > --- > mm/vma.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/mm/vma.c b/mm/vma.c > index bb2119e5a0d0..dbc68b7cd0ec 100644 > --- a/mm/vma.c > +++ b/mm/vma.c > @@ -1772,6 +1772,9 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, > return new_vma; > > out_vma_link: > + /* Avoid vm accounting in close() operation */ > + new_vma->vm_start = new_vma->vm_end; > + new_vma->vm_pgoff = 0; > vma_close(new_vma); > > if (new_vma->vm_file) > -- > 2.34.1 > >
Thank you for the feedback! I'll check the suggested solution ASAP. -----Original Message----- From: Lorenzo Stoakes [mailto:lorenzo.stoakes@oracle.com] Sent: Monday, January 27, 2025 5:47 PM To: Дулов Даниил <D.Dulov@aladdin.ru> Cc: Andrew Morton <akpm@linux-foundation.org>; Liam R. Howlett <Liam.Howlett@oracle.com>; Vlastimil Babka <vbabka@suse.cz>; Jann Horn <jannh@google.com>; Matthew Wilcox (Oracle) <willy@infradead.org>; linux-mm@kvack.org; linux-kernel@vger.kernel.org; lvc-project@linuxtesting.org; stable@vger.kernel.com Subject: Re: [PATCH] mm/vma: Fix hugetlb accounting error in copy_vma() Thanks for the report. On Mon, Jan 27, 2025 at 05:32:01PM +0300, Daniil Dulov wrote: > In copy_vma() allocation of maple tree nodes may fail. Since page accounting > takes place at the close() operation for hugetlb, it is called at the error > path against the new_vma to account pages of the vma that was not successfully > copied and that shares the page_counter with the original vma. Then, when the > process is being terminated, vm_ops->close() is called once again against the > original vma, which results in a page_counter underflow. This seems like a bug in hugetlb. I really hate the solution here, it's hacky and assumes only these fields are meaningful for 'close twice' scenarios. We now use vma_close(), which assigns vma->vm_ops to vma_dummy_vm_ops, meaning no further close() invocations can occur. If hugetlb is _still_ choosing to internally invoke this, it seems like it should have some if (vma->vm_ops == hugetlb_vm_ops) { ... } check first? That way it'll account for the closing twice issue. Can you easily repro in order to check a solution like that fixes your problem? I don't see why it shouldn't > > page_counter underflow: -1024 nr_pages=1024 > WARNING: CPU: 1 PID: 1086 at mm/page_counter.c:55 page_counter_cancel+0xd6/0x130 mm/page_counter.c:55 > Modules linked in: > CPU: 1 PID: 1086 Comm: syz-executor200 Not tainted 6.1.108-syzkaller-00078-g9ce77c16947b #0 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 > Call Trace: > <TASK> > page_counter_uncharge+0x2e/0x70 mm/page_counter.c:158 > hugetlb_cgroup_uncharge_counter+0xd2/0x420 mm/hugetlb_cgroup.c:430 > hugetlb_vm_op_close+0x435/0x700 mm/hugetlb.c:4886 > remove_vma+0x84/0x130 mm/mmap.c:140 > exit_mmap+0x32f/0x7a0 mm/mmap.c:3249 > __mmput+0x11e/0x430 kernel/fork.c:1199 > mmput+0x61/0x70 kernel/fork.c:1221 > exit_mm kernel/exit.c:565 [inline] > do_exit+0xa4a/0x2790 kernel/exit.c:858 > do_group_exit+0xd0/0x2a0 kernel/exit.c:1021 > __do_sys_exit_group kernel/exit.c:1032 [inline] > __se_sys_exit_group kernel/exit.c:1030 [inline] > __x64_sys_exit_group+0x3a/0x50 kernel/exit.c:1030 > do_syscall_x64 arch/x86/entry/common.c:51 [inline] > do_syscall_64+0x35/0x80 arch/x86/entry/common.c:81 > entry_SYSCALL_64_after_hwframe+0x6e/0xd8 > </TASK> > > Since there is no sense in vm accounting for a bad copy of vma, set vm_start > to be equal vm_end and vm_pgoff to be equal 0. Previously, a similar issue > has been fixed in __split_vma() in the same way [1]. > > [1]: https://lore.kernel.org/all/20220719201523.3561958-1-Liam.Howlett@oracle.com/T/ Understood that we do this elsewhere, I think equally we should not do this there either! :) > > Found by Linux Verification Center (linuxtesting.org) with Syzkaller. > > Fixes: d4af56c5c7c6 ("mm: start tracking VMAs with maple tree") > Cc: stable@vger.kernel.com > Signed-off-by: Daniil Dulov <d.dulov@aladdin.ru> > --- > mm/vma.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/mm/vma.c b/mm/vma.c > index bb2119e5a0d0..dbc68b7cd0ec 100644 > --- a/mm/vma.c > +++ b/mm/vma.c > @@ -1772,6 +1772,9 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, > return new_vma; > > out_vma_link: > + /* Avoid vm accounting in close() operation */ > + new_vma->vm_start = new_vma->vm_end; > + new_vma->vm_pgoff = 0; > vma_close(new_vma); > > if (new_vma->vm_file) > -- > 2.34.1 > >
+Cc Mina Almasry and Mike Kravetz On Mon, 27. Jan 14:46, Lorenzo Stoakes wrote: > Thanks for the report. > > On Mon, Jan 27, 2025 at 05:32:01PM +0300, Daniil Dulov wrote: > > In copy_vma() allocation of maple tree nodes may fail. Since page accounting > > takes place at the close() operation for hugetlb, it is called at the error > > path against the new_vma to account pages of the vma that was not successfully > > copied and that shares the page_counter with the original vma. Then, when the > > process is being terminated, vm_ops->close() is called once again against the > > original vma, which results in a page_counter underflow. > > This seems like a bug in hugetlb. > > I really hate the solution here, it's hacky and assumes only these fields are > meaningful for 'close twice' scenarios. > > We now use vma_close(), which assigns vma->vm_ops to vma_dummy_vm_ops, meaning > no further close() invocations can occur. Does the "close twice" scenario exactly mean ->close() is called twice for the same object of struct vm_area_struct? For the observed case I think that's not true. ->close() is called for two different objects of type vm_area_struct - the first time for the new_vma on error path of copy_vma(), the second time for the original vma. It turns out then these objects share the same reservation map holding page_counters at this point of time. > > If hugetlb is _still_ choosing to internally invoke this, it seems like it > should have some if (vma->vm_ops == hugetlb_vm_ops) { ... } check first? That > way it'll account for the closing twice issue. > > Can you easily repro in order to check a solution like that fixes your problem? > I don't see why it shouldn't > Seems that wouldn't fix the problem (again, two different vma objects). There's presumably no obvious place in hugetlb internals where this may be fixed elegantly, at the quick glance. But yep, it does look like a bug for hugetlb to care about.. Perhaps somehow defer the reservation map copying? > > > > page_counter underflow: -1024 nr_pages=1024 > > WARNING: CPU: 1 PID: 1086 at mm/page_counter.c:55 page_counter_cancel+0xd6/0x130 mm/page_counter.c:55 > > Modules linked in: > > CPU: 1 PID: 1086 Comm: syz-executor200 Not tainted 6.1.108-syzkaller-00078-g9ce77c16947b #0 > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 > > Call Trace: > > <TASK> > > page_counter_uncharge+0x2e/0x70 mm/page_counter.c:158 > > hugetlb_cgroup_uncharge_counter+0xd2/0x420 mm/hugetlb_cgroup.c:430 > > hugetlb_vm_op_close+0x435/0x700 mm/hugetlb.c:4886 > > remove_vma+0x84/0x130 mm/mmap.c:140 > > exit_mmap+0x32f/0x7a0 mm/mmap.c:3249 > > __mmput+0x11e/0x430 kernel/fork.c:1199 > > mmput+0x61/0x70 kernel/fork.c:1221 > > exit_mm kernel/exit.c:565 [inline] > > do_exit+0xa4a/0x2790 kernel/exit.c:858 > > do_group_exit+0xd0/0x2a0 kernel/exit.c:1021 > > __do_sys_exit_group kernel/exit.c:1032 [inline] > > __se_sys_exit_group kernel/exit.c:1030 [inline] > > __x64_sys_exit_group+0x3a/0x50 kernel/exit.c:1030 > > do_syscall_x64 arch/x86/entry/common.c:51 [inline] > > do_syscall_64+0x35/0x80 arch/x86/entry/common.c:81 > > entry_SYSCALL_64_after_hwframe+0x6e/0xd8 > > </TASK> > > > > > Since there is no sense in vm accounting for a bad copy of vma, set vm_start > > to be equal vm_end and vm_pgoff to be equal 0. Previously, a similar issue > > has been fixed in __split_vma() in the same way [1]. > > > > [1]: https://lore.kernel.org/all/20220719201523.3561958-1-Liam.Howlett@oracle.com/T/ > > Understood that we do this elsewhere, I think equally we should not do this > there either! :) > > > > > Found by Linux Verification Center (linuxtesting.org) with Syzkaller. > > > > Fixes: d4af56c5c7c6 ("mm: start tracking VMAs with maple tree") > > Cc: stable@vger.kernel.com > > Signed-off-by: Daniil Dulov <d.dulov@aladdin.ru> > > --- > > mm/vma.c | 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/mm/vma.c b/mm/vma.c > > index bb2119e5a0d0..dbc68b7cd0ec 100644 > > --- a/mm/vma.c > > +++ b/mm/vma.c > > @@ -1772,6 +1772,9 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, > > return new_vma; > > > > out_vma_link: > > + /* Avoid vm accounting in close() operation */ > > + new_vma->vm_start = new_vma->vm_end; > > + new_vma->vm_pgoff = 0; > > vma_close(new_vma); > > > > if (new_vma->vm_file) > > -- > > 2.34.1 > >
On Mon, Jan 27, 2025 at 8:45 AM Fedor Pchelkin <pchelkin@ispras.ru> wrote: > > +Cc Mina Almasry and Mike Kravetz > +Cc current Hugetlb maintainer as well. > On Mon, 27. Jan 14:46, Lorenzo Stoakes wrote: > > Thanks for the report. > > > > On Mon, Jan 27, 2025 at 05:32:01PM +0300, Daniil Dulov wrote: > > > In copy_vma() allocation of maple tree nodes may fail. Since page accounting > > > takes place at the close() operation for hugetlb, it is called at the error > > > path against the new_vma to account pages of the vma that was not successfully > > > copied and that shares the page_counter with the original vma. Then, when the > > > process is being terminated, vm_ops->close() is called once again against the > > > original vma, which results in a page_counter underflow. > > > > This seems like a bug in hugetlb. > > > > I really hate the solution here, it's hacky and assumes only these fields are > > meaningful for 'close twice' scenarios. > > > > We now use vma_close(), which assigns vma->vm_ops to vma_dummy_vm_ops, meaning > > no further close() invocations can occur. > > Does the "close twice" scenario exactly mean ->close() is called twice > for the same object of struct vm_area_struct? > > For the observed case I think that's not true. ->close() is called for > two different objects of type vm_area_struct - the first time for the > new_vma on error path of copy_vma(), the second time for the original > vma. It turns out then these objects share the same reservation map > holding page_counters at this point of time. > > > > > If hugetlb is _still_ choosing to internally invoke this, it seems like it > > should have some if (vma->vm_ops == hugetlb_vm_ops) { ... } check first? That > > way it'll account for the closing twice issue. > > > > Can you easily repro in order to check a solution like that fixes your problem? > > I don't see why it shouldn't > > > > Seems that wouldn't fix the problem (again, two different vma objects). > There's presumably no obvious place in hugetlb internals where this may > be fixed elegantly, at the quick glance. But yep, it does look like a bug > for hugetlb to care about.. Perhaps somehow defer the reservation map > copying? > > > > > > > page_counter underflow: -1024 nr_pages=1024 > > > WARNING: CPU: 1 PID: 1086 at mm/page_counter.c:55 page_counter_cancel+0xd6/0x130 mm/page_counter.c:55 > > > Modules linked in: > > > CPU: 1 PID: 1086 Comm: syz-executor200 Not tainted 6.1.108-syzkaller-00078-g9ce77c16947b #0 > > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 > > > Call Trace: > > > <TASK> > > > page_counter_uncharge+0x2e/0x70 mm/page_counter.c:158 > > > hugetlb_cgroup_uncharge_counter+0xd2/0x420 mm/hugetlb_cgroup.c:430 > > > hugetlb_vm_op_close+0x435/0x700 mm/hugetlb.c:4886 > > > remove_vma+0x84/0x130 mm/mmap.c:140 > > > exit_mmap+0x32f/0x7a0 mm/mmap.c:3249 > > > __mmput+0x11e/0x430 kernel/fork.c:1199 > > > mmput+0x61/0x70 kernel/fork.c:1221 > > > exit_mm kernel/exit.c:565 [inline] > > > do_exit+0xa4a/0x2790 kernel/exit.c:858 > > > do_group_exit+0xd0/0x2a0 kernel/exit.c:1021 > > > __do_sys_exit_group kernel/exit.c:1032 [inline] > > > __se_sys_exit_group kernel/exit.c:1030 [inline] > > > __x64_sys_exit_group+0x3a/0x50 kernel/exit.c:1030 > > > do_syscall_x64 arch/x86/entry/common.c:51 [inline] > > > do_syscall_64+0x35/0x80 arch/x86/entry/common.c:81 > > > entry_SYSCALL_64_after_hwframe+0x6e/0xd8 > > > </TASK> > > > > > > > > Since there is no sense in vm accounting for a bad copy of vma, set vm_start > > > to be equal vm_end and vm_pgoff to be equal 0. Previously, a similar issue > > > has been fixed in __split_vma() in the same way [1]. > > > > > > [1]: https://lore.kernel.org/all/20220719201523.3561958-1-Liam.Howlett@oracle.com/T/ > > > > Understood that we do this elsewhere, I think equally we should not do this > > there either! :) > > > > > > > > Found by Linux Verification Center (linuxtesting.org) with Syzkaller. > > > > > > Fixes: d4af56c5c7c6 ("mm: start tracking VMAs with maple tree") > > > Cc: stable@vger.kernel.com > > > Signed-off-by: Daniil Dulov <d.dulov@aladdin.ru> > > > --- > > > mm/vma.c | 3 +++ > > > 1 file changed, 3 insertions(+) > > > > > > diff --git a/mm/vma.c b/mm/vma.c > > > index bb2119e5a0d0..dbc68b7cd0ec 100644 > > > --- a/mm/vma.c > > > +++ b/mm/vma.c > > > @@ -1772,6 +1772,9 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, > > > return new_vma; > > > > > > out_vma_link: > > > + /* Avoid vm accounting in close() operation */ > > > + new_vma->vm_start = new_vma->vm_end; > > > + new_vma->vm_pgoff = 0; > > > vma_close(new_vma); > > > > > > if (new_vma->vm_file) > > > -- > > > 2.34.1 > > >
diff --git a/mm/vma.c b/mm/vma.c index bb2119e5a0d0..dbc68b7cd0ec 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -1772,6 +1772,9 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, return new_vma; out_vma_link: + /* Avoid vm accounting in close() operation */ + new_vma->vm_start = new_vma->vm_end; + new_vma->vm_pgoff = 0; vma_close(new_vma); if (new_vma->vm_file)
In copy_vma() allocation of maple tree nodes may fail. Since page accounting takes place at the close() operation for hugetlb, it is called at the error path against the new_vma to account pages of the vma that was not successfully copied and that shares the page_counter with the original vma. Then, when the process is being terminated, vm_ops->close() is called once again against the original vma, which results in a page_counter underflow. page_counter underflow: -1024 nr_pages=1024 WARNING: CPU: 1 PID: 1086 at mm/page_counter.c:55 page_counter_cancel+0xd6/0x130 mm/page_counter.c:55 Modules linked in: CPU: 1 PID: 1086 Comm: syz-executor200 Not tainted 6.1.108-syzkaller-00078-g9ce77c16947b #0 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 Call Trace: <TASK> page_counter_uncharge+0x2e/0x70 mm/page_counter.c:158 hugetlb_cgroup_uncharge_counter+0xd2/0x420 mm/hugetlb_cgroup.c:430 hugetlb_vm_op_close+0x435/0x700 mm/hugetlb.c:4886 remove_vma+0x84/0x130 mm/mmap.c:140 exit_mmap+0x32f/0x7a0 mm/mmap.c:3249 __mmput+0x11e/0x430 kernel/fork.c:1199 mmput+0x61/0x70 kernel/fork.c:1221 exit_mm kernel/exit.c:565 [inline] do_exit+0xa4a/0x2790 kernel/exit.c:858 do_group_exit+0xd0/0x2a0 kernel/exit.c:1021 __do_sys_exit_group kernel/exit.c:1032 [inline] __se_sys_exit_group kernel/exit.c:1030 [inline] __x64_sys_exit_group+0x3a/0x50 kernel/exit.c:1030 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x35/0x80 arch/x86/entry/common.c:81 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 </TASK> Since there is no sense in vm accounting for a bad copy of vma, set vm_start to be equal vm_end and vm_pgoff to be equal 0. Previously, a similar issue has been fixed in __split_vma() in the same way [1]. [1]: https://lore.kernel.org/all/20220719201523.3561958-1-Liam.Howlett@oracle.com/T/ Found by Linux Verification Center (linuxtesting.org) with Syzkaller. Fixes: d4af56c5c7c6 ("mm: start tracking VMAs with maple tree") Cc: stable@vger.kernel.com Signed-off-by: Daniil Dulov <d.dulov@aladdin.ru> --- mm/vma.c | 3 +++ 1 file changed, 3 insertions(+)