diff mbox series

resv_huge_page underflow with userfaultfd test

Message ID CAHS8izNvQq5rawUYnv9H-=in+noafJjP75Ci7Q5-77A419uYeQ@mail.gmail.com (mailing list archive)
State New, archived
Headers show
Series resv_huge_page underflow with userfaultfd test | expand

Commit Message

Mina Almasry May 7, 2021, 9:21 p.m. UTC
Hi folks,

I ran into a bug that I'm not sure how to solve so I'm wondering if
anyone has suggestions on what the issue could be and how to
investigate. I added the WARN_ON_ONCE() here to catch instances of
resv_huge_pages underflowing:

[   11.163413] Modules linked in:
[   11.163419] CPU: 0 PID: 237 Comm: userfaultfd Not tainted 5.12.0-dbg-DEV #135
[   11.163424] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.14.0-2 04/01/2014
[   11.163429] RIP: 0010:alloc_huge_page+0x558/0x5a0
[   11.163432] Code: b0 00 0f 85 3d ff ff ff e9 2a ff ff ff be 01 00
00 00 48 89 df e8 18 e7 ff ff 48 f7 d8 4c 89 ef 48 89 c6 e8 da d7 ff
ff eb 8c <0f> 0b 4d 8b 85 c0 00 00 00 e9 95 fd ff ff e8 35 59 84 00 4c
897
[   11.163434] RSP: 0018:ffff94bb0073fc80 EFLAGS: 00010046
[   11.163436] RAX: 0000000000000080 RBX: 0000000000000000 RCX: 5fa252c406a76700
[   11.163438] RDX: c0000000ffff7fff RSI: 0000000000000004 RDI:
0000000000017ffd
[   11.163439] RBP: ffff94bb0073fcf8 R08: 0000000000000000 R09:
ffffffff9813ba70
[   11.163440] R10: 00000000ffff7fff R11: 0000000000000000 R12:
ffff8ac7800558c8
[   11.163442] R13: ffffffff993f8880 R14: 00007f0dfa200000 R15:
ffffed85453e0000
[   11.163443] FS:  00007f0d731fc700(0000) GS:ffff8acba9400000(0000)
knlGS:0000000000000000
[   11.163445] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   11.163448] CR2: 00007f0e65e00028 CR3: 0000000108d50003 CR4:
0000000000370ef0
[   11.163452] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[   11.163453] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[   11.163455] Call Trace:
[   11.163468]  hugetlb_mcopy_atomic_pte+0xcb/0x450
[   11.163477]  mcopy_atomic+0xa08/0xd60
[   11.163480]  ? __might_fault+0x56/0x80
[   11.163493]  userfaultfd_ioctl+0xb18/0xd60
[   11.163502]  __se_sys_ioctl+0x77/0xc0
[   11.163507]  __x64_sys_ioctl+0x1d/0x20
[   11.163510]  do_syscall_64+0x3f/0x80
[   11.163515]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   11.163519] RIP: 0033:0x45ec87
[   11.163531] Code: 3c 1c 48 f7 d8 49 39 c4 72 b8 e8 64 63 03 00 85
c0 78 bd 48 83 c4 08 4c 89 e0 5b 41 5c c3 0f 1f 44 00 00 b8 10 00 00
00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89
018
[   11.163532] RSP: 002b:00007f0d731fc248 EFLAGS: 00000206 ORIG_RAX:
0000000000000010
[   11.163534] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000045ec87
[   11.163536] RDX: 00007f0d731fc290 RSI: 00000000c028aa03 RDI: 0000000000000004
[   11.163537] RBP: 00007f0d731fc270 R08: 00000000004022b3 R09: 00007f0d731fc700
[   11.163538] R10: 00007f0d731fc9d0 R11: 0000000000000206 R12: 00007fff610cd82e
[   11.163539] R13: 00007fff610cd82f R14: 00007f0d731fc400 R15: 0000000001002000
[   11.163549] irq event stamp: 722
[   11.163550] hardirqs last  enabled at (721): [<ffffffff967cd41b>]
kmem_cache_alloc_trace+0x1db/0x370
[   11.163558] hardirqs last disabled at (722): [<ffffffff9700c052>]
_raw_spin_lock_irq+0x32/0x80
[   11.163560] softirqs last  enabled at (130): [<ffffffff9654e0d6>]
__irq_exit_rcu+0xf6/0x100
[   11.163564] softirqs last disabled at (125): [<ffffffff9654e0d6>]
__irq_exit_rcu+0xf6/0x100
[   11.163567] ---[ end trace 358ac5c76c211ea1 ]---

Debugging further I find the resv_huge_pages underflows by 1
temporarily during the run of the test multiple times, but a
__free_huge_page() is always subsequently called that overflows it
back to 0. resv_huge_pages is always 0 at the end of the test. I've
initially looked at this as I suspected a problem in the
resv_huge_pages accounting, but seems the resv_huge_pages accounting
is fine in itself as it correctly decrements resv_huge_pages when a
page is allocated from reservation and correctly increments it back up
when that page is freed.

I'm not that familiar with the userfaultfd/hugetlb code so I was
hoping to solicit some suggestions for what the issue could be. Things
I've tried so far:

- Adding code that prevents resv_huge_pages to underflow causes the
test to fail, so it seems in this test the calling code actually
expects to be able to temporarily allocate 1 more page than the VMA
has reserved, which seems like a bug maybe?
- Modifying hugetlb_mcopy_atomic_pte() to not use reserved pages
causes the test to fail again. Doin that and overprovisioning
/proc/sys/vm/nr_hugepages causes the test to pass again but I'm not
sure that's right (not familiar with the code).
- The failure gets reproduced as far back as 5.11, so it doesn't seem
to be related to any recent changes.

Thanks in advance!

Comments

Mike Kravetz May 11, 2021, 12:33 a.m. UTC | #1
On 5/7/21 2:21 PM, Mina Almasry wrote:
> Hi folks,
> 
> I ran into a bug that I'm not sure how to solve so I'm wondering if
> anyone has suggestions on what the issue could be and how to
> investigate. I added the WARN_ON_ONCE() here to catch instances of
> resv_huge_pages underflowing:
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 629aa4c2259c..7d763eed650f 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1165,7 +1165,21 @@ static struct page
> *dequeue_huge_page_vma(struct hstate *h,
>         page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
>         if (page && !avoid_reserve && vma_has_reserves(vma, chg)) {
>                 SetHPageRestoreReserve(page);
> +               WARN_ON_ONCE(!h->resv_huge_pages);
>                 h->resv_huge_pages--;
>         }
> 
> And ran the userfaultfd selftests like so:
> 
> echo 1024 > /proc/sys/vm/nr_hugepages
> mkdir -p /mnt/huge
> mount -t hugetlbfs none /mnt/huge
> ./tools/testings/selftests/vm/userfaultfd hugetlb_shared 1024 200
> /mnt/huge/userfaultfd_test
> 
> And run into this warning indicating this test does discover an underflow:
> 
> [   11.163403] ------------[ cut here ]------------
> [   11.163404] WARNING: CPU: 0 PID: 237 at mm/hugetlb.c:1178
> alloc_huge_page+0x558/0x5a0
> [   11.163413] Modules linked in:
> [   11.163419] CPU: 0 PID: 237 Comm: userfaultfd Not tainted 5.12.0-dbg-DEV #135
> [   11.163424] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.14.0-2 04/01/2014
> [   11.163429] RIP: 0010:alloc_huge_page+0x558/0x5a0
> [   11.163432] Code: b0 00 0f 85 3d ff ff ff e9 2a ff ff ff be 01 00
> 00 00 48 89 df e8 18 e7 ff ff 48 f7 d8 4c 89 ef 48 89 c6 e8 da d7 ff
> ff eb 8c <0f> 0b 4d 8b 85 c0 00 00 00 e9 95 fd ff ff e8 35 59 84 00 4c
> 897
> [   11.163434] RSP: 0018:ffff94bb0073fc80 EFLAGS: 00010046
> [   11.163436] RAX: 0000000000000080 RBX: 0000000000000000 RCX: 5fa252c406a76700
> [   11.163438] RDX: c0000000ffff7fff RSI: 0000000000000004 RDI:
> 0000000000017ffd
> [   11.163439] RBP: ffff94bb0073fcf8 R08: 0000000000000000 R09:
> ffffffff9813ba70
> [   11.163440] R10: 00000000ffff7fff R11: 0000000000000000 R12:
> ffff8ac7800558c8
> [   11.163442] R13: ffffffff993f8880 R14: 00007f0dfa200000 R15:
> ffffed85453e0000
> [   11.163443] FS:  00007f0d731fc700(0000) GS:ffff8acba9400000(0000)
> knlGS:0000000000000000
> [   11.163445] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   11.163448] CR2: 00007f0e65e00028 CR3: 0000000108d50003 CR4:
> 0000000000370ef0
> [   11.163452] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [   11.163453] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [   11.163455] Call Trace:
> [   11.163468]  hugetlb_mcopy_atomic_pte+0xcb/0x450
> [   11.163477]  mcopy_atomic+0xa08/0xd60
> [   11.163480]  ? __might_fault+0x56/0x80
> [   11.163493]  userfaultfd_ioctl+0xb18/0xd60
> [   11.163502]  __se_sys_ioctl+0x77/0xc0
> [   11.163507]  __x64_sys_ioctl+0x1d/0x20
> [   11.163510]  do_syscall_64+0x3f/0x80
> [   11.163515]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [   11.163519] RIP: 0033:0x45ec87
> [   11.163531] Code: 3c 1c 48 f7 d8 49 39 c4 72 b8 e8 64 63 03 00 85
> c0 78 bd 48 83 c4 08 4c 89 e0 5b 41 5c c3 0f 1f 44 00 00 b8 10 00 00
> 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89
> 018
> [   11.163532] RSP: 002b:00007f0d731fc248 EFLAGS: 00000206 ORIG_RAX:
> 0000000000000010
> [   11.163534] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000045ec87
> [   11.163536] RDX: 00007f0d731fc290 RSI: 00000000c028aa03 RDI: 0000000000000004
> [   11.163537] RBP: 00007f0d731fc270 R08: 00000000004022b3 R09: 00007f0d731fc700
> [   11.163538] R10: 00007f0d731fc9d0 R11: 0000000000000206 R12: 00007fff610cd82e
> [   11.163539] R13: 00007fff610cd82f R14: 00007f0d731fc400 R15: 0000000001002000
> [   11.163549] irq event stamp: 722
> [   11.163550] hardirqs last  enabled at (721): [<ffffffff967cd41b>]
> kmem_cache_alloc_trace+0x1db/0x370
> [   11.163558] hardirqs last disabled at (722): [<ffffffff9700c052>]
> _raw_spin_lock_irq+0x32/0x80
> [   11.163560] softirqs last  enabled at (130): [<ffffffff9654e0d6>]
> __irq_exit_rcu+0xf6/0x100
> [   11.163564] softirqs last disabled at (125): [<ffffffff9654e0d6>]
> __irq_exit_rcu+0xf6/0x100
> [   11.163567] ---[ end trace 358ac5c76c211ea1 ]---
> 
> Debugging further I find the resv_huge_pages underflows by 1
> temporarily during the run of the test multiple times, but a
> __free_huge_page() is always subsequently called that overflows it
> back to 0. resv_huge_pages is always 0 at the end of the test. I've
> initially looked at this as I suspected a problem in the
> resv_huge_pages accounting, but seems the resv_huge_pages accounting
> is fine in itself as it correctly decrements resv_huge_pages when a
> page is allocated from reservation and correctly increments it back up
> when that page is freed.
> 
> I'm not that familiar with the userfaultfd/hugetlb code so I was
> hoping to solicit some suggestions for what the issue could be. Things
> I've tried so far:
> 
> - Adding code that prevents resv_huge_pages to underflow causes the
> test to fail, so it seems in this test the calling code actually
> expects to be able to temporarily allocate 1 more page than the VMA
> has reserved, which seems like a bug maybe?
> - Modifying hugetlb_mcopy_atomic_pte() to not use reserved pages
> causes the test to fail again. Doin that and overprovisioning
> /proc/sys/vm/nr_hugepages causes the test to pass again but I'm not
> sure that's right (not familiar with the code).
> - The failure gets reproduced as far back as 5.11, so it doesn't seem
> to be related to any recent changes.

Hi Mina,

I am fairly confident the issue is with hugetlb_mcopy_atomic_pte.  It
does not detect/handle the case where a page cache page already exists
when in MCOPY_ATOMIC_NORMAL mode.  If you add a printk/warning after the
failure of huge_add_to_page_cache, these will generally correspond to
the underflow.  From a reservation POV, if the page exists in the cache
the reservation was already consumed.  The call to alloc_huge_page will
'consume' another reservation which can lead to the underflow.  As you
noted, this underflow gets cleaned up in the error path.  However, we
should prevent it from happening as we do not want anyone making
decisions on that underflow value.

hugetlb_mcopy_atomic_pte should check for a page in the cache and if it
exists use it in MCOPY_ATOMIC_NORMAL.  This code is quite tricky and my
first simple attempt at this did not work.  I am happy to continue
working on this.  However, if you or anyone else want to jump in and fix
feel free.
Mina Almasry May 11, 2021, 12:52 a.m. UTC | #2
On Mon, May 10, 2021 at 5:33 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 5/7/21 2:21 PM, Mina Almasry wrote:
> > Hi folks,
> >
> > I ran into a bug that I'm not sure how to solve so I'm wondering if
> > anyone has suggestions on what the issue could be and how to
> > investigate. I added the WARN_ON_ONCE() here to catch instances of
> > resv_huge_pages underflowing:
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 629aa4c2259c..7d763eed650f 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1165,7 +1165,21 @@ static struct page
> > *dequeue_huge_page_vma(struct hstate *h,
> >         page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
> >         if (page && !avoid_reserve && vma_has_reserves(vma, chg)) {
> >                 SetHPageRestoreReserve(page);
> > +               WARN_ON_ONCE(!h->resv_huge_pages);
> >                 h->resv_huge_pages--;
> >         }
> >
> > And ran the userfaultfd selftests like so:
> >
> > echo 1024 > /proc/sys/vm/nr_hugepages
> > mkdir -p /mnt/huge
> > mount -t hugetlbfs none /mnt/huge
> > ./tools/testings/selftests/vm/userfaultfd hugetlb_shared 1024 200
> > /mnt/huge/userfaultfd_test
> >
> > And run into this warning indicating this test does discover an underflow:
> >
> > [   11.163403] ------------[ cut here ]------------
> > [   11.163404] WARNING: CPU: 0 PID: 237 at mm/hugetlb.c:1178
> > alloc_huge_page+0x558/0x5a0
> > [   11.163413] Modules linked in:
> > [   11.163419] CPU: 0 PID: 237 Comm: userfaultfd Not tainted 5.12.0-dbg-DEV #135
> > [   11.163424] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> > BIOS 1.14.0-2 04/01/2014
> > [   11.163429] RIP: 0010:alloc_huge_page+0x558/0x5a0
> > [   11.163432] Code: b0 00 0f 85 3d ff ff ff e9 2a ff ff ff be 01 00
> > 00 00 48 89 df e8 18 e7 ff ff 48 f7 d8 4c 89 ef 48 89 c6 e8 da d7 ff
> > ff eb 8c <0f> 0b 4d 8b 85 c0 00 00 00 e9 95 fd ff ff e8 35 59 84 00 4c
> > 897
> > [   11.163434] RSP: 0018:ffff94bb0073fc80 EFLAGS: 00010046
> > [   11.163436] RAX: 0000000000000080 RBX: 0000000000000000 RCX: 5fa252c406a76700
> > [   11.163438] RDX: c0000000ffff7fff RSI: 0000000000000004 RDI:
> > 0000000000017ffd
> > [   11.163439] RBP: ffff94bb0073fcf8 R08: 0000000000000000 R09:
> > ffffffff9813ba70
> > [   11.163440] R10: 00000000ffff7fff R11: 0000000000000000 R12:
> > ffff8ac7800558c8
> > [   11.163442] R13: ffffffff993f8880 R14: 00007f0dfa200000 R15:
> > ffffed85453e0000
> > [   11.163443] FS:  00007f0d731fc700(0000) GS:ffff8acba9400000(0000)
> > knlGS:0000000000000000
> > [   11.163445] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   11.163448] CR2: 00007f0e65e00028 CR3: 0000000108d50003 CR4:
> > 0000000000370ef0
> > [   11.163452] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > 0000000000000000
> > [   11.163453] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > 0000000000000400
> > [   11.163455] Call Trace:
> > [   11.163468]  hugetlb_mcopy_atomic_pte+0xcb/0x450
> > [   11.163477]  mcopy_atomic+0xa08/0xd60
> > [   11.163480]  ? __might_fault+0x56/0x80
> > [   11.163493]  userfaultfd_ioctl+0xb18/0xd60
> > [   11.163502]  __se_sys_ioctl+0x77/0xc0
> > [   11.163507]  __x64_sys_ioctl+0x1d/0x20
> > [   11.163510]  do_syscall_64+0x3f/0x80
> > [   11.163515]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > [   11.163519] RIP: 0033:0x45ec87
> > [   11.163531] Code: 3c 1c 48 f7 d8 49 39 c4 72 b8 e8 64 63 03 00 85
> > c0 78 bd 48 83 c4 08 4c 89 e0 5b 41 5c c3 0f 1f 44 00 00 b8 10 00 00
> > 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89
> > 018
> > [   11.163532] RSP: 002b:00007f0d731fc248 EFLAGS: 00000206 ORIG_RAX:
> > 0000000000000010
> > [   11.163534] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000045ec87
> > [   11.163536] RDX: 00007f0d731fc290 RSI: 00000000c028aa03 RDI: 0000000000000004
> > [   11.163537] RBP: 00007f0d731fc270 R08: 00000000004022b3 R09: 00007f0d731fc700
> > [   11.163538] R10: 00007f0d731fc9d0 R11: 0000000000000206 R12: 00007fff610cd82e
> > [   11.163539] R13: 00007fff610cd82f R14: 00007f0d731fc400 R15: 0000000001002000
> > [   11.163549] irq event stamp: 722
> > [   11.163550] hardirqs last  enabled at (721): [<ffffffff967cd41b>]
> > kmem_cache_alloc_trace+0x1db/0x370
> > [   11.163558] hardirqs last disabled at (722): [<ffffffff9700c052>]
> > _raw_spin_lock_irq+0x32/0x80
> > [   11.163560] softirqs last  enabled at (130): [<ffffffff9654e0d6>]
> > __irq_exit_rcu+0xf6/0x100
> > [   11.163564] softirqs last disabled at (125): [<ffffffff9654e0d6>]
> > __irq_exit_rcu+0xf6/0x100
> > [   11.163567] ---[ end trace 358ac5c76c211ea1 ]---
> >
> > Debugging further I find the resv_huge_pages underflows by 1
> > temporarily during the run of the test multiple times, but a
> > __free_huge_page() is always subsequently called that overflows it
> > back to 0. resv_huge_pages is always 0 at the end of the test. I've
> > initially looked at this as I suspected a problem in the
> > resv_huge_pages accounting, but seems the resv_huge_pages accounting
> > is fine in itself as it correctly decrements resv_huge_pages when a
> > page is allocated from reservation and correctly increments it back up
> > when that page is freed.
> >
> > I'm not that familiar with the userfaultfd/hugetlb code so I was
> > hoping to solicit some suggestions for what the issue could be. Things
> > I've tried so far:
> >
> > - Adding code that prevents resv_huge_pages to underflow causes the
> > test to fail, so it seems in this test the calling code actually
> > expects to be able to temporarily allocate 1 more page than the VMA
> > has reserved, which seems like a bug maybe?
> > - Modifying hugetlb_mcopy_atomic_pte() to not use reserved pages
> > causes the test to fail again. Doin that and overprovisioning
> > /proc/sys/vm/nr_hugepages causes the test to pass again but I'm not
> > sure that's right (not familiar with the code).
> > - The failure gets reproduced as far back as 5.11, so it doesn't seem
> > to be related to any recent changes.
>
> Hi Mina,
>
> I am fairly confident the issue is with hugetlb_mcopy_atomic_pte.  It
> does not detect/handle the case where a page cache page already exists
> when in MCOPY_ATOMIC_NORMAL mode.  If you add a printk/warning after the
> failure of huge_add_to_page_cache, these will generally correspond to
> the underflow.  From a reservation POV, if the page exists in the cache
> the reservation was already consumed.  The call to alloc_huge_page will
> 'consume' another reservation which can lead to the underflow.  As you
> noted, this underflow gets cleaned up in the error path.  However, we
> should prevent it from happening as we do not want anyone making
> decisions on that underflow value.
>
> hugetlb_mcopy_atomic_pte should check for a page in the cache and if it
> exists use it in MCOPY_ATOMIC_NORMAL.  This code is quite tricky and my
> first simple attempt at this did not work.  I am happy to continue
> working on this.  However, if you or anyone else want to jump in and fix
> feel free.

Thank you! This is the guidance I was looking for. I'll jump in and
try to fix it definitely but I'm learning the code, etc and will not
be terribly quick about this unfortunately. I will start looking into
this nevertheless and take a stab at it.

Thanks so much!

> --
> Mike Kravetz
Axel Rasmussen May 11, 2021, 6:45 a.m. UTC | #3
Thanks for the investigation, Mike!

Mina, since hugetlb_mcopy_atomic_pte is specific to userfaultfd, I'm
happy to take a deeper look at it this week as well.

For context, we have seen the WARN_ON Mina described trigger in
production before, but were never able to reproduce it. The
userfaultfd self test turns out to reproduce it reliably, so the
thinking up to this point was that it just happened to reproduce some
non-userfaultfd-specific issue. But from Mike's description, it seems
this bug is very specific to userfaultfd after all. :)

On Mon, May 10, 2021 at 5:52 PM Mina Almasry <almasrymina@google.com> wrote:
>
> On Mon, May 10, 2021 at 5:33 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >
> > On 5/7/21 2:21 PM, Mina Almasry wrote:
> > > Hi folks,
> > >
> > > I ran into a bug that I'm not sure how to solve so I'm wondering if
> > > anyone has suggestions on what the issue could be and how to
> > > investigate. I added the WARN_ON_ONCE() here to catch instances of
> > > resv_huge_pages underflowing:
> > >
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index 629aa4c2259c..7d763eed650f 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -1165,7 +1165,21 @@ static struct page
> > > *dequeue_huge_page_vma(struct hstate *h,
> > >         page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
> > >         if (page && !avoid_reserve && vma_has_reserves(vma, chg)) {
> > >                 SetHPageRestoreReserve(page);
> > > +               WARN_ON_ONCE(!h->resv_huge_pages);
> > >                 h->resv_huge_pages--;
> > >         }
> > >
> > > And ran the userfaultfd selftests like so:
> > >
> > > echo 1024 > /proc/sys/vm/nr_hugepages
> > > mkdir -p /mnt/huge
> > > mount -t hugetlbfs none /mnt/huge
> > > ./tools/testings/selftests/vm/userfaultfd hugetlb_shared 1024 200
> > > /mnt/huge/userfaultfd_test
> > >
> > > And run into this warning indicating this test does discover an underflow:
> > >
> > > [   11.163403] ------------[ cut here ]------------
> > > [   11.163404] WARNING: CPU: 0 PID: 237 at mm/hugetlb.c:1178
> > > alloc_huge_page+0x558/0x5a0
> > > [   11.163413] Modules linked in:
> > > [   11.163419] CPU: 0 PID: 237 Comm: userfaultfd Not tainted 5.12.0-dbg-DEV #135
> > > [   11.163424] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> > > BIOS 1.14.0-2 04/01/2014
> > > [   11.163429] RIP: 0010:alloc_huge_page+0x558/0x5a0
> > > [   11.163432] Code: b0 00 0f 85 3d ff ff ff e9 2a ff ff ff be 01 00
> > > 00 00 48 89 df e8 18 e7 ff ff 48 f7 d8 4c 89 ef 48 89 c6 e8 da d7 ff
> > > ff eb 8c <0f> 0b 4d 8b 85 c0 00 00 00 e9 95 fd ff ff e8 35 59 84 00 4c
> > > 897
> > > [   11.163434] RSP: 0018:ffff94bb0073fc80 EFLAGS: 00010046
> > > [   11.163436] RAX: 0000000000000080 RBX: 0000000000000000 RCX: 5fa252c406a76700
> > > [   11.163438] RDX: c0000000ffff7fff RSI: 0000000000000004 RDI:
> > > 0000000000017ffd
> > > [   11.163439] RBP: ffff94bb0073fcf8 R08: 0000000000000000 R09:
> > > ffffffff9813ba70
> > > [   11.163440] R10: 00000000ffff7fff R11: 0000000000000000 R12:
> > > ffff8ac7800558c8
> > > [   11.163442] R13: ffffffff993f8880 R14: 00007f0dfa200000 R15:
> > > ffffed85453e0000
> > > [   11.163443] FS:  00007f0d731fc700(0000) GS:ffff8acba9400000(0000)
> > > knlGS:0000000000000000
> > > [   11.163445] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [   11.163448] CR2: 00007f0e65e00028 CR3: 0000000108d50003 CR4:
> > > 0000000000370ef0
> > > [   11.163452] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > > 0000000000000000
> > > [   11.163453] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > > 0000000000000400
> > > [   11.163455] Call Trace:
> > > [   11.163468]  hugetlb_mcopy_atomic_pte+0xcb/0x450
> > > [   11.163477]  mcopy_atomic+0xa08/0xd60
> > > [   11.163480]  ? __might_fault+0x56/0x80
> > > [   11.163493]  userfaultfd_ioctl+0xb18/0xd60
> > > [   11.163502]  __se_sys_ioctl+0x77/0xc0
> > > [   11.163507]  __x64_sys_ioctl+0x1d/0x20
> > > [   11.163510]  do_syscall_64+0x3f/0x80
> > > [   11.163515]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > [   11.163519] RIP: 0033:0x45ec87
> > > [   11.163531] Code: 3c 1c 48 f7 d8 49 39 c4 72 b8 e8 64 63 03 00 85
> > > c0 78 bd 48 83 c4 08 4c 89 e0 5b 41 5c c3 0f 1f 44 00 00 b8 10 00 00
> > > 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89
> > > 018
> > > [   11.163532] RSP: 002b:00007f0d731fc248 EFLAGS: 00000206 ORIG_RAX:
> > > 0000000000000010
> > > [   11.163534] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000045ec87
> > > [   11.163536] RDX: 00007f0d731fc290 RSI: 00000000c028aa03 RDI: 0000000000000004
> > > [   11.163537] RBP: 00007f0d731fc270 R08: 00000000004022b3 R09: 00007f0d731fc700
> > > [   11.163538] R10: 00007f0d731fc9d0 R11: 0000000000000206 R12: 00007fff610cd82e
> > > [   11.163539] R13: 00007fff610cd82f R14: 00007f0d731fc400 R15: 0000000001002000
> > > [   11.163549] irq event stamp: 722
> > > [   11.163550] hardirqs last  enabled at (721): [<ffffffff967cd41b>]
> > > kmem_cache_alloc_trace+0x1db/0x370
> > > [   11.163558] hardirqs last disabled at (722): [<ffffffff9700c052>]
> > > _raw_spin_lock_irq+0x32/0x80
> > > [   11.163560] softirqs last  enabled at (130): [<ffffffff9654e0d6>]
> > > __irq_exit_rcu+0xf6/0x100
> > > [   11.163564] softirqs last disabled at (125): [<ffffffff9654e0d6>]
> > > __irq_exit_rcu+0xf6/0x100
> > > [   11.163567] ---[ end trace 358ac5c76c211ea1 ]---
> > >
> > > Debugging further I find the resv_huge_pages underflows by 1
> > > temporarily during the run of the test multiple times, but a
> > > __free_huge_page() is always subsequently called that overflows it
> > > back to 0. resv_huge_pages is always 0 at the end of the test. I've
> > > initially looked at this as I suspected a problem in the
> > > resv_huge_pages accounting, but seems the resv_huge_pages accounting
> > > is fine in itself as it correctly decrements resv_huge_pages when a
> > > page is allocated from reservation and correctly increments it back up
> > > when that page is freed.
> > >
> > > I'm not that familiar with the userfaultfd/hugetlb code so I was
> > > hoping to solicit some suggestions for what the issue could be. Things
> > > I've tried so far:
> > >
> > > - Adding code that prevents resv_huge_pages to underflow causes the
> > > test to fail, so it seems in this test the calling code actually
> > > expects to be able to temporarily allocate 1 more page than the VMA
> > > has reserved, which seems like a bug maybe?
> > > - Modifying hugetlb_mcopy_atomic_pte() to not use reserved pages
> > > causes the test to fail again. Doin that and overprovisioning
> > > /proc/sys/vm/nr_hugepages causes the test to pass again but I'm not
> > > sure that's right (not familiar with the code).
> > > - The failure gets reproduced as far back as 5.11, so it doesn't seem
> > > to be related to any recent changes.
> >
> > Hi Mina,
> >
> > I am fairly confident the issue is with hugetlb_mcopy_atomic_pte.  It
> > does not detect/handle the case where a page cache page already exists
> > when in MCOPY_ATOMIC_NORMAL mode.  If you add a printk/warning after the
> > failure of huge_add_to_page_cache, these will generally correspond to
> > the underflow.  From a reservation POV, if the page exists in the cache
> > the reservation was already consumed.  The call to alloc_huge_page will
> > 'consume' another reservation which can lead to the underflow.  As you
> > noted, this underflow gets cleaned up in the error path.  However, we
> > should prevent it from happening as we do not want anyone making
> > decisions on that underflow value.
> >
> > hugetlb_mcopy_atomic_pte should check for a page in the cache and if it
> > exists use it in MCOPY_ATOMIC_NORMAL.  This code is quite tricky and my
> > first simple attempt at this did not work.  I am happy to continue
> > working on this.  However, if you or anyone else want to jump in and fix
> > feel free.
>
> Thank you! This is the guidance I was looking for. I'll jump in and
> try to fix it definitely but I'm learning the code, etc and will not
> be terribly quick about this unfortunately. I will start looking into
> this nevertheless and take a stab at it.
>
> Thanks so much!
>
> > --
> > Mike Kravetz
Mina Almasry May 11, 2021, 7:08 a.m. UTC | #4
On Mon, May 10, 2021 at 11:46 PM Axel Rasmussen
<axelrasmussen@google.com> wrote:
>
> Thanks for the investigation, Mike!
>
> Mina, since hugetlb_mcopy_atomic_pte is specific to userfaultfd, I'm
> happy to take a deeper look at it this week as well.
>
> For context, we have seen the WARN_ON Mina described trigger in
> production before, but were never able to reproduce it. The
> userfaultfd self test turns out to reproduce it reliably, so the
> thinking up to this point was that it just happened to reproduce some
> non-userfaultfd-specific issue. But from Mike's description, it seems
> this bug is very specific to userfaultfd after all. :)
>

That's fine, I'll take an initial look at least. If I end up being
roped into userfaultfd stuff way over my head I'll solicit your help
or punt it over to you.Thanks!
Mike Kravetz May 11, 2021, 4:38 p.m. UTC | #5
On 5/10/21 11:45 PM, Axel Rasmussen wrote:
> Thanks for the investigation, Mike!
> 
> Mina, since hugetlb_mcopy_atomic_pte is specific to userfaultfd, I'm
> happy to take a deeper look at it this week as well.
> 
> For context, we have seen the WARN_ON Mina described trigger in
> production before, but were never able to reproduce it. The
> userfaultfd self test turns out to reproduce it reliably, so the
> thinking up to this point was that it just happened to reproduce some
> non-userfaultfd-specific issue. But from Mike's description, it seems
> this bug is very specific to userfaultfd after all. :)

Certainly, this case is userfaultfd specific.

However, I too recall seeing 'transient' underflows in the past.  Pretty
sure this was not userfaultfd specific.  Specifically, when working on
commit 22146c3ce989 "hugetlbfs: dirty pages as they are added to pagecache"
I recall seeing transient underflow.  After fixing the issue in 22146c3ce989,
I could not reproduce transient underflows and stopped looking for the
cause.  We added code to a production kernel in an attmempt to catch
the issue:
https://github.com/oracle/linux-uek/commit/bd697676290d91762ef2bf79832f653b44e6f83b#diff-fb6066ca63d9afdc2e3660c85e2dcc04cf31b37900ca9df2a1019ee8fa80dce0

I'll try running some non-userfaultfd specific tests to see if I can
reproduce.
Mina Almasry May 11, 2021, 7:08 p.m. UTC | #6
On Tue, May 11, 2021 at 9:38 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 5/10/21 11:45 PM, Axel Rasmussen wrote:
> > Thanks for the investigation, Mike!
> >
> > Mina, since hugetlb_mcopy_atomic_pte is specific to userfaultfd, I'm
> > happy to take a deeper look at it this week as well.
> >
> > For context, we have seen the WARN_ON Mina described trigger in
> > production before, but were never able to reproduce it. The
> > userfaultfd self test turns out to reproduce it reliably, so the
> > thinking up to this point was that it just happened to reproduce some
> > non-userfaultfd-specific issue. But from Mike's description, it seems
> > this bug is very specific to userfaultfd after all. :)
>
> Certainly, this case is userfaultfd specific.
>
> However, I too recall seeing 'transient' underflows in the past.  Pretty
> sure this was not userfaultfd specific.  Specifically, when working on
> commit 22146c3ce989 "hugetlbfs: dirty pages as they are added to pagecache"
> I recall seeing transient underflow.  After fixing the issue in 22146c3ce989,
> I could not reproduce transient underflows and stopped looking for the
> cause.  We added code to a production kernel in an attmempt to catch
> the issue:
> https://github.com/oracle/linux-uek/commit/bd697676290d91762ef2bf79832f653b44e6f83b#diff-fb6066ca63d9afdc2e3660c85e2dcc04cf31b37900ca9df2a1019ee8fa80dce0
>

We also have local changes that detect resv_huge_pages underflowing,
and they trigger in production every few months and we don't use
userfaultfd in production. We haven't been able to reproduce these
issues though, and we run an older version of the kernel so I haven't
had anything useful to share until now.

I'm (quite tentatively to be honest) considering proposing adding some
of these WARN_ONCE_ON() to the kernel after I fix this issue and
soliciting reproducers from the community to fix these underflows. I'm
hesitant because I'm not confident I can fix the ensuing bug reports
in a timely manner and the volume of reports may be large. But we can
cross that bridge when this issue is figured out :)

> I'll try running some non-userfaultfd specific tests to see if I can
> reproduce.
> --
> Mike Kravetz
Mike Kravetz May 12, 2021, 2:25 a.m. UTC | #7
On 5/10/21 5:33 PM, Mike Kravetz wrote:
> On 5/7/21 2:21 PM, Mina Almasry wrote:
>> I ran into a bug that I'm not sure how to solve so I'm wondering if
>> anyone has suggestions on what the issue could be and how to
>> investigate. I added the WARN_ON_ONCE() here to catch instances of
>> resv_huge_pages underflowing:
>>
> 
> I am fairly confident the issue is with hugetlb_mcopy_atomic_pte.  It
> does not detect/handle the case where a page cache page already exists
> when in MCOPY_ATOMIC_NORMAL mode.  If you add a printk/warning after the
> failure of huge_add_to_page_cache, these will generally correspond to
> the underflow.  From a reservation POV, if the page exists in the cache
> the reservation was already consumed.  The call to alloc_huge_page will
> 'consume' another reservation which can lead to the underflow.  As you
> noted, this underflow gets cleaned up in the error path.  However, we
> should prevent it from happening as we do not want anyone making
> decisions on that underflow value.
> 
> hugetlb_mcopy_atomic_pte should check for a page in the cache and if it
> exists use it in MCOPY_ATOMIC_NORMAL.  This code is quite tricky and my
> first simple attempt at this did not work.  I am happy to continue
> working on this.  However, if you or anyone else want to jump in and fix
> feel free.

I looked at this a bit more today and am not exactly sure of the
expected behavior.  The situation is:
- UFFDIO_COPY is called for hugetlb mapping
  - the dest address is in a shared mapping
  - there is a page in the cache associated with the address in the
    shared mapping

Currently, the code will fail when trying to update the page cache as
the entry already exists.  The shm code appears to do the same.

Quick question.  Is this the expected behavior?  Or, would you expect
the UFFDIO_COPY to update the page in the page cache, and then resolve
the fault/update the pte?
Peter Xu May 12, 2021, 2:35 a.m. UTC | #8
Mike,

On Tue, May 11, 2021 at 07:25:39PM -0700, Mike Kravetz wrote:
> I looked at this a bit more today and am not exactly sure of the
> expected behavior.  The situation is:
> - UFFDIO_COPY is called for hugetlb mapping
>   - the dest address is in a shared mapping
>   - there is a page in the cache associated with the address in the
>     shared mapping
> 
> Currently, the code will fail when trying to update the page cache as
> the entry already exists.  The shm code appears to do the same.
> 
> Quick question.  Is this the expected behavior?  Or, would you expect
> the UFFDIO_COPY to update the page in the page cache, and then resolve
> the fault/update the pte?

AFAICT that's the expected behavior, and it need to be like that so as to avoid
silent data corruption (if the page cache existed, it means the page is not
"missing" at all, then it does not suite for a UFFDIO_COPY as it's only used
for uffd page missing case).  Thanks,
Mike Kravetz May 12, 2021, 3:06 a.m. UTC | #9
On 5/11/21 7:35 PM, Peter Xu wrote:
> Mike,
> 
> On Tue, May 11, 2021 at 07:25:39PM -0700, Mike Kravetz wrote:
>> I looked at this a bit more today and am not exactly sure of the
>> expected behavior.  The situation is:
>> - UFFDIO_COPY is called for hugetlb mapping
>>   - the dest address is in a shared mapping
>>   - there is a page in the cache associated with the address in the
>>     shared mapping
>>
>> Currently, the code will fail when trying to update the page cache as
>> the entry already exists.  The shm code appears to do the same.
>>
>> Quick question.  Is this the expected behavior?  Or, would you expect
>> the UFFDIO_COPY to update the page in the page cache, and then resolve
>> the fault/update the pte?
> 
> AFAICT that's the expected behavior, and it need to be like that so as to avoid
> silent data corruption (if the page cache existed, it means the page is not
> "missing" at all, then it does not suite for a UFFDIO_COPY as it's only used
> for uffd page missing case).  Thanks,

Thanks Peter.

Making it fail in that case is pretty straight forward.

BTW, in this case the page was not in the cache at the time of the
original fault which is why it is being handled as missing by userfaultfd.
By the time the UFFDIO_COPY is requested, a page has been added to the cache.
Mina Almasry May 12, 2021, 7:44 a.m. UTC | #10
On Tue, May 11, 2021 at 11:58 PM Mina Almasry <almasrymina@google.com> wrote:
>
> When hugetlb_mcopy_atomic_pte() is called with:
> - mode==MCOPY_ATOMIC_NORMAL and,
> - we already have a page in the page cache corresponding to the
> associated address,
>
> We will allocate a huge page from the reserves, and then fail to insert it
> into the cache and return -EEXIST.
>
> In this case, we need to return -EEXIST without allocating a new page as
> the page already exists in the cache. Allocating the extra page causes
> the resv_huge_pages to underflow temporarily until the extra page is
> freed.
>
> Also, add the warning so that future similar instances of resv_huge_pages
> underflowing will be caught.
>
> Also, minor drive-by cleanups to this code path:
> - pagep is an out param never set by calling code, so delete code
> assuming there could be a valid page in there.
> - use hugetlbfs_pagecache_page() instead of repeating its
> implementation.
>
> Tested using:
> ./tools/testing/selftests/vm/userfaultfd hugetlb_shared 1024 200 \
> /mnt/huge
>
> Test passes, and dmesg shows no underflow warnings.
>
> Signed-off-by: Mina Almasry <almasrymina@google.com>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Axel Rasmussen <axelrasmussen@google.com>
> Cc: Peter Xu <peterx@redhat.com>
>
> ---
>  mm/hugetlb.c | 33 ++++++++++++++++++++-------------
>  1 file changed, 20 insertions(+), 13 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 629aa4c2259c..40f4ad1bca29 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1165,6 +1165,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
>         page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
>         if (page && !avoid_reserve && vma_has_reserves(vma, chg)) {
>                 SetHPageRestoreReserve(page);
> +               WARN_ON_ONCE(!h->resv_huge_pages);
>                 h->resv_huge_pages--;
>         }
>
> @@ -4868,30 +4869,39 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>                             struct page **pagep)
>  {
>         bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE);
> -       struct address_space *mapping;
> -       pgoff_t idx;
> +       struct hstate *h = hstate_vma(dst_vma);
> +       struct address_space *mapping = dst_vma->vm_file->f_mapping;
> +       pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr);
>         unsigned long size;
>         int vm_shared = dst_vma->vm_flags & VM_SHARED;
> -       struct hstate *h = hstate_vma(dst_vma);
>         pte_t _dst_pte;
>         spinlock_t *ptl;
> -       int ret;
> +       int ret = -ENOMEM;
>         struct page *page;
>         int writable;
>
> -       mapping = dst_vma->vm_file->f_mapping;
> -       idx = vma_hugecache_offset(h, dst_vma, dst_addr);
> +       /* Out parameter. */
> +       WARN_ON(*pagep);
>
>         if (is_continue) {
>                 ret = -EFAULT;
> -               page = find_lock_page(mapping, idx);
> +               page = hugetlbfs_pagecache_page(h, dst_vma, dst_addr);
>                 if (!page)
>                         goto out;
> -       } else if (!*pagep) {
> -               ret = -ENOMEM;
> +       } else {
> +               /* If a page already exists, then it's UFFDIO_COPY for
> +                * a non-missing case. Return -EEXIST.
> +                */
> +               if (hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {
> +                       ret = -EEXIST;
> +                       goto out;
> +               }
> +
>                 page = alloc_huge_page(dst_vma, dst_addr, 0);
> -               if (IS_ERR(page))
> +               if (IS_ERR(page)) {
> +                       ret = -ENOMEM;
>                         goto out;
> +               }
>
>                 ret = copy_huge_page_from_user(page,
>                                                 (const void __user *) src_addr,
> @@ -4904,9 +4914,6 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>                         /* don't free the page */
>                         goto out;
>                 }
> -       } else {
> -               page = *pagep;
> -               *pagep = NULL;
>         }
>
>         /*
> --
> 2.31.1.607.g51e8a6a459-goog

I just realized I missed CCing Andrew and the mailing lists to this
patch's review. I'll collect review comments from folks and send a v2
to the correct folks and mailing lists.
Mina Almasry May 12, 2021, 7:42 p.m. UTC | #11
On Wed, May 12, 2021 at 10:22 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 5/12/21 8:53 AM, Axel Rasmussen wrote:
> > On Tue, May 11, 2021 at 11:58 PM Mina Almasry <almasrymina@google.com> wrote:
> >>
> >> When hugetlb_mcopy_atomic_pte() is called with:
> >> - mode==MCOPY_ATOMIC_NORMAL and,
> >> - we already have a page in the page cache corresponding to the
> >> associated address,
> >>
> >> We will allocate a huge page from the reserves, and then fail to insert it
> >> into the cache and return -EEXIST.
> >>
> >> In this case, we need to return -EEXIST without allocating a new page as
> >> the page already exists in the cache. Allocating the extra page causes
> >> the resv_huge_pages to underflow temporarily until the extra page is
> >> freed.
> >>
> >> Also, add the warning so that future similar instances of resv_huge_pages
> >> underflowing will be caught.
> >>
> >> Also, minor drive-by cleanups to this code path:
> >> - pagep is an out param never set by calling code, so delete code
> >> assuming there could be a valid page in there.
> >> - use hugetlbfs_pagecache_page() instead of repeating its
> >> implementation.
> >>
> >> Tested using:
> >> ./tools/testing/selftests/vm/userfaultfd hugetlb_shared 1024 200 \
> >> /mnt/huge
> >>
> >> Test passes, and dmesg shows no underflow warnings.
> >>
> >> Signed-off-by: Mina Almasry <almasrymina@google.com>
> >> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> >> Cc: Axel Rasmussen <axelrasmussen@google.com>
> >> Cc: Peter Xu <peterx@redhat.com>
> >>
> >> ---
> >>  mm/hugetlb.c | 33 ++++++++++++++++++++-------------
> >>  1 file changed, 20 insertions(+), 13 deletions(-)
> >>
> >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >> index 629aa4c2259c..40f4ad1bca29 100644
> >> --- a/mm/hugetlb.c
> >> +++ b/mm/hugetlb.c
> >> @@ -1165,6 +1165,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
> >>         page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
> >>         if (page && !avoid_reserve && vma_has_reserves(vma, chg)) {
> >>                 SetHPageRestoreReserve(page);
> >> +               WARN_ON_ONCE(!h->resv_huge_pages);
>
> This warning catches underflow in a relatively specific case.  In a
> previous email, you mentioned that you have seem underflow on production
> systems several times.  Was it this specific case?  I am also assuming
> that the underflow you saw was transitive and corrected itself.  The
> value did not remain negative?
>
> As mentioned above, this warning only catches the specific case where
> resv_huge_pages goes negative in this routine.  If we believe this is
> possible, then there are likely more cases where resv_huge_pages is simply
> decremented when it should not.  For example: resv_huge_pages temporarily
> goes to 2034 from 2035 when it should not.  Also, there are several
> other places where resv_huge_pages could incorrectly be modified and go
> negative.
>

My only motivation for adding this particular warning is to make sure
this particular issue remains fixed and doesn't get re-introduced in
the future. If that's not too useful then I can remove it, no problem,
I'm not too attached to it or anything.

> I would prefer not to add this warning unless you have seen this
> specific case in production or some other environments.  If so, then
> please add the specifics.  I am not opposed to adding warnings or code to
> detect underflow or other accounting issues.  We just need to make sure
> they are likely to provide useful data.
>

I've actually looked at all the resv_huge_pages underflow issues we
have internally, and upon a closer look I find that they are all on
kernels so old they don't have 1b1edc140dc7 ("hugetlbfs: dirty pages
as they are added to pagecache") or any of the others patches that
fixed resv_huge_pages issues recently. I can't seem to find instances
new enough that they would be useful to look at, so I take back what I
said earlier. If any underflow issues pop up on our newer kernels I'll
bring this up again, but for now, it seems it's just this issue
related to userfaultfd. Sorry for the noise :(

> >>                 h->resv_huge_pages--;
> >>         }
> >>
> >> @@ -4868,30 +4869,39 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> >>                             struct page **pagep)
> >>  {
> >>         bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE);
> >> -       struct address_space *mapping;
> >> -       pgoff_t idx;
> >> +       struct hstate *h = hstate_vma(dst_vma);
> >> +       struct address_space *mapping = dst_vma->vm_file->f_mapping;
> >> +       pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr);
> >>         unsigned long size;
> >>         int vm_shared = dst_vma->vm_flags & VM_SHARED;
> >> -       struct hstate *h = hstate_vma(dst_vma);
> >>         pte_t _dst_pte;
> >>         spinlock_t *ptl;
> >> -       int ret;
> >> +       int ret = -ENOMEM;
> >>         struct page *page;
> >>         int writable;
> >>
> >> -       mapping = dst_vma->vm_file->f_mapping;
> >> -       idx = vma_hugecache_offset(h, dst_vma, dst_addr);
> >> +       /* Out parameter. */
> >> +       WARN_ON(*pagep);
> >
> > I don't think this warning works, because we do set *pagep, in the
> > copy_huge_page_from_user failure case. In that case, the following
> > happens:
> >
> > 1. We set *pagep, and return immediately.
> > 2. Our caller notices this particular error, drops mmap_lock, and then
> > calls us again with *pagep set.
> >
> > In this path, we're supposed to just re-use this existing *pagep
> > instead of allocating a second new page.
> >
> > I think this also means we need to keep the "else" case where *pagep
> > is set below.
> >
>
> +1 to Peter's comment.
>

Gah, sorry about that. I'll fix in v2.

> --
> Mike Kravetz
Peter Xu May 12, 2021, 8:14 p.m. UTC | #12
Mina,

On Wed, May 12, 2021 at 12:42:32PM -0700, Mina Almasry wrote:
> > >> @@ -4868,30 +4869,39 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> > >>                             struct page **pagep)
> > >>  {
> > >>         bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE);
> > >> -       struct address_space *mapping;
> > >> -       pgoff_t idx;
> > >> +       struct hstate *h = hstate_vma(dst_vma);
> > >> +       struct address_space *mapping = dst_vma->vm_file->f_mapping;
> > >> +       pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr);
> > >>         unsigned long size;
> > >>         int vm_shared = dst_vma->vm_flags & VM_SHARED;
> > >> -       struct hstate *h = hstate_vma(dst_vma);
> > >>         pte_t _dst_pte;
> > >>         spinlock_t *ptl;
> > >> -       int ret;
> > >> +       int ret = -ENOMEM;
> > >>         struct page *page;
> > >>         int writable;
> > >>
> > >> -       mapping = dst_vma->vm_file->f_mapping;
> > >> -       idx = vma_hugecache_offset(h, dst_vma, dst_addr);
> > >> +       /* Out parameter. */
> > >> +       WARN_ON(*pagep);
> > >
> > > I don't think this warning works, because we do set *pagep, in the
> > > copy_huge_page_from_user failure case. In that case, the following
> > > happens:
> > >
> > > 1. We set *pagep, and return immediately.
> > > 2. Our caller notices this particular error, drops mmap_lock, and then
> > > calls us again with *pagep set.
> > >
> > > In this path, we're supposed to just re-use this existing *pagep
> > > instead of allocating a second new page.
> > >
> > > I think this also means we need to keep the "else" case where *pagep
> > > is set below.
> > >
> >
> > +1 to Peter's comment.
> >
> 
> Gah, sorry about that. I'll fix in v2.

I have a question regarding v1: how do you guarantee huge_add_to_page_cache()
won't fail again even if checked before page alloc?  Say, what if the page
cache got inserted after hugetlbfs_pagecache_present() (which is newly added in
your v1) but before huge_add_to_page_cache()?

I also have a feeling that we've been trying to work around something else, but
I can't tell yet as I'll probably need to read a bit more/better on how hugetlb
does the accounting and also on reservations.

Thanks,
Mike Kravetz May 12, 2021, 9:31 p.m. UTC | #13
On 5/12/21 1:14 PM, Peter Xu wrote:
> On Wed, May 12, 2021 at 12:42:32PM -0700, Mina Almasry wrote:
>>>>> @@ -4868,30 +4869,39 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>>>>> +       WARN_ON(*pagep);
>>>>
>>>> I don't think this warning works, because we do set *pagep, in the
>>>> copy_huge_page_from_user failure case. In that case, the following
>>>> happens:
>>>>
>>>> 1. We set *pagep, and return immediately.
>>>> 2. Our caller notices this particular error, drops mmap_lock, and then
>>>> calls us again with *pagep set.
>>>>
>>>> In this path, we're supposed to just re-use this existing *pagep
>>>> instead of allocating a second new page.
>>>>
>>>> I think this also means we need to keep the "else" case where *pagep
>>>> is set below.
>>>>
>>>
>>> +1 to Peter's comment.
>>>

Apologies to Axel (and Peter) as that comment was from Axel.

>>
>> Gah, sorry about that. I'll fix in v2.
> 
> I have a question regarding v1: how do you guarantee huge_add_to_page_cache()
> won't fail again even if checked before page alloc?  Say, what if the page
> cache got inserted after hugetlbfs_pagecache_present() (which is newly added in
> your v1) but before huge_add_to_page_cache()?

In the caller (__mcopy_atomic_hugetlb) we obtain the hugetlb fault mutex
before calling this routine.  This should prevent changes to the cache
while in the routine.

However, things get complicated in the case where copy_huge_page_from_user
fails.  In this case, we will return to the caller which will drop mmap_lock
and the hugetlb fault mutex before doing the copy.  After dropping the
mutex, someone could populate the cache.  This would result in the same
situation where two reserves are 'temporarily' consumed for the same
mapping offset.  By the time we get to the second call to
hugetlb_mcopy_atomic_pte where the previously allocated page is passed
in, it is too late.
Mina Almasry May 12, 2021, 9:52 p.m. UTC | #14
On Wed, May 12, 2021 at 2:31 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 5/12/21 1:14 PM, Peter Xu wrote:
> > On Wed, May 12, 2021 at 12:42:32PM -0700, Mina Almasry wrote:
> >>>>> @@ -4868,30 +4869,39 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> >>>>> +       WARN_ON(*pagep);
> >>>>
> >>>> I don't think this warning works, because we do set *pagep, in the
> >>>> copy_huge_page_from_user failure case. In that case, the following
> >>>> happens:
> >>>>
> >>>> 1. We set *pagep, and return immediately.
> >>>> 2. Our caller notices this particular error, drops mmap_lock, and then
> >>>> calls us again with *pagep set.
> >>>>
> >>>> In this path, we're supposed to just re-use this existing *pagep
> >>>> instead of allocating a second new page.
> >>>>
> >>>> I think this also means we need to keep the "else" case where *pagep
> >>>> is set below.
> >>>>
> >>>
> >>> +1 to Peter's comment.
> >>>
>
> Apologies to Axel (and Peter) as that comment was from Axel.
>
> >>
> >> Gah, sorry about that. I'll fix in v2.
> >
> > I have a question regarding v1: how do you guarantee huge_add_to_page_cache()
> > won't fail again even if checked before page alloc?  Say, what if the page
> > cache got inserted after hugetlbfs_pagecache_present() (which is newly added in
> > your v1) but before huge_add_to_page_cache()?
>
> In the caller (__mcopy_atomic_hugetlb) we obtain the hugetlb fault mutex
> before calling this routine.  This should prevent changes to the cache
> while in the routine.
>
> However, things get complicated in the case where copy_huge_page_from_user
> fails.  In this case, we will return to the caller which will drop mmap_lock
> and the hugetlb fault mutex before doing the copy.  After dropping the
> mutex, someone could populate the cache.  This would result in the same
> situation where two reserves are 'temporarily' consumed for the same
> mapping offset.  By the time we get to the second call to
> hugetlb_mcopy_atomic_pte where the previously allocated page is passed
> in, it is too late.
>

Thanks. I tried locally to allocate a page, then add it into the
cache, *then* copy its contents (dropping that lock if that fails).
That also has the test passing, but I'm not sure if I'm causing a fire
somewhere else by having a page in the cache that has uninitialized
contents. The only other code that checks the cache seems to be the
hugetlb_fault/hugetlb_cow code. I'm reading that code to try to
understand if I'm breaking that code doing this.

> --
> Mike Kravetz
diff mbox series

Patch

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 629aa4c2259c..7d763eed650f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1165,7 +1165,21 @@  static struct page
*dequeue_huge_page_vma(struct hstate *h,
        page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
        if (page && !avoid_reserve && vma_has_reserves(vma, chg)) {
                SetHPageRestoreReserve(page);
+               WARN_ON_ONCE(!h->resv_huge_pages);
                h->resv_huge_pages--;
        }

And ran the userfaultfd selftests like so:

echo 1024 > /proc/sys/vm/nr_hugepages
mkdir -p /mnt/huge
mount -t hugetlbfs none /mnt/huge
./tools/testings/selftests/vm/userfaultfd hugetlb_shared 1024 200
/mnt/huge/userfaultfd_test

And run into this warning indicating this test does discover an underflow:

[   11.163403] ------------[ cut here ]------------
[   11.163404] WARNING: CPU: 0 PID: 237 at mm/hugetlb.c:1178
alloc_huge_page+0x558/0x5a0