mm/memfd: reserve hugetlb folios before allocation

Message ID	20250107072517.2089633-1-vivek.kasireddy@intel.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Vivek Kasireddy <vivek.kasireddy@intel.com> To: linux-mm@kvack.org Cc: Vivek Kasireddy <vivek.kasireddy@intel.com>, syzbot+a504cb5bae4fe117ba94@syzkaller.appspotmail.com, Steve Sistare <steven.sistare@oracle.com>, Muchun Song <muchun.song@linux.dev>, David Hildenbrand <david@redhat.com>, Andrew Morton <akpm@linux-foundation.org> Subject: [PATCH] mm/memfd: reserve hugetlb folios before allocation Date: Mon, 6 Jan 2025 23:25:17 -0800 Message-ID: <20250107072517.2089633-1-vivek.kasireddy@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm/memfd: reserve hugetlb folios before allocation \| expand mm/memfd: reserve hugetlb folios before allocation

Message ID

20250107072517.2089633-1-vivek.kasireddy@intel.com (mailing list archive)

State

New

Headers

From: Vivek Kasireddy <vivek.kasireddy@intel.com>
To: linux-mm@kvack.org
Cc: Vivek Kasireddy <vivek.kasireddy@intel.com>,
	syzbot+a504cb5bae4fe117ba94@syzkaller.appspotmail.com,
	Steve Sistare <steven.sistare@oracle.com>,
	Muchun Song <muchun.song@linux.dev>,
	David Hildenbrand <david@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: [PATCH] mm/memfd: reserve hugetlb folios before allocation
Date: Mon,  6 Jan 2025 23:25:17 -0800
Message-ID: <20250107072517.2089633-1-vivek.kasireddy@intel.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

mm/memfd: reserve hugetlb folios before allocation | expand

Commit Message

Kasireddy, Vivek Jan. 7, 2025, 7:25 a.m. UTC

There are cases when we try to pin a folio but discover that it has
not been faulted-in. So, we try to allocate it in memfd_alloc_folio()
but there is a chance that we might encounter a crash/failure
(VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations
at that instant. This issue was reported by syzbot:

kernel BUG at mm/hugetlb.c:2403!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
CPU: 0 UID: 0 PID: 5315 Comm: syz.0.0 Not tainted
6.13.0-rc5-syzkaller-00161-g63676eefb7a0 #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
RIP: 0010:alloc_hugetlb_folio_reserve+0xbc/0xc0 mm/hugetlb.c:2403
Code: 1f eb 05 e8 56 18 a0 ff 48 c7 c7 40 56 61 8e e8 ba 21 cc 09 4c 89
f0 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc e8 35 18 a0 ff 90 <0f> 0b 66
90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f
RSP: 0018:ffffc9000d3d77f8 EFLAGS: 00010087
RAX: ffffffff81ff6beb RBX: 0000000000000000 RCX: 0000000000100000
RDX: ffffc9000e51a000 RSI: 00000000000003ec RDI: 00000000000003ed
RBP: 1ffffffff34810d9 R08: ffffffff81ff6ba3 R09: 1ffffd4000093005
R10: dffffc0000000000 R11: fffff94000093006 R12: dffffc0000000000
R13: dffffc0000000000 R14: ffffea0000498000 R15: ffffffff9a4086c8
FS:  00007f77ac12e6c0(0000) GS:ffff88801fc00000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f77ab54b170 CR3: 0000000040b70000 CR4: 0000000000352ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 memfd_alloc_folio+0x1bd/0x370 mm/memfd.c:88
 memfd_pin_folios+0xf10/0x1570 mm/gup.c:3750
 udmabuf_pin_folios drivers/dma-buf/udmabuf.c:346 [inline]
 udmabuf_create+0x70e/0x10c0 drivers/dma-buf/udmabuf.c:443
 udmabuf_ioctl_create drivers/dma-buf/udmabuf.c:495 [inline]
 udmabuf_ioctl+0x301/0x4e0 drivers/dma-buf/udmabuf.c:526
 vfs_ioctl fs/ioctl.c:51 [inline]
 __do_sys_ioctl fs/ioctl.c:906 [inline]
 __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Therefore, to avoid this situation and fix this issue, we just need
to make a reservation before we try to allocate the folio. While at
it, also remove the VM_BUG_ON() as there is no need to crash the
system in this scenario and instead we could just fail the allocation.

Fixes: 26a8ea80929c ("mm/hugetlb: fix memfd_pin_folios resv_huge_pages leak")
Reported-by: syzbot+a504cb5bae4fe117ba94@syzkaller.appspotmail.com
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Cc: Steve Sistare <steven.sistare@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: David Hildenbrand <david@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 mm/hugetlb.c | 9 ++++++---
 mm/memfd.c   | 5 +++++
 2 files changed, 11 insertions(+), 3 deletions(-)

Comments

David Hildenbrand Jan. 7, 2025, 9:36 a.m. UTC | #1

On 07.01.25 08:25, Vivek Kasireddy wrote:
> There are cases when we try to pin a folio but discover that it has
> not been faulted-in. So, we try to allocate it in memfd_alloc_folio()
> but there is a chance that we might encounter a crash/failure
> (VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations
> at that instant. This issue was reported by syzbot:
> 
> kernel BUG at mm/hugetlb.c:2403!
> Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
> CPU: 0 UID: 0 PID: 5315 Comm: syz.0.0 Not tainted
> 6.13.0-rc5-syzkaller-00161-g63676eefb7a0 #0
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
> 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
> RIP: 0010:alloc_hugetlb_folio_reserve+0xbc/0xc0 mm/hugetlb.c:2403
> Code: 1f eb 05 e8 56 18 a0 ff 48 c7 c7 40 56 61 8e e8 ba 21 cc 09 4c 89
> f0 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc e8 35 18 a0 ff 90 <0f> 0b 66
> 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f
> RSP: 0018:ffffc9000d3d77f8 EFLAGS: 00010087
> RAX: ffffffff81ff6beb RBX: 0000000000000000 RCX: 0000000000100000
> RDX: ffffc9000e51a000 RSI: 00000000000003ec RDI: 00000000000003ed
> RBP: 1ffffffff34810d9 R08: ffffffff81ff6ba3 R09: 1ffffd4000093005
> R10: dffffc0000000000 R11: fffff94000093006 R12: dffffc0000000000
> R13: dffffc0000000000 R14: ffffea0000498000 R15: ffffffff9a4086c8
> FS:  00007f77ac12e6c0(0000) GS:ffff88801fc00000(0000)
> knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f77ab54b170 CR3: 0000000040b70000 CR4: 0000000000352ef0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
>   <TASK>
>   memfd_alloc_folio+0x1bd/0x370 mm/memfd.c:88
>   memfd_pin_folios+0xf10/0x1570 mm/gup.c:3750
>   udmabuf_pin_folios drivers/dma-buf/udmabuf.c:346 [inline]
>   udmabuf_create+0x70e/0x10c0 drivers/dma-buf/udmabuf.c:443
>   udmabuf_ioctl_create drivers/dma-buf/udmabuf.c:495 [inline]
>   udmabuf_ioctl+0x301/0x4e0 drivers/dma-buf/udmabuf.c:526
>   vfs_ioctl fs/ioctl.c:51 [inline]
>   __do_sys_ioctl fs/ioctl.c:906 [inline]
>   __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
>   do_syscall_x64 arch/x86/entry/common.c:52 [inline]
>   do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
>   entry_SYSCALL_64_after_hwframe+0x77/0x7f
> 
> Therefore, to avoid this situation and fix this issue, we just need
> to make a reservation before we try to allocate the folio. While at
> it, also remove the VM_BUG_ON() as there is no need to crash the
> system in this scenario and instead we could just fail the allocation.
> 
> Fixes: 26a8ea80929c ("mm/hugetlb: fix memfd_pin_folios resv_huge_pages leak")
> Reported-by: syzbot+a504cb5bae4fe117ba94@syzkaller.appspotmail.com
> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> Cc: Steve Sistare <steven.sistare@oracle.com>
> Cc: Muchun Song <muchun.song@linux.dev>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
>   mm/hugetlb.c | 9 ++++++---
>   mm/memfd.c   | 5 +++++
>   2 files changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index c498874a7170..e46c461210a4 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2397,12 +2397,15 @@ struct folio *alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
>   	struct folio *folio;
>   
>   	spin_lock_irq(&hugetlb_lock);
> +	if (!h->resv_huge_pages) {

Should this be a "if (WARN_ON_ONCE(!h->resv_huge_pages)) {", because the 
"_reserve" in the function indicates that this is indeed something that 
must never happen?

> +		spin_unlock_irq(&hugetlb_lock);
> +		return NULL;
> +	}
> +
>   	folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask, preferred_nid,
>   					       nmask);
> -	if (folio) {
> -		VM_BUG_ON(!h->resv_huge_pages);
> +	if (folio)
>   		h->resv_huge_pages--;
> -	}
>   
>   	spin_unlock_irq(&hugetlb_lock);
>   	return folio;
> diff --git a/mm/memfd.c b/mm/memfd.c
> index 35a370d75c9a..a3012c444285 100644
> --- a/mm/memfd.c
> +++ b/mm/memfd.c
> @@ -85,6 +85,10 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx)
>   		gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE);
>   		idx >>= huge_page_order(h);
>   
> +		if (!hugetlb_reserve_pages(file_inode(memfd),
> +					   idx, idx + 1, NULL, 0))
> +			return ERR_PTR(-ENOMEM);
> +
>   		folio = alloc_hugetlb_folio_reserve(h,
>   						    numa_node_id(),
>   						    NULL,
> @@ -100,6 +104,7 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx)
>   			folio_unlock(folio);
>   			return folio;
>   		}
> +		hugetlb_unreserve_pages(file_inode(memfd), idx, idx + 1, 1);
>   		return ERR_PTR(-ENOMEM);

Staring at hugetlb_reserve_pages() I assume this will also work as 
expected if already reserved.

Steven Sistare Jan. 7, 2025, 5:12 p.m. UTC | #2

On 1/7/2025 2:25 AM, Vivek Kasireddy wrote:
> There are cases when we try to pin a folio but discover that it has
> not been faulted-in. So, we try to allocate it in memfd_alloc_folio()
> but there is a chance that we might encounter a crash/failure
> (VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations
> at that instant. This issue was reported by syzbot:
> 
> kernel BUG at mm/hugetlb.c:2403!
> Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
> CPU: 0 UID: 0 PID: 5315 Comm: syz.0.0 Not tainted
> 6.13.0-rc5-syzkaller-00161-g63676eefb7a0 #0
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
> 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
> RIP: 0010:alloc_hugetlb_folio_reserve+0xbc/0xc0 mm/hugetlb.c:2403
> Code: 1f eb 05 e8 56 18 a0 ff 48 c7 c7 40 56 61 8e e8 ba 21 cc 09 4c 89
> f0 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc e8 35 18 a0 ff 90 <0f> 0b 66
> 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f
> RSP: 0018:ffffc9000d3d77f8 EFLAGS: 00010087
> RAX: ffffffff81ff6beb RBX: 0000000000000000 RCX: 0000000000100000
> RDX: ffffc9000e51a000 RSI: 00000000000003ec RDI: 00000000000003ed
> RBP: 1ffffffff34810d9 R08: ffffffff81ff6ba3 R09: 1ffffd4000093005
> R10: dffffc0000000000 R11: fffff94000093006 R12: dffffc0000000000
> R13: dffffc0000000000 R14: ffffea0000498000 R15: ffffffff9a4086c8
> FS:  00007f77ac12e6c0(0000) GS:ffff88801fc00000(0000)
> knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f77ab54b170 CR3: 0000000040b70000 CR4: 0000000000352ef0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
>   <TASK>
>   memfd_alloc_folio+0x1bd/0x370 mm/memfd.c:88
>   memfd_pin_folios+0xf10/0x1570 mm/gup.c:3750
>   udmabuf_pin_folios drivers/dma-buf/udmabuf.c:346 [inline]
>   udmabuf_create+0x70e/0x10c0 drivers/dma-buf/udmabuf.c:443
>   udmabuf_ioctl_create drivers/dma-buf/udmabuf.c:495 [inline]
>   udmabuf_ioctl+0x301/0x4e0 drivers/dma-buf/udmabuf.c:526
>   vfs_ioctl fs/ioctl.c:51 [inline]
>   __do_sys_ioctl fs/ioctl.c:906 [inline]
>   __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
>   do_syscall_x64 arch/x86/entry/common.c:52 [inline]
>   do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
>   entry_SYSCALL_64_after_hwframe+0x77/0x7f
> 
> Therefore, to avoid this situation and fix this issue, we just need
> to make a reservation before we try to allocate the folio. While at
> it, also remove the VM_BUG_ON() as there is no need to crash the
> system in this scenario and instead we could just fail the allocation.
> 
> Fixes: 26a8ea80929c ("mm/hugetlb: fix memfd_pin_folios resv_huge_pages leak")
> Reported-by: syzbot+a504cb5bae4fe117ba94@syzkaller.appspotmail.com
> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> Cc: Steve Sistare <steven.sistare@oracle.com>
> Cc: Muchun Song <muchun.song@linux.dev>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
>   mm/hugetlb.c | 9 ++++++---
>   mm/memfd.c   | 5 +++++
>   2 files changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index c498874a7170..e46c461210a4 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2397,12 +2397,15 @@ struct folio *alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
>   	struct folio *folio;
>   
>   	spin_lock_irq(&hugetlb_lock);
> +	if (!h->resv_huge_pages) {
> +		spin_unlock_irq(&hugetlb_lock);
> +		return NULL;
> +	}

This should be the entire fix, plus deleting the VM_BUG_ON.  See below.

> +
>   	folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask, preferred_nid,
>   					       nmask);
> -	if (folio) {
> -		VM_BUG_ON(!h->resv_huge_pages);
> +	if (folio)
>   		h->resv_huge_pages--;
> -	}
>   
>   	spin_unlock_irq(&hugetlb_lock);
>   	return folio;
> diff --git a/mm/memfd.c b/mm/memfd.c
> index 35a370d75c9a..a3012c444285 100644
> --- a/mm/memfd.c
> +++ b/mm/memfd.c
> @@ -85,6 +85,10 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx)
>   		gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE);
>   		idx >>= huge_page_order(h);
>   
> +		if (!hugetlb_reserve_pages(file_inode(memfd),
> +					   idx, idx + 1, NULL, 0))
> +			return ERR_PTR(-ENOMEM);

I believe it is wrong to force a reservation here.  Pages should have already been
reserved at this point, eg by calls from hugetlbfs_file_mmap or hugetlb_file_setup.
syzcaller has forced its way down this path without calling those pre-requisites,
doing weird stuff as it should.

To fix, I suggest you simply fix alloc_hugetlb_folio_reserve as above.

- Steve

> +
>   		folio = alloc_hugetlb_folio_reserve(h,
>   						    numa_node_id(),
>   						    NULL,
> @@ -100,6 +104,7 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx)
>   			folio_unlock(folio);
>   			return folio;
>   		}
> +		hugetlb_unreserve_pages(file_inode(memfd), idx, idx + 1, 1);
>   		return ERR_PTR(-ENOMEM);
>   	}
>   #endif

Kasireddy, Vivek Jan. 8, 2025, 6:59 a.m. UTC | #3

Hi David,

> > There are cases when we try to pin a folio but discover that it has
> > not been faulted-in. So, we try to allocate it in memfd_alloc_folio()
> > but there is a chance that we might encounter a crash/failure
> > (VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations
> > at that instant. This issue was reported by syzbot:
> >
> > kernel BUG at mm/hugetlb.c:2403!
> > Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
> > CPU: 0 UID: 0 PID: 5315 Comm: syz.0.0 Not tainted
> > 6.13.0-rc5-syzkaller-00161-g63676eefb7a0 #0
> > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
> > 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
> > RIP: 0010:alloc_hugetlb_folio_reserve+0xbc/0xc0 mm/hugetlb.c:2403
> > Code: 1f eb 05 e8 56 18 a0 ff 48 c7 c7 40 56 61 8e e8 ba 21 cc 09 4c 89
> > f0 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc e8 35 18 a0 ff 90 <0f> 0b 66
> > 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f
> > RSP: 0018:ffffc9000d3d77f8 EFLAGS: 00010087
> > RAX: ffffffff81ff6beb RBX: 0000000000000000 RCX: 0000000000100000
> > RDX: ffffc9000e51a000 RSI: 00000000000003ec RDI: 00000000000003ed
> > RBP: 1ffffffff34810d9 R08: ffffffff81ff6ba3 R09: 1ffffd4000093005
> > R10: dffffc0000000000 R11: fffff94000093006 R12: dffffc0000000000
> > R13: dffffc0000000000 R14: ffffea0000498000 R15: ffffffff9a4086c8
> > FS:  00007f77ac12e6c0(0000) GS:ffff88801fc00000(0000)
> > knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00007f77ab54b170 CR3: 0000000040b70000 CR4: 0000000000352ef0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > Call Trace:
> >   <TASK>
> >   memfd_alloc_folio+0x1bd/0x370 mm/memfd.c:88
> >   memfd_pin_folios+0xf10/0x1570 mm/gup.c:3750
> >   udmabuf_pin_folios drivers/dma-buf/udmabuf.c:346 [inline]
> >   udmabuf_create+0x70e/0x10c0 drivers/dma-buf/udmabuf.c:443
> >   udmabuf_ioctl_create drivers/dma-buf/udmabuf.c:495 [inline]
> >   udmabuf_ioctl+0x301/0x4e0 drivers/dma-buf/udmabuf.c:526
> >   vfs_ioctl fs/ioctl.c:51 [inline]
> >   __do_sys_ioctl fs/ioctl.c:906 [inline]
> >   __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
> >   do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> >   do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
> >   entry_SYSCALL_64_after_hwframe+0x77/0x7f
> >
> > Therefore, to avoid this situation and fix this issue, we just need
> > to make a reservation before we try to allocate the folio. While at
> > it, also remove the VM_BUG_ON() as there is no need to crash the
> > system in this scenario and instead we could just fail the allocation.
> >
> > Fixes: 26a8ea80929c ("mm/hugetlb: fix memfd_pin_folios resv_huge_pages
> leak")
> > Reported-by: syzbot+a504cb5bae4fe117ba94@syzkaller.appspotmail.com
> > Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> > Cc: Steve Sistare <steven.sistare@oracle.com>
> > Cc: Muchun Song <muchun.song@linux.dev>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > ---
> >   mm/hugetlb.c | 9 ++++++---
> >   mm/memfd.c   | 5 +++++
> >   2 files changed, 11 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index c498874a7170..e46c461210a4 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -2397,12 +2397,15 @@ struct folio *alloc_hugetlb_folio_reserve(struct
> hstate *h, int preferred_nid,
> >   	struct folio *folio;
> >
> >   	spin_lock_irq(&hugetlb_lock);
> > +	if (!h->resv_huge_pages) {
> 
> Should this be a "if (WARN_ON_ONCE(!h->resv_huge_pages)) {", because the
> "_reserve" in the function indicates that this is indeed something that
> must never happen?
Yes, adding a WARN_ON_ONCE makes sense in this case. I'll add it in v2.

> 
> > +		spin_unlock_irq(&hugetlb_lock);
> > +		return NULL;
> > +	}
> > +
> >   	folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask,
> preferred_nid,
> >   					       nmask);
> > -	if (folio) {
> > -		VM_BUG_ON(!h->resv_huge_pages);
> > +	if (folio)
> >   		h->resv_huge_pages--;
> > -	}
> >
> >   	spin_unlock_irq(&hugetlb_lock);
> >   	return folio;
> > diff --git a/mm/memfd.c b/mm/memfd.c
> > index 35a370d75c9a..a3012c444285 100644
> > --- a/mm/memfd.c
> > +++ b/mm/memfd.c
> > @@ -85,6 +85,10 @@ struct folio *memfd_alloc_folio(struct file *memfd,
> pgoff_t idx)
> >   		gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE);
> >   		idx >>= huge_page_order(h);
> >
> > +		if (!hugetlb_reserve_pages(file_inode(memfd),
> > +					   idx, idx + 1, NULL, 0))
> > +			return ERR_PTR(-ENOMEM);
> > +
> >   		folio = alloc_hugetlb_folio_reserve(h,
> >   						    numa_node_id(),
> >   						    NULL,
> > @@ -100,6 +104,7 @@ struct folio *memfd_alloc_folio(struct file *memfd,
> pgoff_t idx)
> >   			folio_unlock(folio);
> >   			return folio;
> >   		}
> > +		hugetlb_unreserve_pages(file_inode(memfd), idx, idx + 1, 1);
> >   		return ERR_PTR(-ENOMEM);
> 
> Staring at hugetlb_reserve_pages() I assume this will also work as
> expected if already reserved.
IIUC, if a range is already reserved, and when hugetlb_reserve_pages()
gets called for the same range, it would mostly become a no-op as
region_chg() and region_add() return 0 (based on my light testing).

Thanks,
Vivek

> 
> 
> --
> Cheers,
> 
> David / dhildenb

Kasireddy, Vivek Jan. 8, 2025, 7:24 a.m. UTC | #4

Hi Steve,

> > There are cases when we try to pin a folio but discover that it has
> > not been faulted-in. So, we try to allocate it in memfd_alloc_folio()
> > but there is a chance that we might encounter a crash/failure
> > (VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations
> > at that instant. This issue was reported by syzbot:
> >
> > kernel BUG at mm/hugetlb.c:2403!
> > Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
> > CPU: 0 UID: 0 PID: 5315 Comm: syz.0.0 Not tainted
> > 6.13.0-rc5-syzkaller-00161-g63676eefb7a0 #0
> > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
> > 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
> > RIP: 0010:alloc_hugetlb_folio_reserve+0xbc/0xc0 mm/hugetlb.c:2403
> > Code: 1f eb 05 e8 56 18 a0 ff 48 c7 c7 40 56 61 8e e8 ba 21 cc 09 4c 89
> > f0 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc e8 35 18 a0 ff 90 <0f> 0b 66
> > 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f
> > RSP: 0018:ffffc9000d3d77f8 EFLAGS: 00010087
> > RAX: ffffffff81ff6beb RBX: 0000000000000000 RCX: 0000000000100000
> > RDX: ffffc9000e51a000 RSI: 00000000000003ec RDI: 00000000000003ed
> > RBP: 1ffffffff34810d9 R08: ffffffff81ff6ba3 R09: 1ffffd4000093005
> > R10: dffffc0000000000 R11: fffff94000093006 R12: dffffc0000000000
> > R13: dffffc0000000000 R14: ffffea0000498000 R15: ffffffff9a4086c8
> > FS:  00007f77ac12e6c0(0000) GS:ffff88801fc00000(0000)
> > knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00007f77ab54b170 CR3: 0000000040b70000 CR4: 0000000000352ef0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > Call Trace:
> >   <TASK>
> >   memfd_alloc_folio+0x1bd/0x370 mm/memfd.c:88
> >   memfd_pin_folios+0xf10/0x1570 mm/gup.c:3750
> >   udmabuf_pin_folios drivers/dma-buf/udmabuf.c:346 [inline]
> >   udmabuf_create+0x70e/0x10c0 drivers/dma-buf/udmabuf.c:443
> >   udmabuf_ioctl_create drivers/dma-buf/udmabuf.c:495 [inline]
> >   udmabuf_ioctl+0x301/0x4e0 drivers/dma-buf/udmabuf.c:526
> >   vfs_ioctl fs/ioctl.c:51 [inline]
> >   __do_sys_ioctl fs/ioctl.c:906 [inline]
> >   __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
> >   do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> >   do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
> >   entry_SYSCALL_64_after_hwframe+0x77/0x7f
> >
> > Therefore, to avoid this situation and fix this issue, we just need
> > to make a reservation before we try to allocate the folio. While at
> > it, also remove the VM_BUG_ON() as there is no need to crash the
> > system in this scenario and instead we could just fail the allocation.
> >
> > Fixes: 26a8ea80929c ("mm/hugetlb: fix memfd_pin_folios resv_huge_pages
> leak")
> > Reported-by: syzbot+a504cb5bae4fe117ba94@syzkaller.appspotmail.com
> > Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> > Cc: Steve Sistare <steven.sistare@oracle.com>
> > Cc: Muchun Song <muchun.song@linux.dev>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > ---
> >   mm/hugetlb.c | 9 ++++++---
> >   mm/memfd.c   | 5 +++++
> >   2 files changed, 11 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index c498874a7170..e46c461210a4 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -2397,12 +2397,15 @@ struct folio *alloc_hugetlb_folio_reserve(struct
> hstate *h, int preferred_nid,
> >   	struct folio *folio;
> >
> >   	spin_lock_irq(&hugetlb_lock);
> > +	if (!h->resv_huge_pages) {
> > +		spin_unlock_irq(&hugetlb_lock);
> > +		return NULL;
> > +	}
> 
> This should be the entire fix, plus deleting the VM_BUG_ON.  See below.
> 
> > +
> >   	folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask,
> preferred_nid,
> >   					       nmask);
> > -	if (folio) {
> > -		VM_BUG_ON(!h->resv_huge_pages);
> > +	if (folio)
> >   		h->resv_huge_pages--;
> > -	}
> >
> >   	spin_unlock_irq(&hugetlb_lock);
> >   	return folio;
> > diff --git a/mm/memfd.c b/mm/memfd.c
> > index 35a370d75c9a..a3012c444285 100644
> > --- a/mm/memfd.c
> > +++ b/mm/memfd.c
> > @@ -85,6 +85,10 @@ struct folio *memfd_alloc_folio(struct file *memfd,
> pgoff_t idx)
> >   		gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE);
> >   		idx >>= huge_page_order(h);
> >
> > +		if (!hugetlb_reserve_pages(file_inode(memfd),
> > +					   idx, idx + 1, NULL, 0))
> > +			return ERR_PTR(-ENOMEM);
> 
> I believe it is wrong to force a reservation here. 
Is there any particular reason why you believe a reservation here would be wrong?
AFAICS, at the moment, we are not doing any region/subpool accounting before
our folio allocation and this gets flagged in the form of elevated resv_huge_pages
value (hugetlb_acct_memory() elevates it based on the return value of region_del())
when hugetlb_unreserve_pages() eventually gets called. 

> Pages should have already been
> reserved at this point, eg by calls from hugetlbfs_file_mmap or hugetlb_file_setup.
hugetlb_file_setup() does not reserve any pages as it passes in VM_NORESERVE.
And, the case we are trying to address is exactly when hugetlbfs_file_mmap() does
not get called before pinning. So, when hugetlbfs_file_mmap() does eventually
get called, I don't see any problem if it calls hugetlb_reserve_pages() again for the
same range or overlapping ranges.

> syzcaller has forced its way down this path without calling those pre-requisites,
> doing weird stuff as it should.
This issue is not very hard to reproduce. If we have free_huge_pages > 0 and
resv_huge_pages = 0, and then we call memfd_pin_folios() before mmap()/
hugetlbfs_file_mmap() we can easily encounter this issue. Furthermore, we
should be able to allocate a folio in this scenario (as available_huge_pages > 0),
which we would not be able to do if we don't call hugetlb_reserve_pages().
Note that hugetlb_reserve_pages() actually elevates resv_huge_pages in
this case and kind of gives a go-ahead for the allocation.

I have used a slightly modified udmabuf selftest to reproduce this issue which
I'll send out as part of v2.

Thanks,
Vivek

> 
> To fix, I suggest you simply fix alloc_hugetlb_folio_reserve as above.
> 
> - Steve
> 
> > +
> >   		folio = alloc_hugetlb_folio_reserve(h,
> >   						    numa_node_id(),
> >   						    NULL,
> > @@ -100,6 +104,7 @@ struct folio *memfd_alloc_folio(struct file *memfd,
> pgoff_t idx)
> >   			folio_unlock(folio);
> >   			return folio;
> >   		}
> > +		hugetlb_unreserve_pages(file_inode(memfd), idx, idx + 1, 1);
> >   		return ERR_PTR(-ENOMEM);
> >   	}
> >   #endif

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c498874a7170..e46c461210a4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2397,12 +2397,15 @@  struct folio *alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
 	struct folio *folio;
 
 	spin_lock_irq(&hugetlb_lock);
+	if (!h->resv_huge_pages) {
+		spin_unlock_irq(&hugetlb_lock);
+		return NULL;
+	}
+
 	folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask, preferred_nid,
 					       nmask);
-	if (folio) {
-		VM_BUG_ON(!h->resv_huge_pages);
+	if (folio)
 		h->resv_huge_pages--;
-	}
 
 	spin_unlock_irq(&hugetlb_lock);
 	return folio;
diff --git a/mm/memfd.c b/mm/memfd.c
index 35a370d75c9a..a3012c444285 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -85,6 +85,10 @@  struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx)
 		gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE);
 		idx >>= huge_page_order(h);
 
+		if (!hugetlb_reserve_pages(file_inode(memfd),
+					   idx, idx + 1, NULL, 0))
+			return ERR_PTR(-ENOMEM);
+
 		folio = alloc_hugetlb_folio_reserve(h,
 						    numa_node_id(),
 						    NULL,
@@ -100,6 +104,7 @@  struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx)
 			folio_unlock(folio);
 			return folio;
 		}
+		hugetlb_unreserve_pages(file_inode(memfd), idx, idx + 1, 1);
 		return ERR_PTR(-ENOMEM);
 	}
 #endif

mm/memfd: reserve hugetlb folios before allocation

Commit Message

Comments

Patch