diff mbox series

[v2] mm: vma: skip anonymous vma when inserting vma to file rmap tree

Message ID 20250312221521.1255690-1-yang@os.amperecomputing.com (mailing list archive)
State New
Headers show
Series [v2] mm: vma: skip anonymous vma when inserting vma to file rmap tree | expand

Commit Message

Yang Shi March 12, 2025, 10:15 p.m. UTC
LKP reported 800% performance improvement for small-allocs benchmark
from vm-scalability [1] with patch ("/dev/zero: make private mapping
full anonymous mapping") [2], but the patch was nack'ed since it changes
the output of smaps somewhat.

The profiling shows one of the major sources of the performance
improvement is the less contention to i_mmap_rwsem.

The small-allocs benchmark creates a lot of 40K size memory maps by
mmap'ing private /dev/zero then triggers page fault on the mappings.
When creating private mapping for /dev/zero, the anonymous VMA is
created, but it has valid vm_file.  Kernel basically assumes anonymous
VMAs should have NULL vm_file, for example, mmap inserts VMA to the file
rmap tree if vm_file is not NULL.  So the private /dev/zero mapping
will be inserted to the file rmap tree, this resulted in the contention
to i_mmap_rwsem.  But it is actually anonymous VMA, so it is pointless
to insert it to file rmap tree.

Skip anonymous VMA for this case.  Over 400% performance improvement was
reported [3].

It is not on par with the 800% improvement from the original patch.  It is
because page fault handler needs to access some members of struct file
if vm_file is not NULL, for example, f_mode and f_mapping.  They are in
the same cacheline with file refcount.  When mmap'ing a file the file
refcount is inc'ed and dec'ed, this caused bad cache false sharing
problem.  The further debug showed checking whether the VMA is anonymous
or not can alleviate the problem.  But I'm not sure whether it is the
best way to handle it, maybe we should consider shuffle the layout of
struct file.

However it sounds rare that real life applications would create that
many maps with mmap'ing private /dev/zero and share the same struct
file, so the cache false sharing problem may be not that bad.  But
i_mmap_rwsem contention problem seems more real since all /dev/zero
private mappings even from different applications share the same struct
address_space so the same i_mmap_rwsem.  Inserting anonymous VMA into
file rmap tree is also a broken behavior.  It is worth fixing from this
perspective too.

[1] https://lore.kernel.org/linux-mm/202501281038.617c6b60-lkp@intel.com/
[2] https://lore.kernel.org/linux-mm/20250113223033.4054534-1-yang@os.amperecomputing.com/
[3] https://lore.kernel.org/linux-mm/Z6RshwXCWhAGoMOK@xsang-OptiPlex-9020/#t

Reported-by: kernel test robot <oliver.sang@intel.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
---
v2:
   * Added the comments in code suggested by Lorenzo
   * Collected R-b from Lorenze

 mm/vma.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

Comments

Vasily Gorbik March 12, 2025, 11:55 p.m. UTC | #1
On Wed, Mar 12, 2025 at 03:15:21PM -0700, Yang Shi wrote:
> LKP reported 800% performance improvement for small-allocs benchmark
> from vm-scalability [1] with patch ("/dev/zero: make private mapping
> full anonymous mapping") [2], but the patch was nack'ed since it changes
> the output of smaps somewhat.
...
> ---
> v2:
>    * Added the comments in code suggested by Lorenzo
>    * Collected R-b from Lorenze
> 
>  mm/vma.c | 18 ++++++++++++++++--
>  1 file changed, 16 insertions(+), 2 deletions(-)

Hi Yang,

Replying to v2, as the code is the same as v1 in linux-next:

The LTP test "mmap10" consistently triggers a kernel NULL pointer
dereference with this change, at least on x86 and s390. Reverting just
this single patch from linux-next fixes the issue.

LTP: starting mmap10
BUG: kernel NULL pointer dereference, address: 0000000000000008
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 800000010d22a067 P4D 800000010d22a067 PUD 11ff09067 PMD 0 
Oops: Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 5 UID: 0 PID: 1719 Comm: mmap10 Not tainted 6.14.0-rc6-next-20250312 #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014
RIP: 0010:__rb_insert_augmented+0x2b/0x1d0
Code: 0f 1e fa 48 89 f8 48 8b 3f 48 85 ff 0f 84 a4 01 00 00 41 55 49 89 f5 41 54 49 89 d4 55 53 48 8b 1f f6 c3 01 0f 85 e1 00 00 00 <48> 8b 53 08 48 39 fa 74 67 48 85 d2 74 09 f6 02 01 0f 84 a0 00 00
RSP: 0018:ffffc90002b47cc8 EFLAGS: 00010246
RAX: ffff8881143ab788 RBX: 0000000000000000 RCX: 00000000000009ff
RDX: ffffffff814ad5d0 RSI: ffff888100bb5060 RDI: ffff8881143ab088
RBP: ffff8881053af8c0 R08: ffff8881143ab700 R09: 00007ff6433f2000
R10: 00007ff6433f2000 R11: ffff8881143ab000 R12: ffffffff814ad5d0
R13: ffff888100bb5060 R14: ffff8881143ab700 R15: ffff8881143ab000
FS:  00007ff643df1740(0000) GS:ffff8882b45bf000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 000000011b042000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 ? __die_body.cold+0x19/0x2b
 ? page_fault_oops+0xc4/0x1f0
 ? search_extable+0x26/0x30
 ? search_module_extables+0x3f/0x60
 ? exc_page_fault+0x6b/0x150
 ? asm_exc_page_fault+0x26/0x30
 ? __pfx_vma_interval_tree_augment_rotate+0x10/0x10
 ? __pfx_vma_interval_tree_augment_rotate+0x10/0x10
 ? __rb_insert_augmented+0x2b/0x1d0
 copy_mm+0x48a/0x8c0
 copy_process+0xf98/0x1930
 kernel_clone+0xb7/0x3b0
 __do_sys_clone+0x65/0x90
 do_syscall_64+0x9e/0x1a0
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7ff643eb2b00
Code: 31 c0 31 d2 31 f6 bf 11 00 20 01 48 89 e5 53 48 83 ec 08 64 48 8b 04 25 10 00 00 00 4c 8d 90 d0 02 00 00 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 48 89 c3 85 c0 75 31 64 48 8b 04 25 10 00 00
RSP: 002b:00007ffdac219010 EFLAGS: 00000202 ORIG_RAX: 0000000000000038
RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007ff643eb2b00
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
RBP: 00007ffdac219020 R08: 0000000000000000 R09: 0000000000000000
R10: 00007ff643df1a10 R11: 0000000000000202 R12: 0000000000000001
R13: 0000000000000000 R14: 00007ff644036000 R15: 0000000000000000
 </TASK>
Modules linked in:
CR2: 0000000000000008
---[ end trace 0000000000000000 ]---
RIP: 0010:__rb_insert_augmented+0x2b/0x1d0
Code: 0f 1e fa 48 89 f8 48 8b 3f 48 85 ff 0f 84 a4 01 00 00 41 55 49 89 f5 41 54 49 89 d4 55 53 48 8b 1f f6 c3 01 0f 85 e1 00 00 00 <48> 8b 53 08 48 39 fa 74 67 48 85 d2 74 09 f6 02 01 0f 84 a0 00 00
RSP: 0018:ffffc90002b47cc8 EFLAGS: 00010246
RAX: ffff8881143ab788 RBX: 0000000000000000 RCX: 00000000000009ff
RDX: ffffffff814ad5d0 RSI: ffff888100bb5060 RDI: ffff8881143ab088
RBP: ffff8881053af8c0 R08: ffff8881143ab700 R09: 00007ff6433f2000
R10: 00007ff6433f2000 R11: ffff8881143ab000 R12: ffffffff814ad5d0
R13: ffff888100bb5060 R14: ffff8881143ab700 R15: ffff8881143ab000
FS:  00007ff643df1740(0000) GS:ffff8882b45bf000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 000000011b042000 CR4: 00000000000006f0



LTP: starting mmap10
Unable to handle kernel pointer dereference in virtual kernel address space
Failing address: 0000000000000000 TEID: 0000000000000483
Fault in home space mode while using kernel ASCE.
AS:000000000247c007 R3:00000001ffffc007 S:00000001ffffb801 P:000000000000013d
Oops: 0004 ilc:3 [#1] SMP
Modules linked in:
CPU: 0 UID: 0 PID: 665 Comm: mmap10 Not tainted 6.14.0-rc6-next-20250312 #16
Hardware name: IBM 3931 A01 704 (KVM/Linux)
Krnl PSW : 0704c00180000000 000003ffe0ee0440 (__rb_insert_augmented+0x60/0x210)
           R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
Krnl GPRS: 00000000009ff000 0000000000000000 000000008e5f7508 0000000084a7ed08
           00000000000009fe 0000000000000000 0000000000000000 0000037fe06c7b68
           00000000801d0e90 000003ffe04158d0 0000000084a7ed08 0000000000000000
           000003ffbb700000 00000000801d0e48 000003ffe0ee057c 0000037fe06c7a40
Krnl Code: 000003ffe0ee0430: e31030080004        lg      %r1,8(%r3)
           000003ffe0ee0436: ec1200888064        cgrj    %r1,%r2,8,000003ffe0ee0546
          #000003ffe0ee043c: b90400a3            lgr     %r10,%r3
          >000003ffe0ee0440: e310b0100024        stg     %r1,16(%r11)
           000003ffe0ee0446: e3b030080024        stg     %r11,8(%r3)
           000003ffe0ee044c: ec180009007c        cgij    %r1,0,8,000003ffe0ee045e
           000003ffe0ee0452: ec2b000100d9        aghik   %r2,%r11,1
           000003ffe0ee0458: e32010000024        stg     %r2,0(%r1)
Call Trace:
 [<000003ffe0ee0440>] __rb_insert_augmented+0x60/0x210
 [<000003ffe016d6c4>] dup_mmap+0x424/0x8c0
 [<000003ffe016dc62>] copy_mm+0x102/0x1c0
 [<000003ffe016e8ae>] copy_process+0x7ce/0x12b0
 [<000003ffe016f458>] kernel_clone+0x68/0x380
 [<000003ffe016f84a>] __do_sys_clone+0x5a/0x70
 [<000003ffe016faa0>] __s390x_sys_clone+0x40/0x50
 [<000003ffe011c9b6>] do_syscall.constprop.0+0x116/0x140
 [<000003ffe0ef1d64>] __do_syscall+0xd4/0x1c0
 [<000003ffe0efd044>] system_call+0x74/0x98
Last Breaking-Event-Address:
 [<000003ffe0ee058a>] __rb_insert_augmented+0x1aa/0x210
Kernel panic - not syncing: Fatal exception: panic_on_oops
Yang Shi March 13, 2025, 3:04 a.m. UTC | #2
On 3/12/25 4:55 PM, Vasily Gorbik wrote:
> On Wed, Mar 12, 2025 at 03:15:21PM -0700, Yang Shi wrote:
>> LKP reported 800% performance improvement for small-allocs benchmark
>> from vm-scalability [1] with patch ("/dev/zero: make private mapping
>> full anonymous mapping") [2], but the patch was nack'ed since it changes
>> the output of smaps somewhat.
> ...
>> ---
>> v2:
>>     * Added the comments in code suggested by Lorenzo
>>     * Collected R-b from Lorenze
>>
>>   mm/vma.c | 18 ++++++++++++++++--
>>   1 file changed, 16 insertions(+), 2 deletions(-)
> Hi Yang,
>
> Replying to v2, as the code is the same as v1 in linux-next:
>
> The LTP test "mmap10" consistently triggers a kernel NULL pointer
> dereference with this change, at least on x86 and s390. Reverting just
> this single patch from linux-next fixes the issue.

Hi Vasily,

Thanks for the report. It is because dup_mmap() inserts the VMA into 
file rmap by checking whether vma->vm_file is NULL or not. This splat 
can be killed by skipping anonymous vma, but this actually will expose a 
more severe problem. The struct file refcount may be imbalance. The 
refcount is inc'ed in mmap, then inc'ed again by fork(), it is dec'ed 
when unmap or process exit. If we skip refcount inc in fork, we need 
skip refcount dec in unmap too, but there is still one refcount from mmap.

Can we dec refcount in mmap if we see it is anonymous vma finally? 
Unfortunately, no. If the refcount reaches 0, the struct file will be 
freed. We will run into UAF when looking up smaps IIUC. It may point to 
anything.

Lorenzo,

This problem seems more complicated than what I thought in the first 
place. Making it is a real anonymous vma (vm_file is NULL) may be still 
the best option. But we need figure out how we can keep compatible smaps.

Andrew,

Can you please drop this patch from your tree?

Thanks,
Yang

>
> LTP: starting mmap10
> BUG: kernel NULL pointer dereference, address: 0000000000000008
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 800000010d22a067 P4D 800000010d22a067 PUD 11ff09067 PMD 0
> Oops: Oops: 0000 [#1] PREEMPT SMP PTI
> CPU: 5 UID: 0 PID: 1719 Comm: mmap10 Not tainted 6.14.0-rc6-next-20250312 #3
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014
> RIP: 0010:__rb_insert_augmented+0x2b/0x1d0
> Code: 0f 1e fa 48 89 f8 48 8b 3f 48 85 ff 0f 84 a4 01 00 00 41 55 49 89 f5 41 54 49 89 d4 55 53 48 8b 1f f6 c3 01 0f 85 e1 00 00 00 <48> 8b 53 08 48 39 fa 74 67 48 85 d2 74 09 f6 02 01 0f 84 a0 00 00
> RSP: 0018:ffffc90002b47cc8 EFLAGS: 00010246
> RAX: ffff8881143ab788 RBX: 0000000000000000 RCX: 00000000000009ff
> RDX: ffffffff814ad5d0 RSI: ffff888100bb5060 RDI: ffff8881143ab088
> RBP: ffff8881053af8c0 R08: ffff8881143ab700 R09: 00007ff6433f2000
> R10: 00007ff6433f2000 R11: ffff8881143ab000 R12: ffffffff814ad5d0
> R13: ffff888100bb5060 R14: ffff8881143ab700 R15: ffff8881143ab000
> FS:  00007ff643df1740(0000) GS:ffff8882b45bf000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000008 CR3: 000000011b042000 CR4: 00000000000006f0
> Call Trace:
>   <TASK>
>   ? __die_body.cold+0x19/0x2b
>   ? page_fault_oops+0xc4/0x1f0
>   ? search_extable+0x26/0x30
>   ? search_module_extables+0x3f/0x60
>   ? exc_page_fault+0x6b/0x150
>   ? asm_exc_page_fault+0x26/0x30
>   ? __pfx_vma_interval_tree_augment_rotate+0x10/0x10
>   ? __pfx_vma_interval_tree_augment_rotate+0x10/0x10
>   ? __rb_insert_augmented+0x2b/0x1d0
>   copy_mm+0x48a/0x8c0
>   copy_process+0xf98/0x1930
>   kernel_clone+0xb7/0x3b0
>   __do_sys_clone+0x65/0x90
>   do_syscall_64+0x9e/0x1a0
>   entry_SYSCALL_64_after_hwframe+0x77/0x7f
> RIP: 0033:0x7ff643eb2b00
> Code: 31 c0 31 d2 31 f6 bf 11 00 20 01 48 89 e5 53 48 83 ec 08 64 48 8b 04 25 10 00 00 00 4c 8d 90 d0 02 00 00 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 48 89 c3 85 c0 75 31 64 48 8b 04 25 10 00 00
> RSP: 002b:00007ffdac219010 EFLAGS: 00000202 ORIG_RAX: 0000000000000038
> RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007ff643eb2b00
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
> RBP: 00007ffdac219020 R08: 0000000000000000 R09: 0000000000000000
> R10: 00007ff643df1a10 R11: 0000000000000202 R12: 0000000000000001
> R13: 0000000000000000 R14: 00007ff644036000 R15: 0000000000000000
>   </TASK>
> Modules linked in:
> CR2: 0000000000000008
> ---[ end trace 0000000000000000 ]---
> RIP: 0010:__rb_insert_augmented+0x2b/0x1d0
> Code: 0f 1e fa 48 89 f8 48 8b 3f 48 85 ff 0f 84 a4 01 00 00 41 55 49 89 f5 41 54 49 89 d4 55 53 48 8b 1f f6 c3 01 0f 85 e1 00 00 00 <48> 8b 53 08 48 39 fa 74 67 48 85 d2 74 09 f6 02 01 0f 84 a0 00 00
> RSP: 0018:ffffc90002b47cc8 EFLAGS: 00010246
> RAX: ffff8881143ab788 RBX: 0000000000000000 RCX: 00000000000009ff
> RDX: ffffffff814ad5d0 RSI: ffff888100bb5060 RDI: ffff8881143ab088
> RBP: ffff8881053af8c0 R08: ffff8881143ab700 R09: 00007ff6433f2000
> R10: 00007ff6433f2000 R11: ffff8881143ab000 R12: ffffffff814ad5d0
> R13: ffff888100bb5060 R14: ffff8881143ab700 R15: ffff8881143ab000
> FS:  00007ff643df1740(0000) GS:ffff8882b45bf000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000008 CR3: 000000011b042000 CR4: 00000000000006f0
>
>
>
> LTP: starting mmap10
> Unable to handle kernel pointer dereference in virtual kernel address space
> Failing address: 0000000000000000 TEID: 0000000000000483
> Fault in home space mode while using kernel ASCE.
> AS:000000000247c007 R3:00000001ffffc007 S:00000001ffffb801 P:000000000000013d
> Oops: 0004 ilc:3 [#1] SMP
> Modules linked in:
> CPU: 0 UID: 0 PID: 665 Comm: mmap10 Not tainted 6.14.0-rc6-next-20250312 #16
> Hardware name: IBM 3931 A01 704 (KVM/Linux)
> Krnl PSW : 0704c00180000000 000003ffe0ee0440 (__rb_insert_augmented+0x60/0x210)
>             R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
> Krnl GPRS: 00000000009ff000 0000000000000000 000000008e5f7508 0000000084a7ed08
>             00000000000009fe 0000000000000000 0000000000000000 0000037fe06c7b68
>             00000000801d0e90 000003ffe04158d0 0000000084a7ed08 0000000000000000
>             000003ffbb700000 00000000801d0e48 000003ffe0ee057c 0000037fe06c7a40
> Krnl Code: 000003ffe0ee0430: e31030080004        lg      %r1,8(%r3)
>             000003ffe0ee0436: ec1200888064        cgrj    %r1,%r2,8,000003ffe0ee0546
>            #000003ffe0ee043c: b90400a3            lgr     %r10,%r3
>            >000003ffe0ee0440: e310b0100024        stg     %r1,16(%r11)
>             000003ffe0ee0446: e3b030080024        stg     %r11,8(%r3)
>             000003ffe0ee044c: ec180009007c        cgij    %r1,0,8,000003ffe0ee045e
>             000003ffe0ee0452: ec2b000100d9        aghik   %r2,%r11,1
>             000003ffe0ee0458: e32010000024        stg     %r2,0(%r1)
> Call Trace:
>   [<000003ffe0ee0440>] __rb_insert_augmented+0x60/0x210
>   [<000003ffe016d6c4>] dup_mmap+0x424/0x8c0
>   [<000003ffe016dc62>] copy_mm+0x102/0x1c0
>   [<000003ffe016e8ae>] copy_process+0x7ce/0x12b0
>   [<000003ffe016f458>] kernel_clone+0x68/0x380
>   [<000003ffe016f84a>] __do_sys_clone+0x5a/0x70
>   [<000003ffe016faa0>] __s390x_sys_clone+0x40/0x50
>   [<000003ffe011c9b6>] do_syscall.constprop.0+0x116/0x140
>   [<000003ffe0ef1d64>] __do_syscall+0xd4/0x1c0
>   [<000003ffe0efd044>] system_call+0x74/0x98
> Last Breaking-Event-Address:
>   [<000003ffe0ee058a>] __rb_insert_augmented+0x1aa/0x210
> Kernel panic - not syncing: Fatal exception: panic_on_oops
Lorenzo Stoakes March 13, 2025, 5:16 a.m. UTC | #3
On Wed, Mar 12, 2025 at 08:04:23PM -0700, Yang Shi wrote:
>
>
> On 3/12/25 4:55 PM, Vasily Gorbik wrote:
> > On Wed, Mar 12, 2025 at 03:15:21PM -0700, Yang Shi wrote:
> > > LKP reported 800% performance improvement for small-allocs benchmark
> > > from vm-scalability [1] with patch ("/dev/zero: make private mapping
> > > full anonymous mapping") [2], but the patch was nack'ed since it changes
> > > the output of smaps somewhat.
> > ...
> > > ---
> > > v2:
> > >     * Added the comments in code suggested by Lorenzo
> > >     * Collected R-b from Lorenze
> > >
> > >   mm/vma.c | 18 ++++++++++++++++--
> > >   1 file changed, 16 insertions(+), 2 deletions(-)
> > Hi Yang,
> >
> > Replying to v2, as the code is the same as v1 in linux-next:
> >
> > The LTP test "mmap10" consistently triggers a kernel NULL pointer
> > dereference with this change, at least on x86 and s390. Reverting just
> > this single patch from linux-next fixes the issue.
>
> Hi Vasily,
>
> Thanks for the report. It is because dup_mmap() inserts the VMA into file
> rmap by checking whether vma->vm_file is NULL or not. This splat can be
> killed by skipping anonymous vma, but this actually will expose a more
> severe problem. The struct file refcount may be imbalance. The refcount is
> inc'ed in mmap, then inc'ed again by fork(), it is dec'ed when unmap or
> process exit. If we skip refcount inc in fork, we need skip refcount dec in
> unmap too, but there is still one refcount from mmap.
>
> Can we dec refcount in mmap if we see it is anonymous vma finally?
> Unfortunately, no. If the refcount reaches 0, the struct file will be freed.
> We will run into UAF when looking up smaps IIUC. It may point to anything.
>
> Lorenzo,
>
> This problem seems more complicated than what I thought in the first place.
> Making it is a real anonymous vma (vm_file is NULL) may be still the best
> option. But we need figure out how we can keep compatible smaps.

Ugh lord. I am not in favour of this for reasons aforementioned, and I _really_
don't want to special case this any more than we already do...

Let me think a bit about this also.

Maybe if you're at LSF we can chat about it there?

Thanks!

>
> Andrew,
>
> Can you please drop this patch from your tree?
>
> Thanks,
> Yang
>
> >
> > LTP: starting mmap10
> > BUG: kernel NULL pointer dereference, address: 0000000000000008
> > #PF: supervisor read access in kernel mode
> > #PF: error_code(0x0000) - not-present page
> > PGD 800000010d22a067 P4D 800000010d22a067 PUD 11ff09067 PMD 0
> > Oops: Oops: 0000 [#1] PREEMPT SMP PTI
> > CPU: 5 UID: 0 PID: 1719 Comm: mmap10 Not tainted 6.14.0-rc6-next-20250312 #3
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014
> > RIP: 0010:__rb_insert_augmented+0x2b/0x1d0
> > Code: 0f 1e fa 48 89 f8 48 8b 3f 48 85 ff 0f 84 a4 01 00 00 41 55 49 89 f5 41 54 49 89 d4 55 53 48 8b 1f f6 c3 01 0f 85 e1 00 00 00 <48> 8b 53 08 48 39 fa 74 67 48 85 d2 74 09 f6 02 01 0f 84 a0 00 00
> > RSP: 0018:ffffc90002b47cc8 EFLAGS: 00010246
> > RAX: ffff8881143ab788 RBX: 0000000000000000 RCX: 00000000000009ff
> > RDX: ffffffff814ad5d0 RSI: ffff888100bb5060 RDI: ffff8881143ab088
> > RBP: ffff8881053af8c0 R08: ffff8881143ab700 R09: 00007ff6433f2000
> > R10: 00007ff6433f2000 R11: ffff8881143ab000 R12: ffffffff814ad5d0
> > R13: ffff888100bb5060 R14: ffff8881143ab700 R15: ffff8881143ab000
> > FS:  00007ff643df1740(0000) GS:ffff8882b45bf000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 0000000000000008 CR3: 000000011b042000 CR4: 00000000000006f0
> > Call Trace:
> >   <TASK>
> >   ? __die_body.cold+0x19/0x2b
> >   ? page_fault_oops+0xc4/0x1f0
> >   ? search_extable+0x26/0x30
> >   ? search_module_extables+0x3f/0x60
> >   ? exc_page_fault+0x6b/0x150
> >   ? asm_exc_page_fault+0x26/0x30
> >   ? __pfx_vma_interval_tree_augment_rotate+0x10/0x10
> >   ? __pfx_vma_interval_tree_augment_rotate+0x10/0x10
> >   ? __rb_insert_augmented+0x2b/0x1d0
> >   copy_mm+0x48a/0x8c0
> >   copy_process+0xf98/0x1930
> >   kernel_clone+0xb7/0x3b0
> >   __do_sys_clone+0x65/0x90
> >   do_syscall_64+0x9e/0x1a0
> >   entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > RIP: 0033:0x7ff643eb2b00
> > Code: 31 c0 31 d2 31 f6 bf 11 00 20 01 48 89 e5 53 48 83 ec 08 64 48 8b 04 25 10 00 00 00 4c 8d 90 d0 02 00 00 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 48 89 c3 85 c0 75 31 64 48 8b 04 25 10 00 00
> > RSP: 002b:00007ffdac219010 EFLAGS: 00000202 ORIG_RAX: 0000000000000038
> > RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007ff643eb2b00
> > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
> > RBP: 00007ffdac219020 R08: 0000000000000000 R09: 0000000000000000
> > R10: 00007ff643df1a10 R11: 0000000000000202 R12: 0000000000000001
> > R13: 0000000000000000 R14: 00007ff644036000 R15: 0000000000000000
> >   </TASK>
> > Modules linked in:
> > CR2: 0000000000000008
> > ---[ end trace 0000000000000000 ]---
> > RIP: 0010:__rb_insert_augmented+0x2b/0x1d0
> > Code: 0f 1e fa 48 89 f8 48 8b 3f 48 85 ff 0f 84 a4 01 00 00 41 55 49 89 f5 41 54 49 89 d4 55 53 48 8b 1f f6 c3 01 0f 85 e1 00 00 00 <48> 8b 53 08 48 39 fa 74 67 48 85 d2 74 09 f6 02 01 0f 84 a0 00 00
> > RSP: 0018:ffffc90002b47cc8 EFLAGS: 00010246
> > RAX: ffff8881143ab788 RBX: 0000000000000000 RCX: 00000000000009ff
> > RDX: ffffffff814ad5d0 RSI: ffff888100bb5060 RDI: ffff8881143ab088
> > RBP: ffff8881053af8c0 R08: ffff8881143ab700 R09: 00007ff6433f2000
> > R10: 00007ff6433f2000 R11: ffff8881143ab000 R12: ffffffff814ad5d0
> > R13: ffff888100bb5060 R14: ffff8881143ab700 R15: ffff8881143ab000
> > FS:  00007ff643df1740(0000) GS:ffff8882b45bf000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 0000000000000008 CR3: 000000011b042000 CR4: 00000000000006f0
> >
> >
> >
> > LTP: starting mmap10
> > Unable to handle kernel pointer dereference in virtual kernel address space
> > Failing address: 0000000000000000 TEID: 0000000000000483
> > Fault in home space mode while using kernel ASCE.
> > AS:000000000247c007 R3:00000001ffffc007 S:00000001ffffb801 P:000000000000013d
> > Oops: 0004 ilc:3 [#1] SMP
> > Modules linked in:
> > CPU: 0 UID: 0 PID: 665 Comm: mmap10 Not tainted 6.14.0-rc6-next-20250312 #16
> > Hardware name: IBM 3931 A01 704 (KVM/Linux)
> > Krnl PSW : 0704c00180000000 000003ffe0ee0440 (__rb_insert_augmented+0x60/0x210)
> >             R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
> > Krnl GPRS: 00000000009ff000 0000000000000000 000000008e5f7508 0000000084a7ed08
> >             00000000000009fe 0000000000000000 0000000000000000 0000037fe06c7b68
> >             00000000801d0e90 000003ffe04158d0 0000000084a7ed08 0000000000000000
> >             000003ffbb700000 00000000801d0e48 000003ffe0ee057c 0000037fe06c7a40
> > Krnl Code: 000003ffe0ee0430: e31030080004        lg      %r1,8(%r3)
> >             000003ffe0ee0436: ec1200888064        cgrj    %r1,%r2,8,000003ffe0ee0546
> >            #000003ffe0ee043c: b90400a3            lgr     %r10,%r3
> >            >000003ffe0ee0440: e310b0100024        stg     %r1,16(%r11)
> >             000003ffe0ee0446: e3b030080024        stg     %r11,8(%r3)
> >             000003ffe0ee044c: ec180009007c        cgij    %r1,0,8,000003ffe0ee045e
> >             000003ffe0ee0452: ec2b000100d9        aghik   %r2,%r11,1
> >             000003ffe0ee0458: e32010000024        stg     %r2,0(%r1)
> > Call Trace:
> >   [<000003ffe0ee0440>] __rb_insert_augmented+0x60/0x210
> >   [<000003ffe016d6c4>] dup_mmap+0x424/0x8c0
> >   [<000003ffe016dc62>] copy_mm+0x102/0x1c0
> >   [<000003ffe016e8ae>] copy_process+0x7ce/0x12b0
> >   [<000003ffe016f458>] kernel_clone+0x68/0x380
> >   [<000003ffe016f84a>] __do_sys_clone+0x5a/0x70
> >   [<000003ffe016faa0>] __s390x_sys_clone+0x40/0x50
> >   [<000003ffe011c9b6>] do_syscall.constprop.0+0x116/0x140
> >   [<000003ffe0ef1d64>] __do_syscall+0xd4/0x1c0
> >   [<000003ffe0efd044>] system_call+0x74/0x98
> > Last Breaking-Event-Address:
> >   [<000003ffe0ee058a>] __rb_insert_augmented+0x1aa/0x210
> > Kernel panic - not syncing: Fatal exception: panic_on_oops
>
Yang Shi March 13, 2025, 5:42 p.m. UTC | #4
On 3/12/25 10:16 PM, Lorenzo Stoakes wrote:
> On Wed, Mar 12, 2025 at 08:04:23PM -0700, Yang Shi wrote:
>>
>> On 3/12/25 4:55 PM, Vasily Gorbik wrote:
>>> On Wed, Mar 12, 2025 at 03:15:21PM -0700, Yang Shi wrote:
>>>> LKP reported 800% performance improvement for small-allocs benchmark
>>>> from vm-scalability [1] with patch ("/dev/zero: make private mapping
>>>> full anonymous mapping") [2], but the patch was nack'ed since it changes
>>>> the output of smaps somewhat.
>>> ...
>>>> ---
>>>> v2:
>>>>      * Added the comments in code suggested by Lorenzo
>>>>      * Collected R-b from Lorenze
>>>>
>>>>    mm/vma.c | 18 ++++++++++++++++--
>>>>    1 file changed, 16 insertions(+), 2 deletions(-)
>>> Hi Yang,
>>>
>>> Replying to v2, as the code is the same as v1 in linux-next:
>>>
>>> The LTP test "mmap10" consistently triggers a kernel NULL pointer
>>> dereference with this change, at least on x86 and s390. Reverting just
>>> this single patch from linux-next fixes the issue.
>> Hi Vasily,
>>
>> Thanks for the report. It is because dup_mmap() inserts the VMA into file
>> rmap by checking whether vma->vm_file is NULL or not. This splat can be
>> killed by skipping anonymous vma, but this actually will expose a more
>> severe problem. The struct file refcount may be imbalance. The refcount is
>> inc'ed in mmap, then inc'ed again by fork(), it is dec'ed when unmap or
>> process exit. If we skip refcount inc in fork, we need skip refcount dec in
>> unmap too, but there is still one refcount from mmap.
>>
>> Can we dec refcount in mmap if we see it is anonymous vma finally?
>> Unfortunately, no. If the refcount reaches 0, the struct file will be freed.
>> We will run into UAF when looking up smaps IIUC. It may point to anything.
>>
>> Lorenzo,
>>
>> This problem seems more complicated than what I thought in the first place.
>> Making it is a real anonymous vma (vm_file is NULL) may be still the best
>> option. But we need figure out how we can keep compatible smaps.
> Ugh lord. I am not in favour of this for reasons aforementioned, and I _really_
> don't want to special case this any more than we already do...

Yeah, understood. I meant we should find a way to make smaps unchanged 
or compatible.

>
> Let me think a bit about this also.
>
> Maybe if you're at LSF we can chat about it there?

Unfortunately I can't make it this year. Have a fun!

Thanks,
Yang

>
> Thanks!
>
>> Andrew,
>>
>> Can you please drop this patch from your tree?
>>
>> Thanks,
>> Yang
>>
>>> LTP: starting mmap10
>>> BUG: kernel NULL pointer dereference, address: 0000000000000008
>>> #PF: supervisor read access in kernel mode
>>> #PF: error_code(0x0000) - not-present page
>>> PGD 800000010d22a067 P4D 800000010d22a067 PUD 11ff09067 PMD 0
>>> Oops: Oops: 0000 [#1] PREEMPT SMP PTI
>>> CPU: 5 UID: 0 PID: 1719 Comm: mmap10 Not tainted 6.14.0-rc6-next-20250312 #3
>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014
>>> RIP: 0010:__rb_insert_augmented+0x2b/0x1d0
>>> Code: 0f 1e fa 48 89 f8 48 8b 3f 48 85 ff 0f 84 a4 01 00 00 41 55 49 89 f5 41 54 49 89 d4 55 53 48 8b 1f f6 c3 01 0f 85 e1 00 00 00 <48> 8b 53 08 48 39 fa 74 67 48 85 d2 74 09 f6 02 01 0f 84 a0 00 00
>>> RSP: 0018:ffffc90002b47cc8 EFLAGS: 00010246
>>> RAX: ffff8881143ab788 RBX: 0000000000000000 RCX: 00000000000009ff
>>> RDX: ffffffff814ad5d0 RSI: ffff888100bb5060 RDI: ffff8881143ab088
>>> RBP: ffff8881053af8c0 R08: ffff8881143ab700 R09: 00007ff6433f2000
>>> R10: 00007ff6433f2000 R11: ffff8881143ab000 R12: ffffffff814ad5d0
>>> R13: ffff888100bb5060 R14: ffff8881143ab700 R15: ffff8881143ab000
>>> FS:  00007ff643df1740(0000) GS:ffff8882b45bf000(0000) knlGS:0000000000000000
>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: 0000000000000008 CR3: 000000011b042000 CR4: 00000000000006f0
>>> Call Trace:
>>>    <TASK>
>>>    ? __die_body.cold+0x19/0x2b
>>>    ? page_fault_oops+0xc4/0x1f0
>>>    ? search_extable+0x26/0x30
>>>    ? search_module_extables+0x3f/0x60
>>>    ? exc_page_fault+0x6b/0x150
>>>    ? asm_exc_page_fault+0x26/0x30
>>>    ? __pfx_vma_interval_tree_augment_rotate+0x10/0x10
>>>    ? __pfx_vma_interval_tree_augment_rotate+0x10/0x10
>>>    ? __rb_insert_augmented+0x2b/0x1d0
>>>    copy_mm+0x48a/0x8c0
>>>    copy_process+0xf98/0x1930
>>>    kernel_clone+0xb7/0x3b0
>>>    __do_sys_clone+0x65/0x90
>>>    do_syscall_64+0x9e/0x1a0
>>>    entry_SYSCALL_64_after_hwframe+0x77/0x7f
>>> RIP: 0033:0x7ff643eb2b00
>>> Code: 31 c0 31 d2 31 f6 bf 11 00 20 01 48 89 e5 53 48 83 ec 08 64 48 8b 04 25 10 00 00 00 4c 8d 90 d0 02 00 00 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 48 89 c3 85 c0 75 31 64 48 8b 04 25 10 00 00
>>> RSP: 002b:00007ffdac219010 EFLAGS: 00000202 ORIG_RAX: 0000000000000038
>>> RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007ff643eb2b00
>>> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
>>> RBP: 00007ffdac219020 R08: 0000000000000000 R09: 0000000000000000
>>> R10: 00007ff643df1a10 R11: 0000000000000202 R12: 0000000000000001
>>> R13: 0000000000000000 R14: 00007ff644036000 R15: 0000000000000000
>>>    </TASK>
>>> Modules linked in:
>>> CR2: 0000000000000008
>>> ---[ end trace 0000000000000000 ]---
>>> RIP: 0010:__rb_insert_augmented+0x2b/0x1d0
>>> Code: 0f 1e fa 48 89 f8 48 8b 3f 48 85 ff 0f 84 a4 01 00 00 41 55 49 89 f5 41 54 49 89 d4 55 53 48 8b 1f f6 c3 01 0f 85 e1 00 00 00 <48> 8b 53 08 48 39 fa 74 67 48 85 d2 74 09 f6 02 01 0f 84 a0 00 00
>>> RSP: 0018:ffffc90002b47cc8 EFLAGS: 00010246
>>> RAX: ffff8881143ab788 RBX: 0000000000000000 RCX: 00000000000009ff
>>> RDX: ffffffff814ad5d0 RSI: ffff888100bb5060 RDI: ffff8881143ab088
>>> RBP: ffff8881053af8c0 R08: ffff8881143ab700 R09: 00007ff6433f2000
>>> R10: 00007ff6433f2000 R11: ffff8881143ab000 R12: ffffffff814ad5d0
>>> R13: ffff888100bb5060 R14: ffff8881143ab700 R15: ffff8881143ab000
>>> FS:  00007ff643df1740(0000) GS:ffff8882b45bf000(0000) knlGS:0000000000000000
>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: 0000000000000008 CR3: 000000011b042000 CR4: 00000000000006f0
>>>
>>>
>>>
>>> LTP: starting mmap10
>>> Unable to handle kernel pointer dereference in virtual kernel address space
>>> Failing address: 0000000000000000 TEID: 0000000000000483
>>> Fault in home space mode while using kernel ASCE.
>>> AS:000000000247c007 R3:00000001ffffc007 S:00000001ffffb801 P:000000000000013d
>>> Oops: 0004 ilc:3 [#1] SMP
>>> Modules linked in:
>>> CPU: 0 UID: 0 PID: 665 Comm: mmap10 Not tainted 6.14.0-rc6-next-20250312 #16
>>> Hardware name: IBM 3931 A01 704 (KVM/Linux)
>>> Krnl PSW : 0704c00180000000 000003ffe0ee0440 (__rb_insert_augmented+0x60/0x210)
>>>              R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
>>> Krnl GPRS: 00000000009ff000 0000000000000000 000000008e5f7508 0000000084a7ed08
>>>              00000000000009fe 0000000000000000 0000000000000000 0000037fe06c7b68
>>>              00000000801d0e90 000003ffe04158d0 0000000084a7ed08 0000000000000000
>>>              000003ffbb700000 00000000801d0e48 000003ffe0ee057c 0000037fe06c7a40
>>> Krnl Code: 000003ffe0ee0430: e31030080004        lg      %r1,8(%r3)
>>>              000003ffe0ee0436: ec1200888064        cgrj    %r1,%r2,8,000003ffe0ee0546
>>>             #000003ffe0ee043c: b90400a3            lgr     %r10,%r3
>>>             >000003ffe0ee0440: e310b0100024        stg     %r1,16(%r11)
>>>              000003ffe0ee0446: e3b030080024        stg     %r11,8(%r3)
>>>              000003ffe0ee044c: ec180009007c        cgij    %r1,0,8,000003ffe0ee045e
>>>              000003ffe0ee0452: ec2b000100d9        aghik   %r2,%r11,1
>>>              000003ffe0ee0458: e32010000024        stg     %r2,0(%r1)
>>> Call Trace:
>>>    [<000003ffe0ee0440>] __rb_insert_augmented+0x60/0x210
>>>    [<000003ffe016d6c4>] dup_mmap+0x424/0x8c0
>>>    [<000003ffe016dc62>] copy_mm+0x102/0x1c0
>>>    [<000003ffe016e8ae>] copy_process+0x7ce/0x12b0
>>>    [<000003ffe016f458>] kernel_clone+0x68/0x380
>>>    [<000003ffe016f84a>] __do_sys_clone+0x5a/0x70
>>>    [<000003ffe016faa0>] __s390x_sys_clone+0x40/0x50
>>>    [<000003ffe011c9b6>] do_syscall.constprop.0+0x116/0x140
>>>    [<000003ffe0ef1d64>] __do_syscall+0xd4/0x1c0
>>>    [<000003ffe0efd044>] system_call+0x74/0x98
>>> Last Breaking-Event-Address:
>>>    [<000003ffe0ee058a>] __rb_insert_augmented+0x1aa/0x210
>>> Kernel panic - not syncing: Fatal exception: panic_on_oops
diff mbox series

Patch

diff --git a/mm/vma.c b/mm/vma.c
index c7abef5177cc..2fe99d181cfd 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -1648,6 +1648,10 @@  static void unlink_file_vma_batch_process(struct unlink_vma_file_batch *vb)
 void unlink_file_vma_batch_add(struct unlink_vma_file_batch *vb,
 			       struct vm_area_struct *vma)
 {
+	/* Rare, but e.g. /dev/zero sets vma->vm_file on an anon VMA */
+	if (vma_is_anonymous(vma))
+		return;
+
 	if (vma->vm_file == NULL)
 		return;
 
@@ -1671,8 +1675,13 @@  void unlink_file_vma_batch_final(struct unlink_vma_file_batch *vb)
  */
 void unlink_file_vma(struct vm_area_struct *vma)
 {
-	struct file *file = vma->vm_file;
+	struct file *file;
+
+	/* Rare, but e.g. /dev/zero sets vma->vm_file on an anon VMA */
+	if (vma_is_anonymous(vma))
+		return;
 
+	file = vma->vm_file;
 	if (file) {
 		struct address_space *mapping = file->f_mapping;
 
@@ -1684,9 +1693,14 @@  void unlink_file_vma(struct vm_area_struct *vma)
 
 void vma_link_file(struct vm_area_struct *vma)
 {
-	struct file *file = vma->vm_file;
+	struct file *file;
 	struct address_space *mapping;
 
+	/* Rare, but e.g. /dev/zero sets vma->vm_file on an anon VMA */
+	if (vma_is_anonymous(vma))
+		return;
+
+	file = vma->vm_file;
 	if (file) {
 		mapping = file->f_mapping;
 		i_mmap_lock_write(mapping);