[v7,0/6] Per-VMA lock support for swap and userfaults

Message ID	20230630211957.1341547-1-surenb@google.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> Date: Fri, 30 Jun 2023 14:19:51 -0700 Mime-Version: 1.0 Message-ID: <20230630211957.1341547-1-surenb@google.com> Subject: [PATCH v7 0/6] Per-VMA lock support for swap and userfaults From: Suren Baghdasaryan <surenb@google.com> To: akpm@linux-foundation.org Cc: willy@infradead.org, hannes@cmpxchg.org, mhocko@suse.com, josef@toxicpanda.com, jack@suse.cz, ldufour@linux.ibm.com, laurent.dufour@fr.ibm.com, michel@lespinasse.org, liam.howlett@oracle.com, jglisse@google.com, vbabka@suse.cz, minchan@google.com, dave@stgolabs.net, punit.agrawal@bytedance.com, lstoakes@gmail.com, hdanton@sina.com, apopple@nvidia.com, peterx@redhat.com, ying.huang@intel.com, david@redhat.com, yuzhao@google.com, dhowells@redhat.com, hughd@google.com, viro@zeniv.linux.org.uk, brauner@kernel.org, pasha.tatashin@soleen.com, surenb@google.com, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@android.com Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Per-VMA lock support for swap and userfaults \| expand [v7,0/6] Per-VMA lock support for swap and userfaults [v7,1/6] swap: remove remnants of polling from read_swap_cache_async [v7,2/6] mm: add missing VM_FAULT_RESULT_TRACE name for VM_FAULT_COMPLETED [v7,3/6] mm: drop per-VMA lock when returning VM_FAULT_RETRY or VM_FAULT_COMPLETED [v7,4/6] mm: change folio_lock_or_retry to use vm_fault directly [v7,5/6] mm: handle swap page faults under per-VMA lock [v7,6/6] mm: handle userfaults under VMA lock

Suren Baghdasaryan June 30, 2023, 9:19 p.m. UTC

When per-VMA locks were introduced in [1] several types of page faults
would still fall back to mmap_lock to keep the patchset simple. Among them
are swap and userfault pages. The main reason for skipping those cases was
the fact that mmap_lock could be dropped while handling these faults and
that required additional logic to be implemented.
Implement the mechanism to allow per-VMA locks to be dropped for these
cases.
First, change handle_mm_fault to drop per-VMA locks when returning
VM_FAULT_RETRY or VM_FAULT_COMPLETED to be consistent with the way
mmap_lock is handled. Then change folio_lock_or_retry to accept vm_fault
and return vm_fault_t which simplifies later patches. Finally allow swap
and uffd page faults to be handled under per-VMA locks by dropping per-VMA
and retrying, the same way it's done under mmap_lock.
Naturally, once VMA lock is dropped that VMA should be assumed unstable
and can't be used.

Changes since v6 posted at [2]
- 4/6 replaced the ternary operation in folio_lock_or_retry,
per Matthew Wilcox
- 4/6 changed return code description for __folio_lock_or_retry
per Matthew Wilcox

Note: patch 3/6 will cause a trivial merge conflict in arch/arm64/mm/fault.c
when applied over mm-unstable branch due to a patch from ARM64 tree [3]
which is missing in mm-unstable.

[1] https://lore.kernel.org/all/20230227173632.3292573-1-surenb@google.com/
[2] https://lore.kernel.org/all/20230630020436.1066016-1-surenb@google.com/
[3] https://lore.kernel.org/all/20230524131305.2808-1-jszhang@kernel.org/

Suren Baghdasaryan (6):
  swap: remove remnants of polling from read_swap_cache_async
  mm: add missing VM_FAULT_RESULT_TRACE name for VM_FAULT_COMPLETED
  mm: drop per-VMA lock when returning VM_FAULT_RETRY or
    VM_FAULT_COMPLETED
  mm: change folio_lock_or_retry to use vm_fault directly
  mm: handle swap page faults under per-VMA lock
  mm: handle userfaults under VMA lock

 arch/arm64/mm/fault.c    |  3 ++-
 arch/powerpc/mm/fault.c  |  3 ++-
 arch/s390/mm/fault.c     |  3 ++-
 arch/x86/mm/fault.c      |  3 ++-
 fs/userfaultfd.c         | 34 ++++++++++++----------------
 include/linux/mm.h       | 37 ++++++++++++++++++++++++++++++
 include/linux/mm_types.h |  3 ++-
 include/linux/pagemap.h  | 11 +++++----
 mm/filemap.c             | 37 +++++++++++++++---------------
 mm/madvise.c             |  4 ++--
 mm/memory.c              | 49 ++++++++++++++++++++++------------------
 mm/swap.h                |  1 -
 mm/swap_state.c          | 12 ++++------
 13 files changed, 120 insertions(+), 80 deletions(-)

Andrew Morton July 2, 2023, 5:50 p.m. UTC | #1

On Fri, 30 Jun 2023 14:19:51 -0700 Suren Baghdasaryan <surenb@google.com> wrote:

> When per-VMA locks were introduced in [1] several types of page faults
> would still fall back to mmap_lock to keep the patchset simple. Among them
> are swap and userfault pages. The main reason for skipping those cases was
> the fact that mmap_lock could be dropped while handling these faults and
> that required additional logic to be implemented.
> Implement the mechanism to allow per-VMA locks to be dropped for these
> cases.
> First, change handle_mm_fault to drop per-VMA locks when returning
> VM_FAULT_RETRY or VM_FAULT_COMPLETED to be consistent with the way
> mmap_lock is handled. Then change folio_lock_or_retry to accept vm_fault
> and return vm_fault_t which simplifies later patches. Finally allow swap
> and uffd page faults to be handled under per-VMA locks by dropping per-VMA
> and retrying, the same way it's done under mmap_lock.
> Naturally, once VMA lock is dropped that VMA should be assumed unstable
> and can't be used.

Is there any measurable performance benefit from this?

Suren Baghdasaryan July 3, 2023, 3:27 p.m. UTC | #2

On Sun, Jul 2, 2023 at 5:50 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Fri, 30 Jun 2023 14:19:51 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
>
> > When per-VMA locks were introduced in [1] several types of page faults
> > would still fall back to mmap_lock to keep the patchset simple. Among them
> > are swap and userfault pages. The main reason for skipping those cases was
> > the fact that mmap_lock could be dropped while handling these faults and
> > that required additional logic to be implemented.
> > Implement the mechanism to allow per-VMA locks to be dropped for these
> > cases.
> > First, change handle_mm_fault to drop per-VMA locks when returning
> > VM_FAULT_RETRY or VM_FAULT_COMPLETED to be consistent with the way
> > mmap_lock is handled. Then change folio_lock_or_retry to accept vm_fault
> > and return vm_fault_t which simplifies later patches. Finally allow swap
> > and uffd page faults to be handled under per-VMA locks by dropping per-VMA
> > and retrying, the same way it's done under mmap_lock.
> > Naturally, once VMA lock is dropped that VMA should be assumed unstable
> > and can't be used.
>
> Is there any measurable performance benefit from this?

Good point. I haven't measured it but assume it will have the same
effect as for other page fault cases handled under per-VMA locks
(mmap_lock contention reduction). I'll try to create a test to measure
the effects.
Thanks,
Suren.

David Hildenbrand Aug. 9, 2023, 7:48 a.m. UTC | #3

On 30.06.23 23:19, Suren Baghdasaryan wrote:
> When per-VMA locks were introduced in [1] several types of page faults
> would still fall back to mmap_lock to keep the patchset simple. Among them
> are swap and userfault pages. The main reason for skipping those cases was
> the fact that mmap_lock could be dropped while handling these faults and
> that required additional logic to be implemented.
> Implement the mechanism to allow per-VMA locks to be dropped for these
> cases.
> First, change handle_mm_fault to drop per-VMA locks when returning
> VM_FAULT_RETRY or VM_FAULT_COMPLETED to be consistent with the way
> mmap_lock is handled. Then change folio_lock_or_retry to accept vm_fault
> and return vm_fault_t which simplifies later patches. Finally allow swap
> and uffd page faults to be handled under per-VMA locks by dropping per-VMA
> and retrying, the same way it's done under mmap_lock.
> Naturally, once VMA lock is dropped that VMA should be assumed unstable
> and can't be used.
> 
> Changes since v6 posted at [2]
> - 4/6 replaced the ternary operation in folio_lock_or_retry,
> per Matthew Wilcox
> - 4/6 changed return code description for __folio_lock_or_retry
> per Matthew Wilcox
> 
> Note: patch 3/6 will cause a trivial merge conflict in arch/arm64/mm/fault.c
> when applied over mm-unstable branch due to a patch from ARM64 tree [3]
> which is missing in mm-unstable.
> 
> [1] https://lore.kernel.org/all/20230227173632.3292573-1-surenb@google.com/
> [2] https://lore.kernel.org/all/20230630020436.1066016-1-surenb@google.com/
> [3] https://lore.kernel.org/all/20230524131305.2808-1-jszhang@kernel.org/
> 
> Suren Baghdasaryan (6):
>    swap: remove remnants of polling from read_swap_cache_async
>    mm: add missing VM_FAULT_RESULT_TRACE name for VM_FAULT_COMPLETED
>    mm: drop per-VMA lock when returning VM_FAULT_RETRY or
>      VM_FAULT_COMPLETED
>    mm: change folio_lock_or_retry to use vm_fault directly
>    mm: handle swap page faults under per-VMA lock
>    mm: handle userfaults under VMA lock

On mm/mm-unstable I get running the selftests:

Testing sigbus-wp on shmem... [  383.215804] mm ffff9666078e5280 task_size 140737488351232
[  383.215804] get_unmapped_area ffffffffad03b980
[  383.215804] mmap_base 140378441285632 mmap_legacy_base 47254353883136
[  383.215804] pgd ffff966608960000 mm_users 1 mm_count 6 pgtables_bytes 126976 map_count 28
[  383.215804] hiwater_rss 6183 hiwater_vm 8aa7 total_vm 8aa7 locked_vm 0
[  383.215804] pinned_vm 0 data_vm 844 exec_vm 1a4 stack_vm 21
[  383.215804] start_code 402000 end_code 408f09 start_data 40ce10 end_data 40d500
[  383.215804] start_brk 17fe000 brk 1830000 start_stack 7ffecbbe08e0
[  383.215804] arg_start 7ffecbbe1c6f arg_end 7ffecbbe1c81 env_start 7ffecbbe1c81 env_end 7ffecbbe1fe6
[  383.215804] binfmt ffffffffaf3efe40 flags 80000cd
[  383.215804] ioctx_table 0000000000000000
[  383.215804] owner ffff96660d4a4000 exe_file ffff966285501a00
[  383.215804] notifier_subscriptions 0000000000000000
[  383.215804] numa_next_scan 4295050919 numa_scan_offset 0 numa_scan_seq 0
[  383.215804] tlb_flush_pending 0
[  383.215804] def_flags: 0x0()
[  383.236255] ------------[ cut here ]------------
[  383.237537] kernel BUG at include/linux/mmap_lock.h:66!
[  383.238897] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[  383.240114] CPU: 37 PID: 1482 Comm: uffd-unit-tests Not tainted 6.5.0-rc4+ #68
[  383.242513] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc38 04/01/2014
[  383.244936] RIP: 0010:find_vma+0x3a/0x40
[  383.246200] Code: 48 89 34 24 48 85 c0 74 1c 48 83 c7 40 48 c7 c2 ff ff ff ff 48 89 e6 e8 a4 29 ba 00
[  383.251084] RSP: 0000:ffffae3745b6beb0 EFLAGS: 00010282
[  383.252781] RAX: 0000000000000314 RBX: ffff9666078e5280 RCX: 0000000000000000
[  383.255073] RDX: 0000000000000001 RSI: ffffffffae8f69c3 RDI: 00000000ffffffff
[  383.257352] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffae3745b6bc48
[  383.259369] R10: 0000000000000003 R11: ffff9669fff46fe8 R12: 0000000044401028
[  383.261570] R13: ffff9666078e5338 R14: ffffae3745b6bf58 R15: 0000000000000400
[  383.263499] FS:  00007fac671c5740(0000) GS:ffff9669efbc0000(0000) knlGS:0000000000000000
[  383.265483] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  383.266847] CR2: 0000000044401028 CR3: 0000000488960006 CR4: 0000000000770ee0
[  383.268532] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  383.270206] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  383.271905] PKRU: 55555554
[  383.272593] Call Trace:
[  383.273215]  <TASK>
[  383.273774]  ? die+0x32/0x80
[  383.274510]  ? do_trap+0xd6/0x100
[  383.275326]  ? find_vma+0x3a/0x40
[  383.276152]  ? do_error_trap+0x6a/0x90
[  383.277072]  ? find_vma+0x3a/0x40
[  383.277899]  ? exc_invalid_op+0x4c/0x60
[  383.278846]  ? find_vma+0x3a/0x40
[  383.279675]  ? asm_exc_invalid_op+0x16/0x20
[  383.280698]  ? find_vma+0x3a/0x40
[  383.281527]  lock_mm_and_find_vma+0x3f/0x270
[  383.282570]  do_user_addr_fault+0x1e4/0x660
[  383.283591]  exc_page_fault+0x73/0x170
[  383.284509]  asm_exc_page_fault+0x22/0x30
[  383.285486] RIP: 0033:0x404428
[  383.286265] Code: 48 89 85 18 ff ff ff e9 dc 00 00 00 48 8b 15 9f 92 00 00 48 8b 05 80 92 00 00 48 03
[  383.290566] RSP: 002b:00007ffecbbe05c0 EFLAGS: 00010206
[  383.291814] RAX: 0000000044401028 RBX: 00007ffecbbe08e8 RCX: 00007fac66e93c18
[  383.293502] RDX: 0000000044400000 RSI: 0000000000000001 RDI: 0000000000000000
[  383.295175] RBP: 00007ffecbbe06c0 R08: 00007ffecbbe05c0 R09: 00007ffecbbe06c0
[  383.296857] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000000
[  383.298533] R13: 00007ffecbbe08f8 R14: 000000000040ce18 R15: 00007fac67206000
[  383.300203]  </TASK>
[  383.300775] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_commong
[  383.309661] ---[ end trace 0000000000000000 ]---
[  383.310795] RIP: 0010:find_vma+0x3a/0x40
[  383.311771] Code: 48 89 34 24 48 85 c0 74 1c 48 83 c7 40 48 c7 c2 ff ff ff ff 48 89 e6 e8 a4 29 ba 00
[  383.316081] RSP: 0000:ffffae3745b6beb0 EFLAGS: 00010282
[  383.317346] RAX: 0000000000000314 RBX: ffff9666078e5280 RCX: 0000000000000000
[  383.319050] RDX: 0000000000000001 RSI: ffffffffae8f69c3 RDI: 00000000ffffffff
[  383.320767] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffae3745b6bc48
[  383.322468] R10: 0000000000000003 R11: ffff9669fff46fe8 R12: 0000000044401028
[  383.324164] R13: ffff9666078e5338 R14: ffffae3745b6bf58 R15: 0000000000000400
[  383.325870] FS:  00007fac671c5740(0000) GS:ffff9669efbc0000(0000) knlGS:0000000000000000
[  383.327795] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  383.329177] CR2: 0000000044401028 CR3: 0000000488960006 CR4: 0000000000770ee0
[  383.330885] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  383.332592] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  383.334287] PKRU: 55555554


Which ends up being

VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);

I did not check if this is also the case on mainline, and if this series is responsible.

Suren Baghdasaryan Aug. 9, 2023, 2:28 p.m. UTC | #4

On Wed, Aug 9, 2023 at 12:49 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 30.06.23 23:19, Suren Baghdasaryan wrote:
> > When per-VMA locks were introduced in [1] several types of page faults
> > would still fall back to mmap_lock to keep the patchset simple. Among them
> > are swap and userfault pages. The main reason for skipping those cases was
> > the fact that mmap_lock could be dropped while handling these faults and
> > that required additional logic to be implemented.
> > Implement the mechanism to allow per-VMA locks to be dropped for these
> > cases.
> > First, change handle_mm_fault to drop per-VMA locks when returning
> > VM_FAULT_RETRY or VM_FAULT_COMPLETED to be consistent with the way
> > mmap_lock is handled. Then change folio_lock_or_retry to accept vm_fault
> > and return vm_fault_t which simplifies later patches. Finally allow swap
> > and uffd page faults to be handled under per-VMA locks by dropping per-VMA
> > and retrying, the same way it's done under mmap_lock.
> > Naturally, once VMA lock is dropped that VMA should be assumed unstable
> > and can't be used.
> >
> > Changes since v6 posted at [2]
> > - 4/6 replaced the ternary operation in folio_lock_or_retry,
> > per Matthew Wilcox
> > - 4/6 changed return code description for __folio_lock_or_retry
> > per Matthew Wilcox
> >
> > Note: patch 3/6 will cause a trivial merge conflict in arch/arm64/mm/fault.c
> > when applied over mm-unstable branch due to a patch from ARM64 tree [3]
> > which is missing in mm-unstable.
> >
> > [1] https://lore.kernel.org/all/20230227173632.3292573-1-surenb@google.com/
> > [2] https://lore.kernel.org/all/20230630020436.1066016-1-surenb@google.com/
> > [3] https://lore.kernel.org/all/20230524131305.2808-1-jszhang@kernel.org/
> >
> > Suren Baghdasaryan (6):
> >    swap: remove remnants of polling from read_swap_cache_async
> >    mm: add missing VM_FAULT_RESULT_TRACE name for VM_FAULT_COMPLETED
> >    mm: drop per-VMA lock when returning VM_FAULT_RETRY or
> >      VM_FAULT_COMPLETED
> >    mm: change folio_lock_or_retry to use vm_fault directly
> >    mm: handle swap page faults under per-VMA lock
> >    mm: handle userfaults under VMA lock
>
> On mm/mm-unstable I get running the selftests:
>
> Testing sigbus-wp on shmem... [  383.215804] mm ffff9666078e5280 task_size 140737488351232
> [  383.215804] get_unmapped_area ffffffffad03b980
> [  383.215804] mmap_base 140378441285632 mmap_legacy_base 47254353883136
> [  383.215804] pgd ffff966608960000 mm_users 1 mm_count 6 pgtables_bytes 126976 map_count 28
> [  383.215804] hiwater_rss 6183 hiwater_vm 8aa7 total_vm 8aa7 locked_vm 0
> [  383.215804] pinned_vm 0 data_vm 844 exec_vm 1a4 stack_vm 21
> [  383.215804] start_code 402000 end_code 408f09 start_data 40ce10 end_data 40d500
> [  383.215804] start_brk 17fe000 brk 1830000 start_stack 7ffecbbe08e0
> [  383.215804] arg_start 7ffecbbe1c6f arg_end 7ffecbbe1c81 env_start 7ffecbbe1c81 env_end 7ffecbbe1fe6
> [  383.215804] binfmt ffffffffaf3efe40 flags 80000cd
> [  383.215804] ioctx_table 0000000000000000
> [  383.215804] owner ffff96660d4a4000 exe_file ffff966285501a00
> [  383.215804] notifier_subscriptions 0000000000000000
> [  383.215804] numa_next_scan 4295050919 numa_scan_offset 0 numa_scan_seq 0
> [  383.215804] tlb_flush_pending 0
> [  383.215804] def_flags: 0x0()
> [  383.236255] ------------[ cut here ]------------
> [  383.237537] kernel BUG at include/linux/mmap_lock.h:66!
> [  383.238897] invalid opcode: 0000 [#1] PREEMPT SMP PTI
> [  383.240114] CPU: 37 PID: 1482 Comm: uffd-unit-tests Not tainted 6.5.0-rc4+ #68
> [  383.242513] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc38 04/01/2014
> [  383.244936] RIP: 0010:find_vma+0x3a/0x40
> [  383.246200] Code: 48 89 34 24 48 85 c0 74 1c 48 83 c7 40 48 c7 c2 ff ff ff ff 48 89 e6 e8 a4 29 ba 00
> [  383.251084] RSP: 0000:ffffae3745b6beb0 EFLAGS: 00010282
> [  383.252781] RAX: 0000000000000314 RBX: ffff9666078e5280 RCX: 0000000000000000
> [  383.255073] RDX: 0000000000000001 RSI: ffffffffae8f69c3 RDI: 00000000ffffffff
> [  383.257352] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffae3745b6bc48
> [  383.259369] R10: 0000000000000003 R11: ffff9669fff46fe8 R12: 0000000044401028
> [  383.261570] R13: ffff9666078e5338 R14: ffffae3745b6bf58 R15: 0000000000000400
> [  383.263499] FS:  00007fac671c5740(0000) GS:ffff9669efbc0000(0000) knlGS:0000000000000000
> [  383.265483] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  383.266847] CR2: 0000000044401028 CR3: 0000000488960006 CR4: 0000000000770ee0
> [  383.268532] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  383.270206] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  383.271905] PKRU: 55555554
> [  383.272593] Call Trace:
> [  383.273215]  <TASK>
> [  383.273774]  ? die+0x32/0x80
> [  383.274510]  ? do_trap+0xd6/0x100
> [  383.275326]  ? find_vma+0x3a/0x40
> [  383.276152]  ? do_error_trap+0x6a/0x90
> [  383.277072]  ? find_vma+0x3a/0x40
> [  383.277899]  ? exc_invalid_op+0x4c/0x60
> [  383.278846]  ? find_vma+0x3a/0x40
> [  383.279675]  ? asm_exc_invalid_op+0x16/0x20
> [  383.280698]  ? find_vma+0x3a/0x40
> [  383.281527]  lock_mm_and_find_vma+0x3f/0x270
> [  383.282570]  do_user_addr_fault+0x1e4/0x660
> [  383.283591]  exc_page_fault+0x73/0x170
> [  383.284509]  asm_exc_page_fault+0x22/0x30
> [  383.285486] RIP: 0033:0x404428
> [  383.286265] Code: 48 89 85 18 ff ff ff e9 dc 00 00 00 48 8b 15 9f 92 00 00 48 8b 05 80 92 00 00 48 03
> [  383.290566] RSP: 002b:00007ffecbbe05c0 EFLAGS: 00010206
> [  383.291814] RAX: 0000000044401028 RBX: 00007ffecbbe08e8 RCX: 00007fac66e93c18
> [  383.293502] RDX: 0000000044400000 RSI: 0000000000000001 RDI: 0000000000000000
> [  383.295175] RBP: 00007ffecbbe06c0 R08: 00007ffecbbe05c0 R09: 00007ffecbbe06c0
> [  383.296857] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000000
> [  383.298533] R13: 00007ffecbbe08f8 R14: 000000000040ce18 R15: 00007fac67206000
> [  383.300203]  </TASK>
> [  383.300775] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_commong
> [  383.309661] ---[ end trace 0000000000000000 ]---
> [  383.310795] RIP: 0010:find_vma+0x3a/0x40
> [  383.311771] Code: 48 89 34 24 48 85 c0 74 1c 48 83 c7 40 48 c7 c2 ff ff ff ff 48 89 e6 e8 a4 29 ba 00
> [  383.316081] RSP: 0000:ffffae3745b6beb0 EFLAGS: 00010282
> [  383.317346] RAX: 0000000000000314 RBX: ffff9666078e5280 RCX: 0000000000000000
> [  383.319050] RDX: 0000000000000001 RSI: ffffffffae8f69c3 RDI: 00000000ffffffff
> [  383.320767] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffae3745b6bc48
> [  383.322468] R10: 0000000000000003 R11: ffff9669fff46fe8 R12: 0000000044401028
> [  383.324164] R13: ffff9666078e5338 R14: ffffae3745b6bf58 R15: 0000000000000400
> [  383.325870] FS:  00007fac671c5740(0000) GS:ffff9669efbc0000(0000) knlGS:0000000000000000
> [  383.327795] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  383.329177] CR2: 0000000044401028 CR3: 0000000488960006 CR4: 0000000000770ee0
> [  383.330885] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  383.332592] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  383.334287] PKRU: 55555554
>
>
> Which ends up being
>
> VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
>
> I did not check if this is also the case on mainline, and if this series is responsible.

Thanks for reporting! I'm checking it now.

>
> --
> Cheers,
>
> David / dhildenb
>

Suren Baghdasaryan Aug. 9, 2023, 3:22 p.m. UTC | #5

On Wed, Aug 9, 2023 at 7:28 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, Aug 9, 2023 at 12:49 AM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 30.06.23 23:19, Suren Baghdasaryan wrote:
> > > When per-VMA locks were introduced in [1] several types of page faults
> > > would still fall back to mmap_lock to keep the patchset simple. Among them
> > > are swap and userfault pages. The main reason for skipping those cases was
> > > the fact that mmap_lock could be dropped while handling these faults and
> > > that required additional logic to be implemented.
> > > Implement the mechanism to allow per-VMA locks to be dropped for these
> > > cases.
> > > First, change handle_mm_fault to drop per-VMA locks when returning
> > > VM_FAULT_RETRY or VM_FAULT_COMPLETED to be consistent with the way
> > > mmap_lock is handled. Then change folio_lock_or_retry to accept vm_fault
> > > and return vm_fault_t which simplifies later patches. Finally allow swap
> > > and uffd page faults to be handled under per-VMA locks by dropping per-VMA
> > > and retrying, the same way it's done under mmap_lock.
> > > Naturally, once VMA lock is dropped that VMA should be assumed unstable
> > > and can't be used.
> > >
> > > Changes since v6 posted at [2]
> > > - 4/6 replaced the ternary operation in folio_lock_or_retry,
> > > per Matthew Wilcox
> > > - 4/6 changed return code description for __folio_lock_or_retry
> > > per Matthew Wilcox
> > >
> > > Note: patch 3/6 will cause a trivial merge conflict in arch/arm64/mm/fault.c
> > > when applied over mm-unstable branch due to a patch from ARM64 tree [3]
> > > which is missing in mm-unstable.
> > >
> > > [1] https://lore.kernel.org/all/20230227173632.3292573-1-surenb@google.com/
> > > [2] https://lore.kernel.org/all/20230630020436.1066016-1-surenb@google.com/
> > > [3] https://lore.kernel.org/all/20230524131305.2808-1-jszhang@kernel.org/
> > >
> > > Suren Baghdasaryan (6):
> > >    swap: remove remnants of polling from read_swap_cache_async
> > >    mm: add missing VM_FAULT_RESULT_TRACE name for VM_FAULT_COMPLETED
> > >    mm: drop per-VMA lock when returning VM_FAULT_RETRY or
> > >      VM_FAULT_COMPLETED
> > >    mm: change folio_lock_or_retry to use vm_fault directly
> > >    mm: handle swap page faults under per-VMA lock
> > >    mm: handle userfaults under VMA lock
> >
> > On mm/mm-unstable I get running the selftests:
> >
> > Testing sigbus-wp on shmem... [  383.215804] mm ffff9666078e5280 task_size 140737488351232
> > [  383.215804] get_unmapped_area ffffffffad03b980
> > [  383.215804] mmap_base 140378441285632 mmap_legacy_base 47254353883136
> > [  383.215804] pgd ffff966608960000 mm_users 1 mm_count 6 pgtables_bytes 126976 map_count 28
> > [  383.215804] hiwater_rss 6183 hiwater_vm 8aa7 total_vm 8aa7 locked_vm 0
> > [  383.215804] pinned_vm 0 data_vm 844 exec_vm 1a4 stack_vm 21
> > [  383.215804] start_code 402000 end_code 408f09 start_data 40ce10 end_data 40d500
> > [  383.215804] start_brk 17fe000 brk 1830000 start_stack 7ffecbbe08e0
> > [  383.215804] arg_start 7ffecbbe1c6f arg_end 7ffecbbe1c81 env_start 7ffecbbe1c81 env_end 7ffecbbe1fe6
> > [  383.215804] binfmt ffffffffaf3efe40 flags 80000cd
> > [  383.215804] ioctx_table 0000000000000000
> > [  383.215804] owner ffff96660d4a4000 exe_file ffff966285501a00
> > [  383.215804] notifier_subscriptions 0000000000000000
> > [  383.215804] numa_next_scan 4295050919 numa_scan_offset 0 numa_scan_seq 0
> > [  383.215804] tlb_flush_pending 0
> > [  383.215804] def_flags: 0x0()
> > [  383.236255] ------------[ cut here ]------------
> > [  383.237537] kernel BUG at include/linux/mmap_lock.h:66!
> > [  383.238897] invalid opcode: 0000 [#1] PREEMPT SMP PTI
> > [  383.240114] CPU: 37 PID: 1482 Comm: uffd-unit-tests Not tainted 6.5.0-rc4+ #68
> > [  383.242513] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc38 04/01/2014
> > [  383.244936] RIP: 0010:find_vma+0x3a/0x40
> > [  383.246200] Code: 48 89 34 24 48 85 c0 74 1c 48 83 c7 40 48 c7 c2 ff ff ff ff 48 89 e6 e8 a4 29 ba 00
> > [  383.251084] RSP: 0000:ffffae3745b6beb0 EFLAGS: 00010282
> > [  383.252781] RAX: 0000000000000314 RBX: ffff9666078e5280 RCX: 0000000000000000
> > [  383.255073] RDX: 0000000000000001 RSI: ffffffffae8f69c3 RDI: 00000000ffffffff
> > [  383.257352] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffae3745b6bc48
> > [  383.259369] R10: 0000000000000003 R11: ffff9669fff46fe8 R12: 0000000044401028
> > [  383.261570] R13: ffff9666078e5338 R14: ffffae3745b6bf58 R15: 0000000000000400
> > [  383.263499] FS:  00007fac671c5740(0000) GS:ffff9669efbc0000(0000) knlGS:0000000000000000
> > [  383.265483] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  383.266847] CR2: 0000000044401028 CR3: 0000000488960006 CR4: 0000000000770ee0
> > [  383.268532] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [  383.270206] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [  383.271905] PKRU: 55555554
> > [  383.272593] Call Trace:
> > [  383.273215]  <TASK>
> > [  383.273774]  ? die+0x32/0x80
> > [  383.274510]  ? do_trap+0xd6/0x100
> > [  383.275326]  ? find_vma+0x3a/0x40
> > [  383.276152]  ? do_error_trap+0x6a/0x90
> > [  383.277072]  ? find_vma+0x3a/0x40
> > [  383.277899]  ? exc_invalid_op+0x4c/0x60
> > [  383.278846]  ? find_vma+0x3a/0x40
> > [  383.279675]  ? asm_exc_invalid_op+0x16/0x20
> > [  383.280698]  ? find_vma+0x3a/0x40
> > [  383.281527]  lock_mm_and_find_vma+0x3f/0x270
> > [  383.282570]  do_user_addr_fault+0x1e4/0x660
> > [  383.283591]  exc_page_fault+0x73/0x170
> > [  383.284509]  asm_exc_page_fault+0x22/0x30
> > [  383.285486] RIP: 0033:0x404428
> > [  383.286265] Code: 48 89 85 18 ff ff ff e9 dc 00 00 00 48 8b 15 9f 92 00 00 48 8b 05 80 92 00 00 48 03
> > [  383.290566] RSP: 002b:00007ffecbbe05c0 EFLAGS: 00010206
> > [  383.291814] RAX: 0000000044401028 RBX: 00007ffecbbe08e8 RCX: 00007fac66e93c18
> > [  383.293502] RDX: 0000000044400000 RSI: 0000000000000001 RDI: 0000000000000000
> > [  383.295175] RBP: 00007ffecbbe06c0 R08: 00007ffecbbe05c0 R09: 00007ffecbbe06c0
> > [  383.296857] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000000
> > [  383.298533] R13: 00007ffecbbe08f8 R14: 000000000040ce18 R15: 00007fac67206000
> > [  383.300203]  </TASK>
> > [  383.300775] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_commong
> > [  383.309661] ---[ end trace 0000000000000000 ]---
> > [  383.310795] RIP: 0010:find_vma+0x3a/0x40
> > [  383.311771] Code: 48 89 34 24 48 85 c0 74 1c 48 83 c7 40 48 c7 c2 ff ff ff ff 48 89 e6 e8 a4 29 ba 00
> > [  383.316081] RSP: 0000:ffffae3745b6beb0 EFLAGS: 00010282
> > [  383.317346] RAX: 0000000000000314 RBX: ffff9666078e5280 RCX: 0000000000000000
> > [  383.319050] RDX: 0000000000000001 RSI: ffffffffae8f69c3 RDI: 00000000ffffffff
> > [  383.320767] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffae3745b6bc48
> > [  383.322468] R10: 0000000000000003 R11: ffff9669fff46fe8 R12: 0000000044401028
> > [  383.324164] R13: ffff9666078e5338 R14: ffffae3745b6bf58 R15: 0000000000000400
> > [  383.325870] FS:  00007fac671c5740(0000) GS:ffff9669efbc0000(0000) knlGS:0000000000000000
> > [  383.327795] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  383.329177] CR2: 0000000044401028 CR3: 0000000488960006 CR4: 0000000000770ee0
> > [  383.330885] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [  383.332592] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [  383.334287] PKRU: 55555554
> >
> >
> > Which ends up being
> >
> > VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
> >
> > I did not check if this is also the case on mainline, and if this series is responsible.
>
> Thanks for reporting! I'm checking it now.

Hmm. From the code it's not obvious how lock_mm_and_find_vma() ends up
calling find_vma() without mmap_lock after successfully completing
get_mmap_lock_carefully(). lock_mm_and_find_vma+0x3f/0x270 points to
the first invocation of find_vma(), so this is not even the lock
upgrade path... I'll try to reproduce this issue and dig up more but
from the information I have so far this issue does not seem to be
related to this series.

>
> >
> > --
> > Cheers,
> >
> > David / dhildenb
> >

Suren Baghdasaryan Aug. 9, 2023, 5:17 p.m. UTC | #6

On Wed, Aug 9, 2023 at 8:22 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, Aug 9, 2023 at 7:28 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Wed, Aug 9, 2023 at 12:49 AM David Hildenbrand <david@redhat.com> wrote:
> > >
> > > On 30.06.23 23:19, Suren Baghdasaryan wrote:
> > > > When per-VMA locks were introduced in [1] several types of page faults
> > > > would still fall back to mmap_lock to keep the patchset simple. Among them
> > > > are swap and userfault pages. The main reason for skipping those cases was
> > > > the fact that mmap_lock could be dropped while handling these faults and
> > > > that required additional logic to be implemented.
> > > > Implement the mechanism to allow per-VMA locks to be dropped for these
> > > > cases.
> > > > First, change handle_mm_fault to drop per-VMA locks when returning
> > > > VM_FAULT_RETRY or VM_FAULT_COMPLETED to be consistent with the way
> > > > mmap_lock is handled. Then change folio_lock_or_retry to accept vm_fault
> > > > and return vm_fault_t which simplifies later patches. Finally allow swap
> > > > and uffd page faults to be handled under per-VMA locks by dropping per-VMA
> > > > and retrying, the same way it's done under mmap_lock.
> > > > Naturally, once VMA lock is dropped that VMA should be assumed unstable
> > > > and can't be used.
> > > >
> > > > Changes since v6 posted at [2]
> > > > - 4/6 replaced the ternary operation in folio_lock_or_retry,
> > > > per Matthew Wilcox
> > > > - 4/6 changed return code description for __folio_lock_or_retry
> > > > per Matthew Wilcox
> > > >
> > > > Note: patch 3/6 will cause a trivial merge conflict in arch/arm64/mm/fault.c
> > > > when applied over mm-unstable branch due to a patch from ARM64 tree [3]
> > > > which is missing in mm-unstable.
> > > >
> > > > [1] https://lore.kernel.org/all/20230227173632.3292573-1-surenb@google.com/
> > > > [2] https://lore.kernel.org/all/20230630020436.1066016-1-surenb@google.com/
> > > > [3] https://lore.kernel.org/all/20230524131305.2808-1-jszhang@kernel.org/
> > > >
> > > > Suren Baghdasaryan (6):
> > > >    swap: remove remnants of polling from read_swap_cache_async
> > > >    mm: add missing VM_FAULT_RESULT_TRACE name for VM_FAULT_COMPLETED
> > > >    mm: drop per-VMA lock when returning VM_FAULT_RETRY or
> > > >      VM_FAULT_COMPLETED
> > > >    mm: change folio_lock_or_retry to use vm_fault directly
> > > >    mm: handle swap page faults under per-VMA lock
> > > >    mm: handle userfaults under VMA lock
> > >
> > > On mm/mm-unstable I get running the selftests:
> > >
> > > Testing sigbus-wp on shmem... [  383.215804] mm ffff9666078e5280 task_size 140737488351232
> > > [  383.215804] get_unmapped_area ffffffffad03b980
> > > [  383.215804] mmap_base 140378441285632 mmap_legacy_base 47254353883136
> > > [  383.215804] pgd ffff966608960000 mm_users 1 mm_count 6 pgtables_bytes 126976 map_count 28
> > > [  383.215804] hiwater_rss 6183 hiwater_vm 8aa7 total_vm 8aa7 locked_vm 0
> > > [  383.215804] pinned_vm 0 data_vm 844 exec_vm 1a4 stack_vm 21
> > > [  383.215804] start_code 402000 end_code 408f09 start_data 40ce10 end_data 40d500
> > > [  383.215804] start_brk 17fe000 brk 1830000 start_stack 7ffecbbe08e0
> > > [  383.215804] arg_start 7ffecbbe1c6f arg_end 7ffecbbe1c81 env_start 7ffecbbe1c81 env_end 7ffecbbe1fe6
> > > [  383.215804] binfmt ffffffffaf3efe40 flags 80000cd
> > > [  383.215804] ioctx_table 0000000000000000
> > > [  383.215804] owner ffff96660d4a4000 exe_file ffff966285501a00
> > > [  383.215804] notifier_subscriptions 0000000000000000
> > > [  383.215804] numa_next_scan 4295050919 numa_scan_offset 0 numa_scan_seq 0
> > > [  383.215804] tlb_flush_pending 0
> > > [  383.215804] def_flags: 0x0()
> > > [  383.236255] ------------[ cut here ]------------
> > > [  383.237537] kernel BUG at include/linux/mmap_lock.h:66!
> > > [  383.238897] invalid opcode: 0000 [#1] PREEMPT SMP PTI
> > > [  383.240114] CPU: 37 PID: 1482 Comm: uffd-unit-tests Not tainted 6.5.0-rc4+ #68
> > > [  383.242513] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc38 04/01/2014
> > > [  383.244936] RIP: 0010:find_vma+0x3a/0x40
> > > [  383.246200] Code: 48 89 34 24 48 85 c0 74 1c 48 83 c7 40 48 c7 c2 ff ff ff ff 48 89 e6 e8 a4 29 ba 00
> > > [  383.251084] RSP: 0000:ffffae3745b6beb0 EFLAGS: 00010282
> > > [  383.252781] RAX: 0000000000000314 RBX: ffff9666078e5280 RCX: 0000000000000000
> > > [  383.255073] RDX: 0000000000000001 RSI: ffffffffae8f69c3 RDI: 00000000ffffffff
> > > [  383.257352] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffae3745b6bc48
> > > [  383.259369] R10: 0000000000000003 R11: ffff9669fff46fe8 R12: 0000000044401028
> > > [  383.261570] R13: ffff9666078e5338 R14: ffffae3745b6bf58 R15: 0000000000000400
> > > [  383.263499] FS:  00007fac671c5740(0000) GS:ffff9669efbc0000(0000) knlGS:0000000000000000
> > > [  383.265483] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [  383.266847] CR2: 0000000044401028 CR3: 0000000488960006 CR4: 0000000000770ee0
> > > [  383.268532] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > [  383.270206] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > [  383.271905] PKRU: 55555554
> > > [  383.272593] Call Trace:
> > > [  383.273215]  <TASK>
> > > [  383.273774]  ? die+0x32/0x80
> > > [  383.274510]  ? do_trap+0xd6/0x100
> > > [  383.275326]  ? find_vma+0x3a/0x40
> > > [  383.276152]  ? do_error_trap+0x6a/0x90
> > > [  383.277072]  ? find_vma+0x3a/0x40
> > > [  383.277899]  ? exc_invalid_op+0x4c/0x60
> > > [  383.278846]  ? find_vma+0x3a/0x40
> > > [  383.279675]  ? asm_exc_invalid_op+0x16/0x20
> > > [  383.280698]  ? find_vma+0x3a/0x40
> > > [  383.281527]  lock_mm_and_find_vma+0x3f/0x270
> > > [  383.282570]  do_user_addr_fault+0x1e4/0x660
> > > [  383.283591]  exc_page_fault+0x73/0x170
> > > [  383.284509]  asm_exc_page_fault+0x22/0x30
> > > [  383.285486] RIP: 0033:0x404428
> > > [  383.286265] Code: 48 89 85 18 ff ff ff e9 dc 00 00 00 48 8b 15 9f 92 00 00 48 8b 05 80 92 00 00 48 03
> > > [  383.290566] RSP: 002b:00007ffecbbe05c0 EFLAGS: 00010206
> > > [  383.291814] RAX: 0000000044401028 RBX: 00007ffecbbe08e8 RCX: 00007fac66e93c18
> > > [  383.293502] RDX: 0000000044400000 RSI: 0000000000000001 RDI: 0000000000000000
> > > [  383.295175] RBP: 00007ffecbbe06c0 R08: 00007ffecbbe05c0 R09: 00007ffecbbe06c0
> > > [  383.296857] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000000
> > > [  383.298533] R13: 00007ffecbbe08f8 R14: 000000000040ce18 R15: 00007fac67206000
> > > [  383.300203]  </TASK>
> > > [  383.300775] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_commong
> > > [  383.309661] ---[ end trace 0000000000000000 ]---
> > > [  383.310795] RIP: 0010:find_vma+0x3a/0x40
> > > [  383.311771] Code: 48 89 34 24 48 85 c0 74 1c 48 83 c7 40 48 c7 c2 ff ff ff ff 48 89 e6 e8 a4 29 ba 00
> > > [  383.316081] RSP: 0000:ffffae3745b6beb0 EFLAGS: 00010282
> > > [  383.317346] RAX: 0000000000000314 RBX: ffff9666078e5280 RCX: 0000000000000000
> > > [  383.319050] RDX: 0000000000000001 RSI: ffffffffae8f69c3 RDI: 00000000ffffffff
> > > [  383.320767] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffae3745b6bc48
> > > [  383.322468] R10: 0000000000000003 R11: ffff9669fff46fe8 R12: 0000000044401028
> > > [  383.324164] R13: ffff9666078e5338 R14: ffffae3745b6bf58 R15: 0000000000000400
> > > [  383.325870] FS:  00007fac671c5740(0000) GS:ffff9669efbc0000(0000) knlGS:0000000000000000
> > > [  383.327795] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [  383.329177] CR2: 0000000044401028 CR3: 0000000488960006 CR4: 0000000000770ee0
> > > [  383.330885] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > [  383.332592] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > [  383.334287] PKRU: 55555554
> > >
> > >
> > > Which ends up being
> > >
> > > VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
> > >
> > > I did not check if this is also the case on mainline, and if this series is responsible.
> >
> > Thanks for reporting! I'm checking it now.
>
> Hmm. From the code it's not obvious how lock_mm_and_find_vma() ends up
> calling find_vma() without mmap_lock after successfully completing
> get_mmap_lock_carefully(). lock_mm_and_find_vma+0x3f/0x270 points to
> the first invocation of find_vma(), so this is not even the lock
> upgrade path... I'll try to reproduce this issue and dig up more but
> from the information I have so far this issue does not seem to be
> related to this series.

This is really weird. I added mmap_assert_locked(mm) calls into
get_mmap_lock_carefully() right after we acquire mmap_lock read lock
and one of them triggers right after successful
mmap_read_lock_killable(). Here is my modified version of
get_mmap_lock_carefully():

static inline bool get_mmap_lock_carefully(struct mm_struct *mm,
struct pt_regs *regs) {
     /* Even if this succeeds, make it clear we might have slept */
     if (likely(mmap_read_trylock(mm))) {
         might_sleep();
         mmap_assert_locked(mm);
         return true;
     }
     if (regs && !user_mode(regs)) {
         unsigned long ip = instruction_pointer(regs);
         if (!search_exception_tables(ip))
             return false;
     }
     if (!mmap_read_lock_killable(mm)) {
         mmap_assert_locked(mm);                     <---- generates a BUG
         return true;
     }
     return false;
}

AFAIKT conditions for mmap_read_trylock() and
mmap_read_lock_killable() are checked correctly. Am I missing
something?

>
> >
> > >
> > > --
> > > Cheers,
> > >
> > > David / dhildenb
> > >

David Hildenbrand Aug. 9, 2023, 6:04 p.m. UTC | #7

>>>> Which ends up being
>>>>
>>>> VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
>>>>
>>>> I did not check if this is also the case on mainline, and if this series is responsible.
>>>
>>> Thanks for reporting! I'm checking it now.
>>
>> Hmm. From the code it's not obvious how lock_mm_and_find_vma() ends up
>> calling find_vma() without mmap_lock after successfully completing
>> get_mmap_lock_carefully(). lock_mm_and_find_vma+0x3f/0x270 points to
>> the first invocation of find_vma(), so this is not even the lock
>> upgrade path... I'll try to reproduce this issue and dig up more but
>> from the information I have so far this issue does not seem to be
>> related to this series.

I just checked on mainline and it does not fail there.

> 
> This is really weird. I added mmap_assert_locked(mm) calls into
> get_mmap_lock_carefully() right after we acquire mmap_lock read lock
> and one of them triggers right after successful
> mmap_read_lock_killable(). Here is my modified version of
> get_mmap_lock_carefully():
> 
> static inline bool get_mmap_lock_carefully(struct mm_struct *mm,
> struct pt_regs *regs) {
>       /* Even if this succeeds, make it clear we might have slept */
>       if (likely(mmap_read_trylock(mm))) {
>           might_sleep();
>           mmap_assert_locked(mm);
>           return true;
>       }
>       if (regs && !user_mode(regs)) {
>           unsigned long ip = instruction_pointer(regs);
>           if (!search_exception_tables(ip))
>               return false;
>       }
>       if (!mmap_read_lock_killable(mm)) {
>           mmap_assert_locked(mm);                     <---- generates a BUG
>           return true;
>       }
>       return false;
> }

Ehm, that's indeed weird.

> 
> AFAIKT conditions for mmap_read_trylock() and
> mmap_read_lock_killable() are checked correctly. Am I missing
> something?

Weirdly enough, it only triggers during that specific uffd test, right?

Suren Baghdasaryan Aug. 9, 2023, 6:08 p.m. UTC | #8

On Wed, Aug 9, 2023 at 11:04 AM David Hildenbrand <david@redhat.com> wrote:
>
> >>>> Which ends up being
> >>>>
> >>>> VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
> >>>>
> >>>> I did not check if this is also the case on mainline, and if this series is responsible.
> >>>
> >>> Thanks for reporting! I'm checking it now.
> >>
> >> Hmm. From the code it's not obvious how lock_mm_and_find_vma() ends up
> >> calling find_vma() without mmap_lock after successfully completing
> >> get_mmap_lock_carefully(). lock_mm_and_find_vma+0x3f/0x270 points to
> >> the first invocation of find_vma(), so this is not even the lock
> >> upgrade path... I'll try to reproduce this issue and dig up more but
> >> from the information I have so far this issue does not seem to be
> >> related to this series.
>
> I just checked on mainline and it does not fail there.
>
> >
> > This is really weird. I added mmap_assert_locked(mm) calls into
> > get_mmap_lock_carefully() right after we acquire mmap_lock read lock
> > and one of them triggers right after successful
> > mmap_read_lock_killable(). Here is my modified version of
> > get_mmap_lock_carefully():
> >
> > static inline bool get_mmap_lock_carefully(struct mm_struct *mm,
> > struct pt_regs *regs) {
> >       /* Even if this succeeds, make it clear we might have slept */
> >       if (likely(mmap_read_trylock(mm))) {
> >           might_sleep();
> >           mmap_assert_locked(mm);
> >           return true;
> >       }
> >       if (regs && !user_mode(regs)) {
> >           unsigned long ip = instruction_pointer(regs);
> >           if (!search_exception_tables(ip))
> >               return false;
> >       }
> >       if (!mmap_read_lock_killable(mm)) {
> >           mmap_assert_locked(mm);                     <---- generates a BUG
> >           return true;
> >       }
> >       return false;
> > }
>
> Ehm, that's indeed weird.
>
> >
> > AFAIKT conditions for mmap_read_trylock() and
> > mmap_read_lock_killable() are checked correctly. Am I missing
> > something?
>
> Weirdly enough, it only triggers during that specific uffd test, right?

Yes, uffd-unit-tests. I even ran it separately to ensure it's not some
fallback from a previous test and I'm able to reproduce this
consistently.

>
> --
> Cheers,
>
> David / dhildenb
>

Suren Baghdasaryan Aug. 9, 2023, 6:31 p.m. UTC | #9

On Wed, Aug 9, 2023 at 11:08 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, Aug 9, 2023 at 11:04 AM David Hildenbrand <david@redhat.com> wrote:
> >
> > >>>> Which ends up being
> > >>>>
> > >>>> VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
> > >>>>
> > >>>> I did not check if this is also the case on mainline, and if this series is responsible.
> > >>>
> > >>> Thanks for reporting! I'm checking it now.
> > >>
> > >> Hmm. From the code it's not obvious how lock_mm_and_find_vma() ends up
> > >> calling find_vma() without mmap_lock after successfully completing
> > >> get_mmap_lock_carefully(). lock_mm_and_find_vma+0x3f/0x270 points to
> > >> the first invocation of find_vma(), so this is not even the lock
> > >> upgrade path... I'll try to reproduce this issue and dig up more but
> > >> from the information I have so far this issue does not seem to be
> > >> related to this series.
> >
> > I just checked on mainline and it does not fail there.

Thanks. Just to eliminate the possibility, I'll try reverting my
patchset in mm-unstable and will try the test again. Will do that in
the evening once I'm home.

> >
> > >
> > > This is really weird. I added mmap_assert_locked(mm) calls into
> > > get_mmap_lock_carefully() right after we acquire mmap_lock read lock
> > > and one of them triggers right after successful
> > > mmap_read_lock_killable(). Here is my modified version of
> > > get_mmap_lock_carefully():
> > >
> > > static inline bool get_mmap_lock_carefully(struct mm_struct *mm,
> > > struct pt_regs *regs) {
> > >       /* Even if this succeeds, make it clear we might have slept */
> > >       if (likely(mmap_read_trylock(mm))) {
> > >           might_sleep();
> > >           mmap_assert_locked(mm);
> > >           return true;
> > >       }
> > >       if (regs && !user_mode(regs)) {
> > >           unsigned long ip = instruction_pointer(regs);
> > >           if (!search_exception_tables(ip))
> > >               return false;
> > >       }
> > >       if (!mmap_read_lock_killable(mm)) {
> > >           mmap_assert_locked(mm);                     <---- generates a BUG
> > >           return true;
> > >       }
> > >       return false;
> > > }
> >
> > Ehm, that's indeed weird.
> >
> > >
> > > AFAIKT conditions for mmap_read_trylock() and
> > > mmap_read_lock_killable() are checked correctly. Am I missing
> > > something?
> >
> > Weirdly enough, it only triggers during that specific uffd test, right?
>
> Yes, uffd-unit-tests. I even ran it separately to ensure it's not some
> fallback from a previous test and I'm able to reproduce this
> consistently.
>
> >
> > --
> > Cheers,
> >
> > David / dhildenb
> >

Suren Baghdasaryan Aug. 10, 2023, 5:29 a.m. UTC | #10

On Wed, Aug 9, 2023 at 11:31 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, Aug 9, 2023 at 11:08 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Wed, Aug 9, 2023 at 11:04 AM David Hildenbrand <david@redhat.com> wrote:
> > >
> > > >>>> Which ends up being
> > > >>>>
> > > >>>> VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
> > > >>>>
> > > >>>> I did not check if this is also the case on mainline, and if this series is responsible.
> > > >>>
> > > >>> Thanks for reporting! I'm checking it now.
> > > >>
> > > >> Hmm. From the code it's not obvious how lock_mm_and_find_vma() ends up
> > > >> calling find_vma() without mmap_lock after successfully completing
> > > >> get_mmap_lock_carefully(). lock_mm_and_find_vma+0x3f/0x270 points to
> > > >> the first invocation of find_vma(), so this is not even the lock
> > > >> upgrade path... I'll try to reproduce this issue and dig up more but
> > > >> from the information I have so far this issue does not seem to be
> > > >> related to this series.
> > >
> > > I just checked on mainline and it does not fail there.
>
> Thanks. Just to eliminate the possibility, I'll try reverting my
> patchset in mm-unstable and will try the test again. Will do that in
> the evening once I'm home.
>
> > >
> > > >
> > > > This is really weird. I added mmap_assert_locked(mm) calls into
> > > > get_mmap_lock_carefully() right after we acquire mmap_lock read lock
> > > > and one of them triggers right after successful
> > > > mmap_read_lock_killable(). Here is my modified version of
> > > > get_mmap_lock_carefully():
> > > >
> > > > static inline bool get_mmap_lock_carefully(struct mm_struct *mm,
> > > > struct pt_regs *regs) {
> > > >       /* Even if this succeeds, make it clear we might have slept */
> > > >       if (likely(mmap_read_trylock(mm))) {
> > > >           might_sleep();
> > > >           mmap_assert_locked(mm);
> > > >           return true;
> > > >       }
> > > >       if (regs && !user_mode(regs)) {
> > > >           unsigned long ip = instruction_pointer(regs);
> > > >           if (!search_exception_tables(ip))
> > > >               return false;
> > > >       }
> > > >       if (!mmap_read_lock_killable(mm)) {
> > > >           mmap_assert_locked(mm);                     <---- generates a BUG
> > > >           return true;
> > > >       }
> > > >       return false;
> > > > }
> > >
> > > Ehm, that's indeed weird.
> > >
> > > >
> > > > AFAIKT conditions for mmap_read_trylock() and
> > > > mmap_read_lock_killable() are checked correctly. Am I missing
> > > > something?
> > >
> > > Weirdly enough, it only triggers during that specific uffd test, right?
> >
> > Yes, uffd-unit-tests. I even ran it separately to ensure it's not some
> > fallback from a previous test and I'm able to reproduce this
> > consistently.

Yeah, it is somehow related to per-vma locking. Unfortunately I can't
reproduce the issue on my VM, so I have to use my host and bisection
is slow. I think I'll get to the bottom of this tomorrow.

> >
> > >
> > > --
> > > Cheers,
> > >
> > > David / dhildenb
> > >

Suren Baghdasaryan Aug. 10, 2023, 6:24 a.m. UTC | #11

On Wed, Aug 9, 2023 at 10:29 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, Aug 9, 2023 at 11:31 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Wed, Aug 9, 2023 at 11:08 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Wed, Aug 9, 2023 at 11:04 AM David Hildenbrand <david@redhat.com> wrote:
> > > >
> > > > >>>> Which ends up being
> > > > >>>>
> > > > >>>> VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
> > > > >>>>
> > > > >>>> I did not check if this is also the case on mainline, and if this series is responsible.
> > > > >>>
> > > > >>> Thanks for reporting! I'm checking it now.
> > > > >>
> > > > >> Hmm. From the code it's not obvious how lock_mm_and_find_vma() ends up
> > > > >> calling find_vma() without mmap_lock after successfully completing
> > > > >> get_mmap_lock_carefully(). lock_mm_and_find_vma+0x3f/0x270 points to
> > > > >> the first invocation of find_vma(), so this is not even the lock
> > > > >> upgrade path... I'll try to reproduce this issue and dig up more but
> > > > >> from the information I have so far this issue does not seem to be
> > > > >> related to this series.
> > > >
> > > > I just checked on mainline and it does not fail there.
> >
> > Thanks. Just to eliminate the possibility, I'll try reverting my
> > patchset in mm-unstable and will try the test again. Will do that in
> > the evening once I'm home.
> >
> > > >
> > > > >
> > > > > This is really weird. I added mmap_assert_locked(mm) calls into
> > > > > get_mmap_lock_carefully() right after we acquire mmap_lock read lock
> > > > > and one of them triggers right after successful
> > > > > mmap_read_lock_killable(). Here is my modified version of
> > > > > get_mmap_lock_carefully():
> > > > >
> > > > > static inline bool get_mmap_lock_carefully(struct mm_struct *mm,
> > > > > struct pt_regs *regs) {
> > > > >       /* Even if this succeeds, make it clear we might have slept */
> > > > >       if (likely(mmap_read_trylock(mm))) {
> > > > >           might_sleep();
> > > > >           mmap_assert_locked(mm);
> > > > >           return true;
> > > > >       }
> > > > >       if (regs && !user_mode(regs)) {
> > > > >           unsigned long ip = instruction_pointer(regs);
> > > > >           if (!search_exception_tables(ip))
> > > > >               return false;
> > > > >       }
> > > > >       if (!mmap_read_lock_killable(mm)) {
> > > > >           mmap_assert_locked(mm);                     <---- generates a BUG
> > > > >           return true;
> > > > >       }
> > > > >       return false;
> > > > > }
> > > >
> > > > Ehm, that's indeed weird.
> > > >
> > > > >
> > > > > AFAIKT conditions for mmap_read_trylock() and
> > > > > mmap_read_lock_killable() are checked correctly. Am I missing
> > > > > something?
> > > >
> > > > Weirdly enough, it only triggers during that specific uffd test, right?
> > >
> > > Yes, uffd-unit-tests. I even ran it separately to ensure it's not some
> > > fallback from a previous test and I'm able to reproduce this
> > > consistently.
>
> Yeah, it is somehow related to per-vma locking. Unfortunately I can't
> reproduce the issue on my VM, so I have to use my host and bisection
> is slow. I think I'll get to the bottom of this tomorrow.

Ok, I think I found the issue.  wp_page_shared() ->
fault_dirty_shared_page() can drop mmap_lock (see the comment saying
"Drop the mmap_lock before waiting on IO, if we can...", therefore we
have to ensure we are not doing this under per-VMA lock.
I think what happens is that this path is racing with another page
fault which took mmap_lock for read. fault_dirty_shared_page()
releases this lock which was taken by another page faulting thread and
that thread generates an assertion when it finds out the lock it just
took got released from under it.
The following crude change fixed the issue for me but there might be a
more granular way to deal with this:

--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3293,18 +3293,18 @@ static vm_fault_t wp_page_shared(struct
vm_fault *vmf, struct folio *folio)
         struct vm_area_struct *vma = vmf->vma;
         vm_fault_t ret = 0;

+        if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
+                pte_unmap_unlock(vmf->pte, vmf->ptl);
+                vma_end_read(vmf->vma);
+                return VM_FAULT_RETRY;
+        }
+
         folio_get(folio);

         if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
                 vm_fault_t tmp;

                 pte_unmap_unlock(vmf->pte, vmf->ptl);
-                if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
-                        folio_put(folio);
-                        vma_end_read(vmf->vma);
-                        return VM_FAULT_RETRY;
-                }
-
                 tmp = do_page_mkwrite(vmf, folio);
                 if (unlikely(!tmp || (tmp &
                                       (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))) {


Matthew, please check if this fix is valid and if there might be a
better way. I think the issue was introduced by 88e2667632d4 ("mm:
handle faults that merely update the accessed bit under the VMA lock")
Thanks,
Suren.



>
> > >
> > > >
> > > > --
> > > > Cheers,
> > > >
> > > > David / dhildenb
> > > >

David Hildenbrand Aug. 10, 2023, 7:40 a.m. UTC | #12

On 10.08.23 08:24, Suren Baghdasaryan wrote:
> On Wed, Aug 9, 2023 at 10:29 PM Suren Baghdasaryan <surenb@google.com> wrote:
>>
>> On Wed, Aug 9, 2023 at 11:31 AM Suren Baghdasaryan <surenb@google.com> wrote:
>>>
>>> On Wed, Aug 9, 2023 at 11:08 AM Suren Baghdasaryan <surenb@google.com> wrote:
>>>>
>>>> On Wed, Aug 9, 2023 at 11:04 AM David Hildenbrand <david@redhat.com> wrote:
>>>>>
>>>>>>>>> Which ends up being
>>>>>>>>>
>>>>>>>>> VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
>>>>>>>>>
>>>>>>>>> I did not check if this is also the case on mainline, and if this series is responsible.
>>>>>>>>
>>>>>>>> Thanks for reporting! I'm checking it now.
>>>>>>>
>>>>>>> Hmm. From the code it's not obvious how lock_mm_and_find_vma() ends up
>>>>>>> calling find_vma() without mmap_lock after successfully completing
>>>>>>> get_mmap_lock_carefully(). lock_mm_and_find_vma+0x3f/0x270 points to
>>>>>>> the first invocation of find_vma(), so this is not even the lock
>>>>>>> upgrade path... I'll try to reproduce this issue and dig up more but
>>>>>>> from the information I have so far this issue does not seem to be
>>>>>>> related to this series.
>>>>>
>>>>> I just checked on mainline and it does not fail there.
>>>
>>> Thanks. Just to eliminate the possibility, I'll try reverting my
>>> patchset in mm-unstable and will try the test again. Will do that in
>>> the evening once I'm home.
>>>
>>>>>
>>>>>>
>>>>>> This is really weird. I added mmap_assert_locked(mm) calls into
>>>>>> get_mmap_lock_carefully() right after we acquire mmap_lock read lock
>>>>>> and one of them triggers right after successful
>>>>>> mmap_read_lock_killable(). Here is my modified version of
>>>>>> get_mmap_lock_carefully():
>>>>>>
>>>>>> static inline bool get_mmap_lock_carefully(struct mm_struct *mm,
>>>>>> struct pt_regs *regs) {
>>>>>>        /* Even if this succeeds, make it clear we might have slept */
>>>>>>        if (likely(mmap_read_trylock(mm))) {
>>>>>>            might_sleep();
>>>>>>            mmap_assert_locked(mm);
>>>>>>            return true;
>>>>>>        }
>>>>>>        if (regs && !user_mode(regs)) {
>>>>>>            unsigned long ip = instruction_pointer(regs);
>>>>>>            if (!search_exception_tables(ip))
>>>>>>                return false;
>>>>>>        }
>>>>>>        if (!mmap_read_lock_killable(mm)) {
>>>>>>            mmap_assert_locked(mm);                     <---- generates a BUG
>>>>>>            return true;
>>>>>>        }
>>>>>>        return false;
>>>>>> }
>>>>>
>>>>> Ehm, that's indeed weird.
>>>>>
>>>>>>
>>>>>> AFAIKT conditions for mmap_read_trylock() and
>>>>>> mmap_read_lock_killable() are checked correctly. Am I missing
>>>>>> something?
>>>>>
>>>>> Weirdly enough, it only triggers during that specific uffd test, right?
>>>>
>>>> Yes, uffd-unit-tests. I even ran it separately to ensure it's not some
>>>> fallback from a previous test and I'm able to reproduce this
>>>> consistently.
>>
>> Yeah, it is somehow related to per-vma locking. Unfortunately I can't
>> reproduce the issue on my VM, so I have to use my host and bisection
>> is slow. I think I'll get to the bottom of this tomorrow.
> 
> Ok, I think I found the issue. 

Nice!

> wp_page_shared() ->
> fault_dirty_shared_page() can drop mmap_lock (see the comment saying
> "Drop the mmap_lock before waiting on IO, if we can...", therefore we
> have to ensure we are not doing this under per-VMA lock.
> I think what happens is that this path is racing with another page
> fault which took mmap_lock for read. fault_dirty_shared_page()
> releases this lock which was taken by another page faulting thread and
> that thread generates an assertion when it finds out the lock it just
> took got released from under it.

I wonder if we could detect that someone releases the mmap lock that was 
not taken by that person, to bail out early at the right place when 
debugging such issues. Only with certain config knobs enabled, of course.

> The following crude change fixed the issue for me but there might be a
> more granular way to deal with this:
> 
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3293,18 +3293,18 @@ static vm_fault_t wp_page_shared(struct
> vm_fault *vmf, struct folio *folio)
>           struct vm_area_struct *vma = vmf->vma;
>           vm_fault_t ret = 0;
> 
> +        if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> +                pte_unmap_unlock(vmf->pte, vmf->ptl);
> +                vma_end_read(vmf->vma);
> +                return VM_FAULT_RETRY;
> +        }
> +

I won't lie: all of these locking checks are a bit hard to get and 
possibly even harder to maintain.

Maybe better mmap unlock sanity checks as spelled out above might help 
improve part of the situation.


And maybe some comments regarding the placement might help as well ;)

Suren Baghdasaryan Aug. 10, 2023, 8:20 p.m. UTC | #13

On Thu, Aug 10, 2023 at 12:41 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 10.08.23 08:24, Suren Baghdasaryan wrote:
> > On Wed, Aug 9, 2023 at 10:29 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >>
> >> On Wed, Aug 9, 2023 at 11:31 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >>>
> >>> On Wed, Aug 9, 2023 at 11:08 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >>>>
> >>>> On Wed, Aug 9, 2023 at 11:04 AM David Hildenbrand <david@redhat.com> wrote:
> >>>>>
> >>>>>>>>> Which ends up being
> >>>>>>>>>
> >>>>>>>>> VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
> >>>>>>>>>
> >>>>>>>>> I did not check if this is also the case on mainline, and if this series is responsible.
> >>>>>>>>
> >>>>>>>> Thanks for reporting! I'm checking it now.
> >>>>>>>
> >>>>>>> Hmm. From the code it's not obvious how lock_mm_and_find_vma() ends up
> >>>>>>> calling find_vma() without mmap_lock after successfully completing
> >>>>>>> get_mmap_lock_carefully(). lock_mm_and_find_vma+0x3f/0x270 points to
> >>>>>>> the first invocation of find_vma(), so this is not even the lock
> >>>>>>> upgrade path... I'll try to reproduce this issue and dig up more but
> >>>>>>> from the information I have so far this issue does not seem to be
> >>>>>>> related to this series.
> >>>>>
> >>>>> I just checked on mainline and it does not fail there.
> >>>
> >>> Thanks. Just to eliminate the possibility, I'll try reverting my
> >>> patchset in mm-unstable and will try the test again. Will do that in
> >>> the evening once I'm home.
> >>>
> >>>>>
> >>>>>>
> >>>>>> This is really weird. I added mmap_assert_locked(mm) calls into
> >>>>>> get_mmap_lock_carefully() right after we acquire mmap_lock read lock
> >>>>>> and one of them triggers right after successful
> >>>>>> mmap_read_lock_killable(). Here is my modified version of
> >>>>>> get_mmap_lock_carefully():
> >>>>>>
> >>>>>> static inline bool get_mmap_lock_carefully(struct mm_struct *mm,
> >>>>>> struct pt_regs *regs) {
> >>>>>>        /* Even if this succeeds, make it clear we might have slept */
> >>>>>>        if (likely(mmap_read_trylock(mm))) {
> >>>>>>            might_sleep();
> >>>>>>            mmap_assert_locked(mm);
> >>>>>>            return true;
> >>>>>>        }
> >>>>>>        if (regs && !user_mode(regs)) {
> >>>>>>            unsigned long ip = instruction_pointer(regs);
> >>>>>>            if (!search_exception_tables(ip))
> >>>>>>                return false;
> >>>>>>        }
> >>>>>>        if (!mmap_read_lock_killable(mm)) {
> >>>>>>            mmap_assert_locked(mm);                     <---- generates a BUG
> >>>>>>            return true;
> >>>>>>        }
> >>>>>>        return false;
> >>>>>> }
> >>>>>
> >>>>> Ehm, that's indeed weird.
> >>>>>
> >>>>>>
> >>>>>> AFAIKT conditions for mmap_read_trylock() and
> >>>>>> mmap_read_lock_killable() are checked correctly. Am I missing
> >>>>>> something?
> >>>>>
> >>>>> Weirdly enough, it only triggers during that specific uffd test, right?
> >>>>
> >>>> Yes, uffd-unit-tests. I even ran it separately to ensure it's not some
> >>>> fallback from a previous test and I'm able to reproduce this
> >>>> consistently.
> >>
> >> Yeah, it is somehow related to per-vma locking. Unfortunately I can't
> >> reproduce the issue on my VM, so I have to use my host and bisection
> >> is slow. I think I'll get to the bottom of this tomorrow.
> >
> > Ok, I think I found the issue.
>
> Nice!
>
> > wp_page_shared() ->
> > fault_dirty_shared_page() can drop mmap_lock (see the comment saying
> > "Drop the mmap_lock before waiting on IO, if we can...", therefore we
> > have to ensure we are not doing this under per-VMA lock.
> > I think what happens is that this path is racing with another page
> > fault which took mmap_lock for read. fault_dirty_shared_page()
> > releases this lock which was taken by another page faulting thread and
> > that thread generates an assertion when it finds out the lock it just
> > took got released from under it.
>
> I wonder if we could detect that someone releases the mmap lock that was
> not taken by that person, to bail out early at the right place when
> debugging such issues. Only with certain config knobs enabled, of course.

I think that's doable. If we add tags_struct.mmap_locked = RDLOCK |
WRLOCK | NONE that could set when a task takes the mmap_lock and
reset+checked when it's released. Lockdep would also catch this if the
release code did not race with another page faulting task (this would
be seen as releasing the lock which was never locked).

>
> > The following crude change fixed the issue for me but there might be a
> > more granular way to deal with this:
> >
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3293,18 +3293,18 @@ static vm_fault_t wp_page_shared(struct
> > vm_fault *vmf, struct folio *folio)
> >           struct vm_area_struct *vma = vmf->vma;
> >           vm_fault_t ret = 0;
> >
> > +        if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > +                pte_unmap_unlock(vmf->pte, vmf->ptl);
> > +                vma_end_read(vmf->vma);
> > +                return VM_FAULT_RETRY;
> > +        }
> > +
>
> I won't lie: all of these locking checks are a bit hard to get and
> possibly even harder to maintain.
>
> Maybe better mmap unlock sanity checks as spelled out above might help
> improve part of the situation.
>
>
> And maybe some comments regarding the placement might help as well ;)

I think comments with explanations why we bail out would help. I had
them in some but probably not all the places. Once the code stabilizes
I'll review the results and will add more comments with explanations.
Thanks,
Suren.

>
> --
> Cheers,
>
> David / dhildenb
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

Matthew Wilcox Aug. 10, 2023, 10 p.m. UTC | #14

On Thu, Aug 10, 2023 at 09:40:57AM +0200, David Hildenbrand wrote:
> I won't lie: all of these locking checks are a bit hard to get and possibly
> even harder to maintain.
> 
> Maybe better mmap unlock sanity checks as spelled out above might help
> improve part of the situation.
> 
> 
> And maybe some comments regarding the placement might help as well ;)

The placement was obvious; we can't call into drivers under the vma
lock.  Not until we've audited all of them.  I haven't yet had the
chance to figure out exactly what is being fixed with this patch ...
give me a few minutes.

Matthew Wilcox Aug. 10, 2023, 10:16 p.m. UTC | #15

On Thu, Aug 10, 2023 at 06:24:15AM +0000, Suren Baghdasaryan wrote:
> Ok, I think I found the issue.  wp_page_shared() ->
> fault_dirty_shared_page() can drop mmap_lock (see the comment saying
> "Drop the mmap_lock before waiting on IO, if we can...", therefore we
> have to ensure we are not doing this under per-VMA lock.

... or we could change maybe_unlock_mmap_for_io() the same way
that we changed folio_lock_or_retry():

+++ b/mm/internal.h
@@ -706,7 +706,7 @@ static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf,
        if (fault_flag_allow_retry_first(flags) &&
            !(flags & FAULT_FLAG_RETRY_NOWAIT)) {
                fpin = get_file(vmf->vma->vm_file);
-               mmap_read_unlock(vmf->vma->vm_mm);
+               release_fault_lock(vmf);
        }
        return fpin;
 }

What do you think?

> I think what happens is that this path is racing with another page
> fault which took mmap_lock for read. fault_dirty_shared_page()
> releases this lock which was taken by another page faulting thread and
> that thread generates an assertion when it finds out the lock it just
> took got released from under it.

I'm confused that our debugging didn't catch this earlier.  lockdep
should always catch this.

Suren Baghdasaryan Aug. 10, 2023, 11:29 p.m. UTC | #16

On Thu, Aug 10, 2023 at 3:16 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Aug 10, 2023 at 06:24:15AM +0000, Suren Baghdasaryan wrote:
> > Ok, I think I found the issue.  wp_page_shared() ->
> > fault_dirty_shared_page() can drop mmap_lock (see the comment saying
> > "Drop the mmap_lock before waiting on IO, if we can...", therefore we
> > have to ensure we are not doing this under per-VMA lock.
>
> ... or we could change maybe_unlock_mmap_for_io() the same way
> that we changed folio_lock_or_retry():
>
> +++ b/mm/internal.h
> @@ -706,7 +706,7 @@ static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf,
>         if (fault_flag_allow_retry_first(flags) &&
>             !(flags & FAULT_FLAG_RETRY_NOWAIT)) {
>                 fpin = get_file(vmf->vma->vm_file);
> -               mmap_read_unlock(vmf->vma->vm_mm);
> +               release_fault_lock(vmf);
>         }
>         return fpin;
>  }
>
> What do you think?

This is very tempting... Let me try that and see if anything explodes,
but yes, this would be ideal.


>
> > I think what happens is that this path is racing with another page
> > fault which took mmap_lock for read. fault_dirty_shared_page()
> > releases this lock which was taken by another page faulting thread and
> > that thread generates an assertion when it finds out the lock it just
> > took got released from under it.
>
> I'm confused that our debugging didn't catch this earlier.  lockdep
> should always catch this.

Maybe this condition is rare enough?

Suren Baghdasaryan Aug. 10, 2023, 11:43 p.m. UTC | #17

On Thu, Aug 10, 2023 at 4:29 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Aug 10, 2023 at 3:16 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Thu, Aug 10, 2023 at 06:24:15AM +0000, Suren Baghdasaryan wrote:
> > > Ok, I think I found the issue.  wp_page_shared() ->
> > > fault_dirty_shared_page() can drop mmap_lock (see the comment saying
> > > "Drop the mmap_lock before waiting on IO, if we can...", therefore we
> > > have to ensure we are not doing this under per-VMA lock.
> >
> > ... or we could change maybe_unlock_mmap_for_io() the same way
> > that we changed folio_lock_or_retry():
> >
> > +++ b/mm/internal.h
> > @@ -706,7 +706,7 @@ static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf,
> >         if (fault_flag_allow_retry_first(flags) &&
> >             !(flags & FAULT_FLAG_RETRY_NOWAIT)) {
> >                 fpin = get_file(vmf->vma->vm_file);
> > -               mmap_read_unlock(vmf->vma->vm_mm);
> > +               release_fault_lock(vmf);
> >         }
> >         return fpin;
> >  }
> >
> > What do you think?
>
> This is very tempting... Let me try that and see if anything explodes,
> but yes, this would be ideal.

Ok, so far looks good, the problem is not reproducible. I'll run some
more exhaustive testing today.

>
>
> >
> > > I think what happens is that this path is racing with another page
> > > fault which took mmap_lock for read. fault_dirty_shared_page()
> > > releases this lock which was taken by another page faulting thread and
> > > that thread generates an assertion when it finds out the lock it just
> > > took got released from under it.
> >
> > I'm confused that our debugging didn't catch this earlier.  lockdep
> > should always catch this.
>
> Maybe this condition is rare enough?

Suren Baghdasaryan Aug. 11, 2023, 6:13 a.m. UTC | #18

On Thu, Aug 10, 2023 at 4:43 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Aug 10, 2023 at 4:29 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Thu, Aug 10, 2023 at 3:16 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Thu, Aug 10, 2023 at 06:24:15AM +0000, Suren Baghdasaryan wrote:
> > > > Ok, I think I found the issue.  wp_page_shared() ->
> > > > fault_dirty_shared_page() can drop mmap_lock (see the comment saying
> > > > "Drop the mmap_lock before waiting on IO, if we can...", therefore we
> > > > have to ensure we are not doing this under per-VMA lock.
> > >
> > > ... or we could change maybe_unlock_mmap_for_io() the same way
> > > that we changed folio_lock_or_retry():
> > >
> > > +++ b/mm/internal.h
> > > @@ -706,7 +706,7 @@ static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf,
> > >         if (fault_flag_allow_retry_first(flags) &&
> > >             !(flags & FAULT_FLAG_RETRY_NOWAIT)) {
> > >                 fpin = get_file(vmf->vma->vm_file);
> > > -               mmap_read_unlock(vmf->vma->vm_mm);
> > > +               release_fault_lock(vmf);
> > >         }
> > >         return fpin;
> > >  }
> > >
> > > What do you think?
> >
> > This is very tempting... Let me try that and see if anything explodes,
> > but yes, this would be ideal.
>
> Ok, so far looks good, the problem is not reproducible. I'll run some
> more exhaustive testing today.

So far it works without a glitch. Matthew, I think it's fine. If you
post a fixup please add my Tested-by.
Thanks,
Suren.

>
> >
> >
> > >
> > > > I think what happens is that this path is racing with another page
> > > > fault which took mmap_lock for read. fault_dirty_shared_page()
> > > > releases this lock which was taken by another page faulting thread and
> > > > that thread generates an assertion when it finds out the lock it just
> > > > took got released from under it.
> > >
> > > I'm confused that our debugging didn't catch this earlier.  lockdep
> > > should always catch this.
> >
> > Maybe this condition is rare enough?

[v7,0/6] Per-VMA lock support for swap and userfaults

Message

Comments