Message ID | 20230404135850.3673404-2-willy@infradead.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Avoid the mmap lock for fault-around | expand |
On Tue, Apr 4, 2023 at 6:59 AM Matthew Wilcox (Oracle) <willy@infradead.org> wrote: > > The fault path will immediately fail in handle_mm_fault(), so this > is the minimal step which allows the per-VMA lock to be taken on > file-backed VMAs. There may be a small performance reduction as a > little unnecessary work will be done on each page fault. See later > patches for the improvement. > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> > --- > mm/memory.c | 9 ++++----- > 1 file changed, 4 insertions(+), 5 deletions(-) > > diff --git a/mm/memory.c b/mm/memory.c > index fdaec7772fff..f726f85f0081 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -5223,6 +5223,9 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, > flags & FAULT_FLAG_REMOTE)) > return VM_FAULT_SIGSEGV; > > + if ((flags & FAULT_FLAG_VMA_LOCK) && !vma_is_anonymous(vma)) > + return VM_FAULT_RETRY; > + There are count_vm_event(PGFAULT) and count_memcg_event_mm(vma->vm_mm, PGFAULT) earlier in this function. Returning here and retrying I think will double-count this page fault. Returning before this accounting should fix this issue. > /* > * Enable the memcg OOM handling for faults triggered in user > * space. Kernel faults are handled more gracefully. > @@ -5275,12 +5278,8 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, > if (!vma) > goto inval; > > - /* Only anonymous vmas are supported for now */ > - if (!vma_is_anonymous(vma)) > - goto inval; > - > /* find_mergeable_anon_vma uses adjacent vmas which are not locked */ > - if (!vma->anon_vma) > + if (vma_is_anonymous(vma) && !vma->anon_vma) > goto inval; > > if (!vma_start_read(vma)) > -- > 2.39.2 >
On Fri, Apr 07, 2023 at 10:54:00AM -0700, Suren Baghdasaryan wrote: > On Tue, Apr 4, 2023 at 6:59 AM Matthew Wilcox (Oracle) > <willy@infradead.org> wrote: > > > > The fault path will immediately fail in handle_mm_fault(), so this > > is the minimal step which allows the per-VMA lock to be taken on > > file-backed VMAs. There may be a small performance reduction as a > > little unnecessary work will be done on each page fault. See later > > patches for the improvement. > > > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> > > --- > > mm/memory.c | 9 ++++----- > > 1 file changed, 4 insertions(+), 5 deletions(-) > > > > diff --git a/mm/memory.c b/mm/memory.c > > index fdaec7772fff..f726f85f0081 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -5223,6 +5223,9 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, > > flags & FAULT_FLAG_REMOTE)) > > return VM_FAULT_SIGSEGV; > > > > + if ((flags & FAULT_FLAG_VMA_LOCK) && !vma_is_anonymous(vma)) > > + return VM_FAULT_RETRY; > > + > > There are count_vm_event(PGFAULT) and count_memcg_event_mm(vma->vm_mm, > PGFAULT) earlier in this function. Returning here and retrying I think > will double-count this page fault. Returning before this accounting > should fix this issue. You're right, but this will be an issue with later patches in the series anyway as we move the check further and further down the call-chain. For that matter, it's an issue in do_swap_page() right now, isn't it? I suppose we don't care too much because it's the rare case where we go into do_swap_page() and so the stats are "correct enough".
On Fri, Apr 7, 2023 at 1:12 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Fri, Apr 07, 2023 at 10:54:00AM -0700, Suren Baghdasaryan wrote: > > On Tue, Apr 4, 2023 at 6:59 AM Matthew Wilcox (Oracle) > > <willy@infradead.org> wrote: > > > > > > The fault path will immediately fail in handle_mm_fault(), so this > > > is the minimal step which allows the per-VMA lock to be taken on > > > file-backed VMAs. There may be a small performance reduction as a > > > little unnecessary work will be done on each page fault. See later > > > patches for the improvement. > > > > > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> > > > --- > > > mm/memory.c | 9 ++++----- > > > 1 file changed, 4 insertions(+), 5 deletions(-) > > > > > > diff --git a/mm/memory.c b/mm/memory.c > > > index fdaec7772fff..f726f85f0081 100644 > > > --- a/mm/memory.c > > > +++ b/mm/memory.c > > > @@ -5223,6 +5223,9 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, > > > flags & FAULT_FLAG_REMOTE)) > > > return VM_FAULT_SIGSEGV; > > > > > > + if ((flags & FAULT_FLAG_VMA_LOCK) && !vma_is_anonymous(vma)) > > > + return VM_FAULT_RETRY; > > > + > > > > There are count_vm_event(PGFAULT) and count_memcg_event_mm(vma->vm_mm, > > PGFAULT) earlier in this function. Returning here and retrying I think > > will double-count this page fault. Returning before this accounting > > should fix this issue. > > You're right, but this will be an issue with later patches in the series > anyway as we move the check further and further down the call-chain. > For that matter, it's an issue in do_swap_page() right now, isn't it? > I suppose we don't care too much because it's the rare case where we go > into do_swap_page() and so the stats are "correct enough". True. do_swap_page() has the same issue. Can we move these count_vm_event() calls to the end of handle_mm_fault(): vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, unsigned int flags, struct pt_regs *regs) { vm_fault_t ret; __set_current_state(TASK_RUNNING); - count_vm_event(PGFAULT); - count_memcg_event_mm(vma->vm_mm, PGFAULT); ret = sanitize_fault_flags(vma, &flags); if (ret) - return ret; - goto out; ... mm_account_fault(regs, address, flags, ret); +out: + if (ret != VM_FAULT_RETRY) { + count_vm_event(PGFAULT); + count_memcg_event_mm(vma->vm_mm, PGFAULT); + } return ret; } ?
On Fri, Apr 07, 2023 at 01:26:08PM -0700, Suren Baghdasaryan wrote: > True. do_swap_page() has the same issue. Can we move these > count_vm_event() calls to the end of handle_mm_fault(): I was going to suggest something similar, so count that as an Acked-by. This will change the accounting for the current retry situation, where we drop the mmap_lock in filemap_fault(), initiate I/O and return RETRY. I think that's probably a good change though; I don't see why applications should have their fault counters incremented twice for that situation. > mm_account_fault(regs, address, flags, ret); > +out: > + if (ret != VM_FAULT_RETRY) { > + count_vm_event(PGFAULT); > + count_memcg_event_mm(vma->vm_mm, PGFAULT); > + } Nit: this is a bitmask, so we should probably have: if (!(ret & VM_FAULT_RETRY)) { in case somebody's ORed it with VM_FAULT_MAJOR or something. Hm, I wonder if we're handling that correctly; if we need to start I/O, do we preserve VM_FAULT_MAJOR returned from the first attempt? Not in a position where I can look at the code right now.
On Fri, Apr 7, 2023 at 2:36 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Fri, Apr 07, 2023 at 01:26:08PM -0700, Suren Baghdasaryan wrote: > > True. do_swap_page() has the same issue. Can we move these > > count_vm_event() calls to the end of handle_mm_fault(): > > I was going to suggest something similar, so count that as an > Acked-by. This will change the accounting for the current retry > situation, where we drop the mmap_lock in filemap_fault(), initiate I/O > and return RETRY. I think that's probably a good change though; I don't > see why applications should have their fault counters incremented twice > for that situation. > > > mm_account_fault(regs, address, flags, ret); > > +out: > > + if (ret != VM_FAULT_RETRY) { > > + count_vm_event(PGFAULT); > > + count_memcg_event_mm(vma->vm_mm, PGFAULT); > > + } > > Nit: this is a bitmask, so we should probably have: > > if (!(ret & VM_FAULT_RETRY)) { > > in case somebody's ORed it with VM_FAULT_MAJOR or something. > > Hm, I wonder if we're handling that correctly; if we need to start I/O, > do we preserve VM_FAULT_MAJOR returned from the first attempt? Not > in a position where I can look at the code right now. Interesting question. IIUC mm_account_fault() is the place where VM_FAULT_MAJOR is used to perform minor/major fault accounting. However it has an early exit: /* * We don't do accounting for some specific faults: * * - Unsuccessful faults (e.g. when the address wasn't valid). That * includes arch_vma_access_permitted() failing before reaching here. * So this is not a "this many hardware page faults" counter. We * should use the hw profiling for that. * * - Incomplete faults (VM_FAULT_RETRY). They will only be counted * once they're completed. */ if (ret & (VM_FAULT_ERROR | VM_FAULT_RETRY)) return; So, VM_FAULT_RETRY is ignored in the accounting path. For the current do_swap_page (the only user returning VM_FAULT_RETRY this late) it's fine because we did not really do anything. However with your series and with my current attempts to drop the vma lock, do swapin_readahead() and return VM_FAULT_RETRY the situation changes. We need to register such (VM_FAULT_MAJOR | VM_FAULT_RETRY) case as a major fault, otherwise after retry it will be accounted as only a minor fault. Accounting the retry as a minor fault after the original major fault is technically also double-counting but probably not as big of a problem as missing the major fault accounting.
On Fri, Apr 7, 2023 at 3:40 PM Suren Baghdasaryan <surenb@google.com> wrote: > > On Fri, Apr 7, 2023 at 2:36 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Fri, Apr 07, 2023 at 01:26:08PM -0700, Suren Baghdasaryan wrote: > > > True. do_swap_page() has the same issue. Can we move these > > > count_vm_event() calls to the end of handle_mm_fault(): > > > > I was going to suggest something similar, so count that as an > > Acked-by. This will change the accounting for the current retry > > situation, where we drop the mmap_lock in filemap_fault(), initiate I/O > > and return RETRY. I think that's probably a good change though; I don't > > see why applications should have their fault counters incremented twice > > for that situation. > > > > > mm_account_fault(regs, address, flags, ret); > > > +out: > > > + if (ret != VM_FAULT_RETRY) { > > > + count_vm_event(PGFAULT); > > > + count_memcg_event_mm(vma->vm_mm, PGFAULT); > > > + } > > > > Nit: this is a bitmask, so we should probably have: > > > > if (!(ret & VM_FAULT_RETRY)) { > > > > in case somebody's ORed it with VM_FAULT_MAJOR or something. > > > > Hm, I wonder if we're handling that correctly; if we need to start I/O, > > do we preserve VM_FAULT_MAJOR returned from the first attempt? Not > > in a position where I can look at the code right now. > > Interesting question. IIUC mm_account_fault() is the place where > VM_FAULT_MAJOR is used to perform minor/major fault accounting. > However it has an early exit: > > /* > * We don't do accounting for some specific faults: > * > * - Unsuccessful faults (e.g. when the address wasn't valid). That > * includes arch_vma_access_permitted() failing before reaching here. > * So this is not a "this many hardware page faults" counter. We > * should use the hw profiling for that. > * > * - Incomplete faults (VM_FAULT_RETRY). They will only be counted > * once they're completed. > */ > if (ret & (VM_FAULT_ERROR | VM_FAULT_RETRY)) > return; > > So, VM_FAULT_RETRY is ignored in the accounting path. > For the current do_swap_page (the only user returning VM_FAULT_RETRY > this late) it's fine because we did not really do anything. However > with your series and with my current attempts to drop the vma lock, do > swapin_readahead() and return VM_FAULT_RETRY the situation changes. We > need to register such (VM_FAULT_MAJOR | VM_FAULT_RETRY) case as a > major fault, otherwise after retry it will be accounted as only a > minor fault. Accounting the retry as a minor fault after the original > major fault is technically also double-counting but probably not as > big of a problem as missing the major fault accounting. Actually, looks like in most cases where VM_FAULT_MAJOR is returned we also do count_vm_event(PGMAJFAULT) and count_memcg_event_mm(vma->vm_mm, PGMAJFAULT). mm_account_fault() modifies only current->{maj|min}_flt and perf_sw_event().
On Fri, Apr 7, 2023 at 2:36 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Fri, Apr 07, 2023 at 01:26:08PM -0700, Suren Baghdasaryan wrote: > > True. do_swap_page() has the same issue. Can we move these > > count_vm_event() calls to the end of handle_mm_fault(): > > I was going to suggest something similar, so count that as an > Acked-by. This will change the accounting for the current retry > situation, where we drop the mmap_lock in filemap_fault(), initiate I/O > and return RETRY. I think that's probably a good change though; I don't > see why applications should have their fault counters incremented twice > for that situation. > > > mm_account_fault(regs, address, flags, ret); > > +out: > > + if (ret != VM_FAULT_RETRY) { > > + count_vm_event(PGFAULT); > > + count_memcg_event_mm(vma->vm_mm, PGFAULT); > > + } > > Nit: this is a bitmask, so we should probably have: > > if (!(ret & VM_FAULT_RETRY)) { > > in case somebody's ORed it with VM_FAULT_MAJOR or something. The fix is posted at https://lore.kernel.org/all/20230414175444.1837474-1-surenb@google.com/ > > Hm, I wonder if we're handling that correctly; if we need to start I/O, > do we preserve VM_FAULT_MAJOR returned from the first attempt? Not > in a position where I can look at the code right now.
diff --git a/mm/memory.c b/mm/memory.c index fdaec7772fff..f726f85f0081 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5223,6 +5223,9 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, flags & FAULT_FLAG_REMOTE)) return VM_FAULT_SIGSEGV; + if ((flags & FAULT_FLAG_VMA_LOCK) && !vma_is_anonymous(vma)) + return VM_FAULT_RETRY; + /* * Enable the memcg OOM handling for faults triggered in user * space. Kernel faults are handled more gracefully. @@ -5275,12 +5278,8 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, if (!vma) goto inval; - /* Only anonymous vmas are supported for now */ - if (!vma_is_anonymous(vma)) - goto inval; - /* find_mergeable_anon_vma uses adjacent vmas which are not locked */ - if (!vma->anon_vma) + if (vma_is_anonymous(vma) && !vma->anon_vma) goto inval; if (!vma_start_read(vma))
The fault path will immediately fail in handle_mm_fault(), so this is the minimal step which allows the per-VMA lock to be taken on file-backed VMAs. There may be a small performance reduction as a little unnecessary work will be done on each page fault. See later patches for the improvement. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> --- mm/memory.c | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-)