diff mbox series

[1/3] mm: handle swap page faults under VMA lock if page is uncontended

Message ID 20230501175025.36233-1-surenb@google.com (mailing list archive)
State New
Headers show
Series [1/3] mm: handle swap page faults under VMA lock if page is uncontended | expand

Commit Message

Suren Baghdasaryan May 1, 2023, 5:50 p.m. UTC
When page fault is handled under VMA lock protection, all swap page
faults are retried with mmap_lock because folio_lock_or_retry
implementation has to drop and reacquire mmap_lock if folio could
not be immediately locked.
Instead of retrying all swapped page faults, retry only when folio
locking fails.
Drivers implementing ops->migrate_to_ram might still rely on mmap_lock,
therefore fall back to mmap_lock in this case.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/filemap.c |  6 ++++++
 mm/memory.c  | 14 +++++++++-----
 2 files changed, 15 insertions(+), 5 deletions(-)

Comments

Matthew Wilcox May 2, 2023, 2:02 a.m. UTC | #1
On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote:
> +++ b/mm/memory.c
> @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	if (!pte_unmap_same(vmf))
>  		goto out;
>  
> -	if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> -		ret = VM_FAULT_RETRY;
> -		goto out;
> -	}
> -
>  	entry = pte_to_swp_entry(vmf->orig_pte);
>  	if (unlikely(non_swap_entry(entry))) {
>  		if (is_migration_entry(entry)) {

You're missing the necessary fallback in the (!folio) case.
swap_readpage() is synchronous and will sleep.
Matthew Wilcox May 2, 2023, 3:22 a.m. UTC | #2
On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote:
> On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote:
> > > +++ b/mm/memory.c
> > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >       if (!pte_unmap_same(vmf))
> > >               goto out;
> > >
> > > -     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > > -             ret = VM_FAULT_RETRY;
> > > -             goto out;
> > > -     }
> > > -
> > >       entry = pte_to_swp_entry(vmf->orig_pte);
> > >       if (unlikely(non_swap_entry(entry))) {
> > >               if (is_migration_entry(entry)) {
> >
> > You're missing the necessary fallback in the (!folio) case.
> > swap_readpage() is synchronous and will sleep.
> 
> True, but is it unsafe to do that under VMA lock and has to be done
> under mmap_lock?

... you were the one arguing that we didn't want to wait for I/O with
the VMA lock held?
Suren Baghdasaryan May 2, 2023, 5:04 a.m. UTC | #3
On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote:
> > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote:
> > > > +++ b/mm/memory.c
> > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > >       if (!pte_unmap_same(vmf))
> > > >               goto out;
> > > >
> > > > -     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > > > -             ret = VM_FAULT_RETRY;
> > > > -             goto out;
> > > > -     }
> > > > -
> > > >       entry = pte_to_swp_entry(vmf->orig_pte);
> > > >       if (unlikely(non_swap_entry(entry))) {
> > > >               if (is_migration_entry(entry)) {
> > >
> > > You're missing the necessary fallback in the (!folio) case.
> > > swap_readpage() is synchronous and will sleep.
> >
> > True, but is it unsafe to do that under VMA lock and has to be done
> > under mmap_lock?
>
> ... you were the one arguing that we didn't want to wait for I/O with
> the VMA lock held?

Well, that discussion was about waiting in folio_lock_or_retry() with
the lock being held. I argued against it because currently we drop
mmap_lock lock before waiting, so if we don't drop VMA lock we would
be changing the current behavior which might introduce new
regressions. In the case of swap_readpage and swapin_readahead we
already wait with mmap_lock held, so waiting with VMA lock held does
not introduce new problems (unless there is a need to hold mmap_lock).

That said, you are absolutely correct that this situation can be
improved by dropping the lock in these cases too. I just didn't want
to attack everything at once. I believe after we agree on the approach
implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com
for dropping the VMA lock before waiting, these cases can be added
easier. Does that make sense?
Matthew Wilcox May 2, 2023, 3:03 p.m. UTC | #4
On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote:
> On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote:
> > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote:
> > > > > +++ b/mm/memory.c
> > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > >       if (!pte_unmap_same(vmf))
> > > > >               goto out;
> > > > >
> > > > > -     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > > > > -             ret = VM_FAULT_RETRY;
> > > > > -             goto out;
> > > > > -     }
> > > > > -
> > > > >       entry = pte_to_swp_entry(vmf->orig_pte);
> > > > >       if (unlikely(non_swap_entry(entry))) {
> > > > >               if (is_migration_entry(entry)) {
> > > >
> > > > You're missing the necessary fallback in the (!folio) case.
> > > > swap_readpage() is synchronous and will sleep.
> > >
> > > True, but is it unsafe to do that under VMA lock and has to be done
> > > under mmap_lock?
> >
> > ... you were the one arguing that we didn't want to wait for I/O with
> > the VMA lock held?
> 
> Well, that discussion was about waiting in folio_lock_or_retry() with
> the lock being held. I argued against it because currently we drop
> mmap_lock lock before waiting, so if we don't drop VMA lock we would
> be changing the current behavior which might introduce new
> regressions. In the case of swap_readpage and swapin_readahead we
> already wait with mmap_lock held, so waiting with VMA lock held does
> not introduce new problems (unless there is a need to hold mmap_lock).
> 
> That said, you are absolutely correct that this situation can be
> improved by dropping the lock in these cases too. I just didn't want
> to attack everything at once. I believe after we agree on the approach
> implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com
> for dropping the VMA lock before waiting, these cases can be added
> easier. Does that make sense?

OK, I looked at this path some more, and I think we're fine.  This
patch is only called for SWP_SYNCHRONOUS_IO which is only set for
QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms
(both btt and pmem).  So the answer is that we don't sleep in this
path, and there's no need to drop the lock.
Suren Baghdasaryan May 2, 2023, 4:36 p.m. UTC | #5
On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote:
> > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote:
> > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote:
> > > > > > +++ b/mm/memory.c
> > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > > >       if (!pte_unmap_same(vmf))
> > > > > >               goto out;
> > > > > >
> > > > > > -     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > > > > > -             ret = VM_FAULT_RETRY;
> > > > > > -             goto out;
> > > > > > -     }
> > > > > > -
> > > > > >       entry = pte_to_swp_entry(vmf->orig_pte);
> > > > > >       if (unlikely(non_swap_entry(entry))) {
> > > > > >               if (is_migration_entry(entry)) {
> > > > >
> > > > > You're missing the necessary fallback in the (!folio) case.
> > > > > swap_readpage() is synchronous and will sleep.
> > > >
> > > > True, but is it unsafe to do that under VMA lock and has to be done
> > > > under mmap_lock?
> > >
> > > ... you were the one arguing that we didn't want to wait for I/O with
> > > the VMA lock held?
> >
> > Well, that discussion was about waiting in folio_lock_or_retry() with
> > the lock being held. I argued against it because currently we drop
> > mmap_lock lock before waiting, so if we don't drop VMA lock we would
> > be changing the current behavior which might introduce new
> > regressions. In the case of swap_readpage and swapin_readahead we
> > already wait with mmap_lock held, so waiting with VMA lock held does
> > not introduce new problems (unless there is a need to hold mmap_lock).
> >
> > That said, you are absolutely correct that this situation can be
> > improved by dropping the lock in these cases too. I just didn't want
> > to attack everything at once. I believe after we agree on the approach
> > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com
> > for dropping the VMA lock before waiting, these cases can be added
> > easier. Does that make sense?
>
> OK, I looked at this path some more, and I think we're fine.  This
> patch is only called for SWP_SYNCHRONOUS_IO which is only set for
> QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms
> (both btt and pmem).  So the answer is that we don't sleep in this
> path, and there's no need to drop the lock.

Yes but swapin_readahead does sleep, so I'll have to handle that case
too after this.
Matthew Wilcox May 2, 2023, 10:31 p.m. UTC | #6
On Tue, May 02, 2023 at 09:36:03AM -0700, Suren Baghdasaryan wrote:
> On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote:
> > > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote:
> > > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > >
> > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote:
> > > > > > > +++ b/mm/memory.c
> > > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > > > >       if (!pte_unmap_same(vmf))
> > > > > > >               goto out;
> > > > > > >
> > > > > > > -     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > > > > > > -             ret = VM_FAULT_RETRY;
> > > > > > > -             goto out;
> > > > > > > -     }
> > > > > > > -
> > > > > > >       entry = pte_to_swp_entry(vmf->orig_pte);
> > > > > > >       if (unlikely(non_swap_entry(entry))) {
> > > > > > >               if (is_migration_entry(entry)) {
> > > > > >
> > > > > > You're missing the necessary fallback in the (!folio) case.
> > > > > > swap_readpage() is synchronous and will sleep.
> > > > >
> > > > > True, but is it unsafe to do that under VMA lock and has to be done
> > > > > under mmap_lock?
> > > >
> > > > ... you were the one arguing that we didn't want to wait for I/O with
> > > > the VMA lock held?
> > >
> > > Well, that discussion was about waiting in folio_lock_or_retry() with
> > > the lock being held. I argued against it because currently we drop
> > > mmap_lock lock before waiting, so if we don't drop VMA lock we would
> > > be changing the current behavior which might introduce new
> > > regressions. In the case of swap_readpage and swapin_readahead we
> > > already wait with mmap_lock held, so waiting with VMA lock held does
> > > not introduce new problems (unless there is a need to hold mmap_lock).
> > >
> > > That said, you are absolutely correct that this situation can be
> > > improved by dropping the lock in these cases too. I just didn't want
> > > to attack everything at once. I believe after we agree on the approach
> > > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com
> > > for dropping the VMA lock before waiting, these cases can be added
> > > easier. Does that make sense?
> >
> > OK, I looked at this path some more, and I think we're fine.  This
> > patch is only called for SWP_SYNCHRONOUS_IO which is only set for
> > QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms
> > (both btt and pmem).  So the answer is that we don't sleep in this
> > path, and there's no need to drop the lock.
> 
> Yes but swapin_readahead does sleep, so I'll have to handle that case
> too after this.

Sleeping is OK, we do that in pXd_alloc()!  Do we block on I/O anywhere
in swapin_readahead()?  It all looks like async I/O to me.
Suren Baghdasaryan May 2, 2023, 11:04 p.m. UTC | #7
On Tue, May 2, 2023 at 3:31 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, May 02, 2023 at 09:36:03AM -0700, Suren Baghdasaryan wrote:
> > On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote:
> > > > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote:
> > > > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > >
> > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote:
> > > > > > > > +++ b/mm/memory.c
> > > > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > > > > >       if (!pte_unmap_same(vmf))
> > > > > > > >               goto out;
> > > > > > > >
> > > > > > > > -     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > > > > > > > -             ret = VM_FAULT_RETRY;
> > > > > > > > -             goto out;
> > > > > > > > -     }
> > > > > > > > -
> > > > > > > >       entry = pte_to_swp_entry(vmf->orig_pte);
> > > > > > > >       if (unlikely(non_swap_entry(entry))) {
> > > > > > > >               if (is_migration_entry(entry)) {
> > > > > > >
> > > > > > > You're missing the necessary fallback in the (!folio) case.
> > > > > > > swap_readpage() is synchronous and will sleep.
> > > > > >
> > > > > > True, but is it unsafe to do that under VMA lock and has to be done
> > > > > > under mmap_lock?
> > > > >
> > > > > ... you were the one arguing that we didn't want to wait for I/O with
> > > > > the VMA lock held?
> > > >
> > > > Well, that discussion was about waiting in folio_lock_or_retry() with
> > > > the lock being held. I argued against it because currently we drop
> > > > mmap_lock lock before waiting, so if we don't drop VMA lock we would
> > > > be changing the current behavior which might introduce new
> > > > regressions. In the case of swap_readpage and swapin_readahead we
> > > > already wait with mmap_lock held, so waiting with VMA lock held does
> > > > not introduce new problems (unless there is a need to hold mmap_lock).
> > > >
> > > > That said, you are absolutely correct that this situation can be
> > > > improved by dropping the lock in these cases too. I just didn't want
> > > > to attack everything at once. I believe after we agree on the approach
> > > > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com
> > > > for dropping the VMA lock before waiting, these cases can be added
> > > > easier. Does that make sense?
> > >
> > > OK, I looked at this path some more, and I think we're fine.  This
> > > patch is only called for SWP_SYNCHRONOUS_IO which is only set for
> > > QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms
> > > (both btt and pmem).  So the answer is that we don't sleep in this
> > > path, and there's no need to drop the lock.
> >
> > Yes but swapin_readahead does sleep, so I'll have to handle that case
> > too after this.
>
> Sleeping is OK, we do that in pXd_alloc()!  Do we block on I/O anywhere
> in swapin_readahead()?  It all looks like async I/O to me.

Hmm. I thought that we have synchronous I/O in the following paths:
    swapin_readahead()->swap_cluster_readahead()->swap_readpage()
    swapin_readahead()->swap_vma_readahead()->swap_readpage()
but just noticed that in both cases swap_readpage() is called with the
synchronous parameter being false. So you are probably right here...
Does that mean swapin_readahead() might return a page which does not
have its content swapped-in yet?
Matthew Wilcox May 2, 2023, 11:40 p.m. UTC | #8
On Tue, May 02, 2023 at 04:04:59PM -0700, Suren Baghdasaryan wrote:
> On Tue, May 2, 2023 at 3:31 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Tue, May 02, 2023 at 09:36:03AM -0700, Suren Baghdasaryan wrote:
> > > On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote:
> > > > > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > >
> > > > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote:
> > > > > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > > >
> > > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote:
> > > > > > > > > +++ b/mm/memory.c
> > > > > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > > > > > >       if (!pte_unmap_same(vmf))
> > > > > > > > >               goto out;
> > > > > > > > >
> > > > > > > > > -     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > > > > > > > > -             ret = VM_FAULT_RETRY;
> > > > > > > > > -             goto out;
> > > > > > > > > -     }
> > > > > > > > > -
> > > > > > > > >       entry = pte_to_swp_entry(vmf->orig_pte);
> > > > > > > > >       if (unlikely(non_swap_entry(entry))) {
> > > > > > > > >               if (is_migration_entry(entry)) {
> > > > > > > >
> > > > > > > > You're missing the necessary fallback in the (!folio) case.
> > > > > > > > swap_readpage() is synchronous and will sleep.
> > > > > > >
> > > > > > > True, but is it unsafe to do that under VMA lock and has to be done
> > > > > > > under mmap_lock?
> > > > > >
> > > > > > ... you were the one arguing that we didn't want to wait for I/O with
> > > > > > the VMA lock held?
> > > > >
> > > > > Well, that discussion was about waiting in folio_lock_or_retry() with
> > > > > the lock being held. I argued against it because currently we drop
> > > > > mmap_lock lock before waiting, so if we don't drop VMA lock we would
> > > > > be changing the current behavior which might introduce new
> > > > > regressions. In the case of swap_readpage and swapin_readahead we
> > > > > already wait with mmap_lock held, so waiting with VMA lock held does
> > > > > not introduce new problems (unless there is a need to hold mmap_lock).
> > > > >
> > > > > That said, you are absolutely correct that this situation can be
> > > > > improved by dropping the lock in these cases too. I just didn't want
> > > > > to attack everything at once. I believe after we agree on the approach
> > > > > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com
> > > > > for dropping the VMA lock before waiting, these cases can be added
> > > > > easier. Does that make sense?
> > > >
> > > > OK, I looked at this path some more, and I think we're fine.  This
> > > > patch is only called for SWP_SYNCHRONOUS_IO which is only set for
> > > > QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms
> > > > (both btt and pmem).  So the answer is that we don't sleep in this
> > > > path, and there's no need to drop the lock.
> > >
> > > Yes but swapin_readahead does sleep, so I'll have to handle that case
> > > too after this.
> >
> > Sleeping is OK, we do that in pXd_alloc()!  Do we block on I/O anywhere
> > in swapin_readahead()?  It all looks like async I/O to me.
> 
> Hmm. I thought that we have synchronous I/O in the following paths:
>     swapin_readahead()->swap_cluster_readahead()->swap_readpage()
>     swapin_readahead()->swap_vma_readahead()->swap_readpage()
> but just noticed that in both cases swap_readpage() is called with the
> synchronous parameter being false. So you are probably right here...
> Does that mean swapin_readahead() might return a page which does not
> have its content swapped-in yet?

That's my understanding.  In that case it's !uptodate and still locked.
The folio_lock_or_retry() will wait for the read to complete unless
we've told it we'd rather retry.
Suren Baghdasaryan May 3, 2023, 1:05 a.m. UTC | #9
On Tue, May 2, 2023 at 4:40 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, May 02, 2023 at 04:04:59PM -0700, Suren Baghdasaryan wrote:
> > On Tue, May 2, 2023 at 3:31 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Tue, May 02, 2023 at 09:36:03AM -0700, Suren Baghdasaryan wrote:
> > > > On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote:
> > > > > > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > >
> > > > > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote:
> > > > > > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > > > >
> > > > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote:
> > > > > > > > > > +++ b/mm/memory.c
> > > > > > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > > > > > > >       if (!pte_unmap_same(vmf))
> > > > > > > > > >               goto out;
> > > > > > > > > >
> > > > > > > > > > -     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > > > > > > > > > -             ret = VM_FAULT_RETRY;
> > > > > > > > > > -             goto out;
> > > > > > > > > > -     }
> > > > > > > > > > -
> > > > > > > > > >       entry = pte_to_swp_entry(vmf->orig_pte);
> > > > > > > > > >       if (unlikely(non_swap_entry(entry))) {
> > > > > > > > > >               if (is_migration_entry(entry)) {
> > > > > > > > >
> > > > > > > > > You're missing the necessary fallback in the (!folio) case.
> > > > > > > > > swap_readpage() is synchronous and will sleep.
> > > > > > > >
> > > > > > > > True, but is it unsafe to do that under VMA lock and has to be done
> > > > > > > > under mmap_lock?
> > > > > > >
> > > > > > > ... you were the one arguing that we didn't want to wait for I/O with
> > > > > > > the VMA lock held?
> > > > > >
> > > > > > Well, that discussion was about waiting in folio_lock_or_retry() with
> > > > > > the lock being held. I argued against it because currently we drop
> > > > > > mmap_lock lock before waiting, so if we don't drop VMA lock we would
> > > > > > be changing the current behavior which might introduce new
> > > > > > regressions. In the case of swap_readpage and swapin_readahead we
> > > > > > already wait with mmap_lock held, so waiting with VMA lock held does
> > > > > > not introduce new problems (unless there is a need to hold mmap_lock).
> > > > > >
> > > > > > That said, you are absolutely correct that this situation can be
> > > > > > improved by dropping the lock in these cases too. I just didn't want
> > > > > > to attack everything at once. I believe after we agree on the approach
> > > > > > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com
> > > > > > for dropping the VMA lock before waiting, these cases can be added
> > > > > > easier. Does that make sense?
> > > > >
> > > > > OK, I looked at this path some more, and I think we're fine.  This
> > > > > patch is only called for SWP_SYNCHRONOUS_IO which is only set for
> > > > > QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms
> > > > > (both btt and pmem).  So the answer is that we don't sleep in this
> > > > > path, and there's no need to drop the lock.
> > > >
> > > > Yes but swapin_readahead does sleep, so I'll have to handle that case
> > > > too after this.
> > >
> > > Sleeping is OK, we do that in pXd_alloc()!  Do we block on I/O anywhere
> > > in swapin_readahead()?  It all looks like async I/O to me.
> >
> > Hmm. I thought that we have synchronous I/O in the following paths:
> >     swapin_readahead()->swap_cluster_readahead()->swap_readpage()
> >     swapin_readahead()->swap_vma_readahead()->swap_readpage()
> > but just noticed that in both cases swap_readpage() is called with the
> > synchronous parameter being false. So you are probably right here...
> > Does that mean swapin_readahead() might return a page which does not
> > have its content swapped-in yet?
>
> That's my understanding.  In that case it's !uptodate and still locked.
> The folio_lock_or_retry() will wait for the read to complete unless
> we've told it we'd rather retry.

Ok, and we already drop the locks in folio_lock_or_retry() when
needed. Sounds like we cover this case with this patchset?
Yosry Ahmed May 3, 2023, 8:34 a.m. UTC | #10
On Tue, May 2, 2023 at 4:05 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, May 2, 2023 at 3:31 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Tue, May 02, 2023 at 09:36:03AM -0700, Suren Baghdasaryan wrote:
> > > On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote:
> > > > > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > >
> > > > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote:
> > > > > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > > >
> > > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote:
> > > > > > > > > +++ b/mm/memory.c
> > > > > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > > > > > >       if (!pte_unmap_same(vmf))
> > > > > > > > >               goto out;
> > > > > > > > >
> > > > > > > > > -     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > > > > > > > > -             ret = VM_FAULT_RETRY;
> > > > > > > > > -             goto out;
> > > > > > > > > -     }
> > > > > > > > > -
> > > > > > > > >       entry = pte_to_swp_entry(vmf->orig_pte);
> > > > > > > > >       if (unlikely(non_swap_entry(entry))) {
> > > > > > > > >               if (is_migration_entry(entry)) {
> > > > > > > >
> > > > > > > > You're missing the necessary fallback in the (!folio) case.
> > > > > > > > swap_readpage() is synchronous and will sleep.
> > > > > > >
> > > > > > > True, but is it unsafe to do that under VMA lock and has to be done
> > > > > > > under mmap_lock?
> > > > > >
> > > > > > ... you were the one arguing that we didn't want to wait for I/O with
> > > > > > the VMA lock held?
> > > > >
> > > > > Well, that discussion was about waiting in folio_lock_or_retry() with
> > > > > the lock being held. I argued against it because currently we drop
> > > > > mmap_lock lock before waiting, so if we don't drop VMA lock we would
> > > > > be changing the current behavior which might introduce new
> > > > > regressions. In the case of swap_readpage and swapin_readahead we
> > > > > already wait with mmap_lock held, so waiting with VMA lock held does
> > > > > not introduce new problems (unless there is a need to hold mmap_lock).
> > > > >
> > > > > That said, you are absolutely correct that this situation can be
> > > > > improved by dropping the lock in these cases too. I just didn't want
> > > > > to attack everything at once. I believe after we agree on the approach
> > > > > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com
> > > > > for dropping the VMA lock before waiting, these cases can be added
> > > > > easier. Does that make sense?
> > > >
> > > > OK, I looked at this path some more, and I think we're fine.  This
> > > > patch is only called for SWP_SYNCHRONOUS_IO which is only set for
> > > > QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms
> > > > (both btt and pmem).  So the answer is that we don't sleep in this
> > > > path, and there's no need to drop the lock.
> > >
> > > Yes but swapin_readahead does sleep, so I'll have to handle that case
> > > too after this.
> >
> > Sleeping is OK, we do that in pXd_alloc()!  Do we block on I/O anywhere
> > in swapin_readahead()?  It all looks like async I/O to me.
>
> Hmm. I thought that we have synchronous I/O in the following paths:
>     swapin_readahead()->swap_cluster_readahead()->swap_readpage()
>     swapin_readahead()->swap_vma_readahead()->swap_readpage()
> but just noticed that in both cases swap_readpage() is called with the
> synchronous parameter being false. So you are probably right here...

In both swap_cluster_readahead() and swap_vma_readahead() it looks
like if the readahead window is 1 (aka we are not reading ahead), then
we jump to directly calling read_swap_cache_async() passing do_poll =
true, which means we may end up calling swap_readpage() passing
synchronous = true.

I am not familiar with readahead heuristics, so I am not sure how
common this is, but it's something to think about.

> Does that mean swapin_readahead() might return a page which does not
> have its content swapped-in yet?
>
Suren Baghdasaryan May 3, 2023, 7:57 p.m. UTC | #11
On Wed, May 3, 2023 at 1:34 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Tue, May 2, 2023 at 4:05 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Tue, May 2, 2023 at 3:31 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Tue, May 02, 2023 at 09:36:03AM -0700, Suren Baghdasaryan wrote:
> > > > On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote:
> > > > > > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > >
> > > > > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote:
> > > > > > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > > > >
> > > > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote:
> > > > > > > > > > +++ b/mm/memory.c
> > > > > > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > > > > > > >       if (!pte_unmap_same(vmf))
> > > > > > > > > >               goto out;
> > > > > > > > > >
> > > > > > > > > > -     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > > > > > > > > > -             ret = VM_FAULT_RETRY;
> > > > > > > > > > -             goto out;
> > > > > > > > > > -     }
> > > > > > > > > > -
> > > > > > > > > >       entry = pte_to_swp_entry(vmf->orig_pte);
> > > > > > > > > >       if (unlikely(non_swap_entry(entry))) {
> > > > > > > > > >               if (is_migration_entry(entry)) {
> > > > > > > > >
> > > > > > > > > You're missing the necessary fallback in the (!folio) case.
> > > > > > > > > swap_readpage() is synchronous and will sleep.
> > > > > > > >
> > > > > > > > True, but is it unsafe to do that under VMA lock and has to be done
> > > > > > > > under mmap_lock?
> > > > > > >
> > > > > > > ... you were the one arguing that we didn't want to wait for I/O with
> > > > > > > the VMA lock held?
> > > > > >
> > > > > > Well, that discussion was about waiting in folio_lock_or_retry() with
> > > > > > the lock being held. I argued against it because currently we drop
> > > > > > mmap_lock lock before waiting, so if we don't drop VMA lock we would
> > > > > > be changing the current behavior which might introduce new
> > > > > > regressions. In the case of swap_readpage and swapin_readahead we
> > > > > > already wait with mmap_lock held, so waiting with VMA lock held does
> > > > > > not introduce new problems (unless there is a need to hold mmap_lock).
> > > > > >
> > > > > > That said, you are absolutely correct that this situation can be
> > > > > > improved by dropping the lock in these cases too. I just didn't want
> > > > > > to attack everything at once. I believe after we agree on the approach
> > > > > > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com
> > > > > > for dropping the VMA lock before waiting, these cases can be added
> > > > > > easier. Does that make sense?
> > > > >
> > > > > OK, I looked at this path some more, and I think we're fine.  This
> > > > > patch is only called for SWP_SYNCHRONOUS_IO which is only set for
> > > > > QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms
> > > > > (both btt and pmem).  So the answer is that we don't sleep in this
> > > > > path, and there's no need to drop the lock.
> > > >
> > > > Yes but swapin_readahead does sleep, so I'll have to handle that case
> > > > too after this.
> > >
> > > Sleeping is OK, we do that in pXd_alloc()!  Do we block on I/O anywhere
> > > in swapin_readahead()?  It all looks like async I/O to me.
> >
> > Hmm. I thought that we have synchronous I/O in the following paths:
> >     swapin_readahead()->swap_cluster_readahead()->swap_readpage()
> >     swapin_readahead()->swap_vma_readahead()->swap_readpage()
> > but just noticed that in both cases swap_readpage() is called with the
> > synchronous parameter being false. So you are probably right here...
>
> In both swap_cluster_readahead() and swap_vma_readahead() it looks
> like if the readahead window is 1 (aka we are not reading ahead), then
> we jump to directly calling read_swap_cache_async() passing do_poll =
> true, which means we may end up calling swap_readpage() passing
> synchronous = true.
>
> I am not familiar with readahead heuristics, so I am not sure how
> common this is, but it's something to think about.

Uh, you are correct. If this branch is common, we could use the same
"drop the lock and retry" pattern inside read_swap_cache_async(). That
would be quite easy to implement.
Thanks for checking on it!

>
> > Does that mean swapin_readahead() might return a page which does not
> > have its content swapped-in yet?
> >
Yosry Ahmed May 3, 2023, 8:57 p.m. UTC | #12
On Wed, May 3, 2023 at 12:57 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, May 3, 2023 at 1:34 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Tue, May 2, 2023 at 4:05 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Tue, May 2, 2023 at 3:31 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Tue, May 02, 2023 at 09:36:03AM -0700, Suren Baghdasaryan wrote:
> > > > > On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > >
> > > > > > On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote:
> > > > > > > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > > >
> > > > > > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote:
> > > > > > > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > > > > >
> > > > > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote:
> > > > > > > > > > > +++ b/mm/memory.c
> > > > > > > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > > > > > > > >       if (!pte_unmap_same(vmf))
> > > > > > > > > > >               goto out;
> > > > > > > > > > >
> > > > > > > > > > > -     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > > > > > > > > > > -             ret = VM_FAULT_RETRY;
> > > > > > > > > > > -             goto out;
> > > > > > > > > > > -     }
> > > > > > > > > > > -
> > > > > > > > > > >       entry = pte_to_swp_entry(vmf->orig_pte);
> > > > > > > > > > >       if (unlikely(non_swap_entry(entry))) {
> > > > > > > > > > >               if (is_migration_entry(entry)) {
> > > > > > > > > >
> > > > > > > > > > You're missing the necessary fallback in the (!folio) case.
> > > > > > > > > > swap_readpage() is synchronous and will sleep.
> > > > > > > > >
> > > > > > > > > True, but is it unsafe to do that under VMA lock and has to be done
> > > > > > > > > under mmap_lock?
> > > > > > > >
> > > > > > > > ... you were the one arguing that we didn't want to wait for I/O with
> > > > > > > > the VMA lock held?
> > > > > > >
> > > > > > > Well, that discussion was about waiting in folio_lock_or_retry() with
> > > > > > > the lock being held. I argued against it because currently we drop
> > > > > > > mmap_lock lock before waiting, so if we don't drop VMA lock we would
> > > > > > > be changing the current behavior which might introduce new
> > > > > > > regressions. In the case of swap_readpage and swapin_readahead we
> > > > > > > already wait with mmap_lock held, so waiting with VMA lock held does
> > > > > > > not introduce new problems (unless there is a need to hold mmap_lock).
> > > > > > >
> > > > > > > That said, you are absolutely correct that this situation can be
> > > > > > > improved by dropping the lock in these cases too. I just didn't want
> > > > > > > to attack everything at once. I believe after we agree on the approach
> > > > > > > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com
> > > > > > > for dropping the VMA lock before waiting, these cases can be added
> > > > > > > easier. Does that make sense?
> > > > > >
> > > > > > OK, I looked at this path some more, and I think we're fine.  This
> > > > > > patch is only called for SWP_SYNCHRONOUS_IO which is only set for
> > > > > > QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms
> > > > > > (both btt and pmem).  So the answer is that we don't sleep in this
> > > > > > path, and there's no need to drop the lock.
> > > > >
> > > > > Yes but swapin_readahead does sleep, so I'll have to handle that case
> > > > > too after this.
> > > >
> > > > Sleeping is OK, we do that in pXd_alloc()!  Do we block on I/O anywhere
> > > > in swapin_readahead()?  It all looks like async I/O to me.
> > >
> > > Hmm. I thought that we have synchronous I/O in the following paths:
> > >     swapin_readahead()->swap_cluster_readahead()->swap_readpage()
> > >     swapin_readahead()->swap_vma_readahead()->swap_readpage()
> > > but just noticed that in both cases swap_readpage() is called with the
> > > synchronous parameter being false. So you are probably right here...
> >
> > In both swap_cluster_readahead() and swap_vma_readahead() it looks
> > like if the readahead window is 1 (aka we are not reading ahead), then
> > we jump to directly calling read_swap_cache_async() passing do_poll =
> > true, which means we may end up calling swap_readpage() passing
> > synchronous = true.
> >
> > I am not familiar with readahead heuristics, so I am not sure how
> > common this is, but it's something to think about.
>
> Uh, you are correct. If this branch is common, we could use the same
> "drop the lock and retry" pattern inside read_swap_cache_async(). That
> would be quite easy to implement.
> Thanks for checking on it!


I am honestly not sure how common this is.

+Ying who might have a better idea.


>
>
> >
> > > Does that mean swapin_readahead() might return a page which does not
> > > have its content swapped-in yet?
> > >
Huang, Ying May 5, 2023, 5:02 a.m. UTC | #13
Yosry Ahmed <yosryahmed@google.com> writes:

> On Wed, May 3, 2023 at 12:57 PM Suren Baghdasaryan <surenb@google.com> wrote:
>>
>> On Wed, May 3, 2023 at 1:34 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>> >
>> > On Tue, May 2, 2023 at 4:05 PM Suren Baghdasaryan <surenb@google.com> wrote:
>> > >
>> > > On Tue, May 2, 2023 at 3:31 PM Matthew Wilcox <willy@infradead.org> wrote:
>> > > >
>> > > > On Tue, May 02, 2023 at 09:36:03AM -0700, Suren Baghdasaryan wrote:
>> > > > > On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote:
>> > > > > >
>> > > > > > On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote:
>> > > > > > > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote:
>> > > > > > > >
>> > > > > > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote:
>> > > > > > > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote:
>> > > > > > > > > >
>> > > > > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote:
>> > > > > > > > > > > +++ b/mm/memory.c
>> > > > > > > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> > > > > > > > > > >       if (!pte_unmap_same(vmf))
>> > > > > > > > > > >               goto out;
>> > > > > > > > > > >
>> > > > > > > > > > > -     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
>> > > > > > > > > > > -             ret = VM_FAULT_RETRY;
>> > > > > > > > > > > -             goto out;
>> > > > > > > > > > > -     }
>> > > > > > > > > > > -
>> > > > > > > > > > >       entry = pte_to_swp_entry(vmf->orig_pte);
>> > > > > > > > > > >       if (unlikely(non_swap_entry(entry))) {
>> > > > > > > > > > >               if (is_migration_entry(entry)) {
>> > > > > > > > > >
>> > > > > > > > > > You're missing the necessary fallback in the (!folio) case.
>> > > > > > > > > > swap_readpage() is synchronous and will sleep.
>> > > > > > > > >
>> > > > > > > > > True, but is it unsafe to do that under VMA lock and has to be done
>> > > > > > > > > under mmap_lock?
>> > > > > > > >
>> > > > > > > > ... you were the one arguing that we didn't want to wait for I/O with
>> > > > > > > > the VMA lock held?
>> > > > > > >
>> > > > > > > Well, that discussion was about waiting in folio_lock_or_retry() with
>> > > > > > > the lock being held. I argued against it because currently we drop
>> > > > > > > mmap_lock lock before waiting, so if we don't drop VMA lock we would
>> > > > > > > be changing the current behavior which might introduce new
>> > > > > > > regressions. In the case of swap_readpage and swapin_readahead we
>> > > > > > > already wait with mmap_lock held, so waiting with VMA lock held does
>> > > > > > > not introduce new problems (unless there is a need to hold mmap_lock).
>> > > > > > >
>> > > > > > > That said, you are absolutely correct that this situation can be
>> > > > > > > improved by dropping the lock in these cases too. I just didn't want
>> > > > > > > to attack everything at once. I believe after we agree on the approach
>> > > > > > > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com
>> > > > > > > for dropping the VMA lock before waiting, these cases can be added
>> > > > > > > easier. Does that make sense?
>> > > > > >
>> > > > > > OK, I looked at this path some more, and I think we're fine.  This
>> > > > > > patch is only called for SWP_SYNCHRONOUS_IO which is only set for
>> > > > > > QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms
>> > > > > > (both btt and pmem).  So the answer is that we don't sleep in this
>> > > > > > path, and there's no need to drop the lock.
>> > > > >
>> > > > > Yes but swapin_readahead does sleep, so I'll have to handle that case
>> > > > > too after this.
>> > > >
>> > > > Sleeping is OK, we do that in pXd_alloc()!  Do we block on I/O anywhere
>> > > > in swapin_readahead()?  It all looks like async I/O to me.
>> > >
>> > > Hmm. I thought that we have synchronous I/O in the following paths:
>> > >     swapin_readahead()->swap_cluster_readahead()->swap_readpage()
>> > >     swapin_readahead()->swap_vma_readahead()->swap_readpage()
>> > > but just noticed that in both cases swap_readpage() is called with the
>> > > synchronous parameter being false. So you are probably right here...
>> >
>> > In both swap_cluster_readahead() and swap_vma_readahead() it looks
>> > like if the readahead window is 1 (aka we are not reading ahead), then
>> > we jump to directly calling read_swap_cache_async() passing do_poll =
>> > true, which means we may end up calling swap_readpage() passing
>> > synchronous = true.
>> >
>> > I am not familiar with readahead heuristics, so I am not sure how
>> > common this is, but it's something to think about.
>>
>> Uh, you are correct. If this branch is common, we could use the same
>> "drop the lock and retry" pattern inside read_swap_cache_async(). That
>> would be quite easy to implement.
>> Thanks for checking on it!
>
>
> I am honestly not sure how common this is.
>
> +Ying who might have a better idea.

Checked the code and related history.  It seems that we can just pass
"synchronous = false" to swap_readpage() in read_swap_cache_async().
"synchronous = true" was introduced in commit 23955622ff8d ("swap: add
block io poll in swapin path") to reduce swap read latency for block
devices that can be polled.  But in commit 9650b453a3d4 ("block: ignore
RWF_HIPRI hint for sync dio"), the polling is deleted.  So, we don't
need to pass "synchronous = true" to swap_readpage() during
swapin_readahead(), because we will wait the IO to complete in
folio_lock_or_retry().

Best Regards,
Huang, Ying

>>
>>
>> >
>> > > Does that mean swapin_readahead() might return a page which does not
>> > > have its content swapped-in yet?
>> > >
Suren Baghdasaryan May 5, 2023, 10:30 p.m. UTC | #14
On Thu, May 4, 2023 at 10:03 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yosry Ahmed <yosryahmed@google.com> writes:
>
> > On Wed, May 3, 2023 at 12:57 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >>
> >> On Wed, May 3, 2023 at 1:34 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >> >
> >> > On Tue, May 2, 2023 at 4:05 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >> > >
> >> > > On Tue, May 2, 2023 at 3:31 PM Matthew Wilcox <willy@infradead.org> wrote:
> >> > > >
> >> > > > On Tue, May 02, 2023 at 09:36:03AM -0700, Suren Baghdasaryan wrote:
> >> > > > > On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote:
> >> > > > > >
> >> > > > > > On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote:
> >> > > > > > > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote:
> >> > > > > > > >
> >> > > > > > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote:
> >> > > > > > > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote:
> >> > > > > > > > > >
> >> > > > > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote:
> >> > > > > > > > > > > +++ b/mm/memory.c
> >> > > > > > > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >> > > > > > > > > > >       if (!pte_unmap_same(vmf))
> >> > > > > > > > > > >               goto out;
> >> > > > > > > > > > >
> >> > > > > > > > > > > -     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> >> > > > > > > > > > > -             ret = VM_FAULT_RETRY;
> >> > > > > > > > > > > -             goto out;
> >> > > > > > > > > > > -     }
> >> > > > > > > > > > > -
> >> > > > > > > > > > >       entry = pte_to_swp_entry(vmf->orig_pte);
> >> > > > > > > > > > >       if (unlikely(non_swap_entry(entry))) {
> >> > > > > > > > > > >               if (is_migration_entry(entry)) {
> >> > > > > > > > > >
> >> > > > > > > > > > You're missing the necessary fallback in the (!folio) case.
> >> > > > > > > > > > swap_readpage() is synchronous and will sleep.
> >> > > > > > > > >
> >> > > > > > > > > True, but is it unsafe to do that under VMA lock and has to be done
> >> > > > > > > > > under mmap_lock?
> >> > > > > > > >
> >> > > > > > > > ... you were the one arguing that we didn't want to wait for I/O with
> >> > > > > > > > the VMA lock held?
> >> > > > > > >
> >> > > > > > > Well, that discussion was about waiting in folio_lock_or_retry() with
> >> > > > > > > the lock being held. I argued against it because currently we drop
> >> > > > > > > mmap_lock lock before waiting, so if we don't drop VMA lock we would
> >> > > > > > > be changing the current behavior which might introduce new
> >> > > > > > > regressions. In the case of swap_readpage and swapin_readahead we
> >> > > > > > > already wait with mmap_lock held, so waiting with VMA lock held does
> >> > > > > > > not introduce new problems (unless there is a need to hold mmap_lock).
> >> > > > > > >
> >> > > > > > > That said, you are absolutely correct that this situation can be
> >> > > > > > > improved by dropping the lock in these cases too. I just didn't want
> >> > > > > > > to attack everything at once. I believe after we agree on the approach
> >> > > > > > > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com
> >> > > > > > > for dropping the VMA lock before waiting, these cases can be added
> >> > > > > > > easier. Does that make sense?
> >> > > > > >
> >> > > > > > OK, I looked at this path some more, and I think we're fine.  This
> >> > > > > > patch is only called for SWP_SYNCHRONOUS_IO which is only set for
> >> > > > > > QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms
> >> > > > > > (both btt and pmem).  So the answer is that we don't sleep in this
> >> > > > > > path, and there's no need to drop the lock.
> >> > > > >
> >> > > > > Yes but swapin_readahead does sleep, so I'll have to handle that case
> >> > > > > too after this.
> >> > > >
> >> > > > Sleeping is OK, we do that in pXd_alloc()!  Do we block on I/O anywhere
> >> > > > in swapin_readahead()?  It all looks like async I/O to me.
> >> > >
> >> > > Hmm. I thought that we have synchronous I/O in the following paths:
> >> > >     swapin_readahead()->swap_cluster_readahead()->swap_readpage()
> >> > >     swapin_readahead()->swap_vma_readahead()->swap_readpage()
> >> > > but just noticed that in both cases swap_readpage() is called with the
> >> > > synchronous parameter being false. So you are probably right here...
> >> >
> >> > In both swap_cluster_readahead() and swap_vma_readahead() it looks
> >> > like if the readahead window is 1 (aka we are not reading ahead), then
> >> > we jump to directly calling read_swap_cache_async() passing do_poll =
> >> > true, which means we may end up calling swap_readpage() passing
> >> > synchronous = true.
> >> >
> >> > I am not familiar with readahead heuristics, so I am not sure how
> >> > common this is, but it's something to think about.
> >>
> >> Uh, you are correct. If this branch is common, we could use the same
> >> "drop the lock and retry" pattern inside read_swap_cache_async(). That
> >> would be quite easy to implement.
> >> Thanks for checking on it!
> >
> >
> > I am honestly not sure how common this is.
> >
> > +Ying who might have a better idea.
>
> Checked the code and related history.  It seems that we can just pass
> "synchronous = false" to swap_readpage() in read_swap_cache_async().
> "synchronous = true" was introduced in commit 23955622ff8d ("swap: add
> block io poll in swapin path") to reduce swap read latency for block
> devices that can be polled.  But in commit 9650b453a3d4 ("block: ignore
> RWF_HIPRI hint for sync dio"), the polling is deleted.  So, we don't
> need to pass "synchronous = true" to swap_readpage() during
> swapin_readahead(), because we will wait the IO to complete in
> folio_lock_or_retry().

Thanks for investigating, Ying! It sounds like we can make some
simplifications here. I'll double-check and if I don't find anything
else, will change to "synchronous = false" in the next version of the
patchset.

>
> Best Regards,
> Huang, Ying
>
> >>
> >>
> >> >
> >> > > Does that mean swapin_readahead() might return a page which does not
> >> > > have its content swapped-in yet?
> >> > >
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
diff mbox series

Patch

diff --git a/mm/filemap.c b/mm/filemap.c
index a34abfe8c654..84f39114d4de 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1706,6 +1706,8 @@  static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
  *     mmap_lock has been released (mmap_read_unlock(), unless flags had both
  *     FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT set, in
  *     which case mmap_lock is still held.
+ *     If flags had FAULT_FLAG_VMA_LOCK set, meaning the operation is performed
+ *     with VMA lock only, the VMA lock is still held.
  *
  * If neither ALLOW_RETRY nor KILLABLE are set, will always return true
  * with the folio locked and the mmap_lock unperturbed.
@@ -1713,6 +1715,10 @@  static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
 bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
 			 unsigned int flags)
 {
+	/* Can't do this if not holding mmap_lock */
+	if (flags & FAULT_FLAG_VMA_LOCK)
+		return false;
+
 	if (fault_flag_allow_retry_first(flags)) {
 		/*
 		 * CAUTION! In this case, mmap_lock is not released
diff --git a/mm/memory.c b/mm/memory.c
index f69fbc251198..41f45819a923 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3711,11 +3711,6 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (!pte_unmap_same(vmf))
 		goto out;
 
-	if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
-		ret = VM_FAULT_RETRY;
-		goto out;
-	}
-
 	entry = pte_to_swp_entry(vmf->orig_pte);
 	if (unlikely(non_swap_entry(entry))) {
 		if (is_migration_entry(entry)) {
@@ -3725,6 +3720,15 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 			vmf->page = pfn_swap_entry_to_page(entry);
 			ret = remove_device_exclusive_entry(vmf);
 		} else if (is_device_private_entry(entry)) {
+			if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
+				/*
+				 * migrate_to_ram is not yet ready to operate
+				 * under VMA lock.
+				 */
+				ret |= VM_FAULT_RETRY;
+				goto out;
+			}
+
 			vmf->page = pfn_swap_entry_to_page(entry);
 			vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
 					vmf->address, &vmf->ptl);