mbox series

[v5,00/12] gfs2: Fix mmap + page fault deadlocks

Message ID 20210803191818.993968-1-agruenba@redhat.com (mailing list archive)
Headers show
Series gfs2: Fix mmap + page fault deadlocks | expand

Message

Andreas Gruenbacher Aug. 3, 2021, 7:18 p.m. UTC
Hi all,

here's another update on top of v5.14-rc4.  There seems to be a bug in
get_user_pages_fast when called with FOLL_FAST_ONLY; please see below.

Changes:

 * Change fault_in_pages_{readable,writeable} to return the number of
   bytes that should be accessible instead of failing outright when
   part of the requested region cannot be faulted in.  Change
   iov_iter_fault_in_readable to those same semantics.

 * Add fault_in_iov_iter_writeable for safely faulting in pages for
   writing without modifying the pages.


With this patch queue, fstest generic/208 (aio-dio-invalidate-failure.c)
endlessly spins in gfs2_file_direct_write.  It looks as if there's a bug
in get_user_pages_fast when called with FOLL_FAST_ONLY:

 (1) The test case performs an aio write into a 32 MB buffer.

 (2) The buffer is initially not in memory, so when iomap_dio_rw() ->
     ... -> bio_iov_iter_get_pages() is called with the iter->noio flag
     set, we get to get_user_pages_fast() with FOLL_FAST_ONLY set.
     get_user_pages_fast() returns 0, which causes
     bio_iov_iter_get_pages to return -EFAULT.

 (3) Then gfs2_file_direct_write faults in the entire buffer with
     fault_in_iov_iter_readable(), which succeeds.

 (4) With the buffer in memory, we retry the iomap_dio_rw() ->
     ... -> bio_iov_iter_get_pages() -> ... -> get_user_pages_fast().
     This should succeed now, but get_user_pages_fast() still returns 0.

 (5) Thus we end up in step (3) again.

The buffered writes generic/208 performs are unrelated to this hang.


Apart from the generic/208 hang, gfs2 still needs a better strategy for
faulting in more reasonable chunks of memory at a time and for resuming
requests after faulting in pages.  We've got some of the pieces in place
for safely allowing that, but more work remains to be done.


For immediate consideration by Al Viro:

  iov_iter: Fix iov_iter_get_pages{,_alloc} page fault return value


For immediate consideration by Paul Mackerras:

  powerpc/kvm: Fix kvm_use_magic_page


Thanks,
Andreas


Andreas Gruenbacher (12):
  iov_iter: Fix iov_iter_get_pages{,_alloc} page fault return value
  powerpc/kvm: Fix kvm_use_magic_page
  Turn fault_in_pages_{readable,writeable} into
    fault_in_{readable,writeable}
  Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable
  iov_iter: Introduce fault_in_iov_iter_writeable
  gfs2: Add wrapper for iomap_file_buffered_write
  gfs2: Fix mmap + page fault deadlocks for buffered I/O
  iomap: Fix iomap_dio_rw return value for user copies
  iomap: Support restarting direct I/O requests after user copy failures
  iomap: Add done_before argument to iomap_dio_rw
  iov_iter: Introduce noio flag to disable page faults
  gfs2: Fix mmap + page fault deadlocks for direct I/O

 arch/powerpc/kernel/kvm.c           |   3 +-
 arch/powerpc/kernel/signal_32.c     |   4 +-
 arch/powerpc/kernel/signal_64.c     |   2 +-
 arch/x86/kernel/fpu/signal.c        |   8 +-
 drivers/gpu/drm/armada/armada_gem.c |   7 +-
 fs/btrfs/file.c                     |   8 +-
 fs/btrfs/ioctl.c                    |   7 +-
 fs/ext4/file.c                      |   5 +-
 fs/f2fs/file.c                      |   6 +-
 fs/fuse/file.c                      |   2 +-
 fs/gfs2/file.c                      |  95 ++++++++++++++++++++---
 fs/iomap/buffered-io.c              |   2 +-
 fs/iomap/direct-io.c                |  28 +++++--
 fs/ntfs/file.c                      |   2 +-
 fs/xfs/xfs_file.c                   |   6 +-
 fs/zonefs/super.c                   |   4 +-
 include/linux/iomap.h               |  11 ++-
 include/linux/pagemap.h             |  58 +-------------
 include/linux/uio.h                 |   4 +-
 lib/iov_iter.c                      | 107 ++++++++++++++++++++------
 mm/filemap.c                        |   4 +-
 mm/gup.c                            | 113 ++++++++++++++++++++++++++++
 22 files changed, 360 insertions(+), 126 deletions(-)

Comments

Linus Torvalds Aug. 3, 2021, 7:45 p.m. UTC | #1
On Tue, Aug 3, 2021 at 12:18 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
>
> With this patch queue, fstest generic/208 (aio-dio-invalidate-failure.c)
> endlessly spins in gfs2_file_direct_write.  It looks as if there's a bug
> in get_user_pages_fast when called with FOLL_FAST_ONLY:
>
>  (1) The test case performs an aio write into a 32 MB buffer.
>
>  (2) The buffer is initially not in memory, so when iomap_dio_rw() ->
>      ... -> bio_iov_iter_get_pages() is called with the iter->noio flag
>      set, we get to get_user_pages_fast() with FOLL_FAST_ONLY set.
>      get_user_pages_fast() returns 0, which causes
>      bio_iov_iter_get_pages to return -EFAULT.
>
>  (3) Then gfs2_file_direct_write faults in the entire buffer with
>      fault_in_iov_iter_readable(), which succeeds.
>
>  (4) With the buffer in memory, we retry the iomap_dio_rw() ->
>      ... -> bio_iov_iter_get_pages() -> ... -> get_user_pages_fast().
>      This should succeed now, but get_user_pages_fast() still returns 0.

Hmm. Have you tried to figure out why that "still returns 0" happens?

One option - for debugging only - would be to introduce a new flag to
get_user_pages_fast() tyhat says "print out reason if failed" and make
the retry (but not the original one) have that flag set.

There are a couple of things of note when it comes to "get_user_pages_fast()":

 (a) some architectures don't even enable it

 (b) it can be very picky about the page table contents, and wants the
accessed bit to already be set (or the dirty bit, in the case of a
write).

but (a) shouldn't be an issue on any common platform and (b) shouldn't
be an issue with  fault_in_iov_iter_readable() that actually does a
__get_user() so it will access through the page tables.

(It might be more of an issue with fault_in_iov_iter_writable() due to
walking the page tables by hand - if we don't do the proper
access/dirty setting, I could see get_user_pages_fast() failing).

Anyway, for reason (a) I do think that eventually we should probably
introduce FOLL_NOFAULT, and allow the full "slow" page table walk -
just not calling down to handle_mm_fault() if it fails.

But (a) should be a non-issue in your test environment, and so it
would be interesting to hear what it is that fails. Because scanning
through the patches, they all _look_ fine to me (apart from the one
comment about return values, which is more about being consistent with
copy_to/from_user() and making the code simpler - not about
correctness)

                       Linus
Andreas Gruenbacher Aug. 16, 2021, 7:14 p.m. UTC | #2
On Tue, Aug 3, 2021 at 9:45 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Aug 3, 2021 at 12:18 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > With this patch queue, fstest generic/208 (aio-dio-invalidate-failure.c)
> > endlessly spins in gfs2_file_direct_write.  It looks as if there's a bug
> > in get_user_pages_fast when called with FOLL_FAST_ONLY:
> >
> >  (1) The test case performs an aio write into a 32 MB buffer.
> >
> >  (2) The buffer is initially not in memory, so when iomap_dio_rw() ->
> >      ... -> bio_iov_iter_get_pages() is called with the iter->noio flag
> >      set, we get to get_user_pages_fast() with FOLL_FAST_ONLY set.
> >      get_user_pages_fast() returns 0, which causes
> >      bio_iov_iter_get_pages to return -EFAULT.
> >
> >  (3) Then gfs2_file_direct_write faults in the entire buffer with
> >      fault_in_iov_iter_readable(), which succeeds.
> >
> >  (4) With the buffer in memory, we retry the iomap_dio_rw() ->
> >      ... -> bio_iov_iter_get_pages() -> ... -> get_user_pages_fast().
> >      This should succeed now, but get_user_pages_fast() still returns 0.
>
> Hmm. Have you tried to figure out why that "still returns 0" happens?

The call stack is:

gup_pte_range
gup_pmd_range
gup_pud_range
gup_p4d_range
gup_pgd_range
lockless_pages_from_mm
internal_get_user_pages_fast
get_user_pages_fast
iov_iter_get_pages
__bio_iov_iter_get_pages
bio_iov_iter_get_pages
iomap_dio_bio_actor
iomap_dio_actor
iomap_apply
iomap_dio_rw
gfs2_file_direct_write

In gup_pte_range, pte_special(pte) is true and so we return 0.

> One option - for debugging only - would be to introduce a new flag to
> get_user_pages_fast() that says "print out reason if failed" and make
> the retry (but not the original one) have that flag set.
>
> There are a couple of things of note when it comes to "get_user_pages_fast()":
>
>  (a) some architectures don't even enable it
>
>  (b) it can be very picky about the page table contents, and wants the
> accessed bit to already be set (or the dirty bit, in the case of a
> write).
>
> but (a) shouldn't be an issue on any common platform and (b) shouldn't
> be an issue with  fault_in_iov_iter_readable() that actually does a
> __get_user() so it will access through the page tables.
>
> (It might be more of an issue with fault_in_iov_iter_writable() due to
> walking the page tables by hand - if we don't do the proper
> access/dirty setting, I could see get_user_pages_fast() failing).
>
> Anyway, for reason (a) I do think that eventually we should probably
> introduce FOLL_NOFAULT, and allow the full "slow" page table walk -
> just not calling down to handle_mm_fault() if it fails.
>
> But (a) should be a non-issue in your test environment, and so it
> would be interesting to hear what it is that fails. Because scanning
> through the patches, they all _look_ fine to me (apart from the one
> comment about return values, which is more about being consistent with
> copy_to/from_user() and making the code simpler - not about
> correctness)
>
>                        Linus
>

Thanks,
Andreas
Linus Torvalds Aug. 18, 2021, 9:49 p.m. UTC | #3
[ Sorry for the delay, I was on the road and this fell through the cracks ]

On Mon, Aug 16, 2021 at 12:14 PM Andreas Gruenbacher
<agruenba@redhat.com> wrote:
>
> On Tue, Aug 3, 2021 at 9:45 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Hmm. Have you tried to figure out why that "still returns 0" happens?
>
> The call stack is:
>
> gup_pte_range
> gup_pmd_range
> gup_pud_range
> gup_p4d_range
> gup_pgd_range
> lockless_pages_from_mm
> internal_get_user_pages_fast
> get_user_pages_fast
> iov_iter_get_pages
> __bio_iov_iter_get_pages
> bio_iov_iter_get_pages
> iomap_dio_bio_actor
> iomap_dio_actor
> iomap_apply
> iomap_dio_rw
> gfs2_file_direct_write
>
> In gup_pte_range, pte_special(pte) is true and so we return 0.

Ok, so that is indeed something that the fast-case can't handle,
because some of the special code wants to have the mm_lock so that it
can look at the vma flags (eg "vm_normal_page()" and friends.

That said, some of these cases even the full GUP won't ever handle,
simply because a mapping doesn't necessarily even _have_ a 'struct
page' associated with it if it's a VM_IO mapping.

So it turns out that you can't just always do
fault_in_iov_iter_readable() and then assume that you can do
iov_iter_get_pages() and repeat until successful.

We could certainly make get_user_pages_fast() handle a few more cases,
but I get the feeling that we need to have separate error cases for
EFAULT - no page exists - and the "page exists, but cannot be mapped
as a 'struct page'" case.

I also do still think that even regardless of that, we want to just
add a FOLL_NOFAULT flag that just disables calling handle_mm_fault(),
and then you can use the regular get_user_pages().

That at least gives us the full _normal_ page handling stuff.

                   Linus
Andreas Gruenbacher Aug. 19, 2021, 7:40 p.m. UTC | #4
On Wed, Aug 18, 2021 at 11:49 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> [ Sorry for the delay, I was on the road and this fell through the cracks ]

No harm done, I was busy enough implementing your previous suggestions.

> On Mon, Aug 16, 2021 at 12:14 PM Andreas Gruenbacher
> <agruenba@redhat.com> wrote:
> >
> > On Tue, Aug 3, 2021 at 9:45 PM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > Hmm. Have you tried to figure out why that "still returns 0" happens?
> >
> > The call stack is:
> >
> > gup_pte_range
> > gup_pmd_range
> > gup_pud_range
> > gup_p4d_range
> > gup_pgd_range
> > lockless_pages_from_mm
> > internal_get_user_pages_fast
> > get_user_pages_fast
> > iov_iter_get_pages
> > __bio_iov_iter_get_pages
> > bio_iov_iter_get_pages
> > iomap_dio_bio_actor
> > iomap_dio_actor
> > iomap_apply
> > iomap_dio_rw
> > gfs2_file_direct_write
> >
> > In gup_pte_range, pte_special(pte) is true and so we return 0.
>
> Ok, so that is indeed something that the fast-case can't handle,
> because some of the special code wants to have the mm_lock so that it
> can look at the vma flags (eg "vm_normal_page()" and friends.
>
> That said, some of these cases even the full GUP won't ever handle,
> simply because a mapping doesn't necessarily even _have_ a 'struct
> page' associated with it if it's a VM_IO mapping.
>
> So it turns out that you can't just always do
> fault_in_iov_iter_readable() and then assume that you can do
> iov_iter_get_pages() and repeat until successful.
>
> We could certainly make get_user_pages_fast() handle a few more cases,
> but I get the feeling that we need to have separate error cases for
> EFAULT - no page exists - and the "page exists, but cannot be mapped
> as a 'struct page'" case.

Hmm, what if GUP is made to skip VM_IO vmas without adding anything to
the pages array? That would match fault_in_iov_iter_writeable, which
is modeled after __mm_populate and which skips VM_IO and VM_PFNMAP
vmas.

The other strategy I've added is to scale back the page fault windows
to a single page if faulting in multiple pages didn't help, and to
give up if the I/O operation still fails after that. So pathological
cases won't loop indefinitely anymore at least.

> I also do still think that even regardless of that, we want to just
> add a FOLL_NOFAULT flag that just disables calling handle_mm_fault(),
> and then you can use the regular get_user_pages().
>
> That at least gives us the full _normal_ page handling stuff.

And it does fix the generic/208 failure.

Thanks,
Andreas
Linus Torvalds Aug. 19, 2021, 8:14 p.m. UTC | #5
On Thu, Aug 19, 2021 at 12:41 PM Andreas Gruenbacher
<agruenba@redhat.com> wrote:
>
> Hmm, what if GUP is made to skip VM_IO vmas without adding anything to
> the pages array? That would match fault_in_iov_iter_writeable, which
> is modeled after __mm_populate and which skips VM_IO and VM_PFNMAP
> vmas.

I don't understand what you mean.. GUP already skips VM_IO (and
VM_PFNMAP) pages. It just returns EFAULT.

We could make it return another error. We already have DAX and
FOLL_LONGTERM returning -EOPNOTSUPP.

Of course, I think some code ends up always just returning "number of
pages looked up" and might return 0 for "no pages" rather than the
error for the first page.

So we may end up having interfaces that then lose that explanation
error code, but I didn't check.

But we couldn't make it just say "skip them and try later addresses",
if that is what you meant. THAT makes no sense - that would just make
GUP look up some other address than what was asked for.

> > I also do still think that even regardless of that, we want to just
> > add a FOLL_NOFAULT flag that just disables calling handle_mm_fault(),
> > and then you can use the regular get_user_pages().
> >
> > That at least gives us the full _normal_ page handling stuff.
>
> And it does fix the generic/208 failure.

Good. So I think the approach is usable, even if we might have corner
cases left.

So I think the remaining issue is exactly things like VM_IO and
VM_PFNMAP. Do the fstests have test-cases for things like this? It
_is_ quite specialized, it might be a good idea to have that.

Of course, doing direct-IO from special memory regions with zerocopy
might be something special people actually want to do. But I think
we've had that VM_IO flag testing there basically forever, so I don't
think it has ever worked (for some definition of "ever").

            Linus
Andreas Gruenbacher Aug. 19, 2021, 9:39 p.m. UTC | #6
On Thu, Aug 19, 2021 at 10:14 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Aug 19, 2021 at 12:41 PM Andreas Gruenbacher
> <agruenba@redhat.com> wrote:
> >
> > Hmm, what if GUP is made to skip VM_IO vmas without adding anything to
> > the pages array? That would match fault_in_iov_iter_writeable, which
> > is modeled after __mm_populate and which skips VM_IO and VM_PFNMAP
> > vmas.
>
> I don't understand what you mean.. GUP already skips VM_IO (and
> VM_PFNMAP) pages. It just returns EFAULT.
>
> We could make it return another error. We already have DAX and
> FOLL_LONGTERM returning -EOPNOTSUPP.
>
> Of course, I think some code ends up always just returning "number of
> pages looked up" and might return 0 for "no pages" rather than the
> error for the first page.
>
> So we may end up having interfaces that then lose that explanation
> error code, but I didn't check.
>
> But we couldn't make it just say "skip them and try later addresses",
> if that is what you meant. THAT makes no sense - that would just make
> GUP look up some other address than what was asked for.

get_user_pages has a start and a nr_pages argument, which specifies an
address range from start to start + nr_pages * PAGE_SIZE. If pages !=
NULL, it adds a pointer to that array for each PAGE_SIZE subpage. I
was thinking of skipping over VM_IO vmas in that process, so when the
range starts in a mappable vma, runs into a VM_IO vma, and ends in a
mappable vma, the pages in the pages array would be discontiguous;
they would only cover the mappable vmas. But that would make it
difficult to make sense of what's in the pages array. So scratch that
idea.

> > > I also do still think that even regardless of that, we want to just
> > > add a FOLL_NOFAULT flag that just disables calling handle_mm_fault(),
> > > and then you can use the regular get_user_pages().
> > >
> > > That at least gives us the full _normal_ page handling stuff.
> >
> > And it does fix the generic/208 failure.
>
> Good. So I think the approach is usable, even if we might have corner
> cases left.
>
> So I think the remaining issue is exactly things like VM_IO and
> VM_PFNMAP. Do the fstests have test-cases for things like this? It
> _is_ quite specialized, it might be a good idea to have that.
>
> Of course, doing direct-IO from special memory regions with zerocopy
> might be something special people actually want to do. But I think
> we've had that VM_IO flag testing there basically forever, so I don't
> think it has ever worked (for some definition of "ever").

The v6 patch queue should handle those cases acceptably well for now,
but I don't think we have tests covering that at all.

Thanks,
Andreas