mbox series

[RFC,v3,0/9] mm: Use DIO for swap and fix NFS swapfiles

Message ID 163250387273.2330363.13240781819520072222.stgit@warthog.procyon.org.uk (mailing list archive)
Headers show
Series mm: Use DIO for swap and fix NFS swapfiles | expand

Message

David Howells Sept. 24, 2021, 5:17 p.m. UTC
Hi Willy, Trond, Christoph,

Here's v3 of a change to make reads and writes from the swapfile use async
DIO, adding a new ->swap_rw() address_space method, rather than readpage()
or direct_IO(), as requested by Willy.  This allows NFS to bypass the write
checks that prevent swapfiles from working, plus a bunch of other checks
that may or may not be necessary.

Whilst trying to make this work, I found that NFS's support for swapfiles
seems to have been non-functional since Aug 2019 (I think), so the first
patch fixes that.  Question is: do we actually *want* to keep this
functionality, given that it seems that no one's tested it with an upstream
kernel in the last couple of years?

There are additional patches to get rid of noop_direct_IO and replace it
with a feature bitmask, to make btrfs, ext4, xfs and raw blockdevs use the
new ->swap_rw method and thence remove the direct BIO submission paths from
swap.

I kept the IOCB_SWAP flag, using it to enable REQ_SWAP.  I'm not sure if
that's necessary, but it seems accounting related.

The synchronous DIO I/O code on NFS, raw blockdev, ext4 swapfile and xfs
swapfile all seem to work fine.  Btrfs refuses to swapon because the file
might be CoW'd.  I've tried doing "chattr +C", but that didn't help.

The async DIO paths fail spectacularly (from I/O errors to ATA failure
messages on the test disk using a normal swapspace); NFS just hangs.

My patches can be found here also:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=swap-dio

I tested this using the procedure and program outlined in the NFS patch.

I also encountered occasional instances of the following warning with NFS, so
I'm wondering if there's a scheduling problem somewhere:

BUG: workqueue lockup - pool cpus=0-3 flags=0x5 nice=0 stuck for 34s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
  pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
    in-flight: 1565:fill_page_cache_func
workqueue events_highpri: flags=0x10
  pwq 3: cpus=1 node=0 flags=0x1 nice=-20 active=1/256 refcnt=2
    in-flight: 1547:fill_page_cache_func
  pwq 1: cpus=0 node=0 flags=0x0 nice=-20 active=1/256 refcnt=2
    in-flight: 1811:fill_page_cache_func
workqueue events_unbound: flags=0x2
  pwq 8: cpus=0-3 flags=0x5 nice=0 active=3/512 refcnt=5
    pending: fsnotify_connector_destroy_workfn, fsnotify_mark_destroy_workfn, cleanup_offline_cgwbs_workfn
workqueue events_power_efficient: flags=0x82
  pwq 8: cpus=0-3 flags=0x5 nice=0 active=4/256 refcnt=6
    pending: neigh_periodic_work, neigh_periodic_work, check_lifetime, do_cache_clean
workqueue writeback: flags=0x4a
  pwq 8: cpus=0-3 flags=0x5 nice=0 active=1/256 refcnt=4
    in-flight: 433(RESCUER):wb_workfn
workqueue rpciod: flags=0xa
  pwq 8: cpus=0-3 flags=0x5 nice=0 active=38/256 refcnt=40
    in-flight: 7:rpc_async_schedule, 1609:rpc_async_schedule, 1610:rpc_async_schedule, 912:rpc_async_schedule, 1613:rpc_async_schedule, 1631:rpc_async_schedule, 34:rpc_async_schedule, 44:rpc_async_schedule
    pending: rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule
workqueue ext4-rsv-conversion: flags=0x2000a
pool 1: cpus=0 node=0 flags=0x0 nice=-20 hung=59s workers=2 idle: 6
pool 3: cpus=1 node=0 flags=0x1 nice=-20 hung=43s workers=2 manager: 20
pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 498 29
pool 8: cpus=0-3 flags=0x5 nice=0 hung=34s workers=9 manager: 1623
pool 9: cpus=0-3 flags=0x5 nice=-20 hung=0s workers=2 manager: 5224 idle: 859

Note that this is due to DIO writes to NFS only, as far as I can tell, and
that no reads had happened yet.

Changes:
========
ver #3:
   - Introduced a new ->swap_rw() method.
   - Added feature support flags to the address_space_operations struct and
     got rid of the checks for ->direct_() and noop_direct_IO() and
     similar.
   - Implemented swap_rw for nfs, adjusting the direct I/O code paths.
   - Implemented swap_rw for blockdev, btrfs, ext4 and xfs.
   - Got rid of the return value on swap_readpage() as it's never checked.

ver #2:
   - Remove the callback param to __swap_writepage() as it's invariant.
   - Allocate the kiocb on the stack in sync mode.
   - Do an async DIO write if WB_SYNC_ALL isn't set.
   - Try to remove the BIO submission paths.

David

Link: https://lore.kernel.org/r/162876946134.3068428.15475611190876694695.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/162879971699.3306668.8977537647318498651.stgit@warthog.procyon.org.uk/ # v2
---
David Howells (9):
      mm: Remove the callback func argument from __swap_writepage()
      mm: Add 'supports' field to the address_space_operations to list features
      mm: Make swap_readpage() void
      Introduce IOCB_SWAP kiocb flag to trigger REQ_SWAP
      mm: Make swap_readpage() for SWP_FS_OPS use ->swap_rw() not ->readpage()
      mm: Make __swap_writepage() do async DIO if asked for it
      nfs: Fix write to swapfile failure due to generic_write_checks()
      block, btrfs, ext4, xfs: Implement swap_rw
      mm: Remove swap BIO paths and only use DIO paths


 Documentation/filesystems/vfs.rst |   8 +
 block/fops.c                      |   2 +
 drivers/block/loop.c              |   6 +-
 fs/9p/vfs_addr.c                  |   1 +
 fs/affs/file.c                    |   1 +
 fs/btrfs/inode.c                  |  14 +-
 fs/ceph/addr.c                    |  13 +-
 fs/cifs/file.c                    |  21 +-
 fs/direct-io.c                    |   2 +
 fs/erofs/data.c                   |   2 +-
 fs/exfat/inode.c                  |   1 +
 fs/ext2/inode.c                   |   4 +-
 fs/ext4/inode.c                   |  17 +-
 fs/f2fs/data.c                    |   1 +
 fs/fat/inode.c                    |   1 +
 fs/fcntl.c                        |   2 +-
 fs/fuse/dax.c                     |   2 +-
 fs/fuse/file.c                    |   1 +
 fs/gfs2/aops.c                    |   2 +-
 fs/hfs/inode.c                    |   1 +
 fs/hfsplus/inode.c                |   1 +
 fs/jfs/inode.c                    |   1 +
 fs/libfs.c                        |  12 -
 fs/nfs/direct.c                   |  28 +--
 fs/nfs/file.c                     |  15 +-
 fs/nilfs2/inode.c                 |   1 +
 fs/ntfs3/inode.c                  |   1 +
 fs/ocfs2/aops.c                   |   1 +
 fs/open.c                         |   3 +-
 fs/orangefs/inode.c               |   1 +
 fs/overlayfs/file.c               |   2 +-
 fs/overlayfs/inode.c              |   3 +-
 fs/reiserfs/inode.c               |   1 +
 fs/udf/file.c                     |   1 +
 fs/udf/inode.c                    |   1 +
 fs/xfs/xfs_aops.c                 |  13 +-
 fs/zonefs/super.c                 |   2 +-
 include/linux/bio.h               |   2 +
 include/linux/fs.h                |   7 +-
 include/linux/nfs_fs.h            |   2 +-
 include/linux/swap.h              |   2 +-
 mm/page_io.c                      | 356 +++++++++++++++---------------
 mm/swapfile.c                     |   4 +-
 43 files changed, 275 insertions(+), 287 deletions(-)

Comments

Dave Chinner Sept. 25, 2021, 11:42 p.m. UTC | #1
On Fri, Sep 24, 2021 at 06:17:52PM +0100, David Howells wrote:
> 
> Hi Willy, Trond, Christoph,
> 
> Here's v3 of a change to make reads and writes from the swapfile use async
> DIO, adding a new ->swap_rw() address_space method, rather than readpage()
> or direct_IO(), as requested by Willy.  This allows NFS to bypass the write
> checks that prevent swapfiles from working, plus a bunch of other checks
> that may or may not be necessary.
> 
> Whilst trying to make this work, I found that NFS's support for swapfiles
> seems to have been non-functional since Aug 2019 (I think), so the first
> patch fixes that.  Question is: do we actually *want* to keep this
> functionality, given that it seems that no one's tested it with an upstream
> kernel in the last couple of years?
> 
> There are additional patches to get rid of noop_direct_IO and replace it
> with a feature bitmask, to make btrfs, ext4, xfs and raw blockdevs use the
> new ->swap_rw method and thence remove the direct BIO submission paths from
> swap.
> 
> I kept the IOCB_SWAP flag, using it to enable REQ_SWAP.  I'm not sure if
> that's necessary, but it seems accounting related.
> 
> The synchronous DIO I/O code on NFS, raw blockdev, ext4 swapfile and xfs
> swapfile all seem to work fine.  Btrfs refuses to swapon because the file
> might be CoW'd.  I've tried doing "chattr +C", but that didn't help.

Ok, so if the filesystem is doing block mapping in the IO path now,
why does the swap file still need to map the file into a private
block mapping now?  i.e all the work that iomap_swapfile_activate()
does for filesystems like XFS and ext4 - it's this completely
redundant now that we are doing block mapping during swap file IO
via iomap_dio_rw()?

Actually, that path does all the "can we use this file as a swap
file" checking. So the extent iteration can't go away, just the swap
file mapping part (iomap_swapfile_add_extent()). This is necessary
to ensure there aren't any holes in the file, and we still need that
because the DIO write path will allocate into holes, which leads
me to my main concern here.

Using the DIO path opens up the possibility that the filesystem
could want to run transactions are part of the DIO. Right now we
support unwritten extents for swap files (so they don't have to be
written to allocate the backing store before activation) and that
means we'll be doing DIO to unwritten extents. IO completion of a
DIO write to an unwritten extent will run a transaction to convert
that extent to written. A similar problem with sparse files exists,
because allocation of blocks can be done from the DIO path, and that
requires transactions. File extension is another potential
transaction path we open up by using DIO writes dor swap.

The problem is that a transaction run in swap IO context will will
deadlock the filesystem. Either through the unbound memory demand of
metadata modification, or from needing log space that can't be freed
up because the metadata IO that will free the log space is waiting
on memory allocation that is waiting on swap IO...

I think some more thought needs to be put into controlling the
behaviour/semantics of the DIO path so that it can be safely used
by swap IO, because it's not a direct 1:1 behavioural mapping with
existing DIO and there are potential deadlock vectors we need to
avoid.

Cheers,

Dave.
Matthew Wilcox Sept. 26, 2021, 3:10 a.m. UTC | #2
On Sun, Sep 26, 2021 at 09:42:43AM +1000, Dave Chinner wrote:
> Ok, so if the filesystem is doing block mapping in the IO path now,
> why does the swap file still need to map the file into a private
> block mapping now?  i.e all the work that iomap_swapfile_activate()
> does for filesystems like XFS and ext4 - it's this completely
> redundant now that we are doing block mapping during swap file IO
> via iomap_dio_rw()?

Hi Dave,

Thanks for bringing up all these points.  I think they all deserve to go
into the documentation as "things to consider" for people implementing
->swap_rw for their filesystem.

Something I don't think David perhaps made sufficiently clear is that
regular DIO from userspace gets handled by ->read_iter and ->write_iter.
This ->swap_rw op is used exclusive for, as the name suggests, swap DIO.
So filesystems don't have to handle swap DIO and regular DIO the same
way, and can split the allocation work between ->swap_activate and the
iomap callback as they see fit (as long as they can guarantee the lack
of deadlocks under memory pressure).

There are several advantages to using the DIO infrastructure for
swap:

 - unify block & net swap paths
 - allow filesystems to _see_ swap IOs instead of being bypassed
 - get rid of the swap extent rbtree
 - allow writing compound pages to swap files instead of splitting
   them
 - allow ->readpage to be synchronous for better error reporting
 - remove page_file_mapping() and page_file_offset()

I suspect there are several problems with this patchset, but I'm not
likely to have a chance to read it closely for a few days.  If you
have time to give the XFS parts a good look, that would be fantastic.
Dave Chinner Sept. 26, 2021, 10:36 p.m. UTC | #3
On Sun, Sep 26, 2021 at 04:10:43AM +0100, Matthew Wilcox wrote:
> On Sun, Sep 26, 2021 at 09:42:43AM +1000, Dave Chinner wrote:
> > Ok, so if the filesystem is doing block mapping in the IO path now,
> > why does the swap file still need to map the file into a private
> > block mapping now?  i.e all the work that iomap_swapfile_activate()
> > does for filesystems like XFS and ext4 - it's this completely
> > redundant now that we are doing block mapping during swap file IO
> > via iomap_dio_rw()?
> 
> Hi Dave,
> 
> Thanks for bringing up all these points.  I think they all deserve to go
> into the documentation as "things to consider" for people implementing
> ->swap_rw for their filesystem.
> 
> Something I don't think David perhaps made sufficiently clear is that
> regular DIO from userspace gets handled by ->read_iter and ->write_iter.
> This ->swap_rw op is used exclusive for, as the name suggests, swap DIO.
> So filesystems don't have to handle swap DIO and regular DIO the same
> way, and can split the allocation work between ->swap_activate and the
> iomap callback as they see fit (as long as they can guarantee the lack
> of deadlocks under memory pressure).

I understand this completely.

The point is that the implementation of ->swap_rw is to call
iomap_dio_rw() with the same ops as the normal DIO read/write path
uses. IOWs, apart from the IOCB_SWAP flag, there is no practical
difference between the "swap DIO" and "normal DIO" I/O paths.

> There are several advantages to using the DIO infrastructure for
> swap:
> 
>  - unify block & net swap paths
>  - allow filesystems to _see_ swap IOs instead of being bypassed
>  - get rid of the swap extent rbtree
>  - allow writing compound pages to swap files instead of splitting
>    them
>  - allow ->readpage to be synchronous for better error reporting
>  - remove page_file_mapping() and page_file_offset()
> 
> I suspect there are several problems with this patchset, but I'm not
> likely to have a chance to read it closely for a few days.  If you
> have time to give the XFS parts a good look, that would be fantastic.

That's what I've already done, and all the questions I've raised are
from asking a simple question: what happens if a transaction is
required to complete the iomap_dio_rw() swap write operation?

I mean, this is similar to the problems with IOCB_NOWAIT - we're
supposed to return -EAGAIN if we might block during IO submission,
and one of those situations we have to consider is "do we need to
run a transaction". If we get it wrong (and we do!), then the worst
thing that happens is that there is a long latency for IO
submission. It's a minor performance issue, not the end of the
world.

The difference with IOCB_SWAP is that "don't do transactions during
iomap_dio_rw()" is a _hard requirement_ on both IO submission and
completion. That means, from now and forever, we will have to
guarantee a path through iomap_dio_rw() that will never run
transactions on an IO. That requirement needs to be enforced in
every block mapping callback into each filesystem, as this is
something the iomap infrastructure cannot enforce. Hence we'll have
to plumb IOCB_SWAP into a new IOMAP_SWAP iterator flag to pass to
the ->iomap_begin() DIO methods to ensure they do the right thing.

And then the question becomes: what happens if the filesystem cannot
do the right thing? Can the swap code handle an error? e.g. the
first thing that xfs_direct_write_iomap_begin() and
xfs_read_iomap_begin() do is check if the filesystem is shut down
and returns -EIO in that case. IOWs, we've now got normal filesystem
"reject all IO" corruption protection mechanisms in play. Using
iomap_dio_rw() as it stands means that _all swapfile IO will fail_
if the filesystem shuts down.

Right now the swap file IO can keep going blissfully unaware of the
filesystem failure status. The open swapfile will prevent the
filesystem from being unmounted. Hence to unmount the shutdown
filesystem to correct the problem, first the swap file has to be
turned off, which means we have a fail-safe behaviour. Using the
iomap_dio_rw() path means that swapfile IO _can and will fail_.

AFAICT, swap IO errors are pretty much thrown away by the mm code;
the swap_writepage() return value is ignored or placed on the swap
cache address space and ignored. And it looks like the new read path
just sets PageError() and leaves it to callers to detect and deal
with a swapin failure because swap_readpage() is now void...

So it seems like there's a whole new set of failure cases using the
DIO path introduces into the swap IO path that haven't been
considered here. I can't see why we wouldn't be able to solve them,
but these considerations lead me to think that use of the DIO is
based on an incorrect assumption - DIO is not a "simple low level
IO" interface.

Hence I suspect that we'd be much better off with a new
iomap_swap_rw() implementation that just does what swap needs
without any of the complexity of the DIO API. Internally iomap can
share what it needs to share with the DIO path, but at this point
I'm not sure we should be overloading the iomap_dio_rw() path with
the semantics required by swap.

e.g. we limit iomap_swap_rw() to only accept written or unwritten
block mappings within file size on inodes with clean metadata (i.e.
pure overwrite to guarantee no modification transactions), and then
the fs provided ->iomap_begin callback can ignore shutdown state,
elide inode level locking, do read-only mappings, etc without adding
extra overhead to the existing DIO code path...

Cheers,

Dave.
David Sterba Sept. 27, 2021, 8:07 p.m. UTC | #4
On Fri, Sep 24, 2021 at 06:17:52PM +0100, David Howells wrote:
> 
> Hi Willy, Trond, Christoph,
> 
> Here's v3 of a change to make reads and writes from the swapfile use async
> DIO, adding a new ->swap_rw() address_space method, rather than readpage()
> or direct_IO(), as requested by Willy.  This allows NFS to bypass the write
> checks that prevent swapfiles from working, plus a bunch of other checks
> that may or may not be necessary.
> 
> Whilst trying to make this work, I found that NFS's support for swapfiles
> seems to have been non-functional since Aug 2019 (I think), so the first
> patch fixes that.  Question is: do we actually *want* to keep this
> functionality, given that it seems that no one's tested it with an upstream
> kernel in the last couple of years?
> 
> There are additional patches to get rid of noop_direct_IO and replace it
> with a feature bitmask, to make btrfs, ext4, xfs and raw blockdevs use the
> new ->swap_rw method and thence remove the direct BIO submission paths from
> swap.
> 
> I kept the IOCB_SWAP flag, using it to enable REQ_SWAP.  I'm not sure if
> that's necessary, but it seems accounting related.
> 
> The synchronous DIO I/O code on NFS, raw blockdev, ext4 swapfile and xfs
> swapfile all seem to work fine.  Btrfs refuses to swapon because the file
> might be CoW'd.  I've tried doing "chattr +C", but that didn't help.

There was probably some step missing. The file must not have holes, so
either do 'dd' to the right size or use fallocate (which is recommended
in manual page btrfs(5) SWAPFILE SUPPORT). There are some fstests
exercising swapfile (grep -l _format_swapfile tests/generic/*) so you
could try that without having to set up the swapfile manually.
NeilBrown Sept. 28, 2021, 3:11 a.m. UTC | #5
On Sat, 25 Sep 2021, David Howells wrote:
> Whilst trying to make this work, I found that NFS's support for swapfiles
> seems to have been non-functional since Aug 2019 (I think), so the first
> patch fixes that.  Question is: do we actually *want* to keep this
> functionality, given that it seems that no one's tested it with an upstream
> kernel in the last couple of years?

SUSE definitely want to keep this functionality.  We have customers
using it.
I agree it would be good if it was being tested somewhere....

Thanks,
NeilBrown
David Howells Sept. 29, 2021, 3:45 p.m. UTC | #6
David Sterba <dsterba@suse.cz> wrote:

> > There are additional patches to get rid of noop_direct_IO and replace it
> > with a feature bitmask, to make btrfs, ext4, xfs and raw blockdevs use the
> > new ->swap_rw method and thence remove the direct BIO submission paths from
> > swap.
> > 
> > I kept the IOCB_SWAP flag, using it to enable REQ_SWAP.  I'm not sure if
> > that's necessary, but it seems accounting related.
>
> There was probably some step missing. The file must not have holes, so
> either do 'dd' to the right size or use fallocate (which is recommended
> in manual page btrfs(5) SWAPFILE SUPPORT). There are some fstests
> exercising swapfile (grep -l _format_swapfile tests/generic/*) so you
> could try that without having to set up the swapfile manually.

Yeah.  As advised elsewhere, I removed the file and recreated it, doing the
chattr before extending the file.  At that point swapon worked.  It didn't
work though, and various userspace programs started dying.  I'm guessing my
btrfs_swap_rw() is wrong somehow.

David
Steve French Sept. 30, 2021, 3:54 p.m. UTC | #7
On Mon, Sep 27, 2021 at 10:12 PM NeilBrown <neilb@suse.de> wrote:
>
> On Sat, 25 Sep 2021, David Howells wrote:
> > Whilst trying to make this work, I found that NFS's support for swapfiles
> > seems to have been non-functional since Aug 2019 (I think), so the first
> > patch fixes that.  Question is: do we actually *want* to keep this
> > functionality, given that it seems that no one's tested it with an upstream
> > kernel in the last couple of years?
>
> SUSE definitely want to keep this functionality.  We have customers
> using it.
> I agree it would be good if it was being tested somewhere....
>

I am trying to work through the testing of swap over SMB3 mounts
since there are use cases where you need to expand the swap
space to remote storage and so this requirement comes up.  The main difficulty
I run into is forgetting to mount with the mount options (to store mode bits)
(so swap file has the right permissions) and debugging some of the
xfstests relating to swap can be a little confusing.