mbox series

[RFC,0/13] fs: generic filesystem shutdown handling

Message ID 20240807180706.30713-1-jack@suse.cz (mailing list archive)
Headers show
Series fs: generic filesystem shutdown handling | expand

Message

Jan Kara Aug. 7, 2024, 6:29 p.m. UTC
Hello,

this patch series implements generic handling of filesystem shutdown. The idea
is very simple: Have a superblock flag, which when set, will make VFS refuse
modifications to the filesystem. The patch series consists of several parts.
Patches 1-6 cleanup handling of SB_I_ flags which is currently messy (different
flags seem to have different locks protecting them although they are modified
by plain stores). Patches 7-12 gradually convert code to be able to handle
errors from sb_start_write() / sb_start_pagefault(). Patch 13 then shows how
filesystems can use this generic flag. Additionally, we could remove some
shutdown checks from within ext4 code and rely on checks in VFS but I didn't
want to complicate the series with ext4 specific things.

Also, as Dave suggested, we can lift *_IOC_{SHUTDOWN|GOINGDOWN} ioctl handling
to VFS (currently in 5 filesystems) and just call new ->shutdown op for
the filesystem abort handling itself. But that is kind of independent thing
and this series is long enough as is.

So what do people think?

								Honza

Comments

Dave Chinner Aug. 7, 2024, 11:18 p.m. UTC | #1
On Wed, Aug 07, 2024 at 08:29:45PM +0200, Jan Kara wrote:
> Hello,
> 
> this patch series implements generic handling of filesystem shutdown. The idea
> is very simple: Have a superblock flag, which when set, will make VFS refuse
> modifications to the filesystem. The patch series consists of several parts.
> Patches 1-6 cleanup handling of SB_I_ flags which is currently messy (different
> flags seem to have different locks protecting them although they are modified
> by plain stores). Patches 7-12 gradually convert code to be able to handle
> errors from sb_start_write() / sb_start_pagefault(). Patch 13 then shows how
> filesystems can use this generic flag. Additionally, we could remove some
> shutdown checks from within ext4 code and rely on checks in VFS but I didn't
> want to complicate the series with ext4 specific things.

Overall this looks good. Two things that I noticed that we should
nail down before anything else:

1. The original definition of a 'shutdown filesystem' (i.e. from the
XFS origins) is that a shutdown filesystem must *never* do -physical
IO- after the shutdown is initiated. This is a protection mechanism
for the underlying storage to prevent potential propagation of
problems in the storage media once a serious issue has been
detected. (e.g. suspect physical media can be made worse by
continually trying to read it.) It also allows the block device to
go away and we won't try to access issue new IO to it once the
->shutdown call has been complete.

IOWs, XFS implements a "no new IO after shutdown" architecture, and
this is also largely what ext4 implements as well.

However, this isn't what this generic shutdown infrastructure
implements. It only prevents new user modifications from being
started - it is effectively a "instant RO" mechanism rather than an
"instant no more IO" architecture.

Hence we have an impedence mismatch between existing shutdown
implementations that currently return -EIO on shutdown for all
operations (both read and write) and this generic implementation
which returns -EROFS only for write operations.

Hence the proposed generic shutdown model doesn't really solve the
inconsistent shutdown behaviour problem across filesystems - it just
adds a new inconsistency between existing filesystem shutdown
implementations and the generic infrastructure.

2. On shutdown, this patchset returns -EROFS.

As per #1, returning -EROFS on shutdown will be a significant change
of behaviour for some filesystems as they currently return -EIO when
the filesystem is shut down.

I don't think -EROFS is right, because existing shutdown behaviour
also impacts read-only operations and will return -EIO for them,
too.

I think the error returned by a shutdown filesystem should always be
consistent and that really means -EIO needs to be returned rather
than -EROFS.

However, given this is new generic infrastructure, we can define a
new error like -ESHUTDOWN (to reuse an existing errno) or even a
new errno like -EFSSHUTDOWN for this, document it man pages and then
convert all the existing filesystem shutdown checks to return this
error instead of -EIO...

> Also, as Dave suggested, we can lift *_IOC_{SHUTDOWN|GOINGDOWN} ioctl handling
> to VFS (currently in 5 filesystems) and just call new ->shutdown op for
> the filesystem abort handling itself. But that is kind of independent thing
> and this series is long enough as is.

Agreed - that can be done separately once we've sorted out the
little details of what a shutdown filesystem actually means and how
that gets reported consistently to userspace...

-Dave.
Jan Kara Aug. 8, 2024, 2:32 p.m. UTC | #2
On Thu 08-08-24 09:18:51, Dave Chinner wrote:
> On Wed, Aug 07, 2024 at 08:29:45PM +0200, Jan Kara wrote:
> > Hello,
> > 
> > this patch series implements generic handling of filesystem shutdown. The idea
> > is very simple: Have a superblock flag, which when set, will make VFS refuse
> > modifications to the filesystem. The patch series consists of several parts.
> > Patches 1-6 cleanup handling of SB_I_ flags which is currently messy (different
> > flags seem to have different locks protecting them although they are modified
> > by plain stores). Patches 7-12 gradually convert code to be able to handle
> > errors from sb_start_write() / sb_start_pagefault(). Patch 13 then shows how
> > filesystems can use this generic flag. Additionally, we could remove some
> > shutdown checks from within ext4 code and rely on checks in VFS but I didn't
> > want to complicate the series with ext4 specific things.
> 
> Overall this looks good. Two things that I noticed that we should
> nail down before anything else:
> 
> 1. The original definition of a 'shutdown filesystem' (i.e. from the
> XFS origins) is that a shutdown filesystem must *never* do -physical
> IO- after the shutdown is initiated. This is a protection mechanism
> for the underlying storage to prevent potential propagation of
> problems in the storage media once a serious issue has been
> detected. (e.g. suspect physical media can be made worse by
> continually trying to read it.) It also allows the block device to
> go away and we won't try to access issue new IO to it once the
> ->shutdown call has been complete.
> 
> IOWs, XFS implements a "no new IO after shutdown" architecture, and
> this is also largely what ext4 implements as well.

Thanks for sharing this. I wasn't aware that "no new IO after shutdown" is
the goal. I knew this is required for modifications but I wasn't sure how
strict this was for writes.

> However, this isn't what this generic shutdown infrastructure
> implements. It only prevents new user modifications from being
> started - it is effectively a "instant RO" mechanism rather than an
> "instant no more IO" architecture.
> 
> Hence we have an impedence mismatch between existing shutdown
> implementations that currently return -EIO on shutdown for all
> operations (both read and write) and this generic implementation
> which returns -EROFS only for write operations.
> 
> Hence the proposed generic shutdown model doesn't really solve the
> inconsistent shutdown behaviour problem across filesystems - it just
> adds a new inconsistency between existing filesystem shutdown
> implementations and the generic infrastructure.

OK, understood. I also agree it would be good to keep this no-IO semantics
when implementing the generic solution. I'm just pondering how to achieve
that in a maintainable way. For the write path what I've done looks like
the least painful way. For the read path the simplest is probably to still
return whatever is in cache and just do the check + error return somewhere
down in the call stack just before calling into filesystem. It is easy
enough to stop things like ->read_folio, ->readahead, or ->lookup. But how
about things like ->evict_inode or ->release?  They can trigger IO but
allowing inode reclaim on shutdown fs is desirable I'd say. Similarly for
things like ->remount_fs or ->put_super. So avoiding IO from operations
like these would rely on fs implementation anyway.

> 2. On shutdown, this patchset returns -EROFS.
> 
> As per #1, returning -EROFS on shutdown will be a significant change
> of behaviour for some filesystems as they currently return -EIO when
> the filesystem is shut down.
> 
> I don't think -EROFS is right, because existing shutdown behaviour
> also impacts read-only operations and will return -EIO for them,
> too.
> 
> I think the error returned by a shutdown filesystem should always be
> consistent and that really means -EIO needs to be returned rather
> than -EROFS.
> 
> However, given this is new generic infrastructure, we can define a
> new error like -ESHUTDOWN (to reuse an existing errno) or even a
> new errno like -EFSSHUTDOWN for this, document it man pages and then
> convert all the existing filesystem shutdown checks to return this
> error instead of -EIO...

Right, -EROFS isn't really good return value when we refuse also reads. I
think -EIO is fine. -ESHUTDOWN would be ok but the standard message ("Cannot
send after transport endpoint shutdown") whould be IMO confusing to users.
I was also thinking about -EFSCORRUPTED (alias -EUCLEAN) which already has
some precedens in the filesystem space but -EIO is probably better.

								Honza
Darrick J. Wong Aug. 8, 2024, 2:51 p.m. UTC | #3
On Thu, Aug 08, 2024 at 09:18:51AM +1000, Dave Chinner wrote:
> On Wed, Aug 07, 2024 at 08:29:45PM +0200, Jan Kara wrote:
> > Hello,
> > 
> > this patch series implements generic handling of filesystem shutdown. The idea
> > is very simple: Have a superblock flag, which when set, will make VFS refuse
> > modifications to the filesystem. The patch series consists of several parts.
> > Patches 1-6 cleanup handling of SB_I_ flags which is currently messy (different
> > flags seem to have different locks protecting them although they are modified
> > by plain stores). Patches 7-12 gradually convert code to be able to handle
> > errors from sb_start_write() / sb_start_pagefault(). Patch 13 then shows how
> > filesystems can use this generic flag. Additionally, we could remove some
> > shutdown checks from within ext4 code and rely on checks in VFS but I didn't
> > want to complicate the series with ext4 specific things.
> 
> Overall this looks good. Two things that I noticed that we should
> nail down before anything else:
> 
> 1. The original definition of a 'shutdown filesystem' (i.e. from the
> XFS origins) is that a shutdown filesystem must *never* do -physical
> IO- after the shutdown is initiated. This is a protection mechanism
> for the underlying storage to prevent potential propagation of
> problems in the storage media once a serious issue has been
> detected. (e.g. suspect physical media can be made worse by
> continually trying to read it.) It also allows the block device to
> go away and we won't try to access issue new IO to it once the
> ->shutdown call has been complete.
> 
> IOWs, XFS implements a "no new IO after shutdown" architecture, and
> this is also largely what ext4 implements as well.

I don't think it quite does -- for EXT4_GOING_FLAGS_DEFAULT, it sets the
shutdown flag, but it doesn't actually abort the journal.  I think
that's an implementation bug since XFS /does/ shut down the log.

But looking at XFS_FSOP_GOING_FLAGS_DEFAULT, I also notice that if the
bdev_freeze fails, it returns 0 and the fs isn't shut down.  ext4, otoh,
actually does pass bdev_freeze's errno along.  I think ext4's behavior
is the correct one, right?

> However, this isn't what this generic shutdown infrastructure
> implements. It only prevents new user modifications from being
> started - it is effectively a "instant RO" mechanism rather than an
> "instant no more IO" architecture.

I thought pagefaults are still allowed on a shutdown xfs?  Curiously I
don't see a prohibition on write faults, but iirc we still allowed read
faults so that a shutdown on the rootfs doesn't immediately crash the
whole machine?

> Hence we have an impedence mismatch between existing shutdown
> implementations that currently return -EIO on shutdown for all
> operations (both read and write) and this generic implementation
> which returns -EROFS only for write operations.
> 
> Hence the proposed generic shutdown model doesn't really solve the
> inconsistent shutdown behaviour problem across filesystems - it just
> adds a new inconsistency between existing filesystem shutdown
> implementations and the generic infrastructure.
> 
> 2. On shutdown, this patchset returns -EROFS.
> 
> As per #1, returning -EROFS on shutdown will be a significant change
> of behaviour for some filesystems as they currently return -EIO when
> the filesystem is shut down.
> 
> I don't think -EROFS is right, because existing shutdown behaviour
> also impacts read-only operations and will return -EIO for them,
> too.
> 
> I think the error returned by a shutdown filesystem should always be
> consistent and that really means -EIO needs to be returned rather
> than -EROFS.
> 
> However, given this is new generic infrastructure, we can define a
> new error like -ESHUTDOWN (to reuse an existing errno) or even a
> new errno like -EFSSHUTDOWN for this, document it man pages and then
> convert all the existing filesystem shutdown checks to return this
> error instead of -EIO...

Agree.

> > Also, as Dave suggested, we can lift *_IOC_{SHUTDOWN|GOINGDOWN} ioctl handling
> > to VFS (currently in 5 filesystems) and just call new ->shutdown op for
> > the filesystem abort handling itself. But that is kind of independent thing
> > and this series is long enough as is.
> 
> Agreed - that can be done separately once we've sorted out the
> little details of what a shutdown filesystem actually means and how
> that gets reported consistently to userspace...

I would define it as:

No more writes to the filesystem or its underlying storage; file IO
and stat* calls return ESHUTDOWN; read faults still allowed.

I like the idea of hoisting this to the vfs and defining how one figures
out if the fs is shut down; thank you for working on this, Jan.

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Dave Chinner Aug. 9, 2024, 2:30 a.m. UTC | #4
On Thu, Aug 08, 2024 at 07:51:41AM -0700, Darrick J. Wong wrote:
> On Thu, Aug 08, 2024 at 09:18:51AM +1000, Dave Chinner wrote:
> > On Wed, Aug 07, 2024 at 08:29:45PM +0200, Jan Kara wrote:
> > > Hello,
> > > 
> > > this patch series implements generic handling of filesystem shutdown. The idea
> > > is very simple: Have a superblock flag, which when set, will make VFS refuse
> > > modifications to the filesystem. The patch series consists of several parts.
> > > Patches 1-6 cleanup handling of SB_I_ flags which is currently messy (different
> > > flags seem to have different locks protecting them although they are modified
> > > by plain stores). Patches 7-12 gradually convert code to be able to handle
> > > errors from sb_start_write() / sb_start_pagefault(). Patch 13 then shows how
> > > filesystems can use this generic flag. Additionally, we could remove some
> > > shutdown checks from within ext4 code and rely on checks in VFS but I didn't
> > > want to complicate the series with ext4 specific things.
> > 
> > Overall this looks good. Two things that I noticed that we should
> > nail down before anything else:
> > 
> > 1. The original definition of a 'shutdown filesystem' (i.e. from the
> > XFS origins) is that a shutdown filesystem must *never* do -physical
> > IO- after the shutdown is initiated. This is a protection mechanism
> > for the underlying storage to prevent potential propagation of
> > problems in the storage media once a serious issue has been
> > detected. (e.g. suspect physical media can be made worse by
> > continually trying to read it.) It also allows the block device to
> > go away and we won't try to access issue new IO to it once the
> > ->shutdown call has been complete.
> > 
> > IOWs, XFS implements a "no new IO after shutdown" architecture, and
> > this is also largely what ext4 implements as well.
> 
> I don't think it quite does -- for EXT4_GOING_FLAGS_DEFAULT, it sets the
> shutdown flag, but it doesn't actually abort the journal. I think
> that's an implementation bug since XFS /does/ shut down the log.
>
> But looking at XFS_FSOP_GOING_FLAGS_DEFAULT, I also notice that if the
> bdev_freeze fails, it returns 0 and the fs isn't shut down.  ext4, otoh,
> actually does pass bdev_freeze's errno along.  I think ext4's behavior
> is the correct one, right?

Yes, there are inconsistencies in how different filesystems
implement user-driven shutdown operations, but Jan has specifically
left addressing those sorts of inconsistencies in ioctl/->shutdown
implementations for a later patch set. I agree with that approach -
let's first focus on defining a generic model for how shutdown
filesystems should behave once they are shut down. Once we have the
model defined, then we can worry about making filesystems shutdown
mechanisms behave consistently within that model..

> > However, this isn't what this generic shutdown infrastructure
> > implements. It only prevents new user modifications from being
> > started - it is effectively a "instant RO" mechanism rather than an
> > "instant no more IO" architecture.
> 
> I thought pagefaults are still allowed on a shutdown xfs?  Curiously I
> don't see a prohibition on write faults, but iirc we still allowed read
> faults so that a shutdown on the rootfs doesn't immediately crash the
> whole machine?

Yes, page faults are allowed on a shutdown XFS filesystem right up
to the point where they need to do IO on a page cache miss.  Then
the IO request hits the block mapping code (xfs_bmapi_read()), sees
the shutdown state and the read IO fails. The result of this is
SIGBUS for your executable.

IOWs, if the executable is cached, it will keep running after a
shutdown. If it's not cached, then it's game over already.

> > > Also, as Dave suggested, we can lift *_IOC_{SHUTDOWN|GOINGDOWN} ioctl handling
> > > to VFS (currently in 5 filesystems) and just call new ->shutdown op for
> > > the filesystem abort handling itself. But that is kind of independent thing
> > > and this series is long enough as is.
> > 
> > Agreed - that can be done separately once we've sorted out the
> > little details of what a shutdown filesystem actually means and how
> > that gets reported consistently to userspace...
> 
> I would define it as:
> 
> No more writes to the filesystem or its underlying storage; file IO
> and stat* calls return ESHUTDOWN; read faults still allowed.

We need to operations in terms of whether we allow physical IO or
not. We currently don't allow physical IO from read faults on XFs,
so...

-Dave.
Christian Brauner Aug. 13, 2024, 12:46 p.m. UTC | #5
> things like ->remount_fs or ->put_super. So avoiding IO from operations
> like these would rely on fs implementation anyway.

I think that needs to remain filesystem specific and I don't think this
needs to hold up the patchset.

> Right, -EROFS isn't really good return value when we refuse also reads. I
> think -EIO is fine. -ESHUTDOWN would be ok but the standard message ("Cannot
> send after transport endpoint shutdown") whould be IMO confusing to users.
> I was also thinking about -EFSCORRUPTED (alias -EUCLEAN) which already has
> some precedens in the filesystem space but -EIO is probably better.

EIO isn't great but I agree that it's the best we probably have for now.
Dave Chinner Aug. 14, 2024, 12:09 a.m. UTC | #6
On Thu, Aug 08, 2024 at 04:32:22PM +0200, Jan Kara wrote:
> On Thu 08-08-24 09:18:51, Dave Chinner wrote:
> > On Wed, Aug 07, 2024 at 08:29:45PM +0200, Jan Kara wrote:
> > > Hello,
> > > 
> > > this patch series implements generic handling of filesystem shutdown. The idea
> > > is very simple: Have a superblock flag, which when set, will make VFS refuse
> > > modifications to the filesystem. The patch series consists of several parts.
> > > Patches 1-6 cleanup handling of SB_I_ flags which is currently messy (different
> > > flags seem to have different locks protecting them although they are modified
> > > by plain stores). Patches 7-12 gradually convert code to be able to handle
> > > errors from sb_start_write() / sb_start_pagefault(). Patch 13 then shows how
> > > filesystems can use this generic flag. Additionally, we could remove some
> > > shutdown checks from within ext4 code and rely on checks in VFS but I didn't
> > > want to complicate the series with ext4 specific things.
> > 
> > Overall this looks good. Two things that I noticed that we should
> > nail down before anything else:
> > 
> > 1. The original definition of a 'shutdown filesystem' (i.e. from the
> > XFS origins) is that a shutdown filesystem must *never* do -physical
> > IO- after the shutdown is initiated. This is a protection mechanism
> > for the underlying storage to prevent potential propagation of
> > problems in the storage media once a serious issue has been
> > detected. (e.g. suspect physical media can be made worse by
> > continually trying to read it.) It also allows the block device to
> > go away and we won't try to access issue new IO to it once the
> > ->shutdown call has been complete.
> > 
> > IOWs, XFS implements a "no new IO after shutdown" architecture, and
> > this is also largely what ext4 implements as well.
> 
> Thanks for sharing this. I wasn't aware that "no new IO after shutdown" is
> the goal. I knew this is required for modifications but I wasn't sure how
> strict this was for writes.
> 
> > However, this isn't what this generic shutdown infrastructure
> > implements. It only prevents new user modifications from being
> > started - it is effectively a "instant RO" mechanism rather than an
> > "instant no more IO" architecture.
> > 
> > Hence we have an impedence mismatch between existing shutdown
> > implementations that currently return -EIO on shutdown for all
> > operations (both read and write) and this generic implementation
> > which returns -EROFS only for write operations.
> > 
> > Hence the proposed generic shutdown model doesn't really solve the
> > inconsistent shutdown behaviour problem across filesystems - it just
> > adds a new inconsistency between existing filesystem shutdown
> > implementations and the generic infrastructure.
> 
> OK, understood. I also agree it would be good to keep this no-IO semantics
> when implementing the generic solution. I'm just pondering how to achieve
> that in a maintainable way. For the write path what I've done looks like
> the least painful way. For the read path the simplest is probably to still
> return whatever is in cache and just do the check + error return somewhere
> down in the call stack just before calling into filesystem. It is easy
> enough to stop things like ->read_folio, ->readahead, or ->lookup. But how
> about things like ->evict_inode or ->release?

If the filesystem is shut down, inode eviction or releasing a FD
should not be doing any IO at all - they should just be releasing
in-memory resources and freeing the objects being released. e.g. we
don't process unlinked inodes when the filesystem is shut down; they
remain as unlinked on disk and recovery gets to clean up the mess.
i.e.  we process all inodes as if they were clean, linked inodes and
just tear down the in-memory structures attached to the inode.

i.e. shutdown isn't concerned about keeping anything consistent
either in memory or on disk - it's concerned only about releasing
in-memory resources such that the filesystem can be unmounted
without doing any IO at all.  e.g. ext4_evict_inode() needs to treat
all unlinked inodes as if they are bad when the filesystem is shut
down. XFS does this (see the shutdown check in
xfs_inode_needs_inactive()) and every filesystem that does unlinked
inode processing in inode eviction will need similar modifications.

Yes, this means a "shutdown means no IO" model cannot be exclusively
implemented at the VFS - it will need things like filesystems with
customised inode eviction callouts to handle these cases themselves.

> They can trigger IO but
> allowing inode reclaim on shutdown fs is desirable I'd say. Similarly for
> things like ->remount_fs or ->put_super. So avoiding IO from operations
> like these would rely on fs implementation anyway.

remounts need to follow the fundamental rule of shutdowns: you can't
change the state of a shutdown filesystem -at all- because any
operation on a shutdown filesystem should be immediately failed. The
only thing you can reliably do once a filesystem is shut down is
unmount it.  IOWs, the VFS should return -EIO when a remount is
requested on a shutdown filesystem, and the filesystem code then
doesn't have to care.

As for ->put_super(), this should act as if the filesystem is clean
when the filesystem is shut down as everything that is dirty in
memory will never get cleaned. IOWs, once shutdown has been set,
dirty state should be completely ignored by everything and so object
release/eviction should tear everything down regardless of it's
state.

Supporting a "no-IO shutdown" model properly will require filesystem
specific changes to handle, but that's really implementation details
more than anything else. What we need to do first is define
and document exactly what shutdown means and how the VFS and
filesystems need to operate when that bit is set. Then we have a
clear framework from which we can consistently answer "what should
filesystem X do in this situation" issues that arise...

-Dave.