diff mbox

[v2,5/7] fs: notify superblocks of backing-device death

Message ID 20151125183724.12508.73971.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Dan Williams Nov. 25, 2015, 6:37 p.m. UTC
Set SB_I_BDIDEAD when a block device is no longer available to service
requests.  This will be used in several places where an fs should give
up early because the block device is gone.  Letting the fs continue on
as if the block device is still present can lead to long latencies
waiting for an fs to detect the loss of its backing device, trigger
crashes, or generate misleasing warnings.

Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@fb.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 block/genhd.c      |    2 ++
 fs/block_dev.c     |   17 +++++++++++++++++
 include/linux/fs.h |    2 ++
 3 files changed, 21 insertions(+)


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Dave Chinner Nov. 25, 2015, 9:50 p.m. UTC | #1
On Wed, Nov 25, 2015 at 10:37:24AM -0800, Dan Williams wrote:
> Set SB_I_BDIDEAD when a block device is no longer available to service
> requests.  This will be used in several places where an fs should give
> up early because the block device is gone.  Letting the fs continue on
> as if the block device is still present can lead to long latencies
> waiting for an fs to detect the loss of its backing device, trigger
> crashes, or generate misleasing warnings.
> 
> Cc: Jan Kara <jack@suse.com>
> Cc: Jens Axboe <axboe@fb.com>
> Suggested-by: Dave Chinner <david@fromorbit.com>

This isn't what I suggested. :/

.....

> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 1dd416bf72f7..d0233d643d33 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1795,6 +1795,23 @@ int __invalidate_device(struct block_device *bdev, bool kill_dirty)
>  }
>  EXPORT_SYMBOL(__invalidate_device);
>  
> +void kill_bdev_super(struct gendisk *disk, int partno)
> +{
> +	struct block_device *bdev = bdget_disk(disk, partno);
> +	struct super_block *sb;
> +
> +	if (!bdev)
> +		return;
> +	sb = get_super(bdev);
> +	if (!sb)
> +		goto out;
> +
> +	sb->s_iflags |= SB_I_BDI_DEAD;
> +	drop_super(sb);
> + out:
> +	bdput(bdev);
> +}

That's not a notification to the filesystem - that's a status flag
the filesystem has to explicitly check for *on every operation*. We
already have checks like these throughout filesystems, but they are
filesystem specific and need to propagate into fs-specific
subsystems that have knowledge of VFS level superblocks.

To that end, what I actually suggested what a callback - something
like a function in the super operations structure so that the
filesystem can take *immediate action* when the block device is
dying. i.e.

void kill_bdev_super(struct gendisk *disk, int partno)
{
	struct block_device *bdev = bdget_disk(disk, partno);
	struct super_block *sb;

	if (!bdev)
		return;
	sb = get_super(bdev);
	if (!sb)
		goto out;

	if (sb->s_ops->shutdown)
		sb->s_ops->shutdown(sb);

	drop_super(sb);
 out:
	bdput(bdev);
}

and then we implement ->shutdown somthing like this in XFS:

xfs_fs_shutdown(struct superblock *sb)
{
	xfs_force_shutdown(XFS_M(sb), SHUTDOWN_DEVICE_REQ);
}

and so we immediately notify the entire filesystem that a shutdown
state has been entered and the appropriate actions are immediately
taken.

Cheers,

Dave.
Dan Williams Nov. 25, 2015, 10:09 p.m. UTC | #2
On Wed, Nov 25, 2015 at 1:50 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Nov 25, 2015 at 10:37:24AM -0800, Dan Williams wrote:
>> Set SB_I_BDIDEAD when a block device is no longer available to service
>> requests.  This will be used in several places where an fs should give
>> up early because the block device is gone.  Letting the fs continue on
>> as if the block device is still present can lead to long latencies
>> waiting for an fs to detect the loss of its backing device, trigger
>> crashes, or generate misleasing warnings.
>>
>> Cc: Jan Kara <jack@suse.com>
>> Cc: Jens Axboe <axboe@fb.com>
>> Suggested-by: Dave Chinner <david@fromorbit.com>
>
> This isn't what I suggested. :/
>
> .....
>
>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>> index 1dd416bf72f7..d0233d643d33 100644
>> --- a/fs/block_dev.c
>> +++ b/fs/block_dev.c
>> @@ -1795,6 +1795,23 @@ int __invalidate_device(struct block_device *bdev, bool kill_dirty)
>>  }
>>  EXPORT_SYMBOL(__invalidate_device);
>>
>> +void kill_bdev_super(struct gendisk *disk, int partno)
>> +{
>> +     struct block_device *bdev = bdget_disk(disk, partno);
>> +     struct super_block *sb;
>> +
>> +     if (!bdev)
>> +             return;
>> +     sb = get_super(bdev);
>> +     if (!sb)
>> +             goto out;
>> +
>> +     sb->s_iflags |= SB_I_BDI_DEAD;
>> +     drop_super(sb);
>> + out:
>> +     bdput(bdev);
>> +}
>
> That's not a notification to the filesystem - that's a status flag
> the filesystem has to explicitly check for *on every operation*. We
> already have checks like these throughout filesystems, but they are
> filesystem specific and need to propagate into fs-specific
> subsystems that have knowledge of VFS level superblocks.
>
> To that end, what I actually suggested what a callback - something
> like a function in the super operations structure so that the
> filesystem can take *immediate action* when the block device is
> dying. i.e.
>
> void kill_bdev_super(struct gendisk *disk, int partno)
> {
>         struct block_device *bdev = bdget_disk(disk, partno);
>         struct super_block *sb;
>
>         if (!bdev)
>                 return;
>         sb = get_super(bdev);
>         if (!sb)
>                 goto out;
>
>         if (sb->s_ops->shutdown)
>                 sb->s_ops->shutdown(sb);
>
>         drop_super(sb);
>  out:
>         bdput(bdev);
> }
>
> and then we implement ->shutdown somthing like this in XFS:
>
> xfs_fs_shutdown(struct superblock *sb)
> {
>         xfs_force_shutdown(XFS_M(sb), SHUTDOWN_DEVICE_REQ);
> }
>
> and so we immediately notify the entire filesystem that a shutdown
> state has been entered and the appropriate actions are immediately
> taken.
>

That sounds good in theory.  However, in the case of xfs it is already
calling xfs_force_shutdown(), but that does not prevent it from
continuing to wait indefinitely at umount.  For the ext4 the
mark_inode_dirty() warning we're triggering the error in generic code.

None of this fixes the problem of dax mappings leaking past block
device remove.  That can be done generically without needing per-fs
code.

Solving the "zombie filesystem after block device down" problem is
incremental to this patch set.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Nov. 26, 2015, 6:27 a.m. UTC | #3
On Wed, Nov 25, 2015 at 02:09:34PM -0800, Dan Williams wrote:
> On Wed, Nov 25, 2015 at 1:50 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Wed, Nov 25, 2015 at 10:37:24AM -0800, Dan Williams wrote:
> >> Set SB_I_BDIDEAD when a block device is no longer available to service
> >> requests.  This will be used in several places where an fs should give
> >> up early because the block device is gone.  Letting the fs continue on
> >> as if the block device is still present can lead to long latencies
> >> waiting for an fs to detect the loss of its backing device, trigger
> >> crashes, or generate misleasing warnings.
> >>
> >> Cc: Jan Kara <jack@suse.com>
> >> Cc: Jens Axboe <axboe@fb.com>
> >> Suggested-by: Dave Chinner <david@fromorbit.com>
> >
> > This isn't what I suggested. :/
> >
> > .....
> >
> >> diff --git a/fs/block_dev.c b/fs/block_dev.c
> >> index 1dd416bf72f7..d0233d643d33 100644
> >> --- a/fs/block_dev.c
> >> +++ b/fs/block_dev.c
> >> @@ -1795,6 +1795,23 @@ int __invalidate_device(struct block_device *bdev, bool kill_dirty)
> >>  }
> >>  EXPORT_SYMBOL(__invalidate_device);
> >>
> >> +void kill_bdev_super(struct gendisk *disk, int partno)
> >> +{
> >> +     struct block_device *bdev = bdget_disk(disk, partno);
> >> +     struct super_block *sb;
> >> +
> >> +     if (!bdev)
> >> +             return;
> >> +     sb = get_super(bdev);
> >> +     if (!sb)
> >> +             goto out;
> >> +
> >> +     sb->s_iflags |= SB_I_BDI_DEAD;
> >> +     drop_super(sb);
> >> + out:
> >> +     bdput(bdev);
> >> +}
> >
> > That's not a notification to the filesystem - that's a status flag
> > the filesystem has to explicitly check for *on every operation*. We
> > already have checks like these throughout filesystems, but they are
> > filesystem specific and need to propagate into fs-specific
> > subsystems that have knowledge of VFS level superblocks.
> >
> > To that end, what I actually suggested what a callback - something
> > like a function in the super operations structure so that the
> > filesystem can take *immediate action* when the block device is
> > dying. i.e.
> >
> > void kill_bdev_super(struct gendisk *disk, int partno)
> > {
> >         struct block_device *bdev = bdget_disk(disk, partno);
> >         struct super_block *sb;
> >
> >         if (!bdev)
> >                 return;
> >         sb = get_super(bdev);
> >         if (!sb)
> >                 goto out;
> >
> >         if (sb->s_ops->shutdown)
> >                 sb->s_ops->shutdown(sb);
> >
> >         drop_super(sb);
> >  out:
> >         bdput(bdev);
> > }
> >
> > and then we implement ->shutdown somthing like this in XFS:
> >
> > xfs_fs_shutdown(struct superblock *sb)
> > {
> >         xfs_force_shutdown(XFS_M(sb), SHUTDOWN_DEVICE_REQ);
> > }
> >
> > and so we immediately notify the entire filesystem that a shutdown
> > state has been entered and the appropriate actions are immediately
> > taken.
> >
> 
> That sounds good in theory.  However, in the case of xfs it is already
> calling xfs_force_shutdown(),

Where? If XFS does not do any metadata IO, then it won't shut the
filesystem down. We almost always allocate/map blocks without doing
any IO, which means we cannot guarantee erroring out page faults
during get_blocks until a shutdown has be triggered by other
means....

> but that does not prevent it from
> continuing to wait indefinitely at umount.

Which is a bug we need to fix - I don't see how a shutdown
implementation problem is at all relevant to triggering shutdowns in
a prompt manner...

> For the ext4 the
> mark_inode_dirty() warning we're triggering the error in generic code.

So? XFS doesn't use that generic code, but we have filesystem
specific issues that we need to handle sanely.

> None of this fixes the problem of dax mappings leaking past block
> device remove.  That can be done generically without needing per-fs
> code.

No, the shutdown is intended to close the race between the device
being removed, the mappings being invalidated and the filesytem
racing creating new mappings during page faults because it doesn't
know the device has been unplugged until it does IO that gets some
error in an unrecoverable path...

Cheers,

Dave.
Dan Williams Nov. 26, 2015, 7:11 a.m. UTC | #4
On Wed, Nov 25, 2015 at 10:27 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Nov 25, 2015 at 02:09:34PM -0800, Dan Williams wrote:
>> On Wed, Nov 25, 2015 at 1:50 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Wed, Nov 25, 2015 at 10:37:24AM -0800, Dan Williams wrote:
[..]
>> That sounds good in theory.  However, in the case of xfs it is already
>> calling xfs_force_shutdown(),
>
> Where?

In the trace I included in the cover letter.

Testing this patch set reveals that xfs needs more XFS_FORCED_SHUTDOWN
checks, especially in the unmount path.  Currently we deadlock here on
umount after block device removal:

 INFO: task umount:2187 blocked for more than 120 seconds.
       Tainted: G           O    4.4.0-rc2+ #1953
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 umount          D ffff8800d2fbfd70     0  2187   2095 0x00000080
  ffff8800d2fbfd70 ffffffff81f94f98 ffff88031fc97bd8 ffff88030af5ad80
  ffff8800db71db00 ffff8800d2fc0000 ffff8800db8dbde0 ffff8800d93b6708
  ffff8800d93b6760 ffff8800d93b66d8 ffff8800d2fbfd88 ffffffff818f0695
 Call Trace:
  [<ffffffff818f0695>] schedule+0x35/0x80
  [<ffffffffa01e134e>] xfs_ail_push_all_sync+0xbe/0x110 [xfs]
  [<ffffffff810ecc30>] ? wait_woken+0x80/0x80
  [<ffffffffa01c8d91>] xfs_unmountfs+0x81/0x1b0 [xfs]
  [<ffffffffa01c991b>] ? xfs_mru_cache_destroy+0x6b/0x90 [xfs]
  [<ffffffffa01cbf30>] xfs_fs_put_super+0x30/0x90 [xfs]
  [<ffffffff81247eca>] generic_shutdown_super+0x6a/0xf0

Earlier in this trace xfs has already performed:

 XFS (pmem0m): xfs_do_force_shutdown(0x2) called from line 1197 of
file fs/xfs/xfs_log.c.

...but xfs_log_work_queue() continues to run periodically.

> If XFS does not do any metadata IO, then it won't shut the
> filesystem down. We almost always allocate/map blocks without doing
> any IO, which means we cannot guarantee erroring out page faults
> during get_blocks until a shutdown has be triggered by other
> means....
>
>> but that does not prevent it from
>> continuing to wait indefinitely at umount.
>
> Which is a bug we need to fix - I don't see how a shutdown
> implementation problem is at all relevant to triggering shutdowns in
> a prompt manner...
>
>> For the ext4 the
>> mark_inode_dirty() warning we're triggering the error in generic code.
>
> So? XFS doesn't use that generic code, but we have filesystem
> specific issues that we need to handle sanely.
>
>> None of this fixes the problem of dax mappings leaking past block
>> device remove.  That can be done generically without needing per-fs
>> code.
>
> No, the shutdown is intended to close the race between the device
> being removed, the mappings being invalidated and the filesytem
> racing creating new mappings during page faults because it doesn't
> know the device has been unplugged until it does IO that gets some
> error in an unrecoverable path...
>

So you want me to grow a ->sb_shutdown() method for each fs that
supports dax only to call unmap_mapping_range on each dax inode when
common code could have already walked the inode list.  Also separately
teach each fs to start returning errors on get_blocks() after shutdown
even though fs/dax.c can figure it out from the return value from
blk_queue_enter?  I'm failing to see the point.

That is of course separate from the real need for an ->sb_shutdown()
to tell the fs that the device is gone for other filesystem specific
operations.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Dec. 1, 2015, 4:03 a.m. UTC | #5
On Wed, Nov 25, 2015 at 11:11:00PM -0800, Dan Williams wrote:
> On Wed, Nov 25, 2015 at 10:27 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Wed, Nov 25, 2015 at 02:09:34PM -0800, Dan Williams wrote:
> >> On Wed, Nov 25, 2015 at 1:50 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > On Wed, Nov 25, 2015 at 10:37:24AM -0800, Dan Williams wrote:
> [..]
> >> That sounds good in theory.  However, in the case of xfs it is already
> >> calling xfs_force_shutdown(),
> >
> > Where?
> 
> In the trace I included in the cover letter.
> 
> Testing this patch set reveals that xfs needs more XFS_FORCED_SHUTDOWN
> checks, especially in the unmount path.  Currently we deadlock here on
> umount after block device removal:

What's the test case?

From what you've said, it appears to be another manifestation where
we trying to recovery from non-fatal IO errors that are being
reported by the block device, but that is short-cutting the path
that normally does shutdown detection on metadata buffer IO
completion.

Filesystem error handling coul dbe a lot more simple if we didnt'
have to try to guess what EIO from the block device actually means.
If device unplug triggered a shutdown, we wouldn't have to care
about retrying in many cases where we do right now because shutdown
then has clear priority of all other errors we need to handle. RIgh
tnow we just have ot guess, and there's a long history of corner
cases where we have guessed wrong....

I have patches to fix this particular manifestation, but until we
have have notification that devices hav ebeen unplugged and are not
coming back we're going to continue to struggle to get this right
and hence have re-occurring bugs when users pull USB drives out from
under active filesystems.....


> >> None of this fixes the problem of dax mappings leaking past block
> >> device remove.  That can be done generically without needing per-fs
> >> code.
> >
> > No, the shutdown is intended to close the race between the device
> > being removed, the mappings being invalidated and the filesytem
> > racing creating new mappings during page faults because it doesn't
> > know the device has been unplugged until it does IO that gets some
> > error in an unrecoverable path...
> 
> So you want me to grow a ->sb_shutdown() method for each fs that
> supports dax only to call unmap_mapping_range on each dax inode when
> common code could have already walked the inode list.

No, I don't want you to implement some whacky, dax only
->sb_shutdown method. I want a notification method to be implemented
so that each filesystem can take the necessary action it requires
when the underlying block device goes away.

Yes, that *filesystem specific method* might involve invalidating
all the dirty inodes in the filesystem, and if there are DAX
mappings in the filesystem then invalidating them, too, but that's
something the filesystem needs to take care of because *all
filesystem shutdowns must do this*.

Do you see the distinction there? This isn't just about device
unplug - there are lots of different triggers for a filesystem
shutdown. However, regardless of the shutdown trigger, we have to
take the same action. That is, we have to prevent all pending and
future IO from being issued to the block device, *even if the block
device is still OK*.

For example, if we detect a free space corruption during allocation,
it is not safe to trust *any active mapping* because we can't trust
that we having handed out the same block to multiple owners. Hence
on such a filesystem shutdown, we have to prevent any new DAX
mapping from occurring and invalidate all existing mappings as we
cannot allow userspace to modify any data or metadata until we've
resolved the corruption situation.

The simple fact is that a /filesystem/ shutdown needs to do DAX
mapping invalidation regardless of whether the block device has been
unplugged or not. This is not a case of "this only happens when we
unplug the device", this is a user data protection mechanism that we
use to prevent corruption propagation once it has been detected. A
device unplug is just one type of "corruption" that can occur.

Hence putting all this special invalidation stuff into the block
devices is simply duplicating functionality we need to implement in
the filesystem layers.  Regardless of this, it should be done by the
filesystem layers because that is where all the information necesary
to determine what needs invalidation is stored.

So, from the block dvice perspective, the *only thing it needs to
do* is notify the filesystem that it needs to shut down, and the
filesystem should then take care of everything else. Device unplug
is not some special snowflake that needs to be treated differently
from all other types of "it's dead, Jim" filesystem errors - from
the FS perspective it's the dead simple case because there are no
serious consequences if we screw up. i.e. if the FS gets it wrong
and IO is still issued after the shutdown, the IO is going to be
errored out rather than corrupting something on disk that would have
otherwise been OK....

> Also separately
> teach each fs to start returning errors on get_blocks() after shutdown
> even though fs/dax.c can figure it out from the return value from
> blk_queue_enter?

They will already return errors from get_blocks.

STATIC int
__xfs_get_blocks(
        struct inode            *inode,
        sector_t                iblock,
        struct buffer_head      *bh_result,
        int                     create,
        bool                    direct,
        bool                    dax_fault)
{
        struct xfs_inode        *ip = XFS_I(inode);
        struct xfs_mount        *mp = ip->i_mount;
        xfs_fileoff_t           offset_fsb, end_fsb;
        int                     error = 0;
        int                     lockmode = 0;
        struct xfs_bmbt_irec    imap;
        int                     nimaps = 1;
        xfs_off_t               offset;
        ssize_t                 size;
        int                     new = 0;

        if (XFS_FORCED_SHUTDOWN(mp))
                return -EIO;
.....

Hmmm? First thing XFS does in get_blocks() is check for shutdown.
In fact, the first thing that every major subsystem entry point
does in XFs is check for shutdown. Let's look at a typical
allocation get_blocks call in XFS:

 xfs_get_blocks
   __xfs_get_blocks - has shutdown check
    xfs_bmapi_read - shutdown check is second after error injection
    xfs_iomap_write_direct
      xfs_trans_reserve - has shutdown check
      xfs_bmapi_write - has shutdown check
      xfs_trans_commit - has shutdown check

IOWs, there are shutdown checks all through the get_blocks callbacks
already, so it's hardly a caase of you having to go an add support
to any filesystem for this...

Cheers,
MDave.
Dan Williams Dec. 1, 2015, 4:20 a.m. UTC | #6
On Mon, Nov 30, 2015 at 8:03 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Nov 25, 2015 at 11:11:00PM -0800, Dan Williams wrote:
>> On Wed, Nov 25, 2015 at 10:27 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Wed, Nov 25, 2015 at 02:09:34PM -0800, Dan Williams wrote:
>> >> On Wed, Nov 25, 2015 at 1:50 PM, Dave Chinner <david@fromorbit.com> wrote:
>> >> > On Wed, Nov 25, 2015 at 10:37:24AM -0800, Dan Williams wrote:
>> [..]
>> >> That sounds good in theory.  However, in the case of xfs it is already
>> >> calling xfs_force_shutdown(),
>> >
>> > Where?
>>
>> In the trace I included in the cover letter.
>>
>> Testing this patch set reveals that xfs needs more XFS_FORCED_SHUTDOWN
>> checks, especially in the unmount path.  Currently we deadlock here on
>> umount after block device removal:
>
> What's the test case?
>
> From what you've said, it appears to be another manifestation where
> we trying to recovery from non-fatal IO errors that are being
> reported by the block device, but that is short-cutting the path
> that normally does shutdown detection on metadata buffer IO
> completion.
>
> Filesystem error handling coul dbe a lot more simple if we didnt'
> have to try to guess what EIO from the block device actually means.
> If device unplug triggered a shutdown, we wouldn't have to care
> about retrying in many cases where we do right now because shutdown
> then has clear priority of all other errors we need to handle. RIgh
> tnow we just have ot guess, and there's a long history of corner
> cases where we have guessed wrong....
>
> I have patches to fix this particular manifestation, but until we
> have have notification that devices hav ebeen unplugged and are not
> coming back we're going to continue to struggle to get this right
> and hence have re-occurring bugs when users pull USB drives out from
> under active filesystems.....
>
>
>> >> None of this fixes the problem of dax mappings leaking past block
>> >> device remove.  That can be done generically without needing per-fs
>> >> code.
>> >
>> > No, the shutdown is intended to close the race between the device
>> > being removed, the mappings being invalidated and the filesytem
>> > racing creating new mappings during page faults because it doesn't
>> > know the device has been unplugged until it does IO that gets some
>> > error in an unrecoverable path...
>>
>> So you want me to grow a ->sb_shutdown() method for each fs that
>> supports dax only to call unmap_mapping_range on each dax inode when
>> common code could have already walked the inode list.
>
> No, I don't want you to implement some whacky, dax only
> ->sb_shutdown method. I want a notification method to be implemented
> so that each filesystem can take the necessary action it requires
> when the underlying block device goes away.
>
> Yes, that *filesystem specific method* might involve invalidating
> all the dirty inodes in the filesystem, and if there are DAX
> mappings in the filesystem then invalidating them, too, but that's
> something the filesystem needs to take care of because *all
> filesystem shutdowns must do this*.
>
> Do you see the distinction there? This isn't just about device
> unplug - there are lots of different triggers for a filesystem
> shutdown. However, regardless of the shutdown trigger, we have to
> take the same action. That is, we have to prevent all pending and
> future IO from being issued to the block device, *even if the block
> device is still OK*.

Ah, that finally got through to me.

I'll refactor this to be a ->shutdown notification with a generic
unmap that a filesystem can optionally call under its own discretion.
As I now see that generic functionality is just common code that an fs
might optionally need/use, but it is counter productive for that to be
privately triggered only by the block layer and only from an event
like del_gendisk.  We need it in potentially all fs shutdown paths.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/block/genhd.c b/block/genhd.c
index e5cafa51567c..8a743cb81fb4 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -648,10 +648,12 @@  void del_gendisk(struct gendisk *disk)
 	while ((part = disk_part_iter_next(&piter))) {
 		invalidate_partition(disk, part->partno);
 		delete_partition(disk, part->partno);
+		kill_bdev_super(disk, part->partno);
 	}
 	disk_part_iter_exit(&piter);
 
 	invalidate_partition(disk, 0);
+	kill_bdev_super(disk, 0);
 	set_capacity(disk, 0);
 	disk->flags &= ~GENHD_FL_UP;
 
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 1dd416bf72f7..d0233d643d33 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1795,6 +1795,23 @@  int __invalidate_device(struct block_device *bdev, bool kill_dirty)
 }
 EXPORT_SYMBOL(__invalidate_device);
 
+void kill_bdev_super(struct gendisk *disk, int partno)
+{
+	struct block_device *bdev = bdget_disk(disk, partno);
+	struct super_block *sb;
+
+	if (!bdev)
+		return;
+	sb = get_super(bdev);
+	if (!sb)
+		goto out;
+
+	sb->s_iflags |= SB_I_BDI_DEAD;
+	drop_super(sb);
+ out:
+	bdput(bdev);
+}
+
 void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 {
 	struct inode *inode, *old_inode = NULL;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3aa514254161..76925e322e87 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1254,6 +1254,7 @@  struct mm_struct;
 /* sb->s_iflags */
 #define SB_I_CGROUPWB	0x00000001	/* cgroup-aware writeback enabled */
 #define SB_I_NOEXEC	0x00000002	/* Ignore executables on this fs */
+#define SB_I_BDI_DEAD	0x00000004	/* Give up, backing device is dead */
 
 /* Possible states of 'frozen' field */
 enum {
@@ -2390,6 +2391,7 @@  extern int revalidate_disk(struct gendisk *);
 extern int check_disk_change(struct block_device *);
 extern int __invalidate_device(struct block_device *, bool);
 extern int invalidate_partition(struct gendisk *, int);
+extern void kill_bdev_super(struct gendisk *, int);
 #endif
 unsigned long invalidate_mapping_pages(struct address_space *mapping,
 					pgoff_t start, pgoff_t end);