diff mbox

[resend,1/3] block, fs: reliably communicate bdev end-of-life

Message ID 20160104182005.24118.50361.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Dan Williams Jan. 4, 2016, 6:20 p.m. UTC
Historically we have waited for filesystem specific heuristics to
attempt to guess when a block device is gone.  Sometimes this works, but
in other cases the system can hang waiting for the fs to trigger its
shutdown protocol.

The initial motivation for this investigation was to prevent DAX
mappings (direct mmap access to persistent memory) from leaking past the
lifetime of the hosting block device.  However, Dave points out that
these shutdown operations are needed in other scenarios.  Quoting Dave:

    For example, if we detect a free space corruption during allocation,
    it is not safe to trust *any active mapping* because we can't trust
    that we having handed out the same block to multiple owners. Hence
    on such a filesystem shutdown, we have to prevent any new DAX
    mapping from occurring and invalidate all existing mappings as we
    cannot allow userspace to modify any data or metadata until we've
    resolved the corruption situation.

The current block device shutdown sequence of del_gendisk +
blk_cleanup_queue is problematic.  We want to tell the fs after
blk_cleanup_queue that there is no possibility of recovery, but by that
time we have deleted partitions and lost the ability to find all the
super-blocks on a block device.

Introduce del_gendisk_queue to trigger ->quiesce() and ->bdi_gone()
notifications to all the filesystems hosted on the disk.  Where
->quiesce() are 'shutdown' operations while the bdev may still be alive,
and ->bdi_gone() is a set of actions to take after the backing device
is known to be permanently dead.

generic_quiesce_super and generic_bdi_gone, are the default operations
when a filesystem does not implement ->quiesce(), ->bdi_gone().  They
invalidate inodes and unmap DAX-inodes respectively.  For now only
->bdi_gone() has an associated super operation as xfs will implement
->bdi_gone() in a later patch.

Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 block/genhd.c                |   87 +++++++++++++++++++++++++++++++++++-------
 drivers/block/brd.c          |    3 -
 drivers/nvdimm/pmem.c        |    3 -
 drivers/s390/block/dcssblk.c |    6 +--
 fs/block_dev.c               |   78 ++++++++++++++++++++++++++++++++------
 include/linux/fs.h           |    2 +
 include/linux/genhd.h        |    1 
 7 files changed, 147 insertions(+), 33 deletions(-)

Comments

Dave Chinner Jan. 5, 2016, 3:51 a.m. UTC | #1
On Mon, Jan 04, 2016 at 10:20:05AM -0800, Dan Williams wrote:
> Historically we have waited for filesystem specific heuristics to
> attempt to guess when a block device is gone.  Sometimes this works, but
> in other cases the system can hang waiting for the fs to trigger its
> shutdown protocol.
> 
> The initial motivation for this investigation was to prevent DAX
> mappings (direct mmap access to persistent memory) from leaking past the
> lifetime of the hosting block device.  However, Dave points out that
> these shutdown operations are needed in other scenarios.  Quoting Dave:
> 
>     For example, if we detect a free space corruption during allocation,
>     it is not safe to trust *any active mapping* because we can't trust
>     that we having handed out the same block to multiple owners. Hence
>     on such a filesystem shutdown, we have to prevent any new DAX
>     mapping from occurring and invalidate all existing mappings as we
>     cannot allow userspace to modify any data or metadata until we've
>     resolved the corruption situation.
> 
> The current block device shutdown sequence of del_gendisk +
> blk_cleanup_queue is problematic.  We want to tell the fs after
> blk_cleanup_queue that there is no possibility of recovery, but by that
> time we have deleted partitions and lost the ability to find all the
> super-blocks on a block device.
> 
> Introduce del_gendisk_queue to trigger ->quiesce() and ->bdi_gone()

I don't see anything that introduces a ->quiesce() method.

> notifications to all the filesystems hosted on the disk.  Where
> ->quiesce() are 'shutdown' operations while the bdev may still be alive,

So you are saying "quiesce == invalidation", which is in conflict
with what we typically think quiesce means. i.e.  "Quiesce" is what
we do when doing orderly writeback of all outstanding dirty objects
in a filesystem - what we do during freeze, remount-ro, and unmount.
e.g. See the functions xfs_log_quiesce() and xfs_attr_quiesce()

> and ->bdi_gone() is a set of actions to take after the backing device
> is known to be permanently dead.

In which case, bdi_gone == shutdown.

Operation methods should be named after what they do, not what their
calling context is. i.e. these are "invalidate" and "shutdown"
superblock operations, not "quiesce" and "bdi_gone".

> generic_quiesce_super and generic_bdi_gone, are the default operations
> when a filesystem does not implement ->quiesce(), ->bdi_gone().  They
> invalidate inodes and unmap DAX-inodes respectively.  For now only
> ->bdi_gone() has an associated super operation as xfs will implement
> ->bdi_gone() in a later patch.

I don't quite understand what the point of factoring
__invalidate_device() like this is - it's not used by anyone, so it
seems completely unnecessary to me.

And really, that points out that there are multiple changes in this
patch set that should be done separately.  The rework of
del_gendisk() into del_gendisk_start/del_gendisk_end should be the
first patch.  The del_gendisk/blk_cleanup_queue to
blk_cleanup_queue() combining should be the second part, as it
builds on the factoring of del_gendisk(). Then, if we really need to
keep it, the factoring to introduce generic_quiesce_super. And
finally, the patch that allows the shutdown callout can be
introduced.

[snip]
>  
> +static void generic_bdi_gone(struct super_block *sb)
> +{
> +	struct inode *inode, *_inode = NULL;
> +
> +	spin_lock(&sb->s_inode_list_lock);
> +	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> +		spin_lock(&inode->i_lock);
> +		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
> +				|| !IS_DAX(inode)) {
> +			spin_unlock(&inode->i_lock);
> +			continue;
> +		}
> +		__iget(inode);
> +		spin_unlock(&inode->i_lock);
> +		spin_unlock(&sb->s_inode_list_lock);
> +
> +		unmap_mapping_range(inode->i_mapping, 0, 0, 1);
> +		iput(_inode);
> +		_inode = inode;
> +		cond_resched();
> +
> +		spin_lock(&sb->s_inode_list_lock);
> +	}
> +	spin_unlock(&sb->s_inode_list_lock);
> +	iput(_inode);
> +}

This belongs in fs/inode.c, right next to invalidate_inodes(), and
with a name that describes what it does, not the context in which
it is called. e.g. unmap_dax_inodes().

> +void shutdown_partition(struct gendisk *disk, int partno)
> +{
> +	struct block_device *bdev = bdget_disk(disk, partno);
> +	struct super_block *sb = get_super(bdev);
> +
> +	if (!bdev)
> +		return;

Null pointer deref. Certainly a landmine waiting for someone to
tread on.

> +	if (!sb) {
> +		bdput(bdev);
> +		return;
> +	}
> +
> +	if (sb->s_op->bdi_gone)
> +		sb->s_op->bdi_gone(sb);
> +	else
> +		generic_bdi_gone(sb);

	if (sb->s_op->shutdown)
		sb->s_op->shutdown(sb);
	else
		unmap_dax_inodes(sb);

name things correctly, and the code documents itself.
Dan Williams Jan. 5, 2016, 4:25 a.m. UTC | #2
On Mon, Jan 4, 2016 at 7:51 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Jan 04, 2016 at 10:20:05AM -0800, Dan Williams wrote:
>> Historically we have waited for filesystem specific heuristics to
>> attempt to guess when a block device is gone.  Sometimes this works, but
>> in other cases the system can hang waiting for the fs to trigger its
>> shutdown protocol.
>>
>> The initial motivation for this investigation was to prevent DAX
>> mappings (direct mmap access to persistent memory) from leaking past the
>> lifetime of the hosting block device.  However, Dave points out that
>> these shutdown operations are needed in other scenarios.  Quoting Dave:
>>
>>     For example, if we detect a free space corruption during allocation,
>>     it is not safe to trust *any active mapping* because we can't trust
>>     that we having handed out the same block to multiple owners. Hence
>>     on such a filesystem shutdown, we have to prevent any new DAX
>>     mapping from occurring and invalidate all existing mappings as we
>>     cannot allow userspace to modify any data or metadata until we've
>>     resolved the corruption situation.
>>
>> The current block device shutdown sequence of del_gendisk +
>> blk_cleanup_queue is problematic.  We want to tell the fs after
>> blk_cleanup_queue that there is no possibility of recovery, but by that
>> time we have deleted partitions and lost the ability to find all the
>> super-blocks on a block device.
>>
>> Introduce del_gendisk_queue to trigger ->quiesce() and ->bdi_gone()
>
> I don't see anything that introduces a ->quiesce() method.
>

Right, we only have generic_quiesce_super() as no fs had a need to do
anything but the generic actions.

>> notifications to all the filesystems hosted on the disk.  Where
>> ->quiesce() are 'shutdown' operations while the bdev may still be alive,
>
> So you are saying "quiesce == invalidation", which is in conflict
> with what we typically think quiesce means. i.e.  "Quiesce" is what
> we do when doing orderly writeback of all outstanding dirty objects
> in a filesystem - what we do during freeze, remount-ro, and unmount.
> e.g. See the functions xfs_log_quiesce() and xfs_attr_quiesce()

Fair enough, I should have called this one invalidate_super.

>
>> and ->bdi_gone() is a set of actions to take after the backing device
>> is known to be permanently dead.
>
> In which case, bdi_gone == shutdown.
>

True, and following this logic I think the existing
generic_shutdown_super() should be renamed generic_kill_super() to
match the fs actions, but see below...

> Operation methods should be named after what they do, not what their
> calling context is. i.e. these are "invalidate" and "shutdown"
> superblock operations, not "quiesce" and "bdi_gone".

I was running out of colors to paint the bike shed given
generic_shutdown_super() was already in use.  Also, since
generic_shutdown_super() has had its current meaning since forever I
didn't think it was worth the risk to change the meaning of such a
long standing symbol.  Other ideas, generic_stop_super?

>> generic_quiesce_super and generic_bdi_gone, are the default operations
>> when a filesystem does not implement ->quiesce(), ->bdi_gone().  They
>> invalidate inodes and unmap DAX-inodes respectively.  For now only
>> ->bdi_gone() has an associated super operation as xfs will implement
>> ->bdi_gone() in a later patch.
>
> I don't quite understand what the point of factoring
> __invalidate_device() like this is - it's not used by anyone, so it
> seems completely unnecessary to me.
>

You're right, without a need for an fs to intercept the 'invalidation'
event there's no need to do this refactor.

> And really, that points out that there are multiple changes in this
> patch set that should be done separately.  The rework of
> del_gendisk() into del_gendisk_start/del_gendisk_end should be the
> first patch.  The del_gendisk/blk_cleanup_queue to
> blk_cleanup_queue() combining should be the second part, as it
> builds on the factoring of del_gendisk(). Then, if we really need to
> keep it, the factoring to introduce generic_quiesce_super. And
> finally, the patch that allows the shutdown callout can be
> introduced.

Ok, will split.

>
> [snip]
>>
>> +static void generic_bdi_gone(struct super_block *sb)
>> +{
>> +     struct inode *inode, *_inode = NULL;
>> +
>> +     spin_lock(&sb->s_inode_list_lock);
>> +     list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
>> +             spin_lock(&inode->i_lock);
>> +             if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
>> +                             || !IS_DAX(inode)) {
>> +                     spin_unlock(&inode->i_lock);
>> +                     continue;
>> +             }
>> +             __iget(inode);
>> +             spin_unlock(&inode->i_lock);
>> +             spin_unlock(&sb->s_inode_list_lock);
>> +
>> +             unmap_mapping_range(inode->i_mapping, 0, 0, 1);
>> +             iput(_inode);
>> +             _inode = inode;
>> +             cond_resched();
>> +
>> +             spin_lock(&sb->s_inode_list_lock);
>> +     }
>> +     spin_unlock(&sb->s_inode_list_lock);
>> +     iput(_inode);
>> +}
>
> This belongs in fs/inode.c, right next to invalidate_inodes(), and
> with a name that describes what it does, not the context in which
> it is called. e.g. unmap_dax_inodes().
>

Sounds good.

>> +void shutdown_partition(struct gendisk *disk, int partno)
>> +{
>> +     struct block_device *bdev = bdget_disk(disk, partno);
>> +     struct super_block *sb = get_super(bdev);
>> +
>> +     if (!bdev)
>> +             return;
>
> Null pointer deref. Certainly a landmine waiting for someone to
> tread on.
>

Nope, get_super() checks for a NULL argument.

>> +     if (!sb) {
>> +             bdput(bdev);
>> +             return;
>> +     }
>> +
>> +     if (sb->s_op->bdi_gone)
>> +             sb->s_op->bdi_gone(sb);
>> +     else
>> +             generic_bdi_gone(sb);
>
>         if (sb->s_op->shutdown)
>                 sb->s_op->shutdown(sb);
>         else
>                 unmap_dax_inodes(sb);
>
> name things correctly, and the code documents itself.

How about 'stop' or 'halt' instead of 'shutdown' to preserve the
historical meaning of generic_shutdown_super?
Dave Chinner Jan. 5, 2016, 10:32 p.m. UTC | #3
On Mon, Jan 04, 2016 at 08:25:16PM -0800, Dan Williams wrote:
> On Mon, Jan 4, 2016 at 7:51 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Jan 04, 2016 at 10:20:05AM -0800, Dan Williams wrote:
> >> Historically we have waited for filesystem specific heuristics to
> >> attempt to guess when a block device is gone.  Sometimes this works, but
> >> in other cases the system can hang waiting for the fs to trigger its
> >> shutdown protocol.
....
> True, and following this logic I think the existing
> generic_shutdown_super() should be renamed generic_kill_super() to
> match the fs actions, but see below...
> 
> > Operation methods should be named after what they do, not what their
> > calling context is. i.e. these are "invalidate" and "shutdown"
> > superblock operations, not "quiesce" and "bdi_gone".
> 
> I was running out of colors to paint the bike shed given
> generic_shutdown_super() was already in use.  Also, since

Ah, yeah, i forgot about that function. To many different shades of
purple are already in use. :/

> >> +void shutdown_partition(struct gendisk *disk, int partno)
> >> +{
> >> +     struct block_device *bdev = bdget_disk(disk, partno);
> >> +     struct super_block *sb = get_super(bdev);
> >> +
> >> +     if (!bdev)
> >> +             return;
> >
> > Null pointer deref. Certainly a landmine waiting for someone to
> > tread on.
> >
> 
> Nope, get_super() checks for a NULL argument.

Hence my comment about it being a landmine. If get_super() is ever
changed, this code will explode on us when we least expect it. If I
have to go read other code in a completely different file just to
determine the code is actually safe, then I consider that bad code.
Code that is slightly more verbose but is obviously correct and
balanced is much, much easier to understand and maintain....

> >> +     if (!sb) {
> >> +             bdput(bdev);
> >> +             return;
> >> +     }
> >> +
> >> +     if (sb->s_op->bdi_gone)
> >> +             sb->s_op->bdi_gone(sb);
> >> +     else
> >> +             generic_bdi_gone(sb);
> >
> >         if (sb->s_op->shutdown)
> >                 sb->s_op->shutdown(sb);
> >         else
> >                 unmap_dax_inodes(sb);
> >
> > name things correctly, and the code documents itself.
> 
> How about 'stop' or 'halt' instead of 'shutdown' to preserve the
> historical meaning of generic_shutdown_super?

It's hard chosing a different color that is appropriate. :/

Because it's a brute-force behavioural override, I think that needs
to be obvious from the op name. So something like force_stop or
force_failure seems best to me.

In fact, now that I've thought/repaeated/written force_failure a
couple of times, it seems quite appropriate here, as what we want to
be able to do is force the filesystem into a (permanent) failure
state.....

Cheers,

Dave.
Al Viro Jan. 9, 2016, 7:54 a.m. UTC | #4
On Mon, Jan 04, 2016 at 10:20:05AM -0800, Dan Williams wrote:
> Historically we have waited for filesystem specific heuristics to
> attempt to guess when a block device is gone.  Sometimes this works, but
> in other cases the system can hang waiting for the fs to trigger its
> shutdown protocol.
> 
> The initial motivation for this investigation was to prevent DAX
> mappings (direct mmap access to persistent memory) from leaking past the
> lifetime of the hosting block device.  However, Dave points out that
> these shutdown operations are needed in other scenarios.  Quoting Dave:
> 
>     For example, if we detect a free space corruption during allocation,
>     it is not safe to trust *any active mapping* because we can't trust
>     that we having handed out the same block to multiple owners. Hence
>     on such a filesystem shutdown, we have to prevent any new DAX
>     mapping from occurring and invalidate all existing mappings as we
>     cannot allow userspace to modify any data or metadata until we've
>     resolved the corruption situation.
> 
> The current block device shutdown sequence of del_gendisk +
> blk_cleanup_queue is problematic.  We want to tell the fs after
> blk_cleanup_queue that there is no possibility of recovery, but by that
> time we have deleted partitions and lost the ability to find all the
> super-blocks on a block device.
> 
> Introduce del_gendisk_queue to trigger ->quiesce() and ->bdi_gone()
> notifications to all the filesystems hosted on the disk.  Where
> ->quiesce() are 'shutdown' operations while the bdev may still be alive,
> and ->bdi_gone() is a set of actions to take after the backing device
> is known to be permanently dead.

	Would you mind explaining what the hell is _the_ backing device
of a filesystem?  What does that translate into in case of e.g. btrfs
spanning several disks?  Or ext4 with journal on a different device, for
that matter?

	If anything, I would argue that filesystem is out of place here -
general situation is "IO on X may require IO on device Y and X needs to do
something when Y goes away".  Consider e.g. /dev/loop backed by a device
that went away.  Or by a file on fs that has run down the curtain and joined
the bleedin choir invisible.  With another fs partially hosted by that
loopback device.  Or by RAID0 containing said device.

	You are given Y and attempt to locate the affected X.  _Then_
you assume that X is a filesystem and has "something to be done" independent
from the role Y played for it, so you can pick that action from superblock
method.

	IMO you are placing the burden in the wrong place.  _Recepient_
knows what it depends upon and what should be done for each source of
trouble.  So make it recepient's responsibility to request notifications.
At which point the superblock method goes away, along with the requirement
to handle all sources of trouble the same way, etc.

	What's more, things like RAID5 (also interested in knowing when
a component has been ripped out) might or might not decide to propagate
the event further - after all, that's exactly the point of redundancy.

	I'd look into something along the lines of notifier chain per
gendisk, with potential victims registering a callback when they decide
that from now on such and such device might screw them over...
Dan Williams Jan. 9, 2016, 2:17 p.m. UTC | #5
On Fri, Jan 8, 2016 at 11:54 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Mon, Jan 04, 2016 at 10:20:05AM -0800, Dan Williams wrote:
[..]
>         Would you mind explaining what the hell is _the_ backing device
> of a filesystem?  What does that translate into in case of e.g. btrfs
> spanning several disks?  Or ext4 with journal on a different device, for
> that matter?
>
>         If anything, I would argue that filesystem is out of place here -
> general situation is "IO on X may require IO on device Y and X needs to do
> something when Y goes away".  Consider e.g. /dev/loop backed by a device
> that went away.  Or by a file on fs that has run down the curtain and joined
> the bleedin choir invisible.  With another fs partially hosted by that
> loopback device.  Or by RAID0 containing said device.
>
>         You are given Y and attempt to locate the affected X.  _Then_
> you assume that X is a filesystem and has "something to be done" independent
> from the role Y played for it, so you can pick that action from superblock
> method.
>
>         IMO you are placing the burden in the wrong place.  _Recepient_
> knows what it depends upon and what should be done for each source of
> trouble.  So make it recepient's responsibility to request notifications.
> At which point the superblock method goes away, along with the requirement
> to handle all sources of trouble the same way, etc.
>
>         What's more, things like RAID5 (also interested in knowing when
> a component has been ripped out) might or might not decide to propagate
> the event further - after all, that's exactly the point of redundancy.
>
>         I'd look into something along the lines of notifier chain per
> gendisk, with potential victims registering a callback when they decide
> that from now on such and such device might screw them over...

Makes sense.  I'll drop this series for now and come back after
re-working it use notifiers.
Hannes Reinecke Jan. 11, 2016, 7:15 a.m. UTC | #6
On 01/09/2016 03:17 PM, Dan Williams wrote:
> On Fri, Jan 8, 2016 at 11:54 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> On Mon, Jan 04, 2016 at 10:20:05AM -0800, Dan Williams wrote:
> [..]
>>          Would you mind explaining what the hell is _the_ backing device
>> of a filesystem?  What does that translate into in case of e.g. btrfs
>> spanning several disks?  Or ext4 with journal on a different device, for
>> that matter?
>>
>>          If anything, I would argue that filesystem is out of place here -
>> general situation is "IO on X may require IO on device Y and X needs to do
>> something when Y goes away".  Consider e.g. /dev/loop backed by a device
>> that went away.  Or by a file on fs that has run down the curtain and joined
>> the bleedin choir invisible.  With another fs partially hosted by that
>> loopback device.  Or by RAID0 containing said device.
>>
>>          You are given Y and attempt to locate the affected X.  _Then_
>> you assume that X is a filesystem and has "something to be done" independent
>> from the role Y played for it, so you can pick that action from superblock
>> method.
>>
>>          IMO you are placing the burden in the wrong place.  _Recepient_
>> knows what it depends upon and what should be done for each source of
>> trouble.  So make it recepient's responsibility to request notifications.
>> At which point the superblock method goes away, along with the requirement
>> to handle all sources of trouble the same way, etc.
>>
>>          What's more, things like RAID5 (also interested in knowing when
>> a component has been ripped out) might or might not decide to propagate
>> the event further - after all, that's exactly the point of redundancy.
>>
>>          I'd look into something along the lines of notifier chain per
>> gendisk, with potential victims registering a callback when they decide
>> that from now on such and such device might screw them over...
>
> Makes sense.  I'll drop this series for now and come back after
> re-working it use notifiers.

Yes please. I need a similar thing for communicating device changes 
(resizing, topology changes), so I'd be very much interested in them.

And while you're at it, maybe we can fold the block device event 
handling into that, too.

Cheers,

Hannes
Hannes Reinecke Jan. 11, 2016, 3:24 p.m. UTC | #7
On 01/09/2016 08:54 AM, Al Viro wrote:
> On Mon, Jan 04, 2016 at 10:20:05AM -0800, Dan Williams wrote:
>> Historically we have waited for filesystem specific heuristics to
>> attempt to guess when a block device is gone.  Sometimes this works, but
>> in other cases the system can hang waiting for the fs to trigger its
>> shutdown protocol.
>>
>> The initial motivation for this investigation was to prevent DAX
>> mappings (direct mmap access to persistent memory) from leaking past the
>> lifetime of the hosting block device.  However, Dave points out that
>> these shutdown operations are needed in other scenarios.  Quoting Dave:
>>
>>      For example, if we detect a free space corruption during allocation,
>>      it is not safe to trust *any active mapping* because we can't trust
>>      that we having handed out the same block to multiple owners. Hence
>>      on such a filesystem shutdown, we have to prevent any new DAX
>>      mapping from occurring and invalidate all existing mappings as we
>>      cannot allow userspace to modify any data or metadata until we've
>>      resolved the corruption situation.
>>
>> The current block device shutdown sequence of del_gendisk +
>> blk_cleanup_queue is problematic.  We want to tell the fs after
>> blk_cleanup_queue that there is no possibility of recovery, but by that
>> time we have deleted partitions and lost the ability to find all the
>> super-blocks on a block device.
>>
>> Introduce del_gendisk_queue to trigger ->quiesce() and ->bdi_gone()
>> notifications to all the filesystems hosted on the disk.  Where
>> ->quiesce() are 'shutdown' operations while the bdev may still be alive,
>> and ->bdi_gone() is a set of actions to take after the backing device
>> is known to be permanently dead.
>
> 	Would you mind explaining what the hell is _the_ backing device
> of a filesystem?  What does that translate into in case of e.g. btrfs
> spanning several disks?  Or ext4 with journal on a different device, for
> that matter?
>
> 	If anything, I would argue that filesystem is out of place here -
> general situation is "IO on X may require IO on device Y and X needs to do
> something when Y goes away".  Consider e.g. /dev/loop backed by a device
> that went away.  Or by a file on fs that has run down the curtain and joined
> the bleedin choir invisible.  With another fs partially hosted by that
> loopback device.  Or by RAID0 containing said device.
>
> 	You are given Y and attempt to locate the affected X.  _Then_
> you assume that X is a filesystem and has "something to be done" independent
> from the role Y played for it, so you can pick that action from superblock
> method.
>
> 	IMO you are placing the burden in the wrong place.  _Recepient_
> knows what it depends upon and what should be done for each source of
> trouble.  So make it recepient's responsibility to request notifications.
> At which point the superblock method goes away, along with the requirement
> to handle all sources of trouble the same way, etc.
>
> 	What's more, things like RAID5 (also interested in knowing when
> a component has been ripped out) might or might not decide to propagate
> the event further - after all, that's exactly the point of redundancy.
>
> 	I'd look into something along the lines of notifier chain per
> gendisk, with potential victims registering a callback when they decide
> that from now on such and such device might screw them over...

Fully support this. I was planning on something similar to transport 
device changes (resizing, topology change etc).

And it might even be an idea to convert the block device events to a 
notifier chain, too.

Dan, can you keep me in the loop here?
Thanks.

Cheers,

Hannes
Dan Williams Jan. 11, 2016, 3:55 p.m. UTC | #8
On Mon, Jan 11, 2016 at 7:24 AM, Hannes Reinecke <hare@suse.de> wrote:
> On 01/09/2016 08:54 AM, Al Viro wrote:
>>
>> On Mon, Jan 04, 2016 at 10:20:05AM -0800, Dan Williams wrote:
>>>
>>> Historically we have waited for filesystem specific heuristics to
>>> attempt to guess when a block device is gone.  Sometimes this works, but
>>> in other cases the system can hang waiting for the fs to trigger its
>>> shutdown protocol.
>>>
>>> The initial motivation for this investigation was to prevent DAX
>>> mappings (direct mmap access to persistent memory) from leaking past the
>>> lifetime of the hosting block device.  However, Dave points out that
>>> these shutdown operations are needed in other scenarios.  Quoting Dave:
>>>
>>>      For example, if we detect a free space corruption during allocation,
>>>      it is not safe to trust *any active mapping* because we can't trust
>>>      that we having handed out the same block to multiple owners. Hence
>>>      on such a filesystem shutdown, we have to prevent any new DAX
>>>      mapping from occurring and invalidate all existing mappings as we
>>>      cannot allow userspace to modify any data or metadata until we've
>>>      resolved the corruption situation.
>>>
>>> The current block device shutdown sequence of del_gendisk +
>>> blk_cleanup_queue is problematic.  We want to tell the fs after
>>> blk_cleanup_queue that there is no possibility of recovery, but by that
>>> time we have deleted partitions and lost the ability to find all the
>>> super-blocks on a block device.
>>>
>>> Introduce del_gendisk_queue to trigger ->quiesce() and ->bdi_gone()
>>> notifications to all the filesystems hosted on the disk.  Where
>>> ->quiesce() are 'shutdown' operations while the bdev may still be alive,
>>> and ->bdi_gone() is a set of actions to take after the backing device
>>> is known to be permanently dead.
>>
>>
>>         Would you mind explaining what the hell is _the_ backing device
>> of a filesystem?  What does that translate into in case of e.g. btrfs
>> spanning several disks?  Or ext4 with journal on a different device, for
>> that matter?
>>
>>         If anything, I would argue that filesystem is out of place here -
>> general situation is "IO on X may require IO on device Y and X needs to do
>> something when Y goes away".  Consider e.g. /dev/loop backed by a device
>> that went away.  Or by a file on fs that has run down the curtain and
>> joined
>> the bleedin choir invisible.  With another fs partially hosted by that
>> loopback device.  Or by RAID0 containing said device.
>>
>>         You are given Y and attempt to locate the affected X.  _Then_
>> you assume that X is a filesystem and has "something to be done"
>> independent
>> from the role Y played for it, so you can pick that action from superblock
>> method.
>>
>>         IMO you are placing the burden in the wrong place.  _Recepient_
>> knows what it depends upon and what should be done for each source of
>> trouble.  So make it recepient's responsibility to request notifications.
>> At which point the superblock method goes away, along with the requirement
>> to handle all sources of trouble the same way, etc.
>>
>>         What's more, things like RAID5 (also interested in knowing when
>> a component has been ripped out) might or might not decide to propagate
>> the event further - after all, that's exactly the point of redundancy.
>>
>>         I'd look into something along the lines of notifier chain per
>> gendisk, with potential victims registering a callback when they decide
>> that from now on such and such device might screw them over...
>
>
> Fully support this. I was planning on something similar to transport device
> changes (resizing, topology change etc).
>
> And it might even be an idea to convert the block device events to a
> notifier chain, too.
>
> Dan, can you keep me in the loop here?

Yes, will do.
diff mbox

Patch

diff --git a/block/genhd.c b/block/genhd.c
index e5cafa51567c..967fcfd63c98 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -634,24 +634,14 @@  void add_disk(struct gendisk *disk)
 }
 EXPORT_SYMBOL(add_disk);
 
-void del_gendisk(struct gendisk *disk)
+static void del_gendisk_start(struct gendisk *disk)
 {
-	struct disk_part_iter piter;
-	struct hd_struct *part;
-
 	blk_integrity_del(disk);
 	disk_del_events(disk);
+}
 
-	/* invalidate stuff */
-	disk_part_iter_init(&piter, disk,
-			     DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
-	while ((part = disk_part_iter_next(&piter))) {
-		invalidate_partition(disk, part->partno);
-		delete_partition(disk, part->partno);
-	}
-	disk_part_iter_exit(&piter);
-
-	invalidate_partition(disk, 0);
+static void del_gendisk_end(struct gendisk *disk)
+{
 	set_capacity(disk, 0);
 	disk->flags &= ~GENHD_FL_UP;
 
@@ -670,9 +660,78 @@  void del_gendisk(struct gendisk *disk)
 	pm_runtime_set_memalloc_noio(disk_to_dev(disk), false);
 	device_del(disk_to_dev(disk));
 }
+
+#define for_each_part(part, piter) \
+	for (part = disk_part_iter_next(piter); part; \
+			part = disk_part_iter_next(piter))
+void del_gendisk(struct gendisk *disk)
+{
+	struct disk_part_iter piter;
+	struct hd_struct *part;
+
+	del_gendisk_start(disk);
+
+	/* invalidate stuff */
+	disk_part_iter_init(&piter, disk,
+			     DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
+	for_each_part(part, &piter) {
+		invalidate_partition(disk, part->partno);
+		delete_partition(disk, part->partno);
+	}
+	disk_part_iter_exit(&piter);
+	invalidate_partition(disk, 0);
+
+	del_gendisk_end(disk);
+}
 EXPORT_SYMBOL(del_gendisk);
 
 /**
+ * del_gendisk_queue - combined del_gendisk + blk_cleanup_queue
+ * @disk: disk to delete, invalidate, unmap, and end i/o
+ *
+ * This alternative for open coded calls to:
+ *     del_gendisk()
+ *     blk_cleanup_queue()
+ * ...is for notifying the filesystem that the block device has gone
+ * away.  This distinction is important for triggering a filesystem to
+ * abort its error recovery and for (DAX) capable devices.  DAX bypasses
+ * page cache and mappings go directly to storage media.  When such a
+ * disk is removed the pfn backing a mapping may be invalid or removed
+ * from the system.  Upon return accessing DAX mappings of this disk
+ * will trigger SIGBUS.
+ */
+void del_gendisk_queue(struct gendisk *disk)
+{
+	struct disk_part_iter piter;
+	struct hd_struct *part;
+
+	del_gendisk_start(disk);
+
+	/* pass1 sync fs + evict idle inodes */
+	disk_part_iter_init(&piter, disk,
+			     DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
+	for_each_part(part, &piter)
+		invalidate_partition(disk, part->partno);
+	disk_part_iter_exit(&piter);
+	invalidate_partition(disk, 0);
+
+	blk_cleanup_queue(disk->queue);
+
+	/* pass2 the queue is dead, trigger fs shutdown via ->bdi_gone() */
+	disk_part_iter_init(&piter, disk,
+			     DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
+	for_each_part(part, &piter) {
+		shutdown_partition(disk, part->partno);
+		delete_partition(disk, part->partno);
+	}
+	disk_part_iter_exit(&piter);
+	shutdown_partition(disk, 0);
+
+	del_gendisk_end(disk);
+}
+EXPORT_SYMBOL(del_gendisk_queue);
+
+/**
  * get_gendisk - get partitioning information for a given device
  * @devt: device to get partitioning information for
  * @partno: returned partition index
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index a5880f4ab40e..f149dd57bba3 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -532,7 +532,6 @@  out:
 static void brd_free(struct brd_device *brd)
 {
 	put_disk(brd->brd_disk);
-	blk_cleanup_queue(brd->brd_queue);
 	brd_free_pages(brd);
 	kfree(brd);
 }
@@ -560,7 +559,7 @@  out:
 static void brd_del_one(struct brd_device *brd)
 {
 	list_del(&brd->brd_list);
-	del_gendisk(brd->brd_disk);
+	del_gendisk_queue(brd->brd_disk);
 	brd_free(brd);
 }
 
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 8ee79893d2f5..6dd06e9d34b0 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -158,9 +158,8 @@  static void pmem_detach_disk(struct pmem_device *pmem)
 	if (!pmem->pmem_disk)
 		return;
 
-	del_gendisk(pmem->pmem_disk);
+	del_gendisk_queue(pmem->pmem_disk);
 	put_disk(pmem->pmem_disk);
-	blk_cleanup_queue(pmem->pmem_queue);
 }
 
 static int pmem_attach_disk(struct device *dev,
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 94a8f4ab57bc..0c3c968b57d9 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -388,8 +388,7 @@  removeseg:
 	}
 	list_del(&dev_info->lh);
 
-	del_gendisk(dev_info->gd);
-	blk_cleanup_queue(dev_info->dcssblk_queue);
+	del_gendisk_queue(dev_info->gd);
 	dev_info->gd->queue = NULL;
 	put_disk(dev_info->gd);
 	up_write(&dcssblk_devices_sem);
@@ -751,8 +750,7 @@  dcssblk_remove_store(struct device *dev, struct device_attribute *attr, const ch
 	}
 
 	list_del(&dev_info->lh);
-	del_gendisk(dev_info->gd);
-	blk_cleanup_queue(dev_info->dcssblk_queue);
+	del_gendisk_queue(dev_info->gd);
 	dev_info->gd->queue = NULL;
 	put_disk(dev_info->gd);
 	device_unregister(&dev_info->dev);
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 44d4a1e9244e..739e43a37e64 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1778,27 +1778,83 @@  fail:
 }
 EXPORT_SYMBOL(lookup_bdev);
 
+static int generic_quiesce_super(struct super_block *sb, bool kill_dirty)
+{
+	/*
+	 * no need to lock the super, get_super holds the read mutex so
+	 * the filesystem cannot go away under us (->put_super runs with
+	 * the write lock held).
+	 */
+	shrink_dcache_sb(sb);
+	return invalidate_inodes(sb, kill_dirty);
+}
+
 int __invalidate_device(struct block_device *bdev, bool kill_dirty)
 {
 	struct super_block *sb = get_super(bdev);
 	int res = 0;
 
-	if (sb) {
-		/*
-		 * no need to lock the super, get_super holds the
-		 * read mutex so the filesystem cannot go away
-		 * under us (->put_super runs with the write lock
-		 * hold).
-		 */
-		shrink_dcache_sb(sb);
-		res = invalidate_inodes(sb, kill_dirty);
-		drop_super(sb);
-	}
+	if (!sb)
+		goto out;
+
+	res = generic_quiesce_super(sb, kill_dirty);
+	drop_super(sb);
+ out:
 	invalidate_bdev(bdev);
+
 	return res;
 }
 EXPORT_SYMBOL(__invalidate_device);
 
+static void generic_bdi_gone(struct super_block *sb)
+{
+	struct inode *inode, *_inode = NULL;
+
+	spin_lock(&sb->s_inode_list_lock);
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+		spin_lock(&inode->i_lock);
+		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
+				|| !IS_DAX(inode)) {
+			spin_unlock(&inode->i_lock);
+			continue;
+		}
+		__iget(inode);
+		spin_unlock(&inode->i_lock);
+		spin_unlock(&sb->s_inode_list_lock);
+
+		unmap_mapping_range(inode->i_mapping, 0, 0, 1);
+		iput(_inode);
+		_inode = inode;
+		cond_resched();
+
+		spin_lock(&sb->s_inode_list_lock);
+	}
+	spin_unlock(&sb->s_inode_list_lock);
+	iput(_inode);
+}
+
+void shutdown_partition(struct gendisk *disk, int partno)
+{
+	struct block_device *bdev = bdget_disk(disk, partno);
+	struct super_block *sb = get_super(bdev);
+
+	if (!bdev)
+		return;
+
+	if (!sb) {
+		bdput(bdev);
+		return;
+	}
+
+	if (sb->s_op->bdi_gone)
+		sb->s_op->bdi_gone(sb);
+	else
+		generic_bdi_gone(sb);
+
+	drop_super(sb);
+	bdput(bdev);
+}
+
 void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 {
 	struct inode *inode, *old_inode = NULL;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3aa514254161..0e201ed38045 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1713,6 +1713,7 @@  struct super_operations {
 				  struct shrink_control *);
 	long (*free_cached_objects)(struct super_block *,
 				    struct shrink_control *);
+	void (*bdi_gone)(struct super_block *);
 };
 
 /*
@@ -2390,6 +2391,7 @@  extern int revalidate_disk(struct gendisk *);
 extern int check_disk_change(struct block_device *);
 extern int __invalidate_device(struct block_device *, bool);
 extern int invalidate_partition(struct gendisk *, int);
+extern void shutdown_partition(struct gendisk *, int);
 #endif
 unsigned long invalidate_mapping_pages(struct address_space *mapping,
 					pgoff_t start, pgoff_t end);
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 847cc1d91634..028cf15a8a57 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -431,6 +431,7 @@  extern void part_round_stats(int cpu, struct hd_struct *part);
 /* block/genhd.c */
 extern void add_disk(struct gendisk *disk);
 extern void del_gendisk(struct gendisk *gp);
+extern void del_gendisk_queue(struct gendisk *disk);
 extern struct gendisk *get_gendisk(dev_t dev, int *partno);
 extern struct block_device *bdget_disk(struct gendisk *disk, int partno);