[02/11] block: Fix race of bdev open with gendisk shutdown

Message ID	20170313151410.5586-3-jack@suse.cz (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> From: Jan Kara <jack@suse.cz> To: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org, Christoph Hellwig <hch@infradead.org>, Dan Williams <dan.j.williams@intel.com>, Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>, Tejun Heo <tj@kernel.org>, Tahsin Erdogan <tahsin@google.com>, Omar Sandoval <osandov@osandov.com>, Jan Kara <jack@suse.cz> Subject: [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown Date: Mon, 13 Mar 2017 16:14:01 +0100 Message-Id: <20170313151410.5586-3-jack@suse.cz> In-Reply-To: <20170313151410.5586-1-jack@suse.cz> References: <20170313151410.5586-1-jack@suse.cz> Sender: linux-block-owner@vger.kernel.org Precedence: bulk

Jan Kara March 13, 2017, 3:14 p.m. UTC

blkdev_open() may race with gendisk shutdown in two different ways.
Either del_gendisk() has already unhashed block device inode (and thus
bd_acquire() will end up creating new block device inode) however
gen_gendisk() will still return the gendisk that is being destroyed.
Or bdev returned by bd_acquire() will get unhashed and gendisk destroyed
before we get to get_gendisk() and get_gendisk() will return new gendisk
that got allocated for device that reused the device number.

In both cases this will result in possible inconsistencies between
bdev->bd_disk and bdev->bd_bdi (in the first case after gendisk gets
destroyed and device number reused, in the second case immediately).

Fix the problem by checking whether the gendisk is still alive and inode
hashed when associating bdev inode with it and its bdi. That way we are
sure that we will not associate bdev inode with disk that got past
blk_unregister_region() in del_gendisk() (and thus device number can get
reused). Similarly, we will not associate bdev that was once associated
with gendisk that is going away (and thus the corresponding bdev inode
will get unhashed in del_gendisk()) with a new gendisk that is just
reusing the device number.

Also add a warning that will tell us about unexpected inconsistencies
between bdi associated with the bdev inode and bdi associated with the
disk.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/block_dev.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

Tejun Heo March 21, 2017, 4:19 p.m. UTC | #1

Hello, Jan.

On Mon, Mar 13, 2017 at 04:14:01PM +0100, Jan Kara wrote:
> blkdev_open() may race with gendisk shutdown in two different ways.
> Either del_gendisk() has already unhashed block device inode (and thus
> bd_acquire() will end up creating new block device inode) however
> gen_gendisk() will still return the gendisk that is being destroyed.
> Or bdev returned by bd_acquire() will get unhashed and gendisk destroyed
> before we get to get_gendisk() and get_gendisk() will return new gendisk
> that got allocated for device that reused the device number.
> 
> In both cases this will result in possible inconsistencies between
> bdev->bd_disk and bdev->bd_bdi (in the first case after gendisk gets
> destroyed and device number reused, in the second case immediately).
> 
> Fix the problem by checking whether the gendisk is still alive and inode
> hashed when associating bdev inode with it and its bdi. That way we are
> sure that we will not associate bdev inode with disk that got past
> blk_unregister_region() in del_gendisk() (and thus device number can get
> reused). Similarly, we will not associate bdev that was once associated
> with gendisk that is going away (and thus the corresponding bdev inode
> will get unhashed in del_gendisk()) with a new gendisk that is just
> reusing the device number.
> 
> Also add a warning that will tell us about unexpected inconsistencies
> between bdi associated with the bdev inode and bdi associated with the
> disk.

Hmm... let's see if I got this straight.

It used to be that blockdevs are essentially stateless while nobody
has it open.  It acquires all its actual associations during the
initial open and loses them on the last release.  The immediate
release semantics got us into trouble because upper layers had nothing
to serve as the proper sever point when the underlying qblock device
goes away.

So, we decided that bdi should serve that purpose, which makes perfect
sense as it's what upper layers talk to when they wanna reach to the
block device, so we decoupled its lifetime from the request_queue and
implements sever there.

Now that bdis are persistent, we can solve bdev-access-while-not-open
problem by making bdi point to the respective bdi from the beginning
until it's released, which means that bdevs are now stateful objects
which are associated with specific bdis and thus request_queues.

Because there are multiple ways that these objects are looked up and
handled, now we can get into situations where the request_queue
(gendisk) we look up during open may not match the bdi that a bdev is
associated with, and this patch solves that problem by detecting the
conditions where these mismatches would take place.  More
specifically, we don't want to be using a dead bdev during open.

If I misunderstood something, please let me know.

> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 53e2389ae4d4..5ec8750f5332 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1560,7 +1560,8 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
>  		if (!partno) {
>  			ret = -ENXIO;
>  			bdev->bd_part = disk_get_part(disk, partno);
> -			if (!bdev->bd_part)
> +			if (!(disk->flags & GENHD_FL_UP) || !bdev->bd_part ||
> +			    inode_unhashed(bdev->bd_inode))
>  				goto out_clear;

This adds both GENHD_FL_UP and unhashed testing; however, I'm not sure
whether the former is meaningful here.  GENHD flag updates are not
synchronized when seen from outside the party which is updating it.
The actual synchronization against dead device should happen inside
disk->fops->open().

>  			ret = 0;
> @@ -1614,7 +1615,8 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
>  			bdev->bd_contains = whole;
>  			bdev->bd_part = disk_get_part(disk, partno);
>  			if (!(disk->flags & GENHD_FL_UP) ||
> -			    !bdev->bd_part || !bdev->bd_part->nr_sects) {
> +			    !bdev->bd_part || !bdev->bd_part->nr_sects ||
> +			    inode_unhashed(bdev->bd_inode)) {
>  				ret = -ENXIO;
>  				goto out_clear;
>  			}

Which is different from here.  As the device is already open, we're
just incrementing the ref before granting another open without
consulting the driver.  As the device is open already, whether
granting a new ref or not can't break anything.  Block layer is just
doing a best effort thing to not give out new ref on a dead one with
the UP test, which is completely fine.

> @@ -1623,6 +1625,9 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
>  
>  		if (bdev->bd_bdi == &noop_backing_dev_info)
>  			bdev->bd_bdi = bdi_get(disk->queue->backing_dev_info);
> +		else
> +			WARN_ON_ONCE(bdev->bd_bdi !=
> +				     disk->queue->backing_dev_info);

While I can't think of cases where this wouldn't work, it seems a bit
roundabout to me.

* A bdi is the object which knows which request_queue it's associated
  and when that association dies.

* bdev is permanently associated with a bdi.

So, it's kinda weird to look up the request_queue again on each open
and then verify that the bd_bdi and request_queue match.  I think it
would make more sense to skip disk look up and just go through bdi to
determine the associated request_queue as that's the primary nexus
object tying everything up now.

That said, it's totally possible that that would take too much
restructuring to implement right now with everything else, but if we
think that that's the right long term direction, I think it would make
more sense to just test whether bdev->bd_bdi matches the disk right
after looking up the disk and fail the open if not.  That's the
ultimate condition we wanna avoid after all and it also would ease
replacing it with going through bdi instead of looking up again.

Thanks.

Jan Kara March 23, 2017, 12:21 a.m. UTC | #2

Hello Tejun,

> On Mon, Mar 13, 2017 at 04:14:01PM +0100, Jan Kara wrote:
> > blkdev_open() may race with gendisk shutdown in two different ways.
> > Either del_gendisk() has already unhashed block device inode (and thus
> > bd_acquire() will end up creating new block device inode) however
> > gen_gendisk() will still return the gendisk that is being destroyed.
> > Or bdev returned by bd_acquire() will get unhashed and gendisk destroyed
> > before we get to get_gendisk() and get_gendisk() will return new gendisk
> > that got allocated for device that reused the device number.
> > 
> > In both cases this will result in possible inconsistencies between
> > bdev->bd_disk and bdev->bd_bdi (in the first case after gendisk gets
> > destroyed and device number reused, in the second case immediately).
> > 
> > Fix the problem by checking whether the gendisk is still alive and inode
> > hashed when associating bdev inode with it and its bdi. That way we are
> > sure that we will not associate bdev inode with disk that got past
> > blk_unregister_region() in del_gendisk() (and thus device number can get
> > reused). Similarly, we will not associate bdev that was once associated
> > with gendisk that is going away (and thus the corresponding bdev inode
> > will get unhashed in del_gendisk()) with a new gendisk that is just
> > reusing the device number.
> > 
> > Also add a warning that will tell us about unexpected inconsistencies
> > between bdi associated with the bdev inode and bdi associated with the
> > disk.
> 
> Hmm... let's see if I got this straight.
> 
> It used to be that blockdevs are essentially stateless while nobody
> has it open.  It acquires all its actual associations during the
> initial open and loses them on the last release.  The immediate
> release semantics got us into trouble because upper layers had nothing
> to serve as the proper sever point when the underlying qblock device
> goes away.
> 
> So, we decided that bdi should serve that purpose, which makes perfect
> sense as it's what upper layers talk to when they wanna reach to the
> block device, so we decoupled its lifetime from the request_queue and
> implements sever there.
> 
> Now that bdis are persistent, we can solve bdev-access-while-not-open
> problem by making bdi point to the respective bdi from the beginning
> until it's released, which means that bdevs are now stateful objects
> which are associated with specific bdis and thus request_queues.

Yes, perfect summary.

> Because there are multiple ways that these objects are looked up and
> handled, now we can get into situations where the request_queue
> (gendisk) we look up during open may not match the bdi that a bdev is
> associated with, and this patch solves that problem by detecting the
> conditions where these mismatches would take place.  More
> specifically, we don't want to be using a dead bdev during open.

Yes.

> > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > index 53e2389ae4d4..5ec8750f5332 100644
> > --- a/fs/block_dev.c
> > +++ b/fs/block_dev.c
> > @@ -1560,7 +1560,8 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
> >  		if (!partno) {
> >  			ret = -ENXIO;
> >  			bdev->bd_part = disk_get_part(disk, partno);
> > -			if (!bdev->bd_part)
> > +			if (!(disk->flags & GENHD_FL_UP) || !bdev->bd_part ||
> > +			    inode_unhashed(bdev->bd_inode))
> >  				goto out_clear;
> 
> This adds both GENHD_FL_UP and unhashed testing; however, I'm not sure
> whether the former is meaningful here.  GENHD flag updates are not
> synchronized when seen from outside the party which is updating it.
> The actual synchronization against dead device should happen inside
> disk->fops->open().

Hum, now that I look at the code again it seems my patch is still racy if
we get fresh inode (lookup_bdev() already allocated new inode and returned
it to us) but get_gendisk() returned gendisk that is shutting down but
we'll still see GENHD_FL_UP set and thus associate new inode with gendisk
that is being destroyed.

> >  			ret = 0;
> > @@ -1614,7 +1615,8 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
> >  			bdev->bd_contains = whole;
> >  			bdev->bd_part = disk_get_part(disk, partno);
> >  			if (!(disk->flags & GENHD_FL_UP) ||
> > -			    !bdev->bd_part || !bdev->bd_part->nr_sects) {
> > +			    !bdev->bd_part || !bdev->bd_part->nr_sects ||
> > +			    inode_unhashed(bdev->bd_inode)) {
> >  				ret = -ENXIO;
> >  				goto out_clear;
> >  			}
> 
> Which is different from here.  As the device is already open, we're
> just incrementing the ref before granting another open without
> consulting the driver.  As the device is open already, whether
> granting a new ref or not can't break anything.  Block layer is just
> doing a best effort thing to not give out new ref on a dead one with
> the UP test, which is completely fine.

Yeah.

> > @@ -1623,6 +1625,9 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
> >  
> >  		if (bdev->bd_bdi == &noop_backing_dev_info)
> >  			bdev->bd_bdi = bdi_get(disk->queue->backing_dev_info);
> > +		else
> > +			WARN_ON_ONCE(bdev->bd_bdi !=
> > +				     disk->queue->backing_dev_info);
> 
> While I can't think of cases where this wouldn't work, it seems a bit
> roundabout to me.
> 
> * A bdi is the object which knows which request_queue it's associated
>   and when that association dies.

So bdi is associated with a request_queue but it does not have any
direct pointer or reference to it. It is the request_queue that points to
bdi. However you are right that bdi_unregister() call is telling bdi that
the device is going away.

> * bdev is permanently associated with a bdi.

Yes.

> So, it's kinda weird to look up the request_queue again on each open
> and then verify that the bd_bdi and request_queue match.  I think it
> would make more sense to skip disk look up and just go through bdi to
> determine the associated request_queue as that's the primary nexus
> object tying everything up now.

So I was thinking about this as well. The problem is that you cannot really
keep a reference to request_queue (or gendisk) from block device inode or
bdi as that could unexpectedly pin the driver in memory after user thinks
everything should be cleaned up. So I think we really do have to establish
the connection between block device inode and request_queue / gendisk on
opening of the block device and drop the reference when the block device is
closed.

> That said, it's totally possible that that would take too much
> restructuring to implement right now with everything else, but if we
> think that that's the right long term direction, I think it would make
> more sense to just test whether bdev->bd_bdi matches the disk right
> after looking up the disk and fail the open if not.  That's the
> ultimate condition we wanna avoid after all and it also would ease
> replacing it with going through bdi instead of looking up again.

The problem with this is that you could still associate fresh block device
inode with a bdi corresponding to gendisk that is going away (and that has
already went through bdev_unhash_inode()). Such block device inode may than
stay in case associated with that bdi even after the device number gets
reused and that will cause false the check for matching bdi to fail ever
after that moment.

So since I'm not actually able to trigger any of these races in my testing,
I guess I'll just drop this patch (it isn't needed by anything else in the
series) and let the reset of the series be merged. When I come up with some
better solution to this problem, I'll send a fix separately. Thanks for
review!

								Honza

Tejun Heo March 23, 2017, 3:38 p.m. UTC | #3

Hello, Jan.

On Thu, Mar 23, 2017 at 01:21:05AM +0100, Jan Kara wrote:
> > * A bdi is the object which knows which request_queue it's associated
> >   and when that association dies.
> 
> So bdi is associated with a request_queue but it does not have any
> direct pointer or reference to it. It is the request_queue that points to
> bdi. However you are right that bdi_unregister() call is telling bdi that
> the device is going away.
> 
> > * bdev is permanently associated with a bdi.
> 
> Yes.
> 
> > So, it's kinda weird to look up the request_queue again on each open
> > and then verify that the bd_bdi and request_queue match.  I think it
> > would make more sense to skip disk look up and just go through bdi to
> > determine the associated request_queue as that's the primary nexus
> > object tying everything up now.
> 
> So I was thinking about this as well. The problem is that you cannot really
> keep a reference to request_queue (or gendisk) from block device inode or
> bdi as that could unexpectedly pin the driver in memory after user thinks
> everything should be cleaned up. So I think we really do have to establish
> the connection between block device inode and request_queue / gendisk on
> opening of the block device and drop the reference when the block device is
> closed.

Hmmm... but for bdi to be able to serve as the sever point, it must
know when it gets disassociated (del_gendisk) and be able to safely
clear its pointer and reject further access to the request_queue.  I
think we're trying to shift that synchronization to the open path but
no matter where we do that we have to do that anyway and it'll likely
be the more straight-forward to do it at the sever point.

Thanks.

Hou Tao Nov. 17, 2017, 6:51 a.m. UTC | #4

Hi Jan,

On 2017/3/13 23:14, Jan Kara wrote:
> blkdev_open() may race with gendisk shutdown in two different ways.
> Either del_gendisk() has already unhashed block device inode (and thus
> bd_acquire() will end up creating new block device inode) however
> gen_gendisk() will still return the gendisk that is being destroyed.
> Or bdev returned by bd_acquire() will get unhashed and gendisk destroyed
> before we get to get_gendisk() and get_gendisk() will return new gendisk
> that got allocated for device that reused the device number.
> 
> In both cases this will result in possible inconsistencies between
> bdev->bd_disk and bdev->bd_bdi (in the first case after gendisk gets
> destroyed and device number reused, in the second case immediately).

> Fix the problem by checking whether the gendisk is still alive and inode
> hashed when associating bdev inode with it and its bdi. That way we are
> sure that we will not associate bdev inode with disk that got past
> blk_unregister_region() in del_gendisk() (and thus device number can get
> reused). Similarly, we will not associate bdev that was once associated
> with gendisk that is going away (and thus the corresponding bdev inode
> will get unhashed in del_gendisk()) with a new gendisk that is just
> reusing the device number.

I have run some extended tests based on Omar Sandoval's script [1] these days,
and found two new problems related with the race between bdev open and gendisk
shutdown.

The first problem lies in __blkdev_get(), and will cause use-after-free, or
worse oops. It happens because there may be two instances of gendisk related
with one bdev. When the process which owns the newer gendisk completes the
invocation of __blkdev_get() firstly, the other process which owns the older
gendisk will put the last reference of the older gendisk and cause use-after-free.

The following call sequences illustrate the problem:

unhash the bdev_inode of /dev/sda

process A (blkdev_open()):
	bd_acquire() returns new bdev_inode of /dev/sda
	get_gendisk() returns gendisk (v1 gendisk)

remove gendisk from bdev_map
device number is reused

process B (blkdev_open()):
	bd_acquire() returns new bdev_inode of /dev/sda
	get_gendisk() returns a new gendisk (v2 gendisk)
	increase bdev->bd_openers

process A (blkdev_open()):
	find bdev->bd_openers != 0
	put_disk(disk) // this is the last reference to v1 gendisk
	disk_unblock_events(disk) // use-after-free occurs

The problem can be fixed by an extra check showed in the following snippet:

static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
{
	if (!bdev->bd_openers) {
	} else {
		if (bdev->bd_disk != disk) {
			ret = -ENXIO;
			goto out_unlock_bdev;
		}
	}
}

And we also can simplify the check in your patch by moving blk_unregister_region()
before bdev_unhash_inode(), so whenever we get a new bdev_inode, the gendisk returned
by get_gendisk() will either be NULL or a new gendisk. And the only check we need to do
in __blkdev_get() is checking the hash status of the bdev_inode after we get a gendisk
from get_gendisk(). The check can be done like the following snippet:

static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
{
	/* ...... */
	ret = -ENXIO;
	disk = get_gendisk(bdev->bd_dev, &partno);
	if (!disk)
		goto out;
	owner = disk->fops->owner;

	if (inode_unhashed(bdev->bd_inode))
		goto out_unmatched;
}

The second problem lies in bd_start_claiming(), and will trigger the BUG_ON assert in
blkdev_get: BUG_ON(!bd_may_claim(bdev, whole, holder)). It occurs when testing the
exclusive open and close of the disk and its partitions.

The cause of the problem is that a bdev_inode of a disk partition corresponds with two
instances of bdev_inode of the whole disk simultaneously. When one pair of the partition
inode and disk inode completes the claiming, the other pair will be stumbled by the BUG_ON
assert in blkdev_get() because bdev->bd_holder is no longer NULL.

The following sequences illustrate the problem:

unhash the bdev_inode of /dev/sda2

process A (blkdev_open()):
	bd_acquire() returns a new bdev_inode for /dev/sda2
	bd_start_claiming() returns bdev_inode of /dev/sda

process B (blkdev_open()):
	bd_acquire() returns the new bdev_inode for /dev/sda2

unhash the bdev_inode of /dev/sda
remove gendisk from bdev_map
device number is reused

process B (blkdev_open()):
	bd_start_claming() returns a new bdev_inode for /dev/sda

process A (blkdev_open()):
	__blkdev_get() returns successfully
	finish claiming  // bdev->bd_holder = holder

process B (blkdev_open()):
	__blkdev_get() returns successfully
	trigger BUG_ON(!bd_may_claim(bdev, whole, holder))

And the problem can be fixed by adding more checks in bd_start_claiming():

static struct block_device *bd_start_claiming(struct block_device *bdev,
                                              void *holder)
{
	/* ...... */
        disk = get_gendisk(bdev->bd_dev, &partno);
        if (!disk)
                return ERR_PTR(-ENXIO);

        if (inode_unhashed(bdev->bd_inode)) {
                module_put(disk->fops->owner);
                put_disk(disk);
                return ERR_PTR(-ENXIO);
        }

	/* ...... */
        if (!whole)
                return ERR_PTR(-ENOMEM);

        if (inode_unhashed(bdev->bd_inode) ||
                inode_unhashed(whole->bd_inode)) {
                bdput(whole);
                return ERR_PTR(-ENXIO);
        }
}

Except adding more and more checks, are there better solutions to the race problem ?
I have thought about managing the device number by ref-counter, so the device number
will not be reused until the last reference of bdev_inode and gendisk are released,
and i am trying to figure out a way to get/put a reference of the device number when
get/put the block_device.

So any suggests and thoughts ?

Regards,
Tao

---
1: https://marc.info/?l=linux-block&m=148554717109098&w=2


> Also add a warning that will tell us about unexpected inconsistencies
> between bdi associated with the bdev inode and bdi associated with the
> disk.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/block_dev.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 53e2389ae4d4..5ec8750f5332 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1560,7 +1560,8 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
>  		if (!partno) {
>  			ret = -ENXIO;
>  			bdev->bd_part = disk_get_part(disk, partno);
> -			if (!bdev->bd_part)
> +			if (!(disk->flags & GENHD_FL_UP) || !bdev->bd_part ||
> +			    inode_unhashed(bdev->bd_inode))
>  				goto out_clear;
>  
>  			ret = 0;
> @@ -1614,7 +1615,8 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
>  			bdev->bd_contains = whole;
>  			bdev->bd_part = disk_get_part(disk, partno);
>  			if (!(disk->flags & GENHD_FL_UP) ||
> -			    !bdev->bd_part || !bdev->bd_part->nr_sects) {
> +			    !bdev->bd_part || !bdev->bd_part->nr_sects ||
> +			    inode_unhashed(bdev->bd_inode)) {
>  				ret = -ENXIO;
>  				goto out_clear;
>  			}
> @@ -1623,6 +1625,9 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
>  
>  		if (bdev->bd_bdi == &noop_backing_dev_info)
>  			bdev->bd_bdi = bdi_get(disk->queue->backing_dev_info);
> +		else
> +			WARN_ON_ONCE(bdev->bd_bdi !=
> +				     disk->queue->backing_dev_info);
>  	} else {
>  		if (bdev->bd_contains == bdev) {
>  			ret = 0;
>

Jan Kara Nov. 20, 2017, 4:43 p.m. UTC | #5

Hi Tao!

On Fri 17-11-17 14:51:18, Hou Tao wrote:
> On 2017/3/13 23:14, Jan Kara wrote:
> > blkdev_open() may race with gendisk shutdown in two different ways.
> > Either del_gendisk() has already unhashed block device inode (and thus
> > bd_acquire() will end up creating new block device inode) however
> > gen_gendisk() will still return the gendisk that is being destroyed.
> > Or bdev returned by bd_acquire() will get unhashed and gendisk destroyed
> > before we get to get_gendisk() and get_gendisk() will return new gendisk
> > that got allocated for device that reused the device number.
> > 
> > In both cases this will result in possible inconsistencies between
> > bdev->bd_disk and bdev->bd_bdi (in the first case after gendisk gets
> > destroyed and device number reused, in the second case immediately).
> 
> > Fix the problem by checking whether the gendisk is still alive and inode
> > hashed when associating bdev inode with it and its bdi. That way we are
> > sure that we will not associate bdev inode with disk that got past
> > blk_unregister_region() in del_gendisk() (and thus device number can get
> > reused). Similarly, we will not associate bdev that was once associated
> > with gendisk that is going away (and thus the corresponding bdev inode
> > will get unhashed in del_gendisk()) with a new gendisk that is just
> > reusing the device number.
> 
> I have run some extended tests based on Omar Sandoval's script [1] these days,
> and found two new problems related with the race between bdev open and gendisk
> shutdown.
> 
> The first problem lies in __blkdev_get(), and will cause use-after-free, or
> worse oops. It happens because there may be two instances of gendisk related
> with one bdev. When the process which owns the newer gendisk completes the
> invocation of __blkdev_get() firstly, the other process which owns the older
> gendisk will put the last reference of the older gendisk and cause use-after-free.
> 
> The following call sequences illustrate the problem:
> 
> unhash the bdev_inode of /dev/sda
> 
> process A (blkdev_open()):
> 	bd_acquire() returns new bdev_inode of /dev/sda
> 	get_gendisk() returns gendisk (v1 gendisk)
> 
> remove gendisk from bdev_map
> device number is reused

^^ this is through blk_unregister_region() in del_gendisk(), isn't it?

> process B (blkdev_open()):
> 	bd_acquire() returns new bdev_inode of /dev/sda
> 	get_gendisk() returns a new gendisk (v2 gendisk)
> 	increase bdev->bd_openers

So this needs to be a bit more complex - bd_acquire() must happen before
bdev_unhash_inode() in del_gendisk() above happened. And get_gendisk() must
happen after blk_unregister_region() from del_gendisk() happened. But yes,
this seems possible.

> process A (blkdev_open()):
> 	find bdev->bd_openers != 0
> 	put_disk(disk) // this is the last reference to v1 gendisk
> 	disk_unblock_events(disk) // use-after-free occurs
> 
> The problem can be fixed by an extra check showed in the following snippet:
> 
> static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
> {
> 	if (!bdev->bd_openers) {
> 	} else {
> 		if (bdev->bd_disk != disk) {
> 			ret = -ENXIO;
> 			goto out_unlock_bdev;
> 		}
> 	}
> }
>
> And we also can simplify the check in your patch by moving
> blk_unregister_region() before bdev_unhash_inode(), so whenever we get a
> new bdev_inode, the gendisk returned by get_gendisk() will either be NULL
> or a new gendisk.

I don't think we can do that. If we call blk_unregister_region() before
bdev_unhash_inode(), the device number can get reused while bdev inode is
still visible. So you can get that bdev inode v1 associated with gendisk
v2. And open following unhashing of bdev inode v1 will get bdev inode v2
associated with gendisk v2. So you have two different bdev inodes
associated with the same gendisk and that will cause data integrity issues.

But what you suggest above is probably a reasonable fix. Will you submit a
proper patch please?

> And the only check we need to do in __blkdev_get() is
> checking the hash status of the bdev_inode after we get a gendisk from
> get_gendisk(). The check can be done like the following snippet:
> 
> static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
> {
> 	/* ...... */
> 	ret = -ENXIO;
> 	disk = get_gendisk(bdev->bd_dev, &partno);
> 	if (!disk)
> 		goto out;
> 	owner = disk->fops->owner;
> 
> 	if (inode_unhashed(bdev->bd_inode))
> 		goto out_unmatched;
> }
> 
> The second problem lies in bd_start_claiming(), and will trigger the
> BUG_ON assert in blkdev_get: BUG_ON(!bd_may_claim(bdev, whole, holder)).
> It occurs when testing the exclusive open and close of the disk and its
> partitions.
> 
> The cause of the problem is that a bdev_inode of a disk partition
> corresponds with two instances of bdev_inode of the whole disk
> simultaneously. When one pair of the partition inode and disk inode
> completes the claiming, the other pair will be stumbled by the BUG_ON
> assert in blkdev_get() because bdev->bd_holder is no longer NULL.
> 
> The following sequences illustrate the problem:
> 
> unhash the bdev_inode of /dev/sda2
> 
> process A (blkdev_open()):
> 	bd_acquire() returns a new bdev_inode for /dev/sda2
> 	bd_start_claiming() returns bdev_inode of /dev/sda
> 
> process B (blkdev_open()):
> 	bd_acquire() returns the new bdev_inode for /dev/sda2
> 
> unhash the bdev_inode of /dev/sda
> remove gendisk from bdev_map
> device number is reused
> 
> process B (blkdev_open()):
> 	bd_start_claming() returns a new bdev_inode for /dev/sda
> 
> process A (blkdev_open()):
> 	__blkdev_get() returns successfully
> 	finish claiming  // bdev->bd_holder = holder
> 
> process B (blkdev_open()):
> 	__blkdev_get() returns successfully
> 	trigger BUG_ON(!bd_may_claim(bdev, whole, holder))

Ah, good spotting. Essentially the whole->bd_claiming protection for
partition bdev does not work because we've got two different 'whole' bdev
pointers.

> And the problem can be fixed by adding more checks in bd_start_claiming():
> 
> static struct block_device *bd_start_claiming(struct block_device *bdev,
>                                               void *holder)
> {
> 	/* ...... */
>         disk = get_gendisk(bdev->bd_dev, &partno);
>         if (!disk)
>                 return ERR_PTR(-ENXIO);
> 
>         if (inode_unhashed(bdev->bd_inode)) {
>                 module_put(disk->fops->owner);
>                 put_disk(disk);
>                 return ERR_PTR(-ENXIO);
>         }
> 
> 	/* ...... */
>         if (!whole)
>                 return ERR_PTR(-ENOMEM);
> 
>         if (inode_unhashed(bdev->bd_inode) ||
>                 inode_unhashed(whole->bd_inode)) {
>                 bdput(whole);
>                 return ERR_PTR(-ENXIO);
>         }
> }
> 
> Except adding more and more checks, are there better solutions to the
> race problem ?  I have thought about managing the device number by
> ref-counter, so the device number will not be reused until the last
> reference of bdev_inode and gendisk are released, and i am trying to
> figure out a way to get/put a reference of the device number when get/put
> the block_device.
> 
> So any suggests and thoughts ?

Yeah, these races are nasty. I will think whether there's some reasonably
easy way to fixing them or whether we'll have to go via some other route
like blocking the device number as you suggest. In either case thanks for
the report and the analysis!

								Honza

Hou Tao Nov. 24, 2017, 3:12 a.m. UTC | #6

Hi Jan,

On 2017/11/21 0:43, Jan Kara wrote:
> Hi Tao!
> 
> On Fri 17-11-17 14:51:18, Hou Tao wrote:
>> On 2017/3/13 23:14, Jan Kara wrote:
>>> blkdev_open() may race with gendisk shutdown in two different ways.
>>> Either del_gendisk() has already unhashed block device inode (and thus
>>> bd_acquire() will end up creating new block device inode) however
>>> gen_gendisk() will still return the gendisk that is being destroyed.
>>> Or bdev returned by bd_acquire() will get unhashed and gendisk destroyed
>>> before we get to get_gendisk() and get_gendisk() will return new gendisk
>>> that got allocated for device that reused the device number.
>>>
>>> In both cases this will result in possible inconsistencies between
>>> bdev->bd_disk and bdev->bd_bdi (in the first case after gendisk gets
>>> destroyed and device number reused, in the second case immediately).
>>
>>> Fix the problem by checking whether the gendisk is still alive and inode
>>> hashed when associating bdev inode with it and its bdi. That way we are
>>> sure that we will not associate bdev inode with disk that got past
>>> blk_unregister_region() in del_gendisk() (and thus device number can get
>>> reused). Similarly, we will not associate bdev that was once associated
>>> with gendisk that is going away (and thus the corresponding bdev inode
>>> will get unhashed in del_gendisk()) with a new gendisk that is just
>>> reusing the device number.
>>
>> I have run some extended tests based on Omar Sandoval's script [1] these days,
>> and found two new problems related with the race between bdev open and gendisk
>> shutdown.
>>
>> The first problem lies in __blkdev_get(), and will cause use-after-free, or
>> worse oops. It happens because there may be two instances of gendisk related
>> with one bdev. When the process which owns the newer gendisk completes the
>> invocation of __blkdev_get() firstly, the other process which owns the older
>> gendisk will put the last reference of the older gendisk and cause use-after-free.
>>
>> The following call sequences illustrate the problem:
>>
>> unhash the bdev_inode of /dev/sda
>>
>> process A (blkdev_open()):
>> 	bd_acquire() returns new bdev_inode of /dev/sda
>> 	get_gendisk() returns gendisk (v1 gendisk)
>>
>> remove gendisk from bdev_map
>> device number is reused
> 
> ^^ this is through blk_unregister_region() in del_gendisk(), isn't it?

Yes, both the unhash of inode and the removal of gendisk from bdev_map
happen in del_gendisk().

>> process B (blkdev_open()):
>> 	bd_acquire() returns new bdev_inode of /dev/sda
>> 	get_gendisk() returns a new gendisk (v2 gendisk)
>> 	increase bdev->bd_openers
> 
> So this needs to be a bit more complex - bd_acquire() must happen before
> bdev_unhash_inode() in del_gendisk() above happened. And get_gendisk() must
> happen after blk_unregister_region() from del_gendisk() happened. But yes,
> this seems possible.
> 
>> process A (blkdev_open()):
>> 	find bdev->bd_openers != 0
>> 	put_disk(disk) // this is the last reference to v1 gendisk
>> 	disk_unblock_events(disk) // use-after-free occurs
>>
>> The problem can be fixed by an extra check showed in the following snippet:
>>
>> static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
>> {
>> 	if (!bdev->bd_openers) {
>> 	} else {
>> 		if (bdev->bd_disk != disk) {
>> 			ret = -ENXIO;
>> 			goto out_unlock_bdev;
>> 		}
>> 	}
>> }
>>
>> And we also can simplify the check in your patch by moving
>> blk_unregister_region() before bdev_unhash_inode(), so whenever we get a
>> new bdev_inode, the gendisk returned by get_gendisk() will either be NULL
>> or a new gendisk.
> 
> I don't think we can do that. If we call blk_unregister_region() before
> bdev_unhash_inode(), the device number can get reused while bdev inode is
> still visible. So you can get that bdev inode v1 associated with gendisk
> v2. And open following unhashing of bdev inode v1 will get bdev inode v2
> associated with gendisk v2. So you have two different bdev inodes
> associated with the same gendisk and that will cause data integrity issues.

Even we do not move blk_unregister_region(), the data integrity issue still
exists: bd_acquire() is invoked before bdev_unhash_inode() and get_gendisk()
is invoked after blk_unregister_region(). The move is used to simplify the
consistency check or the version check between bdev_inode and gendisk, so
the added check above will be unnecessary, because whenever bd_acquire()
returns a new bdev_inode, the gendisk returned by get_gendisk() will also
a new gendisk or NULL. Anyway, it's just another workaround, so it's OK
we do not do that.

> But what you suggest above is probably a reasonable fix. Will you submit a
> proper patch please?

Uh, it's also an incomplete fix to the race problem. Anyhow i will do it.

>> And the only check we need to do in __blkdev_get() is
>> checking the hash status of the bdev_inode after we get a gendisk from
>> get_gendisk(). The check can be done like the following snippet:
>>
>> static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
>> {
>> 	/* ...... */
>> 	ret = -ENXIO;
>> 	disk = get_gendisk(bdev->bd_dev, &partno);
>> 	if (!disk)
>> 		goto out;
>> 	owner = disk->fops->owner;
>>
>> 	if (inode_unhashed(bdev->bd_inode))
>> 		goto out_unmatched;
>> }
>>
>> The second problem lies in bd_start_claiming(), and will trigger the
>> BUG_ON assert in blkdev_get: BUG_ON(!bd_may_claim(bdev, whole, holder)).
>> It occurs when testing the exclusive open and close of the disk and its
>> partitions.
>>
>> The cause of the problem is that a bdev_inode of a disk partition
>> corresponds with two instances of bdev_inode of the whole disk
>> simultaneously. When one pair of the partition inode and disk inode
>> completes the claiming, the other pair will be stumbled by the BUG_ON
>> assert in blkdev_get() because bdev->bd_holder is no longer NULL.
>>
>> The following sequences illustrate the problem:
>>
>> unhash the bdev_inode of /dev/sda2
>>
>> process A (blkdev_open()):
>> 	bd_acquire() returns a new bdev_inode for /dev/sda2
>> 	bd_start_claiming() returns bdev_inode of /dev/sda
>>
>> process B (blkdev_open()):
>> 	bd_acquire() returns the new bdev_inode for /dev/sda2
>>
>> unhash the bdev_inode of /dev/sda
>> remove gendisk from bdev_map
>> device number is reused
>>
>> process B (blkdev_open()):
>> 	bd_start_claming() returns a new bdev_inode for /dev/sda
>>
>> process A (blkdev_open()):
>> 	__blkdev_get() returns successfully
>> 	finish claiming  // bdev->bd_holder = holder
>>
>> process B (blkdev_open()):
>> 	__blkdev_get() returns successfully
>> 	trigger BUG_ON(!bd_may_claim(bdev, whole, holder))
> 
> Ah, good spotting. Essentially the whole->bd_claiming protection for
> partition bdev does not work because we've got two different 'whole' bdev
> pointers.

Yes, perfect summary.

>> And the problem can be fixed by adding more checks in bd_start_claiming():
>>
>> static struct block_device *bd_start_claiming(struct block_device *bdev,
>>                                               void *holder)
>> {
>> 	/* ...... */
>>         disk = get_gendisk(bdev->bd_dev, &partno);
>>         if (!disk)
>>                 return ERR_PTR(-ENXIO);
>>
>>         if (inode_unhashed(bdev->bd_inode)) {
>>                 module_put(disk->fops->owner);
>>                 put_disk(disk);
>>                 return ERR_PTR(-ENXIO);
>>         }
>>
>> 	/* ...... */
>>         if (!whole)
>>                 return ERR_PTR(-ENOMEM);
>>
>>         if (inode_unhashed(bdev->bd_inode) ||
>>                 inode_unhashed(whole->bd_inode)) {
>>                 bdput(whole);
>>                 return ERR_PTR(-ENXIO);
>>         }
>> }
>>
>> Except adding more and more checks, are there better solutions to the
>> race problem ?  I have thought about managing the device number by
>> ref-counter, so the device number will not be reused until the last
>> reference of bdev_inode and gendisk are released, and i am trying to
>> figure out a way to get/put a reference of the device number when get/put
>> the block_device.
>>
>> So any suggests and thoughts ?
> 
> Yeah, these races are nasty. I will think whether there's some reasonably
> easy way to fixing them or whether we'll have to go via some other route
> like blocking the device number as you suggest. In either case thanks for
> the report and the analysis!

Look forward to review them.

Regards,
Tao

> 								Honza
>

[02/11] block: Fix race of bdev open with gendisk shutdown

Commit Message

Comments

Patch