diff mbox

[-stable] block: destroy bdi before blockdev is unregistered.

Message ID 20150429072530.39d38b00@notabene.brown (mailing list archive)
State Not Applicable, archived
Delegated to: Mike Snitzer
Headers show

Commit Message

NeilBrown April 28, 2015, 9:25 p.m. UTC
On Tue, 28 Apr 2015 12:41:18 -0400 Mike Snitzer <snitzer@redhat.com> wrote:

> On Mon, Apr 27, 2015 at 12:12 AM, NeilBrown <neilb@suse.de> wrote:
> >
> > Because of the peculiar way that md devices are created (automatically
> > when the device node is opened), a new device can be created and
> > registered immediately after the
> >         blk_unregister_region(disk_devt(disk), disk->minors);
> > call in del_gendisk().
> >
> > Therefore it is important that all visible artifacts of the previous
> > device are removed before this call.  In particular, the 'bdi'.
> >
> > Since:
> > commit c4db59d31e39ea067c32163ac961e9c80198fd37
> > Author: Christoph Hellwig <hch@lst.de>
> >     fs: don't reassign dirty inodes to default_backing_dev_info
> >
> > moved the
> >    device_unregister(bdi->dev);
> > call from bdi_unregister() to bdi_destroy() it has been quite easy to
> > lose a race and have a new (e.g.) "md127" be created after the
> > blk_unregister_region() call and before bdi_destroy() is ultimately
> > called by the final 'put_disk', which must come after del_gendisk().
> >
> > The new device finds that the bdi name is already registered in sysfs
> > and complains
> >
> >> [ 9627.630029] WARNING: CPU: 18 PID: 3330 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x5a/0x70()
> >> [ 9627.630032] sysfs: cannot create duplicate filename '/devices/virtual/bdi/9:127'
> >
> > We can fix this by moving the bdi_destroy() call out of
> > blk_release_queue() (which can happen very late when a refcount
> > reaches zero) and into blk_cleanup_queue() - which happens exactly when the md
> > device driver calls it.
> >
> > Then it is only necessary for md to call blk_cleanup_queue() before
> > del_gendisk().  As loop.c devices are also created on demand by
> > opening the device node, we make the same change there.
> >
> > Fixes: c4db59d31e39ea067c32163ac961e9c80198fd37
> > Reported-by: Azat Khuzhin <a3at.mail@gmail.com>
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: stable@vger.kernel.org (v4.0)
> > Signed-off-by: NeilBrown <neilb@suse.de>
> >
> > --
> > Hi Jens,
> >  if you could check this and forward on to Linus I'd really appreciate it.
> >
> > Thanks,
> > NeilBrown
> >
> >
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index fd154b94447a..7871603f0a29 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -552,6 +552,8 @@ void blk_cleanup_queue(struct request_queue *q)
> >                 q->queue_lock = &q->__queue_lock;
> >         spin_unlock_irq(lock);
> >
> > +       bdi_destroy(&q->backing_dev_info);
> > +
> >         /* @q is and will stay empty, shutdown and put */
> >         blk_put_queue(q);
> >  }
> > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> > index faaf36ade7eb..2b8fd302f677 100644
> > --- a/block/blk-sysfs.c
> > +++ b/block/blk-sysfs.c
> > @@ -522,8 +522,6 @@ static void blk_release_queue(struct kobject *kobj)
> >
> >         blk_trace_shutdown(q);
> >
> > -       bdi_destroy(&q->backing_dev_info);
> > -
> >         ida_simple_remove(&blk_queue_ida, q->id);
> >         call_rcu(&q->rcu_head, blk_free_queue_rcu);
> >  }
> > diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> > index ae3fcb4199e9..d7173cb1ea76 100644
> > --- a/drivers/block/loop.c
> > +++ b/drivers/block/loop.c
> > @@ -1620,8 +1620,8 @@ out:
> >
> >  static void loop_remove(struct loop_device *lo)
> >  {
> > -       del_gendisk(lo->lo_disk);
> >         blk_cleanup_queue(lo->lo_queue);
> > +       del_gendisk(lo->lo_disk);
> >         blk_mq_free_tag_set(&lo->tag_set);
> >         put_disk(lo->lo_disk);
> >         kfree(lo);
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index d4f31e195e26..593a02476c78 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -4818,12 +4818,12 @@ static void md_free(struct kobject *ko)
> >         if (mddev->sysfs_state)
> >                 sysfs_put(mddev->sysfs_state);
> >
> > +       if (mddev->queue)
> > +               blk_cleanup_queue(mddev->queue);
> >         if (mddev->gendisk) {
> >                 del_gendisk(mddev->gendisk);
> >                 put_disk(mddev->gendisk);
> >         }
> > -       if (mddev->queue)
> > -               blk_cleanup_queue(mddev->queue);
> >
> >         kfree(mddev);
> >  }
> 
> I've taken this patch into consideration relative to DM, please see:
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=wip&id=5dac86fb79697e93297edb6a316b429236d00137
> 
> Point is in my private snitzer/wip branch DM now calls
> blk_cleanup_queue() before del_gendisk().
> 
> With your patch:
> 1) blk_cleanup_queue -> bdi_destroy -> bdi->dev = NULL;
> 2) del_gendisk -> bdi_unregister -> WARN_ON_ONCE(!bdi->dev)
> 
> So, in testing with DM I'm hitting bdi_unregister's WARN_ON(), are you
> seeing this WARN_ON?

Hmmm.  Yes I am.  I wonder how I missed that.  Thanks!

As bdi_set_min_ratio doesn't touch bdi->dev, there seems to be no need for
the test, or the warning.

I wonder if it would make sense to move the bdi_set_min_ratio() call to
bdi_destroy, and discard bdi_unregister??
There is a comment which suggests bdi_unregister might be of use later, but
it might be best to have a clean slate in which to add whatever might be
needed??

NeilBrown
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Comments

Christoph Hellwig April 29, 2015, 1:35 p.m. UTC | #1
On Wed, Apr 29, 2015 at 07:25:30AM +1000, NeilBrown wrote:
> As bdi_set_min_ratio doesn't touch bdi->dev, there seems to be no need for
> the test, or the warning.
> 
> I wonder if it would make sense to move the bdi_set_min_ratio() call to
> bdi_destroy, and discard bdi_unregister??
> There is a comment which suggests bdi_unregister might be of use later, but
> it might be best to have a clean slate in which to add whatever might be
> needed??

This seems fine to me from the block dev point of view.  I don't really
understand the bdi_min_ratio logic, but Peter might have a better idea.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Peter Zijlstra April 29, 2015, 4:02 p.m. UTC | #2
On Wed, Apr 29, 2015 at 03:35:12PM +0200, Christoph Hellwig wrote:
> On Wed, Apr 29, 2015 at 07:25:30AM +1000, NeilBrown wrote:
> > As bdi_set_min_ratio doesn't touch bdi->dev, there seems to be no need for
> > the test, or the warning.
> > 
> > I wonder if it would make sense to move the bdi_set_min_ratio() call to
> > bdi_destroy, and discard bdi_unregister??
> > There is a comment which suggests bdi_unregister might be of use later, but
> > it might be best to have a clean slate in which to add whatever might be
> > needed??
> 
> This seems fine to me from the block dev point of view.  I don't really
> understand the bdi_min_ratio logic, but Peter might have a better idea.

Ah, that was a bit of digging, I've not looked at that in ages :-)

So if you look at bdi_dirty_limit()'s comment:

 * The bdi's share of dirty limit will be adapting to its throughput and
 * bounded by the bdi->min_ratio and/or bdi->max_ratio parameters, if set.

So the min_ratio is a minimum guaranteed fraction of the total
throughput.

Now the problem before commit ccb6108f5b0b ("mm/backing-dev.c: reset bdi
min_ratio in bdi_unregister()") was that since bdi_set_min_ratio()
keeps a global sum of bdi->min_ratio, you need to subtract from said
global sum when taking the BDI away. Otherwise we loose/leak a fraction
of the total throughput available (to the other BDIs).

Which is what that bdi_set_min_ratio(bdi, 0) in unregister does. It
resets the min_ratio for the bdi being taken out and frees up the min
allocated bandwidth for the others.

So I think moving that do destroy would be fine; assuming the delay
between unregister and destroy is typically 'short'. Because without
that you can 'leak' this min ratio for extended periods which means the
bandwidth is unavailable for other BDIs.

Does that make sense?

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
NeilBrown April 30, 2015, 12:06 a.m. UTC | #3
On Wed, 29 Apr 2015 18:02:58 +0200 Peter Zijlstra <peterz@infradead.org>
wrote:

> On Wed, Apr 29, 2015 at 03:35:12PM +0200, Christoph Hellwig wrote:
> > On Wed, Apr 29, 2015 at 07:25:30AM +1000, NeilBrown wrote:
> > > As bdi_set_min_ratio doesn't touch bdi->dev, there seems to be no need for
> > > the test, or the warning.
> > > 
> > > I wonder if it would make sense to move the bdi_set_min_ratio() call to
> > > bdi_destroy, and discard bdi_unregister??
> > > There is a comment which suggests bdi_unregister might be of use later, but
> > > it might be best to have a clean slate in which to add whatever might be
> > > needed??
> > 
> > This seems fine to me from the block dev point of view.  I don't really
> > understand the bdi_min_ratio logic, but Peter might have a better idea.
> 
> Ah, that was a bit of digging, I've not looked at that in ages :-)
> 
> So if you look at bdi_dirty_limit()'s comment:
> 
>  * The bdi's share of dirty limit will be adapting to its throughput and
>  * bounded by the bdi->min_ratio and/or bdi->max_ratio parameters, if set.
> 
> So the min_ratio is a minimum guaranteed fraction of the total
> throughput.
> 
> Now the problem before commit ccb6108f5b0b ("mm/backing-dev.c: reset bdi
> min_ratio in bdi_unregister()") was that since bdi_set_min_ratio()
> keeps a global sum of bdi->min_ratio, you need to subtract from said
> global sum when taking the BDI away. Otherwise we loose/leak a fraction
> of the total throughput available (to the other BDIs).
> 
> Which is what that bdi_set_min_ratio(bdi, 0) in unregister does. It
> resets the min_ratio for the bdi being taken out and frees up the min
> allocated bandwidth for the others.
> 
> So I think moving that do destroy would be fine; assuming the delay
> between unregister and destroy is typically 'short'. Because without
> that you can 'leak' this min ratio for extended periods which means the
> bandwidth is unavailable for other BDIs.
> 
> Does that make sense?

Your assessment is almost exactly what I had come up with, so it definitely
makes sense :-)
'destroy' does come very shortly after 'unregister' (and immediately before
'blk_put_queue' which actually frees the struct).  However the driving force
for this patch was a desire to move blk_cleanup_queue(), which calls
'destroy', earlier.  So the net result is that bdi_set_min_ratio will be
called slightly sooner.

Thanks,
NeilBrown
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
diff mbox

Patch

diff --git a/block/genhd.c b/block/genhd.c
index e351fc521053..1d4435478e8a 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -657,7 +657,6 @@  void del_gendisk(struct gendisk *disk)
 	disk->flags &= ~GENHD_FL_UP;
 
 	sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi");
-	bdi_unregister(&disk->queue->backing_dev_info);
 	blk_unregister_queue(disk);
 	blk_unregister_region(disk_devt(disk), disk->minors);
 
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index aff923ae8c4b..d87d8eced064 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -116,7 +116,6 @@  __printf(3, 4)
 int bdi_register(struct backing_dev_info *bdi, struct device *parent,
 		const char *fmt, ...);
 int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev);
-void bdi_unregister(struct backing_dev_info *bdi);
 int __must_check bdi_setup_and_register(struct backing_dev_info *, char *);
 void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
 			enum wb_reason reason);
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 880dd7437172..c178d13d6f4c 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -250,7 +250,6 @@  DEFINE_EVENT(writeback_class, name, \
 DEFINE_WRITEBACK_EVENT(writeback_nowork);
 DEFINE_WRITEBACK_EVENT(writeback_wake_background);
 DEFINE_WRITEBACK_EVENT(writeback_bdi_register);
-DEFINE_WRITEBACK_EVENT(writeback_bdi_unregister);
 
 DECLARE_EVENT_CLASS(wbc_class,
 	TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi),
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 6dc4580df2af..000e7b3b9896 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -359,23 +359,6 @@  static void bdi_wb_shutdown(struct backing_dev_info *bdi)
 	flush_delayed_work(&bdi->wb.dwork);
 }
 
-/*
- * Called when the device behind @bdi has been removed or ejected.
- *
- * We can't really do much here except for reducing the dirty ratio at
- * the moment.  In the future we should be able to set a flag so that
- * the filesystem can handle errors at mark_inode_dirty time instead
- * of only at writeback time.
- */
-void bdi_unregister(struct backing_dev_info *bdi)
-{
-	if (WARN_ON_ONCE(!bdi->dev))
-		return;
-
-	bdi_set_min_ratio(bdi, 0);
-}
-EXPORT_SYMBOL(bdi_unregister);
-
 static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
 {
 	memset(wb, 0, sizeof(*wb));
@@ -443,6 +426,7 @@  void bdi_destroy(struct backing_dev_info *bdi)
 	int i;
 
 	bdi_wb_shutdown(bdi);
+	bdi_set_min_ratio(bdi, 0);
 
 	WARN_ON(!list_empty(&bdi->work_list));
 	WARN_ON(delayed_work_pending(&bdi->wb.dwork));