diff mbox series

[4/5] mm/swapfile: refcount block and queue before using blkcg_schedule_throttle()

Message ID 20200414041902.16769-5-mcgrof@kernel.org
State New, archived
Headers show
Series blktrace: fix use after free | expand

Commit Message

Luis Chamberlain April 14, 2020, 4:19 a.m. UTC
block devices are refcounted so to ensure once its final user goes away it
can be cleaned up by the lower layers properly. The block device's
request_queue structure is also refcounted, however, if the last
blk_put_queue() is called under atomic context the block layer has
to defer removal.

By refcounting the block device during the use of blkcg_schedule_throttle(),
we ensure ensure two things:

1) the block device remains available during the call
2) we ensure avoid having to deal with the fact we're using the
   request_queue structure in atomic context, since the last
   blk_put_queue() will be called upon disk_release(), *after*
   our own bdput().

This means this code path is *not* going to remove the request_queue
structure, as we are ensuring some later upper layer disk_release()
will be the one to release the request_queue structure for us.

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Nicolai Stange <nstange@suse.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: yu kuai <yukuai3@huawei.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 mm/swapfile.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

Comments

Christoph Hellwig April 14, 2020, 3:44 p.m. UTC | #1
On Tue, Apr 14, 2020 at 04:19:01AM +0000, Luis Chamberlain wrote:
> block devices are refcounted so to ensure once its final user goes away it
> can be cleaned up by the lower layers properly. The block device's
> request_queue structure is also refcounted, however, if the last
> blk_put_queue() is called under atomic context the block layer has
> to defer removal.
> 
> By refcounting the block device during the use of blkcg_schedule_throttle(),
> we ensure ensure two things:
> 
> 1) the block device remains available during the call
> 2) we ensure avoid having to deal with the fact we're using the
>    request_queue structure in atomic context, since the last
>    blk_put_queue() will be called upon disk_release(), *after*
>    our own bdput().
> 
> This means this code path is *not* going to remove the request_queue
> structure, as we are ensuring some later upper layer disk_release()
> will be the one to release the request_queue structure for us.
> 
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Omar Sandoval <osandov@fb.com>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Nicolai Stange <nstange@suse.de>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: yu kuai <yukuai3@huawei.com>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  mm/swapfile.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 6659ab563448..9285ff6030ca 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -3753,6 +3753,7 @@ static void free_swap_count_continuations(struct swap_info_struct *si)
>  void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node,
>  				  gfp_t gfp_mask)
>  {
> +	struct block_device *bdev;
>  	struct swap_info_struct *si, *next;
>  	if (!(gfp_mask & __GFP_IO) || !memcg)
>  		return;
> @@ -3771,8 +3772,17 @@ void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node,
>  	plist_for_each_entry_safe(si, next, &swap_avail_heads[node],
>  				  avail_lists[node]) {
>  		if (si->bdev) {
> -			blkcg_schedule_throttle(bdev_get_queue(si->bdev),
> -						true);
> +			bdev = bdgrab(si->bdev);
> +			if (!bdev)
> +				continue;
> +			/*
> +			 * By adding our own bdgrab() we ensure the queue
> +			 * sticks around until disk_release(), and so we ensure
> +			 * our release of the request_queue does not happen in
> +			 * atomic context.
> +			 */
> +			blkcg_schedule_throttle(bdev_get_queue(bdev), true);
> +			bdput(bdev);

I don't understand the atomic part of the comment.  How does
bdgrab/bdput help us there?
Luis Chamberlain April 15, 2020, 5:42 a.m. UTC | #2
On Tue, Apr 14, 2020 at 08:44:47AM -0700, Christoph Hellwig wrote:
> On Tue, Apr 14, 2020 at 04:19:01AM +0000, Luis Chamberlain wrote:
> > block devices are refcounted so to ensure once its final user goes away it
> > can be cleaned up by the lower layers properly. The block device's
> > request_queue structure is also refcounted, however, if the last
> > blk_put_queue() is called under atomic context the block layer has
> > to defer removal.
> > 
> > By refcounting the block device during the use of blkcg_schedule_throttle(),
> > we ensure ensure two things:
> > 
> > 1) the block device remains available during the call
> > 2) we ensure avoid having to deal with the fact we're using the
> >    request_queue structure in atomic context, since the last
> >    blk_put_queue() will be called upon disk_release(), *after*
> >    our own bdput().
> > 
> > This means this code path is *not* going to remove the request_queue
> > structure, as we are ensuring some later upper layer disk_release()
> > will be the one to release the request_queue structure for us.
> > 
> > Cc: Bart Van Assche <bvanassche@acm.org>
> > Cc: Omar Sandoval <osandov@fb.com>
> > Cc: Hannes Reinecke <hare@suse.com>
> > Cc: Nicolai Stange <nstange@suse.de>
> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > Cc: Michal Hocko <mhocko@kernel.org>
> > Cc: yu kuai <yukuai3@huawei.com>
> > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> > ---
> >  mm/swapfile.c | 14 ++++++++++++--
> >  1 file changed, 12 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 6659ab563448..9285ff6030ca 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -3753,6 +3753,7 @@ static void free_swap_count_continuations(struct swap_info_struct *si)
> >  void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node,
> >  				  gfp_t gfp_mask)
> >  {
> > +	struct block_device *bdev;
> >  	struct swap_info_struct *si, *next;
> >  	if (!(gfp_mask & __GFP_IO) || !memcg)
> >  		return;
> > @@ -3771,8 +3772,17 @@ void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node,
> >  	plist_for_each_entry_safe(si, next, &swap_avail_heads[node],
> >  				  avail_lists[node]) {
> >  		if (si->bdev) {
> > -			blkcg_schedule_throttle(bdev_get_queue(si->bdev),
> > -						true);
> > +			bdev = bdgrab(si->bdev);
> > +			if (!bdev)
> > +				continue;
> > +			/*
> > +			 * By adding our own bdgrab() we ensure the queue
> > +			 * sticks around until disk_release(), and so we ensure
> > +			 * our release of the request_queue does not happen in
> > +			 * atomic context.
> > +			 */
> > +			blkcg_schedule_throttle(bdev_get_queue(bdev), true);
> > +			bdput(bdev);
> 
> I don't understand the atomic part of the comment.  How does
> bdgrab/bdput help us there?

The commit log above did a better job at explaining this in terms of our
goal to use the request_queue and how this use would prevent the risk of
releasing the request_queue, which could sleep.

If its not clear still, at least why we'd escape the sleep potential
of the request_queue, we can just see its up to the disk_release()
to call the last blk_put_queue():

static void __device_add_disk(struct device *parent, struct gendisk disk,      
			      const struct attribute_group **groups,            
			      bool register_queue)                              
{   
	...
        /*                                                                      
	 * Take an extra ref on queue which will be put on disk_release()
	 * so that it sticks around as long as @disk is there.                  
	 */                                                                     
	WARN_ON_ONCE(!blk_get_queue(disk->queue));

	disk_add_events(disk);
	blk_integrity_add(disk);
}

static void disk_release(struct device *dev)                                    
{                                                                               
	struct gendisk *disk = dev_to_disk(dev);

	blk_free_devt(dev->devt);
	disk_release_events(disk);
	kfree(disk->random);
	disk_replace_part_tbl(disk, NULL);
	hd_free_part(&disk->part0);
	if (disk->queue)
		blk_put_queue(disk->queue);
	kfree(disk);
}     

I admit that all this however it did a poor job at explaining why
bdgrab()/bdput() was safe in atomic context other than the implicit
reasoning that we already do that elsewhere in atomic context.

bdgrab() specifically was added to be able to refcount a block device in
atomic context via commit dddac6a7b445 ("("PM / Hibernate: Replace bdget
call with simple atomic_inc of i_count"). In its latest incarnation we
have:

/**
 * bdgrab -- Grab a reference to an already referenced block device
 * @bdev:       Block device to grab a reference to.                            
 */
struct block_device *bdgrab(struct block_device *bdev)
{
	ihold(bdev->bd_inode);                                                  
	return bdev;                                                            
}                                                                               
EXPORT_SYMBOL(bdgrab); 

And this in turn:

/*                                                                              
 * get additional reference to inode; caller must already hold one.
 */
void ihold(struct inode *inode)                                                 
{                                                                               
	WARN_ON(atomic_inc_return(&inode->i_count) < 2);                        
}                                                                               
EXPORT_SYMBOL(ihold);

However... I'd eventure to say we don't have tribal knowledge documented
about why bdput() is safe in atomic context when used with bdgrab(),
including the commit log which added it. So the only thing backing its
safety is that we already use this combo in atomic context, and if its
incorrect, other areas would be incorrect as well.

But looking underneath the hood, I see at the end of __blkdev_get():

static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)  
{
	...
	disk = bdev_get_gendisk(bdev, &partno);
	...
        /* only one opener holds refs to the module and disk */                 
	if (!first_open)                                                        
		put_disk_and_module(disk); 
	...
}

So ihold() seems like a way to ensure the caller of bdgrab() won't be
this first opener. If only one opener holds the ref to the disk and it
was not us, we musn't be the one to decrease it, and if the disk is
held, it should mean the block device should be refcounted by it as
well. More review on this later part is appreciated though.

  Luis
Christoph Hellwig April 15, 2020, 7:27 a.m. UTC | #3
On Wed, Apr 15, 2020 at 05:42:34AM +0000, Luis Chamberlain wrote:
> > I don't understand the atomic part of the comment.  How does
> > bdgrab/bdput help us there?
> 
> The commit log above did a better job at explaining this in terms of our
> goal to use the request_queue and how this use would prevent the risk of
> releasing the request_queue, which could sleep.

So bdput eventually does and iput, but what leads to an out of context
offload?

But anyway, isn't the original problem better solved by simply not
releasing the queue from atomic context to start with?  There isn't
really any good reason we keep holding the spinlock once we have a
reference on the queue, so something like this (not even compile tested)
should do the work:

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index c5dc833212e1..45faa851f789 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1673,6 +1673,17 @@ void blkcg_maybe_throttle_current(void)
 	blk_put_queue(q);
 }
 
+/* consumes a reference on q */
+void __blkcg_schedule_throttle(struct request_queue *q, bool use_memdelay)
+{
+	if (current->throttle_queue)
+		blk_put_queue(current->throttle_queue);
+	current->throttle_queue = q;
+	if (use_memdelay)
+		current->use_memdelay = use_memdelay;
+	set_notify_resume(current);
+}
+
 /**
  * blkcg_schedule_throttle - this task needs to check for throttling
  * @q: the request queue IO was submitted on
@@ -1694,16 +1705,8 @@ void blkcg_schedule_throttle(struct request_queue *q, bool use_memdelay)
 {
 	if (unlikely(current->flags & PF_KTHREAD))
 		return;
-
-	if (!blk_get_queue(q))
-		return;
-
-	if (current->throttle_queue)
-		blk_put_queue(current->throttle_queue);
-	current->throttle_queue = q;
-	if (use_memdelay)
-		current->use_memdelay = use_memdelay;
-	set_notify_resume(current);
+	if (blk_get_queue(q))
+		__blkcg_schedule_throttle(q, use_memdelay);
 }
 
 /**
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 35f8ffe92b70..68440cb3ea9e 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -679,6 +679,7 @@ static inline void blkcg_clear_delay(struct blkcg_gq *blkg)
 
 void blkcg_add_delay(struct blkcg_gq *blkg, u64 now, u64 delta);
 void blkcg_schedule_throttle(struct request_queue *q, bool use_memdelay);
+void __blkcg_schedule_throttle(struct request_queue *q, bool use_memdelay);
 void blkcg_maybe_throttle_current(void);
 #else	/* CONFIG_BLK_CGROUP */
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 5871a2aa86a5..4c6aa59ee593 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3749,9 +3749,10 @@ void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node,
 				  gfp_t gfp_mask)
 {
 	struct swap_info_struct *si, *next;
+	struct request_queue *q = NULL;
+
 	if (!(gfp_mask & __GFP_IO) || !memcg)
 		return;
-
 	if (!blk_cgroup_congested())
 		return;
 
@@ -3761,17 +3762,21 @@ void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node,
 	 */
 	if (current->throttle_queue)
 		return;
+	if (unlikely(current->flags & PF_KTHREAD))
+		return;
 
 	spin_lock(&swap_avail_lock);
 	plist_for_each_entry_safe(si, next, &swap_avail_heads[node],
 				  avail_lists[node]) {
 		if (si->bdev) {
-			blkcg_schedule_throttle(bdev_get_queue(si->bdev),
-						true);
+			if (blk_get_queue(dev_get_queue(si->bdev)))
+				q = dev_get_queue(si->bdev);
 			break;
 		}
 	}
 	spin_unlock(&swap_avail_lock);
+	if (q)
+		__blkcg_schedule_throttle(q, true);
 }
 #endif
Christoph Hellwig April 15, 2020, 7:34 a.m. UTC | #4
On Wed, Apr 15, 2020 at 12:27:12AM -0700, Christoph Hellwig wrote:
> On Wed, Apr 15, 2020 at 05:42:34AM +0000, Luis Chamberlain wrote:
> > > I don't understand the atomic part of the comment.  How does
> > > bdgrab/bdput help us there?
> > 
> > The commit log above did a better job at explaining this in terms of our
> > goal to use the request_queue and how this use would prevent the risk of
> > releasing the request_queue, which could sleep.
> 
> So bdput eventually does and iput, but what leads to an out of context
> offload?
> 
> But anyway, isn't the original problem better solved by simply not
> releasing the queue from atomic context to start with?  There isn't
> really any good reason we keep holding the spinlock once we have a
> reference on the queue, so something like this (not even compile tested)
> should do the work:

Actually - mem_cgroup_throttle_swaprate already checks for a non-NULL
current->throttle_queue above, so we should never even call
blk_put_queue here.  Was this found by code inspection, or was there
a real report?

In the latter case we need to figure out what protects >throttle_queue,
as the way blkcg_schedule_throttle first put the reference and only then
assign a value to it already looks like it introduces a tiny race
window.

Otherwise just open coding the applicable part ofblkcg_schedule_throttle
in mem_cgroup_throttle_swaprate seems easiest:

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 5871a2aa86a5..e16051ef074c 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3761,15 +3761,20 @@ void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node,
 	 */
 	if (current->throttle_queue)
 		return;
+	if (unlikely(current->flags & PF_KTHREAD))
+		return;
 
 	spin_lock(&swap_avail_lock);
 	plist_for_each_entry_safe(si, next, &swap_avail_heads[node],
 				  avail_lists[node]) {
-		if (si->bdev) {
-			blkcg_schedule_throttle(bdev_get_queue(si->bdev),
-						true);
-			break;
+		if (!si->bdev)
+			continue;
+		if (blk_get_queue(dev_get_queue(si->bdev))) {
+			current->throttle_queue = dev_get_queue(si->bdev);
+			current->use_memdelay = true;
+			set_notify_resume(current);
 		}
+		break;
 	}
 	spin_unlock(&swap_avail_lock);
 }
Luis Chamberlain April 15, 2020, 1:19 p.m. UTC | #5
On Wed, Apr 15, 2020 at 12:34:43AM -0700, Christoph Hellwig wrote:
> On Wed, Apr 15, 2020 at 12:27:12AM -0700, Christoph Hellwig wrote:
> > On Wed, Apr 15, 2020 at 05:42:34AM +0000, Luis Chamberlain wrote:
> > > > I don't understand the atomic part of the comment.  How does
> > > > bdgrab/bdput help us there?
> > > 
> > > The commit log above did a better job at explaining this in terms of our
> > > goal to use the request_queue and how this use would prevent the risk of
> > > releasing the request_queue, which could sleep.
> > 
> > So bdput eventually does and iput, but what leads to an out of context
> > offload?
> > 
> > But anyway, isn't the original problem better solved by simply not
> > releasing the queue from atomic context to start with?  There isn't
> > really any good reason we keep holding the spinlock once we have a
> > reference on the queue, so something like this (not even compile tested)
> > should do the work:
> 
> Actually - mem_cgroup_throttle_swaprate already checks for a non-NULL
> current->throttle_queue above, so we should never even call
> blk_put_queue here.  Was this found by code inspection, or was there
> a real report?

No but report, this code path came up as a possible use of
blk_put_queue() which already exists in atomic context. So yes, it
already uses blk_put_queue(), but how are we *sure* its not going to be
the last one? Because if it is that mean the release will happen in
atomic context, defeating the goal of the last patch in this series.

Using bdgrab() however ensures that during the lifecycle of this path,
blk_put_queue() won't be the last if used after, but instead we are
sure it will be upon disk release.

In fact, with bdgrab() isn't the blk_get_queue() on
blkcg_schedule_throttle() no longer needed?

> In the latter case we need to figure out what protects >throttle_queue,
> as the way blkcg_schedule_throttle first put the reference and only then
> assign a value to it already looks like it introduces a tiny race
> window.
> 
> Otherwise just open coding the applicable part of blkcg_schedule_throttle
> in mem_cgroup_throttle_swaprate seems easiest:
> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 5871a2aa86a5..e16051ef074c 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -3761,15 +3761,20 @@ void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node,
>  	 */
>  	if (current->throttle_queue)
>  		return;
> +	if (unlikely(current->flags & PF_KTHREAD))
> +		return;
>  
>  	spin_lock(&swap_avail_lock);
>  	plist_for_each_entry_safe(si, next, &swap_avail_heads[node],
>  				  avail_lists[node]) {
> -		if (si->bdev) {
> -			blkcg_schedule_throttle(bdev_get_queue(si->bdev),
> -						true);
> -			break;
> +		if (!si->bdev)
> +			continue;
> +		if (blk_get_queue(dev_get_queue(si->bdev))) {
> +			current->throttle_queue = dev_get_queue(si->bdev);
> +			current->use_memdelay = true;
> +			set_notify_resume(current);
>  		}
> +		break;
>  	}
>  	spin_unlock(&swap_avail_lock);
>  }

Sorry, its not clear to me  who calls the respective blk_put_queue()
here?

  Luis
Christoph Hellwig April 16, 2020, 6:10 a.m. UTC | #6
On Wed, Apr 15, 2020 at 01:19:15PM +0000, Luis Chamberlain wrote:
> >  	if (current->throttle_queue)
> >  		return;
> > +	if (unlikely(current->flags & PF_KTHREAD))
> > +		return;
> >  
> >  	spin_lock(&swap_avail_lock);
> >  	plist_for_each_entry_safe(si, next, &swap_avail_heads[node],
> >  				  avail_lists[node]) {
> > -		if (si->bdev) {
> > -			blkcg_schedule_throttle(bdev_get_queue(si->bdev),
> > -						true);
> > -			break;
> > +		if (!si->bdev)
> > +			continue;
> > +		if (blk_get_queue(dev_get_queue(si->bdev))) {
> > +			current->throttle_queue = dev_get_queue(si->bdev);
> > +			current->use_memdelay = true;
> > +			set_notify_resume(current);
> >  		}
> > +		break;
> >  	}
> >  	spin_unlock(&swap_avail_lock);
> >  }
> 
> Sorry, its not clear to me  who calls the respective blk_put_queue()
> here?

If you look at blkcg_schedule_throttle, it only puts the queue that
was in current->throttle_queue.  But mem_cgroup_throttle_swaprate
exits early when current->throttle_queue is non-zero (first two lines
quote above).  So when called from mem_cgroup_throttle_swaprate,
blkcg_schedule_throttle should never actually put a queue.  Open
coding the few relevant lines from blkcg_schedule_throttle in
mem_cgroup_throttle_swaprate makes that obvious.
Ming Lei April 16, 2020, 6:22 a.m. UTC | #7
On Tue, Apr 14, 2020 at 04:19:01AM +0000, Luis Chamberlain wrote:
> block devices are refcounted so to ensure once its final user goes away it
> can be cleaned up by the lower layers properly. The block device's
> request_queue structure is also refcounted, however, if the last
> blk_put_queue() is called under atomic context the block layer has
> to defer removal.
> 
> By refcounting the block device during the use of blkcg_schedule_throttle(),
> we ensure ensure two things:
> 
> 1) the block device remains available during the call
> 2) we ensure avoid having to deal with the fact we're using the
>    request_queue structure in atomic context, since the last
>    blk_put_queue() will be called upon disk_release(), *after*
>    our own bdput().
> 
> This means this code path is *not* going to remove the request_queue
> structure, as we are ensuring some later upper layer disk_release()
> will be the one to release the request_queue structure for us.
> 
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Omar Sandoval <osandov@fb.com>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Nicolai Stange <nstange@suse.de>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: yu kuai <yukuai3@huawei.com>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  mm/swapfile.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 6659ab563448..9285ff6030ca 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -3753,6 +3753,7 @@ static void free_swap_count_continuations(struct swap_info_struct *si)
>  void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node,
>  				  gfp_t gfp_mask)
>  {
> +	struct block_device *bdev;
>  	struct swap_info_struct *si, *next;
>  	if (!(gfp_mask & __GFP_IO) || !memcg)
>  		return;
> @@ -3771,8 +3772,17 @@ void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node,
>  	plist_for_each_entry_safe(si, next, &swap_avail_heads[node],
>  				  avail_lists[node]) {
>  		if (si->bdev) {
> -			blkcg_schedule_throttle(bdev_get_queue(si->bdev),
> -						true);
> +			bdev = bdgrab(si->bdev);

When swapon, the block_device has been opened in claim_swapfile(),
so no need to worry about the queue being gone here.


Thanks,
Ming
Luis Chamberlain April 16, 2020, 6:25 a.m. UTC | #8
On Thu, Apr 16, 2020 at 02:22:22PM +0800, Ming Lei wrote:
> On Tue, Apr 14, 2020 at 04:19:01AM +0000, Luis Chamberlain wrote:
> > block devices are refcounted so to ensure once its final user goes away it
> > can be cleaned up by the lower layers properly. The block device's
> > request_queue structure is also refcounted, however, if the last
> > blk_put_queue() is called under atomic context the block layer has
> > to defer removal.
> > 
> > By refcounting the block device during the use of blkcg_schedule_throttle(),
> > we ensure ensure two things:
> > 
> > 1) the block device remains available during the call
> > 2) we ensure avoid having to deal with the fact we're using the
> >    request_queue structure in atomic context, since the last
> >    blk_put_queue() will be called upon disk_release(), *after*
> >    our own bdput().
> > 
> > This means this code path is *not* going to remove the request_queue
> > structure, as we are ensuring some later upper layer disk_release()
> > will be the one to release the request_queue structure for us.
> > 
> > Cc: Bart Van Assche <bvanassche@acm.org>
> > Cc: Omar Sandoval <osandov@fb.com>
> > Cc: Hannes Reinecke <hare@suse.com>
> > Cc: Nicolai Stange <nstange@suse.de>
> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > Cc: Michal Hocko <mhocko@kernel.org>
> > Cc: yu kuai <yukuai3@huawei.com>
> > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> > ---
> >  mm/swapfile.c | 14 ++++++++++++--
> >  1 file changed, 12 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 6659ab563448..9285ff6030ca 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -3753,6 +3753,7 @@ static void free_swap_count_continuations(struct swap_info_struct *si)
> >  void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node,
> >  				  gfp_t gfp_mask)
> >  {
> > +	struct block_device *bdev;
> >  	struct swap_info_struct *si, *next;
> >  	if (!(gfp_mask & __GFP_IO) || !memcg)
> >  		return;
> > @@ -3771,8 +3772,17 @@ void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node,
> >  	plist_for_each_entry_safe(si, next, &swap_avail_heads[node],
> >  				  avail_lists[node]) {
> >  		if (si->bdev) {
> > -			blkcg_schedule_throttle(bdev_get_queue(si->bdev),
> > -						true);
> > +			bdev = bdgrab(si->bdev);
> 
> When swapon, the block_device has been opened in claim_swapfile(),
> so no need to worry about the queue being gone here.

Thanks, so why bdev_get_queue() before?

   Luis
Ming Lei April 16, 2020, 6:34 a.m. UTC | #9
On Thu, Apr 16, 2020 at 06:25:32AM +0000, Luis Chamberlain wrote:
> On Thu, Apr 16, 2020 at 02:22:22PM +0800, Ming Lei wrote:
> > On Tue, Apr 14, 2020 at 04:19:01AM +0000, Luis Chamberlain wrote:
> > > block devices are refcounted so to ensure once its final user goes away it
> > > can be cleaned up by the lower layers properly. The block device's
> > > request_queue structure is also refcounted, however, if the last
> > > blk_put_queue() is called under atomic context the block layer has
> > > to defer removal.
> > > 
> > > By refcounting the block device during the use of blkcg_schedule_throttle(),
> > > we ensure ensure two things:
> > > 
> > > 1) the block device remains available during the call
> > > 2) we ensure avoid having to deal with the fact we're using the
> > >    request_queue structure in atomic context, since the last
> > >    blk_put_queue() will be called upon disk_release(), *after*
> > >    our own bdput().
> > > 
> > > This means this code path is *not* going to remove the request_queue
> > > structure, as we are ensuring some later upper layer disk_release()
> > > will be the one to release the request_queue structure for us.
> > > 
> > > Cc: Bart Van Assche <bvanassche@acm.org>
> > > Cc: Omar Sandoval <osandov@fb.com>
> > > Cc: Hannes Reinecke <hare@suse.com>
> > > Cc: Nicolai Stange <nstange@suse.de>
> > > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > > Cc: Michal Hocko <mhocko@kernel.org>
> > > Cc: yu kuai <yukuai3@huawei.com>
> > > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> > > ---
> > >  mm/swapfile.c | 14 ++++++++++++--
> > >  1 file changed, 12 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > index 6659ab563448..9285ff6030ca 100644
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
> > > @@ -3753,6 +3753,7 @@ static void free_swap_count_continuations(struct swap_info_struct *si)
> > >  void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node,
> > >  				  gfp_t gfp_mask)
> > >  {
> > > +	struct block_device *bdev;
> > >  	struct swap_info_struct *si, *next;
> > >  	if (!(gfp_mask & __GFP_IO) || !memcg)
> > >  		return;
> > > @@ -3771,8 +3772,17 @@ void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node,
> > >  	plist_for_each_entry_safe(si, next, &swap_avail_heads[node],
> > >  				  avail_lists[node]) {
> > >  		if (si->bdev) {
> > > -			blkcg_schedule_throttle(bdev_get_queue(si->bdev),
> > > -						true);
> > > +			bdev = bdgrab(si->bdev);
> > 
> > When swapon, the block_device has been opened in claim_swapfile(),
> > so no need to worry about the queue being gone here.
> 
> Thanks, so why bdev_get_queue() before?

bdev_get_queue() returns the request queue associated with the
the block device, and it is just that blkcg_schedule_throttle() needs
it.

Maybe I misunderstood your question, if yes, please explain it in
a bit detail.

Thanks,
Ming
diff mbox series

Patch

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 6659ab563448..9285ff6030ca 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3753,6 +3753,7 @@  static void free_swap_count_continuations(struct swap_info_struct *si)
 void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node,
 				  gfp_t gfp_mask)
 {
+	struct block_device *bdev;
 	struct swap_info_struct *si, *next;
 	if (!(gfp_mask & __GFP_IO) || !memcg)
 		return;
@@ -3771,8 +3772,17 @@  void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node,
 	plist_for_each_entry_safe(si, next, &swap_avail_heads[node],
 				  avail_lists[node]) {
 		if (si->bdev) {
-			blkcg_schedule_throttle(bdev_get_queue(si->bdev),
-						true);
+			bdev = bdgrab(si->bdev);
+			if (!bdev)
+				continue;
+			/*
+			 * By adding our own bdgrab() we ensure the queue
+			 * sticks around until disk_release(), and so we ensure
+			 * our release of the request_queue does not happen in
+			 * atomic context.
+			 */
+			blkcg_schedule_throttle(bdev_get_queue(bdev), true);
+			bdput(bdev);
 			break;
 		}
 	}