diff mbox

[v6,1/2] blktrace: Fix potentail deadlock between delete & sysfs ops

Message ID 1505760831-7747-2-git-send-email-longman@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Waiman Long Sept. 18, 2017, 6:53 p.m. UTC
The lockdep code had reported the following unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(s_active#228);
                               lock(&bdev->bd_mutex/1);
                               lock(s_active#228);
  lock(&bdev->bd_mutex);

 *** DEADLOCK ***

The deadlock may happen when one task (CPU1) is trying to delete a
partition in a block device and another task (CPU0) is accessing
tracing sysfs file (e.g. /sys/block/dm-1/trace/act_mask) in that
partition.

The s_active isn't an actual lock. It is a reference count (kn->count)
on the sysfs (kernfs) file. Removal of a sysfs file, however, require
a wait until all the references are gone. The reference count is
treated like a rwsem using lockdep instrumentation code.

The fact that a thread is in the sysfs callback method or in the
ioctl call means there is a reference to the opended sysfs or device
file. That should prevent the underlying block structure from being
removed.

Instead of using bd_mutex in the block_device structure, the other
bd_fsfreeze_mutex mutex in the block_device structure is now overloaded
to protect against concurrent blktrace data access in the blktrace.c
file. There is no point in adding one more mutex to the block_device
structure just for blktrace.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/fs.h      |  2 +-
 kernel/trace/blktrace.c | 26 ++++++++++++++++++++------
 2 files changed, 21 insertions(+), 7 deletions(-)

Comments

Christoph Hellwig Sept. 19, 2017, 12:01 a.m. UTC | #1
Taking a look at this it seems like using a lock in struct block_device
isn't the right thing to do anyway - all the action is on fields in
struct blk_trace, so having a lock inside that would make a lot more
sense.

It would also help to document what exactly we're actually protecting.
Waiman Long Sept. 19, 2017, 12:49 p.m. UTC | #2
On 09/18/2017 08:01 PM, Christoph Hellwig wrote:
> Taking a look at this it seems like using a lock in struct block_device
> isn't the right thing to do anyway - all the action is on fields in
> struct blk_trace, so having a lock inside that would make a lot more
> sense.
>
> It would also help to document what exactly we're actually protecting.

I think I documented in the patch that the lock has to protect changes
in the blktrace structure as well as the allocation and destruction of
it. Because of that, it can't be put inside the blktrace structure. The
original code use the bd_mutex of the block_device structure. I just
change the code to use another bd_fsfreeze_mutex in the same structure.

In an earlier patch version, I used a global blktrace mutex. This was
deemed to be not scalable enough and so I now use the bd_fsfreeze_mutex
instead.

Cheers,
Longman
Christoph Hellwig Sept. 19, 2017, 2:38 p.m. UTC | #3
On Tue, Sep 19, 2017 at 08:49:12AM -0400, Waiman Long wrote:
> On 09/18/2017 08:01 PM, Christoph Hellwig wrote:
> > Taking a look at this it seems like using a lock in struct block_device
> > isn't the right thing to do anyway - all the action is on fields in
> > struct blk_trace, so having a lock inside that would make a lot more
> > sense.
> >
> > It would also help to document what exactly we're actually protecting.
> 
> I think I documented in the patch that the lock has to protect changes
> in the blktrace structure as well as the allocation and destruction of
> it. Because of that, it can't be put inside the blktrace structure. The
> original code use the bd_mutex of the block_device structure. I just
> change the code to use another bd_fsfreeze_mutex in the same structure.

Either way it has absolutely nothing to do with struct block_device,
given that struct blk_trace hangs off the request_queue.

Reusing a mutex just because it happens to live in a structure also
generally is a bad idea if the protected data is in no way related.
Waiman Long Sept. 19, 2017, 3:58 p.m. UTC | #4
On 09/19/2017 10:38 AM, Christoph Hellwig wrote:
> On Tue, Sep 19, 2017 at 08:49:12AM -0400, Waiman Long wrote:
>> On 09/18/2017 08:01 PM, Christoph Hellwig wrote:
>>> Taking a look at this it seems like using a lock in struct block_device
>>> isn't the right thing to do anyway - all the action is on fields in
>>> struct blk_trace, so having a lock inside that would make a lot more
>>> sense.
>>>
>>> It would also help to document what exactly we're actually protecting.
>> I think I documented in the patch that the lock has to protect changes
>> in the blktrace structure as well as the allocation and destruction of
>> it. Because of that, it can't be put inside the blktrace structure. The
>> original code use the bd_mutex of the block_device structure. I just
>> change the code to use another bd_fsfreeze_mutex in the same structure.
> Either way it has absolutely nothing to do with struct block_device,
> given that struct blk_trace hangs off the request_queue.
>
> Reusing a mutex just because it happens to live in a structure also
> generally is a bad idea if the protected data is in no way related.

I was trying not to add a new mutex to a structure just for blktrace as
it is an optional feature that is enabled only if the
CONFIG_BLK_DEV_IO_TRACE config option is defined and it will only need
to be taken occasionally.

As filesystem freeze looks orthogonal to blktrace and the mutex also
looks to be used sparingly, I think it is a good match to overload it to
control blktrace as well.

I could modify the patch to use a mutex in the request_queue structure.
The current sysfs_lock mutex has about 74 references. So I am not
totally sure if it is safe to reuse. So the only option is to add
something like

#ifdef CONFIG_BLK_DEV_IO_TRACE
struct mutex blktrace_mutex;
#endif

to the request_queue structure. That structure is large enough that
adding a mutex won't increase the size much percentage-wise.

I would like to keep the current patch as is as I don't see any problem
with it. However, I can revise the patch as discussed above if you guys
prefer that alternative.

Cheers,
Longman
Christoph Hellwig Sept. 19, 2017, 8:41 p.m. UTC | #5
On Tue, Sep 19, 2017 at 11:58:34AM -0400, Waiman Long wrote:
> I was trying not to add a new mutex to a structure just for blktrace as
> it is an optional feature that is enabled only if the
> CONFIG_BLK_DEV_IO_TRACE config option is defined and it will only need
> to be taken occasionally.

So?  Make the lock optional, too.

> As filesystem freeze looks orthogonal to blktrace and the mutex also
> looks to be used sparingly, I think it is a good match to overload it to
> control blktrace as well.

If the functionally is orthogonal that is a reason not to share a lock,
not to the contrary.

> I could modify the patch to use a mutex in the request_queue structure.
> The current sysfs_lock mutex has about 74 references. So I am not
> totally sure if it is safe to reuse. So the only option is to add
> something like

> 
> #ifdef CONFIG_BLK_DEV_IO_TRACE
> struct mutex blktrace_mutex;
> #endif
> 
> to the request_queue structure. That structure is large enough that
> adding a mutex won't increase the size much percentage-wise.

Call it blk_trace mutex and move it right next to the blk_trace
structure:

ifdef CONFIG_BLK_DEV_IO_TRACE
        struct blk_trace        *blk_trace;
	struct mutex		blk_trace_mutex;
#endif

which makes it completely obvious to any read what you are protecting
with it.
Steven Rostedt Sept. 19, 2017, 9:58 p.m. UTC | #6
On Tue, 19 Sep 2017 13:41:35 -0700
Christoph Hellwig <hch@infradead.org> wrote:

> Call it blk_trace mutex and move it right next to the blk_trace
> structure:
> 
> ifdef CONFIG_BLK_DEV_IO_TRACE
>         struct blk_trace        *blk_trace;
> 	struct mutex		blk_trace_mutex;
> #endif
> 
> which makes it completely obvious to any read what you are protecting
> with it.

As a 1000ft away bystander, this appears to be the most logical
solution.

-- Steve
diff mbox

Patch

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 339e737..330b572 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -448,7 +448,7 @@  struct block_device {
 
 	/* The counter of freeze processes */
 	int			bd_fsfreeze_count;
-	/* Mutex for freeze */
+	/* Mutex for freeze and blktrace */
 	struct mutex		bd_fsfreeze_mutex;
 } __randomize_layout;
 
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 2a685b4..7cd5d1d 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -648,6 +648,20 @@  int blk_trace_startstop(struct request_queue *q, int start)
 }
 EXPORT_SYMBOL_GPL(blk_trace_startstop);
 
+/*
+ * When reading or writing the blktrace sysfs files, the references to the
+ * opened sysfs or device files should prevent the underlying block device
+ * from being removed. So no further delete protection is really needed.
+ *
+ * Protection from multiple readers and writers accessing blktrace data
+ * concurrently is still required. The bd_mutex was used for this purpose.
+ * That could lead to deadlock with concurrent block device deletion and
+ * sysfs access. Instead, the block device bd_fsfreeze_mutex is now
+ * overloaded for blktrace data protection. Like freeze/thaw, blktrace
+ * functionality is not frequently used. There is no point in adding
+ * one more mutex to the block_device structure just for blktrace.
+ */
+
 /**
  * blk_trace_ioctl: - handle the ioctls associated with tracing
  * @bdev:	the block device
@@ -665,7 +679,7 @@  int blk_trace_ioctl(struct block_device *bdev, unsigned cmd, char __user *arg)
 	if (!q)
 		return -ENXIO;
 
-	mutex_lock(&bdev->bd_mutex);
+	mutex_lock(&bdev->bd_fsfreeze_mutex);
 
 	switch (cmd) {
 	case BLKTRACESETUP:
@@ -691,7 +705,7 @@  int blk_trace_ioctl(struct block_device *bdev, unsigned cmd, char __user *arg)
 		break;
 	}
 
-	mutex_unlock(&bdev->bd_mutex);
+	mutex_unlock(&bdev->bd_fsfreeze_mutex);
 	return ret;
 }
 
@@ -1727,7 +1741,7 @@  static ssize_t sysfs_blk_trace_attr_show(struct device *dev,
 	if (q == NULL)
 		goto out_bdput;
 
-	mutex_lock(&bdev->bd_mutex);
+	mutex_lock(&bdev->bd_fsfreeze_mutex);
 
 	if (attr == &dev_attr_enable) {
 		ret = sprintf(buf, "%u\n", !!q->blk_trace);
@@ -1746,7 +1760,7 @@  static ssize_t sysfs_blk_trace_attr_show(struct device *dev,
 		ret = sprintf(buf, "%llu\n", q->blk_trace->end_lba);
 
 out_unlock_bdev:
-	mutex_unlock(&bdev->bd_mutex);
+	mutex_unlock(&bdev->bd_fsfreeze_mutex);
 out_bdput:
 	bdput(bdev);
 out:
@@ -1788,7 +1802,7 @@  static ssize_t sysfs_blk_trace_attr_store(struct device *dev,
 	if (q == NULL)
 		goto out_bdput;
 
-	mutex_lock(&bdev->bd_mutex);
+	mutex_lock(&bdev->bd_fsfreeze_mutex);
 
 	if (attr == &dev_attr_enable) {
 		if (value)
@@ -1814,7 +1828,7 @@  static ssize_t sysfs_blk_trace_attr_store(struct device *dev,
 	}
 
 out_unlock_bdev:
-	mutex_unlock(&bdev->bd_mutex);
+	mutex_unlock(&bdev->bd_fsfreeze_mutex);
 out_bdput:
 	bdput(bdev);
 out: