diff mbox series

[RFC,2/3] blktrace: fix debugfs use after free

Message ID 20200402000002.7442-3-mcgrof@kernel.org
State New, archived
Headers show
Series block: address blktrace use-after-free | expand

Commit Message

Luis Chamberlain April 2, 2020, midnight UTC
On commit 6ac93117ab00 ("blktrace: use existing disk debugfs directory")
Omar fixed the original blktrace code for multiqueue use. This however
left in place a possible crash, if you happen to abuse blktrace in a way
it was not intended.

Namely, if you loop adding a device, setup the blktrace with BLKTRACESETUP,
forget to BLKTRACETEARDOWN, and then just remove the device you end up
with a panic:

[  107.193134] debugfs: Directory 'loop0' with parent 'block' already present!
[  107.254615] BUG: kernel NULL pointer dereference, address: 00000000000000a0
[  107.258785] #PF: supervisor write access in kernel mode
[  107.262035] #PF: error_code(0x0002) - not-present page
[  107.264106] PGD 0 P4D 0
[  107.264404] Oops: 0002 [#1] SMP NOPTI
[  107.264803] CPU: 8 PID: 674 Comm: kworker/8:2 Tainted: G            E 5.6.0-rc7-next-20200327 #1
[  107.265712] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
[  107.266553] Workqueue: events __blk_release_queue
[  107.267051] RIP: 0010:down_write+0x15/0x40
[  107.267488] Code: eb ca e8 ee a5 8d ff cc cc cc cc cc cc cc cc cc cc cc cc cc cc 0f 1f 44 00 00 55 48 89 fd e8 52 db ff ff 31 c0 ba 01 00 00 00 <f0> 48 0f b1 55 00 75 0f  65 48 8b 04 25 c0 8b 01 00 48 89 45 08 5d
[  107.269300] RSP: 0018:ffff9927c06efda8 EFLAGS: 00010246
[  107.269841] RAX: 0000000000000000 RBX: ffff8be7e73b0600 RCX: ffffff8100000000
[  107.270559] RDX: 0000000000000001 RSI: ffffff8100000000 RDI: 00000000000000a0
[  107.271281] RBP: 00000000000000a0 R08: ffff8be7ebc80fa8 R09: ffff8be7ebc80fa8
[  107.272001] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  107.272722] R13: ffff8be7efc30400 R14: ffff8be7e0571200 R15: 00000000000000a0
[  107.273475] FS:  0000000000000000(0000) GS:ffff8be7efc00000(0000) knlGS:0000000000000000
[  107.274346] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  107.274968] CR2: 00000000000000a0 CR3: 000000042abee003 CR4: 0000000000360ee0
[  107.275710] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  107.276465] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  107.277214] Call Trace:
[  107.277532]  simple_recursive_removal+0x4e/0x2e0
[  107.278049]  ? debugfs_remove+0x60/0x60
[  107.278493]  debugfs_remove+0x40/0x60
[  107.278922]  blk_trace_free+0xd/0x50
[  107.279339]  __blk_trace_remove+0x27/0x40
[  107.279797]  blk_trace_shutdown+0x30/0x40
[  107.280256]  __blk_release_queue+0xab/0x110
[  107.280734]  process_one_work+0x1b4/0x380
[  107.281194]  worker_thread+0x50/0x3c0
[  107.281622]  kthread+0xf9/0x130
[  107.281994]  ? process_one_work+0x380/0x380
[  107.282467]  ? kthread_park+0x90/0x90
[  107.282895]  ret_from_fork+0x1f/0x40
[  107.283316] Modules linked in: loop(E) <etc>
[  107.288562] CR2: 00000000000000a0
[  107.288957] ---[ end trace b885d243d441bbce ]---

This splat happens to be very similar to the one reported via
kernel.org korg#205713, only that korg#205713 was for v4.19.83
and the above now includes the simple_recursive_removal() introduced
via commit a3d1e7eb5abe ("simple_recursive_removal(): kernel-side rm
-rf for ramfs-style filesystems") merged on v5.6.

korg#205713 then was used to create CVE-2019-19770 and claims that
the bug is in a use-after-free in the debugfs core code. The
implications of this being a generic UAF on debugfs would be
much more severe, as it would imply parent dentries can sometimes
not be possitive, which is something claim is not possible.

It turns out that the issue actually is a mis-use of debugfs for
the multiqueue case, and the fragile nature of how we free the
directory used to keep track of blktrace debugfs files. Omar's
commit assumed the parent directory would be kept with
debugfs_lookup() but this is not the case, only the dentry is
kept around. We also special-case a solution for multiqueue
given that for multiqueue code we always instantiate the debugfs
directory for the request queue. We were leaving it only to chance,
if someone happens to use blktrace, on single queue block devices
for the respective debugfs directory be created.

We can fix the UAF by simply using a debugfs directory which is
always created for singlequeue and multiqueue block devices. This
simplifies the code considerably, with the only penalty now being
that we're always creating the request queue directory debugfs
directory for the block device on singlequeue block devices.

The UAF then is not a core debugfs issue, but instead a mis-use of
debugfs, and this issue can only be triggered if you are root, and
mis-use blktrace.

This issue can be reproduced with break-blktrace [2] using:

  break-blktrace -c 10 -d

This patch fixes this issue. Note that there is also another
respective UAF but from the ioctl path [3], this should also fix
that issue.

This patch then also contends the severity of CVE-2019-19770 as
this issue is only possible using root to shoot yourself in the
foot by also misuing blktrace.

[0] https://bugzilla.kernel.org/show_bug.cgi?id=205713
[1] https://nvd.nist.gov/vuln/detail/CVE-2019-19770
[2] https://github.com/mcgrof/break-blktrace
[3] https://lore.kernel.org/lkml/000000000000ec635b059f752700@google.com/

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Nicolai Stange <nstange@suse.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Reported-by: syzbot+603294af2d01acfdd6da@syzkaller.appspotmail.com
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 block/blk-debugfs.c          | 12 ++++++++++++
 block/blk-mq-debugfs.c       |  5 -----
 block/blk-sysfs.c            |  3 +++
 block/blk.h                  | 10 ++++++++++
 include/linux/blktrace_api.h |  1 -
 kernel/trace/blktrace.c      | 19 ++++++++-----------
 6 files changed, 33 insertions(+), 17 deletions(-)

Comments

Eric Sandeen April 2, 2020, 1:57 a.m. UTC | #1
On 4/1/20 7:00 PM, Luis Chamberlain wrote:
> On commit 6ac93117ab00 ("blktrace: use existing disk debugfs directory")
> Omar fixed the original blktrace code for multiqueue use. This however
> left in place a possible crash, if you happen to abuse blktrace in a way
> it was not intended.
> 
> Namely, if you loop adding a device, setup the blktrace with BLKTRACESETUP,
> forget to BLKTRACETEARDOWN, and then just remove the device you end up
> with a panic:

Weird, I swear I tested that and didn't hit it, but ...
 

> This issue can be reproduced with break-blktrace [2] using:
> 
>   break-blktrace -c 10 -d

+ -s, right?
 
> This patch fixes this issue. Note that there is also another
> respective UAF but from the ioctl path [3], this should also fix
> that issue.
> 
> This patch then also contends the severity of CVE-2019-19770 as
> this issue is only possible using root to shoot yourself in the
> foot by also misuing blktrace.
> 
> [0] https://bugzilla.kernel.org/show_bug.cgi?id=205713
> [1] https://nvd.nist.gov/vuln/detail/CVE-2019-19770
> [2] https://github.com/mcgrof/break-blktrace

I verified that this does reproduce the exact same KASAN splat on
kernel 4.19.83 as reported in the original bug, thanks!

-Eric
Luis Chamberlain April 2, 2020, 4:14 p.m. UTC | #2
On Wed, Apr 01, 2020 at 08:57:42PM -0500, Eric Sandeen wrote:
> On 4/1/20 7:00 PM, Luis Chamberlain wrote:
> > On commit 6ac93117ab00 ("blktrace: use existing disk debugfs directory")
> > Omar fixed the original blktrace code for multiqueue use. This however
> > left in place a possible crash, if you happen to abuse blktrace in a way
> > it was not intended.
> > 
> > Namely, if you loop adding a device, setup the blktrace with BLKTRACESETUP,
> > forget to BLKTRACETEARDOWN, and then just remove the device you end up
> > with a panic:
> 
> Weird, I swear I tested that and didn't hit it, but ...

The real issue was also the dangling block device, a loop device proves
easy to test that. I'll note that another way to test this as well would
be to have a virtual machine with a block devices that goes in and out
via whatever virsh shenanigans to make a block device appear / disappear.

> > This issue can be reproduced with break-blktrace [2] using:
> > 
> >   break-blktrace -c 10 -d
> 
> + -s, right?

Yeap, sorry about that.

> > This patch fixes this issue. Note that there is also another
> > respective UAF but from the ioctl path [3], this should also fix
> > that issue.
> > 
> > This patch then also contends the severity of CVE-2019-19770 as
> > this issue is only possible using root to shoot yourself in the
> > foot by also misuing blktrace.
> > 
> > [0] https://bugzilla.kernel.org/show_bug.cgi?id=205713
> > [1] https://nvd.nist.gov/vuln/detail/CVE-2019-19770
> > [2] https://github.com/mcgrof/break-blktrace
> 
> I verified that this does reproduce the exact same KASAN splat on
> kernel 4.19.83 as reported in the original bug, thanks!

I just codified what Nicolai suspected, we should thank him :)

  Luis
Bart Van Assche April 5, 2020, 3:39 a.m. UTC | #3
On 2020-04-01 17:00, Luis Chamberlain wrote:
> korg#205713 then was used to create CVE-2019-19770 and claims that
> the bug is in a use-after-free in the debugfs core code. The
> implications of this being a generic UAF on debugfs would be
> much more severe, as it would imply parent dentries can sometimes
> not be possitive, which is something claim is not possible.
         ^^^^^^^^^  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
         positive?  is there perhaps a word missing here?

> It turns out that the issue actually is a mis-use of debugfs for
> the multiqueue case, and the fragile nature of how we free the
> directory used to keep track of blktrace debugfs files. Omar's
> commit assumed the parent directory would be kept with
> debugfs_lookup() but this is not the case, only the dentry is
> kept around. We also special-case a solution for multiqueue
> given that for multiqueue code we always instantiate the debugfs
> directory for the request queue. We were leaving it only to chance,
> if someone happens to use blktrace, on single queue block devices
> for the respective debugfs directory be created.

Since the legacy block layer is gone, the above explanation may have to
be rephrased.

> We can fix the UAF by simply using a debugfs directory which is
> always created for singlequeue and multiqueue block devices. This
> simplifies the code considerably, with the only penalty now being
> that we're always creating the request queue directory debugfs
> directory for the block device on singlequeue block devices.

Same comment here - the legacy block layer is gone. I think that today
all block drivers are either request-based and multiqueue or so-called
make_request drivers. See also the output of git grep -nHw
blk_alloc_queue for examples of the latter category.

> This patch then also contends the severity of CVE-2019-19770 as
> this issue is only possible using root to shoot yourself in the
> foot by also misuing blktrace.
               ^^^^^^^
               misusing?

> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index b3f2ba483992..bda9378eab90 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -823,9 +823,6 @@ void blk_mq_debugfs_register(struct request_queue *q)
>  	struct blk_mq_hw_ctx *hctx;
>  	int i;
>  
> -	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
> -					    blk_debugfs_root);
> -
>  	debugfs_create_files(q->debugfs_dir, q, blk_mq_debugfs_queue_attrs);
>  
>  	/*

[ ... ]

>  static void blk_mq_debugfs_register_ctx(struct blk_mq_hw_ctx *hctx,
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index fca9b158f4a0..20f20b0fa0b9 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -895,6 +895,7 @@ static void __blk_release_queue(struct work_struct *work)
>  
>  	blk_trace_shutdown(q);
>  
> +	blk_q_debugfs_unregister(q);
>  	if (queue_is_mq(q))
>  		blk_mq_debugfs_unregister(q);

Does this patch change the behavior of the block layer from only
registering a debugfs directory for request-based block devices to
registering a debugfs directory for request-based and make_request based
block devices? Is that behavior change an intended behavior change?

Thanks,

Bart.
Eric Sandeen April 6, 2020, 1:27 a.m. UTC | #4
On 4/4/20 10:39 PM, Bart Van Assche wrote:
> On 2020-04-01 17:00, Luis Chamberlain wrote:
>> korg#205713 then was used to create CVE-2019-19770 and claims that
>> the bug is in a use-after-free in the debugfs core code. The
>> implications of this being a generic UAF on debugfs would be
>> much more severe, as it would imply parent dentries can sometimes
>> not be possitive, which is something claim is not possible.
>          ^^^^^^^^^  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>          positive?  is there perhaps a word missing here?
> 
>> It turns out that the issue actually is a mis-use of debugfs for
>> the multiqueue case, and the fragile nature of how we free the
>> directory used to keep track of blktrace debugfs files. Omar's
>> commit assumed the parent directory would be kept with
>> debugfs_lookup() but this is not the case, only the dentry is
>> kept around. We also special-case a solution for multiqueue
>> given that for multiqueue code we always instantiate the debugfs
>> directory for the request queue. We were leaving it only to chance,
>> if someone happens to use blktrace, on single queue block devices
>> for the respective debugfs directory be created.
> 
> Since the legacy block layer is gone, the above explanation may have to
> be rephrased.
> 
>> We can fix the UAF by simply using a debugfs directory which is
>> always created for singlequeue and multiqueue block devices. This
>> simplifies the code considerably, with the only penalty now being
>> that we're always creating the request queue directory debugfs
>> directory for the block device on singlequeue block devices.
> 
> Same comment here - the legacy block layer is gone. I think that today
> all block drivers are either request-based and multiqueue or so-called
> make_request drivers. See also the output of git grep -nHw
> blk_alloc_queue for examples of the latter category.
> 
>> This patch then also contends the severity of CVE-2019-19770 as
>> this issue is only possible using root to shoot yourself in the
>> foot by also misuing blktrace.
>                ^^^^^^^
>                misusing?

Honestly I think the whole "misusing blktrace" narrative is not relevant
here; the kernel has to deal with whatever ioctls it receives, right.

The thing I can't figure out from reading the change log is

1) what the root cause of the problem is, and
2) how this patch fixes it?

>> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
>> index b3f2ba483992..bda9378eab90 100644
>> --- a/block/blk-mq-debugfs.c
>> +++ b/block/blk-mq-debugfs.c
>> @@ -823,9 +823,6 @@ void blk_mq_debugfs_register(struct request_queue *q)
>>  	struct blk_mq_hw_ctx *hctx;
>>  	int i;
>>  
>> -	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
>> -					    blk_debugfs_root);
>> -
>>  	debugfs_create_files(q->debugfs_dir, q, blk_mq_debugfs_queue_attrs);
>>  
>>  	/*
> 
> [ ... ]
> 
>>  static void blk_mq_debugfs_register_ctx(struct blk_mq_hw_ctx *hctx,
>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>> index fca9b158f4a0..20f20b0fa0b9 100644
>> --- a/block/blk-sysfs.c
>> +++ b/block/blk-sysfs.c
>> @@ -895,6 +895,7 @@ static void __blk_release_queue(struct work_struct *work)
>>  
>>  	blk_trace_shutdown(q);
>>  
>> +	blk_q_debugfs_unregister(q);
>>  	if (queue_is_mq(q))
>>  		blk_mq_debugfs_unregister(q);
> 
> Does this patch change the behavior of the block layer from only
> registering a debugfs directory for request-based block devices to
> registering a debugfs directory for request-based and make_request based
> block devices? Is that behavior change an intended behavior change?

Seems to be:

"This simplifies the code considerably, with the only penalty now being
that we're always creating the request queue directory debugfs
directory for the block device on singlequeue block devices."

?

-Eric
Bart Van Assche April 6, 2020, 4:25 a.m. UTC | #5
On 2020-04-05 18:27, Eric Sandeen wrote:
> The thing I can't figure out from reading the change log is
> 
> 1) what the root cause of the problem is, and
> 2) how this patch fixes it?

I think that the root cause is that do_blk_trace_setup() uses
debugfs_lookup() and that debugfs_lookup() may return a pointer
associated with a previous incarnation of the block device.
Additionally, I think the following changes fix that problem by using
q->debugfs_dir in the blktrace code instead of debugfs_lookup():

[ ... ]
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -311,7 +311,6 @@ static void blk_trace_free(struct blk_trace *bt)
 	debugfs_remove(bt->msg_file);
 	debugfs_remove(bt->dropped_file);
 	relay_close(bt->rchan);
-	debugfs_remove(bt->dir);
 	free_percpu(bt->sequence);
 	free_percpu(bt->msg_data);
 	kfree(bt);
[ ... ]
@@ -509,21 +510,19 @@ static int do_blk_trace_setup(struct request_queue
*q, char *name, dev_t dev,

 	ret = -ENOENT;

-	dir = debugfs_lookup(buts->name, blk_debugfs_root);
-	if (!dir)
-		bt->dir = dir = debugfs_create_dir(buts->name, blk_debugfs_root);
-
 	bt->dev = dev;
 	atomic_set(&bt->dropped, 0);
 	INIT_LIST_HEAD(&bt->running_list);

 	ret = -EIO;
-	bt->dropped_file = debugfs_create_file("dropped", 0444, dir, bt,
+	bt->dropped_file = debugfs_create_file("dropped", 0444,
+					       q->debugfs_dir, bt,
 					       &blk_dropped_fops);
[ ... ]

Bart.
Nicolai Stange April 6, 2020, 9:18 a.m. UTC | #6
Bart Van Assche <bvanassche@acm.org> writes:

> On 2020-04-05 18:27, Eric Sandeen wrote:
>> The thing I can't figure out from reading the change log is
>> 
>> 1) what the root cause of the problem is, and
>> 2) how this patch fixes it?
>
> I think that the root cause is that do_blk_trace_setup() uses
> debugfs_lookup() and that debugfs_lookup() may return a pointer
> associated with a previous incarnation of the block device.

That's correct, the debugfs_lookup() can find a previous incarnation's
dir of the same name which is about to get removed from a not yet
schedule work.

I.e. something like the following is possible:

  LOOP_CTL_DEL(loop0) /* schedule __blk_release_queue() work_struct */
  LOOP_CTL_ADD(loop0) /* debugfs_create_dir() from
		       * blk_mq_debugfs_register() fails with EEXIST
                       */
  BLKTRACE_SETUP(loop0) /* debugfs_lookup() finds the directory about to
			 * get deleted and blktrace files will be created
			 * thereunder.
			 */

  The work_struct gets scheduled and the debugfs dir debugfs_remove()ed
  recursively, which includes the blktrace files just created. blktrace's
  dentry pointers are now dangling and there will be a UAF when it
  attempts to delete those again.

Luis' patch [2/3] fixes the issue of the debugfs_lookup() from
blk_mq_debugfs_register() potentially returning an existing directory
associated with a previous block device incarnation of the same name and
thus, fixes the UAF.


However, the problem that the debugfs_create_dir() from
blk_mq_debugfs_register() in a sequence of
  LOOP_CTL_DEL(loop0)
  LOOP_CTL_ADD(loop0)
could silently fail still remains. The RFC patch [3/3] from Luis
attempts to address this issue by folding the delayed
__blk_release_queue() work back into blk_release_queue(), the release
handler associated with the queue kobject and executed from the final
blk_queue_put(). However, that's still no full solution, because the
kobject release handler can run asynchronously, long after
blk_unregister_queue() has returned (c.f. also
CONFIG_DEBUG_KOBJECT_RELEASE).

Note that I proposed this change here (internally) as a potential
cleanup, because I missed the kobject_del() from blk_unregister_queue()
and *wrongly* concluded that blk_queue_put() must be allowed to sleep
nowadays. However, that kobject_del() is in place and moreover, the
analysis requested by Bart (c.f. [1] in this thread) revealed that there
are indeed a couple of sites calling blk_queue_put() from atomic
context.

So I'd suggest to drop patch [3/3] from this series and modify this
patch [2/3] here to move the blk_q_debugfs_unregister(q) invocation from
__blk_release_queue() to blk_unregister_queue() instead.


> Additionally, I think the following changes fix that problem by using
> q->debugfs_dir in the blktrace code instead of debugfs_lookup():

That would fix the UAF, but !queue_is_mq() queues wouldn't get a debugfs
directory created for them by blktrace anymore?


Thanks,

Nicolai

[1] https://lkml.kernel.org/r/87o8saj62m.fsf@suse.de


> [ ... ]
> --- a/kernel/trace/blktrace.c
> +++ b/kernel/trace/blktrace.c
> @@ -311,7 +311,6 @@ static void blk_trace_free(struct blk_trace *bt)
>  	debugfs_remove(bt->msg_file);
>  	debugfs_remove(bt->dropped_file);
>  	relay_close(bt->rchan);
> -	debugfs_remove(bt->dir);
>  	free_percpu(bt->sequence);
>  	free_percpu(bt->msg_data);
>  	kfree(bt);
> [ ... ]
> @@ -509,21 +510,19 @@ static int do_blk_trace_setup(struct request_queue
> *q, char *name, dev_t dev,
>
>  	ret = -ENOENT;
>
> -	dir = debugfs_lookup(buts->name, blk_debugfs_root);
> -	if (!dir)
> -		bt->dir = dir = debugfs_create_dir(buts->name, blk_debugfs_root);
> -
>  	bt->dev = dev;
>  	atomic_set(&bt->dropped, 0);
>  	INIT_LIST_HEAD(&bt->running_list);
>
>  	ret = -EIO;
> -	bt->dropped_file = debugfs_create_file("dropped", 0444, dir, bt,
> +	bt->dropped_file = debugfs_create_file("dropped", 0444,
> +					       q->debugfs_dir, bt,
>  					       &blk_dropped_fops);
> [ ... ]
>
> Bart.
Eric Sandeen April 6, 2020, 2:29 p.m. UTC | #7
On 4/5/20 11:25 PM, Bart Van Assche wrote:
> On 2020-04-05 18:27, Eric Sandeen wrote:
>> The thing I can't figure out from reading the change log is
>>
>> 1) what the root cause of the problem is, and
>> 2) how this patch fixes it?
> 
> I think that the root cause is that do_blk_trace_setup() uses
> debugfs_lookup() and that debugfs_lookup() may return a pointer
> associated with a previous incarnation of the block device.
> Additionally, I think the following changes fix that problem by using
> q->debugfs_dir in the blktrace code instead of debugfs_lookup():

Yep, I gathered that from reading the patch, was just hoping for a commit log
that makes it clear.

> [ ... ]
> --- a/kernel/trace/blktrace.c
> +++ b/kernel/trace/blktrace.c
> @@ -311,7 +311,6 @@ static void blk_trace_free(struct blk_trace *bt)
>  	debugfs_remove(bt->msg_file);
>  	debugfs_remove(bt->dropped_file);
>  	relay_close(bt->rchan);
> -	debugfs_remove(bt->dir);
>  	free_percpu(bt->sequence);
>  	free_percpu(bt->msg_data);
>  	kfree(bt);
> [ ... ]
> @@ -509,21 +510,19 @@ static int do_blk_trace_setup(struct request_queue
> *q, char *name, dev_t dev,
> 
>  	ret = -ENOENT;
> 
> -	dir = debugfs_lookup(buts->name, blk_debugfs_root);
> -	if (!dir)
> -		bt->dir = dir = debugfs_create_dir(buts->name, blk_debugfs_root);
> -
>  	bt->dev = dev;
>  	atomic_set(&bt->dropped, 0);
>  	INIT_LIST_HEAD(&bt->running_list);
> 
>  	ret = -EIO;
> -	bt->dropped_file = debugfs_create_file("dropped", 0444, dir, bt,
> +	bt->dropped_file = debugfs_create_file("dropped", 0444,
> +					       q->debugfs_dir, bt,
>  					       &blk_dropped_fops);

One thing I'm not sure about, the block_trace *bt still points to a dentry that
could get torn down via debugfs_remove_recursive when the queue is released, right,
but could later be sent to blk_trace_free again?  And yet this does seem to fix the
use after free in my testing, so I must be missing something.

Thanks,
-Eric
Luis Chamberlain April 6, 2020, 3:14 p.m. UTC | #8
On Sat, Apr 04, 2020 at 08:39:47PM -0700, Bart Van Assche wrote:
> On 2020-04-01 17:00, Luis Chamberlain wrote:
> > korg#205713 then was used to create CVE-2019-19770 and claims that
> > the bug is in a use-after-free in the debugfs core code. The
> > implications of this being a generic UAF on debugfs would be
> > much more severe, as it would imply parent dentries can sometimes
> > not be possitive, which is something claim is not possible.
>          ^^^^^^^^^  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>          positive?  is there perhaps a word missing here?

Sorry yeah, this was supposed to say:

it would imply parent dentries can sometimes not be positive, which
is just not possible.

> > It turns out that the issue actually is a mis-use of debugfs for
> > the multiqueue case, and the fragile nature of how we free the
> > directory used to keep track of blktrace debugfs files. Omar's
> > commit assumed the parent directory would be kept with
> > debugfs_lookup() but this is not the case, only the dentry is
> > kept around. We also special-case a solution for multiqueue
> > given that for multiqueue code we always instantiate the debugfs
> > directory for the request queue. We were leaving it only to chance,
> > if someone happens to use blktrace, on single queue block devices
> > for the respective debugfs directory be created.
> 
> Since the legacy block layer is gone, the above explanation may have to
> be rephrased.

Will do.

> > We can fix the UAF by simply using a debugfs directory which is
> > always created for singlequeue and multiqueue block devices. This
> > simplifies the code considerably, with the only penalty now being
> > that we're always creating the request queue directory debugfs
> > directory for the block device on singlequeue block devices.
> 
> Same comment here - the legacy block layer is gone. I think that today
> all block drivers are either request-based and multiqueue or so-called
> make_request drivers. See also the output of git grep -nHw
> blk_alloc_queue for examples of the latter category.

Will adjust.

> > This patch then also contends the severity of CVE-2019-19770 as
> > this issue is only possible using root to shoot yourself in the
> > foot by also misuing blktrace.
>                ^^^^^^^
>                misusing?
> 
> > diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> > index b3f2ba483992..bda9378eab90 100644
> > --- a/block/blk-mq-debugfs.c
> > +++ b/block/blk-mq-debugfs.c
> > @@ -823,9 +823,6 @@ void blk_mq_debugfs_register(struct request_queue *q)
> >  	struct blk_mq_hw_ctx *hctx;
> >  	int i;
> >  
> > -	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
> > -					    blk_debugfs_root);
> > -
> >  	debugfs_create_files(q->debugfs_dir, q, blk_mq_debugfs_queue_attrs);
> >  
> >  	/*
> 
> [ ... ]
> 
> >  static void blk_mq_debugfs_register_ctx(struct blk_mq_hw_ctx *hctx,
> > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> > index fca9b158f4a0..20f20b0fa0b9 100644
> > --- a/block/blk-sysfs.c
> > +++ b/block/blk-sysfs.c
> > @@ -895,6 +895,7 @@ static void __blk_release_queue(struct work_struct *work)
> >  
> >  	blk_trace_shutdown(q);
> >  
> > +	blk_q_debugfs_unregister(q);
> >  	if (queue_is_mq(q))
> >  		blk_mq_debugfs_unregister(q);
> 
> Does this patch change the behavior of the block layer from only
> registering a debugfs directory for request-based block devices to
> registering a debugfs directory for request-based and make_request based
> block devices? Is that behavior change an intended behavior change?

Yes, specifically this was already done, however for request-based block
devices this was done upon init, and for make_request based devices this
was only instantiated *iff* blktrace was used at least once. It is
actually a bit difficult to see the later, given the rq->debugfs_dir was
not used per se for make_request based block devices, but instead
the debugfs_create_dir(buts->name, blk_debugfs_root) call was made
directly, which happens to end up being the same directory as
debugfs_create_dir(kobject_name(q->kobj.parent), blk_debugfs_root)
called on block/blk-mq-debugfs.c.

This changes the block layer so that the rq->debugfs_dir is always created
now if debugfs is enabled.

Note that blktrace already depends on debugfs. What was missing in this
patch too was this hunk:

--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -569,8 +569,10 @@ struct request_queue {
	struct list_head        tag_set_list;
	struct bio_set          bio_split;

-#ifdef CONFIG_BLK_DEBUG_FS
+#ifdef CONFIG_DEBUG_FS
	struct dentry           *debugfs_dir;
+#endif
+#ifdef CONFIG_BLK_DEBUG_FS
	struct dentry
	*sched_debugfs_dir;
	struct dentry
	*rqos_debugfs_dir;
#endif
Luis Chamberlain April 6, 2020, 3:19 p.m. UTC | #9
On Mon, Apr 06, 2020 at 11:18:13AM +0200, Nicolai Stange wrote:
> Bart Van Assche <bvanassche@acm.org> writes:

> So I'd suggest to drop patch [3/3] from this series and modify this
> patch [2/3] here to move the blk_q_debugfs_unregister(q) invocation from
> __blk_release_queue() to blk_unregister_queue() instead.

I'll take a stab.

> > Additionally, I think the following changes fix that problem by using
> > q->debugfs_dir in the blktrace code instead of debugfs_lookup():
> 
> That would fix the UAF, but !queue_is_mq() queues wouldn't get a debugfs
> directory created for them by blktrace anymore?

It would, it would just be done early on init as well, and it would now be
shared with the queue_is_mq() case.

  Luis
Luis Chamberlain April 7, 2020, 8:09 a.m. UTC | #10
On Mon, Apr 06, 2020 at 09:29:40AM -0500, Eric Sandeen wrote:
> On 4/5/20 11:25 PM, Bart Van Assche wrote:
> > On 2020-04-05 18:27, Eric Sandeen wrote:
> >> The thing I can't figure out from reading the change log is
> >>
> >> 1) what the root cause of the problem is, and
> >> 2) how this patch fixes it?
> > 
> > I think that the root cause is that do_blk_trace_setup() uses
> > debugfs_lookup() and that debugfs_lookup() may return a pointer
> > associated with a previous incarnation of the block device.
> > Additionally, I think the following changes fix that problem by using
> > q->debugfs_dir in the blktrace code instead of debugfs_lookup():
> 
> Yep, I gathered that from reading the patch, was just hoping for a commit log
> that makes it clear.

Will try to be a bit more clear.

> > [ ... ]
> > --- a/kernel/trace/blktrace.c
> > +++ b/kernel/trace/blktrace.c
> > @@ -311,7 +311,6 @@ static void blk_trace_free(struct blk_trace *bt)
> >  	debugfs_remove(bt->msg_file);
> >  	debugfs_remove(bt->dropped_file);
> >  	relay_close(bt->rchan);
> > -	debugfs_remove(bt->dir);
> >  	free_percpu(bt->sequence);
> >  	free_percpu(bt->msg_data);
> >  	kfree(bt);
> > [ ... ]
> > @@ -509,21 +510,19 @@ static int do_blk_trace_setup(struct request_queue
> > *q, char *name, dev_t dev,
> > 
> >  	ret = -ENOENT;
> > 
> > -	dir = debugfs_lookup(buts->name, blk_debugfs_root);
> > -	if (!dir)
> > -		bt->dir = dir = debugfs_create_dir(buts->name, blk_debugfs_root);
> > -
> >  	bt->dev = dev;
> >  	atomic_set(&bt->dropped, 0);
> >  	INIT_LIST_HEAD(&bt->running_list);
> > 
> >  	ret = -EIO;
> > -	bt->dropped_file = debugfs_create_file("dropped", 0444, dir, bt,
> > +	bt->dropped_file = debugfs_create_file("dropped", 0444,
> > +					       q->debugfs_dir, bt,
> >  					       &blk_dropped_fops);
> 
> One thing I'm not sure about, the block_trace *bt still points to a dentry that
> could get torn down via debugfs_remove_recursive when the queue is released, right,

Yes, but let us be specific, what points to a dentry is the directory,
and then the files liked "dropped" above. In fact, as it turns out, the
relay can also be torn down for us through a debugfs_remove_recursive().

> but could later be sent to blk_trace_free again? 

This *should* not be possible, because the dentries for the blktrace
should not be removed prior to the request_queue being removed, because
the blktrace dentries are protected via a request_queue mutex,
q->blk_trace_mutex.

The blk_trace_free() free should therefore happen way early, before
the request_queue is removed. The only dentry left is the directory
on the q->debugfs_dir, and this should only be removed only later when
the request_queue is removed.

But, this does not happen today. For completeness below is a trace
of a real crash but with some extra information added, such as the
get_cpu() a call runs on, and any other interesting thing which we
might care for at the point a call is made.

Also remember debugfs_remove_recursive() is just a define to debugfs_remove().

I used:

	./break-blktrace -c 10 -d -s

So the teardown of the blktrace is skipped purposely.

The gist of it, is that upon the second loop, the LOOP_CTL_DEL for
loop0 schedules the __blk_release_queue() work, but it at least
gets to call blk_trace_free(), but in the meantime LOOP_CTL_ADD()
from the second loop gets called, without the prior work from the
LOOP_CTL_DEL having completed yet. By the time BLKTRACE_SETUP gets
called, the debugfs_lookup() is used, the directory is used (not
created). Then the old LOOP_CTL_DEL from the same second loop still
has to complete, and so it tries, and as you will see from the trace
below it calls debugfs_remove_recursive(q->debugfs_dir) prior to
the second loops' BLKTRACE_SETUP finishing. The dentries for the
files created for blktrace would be removed. Upon the third loop's
LOOP_CTL_DEL we end up trying to free these dentries again.

load loopback module                                                            
[   13.603371] == blk_mq_debugfs_register(12) start                             
[   13.604040] == blk_mq_debugfs_register(12) q->debugfs_dir created            
[   13.604934] == blk_mq_debugfs_register(12) end                               
[   13.627382] == blk_mq_debugfs_register(12) start                             
[   13.628041] == blk_mq_debugfs_register(12) q->debugfs_dir created            
[   13.629240] == blk_mq_debugfs_register(12) end                               
[   13.651667] == blk_mq_debugfs_register(12) start                             
[   13.652836] == blk_mq_debugfs_register(12) q->debugfs_dir created            
[   13.655107] == blk_mq_debugfs_register(12) end                               
[   13.684917] == blk_mq_debugfs_register(12) start                             
[   13.687876] == blk_mq_debugfs_register(12) q->debugfs_dir created            
[   13.691588] == blk_mq_debugfs_register(13) end                               
[   13.707320] == blk_mq_debugfs_register(13) start                             
[   13.707863] == blk_mq_debugfs_register(13) q->debugfs_dir created            
[   13.708856] == blk_mq_debugfs_register(13) end                               
[   13.735623] == blk_mq_debugfs_register(13) start                             
[   13.736656] == blk_mq_debugfs_register(13) q->debugfs_dir created            
[   13.738411] == blk_mq_debugfs_register(13) end                               
[   13.763326] == blk_mq_debugfs_register(13) start                             
[   13.763972] == blk_mq_debugfs_register(13) q->debugfs_dir created            
[   13.765167] == blk_mq_debugfs_register(13) end                               
[   13.779510] == blk_mq_debugfs_register(13) start                             
[   13.780522] == blk_mq_debugfs_register(13) q->debugfs_dir created            
[   13.782338] == blk_mq_debugfs_register(13) end                               
[   13.783521] loop: module loaded         

LOOP_CTL_DEL(loop0) #1                                                          
[   13.803550] = __blk_release_queue(4) start                                   
[   13.807772] == blk_trace_shutdown(4) start                                   
[   13.810749] == blk_trace_shutdown(4) end                                     
[   13.813437] = __blk_release_queue(4) calling blk_mq_debugfs_unregister()     
[   13.817593] ==== blk_mq_debugfs_unregister(4) begin                          
[   13.817621] ==== blk_mq_debugfs_unregister(4) debugfs_remove_recursive(q->debugfs_dir)
[   13.821203] ==== blk_mq_debugfs_unregister(4) end q->debugfs_dir is NULL     
[   13.826166] = __blk_release_queue(4) blk_mq_debugfs_unregister() end         
[   13.832992] = __blk_release_queue(4) end

LOOP_CTL_ADD(loop0) #1                                                          
[   13.843742] == blk_mq_debugfs_register(7) start                              
[   13.845569] == blk_mq_debugfs_register(7) q->debugfs_dir created             
[   13.848628] == blk_mq_debugfs_register(7) end

BLKTRACE_SETUP(loop0) #1                                                        
[   13.850924] == blk_trace_ioctl(7, BLKTRACESETUP) start                       
[   13.852852] === do_blk_trace_setup(7) start                                  
[   13.854580] === do_blk_trace_setup(7) creating directory                     
[   13.856620] === do_blk_trace_setup(7) using what debugfs_lookup() gave       
[   13.860635] === do_blk_trace_setup(7) end with ret: 0                        
[   13.862615] == blk_trace_ioctl(7, BLKTRACESETUP) end

LOOP_CTL_DEL(loop0) #2                                                          
[   13.883304] = __blk_release_queue(7) start                                   
[   13.885324] == blk_trace_shutdown(7) start                                   
[   13.887197] == blk_trace_shutdown(7) calling __blk_trace_remove()            
[   13.889807] == __blk_trace_remove(7) start                                   
[   13.891669] === blk_trace_cleanup(7) start                                   
[   13.911656] ====== blk_trace_free(7) start  

LOOP_CTL_ADD(loop0) #2                                                          
[   13.912709] == blk_mq_debugfs_register(2) start

---> From LOOP_CTL_DEL(loop0) #2                                                
[   13.915887] ====== blk_trace_free(7) end 

---> From LOOP_CTL_ADD(loop0) #2                                                
[   13.918359] debugfs: Directory 'loop0' with parent 'block' already present!  
[   13.926433] == blk_mq_debugfs_register(2) q->debugfs_dir created             
[   13.930373] == blk_mq_debugfs_register(2) end

BLKTRACE_SETUP(loop0) #2                                                        
[   13.933961] == blk_trace_ioctl(2, BLKTRACESETUP) start                       
[   13.936758] === do_blk_trace_setup(2) start                                  
[   13.938944] === do_blk_trace_setup(2) creating directory                     
[   13.941029] === do_blk_trace_setup(2) using what debugfs_lookup() gave

---> From LOOP_CTL_DEL(loop0) #2                                                
[   13.971046] === blk_trace_cleanup(7) end                                     
[   13.973175] == __blk_trace_remove(7) end                                     
[   13.975352] == blk_trace_shutdown(7) end                                     
[   13.977415] = __blk_release_queue(7) calling blk_mq_debugfs_unregister()     
[   13.980645] ==== blk_mq_debugfs_unregister(7) begin                          
[   13.980696] ==== blk_mq_debugfs_unregister(7) debugfs_remove_recursive(q->debugfs_dir)
[   13.983118] ==== blk_mq_debugfs_unregister(7) end q->debugfs_dir is NULL     
[   13.986945] = __blk_release_queue(7) blk_mq_debugfs_unregister() end         
[   13.993155] = __blk_release_queue(7) end 

---> From BLKTRACE_SETUP(loop0) #2                                              
[   13.995928] === do_blk_trace_setup(2) end with ret: 0                        
[   13.997623] == blk_trace_ioctl(2, BLKTRACESETUP) end   

LOOP_CTL_DEL(loop0) #3                                                          
[   14.035119] = __blk_release_queue(2) start                                   
[   14.036925] == blk_trace_shutdown(2) start                                   
[   14.038518] == blk_trace_shutdown(2) calling __blk_trace_remove()            
[   14.040829] == __blk_trace_remove(2) start                                   
[   14.042413] === blk_trace_cleanup(2) start 

LOOP_CTL_ADD(loop0) #3                                                          
[   14.072522] == blk_mq_debugfs_register(6) start 

---> From LOOP_CTL_DEL(loop0) #3                                                
[   14.075151] ====== blk_trace_free(2) start 

---> From LOOP_CTL_ADD(loop0) #3                                                
[   14.075882] == blk_mq_debugfs_register(6) q->debugfs_dir created  

---> From LOOP_CTL_DEL(loop0) #3                                                
[   14.078624] BUG: kernel NULL pointer dereference, address: 00000000000000a0  
[   14.084332] == blk_mq_debugfs_register(6) end                                
[   14.086971] #PF: supervisor write access in kernel mode                      
[   14.086974] #PF: error_code(0x0002) - not-present page                       
[   14.086977] PGD 0 P4D 0                                                      
[   14.086984] Oops: 0002 [#1] SMP NOPTI                                        
[   14.086990] CPU: 2 PID: 287 Comm: kworker/2:2 Tainted: G            E 5.6.0-next-20200403+ #54
[   14.086991] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
[   14.087002] Workqueue: events __blk_release_queue                            
[   14.087011] RIP: 0010:down_write+0x15/0x40                                   
[   14.090300] == blk_trace_ioctl(6, BLKTRACESETUP) start                       
[   14.093277] Code: eb ca e8 3e 34 8d ff cc cc cc cc cc cc cc cc cc cc
cc cc cc cc 0f 1f 44 00 00 55 48 89 fd e8 52 db ff ff 31 c0 ba 01 00 00
00 <f0> 48 0f b1 55 00 75 0f 65 48 8b 04 25 c0 8b 01 00 48 89 45 08 5d
[   14.093280] RSP: 0018:ffffc28a00533da8 EFLAGS: 00010246                      
[   14.093284] RAX: 0000000000000000 RBX: ffff9f7a24d07980 RCX: ffffff8100000000
[   14.093286] RDX: 0000000000000001 RSI: ffffff8100000000 RDI: 00000000000000a0
[   14.093287] RBP: 00000000000000a0 R08: 0000000000000000 R09: 0000000000000019
[   14.093289] R10: 0000000000000774 R11: 0000000000000000 R12: 0000000000000000
[   14.093291] R13: ffff9f7a2fab0400 R14: ffff9f7a21dd1140 R15: 00000000000000a0
[   14.093294] FS:  0000000000000000(0000) GS:ffff9f7a2fa80000(0000) knlGS:0000000000000000
[   14.093296] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033                
[   14.093298] CR2: 00000000000000a0 CR3: 00000004293d2003 CR4: 0000000000360ee0
[   14.093307] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   14.093308] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   14.093310] Call Trace:                                                      
[   14.093324]  simple_recursive_removal+0x4e/0x2e0                             
[   14.093330]  ? debugfs_remove+0x60/0x60                                      
[   14.093334]  debugfs_remove+0x40/0x60                                        
[   14.093339]  blk_trace_free+0x20/0x70                                        
[   14.093346]  __blk_trace_remove+0x54/0x90                                    
[   14.096704] === do_blk_trace_setup(6) start                                  
[   14.098534]  blk_trace_shutdown+0x74/0x80                                    
[   14.100958] === do_blk_trace_setup(6) creating directory                     
[   14.104575]  __blk_release_queue+0xbe/0x160                                  
[   14.104580]  process_one_work+0x1b4/0x380                                    
[   14.104585]  worker_thread+0x50/0x3c0                                        
[   14.104589]  kthread+0xf9/0x130            
[   14.104593]  ? process_one_work+0x380/0x380                                  
[   14.104596]  ? kthread_park+0x90/0x90                                        
[   14.104599]  ret_from_fork+0x1f/0x40                                         
[   14.104603] Modules linked in: loop(E) xfs(E) libcrc32c(E)
crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) joydev(E)
serio_raw(E) aesni_intel(E) glue_helper(E) virtio_balloon(E) evdev(E)
crypto_simd(E) pcspkr(E) cryptd(E) i6300esb(E) button(E) ip_tables(E)
x_tables(E) autofs4(E) ext4(E) crc32c_generic(E) crc16(E) mbcache(E)
jbd2(E) virtio_net(E) net_failover(E) failover(E) virtio_blk(E)
ata_generic(E) uhci_hcd(E) ata_piix(E) ehci_hcd(E) nvme(E) libata(E)
crc32c_intel(E) usbcore(E) psmouse(E) nvme_core(E) virtio_pci(E)
scsi_mod(E) virtio_ring(E) t10_pi(E) virtio(E) i2c_piix4(E) floppy(E)
[   14.107400] === do_blk_trace_setup(6) using what debugfs_lookup()
gave       
[   14.108939] CR2: 00000000000000a0                                            
[   14.110589] === do_blk_trace_setup(6) end with ret: 0                        
[   14.111592] ---[ end trace 7a783b33b9614db9 ]---

> And yet this does seem to fix the
> use after free in my testing, so I must be missing something.

The fragility is in the use of the debugfs_lookup(), and since that
doesn't protect the files we create underneath it. We also dput()
right away, and even if we didn't, if you try to fix that you'll
soon see that this is just a hot mess with the debugfs directories
used.

It is much cleaner to have one directory which we don't have to
worry about for its life, the complexities around special casing
it for mq or not is what gets this thing to be sloppy.

One thing we could do to avoid conflicts with removal and a setup
is:

diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 15086227592f..e3d4809a2964 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -701,6 +701,9 @@ int blk_trace_ioctl(struct block_device *bdev, unsigned cmd, char __user *arg)
 	if (!q)
 		return -ENXIO;
 
+	if (work_pending(&q->release_work))
+		return -ENXIO;
+
 	mutex_lock(&q->blk_trace_mutex);
 
 	switch (cmd) {
Luis Chamberlain April 7, 2020, 8:15 a.m. UTC | #11
On Mon, Apr 6, 2020 at 9:19 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> On Mon, Apr 06, 2020 at 11:18:13AM +0200, Nicolai Stange wrote:
> > Bart Van Assche <bvanassche@acm.org> writes:
>
> > So I'd suggest to drop patch [3/3] from this series and modify this
> > patch [2/3] here to move the blk_q_debugfs_unregister(q) invocation from
> > __blk_release_queue() to blk_unregister_queue() instead.
>
> I'll take a stab.

That didn't work. I'll look for alternatives.

  Luis
diff mbox series

Patch

diff --git a/block/blk-debugfs.c b/block/blk-debugfs.c
index 634dea4b1507..a8b343e758e4 100644
--- a/block/blk-debugfs.c
+++ b/block/blk-debugfs.c
@@ -13,3 +13,15 @@  void blk_debugfs_register(void)
 {
 	blk_debugfs_root = debugfs_create_dir("block", NULL);
 }
+
+void blk_q_debugfs_register(struct request_queue *q)
+{
+	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
+					    blk_debugfs_root);
+}
+
+void blk_q_debugfs_unregister(struct request_queue *q)
+{
+	debugfs_remove_recursive(q->debugfs_dir);
+	q->debugfs_dir = NULL;
+}
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index b3f2ba483992..bda9378eab90 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -823,9 +823,6 @@  void blk_mq_debugfs_register(struct request_queue *q)
 	struct blk_mq_hw_ctx *hctx;
 	int i;
 
-	q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
-					    blk_debugfs_root);
-
 	debugfs_create_files(q->debugfs_dir, q, blk_mq_debugfs_queue_attrs);
 
 	/*
@@ -856,9 +853,7 @@  void blk_mq_debugfs_register(struct request_queue *q)
 
 void blk_mq_debugfs_unregister(struct request_queue *q)
 {
-	debugfs_remove_recursive(q->debugfs_dir);
 	q->sched_debugfs_dir = NULL;
-	q->debugfs_dir = NULL;
 }
 
 static void blk_mq_debugfs_register_ctx(struct blk_mq_hw_ctx *hctx,
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index fca9b158f4a0..20f20b0fa0b9 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -895,6 +895,7 @@  static void __blk_release_queue(struct work_struct *work)
 
 	blk_trace_shutdown(q);
 
+	blk_q_debugfs_unregister(q);
 	if (queue_is_mq(q))
 		blk_mq_debugfs_unregister(q);
 
@@ -975,6 +976,8 @@  int blk_register_queue(struct gendisk *disk)
 		goto unlock;
 	}
 
+	blk_q_debugfs_register(q);
+
 	if (queue_is_mq(q)) {
 		__blk_mq_register_dev(dev, q);
 		blk_mq_debugfs_register(q);
diff --git a/block/blk.h b/block/blk.h
index 86a66b614f08..b86123a2d74f 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -489,10 +489,20 @@  int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
 		bool *same_page);
 #ifdef CONFIG_DEBUG_FS
 void blk_debugfs_register(void);
+void blk_q_debugfs_register(struct request_queue *q);
+void blk_q_debugfs_unregister(struct request_queue *q);
 #else
 static inline void blk_debugfs_register(void)
 {
 }
+
+static inline void blk_q_debugfs_register(struct request_queue *q)
+{
+}
+
+static inline void blk_q_debugfs_unregister(struct request_queue *q)
+{
+}
 #endif /* CONFIG_DEBUG_FS */
 
 #endif /* BLK_INTERNAL_H */
diff --git a/include/linux/blktrace_api.h b/include/linux/blktrace_api.h
index 3b6ff5902edc..eb6db276e293 100644
--- a/include/linux/blktrace_api.h
+++ b/include/linux/blktrace_api.h
@@ -22,7 +22,6 @@  struct blk_trace {
 	u64 end_lba;
 	u32 pid;
 	u32 dev;
-	struct dentry *dir;
 	struct dentry *dropped_file;
 	struct dentry *msg_file;
 	struct list_head running_list;
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index ca39dc3230cb..15086227592f 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -311,7 +311,6 @@  static void blk_trace_free(struct blk_trace *bt)
 	debugfs_remove(bt->msg_file);
 	debugfs_remove(bt->dropped_file);
 	relay_close(bt->rchan);
-	debugfs_remove(bt->dir);
 	free_percpu(bt->sequence);
 	free_percpu(bt->msg_data);
 	kfree(bt);
@@ -476,7 +475,6 @@  static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
 			      struct blk_user_trace_setup *buts)
 {
 	struct blk_trace *bt = NULL;
-	struct dentry *dir = NULL;
 	int ret;
 
 	if (!buts->buf_size || !buts->buf_nr)
@@ -485,6 +483,9 @@  static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
 	if (!blk_debugfs_root)
 		return -ENOENT;
 
+	if (!q->debugfs_dir)
+		return -ENOENT;
+
 	strncpy(buts->name, name, BLKTRACE_BDEV_SIZE);
 	buts->name[BLKTRACE_BDEV_SIZE - 1] = '\0';
 
@@ -509,21 +510,19 @@  static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
 
 	ret = -ENOENT;
 
-	dir = debugfs_lookup(buts->name, blk_debugfs_root);
-	if (!dir)
-		bt->dir = dir = debugfs_create_dir(buts->name, blk_debugfs_root);
-
 	bt->dev = dev;
 	atomic_set(&bt->dropped, 0);
 	INIT_LIST_HEAD(&bt->running_list);
 
 	ret = -EIO;
-	bt->dropped_file = debugfs_create_file("dropped", 0444, dir, bt,
+	bt->dropped_file = debugfs_create_file("dropped", 0444,
+					       q->debugfs_dir, bt,
 					       &blk_dropped_fops);
 
-	bt->msg_file = debugfs_create_file("msg", 0222, dir, bt, &blk_msg_fops);
+	bt->msg_file = debugfs_create_file("msg", 0222, q->debugfs_dir,
+					   bt, &blk_msg_fops);
 
-	bt->rchan = relay_open("trace", dir, buts->buf_size,
+	bt->rchan = relay_open("trace", q->debugfs_dir, buts->buf_size,
 				buts->buf_nr, &blk_relay_callbacks, bt);
 	if (!bt->rchan)
 		goto err;
@@ -551,8 +550,6 @@  static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
 
 	ret = 0;
 err:
-	if (dir && !bt->dir)
-		dput(dir);
 	if (ret)
 		blk_trace_free(bt);
 	return ret;