From patchwork Tue Apr 14 04:18:58 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Luis Chamberlain X-Patchwork-Id: 11486743 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id EE3A8174A for ; Tue, 14 Apr 2020 04:19:31 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id D524F20737 for ; Tue, 14 Apr 2020 04:19:31 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1586837971; bh=WK0Djy2DbNnX7XgN1voKF0kaelb+5aAIBEsPUY369d0=; h=From:To:Cc:Subject:Date:In-Reply-To:References:List-ID:From; b=ovHnRoKjrribIWssHumaUqUMIz8xZLr8NfBFeCcbo/8EVc0rd3Vga48iuvw9Hrx6A 2kEE7Vdas1Rdw2NcqTPQVHoy2SV7x7L1qNrS7hLulz/nWSQch9bY8PKpyjtydQpV4Q poUqTdZgatNk+n7Y568DnjVJ0gkA6wcsdSs7DD2E= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2405271AbgDNET0 (ORCPT ); Tue, 14 Apr 2020 00:19:26 -0400 Received: from mail-pj1-f67.google.com ([209.85.216.67]:53327 "EHLO mail-pj1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2405226AbgDNETH (ORCPT ); Tue, 14 Apr 2020 00:19:07 -0400 Received: by mail-pj1-f67.google.com with SMTP id cl8so3568836pjb.3; Mon, 13 Apr 2020 21:19:07 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=GOqFz/6SfoYIxdIaKsZbQKuFZCTl+d+PmNjeRcAeYWw=; b=FYGcpJmOjbezTz22wbul39ZMILTTgMIifA26EiIiCom8QQ29f9m6alYVnFvqjfJAoY INLniVquV5SH0p1m9/a1AH7UnUPnswOH1KkV8g5je0f0XhWr7hauReCWdqc60hH3LKrO wVk5y7yZR9waZ/WBRI4NvNAjkFF8fvARHx6dBY4wPhgMauTDVohc/HPpqaCrvmpEgu8l LUSC1NgFTytqUPhdKht89a1GH4ZotbalWcI6By/7MoLoCbYKgslQga0S87tLRdWny3bT LmSRN9XOkgNwkuM5zRqPOeJ/gskm4hcIxrTBf35vfPzb7kGouwsfR0t5FlP3IMcZsIXR ZcmA== X-Gm-Message-State: AGi0PubkN3PSd8iKUdhJ1v70Y1vj2sTiruOk9ZEXA/5l3Mo/GsjyWF8w DNzPo9d4AdR6zDxBAnnisgk= X-Google-Smtp-Source: APiQypIp2fwAykCaBVO3uyfeOBBSTJASxml0yVcrV1SDyvghPVLTQxgkubQP+6PzT2wAG+Ag3J1d9w== X-Received: by 2002:a17:902:c814:: with SMTP id u20mr1357265plx.85.1586837946598; Mon, 13 Apr 2020 21:19:06 -0700 (PDT) Received: from 42.do-not-panic.com (42.do-not-panic.com. [157.230.128.187]) by smtp.gmail.com with ESMTPSA id x7sm7854126pjg.26.2020.04.13.21.19.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 13 Apr 2020 21:19:04 -0700 (PDT) Received: by 42.do-not-panic.com (Postfix, from userid 1000) id D6D8B40605; Tue, 14 Apr 2020 04:19:03 +0000 (UTC) From: Luis Chamberlain To: axboe@kernel.dk, viro@zeniv.linux.org.uk, bvanassche@acm.org, gregkh@linuxfoundation.org, rostedt@goodmis.org, mingo@redhat.com, jack@suse.cz, ming.lei@redhat.com, nstange@suse.de, akpm@linux-foundation.org Cc: mhocko@suse.com, yukuai3@huawei.com, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Luis Chamberlain , Omar Sandoval , Hannes Reinecke , Michal Hocko Subject: [PATCH 1/5] block: move main block debugfs initialization to its own file Date: Tue, 14 Apr 2020 04:18:58 +0000 Message-Id: <20200414041902.16769-2-mcgrof@kernel.org> X-Mailer: git-send-email 2.23.0.rc1 In-Reply-To: <20200414041902.16769-1-mcgrof@kernel.org> References: <20200414041902.16769-1-mcgrof@kernel.org> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org make_request-based drivers and and request-based drivers share some some debugfs code. By moving this into its own file it makes it easier to expand and audit this shared code. This patch contains no functional changes. Cc: Bart Van Assche Cc: Omar Sandoval Cc: Hannes Reinecke Cc: Nicolai Stange Cc: Greg Kroah-Hartman Cc: Michal Hocko Cc: yu kuai Signed-off-by: Luis Chamberlain Reviewed-by: Greg Kroah-Hartman Reviewed-by: Bart Van Assche --- block/Makefile | 1 + block/blk-core.c | 9 +-------- block/blk-debugfs.c | 15 +++++++++++++++ block/blk.h | 7 +++++++ 4 files changed, 24 insertions(+), 8 deletions(-) create mode 100644 block/blk-debugfs.c diff --git a/block/Makefile b/block/Makefile index 206b96e9387f..1d3ab20505d8 100644 --- a/block/Makefile +++ b/block/Makefile @@ -10,6 +10,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-sysfs.o \ blk-mq-sysfs.o blk-mq-cpumap.o blk-mq-sched.o ioctl.o \ genhd.o ioprio.o badblocks.o partitions/ blk-rq-qos.o +obj-$(CONFIG_DEBUG_FS) += blk-debugfs.o obj-$(CONFIG_BOUNCE) += bounce.o obj-$(CONFIG_BLK_SCSI_REQUEST) += scsi_ioctl.o obj-$(CONFIG_BLK_DEV_BSG) += bsg.o diff --git a/block/blk-core.c b/block/blk-core.c index 7e4a1da0715e..5aaae7a1b338 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -48,10 +48,6 @@ #include "blk-pm.h" #include "blk-rq-qos.h" -#ifdef CONFIG_DEBUG_FS -struct dentry *blk_debugfs_root; -#endif - EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap); EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_remap); EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_complete); @@ -1796,10 +1792,7 @@ int __init blk_dev_init(void) blk_requestq_cachep = kmem_cache_create("request_queue", sizeof(struct request_queue), 0, SLAB_PANIC, NULL); - -#ifdef CONFIG_DEBUG_FS - blk_debugfs_root = debugfs_create_dir("block", NULL); -#endif + blk_debugfs_register(); return 0; } diff --git a/block/blk-debugfs.c b/block/blk-debugfs.c new file mode 100644 index 000000000000..19091e1effc0 --- /dev/null +++ b/block/blk-debugfs.c @@ -0,0 +1,15 @@ +// SPDX-License-Identifier: GPL-2.0 + +/* + * Shared request-based / make_request-based functionality + */ +#include +#include +#include + +struct dentry *blk_debugfs_root; + +void blk_debugfs_register(void) +{ + blk_debugfs_root = debugfs_create_dir("block", NULL); +} diff --git a/block/blk.h b/block/blk.h index 0a94ec68af32..86a66b614f08 100644 --- a/block/blk.h +++ b/block/blk.h @@ -487,5 +487,12 @@ struct request_queue *__blk_alloc_queue(int node_id); int __bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page *page, unsigned int len, unsigned int offset, bool *same_page); +#ifdef CONFIG_DEBUG_FS +void blk_debugfs_register(void); +#else +static inline void blk_debugfs_register(void) +{ +} +#endif /* CONFIG_DEBUG_FS */ #endif /* BLK_INTERNAL_H */ From patchwork Tue Apr 14 04:18:59 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Luis Chamberlain X-Patchwork-Id: 11486751 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 7E994174A for ; Tue, 14 Apr 2020 04:19:53 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 4CBAE2078A for ; Tue, 14 Apr 2020 04:19:53 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1586837993; bh=yxftA0nzxi9V/xsSjdsM6OOO11A8XnGmL9ArG7uDRVc=; h=From:To:Cc:Subject:Date:In-Reply-To:References:List-ID:From; b=TX6MXf3Red5YIrB+ZbpazxauACGvWtdj0Xl7x5RcsD42kUgjc/oWjS5ujo9HAQWjl nygD82+42iNDq2rbh0mQNjOMTiiH1VYyp05PFUWW+LFYMrnyax4lqW7Ax1fex5hIHe XBcEfZ26tIeVFwtU1ZdNkKusjA9clsUifDx3MpB0= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2405255AbgDNETY (ORCPT ); Tue, 14 Apr 2020 00:19:24 -0400 Received: from mail-pl1-f195.google.com ([209.85.214.195]:41637 "EHLO mail-pl1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2405232AbgDNETL (ORCPT ); Tue, 14 Apr 2020 00:19:11 -0400 Received: by mail-pl1-f195.google.com with SMTP id d24so4198516pll.8; Mon, 13 Apr 2020 21:19:09 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=CNy/Hws67SddpPuSgwxX521J0KQbIlyC2nR+XSopWLE=; b=LuXY9QGJ98dtlTWSI8Zd+9aec1UCbREB5hmwk5MATvCBuE7IToRGyhG10zpbGAqibJ q07olxSekyBYpwWU/DK2U7+5YjmS1SBmh4xUyElF931AYIEMLyHEWbqoVWQaKEDdJy10 QwGu/gvDYvb6fzhB+tsy11UcqeTEF2ofvQFgPRlr6KfZIf4F5pT/kTHANbs1EmnE6SvV ynOHFdAqP5Z47erge+EYRszBq5kDFdgvM2vh5sj8XeJCYwuCxQeDxQVB5h6ncE8tveY6 iCpeqF8oKqkOnMxd5ZFK8xPs0LPBA6X/snOt3Anv5lmae0SS4rO4dqQboSbCF6vB9Khn e33g== X-Gm-Message-State: AGi0PublXdlmAxJmcKMyJFc1Vnm264chxaCnZF14CGrZOz0QEwu4CYiV 70al+7EaENQENOvVYuP5Dec= X-Google-Smtp-Source: APiQypKq8+jtv0Q/NV2LrpyBDYSsHPfFT/whqczgR8y9i6f99WhF7FOL1jIK86Fj0VjeJlArZIpYJg== X-Received: by 2002:a17:902:7c06:: with SMTP id x6mr8893137pll.178.1586837948804; Mon, 13 Apr 2020 21:19:08 -0700 (PDT) Received: from 42.do-not-panic.com (42.do-not-panic.com. [157.230.128.187]) by smtp.gmail.com with ESMTPSA id f9sm10771723pjt.45.2020.04.13.21.19.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 13 Apr 2020 21:19:04 -0700 (PDT) Received: by 42.do-not-panic.com (Postfix, from userid 1000) id EABDC41381; Tue, 14 Apr 2020 04:19:03 +0000 (UTC) From: Luis Chamberlain To: axboe@kernel.dk, viro@zeniv.linux.org.uk, bvanassche@acm.org, gregkh@linuxfoundation.org, rostedt@goodmis.org, mingo@redhat.com, jack@suse.cz, ming.lei@redhat.com, nstange@suse.de, akpm@linux-foundation.org Cc: mhocko@suse.com, yukuai3@huawei.com, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Luis Chamberlain , Omar Sandoval , Hannes Reinecke , Michal Hocko , syzbot+603294af2d01acfdd6da@syzkaller.appspotmail.com Subject: [PATCH 2/5] blktrace: fix debugfs use after free Date: Tue, 14 Apr 2020 04:18:59 +0000 Message-Id: <20200414041902.16769-3-mcgrof@kernel.org> X-Mailer: git-send-email 2.23.0.rc1 In-Reply-To: <20200414041902.16769-1-mcgrof@kernel.org> References: <20200414041902.16769-1-mcgrof@kernel.org> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On commit 6ac93117ab00 ("blktrace: use existing disk debugfs directory") merged on v4.12 Omar fixed the original blktrace code for request-based drivers (multiqueue). This however left in place a possible crash, if you happen to abuse blktrace in a way it was not intended. Namely, if you loop adding a device, setup the blktrace with BLKTRACESETUP, forget to BLKTRACETEARDOWN, and then just remove the device you end up with a panic: [ 107.193134] debugfs: Directory 'loop0' with parent 'block' already present! [ 107.254615] BUG: kernel NULL pointer dereference, address: 00000000000000a0 [ 107.258785] #PF: supervisor write access in kernel mode [ 107.262035] #PF: error_code(0x0002) - not-present page [ 107.264106] PGD 0 P4D 0 [ 107.264404] Oops: 0002 [#1] SMP NOPTI [ 107.264803] CPU: 8 PID: 674 Comm: kworker/8:2 Tainted: G E 5.6.0-rc7-next-20200327 #1 [ 107.265712] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014 [ 107.266553] Workqueue: events __blk_release_queue [ 107.267051] RIP: 0010:down_write+0x15/0x40 [ 107.267488] Code: eb ca e8 ee a5 8d ff cc cc cc cc cc cc cc cc cc cc cc cc cc cc 0f 1f 44 00 00 55 48 89 fd e8 52 db ff ff 31 c0 ba 01 00 00 00 48 0f b1 55 00 75 0f 65 48 8b 04 25 c0 8b 01 00 48 89 45 08 5d [ 107.269300] RSP: 0018:ffff9927c06efda8 EFLAGS: 00010246 [ 107.269841] RAX: 0000000000000000 RBX: ffff8be7e73b0600 RCX: ffffff8100000000 [ 107.270559] RDX: 0000000000000001 RSI: ffffff8100000000 RDI: 00000000000000a0 [ 107.271281] RBP: 00000000000000a0 R08: ffff8be7ebc80fa8 R09: ffff8be7ebc80fa8 [ 107.272001] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 107.272722] R13: ffff8be7efc30400 R14: ffff8be7e0571200 R15: 00000000000000a0 [ 107.273475] FS: 0000000000000000(0000) GS:ffff8be7efc00000(0000) knlGS:0000000000000000 [ 107.274346] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 107.274968] CR2: 00000000000000a0 CR3: 000000042abee003 CR4: 0000000000360ee0 [ 107.275710] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 107.276465] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 107.277214] Call Trace: [ 107.277532] simple_recursive_removal+0x4e/0x2e0 [ 107.278049] ? debugfs_remove+0x60/0x60 [ 107.278493] debugfs_remove+0x40/0x60 [ 107.278922] blk_trace_free+0xd/0x50 [ 107.279339] __blk_trace_remove+0x27/0x40 [ 107.279797] blk_trace_shutdown+0x30/0x40 [ 107.280256] __blk_release_queue+0xab/0x110 [ 107.280734] process_one_work+0x1b4/0x380 [ 107.281194] worker_thread+0x50/0x3c0 [ 107.281622] kthread+0xf9/0x130 [ 107.281994] ? process_one_work+0x380/0x380 [ 107.282467] ? kthread_park+0x90/0x90 [ 107.282895] ret_from_fork+0x1f/0x40 [ 107.283316] Modules linked in: loop(E) [ 107.288562] CR2: 00000000000000a0 [ 107.288957] ---[ end trace b885d243d441bbce ]--- This splat happens to be very similar to the one reported via kernel.org korg#205713, only that korg#205713 was for v4.19.83 and the above now includes the simple_recursive_removal() introduced via commit a3d1e7eb5abe ("simple_recursive_removal(): kernel-side rm -rf for ramfs-style filesystems") merged on v5.6. korg#205713 then was used to create CVE-2019-19770 and claims that the bug is in a use-after-free in the debugfs core code. The implications of this being a generic UAF on debugfs would be much more severe, as it would imply parent dentries can sometimes not be positive, which we hold by design is just not possible. Below is the splat explained with a bit more details, explaining what is happening in userspace, kernel, and a print of the CPU on, which the code runs on: load loopback module [ 13.603371] == blk_mq_debugfs_register(12) start [ 13.604040] == blk_mq_debugfs_register(12) q->debugfs_dir created [ 13.604934] == blk_mq_debugfs_register(12) end [ 13.627382] == blk_mq_debugfs_register(12) start [ 13.628041] == blk_mq_debugfs_register(12) q->debugfs_dir created [ 13.629240] == blk_mq_debugfs_register(12) end [ 13.651667] == blk_mq_debugfs_register(12) start [ 13.652836] == blk_mq_debugfs_register(12) q->debugfs_dir created [ 13.655107] == blk_mq_debugfs_register(12) end [ 13.684917] == blk_mq_debugfs_register(12) start [ 13.687876] == blk_mq_debugfs_register(12) q->debugfs_dir created [ 13.691588] == blk_mq_debugfs_register(13) end [ 13.707320] == blk_mq_debugfs_register(13) start [ 13.707863] == blk_mq_debugfs_register(13) q->debugfs_dir created [ 13.708856] == blk_mq_debugfs_register(13) end [ 13.735623] == blk_mq_debugfs_register(13) start [ 13.736656] == blk_mq_debugfs_register(13) q->debugfs_dir created [ 13.738411] == blk_mq_debugfs_register(13) end [ 13.763326] == blk_mq_debugfs_register(13) start [ 13.763972] == blk_mq_debugfs_register(13) q->debugfs_dir created [ 13.765167] == blk_mq_debugfs_register(13) end [ 13.779510] == blk_mq_debugfs_register(13) start [ 13.780522] == blk_mq_debugfs_register(13) q->debugfs_dir created [ 13.782338] == blk_mq_debugfs_register(13) end [ 13.783521] loop: module loaded LOOP_CTL_DEL(loop0) #1 [ 13.803550] = __blk_release_queue(4) start [ 13.807772] == blk_trace_shutdown(4) start [ 13.810749] == blk_trace_shutdown(4) end [ 13.813437] = __blk_release_queue(4) calling blk_mq_debugfs_unregister() [ 13.817593] ==== blk_mq_debugfs_unregister(4) begin [ 13.817621] ==== blk_mq_debugfs_unregister(4) debugfs_remove_recursive(q->debugfs_dir) [ 13.821203] ==== blk_mq_debugfs_unregister(4) end q->debugfs_dir is NULL [ 13.826166] = __blk_release_queue(4) blk_mq_debugfs_unregister() end [ 13.832992] = __blk_release_queue(4) end LOOP_CTL_ADD(loop0) #1 [ 13.843742] == blk_mq_debugfs_register(7) start [ 13.845569] == blk_mq_debugfs_register(7) q->debugfs_dir created [ 13.848628] == blk_mq_debugfs_register(7) end BLKTRACE_SETUP(loop0) #1 [ 13.850924] == blk_trace_ioctl(7, BLKTRACESETUP) start [ 13.852852] === do_blk_trace_setup(7) start [ 13.854580] === do_blk_trace_setup(7) creating directory [ 13.856620] === do_blk_trace_setup(7) using what debugfs_lookup() gave [ 13.860635] === do_blk_trace_setup(7) end with ret: 0 [ 13.862615] == blk_trace_ioctl(7, BLKTRACESETUP) end LOOP_CTL_DEL(loop0) #2 [ 13.883304] = __blk_release_queue(7) start [ 13.885324] == blk_trace_shutdown(7) start [ 13.887197] == blk_trace_shutdown(7) calling __blk_trace_remove() [ 13.889807] == __blk_trace_remove(7) start [ 13.891669] === blk_trace_cleanup(7) start [ 13.911656] ====== blk_trace_free(7) start LOOP_CTL_ADD(loop0) #2 [ 13.912709] == blk_mq_debugfs_register(2) start ---> From LOOP_CTL_DEL(loop0) #2 [ 13.915887] ====== blk_trace_free(7) end ---> From LOOP_CTL_ADD(loop0) #2 [ 13.918359] debugfs: Directory 'loop0' with parent 'block' already present! [ 13.926433] == blk_mq_debugfs_register(2) q->debugfs_dir created [ 13.930373] == blk_mq_debugfs_register(2) end BLKTRACE_SETUP(loop0) #2 [ 13.933961] == blk_trace_ioctl(2, BLKTRACESETUP) start [ 13.936758] === do_blk_trace_setup(2) start [ 13.938944] === do_blk_trace_setup(2) creating directory [ 13.941029] === do_blk_trace_setup(2) using what debugfs_lookup() gave ---> From LOOP_CTL_DEL(loop0) #2 [ 13.971046] === blk_trace_cleanup(7) end [ 13.973175] == __blk_trace_remove(7) end [ 13.975352] == blk_trace_shutdown(7) end [ 13.977415] = __blk_release_queue(7) calling blk_mq_debugfs_unregister() [ 13.980645] ==== blk_mq_debugfs_unregister(7) begin [ 13.980696] ==== blk_mq_debugfs_unregister(7) debugfs_remove_recursive(q->debugfs_dir) [ 13.983118] ==== blk_mq_debugfs_unregister(7) end q->debugfs_dir is NULL [ 13.986945] = __blk_release_queue(7) blk_mq_debugfs_unregister() end [ 13.993155] = __blk_release_queue(7) end ---> From BLKTRACE_SETUP(loop0) #2 [ 13.995928] === do_blk_trace_setup(2) end with ret: 0 [ 13.997623] == blk_trace_ioctl(2, BLKTRACESETUP) end LOOP_CTL_DEL(loop0) #3 [ 14.035119] = __blk_release_queue(2) start [ 14.036925] == blk_trace_shutdown(2) start [ 14.038518] == blk_trace_shutdown(2) calling __blk_trace_remove() [ 14.040829] == __blk_trace_remove(2) start [ 14.042413] === blk_trace_cleanup(2) start LOOP_CTL_ADD(loop0) #3 [ 14.072522] == blk_mq_debugfs_register(6) start ---> From LOOP_CTL_DEL(loop0) #3 [ 14.075151] ====== blk_trace_free(2) start ---> From LOOP_CTL_ADD(loop0) #3 [ 14.075882] == blk_mq_debugfs_register(6) q->debugfs_dir created ---> From LOOP_CTL_DEL(loop0) #3 [ 14.078624] BUG: kernel NULL pointer dereference, address: 00000000000000a0 [ 14.084332] == blk_mq_debugfs_register(6) end [ 14.086971] #PF: supervisor write access in kernel mode [ 14.086974] #PF: error_code(0x0002) - not-present page [ 14.086977] PGD 0 P4D 0 [ 14.086984] Oops: 0002 [#1] SMP NOPTI [ 14.086990] CPU: 2 PID: 287 Comm: kworker/2:2 Tainted: G E 5.6.0-next-20200403+ #54 [ 14.086991] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014 [ 14.087002] Workqueue: events __blk_release_queue [ 14.087011] RIP: 0010:down_write+0x15/0x40 [ 14.090300] == blk_trace_ioctl(6, BLKTRACESETUP) start [ 14.093277] Code: eb ca e8 3e 34 8d ff cc cc cc cc cc cc cc cc cc cc cc cc cc cc 0f 1f 44 00 00 55 48 89 fd e8 52 db ff ff 31 c0 ba 01 00 00 00 48 0f b1 55 00 75 0f 65 48 8b 04 25 c0 8b 01 00 48 89 45 08 5d [ 14.093280] RSP: 0018:ffffc28a00533da8 EFLAGS: 00010246 [ 14.093284] RAX: 0000000000000000 RBX: ffff9f7a24d07980 RCX: ffffff8100000000 [ 14.093286] RDX: 0000000000000001 RSI: ffffff8100000000 RDI: 00000000000000a0 [ 14.093287] RBP: 00000000000000a0 R08: 0000000000000000 R09: 0000000000000019 [ 14.093289] R10: 0000000000000774 R11: 0000000000000000 R12: 0000000000000000 [ 14.093291] R13: ffff9f7a2fab0400 R14: ffff9f7a21dd1140 R15: 00000000000000a0 [ 14.093294] FS: 0000000000000000(0000) GS:ffff9f7a2fa80000(0000) knlGS:0000000000000000 [ 14.093296] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 14.093298] CR2: 00000000000000a0 CR3: 00000004293d2003 CR4: 0000000000360ee0 [ 14.093307] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 14.093308] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 14.093310] Call Trace: [ 14.093324] simple_recursive_removal+0x4e/0x2e0 [ 14.093330] ? debugfs_remove+0x60/0x60 [ 14.093334] debugfs_remove+0x40/0x60 [ 14.093339] blk_trace_free+0x20/0x70 [ 14.093346] __blk_trace_remove+0x54/0x90 [ 14.096704] === do_blk_trace_setup(6) start [ 14.098534] blk_trace_shutdown+0x74/0x80 [ 14.100958] === do_blk_trace_setup(6) creating directory [ 14.104575] __blk_release_queue+0xbe/0x160 [ 14.104580] process_one_work+0x1b4/0x380 [ 14.104585] worker_thread+0x50/0x3c0 [ 14.104589] kthread+0xf9/0x130 [ 14.104593] ? process_one_work+0x380/0x380 [ 14.104596] ? kthread_park+0x90/0x90 [ 14.104599] ret_from_fork+0x1f/0x40 [ 14.104603] Modules linked in: loop(E) xfs(E) libcrc32c(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) joydev(E) serio_raw(E) aesni_intel(E) glue_helper(E) virtio_balloon(E) evdev(E) crypto_simd(E) pcspkr(E) cryptd(E) i6300esb(E) button(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc32c_generic(E) crc16(E) mbcache(E) jbd2(E) virtio_net(E) net_failover(E) failover(E) virtio_blk(E) ata_generic(E) uhci_hcd(E) ata_piix(E) ehci_hcd(E) nvme(E) libata(E) crc32c_intel(E) usbcore(E) psmouse(E) nvme_core(E) virtio_pci(E) scsi_mod(E) virtio_ring(E) t10_pi(E) virtio(E) i2c_piix4(E) floppy(E) [ 14.107400] === do_blk_trace_setup(6) using what debugfs_lookup() gave [ 14.108939] CR2: 00000000000000a0 [ 14.110589] === do_blk_trace_setup(6) end with ret: 0 [ 14.111592] ---[ end trace 7a783b33b9614db9 ]--- The root cause to this issue is that debugfs_lookup() can find a previous incarnation's dir of the same name which is about to get removed from a not yet schedule work. We can fix the UAF by simply using a debugfs directory which moving forward will always be accessible if debugfs is enabled, this way, its allocated and avaialble always for both request-based block drivers or make_request drivers (multiqueue) block drivers. This simplifies the code considerably, with the only penalty now being that we're always creating the request queue debugfs directory for the request-based block device drivers. The UAF then is not a core debugfs issue, but instead a misuse of debugfs, and this issue can only be triggered if you are root, and misuse blktrace. This issue can be reproduced with break-blktrace [2] using: break-blktrace -c 10 -d -s This patch fixes this issue. Note that there is also another respective UAF but from the ioctl path [3], this should also fix that issue. This patch then also disputes the severity of CVE-2019-19770 as this issue is only possible by being root and using blktrace. It is not a core debugfs issue. [0] https://bugzilla.kernel.org/show_bug.cgi?id=205713 [1] https://nvd.nist.gov/vuln/detail/CVE-2019-19770 [2] https://github.com/mcgrof/break-blktrace [3] https://lore.kernel.org/lkml/000000000000ec635b059f752700@google.com/ Cc: Bart Van Assche Cc: Omar Sandoval Cc: Hannes Reinecke Cc: Nicolai Stange Cc: Greg Kroah-Hartman Cc: Michal Hocko Cc: yu kuai Reported-by: syzbot+603294af2d01acfdd6da@syzkaller.appspotmail.com Fixes: 6ac93117ab00 ("blktrace: use existing disk debugfs directory") Signed-off-by: Luis Chamberlain Reviewed-by: Greg Kroah-Hartman Reviewed-by: Christoph Hellwig Reviewed-by: Bart Van Assche --- block/blk-debugfs.c | 12 ++++++++++++ block/blk-mq-debugfs.c | 5 ----- block/blk-sysfs.c | 3 +++ block/blk.h | 10 ++++++++++ include/linux/blkdev.h | 5 ++++- include/linux/blktrace_api.h | 1 - kernel/trace/blktrace.c | 19 ++++++++----------- 7 files changed, 37 insertions(+), 18 deletions(-) diff --git a/block/blk-debugfs.c b/block/blk-debugfs.c index 19091e1effc0..db982688cf46 100644 --- a/block/blk-debugfs.c +++ b/block/blk-debugfs.c @@ -13,3 +13,15 @@ void blk_debugfs_register(void) { blk_debugfs_root = debugfs_create_dir("block", NULL); } + +void blk_queue_debugfs_register(struct request_queue *q) +{ + q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent), + blk_debugfs_root); +} + +void blk_queue_debugfs_unregister(struct request_queue *q) +{ + debugfs_remove_recursive(q->debugfs_dir); + q->debugfs_dir = NULL; +} diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c index b3f2ba483992..bda9378eab90 100644 --- a/block/blk-mq-debugfs.c +++ b/block/blk-mq-debugfs.c @@ -823,9 +823,6 @@ void blk_mq_debugfs_register(struct request_queue *q) struct blk_mq_hw_ctx *hctx; int i; - q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent), - blk_debugfs_root); - debugfs_create_files(q->debugfs_dir, q, blk_mq_debugfs_queue_attrs); /* @@ -856,9 +853,7 @@ void blk_mq_debugfs_register(struct request_queue *q) void blk_mq_debugfs_unregister(struct request_queue *q) { - debugfs_remove_recursive(q->debugfs_dir); q->sched_debugfs_dir = NULL; - q->debugfs_dir = NULL; } static void blk_mq_debugfs_register_ctx(struct blk_mq_hw_ctx *hctx, diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index fca9b158f4a0..0285d67e1e4c 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -895,6 +895,7 @@ static void __blk_release_queue(struct work_struct *work) blk_trace_shutdown(q); + blk_queue_debugfs_unregister(q); if (queue_is_mq(q)) blk_mq_debugfs_unregister(q); @@ -975,6 +976,8 @@ int blk_register_queue(struct gendisk *disk) goto unlock; } + blk_queue_debugfs_register(q); + if (queue_is_mq(q)) { __blk_mq_register_dev(dev, q); blk_mq_debugfs_register(q); diff --git a/block/blk.h b/block/blk.h index 86a66b614f08..813b8513fc1a 100644 --- a/block/blk.h +++ b/block/blk.h @@ -489,10 +489,20 @@ int __bio_add_pc_page(struct request_queue *q, struct bio *bio, bool *same_page); #ifdef CONFIG_DEBUG_FS void blk_debugfs_register(void); +void blk_queue_debugfs_register(struct request_queue *q); +void blk_queue_debugfs_unregister(struct request_queue *q); #else static inline void blk_debugfs_register(void) { } + +static inline void blk_queue_debugfs_register(struct request_queue *q) +{ +} + +static inline void blk_queue_debugfs_unregister(struct request_queue *q) +{ +} #endif /* CONFIG_DEBUG_FS */ #endif /* BLK_INTERNAL_H */ diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 32868fbedc9e..cc43c8e6516c 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -569,8 +569,11 @@ struct request_queue { struct list_head tag_set_list; struct bio_set bio_split; -#ifdef CONFIG_BLK_DEBUG_FS +#ifdef CONFIG_DEBUG_FS + /* Used by block/blk-*debugfs.c and kernel/trace/blktrace.c */ struct dentry *debugfs_dir; +#endif +#ifdef CONFIG_BLK_DEBUG_FS struct dentry *sched_debugfs_dir; struct dentry *rqos_debugfs_dir; #endif diff --git a/include/linux/blktrace_api.h b/include/linux/blktrace_api.h index 3b6ff5902edc..eb6db276e293 100644 --- a/include/linux/blktrace_api.h +++ b/include/linux/blktrace_api.h @@ -22,7 +22,6 @@ struct blk_trace { u64 end_lba; u32 pid; u32 dev; - struct dentry *dir; struct dentry *dropped_file; struct dentry *msg_file; struct list_head running_list; diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c index ca39dc3230cb..15086227592f 100644 --- a/kernel/trace/blktrace.c +++ b/kernel/trace/blktrace.c @@ -311,7 +311,6 @@ static void blk_trace_free(struct blk_trace *bt) debugfs_remove(bt->msg_file); debugfs_remove(bt->dropped_file); relay_close(bt->rchan); - debugfs_remove(bt->dir); free_percpu(bt->sequence); free_percpu(bt->msg_data); kfree(bt); @@ -476,7 +475,6 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev, struct blk_user_trace_setup *buts) { struct blk_trace *bt = NULL; - struct dentry *dir = NULL; int ret; if (!buts->buf_size || !buts->buf_nr) @@ -485,6 +483,9 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev, if (!blk_debugfs_root) return -ENOENT; + if (!q->debugfs_dir) + return -ENOENT; + strncpy(buts->name, name, BLKTRACE_BDEV_SIZE); buts->name[BLKTRACE_BDEV_SIZE - 1] = '\0'; @@ -509,21 +510,19 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev, ret = -ENOENT; - dir = debugfs_lookup(buts->name, blk_debugfs_root); - if (!dir) - bt->dir = dir = debugfs_create_dir(buts->name, blk_debugfs_root); - bt->dev = dev; atomic_set(&bt->dropped, 0); INIT_LIST_HEAD(&bt->running_list); ret = -EIO; - bt->dropped_file = debugfs_create_file("dropped", 0444, dir, bt, + bt->dropped_file = debugfs_create_file("dropped", 0444, + q->debugfs_dir, bt, &blk_dropped_fops); - bt->msg_file = debugfs_create_file("msg", 0222, dir, bt, &blk_msg_fops); + bt->msg_file = debugfs_create_file("msg", 0222, q->debugfs_dir, + bt, &blk_msg_fops); - bt->rchan = relay_open("trace", dir, buts->buf_size, + bt->rchan = relay_open("trace", q->debugfs_dir, buts->buf_size, buts->buf_nr, &blk_relay_callbacks, bt); if (!bt->rchan) goto err; @@ -551,8 +550,6 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev, ret = 0; err: - if (dir && !bt->dir) - dput(dir); if (ret) blk_trace_free(bt); return ret; From patchwork Tue Apr 14 04:19:00 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Luis Chamberlain X-Patchwork-Id: 11486745 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 01F1F174A for ; Tue, 14 Apr 2020 04:19:50 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id DD22020737 for ; Tue, 14 Apr 2020 04:19:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1586837989; bh=hY7cXaqKZkHpkAUvflNVAu1tv2xIynJSqL2ycOlCRbI=; h=From:To:Cc:Subject:Date:In-Reply-To:References:List-ID:From; b=cmO+7Fskt7lQaUk1vMFWfdfRWlbVISKbAEkkQySEdi/B9qYVoMc2UzSEWXjX8ze9n 4Cucei8At3EgItsi7GVoatudQOyr4xhQVHkXf0eNJTLVn9C285V+nxJZOyPqg9w8R1 XSSQyJ35waS/IgI/ijAGXzNX8F8AWGhIsXz1tdg4= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2405267AbgDNETZ (ORCPT ); Tue, 14 Apr 2020 00:19:25 -0400 Received: from mail-pl1-f194.google.com ([209.85.214.194]:39451 "EHLO mail-pl1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2405229AbgDNETI (ORCPT ); Tue, 14 Apr 2020 00:19:08 -0400 Received: by mail-pl1-f194.google.com with SMTP id k18so4199635pll.6; Mon, 13 Apr 2020 21:19:08 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=NmUdDstD3GuaW3cnQoQAeoBXgYPgT0T/dDdxzI7Fmbk=; b=LzmablTY1kDvE+nUb5xx4QvGVP+7JaFXK0ZSvAC+5eFkTO5hr4sEq+c0tkSMlJOQBM hFDAmE9Msl0jotMUmumN75q12nfDeOc35/jLUOC/G/s0wEtbCah4NQrCSjbizydoWbNE ltW92IwA+LYdnWMuOIdLnjg3HtTZmaoIh9GP8OxTtMXtuuNwPTDOkIYjP3Xb0S636Gh5 +EAugLWzeEg4nNyxVNnnbyG357fBOGU5o5ESxdkheCL2TdRH2AUFBXS1rTOJEyO1Zohq m5/AWdkqA+NM4XF02ZB2t6K+OvXpBGfwfgBhiB74rAXD4oNHvkTs0UTGBSSRHfTJtXOk SH/g== X-Gm-Message-State: AGi0PuZ/PE8pM1PAf9fPvfRMcWUc3Vwq+WeG+R+510D/KPrEUHIulUKa 7zbjIKWAM0n2yjVPWEFizEI= X-Google-Smtp-Source: APiQypIsVkYIkhxCV56xtOZaGPp9eSaAdDaWC4rfpGfLdLU2ldMd52BYaibw40sfHKHEpTAKk4IcLg== X-Received: by 2002:a17:90a:82:: with SMTP id a2mr27017334pja.47.1586837947821; Mon, 13 Apr 2020 21:19:07 -0700 (PDT) Received: from 42.do-not-panic.com (42.do-not-panic.com. [157.230.128.187]) by smtp.gmail.com with ESMTPSA id j1sm3878077pgk.23.2020.04.13.21.19.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 13 Apr 2020 21:19:04 -0700 (PDT) Received: by 42.do-not-panic.com (Postfix, from userid 1000) id 0A76D418C0; Tue, 14 Apr 2020 04:19:04 +0000 (UTC) From: Luis Chamberlain To: axboe@kernel.dk, viro@zeniv.linux.org.uk, bvanassche@acm.org, gregkh@linuxfoundation.org, rostedt@goodmis.org, mingo@redhat.com, jack@suse.cz, ming.lei@redhat.com, nstange@suse.de, akpm@linux-foundation.org Cc: mhocko@suse.com, yukuai3@huawei.com, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Luis Chamberlain , Omar Sandoval , Hannes Reinecke , Michal Hocko Subject: [PATCH 3/5] blktrace: refcount the request_queue during ioctl Date: Tue, 14 Apr 2020 04:19:00 +0000 Message-Id: <20200414041902.16769-4-mcgrof@kernel.org> X-Mailer: git-send-email 2.23.0.rc1 In-Reply-To: <20200414041902.16769-1-mcgrof@kernel.org> References: <20200414041902.16769-1-mcgrof@kernel.org> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Ensure that the request_queue is refcounted during its full ioctl cycle. This avoids possible races against removal, given blk_get_queue() also checks to ensure the queue is not dying. This small race is possible if you defer removal of the request_queue and userspace fires off an ioctl for the device in the meantime. Cc: Bart Van Assche Cc: Omar Sandoval Cc: Hannes Reinecke Cc: Nicolai Stange Cc: Greg Kroah-Hartman Cc: Michal Hocko Cc: yu kuai Reviewed-by: Bart Van Assche Signed-off-by: Luis Chamberlain --- kernel/trace/blktrace.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c index 15086227592f..17e144d15779 100644 --- a/kernel/trace/blktrace.c +++ b/kernel/trace/blktrace.c @@ -701,6 +701,9 @@ int blk_trace_ioctl(struct block_device *bdev, unsigned cmd, char __user *arg) if (!q) return -ENXIO; + if (!blk_get_queue(q)) + return -ENXIO; + mutex_lock(&q->blk_trace_mutex); switch (cmd) { @@ -729,6 +732,9 @@ int blk_trace_ioctl(struct block_device *bdev, unsigned cmd, char __user *arg) } mutex_unlock(&q->blk_trace_mutex); + + blk_put_queue(q); + return ret; } From patchwork Tue Apr 14 04:19:01 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Luis Chamberlain X-Patchwork-Id: 11486749 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 3F855186E for ; Tue, 14 Apr 2020 04:19:51 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 26AFD2078A for ; Tue, 14 Apr 2020 04:19:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1586837991; bh=35H3eCm4LRFYTA74DGCmPAOYtjpvsQRPTtdRCaSvfzI=; h=From:To:Cc:Subject:Date:In-Reply-To:References:List-ID:From; b=JeijaUNFEGYq5objL0Nt9KwdH3Fnam1nBnWrsMzK4ILZ/j1Qbcmq0QiQhu7sQzSma K7AQPw9sCxIt4n5dxGCEOR0xvbUcNpJcDqeW4zkkf5GPQgEqLBZyDbuXm1qTiPNWfn Utv3JW8k6YJRVSDlfz9HS74fC3dpD75C1Fg6EMOM= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2405261AbgDNETZ (ORCPT ); Tue, 14 Apr 2020 00:19:25 -0400 Received: from mail-pg1-f196.google.com ([209.85.215.196]:39409 "EHLO mail-pg1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2405233AbgDNETK (ORCPT ); Tue, 14 Apr 2020 00:19:10 -0400 Received: by mail-pg1-f196.google.com with SMTP id g32so5456017pgb.6; Mon, 13 Apr 2020 21:19:10 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=y7GCMwZ+500s2UeOg6j11+a/fStJc20VLvn3M0bWMqk=; b=EkynL942Np0ynFvupyxP8+dtig/dz107SvF2dke+wp8Eys1WQF9p0GpIEJ5KqgyOGS FwWfIywhE7yPHcJShqHpl/CXX+gz4GThF3vh69hKPxG2KaFHNYJfa7hy0L8OdcnLYuFX Jr40/vrMWP/RJ07ucVIB4FgSoS4JsZjsrtwOw4rqi/JtGsyFxZ0ky4zemwwHvX3WCVAu oKAjsRbcHVOB00cl9ZRBIWQa71Ga2EE/sdkZSEFYpOw0fEEYKgF2/TDuWiiaHmu4LW1N gxONSH1PcIhxWsxcu+Da0X6edcnuR85mZbbT8rEFJwh17eCblZr96fdKM7i/FX85NNCq DPkQ== X-Gm-Message-State: AGi0Publ4IUI+3MUXeeDh17nqT8rAUyFwwiEu0C4fpzLTjHKEhr8zcnA rVdvUaOb6HWsDqfS1pTaQKs= X-Google-Smtp-Source: APiQypIhm3o5ig0Vl42dVHSMw8R//BDfhQcbCw+CHlsRTU54HdyPnw18cejKvzwRFkPw0xmS9mvYAQ== X-Received: by 2002:aa7:9207:: with SMTP id 7mr20846200pfo.178.1586837949770; Mon, 13 Apr 2020 21:19:09 -0700 (PDT) Received: from 42.do-not-panic.com (42.do-not-panic.com. [157.230.128.187]) by smtp.gmail.com with ESMTPSA id a1sm9983484pfl.188.2020.04.13.21.19.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 13 Apr 2020 21:19:06 -0700 (PDT) Received: by 42.do-not-panic.com (Postfix, from userid 1000) id 1DA65419AC; Tue, 14 Apr 2020 04:19:04 +0000 (UTC) From: Luis Chamberlain To: axboe@kernel.dk, viro@zeniv.linux.org.uk, bvanassche@acm.org, gregkh@linuxfoundation.org, rostedt@goodmis.org, mingo@redhat.com, jack@suse.cz, ming.lei@redhat.com, nstange@suse.de, akpm@linux-foundation.org Cc: mhocko@suse.com, yukuai3@huawei.com, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Luis Chamberlain , Omar Sandoval , Hannes Reinecke , Michal Hocko Subject: [PATCH 4/5] mm/swapfile: refcount block and queue before using blkcg_schedule_throttle() Date: Tue, 14 Apr 2020 04:19:01 +0000 Message-Id: <20200414041902.16769-5-mcgrof@kernel.org> X-Mailer: git-send-email 2.23.0.rc1 In-Reply-To: <20200414041902.16769-1-mcgrof@kernel.org> References: <20200414041902.16769-1-mcgrof@kernel.org> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org block devices are refcounted so to ensure once its final user goes away it can be cleaned up by the lower layers properly. The block device's request_queue structure is also refcounted, however, if the last blk_put_queue() is called under atomic context the block layer has to defer removal. By refcounting the block device during the use of blkcg_schedule_throttle(), we ensure ensure two things: 1) the block device remains available during the call 2) we ensure avoid having to deal with the fact we're using the request_queue structure in atomic context, since the last blk_put_queue() will be called upon disk_release(), *after* our own bdput(). This means this code path is *not* going to remove the request_queue structure, as we are ensuring some later upper layer disk_release() will be the one to release the request_queue structure for us. Cc: Bart Van Assche Cc: Omar Sandoval Cc: Hannes Reinecke Cc: Nicolai Stange Cc: Greg Kroah-Hartman Cc: Michal Hocko Cc: yu kuai Signed-off-by: Luis Chamberlain --- mm/swapfile.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 6659ab563448..9285ff6030ca 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -3753,6 +3753,7 @@ static void free_swap_count_continuations(struct swap_info_struct *si) void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node, gfp_t gfp_mask) { + struct block_device *bdev; struct swap_info_struct *si, *next; if (!(gfp_mask & __GFP_IO) || !memcg) return; @@ -3771,8 +3772,17 @@ void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node, plist_for_each_entry_safe(si, next, &swap_avail_heads[node], avail_lists[node]) { if (si->bdev) { - blkcg_schedule_throttle(bdev_get_queue(si->bdev), - true); + bdev = bdgrab(si->bdev); + if (!bdev) + continue; + /* + * By adding our own bdgrab() we ensure the queue + * sticks around until disk_release(), and so we ensure + * our release of the request_queue does not happen in + * atomic context. + */ + blkcg_schedule_throttle(bdev_get_queue(bdev), true); + bdput(bdev); break; } } From patchwork Tue Apr 14 04:19:02 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Luis Chamberlain X-Patchwork-Id: 11486759 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id AFB55186E for ; Tue, 14 Apr 2020 04:19:57 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 988E12085B for ; Tue, 14 Apr 2020 04:19:57 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1586837997; bh=xBmhkAHTTgQMqBOpnEPowNyWTGR4WhGaURiaiNDipB4=; h=From:To:Cc:Subject:Date:In-Reply-To:References:List-ID:From; b=YZg0zgXNlXlKqsMtBOixKFUV3E3HD8hOBCMNPWnjARybOVqoU1xErIgfRt50Av9TD GKBA0jQsPn2UHP5gaccDERr/qg210cvppANKDqo14MDV4WxVS7H3LwbSVhxxvrfFQ6 vGGlJ2VmsXyfUZJzlrZ5R1DzHDBL+XsZfljeCJjY= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2405249AbgDNETX (ORCPT ); Tue, 14 Apr 2020 00:19:23 -0400 Received: from mail-pl1-f196.google.com ([209.85.214.196]:35289 "EHLO mail-pl1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2405223AbgDNETL (ORCPT ); Tue, 14 Apr 2020 00:19:11 -0400 Received: by mail-pl1-f196.google.com with SMTP id y12so3899772pll.2; Mon, 13 Apr 2020 21:19:11 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=AAvWimIn06L/ckTvHYJdjtyNjcqrQOPPMI8XJSOmgvI=; b=TIfxM8D5oRdCqOxeW1d9FUPdVVUcj7Zy044/kNNsybCmbboWxmLksiV+i6kkxjGbg2 Op6uamCE+2oMyM/XwXSpiujQPMMApKgivdhBBemZ+meEmymc1SApQRhRxi4v4dacqLIc Ki7RJRnh9Y0mF70AJRHl4k8g4HemcgteFZiUk2qx2GDoKaCHcGm6c09QU3NHBq84CS4T cK7g5KOehHFMQ2z7E92BfUmXYx9uk7NOTghfYqpdUoX/BV+gQhQyEezEtBnjR3CfaF16 2iIlP7l6Ze6lIaQQbr6H5SgCji4PFLX3W33J8FjAli3802C/mmBvWYcfXgaOmArflygQ zdCw== X-Gm-Message-State: AGi0PuYBMTsMaGmOVakyo4xalvtKA7oWIZ7g7VomMZ/RkxQ2JTdF4Ewn IZw6h1VONSIrZ2fVx7PAQHzWQWHJzSs= X-Google-Smtp-Source: APiQypIIV5QwziKUNQLFlwjUY+5TxLvww5Fasa/CK24h/7LqZeND5GJga3hHm43a7pojb0NsTeIXBg== X-Received: by 2002:a17:90a:a602:: with SMTP id c2mr25226081pjq.135.1586837950844; Mon, 13 Apr 2020 21:19:10 -0700 (PDT) Received: from 42.do-not-panic.com (42.do-not-panic.com. [157.230.128.187]) by smtp.gmail.com with ESMTPSA id u18sm10164611pfl.40.2020.04.13.21.19.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 13 Apr 2020 21:19:08 -0700 (PDT) Received: by 42.do-not-panic.com (Postfix, from userid 1000) id 35461419C3; Tue, 14 Apr 2020 04:19:04 +0000 (UTC) From: Luis Chamberlain To: axboe@kernel.dk, viro@zeniv.linux.org.uk, bvanassche@acm.org, gregkh@linuxfoundation.org, rostedt@goodmis.org, mingo@redhat.com, jack@suse.cz, ming.lei@redhat.com, nstange@suse.de, akpm@linux-foundation.org Cc: mhocko@suse.com, yukuai3@huawei.com, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Luis Chamberlain , Omar Sandoval , Hannes Reinecke , Michal Hocko Subject: [PATCH 5/5] block: revert back to synchronous request_queue removal Date: Tue, 14 Apr 2020 04:19:02 +0000 Message-Id: <20200414041902.16769-6-mcgrof@kernel.org> X-Mailer: git-send-email 2.23.0.rc1 In-Reply-To: <20200414041902.16769-1-mcgrof@kernel.org> References: <20200414041902.16769-1-mcgrof@kernel.org> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Commit dc9edc44de6c ("block: Fix a blk_exit_rl() regression") merged on v4.12 moved the work behind blk_release_queue() into a workqueue after a splat floated around which indicated some work on blk_release_queue() could sleep in blk_exit_rl(). This splat would be possible when a driver called blk_put_queue() or blk_cleanup_queue() (which calls blk_put_queue() as its final call) from an atomic context. blk_put_queue() decrements the refcount for the request_queue kobject, and upon reaching 0 blk_release_queue() is called. Although blk_exit_rl() is now removed through commit db6d9952356 ("block: remove request_list code"), we reserve the right to be able to sleep within blk_release_queue() context. If you see no other way and *have* be in atomic context when you driver calls the last blk_put_queue() you can always just increase your block device's reference count with bdgrab() as this can be done in atomic context and the request_queue removal would be left to upper layers later. We document this bit of tribal knowledge as well now, and adjust kdoc format a bit. We revert back to synchronous request_queue removal because asynchronous removal creates a regression with expected userspace interaction with several drivers. An example is when removing the loopback driver and issues ioctl from userspace to do so, upon return and if successful one expects the device to be removed. Moving to asynchronous request_queue removal could have broken many scripts which relied on the removal to have been completed if there was no error. Using asynchronous request_queue removal however has helped us find other bugs, in the future we can test what could break with this arrangement by enabling CONFIG_DEBUG_KOBJECT_RELEASE. Cc: Bart Van Assche Cc: Omar Sandoval Cc: Hannes Reinecke Cc: Nicolai Stange Cc: Greg Kroah-Hartman Cc: Michal Hocko Cc: yu kuai Suggested-by: Nicolai Stange Fixes: dc9edc44de6c ("block: Fix a blk_exit_rl() regression") Signed-off-by: Luis Chamberlain Reviewed-by: Ming Lei --- block/blk-core.c | 19 ++++++++++++++++++- block/blk-sysfs.c | 38 +++++++++++++++++--------------------- include/linux/blkdev.h | 2 -- 3 files changed, 35 insertions(+), 24 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 5aaae7a1b338..8346c7c59ee6 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -301,6 +301,17 @@ void blk_clear_pm_only(struct request_queue *q) } EXPORT_SYMBOL_GPL(blk_clear_pm_only); +/** + * blk_put_queue - decrement the request_queue refcount + * + * Decrements the refcount to the request_queue kobject, when this reaches + * 0 we'll have blk_release_queue() called. You should avoid calling + * this function in atomic context but if you really have to ensure you + * first refcount the block device with bdgrab() / bdput() so that the + * last decrement happens in blk_cleanup_queue(). + * + * @q: the request_queue structure to decrement the refcount for + */ void blk_put_queue(struct request_queue *q) { kobject_put(&q->kobj); @@ -328,10 +339,16 @@ EXPORT_SYMBOL_GPL(blk_set_queue_dying); /** * blk_cleanup_queue - shutdown a request queue - * @q: request queue to shutdown * * Mark @q DYING, drain all pending requests, mark @q DEAD, destroy and * put it. All future requests will be failed immediately with -ENODEV. + * + * You should not call this function in atomic context. If you need to + * refcount a request_queue in atomic context, instead refcount the + * block device with bdgrab() / bdput(). + * + * @q: request queue to shutdown + * */ void blk_cleanup_queue(struct request_queue *q) { diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 0285d67e1e4c..859911191ebc 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -860,22 +860,27 @@ static void blk_exit_queue(struct request_queue *q) bdi_put(q->backing_dev_info); } - /** - * __blk_release_queue - release a request queue - * @work: pointer to the release_work member of the request queue to be released + * blk_release_queue - release a request queue + * + * This function is called as part of the process when a block device is being + * unregistered. Releasing a request queue starts with blk_cleanup_queue(), + * which set the appropriate flags and then calls blk_put_queue() as the last + * step. blk_put_queue() decrements the reference counter of the request queue + * and once the reference counter reaches zero, this function is called to + * release all allocated resources of the request queue. * - * Description: - * This function is called when a block device is being unregistered. The - * process of releasing a request queue starts with blk_cleanup_queue, which - * set the appropriate flags and then calls blk_put_queue, that decrements - * the reference counter of the request queue. Once the reference counter - * of the request queue reaches zero, blk_release_queue is called to release - * all allocated resources of the request queue. + * This function can sleep, and so we must ensure that the very last + * blk_put_queue() is never called from atomic context. + * + * @kobj: pointer to a kobject, who's container is a request_queue */ -static void __blk_release_queue(struct work_struct *work) +static void blk_release_queue(struct kobject *kobj) { - struct request_queue *q = container_of(work, typeof(*q), release_work); + struct request_queue *q = + container_of(kobj, struct request_queue, kobj); + + might_sleep(); if (test_bit(QUEUE_FLAG_POLL_STATS, &q->queue_flags)) blk_stat_remove_callback(q, q->poll_cb); @@ -905,15 +910,6 @@ static void __blk_release_queue(struct work_struct *work) call_rcu(&q->rcu_head, blk_free_queue_rcu); } -static void blk_release_queue(struct kobject *kobj) -{ - struct request_queue *q = - container_of(kobj, struct request_queue, kobj); - - INIT_WORK(&q->release_work, __blk_release_queue); - schedule_work(&q->release_work); -} - static const struct sysfs_ops queue_sysfs_ops = { .show = queue_attr_show, .store = queue_attr_store, diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index cc43c8e6516c..81f7ddb1587e 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -582,8 +582,6 @@ struct request_queue { size_t cmd_size; - struct work_struct release_work; - #define BLK_MAX_WRITE_HINTS 5 u64 write_hints[BLK_MAX_WRITE_HINTS]; };