diff mbox series

[v3] md: fix duplicate filename for rdev

Message ID 20230523012727.3042247-1-yukuai1@huaweicloud.com (mailing list archive)
State New, archived
Headers show
Series [v3] md: fix duplicate filename for rdev | expand

Commit Message

Yu Kuai May 23, 2023, 1:27 a.m. UTC
From: Yu Kuai <yukuai3@huawei.com>

Commit 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device
from an md array via sysfs") delays the deletion of rdev, however, this
introduces a window that rdev can be added again while the deletion is
not done yet, and sysfs will complain about duplicate filename.

Follow up patches try to fix this problem by flushing workqueue, however,
flush_rdev_wq() is just dead code, the progress in
md_kick_rdev_from_array():

1) list_del_rcu(&rdev->same_set);
2) synchronize_rcu();
3) queue_work(md_rdev_misc_wq, &rdev->del_work);

So in flush_rdev_wq(), if rdev is found in the list, work_pending() can
never pass, in the meantime, if work is queued, then rdev can never be
found in the list.

flush_rdev_wq() can be replaced by flush_workqueue() directly, however,
this approach is not good:
- the workqueue is global, this synchronization for all raid disks is
  not necessary.
- flush_workqueue can't be called under 'reconfig_mutex', there is still
  a small window between flush_workqueue() and mddev_lock() that other
  contexts can queue new work, hence the problem is not solved completely.

sysfs already has apis to support delete itself through writer, and
these apis, specifically sysfs_break/unbreak_active_protection(), is used
to support deleting rdev synchronously. Therefore, the above commit can be
reverted, and sysfs duplicate filename can be avoided.

A new mdadm regression test is proposed as well([1]).

[1] https://lore.kernel.org/linux-raid/20230428062845.1975462-1-yukuai1@huaweicloud.com/
Fixes: 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device from an md array via sysfs")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
Changes in v3:
 - fix some typos in commit message
 - fix that unused 'md_rdev_misc_wq' is not removed in v2
Changes in v2:
 - rebase from the latest md-next branch
 drivers/md/md.c | 85 +++++++++++++++++++++++++------------------------
 drivers/md/md.h |  8 +++++
 2 files changed, 51 insertions(+), 42 deletions(-)

Comments

Song Liu May 23, 2023, 6:05 p.m. UTC | #1
On Mon, May 22, 2023 at 6:30 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> From: Yu Kuai <yukuai3@huawei.com>
>
> Commit 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device
> from an md array via sysfs") delays the deletion of rdev, however, this
> introduces a window that rdev can be added again while the deletion is
> not done yet, and sysfs will complain about duplicate filename.
>
> Follow up patches try to fix this problem by flushing workqueue, however,
> flush_rdev_wq() is just dead code, the progress in
> md_kick_rdev_from_array():
>
> 1) list_del_rcu(&rdev->same_set);
> 2) synchronize_rcu();
> 3) queue_work(md_rdev_misc_wq, &rdev->del_work);
>
> So in flush_rdev_wq(), if rdev is found in the list, work_pending() can
> never pass, in the meantime, if work is queued, then rdev can never be
> found in the list.
>
> flush_rdev_wq() can be replaced by flush_workqueue() directly, however,
> this approach is not good:
> - the workqueue is global, this synchronization for all raid disks is
>   not necessary.
> - flush_workqueue can't be called under 'reconfig_mutex', there is still
>   a small window between flush_workqueue() and mddev_lock() that other
>   contexts can queue new work, hence the problem is not solved completely.
>
> sysfs already has apis to support delete itself through writer, and
> these apis, specifically sysfs_break/unbreak_active_protection(), is used
> to support deleting rdev synchronously. Therefore, the above commit can be
> reverted, and sysfs duplicate filename can be avoided.
>
> A new mdadm regression test is proposed as well([1]).
>
> [1] https://lore.kernel.org/linux-raid/20230428062845.1975462-1-yukuai1@huaweicloud.com/
> Fixes: 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device from an md array via sysfs")
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>

Thanks for the fix! I made the following changes and applied it
to md-next:

1. remove md_rdev->del_work, which is not used any more;
2. change list_empty_safe to list_empty protected by the mutex, as
   list_empty_safe doesn't seem safe here.

Please let me know if either change doesn't make sense.

Thanks,
Song
Yu Kuai May 24, 2023, 1:33 a.m. UTC | #2
Hi,

在 2023/05/24 2:05, Song Liu 写道:
> On Mon, May 22, 2023 at 6:30 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> Commit 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device
>> from an md array via sysfs") delays the deletion of rdev, however, this
>> introduces a window that rdev can be added again while the deletion is
>> not done yet, and sysfs will complain about duplicate filename.
>>
>> Follow up patches try to fix this problem by flushing workqueue, however,
>> flush_rdev_wq() is just dead code, the progress in
>> md_kick_rdev_from_array():
>>
>> 1) list_del_rcu(&rdev->same_set);
>> 2) synchronize_rcu();
>> 3) queue_work(md_rdev_misc_wq, &rdev->del_work);
>>
>> So in flush_rdev_wq(), if rdev is found in the list, work_pending() can
>> never pass, in the meantime, if work is queued, then rdev can never be
>> found in the list.
>>
>> flush_rdev_wq() can be replaced by flush_workqueue() directly, however,
>> this approach is not good:
>> - the workqueue is global, this synchronization for all raid disks is
>>    not necessary.
>> - flush_workqueue can't be called under 'reconfig_mutex', there is still
>>    a small window between flush_workqueue() and mddev_lock() that other
>>    contexts can queue new work, hence the problem is not solved completely.
>>
>> sysfs already has apis to support delete itself through writer, and
>> these apis, specifically sysfs_break/unbreak_active_protection(), is used
>> to support deleting rdev synchronously. Therefore, the above commit can be
>> reverted, and sysfs duplicate filename can be avoided.
>>
>> A new mdadm regression test is proposed as well([1]).
>>
>> [1] https://lore.kernel.org/linux-raid/20230428062845.1975462-1-yukuai1@huaweicloud.com/
>> Fixes: 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device from an md array via sysfs")
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> 
> Thanks for the fix! I made the following changes and applied it
> to md-next:
> 
> 1. remove md_rdev->del_work, which is not used any more;
> 2. change list_empty_safe to list_empty protected by the mutex, as
>     list_empty_safe doesn't seem safe here.
Yes, this make sense, we must make sure caller won't see stale rdev
through sysfs:

t1: remove rdev			t2:
mutex_lock(reconfig_mutex)
list_add
mutex_unlock(reconfig_mutex)
				mutex_lock(reconfig_mutex)
				mutex_unlock(reconfig_mutex)
				mutex_lock(delete_mutex)
				list_del_init
list_empty_careful
  -> list is empty now, return, caller will think rdev is removed,
  however, since export_rdev is not called yet, adding this rdev again
  will fail.
				kobject_del

hold mutex is safe, and I think performance should be ok because remove
rdev is not hot path. If we don't want to hold a new mutex in hot path,
perhaps list_empty_careful can work with following changes, remove rdev
from the list after sysfs entries is removed:

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 296885798a2b..84dce5822f91 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -758,8 +758,8 @@ static void md_free_rdev(struct mddev *mddev)

         mutex_lock(&mddev->delete_mutex);
         list_for_each_entry_safe(rdev, tmp, &mddev->deleting, same_set) {
-               list_del_init(&rdev->same_set);
                 kobject_del(&rdev->kobj);
+               list_del_init(&rdev->same_set);
                 export_rdev(rdev);
         }
         mutex_unlock(&mddev->delete_mutex);

> 
> Please let me know if either change doesn't make sense.
> 
> Thanks,
> Song
> .
>
Song Liu June 20, 2023, 9:38 p.m. UTC | #3
On Tue, May 23, 2023 at 6:33 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2023/05/24 2:05, Song Liu 写道:
> > On Mon, May 22, 2023 at 6:30 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
> >>
> >> From: Yu Kuai <yukuai3@huawei.com>
> >>
> >> Commit 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device
> >> from an md array via sysfs") delays the deletion of rdev, however, this
> >> introduces a window that rdev can be added again while the deletion is
> >> not done yet, and sysfs will complain about duplicate filename.
> >>
> >> Follow up patches try to fix this problem by flushing workqueue, however,
> >> flush_rdev_wq() is just dead code, the progress in
> >> md_kick_rdev_from_array():
> >>
> >> 1) list_del_rcu(&rdev->same_set);
> >> 2) synchronize_rcu();
> >> 3) queue_work(md_rdev_misc_wq, &rdev->del_work);
> >>
> >> So in flush_rdev_wq(), if rdev is found in the list, work_pending() can
> >> never pass, in the meantime, if work is queued, then rdev can never be
> >> found in the list.
> >>
> >> flush_rdev_wq() can be replaced by flush_workqueue() directly, however,
> >> this approach is not good:
> >> - the workqueue is global, this synchronization for all raid disks is
> >>    not necessary.
> >> - flush_workqueue can't be called under 'reconfig_mutex', there is still
> >>    a small window between flush_workqueue() and mddev_lock() that other
> >>    contexts can queue new work, hence the problem is not solved completely.
> >>
> >> sysfs already has apis to support delete itself through writer, and
> >> these apis, specifically sysfs_break/unbreak_active_protection(), is used
> >> to support deleting rdev synchronously. Therefore, the above commit can be
> >> reverted, and sysfs duplicate filename can be avoided.
> >>
> >> A new mdadm regression test is proposed as well([1]).
> >>
> >> [1] https://lore.kernel.org/linux-raid/20230428062845.1975462-1-yukuai1@huaweicloud.com/
> >> Fixes: 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device from an md array via sysfs")
> >> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> >
> > Thanks for the fix! I made the following changes and applied it
> > to md-next:
> >
> > 1. remove md_rdev->del_work, which is not used any more;
> > 2. change list_empty_safe to list_empty protected by the mutex, as
> >     list_empty_safe doesn't seem safe here.

Hmm.. it appears that I missed a circular locking dependency with mdadm
test 21raid5cache (delete_mutex and open_mutex).

Please take a look at this.

Thanks,
Song

[  239.802277] ======================================================
[  239.803650] WARNING: possible circular locking dependency detected
[  239.804929] 6.4.0-rc2+ #772 Not tainted
[  239.805569] ------------------------------------------------------
[  239.806568] kworker/20:1/222 is trying to acquire lock:
[  239.807406] ffff88815335b3f0 (&mddev->delete_mutex){+.+.}-{3:3},
at: mddev_unlock+0xe0/0x2d0
[  239.808653]
               but task is already holding lock:
[  239.809481] ffffc9000246fe00
((work_completion)(&log->disable_writeback_work)){+.+.}-{0:0}, at:
process_one_work+0x462/0xa50
[  239.811049]
               which lock already depends on the new lock.

[  239.812187]
               the existing dependency chain (in reverse order) is:
[  239.813230]
               -> #3
((work_completion)(&log->disable_writeback_work)){+.+.}-{0:0}:
[  239.814468]        __flush_work+0xdb/0x690
[  239.815068]        r5l_exit_log+0x59/0xc0
[  239.815649]        free_conf+0x34/0x320
[  239.816243]        raid5_free+0x11/0x40
[  239.816788]        __md_stop+0x9f/0x140
[  239.817336]        do_md_stop+0x2af/0xaf0
[  239.817901]        md_ioctl+0xb34/0x1e30
[  239.818469]        blkdev_ioctl+0x2bf/0x3d0
[  239.819079]        __x64_sys_ioctl+0xbe/0x100
[  239.819701]        do_syscall_64+0x3a/0x90
[  239.820294]        entry_SYSCALL_64_after_hwframe+0x72/0xdc
[  239.821076]
               -> #2 (&mddev->open_mutex){+.+.}-{3:3}:
[  239.821971]        __mutex_lock+0x11d/0x13f0
[  239.822592]        md_open+0xad/0x180
[  239.822937] kobject: 'md0' (ffff88811c8e5498): kobject_uevent_env
[  239.823113]        blkdev_get_whole+0x50/0x120
[  239.824216] kobject: 'md0' (ffff88811c8e5498): fill_kobj_path: path
= '/devices/virtual/block/md0'
[  239.824725]        blkdev_get_by_dev+0x309/0x4f0
[  239.826747]        blkdev_open+0x8a/0x110
[  239.827319]        do_dentry_open+0x2a5/0x7b0
[  239.827939]        path_openat+0xcee/0x1070
[  239.828545]        do_filp_open+0x148/0x1d0
[  239.829137]        do_sys_openat2+0x2ec/0x470
[  239.829791]        do_sys_open+0x8a/0xd0
[  239.830352]        do_syscall_64+0x3a/0x90
[  239.830935]        entry_SYSCALL_64_after_hwframe+0x72/0xdc
[  239.831728]
               -> #1 (&disk->open_mutex){+.+.}-{3:3}:
[  239.832615]        __mutex_lock+0x11d/0x13f0
[  239.833231]        blkdev_put+0x65/0x350
[  239.833785]        export_rdev.isra.63+0x72/0xe0
[  239.834456]        mddev_unlock+0x1b1/0x2d0
[  239.835043]        md_ioctl+0x96c/0x1e30
[  239.835616]        blkdev_ioctl+0x2bf/0x3d0
[  239.836213]        __x64_sys_ioctl+0xbe/0x100
[  239.836828]        do_syscall_64+0x3a/0x90
[  239.837414]        entry_SYSCALL_64_after_hwframe+0x72/0xdc
[  239.838225]
               -> #0 (&mddev->delete_mutex){+.+.}-{3:3}:
[  239.839145]        __lock_acquire+0x1e42/0x34b0
[  239.839797]        lock_acquire+0x161/0x3d0
[  239.840395]        __mutex_lock+0x11d/0x13f0
[  239.840999]        mddev_unlock+0xe0/0x2d0
[  239.841582]        r5c_disable_writeback_async+0x261/0x270
[  239.842354]        process_one_work+0x55f/0xa50
[  239.842996]        worker_thread+0x69/0x660
[  239.843595]        kthread+0x1a1/0x1e0
[  239.844131]        ret_from_fork+0x2c/0x50
[  239.844712]
               other info that might help us debug this:

[  239.845813] Chain exists of:
                 &mddev->delete_mutex --> &mddev->open_mutex -->
(work_completion)(&log->disable_writeback_work)

[  239.847776]  Possible unsafe locking scenario:

[  239.848623]        CPU0                    CPU1
[  239.849262]        ----                    ----
[  239.849927]   lock((work_completion)(&log->disable_writeback_work));
[  239.850831]                                lock(&mddev->open_mutex);
[  239.851719]
lock((work_completion)(&log->disable_writeback_work));
[  239.852945]   lock(&mddev->delete_mutex);
[  239.853517]
                *** DEADLOCK ***

[  239.854345] 2 locks held by kworker/20:1/222:
[  239.854960]  #0: ffff8881000b2748
((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x462/0xa50
[  239.856287]  #1: ffffc9000246fe00
((work_completion)(&log->disable_writeback_work)){+.+.}-{0:0}, at:
process_one_work+0x462/0xa50
[  239.857939]
               stack backtrace:
[  239.858567] CPU: 20 PID: 222 Comm: kworker/20:1 Not tainted 6.4.0-rc2+ #772
[  239.859533] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
[  239.861135] Workqueue: events r5c_disable_writeback_async
[  239.861903] Call Trace:
[  239.862281]  <TASK>
[  239.862606]  dump_stack_lvl+0x46/0x80
[  239.863144]  check_noncircular+0x1ff/0x240
[  239.863737]  ? __pfx_check_noncircular+0x10/0x10
[  239.864401]  ? mark_lock.part.45+0x11a/0x1350
[  239.865019]  ? add_chain_block+0x23b/0x310
[  239.865624]  __lock_acquire+0x1e42/0x34b0
[  239.866202]  ? select_task_rq_fair+0x2b0/0x1e50
[  239.866845]  ? __pfx___lock_acquire+0x10/0x10
[  239.867473]  ? ttwu_queue_wakelist+0x1cc/0x1f0
[  239.868107]  ? __smp_call_single_queue+0x137/0x2a0
[  239.868816]  ? __default_send_IPI_dest_field+0x2b/0xa0
[  239.869548]  ? __lock_acquire+0xa5d/0x34b0
[  239.870131]  lock_acquire+0x161/0x3d0
[  239.870660]  ? mddev_unlock+0xe0/0x2d0
[  239.871216]  ? __pfx_lock_acquire+0x10/0x10
[  239.871808]  ? lock_is_held_type+0xd8/0x130
[  239.872405]  __mutex_lock+0x11d/0x13f0
[  239.872936]  ? mddev_unlock+0xe0/0x2d0
[  239.873474]  ? mddev_unlock+0xe0/0x2d0
[  239.874006]  ? __mutex_unlock_slowpath+0x12c/0x410
[  239.874695]  ? __pfx___mutex_lock+0x10/0x10
[  239.875297]  ? __pfx_rcu_read_lock_held+0x10/0x10
[  239.875986]  ? mddev_unlock+0xe0/0x2d0
[  239.876530]  mddev_unlock+0xe0/0x2d0
[  239.877044]  r5c_disable_writeback_async+0x261/0x270
[  239.877753]  ? __pfx_r5c_disable_writeback_async+0x10/0x10
[  239.878531]  ? __switch_to+0x2d8/0x770
[  239.879066]  ? __pfx_autoremove_wake_function+0x10/0x10
[  239.879813]  process_one_work+0x55f/0xa50
[  239.880398]  ? __pfx_process_one_work+0x10/0x10
[  239.881073]  ? _raw_spin_lock_irq+0x5c/0x90
[  239.881675]  worker_thread+0x69/0x660
[  239.882206]  ? __kthread_parkme+0xe4/0x100
[  239.882786]  ? __pfx_worker_thread+0x10/0x10
[  239.883395]  kthread+0x1a1/0x1e0
[  239.883863]  ? __pfx_kthread+0x10/0x10
[  239.884406]  ret_from_fork+0x2c/0x50
[  239.884925]  </TASK>
Yu Kuai June 21, 2023, 2:25 a.m. UTC | #4
Hi,

在 2023/06/21 5:38, Song Liu 写道:
> On Tue, May 23, 2023 at 6:33 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> Hi,
>>
>> 在 2023/05/24 2:05, Song Liu 写道:
>>> On Mon, May 22, 2023 at 6:30 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>>>
>>>> From: Yu Kuai <yukuai3@huawei.com>
>>>>
>>>> Commit 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device
>>>> from an md array via sysfs") delays the deletion of rdev, however, this
>>>> introduces a window that rdev can be added again while the deletion is
>>>> not done yet, and sysfs will complain about duplicate filename.
>>>>
>>>> Follow up patches try to fix this problem by flushing workqueue, however,
>>>> flush_rdev_wq() is just dead code, the progress in
>>>> md_kick_rdev_from_array():
>>>>
>>>> 1) list_del_rcu(&rdev->same_set);
>>>> 2) synchronize_rcu();
>>>> 3) queue_work(md_rdev_misc_wq, &rdev->del_work);
>>>>
>>>> So in flush_rdev_wq(), if rdev is found in the list, work_pending() can
>>>> never pass, in the meantime, if work is queued, then rdev can never be
>>>> found in the list.
>>>>
>>>> flush_rdev_wq() can be replaced by flush_workqueue() directly, however,
>>>> this approach is not good:
>>>> - the workqueue is global, this synchronization for all raid disks is
>>>>     not necessary.
>>>> - flush_workqueue can't be called under 'reconfig_mutex', there is still
>>>>     a small window between flush_workqueue() and mddev_lock() that other
>>>>     contexts can queue new work, hence the problem is not solved completely.
>>>>
>>>> sysfs already has apis to support delete itself through writer, and
>>>> these apis, specifically sysfs_break/unbreak_active_protection(), is used
>>>> to support deleting rdev synchronously. Therefore, the above commit can be
>>>> reverted, and sysfs duplicate filename can be avoided.
>>>>
>>>> A new mdadm regression test is proposed as well([1]).
>>>>
>>>> [1] https://lore.kernel.org/linux-raid/20230428062845.1975462-1-yukuai1@huaweicloud.com/
>>>> Fixes: 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device from an md array via sysfs")
>>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>>>
>>> Thanks for the fix! I made the following changes and applied it
>>> to md-next:
>>>
>>> 1. remove md_rdev->del_work, which is not used any more;
>>> 2. change list_empty_safe to list_empty protected by the mutex, as
>>>      list_empty_safe doesn't seem safe here.
> 
> Hmm.. it appears that I missed a circular locking dependency with mdadm
> test 21raid5cache (delete_mutex and open_mutex).
> 
> Please take a look at this.
> 

Thanks, I think following patch can fix this, but run this test keep
failing in my VM. Song, can you pass this test?

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 1be64ab76f5c..62b7d2d0e873 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -766,17 +766,29 @@ static void md_free_rdev(struct mddev *mddev)
  {
         struct md_rdev *rdev;
         struct md_rdev *tmp;
+       LIST_HEAD(delete);

-       mutex_lock(&mddev->delete_mutex);
-       if (list_empty(&mddev->deleting))
-               goto out;
+retry:
+       if (list_empty_careful(&mddev->deleting))
+               return;
+
+       if (!mutex_trylock(&mddev->delete_mutex))
+               goto retry;

-       list_for_each_entry_safe(rdev, tmp, &mddev->deleting, same_set) {
+       /*
+        * Clear 'mddev->deleting' first, because export_rdev can trigger
+        * mddev_unlock(), and grab 'delete_mutex' again will cause 
deadlock.
+        * make sure md_free_rdev() will exit directly untill all rdev is
+        * deleted.
+        */
+       list_splice(&delete, &mddev->deleting);
+
+       list_for_each_entry_safe(rdev, tmp, &delete, same_set) {
                 list_del_init(&rdev->same_set);
                 kobject_del(&rdev->kobj);
                 export_rdev(rdev, mddev);
         }
-out:
+
         mutex_unlock(&mddev->delete_mutex);
  }



> Thanks,
> Song
> 
> [  239.802277] ======================================================
> [  239.803650] WARNING: possible circular locking dependency detected
> [  239.804929] 6.4.0-rc2+ #772 Not tainted
> [  239.805569] ------------------------------------------------------
> [  239.806568] kworker/20:1/222 is trying to acquire lock:
> [  239.807406] ffff88815335b3f0 (&mddev->delete_mutex){+.+.}-{3:3},
> at: mddev_unlock+0xe0/0x2d0
> [  239.808653]
>                 but task is already holding lock:
> [  239.809481] ffffc9000246fe00
> ((work_completion)(&log->disable_writeback_work)){+.+.}-{0:0}, at:
> process_one_work+0x462/0xa50
> [  239.811049]
>                 which lock already depends on the new lock.
> 
> [  239.812187]
>                 the existing dependency chain (in reverse order) is:
> [  239.813230]
>                 -> #3
> ((work_completion)(&log->disable_writeback_work)){+.+.}-{0:0}:
> [  239.814468]        __flush_work+0xdb/0x690
> [  239.815068]        r5l_exit_log+0x59/0xc0
> [  239.815649]        free_conf+0x34/0x320
> [  239.816243]        raid5_free+0x11/0x40
> [  239.816788]        __md_stop+0x9f/0x140
> [  239.817336]        do_md_stop+0x2af/0xaf0
> [  239.817901]        md_ioctl+0xb34/0x1e30
> [  239.818469]        blkdev_ioctl+0x2bf/0x3d0
> [  239.819079]        __x64_sys_ioctl+0xbe/0x100
> [  239.819701]        do_syscall_64+0x3a/0x90
> [  239.820294]        entry_SYSCALL_64_after_hwframe+0x72/0xdc
> [  239.821076]
>                 -> #2 (&mddev->open_mutex){+.+.}-{3:3}:
> [  239.821971]        __mutex_lock+0x11d/0x13f0
> [  239.822592]        md_open+0xad/0x180
> [  239.822937] kobject: 'md0' (ffff88811c8e5498): kobject_uevent_env
> [  239.823113]        blkdev_get_whole+0x50/0x120
> [  239.824216] kobject: 'md0' (ffff88811c8e5498): fill_kobj_path: path
> = '/devices/virtual/block/md0'
> [  239.824725]        blkdev_get_by_dev+0x309/0x4f0
> [  239.826747]        blkdev_open+0x8a/0x110
> [  239.827319]        do_dentry_open+0x2a5/0x7b0
> [  239.827939]        path_openat+0xcee/0x1070
> [  239.828545]        do_filp_open+0x148/0x1d0
> [  239.829137]        do_sys_openat2+0x2ec/0x470
> [  239.829791]        do_sys_open+0x8a/0xd0
> [  239.830352]        do_syscall_64+0x3a/0x90
> [  239.830935]        entry_SYSCALL_64_after_hwframe+0x72/0xdc
> [  239.831728]
>                 -> #1 (&disk->open_mutex){+.+.}-{3:3}:
> [  239.832615]        __mutex_lock+0x11d/0x13f0
> [  239.833231]        blkdev_put+0x65/0x350
> [  239.833785]        export_rdev.isra.63+0x72/0xe0
> [  239.834456]        mddev_unlock+0x1b1/0x2d0
> [  239.835043]        md_ioctl+0x96c/0x1e30
> [  239.835616]        blkdev_ioctl+0x2bf/0x3d0
> [  239.836213]        __x64_sys_ioctl+0xbe/0x100
> [  239.836828]        do_syscall_64+0x3a/0x90
> [  239.837414]        entry_SYSCALL_64_after_hwframe+0x72/0xdc
> [  239.838225]
>                 -> #0 (&mddev->delete_mutex){+.+.}-{3:3}:
> [  239.839145]        __lock_acquire+0x1e42/0x34b0
> [  239.839797]        lock_acquire+0x161/0x3d0
> [  239.840395]        __mutex_lock+0x11d/0x13f0
> [  239.840999]        mddev_unlock+0xe0/0x2d0
> [  239.841582]        r5c_disable_writeback_async+0x261/0x270
> [  239.842354]        process_one_work+0x55f/0xa50
> [  239.842996]        worker_thread+0x69/0x660
> [  239.843595]        kthread+0x1a1/0x1e0
> [  239.844131]        ret_from_fork+0x2c/0x50
> [  239.844712]
>                 other info that might help us debug this:
> 
> [  239.845813] Chain exists of:
>                   &mddev->delete_mutex --> &mddev->open_mutex -->
> (work_completion)(&log->disable_writeback_work)
> 
> [  239.847776]  Possible unsafe locking scenario:
> 
> [  239.848623]        CPU0                    CPU1
> [  239.849262]        ----                    ----
> [  239.849927]   lock((work_completion)(&log->disable_writeback_work));
> [  239.850831]                                lock(&mddev->open_mutex);
> [  239.851719]
> lock((work_completion)(&log->disable_writeback_work));
> [  239.852945]   lock(&mddev->delete_mutex);
> [  239.853517]
>                  *** DEADLOCK ***
> 
> [  239.854345] 2 locks held by kworker/20:1/222:
> [  239.854960]  #0: ffff8881000b2748
> ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x462/0xa50
> [  239.856287]  #1: ffffc9000246fe00
> ((work_completion)(&log->disable_writeback_work)){+.+.}-{0:0}, at:
> process_one_work+0x462/0xa50
> [  239.857939]
>                 stack backtrace:
> [  239.858567] CPU: 20 PID: 222 Comm: kworker/20:1 Not tainted 6.4.0-rc2+ #772
> [  239.859533] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
> [  239.861135] Workqueue: events r5c_disable_writeback_async
> [  239.861903] Call Trace:
> [  239.862281]  <TASK>
> [  239.862606]  dump_stack_lvl+0x46/0x80
> [  239.863144]  check_noncircular+0x1ff/0x240
> [  239.863737]  ? __pfx_check_noncircular+0x10/0x10
> [  239.864401]  ? mark_lock.part.45+0x11a/0x1350
> [  239.865019]  ? add_chain_block+0x23b/0x310
> [  239.865624]  __lock_acquire+0x1e42/0x34b0
> [  239.866202]  ? select_task_rq_fair+0x2b0/0x1e50
> [  239.866845]  ? __pfx___lock_acquire+0x10/0x10
> [  239.867473]  ? ttwu_queue_wakelist+0x1cc/0x1f0
> [  239.868107]  ? __smp_call_single_queue+0x137/0x2a0
> [  239.868816]  ? __default_send_IPI_dest_field+0x2b/0xa0
> [  239.869548]  ? __lock_acquire+0xa5d/0x34b0
> [  239.870131]  lock_acquire+0x161/0x3d0
> [  239.870660]  ? mddev_unlock+0xe0/0x2d0
> [  239.871216]  ? __pfx_lock_acquire+0x10/0x10
> [  239.871808]  ? lock_is_held_type+0xd8/0x130
> [  239.872405]  __mutex_lock+0x11d/0x13f0
> [  239.872936]  ? mddev_unlock+0xe0/0x2d0
> [  239.873474]  ? mddev_unlock+0xe0/0x2d0
> [  239.874006]  ? __mutex_unlock_slowpath+0x12c/0x410
> [  239.874695]  ? __pfx___mutex_lock+0x10/0x10
> [  239.875297]  ? __pfx_rcu_read_lock_held+0x10/0x10
> [  239.875986]  ? mddev_unlock+0xe0/0x2d0
> [  239.876530]  mddev_unlock+0xe0/0x2d0
> [  239.877044]  r5c_disable_writeback_async+0x261/0x270
> [  239.877753]  ? __pfx_r5c_disable_writeback_async+0x10/0x10
> [  239.878531]  ? __switch_to+0x2d8/0x770
> [  239.879066]  ? __pfx_autoremove_wake_function+0x10/0x10
> [  239.879813]  process_one_work+0x55f/0xa50
> [  239.880398]  ? __pfx_process_one_work+0x10/0x10
> [  239.881073]  ? _raw_spin_lock_irq+0x5c/0x90
> [  239.881675]  worker_thread+0x69/0x660
> [  239.882206]  ? __kthread_parkme+0xe4/0x100
> [  239.882786]  ? __pfx_worker_thread+0x10/0x10
> [  239.883395]  kthread+0x1a1/0x1e0
> [  239.883863]  ? __pfx_kthread+0x10/0x10
> [  239.884406]  ret_from_fork+0x2c/0x50
> [  239.884925]  </TASK>
> .
>
Yu Kuai June 21, 2023, 3:46 a.m. UTC | #5
Hi,

在 2023/06/21 10:25, Yu Kuai 写道:
> Hi,
> 
> 在 2023/06/21 5:38, Song Liu 写道:
>> On Tue, May 23, 2023 at 6:33 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>>
>>> Hi,
>>>
>>> 在 2023/05/24 2:05, Song Liu 写道:
>>>> On Mon, May 22, 2023 at 6:30 PM Yu Kuai <yukuai1@huaweicloud.com> 
>>>> wrote:
>>>>>
>>>>> From: Yu Kuai <yukuai3@huawei.com>
>>>>>
>>>>> Commit 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a 
>>>>> device
>>>>> from an md array via sysfs") delays the deletion of rdev, however, 
>>>>> this
>>>>> introduces a window that rdev can be added again while the deletion is
>>>>> not done yet, and sysfs will complain about duplicate filename.
>>>>>
>>>>> Follow up patches try to fix this problem by flushing workqueue, 
>>>>> however,
>>>>> flush_rdev_wq() is just dead code, the progress in
>>>>> md_kick_rdev_from_array():
>>>>>
>>>>> 1) list_del_rcu(&rdev->same_set);
>>>>> 2) synchronize_rcu();
>>>>> 3) queue_work(md_rdev_misc_wq, &rdev->del_work);
>>>>>
>>>>> So in flush_rdev_wq(), if rdev is found in the list, work_pending() 
>>>>> can
>>>>> never pass, in the meantime, if work is queued, then rdev can never be
>>>>> found in the list.
>>>>>
>>>>> flush_rdev_wq() can be replaced by flush_workqueue() directly, 
>>>>> however,
>>>>> this approach is not good:
>>>>> - the workqueue is global, this synchronization for all raid disks is
>>>>>     not necessary.
>>>>> - flush_workqueue can't be called under 'reconfig_mutex', there is 
>>>>> still
>>>>>     a small window between flush_workqueue() and mddev_lock() that 
>>>>> other
>>>>>     contexts can queue new work, hence the problem is not solved 
>>>>> completely.
>>>>>
>>>>> sysfs already has apis to support delete itself through writer, and
>>>>> these apis, specifically sysfs_break/unbreak_active_protection(), 
>>>>> is used
>>>>> to support deleting rdev synchronously. Therefore, the above commit 
>>>>> can be
>>>>> reverted, and sysfs duplicate filename can be avoided.
>>>>>
>>>>> A new mdadm regression test is proposed as well([1]).
>>>>>
>>>>> [1] 
>>>>> https://lore.kernel.org/linux-raid/20230428062845.1975462-1-yukuai1@huaweicloud.com/ 
>>>>>
>>>>> Fixes: 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a 
>>>>> device from an md array via sysfs")
>>>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>>>>
>>>> Thanks for the fix! I made the following changes and applied it
>>>> to md-next:
>>>>
>>>> 1. remove md_rdev->del_work, which is not used any more;
>>>> 2. change list_empty_safe to list_empty protected by the mutex, as
>>>>      list_empty_safe doesn't seem safe here.
>>
>> Hmm.. it appears that I missed a circular locking dependency with mdadm
>> test 21raid5cache (delete_mutex and open_mutex).
>>
>> Please take a look at this.
>>
> 
> Thanks, I think following patch can fix this, but run this test keep
> failing in my VM. Song, can you pass this test?

I just figure out a better sulution, please ignore this patch. I'll send
a fix soon.

Thanks,
Kuai
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 1be64ab76f5c..62b7d2d0e873 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -766,17 +766,29 @@ static void md_free_rdev(struct mddev *mddev)
>   {
>          struct md_rdev *rdev;
>          struct md_rdev *tmp;
> +       LIST_HEAD(delete);
> 
> -       mutex_lock(&mddev->delete_mutex);
> -       if (list_empty(&mddev->deleting))
> -               goto out;
> +retry:
> +       if (list_empty_careful(&mddev->deleting))
> +               return;
> +
> +       if (!mutex_trylock(&mddev->delete_mutex))
> +               goto retry;
> 
> -       list_for_each_entry_safe(rdev, tmp, &mddev->deleting, same_set) {
> +       /*
> +        * Clear 'mddev->deleting' first, because export_rdev can trigger
> +        * mddev_unlock(), and grab 'delete_mutex' again will cause 
> deadlock.
> +        * make sure md_free_rdev() will exit directly untill all rdev is
> +        * deleted.
> +        */
> +       list_splice(&delete, &mddev->deleting);
> +
> +       list_for_each_entry_safe(rdev, tmp, &delete, same_set) {
>                  list_del_init(&rdev->same_set);
>                  kobject_del(&rdev->kobj);
>                  export_rdev(rdev, mddev);
>          }
> -out:
> +
>          mutex_unlock(&mddev->delete_mutex);
>   }
> 
> 
> 
>> Thanks,
>> Song
>>
>> [  239.802277] ======================================================
>> [  239.803650] WARNING: possible circular locking dependency detected
>> [  239.804929] 6.4.0-rc2+ #772 Not tainted
>> [  239.805569] ------------------------------------------------------
>> [  239.806568] kworker/20:1/222 is trying to acquire lock:
>> [  239.807406] ffff88815335b3f0 (&mddev->delete_mutex){+.+.}-{3:3},
>> at: mddev_unlock+0xe0/0x2d0
>> [  239.808653]
>>                 but task is already holding lock:
>> [  239.809481] ffffc9000246fe00
>> ((work_completion)(&log->disable_writeback_work)){+.+.}-{0:0}, at:
>> process_one_work+0x462/0xa50
>> [  239.811049]
>>                 which lock already depends on the new lock.
>>
>> [  239.812187]
>>                 the existing dependency chain (in reverse order) is:
>> [  239.813230]
>>                 -> #3
>> ((work_completion)(&log->disable_writeback_work)){+.+.}-{0:0}:
>> [  239.814468]        __flush_work+0xdb/0x690
>> [  239.815068]        r5l_exit_log+0x59/0xc0
>> [  239.815649]        free_conf+0x34/0x320
>> [  239.816243]        raid5_free+0x11/0x40
>> [  239.816788]        __md_stop+0x9f/0x140
>> [  239.817336]        do_md_stop+0x2af/0xaf0
>> [  239.817901]        md_ioctl+0xb34/0x1e30
>> [  239.818469]        blkdev_ioctl+0x2bf/0x3d0
>> [  239.819079]        __x64_sys_ioctl+0xbe/0x100
>> [  239.819701]        do_syscall_64+0x3a/0x90
>> [  239.820294]        entry_SYSCALL_64_after_hwframe+0x72/0xdc
>> [  239.821076]
>>                 -> #2 (&mddev->open_mutex){+.+.}-{3:3}:
>> [  239.821971]        __mutex_lock+0x11d/0x13f0
>> [  239.822592]        md_open+0xad/0x180
>> [  239.822937] kobject: 'md0' (ffff88811c8e5498): kobject_uevent_env
>> [  239.823113]        blkdev_get_whole+0x50/0x120
>> [  239.824216] kobject: 'md0' (ffff88811c8e5498): fill_kobj_path: path
>> = '/devices/virtual/block/md0'
>> [  239.824725]        blkdev_get_by_dev+0x309/0x4f0
>> [  239.826747]        blkdev_open+0x8a/0x110
>> [  239.827319]        do_dentry_open+0x2a5/0x7b0
>> [  239.827939]        path_openat+0xcee/0x1070
>> [  239.828545]        do_filp_open+0x148/0x1d0
>> [  239.829137]        do_sys_openat2+0x2ec/0x470
>> [  239.829791]        do_sys_open+0x8a/0xd0
>> [  239.830352]        do_syscall_64+0x3a/0x90
>> [  239.830935]        entry_SYSCALL_64_after_hwframe+0x72/0xdc
>> [  239.831728]
>>                 -> #1 (&disk->open_mutex){+.+.}-{3:3}:
>> [  239.832615]        __mutex_lock+0x11d/0x13f0
>> [  239.833231]        blkdev_put+0x65/0x350
>> [  239.833785]        export_rdev.isra.63+0x72/0xe0
>> [  239.834456]        mddev_unlock+0x1b1/0x2d0
>> [  239.835043]        md_ioctl+0x96c/0x1e30
>> [  239.835616]        blkdev_ioctl+0x2bf/0x3d0
>> [  239.836213]        __x64_sys_ioctl+0xbe/0x100
>> [  239.836828]        do_syscall_64+0x3a/0x90
>> [  239.837414]        entry_SYSCALL_64_after_hwframe+0x72/0xdc
>> [  239.838225]
>>                 -> #0 (&mddev->delete_mutex){+.+.}-{3:3}:
>> [  239.839145]        __lock_acquire+0x1e42/0x34b0
>> [  239.839797]        lock_acquire+0x161/0x3d0
>> [  239.840395]        __mutex_lock+0x11d/0x13f0
>> [  239.840999]        mddev_unlock+0xe0/0x2d0
>> [  239.841582]        r5c_disable_writeback_async+0x261/0x270
>> [  239.842354]        process_one_work+0x55f/0xa50
>> [  239.842996]        worker_thread+0x69/0x660
>> [  239.843595]        kthread+0x1a1/0x1e0
>> [  239.844131]        ret_from_fork+0x2c/0x50
>> [  239.844712]
>>                 other info that might help us debug this:
>>
>> [  239.845813] Chain exists of:
>>                   &mddev->delete_mutex --> &mddev->open_mutex -->
>> (work_completion)(&log->disable_writeback_work)
>>
>> [  239.847776]  Possible unsafe locking scenario:
>>
>> [  239.848623]        CPU0                    CPU1
>> [  239.849262]        ----                    ----
>> [  239.849927]   lock((work_completion)(&log->disable_writeback_work));
>> [  239.850831]                                lock(&mddev->open_mutex);
>> [  239.851719]
>> lock((work_completion)(&log->disable_writeback_work));
>> [  239.852945]   lock(&mddev->delete_mutex);
>> [  239.853517]
>>                  *** DEADLOCK ***
>>
>> [  239.854345] 2 locks held by kworker/20:1/222:
>> [  239.854960]  #0: ffff8881000b2748
>> ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x462/0xa50
>> [  239.856287]  #1: ffffc9000246fe00
>> ((work_completion)(&log->disable_writeback_work)){+.+.}-{0:0}, at:
>> process_one_work+0x462/0xa50
>> [  239.857939]
>>                 stack backtrace:
>> [  239.858567] CPU: 20 PID: 222 Comm: kworker/20:1 Not tainted 
>> 6.4.0-rc2+ #772
>> [  239.859533] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>> BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
>> [  239.861135] Workqueue: events r5c_disable_writeback_async
>> [  239.861903] Call Trace:
>> [  239.862281]  <TASK>
>> [  239.862606]  dump_stack_lvl+0x46/0x80
>> [  239.863144]  check_noncircular+0x1ff/0x240
>> [  239.863737]  ? __pfx_check_noncircular+0x10/0x10
>> [  239.864401]  ? mark_lock.part.45+0x11a/0x1350
>> [  239.865019]  ? add_chain_block+0x23b/0x310
>> [  239.865624]  __lock_acquire+0x1e42/0x34b0
>> [  239.866202]  ? select_task_rq_fair+0x2b0/0x1e50
>> [  239.866845]  ? __pfx___lock_acquire+0x10/0x10
>> [  239.867473]  ? ttwu_queue_wakelist+0x1cc/0x1f0
>> [  239.868107]  ? __smp_call_single_queue+0x137/0x2a0
>> [  239.868816]  ? __default_send_IPI_dest_field+0x2b/0xa0
>> [  239.869548]  ? __lock_acquire+0xa5d/0x34b0
>> [  239.870131]  lock_acquire+0x161/0x3d0
>> [  239.870660]  ? mddev_unlock+0xe0/0x2d0
>> [  239.871216]  ? __pfx_lock_acquire+0x10/0x10
>> [  239.871808]  ? lock_is_held_type+0xd8/0x130
>> [  239.872405]  __mutex_lock+0x11d/0x13f0
>> [  239.872936]  ? mddev_unlock+0xe0/0x2d0
>> [  239.873474]  ? mddev_unlock+0xe0/0x2d0
>> [  239.874006]  ? __mutex_unlock_slowpath+0x12c/0x410
>> [  239.874695]  ? __pfx___mutex_lock+0x10/0x10
>> [  239.875297]  ? __pfx_rcu_read_lock_held+0x10/0x10
>> [  239.875986]  ? mddev_unlock+0xe0/0x2d0
>> [  239.876530]  mddev_unlock+0xe0/0x2d0
>> [  239.877044]  r5c_disable_writeback_async+0x261/0x270
>> [  239.877753]  ? __pfx_r5c_disable_writeback_async+0x10/0x10
>> [  239.878531]  ? __switch_to+0x2d8/0x770
>> [  239.879066]  ? __pfx_autoremove_wake_function+0x10/0x10
>> [  239.879813]  process_one_work+0x55f/0xa50
>> [  239.880398]  ? __pfx_process_one_work+0x10/0x10
>> [  239.881073]  ? _raw_spin_lock_irq+0x5c/0x90
>> [  239.881675]  worker_thread+0x69/0x660
>> [  239.882206]  ? __kthread_parkme+0xe4/0x100
>> [  239.882786]  ? __pfx_worker_thread+0x10/0x10
>> [  239.883395]  kthread+0x1a1/0x1e0
>> [  239.883863]  ? __pfx_kthread+0x10/0x10
>> [  239.884406]  ret_from_fork+0x2c/0x50
>> [  239.884925]  </TASK>
>> .
>>
> 
> .
>
diff mbox series

Patch

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 7455bc9d8498..a0a2be43aabf 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -87,11 +87,11 @@  static struct module *md_cluster_mod;
 static DECLARE_WAIT_QUEUE_HEAD(resync_wait);
 static struct workqueue_struct *md_wq;
 static struct workqueue_struct *md_misc_wq;
-static struct workqueue_struct *md_rdev_misc_wq;
 
 static int remove_and_add_spares(struct mddev *mddev,
 				 struct md_rdev *this);
 static void mddev_detach(struct mddev *mddev);
+static void export_rdev(struct md_rdev *rdev);
 
 /*
  * Default number of read corrections we'll attempt on an rdev
@@ -643,9 +643,11 @@  void mddev_init(struct mddev *mddev)
 {
 	mutex_init(&mddev->open_mutex);
 	mutex_init(&mddev->reconfig_mutex);
+	mutex_init(&mddev->delete_mutex);
 	mutex_init(&mddev->bitmap_info.mutex);
 	INIT_LIST_HEAD(&mddev->disks);
 	INIT_LIST_HEAD(&mddev->all_mddevs);
+	INIT_LIST_HEAD(&mddev->deleting);
 	timer_setup(&mddev->safemode_timer, md_safemode_timeout, 0);
 	atomic_set(&mddev->active, 1);
 	atomic_set(&mddev->openers, 0);
@@ -747,6 +749,23 @@  static void mddev_free(struct mddev *mddev)
 
 static const struct attribute_group md_redundancy_group;
 
+static void md_free_rdev(struct mddev *mddev)
+{
+	struct md_rdev *rdev;
+	struct md_rdev *tmp;
+
+	if (list_empty_careful(&mddev->deleting))
+		return;
+
+	mutex_lock(&mddev->delete_mutex);
+	list_for_each_entry_safe(rdev, tmp, &mddev->deleting, same_set) {
+		list_del_init(&rdev->same_set);
+		kobject_del(&rdev->kobj);
+		export_rdev(rdev);
+	}
+	mutex_unlock(&mddev->delete_mutex);
+}
+
 void mddev_unlock(struct mddev *mddev)
 {
 	if (mddev->to_remove) {
@@ -788,6 +807,8 @@  void mddev_unlock(struct mddev *mddev)
 	} else
 		mutex_unlock(&mddev->reconfig_mutex);
 
+	md_free_rdev(mddev);
+
 	/* As we've dropped the mutex we need a spinlock to
 	 * make sure the thread doesn't disappear
 	 */
@@ -2428,13 +2449,6 @@  static int bind_rdev_to_array(struct md_rdev *rdev, struct mddev *mddev)
 	return err;
 }
 
-static void rdev_delayed_delete(struct work_struct *ws)
-{
-	struct md_rdev *rdev = container_of(ws, struct md_rdev, del_work);
-	kobject_del(&rdev->kobj);
-	kobject_put(&rdev->kobj);
-}
-
 void md_autodetect_dev(dev_t dev);
 
 static void export_rdev(struct md_rdev *rdev)
@@ -2452,6 +2466,8 @@  static void export_rdev(struct md_rdev *rdev)
 
 static void md_kick_rdev_from_array(struct md_rdev *rdev)
 {
+	struct mddev *mddev = rdev->mddev;
+
 	bd_unlink_disk_holder(rdev->bdev, rdev->mddev->gendisk);
 	list_del_rcu(&rdev->same_set);
 	pr_debug("md: unbind<%pg>\n", rdev->bdev);
@@ -2465,15 +2481,17 @@  static void md_kick_rdev_from_array(struct md_rdev *rdev)
 	rdev->sysfs_unack_badblocks = NULL;
 	rdev->sysfs_badblocks = NULL;
 	rdev->badblocks.count = 0;
-	/* We need to delay this, otherwise we can deadlock when
-	 * writing to 'remove' to "dev/state".  We also need
-	 * to delay it due to rcu usage.
-	 */
+
 	synchronize_rcu();
-	INIT_WORK(&rdev->del_work, rdev_delayed_delete);
-	kobject_get(&rdev->kobj);
-	queue_work(md_rdev_misc_wq, &rdev->del_work);
-	export_rdev(rdev);
+
+	/*
+	 * kobject_del() will wait for all in progress writers to be done, where
+	 * reconfig_mutex is held, hence it can't be called under
+	 * reconfig_mutex and it's delayed to mddev_unlock().
+	 */
+	mutex_lock(&mddev->delete_mutex);
+	list_add(&rdev->same_set, &mddev->deleting);
+	mutex_unlock(&mddev->delete_mutex);
 }
 
 static void export_array(struct mddev *mddev)
@@ -3541,6 +3559,7 @@  rdev_attr_store(struct kobject *kobj, struct attribute *attr,
 {
 	struct rdev_sysfs_entry *entry = container_of(attr, struct rdev_sysfs_entry, attr);
 	struct md_rdev *rdev = container_of(kobj, struct md_rdev, kobj);
+	struct kernfs_node *kn = NULL;
 	ssize_t rv;
 	struct mddev *mddev = rdev->mddev;
 
@@ -3548,6 +3567,10 @@  rdev_attr_store(struct kobject *kobj, struct attribute *attr,
 		return -EIO;
 	if (!capable(CAP_SYS_ADMIN))
 		return -EACCES;
+
+	if (entry->store == state_store && cmd_match(page, "remove"))
+		kn = sysfs_break_active_protection(kobj, attr);
+
 	rv = mddev ? mddev_lock(mddev) : -ENODEV;
 	if (!rv) {
 		if (rdev->mddev == NULL)
@@ -3556,6 +3579,10 @@  rdev_attr_store(struct kobject *kobj, struct attribute *attr,
 			rv = entry->store(rdev, page, length);
 		mddev_unlock(mddev);
 	}
+
+	if (kn)
+		sysfs_unbreak_active_protection(kn);
+
 	return rv;
 }
 
@@ -4479,20 +4506,6 @@  null_show(struct mddev *mddev, char *page)
 	return -EINVAL;
 }
 
-/* need to ensure rdev_delayed_delete() has completed */
-static void flush_rdev_wq(struct mddev *mddev)
-{
-	struct md_rdev *rdev;
-
-	rcu_read_lock();
-	rdev_for_each_rcu(rdev, mddev)
-		if (work_pending(&rdev->del_work)) {
-			flush_workqueue(md_rdev_misc_wq);
-			break;
-		}
-	rcu_read_unlock();
-}
-
 static ssize_t
 new_dev_store(struct mddev *mddev, const char *buf, size_t len)
 {
@@ -4520,7 +4533,6 @@  new_dev_store(struct mddev *mddev, const char *buf, size_t len)
 	    minor != MINOR(dev))
 		return -EOVERFLOW;
 
-	flush_rdev_wq(mddev);
 	err = mddev_lock(mddev);
 	if (err)
 		return err;
@@ -5590,7 +5602,6 @@  struct mddev *md_alloc(dev_t dev, char *name)
 	 * removed (mddev_delayed_delete).
 	 */
 	flush_workqueue(md_misc_wq);
-	flush_workqueue(md_rdev_misc_wq);
 
 	mutex_lock(&disks_mutex);
 	mddev = mddev_alloc(dev);
@@ -7553,9 +7564,6 @@  static int md_ioctl(struct block_device *bdev, fmode_t mode,
 
 	}
 
-	if (cmd == ADD_NEW_DISK || cmd == HOT_ADD_DISK)
-		flush_rdev_wq(mddev);
-
 	if (cmd == HOT_REMOVE_DISK)
 		/* need to ensure recovery thread has run */
 		wait_event_interruptible_timeout(mddev->sb_wait,
@@ -9618,10 +9626,6 @@  static int __init md_init(void)
 	if (!md_misc_wq)
 		goto err_misc_wq;
 
-	md_rdev_misc_wq = alloc_workqueue("md_rdev_misc", 0, 0);
-	if (!md_rdev_misc_wq)
-		goto err_rdev_misc_wq;
-
 	ret = __register_blkdev(MD_MAJOR, "md", md_probe);
 	if (ret < 0)
 		goto err_md;
@@ -9640,8 +9644,6 @@  static int __init md_init(void)
 err_mdp:
 	unregister_blkdev(MD_MAJOR, "md");
 err_md:
-	destroy_workqueue(md_rdev_misc_wq);
-err_rdev_misc_wq:
 	destroy_workqueue(md_misc_wq);
 err_misc_wq:
 	destroy_workqueue(md_wq);
@@ -9937,7 +9939,6 @@  static __exit void md_exit(void)
 	}
 	spin_unlock(&all_mddevs_lock);
 
-	destroy_workqueue(md_rdev_misc_wq);
 	destroy_workqueue(md_misc_wq);
 	destroy_workqueue(md_wq);
 }
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 1eec65cf783c..4d191db831da 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -531,6 +531,14 @@  struct mddev {
 	unsigned int			good_device_nr;	/* good device num within cluster raid */
 	unsigned int			noio_flag; /* for memalloc scope API */
 
+	/*
+	 * Temporarily store rdev that will be finally removed when
+	 * reconfig_mutex is unlocked.
+	 */
+	struct list_head		deleting;
+	/* Protect the deleting list */
+	struct mutex			delete_mutex;
+
 	bool	has_superblocks:1;
 	bool	fail_last_dev:1;
 	bool	serialize_policy:1;