md/raid1: freeze block layer queue during reshape

Message ID	vsag6vp4jokp2k5fkoqb5flklghpakxmglr75vpzgkmzejc47u@ih2255x374rp (mailing list archive)
State	Rejected, archived
Headers	show Return-Path: <linux-raid-owner@vger.kernel.org> Date: Sun, 2 Jul 2023 18:04:25 +0800 From: Xueshi Hu <xueshi.hu@smartx.com> To: linux-raid@vger.kernel.org Cc: song@kernel.org Subject: [PATCH] md/raid1: freeze block layer queue during reshape Message-ID: <vsag6vp4jokp2k5fkoqb5flklghpakxmglr75vpzgkmzejc47u@ih2255x374rp> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Precedence: bulk
Series	md/raid1: freeze block layer queue during reshape \| expand md/raid1: freeze block layer queue during reshape

Xueshi Hu July 2, 2023, 10:04 a.m. UTC

When a raid device is reshaped, in-flight bio may reference outdated
r1conf::raid_disks and r1bio::poolinfo. This can trigger a bug in
three possible paths:

1. In function "raid1d". If a bio fails to submit, it will be resent to
raid1d for retrying the submission, which increases r1conf::nr_queued.
If the reshape happens, the in-flight bio cannot be freed normally as
the old mempool has been destroyed.
2. In raid1_write_request. If one raw device is blocked, the kernel will
allow the barrier and wait for the raw device became ready, this makes
the raid reshape possible. Then, the local variable "disks" before the
label "retry_write" is outdated. Additionally, the kernel cannot reuse the
old r1bio.
3. In raid_end_bio_io. The kernel must free the r1bio first and then
allow the barrier.

By freezing the queue, we can ensure that there are no in-flight bios
during reshape. This prevents bio from referencing the outdated
r1conf::raid_disks or r1bio::poolinfo.

Signed-off-by: Xueshi Hu <xueshi.hu@smartx.com>
---
 drivers/md/raid1.c | 3 +++
 1 file changed, 3 insertions(+)

Yu Kuai July 3, 2023, 1:44 a.m. UTC | #1

Hi,

在 2023/07/02 18:04, Xueshi Hu 写道:
> When a raid device is reshaped, in-flight bio may reference outdated
> r1conf::raid_disks and r1bio::poolinfo. This can trigger a bug in
> three possible paths:
> 
> 1. In function "raid1d". If a bio fails to submit, it will be resent to
> raid1d for retrying the submission, which increases r1conf::nr_queued.
> If the reshape happens, the in-flight bio cannot be freed normally as
> the old mempool has been destroyed.
> 2. In raid1_write_request. If one raw device is blocked, the kernel will
> allow the barrier and wait for the raw device became ready, this makes
> the raid reshape possible. Then, the local variable "disks" before the
> label "retry_write" is outdated. Additionally, the kernel cannot reuse the
> old r1bio.
> 3. In raid_end_bio_io. The kernel must free the r1bio first and then
> allow the barrier.
> 
> By freezing the queue, we can ensure that there are no in-flight bios
> during reshape. This prevents bio from referencing the outdated
> r1conf::raid_disks or r1bio::poolinfo.

I didn't look into the details of the problem you described, but even if
the problem exist, freeze queue can't help at all, blk_mq_freeze_queue()
for bio-based device can't guarantee that threre are no in-flight bios.

Thanks,
Kuai
> 
> Signed-off-by: Xueshi Hu <xueshi.hu@smartx.com>
> ---
>   drivers/md/raid1.c | 3 +++
>   1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index dd25832eb045..d8d6825d0af6 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -3247,6 +3247,7 @@ static int raid1_reshape(struct mddev *mddev)
>   	unsigned long flags;
>   	int d, d2;
>   	int ret;
> +	struct request_queue *q = mddev->queue;
>   
>   	memset(&newpool, 0, sizeof(newpool));
>   	memset(&oldpool, 0, sizeof(oldpool));
> @@ -3296,6 +3297,7 @@ static int raid1_reshape(struct mddev *mddev)
>   		return -ENOMEM;
>   	}
>   
> +	blk_mq_freeze_queue(q);
>   	freeze_array(conf, 0);
>   
>   	/* ok, everything is stopped */
> @@ -3333,6 +3335,7 @@ static int raid1_reshape(struct mddev *mddev)
>   	md_wakeup_thread(mddev->thread);
>   
>   	mempool_exit(&oldpool);
> +	blk_mq_unfreeze_queue(q);
>   	return 0;
>   }
>   
>

Xueshi Hu July 3, 2023, 9:47 a.m. UTC | #2

On Mon, Jul 03, 2023 at 09:44:03AM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2023/07/02 18:04, Xueshi Hu 写道:
> > When a raid device is reshaped, in-flight bio may reference outdated
> > r1conf::raid_disks and r1bio::poolinfo. This can trigger a bug in
> > three possible paths:
> > 
> > 1. In function "raid1d". If a bio fails to submit, it will be resent to
> > raid1d for retrying the submission, which increases r1conf::nr_queued.
> > If the reshape happens, the in-flight bio cannot be freed normally as
> > the old mempool has been destroyed.
> > 2. In raid1_write_request. If one raw device is blocked, the kernel will
> > allow the barrier and wait for the raw device became ready, this makes
> > the raid reshape possible. Then, the local variable "disks" before the
> > label "retry_write" is outdated. Additionally, the kernel cannot reuse the
> > old r1bio.
> > 3. In raid_end_bio_io. The kernel must free the r1bio first and then
> > allow the barrier.
> > 
> > By freezing the queue, we can ensure that there are no in-flight bios
> > during reshape. This prevents bio from referencing the outdated
> > r1conf::raid_disks or r1bio::poolinfo.
> 
> I didn't look into the details of the problem you described, but even if
> the problem exist, freeze queue can't help at all, blk_mq_freeze_queue()
> for bio-based device can't guarantee that threre are no in-flight bios.
> 
> Thanks,
> Kuai
> > 
> > Signed-off-by: Xueshi Hu <xueshi.hu@smartx.com>
> > ---
> >   drivers/md/raid1.c | 3 +++
> >   1 file changed, 3 insertions(+)
> > 
> > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> > index dd25832eb045..d8d6825d0af6 100644
> > --- a/drivers/md/raid1.c
> > +++ b/drivers/md/raid1.c
> > @@ -3247,6 +3247,7 @@ static int raid1_reshape(struct mddev *mddev)
> >   	unsigned long flags;
> >   	int d, d2;
> >   	int ret;
> > +	struct request_queue *q = mddev->queue;
> >   	memset(&newpool, 0, sizeof(newpool));
> >   	memset(&oldpool, 0, sizeof(oldpool));
> > @@ -3296,6 +3297,7 @@ static int raid1_reshape(struct mddev *mddev)
> >   		return -ENOMEM;
> >   	}
> > +	blk_mq_freeze_queue(q);
> >   	freeze_array(conf, 0);
> >   	/* ok, everything is stopped */
> > @@ -3333,6 +3335,7 @@ static int raid1_reshape(struct mddev *mddev)
> >   	md_wakeup_thread(mddev->thread);
> >   	mempool_exit(&oldpool);
> > +	blk_mq_unfreeze_queue(q);
> >   	return 0;
> >   }
> > 

Use this bash script, it's easy to trigger the bug
1. Firstly, start fio to make requests on raid device
2. Set one of the raw devices' state into "blocked"
3. Reshape the raid device and "-blocked" the raw device

```
parted -s /dev/sda -- mklabel gpt
parted /dev/sda -- mkpart primary 0G 1G
parted -s /dev/sdc -- mklabel gpt
parted /dev/sdc -- mkpart primary 0G 1G

yes | mdadm --create /dev/md10 --level=mirror \
	--force --raid-devices=2 /dev/sda1 /dev/sdc1
mdadm --wait /dev/md10

nohup fio fio.job &

device_num=2
for ((i = 0; i <= 100000; i = i + 1)); do
	sleep 1
	echo "blocked" >/sys/devices/virtual/block/md10/md/dev-sda1/state
	if [[ $((i % 2)) -eq 0 ]]; then
		device_num=2
	else
		device_num=1800
	fi
	mdadm --grow --force --raid-devices=$device_num /dev/md10
	sleep 1
	echo "-blocked" >/sys/devices/virtual/block/md10/md/dev-sda1/state
done
```

The configuration of fio, file fio.job
```
[global]
ioengine=libaio
bs=4k
numjobs=1
iodepth=128
direct=1
rate=1M,1M

[md10]
time_based
runtime=-1
rw=randwrite
filename=/dev/md10
```

kernel crashed when trying to free r1bio:

[  116.977805]  ? __die+0x23/0x70
[  116.977962]  ? page_fault_oops+0x181/0x470
[  116.978148]  ? exc_page_fault+0x71/0x180
[  116.978331]  ? asm_exc_page_fault+0x26/0x30
[  116.978523]  ? bio_put+0xe/0x130
[  116.978672]  raid_end_bio_io+0xa1/0xd0
[  116.978854]  raid1_end_write_request+0x111/0x350
[  116.979063]  blk_update_request+0x114/0x480
[  116.979253]  ? __ata_sff_port_intr+0x9c/0x160
[  116.979452]  scsi_end_request+0x27/0x1c0
[  116.979633]  scsi_io_completion+0x5a/0x6a0
[  116.979822]  blk_complete_reqs+0x3d/0x50
[  116.980000]  __do_softirq+0x113/0x3aa
[  116.980169]  irq_exit_rcu+0x8e/0xb0
[  116.980334]  common_interrupt+0x86/0xa0
[  116.980508]  </IRQ>
[  116.980606]  <TASK>
[  116.980704]  asm_common_interrupt+0x26/0x40
[  116.980897] RIP: 0010:default_idle+0xf/0x20

As far I know, when a request is allocated, 
request_queue::q_usage_counter is increased. When the io finished, the 
request_queue::q_usage_counter is decreased, use nvme driver as an
example:

nvme_complete_batch()
	blk_mq_end_request_batch()
		blk_mq_flush_tag_batch()
			percpu_ref_put_many(&q->q_usage_counter, nr_tags);
			

So, when blk_mq_freeze_queue() is returned successfully, every in-flight
io has returned from hardware, also new requests are blocked.

Thanks,
Hu

Yu Kuai July 3, 2023, 11:19 a.m. UTC | #3

Hi,

在 2023/07/03 17:47, Xueshi Hu 写道:
> On Mon, Jul 03, 2023 at 09:44:03AM +0800, Yu Kuai wrote:
>> Hi,
>>
>> 在 2023/07/02 18:04, Xueshi Hu 写道:
>>> When a raid device is reshaped, in-flight bio may reference outdated
>>> r1conf::raid_disks and r1bio::poolinfo. This can trigger a bug in
>>> three possible paths:
>>>
>>> 1. In function "raid1d". If a bio fails to submit, it will be resent to
>>> raid1d for retrying the submission, which increases r1conf::nr_queued.
>>> If the reshape happens, the in-flight bio cannot be freed normally as
>>> the old mempool has been destroyed.
>>> 2. In raid1_write_request. If one raw device is blocked, the kernel will
>>> allow the barrier and wait for the raw device became ready, this makes
>>> the raid reshape possible. Then, the local variable "disks" before the
>>> label "retry_write" is outdated. Additionally, the kernel cannot reuse the
>>> old r1bio.
>>> 3. In raid_end_bio_io. The kernel must free the r1bio first and then
>>> allow the barrier.
>>>
>>> By freezing the queue, we can ensure that there are no in-flight bios
>>> during reshape. This prevents bio from referencing the outdated
>>> r1conf::raid_disks or r1bio::poolinfo.
>>
>> I didn't look into the details of the problem you described, but even if
>> the problem exist, freeze queue can't help at all, blk_mq_freeze_queue()
>> for bio-based device can't guarantee that threre are no in-flight bios.
>>
>> Thanks,
>> Kuai
>>>
>>> Signed-off-by: Xueshi Hu <xueshi.hu@smartx.com>
>>> ---
>>>    drivers/md/raid1.c | 3 +++
>>>    1 file changed, 3 insertions(+)
>>>
>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>> index dd25832eb045..d8d6825d0af6 100644
>>> --- a/drivers/md/raid1.c
>>> +++ b/drivers/md/raid1.c
>>> @@ -3247,6 +3247,7 @@ static int raid1_reshape(struct mddev *mddev)
>>>    	unsigned long flags;
>>>    	int d, d2;
>>>    	int ret;
>>> +	struct request_queue *q = mddev->queue;
>>>    	memset(&newpool, 0, sizeof(newpool));
>>>    	memset(&oldpool, 0, sizeof(oldpool));
>>> @@ -3296,6 +3297,7 @@ static int raid1_reshape(struct mddev *mddev)
>>>    		return -ENOMEM;
>>>    	}
>>> +	blk_mq_freeze_queue(q);
>>>    	freeze_array(conf, 0);
>>>    	/* ok, everything is stopped */
>>> @@ -3333,6 +3335,7 @@ static int raid1_reshape(struct mddev *mddev)
>>>    	md_wakeup_thread(mddev->thread);
>>>    	mempool_exit(&oldpool);
>>> +	blk_mq_unfreeze_queue(q);
>>>    	return 0;
>>>    }
>>>
> 
> Use this bash script, it's easy to trigger the bug
> 1. Firstly, start fio to make requests on raid device
> 2. Set one of the raw devices' state into "blocked"
> 3. Reshape the raid device and "-blocked" the raw device
> 
> ```
> parted -s /dev/sda -- mklabel gpt
> parted /dev/sda -- mkpart primary 0G 1G
> parted -s /dev/sdc -- mklabel gpt
> parted /dev/sdc -- mkpart primary 0G 1G
> 
> yes | mdadm --create /dev/md10 --level=mirror \
> 	--force --raid-devices=2 /dev/sda1 /dev/sdc1
> mdadm --wait /dev/md10
> 
> nohup fio fio.job &
> 
> device_num=2
> for ((i = 0; i <= 100000; i = i + 1)); do
> 	sleep 1
> 	echo "blocked" >/sys/devices/virtual/block/md10/md/dev-sda1/state
> 	if [[ $((i % 2)) -eq 0 ]]; then
> 		device_num=2
> 	else
> 		device_num=1800
> 	fi
> 	mdadm --grow --force --raid-devices=$device_num /dev/md10
> 	sleep 1
> 	echo "-blocked" >/sys/devices/virtual/block/md10/md/dev-sda1/state
> done
> ```
> 
> The configuration of fio, file fio.job
> ```
> [global]
> ioengine=libaio
> bs=4k
> numjobs=1
> iodepth=128
> direct=1
> rate=1M,1M
> 
> [md10]
> time_based
> runtime=-1
> rw=randwrite
> filename=/dev/md10
> ```
> 
> kernel crashed when trying to free r1bio:
> 
> [  116.977805]  ? __die+0x23/0x70
> [  116.977962]  ? page_fault_oops+0x181/0x470
> [  116.978148]  ? exc_page_fault+0x71/0x180
> [  116.978331]  ? asm_exc_page_fault+0x26/0x30
> [  116.978523]  ? bio_put+0xe/0x130
> [  116.978672]  raid_end_bio_io+0xa1/0xd0
> [  116.978854]  raid1_end_write_request+0x111/0x350
> [  116.979063]  blk_update_request+0x114/0x480
> [  116.979253]  ? __ata_sff_port_intr+0x9c/0x160
> [  116.979452]  scsi_end_request+0x27/0x1c0
> [  116.979633]  scsi_io_completion+0x5a/0x6a0
> [  116.979822]  blk_complete_reqs+0x3d/0x50
> [  116.980000]  __do_softirq+0x113/0x3aa
> [  116.980169]  irq_exit_rcu+0x8e/0xb0
> [  116.980334]  common_interrupt+0x86/0xa0
> [  116.980508]  </IRQ>
> [  116.980606]  <TASK>
> [  116.980704]  asm_common_interrupt+0x26/0x40
> [  116.980897] RIP: 0010:default_idle+0xf/0x20

This looks like freeze_array() doen't work as expected.

> 
> As far I know, when a request is allocated,
> request_queue::q_usage_counter is increased. When the io finished, the
> request_queue::q_usage_counter is decreased, use nvme driver as an
> example:
> 
> nvme_complete_batch()
> 	blk_mq_end_request_batch()
> 		blk_mq_flush_tag_batch()
> 			percpu_ref_put_many(&q->q_usage_counter, nr_tags);
> 			
> 
> So, when blk_mq_freeze_queue() is returned successfully, every in-flight
> io has returned from hardware, also new requests are blocked.

This only works for rq-based device, not bio-based device.

Thanks,
Kuai
> 
> Thanks,
> Hu
> 
> .
>

Yu Kuai July 3, 2023, 12:04 p.m. UTC | #4

Hi,

在 2023/07/03 19:19, Yu Kuai 写道:
> Hi,
> 
> 在 2023/07/03 17:47, Xueshi Hu 写道:
>> On Mon, Jul 03, 2023 at 09:44:03AM +0800, Yu Kuai wrote:
>>> Hi,
>>>
>>> 在 2023/07/02 18:04, Xueshi Hu 写道:
>>>> When a raid device is reshaped, in-flight bio may reference outdated
>>>> r1conf::raid_disks and r1bio::poolinfo. This can trigger a bug in
>>>> three possible paths:
>>>>
>>>> 1. In function "raid1d". If a bio fails to submit, it will be resent to
>>>> raid1d for retrying the submission, which increases r1conf::nr_queued.
>>>> If the reshape happens, the in-flight bio cannot be freed normally as
>>>> the old mempool has been destroyed.
>>>> 2. In raid1_write_request. If one raw device is blocked, the kernel 
>>>> will
>>>> allow the barrier and wait for the raw device became ready, this makes
>>>> the raid reshape possible. Then, the local variable "disks" before the
>>>> label "retry_write" is outdated. Additionally, the kernel cannot 
>>>> reuse the
>>>> old r1bio.
>>>> 3. In raid_end_bio_io. The kernel must free the r1bio first and then
>>>> allow the barrier.
>>>>
>>>> By freezing the queue, we can ensure that there are no in-flight bios
>>>> during reshape. This prevents bio from referencing the outdated
>>>> r1conf::raid_disks or r1bio::poolinfo.
>>>
>>> I didn't look into the details of the problem you described, but even if
>>> the problem exist, freeze queue can't help at all, blk_mq_freeze_queue()
>>> for bio-based device can't guarantee that threre are no in-flight bios.
>>>
>>> Thanks,
>>> Kuai
>>>>
>>>> Signed-off-by: Xueshi Hu <xueshi.hu@smartx.com>
>>>> ---
>>>>    drivers/md/raid1.c | 3 +++
>>>>    1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>>> index dd25832eb045..d8d6825d0af6 100644
>>>> --- a/drivers/md/raid1.c
>>>> +++ b/drivers/md/raid1.c
>>>> @@ -3247,6 +3247,7 @@ static int raid1_reshape(struct mddev *mddev)
>>>>        unsigned long flags;
>>>>        int d, d2;
>>>>        int ret;
>>>> +    struct request_queue *q = mddev->queue;
>>>>        memset(&newpool, 0, sizeof(newpool));
>>>>        memset(&oldpool, 0, sizeof(oldpool));
>>>> @@ -3296,6 +3297,7 @@ static int raid1_reshape(struct mddev *mddev)
>>>>            return -ENOMEM;
>>>>        }
>>>> +    blk_mq_freeze_queue(q);
>>>>        freeze_array(conf, 0);
>>>>        /* ok, everything is stopped */
>>>> @@ -3333,6 +3335,7 @@ static int raid1_reshape(struct mddev *mddev)
>>>>        md_wakeup_thread(mddev->thread);
>>>>        mempool_exit(&oldpool);
>>>> +    blk_mq_unfreeze_queue(q);
>>>>        return 0;
>>>>    }
>>>>
>>
>> Use this bash script, it's easy to trigger the bug
>> 1. Firstly, start fio to make requests on raid device
>> 2. Set one of the raw devices' state into "blocked"
>> 3. Reshape the raid device and "-blocked" the raw device
>>
>> ```
>> parted -s /dev/sda -- mklabel gpt
>> parted /dev/sda -- mkpart primary 0G 1G
>> parted -s /dev/sdc -- mklabel gpt
>> parted /dev/sdc -- mkpart primary 0G 1G
>>
>> yes | mdadm --create /dev/md10 --level=mirror \
>>     --force --raid-devices=2 /dev/sda1 /dev/sdc1
>> mdadm --wait /dev/md10
>>
>> nohup fio fio.job &
>>
>> device_num=2
>> for ((i = 0; i <= 100000; i = i + 1)); do
>>     sleep 1
>>     echo "blocked" >/sys/devices/virtual/block/md10/md/dev-sda1/state
>>     if [[ $((i % 2)) -eq 0 ]]; then
>>         device_num=2
>>     else
>>         device_num=1800
>>     fi
>>     mdadm --grow --force --raid-devices=$device_num /dev/md10
>>     sleep 1
>>     echo "-blocked" >/sys/devices/virtual/block/md10/md/dev-sda1/state
>> done
>> ```
>>
>> The configuration of fio, file fio.job
>> ```
>> [global]
>> ioengine=libaio
>> bs=4k
>> numjobs=1
>> iodepth=128
>> direct=1
>> rate=1M,1M
>>
>> [md10]
>> time_based
>> runtime=-1
>> rw=randwrite
>> filename=/dev/md10
>> ```
>>
>> kernel crashed when trying to free r1bio:
>>
>> [  116.977805]  ? __die+0x23/0x70
>> [  116.977962]  ? page_fault_oops+0x181/0x470
>> [  116.978148]  ? exc_page_fault+0x71/0x180
>> [  116.978331]  ? asm_exc_page_fault+0x26/0x30
>> [  116.978523]  ? bio_put+0xe/0x130
>> [  116.978672]  raid_end_bio_io+0xa1/0xd0
>> [  116.978854]  raid1_end_write_request+0x111/0x350
>> [  116.979063]  blk_update_request+0x114/0x480
>> [  116.979253]  ? __ata_sff_port_intr+0x9c/0x160
>> [  116.979452]  scsi_end_request+0x27/0x1c0
>> [  116.979633]  scsi_io_completion+0x5a/0x6a0
>> [  116.979822]  blk_complete_reqs+0x3d/0x50
>> [  116.980000]  __do_softirq+0x113/0x3aa
>> [  116.980169]  irq_exit_rcu+0x8e/0xb0
>> [  116.980334]  common_interrupt+0x86/0xa0
>> [  116.980508]  </IRQ>
>> [  116.980606]  <TASK>
>> [  116.980704]  asm_common_interrupt+0x26/0x40
>> [  116.980897] RIP: 0010:default_idle+0xf/0x20
> 
> This looks like freeze_array() doen't work as expected.
> 
After a quick look, I think the problem is that before waiting for
blocked rdev in raid1_write_request(), allow_barrier() is called hence
freeze_array() won't wait for this r1_bio, and if old bio pool is freed
by raid1_reshape(), this problem is triggered.

I think the right fix might be freing the old r1_bio and allocate a new
r1_bio in this case.

Thanks,
Kuai
>>
>> As far I know, when a request is allocated,
>> request_queue::q_usage_counter is increased. When the io finished, the
>> request_queue::q_usage_counter is decreased, use nvme driver as an
>> example:
>>
>> nvme_complete_batch()
>>     blk_mq_end_request_batch()
>>         blk_mq_flush_tag_batch()
>>             percpu_ref_put_many(&q->q_usage_counter, nr_tags);
>>
>>
>> So, when blk_mq_freeze_queue() is returned successfully, every in-flight
>> io has returned from hardware, also new requests are blocked.
> 
> This only works for rq-based device, not bio-based device.
> 
> Thanks,
> Kuai
>>
>> Thanks,
>> Hu
>>
>> .
>>

Xueshi Hu July 18, 2023, 1:18 p.m. UTC | #5

On Mon, Jul 03, 2023 at 07:19:35PM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2023/07/03 17:47, Xueshi Hu 写道:
> > On Mon, Jul 03, 2023 at 09:44:03AM +0800, Yu Kuai wrote:
> > > Hi,
> > > 
> > > 在 2023/07/02 18:04, Xueshi Hu 写道:
> > > > When a raid device is reshaped, in-flight bio may reference outdated
> > > > r1conf::raid_disks and r1bio::poolinfo. This can trigger a bug in
> > > > three possible paths:
> > > > 
> > > > 1. In function "raid1d". If a bio fails to submit, it will be resent to
> > > > raid1d for retrying the submission, which increases r1conf::nr_queued.
> > > > If the reshape happens, the in-flight bio cannot be freed normally as
> > > > the old mempool has been destroyed.
> > > > 2. In raid1_write_request. If one raw device is blocked, the kernel will
> > > > allow the barrier and wait for the raw device became ready, this makes
> > > > the raid reshape possible. Then, the local variable "disks" before the
> > > > label "retry_write" is outdated. Additionally, the kernel cannot reuse the
> > > > old r1bio.
> > > > 3. In raid_end_bio_io. The kernel must free the r1bio first and then
> > > > allow the barrier.
> > > > 
> > > > By freezing the queue, we can ensure that there are no in-flight bios
> > > > during reshape. This prevents bio from referencing the outdated
> > > > r1conf::raid_disks or r1bio::poolinfo.
> > > 
> > > I didn't look into the details of the problem you described, but even if
> > > the problem exist, freeze queue can't help at all, blk_mq_freeze_queue()
> > > for bio-based device can't guarantee that threre are no in-flight bios.
> > > 
> > > Thanks,
> > > Kuai
> > > > 
> > > > Signed-off-by: Xueshi Hu <xueshi.hu@smartx.com>
> > > > ---
> > > >    drivers/md/raid1.c | 3 +++
> > > >    1 file changed, 3 insertions(+)
> > > > 
> > > > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> > > > index dd25832eb045..d8d6825d0af6 100644
> > > > --- a/drivers/md/raid1.c
> > > > +++ b/drivers/md/raid1.c
> > > > @@ -3247,6 +3247,7 @@ static int raid1_reshape(struct mddev *mddev)
> > > >    	unsigned long flags;
> > > >    	int d, d2;
> > > >    	int ret;
> > > > +	struct request_queue *q = mddev->queue;
> > > >    	memset(&newpool, 0, sizeof(newpool));
> > > >    	memset(&oldpool, 0, sizeof(oldpool));
> > > > @@ -3296,6 +3297,7 @@ static int raid1_reshape(struct mddev *mddev)
> > > >    		return -ENOMEM;
> > > >    	}
> > > > +	blk_mq_freeze_queue(q);
> > > >    	freeze_array(conf, 0);
> > > >    	/* ok, everything is stopped */
> > > > @@ -3333,6 +3335,7 @@ static int raid1_reshape(struct mddev *mddev)
> > > >    	md_wakeup_thread(mddev->thread);
> > > >    	mempool_exit(&oldpool);
> > > > +	blk_mq_unfreeze_queue(q);
> > > >    	return 0;
> > > >    }
> > > > 
> > 
> > Use this bash script, it's easy to trigger the bug
> > 1. Firstly, start fio to make requests on raid device
> > 2. Set one of the raw devices' state into "blocked"
> > 3. Reshape the raid device and "-blocked" the raw device
> > 
> > ```
> > parted -s /dev/sda -- mklabel gpt
> > parted /dev/sda -- mkpart primary 0G 1G
> > parted -s /dev/sdc -- mklabel gpt
> > parted /dev/sdc -- mkpart primary 0G 1G
> > 
> > yes | mdadm --create /dev/md10 --level=mirror \
> > 	--force --raid-devices=2 /dev/sda1 /dev/sdc1
> > mdadm --wait /dev/md10
> > 
> > nohup fio fio.job &
> > 
> > device_num=2
> > for ((i = 0; i <= 100000; i = i + 1)); do
> > 	sleep 1
> > 	echo "blocked" >/sys/devices/virtual/block/md10/md/dev-sda1/state
> > 	if [[ $((i % 2)) -eq 0 ]]; then
> > 		device_num=2
> > 	else
> > 		device_num=1800
> > 	fi
> > 	mdadm --grow --force --raid-devices=$device_num /dev/md10
> > 	sleep 1
> > 	echo "-blocked" >/sys/devices/virtual/block/md10/md/dev-sda1/state
> > done
> > ```
> > 
> > The configuration of fio, file fio.job
> > ```
> > [global]
> > ioengine=libaio
> > bs=4k
> > numjobs=1
> > iodepth=128
> > direct=1
> > rate=1M,1M
> > 
> > [md10]
> > time_based
> > runtime=-1
> > rw=randwrite
> > filename=/dev/md10
> > ```
> > 
> > kernel crashed when trying to free r1bio:
> > 
> > [  116.977805]  ? __die+0x23/0x70
> > [  116.977962]  ? page_fault_oops+0x181/0x470
> > [  116.978148]  ? exc_page_fault+0x71/0x180
> > [  116.978331]  ? asm_exc_page_fault+0x26/0x30
> > [  116.978523]  ? bio_put+0xe/0x130
> > [  116.978672]  raid_end_bio_io+0xa1/0xd0
> > [  116.978854]  raid1_end_write_request+0x111/0x350
> > [  116.979063]  blk_update_request+0x114/0x480
> > [  116.979253]  ? __ata_sff_port_intr+0x9c/0x160
> > [  116.979452]  scsi_end_request+0x27/0x1c0
> > [  116.979633]  scsi_io_completion+0x5a/0x6a0
> > [  116.979822]  blk_complete_reqs+0x3d/0x50
> > [  116.980000]  __do_softirq+0x113/0x3aa
> > [  116.980169]  irq_exit_rcu+0x8e/0xb0
> > [  116.980334]  common_interrupt+0x86/0xa0
> > [  116.980508]  </IRQ>
> > [  116.980606]  <TASK>
> > [  116.980704]  asm_common_interrupt+0x26/0x40
> > [  116.980897] RIP: 0010:default_idle+0xf/0x20
> 
> This looks like freeze_array() doen't work as expected.
> 
> > 
> > As far I know, when a request is allocated,
> > request_queue::q_usage_counter is increased. When the io finished, the
> > request_queue::q_usage_counter is decreased, use nvme driver as an
> > example:
> > 
> > nvme_complete_batch()
> > 	blk_mq_end_request_batch()
> > 		blk_mq_flush_tag_batch()
> > 			percpu_ref_put_many(&q->q_usage_counter, nr_tags);
> > 			
> > 
> > So, when blk_mq_freeze_queue() is returned successfully, every in-flight
> > io has returned from hardware, also new requests are blocked.
> 
> This only works for rq-based device, not bio-based device.
I get you point eventually, raid is bio-based device[^1], and it's io
path diverges from rq-based device at __submit_bio(), bio-based device
call the submit_bio() and put the refcount immediately, without waiting
for the in-flight io to come back.

Thank you for catching my mistake.

[^1]: https://lwn.net/Articles/736534/
> 
> Thanks,
> Kuai
> > 
> > Thanks,
> > Hu
> > 
> > .
> > 
Thanks,
Hu

md/raid1: freeze block layer queue during reshape

Commit Message

Comments

Patch