[md-6.9,00/10] md/raid1: refactor read_balance() and some minor fix

Message ID	20240222075806.1816400-1-yukuai1@huaweicloud.com (mailing list archive)
Headers	show Received: from dggsgout11.his.huawei.com (unknown [45.249.212.51]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7423017BB6; Thu, 22 Feb 2024 08:03:55 +0000 (UTC) From: Yu Kuai <yukuai1@huaweicloud.com> To: paul.e.luse@linux.intel.com, song@kernel.org, neilb@suse.com, shli@fb.com Cc: linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org, yukuai3@huawei.com, yukuai1@huaweicloud.com, yi.zhang@huawei.com, yangerkun@huawei.com Subject: [PATCH md-6.9 00/10] md/raid1: refactor read_balance() and some minor fix Date: Thu, 22 Feb 2024 15:57:56 +0800 Message-Id: <20240222075806.1816400-1-yukuai1@huaweicloud.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	md/raid1: refactor read_balance() and some minor fix \| expand [md-6.9,00/10] md/raid1: refactor read_balance() and some minor fix [md-6.9,01/10] md: add a new helper rdev_has_badblock() [md-6.9,02/10] md: record nonrot rdevs while adding/removing rdevs to conf [md-6.9,03/10] md/raid1: fix choose next idle in read_balance() [md-6.9,04/10] md/raid1-10: add a helper raid1_check_read_range() [md-6.9,05/10] md/raid1-10: factor out a new helper raid1_should_read_first() [md-6.9,06/10] md/raid1: factor out read_first_rdev() from read_balance() [md-6.9,07/10] md/raid1: factor out choose_slow_rdev() from read_balance() [md-6.9,08/10] md/raid1: factor out choose_bb_rdev() from read_balance() [md-6.9,09/10] md/raid1: factor out the code to manage sequential IO [md-6.9,10/10] md/raid1: factor out helpers to choose the best rdev from read_balance()

Yu Kuai Feb. 22, 2024, 7:57 a.m. UTC

From: Yu Kuai <yukuai3@huawei.com>

The orignial idea is that Paul want to optimize raid1 read
performance([1]), however, we think that the orignial code for
read_balance() is quite complex, and we don't want to add more
complexity. Hence we decide to refactor read_balance() first, to make
code cleaner and easier for follow up.  

Before this patchset, read_balance() has many local variables and many
braches, it want to consider all the scenarios in one iteration. The
idea of this patch is to devide them into 4 different steps:

1) If resync is in progress, find the first usable disk, patch 5;
Otherwise:
2) Loop through all disks and skipping slow disks and disks with bad
blocks, choose the best disk, patch 10. If no disk is found:
3) Look for disks with bad blocks and choose the one with most number of
sectors, patch 8. If no disk is found:
4) Choose first found slow disk with no bad blocks, or slow disk with
most number of sectors, patch 7.

Note that step 3) and step 4) are super code path, and performance
should not be considered.

And after this patchset, we'll continue to optimize read_balance for
step 2), specifically how to choose the best rdev to read.

[1] https://lore.kernel.org/all/20240102125115.129261-1-paul.e.luse@linux.intel.com/

Yu Kuai (10):
  md: add a new helper rdev_has_badblock()
  md: record nonrot rdevs while adding/removing rdevs to conf
  md/raid1: fix choose next idle in read_balance()
  md/raid1-10: add a helper raid1_check_read_range()
  md/raid1-10: factor out a new helper raid1_should_read_first()
  md/raid1: factor out read_first_rdev() from read_balance()
  md/raid1: factor out choose_slow_rdev() from read_balance()
  md/raid1: factor out choose_bb_rdev() from read_balance()
  md/raid1: factor out the code to manage sequential IO
  md/raid1: factor out helpers to choose the best rdev from
    read_balance()

 drivers/md/md.c       |  28 ++-
 drivers/md/md.h       |  12 ++
 drivers/md/raid1-10.c |  69 +++++++
 drivers/md/raid1.c    | 454 ++++++++++++++++++++++++------------------
 drivers/md/raid10.c   |  66 ++----
 drivers/md/raid5.c    |  35 ++--
 6 files changed, 402 insertions(+), 262 deletions(-)

Paul Menzel Feb. 22, 2024, 8:40 a.m. UTC | #1

Dear Kuai, dear Paul,


Thank you for your work. Some nits and request for benchmarks below.


Am 22.02.24 um 08:57 schrieb Yu Kuai:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> The orignial idea is that Paul want to optimize raid1 read

original

> performance([1]), however, we think that the orignial code for

original

> read_balance() is quite complex, and we don't want to add more
> complexity. Hence we decide to refactor read_balance() first, to make
> code cleaner and easier for follow up.
> 
> Before this patchset, read_balance() has many local variables and many
> braches, it want to consider all the scenarios in one iteration. The

branches

> idea of this patch is to devide them into 4 different steps:

divide

> 1) If resync is in progress, find the first usable disk, patch 5;
> Otherwise:
> 2) Loop through all disks and skipping slow disks and disks with bad
> blocks, choose the best disk, patch 10. If no disk is found:
> 3) Look for disks with bad blocks and choose the one with most number of
> sectors, patch 8. If no disk is found:
> 4) Choose first found slow disk with no bad blocks, or slow disk with
> most number of sectors, patch 7.
> 
> Note that step 3) and step 4) are super code path, and performance
> should not be considered.
> 
> And after this patchset, we'll continue to optimize read_balance for
> step 2), specifically how to choose the best rdev to read.

Is there a change in performance with the current patch set? Is radi1 
well enough covered by the test suite?


Kind regards,

Paul


> [1] https://lore.kernel.org/all/20240102125115.129261-1-paul.e.luse@linux.intel.com/
> 
> Yu Kuai (10):
>    md: add a new helper rdev_has_badblock()
>    md: record nonrot rdevs while adding/removing rdevs to conf
>    md/raid1: fix choose next idle in read_balance()
>    md/raid1-10: add a helper raid1_check_read_range()
>    md/raid1-10: factor out a new helper raid1_should_read_first()
>    md/raid1: factor out read_first_rdev() from read_balance()
>    md/raid1: factor out choose_slow_rdev() from read_balance()
>    md/raid1: factor out choose_bb_rdev() from read_balance()
>    md/raid1: factor out the code to manage sequential IO
>    md/raid1: factor out helpers to choose the best rdev from
>      read_balance()
> 
>   drivers/md/md.c       |  28 ++-
>   drivers/md/md.h       |  12 ++
>   drivers/md/raid1-10.c |  69 +++++++
>   drivers/md/raid1.c    | 454 ++++++++++++++++++++++++------------------
>   drivers/md/raid10.c   |  66 ++----
>   drivers/md/raid5.c    |  35 ++--
>   6 files changed, 402 insertions(+), 262 deletions(-)

Yu Kuai Feb. 22, 2024, 9:08 a.m. UTC | #2

Hi,

在 2024/02/22 16:40, Paul Menzel 写道:
> Is there a change in performance with the current patch set? Is radi1 
> well enough covered by the test suite?

Yes, there are no performance degradation, and mdadm tests passed. And
Paul Luse also ran fio mixed workload w/data integrity and it passes.

Thanks,
Kuai

Luse, Paul E Feb. 22, 2024, 1:04 p.m. UTC | #3

> On Feb 22, 2024, at 2:08 AM, Yu Kuai <yukuai1@huaweicloud.com> wrote:
> 
> Hi,
> 
> 在 2024/02/22 16:40, Paul Menzel 写道:
>> Is there a change in performance with the current patch set? Is radi1 well enough covered by the test suite?
> 
> Yes, there are no performance degradation, and mdadm tests passed. And
> Paul Luse also ran fio mixed workload w/data integrity and it passes.
> 
> Thanks,
> Kuai
> 

Kuai is correct, in my original perf improvement patch I included lots of results.  For this set where we just refactored I checked performance to assure we didn't go downhill but didn't save the results as deltas were in the noise.  After this series lands we will look at introducing performance improvements again and at that time results from a full performance sweep will be included.  

For data integrity, 1 and 2 disk mirrors were ran overnight w/fio and crcr32 check - no issues.

To assure other code paths execute as they did before was a little trickier without a unit test framework but for those cases I did modify/un-modify the code several times to follow various code paths and assure they're working as expected (ie bad blocks, etc)

-Paul

Paul Menzel Feb. 22, 2024, 3:30 p.m. UTC | #4

Dear Paul, dear Kuai


Am 22.02.24 um 14:04 schrieb Luse, Paul E:

>> On Feb 22, 2024, at 2:08 AM, Yu Kuai <yukuai1@huaweicloud.com> wrote:

>> 在 2024/02/22 16:40, Paul Menzel 写道:
>>> Is there a change in performance with the current patch set? Is
>>> radi1 well enough covered by the test suite?
>> 
>> Yes, there are no performance degradation, and mdadm tests passed.
>> And Paul Luse also ran fio mixed workload w/data integrity and it
>> passes.
> 
> Kuai is correct, in my original perf improvement patch I included
> lots of results.  For this set where we just refactored I checked
> performance to assure we didn't go downhill but didn't save the
> results as deltas were in the noise.  After this series lands we will
> look at introducing performance improvements again and at that time
> results from a full performance sweep will be included.
> 
> For data integrity, 1 and 2 disk mirrors were ran overnight w/fio and
> crcr32 check - no issues.
> 
> To assure other code paths execute as they did before was a little
> trickier without a unit test framework but for those cases I did
> modify/un-modify the code several times to follow various code paths
> and assure they're working as expected (ie bad blocks, etc)
Thank you very much for the elaborate response.

In our infrastructure, we often notice things improve, but we sometimes 
also have the “feeling” that things get worse. As IO is so complex, I 
find it always helpful to exactly note down the test setup and the run 
tests. So thank you for responding.


Kind regards,

Paul

Song Liu Feb. 27, 2024, 12:27 a.m. UTC | #5

Hi Kuai and Paul,

On Thu, Feb 22, 2024 at 12:03 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> From: Yu Kuai <yukuai3@huawei.com>
>
> The orignial idea is that Paul want to optimize raid1 read
> performance([1]), however, we think that the orignial code for
> read_balance() is quite complex, and we don't want to add more
> complexity. Hence we decide to refactor read_balance() first, to make
> code cleaner and easier for follow up.
>
> Before this patchset, read_balance() has many local variables and many
> braches, it want to consider all the scenarios in one iteration. The
> idea of this patch is to devide them into 4 different steps:
>
> 1) If resync is in progress, find the first usable disk, patch 5;
> Otherwise:
> 2) Loop through all disks and skipping slow disks and disks with bad
> blocks, choose the best disk, patch 10. If no disk is found:
> 3) Look for disks with bad blocks and choose the one with most number of
> sectors, patch 8. If no disk is found:
> 4) Choose first found slow disk with no bad blocks, or slow disk with
> most number of sectors, patch 7.

Thanks for your great work in this set. It looks great.

Please address feedback from folks and send v2. We can still get this in
6.9 merge window.

Thanks,
Song

[md-6.9,00/10] md/raid1: refactor read_balance() and some minor fix

Message

Comments