[-next,0/3] md/raid10: reduce lock contention for io

Message ID	20220829131502.165356-1-yukuai1@huaweicloud.com (mailing list archive)
Headers	show Return-Path: <linux-raid-owner@kernel.org> From: Yu Kuai <yukuai1@huaweicloud.com> To: song@kernel.org Cc: linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org, yukuai3@huawei.com, yukuai1@huaweicloud.com, yi.zhang@huawei.com Subject: [PATCH -next 0/3] md/raid10: reduce lock contention for io Date: Mon, 29 Aug 2022 21:14:59 +0800 Message-Id: <20220829131502.165356-1-yukuai1@huaweicloud.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	md/raid10: reduce lock contention for io \| expand [-next,0/3] md/raid10: reduce lock contention for io [-next,1/3] md/raid10: fix improper BUG_ON() in raise_barrier() [-next,2/3] md/raid10: convert resync_lock to use seqlock [-next,3/3] md/raid10: prevent unnecessary calls to wake_up() in fast path

Yu Kuai Aug. 29, 2022, 1:14 p.m. UTC

From: Yu Kuai <yukuai3@huawei.com>

patch 1 is a small problem found by code review.
patch 2 avoid holding resync_lock in fast path.
patch 3 avoid holding lock in wake_up() in fast path.

Test environment:

Architecture: aarch64
Cpu: Huawei KUNPENG 920, there are four numa nodes

Raid10 initialize:
mdadm --create /dev/md0 --level 10 --bitmap none --raid-devices 4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1

Test cmd:
fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 -randseed=2022 -rwmixread=70 -refill_buffers -filename=/dev/md0 -numjobs=16 -runtime=60s -bs=4k -iodepth=256 -rw=randread

Test result:
before this patchset:	2.9 GiB/s
after this patchset:	6.6 Gib/s

Please noted that in kunpeng-920, memory access latency is very bad
accross nodes compare to local node, and in other architecture
performance improvement might not be significant.

Yu Kuai (3):
  md/raid10: fix improper BUG_ON() in raise_barrier()
  md/raid10: convert resync_lock to use seqlock
  md/raid10: prevent unnecessary calls to wake_up() in fast path

 drivers/md/raid10.c | 88 +++++++++++++++++++++++++++++----------------
 drivers/md/raid10.h |  2 +-
 2 files changed, 59 insertions(+), 31 deletions(-)

Guoqing Jiang Aug. 29, 2022, 1:40 p.m. UTC | #1

On 8/29/22 9:14 PM, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
>
> patch 1 is a small problem found by code review.
> patch 2 avoid holding resync_lock in fast path.
> patch 3 avoid holding lock in wake_up() in fast path.
>
> Test environment:
>
> Architecture: aarch64
> Cpu: Huawei KUNPENG 920, there are four numa nodes
>
> Raid10 initialize:
> mdadm --create /dev/md0 --level 10 --bitmap none --raid-devices 4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
>
> Test cmd:
> fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 -randseed=2022 -rwmixread=70 -refill_buffers -filename=/dev/md0 -numjobs=16 -runtime=60s -bs=4k -iodepth=256 -rw=randread
>
> Test result:
> before this patchset:	2.9 GiB/s
> after this patchset:	6.6 Gib/s

Impressive! Pls try mdadm test suites too to avoid regression.

> Please noted that in kunpeng-920, memory access latency is very bad
> accross nodes compare to local node, and in other architecture
> performance improvement might not be significant.

By any chance can someone try with x64?

Thanks,
Guoqing

Paul Menzel Aug. 29, 2022, 1:58 p.m. UTC | #2

Dear Yu,


Thank you for your patches.

Am 29.08.22 um 15:14 schrieb Yu Kuai:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> patch 1 is a small problem found by code review.
> patch 2 avoid holding resync_lock in fast path.
> patch 3 avoid holding lock in wake_up() in fast path.
> 
> Test environment:
> 
> Architecture: aarch64
> Cpu: Huawei KUNPENG 920, there are four numa nodes
> 
> Raid10 initialize:
> mdadm --create /dev/md0 --level 10 --bitmap none --raid-devices 4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
> 
> Test cmd:
> fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 -randseed=2022 -rwmixread=70 -refill_buffers -filename=/dev/md0 -numjobs=16 -runtime=60s -bs=4k -iodepth=256 -rw=randread
> 
> Test result:
> before this patchset:	2.9 GiB/s
> after this patchset:	6.6 Gib/s

Could you please give more details about the test setup, like the drives 
used?

Did you use some tools like ftrace to figure out the bottleneck?

> Please noted that in kunpeng-920, memory access latency is very bad
> accross nodes compare to local node, and in other architecture
> performance improvement might not be significant.
> 
> Yu Kuai (3):
>    md/raid10: fix improper BUG_ON() in raise_barrier()
>    md/raid10: convert resync_lock to use seqlock
>    md/raid10: prevent unnecessary calls to wake_up() in fast path
> 
>   drivers/md/raid10.c | 88 +++++++++++++++++++++++++++++----------------
>   drivers/md/raid10.h |  2 +-
>   2 files changed, 59 insertions(+), 31 deletions(-)


Kind regards,

Paul

Yu Kuai Aug. 30, 2022, 1:09 a.m. UTC | #3

Hi, Paul!

在 2022/08/29 21:58, Paul Menzel 写道:
> Dear Yu,
> 
> 
> Thank you for your patches.
> 
> Am 29.08.22 um 15:14 schrieb Yu Kuai:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> patch 1 is a small problem found by code review.
>> patch 2 avoid holding resync_lock in fast path.
>> patch 3 avoid holding lock in wake_up() in fast path.
>>
>> Test environment:
>>
>> Architecture: aarch64
>> Cpu: Huawei KUNPENG 920, there are four numa nodes
>>
>> Raid10 initialize:
>> mdadm --create /dev/md0 --level 10 --bitmap none --raid-devices 4 
>> /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
>>
>> Test cmd:
>> fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 
>> -randseed=2022 -rwmixread=70 -refill_buffers -filename=/dev/md0 
>> -numjobs=16 -runtime=60s -bs=4k -iodepth=256 -rw=randread
>>
>> Test result:
>> before this patchset:    2.9 GiB/s
>> after this patchset:    6.6 Gib/s
> 
> Could you please give more details about the test setup, like the drives 
> used?

test setup is described above, four nvme disks is used.
> 
> Did you use some tools like ftrace to figure out the bottleneck?

Yes, I'm sure the bottleneck is spin_lock(), specifically threads from
multiple nodes try to grab the same lock. By the way, if I bind the
threads to the same node, performance can also improve to 6.6 Gib/s
without this patchset.

Thanks,
Kuai
> 
>> Please noted that in kunpeng-920, memory access latency is very bad
>> accross nodes compare to local node, and in other architecture
>> performance improvement might not be significant.
>>
>> Yu Kuai (3):
>>    md/raid10: fix improper BUG_ON() in raise_barrier()
>>    md/raid10: convert resync_lock to use seqlock
>>    md/raid10: prevent unnecessary calls to wake_up() in fast path
>>
>>   drivers/md/raid10.c | 88 +++++++++++++++++++++++++++++----------------
>>   drivers/md/raid10.h |  2 +-
>>   2 files changed, 59 insertions(+), 31 deletions(-)
> 
> 
> Kind regards,
> 
> Paul
> .
>

Yu Kuai Aug. 31, 2022, 11:55 a.m. UTC | #4

Hi!

在 2022/08/29 21:40, Guoqing Jiang 写道:
> 
> 
> On 8/29/22 9:14 PM, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> patch 1 is a small problem found by code review.
>> patch 2 avoid holding resync_lock in fast path.
>> patch 3 avoid holding lock in wake_up() in fast path.
>>
>> Test environment:
>>
>> Architecture: aarch64
>> Cpu: Huawei KUNPENG 920, there are four numa nodes
>>
>> Raid10 initialize:
>> mdadm --create /dev/md0 --level 10 --bitmap none --raid-devices 4 
>> /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
>>
>> Test cmd:
>> fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 
>> -randseed=2022 -rwmixread=70 -refill_buffers -filename=/dev/md0 
>> -numjobs=16 -runtime=60s -bs=4k -iodepth=256 -rw=randread
>>
>> Test result:
>> before this patchset:    2.9 GiB/s
>> after this patchset:    6.6 Gib/s
> 
> Impressive! Pls try mdadm test suites too to avoid regression.
> 
>> Please noted that in kunpeng-920, memory access latency is very bad
>> accross nodes compare to local node, and in other architecture
>> performance improvement might not be significant.
> 
> By any chance can someone try with x64?

I tried to test with Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz,

before this patchset: 7.0 GiB/s
after this patchset : 9.3 GiB/s

Thanks,
Kuai
> 
> Thanks,
> Guoqing
> .
>

Paul Menzel Aug. 31, 2022, 11:59 a.m. UTC | #5

Dear Yu,


Am 30.08.22 um 03:09 schrieb Yu Kuai:

> 在 2022/08/29 21:58, Paul Menzel 写道:

>> Am 29.08.22 um 15:14 schrieb Yu Kuai:
>>> From: Yu Kuai <yukuai3@huawei.com>
>>>
>>> patch 1 is a small problem found by code review.
>>> patch 2 avoid holding resync_lock in fast path.
>>> patch 3 avoid holding lock in wake_up() in fast path.
>>>
>>> Test environment:
>>>
>>> Architecture: aarch64
>>> Cpu: Huawei KUNPENG 920, there are four numa nodes
>>>
>>> Raid10 initialize:
>>> mdadm --create /dev/md0 --level 10 --bitmap none --raid-devices 4 
>>> /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
>>>
>>> Test cmd:
>>> fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 
>>> -randseed=2022 -rwmixread=70 -refill_buffers -filename=/dev/md0 
>>> -numjobs=16 -runtime=60s -bs=4k -iodepth=256 -rw=randread
>>>
>>> Test result:
>>> before this patchset:    2.9 GiB/s
>>> after this patchset:    6.6 Gib/s
>>
>> Could you please give more details about the test setup, like the 
>> drives used?
> 
> test setup is described above, four nvme disks is used.

I was wondering about the model to be able to reproduce it.

>> Did you use some tools like ftrace to figure out the bottleneck?
> 
> Yes, I'm sure the bottleneck is spin_lock(), specifically threads from
> multiple nodes try to grab the same lock. By the way, if I bind the
> threads to the same node, performance can also improve to 6.6 Gib/s
> without this patchset.

Interesting. Maybe you could add all that to the commit message of the 
second patch.


Kind regards,

Paul


>>> Please noted that in kunpeng-920, memory access latency is very bad
>>> accross nodes compare to local node, and in other architecture
>>> performance improvement might not be significant.
>>>
>>> Yu Kuai (3):
>>>    md/raid10: fix improper BUG_ON() in raise_barrier()
>>>    md/raid10: convert resync_lock to use seqlock
>>>    md/raid10: prevent unnecessary calls to wake_up() in fast path
>>>
>>>   drivers/md/raid10.c | 88 +++++++++++++++++++++++++++++----------------
>>>   drivers/md/raid10.h |  2 +-
>>>   2 files changed, 59 insertions(+), 31 deletions(-)

Yu Kuai Aug. 31, 2022, 12:07 p.m. UTC | #6

Hi, Paul!

在 2022/08/31 19:59, Paul Menzel 写道:
> Dear Yu,
> 
> 
> Am 30.08.22 um 03:09 schrieb Yu Kuai:
> 
>> 在 2022/08/29 21:58, Paul Menzel 写道:
> 
>>> Am 29.08.22 um 15:14 schrieb Yu Kuai:
>>>> From: Yu Kuai <yukuai3@huawei.com>
>>>>
>>>> patch 1 is a small problem found by code review.
>>>> patch 2 avoid holding resync_lock in fast path.
>>>> patch 3 avoid holding lock in wake_up() in fast path.
>>>>
>>>> Test environment:
>>>>
>>>> Architecture: aarch64
>>>> Cpu: Huawei KUNPENG 920, there are four numa nodes
>>>>
>>>> Raid10 initialize:
>>>> mdadm --create /dev/md0 --level 10 --bitmap none --raid-devices 4 
>>>> /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
>>>>
>>>> Test cmd:
>>>> fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 
>>>> -randseed=2022 -rwmixread=70 -refill_buffers -filename=/dev/md0 
>>>> -numjobs=16 -runtime=60s -bs=4k -iodepth=256 -rw=randread
>>>>
>>>> Test result:
>>>> before this patchset:    2.9 GiB/s
>>>> after this patchset:    6.6 Gib/s
>>>
>>> Could you please give more details about the test setup, like the 
>>> drives used?
>>
>> test setup is described above, four nvme disks is used.
> 
> I was wondering about the model to be able to reproduce it.
> 
>>> Did you use some tools like ftrace to figure out the bottleneck?
>>
>> Yes, I'm sure the bottleneck is spin_lock(), specifically threads from
>> multiple nodes try to grab the same lock. By the way, if I bind the
>> threads to the same node, performance can also improve to 6.6 Gib/s
>> without this patchset.
> 
> Interesting. Maybe you could add all that to the commit message of the 
> second patch.

Of course, I will do that in next version.

Thanks,
Kuai
> 
> 
> Kind regards,
> 
> Paul
> 
> 
>>>> Please noted that in kunpeng-920, memory access latency is very bad
>>>> accross nodes compare to local node, and in other architecture
>>>> performance improvement might not be significant.
>>>>
>>>> Yu Kuai (3):
>>>>    md/raid10: fix improper BUG_ON() in raise_barrier()
>>>>    md/raid10: convert resync_lock to use seqlock
>>>>    md/raid10: prevent unnecessary calls to wake_up() in fast path
>>>>
>>>>   drivers/md/raid10.c | 88 
>>>> +++++++++++++++++++++++++++++----------------
>>>>   drivers/md/raid10.h |  2 +-
>>>>   2 files changed, 59 insertions(+), 31 deletions(-)
> .
>

Song Liu Aug. 31, 2022, 6 p.m. UTC | #7

On Mon, Aug 29, 2022 at 6:03 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> From: Yu Kuai <yukuai3@huawei.com>
>
> patch 1 is a small problem found by code review.
> patch 2 avoid holding resync_lock in fast path.
> patch 3 avoid holding lock in wake_up() in fast path.
>
> Test environment:
>
> Architecture: aarch64
> Cpu: Huawei KUNPENG 920, there are four numa nodes
>
> Raid10 initialize:
> mdadm --create /dev/md0 --level 10 --bitmap none --raid-devices 4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
>
> Test cmd:
> fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 -randseed=2022 -rwmixread=70 -refill_buffers -filename=/dev/md0 -numjobs=16 -runtime=60s -bs=4k -iodepth=256 -rw=randread
>
> Test result:
> before this patchset:   2.9 GiB/s
> after this patchset:    6.6 Gib/s

Nice work! Applied to md-next.

Thanks,
Song

Yu Kuai Sept. 3, 2022, 6:08 a.m. UTC | #8

Hi, Song

在 2022/09/01 2:00, Song Liu 写道:
> On Mon, Aug 29, 2022 at 6:03 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> patch 1 is a small problem found by code review.
>> patch 2 avoid holding resync_lock in fast path.
>> patch 3 avoid holding lock in wake_up() in fast path.
>>
>> Test environment:
>>
>> Architecture: aarch64
>> Cpu: Huawei KUNPENG 920, there are four numa nodes
>>
>> Raid10 initialize:
>> mdadm --create /dev/md0 --level 10 --bitmap none --raid-devices 4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
>>
>> Test cmd:
>> fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 -randseed=2022 -rwmixread=70 -refill_buffers -filename=/dev/md0 -numjobs=16 -runtime=60s -bs=4k -iodepth=256 -rw=randread
>>
>> Test result:
>> before this patchset:   2.9 GiB/s
>> after this patchset:    6.6 Gib/s
> 
> Nice work! Applied to md-next.

Can you drop this version? There are something to improve. I can send a
new version.

Thanks,
Kuai
> 
> Thanks,
> Song
> .
>

Song Liu Sept. 9, 2022, 2:45 p.m. UTC | #9

On Fri, Sep 2, 2022 at 11:08 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi, Song
>
> 在 2022/09/01 2:00, Song Liu 写道:
> > On Mon, Aug 29, 2022 at 6:03 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
> >>
> >> From: Yu Kuai <yukuai3@huawei.com>
> >>
> >> patch 1 is a small problem found by code review.
> >> patch 2 avoid holding resync_lock in fast path.
> >> patch 3 avoid holding lock in wake_up() in fast path.
> >>
> >> Test environment:
> >>
> >> Architecture: aarch64
> >> Cpu: Huawei KUNPENG 920, there are four numa nodes
> >>
> >> Raid10 initialize:
> >> mdadm --create /dev/md0 --level 10 --bitmap none --raid-devices 4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
> >>
> >> Test cmd:
> >> fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 -randseed=2022 -rwmixread=70 -refill_buffers -filename=/dev/md0 -numjobs=16 -runtime=60s -bs=4k -iodepth=256 -rw=randread
> >>
> >> Test result:
> >> before this patchset:   2.9 GiB/s
> >> after this patchset:    6.6 Gib/s
> >
> > Nice work! Applied to md-next.
>
> Can you drop this version? There are something to improve. I can send a
> new version.

Sure, I will drop it from md-next.

Song

[-next,0/3] md/raid10: reduce lock contention for io

Message

Comments