Message ID | 20220829131502.165356-1-yukuai1@huaweicloud.com (mailing list archive) |
---|---|
Headers | show |
Series | md/raid10: reduce lock contention for io | expand |
On 8/29/22 9:14 PM, Yu Kuai wrote: > From: Yu Kuai <yukuai3@huawei.com> > > patch 1 is a small problem found by code review. > patch 2 avoid holding resync_lock in fast path. > patch 3 avoid holding lock in wake_up() in fast path. > > Test environment: > > Architecture: aarch64 > Cpu: Huawei KUNPENG 920, there are four numa nodes > > Raid10 initialize: > mdadm --create /dev/md0 --level 10 --bitmap none --raid-devices 4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 > > Test cmd: > fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 -randseed=2022 -rwmixread=70 -refill_buffers -filename=/dev/md0 -numjobs=16 -runtime=60s -bs=4k -iodepth=256 -rw=randread > > Test result: > before this patchset: 2.9 GiB/s > after this patchset: 6.6 Gib/s Impressive! Pls try mdadm test suites too to avoid regression. > Please noted that in kunpeng-920, memory access latency is very bad > accross nodes compare to local node, and in other architecture > performance improvement might not be significant. By any chance can someone try with x64? Thanks, Guoqing
Dear Yu, Thank you for your patches. Am 29.08.22 um 15:14 schrieb Yu Kuai: > From: Yu Kuai <yukuai3@huawei.com> > > patch 1 is a small problem found by code review. > patch 2 avoid holding resync_lock in fast path. > patch 3 avoid holding lock in wake_up() in fast path. > > Test environment: > > Architecture: aarch64 > Cpu: Huawei KUNPENG 920, there are four numa nodes > > Raid10 initialize: > mdadm --create /dev/md0 --level 10 --bitmap none --raid-devices 4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 > > Test cmd: > fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 -randseed=2022 -rwmixread=70 -refill_buffers -filename=/dev/md0 -numjobs=16 -runtime=60s -bs=4k -iodepth=256 -rw=randread > > Test result: > before this patchset: 2.9 GiB/s > after this patchset: 6.6 Gib/s Could you please give more details about the test setup, like the drives used? Did you use some tools like ftrace to figure out the bottleneck? > Please noted that in kunpeng-920, memory access latency is very bad > accross nodes compare to local node, and in other architecture > performance improvement might not be significant. > > Yu Kuai (3): > md/raid10: fix improper BUG_ON() in raise_barrier() > md/raid10: convert resync_lock to use seqlock > md/raid10: prevent unnecessary calls to wake_up() in fast path > > drivers/md/raid10.c | 88 +++++++++++++++++++++++++++++---------------- > drivers/md/raid10.h | 2 +- > 2 files changed, 59 insertions(+), 31 deletions(-) Kind regards, Paul
Hi, Paul! 在 2022/08/29 21:58, Paul Menzel 写道: > Dear Yu, > > > Thank you for your patches. > > Am 29.08.22 um 15:14 schrieb Yu Kuai: >> From: Yu Kuai <yukuai3@huawei.com> >> >> patch 1 is a small problem found by code review. >> patch 2 avoid holding resync_lock in fast path. >> patch 3 avoid holding lock in wake_up() in fast path. >> >> Test environment: >> >> Architecture: aarch64 >> Cpu: Huawei KUNPENG 920, there are four numa nodes >> >> Raid10 initialize: >> mdadm --create /dev/md0 --level 10 --bitmap none --raid-devices 4 >> /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 >> >> Test cmd: >> fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 >> -randseed=2022 -rwmixread=70 -refill_buffers -filename=/dev/md0 >> -numjobs=16 -runtime=60s -bs=4k -iodepth=256 -rw=randread >> >> Test result: >> before this patchset: 2.9 GiB/s >> after this patchset: 6.6 Gib/s > > Could you please give more details about the test setup, like the drives > used? test setup is described above, four nvme disks is used. > > Did you use some tools like ftrace to figure out the bottleneck? Yes, I'm sure the bottleneck is spin_lock(), specifically threads from multiple nodes try to grab the same lock. By the way, if I bind the threads to the same node, performance can also improve to 6.6 Gib/s without this patchset. Thanks, Kuai > >> Please noted that in kunpeng-920, memory access latency is very bad >> accross nodes compare to local node, and in other architecture >> performance improvement might not be significant. >> >> Yu Kuai (3): >> md/raid10: fix improper BUG_ON() in raise_barrier() >> md/raid10: convert resync_lock to use seqlock >> md/raid10: prevent unnecessary calls to wake_up() in fast path >> >> drivers/md/raid10.c | 88 +++++++++++++++++++++++++++++---------------- >> drivers/md/raid10.h | 2 +- >> 2 files changed, 59 insertions(+), 31 deletions(-) > > > Kind regards, > > Paul > . >
Hi! 在 2022/08/29 21:40, Guoqing Jiang 写道: > > > On 8/29/22 9:14 PM, Yu Kuai wrote: >> From: Yu Kuai <yukuai3@huawei.com> >> >> patch 1 is a small problem found by code review. >> patch 2 avoid holding resync_lock in fast path. >> patch 3 avoid holding lock in wake_up() in fast path. >> >> Test environment: >> >> Architecture: aarch64 >> Cpu: Huawei KUNPENG 920, there are four numa nodes >> >> Raid10 initialize: >> mdadm --create /dev/md0 --level 10 --bitmap none --raid-devices 4 >> /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 >> >> Test cmd: >> fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 >> -randseed=2022 -rwmixread=70 -refill_buffers -filename=/dev/md0 >> -numjobs=16 -runtime=60s -bs=4k -iodepth=256 -rw=randread >> >> Test result: >> before this patchset: 2.9 GiB/s >> after this patchset: 6.6 Gib/s > > Impressive! Pls try mdadm test suites too to avoid regression. > >> Please noted that in kunpeng-920, memory access latency is very bad >> accross nodes compare to local node, and in other architecture >> performance improvement might not be significant. > > By any chance can someone try with x64? I tried to test with Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz, before this patchset: 7.0 GiB/s after this patchset : 9.3 GiB/s Thanks, Kuai > > Thanks, > Guoqing > . >
Dear Yu, Am 30.08.22 um 03:09 schrieb Yu Kuai: > 在 2022/08/29 21:58, Paul Menzel 写道: >> Am 29.08.22 um 15:14 schrieb Yu Kuai: >>> From: Yu Kuai <yukuai3@huawei.com> >>> >>> patch 1 is a small problem found by code review. >>> patch 2 avoid holding resync_lock in fast path. >>> patch 3 avoid holding lock in wake_up() in fast path. >>> >>> Test environment: >>> >>> Architecture: aarch64 >>> Cpu: Huawei KUNPENG 920, there are four numa nodes >>> >>> Raid10 initialize: >>> mdadm --create /dev/md0 --level 10 --bitmap none --raid-devices 4 >>> /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 >>> >>> Test cmd: >>> fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 >>> -randseed=2022 -rwmixread=70 -refill_buffers -filename=/dev/md0 >>> -numjobs=16 -runtime=60s -bs=4k -iodepth=256 -rw=randread >>> >>> Test result: >>> before this patchset: 2.9 GiB/s >>> after this patchset: 6.6 Gib/s >> >> Could you please give more details about the test setup, like the >> drives used? > > test setup is described above, four nvme disks is used. I was wondering about the model to be able to reproduce it. >> Did you use some tools like ftrace to figure out the bottleneck? > > Yes, I'm sure the bottleneck is spin_lock(), specifically threads from > multiple nodes try to grab the same lock. By the way, if I bind the > threads to the same node, performance can also improve to 6.6 Gib/s > without this patchset. Interesting. Maybe you could add all that to the commit message of the second patch. Kind regards, Paul >>> Please noted that in kunpeng-920, memory access latency is very bad >>> accross nodes compare to local node, and in other architecture >>> performance improvement might not be significant. >>> >>> Yu Kuai (3): >>> md/raid10: fix improper BUG_ON() in raise_barrier() >>> md/raid10: convert resync_lock to use seqlock >>> md/raid10: prevent unnecessary calls to wake_up() in fast path >>> >>> drivers/md/raid10.c | 88 +++++++++++++++++++++++++++++---------------- >>> drivers/md/raid10.h | 2 +- >>> 2 files changed, 59 insertions(+), 31 deletions(-)
Hi, Paul! 在 2022/08/31 19:59, Paul Menzel 写道: > Dear Yu, > > > Am 30.08.22 um 03:09 schrieb Yu Kuai: > >> 在 2022/08/29 21:58, Paul Menzel 写道: > >>> Am 29.08.22 um 15:14 schrieb Yu Kuai: >>>> From: Yu Kuai <yukuai3@huawei.com> >>>> >>>> patch 1 is a small problem found by code review. >>>> patch 2 avoid holding resync_lock in fast path. >>>> patch 3 avoid holding lock in wake_up() in fast path. >>>> >>>> Test environment: >>>> >>>> Architecture: aarch64 >>>> Cpu: Huawei KUNPENG 920, there are four numa nodes >>>> >>>> Raid10 initialize: >>>> mdadm --create /dev/md0 --level 10 --bitmap none --raid-devices 4 >>>> /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 >>>> >>>> Test cmd: >>>> fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 >>>> -randseed=2022 -rwmixread=70 -refill_buffers -filename=/dev/md0 >>>> -numjobs=16 -runtime=60s -bs=4k -iodepth=256 -rw=randread >>>> >>>> Test result: >>>> before this patchset: 2.9 GiB/s >>>> after this patchset: 6.6 Gib/s >>> >>> Could you please give more details about the test setup, like the >>> drives used? >> >> test setup is described above, four nvme disks is used. > > I was wondering about the model to be able to reproduce it. > >>> Did you use some tools like ftrace to figure out the bottleneck? >> >> Yes, I'm sure the bottleneck is spin_lock(), specifically threads from >> multiple nodes try to grab the same lock. By the way, if I bind the >> threads to the same node, performance can also improve to 6.6 Gib/s >> without this patchset. > > Interesting. Maybe you could add all that to the commit message of the > second patch. Of course, I will do that in next version. Thanks, Kuai > > > Kind regards, > > Paul > > >>>> Please noted that in kunpeng-920, memory access latency is very bad >>>> accross nodes compare to local node, and in other architecture >>>> performance improvement might not be significant. >>>> >>>> Yu Kuai (3): >>>> md/raid10: fix improper BUG_ON() in raise_barrier() >>>> md/raid10: convert resync_lock to use seqlock >>>> md/raid10: prevent unnecessary calls to wake_up() in fast path >>>> >>>> drivers/md/raid10.c | 88 >>>> +++++++++++++++++++++++++++++---------------- >>>> drivers/md/raid10.h | 2 +- >>>> 2 files changed, 59 insertions(+), 31 deletions(-) > . >
On Mon, Aug 29, 2022 at 6:03 AM Yu Kuai <yukuai1@huaweicloud.com> wrote: > > From: Yu Kuai <yukuai3@huawei.com> > > patch 1 is a small problem found by code review. > patch 2 avoid holding resync_lock in fast path. > patch 3 avoid holding lock in wake_up() in fast path. > > Test environment: > > Architecture: aarch64 > Cpu: Huawei KUNPENG 920, there are four numa nodes > > Raid10 initialize: > mdadm --create /dev/md0 --level 10 --bitmap none --raid-devices 4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 > > Test cmd: > fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 -randseed=2022 -rwmixread=70 -refill_buffers -filename=/dev/md0 -numjobs=16 -runtime=60s -bs=4k -iodepth=256 -rw=randread > > Test result: > before this patchset: 2.9 GiB/s > after this patchset: 6.6 Gib/s Nice work! Applied to md-next. Thanks, Song
Hi, Song 在 2022/09/01 2:00, Song Liu 写道: > On Mon, Aug 29, 2022 at 6:03 AM Yu Kuai <yukuai1@huaweicloud.com> wrote: >> >> From: Yu Kuai <yukuai3@huawei.com> >> >> patch 1 is a small problem found by code review. >> patch 2 avoid holding resync_lock in fast path. >> patch 3 avoid holding lock in wake_up() in fast path. >> >> Test environment: >> >> Architecture: aarch64 >> Cpu: Huawei KUNPENG 920, there are four numa nodes >> >> Raid10 initialize: >> mdadm --create /dev/md0 --level 10 --bitmap none --raid-devices 4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 >> >> Test cmd: >> fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 -randseed=2022 -rwmixread=70 -refill_buffers -filename=/dev/md0 -numjobs=16 -runtime=60s -bs=4k -iodepth=256 -rw=randread >> >> Test result: >> before this patchset: 2.9 GiB/s >> after this patchset: 6.6 Gib/s > > Nice work! Applied to md-next. Can you drop this version? There are something to improve. I can send a new version. Thanks, Kuai > > Thanks, > Song > . >
On Fri, Sep 2, 2022 at 11:08 PM Yu Kuai <yukuai1@huaweicloud.com> wrote: > > Hi, Song > > 在 2022/09/01 2:00, Song Liu 写道: > > On Mon, Aug 29, 2022 at 6:03 AM Yu Kuai <yukuai1@huaweicloud.com> wrote: > >> > >> From: Yu Kuai <yukuai3@huawei.com> > >> > >> patch 1 is a small problem found by code review. > >> patch 2 avoid holding resync_lock in fast path. > >> patch 3 avoid holding lock in wake_up() in fast path. > >> > >> Test environment: > >> > >> Architecture: aarch64 > >> Cpu: Huawei KUNPENG 920, there are four numa nodes > >> > >> Raid10 initialize: > >> mdadm --create /dev/md0 --level 10 --bitmap none --raid-devices 4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 > >> > >> Test cmd: > >> fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 -randseed=2022 -rwmixread=70 -refill_buffers -filename=/dev/md0 -numjobs=16 -runtime=60s -bs=4k -iodepth=256 -rw=randread > >> > >> Test result: > >> before this patchset: 2.9 GiB/s > >> after this patchset: 6.6 Gib/s > > > > Nice work! Applied to md-next. > > Can you drop this version? There are something to improve. I can send a > new version. Sure, I will drop it from md-next. Song
From: Yu Kuai <yukuai3@huawei.com> patch 1 is a small problem found by code review. patch 2 avoid holding resync_lock in fast path. patch 3 avoid holding lock in wake_up() in fast path. Test environment: Architecture: aarch64 Cpu: Huawei KUNPENG 920, there are four numa nodes Raid10 initialize: mdadm --create /dev/md0 --level 10 --bitmap none --raid-devices 4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 Test cmd: fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 -randseed=2022 -rwmixread=70 -refill_buffers -filename=/dev/md0 -numjobs=16 -runtime=60s -bs=4k -iodepth=256 -rw=randread Test result: before this patchset: 2.9 GiB/s after this patchset: 6.6 Gib/s Please noted that in kunpeng-920, memory access latency is very bad accross nodes compare to local node, and in other architecture performance improvement might not be significant. Yu Kuai (3): md/raid10: fix improper BUG_ON() in raise_barrier() md/raid10: convert resync_lock to use seqlock md/raid10: prevent unnecessary calls to wake_up() in fast path drivers/md/raid10.c | 88 +++++++++++++++++++++++++++++---------------- drivers/md/raid10.h | 2 +- 2 files changed, 59 insertions(+), 31 deletions(-)