diff mbox series

[PATCHv2] md: Replace md_thread's wait queue with the swait variant

Message ID 20240305105419.21077-1-jinpu.wang@ionos.com (mailing list archive)
State Superseded, archived
Headers show
Series [PATCHv2] md: Replace md_thread's wait queue with the swait variant | expand

Commit Message

Jinpu Wang March 5, 2024, 10:54 a.m. UTC
From: Florian-Ewald Mueller <florian-ewald.mueller@ionos.com>

Replace md_thread's wait_event()/wake_up() related calls with their
simple swait~ variants to improve performance as well as memory and
stack footprint.

Use the IDLE state for the worker thread put to sleep instead of
misusing the INTERRUPTIBLE state combined with flushing signals
just for not contributing to system's cpu load-average stats.

Also check for sleeping thread before attempting its wake_up in
md_wakeup_thread() for avoiding unnecessary spinlock contention.

With this patch (backported) on a kernel 6.1, the IOPS improved
by around 4% with raid1 and/or raid5, in high IO load scenarios.

Signed-off-by: Florian-Ewald Mueller <florian-ewald.mueller@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
---
v2: fix a type error for thread
 drivers/md/md.c | 23 +++++++----------------
 drivers/md/md.h |  2 +-
 2 files changed, 8 insertions(+), 17 deletions(-)

Comments

Paul Menzel March 5, 2024, noon UTC | #1
Dear Jack, dear Florian-Ewald,


Thank you for your patch.

Am 05.03.24 um 11:54 schrieb Jack Wang:
> From: Florian-Ewald Mueller <florian-ewald.mueller@ionos.com>
> 
> Replace md_thread's wait_event()/wake_up() related calls with their
> simple swait~ variants to improve performance as well as memory and
> stack footprint.
> 
> Use the IDLE state for the worker thread put to sleep instead of
> misusing the INTERRUPTIBLE state combined with flushing signals
> just for not contributing to system's cpu load-average stats.
> 
> Also check for sleeping thread before attempting its wake_up in
> md_wakeup_thread() for avoiding unnecessary spinlock contention.
> 
> With this patch (backported) on a kernel 6.1, the IOPS improved
> by around 4% with raid1 and/or raid5, in high IO load scenarios.

It’d be great if you elaborated on your test setup.

Should this have?

Fixes: 93588e2284b6 ("[PATCH] md: make md threads interruptible again")

I ask, because the simple waitqueue (swait) implementation was only 
introduced in 2016, so 11 years later than the mentioned commit.

> Signed-off-by: Florian-Ewald Mueller <florian-ewald.mueller@ionos.com>
> Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
> ---
> v2: fix a type error for thread
>   drivers/md/md.c | 23 +++++++----------------
>   drivers/md/md.h |  2 +-
>   2 files changed, 8 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 48ae2b1cb57a..14d12ee4b06b 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -7970,22 +7970,12 @@ static int md_thread(void *arg)
>   	 * many dirty RAID5 blocks.
>   	 */
>   
> -	allow_signal(SIGKILL);
>   	while (!kthread_should_stop()) {
>   
> -		/* We need to wait INTERRUPTIBLE so that
> -		 * we don't add to the load-average.
> -		 * That means we need to be sure no signals are
> -		 * pending
> -		 */
> -		if (signal_pending(current))
> -			flush_signals(current);
> -
> -		wait_event_interruptible_timeout
> -			(thread->wqueue,
> -			 test_bit(THREAD_WAKEUP, &thread->flags)
> -			 || kthread_should_stop() || kthread_should_park(),
> -			 thread->timeout);
> +		swait_event_idle_timeout_exclusive(thread->wqueue,
> +			test_bit(THREAD_WAKEUP, &thread->flags) ||
> +			kthread_should_stop() || kthread_should_park(),
> +			thread->timeout);
>   
>   		clear_bit(THREAD_WAKEUP, &thread->flags);
>   		if (kthread_should_park())
> @@ -8017,7 +8007,8 @@ void md_wakeup_thread(struct md_thread __rcu *thread)
>   	if (t) {
>   		pr_debug("md: waking up MD thread %s.\n", t->tsk->comm);
>   		set_bit(THREAD_WAKEUP, &t->flags);
> -		wake_up(&t->wqueue);
> +		if (swq_has_sleeper(&t->wqueue))
> +			swake_up_one(&t->wqueue);
>   	}
>   	rcu_read_unlock();
>   }
> @@ -8032,7 +8023,7 @@ struct md_thread *md_register_thread(void (*run) (struct md_thread *),
>   	if (!thread)
>   		return NULL;
>   
> -	init_waitqueue_head(&thread->wqueue);
> +	init_swait_queue_head(&thread->wqueue);
>   
>   	thread->run = run;
>   	thread->mddev = mddev;
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index b2076a165c10..1dc01bc5c038 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -716,7 +716,7 @@ static inline void sysfs_unlink_rdev(struct mddev *mddev, struct md_rdev *rdev)
>   struct md_thread {
>   	void			(*run) (struct md_thread *thread);
>   	struct mddev		*mddev;
> -	wait_queue_head_t	wqueue;
> +	struct swait_queue_head	wqueue;
>   	unsigned long		flags;
>   	struct task_struct	*tsk;
>   	unsigned long		timeout;

Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>


Kind regards,

Paul
Florian-Ewald Müller March 5, 2024, 12:25 p.m. UTC | #2
Hi Paul,

Thank you for your answer.

Regarding the INTERRUPTIBLE wait:
- it was introduced that time only for _not_ contributing to load-average.
- now obsolete, see also explanation in our patch.

Regarding our tests:
- we used 100 logical volumes (and more) doing full-blown random rd/wr
IO (fio) to the same md-raid1/raid5 device.

Best regards,


On Tue, Mar 5, 2024 at 1:01 PM Paul Menzel <pmenzel@molgen.mpg.de> wrote:
>
> Dear Jack, dear Florian-Ewald,
>
>
> Thank you for your patch.
>
> Am 05.03.24 um 11:54 schrieb Jack Wang:
> > From: Florian-Ewald Mueller <florian-ewald.mueller@ionos.com>
> >
> > Replace md_thread's wait_event()/wake_up() related calls with their
> > simple swait~ variants to improve performance as well as memory and
> > stack footprint.
> >
> > Use the IDLE state for the worker thread put to sleep instead of
> > misusing the INTERRUPTIBLE state combined with flushing signals
> > just for not contributing to system's cpu load-average stats.
> >
> > Also check for sleeping thread before attempting its wake_up in
> > md_wakeup_thread() for avoiding unnecessary spinlock contention.
> >
> > With this patch (backported) on a kernel 6.1, the IOPS improved
> > by around 4% with raid1 and/or raid5, in high IO load scenarios.
>
> It’d be great if you elaborated on your test setup.
>
> Should this have?
>
> Fixes: 93588e2284b6 ("[PATCH] md: make md threads interruptible again")
>
> I ask, because the simple waitqueue (swait) implementation was only
> introduced in 2016, so 11 years later than the mentioned commit.
>
> > Signed-off-by: Florian-Ewald Mueller <florian-ewald.mueller@ionos.com>
> > Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
> > ---
> > v2: fix a type error for thread
> >   drivers/md/md.c | 23 +++++++----------------
> >   drivers/md/md.h |  2 +-
> >   2 files changed, 8 insertions(+), 17 deletions(-)
> >
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index 48ae2b1cb57a..14d12ee4b06b 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -7970,22 +7970,12 @@ static int md_thread(void *arg)
> >        * many dirty RAID5 blocks.
> >        */
> >
> > -     allow_signal(SIGKILL);
> >       while (!kthread_should_stop()) {
> >
> > -             /* We need to wait INTERRUPTIBLE so that
> > -              * we don't add to the load-average.
> > -              * That means we need to be sure no signals are
> > -              * pending
> > -              */
> > -             if (signal_pending(current))
> > -                     flush_signals(current);
> > -
> > -             wait_event_interruptible_timeout
> > -                     (thread->wqueue,
> > -                      test_bit(THREAD_WAKEUP, &thread->flags)
> > -                      || kthread_should_stop() || kthread_should_park(),
> > -                      thread->timeout);
> > +             swait_event_idle_timeout_exclusive(thread->wqueue,
> > +                     test_bit(THREAD_WAKEUP, &thread->flags) ||
> > +                     kthread_should_stop() || kthread_should_park(),
> > +                     thread->timeout);
> >
> >               clear_bit(THREAD_WAKEUP, &thread->flags);
> >               if (kthread_should_park())
> > @@ -8017,7 +8007,8 @@ void md_wakeup_thread(struct md_thread __rcu *thread)
> >       if (t) {
> >               pr_debug("md: waking up MD thread %s.\n", t->tsk->comm);
> >               set_bit(THREAD_WAKEUP, &t->flags);
> > -             wake_up(&t->wqueue);
> > +             if (swq_has_sleeper(&t->wqueue))
> > +                     swake_up_one(&t->wqueue);
> >       }
> >       rcu_read_unlock();
> >   }
> > @@ -8032,7 +8023,7 @@ struct md_thread *md_register_thread(void (*run) (struct md_thread *),
> >       if (!thread)
> >               return NULL;
> >
> > -     init_waitqueue_head(&thread->wqueue);
> > +     init_swait_queue_head(&thread->wqueue);
> >
> >       thread->run = run;
> >       thread->mddev = mddev;
> > diff --git a/drivers/md/md.h b/drivers/md/md.h
> > index b2076a165c10..1dc01bc5c038 100644
> > --- a/drivers/md/md.h
> > +++ b/drivers/md/md.h
> > @@ -716,7 +716,7 @@ static inline void sysfs_unlink_rdev(struct mddev *mddev, struct md_rdev *rdev)
> >   struct md_thread {
> >       void                    (*run) (struct md_thread *thread);
> >       struct mddev            *mddev;
> > -     wait_queue_head_t       wqueue;
> > +     struct swait_queue_head wqueue;
> >       unsigned long           flags;
> >       struct task_struct      *tsk;
> >       unsigned long           timeout;
>
> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
>
>
> Kind regards,
>
> Paul
Paul Menzel March 5, 2024, 2:10 p.m. UTC | #3
Dear Florian-Ewald,


Thank you for your quick reply.

Am 05.03.24 um 13:25 schrieb Florian-Ewald Müller:

> Regarding the INTERRUPTIBLE wait:
> - it was introduced that time only for _not_ contributing to load-average.
> - now obsolete, see also explanation in our patch.

Yes, I understood that. My question was regarding, if this should be 
backported to the stable series.

> Regarding our tests:
> - we used 100 logical volumes (and more) doing full-blown random rd/wr
> IO (fio) to the same md-raid1/raid5 device.

Do you have a script to create these volumes and running the fio test, 
so it’s easy to reproduce?

Was it a spinning disk or SSD device?


Kind regards,

Paul
Florian-Ewald Müller March 5, 2024, 2:52 p.m. UTC | #4
Hi Paul,

Regarding the IDLE state waiting:
- yes, this could/should be backported to any version which already
has swait_event_idle() macro(s) - and so also TASK_IDLE - defined.

About our fio driven tests
- we used fio "ini" files similar to:

[global]
description=Random Read/Write Access Pattern
bssplit=512/8:1k/4:2k/4:4k/30:8k/20:16k/10:32k/8:64k/12:128k/4
cpus_allowed=0-31
cpus_allowed_policy=split
numa_mem_policy=local
fadvise_hint=0
rw=randrw
direct=1
size=100%
random_distribution=zipf:1.2
percentage_random=80
ioengine=libaio
iodepth=128
numjobs=1
gtod_reduce=1
group_reporting
ramp_time=60
runtime=300
time_based

[job0]
filename=/dev/disk/by-name/vol0

[job1]
filename=/dev/disk/by-name/vol1

... and so on for >= 100 logical volumes.

Best regards,

On Tue, Mar 5, 2024 at 3:14 PM Paul Menzel <pmenzel@molgen.mpg.de> wrote:
>
> Dear Florian-Ewald,
>
>
> Thank you for your quick reply.
>
> Am 05.03.24 um 13:25 schrieb Florian-Ewald Müller:
>
> > Regarding the INTERRUPTIBLE wait:
> > - it was introduced that time only for _not_ contributing to load-average.
> > - now obsolete, see also explanation in our patch.
>
> Yes, I understood that. My question was regarding, if this should be
> backported to the stable series.
>
> > Regarding our tests:
> > - we used 100 logical volumes (and more) doing full-blown random rd/wr
> > IO (fio) to the same md-raid1/raid5 device.
>
> Do you have a script to create these volumes and running the fio test,
> so it’s easy to reproduce?
>
> Was it a spinning disk or SSD device?
>
>
> Kind regards,
>
> Paul
Florian-Ewald Müller March 5, 2024, 2:53 p.m. UTC | #5
PS: backend devices were/are SSDs.

On Tue, Mar 5, 2024 at 3:52 PM Florian-Ewald Müller
<florian-ewald.mueller@ionos.com> wrote:
>
> Hi Paul,
>
> Regarding the IDLE state waiting:
> - yes, this could/should be backported to any version which already
> has swait_event_idle() macro(s) - and so also TASK_IDLE - defined.
>
> About our fio driven tests
> - we used fio "ini" files similar to:
>
> [global]
> description=Random Read/Write Access Pattern
> bssplit=512/8:1k/4:2k/4:4k/30:8k/20:16k/10:32k/8:64k/12:128k/4
> cpus_allowed=0-31
> cpus_allowed_policy=split
> numa_mem_policy=local
> fadvise_hint=0
> rw=randrw
> direct=1
> size=100%
> random_distribution=zipf:1.2
> percentage_random=80
> ioengine=libaio
> iodepth=128
> numjobs=1
> gtod_reduce=1
> group_reporting
> ramp_time=60
> runtime=300
> time_based
>
> [job0]
> filename=/dev/disk/by-name/vol0
>
> [job1]
> filename=/dev/disk/by-name/vol1
>
> ... and so on for >= 100 logical volumes.
>
> Best regards,
>
> On Tue, Mar 5, 2024 at 3:14 PM Paul Menzel <pmenzel@molgen.mpg.de> wrote:
> >
> > Dear Florian-Ewald,
> >
> >
> > Thank you for your quick reply.
> >
> > Am 05.03.24 um 13:25 schrieb Florian-Ewald Müller:
> >
> > > Regarding the INTERRUPTIBLE wait:
> > > - it was introduced that time only for _not_ contributing to load-average.
> > > - now obsolete, see also explanation in our patch.
> >
> > Yes, I understood that. My question was regarding, if this should be
> > backported to the stable series.
> >
> > > Regarding our tests:
> > > - we used 100 logical volumes (and more) doing full-blown random rd/wr
> > > IO (fio) to the same md-raid1/raid5 device.
> >
> > Do you have a script to create these volumes and running the fio test,
> > so it’s easy to reproduce?
> >
> > Was it a spinning disk or SSD device?
> >
> >
> > Kind regards,
> >
> > Paul
diff mbox series

Patch

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 48ae2b1cb57a..14d12ee4b06b 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7970,22 +7970,12 @@  static int md_thread(void *arg)
 	 * many dirty RAID5 blocks.
 	 */
 
-	allow_signal(SIGKILL);
 	while (!kthread_should_stop()) {
 
-		/* We need to wait INTERRUPTIBLE so that
-		 * we don't add to the load-average.
-		 * That means we need to be sure no signals are
-		 * pending
-		 */
-		if (signal_pending(current))
-			flush_signals(current);
-
-		wait_event_interruptible_timeout
-			(thread->wqueue,
-			 test_bit(THREAD_WAKEUP, &thread->flags)
-			 || kthread_should_stop() || kthread_should_park(),
-			 thread->timeout);
+		swait_event_idle_timeout_exclusive(thread->wqueue,
+			test_bit(THREAD_WAKEUP, &thread->flags) ||
+			kthread_should_stop() || kthread_should_park(),
+			thread->timeout);
 
 		clear_bit(THREAD_WAKEUP, &thread->flags);
 		if (kthread_should_park())
@@ -8017,7 +8007,8 @@  void md_wakeup_thread(struct md_thread __rcu *thread)
 	if (t) {
 		pr_debug("md: waking up MD thread %s.\n", t->tsk->comm);
 		set_bit(THREAD_WAKEUP, &t->flags);
-		wake_up(&t->wqueue);
+		if (swq_has_sleeper(&t->wqueue))
+			swake_up_one(&t->wqueue);
 	}
 	rcu_read_unlock();
 }
@@ -8032,7 +8023,7 @@  struct md_thread *md_register_thread(void (*run) (struct md_thread *),
 	if (!thread)
 		return NULL;
 
-	init_waitqueue_head(&thread->wqueue);
+	init_swait_queue_head(&thread->wqueue);
 
 	thread->run = run;
 	thread->mddev = mddev;
diff --git a/drivers/md/md.h b/drivers/md/md.h
index b2076a165c10..1dc01bc5c038 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -716,7 +716,7 @@  static inline void sysfs_unlink_rdev(struct mddev *mddev, struct md_rdev *rdev)
 struct md_thread {
 	void			(*run) (struct md_thread *thread);
 	struct mddev		*mddev;
-	wait_queue_head_t	wqueue;
+	struct swait_queue_head	wqueue;
 	unsigned long		flags;
 	struct task_struct	*tsk;
 	unsigned long		timeout;