diff mbox series

mdadm: stop using 'idle' for sysfs api "sync_action" to wake up sync_thread

Message ID 20240111120505.4135257-1-yukuai1@huaweicloud.com (mailing list archive)
State Changes Requested
Headers show
Series mdadm: stop using 'idle' for sysfs api "sync_action" to wake up sync_thread | expand

Commit Message

Yu Kuai Jan. 11, 2024, 12:05 p.m. UTC
From: Yu Kuai <yukuai3@huawei.com>

Echo 'idle' to "sync_action" is supposed to stop sync_thread while new
sync_thread can still start. However, currently this behaviour is not
correct, echo 'idle' will actually try to stop sync_thread and then
start a new sync_thread. And mdadm relies on this wrong behaviour in
some places.

In kernel, if resync is not done yet, then recovery/reshape/check/repair
can't not start in the first place, and if resync is done, echo 'resync'
behaves the same as echo 'idle' for now.

Hence replace echo 'idle' with echo 'resync/reshape' when trying to
continue frozed sync_thread. There should be no functional changes and
prevent regressions after fixing that echo 'idle' will start new
sync_thread in kernel.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 Grow.c   | 2 +-
 Manage.c | 8 ++++----
 msg.c    | 2 +-
 3 files changed, 6 insertions(+), 6 deletions(-)

Comments

Mariusz Tkaczyk Jan. 12, 2024, 11:44 a.m. UTC | #1
On Thu, 11 Jan 2024 20:05:05 +0800
Yu Kuai <yukuai1@huaweicloud.com> wrote:

> From: Yu Kuai <yukuai3@huawei.com>
> 
> Echo 'idle' to "sync_action" is supposed to stop sync_thread while new
> sync_thread can still start. However, currently this behaviour is not
> correct, echo 'idle' will actually try to stop sync_thread and then
> start a new sync_thread. And mdadm relies on this wrong behaviour in
> some places.
> 
> In kernel, if resync is not done yet, then recovery/reshape/check/repair
> can't not start in the first place, and if resync is done, echo 'resync'
> behaves the same as echo 'idle' for now.

Hi Kuai,
From the last part I understand that in case of resync/reshape frozen thread is
unblocked, not restarted.

I miss some explanation about that here. So far I understand is:

"Setting "resync" or "reshape" allow to continue frozen sync_thread instead
restarting it. Setting "resync" if resync is done, has same effect as "idle" so
it is safe."

Please describe setting "reshape", I can see that you use it in one place, I
think that with reshape we need to be more careful but you are the expert here,
maybe it is same as "resync"?

> 
> Hence replace echo 'idle' with echo 'resync/reshape' when trying to
> continue frozed sync_thread. There should be no functional changes and
> prevent regressions after fixing that echo 'idle' will start new
> sync_thread in kernel.

Ok, so this is kind of preparing for kernel fix. Got it.
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---

I think that I understand purpose of the change. You are trying to avoid thread
restarting if not needed and remove reference to incorrect "idle" usage of
mdadm.
Unfortunately, the changes you need to make have strong reference to kernel
implementation. It requires to well describe them because blame is volatile.

I would like to propose separate enum to not rely on kernel states naming, some
proposals:

/* So far I understand write "resync" for both cases */
SYNC_ACTION_RESYNC_START
SYNC_ACTION_RESYNC_CONTINUE

/* So far I understand write "reshape" for both cases *
SYNC_ACTION_RESHAPE_START
SYNC_ACTION_RESHAPE_CONTINUE

/* Highlight known bug in comment and use "resync"? /*
SYNC_ACTION_IDLE
/* If needed? */
SYNC_ACTION_ABORT

It needs to be handled by proper function which will have comments describing
what is written to kernel and why. In userspace, I need more user/reader
friendly code.
I want to know what we exactly requested from kernel. In some cases we would
expect to restart thread is some other cases just to continue frozen one. I
would like to know what was a purpose of request in the particular case even if
now the same action is used behind.

Let me know what you think.

Thanks,
Mariusz
Yu Kuai Jan. 14, 2024, 3:11 a.m. UTC | #2
Hi,

在 2024/01/12 19:44, Mariusz Tkaczyk 写道:
> On Thu, 11 Jan 2024 20:05:05 +0800
> Yu Kuai <yukuai1@huaweicloud.com> wrote:
> 
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> Echo 'idle' to "sync_action" is supposed to stop sync_thread while new
>> sync_thread can still start. However, currently this behaviour is not
>> correct, echo 'idle' will actually try to stop sync_thread and then
>> start a new sync_thread. And mdadm relies on this wrong behaviour in
>> some places.
>>
>> In kernel, if resync is not done yet, then recovery/reshape/check/repair
>> can't not start in the first place, and if resync is done, echo 'resync'
>> behaves the same as echo 'idle' for now.
> 
> Hi Kuai,
>>From the last part I understand that in case of resync/reshape frozen thread is
> unblocked, not restarted.
> 
> I miss some explanation about that here. So far I understand is:
> 
> "Setting "resync" or "reshape" allow to continue frozen sync_thread instead
> restarting it. Setting "resync" if resync is done, has same effect as "idle" so
> it is safe."
> 
> Please describe setting "reshape", I can see that you use it in one place, I
> think that with reshape we need to be more careful but you are the expert here,
> maybe it is same as "resync"?

The only place to use "reshape" is that reshape is frozed before, and
echo "reshape" can continue the interrupted reshape. Of coures, echo
"idle/resync" can also continue the interrupted reshape.
> 
>>
>> Hence replace echo 'idle' with echo 'resync/reshape' when trying to
>> continue frozed sync_thread. There should be no functional changes and
>> prevent regressions after fixing that echo 'idle' will start new
>> sync_thread in kernel.
> 
> Ok, so this is kind of preparing for kernel fix. Got it.
>>
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> ---
> 
> I think that I understand purpose of the change. You are trying to avoid thread
> restarting if not needed and remove reference to incorrect "idle" usage of
> mdadm.
> Unfortunately, the changes you need to make have strong reference to kernel
> implementation. It requires to well describe them because blame is volatile.
> 
> I would like to propose separate enum to not rely on kernel states naming, some
> proposals:
> 
> /* So far I understand write "resync" for both cases */
> SYNC_ACTION_RESYNC_START
> SYNC_ACTION_RESYNC_CONTINUE
> 
> /* So far I understand write "reshape" for both cases *
> SYNC_ACTION_RESHAPE_START
> SYNC_ACTION_RESHAPE_CONTINUE
> 
> /* Highlight known bug in comment and use "resync"? /*
> SYNC_ACTION_IDLE
> /* If needed? */
> SYNC_ACTION_ABORT

The enum sounds good, and I thought about enum in kerenl as well,
currently resync/recovery/reshape/... status is determined by
combination of flags, which is hard to follow.
> 
> It needs to be handled by proper function which will have comments describing
> what is written to kernel and why. In userspace, I need more user/reader
> friendly code.
> I want to know what we exactly requested from kernel. In some cases we would
> expect to restart thread is some other cases just to continue frozen one. I
> would like to know what was a purpose of request in the particular case even if
> now the same action is used behind.

Sounds like a good plan, and looks like the first thing to do is to sort
out all the places to use sysfs api 'sync_action' in mdadm, then we'll
know to define the new helper(I just grep the case 'idle' for now).

Thanks,
Kuai

> 
> Let me know what you think.
> 
> Thanks,
> Mariusz
> 
> .
>
diff mbox series

Patch

diff --git a/Grow.c b/Grow.c
index 8fa97875..05498c6f 100644
--- a/Grow.c
+++ b/Grow.c
@@ -843,7 +843,7 @@  static void unfreeze(struct supertype *st)
 		if (sra &&
 		    sysfs_get_str(sra, NULL, "sync_action", buf, 20) > 0 &&
 		    strcmp(buf, "frozen\n") == 0)
-			sysfs_set_str(sra, NULL, "sync_action", "idle");
+			sysfs_set_str(sra, NULL, "sync_action", "reshape");
 		sysfs_free(sra);
 	}
 }
diff --git a/Manage.c b/Manage.c
index 91532266..91483112 100644
--- a/Manage.c
+++ b/Manage.c
@@ -344,7 +344,7 @@  int Manage_stop(char *devname, int fd, int verbose, int will_retry)
 			backwards = 1;
 		if (sysfs_get_ll(mdi, NULL, "reshape_position", &position) != 0) {
 			/* reshape must have finished now */
-			sysfs_set_str(mdi, NULL, "sync_action", "idle");
+			sysfs_set_str(mdi, NULL, "sync_action", "resync");
 			goto done;
 		}
 		sysfs_get_two(mdi, NULL, "chunk_size", &chunk1, &chunk2);
@@ -386,7 +386,7 @@  int Manage_stop(char *devname, int fd, int verbose, int will_retry)
 		 * the reshape monitor */
 		if (sync_max < old_sync_max)
 			sysfs_set_num(mdi, NULL, "sync_max", sync_max);
-		sysfs_set_str(mdi, NULL, "sync_action", "idle");
+		sysfs_set_str(mdi, NULL, "sync_action", "resync");
 
 		/* That should have set things going again.  Now we
 		 * wait a little while (3 second max) for sync_completed
@@ -1717,7 +1717,7 @@  int Manage_subdevs(char *devname, int fd,
 	}
 	free(tst);
 	if (frozen > 0)
-		sysfs_set_str(&info, NULL, "sync_action","idle");
+		sysfs_set_str(&info, NULL, "sync_action", "resync");
 	if (test && count == 0)
 		return 2;
 	return 0;
@@ -1725,7 +1725,7 @@  int Manage_subdevs(char *devname, int fd,
 abort:
 	free(tst);
 	if (frozen > 0)
-		sysfs_set_str(&info, NULL, "sync_action","idle");
+		sysfs_set_str(&info, NULL, "sync_action", "resync");
 	return !test && busy ? 2 : 1;
 }
 
diff --git a/msg.c b/msg.c
index 45cd4504..d9f08ebf 100644
--- a/msg.c
+++ b/msg.c
@@ -252,7 +252,7 @@  int unblock_subarray(struct mdinfo *sra, const int unfreeze)
 	    sysfs_set_str(sra, NULL, "metadata_version", buf) ||
 	    (unfreeze &&
 	     sysfs_attribute_available(sra, NULL, "sync_action") &&
-	     sysfs_set_str(sra, NULL, "sync_action", "idle")))
+	     sysfs_set_str(sra, NULL, "sync_action", "resync")))
 		rc = -1;
 	return rc;
 }