[1/2] md-cluster: fix hanging issue while a new disk adding

Message ID	20240612021911.11043-1-heming.zhao@suse.com (mailing list archive)
State	Changes Requested
Headers	show Received: from mail-lj1-f180.google.com (mail-lj1-f180.google.com [209.85.208.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 41B2B3BB27 for <linux-raid@vger.kernel.org>; Wed, 12 Jun 2024 02:19:22 +0000 (UTC) From: Heming Zhao <heming.zhao@suse.com> To: song@kernel.org, yukuai1@huaweicloud.com, xni@redhat.com Cc: Heming Zhao <heming.zhao@suse.com>, glass.su@suse.com, linux-raid@vger.kernel.org Subject: [PATCH 1/2] md-cluster: fix hanging issue while a new disk adding Date: Wed, 12 Jun 2024 10:19:10 +0800 Message-Id: <20240612021911.11043-1-heming.zhao@suse.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[1/2] md-cluster: fix hanging issue while a new disk adding \| expand [1/2] md-cluster: fix hanging issue while a new disk adding [2/2] md-cluster: fix no recovery job when adding/re-adding a disk

Context	Check	Description
mdraidci/vmtest-md-6_11-PR	success	PR summary
mdraidci/vmtest-md-6_11-VM_Test-0	success	Logs for build-kernel

heming.zhao@suse.com June 12, 2024, 2:19 a.m. UTC

The commit 1bbe254e4336 ("md-cluster: check for timeout while a
new disk adding") is correct in terms of code syntax but not
suite real clustered code logic.

When a timeout occurs while adding a new disk, if recv_daemon()
bypasses the unlock for ack_lockres:CR, another node will be waiting
to grab EX lock. This will cause the cluster to hang indefinitely.

How to fix:

1. In dlm_lock_sync(), change the wait behaviour from forever to a
   timeout, This could avoid the hanging issue when another node
   fails to handle cluster msg. Another result of this change is
   that if another node receives an unknown msg (e.g. a new msg_type),
   the old code will hang, whereas the new code will timeout and fail.
   This could help cluster_md handle new msg_type from different
   nodes with different kernel/module versions (e.g. The user only
   updates one leg's kernel and monitors the stability of the new
   kernel).
2. The old code for __sendmsg() always returns 0 (success) under the
   design (must successfully unlock ->message_lockres). This commit
   makes this function return an error number when an error occurs.

Fixes: 1bbe254e4336 ("md-cluster: check for timeout while a new disk adding")
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Su Yue <glass.su@suse.com>
---
 drivers/md/md-cluster.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

heming.zhao@suse.com June 24, 2024, 1:55 a.m. UTC | #1

Hello Song & Kuai,

Xiao ni told me that he has been quite busy recently and cannot review
the code. Do you have time to review my code?

btw,
The patches has been passed the 60 loops of clustermd_tests [1]. because
the kernel md layer code changes, the clustermd_tests scripts also need
to be updated. I will send the clustermd_tests patch when the kernel
layer code passes review.

[1]: https://git.kernel.org/pub/scm/utils/mdadm/mdadm.git/tree/clustermd_tests

Thanks,
Heming

On 6/12/24 10:19, Heming Zhao wrote:
> The commit 1bbe254e4336 ("md-cluster: check for timeout while a
> new disk adding") is correct in terms of code syntax but not
> suite real clustered code logic.
> 
> When a timeout occurs while adding a new disk, if recv_daemon()
> bypasses the unlock for ack_lockres:CR, another node will be waiting
> to grab EX lock. This will cause the cluster to hang indefinitely.
> 
> How to fix:
> 
> 1. In dlm_lock_sync(), change the wait behaviour from forever to a
>     timeout, This could avoid the hanging issue when another node
>     fails to handle cluster msg. Another result of this change is
>     that if another node receives an unknown msg (e.g. a new msg_type),
>     the old code will hang, whereas the new code will timeout and fail.
>     This could help cluster_md handle new msg_type from different
>     nodes with different kernel/module versions (e.g. The user only
>     updates one leg's kernel and monitors the stability of the new
>     kernel).
> 2. The old code for __sendmsg() always returns 0 (success) under the
>     design (must successfully unlock ->message_lockres). This commit
>     makes this function return an error number when an error occurs.
> 
> Fixes: 1bbe254e4336 ("md-cluster: check for timeout while a new disk adding")
> Signed-off-by: Heming Zhao <heming.zhao@suse.com>
> Reviewed-by: Su Yue <glass.su@suse.com>
> ---
>   drivers/md/md-cluster.c | 14 ++++++++++++--
>   1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
> index 8e36a0feec09..27eaaf9fef94 100644
> --- a/drivers/md/md-cluster.c
> +++ b/drivers/md/md-cluster.c
> @@ -130,8 +130,13 @@ static int dlm_lock_sync(struct dlm_lock_resource *res, int mode)
>   			0, sync_ast, res, res->bast);
>   	if (ret)
>   		return ret;
> -	wait_event(res->sync_locking, res->sync_locking_done);
> +	ret = wait_event_timeout(res->sync_locking, res->sync_locking_done,
> +				60 * HZ);
>   	res->sync_locking_done = false;
> +	if (!ret) {
> +		pr_err("locking DLM '%s' timeout!\n", res->name);
> +		return -EBUSY;
> +	}
>   	if (res->lksb.sb_status == 0)
>   		res->mode = mode;
>   	return res->lksb.sb_status;
> @@ -744,12 +749,14 @@ static void unlock_comm(struct md_cluster_info *cinfo)
>   static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
>   {
>   	int error;
> +	int ret = 0;
>   	int slot = cinfo->slot_number - 1;
>   
>   	cmsg->slot = cpu_to_le32(slot);
>   	/*get EX on Message*/
>   	error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_EX);
>   	if (error) {
> +		ret = error;
>   		pr_err("md-cluster: failed to get EX on MESSAGE (%d)\n", error);
>   		goto failed_message;
>   	}
> @@ -759,6 +766,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
>   	/*down-convert EX to CW on Message*/
>   	error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_CW);
>   	if (error) {
> +		ret = error;
>   		pr_err("md-cluster: failed to convert EX to CW on MESSAGE(%d)\n",
>   				error);
>   		goto failed_ack;
> @@ -767,6 +775,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
>   	/*up-convert CR to EX on Ack*/
>   	error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_EX);
>   	if (error) {
> +		ret = error;
>   		pr_err("md-cluster: failed to convert CR to EX on ACK(%d)\n",
>   				error);
>   		goto failed_ack;
> @@ -775,6 +784,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
>   	/*down-convert EX to CR on Ack*/
>   	error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_CR);
>   	if (error) {
> +		ret = error;
>   		pr_err("md-cluster: failed to convert EX to CR on ACK(%d)\n",
>   				error);
>   		goto failed_ack;
> @@ -789,7 +799,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
>   		goto failed_ack;
>   	}
>   failed_message:
> -	return error;
> +	return ret;
>   }
>   
>   static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg,

Yu Kuai June 24, 2024, 2:37 a.m. UTC | #2

Hi,

在 2024/06/24 9:55, Heming Zhao 写道:
> Hello Song & Kuai,
> 
> Xiao ni told me that he has been quite busy recently and cannot review
> the code. Do you have time to review my code?
> 
> btw,
> The patches has been passed the 60 loops of clustermd_tests [1]. because
> the kernel md layer code changes, the clustermd_tests scripts also need
> to be updated. I will send the clustermd_tests patch when the kernel
> layer code passes review.

The tests will be quite important, since I'm not familiar with cluster
code here. Of coure I'll find sometime to review the code.

Thanks,
Kuai

> 
> [1]: 
> https://git.kernel.org/pub/scm/utils/mdadm/mdadm.git/tree/clustermd_tests
> 
> Thanks,
> Heming
> 
> On 6/12/24 10:19, Heming Zhao wrote:
>> The commit 1bbe254e4336 ("md-cluster: check for timeout while a
>> new disk adding") is correct in terms of code syntax but not
>> suite real clustered code logic.
>>
>> When a timeout occurs while adding a new disk, if recv_daemon()
>> bypasses the unlock for ack_lockres:CR, another node will be waiting
>> to grab EX lock. This will cause the cluster to hang indefinitely.
>>
>> How to fix:
>>
>> 1. In dlm_lock_sync(), change the wait behaviour from forever to a
>>     timeout, This could avoid the hanging issue when another node
>>     fails to handle cluster msg. Another result of this change is
>>     that if another node receives an unknown msg (e.g. a new msg_type),
>>     the old code will hang, whereas the new code will timeout and fail.
>>     This could help cluster_md handle new msg_type from different
>>     nodes with different kernel/module versions (e.g. The user only
>>     updates one leg's kernel and monitors the stability of the new
>>     kernel).
>> 2. The old code for __sendmsg() always returns 0 (success) under the
>>     design (must successfully unlock ->message_lockres). This commit
>>     makes this function return an error number when an error occurs.
>>
>> Fixes: 1bbe254e4336 ("md-cluster: check for timeout while a new disk 
>> adding")
>> Signed-off-by: Heming Zhao <heming.zhao@suse.com>
>> Reviewed-by: Su Yue <glass.su@suse.com>
>> ---
>>   drivers/md/md-cluster.c | 14 ++++++++++++--
>>   1 file changed, 12 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
>> index 8e36a0feec09..27eaaf9fef94 100644
>> --- a/drivers/md/md-cluster.c
>> +++ b/drivers/md/md-cluster.c
>> @@ -130,8 +130,13 @@ static int dlm_lock_sync(struct dlm_lock_resource 
>> *res, int mode)
>>               0, sync_ast, res, res->bast);
>>       if (ret)
>>           return ret;
>> -    wait_event(res->sync_locking, res->sync_locking_done);
>> +    ret = wait_event_timeout(res->sync_locking, res->sync_locking_done,
>> +                60 * HZ);
>>       res->sync_locking_done = false;
>> +    if (!ret) {
>> +        pr_err("locking DLM '%s' timeout!\n", res->name);
>> +        return -EBUSY;
>> +    }
>>       if (res->lksb.sb_status == 0)
>>           res->mode = mode;
>>       return res->lksb.sb_status;
>> @@ -744,12 +749,14 @@ static void unlock_comm(struct md_cluster_info 
>> *cinfo)
>>   static int __sendmsg(struct md_cluster_info *cinfo, struct 
>> cluster_msg *cmsg)
>>   {
>>       int error;
>> +    int ret = 0;
>>       int slot = cinfo->slot_number - 1;
>>       cmsg->slot = cpu_to_le32(slot);
>>       /*get EX on Message*/
>>       error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_EX);
>>       if (error) {
>> +        ret = error;
>>           pr_err("md-cluster: failed to get EX on MESSAGE (%d)\n", 
>> error);
>>           goto failed_message;
>>       }
>> @@ -759,6 +766,7 @@ static int __sendmsg(struct md_cluster_info 
>> *cinfo, struct cluster_msg *cmsg)
>>       /*down-convert EX to CW on Message*/
>>       error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_CW);
>>       if (error) {
>> +        ret = error;
>>           pr_err("md-cluster: failed to convert EX to CW on 
>> MESSAGE(%d)\n",
>>                   error);
>>           goto failed_ack;
>> @@ -767,6 +775,7 @@ static int __sendmsg(struct md_cluster_info 
>> *cinfo, struct cluster_msg *cmsg)
>>       /*up-convert CR to EX on Ack*/
>>       error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_EX);
>>       if (error) {
>> +        ret = error;
>>           pr_err("md-cluster: failed to convert CR to EX on ACK(%d)\n",
>>                   error);
>>           goto failed_ack;
>> @@ -775,6 +784,7 @@ static int __sendmsg(struct md_cluster_info 
>> *cinfo, struct cluster_msg *cmsg)
>>       /*down-convert EX to CR on Ack*/
>>       error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_CR);
>>       if (error) {
>> +        ret = error;
>>           pr_err("md-cluster: failed to convert EX to CR on ACK(%d)\n",
>>                   error);
>>           goto failed_ack;
>> @@ -789,7 +799,7 @@ static int __sendmsg(struct md_cluster_info 
>> *cinfo, struct cluster_msg *cmsg)
>>           goto failed_ack;
>>       }
>>   failed_message:
>> -    return error;
>> +    return ret;
>>   }
>>   static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg 
>> *cmsg,
> 
> 
> .
>

heming.zhao@suse.com June 25, 2024, 2:49 a.m. UTC | #3

On 6/24/24 10:37, Yu Kuai wrote:
> Hi,
> 
> 在 2024/06/24 9:55, Heming Zhao 写道:
>> Hello Song & Kuai,
>>
>> Xiao ni told me that he has been quite busy recently and cannot review
>> the code. Do you have time to review my code?
>>
>> btw,
>> The patches has been passed the 60 loops of clustermd_tests [1]. because
>> the kernel md layer code changes, the clustermd_tests scripts also need
>> to be updated. I will send the clustermd_tests patch when the kernel
>> layer code passes review.
> 
> The tests will be quite important, since I'm not familiar with cluster
> code here. Of coure I'll find sometime to review the code.
> 
> Thanks,
> Kuai
> 

Thanks for your reply. I post the set up for HA env here, if you have any
question please feel free to ask me.

--------
# How to set up clustered md env

(I use opensuse tumbleweed to test)

1. cook the patches

[PATCH 1/2] mdadm/clustermd_tests: add some apis in func.sh to support test to run without error
[PATCH 2/2] mdadm/clustermd_tests: adjust test cases to support md module changes
- https://lore.kernel.org/linux-raid/20240625021019.8732-1-heming.zhao@suse.com/T/#t

2. edit mdadm.git/clustermd_tests/cluster_conf

2.1 edit NODE[12], e.g:

```
NODE1=192.168.1.100
NODE2=192.168.1.101
```

2.2 edit 'devlist=', e.g:

```
devlist=/dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf
```

in my env, each sd[abcdef] size is 300MB

3. set up cluster env

download ISO:
https://mirrors.tuna.tsinghua.edu.cn/opensuse/tumbleweed/iso/openSUSE-Tumbleweed-DVD-x86_64-Snapshot20240621-Media.iso

install two VMs. add 6 shared disk (each 300MB), the backend is raw file.

4. set up ha

4.1 install softwares

because tumbleweed doesn't support "zypper -t pattern" mode to install HA stack softwares.

install HA stack command from opensuse:
  zypper in crmsh pacemaker corosync libcsync-plugin-sftp libvirt-client

4.2 set up ha

ref: https://documentation.suse.com/sle-ha/15-SP5/html/SLE-HA-all/article-installation.html

set up vm machine hostname:
node1: hostnamectl set-hostname tw-md1
node2: hostnamectl set-hostname tw-md2

edit /etc/hosts:
192.168.111.100 tw-md1
192.168.111.101 tw-md2

copy to another node:
scp -O /etc/hosts tw-md2:/etc/


on node1:
crm cluster init -S -y -i <node1-ip>

on node2:
crm cluster join -c <node1-ip>

on node1:
crm config edit (<== edit config as following)

```
INFO: "config" is accepted as "configure"
node 1: tw-md1
node 2: tw-md2
primitive dlm ocf:pacemaker:controld \
         op monitor interval=60s timeout=60s \
         op stop timeout=100s interval=0s \
         op monitor interval=30s timeout=90s
primitive stonith-libvirt stonith:external/libvirt \
         params hostlist="tb-md1,tb-md2" hypervisor_uri="qemu+tcp://<host-ip>/system" pcmk_delay_max=30s
group base-group dlm
clone base-clone base-group \
         meta interleave=true
property cib-bootstrap-options: \
         have-watchdog=true \
         dc-version="2.1.7+20240530.09c4d6d2e-1.2-2.1.7+20240530.09c4d6d2e" \
         cluster-infrastructure=corosync \
         cluster-name=hacluster \
         stonith-timeout=71 \
         stonith-enabled=true \
         no-quorum-policy=freeze
rsc_defaults build-resource-defaults: \
         resource-stickiness=1 \
         priority=1
```

note:
- hypervisor_uri="qemu+tcp://<host-ip>/system", the <host-ip> should be
   adjusted according to the real env.
- use 'crm config show' to show/check the config

check HA status:

crm status full (<== should show no error, see below)

```
Node List:
   * Node tw-md1: online:
     * Resources:
       * stonith-libvirt (stonith:external/libvirt):      Started
       * dlm     (ocf:pacemaker:controld):        Started
   * Node tw-md2: online:
     * Resources:
       * dlm     (ocf:pacemaker:controld):        Started
```

5. run test

all:
./test --testdir=clustermd_tests --save-logs --logdir=./logs --keep-going

single test:
./test --testdir=clustermd_tests --save-logs --logdir=./logs --keep-going --tests=02r1_Manage_add

Thanks,
Heming

>>
>> [1]: https://git.kernel.org/pub/scm/utils/mdadm/mdadm.git/tree/clustermd_tests
>>
>> Thanks,
>> Heming
>>
>> On 6/12/24 10:19, Heming Zhao wrote:
>>> The commit 1bbe254e4336 ("md-cluster: check for timeout while a
>>> new disk adding") is correct in terms of code syntax but not
>>> suite real clustered code logic.
>>>
>>> When a timeout occurs while adding a new disk, if recv_daemon()
>>> bypasses the unlock for ack_lockres:CR, another node will be waiting
>>> to grab EX lock. This will cause the cluster to hang indefinitely.
>>>
>>> How to fix:
>>>
>>> 1. In dlm_lock_sync(), change the wait behaviour from forever to a
>>>     timeout, This could avoid the hanging issue when another node
>>>     fails to handle cluster msg. Another result of this change is
>>>     that if another node receives an unknown msg (e.g. a new msg_type),
>>>     the old code will hang, whereas the new code will timeout and fail.
>>>     This could help cluster_md handle new msg_type from different
>>>     nodes with different kernel/module versions (e.g. The user only
>>>     updates one leg's kernel and monitors the stability of the new
>>>     kernel).
>>> 2. The old code for __sendmsg() always returns 0 (success) under the
>>>     design (must successfully unlock ->message_lockres). This commit
>>>     makes this function return an error number when an error occurs.
>>>
>>> Fixes: 1bbe254e4336 ("md-cluster: check for timeout while a new disk adding")
>>> Signed-off-by: Heming Zhao <heming.zhao@suse.com>
>>> Reviewed-by: Su Yue <glass.su@suse.com>
>>> ---
>>>   drivers/md/md-cluster.c | 14 ++++++++++++--
>>>   1 file changed, 12 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
>>> index 8e36a0feec09..27eaaf9fef94 100644
>>> --- a/drivers/md/md-cluster.c
>>> +++ b/drivers/md/md-cluster.c
>>> @@ -130,8 +130,13 @@ static int dlm_lock_sync(struct dlm_lock_resource *res, int mode)
>>>               0, sync_ast, res, res->bast);
>>>       if (ret)
>>>           return ret;
>>> -    wait_event(res->sync_locking, res->sync_locking_done);
>>> +    ret = wait_event_timeout(res->sync_locking, res->sync_locking_done,
>>> +                60 * HZ);
>>>       res->sync_locking_done = false;
>>> +    if (!ret) {
>>> +        pr_err("locking DLM '%s' timeout!\n", res->name);
>>> +        return -EBUSY;
>>> +    }
>>>       if (res->lksb.sb_status == 0)
>>>           res->mode = mode;
>>>       return res->lksb.sb_status;
>>> @@ -744,12 +749,14 @@ static void unlock_comm(struct md_cluster_info *cinfo)
>>>   static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
>>>   {
>>>       int error;
>>> +    int ret = 0;
>>>       int slot = cinfo->slot_number - 1;
>>>       cmsg->slot = cpu_to_le32(slot);
>>>       /*get EX on Message*/
>>>       error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_EX);
>>>       if (error) {
>>> +        ret = error;
>>>           pr_err("md-cluster: failed to get EX on MESSAGE (%d)\n", error);
>>>           goto failed_message;
>>>       }
>>> @@ -759,6 +766,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
>>>       /*down-convert EX to CW on Message*/
>>>       error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_CW);
>>>       if (error) {
>>> +        ret = error;
>>>           pr_err("md-cluster: failed to convert EX to CW on MESSAGE(%d)\n",
>>>                   error);
>>>           goto failed_ack;
>>> @@ -767,6 +775,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
>>>       /*up-convert CR to EX on Ack*/
>>>       error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_EX);
>>>       if (error) {
>>> +        ret = error;
>>>           pr_err("md-cluster: failed to convert CR to EX on ACK(%d)\n",
>>>                   error);
>>>           goto failed_ack;
>>> @@ -775,6 +784,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
>>>       /*down-convert EX to CR on Ack*/
>>>       error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_CR);
>>>       if (error) {
>>> +        ret = error;
>>>           pr_err("md-cluster: failed to convert EX to CR on ACK(%d)\n",
>>>                   error);
>>>           goto failed_ack;
>>> @@ -789,7 +799,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
>>>           goto failed_ack;
>>>       }
>>>   failed_message:
>>> -    return error;
>>> +    return ret;
>>>   }
>>>   static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg,
>>
>>
>> .
>>
>

Yu Kuai June 27, 2024, 12:52 p.m. UTC | #4

Hi,

在 2024/06/12 10:19, Heming Zhao 写道:
> The commit 1bbe254e4336 ("md-cluster: check for timeout while a
> new disk adding") is correct in terms of code syntax but not
> suite real clustered code logic.
> 
> When a timeout occurs while adding a new disk, if recv_daemon()
> bypasses the unlock for ack_lockres:CR, another node will be waiting
> to grab EX lock. This will cause the cluster to hang indefinitely.
> 
> How to fix:
> 
> 1. In dlm_lock_sync(), change the wait behaviour from forever to a
>     timeout, This could avoid the hanging issue when another node
>     fails to handle cluster msg. Another result of this change is
>     that if another node receives an unknown msg (e.g. a new msg_type),
>     the old code will hang, whereas the new code will timeout and fail.
>     This could help cluster_md handle new msg_type from different
>     nodes with different kernel/module versions (e.g. The user only
>     updates one leg's kernel and monitors the stability of the new
>     kernel).
> 2. The old code for __sendmsg() always returns 0 (success) under the
>     design (must successfully unlock ->message_lockres). This commit
>     makes this function return an error number when an error occurs.
> 
> Fixes: 1bbe254e4336 ("md-cluster: check for timeout while a new disk adding")
> Signed-off-by: Heming Zhao <heming.zhao@suse.com>
> Reviewed-by: Su Yue <glass.su@suse.com>
> ---
>   drivers/md/md-cluster.c | 14 ++++++++++++--
>   1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
> index 8e36a0feec09..27eaaf9fef94 100644
> --- a/drivers/md/md-cluster.c
> +++ b/drivers/md/md-cluster.c
> @@ -130,8 +130,13 @@ static int dlm_lock_sync(struct dlm_lock_resource *res, int mode)
>   			0, sync_ast, res, res->bast);
>   	if (ret)
>   		return ret;
> -	wait_event(res->sync_locking, res->sync_locking_done);
> +	ret = wait_event_timeout(res->sync_locking, res->sync_locking_done,
> +				60 * HZ);

Let's not use magic number directly, it's better to define a marco. BTW,
60s looks too long for me.
>   	res->sync_locking_done = false;

And I tried to find, if setting this value on failure is ok. However,
I'm lost and I really don't know. Can you explain this?
> +	if (!ret) {
> +		pr_err("locking DLM '%s' timeout!\n", res->name);
> +		return -EBUSY;
> +	}
>   	if (res->lksb.sb_status == 0)
>   		res->mode = mode;
>   	return res->lksb.sb_status;
> @@ -744,12 +749,14 @@ static void unlock_comm(struct md_cluster_info *cinfo)
>   static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
>   {
>   	int error;
> +	int ret = 0;
>   	int slot = cinfo->slot_number - 1;
>   
>   	cmsg->slot = cpu_to_le32(slot);
>   	/*get EX on Message*/
>   	error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_EX);
>   	if (error) {
> +		ret = error;

You can return error directly in this branch.
>   		pr_err("md-cluster: failed to get EX on MESSAGE (%d)\n", error);
>   		goto failed_message;
>   	}
> @@ -759,6 +766,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
>   	/*down-convert EX to CW on Message*/
>   	error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_CW);
>   	if (error) {
> +		ret = error;
>   		pr_err("md-cluster: failed to convert EX to CW on MESSAGE(%d)\n",
>   				error);
>   		goto failed_ack;
> @@ -767,6 +775,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
>   	/*up-convert CR to EX on Ack*/
>   	error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_EX);
>   	if (error) {
> +		ret = error;
>   		pr_err("md-cluster: failed to convert CR to EX on ACK(%d)\n",
>   				error);
>   		goto failed_ack;
> @@ -775,6 +784,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
>   	/*down-convert EX to CR on Ack*/
>   	error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_CR);
>   	if (error) {
> +		ret = error;
>   		pr_err("md-cluster: failed to convert EX to CR on ACK(%d)\n",
>   				error);
>   		goto failed_ack;
> @@ -789,7 +799,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
>   		goto failed_ack;
>   	}
>   failed_message:
> -	return error;
> +	return ret;

And I'll suggest just to change dlm_unlock_sync(), not to change all the
other places.

Thanks,
Kuai

>   }
>   
>   static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg,
>

heming.zhao@suse.com June 28, 2024, 12:32 p.m. UTC | #5

On 6/27/24 20:52, Yu Kuai wrote:
> Hi,
> 
> 在 2024/06/12 10:19, Heming Zhao 写道:
>> The commit 1bbe254e4336 ("md-cluster: check for timeout while a
>> new disk adding") is correct in terms of code syntax but not
>> suite real clustered code logic.
>>
>> When a timeout occurs while adding a new disk, if recv_daemon()
>> bypasses the unlock for ack_lockres:CR, another node will be waiting
>> to grab EX lock. This will cause the cluster to hang indefinitely.
>>
>> How to fix:
>>
>> 1. In dlm_lock_sync(), change the wait behaviour from forever to a
>>     timeout, This could avoid the hanging issue when another node
>>     fails to handle cluster msg. Another result of this change is
>>     that if another node receives an unknown msg (e.g. a new msg_type),
>>     the old code will hang, whereas the new code will timeout and fail.
>>     This could help cluster_md handle new msg_type from different
>>     nodes with different kernel/module versions (e.g. The user only
>>     updates one leg's kernel and monitors the stability of the new
>>     kernel).
>> 2. The old code for __sendmsg() always returns 0 (success) under the
>>     design (must successfully unlock ->message_lockres). This commit
>>     makes this function return an error number when an error occurs.
>>
>> Fixes: 1bbe254e4336 ("md-cluster: check for timeout while a new disk adding")
>> Signed-off-by: Heming Zhao <heming.zhao@suse.com>
>> Reviewed-by: Su Yue <glass.su@suse.com>
>> ---
>>   drivers/md/md-cluster.c | 14 ++++++++++++--
>>   1 file changed, 12 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
>> index 8e36a0feec09..27eaaf9fef94 100644
>> --- a/drivers/md/md-cluster.c
>> +++ b/drivers/md/md-cluster.c
>> @@ -130,8 +130,13 @@ static int dlm_lock_sync(struct dlm_lock_resource *res, int mode)
>>               0, sync_ast, res, res->bast);
>>       if (ret)
>>           return ret;
>> -    wait_event(res->sync_locking, res->sync_locking_done);
>> +    ret = wait_event_timeout(res->sync_locking, res->sync_locking_done,
>> +                60 * HZ);
> 
> Let's not use magic number directly, it's better to define a marco. BTW,
> 60s looks too long for me.

got it, will create a define:
#define WAIT_DLM_LOCK_TIMEOUT 30 * HZ

In my view, the shortest time should be 30s. because there is a clustered env.
Node A is waiting for node B to release the lock.
We should consider:
- network traffic (node A and B are not in the same build)
- another node's udev event handling time: NEW_DEV_TIMEOUT 5000

>>       res->sync_locking_done = false;
> 
> And I tried to find, if setting this value on failure is ok. However,
> I'm lost and I really don't know. Can you explain this?

This code logic is the same as dlm_lock_sync_interruptible(). We can
see that regardless of success or failure, '->sync_locking_done' is
set to false in dlm_lock_sync_interruptible().

>> +    if (!ret) {
>> +        pr_err("locking DLM '%s' timeout!\n", res->name);
>> +        return -EBUSY;
>> +    }
>>       if (res->lksb.sb_status == 0)
>>           res->mode = mode;
>>       return res->lksb.sb_status;
>> @@ -744,12 +749,14 @@ static void unlock_comm(struct md_cluster_info *cinfo)
>>   static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
>>   {
>>       int error;
>> +    int ret = 0;
>>       int slot = cinfo->slot_number - 1;
>>       cmsg->slot = cpu_to_le32(slot);
>>       /*get EX on Message*/
>>       error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_EX);
>>       if (error) {
>> +        ret = error;
> 
> You can return error directly in this branch.

OK

>>           pr_err("md-cluster: failed to get EX on MESSAGE (%d)\n", error);
>>           goto failed_message;
>>       }
>> @@ -759,6 +766,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
>>       /*down-convert EX to CW on Message*/
>>       error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_CW);
>>       if (error) {
>> +        ret = error;
>>           pr_err("md-cluster: failed to convert EX to CW on MESSAGE(%d)\n",
>>                   error);
>>           goto failed_ack;
>> @@ -767,6 +775,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
>>       /*up-convert CR to EX on Ack*/
>>       error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_EX);
>>       if (error) {
>> +        ret = error;
>>           pr_err("md-cluster: failed to convert CR to EX on ACK(%d)\n",
>>                   error);
>>           goto failed_ack;
>> @@ -775,6 +784,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
>>       /*down-convert EX to CR on Ack*/
>>       error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_CR);
>>       if (error) {
>> +        ret = error;
>>           pr_err("md-cluster: failed to convert EX to CR on ACK(%d)\n",
>>                   error);
>>           goto failed_ack;
>> @@ -789,7 +799,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
>>           goto failed_ack;
>>       }
>>   failed_message:
>> -    return error;
>> +    return ret;
> 
> And I'll suggest just to change dlm_unlock_sync(), not to change all the
> other places.
> 

I have a different viewpoint, the clustermd code has been running for about 10
years, and no bugs have been reported from SUSE customers for about 1 year. I am
inclined to keep the current code style. If we change dlm_unlock_sync(), many
places will need to be rewritten, which may introduce new bugs. From the callers
of this func (__sendmsg()), which handle the return value, we know it's
definitely wrong because the return value of __sendmsg() is always true.

-Heming

Yu Kuai June 29, 2024, 1:42 a.m. UTC | #6

Hi,

在 2024/06/28 20:32, Heming Zhao 写道:
> On 6/27/24 20:52, Yu Kuai wrote:
>> Hi,
>>
>> 在 2024/06/12 10:19, Heming Zhao 写道:
>>> The commit 1bbe254e4336 ("md-cluster: check for timeout while a
>>> new disk adding") is correct in terms of code syntax but not
>>> suite real clustered code logic.
>>>
>>> When a timeout occurs while adding a new disk, if recv_daemon()
>>> bypasses the unlock for ack_lockres:CR, another node will be waiting
>>> to grab EX lock. This will cause the cluster to hang indefinitely.
>>>
>>> How to fix:
>>>
>>> 1. In dlm_lock_sync(), change the wait behaviour from forever to a
>>>     timeout, This could avoid the hanging issue when another node
>>>     fails to handle cluster msg. Another result of this change is
>>>     that if another node receives an unknown msg (e.g. a new msg_type),
>>>     the old code will hang, whereas the new code will timeout and fail.
>>>     This could help cluster_md handle new msg_type from different
>>>     nodes with different kernel/module versions (e.g. The user only
>>>     updates one leg's kernel and monitors the stability of the new
>>>     kernel).
>>> 2. The old code for __sendmsg() always returns 0 (success) under the
>>>     design (must successfully unlock ->message_lockres). This commit
>>>     makes this function return an error number when an error occurs.
>>>
>>> Fixes: 1bbe254e4336 ("md-cluster: check for timeout while a new disk 
>>> adding")
>>> Signed-off-by: Heming Zhao <heming.zhao@suse.com>
>>> Reviewed-by: Su Yue <glass.su@suse.com>
>>> ---
>>>   drivers/md/md-cluster.c | 14 ++++++++++++--
>>>   1 file changed, 12 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
>>> index 8e36a0feec09..27eaaf9fef94 100644
>>> --- a/drivers/md/md-cluster.c
>>> +++ b/drivers/md/md-cluster.c
>>> @@ -130,8 +130,13 @@ static int dlm_lock_sync(struct 
>>> dlm_lock_resource *res, int mode)
>>>               0, sync_ast, res, res->bast);
>>>       if (ret)
>>>           return ret;
>>> -    wait_event(res->sync_locking, res->sync_locking_done);
>>> +    ret = wait_event_timeout(res->sync_locking, res->sync_locking_done,
>>> +                60 * HZ);
>>
>> Let's not use magic number directly, it's better to define a marco. BTW,
>> 60s looks too long for me.
> 
> got it, will create a define:
> #define WAIT_DLM_LOCK_TIMEOUT 30 * HZ
> 
> In my view, the shortest time should be 30s. because there is a 
> clustered env.
> Node A is waiting for node B to release the lock.
> We should consider:
> - network traffic (node A and B are not in the same build)
> - another node's udev event handling time: NEW_DEV_TIMEOUT 5000
> 
>>>       res->sync_locking_done = false;
>>
>> And I tried to find, if setting this value on failure is ok. However,
>> I'm lost and I really don't know. Can you explain this?
> 
> This code logic is the same as dlm_lock_sync_interruptible(). We can
> see that regardless of success or failure, '->sync_locking_done' is
> set to false in dlm_lock_sync_interruptible().
> 
>>> +    if (!ret) {
>>> +        pr_err("locking DLM '%s' timeout!\n", res->name);
>>> +        return -EBUSY;
>>> +    }
>>>       if (res->lksb.sb_status == 0)
>>>           res->mode = mode;
>>>       return res->lksb.sb_status;
>>> @@ -744,12 +749,14 @@ static void unlock_comm(struct md_cluster_info 
>>> *cinfo)
>>>   static int __sendmsg(struct md_cluster_info *cinfo, struct 
>>> cluster_msg *cmsg)
>>>   {
>>>       int error;
>>> +    int ret = 0;
>>>       int slot = cinfo->slot_number - 1;
>>>       cmsg->slot = cpu_to_le32(slot);
>>>       /*get EX on Message*/
>>>       error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_EX);
>>>       if (error) {
>>> +        ret = error;
>>
>> You can return error directly in this branch.
> 
> OK
> 
>>>           pr_err("md-cluster: failed to get EX on MESSAGE (%d)\n", 
>>> error);
>>>           goto failed_message;
>>>       }
>>> @@ -759,6 +766,7 @@ static int __sendmsg(struct md_cluster_info 
>>> *cinfo, struct cluster_msg *cmsg)
>>>       /*down-convert EX to CW on Message*/
>>>       error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_CW);
>>>       if (error) {
>>> +        ret = error;
>>>           pr_err("md-cluster: failed to convert EX to CW on 
>>> MESSAGE(%d)\n",
>>>                   error);
>>>           goto failed_ack;
>>> @@ -767,6 +775,7 @@ static int __sendmsg(struct md_cluster_info 
>>> *cinfo, struct cluster_msg *cmsg)
>>>       /*up-convert CR to EX on Ack*/
>>>       error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_EX);
>>>       if (error) {
>>> +        ret = error;
>>>           pr_err("md-cluster: failed to convert CR to EX on ACK(%d)\n",
>>>                   error);
>>>           goto failed_ack;
>>> @@ -775,6 +784,7 @@ static int __sendmsg(struct md_cluster_info 
>>> *cinfo, struct cluster_msg *cmsg)
>>>       /*down-convert EX to CR on Ack*/
>>>       error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_CR);
>>>       if (error) {
>>> +        ret = error;
>>>           pr_err("md-cluster: failed to convert EX to CR on ACK(%d)\n",
>>>                   error);
>>>           goto failed_ack;
>>> @@ -789,7 +799,7 @@ static int __sendmsg(struct md_cluster_info 
>>> *cinfo, struct cluster_msg *cmsg)
>>>           goto failed_ack;
>>>       }
>>>   failed_message:
>>> -    return error;
>>> +    return ret;
>>
>> And I'll suggest just to change dlm_unlock_sync(), not to change all the
>> other places.
>>
> 
> I have a different viewpoint, the clustermd code has been running for 
> about 10
> years, and no bugs have been reported from SUSE customers for about 1 
> year. I am
> inclined to keep the current code style. If we change dlm_unlock_sync(), 
> many
> places will need to be rewritten, which may introduce new bugs. From the 
> callers
> of this func (__sendmsg()), which handle the return value, we know it's
> definitely wrong because the return value of __sendmsg() is always true.

That's not what I mean... Let me show you the code:

diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index eb9bbf12c8d8..11a3c9960a22 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -744,6 +744,7 @@ static void unlock_comm(struct md_cluster_info *cinfo)
  static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg 
*cmsg)
  {
         int error;
+       int unlock_error;
         int slot = cinfo->slot_number - 1;

         cmsg->slot = cpu_to_le32(slot);
@@ -781,13 +782,9 @@ static int __sendmsg(struct md_cluster_info *cinfo, 
struct cluster_msg *cmsg)
         }

  failed_ack:
-       error = dlm_unlock_sync(cinfo->message_lockres);
-       if (unlikely(error != 0)) {
+       while ((unlock_error = dlm_unlock_sync(cinfo->message_lockres)))
                 pr_err("md-cluster: failed convert to NL on MESSAGE(%d)\n",
-                       error);
-               /* in case the message can't be released due to some 
reason */
-               goto failed_ack;
-       }
+                       unlock_error);
  failed_message:
         return error;

Thanks,
Kuai

> 
> -Heming
> .
>

heming.zhao@suse.com June 29, 2024, 11:58 a.m. UTC | #7

On 6/29/24 09:42, Yu Kuai wrote:
> Hi,
> 
> 在 2024/06/28 20:32, Heming Zhao 写道:
>> On 6/27/24 20:52, Yu Kuai wrote:
>>> Hi,
>>>
>>> 在 2024/06/12 10:19, Heming Zhao 写道:
>>>> The commit 1bbe254e4336 ("md-cluster: check for timeout while a
>>>> new disk adding") is correct in terms of code syntax but not
>>>> suite real clustered code logic.
>>>>
>>>> When a timeout occurs while adding a new disk, if recv_daemon()
>>>> bypasses the unlock for ack_lockres:CR, another node will be waiting
>>>> to grab EX lock. This will cause the cluster to hang indefinitely.
>>>>
>>>> How to fix:
>>>>
>>>> 1. In dlm_lock_sync(), change the wait behaviour from forever to a
>>>>     timeout, This could avoid the hanging issue when another node
>>>>     fails to handle cluster msg. Another result of this change is
>>>>     that if another node receives an unknown msg (e.g. a new msg_type),
>>>>     the old code will hang, whereas the new code will timeout and fail.
>>>>     This could help cluster_md handle new msg_type from different
>>>>     nodes with different kernel/module versions (e.g. The user only
>>>>     updates one leg's kernel and monitors the stability of the new
>>>>     kernel).
>>>> 2. The old code for __sendmsg() always returns 0 (success) under the
>>>>     design (must successfully unlock ->message_lockres). This commit
>>>>     makes this function return an error number when an error occurs.
>>>>
>>>> Fixes: 1bbe254e4336 ("md-cluster: check for timeout while a new disk adding")
>>>> Signed-off-by: Heming Zhao <heming.zhao@suse.com>
>>>> Reviewed-by: Su Yue <glass.su@suse.com>
>>>> ---
>>>>   drivers/md/md-cluster.c | 14 ++++++++++++--
>>>>   1 file changed, 12 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
>>>> index 8e36a0feec09..27eaaf9fef94 100644
>>>> --- a/drivers/md/md-cluster.c
>>>> +++ b/drivers/md/md-cluster.c
>>>> @@ -130,8 +130,13 @@ static int dlm_lock_sync(struct dlm_lock_resource *res, int mode)
>>>>               0, sync_ast, res, res->bast);
>>>>       if (ret)
>>>>           return ret;
>>>> -    wait_event(res->sync_locking, res->sync_locking_done);
>>>> +    ret = wait_event_timeout(res->sync_locking, res->sync_locking_done,
>>>> +                60 * HZ);
>>>
>>> Let's not use magic number directly, it's better to define a marco. BTW,
>>> 60s looks too long for me.
>>
>> got it, will create a define:
>> #define WAIT_DLM_LOCK_TIMEOUT 30 * HZ
>>
>> In my view, the shortest time should be 30s. because there is a clustered env.
>> Node A is waiting for node B to release the lock.
>> We should consider:
>> - network traffic (node A and B are not in the same build)
>> - another node's udev event handling time: NEW_DEV_TIMEOUT 5000
>>
>>>>       res->sync_locking_done = false;
>>>
>>> And I tried to find, if setting this value on failure is ok. However,
>>> I'm lost and I really don't know. Can you explain this?
>>
>> This code logic is the same as dlm_lock_sync_interruptible(). We can
>> see that regardless of success or failure, '->sync_locking_done' is
>> set to false in dlm_lock_sync_interruptible().
>>
>>>> +    if (!ret) {
>>>> +        pr_err("locking DLM '%s' timeout!\n", res->name);
>>>> +        return -EBUSY;
>>>> +    }
>>>>       if (res->lksb.sb_status == 0)
>>>>           res->mode = mode;
>>>>       return res->lksb.sb_status;
>>>> @@ -744,12 +749,14 @@ static void unlock_comm(struct md_cluster_info *cinfo)
>>>>   static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
>>>>   {
>>>>       int error;
>>>> +    int ret = 0;
>>>>       int slot = cinfo->slot_number - 1;
>>>>       cmsg->slot = cpu_to_le32(slot);
>>>>       /*get EX on Message*/
>>>>       error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_EX);
>>>>       if (error) {
>>>> +        ret = error;
>>>
>>> You can return error directly in this branch.
>>
>> OK
>>
>>>>           pr_err("md-cluster: failed to get EX on MESSAGE (%d)\n", error);
>>>>           goto failed_message;
>>>>       }
>>>> @@ -759,6 +766,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
>>>>       /*down-convert EX to CW on Message*/
>>>>       error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_CW);
>>>>       if (error) {
>>>> +        ret = error;
>>>>           pr_err("md-cluster: failed to convert EX to CW on MESSAGE(%d)\n",
>>>>                   error);
>>>>           goto failed_ack;
>>>> @@ -767,6 +775,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
>>>>       /*up-convert CR to EX on Ack*/
>>>>       error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_EX);
>>>>       if (error) {
>>>> +        ret = error;
>>>>           pr_err("md-cluster: failed to convert CR to EX on ACK(%d)\n",
>>>>                   error);
>>>>           goto failed_ack;
>>>> @@ -775,6 +784,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
>>>>       /*down-convert EX to CR on Ack*/
>>>>       error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_CR);
>>>>       if (error) {
>>>> +        ret = error;
>>>>           pr_err("md-cluster: failed to convert EX to CR on ACK(%d)\n",
>>>>                   error);
>>>>           goto failed_ack;
>>>> @@ -789,7 +799,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
>>>>           goto failed_ack;
>>>>       }
>>>>   failed_message:
>>>> -    return error;
>>>> +    return ret;
>>>
>>> And I'll suggest just to change dlm_unlock_sync(), not to change all the
>>> other places.
>>>
>>
>> I have a different viewpoint, the clustermd code has been running for about 10
>> years, and no bugs have been reported from SUSE customers for about 1 year. I am
>> inclined to keep the current code style. If we change dlm_unlock_sync(), many
>> places will need to be rewritten, which may introduce new bugs. From the callers
>> of this func (__sendmsg()), which handle the return value, we know it's
>> definitely wrong because the return value of __sendmsg() is always true.
> 
> That's not what I mean... Let me show you the code:
> 
> diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
> index eb9bbf12c8d8..11a3c9960a22 100644
> --- a/drivers/md/md-cluster.c
> +++ b/drivers/md/md-cluster.c
> @@ -744,6 +744,7 @@ static void unlock_comm(struct md_cluster_info *cinfo)
>   static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
>   {
>          int error;
> +       int unlock_error;
>          int slot = cinfo->slot_number - 1;
> 
>          cmsg->slot = cpu_to_le32(slot);
> @@ -781,13 +782,9 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
>          }
> 
>   failed_ack:
> -       error = dlm_unlock_sync(cinfo->message_lockres);
> -       if (unlikely(error != 0)) {
> +       while ((unlock_error = dlm_unlock_sync(cinfo->message_lockres)))
>                  pr_err("md-cluster: failed convert to NL on MESSAGE(%d)\n",
> -                       error);
> -               /* in case the message can't be released due to some reason */
> -               goto failed_ack;
> -       }
> +                       unlock_error);
>   failed_message:
>          return error;
> 

Your idea/code is better. I will use them in v2 patch.

Thanks,
Heming

[1/2] md-cluster: fix hanging issue while a new disk adding

Checks

Commit Message

Comments

Patch