Message ID | 20240612021911.11043-1-heming.zhao@suse.com (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
Series | [1/2] md-cluster: fix hanging issue while a new disk adding | expand |
Context | Check | Description |
---|---|---|
mdraidci/vmtest-md-6_11-PR | success | PR summary |
mdraidci/vmtest-md-6_11-VM_Test-0 | success | Logs for build-kernel |
Hello Song & Kuai, Xiao ni told me that he has been quite busy recently and cannot review the code. Do you have time to review my code? btw, The patches has been passed the 60 loops of clustermd_tests [1]. because the kernel md layer code changes, the clustermd_tests scripts also need to be updated. I will send the clustermd_tests patch when the kernel layer code passes review. [1]: https://git.kernel.org/pub/scm/utils/mdadm/mdadm.git/tree/clustermd_tests Thanks, Heming On 6/12/24 10:19, Heming Zhao wrote: > The commit 1bbe254e4336 ("md-cluster: check for timeout while a > new disk adding") is correct in terms of code syntax but not > suite real clustered code logic. > > When a timeout occurs while adding a new disk, if recv_daemon() > bypasses the unlock for ack_lockres:CR, another node will be waiting > to grab EX lock. This will cause the cluster to hang indefinitely. > > How to fix: > > 1. In dlm_lock_sync(), change the wait behaviour from forever to a > timeout, This could avoid the hanging issue when another node > fails to handle cluster msg. Another result of this change is > that if another node receives an unknown msg (e.g. a new msg_type), > the old code will hang, whereas the new code will timeout and fail. > This could help cluster_md handle new msg_type from different > nodes with different kernel/module versions (e.g. The user only > updates one leg's kernel and monitors the stability of the new > kernel). > 2. The old code for __sendmsg() always returns 0 (success) under the > design (must successfully unlock ->message_lockres). This commit > makes this function return an error number when an error occurs. > > Fixes: 1bbe254e4336 ("md-cluster: check for timeout while a new disk adding") > Signed-off-by: Heming Zhao <heming.zhao@suse.com> > Reviewed-by: Su Yue <glass.su@suse.com> > --- > drivers/md/md-cluster.c | 14 ++++++++++++-- > 1 file changed, 12 insertions(+), 2 deletions(-) > > diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c > index 8e36a0feec09..27eaaf9fef94 100644 > --- a/drivers/md/md-cluster.c > +++ b/drivers/md/md-cluster.c > @@ -130,8 +130,13 @@ static int dlm_lock_sync(struct dlm_lock_resource *res, int mode) > 0, sync_ast, res, res->bast); > if (ret) > return ret; > - wait_event(res->sync_locking, res->sync_locking_done); > + ret = wait_event_timeout(res->sync_locking, res->sync_locking_done, > + 60 * HZ); > res->sync_locking_done = false; > + if (!ret) { > + pr_err("locking DLM '%s' timeout!\n", res->name); > + return -EBUSY; > + } > if (res->lksb.sb_status == 0) > res->mode = mode; > return res->lksb.sb_status; > @@ -744,12 +749,14 @@ static void unlock_comm(struct md_cluster_info *cinfo) > static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) > { > int error; > + int ret = 0; > int slot = cinfo->slot_number - 1; > > cmsg->slot = cpu_to_le32(slot); > /*get EX on Message*/ > error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_EX); > if (error) { > + ret = error; > pr_err("md-cluster: failed to get EX on MESSAGE (%d)\n", error); > goto failed_message; > } > @@ -759,6 +766,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) > /*down-convert EX to CW on Message*/ > error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_CW); > if (error) { > + ret = error; > pr_err("md-cluster: failed to convert EX to CW on MESSAGE(%d)\n", > error); > goto failed_ack; > @@ -767,6 +775,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) > /*up-convert CR to EX on Ack*/ > error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_EX); > if (error) { > + ret = error; > pr_err("md-cluster: failed to convert CR to EX on ACK(%d)\n", > error); > goto failed_ack; > @@ -775,6 +784,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) > /*down-convert EX to CR on Ack*/ > error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_CR); > if (error) { > + ret = error; > pr_err("md-cluster: failed to convert EX to CR on ACK(%d)\n", > error); > goto failed_ack; > @@ -789,7 +799,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) > goto failed_ack; > } > failed_message: > - return error; > + return ret; > } > > static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg,
Hi, 在 2024/06/24 9:55, Heming Zhao 写道: > Hello Song & Kuai, > > Xiao ni told me that he has been quite busy recently and cannot review > the code. Do you have time to review my code? > > btw, > The patches has been passed the 60 loops of clustermd_tests [1]. because > the kernel md layer code changes, the clustermd_tests scripts also need > to be updated. I will send the clustermd_tests patch when the kernel > layer code passes review. The tests will be quite important, since I'm not familiar with cluster code here. Of coure I'll find sometime to review the code. Thanks, Kuai > > [1]: > https://git.kernel.org/pub/scm/utils/mdadm/mdadm.git/tree/clustermd_tests > > Thanks, > Heming > > On 6/12/24 10:19, Heming Zhao wrote: >> The commit 1bbe254e4336 ("md-cluster: check for timeout while a >> new disk adding") is correct in terms of code syntax but not >> suite real clustered code logic. >> >> When a timeout occurs while adding a new disk, if recv_daemon() >> bypasses the unlock for ack_lockres:CR, another node will be waiting >> to grab EX lock. This will cause the cluster to hang indefinitely. >> >> How to fix: >> >> 1. In dlm_lock_sync(), change the wait behaviour from forever to a >> timeout, This could avoid the hanging issue when another node >> fails to handle cluster msg. Another result of this change is >> that if another node receives an unknown msg (e.g. a new msg_type), >> the old code will hang, whereas the new code will timeout and fail. >> This could help cluster_md handle new msg_type from different >> nodes with different kernel/module versions (e.g. The user only >> updates one leg's kernel and monitors the stability of the new >> kernel). >> 2. The old code for __sendmsg() always returns 0 (success) under the >> design (must successfully unlock ->message_lockres). This commit >> makes this function return an error number when an error occurs. >> >> Fixes: 1bbe254e4336 ("md-cluster: check for timeout while a new disk >> adding") >> Signed-off-by: Heming Zhao <heming.zhao@suse.com> >> Reviewed-by: Su Yue <glass.su@suse.com> >> --- >> drivers/md/md-cluster.c | 14 ++++++++++++-- >> 1 file changed, 12 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c >> index 8e36a0feec09..27eaaf9fef94 100644 >> --- a/drivers/md/md-cluster.c >> +++ b/drivers/md/md-cluster.c >> @@ -130,8 +130,13 @@ static int dlm_lock_sync(struct dlm_lock_resource >> *res, int mode) >> 0, sync_ast, res, res->bast); >> if (ret) >> return ret; >> - wait_event(res->sync_locking, res->sync_locking_done); >> + ret = wait_event_timeout(res->sync_locking, res->sync_locking_done, >> + 60 * HZ); >> res->sync_locking_done = false; >> + if (!ret) { >> + pr_err("locking DLM '%s' timeout!\n", res->name); >> + return -EBUSY; >> + } >> if (res->lksb.sb_status == 0) >> res->mode = mode; >> return res->lksb.sb_status; >> @@ -744,12 +749,14 @@ static void unlock_comm(struct md_cluster_info >> *cinfo) >> static int __sendmsg(struct md_cluster_info *cinfo, struct >> cluster_msg *cmsg) >> { >> int error; >> + int ret = 0; >> int slot = cinfo->slot_number - 1; >> cmsg->slot = cpu_to_le32(slot); >> /*get EX on Message*/ >> error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_EX); >> if (error) { >> + ret = error; >> pr_err("md-cluster: failed to get EX on MESSAGE (%d)\n", >> error); >> goto failed_message; >> } >> @@ -759,6 +766,7 @@ static int __sendmsg(struct md_cluster_info >> *cinfo, struct cluster_msg *cmsg) >> /*down-convert EX to CW on Message*/ >> error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_CW); >> if (error) { >> + ret = error; >> pr_err("md-cluster: failed to convert EX to CW on >> MESSAGE(%d)\n", >> error); >> goto failed_ack; >> @@ -767,6 +775,7 @@ static int __sendmsg(struct md_cluster_info >> *cinfo, struct cluster_msg *cmsg) >> /*up-convert CR to EX on Ack*/ >> error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_EX); >> if (error) { >> + ret = error; >> pr_err("md-cluster: failed to convert CR to EX on ACK(%d)\n", >> error); >> goto failed_ack; >> @@ -775,6 +784,7 @@ static int __sendmsg(struct md_cluster_info >> *cinfo, struct cluster_msg *cmsg) >> /*down-convert EX to CR on Ack*/ >> error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_CR); >> if (error) { >> + ret = error; >> pr_err("md-cluster: failed to convert EX to CR on ACK(%d)\n", >> error); >> goto failed_ack; >> @@ -789,7 +799,7 @@ static int __sendmsg(struct md_cluster_info >> *cinfo, struct cluster_msg *cmsg) >> goto failed_ack; >> } >> failed_message: >> - return error; >> + return ret; >> } >> static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg >> *cmsg, > > > . >
On 6/24/24 10:37, Yu Kuai wrote: > Hi, > > 在 2024/06/24 9:55, Heming Zhao 写道: >> Hello Song & Kuai, >> >> Xiao ni told me that he has been quite busy recently and cannot review >> the code. Do you have time to review my code? >> >> btw, >> The patches has been passed the 60 loops of clustermd_tests [1]. because >> the kernel md layer code changes, the clustermd_tests scripts also need >> to be updated. I will send the clustermd_tests patch when the kernel >> layer code passes review. > > The tests will be quite important, since I'm not familiar with cluster > code here. Of coure I'll find sometime to review the code. > > Thanks, > Kuai > Thanks for your reply. I post the set up for HA env here, if you have any question please feel free to ask me. -------- # How to set up clustered md env (I use opensuse tumbleweed to test) 1. cook the patches [PATCH 1/2] mdadm/clustermd_tests: add some apis in func.sh to support test to run without error [PATCH 2/2] mdadm/clustermd_tests: adjust test cases to support md module changes - https://lore.kernel.org/linux-raid/20240625021019.8732-1-heming.zhao@suse.com/T/#t 2. edit mdadm.git/clustermd_tests/cluster_conf 2.1 edit NODE[12], e.g: ``` NODE1=192.168.1.100 NODE2=192.168.1.101 ``` 2.2 edit 'devlist=', e.g: ``` devlist=/dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf ``` in my env, each sd[abcdef] size is 300MB 3. set up cluster env download ISO: https://mirrors.tuna.tsinghua.edu.cn/opensuse/tumbleweed/iso/openSUSE-Tumbleweed-DVD-x86_64-Snapshot20240621-Media.iso install two VMs. add 6 shared disk (each 300MB), the backend is raw file. 4. set up ha 4.1 install softwares because tumbleweed doesn't support "zypper -t pattern" mode to install HA stack softwares. install HA stack command from opensuse: zypper in crmsh pacemaker corosync libcsync-plugin-sftp libvirt-client 4.2 set up ha ref: https://documentation.suse.com/sle-ha/15-SP5/html/SLE-HA-all/article-installation.html set up vm machine hostname: node1: hostnamectl set-hostname tw-md1 node2: hostnamectl set-hostname tw-md2 edit /etc/hosts: 192.168.111.100 tw-md1 192.168.111.101 tw-md2 copy to another node: scp -O /etc/hosts tw-md2:/etc/ on node1: crm cluster init -S -y -i <node1-ip> on node2: crm cluster join -c <node1-ip> on node1: crm config edit (<== edit config as following) ``` INFO: "config" is accepted as "configure" node 1: tw-md1 node 2: tw-md2 primitive dlm ocf:pacemaker:controld \ op monitor interval=60s timeout=60s \ op stop timeout=100s interval=0s \ op monitor interval=30s timeout=90s primitive stonith-libvirt stonith:external/libvirt \ params hostlist="tb-md1,tb-md2" hypervisor_uri="qemu+tcp://<host-ip>/system" pcmk_delay_max=30s group base-group dlm clone base-clone base-group \ meta interleave=true property cib-bootstrap-options: \ have-watchdog=true \ dc-version="2.1.7+20240530.09c4d6d2e-1.2-2.1.7+20240530.09c4d6d2e" \ cluster-infrastructure=corosync \ cluster-name=hacluster \ stonith-timeout=71 \ stonith-enabled=true \ no-quorum-policy=freeze rsc_defaults build-resource-defaults: \ resource-stickiness=1 \ priority=1 ``` note: - hypervisor_uri="qemu+tcp://<host-ip>/system", the <host-ip> should be adjusted according to the real env. - use 'crm config show' to show/check the config check HA status: crm status full (<== should show no error, see below) ``` Node List: * Node tw-md1: online: * Resources: * stonith-libvirt (stonith:external/libvirt): Started * dlm (ocf:pacemaker:controld): Started * Node tw-md2: online: * Resources: * dlm (ocf:pacemaker:controld): Started ``` 5. run test all: ./test --testdir=clustermd_tests --save-logs --logdir=./logs --keep-going single test: ./test --testdir=clustermd_tests --save-logs --logdir=./logs --keep-going --tests=02r1_Manage_add Thanks, Heming >> >> [1]: https://git.kernel.org/pub/scm/utils/mdadm/mdadm.git/tree/clustermd_tests >> >> Thanks, >> Heming >> >> On 6/12/24 10:19, Heming Zhao wrote: >>> The commit 1bbe254e4336 ("md-cluster: check for timeout while a >>> new disk adding") is correct in terms of code syntax but not >>> suite real clustered code logic. >>> >>> When a timeout occurs while adding a new disk, if recv_daemon() >>> bypasses the unlock for ack_lockres:CR, another node will be waiting >>> to grab EX lock. This will cause the cluster to hang indefinitely. >>> >>> How to fix: >>> >>> 1. In dlm_lock_sync(), change the wait behaviour from forever to a >>> timeout, This could avoid the hanging issue when another node >>> fails to handle cluster msg. Another result of this change is >>> that if another node receives an unknown msg (e.g. a new msg_type), >>> the old code will hang, whereas the new code will timeout and fail. >>> This could help cluster_md handle new msg_type from different >>> nodes with different kernel/module versions (e.g. The user only >>> updates one leg's kernel and monitors the stability of the new >>> kernel). >>> 2. The old code for __sendmsg() always returns 0 (success) under the >>> design (must successfully unlock ->message_lockres). This commit >>> makes this function return an error number when an error occurs. >>> >>> Fixes: 1bbe254e4336 ("md-cluster: check for timeout while a new disk adding") >>> Signed-off-by: Heming Zhao <heming.zhao@suse.com> >>> Reviewed-by: Su Yue <glass.su@suse.com> >>> --- >>> drivers/md/md-cluster.c | 14 ++++++++++++-- >>> 1 file changed, 12 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c >>> index 8e36a0feec09..27eaaf9fef94 100644 >>> --- a/drivers/md/md-cluster.c >>> +++ b/drivers/md/md-cluster.c >>> @@ -130,8 +130,13 @@ static int dlm_lock_sync(struct dlm_lock_resource *res, int mode) >>> 0, sync_ast, res, res->bast); >>> if (ret) >>> return ret; >>> - wait_event(res->sync_locking, res->sync_locking_done); >>> + ret = wait_event_timeout(res->sync_locking, res->sync_locking_done, >>> + 60 * HZ); >>> res->sync_locking_done = false; >>> + if (!ret) { >>> + pr_err("locking DLM '%s' timeout!\n", res->name); >>> + return -EBUSY; >>> + } >>> if (res->lksb.sb_status == 0) >>> res->mode = mode; >>> return res->lksb.sb_status; >>> @@ -744,12 +749,14 @@ static void unlock_comm(struct md_cluster_info *cinfo) >>> static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) >>> { >>> int error; >>> + int ret = 0; >>> int slot = cinfo->slot_number - 1; >>> cmsg->slot = cpu_to_le32(slot); >>> /*get EX on Message*/ >>> error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_EX); >>> if (error) { >>> + ret = error; >>> pr_err("md-cluster: failed to get EX on MESSAGE (%d)\n", error); >>> goto failed_message; >>> } >>> @@ -759,6 +766,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) >>> /*down-convert EX to CW on Message*/ >>> error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_CW); >>> if (error) { >>> + ret = error; >>> pr_err("md-cluster: failed to convert EX to CW on MESSAGE(%d)\n", >>> error); >>> goto failed_ack; >>> @@ -767,6 +775,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) >>> /*up-convert CR to EX on Ack*/ >>> error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_EX); >>> if (error) { >>> + ret = error; >>> pr_err("md-cluster: failed to convert CR to EX on ACK(%d)\n", >>> error); >>> goto failed_ack; >>> @@ -775,6 +784,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) >>> /*down-convert EX to CR on Ack*/ >>> error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_CR); >>> if (error) { >>> + ret = error; >>> pr_err("md-cluster: failed to convert EX to CR on ACK(%d)\n", >>> error); >>> goto failed_ack; >>> @@ -789,7 +799,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) >>> goto failed_ack; >>> } >>> failed_message: >>> - return error; >>> + return ret; >>> } >>> static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg, >> >> >> . >> >
Hi, 在 2024/06/12 10:19, Heming Zhao 写道: > The commit 1bbe254e4336 ("md-cluster: check for timeout while a > new disk adding") is correct in terms of code syntax but not > suite real clustered code logic. > > When a timeout occurs while adding a new disk, if recv_daemon() > bypasses the unlock for ack_lockres:CR, another node will be waiting > to grab EX lock. This will cause the cluster to hang indefinitely. > > How to fix: > > 1. In dlm_lock_sync(), change the wait behaviour from forever to a > timeout, This could avoid the hanging issue when another node > fails to handle cluster msg. Another result of this change is > that if another node receives an unknown msg (e.g. a new msg_type), > the old code will hang, whereas the new code will timeout and fail. > This could help cluster_md handle new msg_type from different > nodes with different kernel/module versions (e.g. The user only > updates one leg's kernel and monitors the stability of the new > kernel). > 2. The old code for __sendmsg() always returns 0 (success) under the > design (must successfully unlock ->message_lockres). This commit > makes this function return an error number when an error occurs. > > Fixes: 1bbe254e4336 ("md-cluster: check for timeout while a new disk adding") > Signed-off-by: Heming Zhao <heming.zhao@suse.com> > Reviewed-by: Su Yue <glass.su@suse.com> > --- > drivers/md/md-cluster.c | 14 ++++++++++++-- > 1 file changed, 12 insertions(+), 2 deletions(-) > > diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c > index 8e36a0feec09..27eaaf9fef94 100644 > --- a/drivers/md/md-cluster.c > +++ b/drivers/md/md-cluster.c > @@ -130,8 +130,13 @@ static int dlm_lock_sync(struct dlm_lock_resource *res, int mode) > 0, sync_ast, res, res->bast); > if (ret) > return ret; > - wait_event(res->sync_locking, res->sync_locking_done); > + ret = wait_event_timeout(res->sync_locking, res->sync_locking_done, > + 60 * HZ); Let's not use magic number directly, it's better to define a marco. BTW, 60s looks too long for me. > res->sync_locking_done = false; And I tried to find, if setting this value on failure is ok. However, I'm lost and I really don't know. Can you explain this? > + if (!ret) { > + pr_err("locking DLM '%s' timeout!\n", res->name); > + return -EBUSY; > + } > if (res->lksb.sb_status == 0) > res->mode = mode; > return res->lksb.sb_status; > @@ -744,12 +749,14 @@ static void unlock_comm(struct md_cluster_info *cinfo) > static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) > { > int error; > + int ret = 0; > int slot = cinfo->slot_number - 1; > > cmsg->slot = cpu_to_le32(slot); > /*get EX on Message*/ > error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_EX); > if (error) { > + ret = error; You can return error directly in this branch. > pr_err("md-cluster: failed to get EX on MESSAGE (%d)\n", error); > goto failed_message; > } > @@ -759,6 +766,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) > /*down-convert EX to CW on Message*/ > error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_CW); > if (error) { > + ret = error; > pr_err("md-cluster: failed to convert EX to CW on MESSAGE(%d)\n", > error); > goto failed_ack; > @@ -767,6 +775,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) > /*up-convert CR to EX on Ack*/ > error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_EX); > if (error) { > + ret = error; > pr_err("md-cluster: failed to convert CR to EX on ACK(%d)\n", > error); > goto failed_ack; > @@ -775,6 +784,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) > /*down-convert EX to CR on Ack*/ > error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_CR); > if (error) { > + ret = error; > pr_err("md-cluster: failed to convert EX to CR on ACK(%d)\n", > error); > goto failed_ack; > @@ -789,7 +799,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) > goto failed_ack; > } > failed_message: > - return error; > + return ret; And I'll suggest just to change dlm_unlock_sync(), not to change all the other places. Thanks, Kuai > } > > static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg, >
On 6/27/24 20:52, Yu Kuai wrote: > Hi, > > 在 2024/06/12 10:19, Heming Zhao 写道: >> The commit 1bbe254e4336 ("md-cluster: check for timeout while a >> new disk adding") is correct in terms of code syntax but not >> suite real clustered code logic. >> >> When a timeout occurs while adding a new disk, if recv_daemon() >> bypasses the unlock for ack_lockres:CR, another node will be waiting >> to grab EX lock. This will cause the cluster to hang indefinitely. >> >> How to fix: >> >> 1. In dlm_lock_sync(), change the wait behaviour from forever to a >> timeout, This could avoid the hanging issue when another node >> fails to handle cluster msg. Another result of this change is >> that if another node receives an unknown msg (e.g. a new msg_type), >> the old code will hang, whereas the new code will timeout and fail. >> This could help cluster_md handle new msg_type from different >> nodes with different kernel/module versions (e.g. The user only >> updates one leg's kernel and monitors the stability of the new >> kernel). >> 2. The old code for __sendmsg() always returns 0 (success) under the >> design (must successfully unlock ->message_lockres). This commit >> makes this function return an error number when an error occurs. >> >> Fixes: 1bbe254e4336 ("md-cluster: check for timeout while a new disk adding") >> Signed-off-by: Heming Zhao <heming.zhao@suse.com> >> Reviewed-by: Su Yue <glass.su@suse.com> >> --- >> drivers/md/md-cluster.c | 14 ++++++++++++-- >> 1 file changed, 12 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c >> index 8e36a0feec09..27eaaf9fef94 100644 >> --- a/drivers/md/md-cluster.c >> +++ b/drivers/md/md-cluster.c >> @@ -130,8 +130,13 @@ static int dlm_lock_sync(struct dlm_lock_resource *res, int mode) >> 0, sync_ast, res, res->bast); >> if (ret) >> return ret; >> - wait_event(res->sync_locking, res->sync_locking_done); >> + ret = wait_event_timeout(res->sync_locking, res->sync_locking_done, >> + 60 * HZ); > > Let's not use magic number directly, it's better to define a marco. BTW, > 60s looks too long for me. got it, will create a define: #define WAIT_DLM_LOCK_TIMEOUT 30 * HZ In my view, the shortest time should be 30s. because there is a clustered env. Node A is waiting for node B to release the lock. We should consider: - network traffic (node A and B are not in the same build) - another node's udev event handling time: NEW_DEV_TIMEOUT 5000 >> res->sync_locking_done = false; > > And I tried to find, if setting this value on failure is ok. However, > I'm lost and I really don't know. Can you explain this? This code logic is the same as dlm_lock_sync_interruptible(). We can see that regardless of success or failure, '->sync_locking_done' is set to false in dlm_lock_sync_interruptible(). >> + if (!ret) { >> + pr_err("locking DLM '%s' timeout!\n", res->name); >> + return -EBUSY; >> + } >> if (res->lksb.sb_status == 0) >> res->mode = mode; >> return res->lksb.sb_status; >> @@ -744,12 +749,14 @@ static void unlock_comm(struct md_cluster_info *cinfo) >> static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) >> { >> int error; >> + int ret = 0; >> int slot = cinfo->slot_number - 1; >> cmsg->slot = cpu_to_le32(slot); >> /*get EX on Message*/ >> error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_EX); >> if (error) { >> + ret = error; > > You can return error directly in this branch. OK >> pr_err("md-cluster: failed to get EX on MESSAGE (%d)\n", error); >> goto failed_message; >> } >> @@ -759,6 +766,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) >> /*down-convert EX to CW on Message*/ >> error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_CW); >> if (error) { >> + ret = error; >> pr_err("md-cluster: failed to convert EX to CW on MESSAGE(%d)\n", >> error); >> goto failed_ack; >> @@ -767,6 +775,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) >> /*up-convert CR to EX on Ack*/ >> error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_EX); >> if (error) { >> + ret = error; >> pr_err("md-cluster: failed to convert CR to EX on ACK(%d)\n", >> error); >> goto failed_ack; >> @@ -775,6 +784,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) >> /*down-convert EX to CR on Ack*/ >> error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_CR); >> if (error) { >> + ret = error; >> pr_err("md-cluster: failed to convert EX to CR on ACK(%d)\n", >> error); >> goto failed_ack; >> @@ -789,7 +799,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) >> goto failed_ack; >> } >> failed_message: >> - return error; >> + return ret; > > And I'll suggest just to change dlm_unlock_sync(), not to change all the > other places. > I have a different viewpoint, the clustermd code has been running for about 10 years, and no bugs have been reported from SUSE customers for about 1 year. I am inclined to keep the current code style. If we change dlm_unlock_sync(), many places will need to be rewritten, which may introduce new bugs. From the callers of this func (__sendmsg()), which handle the return value, we know it's definitely wrong because the return value of __sendmsg() is always true. -Heming
Hi, 在 2024/06/28 20:32, Heming Zhao 写道: > On 6/27/24 20:52, Yu Kuai wrote: >> Hi, >> >> 在 2024/06/12 10:19, Heming Zhao 写道: >>> The commit 1bbe254e4336 ("md-cluster: check for timeout while a >>> new disk adding") is correct in terms of code syntax but not >>> suite real clustered code logic. >>> >>> When a timeout occurs while adding a new disk, if recv_daemon() >>> bypasses the unlock for ack_lockres:CR, another node will be waiting >>> to grab EX lock. This will cause the cluster to hang indefinitely. >>> >>> How to fix: >>> >>> 1. In dlm_lock_sync(), change the wait behaviour from forever to a >>> timeout, This could avoid the hanging issue when another node >>> fails to handle cluster msg. Another result of this change is >>> that if another node receives an unknown msg (e.g. a new msg_type), >>> the old code will hang, whereas the new code will timeout and fail. >>> This could help cluster_md handle new msg_type from different >>> nodes with different kernel/module versions (e.g. The user only >>> updates one leg's kernel and monitors the stability of the new >>> kernel). >>> 2. The old code for __sendmsg() always returns 0 (success) under the >>> design (must successfully unlock ->message_lockres). This commit >>> makes this function return an error number when an error occurs. >>> >>> Fixes: 1bbe254e4336 ("md-cluster: check for timeout while a new disk >>> adding") >>> Signed-off-by: Heming Zhao <heming.zhao@suse.com> >>> Reviewed-by: Su Yue <glass.su@suse.com> >>> --- >>> drivers/md/md-cluster.c | 14 ++++++++++++-- >>> 1 file changed, 12 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c >>> index 8e36a0feec09..27eaaf9fef94 100644 >>> --- a/drivers/md/md-cluster.c >>> +++ b/drivers/md/md-cluster.c >>> @@ -130,8 +130,13 @@ static int dlm_lock_sync(struct >>> dlm_lock_resource *res, int mode) >>> 0, sync_ast, res, res->bast); >>> if (ret) >>> return ret; >>> - wait_event(res->sync_locking, res->sync_locking_done); >>> + ret = wait_event_timeout(res->sync_locking, res->sync_locking_done, >>> + 60 * HZ); >> >> Let's not use magic number directly, it's better to define a marco. BTW, >> 60s looks too long for me. > > got it, will create a define: > #define WAIT_DLM_LOCK_TIMEOUT 30 * HZ > > In my view, the shortest time should be 30s. because there is a > clustered env. > Node A is waiting for node B to release the lock. > We should consider: > - network traffic (node A and B are not in the same build) > - another node's udev event handling time: NEW_DEV_TIMEOUT 5000 > >>> res->sync_locking_done = false; >> >> And I tried to find, if setting this value on failure is ok. However, >> I'm lost and I really don't know. Can you explain this? > > This code logic is the same as dlm_lock_sync_interruptible(). We can > see that regardless of success or failure, '->sync_locking_done' is > set to false in dlm_lock_sync_interruptible(). > >>> + if (!ret) { >>> + pr_err("locking DLM '%s' timeout!\n", res->name); >>> + return -EBUSY; >>> + } >>> if (res->lksb.sb_status == 0) >>> res->mode = mode; >>> return res->lksb.sb_status; >>> @@ -744,12 +749,14 @@ static void unlock_comm(struct md_cluster_info >>> *cinfo) >>> static int __sendmsg(struct md_cluster_info *cinfo, struct >>> cluster_msg *cmsg) >>> { >>> int error; >>> + int ret = 0; >>> int slot = cinfo->slot_number - 1; >>> cmsg->slot = cpu_to_le32(slot); >>> /*get EX on Message*/ >>> error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_EX); >>> if (error) { >>> + ret = error; >> >> You can return error directly in this branch. > > OK > >>> pr_err("md-cluster: failed to get EX on MESSAGE (%d)\n", >>> error); >>> goto failed_message; >>> } >>> @@ -759,6 +766,7 @@ static int __sendmsg(struct md_cluster_info >>> *cinfo, struct cluster_msg *cmsg) >>> /*down-convert EX to CW on Message*/ >>> error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_CW); >>> if (error) { >>> + ret = error; >>> pr_err("md-cluster: failed to convert EX to CW on >>> MESSAGE(%d)\n", >>> error); >>> goto failed_ack; >>> @@ -767,6 +775,7 @@ static int __sendmsg(struct md_cluster_info >>> *cinfo, struct cluster_msg *cmsg) >>> /*up-convert CR to EX on Ack*/ >>> error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_EX); >>> if (error) { >>> + ret = error; >>> pr_err("md-cluster: failed to convert CR to EX on ACK(%d)\n", >>> error); >>> goto failed_ack; >>> @@ -775,6 +784,7 @@ static int __sendmsg(struct md_cluster_info >>> *cinfo, struct cluster_msg *cmsg) >>> /*down-convert EX to CR on Ack*/ >>> error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_CR); >>> if (error) { >>> + ret = error; >>> pr_err("md-cluster: failed to convert EX to CR on ACK(%d)\n", >>> error); >>> goto failed_ack; >>> @@ -789,7 +799,7 @@ static int __sendmsg(struct md_cluster_info >>> *cinfo, struct cluster_msg *cmsg) >>> goto failed_ack; >>> } >>> failed_message: >>> - return error; >>> + return ret; >> >> And I'll suggest just to change dlm_unlock_sync(), not to change all the >> other places. >> > > I have a different viewpoint, the clustermd code has been running for > about 10 > years, and no bugs have been reported from SUSE customers for about 1 > year. I am > inclined to keep the current code style. If we change dlm_unlock_sync(), > many > places will need to be rewritten, which may introduce new bugs. From the > callers > of this func (__sendmsg()), which handle the return value, we know it's > definitely wrong because the return value of __sendmsg() is always true. That's not what I mean... Let me show you the code: diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c index eb9bbf12c8d8..11a3c9960a22 100644 --- a/drivers/md/md-cluster.c +++ b/drivers/md/md-cluster.c @@ -744,6 +744,7 @@ static void unlock_comm(struct md_cluster_info *cinfo) static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) { int error; + int unlock_error; int slot = cinfo->slot_number - 1; cmsg->slot = cpu_to_le32(slot); @@ -781,13 +782,9 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) } failed_ack: - error = dlm_unlock_sync(cinfo->message_lockres); - if (unlikely(error != 0)) { + while ((unlock_error = dlm_unlock_sync(cinfo->message_lockres))) pr_err("md-cluster: failed convert to NL on MESSAGE(%d)\n", - error); - /* in case the message can't be released due to some reason */ - goto failed_ack; - } + unlock_error); failed_message: return error; Thanks, Kuai > > -Heming > . >
On 6/29/24 09:42, Yu Kuai wrote: > Hi, > > 在 2024/06/28 20:32, Heming Zhao 写道: >> On 6/27/24 20:52, Yu Kuai wrote: >>> Hi, >>> >>> 在 2024/06/12 10:19, Heming Zhao 写道: >>>> The commit 1bbe254e4336 ("md-cluster: check for timeout while a >>>> new disk adding") is correct in terms of code syntax but not >>>> suite real clustered code logic. >>>> >>>> When a timeout occurs while adding a new disk, if recv_daemon() >>>> bypasses the unlock for ack_lockres:CR, another node will be waiting >>>> to grab EX lock. This will cause the cluster to hang indefinitely. >>>> >>>> How to fix: >>>> >>>> 1. In dlm_lock_sync(), change the wait behaviour from forever to a >>>> timeout, This could avoid the hanging issue when another node >>>> fails to handle cluster msg. Another result of this change is >>>> that if another node receives an unknown msg (e.g. a new msg_type), >>>> the old code will hang, whereas the new code will timeout and fail. >>>> This could help cluster_md handle new msg_type from different >>>> nodes with different kernel/module versions (e.g. The user only >>>> updates one leg's kernel and monitors the stability of the new >>>> kernel). >>>> 2. The old code for __sendmsg() always returns 0 (success) under the >>>> design (must successfully unlock ->message_lockres). This commit >>>> makes this function return an error number when an error occurs. >>>> >>>> Fixes: 1bbe254e4336 ("md-cluster: check for timeout while a new disk adding") >>>> Signed-off-by: Heming Zhao <heming.zhao@suse.com> >>>> Reviewed-by: Su Yue <glass.su@suse.com> >>>> --- >>>> drivers/md/md-cluster.c | 14 ++++++++++++-- >>>> 1 file changed, 12 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c >>>> index 8e36a0feec09..27eaaf9fef94 100644 >>>> --- a/drivers/md/md-cluster.c >>>> +++ b/drivers/md/md-cluster.c >>>> @@ -130,8 +130,13 @@ static int dlm_lock_sync(struct dlm_lock_resource *res, int mode) >>>> 0, sync_ast, res, res->bast); >>>> if (ret) >>>> return ret; >>>> - wait_event(res->sync_locking, res->sync_locking_done); >>>> + ret = wait_event_timeout(res->sync_locking, res->sync_locking_done, >>>> + 60 * HZ); >>> >>> Let's not use magic number directly, it's better to define a marco. BTW, >>> 60s looks too long for me. >> >> got it, will create a define: >> #define WAIT_DLM_LOCK_TIMEOUT 30 * HZ >> >> In my view, the shortest time should be 30s. because there is a clustered env. >> Node A is waiting for node B to release the lock. >> We should consider: >> - network traffic (node A and B are not in the same build) >> - another node's udev event handling time: NEW_DEV_TIMEOUT 5000 >> >>>> res->sync_locking_done = false; >>> >>> And I tried to find, if setting this value on failure is ok. However, >>> I'm lost and I really don't know. Can you explain this? >> >> This code logic is the same as dlm_lock_sync_interruptible(). We can >> see that regardless of success or failure, '->sync_locking_done' is >> set to false in dlm_lock_sync_interruptible(). >> >>>> + if (!ret) { >>>> + pr_err("locking DLM '%s' timeout!\n", res->name); >>>> + return -EBUSY; >>>> + } >>>> if (res->lksb.sb_status == 0) >>>> res->mode = mode; >>>> return res->lksb.sb_status; >>>> @@ -744,12 +749,14 @@ static void unlock_comm(struct md_cluster_info *cinfo) >>>> static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) >>>> { >>>> int error; >>>> + int ret = 0; >>>> int slot = cinfo->slot_number - 1; >>>> cmsg->slot = cpu_to_le32(slot); >>>> /*get EX on Message*/ >>>> error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_EX); >>>> if (error) { >>>> + ret = error; >>> >>> You can return error directly in this branch. >> >> OK >> >>>> pr_err("md-cluster: failed to get EX on MESSAGE (%d)\n", error); >>>> goto failed_message; >>>> } >>>> @@ -759,6 +766,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) >>>> /*down-convert EX to CW on Message*/ >>>> error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_CW); >>>> if (error) { >>>> + ret = error; >>>> pr_err("md-cluster: failed to convert EX to CW on MESSAGE(%d)\n", >>>> error); >>>> goto failed_ack; >>>> @@ -767,6 +775,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) >>>> /*up-convert CR to EX on Ack*/ >>>> error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_EX); >>>> if (error) { >>>> + ret = error; >>>> pr_err("md-cluster: failed to convert CR to EX on ACK(%d)\n", >>>> error); >>>> goto failed_ack; >>>> @@ -775,6 +784,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) >>>> /*down-convert EX to CR on Ack*/ >>>> error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_CR); >>>> if (error) { >>>> + ret = error; >>>> pr_err("md-cluster: failed to convert EX to CR on ACK(%d)\n", >>>> error); >>>> goto failed_ack; >>>> @@ -789,7 +799,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) >>>> goto failed_ack; >>>> } >>>> failed_message: >>>> - return error; >>>> + return ret; >>> >>> And I'll suggest just to change dlm_unlock_sync(), not to change all the >>> other places. >>> >> >> I have a different viewpoint, the clustermd code has been running for about 10 >> years, and no bugs have been reported from SUSE customers for about 1 year. I am >> inclined to keep the current code style. If we change dlm_unlock_sync(), many >> places will need to be rewritten, which may introduce new bugs. From the callers >> of this func (__sendmsg()), which handle the return value, we know it's >> definitely wrong because the return value of __sendmsg() is always true. > > That's not what I mean... Let me show you the code: > > diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c > index eb9bbf12c8d8..11a3c9960a22 100644 > --- a/drivers/md/md-cluster.c > +++ b/drivers/md/md-cluster.c > @@ -744,6 +744,7 @@ static void unlock_comm(struct md_cluster_info *cinfo) > static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) > { > int error; > + int unlock_error; > int slot = cinfo->slot_number - 1; > > cmsg->slot = cpu_to_le32(slot); > @@ -781,13 +782,9 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) > } > > failed_ack: > - error = dlm_unlock_sync(cinfo->message_lockres); > - if (unlikely(error != 0)) { > + while ((unlock_error = dlm_unlock_sync(cinfo->message_lockres))) > pr_err("md-cluster: failed convert to NL on MESSAGE(%d)\n", > - error); > - /* in case the message can't be released due to some reason */ > - goto failed_ack; > - } > + unlock_error); > failed_message: > return error; > Your idea/code is better. I will use them in v2 patch. Thanks, Heming
diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c index 8e36a0feec09..27eaaf9fef94 100644 --- a/drivers/md/md-cluster.c +++ b/drivers/md/md-cluster.c @@ -130,8 +130,13 @@ static int dlm_lock_sync(struct dlm_lock_resource *res, int mode) 0, sync_ast, res, res->bast); if (ret) return ret; - wait_event(res->sync_locking, res->sync_locking_done); + ret = wait_event_timeout(res->sync_locking, res->sync_locking_done, + 60 * HZ); res->sync_locking_done = false; + if (!ret) { + pr_err("locking DLM '%s' timeout!\n", res->name); + return -EBUSY; + } if (res->lksb.sb_status == 0) res->mode = mode; return res->lksb.sb_status; @@ -744,12 +749,14 @@ static void unlock_comm(struct md_cluster_info *cinfo) static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) { int error; + int ret = 0; int slot = cinfo->slot_number - 1; cmsg->slot = cpu_to_le32(slot); /*get EX on Message*/ error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_EX); if (error) { + ret = error; pr_err("md-cluster: failed to get EX on MESSAGE (%d)\n", error); goto failed_message; } @@ -759,6 +766,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) /*down-convert EX to CW on Message*/ error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_CW); if (error) { + ret = error; pr_err("md-cluster: failed to convert EX to CW on MESSAGE(%d)\n", error); goto failed_ack; @@ -767,6 +775,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) /*up-convert CR to EX on Ack*/ error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_EX); if (error) { + ret = error; pr_err("md-cluster: failed to convert CR to EX on ACK(%d)\n", error); goto failed_ack; @@ -775,6 +784,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) /*down-convert EX to CR on Ack*/ error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_CR); if (error) { + ret = error; pr_err("md-cluster: failed to convert EX to CR on ACK(%d)\n", error); goto failed_ack; @@ -789,7 +799,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) goto failed_ack; } failed_message: - return error; + return ret; } static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg,