Message ID | 1532010659-4657-1-git-send-email-xiubli@redhat.com (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
On 07/19/2018 09:30 AM, xiubli@redhat.com wrote: > From: Xiubo Li <xiubli@redhat.com> > > The logs are: > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000040 > IP: [<ffffffffc072b9a9>] tcmu_reset_ring_store+0x149/0x240 [target_core_user] > PGD 800000000e254067 PUD e255067 PMD 0 > Oops: 0002 [#1] SMP > [...] > CPU: 0 PID: 36077 Comm: tcmu-runner Kdump: loaded Not tainted 3.10.0-924.el7.test.x86_64 #1 It is not clear to me how you hit this. It's not a RHEL only bug is it? Are you sure you are hitting it during device removal? If we are calling tcmu_reset_ring_store isn't there a refcount on the item? So we cannot call tcmu_destroy_device -> tcmu_dev_kref_release until after that has been released, so they would never run at the same time. And, if userspace were to try to open/write to that reset file after the rmdir has been done on the se_device then configfs detects this and fails the open. I think we can hit your bug during initialization. We do not setup the mb_addr until the device is configured, but the configfs files are exported to userspace at device creation time. So between those times, userspace could be writing to the reset file and hitting this bug. Is that maybe what you hit? If that is the bug, we need to grab the cmdr_lock in tcmu_configure_device when we are setting up the mb_addr and in the failure path in that function. In tcmu_reset_ring then I think we also need a check for a NULL mb_addr. > diff --git a/drivers/target/target_core_user.c b/drivers/target/target_core_user.c > index 847707a..8d7274e 100644 > --- a/drivers/target/target_core_user.c > +++ b/drivers/target/target_core_user.c > @@ -1587,16 +1587,16 @@ static void tcmu_dev_kref_release(struct kref *kref) > bool all_expired = true; > int i; > > - vfree(udev->mb_addr); > - udev->mb_addr = NULL; > - > spin_lock_bh(&timed_out_udevs_lock); > if (!list_empty(&udev->timedout_entry)) > list_del(&udev->timedout_entry); > spin_unlock_bh(&timed_out_udevs_lock); > > - /* Upper layer should drain all requests before calling this */ > mutex_lock(&udev->cmdr_lock); > + vfree(udev->mb_addr); > + udev->mb_addr = NULL; > + > + /* Upper layer should drain all requests before calling this */ > idr_for_each_entry(&udev->commands, cmd, i) { > if (tcmu_check_and_free_pending_cmd(cmd) != 0) > all_expired = false; >
On 07/19/2018 07:34 PM, Xiubo Li wrote: > On 2018/7/19 23:49, Mike Christie wrote: >> On 07/19/2018 09:30 AM, xiubli@redhat.com wrote: >>> From: Xiubo Li <xiubli@redhat.com> >>> >>> The logs are: >>> >>> BUG: unable to handle kernel NULL pointer dereference at 0000000000000040 >>> IP: [<ffffffffc072b9a9>] tcmu_reset_ring_store+0x149/0x240 [target_core_user] >>> PGD 800000000e254067 PUD e255067 PMD 0 >>> Oops: 0002 [#1] SMP >>> [...] >>> CPU: 0 PID: 36077 Comm: tcmu-runner Kdump: loaded Not tainted 3.10.0-924.el7.test.x86_64 #1 >> >> It is not clear to me how you hit this. It's not a RHEL only bug is it? >> Are you sure you are hitting it during device removal? > Yes, not only the RHEL bug, for the RHEL we have sync the related code > to the upstream. Before as I remembered I have hit one sudden crash by > using the SCSI repo code when testing the tcmu-runner PR#402: > tcmu-runner: add reset netlink support. > > The SCSI repo is : > git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git > > Since the kdump service was not working for the upstream code in my > setups that time, so could only get very simple logs. This time I was > using Prasanna's script and RHEL code with the netlink and UIO crash > fixing patches to reproduce this. The test script please see the attachment. > >> If we are calling tcmu_reset_ring_store isn't there a refcount on the >> item? So we cannot call tcmu_destroy_device -> tcmu_dev_kref_release >> until after that has been released, so they would never run at the same >> time. And, if userspace were to try to open/write to that reset file >> after the rmdir has been done on the se_device then configfs detects >> this and fails the open. > With the tcmu-runner PR#402, when starting tcmu-runner the daemon will do : > 1, block netlink > 2, reset netlink events > 3, block device --> reset ring --> open device --> unblock device If the device was only created and not yet enabled you will hit the crash here. If in your test you kill runner while a enable is in process then this would happen. > 4, unblock netlink > > > Currently there may has one bug in systemd service, because we hit one > odd issue by using the test script: > When executing 'systemctl restart tcmu-runner', this process seems > waiting one signal from somewhere by stucking itself, if we do nothing > It will wait for about 6 ~20 minutes(the time comes from our test > result) to recovery from that. And we can execute 'systemctl restart > gluster-blockd' in another terminal, which could stop the stuck state > above immediately. > > [root@localhost ~]# ps -aux | grep -e tcmu -e > target > root 2021 0.0 0.0 0 0 ? S< 15:09 0:00 > [target_completi] > root 2386 0.6 0.5 226232 11096 ? Ds 15:11 0:00 > /usr/bin/python /usr/bin/targetctl clear > root 2400 0.0 0.0 134816 1316 pts/0 S+ 15:11 0:00 > systemctl restart tcmu-runner root 2415 > 0.0 0.0 112704 988 pts/1 R+ 15:11 0:00 grep --color=auto -e > tcmu -e target > [root@localhost ~]# cat > /proc/2400/stack > > [<ffffffff9d0548f5>] > poll_schedule_timeout+0x55/0xb0 > > [<ffffffff9d05607d>] > do_sys_poll+0x48d/0x590 > > [<ffffffff9d0564f3>] > SyS_ppoll+0x1d3/0x1f0 > > [<ffffffff9d56cdef>] > system_call_fastpath+0x1c/0x21 > > > And with IOs going on, when killing the tcmu-runner process, the Are you just doing a test to kill runner at various times or is your restart of your service doing this while the targetcli clear operation is running? If you re killing it at various times, how do you know you are not killing it during a enable operation and hitting this bug there? If you are killing on purpose as part of your restart operation why are you killing it when the targetcli clear operation needs the backends to complete their IO as part of the tpg disablement process. > 'targetctl clear' process will stuck in [1] first. Then we restart the > tcmu-runner, which will reset ring with accessing the udev->mb_addr, > currently we still could access the > /sys/kernel/config/target/core/user_XX/*. And while the ring reset is > still going on, the stucked 'targetctl clear' process will be woken up > and then tries to delete the device which will kfree(udev->mb_addr), > here any accessing to /sys/kernel/config/target/core/user_XX/* will stuck. The rtslib clear operation deletes the target first and that disables the tpg which waits for outstanding commands. It then finishes tearing down the objects under the target, and then the backend devices are deleted. If targetcli clear is unhung from the tpg disable operation then tcmu-runner has restarted and has started to clean up the kernel state. So runner has reset the ring and failed running IO right (this is how the tpg disable operation will see all outstanding commands completed and procceed)? We then proceed and can do a rmdir on the tcmu device. So I am saying the only way tpg disablement has completed when there is outstanding IO is if reset ring has completed first. Are you certain you have not killed a runner in the middle of a enable/configure event at some point. Then runner has restarted and killed that nl operation leaving the uncongigured device. It then runs reset ring which fails outstanding commands and accessed the unsetup mb_addr and hits the crash here. When the outstanding commands have completed the rtslib operation is also unhung from the tpg disable completes, and we proceed to the backend device deletion. > > There are two stuck case: > [1], If there has any uncompleted IO or cmd, 'targetctl clear' process > will stuck in iscsit_free_session() --> > iscsit_cause_connection_reinstatement() first, then after all the IOs or > cmds completed, it will stuck in [2]. > > [2], If there has no uncompleted IOs or cmds, 'targetctl clear' process > will stuck in tcmu_destroy_device() --> tcmu_netlink_event() --> > wait_for_completion(). > > > >> I think we can hit your bug during initialization. We do not setup the >> mb_addr until the device is configured, but the configfs files are >> exported to userspace at device creation time. So between those times, >> userspace could be writing to the reset file and hitting this bug. Is >> that maybe what you hit? >> >> If that is the bug, we need to grab the cmdr_lock in >> tcmu_configure_device when we are setting up the mb_addr and in the >> failure path in that function. In tcmu_reset_ring then I think we also >> need a check for a NULL mb_addr. >> > I checked all the related code and find that all the mb_addr accessing > are protect by the cmdr_lock except in tcmu_dev_kref_release() after the > device initialized. > > Today I have reproduced this again. > If you can reproduce it can you debug it a little more to figure out what is going on exactly? If you cannot get a kdump, then at least stick in a printk in tcmu_dev_kref_release and tcmu_reset_ring to confirm that the release is run before the reset. Also stick in a printk to the tcmu_configure_device function to verify it is always succeeding. The problem is that if you can hit the bug you describe where configfs is accessing the device after the device has had a rmdir done on it then a lot of the configfs code will have a possible bug. For the se_device tcmu_dev_kref_release is only run when the se_device's configfs item's refcount has gone to zero (target_core_dev_release -> target_free_device -> target_free_device -> tcmu_free_device), so the se_device/tcmu_device could also be freed at any time while configfs is operation is running if you are right and your patch is not going to help that. > Thanks, > BRs > Xiubo > >> >>> diff --git a/drivers/target/target_core_user.c b/drivers/target/target_core_user.c >>> index 847707a..8d7274e 100644 >>> --- a/drivers/target/target_core_user.c >>> +++ b/drivers/target/target_core_user.c >>> @@ -1587,16 +1587,16 @@ static void tcmu_dev_kref_release(struct kref *kref) >>> bool all_expired = true; >>> int i; >>> >>> - vfree(udev->mb_addr); >>> - udev->mb_addr = NULL; >>> - >>> spin_lock_bh(&timed_out_udevs_lock); >>> if (!list_empty(&udev->timedout_entry)) >>> list_del(&udev->timedout_entry); >>> spin_unlock_bh(&timed_out_udevs_lock); >>> >>> - /* Upper layer should drain all requests before calling this */ >>> mutex_lock(&udev->cmdr_lock); >>> + vfree(udev->mb_addr); >>> + udev->mb_addr = NULL; >>> + >>> + /* Upper layer should drain all requests before calling this */ >>> idr_for_each_entry(&udev->commands, cmd, i) { >>> if (tcmu_check_and_free_pending_cmd(cmd) != 0) >>> all_expired = false; >>> >> -- >> To unsubscribe from this list: send the line "unsubscribe target-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > >
On 2018/7/20 23:13, Mike Christie wrote: > On 07/19/2018 07:34 PM, Xiubo Li wrote: >> On 2018/7/19 23:49, Mike Christie wrote: >>> On 07/19/2018 09:30 AM, xiubli@redhat.com wrote: >>>> From: Xiubo Li <xiubli@redhat.com> >>>> >>>> The logs are: >>>> >>>> BUG: unable to handle kernel NULL pointer dereference at 0000000000000040 >>>> IP: [<ffffffffc072b9a9>] tcmu_reset_ring_store+0x149/0x240 [target_core_user] >>>> PGD 800000000e254067 PUD e255067 PMD 0 >>>> Oops: 0002 [#1] SMP >>>> [...] >>>> CPU: 0 PID: 36077 Comm: tcmu-runner Kdump: loaded Not tainted 3.10.0-924.el7.test.x86_64 #1 >>> It is not clear to me how you hit this. It's not a RHEL only bug is it? >>> Are you sure you are hitting it during device removal? >> Yes, not only the RHEL bug, for the RHEL we have sync the related code >> to the upstream. Before as I remembered I have hit one sudden crash by >> using the SCSI repo code when testing the tcmu-runner PR#402: >> tcmu-runner: add reset netlink support. >> >> The SCSI repo is : >> git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git >> >> Since the kdump service was not working for the upstream code in my >> setups that time, so could only get very simple logs. This time I was >> using Prasanna's script and RHEL code with the netlink and UIO crash >> fixing patches to reproduce this. The test script please see the attachment. >> >>> If we are calling tcmu_reset_ring_store isn't there a refcount on the >>> item? So we cannot call tcmu_destroy_device -> tcmu_dev_kref_release >>> until after that has been released, so they would never run at the same >>> time. And, if userspace were to try to open/write to that reset file >>> after the rmdir has been done on the se_device then configfs detects >>> this and fails the open. >> With the tcmu-runner PR#402, when starting tcmu-runner the daemon will do : >> 1, block netlink >> 2, reset netlink events >> 3, block device --> reset ring --> open device --> unblock device > If the device was only created and not yet enabled you will hit the > crash here. > > If in your test you kill runner while a enable is in process then this > would happen. As you mentioned it is possible only created then runner is killed. >> 4, unblock netlink >> >> >> Currently there may has one bug in systemd service, because we hit one >> odd issue by using the test script: >> When executing 'systemctl restart tcmu-runner', this process seems >> waiting one signal from somewhere by stucking itself, if we do nothing >> It will wait for about 6 ~20 minutes(the time comes from our test >> result) to recovery from that. And we can execute 'systemctl restart >> gluster-blockd' in another terminal, which could stop the stuck state >> above immediately. >> >> [root@localhost ~]# ps -aux | grep -e tcmu -e >> target >> root 2021 0.0 0.0 0 0 ? S< 15:09 0:00 >> [target_completi] >> root 2386 0.6 0.5 226232 11096 ? Ds 15:11 0:00 >> /usr/bin/python /usr/bin/targetctl clear >> root 2400 0.0 0.0 134816 1316 pts/0 S+ 15:11 0:00 >> systemctl restart tcmu-runner root 2415 >> 0.0 0.0 112704 988 pts/1 R+ 15:11 0:00 grep --color=auto -e >> tcmu -e target >> [root@localhost ~]# cat >> /proc/2400/stack >> >> [<ffffffff9d0548f5>] >> poll_schedule_timeout+0x55/0xb0 >> >> [<ffffffff9d05607d>] >> do_sys_poll+0x48d/0x590 >> >> [<ffffffff9d0564f3>] >> SyS_ppoll+0x1d3/0x1f0 >> >> [<ffffffff9d56cdef>] >> system_call_fastpath+0x1c/0x21 >> >> >> And with IOs going on, when killing the tcmu-runner process, the > Are you just doing a test to kill runner at various times or is your > restart of your service doing this while the targetcli clear operation > is running? There are 3 related services, and here '-->' means will Bindto and depend to gluster-blockd.service --> gluster-block-target.service --> tcmu-runner service I was running the script as I have attached in last mail and in another terminal running the 'systemctl restart gluster-blockd' cmd when this crash is hit. Because as I mentioned without the 'systemctl restart gluster-blockd' in another terminal, the script will be stuck in 'systemctl restart tcmu-runner' for a very long time. When resarting the gluster-blockd, it will start the tcmu-runner and gluster-block-target first. And it is possible that the targetcli clear operation is stuck somewhere then the runner is restarted. > If you re killing it at various times, how do you know you are not > killing it during a enable operation and hitting this bug there? It is very possible. > If you are killing on purpose as part of your restart operation why are > you killing it when the targetcli clear operation needs the backends to > complete their IO as part of the tpg disablement process. Just after runner service is killed then the gluster-block-target service will be killed too, and it will invoke the 'targetctl clear' when stopping. We are testing the netlink reset and the ring reset feature, so the script is killing the runner daemon to emulate the D state cases. And the script also emulating the crash case of the tcmu-runner daemon for some reasons. > >> 'targetctl clear' process will stuck in [1] first. Then we restart the >> tcmu-runner, which will reset ring with accessing the udev->mb_addr, >> currently we still could access the >> /sys/kernel/config/target/core/user_XX/*. And while the ring reset is >> still going on, the stucked 'targetctl clear' process will be woken up >> and then tries to delete the device which will kfree(udev->mb_addr), >> here any accessing to /sys/kernel/config/target/core/user_XX/* will stuck. > The rtslib clear operation deletes the target first and that disables > the tpg which waits for outstanding commands. It then finishes tearing > down the objects under the target, and then the backend devices are deleted. > > If targetcli clear is unhung from the tpg disable operation then > tcmu-runner has restarted and has started to clean up the kernel state. > So runner has reset the ring and failed running IO right (this is how > the tpg disable operation will see all outstanding commands completed > and procceed)? We then proceed and can do a rmdir on the tcmu device. So > I am saying the only way tpg disablement has completed when there is > outstanding IO is if reset ring has completed first. > > Are you certain you have not killed a runner in the middle of a > enable/configure event at some point. No, I am not sure, because the reason I mentioned above, it is very possible the tcmu-runner is killed in the middle of a enable/configure event. > Then runner has restarted and > killed that nl operation leaving the uncongigured device. It then runs > reset ring which fails outstanding commands and accessed the unsetup > mb_addr and hits the crash here. When the outstanding commands have > completed the rtslib operation is also unhung from the tpg disable > completes, and we proceed to the backend device deletion. > >> There are two stuck case: >> [1], If there has any uncompleted IO or cmd, 'targetctl clear' process >> will stuck in iscsit_free_session() --> >> iscsit_cause_connection_reinstatement() first, then after all the IOs or >> cmds completed, it will stuck in [2]. >> >> [2], If there has no uncompleted IOs or cmds, 'targetctl clear' process >> will stuck in tcmu_destroy_device() --> tcmu_netlink_event() --> >> wait_for_completion(). >> >> >> >>> I think we can hit your bug during initialization. We do not setup the >>> mb_addr until the device is configured, but the configfs files are >>> exported to userspace at device creation time. So between those times, >>> userspace could be writing to the reset file and hitting this bug. Is >>> that maybe what you hit? >>> >>> If that is the bug, we need to grab the cmdr_lock in >>> tcmu_configure_device when we are setting up the mb_addr and in the >>> failure path in that function. In tcmu_reset_ring then I think we also >>> need a check for a NULL mb_addr. >>> >> I checked all the related code and find that all the mb_addr accessing >> are protect by the cmdr_lock except in tcmu_dev_kref_release() after the >> device initialized. >> >> Today I have reproduced this again. >> > If you can reproduce it can you debug it a little more to figure out > what is going on exactly? If you cannot get a kdump, then at least stick > in a printk in tcmu_dev_kref_release and tcmu_reset_ring to confirm that > the release is run before the reset. Also stick in a printk to the > tcmu_configure_device function to verify it is always succeeding. > > The problem is that if you can hit the bug you describe where configfs > is accessing the device after the device has had a rmdir done on it then > a lot of the configfs code will have a possible bug. For the se_device > tcmu_dev_kref_release is only run when the se_device's configfs item's > refcount has gone to zero (target_core_dev_release -> target_free_device > -> target_free_device -> tcmu_free_device), so the se_device/tcmu_device > could also be freed at any time while configfs is operation is running > if you are right and your patch is not going to help that. Sure, I will debug this with some printk messages. But not sure could reproduce it in time, it may take days or weeks. Because the first time to hit this is many weeks ago and this week twice. And if you are agree with current fixing, let's push this first and I will post the V2 with some other fixing. Then I will continue to reproduce this by adding some printk messages to make sure there has no problem with the configfs code, if it's really has bugs, that means the three times crash may have different root causes. Then let's fix it ? Thanks Mike. BRs > >> Thanks, >> BRs >> Xiubo >> >>>> diff --git a/drivers/target/target_core_user.c b/drivers/target/target_core_user.c >>>> index 847707a..8d7274e 100644 >>>> --- a/drivers/target/target_core_user.c >>>> +++ b/drivers/target/target_core_user.c >>>> @@ -1587,16 +1587,16 @@ static void tcmu_dev_kref_release(struct kref *kref) >>>> bool all_expired = true; >>>> int i; >>>> >>>> - vfree(udev->mb_addr); >>>> - udev->mb_addr = NULL; >>>> - >>>> spin_lock_bh(&timed_out_udevs_lock); >>>> if (!list_empty(&udev->timedout_entry)) >>>> list_del(&udev->timedout_entry); >>>> spin_unlock_bh(&timed_out_udevs_lock); >>>> >>>> - /* Upper layer should drain all requests before calling this */ >>>> mutex_lock(&udev->cmdr_lock); >>>> + vfree(udev->mb_addr); >>>> + udev->mb_addr = NULL; >>>> + >>>> + /* Upper layer should drain all requests before calling this */ >>>> idr_for_each_entry(&udev->commands, cmd, i) { >>>> if (tcmu_check_and_free_pending_cmd(cmd) != 0) >>>> all_expired = false; >>>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe target-devel" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>
diff --git a/drivers/target/target_core_user.c b/drivers/target/target_core_user.c index 847707a..8d7274e 100644 --- a/drivers/target/target_core_user.c +++ b/drivers/target/target_core_user.c @@ -1587,16 +1587,16 @@ static void tcmu_dev_kref_release(struct kref *kref) bool all_expired = true; int i; - vfree(udev->mb_addr); - udev->mb_addr = NULL; - spin_lock_bh(&timed_out_udevs_lock); if (!list_empty(&udev->timedout_entry)) list_del(&udev->timedout_entry); spin_unlock_bh(&timed_out_udevs_lock); - /* Upper layer should drain all requests before calling this */ mutex_lock(&udev->cmdr_lock); + vfree(udev->mb_addr); + udev->mb_addr = NULL; + + /* Upper layer should drain all requests before calling this */ idr_for_each_entry(&udev->commands, cmd, i) { if (tcmu_check_and_free_pending_cmd(cmd) != 0) all_expired = false;