tcmu: fix crash for dereferencing the released udev->mb_addr memory
diff mbox

Message ID 1532010659-4657-1-git-send-email-xiubli@redhat.com
State New, archived
Headers show

Commit Message

Xiubo Li July 19, 2018, 2:30 p.m. UTC
From: Xiubo Li <xiubli@redhat.com>

The logs are:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
IP: [<ffffffffc072b9a9>] tcmu_reset_ring_store+0x149/0x240 [target_core_user]
PGD 800000000e254067 PUD e255067 PMD 0
Oops: 0002 [#1] SMP
[...]
CPU: 0 PID: 36077 Comm: tcmu-runner Kdump: loaded Not tainted 3.10.0-924.el7.test.x86_64 #1
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
task: ffff922db95ab0c0 ti: ffff922d9f8d4000 task.ti: ffff922d9f8d4000
RIP: 0010:[<ffffffffc072b9a9>]  [<ffffffffc072b9a9>] tcmu_reset_ring_store+0x149/0x240 [target_core_user]
RSP: 0018:ffff922d9f8d7e30  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000001000 RCX: 00000000c0000100
RDX: ffff922d9f8d5fd8 RSI: 0000000000000000 RDI: ffff922d4b91f440
RBP: ffff922d9f8d7e70 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000000 R12: ffff922d4b91e550
R13: ffff922d4b91f3e8 R14: 0000000000000000 R15: 0000000000000000
FS:  00007f70467d7880(0000) GS:ffff922dbb600000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000040 CR3: 000000000a2b0000 CR4: 00000000003607f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 [<ffffffffac0cafaf>] configfs_write_file+0x11f/0x160

From the crash tools:

crash> bt
PID: 36077  TASK: ffff922db95ab0c0  CPU: 0   COMMAND: "tcmu-runner"
 #0 [ffff922d9f8d7ac8] machine_kexec at ffffffffabe62d8a
 #1 [ffff922d9f8d7b28] __crash_kexec at ffffffffabf1bb02
 #2 [ffff922d9f8d7bf8] crash_kexec at ffffffffabf1bbf0
 #3 [ffff922d9f8d7c10] oops_end at ffffffffac564798
 #4 [ffff922d9f8d7c38] no_context at ffffffffac552b3b
 #5 [ffff922d9f8d7c88] __bad_area_nosemaphore at ffffffffac552bd2
 #6 [ffff922d9f8d7cd8] bad_area_nosemaphore at ffffffffac552d43
 #7 [ffff922d9f8d7ce8] __do_page_fault at ffffffffac567750
 #8 [ffff922d9f8d7d50] do_page_fault at ffffffffac567945
 #9 [ffff922d9f8d7d80] page_fault at ffffffffac563788
    [exception RIP: tcmu_reset_ring_store+329]
    RIP: ffffffffc072b9a9  RSP: ffff922d9f8d7e30  RFLAGS: 00010246
    RAX: 0000000000000000  RBX: 0000000000001000  RCX: 00000000c0000100
    RDX: ffff922d9f8d5fd8  RSI: 0000000000000000  RDI: ffff922d4b91f440
    RBP: ffff922d9f8d7e70   R8: 0000000000000000   R9: 0000000000000001
    R10: 0000000000000000  R11: 0000000000000000  R12: ffff922d4b91e550
    R13: ffff922d4b91f3e8  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
    RIP: 00007f70458a074d  RSP: 00007ffcf788dc50  RFLAGS: 00000293
    RAX: 0000000000000001  RBX: 0000000000000000  RCX: ffffffffffffffff
    RDX: 0000000000000002  RSI: 00007ffcf788dce0  RDI: 0000000000000007
    RBP: 00007ffcf788dcc0   R8: 0000000000000000   R9: 00007f7044cd10fd
    R10: 00007ffcf788e720  R11: 0000000000000293  R12: 0000000000407c80
    R13: 00007ffcf788f170  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

We can see that the IP is tcmu_reset_ring_store+329.

crash> dis tcmu_reset_ring_store
[...]
0xffffffffc072b955 <tcmu_reset_ring_store+245>: callq  0xffffffffac019810 <kmem_cache_free>
0xffffffffc072b95a <tcmu_reset_ring_store+250>: jmpq   0xffffffffc072b8d8 <tcmu_reset_ring_store+120>
0xffffffffc072b95f <tcmu_reset_ring_store+255>: nop
[...]
0xffffffffc072b98e <tcmu_reset_ring_store+302>: jne    0xffffffffc072b988 <tcmu_reset_ring_store+296>
0xffffffffc072b990 <tcmu_reset_ring_store+304>: testb  $0x4,0x3efb(%rip)        # 0xffffffffc072f892
0xffffffffc072b997 <tcmu_reset_ring_store+311>: jne    0xffffffffc072ba6e <tcmu_reset_ring_store+526>
0xffffffffc072b99d <tcmu_reset_ring_store+317>: movl   $0x0,0xe74(%r12)
0xffffffffc072b9a9 <tcmu_reset_ring_store+329>: movl   $0x0,0x40(%r14)
0xffffffffc072b9b1 <tcmu_reset_ring_store+337>: movl   $0x0,0xc(%r14)
0xffffffffc072b9b9 <tcmu_reset_ring_store+345>: nopl   0x0(%rax)
0xffffffffc072b9c0 <tcmu_reset_ring_store+352>: sub    $0x1000,%rbx
0xffffffffc072b9c7 <tcmu_reset_ring_store+359>: jne    0xffffffffc072b9c0 <tcmu_reset_ring_store+352>
0xffffffffc072b9c9 <tcmu_reset_ring_store+361>: lea    0xf18(%r12),%rdi
0xffffffffc072b9d1 <tcmu_reset_ring_store+369>: callq  0xffffffffabea8b00 <del_timer>
[...]

And the related target_core_user.c C code for line:
0xffffffffc072b9a9 <tcmu_reset_ring_store+329>: movl   $0x0,0x40(%r14)
is "mb->cmd_tail = 0;"

Signed-off-by: Xiubo Li <xiubli@redhat.com>
---
 drivers/target/target_core_user.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Comments

Mike Christie July 19, 2018, 3:49 p.m. UTC | #1
On 07/19/2018 09:30 AM, xiubli@redhat.com wrote:
> From: Xiubo Li <xiubli@redhat.com>
> 
> The logs are:
> 
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
> IP: [<ffffffffc072b9a9>] tcmu_reset_ring_store+0x149/0x240 [target_core_user]
> PGD 800000000e254067 PUD e255067 PMD 0
> Oops: 0002 [#1] SMP
> [...]
> CPU: 0 PID: 36077 Comm: tcmu-runner Kdump: loaded Not tainted 3.10.0-924.el7.test.x86_64 #1


It is not clear to me how you hit this. It's not a RHEL only bug is it?
Are you sure you are hitting it during device removal?

If we are calling tcmu_reset_ring_store isn't there a refcount on the
item? So we cannot call tcmu_destroy_device -> tcmu_dev_kref_release
until after that has been released, so they would never run at the same
time. And, if userspace were to try to open/write to that reset file
after the rmdir has been done on the se_device then configfs detects
this and fails the open.

I think we can hit your bug during initialization. We do not setup the
mb_addr until the device is configured, but the configfs files are
exported to userspace at device creation time. So between those times,
userspace could be writing to the reset file and hitting this bug. Is
that maybe what you hit?

If that is the bug, we need to grab the cmdr_lock in
tcmu_configure_device when we are setting up the mb_addr and in the
failure path in that function. In tcmu_reset_ring then I think we also
need a check for a NULL mb_addr.



> diff --git a/drivers/target/target_core_user.c b/drivers/target/target_core_user.c
> index 847707a..8d7274e 100644
> --- a/drivers/target/target_core_user.c
> +++ b/drivers/target/target_core_user.c
> @@ -1587,16 +1587,16 @@ static void tcmu_dev_kref_release(struct kref *kref)
>  	bool all_expired = true;
>  	int i;
>  
> -	vfree(udev->mb_addr);
> -	udev->mb_addr = NULL;
> -
>  	spin_lock_bh(&timed_out_udevs_lock);
>  	if (!list_empty(&udev->timedout_entry))
>  		list_del(&udev->timedout_entry);
>  	spin_unlock_bh(&timed_out_udevs_lock);
>  
> -	/* Upper layer should drain all requests before calling this */
>  	mutex_lock(&udev->cmdr_lock);
> +	vfree(udev->mb_addr);
> +	udev->mb_addr = NULL;
> +
> +	/* Upper layer should drain all requests before calling this */
>  	idr_for_each_entry(&udev->commands, cmd, i) {
>  		if (tcmu_check_and_free_pending_cmd(cmd) != 0)
>  			all_expired = false;
> 

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mike Christie July 20, 2018, 3:13 p.m. UTC | #2
On 07/19/2018 07:34 PM, Xiubo Li wrote:
> On 2018/7/19 23:49, Mike Christie wrote:
>> On 07/19/2018 09:30 AM, xiubli@redhat.com wrote:
>>> From: Xiubo Li <xiubli@redhat.com>
>>>
>>> The logs are:
>>>
>>> BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
>>> IP: [<ffffffffc072b9a9>] tcmu_reset_ring_store+0x149/0x240 [target_core_user]
>>> PGD 800000000e254067 PUD e255067 PMD 0
>>> Oops: 0002 [#1] SMP
>>> [...]
>>> CPU: 0 PID: 36077 Comm: tcmu-runner Kdump: loaded Not tainted 3.10.0-924.el7.test.x86_64 #1
>>
>> It is not clear to me how you hit this. It's not a RHEL only bug is it?
>> Are you sure you are hitting it during device removal?
> Yes, not only the RHEL bug, for the RHEL we have sync the related code
> to the upstream. Before as I remembered I have hit one sudden crash by
> using the SCSI repo code when testing the tcmu-runner PR#402:
> tcmu-runner: add reset netlink support.
> 
> The SCSI repo is :
> git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git
> 
> Since the kdump service was not working for the upstream code in my
> setups that time, so could only get very simple logs. This time I was
> using Prasanna's script and RHEL code with the netlink and UIO crash
> fixing patches to reproduce this. The test script please see the attachment.
> 
>> If we are calling tcmu_reset_ring_store isn't there a refcount on the
>> item? So we cannot call tcmu_destroy_device -> tcmu_dev_kref_release
>> until after that has been released, so they would never run at the same
>> time. And, if userspace were to try to open/write to that reset file
>> after the rmdir has been done on the se_device then configfs detects
>> this and fails the open.
> With the tcmu-runner PR#402, when starting tcmu-runner the daemon will do :
> 1, block netlink
> 2, reset netlink events
> 3, block device --> reset ring --> open device --> unblock device

If the device was only created and not yet enabled you will hit the
crash here.

If in your test you kill runner while a enable is in process then this
would happen.

> 4, unblock netlink
> 
> 
> Currently there may has one bug in systemd service, because we hit one
> odd issue by using the test script:
> When executing 'systemctl restart tcmu-runner', this process seems
> waiting one signal from somewhere by stucking itself, if we do nothing
> It will wait for about 6 ~20 minutes(the time comes from our test
> result) to recovery from that. And we can execute 'systemctl restart
> gluster-blockd' in another terminal, which could stop the stuck state
> above immediately.
> 
> [root@localhost ~]# ps -aux | grep -e tcmu -e
> target                                                                 
> root      2021  0.0  0.0      0     0 ?        S<   15:09   0:00
> [target_completi]                                   
> root      2386  0.6  0.5 226232 11096 ?        Ds   15:11   0:00
> /usr/bin/python /usr/bin/targetctl clear            
> root      2400  0.0  0.0 134816  1316 pts/0    S+   15:11   0:00
> systemctl restart tcmu-runner                        root      2415 
> 0.0  0.0 112704   988 pts/1    R+   15:11   0:00 grep --color=auto -e
> tcmu -e target
> [root@localhost ~]# cat
> /proc/2400/stack                                                                             
> 
> [<ffffffff9d0548f5>]
> poll_schedule_timeout+0x55/0xb0                                                                 
> 
> [<ffffffff9d05607d>]
> do_sys_poll+0x48d/0x590                                                                         
> 
> [<ffffffff9d0564f3>]
> SyS_ppoll+0x1d3/0x1f0                                                                           
> 
> [<ffffffff9d56cdef>]
> system_call_fastpath+0x1c/0x21                                                                  
> 
> 
> And with IOs going on, when killing the tcmu-runner process, the

Are you just doing a test to kill runner at various times or is your
restart of your service doing this while the targetcli clear operation
is running?

If you re killing it at various times, how do you know you are not
killing it during a enable operation and hitting this bug there?

If you are killing on purpose as part of your restart operation why are
you killing it when the targetcli clear operation needs the backends to
complete their IO as part of the tpg disablement process.


> 'targetctl clear' process will stuck in [1] first. Then we restart the
> tcmu-runner, which will reset ring with accessing the udev->mb_addr,
> currently we still could access the 
> /sys/kernel/config/target/core/user_XX/*. And while the ring reset is
> still going on, the stucked 'targetctl clear' process will be woken up
> and then tries to delete the device which will kfree(udev->mb_addr),
> here any accessing to  /sys/kernel/config/target/core/user_XX/* will stuck.

The rtslib clear operation deletes the target first and that disables
the tpg which waits for outstanding commands. It then finishes tearing
down the objects under the target, and then the backend devices are deleted.

If targetcli clear is unhung from the tpg disable operation then
tcmu-runner has restarted and has started to clean up the kernel state.
So runner has reset the ring and failed running IO right (this is how
the tpg disable operation will see all outstanding commands completed
and procceed)? We then proceed and can do a rmdir on the tcmu device. So
I am saying the only way tpg disablement has completed when there is
outstanding IO is if reset ring has completed first.

Are you certain you have not killed a runner in the middle of a
enable/configure event at some point. Then runner has restarted and
killed that nl operation leaving the uncongigured device. It then runs
reset ring which fails outstanding commands and accessed the unsetup
mb_addr and hits the crash here. When the outstanding commands have
completed the rtslib operation is also unhung from the tpg disable
completes, and we proceed to the backend device deletion.

> 
> There are two stuck case:
> [1], If there has any uncompleted IO or cmd, 'targetctl clear' process
> will stuck in iscsit_free_session() -->
> iscsit_cause_connection_reinstatement() first, then after all the IOs or
> cmds completed, it will stuck in [2].
> 
> [2], If there has no uncompleted IOs or cmds, 'targetctl clear' process
> will stuck in tcmu_destroy_device() --> tcmu_netlink_event() -->
> wait_for_completion().
> 
>  
> 
>> I think we can hit your bug during initialization. We do not setup the
>> mb_addr until the device is configured, but the configfs files are
>> exported to userspace at device creation time. So between those times,
>> userspace could be writing to the reset file and hitting this bug. Is
>> that maybe what you hit?
>>
>> If that is the bug, we need to grab the cmdr_lock in
>> tcmu_configure_device when we are setting up the mb_addr and in the
>> failure path in that function. In tcmu_reset_ring then I think we also
>> need a check for a NULL mb_addr.
>>
> I checked all the related code and find that all the mb_addr accessing
> are protect by the cmdr_lock except in tcmu_dev_kref_release() after the
> device initialized.
> 
> Today I have reproduced this again.
> 

If you can reproduce it can you debug it a little more to figure out
what is going on exactly? If you cannot get a kdump, then at least stick
in a printk in tcmu_dev_kref_release and tcmu_reset_ring to confirm that
the release is run before the reset. Also stick in a printk to the
tcmu_configure_device function to verify it is always succeeding.

The problem is that if you can hit the bug you describe where configfs
is accessing the device after the device has had a rmdir done on it then
a lot of the configfs code will have a possible bug. For the se_device
tcmu_dev_kref_release is only run when the se_device's configfs item's
refcount has gone to zero (target_core_dev_release -> target_free_device
-> target_free_device -> tcmu_free_device), so the se_device/tcmu_device
could also be freed at any time while configfs is operation is running
if you are right and your patch is not going to help that.


> Thanks,
> BRs
> Xiubo
> 
>>
>>> diff --git a/drivers/target/target_core_user.c b/drivers/target/target_core_user.c
>>> index 847707a..8d7274e 100644
>>> --- a/drivers/target/target_core_user.c
>>> +++ b/drivers/target/target_core_user.c
>>> @@ -1587,16 +1587,16 @@ static void tcmu_dev_kref_release(struct kref *kref)
>>>  	bool all_expired = true;
>>>  	int i;
>>>  
>>> -	vfree(udev->mb_addr);
>>> -	udev->mb_addr = NULL;
>>> -
>>>  	spin_lock_bh(&timed_out_udevs_lock);
>>>  	if (!list_empty(&udev->timedout_entry))
>>>  		list_del(&udev->timedout_entry);
>>>  	spin_unlock_bh(&timed_out_udevs_lock);
>>>  
>>> -	/* Upper layer should drain all requests before calling this */
>>>  	mutex_lock(&udev->cmdr_lock);
>>> +	vfree(udev->mb_addr);
>>> +	udev->mb_addr = NULL;
>>> +
>>> +	/* Upper layer should drain all requests before calling this */
>>>  	idr_for_each_entry(&udev->commands, cmd, i) {
>>>  		if (tcmu_check_and_free_pending_cmd(cmd) != 0)
>>>  			all_expired = false;
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe target-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiubo Li July 21, 2018, 2:21 a.m. UTC | #3
On 2018/7/20 23:13, Mike Christie wrote:
> On 07/19/2018 07:34 PM, Xiubo Li wrote:
>> On 2018/7/19 23:49, Mike Christie wrote:
>>> On 07/19/2018 09:30 AM, xiubli@redhat.com wrote:
>>>> From: Xiubo Li <xiubli@redhat.com>
>>>>
>>>> The logs are:
>>>>
>>>> BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
>>>> IP: [<ffffffffc072b9a9>] tcmu_reset_ring_store+0x149/0x240 [target_core_user]
>>>> PGD 800000000e254067 PUD e255067 PMD 0
>>>> Oops: 0002 [#1] SMP
>>>> [...]
>>>> CPU: 0 PID: 36077 Comm: tcmu-runner Kdump: loaded Not tainted 3.10.0-924.el7.test.x86_64 #1
>>> It is not clear to me how you hit this. It's not a RHEL only bug is it?
>>> Are you sure you are hitting it during device removal?
>> Yes, not only the RHEL bug, for the RHEL we have sync the related code
>> to the upstream. Before as I remembered I have hit one sudden crash by
>> using the SCSI repo code when testing the tcmu-runner PR#402:
>> tcmu-runner: add reset netlink support.
>>
>> The SCSI repo is :
>> git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git
>>
>> Since the kdump service was not working for the upstream code in my
>> setups that time, so could only get very simple logs. This time I was
>> using Prasanna's script and RHEL code with the netlink and UIO crash
>> fixing patches to reproduce this. The test script please see the attachment.
>>
>>> If we are calling tcmu_reset_ring_store isn't there a refcount on the
>>> item? So we cannot call tcmu_destroy_device -> tcmu_dev_kref_release
>>> until after that has been released, so they would never run at the same
>>> time. And, if userspace were to try to open/write to that reset file
>>> after the rmdir has been done on the se_device then configfs detects
>>> this and fails the open.
>> With the tcmu-runner PR#402, when starting tcmu-runner the daemon will do :
>> 1, block netlink
>> 2, reset netlink events
>> 3, block device --> reset ring --> open device --> unblock device
> If the device was only created and not yet enabled you will hit the
> crash here.
>
> If in your test you kill runner while a enable is in process then this
> would happen.

As you mentioned it is possible only created then runner is killed.

>> 4, unblock netlink
>>
>>
>> Currently there may has one bug in systemd service, because we hit one
>> odd issue by using the test script:
>> When executing 'systemctl restart tcmu-runner', this process seems
>> waiting one signal from somewhere by stucking itself, if we do nothing
>> It will wait for about 6 ~20 minutes(the time comes from our test
>> result) to recovery from that. And we can execute 'systemctl restart
>> gluster-blockd' in another terminal, which could stop the stuck state
>> above immediately.
>>
>> [root@localhost ~]# ps -aux | grep -e tcmu -e
>> target
>> root      2021  0.0  0.0      0     0 ?        S<   15:09   0:00
>> [target_completi]
>> root      2386  0.6  0.5 226232 11096 ?        Ds   15:11   0:00
>> /usr/bin/python /usr/bin/targetctl clear
>> root      2400  0.0  0.0 134816  1316 pts/0    S+   15:11   0:00
>> systemctl restart tcmu-runner                        root      2415
>> 0.0  0.0 112704   988 pts/1    R+   15:11   0:00 grep --color=auto -e
>> tcmu -e target
>> [root@localhost ~]# cat
>> /proc/2400/stack
>>
>> [<ffffffff9d0548f5>]
>> poll_schedule_timeout+0x55/0xb0
>>
>> [<ffffffff9d05607d>]
>> do_sys_poll+0x48d/0x590
>>
>> [<ffffffff9d0564f3>]
>> SyS_ppoll+0x1d3/0x1f0
>>
>> [<ffffffff9d56cdef>]
>> system_call_fastpath+0x1c/0x21
>>
>>
>> And with IOs going on, when killing the tcmu-runner process, the
> Are you just doing a test to kill runner at various times or is your
> restart of your service doing this while the targetcli clear operation
> is running?
There are 3 related services, and here '-->' means will Bindto and depend to
gluster-blockd.service --> gluster-block-target.service --> tcmu-runner 
service

I was running the script as I have attached in last mail and in another 
terminal running the 'systemctl restart gluster-blockd' cmd when this 
crash is hit. Because as I mentioned without the 'systemctl restart 
gluster-blockd' in another terminal, the script will be stuck in 
'systemctl restart tcmu-runner' for a very long time.

When resarting the gluster-blockd, it will start the tcmu-runner and 
gluster-block-target first.

And it is possible that the targetcli clear operation is stuck somewhere 
then the runner is restarted.

> If you re killing it at various times, how do you know you are not
> killing it during a enable operation and hitting this bug there?
It is very possible.


> If you are killing on purpose as part of your restart operation why are
> you killing it when the targetcli clear operation needs the backends to
> complete their IO as part of the tpg disablement process.

Just after runner service is killed then the gluster-block-target 
service will be killed too, and it will invoke the 'targetctl clear' 
when stopping.

We are testing the netlink reset and the ring reset feature, so the 
script is killing the runner daemon to emulate the D state cases. And 
the script also emulating the crash case of the tcmu-runner daemon for 
some reasons.

>
>> 'targetctl clear' process will stuck in [1] first. Then we restart the
>> tcmu-runner, which will reset ring with accessing the udev->mb_addr,
>> currently we still could access the
>> /sys/kernel/config/target/core/user_XX/*. And while the ring reset is
>> still going on, the stucked 'targetctl clear' process will be woken up
>> and then tries to delete the device which will kfree(udev->mb_addr),
>> here any accessing to  /sys/kernel/config/target/core/user_XX/* will stuck.
> The rtslib clear operation deletes the target first and that disables
> the tpg which waits for outstanding commands. It then finishes tearing
> down the objects under the target, and then the backend devices are deleted.
>
> If targetcli clear is unhung from the tpg disable operation then
> tcmu-runner has restarted and has started to clean up the kernel state.
> So runner has reset the ring and failed running IO right (this is how
> the tpg disable operation will see all outstanding commands completed
> and procceed)? We then proceed and can do a rmdir on the tcmu device. So
> I am saying the only way tpg disablement has completed when there is
> outstanding IO is if reset ring has completed first.
>
> Are you certain you have not killed a runner in the middle of a
> enable/configure event at some point.
No, I am not sure, because the reason I mentioned above, it is very 
possible the tcmu-runner is killed in the middle of a enable/configure 
event.

> Then runner has restarted and
> killed that nl operation leaving the uncongigured device. It then runs
> reset ring which fails outstanding commands and accessed the unsetup
> mb_addr and hits the crash here. When the outstanding commands have
> completed the rtslib operation is also unhung from the tpg disable
> completes, and we proceed to the backend device deletion.
>
>> There are two stuck case:
>> [1], If there has any uncompleted IO or cmd, 'targetctl clear' process
>> will stuck in iscsit_free_session() -->
>> iscsit_cause_connection_reinstatement() first, then after all the IOs or
>> cmds completed, it will stuck in [2].
>>
>> [2], If there has no uncompleted IOs or cmds, 'targetctl clear' process
>> will stuck in tcmu_destroy_device() --> tcmu_netlink_event() -->
>> wait_for_completion().
>>
>>   
>>
>>> I think we can hit your bug during initialization. We do not setup the
>>> mb_addr until the device is configured, but the configfs files are
>>> exported to userspace at device creation time. So between those times,
>>> userspace could be writing to the reset file and hitting this bug. Is
>>> that maybe what you hit?
>>>
>>> If that is the bug, we need to grab the cmdr_lock in
>>> tcmu_configure_device when we are setting up the mb_addr and in the
>>> failure path in that function. In tcmu_reset_ring then I think we also
>>> need a check for a NULL mb_addr.
>>>
>> I checked all the related code and find that all the mb_addr accessing
>> are protect by the cmdr_lock except in tcmu_dev_kref_release() after the
>> device initialized.
>>
>> Today I have reproduced this again.
>>
> If you can reproduce it can you debug it a little more to figure out
> what is going on exactly? If you cannot get a kdump, then at least stick
> in a printk in tcmu_dev_kref_release and tcmu_reset_ring to confirm that
> the release is run before the reset. Also stick in a printk to the
> tcmu_configure_device function to verify it is always succeeding.
>
> The problem is that if you can hit the bug you describe where configfs
> is accessing the device after the device has had a rmdir done on it then
> a lot of the configfs code will have a possible bug. For the se_device
> tcmu_dev_kref_release is only run when the se_device's configfs item's
> refcount has gone to zero (target_core_dev_release -> target_free_device
> -> target_free_device -> tcmu_free_device), so the se_device/tcmu_device
> could also be freed at any time while configfs is operation is running
> if you are right and your patch is not going to help that.
Sure, I will debug this with some printk messages. But not sure could 
reproduce it in time, it may take days or weeks.

Because the first time to hit this is many weeks ago and this week twice.

And if you are agree with current fixing, let's push this first and I 
will post the V2 with some other fixing.
Then I will continue to reproduce this by adding some printk messages to 
make sure there has no problem with the configfs code, if it's really 
has bugs, that means the three times crash may have different root 
causes. Then let's fix it ?

Thanks Mike.
BRs

>
>> Thanks,
>> BRs
>> Xiubo
>>
>>>> diff --git a/drivers/target/target_core_user.c b/drivers/target/target_core_user.c
>>>> index 847707a..8d7274e 100644
>>>> --- a/drivers/target/target_core_user.c
>>>> +++ b/drivers/target/target_core_user.c
>>>> @@ -1587,16 +1587,16 @@ static void tcmu_dev_kref_release(struct kref *kref)
>>>>   	bool all_expired = true;
>>>>   	int i;
>>>>   
>>>> -	vfree(udev->mb_addr);
>>>> -	udev->mb_addr = NULL;
>>>> -
>>>>   	spin_lock_bh(&timed_out_udevs_lock);
>>>>   	if (!list_empty(&udev->timedout_entry))
>>>>   		list_del(&udev->timedout_entry);
>>>>   	spin_unlock_bh(&timed_out_udevs_lock);
>>>>   
>>>> -	/* Upper layer should drain all requests before calling this */
>>>>   	mutex_lock(&udev->cmdr_lock);
>>>> +	vfree(udev->mb_addr);
>>>> +	udev->mb_addr = NULL;
>>>> +
>>>> +	/* Upper layer should drain all requests before calling this */
>>>>   	idr_for_each_entry(&udev->commands, cmd, i) {
>>>>   		if (tcmu_check_and_free_pending_cmd(cmd) != 0)
>>>>   			all_expired = false;
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe target-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/drivers/target/target_core_user.c b/drivers/target/target_core_user.c
index 847707a..8d7274e 100644
--- a/drivers/target/target_core_user.c
+++ b/drivers/target/target_core_user.c
@@ -1587,16 +1587,16 @@  static void tcmu_dev_kref_release(struct kref *kref)
 	bool all_expired = true;
 	int i;
 
-	vfree(udev->mb_addr);
-	udev->mb_addr = NULL;
-
 	spin_lock_bh(&timed_out_udevs_lock);
 	if (!list_empty(&udev->timedout_entry))
 		list_del(&udev->timedout_entry);
 	spin_unlock_bh(&timed_out_udevs_lock);
 
-	/* Upper layer should drain all requests before calling this */
 	mutex_lock(&udev->cmdr_lock);
+	vfree(udev->mb_addr);
+	udev->mb_addr = NULL;
+
+	/* Upper layer should drain all requests before calling this */
 	idr_for_each_entry(&udev->commands, cmd, i) {
 		if (tcmu_check_and_free_pending_cmd(cmd) != 0)
 			all_expired = false;