diff mbox

libsas: flush pending destruct work in sas_unregister_domain_devices()

Message ID 20171128002445.16594-1-xiyou.wangcong@gmail.com (mailing list archive)
State Not Applicable
Headers show

Commit Message

Cong Wang Nov. 28, 2017, 12:24 a.m. UTC
We saw dozens of the following kernel waring:

 WARNING: CPU: 0 PID: 705 at fs/sysfs/group.c:224 sysfs_remove_group+0x54/0x88()
 sysfs group ffffffff81ab7670 not found for kobject '6:0:3:0'
 Modules linked in: cpufreq_ondemand x86_pkg_temp_thermal coretemp kvm_intel kvm microcode raid0 iTCO_wdt iTCO_vendor_support sb_edac edac_core lpc_ich mfd_core ioatdma i2c_i801 shpchp wmi hed acpi_cpufreq lp parport tcp_diag inet_diag ipmi_si ipmi_devintf ipmi_msghandler sch_fq_codel igb ptp pps_core i2c_algo_bit i2c_core crc32c_intel isci libsas scsi_transport_sas dca ipv6
 CPU: 0 PID: 705 Comm: kworker/u240:0 Not tainted 4.1.35.el7.x86_64 #1
 Hardware name: WIWYNN Lyra/JD/S2600GZ, BIOS SE5C600.86B.02.03.2004.030620151456 03/06/2015
 Workqueue: scsi_wq_6 sas_destruct_devices [libsas]
  0000000000000000 ffff88056c393ba8 ffffffff81544a6d ffff88056c393bf8
  0000000000000009 ffff88056c393be8 ffffffff81069b4c ffff88081790d078
  ffffffff811dad37 0000000000000000 ffffffff81ab7670 ffff88081b29dc10
 Call Trace:
  [<ffffffff81544a6d>] dump_stack+0x4d/0x63
  [<ffffffff81069b4c>] warn_slowpath_common+0xa1/0xbb
  [<ffffffff811dad37>] ? sysfs_remove_group+0x54/0x88
  [<ffffffff81069bac>] warn_slowpath_fmt+0x46/0x48
  [<ffffffff811d77ad>] ? kernfs_find_and_get_ns+0x4d/0x58
  [<ffffffff811dad37>] sysfs_remove_group+0x54/0x88
  [<ffffffff81387835>] dpm_sysfs_remove+0x50/0x55
  [<ffffffff8137de7c>] device_del+0x47/0x1ec
  [<ffffffff815482f7>] ? mutex_unlock+0x16/0x18
  [<ffffffff8137e069>] device_unregister+0x48/0x54
  [<ffffffff8128eb82>] bsg_unregister_queue+0x5f/0x86
  [<ffffffff813aac83>] __scsi_remove_device+0x3a/0xc3
  [<ffffffff813aad32>] scsi_remove_device+0x26/0x33
  [<ffffffff813aaea2>] scsi_remove_target+0x134/0x19b
  [<ffffffffa0078725>] sas_rphy_remove+0x2c/0x72 [scsi_transport_sas]
  [<ffffffffa007877e>] sas_rphy_delete+0x13/0x1f [scsi_transport_sas]
  [<ffffffffa008817c>] sas_destruct_devices+0x58/0x79 [libsas]
  [<ffffffff8107cca1>] process_one_work+0x19b/0x2d1
  [<ffffffff8107d38e>] worker_thread+0x1dd/0x2bb
  [<ffffffff8107d1b1>] ? cancel_delayed_work+0x72/0x72
  [<ffffffff8108165a>] kthread+0xa5/0xad
  [<ffffffff81080000>] ? task_work_add+0xd/0x53
  [<ffffffff810815b5>] ? __kthread_parkme+0x61/0x61
  [<ffffffff8154a492>] ret_from_fork+0x42/0x70
  [<ffffffff810815b5>] ? __kthread_parkme+0x61/0x61

It looks like we don't wait for sas destruct work properly
on tear down path, at least sas_deform_port() calls
sas_unregister_domain_devices() to schedule destruct work
to a workqueue and then calls sas_port_delete() to remove
the related sysfs files concurrently.

Dan tried to fix this with a different way:

 https://patchwork.kernel.org/patch/6450921/

but that patch is never applied. I take a better approach
as suggested by Johannes, that is waiting for pending destruct
work to remove child sysfs files and then removing the parent
sysfs files.

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Praveen Murali <pmurali@logicube.com>
Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 drivers/scsi/libsas/sas_discover.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Comments

Johannes Thumshirn Nov. 28, 2017, 8:20 a.m. UTC | #1
On Mon, Nov 27, 2017 at 04:24:45PM -0800, Cong Wang wrote:
> We saw dozens of the following kernel waring:
> 
>  WARNING: CPU: 0 PID: 705 at fs/sysfs/group.c:224 sysfs_remove_group+0x54/0x88()
>  sysfs group ffffffff81ab7670 not found for kobject '6:0:3:0'
>  Modules linked in: cpufreq_ondemand x86_pkg_temp_thermal coretemp kvm_intel kvm microcode raid0 iTCO_wdt iTCO_vendor_support sb_edac edac_core lpc_ich mfd_core ioatdma i2c_i801 shpchp wmi hed acpi_cpufreq lp parport tcp_diag inet_diag ipmi_si ipmi_devintf ipmi_msghandler sch_fq_codel igb ptp pps_core i2c_algo_bit i2c_core crc32c_intel isci libsas scsi_transport_sas dca ipv6
>  CPU: 0 PID: 705 Comm: kworker/u240:0 Not tainted 4.1.35.el7.x86_64 #1

This should by now be fixed with commit fbce4d97fd43 ("scsi: fixup kernel
warning during rmmod()" which went into v4.14-rc6.
John Garry Nov. 28, 2017, 11:18 a.m. UTC | #2
On 28/11/2017 08:20, Johannes Thumshirn wrote:
> On Mon, Nov 27, 2017 at 04:24:45PM -0800, Cong Wang wrote:
>> We saw dozens of the following kernel waring:
>>
>>  WARNING: CPU: 0 PID: 705 at fs/sysfs/group.c:224 sysfs_remove_group+0x54/0x88()
>>  sysfs group ffffffff81ab7670 not found for kobject '6:0:3:0'
>>  Modules linked in: cpufreq_ondemand x86_pkg_temp_thermal coretemp kvm_intel kvm microcode raid0 iTCO_wdt iTCO_vendor_support sb_edac edac_core lpc_ich mfd_core ioatdma i2c_i801 shpchp wmi hed acpi_cpufreq lp parport tcp_diag inet_diag ipmi_si ipmi_devintf ipmi_msghandler sch_fq_codel igb ptp pps_core i2c_algo_bit i2c_core crc32c_intel isci libsas scsi_transport_sas dca ipv6
>>  CPU: 0 PID: 705 Comm: kworker/u240:0 Not tainted 4.1.35.el7.x86_64 #1
>
> This should by now be fixed with commit fbce4d97fd43 ("scsi: fixup kernel
> warning during rmmod()" which went into v4.14-rc6.
>

Is that the same issue? I think Cong Wang is just trying to deal with 
the longstanding libsas hotplug WARN.

We at Huawei are still working to fix it. Our patchset is under internal 
test at the moment.

As for this patch:
 >  drivers/scsi/libsas/sas_discover.c | 7 ++++++-
 >  1 file changed, 6 insertions(+), 1 deletion(-)
 >
 > diff --git a/drivers/scsi/libsas/sas_discover.c 
b/drivers/scsi/libsas/sas_discover.c
 > index 60de66252fa2..27c11fc7aa2b 100644
 > --- a/drivers/scsi/libsas/sas_discover.c
 > +++ b/drivers/scsi/libsas/sas_discover.c
 > @@ -388,6 +388,11 @@ void sas_unregister_dev(struct asd_sas_port 
*port, struct domain_device *dev)
 >  	}
 >  }
 >
 > +static void sas_flush_work(struct asd_sas_port *port)
 > +{
 > +	scsi_flush_work(port->ha->core.shost);
 > +}
 > +
 >  void sas_unregister_domain_devices(struct asd_sas_port *port, int gone)
 >  {
 >  	struct domain_device *dev, *n;
 > @@ -401,8 +406,8 @@ void sas_unregister_domain_devices(struct 
asd_sas_port *port, int gone)
 >  	list_for_each_entry_safe(dev, n, &port->disco_list, disco_list_node)
 >  		sas_unregister_dev(port, dev);
 >
 > +	sas_flush_work(port);

How can this work as sas_unregister_domain_devices() may be called from 
the same workqueue which you're trying to flush?

 >  	port->port->rphy = NULL;
 > -
 >  }
 >
 >  void sas_device_set_phy(struct domain_device *dev, struct sas_port 
*port)
 >

Thanks,
John
Cong Wang Nov. 28, 2017, 5 p.m. UTC | #3
On Tue, Nov 28, 2017 at 12:20 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote:
> On Mon, Nov 27, 2017 at 04:24:45PM -0800, Cong Wang wrote:
>> We saw dozens of the following kernel waring:
>>
>>  WARNING: CPU: 0 PID: 705 at fs/sysfs/group.c:224 sysfs_remove_group+0x54/0x88()
>>  sysfs group ffffffff81ab7670 not found for kobject '6:0:3:0'
>>  Modules linked in: cpufreq_ondemand x86_pkg_temp_thermal coretemp kvm_intel kvm microcode raid0 iTCO_wdt iTCO_vendor_support sb_edac edac_core lpc_ich mfd_core ioatdma i2c_i801 shpchp wmi hed acpi_cpufreq lp parport tcp_diag inet_diag ipmi_si ipmi_devintf ipmi_msghandler sch_fq_codel igb ptp pps_core i2c_algo_bit i2c_core crc32c_intel isci libsas scsi_transport_sas dca ipv6
>>  CPU: 0 PID: 705 Comm: kworker/u240:0 Not tainted 4.1.35.el7.x86_64 #1
>
> This should by now be fixed with commit fbce4d97fd43 ("scsi: fixup kernel
> warning during rmmod()" which went into v4.14-rc6.

I don't see the full backtrace in commit fbce4d97fd43, but it is probably
not rmmod path in our case.
Cong Wang Nov. 28, 2017, 5:04 p.m. UTC | #4
On Tue, Nov 28, 2017 at 3:18 AM, John Garry <john.garry@huawei.com> wrote:
> On 28/11/2017 08:20, Johannes Thumshirn wrote:
>>
>> On Mon, Nov 27, 2017 at 04:24:45PM -0800, Cong Wang wrote:
>>>
>>> We saw dozens of the following kernel waring:
>>>
>>>  WARNING: CPU: 0 PID: 705 at fs/sysfs/group.c:224
>>> sysfs_remove_group+0x54/0x88()
>>>  sysfs group ffffffff81ab7670 not found for kobject '6:0:3:0'
>>>  Modules linked in: cpufreq_ondemand x86_pkg_temp_thermal coretemp
>>> kvm_intel kvm microcode raid0 iTCO_wdt iTCO_vendor_support sb_edac edac_core
>>> lpc_ich mfd_core ioatdma i2c_i801 shpchp wmi hed acpi_cpufreq lp parport
>>> tcp_diag inet_diag ipmi_si ipmi_devintf ipmi_msghandler sch_fq_codel igb ptp
>>> pps_core i2c_algo_bit i2c_core crc32c_intel isci libsas scsi_transport_sas
>>> dca ipv6
>>>  CPU: 0 PID: 705 Comm: kworker/u240:0 Not tainted 4.1.35.el7.x86_64 #1
>>
>>
>> This should by now be fixed with commit fbce4d97fd43 ("scsi: fixup kernel
>> warning during rmmod()" which went into v4.14-rc6.
>>
>
> Is that the same issue? I think Cong Wang is just trying to deal with the
> longstanding libsas hotplug WARN.

Right, we saw it on both 4.1 and 3.14, clearly an old bug.


>
> We at Huawei are still working to fix it. Our patchset is under internal
> test at the moment.
>
> As for this patch:
>>  drivers/scsi/libsas/sas_discover.c | 7 ++++++-
>>  1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/scsi/libsas/sas_discover.c
>> b/drivers/scsi/libsas/sas_discover.c
>> index 60de66252fa2..27c11fc7aa2b 100644
>> --- a/drivers/scsi/libsas/sas_discover.c
>> +++ b/drivers/scsi/libsas/sas_discover.c
>> @@ -388,6 +388,11 @@ void sas_unregister_dev(struct asd_sas_port *port,
>> struct domain_device *dev)
>>       }
>>  }
>>
>> +static void sas_flush_work(struct asd_sas_port *port)
>> +{
>> +     scsi_flush_work(port->ha->core.shost);
>> +}
>> +
>>  void sas_unregister_domain_devices(struct asd_sas_port *port, int gone)
>>  {
>>       struct domain_device *dev, *n;
>> @@ -401,8 +406,8 @@ void sas_unregister_domain_devices(struct asd_sas_port
>> *port, int gone)
>>       list_for_each_entry_safe(dev, n, &port->disco_list, disco_list_node)
>>               sas_unregister_dev(port, dev);
>>
>> +     sas_flush_work(port);
>
> How can this work as sas_unregister_domain_devices() may be called from the
> same workqueue which you're trying to flush?


I don't understand, the only caller of sas_unregister_domain_devices()
is sas_deform_port().
John Garry Dec. 7, 2017, 1:37 p.m. UTC | #5
On 28/11/2017 17:04, Cong Wang wrote:
> On Tue, Nov 28, 2017 at 3:18 AM, John Garry <john.garry@huawei.com> wrote:
>> On 28/11/2017 08:20, Johannes Thumshirn wrote:
>>>
>>> On Mon, Nov 27, 2017 at 04:24:45PM -0800, Cong Wang wrote:
>>>>
>>>> We saw dozens of the following kernel waring:
>>>>
>>>>  WARNING: CPU: 0 PID: 705 at fs/sysfs/group.c:224
>>>> sysfs_remove_group+0x54/0x88()
>>>>  sysfs group ffffffff81ab7670 not found for kobject '6:0:3:0'
>>>>  Modules linked in: cpufreq_ondemand x86_pkg_temp_thermal coretemp
>>>> kvm_intel kvm microcode raid0 iTCO_wdt iTCO_vendor_support sb_edac edac_core
>>>> lpc_ich mfd_core ioatdma i2c_i801 shpchp wmi hed acpi_cpufreq lp parport
>>>> tcp_diag inet_diag ipmi_si ipmi_devintf ipmi_msghandler sch_fq_codel igb ptp
>>>> pps_core i2c_algo_bit i2c_core crc32c_intel isci libsas scsi_transport_sas
>>>> dca ipv6
>>>>  CPU: 0 PID: 705 Comm: kworker/u240:0 Not tainted 4.1.35.el7.x86_64 #1
>>>
>>>
>>> This should by now be fixed with commit fbce4d97fd43 ("scsi: fixup kernel
>>> warning during rmmod()" which went into v4.14-rc6.
>>>
>>
>> Is that the same issue? I think Cong Wang is just trying to deal with the
>> longstanding libsas hotplug WARN.
>
> Right, we saw it on both 4.1 and 3.14, clearly an old bug.
>
>
>>
>> We at Huawei are still working to fix it. Our patchset is under internal
>> test at the moment.
>>
>> As for this patch:
>>>  drivers/scsi/libsas/sas_discover.c | 7 ++++++-
>>>  1 file changed, 6 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/scsi/libsas/sas_discover.c
>>> b/drivers/scsi/libsas/sas_discover.c
>>> index 60de66252fa2..27c11fc7aa2b 100644
>>> --- a/drivers/scsi/libsas/sas_discover.c
>>> +++ b/drivers/scsi/libsas/sas_discover.c
>>> @@ -388,6 +388,11 @@ void sas_unregister_dev(struct asd_sas_port *port,
>>> struct domain_device *dev)
>>>       }
>>>  }
>>>
>>> +static void sas_flush_work(struct asd_sas_port *port)
>>> +{
>>> +     scsi_flush_work(port->ha->core.shost);
>>> +}
>>> +
>>>  void sas_unregister_domain_devices(struct asd_sas_port *port, int gone)
>>>  {
>>>       struct domain_device *dev, *n;
>>> @@ -401,8 +406,8 @@ void sas_unregister_domain_devices(struct asd_sas_port
>>> *port, int gone)
>>>       list_for_each_entry_safe(dev, n, &port->disco_list, disco_list_node)
>>>               sas_unregister_dev(port, dev);
>>>
>>> +     sas_flush_work(port);
>>
>> How can this work as sas_unregister_domain_devices() may be called from the
>> same workqueue which you're trying to flush?
>

Sorry for slow reply, just remembered this now.

>
> I don't understand, the only caller of sas_unregister_domain_devices()
> is sas_deform_port().
>

And sas_deform_port() may be called from another worker on the same 
queue, right? As in sas_phye_loss_of_signal()->sas_deform_port()

As I see today, this is the problem callchain:
sas_deform_port()
sas_unregister_domain_devices()
sas_unregister_dev()
sas_discover_event(DISCE_DESTRUCT)

The device destruct takes place in a separate worker from which 
sas_deform_port() is called, but the same queue. So we have this queued 
destruct happen after the port is fully deformed -> hence the WARN.

I guess you only tested your patch on disks attached through an expander.

Thanks,
John








> .
>
Cong Wang Dec. 7, 2017, 10:57 p.m. UTC | #6
On Thu, Dec 7, 2017 at 5:37 AM, John Garry <john.garry@huawei.com> wrote:
> On 28/11/2017 17:04, Cong Wang wrote:
>>
>> I don't understand, the only caller of sas_unregister_domain_devices()
>> is sas_deform_port().
>>
>
> And sas_deform_port() may be called from another worker on the same queue,
> right? As in sas_phye_loss_of_signal()->sas_deform_port()

Oh, good catch! I didn't notice this subtle call path.

Do you have any better idea to fix this? We saw this on 4.9 too.

>
> The device destruct takes place in a separate worker from which
> sas_deform_port() is called, but the same queue. So we have this queued
> destruct happen after the port is fully deformed -> hence the WARN.
>
> I guess you only tested your patch on disks attached through an expander.

I have very limited scsi hardware, so my testing is limited too.
Jason Yan Dec. 8, 2017, 7:54 a.m. UTC | #7
On 2017/12/8 6:57, Cong Wang wrote:
> On Thu, Dec 7, 2017 at 5:37 AM, John Garry <john.garry@huawei.com> wrote:
>> On 28/11/2017 17:04, Cong Wang wrote:
>>>
>>> I don't understand, the only caller of sas_unregister_domain_devices()
>>> is sas_deform_port().
>>>
>>
>> And sas_deform_port() may be called from another worker on the same queue,
>> right? As in sas_phye_loss_of_signal()->sas_deform_port()
>
> Oh, good catch! I didn't notice this subtle call path.
>
> Do you have any better idea to fix this? We saw this on 4.9 too.
>

We have sent a patchset to fix this and to enhance libsas hotplug.
Please refer to https://lkml.org/lkml/2017/9/6/142

And I'm going to send a new version soon.

Jason

>>
>> The device destruct takes place in a separate worker from which
>> sas_deform_port() is called, but the same queue. So we have this queued
>> destruct happen after the port is fully deformed -> hence the WARN.
>>
>> I guess you only tested your patch on disks attached through an expander.
>
> I have very limited scsi hardware, so my testing is limited too.
>
> .
>
Cong Wang Dec. 9, 2017, 7:51 p.m. UTC | #8
On Thu, Dec 7, 2017 at 11:54 PM, Jason Yan <yanaijie@huawei.com> wrote:
>
> We have sent a patchset to fix this and to enhance libsas hotplug.
> Please refer to https://lkml.org/lkml/2017/9/6/142
>
> And I'm going to send a new version soon.

Thanks for working on it! Please make sure they will be queued
for -stable too, since 3.14, 4.1 and 4.9 are all affected.
diff mbox

Patch

diff --git a/drivers/scsi/libsas/sas_discover.c b/drivers/scsi/libsas/sas_discover.c
index 60de66252fa2..27c11fc7aa2b 100644
--- a/drivers/scsi/libsas/sas_discover.c
+++ b/drivers/scsi/libsas/sas_discover.c
@@ -388,6 +388,11 @@  void sas_unregister_dev(struct asd_sas_port *port, struct domain_device *dev)
 	}
 }
 
+static void sas_flush_work(struct asd_sas_port *port)
+{
+	scsi_flush_work(port->ha->core.shost);
+}
+
 void sas_unregister_domain_devices(struct asd_sas_port *port, int gone)
 {
 	struct domain_device *dev, *n;
@@ -401,8 +406,8 @@  void sas_unregister_domain_devices(struct asd_sas_port *port, int gone)
 	list_for_each_entry_safe(dev, n, &port->disco_list, disco_list_node)
 		sas_unregister_dev(port, dev);
 
+	sas_flush_work(port);
 	port->port->rphy = NULL;
-
 }
 
 void sas_device_set_phy(struct domain_device *dev, struct sas_port *port)