diff mbox

[v2] libsas: fix "sysfs group not found" warnings at port teardown time

Message ID 20150520230012.15322.1013.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Dan Williams May 20, 2015, 11 p.m. UTC
Praveen reports:

    After some debugging this is what I have found

    sas_phye_loss_of_signal gets triggered on phy_event from mvsas
        sas_phye_loss_of_signal calls sas_deform_port
             sas_deform_port posts a DISCE_DESTRUCT event (sas_unregister_domain_devices-> sas_unregister_dev)
             sas_deform_port calls sas_port_delete
                     sas_port_delete calls sas_port_delete_link
                             sysfs_remove_group: kobject 'port-X:Y'
                     sas_port_delete calls device_del
                             sysfs_remove_group: kobject 'port-X:Y'

    sas_destruct_devices gets triggered for the destruct event (DISCE_DESTRUCT)
        sas_destruct_devices calls sas_rphy_delete
            sas_rphy_delete calls scsi_remove_device
                 scsi_remove_device calls __scsi_remove_device
                         __scsi_remove_device calls bsg_unregister_queue
                                 bsg_unregister_queue -> device_unregister -> device_del -> sysfs_remove_group: kobject 'X:0:0:0'

    Since X:0:0:0 falls under port-X:Y (which got deleted during
    sas_port_delete), this call results in the warning. All the later
    warnings in the dmesg output I sent earlier are trying to delete objects
    under port-X:Y. Since port-X:Y got recursively deleted, all these calls
    result in warnings. Since, the PHY and DISC events are processed in two
    different work queues (and one triggers the other), is there any way
    other than checking if the object exists in sysfs (in device_del) before
    deleting?

    ------------[ cut here ]------------
    WARNING: CPU: 2 PID: 6 at fs/sysfs/group.c:219 device_del+0x40/0x1c0()
    sysfs group ffffffff818b97e0 not found for kobject '2:0:4:0'
    [..]
    CPU: 2 PID: 6 Comm: kworker/u8:0 Tainted: P        W  O  3.16.7-ckt9-logicube-ng.3 #1
    Hardware name: To be filled by O.E.M. To be filled by O.E.M./VT6085, BIOS 4.6.5 01/23/2015
    Workqueue: scsi_wq_2 sas_destruct_devices [libsas]
     0000000000000009 ffffffff8151cd18 ffff88011b35bcd8 ffffffff810687b7
     ffff88011a661400 ffff88011b35bd28 ffff8800c6e5e968 ffff880000028810
     ffff8800c89f2c00 ffffffff8106881c ffffffff81733b68 0000000000000028
    Call Trace:
     [<ffffffff8151cd18>] ? dump_stack+0x41/0x51
     [<ffffffff810687b7>] ? warn_slowpath_common+0x77/0x90
     [<ffffffff8106881c>] ? warn_slowpath_fmt+0x4c/0x50
     [<ffffffff813ad2d0>] ? device_del+0x40/0x1c0
     [<ffffffff813ad46a>] ? device_unregister+0x1a/0x70
     [<ffffffff812a535e>] ? bsg_unregister_queue+0x5e/0xb0
     [<ffffffffa00781a9>] ? __scsi_remove_device+0xa9/0xd0 [scsi_mod]

It appears we've always been double deleting the devices below sas_port,
but recent sysfs changes now exposes this problem.  Libsas should delete
all the devices from rphy down before deleting the parent port.

Cc: <stable@vger.kernel.org>
Reported-by: Praveen Murali <pmurali@logicube.com>
Tested-by: Praveen Murali <pmurali@logicube.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

v2: drop the "---" separators that will confuse git-am.  Thanks Luis!

 drivers/scsi/libsas/sas_discover.c |    6 +++---
 drivers/scsi/libsas/sas_port.c     |    1 -
 2 files changed, 3 insertions(+), 4 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Luis Henriques May 21, 2015, 7:54 a.m. UTC | #1
On Wed, May 20, 2015 at 07:00:53PM -0400, Dan Williams wrote:
> Praveen reports:
> 
>     After some debugging this is what I have found
> 
>     sas_phye_loss_of_signal gets triggered on phy_event from mvsas
>         sas_phye_loss_of_signal calls sas_deform_port
>              sas_deform_port posts a DISCE_DESTRUCT event (sas_unregister_domain_devices-> sas_unregister_dev)
>              sas_deform_port calls sas_port_delete
>                      sas_port_delete calls sas_port_delete_link
>                              sysfs_remove_group: kobject 'port-X:Y'
>                      sas_port_delete calls device_del
>                              sysfs_remove_group: kobject 'port-X:Y'
> 
>     sas_destruct_devices gets triggered for the destruct event (DISCE_DESTRUCT)
>         sas_destruct_devices calls sas_rphy_delete
>             sas_rphy_delete calls scsi_remove_device
>                  scsi_remove_device calls __scsi_remove_device
>                          __scsi_remove_device calls bsg_unregister_queue
>                                  bsg_unregister_queue -> device_unregister -> device_del -> sysfs_remove_group: kobject 'X:0:0:0'
> 
>     Since X:0:0:0 falls under port-X:Y (which got deleted during
>     sas_port_delete), this call results in the warning. All the later
>     warnings in the dmesg output I sent earlier are trying to delete objects
>     under port-X:Y. Since port-X:Y got recursively deleted, all these calls
>     result in warnings. Since, the PHY and DISC events are processed in two
>     different work queues (and one triggers the other), is there any way
>     other than checking if the object exists in sysfs (in device_del) before
>     deleting?
> 
>     ------------[ cut here ]------------
>     WARNING: CPU: 2 PID: 6 at fs/sysfs/group.c:219 device_del+0x40/0x1c0()
>     sysfs group ffffffff818b97e0 not found for kobject '2:0:4:0'
>     [..]
>     CPU: 2 PID: 6 Comm: kworker/u8:0 Tainted: P        W  O  3.16.7-ckt9-logicube-ng.3 #1
>     Hardware name: To be filled by O.E.M. To be filled by O.E.M./VT6085, BIOS 4.6.5 01/23/2015
>     Workqueue: scsi_wq_2 sas_destruct_devices [libsas]
>      0000000000000009 ffffffff8151cd18 ffff88011b35bcd8 ffffffff810687b7
>      ffff88011a661400 ffff88011b35bd28 ffff8800c6e5e968 ffff880000028810
>      ffff8800c89f2c00 ffffffff8106881c ffffffff81733b68 0000000000000028
>     Call Trace:
>      [<ffffffff8151cd18>] ? dump_stack+0x41/0x51
>      [<ffffffff810687b7>] ? warn_slowpath_common+0x77/0x90
>      [<ffffffff8106881c>] ? warn_slowpath_fmt+0x4c/0x50
>      [<ffffffff813ad2d0>] ? device_del+0x40/0x1c0
>      [<ffffffff813ad46a>] ? device_unregister+0x1a/0x70
>      [<ffffffff812a535e>] ? bsg_unregister_queue+0x5e/0xb0
>      [<ffffffffa00781a9>] ? __scsi_remove_device+0xa9/0xd0 [scsi_mod]
> 
> It appears we've always been double deleting the devices below sas_port,
> but recent sysfs changes now exposes this problem.  Libsas should delete
> all the devices from rphy down before deleting the parent port.
> 
> Cc: <stable@vger.kernel.org>
> Reported-by: Praveen Murali <pmurali@logicube.com>
> Tested-by: Praveen Murali <pmurali@logicube.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> 
> v2: drop the "---" separators that will confuse git-am.  Thanks Luis!
> 

Awesome, thanks a lot!

Cheers,
--
Luís

>  drivers/scsi/libsas/sas_discover.c |    6 +++---
>  drivers/scsi/libsas/sas_port.c     |    1 -
>  2 files changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/scsi/libsas/sas_discover.c b/drivers/scsi/libsas/sas_discover.c
> index 60de66252fa2..a4db770fe8b0 100644
> --- a/drivers/scsi/libsas/sas_discover.c
> +++ b/drivers/scsi/libsas/sas_discover.c
> @@ -362,11 +362,14 @@ static void sas_destruct_devices(struct work_struct *work)
>  	clear_bit(DISCE_DESTRUCT, &port->disc.pending);
>  
>  	list_for_each_entry_safe(dev, n, &port->destroy_list, disco_list_node) {
> +		struct sas_port *sas_port = dev_to_sas_port(dev->rphy->dev.parent);
> +
>  		list_del_init(&dev->disco_list_node);
>  
>  		sas_remove_children(&dev->rphy->dev);
>  		sas_rphy_delete(dev->rphy);
>  		sas_unregister_common_dev(port, dev);
> +		sas_port_delete(sas_port);
>  	}
>  }
>  
> @@ -400,9 +403,6 @@ void sas_unregister_domain_devices(struct asd_sas_port *port, int gone)
>  
>  	list_for_each_entry_safe(dev, n, &port->disco_list, disco_list_node)
>  		sas_unregister_dev(port, dev);
> -
> -	port->port->rphy = NULL;
> -
>  }
>  
>  void sas_device_set_phy(struct domain_device *dev, struct sas_port *port)
> diff --git a/drivers/scsi/libsas/sas_port.c b/drivers/scsi/libsas/sas_port.c
> index d3c5297c6c89..9a25ae3a52a4 100644
> --- a/drivers/scsi/libsas/sas_port.c
> +++ b/drivers/scsi/libsas/sas_port.c
> @@ -219,7 +219,6 @@ void sas_deform_port(struct asd_sas_phy *phy, int gone)
>  
>  	if (port->num_phys == 1) {
>  		sas_unregister_domain_devices(port, gone);
> -		sas_port_delete(port->port);
>  		port->port = NULL;
>  	} else {
>  		sas_port_delete_phy(port->port, phy->phy);
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Thumshirn March 19, 2017, 12:44 p.m. UTC | #2
On Wed, May 20, 2015 at 07:00:53PM -0400, Dan Williams wrote:
> Praveen reports:
> 
>     After some debugging this is what I have found
> 
>     sas_phye_loss_of_signal gets triggered on phy_event from mvsas
>         sas_phye_loss_of_signal calls sas_deform_port
>              sas_deform_port posts a DISCE_DESTRUCT event (sas_unregister_domain_devices-> sas_unregister_dev)
>              sas_deform_port calls sas_port_delete
>                      sas_port_delete calls sas_port_delete_link
>                              sysfs_remove_group: kobject 'port-X:Y'
>                      sas_port_delete calls device_del
>                              sysfs_remove_group: kobject 'port-X:Y'
> 
>     sas_destruct_devices gets triggered for the destruct event (DISCE_DESTRUCT)
>         sas_destruct_devices calls sas_rphy_delete
>             sas_rphy_delete calls scsi_remove_device
>                  scsi_remove_device calls __scsi_remove_device
>                          __scsi_remove_device calls bsg_unregister_queue
>                                  bsg_unregister_queue -> device_unregister -> device_del -> sysfs_remove_group: kobject 'X:0:0:0'
> 
>     Since X:0:0:0 falls under port-X:Y (which got deleted during
>     sas_port_delete), this call results in the warning. All the later
>     warnings in the dmesg output I sent earlier are trying to delete objects
>     under port-X:Y. Since port-X:Y got recursively deleted, all these calls
>     result in warnings. Since, the PHY and DISC events are processed in two
>     different work queues (and one triggers the other), is there any way
>     other than checking if the object exists in sysfs (in device_del) before
>     deleting?
> 
>     ------------[ cut here ]------------
>     WARNING: CPU: 2 PID: 6 at fs/sysfs/group.c:219 device_del+0x40/0x1c0()
>     sysfs group ffffffff818b97e0 not found for kobject '2:0:4:0'
>     [..]
>     CPU: 2 PID: 6 Comm: kworker/u8:0 Tainted: P        W  O  3.16.7-ckt9-logicube-ng.3 #1
>     Hardware name: To be filled by O.E.M. To be filled by O.E.M./VT6085, BIOS 4.6.5 01/23/2015
>     Workqueue: scsi_wq_2 sas_destruct_devices [libsas]
>      0000000000000009 ffffffff8151cd18 ffff88011b35bcd8 ffffffff810687b7
>      ffff88011a661400 ffff88011b35bd28 ffff8800c6e5e968 ffff880000028810
>      ffff8800c89f2c00 ffffffff8106881c ffffffff81733b68 0000000000000028
>     Call Trace:
>      [<ffffffff8151cd18>] ? dump_stack+0x41/0x51
>      [<ffffffff810687b7>] ? warn_slowpath_common+0x77/0x90
>      [<ffffffff8106881c>] ? warn_slowpath_fmt+0x4c/0x50
>      [<ffffffff813ad2d0>] ? device_del+0x40/0x1c0
>      [<ffffffff813ad46a>] ? device_unregister+0x1a/0x70
>      [<ffffffff812a535e>] ? bsg_unregister_queue+0x5e/0xb0
>      [<ffffffffa00781a9>] ? __scsi_remove_device+0xa9/0xd0 [scsi_mod]
> 
> It appears we've always been double deleting the devices below sas_port,
> but recent sysfs changes now exposes this problem.  Libsas should delete
> all the devices from rphy down before deleting the parent port.
> 
> Cc: <stable@vger.kernel.org>
> Reported-by: Praveen Murali <pmurali@logicube.com>
> Tested-by: Praveen Murali <pmurali@logicube.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---

Hi Dan,

JFYI it looks like your patch doesn't completely solve the problem. I can
still trigger this with the following one-liner:

echo 1 > /sys/module/isci/drivers/pci\:isci/0000\:03\:00.0/remove

It's the 'power' sysfs attribute and it happens for SAS port, end_device and
phy.

And as I've seen a report for lpfc as well, I _think_ it is a generic problem
with the SCSI transport classes, not just libsas. Or it could be
scsi_transport_sas and scsi_transport_fc doing nasty things and abuse the
transport class interface.

Byte,
	Johannes
Johannes Thumshirn March 24, 2017, 4:53 p.m. UTC | #3
[ +Cc Tejun ]

On Fri, Mar 24, 2017 at 11:44:55AM +0000, John Garry wrote:
> To be clear, was this the same test with isci which you initially reported?

Yes, just echo into the PCI device's sysfs remove file and it'll trigger the
problem.

I did some archeology and it seems as if commit bcdde7e ("sysfs: make
__sysfs_remove_dir() recursive") introduced/uncovered this behavior.

For reference, here's one of my calltraces (the first of 40!):
------------[ cut here ]------------
WARNING: CPU: 2 PID: 5 at fs/sysfs/group.c:241 sysfs_remove_group+0xc3/0xd0
sysfs group 'power' not found for kobject 'end_device-6:0'
CPU: 16 PID: 5884 Comm: repro.sh Not tainted 4.11.0-rc3-libsas+ #504
Call Trace:
 dump_stack+0x85/0xc2
 __warn+0xc6/0xe0
 warn_slowpath_fmt+0x4a/0x50
 sysfs_remove_group+0xc3/0xd0
 dpm_sysfs_remove+0x52/0x60
 device_del+0x13c/0x360
 ? device_remove_file+0x14/0x20
 attribute_container_class_device_del+0x15/0x20
 transport_remove_classdev+0x4c/0x60
 ? transport_add_class_device+0x40/0x40
 attribute_container_device_trigger+0xb3/0xc0
 transport_remove_device+0x10/0x20
 sas_port_delete+0x12d/0x160 [scsi_transport_sas]
 sas_deform_port+0x1bf/0x1d0 [libsas]
 sas_unregister_ports+0x36/0x50 [libsas]
 sas_unregister_ha+0x1b/0x40 [libsas]
 isci_unregister+0x2a/0x40 [isci]
 isci_pci_remove+0x52/0xb0 [isci]
 ? __pm_runtime_resume+0x56/0x80
 pci_device_remove+0x34/0xb0
 device_release_driver_internal+0x158/0x210
 device_release_driver+0xd/0x10
 pci_stop_bus_device+0x85/0x90
 pci_stop_and_remove_bus_device_locked+0x15/0x30
 remove_store+0x59/0x70
 dev_attr_store+0x13/0x20
 sysfs_kf_write+0x40/0x50
 kernfs_fop_write+0x130/0x1b0
 __vfs_write+0x23/0x130
 ? rcu_read_lock_sched_held+0x6d/0x80
 ? rcu_sync_lockdep_assert+0x2a/0x50
 ? __sb_start_write+0xd7/0x1e0
 ? vfs_write+0x1a4/0x1f0
 vfs_write+0xc6/0x1f0
 SyS_write+0x44/0xa0
 entry_SYSCALL_64_fastpath+0x23/0xc6

But as I said, I don't belive this is a problem in the SAS transport or the
SAS drivers, but a device core or transport class.

Byte,
	Johannes
Tejun Heo March 28, 2017, 9:41 p.m. UTC | #4
Hello,

On Fri, Mar 24, 2017 at 05:53:54PM +0100, Johannes Thumshirn wrote:
> [ +Cc Tejun ]
> 
> On Fri, Mar 24, 2017 at 11:44:55AM +0000, John Garry wrote:
> > To be clear, was this the same test with isci which you initially reported?
> 
> Yes, just echo into the PCI device's sysfs remove file and it'll trigger the
> problem.
> 
> I did some archeology and it seems as if commit bcdde7e ("sysfs: make
> __sysfs_remove_dir() recursive") introduced/uncovered this behavior.

I couldn't reproduce it with other devices (don't have a sas controller).

> For reference, here's one of my calltraces (the first of 40!):
> ------------[ cut here ]------------
> WARNING: CPU: 2 PID: 5 at fs/sysfs/group.c:241 sysfs_remove_group+0xc3/0xd0
> sysfs group 'power' not found for kobject 'end_device-6:0'
> CPU: 16 PID: 5884 Comm: repro.sh Not tainted 4.11.0-rc3-libsas+ #504
> Call Trace:
>  dump_stack+0x85/0xc2
>  __warn+0xc6/0xe0
>  warn_slowpath_fmt+0x4a/0x50
>  sysfs_remove_group+0xc3/0xd0
>  dpm_sysfs_remove+0x52/0x60
>  device_del+0x13c/0x360
>  ? device_remove_file+0x14/0x20
>  attribute_container_class_device_del+0x15/0x20
>  transport_remove_classdev+0x4c/0x60
>  ? transport_add_class_device+0x40/0x40
>  attribute_container_device_trigger+0xb3/0xc0
>  transport_remove_device+0x10/0x20
>  sas_port_delete+0x12d/0x160 [scsi_transport_sas]
>  sas_deform_port+0x1bf/0x1d0 [libsas]
>  sas_unregister_ports+0x36/0x50 [libsas]
>  sas_unregister_ha+0x1b/0x40 [libsas]
>  isci_unregister+0x2a/0x40 [isci]
>  isci_pci_remove+0x52/0xb0 [isci]
>  ? __pm_runtime_resume+0x56/0x80
>  pci_device_remove+0x34/0xb0
>  device_release_driver_internal+0x158/0x210
>  device_release_driver+0xd/0x10
>  pci_stop_bus_device+0x85/0x90
>  pci_stop_and_remove_bus_device_locked+0x15/0x30
>  remove_store+0x59/0x70
>  dev_attr_store+0x13/0x20
>  sysfs_kf_write+0x40/0x50
>  kernfs_fop_write+0x130/0x1b0
>  __vfs_write+0x23/0x130
>  ? rcu_read_lock_sched_held+0x6d/0x80
>  ? rcu_sync_lockdep_assert+0x2a/0x50
>  ? __sb_start_write+0xd7/0x1e0
>  ? vfs_write+0x1a4/0x1f0
>  vfs_write+0xc6/0x1f0
>  SyS_write+0x44/0xa0
>  entry_SYSCALL_64_fastpath+0x23/0xc6
> 
> But as I said, I don't belive this is a problem in the SAS transport or the
> SAS drivers, but a device core or transport class.

So, what's most likely happening is that the parent device or kobject
which contains the attribute group has already been removed earlier
and because the removal is recursive, the later explicit removal is
trying to remove already removed files.  It can be fixed either by
reordering so that the parent node is removed after the children or
simply dropping the explicit removal of children.

Thanks.
Johannes Thumshirn March 29, 2017, 8:11 a.m. UTC | #5
On Tue, Mar 28, 2017 at 05:41:49PM -0400, Tejun Heo wrote:
> > But as I said, I don't belive this is a problem in the SAS transport or the
> > SAS drivers, but a device core or transport class.
> 
> So, what's most likely happening is that the parent device or kobject
> which contains the attribute group has already been removed earlier
> and because the removal is recursive, the later explicit removal is
> trying to remove already removed files.  It can be fixed either by
> reordering so that the parent node is removed after the children or
> simply dropping the explicit removal of children.

Hi Tejun,

Thanks for haivng a look at this. I already did this earlier and it didn't have
an effect at all. But I forgot to flush the SAS desctruct workqueue so it didn't
have an effect.

Thanks for re-pointing me to it. Patches fixing the issue are coming finally :-).

Byte,
	Johannes
diff mbox

Patch

diff --git a/drivers/scsi/libsas/sas_discover.c b/drivers/scsi/libsas/sas_discover.c
index 60de66252fa2..a4db770fe8b0 100644
--- a/drivers/scsi/libsas/sas_discover.c
+++ b/drivers/scsi/libsas/sas_discover.c
@@ -362,11 +362,14 @@  static void sas_destruct_devices(struct work_struct *work)
 	clear_bit(DISCE_DESTRUCT, &port->disc.pending);
 
 	list_for_each_entry_safe(dev, n, &port->destroy_list, disco_list_node) {
+		struct sas_port *sas_port = dev_to_sas_port(dev->rphy->dev.parent);
+
 		list_del_init(&dev->disco_list_node);
 
 		sas_remove_children(&dev->rphy->dev);
 		sas_rphy_delete(dev->rphy);
 		sas_unregister_common_dev(port, dev);
+		sas_port_delete(sas_port);
 	}
 }
 
@@ -400,9 +403,6 @@  void sas_unregister_domain_devices(struct asd_sas_port *port, int gone)
 
 	list_for_each_entry_safe(dev, n, &port->disco_list, disco_list_node)
 		sas_unregister_dev(port, dev);
-
-	port->port->rphy = NULL;
-
 }
 
 void sas_device_set_phy(struct domain_device *dev, struct sas_port *port)
diff --git a/drivers/scsi/libsas/sas_port.c b/drivers/scsi/libsas/sas_port.c
index d3c5297c6c89..9a25ae3a52a4 100644
--- a/drivers/scsi/libsas/sas_port.c
+++ b/drivers/scsi/libsas/sas_port.c
@@ -219,7 +219,6 @@  void sas_deform_port(struct asd_sas_phy *phy, int gone)
 
 	if (port->num_phys == 1) {
 		sas_unregister_domain_devices(port, gone);
-		sas_port_delete(port->port);
 		port->port = NULL;
 	} else {
 		sas_port_delete_phy(port->port, phy->phy);