diff mbox series

multipath-tools: document why dev_loss_tmo is set to infinity for HPE 3PAR

Message ID 20181212165833.12407-1-xose.vazquez@gmail.com (mailing list archive)
State Not Applicable, archived
Delegated to: christophe varoqui
Headers show
Series multipath-tools: document why dev_loss_tmo is set to infinity for HPE 3PAR | expand

Commit Message

Xose Vazquez Perez Dec. 12, 2018, 4:58 p.m. UTC
It's needed by Peer Persistence, documented in SLES and RHEL guides:
https://support.hpe.com/hpsc/doc/public/display?docId=a00053835
https://support.hpe.com/hpsc/doc/public/display?docId=c04448818

Cc: Christophe Varoqui <christophe.varoqui@opensvc.com>
Cc: DM-DEVEL ML <dm-devel@redhat.com>
Signed-off-by: Xose Vazquez Perez <xose.vazquez@gmail.com>
---
 libmultipath/hwtable.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Roger Heflin Dec. 12, 2018, 7:44 p.m. UTC | #1
On thing that seems to be a mess with the tmo value that is being
inherited from the underlying driver, is that the setting for the scsi
layer is significantly different from what multipath calls TMO.

In the case I have seen with the lpfc driver this is often set fairly
low (HPE's doc references 14 seconds, and this is similar to what my
employer is using).
parm:           lpfc_devloss_tmo:Seconds driver will hold I/O waiting
for a device to come back (int)

But setting this on the scsi layer causes it to quickly return an
error to the multipath layer.  It does not mean that the scsi layer
removes the device from the system, just that it returns an error so
that the layer above it can deal with it.   The multipath layer
interprets its value of TMO as when to clean up/remove the underlying
path that when dev_loss_tmo is hit.    TMO is used in both names, but
they are not the same usage and meaning and the scsi layer's TMO
should not be inherited by the multipath layer, as they don't appear
to actually be the same thing.   In multipath it should probably be
called remove_fault_paths or something similar.

This incorrect inheritance has caused issues, as prior to multipath
inheriting TMO from the scsi layer, multipath did not remove the paths
when IO failed for TMO time.   The paths prior to the inheritance
stayed around and errored until the underlying issue was fixed, or a
reboot happened, or until someone manually removed the failing paths.
When I first saw this I had processes to deal with this, and we did
noticed when it stated automatically cleaning up paths and it was good
since it eliminated manual work, that is until it caused issues during
firmware update.  HPE's update to infinity will be a response to the
inherited TMO change causing issues.

On Wed, Dec 12, 2018 at 10:58 AM Xose Vazquez Perez
<xose.vazquez@gmail.com> wrote:
>
> It's needed by Peer Persistence, documented in SLES and RHEL guides:
> https://support.hpe.com/hpsc/doc/public/display?docId=a00053835
> https://support.hpe.com/hpsc/doc/public/display?docId=c04448818
>
> Cc: Christophe Varoqui <christophe.varoqui@opensvc.com>
> Cc: DM-DEVEL ML <dm-devel@redhat.com>
> Signed-off-by: Xose Vazquez Perez <xose.vazquez@gmail.com>
> ---
>  libmultipath/hwtable.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/libmultipath/hwtable.c b/libmultipath/hwtable.c
> index d3a8d9b..543bacd 100644
> --- a/libmultipath/hwtable.c
> +++ b/libmultipath/hwtable.c
> @@ -116,6 +116,7 @@ static struct hwentry default_hw[] = {
>                 .prio_name     = PRIO_ALUA,
>                 .no_path_retry = 18,
>                 .fast_io_fail  = 10,
> +               /* infinity is needed by Peer Persistence */
>                 .dev_loss      = MAX_DEV_LOSS_TMO,
>         },
>         {
> --
> 2.19.2
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Martin Wilck Dec. 12, 2018, 11:44 p.m. UTC | #2
On Wed, 2018-12-12 at 13:44 -0600, Roger Heflin wrote:

> On thing that seems to be a mess with the tmo value that is being
> inherited from the underlying driver, is that the setting for the
> scsi
> layer is significantly different from what multipath calls TMO.
> 
> In the case I have seen with the lpfc driver this is often set fairly
> low (HPE's doc references 14 seconds, and this is similar to what my
> employer is using).
> parm:           lpfc_devloss_tmo:Seconds driver will hold I/O waiting
> for a device to come back (int)
> 
> But setting this on the scsi layer causes it to quickly return an
> error to the multipath layer.  It does not mean that the scsi layer
> removes the device from the system, just that it returns an error so
> that the layer above it can deal with it. 

You are confusing fast_io_fail_tmo and dev_loss_tmo. What you just
described is fast_io_fail_tmo. If dev_loss_tmo expires, the SCSI layer
does indeed remove the SCSI target. See comments on the
fc_remote_port_delete() function.
(https://elixir.bootlin.com/linux/latest/source/drivers/scsi/scsi_transport_fc.c#L2906)

For multipath, what really matters is fast_io_fail_tmo. dev_loss_tmo
only matters if fast_io_fail_tmo is unset. fast_io_fail is preferred,
because path failure/reinstantiation is much easier to handle than path
removal/re-addition, on both kernel and user space level. The reason
dev_loss_tmo is not infinity by default is twofold: 1) if fast_io_fail
is not used and dev_loss_tmo is infinity, IOs might block on a removed
device forever; 2) even with fast_io_fail, if a lost device doesn't
come back after a long time, it might be good not to carry it around
forever - chances are that the storage admin really removed the device
or changed the zoning.

>   The multipath layer
> interprets its value of TMO as when to clean up/remove the underlying
> path that when dev_loss_tmo is hit.    TMO is used in both names, but
> they are not the same usage and meaning and the scsi layer's TMO
> should not be inherited by the multipath layer, as they don't appear
> to actually be the same thing.   In multipath it should probably be
> called remove_fault_paths or something similar.

I'm not sure what you mean with "multipath layer". The kernel dm-
multipath layer has nothing to do with dev_loss_tmo at all. multipath-
tools don't "inherit" this value, either. They *set* it to match the
settings from multipath.conf and the internal hwtable, taking other
related settings into account (in particular, no_path_retry).

> This incorrect inheritance has caused issues, as prior to multipath
> inheriting TMO from the scsi layer, multipath did not remove the
> paths
> when IO failed for TMO time. 

Sorry, no. multipathd *never* removes SCSI paths. If it receives an
event about removal of a path, it updates its own data structures, and
the maps in the dm-multipath layer. That's it.

The only thing that multipath-tools do that may cause SCSI devices to
get removed is to set dev_loss_tmo to a low value. But that would be
a matter of (unusual) configuration.

>   The paths prior to the inheritance
> stayed around and errored until the underlying issue was fixed, or a
> reboot happened, or until someone manually removed the failing paths.
> When I first saw this I had processes to deal with this, and we did
> noticed when it stated automatically cleaning up paths and it was
> good
> since it eliminated manual work, that is until it caused issues
> during
> firmware update.  HPE's update to infinity will be a response to the
> inherited TMO change causing issues.

I'm wondering what you're talking about. dev_loss_tmo has been in the
SCSI layer for ages.

Regards
Martin


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Roger Heflin Dec. 13, 2018, 4:46 p.m. UTC | #3
> You are confusing fast_io_fail_tmo and dev_loss_tmo. What you just
> described is fast_io_fail_tmo. If dev_loss_tmo expires, the SCSI layer
> does indeed remove the SCSI target. See comments on the
> fc_remote_port_delete() function.
> (https://elixir.bootlin.com/linux/latest/source/drivers/scsi/scsi_transport_fc.c#L2906)

the lpfc driver lets one set dev_loss_tmo and the description on the
parameter seems like it should be fast_io_fail_tmo rather that
dev_loss_tmo, from how it is working it appears to be used to set
dev_loss_tmo in the scsi layer.   And the lpfc driver does not have a
setting for the fast_io_fail_tmo and that would seem to be what is
actually needed/wanted.  The reason for setting it was we have had fc
fabric failures that did not result in an error being return to
multipath, such that multipath could not failover to the other working
paths.


>
> For multipath, what really matters is fast_io_fail_tmo. dev_loss_tmo
> only matters if fast_io_fail_tmo is unset. fast_io_fail is preferred,
> because path failure/reinstantiation is much easier to handle than path
> removal/re-addition, on both kernel and user space level. The reason
> dev_loss_tmo is not infinity by default is twofold: 1) if fast_io_fail
> is not used and dev_loss_tmo is infinity, IOs might block on a removed
> device forever; 2) even with fast_io_fail, if a lost device doesn't
> come back after a long time, it might be good not to carry it around
> forever - chances are that the storage admin really removed the device
> or changed the zoning.

We are thinking of setting dev_loss_tmo to 86400 (24 hours) as that is
a happy medium, and leaves the paths around during reasonable events,
but results in a clean-up at 24 hours.

>
> >   The multipath layer
> > interprets its value of TMO as when to clean up/remove the underlying
> > path that when dev_loss_tmo is hit.    TMO is used in both names, but
> > they are not the same usage and meaning and the scsi layer's TMO
> > should not be inherited by the multipath layer, as they don't appear
> > to actually be the same thing.   In multipath it should probably be
> > called remove_fault_paths or something similar.
>
> I'm not sure what you mean with "multipath layer". The kernel dm-
> multipath layer has nothing to do with dev_loss_tmo at all. multipath-
> tools don't "inherit" this value, either. They *set* it to match the
> settings from multipath.conf and the internal hwtable, taking other
> related settings into account (in particular, no_path_retry).

ok.
>
> > This incorrect inheritance has caused issues, as prior to multipath
> > inheriting TMO from the scsi layer, multipath did not remove the
> > paths
> > when IO failed for TMO time.
>
> Sorry, no. multipathd *never* removes SCSI paths. If it receives an
> event about removal of a path, it updates its own data structures, and
> the maps in the dm-multipath layer. That's it.
>

> >   The paths prior to the inheritance
> > stayed around and errored until the underlying issue was fixed, or a
> > reboot happened, or until someone manually removed the failing paths.
> > When I first saw this I had processes to deal with this, and we did
> > noticed when it stated automatically cleaning up paths and it was
> > good
> > since it eliminated manual work, that is until it caused issues
> > during
> > firmware update.  HPE's update to infinity will be a response to the
> > inherited TMO change causing issues.
>
> I'm wondering what you're talking about. dev_loss_tmo has been in the
> SCSI layer for ages.

Do you have an idea how many years ago the dev_loss_tmo started
actually removing the device?   I am guessing when that was backported
into rhel was what I saw it start, but iI don't know exactly when it
was backported.

Prior to that we had processes to evaluate why a given path was
erroring and either fix it or clean it up, so the change was fairly
easy for us to see, and maybe when that change went in the lpfc driver
should have started setting the fast_io_fail_tmo rather than the
tmo_dev_loss in the scsi layer as fast_io_fail_tmo is closer to what
the described option does on the lpfc driver.

thanks, I think this helps my understand how to tune things.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Martin Wilck Dec. 13, 2018, 5:16 p.m. UTC | #4
On Thu, 2018-12-13 at 10:46 -0600, Roger Heflin wrote:
> > You are confusing fast_io_fail_tmo and dev_loss_tmo. What you just
> > described is fast_io_fail_tmo. If dev_loss_tmo expires, the SCSI
> > layer
> > does indeed remove the SCSI target. See comments on the
> > fc_remote_port_delete() function.
> > (
> the lpfc driver lets one set dev_loss_tmo and the description on the
> parameter seems like it should be fast_io_fail_tmo rather that
> dev_loss_tmo, from how it is working it appears to be used to set
> dev_loss_tmo in the scsi layer.

On my system, the docstring of lpfc.devloss_tmo says "Seconds driver
will hold I/O waiting for a device to come back". Which is basically
true, although it does not say that when the waiting is over, the
device node is removed. 

>    And the lpfc driver does not have a
> setting for the fast_io_fail_tmo and that would seem to be what is
> actually needed/wanted. 

That is set via the generic scsi_transport_fc layer. Normally you do it
with multipath-tools, as the parameter is only useful in multipath
scenarios.

>  The reason for setting it was we have had fc
> fabric failures that did not result in an error being return to
> multipath, such that multipath could not failover to the other
> working
> paths.

You should have been setting fast_io_fail_tmo in multipath.conf.


> > For multipath, what really matters is fast_io_fail_tmo.
> > dev_loss_tmo
> > only matters if fast_io_fail_tmo is unset. fast_io_fail is
> > preferred,
> > because path failure/reinstantiation is much easier to handle than
> > path
> > removal/re-addition, on both kernel and user space level. The
> > reason
> > dev_loss_tmo is not infinity by default is twofold: 1) if
> > fast_io_fail
> > is not used and dev_loss_tmo is infinity, IOs might block on a
> > removed
> > device forever; 2) even with fast_io_fail, if a lost device doesn't
> > come back after a long time, it might be good not to carry it
> > around
> > forever - chances are that the storage admin really removed the
> > device
> > or changed the zoning.
> 
> We are thinking of setting dev_loss_tmo to 86400 (24 hours) as that
> is
> a happy medium, and leaves the paths around during reasonable events,
> but results in a clean-up at 24 hours.

That sounds reasonable. But that's a matter of policy, which
differs vastly between different installations and administrator
preferences. The point I'm trying to make is: it doesn't make a lot 
of sense to tie this setting to the storage hardware properties, as
multipath currently does. It's really much more a matter of data center
administration. That's different for fast_io_fail_tmo - it makes
sense to relate this timeout to hardware properties, e.g. the time it
takes to do failover or failback.

IMO, in a way, the different dev_loss_tmo settings in multipath's
hardware table reflect the different vendor's ideas of how the storage
should be administrated rather than the actual properties of the
hardware.

> 
> > I'm wondering what you're talking about. dev_loss_tmo has been in
> > the
> > SCSI layer for ages.
> 
> Do you have an idea how many years ago the dev_loss_tmo started
> actually removing the device?   I am guessing when that was
> backported
> into rhel was what I saw it start, but iI don't know exactly when it
> was backported.

I can see it in 2.6.12 (2005):

https://elixir.bootlin.com/linux/v2.6.12/source/drivers/scsi/scsi_transport_fc.c#L1549

You need to understand that, when time starts ticking towards the
dev_loss_tmo, the FC remote port port is *already gone*. On the
transport layer, there's nothing to "remove" any more. The kernel just
keeps the SCSI layer structures and waits if the device comes back, as
it would for temporary failures such as network outage or an operator
having pulled the wrong cable.

Regards
Martin

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Roger Heflin Dec. 13, 2018, 5:56 p.m. UTC | #5
On Thu, Dec 13, 2018 at 11:16 AM Martin Wilck <mwilck@suse.de> wrote:
>
> On Thu, 2018-12-13 at 10:46 -0600, Roger Heflin wrote:
> > > You are confusing fast_io_fail_tmo and dev_loss_tmo. What you just
> > > described is fast_io_fail_tmo. If dev_loss_tmo expires, the SCSI
> > > layer
> > > does indeed remove the SCSI target. See comments on the
> > > fc_remote_port_delete() function.
> > > (
> > the lpfc driver lets one set dev_loss_tmo and the description on the
> > parameter seems like it should be fast_io_fail_tmo rather that
> > dev_loss_tmo, from how it is working it appears to be used to set
> > dev_loss_tmo in the scsi layer.
>
> On my system, the docstring of lpfc.devloss_tmo says "Seconds driver
> will hold I/O waiting for a device to come back". Which is basically
> true, although it does not say that when the waiting is over, the
> device node is removed.

In older version the device was not removed, so there was a behavior change.

>
> >    And the lpfc driver does not have a
> > setting for the fast_io_fail_tmo and that would seem to be what is
> > actually needed/wanted.
>
> That is set via the generic scsi_transport_fc layer. Normally you do it
> with multipath-tools, as the parameter is only useful in multipath
> scenarios.
>
> >  The reason for setting it was we have had fc
> > fabric failures that did not result in an error being return to
> > multipath, such that multipath could not failover to the other
> > working
> > paths.
>
> You should have been setting fast_io_fail_tmo in multipath.conf.

When we started setting the lpfc parameter, multipath did not yet
manage fast_io_fail_tmo nor dev_loss_tmo so it was not an option.

> > We are thinking of setting dev_loss_tmo to 86400 (24 hours) as that
> > is
> > a happy medium, and leaves the paths around during reasonable events,
> > but results in a clean-up at 24 hours.
>
> That sounds reasonable. But that's a matter of policy, which
> differs vastly between different installations and administrator
> preferences. The point I'm trying to make is: it doesn't make a lot
> of sense to tie this setting to the storage hardware properties, as
> multipath currently does. It's really much more a matter of data center
> administration. That's different for fast_io_fail_tmo - it makes
> sense to relate this timeout to hardware properties, e.g. the time it
> takes to do failover or failback.
>
> IMO, in a way, the different dev_loss_tmo settings in multipath's
> hardware table reflect the different vendor's ideas of how the storage
> should be administrated rather than the actual properties of the
> hardware.

And that is kind of the conclusion we were coming to, it is a
preference based on datacenter admin.

>

> > Do you have an idea how many years ago the dev_loss_tmo started
> > actually removing the device?   I am guessing when that was
> > backported
> > into rhel was what I saw it start, but iI don't know exactly when it
> > was backported.
>
> I can see it in 2.6.12 (2005):
>
> https://elixir.bootlin.com/linux/v2.6.12/source/drivers/scsi/scsi_transport_fc.c#L1549
>
In rhel5 (2.6.18+) it did not actually delete the device I believe
until around 5.8, so all of the magic may not have quite been working
yet.

> You need to understand that, when time starts ticking towards the
> dev_loss_tmo, the FC remote port port is *already gone*. On the
> transport layer, there's nothing to "remove" any more. The kernel just
> keeps the SCSI layer structures and waits if the device comes back, as
> it would for temporary failures such as network outage or an operator
> having pulled the wrong cable.
>

We understand that.  The issues seem to be that once the device is
deleted the process that brings the devices back as a path when the
rport/cable is fixed is not reliable (it fails say 1 in 100 events and
causes issues) when the cable is far enough removed from the host (ie
the array port connect to a fc switch).   And when routed fc storage
is involved everything gets much less reliable all around.  It always
works if the cable issue being fixed is the cable to the host, but
once it gets far enough away there seem to be an issue sometimes.   So
the hope is if the device is still there and still being probed my
multipath that there will be less reliance on the imperfect fc magic,
it may or may not help anything, and we may have to upgrade something
and actively scan and resolve path issues manually before upgrading
the next component that will take out another path.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
diff mbox series

Patch

diff --git a/libmultipath/hwtable.c b/libmultipath/hwtable.c
index d3a8d9b..543bacd 100644
--- a/libmultipath/hwtable.c
+++ b/libmultipath/hwtable.c
@@ -116,6 +116,7 @@  static struct hwentry default_hw[] = {
 		.prio_name     = PRIO_ALUA,
 		.no_path_retry = 18,
 		.fast_io_fail  = 10,
+		/* infinity is needed by Peer Persistence */
 		.dev_loss      = MAX_DEV_LOSS_TMO,
 	},
 	{