Message ID | 20181212165833.12407-1-xose.vazquez@gmail.com (mailing list archive) |
---|---|
State | Not Applicable, archived |
Delegated to: | christophe varoqui |
Headers | show |
Series | multipath-tools: document why dev_loss_tmo is set to infinity for HPE 3PAR | expand |
On thing that seems to be a mess with the tmo value that is being inherited from the underlying driver, is that the setting for the scsi layer is significantly different from what multipath calls TMO. In the case I have seen with the lpfc driver this is often set fairly low (HPE's doc references 14 seconds, and this is similar to what my employer is using). parm: lpfc_devloss_tmo:Seconds driver will hold I/O waiting for a device to come back (int) But setting this on the scsi layer causes it to quickly return an error to the multipath layer. It does not mean that the scsi layer removes the device from the system, just that it returns an error so that the layer above it can deal with it. The multipath layer interprets its value of TMO as when to clean up/remove the underlying path that when dev_loss_tmo is hit. TMO is used in both names, but they are not the same usage and meaning and the scsi layer's TMO should not be inherited by the multipath layer, as they don't appear to actually be the same thing. In multipath it should probably be called remove_fault_paths or something similar. This incorrect inheritance has caused issues, as prior to multipath inheriting TMO from the scsi layer, multipath did not remove the paths when IO failed for TMO time. The paths prior to the inheritance stayed around and errored until the underlying issue was fixed, or a reboot happened, or until someone manually removed the failing paths. When I first saw this I had processes to deal with this, and we did noticed when it stated automatically cleaning up paths and it was good since it eliminated manual work, that is until it caused issues during firmware update. HPE's update to infinity will be a response to the inherited TMO change causing issues. On Wed, Dec 12, 2018 at 10:58 AM Xose Vazquez Perez <xose.vazquez@gmail.com> wrote: > > It's needed by Peer Persistence, documented in SLES and RHEL guides: > https://support.hpe.com/hpsc/doc/public/display?docId=a00053835 > https://support.hpe.com/hpsc/doc/public/display?docId=c04448818 > > Cc: Christophe Varoqui <christophe.varoqui@opensvc.com> > Cc: DM-DEVEL ML <dm-devel@redhat.com> > Signed-off-by: Xose Vazquez Perez <xose.vazquez@gmail.com> > --- > libmultipath/hwtable.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/libmultipath/hwtable.c b/libmultipath/hwtable.c > index d3a8d9b..543bacd 100644 > --- a/libmultipath/hwtable.c > +++ b/libmultipath/hwtable.c > @@ -116,6 +116,7 @@ static struct hwentry default_hw[] = { > .prio_name = PRIO_ALUA, > .no_path_retry = 18, > .fast_io_fail = 10, > + /* infinity is needed by Peer Persistence */ > .dev_loss = MAX_DEV_LOSS_TMO, > }, > { > -- > 2.19.2 > > -- > dm-devel mailing list > dm-devel@redhat.com > https://www.redhat.com/mailman/listinfo/dm-devel -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, 2018-12-12 at 13:44 -0600, Roger Heflin wrote: > On thing that seems to be a mess with the tmo value that is being > inherited from the underlying driver, is that the setting for the > scsi > layer is significantly different from what multipath calls TMO. > > In the case I have seen with the lpfc driver this is often set fairly > low (HPE's doc references 14 seconds, and this is similar to what my > employer is using). > parm: lpfc_devloss_tmo:Seconds driver will hold I/O waiting > for a device to come back (int) > > But setting this on the scsi layer causes it to quickly return an > error to the multipath layer. It does not mean that the scsi layer > removes the device from the system, just that it returns an error so > that the layer above it can deal with it. You are confusing fast_io_fail_tmo and dev_loss_tmo. What you just described is fast_io_fail_tmo. If dev_loss_tmo expires, the SCSI layer does indeed remove the SCSI target. See comments on the fc_remote_port_delete() function. (https://elixir.bootlin.com/linux/latest/source/drivers/scsi/scsi_transport_fc.c#L2906) For multipath, what really matters is fast_io_fail_tmo. dev_loss_tmo only matters if fast_io_fail_tmo is unset. fast_io_fail is preferred, because path failure/reinstantiation is much easier to handle than path removal/re-addition, on both kernel and user space level. The reason dev_loss_tmo is not infinity by default is twofold: 1) if fast_io_fail is not used and dev_loss_tmo is infinity, IOs might block on a removed device forever; 2) even with fast_io_fail, if a lost device doesn't come back after a long time, it might be good not to carry it around forever - chances are that the storage admin really removed the device or changed the zoning. > The multipath layer > interprets its value of TMO as when to clean up/remove the underlying > path that when dev_loss_tmo is hit. TMO is used in both names, but > they are not the same usage and meaning and the scsi layer's TMO > should not be inherited by the multipath layer, as they don't appear > to actually be the same thing. In multipath it should probably be > called remove_fault_paths or something similar. I'm not sure what you mean with "multipath layer". The kernel dm- multipath layer has nothing to do with dev_loss_tmo at all. multipath- tools don't "inherit" this value, either. They *set* it to match the settings from multipath.conf and the internal hwtable, taking other related settings into account (in particular, no_path_retry). > This incorrect inheritance has caused issues, as prior to multipath > inheriting TMO from the scsi layer, multipath did not remove the > paths > when IO failed for TMO time. Sorry, no. multipathd *never* removes SCSI paths. If it receives an event about removal of a path, it updates its own data structures, and the maps in the dm-multipath layer. That's it. The only thing that multipath-tools do that may cause SCSI devices to get removed is to set dev_loss_tmo to a low value. But that would be a matter of (unusual) configuration. > The paths prior to the inheritance > stayed around and errored until the underlying issue was fixed, or a > reboot happened, or until someone manually removed the failing paths. > When I first saw this I had processes to deal with this, and we did > noticed when it stated automatically cleaning up paths and it was > good > since it eliminated manual work, that is until it caused issues > during > firmware update. HPE's update to infinity will be a response to the > inherited TMO change causing issues. I'm wondering what you're talking about. dev_loss_tmo has been in the SCSI layer for ages. Regards Martin -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
> You are confusing fast_io_fail_tmo and dev_loss_tmo. What you just > described is fast_io_fail_tmo. If dev_loss_tmo expires, the SCSI layer > does indeed remove the SCSI target. See comments on the > fc_remote_port_delete() function. > (https://elixir.bootlin.com/linux/latest/source/drivers/scsi/scsi_transport_fc.c#L2906) the lpfc driver lets one set dev_loss_tmo and the description on the parameter seems like it should be fast_io_fail_tmo rather that dev_loss_tmo, from how it is working it appears to be used to set dev_loss_tmo in the scsi layer. And the lpfc driver does not have a setting for the fast_io_fail_tmo and that would seem to be what is actually needed/wanted. The reason for setting it was we have had fc fabric failures that did not result in an error being return to multipath, such that multipath could not failover to the other working paths. > > For multipath, what really matters is fast_io_fail_tmo. dev_loss_tmo > only matters if fast_io_fail_tmo is unset. fast_io_fail is preferred, > because path failure/reinstantiation is much easier to handle than path > removal/re-addition, on both kernel and user space level. The reason > dev_loss_tmo is not infinity by default is twofold: 1) if fast_io_fail > is not used and dev_loss_tmo is infinity, IOs might block on a removed > device forever; 2) even with fast_io_fail, if a lost device doesn't > come back after a long time, it might be good not to carry it around > forever - chances are that the storage admin really removed the device > or changed the zoning. We are thinking of setting dev_loss_tmo to 86400 (24 hours) as that is a happy medium, and leaves the paths around during reasonable events, but results in a clean-up at 24 hours. > > > The multipath layer > > interprets its value of TMO as when to clean up/remove the underlying > > path that when dev_loss_tmo is hit. TMO is used in both names, but > > they are not the same usage and meaning and the scsi layer's TMO > > should not be inherited by the multipath layer, as they don't appear > > to actually be the same thing. In multipath it should probably be > > called remove_fault_paths or something similar. > > I'm not sure what you mean with "multipath layer". The kernel dm- > multipath layer has nothing to do with dev_loss_tmo at all. multipath- > tools don't "inherit" this value, either. They *set* it to match the > settings from multipath.conf and the internal hwtable, taking other > related settings into account (in particular, no_path_retry). ok. > > > This incorrect inheritance has caused issues, as prior to multipath > > inheriting TMO from the scsi layer, multipath did not remove the > > paths > > when IO failed for TMO time. > > Sorry, no. multipathd *never* removes SCSI paths. If it receives an > event about removal of a path, it updates its own data structures, and > the maps in the dm-multipath layer. That's it. > > > The paths prior to the inheritance > > stayed around and errored until the underlying issue was fixed, or a > > reboot happened, or until someone manually removed the failing paths. > > When I first saw this I had processes to deal with this, and we did > > noticed when it stated automatically cleaning up paths and it was > > good > > since it eliminated manual work, that is until it caused issues > > during > > firmware update. HPE's update to infinity will be a response to the > > inherited TMO change causing issues. > > I'm wondering what you're talking about. dev_loss_tmo has been in the > SCSI layer for ages. Do you have an idea how many years ago the dev_loss_tmo started actually removing the device? I am guessing when that was backported into rhel was what I saw it start, but iI don't know exactly when it was backported. Prior to that we had processes to evaluate why a given path was erroring and either fix it or clean it up, so the change was fairly easy for us to see, and maybe when that change went in the lpfc driver should have started setting the fast_io_fail_tmo rather than the tmo_dev_loss in the scsi layer as fast_io_fail_tmo is closer to what the described option does on the lpfc driver. thanks, I think this helps my understand how to tune things. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Thu, 2018-12-13 at 10:46 -0600, Roger Heflin wrote: > > You are confusing fast_io_fail_tmo and dev_loss_tmo. What you just > > described is fast_io_fail_tmo. If dev_loss_tmo expires, the SCSI > > layer > > does indeed remove the SCSI target. See comments on the > > fc_remote_port_delete() function. > > ( > the lpfc driver lets one set dev_loss_tmo and the description on the > parameter seems like it should be fast_io_fail_tmo rather that > dev_loss_tmo, from how it is working it appears to be used to set > dev_loss_tmo in the scsi layer. On my system, the docstring of lpfc.devloss_tmo says "Seconds driver will hold I/O waiting for a device to come back". Which is basically true, although it does not say that when the waiting is over, the device node is removed. > And the lpfc driver does not have a > setting for the fast_io_fail_tmo and that would seem to be what is > actually needed/wanted. That is set via the generic scsi_transport_fc layer. Normally you do it with multipath-tools, as the parameter is only useful in multipath scenarios. > The reason for setting it was we have had fc > fabric failures that did not result in an error being return to > multipath, such that multipath could not failover to the other > working > paths. You should have been setting fast_io_fail_tmo in multipath.conf. > > For multipath, what really matters is fast_io_fail_tmo. > > dev_loss_tmo > > only matters if fast_io_fail_tmo is unset. fast_io_fail is > > preferred, > > because path failure/reinstantiation is much easier to handle than > > path > > removal/re-addition, on both kernel and user space level. The > > reason > > dev_loss_tmo is not infinity by default is twofold: 1) if > > fast_io_fail > > is not used and dev_loss_tmo is infinity, IOs might block on a > > removed > > device forever; 2) even with fast_io_fail, if a lost device doesn't > > come back after a long time, it might be good not to carry it > > around > > forever - chances are that the storage admin really removed the > > device > > or changed the zoning. > > We are thinking of setting dev_loss_tmo to 86400 (24 hours) as that > is > a happy medium, and leaves the paths around during reasonable events, > but results in a clean-up at 24 hours. That sounds reasonable. But that's a matter of policy, which differs vastly between different installations and administrator preferences. The point I'm trying to make is: it doesn't make a lot of sense to tie this setting to the storage hardware properties, as multipath currently does. It's really much more a matter of data center administration. That's different for fast_io_fail_tmo - it makes sense to relate this timeout to hardware properties, e.g. the time it takes to do failover or failback. IMO, in a way, the different dev_loss_tmo settings in multipath's hardware table reflect the different vendor's ideas of how the storage should be administrated rather than the actual properties of the hardware. > > > I'm wondering what you're talking about. dev_loss_tmo has been in > > the > > SCSI layer for ages. > > Do you have an idea how many years ago the dev_loss_tmo started > actually removing the device? I am guessing when that was > backported > into rhel was what I saw it start, but iI don't know exactly when it > was backported. I can see it in 2.6.12 (2005): https://elixir.bootlin.com/linux/v2.6.12/source/drivers/scsi/scsi_transport_fc.c#L1549 You need to understand that, when time starts ticking towards the dev_loss_tmo, the FC remote port port is *already gone*. On the transport layer, there's nothing to "remove" any more. The kernel just keeps the SCSI layer structures and waits if the device comes back, as it would for temporary failures such as network outage or an operator having pulled the wrong cable. Regards Martin -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Thu, Dec 13, 2018 at 11:16 AM Martin Wilck <mwilck@suse.de> wrote: > > On Thu, 2018-12-13 at 10:46 -0600, Roger Heflin wrote: > > > You are confusing fast_io_fail_tmo and dev_loss_tmo. What you just > > > described is fast_io_fail_tmo. If dev_loss_tmo expires, the SCSI > > > layer > > > does indeed remove the SCSI target. See comments on the > > > fc_remote_port_delete() function. > > > ( > > the lpfc driver lets one set dev_loss_tmo and the description on the > > parameter seems like it should be fast_io_fail_tmo rather that > > dev_loss_tmo, from how it is working it appears to be used to set > > dev_loss_tmo in the scsi layer. > > On my system, the docstring of lpfc.devloss_tmo says "Seconds driver > will hold I/O waiting for a device to come back". Which is basically > true, although it does not say that when the waiting is over, the > device node is removed. In older version the device was not removed, so there was a behavior change. > > > And the lpfc driver does not have a > > setting for the fast_io_fail_tmo and that would seem to be what is > > actually needed/wanted. > > That is set via the generic scsi_transport_fc layer. Normally you do it > with multipath-tools, as the parameter is only useful in multipath > scenarios. > > > The reason for setting it was we have had fc > > fabric failures that did not result in an error being return to > > multipath, such that multipath could not failover to the other > > working > > paths. > > You should have been setting fast_io_fail_tmo in multipath.conf. When we started setting the lpfc parameter, multipath did not yet manage fast_io_fail_tmo nor dev_loss_tmo so it was not an option. > > We are thinking of setting dev_loss_tmo to 86400 (24 hours) as that > > is > > a happy medium, and leaves the paths around during reasonable events, > > but results in a clean-up at 24 hours. > > That sounds reasonable. But that's a matter of policy, which > differs vastly between different installations and administrator > preferences. The point I'm trying to make is: it doesn't make a lot > of sense to tie this setting to the storage hardware properties, as > multipath currently does. It's really much more a matter of data center > administration. That's different for fast_io_fail_tmo - it makes > sense to relate this timeout to hardware properties, e.g. the time it > takes to do failover or failback. > > IMO, in a way, the different dev_loss_tmo settings in multipath's > hardware table reflect the different vendor's ideas of how the storage > should be administrated rather than the actual properties of the > hardware. And that is kind of the conclusion we were coming to, it is a preference based on datacenter admin. > > > Do you have an idea how many years ago the dev_loss_tmo started > > actually removing the device? I am guessing when that was > > backported > > into rhel was what I saw it start, but iI don't know exactly when it > > was backported. > > I can see it in 2.6.12 (2005): > > https://elixir.bootlin.com/linux/v2.6.12/source/drivers/scsi/scsi_transport_fc.c#L1549 > In rhel5 (2.6.18+) it did not actually delete the device I believe until around 5.8, so all of the magic may not have quite been working yet. > You need to understand that, when time starts ticking towards the > dev_loss_tmo, the FC remote port port is *already gone*. On the > transport layer, there's nothing to "remove" any more. The kernel just > keeps the SCSI layer structures and waits if the device comes back, as > it would for temporary failures such as network outage or an operator > having pulled the wrong cable. > We understand that. The issues seem to be that once the device is deleted the process that brings the devices back as a path when the rport/cable is fixed is not reliable (it fails say 1 in 100 events and causes issues) when the cable is far enough removed from the host (ie the array port connect to a fc switch). And when routed fc storage is involved everything gets much less reliable all around. It always works if the cable issue being fixed is the cable to the host, but once it gets far enough away there seem to be an issue sometimes. So the hope is if the device is still there and still being probed my multipath that there will be less reliance on the imperfect fc magic, it may or may not help anything, and we may have to upgrade something and actively scan and resolve path issues manually before upgrading the next component that will take out another path. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
diff --git a/libmultipath/hwtable.c b/libmultipath/hwtable.c index d3a8d9b..543bacd 100644 --- a/libmultipath/hwtable.c +++ b/libmultipath/hwtable.c @@ -116,6 +116,7 @@ static struct hwentry default_hw[] = { .prio_name = PRIO_ALUA, .no_path_retry = 18, .fast_io_fail = 10, + /* infinity is needed by Peer Persistence */ .dev_loss = MAX_DEV_LOSS_TMO, }, {
It's needed by Peer Persistence, documented in SLES and RHEL guides: https://support.hpe.com/hpsc/doc/public/display?docId=a00053835 https://support.hpe.com/hpsc/doc/public/display?docId=c04448818 Cc: Christophe Varoqui <christophe.varoqui@opensvc.com> Cc: DM-DEVEL ML <dm-devel@redhat.com> Signed-off-by: Xose Vazquez Perez <xose.vazquez@gmail.com> --- libmultipath/hwtable.c | 1 + 1 file changed, 1 insertion(+)