Message ID | 1537571127-10143-19-git-send-email-bmarzins@redhat.com (mailing list archive) |
---|---|
State | Not Applicable, archived |
Delegated to: | christophe varoqui |
Headers | show |
Series | Misc Multipath patches | expand |
On Fri, 2018-09-21 at 18:05 -0500, Benjamin Marzinski wrote: > When pathinfo fails for some likely transient reason, it clears the > path > wwid, but otherwise returns successfully, to keep the path around but > not usable until it gets fully initialized. However, if the path has > already been initialized, and pathinfo hits a transient error, it > shouldn't clear the wwid. > > Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> > --- > libmultipath/discovery.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/libmultipath/discovery.c b/libmultipath/discovery.c > index 3e0db7f..33815dc 100644 > --- a/libmultipath/discovery.c > +++ b/libmultipath/discovery.c > @@ -1991,9 +1991,9 @@ blank: > /* > * Recoverable error, for example faulty or offline path > */ > - memset(pp->wwid, 0, WWID_SIZE); > pp->chkrstate = pp->state = PATH_DOWN; > - pp->initialized = INIT_FAILED; > + if (pp->initialized == INIT_FAILED) > + memset(pp->wwid, 0, WWID_SIZE); > > return PATHINFO_OK; > } I am uncertain about this one. The old code sets pp->initialized to INIT_FAILED. If the state had been INIT_MISSING_UDEV or INIT_REQUESTED_UDEV before, this patch might change how the code behaves later in check_path(), where these conditions are checked. Likewise, tests for strlen(pp->wwid) are used in various places around the code. These tests would now yield different results for paths in "recoverable error" state. Have you considered these possible side effects? Martin
Hi Ben, On Tue, 2018-10-02 at 00:00 +0200, Martin Wilck wrote: > On Fri, 2018-09-21 at 18:05 -0500, Benjamin Marzinski wrote: > > When pathinfo fails for some likely transient reason, it clears the > > path > > wwid, but otherwise returns successfully, to keep the path around > > but > > not usable until it gets fully initialized. However, if the path > > has > > already been initialized, and pathinfo hits a transient error, it > > shouldn't clear the wwid. > > > > Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> > > --- > > libmultipath/discovery.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/libmultipath/discovery.c b/libmultipath/discovery.c > > index 3e0db7f..33815dc 100644 > > --- a/libmultipath/discovery.c > > +++ b/libmultipath/discovery.c > > @@ -1991,9 +1991,9 @@ blank: > > /* > > * Recoverable error, for example faulty or offline path > > */ > > - memset(pp->wwid, 0, WWID_SIZE); > > pp->chkrstate = pp->state = PATH_DOWN; > > - pp->initialized = INIT_FAILED; > > + if (pp->initialized == INIT_FAILED) > > + memset(pp->wwid, 0, WWID_SIZE); > > > > return PATHINFO_OK; > > } > > I am uncertain about this one. The old code sets pp->initialized to > INIT_FAILED. If the state had been INIT_MISSING_UDEV or > INIT_REQUESTED_UDEV before, this patch might change how the code > behaves later in check_path(), where these conditions are checked. > > Likewise, tests for strlen(pp->wwid) are used in various places > around > the code. These tests would now yield different results for paths in > "recoverable error" state. > > Have you considered these possible side effects? I've pondered over this a lot. The dust is clearing up a bit. 1. With your patch in place, INIT_FAILED is never set except in alloc_path() (we might rename it to INIT_NEW or the like, but see below). 2. I don't understand how you handle repeated failure to retrieve the WWID. I see that get_uid() (actually, scsi_uid_fallback()) would retrieve the WWID from sysfs after retriggers are exhausted. But I don't see how pathinfo(DI_WWID) would ever be called in this situation: In the last invocation, pathinfo() had failed to retrieve the WWID and set pp->initialized = INIT_MISSING_UDEV. There it will remain because check_path() won't set it to INIT_REQUESTED_UDEV any more after retries are exhausted. And now, check_path() won't call pathinfo(DI_ALL) any more from the "add missing path" code, because of the (pp->initialized != INIT_MISSING_UDEV) condition. Am I overlooking something? 3. If "blank" state means that important device information couldn't be retrieved because of presumably transient failure conditions, we should retry to retrieve this information by calling pathinfo again later. But unless the WWID is (reset to) the empty string, check_path() won't call pathinfo(DI_ALL) any more. 4. The "blank" logic in pathinfo() combines several very different cases. a) PATH_REMOVED status from path_offline(). This means that elementary sysfs attributes were missing. This is almost the same as failure in sysfs_pathinfo(), which results in PATHINFO_FAILED return status; but for PATH_REMOVED we return PATHINFO_OK and keep the path around. b) Failure in checker_check(). If the path is offline in the first place, the checker isn't called, and WWID determination is attempted. But if the checker returns PATH_UNCHECKED or PATH_WILD, we goto "blank" state. c) Failure in scsi_ioctl_pathinfo() or cciss_ioctl_pathinfo(). Both functions never fail, so this can't happen. I've patches here to fix that. d) Failure to open pp->fd. d) is the only case in which the "blank" logic makes really sense to me. It can happen only at the first pathinfo() invocation, meaning pp->wwid is still empty, and pp->initialized is INIT_FAILED. Your patch would change nothing for this case. a) and b) can happen for paths that have been initialized already. I think in case a) the WWID should be reset, probably initialized should be set to INIT_FAILED, and PATHINFO_FAILED should be returned. In case b) we should IMO proceed normally rather than goto "blank". Resetting the WWID in case b) is nonsense, agreed. Altogether, if my analysis is correct, your patch (not blanking the WWID) should be applied to case b) only. Please comment - I still feel a bit confused and may have overlooked something essential. Regards Martin
On Wed, Oct 03, 2018 at 12:37:17AM +0200, Martin Wilck wrote: > Hi Ben, > > > I am uncertain about this one. The old code sets pp->initialized to > > INIT_FAILED. If the state had been INIT_MISSING_UDEV or > > INIT_REQUESTED_UDEV before, this patch might change how the code > > behaves later in check_path(), where these conditions are checked. > > > > Likewise, tests for strlen(pp->wwid) are used in various places > > around > > the code. These tests would now yield different results for paths in > > "recoverable error" state. > > > > Have you considered these possible side effects? > > I've pondered over this a lot. The dust is clearing up a bit. > > 1. With your patch in place, INIT_FAILED is never set except in > alloc_path() (we might rename it to INIT_NEW or the like, but see > below). Correct. The idea is that INIT_FAILED means that the path has never been fully initialized and that it is missing more than simply the wwid (which would set INIT_MISSING_UDEV instead). > 2. I don't understand how you handle repeated failure to retrieve the > WWID. I see that get_uid() (actually, scsi_uid_fallback()) would > retrieve the WWID from sysfs after retriggers are exhausted. But I > don't see how pathinfo(DI_WWID) would ever be called in this situation: > > In the last invocation, pathinfo() had failed to retrieve the WWID and > set pp->initialized = INIT_MISSING_UDEV. There it will remain because > check_path() won't set it to INIT_REQUESTED_UDEV any more after retries > are exhausted. And now, check_path() won't call pathinfo(DI_ALL) any > more from the "add missing path" code, because of the (pp->initialized > != INIT_MISSING_UDEV) condition. > > Am I overlooking something? Your analysis looks right. This shouldn't have anything to do with the correctness of the not blanking initialized paths, but we should fix it anyway. Right now, as you mentioned, if multipathd fails to get the wwid after retrigger_tries new uevents, it only has one shot at the failback methods, and then mutipathd cannot use that path in the future. However, it will still check if the path is up or down, which is pretty pointless. It should either 1. set pp->intialized to INIT_OK, to keep it from being checked at all in the future, or 2. try to grab the missing pathinfo, so that it can use the path. The benefit of doing the first is that mutipathd doesn't keep messing with paths that really aren't set up to be multipathed. On the other hand it might ignore some paths that really should be multipathed. perhaps the best idea is to keep trying to get the missing info, but just increase pp->tick (possibly progressively) so we try less often. > 3. If "blank" state means that important device information couldn't be > retrieved because of presumably transient failure conditions, we should > retry to retrieve this information by calling pathinfo again later. But > unless the WWID is (reset to) the empty string, check_path() won't call > pathinfo(DI_ALL) any more. But if we set the wwid at some point in time, that means we got all the necessary information. My argument is that if something happens to cause a later pathinfo() call to fail, we shouldn't pretend like we didn't already get this info. > 4. The "blank" logic in pathinfo() combines several very different > cases. > a) PATH_REMOVED status from path_offline(). This means that > elementary sysfs attributes were missing. This is almost the same as > failure in sysfs_pathinfo(), which results in PATHINFO_FAILED return > status; but for PATH_REMOVED we return PATHINFO_OK and keep the path > around. True, but I've always assumed that this means that we will be getting a uevent to remove this path momentarily, so I doesn't really matter what we do with it. We could blank the path here, but if we already go the necessary info, it seems like just setting the path to PATH_DOWN, and waiting for the remove uevent should be fine. > b) Failure in checker_check(). If the path is offline in the first > place, the checker isn't called, and WWID determination is attempted. > But if the checker returns PATH_UNCHECKED or PATH_WILD, we goto "blank" > state. This is clearly the wrong thing to do. > c) Failure in scsi_ioctl_pathinfo() or cciss_ioctl_pathinfo(). Both > functions never fail, so this can't happen. I've patches here to fix > that. I just ignored this state because it was impossible. But even with your patches, my argument remains the same. If we already got this info, failing to get it again shouldn't cause us to delete the wwid. We should probably remember the existing values, to make sure that we don't lose them if the calls fail for some reason. > d) Failure to open pp->fd. This is clearly the right thing to do. > d) is the only case in which the "blank" logic makes really sense to > me. It can happen only at the first pathinfo() invocation, meaning > pp->wwid is still empty, and pp->initialized is INIT_FAILED. Your patch > would change nothing for this case. > > a) and b) can happen for paths that have been initialized already. I > think in case a) the WWID should be reset, probably initialized should > be set to INIT_FAILED, and PATHINFO_FAILED should be returned. In case > b) we should IMO proceed normally rather than goto "blank". Resetting > the WWID in case b) is nonsense, agreed. > > Altogether, if my analysis is correct, your patch (not blanking the > WWID) should be applied to case b) only. I don't really care about blanking the wwid in case a). I just think it's unnecessary. Why do you think we should blank it in case c) if we have previously gotten those values? -Ben > Please comment - I still feel a bit confused and may have overlooked > something essential. > > Regards > Martin > > -- > Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107 > SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton > HRB 21284 (AG Nürnberg) > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Fri, 2018-09-21 at 18:05 -0500, Benjamin Marzinski wrote: > When pathinfo fails for some likely transient reason, it clears the > path > wwid, but otherwise returns successfully, to keep the path around but > not usable until it gets fully initialized. However, if the path has > already been initialized, and pathinfo hits a transient error, it > shouldn't clear the wwid. > > Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> After follow-up discussion: Reviewed-by: Martin Wilck <mwilck@suse.com> > --- > libmultipath/discovery.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/libmultipath/discovery.c b/libmultipath/discovery.c > index 3e0db7f..33815dc 100644 > --- a/libmultipath/discovery.c > +++ b/libmultipath/discovery.c > @@ -1991,9 +1991,9 @@ blank: > /* > * Recoverable error, for example faulty or offline path > */ > - memset(pp->wwid, 0, WWID_SIZE); > pp->chkrstate = pp->state = PATH_DOWN; > - pp->initialized = INIT_FAILED; > + if (pp->initialized == INIT_FAILED) > + memset(pp->wwid, 0, WWID_SIZE); > > return PATHINFO_OK; > }
On Fri, 2018-10-05 at 14:38 -0500, Benjamin Marzinski wrote: > On Wed, Oct 03, 2018 at 12:37:17AM +0200, Martin Wilck wrote: > > Hi Ben, > > > > > I am uncertain about this one. The old code sets pp->initialized > > > to > > > INIT_FAILED. If the state had been INIT_MISSING_UDEV or > > > INIT_REQUESTED_UDEV before, this patch might change how the code > > > behaves later in check_path(), where these conditions are > > > checked. > > > > > > Likewise, tests for strlen(pp->wwid) are used in various places > > > around > > > the code. These tests would now yield different results for paths > > > in > > > "recoverable error" state. > > > > > > Have you considered these possible side effects? > > > > I've pondered over this a lot. The dust is clearing up a bit. > > > > 1. With your patch in place, INIT_FAILED is never set except in > > alloc_path() (we might rename it to INIT_NEW or the like, but see > > below). > > Correct. The idea is that INIT_FAILED means that the path has never > been > fully initialized and that it is missing more than simply the wwid > (which would set INIT_MISSING_UDEV instead). > > > 2. I don't understand how you handle repeated failure to retrieve > > the > > WWID. I see that get_uid() (actually, scsi_uid_fallback()) would > > retrieve the WWID from sysfs after retriggers are exhausted. But I > > don't see how pathinfo(DI_WWID) would ever be called in this > > situation: > > > > In the last invocation, pathinfo() had failed to retrieve the WWID > > and > > set pp->initialized = INIT_MISSING_UDEV. There it will remain > > because > > check_path() won't set it to INIT_REQUESTED_UDEV any more after > > retries > > are exhausted. And now, check_path() won't call pathinfo(DI_ALL) > > any > > more from the "add missing path" code, because of the (pp- > > >initialized > > != INIT_MISSING_UDEV) condition. > > > > Am I overlooking something? > > Your analysis looks right. This shouldn't have anything to do with > the > correctness of the not blanking initialized paths, but we should fix > it > anyway. Right now, as you mentioned, if multipathd fails to get the > wwid after retrigger_tries new uevents, it only has one shot at the > failback methods, and then mutipathd cannot use that path in the > future. > However, it will still check if the path is up or down, which is > pretty > pointless. It should either > > 1. set pp->intialized to INIT_OK, to keep it from being checked at > all > in the future, or That would be lying to ourselves. I'd rather call it INIT_BROKEN or so, or maybe even delete the path from pathvec altogether. [ While we're at this: we call pathinfo with DI_ALL from configure->path_discovery(), but with DI_ALL|DI_BLACKLIST from uev_add_path(). As a result, paths discovered in uevents won't be added to pathvec in the first place if they're blacklisted, while blacklisted paths discovered at startup will be in the pathvec, but orphaned. Is that intentional, and if yes, why ? ] > 2. try to grab the missing pathinfo, so that it can use the path. > > The benefit of doing the first is that mutipathd doesn't keep messing > with paths that really aren't set up to be multipathed. On the other > hand it might ignore some paths that really should be multipathed. > perhaps the best idea is to keep trying to get the missing info, but > just increase pp->tick (possibly progressively) so we try less often. The latter sounds overly complex to me, in particular as in the daemon, we're only looking at paths retrieved via udev anyway, as you pointed out in your other post. So perhaps INIT_MISSING_UDEV is just about the right state for this. We may just consider changing uev_update_path() such that the path is passed to uev_add_path() not only for INIT_REQUESTED_UDEV, but also for INIT_MISSING_UDEV state. > > > 3. If "blank" state means that important device information > > couldn't be > > retrieved because of presumably transient failure conditions, we > > should > > retry to retrieve this information by calling pathinfo again later. > > But > > unless the WWID is (reset to) the empty string, check_path() won't > > call > > pathinfo(DI_ALL) any more. > > But if we set the wwid at some point in time, that means we got all > the > necessary information. My argument is that if something happens to > cause a later pathinfo() call to fail, we shouldn't pretend like we > didn't already get this info. The question is whether the problem encountered means that the earlier determined WWID is still reliable. If the path is entirely gone from sysfs (case a) below), I wouldn't bet on that. In this case we rely on the assumption that we'll receive an uevent later - likely but not guaranteed. OTOH, by blanking the WWID, we forfeit the opportunity to detect a WWID change later - with that in mind, blanking is actually wrong in this situation, and your patch should be an improvement. > > > 4. The "blank" logic in pathinfo() combines several very different > > cases. > > a) PATH_REMOVED status from path_offline(). This means that > > elementary sysfs attributes were missing. This is almost the same > > as > > failure in sysfs_pathinfo(), which results in PATHINFO_FAILED > > return > > status; but for PATH_REMOVED we return PATHINFO_OK and keep the > > path > > around. > > True, but I've always assumed that this means that we will be getting > a > uevent to remove this path momentarily, so I doesn't really matter > what > we do with it. We could blank the path here, but if we already go the > necessary info, it seems like just setting the path to PATH_DOWN, and > waiting for the remove uevent should be fine. Perhaps. I just wanted to point out the different treatment of two very similar scenarios. > > b) Failure in checker_check(). If the path is offline in the > > first > > place, the checker isn't called, and WWID determination is > > attempted. > > But if the checker returns PATH_UNCHECKED or PATH_WILD, we goto > > "blank" > > state. > > This is clearly the wrong thing to do. > > > c) Failure in scsi_ioctl_pathinfo() or cciss_ioctl_pathinfo(). > > Both > > functions never fail, so this can't happen. I've patches here to > > fix > > that. > > I just ignored this state because it was impossible. But even with > your > patches, my argument remains the same. If we already got this info, > failing to get it again shouldn't cause us to delete the wwid. We > should > probably remember the existing values, to make sure that we don't > lose > them if the calls fail for some reason. My patches wouldn't change a thing here. They just turn these two functions to void type. > > > d) Failure to open pp->fd. > > This is clearly the right thing to do. > > > d) is the only case in which the "blank" logic makes really sense > > to > > me. It can happen only at the first pathinfo() invocation, meaning > > pp->wwid is still empty, and pp->initialized is INIT_FAILED. Your > > patch > > would change nothing for this case. > > > > a) and b) can happen for paths that have been initialized already. > > I > > think in case a) the WWID should be reset, probably initialized > > should > > be set to INIT_FAILED, and PATHINFO_FAILED should be returned. In > > case > > b) we should IMO proceed normally rather than goto "blank". > > Resetting > > the WWID in case b) is nonsense, agreed. > > > > Altogether, if my analysis is correct, your patch (not blanking the > > WWID) should be applied to case b) only. > > I don't really care about blanking the wwid in case a). I just think > it's unnecessary. Why do you think we should blank it in case c) if > we > have previously gotten those values? Summarizing, a) you convinced me about that one, b) your patch is good, but we may need additional changes in a follow- up patch, c) doesn't happen, d) I agreed in the first place. And my concern about INIT_MISSING_UDEV was orthogonal to your patch anyway. So I'll give your patch a reviewed-by. Sorry for the noise. I guess you'll concede the handling of path states in multipath-tools is non-trivial. Martin
On Mon, Oct 08, 2018 at 11:41:35AM +0200, Martin Wilck wrote: > On Fri, 2018-10-05 at 14:38 -0500, Benjamin Marzinski wrote: > > On Wed, Oct 03, 2018 at 12:37:17AM +0200, Martin Wilck wrote: > > Your analysis looks right. This shouldn't have anything to do with > > the > > correctness of the not blanking initialized paths, but we should fix > > it > > anyway. Right now, as you mentioned, if multipathd fails to get the > > wwid after retrigger_tries new uevents, it only has one shot at the > > failback methods, and then mutipathd cannot use that path in the > > future. > > However, it will still check if the path is up or down, which is > > pretty > > pointless. It should either > > > > 1. set pp->intialized to INIT_OK, to keep it from being checked at > > all > > in the future, or > > That would be lying to ourselves. I'd rather call it INIT_BROKEN or so, > or maybe even delete the path from pathvec altogether. > > [ While we're at this: we call pathinfo with DI_ALL from > configure->path_discovery(), but with DI_ALL|DI_BLACKLIST from > uev_add_path(). As a result, paths discovered in uevents won't be added > to pathvec in the first place if they're blacklisted, while blacklisted > paths discovered at startup will be in the pathvec, but orphaned. Is > that intentional, and if yes, why ? ] I really don't know why configure() works this way. The call to filter_path() just after path_discovery() will remove the blacklisted paths, so they won't actually end up in the pathvec. But it does seem pretty pointless to not just check if they are blacklisted while initializing them. > > 2. try to grab the missing pathinfo, so that it can use the path. > > > > The benefit of doing the first is that mutipathd doesn't keep messing > > with paths that really aren't set up to be multipathed. On the other > > hand it might ignore some paths that really should be multipathed. > > perhaps the best idea is to keep trying to get the missing info, but > > just increase pp->tick (possibly progressively) so we try less often. > > The latter sounds overly complex to me, in particular as in the daemon, > we're only looking at paths retrieved via udev anyway, as you pointed > out in your other post. So perhaps INIT_MISSING_UDEV is just about the > right state for this. We may just consider changing uev_update_path() > such that the path is passed to uev_add_path() not only for > INIT_REQUESTED_UDEV, but also for INIT_MISSING_UDEV state. The thought behind not just using uev_update_path() is that it doesn't get called very predicably. There's a chance that the path may become usable, but we won't get any event for it. There's also a chance that a path that will never have a wwid (and thus shouldn't be multipathed) will be part of a uevent storm, and multipath will do a bunch of pointless work. If we do the checks in the checkerloop(), we can control exactly how often we check these paths. > > Summarizing, > > a) you convinced me about that one, > b) your patch is good, but we may need additional changes in a follow- > up patch, > c) doesn't happen, > d) I agreed in the first place. > > And my concern about INIT_MISSING_UDEV was orthogonal to your patch > anyway. So I'll give your patch a reviewed-by. Sorry for the noise. I > guess you'll concede the handling of path states in multipath-tools is > non-trivial. Sure. We do need to figure out what to do when we really can't get a wwid for the device, but that can happen in a different patchset. -Ben > Martin > > -- > Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107 > SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton > HRB 21284 (AG Nürnberg) > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
diff --git a/libmultipath/discovery.c b/libmultipath/discovery.c index 3e0db7f..33815dc 100644 --- a/libmultipath/discovery.c +++ b/libmultipath/discovery.c @@ -1991,9 +1991,9 @@ blank: /* * Recoverable error, for example faulty or offline path */ - memset(pp->wwid, 0, WWID_SIZE); pp->chkrstate = pp->state = PATH_DOWN; - pp->initialized = INIT_FAILED; + if (pp->initialized == INIT_FAILED) + memset(pp->wwid, 0, WWID_SIZE); return PATHINFO_OK; }
When pathinfo fails for some likely transient reason, it clears the path wwid, but otherwise returns successfully, to keep the path around but not usable until it gets fully initialized. However, if the path has already been initialized, and pathinfo hits a transient error, it shouldn't clear the wwid. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> --- libmultipath/discovery.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)