multipathd: the sysfs prioritizer can return stale data

Message ID	20240124020007.25259-1-brian@purestorage.com (mailing list archive)
State	Not Applicable, archived
Delegated to:	christophe varoqui
Headers	show Received: from mail-yb1-f179.google.com (mail-yb1-f179.google.com [209.85.219.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B8EA515D2 for <dm-devel@lists.linux.dev>; Wed, 24 Jan 2024 02:00:32 +0000 (UTC) From: Brian Bunker <brian@purestorage.com> To: dm-devel@lists.linux.dev Cc: mwilck@suse.com, bmarzins@redhat.com, Brian Bunker <brian@purestorage.com>, Seamus Connor <sconnor@purestorage.com> Subject: [PATCH] multipathd: the sysfs prioritizer can return stale data Date: Tue, 23 Jan 2024 18:00:07 -0800 Message-ID: <20240124020007.25259-1-brian@purestorage.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	multipathd: the sysfs prioritizer can return stale data \| expand multipathd: the sysfs prioritizer can return stale data

Brian Bunker Jan. 24, 2024, 2 a.m. UTC

When a path is lost and then reinstated later, the ALUA
device handler will not pick up this change and continue
to possibly provide incorrect (stale) information about
its ALUA state to the 'sysfs' prioritizer if the path's
priorities were changed prior to the loss of those paths.

On the loss of an I_T nexus, the path state should not
continue to use its last known state. Many things and a lot
of time could have passed between the path loss and its
eventual reinstatemnt.

The ALUA device handler didn't have this issue since it
always got the ALUA state from the target by sending an RTPG
request on the path and updated the priority. However with
the detect priority set to true by default, the 'sysfs'
prioritzier will be used on targets that support ALUA rather
than the 'alua' prioritizer.

Without re-evaluating the ALUA state when a path returns
multipath is left with path states which do not reflect
the actual ALUA states the target is providing.

3624a9370f0f545fc7c3e46a100011010 dm-2 PURE,FlashArray
size=3.0T features='0' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 13:0:0:1 sdg 8:96  active ready running
| |- 12:0:0:1 sdf 8:80  active ready running
| |- 9:0:0:1  sdc 8:32  active ready running
| `- 1:0:0:1  sdb 8:16  active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 14:0:0:1 sdh 8:112 active ready running
  |- 15:0:0:1 sdi 8:128 active ready running
  |- 10:0:0:1 sdd 8:48  active ready running
  `- 11:0:0:1 sde 8:64  active ready running

 # sg_rtpg /dev/sdh (Active/Optimized)
target port group asymmetric access state : 0x00

Signed-off-by: Brian Bunker <brian@purestorage.com>
Signed-off-by: Seamus Connor <sconnor@purestorage.com>
---
 multipathd/main.c | 5 +++++
 1 file changed, 5 insertions(+)

Benjamin Marzinski Jan. 30, 2024, 4:41 p.m. UTC | #1

On Tue, Jan 23, 2024 at 06:00:07PM -0800, Brian Bunker wrote:
> When a path is lost and then reinstated later, the ALUA
> device handler will not pick up this change and continue
> to possibly provide incorrect (stale) information about
> its ALUA state to the 'sysfs' prioritizer if the path's
> priorities were changed prior to the loss of those paths.
> 
> On the loss of an I_T nexus, the path state should not
> continue to use its last known state. Many things and a lot
> of time could have passed between the path loss and its
> eventual reinstatemnt.
> 
> The ALUA device handler didn't have this issue since it
> always got the ALUA state from the target by sending an RTPG
> request on the path and updated the priority. However with
> the detect priority set to true by default, the 'sysfs'
> prioritzier will be used on targets that support ALUA rather
> than the 'alua' prioritizer.
> 
> Without re-evaluating the ALUA state when a path returns
> multipath is left with path states which do not reflect
> the actual ALUA states the target is providing.
> 
> 3624a9370f0f545fc7c3e46a100011010 dm-2 PURE,FlashArray
> size=3.0T features='0' hwhandler='1 alua' wp=rw
> |-+- policy='service-time 0' prio=50 status=active
> | |- 13:0:0:1 sdg 8:96  active ready running
> | |- 12:0:0:1 sdf 8:80  active ready running
> | |- 9:0:0:1  sdc 8:32  active ready running
> | `- 1:0:0:1  sdb 8:16  active ready running
> `-+- policy='service-time 0' prio=10 status=enabled
>   |- 14:0:0:1 sdh 8:112 active ready running
>   |- 15:0:0:1 sdi 8:128 active ready running
>   |- 10:0:0:1 sdd 8:48  active ready running
>   `- 11:0:0:1 sde 8:64  active ready running
> 
>  # sg_rtpg /dev/sdh (Active/Optimized)
> target port group asymmetric access state : 0x00
> 
> Signed-off-by: Brian Bunker <brian@purestorage.com>
> Signed-off-by: Seamus Connor <sconnor@purestorage.com>
> ---
>  multipathd/main.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/multipathd/main.c b/multipathd/main.c
> index 230c9d10..dd48be74 100644
> --- a/multipathd/main.c
> +++ b/multipathd/main.c
> @@ -1937,6 +1937,11 @@ reinstate_path (struct path * pp)
>  	else {
>  		condlog(2, "%s: reinstated", pp->dev_t);
>  		update_queue_mode_add_path(pp->mpp);
> +		if (strcmp(pp->prio.name, PRIO_SYSFS) == 0) {
> +			condlog(2, "%s: rescan target to update priorities",
> +				pp->dev_t);
> +			rescan_path(pp->udev);

I can see why this is necessary with the sysfs target, but AFAICT
rescanning the scsi device will end up calling scsi_execute_cmd to
update the vpd pages, and this can block.  Obviously, we are doing this
after the checker has verified that the path is up, but even still,
we're trying to cut down on the amount of code that can block multipathd
on an inaccessible device, instead of increase it.

Perhaps it would be better to make the prioritizer itself smarter, so
that it could know when it needs to run the rescan(), and it could do
that in a separate thread.

Thoughts?

-Ben

> +		}
>  	}
>  }
>  
> -- 
> 2.43.0

Martin Wilck Jan. 30, 2024, 5:57 p.m. UTC | #2

On Tue, 2024-01-30 at 11:41 -0500, Benjamin Marzinski wrote:
> On Tue, Jan 23, 2024 at 06:00:07PM -0800, Brian Bunker wrote:
> > When a path is lost and then reinstated later, the ALUA
> > device handler will not pick up this change and continue
> > to possibly provide incorrect (stale) information about
> > its ALUA state to the 'sysfs' prioritizer if the path's
> > priorities were changed prior to the loss of those paths.
> > 
> > On the loss of an I_T nexus, the path state should not
> > continue to use its last known state. Many things and a lot
> > of time could have passed between the path loss and its
> > eventual reinstatemnt.
> > 
> > The ALUA device handler didn't have this issue since it
> > always got the ALUA state from the target by sending an RTPG
> > request on the path and updated the priority. However with
> > the detect priority set to true by default, the 'sysfs'
> > prioritzier will be used on targets that support ALUA rather
> > than the 'alua' prioritizer.
> > 
> > Without re-evaluating the ALUA state when a path returns
> > multipath is left with path states which do not reflect
> > the actual ALUA states the target is providing.
> > 
> > 3624a9370f0f545fc7c3e46a100011010 dm-2 PURE,FlashArray
> > size=3.0T features='0' hwhandler='1 alua' wp=rw
> > > -+- policy='service-time 0' prio=50 status=active
> > > > - 13:0:0:1 sdg 8:96  active ready running
> > > > - 12:0:0:1 sdf 8:80  active ready running
> > > > - 9:0:0:1  sdc 8:32  active ready running
> > > `- 1:0:0:1  sdb 8:16  active ready running
> > `-+- policy='service-time 0' prio=10 status=enabled
> >   |- 14:0:0:1 sdh 8:112 active ready running
> >   |- 15:0:0:1 sdi 8:128 active ready running
> >   |- 10:0:0:1 sdd 8:48  active ready running
> >   `- 11:0:0:1 sde 8:64  active ready running
> > 
> >  # sg_rtpg /dev/sdh (Active/Optimized)
> > target port group asymmetric access state : 0x00
> > 
> > Signed-off-by: Brian Bunker <brian@purestorage.com>
> > Signed-off-by: Seamus Connor <sconnor@purestorage.com>
> > ---
> >  multipathd/main.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/multipathd/main.c b/multipathd/main.c
> > index 230c9d10..dd48be74 100644
> > --- a/multipathd/main.c
> > +++ b/multipathd/main.c
> > @@ -1937,6 +1937,11 @@ reinstate_path (struct path * pp)
> >  	else {
> >  		condlog(2, "%s: reinstated", pp->dev_t);
> >  		update_queue_mode_add_path(pp->mpp);
> > +		if (strcmp(pp->prio.name, PRIO_SYSFS) == 0) {
> > +			condlog(2, "%s: rescan target to update
> > priorities",
> > +				pp->dev_t);
> > +			rescan_path(pp->udev);
> 
> I can see why this is necessary with the sysfs target, but AFAICT
> rescanning the scsi device will end up calling scsi_execute_cmd to
> update the vpd pages, and this can block.  Obviously, we are doing
> this
> after the checker has verified that the path is up, but even still,
> we're trying to cut down on the amount of code that can block
> multipathd
> on an inaccessible device, instead of increase it.
> 
> Perhaps it would be better to make the prioritizer itself smarter, so
> that it could know when it needs to run the rescan(), and it could do
> that in a separate thread.
> 
> Thoughts?

A full rescan shouldn't be necessary. All that's needed is that the
kernel issue another RTPG. AFAICS that should happen as soon as the
target responds to any command with a sense key of UNIT ATTENTION with
ASC=0x2a and ascq=6 or 7 (ALUA state change, ALUA state transition
failed).

@Brian, does your storage unit not do this? If so, I suggest we disable
the sysfs prioritizer for pure storage.

Otherwise, as far as multipathd is concerned, when a path is
reinstated, it should be sufficient to send any IO command to trigger
an RTPG. Or am I missing something here?

Martin

Brian Bunker Jan. 30, 2024, 6:43 p.m. UTC | #3

> 
> On Jan 30, 2024, at 9:57 AM, Martin Wilck <mwilck@suse.com> wrote:
> 
> On Tue, 2024-01-30 at 11:41 -0500, Benjamin Marzinski wrote:
>> On Tue, Jan 23, 2024 at 06:00:07PM -0800, Brian Bunker wrote:
>>> When a path is lost and then reinstated later, the ALUA
>>> device handler will not pick up this change and continue
>>> to possibly provide incorrect (stale) information about
>>> its ALUA state to the 'sysfs' prioritizer if the path's
>>> priorities were changed prior to the loss of those paths.
>>> 
>>> On the loss of an I_T nexus, the path state should not
>>> continue to use its last known state. Many things and a lot
>>> of time could have passed between the path loss and its
>>> eventual reinstatemnt.
>>> 
>>> The ALUA device handler didn't have this issue since it
>>> always got the ALUA state from the target by sending an RTPG
>>> request on the path and updated the priority. However with
>>> the detect priority set to true by default, the 'sysfs'
>>> prioritzier will be used on targets that support ALUA rather
>>> than the 'alua' prioritizer.
>>> 
>>> Without re-evaluating the ALUA state when a path returns
>>> multipath is left with path states which do not reflect
>>> the actual ALUA states the target is providing.
>>> 
>>> 3624a9370f0f545fc7c3e46a100011010 dm-2 PURE,FlashArray
>>> size=3.0T features='0' hwhandler='1 alua' wp=rw
>>>> -+- policy='service-time 0' prio=50 status=active
>>>>> - 13:0:0:1 sdg 8:96  active ready running
>>>>> - 12:0:0:1 sdf 8:80  active ready running
>>>>> - 9:0:0:1  sdc 8:32  active ready running
>>>> `- 1:0:0:1  sdb 8:16  active ready running
>>> `-+- policy='service-time 0' prio=10 status=enabled
>>>   |- 14:0:0:1 sdh 8:112 active ready running
>>>   |- 15:0:0:1 sdi 8:128 active ready running
>>>   |- 10:0:0:1 sdd 8:48  active ready running
>>>   `- 11:0:0:1 sde 8:64  active ready running
>>> 
>>>  # sg_rtpg /dev/sdh (Active/Optimized)
>>> target port group asymmetric access state : 0x00
>>> 
>>> Signed-off-by: Brian Bunker <brian@purestorage.com>
>>> Signed-off-by: Seamus Connor <sconnor@purestorage.com>
>>> ---
>>>  multipathd/main.c | 5 +++++
>>>  1 file changed, 5 insertions(+)
>>> 
>>> diff --git a/multipathd/main.c b/multipathd/main.c
>>> index 230c9d10..dd48be74 100644
>>> --- a/multipathd/main.c
>>> +++ b/multipathd/main.c
>>> @@ -1937,6 +1937,11 @@ reinstate_path (struct path * pp)
>>>   else {
>>>   condlog(2, "%s: reinstated", pp->dev_t);
>>>   update_queue_mode_add_path(pp->mpp);
>>> + if (strcmp(pp->prio.name, PRIO_SYSFS) == 0) {
>>> + condlog(2, "%s: rescan target to update
>>> priorities",
>>> + pp->dev_t);
>>> + rescan_path(pp->udev);
>> 
>> I can see why this is necessary with the sysfs target, but AFAICT
>> rescanning the scsi device will end up calling scsi_execute_cmd to
>> update the vpd pages, and this can block.  Obviously, we are doing
>> this
>> after the checker has verified that the path is up, but even still,
>> we're trying to cut down on the amount of code that can block
>> multipathd
>> on an inaccessible device, instead of increase it.
>> 
>> Perhaps it would be better to make the prioritizer itself smarter, so
>> that it could know when it needs to run the rescan(), and it could do
>> that in a separate thread.
>> 
>> Thoughts?
> 
> A full rescan shouldn't be necessary. All that's needed is that the
> kernel issue another RTPG. AFAICS that should happen as soon as the
> target responds to any command with a sense key of UNIT ATTENTION with
> ASC=0x2a and ascq=6 or 7 (ALUA state change, ALUA state transition
> failed).
> 
> @Brian, does your storage unit not do this? If so, I suggest we disable
> the sysfs prioritizer for pure storage.
> 
> Otherwise, as far as multipathd is concerned, when a path is
> reinstated, it should be sufficient to send any IO command to trigger
> an RTPG. Or am I missing something here?
> 
> Martin

Martin,

What I am gaining with the rescan is exactly that. You are correct the ALUA device handler the kernel has to send an RTPG to the target. 

We do set a unit attention to have the initiator paths go into the ANO state before we reboot leading to the path loss, but we do not set a unit attention when the paths come back up. We have relied on the initiator’s polling to pick up the ALUA state change which they always have in the past and the ‘alua’ prioritizer still will. For us to add a unit attention would work, but there are couple of issues with that.

1. Unit attentions may not get back to the initiator. It is not guaranteed.
2. Paths could take a very long time to come back. We might not get these paths back for a very long time. Sometimes it is just a reboot. Other times it is a hardware replacement. It is possible for us to keep this state forever and post when when that I_T nexus returns but we haven’t had to.

If we did post the unit attention, everything works as expected. I have verified this, but I would also hope that the polling of the checkers would also unstick my stale ALUA state and we won’t have to.

I put this rescan_path inline to show the problem and the fix. I wasn’t sure the ‘right’ place to put it. I get that it would be better not to block on this. It should be possible to put this in a thread so that it does not. The other caller of rescan_path I guess is also doing the same thing when it is handling the wwid change.

Thanks,
Brian

Martin Wilck Jan. 31, 2024, 10:19 a.m. UTC | #4

Hi Brian,

On Tue, 2024-01-30 at 10:43 -0800, Brian Bunker wrote:
> > 
> > A full rescan shouldn't be necessary. All that's needed is that the
> > kernel issue another RTPG. AFAICS that should happen as soon as the
> > target responds to any command with a sense key of UNIT ATTENTION
> > with
> > ASC=0x2a and ascq=6 or 7 (ALUA state change, ALUA state transition
> > failed).
> > 
> > @Brian, does your storage unit not do this? If so, I suggest we
> > disable
> > the sysfs prioritizer for pure storage.
> > 
> > Otherwise, as far as multipathd is concerned, when a path is
> > reinstated, it should be sufficient to send any IO command to
> > trigger
> > an RTPG. Or am I missing something here?
> > 
> > Martin
> 
> Martin,
> 
> What I am gaining with the rescan is exactly that. You are correct
> the ALUA device handler the kernel has to send an RTPG to the target.
> 
> We do set a unit attention to have the initiator paths go into the
> ANO state before we reboot leading to the path loss, but we do not
> set a unit attention when the paths come back up.
> 
>  We have relied on the initiator’s polling to pick up the ALUA state
> change which they always have in the past and the ‘alua’ prioritizer
> still will. For us to add a unit attention would work, but there are
> couple of issues with that.
> 
> 1. Unit attentions may not get back to the initiator. It is not
> guaranteed.

That's news for me. If that happens, wouldn't it mean that the
initiator sees a timeout (no response) to some command, IOW that
there's still something very wrong with this I_T nexus?

> 2. Paths could take a very long time to come back. We might not get
> these paths back for a very long time. Sometimes it is just a reboot.
> Other times it is a hardware replacement. It is possible for us to
> keep this state forever and post when when that I_T nexus returns but
> we haven’t had to.

No offense, that sounds somewhat lazy ;-) Note that it's also kind of
dangerous. You are hiding the state change from the initiator. If the
Linux kernel decided to use the access_state device attribute for
anything else but feeding the sysfs attribute, things might go badly
wrong.

> If we did post the unit attention, everything works as expected. I
> have verified this, but I would also hope that the polling of the
> checkers would also unstick my stale ALUA state and we won’t have to.
> 
> I put this rescan_path inline to show the problem and the fix. I
> wasn’t sure the ‘right’ place to put it. I get that it would be
> better not to block on this. It should be possible to put this in a
> thread so that it does not. The other caller of rescan_path I guess
> is also doing the same thing when it is handling the wwid change.

That's true and not optimal, but wwid changes are rare events and an
error condition in its own right. Patches converting moving
rescan_path() into a thread would receive sympathetic reviews :-)
The big benefit of the sysfs prioritizer is that it never blocks,
without needing pthreads complexity.

Btw, I think you'd need to wait for the RTPG to finish after triggering
the rescan, if you want to obtain the right priority (alua_rescan() ->
alua_initialize() -> alua_check_vpd() will only queue an RTPG and not
wait for the result).

Unfortunately, the kernel has no API for manually triggering an update
of the access_state. I believe that would be useful elsewhere, too. We
can consider adding it, but it won't help with current kernels.

IMO the best option for your storage arrays would is to force using the
alua prioritizer rather than the sysfs one. You are not alone, we're
doing this for RDAC already (see check_rdac() call in detect_prio()).
This can be configured in multipath.conf right now by setting
"detect_prio no" and "prio alua", and we can make it the default for
your storage with a trivial patch for hwtable.c.

Regards
Martin

Brian Bunker Jan. 31, 2024, 5:45 p.m. UTC | #5

> On Jan 31, 2024, at 2:19 AM, Martin Wilck <mwilck@suse.com> wrote:
> 
> Hi Brian,
> 
> On Tue, 2024-01-30 at 10:43 -0800, Brian Bunker wrote:
>>> 
>>> A full rescan shouldn't be necessary. All that's needed is that the
>>> kernel issue another RTPG. AFAICS that should happen as soon as the
>>> target responds to any command with a sense key of UNIT ATTENTION
>>> with
>>> ASC=0x2a and ascq=6 or 7 (ALUA state change, ALUA state transition
>>> failed).
>>> 
>>> @Brian, does your storage unit not do this? If so, I suggest we
>>> disable
>>> the sysfs prioritizer for pure storage.
>>> 
>>> Otherwise, as far as multipathd is concerned, when a path is
>>> reinstated, it should be sufficient to send any IO command to
>>> trigger
>>> an RTPG. Or am I missing something here?
>>> 
>>> Martin
>> 
>> Martin,
>> 
>> What I am gaining with the rescan is exactly that. You are correct
>> the ALUA device handler the kernel has to send an RTPG to the target.
>> 
>> We do set a unit attention to have the initiator paths go into the
>> ANO state before we reboot leading to the path loss, but we do not
>> set a unit attention when the paths come back up.
>> 
>> We have relied on the initiator’s polling to pick up the ALUA state
>> change which they always have in the past and the ‘alua’ prioritizer
>> still will. For us to add a unit attention would work, but there are
>> couple of issues with that.
>> 
>> 1. Unit attentions may not get back to the initiator. It is not
>> guaranteed.
> 
> That's news for me. If that happens, wouldn't it mean that the
> initiator sees a timeout (no response) to some command, IOW that
> there's still something very wrong with this I_T nexus?
Probably. My point is just that any individual response could be lost
wherever and there is no burden on the target to ensure the initiator
got the unit attention.
> 
>> 2. Paths could take a very long time to come back. We might not get
>> these paths back for a very long time. Sometimes it is just a reboot.
>> Other times it is a hardware replacement. It is possible for us to
>> keep this state forever and post when when that I_T nexus returns but
>> we haven’t had to.
> 
> No offense, that sounds somewhat lazy ;-) Note that it's also kind of
> dangerous. You are hiding the state change from the initiator. If the
> Linux kernel decided to use the access_state device attribute for
> anything else but feeding the sysfs attribute, things might go badly
> wrong.
That is fair. We definitely could do better here. In general, that unit
attention coming out of the preferred state didn’t buy us any speed.
Those non-preferred paths weren’t serving I/O so the first I/O that
would pick up the unit attention on those paths would be the path
checker. The same run of the path checker picked up the new ALUA
state. When going into the preferred state, there is read and write
I/O which means those unit attentions are picked up very quickly and
the ALUA state change is picked up in the kernel before the checker
runs again.

Have you ever considered a checker of RTPG as opposed to TUR?
That would seeming solve a lot of this too since you would be getting
path state and priority in the same trip.
> 
>> If we did post the unit attention, everything works as expected. I
>> have verified this, but I would also hope that the polling of the
>> checkers would also unstick my stale ALUA state and we won’t have to.
>> 
>> I put this rescan_path inline to show the problem and the fix. I
>> wasn’t sure the ‘right’ place to put it. I get that it would be
>> better not to block on this. It should be possible to put this in a
>> thread so that it does not. The other caller of rescan_path I guess
>> is also doing the same thing when it is handling the wwid change.
> 
> That's true and not optimal, but wwid changes are rare events and an
> error condition in its own right. Patches converting moving
> rescan_path() into a thread would receive sympathetic reviews :-)
> The big benefit of the sysfs prioritizer is that it never blocks,
> without needing pthreads complexity.
> 
> Btw, I think you'd need to wait for the RTPG to finish after triggering
> the rescan, if you want to obtain the right priority (alua_rescan() ->
> alua_initialize() -> alua_check_vpd() will only queue an RTPG and not
> wait for the result).
For our purposes we didn’t really need to wait for the rescan. As long as
it happened. The next time the checker ran it would pick it up. These paths
returning for us are redundant paths. We want them back as soon as possible
but we have other paths that can serve I/O while waiting for the HA paths.

I can create a patch in the sysfs prioritizer to do the rescan_path in a thread
that the checker and priority run doesn’t wait on. Would that be well received
or I am better served by either posting a unit attention or just using detect_prio
set to no and leaving the ’sysfs’ prioritizer alone?
> 
> Unfortunately, the kernel has no API for manually triggering an update
> of the access_state. I believe that would be useful elsewhere, too. We
> can consider adding it, but it won't help with current kernels.
> 
> IMO the best option for your storage arrays would is to force using the
> alua prioritizer rather than the sysfs one. You are not alone, we're
> doing this for RDAC already (see check_rdac() call in detect_prio()).
> This can be configured in multipath.conf right now by setting
> "detect_prio no" and "prio alua", and we can make it the default for
> your storage with a trivial patch for hwtable.c.
This is what we are doing now in our recommended configuration. I will
probably add a patch for our hw table entry soon. It is a bit strange to
me still that detect_prio would mean replace the one that I am explicitly
stating in the device section. To me detect_prio would be if I didn’t
provide one and wanted multipath to choose for me.
> 
> Regards
> Martin
> 
Thanks,
Brian

Martin Wilck Jan. 31, 2024, 8:12 p.m. UTC | #6

Hi Brian,

On Wed, 2024-01-31 at 09:45 -0800, Brian Bunker wrote:
> 
> 
> > On Jan 31, 2024, at 2:19 AM, Martin Wilck <mwilck@suse.com> wrote:
> > 
> > Hi Brian,
> > 
> > On Tue, 2024-01-30 at 10:43 -0800, Brian Bunker wrote:
> > > > 
> > > > A full rescan shouldn't be necessary. All that's needed is that
> > > > the
> > > > kernel issue another RTPG. AFAICS that should happen as soon as
> > > > the
> > > > target responds to any command with a sense key of UNIT
> > > > ATTENTION
> > > > with
> > > > ASC=0x2a and ascq=6 or 7 (ALUA state change, ALUA state
> > > > transition
> > > > failed).
> > > > 
> > > > @Brian, does your storage unit not do this? If so, I suggest we
> > > > disable
> > > > the sysfs prioritizer for pure storage.
> > > > 
> > > > Otherwise, as far as multipathd is concerned, when a path is
> > > > reinstated, it should be sufficient to send any IO command to
> > > > trigger
> > > > an RTPG. Or am I missing something here?
> > > > 
> > > > Martin
> > > 
> > > Martin,
> > > 
> > > What I am gaining with the rescan is exactly that. You are
> > > correct
> > > the ALUA device handler the kernel has to send an RTPG to the
> > > target.
> > > 
> > > We do set a unit attention to have the initiator paths go into
> > > the
> > > ANO state before we reboot leading to the path loss, but we do
> > > not
> > > set a unit attention when the paths come back up.
> > > 
> > > We have relied on the initiator’s polling to pick up the ALUA
> > > state
> > > change which they always have in the past and the ‘alua’
> > > prioritizer
> > > still will. For us to add a unit attention would work, but there
> > > are
> > > couple of issues with that.
> > > 
> > > 1. Unit attentions may not get back to the initiator. It is not
> > > guaranteed.
> > 
> > That's news for me. If that happens, wouldn't it mean that the
> > initiator sees a timeout (no response) to some command, IOW that
> > there's still something very wrong with this I_T nexus?
> Probably. My point is just that any individual response could be lost
> wherever and there is no burden on the target to ensure the initiator
> got the unit attention.

Right. In the specific case of the ALUA-related UAs that we've been
discussing, the storage *might* implement a logic to repeat responding
with UA until an RTPG is received. But that may have  unwanted side
effects; I haven't thought it through.

> > 
> > > 2. Paths could take a very long time to come back. We might not
> > > get
> > > these paths back for a very long time. Sometimes it is just a
> > > reboot.
> > > Other times it is a hardware replacement. It is possible for us
> > > to
> > > keep this state forever and post when when that I_T nexus returns
> > > but
> > > we haven’t had to.
> > 
> > No offense, that sounds somewhat lazy ;-) Note that it's also kind
> > of
> > dangerous. You are hiding the state change from the initiator. If
> > the
> > Linux kernel decided to use the access_state device attribute for
> > anything else but feeding the sysfs attribute, things might go
> > badly
> > wrong.
> That is fair. We definitely could do better here. In general, that
> unit
> attention coming out of the preferred state didn’t buy us any speed.
> Those non-preferred paths weren’t serving I/O so the first I/O that
> would pick up the unit attention on those paths would be the path
> checker. The same run of the path checker picked up the new ALUA
> state. When going into the preferred state, there is read and write
> I/O which means those unit attentions are picked up very quickly and
> the ALUA state change is picked up in the kernel before the checker
> runs again.
> 
> Have you ever considered a checker of RTPG as opposed to TUR?
> That would seeming solve a lot of this too since you would be getting
> path state and priority in the same trip.

Interesting idea. AFAICT, noone has thought about it so far. In the
past I've invested some thought in tieing  checker and prioritizer more
closely together. Unfortunately, the current multipathd architecture
treats them as entirely separate, which makes it complicated to achieve
this.

> > 
> > > If we did post the unit attention, everything works as expected.
> > > I
> > > have verified this, but I would also hope that the polling of the
> > > checkers would also unstick my stale ALUA state and we won’t have
> > > to.
> > > 
> > > I put this rescan_path inline to show the problem and the fix. I
> > > wasn’t sure the ‘right’ place to put it. I get that it would be
> > > better not to block on this. It should be possible to put this in
> > > a
> > > thread so that it does not. The other caller of rescan_path I
> > > guess
> > > is also doing the same thing when it is handling the wwid change.
> > 
> > That's true and not optimal, but wwid changes are rare events and
> > an
> > error condition in its own right. Patches converting moving
> > rescan_path() into a thread would receive sympathetic reviews :-)
> > The big benefit of the sysfs prioritizer is that it never blocks,
> > without needing pthreads complexity.
> > 
> > Btw, I think you'd need to wait for the RTPG to finish after
> > triggering
> > the rescan, if you want to obtain the right priority (alua_rescan()
> > ->
> > alua_initialize() -> alua_check_vpd() will only queue an RTPG and
> > not
> > wait for the result).
> For our purposes we didn’t really need to wait for the rescan. As
> long as
> it happened. The next time the checker ran it would pick it up. These
> paths
> returning for us are redundant paths. We want them back as soon as
> possible
> but we have other paths that can serve I/O while waiting for the HA
> paths.
> 
> I can create a patch in the sysfs prioritizer to do the rescan_path
> in a thread
> that the checker and priority run doesn’t wait on. Would that be well
> received
> or I am better served by either posting a unit attention or just
> using detect_prio
> set to no and leaving the ’sysfs’ prioritizer alone?

I am not sure. Like I said, the beauty of the sysfs prioritizer is that
it doesn't do any IO. rescanning a device from this prioritizer
basically voids this benefit. You might as well just use alua.
As I said, we do it for RDAC as well.

But if you want to invest more effort, feel free to submit patches ;-)
You can do this any time on top of the hwtable change.

> > Unfortunately, the kernel has no API for manually triggering an
> > update
> > of the access_state. I believe that would be useful elsewhere, too.
> > We
> > can consider adding it, but it won't help with current kernels.
> > 
> > IMO the best option for your storage arrays would is to force using
> > the
> > alua prioritizer rather than the sysfs one. You are not alone,
> > we're
> > doing this for RDAC already (see check_rdac() call in
> > detect_prio()).
> > This can be configured in multipath.conf right now by setting
> > "detect_prio no" and "prio alua", and we can make it the default
> > for
> > your storage with a trivial patch for hwtable.c.
> This is what we are doing now in our recommended configuration. I
> will
> probably add a patch for our hw table entry soon. It is a bit strange
> to
> me still that detect_prio would mean replace the one that I am
> explicitly
> stating in the device section. To me detect_prio would be if I didn’t
> provide one and wanted multipath to choose for me.

True. The design of the "detect_prio" option is awkward and confusing.
It happens all the time that people set "prio" and wonder why the
setting isn't applied. Or worse, they think it is applied but it's not.
I think the original idea was to use ALUA whenever supported, even if
historically the hwtable contained something else for a given storage.
It has been this way for a long time, I don't think we can change the
semantics easily.

Regards
Martin

> > 
> > Regards
> > Martin
> > 
> Thanks,
> Brian
>

Benjamin Marzinski Jan. 31, 2024, 8:34 p.m. UTC | #7

On Wed, Jan 31, 2024 at 09:12:23PM +0100, Martin Wilck wrote:
> Hi Brian,
> > This is what we are doing now in our recommended configuration. I
> > will
> > probably add a patch for our hw table entry soon. It is a bit strange
> > to
> > me still that detect_prio would mean replace the one that I am
> > explicitly
> > stating in the device section. To me detect_prio would be if I didn’t
> > provide one and wanted multipath to choose for me.
> 
> True. The design of the "detect_prio" option is awkward and confusing.
> It happens all the time that people set "prio" and wonder why the
> setting isn't applied. Or worse, they think it is applied but it's not.
> I think the original idea was to use ALUA whenever supported, even if
> historically the hwtable contained something else for a given storage.
> It has been this way for a long time, I don't think we can change the
> semantics easily.
> 

It's so that you can select the failback prioritizer if multipath can't
autodetect one. This made more sense when detect_prio defaulted to "no",
so autodetection was only used by devices that specifically enabled it.
Now it just is what it is.

-Ben

Brian Bunker Jan. 31, 2024, 10:23 p.m. UTC | #8

> On Jan 31, 2024, at 12:12 PM, Martin Wilck <mwilck@suse.com> wrote:
> 
> Hi Brian,
> 
> On Wed, 2024-01-31 at 09:45 -0800, Brian Bunker wrote:
>> 
>> 
>>> On Jan 31, 2024, at 2:19 AM, Martin Wilck <mwilck@suse.com> wrote:
>>> 
>>> Hi Brian,
>>> 
>>> On Tue, 2024-01-30 at 10:43 -0800, Brian Bunker wrote:
>>>>> 
>>>>> A full rescan shouldn't be necessary. All that's needed is that
>>>>> the
>>>>> kernel issue another RTPG. AFAICS that should happen as soon as
>>>>> the
>>>>> target responds to any command with a sense key of UNIT
>>>>> ATTENTION
>>>>> with
>>>>> ASC=0x2a and ascq=6 or 7 (ALUA state change, ALUA state
>>>>> transition
>>>>> failed).
>>>>> 
>>>>> @Brian, does your storage unit not do this? If so, I suggest we
>>>>> disable
>>>>> the sysfs prioritizer for pure storage.
>>>>> 
>>>>> Otherwise, as far as multipathd is concerned, when a path is
>>>>> reinstated, it should be sufficient to send any IO command to
>>>>> trigger
>>>>> an RTPG. Or am I missing something here?
>>>>> 
>>>>> Martin
>>>> 
>>>> Martin,
>>>> 
>>>> What I am gaining with the rescan is exactly that. You are
>>>> correct
>>>> the ALUA device handler the kernel has to send an RTPG to the
>>>> target.
>>>> 
>>>> We do set a unit attention to have the initiator paths go into
>>>> the
>>>> ANO state before we reboot leading to the path loss, but we do
>>>> not
>>>> set a unit attention when the paths come back up.
>>>> 
>>>> We have relied on the initiator’s polling to pick up the ALUA
>>>> state
>>>> change which they always have in the past and the ‘alua’
>>>> prioritizer
>>>> still will. For us to add a unit attention would work, but there
>>>> are
>>>> couple of issues with that.
>>>> 
>>>> 1. Unit attentions may not get back to the initiator. It is not
>>>> guaranteed.
>>> 
>>> That's news for me. If that happens, wouldn't it mean that the
>>> initiator sees a timeout (no response) to some command, IOW that
>>> there's still something very wrong with this I_T nexus?
>> Probably. My point is just that any individual response could be lost
>> wherever and there is no burden on the target to ensure the initiator
>> got the unit attention.
> 
> Right. In the specific case of the ALUA-related UAs that we've been
> discussing, the storage *might* implement a logic to repeat responding
> with UA until an RTPG is received. But that may have  unwanted side
> effects; I haven't thought it through.
> 
>>> 
>>>> 2. Paths could take a very long time to come back. We might not
>>>> get
>>>> these paths back for a very long time. Sometimes it is just a
>>>> reboot.
>>>> Other times it is a hardware replacement. It is possible for us
>>>> to
>>>> keep this state forever and post when when that I_T nexus returns
>>>> but
>>>> we haven’t had to.
>>> 
>>> No offense, that sounds somewhat lazy ;-) Note that it's also kind
>>> of
>>> dangerous. You are hiding the state change from the initiator. If
>>> the
>>> Linux kernel decided to use the access_state device attribute for
>>> anything else but feeding the sysfs attribute, things might go
>>> badly
>>> wrong.
>> That is fair. We definitely could do better here. In general, that
>> unit
>> attention coming out of the preferred state didn’t buy us any speed.
>> Those non-preferred paths weren’t serving I/O so the first I/O that
>> would pick up the unit attention on those paths would be the path
>> checker. The same run of the path checker picked up the new ALUA
>> state. When going into the preferred state, there is read and write
>> I/O which means those unit attentions are picked up very quickly and
>> the ALUA state change is picked up in the kernel before the checker
>> runs again.
>> 
>> Have you ever considered a checker of RTPG as opposed to TUR?
>> That would seeming solve a lot of this too since you would be getting
>> path state and priority in the same trip.
> 
> Interesting idea. AFAICT, noone has thought about it so far. In the
> past I've invested some thought in tieing  checker and prioritizer more
> closely together. Unfortunately, the current multipathd architecture
> treats them as entirely separate, which makes it complicated to achieve
> This.
Yeah. It would be a re-do of the separation between checkers and
prioritizers and might leave you with a mess if you had to keep everything
functioning the same way wasn’t ALUA. I think it might make check_path
much cleaner though.
>>> 
>>>> If we did post the unit attention, everything works as expected.
>>>> I
>>>> have verified this, but I would also hope that the polling of the
>>>> checkers would also unstick my stale ALUA state and we won’t have
>>>> to.
>>>> 
>>>> I put this rescan_path inline to show the problem and the fix. I
>>>> wasn’t sure the ‘right’ place to put it. I get that it would be
>>>> better not to block on this. It should be possible to put this in
>>>> a
>>>> thread so that it does not. The other caller of rescan_path I
>>>> guess
>>>> is also doing the same thing when it is handling the wwid change.
>>> 
>>> That's true and not optimal, but wwid changes are rare events and
>>> an
>>> error condition in its own right. Patches converting moving
>>> rescan_path() into a thread would receive sympathetic reviews :-)
>>> The big benefit of the sysfs prioritizer is that it never blocks,
>>> without needing pthreads complexity.
>>> 
>>> Btw, I think you'd need to wait for the RTPG to finish after
>>> triggering
>>> the rescan, if you want to obtain the right priority (alua_rescan()
>>> ->
>>> alua_initialize() -> alua_check_vpd() will only queue an RTPG and
>>> not
>>> wait for the result).
>> For our purposes we didn’t really need to wait for the rescan. As
>> long as
>> it happened. The next time the checker ran it would pick it up. These
>> paths
>> returning for us are redundant paths. We want them back as soon as
>> possible
>> but we have other paths that can serve I/O while waiting for the HA
>> paths.
>> 
>> I can create a patch in the sysfs prioritizer to do the rescan_path
>> in a thread
>> that the checker and priority run doesn’t wait on. Would that be well
>> received
>> or I am better served by either posting a unit attention or just
>> using detect_prio
>> set to no and leaving the ’sysfs’ prioritizer alone?
> 
> I am not sure. Like I said, the beauty of the sysfs prioritizer is that
> it doesn't do any IO. rescanning a device from this prioritizer
> basically voids this benefit. You might as well just use alua.
> As I said, we do it for RDAC as well.
> 
> But if you want to invest more effort, feel free to submit patches ;-)
> You can do this any time on top of the hwtable change.
OK I will submit the hwtable change first since there is no controversy
there at all.

My change only affects the reinstate path so it is isn’t like
‘alua’ where priority is evaluated with RTPG at each checker instance.
The rescan should be rare since I_T nexus loss shouldn’t be too common.
> 
>>> Unfortunately, the kernel has no API for manually triggering an
>>> update
>>> of the access_state. I believe that would be useful elsewhere, too.
>>> We
>>> can consider adding it, but it won't help with current kernels.
>>> 
>>> IMO the best option for your storage arrays would is to force using
>>> the
>>> alua prioritizer rather than the sysfs one. You are not alone,
>>> we're
>>> doing this for RDAC already (see check_rdac() call in
>>> detect_prio()).
>>> This can be configured in multipath.conf right now by setting
>>> "detect_prio no" and "prio alua", and we can make it the default
>>> for
>>> your storage with a trivial patch for hwtable.c.
>> This is what we are doing now in our recommended configuration. I
>> will
>> probably add a patch for our hw table entry soon. It is a bit strange
>> to
>> me still that detect_prio would mean replace the one that I am
>> explicitly
>> stating in the device section. To me detect_prio would be if I didn’t
>> provide one and wanted multipath to choose for me.
> 
> True. The design of the "detect_prio" option is awkward and confusing.
> It happens all the time that people set "prio" and wonder why the
> setting isn't applied. Or worse, they think it is applied but it's not.
> I think the original idea was to use ALUA whenever supported, even if
> historically the hwtable contained something else for a given storage.
> It has been this way for a long time, I don't think we can change the
> semantics easily.
I saw Ben’s later comment explaining the history of how it became the
way it became. 
> 
> Regards
> Martin
Thanks,
Brian
> 
>>> 
>>> Regards
>>> Martin
>>> 
>> Thanks,
>> Brian

multipathd: the sysfs prioritizer can return stale data

Commit Message

Comments

Patch