diff mbox series

[3/3] libmultipath: reset nr_timeouts if we freed the context

Message ID 1678229374-18788-4-git-send-email-bmarzins@redhat.com (mailing list archive)
State Not Applicable, archived
Delegated to: christophe varoqui
Headers show
Series handle transitioning devices in TUR checker | expand

Commit Message

Benjamin Marzinski March 7, 2023, 10:49 p.m. UTC
If a the tur checker creates a new context because an old thread is
still running, but the old thread finishes before the checker drops
the old context, the checker should reset nr_timeouts to 0, since
the old thread did complete in time.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
 libmultipath/checkers/tur.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Martin Wilck March 8, 2023, 6:17 p.m. UTC | #1
On Tue, 2023-03-07 at 16:49 -0600, Benjamin Marzinski wrote:
> If a the tur checker creates a new context because an old thread is
> still running, but the old thread finishes before the checker drops
> the old context, the checker should reset nr_timeouts to 0, since
> the old thread did complete in time.

This looks strange to me. First of all, the old thread _did_ timeout,
otherwise we wouldn't be in this code path at all. And even if you say
this timeout shouldn't count, as the thread isn't in stalled state any
more, shouldn't you just decrease nr_timeouts by 1? The fact that
_this_ thread terminated doesn't tell us anything about other hanging
threads.

Martin




> 
> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
> ---
>  libmultipath/checkers/tur.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/libmultipath/checkers/tur.c
> b/libmultipath/checkers/tur.c
> index a19becf5..fe6a2f14 100644
> --- a/libmultipath/checkers/tur.c
> +++ b/libmultipath/checkers/tur.c
> @@ -406,9 +406,11 @@ int libcheck_check(struct checker * c)
>                         }
>                         ((struct tur_checker_context *)c->context)-
> >nr_timeouts = ct->nr_timeouts;
>  
> -                       if (!uatomic_sub_return(&ct->holders, 1))
> +                       if (!uatomic_sub_return(&ct->holders, 1)) {
>                                 /* It did terminate, eventually */
>                                 cleanup_context(ct);
> +                               ((struct tur_checker_context *)c-
> >context)->nr_timeouts = 0;
> +                       }
>  
>                         ct = c->context;
>                 } else

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Benjamin Marzinski March 8, 2023, 8:37 p.m. UTC | #2
On Wed, Mar 08, 2023 at 06:17:05PM +0000, Martin Wilck wrote:
> On Tue, 2023-03-07 at 16:49 -0600, Benjamin Marzinski wrote:
> > If a the tur checker creates a new context because an old thread is
> > still running, but the old thread finishes before the checker drops
> > the old context, the checker should reset nr_timeouts to 0, since
> > the old thread did complete in time.
> 
> This looks strange to me. First of all, the old thread _did_ timeout,
> otherwise we wouldn't be in this code path at all. And even if you say
> this timeout shouldn't count, as the thread isn't in stalled state any
> more, shouldn't you just decrease nr_timeouts by 1? The fact that
> _this_ thread terminated doesn't tell us anything about other hanging
> threads.

We can't track the old threads once we drop the old context.  So
instead, we just assume that if the last thread we created did complete
we're back to a normal state, and reset nr_timeouts to 0. This change is
just extending that same logic to the case where the old thread didn't
complete until the last possible moment where we could find that out.

The MAX_NR_TIMEOUTS code was added because a user ran into a situation
where a scsi kernel issue made TUR commands block uninterruptibly
forever, and multipathd kept making more threads, tens of thousnds of
them. The goal was to stop creating threads if two in a row haven't
completed, and wait for the last thread to complete. It's not the most
robust solution possible, but I think it's good enough. We could revist
the whole solution, but the way things are, nr_timeouts = 1 was supposed
to mean that, as far as we know, the last thread was cancelled but
didn't complete, and we are going to try one more thread before stopping
and waiting. I'm just fixing to code to take care of the corner case
where we find out that the last thread did complete when we drop the old
context. 

-Ben
 
> Martin
> 
> 
> 
> 
> > 
> > Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
> > ---
> >  libmultipath/checkers/tur.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/libmultipath/checkers/tur.c
> > b/libmultipath/checkers/tur.c
> > index a19becf5..fe6a2f14 100644
> > --- a/libmultipath/checkers/tur.c
> > +++ b/libmultipath/checkers/tur.c
> > @@ -406,9 +406,11 @@ int libcheck_check(struct checker * c)
> >                         }
> >                         ((struct tur_checker_context *)c->context)-
> > >nr_timeouts = ct->nr_timeouts;
> >  
> > -                       if (!uatomic_sub_return(&ct->holders, 1))
> > +                       if (!uatomic_sub_return(&ct->holders, 1)) {
> >                                 /* It did terminate, eventually */
> >                                 cleanup_context(ct);
> > +                               ((struct tur_checker_context *)c-
> > >context)->nr_timeouts = 0;
> > +                       }
> >  
> >                         ct = c->context;
> >                 } else
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
diff mbox series

Patch

diff --git a/libmultipath/checkers/tur.c b/libmultipath/checkers/tur.c
index a19becf5..fe6a2f14 100644
--- a/libmultipath/checkers/tur.c
+++ b/libmultipath/checkers/tur.c
@@ -406,9 +406,11 @@  int libcheck_check(struct checker * c)
 			}
 			((struct tur_checker_context *)c->context)->nr_timeouts = ct->nr_timeouts;
 
-			if (!uatomic_sub_return(&ct->holders, 1))
+			if (!uatomic_sub_return(&ct->holders, 1)) {
 				/* It did terminate, eventually */
 				cleanup_context(ct);
+				((struct tur_checker_context *)c->context)->nr_timeouts = 0;
+			}
 
 			ct = c->context;
 		} else