[v3,01/19] libmultipath: fix tur checker timeout

Message ID	1537571127-10143-2-git-send-email-bmarzins@redhat.com (mailing list archive)
State	Not Applicable, archived
Delegated to:	christophe varoqui
Headers	show Return-Path: <dm-devel-bounces@redhat.com> From: "Benjamin Marzinski" <bmarzins@redhat.com> To: device-mapper development <dm-devel@redhat.com> Date: Fri, 21 Sep 2018 18:05:09 -0500 Message-Id: <1537571127-10143-2-git-send-email-bmarzins@redhat.com> In-Reply-To: <1537571127-10143-1-git-send-email-bmarzins@redhat.com> References: <1537571127-10143-1-git-send-email-bmarzins@redhat.com> Cc: Martin Wilck <mwilck@suse.com> Subject: [dm-devel] [PATCH v3 01/19] libmultipath: fix tur checker timeout Precedence: junk MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: dm-devel-bounces@redhat.com Errors-To: dm-devel-bounces@redhat.com
Series	Misc Multipath patches \| expand [v3,00/19] Misc Multipath patches [v3,01/19] libmultipath: fix tur checker timeout [v3,02/19] libmultipath: fix tur checker double locking [v3,03/19] libmultipath: fix tur memory misuse [v3,04/19] libmultipath: cleanup tur locking [v3,05/19] libmultipath: fix tur checker timeout issue [v3,06/19] libmultipath: fix set_int error path [v3,07/19] libmultipath: fix length issues in get_vpd_sgio [v3,08/19] libmultipath: _install_keyword cleanup [v3,09/19] libmultipath: remove unused code [v3,10/19] libmultipath: fix memory issue in path_latency prio [v3,11/19] libmultipath: fix null dereference int alloc_path_group [v3,12/19] libmutipath: don't use malformed uevents [v3,13/19] multipath: fix max array size in print_cmd_valid [v3,14/19] multipathd: function return value tweaks [v3,15/19] multipathd: minor fixes [v3,16/19] multipathd: remove useless check and fix format [v3,17/19] multipathd: fix memory leak on error in configure [v3,18/19] libmultipath: Don't blank intialized paths [v3,19/19] libmultipath: Fixup updating paths

Benjamin Marzinski Sept. 21, 2018, 11:05 p.m. UTC

The code previously was timing out mode if ct->thread was 0 but
ct->running wasn't. This combination never happens.  The idea was to
timeout if for some reason the path checker tried to kill the thread,
but it didn't die.  The correct thing to check for this is ct->holders.
ct->holders will always be at least one when libcheck_check() is called,
since libcheck_free() won't get called until the device is no longer
being checked. So, if ct->holders is 2, that means that the tur thread
is has not shut down yet.

Also, instead of returning PATH_TIMEOUT whenever the thread hasn't died,
it should only time out if the thread didn't successfully get a value,
which means the previous state was already PATH_TIMEOUT.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
 libmultipath/checkers/tur.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

Martin Wilck Oct. 1, 2018, 7:51 p.m. UTC | #1

On Fri, 2018-09-21 at 18:05 -0500, Benjamin Marzinski wrote:
> The code previously was timing out mode if ct->thread was 0 but
> ct->running wasn't. This combination never happens.  The idea was to
> timeout if for some reason the path checker tried to kill the thread,
> but it didn't die.  The correct thing to check for this is ct-
> >holders.
> ct->holders will always be at least one when libcheck_check() is
> called,
> since libcheck_free() won't get called until the device is no longer
> being checked. So, if ct->holders is 2, that means that the tur
> thread
> is has not shut down yet.
> 
> Also, instead of returning PATH_TIMEOUT whenever the thread hasn't
> died,
> it should only time out if the thread didn't successfully get a
> value,
> which means the previous state was already PATH_TIMEOUT.

What about PATH_UNCHECKED?

> 
> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
> ---
>  libmultipath/checkers/tur.c | 15 +++++++++------
>  1 file changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/libmultipath/checkers/tur.c
> b/libmultipath/checkers/tur.c
> index bf8486d..275541f 100644
> --- a/libmultipath/checkers/tur.c
> +++ b/libmultipath/checkers/tur.c
> @@ -355,12 +355,15 @@ int libcheck_check(struct checker * c)
>  		}
>  		pthread_mutex_unlock(&ct->lock);
>  	} else {
> -		if (uatomic_read(&ct->running) != 0) {
> -			/* pthread cancel failed. continue in sync mode
> */
> -			pthread_mutex_unlock(&ct->lock);
> -			condlog(3, "%s: tur thread not responding",
> -				tur_devt(devt, sizeof(devt), ct));
> -			return PATH_TIMEOUT;
> +		if (uatomic_read(&ct->holders) > 1) {
> +			/* pthread cancel failed. If it didn't get the
> path
> +			   state already, timeout */
> +			if (ct->state == PATH_PENDING) {
> +				pthread_mutex_unlock(&ct->lock);
> +				condlog(3, "%s: tur thread not
> responding",
> +					tur_devt(devt, sizeof(devt),
> ct));
> +				return PATH_TIMEOUT;
> +			}
>  		}

I'm not quite getting it yet. We're entering this code path if 
ct->thread is 0. As you point out, if there are >1 holders, a
previously cancelled TUR thread must be still "alive". But now we want
to check *again*. The previous code refused to do that - a single
hanging TUR thread could block further checks from happening, forever.
The PATH_TIMEOUT return code was actually questionable - in this
libcheck_check() invocation, we didn't even try to check, so nothing
could actually time out (but that's obviously independent of your
patch).

With your patch, if ct->state != PATH_PENDING, we'd try to start
another TUR thread although there's still a hanging one. That's not
necessarily wrong, but I fail to see why it depends on ct->state.
Anyway, if we do this, we take the risk of starting more and more TUR
threads which might all end up hanging; I wonder if we should stop
creating more threads if ct->holders grows further (e.g. > 5).

Regards,
Martin

>  		/* Start new TUR checker */
>  		ct->state = PATH_UNCHECKED;

Benjamin Marzinski Oct. 4, 2018, 4:31 p.m. UTC | #2

On Mon, Oct 01, 2018 at 09:51:22PM +0200, Martin Wilck wrote:
> On Fri, 2018-09-21 at 18:05 -0500, Benjamin Marzinski wrote:
> > The code previously was timing out mode if ct->thread was 0 but
> > ct->running wasn't. This combination never happens.  The idea was to
> > timeout if for some reason the path checker tried to kill the thread,
> > but it didn't die.  The correct thing to check for this is ct-
> > >holders.
> > ct->holders will always be at least one when libcheck_check() is
> > called,
> > since libcheck_free() won't get called until the device is no longer
> > being checked. So, if ct->holders is 2, that means that the tur
> > thread
> > is has not shut down yet.
> > 
> > Also, instead of returning PATH_TIMEOUT whenever the thread hasn't
> > died,
> > it should only time out if the thread didn't successfully get a
> > value,
> > which means the previous state was already PATH_TIMEOUT.
> 
> What about PATH_UNCHECKED?
> 

I don't see how that could happen. In this state we've definitely set
ct->state to PATH_PENDING when we started the thread. The thread will
only change  ct->state to the return of tur_check(), which doesn't
ever return PATH_UNCHECKED. Am I missing something here?

> > 
> > Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
> > ---
> >  libmultipath/checkers/tur.c | 15 +++++++++------
> >  1 file changed, 9 insertions(+), 6 deletions(-)
> > 
> > diff --git a/libmultipath/checkers/tur.c
> > b/libmultipath/checkers/tur.c
> > index bf8486d..275541f 100644
> > --- a/libmultipath/checkers/tur.c
> > +++ b/libmultipath/checkers/tur.c
> > @@ -355,12 +355,15 @@ int libcheck_check(struct checker * c)
> >  		}
> >  		pthread_mutex_unlock(&ct->lock);
> >  	} else {
> > -		if (uatomic_read(&ct->running) != 0) {
> > -			/* pthread cancel failed. continue in sync mode
> > */
> > -			pthread_mutex_unlock(&ct->lock);
> > -			condlog(3, "%s: tur thread not responding",
> > -				tur_devt(devt, sizeof(devt), ct));
> > -			return PATH_TIMEOUT;
> > +		if (uatomic_read(&ct->holders) > 1) {
> > +			/* pthread cancel failed. If it didn't get the
> > path
> > +			   state already, timeout */
> > +			if (ct->state == PATH_PENDING) {
> > +				pthread_mutex_unlock(&ct->lock);
> > +				condlog(3, "%s: tur thread not
> > responding",
> > +					tur_devt(devt, sizeof(devt),
> > ct));
> > +				return PATH_TIMEOUT;
> > +			}
> >  		}
> 
> I'm not quite getting it yet. We're entering this code path if 
> ct->thread is 0. As you point out, if there are >1 holders, a
> previously cancelled TUR thread must be still "alive". But now we want
> to check *again*. The previous code refused to do that - a single
> hanging TUR thread could block further checks from happening, forever.
> The PATH_TIMEOUT return code was actually questionable - in this
> libcheck_check() invocation, we didn't even try to check, so nothing
> could actually time out (but that's obviously independent of your
> patch).
> 
> With your patch, if ct->state != PATH_PENDING, we'd try to start
> another TUR thread although there's still a hanging one. That's not
> necessarily wrong, but I fail to see why it depends on ct->state.
> Anyway, if we do this, we take the risk of starting more and more TUR
> threads which might all end up hanging; I wonder if we should stop
> creating more threads if ct->holders grows further (e.g. > 5).

It's been a while, and I'm not exactly sure what I was thinking here.
But IIRC the idea was that if the state isn't set yet, then the old
thread could mess with the results of a future thread. Also, it was to
avoid a corner case where the tur checker was called, started the
thread, got the result before the checker exitted, cancelled the thread
and returned the result and then was called very shortly afterwards,
before the thread could clean itself up.  In that case, the right thing
to do (I thought) would be to start a new thread, because the other one
would be ending soon.  In truth, starting a new thread at all is
probably a bad idea, since the old thread will still mess with the
checker context on exit. 

A better idea might be to simply fail back to a syncronous call to
tur_check() when you notice a cancelled thread that hasn't exitted.
That can cause all the other checkers to get delayed, but at least in
that case you are still checking paths. The other option is to do as
before and just not check this path, which won't effect the other
checkers. It seems very unlikely that the thread could remain
uncancelled forever, especially if the path was usable.

thoughts?

-Ben

> Regards,
> Martin
> 
> 
> >  		/* Start new TUR checker */
> >  		ct->state = PATH_UNCHECKED;
> 
> -- 
> Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
> HRB 21284 (AG Nürnberg)
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Martin Wilck Oct. 5, 2018, 10:11 a.m. UTC | #3

On Thu, 2018-10-04 at 11:31 -0500,  Benjamin Marzinski  wrote:
> On Mon, Oct 01, 2018 at 09:51:22PM +0200, Martin Wilck wrote:
> > On Fri, 2018-09-21 at 18:05 -0500, Benjamin Marzinski wrote:
> > > The code previously was timing out mode if ct->thread was 0 but
> > > ct->running wasn't. This combination never happens.  The idea was
> > > to
> > > timeout if for some reason the path checker tried to kill the
> > > thread,
> > > but it didn't die.  The correct thing to check for this is ct-
> > > > holders.
> > > 
> > > ct->holders will always be at least one when libcheck_check() is
> > > called,
> > > since libcheck_free() won't get called until the device is no
> > > longer
> > > being checked. So, if ct->holders is 2, that means that the tur
> > > thread
> > > is has not shut down yet.
> > > 
> > > Also, instead of returning PATH_TIMEOUT whenever the thread
> > > hasn't
> > > died,
> > > it should only time out if the thread didn't successfully get a
> > > value,
> > > which means the previous state was already PATH_TIMEOUT.
> > 
> > What about PATH_UNCHECKED?
> > 
> 
> I don't see how that could happen. In this state we've definitely set
> ct->state to PATH_PENDING when we started the thread. The thread will
> only change  ct->state to the return of tur_check(), which doesn't
> ever return PATH_UNCHECKED. Am I missing something here?

OK. The only thing you missed was a comment :-)

This stuff is so subtle that we should describe exactly how
the locking works, which actor is supposed/entitled to set what field
to what value in what situation, and so on.

> 
> > > 
> > > Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
> > > ---
> > >  libmultipath/checkers/tur.c | 15 +++++++++------
> > >  1 file changed, 9 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/libmultipath/checkers/tur.c
> > > b/libmultipath/checkers/tur.c
> > > index bf8486d..275541f 100644
> > > --- a/libmultipath/checkers/tur.c
> > > +++ b/libmultipath/checkers/tur.c
> > > @@ -355,12 +355,15 @@ int libcheck_check(struct checker * c)
> > >  		}
> > >  		pthread_mutex_unlock(&ct->lock);
> > >  	} else {
> > > -		if (uatomic_read(&ct->running) != 0) {
> > > -			/* pthread cancel failed. continue in sync mode
> > > */
> > > -			pthread_mutex_unlock(&ct->lock);
> > > -			condlog(3, "%s: tur thread not responding",
> > > -				tur_devt(devt, sizeof(devt), ct));
> > > -			return PATH_TIMEOUT;
> > > +		if (uatomic_read(&ct->holders) > 1) {
> > > +			/* pthread cancel failed. If it didn't get the
> > > path
> > > +			   state already, timeout */
> > > +			if (ct->state == PATH_PENDING) {
> > > +				pthread_mutex_unlock(&ct->lock);
> > > +				condlog(3, "%s: tur thread not
> > > responding",
> > > +					tur_devt(devt, sizeof(devt),
> > > ct));
> > > +				return PATH_TIMEOUT;
> > > +			}
> > >  		}
> > 
> > I'm not quite getting it yet. We're entering this code path if 
> > ct->thread is 0. As you point out, if there are >1 holders, a
> > previously cancelled TUR thread must be still "alive". But now we
> > want
> > to check *again*. The previous code refused to do that - a single
> > hanging TUR thread could block further checks from happening,
> > forever.
> > The PATH_TIMEOUT return code was actually questionable - in this
> > libcheck_check() invocation, we didn't even try to check, so
> > nothing
> > could actually time out (but that's obviously independent of your
> > patch).
> > 
> > With your patch, if ct->state != PATH_PENDING, we'd try to start
> > another TUR thread although there's still a hanging one. That's not
> > necessarily wrong, but I fail to see why it depends on ct->state.
> > Anyway, if we do this, we take the risk of starting more and more
> > TUR
> > threads which might all end up hanging; I wonder if we should stop
> > creating more threads if ct->holders grows further (e.g. > 5).
> 
> It's been a while, and I'm not exactly sure what I was thinking here.
> But IIRC the idea was that if the state isn't set yet, then the old
> thread could mess with the results of a future thread. Also, it was
> to
> avoid a corner case where the tur checker was called, started the
> thread, got the result before the checker exitted, cancelled the
> thread
> and returned the result and then was called very shortly afterwards,
> before the thread could clean itself up.  In that case, the right
> thing
> to do (I thought) would be to start a new thread, because the other
> one
> would be ending soon.  In truth, starting a new thread at all is
> probably a bad idea, since the old thread will still mess with the
> checker context on exit. 
> 
> A better idea might be to simply fail back to a syncronous call to
> tur_check() when you notice a cancelled thread that hasn't exitted.
> That can cause all the other checkers to get delayed, but at least in
> that case you are still checking paths. The other option is to do as
> before and just not check this path, which won't effect the other
> checkers. It seems very unlikely that the thread could remain
> uncancelled forever, especially if the path was usable.
>
> thoughts?

Generally speaking, we're deeply in the realm of highly unlikely
situations I would say. But we should get it right eventually.

Maybe we can add logic to the tur thread to keep its hands off the
context if it's a "zombie", like below (just a thought, untested)?

Martin


diff --git a/libmultipath/checkers/tur.c b/libmultipath/checkers/tur.c
index bf8486d3..e8493ca8 100644
--- a/libmultipath/checkers/tur.c
+++ b/libmultipath/checkers/tur.c
@@ -219,11 +219,18 @@ static void cleanup_func(void *data)
 	rcu_unregister_thread();
 }
 
+#define tur_thread_quit_unless_owner(__ct)	   \
+	if (__ct->thread != pthread_self()) {	   \
+		pthread_mutex_unlock(&__ct->lock); \
+		pthread_exit(NULL);		   \
+	}
+
 static void copy_msg_to_tcc(void *ct_p, const char *msg)
 {
 	struct tur_checker_context *ct = ct_p;
 
 	pthread_mutex_lock(&ct->lock);
+	tur_thread_quit_unless_owner(ct);
 	strlcpy(ct->message, msg, sizeof(ct->message));
 	pthread_mutex_unlock(&ct->lock);
 }
@@ -243,6 +250,7 @@ static void *tur_thread(void *ctx)
 
 	/* TUR checker start up */
 	pthread_mutex_lock(&ct->lock);
+	tur_thread_quit_unless_owner(ct);
 	ct->state = PATH_PENDING;
 	ct->message[0] = '\0';
 	pthread_mutex_unlock(&ct->lock);
@@ -252,6 +260,7 @@ static void *tur_thread(void *ctx)
 
 	/* TUR checker done */
 	pthread_mutex_lock(&ct->lock);
+	tur_thread_quit_unless_owner(ct);
 	ct->state = state;
 	pthread_cond_signal(&ct->active);
 	pthread_mutex_unlock(&ct->lock);

Benjamin Marzinski Oct. 5, 2018, 5:02 p.m. UTC | #4

On Fri, Oct 05, 2018 at 12:11:18PM +0200, Martin Wilck wrote:
> On Thu, 2018-10-04 at 11:31 -0500,  Benjamin Marzinski  wrote:
> > It's been a while, and I'm not exactly sure what I was thinking here.
> > But IIRC the idea was that if the state isn't set yet, then the old
> > thread could mess with the results of a future thread. Also, it was
> > to
> > avoid a corner case where the tur checker was called, started the
> > thread, got the result before the checker exitted, cancelled the
> > thread
> > and returned the result and then was called very shortly afterwards,
> > before the thread could clean itself up.  In that case, the right
> > thing
> > to do (I thought) would be to start a new thread, because the other
> > one
> > would be ending soon.  In truth, starting a new thread at all is
> > probably a bad idea, since the old thread will still mess with the
> > checker context on exit. 
> > 
> > A better idea might be to simply fail back to a syncronous call to
> > tur_check() when you notice a cancelled thread that hasn't exitted.
> > That can cause all the other checkers to get delayed, but at least in
> > that case you are still checking paths. The other option is to do as
> > before and just not check this path, which won't effect the other
> > checkers. It seems very unlikely that the thread could remain
> > uncancelled forever, especially if the path was usable.
> >
> > thoughts?
> 
> Generally speaking, we're deeply in the realm of highly unlikely
> situations I would say. But we should get it right eventually.
> 
> Maybe we can add logic to the tur thread to keep its hands off the
> context if it's a "zombie", like below (just a thought, untested)?

This still wouldn't stop a thread from racing with new thread creation
to change ct->holders or ct->running.

-Ben

> Martin
> 
> 
> diff --git a/libmultipath/checkers/tur.c b/libmultipath/checkers/tur.c
> index bf8486d3..e8493ca8 100644
> --- a/libmultipath/checkers/tur.c
> +++ b/libmultipath/checkers/tur.c
> @@ -219,11 +219,18 @@ static void cleanup_func(void *data)
>  	rcu_unregister_thread();
>  }
>  
> +#define tur_thread_quit_unless_owner(__ct)	   \
> +	if (__ct->thread != pthread_self()) {	   \
> +		pthread_mutex_unlock(&__ct->lock); \
> +		pthread_exit(NULL);		   \
> +	}
> +
>  static void copy_msg_to_tcc(void *ct_p, const char *msg)
>  {
>  	struct tur_checker_context *ct = ct_p;
>  
>  	pthread_mutex_lock(&ct->lock);
> +	tur_thread_quit_unless_owner(ct);
>  	strlcpy(ct->message, msg, sizeof(ct->message));
>  	pthread_mutex_unlock(&ct->lock);
>  }
> @@ -243,6 +250,7 @@ static void *tur_thread(void *ctx)
>  
>  	/* TUR checker start up */
>  	pthread_mutex_lock(&ct->lock);
> +	tur_thread_quit_unless_owner(ct);
>  	ct->state = PATH_PENDING;
>  	ct->message[0] = '\0';
>  	pthread_mutex_unlock(&ct->lock);
> @@ -252,6 +260,7 @@ static void *tur_thread(void *ctx)
>  
>  	/* TUR checker done */
>  	pthread_mutex_lock(&ct->lock);
> +	tur_thread_quit_unless_owner(ct);
>  	ct->state = state;
>  	pthread_cond_signal(&ct->active);
>  	pthread_mutex_unlock(&ct->lock);
> 
> 
> 
> 
> -- 
> Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
> SUSELinux GmbH, GF: Felix Imendörffer, Jane Smithard,  Graham Norton
> HRB 21284 (AG Nürnberg)
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Martin Wilck Oct. 5, 2018, 7:23 p.m. UTC | #5

On Fri, 2018-10-05 at 12:02 -0500, Benjamin Marzinski wrote:
> On Fri, Oct 05, 2018 at 12:11:18PM +0200, Martin Wilck wrote:
> > On Thu, 2018-10-04 at 11:31 -0500,  Benjamin Marzinski  wrote:
> > > 
> > > thoughts?
> > 
> > Generally speaking, we're deeply in the realm of highly unlikely
> > situations I would say. But we should get it right eventually.
> > 
> > Maybe we can add logic to the tur thread to keep its hands off the
> > context if it's a "zombie", like below (just a thought, untested)?
> 
> This still wouldn't stop a thread from racing with new thread
> creation
> to change ct->holders or ct->running.

I don't see a problem with ct->holders. Being a refcount, it seems to
be made quite for this purpose. The zombie thread _must_ decrement it
when it eventually exits. Wrt ct->running, we may have to move it under
the lock again if we want to be absolutely sure. 

Regards
Martin

[v3,01/19] libmultipath: fix tur checker timeout

Commit Message

Comments

Patch