Message ID | 1537571127-10143-2-git-send-email-bmarzins@redhat.com (mailing list archive) |
---|---|
State | Not Applicable, archived |
Delegated to: | christophe varoqui |
Headers | show |
Series | Misc Multipath patches | expand |
On Fri, 2018-09-21 at 18:05 -0500, Benjamin Marzinski wrote: > The code previously was timing out mode if ct->thread was 0 but > ct->running wasn't. This combination never happens. The idea was to > timeout if for some reason the path checker tried to kill the thread, > but it didn't die. The correct thing to check for this is ct- > >holders. > ct->holders will always be at least one when libcheck_check() is > called, > since libcheck_free() won't get called until the device is no longer > being checked. So, if ct->holders is 2, that means that the tur > thread > is has not shut down yet. > > Also, instead of returning PATH_TIMEOUT whenever the thread hasn't > died, > it should only time out if the thread didn't successfully get a > value, > which means the previous state was already PATH_TIMEOUT. What about PATH_UNCHECKED? > > Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> > --- > libmultipath/checkers/tur.c | 15 +++++++++------ > 1 file changed, 9 insertions(+), 6 deletions(-) > > diff --git a/libmultipath/checkers/tur.c > b/libmultipath/checkers/tur.c > index bf8486d..275541f 100644 > --- a/libmultipath/checkers/tur.c > +++ b/libmultipath/checkers/tur.c > @@ -355,12 +355,15 @@ int libcheck_check(struct checker * c) > } > pthread_mutex_unlock(&ct->lock); > } else { > - if (uatomic_read(&ct->running) != 0) { > - /* pthread cancel failed. continue in sync mode > */ > - pthread_mutex_unlock(&ct->lock); > - condlog(3, "%s: tur thread not responding", > - tur_devt(devt, sizeof(devt), ct)); > - return PATH_TIMEOUT; > + if (uatomic_read(&ct->holders) > 1) { > + /* pthread cancel failed. If it didn't get the > path > + state already, timeout */ > + if (ct->state == PATH_PENDING) { > + pthread_mutex_unlock(&ct->lock); > + condlog(3, "%s: tur thread not > responding", > + tur_devt(devt, sizeof(devt), > ct)); > + return PATH_TIMEOUT; > + } > } I'm not quite getting it yet. We're entering this code path if ct->thread is 0. As you point out, if there are >1 holders, a previously cancelled TUR thread must be still "alive". But now we want to check *again*. The previous code refused to do that - a single hanging TUR thread could block further checks from happening, forever. The PATH_TIMEOUT return code was actually questionable - in this libcheck_check() invocation, we didn't even try to check, so nothing could actually time out (but that's obviously independent of your patch). With your patch, if ct->state != PATH_PENDING, we'd try to start another TUR thread although there's still a hanging one. That's not necessarily wrong, but I fail to see why it depends on ct->state. Anyway, if we do this, we take the risk of starting more and more TUR threads which might all end up hanging; I wonder if we should stop creating more threads if ct->holders grows further (e.g. > 5). Regards, Martin > /* Start new TUR checker */ > ct->state = PATH_UNCHECKED;
On Mon, Oct 01, 2018 at 09:51:22PM +0200, Martin Wilck wrote: > On Fri, 2018-09-21 at 18:05 -0500, Benjamin Marzinski wrote: > > The code previously was timing out mode if ct->thread was 0 but > > ct->running wasn't. This combination never happens. The idea was to > > timeout if for some reason the path checker tried to kill the thread, > > but it didn't die. The correct thing to check for this is ct- > > >holders. > > ct->holders will always be at least one when libcheck_check() is > > called, > > since libcheck_free() won't get called until the device is no longer > > being checked. So, if ct->holders is 2, that means that the tur > > thread > > is has not shut down yet. > > > > Also, instead of returning PATH_TIMEOUT whenever the thread hasn't > > died, > > it should only time out if the thread didn't successfully get a > > value, > > which means the previous state was already PATH_TIMEOUT. > > What about PATH_UNCHECKED? > I don't see how that could happen. In this state we've definitely set ct->state to PATH_PENDING when we started the thread. The thread will only change ct->state to the return of tur_check(), which doesn't ever return PATH_UNCHECKED. Am I missing something here? > > > > Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> > > --- > > libmultipath/checkers/tur.c | 15 +++++++++------ > > 1 file changed, 9 insertions(+), 6 deletions(-) > > > > diff --git a/libmultipath/checkers/tur.c > > b/libmultipath/checkers/tur.c > > index bf8486d..275541f 100644 > > --- a/libmultipath/checkers/tur.c > > +++ b/libmultipath/checkers/tur.c > > @@ -355,12 +355,15 @@ int libcheck_check(struct checker * c) > > } > > pthread_mutex_unlock(&ct->lock); > > } else { > > - if (uatomic_read(&ct->running) != 0) { > > - /* pthread cancel failed. continue in sync mode > > */ > > - pthread_mutex_unlock(&ct->lock); > > - condlog(3, "%s: tur thread not responding", > > - tur_devt(devt, sizeof(devt), ct)); > > - return PATH_TIMEOUT; > > + if (uatomic_read(&ct->holders) > 1) { > > + /* pthread cancel failed. If it didn't get the > > path > > + state already, timeout */ > > + if (ct->state == PATH_PENDING) { > > + pthread_mutex_unlock(&ct->lock); > > + condlog(3, "%s: tur thread not > > responding", > > + tur_devt(devt, sizeof(devt), > > ct)); > > + return PATH_TIMEOUT; > > + } > > } > > I'm not quite getting it yet. We're entering this code path if > ct->thread is 0. As you point out, if there are >1 holders, a > previously cancelled TUR thread must be still "alive". But now we want > to check *again*. The previous code refused to do that - a single > hanging TUR thread could block further checks from happening, forever. > The PATH_TIMEOUT return code was actually questionable - in this > libcheck_check() invocation, we didn't even try to check, so nothing > could actually time out (but that's obviously independent of your > patch). > > With your patch, if ct->state != PATH_PENDING, we'd try to start > another TUR thread although there's still a hanging one. That's not > necessarily wrong, but I fail to see why it depends on ct->state. > Anyway, if we do this, we take the risk of starting more and more TUR > threads which might all end up hanging; I wonder if we should stop > creating more threads if ct->holders grows further (e.g. > 5). It's been a while, and I'm not exactly sure what I was thinking here. But IIRC the idea was that if the state isn't set yet, then the old thread could mess with the results of a future thread. Also, it was to avoid a corner case where the tur checker was called, started the thread, got the result before the checker exitted, cancelled the thread and returned the result and then was called very shortly afterwards, before the thread could clean itself up. In that case, the right thing to do (I thought) would be to start a new thread, because the other one would be ending soon. In truth, starting a new thread at all is probably a bad idea, since the old thread will still mess with the checker context on exit. A better idea might be to simply fail back to a syncronous call to tur_check() when you notice a cancelled thread that hasn't exitted. That can cause all the other checkers to get delayed, but at least in that case you are still checking paths. The other option is to do as before and just not check this path, which won't effect the other checkers. It seems very unlikely that the thread could remain uncancelled forever, especially if the path was usable. thoughts? -Ben > Regards, > Martin > > > > /* Start new TUR checker */ > > ct->state = PATH_UNCHECKED; > > -- > Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107 > SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton > HRB 21284 (AG Nürnberg) > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Thu, 2018-10-04 at 11:31 -0500, Benjamin Marzinski wrote: > On Mon, Oct 01, 2018 at 09:51:22PM +0200, Martin Wilck wrote: > > On Fri, 2018-09-21 at 18:05 -0500, Benjamin Marzinski wrote: > > > The code previously was timing out mode if ct->thread was 0 but > > > ct->running wasn't. This combination never happens. The idea was > > > to > > > timeout if for some reason the path checker tried to kill the > > > thread, > > > but it didn't die. The correct thing to check for this is ct- > > > > holders. > > > > > > ct->holders will always be at least one when libcheck_check() is > > > called, > > > since libcheck_free() won't get called until the device is no > > > longer > > > being checked. So, if ct->holders is 2, that means that the tur > > > thread > > > is has not shut down yet. > > > > > > Also, instead of returning PATH_TIMEOUT whenever the thread > > > hasn't > > > died, > > > it should only time out if the thread didn't successfully get a > > > value, > > > which means the previous state was already PATH_TIMEOUT. > > > > What about PATH_UNCHECKED? > > > > I don't see how that could happen. In this state we've definitely set > ct->state to PATH_PENDING when we started the thread. The thread will > only change ct->state to the return of tur_check(), which doesn't > ever return PATH_UNCHECKED. Am I missing something here? OK. The only thing you missed was a comment :-) This stuff is so subtle that we should describe exactly how the locking works, which actor is supposed/entitled to set what field to what value in what situation, and so on. > > > > > > > Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> > > > --- > > > libmultipath/checkers/tur.c | 15 +++++++++------ > > > 1 file changed, 9 insertions(+), 6 deletions(-) > > > > > > diff --git a/libmultipath/checkers/tur.c > > > b/libmultipath/checkers/tur.c > > > index bf8486d..275541f 100644 > > > --- a/libmultipath/checkers/tur.c > > > +++ b/libmultipath/checkers/tur.c > > > @@ -355,12 +355,15 @@ int libcheck_check(struct checker * c) > > > } > > > pthread_mutex_unlock(&ct->lock); > > > } else { > > > - if (uatomic_read(&ct->running) != 0) { > > > - /* pthread cancel failed. continue in sync mode > > > */ > > > - pthread_mutex_unlock(&ct->lock); > > > - condlog(3, "%s: tur thread not responding", > > > - tur_devt(devt, sizeof(devt), ct)); > > > - return PATH_TIMEOUT; > > > + if (uatomic_read(&ct->holders) > 1) { > > > + /* pthread cancel failed. If it didn't get the > > > path > > > + state already, timeout */ > > > + if (ct->state == PATH_PENDING) { > > > + pthread_mutex_unlock(&ct->lock); > > > + condlog(3, "%s: tur thread not > > > responding", > > > + tur_devt(devt, sizeof(devt), > > > ct)); > > > + return PATH_TIMEOUT; > > > + } > > > } > > > > I'm not quite getting it yet. We're entering this code path if > > ct->thread is 0. As you point out, if there are >1 holders, a > > previously cancelled TUR thread must be still "alive". But now we > > want > > to check *again*. The previous code refused to do that - a single > > hanging TUR thread could block further checks from happening, > > forever. > > The PATH_TIMEOUT return code was actually questionable - in this > > libcheck_check() invocation, we didn't even try to check, so > > nothing > > could actually time out (but that's obviously independent of your > > patch). > > > > With your patch, if ct->state != PATH_PENDING, we'd try to start > > another TUR thread although there's still a hanging one. That's not > > necessarily wrong, but I fail to see why it depends on ct->state. > > Anyway, if we do this, we take the risk of starting more and more > > TUR > > threads which might all end up hanging; I wonder if we should stop > > creating more threads if ct->holders grows further (e.g. > 5). > > It's been a while, and I'm not exactly sure what I was thinking here. > But IIRC the idea was that if the state isn't set yet, then the old > thread could mess with the results of a future thread. Also, it was > to > avoid a corner case where the tur checker was called, started the > thread, got the result before the checker exitted, cancelled the > thread > and returned the result and then was called very shortly afterwards, > before the thread could clean itself up. In that case, the right > thing > to do (I thought) would be to start a new thread, because the other > one > would be ending soon. In truth, starting a new thread at all is > probably a bad idea, since the old thread will still mess with the > checker context on exit. > > A better idea might be to simply fail back to a syncronous call to > tur_check() when you notice a cancelled thread that hasn't exitted. > That can cause all the other checkers to get delayed, but at least in > that case you are still checking paths. The other option is to do as > before and just not check this path, which won't effect the other > checkers. It seems very unlikely that the thread could remain > uncancelled forever, especially if the path was usable. > > thoughts? Generally speaking, we're deeply in the realm of highly unlikely situations I would say. But we should get it right eventually. Maybe we can add logic to the tur thread to keep its hands off the context if it's a "zombie", like below (just a thought, untested)? Martin diff --git a/libmultipath/checkers/tur.c b/libmultipath/checkers/tur.c index bf8486d3..e8493ca8 100644 --- a/libmultipath/checkers/tur.c +++ b/libmultipath/checkers/tur.c @@ -219,11 +219,18 @@ static void cleanup_func(void *data) rcu_unregister_thread(); } +#define tur_thread_quit_unless_owner(__ct) \ + if (__ct->thread != pthread_self()) { \ + pthread_mutex_unlock(&__ct->lock); \ + pthread_exit(NULL); \ + } + static void copy_msg_to_tcc(void *ct_p, const char *msg) { struct tur_checker_context *ct = ct_p; pthread_mutex_lock(&ct->lock); + tur_thread_quit_unless_owner(ct); strlcpy(ct->message, msg, sizeof(ct->message)); pthread_mutex_unlock(&ct->lock); } @@ -243,6 +250,7 @@ static void *tur_thread(void *ctx) /* TUR checker start up */ pthread_mutex_lock(&ct->lock); + tur_thread_quit_unless_owner(ct); ct->state = PATH_PENDING; ct->message[0] = '\0'; pthread_mutex_unlock(&ct->lock); @@ -252,6 +260,7 @@ static void *tur_thread(void *ctx) /* TUR checker done */ pthread_mutex_lock(&ct->lock); + tur_thread_quit_unless_owner(ct); ct->state = state; pthread_cond_signal(&ct->active); pthread_mutex_unlock(&ct->lock);
On Fri, Oct 05, 2018 at 12:11:18PM +0200, Martin Wilck wrote: > On Thu, 2018-10-04 at 11:31 -0500, Benjamin Marzinski wrote: > > It's been a while, and I'm not exactly sure what I was thinking here. > > But IIRC the idea was that if the state isn't set yet, then the old > > thread could mess with the results of a future thread. Also, it was > > to > > avoid a corner case where the tur checker was called, started the > > thread, got the result before the checker exitted, cancelled the > > thread > > and returned the result and then was called very shortly afterwards, > > before the thread could clean itself up. In that case, the right > > thing > > to do (I thought) would be to start a new thread, because the other > > one > > would be ending soon. In truth, starting a new thread at all is > > probably a bad idea, since the old thread will still mess with the > > checker context on exit. > > > > A better idea might be to simply fail back to a syncronous call to > > tur_check() when you notice a cancelled thread that hasn't exitted. > > That can cause all the other checkers to get delayed, but at least in > > that case you are still checking paths. The other option is to do as > > before and just not check this path, which won't effect the other > > checkers. It seems very unlikely that the thread could remain > > uncancelled forever, especially if the path was usable. > > > > thoughts? > > Generally speaking, we're deeply in the realm of highly unlikely > situations I would say. But we should get it right eventually. > > Maybe we can add logic to the tur thread to keep its hands off the > context if it's a "zombie", like below (just a thought, untested)? This still wouldn't stop a thread from racing with new thread creation to change ct->holders or ct->running. -Ben > Martin > > > diff --git a/libmultipath/checkers/tur.c b/libmultipath/checkers/tur.c > index bf8486d3..e8493ca8 100644 > --- a/libmultipath/checkers/tur.c > +++ b/libmultipath/checkers/tur.c > @@ -219,11 +219,18 @@ static void cleanup_func(void *data) > rcu_unregister_thread(); > } > > +#define tur_thread_quit_unless_owner(__ct) \ > + if (__ct->thread != pthread_self()) { \ > + pthread_mutex_unlock(&__ct->lock); \ > + pthread_exit(NULL); \ > + } > + > static void copy_msg_to_tcc(void *ct_p, const char *msg) > { > struct tur_checker_context *ct = ct_p; > > pthread_mutex_lock(&ct->lock); > + tur_thread_quit_unless_owner(ct); > strlcpy(ct->message, msg, sizeof(ct->message)); > pthread_mutex_unlock(&ct->lock); > } > @@ -243,6 +250,7 @@ static void *tur_thread(void *ctx) > > /* TUR checker start up */ > pthread_mutex_lock(&ct->lock); > + tur_thread_quit_unless_owner(ct); > ct->state = PATH_PENDING; > ct->message[0] = '\0'; > pthread_mutex_unlock(&ct->lock); > @@ -252,6 +260,7 @@ static void *tur_thread(void *ctx) > > /* TUR checker done */ > pthread_mutex_lock(&ct->lock); > + tur_thread_quit_unless_owner(ct); > ct->state = state; > pthread_cond_signal(&ct->active); > pthread_mutex_unlock(&ct->lock); > > > > > -- > Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107 > SUSELinux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton > HRB 21284 (AG Nürnberg) > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Fri, 2018-10-05 at 12:02 -0500, Benjamin Marzinski wrote: > On Fri, Oct 05, 2018 at 12:11:18PM +0200, Martin Wilck wrote: > > On Thu, 2018-10-04 at 11:31 -0500, Benjamin Marzinski wrote: > > > > > > thoughts? > > > > Generally speaking, we're deeply in the realm of highly unlikely > > situations I would say. But we should get it right eventually. > > > > Maybe we can add logic to the tur thread to keep its hands off the > > context if it's a "zombie", like below (just a thought, untested)? > > This still wouldn't stop a thread from racing with new thread > creation > to change ct->holders or ct->running. I don't see a problem with ct->holders. Being a refcount, it seems to be made quite for this purpose. The zombie thread _must_ decrement it when it eventually exits. Wrt ct->running, we may have to move it under the lock again if we want to be absolutely sure. Regards Martin
diff --git a/libmultipath/checkers/tur.c b/libmultipath/checkers/tur.c index bf8486d..275541f 100644 --- a/libmultipath/checkers/tur.c +++ b/libmultipath/checkers/tur.c @@ -355,12 +355,15 @@ int libcheck_check(struct checker * c) } pthread_mutex_unlock(&ct->lock); } else { - if (uatomic_read(&ct->running) != 0) { - /* pthread cancel failed. continue in sync mode */ - pthread_mutex_unlock(&ct->lock); - condlog(3, "%s: tur thread not responding", - tur_devt(devt, sizeof(devt), ct)); - return PATH_TIMEOUT; + if (uatomic_read(&ct->holders) > 1) { + /* pthread cancel failed. If it didn't get the path + state already, timeout */ + if (ct->state == PATH_PENDING) { + pthread_mutex_unlock(&ct->lock); + condlog(3, "%s: tur thread not responding", + tur_devt(devt, sizeof(devt), ct)); + return PATH_TIMEOUT; + } } /* Start new TUR checker */ ct->state = PATH_UNCHECKED;
The code previously was timing out mode if ct->thread was 0 but ct->running wasn't. This combination never happens. The idea was to timeout if for some reason the path checker tried to kill the thread, but it didn't die. The correct thing to check for this is ct->holders. ct->holders will always be at least one when libcheck_check() is called, since libcheck_free() won't get called until the device is no longer being checked. So, if ct->holders is 2, that means that the tur thread is has not shut down yet. Also, instead of returning PATH_TIMEOUT whenever the thread hasn't died, it should only time out if the thread didn't successfully get a value, which means the previous state was already PATH_TIMEOUT. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> --- libmultipath/checkers/tur.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-)