Message ID | 1518134167-15938-2-git-send-email-bmarzins@redhat.com (mailing list archive) |
---|---|
State | Not Applicable, archived |
Delegated to: | Mike Snitzer |
Headers | show |
On Thu, 2018-02-08 at 17:56 -0600, Benjamin Marzinski wrote: > static void cleanup_func(void *data) > { > - int holders; > + int running, holders; > struct tur_checker_context *ct = data; > - pthread_spin_lock(&ct->hldr_lock); > - ct->holders--; > - holders = ct->holders; > - ct->thread = 0; > - pthread_spin_unlock(&ct->hldr_lock); > + > + running = uatomic_xchg(&ct->running, 0); > + holders = uatomic_sub_return(&ct->holders, 1); > if (!holders) > cleanup_context(ct); > + if (!running) > + pause(); > } Hello Ben, Why has the pause() call been added? I think it is safe to call pthread_cancel() for a non-detached thread that has finished so I don't think that pause() call is necessary. > static int tur_running(struct tur_checker_context *ct) > { > - pthread_t thread; > - > - pthread_spin_lock(&ct->hldr_lock); > - thread = ct->thread; > - pthread_spin_unlock(&ct->hldr_lock); > - > - return thread != 0; > + return (uatomic_read(&ct->running) != 0); > } Is such a one line function really useful? I think the code will be easier to read if this function is inlined into its callers. > @@ -418,8 +396,12 @@ int libcheck_check(struct checker * c) > (tur_status == PATH_PENDING || tur_status == PATH_UNCHECKED)) { > condlog(3, "%s: tur checker still running", > tur_devt(devt, sizeof(devt), ct)); > - ct->running = 1; > tur_status = PATH_PENDING; > + } else { > + int running = uatomic_xchg(&ct->running, 0); > + if (running) > + pthread_cancel(ct->thread); > + ct->thread = 0; > } > } Why has this pthread_cancel() call been added? I think that else clause can only be reached if ct->running == 0 so I don't think that the pthread_cancel() call will ever be reached. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Fri, Feb 09, 2018 at 04:15:34PM +0000, Bart Van Assche wrote: > On Thu, 2018-02-08 at 17:56 -0600, Benjamin Marzinski wrote: > > static void cleanup_func(void *data) > > { > > - int holders; > > + int running, holders; > > struct tur_checker_context *ct = data; > > - pthread_spin_lock(&ct->hldr_lock); > > - ct->holders--; > > - holders = ct->holders; > > - ct->thread = 0; > > - pthread_spin_unlock(&ct->hldr_lock); > > + > > + running = uatomic_xchg(&ct->running, 0); > > + holders = uatomic_sub_return(&ct->holders, 1); > > if (!holders) > > cleanup_context(ct); > > + if (!running) > > + pause(); > > } > > Hello Ben, > > Why has the pause() call been added? I think it is safe to call pthread_cancel() > for a non-detached thread that has finished so I don't think that pause() call > is necessary. Martin objected to having the threads getting detached as part of cancelling them (I think. I'm a little fuzzy on what he didn't like). But he definitely said he preferred the thread to start detached, so in this version, it does. That's why we need the pause(). If he's fine with the threads getting detached later, I will happily replace the pause() with if (running) pthread_detach(pthread_self()); and add pthread_detach(ct->thread) after the calls to pthread_cancel(ct->thread). Otherwise we need the pause() to solve your original bug. As an aside, Martin, if your problem is the thread detaching itself, we can skip that if we are fine with a zombie thread hanging around until the next time we call libcheck_check() or libcheck_free(). Then the checker can always be in charge of detaching the thread. > > static int tur_running(struct tur_checker_context *ct) > > { > > - pthread_t thread; > > - > > - pthread_spin_lock(&ct->hldr_lock); > > - thread = ct->thread; > > - pthread_spin_unlock(&ct->hldr_lock); > > - > > - return thread != 0; > > + return (uatomic_read(&ct->running) != 0); > > } > > Is such a one line function really useful? Nope. I just left it there to keep the number of changes that the patch makes lower, to make it more straightforward to review. I'm fine will inlining it. > I think the code will be easier to read if this function is inlined > into its callers. > > @@ -418,8 +396,12 @@ int libcheck_check(struct checker * c) > > (tur_status == PATH_PENDING || tur_status == PATH_UNCHECKED)) { > > condlog(3, "%s: tur checker still running", > > tur_devt(devt, sizeof(devt), ct)); > > - ct->running = 1; > > tur_status = PATH_PENDING; > > + } else { > > + int running = uatomic_xchg(&ct->running, 0); > > + if (running) > > + pthread_cancel(ct->thread); > > + ct->thread = 0; > > } > > } > > Why has this pthread_cancel() call been added? I think that else clause can only be > reached if ct->running == 0 so I don't think that the pthread_cancel() call will ever > be reached. It can be reached if ct->running is 1, as long as tur_status has been updated. In practice this means that the thread should have done everything it needs to do, and all that's left is for it to shutdown. However, if the thread doesn't shut itself down before the next time you call libcheck_check(), the checker will give up and return PATH_TIMEOUT. It seems pretty unlikely that this will happen, since there should be a significant delay before calling libcheck_check() again. This theoretical race has been in the code for a while, and AFAIK, it's never occured. But there definitely is the possiblity that the thread will still be running at the end of libcheck_check(), and it doesn't hurt things to forceably shut the thread down, if it is. -Ben > Thanks, > > Bart. > > > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Fri, 2018-02-09 at 11:26 -0600, Benjamin Marzinski wrote: > On Fri, Feb 09, 2018 at 04:15:34PM +0000, Bart Van Assche wrote: > > On Thu, 2018-02-08 at 17:56 -0600, Benjamin Marzinski wrote: > > > static void cleanup_func(void *data) > > > { > > > - int holders; > > > + int running, holders; > > > struct tur_checker_context *ct = data; > > > - pthread_spin_lock(&ct->hldr_lock); > > > - ct->holders--; > > > - holders = ct->holders; > > > - ct->thread = 0; > > > - pthread_spin_unlock(&ct->hldr_lock); > > > + > > > + running = uatomic_xchg(&ct->running, 0); > > > + holders = uatomic_sub_return(&ct->holders, 1); > > > if (!holders) > > > cleanup_context(ct); > > > + if (!running) > > > + pause(); > > > } > > > > Hello Ben, > > > > Why has the pause() call been added? I think it is safe to call pthread_cancel() > > for a non-detached thread that has finished so I don't think that pause() call > > is necessary. > > Martin objected to having the threads getting detached as part of > cancelling them (I think. I'm a little fuzzy on what he didn't like). > But he definitely said he preferred the thread to start detached, so in > this version, it does. That's why we need the pause(). If he's fine with > the threads getting detached later, I will happily replace the pause() > with > > if (running) > pthread_detach(pthread_self()); > > and add pthread_detach(ct->thread) after the calls to > pthread_cancel(ct->thread). Otherwise we need the pause() to solve your > original bug. Ah, thanks, I had overlooked that the tur checker detaches the checker thread. Have you considered to add a comment above the pause() call that explains the purpose of that call? Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Thu, 2018-02-08 at 17:56 -0600, Benjamin Marzinski wrote: > ct->running is now an atomic variable. When the thread is started > it is set to 1. When the checker wants to kill a thread, it > atomically > sets the value to 0 and reads the previous value. If it was 1, > the checker cancels the thread. If it was 0, the nothing needs to be > done. After the checker has dealt with the thread, it sets ct- > >thread > to NULL. > > When the thread is done, it atomicalllys sets the value of ct- > >running > to 0 and reads the previous value. If it was 1, the thread just > exits. > If it was 0, then the checker is trying to cancel the thread, and so > the thread calls pause(), which is a cancellation point. > I'm missing one thing here. My poor brain is aching. cleanup_func() can be entered in two ways: a) if the thread has been cancelled and passed a cancellation point already, or b) if it exits normally and calls pthread_cleanup_pop(). In case b), waiting for the cancellation request by calling pause() makes sense to me. But in case a), the thread has already seen the cancellation request - wouldn't calling pause() cause it to sleep forever? Martin
On Fri, Feb 09, 2018 at 09:30:56PM +0100, Martin Wilck wrote: > On Thu, 2018-02-08 at 17:56 -0600, Benjamin Marzinski wrote: > > ct->running is now an atomic variable. When the thread is started > > it is set to 1. When the checker wants to kill a thread, it > > atomically > > sets the value to 0 and reads the previous value. If it was 1, > > the checker cancels the thread. If it was 0, the nothing needs to be > > done. After the checker has dealt with the thread, it sets ct- > > >thread > > to NULL. > > > > When the thread is done, it atomicalllys sets the value of ct- > > >running > > to 0 and reads the previous value. If it was 1, the thread just > > exits. > > If it was 0, then the checker is trying to cancel the thread, and so > > the thread calls pause(), which is a cancellation point. > > > > I'm missing one thing here. My poor brain is aching. > > cleanup_func() can be entered in two ways: a) if the thread has been > cancelled and passed a cancellation point already, or b) if it exits > normally and calls pthread_cleanup_pop(). > In case b), waiting for the cancellation request by calling pause() > makes sense to me. But in case a), the thread has already seen the > cancellation request - wouldn't calling pause() cause it to sleep > forever? Urgh. You're right. If it is in the cleanup helper because it already has been cancelled, then the pause isn't going get cancelled. So much for my quick rewrite. That leaves three options. 1. have either the thread or the checker detach the thread (depending on which one exits first) 2. make the checker always cancel and detach the thread. This simplifies the code, but there will zombie threads hanging around between calls to the checker. 3. just move the condlog I really don't care which one we pick anymore. -Ben > > Martin > > -- > Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107 > SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton > HRB 21284 (AG Nürnberg) -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
diff --git a/libmultipath/checkers/tur.c b/libmultipath/checkers/tur.c index b4a5cb2..894ad41 100644 --- a/libmultipath/checkers/tur.c +++ b/libmultipath/checkers/tur.c @@ -15,6 +15,7 @@ #include <errno.h> #include <sys/time.h> #include <pthread.h> +#include <urcu/uatomic.h> #include "checkers.h" @@ -44,7 +45,6 @@ struct tur_checker_context { pthread_t thread; pthread_mutex_t lock; pthread_cond_t active; - pthread_spinlock_t hldr_lock; int holders; char message[CHECKER_MSG_LEN]; }; @@ -74,13 +74,12 @@ int libcheck_init (struct checker * c) ct->state = PATH_UNCHECKED; ct->fd = -1; - ct->holders = 1; + uatomic_set(&ct->holders, 1); pthread_cond_init_mono(&ct->active); pthread_mutexattr_init(&attr); pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE); pthread_mutex_init(&ct->lock, &attr); pthread_mutexattr_destroy(&attr); - pthread_spin_init(&ct->hldr_lock, PTHREAD_PROCESS_PRIVATE); c->context = ct; return 0; @@ -90,7 +89,6 @@ static void cleanup_context(struct tur_checker_context *ct) { pthread_mutex_destroy(&ct->lock); pthread_cond_destroy(&ct->active); - pthread_spin_destroy(&ct->hldr_lock); free(ct); } @@ -99,16 +97,14 @@ void libcheck_free (struct checker * c) if (c->context) { struct tur_checker_context *ct = c->context; int holders; - pthread_t thread; - - pthread_spin_lock(&ct->hldr_lock); - ct->holders--; - holders = ct->holders; - thread = ct->thread; - pthread_spin_unlock(&ct->hldr_lock); - if (holders) - pthread_cancel(thread); - else + int running; + + running = uatomic_xchg(&ct->running, 0); + if (running) + pthread_cancel(ct->thread); + ct->thread = 0; + holders = uatomic_sub_return(&ct->holders, 1); + if (!holders) cleanup_context(ct); c->context = NULL; } @@ -218,26 +214,20 @@ retry: static void cleanup_func(void *data) { - int holders; + int running, holders; struct tur_checker_context *ct = data; - pthread_spin_lock(&ct->hldr_lock); - ct->holders--; - holders = ct->holders; - ct->thread = 0; - pthread_spin_unlock(&ct->hldr_lock); + + running = uatomic_xchg(&ct->running, 0); + holders = uatomic_sub_return(&ct->holders, 1); if (!holders) cleanup_context(ct); + if (!running) + pause(); } static int tur_running(struct tur_checker_context *ct) { - pthread_t thread; - - pthread_spin_lock(&ct->hldr_lock); - thread = ct->thread; - pthread_spin_unlock(&ct->hldr_lock); - - return thread != 0; + return (uatomic_read(&ct->running) != 0); } static void copy_msg_to_tcc(void *ct_p, const char *msg) @@ -325,7 +315,6 @@ int libcheck_check(struct checker * c) int tur_status, r; char devt[32]; - if (!ct) return PATH_UNCHECKED; @@ -349,35 +338,26 @@ int libcheck_check(struct checker * c) return PATH_WILD; } - if (ct->running) { - /* - * Check if TUR checker is still running. Hold hldr_lock - * around the pthread_cancel() call to avoid that - * pthread_cancel() gets called after the (detached) TUR - * thread has exited. - */ - pthread_spin_lock(&ct->hldr_lock); - if (ct->thread) { - if (tur_check_async_timeout(c)) { - condlog(3, "%s: tur checker timeout", - tur_devt(devt, sizeof(devt), ct)); + if (ct->thread) { + if (tur_check_async_timeout(c)) { + int running = uatomic_xchg(&ct->running, 0); + if (running) pthread_cancel(ct->thread); - ct->running = 0; - MSG(c, MSG_TUR_TIMEOUT); - tur_status = PATH_TIMEOUT; - } else { - condlog(3, "%s: tur checker not finished", + condlog(3, "%s: tur checker timeout", + tur_devt(devt, sizeof(devt), ct)); + ct->thread = 0; + MSG(c, MSG_TUR_TIMEOUT); + tur_status = PATH_TIMEOUT; + } else if (tur_running(ct)) { + condlog(3, "%s: tur checker not finished", tur_devt(devt, sizeof(devt), ct)); - ct->running++; - tur_status = PATH_PENDING; - } + tur_status = PATH_PENDING; } else { /* TUR checker done */ - ct->running = 0; + ct->thread = 0; tur_status = ct->state; strlcpy(c->message, ct->message, sizeof(c->message)); } - pthread_spin_unlock(&ct->hldr_lock); pthread_mutex_unlock(&ct->lock); } else { if (tur_running(ct)) { @@ -391,19 +371,17 @@ int libcheck_check(struct checker * c) ct->state = PATH_UNCHECKED; ct->fd = c->fd; ct->timeout = c->timeout; - pthread_spin_lock(&ct->hldr_lock); - ct->holders++; - pthread_spin_unlock(&ct->hldr_lock); + uatomic_add(&ct->holders, 1); + uatomic_set(&ct->running, 1); tur_set_async_timeout(c); setup_thread_attr(&attr, 32 * 1024, 1); r = pthread_create(&ct->thread, &attr, tur_thread, ct); pthread_attr_destroy(&attr); if (r) { - pthread_spin_lock(&ct->hldr_lock); - ct->holders--; - pthread_spin_unlock(&ct->hldr_lock); - pthread_mutex_unlock(&ct->lock); + uatomic_sub(&ct->holders, 1); + uatomic_set(&ct->running, 0); ct->thread = 0; + pthread_mutex_unlock(&ct->lock); condlog(3, "%s: failed to start tur thread, using" " sync mode", tur_devt(devt, sizeof(devt), ct)); return tur_check(c->fd, c->timeout, @@ -418,8 +396,12 @@ int libcheck_check(struct checker * c) (tur_status == PATH_PENDING || tur_status == PATH_UNCHECKED)) { condlog(3, "%s: tur checker still running", tur_devt(devt, sizeof(devt), ct)); - ct->running = 1; tur_status = PATH_PENDING; + } else { + int running = uatomic_xchg(&ct->running, 0); + if (running) + pthread_cancel(ct->thread); + ct->thread = 0; } }
Commit 6e2423fd fixed a bug where the tur checker could cancel a detached thread after it had exitted. However in fixing this, the new code grabbed a mutex (to call condlog) while holding a spin_lock. To deal with this, I've done away with the holder spin_lock completely, and replaced it with two atomic variables, based on a suggestion by Martin Wilck. The holder variable works exactly like before. When the checker is initialized, it is set to 1. When a thread is created it is incremented. When either the thread or the checker are done with the context, they atomically decrement the holder variable and check its value. If it is 0, they free the context. If it is 1, they never touch the context again. The other variable has changed. First, ct->running and ct->thread have switched uses. ct->thread is now only ever accessed by the checker, never the thread. If it is non-NULL, a thread has started up, but hasn't been dealt with by the checker yet. It is also obviously used by the checker to cancel the thread. ct->running is now an atomic variable. When the thread is started it is set to 1. When the checker wants to kill a thread, it atomically sets the value to 0 and reads the previous value. If it was 1, the checker cancels the thread. If it was 0, the nothing needs to be done. After the checker has dealt with the thread, it sets ct->thread to NULL. When the thread is done, it atomicalllys sets the value of ct->running to 0 and reads the previous value. If it was 1, the thread just exits. If it was 0, then the checker is trying to cancel the thread, and so the thread calls pause(), which is a cancellation point. Cc: Martin Wilck <mwilck@suse.com> Cc: Bart Van Assche <Bart.VanAssche@wdc.com> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> --- libmultipath/checkers/tur.c | 98 ++++++++++++++++++--------------------------- 1 file changed, 40 insertions(+), 58 deletions(-)