Message ID | 1518043787-7066-2-git-send-email-bmarzins@redhat.com (mailing list archive) |
---|---|
State | Not Applicable, archived |
Delegated to: | Mike Snitzer |
Headers | show |
Hello Ben, On Wed, 2018-02-07 at 16:49 -0600, Benjamin Marzinski wrote: > Commit 6e2423fd fixed a bug where the tur checker could cancel a > detached thread after it had exitted. However in fixing this, the new > code grabbed a mutex (to call condlog) while holding a spin_lock. To > deal with that, and to try to keep with the maixim "lock data, not > code", I've changed how the tur checker synchronizes with its thread. Hmm, if it was only for the condlog issue, it'd certainly be possible to find a simpler solution (just move the log message out of the spin- locked code). And please explain some more why your patch implements "lock data not code" better than what we did before - it's not obvious to me. > Now, the tur checker creates joinable threads, and detaches them when > the thread is finished or has timed out. To track the state of the > threads, I've added a new variable to the checker context, ct- > >attached. Hmm, again. This adds more complexity to an already complex contruct, because now we don't know in the first place whether the checker thread is joinable or detached. IMO it makes a lot of sense for the checker to run in detached mode right from the start. If multipathd is terminated while checkers are blocked, it'll have to detach these threads in the process of terminating - I find that a bit weird. While I didn't spot obvious errors in your patch, changing the locking fundamentally is always risky to some extent, and I'm not yet convinced that the problem you're trying to solve justifies this risk. > When a thread starts, attached is set to 1. When the thread finishes, > it > saves the value of attached, and then zeros it out, while locked. If > attached was set, it detaches itself. Why aren't you simply using pthread_attr_getdetachstate()? > > When the tur checker gives up on a thread, it also saves and > decrements > ct->attached, while locked. At the same time it saves the value of > ct->thread. If attached was set, it cancels the thread, and then > detaches it. Have you thought about using uatomic_add_return(), which we have available anyway through liburcu? https://github.com/urcu/userspace-rcu/blob/master/doc/uatomic-api.md We might actually be able to get rid of the hldr_lock altogether, which would solve the initial spin_lock/mutex problem without any additional code. Some minor remarks below, Regards, Martin > So the values that are protected by the spin lock are now ct- > >holders, > ct->thread, and ct->attached. There are cases where the tur checker > just > wants to know if the thread is running. If so, its safe to simply > read > ct->thread without locking. Also, if it knows that the thread isn't > running, it can freely access all of these variables. I've left the > locking in-place in these cases to make the static analyzers happy. > However I have added comments stating when the locking isn't actually > necessary. > > Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> > --- > libmultipath/checkers/tur.c | 66 ++++++++++++++++++++++++++++++++--- > ---------- > 1 file changed, 48 insertions(+), 18 deletions(-) > > diff --git a/libmultipath/checkers/tur.c > b/libmultipath/checkers/tur.c > index b4a5cb2..6ae9700 100644 > --- a/libmultipath/checkers/tur.c > +++ b/libmultipath/checkers/tur.c > @@ -46,6 +46,7 @@ struct tur_checker_context { > pthread_cond_t active; > pthread_spinlock_t hldr_lock; > int holders; > + int attached; > char message[CHECKER_MSG_LEN]; > }; > > @@ -98,17 +99,21 @@ void libcheck_free (struct checker * c) > { > if (c->context) { > struct tur_checker_context *ct = c->context; > - int holders; > + int attached, holders; > pthread_t thread; > > pthread_spin_lock(&ct->hldr_lock); > ct->holders--; > holders = ct->holders; > + attached = ct->attached; > + ct->attached = 0; > thread = ct->thread; > pthread_spin_unlock(&ct->hldr_lock); > - if (holders) > + if (attached) { > pthread_cancel(thread); > - else > + pthread_detach(thread); > + } > + if (!holders) > cleanup_context(ct); > c->context = NULL; > } > @@ -218,15 +223,21 @@ retry: > > static void cleanup_func(void *data) > { > - int holders; > + int attached, holders; > + pthread_t thread; > struct tur_checker_context *ct = data; > pthread_spin_lock(&ct->hldr_lock); > ct->holders--; > holders = ct->holders; > + attached = ct->attached; > + ct->attached = 0; > + thread = ct->thread; > ct->thread = 0; > pthread_spin_unlock(&ct->hldr_lock); > if (!holders) > cleanup_context(ct); > + if (attached) > + pthread_detach(thread); > } > > static int tur_running(struct tur_checker_context *ct) > @@ -324,7 +335,8 @@ int libcheck_check(struct checker * c) > pthread_attr_t attr; > int tur_status, r; > char devt[32]; > - > + pthread_t thread; > + int timed_out, attached; > > if (!ct) > return PATH_UNCHECKED; > @@ -350,18 +362,27 @@ int libcheck_check(struct checker * c) > } > > if (ct->running) { > - /* > - * Check if TUR checker is still running. Hold > hldr_lock > - * around the pthread_cancel() call to avoid that > - * pthread_cancel() gets called after the (detached) > TUR > - * thread has exited. > - */ > - pthread_spin_lock(&ct->hldr_lock); > - if (ct->thread) { > - if (tur_check_async_timeout(c)) { > + timed_out = tur_check_async_timeout(c); > + if (timed_out) { > + pthread_spin_lock(&ct->hldr_lock); > + attached = ct->attached; > + ct->attached = 0; > + thread = ct->thread; > + pthread_spin_unlock(&ct->hldr_lock); > + if (attached) { > + pthread_cancel(thread); > + pthread_detach(thread); > + } > + } else { > + /* locking here solely to make static > analyzers happy */ Out of curiosity: which analyzers have you been using? > + pthread_spin_lock(&ct->hldr_lock); > + thread = ct->thread; > + pthread_spin_unlock(&ct->hldr_lock); > + } > + if (thread) { > + if (timed_out) { > condlog(3, "%s: tur checker > timeout", > tur_devt(devt, sizeof(devt), > ct)); > - pthread_cancel(ct->thread); > ct->running = 0; > MSG(c, MSG_TUR_TIMEOUT); > tur_status = PATH_TIMEOUT; > @@ -377,9 +398,13 @@ int libcheck_check(struct checker * c) > tur_status = ct->state; > strlcpy(c->message, ct->message, sizeof(c- > >message)); > } > - pthread_spin_unlock(&ct->hldr_lock); > pthread_mutex_unlock(&ct->lock); > } else { > + /* > + * locking is necessary here, so that we know that > the > + * thread finished all access to the context before > we > + * delare it not running > + */ > if (tur_running(ct)) { > /* pthread cancel failed. continue in sync > mode */ > pthread_mutex_unlock(&ct->lock); > @@ -391,19 +416,23 @@ int libcheck_check(struct checker * c) > ct->state = PATH_UNCHECKED; > ct->fd = c->fd; > ct->timeout = c->timeout; > + /* locking here solely to make static analyzers > happy */ > pthread_spin_lock(&ct->hldr_lock); > ct->holders++; > + ct->attached = 1; > pthread_spin_unlock(&ct->hldr_lock); > tur_set_async_timeout(c); > - setup_thread_attr(&attr, 32 * 1024, 1); > + setup_thread_attr(&attr, 32 * 1024, 0); > r = pthread_create(&ct->thread, &attr, tur_thread, > ct); > pthread_attr_destroy(&attr); > if (r) { > + /* locking here solely to make static > analyzers happy */ > pthread_spin_lock(&ct->hldr_lock); > ct->holders--; > + ct->attached = 0; > + ct->thread = 0; > pthread_spin_unlock(&ct->hldr_lock); > pthread_mutex_unlock(&ct->lock); > - ct->thread = 0; > condlog(3, "%s: failed to start tur thread, This part (moving ct->thread =0 into the cricital section) is a bug fix. > using" > " sync mode", tur_devt(devt, > sizeof(devt), ct)); > return tur_check(c->fd, c->timeout, > @@ -414,6 +443,7 @@ int libcheck_check(struct checker * c) > tur_status = ct->state; > strlcpy(c->message, ct->message, sizeof(c- > >message)); > pthread_mutex_unlock(&ct->lock); > + /* locking here solely to make static analyzers > happy */ > if (tur_running(ct) && > (tur_status == PATH_PENDING || tur_status == > PATH_UNCHECKED)) { > condlog(3, "%s: tur checker still running",
On Thu, Feb 08, 2018 at 09:49:20AM +0100, Martin Wilck wrote: > Hello Ben, > > On Wed, 2018-02-07 at 16:49 -0600, Benjamin Marzinski wrote: > > Commit 6e2423fd fixed a bug where the tur checker could cancel a > > detached thread after it had exitted. However in fixing this, the new > > code grabbed a mutex (to call condlog) while holding a spin_lock. To > > deal with that, and to try to keep with the maixim "lock data, not > > code", I've changed how the tur checker synchronizes with its thread. > > Hmm, if it was only for the condlog issue, it'd certainly be possible > to find a simpler solution (just move the log message out of the spin- > locked code). And please explain some more why your patch implements > "lock data not code" better than what we did before - it's not obvious > to me. The way I view spin_locks is that they hold up a processor, so they should only be locking a defined set of shared resources, and making sure that they move from a defined state to another defined state correctly. The existing code calls pthread_cancel(), tur_check_async_timeout(), and condlog() in the spinlock, and only condlog() can be removed from it. While I don't believe that pthread_cancel() and tur_check_async_timeout() can block, they certainly don't need to be run under a spin_lock. My code just makes sure that holders, thread, and attached are updated atomically. > > > Now, the tur checker creates joinable threads, and detaches them when > > the thread is finished or has timed out. To track the state of the > > threads, I've added a new variable to the checker context, ct- > > >attached. > > Hmm, again. This adds more complexity to an already complex contruct, > because now we don't know in the first place whether the checker thread > is joinable or detached. IMO it makes a lot of sense for the checker to > run in detached mode right from the start. If multipathd is terminated > while checkers are blocked, it'll have to detach these threads in the > process of terminating - I find that a bit weird. I'm not sure I get what's weird here. Calling pthread_cancel() and then pthread_detach() is commonly recommended online as an alternative to calling pthread_cancel() and then pthread_join() if you don't want to wait for the results. > While I didn't spot obvious errors in your patch, changing the locking > fundamentally is always risky to some extent, and I'm not yet convinced > that the problem you're trying to solve justifies this risk. > > > When a thread starts, attached is set to 1. When the thread finishes, > > it > > saves the value of attached, and then zeros it out, while locked. If > > attached was set, it detaches itself. > > Why aren't you simply using pthread_attr_getdetachstate()? I'm not sure how you would avoid races if you did this. ct->attach gets set to 0 inside the spinlock to notify the other side that they don't have to do the detach. If you just check if the thread is currently detached, without being able to atomically noitify the other side that you are going to do the detach, what's to stop both sides from checking at the same time and both doing the detach? Am I missing something here? > > > > When the tur checker gives up on a thread, it also saves and > > decrements > > ct->attached, while locked. At the same time it saves the value of > > ct->thread. If attached was set, it cancels the thread, and then > > detaches it. > > Have you thought about using uatomic_add_return(), which we have > available anyway through liburcu? > https://github.com/urcu/userspace-rcu/blob/master/doc/uatomic-api.md > > We might actually be able to get rid of the hldr_lock altogether, which > would solve the initial spin_lock/mutex problem without any additional > code. No, I haven't. I can look into that. > Some minor remarks below, > Regards, Martin > > > So the values that are protected by the spin lock are now ct- > > >holders, > > ct->thread, and ct->attached. There are cases where the tur checker > > just > > wants to know if the thread is running. If so, its safe to simply > > read > > ct->thread without locking. Also, if it knows that the thread isn't > > running, it can freely access all of these variables. I've left the > > locking in-place in these cases to make the static analyzers happy. > > However I have added comments stating when the locking isn't actually > > necessary. > > > > Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> > > --- > > libmultipath/checkers/tur.c | 66 ++++++++++++++++++++++++++++++++--- > > ---------- > > 1 file changed, 48 insertions(+), 18 deletions(-) > > > > diff --git a/libmultipath/checkers/tur.c > > b/libmultipath/checkers/tur.c > > index b4a5cb2..6ae9700 100644 > > --- a/libmultipath/checkers/tur.c > > +++ b/libmultipath/checkers/tur.c > > @@ -46,6 +46,7 @@ struct tur_checker_context { > > pthread_cond_t active; > > pthread_spinlock_t hldr_lock; > > int holders; > > + int attached; > > char message[CHECKER_MSG_LEN]; > > }; > > > > @@ -98,17 +99,21 @@ void libcheck_free (struct checker * c) > > { > > if (c->context) { > > struct tur_checker_context *ct = c->context; > > - int holders; > > + int attached, holders; > > pthread_t thread; > > > > pthread_spin_lock(&ct->hldr_lock); > > ct->holders--; > > holders = ct->holders; > > + attached = ct->attached; > > + ct->attached = 0; > > thread = ct->thread; > > pthread_spin_unlock(&ct->hldr_lock); > > - if (holders) > > + if (attached) { > > pthread_cancel(thread); > > - else > > + pthread_detach(thread); > > + } > > + if (!holders) > > cleanup_context(ct); > > c->context = NULL; > > } > > @@ -218,15 +223,21 @@ retry: > > > > static void cleanup_func(void *data) > > { > > - int holders; > > + int attached, holders; > > + pthread_t thread; > > struct tur_checker_context *ct = data; > > pthread_spin_lock(&ct->hldr_lock); > > ct->holders--; > > holders = ct->holders; > > + attached = ct->attached; > > + ct->attached = 0; > > + thread = ct->thread; > > ct->thread = 0; > > pthread_spin_unlock(&ct->hldr_lock); > > if (!holders) > > cleanup_context(ct); > > + if (attached) > > + pthread_detach(thread); > > } > > > > static int tur_running(struct tur_checker_context *ct) > > @@ -324,7 +335,8 @@ int libcheck_check(struct checker * c) > > pthread_attr_t attr; > > int tur_status, r; > > char devt[32]; > > - > > + pthread_t thread; > > + int timed_out, attached; > > > > if (!ct) > > return PATH_UNCHECKED; > > @@ -350,18 +362,27 @@ int libcheck_check(struct checker * c) > > } > > > > if (ct->running) { > > - /* > > - * Check if TUR checker is still running. Hold > > hldr_lock > > - * around the pthread_cancel() call to avoid that > > - * pthread_cancel() gets called after the (detached) > > TUR > > - * thread has exited. > > - */ > > - pthread_spin_lock(&ct->hldr_lock); > > - if (ct->thread) { > > - if (tur_check_async_timeout(c)) { > > + timed_out = tur_check_async_timeout(c); > > + if (timed_out) { > > + pthread_spin_lock(&ct->hldr_lock); > > + attached = ct->attached; > > + ct->attached = 0; > > + thread = ct->thread; > > + pthread_spin_unlock(&ct->hldr_lock); > > + if (attached) { > > + pthread_cancel(thread); > > + pthread_detach(thread); > > + } > > + } else { > > + /* locking here solely to make static > > analyzers happy */ > > Out of curiosity: which analyzers have you been using? You would have to ask Bart, who I forgot to CC on this patch. He added the code to always read ct->thread in a spin_lock, and said it was to make an analyzer happy. > > + pthread_spin_lock(&ct->hldr_lock); > > + thread = ct->thread; > > + pthread_spin_unlock(&ct->hldr_lock); > > + } > > + if (thread) { > > + if (timed_out) { > > condlog(3, "%s: tur checker > > timeout", > > tur_devt(devt, sizeof(devt), > > ct)); > > - pthread_cancel(ct->thread); > > ct->running = 0; > > MSG(c, MSG_TUR_TIMEOUT); > > tur_status = PATH_TIMEOUT; > > @@ -377,9 +398,13 @@ int libcheck_check(struct checker * c) > > tur_status = ct->state; > > strlcpy(c->message, ct->message, sizeof(c- > > >message)); > > } > > - pthread_spin_unlock(&ct->hldr_lock); > > pthread_mutex_unlock(&ct->lock); > > } else { > > + /* > > + * locking is necessary here, so that we know that > > the > > + * thread finished all access to the context before > > we > > + * delare it not running > > + */ > > if (tur_running(ct)) { > > /* pthread cancel failed. continue in sync > > mode */ > > pthread_mutex_unlock(&ct->lock); > > @@ -391,19 +416,23 @@ int libcheck_check(struct checker * c) > > ct->state = PATH_UNCHECKED; > > ct->fd = c->fd; > > ct->timeout = c->timeout; > > + /* locking here solely to make static analyzers > > happy */ > > pthread_spin_lock(&ct->hldr_lock); > > ct->holders++; > > + ct->attached = 1; > > pthread_spin_unlock(&ct->hldr_lock); > > tur_set_async_timeout(c); > > - setup_thread_attr(&attr, 32 * 1024, 1); > > + setup_thread_attr(&attr, 32 * 1024, 0); > > r = pthread_create(&ct->thread, &attr, tur_thread, > > ct); > > pthread_attr_destroy(&attr); > > if (r) { > > + /* locking here solely to make static > > analyzers happy */ > > pthread_spin_lock(&ct->hldr_lock); > > ct->holders--; > > + ct->attached = 0; > > + ct->thread = 0; > > pthread_spin_unlock(&ct->hldr_lock); > > pthread_mutex_unlock(&ct->lock); > > - ct->thread = 0; > > condlog(3, "%s: failed to start tur thread, > > This part (moving ct->thread =0 into the cricital section) is a bug > fix. Well, we've already determined that another thread wasn't running, and then we failed to start a new thread, so really all the locking here is unnecessary, but I agree that we should either lock all the variables or none of them. > > > using" > > " sync mode", tur_devt(devt, > > sizeof(devt), ct)); > > return tur_check(c->fd, c->timeout, > > @@ -414,6 +443,7 @@ int libcheck_check(struct checker * c) > > tur_status = ct->state; > > strlcpy(c->message, ct->message, sizeof(c- > > >message)); > > pthread_mutex_unlock(&ct->lock); > > + /* locking here solely to make static analyzers > > happy */ > > if (tur_running(ct) && > > (tur_status == PATH_PENDING || tur_status == > > PATH_UNCHECKED)) { > > condlog(3, "%s: tur checker still running", > > -- > Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107 > SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton > HRB 21284 (AG Nürnberg) -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, 2018-02-07 at 16:49 -0600, Benjamin Marzinski wrote: > Commit 6e2423fd fixed a bug where the tur checker could cancel a > detached thread after it had exitted. However in fixing this, the new > code grabbed a mutex (to call condlog) while holding a spin_lock. To > deal with that, and to try to keep with the maixim "lock data, not > code", I've changed how the tur checker synchronizes with its thread. > > Now, the tur checker creates joinable threads, and detaches them when > the thread is finished or has timed out. To track the state of the > threads, I've added a new variable to the checker context, ct->attached. > When a thread starts, attached is set to 1. When the thread finishes, it > saves the value of attached, and then zeros it out, while locked. If > attached was set, it detaches itself. > > When the tur checker gives up on a thread, it also saves and decrements > ct->attached, while locked. At the same time it saves the value of > ct->thread. If attached was set, it cancels the thread, and then > detaches it. > > So the values that are protected by the spin lock are now ct->holders, > ct->thread, and ct->attached. There are cases where the tur checker just > wants to know if the thread is running. If so, its safe to simply read > ct->thread without locking. Also, if it knows that the thread isn't > running, it can freely access all of these variables. I've left the > locking in-place in these cases to make the static analyzers happy. > However I have added comments stating when the locking isn't actually > necessary. Hello Ben, Have you considered to move the condlog() statements out of the spinlock section? I think that would lead to a much smaller and less contrived change than the patch you posted. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Thu, 2018-02-08 at 11:52 -0600, Benjamin Marzinski wrote: > On Thu, Feb 08, 2018 at 09:49:20AM +0100, Martin Wilck wrote: > > On Wed, 2018-02-07 at 16:49 -0600, Benjamin Marzinski wrote: > > > + } else { > > > + /* locking here solely to make static > > > analyzers happy */ > > > > Out of curiosity: which analyzers have you been using? > > You would have to ask Bart, who I forgot to CC on this patch. He added > the code to always read ct->thread in a spin_lock, and said it was to > make an analyzer happy. The analyzer I used is a dynamic analyzer, namely drd. See also http://valgrind.org/docs/manual/drd-manual.html. Bart. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Thu, Feb 08, 2018 at 06:19:32PM +0000, Bart Van Assche wrote: > On Wed, 2018-02-07 at 16:49 -0600, Benjamin Marzinski wrote: > > Commit 6e2423fd fixed a bug where the tur checker could cancel a > > detached thread after it had exitted. However in fixing this, the new > > code grabbed a mutex (to call condlog) while holding a spin_lock. To > > deal with that, and to try to keep with the maixim "lock data, not > > code", I've changed how the tur checker synchronizes with its thread. > > > > Now, the tur checker creates joinable threads, and detaches them when > > the thread is finished or has timed out. To track the state of the > > threads, I've added a new variable to the checker context, ct->attached. > > When a thread starts, attached is set to 1. When the thread finishes, it > > saves the value of attached, and then zeros it out, while locked. If > > attached was set, it detaches itself. > > > > When the tur checker gives up on a thread, it also saves and decrements > > ct->attached, while locked. At the same time it saves the value of > > ct->thread. If attached was set, it cancels the thread, and then > > detaches it. > > > > So the values that are protected by the spin lock are now ct->holders, > > ct->thread, and ct->attached. There are cases where the tur checker just > > wants to know if the thread is running. If so, its safe to simply read > > ct->thread without locking. Also, if it knows that the thread isn't > > running, it can freely access all of these variables. I've left the > > locking in-place in these cases to make the static analyzers happy. > > However I have added comments stating when the locking isn't actually > > necessary. > > Hello Ben, > > Have you considered to move the condlog() statements out of the spinlock > section? I think that would lead to a much smaller and less contrived change > than the patch you posted. Yes. I thought it was better to keep the amount of code locked under a spin_lock as small as possible. But it seems that I am alone in thinking my trade-off was worth it. If both you and Martin object, I'm fine with redoing the patch to simply move the condlog(), since I don't believe that either tur_check_async_timeout() or pthread_cancel() can block. -Ben > Thanks, > > Bart. > > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Thu, 2018-02-08 at 13:27 -0600, Benjamin Marzinski wrote: > On Thu, Feb 08, 2018 at 06:19:32PM +0000, Bart Van Assche wrote: > > On Wed, 2018-02-07 at 16:49 -0600, Benjamin Marzinski wrote: > > > Commit 6e2423fd fixed a bug where the tur checker could cancel a > > > detached thread after it had exitted. However in fixing this, the new > > > code grabbed a mutex (to call condlog) while holding a spin_lock. To > > > deal with that, and to try to keep with the maixim "lock data, not > > > code", I've changed how the tur checker synchronizes with its thread. > > > > > > Now, the tur checker creates joinable threads, and detaches them when > > > the thread is finished or has timed out. To track the state of the > > > threads, I've added a new variable to the checker context, ct->attached. > > > When a thread starts, attached is set to 1. When the thread finishes, it > > > saves the value of attached, and then zeros it out, while locked. If > > > attached was set, it detaches itself. > > > > > > When the tur checker gives up on a thread, it also saves and decrements > > > ct->attached, while locked. At the same time it saves the value of > > > ct->thread. If attached was set, it cancels the thread, and then > > > detaches it. > > > > > > So the values that are protected by the spin lock are now ct->holders, > > > ct->thread, and ct->attached. There are cases where the tur checker just > > > wants to know if the thread is running. If so, its safe to simply read > > > ct->thread without locking. Also, if it knows that the thread isn't > > > running, it can freely access all of these variables. I've left the > > > locking in-place in these cases to make the static analyzers happy. > > > However I have added comments stating when the locking isn't actually > > > necessary. > > > > Hello Ben, > > > > Have you considered to move the condlog() statements out of the spinlock > > section? I think that would lead to a much smaller and less contrived change > > than the patch you posted. > > Yes. I thought it was better to keep the amount of code locked under a > spin_lock as small as possible. But it seems that I am alone in thinking > my trade-off was worth it. If both you and Martin object, I'm fine with > redoing the patch to simply move the condlog(), since I don't believe > that either tur_check_async_timeout() or pthread_cancel() can block. Hello Ben, Have you considered to convert the hldr_lock spinlock into a mutex? I think that the hldr_lock spinlock got introduced by commit 892f7b333a03 ("multipath: fix scsi async tur checker corruption"). However, I have not found an explanation in the description of that commit why hldr_lock is a spinlock and not any other type of synchronization primitive. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
diff --git a/libmultipath/checkers/tur.c b/libmultipath/checkers/tur.c index b4a5cb2..6ae9700 100644 --- a/libmultipath/checkers/tur.c +++ b/libmultipath/checkers/tur.c @@ -46,6 +46,7 @@ struct tur_checker_context { pthread_cond_t active; pthread_spinlock_t hldr_lock; int holders; + int attached; char message[CHECKER_MSG_LEN]; }; @@ -98,17 +99,21 @@ void libcheck_free (struct checker * c) { if (c->context) { struct tur_checker_context *ct = c->context; - int holders; + int attached, holders; pthread_t thread; pthread_spin_lock(&ct->hldr_lock); ct->holders--; holders = ct->holders; + attached = ct->attached; + ct->attached = 0; thread = ct->thread; pthread_spin_unlock(&ct->hldr_lock); - if (holders) + if (attached) { pthread_cancel(thread); - else + pthread_detach(thread); + } + if (!holders) cleanup_context(ct); c->context = NULL; } @@ -218,15 +223,21 @@ retry: static void cleanup_func(void *data) { - int holders; + int attached, holders; + pthread_t thread; struct tur_checker_context *ct = data; pthread_spin_lock(&ct->hldr_lock); ct->holders--; holders = ct->holders; + attached = ct->attached; + ct->attached = 0; + thread = ct->thread; ct->thread = 0; pthread_spin_unlock(&ct->hldr_lock); if (!holders) cleanup_context(ct); + if (attached) + pthread_detach(thread); } static int tur_running(struct tur_checker_context *ct) @@ -324,7 +335,8 @@ int libcheck_check(struct checker * c) pthread_attr_t attr; int tur_status, r; char devt[32]; - + pthread_t thread; + int timed_out, attached; if (!ct) return PATH_UNCHECKED; @@ -350,18 +362,27 @@ int libcheck_check(struct checker * c) } if (ct->running) { - /* - * Check if TUR checker is still running. Hold hldr_lock - * around the pthread_cancel() call to avoid that - * pthread_cancel() gets called after the (detached) TUR - * thread has exited. - */ - pthread_spin_lock(&ct->hldr_lock); - if (ct->thread) { - if (tur_check_async_timeout(c)) { + timed_out = tur_check_async_timeout(c); + if (timed_out) { + pthread_spin_lock(&ct->hldr_lock); + attached = ct->attached; + ct->attached = 0; + thread = ct->thread; + pthread_spin_unlock(&ct->hldr_lock); + if (attached) { + pthread_cancel(thread); + pthread_detach(thread); + } + } else { + /* locking here solely to make static analyzers happy */ + pthread_spin_lock(&ct->hldr_lock); + thread = ct->thread; + pthread_spin_unlock(&ct->hldr_lock); + } + if (thread) { + if (timed_out) { condlog(3, "%s: tur checker timeout", tur_devt(devt, sizeof(devt), ct)); - pthread_cancel(ct->thread); ct->running = 0; MSG(c, MSG_TUR_TIMEOUT); tur_status = PATH_TIMEOUT; @@ -377,9 +398,13 @@ int libcheck_check(struct checker * c) tur_status = ct->state; strlcpy(c->message, ct->message, sizeof(c->message)); } - pthread_spin_unlock(&ct->hldr_lock); pthread_mutex_unlock(&ct->lock); } else { + /* + * locking is necessary here, so that we know that the + * thread finished all access to the context before we + * delare it not running + */ if (tur_running(ct)) { /* pthread cancel failed. continue in sync mode */ pthread_mutex_unlock(&ct->lock); @@ -391,19 +416,23 @@ int libcheck_check(struct checker * c) ct->state = PATH_UNCHECKED; ct->fd = c->fd; ct->timeout = c->timeout; + /* locking here solely to make static analyzers happy */ pthread_spin_lock(&ct->hldr_lock); ct->holders++; + ct->attached = 1; pthread_spin_unlock(&ct->hldr_lock); tur_set_async_timeout(c); - setup_thread_attr(&attr, 32 * 1024, 1); + setup_thread_attr(&attr, 32 * 1024, 0); r = pthread_create(&ct->thread, &attr, tur_thread, ct); pthread_attr_destroy(&attr); if (r) { + /* locking here solely to make static analyzers happy */ pthread_spin_lock(&ct->hldr_lock); ct->holders--; + ct->attached = 0; + ct->thread = 0; pthread_spin_unlock(&ct->hldr_lock); pthread_mutex_unlock(&ct->lock); - ct->thread = 0; condlog(3, "%s: failed to start tur thread, using" " sync mode", tur_devt(devt, sizeof(devt), ct)); return tur_check(c->fd, c->timeout, @@ -414,6 +443,7 @@ int libcheck_check(struct checker * c) tur_status = ct->state; strlcpy(c->message, ct->message, sizeof(c->message)); pthread_mutex_unlock(&ct->lock); + /* locking here solely to make static analyzers happy */ if (tur_running(ct) && (tur_status == PATH_PENDING || tur_status == PATH_UNCHECKED)) { condlog(3, "%s: tur checker still running",
Commit 6e2423fd fixed a bug where the tur checker could cancel a detached thread after it had exitted. However in fixing this, the new code grabbed a mutex (to call condlog) while holding a spin_lock. To deal with that, and to try to keep with the maixim "lock data, not code", I've changed how the tur checker synchronizes with its thread. Now, the tur checker creates joinable threads, and detaches them when the thread is finished or has timed out. To track the state of the threads, I've added a new variable to the checker context, ct->attached. When a thread starts, attached is set to 1. When the thread finishes, it saves the value of attached, and then zeros it out, while locked. If attached was set, it detaches itself. When the tur checker gives up on a thread, it also saves and decrements ct->attached, while locked. At the same time it saves the value of ct->thread. If attached was set, it cancels the thread, and then detaches it. So the values that are protected by the spin lock are now ct->holders, ct->thread, and ct->attached. There are cases where the tur checker just wants to know if the thread is running. If so, its safe to simply read ct->thread without locking. Also, if it knows that the thread isn't running, it can freely access all of these variables. I've left the locking in-place in these cases to make the static analyzers happy. However I have added comments stating when the locking isn't actually necessary. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> --- libmultipath/checkers/tur.c | 66 ++++++++++++++++++++++++++++++++------------- 1 file changed, 48 insertions(+), 18 deletions(-)