Message ID | 1603487708-12547-6-git-send-email-bmarzins@redhat.com (mailing list archive) |
---|---|
State | Not Applicable, archived |
Delegated to: | christophe varoqui |
Headers | show |
Series | Misc Multipath patches | expand |
On Fri, 2020-10-23 at 16:15 -0500, Benjamin Marzinski wrote: > The multipathd tur checker thread is designed to be able to finish at > any time, even after the tur checker itself has been freed. The > multipathd shutdown code makes sure all the checkers have been freed > before freeing the checker_class and calling dlclose() to unload the > DSO, but this doesn't guarantee that the checker threads have > finished. > If one hasn't, the DSO will get unloaded while the thread still > running > code from it, causing a segfault. Unfortunately, it's not possible to > be > sure that all tur checker threads have ended during shutdown, without > making them joinable. > > However, since libmultipath will never be reinitialized after it has > been uninitialzed, not dlclosing the tur checker DSO once a thread is > started has minimal cost (keeping the DSO code around until the > program > exits, which usually happens right after freeing the checkers). I'm not against this, but have you considered using an atomic refcount for the DSO? With every tur thread starting, we could increase it, and decrease it in the cleanup function of the thread when it exits. That should be safe. If the refcount was positive when we exit, we could refrain from unloading the DSO. Regards, Martin -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Fri, Oct 30, 2020 at 09:15:39PM +0000, Martin Wilck wrote: > On Fri, 2020-10-23 at 16:15 -0500, Benjamin Marzinski wrote: > > The multipathd tur checker thread is designed to be able to finish at > > any time, even after the tur checker itself has been freed. The > > multipathd shutdown code makes sure all the checkers have been freed > > before freeing the checker_class and calling dlclose() to unload the > > DSO, but this doesn't guarantee that the checker threads have > > finished. > > If one hasn't, the DSO will get unloaded while the thread still > > running > > code from it, causing a segfault. Unfortunately, it's not possible to > > be > > sure that all tur checker threads have ended during shutdown, without > > making them joinable. > > > > However, since libmultipath will never be reinitialized after it has > > been uninitialzed, not dlclosing the tur checker DSO once a thread is > > started has minimal cost (keeping the DSO code around until the > > program > > exits, which usually happens right after freeing the checkers). > > I'm not against this, but have you considered using an atomic refcount > for the DSO? With every tur thread starting, we could increase it, and > decrease it in the cleanup function of the thread when it exits. That > should be safe. If the refcount was positive when we exit, we could > refrain from unloading the DSO. I tried exactly that, and I would still get crashes, even if it put the code that decrements the atomic variable in a function that's not part of the DSO, and put a pthread_exit() before the final pthread_cleanup_pop() that would decrement it in tur_thread, so that after the cleanup code is called the thread should never return to code that is in the DSO. I had to add sleeps in code to force various orderings, but I couldn't find any way that worked for all possible orderings. I would love it if this worked, and you're free to try, but I could not get this method to work. -Ben > Regards, > Martin -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Fri, Oct 30, 2020 at 09:15:39PM +0000, Martin Wilck wrote: > On Fri, 2020-10-23 at 16:15 -0500, Benjamin Marzinski wrote: > > The multipathd tur checker thread is designed to be able to finish at > > any time, even after the tur checker itself has been freed. The > > multipathd shutdown code makes sure all the checkers have been freed > > before freeing the checker_class and calling dlclose() to unload the > > DSO, but this doesn't guarantee that the checker threads have > > finished. > > If one hasn't, the DSO will get unloaded while the thread still > > running > > code from it, causing a segfault. Unfortunately, it's not possible to > > be > > sure that all tur checker threads have ended during shutdown, without > > making them joinable. > > > > However, since libmultipath will never be reinitialized after it has > > been uninitialzed, not dlclosing the tur checker DSO once a thread is > > started has minimal cost (keeping the DSO code around until the > > program > > exits, which usually happens right after freeing the checkers). > > I'm not against this, but have you considered using an atomic refcount > for the DSO? With every tur thread starting, we could increase it, and > decrease it in the cleanup function of the thread when it exits. That > should be safe. If the refcount was positive when we exit, we could > refrain from unloading the DSO. > > Regards, > Martin NAK. I apparently forgot to commit the version file changes. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Mon, 2020-11-02 at 12:27 -0600, Benjamin Marzinski wrote: > On Fri, Oct 30, 2020 at 09:15:39PM +0000, Martin Wilck wrote: > > On Fri, 2020-10-23 at 16:15 -0500, Benjamin Marzinski wrote: > > > The multipathd tur checker thread is designed to be able to > > > finish at > > > any time, even after the tur checker itself has been freed. The > > > multipathd shutdown code makes sure all the checkers have been > > > freed > > > before freeing the checker_class and calling dlclose() to unload > > > the > > > DSO, but this doesn't guarantee that the checker threads have > > > finished. > > > If one hasn't, the DSO will get unloaded while the thread still > > > running > > > code from it, causing a segfault. Unfortunately, it's not > > > possible to > > > be > > > sure that all tur checker threads have ended during shutdown, > > > without > > > making them joinable. > > > > > > However, since libmultipath will never be reinitialized after it > > > has > > > been uninitialzed, not dlclosing the tur checker DSO once a > > > thread is > > > started has minimal cost (keeping the DSO code around until the > > > program > > > exits, which usually happens right after freeing the checkers). > > > > I'm not against this, but have you considered using an > > atomic refcount > > for the DSO? With every tur thread starting, we could increase it, > > and > > decrease it in the cleanup function of the thread when it exits. > > That > > should be safe. If the refcount was positive when we exit, we could > > refrain from unloading the DSO. > > I tried exactly that, and I would still get crashes, even if it put > the > code that decrements the atomic variable in a function that's not > part > of the DSO, and put a pthread_exit() before the final > pthread_cleanup_pop() that would decrement it in tur_thread, so that > after the cleanup code is called the thread should never return to > code > that is in the DSO. I had to add sleeps in code to force various > orderings, but I couldn't find any way that worked for all possible > orderings. I would love it if this worked, and you're free to try, > but > I could not get this method to work. I've experimented a bit with a trivial test program, and I found that it worked stably if decrementing the refcount is really the last thing thread's cleanup function does. Could you provide some details about the sleeps that you'd put in that made this approach fail? Martin
On Wed, Nov 04, 2020 at 10:39:39PM +0000, Martin Wilck wrote: > On Mon, 2020-11-02 at 12:27 -0600, Benjamin Marzinski wrote: > > On Fri, Oct 30, 2020 at 09:15:39PM +0000, Martin Wilck wrote: > > > On Fri, 2020-10-23 at 16:15 -0500, Benjamin Marzinski wrote: > > > > The multipathd tur checker thread is designed to be able to > > > > finish at > > > > any time, even after the tur checker itself has been freed. The > > > > multipathd shutdown code makes sure all the checkers have been > > > > freed > > > > before freeing the checker_class and calling dlclose() to unload > > > > the > > > > DSO, but this doesn't guarantee that the checker threads have > > > > finished. > > > > If one hasn't, the DSO will get unloaded while the thread still > > > > running > > > > code from it, causing a segfault. Unfortunately, it's not > > > > possible to > > > > be > > > > sure that all tur checker threads have ended during shutdown, > > > > without > > > > making them joinable. > > > > > > > > However, since libmultipath will never be reinitialized after it > > > > has > > > > been uninitialzed, not dlclosing the tur checker DSO once a > > > > thread is > > > > started has minimal cost (keeping the DSO code around until the > > > > program > > > > exits, which usually happens right after freeing the checkers). > > > > > > I'm not against this, but have you considered using an > > > atomic refcount > > > for the DSO? With every tur thread starting, we could increase it, > > > and > > > decrease it in the cleanup function of the thread when it exits. > > > That > > > should be safe. If the refcount was positive when we exit, we could > > > refrain from unloading the DSO. > > > > I tried exactly that, and I would still get crashes, even if it put > > the > > code that decrements the atomic variable in a function that's not > > part > > of the DSO, and put a pthread_exit() before the final > > pthread_cleanup_pop() that would decrement it in tur_thread, so that > > after the cleanup code is called the thread should never return to > > code > > that is in the DSO. I had to add sleeps in code to force various > > orderings, but I couldn't find any way that worked for all possible > > orderings. I would love it if this worked, and you're free to try, > > but > > I could not get this method to work. > > I've experimented a bit with a trivial test program, and I found that > it worked stably if decrementing the refcount is really the last thing > thread's cleanup function does. Could you provide some details about > the sleeps that you'd put in that made this approach fail? I believe the situation that continued to crash was where the tur_thread() exitted naturally (so it set running to 0), although I'm not sure that this is necessary, or if it would still crash when running the cleanup function because of a cancel. I put the cleanup function to decrement the count in libmultipath, so that it wasn't part of the DSO, and then I put a sleep(5) as the last line of the cleanup function, and a sleep(10) as the last line of cleanup_checkers(). I also had to set running to 0 before starting the thread, and then take out the code to pause the thread if running was aleady 0, to make sure the thread acted like it was the one to set running to 0. Then the idea is to stop multipathd while there is a thread in its sleep period, so that multipathd sees that the counter is 0, and closes the dso, and then the thread finishes before multipathd shuts the rest of the way down. -Ben > Martin > > -- Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107 SUSE > Software Solutions Germany GmbH HRB 36809, AG Nürnberg GF: Felix > Imendörffer > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, 2020-11-04 at 22:39 +0000, Martin Wilck wrote: > On Mon, 2020-11-02 at 12:27 -0600, Benjamin Marzinski wrote: > > On Fri, Oct 30, 2020 at 09:15:39PM +0000, Martin Wilck wrote: > > > On Fri, 2020-10-23 at 16:15 -0500, Benjamin Marzinski wrote: > > > > The multipathd tur checker thread is designed to be able to > > > > finish at > > > > any time, even after the tur checker itself has been freed. The > > > > multipathd shutdown code makes sure all the checkers have been > > > > freed > > > > before freeing the checker_class and calling dlclose() to > > > > unload > > > > the > > > > DSO, but this doesn't guarantee that the checker threads have > > > > finished. > > > > If one hasn't, the DSO will get unloaded while the thread still > > > > running > > > > code from it, causing a segfault. Unfortunately, it's not > > > > possible to > > > > be > > > > sure that all tur checker threads have ended during shutdown, > > > > without > > > > making them joinable. > > > > > > > > However, since libmultipath will never be reinitialized after > > > > it > > > > has > > > > been uninitialzed, not dlclosing the tur checker DSO once a > > > > thread is > > > > started has minimal cost (keeping the DSO code around until the > > > > program > > > > exits, which usually happens right after freeing the checkers). > > > > > > I'm not against this, but have you considered using an > > > atomic refcount > > > for the DSO? With every tur thread starting, we could increase > > > it, > > > and > > > decrease it in the cleanup function of the thread when it exits. > > > That > > > should be safe. If the refcount was positive when we exit, we > > > could > > > refrain from unloading the DSO. > > > > I tried exactly that, and I would still get crashes, even if it put > > the > > code that decrements the atomic variable in a function that's not > > part > > of the DSO, and put a pthread_exit() before the final > > pthread_cleanup_pop() that would decrement it in tur_thread, so > > that > > after the cleanup code is called the thread should never return to > > code > > that is in the DSO. I had to add sleeps in code to force various > > orderings, but I couldn't find any way that worked for all possible > > orderings. I would love it if this worked, and you're free to try, > > but > > I could not get this method to work. > > I've experimented a bit with a trivial test program, and I found that > it worked stably if decrementing the refcount is really the last > thing > thread's cleanup function does. Could you provide some details about > the sleeps that you'd put in that made this approach fail? I've made some more experiments and if I repeat my test enough, I could see SIGSEGV. I was able to make it work by using a thread entrypoint outside the DSO (calling the DSO's thread function), and decrementing the refcount in the outermost cleanup function (which is also not in the DSO), all using proper atomics and barriers. Martin
On Wed, 2020-11-04 at 17:27 -0600, Benjamin Marzinski wrote: > On Wed, Nov 04, 2020 at 10:39:39PM +0000, Martin Wilck wrote: > > On Mon, 2020-11-02 at 12:27 -0600, Benjamin Marzinski wrote: > > > On Fri, Oct 30, 2020 at 09:15:39PM +0000, Martin Wilck wrote: > > > > On Fri, 2020-10-23 at 16:15 -0500, Benjamin Marzinski wrote: > > > > > The multipathd tur checker thread is designed to be able to > > > > > finish at > > > > > any time, even after the tur checker itself has been freed. > > > > > The > > > > > multipathd shutdown code makes sure all the checkers have > > > > > been > > > > > freed > > > > > before freeing the checker_class and calling dlclose() to > > > > > unload > > > > > the > > > > > DSO, but this doesn't guarantee that the checker threads have > > > > > finished. > > > > > If one hasn't, the DSO will get unloaded while the thread > > > > > still > > > > > running > > > > > code from it, causing a segfault. Unfortunately, it's not > > > > > possible to > > > > > be > > > > > sure that all tur checker threads have ended during shutdown, > > > > > without > > > > > making them joinable. > > > > > > > > > > However, since libmultipath will never be reinitialized after > > > > > it > > > > > has > > > > > been uninitialzed, not dlclosing the tur checker DSO once a > > > > > thread is > > > > > started has minimal cost (keeping the DSO code around until > > > > > the > > > > > program > > > > > exits, which usually happens right after freeing the > > > > > checkers). > > > > > > > > I'm not against this, but have you considered using an > > > > atomic refcount > > > > for the DSO? With every tur thread starting, we could increase > > > > it, > > > > and > > > > decrease it in the cleanup function of the thread when it > > > > exits. > > > > That > > > > should be safe. If the refcount was positive when we exit, we > > > > could > > > > refrain from unloading the DSO. > > > > > > I tried exactly that, and I would still get crashes, even if it > > > put > > > the > > > code that decrements the atomic variable in a function that's not > > > part > > > of the DSO, and put a pthread_exit() before the final > > > pthread_cleanup_pop() that would decrement it in tur_thread, so > > > that > > > after the cleanup code is called the thread should never return > > > to > > > code > > > that is in the DSO. I had to add sleeps in code to force various > > > orderings, but I couldn't find any way that worked for all > > > possible > > > orderings. I would love it if this worked, and you're free to > > > try, > > > but > > > I could not get this method to work. > > > > I've experimented a bit with a trivial test program, and I found > > that > > it worked stably if decrementing the refcount is really the last > > thing > > thread's cleanup function does. Could you provide some details > > about > > the sleeps that you'd put in that made this approach fail? > > I believe the situation that continued to crash was where the > tur_thread() exitted naturally (so it set running to 0), although I'm > not sure that this is necessary, or if it would still crash when > running > the cleanup function because of a cancel. I put the cleanup function > to > decrement the count in libmultipath, so that it wasn't part of the > DSO, > and then I put a sleep(5) as the last line of the cleanup function, > and > a sleep(10) as the last line of cleanup_checkers(). I also had to set > running to 0 before starting the thread, and then take out the code > to > pause the thread if running was aleady 0, to make sure the thread > acted > like it was the one to set running to 0. Then the idea is to stop > multipathd while there is a thread in its sleep period, so that > multipathd sees that the counter is 0, and closes the dso, and then > the thread finishes before multipathd shuts the rest of the way > down. I guess the key is that the thread's entry point must also be in libmultipath (i.e. outside the DSO). In pseudo-code: entrypoint() { refcount++; pthread_cleanup_push(refcount--); tur_thread(ct); pthread_cleanup_pop(1); } This way the thread can't be in DSO code any more when refcount goes to zero. Martin
On Wed, Nov 04, 2020 at 11:56:07PM +0000, Martin Wilck wrote: > On Wed, 2020-11-04 at 17:27 -0600, Benjamin Marzinski wrote: > > On Wed, Nov 04, 2020 at 10:39:39PM +0000, Martin Wilck wrote: > > > On Mon, 2020-11-02 at 12:27 -0600, Benjamin Marzinski wrote: > > > > On Fri, Oct 30, 2020 at 09:15:39PM +0000, Martin Wilck wrote: > > > > > On Fri, 2020-10-23 at 16:15 -0500, Benjamin Marzinski wrote: > > > > > > The multipathd tur checker thread is designed to be able to > > > > > > finish at > > > > > > any time, even after the tur checker itself has been freed. > > > > > > The > > > > > > multipathd shutdown code makes sure all the checkers have > > > > > > been > > > > > > freed > > > > > > before freeing the checker_class and calling dlclose() to > > > > > > unload > > > > > > the > > > > > > DSO, but this doesn't guarantee that the checker threads have > > > > > > finished. > > > > > > If one hasn't, the DSO will get unloaded while the thread > > > > > > still > > > > > > running > > > > > > code from it, causing a segfault. Unfortunately, it's not > > > > > > possible to > > > > > > be > > > > > > sure that all tur checker threads have ended during shutdown, > > > > > > without > > > > > > making them joinable. > > > > > > > > > > > > However, since libmultipath will never be reinitialized after > > > > > > it > > > > > > has > > > > > > been uninitialzed, not dlclosing the tur checker DSO once a > > > > > > thread is > > > > > > started has minimal cost (keeping the DSO code around until > > > > > > the > > > > > > program > > > > > > exits, which usually happens right after freeing the > > > > > > checkers). > > > > > > > > > > I'm not against this, but have you considered using an > > > > > atomic refcount > > > > > for the DSO? With every tur thread starting, we could increase > > > > > it, > > > > > and > > > > > decrease it in the cleanup function of the thread when it > > > > > exits. > > > > > That > > > > > should be safe. If the refcount was positive when we exit, we > > > > > could > > > > > refrain from unloading the DSO. > > > > > > > > I tried exactly that, and I would still get crashes, even if it > > > > put > > > > the > > > > code that decrements the atomic variable in a function that's not > > > > part > > > > of the DSO, and put a pthread_exit() before the final > > > > pthread_cleanup_pop() that would decrement it in tur_thread, so > > > > that > > > > after the cleanup code is called the thread should never return > > > > to > > > > code > > > > that is in the DSO. I had to add sleeps in code to force various > > > > orderings, but I couldn't find any way that worked for all > > > > possible > > > > orderings. I would love it if this worked, and you're free to > > > > try, > > > > but > > > > I could not get this method to work. > > > > > > I've experimented a bit with a trivial test program, and I found > > > that > > > it worked stably if decrementing the refcount is really the last > > > thing > > > thread's cleanup function does. Could you provide some details > > > about > > > the sleeps that you'd put in that made this approach fail? > > > > I believe the situation that continued to crash was where the > > tur_thread() exitted naturally (so it set running to 0), although I'm > > not sure that this is necessary, or if it would still crash when > > running > > the cleanup function because of a cancel. I put the cleanup function > > to > > decrement the count in libmultipath, so that it wasn't part of the > > DSO, > > and then I put a sleep(5) as the last line of the cleanup function, > > and > > a sleep(10) as the last line of cleanup_checkers(). I also had to set > > running to 0 before starting the thread, and then take out the code > > to > > pause the thread if running was aleady 0, to make sure the thread > > acted > > like it was the one to set running to 0. Then the idea is to stop > > multipathd while there is a thread in its sleep period, so that > > multipathd sees that the counter is 0, and closes the dso, and then > > the thread finishes before multipathd shuts the rest of the way > > down. > > I guess the key is that the thread's entry point must also be in > libmultipath (i.e. outside the DSO). In pseudo-code: > > entrypoint() { > refcount++; > pthread_cleanup_push(refcount--); > tur_thread(ct); > pthread_cleanup_pop(1); > } > > This way the thread can't be in DSO code any more when refcount goes to > zero. Oh! I didn't think of solving it that way, but it makes sense. So, were you planning on posting a patch? -Ben > > Martin > > -- > Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107 > SUSE Software Solutions Germany GmbH > HRB 36809, AG Nürnberg GF: Felix > Imendörffer > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, 2020-11-04 at 18:41 -0600, Benjamin Marzinski wrote: > On Wed, Nov 04, 2020 at 11:56:07PM +0000, Martin Wilck wrote: > > > > I guess the key is that the thread's entry point must also be in > > libmultipath (i.e. outside the DSO). In pseudo-code: > > > > entrypoint() { > > refcount++; > > pthread_cleanup_push(refcount--); > > tur_thread(ct); > > pthread_cleanup_pop(1); > > } > > > > This way the thread can't be in DSO code any more when refcount > > goes to > > zero. > > Oh! I didn't think of solving it that way, but it makes sense. So, > were > you planning on posting a patch? I just did ("libmultipath: prevent DSO unloading with astray checker threads"). Please have a look and possibly test it using the setup that failed for you before. Regards Martin
diff --git a/libmultipath/checkers.c b/libmultipath/checkers.c index 18b1f5eb..35a17f8c 100644 --- a/libmultipath/checkers.c +++ b/libmultipath/checkers.c @@ -22,6 +22,7 @@ struct checker_class { void (*reset)(void); /* to reset the global variables */ const char **msgtable; short msgtable_size; + int keep_dso; }; static const char *checker_state_names[PATH_MAX_STATE] = { @@ -74,7 +75,7 @@ void free_checker_class(struct checker_class *c) list_del(&c->node); if (c->reset) c->reset(); - if (c->handle) { + if (c->handle && !c->keep_dso) { if (dlclose(c->handle) != 0) { condlog(0, "Cannot unload checker %s: %s", c->name, dlerror()); @@ -197,6 +198,13 @@ out: return NULL; } +void checker_keep_dso(struct checker * c) +{ + if (!c || !c->cls) + return; + c->cls->keep_dso = 1; +} + void checker_set_fd (struct checker * c, int fd) { if (!c) diff --git a/libmultipath/checkers.h b/libmultipath/checkers.h index 9d5f90b9..af5a4006 100644 --- a/libmultipath/checkers.h +++ b/libmultipath/checkers.h @@ -146,6 +146,7 @@ void checker_reset (struct checker *); void checker_set_sync (struct checker *); void checker_set_async (struct checker *); void checker_set_fd (struct checker *, int); +void checker_keep_dso(struct checker *c); void checker_enable (struct checker *); void checker_disable (struct checker *); int checker_check (struct checker *, int); diff --git a/libmultipath/checkers/tur.c b/libmultipath/checkers/tur.c index e886fcf8..fd58d62a 100644 --- a/libmultipath/checkers/tur.c +++ b/libmultipath/checkers/tur.c @@ -394,6 +394,7 @@ int libcheck_check(struct checker * c) uatomic_set(&ct->running, 1); tur_set_async_timeout(c); setup_thread_attr(&attr, 32 * 1024, 1); + checker_keep_dso(c); r = pthread_create(&ct->thread, &attr, tur_thread, ct); pthread_attr_destroy(&attr); if (r) {
The multipathd tur checker thread is designed to be able to finish at any time, even after the tur checker itself has been freed. The multipathd shutdown code makes sure all the checkers have been freed before freeing the checker_class and calling dlclose() to unload the DSO, but this doesn't guarantee that the checker threads have finished. If one hasn't, the DSO will get unloaded while the thread still running code from it, causing a segfault. Unfortunately, it's not possible to be sure that all tur checker threads have ended during shutdown, without making them joinable. However, since libmultipath will never be reinitialized after it has been uninitialzed, not dlclosing the tur checker DSO once a thread is started has minimal cost (keeping the DSO code around until the program exits, which usually happens right after freeing the checkers). Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> --- libmultipath/checkers.c | 10 +++++++++- libmultipath/checkers.h | 1 + libmultipath/checkers/tur.c | 1 + 3 files changed, 11 insertions(+), 1 deletion(-)