mbox series

[v2,00/22] Yet Another path checker refactor

Message ID 20240912214947.783819-1-bmarzins@redhat.com (mailing list archive)
Headers show
Series Yet Another path checker refactor | expand

Message

Benjamin Marzinski Sept. 12, 2024, 9:49 p.m. UTC
This patchset cleans up some of the ugly code from my last refactor, and
speeds up multipathd path checking, by refactoring how paths are
checked.

multipathd previously ran the checker on a path, waited briefly for it
to complete, and if it did, updated the path and the priorities for the
multipath device and possibly reloaded it.

This had multiple issues.
1. All the paths waited for their checkers to complete serially. This
wasted time with no benefit. It also meant that it wasn't that uncommon
for a checker to not finish in time, and get marked pending.

2. The prioritiers could get run on paths multiple times during a
checker loop if multiple paths were restored at the same time, which is
a common occurance.

3. In an effort to keep a multipath device's state consistent despite
the delays introduced by waiting for the path checkers, paths were
checked by multipath device. However checking the paths could involve
reloading a multipath device table, or even removing a multipath device.
This added some ugly backout and retry logic to the path check code.

With this patchset, the code now first starts the path checkers on all
devices due for a path check. If there are path checkers that need
waiting it then unocks the vecs lock and waits 5ms. Next, it goes back
and updates the state of all the path devices. Once all of the paths
have been updated, it goes through each multipath device, updating path
priorities and reloading the device as necessary.

This allows multipathd to spend less time and do less redundant work
while checking paths, while making paths more likely to not spend a
checker cycle marked as pending. Since multipathd drops the vecs lock
while waiting, uevents and user commands can be processed while its
waiting.

Since there isn't a delay waiting for the previous checker before
starting the next one, the path checker code has reverted to checking
all paths in pathvec order, instead of by multipath device. Updating the
paths in pathvec order simplifies the code, since the multipath devices
can change during path updates. Starting the checkers in pathvec order
keeps a path from having its checker started towards the end of the list
of paths but checking for the result towards the start of the list of
paths.

Changes since v1:

The original patches have been updated based on suggestions from Martin
Wilck (patches listed using their original numbering):

- Moved checker path_state initialization code from 06 to 01
- Check for NULL before dereference in checker_get_state() in 04
- Moved adding start_checker to version file from 07 to 06
- Renamed check_path_state() to get_new_state() in 08. updated 09, 11,
  and 12 to deal with the name change.
- Added new patch before 13 ('fix "fail path" and "reinstate path"
  commands') to fix issues with manually failing and reinstating paths
  between when the checkers are run and the paths are updated
- Fixed check in update_paths in 13
- Split patch 13 into three patches, one to move priority updating out
  of the individual path updating code, and two others to clean up some
  of the logic about updating priorities and path groups.

After the now 18 patches from the original patchset, four new patches
have been added to move the waiting for pending paths out from
checker_get_state() and into the existing waiting code in checkerloop().
Doing the waiting outside of the checkers (in fact, outside of the
vecs lock completely) means that we can't wait till a checker has
completed. We must just pick a time and wait for all of it. On Martin's
suggestion I picked 5ms. Since multipathd isn't sleeping with the vecs
lock held, it's much less critical to do this quickly.

For checker run in pathinfo, I kept the previous timeout of 1ms, but I
waffled on whether to wait at all here, and I'd be perfectly fine with
removing the wait altogether, and just let the paths be pending when
called from multipathd with async checkers.

Also, I've just added these patches to the end of the patchset to make
it easier to see the changes from v1, but if people want less churn in
the checker code commits, I'd be find with resending this patchset with
these patches refactored into the earlier patches.

Benjamin Marzinski (22):
  libmultipath: store checker_check() result in checker struct
  libmultipath: add missing checker function prototypes
  libmultipath: split out the code to wait for pending checkers
  libmultipath: remove pending wait code from libcheck_check calls
  multipath-tools tests: fix up directio tests
  libmultipath: split get_state into two functions
  libmultipath: change path_offline to path_sysfs_state
  multipathd: split check_path_state into two functions
  multipathd: split do_checker_path
  multipathd: split check_path into two functions
  multipathd: split handle_uninitialized_path into two functions
  multipathd: split check_paths into two functions
  multipathd: fix "fail path" and "reinstate path" commands
  multipathd: update priority once after updating all paths
  multipathd: simplify checking for followover_should_failback
  multipathd: only refresh prios on PATH_UP and PATH_GHOST
  multipathd: remove pointless check
  multipathd: fix deferred_failback_tick for reload removes
  libmultipath: add libcheck_need_wait checker function
  libmultipath: don't wait in libcheck_pending
  multipathd: wait for checkers to complete
  multipath-tools tests: fix up directio tests

 libmultipath/checkers.c           |  47 ++--
 libmultipath/checkers.h           |  10 +-
 libmultipath/checkers/directio.c  | 144 +++++++----
 libmultipath/checkers/tur.c       |  78 +++---
 libmultipath/discovery.c          |  97 +++++---
 libmultipath/discovery.h          |   6 +-
 libmultipath/libmultipath.version |   4 +-
 libmultipath/print.c              |   2 +-
 libmultipath/propsel.c            |   1 +
 libmultipath/structs.h            |  21 +-
 multipathd/cli_handlers.c         |   1 +
 multipathd/main.c                 | 399 ++++++++++++++++++------------
 tests/directio.c                  | 182 ++++++++++----
 13 files changed, 643 insertions(+), 349 deletions(-)

Comments

Martin Wilck Sept. 13, 2024, 9:30 a.m. UTC | #1
On Thu, 2024-09-12 at 17:49 -0400, Benjamin Marzinski wrote:
> This patchset cleans up some of the ugly code from my last refactor,
> and
> speeds up multipathd path checking, by refactoring how paths are
> checked.
> 

I haven't started reviewing this yet, but I'm seeing a CI problem with
the sysfs test, only with clang 18 on Fedora 40:
https://github.com/openSUSE/multipath-tools/actions/runs/10845899189

Martin
Benjamin Marzinski Sept. 16, 2024, 9:11 p.m. UTC | #2
On Fri, Sep 13, 2024 at 11:30:23AM +0200, Martin Wilck wrote:
> On Thu, 2024-09-12 at 17:49 -0400, Benjamin Marzinski wrote:
> > This patchset cleans up some of the ugly code from my last refactor,
> > and
> > speeds up multipathd path checking, by refactoring how paths are
> > checked.
> > 
> 
> I haven't started reviewing this yet, but I'm seeing a CI problem with
> the sysfs test, only with clang 18 on Fedora 40:
> https://github.com/openSUSE/multipath-tools/actions/runs/10845899189


I'm not sure what this is and I've been entirely unable to reproduce it
using Fedora 40 and clang 18.1.8 (the current f40 version of clang). It
seems pretty obvious that I has nothing to do with my patchset. It
looks like it's related to the renaming open() issues. Given the

"%s() has remaining non-returned values."

and

"__wrap___open64_2: '%s' parameter still has values that haven't been
checked."

messages, I'm guessing something other than __open64_2() was called. But
like I said, I can't reproduce this.

-Ben

> 
> Martin
Martin Wilck Sept. 17, 2024, 10:13 a.m. UTC | #3
On Mon, 2024-09-16 at 17:11 -0400, Benjamin Marzinski wrote:
> 
> I'm not sure what this is and I've been entirely unable to reproduce
> it
> using Fedora 40 and clang 18.1.8 (the current f40 version of clang).
> It
> seems pretty obvious that I has nothing to do with my patchset. It
> looks like it's related to the renaming open() issues. Given the
> 
> "%s() has remaining non-returned values."
> 
> and
> 
> "__wrap___open64_2: '%s' parameter still has values that haven't been
> checked."
> 
> messages, I'm guessing something other than __open64_2() was called.
> But
> like I said, I can't reproduce this.

OK, I'll double check. I don't recall seeing this before your patch
set, but that can have multiple reasons.

Martin