[0/7] Fix muitpath/multipathd flush issue

Message ID	1592439867-18427-1-git-send-email-bmarzins@redhat.com (mailing list archive)
Headers	show Return-Path: <SRS0=x+wX=77=redhat.com=dm-devel-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 72ACB2080D From: Benjamin Marzinski <bmarzins@redhat.com> To: Christophe Varoqui <christophe.varoqui@opensvc.com> Date: Wed, 17 Jun 2020 19:24:20 -0500 Message-Id: <1592439867-18427-1-git-send-email-bmarzins@redhat.com> Cc: device-mapper development <dm-devel@redhat.com>, Martin Wilck <Martin.Wilck@suse.com> Subject: [dm-devel] [PATCH 0/7] Fix muitpath/multipathd flush issue Precedence: junk MIME-Version: 1.0 Sender: dm-devel-bounces@redhat.com Errors-To: dm-devel-bounces@redhat.com Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit
Series	Fix muitpath/multipathd flush issue \| expand [0/7] Fix muitpath/multipathd flush issue [1/7] libmultipath: change do_get_info returns [2/7] multipathd: fix check_path errors with removed map [3/7] libmultipath: make dm_flush_maps only return 0 on success [4/7] multipathd: add "del maps" multipathd command [5/7] multipath: make flushing maps work like other commands [6/7] multipath: delegate flushing maps to multipathd [7/7] multipath: add option to skip multipathd delegation

Benjamin Marzinski June 18, 2020, 12:24 a.m. UTC

If a multipath device is removed, and check_path() checks one of its
paths before multipathd processes either the uevent or the dm event from
removing it, multipathd will recreate the removed device. This happens
because check_path() will continute to check the removed device's former
paths until an event arrives removing the device.  A missing multpath
device will cause the update_multipath_strings() call to fail, setting
pp->dmstate to PSTATE_UNDEF.  If the path is up, this dmstate will cause
reinstate_path() to be called, which will also fail, because the
multipath device doesn't exist.  This will trigger a reload, restoring
the recently removed device.

This patchset handles this is two ways. The first two patches directly
fix these issues in check_path(), so that a missing multipath device
will no longer get recreated when checking one of its former paths. 

The other 5 patches add a "multipathd del maps" command, and make the
mutipath command delegate flush operations to multipathd so multipathd's
state remains in sync with the kernel's, while doing removes.

One source of complexity in these patches is that multipath suspends the
device with flushing before removing it, while multipathd doesn't.  It
does seem possible that the suspend could hang for a while, so I can
understand multipathd not using it.  However, that would take the
multipath device getting opened and IO being issued, between the time
when _dm_flush_map() verifies that the device isn't opened, and when it
suspends the device.  But more importantly, I don't understand how
suspending the device is useful.  If the device were to be opened
between when _dm_flush_map() checks the usage, and when it tries to
delete the device, device-mapper would simply fail the remove.  And if
the device isn't in use, there can't be any outstanding IO to flush.
When this code was added in commit 9a4ff93, there was no check if the
device was in use, and disabling queue_if_no_path could cause anything
that had the device open to error out and close it. But now that
multipath checks first if the device is open, I don't see the benefit to
doing this anymore. Removing it would allow multipathd and multipath to
use the same code to remove maps. Any thoughts?

Benjamin Marzinski (7):
  libmultipath: change do_get_info returns
  multipathd: fix check_path errors with removed map
  libmultipath: make dm_flush_maps only return 0 on success
  multipathd: add "del maps" multipathd command
  multipath: make flushing maps work like other commands
  multipath: delegate flushing maps to multipathd
  multipath: add option to skip multipathd delegation

 libmultipath/config.h     |  4 ++-
 libmultipath/configure.h  |  3 ---
 libmultipath/devmapper.c  | 41 ++++++++++++++++++-------------
 libmultipath/devmapper.h  |  3 ++-
 multipath/main.c          | 51 +++++++++++++++++++++++++++------------
 multipath/multipath.8     | 16 ++++++++----
 multipathd/cli.c          |  1 +
 multipathd/cli_handlers.c | 19 +++++++++++++++
 multipathd/cli_handlers.h |  1 +
 multipathd/main.c         | 39 ++++++++++++------------------
 multipathd/main.h         |  1 +
 11 files changed, 114 insertions(+), 65 deletions(-)

Martin Wilck June 18, 2020, 9 a.m. UTC | #1

On Wed, 2020-06-17 at 19:24 -0500, Benjamin Marzinski wrote:
> 
> One source of complexity in these patches is that multipath suspends
> the
> device with flushing before removing it, while multipathd
> doesn't.  It
> does seem possible that the suspend could hang for a while, so I can
> understand multipathd not using it.  However, that would take the
> multipath device getting opened and IO being issued, between the time
> when _dm_flush_map() verifies that the device isn't opened, and when
> it
> suspends the device.  But more importantly, I don't understand how
> suspending the device is useful.  

I've looked up the origin of 9a4ff93 in SUSE bugzilla. The problem that
the patch tried to solve was that if map without healthy paths and with
queue_if_no_path set was removed, the removal might fail with 
"map is in use".  Hannes' comment at the time was this:

 "Problem is that we're not waiting for all outstanding I/Os to
 complete. So the check for 'map in use' might trigger a false
 positive, as the outstanding I/O will keep the map busy. Once it's
 finished we can remove the map."

 "I'll add an explicit 'suspend/resume' cycle here, which will wait for
 all outstanding I/O to finish. That should resolve the situation."

Thus setting queue_if_no_paths = 0 and doing a suspend/resume was
basically a trick to force dm to flush outstanding IO.

> If the device were to be opened
> between when _dm_flush_map() checks the usage, and when it tries to
> delete the device, device-mapper would simply fail the remove.  And
> if
> the device isn't in use, there can't be any outstanding IO to flush.
> When this code was added in commit 9a4ff93, there was no check if the
> device was in use,

Hm, I see this in the code preceding 9a4ff93:

extern int
_dm_flush_map (const char * mapname, int need_sync)
{
[...]
        if (dm_get_opencount(mapname)) {
                condlog(2, "%s: map in use", mapname);
                return 1;
        }

... so it seems that there was a check, even back then.

>  and disabling queue_if_no_path could cause anything
> that had the device open to error out and close it. But now that
> multipath checks first if the device is open, I don't see the benefit
> to
> doing this anymore. Removing it would allow multipathd and multipath
> to
> use the same code to remove maps. Any thoughts?

With queue_if_no_paths, there could be outstanding IO even if the
opencount was 0. This IO must be flushed somehow before the map can be
removed. Apparently just setting queue_if_no_path = 0 was not enough,
that's why Hannes added the suspend/resume. Maybe today we have some
better way to force the flush? libdm has the _error_device() function 
(aka dmsetup wipe_table) that replaces the map's table by the error
target. But this does a map reload, which implies a suspend, too.
Perhaps simply an fsync() on the dm device, or just wait a while until
all outstanding IO has errored out? But these alternatives don't
actually look better to me than Hannes' suspend/resume, they will take
at least as much time. Do you have a better idea?

multipathd 

Regards
Martin

Benjamin Marzinski June 18, 2020, 6:04 p.m. UTC | #2

On Thu, Jun 18, 2020 at 09:00:49AM +0000, Martin Wilck wrote:
> On Wed, 2020-06-17 at 19:24 -0500, Benjamin Marzinski wrote:
> > 
> > One source of complexity in these patches is that multipath suspends
> > the
> > device with flushing before removing it, while multipathd
> > doesn't.  It
> > does seem possible that the suspend could hang for a while, so I can
> > understand multipathd not using it.  However, that would take the
> > multipath device getting opened and IO being issued, between the time
> > when _dm_flush_map() verifies that the device isn't opened, and when
> > it
> > suspends the device.  But more importantly, I don't understand how
> > suspending the device is useful.  
> 
> I've looked up the origin of 9a4ff93 in SUSE bugzilla. The problem that
> the patch tried to solve was that if map without healthy paths and with
> queue_if_no_path set was removed, the removal might fail with 
> "map is in use".  Hannes' comment at the time was this:
> 
>  "Problem is that we're not waiting for all outstanding I/Os to
>  complete. So the check for 'map in use' might trigger a false
>  positive, as the outstanding I/O will keep the map busy. Once it's
>  finished we can remove the map."
> 
>  "I'll add an explicit 'suspend/resume' cycle here, which will wait for
>  all outstanding I/O to finish. That should resolve the situation."
> 
> Thus setting queue_if_no_paths = 0 and doing a suspend/resume was
> basically a trick to force dm to flush outstanding IO.

I get the intention, I just don't think it currently is doing anything.

> > If the device were to be opened
> > between when _dm_flush_map() checks the usage, and when it tries to
> > delete the device, device-mapper would simply fail the remove.  And
> > if
> > the device isn't in use, there can't be any outstanding IO to flush.
> > When this code was added in commit 9a4ff93, there was no check if the
> > device was in use,
> 
> Hm, I see this in the code preceding 9a4ff93:
> 
> extern int
> _dm_flush_map (const char * mapname, int need_sync)
> {
> [...]
>         if (dm_get_opencount(mapname)) {
>                 condlog(2, "%s: map in use", mapname);
>                 return 1;
>         }
> 
> ... so it seems that there was a check, even back then.

oops. I missed that. 

> >  and disabling queue_if_no_path could cause anything
> > that had the device open to error out and close it. But now that
> > multipath checks first if the device is open, I don't see the benefit
> > to
> > doing this anymore. Removing it would allow multipathd and multipath
> > to
> > use the same code to remove maps. Any thoughts?
> 
> With queue_if_no_paths, there could be outstanding IO even if the
> opencount was 0.

This is what I don't accept at face value. I wrote a little test
program that fired off an async read, closed the fd without waiting for
a response, and then tried to exit.  running this on a queueing
multipath device causes the exit to hang like this

 cat /proc/22655/task/22655/stack
[<0>] exit_aio+0xdc/0xf0
[<0>] mmput+0x33/0x130
[<0>] do_exit+0x27f/0xb30
[<0>] do_group_exit+0x3a/0xa0
[<0>] __x64_sys_exit_group+0x14/0x20
[<0>] do_syscall_64+0x5b/0x160
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<0>] 0xffffffffffffffff

and the multipath device to remain in use

# dmsetup info mpathb
Name:              mpathb
State:             ACTIVE
Read Ahead:        256
Tables present:    LIVE
Open count:        0
Event number:      7
Major, minor:      253, 5
Number of targets: 1
UUID: mpath-3600d0230000000000e13954ed5f89301

I've asked around, and device-mapper is certainly supposed to flush all
IO before the last close. Of course, there may be a corner case where it
doesn't. If you know of one, that would be worth sharing.

What I think that flush previously accomplished is that, just like the
flush_on_last_del code, if you stop queueing and error out the IO, then
whatever is waiting on it will get those errors, and likely close the
device soon after, hopefully in time for multipath to remove the device.

However, since multipath now checks if the device is in use before
disabling queue_if_no_path, it don't think this code is actually helpful
right now.

> This IO must be flushed somehow before the map can be
> removed. Apparently just setting queue_if_no_path = 0 was not enough,
> that's why Hannes added the suspend/resume. Maybe today we have some
> better way to force the flush? libdm has the _error_device() function 
> (aka dmsetup wipe_table) that replaces the map's table by the error
> target. But this does a map reload, which implies a suspend, too.
> Perhaps simply an fsync() on the dm device, or just wait a while until
> all outstanding IO has errored out? But these alternatives don't
> actually look better to me than Hannes' suspend/resume, they will take
> at least as much time. Do you have a better idea?

To go back to the old behavior, we would need to move the check if the
device is in use until after we suspended the device. Or we can keep the
current behavior, and simply remove the flushing and suspending.

> multipathd 
> 
> Regards
> Martin
> 
> -- 
> Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
> SUSE  Software Solutions Germany GmbH
> HRB 36809, AG Nürnberg GF: Felix
> Imendörffer
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Martin Wilck June 18, 2020, 8:06 p.m. UTC | #3

On Thu, 2020-06-18 at 13:04 -0500, Benjamin Marzinski wrote:
> On Thu, Jun 18, 2020 at 09:00:49AM +0000, Martin Wilck wrote:
> > 
> > With queue_if_no_paths, there could be outstanding IO even if the
> > opencount was 0.
> 
> This is what I don't accept at face value. I wrote a little test
> program that fired off an async read, closed the fd without waiting
> for
> a response, and then tried to exit.  running this on a queueing
> multipath device causes the exit to hang like this
> 
>  cat /proc/22655/task/22655/stack
> [<0>] exit_aio+0xdc/0xf0
> [<0>] mmput+0x33/0x130
> [<0>] do_exit+0x27f/0xb30
> [<0>] do_group_exit+0x3a/0xa0
> [<0>] __x64_sys_exit_group+0x14/0x20
> [<0>] do_syscall_64+0x5b/0x160
> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [<0>] 0xffffffffffffffff
> 
> and the multipath device to remain in use
> 
> # dmsetup info mpathb
> Name:              mpathb
> State:             ACTIVE
> Read Ahead:        256
> Tables present:    LIVE
> Open count:        0
> Event number:      7
> Major, minor:      253, 5
> Number of targets: 1
> UUID: mpath-3600d0230000000000e13954ed5f89301
> 

The open count is 0. Wouldn't this be exactly the situation that
Hannes' patch was targeting - opencount 0, but still unable to
flush/close because of outstanding IO? If multipath was trying to flush
the map in this situation, it would disable queueing. Your process
would get an IO error sooner or later, and exit. But depending on the
amount of outstanding IO and the weather, this might not happen before
multipath had attempted to flush the map, and the flush attempt would
fail. By inserting the synchronous flush operation after disabling
queueing, multipath avoids that failure. Am I misunderstanding
something?

> I've asked around, and device-mapper is certainly supposed to flush
> all
> IO before the last close. Of course, there may be a corner case where
> it
> doesn't. If you know of one, that would be worth sharing.
> 
> What I think that flush previously accomplished is that, just like
> the
> flush_on_last_del code, if you stop queueing and error out the IO,
> then
> whatever is waiting on it will get those errors, and likely close the
> device soon after, hopefully in time for multipath to remove the
> device.
> 
> However, since multipath now checks if the device is in use before
> disabling queue_if_no_path, it don't think this code is actually
> helpful
> right now.
> 
> > This IO must be flushed somehow before the map can be
> > removed. Apparently just setting queue_if_no_path = 0 was not
> > enough,
> > that's why Hannes added the suspend/resume. Maybe today we have
> > some
> > better way to force the flush? libdm has the _error_device()
> > function 
> > (aka dmsetup wipe_table) that replaces the map's table by the error
> > target. But this does a map reload, which implies a suspend, too.
> > Perhaps simply an fsync() on the dm device, or just wait a while
> > until
> > all outstanding IO has errored out? But these alternatives don't
> > actually look better to me than Hannes' suspend/resume, they will
> > take
> > at least as much time. Do you have a better idea?
> 
> To go back to the old behavior, we would need to move the check if
> the
> device is in use until after we suspended the device. Or we can keep
> the
> current behavior, and simply remove the flushing and suspending.
> 

AFAICS the "in use" check looks at the opencount, and according to your
output above, it can be 0 while IO is outstanding. I haven't checked
this myself yet. But I can confirm that there was an actual bug (map
removal failing) that was allegedly fixed by the suspend/resume. It was
long ago, I can't tell with confidence if it could still happen.

Don't get me wrong, I'm not generally opposed to the removal of the
flush/suspend. I just wanted to clarify why it was there. It has been
used in multipath only, anyway. If we delegate to multipathd, it makes
sense to handle the situation in multipathd's manner. If we want to
make sure the user experience for "multipath -f" users is unchanged, we
could handle failure accordingly (suspend, resume, and retry flushing
the map).

Best,
Martin

Benjamin Marzinski June 18, 2020, 11:06 p.m. UTC | #4

On Thu, Jun 18, 2020 at 08:06:53PM +0000, Martin Wilck wrote:
> On Thu, 2020-06-18 at 13:04 -0500, Benjamin Marzinski wrote:
> > On Thu, Jun 18, 2020 at 09:00:49AM +0000, Martin Wilck wrote:
> > > 
> > > With queue_if_no_paths, there could be outstanding IO even if the
> > > opencount was 0.
> > 
> > This is what I don't accept at face value. I wrote a little test
> > program that fired off an async read, closed the fd without waiting
> > for
> > a response, and then tried to exit.  running this on a queueing
> > multipath device causes the exit to hang like this
> > 
> >  cat /proc/22655/task/22655/stack
> > [<0>] exit_aio+0xdc/0xf0
> > [<0>] mmput+0x33/0x130
> > [<0>] do_exit+0x27f/0xb30
> > [<0>] do_group_exit+0x3a/0xa0
> > [<0>] __x64_sys_exit_group+0x14/0x20
> > [<0>] do_syscall_64+0x5b/0x160
> > [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [<0>] 0xffffffffffffffff
> > 
> > and the multipath device to remain in use
> > 
> > # dmsetup info mpathb
> > Name:              mpathb
> > State:             ACTIVE
> > Read Ahead:        256
> > Tables present:    LIVE
> > Open count:        0
> > Event number:      7
> > Major, minor:      253, 5
> > Number of targets: 1
> > UUID: mpath-3600d0230000000000e13954ed5f89301
> > 
> 
> The open count is 0.

Whoops. I copied the output from after I restored the path, and the
program exitted.  While it is hung, you see this:

# dmsetup info mpathb
Name:              mpathb
State:             ACTIVE
Read Ahead:        256
Tables present:    LIVE
Open count:        1
Event number:      8
Major, minor:      253, 5
Number of targets: 1
UUID: mpath-3600d0230000000000e13954ed5f89301

I uploaded the test program, aio_test:

https://github.com/bmarzins/test_programs.git

You just need to run in on a queueing multipath device with no active
paths and an open count of 0. It will hang with the device open. Restore
a path, and it will exit, and the open count will go to 0.

-Ben

> Wouldn't this be exactly the situation that
> Hannes' patch was targeting - opencount 0, but still unable to
> flush/close because of outstanding IO? If multipath was trying to flush
> the map in this situation, it would disable queueing. Your process
> would get an IO error sooner or later, and exit. But depending on the
> amount of outstanding IO and the weather, this might not happen before
> multipath had attempted to flush the map, and the flush attempt would
> fail. By inserting the synchronous flush operation after disabling
> queueing, multipath avoids that failure. Am I misunderstanding
> something?
> 
> 
> > I've asked around, and device-mapper is certainly supposed to flush
> > all
> > IO before the last close. Of course, there may be a corner case where
> > it
> > doesn't. If you know of one, that would be worth sharing.
> > 
> > What I think that flush previously accomplished is that, just like
> > the
> > flush_on_last_del code, if you stop queueing and error out the IO,
> > then
> > whatever is waiting on it will get those errors, and likely close the
> > device soon after, hopefully in time for multipath to remove the
> > device.
> > 
> > However, since multipath now checks if the device is in use before
> > disabling queue_if_no_path, it don't think this code is actually
> > helpful
> > right now.
> > 
> > > This IO must be flushed somehow before the map can be
> > > removed. Apparently just setting queue_if_no_path = 0 was not
> > > enough,
> > > that's why Hannes added the suspend/resume. Maybe today we have
> > > some
> > > better way to force the flush? libdm has the _error_device()
> > > function 
> > > (aka dmsetup wipe_table) that replaces the map's table by the error
> > > target. But this does a map reload, which implies a suspend, too.
> > > Perhaps simply an fsync() on the dm device, or just wait a while
> > > until
> > > all outstanding IO has errored out? But these alternatives don't
> > > actually look better to me than Hannes' suspend/resume, they will
> > > take
> > > at least as much time. Do you have a better idea?
> > 
> > To go back to the old behavior, we would need to move the check if
> > the
> > device is in use until after we suspended the device. Or we can keep
> > the
> > current behavior, and simply remove the flushing and suspending.
> > 
> 
> AFAICS the "in use" check looks at the opencount, and according to your
> output above, it can be 0 while IO is outstanding. I haven't checked
> this myself yet. But I can confirm that there was an actual bug (map
> removal failing) that was allegedly fixed by the suspend/resume. It was
> long ago, I can't tell with confidence if it could still happen.
> 
> Don't get me wrong, I'm not generally opposed to the removal of the
> flush/suspend. I just wanted to clarify why it was there. It has been
> used in multipath only, anyway. If we delegate to multipathd, it makes
> sense to handle the situation in multipathd's manner. If we want to
> make sure the user experience for "multipath -f" users is unchanged, we
> could handle failure accordingly (suspend, resume, and retry flushing
> the map).
> 
> Best,
> Martin
> 
> -- 
> Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
> SUSE  Software Solutions Germany GmbH
> HRB 36809, AG Nürnberg GF: Felix
> Imendörffer
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Martin Wilck July 1, 2020, 8:54 p.m. UTC | #5

On Thu, 2020-06-18 at 18:06 -0500, Benjamin Marzinski wrote:
> 
> I uploaded the test program, aio_test:
> 
> https://github.com/bmarzins/test_programs.git
> 
> You just need to run in on a queueing multipath device with no active
> paths and an open count of 0. It will hang with the device open.
> Restore
> a path, and it will exit, and the open count will go to 0.

Tried it now, it behaves as you say. I admit I can't imagine how the
suspend/resume would improve anything here. Indeed, as you say, it opens 
up a race window. Another process might open the device while
it's suspended. Worse perhaps, once the device is resumed, an uevent will be 
generated, and stacked devices might (in principle at least) be recreated
before we get down to flush the map.

MAYBE the suspend/resume was necessary in the past because some earlier 
kernels wouldn't immediately fail all outstanding commands when 
queue_if_no_path was deactivated? (just speculating, I don't know if this
is the case).

Regards,
Martin

Benjamin Marzinski July 2, 2020, 3:14 a.m. UTC | #6

On Wed, Jul 01, 2020 at 10:54:34PM +0200, Martin Wilck wrote:
> On Thu, 2020-06-18 at 18:06 -0500, Benjamin Marzinski wrote:
> > 
> > I uploaded the test program, aio_test:
> > 
> > https://github.com/bmarzins/test_programs.git
> > 
> > You just need to run in on a queueing multipath device with no active
> > paths and an open count of 0. It will hang with the device open.
> > Restore
> > a path, and it will exit, and the open count will go to 0.
> 
> Tried it now, it behaves as you say. I admit I can't imagine how the
> suspend/resume would improve anything here. Indeed, as you say, it opens 
> up a race window. Another process might open the device while
> it's suspended. Worse perhaps, once the device is resumed, an uevent will be 
> generated, and stacked devices might (in principle at least) be recreated
> before we get down to flush the map.
> 
> MAYBE the suspend/resume was necessary in the past because some earlier 
> kernels wouldn't immediately fail all outstanding commands when 
> queue_if_no_path was deactivated? (just speculating, I don't know if this
> is the case).

If you disable queue_if_no_path and then do a suspend with flushing, you
are guaranteed that the supend won't return until all the IO has
completed or failed.  This would allow anything that was waiting on
queued IO to have the IO failback, which could allow it to close the
device in time for multipath to be able to remove it (obviously this is
racey).  However, this assumes that you do your open checks after the
suspend, which multipath no longer does. I realize that multipath can't
suspend until after it tries to remove the partition devices, otherwise
those can get stuck. But there probably is some order that gets this all
right-ish.

So, for a while now, the suspending has been doing nothing for us.  We
could either try to reorder things so that we actually try to flush the
queued IOs back first (with or without suspending), or we could just
remove suspending and say that things are working fine the way they
currently are.

-Ben

> Regards,
> Martin
> 
> -- 
> Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
> SUSE  Software Solutions Germany GmbH
> HRB 36809, AG Nürnberg GF: Felix
> Imendörffer
> 
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Martin Wilck July 2, 2020, 12:24 p.m. UTC | #7

On Wed, 2020-07-01 at 22:14 -0500, Benjamin Marzinski wrote:
> On Wed, Jul 01, 2020 at 10:54:34PM +0200, Martin Wilck wrote:
> > On Thu, 2020-06-18 at 18:06 -0500, Benjamin Marzinski wrote:
> > > I uploaded the test program, aio_test:
> > > 
> > > https://github.com/bmarzins/test_programs.git
> > > 
> > > You just need to run in on a queueing multipath device with no
> > > active
> > > paths and an open count of 0. It will hang with the device open.
> > > Restore
> > > a path, and it will exit, and the open count will go to 0.
> > 
> > Tried it now, it behaves as you say. I admit I can't imagine how
> > the
> > suspend/resume would improve anything here. Indeed, as you say, it
> > opens 
> > up a race window. Another process might open the device while
> > it's suspended. Worse perhaps, once the device is resumed, an
> > uevent will be 
> > generated, and stacked devices might (in principle at least) be
> > recreated
> > before we get down to flush the map.
> > 
> > MAYBE the suspend/resume was necessary in the past because some
> > earlier 
> > kernels wouldn't immediately fail all outstanding commands when 
> > queue_if_no_path was deactivated? (just speculating, I don't know
> > if this
> > is the case).
> 
> If you disable queue_if_no_path and then do a suspend with flushing,
> you
> are guaranteed that the supend won't return until all the IO has
> completed or failed.

I just realized that if we suspend, we don't even need disable
queue_if_no_path, because the kernel does that automatically if a
"suspend with flush" is issued, and has been doing so basically
forever.

>   This would allow anything that was waiting on
> queued IO to have the IO failback, which could allow it to close the
> device in time for multipath to be able to remove it (obviously this
> is
> racey).  However, this assumes that you do your open checks after the
> suspend, which multipath no longer does.
>  I realize that multipath can't
> suspend until after it tries to remove the partition devices,
> otherwise
> those can get stuck. But there probably is some order that gets this
> all
> right-ish.

Our code is currently racy. IMO we should

 0 unset queue_if_no_path
 1 remove partition mappings
 2 open(O_EXCL|O_RDONLY) the device
 3 If this fails, in multipath, try again after a suspend/resume cycle.
   In multipathd, I think we should just fail for now; perhaps later
   we could handle the explicit "remove map" command like multipath.
 4 If it fails again, bail out (in multipath, retry if asked to do so)
 5 run a "deferred remove" ioctl
 6 close the fd, the dm device will now be removed "atomically".

This cuts down the race window to the minimum possible - after (2), no
mounts / kpartx / lvm operations won't be possible any more.

When we remove the partition mappings, we could use the same technique
to avoid races on that layer. Unfortunately, a pending "deferred
remove" operation doesn't cause new open() or mount() calls to fail; if
it did, we could use a simpler approach.

> So, for a while now, the suspending has been doing nothing for
> us.  We
> could either try to reorder things so that we actually try to flush
> the
> queued IOs back first (with or without suspending), or we could just
> remove suspending and say that things are working fine the way they
> currently are.

What else can we do except suspend/resume to avoid racing with pending
close(), umount() or similar operations? Well, I guess if we open the
device anyway as I proposed, we could do an fsync() on it. That might
actually be better because it avoids the uevent being sent on resume.

We definitely can't suspend the map before we remove the partitions. We
could try a suspend/resume on the partition devices themselves (or
fsync()) if the opencount is > 0.

Regards,
Martin

Benjamin Marzinski July 2, 2020, 3:18 p.m. UTC | #8

On Thu, Jul 02, 2020 at 12:24:32PM +0000, Martin Wilck wrote:
> On Wed, 2020-07-01 at 22:14 -0500, Benjamin Marzinski wrote:
> > On Wed, Jul 01, 2020 at 10:54:34PM +0200, Martin Wilck wrote:
> > > On Thu, 2020-06-18 at 18:06 -0500, Benjamin Marzinski wrote:
> > > > I uploaded the test program, aio_test:
> > > > 
> > > > https://github.com/bmarzins/test_programs.git
> > > > 
> > > > You just need to run in on a queueing multipath device with no
> > > > active
> > > > paths and an open count of 0. It will hang with the device open.
> > > > Restore
> > > > a path, and it will exit, and the open count will go to 0.
> > > 
> > > Tried it now, it behaves as you say. I admit I can't imagine how
> > > the
> > > suspend/resume would improve anything here. Indeed, as you say, it
> > > opens 
> > > up a race window. Another process might open the device while
> > > it's suspended. Worse perhaps, once the device is resumed, an
> > > uevent will be 
> > > generated, and stacked devices might (in principle at least) be
> > > recreated
> > > before we get down to flush the map.
> > > 
> > > MAYBE the suspend/resume was necessary in the past because some
> > > earlier 
> > > kernels wouldn't immediately fail all outstanding commands when 
> > > queue_if_no_path was deactivated? (just speculating, I don't know
> > > if this
> > > is the case).
> > 
> > If you disable queue_if_no_path and then do a suspend with flushing,
> > you
> > are guaranteed that the supend won't return until all the IO has
> > completed or failed.
> 
> I just realized that if we suspend, we don't even need disable
> queue_if_no_path, because the kernel does that automatically if a
> "suspend with flush" is issued, and has been doing so basically
> forever.
> 
> >   This would allow anything that was waiting on
> > queued IO to have the IO failback, which could allow it to close the
> > device in time for multipath to be able to remove it (obviously this
> > is
> > racey).  However, this assumes that you do your open checks after the
> > suspend, which multipath no longer does.
> >  I realize that multipath can't
> > suspend until after it tries to remove the partition devices,
> > otherwise
> > those can get stuck. But there probably is some order that gets this
> > all
> > right-ish.
> 
> Our code is currently racy. IMO we should
> 
>  0 unset queue_if_no_path
>  1 remove partition mappings
>  2 open(O_EXCL|O_RDONLY) the device
>  3 If this fails, in multipath, try again after a suspend/resume cycle.
>    In multipathd, I think we should just fail for now; perhaps later
>    we could handle the explicit "remove map" command like multipath.
>  4 If it fails again, bail out (in multipath, retry if asked to do so)
>  5 run a "deferred remove" ioctl
>  6 close the fd, the dm device will now be removed "atomically".
> 
> This cuts down the race window to the minimum possible - after (2), no
> mounts / kpartx / lvm operations won't be possible any more.

I wasn't actually worried about someone wanting to use the device. In
that case the remove should fail.  I was worried about things that would
close the device, but couldn't because of queued IO.  The race I was
talking about is that after whatever has the device open gets the IO
error, it might not close the device before multipath checks the open
count.

Also, I'm not sure that resume helps us, since that will trigger
uevents, which will open the device again.
 
> When we remove the partition mappings, we could use the same technique
> to avoid races on that layer. Unfortunately, a pending "deferred
> remove" operation doesn't cause new open() or mount() calls to fail; if
> it did, we could use a simpler approach.
> 
> > So, for a while now, the suspending has been doing nothing for
> > us.  We
> > could either try to reorder things so that we actually try to flush
> > the
> > queued IOs back first (with or without suspending), or we could just
> > remove suspending and say that things are working fine the way they
> > currently are.
> 
> What else can we do except suspend/resume to avoid racing with pending
> close(), umount() or similar operations? Well, I guess if we open the
> device anyway as I proposed, we could do an fsync() on it. That might
> actually be better because it avoids the uevent being sent on resume.

yeah. I think that simply disabling queueing and doing an fsync() is
probably better than suspending. If new IO comes in after that, then
something IS using the device, and we don't want to remove it. In
multipath, maybe:

1. disable queueing
2. remove partition mappings
3. open device
4. flush
5. check if we are the only opener.
	If not, and there are retries left... goto 4? sleep and recheck?
	we don't want to wait if the answer is that they device really
	is in use.
6. close device
7. remove device

Possibly the solution to not wanting to wait when a device is in use is
that we could add an option to multipath to distinguish between the way
flushing works now, where we check early if the device is in use, and
just bail if it is, and a more aggressive attempt at remove that might
take longer if used on devices that are in use.

-Ben

> We definitely can't suspend the map before we remove the partitions. We
> could try a suspend/resume on the partition devices themselves (or
> fsync()) if the opencount is > 0.
> 
> Regards,
> Martin
> 
> -- 
> Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
> SUSE  Software Solutions Germany GmbH
> HRB 36809, AG Nürnberg GF: Felix
> Imendörffer
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Martin Wilck July 2, 2020, 4:45 p.m. UTC | #9

On Thu, 2020-07-02 at 10:18 -0500, Benjamin Marzinski wrote:
> On Thu, Jul 02, 2020 at 12:24:32PM +0000, Martin Wilck wrote:
> > On Wed, 2020-07-01 at 22:14 -0500, Benjamin Marzinski wrote:
> > > On Wed, Jul 01, 2020 at 10:54:34PM +0200, Martin Wilck wrote:
> > > > On Thu, 2020-06-18 at 18:06 -0500, Benjamin Marzinski wrote:
> > > > > I uploaded the test program, aio_test:
> > > > > 
> > > > > https://github.com/bmarzins/test_programs.git
> > > > > 
> > > > > You just need to run in on a queueing multipath device with
> > > > > no
> > > > > active
> > > > > paths and an open count of 0. It will hang with the device
> > > > > open.
> > > > > Restore
> > > > > a path, and it will exit, and the open count will go to 0.
> > > > 
> > > > Tried it now, it behaves as you say. I admit I can't imagine
> > > > how
> > > > the
> > > > suspend/resume would improve anything here. Indeed, as you say,
> > > > it
> > > > opens 
> > > > up a race window. Another process might open the device while
> > > > it's suspended. Worse perhaps, once the device is resumed, an
> > > > uevent will be 
> > > > generated, and stacked devices might (in principle at least) be
> > > > recreated
> > > > before we get down to flush the map.
> > > > 
> > > > MAYBE the suspend/resume was necessary in the past because some
> > > > earlier 
> > > > kernels wouldn't immediately fail all outstanding commands
> > > > when 
> > > > queue_if_no_path was deactivated? (just speculating, I don't
> > > > know
> > > > if this
> > > > is the case).
> > > 
> > > If you disable queue_if_no_path and then do a suspend with
> > > flushing,
> > > you
> > > are guaranteed that the supend won't return until all the IO has
> > > completed or failed.
> > 
> > I just realized that if we suspend, we don't even need disable
> > queue_if_no_path, because the kernel does that automatically if a
> > "suspend with flush" is issued, and has been doing so basically
> > forever.
> > 
> > >   This would allow anything that was waiting on
> > > queued IO to have the IO failback, which could allow it to close
> > > the
> > > device in time for multipath to be able to remove it (obviously
> > > this
> > > is
> > > racey).  However, this assumes that you do your open checks after
> > > the
> > > suspend, which multipath no longer does.
> > >  I realize that multipath can't
> > > suspend until after it tries to remove the partition devices,
> > > otherwise
> > > those can get stuck. But there probably is some order that gets
> > > this
> > > all
> > > right-ish.
> > 
> > Our code is currently racy. IMO we should
> > 
> >  0 unset queue_if_no_path
> >  1 remove partition mappings
> >  2 open(O_EXCL|O_RDONLY) the device
> >  3 If this fails, in multipath, try again after a suspend/resume
> > cycle.
> >    In multipathd, I think we should just fail for now; perhaps
> > later
> >    we could handle the explicit "remove map" command like
> > multipath.
> >  4 If it fails again, bail out (in multipath, retry if asked to do
> > so)
> >  5 run a "deferred remove" ioctl
> >  6 close the fd, the dm device will now be removed "atomically".
> > 
> > This cuts down the race window to the minimum possible - after (2),
> > no
> > mounts / kpartx / lvm operations won't be possible any more.
> 
> I wasn't actually worried about someone wanting to use the device. In
> that case the remove should fail.  I was worried about things that
> would
> close the device, but couldn't because of queued IO.  
> The race I was
> talking about is that after whatever has the device open gets the IO
> error, it might not close the device before multipath checks the open
> count.

Understood.

> Also, I'm not sure that resume helps us, since that will trigger
> uevents, which will open the device again.

Not sure if I understand correctly: It's possible to just suspend/flush
and then remove the device, without resuming. But that's dangerous,
because if some process opens the device while it's resumed, even if
it's just for reading a single block (think blkid), the open will
succeed, the IO will hang, the opencount will be increased, and the
REMOVE ioctl will fail. Therefore I think *if* we suspend we should
also resume. But I concur wrt the uevent, of course.

> > When we remove the partition mappings, we could use the same
> > technique
> > to avoid races on that layer. Unfortunately, a pending "deferred
> > remove" operation doesn't cause new open() or mount() calls to
> > fail; if
> > it did, we could use a simpler approach.
> > 
> > > So, for a while now, the suspending has been doing nothing for
> > > us.  We
> > > could either try to reorder things so that we actually try to
> > > flush
> > > the
> > > queued IOs back first (with or without suspending), or we could
> > > just
> > > remove suspending and say that things are working fine the way
> > > they
> > > currently are.
> > 
> > What else can we do except suspend/resume to avoid racing with
> > pending
> > close(), umount() or similar operations? Well, I guess if we open
> > the
> > device anyway as I proposed, we could do an fsync() on it. That
> > might
> > actually be better because it avoids the uevent being sent on
> > resume.
> 
> yeah. I think that simply disabling queueing and doing an fsync() is
> probably better than suspending. If new IO comes in after that, then
> something IS using the device, and we don't want to remove it. In
> multipath, maybe:
> 
> 1. disable queueing
> 2. remove partition mappings
> 3. open device
> 4. flush
> 5. check if we are the only opener.
> 	If not, and there are retries left... goto 4? sleep and
> recheck?
> 	we don't want to wait if the answer is that they device really
> 	is in use.
> 6. close device
> 7. remove device
> 
> Possibly the solution to not wanting to wait when a device is in use
> is
> that we could add an option to multipath to distinguish between the
> way
> flushing works now, where we check early if the device is in use, and
> just bail if it is, and a more aggressive attempt at remove that
> might
> take longer if used on devices that are in use.

What's wrong with deferred remove? After all, the user explicitly asked
for a flush. As long as some other process has the device open, it
won't be removed. That's why I like the O_EXCL idea, which will allow
small programs like blkid to access the device, but will cause all
attempts to mount or add stacked devices to fail until the device is
actually removed. I see no reason no to do this, as it's a race anyway
if some other process opens the device when we're supposed to flush it
and the opencount already dropped to 0. By using O_EXCL, we just
increase multipathd's chances to win the race and do what the user
asked for.

To make sure we're on the same boat: this is a topic for a separate
patch set IMO, I'm not expecting you to fix it in a v3.

Cheers,
Martin



--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Benjamin Marzinski July 2, 2020, 7:41 p.m. UTC | #10

On Thu, Jul 02, 2020 at 04:45:21PM +0000, Martin Wilck wrote:
> On Thu, 2020-07-02 at 10:18 -0500, Benjamin Marzinski wrote:
> > On Thu, Jul 02, 2020 at 12:24:32PM +0000, Martin Wilck wrote:
> > > On Wed, 2020-07-01 at 22:14 -0500, Benjamin Marzinski wrote:
> > > > On Wed, Jul 01, 2020 at 10:54:34PM +0200, Martin Wilck wrote:
> > > > > On Thu, 2020-06-18 at 18:06 -0500, Benjamin Marzinski wrote:
> > > > > > I uploaded the test program, aio_test:
> > > > > > 
> > > > > > https://github.com/bmarzins/test_programs.git
> > > > > > 
> > > > > > You just need to run in on a queueing multipath device with
> > > > > > no
> > > > > > active
> > > > > > paths and an open count of 0. It will hang with the device
> > > > > > open.
> > > > > > Restore
> > > > > > a path, and it will exit, and the open count will go to 0.
> > > > > 
> > > > > Tried it now, it behaves as you say. I admit I can't imagine
> > > > > how
> > > > > the
> > > > > suspend/resume would improve anything here. Indeed, as you say,
> > > > > it
> > > > > opens 
> > > > > up a race window. Another process might open the device while
> > > > > it's suspended. Worse perhaps, once the device is resumed, an
> > > > > uevent will be 
> > > > > generated, and stacked devices might (in principle at least) be
> > > > > recreated
> > > > > before we get down to flush the map.
> > > > > 
> > > > > MAYBE the suspend/resume was necessary in the past because some
> > > > > earlier 
> > > > > kernels wouldn't immediately fail all outstanding commands
> > > > > when 
> > > > > queue_if_no_path was deactivated? (just speculating, I don't
> > > > > know
> > > > > if this
> > > > > is the case).
> > > > 
> > > > If you disable queue_if_no_path and then do a suspend with
> > > > flushing,
> > > > you
> > > > are guaranteed that the supend won't return until all the IO has
> > > > completed or failed.
> > > 
> > > I just realized that if we suspend, we don't even need disable
> > > queue_if_no_path, because the kernel does that automatically if a
> > > "suspend with flush" is issued, and has been doing so basically
> > > forever.
> > > 
> > > >   This would allow anything that was waiting on
> > > > queued IO to have the IO failback, which could allow it to close
> > > > the
> > > > device in time for multipath to be able to remove it (obviously
> > > > this
> > > > is
> > > > racey).  However, this assumes that you do your open checks after
> > > > the
> > > > suspend, which multipath no longer does.
> > > >  I realize that multipath can't
> > > > suspend until after it tries to remove the partition devices,
> > > > otherwise
> > > > those can get stuck. But there probably is some order that gets
> > > > this
> > > > all
> > > > right-ish.
> > > 
> > > Our code is currently racy. IMO we should
> > > 
> > >  0 unset queue_if_no_path
> > >  1 remove partition mappings
> > >  2 open(O_EXCL|O_RDONLY) the device
> > >  3 If this fails, in multipath, try again after a suspend/resume
> > > cycle.
> > >    In multipathd, I think we should just fail for now; perhaps
> > > later
> > >    we could handle the explicit "remove map" command like
> > > multipath.
> > >  4 If it fails again, bail out (in multipath, retry if asked to do
> > > so)
> > >  5 run a "deferred remove" ioctl
> > >  6 close the fd, the dm device will now be removed "atomically".
> > > 
> > > This cuts down the race window to the minimum possible - after (2),
> > > no
> > > mounts / kpartx / lvm operations won't be possible any more.
> > 
> > I wasn't actually worried about someone wanting to use the device. In
> > that case the remove should fail.  I was worried about things that
> > would
> > close the device, but couldn't because of queued IO.  
> > The race I was
> > talking about is that after whatever has the device open gets the IO
> > error, it might not close the device before multipath checks the open
> > count.
> 
> Understood.
> 
> > Also, I'm not sure that resume helps us, since that will trigger
> > uevents, which will open the device again.
> 
> Not sure if I understand correctly: It's possible to just suspend/flush
> and then remove the device, without resuming. But that's dangerous,
> because if some process opens the device while it's resumed, even if
> it's just for reading a single block (think blkid), the open will
> succeed, the IO will hang, the opencount will be increased, and the
> REMOVE ioctl will fail. Therefore I think *if* we suspend we should
> also resume. But I concur wrt the uevent, of course.

We obviously need to resume if the remove fails. I just an not sure we
want to resume before deciding the remove has failed, since it will just
add openers that we don't care about.
 
> > > When we remove the partition mappings, we could use the same
> > > technique
> > > to avoid races on that layer. Unfortunately, a pending "deferred
> > > remove" operation doesn't cause new open() or mount() calls to
> > > fail; if
> > > it did, we could use a simpler approach.
> > > 
> > > > So, for a while now, the suspending has been doing nothing for
> > > > us.  We
> > > > could either try to reorder things so that we actually try to
> > > > flush
> > > > the
> > > > queued IOs back first (with or without suspending), or we could
> > > > just
> > > > remove suspending and say that things are working fine the way
> > > > they
> > > > currently are.
> > > 
> > > What else can we do except suspend/resume to avoid racing with
> > > pending
> > > close(), umount() or similar operations? Well, I guess if we open
> > > the
> > > device anyway as I proposed, we could do an fsync() on it. That
> > > might
> > > actually be better because it avoids the uevent being sent on
> > > resume.
> > 
> > yeah. I think that simply disabling queueing and doing an fsync() is
> > probably better than suspending. If new IO comes in after that, then
> > something IS using the device, and we don't want to remove it. In
> > multipath, maybe:
> > 
> > 1. disable queueing
> > 2. remove partition mappings
> > 3. open device
> > 4. flush
> > 5. check if we are the only opener.
> > 	If not, and there are retries left... goto 4? sleep and
> > recheck?
> > 	we don't want to wait if the answer is that they device really
> > 	is in use.
> > 6. close device
> > 7. remove device
> > 
> > Possibly the solution to not wanting to wait when a device is in use
> > is
> > that we could add an option to multipath to distinguish between the
> > way
> > flushing works now, where we check early if the device is in use, and
> > just bail if it is, and a more aggressive attempt at remove that
> > might
> > take longer if used on devices that are in use.
> 
> What's wrong with deferred remove? After all, the user explicitly asked
> for a flush. As long as some other process has the device open, it
> won't be removed. That's why I like the O_EXCL idea, which will allow
> small programs like blkid to access the device, but will cause all
> attempts to mount or add stacked devices to fail until the device is
> actually removed. I see no reason no to do this, as it's a race anyway
> if some other process opens the device when we're supposed to flush it
> and the opencount already dropped to 0. By using O_EXCL, we just
> increase multipathd's chances to win the race and do what the user
> asked for.

I'm not actually a fan of deferred remove in general. It leaves the
device in this weird state were it is there but no longer openable. I
wish I had originally dealt with deferred removes by having multipathd
occasionally try to flush devices with no paths, or possibly listen for
notifications that the device has been closed, and flush then.

My specific objections here are that not all things that open a device
for longer than an instant do so with O_EXCL.  So it's very possible
that you run "multipath -F", it returns having removed a number of
unused devices.  But some of the devices it didn't remove were opened
without O_EXCL, and they will stick around for a while, and then
suddenly disappear.  Even if they don't say around for that long, this
is a can still effect scripts or other programs that are expecting to
check the device state immediately after calling multipath -F, and
having it not change a second or so later. So far multipath -f/-F will
not return until it has removed all the removeable devices (and waited
for them to be removed from udev). I think it should stay this way.

> To make sure we're on the same boat: this is a topic for a separate
> patch set IMO, I'm not expecting you to fix it in a v3.

Yep. It's future work.

> Cheers,
> Martin
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Martin Wilck July 3, 2020, 3:12 p.m. UTC | #11

On Thu, 2020-07-02 at 14:41 -0500, Benjamin Marzinski wrote:
> On Thu, Jul 02, 2020 at 04:45:21PM +0000, Martin Wilck wrote:
> > 
> > What's wrong with deferred remove? After all, the user explicitly
> > asked
> > for a flush. As long as some other process has the device open, it
> > won't be removed. That's why I like the O_EXCL idea, which will
> > allow
> > small programs like blkid to access the device, but will cause all
> > attempts to mount or add stacked devices to fail until the device
> > is
> > actually removed. I see no reason no to do this, as it's a race
> > anyway
> > if some other process opens the device when we're supposed to flush
> > it
> > and the opencount already dropped to 0. By using O_EXCL, we just
> > increase multipathd's chances to win the race and do what the user
> > asked for.
> 
> I'm not actually a fan of deferred remove in general. It leaves the
> device in this weird state were it is there but no longer openable. 

Ok, I didn't expect that ;-)

AFAICS, devices in DEFERRED REMOVE state are actually still openable. I
just tested this once more on a 5.3 kernel.

As long as the device is opened by some process and thus not removed,
it can be opened by other processes, and is not deleted until the last
opener closes it. It's even possible to create new device mapper layers
like kpartx partitions on top of a DM device in DEFERRED REMOVE state.

> I
> wish I had originally dealt with deferred removes by having
> multipathd
> occasionally try to flush devices with no paths, or possibly listen
> for
> notifications that the device has been closed, and flush then.
> 
> My specific objections here are that not all things that open a
> device
> for longer than an instant do so with O_EXCL.  So it's very possible
> that you run "multipath -F", it returns having removed a number of
> unused devices.  But some of the devices it didn't remove were opened
> without O_EXCL, and they will stick around for a while, and then
> suddenly disappear.  Even if they don't say around for that long,
> this
> is a can still effect scripts or other programs that are expecting to
> check the device state immediately after calling multipath -F, and
> having it not change a second or so later. So far multipath -f/-F
> will
> not return until it has removed all the removeable devices (and
> waited
> for them to be removed from udev). I think it should stay this way.

I see. That's a valid point. IMHO it'd be better if the kernel didn't
allow any new access to devices in "deferred remove" state, and
possibly sent a REMOVE uevent and hide the device immediately after the
deferred remove ioctl. 

That would also be closer to how "lazy umount" (umount -l) behaves.
But I'm certainly overlooking some subtle semantic issues. 

@Mike, Zdenek: perhaps you can explain why "deferred remove" behaves
like this?

Martin

Mike Snitzer July 3, 2020, 4:39 p.m. UTC | #12

On Fri, Jul 03 2020 at 11:12am -0400,
Martin Wilck <Martin.Wilck@suse.com> wrote:

> On Thu, 2020-07-02 at 14:41 -0500, Benjamin Marzinski wrote:
> > On Thu, Jul 02, 2020 at 04:45:21PM +0000, Martin Wilck wrote:
> > > 
> > > What's wrong with deferred remove? After all, the user explicitly
> > > asked
> > > for a flush. As long as some other process has the device open, it
> > > won't be removed. That's why I like the O_EXCL idea, which will
> > > allow
> > > small programs like blkid to access the device, but will cause all
> > > attempts to mount or add stacked devices to fail until the device
> > > is
> > > actually removed. I see no reason no to do this, as it's a race
> > > anyway
> > > if some other process opens the device when we're supposed to flush
> > > it
> > > and the opencount already dropped to 0. By using O_EXCL, we just
> > > increase multipathd's chances to win the race and do what the user
> > > asked for.
> > 
> > I'm not actually a fan of deferred remove in general. It leaves the
> > device in this weird state were it is there but no longer openable. 
> 
> Ok, I didn't expect that ;-)
> 
> AFAICS, devices in DEFERRED REMOVE state are actually still openable. I
> just tested this once more on a 5.3 kernel.
> 
> As long as the device is opened by some process and thus not removed,
> it can be opened by other processes, and is not deleted until the last
> opener closes it. It's even possible to create new device mapper layers
> like kpartx partitions on top of a DM device in DEFERRED REMOVE state.
> 
> > I
> > wish I had originally dealt with deferred removes by having
> > multipathd
> > occasionally try to flush devices with no paths, or possibly listen
> > for
> > notifications that the device has been closed, and flush then.
> > 
> > My specific objections here are that not all things that open a
> > device
> > for longer than an instant do so with O_EXCL.  So it's very possible
> > that you run "multipath -F", it returns having removed a number of
> > unused devices.  But some of the devices it didn't remove were opened
> > without O_EXCL, and they will stick around for a while, and then
> > suddenly disappear.  Even if they don't say around for that long,
> > this
> > is a can still effect scripts or other programs that are expecting to
> > check the device state immediately after calling multipath -F, and
> > having it not change a second or so later. So far multipath -f/-F
> > will
> > not return until it has removed all the removeable devices (and
> > waited
> > for them to be removed from udev). I think it should stay this way.
> 
> I see. That's a valid point. IMHO it'd be better if the kernel didn't
> allow any new access to devices in "deferred remove" state, and
> possibly sent a REMOVE uevent and hide the device immediately after the
> deferred remove ioctl. 
> 
> That would also be closer to how "lazy umount" (umount -l) behaves.
> But I'm certainly overlooking some subtle semantic issues. 
> 
> @Mike, Zdenek: perhaps you can explain why "deferred remove" behaves
> like this?

"deferred remove" was introduced with commits:

2c140a246dc dm: allow remove to be deferred
acfe0ad74d2 dm: allocate a special workqueue for deferred device removal

The feature was developed to cater to the docker "devicemapper" graph
driver [1][2] that uses DM thin provisioning in the backend (Red Hat's
openshift once used a docker that used thinp in production for thinp's
snapshot capabilities. overlayfs is now used instead because it allows
page cache sharing which results in the ability to support vastly more
containers that all are layered on snapshots of the same "device").

Anyway, back to deferred remove: docker's Go-lang based implementation
and storage graph driver interface were clumsily written to require this
lazy removal of used resources.  As such, we had to adapt and the result
was "deferred device" remove that really could be used by any DM device.

Docker couldn't have later opens fail due to a pending removal -- it'd
break their app.  So if you want it to do what you'd have imagined it to
be; we'll need to introduce a new flag that alters the behavior (maybe
as a module param off of DM core's dm-mod.ko).  Patches welcome -- but
you'll need a pretty good reason (not read back far enough but maybe
you have one?).

Thanks,
Mike

 
[1] https://docs.docker.com/storage/storagedriver/device-mapper-driver/
[2] https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_atomic_host/7/html/managing_containers/managing_storage_with_docker_formatted_containers

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Martin Wilck July 3, 2020, 6:50 p.m. UTC | #13

On Fri, 2020-07-03 at 12:39 -0400, Mike Snitzer wrote:
> 
> Docker couldn't have later opens fail due to a pending removal --
> it'd
> break their app.  So if you want it to do what you'd have imagined it
> to
> be; we'll need to introduce a new flag that alters the behavior
> (maybe
> as a module param off of DM core's dm-mod.ko).  Patches welcome --
> but
> you'll need a pretty good reason (not read back far enough but maybe
> you have one?).

Thanks a lot for the explanation. I don't think I'm going write patches
and reason about it. We were just looking for the best way to safely
flush maps in multipath-tools, and I'd considered deferred remove as
one option, which it most likely is not. Anyway, I like to understand
why thinks are the way they are, so thanks again.

Cheers,
Martin

[0/7] Fix muitpath/multipathd flush issue

Message

Comments