Message ID | 1592439867-18427-1-git-send-email-bmarzins@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | Fix muitpath/multipathd flush issue | expand |
On Wed, 2020-06-17 at 19:24 -0500, Benjamin Marzinski wrote: > > One source of complexity in these patches is that multipath suspends > the > device with flushing before removing it, while multipathd > doesn't. It > does seem possible that the suspend could hang for a while, so I can > understand multipathd not using it. However, that would take the > multipath device getting opened and IO being issued, between the time > when _dm_flush_map() verifies that the device isn't opened, and when > it > suspends the device. But more importantly, I don't understand how > suspending the device is useful. I've looked up the origin of 9a4ff93 in SUSE bugzilla. The problem that the patch tried to solve was that if map without healthy paths and with queue_if_no_path set was removed, the removal might fail with "map is in use". Hannes' comment at the time was this: "Problem is that we're not waiting for all outstanding I/Os to complete. So the check for 'map in use' might trigger a false positive, as the outstanding I/O will keep the map busy. Once it's finished we can remove the map." "I'll add an explicit 'suspend/resume' cycle here, which will wait for all outstanding I/O to finish. That should resolve the situation." Thus setting queue_if_no_paths = 0 and doing a suspend/resume was basically a trick to force dm to flush outstanding IO. > If the device were to be opened > between when _dm_flush_map() checks the usage, and when it tries to > delete the device, device-mapper would simply fail the remove. And > if > the device isn't in use, there can't be any outstanding IO to flush. > When this code was added in commit 9a4ff93, there was no check if the > device was in use, Hm, I see this in the code preceding 9a4ff93: extern int _dm_flush_map (const char * mapname, int need_sync) { [...] if (dm_get_opencount(mapname)) { condlog(2, "%s: map in use", mapname); return 1; } ... so it seems that there was a check, even back then. > and disabling queue_if_no_path could cause anything > that had the device open to error out and close it. But now that > multipath checks first if the device is open, I don't see the benefit > to > doing this anymore. Removing it would allow multipathd and multipath > to > use the same code to remove maps. Any thoughts? With queue_if_no_paths, there could be outstanding IO even if the opencount was 0. This IO must be flushed somehow before the map can be removed. Apparently just setting queue_if_no_path = 0 was not enough, that's why Hannes added the suspend/resume. Maybe today we have some better way to force the flush? libdm has the _error_device() function (aka dmsetup wipe_table) that replaces the map's table by the error target. But this does a map reload, which implies a suspend, too. Perhaps simply an fsync() on the dm device, or just wait a while until all outstanding IO has errored out? But these alternatives don't actually look better to me than Hannes' suspend/resume, they will take at least as much time. Do you have a better idea? multipathd Regards Martin
On Thu, Jun 18, 2020 at 09:00:49AM +0000, Martin Wilck wrote: > On Wed, 2020-06-17 at 19:24 -0500, Benjamin Marzinski wrote: > > > > One source of complexity in these patches is that multipath suspends > > the > > device with flushing before removing it, while multipathd > > doesn't. It > > does seem possible that the suspend could hang for a while, so I can > > understand multipathd not using it. However, that would take the > > multipath device getting opened and IO being issued, between the time > > when _dm_flush_map() verifies that the device isn't opened, and when > > it > > suspends the device. But more importantly, I don't understand how > > suspending the device is useful. > > I've looked up the origin of 9a4ff93 in SUSE bugzilla. The problem that > the patch tried to solve was that if map without healthy paths and with > queue_if_no_path set was removed, the removal might fail with > "map is in use". Hannes' comment at the time was this: > > "Problem is that we're not waiting for all outstanding I/Os to > complete. So the check for 'map in use' might trigger a false > positive, as the outstanding I/O will keep the map busy. Once it's > finished we can remove the map." > > "I'll add an explicit 'suspend/resume' cycle here, which will wait for > all outstanding I/O to finish. That should resolve the situation." > > Thus setting queue_if_no_paths = 0 and doing a suspend/resume was > basically a trick to force dm to flush outstanding IO. I get the intention, I just don't think it currently is doing anything. > > If the device were to be opened > > between when _dm_flush_map() checks the usage, and when it tries to > > delete the device, device-mapper would simply fail the remove. And > > if > > the device isn't in use, there can't be any outstanding IO to flush. > > When this code was added in commit 9a4ff93, there was no check if the > > device was in use, > > Hm, I see this in the code preceding 9a4ff93: > > extern int > _dm_flush_map (const char * mapname, int need_sync) > { > [...] > if (dm_get_opencount(mapname)) { > condlog(2, "%s: map in use", mapname); > return 1; > } > > ... so it seems that there was a check, even back then. oops. I missed that. > > and disabling queue_if_no_path could cause anything > > that had the device open to error out and close it. But now that > > multipath checks first if the device is open, I don't see the benefit > > to > > doing this anymore. Removing it would allow multipathd and multipath > > to > > use the same code to remove maps. Any thoughts? > > With queue_if_no_paths, there could be outstanding IO even if the > opencount was 0. This is what I don't accept at face value. I wrote a little test program that fired off an async read, closed the fd without waiting for a response, and then tried to exit. running this on a queueing multipath device causes the exit to hang like this cat /proc/22655/task/22655/stack [<0>] exit_aio+0xdc/0xf0 [<0>] mmput+0x33/0x130 [<0>] do_exit+0x27f/0xb30 [<0>] do_group_exit+0x3a/0xa0 [<0>] __x64_sys_exit_group+0x14/0x20 [<0>] do_syscall_64+0x5b/0x160 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [<0>] 0xffffffffffffffff and the multipath device to remain in use # dmsetup info mpathb Name: mpathb State: ACTIVE Read Ahead: 256 Tables present: LIVE Open count: 0 Event number: 7 Major, minor: 253, 5 Number of targets: 1 UUID: mpath-3600d0230000000000e13954ed5f89301 I've asked around, and device-mapper is certainly supposed to flush all IO before the last close. Of course, there may be a corner case where it doesn't. If you know of one, that would be worth sharing. What I think that flush previously accomplished is that, just like the flush_on_last_del code, if you stop queueing and error out the IO, then whatever is waiting on it will get those errors, and likely close the device soon after, hopefully in time for multipath to remove the device. However, since multipath now checks if the device is in use before disabling queue_if_no_path, it don't think this code is actually helpful right now. > This IO must be flushed somehow before the map can be > removed. Apparently just setting queue_if_no_path = 0 was not enough, > that's why Hannes added the suspend/resume. Maybe today we have some > better way to force the flush? libdm has the _error_device() function > (aka dmsetup wipe_table) that replaces the map's table by the error > target. But this does a map reload, which implies a suspend, too. > Perhaps simply an fsync() on the dm device, or just wait a while until > all outstanding IO has errored out? But these alternatives don't > actually look better to me than Hannes' suspend/resume, they will take > at least as much time. Do you have a better idea? To go back to the old behavior, we would need to move the check if the device is in use until after we suspended the device. Or we can keep the current behavior, and simply remove the flushing and suspending. > multipathd > > Regards > Martin > > -- > Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107 > SUSE Software Solutions Germany GmbH > HRB 36809, AG Nürnberg GF: Felix > Imendörffer > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Thu, 2020-06-18 at 13:04 -0500, Benjamin Marzinski wrote: > On Thu, Jun 18, 2020 at 09:00:49AM +0000, Martin Wilck wrote: > > > > With queue_if_no_paths, there could be outstanding IO even if the > > opencount was 0. > > This is what I don't accept at face value. I wrote a little test > program that fired off an async read, closed the fd without waiting > for > a response, and then tried to exit. running this on a queueing > multipath device causes the exit to hang like this > > cat /proc/22655/task/22655/stack > [<0>] exit_aio+0xdc/0xf0 > [<0>] mmput+0x33/0x130 > [<0>] do_exit+0x27f/0xb30 > [<0>] do_group_exit+0x3a/0xa0 > [<0>] __x64_sys_exit_group+0x14/0x20 > [<0>] do_syscall_64+0x5b/0x160 > [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > [<0>] 0xffffffffffffffff > > and the multipath device to remain in use > > # dmsetup info mpathb > Name: mpathb > State: ACTIVE > Read Ahead: 256 > Tables present: LIVE > Open count: 0 > Event number: 7 > Major, minor: 253, 5 > Number of targets: 1 > UUID: mpath-3600d0230000000000e13954ed5f89301 > The open count is 0. Wouldn't this be exactly the situation that Hannes' patch was targeting - opencount 0, but still unable to flush/close because of outstanding IO? If multipath was trying to flush the map in this situation, it would disable queueing. Your process would get an IO error sooner or later, and exit. But depending on the amount of outstanding IO and the weather, this might not happen before multipath had attempted to flush the map, and the flush attempt would fail. By inserting the synchronous flush operation after disabling queueing, multipath avoids that failure. Am I misunderstanding something? > I've asked around, and device-mapper is certainly supposed to flush > all > IO before the last close. Of course, there may be a corner case where > it > doesn't. If you know of one, that would be worth sharing. > > What I think that flush previously accomplished is that, just like > the > flush_on_last_del code, if you stop queueing and error out the IO, > then > whatever is waiting on it will get those errors, and likely close the > device soon after, hopefully in time for multipath to remove the > device. > > However, since multipath now checks if the device is in use before > disabling queue_if_no_path, it don't think this code is actually > helpful > right now. > > > This IO must be flushed somehow before the map can be > > removed. Apparently just setting queue_if_no_path = 0 was not > > enough, > > that's why Hannes added the suspend/resume. Maybe today we have > > some > > better way to force the flush? libdm has the _error_device() > > function > > (aka dmsetup wipe_table) that replaces the map's table by the error > > target. But this does a map reload, which implies a suspend, too. > > Perhaps simply an fsync() on the dm device, or just wait a while > > until > > all outstanding IO has errored out? But these alternatives don't > > actually look better to me than Hannes' suspend/resume, they will > > take > > at least as much time. Do you have a better idea? > > To go back to the old behavior, we would need to move the check if > the > device is in use until after we suspended the device. Or we can keep > the > current behavior, and simply remove the flushing and suspending. > AFAICS the "in use" check looks at the opencount, and according to your output above, it can be 0 while IO is outstanding. I haven't checked this myself yet. But I can confirm that there was an actual bug (map removal failing) that was allegedly fixed by the suspend/resume. It was long ago, I can't tell with confidence if it could still happen. Don't get me wrong, I'm not generally opposed to the removal of the flush/suspend. I just wanted to clarify why it was there. It has been used in multipath only, anyway. If we delegate to multipathd, it makes sense to handle the situation in multipathd's manner. If we want to make sure the user experience for "multipath -f" users is unchanged, we could handle failure accordingly (suspend, resume, and retry flushing the map). Best, Martin
On Thu, Jun 18, 2020 at 08:06:53PM +0000, Martin Wilck wrote: > On Thu, 2020-06-18 at 13:04 -0500, Benjamin Marzinski wrote: > > On Thu, Jun 18, 2020 at 09:00:49AM +0000, Martin Wilck wrote: > > > > > > With queue_if_no_paths, there could be outstanding IO even if the > > > opencount was 0. > > > > This is what I don't accept at face value. I wrote a little test > > program that fired off an async read, closed the fd without waiting > > for > > a response, and then tried to exit. running this on a queueing > > multipath device causes the exit to hang like this > > > > cat /proc/22655/task/22655/stack > > [<0>] exit_aio+0xdc/0xf0 > > [<0>] mmput+0x33/0x130 > > [<0>] do_exit+0x27f/0xb30 > > [<0>] do_group_exit+0x3a/0xa0 > > [<0>] __x64_sys_exit_group+0x14/0x20 > > [<0>] do_syscall_64+0x5b/0x160 > > [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > [<0>] 0xffffffffffffffff > > > > and the multipath device to remain in use > > > > # dmsetup info mpathb > > Name: mpathb > > State: ACTIVE > > Read Ahead: 256 > > Tables present: LIVE > > Open count: 0 > > Event number: 7 > > Major, minor: 253, 5 > > Number of targets: 1 > > UUID: mpath-3600d0230000000000e13954ed5f89301 > > > > The open count is 0. Whoops. I copied the output from after I restored the path, and the program exitted. While it is hung, you see this: # dmsetup info mpathb Name: mpathb State: ACTIVE Read Ahead: 256 Tables present: LIVE Open count: 1 Event number: 8 Major, minor: 253, 5 Number of targets: 1 UUID: mpath-3600d0230000000000e13954ed5f89301 I uploaded the test program, aio_test: https://github.com/bmarzins/test_programs.git You just need to run in on a queueing multipath device with no active paths and an open count of 0. It will hang with the device open. Restore a path, and it will exit, and the open count will go to 0. -Ben > Wouldn't this be exactly the situation that > Hannes' patch was targeting - opencount 0, but still unable to > flush/close because of outstanding IO? If multipath was trying to flush > the map in this situation, it would disable queueing. Your process > would get an IO error sooner or later, and exit. But depending on the > amount of outstanding IO and the weather, this might not happen before > multipath had attempted to flush the map, and the flush attempt would > fail. By inserting the synchronous flush operation after disabling > queueing, multipath avoids that failure. Am I misunderstanding > something? > > > > I've asked around, and device-mapper is certainly supposed to flush > > all > > IO before the last close. Of course, there may be a corner case where > > it > > doesn't. If you know of one, that would be worth sharing. > > > > What I think that flush previously accomplished is that, just like > > the > > flush_on_last_del code, if you stop queueing and error out the IO, > > then > > whatever is waiting on it will get those errors, and likely close the > > device soon after, hopefully in time for multipath to remove the > > device. > > > > However, since multipath now checks if the device is in use before > > disabling queue_if_no_path, it don't think this code is actually > > helpful > > right now. > > > > > This IO must be flushed somehow before the map can be > > > removed. Apparently just setting queue_if_no_path = 0 was not > > > enough, > > > that's why Hannes added the suspend/resume. Maybe today we have > > > some > > > better way to force the flush? libdm has the _error_device() > > > function > > > (aka dmsetup wipe_table) that replaces the map's table by the error > > > target. But this does a map reload, which implies a suspend, too. > > > Perhaps simply an fsync() on the dm device, or just wait a while > > > until > > > all outstanding IO has errored out? But these alternatives don't > > > actually look better to me than Hannes' suspend/resume, they will > > > take > > > at least as much time. Do you have a better idea? > > > > To go back to the old behavior, we would need to move the check if > > the > > device is in use until after we suspended the device. Or we can keep > > the > > current behavior, and simply remove the flushing and suspending. > > > > AFAICS the "in use" check looks at the opencount, and according to your > output above, it can be 0 while IO is outstanding. I haven't checked > this myself yet. But I can confirm that there was an actual bug (map > removal failing) that was allegedly fixed by the suspend/resume. It was > long ago, I can't tell with confidence if it could still happen. > > Don't get me wrong, I'm not generally opposed to the removal of the > flush/suspend. I just wanted to clarify why it was there. It has been > used in multipath only, anyway. If we delegate to multipathd, it makes > sense to handle the situation in multipathd's manner. If we want to > make sure the user experience for "multipath -f" users is unchanged, we > could handle failure accordingly (suspend, resume, and retry flushing > the map). > > Best, > Martin > > -- > Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107 > SUSE Software Solutions Germany GmbH > HRB 36809, AG Nürnberg GF: Felix > Imendörffer > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Thu, 2020-06-18 at 18:06 -0500, Benjamin Marzinski wrote: > > I uploaded the test program, aio_test: > > https://github.com/bmarzins/test_programs.git > > You just need to run in on a queueing multipath device with no active > paths and an open count of 0. It will hang with the device open. > Restore > a path, and it will exit, and the open count will go to 0. Tried it now, it behaves as you say. I admit I can't imagine how the suspend/resume would improve anything here. Indeed, as you say, it opens up a race window. Another process might open the device while it's suspended. Worse perhaps, once the device is resumed, an uevent will be generated, and stacked devices might (in principle at least) be recreated before we get down to flush the map. MAYBE the suspend/resume was necessary in the past because some earlier kernels wouldn't immediately fail all outstanding commands when queue_if_no_path was deactivated? (just speculating, I don't know if this is the case). Regards, Martin
On Wed, Jul 01, 2020 at 10:54:34PM +0200, Martin Wilck wrote: > On Thu, 2020-06-18 at 18:06 -0500, Benjamin Marzinski wrote: > > > > I uploaded the test program, aio_test: > > > > https://github.com/bmarzins/test_programs.git > > > > You just need to run in on a queueing multipath device with no active > > paths and an open count of 0. It will hang with the device open. > > Restore > > a path, and it will exit, and the open count will go to 0. > > Tried it now, it behaves as you say. I admit I can't imagine how the > suspend/resume would improve anything here. Indeed, as you say, it opens > up a race window. Another process might open the device while > it's suspended. Worse perhaps, once the device is resumed, an uevent will be > generated, and stacked devices might (in principle at least) be recreated > before we get down to flush the map. > > MAYBE the suspend/resume was necessary in the past because some earlier > kernels wouldn't immediately fail all outstanding commands when > queue_if_no_path was deactivated? (just speculating, I don't know if this > is the case). If you disable queue_if_no_path and then do a suspend with flushing, you are guaranteed that the supend won't return until all the IO has completed or failed. This would allow anything that was waiting on queued IO to have the IO failback, which could allow it to close the device in time for multipath to be able to remove it (obviously this is racey). However, this assumes that you do your open checks after the suspend, which multipath no longer does. I realize that multipath can't suspend until after it tries to remove the partition devices, otherwise those can get stuck. But there probably is some order that gets this all right-ish. So, for a while now, the suspending has been doing nothing for us. We could either try to reorder things so that we actually try to flush the queued IOs back first (with or without suspending), or we could just remove suspending and say that things are working fine the way they currently are. -Ben > Regards, > Martin > > -- > Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107 > SUSE Software Solutions Germany GmbH > HRB 36809, AG Nürnberg GF: Felix > Imendörffer > > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, 2020-07-01 at 22:14 -0500, Benjamin Marzinski wrote: > On Wed, Jul 01, 2020 at 10:54:34PM +0200, Martin Wilck wrote: > > On Thu, 2020-06-18 at 18:06 -0500, Benjamin Marzinski wrote: > > > I uploaded the test program, aio_test: > > > > > > https://github.com/bmarzins/test_programs.git > > > > > > You just need to run in on a queueing multipath device with no > > > active > > > paths and an open count of 0. It will hang with the device open. > > > Restore > > > a path, and it will exit, and the open count will go to 0. > > > > Tried it now, it behaves as you say. I admit I can't imagine how > > the > > suspend/resume would improve anything here. Indeed, as you say, it > > opens > > up a race window. Another process might open the device while > > it's suspended. Worse perhaps, once the device is resumed, an > > uevent will be > > generated, and stacked devices might (in principle at least) be > > recreated > > before we get down to flush the map. > > > > MAYBE the suspend/resume was necessary in the past because some > > earlier > > kernels wouldn't immediately fail all outstanding commands when > > queue_if_no_path was deactivated? (just speculating, I don't know > > if this > > is the case). > > If you disable queue_if_no_path and then do a suspend with flushing, > you > are guaranteed that the supend won't return until all the IO has > completed or failed. I just realized that if we suspend, we don't even need disable queue_if_no_path, because the kernel does that automatically if a "suspend with flush" is issued, and has been doing so basically forever. > This would allow anything that was waiting on > queued IO to have the IO failback, which could allow it to close the > device in time for multipath to be able to remove it (obviously this > is > racey). However, this assumes that you do your open checks after the > suspend, which multipath no longer does. > I realize that multipath can't > suspend until after it tries to remove the partition devices, > otherwise > those can get stuck. But there probably is some order that gets this > all > right-ish. Our code is currently racy. IMO we should 0 unset queue_if_no_path 1 remove partition mappings 2 open(O_EXCL|O_RDONLY) the device 3 If this fails, in multipath, try again after a suspend/resume cycle. In multipathd, I think we should just fail for now; perhaps later we could handle the explicit "remove map" command like multipath. 4 If it fails again, bail out (in multipath, retry if asked to do so) 5 run a "deferred remove" ioctl 6 close the fd, the dm device will now be removed "atomically". This cuts down the race window to the minimum possible - after (2), no mounts / kpartx / lvm operations won't be possible any more. When we remove the partition mappings, we could use the same technique to avoid races on that layer. Unfortunately, a pending "deferred remove" operation doesn't cause new open() or mount() calls to fail; if it did, we could use a simpler approach. > So, for a while now, the suspending has been doing nothing for > us. We > could either try to reorder things so that we actually try to flush > the > queued IOs back first (with or without suspending), or we could just > remove suspending and say that things are working fine the way they > currently are. What else can we do except suspend/resume to avoid racing with pending close(), umount() or similar operations? Well, I guess if we open the device anyway as I proposed, we could do an fsync() on it. That might actually be better because it avoids the uevent being sent on resume. We definitely can't suspend the map before we remove the partitions. We could try a suspend/resume on the partition devices themselves (or fsync()) if the opencount is > 0. Regards, Martin
On Thu, Jul 02, 2020 at 12:24:32PM +0000, Martin Wilck wrote: > On Wed, 2020-07-01 at 22:14 -0500, Benjamin Marzinski wrote: > > On Wed, Jul 01, 2020 at 10:54:34PM +0200, Martin Wilck wrote: > > > On Thu, 2020-06-18 at 18:06 -0500, Benjamin Marzinski wrote: > > > > I uploaded the test program, aio_test: > > > > > > > > https://github.com/bmarzins/test_programs.git > > > > > > > > You just need to run in on a queueing multipath device with no > > > > active > > > > paths and an open count of 0. It will hang with the device open. > > > > Restore > > > > a path, and it will exit, and the open count will go to 0. > > > > > > Tried it now, it behaves as you say. I admit I can't imagine how > > > the > > > suspend/resume would improve anything here. Indeed, as you say, it > > > opens > > > up a race window. Another process might open the device while > > > it's suspended. Worse perhaps, once the device is resumed, an > > > uevent will be > > > generated, and stacked devices might (in principle at least) be > > > recreated > > > before we get down to flush the map. > > > > > > MAYBE the suspend/resume was necessary in the past because some > > > earlier > > > kernels wouldn't immediately fail all outstanding commands when > > > queue_if_no_path was deactivated? (just speculating, I don't know > > > if this > > > is the case). > > > > If you disable queue_if_no_path and then do a suspend with flushing, > > you > > are guaranteed that the supend won't return until all the IO has > > completed or failed. > > I just realized that if we suspend, we don't even need disable > queue_if_no_path, because the kernel does that automatically if a > "suspend with flush" is issued, and has been doing so basically > forever. > > > This would allow anything that was waiting on > > queued IO to have the IO failback, which could allow it to close the > > device in time for multipath to be able to remove it (obviously this > > is > > racey). However, this assumes that you do your open checks after the > > suspend, which multipath no longer does. > > I realize that multipath can't > > suspend until after it tries to remove the partition devices, > > otherwise > > those can get stuck. But there probably is some order that gets this > > all > > right-ish. > > Our code is currently racy. IMO we should > > 0 unset queue_if_no_path > 1 remove partition mappings > 2 open(O_EXCL|O_RDONLY) the device > 3 If this fails, in multipath, try again after a suspend/resume cycle. > In multipathd, I think we should just fail for now; perhaps later > we could handle the explicit "remove map" command like multipath. > 4 If it fails again, bail out (in multipath, retry if asked to do so) > 5 run a "deferred remove" ioctl > 6 close the fd, the dm device will now be removed "atomically". > > This cuts down the race window to the minimum possible - after (2), no > mounts / kpartx / lvm operations won't be possible any more. I wasn't actually worried about someone wanting to use the device. In that case the remove should fail. I was worried about things that would close the device, but couldn't because of queued IO. The race I was talking about is that after whatever has the device open gets the IO error, it might not close the device before multipath checks the open count. Also, I'm not sure that resume helps us, since that will trigger uevents, which will open the device again. > When we remove the partition mappings, we could use the same technique > to avoid races on that layer. Unfortunately, a pending "deferred > remove" operation doesn't cause new open() or mount() calls to fail; if > it did, we could use a simpler approach. > > > So, for a while now, the suspending has been doing nothing for > > us. We > > could either try to reorder things so that we actually try to flush > > the > > queued IOs back first (with or without suspending), or we could just > > remove suspending and say that things are working fine the way they > > currently are. > > What else can we do except suspend/resume to avoid racing with pending > close(), umount() or similar operations? Well, I guess if we open the > device anyway as I proposed, we could do an fsync() on it. That might > actually be better because it avoids the uevent being sent on resume. yeah. I think that simply disabling queueing and doing an fsync() is probably better than suspending. If new IO comes in after that, then something IS using the device, and we don't want to remove it. In multipath, maybe: 1. disable queueing 2. remove partition mappings 3. open device 4. flush 5. check if we are the only opener. If not, and there are retries left... goto 4? sleep and recheck? we don't want to wait if the answer is that they device really is in use. 6. close device 7. remove device Possibly the solution to not wanting to wait when a device is in use is that we could add an option to multipath to distinguish between the way flushing works now, where we check early if the device is in use, and just bail if it is, and a more aggressive attempt at remove that might take longer if used on devices that are in use. -Ben > We definitely can't suspend the map before we remove the partitions. We > could try a suspend/resume on the partition devices themselves (or > fsync()) if the opencount is > 0. > > Regards, > Martin > > -- > Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107 > SUSE Software Solutions Germany GmbH > HRB 36809, AG Nürnberg GF: Felix > Imendörffer > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Thu, 2020-07-02 at 10:18 -0500, Benjamin Marzinski wrote: > On Thu, Jul 02, 2020 at 12:24:32PM +0000, Martin Wilck wrote: > > On Wed, 2020-07-01 at 22:14 -0500, Benjamin Marzinski wrote: > > > On Wed, Jul 01, 2020 at 10:54:34PM +0200, Martin Wilck wrote: > > > > On Thu, 2020-06-18 at 18:06 -0500, Benjamin Marzinski wrote: > > > > > I uploaded the test program, aio_test: > > > > > > > > > > https://github.com/bmarzins/test_programs.git > > > > > > > > > > You just need to run in on a queueing multipath device with > > > > > no > > > > > active > > > > > paths and an open count of 0. It will hang with the device > > > > > open. > > > > > Restore > > > > > a path, and it will exit, and the open count will go to 0. > > > > > > > > Tried it now, it behaves as you say. I admit I can't imagine > > > > how > > > > the > > > > suspend/resume would improve anything here. Indeed, as you say, > > > > it > > > > opens > > > > up a race window. Another process might open the device while > > > > it's suspended. Worse perhaps, once the device is resumed, an > > > > uevent will be > > > > generated, and stacked devices might (in principle at least) be > > > > recreated > > > > before we get down to flush the map. > > > > > > > > MAYBE the suspend/resume was necessary in the past because some > > > > earlier > > > > kernels wouldn't immediately fail all outstanding commands > > > > when > > > > queue_if_no_path was deactivated? (just speculating, I don't > > > > know > > > > if this > > > > is the case). > > > > > > If you disable queue_if_no_path and then do a suspend with > > > flushing, > > > you > > > are guaranteed that the supend won't return until all the IO has > > > completed or failed. > > > > I just realized that if we suspend, we don't even need disable > > queue_if_no_path, because the kernel does that automatically if a > > "suspend with flush" is issued, and has been doing so basically > > forever. > > > > > This would allow anything that was waiting on > > > queued IO to have the IO failback, which could allow it to close > > > the > > > device in time for multipath to be able to remove it (obviously > > > this > > > is > > > racey). However, this assumes that you do your open checks after > > > the > > > suspend, which multipath no longer does. > > > I realize that multipath can't > > > suspend until after it tries to remove the partition devices, > > > otherwise > > > those can get stuck. But there probably is some order that gets > > > this > > > all > > > right-ish. > > > > Our code is currently racy. IMO we should > > > > 0 unset queue_if_no_path > > 1 remove partition mappings > > 2 open(O_EXCL|O_RDONLY) the device > > 3 If this fails, in multipath, try again after a suspend/resume > > cycle. > > In multipathd, I think we should just fail for now; perhaps > > later > > we could handle the explicit "remove map" command like > > multipath. > > 4 If it fails again, bail out (in multipath, retry if asked to do > > so) > > 5 run a "deferred remove" ioctl > > 6 close the fd, the dm device will now be removed "atomically". > > > > This cuts down the race window to the minimum possible - after (2), > > no > > mounts / kpartx / lvm operations won't be possible any more. > > I wasn't actually worried about someone wanting to use the device. In > that case the remove should fail. I was worried about things that > would > close the device, but couldn't because of queued IO. > The race I was > talking about is that after whatever has the device open gets the IO > error, it might not close the device before multipath checks the open > count. Understood. > Also, I'm not sure that resume helps us, since that will trigger > uevents, which will open the device again. Not sure if I understand correctly: It's possible to just suspend/flush and then remove the device, without resuming. But that's dangerous, because if some process opens the device while it's resumed, even if it's just for reading a single block (think blkid), the open will succeed, the IO will hang, the opencount will be increased, and the REMOVE ioctl will fail. Therefore I think *if* we suspend we should also resume. But I concur wrt the uevent, of course. > > When we remove the partition mappings, we could use the same > > technique > > to avoid races on that layer. Unfortunately, a pending "deferred > > remove" operation doesn't cause new open() or mount() calls to > > fail; if > > it did, we could use a simpler approach. > > > > > So, for a while now, the suspending has been doing nothing for > > > us. We > > > could either try to reorder things so that we actually try to > > > flush > > > the > > > queued IOs back first (with or without suspending), or we could > > > just > > > remove suspending and say that things are working fine the way > > > they > > > currently are. > > > > What else can we do except suspend/resume to avoid racing with > > pending > > close(), umount() or similar operations? Well, I guess if we open > > the > > device anyway as I proposed, we could do an fsync() on it. That > > might > > actually be better because it avoids the uevent being sent on > > resume. > > yeah. I think that simply disabling queueing and doing an fsync() is > probably better than suspending. If new IO comes in after that, then > something IS using the device, and we don't want to remove it. In > multipath, maybe: > > 1. disable queueing > 2. remove partition mappings > 3. open device > 4. flush > 5. check if we are the only opener. > If not, and there are retries left... goto 4? sleep and > recheck? > we don't want to wait if the answer is that they device really > is in use. > 6. close device > 7. remove device > > Possibly the solution to not wanting to wait when a device is in use > is > that we could add an option to multipath to distinguish between the > way > flushing works now, where we check early if the device is in use, and > just bail if it is, and a more aggressive attempt at remove that > might > take longer if used on devices that are in use. What's wrong with deferred remove? After all, the user explicitly asked for a flush. As long as some other process has the device open, it won't be removed. That's why I like the O_EXCL idea, which will allow small programs like blkid to access the device, but will cause all attempts to mount or add stacked devices to fail until the device is actually removed. I see no reason no to do this, as it's a race anyway if some other process opens the device when we're supposed to flush it and the opencount already dropped to 0. By using O_EXCL, we just increase multipathd's chances to win the race and do what the user asked for. To make sure we're on the same boat: this is a topic for a separate patch set IMO, I'm not expecting you to fix it in a v3. Cheers, Martin -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Thu, Jul 02, 2020 at 04:45:21PM +0000, Martin Wilck wrote: > On Thu, 2020-07-02 at 10:18 -0500, Benjamin Marzinski wrote: > > On Thu, Jul 02, 2020 at 12:24:32PM +0000, Martin Wilck wrote: > > > On Wed, 2020-07-01 at 22:14 -0500, Benjamin Marzinski wrote: > > > > On Wed, Jul 01, 2020 at 10:54:34PM +0200, Martin Wilck wrote: > > > > > On Thu, 2020-06-18 at 18:06 -0500, Benjamin Marzinski wrote: > > > > > > I uploaded the test program, aio_test: > > > > > > > > > > > > https://github.com/bmarzins/test_programs.git > > > > > > > > > > > > You just need to run in on a queueing multipath device with > > > > > > no > > > > > > active > > > > > > paths and an open count of 0. It will hang with the device > > > > > > open. > > > > > > Restore > > > > > > a path, and it will exit, and the open count will go to 0. > > > > > > > > > > Tried it now, it behaves as you say. I admit I can't imagine > > > > > how > > > > > the > > > > > suspend/resume would improve anything here. Indeed, as you say, > > > > > it > > > > > opens > > > > > up a race window. Another process might open the device while > > > > > it's suspended. Worse perhaps, once the device is resumed, an > > > > > uevent will be > > > > > generated, and stacked devices might (in principle at least) be > > > > > recreated > > > > > before we get down to flush the map. > > > > > > > > > > MAYBE the suspend/resume was necessary in the past because some > > > > > earlier > > > > > kernels wouldn't immediately fail all outstanding commands > > > > > when > > > > > queue_if_no_path was deactivated? (just speculating, I don't > > > > > know > > > > > if this > > > > > is the case). > > > > > > > > If you disable queue_if_no_path and then do a suspend with > > > > flushing, > > > > you > > > > are guaranteed that the supend won't return until all the IO has > > > > completed or failed. > > > > > > I just realized that if we suspend, we don't even need disable > > > queue_if_no_path, because the kernel does that automatically if a > > > "suspend with flush" is issued, and has been doing so basically > > > forever. > > > > > > > This would allow anything that was waiting on > > > > queued IO to have the IO failback, which could allow it to close > > > > the > > > > device in time for multipath to be able to remove it (obviously > > > > this > > > > is > > > > racey). However, this assumes that you do your open checks after > > > > the > > > > suspend, which multipath no longer does. > > > > I realize that multipath can't > > > > suspend until after it tries to remove the partition devices, > > > > otherwise > > > > those can get stuck. But there probably is some order that gets > > > > this > > > > all > > > > right-ish. > > > > > > Our code is currently racy. IMO we should > > > > > > 0 unset queue_if_no_path > > > 1 remove partition mappings > > > 2 open(O_EXCL|O_RDONLY) the device > > > 3 If this fails, in multipath, try again after a suspend/resume > > > cycle. > > > In multipathd, I think we should just fail for now; perhaps > > > later > > > we could handle the explicit "remove map" command like > > > multipath. > > > 4 If it fails again, bail out (in multipath, retry if asked to do > > > so) > > > 5 run a "deferred remove" ioctl > > > 6 close the fd, the dm device will now be removed "atomically". > > > > > > This cuts down the race window to the minimum possible - after (2), > > > no > > > mounts / kpartx / lvm operations won't be possible any more. > > > > I wasn't actually worried about someone wanting to use the device. In > > that case the remove should fail. I was worried about things that > > would > > close the device, but couldn't because of queued IO. > > The race I was > > talking about is that after whatever has the device open gets the IO > > error, it might not close the device before multipath checks the open > > count. > > Understood. > > > Also, I'm not sure that resume helps us, since that will trigger > > uevents, which will open the device again. > > Not sure if I understand correctly: It's possible to just suspend/flush > and then remove the device, without resuming. But that's dangerous, > because if some process opens the device while it's resumed, even if > it's just for reading a single block (think blkid), the open will > succeed, the IO will hang, the opencount will be increased, and the > REMOVE ioctl will fail. Therefore I think *if* we suspend we should > also resume. But I concur wrt the uevent, of course. We obviously need to resume if the remove fails. I just an not sure we want to resume before deciding the remove has failed, since it will just add openers that we don't care about. > > > When we remove the partition mappings, we could use the same > > > technique > > > to avoid races on that layer. Unfortunately, a pending "deferred > > > remove" operation doesn't cause new open() or mount() calls to > > > fail; if > > > it did, we could use a simpler approach. > > > > > > > So, for a while now, the suspending has been doing nothing for > > > > us. We > > > > could either try to reorder things so that we actually try to > > > > flush > > > > the > > > > queued IOs back first (with or without suspending), or we could > > > > just > > > > remove suspending and say that things are working fine the way > > > > they > > > > currently are. > > > > > > What else can we do except suspend/resume to avoid racing with > > > pending > > > close(), umount() or similar operations? Well, I guess if we open > > > the > > > device anyway as I proposed, we could do an fsync() on it. That > > > might > > > actually be better because it avoids the uevent being sent on > > > resume. > > > > yeah. I think that simply disabling queueing and doing an fsync() is > > probably better than suspending. If new IO comes in after that, then > > something IS using the device, and we don't want to remove it. In > > multipath, maybe: > > > > 1. disable queueing > > 2. remove partition mappings > > 3. open device > > 4. flush > > 5. check if we are the only opener. > > If not, and there are retries left... goto 4? sleep and > > recheck? > > we don't want to wait if the answer is that they device really > > is in use. > > 6. close device > > 7. remove device > > > > Possibly the solution to not wanting to wait when a device is in use > > is > > that we could add an option to multipath to distinguish between the > > way > > flushing works now, where we check early if the device is in use, and > > just bail if it is, and a more aggressive attempt at remove that > > might > > take longer if used on devices that are in use. > > What's wrong with deferred remove? After all, the user explicitly asked > for a flush. As long as some other process has the device open, it > won't be removed. That's why I like the O_EXCL idea, which will allow > small programs like blkid to access the device, but will cause all > attempts to mount or add stacked devices to fail until the device is > actually removed. I see no reason no to do this, as it's a race anyway > if some other process opens the device when we're supposed to flush it > and the opencount already dropped to 0. By using O_EXCL, we just > increase multipathd's chances to win the race and do what the user > asked for. I'm not actually a fan of deferred remove in general. It leaves the device in this weird state were it is there but no longer openable. I wish I had originally dealt with deferred removes by having multipathd occasionally try to flush devices with no paths, or possibly listen for notifications that the device has been closed, and flush then. My specific objections here are that not all things that open a device for longer than an instant do so with O_EXCL. So it's very possible that you run "multipath -F", it returns having removed a number of unused devices. But some of the devices it didn't remove were opened without O_EXCL, and they will stick around for a while, and then suddenly disappear. Even if they don't say around for that long, this is a can still effect scripts or other programs that are expecting to check the device state immediately after calling multipath -F, and having it not change a second or so later. So far multipath -f/-F will not return until it has removed all the removeable devices (and waited for them to be removed from udev). I think it should stay this way. > To make sure we're on the same boat: this is a topic for a separate > patch set IMO, I'm not expecting you to fix it in a v3. Yep. It's future work. > Cheers, > Martin > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Thu, 2020-07-02 at 14:41 -0500, Benjamin Marzinski wrote: > On Thu, Jul 02, 2020 at 04:45:21PM +0000, Martin Wilck wrote: > > > > What's wrong with deferred remove? After all, the user explicitly > > asked > > for a flush. As long as some other process has the device open, it > > won't be removed. That's why I like the O_EXCL idea, which will > > allow > > small programs like blkid to access the device, but will cause all > > attempts to mount or add stacked devices to fail until the device > > is > > actually removed. I see no reason no to do this, as it's a race > > anyway > > if some other process opens the device when we're supposed to flush > > it > > and the opencount already dropped to 0. By using O_EXCL, we just > > increase multipathd's chances to win the race and do what the user > > asked for. > > I'm not actually a fan of deferred remove in general. It leaves the > device in this weird state were it is there but no longer openable. Ok, I didn't expect that ;-) AFAICS, devices in DEFERRED REMOVE state are actually still openable. I just tested this once more on a 5.3 kernel. As long as the device is opened by some process and thus not removed, it can be opened by other processes, and is not deleted until the last opener closes it. It's even possible to create new device mapper layers like kpartx partitions on top of a DM device in DEFERRED REMOVE state. > I > wish I had originally dealt with deferred removes by having > multipathd > occasionally try to flush devices with no paths, or possibly listen > for > notifications that the device has been closed, and flush then. > > My specific objections here are that not all things that open a > device > for longer than an instant do so with O_EXCL. So it's very possible > that you run "multipath -F", it returns having removed a number of > unused devices. But some of the devices it didn't remove were opened > without O_EXCL, and they will stick around for a while, and then > suddenly disappear. Even if they don't say around for that long, > this > is a can still effect scripts or other programs that are expecting to > check the device state immediately after calling multipath -F, and > having it not change a second or so later. So far multipath -f/-F > will > not return until it has removed all the removeable devices (and > waited > for them to be removed from udev). I think it should stay this way. I see. That's a valid point. IMHO it'd be better if the kernel didn't allow any new access to devices in "deferred remove" state, and possibly sent a REMOVE uevent and hide the device immediately after the deferred remove ioctl. That would also be closer to how "lazy umount" (umount -l) behaves. But I'm certainly overlooking some subtle semantic issues. @Mike, Zdenek: perhaps you can explain why "deferred remove" behaves like this? Martin
On Fri, Jul 03 2020 at 11:12am -0400, Martin Wilck <Martin.Wilck@suse.com> wrote: > On Thu, 2020-07-02 at 14:41 -0500, Benjamin Marzinski wrote: > > On Thu, Jul 02, 2020 at 04:45:21PM +0000, Martin Wilck wrote: > > > > > > What's wrong with deferred remove? After all, the user explicitly > > > asked > > > for a flush. As long as some other process has the device open, it > > > won't be removed. That's why I like the O_EXCL idea, which will > > > allow > > > small programs like blkid to access the device, but will cause all > > > attempts to mount or add stacked devices to fail until the device > > > is > > > actually removed. I see no reason no to do this, as it's a race > > > anyway > > > if some other process opens the device when we're supposed to flush > > > it > > > and the opencount already dropped to 0. By using O_EXCL, we just > > > increase multipathd's chances to win the race and do what the user > > > asked for. > > > > I'm not actually a fan of deferred remove in general. It leaves the > > device in this weird state were it is there but no longer openable. > > Ok, I didn't expect that ;-) > > AFAICS, devices in DEFERRED REMOVE state are actually still openable. I > just tested this once more on a 5.3 kernel. > > As long as the device is opened by some process and thus not removed, > it can be opened by other processes, and is not deleted until the last > opener closes it. It's even possible to create new device mapper layers > like kpartx partitions on top of a DM device in DEFERRED REMOVE state. > > > I > > wish I had originally dealt with deferred removes by having > > multipathd > > occasionally try to flush devices with no paths, or possibly listen > > for > > notifications that the device has been closed, and flush then. > > > > My specific objections here are that not all things that open a > > device > > for longer than an instant do so with O_EXCL. So it's very possible > > that you run "multipath -F", it returns having removed a number of > > unused devices. But some of the devices it didn't remove were opened > > without O_EXCL, and they will stick around for a while, and then > > suddenly disappear. Even if they don't say around for that long, > > this > > is a can still effect scripts or other programs that are expecting to > > check the device state immediately after calling multipath -F, and > > having it not change a second or so later. So far multipath -f/-F > > will > > not return until it has removed all the removeable devices (and > > waited > > for them to be removed from udev). I think it should stay this way. > > I see. That's a valid point. IMHO it'd be better if the kernel didn't > allow any new access to devices in "deferred remove" state, and > possibly sent a REMOVE uevent and hide the device immediately after the > deferred remove ioctl. > > That would also be closer to how "lazy umount" (umount -l) behaves. > But I'm certainly overlooking some subtle semantic issues. > > @Mike, Zdenek: perhaps you can explain why "deferred remove" behaves > like this? "deferred remove" was introduced with commits: 2c140a246dc dm: allow remove to be deferred acfe0ad74d2 dm: allocate a special workqueue for deferred device removal The feature was developed to cater to the docker "devicemapper" graph driver [1][2] that uses DM thin provisioning in the backend (Red Hat's openshift once used a docker that used thinp in production for thinp's snapshot capabilities. overlayfs is now used instead because it allows page cache sharing which results in the ability to support vastly more containers that all are layered on snapshots of the same "device"). Anyway, back to deferred remove: docker's Go-lang based implementation and storage graph driver interface were clumsily written to require this lazy removal of used resources. As such, we had to adapt and the result was "deferred device" remove that really could be used by any DM device. Docker couldn't have later opens fail due to a pending removal -- it'd break their app. So if you want it to do what you'd have imagined it to be; we'll need to introduce a new flag that alters the behavior (maybe as a module param off of DM core's dm-mod.ko). Patches welcome -- but you'll need a pretty good reason (not read back far enough but maybe you have one?). Thanks, Mike [1] https://docs.docker.com/storage/storagedriver/device-mapper-driver/ [2] https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_atomic_host/7/html/managing_containers/managing_storage_with_docker_formatted_containers -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Fri, 2020-07-03 at 12:39 -0400, Mike Snitzer wrote: > > Docker couldn't have later opens fail due to a pending removal -- > it'd > break their app. So if you want it to do what you'd have imagined it > to > be; we'll need to introduce a new flag that alters the behavior > (maybe > as a module param off of DM core's dm-mod.ko). Patches welcome -- > but > you'll need a pretty good reason (not read back far enough but maybe > you have one?). Thanks a lot for the explanation. I don't think I'm going write patches and reason about it. We were just looking for the best way to safely flush maps in multipath-tools, and I'd considered deferred remove as one option, which it most likely is not. Anyway, I like to understand why thinks are the way they are, so thanks again. Cheers, Martin