diff mbox

[4/6] libmultipath: fix suspended devs from failed reloads

Message ID 1494349025-6561-5-git-send-email-bmarzins@redhat.com (mailing list archive)
State Not Applicable, archived
Delegated to: christophe varoqui
Headers show

Commit Message

Benjamin Marzinski May 9, 2017, 4:57 p.m. UTC
When multipath reloads a device, it can either fail while loading the
new table or while resuming the device. If it fails while resuming the
device, the device can get stuck in the suspended state.  To fix this,
multipath needs to resume the device again so that it can continue using
the old table.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
 libmultipath/devmapper.c | 19 ++++++++++++++++++-
 libmultipath/devmapper.h |  1 +
 2 files changed, 19 insertions(+), 1 deletion(-)

Comments

Martin Wilck May 11, 2017, 8:26 p.m. UTC | #1
On Tue, 2017-05-09 at 11:57 -0500, Benjamin Marzinski wrote:
> When multipath reloads a device, it can either fail while loading the
> new table or while resuming the device. If it fails while resuming
> the
> device, the device can get stuck in the suspended state.  To fix
> this,
> multipath needs to resume the device again so that it can continue
> using
> the old table.
> 
> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
> ---
>  libmultipath/devmapper.c | 19 ++++++++++++++++++-
>  libmultipath/devmapper.h |  1 +
>  2 files changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/libmultipath/devmapper.c b/libmultipath/devmapper.c
> index 2c4a13a..69b634b 100644
> --- a/libmultipath/devmapper.c
> +++ b/libmultipath/devmapper.c
> @@ -396,7 +396,13 @@ int dm_addmap_reload(struct multipath *mpp, char
> *params, int flush)
>  	if (r)
>  		r = dm_simplecmd(DM_DEVICE_RESUME, mpp->alias,
> !flush,
>  				 1, udev_flags, 0);
> -	return r;
> +	if (r)
> +		return r;
> +
> +	if (dm_is_suspended(mpp->alias))
> +		dm_simplecmd(DM_DEVICE_RESUME, mpp->alias, !flush,
> 1,
> +			     udev_flags, 0);
> +	return 0;
>  }

Why would the second DM_DEVICE_RESUME call succeed if the first one
failed?

Martin
Benjamin Marzinski May 16, 2017, 5:33 p.m. UTC | #2
On Thu, May 11, 2017 at 10:26:52PM +0200, Martin Wilck wrote:
> On Tue, 2017-05-09 at 11:57 -0500, Benjamin Marzinski wrote:
> > When multipath reloads a device, it can either fail while loading the
> > new table or while resuming the device. If it fails while resuming
> > the
> > device, the device can get stuck in the suspended state.  To fix
> > this,
> > multipath needs to resume the device again so that it can continue
> > using
> > the old table.
> > 
> > Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
> > ---
> >  libmultipath/devmapper.c | 19 ++++++++++++++++++-
> >  libmultipath/devmapper.h |  1 +
> >  2 files changed, 19 insertions(+), 1 deletion(-)
> > 
> > diff --git a/libmultipath/devmapper.c b/libmultipath/devmapper.c
> > index 2c4a13a..69b634b 100644
> > --- a/libmultipath/devmapper.c
> > +++ b/libmultipath/devmapper.c
> > @@ -396,7 +396,13 @@ int dm_addmap_reload(struct multipath *mpp, char
> > *params, int flush)
> >  	if (r)
> >  		r = dm_simplecmd(DM_DEVICE_RESUME, mpp->alias,
> > !flush,
> >  				 1, udev_flags, 0);
> > -	return r;
> > +	if (r)
> > +		return r;
> > +
> > +	if (dm_is_suspended(mpp->alias))
> > +		dm_simplecmd(DM_DEVICE_RESUME, mpp->alias, !flush,
> > 1,
> > +			     udev_flags, 0);
> > +	return 0;
> >  }
> 
> Why would the second DM_DEVICE_RESUME call succeed if the first one
> failed?

Because if the first resume fails, device-mapper rolls back to the
original table.

The specific way that someone found this was by running a

multipathd resize

when only some of the paths had been rescaned and noticed the new,
larger, size. Since the first path had changed size (that's all the
multipath looks at to find the new size) the multipath device tried to
reload larger. However, when it tried to resume with the larger size, it
failed since some of the devices were too small.  The second resume lets
it try again with the original table.

Of course, that isn't the only way that this could fail, and it's nice
to be able to gracefully continue using the old table instead of just
being suspended.

-Ben

> 
> Martin
> 
> -- 
> Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
> HRB 21284 (AG Nürnberg)
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Martin Wilck May 16, 2017, 6:56 p.m. UTC | #3
On Tue, 2017-05-16 at 12:33 -0500, Benjamin Marzinski wrote:
> On Thu, May 11, 2017 at 10:26:52PM +0200, Martin Wilck wrote:
> > On Tue, 2017-05-09 at 11:57 -0500, Benjamin Marzinski wrote:
> > > When multipath reloads a device, it can either fail while loading
> > > the
> > > new table or while resuming the device. If it fails while
> > > resuming
> > > the
> > > device, the device can get stuck in the suspended state.  To fix
> > > this,
> > > multipath needs to resume the device again so that it can
> > > continue
> > > using
> > > the old table.
> > > 
> > > Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
> > > ---
> > >  libmultipath/devmapper.c | 19 ++++++++++++++++++-
> > >  libmultipath/devmapper.h |  1 +
> > >  2 files changed, 19 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/libmultipath/devmapper.c b/libmultipath/devmapper.c
> > > index 2c4a13a..69b634b 100644
> > > --- a/libmultipath/devmapper.c
> > > +++ b/libmultipath/devmapper.c
> > > @@ -396,7 +396,13 @@ int dm_addmap_reload(struct multipath *mpp,
> > > char
> > > *params, int flush)
> > >  	if (r)
> > >  		r = dm_simplecmd(DM_DEVICE_RESUME, mpp->alias,
> > > !flush,
> > >  				 1, udev_flags, 0);
> > > -	return r;
> > > +	if (r)
> > > +		return r;
> > > +
> > > +	if (dm_is_suspended(mpp->alias))
> > > +		dm_simplecmd(DM_DEVICE_RESUME, mpp->alias,
> > > !flush,
> > > 1,
> > > +			     udev_flags, 0);
> > > +	return 0;
> > >  }
> > 
> > Why would the second DM_DEVICE_RESUME call succeed if the first one
> > failed?
> 
> Because if the first resume fails, device-mapper rolls back to the
> original table.

Ah. I didn't know that, and I'm likely to forget it again. Would you
mind adding a short comment at that point in the code? 

Otherwise, ACK.

Martin
Benjamin Marzinski May 17, 2017, 4:25 p.m. UTC | #4
On Tue, May 16, 2017 at 08:56:40PM +0200, Martin Wilck wrote:
> On Tue, 2017-05-16 at 12:33 -0500, Benjamin Marzinski wrote:
> > On Thu, May 11, 2017 at 10:26:52PM +0200, Martin Wilck wrote:
> > > On Tue, 2017-05-09 at 11:57 -0500, Benjamin Marzinski wrote:
> > > > When multipath reloads a device, it can either fail while loading
> > > > the
> > > > new table or while resuming the device. If it fails while
> > > > resuming
> > > > the
> > > > device, the device can get stuck in the suspended state.  To fix
> > > > this,
> > > > multipath needs to resume the device again so that it can
> > > > continue
> > > > using
> > > > the old table.
> > > > 
> > > > Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
> > > > ---
> > > >  libmultipath/devmapper.c | 19 ++++++++++++++++++-
> > > >  libmultipath/devmapper.h |  1 +
> > > >  2 files changed, 19 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/libmultipath/devmapper.c b/libmultipath/devmapper.c
> > > > index 2c4a13a..69b634b 100644
> > > > --- a/libmultipath/devmapper.c
> > > > +++ b/libmultipath/devmapper.c
> > > > @@ -396,7 +396,13 @@ int dm_addmap_reload(struct multipath *mpp,
> > > > char
> > > > *params, int flush)
> > > >  	if (r)
> > > >  		r = dm_simplecmd(DM_DEVICE_RESUME, mpp->alias,
> > > > !flush,
> > > >  				 1, udev_flags, 0);
> > > > -	return r;
> > > > +	if (r)
> > > > +		return r;
> > > > +
> > > > +	if (dm_is_suspended(mpp->alias))
> > > > +		dm_simplecmd(DM_DEVICE_RESUME, mpp->alias,
> > > > !flush,
> > > > 1,
> > > > +			     udev_flags, 0);
> > > > +	return 0;
> > > >  }
> > > 
> > > Why would the second DM_DEVICE_RESUME call succeed if the first one
> > > failed?
> > 
> > Because if the first resume fails, device-mapper rolls back to the
> > original table.
> 
> Ah. I didn't know that, and I'm likely to forget it again. Would you
> mind adding a short comment at that point in the code? 

Sure.

-Ben

> 
> Otherwise, ACK.
> 
> Martin
> 
> -- 
> Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
> HRB 21284 (AG Nürnberg)

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
diff mbox

Patch

diff --git a/libmultipath/devmapper.c b/libmultipath/devmapper.c
index 2c4a13a..69b634b 100644
--- a/libmultipath/devmapper.c
+++ b/libmultipath/devmapper.c
@@ -396,7 +396,13 @@  int dm_addmap_reload(struct multipath *mpp, char *params, int flush)
 	if (r)
 		r = dm_simplecmd(DM_DEVICE_RESUME, mpp->alias, !flush,
 				 1, udev_flags, 0);
-	return r;
+	if (r)
+		return r;
+
+	if (dm_is_suspended(mpp->alias))
+		dm_simplecmd(DM_DEVICE_RESUME, mpp->alias, !flush, 1,
+			     udev_flags, 0);
+	return 0;
 }
 
 static int
@@ -1030,6 +1036,17 @@  dm_geteventnr (char *name)
 	return info.event_nr;
 }
 
+int
+dm_is_suspended(const char *name)
+{
+	struct dm_info info;
+
+	if (do_get_info(name, &info) != 0)
+		return -1;
+
+	return info.suspended;
+}
+
 char *
 dm_mapname(int major, int minor)
 {
diff --git a/libmultipath/devmapper.h b/libmultipath/devmapper.h
index 5b66865..fa739bc 100644
--- a/libmultipath/devmapper.h
+++ b/libmultipath/devmapper.h
@@ -52,6 +52,7 @@  int dm_enablegroup(char * mapname, int index);
 int dm_disablegroup(char * mapname, int index);
 int dm_get_maps (vector mp);
 int dm_geteventnr (char *name);
+int dm_is_suspended(const char *name);
 int dm_get_major_minor (const char *name, int *major, int *minor);
 char * dm_mapname(int major, int minor);
 int dm_remove_partmaps (const char * mapname, int need_sync,