diff mbox series

[2/2] libxl: do not automatically force detach of block devices

Message ID 20200903100537.1337-3-paul@xen.org (mailing list archive)
State Superseded
Headers show
Series fix 'xl block-detach' | expand

Commit Message

Paul Durrant Sept. 3, 2020, 10:05 a.m. UTC
From: Paul Durrant <pdurrant@amazon.com>

The manpage for 'xl' documents that guest co-operation is required for a (non-
forced) block-detach operation and that it may consequently fail. Currently,
however, the implementation of generic device removal means that a time-out
of a block-detach is being automatically re-tried with the force flag set
rather than failing. This patch stops such behaviour.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
---
Cc: Ian Jackson <iwj@xenproject.org>
Cc: Wei Liu <wl@xen.org>
Cc: Anthony PERARD <anthony.perard@citrix.com>
---
 tools/libxl/libxl_device.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

Comments

Roger Pau Monné Sept. 4, 2020, 8:52 a.m. UTC | #1
On Thu, Sep 03, 2020 at 11:05:37AM +0100, Paul Durrant wrote:
> From: Paul Durrant <pdurrant@amazon.com>
> 
> The manpage for 'xl' documents that guest co-operation is required for a (non-
> forced) block-detach operation and that it may consequently fail. Currently,
> however, the implementation of generic device removal means that a time-out
> of a block-detach is being automatically re-tried with the force flag set
> rather than failing. This patch stops such behaviour.

Won't this break cleanup on domain shutdown if the guest doesn't close
the devices itself?

I think we need some special-casing on shutdown that keeps the current
behavior on that case.

> 
> Signed-off-by: Paul Durrant <pdurrant@amazon.com>
> ---
> Cc: Ian Jackson <iwj@xenproject.org>
> Cc: Wei Liu <wl@xen.org>
> Cc: Anthony PERARD <anthony.perard@citrix.com>
> ---
>  tools/libxl/libxl_device.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/tools/libxl/libxl_device.c b/tools/libxl/libxl_device.c
> index 0381c5d509..d17ca78848 100644
> --- a/tools/libxl/libxl_device.c
> +++ b/tools/libxl/libxl_device.c
> @@ -1092,7 +1092,8 @@ static void device_backend_callback(libxl__egc *egc, libxl__ev_devstate *ds,
>  
>      if (rc == ERROR_TIMEDOUT &&
>          aodev->action == LIBXL__DEVICE_ACTION_REMOVE &&
> -        !aodev->force) {
> +        !aodev->force &&
> +        aodev->dev->kind != LIBXL__DEVICE_KIND_VBD) {

Doing this differentiation for block only seems weird, we should treat
all devices equally.

Thanks, Roger.
Paul Durrant Sept. 4, 2020, 9:47 a.m. UTC | #2
> -----Original Message-----
> From: Roger Pau Monné <roger.pau@citrix.com>
> Sent: 04 September 2020 09:53
> To: Paul Durrant <paul@xen.org>
> Cc: xen-devel@lists.xenproject.org; Paul Durrant <pdurrant@amazon.com>; Ian Jackson
> <iwj@xenproject.org>; Wei Liu <wl@xen.org>; Anthony PERARD <anthony.perard@citrix.com>
> Subject: Re: [PATCH 2/2] libxl: do not automatically force detach of block devices
> 
> On Thu, Sep 03, 2020 at 11:05:37AM +0100, Paul Durrant wrote:
> > From: Paul Durrant <pdurrant@amazon.com>
> >
> > The manpage for 'xl' documents that guest co-operation is required for a (non-
> > forced) block-detach operation and that it may consequently fail. Currently,
> > however, the implementation of generic device removal means that a time-out
> > of a block-detach is being automatically re-tried with the force flag set
> > rather than failing. This patch stops such behaviour.
> 
> Won't this break cleanup on domain shutdown if the guest doesn't close
> the devices itself?
> 

I don't think so... AFAIK that's a destroy i.e. it's always forced (since there's no way the guest can co-operate at the point).

> I think we need some special-casing on shutdown that keeps the current
> behavior on that case.
> 
> >
> > Signed-off-by: Paul Durrant <pdurrant@amazon.com>
> > ---
> > Cc: Ian Jackson <iwj@xenproject.org>
> > Cc: Wei Liu <wl@xen.org>
> > Cc: Anthony PERARD <anthony.perard@citrix.com>
> > ---
> >  tools/libxl/libxl_device.c | 9 ++++++---
> >  1 file changed, 6 insertions(+), 3 deletions(-)
> >
> > diff --git a/tools/libxl/libxl_device.c b/tools/libxl/libxl_device.c
> > index 0381c5d509..d17ca78848 100644
> > --- a/tools/libxl/libxl_device.c
> > +++ b/tools/libxl/libxl_device.c
> > @@ -1092,7 +1092,8 @@ static void device_backend_callback(libxl__egc *egc, libxl__ev_devstate *ds,
> >
> >      if (rc == ERROR_TIMEDOUT &&
> >          aodev->action == LIBXL__DEVICE_ACTION_REMOVE &&
> > -        !aodev->force) {
> > +        !aodev->force &&
> > +        aodev->dev->kind != LIBXL__DEVICE_KIND_VBD) {
> 
> Doing this differentiation for block only seems weird, we should treat
> all devices equally.
> 

That is not how things are documented in the xl manpage though; block-detach is the only one to have a force option. I assume that was deliberate.

  Paul

> Thanks, Roger.
Roger Pau Monné Sept. 4, 2020, 1:50 p.m. UTC | #3
On Fri, Sep 04, 2020 at 10:47:37AM +0100, Paul Durrant wrote:
> > -----Original Message-----
> > From: Roger Pau Monné <roger.pau@citrix.com>
> > Sent: 04 September 2020 09:53
> > To: Paul Durrant <paul@xen.org>
> > Cc: xen-devel@lists.xenproject.org; Paul Durrant <pdurrant@amazon.com>; Ian Jackson
> > <iwj@xenproject.org>; Wei Liu <wl@xen.org>; Anthony PERARD <anthony.perard@citrix.com>
> > Subject: Re: [PATCH 2/2] libxl: do not automatically force detach of block devices
> > 
> > On Thu, Sep 03, 2020 at 11:05:37AM +0100, Paul Durrant wrote:
> > > From: Paul Durrant <pdurrant@amazon.com>
> > >
> > > The manpage for 'xl' documents that guest co-operation is required for a (non-
> > > forced) block-detach operation and that it may consequently fail. Currently,
> > > however, the implementation of generic device removal means that a time-out
> > > of a block-detach is being automatically re-tried with the force flag set
> > > rather than failing. This patch stops such behaviour.
> > 
> > Won't this break cleanup on domain shutdown if the guest doesn't close
> > the devices itself?
> > 
> 
> I don't think so... AFAIK that's a destroy i.e. it's always forced (since there's no way the guest can co-operate at the point).

Urgh, yes, sorry.

> > I think we need some special-casing on shutdown that keeps the current
> > behavior on that case.
> > 
> > >
> > > Signed-off-by: Paul Durrant <pdurrant@amazon.com>
> > > ---
> > > Cc: Ian Jackson <iwj@xenproject.org>
> > > Cc: Wei Liu <wl@xen.org>
> > > Cc: Anthony PERARD <anthony.perard@citrix.com>
> > > ---
> > >  tools/libxl/libxl_device.c | 9 ++++++---
> > >  1 file changed, 6 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/tools/libxl/libxl_device.c b/tools/libxl/libxl_device.c
> > > index 0381c5d509..d17ca78848 100644
> > > --- a/tools/libxl/libxl_device.c
> > > +++ b/tools/libxl/libxl_device.c
> > > @@ -1092,7 +1092,8 @@ static void device_backend_callback(libxl__egc *egc, libxl__ev_devstate *ds,
> > >
> > >      if (rc == ERROR_TIMEDOUT &&
> > >          aodev->action == LIBXL__DEVICE_ACTION_REMOVE &&
> > > -        !aodev->force) {
> > > +        !aodev->force &&
> > > +        aodev->dev->kind != LIBXL__DEVICE_KIND_VBD) {
> > 
> > Doing this differentiation for block only seems weird, we should treat
> > all devices equally.
> > 
> 
> That is not how things are documented in the xl manpage though; block-detach is the only one to have a force option. I assume that was deliberate.

Oh right, that's weird. I guess removing a block device without guest
cooperation will likely result in a guest crash, while the same it's
not true for other devices.

Thanks, Roger.
Wei Liu Sept. 8, 2020, 2:19 p.m. UTC | #4
On Thu, Sep 03, 2020 at 11:05:37AM +0100, Paul Durrant wrote:
> From: Paul Durrant <pdurrant@amazon.com>
> 
> The manpage for 'xl' documents that guest co-operation is required for a (non-
> forced) block-detach operation and that it may consequently fail. Currently,
> however, the implementation of generic device removal means that a time-out
> of a block-detach is being automatically re-tried with the force flag set
> rather than failing. This patch stops such behaviour.
> 
> Signed-off-by: Paul Durrant <pdurrant@amazon.com>

I'm two-minded here.

On the one hand, special-casing VBD in libxl to conform to xl's
behaviour seems wrong to me.

On the other hand, if we want to treat all devices the same in libxl,
libxl should drop its force-removal behaviour all together, and the
retry behaviour would need to be implemented in xl for all devices
excepts for VBD. This requires a bit of code churn and, more
importantly, changes how those device removal APIs behave. In the end
this effort may not worth it.

If we go with the patch here, we should document this special case on
libxl level somehow.

Anthony and Ian, do you have any opinion on this?

Wei.

> ---
> Cc: Ian Jackson <iwj@xenproject.org>
> Cc: Wei Liu <wl@xen.org>
> Cc: Anthony PERARD <anthony.perard@citrix.com>
> ---
>  tools/libxl/libxl_device.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/tools/libxl/libxl_device.c b/tools/libxl/libxl_device.c
> index 0381c5d509..d17ca78848 100644
> --- a/tools/libxl/libxl_device.c
> +++ b/tools/libxl/libxl_device.c
> @@ -1092,7 +1092,8 @@ static void device_backend_callback(libxl__egc *egc, libxl__ev_devstate *ds,
>  
>      if (rc == ERROR_TIMEDOUT &&
>          aodev->action == LIBXL__DEVICE_ACTION_REMOVE &&
> -        !aodev->force) {
> +        !aodev->force &&
> +        aodev->dev->kind != LIBXL__DEVICE_KIND_VBD) {
>          LOGD(DEBUG, aodev->dev->domid, "Timeout reached, initiating forced remove");
>          aodev->force = 1;
>          libxl__initiate_device_generic_remove(egc, aodev);
> @@ -1103,7 +1104,8 @@ static void device_backend_callback(libxl__egc *egc, libxl__ev_devstate *ds,
>          LOGD(ERROR, aodev->dev->domid, "unable to %s device with path %s",
>                      libxl__device_action_to_string(aodev->action),
>                      libxl__device_backend_path(gc, aodev->dev));
> -        goto out;
> +        if (!aodev->force)
> +            goto out;
>      }
>  
>      device_hotplug(egc, aodev);
> @@ -1319,7 +1321,8 @@ static void device_hotplug_done(libxl__egc *egc, libxl__ao_device *aodev)
>      device_hotplug_clean(gc, aodev);
>  
>      /* Clean xenstore if it's a disconnection */
> -    if (aodev->action == LIBXL__DEVICE_ACTION_REMOVE) {
> +    if (aodev->action == LIBXL__DEVICE_ACTION_REMOVE &&
> +        (aodev->force || !aodev->rc)) {
>          rc = libxl__device_destroy(gc, aodev->dev);
>          if (!aodev->rc)
>              aodev->rc = rc;
> -- 
> 2.20.1
>
Roger Pau Monné Sept. 14, 2020, 10:41 a.m. UTC | #5
On Tue, Sep 08, 2020 at 02:19:03PM +0000, Wei Liu wrote:
> On Thu, Sep 03, 2020 at 11:05:37AM +0100, Paul Durrant wrote:
> > From: Paul Durrant <pdurrant@amazon.com>
> > 
> > The manpage for 'xl' documents that guest co-operation is required for a (non-
> > forced) block-detach operation and that it may consequently fail. Currently,
> > however, the implementation of generic device removal means that a time-out
> > of a block-detach is being automatically re-tried with the force flag set
> > rather than failing. This patch stops such behaviour.
> > 
> > Signed-off-by: Paul Durrant <pdurrant@amazon.com>
> 
> I'm two-minded here.
> 
> On the one hand, special-casing VBD in libxl to conform to xl's
> behaviour seems wrong to me.
> 
> On the other hand, if we want to treat all devices the same in libxl,
> libxl should drop its force-removal behaviour all together, and the
> retry behaviour would need to be implemented in xl for all devices
> excepts for VBD. This requires a bit of code churn and, more
> importantly, changes how those device removal APIs behave. In the end
> this effort may not worth it.

I would be worried about those changes, as we would likely have to
also change libvirt or any other downstreams?

> If we go with the patch here, we should document this special case on
> libxl level somehow.
> 
> Anthony and Ian, do you have any opinion on this?

Maybe a new function should be introduced instead, that attempts to
remove a device gracefully and fail otherwise?

Then none of the current APIs would change, and xl could use this new
function to handle VBD removal?

Roger.
Wei Liu Sept. 14, 2020, 12:26 p.m. UTC | #6
On Mon, Sep 14, 2020 at 12:41:09PM +0200, Roger Pau Monné wrote:
> On Tue, Sep 08, 2020 at 02:19:03PM +0000, Wei Liu wrote:
> > On Thu, Sep 03, 2020 at 11:05:37AM +0100, Paul Durrant wrote:
> > > From: Paul Durrant <pdurrant@amazon.com>
> > > 
> > > The manpage for 'xl' documents that guest co-operation is required for a (non-
> > > forced) block-detach operation and that it may consequently fail. Currently,
> > > however, the implementation of generic device removal means that a time-out
> > > of a block-detach is being automatically re-tried with the force flag set
> > > rather than failing. This patch stops such behaviour.
> > > 
> > > Signed-off-by: Paul Durrant <pdurrant@amazon.com>
> > 
> > I'm two-minded here.
> > 
> > On the one hand, special-casing VBD in libxl to conform to xl's
> > behaviour seems wrong to me.
> > 
> > On the other hand, if we want to treat all devices the same in libxl,
> > libxl should drop its force-removal behaviour all together, and the
> > retry behaviour would need to be implemented in xl for all devices
> > excepts for VBD. This requires a bit of code churn and, more
> > importantly, changes how those device removal APIs behave. In the end
> > this effort may not worth it.
> 
> I would be worried about those changes, as we would likely have to
> also change libvirt or any other downstreams?
> 
> > If we go with the patch here, we should document this special case on
> > libxl level somehow.
> > 
> > Anthony and Ian, do you have any opinion on this?
> 
> Maybe a new function should be introduced instead, that attempts to
> remove a device gracefully and fail otherwise?
> 
> Then none of the current APIs would change, and xl could use this new
> function to handle VBD removal?

This sounds fine to me.

Wei.

> 
> Roger.
Ian Jackson Sept. 28, 2020, 1:43 p.m. UTC | #7
Wei Liu writes ("Re: [PATCH 2/2] libxl: do not automatically force detach of block devices"):
> On Mon, Sep 14, 2020 at 12:41:09PM +0200, Roger Pau Monné wrote:
> > Maybe a new function should be introduced instead, that attempts to
> > remove a device gracefully and fail otherwise?
> > 
> > Then none of the current APIs would change, and xl could use this new
> > function to handle VBD removal?
> 
> This sounds fine to me.

I agree.

If there is going to be different default policy for different devices
it ought to be in xl, not libxl, but frankly I think this is an
anomaly.

I suggest we have a new entrypoint that specifies the fallback
behaviour explicitly.  ISTM that there are three possible behaviours:
 - fail if the guest does not cooperate
 - fall back to force remove
 - rip the device out immediately

The last of these would be useful only in rare situations.  IDK if the
length of the timeout (for the first two cases) ought to be a
parameter too.

Ian.
diff mbox series

Patch

diff --git a/tools/libxl/libxl_device.c b/tools/libxl/libxl_device.c
index 0381c5d509..d17ca78848 100644
--- a/tools/libxl/libxl_device.c
+++ b/tools/libxl/libxl_device.c
@@ -1092,7 +1092,8 @@  static void device_backend_callback(libxl__egc *egc, libxl__ev_devstate *ds,
 
     if (rc == ERROR_TIMEDOUT &&
         aodev->action == LIBXL__DEVICE_ACTION_REMOVE &&
-        !aodev->force) {
+        !aodev->force &&
+        aodev->dev->kind != LIBXL__DEVICE_KIND_VBD) {
         LOGD(DEBUG, aodev->dev->domid, "Timeout reached, initiating forced remove");
         aodev->force = 1;
         libxl__initiate_device_generic_remove(egc, aodev);
@@ -1103,7 +1104,8 @@  static void device_backend_callback(libxl__egc *egc, libxl__ev_devstate *ds,
         LOGD(ERROR, aodev->dev->domid, "unable to %s device with path %s",
                     libxl__device_action_to_string(aodev->action),
                     libxl__device_backend_path(gc, aodev->dev));
-        goto out;
+        if (!aodev->force)
+            goto out;
     }
 
     device_hotplug(egc, aodev);
@@ -1319,7 +1321,8 @@  static void device_hotplug_done(libxl__egc *egc, libxl__ao_device *aodev)
     device_hotplug_clean(gc, aodev);
 
     /* Clean xenstore if it's a disconnection */
-    if (aodev->action == LIBXL__DEVICE_ACTION_REMOVE) {
+    if (aodev->action == LIBXL__DEVICE_ACTION_REMOVE &&
+        (aodev->force || !aodev->rc)) {
         rc = libxl__device_destroy(gc, aodev->dev);
         if (!aodev->rc)
             aodev->rc = rc;