diff mbox

[v2,1/2] libxl/devd: fix a race with concurrent device addition/removal

Message ID 20170511105544.35197-2-roger.pau@citrix.com (mailing list archive)
State New, archived
Headers show

Commit Message

Roger Pau Monné May 11, 2017, 10:55 a.m. UTC
Current code can free the libxl__device inside of the libxl__ddomain_device
before the addition has finished if a removal happens while an addition is
still in process:

  backend_watch_callback
            |
            v
       add_device
            |                 backend_watch_callback
    (async operation)                   |
            |                           v
            |                     remove_device
            |                           |
            |                           V
            |                    device_complete
            |                 (free libxl__device)
            v
     device_complete
  (deref libxl__device)

Fix this by creating a temporary copy of the libxl__device, that's tracked by
the GC of the nested async operation. This ensures that the libxl__device used
by the async operations cannot be freed while being used.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reported-by: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
---
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Julien Grall <julien.grall@arm.com>
---
 tools/libxl/libxl_device.c | 32 +++++++++++++++++++-------------
 1 file changed, 19 insertions(+), 13 deletions(-)

Comments

Ian Jackson May 11, 2017, 11:06 a.m. UTC | #1
Roger Pau Monne writes ("[PATCH v2 1/2] libxl/devd: fix a race with concurrent device addition/removal"):
> Current code can free the libxl__device inside of the libxl__ddomain_device
> before the addition has finished if a removal happens while an addition is
> still in process:
...
> Fix this by creating a temporary copy of the libxl__device, that's
> tracked by the GC of the nested async operation. This ensures that
> the libxl__device used by the async operations cannot be freed while
> being used.
...
>          GCNEW(aodev);
>          libxl__prepare_ao_device(ao, aodev);
> -        aodev->dev = dev;
> +        /*
> +         * Clone the libxl__device to avoid races if remove_device is called
> +         * before the device addition has finished.
> +         */
> +        GCNEW(aodev->dev);
> +        *aodev->dev = *dev;

This does conveniently disentangle the memory management, so I think
it's a good approach.

But it reads kind of oddly to me.  I think it is not buggy, but can
you add a comment to the definition of libxl__device, saying that it
is a transparent structure containing no external memory references ?

Otherwise this copy is not really justifiable, because in C, in
general, structs might contain private fields, or memory references or
linked list entries or something.

Thanks,
Ian.
Roger Pau Monné May 11, 2017, 11:43 a.m. UTC | #2
On Thu, May 11, 2017 at 12:06:08PM +0100, Ian Jackson wrote:
> Roger Pau Monne writes ("[PATCH v2 1/2] libxl/devd: fix a race with concurrent device addition/removal"):
> > Current code can free the libxl__device inside of the libxl__ddomain_device
> > before the addition has finished if a removal happens while an addition is
> > still in process:
> ...
> > Fix this by creating a temporary copy of the libxl__device, that's
> > tracked by the GC of the nested async operation. This ensures that
> > the libxl__device used by the async operations cannot be freed while
> > being used.
> ...
> >          GCNEW(aodev);
> >          libxl__prepare_ao_device(ao, aodev);
> > -        aodev->dev = dev;
> > +        /*
> > +         * Clone the libxl__device to avoid races if remove_device is called
> > +         * before the device addition has finished.
> > +         */
> > +        GCNEW(aodev->dev);
> > +        *aodev->dev = *dev;
> 
> This does conveniently disentangle the memory management, so I think
> it's a good approach.
> 
> But it reads kind of oddly to me.  I think it is not buggy, but can
> you add a comment to the definition of libxl__device, saying that it
> is a transparent structure containing no external memory references ?

Sure, before implementing this I already took a look at the contents of the
libxl__device struct, but I agree that a comment is in place in case someone
expands the fields of the struct later on.

> Otherwise this copy is not really justifiable, because in C, in
> general, structs might contain private fields, or memory references or
> linked list entries or something.

Thanks, Roger.

NB: FWIW, I'm planning to keep Wei's RB since this is a cosmetic change.
Julien Grall May 15, 2017, 1:37 p.m. UTC | #3
Hi Roger,

On 11/05/17 12:43, Roger Pau Monne wrote:
> On Thu, May 11, 2017 at 12:06:08PM +0100, Ian Jackson wrote:
>> Roger Pau Monne writes ("[PATCH v2 1/2] libxl/devd: fix a race with concurrent device addition/removal"):
>>> Current code can free the libxl__device inside of the libxl__ddomain_device
>>> before the addition has finished if a removal happens while an addition is
>>> still in process:
>> ...
>>> Fix this by creating a temporary copy of the libxl__device, that's
>>> tracked by the GC of the nested async operation. This ensures that
>>> the libxl__device used by the async operations cannot be freed while
>>> being used.
>> ...
>>>          GCNEW(aodev);
>>>          libxl__prepare_ao_device(ao, aodev);
>>> -        aodev->dev = dev;
>>> +        /*
>>> +         * Clone the libxl__device to avoid races if remove_device is called
>>> +         * before the device addition has finished.
>>> +         */
>>> +        GCNEW(aodev->dev);
>>> +        *aodev->dev = *dev;
>>
>> This does conveniently disentangle the memory management, so I think
>> it's a good approach.
>>
>> But it reads kind of oddly to me.  I think it is not buggy, but can
>> you add a comment to the definition of libxl__device, saying that it
>> is a transparent structure containing no external memory references ?
>
> Sure, before implementing this I already took a look at the contents of the
> libxl__device struct, but I agree that a comment is in place in case someone
> expands the fields of the struct later on.
>
>> Otherwise this copy is not really justifiable, because in C, in
>> general, structs might contain private fields, or memory references or
>> linked list entries or something.
>
> Thanks, Roger.
>
> NB: FWIW, I'm planning to keep Wei's RB since this is a cosmetic change.

Is this patch series targeting Xen 4.9?

Cheers,
Roger Pau Monné May 15, 2017, 5:42 p.m. UTC | #4
On Mon, May 15, 2017 at 02:37:33PM +0100, Julien Grall wrote:
> Hi Roger,
> 
> On 11/05/17 12:43, Roger Pau Monne wrote:
> > On Thu, May 11, 2017 at 12:06:08PM +0100, Ian Jackson wrote:
> > > Roger Pau Monne writes ("[PATCH v2 1/2] libxl/devd: fix a race with concurrent device addition/removal"):
> > > > Current code can free the libxl__device inside of the libxl__ddomain_device
> > > > before the addition has finished if a removal happens while an addition is
> > > > still in process:
> > > ...
> > > > Fix this by creating a temporary copy of the libxl__device, that's
> > > > tracked by the GC of the nested async operation. This ensures that
> > > > the libxl__device used by the async operations cannot be freed while
> > > > being used.
> > > ...
> > > >          GCNEW(aodev);
> > > >          libxl__prepare_ao_device(ao, aodev);
> > > > -        aodev->dev = dev;
> > > > +        /*
> > > > +         * Clone the libxl__device to avoid races if remove_device is called
> > > > +         * before the device addition has finished.
> > > > +         */
> > > > +        GCNEW(aodev->dev);
> > > > +        *aodev->dev = *dev;
> > > 
> > > This does conveniently disentangle the memory management, so I think
> > > it's a good approach.
> > > 
> > > But it reads kind of oddly to me.  I think it is not buggy, but can
> > > you add a comment to the definition of libxl__device, saying that it
> > > is a transparent structure containing no external memory references ?
> > 
> > Sure, before implementing this I already took a look at the contents of the
> > libxl__device struct, but I agree that a comment is in place in case someone
> > expands the fields of the struct later on.
> > 
> > > Otherwise this copy is not really justifiable, because in C, in
> > > general, structs might contain private fields, or memory references or
> > > linked list entries or something.
> > 
> > Thanks, Roger.
> > 
> > NB: FWIW, I'm planning to keep Wei's RB since this is a cosmetic change.
> 
> Is this patch series targeting Xen 4.9?

Yes, I think so. I will post a new version soon and Cc you.

Roger.
Ian Jackson May 16, 2017, 10:34 a.m. UTC | #5
Julien Grall writes ("Re: [PATCH v2 1/2] libxl/devd: fix a race with concurrent device addition/removal"):
> Is this patch series targeting Xen 4.9?

IMO it should be.  The changes are confined to the xl devd code, which
is currently broken.  So the reward/risk ratio is good.

Ian.
diff mbox

Patch

diff --git a/tools/libxl/libxl_device.c b/tools/libxl/libxl_device.c
index 5e966762c6..cd4ad05a6f 100644
--- a/tools/libxl/libxl_device.c
+++ b/tools/libxl/libxl_device.c
@@ -1415,9 +1415,6 @@  static void device_complete(libxl__egc *egc, libxl__ao_device *aodev)
                libxl__device_action_to_string(aodev->action),
                aodev->rc ? "failed" : "succeed");
 
-    if (aodev->action == LIBXL__DEVICE_ACTION_REMOVE)
-        free(aodev->dev);
-
     libxl__nested_ao_free(aodev->ao);
 }
 
@@ -1521,7 +1518,12 @@  static int add_device(libxl__egc *egc, libxl__ao *ao,
 
         GCNEW(aodev);
         libxl__prepare_ao_device(ao, aodev);
-        aodev->dev = dev;
+        /*
+         * Clone the libxl__device to avoid races if remove_device is called
+         * before the device addition has finished.
+         */
+        GCNEW(aodev->dev);
+        *aodev->dev = *dev;
         aodev->action = LIBXL__DEVICE_ACTION_ADD;
         aodev->callback = device_complete;
         libxl__wait_device_connection(egc, aodev);
@@ -1564,7 +1566,12 @@  static int remove_device(libxl__egc *egc, libxl__ao *ao,
 
         GCNEW(aodev);
         libxl__prepare_ao_device(ao, aodev);
-        aodev->dev = dev;
+        /*
+         * Clone the libxl__device to avoid races if there's a add_device
+         * running in parallel.
+         */
+        GCNEW(aodev->dev);
+        *aodev->dev = *dev;
         aodev->action = LIBXL__DEVICE_ACTION_REMOVE;
         aodev->callback = device_complete;
         libxl__initiate_device_generic_remove(egc, aodev);
@@ -1576,7 +1583,6 @@  static int remove_device(libxl__egc *egc, libxl__ao *ao,
                 goto out;
         }
         libxl__device_destroy(gc, dev);
-        free(dev);
         /* Fall through to return > 0, no ao has been dispatched */
     default:
         rc = 1;
@@ -1597,7 +1603,7 @@  static void backend_watch_callback(libxl__egc *egc, libxl__ev_xswatch *watch,
     char *p, *path;
     const char *sstate, *sonline;
     int state, online, rc, num_devs;
-    libxl__device *dev = NULL;
+    libxl__device *dev;
     libxl__ddomain_device *ddev = NULL;
     libxl__ddomain_guest *dguest = NULL;
     bool free_ao = false;
@@ -1625,7 +1631,7 @@  static void backend_watch_callback(libxl__egc *egc, libxl__ev_xswatch *watch,
         goto skip;
     online = atoi(sonline);
 
-    dev = libxl__zalloc(NOGC, sizeof(*dev));
+    GCNEW(dev);
     rc = libxl__parse_backend_path(gc, path, dev);
     if (rc)
         goto skip;
@@ -1659,7 +1665,8 @@  static void backend_watch_callback(libxl__egc *egc, libxl__ev_xswatch *watch,
          * to the list of active devices for a given guest.
          */
         ddev = libxl__zalloc(NOGC, sizeof(*ddev));
-        ddev->dev = dev;
+        ddev->dev = libxl__zalloc(NOGC, sizeof(*ddev->dev));
+        *ddev->dev = *dev;
         LIBXL_SLIST_INSERT_HEAD(&dguest->devices, ddev, next);
         LOGD(DEBUG, dev->domid, "Added device %s to the list of active devices",
              path);
@@ -1670,9 +1677,6 @@  static void backend_watch_callback(libxl__egc *egc, libxl__ev_xswatch *watch,
         /*
          * Removal of an active device, remove it from the list and
          * free it's data structures if they are no longer needed.
-         *
-         * The free of the associated libxl__device is left to the
-         * helper remove_device function.
          */
         LIBXL_SLIST_REMOVE(&dguest->devices, ddev, libxl__ddomain_device,
                            next);
@@ -1682,6 +1686,7 @@  static void backend_watch_callback(libxl__egc *egc, libxl__ev_xswatch *watch,
         if (rc > 0)
             free_ao = true;
 
+        free(ddev->dev);
         free(ddev);
         /* If this was the last device in the domain, remove it from the list */
         num_devs = dguest->num_vifs + dguest->num_vbds + dguest->num_qdisks;
@@ -1703,7 +1708,8 @@  static void backend_watch_callback(libxl__egc *egc, libxl__ev_xswatch *watch,
 
 skip:
     libxl__nested_ao_free(nested_ao);
-    free(dev);
+    if (ddev)
+        free(ddev->dev);
     free(ddev);
     free(dguest);
     return;