diff mbox

[3/3] rbd: close remove vs. notify race leading to use-after-free

Message ID 1377757447-23515-4-git-send-email-josh.durgin@inktank.com (mailing list archive)
State New, archived
Headers show

Commit Message

Josh Durgin Aug. 29, 2013, 6:24 a.m. UTC
Removing a device deallocates the disk, unschedules the watch, and
finally cleans up the rbd_dev structure. rbd_dev_refresh(), called
from the watch callback, updates the disk size and rbd_dev
structure. With no locking between them, rbd_dev_refresh() may use the
device or rbd_dev after they've been freed.

To fix this, use the header_rwsem to protect all the work
rbd_dev_refresh() does, and take it exclusively in rbd_remove() where
the block device is released and the watch is canceled.
rbd_bus_del_dev() ends up releasing the block device, so no requests
to the device remain after this. This makes all the work in
rbd_dev_refresh() unnecessary, as well as race-prone, so skip it if
the watch has been canceled.

Finally, flush the osd client's notify queue before deallocating the
rbd dev, so that any callbacks remaining can read rbd_dev->watch_event
safely. No more notifies can enter the queue at this point since the
watch has been canceled.

Fixes: http://tracker.ceph.com/issues/5636
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
---
 drivers/block/rbd.c |   25 ++++++++++++++++++++++---
 1 files changed, 22 insertions(+), 3 deletions(-)

Comments

Sage Weil Aug. 29, 2013, 2:46 p.m. UTC | #1
On Wed, 28 Aug 2013, Josh Durgin wrote:
> Removing a device deallocates the disk, unschedules the watch, and
> finally cleans up the rbd_dev structure. rbd_dev_refresh(), called
> from the watch callback, updates the disk size and rbd_dev
> structure. With no locking between them, rbd_dev_refresh() may use the
> device or rbd_dev after they've been freed.
> 
> To fix this, use the header_rwsem to protect all the work
> rbd_dev_refresh() does, and take it exclusively in rbd_remove() where
> the block device is released and the watch is canceled.
> rbd_bus_del_dev() ends up releasing the block device, so no requests
> to the device remain after this. This makes all the work in
> rbd_dev_refresh() unnecessary, as well as race-prone, so skip it if
> the watch has been canceled.
> 
> Finally, flush the osd client's notify queue before deallocating the
> rbd dev, so that any callbacks remaining can read rbd_dev->watch_event
> safely. No more notifies can enter the queue at this point since the
> watch has been canceled.
> 
> Fixes: http://tracker.ceph.com/issues/5636
> Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
> ---
>  drivers/block/rbd.c |   25 ++++++++++++++++++++++---
>  1 files changed, 22 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index fef3687..63e1590 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -3327,10 +3327,13 @@ static void rbd_exists_validate(struct rbd_device *rbd_dev)
>  static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>  {
>  	u64 mapping_size;
> -	int ret;
> +	int ret = 0;
>  
>  	rbd_assert(rbd_image_format_valid(rbd_dev->image_format));
>  	down_write(&rbd_dev->header_rwsem);
> +	if (!rbd_dev->watch_event)
> +		goto out;
> +
>  	mapping_size = rbd_dev->mapping.size;
>  	if (rbd_dev->image_format == 1)
>  		ret = rbd_dev_v1_header_info(rbd_dev);
> @@ -3340,7 +3343,6 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>  	/* If it's a mapped snapshot, validate its EXISTS flag */
>  
>  	rbd_exists_validate(rbd_dev);
> -	up_write(&rbd_dev->header_rwsem);
>  
>  	if (mapping_size != rbd_dev->mapping.size) {
>  		sector_t size;
> @@ -3350,7 +3352,8 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>  		set_capacity(rbd_dev->disk, size);
>  		revalidate_disk(rbd_dev->disk);
>  	}
> -
> +out:
> +	up_write(&rbd_dev->header_rwsem);
>  	return ret;
>  }
>  
> @@ -5159,10 +5162,26 @@ static ssize_t rbd_remove(struct bus_type *bus,
>  	if (ret < 0 || already)
>  		return ret;
>  
> +	/*
> +	 * take header semaphore while destroying the device and
> +	 * canceling the watch so that device destruction will
> +	 * not race with device updates from the watch callback
> +	 */
> +	down_write(&rbd_dev->header_rwsem);
>  	rbd_bus_del_dev(rbd_dev);
>  	ret = rbd_dev_header_watch_sync(rbd_dev, false);
> +	up_write(&rbd_dev->header_rwsem);

If I'm reading thsi right, we're holding the rwsem exclusively, and 
blocking the workqueue, for the duration of the osd request to remove the 
watch.  It worries me to block up the workqueue for that long (possibly 
for other images).  I don't think it will deadlock, but if hte cluster is 
unhealthy this one request could take a while.

What we if we just add a separate flag "shutting down" and use that 
instead of the rbd_dev->watch_event?  Then we can up_write prior to the 
watch removal call...

sage


> +
>  	if (ret)
>  		rbd_warn(rbd_dev, "failed to cancel watch event (%d)\n", ret);
> +
> +	/*
> +	 * flush remaing watch callbacks - these don't update anything
> +	 * anymore since rbd_dev->watch_event is NULL, but it avoids
> +	 * the watch callback using a freed rbd_dev
> +	 */
> +	dout("%s: flushing notifies", __func__);
> +	ceph_osdc_flush_notifies(&rbd_dev->rbd_client->client->osdc);
>  	rbd_dev_image_release(rbd_dev);
>  	module_put(THIS_MODULE);
>  
> -- 
> 1.7.2.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Elder Aug. 29, 2013, 3:21 p.m. UTC | #2
On 08/29/2013 01:24 AM, Josh Durgin wrote:
> Removing a device deallocates the disk, unschedules the watch, and
> finally cleans up the rbd_dev structure. rbd_dev_refresh(), called
> from the watch callback, updates the disk size and rbd_dev
> structure. With no locking between them, rbd_dev_refresh() may use the
> device or rbd_dev after they've been freed.
> 
> To fix this, use the header_rwsem to protect all the work
> rbd_dev_refresh() does, and take it exclusively in rbd_remove() where
> the block device is released and the watch is canceled.

It seems to me this was a little bit of a tricky area.  As I recall
there was a problem related to lock inversion, holding the semaphore
while calling revalidate_disk().  (I'm sorry, this may be wrong and
I can't dig much into it further at the moment.)  In any case, make
sure you test with lockdep enabled and look for splats to the console.

Also, there was once a function rbd_update_mapping_size(),
which encapsulated both updating the size field and then
calling set_capacity() immediately thereafter.  When images
were refreshed (in rbd_dev_vX_refresh()), these updates would
be made by calling rbd_update_mapping_size().  That made good
sense, but after this commit:
    00a653e216  rbd: update capacity in rbd_dev_refresh()
the function became pretty trivial and it got removed.

I think due to the lockdep thing, revalidate_disk() was
later called outside the protection of the mutex.  It may
be necessary to at least move those size updates back into
the refresh functions to resolve the lockdep problem.

Anyway, having a function that encapsulates these size changes
like before might be nice, but it's not that important.  If
there really is a problem with calling revalidate_disk() with
the mutex held, then it's not all done in a single unit anyway.

> rbd_bus_del_dev() ends up releasing the block device, so no requests
> to the device remain after this. This makes all the work in
> rbd_dev_refresh() unnecessary, as well as race-prone, so skip it if
> the watch has been canceled.

This makes sense, but a brief comment explaining why you're
skipping it at the point in the code would be helpful.

> Finally, flush the osd client's notify queue before deallocating the
> rbd dev, so that any callbacks remaining can read rbd_dev->watch_event
> safely. No more notifies can enter the queue at this point since the
> watch has been canceled.
> 
> Fixes: http://tracker.ceph.com/issues/5636
> Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

Please verify there's no lockdep reports if you haven't already.
And consider my other comments, but barring lockdep issues:

Reviewed-by: Alex Elder <elder@linaro.org>

> ---
>  drivers/block/rbd.c |   25 ++++++++++++++++++++++---
>  1 files changed, 22 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index fef3687..63e1590 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -3327,10 +3327,13 @@ static void rbd_exists_validate(struct rbd_device *rbd_dev)
>  static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>  {
>  	u64 mapping_size;
> -	int ret;
> +	int ret = 0;
>  
>  	rbd_assert(rbd_image_format_valid(rbd_dev->image_format));
>  	down_write(&rbd_dev->header_rwsem);
> +	if (!rbd_dev->watch_event)
> +		goto out;
> +
>  	mapping_size = rbd_dev->mapping.size;
>  	if (rbd_dev->image_format == 1)
>  		ret = rbd_dev_v1_header_info(rbd_dev);
> @@ -3340,7 +3343,6 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>  	/* If it's a mapped snapshot, validate its EXISTS flag */
>  
>  	rbd_exists_validate(rbd_dev);
> -	up_write(&rbd_dev->header_rwsem);
>  
>  	if (mapping_size != rbd_dev->mapping.size) {
>  		sector_t size;
> @@ -3350,7 +3352,8 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>  		set_capacity(rbd_dev->disk, size);
>  		revalidate_disk(rbd_dev->disk);
>  	}
> -
> +out:
> +	up_write(&rbd_dev->header_rwsem);
>  	return ret;
>  }
>  
> @@ -5159,10 +5162,26 @@ static ssize_t rbd_remove(struct bus_type *bus,
>  	if (ret < 0 || already)
>  		return ret;
>  
> +	/*
> +	 * take header semaphore while destroying the device and
> +	 * canceling the watch so that device destruction will
> +	 * not race with device updates from the watch callback
> +	 */
> +	down_write(&rbd_dev->header_rwsem);
>  	rbd_bus_del_dev(rbd_dev);
>  	ret = rbd_dev_header_watch_sync(rbd_dev, false);
> +	up_write(&rbd_dev->header_rwsem);
> +
>  	if (ret)
>  		rbd_warn(rbd_dev, "failed to cancel watch event (%d)\n", ret);
> +
> +	/*
> +	 * flush remaing watch callbacks - these don't update anything
> +	 * anymore since rbd_dev->watch_event is NULL, but it avoids
> +	 * the watch callback using a freed rbd_dev
> +	 */
> +	dout("%s: flushing notifies", __func__);
> +	ceph_osdc_flush_notifies(&rbd_dev->rbd_client->client->osdc);
>  	rbd_dev_image_release(rbd_dev);
>  	module_put(THIS_MODULE);
>  
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Josh Durgin Aug. 29, 2013, 6:33 p.m. UTC | #3
On 08/29/2013 08:21 AM, Alex Elder wrote:
> On 08/29/2013 01:24 AM, Josh Durgin wrote:
>> Removing a device deallocates the disk, unschedules the watch, and
>> finally cleans up the rbd_dev structure. rbd_dev_refresh(), called
>> from the watch callback, updates the disk size and rbd_dev
>> structure. With no locking between them, rbd_dev_refresh() may use the
>> device or rbd_dev after they've been freed.
>>
>> To fix this, use the header_rwsem to protect all the work
>> rbd_dev_refresh() does, and take it exclusively in rbd_remove() where
>> the block device is released and the watch is canceled.
>
> It seems to me this was a little bit of a tricky area.  As I recall
> there was a problem related to lock inversion, holding the semaphore
> while calling revalidate_disk().  (I'm sorry, this may be wrong and
> I can't dig much into it further at the moment.)  In any case, make
> sure you test with lockdep enabled and look for splats to the console.

You're right, there was a lock inversion with the header_rwsem and
bdev->bd_mutex. I'll switch to using a new mutex to avoid this.

> Also, there was once a function rbd_update_mapping_size(),
> which encapsulated both updating the size field and then
> calling set_capacity() immediately thereafter.  When images
> were refreshed (in rbd_dev_vX_refresh()), these updates would
> be made by calling rbd_update_mapping_size().  That made good
> sense, but after this commit:
>      00a653e216  rbd: update capacity in rbd_dev_refresh()
> the function became pretty trivial and it got removed.
>
> I think due to the lockdep thing, revalidate_disk() was
> later called outside the protection of the mutex.  It may
> be necessary to at least move those size updates back into
> the refresh functions to resolve the lockdep problem.
>
> Anyway, having a function that encapsulates these size changes
> like before might be nice, but it's not that important.  If
> there really is a problem with calling revalidate_disk() with
> the mutex held, then it's not all done in a single unit anyway.

Good idea.

>> rbd_bus_del_dev() ends up releasing the block device, so no requests
>> to the device remain after this. This makes all the work in
>> rbd_dev_refresh() unnecessary, as well as race-prone, so skip it if
>> the watch has been canceled.
>
> This makes sense, but a brief comment explaining why you're
> skipping it at the point in the code would be helpful.

I'll add one. Thanks!

Josh

>> Finally, flush the osd client's notify queue before deallocating the
>> rbd dev, so that any callbacks remaining can read rbd_dev->watch_event
>> safely. No more notifies can enter the queue at this point since the
>> watch has been canceled.
>>
>> Fixes: http://tracker.ceph.com/issues/5636
>> Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
>
> Please verify there's no lockdep reports if you haven't already.
> And consider my other comments, but barring lockdep issues:
>
> Reviewed-by: Alex Elder <elder@linaro.org>
>
>> ---
>>   drivers/block/rbd.c |   25 ++++++++++++++++++++++---
>>   1 files changed, 22 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
>> index fef3687..63e1590 100644
>> --- a/drivers/block/rbd.c
>> +++ b/drivers/block/rbd.c
>> @@ -3327,10 +3327,13 @@ static void rbd_exists_validate(struct rbd_device *rbd_dev)
>>   static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>>   {
>>   	u64 mapping_size;
>> -	int ret;
>> +	int ret = 0;
>>
>>   	rbd_assert(rbd_image_format_valid(rbd_dev->image_format));
>>   	down_write(&rbd_dev->header_rwsem);
>> +	if (!rbd_dev->watch_event)
>> +		goto out;
>> +
>>   	mapping_size = rbd_dev->mapping.size;
>>   	if (rbd_dev->image_format == 1)
>>   		ret = rbd_dev_v1_header_info(rbd_dev);
>> @@ -3340,7 +3343,6 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>>   	/* If it's a mapped snapshot, validate its EXISTS flag */
>>
>>   	rbd_exists_validate(rbd_dev);
>> -	up_write(&rbd_dev->header_rwsem);
>>
>>   	if (mapping_size != rbd_dev->mapping.size) {
>>   		sector_t size;
>> @@ -3350,7 +3352,8 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>>   		set_capacity(rbd_dev->disk, size);
>>   		revalidate_disk(rbd_dev->disk);
>>   	}
>> -
>> +out:
>> +	up_write(&rbd_dev->header_rwsem);
>>   	return ret;
>>   }
>>
>> @@ -5159,10 +5162,26 @@ static ssize_t rbd_remove(struct bus_type *bus,
>>   	if (ret < 0 || already)
>>   		return ret;
>>
>> +	/*
>> +	 * take header semaphore while destroying the device and
>> +	 * canceling the watch so that device destruction will
>> +	 * not race with device updates from the watch callback
>> +	 */
>> +	down_write(&rbd_dev->header_rwsem);
>>   	rbd_bus_del_dev(rbd_dev);
>>   	ret = rbd_dev_header_watch_sync(rbd_dev, false);
>> +	up_write(&rbd_dev->header_rwsem);
>> +
>>   	if (ret)
>>   		rbd_warn(rbd_dev, "failed to cancel watch event (%d)\n", ret);
>> +
>> +	/*
>> +	 * flush remaing watch callbacks - these don't update anything
>> +	 * anymore since rbd_dev->watch_event is NULL, but it avoids
>> +	 * the watch callback using a freed rbd_dev
>> +	 */
>> +	dout("%s: flushing notifies", __func__);
>> +	ceph_osdc_flush_notifies(&rbd_dev->rbd_client->client->osdc);
>>   	rbd_dev_image_release(rbd_dev);
>>   	module_put(THIS_MODULE);
>>
>>
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Josh Durgin Aug. 29, 2013, 6:36 p.m. UTC | #4
On 08/29/2013 07:46 AM, Sage Weil wrote:
> On Wed, 28 Aug 2013, Josh Durgin wrote:
>> Removing a device deallocates the disk, unschedules the watch, and
>> finally cleans up the rbd_dev structure. rbd_dev_refresh(), called
>> from the watch callback, updates the disk size and rbd_dev
>> structure. With no locking between them, rbd_dev_refresh() may use the
>> device or rbd_dev after they've been freed.
>>
>> To fix this, use the header_rwsem to protect all the work
>> rbd_dev_refresh() does, and take it exclusively in rbd_remove() where
>> the block device is released and the watch is canceled.
>> rbd_bus_del_dev() ends up releasing the block device, so no requests
>> to the device remain after this. This makes all the work in
>> rbd_dev_refresh() unnecessary, as well as race-prone, so skip it if
>> the watch has been canceled.
>>
>> Finally, flush the osd client's notify queue before deallocating the
>> rbd dev, so that any callbacks remaining can read rbd_dev->watch_event
>> safely. No more notifies can enter the queue at this point since the
>> watch has been canceled.
>>
>> Fixes: http://tracker.ceph.com/issues/5636
>> Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
>> ---
>>   drivers/block/rbd.c |   25 ++++++++++++++++++++++---
>>   1 files changed, 22 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
>> index fef3687..63e1590 100644
>> --- a/drivers/block/rbd.c
>> +++ b/drivers/block/rbd.c
>> @@ -3327,10 +3327,13 @@ static void rbd_exists_validate(struct rbd_device *rbd_dev)
>>   static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>>   {
>>   	u64 mapping_size;
>> -	int ret;
>> +	int ret = 0;
>>
>>   	rbd_assert(rbd_image_format_valid(rbd_dev->image_format));
>>   	down_write(&rbd_dev->header_rwsem);
>> +	if (!rbd_dev->watch_event)
>> +		goto out;
>> +
>>   	mapping_size = rbd_dev->mapping.size;
>>   	if (rbd_dev->image_format == 1)
>>   		ret = rbd_dev_v1_header_info(rbd_dev);
>> @@ -3340,7 +3343,6 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>>   	/* If it's a mapped snapshot, validate its EXISTS flag */
>>
>>   	rbd_exists_validate(rbd_dev);
>> -	up_write(&rbd_dev->header_rwsem);
>>
>>   	if (mapping_size != rbd_dev->mapping.size) {
>>   		sector_t size;
>> @@ -3350,7 +3352,8 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>>   		set_capacity(rbd_dev->disk, size);
>>   		revalidate_disk(rbd_dev->disk);
>>   	}
>> -
>> +out:
>> +	up_write(&rbd_dev->header_rwsem);
>>   	return ret;
>>   }
>>
>> @@ -5159,10 +5162,26 @@ static ssize_t rbd_remove(struct bus_type *bus,
>>   	if (ret < 0 || already)
>>   		return ret;
>>
>> +	/*
>> +	 * take header semaphore while destroying the device and
>> +	 * canceling the watch so that device destruction will
>> +	 * not race with device updates from the watch callback
>> +	 */
>> +	down_write(&rbd_dev->header_rwsem);
>>   	rbd_bus_del_dev(rbd_dev);
>>   	ret = rbd_dev_header_watch_sync(rbd_dev, false);
>> +	up_write(&rbd_dev->header_rwsem);
>
> If I'm reading thsi right, we're holding the rwsem exclusively, and
> blocking the workqueue, for the duration of the osd request to remove the
> watch.  It worries me to block up the workqueue for that long (possibly
> for other images).  I don't think it will deadlock, but if hte cluster is
> unhealthy this one request could take a while.
>
> What we if we just add a separate flag "shutting down" and use that
> instead of the rbd_dev->watch_event?  Then we can up_write prior to the
> watch removal call...

Due to the lock inversion Alex mentioned, I'll use a separate mutex and
flag. Since we're draining the notify queue after this, only
rbd_bus_del_dev() needs to be protected from running at the same time -
canceling the watch doesn't matter for correctness.

Josh

> sage
>
>
>> +
>>   	if (ret)
>>   		rbd_warn(rbd_dev, "failed to cancel watch event (%d)\n", ret);
>> +
>> +	/*
>> +	 * flush remaing watch callbacks - these don't update anything
>> +	 * anymore since rbd_dev->watch_event is NULL, but it avoids
>> +	 * the watch callback using a freed rbd_dev
>> +	 */
>> +	dout("%s: flushing notifies", __func__);
>> +	ceph_osdc_flush_notifies(&rbd_dev->rbd_client->client->osdc);
>>   	rbd_dev_image_release(rbd_dev);
>>   	module_put(THIS_MODULE);
>>
>> --
>> 1.7.2.5
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index fef3687..63e1590 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -3327,10 +3327,13 @@  static void rbd_exists_validate(struct rbd_device *rbd_dev)
 static int rbd_dev_refresh(struct rbd_device *rbd_dev)
 {
 	u64 mapping_size;
-	int ret;
+	int ret = 0;
 
 	rbd_assert(rbd_image_format_valid(rbd_dev->image_format));
 	down_write(&rbd_dev->header_rwsem);
+	if (!rbd_dev->watch_event)
+		goto out;
+
 	mapping_size = rbd_dev->mapping.size;
 	if (rbd_dev->image_format == 1)
 		ret = rbd_dev_v1_header_info(rbd_dev);
@@ -3340,7 +3343,6 @@  static int rbd_dev_refresh(struct rbd_device *rbd_dev)
 	/* If it's a mapped snapshot, validate its EXISTS flag */
 
 	rbd_exists_validate(rbd_dev);
-	up_write(&rbd_dev->header_rwsem);
 
 	if (mapping_size != rbd_dev->mapping.size) {
 		sector_t size;
@@ -3350,7 +3352,8 @@  static int rbd_dev_refresh(struct rbd_device *rbd_dev)
 		set_capacity(rbd_dev->disk, size);
 		revalidate_disk(rbd_dev->disk);
 	}
-
+out:
+	up_write(&rbd_dev->header_rwsem);
 	return ret;
 }
 
@@ -5159,10 +5162,26 @@  static ssize_t rbd_remove(struct bus_type *bus,
 	if (ret < 0 || already)
 		return ret;
 
+	/*
+	 * take header semaphore while destroying the device and
+	 * canceling the watch so that device destruction will
+	 * not race with device updates from the watch callback
+	 */
+	down_write(&rbd_dev->header_rwsem);
 	rbd_bus_del_dev(rbd_dev);
 	ret = rbd_dev_header_watch_sync(rbd_dev, false);
+	up_write(&rbd_dev->header_rwsem);
+
 	if (ret)
 		rbd_warn(rbd_dev, "failed to cancel watch event (%d)\n", ret);
+
+	/*
+	 * flush remaing watch callbacks - these don't update anything
+	 * anymore since rbd_dev->watch_event is NULL, but it avoids
+	 * the watch callback using a freed rbd_dev
+	 */
+	dout("%s: flushing notifies", __func__);
+	ceph_osdc_flush_notifies(&rbd_dev->rbd_client->client->osdc);
 	rbd_dev_image_release(rbd_dev);
 	module_put(THIS_MODULE);