Btrfs: avoid deadlock with memory reclaim due to allocation of devices
diff mbox series

Message ID 20181213211725.14832-1-fdmanana@kernel.org
State New
Headers show
Series
  • Btrfs: avoid deadlock with memory reclaim due to allocation of devices
Related show

Commit Message

Filipe Manana Dec. 13, 2018, 9:17 p.m. UTC
From: Filipe Manana <fdmanana@suse.com>

Several places allocate a device while holding the device list mutex. This
can result in a deadlock if reclaim happens because the device, and its
flush bio, are allocated using GFP_KERNEL mode (by __alloc_device() which
is used by btrfs_alloc_device()). A transaction commit, which reclaim can
trigger, needs to lock the device list mutex in its critical section, done
at btrfs_update_commit_device_size().

Some of these places are device_list_add(), which ends up being called
through the device scan ioctl, and btrfs_close_one_device(), which ends up
being called through the device remove ioctl.

Since all the places that add elements to the list of resized devices (the
device grow and shrink functions) only lock the chunk mutex before adding
a device to the list, drop the need to acquire the device list mutex from
btrfs_update_commit_device_size(), which is the only other place that uses
this list and it already locks the chunk mutex.

Fixes: 78f2c9e6dbb14 ("btrfs: device add and remove: use GFP_KERNEL")
Fixes: e0ae999414238 ("btrfs: preallocate device flush bio")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/volumes.c | 8 ++------
 fs/btrfs/volumes.h | 1 +
 2 files changed, 3 insertions(+), 6 deletions(-)

Comments

Nikolay Borisov Dec. 14, 2018, 7:27 a.m. UTC | #1
On 13.12.18 г. 23:17 ч., fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Several places allocate a device while holding the device list mutex. This
> can result in a deadlock if reclaim happens because the device, and its
> flush bio, are allocated using GFP_KERNEL mode (by __alloc_device() which
> is used by btrfs_alloc_device()). A transaction commit, which reclaim can
> trigger, needs to lock the device list mutex in its critical section, done
> at btrfs_update_commit_device_size().
> 
> Some of these places are device_list_add(), which ends up being called
> through the device scan ioctl, and btrfs_close_one_device(), which ends up
> being called through the device remove ioctl.
> 
> Since all the places that add elements to the list of resized devices (the
> device grow and shrink functions) only lock the chunk mutex before adding
> a device to the list, drop the need to acquire the device list mutex from
> btrfs_update_commit_device_size(), which is the only other place that uses
> this list and it already locks the chunk mutex.
> 
> Fixes: 78f2c9e6dbb14 ("btrfs: device add and remove: use GFP_KERNEL")
> Fixes: e0ae999414238 ("btrfs: preallocate device flush bio")
> Signed-off-by: Filipe Manana <fdmanana@suse.com>


Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Filipe Manana Jan. 8, 2019, 11:51 a.m. UTC | #2
On Thu, Dec 13, 2018 at 9:18 PM <fdmanana@kernel.org> wrote:
>
> From: Filipe Manana <fdmanana@suse.com>
>
> Several places allocate a device while holding the device list mutex. This
> can result in a deadlock if reclaim happens because the device, and its
> flush bio, are allocated using GFP_KERNEL mode (by __alloc_device() which
> is used by btrfs_alloc_device()). A transaction commit, which reclaim can
> trigger, needs to lock the device list mutex in its critical section, done
> at btrfs_update_commit_device_size().
>
> Some of these places are device_list_add(), which ends up being called
> through the device scan ioctl, and btrfs_close_one_device(), which ends up
> being called through the device remove ioctl.
>
> Since all the places that add elements to the list of resized devices (the
> device grow and shrink functions) only lock the chunk mutex before adding
> a device to the list, drop the need to acquire the device list mutex from
> btrfs_update_commit_device_size(), which is the only other place that uses
> this list and it already locks the chunk mutex.
>
> Fixes: 78f2c9e6dbb14 ("btrfs: device add and remove: use GFP_KERNEL")
> Fixes: e0ae999414238 ("btrfs: preallocate device flush bio")
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Ping.

> ---
>  fs/btrfs/volumes.c | 8 ++------
>  fs/btrfs/volumes.h | 1 +
>  2 files changed, 3 insertions(+), 6 deletions(-)
>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index c872adfc939e..74c4ed29e36e 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -176,7 +176,8 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
>   * chunk_mutex
>   * -----------
>   * protects chunks, adding or removing during allocation, trim or when a new
> - * device is added/removed
> + * device is added/removed, and the list of resized devices at struct
> + * btrfs_fs_info::fs_devices::resized_devices
>   *
>   * cleaner_mutex
>   * -------------
> @@ -7298,10 +7299,6 @@ void btrfs_update_commit_device_size(struct btrfs_fs_info *fs_info)
>         struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
>         struct btrfs_device *curr, *next;
>
> -       if (list_empty(&fs_devices->resized_devices))
> -               return;
> -
> -       mutex_lock(&fs_devices->device_list_mutex);
>         mutex_lock(&fs_info->chunk_mutex);
>         list_for_each_entry_safe(curr, next, &fs_devices->resized_devices,
>                                  resized_list) {
> @@ -7309,7 +7306,6 @@ void btrfs_update_commit_device_size(struct btrfs_fs_info *fs_info)
>                 curr->commit_total_bytes = curr->disk_total_bytes;
>         }
>         mutex_unlock(&fs_info->chunk_mutex);
> -       mutex_unlock(&fs_devices->device_list_mutex);
>  }
>
>  /* Must be invoked during the transaction commit */
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index aefce895e994..362574b9c37a 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -229,6 +229,7 @@ struct btrfs_fs_devices {
>         struct mutex device_list_mutex;
>         struct list_head devices;
>
> +       /* protected by struct btrfs_fs_info::chunk_mutex */
>         struct list_head resized_devices;
>         /* devices not currently being allocated */
>         struct list_head alloc_list;
> --
> 2.11.0
>
David Sterba Jan. 9, 2019, 6:26 p.m. UTC | #3
On Thu, Dec 13, 2018 at 09:17:25PM +0000, fdmanana@kernel.org wrote:
> -	if (list_empty(&fs_devices->resized_devices))
> -		return;
> -
> -	mutex_lock(&fs_devices->device_list_mutex);
>  	mutex_lock(&fs_info->chunk_mutex);
>  	list_for_each_entry_safe(curr, next, &fs_devices->resized_devices,
>  				 resized_list) {
> @@ -7309,7 +7306,6 @@ void btrfs_update_commit_device_size(struct btrfs_fs_info *fs_info)
>  		curr->commit_total_bytes = curr->disk_total_bytes;

I'm not sure about removing the device_list_mutex that's said to protect
the commit_total_bytes (comment in struct btrfs_device).

Otherwise the logic is ok, the double lock could happen as you describe.

btrfs_update_commit_device_size is called from btrfs_commit_transaction,
at the same time as commit_bytes_used. The latter is handled in a
similar way in btrfs_update_commit_device_bytes_used, but does not take
the device_list_mutex.

commit_total_bytes is checked several times (eg. in write_dev_supers) to
see if writing the superblock copy is still within the device range.

So, without the protected change, it's theoretically possible that a
stale value is used for the test and the superblock is either written
though it should not, and the other way around.

This would require a resize racing at the time of the check. Grow and
shrink seem to take chunk_mutex while adjusting all the total/size
values, but it's not actually easy to follow as sometimes there are
helpers like btrfs_device_set_total_bytes used and sometimes it's direct
access.

That the device_list_mutex can be safely dropped probably follows from
the simple fact that btrfs_update_commit_device_bytes_used is called
before write_dev_supers in the same context.

But this sounds too simple, given that there are locks taken and
released and btrfs_write_and_wait_transaction called between.

Referencing this code:

2201         btrfs_update_commit_device_size(fs_info);
2202         btrfs_update_commit_device_bytes_used(cur_trans);
2203
2204         clear_bit(BTRFS_FS_LOG1_ERR, &fs_info->flags);
2205         clear_bit(BTRFS_FS_LOG2_ERR, &fs_info->flags);
2206
2207         btrfs_trans_release_chunk_metadata(trans);
2208
2209         spin_lock(&fs_info->trans_lock);
2210         cur_trans->state = TRANS_STATE_UNBLOCKED;
2211         fs_info->running_transaction = NULL;
2212         spin_unlock(&fs_info->trans_lock);
2213         mutex_unlock(&fs_info->reloc_mutex);
2214
2215         wake_up(&fs_info->transaction_wait);
2216
2217         ret = btrfs_write_and_wait_transaction(trans);
2218         if (ret) {
2219                 btrfs_handle_fs_error(fs_info, ret,
2220                                       "Error while writing out transaction");
2221                 mutex_unlock(&fs_info->tree_log_mutex);
2222                 goto scrub_continue;
2223         }
2224
2225         ret = write_all_supers(fs_info, 0);
Filipe Manana Jan. 9, 2019, 7:48 p.m. UTC | #4
On Wed, Jan 9, 2019 at 6:27 PM David Sterba <dsterba@suse.cz> wrote:
>
> On Thu, Dec 13, 2018 at 09:17:25PM +0000, fdmanana@kernel.org wrote:
> > -     if (list_empty(&fs_devices->resized_devices))
> > -             return;
> > -
> > -     mutex_lock(&fs_devices->device_list_mutex);
> >       mutex_lock(&fs_info->chunk_mutex);
> >       list_for_each_entry_safe(curr, next, &fs_devices->resized_devices,
> >                                resized_list) {
> > @@ -7309,7 +7306,6 @@ void btrfs_update_commit_device_size(struct btrfs_fs_info *fs_info)
> >               curr->commit_total_bytes = curr->disk_total_bytes;
>
> I'm not sure about removing the device_list_mutex that's said to protect
> the commit_total_bytes (comment in struct btrfs_device).
>
> Otherwise the logic is ok, the double lock could happen as you describe.
>
> btrfs_update_commit_device_size is called from btrfs_commit_transaction,
> at the same time as commit_bytes_used. The latter is handled in a
> similar way in btrfs_update_commit_device_bytes_used, but does not take
> the device_list_mutex.
>
> commit_total_bytes is checked several times (eg. in write_dev_supers) to
> see if writing the superblock copy is still within the device range.
>
> So, without the protected change, it's theoretically possible that a
> stale value is used for the test and the superblock is either written
> though it should not, and the other way around.
>
> This would require a resize racing at the time of the check. Grow and
> shrink seem to take chunk_mutex while adjusting all the total/size
> values, but it's not actually easy to follow as sometimes there are
> helpers like btrfs_device_set_total_bytes used and sometimes it's direct
> access.
>
> That the device_list_mutex can be safely dropped probably follows from
> the simple fact that btrfs_update_commit_device_bytes_used is called
> before write_dev_supers in the same context.
>
> But this sounds too simple, given that there are locks taken and
> released and btrfs_write_and_wait_transaction called between.

Regardless of all that (and honestly I haven't double checked and
skimmed only through what you said),
there's a more important aspect I missed before: write_all_supers()
takes (and needs) the device list mutex,
therefore this change won't fix the deadlock because of that.

thanks

>
> Referencing this code:
>
> 2201         btrfs_update_commit_device_size(fs_info);
> 2202         btrfs_update_commit_device_bytes_used(cur_trans);
> 2203
> 2204         clear_bit(BTRFS_FS_LOG1_ERR, &fs_info->flags);
> 2205         clear_bit(BTRFS_FS_LOG2_ERR, &fs_info->flags);
> 2206
> 2207         btrfs_trans_release_chunk_metadata(trans);
> 2208
> 2209         spin_lock(&fs_info->trans_lock);
> 2210         cur_trans->state = TRANS_STATE_UNBLOCKED;
> 2211         fs_info->running_transaction = NULL;
> 2212         spin_unlock(&fs_info->trans_lock);
> 2213         mutex_unlock(&fs_info->reloc_mutex);
> 2214
> 2215         wake_up(&fs_info->transaction_wait);
> 2216
> 2217         ret = btrfs_write_and_wait_transaction(trans);
> 2218         if (ret) {
> 2219                 btrfs_handle_fs_error(fs_info, ret,
> 2220                                       "Error while writing out transaction");
> 2221                 mutex_unlock(&fs_info->tree_log_mutex);
> 2222                 goto scrub_continue;
> 2223         }
> 2224
> 2225         ret = write_all_supers(fs_info, 0);
Nikolay Borisov Jan. 10, 2019, 7:03 a.m. UTC | #5
On 9.01.19 г. 20:26 ч., David Sterba wrote:
> On Thu, Dec 13, 2018 at 09:17:25PM +0000, fdmanana@kernel.org wrote:
>> -	if (list_empty(&fs_devices->resized_devices))
>> -		return;
>> -
>> -	mutex_lock(&fs_devices->device_list_mutex);
>>  	mutex_lock(&fs_info->chunk_mutex);
>>  	list_for_each_entry_safe(curr, next, &fs_devices->resized_devices,
>>  				 resized_list) {
>> @@ -7309,7 +7306,6 @@ void btrfs_update_commit_device_size(struct btrfs_fs_info *fs_info)
>>  		curr->commit_total_bytes = curr->disk_total_bytes;
> 
> I'm not sure about removing the device_list_mutex that's said to protect
> the commit_total_bytes (comment in struct btrfs_device).
> 
> Otherwise the logic is ok, the double lock could happen as you describe.
> 
> btrfs_update_commit_device_size is called from btrfs_commit_transaction,
> at the same time as commit_bytes_used. The latter is handled in a
> similar way in btrfs_update_commit_device_bytes_used, but does not take
> the device_list_mutex.
> 
> commit_total_bytes is checked several times (eg. in write_dev_supers) to
> see if writing the superblock copy is still within the device range.
> 
> So, without the protected change, it's theoretically possible that a
> stale value is used for the test and the superblock is either written
> though it should not, and the other way around.

But can it really, btrfs_[grow|shrink]_device happen under transaction
and their modification of the device_disk_total_bytes (the value
assigned to commit_total_bytes) always happen under chunk_mutex. Also
the updates to both values are really owned by the transaction, so even
if grow/shrink modify those value they will queue those changes in a new
transaction, hence write_dev_super will see consistent value in the
current transaction.


I have a patch from Jeff (which is part of a bigger work) that actually
unifies the resize/device size change lists into a single single and
makes the code a bit easier to grok, nevertheless the above explanation
is still correct even without this patch.


> 
> This would require a resize racing at the time of the check. Grow and
> shrink seem to take chunk_mutex while adjusting all the total/size
> values, but it's not actually easy to follow as sometimes there are
> helpers like btrfs_device_set_total_bytes used and sometimes it's direct
> access.
> 
> That the device_list_mutex can be safely dropped probably follows from
> the simple fact that btrfs_update_commit_device_bytes_used is called
> before write_dev_supers in the same context.
> 
> But this sounds too simple, given that there are locks taken and
> released and btrfs_write_and_wait_transaction called between.
> 
> Referencing this code:
> 
> 2201         btrfs_update_commit_device_size(fs_info);
> 2202         btrfs_update_commit_device_bytes_used(cur_trans);
> 2203
> 2204         clear_bit(BTRFS_FS_LOG1_ERR, &fs_info->flags);
> 2205         clear_bit(BTRFS_FS_LOG2_ERR, &fs_info->flags);
> 2206
> 2207         btrfs_trans_release_chunk_metadata(trans);
> 2208
> 2209         spin_lock(&fs_info->trans_lock);
> 2210         cur_trans->state = TRANS_STATE_UNBLOCKED;
> 2211         fs_info->running_transaction = NULL;
> 2212         spin_unlock(&fs_info->trans_lock);
> 2213         mutex_unlock(&fs_info->reloc_mutex);
> 2214
> 2215         wake_up(&fs_info->transaction_wait);
> 2216
> 2217         ret = btrfs_write_and_wait_transaction(trans);
> 2218         if (ret) {
> 2219                 btrfs_handle_fs_error(fs_info, ret,
> 2220                                       "Error while writing out transaction");
> 2221                 mutex_unlock(&fs_info->tree_log_mutex);
> 2222                 goto scrub_continue;
> 2223         }
> 2224
> 2225         ret = write_all_supers(fs_info, 0);
>
Anand Jain Jan. 10, 2019, 7:32 a.m. UTC | #6
On 01/10/2019 03:48 AM, Filipe Manana wrote:
> On Wed, Jan 9, 2019 at 6:27 PM David Sterba <dsterba@suse.cz> wrote:
>>
>> On Thu, Dec 13, 2018 at 09:17:25PM +0000, fdmanana@kernel.org wrote:
>>> -     if (list_empty(&fs_devices->resized_devices))
>>> -             return;
>>> -
>>> -     mutex_lock(&fs_devices->device_list_mutex);
>>>        mutex_lock(&fs_info->chunk_mutex);
>>>        list_for_each_entry_safe(curr, next, &fs_devices->resized_devices,
>>>                                 resized_list) {
>>> @@ -7309,7 +7306,6 @@ void btrfs_update_commit_device_size(struct btrfs_fs_info *fs_info)
>>>                curr->commit_total_bytes = curr->disk_total_bytes;
>>
>> I'm not sure about removing the device_list_mutex that's said to protect
>> the commit_total_bytes (comment in struct btrfs_device).
>>
>> Otherwise the logic is ok, the double lock could happen as you describe.
>>
>> btrfs_update_commit_device_size is called from btrfs_commit_transaction,
>> at the same time as commit_bytes_used. The latter is handled in a
>> similar way in btrfs_update_commit_device_bytes_used, but does not take
>> the device_list_mutex.
>>
>> commit_total_bytes is checked several times (eg. in write_dev_supers) to
>> see if writing the superblock copy is still within the device range.
>>
>> So, without the protected change, it's theoretically possible that a
>> stale value is used for the test and the superblock is either written
>> though it should not, and the other way around.
>>
>> This would require a resize racing at the time of the check. Grow and
>> shrink seem to take chunk_mutex while adjusting all the total/size
>> values, but it's not actually easy to follow as sometimes there are
>> helpers like btrfs_device_set_total_bytes used and sometimes it's direct
>> access.
>>
>> That the device_list_mutex can be safely dropped probably follows from
>> the simple fact that btrfs_update_commit_device_bytes_used is called
>> before write_dev_supers in the same context.
>>
>> But this sounds too simple, given that there are locks taken and
>> released and btrfs_write_and_wait_transaction called between.
> 
> Regardless of all that (and honestly I haven't double checked and
> skimmed only through what you said),
> there's a more important aspect I missed before: write_all_supers()
> takes (and needs) the device list mutex,
> therefore this change won't fix the deadlock because of that.

  Though this won't fix the problem, this patch is still ok
  as its drops the unnecessary device_list_mutex in
  btrfs_update_commit_device_size(). So for that if the change log
  updated,
    Reviewed-by: Anand Jain <anand.jain@oracle.com>

  To address the actual problem.

  Functions which call btrfs_alloc_device() are..
     device_list_add()
     close_fs_devices()
     btrfs_init_dev_replace_tgtdev()
     clone_fs_devices()
     btrfs_init_new_device()

  Now among the above following holds the device_list_mutex when calling
  btrfs_alloc_device()
     device_list_add()
     close_fs_devices()
     clone_fs_devices()

  Now among above three, the lock at device_list_add() and
clone_fs_devices() can be ignored, because the reclaim or flush IO can't
take place on these FSID:devices as device_list_add() is called when FS
is not-mounted or in the mounting context, and clone_fs_devices() is
called when the SEED device is still a read-only FS.

  And so we have to only worry about close_fs_devices().

  close_fs_devices() - I didn't like the way it does, that is allocate a
new struct btrfs_device instead of just zero-ing the struct btrfs_device
during unmount. I guess it was done to avoid RCU warning (not sure) and
if it isn't real issues I am happy to see btrfs_device_alloc() is
dropped in close_fs_devices(). Which means it also fixes the problem
that - you need more memory to unmount an ideal FS.

Thanks.

> thanks
> 
>>
>> Referencing this code:
>>
>> 2201         btrfs_update_commit_device_size(fs_info);
>> 2202         btrfs_update_commit_device_bytes_used(cur_trans);
>> 2203
>> 2204         clear_bit(BTRFS_FS_LOG1_ERR, &fs_info->flags);
>> 2205         clear_bit(BTRFS_FS_LOG2_ERR, &fs_info->flags);
>> 2206
>> 2207         btrfs_trans_release_chunk_metadata(trans);
>> 2208
>> 2209         spin_lock(&fs_info->trans_lock);
>> 2210         cur_trans->state = TRANS_STATE_UNBLOCKED;
>> 2211         fs_info->running_transaction = NULL;
>> 2212         spin_unlock(&fs_info->trans_lock);
>> 2213         mutex_unlock(&fs_info->reloc_mutex);
>> 2214
>> 2215         wake_up(&fs_info->transaction_wait);
>> 2216
>> 2217         ret = btrfs_write_and_wait_transaction(trans);
>> 2218         if (ret) {
>> 2219                 btrfs_handle_fs_error(fs_info, ret,
>> 2220                                       "Error while writing out transaction");
>> 2221                 mutex_unlock(&fs_info->tree_log_mutex);
>> 2222                 goto scrub_continue;
>> 2223         }
>> 2224
>> 2225         ret = write_all_supers(fs_info, 0);

Patch
diff mbox series

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c872adfc939e..74c4ed29e36e 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -176,7 +176,8 @@  static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
  * chunk_mutex
  * -----------
  * protects chunks, adding or removing during allocation, trim or when a new
- * device is added/removed
+ * device is added/removed, and the list of resized devices at struct
+ * btrfs_fs_info::fs_devices::resized_devices
  *
  * cleaner_mutex
  * -------------
@@ -7298,10 +7299,6 @@  void btrfs_update_commit_device_size(struct btrfs_fs_info *fs_info)
 	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
 	struct btrfs_device *curr, *next;
 
-	if (list_empty(&fs_devices->resized_devices))
-		return;
-
-	mutex_lock(&fs_devices->device_list_mutex);
 	mutex_lock(&fs_info->chunk_mutex);
 	list_for_each_entry_safe(curr, next, &fs_devices->resized_devices,
 				 resized_list) {
@@ -7309,7 +7306,6 @@  void btrfs_update_commit_device_size(struct btrfs_fs_info *fs_info)
 		curr->commit_total_bytes = curr->disk_total_bytes;
 	}
 	mutex_unlock(&fs_info->chunk_mutex);
-	mutex_unlock(&fs_devices->device_list_mutex);
 }
 
 /* Must be invoked during the transaction commit */
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index aefce895e994..362574b9c37a 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -229,6 +229,7 @@  struct btrfs_fs_devices {
 	struct mutex device_list_mutex;
 	struct list_head devices;
 
+	/* protected by struct btrfs_fs_info::chunk_mutex */
 	struct list_head resized_devices;
 	/* devices not currently being allocated */
 	struct list_head alloc_list;