Message ID | 20181213211725.14832-1-fdmanana@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Btrfs: avoid deadlock with memory reclaim due to allocation of devices | expand |
On 13.12.18 г. 23:17 ч., fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > Several places allocate a device while holding the device list mutex. This > can result in a deadlock if reclaim happens because the device, and its > flush bio, are allocated using GFP_KERNEL mode (by __alloc_device() which > is used by btrfs_alloc_device()). A transaction commit, which reclaim can > trigger, needs to lock the device list mutex in its critical section, done > at btrfs_update_commit_device_size(). > > Some of these places are device_list_add(), which ends up being called > through the device scan ioctl, and btrfs_close_one_device(), which ends up > being called through the device remove ioctl. > > Since all the places that add elements to the list of resized devices (the > device grow and shrink functions) only lock the chunk mutex before adding > a device to the list, drop the need to acquire the device list mutex from > btrfs_update_commit_device_size(), which is the only other place that uses > this list and it already locks the chunk mutex. > > Fixes: 78f2c9e6dbb14 ("btrfs: device add and remove: use GFP_KERNEL") > Fixes: e0ae999414238 ("btrfs: preallocate device flush bio") > Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com>
On Thu, Dec 13, 2018 at 9:18 PM <fdmanana@kernel.org> wrote: > > From: Filipe Manana <fdmanana@suse.com> > > Several places allocate a device while holding the device list mutex. This > can result in a deadlock if reclaim happens because the device, and its > flush bio, are allocated using GFP_KERNEL mode (by __alloc_device() which > is used by btrfs_alloc_device()). A transaction commit, which reclaim can > trigger, needs to lock the device list mutex in its critical section, done > at btrfs_update_commit_device_size(). > > Some of these places are device_list_add(), which ends up being called > through the device scan ioctl, and btrfs_close_one_device(), which ends up > being called through the device remove ioctl. > > Since all the places that add elements to the list of resized devices (the > device grow and shrink functions) only lock the chunk mutex before adding > a device to the list, drop the need to acquire the device list mutex from > btrfs_update_commit_device_size(), which is the only other place that uses > this list and it already locks the chunk mutex. > > Fixes: 78f2c9e6dbb14 ("btrfs: device add and remove: use GFP_KERNEL") > Fixes: e0ae999414238 ("btrfs: preallocate device flush bio") > Signed-off-by: Filipe Manana <fdmanana@suse.com> Ping. > --- > fs/btrfs/volumes.c | 8 ++------ > fs/btrfs/volumes.h | 1 + > 2 files changed, 3 insertions(+), 6 deletions(-) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index c872adfc939e..74c4ed29e36e 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -176,7 +176,8 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, > * chunk_mutex > * ----------- > * protects chunks, adding or removing during allocation, trim or when a new > - * device is added/removed > + * device is added/removed, and the list of resized devices at struct > + * btrfs_fs_info::fs_devices::resized_devices > * > * cleaner_mutex > * ------------- > @@ -7298,10 +7299,6 @@ void btrfs_update_commit_device_size(struct btrfs_fs_info *fs_info) > struct btrfs_fs_devices *fs_devices = fs_info->fs_devices; > struct btrfs_device *curr, *next; > > - if (list_empty(&fs_devices->resized_devices)) > - return; > - > - mutex_lock(&fs_devices->device_list_mutex); > mutex_lock(&fs_info->chunk_mutex); > list_for_each_entry_safe(curr, next, &fs_devices->resized_devices, > resized_list) { > @@ -7309,7 +7306,6 @@ void btrfs_update_commit_device_size(struct btrfs_fs_info *fs_info) > curr->commit_total_bytes = curr->disk_total_bytes; > } > mutex_unlock(&fs_info->chunk_mutex); > - mutex_unlock(&fs_devices->device_list_mutex); > } > > /* Must be invoked during the transaction commit */ > diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h > index aefce895e994..362574b9c37a 100644 > --- a/fs/btrfs/volumes.h > +++ b/fs/btrfs/volumes.h > @@ -229,6 +229,7 @@ struct btrfs_fs_devices { > struct mutex device_list_mutex; > struct list_head devices; > > + /* protected by struct btrfs_fs_info::chunk_mutex */ > struct list_head resized_devices; > /* devices not currently being allocated */ > struct list_head alloc_list; > -- > 2.11.0 >
On Thu, Dec 13, 2018 at 09:17:25PM +0000, fdmanana@kernel.org wrote: > - if (list_empty(&fs_devices->resized_devices)) > - return; > - > - mutex_lock(&fs_devices->device_list_mutex); > mutex_lock(&fs_info->chunk_mutex); > list_for_each_entry_safe(curr, next, &fs_devices->resized_devices, > resized_list) { > @@ -7309,7 +7306,6 @@ void btrfs_update_commit_device_size(struct btrfs_fs_info *fs_info) > curr->commit_total_bytes = curr->disk_total_bytes; I'm not sure about removing the device_list_mutex that's said to protect the commit_total_bytes (comment in struct btrfs_device). Otherwise the logic is ok, the double lock could happen as you describe. btrfs_update_commit_device_size is called from btrfs_commit_transaction, at the same time as commit_bytes_used. The latter is handled in a similar way in btrfs_update_commit_device_bytes_used, but does not take the device_list_mutex. commit_total_bytes is checked several times (eg. in write_dev_supers) to see if writing the superblock copy is still within the device range. So, without the protected change, it's theoretically possible that a stale value is used for the test and the superblock is either written though it should not, and the other way around. This would require a resize racing at the time of the check. Grow and shrink seem to take chunk_mutex while adjusting all the total/size values, but it's not actually easy to follow as sometimes there are helpers like btrfs_device_set_total_bytes used and sometimes it's direct access. That the device_list_mutex can be safely dropped probably follows from the simple fact that btrfs_update_commit_device_bytes_used is called before write_dev_supers in the same context. But this sounds too simple, given that there are locks taken and released and btrfs_write_and_wait_transaction called between. Referencing this code: 2201 btrfs_update_commit_device_size(fs_info); 2202 btrfs_update_commit_device_bytes_used(cur_trans); 2203 2204 clear_bit(BTRFS_FS_LOG1_ERR, &fs_info->flags); 2205 clear_bit(BTRFS_FS_LOG2_ERR, &fs_info->flags); 2206 2207 btrfs_trans_release_chunk_metadata(trans); 2208 2209 spin_lock(&fs_info->trans_lock); 2210 cur_trans->state = TRANS_STATE_UNBLOCKED; 2211 fs_info->running_transaction = NULL; 2212 spin_unlock(&fs_info->trans_lock); 2213 mutex_unlock(&fs_info->reloc_mutex); 2214 2215 wake_up(&fs_info->transaction_wait); 2216 2217 ret = btrfs_write_and_wait_transaction(trans); 2218 if (ret) { 2219 btrfs_handle_fs_error(fs_info, ret, 2220 "Error while writing out transaction"); 2221 mutex_unlock(&fs_info->tree_log_mutex); 2222 goto scrub_continue; 2223 } 2224 2225 ret = write_all_supers(fs_info, 0);
On Wed, Jan 9, 2019 at 6:27 PM David Sterba <dsterba@suse.cz> wrote: > > On Thu, Dec 13, 2018 at 09:17:25PM +0000, fdmanana@kernel.org wrote: > > - if (list_empty(&fs_devices->resized_devices)) > > - return; > > - > > - mutex_lock(&fs_devices->device_list_mutex); > > mutex_lock(&fs_info->chunk_mutex); > > list_for_each_entry_safe(curr, next, &fs_devices->resized_devices, > > resized_list) { > > @@ -7309,7 +7306,6 @@ void btrfs_update_commit_device_size(struct btrfs_fs_info *fs_info) > > curr->commit_total_bytes = curr->disk_total_bytes; > > I'm not sure about removing the device_list_mutex that's said to protect > the commit_total_bytes (comment in struct btrfs_device). > > Otherwise the logic is ok, the double lock could happen as you describe. > > btrfs_update_commit_device_size is called from btrfs_commit_transaction, > at the same time as commit_bytes_used. The latter is handled in a > similar way in btrfs_update_commit_device_bytes_used, but does not take > the device_list_mutex. > > commit_total_bytes is checked several times (eg. in write_dev_supers) to > see if writing the superblock copy is still within the device range. > > So, without the protected change, it's theoretically possible that a > stale value is used for the test and the superblock is either written > though it should not, and the other way around. > > This would require a resize racing at the time of the check. Grow and > shrink seem to take chunk_mutex while adjusting all the total/size > values, but it's not actually easy to follow as sometimes there are > helpers like btrfs_device_set_total_bytes used and sometimes it's direct > access. > > That the device_list_mutex can be safely dropped probably follows from > the simple fact that btrfs_update_commit_device_bytes_used is called > before write_dev_supers in the same context. > > But this sounds too simple, given that there are locks taken and > released and btrfs_write_and_wait_transaction called between. Regardless of all that (and honestly I haven't double checked and skimmed only through what you said), there's a more important aspect I missed before: write_all_supers() takes (and needs) the device list mutex, therefore this change won't fix the deadlock because of that. thanks > > Referencing this code: > > 2201 btrfs_update_commit_device_size(fs_info); > 2202 btrfs_update_commit_device_bytes_used(cur_trans); > 2203 > 2204 clear_bit(BTRFS_FS_LOG1_ERR, &fs_info->flags); > 2205 clear_bit(BTRFS_FS_LOG2_ERR, &fs_info->flags); > 2206 > 2207 btrfs_trans_release_chunk_metadata(trans); > 2208 > 2209 spin_lock(&fs_info->trans_lock); > 2210 cur_trans->state = TRANS_STATE_UNBLOCKED; > 2211 fs_info->running_transaction = NULL; > 2212 spin_unlock(&fs_info->trans_lock); > 2213 mutex_unlock(&fs_info->reloc_mutex); > 2214 > 2215 wake_up(&fs_info->transaction_wait); > 2216 > 2217 ret = btrfs_write_and_wait_transaction(trans); > 2218 if (ret) { > 2219 btrfs_handle_fs_error(fs_info, ret, > 2220 "Error while writing out transaction"); > 2221 mutex_unlock(&fs_info->tree_log_mutex); > 2222 goto scrub_continue; > 2223 } > 2224 > 2225 ret = write_all_supers(fs_info, 0);
On 9.01.19 г. 20:26 ч., David Sterba wrote: > On Thu, Dec 13, 2018 at 09:17:25PM +0000, fdmanana@kernel.org wrote: >> - if (list_empty(&fs_devices->resized_devices)) >> - return; >> - >> - mutex_lock(&fs_devices->device_list_mutex); >> mutex_lock(&fs_info->chunk_mutex); >> list_for_each_entry_safe(curr, next, &fs_devices->resized_devices, >> resized_list) { >> @@ -7309,7 +7306,6 @@ void btrfs_update_commit_device_size(struct btrfs_fs_info *fs_info) >> curr->commit_total_bytes = curr->disk_total_bytes; > > I'm not sure about removing the device_list_mutex that's said to protect > the commit_total_bytes (comment in struct btrfs_device). > > Otherwise the logic is ok, the double lock could happen as you describe. > > btrfs_update_commit_device_size is called from btrfs_commit_transaction, > at the same time as commit_bytes_used. The latter is handled in a > similar way in btrfs_update_commit_device_bytes_used, but does not take > the device_list_mutex. > > commit_total_bytes is checked several times (eg. in write_dev_supers) to > see if writing the superblock copy is still within the device range. > > So, without the protected change, it's theoretically possible that a > stale value is used for the test and the superblock is either written > though it should not, and the other way around. But can it really, btrfs_[grow|shrink]_device happen under transaction and their modification of the device_disk_total_bytes (the value assigned to commit_total_bytes) always happen under chunk_mutex. Also the updates to both values are really owned by the transaction, so even if grow/shrink modify those value they will queue those changes in a new transaction, hence write_dev_super will see consistent value in the current transaction. I have a patch from Jeff (which is part of a bigger work) that actually unifies the resize/device size change lists into a single single and makes the code a bit easier to grok, nevertheless the above explanation is still correct even without this patch. > > This would require a resize racing at the time of the check. Grow and > shrink seem to take chunk_mutex while adjusting all the total/size > values, but it's not actually easy to follow as sometimes there are > helpers like btrfs_device_set_total_bytes used and sometimes it's direct > access. > > That the device_list_mutex can be safely dropped probably follows from > the simple fact that btrfs_update_commit_device_bytes_used is called > before write_dev_supers in the same context. > > But this sounds too simple, given that there are locks taken and > released and btrfs_write_and_wait_transaction called between. > > Referencing this code: > > 2201 btrfs_update_commit_device_size(fs_info); > 2202 btrfs_update_commit_device_bytes_used(cur_trans); > 2203 > 2204 clear_bit(BTRFS_FS_LOG1_ERR, &fs_info->flags); > 2205 clear_bit(BTRFS_FS_LOG2_ERR, &fs_info->flags); > 2206 > 2207 btrfs_trans_release_chunk_metadata(trans); > 2208 > 2209 spin_lock(&fs_info->trans_lock); > 2210 cur_trans->state = TRANS_STATE_UNBLOCKED; > 2211 fs_info->running_transaction = NULL; > 2212 spin_unlock(&fs_info->trans_lock); > 2213 mutex_unlock(&fs_info->reloc_mutex); > 2214 > 2215 wake_up(&fs_info->transaction_wait); > 2216 > 2217 ret = btrfs_write_and_wait_transaction(trans); > 2218 if (ret) { > 2219 btrfs_handle_fs_error(fs_info, ret, > 2220 "Error while writing out transaction"); > 2221 mutex_unlock(&fs_info->tree_log_mutex); > 2222 goto scrub_continue; > 2223 } > 2224 > 2225 ret = write_all_supers(fs_info, 0); >
On 01/10/2019 03:48 AM, Filipe Manana wrote: > On Wed, Jan 9, 2019 at 6:27 PM David Sterba <dsterba@suse.cz> wrote: >> >> On Thu, Dec 13, 2018 at 09:17:25PM +0000, fdmanana@kernel.org wrote: >>> - if (list_empty(&fs_devices->resized_devices)) >>> - return; >>> - >>> - mutex_lock(&fs_devices->device_list_mutex); >>> mutex_lock(&fs_info->chunk_mutex); >>> list_for_each_entry_safe(curr, next, &fs_devices->resized_devices, >>> resized_list) { >>> @@ -7309,7 +7306,6 @@ void btrfs_update_commit_device_size(struct btrfs_fs_info *fs_info) >>> curr->commit_total_bytes = curr->disk_total_bytes; >> >> I'm not sure about removing the device_list_mutex that's said to protect >> the commit_total_bytes (comment in struct btrfs_device). >> >> Otherwise the logic is ok, the double lock could happen as you describe. >> >> btrfs_update_commit_device_size is called from btrfs_commit_transaction, >> at the same time as commit_bytes_used. The latter is handled in a >> similar way in btrfs_update_commit_device_bytes_used, but does not take >> the device_list_mutex. >> >> commit_total_bytes is checked several times (eg. in write_dev_supers) to >> see if writing the superblock copy is still within the device range. >> >> So, without the protected change, it's theoretically possible that a >> stale value is used for the test and the superblock is either written >> though it should not, and the other way around. >> >> This would require a resize racing at the time of the check. Grow and >> shrink seem to take chunk_mutex while adjusting all the total/size >> values, but it's not actually easy to follow as sometimes there are >> helpers like btrfs_device_set_total_bytes used and sometimes it's direct >> access. >> >> That the device_list_mutex can be safely dropped probably follows from >> the simple fact that btrfs_update_commit_device_bytes_used is called >> before write_dev_supers in the same context. >> >> But this sounds too simple, given that there are locks taken and >> released and btrfs_write_and_wait_transaction called between. > > Regardless of all that (and honestly I haven't double checked and > skimmed only through what you said), > there's a more important aspect I missed before: write_all_supers() > takes (and needs) the device list mutex, > therefore this change won't fix the deadlock because of that. Though this won't fix the problem, this patch is still ok as its drops the unnecessary device_list_mutex in btrfs_update_commit_device_size(). So for that if the change log updated, Reviewed-by: Anand Jain <anand.jain@oracle.com> To address the actual problem. Functions which call btrfs_alloc_device() are.. device_list_add() close_fs_devices() btrfs_init_dev_replace_tgtdev() clone_fs_devices() btrfs_init_new_device() Now among the above following holds the device_list_mutex when calling btrfs_alloc_device() device_list_add() close_fs_devices() clone_fs_devices() Now among above three, the lock at device_list_add() and clone_fs_devices() can be ignored, because the reclaim or flush IO can't take place on these FSID:devices as device_list_add() is called when FS is not-mounted or in the mounting context, and clone_fs_devices() is called when the SEED device is still a read-only FS. And so we have to only worry about close_fs_devices(). close_fs_devices() - I didn't like the way it does, that is allocate a new struct btrfs_device instead of just zero-ing the struct btrfs_device during unmount. I guess it was done to avoid RCU warning (not sure) and if it isn't real issues I am happy to see btrfs_device_alloc() is dropped in close_fs_devices(). Which means it also fixes the problem that - you need more memory to unmount an ideal FS. Thanks. > thanks > >> >> Referencing this code: >> >> 2201 btrfs_update_commit_device_size(fs_info); >> 2202 btrfs_update_commit_device_bytes_used(cur_trans); >> 2203 >> 2204 clear_bit(BTRFS_FS_LOG1_ERR, &fs_info->flags); >> 2205 clear_bit(BTRFS_FS_LOG2_ERR, &fs_info->flags); >> 2206 >> 2207 btrfs_trans_release_chunk_metadata(trans); >> 2208 >> 2209 spin_lock(&fs_info->trans_lock); >> 2210 cur_trans->state = TRANS_STATE_UNBLOCKED; >> 2211 fs_info->running_transaction = NULL; >> 2212 spin_unlock(&fs_info->trans_lock); >> 2213 mutex_unlock(&fs_info->reloc_mutex); >> 2214 >> 2215 wake_up(&fs_info->transaction_wait); >> 2216 >> 2217 ret = btrfs_write_and_wait_transaction(trans); >> 2218 if (ret) { >> 2219 btrfs_handle_fs_error(fs_info, ret, >> 2220 "Error while writing out transaction"); >> 2221 mutex_unlock(&fs_info->tree_log_mutex); >> 2222 goto scrub_continue; >> 2223 } >> 2224 >> 2225 ret = write_all_supers(fs_info, 0);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index c872adfc939e..74c4ed29e36e 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -176,7 +176,8 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, * chunk_mutex * ----------- * protects chunks, adding or removing during allocation, trim or when a new - * device is added/removed + * device is added/removed, and the list of resized devices at struct + * btrfs_fs_info::fs_devices::resized_devices * * cleaner_mutex * ------------- @@ -7298,10 +7299,6 @@ void btrfs_update_commit_device_size(struct btrfs_fs_info *fs_info) struct btrfs_fs_devices *fs_devices = fs_info->fs_devices; struct btrfs_device *curr, *next; - if (list_empty(&fs_devices->resized_devices)) - return; - - mutex_lock(&fs_devices->device_list_mutex); mutex_lock(&fs_info->chunk_mutex); list_for_each_entry_safe(curr, next, &fs_devices->resized_devices, resized_list) { @@ -7309,7 +7306,6 @@ void btrfs_update_commit_device_size(struct btrfs_fs_info *fs_info) curr->commit_total_bytes = curr->disk_total_bytes; } mutex_unlock(&fs_info->chunk_mutex); - mutex_unlock(&fs_devices->device_list_mutex); } /* Must be invoked during the transaction commit */ diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index aefce895e994..362574b9c37a 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -229,6 +229,7 @@ struct btrfs_fs_devices { struct mutex device_list_mutex; struct list_head devices; + /* protected by struct btrfs_fs_info::chunk_mutex */ struct list_head resized_devices; /* devices not currently being allocated */ struct list_head alloc_list;