diff mbox series

[v2,2/4] btrfs: Update per-profile available space when device size/used space get updated

Message ID 20200102112746.145045-3-wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series Introduce per-profile available space array to avoid over-confident can_overcommit() | expand

Commit Message

Qu Wenruo Jan. 2, 2020, 11:27 a.m. UTC
There are 4 locations where device size or used space get updated:
- Chunk allocation
- Chunk removal
- Device grow
- Device shrink

Now also update per-profile available space at those timings.

For __btrfs_alloc_chunk() we can't acquire device_list_mutex as in
btrfs_finish_chunk_alloc() we could hold device_list_mutex and cause
dead lock.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/volumes.c | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

Comments

Josef Bacik Jan. 2, 2020, 4:17 p.m. UTC | #1
On 1/2/20 6:27 AM, Qu Wenruo wrote:
> There are 4 locations where device size or used space get updated:
> - Chunk allocation
> - Chunk removal
> - Device grow
> - Device shrink
> 
> Now also update per-profile available space at those timings.
> 
> For __btrfs_alloc_chunk() we can't acquire device_list_mutex as in
> btrfs_finish_chunk_alloc() we could hold device_list_mutex and cause
> dead lock.
> 

These are protecting two different things though, holding the chunk_mutex 
doesn't keep things from being removed from the device list.

Looking at patch 1 can't we just do the device list traversal under RCU and then 
not have to worry about the locking at all?  Thanks,

Josef
Qu Wenruo Jan. 3, 2020, 12:51 a.m. UTC | #2
On 2020/1/3 上午12:17, Josef Bacik wrote:
> On 1/2/20 6:27 AM, Qu Wenruo wrote:
>> There are 4 locations where device size or used space get updated:
>> - Chunk allocation
>> - Chunk removal
>> - Device grow
>> - Device shrink
>>
>> Now also update per-profile available space at those timings.
>>
>> For __btrfs_alloc_chunk() we can't acquire device_list_mutex as in
>> btrfs_finish_chunk_alloc() we could hold device_list_mutex and cause
>> dead lock.
>>
> 
> These are protecting two different things though, holding the
> chunk_mutex doesn't keep things from being removed from the device list.
> 
> Looking at patch 1 can't we just do the device list traversal under RCU
> and then not have to worry about the locking at all?  Thanks,

That's very interesting solution.

But from the comment of btrfs_fs_devices::alloc_list, it says:
```
        /*
         * Devices which can satisfy space allocation. Protected by
         * chunk_mutex
         */
        struct list_head alloc_list;
```

And __btrfs_chunk_alloc() is iterating alloc_list without extra
protection, so it should be OK I guess.

Thanks,
Qu
> 
> Josef
Qu Wenruo Jan. 3, 2020, 9:52 a.m. UTC | #3
On 2020/1/3 上午12:17, Josef Bacik wrote:
> On 1/2/20 6:27 AM, Qu Wenruo wrote:
>> There are 4 locations where device size or used space get updated:
>> - Chunk allocation
>> - Chunk removal
>> - Device grow
>> - Device shrink
>>
>> Now also update per-profile available space at those timings.
>>
>> For __btrfs_alloc_chunk() we can't acquire device_list_mutex as in
>> btrfs_finish_chunk_alloc() we could hold device_list_mutex and cause
>> dead lock.
>>
> 
> These are protecting two different things though, holding the
> chunk_mutex doesn't keep things from being removed from the device list.

Further looking into the lock schema, Nikolay's comment is right,
chunk_mutex protects alloc_list (rw devices).

Device won't disappear from alloc_list, as in btrfs_rm_device() we took
chunk_mutex before removing writeable device.

In fact, device_list_mutex is unrelated to alloc_list.

So other calc_per_profile_avaiable() call sites are in fact not safe, I
shouldn't take device_list_mutex, but chunk_mutex which are much easier
to get.

> 
> Looking at patch 1 can't we just do the device list traversal under RCU
> and then not have to worry about the locking at all?  Thanks,

I guess we need SRCU, since in __btrfs_alloc_chunk() context we need to
sleep for search dev extent tree.

And I'm not confident enough for some corner case, maybe some RCU sync
case would cause unexpected performance degrade as the RCU sync can be
more expensive than simple mutex lock.

Thanks,
Qu

> 
> Josef
Josef Bacik Jan. 3, 2020, 4:35 p.m. UTC | #4
On 1/3/20 4:52 AM, Qu Wenruo wrote:
> 
> 
> On 2020/1/3 上午12:17, Josef Bacik wrote:
>> On 1/2/20 6:27 AM, Qu Wenruo wrote:
>>> There are 4 locations where device size or used space get updated:
>>> - Chunk allocation
>>> - Chunk removal
>>> - Device grow
>>> - Device shrink
>>>
>>> Now also update per-profile available space at those timings.
>>>
>>> For __btrfs_alloc_chunk() we can't acquire device_list_mutex as in
>>> btrfs_finish_chunk_alloc() we could hold device_list_mutex and cause
>>> dead lock.
>>>
>>
>> These are protecting two different things though, holding the
>> chunk_mutex doesn't keep things from being removed from the device list.
> 
> Further looking into the lock schema, Nikolay's comment is right,
> chunk_mutex protects alloc_list (rw devices).
> 
> Device won't disappear from alloc_list, as in btrfs_rm_device() we took
> chunk_mutex before removing writeable device.
> 
> In fact, device_list_mutex is unrelated to alloc_list.
> 
> So other calc_per_profile_avaiable() call sites are in fact not safe, I
> shouldn't take device_list_mutex, but chunk_mutex which are much easier
> to get.
> 
>>
>> Looking at patch 1 can't we just do the device list traversal under RCU
>> and then not have to worry about the locking at all?  Thanks,
> 
> I guess we need SRCU, since in __btrfs_alloc_chunk() context we need to
> sleep for search dev extent tree.
> 
> And I'm not confident enough for some corner case, maybe some RCU sync
> case would cause unexpected performance degrade as the RCU sync can be
> more expensive than simple mutex lock.
> 

Sigh sorry I was reading it like we were doing fs_devices->devices, which is all 
handled via RCU.  But you're right, alloc_list is covered by the chunk mutex, so 
just make sure the chunk mutex is taken every where and you should be good to 
go.  Thanks,

Josef
diff mbox series

Patch

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index fa948f7eebc2..be8250332c60 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2804,6 +2804,7 @@  int btrfs_grow_device(struct btrfs_trans_handle *trans,
 	struct btrfs_super_block *super_copy = fs_info->super_copy;
 	u64 old_total;
 	u64 diff;
+	int ret;
 
 	if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state))
 		return -EACCES;
@@ -2832,6 +2833,11 @@  int btrfs_grow_device(struct btrfs_trans_handle *trans,
 			      &trans->transaction->dev_update_list);
 	mutex_unlock(&fs_info->chunk_mutex);
 
+	mutex_lock(&fs_info->fs_devices->device_list_mutex);
+	ret = calc_per_profile_avail(fs_info);
+	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
+	if (ret < 0)
+		return ret;
 	return btrfs_update_device(trans, device);
 }
 
@@ -3010,7 +3016,12 @@  int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
 			goto out;
 		}
 	}
+	ret = calc_per_profile_avail(fs_info);
 	mutex_unlock(&fs_devices->device_list_mutex);
+	if (ret < 0) {
+		btrfs_abort_transaction(trans, ret);
+		goto out;
+	}
 
 	ret = btrfs_free_chunk(trans, chunk_offset);
 	if (ret) {
@@ -4826,6 +4837,10 @@  int btrfs_shrink_device(struct btrfs_device *device, u64 new_size)
 			device->fs_devices->total_rw_bytes += diff;
 		atomic64_add(diff, &fs_info->free_chunk_space);
 		mutex_unlock(&fs_info->chunk_mutex);
+	} else {
+		mutex_lock(&fs_info->fs_devices->device_list_mutex);
+		ret = calc_per_profile_avail(fs_info);
+		mutex_unlock(&fs_info->fs_devices->device_list_mutex);
 	}
 	return ret;
 }
@@ -5143,8 +5158,14 @@  static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	check_raid56_incompat_flag(info, type);
 	check_raid1c34_incompat_flag(info, type);
 
+	/*
+	 * In this context we don't need device_list_mutex as chunk_mutex are
+	 * protecting us.
+	 */
+	ret = calc_per_profile_avail(info);
+
 	kfree(devices_info);
-	return 0;
+	return ret;
 
 error_del_extent:
 	write_lock(&em_tree->lock);