diff mbox

[05/11] Btrfs: protect fs_info->alloc_start

Message ID 50EEB880.4060308@cn.fujitsu.com (mailing list archive)
State New, archived
Headers show

Commit Message

Miao Xie Jan. 10, 2013, 12:48 p.m. UTC
fs_info->alloc_start was not protected strictly, it might be changed while
we were accessing it. This patch fixes this problem by using two locks to
protect it (fs_info->chunk_mutex and sb->s_umount). On the read side, we
just need get one of these two locks, and on the write side, we must lock
all of them.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/ctree.h | 10 ++++++++++
 fs/btrfs/super.c |  4 ++++
 2 files changed, 14 insertions(+)

Comments

David Sterba Jan. 10, 2013, 5:10 p.m. UTC | #1
On Thu, Jan 10, 2013 at 08:48:00PM +0800, Miao Xie wrote:
> fs_info->alloc_start was not protected strictly, it might be changed while
> we were accessing it. This patch fixes this problem by using two locks to
> protect it (fs_info->chunk_mutex and sb->s_umount). On the read side, we
> just need get one of these two locks, and on the write side, we must lock
> all of them.

Can you please describe how this can theoretically cause any problems?

Out of curiosity I've been running this command on a 60G ssd disk for a while

for i in `seq 1 10000|shuf`;do
	mount -o remount,alloc_start=$(($i*1024*1024)) /mnt/sdc;done
	#sleep 0 / 1 / 5
done

together with one of my favourite stresstests (heavy writes, subvolume
activity).

There are two direct users of fs_info->alloc_start:

* statfs via btrfs_calc_avail_data_space
* find_free_dev_extent

960         search_start = max(root->fs_info->alloc_start, 1024ull * 1024);

as remount calls sync, there is a very tiny window where an allocation could
get the old value of alloc_start just between sync and do_remount. Theoretical
and without any visible effect.

My NAK for this patch.

david
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Miao Xie Jan. 14, 2013, 8:04 a.m. UTC | #2
On Thu, 10 Jan 2013 18:10:40 +0100, David Sterba wrote:
> On Thu, Jan 10, 2013 at 08:48:00PM +0800, Miao Xie wrote:
>> fs_info->alloc_start was not protected strictly, it might be changed while
>> we were accessing it. This patch fixes this problem by using two locks to
>> protect it (fs_info->chunk_mutex and sb->s_umount). On the read side, we
>> just need get one of these two locks, and on the write side, we must lock
>> all of them.
> 
> Can you please describe how this can theoretically cause any problems?
> 
> Out of curiosity I've been running this command on a 60G ssd disk for a while
> 
> for i in `seq 1 10000|shuf`;do
> 	mount -o remount,alloc_start=$(($i*1024*1024)) /mnt/sdc;done
> 	#sleep 0 / 1 / 5
> done
> 
> together with one of my favourite stresstests (heavy writes, subvolume
> activity).
> 
> There are two direct users of fs_info->alloc_start:
> 
> * statfs via btrfs_calc_avail_data_space
> * find_free_dev_extent
> 
> 960         search_start = max(root->fs_info->alloc_start, 1024ull * 1024);
> 
> as remount calls sync, there is a very tiny window where an allocation could
> get the old value of alloc_start just between sync and do_remount. Theoretical
> and without any visible effect.

->alloc_start is a 64bits variant, on the 32bits machine, we have to load it
by two instructions.

Assuming -> alloc_start is 0x00010000 at the beginning, then we remount and
set ->alloc_start to 0x00000100
	Task0 			Task1
				load low 32 bits
	set high 32 bits
				load high 32 bits
	set low 32 bits

Task1 will get 0.

Thanks
Miao

> 
> My NAK for this patch.
> 
> david
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba Jan. 16, 2013, 12:43 p.m. UTC | #3
On Mon, Jan 14, 2013 at 04:04:59PM +0800, Miao Xie wrote:
> On Thu, 10 Jan 2013 18:10:40 +0100, David Sterba wrote:
> > On Thu, Jan 10, 2013 at 08:48:00PM +0800, Miao Xie wrote:
> >> fs_info->alloc_start was not protected strictly, it might be changed while
> >> we were accessing it. This patch fixes this problem by using two locks to
> >> protect it (fs_info->chunk_mutex and sb->s_umount). On the read side, we
> >> just need get one of these two locks, and on the write side, we must lock
> >> all of them.
> > 
> > Can you please describe how this can theoretically cause any problems?
> > 
> > Out of curiosity I've been running this command on a 60G ssd disk for a while
> > 
> > for i in `seq 1 10000|shuf`;do
> > 	mount -o remount,alloc_start=$(($i*1024*1024)) /mnt/sdc;done
> > 	#sleep 0 / 1 / 5
> > done
> > 
> > together with one of my favourite stresstests (heavy writes, subvolume
> > activity).
> > 
> > There are two direct users of fs_info->alloc_start:
> > 
> > * statfs via btrfs_calc_avail_data_space
> > * find_free_dev_extent
> > 
> > 960         search_start = max(root->fs_info->alloc_start, 1024ull * 1024);
> > 
> > as remount calls sync, there is a very tiny window where an allocation could
> > get the old value of alloc_start just between sync and do_remount. Theoretical
> > and without any visible effect.
> 
> ->alloc_start is a 64bits variant, on the 32bits machine, we have to load it
> by two instructions.
> 
> Assuming -> alloc_start is 0x00010000 at the beginning, then we remount and
> set ->alloc_start to 0x00000100
> 	Task0 			Task1
> 				load low 32 bits
> 	set high 32 bits
> 				load high 32 bits
> 	set low 32 bits
> 
> Task1 will get 0.

That's an insufficient description of how a race could actually happen,
it's only a high-level scheme how a 64bit value can get mixed up with
32bit access.

How exactly does it cause problems in btrfs code?

IMO the bug is not there due to various reasons like instruction
scheduling and cache effect that may hide the actual split of the 32bit
value.

* the value set on one cpu is not immediatelly visible on another cpu
* the full 64bit value of alloc_start lives inside a single cacheline
  and will be updated as a whole in context of btrfs -- there's only one
  place where it's set so multiple places that would store the value and
  fight over the cacheline exclusivity are out of question;
  the storing cpu will flush the pending writes at some point in time
  so they're visible by other cpus, until then the load side reads a
  consistent value -- in almost all cases, and this gets even more wild
  and counts on exact timing of a memory barriers triggered from other
  cpu and cacheline ownership, so the store finishes only one half,
  barrier, load finds half new and half old value, store does the other
  half
* this needs to happen within 4-5 instructions for an operation that
  involves a IO sync -- makes it much harder to provoke right timing

Your example with 0x00010000 and 0x00000100 uses only 32bit values, so
it cannot be applied here. Also, alloc_start is always compared with the
hardcoded 1M limit so it cannot go below to 0.

The point is not the one extra mutex lock, but that you're attempting to
fix a bug highly improbable if possible at all without a proper
description and analysis. I don't want to spend the time on this and do
the work for instead of you next time, please try harder to convince us.

david
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Miao Xie Jan. 17, 2013, 4:13 a.m. UTC | #4
On wed, 16 Jan 2013 13:43:00 +0100, David Sterba wrote:
> On Mon, Jan 14, 2013 at 04:04:59PM +0800, Miao Xie wrote:
>> On Thu, 10 Jan 2013 18:10:40 +0100, David Sterba wrote:
>>> On Thu, Jan 10, 2013 at 08:48:00PM +0800, Miao Xie wrote:
>>>> fs_info->alloc_start was not protected strictly, it might be changed while
>>>> we were accessing it. This patch fixes this problem by using two locks to
>>>> protect it (fs_info->chunk_mutex and sb->s_umount). On the read side, we
>>>> just need get one of these two locks, and on the write side, we must lock
>>>> all of them.
>>>
>>> Can you please describe how this can theoretically cause any problems?
>>>
>>> Out of curiosity I've been running this command on a 60G ssd disk for a while
>>>
>>> for i in `seq 1 10000|shuf`;do
>>> 	mount -o remount,alloc_start=$(($i*1024*1024)) /mnt/sdc;done
>>> 	#sleep 0 / 1 / 5
>>> done
>>>
>>> together with one of my favourite stresstests (heavy writes, subvolume
>>> activity).
>>>
>>> There are two direct users of fs_info->alloc_start:
>>>
>>> * statfs via btrfs_calc_avail_data_space
>>> * find_free_dev_extent
>>>
>>> 960         search_start = max(root->fs_info->alloc_start, 1024ull * 1024);
>>>
>>> as remount calls sync, there is a very tiny window where an allocation could
>>> get the old value of alloc_start just between sync and do_remount. Theoretical
>>> and without any visible effect.
>>
>> ->alloc_start is a 64bits variant, on the 32bits machine, we have to load it
>> by two instructions.
>>
>> Assuming -> alloc_start is 0x00010000 at the beginning, then we remount and
				^ it should be 0x0000 0001 0000 0000

>> set ->alloc_start to 0x00000100
				^ should be 0x0000 0000 0001 0000

>> 	Task0 			Task1
>> 				load low 32 bits
>> 	set high 32 bits
>> 				load high 32 bits
>> 	set low 32 bits
>>
>> Task1 will get 0.
> 
> That's an insufficient description of how a race could actually happen,
> it's only a high-level scheme how a 64bit value can get mixed up with
> 32bit access.
> 
> How exactly does it cause problems in btrfs code?
> 
> IMO the bug is not there due to various reasons like instruction
> scheduling and cache effect that may hide the actual split of the 32bit
> value.
> 
> * the value set on one cpu is not immediatelly visible on another cpu
> * the full 64bit value of alloc_start lives inside a single cacheline
>   and will be updated as a whole in context of btrfs -- there's only one

I know it lives inside a single cacheline, but only the cacheline will be updated
as a whole, not the 64bit value. And cacheline is not a lock, it can not prevent
that it is acquired by the other CPU when only a half is updated. Right?

>   place where it's set so multiple places that would store the value and
>   fight over the cacheline exclusivity are out of question;
>   the storing cpu will flush the pending writes at some point in time
>   so they're visible by other cpus, until then the load side reads a
>   consistent value -- in almost all cases, and this gets even more wild
>   and counts on exact timing of a memory barriers triggered from other
>   cpu and cacheline ownership, so the store finishes only one half,
>   barrier, load finds half new and half old value, store does the other
>   half
> * this needs to happen within 4-5 instructions for an operation that
>   involves a IO sync -- makes it much harder to provoke right timing
> 
> Your example with 0x00010000 and 0x00000100 uses only 32bit values, so

My mistake, sorry. The number we want to use is 64bit(see above).

> it cannot be applied here. Also, alloc_start is always compared with the
> hardcoded 1M limit so it cannot go below to 0.

0 is just a sample. The point is we should not allocate a extent from a unexpected
address.

> The point is not the one extra mutex lock, but that you're attempting to
> fix a bug highly improbable if possible at all without a proper

On 32bit machines, it is not highly improbable, we can trigger this bug easily
(The attached code which simulates the process of the btrfs can reproduce the bug).

This problem don't happen on btrfs is because:
1. Most of the users use btrfs on 64bit machine
2. Changing alloc_start is infrequent?and chunk allocation is not frequent.
So the race is hard to be triggered on btrfs. But it doesn't means it can
not happen.

> description and analysis. I don't want to spend the time on this and do
> the work for instead of you next time, please try harder to convince us.

Miao
Alex Lyakas Jan. 20, 2013, 1:11 p.m. UTC | #5
Hi David,
do you think you can share this favorite stress-test that you have mentioned?

Thanks,
Alex.


On Thu, Jan 10, 2013 at 7:10 PM, David Sterba <dsterba@suse.cz> wrote:
> On Thu, Jan 10, 2013 at 08:48:00PM +0800, Miao Xie wrote:
>> fs_info->alloc_start was not protected strictly, it might be changed while
>> we were accessing it. This patch fixes this problem by using two locks to
>> protect it (fs_info->chunk_mutex and sb->s_umount). On the read side, we
>> just need get one of these two locks, and on the write side, we must lock
>> all of them.
>
> Can you please describe how this can theoretically cause any problems?
>
> Out of curiosity I've been running this command on a 60G ssd disk for a while
>
> for i in `seq 1 10000|shuf`;do
>         mount -o remount,alloc_start=$(($i*1024*1024)) /mnt/sdc;done
>         #sleep 0 / 1 / 5
> done
>
> together with one of my favourite stresstests (heavy writes, subvolume
> activity).
>
> There are two direct users of fs_info->alloc_start:
>
> * statfs via btrfs_calc_avail_data_space
> * find_free_dev_extent
>
> 960         search_start = max(root->fs_info->alloc_start, 1024ull * 1024);
>
> as remount calls sync, there is a very tiny window where an allocation could
> get the old value of alloc_start just between sync and do_remount. Theoretical
> and without any visible effect.
>
> My NAK for this patch.
>
> david
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 3e672916..201be7d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1295,6 +1295,16 @@  struct btrfs_fs_info {
 	 * so it is also safe.
 	 */
 	u64 max_inline;
+	/*
+	 * Protected by ->chunk_mutex and sb->s_umount.
+	 *
+	 * The reason that we use two lock to protect it is because only
+	 * remount and mount operations can change it and these two operations
+	 * are under sb->s_umount, but the read side (chunk allocation) can not
+	 * acquire sb->s_umount or the deadlock would happen. So we use two
+	 * locks to protect it. On the write side, we must acquire two locks,
+	 * and on the read side, we just need acquire one of them.
+	 */
 	u64 alloc_start;
 	struct btrfs_transaction *running_transaction;
 	wait_queue_head_t transaction_throttle;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 073756a..6f0524d 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -519,7 +519,9 @@  int btrfs_parse_options(struct btrfs_root *root, char *options)
 		case Opt_alloc_start:
 			num = match_strdup(&args[0]);
 			if (num) {
+				mutex_lock(&info->chunk_mutex);
 				info->alloc_start = memparse(num, NULL);
+				mutex_unlock(&info->chunk_mutex);
 				kfree(num);
 				printk(KERN_INFO
 					"btrfs: allocations start at %llu\n",
@@ -1289,7 +1291,9 @@  restore:
 	fs_info->mount_opt = old_opts;
 	fs_info->compress_type = old_compress_type;
 	fs_info->max_inline = old_max_inline;
+	mutex_lock(&fs_info->chunk_mutex);
 	fs_info->alloc_start = old_alloc_start;
+	mutex_unlock(&fs_info->chunk_mutex);
 	btrfs_resize_thread_pool(fs_info,
 		old_thread_pool_size, fs_info->thread_pool_size);
 	fs_info->metadata_ratio = old_metadata_ratio;