[05/11] Btrfs: protect fs_info->alloc_start

Message ID	50EEB880.4060308@cn.fujitsu.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-btrfs-owner@vger.kernel.org> Message-ID: <50EEB880.4060308@cn.fujitsu.com> Date: Thu, 10 Jan 2013 20:48:00 +0800 From: Miao Xie <miaox@cn.fujitsu.com> Reply-To: miaox@cn.fujitsu.com User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/17.0 Thunderbird/17.0 MIME-Version: 1.0 To: Linux Btrfs <linux-btrfs@vger.kernel.org> Subject: [PATCH 05/11] Btrfs: protect fs_info->alloc_start Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=UTF-8 Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk

Miao Xie Jan. 10, 2013, 12:48 p.m. UTC

fs_info->alloc_start was not protected strictly, it might be changed while
we were accessing it. This patch fixes this problem by using two locks to
protect it (fs_info->chunk_mutex and sb->s_umount). On the read side, we
just need get one of these two locks, and on the write side, we must lock
all of them.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/ctree.h | 10 ++++++++++
 fs/btrfs/super.c |  4 ++++
 2 files changed, 14 insertions(+)

David Sterba Jan. 10, 2013, 5:10 p.m. UTC | #1

On Thu, Jan 10, 2013 at 08:48:00PM +0800, Miao Xie wrote:
> fs_info->alloc_start was not protected strictly, it might be changed while
> we were accessing it. This patch fixes this problem by using two locks to
> protect it (fs_info->chunk_mutex and sb->s_umount). On the read side, we
> just need get one of these two locks, and on the write side, we must lock
> all of them.

Can you please describe how this can theoretically cause any problems?

Out of curiosity I've been running this command on a 60G ssd disk for a while

for i in `seq 1 10000|shuf`;do
	mount -o remount,alloc_start=$(($i*1024*1024)) /mnt/sdc;done
	#sleep 0 / 1 / 5
done

together with one of my favourite stresstests (heavy writes, subvolume
activity).

There are two direct users of fs_info->alloc_start:

* statfs via btrfs_calc_avail_data_space
* find_free_dev_extent

960         search_start = max(root->fs_info->alloc_start, 1024ull * 1024);

as remount calls sync, there is a very tiny window where an allocation could
get the old value of alloc_start just between sync and do_remount. Theoretical
and without any visible effect.

My NAK for this patch.

david
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Miao Xie Jan. 14, 2013, 8:04 a.m. UTC | #2

On Thu, 10 Jan 2013 18:10:40 +0100, David Sterba wrote:
> On Thu, Jan 10, 2013 at 08:48:00PM +0800, Miao Xie wrote:
>> fs_info->alloc_start was not protected strictly, it might be changed while
>> we were accessing it. This patch fixes this problem by using two locks to
>> protect it (fs_info->chunk_mutex and sb->s_umount). On the read side, we
>> just need get one of these two locks, and on the write side, we must lock
>> all of them.
> 
> Can you please describe how this can theoretically cause any problems?
> 
> Out of curiosity I've been running this command on a 60G ssd disk for a while
> 
> for i in `seq 1 10000|shuf`;do
> 	mount -o remount,alloc_start=$(($i*1024*1024)) /mnt/sdc;done
> 	#sleep 0 / 1 / 5
> done
> 
> together with one of my favourite stresstests (heavy writes, subvolume
> activity).
> 
> There are two direct users of fs_info->alloc_start:
> 
> * statfs via btrfs_calc_avail_data_space
> * find_free_dev_extent
> 
> 960         search_start = max(root->fs_info->alloc_start, 1024ull * 1024);
> 
> as remount calls sync, there is a very tiny window where an allocation could
> get the old value of alloc_start just between sync and do_remount. Theoretical
> and without any visible effect.

->alloc_start is a 64bits variant, on the 32bits machine, we have to load it
by two instructions.

Assuming -> alloc_start is 0x00010000 at the beginning, then we remount and
set ->alloc_start to 0x00000100
	Task0 			Task1
				load low 32 bits
	set high 32 bits
				load high 32 bits
	set low 32 bits

Task1 will get 0.

Thanks
Miao

> 
> My NAK for this patch.
> 
> david
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Sterba Jan. 16, 2013, 12:43 p.m. UTC | #3

On Mon, Jan 14, 2013 at 04:04:59PM +0800, Miao Xie wrote:
> On Thu, 10 Jan 2013 18:10:40 +0100, David Sterba wrote:
> > On Thu, Jan 10, 2013 at 08:48:00PM +0800, Miao Xie wrote:
> >> fs_info->alloc_start was not protected strictly, it might be changed while
> >> we were accessing it. This patch fixes this problem by using two locks to
> >> protect it (fs_info->chunk_mutex and sb->s_umount). On the read side, we
> >> just need get one of these two locks, and on the write side, we must lock
> >> all of them.
> > 
> > Can you please describe how this can theoretically cause any problems?
> > 
> > Out of curiosity I've been running this command on a 60G ssd disk for a while
> > 
> > for i in `seq 1 10000|shuf`;do
> > 	mount -o remount,alloc_start=$(($i*1024*1024)) /mnt/sdc;done
> > 	#sleep 0 / 1 / 5
> > done
> > 
> > together with one of my favourite stresstests (heavy writes, subvolume
> > activity).
> > 
> > There are two direct users of fs_info->alloc_start:
> > 
> > * statfs via btrfs_calc_avail_data_space
> > * find_free_dev_extent
> > 
> > 960         search_start = max(root->fs_info->alloc_start, 1024ull * 1024);
> > 
> > as remount calls sync, there is a very tiny window where an allocation could
> > get the old value of alloc_start just between sync and do_remount. Theoretical
> > and without any visible effect.
> 
> ->alloc_start is a 64bits variant, on the 32bits machine, we have to load it
> by two instructions.
> 
> Assuming -> alloc_start is 0x00010000 at the beginning, then we remount and
> set ->alloc_start to 0x00000100
> 	Task0 			Task1
> 				load low 32 bits
> 	set high 32 bits
> 				load high 32 bits
> 	set low 32 bits
> 
> Task1 will get 0.

That's an insufficient description of how a race could actually happen,
it's only a high-level scheme how a 64bit value can get mixed up with
32bit access.

How exactly does it cause problems in btrfs code?

IMO the bug is not there due to various reasons like instruction
scheduling and cache effect that may hide the actual split of the 32bit
value.

* the value set on one cpu is not immediatelly visible on another cpu
* the full 64bit value of alloc_start lives inside a single cacheline
  and will be updated as a whole in context of btrfs -- there's only one
  place where it's set so multiple places that would store the value and
  fight over the cacheline exclusivity are out of question;
  the storing cpu will flush the pending writes at some point in time
  so they're visible by other cpus, until then the load side reads a
  consistent value -- in almost all cases, and this gets even more wild
  and counts on exact timing of a memory barriers triggered from other
  cpu and cacheline ownership, so the store finishes only one half,
  barrier, load finds half new and half old value, store does the other
  half
* this needs to happen within 4-5 instructions for an operation that
  involves a IO sync -- makes it much harder to provoke right timing

Your example with 0x00010000 and 0x00000100 uses only 32bit values, so
it cannot be applied here. Also, alloc_start is always compared with the
hardcoded 1M limit so it cannot go below to 0.

The point is not the one extra mutex lock, but that you're attempting to
fix a bug highly improbable if possible at all without a proper
description and analysis. I don't want to spend the time on this and do
the work for instead of you next time, please try harder to convince us.

david
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Miao Xie Jan. 17, 2013, 4:13 a.m. UTC | #4

On wed, 16 Jan 2013 13:43:00 +0100, David Sterba wrote:
> On Mon, Jan 14, 2013 at 04:04:59PM +0800, Miao Xie wrote:
>> On Thu, 10 Jan 2013 18:10:40 +0100, David Sterba wrote:
>>> On Thu, Jan 10, 2013 at 08:48:00PM +0800, Miao Xie wrote:
>>>> fs_info->alloc_start was not protected strictly, it might be changed while
>>>> we were accessing it. This patch fixes this problem by using two locks to
>>>> protect it (fs_info->chunk_mutex and sb->s_umount). On the read side, we
>>>> just need get one of these two locks, and on the write side, we must lock
>>>> all of them.
>>>
>>> Can you please describe how this can theoretically cause any problems?
>>>
>>> Out of curiosity I've been running this command on a 60G ssd disk for a while
>>>
>>> for i in `seq 1 10000|shuf`;do
>>> 	mount -o remount,alloc_start=$(($i*1024*1024)) /mnt/sdc;done
>>> 	#sleep 0 / 1 / 5
>>> done
>>>
>>> together with one of my favourite stresstests (heavy writes, subvolume
>>> activity).
>>>
>>> There are two direct users of fs_info->alloc_start:
>>>
>>> * statfs via btrfs_calc_avail_data_space
>>> * find_free_dev_extent
>>>
>>> 960         search_start = max(root->fs_info->alloc_start, 1024ull * 1024);
>>>
>>> as remount calls sync, there is a very tiny window where an allocation could
>>> get the old value of alloc_start just between sync and do_remount. Theoretical
>>> and without any visible effect.
>>
>> ->alloc_start is a 64bits variant, on the 32bits machine, we have to load it
>> by two instructions.
>>
>> Assuming -> alloc_start is 0x00010000 at the beginning, then we remount and
				^ it should be 0x0000 0001 0000 0000

>> set ->alloc_start to 0x00000100
				^ should be 0x0000 0000 0001 0000

>> 	Task0 			Task1
>> 				load low 32 bits
>> 	set high 32 bits
>> 				load high 32 bits
>> 	set low 32 bits
>>
>> Task1 will get 0.
> 
> That's an insufficient description of how a race could actually happen,
> it's only a high-level scheme how a 64bit value can get mixed up with
> 32bit access.
> 
> How exactly does it cause problems in btrfs code?
> 
> IMO the bug is not there due to various reasons like instruction
> scheduling and cache effect that may hide the actual split of the 32bit
> value.
> 
> * the value set on one cpu is not immediatelly visible on another cpu
> * the full 64bit value of alloc_start lives inside a single cacheline
>   and will be updated as a whole in context of btrfs -- there's only one

I know it lives inside a single cacheline, but only the cacheline will be updated
as a whole, not the 64bit value. And cacheline is not a lock, it can not prevent
that it is acquired by the other CPU when only a half is updated. Right?

>   place where it's set so multiple places that would store the value and
>   fight over the cacheline exclusivity are out of question;
>   the storing cpu will flush the pending writes at some point in time
>   so they're visible by other cpus, until then the load side reads a
>   consistent value -- in almost all cases, and this gets even more wild
>   and counts on exact timing of a memory barriers triggered from other
>   cpu and cacheline ownership, so the store finishes only one half,
>   barrier, load finds half new and half old value, store does the other
>   half
> * this needs to happen within 4-5 instructions for an operation that
>   involves a IO sync -- makes it much harder to provoke right timing
> 
> Your example with 0x00010000 and 0x00000100 uses only 32bit values, so

My mistake, sorry. The number we want to use is 64bit(see above).

> it cannot be applied here. Also, alloc_start is always compared with the
> hardcoded 1M limit so it cannot go below to 0.

0 is just a sample. The point is we should not allocate a extent from a unexpected
address.

> The point is not the one extra mutex lock, but that you're attempting to
> fix a bug highly improbable if possible at all without a proper

On 32bit machines, it is not highly improbable, we can trigger this bug easily
(The attached code which simulates the process of the btrfs can reproduce the bug).

This problem don't happen on btrfs is because:
1. Most of the users use btrfs on 64bit machine
2. Changing alloc_start is infrequent?and chunk allocation is not frequent.
So the race is hard to be triggered on btrfs. But it doesn't means it can
not happen.

> description and analysis. I don't want to spend the time on this and do
> the work for instead of you next time, please try harder to convince us.

Miao

Alex Lyakas Jan. 20, 2013, 1:11 p.m. UTC | #5

Hi David,
do you think you can share this favorite stress-test that you have mentioned?

Thanks,
Alex.


On Thu, Jan 10, 2013 at 7:10 PM, David Sterba <dsterba@suse.cz> wrote:
> On Thu, Jan 10, 2013 at 08:48:00PM +0800, Miao Xie wrote:
>> fs_info->alloc_start was not protected strictly, it might be changed while
>> we were accessing it. This patch fixes this problem by using two locks to
>> protect it (fs_info->chunk_mutex and sb->s_umount). On the read side, we
>> just need get one of these two locks, and on the write side, we must lock
>> all of them.
>
> Can you please describe how this can theoretically cause any problems?
>
> Out of curiosity I've been running this command on a 60G ssd disk for a while
>
> for i in `seq 1 10000|shuf`;do
>         mount -o remount,alloc_start=$(($i*1024*1024)) /mnt/sdc;done
>         #sleep 0 / 1 / 5
> done
>
> together with one of my favourite stresstests (heavy writes, subvolume
> activity).
>
> There are two direct users of fs_info->alloc_start:
>
> * statfs via btrfs_calc_avail_data_space
> * find_free_dev_extent
>
> 960         search_start = max(root->fs_info->alloc_start, 1024ull * 1024);
>
> as remount calls sync, there is a very tiny window where an allocation could
> get the old value of alloc_start just between sync and do_remount. Theoretical
> and without any visible effect.
>
> My NAK for this patch.
>
> david
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[05/11] Btrfs: protect fs_info->alloc_start

Commit Message

Comments

Patch