diff mbox

Btrfs: fix race when reusing stale extent buffers that leads to BUG_ON

Message ID 1429784928-12665-1-git-send-email-fdmanana@suse.com (mailing list archive)
State Accepted
Headers show

Commit Message

Filipe Manana April 23, 2015, 10:28 a.m. UTC
There's a race between releasing extent buffers that are flagged as stale
and recycling them that makes us it the following BUG_ON at
btrfs_release_extent_buffer_page:

    BUG_ON(extent_buffer_under_io(eb))

The BUG_ON is triggered because the extent buffer has the flag
EXTENT_BUFFER_DIRTY set as a consequence of having been reused and made
dirty by another concurrent task.

Here follows a sequence of steps that leads to the BUG_ON.

      CPU 0                                                    CPU 1                                                CPU 2

path->nodes[0] == eb X
X->refs == 2 (1 for the tree, 1 for the path)
btrfs_header_generation(X) == current trans id
flag EXTENT_BUFFER_DIRTY set on X

btrfs_release_path(path)
    unlocks X

                                                      reads eb X
                                                         X->refs incremented to 3
                                                      locks eb X
                                                      btrfs_del_items(X)
                                                         X becomes empty
                                                         clean_tree_block(X)
                                                             clear EXTENT_BUFFER_DIRTY from X
                                                         btrfs_del_leaf(X)
                                                             unlocks X
                                                             extent_buffer_get(X)
                                                                X->refs incremented to 4
                                                             btrfs_free_tree_block(X)
                                                                X's range is not pinned
                                                                X's range added to free
                                                                  space cache
                                                             free_extent_buffer_stale(X)
                                                                lock X->refs_lock
                                                                set EXTENT_BUFFER_STALE on X
                                                                release_extent_buffer(X)
                                                                    X->refs decremented to 3
                                                                    unlocks X->refs_lock
                                                      btrfs_release_path()
                                                         unlocks X
                                                         free_extent_buffer(X)
                                                             X->refs becomes 2

                                                                                                      __btrfs_cow_block(Y)
                                                                                                          btrfs_alloc_tree_block()
                                                                                                              btrfs_reserve_extent()
                                                                                                                  find_free_extent()
                                                                                                                      gets offset == X->start
                                                                                                              btrfs_init_new_buffer(X->start)
                                                                                                                  btrfs_find_create_tree_block(X->start)
                                                                                                                      alloc_extent_buffer(X->start)
                                                                                                                          find_extent_buffer(X->start)
                                                                                                                              finds eb X in radix tree

    free_extent_buffer(X)
        lock X->refs_lock
            test X->refs == 2
            test bit EXTENT_BUFFER_STALE is set
            test !extent_buffer_under_io(eb)

                                                                                                                              increments X->refs to 3
                                                                                                                              mark_extent_buffer_accessed(X)
                                                                                                                                  check_buffer_tree_ref(X)
                                                                                                                                    --> does nothing,
                                                                                                                                        X->refs >= 2 and
                                                                                                                                        EXTENT_BUFFER_TREE_REF
                                                                                                                                        is set in X
                                                                                                              clear EXTENT_BUFFER_STALE from X
                                                                                                              locks X
                                                                                                          btrfs_mark_buffer_dirty()
                                                                                                              set_extent_buffer_dirty(X)
                                                                                                                  check_buffer_tree_ref(X)
                                                                                                                     --> does nothing, X->refs >= 2 and
                                                                                                                         EXTENT_BUFFER_TREE_REF is set
                                                                                                                  sets EXTENT_BUFFER_DIRTY on X

            test and clear EXTENT_BUFFER_TREE_REF
            decrements X->refs to 2
        release_extent_buffer(X)
            decrements X->refs to 1
            unlock X->refs_lock

                                                                                                      unlock X
                                                                                                      free_extent_buffer(X)
                                                                                                          lock X->refs_lock
                                                                                                          release_extent_buffer(X)
                                                                                                              decrements X->refs to 0
                                                                                                              btrfs_release_extent_buffer_page(X)
                                                                                                                   BUG_ON(extent_buffer_under_io(X))
                                                                                                                       --> EXTENT_BUFFER_DIRTY set on X

Fix this by making find_extent buffer wait for any ongoing task currently
executing free_extent_buffer()/free_extent_buffer_stale() if the extent
buffer has the stale flag set.
A more clean alternative would be to always increment the extent buffer's
reference count while holding its refs_lock spinlock but find_extent_buffer
is a performance critical area and that would cause lock contention whenever
multiple tasks search for the same extent buffer concurrently.

A build server running a SLES 12 kernel (3.12 kernel + over 450 upstream
btrfs patches backported from newer kernels) was hitting this often:

[1212302.461948] kernel BUG at ../fs/btrfs/extent_io.c:4507!
(...)
[1212302.470219] CPU: 1 PID: 19259 Comm: bs_sched Not tainted 3.12.36-38-default #1
[1212302.540792] Hardware name: Supermicro PDSM4/PDSM4, BIOS 6.00 04/17/2006
[1212302.540792] task: ffff8800e07e0100 ti: ffff8800d6412000 task.ti: ffff8800d6412000
[1212302.540792] RIP: 0010:[<ffffffffa0507081>]  [<ffffffffa0507081>] btrfs_release_extent_buffer_page.constprop.51+0x101/0x110 [btrfs]
(...)
[1212302.630008] Call Trace:
[1212302.630008]  [<ffffffffa05070cd>] release_extent_buffer+0x3d/0xa0 [btrfs]
[1212302.630008]  [<ffffffffa04c2d9d>] btrfs_release_path+0x1d/0xa0 [btrfs]
[1212302.630008]  [<ffffffffa04c5c7e>] read_block_for_search.isra.33+0x13e/0x3a0 [btrfs]
[1212302.630008]  [<ffffffffa04c8094>] btrfs_search_slot+0x3f4/0xa80 [btrfs]
[1212302.630008]  [<ffffffffa04cf5d8>] lookup_inline_extent_backref+0xf8/0x630 [btrfs]
[1212302.630008]  [<ffffffffa04d13dd>] __btrfs_free_extent+0x11d/0xc40 [btrfs]
[1212302.630008]  [<ffffffffa04d64a4>] __btrfs_run_delayed_refs+0x394/0x11d0 [btrfs]
[1212302.630008]  [<ffffffffa04db379>] btrfs_run_delayed_refs.part.66+0x69/0x280 [btrfs]
[1212302.630008]  [<ffffffffa04ed2ad>] __btrfs_end_transaction+0x2ad/0x3d0 [btrfs]
[1212302.630008]  [<ffffffffa04f7505>] btrfs_evict_inode+0x4a5/0x500 [btrfs]
[1212302.630008]  [<ffffffff811b9e28>] evict+0xa8/0x190
[1212302.630008]  [<ffffffff811b0330>] do_unlinkat+0x1a0/0x2b0

I was also able to reproduce this on a 3.19 kernel, corresponding to Chris'
integration branch from about a month ago, running the following stress
test on a qemu/kvm guest (with 4 virtual cpus and 16Gb of ram):

  while true; do
     mkfs.btrfs -l 4096 -f -b `expr 20 \* 1024 \* 1024 \* 1024` /dev/sdd
     mount /dev/sdd /mnt
     snapshot_cmd="btrfs subvolume snapshot -r /mnt"
     snapshot_cmd="$snapshot_cmd /mnt/snap_\`date +'%H_%M_%S_%N'\`"
     fsstress -d /mnt -n 25000 -p 8 -x "$snapshot_cmd" -X 100
     umount /mnt
  done

Which usually triggers the BUG_ON within less than 24 hours:

[49558.618097] ------------[ cut here ]------------
[49558.619732] kernel BUG at fs/btrfs/extent_io.c:4551!
(...)
[49558.620031] CPU: 3 PID: 23908 Comm: fsstress Tainted: G        W      3.19.0-btrfs-next-7+ #3
[49558.620031] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[49558.620031] task: ffff8800319fc0d0 ti: ffff880220da8000 task.ti: ffff880220da8000
[49558.620031] RIP: 0010:[<ffffffffa0476b1a>]  [<ffffffffa0476b1a>] btrfs_release_extent_buffer_page+0x20/0xe9 [btrfs]
(...)
[49558.620031] Call Trace:
[49558.620031]  [<ffffffffa0476c73>] release_extent_buffer+0x90/0xd3 [btrfs]
[49558.620031]  [<ffffffff8142b10c>] ? _raw_spin_lock+0x3b/0x43
[49558.620031]  [<ffffffffa0477052>] ? free_extent_buffer+0x37/0x94 [btrfs]
[49558.620031]  [<ffffffffa04770ab>] free_extent_buffer+0x90/0x94 [btrfs]
[49558.620031]  [<ffffffffa04396d5>] btrfs_release_path+0x4a/0x69 [btrfs]
[49558.620031]  [<ffffffffa0444907>] __btrfs_free_extent+0x778/0x80c [btrfs]
[49558.620031]  [<ffffffffa044a485>] __btrfs_run_delayed_refs+0xad2/0xc62 [btrfs]
[49558.728054]  [<ffffffff811420d5>] ? kmemleak_alloc_recursive.constprop.52+0x16/0x18
[49558.728054]  [<ffffffffa044c1e8>] btrfs_run_delayed_refs+0x6d/0x1ba [btrfs]
[49558.728054]  [<ffffffffa045917f>] ? join_transaction.isra.9+0xb9/0x36b [btrfs]
[49558.728054]  [<ffffffffa045a75c>] btrfs_commit_transaction+0x4c/0x981 [btrfs]
[49558.728054]  [<ffffffffa0434f86>] btrfs_sync_fs+0xd5/0x10d [btrfs]
[49558.728054]  [<ffffffff81155923>] ? iterate_supers+0x60/0xc4
[49558.728054]  [<ffffffff8117966a>] ? do_sync_work+0x91/0x91
[49558.728054]  [<ffffffff8117968a>] sync_fs_one_sb+0x20/0x22
[49558.728054]  [<ffffffff81155939>] iterate_supers+0x76/0xc4
[49558.728054]  [<ffffffff811798e8>] sys_sync+0x55/0x83
[49558.728054]  [<ffffffff8142bbd2>] system_call_fastpath+0x12/0x17

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_io.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

Comments

Holger Hoffstätte April 23, 2015, 12:16 p.m. UTC | #1
On Thu, 23 Apr 2015 11:28:48 +0100, Filipe Manana wrote:

> There's a race between releasing extent buffers that are flagged as stale
> and recycling them that makes us it the following BUG_ON at
> btrfs_release_extent_buffer_page:
> 
>     BUG_ON(extent_buffer_under_io(eb))
> 
> The BUG_ON is triggered because the extent buffer has the flag
> EXTENT_BUFFER_DIRTY set as a consequence of having been reused and made
> dirty by another concurrent task.

Awesome analysis!

> @@ -4768,6 +4768,25 @@ struct extent_buffer *find_extent_buffer(struct btrfs_fs_info *fs_info,
>  			       start >> PAGE_CACHE_SHIFT);
>  	if (eb && atomic_inc_not_zero(&eb->refs)) {
>  		rcu_read_unlock();
> +		/*
> +		 * Lock our eb's refs_lock to avoid races with
> +		 * free_extent_buffer. When we get our eb it might be flagged
> +		 * with EXTENT_BUFFER_STALE and another task running
> +		 * free_extent_buffer might have seen that flag set,
> +		 * eb->refs == 2, that the buffer isn't under IO (dirty and
> +		 * writeback flags not set) and it's still in the tree (flag
> +		 * EXTENT_BUFFER_TREE_REF set), therefore being in the process
> +		 * of decrementing the extent buffer's reference count twice.
> +		 * So here we could race and increment the eb's reference count,
> +		 * clear its stale flag, mark it as dirty and drop our reference
> +		 * before the other task finishes executing free_extent_buffer,
> +		 * which would later result in an attempt to free an extent
> +		 * buffer that is dirty.
> +		 */
> +		if (test_bit(EXTENT_BUFFER_STALE, &eb->bflags)) {
> +			spin_lock(&eb->refs_lock);
> +			spin_unlock(&eb->refs_lock);
> +		}
>  		mark_extent_buffer_accessed(eb, NULL);
>  		return eb;
>  	}

After staring at this (and the Lovecraftian horrors of free_extent_buffer())
for over an hour and trying to understand how and why this could even remotely
work, I cannot help but think that this fix would shift the race to the much
smaller window between the test_bit and the first spin_lock.
Essentially you subtly phase-shifted all participants and make them avoid the
race most of the time, yet I cannot help but think it's still there (just much
smaller), and could strike again with different scheduling intervals.

Would this be accurate?

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Filipe Manana April 23, 2015, 12:34 p.m. UTC | #2
On 04/23/2015 01:16 PM, Holger Hoffstätte wrote:
> On Thu, 23 Apr 2015 11:28:48 +0100, Filipe Manana wrote:
> 
>> There's a race between releasing extent buffers that are flagged as stale
>> and recycling them that makes us it the following BUG_ON at
>> btrfs_release_extent_buffer_page:
>>
>>     BUG_ON(extent_buffer_under_io(eb))
>>
>> The BUG_ON is triggered because the extent buffer has the flag
>> EXTENT_BUFFER_DIRTY set as a consequence of having been reused and made
>> dirty by another concurrent task.
> 
> Awesome analysis!
> 
>> @@ -4768,6 +4768,25 @@ struct extent_buffer *find_extent_buffer(struct btrfs_fs_info *fs_info,
>>  			       start >> PAGE_CACHE_SHIFT);
>>  	if (eb && atomic_inc_not_zero(&eb->refs)) {
>>  		rcu_read_unlock();
>> +		/*
>> +		 * Lock our eb's refs_lock to avoid races with
>> +		 * free_extent_buffer. When we get our eb it might be flagged
>> +		 * with EXTENT_BUFFER_STALE and another task running
>> +		 * free_extent_buffer might have seen that flag set,
>> +		 * eb->refs == 2, that the buffer isn't under IO (dirty and
>> +		 * writeback flags not set) and it's still in the tree (flag
>> +		 * EXTENT_BUFFER_TREE_REF set), therefore being in the process
>> +		 * of decrementing the extent buffer's reference count twice.
>> +		 * So here we could race and increment the eb's reference count,
>> +		 * clear its stale flag, mark it as dirty and drop our reference
>> +		 * before the other task finishes executing free_extent_buffer,
>> +		 * which would later result in an attempt to free an extent
>> +		 * buffer that is dirty.
>> +		 */
>> +		if (test_bit(EXTENT_BUFFER_STALE, &eb->bflags)) {
>> +			spin_lock(&eb->refs_lock);
>> +			spin_unlock(&eb->refs_lock);
>> +		}
>>  		mark_extent_buffer_accessed(eb, NULL);
>>  		return eb;
>>  	}
> 
> After staring at this (and the Lovecraftian horrors of free_extent_buffer())
> for over an hour and trying to understand how and why this could even remotely
> work, I cannot help but think that this fix would shift the race to the much
> smaller window between the test_bit and the first spin_lock.
> Essentially you subtly phase-shifted all participants and make them avoid the
> race most of the time, yet I cannot help but think it's still there (just much
> smaller), and could strike again with different scheduling intervals.
> 
> Would this be accurate?

Hi Holger,

Can you explain how the race can still happen?

The goal here is just to make sure a reader does not advance too fast if
the eb is stale and there's a concurrent call to free_extent_buffer() in
progress.

thanks

> 
> -h
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Holger Hoffstätte April 23, 2015, 1:22 p.m. UTC | #3
On Thu, 23 Apr 2015 13:34:15 +0100, Filipe Manana wrote:

> On 04/23/2015 01:16 PM, Holger Hoffstätte wrote:
>> On Thu, 23 Apr 2015 11:28:48 +0100, Filipe Manana wrote:
>> 
>>> There's a race between releasing extent buffers that are flagged as stale
>>> and recycling them that makes us it the following BUG_ON at
>>> btrfs_release_extent_buffer_page:
>>>
>>>     BUG_ON(extent_buffer_under_io(eb))
>>>
>>> The BUG_ON is triggered because the extent buffer has the flag
>>> EXTENT_BUFFER_DIRTY set as a consequence of having been reused and made
>>> dirty by another concurrent task.
>> 
>> Awesome analysis!
>> 
>>> @@ -4768,6 +4768,25 @@ struct extent_buffer *find_extent_buffer(struct btrfs_fs_info *fs_info,
>>>  			       start >> PAGE_CACHE_SHIFT);
>>>  	if (eb && atomic_inc_not_zero(&eb->refs)) {
>>>  		rcu_read_unlock();
>>> +		/*
>>> +		 * Lock our eb's refs_lock to avoid races with
>>> +		 * free_extent_buffer. When we get our eb it might be flagged
>>> +		 * with EXTENT_BUFFER_STALE and another task running
>>> +		 * free_extent_buffer might have seen that flag set,
>>> +		 * eb->refs == 2, that the buffer isn't under IO (dirty and
>>> +		 * writeback flags not set) and it's still in the tree (flag
>>> +		 * EXTENT_BUFFER_TREE_REF set), therefore being in the process
>>> +		 * of decrementing the extent buffer's reference count twice.
>>> +		 * So here we could race and increment the eb's reference count,
>>> +		 * clear its stale flag, mark it as dirty and drop our reference
>>> +		 * before the other task finishes executing free_extent_buffer,
>>> +		 * which would later result in an attempt to free an extent
>>> +		 * buffer that is dirty.
>>> +		 */
>>> +		if (test_bit(EXTENT_BUFFER_STALE, &eb->bflags)) {
>>> +			spin_lock(&eb->refs_lock);
>>> +			spin_unlock(&eb->refs_lock);
>>> +		}
>>>  		mark_extent_buffer_accessed(eb, NULL);
>>>  		return eb;
>>>  	}
>> 
>> After staring at this (and the Lovecraftian horrors of free_extent_buffer())
>> for over an hour and trying to understand how and why this could even remotely
>> work, I cannot help but think that this fix would shift the race to the much
>> smaller window between the test_bit and the first spin_lock.
>> Essentially you subtly phase-shifted all participants and make them avoid the
>> race most of the time, yet I cannot help but think it's still there (just much
>> smaller), and could strike again with different scheduling intervals.
>> 
>> Would this be accurate?
> 
> Hi Holger,
> 
> Can you explain how the race can still happen?
> 
> The goal here is just to make sure a reader does not advance too fast if
> the eb is stale and there's a concurrent call to free_extent_buffer() in
> progress.

Yes, that much I understood. I look at this change from the perspective of
an optimizing compiler:

- without the new if-block we would fall through and mark_extent_buffer
while it's being deleted, which complains with a BUG.

- the new block therefore checks for the problematic state, but then does
something that - from a work perspective - could be eliminated (lock+unlock),
since nothing seemimgly useful happens inbetween.

-that leaves the if without observable side effect and so could be
eliminated as well.

So theoretically we have not really "coordinated" anything. That made me
suspicious: at the very least I would have expected some kind of loop
or something that protects/reliably delays mark_extent_buffer so that it
really has a completion/atomicity guarantee, not just a small "bump" in
its timeline. You said that a "real fix" would be proper refcounting -
that's sort of what I was expecting, at least in terms of side effects.

Now, I understand that the block is not eliminated, but the if followed
by the lock acquiry is not atomic. That's where - very theoretically -
the same race as before could happen e.g. on a single core and the thread
running free_extent_buffer is scheduled after the if but before the lock.
We'd then get the lock, immediately unlock it and proceed to mark.

I don't know if any of this can actually happen! It's just that I've never
seen a construct like this (and I like lock-free/wait-free coordination),
so this got me curious.

Thanks!
Holger

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Filipe Manana April 23, 2015, 1:43 p.m. UTC | #4
On 04/23/2015 02:22 PM, Holger Hoffstätte wrote:
> On Thu, 23 Apr 2015 13:34:15 +0100, Filipe Manana wrote:
> 
>> On 04/23/2015 01:16 PM, Holger Hoffstätte wrote:
>>> On Thu, 23 Apr 2015 11:28:48 +0100, Filipe Manana wrote:
>>>
>>>> There's a race between releasing extent buffers that are flagged as stale
>>>> and recycling them that makes us it the following BUG_ON at
>>>> btrfs_release_extent_buffer_page:
>>>>
>>>>     BUG_ON(extent_buffer_under_io(eb))
>>>>
>>>> The BUG_ON is triggered because the extent buffer has the flag
>>>> EXTENT_BUFFER_DIRTY set as a consequence of having been reused and made
>>>> dirty by another concurrent task.
>>>
>>> Awesome analysis!
>>>
>>>> @@ -4768,6 +4768,25 @@ struct extent_buffer *find_extent_buffer(struct btrfs_fs_info *fs_info,
>>>>  			       start >> PAGE_CACHE_SHIFT);
>>>>  	if (eb && atomic_inc_not_zero(&eb->refs)) {
>>>>  		rcu_read_unlock();
>>>> +		/*
>>>> +		 * Lock our eb's refs_lock to avoid races with
>>>> +		 * free_extent_buffer. When we get our eb it might be flagged
>>>> +		 * with EXTENT_BUFFER_STALE and another task running
>>>> +		 * free_extent_buffer might have seen that flag set,
>>>> +		 * eb->refs == 2, that the buffer isn't under IO (dirty and
>>>> +		 * writeback flags not set) and it's still in the tree (flag
>>>> +		 * EXTENT_BUFFER_TREE_REF set), therefore being in the process
>>>> +		 * of decrementing the extent buffer's reference count twice.
>>>> +		 * So here we could race and increment the eb's reference count,
>>>> +		 * clear its stale flag, mark it as dirty and drop our reference
>>>> +		 * before the other task finishes executing free_extent_buffer,
>>>> +		 * which would later result in an attempt to free an extent
>>>> +		 * buffer that is dirty.
>>>> +		 */
>>>> +		if (test_bit(EXTENT_BUFFER_STALE, &eb->bflags)) {
>>>> +			spin_lock(&eb->refs_lock);
>>>> +			spin_unlock(&eb->refs_lock);
>>>> +		}
>>>>  		mark_extent_buffer_accessed(eb, NULL);
>>>>  		return eb;
>>>>  	}
>>>
>>> After staring at this (and the Lovecraftian horrors of free_extent_buffer())
>>> for over an hour and trying to understand how and why this could even remotely
>>> work, I cannot help but think that this fix would shift the race to the much
>>> smaller window between the test_bit and the first spin_lock.
>>> Essentially you subtly phase-shifted all participants and make them avoid the
>>> race most of the time, yet I cannot help but think it's still there (just much
>>> smaller), and could strike again with different scheduling intervals.
>>>
>>> Would this be accurate?
>>
>> Hi Holger,
>>
>> Can you explain how the race can still happen?
>>
>> The goal here is just to make sure a reader does not advance too fast if
>> the eb is stale and there's a concurrent call to free_extent_buffer() in
>> progress.
> 
> Yes, that much I understood. I look at this change from the perspective of
> an optimizing compiler:
> 
> - without the new if-block we would fall through and mark_extent_buffer
> while it's being deleted, which complains with a BUG.
> 
> - the new block therefore checks for the problematic state, but then does
> something that - from a work perspective - could be eliminated (lock+unlock),
> since nothing seemimgly useful happens inbetween.
> 
> -that leaves the if without observable side effect and so could be
> eliminated as well.

I don't think a lock followed by unlock without nothing in between (be
it a spinlock, mutex, or any other kind of lock) will be seen by the
compiler as a nop. Pretty sure I've seen this pattern being done in the
kernel and in many other places as mechanism to wait for something.

> 
> So theoretically we have not really "coordinated" anything. That made me
> suspicious: at the very least I would have expected some kind of loop
> or something that protects/reliably delays mark_extent_buffer so that it
> really has a completion/atomicity guarantee, not just a small "bump" in
> its timeline. You said that a "real fix" would be proper refcounting -
> that's sort of what I was expecting, at least in terms of side effects.

I didn't say a "real fix" but instead a more "clean alternative" fix.

> 
> Now, I understand that the block is not eliminated, but the if followed
> by the lock acquiry is not atomic. That's where - very theoretically -
> the same race as before could happen e.g. on a single core and the thread
> running free_extent_buffer is scheduled after the if but before the lock.
> We'd then get the lock, immediately unlock it and proceed to mark.

That can't happen. If that thread running free_extent_buffer() is
scheduled after the "if" and but before "lock", it will see refs == 3
and therefore decrement the ref count by 1 only and leaving the tree_ref
flag set:

void free_extent_buffer(struct extent_buffer *eb)
{

(...)

	spin_lock(&eb->refs_lock);
	if (atomic_read(&eb->refs) == 2 &&
	    test_bit(EXTENT_BUFFER_DUMMY, &eb->bflags))
		atomic_dec(&eb->refs);

	if (atomic_read(&eb->refs) == 2 &&
	    test_bit(EXTENT_BUFFER_STALE, &eb->bflags) &&
	    !extent_buffer_under_io(eb) &&
	    test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
		atomic_dec(&eb->refs);

	release_extent_buffer(eb);
}


> 
> I don't know if any of this can actually happen! It's just that I've never
> seen a construct like this (and I like lock-free/wait-free coordination),
> so this got me curious.
> 
> Thanks!
> Holger
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Holger Hoffstätte April 23, 2015, 1:53 p.m. UTC | #5
On Thu, 23 Apr 2015 14:43:40 +0100, Filipe Manana wrote:

> I don't think a lock followed by unlock without nothing in between (be
> it a spinlock, mutex, or any other kind of lock) will be seen by the
> compiler as a nop. Pretty sure I've seen this pattern being done in the

No, I didn't say they would - that would be wrong. I just found it odd,
that's all.

> kernel and in many other places as mechanism to wait for something.

I also completely forgot that spinlocks disable preemption, since
otherwise nothing would really work. That's the real reason why
any of this works. Well, that and the refcount==2 thing.

Cool! Thanks!
Holger

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Filipe Manana April 24, 2015, 4:08 p.m. UTC | #6
On 04/23/2015 02:53 PM, Holger Hoffstätte wrote:
> On Thu, 23 Apr 2015 14:43:40 +0100, Filipe Manana wrote:
> 
>> I don't think a lock followed by unlock without nothing in between (be
>> it a spinlock, mutex, or any other kind of lock) will be seen by the
>> compiler as a nop. Pretty sure I've seen this pattern being done in the
> 
> No, I didn't say they would - that would be wrong. I just found it odd,
> that's all.

For reference, you have plenty of example in the kernel source tree:

$ find . -name '*.c' | xargs pcregrep --line-offsets -nrM '^\s*spin_lock.*\n^\s*spin_unlock'
(...)
./fs/ext4/balloc.c:644:0,68
(...)
./mm/truncate.c:458:0,51
(...)

$ cat ./fs/ext4/balloc.c | head -646 | tail -4
	if (!(*errp) && (flags & EXT4_MB_DELALLOC_RESERVED)) {
		spin_lock(&EXT4_I(inode)->i_block_reservation_lock);
		spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
		dquot_alloc_block_nofail(inode,

$ cat ./mm/truncate.c | head -460 | tail -9
		/*
		 * As truncation uses a lockless tree lookup, cycle
		 * the tree lock to make sure any ongoing tree
		 * modification that does not see AS_EXITING is
		 * completed before starting the final truncate.
		 */
		spin_lock_irq(&mapping->tree_lock);
		spin_unlock_irq(&mapping->tree_lock);


> 
>> kernel and in many other places as mechanism to wait for something.
> 
> I also completely forgot that spinlocks disable preemption, since
> otherwise nothing would really work. That's the real reason why
> any of this works. Well, that and the refcount==2 thing.
> 
> Cool! Thanks!
> Holger
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba May 7, 2015, 4:19 p.m. UTC | #7
On Thu, Apr 23, 2015 at 11:28:48AM +0100, Filipe Manana wrote:
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Phew, great catch and fix.

Reviewed-by: David Sterba <dsterba@suse.cz>

Chis, please try to squeeze it into 4.1-rc.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Mason May 7, 2015, 7:30 p.m. UTC | #8
On 05/07/2015 12:19 PM, David Sterba wrote:
> On Thu, Apr 23, 2015 at 11:28:48AM +0100, Filipe Manana wrote:
>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> 
> Phew, great catch and fix.
> 
> Reviewed-by: David Sterba <dsterba@suse.cz>
> 
> Chis, please try to squeeze it into 4.1-rc.
> 

Yes, I've been giving this one extra soak time, but I'll get it into 4.1

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba May 12, 2015, 1:04 p.m. UTC | #9
On Thu, Apr 23, 2015 at 12:16:21PM +0000, Holger Hoffstätte wrote:
> After staring at this (and the Lovecraftian horrors of free_extent_buffer())
> for over an hour and trying to understand how and why this could even remotely
> work, I cannot help but think that this fix would shift the race to the much
> smaller window between the test_bit and the first spin_lock.

Btw, I had exactly the same concerns, as this is close to the classical
example of a shared resource check outside of the critical section. But
the extent buffers are protected by more than just the spinlock and if
the race mentioned above happens, the lifted reference counts will
prevent to drop the last one.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 62e8706..46d6e68 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4768,6 +4768,25 @@  struct extent_buffer *find_extent_buffer(struct btrfs_fs_info *fs_info,
 			       start >> PAGE_CACHE_SHIFT);
 	if (eb && atomic_inc_not_zero(&eb->refs)) {
 		rcu_read_unlock();
+		/*
+		 * Lock our eb's refs_lock to avoid races with
+		 * free_extent_buffer. When we get our eb it might be flagged
+		 * with EXTENT_BUFFER_STALE and another task running
+		 * free_extent_buffer might have seen that flag set,
+		 * eb->refs == 2, that the buffer isn't under IO (dirty and
+		 * writeback flags not set) and it's still in the tree (flag
+		 * EXTENT_BUFFER_TREE_REF set), therefore being in the process
+		 * of decrementing the extent buffer's reference count twice.
+		 * So here we could race and increment the eb's reference count,
+		 * clear its stale flag, mark it as dirty and drop our reference
+		 * before the other task finishes executing free_extent_buffer,
+		 * which would later result in an attempt to free an extent
+		 * buffer that is dirty.
+		 */
+		if (test_bit(EXTENT_BUFFER_STALE, &eb->bflags)) {
+			spin_lock(&eb->refs_lock);
+			spin_unlock(&eb->refs_lock);
+		}
 		mark_extent_buffer_accessed(eb, NULL);
 		return eb;
 	}