diff mbox series

btrfs: fix rw_devices count in __btrfs_free_extra_devids

Message ID b3a0a629df98bd044a1fd5c4964f381ff6e7aa05.1600777827.git.anand.jain@oracle.com (mailing list archive)
State New, archived
Headers show
Series btrfs: fix rw_devices count in __btrfs_free_extra_devids | expand

Commit Message

Anand Jain Sept. 22, 2020, 12:30 p.m. UTC
syzbot reported a warning [1] in close_fs_devcies() which it reproduces
using a crafted image.

        WARN_ON(fs_devices->rw_devices);

The crafted image successfully creates a replace-device with the devid 0.
But as there isn't any replace-item. We clean the extra the devid 0, at
__btrfs_free_extra_devids().

rw_devices is incremented in btrfs_open_one_device() for all write-able
devices except for devid == BTRFS_DEV_REPLACE_DEVID.
But while we clean up the extra devices in __btrfs_free_extra_devids()
we used the BTRFS_DEV_STATE_REPLACE_TGT flag which isn't set because
there isn't the replace-item. So rw_devices went below zero.

So let __btrfs_free_extra_devids() also depend on the
devid != BTRFS_DEV_REPLACE_DEVID to manage the rw_devices.

[1]
WARNING: CPU: 1 PID: 3612 at fs/btrfs/volumes.c:1166 close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166
Kernel panic - not syncing: panic_on_warn set ...
CPU: 1 PID: 3612 Comm: syz-executor.2 Not tainted 5.9.0-rc4-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x198/0x1fd lib/dump_stack.c:118
 panic+0x347/0x7c0 kernel/panic.c:231
 __warn.cold+0x20/0x46 kernel/panic.c:600
 report_bug+0x1bd/0x210 lib/bug.c:198
 handle_bug+0x38/0x90 arch/x86/kernel/traps.c:234
 exc_invalid_op+0x14/0x40 arch/x86/kernel/traps.c:254
 asm_exc_invalid_op+0x12/0x20 arch/x86/include/asm/idtentry.h:536
RIP: 0010:close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166
Code: 0f b6 04 02 84 c0 74 02 7e 33 48 8b 44 24 18 c6 80 30 01 00 00 00 48 83 c4 30 5b 5d 41 5c 41 5d 41 5e 41 5f c3 e8 99 ce 6a fe <0f> 0b e9 71 ff ff ff e8 8d ce 6a fe 0f 0b e9 20 ff ff ff e8 d1 d5
RSP: 0018:ffffc900091777e0 EFLAGS: 00010246
RAX: 0000000000040000 RBX: ffffffffffffffff RCX: ffffc9000c8b7000
RDX: 0000000000040000 RSI: ffffffff83097f47 RDI: 0000000000000007
RBP: dffffc0000000000 R08: 0000000000000001 R09: ffff8880988a187f
R10: 0000000000000000 R11: 0000000000000001 R12: ffff88809593a130
R13: ffff88809593a1ec R14: ffff8880988a1908 R15: ffff88809593a050
 close_fs_devices fs/btrfs/volumes.c:1193 [inline]
 btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179
 open_ctree+0x4984/0x4a2d fs/btrfs/disk-io.c:3434
 btrfs_fill_super fs/btrfs/super.c:1316 [inline]
 btrfs_mount_root.cold+0x14/0x165 fs/btrfs/super.c:1672
 legacy_get_tree+0x105/0x220 fs/fs_context.c:592
 vfs_get_tree+0x89/0x2f0 fs/super.c:1547
 fc_mount fs/namespace.c:978 [inline]
 vfs_kern_mount.part.0+0xd3/0x170 fs/namespace.c:1008
 vfs_kern_mount+0x3c/0x60 fs/namespace.c:995
 btrfs_mount+0x234/0xaa0 fs/btrfs/super.c:1732
 legacy_get_tree+0x105/0x220 fs/fs_context.c:592
 vfs_get_tree+0x89/0x2f0 fs/super.c:1547
 do_new_mount fs/namespace.c:2875 [inline]
 path_mount+0x1387/0x2070 fs/namespace.c:3192
 do_mount fs/namespace.c:3205 [inline]
 __do_sys_mount fs/namespace.c:3413 [inline]
 __se_sys_mount fs/namespace.c:3390 [inline]
 __x64_sys_mount+0x27f/0x300 fs/namespace.c:3390
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x46004a
Code: b8 a6 00 00 00 0f 05 48 3d 01 f0 ff ff 0f 83 fd 89 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 0f 83 da 89 fb ff c3 66 0f 1f 84 00 00 00 00 00
RSP: 002b:00007f414d78da88 EFLAGS: 00000202 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 00007f414d78db20 RCX: 000000000046004a
RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007f414d78dae0
RBP: 00007f414d78dae0 R08: 00007f414d78db20 R09: 0000000020000000
R10: 0000000000000000 R11: 0000000000000202 R12: 0000000020000000
R13: 0000000020000100 R14: 0000000020000200 R15: 000000002001a800

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 fs/btrfs/volumes.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

Comments

Josef Bacik Sept. 22, 2020, 1:08 p.m. UTC | #1
On 9/22/20 8:33 AM, Anand Jain wrote:
> syzbot reported a warning [1] in close_fs_devcies() which it reproduces
> using a crafted image.
> 
>          WARN_ON(fs_devices->rw_devices);
> 
> The crafted image successfully creates a replace-device with the devid 0.
> But as there isn't any replace-item. We clean the extra the devid 0, at
> __btrfs_free_extra_devids().
> 
> rw_devices is incremented in btrfs_open_one_device() for all write-able
> devices except for devid == BTRFS_DEV_REPLACE_DEVID.
> But while we clean up the extra devices in __btrfs_free_extra_devids()
> we used the BTRFS_DEV_STATE_REPLACE_TGT flag which isn't set because
> there isn't the replace-item. So rw_devices went below zero.
> 
> So let __btrfs_free_extra_devids() also depend on the
> devid != BTRFS_DEV_REPLACE_DEVID to manage the rw_devices.
> 

This is an invalid state for the fs to be in, I'd rather fix it by detecting we 
have a devid == BTRFS_DEV_REPLACE_DEVID with no corresponding dev_replace item 
and fail out before we get to this point.  Thanks,

Josef
Anand Jain Sept. 23, 2020, 4:42 a.m. UTC | #2
On 22/9/20 9:08 pm, Josef Bacik wrote:
> On 9/22/20 8:33 AM, Anand Jain wrote:
>> syzbot reported a warning [1] in close_fs_devcies() which it reproduces
>> using a crafted image.
>>
>>          WARN_ON(fs_devices->rw_devices);
>>
>> The crafted image successfully creates a replace-device with the devid 0.
>> But as there isn't any replace-item. We clean the extra the devid 0, at
>> __btrfs_free_extra_devids().
>>
>> rw_devices is incremented in btrfs_open_one_device() for all write-able
>> devices except for devid == BTRFS_DEV_REPLACE_DEVID.
>> But while we clean up the extra devices in __btrfs_free_extra_devids()
>> we used the BTRFS_DEV_STATE_REPLACE_TGT flag which isn't set because
>> there isn't the replace-item. So rw_devices went below zero.
>>
>> So let __btrfs_free_extra_devids() also depend on the
>> devid != BTRFS_DEV_REPLACE_DEVID to manage the rw_devices.
>>
> 
> This is an invalid state for the fs to be in,

OK, to be more specific. There is an alien device that is pretending to 
be the replace-target (devid = 0).

 > I'd rather fix it by
 > detecting we have a devid == BTRFS_DEV_REPLACE_DEVID with no
 > corresponding dev_replace item and fail out before we get to this
 > point.  Thanks,

Yes. __btrfs_free_extra_devids() is already doing in a way the same.

------------------------------------
1040 static void __btrfs_free_extra_devids(struct btrfs_fs_devices 
*fs_devices,
::
1059 if (device->devid == BTRFS_DEV_REPLACE_DEVID) {
::
1070 if (step == 0 || test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
1071 &device->dev_state)) {
1072 continue;
1073 }
------------------------------------

OR I did not understand what do you mean.

Thanks, Anand

> 
> Josef
Josef Bacik Sept. 23, 2020, 1:42 p.m. UTC | #3
On 9/23/20 12:42 AM, Anand Jain wrote:
> 
> 
> On 22/9/20 9:08 pm, Josef Bacik wrote:
>> On 9/22/20 8:33 AM, Anand Jain wrote:
>>> syzbot reported a warning [1] in close_fs_devcies() which it reproduces
>>> using a crafted image.
>>>
>>>          WARN_ON(fs_devices->rw_devices);
>>>
>>> The crafted image successfully creates a replace-device with the devid 0.
>>> But as there isn't any replace-item. We clean the extra the devid 0, at
>>> __btrfs_free_extra_devids().
>>>
>>> rw_devices is incremented in btrfs_open_one_device() for all write-able
>>> devices except for devid == BTRFS_DEV_REPLACE_DEVID.
>>> But while we clean up the extra devices in __btrfs_free_extra_devids()
>>> we used the BTRFS_DEV_STATE_REPLACE_TGT flag which isn't set because
>>> there isn't the replace-item. So rw_devices went below zero.
>>>
>>> So let __btrfs_free_extra_devids() also depend on the
>>> devid != BTRFS_DEV_REPLACE_DEVID to manage the rw_devices.
>>>
>>
>> This is an invalid state for the fs to be in,
> 
> OK, to be more specific. There is an alien device that is pretending to be the 
> replace-target (devid = 0).
> 
>  > I'd rather fix it by
>  > detecting we have a devid == BTRFS_DEV_REPLACE_DEVID with no
>  > corresponding dev_replace item and fail out before we get to this
>  > point.  Thanks,
> 
> Yes. __btrfs_free_extra_devids() is already doing in a way the same.
> 
> ------------------------------------
> 1040 static void __btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices,
> ::
> 1059 if (device->devid == BTRFS_DEV_REPLACE_DEVID) {
> ::
> 1070 if (step == 0 || test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
> 1071 &device->dev_state)) {
> 1072 continue;
> 1073 }
> ------------------------------------
> 
> OR I did not understand what do you mean.
> 

Yeah I mean we do something in btrfs_init_dev_replace(), like when we search for 
the key, we double check to make sure we don't have a devid == 
BTRFS_DEV_REPLACE_DEVID in our devices if we don't find a key.  If we do we 
return -EIO and bail out of the mount.  Thanks,

Josef
Anand Jain Sept. 24, 2020, 5:19 a.m. UTC | #4
On 23/9/20 9:42 pm, Josef Bacik wrote:
> On 9/23/20 12:42 AM, Anand Jain wrote:
>>
>>
>> On 22/9/20 9:08 pm, Josef Bacik wrote:
>>> On 9/22/20 8:33 AM, Anand Jain wrote:
>>>> syzbot reported a warning [1] in close_fs_devcies() which it reproduces
>>>> using a crafted image.
>>>>
>>>>          WARN_ON(fs_devices->rw_devices);
>>>>
>>>> The crafted image successfully creates a replace-device with the 
>>>> devid 0.
>>>> But as there isn't any replace-item. We clean the extra the devid 0, at
>>>> __btrfs_free_extra_devids().
>>>>
>>>> rw_devices is incremented in btrfs_open_one_device() for all write-able
>>>> devices except for devid == BTRFS_DEV_REPLACE_DEVID.
>>>> But while we clean up the extra devices in __btrfs_free_extra_devids()
>>>> we used the BTRFS_DEV_STATE_REPLACE_TGT flag which isn't set because
>>>> there isn't the replace-item. So rw_devices went below zero.
>>>>
>>>> So let __btrfs_free_extra_devids() also depend on the
>>>> devid != BTRFS_DEV_REPLACE_DEVID to manage the rw_devices.
>>>>
>>>
>>> This is an invalid state for the fs to be in,
>>
>> OK, to be more specific. There is an alien device that is pretending 
>> to be the replace-target (devid = 0).
>>
>>  > I'd rather fix it by
>>  > detecting we have a devid == BTRFS_DEV_REPLACE_DEVID with no
>>  > corresponding dev_replace item and fail out before we get to this
>>  > point.  Thanks,
>>
>> Yes. __btrfs_free_extra_devids() is already doing in a way the same.
>>
>> ------------------------------------
>> 1040 static void __btrfs_free_extra_devids(struct btrfs_fs_devices 
>> *fs_devices,
>> ::
>> 1059 if (device->devid == BTRFS_DEV_REPLACE_DEVID) {
>> ::
>> 1070 if (step == 0 || test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
>> 1071 &device->dev_state)) {
>> 1072 continue;
>> 1073 }
>> ------------------------------------
>>
>> OR I did not understand what do you mean.
>>
> 
> Yeah I mean we do something in btrfs_init_dev_replace(), like when we 
> search for the key, we double check to make sure we don't have a devid 
> == BTRFS_DEV_REPLACE_DEVID in our devices if we don't find a key.  If we 
> do we return -EIO and bail out of the mount.  Thanks,

  That's brilliant approach. Let me try.

Thanks, Anand

> 
> Josef
David Sterba Sept. 24, 2020, 11:25 a.m. UTC | #5
On Wed, Sep 23, 2020 at 09:42:17AM -0400, Josef Bacik wrote:
> On 9/23/20 12:42 AM, Anand Jain wrote:
> > On 22/9/20 9:08 pm, Josef Bacik wrote:
> >> On 9/22/20 8:33 AM, Anand Jain wrote:

> Yeah I mean we do something in btrfs_init_dev_replace(), like when we search for 
> the key, we double check to make sure we don't have a devid == 
> BTRFS_DEV_REPLACE_DEVID in our devices if we don't find a key.  If we do we 
> return -EIO and bail out of the mount.  Thanks,

From user perspective, then do what? Or do we treat this with minimal
efforts to provide a sane fallback and error handling just to pass
fuzzers (like in many other cases)?
Josef Bacik Sept. 24, 2020, 2:02 p.m. UTC | #6
On 9/24/20 7:25 AM, David Sterba wrote:
> On Wed, Sep 23, 2020 at 09:42:17AM -0400, Josef Bacik wrote:
>> On 9/23/20 12:42 AM, Anand Jain wrote:
>>> On 22/9/20 9:08 pm, Josef Bacik wrote:
>>>> On 9/22/20 8:33 AM, Anand Jain wrote:
> 
>> Yeah I mean we do something in btrfs_init_dev_replace(), like when we search for
>> the key, we double check to make sure we don't have a devid ==
>> BTRFS_DEV_REPLACE_DEVID in our devices if we don't find a key.  If we do we
>> return -EIO and bail out of the mount.  Thanks,
> 
>  From user perspective, then do what? Or do we treat this with minimal
> efforts to provide a sane fallback and error handling just to pass
> fuzzers (like in many other cases)?
> 

That's a question for fsck.  I don't want to spend a lot of time chasing 
imaginary cases that fuzzers come up with, I just want them to fail as quickly 
as possible so we can move on with our lives.

If this happened in the real world then it would be because we either

1) Lost the replace item somehow?
2) Got a random corruption that changed the devid to 0

I think for #1 it's impossible to detect really, unless you can tell which 
device was being replaced somehow?  I'm not sure  how you would do that, I'm not 
familiar enough with the replace code to see if we could figure that out.

For #2 it should be straightforward, as long as we can determine that we really 
weren't doing a device replace, then we just change the devid to 1 or something 
and carry on with life?  Thanks,

Josef
Anand Jain Sept. 25, 2020, 10:11 a.m. UTC | #7
On 24/9/20 10:02 pm, Josef Bacik wrote:
> On 9/24/20 7:25 AM, David Sterba wrote:
>> On Wed, Sep 23, 2020 at 09:42:17AM -0400, Josef Bacik wrote:
>>> On 9/23/20 12:42 AM, Anand Jain wrote:
>>>> On 22/9/20 9:08 pm, Josef Bacik wrote:
>>>>> On 9/22/20 8:33 AM, Anand Jain wrote:
>>
>>> Yeah I mean we do something in btrfs_init_dev_replace(), like when we 
>>> search for
>>> the key, we double check to make sure we don't have a devid ==
>>> BTRFS_DEV_REPLACE_DEVID in our devices if we don't find a key. 


>>> If we 
>>> do we
>>> return -EIO and bail out of the mount.  Thanks,


I read fast and missed the bailout part before.

If we bailout the mount, it means a btrfs rootfs can fail to boot up.

To recover from it, the user has to remove the trespassing/extra device
manually and reboot.
For a non-rootfs, the user would have to remove the device manually and run
'btrfs dev scan --forget' to free up the extra devices.
What we are doing now is removing the extra/trespassing device
internally.

IMO. The case of trespassing/extra device trying to sabotage the setup
is a bit different from a corrupted device, in the former case
resilience is preferred?

Thanks, Anand


>>
>>  From user perspective, then do what? Or do we treat this with minimal
>> efforts to provide a sane fallback and error handling just to pass
>> fuzzers (like in many other cases)?
>>
> 
> That's a question for fsck.  I don't want to spend a lot of time chasing 
> imaginary cases that fuzzers come up with, I just want them to fail as 
> quickly as possible so we can move on with our lives.
> 
> If this happened in the real world then it would be because we either
> 
> 1) Lost the replace item somehow?
> 2) Got a random corruption that changed the devid to 0
> 
> I think for #1 it's impossible to detect really, unless you can tell 
> which device was being replaced somehow?  I'm not sure  how you would do 
> that, I'm not familiar enough with the replace code to see if we could 
> figure that out.
> 
> For #2 it should be straightforward, as long as we can determine that we 
> really weren't doing a device replace, then we just change the devid to 
> 1 or something and carry on with life?  Thanks,
> 




> Josef
Josef Bacik Sept. 25, 2020, 2:28 p.m. UTC | #8
On 9/25/20 6:11 AM, Anand Jain wrote:
> On 24/9/20 10:02 pm, Josef Bacik wrote:
>> On 9/24/20 7:25 AM, David Sterba wrote:
>>> On Wed, Sep 23, 2020 at 09:42:17AM -0400, Josef Bacik wrote:
>>>> On 9/23/20 12:42 AM, Anand Jain wrote:
>>>>> On 22/9/20 9:08 pm, Josef Bacik wrote:
>>>>>> On 9/22/20 8:33 AM, Anand Jain wrote:
>>>
>>>> Yeah I mean we do something in btrfs_init_dev_replace(), like when we search 
>>>> for
>>>> the key, we double check to make sure we don't have a devid ==
>>>> BTRFS_DEV_REPLACE_DEVID in our devices if we don't find a key. 
> 
> 
>>>> If we do we
>>>> return -EIO and bail out of the mount.  Thanks,
> 
> 
> I read fast and missed the bailout part before.
> 
> If we bailout the mount, it means a btrfs rootfs can fail to boot up.
> 
> To recover from it, the user has to remove the trespassing/extra device
> manually and reboot.
> For a non-rootfs, the user would have to remove the device manually and run
> 'btrfs dev scan --forget' to free up the extra devices.
> What we are doing now is removing the extra/trespassing device
> internally.
> 
> IMO. The case of trespassing/extra device trying to sabotage the setup
> is a bit different from a corrupted device, in the former case
> resilience is preferred?
> 

Well this doesn't happen in real life right?  This is purely from a fuzzing 
standpoint, so while resilience should be the first thing we shoot for, I'd 
rather not spend a long time trying to make it work.

In the case of just randomly deleting a device, I don't think that's a decision 
that the kernel can/should make, we should require a user to intervene at that 
point.  That makes failure the best option here, thanks,

Josef
Anand Jain Oct. 6, 2020, 1:12 p.m. UTC | #9
On 25/9/20 10:28 pm, Josef Bacik wrote:
> On 9/25/20 6:11 AM, Anand Jain wrote:
>> On 24/9/20 10:02 pm, Josef Bacik wrote:
>>> On 9/24/20 7:25 AM, David Sterba wrote:
>>>> On Wed, Sep 23, 2020 at 09:42:17AM -0400, Josef Bacik wrote:
>>>>> On 9/23/20 12:42 AM, Anand Jain wrote:
>>>>>> On 22/9/20 9:08 pm, Josef Bacik wrote:
>>>>>>> On 9/22/20 8:33 AM, Anand Jain wrote:
>>>>
>>>>> Yeah I mean we do something in btrfs_init_dev_replace(), like when 
>>>>> we search for
>>>>> the key, we double check to make sure we don't have a devid ==
>>>>> BTRFS_DEV_REPLACE_DEVID in our devices if we don't find a key. 
>>
>>
>>>>> If we do we
>>>>> return -EIO and bail out of the mount.  Thanks,
>>
>>
>> I read fast and missed the bailout part before.
>>
>> If we bailout the mount, it means a btrfs rootfs can fail to boot up.
>>
>> To recover from it, the user has to remove the trespassing/extra device
>> manually and reboot.
>> For a non-rootfs, the user would have to remove the device manually 
>> and run
>> 'btrfs dev scan --forget' to free up the extra devices.
>> What we are doing now is removing the extra/trespassing device
>> internally.
>>
>> IMO. The case of trespassing/extra device trying to sabotage the setup
>> is a bit different from a corrupted device, in the former case
>> resilience is preferred?
>>
> 
> Well this doesn't happen in real life right?  This is purely from a 
> fuzzing standpoint, so while resilience should be the first thing we 
> shoot for, I'd rather not spend a long time trying to make it work.
> 
> In the case of just randomly deleting a device, I don't think that's a 
> decision that the kernel can/should make, we should require a user to 
> intervene at that point.  That makes failure the best option here, thanks,
> 

  It makes sense to me, its different from what we had before.
  Made those changes in v2.

Thanks, Anand

> Josef
diff mbox series

Patch

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index ec9dac40b4f1..2fd73eab6219 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1080,8 +1080,7 @@  static void __btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices,
 		if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
 			list_del_init(&device->dev_alloc_list);
 			clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
-			if (!test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
-				      &device->dev_state))
+			if (device->devid != BTRFS_DEV_REPLACE_DEVID)
 				fs_devices->rw_devices--;
 		}
 		list_del_init(&device->dev_list);