Message ID | b3a0a629df98bd044a1fd5c4964f381ff6e7aa05.1600777827.git.anand.jain@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: fix rw_devices count in __btrfs_free_extra_devids | expand |
On 9/22/20 8:33 AM, Anand Jain wrote: > syzbot reported a warning [1] in close_fs_devcies() which it reproduces > using a crafted image. > > WARN_ON(fs_devices->rw_devices); > > The crafted image successfully creates a replace-device with the devid 0. > But as there isn't any replace-item. We clean the extra the devid 0, at > __btrfs_free_extra_devids(). > > rw_devices is incremented in btrfs_open_one_device() for all write-able > devices except for devid == BTRFS_DEV_REPLACE_DEVID. > But while we clean up the extra devices in __btrfs_free_extra_devids() > we used the BTRFS_DEV_STATE_REPLACE_TGT flag which isn't set because > there isn't the replace-item. So rw_devices went below zero. > > So let __btrfs_free_extra_devids() also depend on the > devid != BTRFS_DEV_REPLACE_DEVID to manage the rw_devices. > This is an invalid state for the fs to be in, I'd rather fix it by detecting we have a devid == BTRFS_DEV_REPLACE_DEVID with no corresponding dev_replace item and fail out before we get to this point. Thanks, Josef
On 22/9/20 9:08 pm, Josef Bacik wrote: > On 9/22/20 8:33 AM, Anand Jain wrote: >> syzbot reported a warning [1] in close_fs_devcies() which it reproduces >> using a crafted image. >> >> WARN_ON(fs_devices->rw_devices); >> >> The crafted image successfully creates a replace-device with the devid 0. >> But as there isn't any replace-item. We clean the extra the devid 0, at >> __btrfs_free_extra_devids(). >> >> rw_devices is incremented in btrfs_open_one_device() for all write-able >> devices except for devid == BTRFS_DEV_REPLACE_DEVID. >> But while we clean up the extra devices in __btrfs_free_extra_devids() >> we used the BTRFS_DEV_STATE_REPLACE_TGT flag which isn't set because >> there isn't the replace-item. So rw_devices went below zero. >> >> So let __btrfs_free_extra_devids() also depend on the >> devid != BTRFS_DEV_REPLACE_DEVID to manage the rw_devices. >> > > This is an invalid state for the fs to be in, OK, to be more specific. There is an alien device that is pretending to be the replace-target (devid = 0). > I'd rather fix it by > detecting we have a devid == BTRFS_DEV_REPLACE_DEVID with no > corresponding dev_replace item and fail out before we get to this > point. Thanks, Yes. __btrfs_free_extra_devids() is already doing in a way the same. ------------------------------------ 1040 static void __btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices, :: 1059 if (device->devid == BTRFS_DEV_REPLACE_DEVID) { :: 1070 if (step == 0 || test_bit(BTRFS_DEV_STATE_REPLACE_TGT, 1071 &device->dev_state)) { 1072 continue; 1073 } ------------------------------------ OR I did not understand what do you mean. Thanks, Anand > > Josef
On 9/23/20 12:42 AM, Anand Jain wrote: > > > On 22/9/20 9:08 pm, Josef Bacik wrote: >> On 9/22/20 8:33 AM, Anand Jain wrote: >>> syzbot reported a warning [1] in close_fs_devcies() which it reproduces >>> using a crafted image. >>> >>> WARN_ON(fs_devices->rw_devices); >>> >>> The crafted image successfully creates a replace-device with the devid 0. >>> But as there isn't any replace-item. We clean the extra the devid 0, at >>> __btrfs_free_extra_devids(). >>> >>> rw_devices is incremented in btrfs_open_one_device() for all write-able >>> devices except for devid == BTRFS_DEV_REPLACE_DEVID. >>> But while we clean up the extra devices in __btrfs_free_extra_devids() >>> we used the BTRFS_DEV_STATE_REPLACE_TGT flag which isn't set because >>> there isn't the replace-item. So rw_devices went below zero. >>> >>> So let __btrfs_free_extra_devids() also depend on the >>> devid != BTRFS_DEV_REPLACE_DEVID to manage the rw_devices. >>> >> >> This is an invalid state for the fs to be in, > > OK, to be more specific. There is an alien device that is pretending to be the > replace-target (devid = 0). > > > I'd rather fix it by > > detecting we have a devid == BTRFS_DEV_REPLACE_DEVID with no > > corresponding dev_replace item and fail out before we get to this > > point. Thanks, > > Yes. __btrfs_free_extra_devids() is already doing in a way the same. > > ------------------------------------ > 1040 static void __btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices, > :: > 1059 if (device->devid == BTRFS_DEV_REPLACE_DEVID) { > :: > 1070 if (step == 0 || test_bit(BTRFS_DEV_STATE_REPLACE_TGT, > 1071 &device->dev_state)) { > 1072 continue; > 1073 } > ------------------------------------ > > OR I did not understand what do you mean. > Yeah I mean we do something in btrfs_init_dev_replace(), like when we search for the key, we double check to make sure we don't have a devid == BTRFS_DEV_REPLACE_DEVID in our devices if we don't find a key. If we do we return -EIO and bail out of the mount. Thanks, Josef
On 23/9/20 9:42 pm, Josef Bacik wrote: > On 9/23/20 12:42 AM, Anand Jain wrote: >> >> >> On 22/9/20 9:08 pm, Josef Bacik wrote: >>> On 9/22/20 8:33 AM, Anand Jain wrote: >>>> syzbot reported a warning [1] in close_fs_devcies() which it reproduces >>>> using a crafted image. >>>> >>>> WARN_ON(fs_devices->rw_devices); >>>> >>>> The crafted image successfully creates a replace-device with the >>>> devid 0. >>>> But as there isn't any replace-item. We clean the extra the devid 0, at >>>> __btrfs_free_extra_devids(). >>>> >>>> rw_devices is incremented in btrfs_open_one_device() for all write-able >>>> devices except for devid == BTRFS_DEV_REPLACE_DEVID. >>>> But while we clean up the extra devices in __btrfs_free_extra_devids() >>>> we used the BTRFS_DEV_STATE_REPLACE_TGT flag which isn't set because >>>> there isn't the replace-item. So rw_devices went below zero. >>>> >>>> So let __btrfs_free_extra_devids() also depend on the >>>> devid != BTRFS_DEV_REPLACE_DEVID to manage the rw_devices. >>>> >>> >>> This is an invalid state for the fs to be in, >> >> OK, to be more specific. There is an alien device that is pretending >> to be the replace-target (devid = 0). >> >> > I'd rather fix it by >> > detecting we have a devid == BTRFS_DEV_REPLACE_DEVID with no >> > corresponding dev_replace item and fail out before we get to this >> > point. Thanks, >> >> Yes. __btrfs_free_extra_devids() is already doing in a way the same. >> >> ------------------------------------ >> 1040 static void __btrfs_free_extra_devids(struct btrfs_fs_devices >> *fs_devices, >> :: >> 1059 if (device->devid == BTRFS_DEV_REPLACE_DEVID) { >> :: >> 1070 if (step == 0 || test_bit(BTRFS_DEV_STATE_REPLACE_TGT, >> 1071 &device->dev_state)) { >> 1072 continue; >> 1073 } >> ------------------------------------ >> >> OR I did not understand what do you mean. >> > > Yeah I mean we do something in btrfs_init_dev_replace(), like when we > search for the key, we double check to make sure we don't have a devid > == BTRFS_DEV_REPLACE_DEVID in our devices if we don't find a key. If we > do we return -EIO and bail out of the mount. Thanks, That's brilliant approach. Let me try. Thanks, Anand > > Josef
On Wed, Sep 23, 2020 at 09:42:17AM -0400, Josef Bacik wrote: > On 9/23/20 12:42 AM, Anand Jain wrote: > > On 22/9/20 9:08 pm, Josef Bacik wrote: > >> On 9/22/20 8:33 AM, Anand Jain wrote: > Yeah I mean we do something in btrfs_init_dev_replace(), like when we search for > the key, we double check to make sure we don't have a devid == > BTRFS_DEV_REPLACE_DEVID in our devices if we don't find a key. If we do we > return -EIO and bail out of the mount. Thanks, From user perspective, then do what? Or do we treat this with minimal efforts to provide a sane fallback and error handling just to pass fuzzers (like in many other cases)?
On 9/24/20 7:25 AM, David Sterba wrote: > On Wed, Sep 23, 2020 at 09:42:17AM -0400, Josef Bacik wrote: >> On 9/23/20 12:42 AM, Anand Jain wrote: >>> On 22/9/20 9:08 pm, Josef Bacik wrote: >>>> On 9/22/20 8:33 AM, Anand Jain wrote: > >> Yeah I mean we do something in btrfs_init_dev_replace(), like when we search for >> the key, we double check to make sure we don't have a devid == >> BTRFS_DEV_REPLACE_DEVID in our devices if we don't find a key. If we do we >> return -EIO and bail out of the mount. Thanks, > > From user perspective, then do what? Or do we treat this with minimal > efforts to provide a sane fallback and error handling just to pass > fuzzers (like in many other cases)? > That's a question for fsck. I don't want to spend a lot of time chasing imaginary cases that fuzzers come up with, I just want them to fail as quickly as possible so we can move on with our lives. If this happened in the real world then it would be because we either 1) Lost the replace item somehow? 2) Got a random corruption that changed the devid to 0 I think for #1 it's impossible to detect really, unless you can tell which device was being replaced somehow? I'm not sure how you would do that, I'm not familiar enough with the replace code to see if we could figure that out. For #2 it should be straightforward, as long as we can determine that we really weren't doing a device replace, then we just change the devid to 1 or something and carry on with life? Thanks, Josef
On 24/9/20 10:02 pm, Josef Bacik wrote: > On 9/24/20 7:25 AM, David Sterba wrote: >> On Wed, Sep 23, 2020 at 09:42:17AM -0400, Josef Bacik wrote: >>> On 9/23/20 12:42 AM, Anand Jain wrote: >>>> On 22/9/20 9:08 pm, Josef Bacik wrote: >>>>> On 9/22/20 8:33 AM, Anand Jain wrote: >> >>> Yeah I mean we do something in btrfs_init_dev_replace(), like when we >>> search for >>> the key, we double check to make sure we don't have a devid == >>> BTRFS_DEV_REPLACE_DEVID in our devices if we don't find a key. >>> If we >>> do we >>> return -EIO and bail out of the mount. Thanks, I read fast and missed the bailout part before. If we bailout the mount, it means a btrfs rootfs can fail to boot up. To recover from it, the user has to remove the trespassing/extra device manually and reboot. For a non-rootfs, the user would have to remove the device manually and run 'btrfs dev scan --forget' to free up the extra devices. What we are doing now is removing the extra/trespassing device internally. IMO. The case of trespassing/extra device trying to sabotage the setup is a bit different from a corrupted device, in the former case resilience is preferred? Thanks, Anand >> >> From user perspective, then do what? Or do we treat this with minimal >> efforts to provide a sane fallback and error handling just to pass >> fuzzers (like in many other cases)? >> > > That's a question for fsck. I don't want to spend a lot of time chasing > imaginary cases that fuzzers come up with, I just want them to fail as > quickly as possible so we can move on with our lives. > > If this happened in the real world then it would be because we either > > 1) Lost the replace item somehow? > 2) Got a random corruption that changed the devid to 0 > > I think for #1 it's impossible to detect really, unless you can tell > which device was being replaced somehow? I'm not sure how you would do > that, I'm not familiar enough with the replace code to see if we could > figure that out. > > For #2 it should be straightforward, as long as we can determine that we > really weren't doing a device replace, then we just change the devid to > 1 or something and carry on with life? Thanks, > > Josef
On 9/25/20 6:11 AM, Anand Jain wrote: > On 24/9/20 10:02 pm, Josef Bacik wrote: >> On 9/24/20 7:25 AM, David Sterba wrote: >>> On Wed, Sep 23, 2020 at 09:42:17AM -0400, Josef Bacik wrote: >>>> On 9/23/20 12:42 AM, Anand Jain wrote: >>>>> On 22/9/20 9:08 pm, Josef Bacik wrote: >>>>>> On 9/22/20 8:33 AM, Anand Jain wrote: >>> >>>> Yeah I mean we do something in btrfs_init_dev_replace(), like when we search >>>> for >>>> the key, we double check to make sure we don't have a devid == >>>> BTRFS_DEV_REPLACE_DEVID in our devices if we don't find a key. > > >>>> If we do we >>>> return -EIO and bail out of the mount. Thanks, > > > I read fast and missed the bailout part before. > > If we bailout the mount, it means a btrfs rootfs can fail to boot up. > > To recover from it, the user has to remove the trespassing/extra device > manually and reboot. > For a non-rootfs, the user would have to remove the device manually and run > 'btrfs dev scan --forget' to free up the extra devices. > What we are doing now is removing the extra/trespassing device > internally. > > IMO. The case of trespassing/extra device trying to sabotage the setup > is a bit different from a corrupted device, in the former case > resilience is preferred? > Well this doesn't happen in real life right? This is purely from a fuzzing standpoint, so while resilience should be the first thing we shoot for, I'd rather not spend a long time trying to make it work. In the case of just randomly deleting a device, I don't think that's a decision that the kernel can/should make, we should require a user to intervene at that point. That makes failure the best option here, thanks, Josef
On 25/9/20 10:28 pm, Josef Bacik wrote: > On 9/25/20 6:11 AM, Anand Jain wrote: >> On 24/9/20 10:02 pm, Josef Bacik wrote: >>> On 9/24/20 7:25 AM, David Sterba wrote: >>>> On Wed, Sep 23, 2020 at 09:42:17AM -0400, Josef Bacik wrote: >>>>> On 9/23/20 12:42 AM, Anand Jain wrote: >>>>>> On 22/9/20 9:08 pm, Josef Bacik wrote: >>>>>>> On 9/22/20 8:33 AM, Anand Jain wrote: >>>> >>>>> Yeah I mean we do something in btrfs_init_dev_replace(), like when >>>>> we search for >>>>> the key, we double check to make sure we don't have a devid == >>>>> BTRFS_DEV_REPLACE_DEVID in our devices if we don't find a key. >> >> >>>>> If we do we >>>>> return -EIO and bail out of the mount. Thanks, >> >> >> I read fast and missed the bailout part before. >> >> If we bailout the mount, it means a btrfs rootfs can fail to boot up. >> >> To recover from it, the user has to remove the trespassing/extra device >> manually and reboot. >> For a non-rootfs, the user would have to remove the device manually >> and run >> 'btrfs dev scan --forget' to free up the extra devices. >> What we are doing now is removing the extra/trespassing device >> internally. >> >> IMO. The case of trespassing/extra device trying to sabotage the setup >> is a bit different from a corrupted device, in the former case >> resilience is preferred? >> > > Well this doesn't happen in real life right? This is purely from a > fuzzing standpoint, so while resilience should be the first thing we > shoot for, I'd rather not spend a long time trying to make it work. > > In the case of just randomly deleting a device, I don't think that's a > decision that the kernel can/should make, we should require a user to > intervene at that point. That makes failure the best option here, thanks, > It makes sense to me, its different from what we had before. Made those changes in v2. Thanks, Anand > Josef
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index ec9dac40b4f1..2fd73eab6219 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1080,8 +1080,7 @@ static void __btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices, if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) { list_del_init(&device->dev_alloc_list); clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state); - if (!test_bit(BTRFS_DEV_STATE_REPLACE_TGT, - &device->dev_state)) + if (device->devid != BTRFS_DEV_REPLACE_DEVID) fs_devices->rw_devices--; } list_del_init(&device->dev_list);
syzbot reported a warning [1] in close_fs_devcies() which it reproduces using a crafted image. WARN_ON(fs_devices->rw_devices); The crafted image successfully creates a replace-device with the devid 0. But as there isn't any replace-item. We clean the extra the devid 0, at __btrfs_free_extra_devids(). rw_devices is incremented in btrfs_open_one_device() for all write-able devices except for devid == BTRFS_DEV_REPLACE_DEVID. But while we clean up the extra devices in __btrfs_free_extra_devids() we used the BTRFS_DEV_STATE_REPLACE_TGT flag which isn't set because there isn't the replace-item. So rw_devices went below zero. So let __btrfs_free_extra_devids() also depend on the devid != BTRFS_DEV_REPLACE_DEVID to manage the rw_devices. [1] WARNING: CPU: 1 PID: 3612 at fs/btrfs/volumes.c:1166 close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166 Kernel panic - not syncing: panic_on_warn set ... CPU: 1 PID: 3612 Comm: syz-executor.2 Not tainted 5.9.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x198/0x1fd lib/dump_stack.c:118 panic+0x347/0x7c0 kernel/panic.c:231 __warn.cold+0x20/0x46 kernel/panic.c:600 report_bug+0x1bd/0x210 lib/bug.c:198 handle_bug+0x38/0x90 arch/x86/kernel/traps.c:234 exc_invalid_op+0x14/0x40 arch/x86/kernel/traps.c:254 asm_exc_invalid_op+0x12/0x20 arch/x86/include/asm/idtentry.h:536 RIP: 0010:close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166 Code: 0f b6 04 02 84 c0 74 02 7e 33 48 8b 44 24 18 c6 80 30 01 00 00 00 48 83 c4 30 5b 5d 41 5c 41 5d 41 5e 41 5f c3 e8 99 ce 6a fe <0f> 0b e9 71 ff ff ff e8 8d ce 6a fe 0f 0b e9 20 ff ff ff e8 d1 d5 RSP: 0018:ffffc900091777e0 EFLAGS: 00010246 RAX: 0000000000040000 RBX: ffffffffffffffff RCX: ffffc9000c8b7000 RDX: 0000000000040000 RSI: ffffffff83097f47 RDI: 0000000000000007 RBP: dffffc0000000000 R08: 0000000000000001 R09: ffff8880988a187f R10: 0000000000000000 R11: 0000000000000001 R12: ffff88809593a130 R13: ffff88809593a1ec R14: ffff8880988a1908 R15: ffff88809593a050 close_fs_devices fs/btrfs/volumes.c:1193 [inline] btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179 open_ctree+0x4984/0x4a2d fs/btrfs/disk-io.c:3434 btrfs_fill_super fs/btrfs/super.c:1316 [inline] btrfs_mount_root.cold+0x14/0x165 fs/btrfs/super.c:1672 legacy_get_tree+0x105/0x220 fs/fs_context.c:592 vfs_get_tree+0x89/0x2f0 fs/super.c:1547 fc_mount fs/namespace.c:978 [inline] vfs_kern_mount.part.0+0xd3/0x170 fs/namespace.c:1008 vfs_kern_mount+0x3c/0x60 fs/namespace.c:995 btrfs_mount+0x234/0xaa0 fs/btrfs/super.c:1732 legacy_get_tree+0x105/0x220 fs/fs_context.c:592 vfs_get_tree+0x89/0x2f0 fs/super.c:1547 do_new_mount fs/namespace.c:2875 [inline] path_mount+0x1387/0x2070 fs/namespace.c:3192 do_mount fs/namespace.c:3205 [inline] __do_sys_mount fs/namespace.c:3413 [inline] __se_sys_mount fs/namespace.c:3390 [inline] __x64_sys_mount+0x27f/0x300 fs/namespace.c:3390 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x46004a Code: b8 a6 00 00 00 0f 05 48 3d 01 f0 ff ff 0f 83 fd 89 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 0f 83 da 89 fb ff c3 66 0f 1f 84 00 00 00 00 00 RSP: 002b:00007f414d78da88 EFLAGS: 00000202 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 00007f414d78db20 RCX: 000000000046004a RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007f414d78dae0 RBP: 00007f414d78dae0 R08: 00007f414d78db20 R09: 0000000020000000 R10: 0000000000000000 R11: 0000000000000202 R12: 0000000020000000 R13: 0000000020000100 R14: 0000000020000200 R15: 000000002001a800 Signed-off-by: Anand Jain <anand.jain@oracle.com> --- fs/btrfs/volumes.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)