btrfs: free device if we fail to open it

Message ID	7cfd63a1232900565abf487e82b7aa4af5fbca29.1638393521.git.josef@toxicpanda.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-btrfs-owner@kernel.org> From: Josef Bacik <josef@toxicpanda.com> To: linux-btrfs@vger.kernel.org, kernel-team@fb.com Subject: [PATCH] btrfs: free device if we fail to open it Date: Wed, 1 Dec 2021 16:18:51 -0500 Message-Id: <7cfd63a1232900565abf487e82b7aa4af5fbca29.1638393521.git.josef@toxicpanda.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	btrfs: free device if we fail to open it \| expand btrfs: free device if we fail to open it

Josef Bacik Dec. 1, 2021, 9:18 p.m. UTC

We've been having transient failures of btrfs/197 because sometimes we
don't show a missing device.

This turned out to be because I use LVM for my devices, and sometimes we
race with udev and get the "/dev/dm-#" device registered as a device in
the fs_devices list instead of "/dev/mapper/vg0-lv#" device.  Thus when
the test adds a device to the existing mount it doesn't find the old
device in the original fs_devices list and remove it properly.

This is fine in general, because when we open the devices we check the
UUID, and properly skip using the device that we added to the other file
system.  However we do not remove it from our fs_devices, so when we get
the device info we still see this old device and think that everything
is ok.

We have a check for -ENODATA coming back from reading the super off of
the device to catch the wipefs case, but we don't catch literally any
other error where the device is no longer valid and thus do not remove
the device.

Fix this to not special case an empty device when reading the super
block, and simply remove any device we are unable to open.

With this fix we properly print out missing devices in the test case,
and after 500 iterations I'm no longer able to reproduce the problem,
whereas I could usually reproduce within 200 iterations.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/disk-io.c | 2 +-
 fs/btrfs/volumes.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

Anand Jain Dec. 2, 2021, 7:09 a.m. UTC | #1

On 02/12/2021 05:18, Josef Bacik wrote:
> We've been having transient failures of btrfs/197 because sometimes we
> don't show a missing device.

> This turned out to be because I use LVM for my devices, and sometimes we
> race with udev and get the "/dev/dm-#" device registered as a device in
> the fs_devices list instead of "/dev/mapper/vg0-lv#" device.
>  Thus when
> the test adds a device to the existing mount it doesn't find the old
> device in the original fs_devices list and remove it properly.

  The above para is confusing. It can go. IMHO. The device path shouldn't
  matter as we depend on the bdev to compare in the device add thread.

2637         bdev = blkdev_get_by_path(device_path, FMODE_WRITE | 
FMODE_EXCL,
2638                                   fs_info->bdev_holder);
::
2657         list_for_each_entry_rcu(device, &fs_devices->devices, 
dev_list) {
2658                 if (device->bdev == bdev) {
2659                         ret = -EEXIST;
2660                         rcu_read_unlock();
2661                         goto error;
2662                 }
2663         }


> This is fine in general, because when we open the devices we check the
> UUID, and properly skip using the device that we added to the other file
> system.  However we do not remove it from our fs_devices,

Hmm, context/thread is missing. Like, is it during device add or during 
mkfs/dev-scan?

AFAIK btrfs_free_stale_devices() checks and handles the removing of 
stale devices in the fs_devices in both the contexts/threads device-add, 
mkfs (device-scan).

  For example:

$ alias cnt_free_stale_devices="bpftrace -e 
'kprobe:btrfs_free_stale_devices { @ = count(); }'"

New FSID on 2 devices, we call free_stale_devices():

$ cnt_free_stale_devices -c 'mkfs.btrfs -fq -draid1 -mraid1 
/dev/vg/scratch0 /dev/vg/scratch1'
Attaching 1 probe...

@: 2

  We do it only when there is a new fsid/device added to the fs_devices.

For example:

Clean up the fs_devices:
$ cnt_free_stale_devices -c 'btrfs dev scan --forget'
Attaching 1 probe...

@: 1

Now mounting devices are new to the fs_devices list, so call 
free_stale_devices().

$ cnt_free_stale_devices -c 'mount -o device=/dev/vg/scratch0 
/dev/vg/scratch1 /btrfs'
Attaching 1 probe...

@: 2

$ umount /btrfs

Below we didn't call free_stale_devices() because these two devices are 
already in the appropriate fs_devices.

$ cnt_free_stale_devices -c 'mount -o device=/dev/vg/scratch0 
/dev/vg/scratch1 /btrfs'
Attaching 1 probe...

@: 0

$

To me, it looks to be working correctly.

> so when we get
> the device info we still see this old device and think that everything
> is ok.


> We have a check for -ENODATA coming back from reading the super off of
> the device to catch the wipefs case, but we don't catch literally any
> other error where the device is no longer valid and thus do not remove
> the device.
> 

> Fix this to not special case an empty device when reading the super
> block, and simply remove any device we are unable to open.
> 
> With this fix we properly print out missing devices in the test case,
> and after 500 iterations I'm no longer able to reproduce the problem,
> whereas I could usually reproduce within 200 iterations.
> 

commit 7f551d969037 ("btrfs: free alien device after device add")
fixed the case we weren't freeing the stale device in the device-add 
context.

However, here I don't understand the thread/context we are missing to 
free the stale device here.

> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>   fs/btrfs/disk-io.c | 2 +-
>   fs/btrfs/volumes.c | 2 +-
>   2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 5c598e124c25..fa34b8807f8d 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3924,7 +3924,7 @@ struct btrfs_super_block *btrfs_read_dev_one_super(struct block_device *bdev,
>   	super = page_address(page);
>   	if (btrfs_super_magic(super) != BTRFS_MAGIC) {
>   		btrfs_release_disk_super(super);
> -		return ERR_PTR(-ENODATA);
> +		return ERR_PTR(-EINVAL);
>   	}

  I think you are removing ENODATA because it is going away in the 
parent function. And, we don't reach this condition in the test case 
btrfs/197.
  Instead, we fail here at line 645 below and, we return -EINVAL;

  645         if (memcmp(device->uuid, disk_super->dev_item.uuid, 
BTRFS_UUID_SIZE))
  646                 goto error_free_page;
  647

  687 error_free_page:
  688         btrfs_release_disk_super(disk_super);
  689         blkdev_put(bdev, flags);
  690
  691         return -EINVAL;

function stack carries the return code to the parent open_fs_devices().

open_fs_devices()
  btrfs_open_one_device()
   btrfs_get_bdev_and_sb()
    btrfs_read_dev_super()
     btrfs_read_dev_one_super()



>   	if (btrfs_super_bytenr(super) != bytenr_orig) {
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index f38c230111be..890153dd2a2b 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -1223,7 +1223,7 @@ static int open_fs_devices(struct btrfs_fs_devices *fs_devices,
>   		if (ret == 0 &&
>   		    (!latest_dev || device->generation > latest_dev->generation)) {
>   			latest_dev = device;
> -		} else if (ret == -ENODATA) {
> +		} else if (ret) {
>   			fs_devices->num_devices--;
>   			list_del(&device->dev_list);
>   			btrfs_free_device(device);
> 

There are many other cases, for example including -EACCES (from 
blkdev_get_by_path()) (I haven't checked other error codes). For which, 
calling btrfs_free_device() is inappropriate.

To be safer, how about just checking for error codes explicitly for 
-EINVAL? as we could probably free the device when it is -EINVAL.

Thanks, Anand

Josef Bacik Dec. 2, 2021, 4:02 p.m. UTC | #2

On Thu, Dec 02, 2021 at 03:09:38PM +0800, Anand Jain wrote:
> On 02/12/2021 05:18, Josef Bacik wrote:
> > We've been having transient failures of btrfs/197 because sometimes we
> > don't show a missing device.
> 
> > This turned out to be because I use LVM for my devices, and sometimes we
> > race with udev and get the "/dev/dm-#" device registered as a device in
> > the fs_devices list instead of "/dev/mapper/vg0-lv#" device.
> >  Thus when
> > the test adds a device to the existing mount it doesn't find the old
> > device in the original fs_devices list and remove it properly.
> 

I think most of your confusion is because you don't know what btrfs/197 does, so
I'll explain that and then answer your questions below.

DEV=/dev/vg0/lv0
RAID_DEVS=/dev/vg0/lv1 /dev/vg0/lv2 /dev/vg0/vl3 /dev/vg0/lv4

# First we create a single fs and mount it
mkfs.btrfs -f $DEV
mount $DEV /mnt/test

# Now we create the RAID fs
mkfs.btrfs -f -d raid10 -m raid10 $RAID_DEVS

# Now we add one of the raid devs to the single mount above
btrfs device add /dev/vg0/lv2 /mnt/test

# /dev/vg0/lv2 is no longer part of the fs it was made on, it's part of the fs
# that's mounted at /mnt/test

# Mount degraded with the raid setup
mount -o degraded /dev/vg0/lv1 /mnt/scratch

# Now we shouldn't have found /dev/vg0/lv2, because it was wiped and is no
# longer part of the fs_devices for this thing, except it is because it wasn't
# removed, so when we do the following it doesn't show as missing
btrfs filesystem show /mnt/scratch

>  The above para is confusing. It can go. IMHO. The device path shouldn't
>  matter as we depend on the bdev to compare in the device add thread.
> 
> 2637         bdev = blkdev_get_by_path(device_path, FMODE_WRITE |
> FMODE_EXCL,
> 2638                                   fs_info->bdev_holder);
> ::
> 2657         list_for_each_entry_rcu(device, &fs_devices->devices, dev_list)
> {
> 2658                 if (device->bdev == bdev) {
> 2659                         ret = -EEXIST;
> 2660                         rcu_read_unlock();
> 2661                         goto error;
> 2662                 }
> 2663         }
> 

This is on the init thread, this is just checking the fs_devices of /mnt/test,
not the fs_devices of the RAID setup that we created, so this doesn't error out
(nor should it) because we're adding it to our mounted fs.

> 
> > This is fine in general, because when we open the devices we check the
> > UUID, and properly skip using the device that we added to the other file
> > system.  However we do not remove it from our fs_devices,
> 
> Hmm, context/thread is missing. Like, is it during device add or during
> mkfs/dev-scan?
> 
> AFAIK btrfs_free_stale_devices() checks and handles the removing of stale
> devices in the fs_devices in both the contexts/threads device-add, mkfs
> (device-scan).
> 
>  For example:
> 
> $ alias cnt_free_stale_devices="bpftrace -e 'kprobe:btrfs_free_stale_devices
> { @ = count(); }'"
> 
> New FSID on 2 devices, we call free_stale_devices():
> 
> $ cnt_free_stale_devices -c 'mkfs.btrfs -fq -draid1 -mraid1 /dev/vg/scratch0
> /dev/vg/scratch1'
> Attaching 1 probe...
> 
> @: 2
> 
>  We do it only when there is a new fsid/device added to the fs_devices.
> 
> For example:
> 
> Clean up the fs_devices:
> $ cnt_free_stale_devices -c 'btrfs dev scan --forget'
> Attaching 1 probe...
> 
> @: 1
> 
> Now mounting devices are new to the fs_devices list, so call
> free_stale_devices().
> 
> $ cnt_free_stale_devices -c 'mount -o device=/dev/vg/scratch0
> /dev/vg/scratch1 /btrfs'
> Attaching 1 probe...
> 
> @: 2
> 
> $ umount /btrfs
> 
> Below we didn't call free_stale_devices() because these two devices are
> already in the appropriate fs_devices.
> 
> $ cnt_free_stale_devices -c 'mount -o device=/dev/vg/scratch0
> /dev/vg/scratch1 /btrfs'
> Attaching 1 probe...
> 
> @: 0
> 
> $
> 
> To me, it looks to be working correctly.
> 

Yes it does work correctly, most of the time.  If you run it in a loop 500 times
it'll fail, because _sometimes_ udev goes in and does teh btrfs device scan and
changes the name of the device in the fs_devices for the RAID group.  So the
btrfs_free_stale_devices() thing doesn't find the exising device, because it's
just looking at the device->name, which is different from the device we're
adding.

We have the fs_devices for the RAID fs, and instead of /dev/vg0/lv2, we have
/dev/dm-4 or whatever.  So we do the addition of /dev/vg0/lv2, go to find it in
any other fs_devices, and don't find it because strcmp("/dev/vg0/lv2",
"/dev/dm0-4") != 0, and thus leave the device on the fs_devices for the RAID
file system.

> > so when we get
> > the device info we still see this old device and think that everything
> > is ok.
> 
> 
> > We have a check for -ENODATA coming back from reading the super off of
> > the device to catch the wipefs case, but we don't catch literally any
> > other error where the device is no longer valid and thus do not remove
> > the device.
> > 
> 
> > Fix this to not special case an empty device when reading the super
> > block, and simply remove any device we are unable to open.
> > 
> > With this fix we properly print out missing devices in the test case,
> > and after 500 iterations I'm no longer able to reproduce the problem,
> > whereas I could usually reproduce within 200 iterations.
> > 
> 
> commit 7f551d969037 ("btrfs: free alien device after device add")
> fixed the case we weren't freeing the stale device in the device-add
> context.
> 
> However, here I don't understand the thread/context we are missing to free
> the stale device here.
> 
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
> >   fs/btrfs/disk-io.c | 2 +-
> >   fs/btrfs/volumes.c | 2 +-
> >   2 files changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > index 5c598e124c25..fa34b8807f8d 100644
> > --- a/fs/btrfs/disk-io.c
> > +++ b/fs/btrfs/disk-io.c
> > @@ -3924,7 +3924,7 @@ struct btrfs_super_block *btrfs_read_dev_one_super(struct block_device *bdev,
> >   	super = page_address(page);
> >   	if (btrfs_super_magic(super) != BTRFS_MAGIC) {
> >   		btrfs_release_disk_super(super);
> > -		return ERR_PTR(-ENODATA);
> > +		return ERR_PTR(-EINVAL);
> >   	}
> 
>  I think you are removing ENODATA because it is going away in the parent
> function. And, we don't reach this condition in the test case btrfs/197.
>  Instead, we fail here at line 645 below and, we return -EINVAL;

I'm changing it to be consistent, instead of special casing this one specific
failure, just treat all failures like we need to remove the device, and thus we
can just make this be EINVAL as well.

> 
>  645         if (memcmp(device->uuid, disk_super->dev_item.uuid,
> BTRFS_UUID_SIZE))
>  646                 goto error_free_page;
>  647
> 
>  687 error_free_page:
>  688         btrfs_release_disk_super(disk_super);
>  689         blkdev_put(bdev, flags);
>  690
>  691         return -EINVAL;
> 
> function stack carries the return code to the parent open_fs_devices().
> 
> open_fs_devices()
>  btrfs_open_one_device()
>   btrfs_get_bdev_and_sb()
>    btrfs_read_dev_super()
>     btrfs_read_dev_one_super()
> 
> 
> 
> >   	if (btrfs_super_bytenr(super) != bytenr_orig) {
> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > index f38c230111be..890153dd2a2b 100644
> > --- a/fs/btrfs/volumes.c
> > +++ b/fs/btrfs/volumes.c
> > @@ -1223,7 +1223,7 @@ static int open_fs_devices(struct btrfs_fs_devices *fs_devices,
> >   		if (ret == 0 &&
> >   		    (!latest_dev || device->generation > latest_dev->generation)) {
> >   			latest_dev = device;
> > -		} else if (ret == -ENODATA) {
> > +		} else if (ret) {
> >   			fs_devices->num_devices--;
> >   			list_del(&device->dev_list);
> >   			btrfs_free_device(device);
> > 
> 
> There are many other cases, for example including -EACCES (from
> blkdev_get_by_path()) (I haven't checked other error codes). For which,
> calling btrfs_free_device() is inappropriate.

Yeah I was a little fuzzy on this.  I think *any* failure should mean that we
remove the device from the fs_devices tho right?  So that we show we're missing
a device, since we can't actually access it?  I'm actually asking, because I
think we can go either way, but to me I think any failure sure result in the
removal of the device so we can re-scan the correct one.  Thanks,

Josef

Anand Jain Dec. 3, 2021, 8:31 a.m. UTC | #3

On 03/12/2021 00:02, Josef Bacik wrote:
> On Thu, Dec 02, 2021 at 03:09:38PM +0800, Anand Jain wrote:
>> On 02/12/2021 05:18, Josef Bacik wrote:
>>> We've been having transient failures of btrfs/197 because sometimes we
>>> don't show a missing device.
>>
>>> This turned out to be because I use LVM for my devices, and sometimes we
>>> race with udev and get the "/dev/dm-#" device registered as a device in
>>> the fs_devices list instead of "/dev/mapper/vg0-lv#" device.
>>>   Thus when
>>> the test adds a device to the existing mount it doesn't find the old
>>> device in the original fs_devices list and remove it properly.
>>
> 
> I think most of your confusion is because you don't know what btrfs/197 does, so
> I'll explain that and then answer your questions below.
> 
> DEV=/dev/vg0/lv0
> RAID_DEVS=/dev/vg0/lv1 /dev/vg0/lv2 /dev/vg0/vl3 /dev/vg0/lv4
> 
> # First we create a single fs and mount it
> mkfs.btrfs -f $DEV
> mount $DEV /mnt/test
> 
> # Now we create the RAID fs
> mkfs.btrfs -f -d raid10 -m raid10 $RAID_DEVS
> 
> # Now we add one of the raid devs to the single mount above
> btrfs device add /dev/vg0/lv2 /mnt/test
> 
> # /dev/vg0/lv2 is no longer part of the fs it was made on, it's part of the fs
> # that's mounted at /mnt/test
> 
> # Mount degraded with the raid setup
> mount -o degraded /dev/vg0/lv1 /mnt/scratch
> 
> # Now we shouldn't have found /dev/vg0/lv2, because it was wiped and is no
> # longer part of the fs_devices for this thing, except it is because it wasn't
> # removed, so when we do the following it doesn't show as missing
> btrfs filesystem show /mnt/scratch
> 

  I thought I understood the test case. Now it is better. Thanks for 
taking the time to explain.


>>   The above para is confusing. It can go. IMHO. The device path shouldn't
>>   matter as we depend on the bdev to compare in the device add thread.
>>
>> 2637         bdev = blkdev_get_by_path(device_path, FMODE_WRITE |
>> FMODE_EXCL,
>> 2638                                   fs_info->bdev_holder);
>> ::
>> 2657         list_for_each_entry_rcu(device, &fs_devices->devices, dev_list)
>> {
>> 2658                 if (device->bdev == bdev) {
>> 2659                         ret = -EEXIST;
>> 2660                         rcu_read_unlock();
>> 2661                         goto error;
>> 2662                 }
>> 2663         }
>>
> 
> This is on the init thread, this is just checking the fs_devices of /mnt/test,
> not the fs_devices of the RAID setup that we created, so this doesn't error out
> (nor should it) because we're adding it to our mounted fs.
> 
>>
>>> This is fine in general, because when we open the devices we check the
>>> UUID, and properly skip using the device that we added to the other file
>>> system.  However we do not remove it from our fs_devices,
>>
>> Hmm, context/thread is missing. Like, is it during device add or during
>> mkfs/dev-scan?
>>
>> AFAIK btrfs_free_stale_devices() checks and handles the removing of stale
>> devices in the fs_devices in both the contexts/threads device-add, mkfs
>> (device-scan).
>>
>>   For example:
>>
>> $ alias cnt_free_stale_devices="bpftrace -e 'kprobe:btrfs_free_stale_devices
>> { @ = count(); }'"
>>
>> New FSID on 2 devices, we call free_stale_devices():
>>
>> $ cnt_free_stale_devices -c 'mkfs.btrfs -fq -draid1 -mraid1 /dev/vg/scratch0
>> /dev/vg/scratch1'
>> Attaching 1 probe...
>>
>> @: 2
>>
>>   We do it only when there is a new fsid/device added to the fs_devices.
>>
>> For example:
>>
>> Clean up the fs_devices:
>> $ cnt_free_stale_devices -c 'btrfs dev scan --forget'
>> Attaching 1 probe...
>>
>> @: 1
>>
>> Now mounting devices are new to the fs_devices list, so call
>> free_stale_devices().
>>
>> $ cnt_free_stale_devices -c 'mount -o device=/dev/vg/scratch0
>> /dev/vg/scratch1 /btrfs'
>> Attaching 1 probe...
>>
>> @: 2
>>
>> $ umount /btrfs
>>
>> Below we didn't call free_stale_devices() because these two devices are
>> already in the appropriate fs_devices.
>>
>> $ cnt_free_stale_devices -c 'mount -o device=/dev/vg/scratch0
>> /dev/vg/scratch1 /btrfs'
>> Attaching 1 probe...
>>
>> @: 0
>>
>> $
>>
>> To me, it looks to be working correctly.
>>
> 
> Yes it does work correctly, most of the time.  If you run it in a loop 500 times
> it'll fail, because _sometimes_ udev goes in and does teh btrfs device scan and
> changes the name of the device in the fs_devices for the RAID group.  So the
> btrfs_free_stale_devices() thing doesn't find the exising device, because it's
> just looking at the device->name, which is different from the device we're
> adding.
> 
> We have the fs_devices for the RAID fs, and instead of /dev/vg0/lv2, we have
> /dev/dm-4 or whatever.  So we do the addition of /dev/vg0/lv2, go to find it in
> any other fs_devices, and don't find it because strcmp("/dev/vg0/lv2",
> "/dev/dm0-4") != 0, and thus leave the device on the fs_devices for the RAID
> file system.
> 

  I got it. It shouldn't be difficult to reproduce and, I could 
reproduce. Without this patch.


  Below is a device with two different paths. dm and its mapper.

----------
  $ ls -bli /dev/mapper/vg-scratch1  /dev/dm-1
  561 brw-rw---- 1 root disk 252, 1 Dec  3 12:13 /dev/dm-1
  565 lrwxrwxrwx 1 root root      7 Dec  3 12:13 /dev/mapper/vg-scratch1 
-> ../dm-1
----------

  Clean the fs_devices.

----------
  $ btrfs dev scan --forget
----------

  Use the mapper to do mkfs.btrfs.

----------
  $ mkfs.btrfs -fq /dev/mapper/vg-scratch0
  $ mount /dev/mapper/vg-scratch0 /btrfs
----------

  Crete raid1 again using mapper path.

----------
  $ mkfs.btrfs -U $uuid -fq -draid1 -mraid1 /dev/mapper/vg-scratch1 
/dev/mapper/vg-scratch2
----------

  Use dm path to add the device which belongs to another btrfs filesystem.

----------
  $ btrfs dev add -f /dev/dm-1 /btrfs
----------

  Now mount the above raid1 in degraded mode.

----------
  $ mount -o degraded /dev/mapper/vg-scratch2 /btrfs1
----------


Before patch:

  The /dev/mapper/vg-scratch1 is not shown as missing in the /btrfs1.

----------
  $ btrfs fi show -m /btrfs1
  Label: none  uuid: 12345678-1234-1234-1234-123456789abc
	Total devices 2 FS bytes used 704.00KiB
	devid    1 size 10.00GiB used 1.26GiB path /dev/mapper/vg-scratch1
	devid    2 size 10.00GiB used 2.54GiB path /dev/mapper/vg-scratch2

$ btrfs fi show -m /btrfs
Label: none  uuid: e89a49c3-66a2-473c-aadf-3bf23d37a3a3
	Total devices 2 FS bytes used 192.00KiB
	devid    1 size 10.00GiB used 536.00MiB path /dev/mapper/vg-scratch0
	devid    2 size 10.00GiB used 0.00B path /dev/mapper/vg-scratch1
----------

  The kernel does set the devid-1 as missing.
  (Checked using boilerplate code, which dumps fs_devices).

----------
  $ cat /proc/fs/btrfs/devlist
::
[fsid: 12345678-1234-1234-1234-123456789abc]
::
		device:		/dev/dm-1  <---
		devid:		1
::
		dev_state:	|IN_FS_METADATA|MISSING|dev_stats_valid
----------


After patch:

------------
$ btrfs fi show /btrfs1
Label: none  uuid: 12345678-1234-1234-1234-123456789abc
	Total devices 2 FS bytes used 128.00KiB
	devid    2 size 10.00GiB used 1.26GiB path /dev/mapper/vg-scratch2
	*** Some devices missing

$ btrfs fi show /btrfs
Label: none  uuid: 761115c2-7fc2-4dc8-afab-baf2509ea6fe
	Total devices 2 FS bytes used 192.00KiB
	devid    1 size 10.00GiB used 536.00MiB path /dev/mapper/vg-scratch0
	devid    2 size 10.00GiB used 0.00B path /dev/mapper/vg-scratch1
------------

------------
$ cat /proc/fs/btrfs/devlist
::
[fsid: 12345678-1234-1234-1234-123456789abc]
::
		device:		(null)  <---
		devid:		1
::
		dev_state:	|IN_FS_METADATA|MISSING|dev_stats_valid
-------------



>>> so when we get
>>> the device info we still see this old device and think that everything
>>> is ok.
>>
>>
>>> We have a check for -ENODATA coming back from reading the super off of
>>> the device to catch the wipefs case, but we don't catch literally any
>>> other error where the device is no longer valid and thus do not remove
>>> the device.
>>>
>>
>>> Fix this to not special case an empty device when reading the super
>>> block, and simply remove any device we are unable to open.
>>>
>>> With this fix we properly print out missing devices in the test case,
>>> and after 500 iterations I'm no longer able to reproduce the problem,
>>> whereas I could usually reproduce within 200 iterations.
>>>
>>
>> commit 7f551d969037 ("btrfs: free alien device after device add")
>> fixed the case we weren't freeing the stale device in the device-add
>> context.
>>
>> However, here I don't understand the thread/context we are missing to free
>> the stale device here.
>>
>>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>>> ---
>>>    fs/btrfs/disk-io.c | 2 +-
>>>    fs/btrfs/volumes.c | 2 +-
>>>    2 files changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>> index 5c598e124c25..fa34b8807f8d 100644
>>> --- a/fs/btrfs/disk-io.c
>>> +++ b/fs/btrfs/disk-io.c
>>> @@ -3924,7 +3924,7 @@ struct btrfs_super_block *btrfs_read_dev_one_super(struct block_device *bdev,
>>>    	super = page_address(page);
>>>    	if (btrfs_super_magic(super) != BTRFS_MAGIC) {
>>>    		btrfs_release_disk_super(super);
>>> -		return ERR_PTR(-ENODATA);
>>> +		return ERR_PTR(-EINVAL);
>>>    	}
>>
>>   I think you are removing ENODATA because it is going away in the parent
>> function. And, we don't reach this condition in the test case btrfs/197.
>>   Instead, we fail here at line 645 below and, we return -EINVAL;
> 
> I'm changing it to be consistent, instead of special casing this one specific
> failure, just treat all failures like we need to remove the device, and thus we
> can just make this be EINVAL as well.

  Hmm, as this change is not related to the bug fix here. Can it be a 
separate patch? The actual bug fix has to go to stable kernels as well.

> 
>>
>>   645         if (memcmp(device->uuid, disk_super->dev_item.uuid,
>> BTRFS_UUID_SIZE))
>>   646                 goto error_free_page;
>>   647
>>
>>   687 error_free_page:
>>   688         btrfs_release_disk_super(disk_super);
>>   689         blkdev_put(bdev, flags);
>>   690
>>   691         return -EINVAL;
>>
>> function stack carries the return code to the parent open_fs_devices().
>>
>> open_fs_devices()
>>   btrfs_open_one_device()
>>    btrfs_get_bdev_and_sb()
>>     btrfs_read_dev_super()
>>      btrfs_read_dev_one_super()
>>
>>
>>
>>>    	if (btrfs_super_bytenr(super) != bytenr_orig) {
>>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>>> index f38c230111be..890153dd2a2b 100644
>>> --- a/fs/btrfs/volumes.c
>>> +++ b/fs/btrfs/volumes.c
>>> @@ -1223,7 +1223,7 @@ static int open_fs_devices(struct btrfs_fs_devices *fs_devices,
>>>    		if (ret == 0 &&
>>>    		    (!latest_dev || device->generation > latest_dev->generation)) {
>>>    			latest_dev = device;
>>> -		} else if (ret == -ENODATA) {
>>> +		} else if (ret) {
>>>    			fs_devices->num_devices--;
>>>    			list_del(&device->dev_list);
>>>    			btrfs_free_device(device);
>>>
>>
>> There are many other cases, for example including -EACCES (from
>> blkdev_get_by_path()) (I haven't checked other error codes). For which,
>> calling btrfs_free_device() is inappropriate.
> 
> Yeah I was a little fuzzy on this.  I think *any* failure should mean that we
> remove the device from the fs_devices tho right?  So that we show we're missing
> a device, since we can't actually access it?  I'm actually asking, because I
> think we can go either way, but to me I think any failure sure result in the
> removal of the device so we can re-scan the correct one.  Thanks,
> 

It is difficult to generalize, I guess. For example, consider the 
transient errors during the boot-up and the errors due to slow to-ready 
devices or the system-related errors such as ENOMEM/EACCES, all these 
does not call for device-free. If we free the device for transient 
errors, any further attempt to mount will fail unless it is device-scan 
again.

Here the bug is about btrfs_free_stale_devices() which failed to 
identify the same device when tricked by mixing the dm and mapper paths.
Can I check with you if there is another way to fix this by checking the 
device major and min number or the serial number from the device inquiry 
page?

Thanks, Anand

> Josef
>

Josef Bacik Dec. 3, 2021, 3:08 p.m. UTC | #4

On Fri, Dec 03, 2021 at 04:31:49PM +0800, Anand Jain wrote:
> 
> 
> On 03/12/2021 00:02, Josef Bacik wrote:
> > On Thu, Dec 02, 2021 at 03:09:38PM +0800, Anand Jain wrote:
> > > On 02/12/2021 05:18, Josef Bacik wrote:
> > > > We've been having transient failures of btrfs/197 because sometimes we
> > > > don't show a missing device.
> > > 
> > > > This turned out to be because I use LVM for my devices, and sometimes we
> > > > race with udev and get the "/dev/dm-#" device registered as a device in
> > > > the fs_devices list instead of "/dev/mapper/vg0-lv#" device.
> > > >   Thus when
> > > > the test adds a device to the existing mount it doesn't find the old
> > > > device in the original fs_devices list and remove it properly.
> > > 
> > 
> > I think most of your confusion is because you don't know what btrfs/197 does, so
> > I'll explain that and then answer your questions below.
> > 
> > DEV=/dev/vg0/lv0
> > RAID_DEVS=/dev/vg0/lv1 /dev/vg0/lv2 /dev/vg0/vl3 /dev/vg0/lv4
> > 
> > # First we create a single fs and mount it
> > mkfs.btrfs -f $DEV
> > mount $DEV /mnt/test
> > 
> > # Now we create the RAID fs
> > mkfs.btrfs -f -d raid10 -m raid10 $RAID_DEVS
> > 
> > # Now we add one of the raid devs to the single mount above
> > btrfs device add /dev/vg0/lv2 /mnt/test
> > 
> > # /dev/vg0/lv2 is no longer part of the fs it was made on, it's part of the fs
> > # that's mounted at /mnt/test
> > 
> > # Mount degraded with the raid setup
> > mount -o degraded /dev/vg0/lv1 /mnt/scratch
> > 
> > # Now we shouldn't have found /dev/vg0/lv2, because it was wiped and is no
> > # longer part of the fs_devices for this thing, except it is because it wasn't
> > # removed, so when we do the following it doesn't show as missing
> > btrfs filesystem show /mnt/scratch
> > 
> 
>  I thought I understood the test case. Now it is better. Thanks for taking
> the time to explain.
> 
> 
> > >   The above para is confusing. It can go. IMHO. The device path shouldn't
> > >   matter as we depend on the bdev to compare in the device add thread.
> > > 
> > > 2637         bdev = blkdev_get_by_path(device_path, FMODE_WRITE |
> > > FMODE_EXCL,
> > > 2638                                   fs_info->bdev_holder);
> > > ::
> > > 2657         list_for_each_entry_rcu(device, &fs_devices->devices, dev_list)
> > > {
> > > 2658                 if (device->bdev == bdev) {
> > > 2659                         ret = -EEXIST;
> > > 2660                         rcu_read_unlock();
> > > 2661                         goto error;
> > > 2662                 }
> > > 2663         }
> > > 
> > 
> > This is on the init thread, this is just checking the fs_devices of /mnt/test,
> > not the fs_devices of the RAID setup that we created, so this doesn't error out
> > (nor should it) because we're adding it to our mounted fs.
> > 
> > > 
> > > > This is fine in general, because when we open the devices we check the
> > > > UUID, and properly skip using the device that we added to the other file
> > > > system.  However we do not remove it from our fs_devices,
> > > 
> > > Hmm, context/thread is missing. Like, is it during device add or during
> > > mkfs/dev-scan?
> > > 
> > > AFAIK btrfs_free_stale_devices() checks and handles the removing of stale
> > > devices in the fs_devices in both the contexts/threads device-add, mkfs
> > > (device-scan).
> > > 
> > >   For example:
> > > 
> > > $ alias cnt_free_stale_devices="bpftrace -e 'kprobe:btrfs_free_stale_devices
> > > { @ = count(); }'"
> > > 
> > > New FSID on 2 devices, we call free_stale_devices():
> > > 
> > > $ cnt_free_stale_devices -c 'mkfs.btrfs -fq -draid1 -mraid1 /dev/vg/scratch0
> > > /dev/vg/scratch1'
> > > Attaching 1 probe...
> > > 
> > > @: 2
> > > 
> > >   We do it only when there is a new fsid/device added to the fs_devices.
> > > 
> > > For example:
> > > 
> > > Clean up the fs_devices:
> > > $ cnt_free_stale_devices -c 'btrfs dev scan --forget'
> > > Attaching 1 probe...
> > > 
> > > @: 1
> > > 
> > > Now mounting devices are new to the fs_devices list, so call
> > > free_stale_devices().
> > > 
> > > $ cnt_free_stale_devices -c 'mount -o device=/dev/vg/scratch0
> > > /dev/vg/scratch1 /btrfs'
> > > Attaching 1 probe...
> > > 
> > > @: 2
> > > 
> > > $ umount /btrfs
> > > 
> > > Below we didn't call free_stale_devices() because these two devices are
> > > already in the appropriate fs_devices.
> > > 
> > > $ cnt_free_stale_devices -c 'mount -o device=/dev/vg/scratch0
> > > /dev/vg/scratch1 /btrfs'
> > > Attaching 1 probe...
> > > 
> > > @: 0
> > > 
> > > $
> > > 
> > > To me, it looks to be working correctly.
> > > 
> > 
> > Yes it does work correctly, most of the time.  If you run it in a loop 500 times
> > it'll fail, because _sometimes_ udev goes in and does teh btrfs device scan and
> > changes the name of the device in the fs_devices for the RAID group.  So the
> > btrfs_free_stale_devices() thing doesn't find the exising device, because it's
> > just looking at the device->name, which is different from the device we're
> > adding.
> > 
> > We have the fs_devices for the RAID fs, and instead of /dev/vg0/lv2, we have
> > /dev/dm-4 or whatever.  So we do the addition of /dev/vg0/lv2, go to find it in
> > any other fs_devices, and don't find it because strcmp("/dev/vg0/lv2",
> > "/dev/dm0-4") != 0, and thus leave the device on the fs_devices for the RAID
> > file system.
> > 
> 
>  I got it. It shouldn't be difficult to reproduce and, I could reproduce.
> Without this patch.
> 
> 
>  Below is a device with two different paths. dm and its mapper.
> 
> ----------
>  $ ls -bli /dev/mapper/vg-scratch1  /dev/dm-1
>  561 brw-rw---- 1 root disk 252, 1 Dec  3 12:13 /dev/dm-1
>  565 lrwxrwxrwx 1 root root      7 Dec  3 12:13 /dev/mapper/vg-scratch1 ->
> ../dm-1
> ----------
> 
>  Clean the fs_devices.
> 
> ----------
>  $ btrfs dev scan --forget
> ----------
> 
>  Use the mapper to do mkfs.btrfs.
> 
> ----------
>  $ mkfs.btrfs -fq /dev/mapper/vg-scratch0
>  $ mount /dev/mapper/vg-scratch0 /btrfs
> ----------
> 
>  Crete raid1 again using mapper path.
> 
> ----------
>  $ mkfs.btrfs -U $uuid -fq -draid1 -mraid1 /dev/mapper/vg-scratch1
> /dev/mapper/vg-scratch2
> ----------
> 
>  Use dm path to add the device which belongs to another btrfs filesystem.
> 
> ----------
>  $ btrfs dev add -f /dev/dm-1 /btrfs
> ----------
> 
>  Now mount the above raid1 in degraded mode.
> 
> ----------
>  $ mount -o degraded /dev/mapper/vg-scratch2 /btrfs1
> ----------
> 

Ahhh nice, I couldn't figure out a way to trigger it manually.  I wonder if we
can figure out a way to do this in xfstests without needing to have your
SCRATCH_DEV on lvm already?

<snip>

> > 
> > Yeah I was a little fuzzy on this.  I think *any* failure should mean that we
> > remove the device from the fs_devices tho right?  So that we show we're missing
> > a device, since we can't actually access it?  I'm actually asking, because I
> > think we can go either way, but to me I think any failure sure result in the
> > removal of the device so we can re-scan the correct one.  Thanks,
> > 
> 
> It is difficult to generalize, I guess. For example, consider the transient
> errors during the boot-up and the errors due to slow to-ready devices or the
> system-related errors such as ENOMEM/EACCES, all these does not call for
> device-free. If we free the device for transient errors, any further attempt
> to mount will fail unless it is device-scan again.
> 
> Here the bug is about btrfs_free_stale_devices() which failed to identify
> the same device when tricked by mixing the dm and mapper paths.
> Can I check with you if there is another way to fix this by checking the
> device major and min number or the serial number from the device inquiry
> page?

I suppose I could just change it so that our verification proceses, like the
MAGIC or FSID checks, return ENODATA and we only do it in those cases.  Does
that seem reasonable?

Thanks

Anand Jain Dec. 6, 2021, 2:32 p.m. UTC | #5

<snip>


>>   I got it. It shouldn't be difficult to reproduce and, I could reproduce.
>> Without this patch.
>>
>>
>>   Below is a device with two different paths. dm and its mapper.
>>
>> ----------
>>   $ ls -bli /dev/mapper/vg-scratch1  /dev/dm-1
>>   561 brw-rw---- 1 root disk 252, 1 Dec  3 12:13 /dev/dm-1
>>   565 lrwxrwxrwx 1 root root      7 Dec  3 12:13 /dev/mapper/vg-scratch1 ->
>> ../dm-1
>> ----------
>>
>>   Clean the fs_devices.
>>
>> ----------
>>   $ btrfs dev scan --forget
>> ----------
>>
>>   Use the mapper to do mkfs.btrfs.
>>
>> ----------
>>   $ mkfs.btrfs -fq /dev/mapper/vg-scratch0
>>   $ mount /dev/mapper/vg-scratch0 /btrfs
>> ----------
>>
>>   Crete raid1 again using mapper path.
>>
>> ----------
>>   $ mkfs.btrfs -U $uuid -fq -draid1 -mraid1 /dev/mapper/vg-scratch1
>> /dev/mapper/vg-scratch2
>> ----------
>>
>>   Use dm path to add the device which belongs to another btrfs filesystem.
>>
>> ----------
>>   $ btrfs dev add -f /dev/dm-1 /btrfs
>> ----------
>>
>>   Now mount the above raid1 in degraded mode.
>>
>> ----------
>>   $ mount -o degraded /dev/mapper/vg-scratch2 /btrfs1
>> ----------
>>
> 
> Ahhh nice, I couldn't figure out a way to trigger it manually.  I wonder if we
> can figure out a way to do this in xfstests without needing to have your
> SCRATCH_DEV on lvm already?

Yep. A dm linear on top of a raw device will help. I have a rough draft 
working. I could send it to xfstests if you want?

>>> Yeah I was a little fuzzy on this.  I think *any* failure should mean that we
>>> remove the device from the fs_devices tho right?  So that we show we're missing
>>> a device, since we can't actually access it?  I'm actually asking, because I
>>> think we can go either way, but to me I think any failure sure result in the
>>> removal of the device so we can re-scan the correct one.  Thanks,
>>>
>>
>> It is difficult to generalize, I guess. For example, consider the transient
>> errors during the boot-up and the errors due to slow to-ready devices or the
>> system-related errors such as ENOMEM/EACCES, all these does not call for
>> device-free. If we free the device for transient errors, any further attempt
>> to mount will fail unless it is device-scan again.
>>
>> Here the bug is about btrfs_free_stale_devices() which failed to identify
>> the same device when tricked by mixing the dm and mapper paths.
>> Can I check with you if there is another way to fix this by checking the
>> device major and min number or the serial number from the device inquiry
>> page?
> 

> I suppose I could just change it so that our verification proceses, like the
> MAGIC or FSID checks, return ENODATA and we only do it in those cases.  Does
> that seem reasonable?

The 'btrfs device add' calls btrfs_free_stale_devices(), however 
device_path_matched() fails to match the device by its path. So IMO, fix 
has to be in device_path_matched() but with a different parameter to 
match instead of device path.

Here is another manifestation of the same problem.

$ mkfs.btrfs -fq /dev/dm-0
$ cat /proc/fs/btrfs/devlist | grep device:
  device: /dev/dm-0

$ btrfs dev scan --forget /dev/dm-0
ERROR: cannot unregister device '/dev/mapper/tsdb': No such file or 
directory

Above, mkfs.btrfs does not sanitizes the input device path however, the 
forget command sanitizes the input device to /dev/mapper/tsdb and the 
device_path_matched() in btrfs_free_stale_devices() fails. So fix has to 
be in device_path_matched() and, why not replace strcmp() with compare 
major and minor device numbers?

Thanks, Anand

Josef Bacik Dec. 6, 2021, 7:16 p.m. UTC | #6

On Mon, Dec 06, 2021 at 10:32:07PM +0800, Anand Jain wrote:
> 
> <snip>
> 
> 
> > >   I got it. It shouldn't be difficult to reproduce and, I could reproduce.
> > > Without this patch.
> > > 
> > > 
> > >   Below is a device with two different paths. dm and its mapper.
> > > 
> > > ----------
> > >   $ ls -bli /dev/mapper/vg-scratch1  /dev/dm-1
> > >   561 brw-rw---- 1 root disk 252, 1 Dec  3 12:13 /dev/dm-1
> > >   565 lrwxrwxrwx 1 root root      7 Dec  3 12:13 /dev/mapper/vg-scratch1 ->
> > > ../dm-1
> > > ----------
> > > 
> > >   Clean the fs_devices.
> > > 
> > > ----------
> > >   $ btrfs dev scan --forget
> > > ----------
> > > 
> > >   Use the mapper to do mkfs.btrfs.
> > > 
> > > ----------
> > >   $ mkfs.btrfs -fq /dev/mapper/vg-scratch0
> > >   $ mount /dev/mapper/vg-scratch0 /btrfs
> > > ----------
> > > 
> > >   Crete raid1 again using mapper path.
> > > 
> > > ----------
> > >   $ mkfs.btrfs -U $uuid -fq -draid1 -mraid1 /dev/mapper/vg-scratch1
> > > /dev/mapper/vg-scratch2
> > > ----------
> > > 
> > >   Use dm path to add the device which belongs to another btrfs filesystem.
> > > 
> > > ----------
> > >   $ btrfs dev add -f /dev/dm-1 /btrfs
> > > ----------
> > > 
> > >   Now mount the above raid1 in degraded mode.
> > > 
> > > ----------
> > >   $ mount -o degraded /dev/mapper/vg-scratch2 /btrfs1
> > > ----------
> > > 
> > 
> > Ahhh nice, I couldn't figure out a way to trigger it manually.  I wonder if we
> > can figure out a way to do this in xfstests without needing to have your
> > SCRATCH_DEV on lvm already?
> 
> Yep. A dm linear on top of a raw device will help. I have a rough draft
> working. I could send it to xfstests if you want?
> 

Yes please, that would be helpful.

> > > > Yeah I was a little fuzzy on this.  I think *any* failure should mean that we
> > > > remove the device from the fs_devices tho right?  So that we show we're missing
> > > > a device, since we can't actually access it?  I'm actually asking, because I
> > > > think we can go either way, but to me I think any failure sure result in the
> > > > removal of the device so we can re-scan the correct one.  Thanks,
> > > > 
> > > 
> > > It is difficult to generalize, I guess. For example, consider the transient
> > > errors during the boot-up and the errors due to slow to-ready devices or the
> > > system-related errors such as ENOMEM/EACCES, all these does not call for
> > > device-free. If we free the device for transient errors, any further attempt
> > > to mount will fail unless it is device-scan again.
> > > 
> > > Here the bug is about btrfs_free_stale_devices() which failed to identify
> > > the same device when tricked by mixing the dm and mapper paths.
> > > Can I check with you if there is another way to fix this by checking the
> > > device major and min number or the serial number from the device inquiry
> > > page?
> > 
> 
> > I suppose I could just change it so that our verification proceses, like the
> > MAGIC or FSID checks, return ENODATA and we only do it in those cases.  Does
> > that seem reasonable?
> 
> The 'btrfs device add' calls btrfs_free_stale_devices(), however
> device_path_matched() fails to match the device by its path. So IMO, fix has
> to be in device_path_matched() but with a different parameter to match
> instead of device path.
> 
> Here is another manifestation of the same problem.
> 
> $ mkfs.btrfs -fq /dev/dm-0
> $ cat /proc/fs/btrfs/devlist | grep device:
>  device: /dev/dm-0
> 
> $ btrfs dev scan --forget /dev/dm-0
> ERROR: cannot unregister device '/dev/mapper/tsdb': No such file or
> directory
> 
> Above, mkfs.btrfs does not sanitizes the input device path however, the
> forget command sanitizes the input device to /dev/mapper/tsdb and the
> device_path_matched() in btrfs_free_stale_devices() fails. So fix has to be
> in device_path_matched() and, why not replace strcmp() with compare major
> and minor device numbers?

I like that better actually, you mind wiring it up since it's your idea?
Thanks,

Josef

Anand Jain Dec. 6, 2021, 10:23 p.m. UTC | #7

On 07/12/2021 03:16, Josef Bacik wrote:
> On Mon, Dec 06, 2021 at 10:32:07PM +0800, Anand Jain wrote:
>>
>> <snip>
>>
>>
>>>>    I got it. It shouldn't be difficult to reproduce and, I could reproduce.
>>>> Without this patch.
>>>>
>>>>
>>>>    Below is a device with two different paths. dm and its mapper.
>>>>
>>>> ----------
>>>>    $ ls -bli /dev/mapper/vg-scratch1  /dev/dm-1
>>>>    561 brw-rw---- 1 root disk 252, 1 Dec  3 12:13 /dev/dm-1
>>>>    565 lrwxrwxrwx 1 root root      7 Dec  3 12:13 /dev/mapper/vg-scratch1 ->
>>>> ../dm-1
>>>> ----------
>>>>
>>>>    Clean the fs_devices.
>>>>
>>>> ----------
>>>>    $ btrfs dev scan --forget
>>>> ----------
>>>>
>>>>    Use the mapper to do mkfs.btrfs.
>>>>
>>>> ----------
>>>>    $ mkfs.btrfs -fq /dev/mapper/vg-scratch0
>>>>    $ mount /dev/mapper/vg-scratch0 /btrfs
>>>> ----------
>>>>
>>>>    Crete raid1 again using mapper path.
>>>>
>>>> ----------
>>>>    $ mkfs.btrfs -U $uuid -fq -draid1 -mraid1 /dev/mapper/vg-scratch1
>>>> /dev/mapper/vg-scratch2
>>>> ----------
>>>>
>>>>    Use dm path to add the device which belongs to another btrfs filesystem.
>>>>
>>>> ----------
>>>>    $ btrfs dev add -f /dev/dm-1 /btrfs
>>>> ----------
>>>>
>>>>    Now mount the above raid1 in degraded mode.
>>>>
>>>> ----------
>>>>    $ mount -o degraded /dev/mapper/vg-scratch2 /btrfs1
>>>> ----------
>>>>
>>>
>>> Ahhh nice, I couldn't figure out a way to trigger it manually.  I wonder if we
>>> can figure out a way to do this in xfstests without needing to have your
>>> SCRATCH_DEV on lvm already?
>>
>> Yep. A dm linear on top of a raw device will help. I have a rough draft
>> working. I could send it to xfstests if you want?
>>
> 
> Yes please, that would be helpful.
> 
>>>>> Yeah I was a little fuzzy on this.  I think *any* failure should mean that we
>>>>> remove the device from the fs_devices tho right?  So that we show we're missing
>>>>> a device, since we can't actually access it?  I'm actually asking, because I
>>>>> think we can go either way, but to me I think any failure sure result in the
>>>>> removal of the device so we can re-scan the correct one.  Thanks,
>>>>>
>>>>
>>>> It is difficult to generalize, I guess. For example, consider the transient
>>>> errors during the boot-up and the errors due to slow to-ready devices or the
>>>> system-related errors such as ENOMEM/EACCES, all these does not call for
>>>> device-free. If we free the device for transient errors, any further attempt
>>>> to mount will fail unless it is device-scan again.
>>>>
>>>> Here the bug is about btrfs_free_stale_devices() which failed to identify
>>>> the same device when tricked by mixing the dm and mapper paths.
>>>> Can I check with you if there is another way to fix this by checking the
>>>> device major and min number or the serial number from the device inquiry
>>>> page?
>>>
>>
>>> I suppose I could just change it so that our verification proceses, like the
>>> MAGIC or FSID checks, return ENODATA and we only do it in those cases.  Does
>>> that seem reasonable?
>>
>> The 'btrfs device add' calls btrfs_free_stale_devices(), however
>> device_path_matched() fails to match the device by its path. So IMO, fix has
>> to be in device_path_matched() but with a different parameter to match
>> instead of device path.
>>
>> Here is another manifestation of the same problem.
>>
>> $ mkfs.btrfs -fq /dev/dm-0
>> $ cat /proc/fs/btrfs/devlist | grep device:
>>   device: /dev/dm-0
>>
>> $ btrfs dev scan --forget /dev/dm-0
>> ERROR: cannot unregister device '/dev/mapper/tsdb': No such file or
>> directory
>>
>> Above, mkfs.btrfs does not sanitizes the input device path however, the
>> forget command sanitizes the input device to /dev/mapper/tsdb and the
>> device_path_matched() in btrfs_free_stale_devices() fails. So fix has to be
>> in device_path_matched() and, why not replace strcmp() with compare major
>> and minor device numbers?
> 
> I like that better actually, you mind wiring it up since it's your idea?
> Thanks,
> 

  Sure. For both xfstests and the fix.

Thanks, Anand

btrfs: free device if we fail to open it

Commit Message

Comments

Patch