diff mbox

[v3] Btrfs: fix incremental send's decision to delay a dir move/rename

Message ID 1395002246-3840-1-git-send-email-fdmanana@gmail.com (mailing list archive)
State Accepted
Headers show

Commit Message

Filipe Manana March 16, 2014, 8:37 p.m. UTC
It's possible to change the parent/child relationship between directories
in such a way that if a child directory has a higher inode number than
its parent, it doesn't necessarily means the child rename/move operation
can be performed immediately. The parent migth have its own rename/move
operation delayed, therefore in this case the child needs to have its
rename/move operation delayed too, and be performed after its new parent's
rename/move.

Steps to reproduce the issue:

      $ umount /mnt
      $ mkfs.btrfs -f /dev/sdd
      $ mount /dev/sdd /mnt

      $ mkdir /mnt/A
      $ mkdir /mnt/B
      $ mkdir /mnt/C
      $ mv /mnt/C /mnt/A
      $ mv /mnt/B /mnt/A/C
      $ mkdir /mnt/A/C/D

      $ btrfs subvolume snapshot -r /mnt /mnt/snap1
      $ btrfs send /mnt/snap1 -f /tmp/base.send

      $ mv /mnt/A/C/D /mnt/A/D2
      $ mv /mnt/A/C/B /mnt/A/D2/B2
      $ mv /mnt/A/C /mnt/A/D2/B2/C2

      $ btrfs subvolume snapshot -r /mnt /mnt/snap2
      $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send

The incremental send caused the kernel code to enter an infinite loop when
building the path string for directory C after its references are processed.

The necessary conditions here are that C has an inode number higher than both
A and B, and B as an higher inode number higher than A, and D has the highest
inode number, that is:
    inode_number(A) < inode_number(B) < inode_number(C) < inode_number(D)

The same issue could happen if after the first snapshot there's any number
of intermediary parent directories between A2 and B2, and between B2 and C2.

A test case for xfstests follows, covering this simple case and more advanced
ones, with files and hard links created inside the directories.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
---

V2: Right version of the patch. Previously sent came from the wrong vm.
V3: The condition needed to check already existed, so just moved it to the
    top, instead of adding it again.

 fs/btrfs/send.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Comments

Marc MERLIN March 16, 2014, 10:20 p.m. UTC | #1
I just created this array:
polgara:/mnt/btrfs_backupcopy# btrfs fi show
Label: backupcopy  uuid: 7d8e1197-69e4-40d8-8d86-278d275af896
        Total devices 10 FS bytes used 220.32GiB
        devid    1 size 465.76GiB used 25.42GiB path /dev/dm-0
        devid    2 size 465.76GiB used 25.40GiB path /dev/dm-1
        devid    3 size 465.75GiB used 25.40GiB path /dev/mapper/crypt_sde1
        devid    4 size 465.76GiB used 25.40GiB path /dev/dm-3
        devid    5 size 465.76GiB used 25.40GiB path /dev/dm-4
        devid    6 size 465.76GiB used 25.40GiB path /dev/dm-5
        devid    7 size 465.76GiB used 25.40GiB path /dev/dm-6
        devid    8 size 465.76GiB used 25.40GiB path /dev/mapper/crypt_sdj1
        devid    9 size 465.76GiB used 25.40GiB path /dev/dm-9
        devid    10 size 465.76GiB used 25.40GiB path /dev/dm-8

And clearly it has issues with one of the drives.

I have a copy that is still going on to it.

Last I tried to boot a raid5 btrfs array with a drive missing, that didn't work at all.

Since this array is still running, what are my options?
I can't tell btrfs to replace drive sde1 with a new drive I plugged in
because the code doesn't exist, correct?
If I yank, sde1 and reboot, the array will not come back up from what I understand,
or is that incorrect?
Do rebuilds work at all with a missing drive to a spare drive?

This is with 3.14.0-rc5.

Do I have other options?
(data is not important at all, I just want to learn how to deal with such a case
with the current code)

[59532.543415] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 162, rd 2444, flush 0, corrupt 0, gen 0
[59547.654888] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 162, rd 2445, flush 0, corrupt 0, gen 0
[59547.655755] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 162, rd 2446, flush 0, corrupt 0, gen 0
[59552.096038] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 162, rd 2447, flush 0, corrupt 0, gen 0
[59552.096613] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 162, rd 2448, flush 0, corrupt 0, gen 0
[59557.124736] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 162, rd 2449, flush 0, corrupt 0, gen 0
[59557.125569] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 162, rd 2450, flush 0, corrupt 0, gen 0
[59572.694548] BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1
[59572.695757] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 163, rd 2450, flush 0, corrupt 0, gen 0
[59572.696295] BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1
[59572.696976] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 164, rd 2450, flush 0, corrupt 0, gen 0
[59572.697693] BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1
[59572.698397] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 165, rd 2450, flush 0, corrupt 0, gen 0
[59586.844083] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 165, rd 2451, flush 0, corrupt 0, gen 0
[59586.844614] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 165, rd 2452, flush 0, corrupt 0, gen 0
[59587.087696] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 165, rd 2453, flush 0, corrupt 0, gen 0
[59587.088378] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 165, rd 2454, flush 0, corrupt 0, gen 0
[59587.188784] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 165, rd 2455, flush 0, corrupt 0, gen 0
[59587.189280] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 165, rd 2456, flush 0, corrupt 0, gen 0
[59587.189737] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 165, rd 2457, flush 0, corrupt 0, gen 0
[59612.829235] BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1
[59612.829871] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 166, rd 2457, flush 0, corrupt 0, gen 0
[59612.830767] BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1
[59612.831397] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 167, rd 2457, flush 0, corrupt 0, gen 0
[59612.832220] BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1
[59612.832848] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 168, rd 2457, flush 0, corrupt 0, gen 0
[59648.014743] BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1
[59648.015221] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 169, rd 2457, flush 0, corrupt 0, gen 0
[59648.015694] BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1
[59648.016154] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 170, rd 2457, flush 0, corrupt 0, gen 0
[59648.017249] BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1

By the way, I found this very amusing:
polgara:/mnt/btrfs_backupcopy# smartctl -i /dev/sde
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.14.0-rc5-amd64-i915-preempt-20140216c] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               /14:0:0:
Product:              0
User Capacity:        600,332,565,813,390,450 bytes [600 PB]
Logical block size:   774843950 bytes
Physical block size:  1549687900 bytes

I have a 600PB drive for sale, please make me offers :)

Thanks,
Marc
Chris Murphy March 16, 2014, 10:55 p.m. UTC | #2
On Mar 16, 2014, at 4:20 PM, Marc MERLIN <marc@merlins.org> wrote:


> If I yank, sde1 and reboot, the array will not come back up from what I understand,
> or is that incorrect?
> Do rebuilds work at all with a missing drive to a spare drive?

The part that isn't working well enough is faulty status. The drive keeps hanging around producing a lot of errors, instead of getting booted. btrfs replace start ought to still work, but if the faulty drive is fussy it might slow down the rebuild, or even prevent it.

The more conservative approach is to pull the drive. If you've previously tested this hardware setup to tolerate hot swap, you can give that a shot. Otherwise, to avoid instability and additional problems, unmount the file system first. Do the hot swap. Then mount it -o degraded. Then use btrfs replace start.


Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Murphy March 16, 2014, 11:12 p.m. UTC | #3
On Mar 16, 2014, at 4:55 PM, Chris Murphy <lists@colorremedies.com> wrote:

> Then use btrfs replace start.

Looks like in 3.14rc6 replace isn't yet supported. I get "dev_replace cannot yet handle RAID5/RAID6".

When I do:
btrfs device add <new> <mp>

The command hangs, no kernel messages.

Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marc MERLIN March 16, 2014, 11:17 p.m. UTC | #4
On Sun, Mar 16, 2014 at 05:12:10PM -0600, Chris Murphy wrote:
> 
> On Mar 16, 2014, at 4:55 PM, Chris Murphy <lists@colorremedies.com> wrote:
> 
> > Then use btrfs replace start.
> 
> Looks like in 3.14rc6 replace isn't yet supported. I get "dev_replace cannot yet handle RAID5/RAID6".
> 
> When I do:
> btrfs device add <new> <mp>
> 
> The command hangs, no kernel messages.

Ok, that's kind of what I thought.
So, for now, with raid5:
- btrfs seems to handle a drive not working 
- you say I can mount with the drive missing in degraded mode (I haven't
  tried that, I will)
- but no matter how I remove the faulty drive, there is no rebuild on a
  new drive procedure that works yet

Correct?

Thanks,
Marc
Chris Murphy March 16, 2014, 11:20 p.m. UTC | #5
On Mar 16, 2014, at 5:12 PM, Chris Murphy <lists@colorremedies.com> wrote:

> 
> On Mar 16, 2014, at 4:55 PM, Chris Murphy <lists@colorremedies.com> wrote:
> 
>> Then use btrfs replace start.
> 
> Looks like in 3.14rc6 replace isn't yet supported. I get "dev_replace cannot yet handle RAID5/RAID6".
> 
> When I do:
> btrfs device add <new> <mp>
> 
> The command hangs, no kernel messages.

So even though the device add command hangs, another shell with btrfs fi show reports that it succeeded:

Label: none  uuid: d50b6c0f-518a-455f-9740-e29779649250
	Total devices 4 FS bytes used 5.70GiB
	devid    1 size 7.81GiB used 4.02GiB path /dev/sdb
	devid    2 size 7.81GiB used 3.01GiB path /dev/sdc
	devid    3 size 7.81GiB used 4.01GiB path 
	devid    4 size 7.81GiB used 0.00 path /dev/sdd

Yet umount <mp> says the target is busy. ps reports the command status D+. And it doesn't cancel. So at the moment I'm stuck coming up with a work around.


Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Murphy March 16, 2014, 11:23 p.m. UTC | #6
On Mar 16, 2014, at 5:17 PM, Marc MERLIN <marc@merlins.org> wrote:

> - but no matter how I remove the faulty drive, there is no rebuild on a
>  new drive procedure that works yet
> 
> Correct?

I'm not sure. From what I've read we should be able to add a device to raid5/6, but I don't know if it's expected we can add a device to a degraded raid5/6. If the add device succeeded, then I ought to be able to remove the missing devid, and then do a balance which should cause reconstruction.

https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg30714.html


Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
ronnie sahlberg March 16, 2014, 11:40 p.m. UTC | #7
On Sun, Mar 16, 2014 at 4:17 PM, Marc MERLIN <marc@merlins.org> wrote:
> On Sun, Mar 16, 2014 at 05:12:10PM -0600, Chris Murphy wrote:
>>
>> On Mar 16, 2014, at 4:55 PM, Chris Murphy <lists@colorremedies.com> wrote:
>>
>> > Then use btrfs replace start.
>>
>> Looks like in 3.14rc6 replace isn't yet supported. I get "dev_replace cannot yet handle RAID5/RAID6".
>>
>> When I do:
>> btrfs device add <new> <mp>
>>
>> The command hangs, no kernel messages.
>
> Ok, that's kind of what I thought.
> So, for now, with raid5:
> - btrfs seems to handle a drive not working
> - you say I can mount with the drive missing in degraded mode (I haven't
>   tried that, I will)
> - but no matter how I remove the faulty drive, there is no rebuild on a
>   new drive procedure that works yet
>
> Correct?

There was a discussion a while back that suggested that a "balance"
would read all blocks and write them out again and that would recover
the data.

I have no idea if that works or not.
Only do this as a last resort once you have already considered all
data lost forever.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marc MERLIN March 17, 2014, 12:51 a.m. UTC | #8
On Sun, Mar 16, 2014 at 05:23:25PM -0600, Chris Murphy wrote:
> 
> On Mar 16, 2014, at 5:17 PM, Marc MERLIN <marc@merlins.org> wrote:
> 
> > - but no matter how I remove the faulty drive, there is no rebuild on a
> >  new drive procedure that works yet
> > 
> > Correct?
> 
> I'm not sure. From what I've read we should be able to add a device to raid5/6, but I don't know if it's expected we can add a device to a degraded raid5/6. If the add device succeeded, then I ought to be able to remove the missing devid, and then do a balance which should cause reconstruction.
> 
> https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg30714.html

Thanks for the link, that's what I thought I read recently.

So, on 3.14, I can confirm
polgara:/mnt/btrfs_backupcopy# btrfs replace start 3 /dev/sdm1 /mnt/btrfs_backupcopy
[68377.679233] BTRFS warning (device dm-9): dev_replace cannot yet handle RAID5/RAID6

polgara:/mnt/btrfs_backupcopy# btrfs device delete /dev/mapper/crypt_sde1 `pwd`
ERROR: error removing the device '/dev/mapper/crypt_sde1' - Invalid argument
and yet
Mar 16 17:48:35 polgara kernel: [69285.032615] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 805, rd 4835, flush 0, corrupt 0, gen 0
Mar 16 17:48:35 polgara kernel: [69285.033791] BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1
Mar 16 17:48:35 polgara kernel: [69285.034379] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 806, rd 4835, flush 0, corrupt 0, gen 0
Mar 16 17:48:35 polgara kernel: [69285.035361] BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1
Mar 16 17:48:35 polgara kernel: [69285.035943] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 807, rd 4835, flush 0, corrupt 0, gen 0

So from here, it sounds like I can try:
1) unmount the filesystem
2) hope that remounting it without that device will work
3) btrfs device add to recreate the missing drive.

Before I do #1 and get myself in a worse state than I am (working
filesystem), does that sound correct?

(again, the data is irrelevant, I have a btrfs receive on it that has
been running for hours and that I'd have to restart, but that's it).

Thanks,
Marc
Chris Murphy March 17, 2014, 1:06 a.m. UTC | #9
On Mar 16, 2014, at 6:51 PM, Marc MERLIN <marc@merlins.org> wrote:
> 
> 
> polgara:/mnt/btrfs_backupcopy# btrfs device delete /dev/mapper/crypt_sde1 `pwd`
> ERROR: error removing the device '/dev/mapper/crypt_sde1' - Invalid argument

You didn't specify a mount point, is the reason for that error. But also, since you're already effectively degraded with 1 disk you can't remove a 2nd without causing array collapse. You have to add a new device first *and* you have to "rebuild" with balance. Then presumably we can remove the device. But I'm stuck adding so I can't test anything else.

> 
> So from here, it sounds like I can try:
> 1) unmount the filesystem
> 2) hope that remounting it without that device will work
> 3) btrfs device add to recreate the missing drive.
> 
> Before I do #1 and get myself in a worse state than I am (working
> filesystem), does that sound correct?
> 
> (again, the data is irrelevant, I have a btrfs receive on it that has
> been running for hours and that I'd have to restart, but that's it).

Well at this point I'd leave it alone because at least for me, device add hangs that command and all other subsequent btrfs user space commands. So for all I know (untested) the whole volume will block on this device add and is effectively useless.

Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marc MERLIN March 17, 2014, 1:17 a.m. UTC | #10
On Sun, Mar 16, 2014 at 07:06:23PM -0600, Chris Murphy wrote:
> 
> On Mar 16, 2014, at 6:51 PM, Marc MERLIN <marc@merlins.org> wrote:
> > 
> > 
> > polgara:/mnt/btrfs_backupcopy# btrfs device delete /dev/mapper/crypt_sde1 `pwd`
> > ERROR: error removing the device '/dev/mapper/crypt_sde1' - Invalid argument
> 
> You didn't specify a mount point, is the reason for that error. But also, since you're already effectively degraded with 1 disk you can't remove a 2nd without causing array collapse. You have to add a new device first *and* you have to "rebuild" with balance. Then presumably we can remove the device. But I'm stuck adding so I can't test anything else.

You missed the `pwd` :)
I'm trying to remove the drive that is causing issues, that doesn't make
things worse, does it?
Does btrtfs not know that device is the bad one even thouth it's spamming my
logs continuously about it?

If I add a device, isn't it going to grow my raid to make it bigger instead
of trying to replace the bad device?
In swraid5, if I add a device, it will grow the raid, unless the array is
running in degraded mode.
However, I can't see if btrfs tools know it's in degraded mode or not.

If you are sure adding a device won't grow my raid, I'll give it a shot.

> > (again, the data is irrelevant, I have a btrfs receive on it that has
> > been running for hours and that I'd have to restart, but that's it).
> 
> Well at this point I'd leave it alone because at least for me, device add hangs that command and all other subsequent btrfs user space commands. So for all I know (untested) the whole volume will block on this device add and is effectively useless.

Right. I was hoping that my kernel slightly newer than yours and maybe real
devices would help, but of course I don't know that.

I'll add the new device first after you confirm that there is no chance
it'll try to grow the filesystem :)

Thanks,
Marc
Chris Murphy March 17, 2014, 2:56 a.m. UTC | #11
On Mar 16, 2014, at 7:17 PM, Marc MERLIN <marc@merlins.org> wrote:

> On Sun, Mar 16, 2014 at 07:06:23PM -0600, Chris Murphy wrote:
>> 
>> On Mar 16, 2014, at 6:51 PM, Marc MERLIN <marc@merlins.org> wrote:
>>> 
>>> 
>>> polgara:/mnt/btrfs_backupcopy# btrfs device delete /dev/mapper/crypt_sde1 `pwd`
>>> ERROR: error removing the device '/dev/mapper/crypt_sde1' - Invalid argument
>> 
>> You didn't specify a mount point, is the reason for that error. But also, since you're already effectively degraded with 1 disk you can't remove a 2nd without causing array collapse. You have to add a new device first *and* you have to "rebuild" with balance. Then presumably we can remove the device. But I'm stuck adding so I can't test anything else.
> 
> You missed the `pwd` :)

I just don't know what it means, it's not a reference to mount point I'm familiar with.

> I'm trying to remove the drive that is causing issues, that doesn't make
> things worse, does it?

I don't think you can force a Btrfs volume to go degraded with a device delete command right now, just like there isn't a command to make it go missing or faulty, like md raid.


> Does btrtfs not know that device is the bad one even thouth it's spamming my
> logs continuously about it?

With raid5, you're always at the minimum number of devices to be normally mounted. Removing one immediately makes it degraded which I don't think it's going to permit. At least, I get an error when I do it even without a device giving me fits.

> 
> If I add a device, isn't it going to grow my raid to make it bigger instead
> of trying to replace the bad device?

Yes if it's successful. No if it fails which is the problem I'm having.

> In swraid5, if I add a device, it will grow the raid, unless the array is
> running in degraded mode.
> However, I can't see if btrfs tools know it's in degraded mode or not.

Only once the device is missing, apparently, and then mounted -o degraded.

> 
> If you are sure adding a device won't grow my raid, I'll give it a shot.

No I'm not sure. And yes I suspect it will make it bigger. But so far a.) replace isn't supported yet; and b.) delete causes the volume to go below the minimum required for normal operation which it won't allow; which leaves c.) add a device but I'm getting a hang. So I'm stuck at this point.


> 
>>> (again, the data is irrelevant, I have a btrfs receive on it that has
>>> been running for hours and that I'd have to restart, but that's it).
>> 
>> Well at this point I'd leave it alone because at least for me, device add hangs that command and all other subsequent btrfs user space commands. So for all I know (untested) the whole volume will block on this device add and is effectively useless.
> 
> Right. I was hoping that my kernel slightly newer than yours and maybe real
> devices would help, but of course I don't know that.
> 
> I'll add the new device first after you confirm that there is no chance
> it'll try to grow the filesystem :)

I confirm nothing since I can't proceed with a device add.


Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marc MERLIN March 17, 2014, 3:44 a.m. UTC | #12
On Sun, Mar 16, 2014 at 08:56:35PM -0600, Chris Murphy wrote:
> >>> polgara:/mnt/btrfs_backupcopy# btrfs device delete /dev/mapper/crypt_sde1 `pwd`
> >>> ERROR: error removing the device '/dev/mapper/crypt_sde1' - Invalid argument
> >> 
> >> You didn't specify a mount point, is the reason for that error. But also, since you're already effectively degraded with 1 disk you can't remove a 2nd without causing array collapse. You have to add a new device first *and* you have to "rebuild" with balance. Then presumably we can remove the device. But I'm stuck adding so I can't test anything else.
> > 
> > You missed the `pwd` :)
> 
> I just don't know what it means, it's not a reference to mount point I'm familiar with.

Try echo `pwd` and you'll understand :)
 
> > I'm trying to remove the drive that is causing issues, that doesn't make
> > things worse, does it?
> 
> I don't think you can force a Btrfs volume to go degraded with a device delete command right now, just like there isn't a command to make it go missing or faulty, like md raid.
> > Does btrtfs not know that device is the bad one even thouth it's spamming my
> > logs continuously about it?
> 
> With raid5, you're always at the minimum number of devices to be normally mounted. Removing one immediately makes it degraded which I don't think it's going to permit. At least, I get an error when I do it even without a device giving me fits.

Ok, I understand that.

> > If I add a device, isn't it going to grow my raid to make it bigger instead
> > of trying to replace the bad device?
> 
> Yes if it's successful. No if it fails which is the problem I'm having.

That's where I don't follow you.
You just agreed that it will grow my raid.
So right now it's 4.5TB with 10 drives, if I add one drive, it will grow to
5TB with 11 drives.
How does that help?
Why would btrfs allow me to remove the faulty device since it does not let
you remove a device from a running raid. If I grow it to a bigger raid, it
still won't let me remove the device, will it?
 
> > In swraid5, if I add a device, it will grow the raid, unless the array is
> > running in degraded mode.
> > However, I can't see if btrfs tools know it's in degraded mode or not.
> 
> Only once the device is missing, apparently, and then mounted -o degraded.
 
Dully noted.
If you agree that adding an 11th drive to my array will not help, I'll
unmount the filesystem, remount it in degraded mode with 9 drives and try to
add the new 11th drive.

> > If you are sure adding a device won't grow my raid, I'll give it a shot.
> 
> No I'm not sure. And yes I suspect it will make it bigger. But so far a.) replace isn't supported yet; and b.) delete causes the volume to go below the minimum required for normal operation which it won't allow; which leaves c.) add a device but I'm getting a hang. So I'm stuck at this point.

Right. So I think we also agree that adding a device to the running
filesystem is not what I want to do since it'll grow it and do nothing to
let me remove he faulty one. 

> > I'll add the new device first after you confirm that there is no chance
> > it'll try to grow the filesystem :)
> 
> I confirm nothing since I can't proceed with a device add.

Fair enough.

So unless someone tells me otherwise, I will unmount the filesystem, remount
it in degraded mode, and then try to add the 11th drive when the 10th one is
missing.

Thanks,
Marc
Chris Murphy March 17, 2014, 5:12 a.m. UTC | #13
On Mar 16, 2014, at 9:44 PM, Marc MERLIN <marc@merlins.org> wrote:

> On Sun, Mar 16, 2014 at 08:56:35PM -0600, Chris Murphy wrote:
> 
>>> If I add a device, isn't it going to grow my raid to make it bigger instead
>>> of trying to replace the bad device?
>> 
>> Yes if it's successful. No if it fails which is the problem I'm having.
> 
> That's where I don't follow you.
> You just agreed that it will grow my raid.
> So right now it's 4.5TB with 10 drives, if I add one drive, it will grow to
> 5TB with 11 drives.
> How does that help?

If you swap the faulty drive for a good drive, I'm thinking then you'll be able to device delete the bad device, which ought to be "missing" at that point; or if that fails you should be able to do a balance, and then be able to device delete the faulty drive.

The problem I'm having is that when I detach one device out of a 3 device raid5, btrfs fi show doesn't list it as missing. It's listed without the /dev/sdd designation it had when attached, but now it's just blank.


> Why would btrfs allow me to remove the faulty device since it does not let
> you remove a device from a running raid. If I grow it to a bigger raid, it
> still won't let me remove the device, will it?

Maybe not, but it seems like it ought to let you balance, which should only be across available devices at which point you should be able to device delete the bad one. That's assuming you've physically detached the faulty device from the start though.

> 
>>> In swraid5, if I add a device, it will grow the raid, unless the array is
>>> running in degraded mode.
>>> However, I can't see if btrfs tools know it's in degraded mode or not.
>> 
>> Only once the device is missing, apparently, and then mounted -o degraded.
> 
> Dully noted.
> If you agree that adding an 11th drive to my array will not help, I'll
> unmount the filesystem, remount it in degraded mode with 9 drives and try to
> add the new 11th drive.

That's the only option I see at the moment in any case, other than blowing it all away and starting from scratch. What I don't know is whether you will be able to 'btrfs device delete' what ought to now be a missing device, since you have enough drives added to proceed with that deletion; or if you'll have to balance first. And I don't even know if the balance will work and then let you device delete if it's a dead end at this point.

> 
>>> If you are sure adding a device won't grow my raid, I'll give it a shot.
>> 
>> No I'm not sure. And yes I suspect it will make it bigger. But so far a.) replace isn't supported yet; and b.) delete causes the volume to go below the minimum required for normal operation which it won't allow; which leaves c.) add a device but I'm getting a hang. So I'm stuck at this point.
> 
> Right. So I think we also agree that adding a device to the running
> filesystem is not what I want to do since it'll grow it and do nothing to
> let me remove he faulty one. 

The grow is entirely beside the point. You definitely can't btrfs replace, or btrfs device delete, so what else is there but try btrfs device add, or obliterate it and start over?


Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marc MERLIN March 17, 2014, 4:13 p.m. UTC | #14
On Sun, Mar 16, 2014 at 11:12:43PM -0600, Chris Murphy wrote:
> 
> On Mar 16, 2014, at 9:44 PM, Marc MERLIN <marc@merlins.org> wrote:
> 
> > On Sun, Mar 16, 2014 at 08:56:35PM -0600, Chris Murphy wrote:
> > 
> >>> If I add a device, isn't it going to grow my raid to make it bigger instead
> >>> of trying to replace the bad device?
> >> 
> >> Yes if it's successful. No if it fails which is the problem I'm having.
> > 
> > That's where I don't follow you.
> > You just agreed that it will grow my raid.
> > So right now it's 4.5TB with 10 drives, if I add one drive, it will grow to
> > 5TB with 11 drives.
> > How does that help?
> 
> If you swap the faulty drive for a good drive, I'm thinking then you'll be able to device delete the bad device, which ought to be "missing" at that point; or if that fails you should be able to do a balance, and then be able to device delete the faulty drive.
> 
> The problem I'm having is that when I detach one device out of a 3 device raid5, btrfs fi show doesn't list it as missing. It's listed without the /dev/sdd designation it had when attached, but now it's just blank.

Ok, I tried unmounting and remounting degraded this morning:

polgara:~# mount -v -t btrfs -o compress=zlib,space_cache,noatime,degraded LABEL=backupcopy /mnt/btrfs_backupcopy
Mar 17 08:57:35 polgara kernel: [123824.344085] BTRFS: device label backupcopy devid 9 transid 3837 /dev/mapper/crypt_sdk1
Mar 17 08:57:35 polgara kernel: [123824.454641] BTRFS info (device dm-9): allowing degraded mounts
Mar 17 08:57:35 polgara kernel: [123824.454978] BTRFS info (device dm-9): disk space caching is enabled
Mar 17 08:57:35 polgara kernel: [123824.497437] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 3888, rd 321927975, flush 0, corrupt 0, gen
0
/dev/mapper/crypt_sdk1 on /mnt/btrfs_backupcopy type btrfs (rw,noatime,compress=zlib,space_cache,degraded)

What's confusing is that mounting in degraded mode shows all devices:
polgara:~# btrfs fi show
Label: backupcopy  uuid: 7d8e1197-69e4-40d8-8d86-278d275af896
        Total devices 10 FS bytes used 376.27GiB
        devid    1 size 465.76GiB used 42.42GiB path /dev/dm-0
        devid    2 size 465.76GiB used 42.40GiB path /dev/dm-1
        devid    3 size 465.75GiB used 42.40GiB path /dev/mapper/crypt_sde1 << this is missing
        devid    4 size 465.76GiB used 42.40GiB path /dev/dm-3
        devid    5 size 465.76GiB used 42.40GiB path /dev/dm-4
        devid    6 size 465.76GiB used 42.40GiB path /dev/dm-5
        devid    7 size 465.76GiB used 42.40GiB path /dev/dm-6
        devid    8 size 465.76GiB used 42.40GiB path /dev/mapper/crypt_sdj1
        devid    9 size 465.76GiB used 42.40GiB path /dev/mapper/crypt_sdk1
        devid    10 size 465.76GiB used 42.40GiB path /dev/dm-8

Ok, so mount in degraded mode works.

Adding a new device failed though:
polgara:~# btrfs device add /dev/mapper/crypt_sdm1 /mnt/btrfs_backupcopy
BTRFS: bad tree block start 852309604880683448 156237824
------------[ cut here ]------------
WARNING: CPU: 0 PID: 1963 at fs/btrfs/super.c:257 __btrfs_abort_transaction+0x50/0x100()
BTRFS: Transaction aborted (error -5)
Modules linked in: xts gf128mul ipt_MASQUERADE ipt_REJECT xt_tcpudp xt_conntrack xt_LOG iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables cpufreq_userspace cpufreq_powersave cpufreq_conservative cpufreq_stats ppdev rfcomm bnep autofs4 binfmt_misc uinput nfsd auth_rpcgss nfs_acl nfs lockd fscache sunrpc fuse dm_crypt dm_mod configs parport_pc lp parport input_polldev loop firewire_sbp2 firewire_core crc_itu_t ecryptfs btusb bluetooth 6lowpan_iphc rfkill usbkbd usbmouse joydev hid_generic usbhid hid iTCO_wdt iTCO_vendor_support gpio_ich coretemp kvm_intel kvm microcode snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec pcspkr snd_hwdep i2c_i801 snd_pcm_oss snd_mixer_oss lpc_ich snd_pcm snd_seq_midi snd_seq_midi_event sg sr_mod cdrom snd_rawmidi snd_seq snd_seq_device snd_timer atl1 mii mvsas snd nouveau libsas scsi_transport_
soundcore ttm ehci_pci asus_atk0110 floppy uhci_hcd ehci_hcd usbcore acpi_cpufreq usb_common processor evdev
CPU: 0 PID: 1963 Comm: btrfs Tainted: G        W    3.14.0-rc5-amd64-i915-preempt-20140216c #1
Hardware name: System manufacturer P5KC/P5KC, BIOS 0502    05/24/2007
 0000000000000000 ffff88004b5c9988 ffffffff816090b3 ffff88004b5c99d0
 ffff88004b5c99c0 ffffffff81050025 ffffffff8120913a 00000000fffffffb
 ffff8800144d5800 ffff88007bd3ba00 ffffffff81839280 ffff88004b5c9a20
Call Trace:
 [<ffffffff816090b3>] dump_stack+0x4e/0x7a
 [<ffffffff81050025>] warn_slowpath_common+0x7f/0x98
 [<ffffffff8120913a>] ? __btrfs_abort_transaction+0x50/0x100
 [<ffffffff8105008a>] warn_slowpath_fmt+0x4c/0x4e
 [<ffffffff8120913a>] __btrfs_abort_transaction+0x50/0x100
 [<ffffffff81216fed>] __btrfs_free_extent+0x6ce/0x712
 [<ffffffff8121bc89>] __btrfs_run_delayed_refs+0x939/0xbdf
 [<ffffffff8121dac8>] btrfs_run_delayed_refs+0x81/0x18f
 [<ffffffff8122aeb2>] btrfs_commit_transaction+0xeb/0x849
 [<ffffffff8124e777>] btrfs_init_new_device+0x9a1/0xc00
 [<ffffffff8114069b>] ? ____cache_alloc+0x1c/0x29b
 [<ffffffff81129d3e>] ? mem_cgroup_end_update_page_stat+0x17/0x26
 [<ffffffff8125570f>] ? btrfs_ioctl+0x989/0x24b1
 [<ffffffff81141096>] ? __kmalloc_track_caller+0x130/0x144
 [<ffffffff8125570f>] ? btrfs_ioctl+0x989/0x24b1
 [<ffffffff81255730>] btrfs_ioctl+0x9aa/0x24b1
 [<ffffffff81611e15>] ? __do_page_fault+0x330/0x3df
 [<ffffffff8116da43>] ? mntput_no_expire+0x33/0x12b
 [<ffffffff81163b16>] do_vfs_ioctl+0x3d2/0x41d
 [<ffffffff8115676b>] ? ____fput+0xe/0x10
 [<ffffffff8106973a>] ? task_work_run+0x87/0x98
 [<ffffffff81163bb8>] SyS_ioctl+0x57/0x82
 [<ffffffff81611ed2>] ? do_page_fault+0xe/0x10
 [<ffffffff816154ad>] system_call_fastpath+0x1a/0x1f
---[ end trace 7d08b9b7f2f17b38 ]---
BTRFS: error (device dm-9) in __btrfs_free_extent:5755: errno=-5 IO failure
BTRFS info (device dm-9): forced readonly
ERROR: error adding the device '/dev/mapper/crypt_sdm1' - Input/output error
polgara:~# Mar 17 09:07:14 polgara kernel: [124403.240880] BTRFS: error (device dm-9) in btrfs_run_delayed_refs:2713: errno=-5 IO failure

Mmmh, dm-9 is another device, although it seems to work:
polgara:~# dd if=/dev/dm-9 of=/dev/null bs=1M
^C1255+0 records in
1254+0 records out
1314914304 bytes (1.3 GB) copied, 15.169 s, 86.7 MB/s

polgara:~# btrfs device stats /dev/dm-9
[/dev/mapper/crypt_sdk1].write_io_errs   0
[/dev/mapper/crypt_sdk1].read_io_errs    0
[/dev/mapper/crypt_sdk1].flush_io_errs   0
[/dev/mapper/crypt_sdk1].corruption_errs 0
[/dev/mapper/crypt_sdk1].generation_errs 0


I also started getting errors on my device after hours of use last night (pasted below).
Not sure if I really have a 2nd device problem or not:

/dev/mapper/crypt_sde1 is dm-2,

BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1
BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1
BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1
BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1
BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1
BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1
BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1
BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1
BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1
quiet_error: 123 callbacks suppressed
Buffer I/O error on device dm-2, logical block 16
Buffer I/O error on device dm-2, logical block 16384
Buffer I/O error on device dm-2, logical block 67108864
Buffer I/O error on device dm-2, logical block 16
Buffer I/O error on device dm-2, logical block 16384
Buffer I/O error on device dm-2, logical block 67108864
BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1
BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1
BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1
Buffer I/O error on device dm-2, logical block 0
Buffer I/O error on device dm-2, logical block 1
Buffer I/O error on device dm-2, logical block 2
Buffer I/O error on device dm-2, logical block 3
Buffer I/O error on device dm-2, logical block 0
Buffer I/O error on device dm-2, logical block 122095101
Buffer I/O error on device dm-2, logical block 122095101
Buffer I/O error on device dm-2, logical block 0
Buffer I/O error on device dm-2, logical block 0
btrfs_dev_stat_print_on_error: 366 callbacks suppressed
btrfs_dev_stat_print_on_error: 346 callbacks suppressed
btrfs_dev_stat_print_on_error: 606 callbacks suppressed
btrfs_dev_stat_print_on_error: 276 callbacks suppressed
BTRFS: bad tree block start 16817792799093053571 2701656064
BTRFS: bad tree block start 16817792799093053571 2701656064
BTRFS: bad tree block start 16817792799093053571 2701656064
BTRFS: bad tree block start 16817792799093053571 2701656064
BTRFS: bad tree block start 16817792799093053571 2701656064
BTRFS: bad tree block start 16817792799093053571 2701656064
BTRFS: bad tree block start 16817792799093053571 2701656064
BTRFS: bad tree block start 16817792799093053571 2701656064
BTRFS: bad tree block start 16817792799093053571 2701656064
BTRFS: bad tree block start 16817792799093053571 2701656064
btrfs_dev_stat_print_on_error: 11469 callbacks suppressed
btree_readpage_end_io_hook: 31227 callbacks suppressed
BTRFS: bad tree block start 16817792799093053571 2701656064
BTRFS: bad tree block start 16817792799093053571 2701656064
BTRFS: bad tree block start 16817792799093053571 2701656064

eventually it turned into:
BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 3891, rd 321927996, flush 0, corrupt 0, gen 0
BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 3891, rd 321927997, flush 0, corrupt 0, gen 0
BTRFS: bad tree block start 17271740454546054736 1265680384
------------[ cut here ]------------
WARNING: CPU: 1 PID: 10414 at fs/btrfs/super.c:257 __btrfs_abort_transaction+0x50/0x100()
BTRFS: Transaction aborted (error -5)
Modules linked in: xts gf128mul ipt_MASQUERADE ipt_REJECT xt_tcpudp xt_conntrack xt_LOG iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables cpufreq_userspace cpufreq_powersave cpufreq_conservative cpufreq_stats ppdev rfcomm bnep autofs4 binfmt_misc uinput nfsd auth_rpcgss nfs_acl nfs lockd fscache sunrpc fuse dm_crypt dm_mod configs parport_pc lp parport input_polldev loop firewire_sbp2 firewire_core crc_itu_t ecryptfs btusb bluetooth 6lowpan_iphc rfkill usbkbd usbmouse joydev hid_generic usbhid hid iTCO_wdt iTCO_vendor_support gpio_ich coretemp kvm_intel kvm microcode snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec pcspkr snd_hwdep i2c_i801 snd_pcm_oss snd_mixer_oss lpc_ich snd_pcm snd_seq_midi snd_seq_midi_event sg sr_mod cdrom snd_rawmidi snd_seq snd_seq_device snd_timer atl1 mii mvsas snd nouveau libsas scsi_transport_
soundcore ttm ehci_pci asus_atk0110 floppy uhci_hcd ehci_hcd usbcore acpi_cpufreq usb_common processor evdev
CPU: 1 PID: 10414 Comm: btrfs-transacti Not tainted 3.14.0-rc5-amd64-i915-preempt-20140216c #1
Hardware name: System manufacturer P5KC/P5KC, BIOS 0502    05/24/2007
 0000000000000000 ffff88004ae4fb30 ffffffff816090b3 ffff88004ae4fb78
 ffff88004ae4fb68 ffffffff81050025 ffffffff8120913a 00000000fffffffb
 ffff88004f2e7800 ffff8800603804c0 ffffffff81839280 ffff88004ae4fbc8
Call Trace:
 [<ffffffff816090b3>] dump_stack+0x4e/0x7a
 [<ffffffff81050025>] warn_slowpath_common+0x7f/0x98
 [<ffffffff8120913a>] ? __btrfs_abort_transaction+0x50/0x100
 [<ffffffff8105008a>] warn_slowpath_fmt+0x4c/0x4e
 [<ffffffff8120913a>] __btrfs_abort_transaction+0x50/0x100
 [<ffffffff81216fed>] __btrfs_free_extent+0x6ce/0x712
 [<ffffffff8121bc89>] __btrfs_run_delayed_refs+0x939/0xbdf
 [<ffffffff8121dac8>] btrfs_run_delayed_refs+0x81/0x18f
 [<ffffffff8122ae40>] btrfs_commit_transaction+0x79/0x849
 [<ffffffff812277ca>] transaction_kthread+0xf8/0x1ab
 [<ffffffff812276d2>] ? btrfs_cleanup_transaction+0x43f/0x43f
 [<ffffffff8106bc56>] kthread+0xae/0xb6
 [<ffffffff8106bba8>] ? __kthread_parkme+0x61/0x61
 [<ffffffff816153fc>] ret_from_fork+0x7c/0xb0
 [<ffffffff8106bba8>] ? __kthread_parkme+0x61/0x61
---[ end trace 7d08b9b7f2f17b35 ]---
BTRFS: error (device dm-9) in __btrfs_free_extent:5755: errno=-5 IO failure
BTRFS info (device dm-9): forced readonly
BTRFS: error (device dm-9) in btrfs_run_delayed_refs:2713: errno=-5 IO failure
------------[ cut here ]------------
Chris Murphy March 17, 2014, 5:38 p.m. UTC | #15
On Mar 17, 2014, at 10:13 AM, Marc MERLIN <marc@merlins.org> wrote:
> 
> What's confusing is that mounting in degraded mode shows all devices:
> polgara:~# btrfs fi show
> Label: backupcopy  uuid: 7d8e1197-69e4-40d8-8d86-278d275af896
>        Total devices 10 FS bytes used 376.27GiB
>        devid    1 size 465.76GiB used 42.42GiB path /dev/dm-0
>        devid    2 size 465.76GiB used 42.40GiB path /dev/dm-1
>        devid    3 size 465.75GiB used 42.40GiB path /dev/mapper/crypt_sde1 << this is missing
>        devid    4 size 465.76GiB used 42.40GiB path /dev/dm-3
>        devid    5 size 465.76GiB used 42.40GiB path /dev/dm-4
>        devid    6 size 465.76GiB used 42.40GiB path /dev/dm-5
>        devid    7 size 465.76GiB used 42.40GiB path /dev/dm-6
>        devid    8 size 465.76GiB used 42.40GiB path /dev/mapper/crypt_sdj1
>        devid    9 size 465.76GiB used 42.40GiB path /dev/mapper/crypt_sdk1
>        devid    10 size 465.76GiB used 42.40GiB path /dev/dm-8

/dev/mapper/crypt_sde1 is completely unavailable, as in not listed by lsblk? If it's not connected and not listed by lsblk yet it's listed by btrfs fi show that's a bug.

> 
> eventually it turned into:
> BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 3891, rd 321927996, flush 0, corrupt 0, gen 0
> BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 3891, rd 321927997, flush 0, corrupt 0, gen 0
[snip]
> BTRFS: error (device dm-9) in __btrfs_free_extent:5755: errno=-5 IO failure
> BTRFS info (device dm-9): forced readonly
> BTRFS: error (device dm-9) in btrfs_run_delayed_refs:2713: errno=-5 IO failure

I think it's a lost cause at this point. Your setup is substantially more complicated than my simple setup, and I can't even get the simple setup to recover from an idealized single device raid5 failure. The only apparent way out is to mount degraded, backup, and then start over.

In your case it looks like at least two devices are reporting, or Btrfs thinks they're reporting, I/O errors. Whether this is the physical drive itself, or if it's some other layer (it looks like these are dmcrypt logical block devices).


Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Duncan March 18, 2014, 9:02 a.m. UTC | #16
Marc MERLIN posted on Sun, 16 Mar 2014 15:20:26 -0700 as excerpted:

> Do I have other options?
> (data is not important at all, I just want to learn how to deal with
> such a case with the current code)

First just a note that you hijacked Mr Manana's patch thread.  Replying 
to a post and changing the topic (the usual cause of such hijacks) does 
NOT change the thread, as the References and In-Reply-To headers still 
includes the Message-IDs from the original thread, and that's what good 
clients thread by since the subject line isn't a reliable means of 
threading.  To start a NEW thread, don't reply to an existing thread, 
compose a NEW message, starting a NEW thread. =:^)

Back on topic...

Since you don't have to worry about the data I'd suggest blowing it away 
and starting over.  Btrfs raid5/6 code is known to be incomplete at this 
point, to work in normal mode and write everything out, but with 
incomplete recovery code.  So I'd treat it like the raid-0 mode it 
effectively is, and consider it lost if a device drops.

There *IS* a post from an earlier thread where someone mentioned a 
recovery under some specific circumstance that worked for him, but I'd 
consider that the exception not the norm since the code is known to be 
incomplete and I think he just got lucky and didn't hit the particular 
missing code in his specific case.  Certainly you could try to go back 
and see what he did and under what conditions, and that might actually be 
worth doing if you had valuable data you'd be losing otherwise, but since 
you don't, while of course it's up to you, I'd not bother were it me.

Which I haven't.  My use-case wouldn't be looking at raid5/6 (or raid0) 
anyway, but even if it were, I'd not touch the current code unless it 
/was/ just for something I'd consider risking on a raid0.  Other than 
pure testing, the /only/ case I'd consider btrfs raid5/6 for right now, 
would be something that I'd consider raid0 riskable currently, but with 
the bonus of it upgrading "for free" to raid5/6 when the code is complete 
without any further effort on my part, since it's actually being written 
as raid5/6 ATM, the recovery simply can't be relied upon as raid5/6, so 
in recovery terms you're effectively running raid0 until it can be.  
Other than that and for /pure/ testing, I just don't see the point of 
even thinking about raid5/6 at this point.
Marc MERLIN March 19, 2014, 6:09 a.m. UTC | #17
On Tue, Mar 18, 2014 at 09:02:07AM +0000, Duncan wrote:
> First just a note that you hijacked Mr Manana's patch thread.  Replying 
(...)
I did, I use mutt, I know about in Reply-To, I was tired, I screwed up,
sorry, and there was no undo :)

> Since you don't have to worry about the data I'd suggest blowing it away 
> and starting over.  Btrfs raid5/6 code is known to be incomplete at this 
> point, to work in normal mode and write everything out, but with 
> incomplete recovery code.  So I'd treat it like the raid-0 mode it 
> effectively is, and consider it lost if a device drops.
>
> Which I haven't.  My use-case wouldn't be looking at raid5/6 (or raid0) 
> anyway, but even if it were, I'd not touch the current code unless it 
> /was/ just for something I'd consider risking on a raid0.  Other than 

Thank you for the warning, and yes I know the risk and the data I'm putting
on it is ok with that risk :)

So, I was bit quiet because I diagnosed problems with the underlying
hardware.
My disk array was creating disk faults due to insufficient power coming in.

Now that I fixed that and made sure the drives work with a full run of
hdrecover of all the drives in parallel (exercises the drives while making
sure all their blocks work), I did tests again:

Summary:
1) You can grow and shrink a raid5 volume while it's mounted => very cool
2) shrinking causes a rebalance
3) growing requires you to run rebalance
4) btrfs cannot replace a drive in raid5, whether it's there or not
   that's the biggest thing missing: just no rebuilds in any way
5) you can mount a raid5 with a missing device with -o degraded
6) adding a drive to a degraded arrays will grow the array, not rebuild
   the missing bits
7) you can remove a drive from an array, add files, and then if you plug
   the drive in, it apparently gets auto sucked in back in the array.
There is no rebuild that happens, you now have an inconsistent array where
one drive is not at the same level than the other ones (I lost all files I added 
after the drive was removed when I added the drive back).

In other words, everything seems to work except there is no rebuild that I could 
see anywhere.

Here are all the details:

Creation
> polgara:/dev/disk/by-id# mkfs.btrfs -f -d raid5 -m raid5 -L backupcopy /dev/mapper/crypt_sd[bdfghijkl]1
> 
> WARNING! - Btrfs v3.12 IS EXPERIMENTAL
> WARNING! - see http://btrfs.wiki.kernel.org before using
> 
> Turning ON incompat feature 'extref': increased hardlink limit per file to 65536
> Turning ON incompat feature 'raid56': raid56 extended format
> adding device /dev/mapper/crypt_sdd1 id 2
> adding device /dev/mapper/crypt_sdf1 id 3
> adding device /dev/mapper/crypt_sdg1 id 4
> adding device /dev/mapper/crypt_sdh1 id 5
> adding device /dev/mapper/crypt_sdi1 id 6
> adding device /dev/mapper/crypt_sdj1 id 7
> adding device /dev/mapper/crypt_sdk1 id 8
> adding device /dev/mapper/crypt_sdl1 id 9
> fs created label backupcopy on /dev/mapper/crypt_sdb1
>         nodesize 16384 leafsize 16384 sectorsize 4096 size 4.09TiB
> polgara:/dev/disk/by-id# mount -L backupcopy /mnt/btrfs_backupcopy/
> polgara:/mnt/btrfs_backupcopy# df -h .
> Filesystem              Size  Used Avail Use% Mounted on
> /dev/mapper/crypt_sdb1  4.1T  3.0M  4.1T   1% /mnt/btrfs_backupcopy

Let's add one drive
> polgara:/mnt/btrfs_backupcopy# btrfs device add -f /dev/mapper/crypt_sdm1 /mnt/btrfs_backupcopy/
> polgara:/mnt/btrfs_backupcopy# df -h .
> Filesystem              Size  Used Avail Use% Mounted on
> /dev/mapper/crypt_sdb1  4.6T  3.0M  4.6T   1% /mnt/btrfs_backupcopy

Oh look it's bigger now. We need to manual rebalance to use the new drive:
> polgara:/mnt/btrfs_backupcopy# btrfs balance start . 
> Done, had to relocate 6 out of 6 chunks
> 
> polgara:/mnt/btrfs_backupcopy#  btrfs device delete /dev/mapper/crypt_sdm1 .
> BTRFS info (device dm-9): relocating block group 23314563072 flags 130
> BTRFS info (device dm-9): relocating block group 22106603520 flags 132
> BTRFS info (device dm-9): found 6 extents
> BTRFS info (device dm-9): relocating block group 12442927104 flags 129
> BTRFS info (device dm-9): found 1 extents
> polgara:/mnt/btrfs_backupcopy# df -h .
> Filesystem              Size  Used Avail Use% Mounted on
> /dev/mapper/crypt_sdb1  4.1T  4.7M  4.1T   1% /mnt/btrfs_backupcopy

Ah, it's smaller again. Note that it's not degraded, you can just keep removing drives
and it'll do a force reblance to fit the data in the remaining drives.

Ok, I've unounted the filesystem, and will manually remove a device:
> polgara:~# dmsetup remove crypt_sdl1
> polgara:~# mount -L backupcopy /mnt/btrfs_backupcopy/
> mount: wrong fs type, bad option, bad superblock on /dev/mapper/crypt_sdk1,
>        missing codepage or helper program, or other error
>        In some cases useful info is found in syslog - try
>        dmesg | tail  or so
> BTRFS: open /dev/dm-9 failed
> BTRFS info (device dm-7): disk space caching is enabled
> BTRFS: failed to read chunk tree on dm-7
> BTRFS: open_ctree failed

So a normal mount fails. You have to mount with -o degraded to acknowledge this.
> polgara:~# mount -o degraded -L backupcopy /mnt/btrfs_backupcopy/
> BTRFS: device label backupcopy devid 8 transid 50 /dev/mapper/crypt_sdk1
> BTRFS: open /dev/dm-9 failed
> BTRFS info (device dm-7): allowing degraded mounts
> BTRFS info (device dm-7): disk space caching is enabled
 
Re-adding a device that was missing:
> polgara:/mnt/btrfs_backupcopy# cryptsetup luksOpen /dev/sdl1 crypt_sdl1
> Enter passphrase for /dev/sdl1: 
> polgara:/mnt/btrfs_backupcopy# df -h .
> Filesystem              Size  Used Avail Use% Mounted on
> /dev/mapper/crypt_sdb1  4.1T  2.5M  3.7T   1% /mnt/btrfs_backupcopy
> polgara:/mnt/btrfs_backupcopy# btrfs device add -f /dev/mapper/crypt_sdl1 /mnt/btrfs_backupcopy/
> /dev/mapper/crypt_sdl1 is mounted
> BTRFS: device label backupcopy devid 9 transid 50 /dev/dm-9
> BTRFS: device label backupcopy devid 9 transid 50 /dev/dm-9
=> waoh, btrfs noticed that the device came back and knew it was its own, so it slurped it right away
(I was not able to add the device because it already was auto-added)

Adding another device does grow the size which adding sdl1 did not:
> polgara:/mnt/btrfs_backupcopy# btrfs device add -f /dev/mapper/crypt_sdm1 /mnt/btrfs_backupcopy/
> polgara:/mnt/btrfs_backupcopy# df -h .
> Filesystem              Size  Used Avail Use% Mounted on
> /dev/mapper/crypt_sdb1  4.6T  2.5M  4.1T   1% /mnt/btrfs_backupcopy

Ok, harder, let's pull a drive now. Strangely btrfs doesn't notice right away but logs this eventually:
BTRFS: bdev /dev/dm-6 errs: wr 0, rd 0, flush 1, corrupt 0, gen 0
BTRFS: lost page write due to I/O error on /dev/dm-6
BTRFS: bdev /dev/dm-6 errs: wr 1, rd 0, flush 1, corrupt 0, gen 0
BTRFS: lost page write due to I/O error on /dev/dm-6
BTRFS: bdev /dev/dm-6 errs: wr 2, rd 0, flush 1, corrupt 0, gen 0
BTRFS: lost page write due to I/O error on /dev/dm-6
BTRFS: bdev /dev/dm-6 errs: wr 3, rd 0, flush 1, corrupt 0, gen 0

From what I can tell, it buffers the writes to the missing drive and retries them in the background.
Technically it is in degraded mode, but it doesn't seem to think so.

This is where it now fails, I cannot remove the bad drive from the array:
polgara:/mnt/btrfs_backupcopy# btrfs device delete /dev/mapper/crypt_sdj1 .
ERROR: error removing the device '/dev/mapper/crypt_sdj1' - Invalid argument

Drive replace is not yet implemented:
> polgara:/mnt/btrfs_backupcopy# btrfs replace start -r /dev/mapper/crypt_sdj1 /dev/mapper/crypt_sde1  .
> quiet_error: 138 callbacks suppressed
> Buffer I/O error on device dm-6, logical block 122095344
> Buffer I/O error on device dm-6, logical block 122095364
> Buffer I/O error on device dm-6, logical block 0
> Buffer I/O error on device dm-6, logical block 1
> Buffer I/O error on device dm-6, logical block 122095365
> Buffer I/O error on device dm-6, logical block 122095365
> Buffer I/O error on device dm-6, logical block 122095365
> Buffer I/O error on device dm-6, logical block 122095365
> Buffer I/O error on device dm-6, logical block 122095365
> Buffer I/O error on device dm-6, logical block 122095365
> BTRFS warning (device dm-8): dev_replace cannot yet handle RAID5/RAID6

Adding a device at this point will not help because the filesystem is not in degraded mode, btrfs is still
kind of hoping that dm-6 (aka crypt_sdj1) will come back. So if I add a device, it would just grow the raid.

Let mount the array in degraded mode:
> polgara:~# mount -v -t btrfs -o compress=zlib,space_cache,noatime,degraded LABEL=backupcopy /mnt/btrfs_backupcopy 
> polgara:~# btrfs fi show
> Label: backupcopy  uuid: 5ccda389-748b-419c-bfa9-c14c4136e1c4
>         Total devices 10 FS bytes used 680.05MiB
>         devid    1 size 465.76GiB used 1.14GiB path /dev/mapper/crypt_sdb1
>         devid    2 size 465.76GiB used 1.14GiB path /dev/dm-1
>         devid    3 size 465.75GiB used 1.14GiB path /dev/dm-2
>         devid    4 size 465.76GiB used 1.14GiB path /dev/dm-3
>         devid    5 size 465.76GiB used 1.14GiB path /dev/dm-4
>         devid    6 size 465.76GiB used 1.14GiB path /dev/dm-5
>         devid    7 size 465.76GiB used 1.14GiB path /dev/dm-6
>         devid    8 size 465.76GiB used 1.14GiB path /dev/mapper/crypt_sdk1
>         devid    9 size 465.76GiB used 1.14GiB path /dev/mapper/crypt_sdl1
>         devid    10 size 465.76GiB used 1.14GiB path /dev/mapper/crypt_sdm1
>
> quiet_error: 250 callbacks suppressed
> Buffer I/O error on device dm-6, logical block 122095344
> Buffer I/O error on device dm-6, logical block 122095344
> Buffer I/O error on device dm-6, logical block 122095364
> Buffer I/O error on device dm-6, logical block 122095364
> Buffer I/O error on device dm-6, logical block 0
> Buffer I/O error on device dm-6, logical block 0
> Buffer I/O error on device dm-6, logical block 1
> Buffer I/O error on device dm-6, logical block 122095365
> Buffer I/O error on device dm-6, logical block 122095365
> Buffer I/O error on device dm-6, logical block 122095365

Even though it cannot access dm-6, it still included it in the mount because the device node still exists.

Adding a device does not help, it just grew the array in degraded mode:
polgara:/mnt/btrfs_backupcopy# btrfs device add /dev/mapper/crypt_sde1  .
polgara:/mnt/btrfs_backupcopy# df -h .
Filesystem              Size  Used Avail Use% Mounted on
/dev/mapper/crypt_sdb1  5.1T  681M  4.6T   1% /mnt/btrfs_backupcopy

Balance is not happy:
polgara:/mnt/btrfs_backupcopy# btrfs balance start . 
> BTRFS info (device dm-8): relocating block group 63026233344 flags 129
> BTRFS info (device dm-8): csum failed ino 257 off 917504 csum 1017609526 expected csum 4264281942
> BTRFS info (device dm-8): csum failed ino 257 off 966656 csum 389256117 expected csum 2901202041
> BTRFS info (device dm-8): csum failed ino 257 off 970752 csum 4107355973 expected csum 3954832285
> BTRFS info (device dm-8): csum failed ino 257 off 974848 csum 1121660380 expected csum 2872112983
> BTRFS info (device dm-8): csum failed ino 257 off 978944 csum 2032023730 expected csum 2250478230
> BTRFS info (device dm-8): csum failed ino 257 off 933888 csum 297434258 expected csum 3687027701
> BTRFS info (device dm-8): csum failed ino 257 off 937984 csum 1176910550 expected csum 3400460732
> BTRFS info (device dm-8): csum failed ino 257 off 942080 csum 366743485 expected csum 2321497660
> BTRFS info (device dm-8): csum failed ino 257 off 946176 csum 1849642521 expected csum 931611495
> BTRFS info (device dm-8): csum failed ino 257 off 921600 csum 1075941372 expected csum 2126420528
ERROR: error during balancing '.' - Input/output error

This looks bad, but my filesystem didn't look corrupted after that.

I am not allowed to remove the new device I just added:
polgara:~# btrfs device delete /dev/mapper/crypt_sde1  .
ERROR: error removing the device '/dev/mapper/crypt_sde1' - Inappropriate ioctl for device

Let's now remove the device node of that bad drive, unmount and remount the array:
polgara:~# dmsetup remove crypt_sdj1
polgara:~# btrfs fi show
Label: 'backupcopy'  uuid: 5ccda389-748b-419c-bfa9-c14c4136e1c4
        Total devices 11 FS bytes used 682.30MiB
        devid    1 size 465.76GiB used 2.14GiB path /dev/mapper/crypt_sdb1
        devid    2 size 465.76GiB used 2.14GiB path /dev/mapper/crypt_sdd1
        devid    3 size 465.75GiB used 2.14GiB path /dev/mapper/crypt_sdf1
        devid    4 size 465.76GiB used 2.14GiB path /dev/mapper/crypt_sdg1
        devid    5 size 465.76GiB used 2.14GiB path /dev/mapper/crypt_sdh1
        devid    6 size 465.76GiB used 2.14GiB path /dev/mapper/crypt_sdi1
        devid    8 size 465.76GiB used 2.14GiB path /dev/mapper/crypt_sdk1
        devid    9 size 465.76GiB used 2.14GiB path /dev/mapper/crypt_sdl1
        devid   10 size 465.76GiB used 2.14GiB path /dev/mapper/crypt_sdm1
        devid   11 size 465.76GiB used 1.00GiB path /dev/mapper/crypt_sde1
        *** Some devices missing
=> ok, that's good, one device is missing

Now when I mount the array, I see this:
polgara:~# mount -v -t btrfs -o compress=zlib,space_cache,noatime,degraded LABEL=backupcopy /mnt/btrfs_backupcopy 
> BTRFS: device label backupcopy devid 11 transid 150 /dev/mapper/crypt_sde1
> BTRFS: open /dev/dm-6 failed
> BTRFS info (device dm-10): allowing degraded mounts
> BTRFS info (device dm-10): disk space caching is enabled
> BTRFS: bdev /dev/dm-6 errs: wr 12, rd 0, flush 4, corrupt 0, gen 0
/dev/mapper/crypt_sde1 on /mnt/btrfs_backupcopy type btrfs (rw,noatime,compress=zlib,space_cache,degraded)
polgara:~# btrfs fi show
Label: backupcopy  uuid: 5ccda389-748b-419c-bfa9-c14c4136e1c4
        Total devices 11 FS bytes used 682.30MiB
        devid    1 size 465.76GiB used 2.14GiB path /dev/dm-0
        devid    2 size 465.76GiB used 2.14GiB path /dev/dm-1
        devid    3 size 465.75GiB used 2.14GiB path /dev/dm-2
        devid    4 size 465.76GiB used 2.14GiB path /dev/dm-3
        devid    5 size 465.76GiB used 2.14GiB path /dev/dm-4
        devid    6 size 465.76GiB used 2.14GiB path /dev/dm-5
        devid    7 size 465.76GiB used 1.14GiB path /dev/dm-6
        devid    8 size 465.76GiB used 2.14GiB path /dev/dm-7
        devid    9 size 465.76GiB used 2.14GiB path /dev/dm-9
        devid    10 size 465.76GiB used 2.14GiB path /dev/dm-8
        devid    11 size 465.76GiB used 1.00GiB path /dev/mapper/crypt_sde1

That's bad, it still shows me dm-6 even though it's gone now. I think
this means that you cannot get btrfs to show that it's in degraded mode.

Ok, let's re-add the device:
polgara:/mnt/btrfs_backupcopy# cryptsetup luksOpen /dev/sdj1 crypt_sdj1
Enter passphrase for /dev/sdj1: 
> BTRFS: device label backupcopy devid 7 transid 137 /dev/dm-6
polgara:/mnt/btrfs_backupcopy# Mar 18 22:30:55 polgara kernel: [49535.076071] BTRFS: device label backupcopy devid 7 transid 137 /dev/dm-6
> btrfs-rmw-2: page allocation failure: order:1, mode:0x8020
> CPU: 0 PID: 7511 Comm: btrfs-rmw-2 Tainted: G        W    3.14.0-rc5-amd64-i915-preempt-20140216c #1
> Hardware name: System manufacturer P5KC/P5KC, BIOS 0502    05/24/2007
>  0000000000000000 ffff880011173690 ffffffff816090b3 0000000000000000
>  ffff880011173718 ffffffff811037b0 00000001fffffffe 0000000000000001
>  ffff88006bb2a0d0 0000000200000000 0000003000000000 ffff88007ff7ce00
> Call Trace:
>  [<ffffffff816090b3>] dump_stack+0x4e/0x7a
>  [<ffffffff811037b0>] warn_alloc_failed+0x111/0x125
>  [<ffffffff81106cb2>] __alloc_pages_nodemask+0x707/0x854
>  [<ffffffff8110654e>] ? get_page_from_freelist+0x6c0/0x71d
>  [<ffffffff81014650>] dma_generic_alloc_coherent+0xa7/0x11c
>  [<ffffffff811354e8>] dma_pool_alloc+0x10a/0x1cb
>  [<ffffffffa00f2aa0>] mvs_task_prep+0x192/0xa42 [mvsas]
>  [<ffffffff81140d66>] ? ____cache_alloc_node+0xf1/0x134
>  [<ffffffffa00f33ad>] mvs_task_exec.isra.9+0x5d/0xc9 [mvsas]
>  [<ffffffffa00f3a76>] mvs_queue_command+0x3d/0x29b [mvsas]
>  [<ffffffff8114118d>] ? kmem_cache_alloc+0xe3/0x161
>  [<ffffffffa00e5d1c>] sas_ata_qc_issue+0x1cd/0x235 [libsas]
>  [<ffffffff814a9598>] ata_qc_issue+0x291/0x2f1
>  [<ffffffff814af413>] ? ata_scsiop_mode_sense+0x29c/0x29c
>  [<ffffffff814b049e>] __ata_scsi_queuecmd+0x184/0x1e0
>  [<ffffffff814b05a5>] ata_sas_queuecmd+0x31/0x4d
>  [<ffffffffa00e47ba>] sas_queuecommand+0x98/0x1fe [libsas]
>  [<ffffffff8148fdee>] scsi_dispatch_cmd+0x14f/0x22e
>  [<ffffffff814964da>] scsi_request_fn+0x4da/0x507
>  [<ffffffff812e01a3>] __blk_run_queue_uncond+0x22/0x2b
>  [<ffffffff812e01c5>] __blk_run_queue+0x19/0x1b
>  [<ffffffff812fc16d>] cfq_insert_request+0x391/0x3b5
>  [<ffffffff812e002f>] ? perf_trace_block_rq_with_error+0x45/0x14f
>  [<ffffffff812e512c>] ? blk_recount_segments+0x1e/0x2e
>  [<ffffffff812dc08c>] __elv_add_request+0x1fc/0x276
>  [<ffffffff812e1c6c>] blk_queue_bio+0x237/0x256
>  [<ffffffff812df92c>] generic_make_request+0x9c/0xdb
>  [<ffffffff812dfa7d>] submit_bio+0x112/0x131
>  [<ffffffff8128274c>] rmw_work+0x112/0x162
>  [<ffffffff8125073f>] worker_loop+0x168/0x4d8
>  [<ffffffff812505d7>] ? btrfs_queue_worker+0x283/0x283
>  [<ffffffff8106bc56>] kthread+0xae/0xb6
>  [<ffffffff8106bba8>] ? __kthread_parkme+0x61/0x61
>  [<ffffffff816153fc>] ret_from_fork+0x7c/0xb0
>  [<ffffffff8106bba8>] ? __kthread_parkme+0x61/0x61

My system hung soon after that, but it could have been due to issues
with my SATA driver too.

I rebooted, tried a mount:
polgara:~# mount -v -t btrfs -o compress=zlib,space_cache,noatime LABEL=backupcopy /mnt/btrfs_backupcopy
> BTRFS: device label backupcopy devid 11 transid 152 /dev/mapper/crypt_sde1
> BTRFS info (device dm-10): disk space caching is enabled
> BTRFS: bdev /dev/dm-6 errs: wr 12, rd 0, flush 4, corrupt 0, gen 0
/dev/mapper/crypt_sde1 on /mnt/btrfs_backupcopy type btrfs (rw,noatime,compress=zlib,space_cache)

Ok, there is a problem here, my filesystem is missing data I added after my sdj1 device died.
In other words, btrfs happily added my device that was way behind and gave me an incomplete fileystem instead of noticing
that sdj1 was behind and giving me a degraded filesystem.
Moral of the story: do not ever re-add a device that got kicked out if you wrote new data after that, or you will end up with an older version of your filesystem (on the plus side, it's consistent and apparently without data corruption. That said, btrfs scrub complained loudly of many errors it didn't know how to fix.
> BTRFS: bdev /dev/dm-6 errs: wr 12, rd 0, flush 4, corrupt 0, gen 0
> BTRFS: bad tree block start 6438453874765710835 61874388992
> BTRFS: bad tree block start 8828340560360071357 61886726144
> BTRFS: bad tree block start 5332618200988957279 61895868416
> BTRFS: bad tree block start 9233018093866324599 61895884800
> BTRFS: bad tree block start 17393001018657664843 61895917568
> BTRFS: bad tree block start 6438453874765710835 61874388992
> BTRFS: bad tree block start 8828340560360071357 61886726144
> BTRFS: bad tree block start 5332618200988957279 61895868416
> BTRFS: bad tree block start 9233018093866324599 61895884800
> BTRFS: bad tree block start 17393001018657664843 61895917568
> BTRFS: checksum error at logical 61826662400 on dev /dev/dm-6, sector 2541568: metadata leaf (level 0) in tree 5
> BTRFS: checksum error at logical 61826662400 on dev /dev/dm-6, sector 2541568: metadata leaf (level 0) in tree 5
> BTRFS: bdev /dev/dm-6 errs: wr 12, rd 0, flush 4, corrupt 1, gen 0
> BTRFS: unable to fixup (regular) error at logical 61826662400 on dev /dev/dm-6
> BTRFS: checksum error at logical 61826678784 on dev /dev/dm-6, sector 2541600: metadata leaf (level 0) in tree 5
> BTRFS: checksum error at logical 61826678784 on dev /dev/dm-6, sector 2541600: metadata leaf (level 0) in tree 5
> BTRFS: bdev /dev/dm-6 errs: wr 12, rd 0, flush 4, corrupt 2, gen 0
> BTRFS: unable to fixup (regular) error at logical 61826678784 on dev /dev/dm-6
> BTRFS: checksum error at logical 61826695168 on dev /dev/dm-6, sector 2541632: metadata leaf (level 0) in tree 5
> BTRFS: checksum error at logical 61826695168 on dev /dev/dm-6, sector 2541632: metadata leaf (level 0) in tree 5
> BTRFS: bdev /dev/dm-6 errs: wr 12, rd 0, flush 4, corrupt 3, gen 0
> BTRFS: unable to fixup (regular) error at logical 61826695168 on dev /dev/dm-6
(...)
> BTRFS: unable to fixup (regular) error at logical 61827186688 on dev /dev/dm-5
> scrub_handle_errored_block: 632 callbacks suppressed
> BTRFS: checksum error at logical 61849731072 on dev /dev/dm-6, sector 2586624: metadata leaf (level 0) in tree 5
> BTRFS: checksum error at logical 61849731072 on dev /dev/dm-6, sector 2586624: metadata leaf (level 0) in tree 5
> btrfs_dev_stat_print_on_error: 632 callbacks suppressed
> BTRFS: bdev /dev/dm-6 errs: wr 12, rd 0, flush 4, corrupt 166, gen 0
> scrub_handle_errored_block: 632 callbacks suppressed
> BTRFS: unable to fixup (regular) error at logical 61849731072 on dev /dev/dm-6
(...)
> BTRFS: unable to fixup (regular) error at logical 61864853504 on dev /dev/dm-5
> btree_readpage_end_io_hook: 16 callbacks suppressed
> BTRFS: bad tree block start 17393001018657664843 61895917568
> BTRFS: bad tree block start 17393001018657664843 61895917568
> scrub_handle_errored_block: 697 callbacks suppressed
> BTRFS: checksum error at logical 61871751168 on dev /dev/dm-3, sector 2629632: metadata leaf (level 0) in tree 5
> BTRFS: checksum error at logical 61871751168 on dev /dev/dm-3, sector 2629632: metadata leaf (level 0) in tree 5
> btrfs_dev_stat_print_on_error: 697 callbacks suppressed
> BTRFS: bdev /dev/dm-3 errs: wr 0, rd 0, flush 0, corrupt 236, gen 0
> scrub_handle_errored_block: 697 callbacks suppressed
> BTRFS: unable to fixup (regular) error at logical 61871751168 on dev /dev/dm-3

On the plus side, I can remove the last drive I added now that I'm not in degraded mode again:
polgara:/mnt/btrfs_backupcopy# btrfs device delete /dev/mapper/crypt_sde1 .
> BTRFS info (device dm-10): relocating block group 72689909760 flags 129
> BTRFS info (device dm-10): found 1 extents
> BTRFS info (device dm-10): found 1 extents


There you go, hope this helps.

Marc
Chris Murphy March 19, 2014, 6:32 a.m. UTC | #18
On Mar 19, 2014, at 12:09 AM, Marc MERLIN <marc@merlins.org> wrote:
> 
> 7) you can remove a drive from an array, add files, and then if you plug
>   the drive in, it apparently gets auto sucked in back in the array.
> There is no rebuild that happens, you now have an inconsistent array where
> one drive is not at the same level than the other ones (I lost all files I added 
> after the drive was removed when I added the drive back).

Seems worthy of a dedicated bug report and keeping an eye on in the future, not good.

>> 
>> polgara:/mnt/btrfs_backupcopy# df -h .
>> Filesystem              Size  Used Avail Use% Mounted on
>> /dev/mapper/crypt_sdb1  4.1T  3.0M  4.1T   1% /mnt/btrfs_backupcopy
> 
> Let's add one drive
>> polgara:/mnt/btrfs_backupcopy# btrfs device add -f /dev/mapper/crypt_sdm1 /mnt/btrfs_backupcopy/
>> polgara:/mnt/btrfs_backupcopy# df -h .
>> Filesystem              Size  Used Avail Use% Mounted on
>> /dev/mapper/crypt_sdb1  4.6T  3.0M  4.6T   1% /mnt/btrfs_backupcopy
> 
> Oh look it's bigger now. We need to manual rebalance to use the new drive:

You don't have to. As soon as you add the additional drive, newly allocated chunks will stripe across all available drives. e.g. 1 GB allocations striped across 3x drives, if I add a 4th drive, initially any additional writes are only to the first three drives but once a new data chunk is allocated it gets striped across 4 drives.


> 
> In other words, btrfs happily added my device that was way behind and gave me an incomplete fileystem instead of noticing
> that sdj1 was behind and giving me a degraded filesystem.
> Moral of the story: do not ever re-add a device that got kicked out if you wrote new data after that, or you will end up with an older version of your filesystem (on the plus side, it's consistent and apparently without data corruption. That said, btrfs scrub complained loudly of many errors it didn't know how to fix.

Sure the whole thing isn't corrupt. But if anything written while degraded vanishes once the missing device is reattached, and you remount normally (non-degraded), that's data loss. Yikes!


> There you go, hope this helps.

Yes. Thanks!

Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marc MERLIN March 19, 2014, 3:40 p.m. UTC | #19
On Wed, Mar 19, 2014 at 12:32:55AM -0600, Chris Murphy wrote:
> 
> On Mar 19, 2014, at 12:09 AM, Marc MERLIN <marc@merlins.org> wrote:
> > 
> > 7) you can remove a drive from an array, add files, and then if you plug
> >   the drive in, it apparently gets auto sucked in back in the array.
> > There is no rebuild that happens, you now have an inconsistent array where
> > one drive is not at the same level than the other ones (I lost all files I added 
> > after the drive was removed when I added the drive back).
> 
> Seems worthy of a dedicated bug report and keeping an eye on in the future, not good.
 
Since it's not supposed to be working, I didn't file a bug, but I figured
it'd be good for people to know about it in the meantime.

> >> polgara:/mnt/btrfs_backupcopy# btrfs device add -f /dev/mapper/crypt_sdm1 /mnt/btrfs_backupcopy/
> >> polgara:/mnt/btrfs_backupcopy# df -h .
> >> Filesystem              Size  Used Avail Use% Mounted on
> >> /dev/mapper/crypt_sdb1  4.6T  3.0M  4.6T   1% /mnt/btrfs_backupcopy
> > 
> > Oh look it's bigger now. We need to manual rebalance to use the new drive:
> 
> You don't have to. As soon as you add the additional drive, newly allocated chunks will stripe across all available drives. e.g. 1 GB allocations striped across 3x drives, if I add a 4th drive, initially any additional writes are only to the first three drives but once a new data chunk is allocated it gets striped across 4 drives.
 
That's the thing though. If the bad device hadn't been forcibly removed, and
apparently the only way to do this was to unmount, make the device node
disappear, and remount in degraded mode, it looked to me like btrfs was
still consideing that the drive was part of the array and trying to write to
it.
After adding a drive, I couldn't quite tell if it was striping over 11
drive2 or 10, but it felt that at least at times, it was striping over 11
drives with write failures on the missing drive.
I can't prove it, but I'm thinking the new data I was writing was being
striped in degraded mode.

> Sure the whole thing isn't corrupt. But if anything written while degraded vanishes once the missing device is reattached, and you remount normally (non-degraded), that's data loss. Yikes!

Yes, although it's limited, you apparently only lose new data that was added
after you went into degraded mode and only if you add another drive where
you write more data.
In real life this shouldn't be too common, even if it is indeed a bug.

Cheers,
Marc
Chris Murphy March 19, 2014, 4:53 p.m. UTC | #20
On Mar 19, 2014, at 9:40 AM, Marc MERLIN <marc@merlins.org> wrote:
> 
> After adding a drive, I couldn't quite tell if it was striping over 11
> drive2 or 10, but it felt that at least at times, it was striping over 11
> drives with write failures on the missing drive.
> I can't prove it, but I'm thinking the new data I was writing was being
> striped in degraded mode.

Well it does sound fragile after all to add a drive to a degraded array, especially when it's not expressly treating the faulty drive as faulty. I think iotop will show what block devices are being written to. And in a VM it's easy (albeit rudimentary) with sparse files, as you can see them grow.

> 
> Yes, although it's limited, you apparently only lose new data that was added
> after you went into degraded mode and only if you add another drive where
> you write more data.
> In real life this shouldn't be too common, even if it is indeed a bug.

It's entirely plausible a drive power/data cable becomes lose, runs for hours degraded before the wayward device is reseated. It'll be common enough. It's definitely not OK for all of that data in the interim to vanish just because the volume has resumed from degraded to normal. Two states of data, normal vs degraded, is scary. It sounds like totally silent data loss. So yeah if it's reproducible it's worthy of a separate bug.


Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marc MERLIN March 19, 2014, 10:40 p.m. UTC | #21
On Wed, Mar 19, 2014 at 10:53:33AM -0600, Chris Murphy wrote:
> > Yes, although it's limited, you apparently only lose new data that was added
> > after you went into degraded mode and only if you add another drive where
> > you write more data.
> > In real life this shouldn't be too common, even if it is indeed a bug.
> 
> It's entirely plausible a drive power/data cable becomes lose, runs for hours degraded before the wayward device is reseated. It'll be common enough. It's definitely not OK for all of that data in the interim to vanish just because the volume has resumed from degraded to normal. Two states of data, normal vs degraded, is scary. It sounds like totally silent data loss. So yeah if it's reproducible it's worthy of a separate bug.

Actually what I did is more complex, I first added a drive to a degraded
array, and then re-added the drive that had been removed.
I don't know if re-adding the same drive that was removed would cause the
bug I saw.

For now, my array is back to actually trying to store the backup I had meant
for it, and the drives seems stable now that I fixed the power issue.

Does someone else want to try? :)

Marc
Marc MERLIN March 20, 2014, 12:46 a.m. UTC | #22
On Thu, Mar 20, 2014 at 01:44:20AM +0100, Tobias Holst wrote:
> I tried the RAID6 implementation of btrfs and I looks like I had the
> same problem. Rebuild with "balance" worked but when a drive was
> removed when mounted and then readded, the chaos began. I tried it a
> few times. So when a drive fails (and this is just because of
> connection lost or similar non severe problems), then it is necessary
> to wipe the disc first before readding it, so btrfs will add it as a
> new disk and not try to readd the old one.

Good to know you got this too.

Just to confirm: did you get it to rebuild, or once a drive is lost/gets
behind, you're in degraded mode forever for those blocks?

Or were you able to balance?

Marc
Duncan March 20, 2014, 7:37 a.m. UTC | #23
Marc MERLIN posted on Wed, 19 Mar 2014 08:40:31 -0700 as excerpted:

> That's the thing though. If the bad device hadn't been forcibly removed,
> and apparently the only way to do this was to unmount, make the device
> node disappear, and remount in degraded mode, it looked to me like btrfs
> was still consideing that the drive was part of the array and trying to
> write to it.
> After adding a drive, I couldn't quite tell if it was striping over 11
> drive2 or 10, but it felt that at least at times, it was striping over
> 11 drives with write failures on the missing drive.
> I can't prove it, but I'm thinking the new data I was writing was being
> striped in degraded mode.

FWIW, there's at least two problems here, one a bug (or perhaps it'd more 
accurately be described as an as yet incomplete feature) unrelated to 
btrfs raid5/6 mode, the other the incomplete raid5/6 support.  Both are 
known issues, however.

The incomplete raid5/6 is discussed well enough elsewhere including in 
this thread as a whole, which leaves the other issue.

The other issue, not specifically raid5/6 mode related, is that 
currently, in-kernel btrfs is basically oblivious to disappearing drives, 
thus explaining some of the more complex bits of the behavior you 
described.  Yes, the kernel has the device data and other layers know 
when a device goes missing, but it's basically a case of the right hand 
not knowing what the left hand is doing -- once setup on a set of 
devices, in-kernel btrfs basically doesn't do anything with the device 
information available to it, at least in terms of removing a device from 
its listing when it goes missing.  (It does seem to transparently handle 
a missing btrfs component device reappearing, arguably /too/ 
transparently!)

Basically all btrfs does is log errors when a component device 
disappears.  It doesn't do anything with the disappeared device, and 
really doesn't "know" it has disappeared at all, until an unmount and 
(possibly degraded) remount, at which point it re-enumerates the devices 
and again knows what's actually there... until a device disappears again.

There's actually patches being worked on to fix that situation as we 
speak, and it's possible they're actually in btrfs-next already.  (I've 
seen the patches and discussion go by on the list but haven't tracked 
them to the extent that I know current status, other than that they're 
not in mainline yet.)

Meanwhile, counter-intuitively, btrfs-userspace is sometimes more aware 
of current device status than btrfs-kernel is ATM, since parts of 
userspace actually either get current status from the kernel, or trigger 
a rescan in ordered to get it.  But even after a rescan updates what 
userspace knows and thus what the kernel as a whole knows, btrfs-kernel 
still doesn't actually use that new information available to it in the 
same kernel that btrfs-userspace used to get it from!

Knowing that rather counterintuitive "little" inconsistency, that isn't 
actually so little, goes quite a way toward explaining what otherwise 
looks like illogical btrfs behavior -- how could kernel-btrfs not know 
the status of its own devices?
Tobias Holst March 20, 2014, 7:37 a.m. UTC | #24
I think after the balance it was a fine, non-degraded RAID again... As
far as I remember.

Tobby


2014-03-20 1:46 GMT+01:00 Marc MERLIN <marc@merlins.org>:
>
> On Thu, Mar 20, 2014 at 01:44:20AM +0100, Tobias Holst wrote:
> > I tried the RAID6 implementation of btrfs and I looks like I had the
> > same problem. Rebuild with "balance" worked but when a drive was
> > removed when mounted and then readded, the chaos began. I tried it a
> > few times. So when a drive fails (and this is just because of
> > connection lost or similar non severe problems), then it is necessary
> > to wipe the disc first before readding it, so btrfs will add it as a
> > new disk and not try to readd the old one.
>
> Good to know you got this too.
>
> Just to confirm: did you get it to rebuild, or once a drive is lost/gets
> behind, you're in degraded mode forever for those blocks?
>
> Or were you able to balance?
>
> Marc
> --
> "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
> Microsoft is to operating systems ....
>                                       .... what McDonalds is to gourmet cooking
> Home page: http://marc.merlins.org/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marc MERLIN March 23, 2014, 7:22 p.m. UTC | #25
On Wed, Mar 19, 2014 at 10:53:33AM -0600, Chris Murphy wrote:
> 
> On Mar 19, 2014, at 9:40 AM, Marc MERLIN <marc@merlins.org> wrote:
> > 
> > After adding a drive, I couldn't quite tell if it was striping over 11
> > drive2 or 10, but it felt that at least at times, it was striping over 11
> > drives with write failures on the missing drive.
> > I can't prove it, but I'm thinking the new data I was writing was being
> > striped in degraded mode.
> 
> Well it does sound fragile after all to add a drive to a degraded array, especially when it's not expressly treating the faulty drive as faulty. I think iotop will show what block devices are being written to. And in a VM it's easy (albeit rudimentary) with sparse files, as you can see them grow.
> 
> > 
> > Yes, although it's limited, you apparently only lose new data that was added
> > after you went into degraded mode and only if you add another drive where
> > you write more data.
> > In real life this shouldn't be too common, even if it is indeed a bug.
> 
> It's entirely plausible a drive power/data cable becomes lose, runs for hours degraded before the wayward device is reseated. It'll be common enough. It's definitely not OK for all of that data in the interim to vanish just because the volume has resumed from degraded to normal. Two states of data, normal vs degraded, is scary. It sounds like totally silent data loss. So yeah if it's reproducible it's worthy of a separate bug.

I just got around to filing that bug:
https://bugzilla.kernel.org/show_bug.cgi?id=72811

In other news, I was able to
1) remove a drive
2) mount degraded
3) add a new drive
4) rebalance (that took 2 days with little data, 4 deadlocks and reboots
though)
5) remove the missing drive from the filesystem
6) remount the array without -o degraded

Now, I'm testing
1) add a new drive
2 remove a working drive
3) automatic rebalance from #2 should rebuild on the new drive automatically

Marc
diff mbox

Patch

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 6463691..d869079 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -3184,12 +3184,12 @@  static int wait_for_parent_move(struct send_ctx *sctx,
 	struct fs_path *path_after = NULL;
 	int len1, len2;
 
-	if (parent_ref->dir <= sctx->cur_ino)
-		return 0;
-
 	if (is_waiting_for_move(sctx, ino))
 		return 1;
 
+	if (parent_ref->dir <= sctx->cur_ino)
+		return 0;
+
 	ret = get_inode_info(sctx->parent_root, ino, NULL, &old_gen,
 			     NULL, NULL, NULL, NULL);
 	if (ret == -ENOENT)