[0/6] Chunk allocator DUP fix and cleanups

Message ID	20181004212443.26519-1-hans.van.kranenburg@mendix.com (mailing list archive)
Headers	show Return-Path: <linux-btrfs-owner@kernel.org> From: Hans van Kranenburg <hans.van.kranenburg@mendix.com> To: linux-btrfs@vger.kernel.org Subject: [PATCH 0/6] Chunk allocator DUP fix and cleanups Date: Thu, 4 Oct 2018 23:24:37 +0200 Message-Id: <20181004212443.26519-1-hans.van.kranenburg@mendix.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk
Series	Chunk allocator DUP fix and cleanups \| expand [0/6] Chunk allocator DUP fix and cleanups [1/6] btrfs: alloc_chunk: do not refurbish num_bytes [2/6] btrfs: alloc_chunk: improve chunk size variable name [3/6] btrfs: alloc_chunk: fix more DUP stripe size handling [4/6] btrfs: fix ncopies raid_attr for RAID56 [5/6] btrfs: introduce nparity raid_attr [6/6] btrfs: alloc_chunk: rework chunk/stripe calculations

Hans van Kranenburg Oct. 4, 2018, 9:24 p.m. UTC

This patch set contains an additional fix for a newly exposed bug after
the previous attempt to fix a chunk allocator bug for new DUP chunks:

https://lore.kernel.org/linux-btrfs/782f6000-30c0-0085-abd2-74ec5827c903@mendix.com/T/#m609ccb5d32998e8ba5cfa9901c1ab56a38a6f374

The DUP fix is "fix more DUP stripe size handling". I did that one
before starting to change more things so it can be applied to earlier
LTS kernels.

Besides that patch, which is fixing the bug in a way that is least
intrusive, I added a bunch of other patches to help getting the chunk
allocator code in a state that is a bit less error-prone and
bug-attracting.

When running this and trying the reproduction scenario, I can now see
that the created DUP device extent is 827326464 bytes long, which is
good.

I wrote and tested this on top of linus 4.19-rc5. I still need to create
a list of related use cases and test more things to at least walk
through a bunch of obvious use cases to see if there's nothing exploding
too quickly with these changes. However, I'm quite confident about it,
so I'm sharing all of it already.

Any feedback and review is appreciated. Be gentle and keep in mind that
I'm still very much in a learning stage regarding kernel development.

The stable patches handling workflow is not 100% clear to me yet. I
guess I have to add a Fixes: in the DUP patch which points to the
previous commit 92e222df7b.

Moo!,
Knorrie

Hans van Kranenburg (6):
btrfs: alloc_chunk: do not refurbish num_bytes
btrfs: alloc_chunk: improve chunk size variable name
btrfs: alloc_chunk: fix more DUP stripe size handling
btrfs: fix ncopies raid_attr for RAID56
btrfs: introduce nparity raid_attr
btrfs: alloc_chunk: rework chunk/stripe calculations

fs/btrfs/volumes.c | 84 +++++++++++++++++++++++-----------------------
fs/btrfs/volumes.h | 4 ++-
2 files changed, 45 insertions(+), 43 deletions(-)

Qu Wenruo Oct. 5, 2018, 7:51 a.m. UTC | #1

On 2018/10/5 上午5:24, Hans van Kranenburg wrote:
> This patch set contains an additional fix for a newly exposed bug after
> the previous attempt to fix a chunk allocator bug for new DUP chunks:
> 
> https://lore.kernel.org/linux-btrfs/782f6000-30c0-0085-abd2-74ec5827c903@mendix.com/T/#m609ccb5d32998e8ba5cfa9901c1ab56a38a6f374

For that bug, did you succeeded in reproducing the bug?

I'm adding dev extent overlap checking in btrfs_verify_dev_extents() and
btrfs-progs.
I think it could help to detect such problem.

Thanks,
Qu

> 
> The DUP fix is "fix more DUP stripe size handling". I did that one
> before starting to change more things so it can be applied to earlier
> LTS kernels.
> 
> Besides that patch, which is fixing the bug in a way that is least
> intrusive, I added a bunch of other patches to help getting the chunk
> allocator code in a state that is a bit less error-prone and
> bug-attracting.
> 
> When running this and trying the reproduction scenario, I can now see
> that the created DUP device extent is 827326464 bytes long, which is
> good.
> 
> I wrote and tested this on top of linus 4.19-rc5. I still need to create
> a list of related use cases and test more things to at least walk
> through a bunch of obvious use cases to see if there's nothing exploding
> too quickly with these changes. However, I'm quite confident about it,
> so I'm sharing all of it already.
> 
> Any feedback and review is appreciated. Be gentle and keep in mind that
> I'm still very much in a learning stage regarding kernel development.
> 
> The stable patches handling workflow is not 100% clear to me yet. I
> guess I have to add a Fixes: in the DUP patch which points to the
> previous commit 92e222df7b.
> 
> Moo!,
> Knorrie
> 
> Hans van Kranenburg (6):
>   btrfs: alloc_chunk: do not refurbish num_bytes
>   btrfs: alloc_chunk: improve chunk size variable name
>   btrfs: alloc_chunk: fix more DUP stripe size handling
>   btrfs: fix ncopies raid_attr for RAID56
>   btrfs: introduce nparity raid_attr
>   btrfs: alloc_chunk: rework chunk/stripe calculations
> 
>  fs/btrfs/volumes.c | 84 +++++++++++++++++++++++-----------------------
>  fs/btrfs/volumes.h |  4 ++-
>  2 files changed, 45 insertions(+), 43 deletions(-)
>

Qu Wenruo Oct. 5, 2018, 7:51 a.m. UTC | #2

On 2018/10/5 上午5:24, Hans van Kranenburg wrote:
> This patch set contains an additional fix for a newly exposed bug after
> the previous attempt to fix a chunk allocator bug for new DUP chunks:
> 
> https://lore.kernel.org/linux-btrfs/782f6000-30c0-0085-abd2-74ec5827c903@mendix.com/T/#m609ccb5d32998e8ba5cfa9901c1ab56a38a6f374

For that bug, did you succeed in reproducing the bug?

I'm adding dev extent overlap checking in btrfs_verify_dev_extents() and
btrfs-progs.
I think it could help to detect such problem.

Thanks,
Qu

> 
> The DUP fix is "fix more DUP stripe size handling". I did that one
> before starting to change more things so it can be applied to earlier
> LTS kernels.
> 
> Besides that patch, which is fixing the bug in a way that is least
> intrusive, I added a bunch of other patches to help getting the chunk
> allocator code in a state that is a bit less error-prone and
> bug-attracting.
> 
> When running this and trying the reproduction scenario, I can now see
> that the created DUP device extent is 827326464 bytes long, which is
> good.
> 
> I wrote and tested this on top of linus 4.19-rc5. I still need to create
> a list of related use cases and test more things to at least walk
> through a bunch of obvious use cases to see if there's nothing exploding
> too quickly with these changes. However, I'm quite confident about it,
> so I'm sharing all of it already.
> 
> Any feedback and review is appreciated. Be gentle and keep in mind that
> I'm still very much in a learning stage regarding kernel development.
> 
> The stable patches handling workflow is not 100% clear to me yet. I
> guess I have to add a Fixes: in the DUP patch which points to the
> previous commit 92e222df7b.
> 
> Moo!,
> Knorrie
> 
> Hans van Kranenburg (6):
>   btrfs: alloc_chunk: do not refurbish num_bytes
>   btrfs: alloc_chunk: improve chunk size variable name
>   btrfs: alloc_chunk: fix more DUP stripe size handling
>   btrfs: fix ncopies raid_attr for RAID56
>   btrfs: introduce nparity raid_attr
>   btrfs: alloc_chunk: rework chunk/stripe calculations
> 
>  fs/btrfs/volumes.c | 84 +++++++++++++++++++++++-----------------------
>  fs/btrfs/volumes.h |  4 ++-
>  2 files changed, 45 insertions(+), 43 deletions(-)
>

Hans van Kranenburg Oct. 5, 2018, 10:58 a.m. UTC | #3

On 10/05/2018 09:51 AM, Qu Wenruo wrote:
> 
> 
> On 2018/10/5 上午5:24, Hans van Kranenburg wrote:
>> This patch set contains an additional fix for a newly exposed bug after
>> the previous attempt to fix a chunk allocator bug for new DUP chunks:
>>
>> https://lore.kernel.org/linux-btrfs/782f6000-30c0-0085-abd2-74ec5827c903@mendix.com/T/#m609ccb5d32998e8ba5cfa9901c1ab56a38a6f374
> 
> For that bug, did you succeeded in reproducing the bug?

Yes, open the above link and scroll to "Steps to reproduce".

o/

> I'm adding dev extent overlap checking in btrfs_verify_dev_extents() and
> btrfs-progs.
> I think it could help to detect such problem.
> 
> Thanks,
> Qu
> 
>>
>> The DUP fix is "fix more DUP stripe size handling". I did that one
>> before starting to change more things so it can be applied to earlier
>> LTS kernels.
>>
>> Besides that patch, which is fixing the bug in a way that is least
>> intrusive, I added a bunch of other patches to help getting the chunk
>> allocator code in a state that is a bit less error-prone and
>> bug-attracting.
>>
>> When running this and trying the reproduction scenario, I can now see
>> that the created DUP device extent is 827326464 bytes long, which is
>> good.
>>
>> I wrote and tested this on top of linus 4.19-rc5. I still need to create
>> a list of related use cases and test more things to at least walk
>> through a bunch of obvious use cases to see if there's nothing exploding
>> too quickly with these changes. However, I'm quite confident about it,
>> so I'm sharing all of it already.
>>
>> Any feedback and review is appreciated. Be gentle and keep in mind that
>> I'm still very much in a learning stage regarding kernel development.
>>
>> The stable patches handling workflow is not 100% clear to me yet. I
>> guess I have to add a Fixes: in the DUP patch which points to the
>> previous commit 92e222df7b.
>>
>> Moo!,
>> Knorrie
>>
>> Hans van Kranenburg (6):
>>   btrfs: alloc_chunk: do not refurbish num_bytes
>>   btrfs: alloc_chunk: improve chunk size variable name
>>   btrfs: alloc_chunk: fix more DUP stripe size handling
>>   btrfs: fix ncopies raid_attr for RAID56
>>   btrfs: introduce nparity raid_attr
>>   btrfs: alloc_chunk: rework chunk/stripe calculations
>>
>>  fs/btrfs/volumes.c | 84 +++++++++++++++++++++++-----------------------
>>  fs/btrfs/volumes.h |  4 ++-
>>  2 files changed, 45 insertions(+), 43 deletions(-)
>>
>

Qu Wenruo Oct. 8, 2018, 6:43 a.m. UTC | #4

On 2018/10/5 下午6:58, Hans van Kranenburg wrote:
> On 10/05/2018 09:51 AM, Qu Wenruo wrote:
>>
>>
>> On 2018/10/5 上午5:24, Hans van Kranenburg wrote:
>>> This patch set contains an additional fix for a newly exposed bug after
>>> the previous attempt to fix a chunk allocator bug for new DUP chunks:
>>>
>>> https://lore.kernel.org/linux-btrfs/782f6000-30c0-0085-abd2-74ec5827c903@mendix.com/T/#m609ccb5d32998e8ba5cfa9901c1ab56a38a6f374
>>
>> For that bug, did you succeeded in reproducing the bug?
> 
> Yes, open the above link and scroll to "Steps to reproduce".

That's beyond device boundary one. Also reproduced here.
And hand-crafted a super small image as test case for btrfs-progs.

But I'm a little curious about the dev extent overlapping case.
Have you got one?

Thanks,
Qu

> 
> o/
> 
>> I'm adding dev extent overlap checking in btrfs_verify_dev_extents() and
>> btrfs-progs.
>> I think it could help to detect such problem.
>>
>> Thanks,
>> Qu
>>
>>>
>>> The DUP fix is "fix more DUP stripe size handling". I did that one
>>> before starting to change more things so it can be applied to earlier
>>> LTS kernels.
>>>
>>> Besides that patch, which is fixing the bug in a way that is least
>>> intrusive, I added a bunch of other patches to help getting the chunk
>>> allocator code in a state that is a bit less error-prone and
>>> bug-attracting.
>>>
>>> When running this and trying the reproduction scenario, I can now see
>>> that the created DUP device extent is 827326464 bytes long, which is
>>> good.
>>>
>>> I wrote and tested this on top of linus 4.19-rc5. I still need to create
>>> a list of related use cases and test more things to at least walk
>>> through a bunch of obvious use cases to see if there's nothing exploding
>>> too quickly with these changes. However, I'm quite confident about it,
>>> so I'm sharing all of it already.
>>>
>>> Any feedback and review is appreciated. Be gentle and keep in mind that
>>> I'm still very much in a learning stage regarding kernel development.
>>>
>>> The stable patches handling workflow is not 100% clear to me yet. I
>>> guess I have to add a Fixes: in the DUP patch which points to the
>>> previous commit 92e222df7b.
>>>
>>> Moo!,
>>> Knorrie
>>>
>>> Hans van Kranenburg (6):
>>>   btrfs: alloc_chunk: do not refurbish num_bytes
>>>   btrfs: alloc_chunk: improve chunk size variable name
>>>   btrfs: alloc_chunk: fix more DUP stripe size handling
>>>   btrfs: fix ncopies raid_attr for RAID56
>>>   btrfs: introduce nparity raid_attr
>>>   btrfs: alloc_chunk: rework chunk/stripe calculations
>>>
>>>  fs/btrfs/volumes.c | 84 +++++++++++++++++++++++-----------------------
>>>  fs/btrfs/volumes.h |  4 ++-
>>>  2 files changed, 45 insertions(+), 43 deletions(-)
>>>
>>
> 
>

Hans van Kranenburg Oct. 8, 2018, 1:19 p.m. UTC | #5

On 10/08/2018 08:43 AM, Qu Wenruo wrote:
> 
> 
> On 2018/10/5 下午6:58, Hans van Kranenburg wrote:
>> On 10/05/2018 09:51 AM, Qu Wenruo wrote:
>>>
>>>
>>> On 2018/10/5 上午5:24, Hans van Kranenburg wrote:
>>>> This patch set contains an additional fix for a newly exposed bug after
>>>> the previous attempt to fix a chunk allocator bug for new DUP chunks:
>>>>
>>>> https://lore.kernel.org/linux-btrfs/782f6000-30c0-0085-abd2-74ec5827c903@mendix.com/T/#m609ccb5d32998e8ba5cfa9901c1ab56a38a6f374
>>>
>>> For that bug, did you succeeded in reproducing the bug?
>>
>> Yes, open the above link and scroll to "Steps to reproduce".
> 
> That's beyond device boundary one. Also reproduced here.
> And hand-crafted a super small image as test case for btrfs-progs.
> 
> But I'm a little curious about the dev extent overlapping case.
> Have you got one?

Ah ok, I see. No, I didn't do that yet.

By using unmodified tooling, I think this can be done by a combination
of a few resizings and using very specific balance to cause a fs of
exactly 7880MiB again with a single 1578MiB gap inside...

I'll try later today to see if I can come up with a recipe for this.

Hans van Kranenburg Oct. 8, 2018, 11:51 p.m. UTC | #6

On 10/08/2018 03:19 PM, Hans van Kranenburg wrote:
> On 10/08/2018 08:43 AM, Qu Wenruo wrote:
>>
>>
>> On 2018/10/5 下午6:58, Hans van Kranenburg wrote:
>>> On 10/05/2018 09:51 AM, Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2018/10/5 上午5:24, Hans van Kranenburg wrote:
>>>>> This patch set contains an additional fix for a newly exposed bug after
>>>>> the previous attempt to fix a chunk allocator bug for new DUP chunks:
>>>>>
>>>>> https://lore.kernel.org/linux-btrfs/782f6000-30c0-0085-abd2-74ec5827c903@mendix.com/T/#m609ccb5d32998e8ba5cfa9901c1ab56a38a6f374
>>>>
>>>> For that bug, did you succeeded in reproducing the bug?
>>>
>>> Yes, open the above link and scroll to "Steps to reproduce".
>>
>> That's beyond device boundary one. Also reproduced here.
>> And hand-crafted a super small image as test case for btrfs-progs.
>>
>> But I'm a little curious about the dev extent overlapping case.
>> Have you got one?
> 
> Ah ok, I see. No, I didn't do that yet.
> 
> By using unmodified tooling, I think this can be done by a combination
> of a few resizings and using very specific balance to cause a fs of
> exactly 7880MiB again with a single 1578MiB gap inside...
> 
> I'll try later today to see if I can come up with a recipe for this.

Ok, this is actually pretty simple to do:

---- >8 ----

-# mkdir bork
-# cd bork
-# dd if=/dev/zero of=image bs=1 count=0 seek=1024M
0+0 records in
0+0 records out
0 bytes copied, 0.000183343 s, 0.0 kB/s
-# mkfs.btrfs -d dup -m dup image
-# losetup -f image
-# mount -o space_cache=v2 /dev/loop0 mountpoint
-# dd if=/dev/zero of=mountpoint/flapsie bs=1M
dd: error writing 'mountpoint/flapsie': No space left on device
453+0 records in
452+0 records out
474185728 bytes (474 MB, 452 MiB) copied, 4.07663 s, 116 MB/s

---- >8 ----

-# ./show_usage.py /bork/mountpoint/
Target profile for SYSTEM (chunk tree): DUP
Target profile for METADATA: DUP
Target profile for DATA: DUP
Mixed groups: False

Virtual space usage by block group type:
|
| type                    total         used
| ----                    -----         ----
| Data                452.31MiB    452.22MiB
| System                8.00MiB     16.00KiB
| Metadata             51.19MiB    656.00KiB

Total raw filesystem size: 1.00GiB
Total raw allocated bytes: 1023.00MiB
Allocatable bytes remaining: 1.00MiB
Unallocatable raw bytes: 0.00B
Unallocatable bytes that can be reclaimed by balancing: 0.00B

Estimated virtual space left to use for metadata: 51.05MiB
Estimated virtual space left to use for data: 96.00KiB

Allocated raw disk bytes by chunk type. Parity is a reserved part of the
allocated bytes, limiting the amount that can be used for data or metadata:
|
| flags               allocated         used       parity
| -----               ---------         ----       ------
| DATA|DUP            904.62MiB    904.44MiB        0.00B
| SYSTEM|DUP           16.00MiB     32.00KiB        0.00B
| METADATA|DUP        102.38MiB      1.28MiB        0.00B

Allocated bytes per device:
|
| devid      total size    allocated path
| -----      ----------    --------- ----
| 1             1.00GiB   1023.00MiB /dev/loop0

Allocated bytes per device, split up per chunk type. Parity bytes are again
part of the total amount of allocated bytes.
|
| Device ID: 1
| | flags               allocated       parity
| | -----               ---------       ------
| | DATA|DUP            904.62MiB        0.00B
| | SYSTEM|DUP           16.00MiB        0.00B
| | METADATA|DUP        102.38MiB        0.00B

Unallocatable raw bytes per device:
|
| devid    unallocatable
| -----    -------------
| 1               0.00B

---- >8 ----

Now we're gonna cause some neat 1578MiB to happen that we're going to
free up later to reproduce the failure:

-# dd if=/dev/zero of=image bs=1 count=0 seek=2602M
0+0 records in
0+0 records out
0 bytes copied, 0.000188621 s, 0.0 kB/s
-# losetup -c /dev/loop0
-# btrfs fi resize max mountpoint/
Resize 'mountpoint/' of 'max'
-# dd if=/dev/zero of=mountpoint/1578MiB bs=1M
dd: error writing 'mountpoint/1578MiB': No space left on device
790+0 records in
789+0 records out
827326464 bytes (827 MB, 789 MiB) copied, 12.2452 s, 67.6 MB/s

---- >8 ----

-# python3
import btrfs
fs = btrfs.FileSystem('/bork/mountpoint/')
for d in fs.dev_extents():
  print("start {} end {} vaddr {}".format(d.paddr, d.paddr + d.length,
d.vaddr))

start 1048576 end 11534336 vaddr 547880960
start 11534336 end 22020096 vaddr 547880960
start 22020096 end 30408704 vaddr 22020096
start 30408704 end 38797312 vaddr 22020096
start 38797312 end 92471296 vaddr 30408704
start 92471296 end 146145280 vaddr 30408704
start 146145280 end 213254144 vaddr 84082688
start 213254144 end 280363008 vaddr 84082688
start 280363008 end 397803520 vaddr 151191552
start 397803520 end 515244032 vaddr 151191552
start 515244032 end 632684544 vaddr 268632064
start 632684544 end 750125056 vaddr 268632064
start 750125056 end 867565568 vaddr 386072576
start 867565568 end 985006080 vaddr 386072576
start 985006080 end 1029373952 vaddr 503513088
start 1029373952 end 1073741824 vaddr 503513088
start 1073741824 end 1358954496 vaddr 558366720
start 1358954496 end 1644167168 vaddr 558366720
start 1644167168 end 1929379840 vaddr 843579392
start 1929379840 end 2214592512 vaddr 843579392
start 2214592512 end 2471493632 vaddr 1128792064
start 2471493632 end 2728394752 vaddr 1128792064

The last three block groups were just added, using up the new space:

for vaddr in (558366720, 843579392, 1128792064):
  print(fs.block_group(vaddr))

block group vaddr 558366720 length 285212672 flags DATA|DUP used
285212672 used_pct 100
block group vaddr 843579392 length 285212672 flags DATA|DUP used
285212672 used_pct 100
block group vaddr 1128792064 length 256901120 flags DATA|DUP used
256901120 used_pct 100

By the way.. for the first and second time (to do the writeup) I did
this, extent allocation looks like paste linked below O_o... I have no
idea where the pattern with decreasing sizes the first time is coming
from, and no idea why the second time all the data is being stored in
144KiB extents...

http://paste.debian.net/plainh/537bdd95

---- >8 ----

Ok, now let's extend the disk to our famous number of 7880M

-# dd if=/dev/zero of=image bs=1 count=0 seek=7880M
0+0 records in
0+0 records out
0 bytes copied, 0.000185768 s, 0.0 kB/s
-# losetup -c /dev/loop0
-# btrfs fi resize max mountpoint/
Resize 'mountpoint/' of 'max'

-# ./show_usage.py /bork/mountpoint/
[...]
Total raw filesystem size: 7.70GiB
Total raw allocated bytes: 2.54GiB
Allocatable bytes remaining: 5.16GiB
[...]

-# dd if=/dev/zero of=mountpoint/bakkiepleur bs=1M
dd: error writing 'mountpoint/bakkiepleur': No space left on device
2384+0 records in
2383+0 records out
2498756608 bytes (2.5 GB, 2.3 GiB) copied, 34.1946 s, 73.1 MB/s

-# ./show_usage.py /bork/mountpoint/
Target profile for SYSTEM (chunk tree): DUP
Target profile for METADATA: DUP
Target profile for DATA: DUP
Mixed groups: False

Virtual space usage by block group type:
|
| type                    total         used
| ----                    -----         ----
| Data                  3.54GiB      3.54GiB
| System                8.00MiB     16.00KiB
| Metadata            307.19MiB    102.05MiB

Total raw filesystem size: 7.70GiB
Total raw allocated bytes: 7.69GiB
Allocatable bytes remaining: 1.00MiB
Unallocatable raw bytes: 0.00B
Unallocatable bytes that can be reclaimed by balancing: 0.00B

Estimated virtual space left to use for metadata: 205.64MiB
Estimated virtual space left to use for data: 0.00B

Allocated raw disk bytes by chunk type. Parity is a reserved part of the
allocated bytes, limiting the amount that can be used for data or metadata:
|
| flags               allocated         used       parity
| -----               ---------         ----       ------
| DATA|DUP              7.08GiB      7.08GiB        0.00B
| SYSTEM|DUP           16.00MiB     32.00KiB        0.00B
| METADATA|DUP        614.38MiB    204.09MiB        0.00B

Allocated bytes per device:
|
| devid      total size    allocated path
| -----      ----------    --------- ----
| 1             7.70GiB      7.69GiB /dev/loop0

Allocated bytes per device, split up per chunk type. Parity bytes are again
part of the total amount of allocated bytes.
|
| Device ID: 1
| | flags               allocated       parity
| | -----               ---------       ------
| | DATA|DUP              7.08GiB        0.00B
| | SYSTEM|DUP           16.00MiB        0.00B
| | METADATA|DUP        614.38MiB        0.00B

Unallocatable raw bytes per device:
|
| devid    unallocatable
| -----    -------------
| 1               0.00B

---- >8 ----

Now, generating the gap in the physical is as simple as dropping the
1578MiB file and triggering a transaction commit to clean up empty block
groups:

-# rm mountpoint/1578MiB
-# btrfs fi sync mountpoint/

for d in fs.dev_extents():
  print("start {} end {} vaddr {}".format(d.paddr, d.paddr + d.length,
d.vaddr))

start 1048576 end 11534336 vaddr 547880960
start 11534336 end 22020096 vaddr 547880960
start 22020096 end 30408704 vaddr 22020096
start 30408704 end 38797312 vaddr 22020096
start 38797312 end 92471296 vaddr 30408704
start 92471296 end 146145280 vaddr 30408704
start 146145280 end 213254144 vaddr 84082688
start 213254144 end 280363008 vaddr 84082688
start 280363008 end 397803520 vaddr 151191552
start 397803520 end 515244032 vaddr 151191552
start 515244032 end 632684544 vaddr 268632064
start 632684544 end 750125056 vaddr 268632064
start 750125056 end 867565568 vaddr 386072576
start 867565568 end 985006080 vaddr 386072576
start 985006080 end 1029373952 vaddr 503513088
start 1029373952 end 1073741824 vaddr 503513088

  [558366720, 843579392 and 1128792064 are gone]

start 2728394752 end 3567255552 vaddr 1385693184
start 3567255552 end 4406116352 vaddr 1385693184
start 4406116352 end 5244977152 vaddr 2224553984
start 5244977152 end 6083837952 vaddr 2224553984
start 6083837952 end 6352273408 vaddr 3063414784
start 6352273408 end 6620708864 vaddr 3063414784
start 6620708864 end 7441743872 vaddr 3331850240
start 7441743872 end 8262778880 vaddr 3331850240

-# ./show_usage.py /bork/mountpoint/
[...]
Allocatable bytes remaining: 1.54GiB
[...]

---- >8 ----

And then...

-# dd if=/dev/zero of=mountpoint/omgwtfbbq bs=1M
dd: error writing 'mountpoint/omgwtfbbq': No space left on device
801+0 records in
800+0 records out
838860800 bytes (839 MB, 800 MiB) copied, 12.6689 s, 66.2 MB/s

We happily write 800MiB(!), destroying the first 22MiB of the device
extent at 2728394752...

for d in fs.dev_extents():
  print("start {} end {} vaddr {}".format(d.paddr, d.paddr + d.length,
d.vaddr))

[...]
start 1912602624 end 2751463424 vaddr 4152885248
start 2728394752 end 3567255552 vaddr 1385693184
[...]

import btrfs
fs = btrfs.FileSystem('/bork/mountpoint/')
paddr = 1912602624
tree = btrfs.ctree.DEV_TREE_OBJECTID
devid = 1
for header, data in btrfs.ioctl.search_v2(fs.fd, tree, key, key):
  btrfs.utils.pretty_print(btrfs.ctree.DevExtent(header, data))

<btrfs.ctree.DevExtent (1 DEV_EXTENT 1912602624)>
devid: 1 (key objectid)
paddr: 1912602624 (key offset)
length: 838860800
chunk_offset: 4152885248
uuid: 00000000-0000-0000-0000-000000000000
chunk_objectid: 256
chunk_tree: 3

Ouch!

-# ./show_usage.py /bork/mountpoint/

[...]
Total raw filesystem size: 7.70GiB
Total raw allocated bytes: 7.72GiB
Allocatable bytes remaining: 0.00B
Unallocatable raw bytes: -22020096.00B
[...]

---- >8 ----

And now more fun:

-# btrfs scrub status mountpoint/
scrub status for 1f7a998a-4ea3-4117-8b42-9d0e6f1d3b4c
	scrub started at Tue Oct  9 01:35:30 2018 and finished after 00:00:13
	total bytes scrubbed: 6.52GiB with 0 errors

But that's probably because it's all data from /dev/zero...

-# btrfs check /dev/loop0
Checking filesystem on /dev/loop0
UUID: 1f7a998a-4ea3-4117-8b42-9d0e6f1d3b4c
checking extents
Device extent[1, 2728394752, 838860800] existed.
Chunk[256, 228, 1385693184] stripe[1, 2728394752] dismatch dev extent[1,
1912602624, 838860800]
Dev extent's total-byte(7445938176) is not equal to
byte-used(8284798976) in dev[1, 216, 1]
ERROR: errors found in extent allocation tree or chunk allocation
checking free space tree
checking fs roots
checking only csum items (without verifying data)
checking root refs
found 3917824000 bytes used, error(s) found
total csum bytes: 3722464
total tree bytes: 106020864
total fs tree bytes: 48955392
total extent tree bytes: 48644096
btree space waste bytes: 1428937
file data blocks allocated: 3811803136
 referenced 3811803136

This one actually already detects that something is wrong.

David Sterba Oct. 11, 2018, 3:13 p.m. UTC | #7

On Thu, Oct 04, 2018 at 11:24:37PM +0200, Hans van Kranenburg wrote:
> This patch set contains an additional fix for a newly exposed bug after
> the previous attempt to fix a chunk allocator bug for new DUP chunks:
> 
> https://lore.kernel.org/linux-btrfs/782f6000-30c0-0085-abd2-74ec5827c903@mendix.com/T/#m609ccb5d32998e8ba5cfa9901c1ab56a38a6f374
> 
> The DUP fix is "fix more DUP stripe size handling". I did that one
> before starting to change more things so it can be applied to earlier
> LTS kernels.
> 
> Besides that patch, which is fixing the bug in a way that is least
> intrusive, I added a bunch of other patches to help getting the chunk
> allocator code in a state that is a bit less error-prone and
> bug-attracting.
> 
> When running this and trying the reproduction scenario, I can now see
> that the created DUP device extent is 827326464 bytes long, which is
> good.
> 
> I wrote and tested this on top of linus 4.19-rc5. I still need to create
> a list of related use cases and test more things to at least walk
> through a bunch of obvious use cases to see if there's nothing exploding
> too quickly with these changes. However, I'm quite confident about it,
> so I'm sharing all of it already.
> 
> Any feedback and review is appreciated. Be gentle and keep in mind that
> I'm still very much in a learning stage regarding kernel development.

The patches look good, thanks. Problem is explained, preparatory work is
separated from the fix itself.

> The stable patches handling workflow is not 100% clear to me yet. I
> guess I have to add a Fixes: in the DUP patch which points to the
> previous commit 92e222df7b.

Almost nobody does it right, no worries. If you can identify a single
patch that introduces a bug then it's for Fixes:, otherwise a CC: stable
with version where it makes sense & applies is enough. I do that check
myself regardless of what's in the patch.

I ran the patches in a VM and hit a division-by-zero in test
fstests/btrfs/011, stacktrace below. First guess is that it's caused by
patch 3/6.

[ 3116.065595] BTRFS: device fsid e3bd8db5-304f-4b1a-8488-7791ea94088f devid 1 transid 5 /dev/vdb
[ 3116.071274] BTRFS: device fsid e3bd8db5-304f-4b1a-8488-7791ea94088f devid 2 transid 5 /dev/vdc
[ 3116.087086] BTRFS info (device vdb): disk space caching is enabled
[ 3116.088644] BTRFS info (device vdb): has skinny extents
[ 3116.089796] BTRFS info (device vdb): flagging fs with big metadata feature
[ 3116.093971] BTRFS info (device vdb): checking UUID tree
[ 3125.853755] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdd started
[ 3125.860269] divide error: 0000 [#1] PREEMPT SMP
[ 3125.861264] CPU: 1 PID: 6477 Comm: btrfs Not tainted 4.19.0-rc7-default+ #288
[ 3125.862841] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
[ 3125.865385] RIP: 0010:__btrfs_alloc_chunk+0x368/0xa70 [btrfs]
[ 3125.870541] RSP: 0018:ffffa4ea0409fa48 EFLAGS: 00010206
[ 3125.871862] RAX: 0000000004000000 RBX: ffff94e074374508 RCX: 0000000000000002
[ 3125.873587] RDX: 0000000000000000 RSI: ffff94e017818c80 RDI: 0000000002000000
[ 3125.874677] RBP: 0000000080800000 R08: 0000000000000000 R09: 0000000000000002
[ 3125.875816] R10: 0000000300000000 R11: 0000000080900000 R12: 0000000000000000
[ 3125.876742] R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000002
[ 3125.877657] FS:  00007f6de34208c0(0000) GS:ffff94e07d600000(0000) knlGS:0000000000000000
[ 3125.878862] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3125.880080] CR2: 00007ffe963d5ce8 CR3: 000000007659b000 CR4: 00000000000006e0
[ 3125.881485] Call Trace:
[ 3125.882105]  do_chunk_alloc+0x266/0x3e0 [btrfs]
[ 3125.882841]  btrfs_inc_block_group_ro+0x10e/0x160 [btrfs]
[ 3125.883875]  scrub_enumerate_chunks+0x18b/0x5d0 [btrfs]
[ 3125.884658]  ? is_module_address+0x11/0x30
[ 3125.885271]  ? wait_for_completion+0x160/0x190
[ 3125.885928]  btrfs_scrub_dev+0x1b8/0x5a0 [btrfs]
[ 3125.887767]  ? start_transaction+0xa1/0x470 [btrfs]
[ 3125.888648]  btrfs_dev_replace_start.cold.19+0x155/0x17e [btrfs]
[ 3125.889459]  btrfs_dev_replace_by_ioctl+0x35/0x60 [btrfs]
[ 3125.890251]  btrfs_ioctl+0x2a94/0x31d0 [btrfs]
[ 3125.890885]  ? do_sigaction+0x7c/0x210
[ 3125.891731]  ? do_vfs_ioctl+0xa2/0x6b0
[ 3125.892652]  do_vfs_ioctl+0xa2/0x6b0
[ 3125.893642]  ? do_sigaction+0x1a7/0x210
[ 3125.894665]  ksys_ioctl+0x3a/0x70
[ 3125.895523]  __x64_sys_ioctl+0x16/0x20
[ 3125.896339]  do_syscall_64+0x5a/0x1a0
[ 3125.896949]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 3125.897638] RIP: 0033:0x7f6de28ecaa7
[ 3125.901313] RSP: 002b:00007ffe963da9c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 3125.902486] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f6de28ecaa7
[ 3125.903538] RDX: 00007ffe963dae00 RSI: 00000000ca289435 RDI: 0000000000000003
[ 3125.904878] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[ 3125.905788] R10: 0000000000000008 R11: 0000000000000246 R12: 00007ffe963de26f
[ 3125.906700] R13: 0000000000000001 R14: 0000000000000004 R15: 000055fceeceb2a0
[ 3125.907954] Modules linked in: btrfs libcrc32c xor zstd_decompress zstd_compress xxhash raid6_pq loop
[ 3125.909342] ---[ end trace 5492bb467d3be2da ]---
[ 3125.910031] RIP: 0010:__btrfs_alloc_chunk+0x368/0xa70 [btrfs]
[ 3125.913600] RSP: 0018:ffffa4ea0409fa48 EFLAGS: 00010206
[ 3125.914595] RAX: 0000000004000000 RBX: ffff94e074374508 RCX: 0000000000000002
[ 3125.916209] RDX: 0000000000000000 RSI: ffff94e017818c80 RDI: 0000000002000000
[ 3125.917701] RBP: 0000000080800000 R08: 0000000000000000 R09: 0000000000000002
[ 3125.919209] R10: 0000000300000000 R11: 0000000080900000 R12: 0000000000000000
[ 3125.920782] R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000002
[ 3125.922413] FS:  00007f6de34208c0(0000) GS:ffff94e07d600000(0000) knlGS:0000000000000000
[ 3125.924264] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3125.925627] CR2: 00007ffe963d5ce8 CR3: 000000007659b000 CR4: 00000000000006e0

Hans van Kranenburg Oct. 11, 2018, 7:40 p.m. UTC | #8

On 10/11/2018 05:13 PM, David Sterba wrote:
> On Thu, Oct 04, 2018 at 11:24:37PM +0200, Hans van Kranenburg wrote:
>> This patch set contains an additional fix for a newly exposed bug after
>> the previous attempt to fix a chunk allocator bug for new DUP chunks:
>>
>> https://lore.kernel.org/linux-btrfs/782f6000-30c0-0085-abd2-74ec5827c903@mendix.com/T/#m609ccb5d32998e8ba5cfa9901c1ab56a38a6f374
>>
>> The DUP fix is "fix more DUP stripe size handling". I did that one
>> before starting to change more things so it can be applied to earlier
>> LTS kernels.
>>
>> Besides that patch, which is fixing the bug in a way that is least
>> intrusive, I added a bunch of other patches to help getting the chunk
>> allocator code in a state that is a bit less error-prone and
>> bug-attracting.
>>
>> When running this and trying the reproduction scenario, I can now see
>> that the created DUP device extent is 827326464 bytes long, which is
>> good.
>>
>> I wrote and tested this on top of linus 4.19-rc5. I still need to create
>> a list of related use cases and test more things to at least walk
>> through a bunch of obvious use cases to see if there's nothing exploding
>> too quickly with these changes. However, I'm quite confident about it,
>> so I'm sharing all of it already.
>>
>> Any feedback and review is appreciated. Be gentle and keep in mind that
>> I'm still very much in a learning stage regarding kernel development.
> 
> The patches look good, thanks. Problem is explained, preparatory work is
> separated from the fix itself.

\o/

>> The stable patches handling workflow is not 100% clear to me yet. I
>> guess I have to add a Fixes: in the DUP patch which points to the
>> previous commit 92e222df7b.
> 
> Almost nobody does it right, no worries. If you can identify a single
> patch that introduces a bug then it's for Fixes:, otherwise a CC: stable
> with version where it makes sense & applies is enough. I do that check
> myself regardless of what's in the patch.

It's 92e222df7b and the thing I'm not sure about is if it also will
catch the same patch which was already backported to LTS kernels since
92e222df7b also has Fixes in it... So by now the new bug is in 4.19,
4.14, 4.9, 4.4, 3.16...

> I ran the patches in a VM and hit a division-by-zero in test
> fstests/btrfs/011, stacktrace below. First guess is that it's caused by
> patch 3/6.

Ah interesting, dev replace.

I'll play around with replace and find out how to run the tests properly
and then reproduce this.

The code introduced in patch 3 is removed again in patch 6, so I don't
suspect that one. :)

But, I'll find out.

Thanks for testing.

Hans

> [ 3116.065595] BTRFS: device fsid e3bd8db5-304f-4b1a-8488-7791ea94088f devid 1 transid 5 /dev/vdb
> [ 3116.071274] BTRFS: device fsid e3bd8db5-304f-4b1a-8488-7791ea94088f devid 2 transid 5 /dev/vdc
> [ 3116.087086] BTRFS info (device vdb): disk space caching is enabled
> [ 3116.088644] BTRFS info (device vdb): has skinny extents
> [ 3116.089796] BTRFS info (device vdb): flagging fs with big metadata feature
> [ 3116.093971] BTRFS info (device vdb): checking UUID tree
> [ 3125.853755] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdd started
> [ 3125.860269] divide error: 0000 [#1] PREEMPT SMP
> [ 3125.861264] CPU: 1 PID: 6477 Comm: btrfs Not tainted 4.19.0-rc7-default+ #288
> [ 3125.862841] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
> [ 3125.865385] RIP: 0010:__btrfs_alloc_chunk+0x368/0xa70 [btrfs]
> [ 3125.870541] RSP: 0018:ffffa4ea0409fa48 EFLAGS: 00010206
> [ 3125.871862] RAX: 0000000004000000 RBX: ffff94e074374508 RCX: 0000000000000002
> [ 3125.873587] RDX: 0000000000000000 RSI: ffff94e017818c80 RDI: 0000000002000000
> [ 3125.874677] RBP: 0000000080800000 R08: 0000000000000000 R09: 0000000000000002
> [ 3125.875816] R10: 0000000300000000 R11: 0000000080900000 R12: 0000000000000000
> [ 3125.876742] R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000002
> [ 3125.877657] FS:  00007f6de34208c0(0000) GS:ffff94e07d600000(0000) knlGS:0000000000000000
> [ 3125.878862] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 3125.880080] CR2: 00007ffe963d5ce8 CR3: 000000007659b000 CR4: 00000000000006e0
> [ 3125.881485] Call Trace:
> [ 3125.882105]  do_chunk_alloc+0x266/0x3e0 [btrfs]
> [ 3125.882841]  btrfs_inc_block_group_ro+0x10e/0x160 [btrfs]
> [ 3125.883875]  scrub_enumerate_chunks+0x18b/0x5d0 [btrfs]
> [ 3125.884658]  ? is_module_address+0x11/0x30
> [ 3125.885271]  ? wait_for_completion+0x160/0x190
> [ 3125.885928]  btrfs_scrub_dev+0x1b8/0x5a0 [btrfs]
> [ 3125.887767]  ? start_transaction+0xa1/0x470 [btrfs]
> [ 3125.888648]  btrfs_dev_replace_start.cold.19+0x155/0x17e [btrfs]
> [ 3125.889459]  btrfs_dev_replace_by_ioctl+0x35/0x60 [btrfs]
> [ 3125.890251]  btrfs_ioctl+0x2a94/0x31d0 [btrfs]
> [ 3125.890885]  ? do_sigaction+0x7c/0x210
> [ 3125.891731]  ? do_vfs_ioctl+0xa2/0x6b0
> [ 3125.892652]  do_vfs_ioctl+0xa2/0x6b0
> [ 3125.893642]  ? do_sigaction+0x1a7/0x210
> [ 3125.894665]  ksys_ioctl+0x3a/0x70
> [ 3125.895523]  __x64_sys_ioctl+0x16/0x20
> [ 3125.896339]  do_syscall_64+0x5a/0x1a0
> [ 3125.896949]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [ 3125.897638] RIP: 0033:0x7f6de28ecaa7
> [ 3125.901313] RSP: 002b:00007ffe963da9c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [ 3125.902486] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f6de28ecaa7
> [ 3125.903538] RDX: 00007ffe963dae00 RSI: 00000000ca289435 RDI: 0000000000000003
> [ 3125.904878] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> [ 3125.905788] R10: 0000000000000008 R11: 0000000000000246 R12: 00007ffe963de26f
> [ 3125.906700] R13: 0000000000000001 R14: 0000000000000004 R15: 000055fceeceb2a0
> [ 3125.907954] Modules linked in: btrfs libcrc32c xor zstd_decompress zstd_compress xxhash raid6_pq loop
> [ 3125.909342] ---[ end trace 5492bb467d3be2da ]---
> [ 3125.910031] RIP: 0010:__btrfs_alloc_chunk+0x368/0xa70 [btrfs]
> [ 3125.913600] RSP: 0018:ffffa4ea0409fa48 EFLAGS: 00010206
> [ 3125.914595] RAX: 0000000004000000 RBX: ffff94e074374508 RCX: 0000000000000002
> [ 3125.916209] RDX: 0000000000000000 RSI: ffff94e017818c80 RDI: 0000000002000000
> [ 3125.917701] RBP: 0000000080800000 R08: 0000000000000000 R09: 0000000000000002
> [ 3125.919209] R10: 0000000300000000 R11: 0000000080900000 R12: 0000000000000000
> [ 3125.920782] R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000002
> [ 3125.922413] FS:  00007f6de34208c0(0000) GS:ffff94e07d600000(0000) knlGS:0000000000000000
> [ 3125.924264] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 3125.925627] CR2: 00007ffe963d5ce8 CR3: 000000007659b000 CR4: 00000000000006e0
>

Hans van Kranenburg Oct. 11, 2018, 8:34 p.m. UTC | #9

On 10/11/2018 09:40 PM, Hans van Kranenburg wrote:
> On 10/11/2018 05:13 PM, David Sterba wrote:
>> On Thu, Oct 04, 2018 at 11:24:37PM +0200, Hans van Kranenburg wrote:
>>> This patch set contains an additional fix for a newly exposed bug after
>>> the previous attempt to fix a chunk allocator bug for new DUP chunks:
>>>
>>> https://lore.kernel.org/linux-btrfs/782f6000-30c0-0085-abd2-74ec5827c903@mendix.com/T/#m609ccb5d32998e8ba5cfa9901c1ab56a38a6f374
>>>
>>> The DUP fix is "fix more DUP stripe size handling". I did that one
>>> before starting to change more things so it can be applied to earlier
>>> LTS kernels.
>>>
>>> Besides that patch, which is fixing the bug in a way that is least
>>> intrusive, I added a bunch of other patches to help getting the chunk
>>> allocator code in a state that is a bit less error-prone and
>>> bug-attracting.
>>>
>>> When running this and trying the reproduction scenario, I can now see
>>> that the created DUP device extent is 827326464 bytes long, which is
>>> good.
>>>
>>> I wrote and tested this on top of linus 4.19-rc5. I still need to create
>>> a list of related use cases and test more things to at least walk
>>> through a bunch of obvious use cases to see if there's nothing exploding
>>> too quickly with these changes. However, I'm quite confident about it,
>>> so I'm sharing all of it already.
>>>
>>> Any feedback and review is appreciated. Be gentle and keep in mind that
>>> I'm still very much in a learning stage regarding kernel development.
>>
>> The patches look good, thanks. Problem is explained, preparatory work is
>> separated from the fix itself.
> 
> \o/
> 
>>> The stable patches handling workflow is not 100% clear to me yet. I
>>> guess I have to add a Fixes: in the DUP patch which points to the
>>> previous commit 92e222df7b.
>>
>> Almost nobody does it right, no worries. If you can identify a single
>> patch that introduces a bug then it's for Fixes:, otherwise a CC: stable
>> with version where it makes sense & applies is enough. I do that check
>> myself regardless of what's in the patch.
> 
> It's 92e222df7b and the thing I'm not sure about is if it also will
> catch the same patch which was already backported to LTS kernels since
> 92e222df7b also has Fixes in it... So by now the new bug is in 4.19,
> 4.14, 4.9, 4.4, 3.16...
> 
>> I ran the patches in a VM and hit a division-by-zero in test
>> fstests/btrfs/011, stacktrace below. First guess is that it's caused by
>> patch 3/6.
> 
> Ah interesting, dev replace.
> 
> I'll play around with replace and find out how to run the tests properly
> and then reproduce this.
> 
> The code introduced in patch 3 is removed again in patch 6, so I don't
> suspect that one. :)

Actually, while writing this I realize that this means it should be
tested separately (like, older kernel with only 3), because, well,
obvious I guess.

> But, I'll find out.
> 
> Thanks for testing.
> 
> Hans
> 
>> [ 3116.065595] BTRFS: device fsid e3bd8db5-304f-4b1a-8488-7791ea94088f devid 1 transid 5 /dev/vdb
>> [ 3116.071274] BTRFS: device fsid e3bd8db5-304f-4b1a-8488-7791ea94088f devid 2 transid 5 /dev/vdc
>> [ 3116.087086] BTRFS info (device vdb): disk space caching is enabled
>> [ 3116.088644] BTRFS info (device vdb): has skinny extents
>> [ 3116.089796] BTRFS info (device vdb): flagging fs with big metadata feature
>> [ 3116.093971] BTRFS info (device vdb): checking UUID tree
>> [ 3125.853755] BTRFS info (device vdb): dev_replace from /dev/vdb (devid 1) to /dev/vdd started
>> [ 3125.860269] divide error: 0000 [#1] PREEMPT SMP
>> [ 3125.861264] CPU: 1 PID: 6477 Comm: btrfs Not tainted 4.19.0-rc7-default+ #288
>> [ 3125.862841] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
>> [ 3125.865385] RIP: 0010:__btrfs_alloc_chunk+0x368/0xa70 [btrfs]
>> [ 3125.870541] RSP: 0018:ffffa4ea0409fa48 EFLAGS: 00010206
>> [ 3125.871862] RAX: 0000000004000000 RBX: ffff94e074374508 RCX: 0000000000000002
>> [ 3125.873587] RDX: 0000000000000000 RSI: ffff94e017818c80 RDI: 0000000002000000
>> [ 3125.874677] RBP: 0000000080800000 R08: 0000000000000000 R09: 0000000000000002
>> [ 3125.875816] R10: 0000000300000000 R11: 0000000080900000 R12: 0000000000000000
>> [ 3125.876742] R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000002
>> [ 3125.877657] FS:  00007f6de34208c0(0000) GS:ffff94e07d600000(0000) knlGS:0000000000000000
>> [ 3125.878862] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 3125.880080] CR2: 00007ffe963d5ce8 CR3: 000000007659b000 CR4: 00000000000006e0
>> [ 3125.881485] Call Trace:
>> [ 3125.882105]  do_chunk_alloc+0x266/0x3e0 [btrfs]
>> [ 3125.882841]  btrfs_inc_block_group_ro+0x10e/0x160 [btrfs]
>> [ 3125.883875]  scrub_enumerate_chunks+0x18b/0x5d0 [btrfs]
>> [ 3125.884658]  ? is_module_address+0x11/0x30
>> [ 3125.885271]  ? wait_for_completion+0x160/0x190
>> [ 3125.885928]  btrfs_scrub_dev+0x1b8/0x5a0 [btrfs]
>> [ 3125.887767]  ? start_transaction+0xa1/0x470 [btrfs]
>> [ 3125.888648]  btrfs_dev_replace_start.cold.19+0x155/0x17e [btrfs]
>> [ 3125.889459]  btrfs_dev_replace_by_ioctl+0x35/0x60 [btrfs]
>> [ 3125.890251]  btrfs_ioctl+0x2a94/0x31d0 [btrfs]
>> [ 3125.890885]  ? do_sigaction+0x7c/0x210
>> [ 3125.891731]  ? do_vfs_ioctl+0xa2/0x6b0
>> [ 3125.892652]  do_vfs_ioctl+0xa2/0x6b0
>> [ 3125.893642]  ? do_sigaction+0x1a7/0x210
>> [ 3125.894665]  ksys_ioctl+0x3a/0x70
>> [ 3125.895523]  __x64_sys_ioctl+0x16/0x20
>> [ 3125.896339]  do_syscall_64+0x5a/0x1a0
>> [ 3125.896949]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>> [ 3125.897638] RIP: 0033:0x7f6de28ecaa7
>> [ 3125.901313] RSP: 002b:00007ffe963da9c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
>> [ 3125.902486] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f6de28ecaa7
>> [ 3125.903538] RDX: 00007ffe963dae00 RSI: 00000000ca289435 RDI: 0000000000000003
>> [ 3125.904878] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
>> [ 3125.905788] R10: 0000000000000008 R11: 0000000000000246 R12: 00007ffe963de26f
>> [ 3125.906700] R13: 0000000000000001 R14: 0000000000000004 R15: 000055fceeceb2a0
>> [ 3125.907954] Modules linked in: btrfs libcrc32c xor zstd_decompress zstd_compress xxhash raid6_pq loop
>> [ 3125.909342] ---[ end trace 5492bb467d3be2da ]---
>> [ 3125.910031] RIP: 0010:__btrfs_alloc_chunk+0x368/0xa70 [btrfs]
>> [ 3125.913600] RSP: 0018:ffffa4ea0409fa48 EFLAGS: 00010206
>> [ 3125.914595] RAX: 0000000004000000 RBX: ffff94e074374508 RCX: 0000000000000002
>> [ 3125.916209] RDX: 0000000000000000 RSI: ffff94e017818c80 RDI: 0000000002000000
>> [ 3125.917701] RBP: 0000000080800000 R08: 0000000000000000 R09: 0000000000000002
>> [ 3125.919209] R10: 0000000300000000 R11: 0000000080900000 R12: 0000000000000000
>> [ 3125.920782] R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000002
>> [ 3125.922413] FS:  00007f6de34208c0(0000) GS:ffff94e07d600000(0000) knlGS:0000000000000000
>> [ 3125.924264] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 3125.925627] CR2: 00007ffe963d5ce8 CR3: 000000007659b000 CR4: 00000000000006e0
>>
> 
>

David Sterba Nov. 13, 2018, 3:03 p.m. UTC | #10

On Thu, Oct 11, 2018 at 07:40:22PM +0000, Hans van Kranenburg wrote:
> On 10/11/2018 05:13 PM, David Sterba wrote:
> > On Thu, Oct 04, 2018 at 11:24:37PM +0200, Hans van Kranenburg wrote:
> >> This patch set contains an additional fix for a newly exposed bug after
> >> the previous attempt to fix a chunk allocator bug for new DUP chunks:
> >>
> >> https://lore.kernel.org/linux-btrfs/782f6000-30c0-0085-abd2-74ec5827c903@mendix.com/T/#m609ccb5d32998e8ba5cfa9901c1ab56a38a6f374
> >>
> >> The DUP fix is "fix more DUP stripe size handling". I did that one
> >> before starting to change more things so it can be applied to earlier
> >> LTS kernels.
> >>
> >> Besides that patch, which is fixing the bug in a way that is least
> >> intrusive, I added a bunch of other patches to help getting the chunk
> >> allocator code in a state that is a bit less error-prone and
> >> bug-attracting.
> >>
> >> When running this and trying the reproduction scenario, I can now see
> >> that the created DUP device extent is 827326464 bytes long, which is
> >> good.
> >>
> >> I wrote and tested this on top of linus 4.19-rc5. I still need to create
> >> a list of related use cases and test more things to at least walk
> >> through a bunch of obvious use cases to see if there's nothing exploding
> >> too quickly with these changes. However, I'm quite confident about it,
> >> so I'm sharing all of it already.
> >>
> >> Any feedback and review is appreciated. Be gentle and keep in mind that
> >> I'm still very much in a learning stage regarding kernel development.
> > 
> > The patches look good, thanks. Problem is explained, preparatory work is
> > separated from the fix itself.
> 
> \o/
> 
> >> The stable patches handling workflow is not 100% clear to me yet. I
> >> guess I have to add a Fixes: in the DUP patch which points to the
> >> previous commit 92e222df7b.
> > 
> > Almost nobody does it right, no worries. If you can identify a single
> > patch that introduces a bug then it's for Fixes:, otherwise a CC: stable
> > with version where it makes sense & applies is enough. I do that check
> > myself regardless of what's in the patch.
> 
> It's 92e222df7b and the thing I'm not sure about is if it also will
> catch the same patch which was already backported to LTS kernels since
> 92e222df7b also has Fixes in it... So by now the new bug is in 4.19,
> 4.14, 4.9, 4.4, 3.16...
> 
> > I ran the patches in a VM and hit a division-by-zero in test
> > fstests/btrfs/011, stacktrace below. First guess is that it's caused by
> > patch 3/6.
> 
> Ah interesting, dev replace.
> 
> I'll play around with replace and find out how to run the tests properly
> and then reproduce this.
> 
> The code introduced in patch 3 is removed again in patch 6, so I don't
> suspect that one. :)
> 
> But, I'll find out.
> 
> Thanks for testing.

I've merged patches 1-5 to misc-next as they seem to be safe and pass
fstests, please let me know when you have updates to the last one.
Thanks.

Hans van Kranenburg Nov. 13, 2018, 4:45 p.m. UTC | #11

On 11/13/18 4:03 PM, David Sterba wrote:
> On Thu, Oct 11, 2018 at 07:40:22PM +0000, Hans van Kranenburg wrote:
>> On 10/11/2018 05:13 PM, David Sterba wrote:
>>> On Thu, Oct 04, 2018 at 11:24:37PM +0200, Hans van Kranenburg wrote:
>>>> This patch set contains an additional fix for a newly exposed bug after
>>>> the previous attempt to fix a chunk allocator bug for new DUP chunks:
>>>>
>>>> https://lore.kernel.org/linux-btrfs/782f6000-30c0-0085-abd2-74ec5827c903@mendix.com/T/#m609ccb5d32998e8ba5cfa9901c1ab56a38a6f374
>>>>
>>>> The DUP fix is "fix more DUP stripe size handling". I did that one
>>>> before starting to change more things so it can be applied to earlier
>>>> LTS kernels.
>>>>
>>>> Besides that patch, which is fixing the bug in a way that is least
>>>> intrusive, I added a bunch of other patches to help getting the chunk
>>>> allocator code in a state that is a bit less error-prone and
>>>> bug-attracting.
>>>>
>>>> When running this and trying the reproduction scenario, I can now see
>>>> that the created DUP device extent is 827326464 bytes long, which is
>>>> good.
>>>>
>>>> I wrote and tested this on top of linus 4.19-rc5. I still need to create
>>>> a list of related use cases and test more things to at least walk
>>>> through a bunch of obvious use cases to see if there's nothing exploding
>>>> too quickly with these changes. However, I'm quite confident about it,
>>>> so I'm sharing all of it already.
>>>>
>>>> Any feedback and review is appreciated. Be gentle and keep in mind that
>>>> I'm still very much in a learning stage regarding kernel development.
>>>
>>> The patches look good, thanks. Problem is explained, preparatory work is
>>> separated from the fix itself.
>>
>> \o/
>>
>>>> The stable patches handling workflow is not 100% clear to me yet. I
>>>> guess I have to add a Fixes: in the DUP patch which points to the
>>>> previous commit 92e222df7b.
>>>
>>> Almost nobody does it right, no worries. If you can identify a single
>>> patch that introduces a bug then it's for Fixes:, otherwise a CC: stable
>>> with version where it makes sense & applies is enough. I do that check
>>> myself regardless of what's in the patch.
>>
>> It's 92e222df7b and the thing I'm not sure about is if it also will
>> catch the same patch which was already backported to LTS kernels since
>> 92e222df7b also has Fixes in it... So by now the new bug is in 4.19,
>> 4.14, 4.9, 4.4, 3.16...
>>
>>> I ran the patches in a VM and hit a division-by-zero in test
>>> fstests/btrfs/011, stacktrace below. First guess is that it's caused by
>>> patch 3/6.
>>
>> Ah interesting, dev replace.
>>
>> I'll play around with replace and find out how to run the tests properly
>> and then reproduce this.
>>
>> The code introduced in patch 3 is removed again in patch 6, so I don't
>> suspect that one. :)
>>
>> But, I'll find out.
>>
>> Thanks for testing.
> 
> I've merged patches 1-5 to misc-next as they seem to be safe and pass
> fstests, please let me know when you have updates to the last one.
> Thanks.

I'll definitely follow up on that. It has not happened because something
about todo lists and ordering work and not doing too many things at the
same time.

Thanks for already confirming it was not patch 3 but 6 that has the
problem, I already suspected that.

For patch 3, the actual fix, how do we get that on top of old stable
kernels where the previous fix (92e222df7b) is in? Because of the Fixes:
line that one was picked again in 4.14, 4.9, 4.4 and even 3.16. How does
this work? Does it need all the other commit ids from those branches in
Fixes lines? Or is there magic somewhere that does this?

From your "development cycle open" mails, I understand that for testing
myself, I would test based on misc-next, for-next or for-x.y in that
order depending on where the first 5 are yet at that moment?

Hans

[0/6] Chunk allocator DUP fix and cleanups

Message

Comments