[RFC] btrfs-progs: mkfs: Enforce 4k sectorsize by default

Message ID	20230321020320.2555362-1-neal@gompa.dev (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-btrfs-owner@vger.kernel.org> From: Neal Gompa <neal@gompa.dev> To: linux-btrfs@vger.kernel.org Cc: David Sterba <dsterba@suse.cz>, Hector Martin <marcan@marcan.st>, Davide Cavalca <davide@cavalca.name>, Neal Gompa <neal@gompa.dev> Subject: [RFC PATCH] btrfs-progs: mkfs: Enforce 4k sectorsize by default Date: Mon, 20 Mar 2023 22:03:20 -0400 Message-Id: <20230321020320.2555362-1-neal@gompa.dev> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[RFC] btrfs-progs: mkfs: Enforce 4k sectorsize by default \| expand [RFC] btrfs-progs: mkfs: Enforce 4k sectorsize by default

Neal Gompa March 21, 2023, 2:03 a.m. UTC

We have had working subpage support in Btrfs for many cycles now.
Generally, we do not want people creating filesystems by default
with non-4k sectorsizes since it creates portability problems.

Signed-off-by: Neal Gompa <neal@gompa.dev>
---
 Documentation/Subpage.rst    |  9 +++++----
 Documentation/mkfs.btrfs.rst | 11 +++++++----
 mkfs/main.c                  |  2 +-
 3 files changed, 13 insertions(+), 9 deletions(-)

Anand Jain March 21, 2023, 3:03 a.m. UTC | #1

On 21/03/2023 10:03, Neal Gompa wrote:
> We have had working subpage support in Btrfs for many cycles now.
> Generally, we do not want people creating filesystems by default
> with non-4k sectorsizes since it creates portability problems.
> 

I agree.

> Signed-off-by: Neal Gompa <neal@gompa.dev>
> ---
>   Documentation/Subpage.rst    |  9 +++++----
>   Documentation/mkfs.btrfs.rst | 11 +++++++----
>   mkfs/main.c                  |  2 +-
>   3 files changed, 13 insertions(+), 9 deletions(-)
> 
> diff --git a/Documentation/Subpage.rst b/Documentation/Subpage.rst
> index 21a495d5..d7e9b241 100644
> --- a/Documentation/Subpage.rst
> +++ b/Documentation/Subpage.rst
> @@ -9,10 +9,11 @@ to the exactly same size of the block and page. On x86_64 this is typically
>   pages, like 64KiB on 64bit ARM or PowerPC. This means filesystems created
>   with 64KiB sector size cannot be mounted on a system with 4KiB page size.
>   
> -While with subpage support, systems with 64KiB page size can create (still needs
> -"-s 4k" option for mkfs.btrfs) and mount filesystems with 4KiB sectorsize,
> -allowing us to push 4KiB sectorsize as default sectorsize for all platforms in the
> -near future.
> +Since v6.3, filesystems are created with a 4KiB sectorsize by default,
> +though it remains possible to create filesystems with other page sizes
> +(such as 64KiB with the "-s 64k" option for mkfs.btrfs). This ensures that
> +new filesystems are compatible across other architecture variants using
> +larger page sizes.
>   

This part is a little confusing.

We can also include kernel versions that started supporting subpages

v5.12   read support of blocksize<pagesize (subpage)
v5.15   write support of blocksize<pagesize (subpage)


>   Requirements, limitations
>   -------------------------
> diff --git a/Documentation/mkfs.btrfs.rst b/Documentation/mkfs.btrfs.rst
> index ba7227b3..af0b9c03 100644
> --- a/Documentation/mkfs.btrfs.rst
> +++ b/Documentation/mkfs.btrfs.rst
> @@ -116,10 +116,13 @@ OPTIONS
>   -s|--sectorsize <size>
>           Specify the sectorsize, the minimum data block allocation unit.
>   
> -        The default value is the page size and is autodetected. If the sectorsize
> -        differs from the page size, the created filesystem may not be mountable by the
> -        running kernel. Therefore it is not recommended to use this option unless you
> -        are going to mount it on a system with the appropriate page size.
> +        The default value is 4KiB (4096). If larger page sizes (such as 64KiB [16384])
> +        are used, the created filesystem will only mount on systems with a running kernel
> +        running on a matching page size. Therefore it is not recommended to use this option
> +        unless you are going to mount it on a system with the appropriate page size.
> +
> +        .. note::
> +                Versions up to 6.3 set the sectorsize matching to the page size.
>   


IMO this can be rewritten as below:

By default, the value is 4KiB, but it can be manually set to match the 
system page size. However, if the sector size is different from the page 
size, the resulting filesystem may not be mountable by the current 
kernel, apart from the default 4KiB. Hence, it's not advisable to use 
this option unless you intend to mount it on a system with the suitable 
page size.


Thanks, Anand


>   -L|--label <string>
>           Specify a label for the filesystem. The *string* should be less than 256
> diff --git a/mkfs/main.c b/mkfs/main.c
> index f5e34cbd..5e1834d7 100644
> --- a/mkfs/main.c
> +++ b/mkfs/main.c
> @@ -1207,7 +1207,7 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
>   	}
>   
>   	if (!sectorsize)
> -		sectorsize = (u32)sysconf(_SC_PAGESIZE);
> +		sectorsize = (u32)SZ_4K;
>   	if (btrfs_check_sectorsize(sectorsize))
>   		goto error;
>

Qu Wenruo March 21, 2023, 3:21 a.m. UTC | #2

On 2023/3/21 10:03, Neal Gompa wrote:
> We have had working subpage support in Btrfs for many cycles now.
> Generally, we do not want people creating filesystems by default
> with non-4k sectorsizes since it creates portability problems.

My biggest concern for now is, I haven't yet done any subpage testing 
for a while.

The bottle neck is the lack of computing power.

I only have one decent (RK3588 based SBC, Rock5B) board available for 
testing, but it's taken by my daily fstests runs, and it's still using 
4K page size for the aarc64 VM.

Although I have an backup SBC (the same Rock5B), it's reserved mostly 
for upstreaming and testing for the RK3588 SoC.

Personally speaking, this change would bring a lot of more testing 
feedback from Asahi guys, thus would be pretty awesome.

But on the other hand, I really don't want any big bug screwing up early 
adopters, and further damage the reputation of btrfs.


Would the Asahi guys gives us some early test results?
(AKA, full fstests runs with 16K page size and 4K sectorsize).

If nothing wrong happened, I am very happy to this patch.

Thanks,
Qu
> 
> Signed-off-by: Neal Gompa <neal@gompa.dev>
> ---
>   Documentation/Subpage.rst    |  9 +++++----
>   Documentation/mkfs.btrfs.rst | 11 +++++++----
>   mkfs/main.c                  |  2 +-
>   3 files changed, 13 insertions(+), 9 deletions(-)
> 
> diff --git a/Documentation/Subpage.rst b/Documentation/Subpage.rst
> index 21a495d5..d7e9b241 100644
> --- a/Documentation/Subpage.rst
> +++ b/Documentation/Subpage.rst
> @@ -9,10 +9,11 @@ to the exactly same size of the block and page. On x86_64 this is typically
>   pages, like 64KiB on 64bit ARM or PowerPC. This means filesystems created
>   with 64KiB sector size cannot be mounted on a system with 4KiB page size.
>   
> -While with subpage support, systems with 64KiB page size can create (still needs
> -"-s 4k" option for mkfs.btrfs) and mount filesystems with 4KiB sectorsize,
> -allowing us to push 4KiB sectorsize as default sectorsize for all platforms in the
> -near future.
> +Since v6.3, filesystems are created with a 4KiB sectorsize by default,
> +though it remains possible to create filesystems with other page sizes
> +(such as 64KiB with the "-s 64k" option for mkfs.btrfs). This ensures that
> +new filesystems are compatible across other architecture variants using
> +larger page sizes.
>   
>   Requirements, limitations
>   -------------------------
> diff --git a/Documentation/mkfs.btrfs.rst b/Documentation/mkfs.btrfs.rst
> index ba7227b3..af0b9c03 100644
> --- a/Documentation/mkfs.btrfs.rst
> +++ b/Documentation/mkfs.btrfs.rst
> @@ -116,10 +116,13 @@ OPTIONS
>   -s|--sectorsize <size>
>           Specify the sectorsize, the minimum data block allocation unit.
>   
> -        The default value is the page size and is autodetected. If the sectorsize
> -        differs from the page size, the created filesystem may not be mountable by the
> -        running kernel. Therefore it is not recommended to use this option unless you
> -        are going to mount it on a system with the appropriate page size.
> +        The default value is 4KiB (4096). If larger page sizes (such as 64KiB [16384])
> +        are used, the created filesystem will only mount on systems with a running kernel
> +        running on a matching page size. Therefore it is not recommended to use this option
> +        unless you are going to mount it on a system with the appropriate page size.
> +
> +        .. note::
> +                Versions up to 6.3 set the sectorsize matching to the page size.
>   
>   -L|--label <string>
>           Specify a label for the filesystem. The *string* should be less than 256
> diff --git a/mkfs/main.c b/mkfs/main.c
> index f5e34cbd..5e1834d7 100644
> --- a/mkfs/main.c
> +++ b/mkfs/main.c
> @@ -1207,7 +1207,7 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
>   	}
>   
>   	if (!sectorsize)
> -		sectorsize = (u32)sysconf(_SC_PAGESIZE);
> +		sectorsize = (u32)SZ_4K;
>   	if (btrfs_check_sectorsize(sectorsize))
>   		goto error;
>

Hector Martin March 21, 2023, 10:07 a.m. UTC | #3

On 21/03/2023 12.21, Qu Wenruo wrote:
> 
> 
> On 2023/3/21 10:03, Neal Gompa wrote:
>> We have had working subpage support in Btrfs for many cycles now.
>> Generally, we do not want people creating filesystems by default
>> with non-4k sectorsizes since it creates portability problems.
> 
> My biggest concern for now is, I haven't yet done any subpage testing 
> for a while.
> 
> The bottle neck is the lack of computing power.
> 
> I only have one decent (RK3588 based SBC, Rock5B) board available for 
> testing, but it's taken by my daily fstests runs, and it's still using 
> 4K page size for the aarc64 VM.
> 
> Although I have an backup SBC (the same Rock5B), it's reserved mostly 
> for upstreaming and testing for the RK3588 SoC.
> 
> Personally speaking, this change would bring a lot of more testing 
> feedback from Asahi guys, thus would be pretty awesome.

Note that we already have a bunch of people running Fedora Asahi, and as
far as I know those images are created on 4K systems and then used on
16K systems, so this should be already getting real-world testing (and
will get a lot more once we get to official announcement/release)
regardless of the default.

IOW, this change is mostly about people creating secondary btrfs
filesystems on Asahi directly, which is relatively niche in comparison.
So if you have a subpage bug it's going to hit Asahi users whether this
change happens or not :)

> 
> But on the other hand, I really don't want any big bug screwing up early 
> adopters, and further damage the reputation of btrfs.
> 
> 
> Would the Asahi guys gives us some early test results?
> (AKA, full fstests runs with 16K page size and 4K sectorsize).

Gave it a shot. Tested with options:

export TEST_DEV=/dev/nvme0n1p6
export TEST_DIR=/mnt/test
#export SCRATCH_DEV=/dev/nvme0n1p7
export SCRATCH_MNT=/mnt/scratch
export SCRATCH_DEV_POOL="/dev/nvme0n1p7 /dev/nvme0n1p8 /dev/nvme0n1p9
/dev/nvme0n1p10 /dev/nvme0n1p11"
export MKFS_OPTIONS="-s 4096"
export FSTYP=btrfs

btrfs/012 is broken, the converted FS fails to mount with:

[  784.588172] BTRFS warning (device nvme0n1p7): v1 space cache is not
supported for page size 16384 with sectorsize 4096
[  784.588199] BTRFS error (device nvme0n1p7): open_ctree failed

btrfs/131 and 136 have the same issue.

btrfs/106 has a size mismatch in the output, but I get the feeling
that's just a bad test that assumes 4K somewhere?

btrfs/140 is the first one that looks bad. Looks like the corruption
correction didn't work for some reason.

... and then btrfs/142 which is similar actually managed to log a bunch
of errors on our NVMe controller. Nice.

[ 1240.000104] nvme-apple 393cc0000.nvme: RTKit: syslog message:
host_ans2.c:1564: cmd parsing error for tag 12 fast decode err 0x1
[ 1240.000767] nvme0n1: Read(0x2) @ LBA 322753843, 0 blocks, Invalid
Field in Command (sct 0x0 / sc 0x2)
[ 1240.000771] nvme-apple 393cc0000.nvme: RTKit: syslog message:
host_ans2.c:1469: tag 12, completed with err BAD_CMD-
[ 1240.000775] operation not supported error, dev nvme0n1, sector
2582030751 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2

Looks like it tried to read 0 blocks? I'm pretty sure that's not a valid
block device operation... and then that test hangs because the
_btrfs_direct_read_on_mirror() common function is completely broken, as
it infinite loops if the exec'd command fails (which it does here, with
ENOENT). Cc sven/axboe/asahi: I'm pretty sure we should catch
upper-layer badness like this before sending it to the controller.

Excluding that one and moving on, 143 is broken the same way, as are
btrfs/265, 266, 267, 269.

213 failed to balance with ENOSPC.

btrfs/246 has an output discrepancy, I don't know what's up with that.

generic/251 then failed too, with dmesg logs about failing to trim block
groups.

generic/520 failed with an EBUSY on umount, but I suspect this might be
some desktop/systemd stuff trying to use the mount?

generic/563 suggests there may be cgroup accounting issues

generic/619 seems known to be pathologically slow on arm64, so I killed
it (https://www.spinics.net/lists/linux-btrfs/msg131553.html).

generic/708 failed but that pointed to a known kernel bug that we don't
have the fix for yet (this is running on 6.2 + asahi patches).

Run output and some select dmesg sections:
https://gist.github.com/marcan/822a34266bcaf4f4cffaa6a198b4c616

Let me know if you need anything else.

> 
> If nothing wrong happened, I am very happy to this patch.
> 
> Thanks,
> Qu
>>
>> Signed-off-by: Neal Gompa <neal@gompa.dev>
>> ---
>>   Documentation/Subpage.rst    |  9 +++++----
>>   Documentation/mkfs.btrfs.rst | 11 +++++++----
>>   mkfs/main.c                  |  2 +-
>>   3 files changed, 13 insertions(+), 9 deletions(-)
>>
>> diff --git a/Documentation/Subpage.rst b/Documentation/Subpage.rst
>> index 21a495d5..d7e9b241 100644
>> --- a/Documentation/Subpage.rst
>> +++ b/Documentation/Subpage.rst
>> @@ -9,10 +9,11 @@ to the exactly same size of the block and page. On x86_64 this is typically
>>   pages, like 64KiB on 64bit ARM or PowerPC. This means filesystems created
>>   with 64KiB sector size cannot be mounted on a system with 4KiB page size.
>>   
>> -While with subpage support, systems with 64KiB page size can create (still needs
>> -"-s 4k" option for mkfs.btrfs) and mount filesystems with 4KiB sectorsize,
>> -allowing us to push 4KiB sectorsize as default sectorsize for all platforms in the
>> -near future.
>> +Since v6.3, filesystems are created with a 4KiB sectorsize by default,
>> +though it remains possible to create filesystems with other page sizes
>> +(such as 64KiB with the "-s 64k" option for mkfs.btrfs). This ensures that
>> +new filesystems are compatible across other architecture variants using
>> +larger page sizes.
>>   
>>   Requirements, limitations
>>   -------------------------
>> diff --git a/Documentation/mkfs.btrfs.rst b/Documentation/mkfs.btrfs.rst
>> index ba7227b3..af0b9c03 100644
>> --- a/Documentation/mkfs.btrfs.rst
>> +++ b/Documentation/mkfs.btrfs.rst
>> @@ -116,10 +116,13 @@ OPTIONS
>>   -s|--sectorsize <size>
>>           Specify the sectorsize, the minimum data block allocation unit.
>>   
>> -        The default value is the page size and is autodetected. If the sectorsize
>> -        differs from the page size, the created filesystem may not be mountable by the
>> -        running kernel. Therefore it is not recommended to use this option unless you
>> -        are going to mount it on a system with the appropriate page size.
>> +        The default value is 4KiB (4096). If larger page sizes (such as 64KiB [16384])
>> +        are used, the created filesystem will only mount on systems with a running kernel
>> +        running on a matching page size. Therefore it is not recommended to use this option
>> +        unless you are going to mount it on a system with the appropriate page size.
>> +
>> +        .. note::
>> +                Versions up to 6.3 set the sectorsize matching to the page size.
>>   
>>   -L|--label <string>
>>           Specify a label for the filesystem. The *string* should be less than 256
>> diff --git a/mkfs/main.c b/mkfs/main.c
>> index f5e34cbd..5e1834d7 100644
>> --- a/mkfs/main.c
>> +++ b/mkfs/main.c
>> @@ -1207,7 +1207,7 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
>>   	}
>>   
>>   	if (!sectorsize)
>> -		sectorsize = (u32)sysconf(_SC_PAGESIZE);
>> +		sectorsize = (u32)SZ_4K;
>>   	if (btrfs_check_sectorsize(sectorsize))
>>   		goto error;
>>   
> 


- Hector

Qu Wenruo March 21, 2023, 11:02 a.m. UTC | #4

On 2023/3/21 18:07, Hector Martin wrote:
> On 21/03/2023 12.21, Qu Wenruo wrote:
[...]
> 
> Note that we already have a bunch of people running Fedora Asahi, and as
> far as I know those images are created on 4K systems and then used on
> 16K systems, so this should be already getting real-world testing (and
> will get a lot more once we get to official announcement/release)
> regardless of the default.
> 
> IOW, this change is mostly about people creating secondary btrfs
> filesystems on Asahi directly, which is relatively niche in comparison.
> So if you have a subpage bug it's going to hit Asahi users whether this
> change happens or not :)

Awesome, nothing happened would already be the best thing.

The worst scenario, in which tons of users reporting subpage crash, is 
already avoided.

> 
>>
>> But on the other hand, I really don't want any big bug screwing up early
>> adopters, and further damage the reputation of btrfs.
>>
>>
>> Would the Asahi guys gives us some early test results?
>> (AKA, full fstests runs with 16K page size and 4K sectorsize).
> 
> Gave it a shot. Tested with options:
> 
> export TEST_DEV=/dev/nvme0n1p6
> export TEST_DIR=/mnt/test
> #export SCRATCH_DEV=/dev/nvme0n1p7
> export SCRATCH_MNT=/mnt/scratch
> export SCRATCH_DEV_POOL="/dev/nvme0n1p7 /dev/nvme0n1p8 /dev/nvme0n1p9
> /dev/nvme0n1p10 /dev/nvme0n1p11"
> export MKFS_OPTIONS="-s 4096"
> export FSTYP=btrfs
> 
> btrfs/012 is broken, the converted FS fails to mount with:
> 
> [  784.588172] BTRFS warning (device nvme0n1p7): v1 space cache is not
> supported for page size 16384 with sectorsize 4096
> [  784.588199] BTRFS error (device nvme0n1p7): open_ctree failed
> 
> btrfs/131 and 136 have the same issue.

This is v1 cache problems, IIRC newer progs from ArchlinuxARM upstream 
should already default to v2 cache, it's the test case itself requesting 
v1 cache, thus causing problems.
(v1 cache is explicitly rejected because I'm a little too lazy to fix 
all the hard coded PAGE_SIZE usage there).

I can update those test cases to avoid if v1 cache is not supported.

> 
> btrfs/106 has a size mismatch in the output, but I get the feeling
> that's just a bad test that assumes 4K somewhere?

Yep, that looks like the case.

Although I'll need dig deeper to find a way to fix the test case.

> 
> btrfs/140 is the first one that looks bad. Looks like the corruption
> correction didn't work for some reason.

This one seems interesting, would definitely bring some machine time for 
16K page sized testsing.

> 
> ... and then btrfs/142 which is similar actually managed to log a bunch
> of errors on our NVMe controller. Nice.
> 
> [ 1240.000104] nvme-apple 393cc0000.nvme: RTKit: syslog message:
> host_ans2.c:1564: cmd parsing error for tag 12 fast decode err 0x1
> [ 1240.000767] nvme0n1: Read(0x2) @ LBA 322753843, 0 blocks, Invalid
> Field in Command (sct 0x0 / sc 0x2)
> [ 1240.000771] nvme-apple 393cc0000.nvme: RTKit: syslog message:
> host_ans2.c:1469: tag 12, completed with err BAD_CMD-
> [ 1240.000775] operation not supported error, dev nvme0n1, sector
> 2582030751 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2
> 
> Looks like it tried to read 0 blocks? I'm pretty sure that's not a valid
> block device operation... and then that test hangs because the
> _btrfs_direct_read_on_mirror() common function is completely broken, as
> it infinite loops if the exec'd command fails (which it does here, with
> ENOENT). Cc sven/axboe/asahi: I'm pretty sure we should catch
> upper-layer badness like this before sending it to the controller.

And also very weird, definitely worthy halting my daily runs.
(To be honest, recent misc-next is a little too boring, which means 
David is doing a pretty good job).

I also believe we should catch such zero lengthed bios inside btrfs 
before we had a block layer sanity checks.
(I know btrfs RAID56 code uses zero length bio as a way for completion 
barrier, but it should not reach lower layer, thus this must be a bug)

> 
> Excluding that one and moving on, 143 is broken the same way, as are
> btrfs/265, 266, 267, 269.
> 
> 213 failed to balance with ENOSPC.
> 
> btrfs/246 has an output discrepancy, I don't know what's up with that.
> 
> generic/251 then failed too, with dmesg logs about failing to trim block
> groups.
> 
> generic/520 failed with an EBUSY on umount, but I suspect this might be
> some desktop/systemd stuff trying to use the mount?
> 
> generic/563 suggests there may be cgroup accounting issues
> 
> generic/619 seems known to be pathologically slow on arm64, so I killed
> it (https://www.spinics.net/lists/linux-btrfs/msg131553.html).
> 
> generic/708 failed but that pointed to a known kernel bug that we don't
> have the fix for yet (this is running on 6.2 + asahi patches).
> 
> Run output and some select dmesg sections:
> https://gist.github.com/marcan/822a34266bcaf4f4cffaa6a198b4c616
> 
> Let me know if you need anything else.

Thank you very much for the test results and all your awesome Apple 
enablement work!
Qu

> 
>>
>> If nothing wrong happened, I am very happy to this patch.
>>
>> Thanks,
>> Qu
>>>
>>> Signed-off-by: Neal Gompa <neal@gompa.dev>
>>> ---
>>>    Documentation/Subpage.rst    |  9 +++++----
>>>    Documentation/mkfs.btrfs.rst | 11 +++++++----
>>>    mkfs/main.c                  |  2 +-
>>>    3 files changed, 13 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/Documentation/Subpage.rst b/Documentation/Subpage.rst
>>> index 21a495d5..d7e9b241 100644
>>> --- a/Documentation/Subpage.rst
>>> +++ b/Documentation/Subpage.rst
>>> @@ -9,10 +9,11 @@ to the exactly same size of the block and page. On x86_64 this is typically
>>>    pages, like 64KiB on 64bit ARM or PowerPC. This means filesystems created
>>>    with 64KiB sector size cannot be mounted on a system with 4KiB page size.
>>>    
>>> -While with subpage support, systems with 64KiB page size can create (still needs
>>> -"-s 4k" option for mkfs.btrfs) and mount filesystems with 4KiB sectorsize,
>>> -allowing us to push 4KiB sectorsize as default sectorsize for all platforms in the
>>> -near future.
>>> +Since v6.3, filesystems are created with a 4KiB sectorsize by default,
>>> +though it remains possible to create filesystems with other page sizes
>>> +(such as 64KiB with the "-s 64k" option for mkfs.btrfs). This ensures that
>>> +new filesystems are compatible across other architecture variants using
>>> +larger page sizes.
>>>    
>>>    Requirements, limitations
>>>    -------------------------
>>> diff --git a/Documentation/mkfs.btrfs.rst b/Documentation/mkfs.btrfs.rst
>>> index ba7227b3..af0b9c03 100644
>>> --- a/Documentation/mkfs.btrfs.rst
>>> +++ b/Documentation/mkfs.btrfs.rst
>>> @@ -116,10 +116,13 @@ OPTIONS
>>>    -s|--sectorsize <size>
>>>            Specify the sectorsize, the minimum data block allocation unit.
>>>    
>>> -        The default value is the page size and is autodetected. If the sectorsize
>>> -        differs from the page size, the created filesystem may not be mountable by the
>>> -        running kernel. Therefore it is not recommended to use this option unless you
>>> -        are going to mount it on a system with the appropriate page size.
>>> +        The default value is 4KiB (4096). If larger page sizes (such as 64KiB [16384])
>>> +        are used, the created filesystem will only mount on systems with a running kernel
>>> +        running on a matching page size. Therefore it is not recommended to use this option
>>> +        unless you are going to mount it on a system with the appropriate page size.
>>> +
>>> +        .. note::
>>> +                Versions up to 6.3 set the sectorsize matching to the page size.
>>>    
>>>    -L|--label <string>
>>>            Specify a label for the filesystem. The *string* should be less than 256
>>> diff --git a/mkfs/main.c b/mkfs/main.c
>>> index f5e34cbd..5e1834d7 100644
>>> --- a/mkfs/main.c
>>> +++ b/mkfs/main.c
>>> @@ -1207,7 +1207,7 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
>>>    	}
>>>    
>>>    	if (!sectorsize)
>>> -		sectorsize = (u32)sysconf(_SC_PAGESIZE);
>>> +		sectorsize = (u32)SZ_4K;
>>>    	if (btrfs_check_sectorsize(sectorsize))
>>>    		goto error;
>>>    
>>
> 
> 
> - Hector

Neal Gompa March 21, 2023, 11:25 a.m. UTC | #5

On Mon, Mar 20, 2023 at 11:03 PM Anand Jain <anand.jain@oracle.com> wrote:
>
> On 21/03/2023 10:03, Neal Gompa wrote:
> > We have had working subpage support in Btrfs for many cycles now.
> > Generally, we do not want people creating filesystems by default
> > with non-4k sectorsizes since it creates portability problems.
> >
>
> I agree.
>
> > Signed-off-by: Neal Gompa <neal@gompa.dev>
> > ---
> >   Documentation/Subpage.rst    |  9 +++++----
> >   Documentation/mkfs.btrfs.rst | 11 +++++++----
> >   mkfs/main.c                  |  2 +-
> >   3 files changed, 13 insertions(+), 9 deletions(-)
> >
> > diff --git a/Documentation/Subpage.rst b/Documentation/Subpage.rst
> > index 21a495d5..d7e9b241 100644
> > --- a/Documentation/Subpage.rst
> > +++ b/Documentation/Subpage.rst
> > @@ -9,10 +9,11 @@ to the exactly same size of the block and page. On x86_64 this is typically
> >   pages, like 64KiB on 64bit ARM or PowerPC. This means filesystems created
> >   with 64KiB sector size cannot be mounted on a system with 4KiB page size.
> >
> > -While with subpage support, systems with 64KiB page size can create (still needs
> > -"-s 4k" option for mkfs.btrfs) and mount filesystems with 4KiB sectorsize,
> > -allowing us to push 4KiB sectorsize as default sectorsize for all platforms in the
> > -near future.
> > +Since v6.3, filesystems are created with a 4KiB sectorsize by default,
> > +though it remains possible to create filesystems with other page sizes
> > +(such as 64KiB with the "-s 64k" option for mkfs.btrfs). This ensures that
> > +new filesystems are compatible across other architecture variants using
> > +larger page sizes.
> >
>
> This part is a little confusing.
>
> We can also include kernel versions that started supporting subpages
>
> v5.12   read support of blocksize<pagesize (subpage)
> v5.15   write support of blocksize<pagesize (subpage)
>

This is covered further down in the document. Though it does remind me
that I should rephrase some of it so that it doesn't say a default
mode is experimental. :)

>
> >   Requirements, limitations
> >   -------------------------
> > diff --git a/Documentation/mkfs.btrfs.rst b/Documentation/mkfs.btrfs.rst
> > index ba7227b3..af0b9c03 100644
> > --- a/Documentation/mkfs.btrfs.rst
> > +++ b/Documentation/mkfs.btrfs.rst
> > @@ -116,10 +116,13 @@ OPTIONS
> >   -s|--sectorsize <size>
> >           Specify the sectorsize, the minimum data block allocation unit.
> >
> > -        The default value is the page size and is autodetected. If the sectorsize
> > -        differs from the page size, the created filesystem may not be mountable by the
> > -        running kernel. Therefore it is not recommended to use this option unless you
> > -        are going to mount it on a system with the appropriate page size.
> > +        The default value is 4KiB (4096). If larger page sizes (such as 64KiB [16384])
> > +        are used, the created filesystem will only mount on systems with a running kernel
> > +        running on a matching page size. Therefore it is not recommended to use this option
> > +        unless you are going to mount it on a system with the appropriate page size.
> > +
> > +        .. note::
> > +                Versions up to 6.3 set the sectorsize matching to the page size.
> >
>
>
> IMO this can be rewritten as below:
>
> By default, the value is 4KiB, but it can be manually set to match the
> system page size. However, if the sector size is different from the page
> size, the resulting filesystem may not be mountable by the current
> kernel, apart from the default 4KiB. Hence, it's not advisable to use
> this option unless you intend to mount it on a system with the suitable
> page size.
>
>
> Thanks, Anand
>
>
> >   -L|--label <string>
> >           Specify a label for the filesystem. The *string* should be less than 256
> > diff --git a/mkfs/main.c b/mkfs/main.c
> > index f5e34cbd..5e1834d7 100644
> > --- a/mkfs/main.c
> > +++ b/mkfs/main.c
> > @@ -1207,7 +1207,7 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
> >       }
> >
> >       if (!sectorsize)
> > -             sectorsize = (u32)sysconf(_SC_PAGESIZE);
> > +             sectorsize = (u32)SZ_4K;
> >       if (btrfs_check_sectorsize(sectorsize))
> >               goto error;
> >
>

Qu Wenruo March 22, 2023, 5:19 a.m. UTC | #6

On 2023/3/21 18:07, Hector Martin wrote:
[...]
> 
> btrfs/140 is the first one that looks bad. Looks like the corruption
> correction didn't work for some reason.
> 
> ... and then btrfs/142 which is similar actually managed to log a bunch
> of errors on our NVMe controller. Nice.
> 
> [ 1240.000104] nvme-apple 393cc0000.nvme: RTKit: syslog message:
> host_ans2.c:1564: cmd parsing error for tag 12 fast decode err 0x1
> [ 1240.000767] nvme0n1: Read(0x2) @ LBA 322753843, 0 blocks, Invalid
> Field in Command (sct 0x0 / sc 0x2)
> [ 1240.000771] nvme-apple 393cc0000.nvme: RTKit: syslog message:
> host_ans2.c:1469: tag 12, completed with err BAD_CMD-
> [ 1240.000775] operation not supported error, dev nvme0n1, sector
> 2582030751 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2

Mind to share the reproducibility?

I tried it with my aarch64 VMs with the following setup:

- Page size 16K
   The host SoC supports all 4K/16K/64K page size

- Virtio-blk
   Backed by a NVME drive, with unsafe cache mode (aka, ignoring all file
   sync)

- Virtual memory 8G

- 4 vCPU cores
   All pinned to A76 cores, as qemu failed to boot if it's initialized
   across A76 and A55 cores.

   Paired with the virtio-blk with unsafe cache, the VM should have a
   very fast storage, while lame CPU perf.

- With extra ASSERT()s in btrfs_submit_bio()
   Making sure all bios submitted through btrfs_submit_bio() has a non-
   zero bi_size.

Unfortunately unable to reproduce the problem.


If you can reproduce, mind to provide a call trace?

> 
> Looks like it tried to read 0 blocks? I'm pretty sure that's not a valid
> block device operation... and then that test hangs because the
> _btrfs_direct_read_on_mirror() common function is completely broken, as
> it infinite loops if the exec'd command fails (which it does here, with
> ENOENT). Cc sven/axboe/asahi: I'm pretty sure we should catch
> upper-layer badness like this before sending it to the controller.
> 
> Excluding that one and moving on, 143 is broken the same way, as are
> btrfs/265, 266, 267, 269.

Also tried those ones, all passed here...
> 
> 213 failed to balance with ENOSPC.

Passed with 5x10G LVs as scratch dev pool.

I hope it's not Apple Silicon too fast to trigger it, or it would be 
pretty hard to reproduce...

> 
> btrfs/246 has an output discrepancy, I don't know what's up with that.

This is a limitation on the compressed write support, would update the 
test case to skip it.

> 
> generic/251 then failed too, with dmesg logs about failing to trim block
> groups.

It takes much longer time but still passed here.

I'll try my best to got some runs on real Apple machines meanwhile, but 
don't expect any improvement on my hardware situation any time soon...

Thanks,
Qu

> 
> generic/520 failed with an EBUSY on umount, but I suspect this might be
> some desktop/systemd stuff trying to use the mount?
> 
> generic/563 suggests there may be cgroup accounting issues
> 
> generic/619 seems known to be pathologically slow on arm64, so I killed
> it (https://www.spinics.net/lists/linux-btrfs/msg131553.html).
> 
> generic/708 failed but that pointed to a known kernel bug that we don't
> have the fix for yet (this is running on 6.2 + asahi patches).
> 
> Run output and some select dmesg sections:
> https://gist.github.com/marcan/822a34266bcaf4f4cffaa6a198b4c616
> 
> Let me know if you need anything else.
>

Su Yue March 22, 2023, 2:07 p.m. UTC | #7

On 2023/3/21 18:07, Hector Martin wrote:
> On 21/03/2023 12.21, Qu Wenruo wrote:
>>
>>
>> On 2023/3/21 10:03, Neal Gompa wrote:
>>> We have had working subpage support in Btrfs for many cycles now.
>>> Generally, we do not want people creating filesystems by default
>>> with non-4k sectorsizes since it creates portability problems.
>>
>> My biggest concern for now is, I haven't yet done any subpage testing
>> for a while.
>>
>> The bottle neck is the lack of computing power.
>>
>> I only have one decent (RK3588 based SBC, Rock5B) board available for
>> testing, but it's taken by my daily fstests runs, and it's still using
>> 4K page size for the aarc64 VM.
>>
>> Although I have an backup SBC (the same Rock5B), it's reserved mostly
>> for upstreaming and testing for the RK3588 SoC.
>>
>> Personally speaking, this change would bring a lot of more testing
>> feedback from Asahi guys, thus would be pretty awesome.
> 
> Note that we already have a bunch of people running Fedora Asahi, and as
> far as I know those images are created on 4K systems and then used on
> 16K systems, so this should be already getting real-world testing (and
> will get a lot more once we get to official announcement/release)
> regardless of the default.
> 
> IOW, this change is mostly about people creating secondary btrfs
> filesystems on Asahi directly, which is relatively niche in comparison.
> So if you have a subpage bug it's going to hit Asahi users whether this
> change happens or not :)
> 
>>
>> But on the other hand, I really don't want any big bug screwing up early
>> adopters, and further damage the reputation of btrfs.
>>
>>
>> Would the Asahi guys gives us some early test results?
>> (AKA, full fstests runs with 16K page size and 4K sectorsize).
> 
> Gave it a shot. Tested with options:
> 
> export TEST_DEV=/dev/nvme0n1p6
> export TEST_DIR=/mnt/test
> #export SCRATCH_DEV=/dev/nvme0n1p7
> export SCRATCH_MNT=/mnt/scratch
> export SCRATCH_DEV_POOL="/dev/nvme0n1p7 /dev/nvme0n1p8 /dev/nvme0n1p9
> /dev/nvme0n1p10 /dev/nvme0n1p11"
> export MKFS_OPTIONS="-s 4096"
> export FSTYP=btrfs
> 
> btrfs/012 is broken, the converted FS fails to mount with:
> 
> [  784.588172] BTRFS warning (device nvme0n1p7): v1 space cache is not
> supported for page size 16384 with sectorsize 4096
> [  784.588199] BTRFS error (device nvme0n1p7): open_ctree failed
> 
> btrfs/131 and 136 have the same issue.
> 
> btrfs/106 has a size mismatch in the output, but I get the feeling
> that's just a bad test that assumes 4K somewhere?
> 
> btrfs/140 is the first one that looks bad. Looks like the corruption
> correction didn't work for some reason.
> 
> ... and then btrfs/142 which is similar actually managed to log a bunch
> of errors on our NVMe controller. Nice.
> 
> [ 1240.000104] nvme-apple 393cc0000.nvme: RTKit: syslog message:
> host_ans2.c:1564: cmd parsing error for tag 12 fast decode err 0x1
> [ 1240.000767] nvme0n1: Read(0x2) @ LBA 322753843, 0 blocks, Invalid
> Field in Command (sct 0x0 / sc 0x2)
> [ 1240.000771] nvme-apple 393cc0000.nvme: RTKit: syslog message:
> host_ans2.c:1469: tag 12, completed with err BAD_CMD-
> [ 1240.000775] operation not supported error, dev nvme0n1, sector
> 2582030751 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2
> 
> Looks like it tried to read 0 blocks? I'm pretty sure that's not a valid
> block device operation... and then that test hangs because the
> _btrfs_direct_read_on_mirror() common function is completely broken, as
> it infinite loops if the exec'd command fails (which it does here, with
> ENOENT). Cc sven/axboe/asahi: I'm pretty sure we should catch
> upper-layer badness like this before sending it to the controller.
> 
> Excluding that one and moving on, 143 is broken the same way, as are
> btrfs/265, 266, 267, 269.
> 
> 213 failed to balance with ENOSPC.
> 
> btrfs/246 has an output discrepancy, I don't know what's up with that.
> 
> generic/251 then failed too, with dmesg logs about failing to trim block
> groups.
> 
> generic/520 failed with an EBUSY on umount, but I suspect this might be
> some desktop/systemd stuff trying to use the mount?
> 
> generic/563 suggests there may be cgroup accounting issues
> 
> generic/619 seems known to be pathologically slow on arm64, so I killed
> it (https://www.spinics.net/lists/linux-btrfs/msg131553.html).
> 
> generic/708 failed but that pointed to a known kernel bug that we don't
> have the fix for yet (this is running on 6.2 + asahi patches).
> 

As Qu asked for test results on M1 mac mini, I ran above tests on 
misc-next in my qemu box but no error found. Those tests failed on bare 
metal
machines.


qemu configuration:

qemu-system-aarch64 \
     -M virt,highmem=no \
     -accel hvf -cpu host -name alarm \
     -smp cpus=8,sockets=1,cores=8,threads=1 -m 10096M \
     -bios $uefi \
     -device virtio-gpu -device qemu-xhci,id=xhci -device usb-kbd \
     -device usb-tablet -device virtio-rng-pci \
     -nic user,model=virtio,hostfwd=tcp::10022-:22 \
     -rtc base=localtime,clock=host \
     -chardev socket,id=mon0,host=localhost,port=63855,server=on,wait=off \
     -mon chardev=mon0,mode=control,pretty=on \
     -drive file=${alarm},if=virtio,id=boot,cache=writethrough \
     -vga none -nographic \
     -drive file=/dev/disk0s3,if=virtio,cache=unsafe,format=raw \
     -netdev vmnet-bridged,id=net0,ifname=en1

--
Su
> Run output and some select dmesg sections:
> https://gist.github.com/marcan/822a34266bcaf4f4cffaa6a198b4c616
> 
> Let me know if you need anything else.
> 
>>
>> If nothing wrong happened, I am very happy to this patch.
>>
>> Thanks,
>> Qu
>>>
>>> Signed-off-by: Neal Gompa <neal@gompa.dev>
>>> ---
>>>    Documentation/Subpage.rst    |  9 +++++----
>>>    Documentation/mkfs.btrfs.rst | 11 +++++++----
>>>    mkfs/main.c                  |  2 +-
>>>    3 files changed, 13 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/Documentation/Subpage.rst b/Documentation/Subpage.rst
>>> index 21a495d5..d7e9b241 100644
>>> --- a/Documentation/Subpage.rst
>>> +++ b/Documentation/Subpage.rst
>>> @@ -9,10 +9,11 @@ to the exactly same size of the block and page. On x86_64 this is typically
>>>    pages, like 64KiB on 64bit ARM or PowerPC. This means filesystems created
>>>    with 64KiB sector size cannot be mounted on a system with 4KiB page size.
>>>    
>>> -While with subpage support, systems with 64KiB page size can create (still needs
>>> -"-s 4k" option for mkfs.btrfs) and mount filesystems with 4KiB sectorsize,
>>> -allowing us to push 4KiB sectorsize as default sectorsize for all platforms in the
>>> -near future.
>>> +Since v6.3, filesystems are created with a 4KiB sectorsize by default,
>>> +though it remains possible to create filesystems with other page sizes
>>> +(such as 64KiB with the "-s 64k" option for mkfs.btrfs). This ensures that
>>> +new filesystems are compatible across other architecture variants using
>>> +larger page sizes.
>>>    
>>>    Requirements, limitations
>>>    -------------------------
>>> diff --git a/Documentation/mkfs.btrfs.rst b/Documentation/mkfs.btrfs.rst
>>> index ba7227b3..af0b9c03 100644
>>> --- a/Documentation/mkfs.btrfs.rst
>>> +++ b/Documentation/mkfs.btrfs.rst
>>> @@ -116,10 +116,13 @@ OPTIONS
>>>    -s|--sectorsize <size>
>>>            Specify the sectorsize, the minimum data block allocation unit.
>>>    
>>> -        The default value is the page size and is autodetected. If the sectorsize
>>> -        differs from the page size, the created filesystem may not be mountable by the
>>> -        running kernel. Therefore it is not recommended to use this option unless you
>>> -        are going to mount it on a system with the appropriate page size.
>>> +        The default value is 4KiB (4096). If larger page sizes (such as 64KiB [16384])
>>> +        are used, the created filesystem will only mount on systems with a running kernel
>>> +        running on a matching page size. Therefore it is not recommended to use this option
>>> +        unless you are going to mount it on a system with the appropriate page size.
>>> +
>>> +        .. note::
>>> +                Versions up to 6.3 set the sectorsize matching to the page size.
>>>    
>>>    -L|--label <string>
>>>            Specify a label for the filesystem. The *string* should be less than 256
>>> diff --git a/mkfs/main.c b/mkfs/main.c
>>> index f5e34cbd..5e1834d7 100644
>>> --- a/mkfs/main.c
>>> +++ b/mkfs/main.c
>>> @@ -1207,7 +1207,7 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
>>>    	}
>>>    
>>>    	if (!sectorsize)
>>> -		sectorsize = (u32)sysconf(_SC_PAGESIZE);
>>> +		sectorsize = (u32)SZ_4K;
>>>    	if (btrfs_check_sectorsize(sectorsize))
>>>    		goto error;
>>>    
>>
> 
> 
> - Hector

[RFC] btrfs-progs: mkfs: Enforce 4k sectorsize by default

Commit Message

Comments

Patch