btrfs: Fix no space bug caused by removing bg

Message ID	15fc8f8d002e4ffcdb46e769736f240ae7ace20b.1442839332.git.zhaolei@cn.fujitsu.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-btrfs-owner@kernel.org> From: Zhao Lei <zhaolei@cn.fujitsu.com> To: <linux-btrfs@vger.kernel.org> CC: Zhao Lei <zhaolei@cn.fujitsu.com> Subject: [PATCH] btrfs: Fix no space bug caused by removing bg Date: Mon, 21 Sep 2015 20:59:13 +0800 Message-ID: <15fc8f8d002e4ffcdb46e769736f240ae7ace20b.1442839332.git.zhaolei@cn.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk

Zhaolei Sept. 21, 2015, 12:59 p.m. UTC

btrfs in v4.3-rc1 failed many xfstests items with '-o nospace_cache'
mount option.

Failed cases are:
 btrfs/008,016,019,020,026,027,028,029,031,041,046,048,050,051,053,054,
 077,083,084,087,092,094
 generic/004,010,014,023,024,074,075,080,086,087,089,091,092,100,112,123,
 124,125,126,127,131,133,192,193,198,207,208,209,213,214,215,228,239,240,
 246,247,248,255,263,285,306,313,316,323

We can reproduce this bug with following simple command:
 TEST_DEV=/dev/vdh
 TEST_DIR=/mnt/tmp

 umount "$TEST_DEV" >/dev/null
 mkfs.btrfs -f "$TEST_DEV"
 mount "$TEST_DEV" "$TEST_DIR"

 umount "$TEST_DEV"
 mount "$TEST_DEV" "$TEST_DIR"

 cp /bin/bash $TEST_DIR

Result is:
 (omit previous commands)
 # cp /bin/bash $TEST_DIR
 cp: writing `/mnt/tmp/bash': No space left on device

By bisect, we can see it is triggered by patch titled:
 commit e44163e17796
 ("btrfs: explictly delete unused block groups in close_ctree and ro-remount")

But the wrong code is not in above patch, btrfs delete all chunks
if no data in filesystem, and above patch just make it obviously.

Detail reason:
 1: mkfs a blank filesystem, or delete everything in filesystem
 2: umount fs
    (current code will delete all data chunks)
 3: mount fs
    Because no any data chunks, data's space_cache have no chance
    to init, it means: space_info->total_bytes == 0, and
    space_info->full == 1.
 4: do some write
    Current code will ignore chunk allocate because space_info->full,
    and return -ENOSPC.

Fix:
 Don't auto-delete last blockgroup for a raid type.
 If we delete all blockgroup for a raidtype, it not only cause above bug,
 but also may change filesystem to all-single in some case.

Test:
 Test by above script, and confirmed the logic by debug output.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
---
 fs/btrfs/extent-tree.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Filipe Manana Sept. 21, 2015, 1:27 p.m. UTC | #1

On Mon, Sep 21, 2015 at 1:59 PM, Zhao Lei <zhaolei@cn.fujitsu.com> wrote:
> btrfs in v4.3-rc1 failed many xfstests items with '-o nospace_cache'
> mount option.
>
> Failed cases are:
>  btrfs/008,016,019,020,026,027,028,029,031,041,046,048,050,051,053,054,
>  077,083,084,087,092,094

Hi Zhao,

So far I tried a few of those against Chris' integration-4.3 branch
(same btrfs code as 4.3-rc1):

MOUNT_OPTIONS="-o nospace_cache" ./check btrfs/008 btrfs/016 btrfs/019 btrfs/020
FSTYP         -- btrfs
PLATFORM      -- Linux/x86_64 debian3 4.2.0-rc5-btrfs-next-12+
MKFS_OPTIONS  -- /dev/sdc
MOUNT_OPTIONS -- -o nospace_cache /dev/sdc /home/fdmanana/btrfs-tests/scratch_1

btrfs/008 2s ... 1s
btrfs/016 4s ... 3s
btrfs/019 4s ... 2s
btrfs/020 2s ... 1s
Ran: btrfs/008 btrfs/016 btrfs/019 btrfs/020
Passed all 4 tests

And none of the tests failed...

>  generic/004,010,014,023,024,074,075,080,086,087,089,091,092,100,112,123,
>  124,125,126,127,131,133,192,193,198,207,208,209,213,214,215,228,239,240,
>  246,247,248,255,263,285,306,313,316,323
>
> We can reproduce this bug with following simple command:
>  TEST_DEV=/dev/vdh
>  TEST_DIR=/mnt/tmp
>
>  umount "$TEST_DEV" >/dev/null
>  mkfs.btrfs -f "$TEST_DEV"
>  mount "$TEST_DEV" "$TEST_DIR"
>
>  umount "$TEST_DEV"
>  mount "$TEST_DEV" "$TEST_DIR"
>
>  cp /bin/bash $TEST_DIR
>
> Result is:
>  (omit previous commands)
>  # cp /bin/bash $TEST_DIR
>  cp: writing `/mnt/tmp/bash': No space left on device
>
> By bisect, we can see it is triggered by patch titled:
>  commit e44163e17796
>  ("btrfs: explictly delete unused block groups in close_ctree and ro-remount")
>
> But the wrong code is not in above patch, btrfs delete all chunks
> if no data in filesystem, and above patch just make it obviously.
>
> Detail reason:
>  1: mkfs a blank filesystem, or delete everything in filesystem
>  2: umount fs
>     (current code will delete all data chunks)
>  3: mount fs
>     Because no any data chunks, data's space_cache have no chance
>     to init, it means: space_info->total_bytes == 0, and
>     space_info->full == 1.

Right, and that's the problem. When the space_info is initialized it
should never be flagged as full, otherwise any buffered write attempts
fail immediately with enospc instead of trying to allocate a data
block group (at extent-tree.c:btrfs_check_data_free_space()).

That was fixed recently by:

https://patchwork.kernel.org/patch/7133451/

(with a respective test too, https://patchwork.kernel.org/patch/7133471/)

>  4: do some write
>     Current code will ignore chunk allocate because space_info->full,
>     and return -ENOSPC.
>
> Fix:
>  Don't auto-delete last blockgroup for a raid type.
>  If we delete all blockgroup for a raidtype, it not only cause above bug,
>  but also may change filesystem to all-single in some case.

I don't get this. Can you mention in which cases that happens and why
(in the commit message)?

It isn't clear when reading the patch why we need to keep at least one
block of each type/profile, and seems to be a workaround for a
different problem.

thanks

>
> Test:
>  Test by above script, and confirmed the logic by debug output.
>
> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
> ---
>  fs/btrfs/extent-tree.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 5411f0a..35cf7eb 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -10012,7 +10012,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>                                                bg_list);
>                 space_info = block_group->space_info;
>                 list_del_init(&block_group->bg_list);
> -               if (ret || btrfs_mixed_space_info(space_info)) {
> +               if (ret || btrfs_mixed_space_info(space_info) ||
> +                   block_group->list.next == block_group->list.prev) {
>                         btrfs_put_block_group(block_group);
>                         continue;
>                 }
> --
> 1.8.5.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Filipe Manana Sept. 21, 2015, 1:37 p.m. UTC | #2

On Mon, Sep 21, 2015 at 2:27 PM, Filipe David Manana <fdmanana@gmail.com> wrote:
> On Mon, Sep 21, 2015 at 1:59 PM, Zhao Lei <zhaolei@cn.fujitsu.com> wrote:
>> btrfs in v4.3-rc1 failed many xfstests items with '-o nospace_cache'
>> mount option.
>>
>> Failed cases are:
>>  btrfs/008,016,019,020,026,027,028,029,031,041,046,048,050,051,053,054,
>>  077,083,084,087,092,094
>
> Hi Zhao,
>
> So far I tried a few of those against Chris' integration-4.3 branch
> (same btrfs code as 4.3-rc1):
>
> MOUNT_OPTIONS="-o nospace_cache" ./check btrfs/008 btrfs/016 btrfs/019 btrfs/020
> FSTYP         -- btrfs
> PLATFORM      -- Linux/x86_64 debian3 4.2.0-rc5-btrfs-next-12+
> MKFS_OPTIONS  -- /dev/sdc
> MOUNT_OPTIONS -- -o nospace_cache /dev/sdc /home/fdmanana/btrfs-tests/scratch_1
>
> btrfs/008 2s ... 1s
> btrfs/016 4s ... 3s
> btrfs/019 4s ... 2s
> btrfs/020 2s ... 1s
> Ran: btrfs/008 btrfs/016 btrfs/019 btrfs/020
> Passed all 4 tests
>
> And none of the tests failed...
>
>>  generic/004,010,014,023,024,074,075,080,086,087,089,091,092,100,112,123,
>>  124,125,126,127,131,133,192,193,198,207,208,209,213,214,215,228,239,240,
>>  246,247,248,255,263,285,306,313,316,323
>>
>> We can reproduce this bug with following simple command:
>>  TEST_DEV=/dev/vdh
>>  TEST_DIR=/mnt/tmp
>>
>>  umount "$TEST_DEV" >/dev/null
>>  mkfs.btrfs -f "$TEST_DEV"
>>  mount "$TEST_DEV" "$TEST_DIR"
>>
>>  umount "$TEST_DEV"
>>  mount "$TEST_DEV" "$TEST_DIR"
>>
>>  cp /bin/bash $TEST_DIR
>>
>> Result is:
>>  (omit previous commands)
>>  # cp /bin/bash $TEST_DIR
>>  cp: writing `/mnt/tmp/bash': No space left on device
>>
>> By bisect, we can see it is triggered by patch titled:
>>  commit e44163e17796
>>  ("btrfs: explictly delete unused block groups in close_ctree and ro-remount")
>>
>> But the wrong code is not in above patch, btrfs delete all chunks
>> if no data in filesystem, and above patch just make it obviously.
>>
>> Detail reason:
>>  1: mkfs a blank filesystem, or delete everything in filesystem
>>  2: umount fs
>>     (current code will delete all data chunks)
>>  3: mount fs
>>     Because no any data chunks, data's space_cache have no chance
>>     to init, it means: space_info->total_bytes == 0, and
>>     space_info->full == 1.
>
> Right, and that's the problem. When the space_info is initialized it
> should never be flagged as full, otherwise any buffered write attempts
> fail immediately with enospc instead of trying to allocate a data
> block group (at extent-tree.c:btrfs_check_data_free_space()).
>
> That was fixed recently by:
>
> https://patchwork.kernel.org/patch/7133451/
>
> (with a respective test too, https://patchwork.kernel.org/patch/7133471/)
>
>>  4: do some write
>>     Current code will ignore chunk allocate because space_info->full,
>>     and return -ENOSPC.
>>
>> Fix:
>>  Don't auto-delete last blockgroup for a raid type.
>>  If we delete all blockgroup for a raidtype, it not only cause above bug,
>>  but also may change filesystem to all-single in some case.
>
> I don't get this. Can you mention in which cases that happens and why
> (in the commit message)?
>
> It isn't clear when reading the patch why we need to keep at least one
> block of each type/profile, and seems to be a workaround for a
> different problem.

Plus it would be a bad fix for such a problem, as anyone can still
trigger deletion of the last block group via a balance operation (like
in the test at https://patchwork.kernel.org/patch/7133471/), i.e.,
preventing deletion by the cleaner kthread is not enough to guarantee
the last block group of a kind isn't deleted...

>
> thanks
>
>>
>> Test:
>>  Test by above script, and confirmed the logic by debug output.
>>
>> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
>> ---
>>  fs/btrfs/extent-tree.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 5411f0a..35cf7eb 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -10012,7 +10012,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>>                                                bg_list);
>>                 space_info = block_group->space_info;
>>                 list_del_init(&block_group->bg_list);
>> -               if (ret || btrfs_mixed_space_info(space_info)) {
>> +               if (ret || btrfs_mixed_space_info(space_info) ||
>> +                   block_group->list.next == block_group->list.prev) {
>>                         btrfs_put_block_group(block_group);
>>                         continue;
>>                 }
>> --
>> 1.8.5.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Filipe David Manana,
>
> "Reasonable men adapt themselves to the world.
>  Unreasonable men adapt the world to themselves.
>  That's why all progress depends on unreasonable men."

Zhaolei Sept. 22, 2015, 10:06 a.m. UTC | #3

Hi, Filipe David Manana

Thanks for review this patch.

> -----Original Message-----
> From: Filipe David Manana [mailto:fdmanana@gmail.com]
> Sent: Monday, September 21, 2015 9:27 PM
> To: Zhao Lei <zhaolei@cn.fujitsu.com>
> Cc: linux-btrfs@vger.kernel.org
> Subject: Re: [PATCH] btrfs: Fix no space bug caused by removing bg
> 
> On Mon, Sep 21, 2015 at 1:59 PM, Zhao Lei <zhaolei@cn.fujitsu.com> wrote:
> > btrfs in v4.3-rc1 failed many xfstests items with '-o nospace_cache'
> > mount option.
> >
> > Failed cases are:
> >
> > btrfs/008,016,019,020,026,027,028,029,031,041,046,048,050,051,053,054,
> >  077,083,084,087,092,094
> 
> Hi Zhao,
> 
> So far I tried a few of those against Chris' integration-4.3 branch (same btrfs
> code as 4.3-rc1):
> 
> MOUNT_OPTIONS="-o nospace_cache" ./check btrfs/008 btrfs/016 btrfs/019
> btrfs/020
> FSTYP         -- btrfs
> PLATFORM      -- Linux/x86_64 debian3 4.2.0-rc5-btrfs-next-12+
> MKFS_OPTIONS  -- /dev/sdc
> MOUNT_OPTIONS -- -o nospace_cache /dev/sdc
> /home/fdmanana/btrfs-tests/scratch_1
> 
> btrfs/008 2s ... 1s
> btrfs/016 4s ... 3s
> btrfs/019 4s ... 2s
> btrfs/020 2s ... 1s
> Ran: btrfs/008 btrfs/016 btrfs/019 btrfs/020 Passed all 4 tests
> 
> And none of the tests failed...
> 
Sorry I hadn't paste detail of my test command.

It is from a coincidence operation which is some different with standard
steps(as yours), I mount fs with -o no_space_cache manually without set
MOUNT_OPT, then xfstests entered into a special path, and triggered the bug:
  export TEST_DEV='/dev/sdb5'
  export TEST_DIR='/var/ltf/tester/mnt'
  mkdir -p '/var/ltf/tester/mnt'

  export SCRATCH_DEV_POOL='/dev/sdb6 /dev/sdb7 /dev/sdb8 /dev/sdb9 /dev/sdb10 /dev/sdb11'
  export SCRATCH_MNT='/var/ltf/tester/scratch_mnt'
  mkdir -p '/var/ltf/tester/scratch_mnt'

  export DIFF_LENGTH=0
  
  mkfs.btrfs -f "$TEST_DEV"
  mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"

  ./check generic/014

Result:
  FSTYP         -- btrfs
  PLATFORM      -- Linux/x86_64 lenovo 4.3.0-rc2_HEAD_1f93e4a96c9109378204c147b3eec0d0e8100fde_
  MKFS_OPTIONS  -- /dev/sdb6
  MOUNT_OPTIONS -- /dev/sdb6 /var/ltf/tester/scratch_mnt

  generic/014 0s ... - output mismatch (see /var/lib/xfstests/results//generic/014.out.bad)
      --- tests/generic/014.out   2015-09-22 17:46:13.855391451 +0800
      +++ /var/lib/xfstests/results//generic/014.out.bad  2015-09-22 17:57:06.446095748 +0800
      @@ -3,4 +3,5 @@
       ------
       test 1
       ------
      -OK
      +truncfile returned 1 : "write: No space left on device
      +Seed = 1442915826 (use "-s 1442915826" to re-execute this test)"
  Ran: generic/014
  Failures: generic/014
  Failed 1 of 1 tests

And following script is from trace result of above test.
Maybe I can remove the xfstest description because it is not standard steps.

> >
> > generic/004,010,014,023,024,074,075,080,086,087,089,091,092,100,112,12
> > 3,
> > 124,125,126,127,131,133,192,193,198,207,208,209,213,214,215,228,239,24
> > 0,
> >  246,247,248,255,263,285,306,313,316,323
> >
> > We can reproduce this bug with following simple command:
> >  TEST_DEV=/dev/vdh
> >  TEST_DIR=/mnt/tmp
> >
> >  umount "$TEST_DEV" >/dev/null
> >  mkfs.btrfs -f "$TEST_DEV"
> >  mount "$TEST_DEV" "$TEST_DIR"
> >
> >  umount "$TEST_DEV"
> >  mount "$TEST_DEV" "$TEST_DIR"
> >
> >  cp /bin/bash $TEST_DIR
> >
> > Result is:
> >  (omit previous commands)
> >  # cp /bin/bash $TEST_DIR
> >  cp: writing `/mnt/tmp/bash': No space left on device
> >
> > By bisect, we can see it is triggered by patch titled:
> >  commit e44163e17796
> >  ("btrfs: explictly delete unused block groups in close_ctree and
> > ro-remount")
> >
> > But the wrong code is not in above patch, btrfs delete all chunks if
> > no data in filesystem, and above patch just make it obviously.
> >
> > Detail reason:
> >  1: mkfs a blank filesystem, or delete everything in filesystem
> >  2: umount fs
> >     (current code will delete all data chunks)
> >  3: mount fs
> >     Because no any data chunks, data's space_cache have no chance
> >     to init, it means: space_info->total_bytes == 0, and
> >     space_info->full == 1.
> 
> Right, and that's the problem. When the space_info is initialized it should never
> be flagged as full, otherwise any buffered write attempts fail immediately with
> enospc instead of trying to allocate a data block group (at
> extent-tree.c:btrfs_check_data_free_space()).
> 
> That was fixed recently by:
> 
> https://patchwork.kernel.org/patch/7133451/
> 
> (with a respective test too, https://patchwork.kernel.org/patch/7133471/)
> 

It can fix problem in mount, but can not fix problem of "raid-level change",
please see below.

> >  4: do some write
> >     Current code will ignore chunk allocate because space_info->full,
> >     and return -ENOSPC.
> >
> > Fix:
> >  Don't auto-delete last blockgroup for a raid type.
> >  If we delete all blockgroup for a raidtype, it not only cause above
> > bug,  but also may change filesystem to all-single in some case.
> 
> I don't get this. Can you mention in which cases that happens and why (in the
> commit message)?
> 
> It isn't clear when reading the patch why we need to keep at least one block of
> each type/profile, and seems to be a workaround for a different problem.
> 
Simply speaking, if we run following command after apply your patch:

  TEST_DEV=(/dev/vdg /dev/vdh)
  TEST_DIR=/mnt/tmp

  umount "$TEST_DEV" >/dev/null
  mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}"
  mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"

  umount "$TEST_DEV"
  mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"

  btrfs filesystem usage $TEST_DIR

The result is:
  # btrfs filesystem usage $TEST_DIR
  (omit)
  Data,single: Size:8.00MiB, Used:0.00B
     /dev/vdg        8.00MiB
  ...

We can see data chunk is changed from raid1 to single,
because if we delete all data chunks before mount,
there are raid-type information in filesystem, and btrfs
will use raid-type of "0x0" for new data chunk after your patch.

So, leave at least one data chunk is a simple workaround for
above two bug.

Thanks
Zhaolei

> thanks
> 
> >
> > Test:
> >  Test by above script, and confirmed the logic by debug output.
> >
> > Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
> > ---
> >  fs/btrfs/extent-tree.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index
> > 5411f0a..35cf7eb 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -10012,7 +10012,8 @@ void btrfs_delete_unused_bgs(struct
> btrfs_fs_info *fs_info)
> >                                                bg_list);
> >                 space_info = block_group->space_info;
> >                 list_del_init(&block_group->bg_list);
> > -               if (ret || btrfs_mixed_space_info(space_info)) {
> > +               if (ret || btrfs_mixed_space_info(space_info) ||
> > +                   block_group->list.next == block_group->list.prev)
> > + {
> >                         btrfs_put_block_group(block_group);
> >                         continue;
> >                 }
> > --
> > 1.8.5.1
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
> > in the body of a message to majordomo@vger.kernel.org More majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> --
> Filipe David Manana,
> 
> "Reasonable men adapt themselves to the world.
>  Unreasonable men adapt the world to themselves.
>  That's why all progress depends on unreasonable men."

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Filipe Manana Sept. 22, 2015, 10:22 a.m. UTC | #4

On Tue, Sep 22, 2015 at 11:06 AM, Zhao Lei <zhaolei@cn.fujitsu.com> wrote:
> Hi, Filipe David Manana
>
> Thanks for review this patch.
>
>> -----Original Message-----
>> From: Filipe David Manana [mailto:fdmanana@gmail.com]
>> Sent: Monday, September 21, 2015 9:27 PM
>> To: Zhao Lei <zhaolei@cn.fujitsu.com>
>> Cc: linux-btrfs@vger.kernel.org
>> Subject: Re: [PATCH] btrfs: Fix no space bug caused by removing bg
>>
>> On Mon, Sep 21, 2015 at 1:59 PM, Zhao Lei <zhaolei@cn.fujitsu.com> wrote:
>> > btrfs in v4.3-rc1 failed many xfstests items with '-o nospace_cache'
>> > mount option.
>> >
>> > Failed cases are:
>> >
>> > btrfs/008,016,019,020,026,027,028,029,031,041,046,048,050,051,053,054,
>> >  077,083,084,087,092,094
>>
>> Hi Zhao,
>>
>> So far I tried a few of those against Chris' integration-4.3 branch (same btrfs
>> code as 4.3-rc1):
>>
>> MOUNT_OPTIONS="-o nospace_cache" ./check btrfs/008 btrfs/016 btrfs/019
>> btrfs/020
>> FSTYP         -- btrfs
>> PLATFORM      -- Linux/x86_64 debian3 4.2.0-rc5-btrfs-next-12+
>> MKFS_OPTIONS  -- /dev/sdc
>> MOUNT_OPTIONS -- -o nospace_cache /dev/sdc
>> /home/fdmanana/btrfs-tests/scratch_1
>>
>> btrfs/008 2s ... 1s
>> btrfs/016 4s ... 3s
>> btrfs/019 4s ... 2s
>> btrfs/020 2s ... 1s
>> Ran: btrfs/008 btrfs/016 btrfs/019 btrfs/020 Passed all 4 tests
>>
>> And none of the tests failed...
>>
> Sorry I hadn't paste detail of my test command.
>
> It is from a coincidence operation which is some different with standard
> steps(as yours), I mount fs with -o no_space_cache manually without set
> MOUNT_OPT, then xfstests entered into a special path, and triggered the bug:
>   export TEST_DEV='/dev/sdb5'
>   export TEST_DIR='/var/ltf/tester/mnt'
>   mkdir -p '/var/ltf/tester/mnt'
>
>   export SCRATCH_DEV_POOL='/dev/sdb6 /dev/sdb7 /dev/sdb8 /dev/sdb9 /dev/sdb10 /dev/sdb11'
>   export SCRATCH_MNT='/var/ltf/tester/scratch_mnt'
>   mkdir -p '/var/ltf/tester/scratch_mnt'
>
>   export DIFF_LENGTH=0
>
>   mkfs.btrfs -f "$TEST_DEV"
>   mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
>
>   ./check generic/014
>
> Result:
>   FSTYP         -- btrfs
>   PLATFORM      -- Linux/x86_64 lenovo 4.3.0-rc2_HEAD_1f93e4a96c9109378204c147b3eec0d0e8100fde_
>   MKFS_OPTIONS  -- /dev/sdb6
>   MOUNT_OPTIONS -- /dev/sdb6 /var/ltf/tester/scratch_mnt
>
>   generic/014 0s ... - output mismatch (see /var/lib/xfstests/results//generic/014.out.bad)
>       --- tests/generic/014.out   2015-09-22 17:46:13.855391451 +0800
>       +++ /var/lib/xfstests/results//generic/014.out.bad  2015-09-22 17:57:06.446095748 +0800
>       @@ -3,4 +3,5 @@
>        ------
>        test 1
>        ------
>       -OK
>       +truncfile returned 1 : "write: No space left on device
>       +Seed = 1442915826 (use "-s 1442915826" to re-execute this test)"
>   Ran: generic/014
>   Failures: generic/014
>   Failed 1 of 1 tests
>
> And following script is from trace result of above test.
> Maybe I can remove the xfstest description because it is not standard steps.
>
>> >
>> > generic/004,010,014,023,024,074,075,080,086,087,089,091,092,100,112,12
>> > 3,
>> > 124,125,126,127,131,133,192,193,198,207,208,209,213,214,215,228,239,24
>> > 0,
>> >  246,247,248,255,263,285,306,313,316,323
>> >
>> > We can reproduce this bug with following simple command:
>> >  TEST_DEV=/dev/vdh
>> >  TEST_DIR=/mnt/tmp
>> >
>> >  umount "$TEST_DEV" >/dev/null
>> >  mkfs.btrfs -f "$TEST_DEV"
>> >  mount "$TEST_DEV" "$TEST_DIR"
>> >
>> >  umount "$TEST_DEV"
>> >  mount "$TEST_DEV" "$TEST_DIR"
>> >
>> >  cp /bin/bash $TEST_DIR
>> >
>> > Result is:
>> >  (omit previous commands)
>> >  # cp /bin/bash $TEST_DIR
>> >  cp: writing `/mnt/tmp/bash': No space left on device
>> >
>> > By bisect, we can see it is triggered by patch titled:
>> >  commit e44163e17796
>> >  ("btrfs: explictly delete unused block groups in close_ctree and
>> > ro-remount")
>> >
>> > But the wrong code is not in above patch, btrfs delete all chunks if
>> > no data in filesystem, and above patch just make it obviously.
>> >
>> > Detail reason:
>> >  1: mkfs a blank filesystem, or delete everything in filesystem
>> >  2: umount fs
>> >     (current code will delete all data chunks)
>> >  3: mount fs
>> >     Because no any data chunks, data's space_cache have no chance
>> >     to init, it means: space_info->total_bytes == 0, and
>> >     space_info->full == 1.
>>
>> Right, and that's the problem. When the space_info is initialized it should never
>> be flagged as full, otherwise any buffered write attempts fail immediately with
>> enospc instead of trying to allocate a data block group (at
>> extent-tree.c:btrfs_check_data_free_space()).
>>
>> That was fixed recently by:
>>
>> https://patchwork.kernel.org/patch/7133451/
>>
>> (with a respective test too, https://patchwork.kernel.org/patch/7133471/)
>>
>
> It can fix problem in mount, but can not fix problem of "raid-level change",
> please see below.
>
>> >  4: do some write
>> >     Current code will ignore chunk allocate because space_info->full,
>> >     and return -ENOSPC.
>> >
>> > Fix:
>> >  Don't auto-delete last blockgroup for a raid type.
>> >  If we delete all blockgroup for a raidtype, it not only cause above
>> > bug,  but also may change filesystem to all-single in some case.
>>
>> I don't get this. Can you mention in which cases that happens and why (in the
>> commit message)?
>>
>> It isn't clear when reading the patch why we need to keep at least one block of
>> each type/profile, and seems to be a workaround for a different problem.
>>
> Simply speaking, if we run following command after apply your patch:
>
>   TEST_DEV=(/dev/vdg /dev/vdh)
>   TEST_DIR=/mnt/tmp
>
>   umount "$TEST_DEV" >/dev/null
>   mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}"
>   mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
>
>   umount "$TEST_DEV"
>   mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
>
>   btrfs filesystem usage $TEST_DIR
>
> The result is:
>   # btrfs filesystem usage $TEST_DIR
>   (omit)
>   Data,single: Size:8.00MiB, Used:0.00B
>      /dev/vdg        8.00MiB
>   ...
>
> We can see data chunk is changed from raid1 to single,
> because if we delete all data chunks before mount,
> there are raid-type information in filesystem, and btrfs
> will use raid-type of "0x0" for new data chunk after your patch.
>
> So, leave at least one data chunk is a simple workaround for
> above two bug.

Well it's a hack that besides not being very logical it's also
incomplete and unreliable: preventing the cleaner kthread (automatic
removal of unused block groups) from deleting the last block group of
a kind doesn't prevent the block group from being deleted by a
balance, just as the new regression xfstest does...

In other words, the problem existed before we had automatic removal of
unused block groups, but just less likely for someone to run into it.

So please lets have a real fix for this problem. Seems like we're
losing raid profile type somewhere when allocating a new block group.

thanks


>
> Thanks
> Zhaolei
>
>> thanks
>>
>> >
>> > Test:
>> >  Test by above script, and confirmed the logic by debug output.
>> >
>> > Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
>> > ---
>> >  fs/btrfs/extent-tree.c | 3 ++-
>> >  1 file changed, 2 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index
>> > 5411f0a..35cf7eb 100644
>> > --- a/fs/btrfs/extent-tree.c
>> > +++ b/fs/btrfs/extent-tree.c
>> > @@ -10012,7 +10012,8 @@ void btrfs_delete_unused_bgs(struct
>> btrfs_fs_info *fs_info)
>> >                                                bg_list);
>> >                 space_info = block_group->space_info;
>> >                 list_del_init(&block_group->bg_list);
>> > -               if (ret || btrfs_mixed_space_info(space_info)) {
>> > +               if (ret || btrfs_mixed_space_info(space_info) ||
>> > +                   block_group->list.next == block_group->list.prev)
>> > + {
>> >                         btrfs_put_block_group(block_group);
>> >                         continue;
>> >                 }
>> > --
>> > 1.8.5.1
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
>> > in the body of a message to majordomo@vger.kernel.org More majordomo
>> > info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> Filipe David Manana,
>>
>> "Reasonable men adapt themselves to the world.
>>  Unreasonable men adapt the world to themselves.
>>  That's why all progress depends on unreasonable men."
>

Zhaolei Sept. 22, 2015, 10:22 a.m. UTC | #5

Hi, Filipe David Manana

> -----Original Message-----
> From: linux-btrfs-owner@vger.kernel.org
> [mailto:linux-btrfs-owner@vger.kernel.org] On Behalf Of Zhao Lei
> Sent: Tuesday, September 22, 2015 6:06 PM
> To: fdmanana@gmail.com
> Cc: linux-btrfs@vger.kernel.org
> Subject: RE: [PATCH] btrfs: Fix no space bug caused by removing bg
> 
> Hi, Filipe David Manana
> 
> Thanks for review this patch.
> 
> > -----Original Message-----
> > From: Filipe David Manana [mailto:fdmanana@gmail.com]
> > Sent: Monday, September 21, 2015 9:27 PM
> > To: Zhao Lei <zhaolei@cn.fujitsu.com>
> > Cc: linux-btrfs@vger.kernel.org
> > Subject: Re: [PATCH] btrfs: Fix no space bug caused by removing bg
> >
> > On Mon, Sep 21, 2015 at 1:59 PM, Zhao Lei <zhaolei@cn.fujitsu.com> wrote:
> > > btrfs in v4.3-rc1 failed many xfstests items with '-o nospace_cache'
> > > mount option.
> > >
> > > Failed cases are:
> > >
> > > btrfs/008,016,019,020,026,027,028,029,031,041,046,048,050,051,053,05
> > > 4,
> > >  077,083,084,087,092,094
> >
> > Hi Zhao,
> >
> > So far I tried a few of those against Chris' integration-4.3 branch
> > (same btrfs code as 4.3-rc1):
> >
> > MOUNT_OPTIONS="-o nospace_cache" ./check btrfs/008 btrfs/016 btrfs/019
> > btrfs/020
> > FSTYP         -- btrfs
> > PLATFORM      -- Linux/x86_64 debian3 4.2.0-rc5-btrfs-next-12+
> > MKFS_OPTIONS  -- /dev/sdc
> > MOUNT_OPTIONS -- -o nospace_cache /dev/sdc
> > /home/fdmanana/btrfs-tests/scratch_1
> >
> > btrfs/008 2s ... 1s
> > btrfs/016 4s ... 3s
> > btrfs/019 4s ... 2s
> > btrfs/020 2s ... 1s
> > Ran: btrfs/008 btrfs/016 btrfs/019 btrfs/020 Passed all 4 tests
> >
> > And none of the tests failed...
> >
> Sorry I hadn't paste detail of my test command.
> 
> It is from a coincidence operation which is some different with standard
> steps(as yours), I mount fs with -o no_space_cache manually without set
> MOUNT_OPT, then xfstests entered into a special path, and triggered the bug:
>   export TEST_DEV='/dev/sdb5'
>   export TEST_DIR='/var/ltf/tester/mnt'
>   mkdir -p '/var/ltf/tester/mnt'
> 
>   export SCRATCH_DEV_POOL='/dev/sdb6 /dev/sdb7 /dev/sdb8 /dev/sdb9
> /dev/sdb10 /dev/sdb11'
>   export SCRATCH_MNT='/var/ltf/tester/scratch_mnt'
>   mkdir -p '/var/ltf/tester/scratch_mnt'
> 
>   export DIFF_LENGTH=0
> 
>   mkfs.btrfs -f "$TEST_DEV"
>   mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
> 
>   ./check generic/014
> 
> Result:
>   FSTYP         -- btrfs
>   PLATFORM      -- Linux/x86_64 lenovo
> 4.3.0-rc2_HEAD_1f93e4a96c9109378204c147b3eec0d0e8100fde_
>   MKFS_OPTIONS  -- /dev/sdb6
>   MOUNT_OPTIONS -- /dev/sdb6 /var/ltf/tester/scratch_mnt
> 
>   generic/014 0s ... - output mismatch (see
> /var/lib/xfstests/results//generic/014.out.bad)
>       --- tests/generic/014.out   2015-09-22 17:46:13.855391451 +0800
>       +++ /var/lib/xfstests/results//generic/014.out.bad  2015-09-22
> 17:57:06.446095748 +0800
>       @@ -3,4 +3,5 @@
>        ------
>        test 1
>        ------
>       -OK
>       +truncfile returned 1 : "write: No space left on device
>       +Seed = 1442915826 (use "-s 1442915826" to re-execute this test)"
>   Ran: generic/014
>   Failures: generic/014
>   Failed 1 of 1 tests
> 

Plus, by retest, the xfstests fail also happened in standard steps in my node:
(with newest xfstests)

  # btrfs --version
  btrfs-progs v4.2
  # uname -a
  Linux lenovo 4.3.0-rc2_HEAD_1f93e4a96c9109378204c147b3eec0d0e8100fde_ #1 SMP Mon Sep 21 06:34:49 CST 2015 x86_64 x86_64 x86_64 GNU/Linux
  # MOUNT_OPTIONS="-o nospace_cache" ./check btrfs/008
  FSTYP         -- btrfs
  PLATFORM      -- Linux/x86_64 lenovo 4.3.0-rc2_HEAD_1f93e4a96c9109378204c147b3eec0d0e8100fde_
  MKFS_OPTIONS  -- /dev/sdb6
  MOUNT_OPTIONS -- -o nospace_cache /dev/sdb6 /var/ltf/tester/scratch_mnt

  btrfs/008 1s ... [failed, exit status 1] - output mismatch (see /var/lib/xfstests/results//btrfs/008.out.bad)
      --- tests/btrfs/008.out     2015-09-22 17:46:12.530391386 +0800
      +++ /var/lib/xfstests/results//btrfs/008.out.bad    2015-09-22 18:17:24.699154708 +0800
      @@ -1,2 +1,3 @@
       QA output created by 008
      -Silence is golden
      +send failed
      +(see /var/lib/xfstests/results//btrfs/008.full for details)
  Ran: btrfs/008
  Failures: btrfs/008
  Failed 1 of 1 tests

Maybe there are some different with our nodes, but I think it is no relationship
with this bug, and need not investigate the detail reason.

Thanks
Zhaolei

> And following script is from trace result of above test.
> Maybe I can remove the xfstest description because it is not standard steps.
> 
> > >
> > > generic/004,010,014,023,024,074,075,080,086,087,089,091,092,100,112,
> > > 12
> > > 3,
> > > 124,125,126,127,131,133,192,193,198,207,208,209,213,214,215,228,239,
> > > 24
> > > 0,
> > >  246,247,248,255,263,285,306,313,316,323
> > >
> > > We can reproduce this bug with following simple command:
> > >  TEST_DEV=/dev/vdh
> > >  TEST_DIR=/mnt/tmp
> > >
> > >  umount "$TEST_DEV" >/dev/null
> > >  mkfs.btrfs -f "$TEST_DEV"
> > >  mount "$TEST_DEV" "$TEST_DIR"
> > >
> > >  umount "$TEST_DEV"
> > >  mount "$TEST_DEV" "$TEST_DIR"
> > >
> > >  cp /bin/bash $TEST_DIR
> > >
> > > Result is:
> > >  (omit previous commands)
> > >  # cp /bin/bash $TEST_DIR
> > >  cp: writing `/mnt/tmp/bash': No space left on device
> > >
> > > By bisect, we can see it is triggered by patch titled:
> > >  commit e44163e17796
> > >  ("btrfs: explictly delete unused block groups in close_ctree and
> > > ro-remount")
> > >
> > > But the wrong code is not in above patch, btrfs delete all chunks if
> > > no data in filesystem, and above patch just make it obviously.
> > >
> > > Detail reason:
> > >  1: mkfs a blank filesystem, or delete everything in filesystem
> > >  2: umount fs
> > >     (current code will delete all data chunks)
> > >  3: mount fs
> > >     Because no any data chunks, data's space_cache have no chance
> > >     to init, it means: space_info->total_bytes == 0, and
> > >     space_info->full == 1.
> >
> > Right, and that's the problem. When the space_info is initialized it
> > should never be flagged as full, otherwise any buffered write attempts
> > fail immediately with enospc instead of trying to allocate a data
> > block group (at extent-tree.c:btrfs_check_data_free_space()).
> >
> > That was fixed recently by:
> >
> > https://patchwork.kernel.org/patch/7133451/
> >
> > (with a respective test too,
> > https://patchwork.kernel.org/patch/7133471/)
> >
> 
> It can fix problem in mount, but can not fix problem of "raid-level change",
> please see below.
> 
> > >  4: do some write
> > >     Current code will ignore chunk allocate because space_info->full,
> > >     and return -ENOSPC.
> > >
> > > Fix:
> > >  Don't auto-delete last blockgroup for a raid type.
> > >  If we delete all blockgroup for a raidtype, it not only cause above
> > > bug,  but also may change filesystem to all-single in some case.
> >
> > I don't get this. Can you mention in which cases that happens and why
> > (in the commit message)?
> >
> > It isn't clear when reading the patch why we need to keep at least one
> > block of each type/profile, and seems to be a workaround for a different
> problem.
> >
> Simply speaking, if we run following command after apply your patch:
> 
>   TEST_DEV=(/dev/vdg /dev/vdh)
>   TEST_DIR=/mnt/tmp
> 
>   umount "$TEST_DEV" >/dev/null
>   mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}"
>   mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
> 
>   umount "$TEST_DEV"
>   mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
> 
>   btrfs filesystem usage $TEST_DIR
> 
> The result is:
>   # btrfs filesystem usage $TEST_DIR
>   (omit)
>   Data,single: Size:8.00MiB, Used:0.00B
>      /dev/vdg        8.00MiB
>   ...
> 
> We can see data chunk is changed from raid1 to single, because if we delete all
> data chunks before mount, there are raid-type information in filesystem, and
> btrfs will use raid-type of "0x0" for new data chunk after your patch.
> 
> So, leave at least one data chunk is a simple workaround for above two bug.
> 
> Thanks
> Zhaolei
> 
> > thanks
> >
> > >
> > > Test:
> > >  Test by above script, and confirmed the logic by debug output.
> > >
> > > Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
> > > ---
> > >  fs/btrfs/extent-tree.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index
> > > 5411f0a..35cf7eb 100644
> > > --- a/fs/btrfs/extent-tree.c
> > > +++ b/fs/btrfs/extent-tree.c
> > > @@ -10012,7 +10012,8 @@ void btrfs_delete_unused_bgs(struct
> > btrfs_fs_info *fs_info)
> > >                                                bg_list);
> > >                 space_info = block_group->space_info;
> > >                 list_del_init(&block_group->bg_list);
> > > -               if (ret || btrfs_mixed_space_info(space_info)) {
> > > +               if (ret || btrfs_mixed_space_info(space_info) ||
> > > +                   block_group->list.next ==
> > > + block_group->list.prev) {
> > >                         btrfs_put_block_group(block_group);
> > >                         continue;
> > >                 }
> > > --
> > > 1.8.5.1
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
> > > in the body of a message to majordomo@vger.kernel.org More majordomo
> > > info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> >
> > --
> > Filipe David Manana,
> >
> > "Reasonable men adapt themselves to the world.
> >  Unreasonable men adapt the world to themselves.
> >  That's why all progress depends on unreasonable men."
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body
> of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Zhaolei Sept. 22, 2015, 11:24 a.m. UTC | #6

Hi, Filipe David Manana

> -----Original Message-----
> From: linux-btrfs-owner@vger.kernel.org
> [mailto:linux-btrfs-owner@vger.kernel.org] On Behalf Of Filipe David Manana
> Sent: Tuesday, September 22, 2015 6:22 PM
> To: Zhao Lei <zhaolei@cn.fujitsu.com>
> Cc: linux-btrfs@vger.kernel.org
> Subject: Re: [PATCH] btrfs: Fix no space bug caused by removing bg
> 
> On Tue, Sep 22, 2015 at 11:06 AM, Zhao Lei <zhaolei@cn.fujitsu.com> wrote:
> > Hi, Filipe David Manana
> >
> > Thanks for review this patch.
> >
> >> -----Original Message-----
> >> From: Filipe David Manana [mailto:fdmanana@gmail.com]
> >> Sent: Monday, September 21, 2015 9:27 PM
> >> To: Zhao Lei <zhaolei@cn.fujitsu.com>
> >> Cc: linux-btrfs@vger.kernel.org
> >> Subject: Re: [PATCH] btrfs: Fix no space bug caused by removing bg
> >>
> >> On Mon, Sep 21, 2015 at 1:59 PM, Zhao Lei <zhaolei@cn.fujitsu.com>
> wrote:
> >> > btrfs in v4.3-rc1 failed many xfstests items with '-o nospace_cache'
> >> > mount option.
> >> >
> >> > Failed cases are:
> >> >
> >> > btrfs/008,016,019,020,026,027,028,029,031,041,046,048,050,051,053,0
> >> > 54,
> >> >  077,083,084,087,092,094
> >>
> >> Hi Zhao,
> >>
> >> So far I tried a few of those against Chris' integration-4.3 branch
> >> (same btrfs code as 4.3-rc1):
> >>
> >> MOUNT_OPTIONS="-o nospace_cache" ./check btrfs/008 btrfs/016
> >> btrfs/019
> >> btrfs/020
> >> FSTYP         -- btrfs
> >> PLATFORM      -- Linux/x86_64 debian3 4.2.0-rc5-btrfs-next-12+
> >> MKFS_OPTIONS  -- /dev/sdc
> >> MOUNT_OPTIONS -- -o nospace_cache /dev/sdc
> >> /home/fdmanana/btrfs-tests/scratch_1
> >>
> >> btrfs/008 2s ... 1s
> >> btrfs/016 4s ... 3s
> >> btrfs/019 4s ... 2s
> >> btrfs/020 2s ... 1s
> >> Ran: btrfs/008 btrfs/016 btrfs/019 btrfs/020 Passed all 4 tests
> >>
> >> And none of the tests failed...
> >>
> > Sorry I hadn't paste detail of my test command.
> >
> > It is from a coincidence operation which is some different with
> > standard steps(as yours), I mount fs with -o no_space_cache manually
> > without set MOUNT_OPT, then xfstests entered into a special path, and
> triggered the bug:
> >   export TEST_DEV='/dev/sdb5'
> >   export TEST_DIR='/var/ltf/tester/mnt'
> >   mkdir -p '/var/ltf/tester/mnt'
> >
> >   export SCRATCH_DEV_POOL='/dev/sdb6 /dev/sdb7 /dev/sdb8 /dev/sdb9
> /dev/sdb10 /dev/sdb11'
> >   export SCRATCH_MNT='/var/ltf/tester/scratch_mnt'
> >   mkdir -p '/var/ltf/tester/scratch_mnt'
> >
> >   export DIFF_LENGTH=0
> >
> >   mkfs.btrfs -f "$TEST_DEV"
> >   mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
> >
> >   ./check generic/014
> >
> > Result:
> >   FSTYP         -- btrfs
> >   PLATFORM      -- Linux/x86_64 lenovo
> 4.3.0-rc2_HEAD_1f93e4a96c9109378204c147b3eec0d0e8100fde_
> >   MKFS_OPTIONS  -- /dev/sdb6
> >   MOUNT_OPTIONS -- /dev/sdb6 /var/ltf/tester/scratch_mnt
> >
> >   generic/014 0s ... - output mismatch (see
> /var/lib/xfstests/results//generic/014.out.bad)
> >       --- tests/generic/014.out   2015-09-22 17:46:13.855391451 +0800
> >       +++ /var/lib/xfstests/results//generic/014.out.bad  2015-09-22
> 17:57:06.446095748 +0800
> >       @@ -3,4 +3,5 @@
> >        ------
> >        test 1
> >        ------
> >       -OK
> >       +truncfile returned 1 : "write: No space left on device
> >       +Seed = 1442915826 (use "-s 1442915826" to re-execute this test)"
> >   Ran: generic/014
> >   Failures: generic/014
> >   Failed 1 of 1 tests
> >
> > And following script is from trace result of above test.
> > Maybe I can remove the xfstest description because it is not standard steps.
> >
> >> >
> >> > generic/004,010,014,023,024,074,075,080,086,087,089,091,092,100,112
> >> > ,12
> >> > 3,
> >> > 124,125,126,127,131,133,192,193,198,207,208,209,213,214,215,228,239
> >> > ,24
> >> > 0,
> >> >  246,247,248,255,263,285,306,313,316,323
> >> >
> >> > We can reproduce this bug with following simple command:
> >> >  TEST_DEV=/dev/vdh
> >> >  TEST_DIR=/mnt/tmp
> >> >
> >> >  umount "$TEST_DEV" >/dev/null
> >> >  mkfs.btrfs -f "$TEST_DEV"
> >> >  mount "$TEST_DEV" "$TEST_DIR"
> >> >
> >> >  umount "$TEST_DEV"
> >> >  mount "$TEST_DEV" "$TEST_DIR"
> >> >
> >> >  cp /bin/bash $TEST_DIR
> >> >
> >> > Result is:
> >> >  (omit previous commands)
> >> >  # cp /bin/bash $TEST_DIR
> >> >  cp: writing `/mnt/tmp/bash': No space left on device
> >> >
> >> > By bisect, we can see it is triggered by patch titled:
> >> >  commit e44163e17796
> >> >  ("btrfs: explictly delete unused block groups in close_ctree and
> >> > ro-remount")
> >> >
> >> > But the wrong code is not in above patch, btrfs delete all chunks
> >> > if no data in filesystem, and above patch just make it obviously.
> >> >
> >> > Detail reason:
> >> >  1: mkfs a blank filesystem, or delete everything in filesystem
> >> >  2: umount fs
> >> >     (current code will delete all data chunks)
> >> >  3: mount fs
> >> >     Because no any data chunks, data's space_cache have no chance
> >> >     to init, it means: space_info->total_bytes == 0, and
> >> >     space_info->full == 1.
> >>
> >> Right, and that's the problem. When the space_info is initialized it
> >> should never be flagged as full, otherwise any buffered write
> >> attempts fail immediately with enospc instead of trying to allocate a
> >> data block group (at extent-tree.c:btrfs_check_data_free_space()).
> >>
> >> That was fixed recently by:
> >>
> >> https://patchwork.kernel.org/patch/7133451/
> >>
> >> (with a respective test too,
> >> https://patchwork.kernel.org/patch/7133471/)
> >>
> >
> > It can fix problem in mount, but can not fix problem of "raid-level
> > change", please see below.
> >
> >> >  4: do some write
> >> >     Current code will ignore chunk allocate because space_info->full,
> >> >     and return -ENOSPC.
> >> >
> >> > Fix:
> >> >  Don't auto-delete last blockgroup for a raid type.
> >> >  If we delete all blockgroup for a raidtype, it not only cause
> >> > above bug,  but also may change filesystem to all-single in some case.
> >>
> >> I don't get this. Can you mention in which cases that happens and why
> >> (in the commit message)?
> >>
> >> It isn't clear when reading the patch why we need to keep at least
> >> one block of each type/profile, and seems to be a workaround for a different
> problem.
> >>
> > Simply speaking, if we run following command after apply your patch:
> >
> >   TEST_DEV=(/dev/vdg /dev/vdh)
> >   TEST_DIR=/mnt/tmp
> >
> >   umount "$TEST_DEV" >/dev/null
> >   mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}"
> >   mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
> >
> >   umount "$TEST_DEV"
> >   mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
> >
> >   btrfs filesystem usage $TEST_DIR
> >
> > The result is:
> >   # btrfs filesystem usage $TEST_DIR
> >   (omit)
> >   Data,single: Size:8.00MiB, Used:0.00B
> >      /dev/vdg        8.00MiB
> >   ...
> >
> > We can see data chunk is changed from raid1 to single, because if we
> > delete all data chunks before mount, there are raid-type information
> > in filesystem, and btrfs will use raid-type of "0x0" for new data
> > chunk after your patch.
> >
> > So, leave at least one data chunk is a simple workaround for above two
> > bug.
> 
> Well it's a hack that besides not being very logical it's also incomplete and
> unreliable: preventing the cleaner kthread (automatic removal of unused block
> groups) from deleting the last block group of a kind doesn't prevent the block
> group from being deleted by a balance, just as the new regression xfstest
> does...
> 

What this patch fixed is regression caused by patch
of "auto-remove-bg-at-umount".

And the balance problem is another bug, which should be fixed by
another patch.

> In other words, the problem existed before we had automatic removal of
> unused block groups, but just less likely for someone to run into it.
> 
> So please lets have a real fix for this problem. Seems like we're losing raid
> profile type somewhere when allocating a new block group.
> 
I think way of "remain last bg" is not a hack way, just like we create
one blank data chunk in mkfs.

In detail, maybe "writing raidtype in disk" is more direct way,
but it will change our disk data format(as superblock),
and need change lots of code as chunk-allocate, balance, mkfs, check, or more.
plus of compatibility with old filesystem.

So the way in this patch maybe the most suit for above bug.

Thanks
Zhaolei

> thanks
> 
> 
> >
> > Thanks
> > Zhaolei
> >
> >> thanks
> >>
> >> >
> >> > Test:
> >> >  Test by above script, and confirmed the logic by debug output.
> >> >
> >> > Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
> >> > ---
> >> >  fs/btrfs/extent-tree.c | 3 ++-
> >> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >> >
> >> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index
> >> > 5411f0a..35cf7eb 100644
> >> > --- a/fs/btrfs/extent-tree.c
> >> > +++ b/fs/btrfs/extent-tree.c
> >> > @@ -10012,7 +10012,8 @@ void btrfs_delete_unused_bgs(struct
> >> btrfs_fs_info *fs_info)
> >> >                                                bg_list);
> >> >                 space_info = block_group->space_info;
> >> >                 list_del_init(&block_group->bg_list);
> >> > -               if (ret || btrfs_mixed_space_info(space_info)) {
> >> > +               if (ret || btrfs_mixed_space_info(space_info) ||
> >> > +                   block_group->list.next ==
> >> > + block_group->list.prev) {
> >> >                         btrfs_put_block_group(block_group);
> >> >                         continue;
> >> >                 }
> >> > --
> >> > 1.8.5.1
> >> >
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
> >> > in the body of a message to majordomo@vger.kernel.org More
> >> > majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >>
> >>
> >> --
> >> Filipe David Manana,
> >>
> >> "Reasonable men adapt themselves to the world.
> >>  Unreasonable men adapt the world to themselves.
> >>  That's why all progress depends on unreasonable men."
> >
> 
> 
> 
> --
> Filipe David Manana,
> 
> "Reasonable men adapt themselves to the world.
>  Unreasonable men adapt the world to themselves.
>  That's why all progress depends on unreasonable men."
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body
> of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Filipe Manana Sept. 22, 2015, 12:45 p.m. UTC | #7

On Tue, Sep 22, 2015 at 12:24 PM, Zhao Lei <zhaolei@cn.fujitsu.com> wrote:
> Hi, Filipe David Manana
>
>> -----Original Message-----
>> From: linux-btrfs-owner@vger.kernel.org
>> [mailto:linux-btrfs-owner@vger.kernel.org] On Behalf Of Filipe David Manana
>> Sent: Tuesday, September 22, 2015 6:22 PM
>> To: Zhao Lei <zhaolei@cn.fujitsu.com>
>> Cc: linux-btrfs@vger.kernel.org
>> Subject: Re: [PATCH] btrfs: Fix no space bug caused by removing bg
>>
>> On Tue, Sep 22, 2015 at 11:06 AM, Zhao Lei <zhaolei@cn.fujitsu.com> wrote:
>> > Hi, Filipe David Manana
>> >
>> > Thanks for review this patch.
>> >
>> >> -----Original Message-----
>> >> From: Filipe David Manana [mailto:fdmanana@gmail.com]
>> >> Sent: Monday, September 21, 2015 9:27 PM
>> >> To: Zhao Lei <zhaolei@cn.fujitsu.com>
>> >> Cc: linux-btrfs@vger.kernel.org
>> >> Subject: Re: [PATCH] btrfs: Fix no space bug caused by removing bg
>> >>
>> >> On Mon, Sep 21, 2015 at 1:59 PM, Zhao Lei <zhaolei@cn.fujitsu.com>
>> wrote:
>> >> > btrfs in v4.3-rc1 failed many xfstests items with '-o nospace_cache'
>> >> > mount option.
>> >> >
>> >> > Failed cases are:
>> >> >
>> >> > btrfs/008,016,019,020,026,027,028,029,031,041,046,048,050,051,053,0
>> >> > 54,
>> >> >  077,083,084,087,092,094
>> >>
>> >> Hi Zhao,
>> >>
>> >> So far I tried a few of those against Chris' integration-4.3 branch
>> >> (same btrfs code as 4.3-rc1):
>> >>
>> >> MOUNT_OPTIONS="-o nospace_cache" ./check btrfs/008 btrfs/016
>> >> btrfs/019
>> >> btrfs/020
>> >> FSTYP         -- btrfs
>> >> PLATFORM      -- Linux/x86_64 debian3 4.2.0-rc5-btrfs-next-12+
>> >> MKFS_OPTIONS  -- /dev/sdc
>> >> MOUNT_OPTIONS -- -o nospace_cache /dev/sdc
>> >> /home/fdmanana/btrfs-tests/scratch_1
>> >>
>> >> btrfs/008 2s ... 1s
>> >> btrfs/016 4s ... 3s
>> >> btrfs/019 4s ... 2s
>> >> btrfs/020 2s ... 1s
>> >> Ran: btrfs/008 btrfs/016 btrfs/019 btrfs/020 Passed all 4 tests
>> >>
>> >> And none of the tests failed...
>> >>
>> > Sorry I hadn't paste detail of my test command.
>> >
>> > It is from a coincidence operation which is some different with
>> > standard steps(as yours), I mount fs with -o no_space_cache manually
>> > without set MOUNT_OPT, then xfstests entered into a special path, and
>> triggered the bug:
>> >   export TEST_DEV='/dev/sdb5'
>> >   export TEST_DIR='/var/ltf/tester/mnt'
>> >   mkdir -p '/var/ltf/tester/mnt'
>> >
>> >   export SCRATCH_DEV_POOL='/dev/sdb6 /dev/sdb7 /dev/sdb8 /dev/sdb9
>> /dev/sdb10 /dev/sdb11'
>> >   export SCRATCH_MNT='/var/ltf/tester/scratch_mnt'
>> >   mkdir -p '/var/ltf/tester/scratch_mnt'
>> >
>> >   export DIFF_LENGTH=0
>> >
>> >   mkfs.btrfs -f "$TEST_DEV"
>> >   mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
>> >
>> >   ./check generic/014
>> >
>> > Result:
>> >   FSTYP         -- btrfs
>> >   PLATFORM      -- Linux/x86_64 lenovo
>> 4.3.0-rc2_HEAD_1f93e4a96c9109378204c147b3eec0d0e8100fde_
>> >   MKFS_OPTIONS  -- /dev/sdb6
>> >   MOUNT_OPTIONS -- /dev/sdb6 /var/ltf/tester/scratch_mnt
>> >
>> >   generic/014 0s ... - output mismatch (see
>> /var/lib/xfstests/results//generic/014.out.bad)
>> >       --- tests/generic/014.out   2015-09-22 17:46:13.855391451 +0800
>> >       +++ /var/lib/xfstests/results//generic/014.out.bad  2015-09-22
>> 17:57:06.446095748 +0800
>> >       @@ -3,4 +3,5 @@
>> >        ------
>> >        test 1
>> >        ------
>> >       -OK
>> >       +truncfile returned 1 : "write: No space left on device
>> >       +Seed = 1442915826 (use "-s 1442915826" to re-execute this test)"
>> >   Ran: generic/014
>> >   Failures: generic/014
>> >   Failed 1 of 1 tests
>> >
>> > And following script is from trace result of above test.
>> > Maybe I can remove the xfstest description because it is not standard steps.
>> >
>> >> >
>> >> > generic/004,010,014,023,024,074,075,080,086,087,089,091,092,100,112
>> >> > ,12
>> >> > 3,
>> >> > 124,125,126,127,131,133,192,193,198,207,208,209,213,214,215,228,239
>> >> > ,24
>> >> > 0,
>> >> >  246,247,248,255,263,285,306,313,316,323
>> >> >
>> >> > We can reproduce this bug with following simple command:
>> >> >  TEST_DEV=/dev/vdh
>> >> >  TEST_DIR=/mnt/tmp
>> >> >
>> >> >  umount "$TEST_DEV" >/dev/null
>> >> >  mkfs.btrfs -f "$TEST_DEV"
>> >> >  mount "$TEST_DEV" "$TEST_DIR"
>> >> >
>> >> >  umount "$TEST_DEV"
>> >> >  mount "$TEST_DEV" "$TEST_DIR"
>> >> >
>> >> >  cp /bin/bash $TEST_DIR
>> >> >
>> >> > Result is:
>> >> >  (omit previous commands)
>> >> >  # cp /bin/bash $TEST_DIR
>> >> >  cp: writing `/mnt/tmp/bash': No space left on device
>> >> >
>> >> > By bisect, we can see it is triggered by patch titled:
>> >> >  commit e44163e17796
>> >> >  ("btrfs: explictly delete unused block groups in close_ctree and
>> >> > ro-remount")
>> >> >
>> >> > But the wrong code is not in above patch, btrfs delete all chunks
>> >> > if no data in filesystem, and above patch just make it obviously.
>> >> >
>> >> > Detail reason:
>> >> >  1: mkfs a blank filesystem, or delete everything in filesystem
>> >> >  2: umount fs
>> >> >     (current code will delete all data chunks)
>> >> >  3: mount fs
>> >> >     Because no any data chunks, data's space_cache have no chance
>> >> >     to init, it means: space_info->total_bytes == 0, and
>> >> >     space_info->full == 1.
>> >>
>> >> Right, and that's the problem. When the space_info is initialized it
>> >> should never be flagged as full, otherwise any buffered write
>> >> attempts fail immediately with enospc instead of trying to allocate a
>> >> data block group (at extent-tree.c:btrfs_check_data_free_space()).
>> >>
>> >> That was fixed recently by:
>> >>
>> >> https://patchwork.kernel.org/patch/7133451/
>> >>
>> >> (with a respective test too,
>> >> https://patchwork.kernel.org/patch/7133471/)
>> >>
>> >
>> > It can fix problem in mount, but can not fix problem of "raid-level
>> > change", please see below.
>> >
>> >> >  4: do some write
>> >> >     Current code will ignore chunk allocate because space_info->full,
>> >> >     and return -ENOSPC.
>> >> >
>> >> > Fix:
>> >> >  Don't auto-delete last blockgroup for a raid type.
>> >> >  If we delete all blockgroup for a raidtype, it not only cause
>> >> > above bug,  but also may change filesystem to all-single in some case.
>> >>
>> >> I don't get this. Can you mention in which cases that happens and why
>> >> (in the commit message)?
>> >>
>> >> It isn't clear when reading the patch why we need to keep at least
>> >> one block of each type/profile, and seems to be a workaround for a different
>> problem.
>> >>
>> > Simply speaking, if we run following command after apply your patch:
>> >
>> >   TEST_DEV=(/dev/vdg /dev/vdh)
>> >   TEST_DIR=/mnt/tmp
>> >
>> >   umount "$TEST_DEV" >/dev/null
>> >   mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}"
>> >   mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
>> >
>> >   umount "$TEST_DEV"
>> >   mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
>> >
>> >   btrfs filesystem usage $TEST_DIR
>> >
>> > The result is:
>> >   # btrfs filesystem usage $TEST_DIR
>> >   (omit)
>> >   Data,single: Size:8.00MiB, Used:0.00B
>> >      /dev/vdg        8.00MiB
>> >   ...
>> >
>> > We can see data chunk is changed from raid1 to single, because if we
>> > delete all data chunks before mount, there are raid-type information
>> > in filesystem, and btrfs will use raid-type of "0x0" for new data
>> > chunk after your patch.
>> >
>> > So, leave at least one data chunk is a simple workaround for above two
>> > bug.
>>
>> Well it's a hack that besides not being very logical it's also incomplete and
>> unreliable: preventing the cleaner kthread (automatic removal of unused block
>> groups) from deleting the last block group of a kind doesn't prevent the block
>> group from being deleted by a balance, just as the new regression xfstest
>> does...
>>
>
> What this patch fixed is regression caused by patch
> of "auto-remove-bg-at-umount".

Well no, that patch did not introduce any problem so far... It only
exposes more easily a problem that exists for a long time...

>
> And the balance problem is another bug, which should be fixed by
> another patch.
>
>> In other words, the problem existed before we had automatic removal of
>> unused block groups, but just less likely for someone to run into it.
>>
>> So please lets have a real fix for this problem. Seems like we're losing raid
>> profile type somewhere when allocating a new block group.
>>
> I think way of "remain last bg" is not a hack way, just like we create
> one blank data chunk in mkfs.
>
> In detail, maybe "writing raidtype in disk" is more direct way,
> but it will change our disk data format(as superblock),
> and need change lots of code as chunk-allocate, balance, mkfs, check, or more.
> plus of compatibility with old filesystem.
>
> So the way in this patch maybe the most suit for above bug.



At the very least this patch should have its title and description
changed - to reflect that it attempts to solve only (and partially)
the problem of losing raid profile type. Because the enospc issue is
not caused by not having any data block group allocated, but it's
instead due to a recent regression when initializing a space_info
object as full, as described and proved in the patch mentioned before
[1] - you could run into the enospc issue even without having the
automatic removal of empty block groups or the patch [2] introduced in
4.3 (just try the new xfstest [3] to verify it yourself).

[1] https://patchwork.kernel.org/patch/7133451/
[2] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=e44163e177960ee60e32a73bffdd53c3a5827406
[3] https://git.kernel.org/cgit/fs/xfs/xfstests-dev.git/commit/?id=da82741228c7efaedae2e027efc54b1892800fe1

>
> Thanks
> Zhaolei
>
>> thanks
>>
>>
>> >
>> > Thanks
>> > Zhaolei
>> >
>> >> thanks
>> >>
>> >> >
>> >> > Test:
>> >> >  Test by above script, and confirmed the logic by debug output.
>> >> >
>> >> > Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
>> >> > ---
>> >> >  fs/btrfs/extent-tree.c | 3 ++-
>> >> >  1 file changed, 2 insertions(+), 1 deletion(-)
>> >> >
>> >> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index
>> >> > 5411f0a..35cf7eb 100644
>> >> > --- a/fs/btrfs/extent-tree.c
>> >> > +++ b/fs/btrfs/extent-tree.c
>> >> > @@ -10012,7 +10012,8 @@ void btrfs_delete_unused_bgs(struct
>> >> btrfs_fs_info *fs_info)
>> >> >                                                bg_list);
>> >> >                 space_info = block_group->space_info;
>> >> >                 list_del_init(&block_group->bg_list);
>> >> > -               if (ret || btrfs_mixed_space_info(space_info)) {
>> >> > +               if (ret || btrfs_mixed_space_info(space_info) ||
>> >> > +                   block_group->list.next ==
>> >> > + block_group->list.prev) {
>> >> >                         btrfs_put_block_group(block_group);
>> >> >                         continue;
>> >> >                 }
>> >> > --
>> >> > 1.8.5.1
>> >> >
>> >> > --
>> >> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
>> >> > in the body of a message to majordomo@vger.kernel.org More
>> >> > majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>
>> >>
>> >>
>> >> --
>> >> Filipe David Manana,
>> >>
>> >> "Reasonable men adapt themselves to the world.
>> >>  Unreasonable men adapt the world to themselves.
>> >>  That's why all progress depends on unreasonable men."
>> >
>>
>>
>>
>> --
>> Filipe David Manana,
>>
>> "Reasonable men adapt themselves to the world.
>>  Unreasonable men adapt the world to themselves.
>>  That's why all progress depends on unreasonable men."
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body
>> of a message to majordomo@vger.kernel.org More majordomo info at
>> http://vger.kernel.org/majordomo-info.html
>

Jeff Mahoney Sept. 22, 2015, 12:59 p.m. UTC | #8

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 9/21/15 8:59 AM, Zhao Lei wrote:
> btrfs in v4.3-rc1 failed many xfstests items with '-o
> nospace_cache' mount option.
> 
> Failed cases are: 
> btrfs/008,016,019,020,026,027,028,029,031,041,046,048,050,051,053,054,
>
> 
077,083,084,087,092,094
> generic/004,010,014,023,024,074,075,080,086,087,089,091,092,100,112,12
3,
>
> 
124,125,126,127,131,133,192,193,198,207,208,209,213,214,215,228,239,240,
> 246,247,248,255,263,285,306,313,316,323
> 
> We can reproduce this bug with following simple command: 
> TEST_DEV=/dev/vdh TEST_DIR=/mnt/tmp
> 
> umount "$TEST_DEV" >/dev/null mkfs.btrfs -f "$TEST_DEV" mount
> "$TEST_DEV" "$TEST_DIR"
> 
> umount "$TEST_DEV" mount "$TEST_DEV" "$TEST_DIR"
> 
> cp /bin/bash $TEST_DIR
> 
> Result is: (omit previous commands) # cp /bin/bash $TEST_DIR cp:
> writing `/mnt/tmp/bash': No space left on device
> 
> By bisect, we can see it is triggered by patch titled: commit
> e44163e17796 ("btrfs: explictly delete unused block groups in
> close_ctree and ro-remount")
> 
> But the wrong code is not in above patch, btrfs delete all chunks 
> if no data in filesystem, and above patch just make it obviously.
> 
> Detail reason: 1: mkfs a blank filesystem, or delete everything in
> filesystem 2: umount fs (current code will delete all data chunks) 
> 3: mount fs Because no any data chunks, data's space_cache have no
> chance to init, it means: space_info->total_bytes == 0, and 
> space_info->full == 1. 4: do some write Current code will ignore
> chunk allocate because space_info->full, and return -ENOSPC.
> 
> Fix: Don't auto-delete last blockgroup for a raid type.

The only reason not to do this is the loss off the raid type, which is
an issue that needs to be addressed separately.  As Filipe said in his
responses, this isn't the source of the problem - it just makes it
obvious.  We've seen in actual bug reports that it's possible to
create exactly the same scenario by balancing an empty file system.

So if they way we want to prevent the loss of raid type info is by
maintaining the last block group allocated with that raid type, fine,
but that's a separate discussion.  Personally, I think keeping 1GB
allocated as a placeholder is a bit much.  Beyond that, I've been
thinking casually about ways to direct the allocator to use certain
devices for certain things (e.g. in a hybrid system with SSDs and
HDDs, always allocate metadata on the SSD) and there's some overlap
there.  As it stands, we can fake that in mkfs but it'll get stomped
by balance nearly immediately.

- -Jeff

> If we delete all blockgroup for a raidtype, it not only cause above
> bug, but also may change filesystem to all-single in some case.
> 
> Test: Test by above script, and confirmed the logic by debug
> output.
> 
> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> --- 
> fs/btrfs/extent-tree.c | 3 ++- 1 file changed, 2 insertions(+), 1
> deletion(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index
> 5411f0a..35cf7eb 100644 --- a/fs/btrfs/extent-tree.c +++
> b/fs/btrfs/extent-tree.c @@ -10012,7 +10012,8 @@ void
> btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info) bg_list); 
> space_info = block_group->space_info; 
> list_del_init(&block_group->bg_list); -		if (ret ||
> btrfs_mixed_space_info(space_info)) { +		if (ret ||
> btrfs_mixed_space_info(space_info) || +		    block_group->list.next
> == block_group->list.prev) { btrfs_put_block_group(block_group); 
> continue; }
> 


- -- 
Jeff Mahoney
SUSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.19 (Darwin)

iQIcBAEBAgAGBQJWAVDNAAoJEB57S2MheeWyCwoQAId9IK0vYX01W20SeLt5E5ql
cabIeN3JCLcmtEbJzhNxQtcjvB7Rgq/r3BRDV0n0Z71dyv8WV8vau4Qka8xUVtLL
l+sbuRIEBUR3UHOvqjV7MxSZeZrQZLWeGuCRH9El059hDn/JFsF9n3wJx8YsgXKe
dma2RG6MHFVXY08jYkLc6nexBbYlc3Dj2jbd2Jr7gHy4YwFTCM9YncR+STV2K47Q
N/pfRwiVHFHHVTju5lg3wzp+xvFPeU52cfWHL05axe8l75pU6Ywwrk406QxyrTvx
2Rh8tXBJItUeMA/D8mRnwWVZBWFUndl6JlBNSyf51fSP+4lPkChbM5UnSOjDOwvE
E7XpGy31TQI0bqpy8qoIkI9wkek6iOlMCppZ9U2vICbeP+65WtNZKfQcCO0t6Z1H
6IqfHsaDvvaiorxEWWIarsIfHZWnWJeav545t6pd4VU3v52YQN2YIOLY8EhWv4Wt
90Xc1izPvPvnyQa3eQPg1ISdqNfJRFlYjSJ75zGvSPurIy77oOyvPa1EfEO7IMys
zXyjgKzU6Yox1iXxeJsDxuAa+FX9P2rXqd8WYP2mBRqH2BE6D+R8V/NitGmXSkYA
bBXN1H/m+gP5qhHLnBQZU+ABH1dDp6RJ1BCsg7iDJBmfE+hJI8YIwowwH/C0RBST
1HgsAUWHmDsjHcYr3/ZB
=Li+/
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hugo Mills Sept. 22, 2015, 1:28 p.m. UTC | #9

On Tue, Sep 22, 2015 at 08:59:57AM -0400, Jeff Mahoney wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
[snip]
> So if they way we want to prevent the loss of raid type info is by
> maintaining the last block group allocated with that raid type, fine,
> but that's a separate discussion.  Personally, I think keeping 1GB
> allocated as a placeholder is a bit much.  Beyond that, I've been
> thinking casually about ways to direct the allocator to use certain
> devices for certain things (e.g. in a hybrid system with SSDs and
> HDDs, always allocate metadata on the SSD) and there's some overlap
> there.  As it stands, we can fake that in mkfs but it'll get stomped
> by balance nearly immediately.

   In terms of selecting the location of chunks within the allocator,
I wrote up a design for a pretty generic way of doing it some time ago
[1]. It would allow things like metadata to SSDs, but also defining
failure domains for replication (i.e. "I want two copies of my data in
RAID-1, but I want each copy to go on a different storage array"). It
would also give us the ability to handle different allocation
strategies, such as filling up one device at a time.

   I got as far as some python to demonstrate the algorithms and
structure (also in that mail thread). I started trying to work out how
to rewrite the allocator in the kernel to support it, but I got lost
in the code fairly rapidly, particularly about how to store the
relevant policy metadata (for the FS as a whole, and, later, on a
per-subvolume basis).

   Hugo.

[1] http://www.mail-archive.com/linux-btrfs%40vger.kernel.org/msg33499.html

> - -Jeff
> 
> > If we delete all blockgroup for a raidtype, it not only cause above
> > bug, but also may change filesystem to all-single in some case.
> > 
> > Test: Test by above script, and confirmed the logic by debug
> > output.
> > 
> > Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> --- 
> > fs/btrfs/extent-tree.c | 3 ++- 1 file changed, 2 insertions(+), 1
> > deletion(-)
> > 
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index
> > 5411f0a..35cf7eb 100644 --- a/fs/btrfs/extent-tree.c +++
> > b/fs/btrfs/extent-tree.c @@ -10012,7 +10012,8 @@ void
> > btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info) bg_list); 
> > space_info = block_group->space_info; 
> > list_del_init(&block_group->bg_list); -		if (ret ||
> > btrfs_mixed_space_info(space_info)) { +		if (ret ||
> > btrfs_mixed_space_info(space_info) || +		    block_group->list.next
> > == block_group->list.prev) { btrfs_put_block_group(block_group); 
> > continue; }
> > 
> 
> 
> - -- 
> Jeff Mahoney
> SUSE Labs
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG/MacGPG2 v2.0.19 (Darwin)
> 
> iQIcBAEBAgAGBQJWAVDNAAoJEB57S2MheeWyCwoQAId9IK0vYX01W20SeLt5E5ql
> cabIeN3JCLcmtEbJzhNxQtcjvB7Rgq/r3BRDV0n0Z71dyv8WV8vau4Qka8xUVtLL
> l+sbuRIEBUR3UHOvqjV7MxSZeZrQZLWeGuCRH9El059hDn/JFsF9n3wJx8YsgXKe
> dma2RG6MHFVXY08jYkLc6nexBbYlc3Dj2jbd2Jr7gHy4YwFTCM9YncR+STV2K47Q
> N/pfRwiVHFHHVTju5lg3wzp+xvFPeU52cfWHL05axe8l75pU6Ywwrk406QxyrTvx
> 2Rh8tXBJItUeMA/D8mRnwWVZBWFUndl6JlBNSyf51fSP+4lPkChbM5UnSOjDOwvE
> E7XpGy31TQI0bqpy8qoIkI9wkek6iOlMCppZ9U2vICbeP+65WtNZKfQcCO0t6Z1H
> 6IqfHsaDvvaiorxEWWIarsIfHZWnWJeav545t6pd4VU3v52YQN2YIOLY8EhWv4Wt
> 90Xc1izPvPvnyQa3eQPg1ISdqNfJRFlYjSJ75zGvSPurIy77oOyvPa1EfEO7IMys
> zXyjgKzU6Yox1iXxeJsDxuAa+FX9P2rXqd8WYP2mBRqH2BE6D+R8V/NitGmXSkYA
> bBXN1H/m+gP5qhHLnBQZU+ABH1dDp6RJ1BCsg7iDJBmfE+hJI8YIwowwH/C0RBST
> 1HgsAUWHmDsjHcYr3/ZB
> =Li+/
> -----END PGP SIGNATURE-----

Holger Hoffstätte Sept. 22, 2015, 1:36 p.m. UTC | #10

On 09/22/15 14:59, Jeff Mahoney wrote:
(snip)
> So if they way we want to prevent the loss of raid type info is by
> maintaining the last block group allocated with that raid type, fine,
> but that's a separate discussion.  Personally, I think keeping 1GB

At this point I'm much more surprised to learn that the RAID type can
apparently get "lost" in the first place, and is not persisted
separately. I mean..wat?

> allocated as a placeholder is a bit much.  Beyond that, I've been

Can you explain why keeping at least one data chunk (or the appropriate
number across devices, depending on RAID level..) is "a bit much"?
IMHO this would have no real negative impact on end users (who think
in terms of overall filesystem space, not how much of that has been
lazily touched), nor for more obscure use cases like sparse images -
which would still be sparsely reserved. So there would not be any
actual downside that I can see. From a fs consistency point of view
it sounds much more sane to assume that at least a basic set of data
chunks always need to exist. I know I was very surprised recently to
see all my data chunks cleaned up on an otherwise empty fs.
I mean..it's good to see that the house cleaning works, but you also
gotta know when to stop!

> thinking casually about ways to direct the allocator to use certain
> devices for certain things (e.g. in a hybrid system with SSDs and
> HDDs, always allocate metadata on the SSD) and there's some overlap
> there.  As it stands, we can fake that in mkfs but it'll get stomped
> by balance nearly immediately.

Please share those casual thoughts with the group. :)

-h
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hugo Mills Sept. 22, 2015, 1:41 p.m. UTC | #11

On Tue, Sep 22, 2015 at 03:36:43PM +0200, Holger Hoffstätte wrote:
> On 09/22/15 14:59, Jeff Mahoney wrote:
> (snip)
> > So if they way we want to prevent the loss of raid type info is by
> > maintaining the last block group allocated with that raid type, fine,
> > but that's a separate discussion.  Personally, I think keeping 1GB
> 
> At this point I'm much more surprised to learn that the RAID type can
> apparently get "lost" in the first place, and is not persisted
> separately. I mean..wat?

   It's always been like that, unfortunately.

   The code tries to use the RAID type that's already present to work
out what the next allocation should be. If there aren't any chunks in
the FS, the configuration is lost, because it's not stored anywhere
else. It's one of the things that tripped me up badly when I was
failing to rewrite the chunk allocator last year.

> > allocated as a placeholder is a bit much.  Beyond that, I've been
> 
> Can you explain why keeping at least one data chunk (or the appropriate
> number across devices, depending on RAID level..) is "a bit much"?
> IMHO this would have no real negative impact on end users (who think
> in terms of overall filesystem space, not how much of that has been
> lazily touched), nor for more obscure use cases like sparse images -
> which would still be sparsely reserved. So there would not be any
> actual downside that I can see. From a fs consistency point of view
> it sounds much more sane to assume that at least a basic set of data
> chunks always need to exist. I know I was very surprised recently to
> see all my data chunks cleaned up on an otherwise empty fs.
> I mean..it's good to see that the house cleaning works, but you also
> gotta know when to stop!
> 
> > thinking casually about ways to direct the allocator to use certain
> > devices for certain things (e.g. in a hybrid system with SSDs and
> > HDDs, always allocate metadata on the SSD) and there's some overlap
> > there.  As it stands, we can fake that in mkfs but it'll get stomped
> > by balance nearly immediately.
> 
> Please share those casual thoughts with the group. :)

   I had some as well last year (see my other mail). I hope they line
up with Jeff's. :)

   Hugo.

David Sterba Sept. 22, 2015, 2:23 p.m. UTC | #12

On Tue, Sep 22, 2015 at 01:41:31PM +0000, Hugo Mills wrote:
> On Tue, Sep 22, 2015 at 03:36:43PM +0200, Holger Hoffstätte wrote:
> > On 09/22/15 14:59, Jeff Mahoney wrote:
> > (snip)
> > > So if they way we want to prevent the loss of raid type info is by
> > > maintaining the last block group allocated with that raid type, fine,
> > > but that's a separate discussion.  Personally, I think keeping 1GB
> > 
> > At this point I'm much more surprised to learn that the RAID type can
> > apparently get "lost" in the first place, and is not persisted
> > separately. I mean..wat?
> 
>    It's always been like that, unfortunately.
> 
>    The code tries to use the RAID type that's already present to work
> out what the next allocation should be. If there aren't any chunks in
> the FS, the configuration is lost, because it's not stored anywhere
> else. It's one of the things that tripped me up badly when I was
> failing to rewrite the chunk allocator last year.

Yeah, right now there's no persistent default for the allocator. I'm
still hoping that the object properties will magically solve that.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hugo Mills Sept. 22, 2015, 2:36 p.m. UTC | #13

On Tue, Sep 22, 2015 at 04:23:33PM +0200, David Sterba wrote:
> On Tue, Sep 22, 2015 at 01:41:31PM +0000, Hugo Mills wrote:
> > On Tue, Sep 22, 2015 at 03:36:43PM +0200, Holger Hoffstätte wrote:
> > > On 09/22/15 14:59, Jeff Mahoney wrote:
> > > (snip)
> > > > So if they way we want to prevent the loss of raid type info is by
> > > > maintaining the last block group allocated with that raid type, fine,
> > > > but that's a separate discussion.  Personally, I think keeping 1GB
> > > 
> > > At this point I'm much more surprised to learn that the RAID type can
> > > apparently get "lost" in the first place, and is not persisted
> > > separately. I mean..wat?
> > 
> >    It's always been like that, unfortunately.
> > 
> >    The code tries to use the RAID type that's already present to work
> > out what the next allocation should be. If there aren't any chunks in
> > the FS, the configuration is lost, because it's not stored anywhere
> > else. It's one of the things that tripped me up badly when I was
> > failing to rewrite the chunk allocator last year.
> 
> Yeah, right now there's no persistent default for the allocator. I'm
> still hoping that the object properties will magically solve that.

   There's no obvious place that filesystem-wide properties can be
stored, though. There's a userspace tool to manipulate the few current
FS-wide properties, but that's all special-cased to use the
"historical" ioctls for those properties, with no generalisation of a
property store, or even (IIRC) any external API for them.

   We're nominally using xattrs in the btrfs: namespace on directories
and files, and presumably on the top directory of a subvolume for
subvol-wide properties, but it's not clear where the FS-wide values
should go: in the top directory of subvolid=5 would be confusing,
because then you couldn't separate the properties for *that subvol*
from the ones for the whole FS (say, the default replication policy,
where you might want the top subvol to have different properties from
everything else).

   Hugo.

Austin S. Hemmelgarn Sept. 22, 2015, 2:54 p.m. UTC | #14

On 2015-09-22 10:36, Hugo Mills wrote:
> On Tue, Sep 22, 2015 at 04:23:33PM +0200, David Sterba wrote:
>> On Tue, Sep 22, 2015 at 01:41:31PM +0000, Hugo Mills wrote:
>>> On Tue, Sep 22, 2015 at 03:36:43PM +0200, Holger Hoffstätte wrote:
>>>> On 09/22/15 14:59, Jeff Mahoney wrote:
>>>> (snip)
>>>>> So if they way we want to prevent the loss of raid type info is by
>>>>> maintaining the last block group allocated with that raid type, fine,
>>>>> but that's a separate discussion.  Personally, I think keeping 1GB
>>>>
>>>> At this point I'm much more surprised to learn that the RAID type can
>>>> apparently get "lost" in the first place, and is not persisted
>>>> separately. I mean..wat?
>>>
>>>     It's always been like that, unfortunately.
>>>
>>>     The code tries to use the RAID type that's already present to work
>>> out what the next allocation should be. If there aren't any chunks in
>>> the FS, the configuration is lost, because it's not stored anywhere
>>> else. It's one of the things that tripped me up badly when I was
>>> failing to rewrite the chunk allocator last year.
>>
>> Yeah, right now there's no persistent default for the allocator. I'm
>> still hoping that the object properties will magically solve that.
>
>     There's no obvious place that filesystem-wide properties can be
> stored, though. There's a userspace tool to manipulate the few current
> FS-wide properties, but that's all special-cased to use the
> "historical" ioctls for those properties, with no generalisation of a
> property store, or even (IIRC) any external API for them.
>
>     We're nominally using xattrs in the btrfs: namespace on directories
> and files, and presumably on the top directory of a subvolume for
> subvol-wide properties, but it's not clear where the FS-wide values
> should go: in the top directory of subvolid=5 would be confusing,
> because then you couldn't separate the properties for *that subvol*
> from the ones for the whole FS (say, the default replication policy,
> where you might want the top subvol to have different properties from
> everything else).
Possibly do special names for the defaults and store them there?  In 
general, I personally see little value in having some special 'default' 
properties however.

The way I would expect things to work is that a new subvolume inherits 
it's properties from it's parent (if it's a snapshot), or from the next 
higher subvolume it's nested in.  This would obviate the need for some 
special 'default' properties, and would be relatively intuitive behavior 
for a significant majority of people.

Hugo Mills Sept. 22, 2015, 3:39 p.m. UTC | #15

On Tue, Sep 22, 2015 at 10:54:45AM -0400, Austin S Hemmelgarn wrote:
> On 2015-09-22 10:36, Hugo Mills wrote:
> >On Tue, Sep 22, 2015 at 04:23:33PM +0200, David Sterba wrote:
> >>On Tue, Sep 22, 2015 at 01:41:31PM +0000, Hugo Mills wrote:
> >>>On Tue, Sep 22, 2015 at 03:36:43PM +0200, Holger Hoffstätte wrote:
> >>>>On 09/22/15 14:59, Jeff Mahoney wrote:
> >>>>(snip)
> >>>>>So if they way we want to prevent the loss of raid type info is by
> >>>>>maintaining the last block group allocated with that raid type, fine,
> >>>>>but that's a separate discussion.  Personally, I think keeping 1GB
> >>>>
> >>>>At this point I'm much more surprised to learn that the RAID type can
> >>>>apparently get "lost" in the first place, and is not persisted
> >>>>separately. I mean..wat?
> >>>
> >>>    It's always been like that, unfortunately.
> >>>
> >>>    The code tries to use the RAID type that's already present to work
> >>>out what the next allocation should be. If there aren't any chunks in
> >>>the FS, the configuration is lost, because it's not stored anywhere
> >>>else. It's one of the things that tripped me up badly when I was
> >>>failing to rewrite the chunk allocator last year.
> >>
> >>Yeah, right now there's no persistent default for the allocator. I'm
> >>still hoping that the object properties will magically solve that.
> >
> >    There's no obvious place that filesystem-wide properties can be
> >stored, though. There's a userspace tool to manipulate the few current
> >FS-wide properties, but that's all special-cased to use the
> >"historical" ioctls for those properties, with no generalisation of a
> >property store, or even (IIRC) any external API for them.
> >
> >    We're nominally using xattrs in the btrfs: namespace on directories
> >and files, and presumably on the top directory of a subvolume for
> >subvol-wide properties, but it's not clear where the FS-wide values
> >should go: in the top directory of subvolid=5 would be confusing,
> >because then you couldn't separate the properties for *that subvol*
> >from the ones for the whole FS (say, the default replication policy,
> >where you might want the top subvol to have different properties from
> >everything else).
> Possibly do special names for the defaults and store them there?  In
> general, I personally see little value in having some special
> 'default' properties however.

   That would work.

> The way I would expect things to work is that a new subvolume
> inherits it's properties from it's parent (if it's a snapshot),

   Definitely this.

> or
> from the next higher subvolume it's nested in.

   I don't think I like this. I'm not quite sure why, though, at the
moment.

   It definitely makes the process at the start of allocating a new
block group much more complex: you have to walk back up through an
arbitrary depth of nested subvols to find the one that's actually got
a replication policy record in it. (Because after this feature is
brought in, there will be lots of filesystems without per-subvol
replication policies in them, and we have to have some way of dealing
with those as well).

   With an FS default policy, you only need check the current subvol,
and then fall back to the FS default if that's not found.

   These things are, I think, likely to be lightly used: I would be
reasonably surprised to find more than two or possibly three storage
policies in use on any given system with a sane sysadmin.

   I'm actually not sure what the interactions of multiple storage
policies are going to be like. It's entirely possible, particularly
with some of the more exotic (but useful) suggestions I've thought of,
that the behaviour of the FS is dependent on the order in which the
block groups are allocated. (i.e. "20 GiB to subvol-A, then 20 GiB to
subvol-B" results in different behaviour than "1 GiB to subvol-A then
1 GiB to subvol-B and repeat"). I tried some simple Monte-Carlo
simulations, but I didn't get any concrete results out of it before
the end of the train journey. :)

>  This would obviate
> the need for some special 'default' properties, and would be
> relatively intuitive behavior for a significant majority of people.

   Of course, you shouldn't be nesting subvolumes anyway. It makes
it much harder to manage them.

   Hugo.

Jeff Mahoney Sept. 22, 2015, 4:23 p.m. UTC | #16

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 9/22/15 9:36 AM, Holger Hoffstätte wrote:
> On 09/22/15 14:59, Jeff Mahoney wrote: (snip)
>> So if they way we want to prevent the loss of raid type info is
>> by maintaining the last block group allocated with that raid
>> type, fine, but that's a separate discussion.  Personally, I
>> think keeping 1GB
> 
> At this point I'm much more surprised to learn that the RAID type
> can apparently get "lost" in the first place, and is not persisted 
> separately. I mean..wat?
> 
>> allocated as a placeholder is a bit much.  Beyond that, I've
>> been
> 
> Can you explain why keeping at least one data chunk (or the
> appropriate number across devices, depending on RAID level..) is "a
> bit much"? IMHO this would have no real negative impact on end
> users (who think in terms of overall filesystem space, not how much
> of that has been lazily touched), nor for more obscure use cases
> like sparse images - which would still be sparsely reserved. So
> there would not be any actual downside that I can see. From a fs
> consistency point of view it sounds much more sane to assume that
> at least a basic set of data chunks always need to exist. I know I
> was very surprised recently to see all my data chunks cleaned up on
> an otherwise empty fs. I mean..it's good to see that the house
> cleaning works, but you also gotta know when to stop!

Ok, say you have a file system populated entirely with ~ 3k files.
You won't actually need a data chunk at all and you'd just waste that
chunk.  A file system like that wouldn't need a data chunk at all.
It's a contrived case, sure, but not an unrealistic one.  We've
already seen workloads like this.

>> thinking casually about ways to direct the allocator to use
>> certain devices for certain things (e.g. in a hybrid system with
>> SSDs and HDDs, always allocate metadata on the SSD) and there's
>> some overlap there.  As it stands, we can fake that in mkfs but
>> it'll get stomped by balance nearly immediately.
> 
> Please share those casual thoughts with the group. :)

The simplest way would be to add a flag to disallow removal of block
groups.  If the goal is to say "this device is only to be used for
metadata at X RAID level" then it's enough to say that chunks on that
device can't be balanced away.

Beyond that, we start to run into the same problems we have with RAID
levels now.  The disk format technically supports multiple RAID levels
simultaneously for both data and metadata.  How that would work in
practice gets a little tricky.  Maybe, as David said, something like
object properties are the answer.  If a subvolume is annotated as
needing storage of a particular description, then we can create chunks
meeting that requirement as required.

- -Jeff

- -- 
Jeff Mahoney
SUSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.19 (Darwin)

iQIcBAEBAgAGBQJWAYB0AAoJEB57S2MheeWyrjcP/2tnnrkV3Ghxpr5rglfiPEDB
ma2OS17FJHkWcPAGlw/KMsxXxU2UX8QNuX9R39Dp+seWprplMPMQQqD12dMZGd9M
4ZeXcv6AzSWW+aE3O1cnXZK4sJMHkb+t5Tk4nRBHY+wf9VIy1bBWwbnfde8jbau0
WubvjDQeCM4AjMbKOhqTsfWtCNXv6pbPzrBI070JQkIvLyvDNVNQ37XQ76The1tW
W/urZecv0Efw9fXruGZcvXPKkUOuc/Fh6xsWBQyDqjU0EDx9TyX27ZCSLYmPON6E
Ihlk3BY96TV8AeFH0IAbFyWfaYL4pokPF+hW06hjPSgMBOEE8NKnL1SyLFU7xNTt
pcQ7KISxGPvLiNyeN0ESDU7WoifmEO0wr/dNSKd4iFJxRi9pImKYViscgNTkPmAt
hkFNKANGTwa7/YBXYVwctfYbSdF0DFHakvG4PTtiucjTW04BhIlU8UC/eUqGbpol
CRQZY/JPy80Sgc/dNktY3srMfgeWhcu7E+Q6ogS4LQ//5ry1NlxB34nTMVEdjR5v
yivkvxjMXZHKxC8CBOF5xDP7ILq5Zf9m+mozjnEI6flto72t4N6TB70Zwj3+3mSw
vG3k/kZCqSF6tfNoPUibOuLpZsA9KVpTMtZfU6niOSOuvPBZHyjwR6ErtZoQt+VJ
jL+GQMaHE28HjEQ2z64D
=9oFr
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Austin S. Hemmelgarn Sept. 22, 2015, 5:32 p.m. UTC | #17

On 2015-09-22 11:39, Hugo Mills wrote:
> On Tue, Sep 22, 2015 at 10:54:45AM -0400, Austin S Hemmelgarn wrote:
>> On 2015-09-22 10:36, Hugo Mills wrote:
>>> On Tue, Sep 22, 2015 at 04:23:33PM +0200, David Sterba wrote:
>>>> On Tue, Sep 22, 2015 at 01:41:31PM +0000, Hugo Mills wrote:
>>>>> On Tue, Sep 22, 2015 at 03:36:43PM +0200, Holger Hoffstätte wrote:
>>>>>> On 09/22/15 14:59, Jeff Mahoney wrote:
>>>>>> (snip)
>>>>>>> So if they way we want to prevent the loss of raid type info is by
>>>>>>> maintaining the last block group allocated with that raid type, fine,
>>>>>>> but that's a separate discussion.  Personally, I think keeping 1GB
>>>>>>
>>>>>> At this point I'm much more surprised to learn that the RAID type can
>>>>>> apparently get "lost" in the first place, and is not persisted
>>>>>> separately. I mean..wat?
>>>>>
>>>>>     It's always been like that, unfortunately.
>>>>>
>>>>>     The code tries to use the RAID type that's already present to work
>>>>> out what the next allocation should be. If there aren't any chunks in
>>>>> the FS, the configuration is lost, because it's not stored anywhere
>>>>> else. It's one of the things that tripped me up badly when I was
>>>>> failing to rewrite the chunk allocator last year.
>>>>
>>>> Yeah, right now there's no persistent default for the allocator. I'm
>>>> still hoping that the object properties will magically solve that.
>>>
>>>     There's no obvious place that filesystem-wide properties can be
>>> stored, though. There's a userspace tool to manipulate the few current
>>> FS-wide properties, but that's all special-cased to use the
>>> "historical" ioctls for those properties, with no generalisation of a
>>> property store, or even (IIRC) any external API for them.
>>>
>>>     We're nominally using xattrs in the btrfs: namespace on directories
>>> and files, and presumably on the top directory of a subvolume for
>>> subvol-wide properties, but it's not clear where the FS-wide values
>>> should go: in the top directory of subvolid=5 would be confusing,
>>> because then you couldn't separate the properties for *that subvol*
>> >from the ones for the whole FS (say, the default replication policy,
>>> where you might want the top subvol to have different properties from
>>> everything else).
>> Possibly do special names for the defaults and store them there?  In
>> general, I personally see little value in having some special
>> 'default' properties however.
>
>     That would work.
>
>> The way I would expect things to work is that a new subvolume
>> inherits it's properties from it's parent (if it's a snapshot),
>
>     Definitely this.
>
>> or
>> from the next higher subvolume it's nested in.
>
>     I don't think I like this. I'm not quite sure why, though, at the
> moment.
>
>     It definitely makes the process at the start of allocating a new
> block group much more complex: you have to walk back up through an
> arbitrary depth of nested subvols to find the one that's actually got
> a replication policy record in it. (Because after this feature is
> brought in, there will be lots of filesystems without per-subvol
> replication policies in them, and we have to have some way of dealing
> with those as well).
ro-compat flag perhaps?
>
>     With an FS default policy, you only need check the current subvol,
> and then fall back to the FS default if that's not found.
>
>     These things are, I think, likely to be lightly used: I would be
> reasonably surprised to find more than two or possibly three storage
> policies in use on any given system with a sane sysadmin.
>
>     I'm actually not sure what the interactions of multiple storage
> policies are going to be like. It's entirely possible, particularly
> with some of the more exotic (but useful) suggestions I've thought of,
> that the behaviour of the FS is dependent on the order in which the
> block groups are allocated. (i.e. "20 GiB to subvol-A, then 20 GiB to
> subvol-B" results in different behaviour than "1 GiB to subvol-A then
> 1 GiB to subvol-B and repeat"). I tried some simple Monte-Carlo
> simulations, but I didn't get any concrete results out of it before
> the end of the train journey. :)
Yeah, I could easily see that getting complicated when you add in the 
(hopefully soon) possibility of n-copy replication.
>
>>   This would obviate
>> the need for some special 'default' properties, and would be
>> relatively intuitive behavior for a significant majority of people.
>
>     Of course, you shouldn't be nesting subvolumes anyway. It makes
> it much harder to manage them.
That depends though, I only ever do single nesting (ie, a subvolume in a 
subvolume), and I use it to exclude stuff from getting saved in 
snapshots (mostly stuff like clones of public git trees, or other stuff 
that's easy to reproduce without a backup).  Beyond that though, there 
are other inherent issues of course.

Austin S. Hemmelgarn Sept. 22, 2015, 5:37 p.m. UTC | #18

On 2015-09-22 13:32, Austin S Hemmelgarn wrote:
> On 2015-09-22 11:39, Hugo Mills wrote:
>> On Tue, Sep 22, 2015 at 10:54:45AM -0400, Austin S Hemmelgarn wrote:
>>> On 2015-09-22 10:36, Hugo Mills wrote:
>>>> On Tue, Sep 22, 2015 at 04:23:33PM +0200, David Sterba wrote:
>>>>> On Tue, Sep 22, 2015 at 01:41:31PM +0000, Hugo Mills wrote:
>>>>>> On Tue, Sep 22, 2015 at 03:36:43PM +0200, Holger Hoffstätte wrote:
>>>>>>> On 09/22/15 14:59, Jeff Mahoney wrote:
>>>>>>> (snip)
>>>>>>>> So if they way we want to prevent the loss of raid type info is by
>>>>>>>> maintaining the last block group allocated with that raid type,
>>>>>>>> fine,
>>>>>>>> but that's a separate discussion.  Personally, I think keeping 1GB
>>>>>>>
>>>>>>> At this point I'm much more surprised to learn that the RAID type
>>>>>>> can
>>>>>>> apparently get "lost" in the first place, and is not persisted
>>>>>>> separately. I mean..wat?
>>>>>>
>>>>>>     It's always been like that, unfortunately.
>>>>>>
>>>>>>     The code tries to use the RAID type that's already present to
>>>>>> work
>>>>>> out what the next allocation should be. If there aren't any chunks in
>>>>>> the FS, the configuration is lost, because it's not stored anywhere
>>>>>> else. It's one of the things that tripped me up badly when I was
>>>>>> failing to rewrite the chunk allocator last year.
>>>>>
>>>>> Yeah, right now there's no persistent default for the allocator. I'm
>>>>> still hoping that the object properties will magically solve that.
>>>>
>>>>     There's no obvious place that filesystem-wide properties can be
>>>> stored, though. There's a userspace tool to manipulate the few current
>>>> FS-wide properties, but that's all special-cased to use the
>>>> "historical" ioctls for those properties, with no generalisation of a
>>>> property store, or even (IIRC) any external API for them.
>>>>
>>>>     We're nominally using xattrs in the btrfs: namespace on directories
>>>> and files, and presumably on the top directory of a subvolume for
>>>> subvol-wide properties, but it's not clear where the FS-wide values
>>>> should go: in the top directory of subvolid=5 would be confusing,
>>>> because then you couldn't separate the properties for *that subvol*
>>> >from the ones for the whole FS (say, the default replication policy,
>>>> where you might want the top subvol to have different properties from
>>>> everything else).
>>> Possibly do special names for the defaults and store them there?  In
>>> general, I personally see little value in having some special
>>> 'default' properties however.
>>
>>     That would work.
>>
>>> The way I would expect things to work is that a new subvolume
>>> inherits it's properties from it's parent (if it's a snapshot),
>>
>>     Definitely this.
>>
>>> or
>>> from the next higher subvolume it's nested in.
>>
>>     I don't think I like this. I'm not quite sure why, though, at the
>> moment.
>>
>>     It definitely makes the process at the start of allocating a new
>> block group much more complex: you have to walk back up through an
>> arbitrary depth of nested subvols to find the one that's actually got
>> a replication policy record in it. (Because after this feature is
>> brought in, there will be lots of filesystems without per-subvol
>> replication policies in them, and we have to have some way of dealing
>> with those as well).
> ro-compat flag perhaps?
>>
>>     With an FS default policy, you only need check the current subvol,
>> and then fall back to the FS default if that's not found.
>>
>>     These things are, I think, likely to be lightly used: I would be
>> reasonably surprised to find more than two or possibly three storage
>> policies in use on any given system with a sane sysadmin.
>>
>>     I'm actually not sure what the interactions of multiple storage
>> policies are going to be like. It's entirely possible, particularly
>> with some of the more exotic (but useful) suggestions I've thought of,
>> that the behaviour of the FS is dependent on the order in which the
>> block groups are allocated. (i.e. "20 GiB to subvol-A, then 20 GiB to
>> subvol-B" results in different behaviour than "1 GiB to subvol-A then
>> 1 GiB to subvol-B and repeat"). I tried some simple Monte-Carlo
>> simulations, but I didn't get any concrete results out of it before
>> the end of the train journey. :)
> Yeah, I could easily see that getting complicated when you add in the
> (hopefully soon) possibility of n-copy replication.
On that note, it might be nice to have the ability to say 'store at 
least n copies of this data' in addition to being able to say 'store 
exactly this many copies of this data'. (could be really helpful for 
filesystems with differing device sizes).
>>
>>>   This would obviate
>>> the need for some special 'default' properties, and would be
>>> relatively intuitive behavior for a significant majority of people.
>>
>>     Of course, you shouldn't be nesting subvolumes anyway. It makes
>> it much harder to manage them.
> That depends though, I only ever do single nesting (ie, a subvolume in a
> subvolume), and I use it to exclude stuff from getting saved in
> snapshots (mostly stuff like clones of public git trees, or other stuff
> that's easy to reproduce without a backup).  Beyond that though, there
> are other inherent issues of course.

Zhaolei Sept. 23, 2015, 1:59 a.m. UTC | #19

Hi, Filipe David Manana

> -----Original Message-----
> At the very least this patch should have its title and description changed - to
> reflect that it attempts to solve only (and partially) the problem of losing raid
> profile type.
>
When I found the bug in testing v4.3-rc1, the only error message is "no space",
and there are no way to trigger problem of "losing raid profile type" in
v4.3-rc1, this is why this patch titled "fix no space bug".

Now I need to update title, because your patch is already in integration-4.3,
so in the apply point of this patch, the bug will changed to "losing raid profile type"

> Because the enospc issue is not caused by not having any data
> block group allocated, but it's instead due to a recent regression when
> initializing a space_info object as full, as described and proved in the patch
> mentioned before [1].
>
The space_info->full is direct reason, but not root reason.

"remove all block groups" will cause the file system lost all raidtype information,
in this case, we have no way to get raid type again in mount time.
Neither "return no space in write" nor "create single-type chunk" can not help
user to restore filesystem into right state.

> - you could run into the enospc issue even without having
> the automatic removal of empty block groups or the patch [2] introduced in
> 4.3 (just try the new xfstest [3] to verify it yourself).
> 
Yes, balance operation can also delete all block groups in some case.
It is problem of balance code

Thanks
Zhaolei

> [1] https://patchwork.kernel.org/patch/7133451/
> [2]
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=e441
> 63e177960ee60e32a73bffdd53c3a5827406
> [3]
> https://git.kernel.org/cgit/fs/xfs/xfstests-dev.git/commit/?id=da82741228c7ef
> aedae2e027efc54b1892800fe1
> 
> >
> > Thanks
> > Zhaolei
> >
> >> thanks
> >>
> >>
> >> >
> >> > Thanks
> >> > Zhaolei
> >> >
> >> >> thanks
> >> >>
> >> >> >
> >> >> > Test:
> >> >> >  Test by above script, and confirmed the logic by debug output.
> >> >> >
> >> >> > Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
> >> >> > ---
> >> >> >  fs/btrfs/extent-tree.c | 3 ++-
> >> >> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >> >> >
> >> >> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> >> >> > index 5411f0a..35cf7eb 100644
> >> >> > --- a/fs/btrfs/extent-tree.c
> >> >> > +++ b/fs/btrfs/extent-tree.c
> >> >> > @@ -10012,7 +10012,8 @@ void btrfs_delete_unused_bgs(struct
> >> >> btrfs_fs_info *fs_info)
> >> >> >                                                bg_list);
> >> >> >                 space_info = block_group->space_info;
> >> >> >                 list_del_init(&block_group->bg_list);
> >> >> > -               if (ret || btrfs_mixed_space_info(space_info)) {
> >> >> > +               if (ret || btrfs_mixed_space_info(space_info) ||
> >> >> > +                   block_group->list.next ==
> >> >> > + block_group->list.prev) {
> >> >> >                         btrfs_put_block_group(block_group);
> >> >> >                         continue;
> >> >> >                 }
> >> >> > --
> >> >> > 1.8.5.1
> >> >> >
> >> >> > --
> >> >> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
> >> >> > in the body of a message to majordomo@vger.kernel.org More
> >> >> > majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Filipe David Manana,
> >> >>
> >> >> "Reasonable men adapt themselves to the world.
> >> >>  Unreasonable men adapt the world to themselves.
> >> >>  That's why all progress depends on unreasonable men."
> >> >
> >>
> >>
> >>
> >> --
> >> Filipe David Manana,
> >>
> >> "Reasonable men adapt themselves to the world.
> >>  Unreasonable men adapt the world to themselves.
> >>  That's why all progress depends on unreasonable men."
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe
> >> linux-btrfs" in the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> 
> 
> 
> --
> Filipe David Manana,
> 
> "Reasonable men adapt themselves to the world.
>  Unreasonable men adapt the world to themselves.
>  That's why all progress depends on unreasonable men."

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Zhaolei Sept. 23, 2015, 2:14 a.m. UTC | #20

Hi, Jeff Mahoney

> -----Original Message-----
> From: Jeff Mahoney [mailto:jeffm@suse.com]
> Sent: Tuesday, September 22, 2015 9:00 PM
> To: Zhao Lei <zhaolei@cn.fujitsu.com>; linux-btrfs@vger.kernel.org
> Subject: Re: [PATCH] btrfs: Fix no space bug caused by removing bg
> 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On 9/21/15 8:59 AM, Zhao Lei wrote:
> > btrfs in v4.3-rc1 failed many xfstests items with '-o nospace_cache'
> > mount option.
> >
> > Failed cases are:
> > btrfs/008,016,019,020,026,027,028,029,031,041,046,048,050,051,053,054,
> >
> >
> 077,083,084,087,092,094
> > generic/004,010,014,023,024,074,075,080,086,087,089,091,092,100,112,12
> 3,
> >
> >
> 124,125,126,127,131,133,192,193,198,207,208,209,213,214,215,228,239,240
> ,
> > 246,247,248,255,263,285,306,313,316,323
> >
> > We can reproduce this bug with following simple command:
> > TEST_DEV=/dev/vdh TEST_DIR=/mnt/tmp
> >
> > umount "$TEST_DEV" >/dev/null mkfs.btrfs -f "$TEST_DEV" mount
> > "$TEST_DEV" "$TEST_DIR"
> >
> > umount "$TEST_DEV" mount "$TEST_DEV" "$TEST_DIR"
> >
> > cp /bin/bash $TEST_DIR
> >
> > Result is: (omit previous commands) # cp /bin/bash $TEST_DIR cp:
> > writing `/mnt/tmp/bash': No space left on device
> >
> > By bisect, we can see it is triggered by patch titled: commit
> > e44163e17796 ("btrfs: explictly delete unused block groups in
> > close_ctree and ro-remount")
> >
> > But the wrong code is not in above patch, btrfs delete all chunks if
> > no data in filesystem, and above patch just make it obviously.
> >
> > Detail reason: 1: mkfs a blank filesystem, or delete everything in
> > filesystem 2: umount fs (current code will delete all data chunks)
> > 3: mount fs Because no any data chunks, data's space_cache have no
> > chance to init, it means: space_info->total_bytes == 0, and
> > space_info->full == 1. 4: do some write Current code will ignore chunk
> > allocate because space_info->full, and return -ENOSPC.
> >
> > Fix: Don't auto-delete last blockgroup for a raid type.
> 
> The only reason not to do this is the loss off the raid type, which is an issue that
> needs to be addressed separately.  As Filipe said in his responses, this isn't
> the source of the problem - it just makes it obvious.  We've seen in actual bug
> reports that it's possible to create exactly the same scenario by balancing an
> empty file system.
> 
I replied it in filipe's mail, in short, I'll change the patch title and description.

> So if they way we want to prevent the loss of raid type info is by maintaining the
> last block group allocated with that raid type, fine, but that's a separate
> discussion.  Personally, I think keeping 1GB allocated as a placeholder is a bit
> much.  Beyond that, I've been thinking casually about ways to direct the
> allocator to use certain devices for certain things (e.g. in a hybrid system with
> SSDs and HDDs, always allocate metadata on the SSD) and there's some
> overlap there.
Good idea!
It will increase performance.

Thanks
Zhaolei

> As it stands, we can fake that in mkfs but it'll get stomped by
> balance nearly immediately.
> 
> - -Jeff
> 
> > If we delete all blockgroup for a raidtype, it not only cause above
> > bug, but also may change filesystem to all-single in some case.
> >
> > Test: Test by above script, and confirmed the logic by debug output.
> >
> > Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> ---
> > fs/btrfs/extent-tree.c | 3 ++- 1 file changed, 2 insertions(+), 1
> > deletion(-)
> >
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index
> > 5411f0a..35cf7eb 100644 --- a/fs/btrfs/extent-tree.c +++
> > b/fs/btrfs/extent-tree.c @@ -10012,7 +10012,8 @@ void
> > btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info) bg_list);
> > space_info = block_group->space_info;
> > list_del_init(&block_group->bg_list); -		if (ret ||
> > btrfs_mixed_space_info(space_info)) { +		if (ret ||
> > btrfs_mixed_space_info(space_info) || +		    block_group->list.next
> > == block_group->list.prev) { btrfs_put_block_group(block_group);
> > continue; }
> >
> 
> 
> - --
> Jeff Mahoney
> SUSE Labs
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG/MacGPG2 v2.0.19 (Darwin)
> 
> iQIcBAEBAgAGBQJWAVDNAAoJEB57S2MheeWyCwoQAId9IK0vYX01W20SeLt5E
> 5ql
> cabIeN3JCLcmtEbJzhNxQtcjvB7Rgq/r3BRDV0n0Z71dyv8WV8vau4Qka8xUVtLL
> l+sbuRIEBUR3UHOvqjV7MxSZeZrQZLWeGuCRH9El059hDn/JFsF9n3wJx8YsgXKe
> dma2RG6MHFVXY08jYkLc6nexBbYlc3Dj2jbd2Jr7gHy4YwFTCM9YncR+STV2K47
> Q
> N/pfRwiVHFHHVTju5lg3wzp+xvFPeU52cfWHL05axe8l75pU6Ywwrk406QxyrTvx
> 2Rh8tXBJItUeMA/D8mRnwWVZBWFUndl6JlBNSyf51fSP+4lPkChbM5UnSOjDOw
> vE
> E7XpGy31TQI0bqpy8qoIkI9wkek6iOlMCppZ9U2vICbeP+65WtNZKfQcCO0t6Z1H
> 6IqfHsaDvvaiorxEWWIarsIfHZWnWJeav545t6pd4VU3v52YQN2YIOLY8EhWv4W
> t
> 90Xc1izPvPvnyQa3eQPg1ISdqNfJRFlYjSJ75zGvSPurIy77oOyvPa1EfEO7IMys
> zXyjgKzU6Yox1iXxeJsDxuAa+FX9P2rXqd8WYP2mBRqH2BE6D+R8V/NitGmXSkYA
> bBXN1H/m+gP5qhHLnBQZU+ABH1dDp6RJ1BCsg7iDJBmfE+hJI8YIwowwH/C0R
> BST
> 1HgsAUWHmDsjHcYr3/ZB
> =Li+/
> -----END PGP SIGNATURE-----

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Duncan Sept. 23, 2015, 4:49 a.m. UTC | #21

Austin S Hemmelgarn posted on Tue, 22 Sep 2015 13:32:58 -0400 as
excerpted:

>> Of course, you shouldn't be nesting subvolumes anyway. It makes
>> it much harder to manage them.
> 
> That depends though, I only ever do single nesting (ie, a subvolume in a
> subvolume), and I use it to exclude stuff from getting saved in
> snapshots

Good point.

Such snapshot-exclusion-via-subvolume-boundary is a frequently 
recommended use-case for nocow files, of course, to keep them from 
fragmenting due to snapshot-provoked cow1.

If there's no (other) reason to mount the subvolumes separately, either 
because they're primarily snapshot-exclusion subvolumes or for whatever 
other reason they aren't intended to be specifically mounted, but 
instead, simply used as ordinary subdirs for most purposes, then nesting 
really /does/ make sense, because relying on the normal subdir behavior, 
that does avoid the otherwise necessary separate mount.


Which, meanwhile, brings up a particular bone I have to pick with 
systemd's use of btrfs subvolumes in various cases, particularly in 
tmpfiles.d usage, as well.  Since IIRC systemd 219, various previously 
created directories, including the (virtual-) machines tree, are by 
default subvolume-creations, now, of course only if they don't already 
exist.

These subvolumes are explicitly nested at whatever location they happen 
to appear in the subtree, as tmpfiles.d simply creates them as subvolumes, 
with a fallback to the standard subdirs they were created as in earlier 
versions on filesystems other than btrfs.  Other than the standard mount 
processing, which involves admins specifically setting up the appropriate 
subvolumes and mounts for them ahead of time, there's no provision made 
for creating them in a standard subvolume location at root level, and 
mounting them at whatever tree location their final destination is, as is 
standard recommended practice for btrfs "upstream".

Tho personally, I just prefer that they stay as subdirs, since I don't 
have any particular need to and therefore don't want to complicate my 
btrfs use-case with subvolumes, when I already break my tree up into 
multiple independent btrfs /filesystems/ using standard partitioning, 
etc, specifically in ordered to avoid having so many data eggs in one 
filesystem basket, and thus avoid the risk of losing them all should the 
bottom of that single basket fall out with the filesystem going bad on 
me.  I know from hard learned experience with much-too-big-mdraid arrays 
how painful maintenance operations can be when they take hours or days, 
compared to the seconds (due to ssd) or minutes (were they on spinning 
rust) maintenance such as balance or scrub now takes me on each of my all 
well under hundred-gig separate btrfs, with the difference even more 
pronounced since most of the time there's only one or two filesystems I 
need to do maintenance on, instead of the whole tree.

So I use multiple small independent btrfs, and don't want or need 
subvolumes, which would only complicate my use-case.  Fortunately, I can 
avoid systemd creating those subvolumes even tho I'm on btrfs, the normal 
target for them, by simply already having the subdirs created... as 
subdirs.  Tho the initial 219 implementation bugged out on them 
(subvolume-create error because my root filesystem is deliberately left 
read-only by default, which propagated thru to a tmpfiles.d service 
failure), I filed a bug, and by 221 (IIRC) tmpfiles.d was working 
correctly, leaving my existing subdirs alone.

Unfortunately, now I have to worry about manual subdir creation to avoid 
the automated subvolume creation on a new install, or should I forget, 
worry about discovering and undoing the automated subvolume create.

Oh, well...

David Sterba Sept. 23, 2015, 1:12 p.m. UTC | #22

On Tue, Sep 22, 2015 at 02:36:02PM +0000, Hugo Mills wrote:
> > Yeah, right now there's no persistent default for the allocator. I'm
> > still hoping that the object properties will magically solve that.
> 
>    There's no obvious place that filesystem-wide properties can be
> stored, though. There's a userspace tool to manipulate the few current
> FS-wide properties, but that's all special-cased to use the
> "historical" ioctls for those properties, with no generalisation of a
> property store, or even (IIRC) any external API for them.

From the UI point, we proposed to add a specifier that would route the
property to either subvolume or the filesystem:

$ btrfs prop set -t filesystem bgtype raid0
$ btrfs prop set -t subvolume bgtype raid1

How this will get stored in the xattrs is another question. As there's
always only single instance of the filesystem properties, it coud be
something like 'btrfs.fs.bgtype' and be stored as a xattr of the
toplevel subvolume.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Qu Wenruo Sept. 23, 2015, 1:19 p.m. UTC | #23

? 2015?09?23? 21:12, David Sterba ??:
> On Tue, Sep 22, 2015 at 02:36:02PM +0000, Hugo Mills wrote:
>>> Yeah, right now there's no persistent default for the allocator. I'm
>>> still hoping that the object properties will magically solve that.
>>
>>     There's no obvious place that filesystem-wide properties can be
>> stored, though. There's a userspace tool to manipulate the few current
>> FS-wide properties, but that's all special-cased to use the
>> "historical" ioctls for those properties, with no generalisation of a
>> property store, or even (IIRC) any external API for them.
>
>  From the UI point, we proposed to add a specifier that would route the
> property to either subvolume or the filesystem:
>
> $ btrfs prop set -t filesystem bgtype raid0
> $ btrfs prop set -t subvolume bgtype raid1
>

BTW, is btrfs going to support different chunk/bg type for subvolume?!
I thought data/meta/system chunk types are all per filesystem level,
and was planning to use superblock to record it...

If really to support that, does it mean we will have different meta/data 
type for each subvolume?
That's a little too flex for me....

Thanks,
Qu

> How this will get stored in the xattrs is another question. As there's
> always only single instance of the filesystem properties, it coud be
> something like 'btrfs.fs.bgtype' and be stored as a xattr of the
> toplevel subvolume.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hugo Mills Sept. 23, 2015, 1:28 p.m. UTC | #24

On Wed, Sep 23, 2015 at 03:12:26PM +0200, David Sterba wrote:
> On Tue, Sep 22, 2015 at 02:36:02PM +0000, Hugo Mills wrote:
> > > Yeah, right now there's no persistent default for the allocator. I'm
> > > still hoping that the object properties will magically solve that.
> > 
> >    There's no obvious place that filesystem-wide properties can be
> > stored, though. There's a userspace tool to manipulate the few current
> > FS-wide properties, but that's all special-cased to use the
> > "historical" ioctls for those properties, with no generalisation of a
> > property store, or even (IIRC) any external API for them.
> 
> From the UI point, we proposed to add a specifier that would route the
> property to either subvolume or the filesystem:
> 
> $ btrfs prop set -t filesystem bgtype raid0
> $ btrfs prop set -t subvolume bgtype raid1
> 
> How this will get stored in the xattrs is another question. As there's
> always only single instance of the filesystem properties, it coud be
> something like 'btrfs.fs.bgtype' and be stored as a xattr of the
> toplevel subvolume.

   That's what Austin suggested as well. Makes sense to me.

   Hugo.

David Sterba Sept. 23, 2015, 1:28 p.m. UTC | #25

On Tue, Sep 22, 2015 at 03:39:30PM +0000, Hugo Mills wrote:
> > The way I would expect things to work is that a new subvolume
> > inherits it's properties from it's parent (if it's a snapshot),
> 
>    Definitely this.
> 
> > or
> > from the next higher subvolume it's nested in.
> 
>    I don't think I like this. I'm not quite sure why, though, at the
> moment.

I don't like inheritance from other than the parent subvolume because
this makes things less obvious.

>    It definitely makes the process at the start of allocating a new
> block group much more complex: you have to walk back up through an
> arbitrary depth of nested subvols to find the one that's actually got
> a replication policy record in it. (Because after this feature is
> brought in, there will be lots of filesystems without per-subvol
> replication policies in them, and we have to have some way of dealing
> with those as well).
> 
>    With an FS default policy, you only need check the current subvol,
> and then fall back to the FS default if that's not found.

That looks reasonable to me.

>    These things are, I think, likely to be lightly used: I would be
> reasonably surprised to find more than two or possibly three storage
> policies in use on any given system with a sane sysadmin.

Agreed. At the moment I'm thinking about all the configuration
possibilites we want to give to the users. Eg. the inheritance can be
configurable on the property level.

The usecase: the toplevel has compression enabled but I don't want any
new subvolume share this property automatically.

(The blockgroup type is probably a bad example for configurable
inheritance as it would not work for shared extents if the type is
different.)

>    I'm actually not sure what the interactions of multiple storage
> policies are going to be like. It's entirely possible, particularly
> with some of the more exotic (but useful) suggestions I've thought of,
> that the behaviour of the FS is dependent on the order in which the
> block groups are allocated. (i.e. "20 GiB to subvol-A, then 20 GiB to
> subvol-B" results in different behaviour than "1 GiB to subvol-A then
> 1 GiB to subvol-B and repeat"). I tried some simple Monte-Carlo
> simulations, but I didn't get any concrete results out of it before
> the end of the train journey. :)
> 
> >  This would obviate
> > the need for some special 'default' properties, and would be
> > relatively intuitive behavior for a significant majority of people.
> 
>    Of course, you shouldn't be nesting subvolumes anyway. It makes
> it much harder to manage them.

Depends, there are valid usecases for nested subvolumes but yeah, the
management is harder.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Austin S. Hemmelgarn Sept. 23, 2015, 1:32 p.m. UTC | #26

On 2015-09-23 09:19, Qu Wenruo wrote:
>
>
> ? 2015?09?23? 21:12, David Sterba ??:
>> On Tue, Sep 22, 2015 at 02:36:02PM +0000, Hugo Mills wrote:
>>>> Yeah, right now there's no persistent default for the allocator. I'm
>>>> still hoping that the object properties will magically solve that.
>>>
>>>     There's no obvious place that filesystem-wide properties can be
>>> stored, though. There's a userspace tool to manipulate the few current
>>> FS-wide properties, but that's all special-cased to use the
>>> "historical" ioctls for those properties, with no generalisation of a
>>> property store, or even (IIRC) any external API for them.
>>
>>  From the UI point, we proposed to add a specifier that would route the
>> property to either subvolume or the filesystem:
>>
>> $ btrfs prop set -t filesystem bgtype raid0
>> $ btrfs prop set -t subvolume bgtype raid1
>>
>
> BTW, is btrfs going to support different chunk/bg type for subvolume?!
> I thought data/meta/system chunk types are all per filesystem level,
> and was planning to use superblock to record it...
>
> If really to support that, does it mean we will have different meta/data
> type for each subvolume?
> That's a little too flex for me....
>
This has actually been a planned feature for a while, and really is 
needed to compete with the flexibility that ZFS gives for this kind of 
thing.  System chunk types should still be set separately (although, 
once we have n-way replication, they really should be set separately 
from metadata to at least one copy per device in multi-device filesystems).

David Sterba Sept. 23, 2015, 1:37 p.m. UTC | #27

On Wed, Sep 23, 2015 at 09:19:38PM +0800, Qu Wenruo wrote:
> >  From the UI point, we proposed to add a specifier that would route the
> > property to either subvolume or the filesystem:
> >
> > $ btrfs prop set -t filesystem bgtype raid0
> > $ btrfs prop set -t subvolume bgtype raid1
> 
> BTW, is btrfs going to support different chunk/bg type for subvolume?!

People have been asking for that for years.

> I thought data/meta/system chunk types are all per filesystem level,
> and was planning to use superblock to record it...

Any superblock change like that will lead to backward incompatibilities
worse than just storing some key:value as an extended attribute. Older
kernels will just ignore the constraints but will be still
mountable/usable.

> If really to support that, does it mean we will have different meta/data 
> type for each subvolume?
> That's a little too flex for me....

btrfs design is ready for this kind of flexibility. The hard task is to
stay sane and offer somethig flexible yet manageable.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hugo Mills Sept. 23, 2015, 1:45 p.m. UTC | #28

On Wed, Sep 23, 2015 at 09:19:38PM +0800, Qu Wenruo wrote:
> ? 2015?09?23? 21:12, David Sterba ??:
> >On Tue, Sep 22, 2015 at 02:36:02PM +0000, Hugo Mills wrote:
> >>>Yeah, right now there's no persistent default for the allocator. I'm
> >>>still hoping that the object properties will magically solve that.
> >>
> >>    There's no obvious place that filesystem-wide properties can be
> >>stored, though. There's a userspace tool to manipulate the few current
> >>FS-wide properties, but that's all special-cased to use the
> >>"historical" ioctls for those properties, with no generalisation of a
> >>property store, or even (IIRC) any external API for them.
> >
> > From the UI point, we proposed to add a specifier that would route the
> >property to either subvolume or the filesystem:
> >
> >$ btrfs prop set -t filesystem bgtype raid0
> >$ btrfs prop set -t subvolume bgtype raid1
> >
> 
> BTW, is btrfs going to support different chunk/bg type for subvolume?!
> I thought data/meta/system chunk types are all per filesystem level,
> and was planning to use superblock to record it...

   Well, it's been talked about for a *long* time, and it's certainly
something for which there's use-cases, and non-zero demand from users.

> If really to support that, does it mean we will have different
> meta/data type for each subvolume?
> That's a little too flex for me....

   I think you'd have to have a single metadata type for the whole FS
-- it doesn't make much sense to try splitting that up "by subvolume",
as almost all of the data structures are common to all subvolumes. For
data, however, it should certainly be possible to specify that a
subvol's data be stored with a given replication policy.

   While David and Austin's suggestion of a special xattr on the top
level subvol for the FS default is one possible implementation, using
the superblock would work just as well -- with the minor detail that
you'd have to special-case it into the properties UI, whereas using
xattrs, there will be less special-casing involved.

   Hugo.

> Thanks,
> Qu
> 
> >How this will get stored in the xattrs is another question. As there's
> >always only single instance of the filesystem properties, it coud be
> >something like 'btrfs.fs.bgtype' and be stored as a xattr of the
> >toplevel subvolume.

Austin S. Hemmelgarn Sept. 23, 2015, 1:57 p.m. UTC | #29

On 2015-09-23 09:28, David Sterba wrote:
> On Tue, Sep 22, 2015 at 03:39:30PM +0000, Hugo Mills wrote:
>>> The way I would expect things to work is that a new subvolume
>>> inherits it's properties from it's parent (if it's a snapshot),
>>
>>     Definitely this.
>>
>>> or
>>> from the next higher subvolume it's nested in.
>>
>>     I don't think I like this. I'm not quite sure why, though, at the
>> moment.
>
> I don't like inheritance from other than the parent subvolume because
> this makes things less obvious.
Possibly, but that depends on how you view things.  Internally, 
subvolumes are independent of each other, but to a regular user (or 
anything that just uses the VFS layer), they look hierarchical, and as 
such without knowing the internals of the FS (which no regular user 
should need to know) I would expect the profile to propagate down from 
the (apparently) next higher subvolume.

It's worth noting that I mean that it should just copy the properties 
from the next higher subvolume at creation, and if the next higher 
subvolume doesn't have the properties set, then just use the filesystem 
defaults (and don't try to walk back up more than one level).
>>     It definitely makes the process at the start of allocating a new
>> block group much more complex: you have to walk back up through an
>> arbitrary depth of nested subvols to find the one that's actually got
>> a replication policy record in it. (Because after this feature is
>> brought in, there will be lots of filesystems without per-subvol
>> replication policies in them, and we have to have some way of dealing
>> with those as well).
>>
>>     With an FS default policy, you only need check the current subvol,
>> and then fall back to the FS default if that's not found.
>
> That looks reasonable to me.
>
>>     These things are, I think, likely to be lightly used: I would be
>> reasonably surprised to find more than two or possibly three storage
>> policies in use on any given system with a sane sysadmin.
>
> Agreed. At the moment I'm thinking about all the configuration
> possibilites we want to give to the users. Eg. the inheritance can be
> configurable on the property level.
>
> The usecase: the toplevel has compression enabled but I don't want any
> new subvolume share this property automatically.
>
> (The blockgroup type is probably a bad example for configurable
> inheritance as it would not work for shared extents if the type is
> different.)
Ideal situation in my opinion WRT block-group profile inheritance would 
be that when an extent becomes shared, it get's re-striped at the 
highest raid profile (prioritizing 3-copy or higher replication over 
raid56) of any of the files that share the extent.  This would result in 
some more I/O than the usual clone operation in some cases, but any 
other method has the potential to silently reduce the degree of data 
safety for a given file.

Qu Wenruo Sept. 23, 2015, 2 p.m. UTC | #30

? 2015?09?23? 21:32, Austin S Hemmelgarn ??:
> On 2015-09-23 09:19, Qu Wenruo wrote:
>>
>>
>> ? 2015?09?23? 21:12, David Sterba ??:
>>> On Tue, Sep 22, 2015 at 02:36:02PM +0000, Hugo Mills wrote:
>>>>> Yeah, right now there's no persistent default for the allocator. I'm
>>>>> still hoping that the object properties will magically solve that.
>>>>
>>>>     There's no obvious place that filesystem-wide properties can be
>>>> stored, though. There's a userspace tool to manipulate the few current
>>>> FS-wide properties, but that's all special-cased to use the
>>>> "historical" ioctls for those properties, with no generalisation of a
>>>> property store, or even (IIRC) any external API for them.
>>>
>>>  From the UI point, we proposed to add a specifier that would route the
>>> property to either subvolume or the filesystem:
>>>
>>> $ btrfs prop set -t filesystem bgtype raid0
>>> $ btrfs prop set -t subvolume bgtype raid1
>>>
>>
>> BTW, is btrfs going to support different chunk/bg type for subvolume?!
>> I thought data/meta/system chunk types are all per filesystem level,
>> and was planning to use superblock to record it...
>>
>> If really to support that, does it mean we will have different meta/data
>> type for each subvolume?
>> That's a little too flex for me....
>>
> This has actually been a planned feature for a while, and really is
> needed to compete with the flexibility that ZFS gives for this kind of
> thing.  System chunk types should still be set separately (although,
> once we have n-way replication, they really should be set separately
> from metadata to at least one copy per device in multi-device filesystems).
>
>

Thanks, David and Austin.

As it's already planned, and I think it will need new incompact flag 
anyway, or old kernel can easily remove/convert desired profile.
So what about adding new ROOT_ITEM members to record data/meta profile?

I'm not a big fan to use xattr and it's quite easy to modify from user 
space, and not completely confident with the possible concurrency about 
xattr modification with chunk allocation.

And from the logical level, xattr is at the same level as inode, but 
subvolume is definitely at a higher level, even it's normally treated as 
directory.
So for me, I'd like to record it into root_item, not as sub xattr, even 
it may need an extra ioctl to handle it.

Anyway, I'm just a new comer for this feature, it's completely OK to 
ignore my stupid ideas.

BTW, what about the original patch? I hope it can be merged in next RC 
as the affected test case is quite a lot...

Thanks,
Qu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hugo Mills Sept. 23, 2015, 2:05 p.m. UTC | #31

On Wed, Sep 23, 2015 at 03:28:29PM +0200, David Sterba wrote:
> On Tue, Sep 22, 2015 at 03:39:30PM +0000, Hugo Mills wrote:
> > > The way I would expect things to work is that a new subvolume
> > > inherits it's properties from it's parent (if it's a snapshot),
> > 
> >    Definitely this.
> > 
> > > or
> > > from the next higher subvolume it's nested in.
> > 
> >    I don't think I like this. I'm not quite sure why, though, at the
> > moment.
> 
> I don't like inheritance from other than the parent subvolume because
> this makes things less obvious.
> 
> >    It definitely makes the process at the start of allocating a new
> > block group much more complex: you have to walk back up through an
> > arbitrary depth of nested subvols to find the one that's actually got
> > a replication policy record in it. (Because after this feature is
> > brought in, there will be lots of filesystems without per-subvol
> > replication policies in them, and we have to have some way of dealing
> > with those as well).
> > 
> >    With an FS default policy, you only need check the current subvol,
> > and then fall back to the FS default if that's not found.
> 
> That looks reasonable to me.
> 
> >    These things are, I think, likely to be lightly used: I would be
> > reasonably surprised to find more than two or possibly three storage
> > policies in use on any given system with a sane sysadmin.
> 
> Agreed. At the moment I'm thinking about all the configuration
> possibilites we want to give to the users. Eg. the inheritance can be
> configurable on the property level.
> 
> The usecase: the toplevel has compression enabled but I don't want any
> new subvolume share this property automatically.
> 
> (The blockgroup type is probably a bad example for configurable
> inheritance as it would not work for shared extents if the type is
> different.)

   There's some major questions about the semantics and the UI for
per-subvol RAID, too:

(Assume that A is RAID-1)
# btrfs sub snap A A'
# btrfs prop set A' raid raid-5

   What happens? Do we convert A' to RAID-5 immediately, doubling (or
more) the disk usage? Do we do lazy conversion, so that only new data
in A' is written to RAID-5? Do we need a "btrfs sub restripe A'"
command? Is directly setting properties actually the right interface
to use here? (Reading properties... yes, but we still have the
possible issues of getting information on partially-converted
subvols).

# btrfs sub snap A A'         # A' is also RAID-1 here: inherit from A
# btrfs prop set A' raid raid-5
# btrfs sub restripe A'
(you wait, time passes)
# btrfs sub snap A A''
# btrfs prop set A'' raid raid-5
# btrfs sub restripe A''

   How do I make A' and A'' pick up their sharing again? Dedup? Will
this cause problems with incremental send?

# btrfs balance start -dconvert=single A

   What does this do? Convert the whole FS's data to single? Just
subvol A? If the whole FS, will all the subvols' raid xattrs be
dropped? Or will only the FS-default ones be changed? If only the
FS-default xattrs are changed, then a write to any non-default subvol
will revert to its configured RAID level, and we almost immediately
lose the "everything single" configuration requested by the convert...

# btrfs balance start -mconvert=single A

   What does this do? (Probably convert the whole FS's metadata to
single -- which may well answer the previous question as well, unless
we want to surprise users).

   I'll probably think of some more questions later, but that's enough
for going on with. :)

   Hugo.

David Sterba Sept. 23, 2015, 5:28 p.m. UTC | #32

On Wed, Sep 23, 2015 at 10:00:12PM +0800, Qu Wenruo wrote:
> As it's already planned, and I think it will need new incompact flag 
> anyway, or old kernel can easily remove/convert desired profile.

The usecase with the old kernel is colser to the rescue scenario than
regular use. We do have support for read-only compatibility (grep for
COMPAT_RO), so all of this might be covered by such an incompat flag.

> So what about adding new ROOT_ITEM members to record data/meta profile?

The profile is not the only property we'd like to add. It is be possible
to add it to btrfs_root_item in a safe way, like the new send related
items.

> I'm not a big fan to use xattr and it's quite easy to modify from user 
> space, and not completely confident with the possible concurrency about 
> xattr modification with chunk allocation.

The point is to be able to modify it from the userspace, but it's done
in a safe way. The xattrs are in their own namespace and have handlers
taht propagate the changes anywhere it's needed. The idea is to cache
the values along the in-memory structures and only update them via the
handlers.

> And from the logical level, xattr is at the same level as inode, but 
> subvolume is definitely at a higher level, even it's normally treated as 
> directory.
> So for me, I'd like to record it into root_item, not as sub xattr, even 
> it may need an extra ioctl to handle it.

The xattrs give us extensibility that does not require changes in the
size of root_item structure. Basically, the unknown and unsupported
xattrs (properties) are invisible in the b-tree.

> Anyway, I'm just a new comer for this feature, it's completely OK to 
> ignore my stupid ideas.

No worries. We had similar discussions in the past, I proposed special
ioctls first but then the xattrs were found more suitable. So we're
letting you know the status but there's still some chance that we missed
something and you can point it out.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

btrfs: Fix no space bug caused by removing bg

Commit Message

Comments

Patch