btrfs: Continue replace when set_block_ro failed

Message ID	786c7fb7179fe73f419c5ab9bea9f86d49510962.1447411735.git.zhaolei@cn.fujitsu.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <linux-btrfs-owner@kernel.org> From: Zhao Lei <zhaolei@cn.fujitsu.com> To: <linux-btrfs@vger.kernel.org> CC: Zhao Lei <zhaolei@cn.fujitsu.com> Subject: [PATCH] btrfs: Continue replace when set_block_ro failed Date: Fri, 13 Nov 2015 19:38:27 +0800 Message-ID: <786c7fb7179fe73f419c5ab9bea9f86d49510962.1447411735.git.zhaolei@cn.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk

Zhaolei Nov. 13, 2015, 11:38 a.m. UTC

xfstests/011 failed in node with small_size filesystem.
Can be reproduced by following script:
  DEV_LIST="/dev/vdd /dev/vde"
  DEV_REPLACE="/dev/vdf"

  do_test()
  {
      local mkfs_opt="$1"
      local size="$2"

      dmesg -c >/dev/null
      umount $SCRATCH_MNT &>/dev/null

      echo  mkfs.btrfs -f $mkfs_opt "${DEV_LIST[*]}"
      mkfs.btrfs -f $mkfs_opt "${DEV_LIST[@]}" || return 1
      mount "${DEV_LIST[0]}" $SCRATCH_MNT

      echo -n "Writing big files"
      dd if=/dev/urandom of=$SCRATCH_MNT/t0 bs=1M count=1 >/dev/null 2>&1
      for ((i = 1; i <= size; i++)); do
          echo -n .
          /bin/cp $SCRATCH_MNT/t0 $SCRATCH_MNT/t$i || return 1
      done
      echo

      echo Start replace
      btrfs replace start -Bf "${DEV_LIST[0]}" "$DEV_REPLACE" $SCRATCH_MNT || {
          dmesg
          return 1
      }
      return 0
  }

  # Set size to value near fs size
  # for example, 1897 can trigger this bug in 2.6G device.
  #
  ./do_test "-d raid1 -m raid1" 1897

System will report replace fail with following warning in dmesg:

Reason:
 When big data writen to fs, the whole free space will be allocated
 for data chunk.
 And operation as scrub need to set_block_ro(), and when there is
 only one metadata chunk in system(or other metadata chunks
 are all full), the function will try to allocate a new chunk,
 and failed because no space in device.

Fix:
 When set_block_ro failed for metadata chunk, it is not a problem
 because scrub_lock forbids commit_trancaction.
 Let replace continue in this case is no problem.

Tested by above script, and xfstests/011, plus 100 times xfstests/070.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
---
 fs/btrfs/scrub.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

Filipe Manana Nov. 13, 2015, 12:01 p.m. UTC | #1

On Fri, Nov 13, 2015 at 11:38 AM, Zhao Lei <zhaolei@cn.fujitsu.com> wrote:
> xfstests/011 failed in node with small_size filesystem.
> Can be reproduced by following script:
>   DEV_LIST="/dev/vdd /dev/vde"
>   DEV_REPLACE="/dev/vdf"
>
>   do_test()
>   {
>       local mkfs_opt="$1"
>       local size="$2"
>
>       dmesg -c >/dev/null
>       umount $SCRATCH_MNT &>/dev/null
>
>       echo  mkfs.btrfs -f $mkfs_opt "${DEV_LIST[*]}"
>       mkfs.btrfs -f $mkfs_opt "${DEV_LIST[@]}" || return 1
>       mount "${DEV_LIST[0]}" $SCRATCH_MNT
>
>       echo -n "Writing big files"
>       dd if=/dev/urandom of=$SCRATCH_MNT/t0 bs=1M count=1 >/dev/null 2>&1
>       for ((i = 1; i <= size; i++)); do
>           echo -n .
>           /bin/cp $SCRATCH_MNT/t0 $SCRATCH_MNT/t$i || return 1
>       done
>       echo
>
>       echo Start replace
>       btrfs replace start -Bf "${DEV_LIST[0]}" "$DEV_REPLACE" $SCRATCH_MNT || {
>           dmesg
>           return 1
>       }
>       return 0
>   }
>
>   # Set size to value near fs size
>   # for example, 1897 can trigger this bug in 2.6G device.
>   #
>   ./do_test "-d raid1 -m raid1" 1897
>
> System will report replace fail with following warning in dmesg:
>
> Reason:
>  When big data writen to fs, the whole free space will be allocated
>  for data chunk.
>  And operation as scrub need to set_block_ro(), and when there is
>  only one metadata chunk in system(or other metadata chunks
>  are all full), the function will try to allocate a new chunk,
>  and failed because no space in device.
>
> Fix:
>  When set_block_ro failed for metadata chunk, it is not a problem
>  because scrub_lock forbids commit_trancaction.

Hi Zhao, I'm confused reading this explanation.

Why isn't it a problem if the block group gets modified while the
transaction is ongoing? Through writepages() for example.
Committing a transaction isn't the only way to write data or metadata
to a block group.

thanks

>  Let replace continue in this case is no problem.
>
> Tested by above script, and xfstests/011, plus 100 times xfstests/070.
>
> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
> ---
>  fs/btrfs/scrub.c | 9 ++++-----
>  1 file changed, 4 insertions(+), 5 deletions(-)
>
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index a39f5d1..5a30753 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -3532,6 +3532,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>         u64 length;
>         u64 chunk_offset;
>         int ret = 0;
> +       int ro_set = 0;
>         int slot;
>         struct extent_buffer *l;
>         struct btrfs_key key;
> @@ -3617,10 +3618,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>                 scrub_pause_on(fs_info);
>                 ret = btrfs_inc_block_group_ro(root, cache);
>                 scrub_pause_off(fs_info);
> -               if (ret) {
> -                       btrfs_put_block_group(cache);
> -                       break;
> -               }
> +               ro_set = !ret;
>
>                 dev_replace->cursor_right = found_key.offset + length;
>                 dev_replace->cursor_left = found_key.offset;
> @@ -3660,7 +3658,8 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>
>                 scrub_pause_off(fs_info);
>
> -               btrfs_dec_block_group_ro(root, cache);
> +               if (ro_set)
> +                       btrfs_dec_block_group_ro(root, cache);
>
>                 btrfs_put_block_group(cache);
>                 if (ret)
> --
> 1.8.5.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Zhaolei Nov. 16, 2015, 10:07 a.m. UTC | #2

Hi, Filipe Manana

Thanks for review.

> -----Original Message-----
> From: Filipe Manana [mailto:fdmanana@gmail.com]
> Sent: Friday, November 13, 2015 8:02 PM
> To: Zhao Lei <zhaolei@cn.fujitsu.com>
> Cc: linux-btrfs@vger.kernel.org
> Subject: Re: [PATCH] btrfs: Continue replace when set_block_ro failed
> 
> On Fri, Nov 13, 2015 at 11:38 AM, Zhao Lei <zhaolei@cn.fujitsu.com> wrote:
> > xfstests/011 failed in node with small_size filesystem.
> > Can be reproduced by following script:
> >   DEV_LIST="/dev/vdd /dev/vde"
> >   DEV_REPLACE="/dev/vdf"
> >
> >   do_test()
> >   {
> >       local mkfs_opt="$1"
> >       local size="$2"
> >
> >       dmesg -c >/dev/null
> >       umount $SCRATCH_MNT &>/dev/null
> >
> >       echo  mkfs.btrfs -f $mkfs_opt "${DEV_LIST[*]}"
> >       mkfs.btrfs -f $mkfs_opt "${DEV_LIST[@]}" || return 1
> >       mount "${DEV_LIST[0]}" $SCRATCH_MNT
> >
> >       echo -n "Writing big files"
> >       dd if=/dev/urandom of=$SCRATCH_MNT/t0 bs=1M
> count=1 >/dev/null 2>&1
> >       for ((i = 1; i <= size; i++)); do
> >           echo -n .
> >           /bin/cp $SCRATCH_MNT/t0 $SCRATCH_MNT/t$i || return 1
> >       done
> >       echo
> >
> >       echo Start replace
> >       btrfs replace start -Bf "${DEV_LIST[0]}" "$DEV_REPLACE"
> $SCRATCH_MNT || {
> >           dmesg
> >           return 1
> >       }
> >       return 0
> >   }
> >
> >   # Set size to value near fs size
> >   # for example, 1897 can trigger this bug in 2.6G device.
> >   #
> >   ./do_test "-d raid1 -m raid1" 1897
> >
> > System will report replace fail with following warning in dmesg:
> >
> > Reason:
> >  When big data writen to fs, the whole free space will be allocated
> > for data chunk.
> >  And operation as scrub need to set_block_ro(), and when there is
> > only one metadata chunk in system(or other metadata chunks  are all
> > full), the function will try to allocate a new chunk,  and failed
> > because no space in device.
> >
> > Fix:
> >  When set_block_ro failed for metadata chunk, it is not a problem
> > because scrub_lock forbids commit_trancaction.
> 
> Hi Zhao, I'm confused reading this explanation.
> 
> Why isn't it a problem if the block group gets modified while the transaction is
> ongoing? Through writepages() for example.
> Committing a transaction isn't the only way to write data or metadata to a
> block group.
> 
Metadata chunks always updated in cow, the writepages() will write
new data to different place with operation done by replace.

Thanks
Zhaolei

> thanks
> 
> >  Let replace continue in this case is no problem.
> >
> > Tested by above script, and xfstests/011, plus 100 times xfstests/070.
> >
> > Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
> > ---
> >  fs/btrfs/scrub.c | 9 ++++-----
> >  1 file changed, 4 insertions(+), 5 deletions(-)
> >
> > diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index
> > a39f5d1..5a30753 100644
> > --- a/fs/btrfs/scrub.c
> > +++ b/fs/btrfs/scrub.c
> > @@ -3532,6 +3532,7 @@ int scrub_enumerate_chunks(struct scrub_ctx
> *sctx,
> >         u64 length;
> >         u64 chunk_offset;
> >         int ret = 0;
> > +       int ro_set = 0;
> >         int slot;
> >         struct extent_buffer *l;
> >         struct btrfs_key key;
> > @@ -3617,10 +3618,7 @@ int scrub_enumerate_chunks(struct scrub_ctx
> *sctx,
> >                 scrub_pause_on(fs_info);
> >                 ret = btrfs_inc_block_group_ro(root, cache);
> >                 scrub_pause_off(fs_info);
> > -               if (ret) {
> > -                       btrfs_put_block_group(cache);
> > -                       break;
> > -               }
> > +               ro_set = !ret;
> >
> >                 dev_replace->cursor_right = found_key.offset + length;
> >                 dev_replace->cursor_left = found_key.offset; @@
> > -3660,7 +3658,8 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
> >
> >                 scrub_pause_off(fs_info);
> >
> > -               btrfs_dec_block_group_ro(root, cache);
> > +               if (ro_set)
> > +                       btrfs_dec_block_group_ro(root, cache);
> >
> >                 btrfs_put_block_group(cache);
> >                 if (ret)
> > --
> > 1.8.5.1
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
> > in the body of a message to majordomo@vger.kernel.org More majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> --
> Filipe David Manana,
> 
> "Reasonable men adapt themselves to the world.
>  Unreasonable men adapt the world to themselves.
>  That's why all progress depends on unreasonable men."

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Filipe Manana Nov. 16, 2015, 10:27 a.m. UTC | #3

On Mon, Nov 16, 2015 at 10:07 AM, Zhao Lei <zhaolei@cn.fujitsu.com> wrote:
> Hi, Filipe Manana
>
> Thanks for review.
>
>> -----Original Message-----
>> From: Filipe Manana [mailto:fdmanana@gmail.com]
>> Sent: Friday, November 13, 2015 8:02 PM
>> To: Zhao Lei <zhaolei@cn.fujitsu.com>
>> Cc: linux-btrfs@vger.kernel.org
>> Subject: Re: [PATCH] btrfs: Continue replace when set_block_ro failed
>>
>> On Fri, Nov 13, 2015 at 11:38 AM, Zhao Lei <zhaolei@cn.fujitsu.com> wrote:
>> > xfstests/011 failed in node with small_size filesystem.
>> > Can be reproduced by following script:
>> >   DEV_LIST="/dev/vdd /dev/vde"
>> >   DEV_REPLACE="/dev/vdf"
>> >
>> >   do_test()
>> >   {
>> >       local mkfs_opt="$1"
>> >       local size="$2"
>> >
>> >       dmesg -c >/dev/null
>> >       umount $SCRATCH_MNT &>/dev/null
>> >
>> >       echo  mkfs.btrfs -f $mkfs_opt "${DEV_LIST[*]}"
>> >       mkfs.btrfs -f $mkfs_opt "${DEV_LIST[@]}" || return 1
>> >       mount "${DEV_LIST[0]}" $SCRATCH_MNT
>> >
>> >       echo -n "Writing big files"
>> >       dd if=/dev/urandom of=$SCRATCH_MNT/t0 bs=1M
>> count=1 >/dev/null 2>&1
>> >       for ((i = 1; i <= size; i++)); do
>> >           echo -n .
>> >           /bin/cp $SCRATCH_MNT/t0 $SCRATCH_MNT/t$i || return 1
>> >       done
>> >       echo
>> >
>> >       echo Start replace
>> >       btrfs replace start -Bf "${DEV_LIST[0]}" "$DEV_REPLACE"
>> $SCRATCH_MNT || {
>> >           dmesg
>> >           return 1
>> >       }
>> >       return 0
>> >   }
>> >
>> >   # Set size to value near fs size
>> >   # for example, 1897 can trigger this bug in 2.6G device.
>> >   #
>> >   ./do_test "-d raid1 -m raid1" 1897
>> >
>> > System will report replace fail with following warning in dmesg:
>> >
>> > Reason:
>> >  When big data writen to fs, the whole free space will be allocated
>> > for data chunk.
>> >  And operation as scrub need to set_block_ro(), and when there is
>> > only one metadata chunk in system(or other metadata chunks  are all
>> > full), the function will try to allocate a new chunk,  and failed
>> > because no space in device.
>> >
>> > Fix:
>> >  When set_block_ro failed for metadata chunk, it is not a problem
>> > because scrub_lock forbids commit_trancaction.
>>
>> Hi Zhao, I'm confused reading this explanation.
>>
>> Why isn't it a problem if the block group gets modified while the transaction is
>> ongoing? Through writepages() for example.
>> Committing a transaction isn't the only way to write data or metadata to a
>> block group.
>>
> Metadata chunks always updated in cow, the writepages() will write
> new data to different place with operation done by replace.

So why do we try to set the block group ro? Your change is really
confusing - if we fail setting the block group to read-only mode, we
just ignore it and proceed - why bother in the first place to set it
read-only? Someone reading the code will find it fishy, and going back
to git history, your change log doesn't really explains why this is
being done and why is it safe to do it.

You also forgot to paste the warning you mention below "System will
report replace fail with following warning in dmesg:" in the change
log.

thanks

>
> Thanks
> Zhaolei
>
>> thanks
>>
>> >  Let replace continue in this case is no problem.
>> >
>> > Tested by above script, and xfstests/011, plus 100 times xfstests/070.
>> >
>> > Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
>> > ---
>> >  fs/btrfs/scrub.c | 9 ++++-----
>> >  1 file changed, 4 insertions(+), 5 deletions(-)
>> >
>> > diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index
>> > a39f5d1..5a30753 100644
>> > --- a/fs/btrfs/scrub.c
>> > +++ b/fs/btrfs/scrub.c
>> > @@ -3532,6 +3532,7 @@ int scrub_enumerate_chunks(struct scrub_ctx
>> *sctx,
>> >         u64 length;
>> >         u64 chunk_offset;
>> >         int ret = 0;
>> > +       int ro_set = 0;
>> >         int slot;
>> >         struct extent_buffer *l;
>> >         struct btrfs_key key;
>> > @@ -3617,10 +3618,7 @@ int scrub_enumerate_chunks(struct scrub_ctx
>> *sctx,
>> >                 scrub_pause_on(fs_info);
>> >                 ret = btrfs_inc_block_group_ro(root, cache);
>> >                 scrub_pause_off(fs_info);
>> > -               if (ret) {
>> > -                       btrfs_put_block_group(cache);
>> > -                       break;
>> > -               }
>> > +               ro_set = !ret;
>> >
>> >                 dev_replace->cursor_right = found_key.offset + length;
>> >                 dev_replace->cursor_left = found_key.offset; @@
>> > -3660,7 +3658,8 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>> >
>> >                 scrub_pause_off(fs_info);
>> >
>> > -               btrfs_dec_block_group_ro(root, cache);
>> > +               if (ro_set)
>> > +                       btrfs_dec_block_group_ro(root, cache);
>> >
>> >                 btrfs_put_block_group(cache);
>> >                 if (ret)
>> > --
>> > 1.8.5.1
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
>> > in the body of a message to majordomo@vger.kernel.org More majordomo
>> > info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> Filipe David Manana,
>>
>> "Reasonable men adapt themselves to the world.
>>  Unreasonable men adapt the world to themselves.
>>  That's why all progress depends on unreasonable men."
>

Zhaolei Nov. 16, 2015, 10:44 a.m. UTC | #4

Hi, Filipe Manana

> -----Original Message-----
> From: linux-btrfs-owner@vger.kernel.org
> [mailto:linux-btrfs-owner@vger.kernel.org] On Behalf Of Filipe Manana
> Sent: Monday, November 16, 2015 6:27 PM
> To: Zhao Lei <zhaolei@cn.fujitsu.com>
> Cc: linux-btrfs@vger.kernel.org
> Subject: Re: [PATCH] btrfs: Continue replace when set_block_ro failed
> 
> On Mon, Nov 16, 2015 at 10:07 AM, Zhao Lei <zhaolei@cn.fujitsu.com> wrote:
> > Hi, Filipe Manana
> >
> > Thanks for review.
> >
> >> -----Original Message-----
> >> From: Filipe Manana [mailto:fdmanana@gmail.com]
> >> Sent: Friday, November 13, 2015 8:02 PM
> >> To: Zhao Lei <zhaolei@cn.fujitsu.com>
> >> Cc: linux-btrfs@vger.kernel.org
> >> Subject: Re: [PATCH] btrfs: Continue replace when set_block_ro failed
> >>
> >> On Fri, Nov 13, 2015 at 11:38 AM, Zhao Lei <zhaolei@cn.fujitsu.com> wrote:
> >> > xfstests/011 failed in node with small_size filesystem.
> >> > Can be reproduced by following script:
> >> >   DEV_LIST="/dev/vdd /dev/vde"
> >> >   DEV_REPLACE="/dev/vdf"
> >> >
> >> >   do_test()
> >> >   {
> >> >       local mkfs_opt="$1"
> >> >       local size="$2"
> >> >
> >> >       dmesg -c >/dev/null
> >> >       umount $SCRATCH_MNT &>/dev/null
> >> >
> >> >       echo  mkfs.btrfs -f $mkfs_opt "${DEV_LIST[*]}"
> >> >       mkfs.btrfs -f $mkfs_opt "${DEV_LIST[@]}" || return 1
> >> >       mount "${DEV_LIST[0]}" $SCRATCH_MNT
> >> >
> >> >       echo -n "Writing big files"
> >> >       dd if=/dev/urandom of=$SCRATCH_MNT/t0 bs=1M
> >> count=1 >/dev/null 2>&1
> >> >       for ((i = 1; i <= size; i++)); do
> >> >           echo -n .
> >> >           /bin/cp $SCRATCH_MNT/t0 $SCRATCH_MNT/t$i || return 1
> >> >       done
> >> >       echo
> >> >
> >> >       echo Start replace
> >> >       btrfs replace start -Bf "${DEV_LIST[0]}" "$DEV_REPLACE"
> >> $SCRATCH_MNT || {
> >> >           dmesg
> >> >           return 1
> >> >       }
> >> >       return 0
> >> >   }
> >> >
> >> >   # Set size to value near fs size
> >> >   # for example, 1897 can trigger this bug in 2.6G device.
> >> >   #
> >> >   ./do_test "-d raid1 -m raid1" 1897
> >> >
> >> > System will report replace fail with following warning in dmesg:
> >> >
> >> > Reason:
> >> >  When big data writen to fs, the whole free space will be allocated
> >> > for data chunk.
> >> >  And operation as scrub need to set_block_ro(), and when there is
> >> > only one metadata chunk in system(or other metadata chunks  are all
> >> > full), the function will try to allocate a new chunk,  and failed
> >> > because no space in device.
> >> >
> >> > Fix:
> >> >  When set_block_ro failed for metadata chunk, it is not a problem
> >> > because scrub_lock forbids commit_trancaction.
> >>
> >> Hi Zhao, I'm confused reading this explanation.
> >>
> >> Why isn't it a problem if the block group gets modified while the
> >> transaction is ongoing? Through writepages() for example.
> >> Committing a transaction isn't the only way to write data or metadata
> >> to a block group.
> >>
> > Metadata chunks always updated in cow, the writepages() will write new
> > data to different place with operation done by replace.
> 
> So why do we try to set the block group ro? Your change is really confusing - if
> we fail setting the block group to read-only mode, we just ignore it and proceed
> - why bother in the first place to set it read-only?
>
I find a problem that set_group_ro will failed for metadata when disk almost full,
but we also have a good news that the time when set_group_ro failed is
just when it is not necessary.

> Someone reading the code will find it fishy, and going back to git history,
> your change log doesn't really explains why this is being done and why is
> it safe to do it.
> 
Thanks for notice, it is necessary to add detail explanation into changelog.

> You also forgot to paste the warning you mention below "System will report
> replace fail with following warning in dmesg:" in the change log.
> 
Agree, I'll paste it in changelog.

Thanks
Zhaolei

> thanks
> 
> >
> > Thanks
> > Zhaolei
> >
> >> thanks
> >>
> >> >  Let replace continue in this case is no problem.
> >> >
> >> > Tested by above script, and xfstests/011, plus 100 times xfstests/070.
> >> >
> >> > Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
> >> > ---
> >> >  fs/btrfs/scrub.c | 9 ++++-----
> >> >  1 file changed, 4 insertions(+), 5 deletions(-)
> >> >
> >> > diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index
> >> > a39f5d1..5a30753 100644
> >> > --- a/fs/btrfs/scrub.c
> >> > +++ b/fs/btrfs/scrub.c
> >> > @@ -3532,6 +3532,7 @@ int scrub_enumerate_chunks(struct scrub_ctx
> >> *sctx,
> >> >         u64 length;
> >> >         u64 chunk_offset;
> >> >         int ret = 0;
> >> > +       int ro_set = 0;
> >> >         int slot;
> >> >         struct extent_buffer *l;
> >> >         struct btrfs_key key;
> >> > @@ -3617,10 +3618,7 @@ int scrub_enumerate_chunks(struct scrub_ctx
> >> *sctx,
> >> >                 scrub_pause_on(fs_info);
> >> >                 ret = btrfs_inc_block_group_ro(root, cache);
> >> >                 scrub_pause_off(fs_info);
> >> > -               if (ret) {
> >> > -                       btrfs_put_block_group(cache);
> >> > -                       break;
> >> > -               }
> >> > +               ro_set = !ret;
> >> >
> >> >                 dev_replace->cursor_right = found_key.offset +
> length;
> >> >                 dev_replace->cursor_left = found_key.offset; @@
> >> > -3660,7 +3658,8 @@ int scrub_enumerate_chunks(struct scrub_ctx
> >> > *sctx,
> >> >
> >> >                 scrub_pause_off(fs_info);
> >> >
> >> > -               btrfs_dec_block_group_ro(root, cache);
> >> > +               if (ro_set)
> >> > +                       btrfs_dec_block_group_ro(root, cache);
> >> >
> >> >                 btrfs_put_block_group(cache);
> >> >                 if (ret)
> >> > --
> >> > 1.8.5.1
> >> >
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
> >> > in the body of a message to majordomo@vger.kernel.org More
> >> > majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >>
> >>
> >> --
> >> Filipe David Manana,
> >>
> >> "Reasonable men adapt themselves to the world.
> >>  Unreasonable men adapt the world to themselves.
> >>  That's why all progress depends on unreasonable men."
> >
> 
> 
> 
> --
> Filipe David Manana,
> 
> "Reasonable men adapt themselves to the world.
>  Unreasonable men adapt the world to themselves.
>  That's why all progress depends on unreasonable men."
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body
> of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Filipe Manana Nov. 16, 2015, 10:57 a.m. UTC | #5

On Mon, Nov 16, 2015 at 10:44 AM, Zhao Lei <zhaolei@cn.fujitsu.com> wrote:
> Hi, Filipe Manana
>
>> -----Original Message-----
>> From: linux-btrfs-owner@vger.kernel.org
>> [mailto:linux-btrfs-owner@vger.kernel.org] On Behalf Of Filipe Manana
>> Sent: Monday, November 16, 2015 6:27 PM
>> To: Zhao Lei <zhaolei@cn.fujitsu.com>
>> Cc: linux-btrfs@vger.kernel.org
>> Subject: Re: [PATCH] btrfs: Continue replace when set_block_ro failed
>>
>> On Mon, Nov 16, 2015 at 10:07 AM, Zhao Lei <zhaolei@cn.fujitsu.com> wrote:
>> > Hi, Filipe Manana
>> >
>> > Thanks for review.
>> >
>> >> -----Original Message-----
>> >> From: Filipe Manana [mailto:fdmanana@gmail.com]
>> >> Sent: Friday, November 13, 2015 8:02 PM
>> >> To: Zhao Lei <zhaolei@cn.fujitsu.com>
>> >> Cc: linux-btrfs@vger.kernel.org
>> >> Subject: Re: [PATCH] btrfs: Continue replace when set_block_ro failed
>> >>
>> >> On Fri, Nov 13, 2015 at 11:38 AM, Zhao Lei <zhaolei@cn.fujitsu.com> wrote:
>> >> > xfstests/011 failed in node with small_size filesystem.
>> >> > Can be reproduced by following script:
>> >> >   DEV_LIST="/dev/vdd /dev/vde"
>> >> >   DEV_REPLACE="/dev/vdf"
>> >> >
>> >> >   do_test()
>> >> >   {
>> >> >       local mkfs_opt="$1"
>> >> >       local size="$2"
>> >> >
>> >> >       dmesg -c >/dev/null
>> >> >       umount $SCRATCH_MNT &>/dev/null
>> >> >
>> >> >       echo  mkfs.btrfs -f $mkfs_opt "${DEV_LIST[*]}"
>> >> >       mkfs.btrfs -f $mkfs_opt "${DEV_LIST[@]}" || return 1
>> >> >       mount "${DEV_LIST[0]}" $SCRATCH_MNT
>> >> >
>> >> >       echo -n "Writing big files"
>> >> >       dd if=/dev/urandom of=$SCRATCH_MNT/t0 bs=1M
>> >> count=1 >/dev/null 2>&1
>> >> >       for ((i = 1; i <= size; i++)); do
>> >> >           echo -n .
>> >> >           /bin/cp $SCRATCH_MNT/t0 $SCRATCH_MNT/t$i || return 1
>> >> >       done
>> >> >       echo
>> >> >
>> >> >       echo Start replace
>> >> >       btrfs replace start -Bf "${DEV_LIST[0]}" "$DEV_REPLACE"
>> >> $SCRATCH_MNT || {
>> >> >           dmesg
>> >> >           return 1
>> >> >       }
>> >> >       return 0
>> >> >   }
>> >> >
>> >> >   # Set size to value near fs size
>> >> >   # for example, 1897 can trigger this bug in 2.6G device.
>> >> >   #
>> >> >   ./do_test "-d raid1 -m raid1" 1897
>> >> >
>> >> > System will report replace fail with following warning in dmesg:
>> >> >
>> >> > Reason:
>> >> >  When big data writen to fs, the whole free space will be allocated
>> >> > for data chunk.
>> >> >  And operation as scrub need to set_block_ro(), and when there is
>> >> > only one metadata chunk in system(or other metadata chunks  are all
>> >> > full), the function will try to allocate a new chunk,  and failed
>> >> > because no space in device.
>> >> >
>> >> > Fix:
>> >> >  When set_block_ro failed for metadata chunk, it is not a problem
>> >> > because scrub_lock forbids commit_trancaction.
>> >>
>> >> Hi Zhao, I'm confused reading this explanation.
>> >>
>> >> Why isn't it a problem if the block group gets modified while the
>> >> transaction is ongoing? Through writepages() for example.
>> >> Committing a transaction isn't the only way to write data or metadata
>> >> to a block group.
>> >>
>> > Metadata chunks always updated in cow, the writepages() will write new
>> > data to different place with operation done by replace.
>>
>> So why do we try to set the block group ro? Your change is really confusing - if
>> we fail setting the block group to read-only mode, we just ignore it and proceed
>> - why bother in the first place to set it read-only?
>>
> I find a problem that set_group_ro will failed for metadata when disk almost full,
> but we also have a good news that the time when set_group_ro failed is
> just when it is not necessary.

Yes - but the code will ignore the failure for any error code, not
just -ENOSPC. Surely btrfs_inc_block_group_ro() can return error codes
different from -ENOSPC. You should make it clear in the code (and in
the change log) why ENOSPC is such a special case - why we need to set
the block to RO and why we can ignore the ENOSPC failure.

>
>> Someone reading the code will find it fishy, and going back to git history,
>> your change log doesn't really explains why this is being done and why is
>> it safe to do it.
>>
> Thanks for notice, it is necessary to add detail explanation into changelog.
>
>> You also forgot to paste the warning you mention below "System will report
>> replace fail with following warning in dmesg:" in the change log.
>>
> Agree, I'll paste it in changelog.
>
> Thanks
> Zhaolei
>
>> thanks
>>
>> >
>> > Thanks
>> > Zhaolei
>> >
>> >> thanks
>> >>
>> >> >  Let replace continue in this case is no problem.
>> >> >
>> >> > Tested by above script, and xfstests/011, plus 100 times xfstests/070.
>> >> >
>> >> > Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
>> >> > ---
>> >> >  fs/btrfs/scrub.c | 9 ++++-----
>> >> >  1 file changed, 4 insertions(+), 5 deletions(-)
>> >> >
>> >> > diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index
>> >> > a39f5d1..5a30753 100644
>> >> > --- a/fs/btrfs/scrub.c
>> >> > +++ b/fs/btrfs/scrub.c
>> >> > @@ -3532,6 +3532,7 @@ int scrub_enumerate_chunks(struct scrub_ctx
>> >> *sctx,
>> >> >         u64 length;
>> >> >         u64 chunk_offset;
>> >> >         int ret = 0;
>> >> > +       int ro_set = 0;
>> >> >         int slot;
>> >> >         struct extent_buffer *l;
>> >> >         struct btrfs_key key;
>> >> > @@ -3617,10 +3618,7 @@ int scrub_enumerate_chunks(struct scrub_ctx
>> >> *sctx,
>> >> >                 scrub_pause_on(fs_info);
>> >> >                 ret = btrfs_inc_block_group_ro(root, cache);
>> >> >                 scrub_pause_off(fs_info);
>> >> > -               if (ret) {
>> >> > -                       btrfs_put_block_group(cache);
>> >> > -                       break;
>> >> > -               }
>> >> > +               ro_set = !ret;
>> >> >
>> >> >                 dev_replace->cursor_right = found_key.offset +
>> length;
>> >> >                 dev_replace->cursor_left = found_key.offset; @@
>> >> > -3660,7 +3658,8 @@ int scrub_enumerate_chunks(struct scrub_ctx
>> >> > *sctx,
>> >> >
>> >> >                 scrub_pause_off(fs_info);
>> >> >
>> >> > -               btrfs_dec_block_group_ro(root, cache);
>> >> > +               if (ro_set)
>> >> > +                       btrfs_dec_block_group_ro(root, cache);
>> >> >
>> >> >                 btrfs_put_block_group(cache);
>> >> >                 if (ret)
>> >> > --
>> >> > 1.8.5.1
>> >> >
>> >> > --
>> >> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
>> >> > in the body of a message to majordomo@vger.kernel.org More
>> >> > majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>
>> >>
>> >>
>> >> --
>> >> Filipe David Manana,
>> >>
>> >> "Reasonable men adapt themselves to the world.
>> >>  Unreasonable men adapt the world to themselves.
>> >>  That's why all progress depends on unreasonable men."
>> >
>>
>>
>>
>> --
>> Filipe David Manana,
>>
>> "Reasonable men adapt themselves to the world.
>>  Unreasonable men adapt the world to themselves.
>>  That's why all progress depends on unreasonable men."
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body
>> of a message to majordomo@vger.kernel.org More majordomo info at
>> http://vger.kernel.org/majordomo-info.html
>

Zhaolei Nov. 16, 2015, 11:43 a.m. UTC | #6

Hi, Filipe Manana

> -----Original Message-----
> From: Filipe Manana [mailto:fdmanana@gmail.com]
> Sent: Monday, November 16, 2015 6:57 PM
> To: Zhao Lei <zhaolei@cn.fujitsu.com>
> Cc: linux-btrfs@vger.kernel.org
> Subject: Re: [PATCH] btrfs: Continue replace when set_block_ro failed
> 
> On Mon, Nov 16, 2015 at 10:44 AM, Zhao Lei <zhaolei@cn.fujitsu.com> wrote:
> > Hi, Filipe Manana
> >
> >> -----Original Message-----
> >> From: linux-btrfs-owner@vger.kernel.org
> >> [mailto:linux-btrfs-owner@vger.kernel.org] On Behalf Of Filipe Manana
> >> Sent: Monday, November 16, 2015 6:27 PM
> >> To: Zhao Lei <zhaolei@cn.fujitsu.com>
> >> Cc: linux-btrfs@vger.kernel.org
> >> Subject: Re: [PATCH] btrfs: Continue replace when set_block_ro failed
> >>
> >> On Mon, Nov 16, 2015 at 10:07 AM, Zhao Lei <zhaolei@cn.fujitsu.com>
> wrote:
> >> > Hi, Filipe Manana
> >> >
> >> > Thanks for review.
> >> >
> >> >> -----Original Message-----
> >> >> From: Filipe Manana [mailto:fdmanana@gmail.com]
> >> >> Sent: Friday, November 13, 2015 8:02 PM
> >> >> To: Zhao Lei <zhaolei@cn.fujitsu.com>
> >> >> Cc: linux-btrfs@vger.kernel.org
> >> >> Subject: Re: [PATCH] btrfs: Continue replace when set_block_ro
> >> >> failed
> >> >>
> >> >> On Fri, Nov 13, 2015 at 11:38 AM, Zhao Lei <zhaolei@cn.fujitsu.com>
> wrote:
> >> >> > xfstests/011 failed in node with small_size filesystem.
> >> >> > Can be reproduced by following script:
> >> >> >   DEV_LIST="/dev/vdd /dev/vde"
> >> >> >   DEV_REPLACE="/dev/vdf"
> >> >> >
> >> >> >   do_test()
> >> >> >   {
> >> >> >       local mkfs_opt="$1"
> >> >> >       local size="$2"
> >> >> >
> >> >> >       dmesg -c >/dev/null
> >> >> >       umount $SCRATCH_MNT &>/dev/null
> >> >> >
> >> >> >       echo  mkfs.btrfs -f $mkfs_opt "${DEV_LIST[*]}"
> >> >> >       mkfs.btrfs -f $mkfs_opt "${DEV_LIST[@]}" || return 1
> >> >> >       mount "${DEV_LIST[0]}" $SCRATCH_MNT
> >> >> >
> >> >> >       echo -n "Writing big files"
> >> >> >       dd if=/dev/urandom of=$SCRATCH_MNT/t0 bs=1M
> >> >> count=1 >/dev/null 2>&1
> >> >> >       for ((i = 1; i <= size; i++)); do
> >> >> >           echo -n .
> >> >> >           /bin/cp $SCRATCH_MNT/t0 $SCRATCH_MNT/t$i || return
> 1
> >> >> >       done
> >> >> >       echo
> >> >> >
> >> >> >       echo Start replace
> >> >> >       btrfs replace start -Bf "${DEV_LIST[0]}" "$DEV_REPLACE"
> >> >> $SCRATCH_MNT || {
> >> >> >           dmesg
> >> >> >           return 1
> >> >> >       }
> >> >> >       return 0
> >> >> >   }
> >> >> >
> >> >> >   # Set size to value near fs size
> >> >> >   # for example, 1897 can trigger this bug in 2.6G device.
> >> >> >   #
> >> >> >   ./do_test "-d raid1 -m raid1" 1897
> >> >> >
> >> >> > System will report replace fail with following warning in dmesg:
> >> >> >
> >> >> > Reason:
> >> >> >  When big data writen to fs, the whole free space will be
> >> >> > allocated for data chunk.
> >> >> >  And operation as scrub need to set_block_ro(), and when there
> >> >> > is only one metadata chunk in system(or other metadata chunks
> >> >> > are all full), the function will try to allocate a new chunk,
> >> >> > and failed because no space in device.
> >> >> >
> >> >> > Fix:
> >> >> >  When set_block_ro failed for metadata chunk, it is not a
> >> >> > problem because scrub_lock forbids commit_trancaction.
> >> >>
> >> >> Hi Zhao, I'm confused reading this explanation.
> >> >>
> >> >> Why isn't it a problem if the block group gets modified while the
> >> >> transaction is ongoing? Through writepages() for example.
> >> >> Committing a transaction isn't the only way to write data or
> >> >> metadata to a block group.
> >> >>
> >> > Metadata chunks always updated in cow, the writepages() will write
> >> > new data to different place with operation done by replace.
> >>
> >> So why do we try to set the block group ro? Your change is really
> >> confusing - if we fail setting the block group to read-only mode, we
> >> just ignore it and proceed
> >> - why bother in the first place to set it read-only?
> >>
> > I find a problem that set_group_ro will failed for metadata when disk
> > almost full, but we also have a good news that the time when
> > set_group_ro failed is just when it is not necessary.
> 
> Yes - but the code will ignore the failure for any error code, not just -ENOSPC.
> Surely btrfs_inc_block_group_ro() can return error codes different from
> -ENOSPC. You should make it clear in the code (and in the change log) why
> ENOSPC is such a special case - why we need to set the block to RO and why we
> can ignore the ENOSPC failure.
> 
Thanks for notice.
Will make, test and send v2.

Thanks
Zhaolei

> >
> >> Someone reading the code will find it fishy, and going back to git
> >> history, your change log doesn't really explains why this is being
> >> done and why is it safe to do it.
> >>
> > Thanks for notice, it is necessary to add detail explanation into changelog.
> >
> >> You also forgot to paste the warning you mention below "System will
> >> report replace fail with following warning in dmesg:" in the change log.
> >>
> > Agree, I'll paste it in changelog.
> >
> > Thanks
> > Zhaolei
> >
> >> thanks
> >>
> >> >
> >> > Thanks
> >> > Zhaolei
> >> >
> >> >> thanks
> >> >>
> >> >> >  Let replace continue in this case is no problem.
> >> >> >
> >> >> > Tested by above script, and xfstests/011, plus 100 times xfstests/070.
> >> >> >
> >> >> > Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
> >> >> > ---
> >> >> >  fs/btrfs/scrub.c | 9 ++++-----
> >> >> >  1 file changed, 4 insertions(+), 5 deletions(-)
> >> >> >
> >> >> > diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index
> >> >> > a39f5d1..5a30753 100644
> >> >> > --- a/fs/btrfs/scrub.c
> >> >> > +++ b/fs/btrfs/scrub.c
> >> >> > @@ -3532,6 +3532,7 @@ int scrub_enumerate_chunks(struct
> >> >> > scrub_ctx
> >> >> *sctx,
> >> >> >         u64 length;
> >> >> >         u64 chunk_offset;
> >> >> >         int ret = 0;
> >> >> > +       int ro_set = 0;
> >> >> >         int slot;
> >> >> >         struct extent_buffer *l;
> >> >> >         struct btrfs_key key;
> >> >> > @@ -3617,10 +3618,7 @@ int scrub_enumerate_chunks(struct
> >> >> > scrub_ctx
> >> >> *sctx,
> >> >> >                 scrub_pause_on(fs_info);
> >> >> >                 ret = btrfs_inc_block_group_ro(root, cache);
> >> >> >                 scrub_pause_off(fs_info);
> >> >> > -               if (ret) {
> >> >> > -                       btrfs_put_block_group(cache);
> >> >> > -                       break;
> >> >> > -               }
> >> >> > +               ro_set = !ret;
> >> >> >
> >> >> >                 dev_replace->cursor_right = found_key.offset +
> >> length;
> >> >> >                 dev_replace->cursor_left = found_key.offset; @@
> >> >> > -3660,7 +3658,8 @@ int scrub_enumerate_chunks(struct scrub_ctx
> >> >> > *sctx,
> >> >> >
> >> >> >                 scrub_pause_off(fs_info);
> >> >> >
> >> >> > -               btrfs_dec_block_group_ro(root, cache);
> >> >> > +               if (ro_set)
> >> >> > +                       btrfs_dec_block_group_ro(root, cache);
> >> >> >
> >> >> >                 btrfs_put_block_group(cache);
> >> >> >                 if (ret)
> >> >> > --
> >> >> > 1.8.5.1
> >> >> >
> >> >> > --
> >> >> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
> >> >> > in the body of a message to majordomo@vger.kernel.org More
> >> >> > majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Filipe David Manana,
> >> >>
> >> >> "Reasonable men adapt themselves to the world.
> >> >>  Unreasonable men adapt the world to themselves.
> >> >>  That's why all progress depends on unreasonable men."
> >> >
> >>
> >>
> >>
> >> --
> >> Filipe David Manana,
> >>
> >> "Reasonable men adapt themselves to the world.
> >>  Unreasonable men adapt the world to themselves.
> >>  That's why all progress depends on unreasonable men."
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe
> >> linux-btrfs" in the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> 
> 
> 
> --
> Filipe David Manana,
> 
> "Reasonable men adapt themselves to the world.
>  Unreasonable men adapt the world to themselves.
>  That's why all progress depends on unreasonable men."

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

btrfs: Continue replace when set_block_ro failed

Commit Message

Comments

Patch