[RFC] Btrfs: expose bad chunks in sysfs

Message ID	20180205231502.12900-1-bo.li.liu@oracle.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-btrfs-owner@kernel.org> From: Liu Bo <bo.li.liu@oracle.com> To: linux-btrfs@vger.kernel.org Subject: [PATCH RFC] Btrfs: expose bad chunks in sysfs Date: Mon, 5 Feb 2018 16:15:02 -0700 Message-Id: <20180205231502.12900-1-bo.li.liu@oracle.com> Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk

Liu Bo Feb. 5, 2018, 11:15 p.m. UTC

Btrfs tries its best to tolerate write errors, but kind of silently
(except some messages in kernel log).

For raid1 and raid10, this is usually not a problem because there is a
copy as backup, while for parity based raid setup, i.e. raid5 and
raid6, the problem is that, if a write error occurs due to some bad
sectors, one horizonal stripe becomes degraded and the number of write
errors it can tolerate gets reduced by one, now if two disk fails,
data may be lost forever.

One way to mitigate the data loss pain is to expose 'bad chunks',
i.e. degraded chunks, to users, so that they can use 'btrfs balance'
to relocate the whole chunk and get the full raid6 protection again
(if the relocation works).

This introduces 'bad_chunks' in btrfs's per-fs sysfs directory.  Once
a chunk of raid5 or raid6 becomes degraded, it will appear in
'bad_chunks'.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
- In this patch, 'bad chunks' is not persistent on disk, but it can be
  added if it's thought to be a good idea.
- This is lightly tested, comments are very welcome.

 fs/btrfs/ctree.h       |  8 +++++++
 fs/btrfs/disk-io.c     |  2 ++
 fs/btrfs/extent-tree.c | 13 +++++++++++
 fs/btrfs/raid56.c      | 59 ++++++++++++++++++++++++++++++++++++++++++++++++--
 fs/btrfs/sysfs.c       | 26 ++++++++++++++++++++++
 fs/btrfs/volumes.c     | 15 +++++++++++--
 fs/btrfs/volumes.h     |  2 ++
 7 files changed, 121 insertions(+), 4 deletions(-)

Qu Wenruo Feb. 6, 2018, 1:28 a.m. UTC | #1

On 2018年02月06日 07:15, Liu Bo wrote:
> Btrfs tries its best to tolerate write errors, but kind of silently
> (except some messages in kernel log).
> 
> For raid1 and raid10, this is usually not a problem because there is a
> copy as backup, while for parity based raid setup, i.e. raid5 and
> raid6, the problem is that, if a write error occurs due to some bad
> sectors, one horizonal stripe becomes degraded and the number of write
> errors it can tolerate gets reduced by one, now if two disk fails,
> data may be lost forever.
> 
> One way to mitigate the data loss pain is to expose 'bad chunks',
> i.e. degraded chunks, to users, so that they can use 'btrfs balance'
> to relocate the whole chunk and get the full raid6 protection again
> (if the relocation works).
> 
> This introduces 'bad_chunks' in btrfs's per-fs sysfs directory.  Once
> a chunk of raid5 or raid6 becomes degraded, it will appear in
> 'bad_chunks'.

Sysfs looks good.

Although other systems uses their own interface to handle their status.
Mdadm uses /proc/mdstat to show such status, LVM uses lvdisplay/lvs.

So here comes to a new sys-fs interface.

> 
> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
> ---
> - In this patch, 'bad chunks' is not persistent on disk, but it can be
>   added if it's thought to be a good idea.

IHMO such bad chunks list can be built using existing dev status at
mount time.

Although using dev status may cause extra problems like false alerts.

> - This is lightly tested, comments are very welcome.

Just checked the code, there are 2 concerns:

1) The way to remove bad chunk
   Currently it can only be removed when the chunk is removed.
   If any transient write error happened, the bad chunk will just be
   there forever (if not removed)

   It seems to cause false alert.

   And extra logic to determine if it's a real bad chunk in kernel seems
   a little complex and less flex.
   (Maybe an interface to info userspace where problem happens is more
    flex?)

2) Bad chunk is only added when writing
   Read routine should also be able to detect bad chunks, with better
   accuracy.

> 
>  fs/btrfs/ctree.h       |  8 +++++++
>  fs/btrfs/disk-io.c     |  2 ++
>  fs/btrfs/extent-tree.c | 13 +++++++++++
>  fs/btrfs/raid56.c      | 59 ++++++++++++++++++++++++++++++++++++++++++++++++--
>  fs/btrfs/sysfs.c       | 26 ++++++++++++++++++++++
>  fs/btrfs/volumes.c     | 15 +++++++++++--
>  fs/btrfs/volumes.h     |  2 ++
>  7 files changed, 121 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 13c260b..08aad65 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1101,6 +1101,9 @@ struct btrfs_fs_info {
>  	spinlock_t ref_verify_lock;
>  	struct rb_root block_tree;
>  #endif
> +
> +	struct list_head bad_chunks;

Rbtree may be better here.

Since iterating a list to remove bad chunk can sometimes be slow.

> +	seqlock_t bc_lock;
>  };
>  
>  static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb)
> @@ -2568,6 +2571,11 @@ static inline gfp_t btrfs_alloc_write_mask(struct address_space *mapping)
>  
>  /* extent-tree.c */
>  
> +struct btrfs_bad_chunk {
> +	u64 chunk_offset;

It would be better to have chunk_size to info user.
Just chunk start won't tell user how serious the problem is.

And according to the code, any write error makes the chunk marked as
bad, no matter if the error can be tolerant.

It would be better if we have different error type, bad for error
intolerant, while degraded for some something tolerant.

Thanks,
Qu

> +	struct list_head list;
> +};
> +
>  enum btrfs_inline_ref_type {
>  	BTRFS_REF_TYPE_INVALID =	 0,
>  	BTRFS_REF_TYPE_BLOCK =		 1,
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index a8ecccf..061e7f94 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2568,6 +2568,8 @@ int open_ctree(struct super_block *sb,
>  	init_waitqueue_head(&fs_info->async_submit_wait);
>  
>  	INIT_LIST_HEAD(&fs_info->pinned_chunks);
> +	INIT_LIST_HEAD(&fs_info->bad_chunks);
> +	seqlock_init(&fs_info->bc_lock);
>  
>  	/* Usable values until the real ones are cached from the superblock */
>  	fs_info->nodesize = 4096;
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 2f43285..3ca7cb4 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -9903,6 +9903,19 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
>  		kobject_del(&space_info->kobj);
>  		kobject_put(&space_info->kobj);
>  	}
> +
> +	/* Clean up bad chunks. */
> +	write_seqlock_irq(&info->bc_lock);
> +	while (!list_empty(&info->bad_chunks)) {
> +		struct btrfs_bad_chunk *bc;
> +
> +		bc = list_first_entry(&info->bad_chunks,
> +				      struct btrfs_bad_chunk, list);
> +		list_del_init(&bc->list);
> +		kfree(bc);
> +	}
> +	write_sequnlock_irq(&info->bc_lock);
> +
>  	return 0;
>  }
>  
> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
> index a7f7925..e960247 100644
> --- a/fs/btrfs/raid56.c
> +++ b/fs/btrfs/raid56.c
> @@ -888,14 +888,19 @@ static void rbio_orig_end_io(struct btrfs_raid_bio *rbio, blk_status_t err)
>  }
>  
>  /*
> - * end io function used by finish_rmw.  When we finally
> - * get here, we've written a full stripe
> + * end io function used by finish_rmw.  When we finally get here, we've written
> + * a full stripe.
> + *
> + * Note that this is not under interrupt context as we queued endio to workers.
>   */
>  static void raid_write_end_io(struct bio *bio)
>  {
>  	struct btrfs_raid_bio *rbio = bio->bi_private;
>  	blk_status_t err = bio->bi_status;
>  	int max_errors;
> +	u64 stripe_start = rbio->bbio->raid_map[0];
> +	struct btrfs_fs_info *fs_info = rbio->fs_info;
> +	int err_cnt;
>  
>  	if (err)
>  		fail_bio_stripe(rbio, bio);
> @@ -908,12 +913,58 @@ static void raid_write_end_io(struct bio *bio)
>  	err = BLK_STS_OK;
>  
>  	/* OK, we have read all the stripes we need to. */
> +	err_cnt = atomic_read(&rbio->error);
>  	max_errors = (rbio->operation == BTRFS_RBIO_PARITY_SCRUB) ?
>  		     0 : rbio->bbio->max_errors;
>  	if (atomic_read(&rbio->error) > max_errors)
>  		err = BLK_STS_IOERR;
>  
>  	rbio_orig_end_io(rbio, err);
> +
> +	/*
> +	 * If there is any error, this stripe is a degraded one, so is the whole
> +	 * chunk, expose this chunk info to sysfs.
> +	 */
> +	if (unlikely(err_cnt)) {
> +		struct btrfs_bad_chunk *bc;
> +		struct btrfs_bad_chunk *tmp;
> +		struct extent_map *em;
> +		unsigned long flags;
> +
> +		em = get_chunk_map(fs_info, stripe_start, 1);
> +		if (IS_ERR(em))
> +			return;
> +
> +		bc = kzalloc(sizeof(*bc), GFP_NOFS);
> +		/* If allocation fails, it's OK. */
> +		if (!bc) {
> +			free_extent_map(em);
> +			return;
> +		}
> +
> +		write_seqlock_irqsave(&fs_info->bc_lock, flags);
> +		list_for_each_entry(tmp, &fs_info->bad_chunks, list) {
> +			if (tmp->chunk_offset != em->start)
> +				continue;
> +
> +			/*
> +			 * Don't bother if this chunk has already been recorded.
> +			 */
> +			write_sequnlock_irqrestore(&fs_info->bc_lock, flags);
> +			kfree(bc);
> +			free_extent_map(em);
> +			return;
> +		}
> +
> +		/* Add new bad chunk to list. */
> +		bc->chunk_offset = em->start;
> +		free_extent_map(em);
> +
> +		INIT_LIST_HEAD(&bc->list);
> +		list_add(&bc->list, &fs_info->bad_chunks);
> +
> +		write_sequnlock_irqrestore(&fs_info->bc_lock, flags);
> +	}
>  }
>  
>  /*
> @@ -1320,6 +1371,8 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
>  		bio->bi_end_io = raid_write_end_io;
>  		bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
>  
> +		btrfs_bio_wq_end_io(rbio->fs_info, bio, BTRFS_WQ_ENDIO_RAID56);
> +
>  		submit_bio(bio);
>  	}
>  	return;
> @@ -2465,6 +2518,8 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
>  		bio->bi_end_io = raid_write_end_io;
>  		bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
>  
> +		btrfs_bio_wq_end_io(rbio->fs_info, bio, BTRFS_WQ_ENDIO_RAID56);
> +
>  		submit_bio(bio);
>  	}
>  	return;
> diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
> index a28bba8..0baaa33 100644
> --- a/fs/btrfs/sysfs.c
> +++ b/fs/btrfs/sysfs.c
> @@ -490,12 +490,38 @@ static ssize_t quota_override_store(struct kobject *kobj,
>  
>  BTRFS_ATTR_RW(, quota_override, quota_override_show, quota_override_store);
>  
> +static ssize_t btrfs_bad_chunks_show(struct kobject *kobj,
> +				     struct kobj_attribute *a, char *buf)
> +{
> +	struct btrfs_fs_info *fs_info = to_fs_info(kobj);
> +	struct btrfs_bad_chunk *bc;
> +	int len = 0;
> +	unsigned int seq;
> +
> +	/* read lock please */
> +	do {
> +		seq = read_seqbegin(&fs_info->bc_lock);
> +		list_for_each_entry(bc, &fs_info->bad_chunks, list) {
> +			len += snprintf(buf + len, PAGE_SIZE - len, "%llu\n",
> +					bc->chunk_offset);
> +			/* chunk offset is u64 */
> +			if (len >= PAGE_SIZE)
> +				break;
> +		}
> +	} while (read_seqretry(&fs_info->bc_lock, seq));
> +
> +	return len;
> +}
> +
> +BTRFS_ATTR(, bad_chunks, btrfs_bad_chunks_show);
> +
>  static const struct attribute *btrfs_attrs[] = {
>  	BTRFS_ATTR_PTR(, label),
>  	BTRFS_ATTR_PTR(, nodesize),
>  	BTRFS_ATTR_PTR(, sectorsize),
>  	BTRFS_ATTR_PTR(, clone_alignment),
>  	BTRFS_ATTR_PTR(, quota_override),
> +	BTRFS_ATTR_PTR(, bad_chunks),
>  	NULL,
>  };
>  
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index a256842..d71f11a 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -2803,8 +2803,8 @@ static int btrfs_del_sys_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
>  	return ret;
>  }
>  
> -static struct extent_map *get_chunk_map(struct btrfs_fs_info *fs_info,
> -					u64 logical, u64 length)
> +struct extent_map *get_chunk_map(struct btrfs_fs_info *fs_info,
> +				 u64 logical, u64 length)
>  {
>  	struct extent_map_tree *em_tree;
>  	struct extent_map *em;
> @@ -2840,6 +2840,7 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans,
>  	u64 dev_extent_len = 0;
>  	int i, ret = 0;
>  	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
> +	struct btrfs_bad_chunk *bc;
>  
>  	em = get_chunk_map(fs_info, chunk_offset, 1);
>  	if (IS_ERR(em)) {
> @@ -2916,6 +2917,16 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans,
>  	}
>  
>  out:
> +	write_seqlock_irq(&fs_info->bc_lock);
> +	list_for_each_entry(bc, &fs_info->bad_chunks, list) {
> +		if (bc->chunk_offset == chunk_offset) {
> +			list_del_init(&bc->list);
> +			kfree(bc);
> +			break;
> +		}
> +	}
> +	write_sequnlock_irq(&fs_info->bc_lock);
> +
>  	/* once for us */
>  	free_extent_map(em);
>  	return ret;
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index ff15208..4e846ba 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -396,6 +396,8 @@ static inline enum btrfs_map_op btrfs_op(struct bio *bio)
>  	}
>  }
>  
> +struct extent_map *get_chunk_map(struct btrfs_fs_info *fs_info,
> +				 u64 logical, u64 length);
>  int btrfs_account_dev_extents_size(struct btrfs_device *device, u64 start,
>  				   u64 end, u64 *length);
>  void btrfs_get_bbio(struct btrfs_bio *bbio);
>

Nikolay Borisov Feb. 6, 2018, 9:11 a.m. UTC | #2

On  6.02.2018 01:15, Liu Bo wrote:
> Btrfs tries its best to tolerate write errors, but kind of silently
> (except some messages in kernel log).
> 
> For raid1 and raid10, this is usually not a problem because there is a
> copy as backup, while for parity based raid setup, i.e. raid5 and
> raid6, the problem is that, if a write error occurs due to some bad
> sectors, one horizonal stripe becomes degraded and the number of write
> errors it can tolerate gets reduced by one, now if two disk fails,
> data may be lost forever.
> 
> One way to mitigate the data loss pain is to expose 'bad chunks',
> i.e. degraded chunks, to users, so that they can use 'btrfs balance'
> to relocate the whole chunk and get the full raid6 protection again
> (if the relocation works).
> 
> This introduces 'bad_chunks' in btrfs's per-fs sysfs directory.  Once
> a chunk of raid5 or raid6 becomes degraded, it will appear in
> 'bad_chunks'.
> 
> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
> ---
> - In this patch, 'bad chunks' is not persistent on disk, but it can be
>   added if it's thought to be a good idea.
> - This is lightly tested, comments are very welcome.
> 
>  fs/btrfs/ctree.h       |  8 +++++++
>  fs/btrfs/disk-io.c     |  2 ++
>  fs/btrfs/extent-tree.c | 13 +++++++++++
>  fs/btrfs/raid56.c      | 59 ++++++++++++++++++++++++++++++++++++++++++++++++--
>  fs/btrfs/sysfs.c       | 26 ++++++++++++++++++++++
>  fs/btrfs/volumes.c     | 15 +++++++++++--
>  fs/btrfs/volumes.h     |  2 ++
>  7 files changed, 121 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 13c260b..08aad65 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1101,6 +1101,9 @@ struct btrfs_fs_info {
>  	spinlock_t ref_verify_lock;
>  	struct rb_root block_tree;
>  #endif
> +
> +	struct list_head bad_chunks;
> +	seqlock_t bc_lock;
>  };
>  
>  static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb)
> @@ -2568,6 +2571,11 @@ static inline gfp_t btrfs_alloc_write_mask(struct address_space *mapping)
>  
>  /* extent-tree.c */
>  
> +struct btrfs_bad_chunk {
> +	u64 chunk_offset;
> +	struct list_head list;
> +};
> +
>  enum btrfs_inline_ref_type {
>  	BTRFS_REF_TYPE_INVALID =	 0,
>  	BTRFS_REF_TYPE_BLOCK =		 1,
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index a8ecccf..061e7f94 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2568,6 +2568,8 @@ int open_ctree(struct super_block *sb,
>  	init_waitqueue_head(&fs_info->async_submit_wait);
>  
>  	INIT_LIST_HEAD(&fs_info->pinned_chunks);
> +	INIT_LIST_HEAD(&fs_info->bad_chunks);
> +	seqlock_init(&fs_info->bc_lock);
>  
>  	/* Usable values until the real ones are cached from the superblock */
>  	fs_info->nodesize = 4096;
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 2f43285..3ca7cb4 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -9903,6 +9903,19 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
>  		kobject_del(&space_info->kobj);
>  		kobject_put(&space_info->kobj);
>  	}
> +
> +	/* Clean up bad chunks. */
> +	write_seqlock_irq(&info->bc_lock);
> +	while (!list_empty(&info->bad_chunks)) {

Why not the idiomatic list_for_each_entry_safe, that way you remove the
list_first_entry invocation altogether and still get a well-formed
btrfs_bad_chunk object.

> +		struct btrfs_bad_chunk *bc;
> +
> +		bc = list_first_entry(&info->bad_chunks,
> +				      struct btrfs_bad_chunk, list);
> +		list_del_init(&bc->list);

nit: no need to use the _init variant, you are directly freeing the
entry, less code to execute :)

> +		kfree(bc);
> +	}
> +	write_sequnlock_irq(&info->bc_lock);
> +
>  	return 0;
>  }
>  
> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
> index a7f7925..e960247 100644
> --- a/fs/btrfs/raid56.c
> +++ b/fs/btrfs/raid56.c
> @@ -888,14 +888,19 @@ static void rbio_orig_end_io(struct btrfs_raid_bio *rbio, blk_status_t err)
>  }
>  
>  /*
> - * end io function used by finish_rmw.  When we finally
> - * get here, we've written a full stripe
> + * end io function used by finish_rmw.  When we finally get here, we've written
> + * a full stripe.
> + *
> + * Note that this is not under interrupt context as we queued endio to workers.
>   */
>  static void raid_write_end_io(struct bio *bio)
>  {
>  	struct btrfs_raid_bio *rbio = bio->bi_private;
>  	blk_status_t err = bio->bi_status;
>  	int max_errors;
> +	u64 stripe_start = rbio->bbio->raid_map[0];
> +	struct btrfs_fs_info *fs_info = rbio->fs_info;
> +	int err_cnt;
>  
>  	if (err)
>  		fail_bio_stripe(rbio, bio);
> @@ -908,12 +913,58 @@ static void raid_write_end_io(struct bio *bio)
>  	err = BLK_STS_OK;
>  
>  	/* OK, we have read all the stripes we need to. */
> +	err_cnt = atomic_read(&rbio->error);
>  	max_errors = (rbio->operation == BTRFS_RBIO_PARITY_SCRUB) ?
>  		     0 : rbio->bbio->max_errors;
>  	if (atomic_read(&rbio->error) > max_errors)
>  		err = BLK_STS_IOERR;
>  
>  	rbio_orig_end_io(rbio, err);
> +
> +	/*
> +	 * If there is any error, this stripe is a degraded one, so is the whole
> +	 * chunk, expose this chunk info to sysfs.
> +	 */
> +	if (unlikely(err_cnt)) {
> +		struct btrfs_bad_chunk *bc;
> +		struct btrfs_bad_chunk *tmp;
> +		struct extent_map *em;
> +		unsigned long flags;
> +
> +		em = get_chunk_map(fs_info, stripe_start, 1);
> +		if (IS_ERR(em))
> +			return;
> +
> +		bc = kzalloc(sizeof(*bc), GFP_NOFS);
> +		/* If allocation fails, it's OK. */
> +		if (!bc) {
> +			free_extent_map(em);
> +			return;
> +		}
> +
> +		write_seqlock_irqsave(&fs_info->bc_lock, flags);

Why do you disable interrupts here and the comment at the beginning of
the function claims this code can't be executed in irq context? Given
the comment I'd expect if you put the following assert at the beginning
of the function it should never trigger:

ASSERT(in_irq())

> +		list_for_each_entry(tmp, &fs_info->bad_chunks, list) {
> +			if (tmp->chunk_offset != em->start)
> +				continue;
> +
> +			/*
> +			 * Don't bother if this chunk has already been recorded.
> +			 */
> +			write_sequnlock_irqrestore(&fs_info->bc_lock, flags);
> +			kfree(bc);
> +			free_extent_map(em);
> +			return;
> +		}
> +
> +		/* Add new bad chunk to list. */
> +		bc->chunk_offset = em->start;
> +		free_extent_map(em);
> +
> +		INIT_LIST_HEAD(&bc->list);

nit: There is no need to initialize the list head of the entry itself.

> +		list_add(&bc->list, &fs_info->bad_chunks);
> +
> +		write_sequnlock_irqrestore(&fs_info->bc_lock, flags);
> +	}
>  }
>  
>  /*
> @@ -1320,6 +1371,8 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
>  		bio->bi_end_io = raid_write_end_io;
>  		bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
>  
> +		btrfs_bio_wq_end_io(rbio->fs_info, bio, BTRFS_WQ_ENDIO_RAID56);
> +
>  		submit_bio(bio);
>  	}
>  	return;
> @@ -2465,6 +2518,8 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
>  		bio->bi_end_io = raid_write_end_io;
>  		bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
>  
> +		btrfs_bio_wq_end_io(rbio->fs_info, bio, BTRFS_WQ_ENDIO_RAID56);
> +
>  		submit_bio(bio);
>  	}
>  	return;
> diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
> index a28bba8..0baaa33 100644
> --- a/fs/btrfs/sysfs.c
> +++ b/fs/btrfs/sysfs.c
> @@ -490,12 +490,38 @@ static ssize_t quota_override_store(struct kobject *kobj,
>  
>  BTRFS_ATTR_RW(, quota_override, quota_override_show, quota_override_store);
>  
> +static ssize_t btrfs_bad_chunks_show(struct kobject *kobj,
> +				     struct kobj_attribute *a, char *buf)
> +{
> +	struct btrfs_fs_info *fs_info = to_fs_info(kobj);
> +	struct btrfs_bad_chunk *bc;
> +	int len = 0;
> +	unsigned int seq;
> +
> +	/* read lock please */
> +	do {
> +		seq = read_seqbegin(&fs_info->bc_lock);
> +		list_for_each_entry(bc, &fs_info->bad_chunks, list) {
> +			len += snprintf(buf + len, PAGE_SIZE - len, "%llu\n",
> +					bc->chunk_offset);
> +			/* chunk offset is u64 */
> +			if (len >= PAGE_SIZE)
> +				break;
> +		}
> +	} while (read_seqretry(&fs_info->bc_lock, seq));
> +
> +	return len;
> +}
> +
> +BTRFS_ATTR(, bad_chunks, btrfs_bad_chunks_show);
> +
>  static const struct attribute *btrfs_attrs[] = {
>  	BTRFS_ATTR_PTR(, label),
>  	BTRFS_ATTR_PTR(, nodesize),
>  	BTRFS_ATTR_PTR(, sectorsize),
>  	BTRFS_ATTR_PTR(, clone_alignment),
>  	BTRFS_ATTR_PTR(, quota_override),
> +	BTRFS_ATTR_PTR(, bad_chunks),
>  	NULL,
>  };
>  
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index a256842..d71f11a 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -2803,8 +2803,8 @@ static int btrfs_del_sys_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
>  	return ret;
>  }
>  
> -static struct extent_map *get_chunk_map(struct btrfs_fs_info *fs_info,
> -					u64 logical, u64 length)
> +struct extent_map *get_chunk_map(struct btrfs_fs_info *fs_info,
> +				 u64 logical, u64 length)

nit: Since you are exposing the function as an API I think this is a
good opportunity to add proper kernel doc for it.

>  {
>  	struct extent_map_tree *em_tree;
>  	struct extent_map *em;
> @@ -2840,6 +2840,7 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans,
>  	u64 dev_extent_len = 0;
>  	int i, ret = 0;
>  	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
> +	struct btrfs_bad_chunk *bc;
>  
>  	em = get_chunk_map(fs_info, chunk_offset, 1);
>  	if (IS_ERR(em)) {
> @@ -2916,6 +2917,16 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans,
>  	}
>  
>  out:
> +	write_seqlock_irq(&fs_info->bc_lock);
> +	list_for_each_entry(bc, &fs_info->bad_chunks, list) {

Use list_for_each_entry_safe to make it more apparent you are going to
be removing from the list. The code as-is works since you are doing a
break after deleting element from the list but this is somewhat subtle.
Also it's not necessary to re-init the deleted entry since you are
directly freeing it.

> +		if (bc->chunk_offset == chunk_offset) {
> +			list_del_init(&bc->list);
> +			kfree(bc);
> +			break;
> +		}
> +	}
> +	write_sequnlock_irq(&fs_info->bc_lock);
> +
>  	/* once for us */
>  	free_extent_map(em);
>  	return ret;
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index ff15208..4e846ba 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -396,6 +396,8 @@ static inline enum btrfs_map_op btrfs_op(struct bio *bio)
>  	}
>  }
>  
> +struct extent_map *get_chunk_map(struct btrfs_fs_info *fs_info,
> +				 u64 logical, u64 length);
>  int btrfs_account_dev_extents_size(struct btrfs_device *device, u64 start,
>  				   u64 end, u64 *length);
>  void btrfs_get_bbio(struct btrfs_bio *bbio);
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Liu Bo Feb. 7, 2018, 10:57 p.m. UTC | #3

On Tue, Feb 06, 2018 at 09:28:14AM +0800, Qu Wenruo wrote:
> 
> 
> On 2018年02月06日 07:15, Liu Bo wrote:
> > Btrfs tries its best to tolerate write errors, but kind of silently
> > (except some messages in kernel log).
> > 
> > For raid1 and raid10, this is usually not a problem because there is a
> > copy as backup, while for parity based raid setup, i.e. raid5 and
> > raid6, the problem is that, if a write error occurs due to some bad
> > sectors, one horizonal stripe becomes degraded and the number of write
> > errors it can tolerate gets reduced by one, now if two disk fails,
> > data may be lost forever.
> > 
> > One way to mitigate the data loss pain is to expose 'bad chunks',
> > i.e. degraded chunks, to users, so that they can use 'btrfs balance'
> > to relocate the whole chunk and get the full raid6 protection again
> > (if the relocation works).
> > 
> > This introduces 'bad_chunks' in btrfs's per-fs sysfs directory.  Once
> > a chunk of raid5 or raid6 becomes degraded, it will appear in
> > 'bad_chunks'.
> 
> Sysfs looks good.
> 
> Although other systems uses their own interface to handle their status.
> Mdadm uses /proc/mdstat to show such status, LVM uses lvdisplay/lvs.
>

It's more like badblocks in md, instead of /proc/mdstat.

> So here comes to a new sys-fs interface.
> 
> > 
> > Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
> > ---
> > - In this patch, 'bad chunks' is not persistent on disk, but it can be
> >   added if it's thought to be a good idea.
> 
> IHMO such bad chunks list can be built using existing dev status at
> mount time.
>

What dev status offers is counters, but here chunk info. is needed if
we want balance to do relocation.  I'll think harder about how to use
it.

> Although using dev status may cause extra problems like false alerts.
> 
> > - This is lightly tested, comments are very welcome.
> 
> Just checked the code, there are 2 concerns:
> 
> 1) The way to remove bad chunk
>    Currently it can only be removed when the chunk is removed.
>    If any transient write error happened, the bad chunk will just be
>    there forever (if not removed)
>

The fundamental assumption about write error is that filesystem should
not get any transient write error, as the underlying layers in IO
stack should do their best to get rid of transient write error.
(probably I should add this to the patch log.)

So once we get a bad chunk, there is a real IO error, for now what I
can think of is to use balance to create a new chunk to hold
everything in the bad chunk and the new chunk has the full raid
protection.

>    It seems to cause false alert.
> 
>    And extra logic to determine if it's a real bad chunk in kernel seems
>    a little complex and less flex.
>    (Maybe an interface to info userspace where problem happens is more
>     flex?)
>

It depends on what users care about, when raid6 is in use, I think
users would care how many disk failures btrfs could tolerate at any
point, about bad chunks whether it's true or false, probably they
don't care, they might think it'd help a lot if some operations could
be done to get the system back to the protect level they want.

> 2) Bad chunk is only added when writing
>    Read routine should also be able to detect bad chunks, with better
>    accuracy.
>

Do you mean a read error should also report bad chunk?
Or am I misunderstanding your point?

Typically read failure would trigger reconstruction and a write for
correction will be issued, then we could get bad chunks if correction
write fails.

> > 
> >  fs/btrfs/ctree.h       |  8 +++++++
> >  fs/btrfs/disk-io.c     |  2 ++
> >  fs/btrfs/extent-tree.c | 13 +++++++++++
> >  fs/btrfs/raid56.c      | 59 ++++++++++++++++++++++++++++++++++++++++++++++++--
> >  fs/btrfs/sysfs.c       | 26 ++++++++++++++++++++++
> >  fs/btrfs/volumes.c     | 15 +++++++++++--
> >  fs/btrfs/volumes.h     |  2 ++
> >  7 files changed, 121 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 13c260b..08aad65 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -1101,6 +1101,9 @@ struct btrfs_fs_info {
> >  	spinlock_t ref_verify_lock;
> >  	struct rb_root block_tree;
> >  #endif
> > +
> > +	struct list_head bad_chunks;
> 
> Rbtree may be better here.
> 
> Since iterating a list to remove bad chunk can sometimes be slow.
>

At the point I wrote the patch, I thought bad chunk should be rare
case so list search is fine, but now I'm not sure.

> > +	seqlock_t bc_lock;
> >  };
> >  
> >  static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb)
> > @@ -2568,6 +2571,11 @@ static inline gfp_t btrfs_alloc_write_mask(struct address_space *mapping)
> >  
> >  /* extent-tree.c */
> >  
> > +struct btrfs_bad_chunk {
> > +	u64 chunk_offset;
> 
> It would be better to have chunk_size to info user.
> Just chunk start won't tell user how serious the problem is.
>

Hmm, I don't understand what extra value chunk_size can offer.

> And according to the code, any write error makes the chunk marked as
> bad, no matter if the error can be tolerant.
> 
> It would be better if we have different error type, bad for error
> intolerant, while degraded for some something tolerant.
>

Let me see..

If the error cannot be tolerated, -EIO would be received by userspace
applications and data is already lost, even balance would not make any
difference.  In this case, seems bad chunk doesn't make sense.

If the error can be tolerated, btrfs just carries on and userspace
applications are not aware of this error, then balance can help regain
the full raid protection.

So it makes sense, we can add a tag to tell whether it's tolerable or
not.

Thanks a lot for the comments.

Thanks,

-liubo

> > +	struct list_head list;
> > +};
> > +
> >  enum btrfs_inline_ref_type {
> >  	BTRFS_REF_TYPE_INVALID =	 0,
> >  	BTRFS_REF_TYPE_BLOCK =		 1,
> > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > index a8ecccf..061e7f94 100644
> > --- a/fs/btrfs/disk-io.c
> > +++ b/fs/btrfs/disk-io.c
> > @@ -2568,6 +2568,8 @@ int open_ctree(struct super_block *sb,
> >  	init_waitqueue_head(&fs_info->async_submit_wait);
> >  
> >  	INIT_LIST_HEAD(&fs_info->pinned_chunks);
> > +	INIT_LIST_HEAD(&fs_info->bad_chunks);
> > +	seqlock_init(&fs_info->bc_lock);
> >  
> >  	/* Usable values until the real ones are cached from the superblock */
> >  	fs_info->nodesize = 4096;
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index 2f43285..3ca7cb4 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -9903,6 +9903,19 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
> >  		kobject_del(&space_info->kobj);
> >  		kobject_put(&space_info->kobj);
> >  	}
> > +
> > +	/* Clean up bad chunks. */
> > +	write_seqlock_irq(&info->bc_lock);
> > +	while (!list_empty(&info->bad_chunks)) {
> > +		struct btrfs_bad_chunk *bc;
> > +
> > +		bc = list_first_entry(&info->bad_chunks,
> > +				      struct btrfs_bad_chunk, list);
> > +		list_del_init(&bc->list);
> > +		kfree(bc);
> > +	}
> > +	write_sequnlock_irq(&info->bc_lock);
> > +
> >  	return 0;
> >  }
> >  
> > diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
> > index a7f7925..e960247 100644
> > --- a/fs/btrfs/raid56.c
> > +++ b/fs/btrfs/raid56.c
> > @@ -888,14 +888,19 @@ static void rbio_orig_end_io(struct btrfs_raid_bio *rbio, blk_status_t err)
> >  }
> >  
> >  /*
> > - * end io function used by finish_rmw.  When we finally
> > - * get here, we've written a full stripe
> > + * end io function used by finish_rmw.  When we finally get here, we've written
> > + * a full stripe.
> > + *
> > + * Note that this is not under interrupt context as we queued endio to workers.
> >   */
> >  static void raid_write_end_io(struct bio *bio)
> >  {
> >  	struct btrfs_raid_bio *rbio = bio->bi_private;
> >  	blk_status_t err = bio->bi_status;
> >  	int max_errors;
> > +	u64 stripe_start = rbio->bbio->raid_map[0];
> > +	struct btrfs_fs_info *fs_info = rbio->fs_info;
> > +	int err_cnt;
> >  
> >  	if (err)
> >  		fail_bio_stripe(rbio, bio);
> > @@ -908,12 +913,58 @@ static void raid_write_end_io(struct bio *bio)
> >  	err = BLK_STS_OK;
> >  
> >  	/* OK, we have read all the stripes we need to. */
> > +	err_cnt = atomic_read(&rbio->error);
> >  	max_errors = (rbio->operation == BTRFS_RBIO_PARITY_SCRUB) ?
> >  		     0 : rbio->bbio->max_errors;
> >  	if (atomic_read(&rbio->error) > max_errors)
> >  		err = BLK_STS_IOERR;
> >  
> >  	rbio_orig_end_io(rbio, err);
> > +
> > +	/*
> > +	 * If there is any error, this stripe is a degraded one, so is the whole
> > +	 * chunk, expose this chunk info to sysfs.
> > +	 */
> > +	if (unlikely(err_cnt)) {
> > +		struct btrfs_bad_chunk *bc;
> > +		struct btrfs_bad_chunk *tmp;
> > +		struct extent_map *em;
> > +		unsigned long flags;
> > +
> > +		em = get_chunk_map(fs_info, stripe_start, 1);
> > +		if (IS_ERR(em))
> > +			return;
> > +
> > +		bc = kzalloc(sizeof(*bc), GFP_NOFS);
> > +		/* If allocation fails, it's OK. */
> > +		if (!bc) {
> > +			free_extent_map(em);
> > +			return;
> > +		}
> > +
> > +		write_seqlock_irqsave(&fs_info->bc_lock, flags);
> > +		list_for_each_entry(tmp, &fs_info->bad_chunks, list) {
> > +			if (tmp->chunk_offset != em->start)
> > +				continue;
> > +
> > +			/*
> > +			 * Don't bother if this chunk has already been recorded.
> > +			 */
> > +			write_sequnlock_irqrestore(&fs_info->bc_lock, flags);
> > +			kfree(bc);
> > +			free_extent_map(em);
> > +			return;
> > +		}
> > +
> > +		/* Add new bad chunk to list. */
> > +		bc->chunk_offset = em->start;
> > +		free_extent_map(em);
> > +
> > +		INIT_LIST_HEAD(&bc->list);
> > +		list_add(&bc->list, &fs_info->bad_chunks);
> > +
> > +		write_sequnlock_irqrestore(&fs_info->bc_lock, flags);
> > +	}
> >  }
> >  
> >  /*
> > @@ -1320,6 +1371,8 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
> >  		bio->bi_end_io = raid_write_end_io;
> >  		bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
> >  
> > +		btrfs_bio_wq_end_io(rbio->fs_info, bio, BTRFS_WQ_ENDIO_RAID56);
> > +
> >  		submit_bio(bio);
> >  	}
> >  	return;
> > @@ -2465,6 +2518,8 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
> >  		bio->bi_end_io = raid_write_end_io;
> >  		bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
> >  
> > +		btrfs_bio_wq_end_io(rbio->fs_info, bio, BTRFS_WQ_ENDIO_RAID56);
> > +
> >  		submit_bio(bio);
> >  	}
> >  	return;
> > diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
> > index a28bba8..0baaa33 100644
> > --- a/fs/btrfs/sysfs.c
> > +++ b/fs/btrfs/sysfs.c
> > @@ -490,12 +490,38 @@ static ssize_t quota_override_store(struct kobject *kobj,
> >  
> >  BTRFS_ATTR_RW(, quota_override, quota_override_show, quota_override_store);
> >  
> > +static ssize_t btrfs_bad_chunks_show(struct kobject *kobj,
> > +				     struct kobj_attribute *a, char *buf)
> > +{
> > +	struct btrfs_fs_info *fs_info = to_fs_info(kobj);
> > +	struct btrfs_bad_chunk *bc;
> > +	int len = 0;
> > +	unsigned int seq;
> > +
> > +	/* read lock please */
> > +	do {
> > +		seq = read_seqbegin(&fs_info->bc_lock);
> > +		list_for_each_entry(bc, &fs_info->bad_chunks, list) {
> > +			len += snprintf(buf + len, PAGE_SIZE - len, "%llu\n",
> > +					bc->chunk_offset);
> > +			/* chunk offset is u64 */
> > +			if (len >= PAGE_SIZE)
> > +				break;
> > +		}
> > +	} while (read_seqretry(&fs_info->bc_lock, seq));
> > +
> > +	return len;
> > +}
> > +
> > +BTRFS_ATTR(, bad_chunks, btrfs_bad_chunks_show);
> > +
> >  static const struct attribute *btrfs_attrs[] = {
> >  	BTRFS_ATTR_PTR(, label),
> >  	BTRFS_ATTR_PTR(, nodesize),
> >  	BTRFS_ATTR_PTR(, sectorsize),
> >  	BTRFS_ATTR_PTR(, clone_alignment),
> >  	BTRFS_ATTR_PTR(, quota_override),
> > +	BTRFS_ATTR_PTR(, bad_chunks),
> >  	NULL,
> >  };
> >  
> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > index a256842..d71f11a 100644
> > --- a/fs/btrfs/volumes.c
> > +++ b/fs/btrfs/volumes.c
> > @@ -2803,8 +2803,8 @@ static int btrfs_del_sys_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
> >  	return ret;
> >  }
> >  
> > -static struct extent_map *get_chunk_map(struct btrfs_fs_info *fs_info,
> > -					u64 logical, u64 length)
> > +struct extent_map *get_chunk_map(struct btrfs_fs_info *fs_info,
> > +				 u64 logical, u64 length)
> >  {
> >  	struct extent_map_tree *em_tree;
> >  	struct extent_map *em;
> > @@ -2840,6 +2840,7 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans,
> >  	u64 dev_extent_len = 0;
> >  	int i, ret = 0;
> >  	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
> > +	struct btrfs_bad_chunk *bc;
> >  
> >  	em = get_chunk_map(fs_info, chunk_offset, 1);
> >  	if (IS_ERR(em)) {
> > @@ -2916,6 +2917,16 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans,
> >  	}
> >  
> >  out:
> > +	write_seqlock_irq(&fs_info->bc_lock);
> > +	list_for_each_entry(bc, &fs_info->bad_chunks, list) {
> > +		if (bc->chunk_offset == chunk_offset) {
> > +			list_del_init(&bc->list);
> > +			kfree(bc);
> > +			break;
> > +		}
> > +	}
> > +	write_sequnlock_irq(&fs_info->bc_lock);
> > +
> >  	/* once for us */
> >  	free_extent_map(em);
> >  	return ret;
> > diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> > index ff15208..4e846ba 100644
> > --- a/fs/btrfs/volumes.h
> > +++ b/fs/btrfs/volumes.h
> > @@ -396,6 +396,8 @@ static inline enum btrfs_map_op btrfs_op(struct bio *bio)
> >  	}
> >  }
> >  
> > +struct extent_map *get_chunk_map(struct btrfs_fs_info *fs_info,
> > +				 u64 logical, u64 length);
> >  int btrfs_account_dev_extents_size(struct btrfs_device *device, u64 start,
> >  				   u64 end, u64 *length);
> >  void btrfs_get_bbio(struct btrfs_bio *bbio);
> > 
> 



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Qu Wenruo Feb. 8, 2018, 2:49 a.m. UTC | #4

On 2018年02月08日 06:57, Liu Bo wrote:
> On Tue, Feb 06, 2018 at 09:28:14AM +0800, Qu Wenruo wrote:
>>
>>
>> On 2018年02月06日 07:15, Liu Bo wrote:
>>> Btrfs tries its best to tolerate write errors, but kind of silently
>>> (except some messages in kernel log).
>>>
>>> For raid1 and raid10, this is usually not a problem because there is a
>>> copy as backup, while for parity based raid setup, i.e. raid5 and
>>> raid6, the problem is that, if a write error occurs due to some bad
>>> sectors, one horizonal stripe becomes degraded and the number of write
>>> errors it can tolerate gets reduced by one, now if two disk fails,
>>> data may be lost forever.
>>>
>>> One way to mitigate the data loss pain is to expose 'bad chunks',
>>> i.e. degraded chunks, to users, so that they can use 'btrfs balance'
>>> to relocate the whole chunk and get the full raid6 protection again
>>> (if the relocation works).
>>>
>>> This introduces 'bad_chunks' in btrfs's per-fs sysfs directory.  Once
>>> a chunk of raid5 or raid6 becomes degraded, it will appear in
>>> 'bad_chunks'.
>>
>> Sysfs looks good.
>>
>> Although other systems uses their own interface to handle their status.
>> Mdadm uses /proc/mdstat to show such status, LVM uses lvdisplay/lvs.
>>
> 
> It's more like badblocks in md, instead of /proc/mdstat.

I see the point now.

> 
>> So here comes to a new sys-fs interface.
>>
>>>
>>> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
>>> ---
>>> - In this patch, 'bad chunks' is not persistent on disk, but it can be
>>>   added if it's thought to be a good idea.
>>
>> IHMO such bad chunks list can be built using existing dev status at
>> mount time.
>>
> 
> What dev status offers is counters, but here chunk info. is needed if
> we want balance to do relocation.  I'll think harder about how to use
> it.

In my opinion, if we get write error, relocation may help for a short
time, but as long as we're using the same device, it may happen again,
and the root fix will be replace the device.

> 
>> Although using dev status may cause extra problems like false alerts.
>>
>>> - This is lightly tested, comments are very welcome.
>>
>> Just checked the code, there are 2 concerns:
>>
>> 1) The way to remove bad chunk
>>    Currently it can only be removed when the chunk is removed.
>>    If any transient write error happened, the bad chunk will just be
>>    there forever (if not removed)
>>
> 
> The fundamental assumption about write error is that filesystem should
> not get any transient write error, as the underlying layers in IO
> stack should do their best to get rid of transient write error.
> (probably I should add this to the patch log.)
> 
> So once we get a bad chunk, there is a real IO error, for now what I
> can think of is to use balance to create a new chunk to hold
> everything in the bad chunk and the new chunk has the full raid
> protection.

Then the problem is about the granularity.

If write error happens, should we just ignore that bad blocks, or the
whole device?

And in that case I prefer the latter.

> 
>>    It seems to cause false alert.
>>
>>    And extra logic to determine if it's a real bad chunk in kernel seems
>>    a little complex and less flex.
>>    (Maybe an interface to info userspace where problem happens is more
>>     flex?)
>>
> 
> It depends on what users care about, when raid6 is in use, I think
> users would care how many disk failures btrfs could tolerate at any
> point, about bad chunks whether it's true or false, probably they
> don't care, they might think it'd help a lot if some operations could
> be done to get the system back to the protect level they want.
> 
>> 2) Bad chunk is only added when writing
>>    Read routine should also be able to detect bad chunks, with better
>>    accuracy.
>>
> 
> Do you mean a read error should also report bad chunk?
> Or am I misunderstanding your point?
> 
> Typically read failure would trigger reconstruction and a write for
> correction will be issued, then we could get bad chunks if correction
> write fails.

Right, I just forgot the fix procedure.

> 
>>>
>>>  fs/btrfs/ctree.h       |  8 +++++++
>>>  fs/btrfs/disk-io.c     |  2 ++
>>>  fs/btrfs/extent-tree.c | 13 +++++++++++
>>>  fs/btrfs/raid56.c      | 59 ++++++++++++++++++++++++++++++++++++++++++++++++--
>>>  fs/btrfs/sysfs.c       | 26 ++++++++++++++++++++++
>>>  fs/btrfs/volumes.c     | 15 +++++++++++--
>>>  fs/btrfs/volumes.h     |  2 ++
>>>  7 files changed, 121 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>> index 13c260b..08aad65 100644
>>> --- a/fs/btrfs/ctree.h
>>> +++ b/fs/btrfs/ctree.h
>>> @@ -1101,6 +1101,9 @@ struct btrfs_fs_info {
>>>  	spinlock_t ref_verify_lock;
>>>  	struct rb_root block_tree;
>>>  #endif
>>> +
>>> +	struct list_head bad_chunks;
>>
>> Rbtree may be better here.
>>
>> Since iterating a list to remove bad chunk can sometimes be slow.
>>
> 
> At the point I wrote the patch, I thought bad chunk should be rare
> case so list search is fine, but now I'm not sure.

My concern is, if write error happens, it may or may not just limited to
several blocks.

And when bad blocks grows, it will easily make a lot of chunks bad.

> 
>>> +	seqlock_t bc_lock;
>>>  };
>>>  
>>>  static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb)
>>> @@ -2568,6 +2571,11 @@ static inline gfp_t btrfs_alloc_write_mask(struct address_space *mapping)
>>>  
>>>  /* extent-tree.c */
>>>  
>>> +struct btrfs_bad_chunk {
>>> +	u64 chunk_offset;
>>
>> It would be better to have chunk_size to info user.
>> Just chunk start won't tell user how serious the problem is.
>>
> 
> Hmm, I don't understand what extra value chunk_size can offer.

Just an example:

If we only have chunk start bytenr:

1G
5G
10G

Then one may think OK, just 3 chunks with maybe 256M chunk size.
That's just 768M affected.

But with size:

1G size 4G
5G size 5G
10G size 5G

Then 14G affected.

This is a big difference.

> 
>> And according to the code, any write error makes the chunk marked as
>> bad, no matter if the error can be tolerant.
>>
>> It would be better if we have different error type, bad for error
>> intolerant, while degraded for some something tolerant.
>>
> 
> Let me see..
> 
> If the error cannot be tolerated, -EIO would be received by userspace
> applications and data is already lost, even balance would not make any
> difference.  In this case, seems bad chunk doesn't make sense.

Yep, that's the point.

But my current biggest concern is about the granularity.
Under most case, user seems to care more about which device is to blame
other than which chunk needs to be relocated.

Thanks,
Qu

> 
> If the error can be tolerated, btrfs just carries on and userspace
> applications are not aware of this error, then balance can help regain
> the full raid protection.
> 
> So it makes sense, we can add a tag to tell whether it's tolerable or
> not.
> 
> Thanks a lot for the comments.
> 
> Thanks,
> 
> -liubo
> 
>>> +	struct list_head list;
>>> +};
>>> +
>>>  enum btrfs_inline_ref_type {
>>>  	BTRFS_REF_TYPE_INVALID =	 0,
>>>  	BTRFS_REF_TYPE_BLOCK =		 1,
>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>> index a8ecccf..061e7f94 100644
>>> --- a/fs/btrfs/disk-io.c
>>> +++ b/fs/btrfs/disk-io.c
>>> @@ -2568,6 +2568,8 @@ int open_ctree(struct super_block *sb,
>>>  	init_waitqueue_head(&fs_info->async_submit_wait);
>>>  
>>>  	INIT_LIST_HEAD(&fs_info->pinned_chunks);
>>> +	INIT_LIST_HEAD(&fs_info->bad_chunks);
>>> +	seqlock_init(&fs_info->bc_lock);
>>>  
>>>  	/* Usable values until the real ones are cached from the superblock */
>>>  	fs_info->nodesize = 4096;
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index 2f43285..3ca7cb4 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -9903,6 +9903,19 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
>>>  		kobject_del(&space_info->kobj);
>>>  		kobject_put(&space_info->kobj);
>>>  	}
>>> +
>>> +	/* Clean up bad chunks. */
>>> +	write_seqlock_irq(&info->bc_lock);
>>> +	while (!list_empty(&info->bad_chunks)) {
>>> +		struct btrfs_bad_chunk *bc;
>>> +
>>> +		bc = list_first_entry(&info->bad_chunks,
>>> +				      struct btrfs_bad_chunk, list);
>>> +		list_del_init(&bc->list);
>>> +		kfree(bc);
>>> +	}
>>> +	write_sequnlock_irq(&info->bc_lock);
>>> +
>>>  	return 0;
>>>  }
>>>  
>>> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
>>> index a7f7925..e960247 100644
>>> --- a/fs/btrfs/raid56.c
>>> +++ b/fs/btrfs/raid56.c
>>> @@ -888,14 +888,19 @@ static void rbio_orig_end_io(struct btrfs_raid_bio *rbio, blk_status_t err)
>>>  }
>>>  
>>>  /*
>>> - * end io function used by finish_rmw.  When we finally
>>> - * get here, we've written a full stripe
>>> + * end io function used by finish_rmw.  When we finally get here, we've written
>>> + * a full stripe.
>>> + *
>>> + * Note that this is not under interrupt context as we queued endio to workers.
>>>   */
>>>  static void raid_write_end_io(struct bio *bio)
>>>  {
>>>  	struct btrfs_raid_bio *rbio = bio->bi_private;
>>>  	blk_status_t err = bio->bi_status;
>>>  	int max_errors;
>>> +	u64 stripe_start = rbio->bbio->raid_map[0];
>>> +	struct btrfs_fs_info *fs_info = rbio->fs_info;
>>> +	int err_cnt;
>>>  
>>>  	if (err)
>>>  		fail_bio_stripe(rbio, bio);
>>> @@ -908,12 +913,58 @@ static void raid_write_end_io(struct bio *bio)
>>>  	err = BLK_STS_OK;
>>>  
>>>  	/* OK, we have read all the stripes we need to. */
>>> +	err_cnt = atomic_read(&rbio->error);
>>>  	max_errors = (rbio->operation == BTRFS_RBIO_PARITY_SCRUB) ?
>>>  		     0 : rbio->bbio->max_errors;
>>>  	if (atomic_read(&rbio->error) > max_errors)
>>>  		err = BLK_STS_IOERR;
>>>  
>>>  	rbio_orig_end_io(rbio, err);
>>> +
>>> +	/*
>>> +	 * If there is any error, this stripe is a degraded one, so is the whole
>>> +	 * chunk, expose this chunk info to sysfs.
>>> +	 */
>>> +	if (unlikely(err_cnt)) {
>>> +		struct btrfs_bad_chunk *bc;
>>> +		struct btrfs_bad_chunk *tmp;
>>> +		struct extent_map *em;
>>> +		unsigned long flags;
>>> +
>>> +		em = get_chunk_map(fs_info, stripe_start, 1);
>>> +		if (IS_ERR(em))
>>> +			return;
>>> +
>>> +		bc = kzalloc(sizeof(*bc), GFP_NOFS);
>>> +		/* If allocation fails, it's OK. */
>>> +		if (!bc) {
>>> +			free_extent_map(em);
>>> +			return;
>>> +		}
>>> +
>>> +		write_seqlock_irqsave(&fs_info->bc_lock, flags);
>>> +		list_for_each_entry(tmp, &fs_info->bad_chunks, list) {
>>> +			if (tmp->chunk_offset != em->start)
>>> +				continue;
>>> +
>>> +			/*
>>> +			 * Don't bother if this chunk has already been recorded.
>>> +			 */
>>> +			write_sequnlock_irqrestore(&fs_info->bc_lock, flags);
>>> +			kfree(bc);
>>> +			free_extent_map(em);
>>> +			return;
>>> +		}
>>> +
>>> +		/* Add new bad chunk to list. */
>>> +		bc->chunk_offset = em->start;
>>> +		free_extent_map(em);
>>> +
>>> +		INIT_LIST_HEAD(&bc->list);
>>> +		list_add(&bc->list, &fs_info->bad_chunks);
>>> +
>>> +		write_sequnlock_irqrestore(&fs_info->bc_lock, flags);
>>> +	}
>>>  }
>>>  
>>>  /*
>>> @@ -1320,6 +1371,8 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
>>>  		bio->bi_end_io = raid_write_end_io;
>>>  		bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
>>>  
>>> +		btrfs_bio_wq_end_io(rbio->fs_info, bio, BTRFS_WQ_ENDIO_RAID56);
>>> +
>>>  		submit_bio(bio);
>>>  	}
>>>  	return;
>>> @@ -2465,6 +2518,8 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
>>>  		bio->bi_end_io = raid_write_end_io;
>>>  		bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
>>>  
>>> +		btrfs_bio_wq_end_io(rbio->fs_info, bio, BTRFS_WQ_ENDIO_RAID56);
>>> +
>>>  		submit_bio(bio);
>>>  	}
>>>  	return;
>>> diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
>>> index a28bba8..0baaa33 100644
>>> --- a/fs/btrfs/sysfs.c
>>> +++ b/fs/btrfs/sysfs.c
>>> @@ -490,12 +490,38 @@ static ssize_t quota_override_store(struct kobject *kobj,
>>>  
>>>  BTRFS_ATTR_RW(, quota_override, quota_override_show, quota_override_store);
>>>  
>>> +static ssize_t btrfs_bad_chunks_show(struct kobject *kobj,
>>> +				     struct kobj_attribute *a, char *buf)
>>> +{
>>> +	struct btrfs_fs_info *fs_info = to_fs_info(kobj);
>>> +	struct btrfs_bad_chunk *bc;
>>> +	int len = 0;
>>> +	unsigned int seq;
>>> +
>>> +	/* read lock please */
>>> +	do {
>>> +		seq = read_seqbegin(&fs_info->bc_lock);
>>> +		list_for_each_entry(bc, &fs_info->bad_chunks, list) {
>>> +			len += snprintf(buf + len, PAGE_SIZE - len, "%llu\n",
>>> +					bc->chunk_offset);
>>> +			/* chunk offset is u64 */
>>> +			if (len >= PAGE_SIZE)
>>> +				break;
>>> +		}
>>> +	} while (read_seqretry(&fs_info->bc_lock, seq));
>>> +
>>> +	return len;
>>> +}
>>> +
>>> +BTRFS_ATTR(, bad_chunks, btrfs_bad_chunks_show);
>>> +
>>>  static const struct attribute *btrfs_attrs[] = {
>>>  	BTRFS_ATTR_PTR(, label),
>>>  	BTRFS_ATTR_PTR(, nodesize),
>>>  	BTRFS_ATTR_PTR(, sectorsize),
>>>  	BTRFS_ATTR_PTR(, clone_alignment),
>>>  	BTRFS_ATTR_PTR(, quota_override),
>>> +	BTRFS_ATTR_PTR(, bad_chunks),
>>>  	NULL,
>>>  };
>>>  
>>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>>> index a256842..d71f11a 100644
>>> --- a/fs/btrfs/volumes.c
>>> +++ b/fs/btrfs/volumes.c
>>> @@ -2803,8 +2803,8 @@ static int btrfs_del_sys_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
>>>  	return ret;
>>>  }
>>>  
>>> -static struct extent_map *get_chunk_map(struct btrfs_fs_info *fs_info,
>>> -					u64 logical, u64 length)
>>> +struct extent_map *get_chunk_map(struct btrfs_fs_info *fs_info,
>>> +				 u64 logical, u64 length)
>>>  {
>>>  	struct extent_map_tree *em_tree;
>>>  	struct extent_map *em;
>>> @@ -2840,6 +2840,7 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans,
>>>  	u64 dev_extent_len = 0;
>>>  	int i, ret = 0;
>>>  	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
>>> +	struct btrfs_bad_chunk *bc;
>>>  
>>>  	em = get_chunk_map(fs_info, chunk_offset, 1);
>>>  	if (IS_ERR(em)) {
>>> @@ -2916,6 +2917,16 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans,
>>>  	}
>>>  
>>>  out:
>>> +	write_seqlock_irq(&fs_info->bc_lock);
>>> +	list_for_each_entry(bc, &fs_info->bad_chunks, list) {
>>> +		if (bc->chunk_offset == chunk_offset) {
>>> +			list_del_init(&bc->list);
>>> +			kfree(bc);
>>> +			break;
>>> +		}
>>> +	}
>>> +	write_sequnlock_irq(&fs_info->bc_lock);
>>> +
>>>  	/* once for us */
>>>  	free_extent_map(em);
>>>  	return ret;
>>> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
>>> index ff15208..4e846ba 100644
>>> --- a/fs/btrfs/volumes.h
>>> +++ b/fs/btrfs/volumes.h
>>> @@ -396,6 +396,8 @@ static inline enum btrfs_map_op btrfs_op(struct bio *bio)
>>>  	}
>>>  }
>>>  
>>> +struct extent_map *get_chunk_map(struct btrfs_fs_info *fs_info,
>>> +				 u64 logical, u64 length);
>>>  int btrfs_account_dev_extents_size(struct btrfs_device *device, u64 start,
>>>  				   u64 end, u64 *length);
>>>  void btrfs_get_bbio(struct btrfs_bio *bbio);
>>>
>>
> 
> 
>

Anand Jain Feb. 8, 2018, 9:47 a.m. UTC | #5

On 02/06/2018 07:15 AM, Liu Bo wrote:
> Btrfs tries its best to tolerate write errors, but kind of silently
> (except some messages in kernel log).
> 
> For raid1 and raid10, this is usually not a problem because there is a
> copy as backup, while for parity based raid setup, i.e. raid5 and
> raid6, the problem is that, if a write error occurs due to some bad
> sectors, one horizonal stripe becomes degraded and the number of write
> errors it can tolerate gets reduced by one, now if two disk fails,
> data may be lost forever.

This is equally true in raid1, raid10, and raid5.  Sorry I didn't get
the point why degraded stripe is critical only to the parity based
stripes (raid5/6)?
And does it really need a bad chunk list to fix in case of parity
based stripes or the balance without bad chunks list can fix as well?

> One way to mitigate the data loss pain is to expose 'bad chunks',
> i.e. degraded chunks, to users, so that they can use 'btrfs balance'
> to relocate the whole chunk and get the full raid6 protection again
> (if the relocation works).

Depending on the type of disk error its recovery action would vary. For
example, it can be a complete disk fail or interim RW failure due to
environmental/transport factors. The disk auto relocation will do the
job of relocating the real bad blocks in the most of the modern disks.
The challenging task will be to know where to draw the line between
complete disk failure (failed) vs interim disk failure (offline) so I
had plans of making it tunable base on number of disk errors.

If it's confirmed that a disk is failed, the auto-replace with the hot
spare disk will be its recovery action. Balance with a failed disk won't
help.

Patches to these are in the ML.

If the failure is momentary due to environmental factors, including the
transport layer, then as we expect the disk with the data will come back
we shouldn't kick in the hot spare, that is disk state offline, or maybe
its a state where read old data is fine, but cannot write new data.
I think you are addressing this interim state. It's better to define the
disk states first so that its recovery action can be defined. I can
revise the patches on that. So that replace VS re-balance using bad 
chunks can be decided.

> This introduces 'bad_chunks' in btrfs's per-fs sysfs directory.  Once
> a chunk of raid5 or raid6 becomes degraded, it will appear in
> 'bad_chunks'.

AFAIK a variable list of output is not allowed on sysfs.

IMHO list of bad chunks won't help the user (it ok if its needed by
kernel). It will help if you provide the list of affected-files
so that the user can use it script to make additional interim external
copy until the disk recovers from the interim error.

Thanks, Anand
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Goffredo Baroncelli Feb. 8, 2018, 6:52 p.m. UTC | #6

On 02/06/2018 12:15 AM, Liu Bo wrote:
[...]
> One way to mitigate the data loss pain is to expose 'bad chunks',
> i.e. degraded chunks, to users, so that they can use 'btrfs balance'
> to relocate the whole chunk and get the full raid6 protection again
> (if the relocation works).

[...]
[...]

> +
> +	/* read lock please */
> +	do {
> +		seq = read_seqbegin(&fs_info->bc_lock);
> +		list_for_each_entry(bc, &fs_info->bad_chunks, list) {
> +			len += snprintf(buf + len, PAGE_SIZE - len, "%llu\n",
> +					bc->chunk_offset);
> +			/* chunk offset is u64 */
> +			if (len >= PAGE_SIZE)
> +				break;
> +		}
> +	} while (read_seqretry(&fs_info->bc_lock, seq));

Using this interface, how many chunks can you list ? If I read the code correctly, only up to fill a kernel page.

If my math are correctly (PAGE_SIZE=4k, a u64 could require up to 19 bytes) it is possible to list only few hundred of chunks (~200). Not more; and the last one could be even listed incomplete.

IIRC a chunk size is max 1GB; If you lost a 500GB of disks, the chunks to list could be more than 200.

My first suggestion is to limit the number of chunks to show to 200 (a page should be big enough to contains all these chunks offset). If the chunks number are greater, ends the list with a marker (something like '[...]\n').
This would solve the ambiguity about the fact that the list chunks are complete or not. Anyway you cannot list all the chunks.

However, my second suggestions is to ... change completely the interface. What about adding a directory in sysfs, where each entry is a chunk ?

Something like:

/sys/fs/btrfs/<FS-UUID>/chunks/<chunks-offset>/type		# data/metadata/sys
/sys/fs/btrfs/<FS-UUID>/chunks/<chunks-offset>/profile		# dup/linear....
/sys/fs/btrfs/<FS-UUID>/chunks/<chunks-offset>/size		# size
/sys/fs/btrfs/<FS-UUID>/chunks/<chunks-offset>/devs		# chunks devs 

And so on. 

Checking  "[...]<chunks-offset>/devs", it would be easy understand if the chunk is in "degraded" mode or not.

However I have to admit that I don't know how feasible is iterate over a sysfs directory which is a map of a kernel objects list.

I think that if these interface would be good enough, we could get rid of a lot of ioctl(TREE_SEARCH) from btrfs-progs. 

BR
G.Baroncelli

Liu Bo Feb. 8, 2018, 7:07 p.m. UTC | #7

On Thu, Feb 08, 2018 at 07:52:05PM +0100, Goffredo Baroncelli wrote:
> On 02/06/2018 12:15 AM, Liu Bo wrote:
> [...]
> > One way to mitigate the data loss pain is to expose 'bad chunks',
> > i.e. degraded chunks, to users, so that they can use 'btrfs balance'
> > to relocate the whole chunk and get the full raid6 protection again
> > (if the relocation works).
> 
> [...]
> [...]
> 
> > +
> > +	/* read lock please */
> > +	do {
> > +		seq = read_seqbegin(&fs_info->bc_lock);
> > +		list_for_each_entry(bc, &fs_info->bad_chunks, list) {
> > +			len += snprintf(buf + len, PAGE_SIZE - len, "%llu\n",
> > +					bc->chunk_offset);
> > +			/* chunk offset is u64 */
> > +			if (len >= PAGE_SIZE)
> > +				break;
> > +		}
> > +	} while (read_seqretry(&fs_info->bc_lock, seq));
> 
> Using this interface, how many chunks can you list ? If I read the code correctly, only up to fill a kernel page.
>
> If my math are correctly (PAGE_SIZE=4k, a u64 could require up to 19 bytes) it is possible to list only few hundred of chunks (~200). Not more; and the last one could be even listed incomplete.
> 

That's true.

> IIRC a chunk size is max 1GB; If you lost a 500GB of disks, the chunks to list could be more than 200.
>
> My first suggestion is to limit the number of chunks to show to 200 (a page should be big enough to contains all these chunks offset). If the chunks number are greater, ends the list with a marker (something like '[...]\n').
> This would solve the ambiguity about the fact that the list chunks are complete or not. Anyway you cannot list all the chunks.
>

Good point, and I need to add one more field to each record to specify
it's metadata or data.

So what I have in my mind is that this kind of error is rare and
reflects bad sectors on disk, but if there are really that many
errors, I think we need to replace the disk without hesitation.

> However, my second suggestions is to ... change completely the interface. What about adding a directory in sysfs, where each entry is a chunk ?
> 
> Something like:
> 
> /sys/fs/btrfs/<FS-UUID>/chunks/<chunks-offset>/type		# data/metadata/sys
> /sys/fs/btrfs/<FS-UUID>/chunks/<chunks-offset>/profile		# dup/linear....
> /sys/fs/btrfs/<FS-UUID>/chunks/<chunks-offset>/size		# size
> /sys/fs/btrfs/<FS-UUID>/chunks/<chunks-offset>/devs		# chunks devs 
> 
> And so on. 
> 
> Checking  "[...]<chunks-offset>/devs", it would be easy understand if the chunk is in "degraded" mode or not.

I'm afraid we cannot do that as it'll occupy too much memory for large
filesystems given a typical chunk size is 1GB.

> 
> However I have to admit that I don't know how feasible is iterate over a sysfs directory which is a map of a kernel objects list.
> 
> I think that if these interface would be good enough, we could get rid of a lot of ioctl(TREE_SEARCH) from btrfs-progs. 
>

TREE_SEARCH is faster than iterating sysfs (if we could), isn't it?

thanks,
-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Qu Wenruo Feb. 9, 2018, 12:23 a.m. UTC | #8

On 2018年02月09日 03:07, Liu Bo wrote:
> On Thu, Feb 08, 2018 at 07:52:05PM +0100, Goffredo Baroncelli wrote:
>> On 02/06/2018 12:15 AM, Liu Bo wrote:
>> [...]
>>> One way to mitigate the data loss pain is to expose 'bad chunks',
>>> i.e. degraded chunks, to users, so that they can use 'btrfs balance'
>>> to relocate the whole chunk and get the full raid6 protection again
>>> (if the relocation works).
>>
>> [...]
>> [...]
>>
>>> +
>>> +	/* read lock please */
>>> +	do {
>>> +		seq = read_seqbegin(&fs_info->bc_lock);
>>> +		list_for_each_entry(bc, &fs_info->bad_chunks, list) {
>>> +			len += snprintf(buf + len, PAGE_SIZE - len, "%llu\n",
>>> +					bc->chunk_offset);
>>> +			/* chunk offset is u64 */
>>> +			if (len >= PAGE_SIZE)
>>> +				break;
>>> +		}
>>> +	} while (read_seqretry(&fs_info->bc_lock, seq));
>>
>> Using this interface, how many chunks can you list ? If I read the code correctly, only up to fill a kernel page.
>>
>> If my math are correctly (PAGE_SIZE=4k, a u64 could require up to 19 bytes) it is possible to list only few hundred of chunks (~200). Not more; and the last one could be even listed incomplete.
>>
> 
> That's true.
> 
>> IIRC a chunk size is max 1GB; If you lost a 500GB of disks, the chunks to list could be more than 200.
>>
>> My first suggestion is to limit the number of chunks to show to 200 (a page should be big enough to contains all these chunks offset). If the chunks number are greater, ends the list with a marker (something like '[...]\n').
>> This would solve the ambiguity about the fact that the list chunks are complete or not. Anyway you cannot list all the chunks.
>>
> 
> Good point, and I need to add one more field to each record to specify
> it's metadata or data.
> 
> So what I have in my mind is that this kind of error is rare and
> reflects bad sectors on disk, but if there are really that many
> errors, I think we need to replace the disk without hesitation.
> 
>> However, my second suggestions is to ... change completely the interface. What about adding a directory in sysfs, where each entry is a chunk ?
>>
>> Something like:
>>
>> /sys/fs/btrfs/<FS-UUID>/chunks/<chunks-offset>/type		# data/metadata/sys
>> /sys/fs/btrfs/<FS-UUID>/chunks/<chunks-offset>/profile		# dup/linear....
>> /sys/fs/btrfs/<FS-UUID>/chunks/<chunks-offset>/size		# size
>> /sys/fs/btrfs/<FS-UUID>/chunks/<chunks-offset>/devs		# chunks devs 

What about netlink interface?

Although it may needs an extra daemon to listen to it, and some guys
won't be happy about the abuse of daemon.

Thanks,
Qu

>>
>> And so on. 
>>
>> Checking  "[...]<chunks-offset>/devs", it would be easy understand if the chunk is in "degraded" mode or not.
> 
> I'm afraid we cannot do that as it'll occupy too much memory for large
> filesystems given a typical chunk size is 1GB.
> 
>>
>> However I have to admit that I don't know how feasible is iterate over a sysfs directory which is a map of a kernel objects list.
>>
>> I think that if these interface would be good enough, we could get rid of a lot of ioctl(TREE_SEARCH) from btrfs-progs. 
>>
> 
> TREE_SEARCH is faster than iterating sysfs (if we could), isn't it?
> 
> thanks,
> -liubo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Darrick J. Wong Feb. 9, 2018, 4:57 a.m. UTC | #9

On Mon, Feb 05, 2018 at 04:15:02PM -0700, Liu Bo wrote:
> Btrfs tries its best to tolerate write errors, but kind of silently
> (except some messages in kernel log).
> 
> For raid1 and raid10, this is usually not a problem because there is a
> copy as backup, while for parity based raid setup, i.e. raid5 and
> raid6, the problem is that, if a write error occurs due to some bad
> sectors, one horizonal stripe becomes degraded and the number of write
> errors it can tolerate gets reduced by one, now if two disk fails,
> data may be lost forever.
> 
> One way to mitigate the data loss pain is to expose 'bad chunks',
> i.e. degraded chunks, to users, so that they can use 'btrfs balance'
> to relocate the whole chunk and get the full raid6 protection again
> (if the relocation works).
> 
> This introduces 'bad_chunks' in btrfs's per-fs sysfs directory.  Once
> a chunk of raid5 or raid6 becomes degraded, it will appear in
> 'bad_chunks'.
> 
> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
> ---
> - In this patch, 'bad chunks' is not persistent on disk, but it can be
>   added if it's thought to be a good idea.
> - This is lightly tested, comments are very welcome.

Hmmm... sorry to be late to the party and dump a bunch of semirelated
work suggestions, but what if you implemented GETFSMAP for btrfs?  Then
you could define a new 'defective' fsmap type/flag/whatever and export
it for whatever metadata/filedata/whatever is now screwed up?  Existing
interface, you don't have to kludge sysfs data, none of this string
interpretation stuff...

--D

> 
>  fs/btrfs/ctree.h       |  8 +++++++
>  fs/btrfs/disk-io.c     |  2 ++
>  fs/btrfs/extent-tree.c | 13 +++++++++++
>  fs/btrfs/raid56.c      | 59 ++++++++++++++++++++++++++++++++++++++++++++++++--
>  fs/btrfs/sysfs.c       | 26 ++++++++++++++++++++++
>  fs/btrfs/volumes.c     | 15 +++++++++++--
>  fs/btrfs/volumes.h     |  2 ++
>  7 files changed, 121 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 13c260b..08aad65 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1101,6 +1101,9 @@ struct btrfs_fs_info {
>  	spinlock_t ref_verify_lock;
>  	struct rb_root block_tree;
>  #endif
> +
> +	struct list_head bad_chunks;
> +	seqlock_t bc_lock;
>  };
>  
>  static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb)
> @@ -2568,6 +2571,11 @@ static inline gfp_t btrfs_alloc_write_mask(struct address_space *mapping)
>  
>  /* extent-tree.c */
>  
> +struct btrfs_bad_chunk {
> +	u64 chunk_offset;
> +	struct list_head list;
> +};
> +
>  enum btrfs_inline_ref_type {
>  	BTRFS_REF_TYPE_INVALID =	 0,
>  	BTRFS_REF_TYPE_BLOCK =		 1,
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index a8ecccf..061e7f94 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2568,6 +2568,8 @@ int open_ctree(struct super_block *sb,
>  	init_waitqueue_head(&fs_info->async_submit_wait);
>  
>  	INIT_LIST_HEAD(&fs_info->pinned_chunks);
> +	INIT_LIST_HEAD(&fs_info->bad_chunks);
> +	seqlock_init(&fs_info->bc_lock);
>  
>  	/* Usable values until the real ones are cached from the superblock */
>  	fs_info->nodesize = 4096;
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 2f43285..3ca7cb4 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -9903,6 +9903,19 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
>  		kobject_del(&space_info->kobj);
>  		kobject_put(&space_info->kobj);
>  	}
> +
> +	/* Clean up bad chunks. */
> +	write_seqlock_irq(&info->bc_lock);
> +	while (!list_empty(&info->bad_chunks)) {
> +		struct btrfs_bad_chunk *bc;
> +
> +		bc = list_first_entry(&info->bad_chunks,
> +				      struct btrfs_bad_chunk, list);
> +		list_del_init(&bc->list);
> +		kfree(bc);
> +	}
> +	write_sequnlock_irq(&info->bc_lock);
> +
>  	return 0;
>  }
>  
> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
> index a7f7925..e960247 100644
> --- a/fs/btrfs/raid56.c
> +++ b/fs/btrfs/raid56.c
> @@ -888,14 +888,19 @@ static void rbio_orig_end_io(struct btrfs_raid_bio *rbio, blk_status_t err)
>  }
>  
>  /*
> - * end io function used by finish_rmw.  When we finally
> - * get here, we've written a full stripe
> + * end io function used by finish_rmw.  When we finally get here, we've written
> + * a full stripe.
> + *
> + * Note that this is not under interrupt context as we queued endio to workers.
>   */
>  static void raid_write_end_io(struct bio *bio)
>  {
>  	struct btrfs_raid_bio *rbio = bio->bi_private;
>  	blk_status_t err = bio->bi_status;
>  	int max_errors;
> +	u64 stripe_start = rbio->bbio->raid_map[0];
> +	struct btrfs_fs_info *fs_info = rbio->fs_info;
> +	int err_cnt;
>  
>  	if (err)
>  		fail_bio_stripe(rbio, bio);
> @@ -908,12 +913,58 @@ static void raid_write_end_io(struct bio *bio)
>  	err = BLK_STS_OK;
>  
>  	/* OK, we have read all the stripes we need to. */
> +	err_cnt = atomic_read(&rbio->error);
>  	max_errors = (rbio->operation == BTRFS_RBIO_PARITY_SCRUB) ?
>  		     0 : rbio->bbio->max_errors;
>  	if (atomic_read(&rbio->error) > max_errors)
>  		err = BLK_STS_IOERR;
>  
>  	rbio_orig_end_io(rbio, err);
> +
> +	/*
> +	 * If there is any error, this stripe is a degraded one, so is the whole
> +	 * chunk, expose this chunk info to sysfs.
> +	 */
> +	if (unlikely(err_cnt)) {
> +		struct btrfs_bad_chunk *bc;
> +		struct btrfs_bad_chunk *tmp;
> +		struct extent_map *em;
> +		unsigned long flags;
> +
> +		em = get_chunk_map(fs_info, stripe_start, 1);
> +		if (IS_ERR(em))
> +			return;
> +
> +		bc = kzalloc(sizeof(*bc), GFP_NOFS);
> +		/* If allocation fails, it's OK. */
> +		if (!bc) {
> +			free_extent_map(em);
> +			return;
> +		}
> +
> +		write_seqlock_irqsave(&fs_info->bc_lock, flags);
> +		list_for_each_entry(tmp, &fs_info->bad_chunks, list) {
> +			if (tmp->chunk_offset != em->start)
> +				continue;
> +
> +			/*
> +			 * Don't bother if this chunk has already been recorded.
> +			 */
> +			write_sequnlock_irqrestore(&fs_info->bc_lock, flags);
> +			kfree(bc);
> +			free_extent_map(em);
> +			return;
> +		}
> +
> +		/* Add new bad chunk to list. */
> +		bc->chunk_offset = em->start;
> +		free_extent_map(em);
> +
> +		INIT_LIST_HEAD(&bc->list);
> +		list_add(&bc->list, &fs_info->bad_chunks);
> +
> +		write_sequnlock_irqrestore(&fs_info->bc_lock, flags);
> +	}
>  }
>  
>  /*
> @@ -1320,6 +1371,8 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
>  		bio->bi_end_io = raid_write_end_io;
>  		bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
>  
> +		btrfs_bio_wq_end_io(rbio->fs_info, bio, BTRFS_WQ_ENDIO_RAID56);
> +
>  		submit_bio(bio);
>  	}
>  	return;
> @@ -2465,6 +2518,8 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
>  		bio->bi_end_io = raid_write_end_io;
>  		bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
>  
> +		btrfs_bio_wq_end_io(rbio->fs_info, bio, BTRFS_WQ_ENDIO_RAID56);
> +
>  		submit_bio(bio);
>  	}
>  	return;
> diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
> index a28bba8..0baaa33 100644
> --- a/fs/btrfs/sysfs.c
> +++ b/fs/btrfs/sysfs.c
> @@ -490,12 +490,38 @@ static ssize_t quota_override_store(struct kobject *kobj,
>  
>  BTRFS_ATTR_RW(, quota_override, quota_override_show, quota_override_store);
>  
> +static ssize_t btrfs_bad_chunks_show(struct kobject *kobj,
> +				     struct kobj_attribute *a, char *buf)
> +{
> +	struct btrfs_fs_info *fs_info = to_fs_info(kobj);
> +	struct btrfs_bad_chunk *bc;
> +	int len = 0;
> +	unsigned int seq;
> +
> +	/* read lock please */
> +	do {
> +		seq = read_seqbegin(&fs_info->bc_lock);
> +		list_for_each_entry(bc, &fs_info->bad_chunks, list) {
> +			len += snprintf(buf + len, PAGE_SIZE - len, "%llu\n",
> +					bc->chunk_offset);
> +			/* chunk offset is u64 */
> +			if (len >= PAGE_SIZE)
> +				break;
> +		}
> +	} while (read_seqretry(&fs_info->bc_lock, seq));
> +
> +	return len;
> +}
> +
> +BTRFS_ATTR(, bad_chunks, btrfs_bad_chunks_show);
> +
>  static const struct attribute *btrfs_attrs[] = {
>  	BTRFS_ATTR_PTR(, label),
>  	BTRFS_ATTR_PTR(, nodesize),
>  	BTRFS_ATTR_PTR(, sectorsize),
>  	BTRFS_ATTR_PTR(, clone_alignment),
>  	BTRFS_ATTR_PTR(, quota_override),
> +	BTRFS_ATTR_PTR(, bad_chunks),
>  	NULL,
>  };
>  
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index a256842..d71f11a 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -2803,8 +2803,8 @@ static int btrfs_del_sys_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
>  	return ret;
>  }
>  
> -static struct extent_map *get_chunk_map(struct btrfs_fs_info *fs_info,
> -					u64 logical, u64 length)
> +struct extent_map *get_chunk_map(struct btrfs_fs_info *fs_info,
> +				 u64 logical, u64 length)
>  {
>  	struct extent_map_tree *em_tree;
>  	struct extent_map *em;
> @@ -2840,6 +2840,7 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans,
>  	u64 dev_extent_len = 0;
>  	int i, ret = 0;
>  	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
> +	struct btrfs_bad_chunk *bc;
>  
>  	em = get_chunk_map(fs_info, chunk_offset, 1);
>  	if (IS_ERR(em)) {
> @@ -2916,6 +2917,16 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans,
>  	}
>  
>  out:
> +	write_seqlock_irq(&fs_info->bc_lock);
> +	list_for_each_entry(bc, &fs_info->bad_chunks, list) {
> +		if (bc->chunk_offset == chunk_offset) {
> +			list_del_init(&bc->list);
> +			kfree(bc);
> +			break;
> +		}
> +	}
> +	write_sequnlock_irq(&fs_info->bc_lock);
> +
>  	/* once for us */
>  	free_extent_map(em);
>  	return ret;
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index ff15208..4e846ba 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -396,6 +396,8 @@ static inline enum btrfs_map_op btrfs_op(struct bio *bio)
>  	}
>  }
>  
> +struct extent_map *get_chunk_map(struct btrfs_fs_info *fs_info,
> +				 u64 logical, u64 length);
>  int btrfs_account_dev_extents_size(struct btrfs_device *device, u64 start,
>  				   u64 end, u64 *length);
>  void btrfs_get_bbio(struct btrfs_bio *bbio);
> -- 
> 2.9.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Goffredo Baroncelli Feb. 9, 2018, 5:02 p.m. UTC | #10

On 02/08/2018 08:07 PM, Liu Bo wrote:
> On Thu, Feb 08, 2018 at 07:52:05PM +0100, Goffredo Baroncelli wrote:
>> On 02/06/2018 12:15 AM, Liu Bo wrote:
>> [...]
>>> One way to mitigate the data loss pain is to expose 'bad chunks',
>>> i.e. degraded chunks, to users, so that they can use 'btrfs balance'
>>> to relocate the whole chunk and get the full raid6 protection again
>>> (if the relocation works).
>>
>> [...]
>> [...]
>>
>>> +
>>> +	/* read lock please */
>>> +	do {
>>> +		seq = read_seqbegin(&fs_info->bc_lock);
>>> +		list_for_each_entry(bc, &fs_info->bad_chunks, list) {
>>> +			len += snprintf(buf + len, PAGE_SIZE - len, "%llu\n",
>>> +					bc->chunk_offset);
>>> +			/* chunk offset is u64 */
>>> +			if (len >= PAGE_SIZE)
>>> +				break;
>>> +		}
>>> +	} while (read_seqretry(&fs_info->bc_lock, seq));
>>
>> Using this interface, how many chunks can you list ? If I read the code correctly, only up to fill a kernel page.
>>
>> If my math are correctly (PAGE_SIZE=4k, a u64 could require up to 19 bytes) it is possible to list only few hundred of chunks (~200). Not more; and the last one could be even listed incomplete.
>>
> 
> That's true.
> 
>> IIRC a chunk size is max 1GB; If you lost a 500GB of disks, the chunks to list could be more than 200.
>>
>> My first suggestion is to limit the number of chunks to show to 200 (a page should be big enough to contains all these chunks offset). If the chunks number are greater, ends the list with a marker (something like '[...]\n').
>> This would solve the ambiguity about the fact that the list chunks are complete or not. Anyway you cannot list all the chunks.
>>
> 
> Good point, and I need to add one more field to each record to specify
> it's metadata or data.
> 
> So what I have in my mind is that this kind of error is rare and
> reflects bad sectors on disk, but if there are really that many
> errors, I think we need to replace the disk without hesitation.

What happens if you loose an entire disk ? If so, you fill "bad_sector"


> 
>> However, my second suggestions is to ... change completely the interface. What about adding a directory in sysfs, where each entry is a chunk ?
>>
>> Something like:
>>
>> /sys/fs/btrfs/<FS-UUID>/chunks/<chunks-offset>/type		# data/metadata/sys
>> /sys/fs/btrfs/<FS-UUID>/chunks/<chunks-offset>/profile		# dup/linear....
>> /sys/fs/btrfs/<FS-UUID>/chunks/<chunks-offset>/size		# size
>> /sys/fs/btrfs/<FS-UUID>/chunks/<chunks-offset>/devs		# chunks devs 
>>
>> And so on. 
>>
>> Checking  "[...]<chunks-offset>/devs", it would be easy understand if the chunk is in "degraded" mode or not.
> 
> I'm afraid we cannot do that as it'll occupy too much memory for large
> filesystems given a typical chunk size is 1GB.
> 
>>
>> However I have to admit that I don't know how feasible is iterate over a sysfs directory which is a map of a kernel objects list.
>>
>> I think that if these interface would be good enough, we could get rid of a lot of ioctl(TREE_SEARCH) from btrfs-progs. 
>>
> 
> TREE_SEARCH is faster than iterating sysfs (if we could), isn't it?

Likely yes. However TREE_SEARCH is bad because it is hard linked to the filesystem structure. I.e. if for some reason BTRFS change the on disk layout, what happens for the old application which expect the previous disk format ? And for the same reason, it is root-only (UID==0)

Let me to give another idea: what about exporting these information via an ioctl ? It could be extended to query all information about (all) the chunks...

> 
> thanks,
> -liubo
>

Liu Bo Feb. 12, 2018, 2:17 p.m. UTC | #11

On Tue, Feb 06, 2018 at 11:11:55AM +0200, Nikolay Borisov wrote:
> 
> 
> On  6.02.2018 01:15, Liu Bo wrote:
> > Btrfs tries its best to tolerate write errors, but kind of silently
> > (except some messages in kernel log).
> > 
> > For raid1 and raid10, this is usually not a problem because there is a
> > copy as backup, while for parity based raid setup, i.e. raid5 and
> > raid6, the problem is that, if a write error occurs due to some bad
> > sectors, one horizonal stripe becomes degraded and the number of write
> > errors it can tolerate gets reduced by one, now if two disk fails,
> > data may be lost forever.
> > 
> > One way to mitigate the data loss pain is to expose 'bad chunks',
> > i.e. degraded chunks, to users, so that they can use 'btrfs balance'
> > to relocate the whole chunk and get the full raid6 protection again
> > (if the relocation works).
> > 
> > This introduces 'bad_chunks' in btrfs's per-fs sysfs directory.  Once
> > a chunk of raid5 or raid6 becomes degraded, it will appear in
> > 'bad_chunks'.
> > 
> > Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
> > ---
> > - In this patch, 'bad chunks' is not persistent on disk, but it can be
> >   added if it's thought to be a good idea.
> > - This is lightly tested, comments are very welcome.
> > 
> >  fs/btrfs/ctree.h       |  8 +++++++
> >  fs/btrfs/disk-io.c     |  2 ++
> >  fs/btrfs/extent-tree.c | 13 +++++++++++
> >  fs/btrfs/raid56.c      | 59 ++++++++++++++++++++++++++++++++++++++++++++++++--
> >  fs/btrfs/sysfs.c       | 26 ++++++++++++++++++++++
> >  fs/btrfs/volumes.c     | 15 +++++++++++--
> >  fs/btrfs/volumes.h     |  2 ++
> >  7 files changed, 121 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 13c260b..08aad65 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -1101,6 +1101,9 @@ struct btrfs_fs_info {
> >  	spinlock_t ref_verify_lock;
> >  	struct rb_root block_tree;
> >  #endif
> > +
> > +	struct list_head bad_chunks;
> > +	seqlock_t bc_lock;
> >  };
> >  
> >  static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb)
> > @@ -2568,6 +2571,11 @@ static inline gfp_t btrfs_alloc_write_mask(struct address_space *mapping)
> >  
> >  /* extent-tree.c */
> >  
> > +struct btrfs_bad_chunk {
> > +	u64 chunk_offset;
> > +	struct list_head list;
> > +};
> > +
> >  enum btrfs_inline_ref_type {
> >  	BTRFS_REF_TYPE_INVALID =	 0,
> >  	BTRFS_REF_TYPE_BLOCK =		 1,
> > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > index a8ecccf..061e7f94 100644
> > --- a/fs/btrfs/disk-io.c
> > +++ b/fs/btrfs/disk-io.c
> > @@ -2568,6 +2568,8 @@ int open_ctree(struct super_block *sb,
> >  	init_waitqueue_head(&fs_info->async_submit_wait);
> >  
> >  	INIT_LIST_HEAD(&fs_info->pinned_chunks);
> > +	INIT_LIST_HEAD(&fs_info->bad_chunks);
> > +	seqlock_init(&fs_info->bc_lock);
> >  
> >  	/* Usable values until the real ones are cached from the superblock */
> >  	fs_info->nodesize = 4096;
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index 2f43285..3ca7cb4 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -9903,6 +9903,19 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
> >  		kobject_del(&space_info->kobj);
> >  		kobject_put(&space_info->kobj);
> >  	}
> > +
> > +	/* Clean up bad chunks. */
> > +	write_seqlock_irq(&info->bc_lock);
> > +	while (!list_empty(&info->bad_chunks)) {
> 
> Why not the idiomatic list_for_each_entry_safe, that way you remove the
> list_first_entry invocation altogether and still get a well-formed
> btrfs_bad_chunk object.
> 
> > +		struct btrfs_bad_chunk *bc;
> > +
> > +		bc = list_first_entry(&info->bad_chunks,
> > +				      struct btrfs_bad_chunk, list);
> > +		list_del_init(&bc->list);
> 
> nit: no need to use the _init variant, you are directly freeing the
> entry, less code to execute :)
> 
> > +		kfree(bc);
> > +	}
> > +	write_sequnlock_irq(&info->bc_lock);
> > +
> >  	return 0;
> >  }
> >  
> > diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
> > index a7f7925..e960247 100644
> > --- a/fs/btrfs/raid56.c
> > +++ b/fs/btrfs/raid56.c
> > @@ -888,14 +888,19 @@ static void rbio_orig_end_io(struct btrfs_raid_bio *rbio, blk_status_t err)
> >  }
> >  
> >  /*
> > - * end io function used by finish_rmw.  When we finally
> > - * get here, we've written a full stripe
> > + * end io function used by finish_rmw.  When we finally get here, we've written
> > + * a full stripe.
> > + *
> > + * Note that this is not under interrupt context as we queued endio to workers.
> >   */
> >  static void raid_write_end_io(struct bio *bio)
> >  {
> >  	struct btrfs_raid_bio *rbio = bio->bi_private;
> >  	blk_status_t err = bio->bi_status;
> >  	int max_errors;
> > +	u64 stripe_start = rbio->bbio->raid_map[0];
> > +	struct btrfs_fs_info *fs_info = rbio->fs_info;
> > +	int err_cnt;
> >  
> >  	if (err)
> >  		fail_bio_stripe(rbio, bio);
> > @@ -908,12 +913,58 @@ static void raid_write_end_io(struct bio *bio)
> >  	err = BLK_STS_OK;
> >  
> >  	/* OK, we have read all the stripes we need to. */
> > +	err_cnt = atomic_read(&rbio->error);
> >  	max_errors = (rbio->operation == BTRFS_RBIO_PARITY_SCRUB) ?
> >  		     0 : rbio->bbio->max_errors;
> >  	if (atomic_read(&rbio->error) > max_errors)
> >  		err = BLK_STS_IOERR;
> >  
> >  	rbio_orig_end_io(rbio, err);
> > +
> > +	/*
> > +	 * If there is any error, this stripe is a degraded one, so is the whole
> > +	 * chunk, expose this chunk info to sysfs.
> > +	 */
> > +	if (unlikely(err_cnt)) {
> > +		struct btrfs_bad_chunk *bc;
> > +		struct btrfs_bad_chunk *tmp;
> > +		struct extent_map *em;
> > +		unsigned long flags;
> > +
> > +		em = get_chunk_map(fs_info, stripe_start, 1);
> > +		if (IS_ERR(em))
> > +			return;
> > +
> > +		bc = kzalloc(sizeof(*bc), GFP_NOFS);
> > +		/* If allocation fails, it's OK. */
> > +		if (!bc) {
> > +			free_extent_map(em);
> > +			return;
> > +		}
> > +
> > +		write_seqlock_irqsave(&fs_info->bc_lock, flags);
> 
> Why do you disable interrupts here and the comment at the beginning of
> the function claims this code can't be executed in irq context? Given
> the comment I'd expect if you put the following assert at the beginning
> of the function it should never trigger:
> 
> ASSERT(in_irq())

I think you're right, no one is processing the object in irq context.

> 
> > +		list_for_each_entry(tmp, &fs_info->bad_chunks, list) {
> > +			if (tmp->chunk_offset != em->start)
> > +				continue;
> > +
> > +			/*
> > +			 * Don't bother if this chunk has already been recorded.
> > +			 */
> > +			write_sequnlock_irqrestore(&fs_info->bc_lock, flags);
> > +			kfree(bc);
> > +			free_extent_map(em);
> > +			return;
> > +		}
> > +
> > +		/* Add new bad chunk to list. */
> > +		bc->chunk_offset = em->start;
> > +		free_extent_map(em);
> > +
> > +		INIT_LIST_HEAD(&bc->list);
> 
> nit: There is no need to initialize the list head of the entry itself.
> 
> > +		list_add(&bc->list, &fs_info->bad_chunks);
> > +
> > +		write_sequnlock_irqrestore(&fs_info->bc_lock, flags);
> > +	}
> >  }
> >  
> >  /*
> > @@ -1320,6 +1371,8 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
> >  		bio->bi_end_io = raid_write_end_io;
> >  		bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
> >  
> > +		btrfs_bio_wq_end_io(rbio->fs_info, bio, BTRFS_WQ_ENDIO_RAID56);
> > +
> >  		submit_bio(bio);
> >  	}
> >  	return;
> > @@ -2465,6 +2518,8 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
> >  		bio->bi_end_io = raid_write_end_io;
> >  		bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
> >  
> > +		btrfs_bio_wq_end_io(rbio->fs_info, bio, BTRFS_WQ_ENDIO_RAID56);
> > +
> >  		submit_bio(bio);
> >  	}
> >  	return;
> > diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
> > index a28bba8..0baaa33 100644
> > --- a/fs/btrfs/sysfs.c
> > +++ b/fs/btrfs/sysfs.c
> > @@ -490,12 +490,38 @@ static ssize_t quota_override_store(struct kobject *kobj,
> >  
> >  BTRFS_ATTR_RW(, quota_override, quota_override_show, quota_override_store);
> >  
> > +static ssize_t btrfs_bad_chunks_show(struct kobject *kobj,
> > +				     struct kobj_attribute *a, char *buf)
> > +{
> > +	struct btrfs_fs_info *fs_info = to_fs_info(kobj);
> > +	struct btrfs_bad_chunk *bc;
> > +	int len = 0;
> > +	unsigned int seq;
> > +
> > +	/* read lock please */
> > +	do {
> > +		seq = read_seqbegin(&fs_info->bc_lock);
> > +		list_for_each_entry(bc, &fs_info->bad_chunks, list) {
> > +			len += snprintf(buf + len, PAGE_SIZE - len, "%llu\n",
> > +					bc->chunk_offset);
> > +			/* chunk offset is u64 */
> > +			if (len >= PAGE_SIZE)
> > +				break;
> > +		}
> > +	} while (read_seqretry(&fs_info->bc_lock, seq));
> > +
> > +	return len;
> > +}
> > +
> > +BTRFS_ATTR(, bad_chunks, btrfs_bad_chunks_show);
> > +
> >  static const struct attribute *btrfs_attrs[] = {
> >  	BTRFS_ATTR_PTR(, label),
> >  	BTRFS_ATTR_PTR(, nodesize),
> >  	BTRFS_ATTR_PTR(, sectorsize),
> >  	BTRFS_ATTR_PTR(, clone_alignment),
> >  	BTRFS_ATTR_PTR(, quota_override),
> > +	BTRFS_ATTR_PTR(, bad_chunks),
> >  	NULL,
> >  };
> >  
> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > index a256842..d71f11a 100644
> > --- a/fs/btrfs/volumes.c
> > +++ b/fs/btrfs/volumes.c
> > @@ -2803,8 +2803,8 @@ static int btrfs_del_sys_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
> >  	return ret;
> >  }
> >  
> > -static struct extent_map *get_chunk_map(struct btrfs_fs_info *fs_info,
> > -					u64 logical, u64 length)
> > +struct extent_map *get_chunk_map(struct btrfs_fs_info *fs_info,
> > +				 u64 logical, u64 length)
> 
> nit: Since you are exposing the function as an API I think this is a
> good opportunity to add proper kernel doc for it.
>

It has nothing to do with the patch's purpose, lets leave it to a
seperate one.

> >  {
> >  	struct extent_map_tree *em_tree;
> >  	struct extent_map *em;
> > @@ -2840,6 +2840,7 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans,
> >  	u64 dev_extent_len = 0;
> >  	int i, ret = 0;
> >  	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
> > +	struct btrfs_bad_chunk *bc;
> >  
> >  	em = get_chunk_map(fs_info, chunk_offset, 1);
> >  	if (IS_ERR(em)) {
> > @@ -2916,6 +2917,16 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans,
> >  	}
> >  
> >  out:
> > +	write_seqlock_irq(&fs_info->bc_lock);
> > +	list_for_each_entry(bc, &fs_info->bad_chunks, list) {
> 
> Use list_for_each_entry_safe to make it more apparent you are going to
> be removing from the list. The code as-is works since you are doing a
> break after deleting element from the list but this is somewhat subtle.

To be honest, I don't see much difference.

I think the _safe version is to protect us from some race when others
remove objects from list, and write lock is held so we're safe.

> Also it's not necessary to re-init the deleted entry since you are
> directly freeing it.
>

OK.

Thanks for the comments.

Thanks,

-liubo
> > +		if (bc->chunk_offset == chunk_offset) {
> > +			list_del_init(&bc->list);
> > +			kfree(bc);
> > +			break;
> > +		}
> > +	}
> > +	write_sequnlock_irq(&fs_info->bc_lock);
> > +
> >  	/* once for us */
> >  	free_extent_map(em);
> >  	return ret;
> > diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> > index ff15208..4e846ba 100644
> > --- a/fs/btrfs/volumes.h
> > +++ b/fs/btrfs/volumes.h
> > @@ -396,6 +396,8 @@ static inline enum btrfs_map_op btrfs_op(struct bio *bio)
> >  	}
> >  }
> >  
> > +struct extent_map *get_chunk_map(struct btrfs_fs_info *fs_info,
> > +				 u64 logical, u64 length);
> >  int btrfs_account_dev_extents_size(struct btrfs_device *device, u64 start,
> >  				   u64 end, u64 *length);
> >  void btrfs_get_bbio(struct btrfs_bio *bbio);
> > 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Liu Bo Feb. 12, 2018, 2:37 p.m. UTC | #12

On Thu, Feb 08, 2018 at 05:47:46PM +0800, Anand Jain wrote:
> 
> 
> On 02/06/2018 07:15 AM, Liu Bo wrote:
> > Btrfs tries its best to tolerate write errors, but kind of silently
> > (except some messages in kernel log).
> > 
> > For raid1 and raid10, this is usually not a problem because there is a
> > copy as backup, while for parity based raid setup, i.e. raid5 and
> > raid6, the problem is that, if a write error occurs due to some bad
> > sectors, one horizonal stripe becomes degraded and the number of write
> > errors it can tolerate gets reduced by one, now if two disk fails,
> > data may be lost forever.
> 
> This is equally true in raid1, raid10, and raid5.  Sorry I didn't get
> the point why degraded stripe is critical only to the parity based
> stripes (raid5/6)?

Hmm, right, it also applies for raid1 and raid10.

> And does it really need a bad chunk list to fix in case of parity
> based stripes or the balance without bad chunks list can fix as well?
>

A full balance can surely fix it, but the intent here is to only move
the chunk which the bad stripe belongs to so that a full expensive
balance could be avoided.

> > One way to mitigate the data loss pain is to expose 'bad chunks',
> > i.e. degraded chunks, to users, so that they can use 'btrfs balance'
> > to relocate the whole chunk and get the full raid6 protection again
> > (if the relocation works).
> 
> Depending on the type of disk error its recovery action would vary. For
> example, it can be a complete disk fail or interim RW failure due to
> environmental/transport factors. The disk auto relocation will do the
> job of relocating the real bad blocks in the most of the modern disks.
> The challenging task will be to know where to draw the line between
> complete disk failure (failed) vs interim disk failure (offline) so I
> had plans of making it tunable base on number of disk errors.
>

Right, I agree with the configurable stuff.

> If it's confirmed that a disk is failed, the auto-replace with the hot
> spare disk will be its recovery action. Balance with a failed disk won't
> help.
>
> Patches to these are in the ML.
> 
> If the failure is momentary due to environmental factors, including the
> transport layer, then as we expect the disk with the data will come back
> we shouldn't kick in the hot spare, that is disk state offline, or maybe
> its a state where read old data is fine, but cannot write new data.
> I think you are addressing this interim state. It's better to define the
> disk states first so that its recovery action can be defined. I can
> revise the patches on that. So that replace VS re-balance using bad chunks
> can be decided.

That's true, with bad chunk info., one can decide himself whether to
balance or replace.  If the whole device failed, balance doesn't make
sense, will fail for sure.

> 
> > This introduces 'bad_chunks' in btrfs's per-fs sysfs directory.  Once
> > a chunk of raid5 or raid6 becomes degraded, it will appear in
> > 'bad_chunks'.
> 
> AFAIK a variable list of output is not allowed on sysfs.
>
> IMHO list of bad chunks won't help the user (it ok if its needed by
> kernel). It will help if you provide the list of affected-files
> so that the user can use it script to make additional interim external
> copy until the disk recovers from the interim error.
>

Probably we can do that, but given the complexity that it needs to
walk btree several times to find the name/path/other stuff, the
granularity of chunk unit is good for the usecase I have, as users
here do not want to interfere too much with the low level filesystem
by a script, probably the only feasible thing is to click some buttons
on GUI.

Thanks a lot for your inputs.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Liu Bo Feb. 12, 2018, 3:23 p.m. UTC | #13

On Thu, Feb 08, 2018 at 08:57:39PM -0800, Darrick J. Wong wrote:
> On Mon, Feb 05, 2018 at 04:15:02PM -0700, Liu Bo wrote:
> > Btrfs tries its best to tolerate write errors, but kind of silently
> > (except some messages in kernel log).
> > 
> > For raid1 and raid10, this is usually not a problem because there is a
> > copy as backup, while for parity based raid setup, i.e. raid5 and
> > raid6, the problem is that, if a write error occurs due to some bad
> > sectors, one horizonal stripe becomes degraded and the number of write
> > errors it can tolerate gets reduced by one, now if two disk fails,
> > data may be lost forever.
> > 
> > One way to mitigate the data loss pain is to expose 'bad chunks',
> > i.e. degraded chunks, to users, so that they can use 'btrfs balance'
> > to relocate the whole chunk and get the full raid6 protection again
> > (if the relocation works).
> > 
> > This introduces 'bad_chunks' in btrfs's per-fs sysfs directory.  Once
> > a chunk of raid5 or raid6 becomes degraded, it will appear in
> > 'bad_chunks'.
> > 
> > Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
> > ---
> > - In this patch, 'bad chunks' is not persistent on disk, but it can be
> >   added if it's thought to be a good idea.
> > - This is lightly tested, comments are very welcome.
> 
> Hmmm... sorry to be late to the party and dump a bunch of semirelated
> work suggestions, but what if you implemented GETFSMAP for btrfs?  Then
> you could define a new 'defective' fsmap type/flag/whatever and export
> it for whatever metadata/filedata/whatever is now screwed up?  Existing
> interface, you don't have to kludge sysfs data, none of this string
> interpretation stuff...

By checking FSGETMAP semantics, I think I can give it a shot.

Right now we have dev stat ioctl, it's per device based and not
suitable for exposing per-fs stuff, so a GETFSMAP could probably
provide what I want.

Thanks for the suggestion, Darrick.

Thanks,

-liubo
> 
> --D
> 
> > 
> >  fs/btrfs/ctree.h       |  8 +++++++
> >  fs/btrfs/disk-io.c     |  2 ++
> >  fs/btrfs/extent-tree.c | 13 +++++++++++
> >  fs/btrfs/raid56.c      | 59 ++++++++++++++++++++++++++++++++++++++++++++++++--
> >  fs/btrfs/sysfs.c       | 26 ++++++++++++++++++++++
> >  fs/btrfs/volumes.c     | 15 +++++++++++--
> >  fs/btrfs/volumes.h     |  2 ++
> >  7 files changed, 121 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 13c260b..08aad65 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -1101,6 +1101,9 @@ struct btrfs_fs_info {
> >  	spinlock_t ref_verify_lock;
> >  	struct rb_root block_tree;
> >  #endif
> > +
> > +	struct list_head bad_chunks;
> > +	seqlock_t bc_lock;
> >  };
> >  
> >  static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb)
> > @@ -2568,6 +2571,11 @@ static inline gfp_t btrfs_alloc_write_mask(struct address_space *mapping)
> >  
> >  /* extent-tree.c */
> >  
> > +struct btrfs_bad_chunk {
> > +	u64 chunk_offset;
> > +	struct list_head list;
> > +};
> > +
> >  enum btrfs_inline_ref_type {
> >  	BTRFS_REF_TYPE_INVALID =	 0,
> >  	BTRFS_REF_TYPE_BLOCK =		 1,
> > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > index a8ecccf..061e7f94 100644
> > --- a/fs/btrfs/disk-io.c
> > +++ b/fs/btrfs/disk-io.c
> > @@ -2568,6 +2568,8 @@ int open_ctree(struct super_block *sb,
> >  	init_waitqueue_head(&fs_info->async_submit_wait);
> >  
> >  	INIT_LIST_HEAD(&fs_info->pinned_chunks);
> > +	INIT_LIST_HEAD(&fs_info->bad_chunks);
> > +	seqlock_init(&fs_info->bc_lock);
> >  
> >  	/* Usable values until the real ones are cached from the superblock */
> >  	fs_info->nodesize = 4096;
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index 2f43285..3ca7cb4 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -9903,6 +9903,19 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
> >  		kobject_del(&space_info->kobj);
> >  		kobject_put(&space_info->kobj);
> >  	}
> > +
> > +	/* Clean up bad chunks. */
> > +	write_seqlock_irq(&info->bc_lock);
> > +	while (!list_empty(&info->bad_chunks)) {
> > +		struct btrfs_bad_chunk *bc;
> > +
> > +		bc = list_first_entry(&info->bad_chunks,
> > +				      struct btrfs_bad_chunk, list);
> > +		list_del_init(&bc->list);
> > +		kfree(bc);
> > +	}
> > +	write_sequnlock_irq(&info->bc_lock);
> > +
> >  	return 0;
> >  }
> >  
> > diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
> > index a7f7925..e960247 100644
> > --- a/fs/btrfs/raid56.c
> > +++ b/fs/btrfs/raid56.c
> > @@ -888,14 +888,19 @@ static void rbio_orig_end_io(struct btrfs_raid_bio *rbio, blk_status_t err)
> >  }
> >  
> >  /*
> > - * end io function used by finish_rmw.  When we finally
> > - * get here, we've written a full stripe
> > + * end io function used by finish_rmw.  When we finally get here, we've written
> > + * a full stripe.
> > + *
> > + * Note that this is not under interrupt context as we queued endio to workers.
> >   */
> >  static void raid_write_end_io(struct bio *bio)
> >  {
> >  	struct btrfs_raid_bio *rbio = bio->bi_private;
> >  	blk_status_t err = bio->bi_status;
> >  	int max_errors;
> > +	u64 stripe_start = rbio->bbio->raid_map[0];
> > +	struct btrfs_fs_info *fs_info = rbio->fs_info;
> > +	int err_cnt;
> >  
> >  	if (err)
> >  		fail_bio_stripe(rbio, bio);
> > @@ -908,12 +913,58 @@ static void raid_write_end_io(struct bio *bio)
> >  	err = BLK_STS_OK;
> >  
> >  	/* OK, we have read all the stripes we need to. */
> > +	err_cnt = atomic_read(&rbio->error);
> >  	max_errors = (rbio->operation == BTRFS_RBIO_PARITY_SCRUB) ?
> >  		     0 : rbio->bbio->max_errors;
> >  	if (atomic_read(&rbio->error) > max_errors)
> >  		err = BLK_STS_IOERR;
> >  
> >  	rbio_orig_end_io(rbio, err);
> > +
> > +	/*
> > +	 * If there is any error, this stripe is a degraded one, so is the whole
> > +	 * chunk, expose this chunk info to sysfs.
> > +	 */
> > +	if (unlikely(err_cnt)) {
> > +		struct btrfs_bad_chunk *bc;
> > +		struct btrfs_bad_chunk *tmp;
> > +		struct extent_map *em;
> > +		unsigned long flags;
> > +
> > +		em = get_chunk_map(fs_info, stripe_start, 1);
> > +		if (IS_ERR(em))
> > +			return;
> > +
> > +		bc = kzalloc(sizeof(*bc), GFP_NOFS);
> > +		/* If allocation fails, it's OK. */
> > +		if (!bc) {
> > +			free_extent_map(em);
> > +			return;
> > +		}
> > +
> > +		write_seqlock_irqsave(&fs_info->bc_lock, flags);
> > +		list_for_each_entry(tmp, &fs_info->bad_chunks, list) {
> > +			if (tmp->chunk_offset != em->start)
> > +				continue;
> > +
> > +			/*
> > +			 * Don't bother if this chunk has already been recorded.
> > +			 */
> > +			write_sequnlock_irqrestore(&fs_info->bc_lock, flags);
> > +			kfree(bc);
> > +			free_extent_map(em);
> > +			return;
> > +		}
> > +
> > +		/* Add new bad chunk to list. */
> > +		bc->chunk_offset = em->start;
> > +		free_extent_map(em);
> > +
> > +		INIT_LIST_HEAD(&bc->list);
> > +		list_add(&bc->list, &fs_info->bad_chunks);
> > +
> > +		write_sequnlock_irqrestore(&fs_info->bc_lock, flags);
> > +	}
> >  }
> >  
> >  /*
> > @@ -1320,6 +1371,8 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
> >  		bio->bi_end_io = raid_write_end_io;
> >  		bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
> >  
> > +		btrfs_bio_wq_end_io(rbio->fs_info, bio, BTRFS_WQ_ENDIO_RAID56);
> > +
> >  		submit_bio(bio);
> >  	}
> >  	return;
> > @@ -2465,6 +2518,8 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
> >  		bio->bi_end_io = raid_write_end_io;
> >  		bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
> >  
> > +		btrfs_bio_wq_end_io(rbio->fs_info, bio, BTRFS_WQ_ENDIO_RAID56);
> > +
> >  		submit_bio(bio);
> >  	}
> >  	return;
> > diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
> > index a28bba8..0baaa33 100644
> > --- a/fs/btrfs/sysfs.c
> > +++ b/fs/btrfs/sysfs.c
> > @@ -490,12 +490,38 @@ static ssize_t quota_override_store(struct kobject *kobj,
> >  
> >  BTRFS_ATTR_RW(, quota_override, quota_override_show, quota_override_store);
> >  
> > +static ssize_t btrfs_bad_chunks_show(struct kobject *kobj,
> > +				     struct kobj_attribute *a, char *buf)
> > +{
> > +	struct btrfs_fs_info *fs_info = to_fs_info(kobj);
> > +	struct btrfs_bad_chunk *bc;
> > +	int len = 0;
> > +	unsigned int seq;
> > +
> > +	/* read lock please */
> > +	do {
> > +		seq = read_seqbegin(&fs_info->bc_lock);
> > +		list_for_each_entry(bc, &fs_info->bad_chunks, list) {
> > +			len += snprintf(buf + len, PAGE_SIZE - len, "%llu\n",
> > +					bc->chunk_offset);
> > +			/* chunk offset is u64 */
> > +			if (len >= PAGE_SIZE)
> > +				break;
> > +		}
> > +	} while (read_seqretry(&fs_info->bc_lock, seq));
> > +
> > +	return len;
> > +}
> > +
> > +BTRFS_ATTR(, bad_chunks, btrfs_bad_chunks_show);
> > +
> >  static const struct attribute *btrfs_attrs[] = {
> >  	BTRFS_ATTR_PTR(, label),
> >  	BTRFS_ATTR_PTR(, nodesize),
> >  	BTRFS_ATTR_PTR(, sectorsize),
> >  	BTRFS_ATTR_PTR(, clone_alignment),
> >  	BTRFS_ATTR_PTR(, quota_override),
> > +	BTRFS_ATTR_PTR(, bad_chunks),
> >  	NULL,
> >  };
> >  
> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > index a256842..d71f11a 100644
> > --- a/fs/btrfs/volumes.c
> > +++ b/fs/btrfs/volumes.c
> > @@ -2803,8 +2803,8 @@ static int btrfs_del_sys_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
> >  	return ret;
> >  }
> >  
> > -static struct extent_map *get_chunk_map(struct btrfs_fs_info *fs_info,
> > -					u64 logical, u64 length)
> > +struct extent_map *get_chunk_map(struct btrfs_fs_info *fs_info,
> > +				 u64 logical, u64 length)
> >  {
> >  	struct extent_map_tree *em_tree;
> >  	struct extent_map *em;
> > @@ -2840,6 +2840,7 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans,
> >  	u64 dev_extent_len = 0;
> >  	int i, ret = 0;
> >  	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
> > +	struct btrfs_bad_chunk *bc;
> >  
> >  	em = get_chunk_map(fs_info, chunk_offset, 1);
> >  	if (IS_ERR(em)) {
> > @@ -2916,6 +2917,16 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans,
> >  	}
> >  
> >  out:
> > +	write_seqlock_irq(&fs_info->bc_lock);
> > +	list_for_each_entry(bc, &fs_info->bad_chunks, list) {
> > +		if (bc->chunk_offset == chunk_offset) {
> > +			list_del_init(&bc->list);
> > +			kfree(bc);
> > +			break;
> > +		}
> > +	}
> > +	write_sequnlock_irq(&fs_info->bc_lock);
> > +
> >  	/* once for us */
> >  	free_extent_map(em);
> >  	return ret;
> > diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> > index ff15208..4e846ba 100644
> > --- a/fs/btrfs/volumes.h
> > +++ b/fs/btrfs/volumes.h
> > @@ -396,6 +396,8 @@ static inline enum btrfs_map_op btrfs_op(struct bio *bio)
> >  	}
> >  }
> >  
> > +struct extent_map *get_chunk_map(struct btrfs_fs_info *fs_info,
> > +				 u64 logical, u64 length);
> >  int btrfs_account_dev_extents_size(struct btrfs_device *device, u64 start,
> >  				   u64 end, u64 *length);
> >  void btrfs_get_bbio(struct btrfs_bio *bbio);
> > -- 
> > 2.9.4
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Liu Bo Feb. 12, 2018, 3:26 p.m. UTC | #14

On Fri, Feb 09, 2018 at 06:02:01PM +0100, Goffredo Baroncelli wrote:
> On 02/08/2018 08:07 PM, Liu Bo wrote:
> > On Thu, Feb 08, 2018 at 07:52:05PM +0100, Goffredo Baroncelli wrote:
> >> On 02/06/2018 12:15 AM, Liu Bo wrote:
> >> [...]
> >>> One way to mitigate the data loss pain is to expose 'bad chunks',
> >>> i.e. degraded chunks, to users, so that they can use 'btrfs balance'
> >>> to relocate the whole chunk and get the full raid6 protection again
> >>> (if the relocation works).
> >>
> >> [...]
> >> [...]
> >>
> >>> +
> >>> +	/* read lock please */
> >>> +	do {
> >>> +		seq = read_seqbegin(&fs_info->bc_lock);
> >>> +		list_for_each_entry(bc, &fs_info->bad_chunks, list) {
> >>> +			len += snprintf(buf + len, PAGE_SIZE - len, "%llu\n",
> >>> +					bc->chunk_offset);
> >>> +			/* chunk offset is u64 */
> >>> +			if (len >= PAGE_SIZE)
> >>> +				break;
> >>> +		}
> >>> +	} while (read_seqretry(&fs_info->bc_lock, seq));
> >>
> >> Using this interface, how many chunks can you list ? If I read the code correctly, only up to fill a kernel page.
> >>
> >> If my math are correctly (PAGE_SIZE=4k, a u64 could require up to 19 bytes) it is possible to list only few hundred of chunks (~200). Not more; and the last one could be even listed incomplete.
> >>
> > 
> > That's true.
> > 
> >> IIRC a chunk size is max 1GB; If you lost a 500GB of disks, the chunks to list could be more than 200.
> >>
> >> My first suggestion is to limit the number of chunks to show to 200 (a page should be big enough to contains all these chunks offset). If the chunks number are greater, ends the list with a marker (something like '[...]\n').
> >> This would solve the ambiguity about the fact that the list chunks are complete or not. Anyway you cannot list all the chunks.
> >>
> > 
> > Good point, and I need to add one more field to each record to specify
> > it's metadata or data.
> > 
> > So what I have in my mind is that this kind of error is rare and
> > reflects bad sectors on disk, but if there are really that many
> > errors, I think we need to replace the disk without hesitation.
> 
> What happens if you loose an entire disk ? If so, you fill "bad_sector"
> 
> 
> > 
> >> However, my second suggestions is to ... change completely the interface. What about adding a directory in sysfs, where each entry is a chunk ?
> >>
> >> Something like:
> >>
> >> /sys/fs/btrfs/<FS-UUID>/chunks/<chunks-offset>/type		# data/metadata/sys
> >> /sys/fs/btrfs/<FS-UUID>/chunks/<chunks-offset>/profile		# dup/linear....
> >> /sys/fs/btrfs/<FS-UUID>/chunks/<chunks-offset>/size		# size
> >> /sys/fs/btrfs/<FS-UUID>/chunks/<chunks-offset>/devs		# chunks devs 
> >>
> >> And so on. 
> >>
> >> Checking  "[...]<chunks-offset>/devs", it would be easy understand if the chunk is in "degraded" mode or not.
> > 
> > I'm afraid we cannot do that as it'll occupy too much memory for large
> > filesystems given a typical chunk size is 1GB.
> > 
> >>
> >> However I have to admit that I don't know how feasible is iterate over a sysfs directory which is a map of a kernel objects list.
> >>
> >> I think that if these interface would be good enough, we could get rid of a lot of ioctl(TREE_SEARCH) from btrfs-progs. 
> >>
> > 
> > TREE_SEARCH is faster than iterating sysfs (if we could), isn't it?
> 
> Likely yes. However TREE_SEARCH is bad because it is hard linked to the filesystem structure. I.e. if for some reason BTRFS change the on disk layout, what happens for the old application which expect the previous disk format ? And for the same reason, it is root-only (UID==0)
> 
> Let me to give another idea: what about exporting these information via an ioctl ? It could be extended to query all information about (all) the chunks...
>

It's all about choice, by using sysfs one can check it at the least cost.

But as mentioned in the thread, I'll go and try a new ioctl way.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Nikolay Borisov Feb. 12, 2018, 5:39 p.m. UTC | #15

On 12.02.2018 16:17, Liu Bo wrote:
> On Tue, Feb 06, 2018 at 11:11:55AM +0200, Nikolay Borisov wrote:
>>
>>
>> On  6.02.2018 01:15, Liu Bo wrote:
>>> Btrfs tries its best to tolerate write errors, but kind of silently
>>> (except some messages in kernel log).
>>>
>>> For raid1 and raid10, this is usually not a problem because there is a
>>> copy as backup, while for parity based raid setup, i.e. raid5 and
>>> raid6, the problem is that, if a write error occurs due to some bad
>>> sectors, one horizonal stripe becomes degraded and the number of write
>>> errors it can tolerate gets reduced by one, now if two disk fails,
>>> data may be lost forever.
>>>
>>> One way to mitigate the data loss pain is to expose 'bad chunks',
>>> i.e. degraded chunks, to users, so that they can use 'btrfs balance'
>>> to relocate the whole chunk and get the full raid6 protection again
>>> (if the relocation works).
>>>
>>> This introduces 'bad_chunks' in btrfs's per-fs sysfs directory.  Once
>>> a chunk of raid5 or raid6 becomes degraded, it will appear in
>>> 'bad_chunks'.
>>>
>>> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
>>> ---
>>> - In this patch, 'bad chunks' is not persistent on disk, but it can be
>>>   added if it's thought to be a good idea.
>>> - This is lightly tested, comments are very welcome.
>>>
>>>  fs/btrfs/ctree.h       |  8 +++++++
>>>  fs/btrfs/disk-io.c     |  2 ++
>>>  fs/btrfs/extent-tree.c | 13 +++++++++++
>>>  fs/btrfs/raid56.c      | 59 ++++++++++++++++++++++++++++++++++++++++++++++++--
>>>  fs/btrfs/sysfs.c       | 26 ++++++++++++++++++++++
>>>  fs/btrfs/volumes.c     | 15 +++++++++++--
>>>  fs/btrfs/volumes.h     |  2 ++
>>>  7 files changed, 121 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>> index 13c260b..08aad65 100644
>>> --- a/fs/btrfs/ctree.h
>>> +++ b/fs/btrfs/ctree.h
>>> @@ -1101,6 +1101,9 @@ struct btrfs_fs_info {
>>>  	spinlock_t ref_verify_lock;
>>>  	struct rb_root block_tree;
>>>  #endif
>>> +
>>> +	struct list_head bad_chunks;
>>> +	seqlock_t bc_lock;
>>>  };
>>>  
>>>  static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb)
>>> @@ -2568,6 +2571,11 @@ static inline gfp_t btrfs_alloc_write_mask(struct address_space *mapping)
>>>  
>>>  /* extent-tree.c */
>>>  
>>> +struct btrfs_bad_chunk {
>>> +	u64 chunk_offset;
>>> +	struct list_head list;
>>> +};
>>> +
>>>  enum btrfs_inline_ref_type {
>>>  	BTRFS_REF_TYPE_INVALID =	 0,
>>>  	BTRFS_REF_TYPE_BLOCK =		 1,
>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>> index a8ecccf..061e7f94 100644
>>> --- a/fs/btrfs/disk-io.c
>>> +++ b/fs/btrfs/disk-io.c
>>> @@ -2568,6 +2568,8 @@ int open_ctree(struct super_block *sb,
>>>  	init_waitqueue_head(&fs_info->async_submit_wait);
>>>  
>>>  	INIT_LIST_HEAD(&fs_info->pinned_chunks);
>>> +	INIT_LIST_HEAD(&fs_info->bad_chunks);
>>> +	seqlock_init(&fs_info->bc_lock);
>>>  
>>>  	/* Usable values until the real ones are cached from the superblock */
>>>  	fs_info->nodesize = 4096;
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index 2f43285..3ca7cb4 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -9903,6 +9903,19 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
>>>  		kobject_del(&space_info->kobj);
>>>  		kobject_put(&space_info->kobj);
>>>  	}
>>> +
>>> +	/* Clean up bad chunks. */
>>> +	write_seqlock_irq(&info->bc_lock);
>>> +	while (!list_empty(&info->bad_chunks)) {
>>
>> Why not the idiomatic list_for_each_entry_safe, that way you remove the
>> list_first_entry invocation altogether and still get a well-formed
>> btrfs_bad_chunk object.
>>
>>> +		struct btrfs_bad_chunk *bc;
>>> +
>>> +		bc = list_first_entry(&info->bad_chunks,
>>> +				      struct btrfs_bad_chunk, list);
>>> +		list_del_init(&bc->list);
>>
>> nit: no need to use the _init variant, you are directly freeing the
>> entry, less code to execute :)
>>
>>> +		kfree(bc);
>>> +	}
>>> +	write_sequnlock_irq(&info->bc_lock);
>>> +
>>>  	return 0;
>>>  }
>>>  
>>> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
>>> index a7f7925..e960247 100644
>>> --- a/fs/btrfs/raid56.c
>>> +++ b/fs/btrfs/raid56.c
>>> @@ -888,14 +888,19 @@ static void rbio_orig_end_io(struct btrfs_raid_bio *rbio, blk_status_t err)
>>>  }
>>>  
>>>  /*
>>> - * end io function used by finish_rmw.  When we finally
>>> - * get here, we've written a full stripe
>>> + * end io function used by finish_rmw.  When we finally get here, we've written
>>> + * a full stripe.
>>> + *
>>> + * Note that this is not under interrupt context as we queued endio to workers.
>>>   */
>>>  static void raid_write_end_io(struct bio *bio)
>>>  {
>>>  	struct btrfs_raid_bio *rbio = bio->bi_private;
>>>  	blk_status_t err = bio->bi_status;
>>>  	int max_errors;
>>> +	u64 stripe_start = rbio->bbio->raid_map[0];
>>> +	struct btrfs_fs_info *fs_info = rbio->fs_info;
>>> +	int err_cnt;
>>>  
>>>  	if (err)
>>>  		fail_bio_stripe(rbio, bio);
>>> @@ -908,12 +913,58 @@ static void raid_write_end_io(struct bio *bio)
>>>  	err = BLK_STS_OK;
>>>  
>>>  	/* OK, we have read all the stripes we need to. */
>>> +	err_cnt = atomic_read(&rbio->error);
>>>  	max_errors = (rbio->operation == BTRFS_RBIO_PARITY_SCRUB) ?
>>>  		     0 : rbio->bbio->max_errors;
>>>  	if (atomic_read(&rbio->error) > max_errors)
>>>  		err = BLK_STS_IOERR;
>>>  
>>>  	rbio_orig_end_io(rbio, err);
>>> +
>>> +	/*
>>> +	 * If there is any error, this stripe is a degraded one, so is the whole
>>> +	 * chunk, expose this chunk info to sysfs.
>>> +	 */
>>> +	if (unlikely(err_cnt)) {
>>> +		struct btrfs_bad_chunk *bc;
>>> +		struct btrfs_bad_chunk *tmp;
>>> +		struct extent_map *em;
>>> +		unsigned long flags;
>>> +
>>> +		em = get_chunk_map(fs_info, stripe_start, 1);
>>> +		if (IS_ERR(em))
>>> +			return;
>>> +
>>> +		bc = kzalloc(sizeof(*bc), GFP_NOFS);
>>> +		/* If allocation fails, it's OK. */
>>> +		if (!bc) {
>>> +			free_extent_map(em);
>>> +			return;
>>> +		}
>>> +
>>> +		write_seqlock_irqsave(&fs_info->bc_lock, flags);
>>
>> Why do you disable interrupts here and the comment at the beginning of
>> the function claims this code can't be executed in irq context? Given
>> the comment I'd expect if you put the following assert at the beginning
>> of the function it should never trigger:
>>
>> ASSERT(in_irq())
> 
> I think you're right, no one is processing the object in irq context.
> 
>>
>>> +		list_for_each_entry(tmp, &fs_info->bad_chunks, list) {
>>> +			if (tmp->chunk_offset != em->start)
>>> +				continue;
>>> +
>>> +			/*
>>> +			 * Don't bother if this chunk has already been recorded.
>>> +			 */
>>> +			write_sequnlock_irqrestore(&fs_info->bc_lock, flags);
>>> +			kfree(bc);
>>> +			free_extent_map(em);
>>> +			return;
>>> +		}
>>> +
>>> +		/* Add new bad chunk to list. */
>>> +		bc->chunk_offset = em->start;
>>> +		free_extent_map(em);
>>> +
>>> +		INIT_LIST_HEAD(&bc->list);
>>
>> nit: There is no need to initialize the list head of the entry itself.
>>
>>> +		list_add(&bc->list, &fs_info->bad_chunks);
>>> +
>>> +		write_sequnlock_irqrestore(&fs_info->bc_lock, flags);
>>> +	}
>>>  }
>>>  
>>>  /*
>>> @@ -1320,6 +1371,8 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
>>>  		bio->bi_end_io = raid_write_end_io;
>>>  		bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
>>>  
>>> +		btrfs_bio_wq_end_io(rbio->fs_info, bio, BTRFS_WQ_ENDIO_RAID56);
>>> +
>>>  		submit_bio(bio);
>>>  	}
>>>  	return;
>>> @@ -2465,6 +2518,8 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
>>>  		bio->bi_end_io = raid_write_end_io;
>>>  		bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
>>>  
>>> +		btrfs_bio_wq_end_io(rbio->fs_info, bio, BTRFS_WQ_ENDIO_RAID56);
>>> +
>>>  		submit_bio(bio);
>>>  	}
>>>  	return;
>>> diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
>>> index a28bba8..0baaa33 100644
>>> --- a/fs/btrfs/sysfs.c
>>> +++ b/fs/btrfs/sysfs.c
>>> @@ -490,12 +490,38 @@ static ssize_t quota_override_store(struct kobject *kobj,
>>>  
>>>  BTRFS_ATTR_RW(, quota_override, quota_override_show, quota_override_store);
>>>  
>>> +static ssize_t btrfs_bad_chunks_show(struct kobject *kobj,
>>> +				     struct kobj_attribute *a, char *buf)
>>> +{
>>> +	struct btrfs_fs_info *fs_info = to_fs_info(kobj);
>>> +	struct btrfs_bad_chunk *bc;
>>> +	int len = 0;
>>> +	unsigned int seq;
>>> +
>>> +	/* read lock please */
>>> +	do {
>>> +		seq = read_seqbegin(&fs_info->bc_lock);
>>> +		list_for_each_entry(bc, &fs_info->bad_chunks, list) {
>>> +			len += snprintf(buf + len, PAGE_SIZE - len, "%llu\n",
>>> +					bc->chunk_offset);
>>> +			/* chunk offset is u64 */
>>> +			if (len >= PAGE_SIZE)
>>> +				break;
>>> +		}
>>> +	} while (read_seqretry(&fs_info->bc_lock, seq));
>>> +
>>> +	return len;
>>> +}
>>> +
>>> +BTRFS_ATTR(, bad_chunks, btrfs_bad_chunks_show);
>>> +
>>>  static const struct attribute *btrfs_attrs[] = {
>>>  	BTRFS_ATTR_PTR(, label),
>>>  	BTRFS_ATTR_PTR(, nodesize),
>>>  	BTRFS_ATTR_PTR(, sectorsize),
>>>  	BTRFS_ATTR_PTR(, clone_alignment),
>>>  	BTRFS_ATTR_PTR(, quota_override),
>>> +	BTRFS_ATTR_PTR(, bad_chunks),
>>>  	NULL,
>>>  };
>>>  
>>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>>> index a256842..d71f11a 100644
>>> --- a/fs/btrfs/volumes.c
>>> +++ b/fs/btrfs/volumes.c
>>> @@ -2803,8 +2803,8 @@ static int btrfs_del_sys_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
>>>  	return ret;
>>>  }
>>>  
>>> -static struct extent_map *get_chunk_map(struct btrfs_fs_info *fs_info,
>>> -					u64 logical, u64 length)
>>> +struct extent_map *get_chunk_map(struct btrfs_fs_info *fs_info,
>>> +				 u64 logical, u64 length)
>>
>> nit: Since you are exposing the function as an API I think this is a
>> good opportunity to add proper kernel doc for it.
>>
> 
> It has nothing to do with the patch's purpose, lets leave it to a
> seperate one.
> 
>>>  {
>>>  	struct extent_map_tree *em_tree;
>>>  	struct extent_map *em;
>>> @@ -2840,6 +2840,7 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans,
>>>  	u64 dev_extent_len = 0;
>>>  	int i, ret = 0;
>>>  	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
>>> +	struct btrfs_bad_chunk *bc;
>>>  
>>>  	em = get_chunk_map(fs_info, chunk_offset, 1);
>>>  	if (IS_ERR(em)) {
>>> @@ -2916,6 +2917,16 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans,
>>>  	}
>>>  
>>>  out:
>>> +	write_seqlock_irq(&fs_info->bc_lock);
>>> +	list_for_each_entry(bc, &fs_info->bad_chunks, list) {
>>
>> Use list_for_each_entry_safe to make it more apparent you are going to
>> be removing from the list. The code as-is works since you are doing a
>> break after deleting element from the list but this is somewhat subtle.
> 
> To be honest, I don't see much difference.
> 
> I think the _safe version is to protect us from some race when others
> remove objects from list, and write lock is held so we're safe.

No, the _safe version uses the second argument (n) as the list iterator.
The non-safe version just uses 'pos', and in case you remove 'pos' from
the list AND continue iterating you will deref an invalid pointer. So
_safe is actually really necessary for correctness when you intend to
remove an entry from a list you are iterating, irrespective of any locks
you might have.
> 
>> Also it's not necessary to re-init the deleted entry since you are
>> directly freeing it.
>>
> 
> OK.
> 
> Thanks for the comments.
> 
> Thanks,
> 
> -liubo
>>> +		if (bc->chunk_offset == chunk_offset) {
>>> +			list_del_init(&bc->list);
>>> +			kfree(bc);
>>> +			break;
>>> +		}
>>> +	}
>>> +	write_sequnlock_irq(&fs_info->bc_lock);
>>> +
>>>  	/* once for us */
>>>  	free_extent_map(em);
>>>  	return ret;
>>> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
>>> index ff15208..4e846ba 100644
>>> --- a/fs/btrfs/volumes.h
>>> +++ b/fs/btrfs/volumes.h
>>> @@ -396,6 +396,8 @@ static inline enum btrfs_map_op btrfs_op(struct bio *bio)
>>>  	}
>>>  }
>>>  
>>> +struct extent_map *get_chunk_map(struct btrfs_fs_info *fs_info,
>>> +				 u64 logical, u64 length);
>>>  int btrfs_account_dev_extents_size(struct btrfs_device *device, u64 start,
>>>  				   u64 end, u64 *length);
>>>  void btrfs_get_bbio(struct btrfs_bio *bbio);
>>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Liu Bo Feb. 13, 2018, 12:06 a.m. UTC | #16

On Mon, Feb 12, 2018 at 07:39:02PM +0200, Nikolay Borisov wrote:
> 
> 
> On 12.02.2018 16:17, Liu Bo wrote:
> > On Tue, Feb 06, 2018 at 11:11:55AM +0200, Nikolay Borisov wrote:
[...]
> >>
> >> Use list_for_each_entry_safe to make it more apparent you are going to
> >> be removing from the list. The code as-is works since you are doing a
> >> break after deleting element from the list but this is somewhat subtle.
> > 
> > To be honest, I don't see much difference.
> > 
> > I think the _safe version is to protect us from some race when others
> > remove objects from list, and write lock is held so we're safe.
> 
> No, the _safe version uses the second argument (n) as the list iterator.
> The non-safe version just uses 'pos', and in case you remove 'pos' from
> the list AND continue iterating you will deref an invalid pointer. So
> _safe is actually really necessary for correctness when you intend to
> remove an entry from a list you are iterating, irrespective of any locks
> you might have.

You're right, I know it takes an extra pointer to store the next
entry, but miss the point when deleting by itself.

Thank you for the clarification.

thanks,
-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Sterba Feb. 26, 2018, 5:18 p.m. UTC | #17

On Mon, Feb 12, 2018 at 07:23:05AM -0800, Liu Bo wrote:
> On Thu, Feb 08, 2018 at 08:57:39PM -0800, Darrick J. Wong wrote:
> > On Mon, Feb 05, 2018 at 04:15:02PM -0700, Liu Bo wrote:
> > > Btrfs tries its best to tolerate write errors, but kind of silently
> > > (except some messages in kernel log).
> > > 
> > > For raid1 and raid10, this is usually not a problem because there is a
> > > copy as backup, while for parity based raid setup, i.e. raid5 and
> > > raid6, the problem is that, if a write error occurs due to some bad
> > > sectors, one horizonal stripe becomes degraded and the number of write
> > > errors it can tolerate gets reduced by one, now if two disk fails,
> > > data may be lost forever.
> > > 
> > > One way to mitigate the data loss pain is to expose 'bad chunks',
> > > i.e. degraded chunks, to users, so that they can use 'btrfs balance'
> > > to relocate the whole chunk and get the full raid6 protection again
> > > (if the relocation works).
> > > 
> > > This introduces 'bad_chunks' in btrfs's per-fs sysfs directory.  Once
> > > a chunk of raid5 or raid6 becomes degraded, it will appear in
> > > 'bad_chunks'.
> > > 
> > > Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
> > > ---
> > > - In this patch, 'bad chunks' is not persistent on disk, but it can be
> > >   added if it's thought to be a good idea.
> > > - This is lightly tested, comments are very welcome.
> > 
> > Hmmm... sorry to be late to the party and dump a bunch of semirelated
> > work suggestions, but what if you implemented GETFSMAP for btrfs?  Then
> > you could define a new 'defective' fsmap type/flag/whatever and export
> > it for whatever metadata/filedata/whatever is now screwed up?  Existing
> > interface, you don't have to kludge sysfs data, none of this string
> > interpretation stuff...
> 
> By checking FSGETMAP semantics, I think I can give it a shot.

Using GETFSMAP sounds good to me, if not best of all the options
mentioned in the thread.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC] Btrfs: expose bad chunks in sysfs

Commit Message

Comments

Patch