[v2] bcache: set max writeback rate when I/O request is idle

Message ID	20180724040310.1590-1-colyli@suse.de (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> From: Coly Li <colyli@suse.de> To: colyli@suse.de, linux-bcache@vger.kernel.org Cc: linux-block@vger.kernel.org, stable@vger.kernel.org, Michael Lyle <mlyle@lyle.org>, Stefan Priebe <s.priebe@profihost.ag> Subject: [PATCH v2] bcache: set max writeback rate when I/O request is idle Date: Tue, 24 Jul 2018 12:03:10 +0800 Message-Id: <20180724040310.1590-1-colyli@suse.de> Sender: linux-block-owner@vger.kernel.org Precedence: bulk
Series	[v2] bcache: set max writeback rate when I/O request is idle \| expand [v2] bcache: set max writeback rate when I/O request is idle

Coly Li July 24, 2018, 4:03 a.m. UTC

Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
allows the writeback rate to be faster if there is no I/O request on a
bcache device. It works well if there is only one bcache device attached
to the cache set. If there are many bcache devices attached to a cache
set, it may introduce performance regression because multiple faster
writeback threads of the idle bcache devices will compete the btree level
locks with the bcache device who have I/O requests coming.

This patch fixes the above issue by only permitting fast writebac when
all bcache devices attached on the cache set are idle. And if one of the
bcache devices has new I/O request coming, minimized all writeback
throughput immediately and let PI controller __update_writeback_rate()
to decide the upcoming writeback rate for each bcache device.

Also when all bcache devices are idle, limited wrieback rate to a small
number is wast of thoughput, especially when backing devices are slower
non-rotation devices (e.g. SATA SSD). This patch sets a max writeback
rate for each backing device if the whole cache set is idle. A faster
writeback rate in idle time means new I/Os may have more available space
for dirty data, and people may observe a better write performance then.

Please note bcache may change its cache mode in run time, and this patch
still works if the cache mode is switched from writeback mode and there
is still dirty data on cache.

Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
Cc: stable@vger.kernel.org #4.16+
Signed-off-by: Coly Li <colyli@suse.de>
Tested-by: Kai Krakow <kai@kaishome.de>
Cc: Michael Lyle <mlyle@lyle.org>
Cc: Stefan Priebe <s.priebe@profihost.ag>
---
Channgelog:
v2, Fix a deadlock reported by Stefan Priebe.
v1, Initial version.

 drivers/md/bcache/bcache.h    |  11 ++--
 drivers/md/bcache/request.c   |  51 ++++++++++++++-
 drivers/md/bcache/super.c     |   1 +
 drivers/md/bcache/sysfs.c     |  14 +++--
 drivers/md/bcache/util.c      |   2 +-
 drivers/md/bcache/util.h      |   2 +-
 drivers/md/bcache/writeback.c | 115 ++++++++++++++++++++++++++--------
 7 files changed, 155 insertions(+), 41 deletions(-)

Stefan Priebe - Profihost AG July 24, 2018, 7:16 a.m. UTC | #1

Am 24.07.2018 um 06:03 schrieb Coly Li:
> Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
> allows the writeback rate to be faster if there is no I/O request on a
> bcache device. It works well if there is only one bcache device attached
> to the cache set. If there are many bcache devices attached to a cache
> set, it may introduce performance regression because multiple faster
> writeback threads of the idle bcache devices will compete the btree level
> locks with the bcache device who have I/O requests coming.
> 
> This patch fixes the above issue by only permitting fast writebac when
> all bcache devices attached on the cache set are idle. And if one of the
> bcache devices has new I/O request coming, minimized all writeback
> throughput immediately and let PI controller __update_writeback_rate()
> to decide the upcoming writeback rate for each bcache device.
> 
> Also when all bcache devices are idle, limited wrieback rate to a small
> number is wast of thoughput, especially when backing devices are slower
> non-rotation devices (e.g. SATA SSD). This patch sets a max writeback
> rate for each backing device if the whole cache set is idle. A faster
> writeback rate in idle time means new I/Os may have more available space
> for dirty data, and people may observe a better write performance then.
> 
> Please note bcache may change its cache mode in run time, and this patch
> still works if the cache mode is switched from writeback mode and there
> is still dirty data on cache.
> 
> Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
> Cc: stable@vger.kernel.org #4.16+
> Signed-off-by: Coly Li <colyli@suse.de>
> Tested-by: Kai Krakow <kai@kaishome.de>
> Cc: Michael Lyle <mlyle@lyle.org>
> Cc: Stefan Priebe <s.priebe@profihost.ag>

Hi Coly,

i'm still experiencing the same bug as yesterday while rebooting every
two times i get a deadlock in cache_set_free.

Sadly it's so late ion the shutdown process that netconsole is already
unloaded or at least i get no messages from while it happens. The traces
look the same like yesterday.

Greets,
Stefan

> ---
> Channgelog:
> v2, Fix a deadlock reported by Stefan Priebe.
> v1, Initial version.
> 
>  drivers/md/bcache/bcache.h    |  11 ++--
>  drivers/md/bcache/request.c   |  51 ++++++++++++++-
>  drivers/md/bcache/super.c     |   1 +
>  drivers/md/bcache/sysfs.c     |  14 +++--
>  drivers/md/bcache/util.c      |   2 +-
>  drivers/md/bcache/util.h      |   2 +-
>  drivers/md/bcache/writeback.c | 115 ++++++++++++++++++++++++++--------
>  7 files changed, 155 insertions(+), 41 deletions(-)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index d6bf294f3907..469ab1a955e0 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -328,13 +328,6 @@ struct cached_dev {
>  	 */
>  	atomic_t		has_dirty;
>  
> -	/*
> -	 * Set to zero by things that touch the backing volume-- except
> -	 * writeback.  Incremented by writeback.  Used to determine when to
> -	 * accelerate idle writeback.
> -	 */
> -	atomic_t		backing_idle;
> -
>  	struct bch_ratelimit	writeback_rate;
>  	struct delayed_work	writeback_rate_update;
>  
> @@ -514,6 +507,8 @@ struct cache_set {
>  	struct cache_accounting accounting;
>  
>  	unsigned long		flags;
> +	atomic_t		idle_counter;
> +	atomic_t		at_max_writeback_rate;
>  
>  	struct cache_sb		sb;
>  
> @@ -523,6 +518,8 @@ struct cache_set {
>  
>  	struct bcache_device	**devices;
>  	unsigned		devices_max_used;
> +	/* See set_at_max_writeback_rate() for how it is used */
> +	unsigned		previous_dirty_dc_nr;
>  	struct list_head	cached_devs;
>  	uint64_t		cached_dev_sectors;
>  	struct closure		caching;
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index ae67f5fa8047..1af3d96abfa5 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -1104,6 +1104,43 @@ static void detached_dev_do_request(struct bcache_device *d, struct bio *bio)
>  
>  /* Cached devices - read & write stuff */
>  
> +static void quit_max_writeback_rate(struct cache_set *c,
> +				    struct cached_dev *this_dc)
> +{
> +	int i;
> +	struct bcache_device *d;
> +	struct cached_dev *dc;
> +
> +	/*
> +	 * If bch_register_lock is acquired by other attach/detach operations,
> +	 * waiting here will increase I/O request latency for seconds or more.
> +	 * To avoid such situation, only writeback rate of current cached device
> +	 * is set to 1, and __update_write_back() will decide writeback rate
> +	 * of other cached devices (remember c->idle_counter is 0 now).
> +	 */
> +	if (mutex_trylock(&bch_register_lock)){
> +		for (i = 0; i < c->devices_max_used; i++) {
> +			if (!c->devices[i])
> +				continue;
> +
> +			if (UUID_FLASH_ONLY(&c->uuids[i]))
> +				continue;
> +
> +			d = c->devices[i];
> +			dc = container_of(d, struct cached_dev, disk);
> +			/*
> +			 * set writeback rate to default minimum value,
> +			 * then let update_writeback_rate() to decide the
> +			 * upcoming rate.
> +			 */
> +			atomic64_set(&dc->writeback_rate.rate, 1);
> +		}
> +
> +		mutex_unlock(&bch_register_lock);
> +	} else
> +		atomic64_set(&this_dc->writeback_rate.rate, 1);
> +}
> +
>  static blk_qc_t cached_dev_make_request(struct request_queue *q,
>  					struct bio *bio)
>  {
> @@ -1119,7 +1156,19 @@ static blk_qc_t cached_dev_make_request(struct request_queue *q,
>  		return BLK_QC_T_NONE;
>  	}
>  
> -	atomic_set(&dc->backing_idle, 0);
> +	if (d->c) {
> +		atomic_set(&d->c->idle_counter, 0);
> +		/*
> +		 * If at_max_writeback_rate of cache set is true and new I/O
> +		 * comes, quit max writeback rate of all cached devices
> +		 * attached to this cache set, and set at_max_writeback_rate
> +		 * to false.
> +		 */
> +		if (unlikely(atomic_read(&d->c->at_max_writeback_rate) == 1)) {
> +			atomic_set(&d->c->at_max_writeback_rate, 0);
> +			quit_max_writeback_rate(d->c, dc);
> +		}
> +	}
>  	generic_start_io_acct(q, rw, bio_sectors(bio), &d->disk->part0);
>  
>  	bio_set_dev(bio, dc->bdev);
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index fa4058e43202..fa532d9f9353 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -1687,6 +1687,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb)
>  	c->block_bits		= ilog2(sb->block_size);
>  	c->nr_uuids		= bucket_bytes(c) / sizeof(struct uuid_entry);
>  	c->devices_max_used	= 0;
> +	c->previous_dirty_dc_nr	= 0;
>  	c->btree_pages		= bucket_pages(c);
>  	if (c->btree_pages > BTREE_MAX_PAGES)
>  		c->btree_pages = max_t(int, c->btree_pages / 4,
> diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
> index 225b15aa0340..d719021bff81 100644
> --- a/drivers/md/bcache/sysfs.c
> +++ b/drivers/md/bcache/sysfs.c
> @@ -170,7 +170,8 @@ SHOW(__bch_cached_dev)
>  	var_printf(writeback_running,	"%i");
>  	var_print(writeback_delay);
>  	var_print(writeback_percent);
> -	sysfs_hprint(writeback_rate,	dc->writeback_rate.rate << 9);
> +	sysfs_hprint(writeback_rate,
> +		     atomic64_read(&dc->writeback_rate.rate) << 9);
>  	sysfs_hprint(io_errors,		atomic_read(&dc->io_errors));
>  	sysfs_printf(io_error_limit,	"%i", dc->error_limit);
>  	sysfs_printf(io_disable,	"%i", dc->io_disable);
> @@ -188,7 +189,8 @@ SHOW(__bch_cached_dev)
>  		char change[20];
>  		s64 next_io;
>  
> -		bch_hprint(rate,	dc->writeback_rate.rate << 9);
> +		bch_hprint(rate,
> +			   atomic64_read(&dc->writeback_rate.rate) << 9);
>  		bch_hprint(dirty,	bcache_dev_sectors_dirty(&dc->disk) << 9);
>  		bch_hprint(target,	dc->writeback_rate_target << 9);
>  		bch_hprint(proportional,dc->writeback_rate_proportional << 9);
> @@ -255,8 +257,12 @@ STORE(__cached_dev)
>  
>  	sysfs_strtoul_clamp(writeback_percent, dc->writeback_percent, 0, 40);
>  
> -	sysfs_strtoul_clamp(writeback_rate,
> -			    dc->writeback_rate.rate, 1, INT_MAX);
> +	if (attr == &sysfs_writeback_rate) {
> +		int v;
> +
> +		sysfs_strtoul_clamp(writeback_rate, v, 1, INT_MAX);
> +		atomic64_set(&dc->writeback_rate.rate, v);
> +	}
>  
>  	sysfs_strtoul_clamp(writeback_rate_update_seconds,
>  			    dc->writeback_rate_update_seconds,
> diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c
> index fc479b026d6d..84f90c3d996d 100644
> --- a/drivers/md/bcache/util.c
> +++ b/drivers/md/bcache/util.c
> @@ -200,7 +200,7 @@ uint64_t bch_next_delay(struct bch_ratelimit *d, uint64_t done)
>  {
>  	uint64_t now = local_clock();
>  
> -	d->next += div_u64(done * NSEC_PER_SEC, d->rate);
> +	d->next += div_u64(done * NSEC_PER_SEC, atomic64_read(&d->rate));
>  
>  	/* Bound the time.  Don't let us fall further than 2 seconds behind
>  	 * (this prevents unnecessary backlog that would make it impossible
> diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h
> index cced87f8eb27..7e17f32ab563 100644
> --- a/drivers/md/bcache/util.h
> +++ b/drivers/md/bcache/util.h
> @@ -442,7 +442,7 @@ struct bch_ratelimit {
>  	 * Rate at which we want to do work, in units per second
>  	 * The units here correspond to the units passed to bch_next_delay()
>  	 */
> -	uint32_t		rate;
> +	atomic64_t		rate;
>  };
>  
>  static inline void bch_ratelimit_reset(struct bch_ratelimit *d)
> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> index ad45ebe1a74b..11ffadc3cf8f 100644
> --- a/drivers/md/bcache/writeback.c
> +++ b/drivers/md/bcache/writeback.c
> @@ -49,6 +49,80 @@ static uint64_t __calc_target_rate(struct cached_dev *dc)
>  	return (cache_dirty_target * bdev_share) >> WRITEBACK_SHARE_SHIFT;
>  }
>  
> +static bool set_at_max_writeback_rate(struct cache_set *c,
> +				      struct cached_dev *dc)
> +{
> +	int i, dirty_dc_nr = 0;
> +	struct bcache_device *d;
> +
> +	/*
> +	 * bch_register_lock is acquired in cached_dev_detach_finish() before
> +	 * calling cancel_writeback_rate_update_dwork() to stop the delayed
> +	 * kworker writeback_rate_update (where the context we are for now).
> +	 * Therefore call mutex_lock() here may introduce deadlock when shut
> +	 * down the bcache device.
> +	 * c->previous_dirty_dc_nr is used to record previous calculated
> +	 * dirty_dc_nr when mutex_trylock() last time succeeded. Then if
> +	 * mutex_trylock() failed here, use c->previous_dirty_dc_nr as dirty
> +	 * cached device number. Of cause it might be inaccurate, but a few more
> +	 * or less loop before setting c->at_max_writeback_rate is much better
> +	 * then a deadlock here.
> +	 */
> +	if (mutex_trylock(&bch_register_lock)) {
> +		for (i = 0; i < c->devices_max_used; i++) {
> +			if (!c->devices[i])
> +				continue;
> +			if (UUID_FLASH_ONLY(&c->uuids[i]))
> +				continue;
> +			d = c->devices[i];
> +			dc = container_of(d, struct cached_dev, disk);
> +			if (atomic_read(&dc->has_dirty))
> +				dirty_dc_nr++;
> +		}
> +		c->previous_dirty_dc_nr = dirty_dc_nr;
> +
> +		mutex_unlock(&bch_register_lock);
> +	} else
> +		dirty_dc_nr = c->previous_dirty_dc_nr;
> +
> +	/*
> +	 * Idle_counter is increased everytime when update_writeback_rate()
> +	 * is rescheduled in. If all backing devices attached to the same
> +	 * cache set has same dc->writeback_rate_update_seconds value, it
> +	 * is about 10 rounds of update_writeback_rate() is called on each
> +	 * backing device, then the code will fall through at set 1 to
> +	 * c->at_max_writeback_rate, and a max wrteback rate to each
> +	 * dc->writeback_rate.rate. This is not very accurate but works well
> +	 * to make sure the whole cache set has no new I/O coming before
> +	 * writeback rate is set to a max number.
> +	 */
> +	if (atomic_inc_return(&c->idle_counter) < dirty_dc_nr * 10)
> +		return false;
> +
> +	if (atomic_read(&c->at_max_writeback_rate) != 1)
> +		atomic_set(&c->at_max_writeback_rate, 1);
> +
> +
> +	atomic64_set(&dc->writeback_rate.rate, INT_MAX);
> +
> +	/* keep writeback_rate_target as existing value */
> +	dc->writeback_rate_proportional = 0;
> +	dc->writeback_rate_integral_scaled = 0;
> +	dc->writeback_rate_change = 0;
> +
> +	/*
> +	 * Check c->idle_counter and c->at_max_writeback_rate agagain in case
> +	 * new I/O arrives during before set_at_max_writeback_rate() returns.
> +	 * Then the writeback rate is set to 1, and its new value should be
> +	 * decided via __update_writeback_rate().
> +	 */
> +	if (atomic_read(&c->idle_counter) < dirty_dc_nr * 10 ||
> +	    !atomic_read(&c->at_max_writeback_rate))
> +		return false;
> +
> +	return true;
> +}
> +
>  static void __update_writeback_rate(struct cached_dev *dc)
>  {
>  	/*
> @@ -104,8 +178,9 @@ static void __update_writeback_rate(struct cached_dev *dc)
>  
>  	dc->writeback_rate_proportional = proportional_scaled;
>  	dc->writeback_rate_integral_scaled = integral_scaled;
> -	dc->writeback_rate_change = new_rate - dc->writeback_rate.rate;
> -	dc->writeback_rate.rate = new_rate;
> +	dc->writeback_rate_change = new_rate -
> +			atomic64_read(&dc->writeback_rate.rate);
> +	atomic64_set(&dc->writeback_rate.rate, new_rate);
>  	dc->writeback_rate_target = target;
>  }
>  
> @@ -138,9 +213,16 @@ static void update_writeback_rate(struct work_struct *work)
>  
>  	down_read(&dc->writeback_lock);
>  
> -	if (atomic_read(&dc->has_dirty) &&
> -	    dc->writeback_percent)
> -		__update_writeback_rate(dc);
> +	if (atomic_read(&dc->has_dirty) && dc->writeback_percent) {
> +		/*
> +		 * If the whole cache set is idle, set_at_max_writeback_rate()
> +		 * will set writeback rate to a max number. Then it is
> +		 * unncessary to update writeback rate for an idle cache set
> +		 * in maximum writeback rate number(s).
> +		 */
> +		if (!set_at_max_writeback_rate(c, dc))
> +			__update_writeback_rate(dc);
> +	}
>  
>  	up_read(&dc->writeback_lock);
>  
> @@ -422,27 +504,6 @@ static void read_dirty(struct cached_dev *dc)
>  
>  		delay = writeback_delay(dc, size);
>  
> -		/* If the control system would wait for at least half a
> -		 * second, and there's been no reqs hitting the backing disk
> -		 * for awhile: use an alternate mode where we have at most
> -		 * one contiguous set of writebacks in flight at a time.  If
> -		 * someone wants to do IO it will be quick, as it will only
> -		 * have to contend with one operation in flight, and we'll
> -		 * be round-tripping data to the backing disk as quickly as
> -		 * it can accept it.
> -		 */
> -		if (delay >= HZ / 2) {
> -			/* 3 means at least 1.5 seconds, up to 7.5 if we
> -			 * have slowed way down.
> -			 */
> -			if (atomic_inc_return(&dc->backing_idle) >= 3) {
> -				/* Wait for current I/Os to finish */
> -				closure_sync(&cl);
> -				/* And immediately launch a new set. */
> -				delay = 0;
> -			}
> -		}
> -
>  		while (!kthread_should_stop() &&
>  		       !test_bit(CACHE_SET_IO_DISABLE, &dc->disk.c->flags) &&
>  		       delay) {
> @@ -715,7 +776,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
>  	dc->writeback_running		= true;
>  	dc->writeback_percent		= 10;
>  	dc->writeback_delay		= 30;
> -	dc->writeback_rate.rate		= 1024;
> +	atomic64_set(&dc->writeback_rate.rate, 1024);
>  	dc->writeback_rate_minimum	= 8;
>  
>  	dc->writeback_rate_update_seconds = WRITEBACK_RATE_UPDATE_SECS_DEFAULT;
>

Coly Li July 24, 2018, 8:28 a.m. UTC | #2

On 2018/7/24 3:16 PM, Stefan Priebe - Profihost AG wrote:
> Am 24.07.2018 um 06:03 schrieb Coly Li:
>> Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
>> allows the writeback rate to be faster if there is no I/O request on a
>> bcache device. It works well if there is only one bcache device attached
>> to the cache set. If there are many bcache devices attached to a cache
>> set, it may introduce performance regression because multiple faster
>> writeback threads of the idle bcache devices will compete the btree level
>> locks with the bcache device who have I/O requests coming.
>>
>> This patch fixes the above issue by only permitting fast writebac when
>> all bcache devices attached on the cache set are idle. And if one of the
>> bcache devices has new I/O request coming, minimized all writeback
>> throughput immediately and let PI controller __update_writeback_rate()
>> to decide the upcoming writeback rate for each bcache device.
>>
>> Also when all bcache devices are idle, limited wrieback rate to a small
>> number is wast of thoughput, especially when backing devices are slower
>> non-rotation devices (e.g. SATA SSD). This patch sets a max writeback
>> rate for each backing device if the whole cache set is idle. A faster
>> writeback rate in idle time means new I/Os may have more available space
>> for dirty data, and people may observe a better write performance then.
>>
>> Please note bcache may change its cache mode in run time, and this patch
>> still works if the cache mode is switched from writeback mode and there
>> is still dirty data on cache.
>>
>> Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
>> Cc: stable@vger.kernel.org #4.16+
>> Signed-off-by: Coly Li <colyli@suse.de>
>> Tested-by: Kai Krakow <kai@kaishome.de>
>> Cc: Michael Lyle <mlyle@lyle.org>
>> Cc: Stefan Priebe <s.priebe@profihost.ag>
> 
> Hi Coly,
> 
> i'm still experiencing the same bug as yesterday while rebooting every
> two times i get a deadlock in cache_set_free.
> 
> Sadly it's so late ion the shutdown process that netconsole is already
> unloaded or at least i get no messages from while it happens. The traces
> look the same like yesterday.

Hi Stefan,

Do you use the v2 patch on latest SLE15 kernel source? Let me try to
reproduce it on my machine. I assume when you reboot, the cache is still
dirty and not clean up, am I right ? And which file system do you mount
on top of the bcache device ?

Thanks.

Coly Li

Stefan Priebe - Profihost AG July 24, 2018, 8:33 a.m. UTC | #3

Am 24.07.2018 um 10:28 schrieb Coly Li:
> On 2018/7/24 3:16 PM, Stefan Priebe - Profihost AG wrote:
>> Am 24.07.2018 um 06:03 schrieb Coly Li:
>>> Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
>>> allows the writeback rate to be faster if there is no I/O request on a
>>> bcache device. It works well if there is only one bcache device attached
>>> to the cache set. If there are many bcache devices attached to a cache
>>> set, it may introduce performance regression because multiple faster
>>> writeback threads of the idle bcache devices will compete the btree level
>>> locks with the bcache device who have I/O requests coming.
>>>
>>> This patch fixes the above issue by only permitting fast writebac when
>>> all bcache devices attached on the cache set are idle. And if one of the
>>> bcache devices has new I/O request coming, minimized all writeback
>>> throughput immediately and let PI controller __update_writeback_rate()
>>> to decide the upcoming writeback rate for each bcache device.
>>>
>>> Also when all bcache devices are idle, limited wrieback rate to a small
>>> number is wast of thoughput, especially when backing devices are slower
>>> non-rotation devices (e.g. SATA SSD). This patch sets a max writeback
>>> rate for each backing device if the whole cache set is idle. A faster
>>> writeback rate in idle time means new I/Os may have more available space
>>> for dirty data, and people may observe a better write performance then.
>>>
>>> Please note bcache may change its cache mode in run time, and this patch
>>> still works if the cache mode is switched from writeback mode and there
>>> is still dirty data on cache.
>>>
>>> Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
>>> Cc: stable@vger.kernel.org #4.16+
>>> Signed-off-by: Coly Li <colyli@suse.de>
>>> Tested-by: Kai Krakow <kai@kaishome.de>
>>> Cc: Michael Lyle <mlyle@lyle.org>
>>> Cc: Stefan Priebe <s.priebe@profihost.ag>
>>
>> Hi Coly,
>>
>> i'm still experiencing the same bug as yesterday while rebooting every
>> two times i get a deadlock in cache_set_free.
>>
>> Sadly it's so late ion the shutdown process that netconsole is already
>> unloaded or at least i get no messages from while it happens. The traces
>> look the same like yesterday.
> 
> Hi Stefan,
> 
> Do you use the v2 patch on latest SLE15 kernel source?

Yes - i use the kernel code from here:
https://github.com/openSUSE/kernel-source/commits/SLE15

> Let me try to
> reproduce it on my machine. I assume when you reboot, the cache is still
> dirty and not clean up, am I right ?

Yes with around 15GB of dirty data - ceph is running on top of it and
generated many random writes.

> And which file system do you mount
> on top of the bcache device ?
XFS

> 
> Thanks.
> 
> Coly Li

Greets,
Stefan

Coly Li July 24, 2018, 4:36 p.m. UTC | #4

On 2018/7/24 4:33 PM, Stefan Priebe - Profihost AG wrote:
> Am 24.07.2018 um 10:28 schrieb Coly Li:
>> On 2018/7/24 3:16 PM, Stefan Priebe - Profihost AG wrote:
>>> Am 24.07.2018 um 06:03 schrieb Coly Li:
>>>> Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
>>>> allows the writeback rate to be faster if there is no I/O request on a
>>>> bcache device. It works well if there is only one bcache device attached
>>>> to the cache set. If there are many bcache devices attached to a cache
>>>> set, it may introduce performance regression because multiple faster
>>>> writeback threads of the idle bcache devices will compete the btree level
>>>> locks with the bcache device who have I/O requests coming.
>>>>
>>>> This patch fixes the above issue by only permitting fast writebac when
>>>> all bcache devices attached on the cache set are idle. And if one of the
>>>> bcache devices has new I/O request coming, minimized all writeback
>>>> throughput immediately and let PI controller __update_writeback_rate()
>>>> to decide the upcoming writeback rate for each bcache device.
>>>>
>>>> Also when all bcache devices are idle, limited wrieback rate to a small
>>>> number is wast of thoughput, especially when backing devices are slower
>>>> non-rotation devices (e.g. SATA SSD). This patch sets a max writeback
>>>> rate for each backing device if the whole cache set is idle. A faster
>>>> writeback rate in idle time means new I/Os may have more available space
>>>> for dirty data, and people may observe a better write performance then.
>>>>
>>>> Please note bcache may change its cache mode in run time, and this patch
>>>> still works if the cache mode is switched from writeback mode and there
>>>> is still dirty data on cache.
>>>>
>>>> Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
>>>> Cc: stable@vger.kernel.org #4.16+
>>>> Signed-off-by: Coly Li <colyli@suse.de>
>>>> Tested-by: Kai Krakow <kai@kaishome.de>
>>>> Cc: Michael Lyle <mlyle@lyle.org>
>>>> Cc: Stefan Priebe <s.priebe@profihost.ag>
>>>
>>> Hi Coly,
>>>
>>> i'm still experiencing the same bug as yesterday while rebooting every
>>> two times i get a deadlock in cache_set_free.
>>>
>>> Sadly it's so late ion the shutdown process that netconsole is already
>>> unloaded or at least i get no messages from while it happens. The traces
>>> look the same like yesterday.
>>
>> Hi Stefan,
>>
>> Do you use the v2 patch on latest SLE15 kernel source?
> 
> Yes - i use the kernel code from here:
> https://github.com/openSUSE/kernel-source/commits/SLE15
> 
>> Let me try to
>> reproduce it on my machine. I assume when you reboot, the cache is still
>> dirty and not clean up, am I right ?
> 
> Yes with around 15GB of dirty data - ceph is running on top of it and
> generated many random writes.
> 
>> And which file system do you mount
>> on top of the bcache device ?
> XFS

Hi Stefan,

Thanks for the information. I managed to reproduce a similar deadlock on
my small box. Let me see how to fix it and post a new version ASAP :-)

Coly Li

Stefan Priebe - Profihost AG July 25, 2018, 6:16 a.m. UTC | #5

Am 24.07.2018 um 18:36 schrieb Coly Li:
> On 2018/7/24 4:33 PM, Stefan Priebe - Profihost AG wrote:
>> Am 24.07.2018 um 10:28 schrieb Coly Li:
>>> On 2018/7/24 3:16 PM, Stefan Priebe - Profihost AG wrote:
>>>> Am 24.07.2018 um 06:03 schrieb Coly Li:
>>>>> Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
>>>>> allows the writeback rate to be faster if there is no I/O request on a
>>>>> bcache device. It works well if there is only one bcache device attached
>>>>> to the cache set. If there are many bcache devices attached to a cache
>>>>> set, it may introduce performance regression because multiple faster
>>>>> writeback threads of the idle bcache devices will compete the btree level
>>>>> locks with the bcache device who have I/O requests coming.
>>>>>
>>>>> This patch fixes the above issue by only permitting fast writebac when
>>>>> all bcache devices attached on the cache set are idle. And if one of the
>>>>> bcache devices has new I/O request coming, minimized all writeback
>>>>> throughput immediately and let PI controller __update_writeback_rate()
>>>>> to decide the upcoming writeback rate for each bcache device.
>>>>>
>>>>> Also when all bcache devices are idle, limited wrieback rate to a small
>>>>> number is wast of thoughput, especially when backing devices are slower
>>>>> non-rotation devices (e.g. SATA SSD). This patch sets a max writeback
>>>>> rate for each backing device if the whole cache set is idle. A faster
>>>>> writeback rate in idle time means new I/Os may have more available space
>>>>> for dirty data, and people may observe a better write performance then.
>>>>>
>>>>> Please note bcache may change its cache mode in run time, and this patch
>>>>> still works if the cache mode is switched from writeback mode and there
>>>>> is still dirty data on cache.
>>>>>
>>>>> Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
>>>>> Cc: stable@vger.kernel.org #4.16+
>>>>> Signed-off-by: Coly Li <colyli@suse.de>
>>>>> Tested-by: Kai Krakow <kai@kaishome.de>
>>>>> Cc: Michael Lyle <mlyle@lyle.org>
>>>>> Cc: Stefan Priebe <s.priebe@profihost.ag>
>>>>
>>>> Hi Coly,
>>>>
>>>> i'm still experiencing the same bug as yesterday while rebooting every
>>>> two times i get a deadlock in cache_set_free.
>>>>
>>>> Sadly it's so late ion the shutdown process that netconsole is already
>>>> unloaded or at least i get no messages from while it happens. The traces
>>>> look the same like yesterday.
>>>
>>> Hi Stefan,
>>>
>>> Do you use the v2 patch on latest SLE15 kernel source?
>>
>> Yes - i use the kernel code from here:
>> https://github.com/openSUSE/kernel-source/commits/SLE15
>>
>>> Let me try to
>>> reproduce it on my machine. I assume when you reboot, the cache is still
>>> dirty and not clean up, am I right ?
>>
>> Yes with around 15GB of dirty data - ceph is running on top of it and
>> generated many random writes.
>>
>>> And which file system do you mount
>>> on top of the bcache device ?
>> XFS
> 
> Hi Stefan,
> 
> Thanks for the information. I managed to reproduce a similar deadlock on
> my small box. Let me see how to fix it and post a new version ASAP :-)

Great!

> 
> Coly Li
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Stefan Priebe - Profihost AG July 26, 2018, 5:44 a.m. UTC | #6

> Am 25.07.2018 um 08:16 schrieb Stefan Priebe - Profihost AG <s.priebe@profihost.ag>:
> 
>> Am 24.07.2018 um 18:36 schrieb Coly Li:
>>> On 2018/7/24 4:33 PM, Stefan Priebe - Profihost AG wrote:
>>>> Am 24.07.2018 um 10:28 schrieb Coly Li:
>>>>> On 2018/7/24 3:16 PM, Stefan Priebe - Profihost AG wrote:
>>>>>> Am 24.07.2018 um 06:03 schrieb Coly Li:
>>>>>> Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
>>>>>> allows the writeback rate to be faster if there is no I/O request on a
>>>>>> bcache device. It works well if there is only one bcache device attached
>>>>>> to the cache set. If there are many bcache devices attached to a cache
>>>>>> set, it may introduce performance regression because multiple faster
>>>>>> writeback threads of the idle bcache devices will compete the btree level
>>>>>> locks with the bcache device who have I/O requests coming.
>>>>>> 
>>>>>> This patch fixes the above issue by only permitting fast writebac when
>>>>>> all bcache devices attached on the cache set are idle. And if one of the
>>>>>> bcache devices has new I/O request coming, minimized all writeback
>>>>>> throughput immediately and let PI controller __update_writeback_rate()
>>>>>> to decide the upcoming writeback rate for each bcache device.
>>>>>> 
>>>>>> Also when all bcache devices are idle, limited wrieback rate to a small
>>>>>> number is wast of thoughput, especially when backing devices are slower
>>>>>> non-rotation devices (e.g. SATA SSD). This patch sets a max writeback
>>>>>> rate for each backing device if the whole cache set is idle. A faster
>>>>>> writeback rate in idle time means new I/Os may have more available space
>>>>>> for dirty data, and people may observe a better write performance then.
>>>>>> 
>>>>>> Please note bcache may change its cache mode in run time, and this patch
>>>>>> still works if the cache mode is switched from writeback mode and there
>>>>>> is still dirty data on cache.
>>>>>> 
>>>>>> Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
>>>>>> Cc: stable@vger.kernel.org #4.16+
>>>>>> Signed-off-by: Coly Li <colyli@suse.de>
>>>>>> Tested-by: Kai Krakow <kai@kaishome.de>
>>>>>> Cc: Michael Lyle <mlyle@lyle.org>
>>>>>> Cc: Stefan Priebe <s.priebe@profihost.ag>
>>>>> 
>>>>> Hi Coly,
>>>>> 
>>>>> i'm still experiencing the same bug as yesterday while rebooting every
>>>>> two times i get a deadlock in cache_set_free.
>>>>> 
>>>>> Sadly it's so late ion the shutdown process that netconsole is already
>>>>> unloaded or at least i get no messages from while it happens. The traces
>>>>> look the same like yesterday.
>>>> 
>>>> Hi Stefan,
>>>> 
>>>> Do you use the v2 patch on latest SLE15 kernel source?
>>> 
>>> Yes - i use the kernel code from here:
>>> https://github.com/openSUSE/kernel-source/commits/SLE15
>>> 
>>>> Let me try to
>>>> reproduce it on my machine. I assume when you reboot, the cache is still
>>>> dirty and not clean up, am I right ?
>>> 
>>> Yes with around 15GB of dirty data - ceph is running on top of it and
>>> generated many random writes.
>>> 
>>>> And which file system do you mount
>>>> on top of the bcache device ?
>>> XFS
>> 
>> Hi Stefan,
>> 
>> Thanks for the information. I managed to reproduce a similar deadlock on
>> my small box. Let me see how to fix it and post a new version ASAP :-)
> 
> Great!

Any news on this already?

Greets,
Stefan

> 
>> 
>> Coly Li
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>

Coly Li July 26, 2018, 7:04 a.m. UTC | #7

On 2018/7/26 1:44 PM, Stefan Priebe - Profihost AG wrote:
>> Am 25.07.2018 um 08:16 schrieb Stefan Priebe - Profihost AG <s.priebe@profihost.ag>:
>>
>>> Am 24.07.2018 um 18:36 schrieb Coly Li:
>>>> On 2018/7/24 4:33 PM, Stefan Priebe - Profihost AG wrote:
>>>>> Am 24.07.2018 um 10:28 schrieb Coly Li:
>>>>>> On 2018/7/24 3:16 PM, Stefan Priebe - Profihost AG wrote:
>>>>>>> Am 24.07.2018 um 06:03 schrieb Coly Li:
>>>>>>> Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
>>>>>>> allows the writeback rate to be faster if there is no I/O request on a
>>>>>>> bcache device. It works well if there is only one bcache device attached
>>>>>>> to the cache set. If there are many bcache devices attached to a cache
>>>>>>> set, it may introduce performance regression because multiple faster
>>>>>>> writeback threads of the idle bcache devices will compete the btree level
>>>>>>> locks with the bcache device who have I/O requests coming.
>>>>>>>
>>>>>>> This patch fixes the above issue by only permitting fast writebac when
>>>>>>> all bcache devices attached on the cache set are idle. And if one of the
>>>>>>> bcache devices has new I/O request coming, minimized all writeback
>>>>>>> throughput immediately and let PI controller __update_writeback_rate()
>>>>>>> to decide the upcoming writeback rate for each bcache device.
>>>>>>>
>>>>>>> Also when all bcache devices are idle, limited wrieback rate to a small
>>>>>>> number is wast of thoughput, especially when backing devices are slower
>>>>>>> non-rotation devices (e.g. SATA SSD). This patch sets a max writeback
>>>>>>> rate for each backing device if the whole cache set is idle. A faster
>>>>>>> writeback rate in idle time means new I/Os may have more available space
>>>>>>> for dirty data, and people may observe a better write performance then.
>>>>>>>
>>>>>>> Please note bcache may change its cache mode in run time, and this patch
>>>>>>> still works if the cache mode is switched from writeback mode and there
>>>>>>> is still dirty data on cache.
>>>>>>>
>>>>>>> Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
>>>>>>> Cc: stable@vger.kernel.org #4.16+
>>>>>>> Signed-off-by: Coly Li <colyli@suse.de>
>>>>>>> Tested-by: Kai Krakow <kai@kaishome.de>
>>>>>>> Cc: Michael Lyle <mlyle@lyle.org>
>>>>>>> Cc: Stefan Priebe <s.priebe@profihost.ag>
>>>>>>
>>>>>> Hi Coly,
>>>>>>
>>>>>> i'm still experiencing the same bug as yesterday while rebooting every
>>>>>> two times i get a deadlock in cache_set_free.
>>>>>>
>>>>>> Sadly it's so late ion the shutdown process that netconsole is already
>>>>>> unloaded or at least i get no messages from while it happens. The traces
>>>>>> look the same like yesterday.
>>>>>
>>>>> Hi Stefan,
>>>>>
>>>>> Do you use the v2 patch on latest SLE15 kernel source?
>>>>
>>>> Yes - i use the kernel code from here:
>>>> https://github.com/openSUSE/kernel-source/commits/SLE15
>>>>
>>>>> Let me try to
>>>>> reproduce it on my machine. I assume when you reboot, the cache is still
>>>>> dirty and not clean up, am I right ?
>>>>
>>>> Yes with around 15GB of dirty data - ceph is running on top of it and
>>>> generated many random writes.
>>>>
>>>>> And which file system do you mount
>>>>> on top of the bcache device ?
>>>> XFS
>>>
>>> Hi Stefan,
>>>
>>> Thanks for the information. I managed to reproduce a similar deadlock on
>>> my small box. Let me see how to fix it and post a new version ASAP :-)
>>
>> Great!
> 
> Any news on this already?

I am testing the new version, and it also requires other patches I
posted earlier for 4.19 (which gets rid of bch_register_lock in
writeback thread).

Hope I can make it today :-)

Coly Li

Coly Li Dec. 15, 2018, 4:07 a.m. UTC | #8

On 12/15/18 2:10 AM, Michael Lyle wrote:
> Coly--
> 
> Apologies for the late reply on this.  I just noticed it based on Greg's
> comment about stable.
> 
> When I wrote the previous "accelerate writeback" patchset, my first
> attempt was very much like this.  I believe it was asked (by you?)
> whether it would impact the latency of front-end I/O because of deep
> backing device queues when a new request comes in.
> 
> Won't this cause lots of requests to be pending to backing, so if there
> is intermittent front-end I/O they'll have to wait for the device? 
> That's why I previous had it set to only complete one writeback at a
> time, to bound the impact on latency-- based on that review feedback.

Hi Mike,

This patch is a much more conservative effort. It sets a high writeback
rate only when all attached bcache device are idled for quite many
seconds. In this situation, the cache set is really quite and spared.

Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
just looks at single bcache device. If there are I/Os for other bcache
device on the cache set, and a single bcache device is idle, a faster
writeback rate for this single idle bcache device will happen, I/O to
read dirty data on cache for writeback will have negative impact to I/O
requests of other bcache devices.

Therefore I give up a specific faster writeback, to make sure the
latency of front end I/O in general.

Thanks.

Coly Li



> On Mon, Jul 23, 2018 at 9:03 PM Coly Li <colyli@suse.de
> <mailto:colyli@suse.de>> wrote:
> 
>     Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
>     allows the writeback rate to be faster if there is no I/O request on a
>     bcache device. It works well if there is only one bcache device attached
>     to the cache set. If there are many bcache devices attached to a cache
>     set, it may introduce performance regression because multiple faster
>     writeback threads of the idle bcache devices will compete the btree
>     level
>     locks with the bcache device who have I/O requests coming.
> 
>     This patch fixes the above issue by only permitting fast writebac when
>     all bcache devices attached on the cache set are idle. And if one of the
>     bcache devices has new I/O request coming, minimized all writeback
>     throughput immediately and let PI controller __update_writeback_rate()
>     to decide the upcoming writeback rate for each bcache device.
> 
>     Also when all bcache devices are idle, limited wrieback rate to a small
>     number is wast of thoughput, especially when backing devices are slower
>     non-rotation devices (e.g. SATA SSD). This patch sets a max writeback
>     rate for each backing device if the whole cache set is idle. A faster
>     writeback rate in idle time means new I/Os may have more available space
>     for dirty data, and people may observe a better write performance then.
> 
>     Please note bcache may change its cache mode in run time, and this patch
>     still works if the cache mode is switched from writeback mode and there
>     is still dirty data on cache.
> 
>     Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when
>     backing idle")
>     Cc: stable@vger.kernel.org <mailto:stable@vger.kernel.org> #4.16+
>     Signed-off-by: Coly Li <colyli@suse.de <mailto:colyli@suse.de>>
>     Tested-by: Kai Krakow <kai@kaishome.de <mailto:kai@kaishome.de>>
>     Cc: Michael Lyle <mlyle@lyle.org <mailto:mlyle@lyle.org>>
>     Cc: Stefan Priebe <s.priebe@profihost.ag <mailto:s.priebe@profihost.ag>>
>     ---
>     Channgelog:
>     v2, Fix a deadlock reported by Stefan Priebe.
>     v1, Initial version.
> 
>      drivers/md/bcache/bcache.h    |  11 ++--
>      drivers/md/bcache/request.c   |  51 ++++++++++++++-
>      drivers/md/bcache/super.c     |   1 +
>      drivers/md/bcache/sysfs.c     |  14 +++--
>      drivers/md/bcache/util.c      |   2 +-
>      drivers/md/bcache/util.h      |   2 +-
>      drivers/md/bcache/writeback.c | 115 ++++++++++++++++++++++++++--------
>      7 files changed, 155 insertions(+), 41 deletions(-)
> 
>     diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
>     index d6bf294f3907..469ab1a955e0 100644
>     --- a/drivers/md/bcache/bcache.h
>     +++ b/drivers/md/bcache/bcache.h
>     @@ -328,13 +328,6 @@ struct cached_dev {
>              */
>             atomic_t                has_dirty;
> 
>     -       /*
>     -        * Set to zero by things that touch the backing volume-- except
>     -        * writeback.  Incremented by writeback.  Used to determine
>     when to
>     -        * accelerate idle writeback.
>     -        */
>     -       atomic_t                backing_idle;
>     -
>             struct bch_ratelimit    writeback_rate;
>             struct delayed_work     writeback_rate_update;
> 
>     @@ -514,6 +507,8 @@ struct cache_set {
>             struct cache_accounting accounting;
> 
>             unsigned long           flags;
>     +       atomic_t                idle_counter;
>     +       atomic_t                at_max_writeback_rate;
> 
>             struct cache_sb         sb;
> 
>     @@ -523,6 +518,8 @@ struct cache_set {
> 
>             struct bcache_device    **devices;
>             unsigned                devices_max_used;
>     +       /* See set_at_max_writeback_rate() for how it is used */
>     +       unsigned                previous_dirty_dc_nr;
>             struct list_head        cached_devs;
>             uint64_t                cached_dev_sectors;
>             struct closure          caching;
>     diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
>     index ae67f5fa8047..1af3d96abfa5 100644
>     --- a/drivers/md/bcache/request.c
>     +++ b/drivers/md/bcache/request.c
>     @@ -1104,6 +1104,43 @@ static void detached_dev_do_request(struct
>     bcache_device *d, struct bio *bio)
> 
>      /* Cached devices - read & write stuff */
> 
>     +static void quit_max_writeback_rate(struct cache_set *c,
>     +                                   struct cached_dev *this_dc)
>     +{
>     +       int i;
>     +       struct bcache_device *d;
>     +       struct cached_dev *dc;
>     +
>     +       /*
>     +        * If bch_register_lock is acquired by other attach/detach
>     operations,
>     +        * waiting here will increase I/O request latency for
>     seconds or more.
>     +        * To avoid such situation, only writeback rate of current
>     cached device
>     +        * is set to 1, and __update_write_back() will decide
>     writeback rate
>     +        * of other cached devices (remember c->idle_counter is 0 now).
>     +        */
>     +       if (mutex_trylock(&bch_register_lock)){
>     +               for (i = 0; i < c->devices_max_used; i++) {
>     +                       if (!c->devices[i])
>     +                               continue;
>     +
>     +                       if (UUID_FLASH_ONLY(&c->uuids[i]))
>     +                               continue;
>     +
>     +                       d = c->devices[i];
>     +                       dc = container_of(d, struct cached_dev, disk);
>     +                       /*
>     +                        * set writeback rate to default minimum value,
>     +                        * then let update_writeback_rate() to
>     decide the
>     +                        * upcoming rate.
>     +                        */
>     +                       atomic64_set(&dc->writeback_rate.rate, 1);
>     +               }
>     +
>     +               mutex_unlock(&bch_register_lock);
>     +       } else
>     +               atomic64_set(&this_dc->writeback_rate.rate, 1);
>     +}
>     +
>      static blk_qc_t cached_dev_make_request(struct request_queue *q,
>                                             struct bio *bio)
>      {
>     @@ -1119,7 +1156,19 @@ static blk_qc_t
>     cached_dev_make_request(struct request_queue *q,
>                     return BLK_QC_T_NONE;
>             }
> 
>     -       atomic_set(&dc->backing_idle, 0);
>     +       if (d->c) {
>     +               atomic_set(&d->c->idle_counter, 0);
>     +               /*
>     +                * If at_max_writeback_rate of cache set is true and
>     new I/O
>     +                * comes, quit max writeback rate of all cached devices
>     +                * attached to this cache set, and set
>     at_max_writeback_rate
>     +                * to false.
>     +                */
>     +               if
>     (unlikely(atomic_read(&d->c->at_max_writeback_rate) == 1)) {
>     +                       atomic_set(&d->c->at_max_writeback_rate, 0);
>     +                       quit_max_writeback_rate(d->c, dc);
>     +               }
>     +       }
>             generic_start_io_acct(q, rw, bio_sectors(bio), &d->disk->part0);
> 
>             bio_set_dev(bio, dc->bdev);
>     diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
>     index fa4058e43202..fa532d9f9353 100644
>     --- a/drivers/md/bcache/super.c
>     +++ b/drivers/md/bcache/super.c
>     @@ -1687,6 +1687,7 @@ struct cache_set *bch_cache_set_alloc(struct
>     cache_sb *sb)
>             c->block_bits           = ilog2(sb->block_size);
>             c->nr_uuids             = bucket_bytes(c) / sizeof(struct
>     uuid_entry);
>             c->devices_max_used     = 0;
>     +       c->previous_dirty_dc_nr = 0;
>             c->btree_pages          = bucket_pages(c);
>             if (c->btree_pages > BTREE_MAX_PAGES)
>                     c->btree_pages = max_t(int, c->btree_pages / 4,
>     diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
>     index 225b15aa0340..d719021bff81 100644
>     --- a/drivers/md/bcache/sysfs.c
>     +++ b/drivers/md/bcache/sysfs.c
>     @@ -170,7 +170,8 @@ SHOW(__bch_cached_dev)
>             var_printf(writeback_running,   "%i");
>             var_print(writeback_delay);
>             var_print(writeback_percent);
>     -       sysfs_hprint(writeback_rate,    dc->writeback_rate.rate << 9);
>     +       sysfs_hprint(writeback_rate,
>     +                    atomic64_read(&dc->writeback_rate.rate) << 9);
>             sysfs_hprint(io_errors,         atomic_read(&dc->io_errors));
>             sysfs_printf(io_error_limit,    "%i", dc->error_limit);
>             sysfs_printf(io_disable,        "%i", dc->io_disable);
>     @@ -188,7 +189,8 @@ SHOW(__bch_cached_dev)
>                     char change[20];
>                     s64 next_io;
> 
>     -               bch_hprint(rate,        dc->writeback_rate.rate << 9);
>     +               bch_hprint(rate,
>     +                          atomic64_read(&dc->writeback_rate.rate)
>     << 9);
>                     bch_hprint(dirty,     
>      bcache_dev_sectors_dirty(&dc->disk) << 9);
>                     bch_hprint(target,      dc->writeback_rate_target << 9);
>                    
>     bch_hprint(proportional,dc->writeback_rate_proportional << 9);
>     @@ -255,8 +257,12 @@ STORE(__cached_dev)
> 
>             sysfs_strtoul_clamp(writeback_percent,
>     dc->writeback_percent, 0, 40);
> 
>     -       sysfs_strtoul_clamp(writeback_rate,
>     -                           dc->writeback_rate.rate, 1, INT_MAX);
>     +       if (attr == &sysfs_writeback_rate) {
>     +               int v;
>     +
>     +               sysfs_strtoul_clamp(writeback_rate, v, 1, INT_MAX);
>     +               atomic64_set(&dc->writeback_rate.rate, v);
>     +       }
> 
>             sysfs_strtoul_clamp(writeback_rate_update_seconds,
>                                 dc->writeback_rate_update_seconds,
>     diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c
>     index fc479b026d6d..84f90c3d996d 100644
>     --- a/drivers/md/bcache/util.c
>     +++ b/drivers/md/bcache/util.c
>     @@ -200,7 +200,7 @@ uint64_t bch_next_delay(struct bch_ratelimit *d,
>     uint64_t done)
>      {
>             uint64_t now = local_clock();
> 
>     -       d->next += div_u64(done * NSEC_PER_SEC, d->rate);
>     +       d->next += div_u64(done * NSEC_PER_SEC,
>     atomic64_read(&d->rate));
> 
>             /* Bound the time.  Don't let us fall further than 2 seconds
>     behind
>              * (this prevents unnecessary backlog that would make it
>     impossible
>     diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h
>     index cced87f8eb27..7e17f32ab563 100644
>     --- a/drivers/md/bcache/util.h
>     +++ b/drivers/md/bcache/util.h
>     @@ -442,7 +442,7 @@ struct bch_ratelimit {
>              * Rate at which we want to do work, in units per second
>              * The units here correspond to the units passed to
>     bch_next_delay()
>              */
>     -       uint32_t                rate;
>     +       atomic64_t              rate;
>      };
> 
>      static inline void bch_ratelimit_reset(struct bch_ratelimit *d)
>     diff --git a/drivers/md/bcache/writeback.c
>     b/drivers/md/bcache/writeback.c
>     index ad45ebe1a74b..11ffadc3cf8f 100644
>     --- a/drivers/md/bcache/writeback.c
>     +++ b/drivers/md/bcache/writeback.c
>     @@ -49,6 +49,80 @@ static uint64_t __calc_target_rate(struct
>     cached_dev *dc)
>             return (cache_dirty_target * bdev_share) >>
>     WRITEBACK_SHARE_SHIFT;
>      }
> 
>     +static bool set_at_max_writeback_rate(struct cache_set *c,
>     +                                     struct cached_dev *dc)
>     +{
>     +       int i, dirty_dc_nr = 0;
>     +       struct bcache_device *d;
>     +
>     +       /*
>     +        * bch_register_lock is acquired in
>     cached_dev_detach_finish() before
>     +        * calling cancel_writeback_rate_update_dwork() to stop the
>     delayed
>     +        * kworker writeback_rate_update (where the context we are
>     for now).
>     +        * Therefore call mutex_lock() here may introduce deadlock
>     when shut
>     +        * down the bcache device.
>     +        * c->previous_dirty_dc_nr is used to record previous calculated
>     +        * dirty_dc_nr when mutex_trylock() last time succeeded. Then if
>     +        * mutex_trylock() failed here, use c->previous_dirty_dc_nr
>     as dirty
>     +        * cached device number. Of cause it might be inaccurate,
>     but a few more
>     +        * or less loop before setting c->at_max_writeback_rate is
>     much better
>     +        * then a deadlock here.
>     +        */
>     +       if (mutex_trylock(&bch_register_lock)) {
>     +               for (i = 0; i < c->devices_max_used; i++) {
>     +                       if (!c->devices[i])
>     +                               continue;
>     +                       if (UUID_FLASH_ONLY(&c->uuids[i]))
>     +                               continue;
>     +                       d = c->devices[i];
>     +                       dc = container_of(d, struct cached_dev, disk);
>     +                       if (atomic_read(&dc->has_dirty))
>     +                               dirty_dc_nr++;
>     +               }
>     +               c->previous_dirty_dc_nr = dirty_dc_nr;
>     +
>     +               mutex_unlock(&bch_register_lock);
>     +       } else
>     +               dirty_dc_nr = c->previous_dirty_dc_nr;
>     +
>     +       /*
>     +        * Idle_counter is increased everytime when
>     update_writeback_rate()
>     +        * is rescheduled in. If all backing devices attached to the
>     same
>     +        * cache set has same dc->writeback_rate_update_seconds
>     value, it
>     +        * is about 10 rounds of update_writeback_rate() is called
>     on each
>     +        * backing device, then the code will fall through at set 1 to
>     +        * c->at_max_writeback_rate, and a max wrteback rate to each
>     +        * dc->writeback_rate.rate. This is not very accurate but
>     works well
>     +        * to make sure the whole cache set has no new I/O coming before
>     +        * writeback rate is set to a max number.
>     +        */
>     +       if (atomic_inc_return(&c->idle_counter) < dirty_dc_nr * 10)
>     +               return false;
>     +
>     +       if (atomic_read(&c->at_max_writeback_rate) != 1)
>     +               atomic_set(&c->at_max_writeback_rate, 1);
>     +
>     +
>     +       atomic64_set(&dc->writeback_rate.rate, INT_MAX);
>     +
>     +       /* keep writeback_rate_target as existing value */
>     +       dc->writeback_rate_proportional = 0;
>     +       dc->writeback_rate_integral_scaled = 0;
>     +       dc->writeback_rate_change = 0;
>     +
>     +       /*
>     +        * Check c->idle_counter and c->at_max_writeback_rate
>     agagain in case
>     +        * new I/O arrives during before set_at_max_writeback_rate()
>     returns.
>     +        * Then the writeback rate is set to 1, and its new value
>     should be
>     +        * decided via __update_writeback_rate().
>     +        */
>     +       if (atomic_read(&c->idle_counter) < dirty_dc_nr * 10 ||
>     +           !atomic_read(&c->at_max_writeback_rate))
>     +               return false;
>     +
>     +       return true;
>     +}
>     +
>      static void __update_writeback_rate(struct cached_dev *dc)
>      {
>             /*
>     @@ -104,8 +178,9 @@ static void __update_writeback_rate(struct
>     cached_dev *dc)
> 
>             dc->writeback_rate_proportional = proportional_scaled;
>             dc->writeback_rate_integral_scaled = integral_scaled;
>     -       dc->writeback_rate_change = new_rate - dc->writeback_rate.rate;
>     -       dc->writeback_rate.rate = new_rate;
>     +       dc->writeback_rate_change = new_rate -
>     +                       atomic64_read(&dc->writeback_rate.rate);
>     +       atomic64_set(&dc->writeback_rate.rate, new_rate);
>             dc->writeback_rate_target = target;
>      }
> 
>     @@ -138,9 +213,16 @@ static void update_writeback_rate(struct
>     work_struct *work)
> 
>             down_read(&dc->writeback_lock);
> 
>     -       if (atomic_read(&dc->has_dirty) &&
>     -           dc->writeback_percent)
>     -               __update_writeback_rate(dc);
>     +       if (atomic_read(&dc->has_dirty) && dc->writeback_percent) {
>     +               /*
>     +                * If the whole cache set is idle,
>     set_at_max_writeback_rate()
>     +                * will set writeback rate to a max number. Then it is
>     +                * unncessary to update writeback rate for an idle
>     cache set
>     +                * in maximum writeback rate number(s).
>     +                */
>     +               if (!set_at_max_writeback_rate(c, dc))
>     +                       __update_writeback_rate(dc);
>     +       }
> 
>             up_read(&dc->writeback_lock);
> 
>     @@ -422,27 +504,6 @@ static void read_dirty(struct cached_dev *dc)
> 
>                     delay = writeback_delay(dc, size);
> 
>     -               /* If the control system would wait for at least half a
>     -                * second, and there's been no reqs hitting the
>     backing disk
>     -                * for awhile: use an alternate mode where we have
>     at most
>     -                * one contiguous set of writebacks in flight at a
>     time.  If
>     -                * someone wants to do IO it will be quick, as it
>     will only
>     -                * have to contend with one operation in flight, and
>     we'll
>     -                * be round-tripping data to the backing disk as
>     quickly as
>     -                * it can accept it.
>     -                */
>     -               if (delay >= HZ / 2) {
>     -                       /* 3 means at least 1.5 seconds, up to 7.5 if we
>     -                        * have slowed way down.
>     -                        */
>     -                       if (atomic_inc_return(&dc->backing_idle) >= 3) {
>     -                               /* Wait for current I/Os to finish */
>     -                               closure_sync(&cl);
>     -                               /* And immediately launch a new set. */
>     -                               delay = 0;
>     -                       }
>     -               }
>     -
>                     while (!kthread_should_stop() &&
>                            !test_bit(CACHE_SET_IO_DISABLE,
>     &dc->disk.c->flags) &&
>                            delay) {
>     @@ -715,7 +776,7 @@ void bch_cached_dev_writeback_init(struct
>     cached_dev *dc)
>             dc->writeback_running           = true;
>             dc->writeback_percent           = 10;
>             dc->writeback_delay             = 30;
>     -       dc->writeback_rate.rate         = 1024;
>     +       atomic64_set(&dc->writeback_rate.rate, 1024);
>             dc->writeback_rate_minimum      = 8;
> 
>             dc->writeback_rate_update_seconds =
>     WRITEBACK_RATE_UPDATE_SECS_DEFAULT;
>     -- 
>     2.17.1
>

Michael Lyle Dec. 15, 2018, 4:22 a.m. UTC | #9

On Fri, Dec 14, 2018 at 8:08 PM Coly Li <colyli@suse.de> wrote:
> Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
> just looks at single bcache device. If there are I/Os for other bcache
> device on the cache set, and a single bcache device is idle, a faster
> writeback rate for this single idle bcache device will happen, I/O to
> read dirty data on cache for writeback will have negative impact to I/O
> requests of other bcache devices.

Yes, but this will potentially fill the queue on the idle device.  If
you have an application that is intermittently doing I/O every few
seconds, the latency of the individual I/Os will be high.  I was told
during review that this was unacceptable, and revised my patch.  But
now we have merged a change that is more like my first patch that was
rejected...

Mike

Coly Li Dec. 15, 2018, 4:37 a.m. UTC | #10

On 12/15/18 12:22 PM, Michael Lyle wrote:
> On Fri, Dec 14, 2018 at 8:08 PM Coly Li <colyli@suse.de> wrote:
>> Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
>> just looks at single bcache device. If there are I/Os for other bcache
>> device on the cache set, and a single bcache device is idle, a faster
>> writeback rate for this single idle bcache device will happen, I/O to
>> read dirty data on cache for writeback will have negative impact to I/O
>> requests of other bcache devices.
> 
> Yes, but this will potentially fill the queue on the idle device.  If
> you have an application that is intermittently doing I/O every few
> seconds, the latency of the individual I/Os will be high.  I was told
> during review that this was unacceptable, and revised my patch.  But
> now we have merged a change that is more like my first patch that was
> rejected...

The original patch only looks at a single bcache, this fix looks at all
bcache devices.

If only a single bcache attached on cache set, by default it will be
idle for 30 seconds before the writeback rate to be set to a high rate.
And if more bcache devices attached, longer threshold required. For 4
bcache devices attached the default idle threshold will be about 120
seconds. If I/O comes between more than 30 seconds (even minutes), I
will not take it as an online system, most cases the latency should not
be sensitive, or bcache is not proper for such workload.

Coly Li

[v2] bcache: set max writeback rate when I/O request is idle

Commit Message

Comments

Patch