Message ID | 20180724040310.1590-1-colyli@suse.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] bcache: set max writeback rate when I/O request is idle | expand |
Am 24.07.2018 um 06:03 schrieb Coly Li: > Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle") > allows the writeback rate to be faster if there is no I/O request on a > bcache device. It works well if there is only one bcache device attached > to the cache set. If there are many bcache devices attached to a cache > set, it may introduce performance regression because multiple faster > writeback threads of the idle bcache devices will compete the btree level > locks with the bcache device who have I/O requests coming. > > This patch fixes the above issue by only permitting fast writebac when > all bcache devices attached on the cache set are idle. And if one of the > bcache devices has new I/O request coming, minimized all writeback > throughput immediately and let PI controller __update_writeback_rate() > to decide the upcoming writeback rate for each bcache device. > > Also when all bcache devices are idle, limited wrieback rate to a small > number is wast of thoughput, especially when backing devices are slower > non-rotation devices (e.g. SATA SSD). This patch sets a max writeback > rate for each backing device if the whole cache set is idle. A faster > writeback rate in idle time means new I/Os may have more available space > for dirty data, and people may observe a better write performance then. > > Please note bcache may change its cache mode in run time, and this patch > still works if the cache mode is switched from writeback mode and there > is still dirty data on cache. > > Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle") > Cc: stable@vger.kernel.org #4.16+ > Signed-off-by: Coly Li <colyli@suse.de> > Tested-by: Kai Krakow <kai@kaishome.de> > Cc: Michael Lyle <mlyle@lyle.org> > Cc: Stefan Priebe <s.priebe@profihost.ag> Hi Coly, i'm still experiencing the same bug as yesterday while rebooting every two times i get a deadlock in cache_set_free. Sadly it's so late ion the shutdown process that netconsole is already unloaded or at least i get no messages from while it happens. The traces look the same like yesterday. Greets, Stefan > --- > Channgelog: > v2, Fix a deadlock reported by Stefan Priebe. > v1, Initial version. > > drivers/md/bcache/bcache.h | 11 ++-- > drivers/md/bcache/request.c | 51 ++++++++++++++- > drivers/md/bcache/super.c | 1 + > drivers/md/bcache/sysfs.c | 14 +++-- > drivers/md/bcache/util.c | 2 +- > drivers/md/bcache/util.h | 2 +- > drivers/md/bcache/writeback.c | 115 ++++++++++++++++++++++++++-------- > 7 files changed, 155 insertions(+), 41 deletions(-) > > diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h > index d6bf294f3907..469ab1a955e0 100644 > --- a/drivers/md/bcache/bcache.h > +++ b/drivers/md/bcache/bcache.h > @@ -328,13 +328,6 @@ struct cached_dev { > */ > atomic_t has_dirty; > > - /* > - * Set to zero by things that touch the backing volume-- except > - * writeback. Incremented by writeback. Used to determine when to > - * accelerate idle writeback. > - */ > - atomic_t backing_idle; > - > struct bch_ratelimit writeback_rate; > struct delayed_work writeback_rate_update; > > @@ -514,6 +507,8 @@ struct cache_set { > struct cache_accounting accounting; > > unsigned long flags; > + atomic_t idle_counter; > + atomic_t at_max_writeback_rate; > > struct cache_sb sb; > > @@ -523,6 +518,8 @@ struct cache_set { > > struct bcache_device **devices; > unsigned devices_max_used; > + /* See set_at_max_writeback_rate() for how it is used */ > + unsigned previous_dirty_dc_nr; > struct list_head cached_devs; > uint64_t cached_dev_sectors; > struct closure caching; > diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c > index ae67f5fa8047..1af3d96abfa5 100644 > --- a/drivers/md/bcache/request.c > +++ b/drivers/md/bcache/request.c > @@ -1104,6 +1104,43 @@ static void detached_dev_do_request(struct bcache_device *d, struct bio *bio) > > /* Cached devices - read & write stuff */ > > +static void quit_max_writeback_rate(struct cache_set *c, > + struct cached_dev *this_dc) > +{ > + int i; > + struct bcache_device *d; > + struct cached_dev *dc; > + > + /* > + * If bch_register_lock is acquired by other attach/detach operations, > + * waiting here will increase I/O request latency for seconds or more. > + * To avoid such situation, only writeback rate of current cached device > + * is set to 1, and __update_write_back() will decide writeback rate > + * of other cached devices (remember c->idle_counter is 0 now). > + */ > + if (mutex_trylock(&bch_register_lock)){ > + for (i = 0; i < c->devices_max_used; i++) { > + if (!c->devices[i]) > + continue; > + > + if (UUID_FLASH_ONLY(&c->uuids[i])) > + continue; > + > + d = c->devices[i]; > + dc = container_of(d, struct cached_dev, disk); > + /* > + * set writeback rate to default minimum value, > + * then let update_writeback_rate() to decide the > + * upcoming rate. > + */ > + atomic64_set(&dc->writeback_rate.rate, 1); > + } > + > + mutex_unlock(&bch_register_lock); > + } else > + atomic64_set(&this_dc->writeback_rate.rate, 1); > +} > + > static blk_qc_t cached_dev_make_request(struct request_queue *q, > struct bio *bio) > { > @@ -1119,7 +1156,19 @@ static blk_qc_t cached_dev_make_request(struct request_queue *q, > return BLK_QC_T_NONE; > } > > - atomic_set(&dc->backing_idle, 0); > + if (d->c) { > + atomic_set(&d->c->idle_counter, 0); > + /* > + * If at_max_writeback_rate of cache set is true and new I/O > + * comes, quit max writeback rate of all cached devices > + * attached to this cache set, and set at_max_writeback_rate > + * to false. > + */ > + if (unlikely(atomic_read(&d->c->at_max_writeback_rate) == 1)) { > + atomic_set(&d->c->at_max_writeback_rate, 0); > + quit_max_writeback_rate(d->c, dc); > + } > + } > generic_start_io_acct(q, rw, bio_sectors(bio), &d->disk->part0); > > bio_set_dev(bio, dc->bdev); > diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c > index fa4058e43202..fa532d9f9353 100644 > --- a/drivers/md/bcache/super.c > +++ b/drivers/md/bcache/super.c > @@ -1687,6 +1687,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb) > c->block_bits = ilog2(sb->block_size); > c->nr_uuids = bucket_bytes(c) / sizeof(struct uuid_entry); > c->devices_max_used = 0; > + c->previous_dirty_dc_nr = 0; > c->btree_pages = bucket_pages(c); > if (c->btree_pages > BTREE_MAX_PAGES) > c->btree_pages = max_t(int, c->btree_pages / 4, > diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c > index 225b15aa0340..d719021bff81 100644 > --- a/drivers/md/bcache/sysfs.c > +++ b/drivers/md/bcache/sysfs.c > @@ -170,7 +170,8 @@ SHOW(__bch_cached_dev) > var_printf(writeback_running, "%i"); > var_print(writeback_delay); > var_print(writeback_percent); > - sysfs_hprint(writeback_rate, dc->writeback_rate.rate << 9); > + sysfs_hprint(writeback_rate, > + atomic64_read(&dc->writeback_rate.rate) << 9); > sysfs_hprint(io_errors, atomic_read(&dc->io_errors)); > sysfs_printf(io_error_limit, "%i", dc->error_limit); > sysfs_printf(io_disable, "%i", dc->io_disable); > @@ -188,7 +189,8 @@ SHOW(__bch_cached_dev) > char change[20]; > s64 next_io; > > - bch_hprint(rate, dc->writeback_rate.rate << 9); > + bch_hprint(rate, > + atomic64_read(&dc->writeback_rate.rate) << 9); > bch_hprint(dirty, bcache_dev_sectors_dirty(&dc->disk) << 9); > bch_hprint(target, dc->writeback_rate_target << 9); > bch_hprint(proportional,dc->writeback_rate_proportional << 9); > @@ -255,8 +257,12 @@ STORE(__cached_dev) > > sysfs_strtoul_clamp(writeback_percent, dc->writeback_percent, 0, 40); > > - sysfs_strtoul_clamp(writeback_rate, > - dc->writeback_rate.rate, 1, INT_MAX); > + if (attr == &sysfs_writeback_rate) { > + int v; > + > + sysfs_strtoul_clamp(writeback_rate, v, 1, INT_MAX); > + atomic64_set(&dc->writeback_rate.rate, v); > + } > > sysfs_strtoul_clamp(writeback_rate_update_seconds, > dc->writeback_rate_update_seconds, > diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c > index fc479b026d6d..84f90c3d996d 100644 > --- a/drivers/md/bcache/util.c > +++ b/drivers/md/bcache/util.c > @@ -200,7 +200,7 @@ uint64_t bch_next_delay(struct bch_ratelimit *d, uint64_t done) > { > uint64_t now = local_clock(); > > - d->next += div_u64(done * NSEC_PER_SEC, d->rate); > + d->next += div_u64(done * NSEC_PER_SEC, atomic64_read(&d->rate)); > > /* Bound the time. Don't let us fall further than 2 seconds behind > * (this prevents unnecessary backlog that would make it impossible > diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h > index cced87f8eb27..7e17f32ab563 100644 > --- a/drivers/md/bcache/util.h > +++ b/drivers/md/bcache/util.h > @@ -442,7 +442,7 @@ struct bch_ratelimit { > * Rate at which we want to do work, in units per second > * The units here correspond to the units passed to bch_next_delay() > */ > - uint32_t rate; > + atomic64_t rate; > }; > > static inline void bch_ratelimit_reset(struct bch_ratelimit *d) > diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c > index ad45ebe1a74b..11ffadc3cf8f 100644 > --- a/drivers/md/bcache/writeback.c > +++ b/drivers/md/bcache/writeback.c > @@ -49,6 +49,80 @@ static uint64_t __calc_target_rate(struct cached_dev *dc) > return (cache_dirty_target * bdev_share) >> WRITEBACK_SHARE_SHIFT; > } > > +static bool set_at_max_writeback_rate(struct cache_set *c, > + struct cached_dev *dc) > +{ > + int i, dirty_dc_nr = 0; > + struct bcache_device *d; > + > + /* > + * bch_register_lock is acquired in cached_dev_detach_finish() before > + * calling cancel_writeback_rate_update_dwork() to stop the delayed > + * kworker writeback_rate_update (where the context we are for now). > + * Therefore call mutex_lock() here may introduce deadlock when shut > + * down the bcache device. > + * c->previous_dirty_dc_nr is used to record previous calculated > + * dirty_dc_nr when mutex_trylock() last time succeeded. Then if > + * mutex_trylock() failed here, use c->previous_dirty_dc_nr as dirty > + * cached device number. Of cause it might be inaccurate, but a few more > + * or less loop before setting c->at_max_writeback_rate is much better > + * then a deadlock here. > + */ > + if (mutex_trylock(&bch_register_lock)) { > + for (i = 0; i < c->devices_max_used; i++) { > + if (!c->devices[i]) > + continue; > + if (UUID_FLASH_ONLY(&c->uuids[i])) > + continue; > + d = c->devices[i]; > + dc = container_of(d, struct cached_dev, disk); > + if (atomic_read(&dc->has_dirty)) > + dirty_dc_nr++; > + } > + c->previous_dirty_dc_nr = dirty_dc_nr; > + > + mutex_unlock(&bch_register_lock); > + } else > + dirty_dc_nr = c->previous_dirty_dc_nr; > + > + /* > + * Idle_counter is increased everytime when update_writeback_rate() > + * is rescheduled in. If all backing devices attached to the same > + * cache set has same dc->writeback_rate_update_seconds value, it > + * is about 10 rounds of update_writeback_rate() is called on each > + * backing device, then the code will fall through at set 1 to > + * c->at_max_writeback_rate, and a max wrteback rate to each > + * dc->writeback_rate.rate. This is not very accurate but works well > + * to make sure the whole cache set has no new I/O coming before > + * writeback rate is set to a max number. > + */ > + if (atomic_inc_return(&c->idle_counter) < dirty_dc_nr * 10) > + return false; > + > + if (atomic_read(&c->at_max_writeback_rate) != 1) > + atomic_set(&c->at_max_writeback_rate, 1); > + > + > + atomic64_set(&dc->writeback_rate.rate, INT_MAX); > + > + /* keep writeback_rate_target as existing value */ > + dc->writeback_rate_proportional = 0; > + dc->writeback_rate_integral_scaled = 0; > + dc->writeback_rate_change = 0; > + > + /* > + * Check c->idle_counter and c->at_max_writeback_rate agagain in case > + * new I/O arrives during before set_at_max_writeback_rate() returns. > + * Then the writeback rate is set to 1, and its new value should be > + * decided via __update_writeback_rate(). > + */ > + if (atomic_read(&c->idle_counter) < dirty_dc_nr * 10 || > + !atomic_read(&c->at_max_writeback_rate)) > + return false; > + > + return true; > +} > + > static void __update_writeback_rate(struct cached_dev *dc) > { > /* > @@ -104,8 +178,9 @@ static void __update_writeback_rate(struct cached_dev *dc) > > dc->writeback_rate_proportional = proportional_scaled; > dc->writeback_rate_integral_scaled = integral_scaled; > - dc->writeback_rate_change = new_rate - dc->writeback_rate.rate; > - dc->writeback_rate.rate = new_rate; > + dc->writeback_rate_change = new_rate - > + atomic64_read(&dc->writeback_rate.rate); > + atomic64_set(&dc->writeback_rate.rate, new_rate); > dc->writeback_rate_target = target; > } > > @@ -138,9 +213,16 @@ static void update_writeback_rate(struct work_struct *work) > > down_read(&dc->writeback_lock); > > - if (atomic_read(&dc->has_dirty) && > - dc->writeback_percent) > - __update_writeback_rate(dc); > + if (atomic_read(&dc->has_dirty) && dc->writeback_percent) { > + /* > + * If the whole cache set is idle, set_at_max_writeback_rate() > + * will set writeback rate to a max number. Then it is > + * unncessary to update writeback rate for an idle cache set > + * in maximum writeback rate number(s). > + */ > + if (!set_at_max_writeback_rate(c, dc)) > + __update_writeback_rate(dc); > + } > > up_read(&dc->writeback_lock); > > @@ -422,27 +504,6 @@ static void read_dirty(struct cached_dev *dc) > > delay = writeback_delay(dc, size); > > - /* If the control system would wait for at least half a > - * second, and there's been no reqs hitting the backing disk > - * for awhile: use an alternate mode where we have at most > - * one contiguous set of writebacks in flight at a time. If > - * someone wants to do IO it will be quick, as it will only > - * have to contend with one operation in flight, and we'll > - * be round-tripping data to the backing disk as quickly as > - * it can accept it. > - */ > - if (delay >= HZ / 2) { > - /* 3 means at least 1.5 seconds, up to 7.5 if we > - * have slowed way down. > - */ > - if (atomic_inc_return(&dc->backing_idle) >= 3) { > - /* Wait for current I/Os to finish */ > - closure_sync(&cl); > - /* And immediately launch a new set. */ > - delay = 0; > - } > - } > - > while (!kthread_should_stop() && > !test_bit(CACHE_SET_IO_DISABLE, &dc->disk.c->flags) && > delay) { > @@ -715,7 +776,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc) > dc->writeback_running = true; > dc->writeback_percent = 10; > dc->writeback_delay = 30; > - dc->writeback_rate.rate = 1024; > + atomic64_set(&dc->writeback_rate.rate, 1024); > dc->writeback_rate_minimum = 8; > > dc->writeback_rate_update_seconds = WRITEBACK_RATE_UPDATE_SECS_DEFAULT; >
On 2018/7/24 3:16 PM, Stefan Priebe - Profihost AG wrote: > Am 24.07.2018 um 06:03 schrieb Coly Li: >> Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle") >> allows the writeback rate to be faster if there is no I/O request on a >> bcache device. It works well if there is only one bcache device attached >> to the cache set. If there are many bcache devices attached to a cache >> set, it may introduce performance regression because multiple faster >> writeback threads of the idle bcache devices will compete the btree level >> locks with the bcache device who have I/O requests coming. >> >> This patch fixes the above issue by only permitting fast writebac when >> all bcache devices attached on the cache set are idle. And if one of the >> bcache devices has new I/O request coming, minimized all writeback >> throughput immediately and let PI controller __update_writeback_rate() >> to decide the upcoming writeback rate for each bcache device. >> >> Also when all bcache devices are idle, limited wrieback rate to a small >> number is wast of thoughput, especially when backing devices are slower >> non-rotation devices (e.g. SATA SSD). This patch sets a max writeback >> rate for each backing device if the whole cache set is idle. A faster >> writeback rate in idle time means new I/Os may have more available space >> for dirty data, and people may observe a better write performance then. >> >> Please note bcache may change its cache mode in run time, and this patch >> still works if the cache mode is switched from writeback mode and there >> is still dirty data on cache. >> >> Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle") >> Cc: stable@vger.kernel.org #4.16+ >> Signed-off-by: Coly Li <colyli@suse.de> >> Tested-by: Kai Krakow <kai@kaishome.de> >> Cc: Michael Lyle <mlyle@lyle.org> >> Cc: Stefan Priebe <s.priebe@profihost.ag> > > Hi Coly, > > i'm still experiencing the same bug as yesterday while rebooting every > two times i get a deadlock in cache_set_free. > > Sadly it's so late ion the shutdown process that netconsole is already > unloaded or at least i get no messages from while it happens. The traces > look the same like yesterday. Hi Stefan, Do you use the v2 patch on latest SLE15 kernel source? Let me try to reproduce it on my machine. I assume when you reboot, the cache is still dirty and not clean up, am I right ? And which file system do you mount on top of the bcache device ? Thanks. Coly Li
Am 24.07.2018 um 10:28 schrieb Coly Li: > On 2018/7/24 3:16 PM, Stefan Priebe - Profihost AG wrote: >> Am 24.07.2018 um 06:03 schrieb Coly Li: >>> Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle") >>> allows the writeback rate to be faster if there is no I/O request on a >>> bcache device. It works well if there is only one bcache device attached >>> to the cache set. If there are many bcache devices attached to a cache >>> set, it may introduce performance regression because multiple faster >>> writeback threads of the idle bcache devices will compete the btree level >>> locks with the bcache device who have I/O requests coming. >>> >>> This patch fixes the above issue by only permitting fast writebac when >>> all bcache devices attached on the cache set are idle. And if one of the >>> bcache devices has new I/O request coming, minimized all writeback >>> throughput immediately and let PI controller __update_writeback_rate() >>> to decide the upcoming writeback rate for each bcache device. >>> >>> Also when all bcache devices are idle, limited wrieback rate to a small >>> number is wast of thoughput, especially when backing devices are slower >>> non-rotation devices (e.g. SATA SSD). This patch sets a max writeback >>> rate for each backing device if the whole cache set is idle. A faster >>> writeback rate in idle time means new I/Os may have more available space >>> for dirty data, and people may observe a better write performance then. >>> >>> Please note bcache may change its cache mode in run time, and this patch >>> still works if the cache mode is switched from writeback mode and there >>> is still dirty data on cache. >>> >>> Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle") >>> Cc: stable@vger.kernel.org #4.16+ >>> Signed-off-by: Coly Li <colyli@suse.de> >>> Tested-by: Kai Krakow <kai@kaishome.de> >>> Cc: Michael Lyle <mlyle@lyle.org> >>> Cc: Stefan Priebe <s.priebe@profihost.ag> >> >> Hi Coly, >> >> i'm still experiencing the same bug as yesterday while rebooting every >> two times i get a deadlock in cache_set_free. >> >> Sadly it's so late ion the shutdown process that netconsole is already >> unloaded or at least i get no messages from while it happens. The traces >> look the same like yesterday. > > Hi Stefan, > > Do you use the v2 patch on latest SLE15 kernel source? Yes - i use the kernel code from here: https://github.com/openSUSE/kernel-source/commits/SLE15 > Let me try to > reproduce it on my machine. I assume when you reboot, the cache is still > dirty and not clean up, am I right ? Yes with around 15GB of dirty data - ceph is running on top of it and generated many random writes. > And which file system do you mount > on top of the bcache device ? XFS > > Thanks. > > Coly Li Greets, Stefan
On 2018/7/24 4:33 PM, Stefan Priebe - Profihost AG wrote: > Am 24.07.2018 um 10:28 schrieb Coly Li: >> On 2018/7/24 3:16 PM, Stefan Priebe - Profihost AG wrote: >>> Am 24.07.2018 um 06:03 schrieb Coly Li: >>>> Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle") >>>> allows the writeback rate to be faster if there is no I/O request on a >>>> bcache device. It works well if there is only one bcache device attached >>>> to the cache set. If there are many bcache devices attached to a cache >>>> set, it may introduce performance regression because multiple faster >>>> writeback threads of the idle bcache devices will compete the btree level >>>> locks with the bcache device who have I/O requests coming. >>>> >>>> This patch fixes the above issue by only permitting fast writebac when >>>> all bcache devices attached on the cache set are idle. And if one of the >>>> bcache devices has new I/O request coming, minimized all writeback >>>> throughput immediately and let PI controller __update_writeback_rate() >>>> to decide the upcoming writeback rate for each bcache device. >>>> >>>> Also when all bcache devices are idle, limited wrieback rate to a small >>>> number is wast of thoughput, especially when backing devices are slower >>>> non-rotation devices (e.g. SATA SSD). This patch sets a max writeback >>>> rate for each backing device if the whole cache set is idle. A faster >>>> writeback rate in idle time means new I/Os may have more available space >>>> for dirty data, and people may observe a better write performance then. >>>> >>>> Please note bcache may change its cache mode in run time, and this patch >>>> still works if the cache mode is switched from writeback mode and there >>>> is still dirty data on cache. >>>> >>>> Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle") >>>> Cc: stable@vger.kernel.org #4.16+ >>>> Signed-off-by: Coly Li <colyli@suse.de> >>>> Tested-by: Kai Krakow <kai@kaishome.de> >>>> Cc: Michael Lyle <mlyle@lyle.org> >>>> Cc: Stefan Priebe <s.priebe@profihost.ag> >>> >>> Hi Coly, >>> >>> i'm still experiencing the same bug as yesterday while rebooting every >>> two times i get a deadlock in cache_set_free. >>> >>> Sadly it's so late ion the shutdown process that netconsole is already >>> unloaded or at least i get no messages from while it happens. The traces >>> look the same like yesterday. >> >> Hi Stefan, >> >> Do you use the v2 patch on latest SLE15 kernel source? > > Yes - i use the kernel code from here: > https://github.com/openSUSE/kernel-source/commits/SLE15 > >> Let me try to >> reproduce it on my machine. I assume when you reboot, the cache is still >> dirty and not clean up, am I right ? > > Yes with around 15GB of dirty data - ceph is running on top of it and > generated many random writes. > >> And which file system do you mount >> on top of the bcache device ? > XFS Hi Stefan, Thanks for the information. I managed to reproduce a similar deadlock on my small box. Let me see how to fix it and post a new version ASAP :-) Coly Li
Am 24.07.2018 um 18:36 schrieb Coly Li: > On 2018/7/24 4:33 PM, Stefan Priebe - Profihost AG wrote: >> Am 24.07.2018 um 10:28 schrieb Coly Li: >>> On 2018/7/24 3:16 PM, Stefan Priebe - Profihost AG wrote: >>>> Am 24.07.2018 um 06:03 schrieb Coly Li: >>>>> Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle") >>>>> allows the writeback rate to be faster if there is no I/O request on a >>>>> bcache device. It works well if there is only one bcache device attached >>>>> to the cache set. If there are many bcache devices attached to a cache >>>>> set, it may introduce performance regression because multiple faster >>>>> writeback threads of the idle bcache devices will compete the btree level >>>>> locks with the bcache device who have I/O requests coming. >>>>> >>>>> This patch fixes the above issue by only permitting fast writebac when >>>>> all bcache devices attached on the cache set are idle. And if one of the >>>>> bcache devices has new I/O request coming, minimized all writeback >>>>> throughput immediately and let PI controller __update_writeback_rate() >>>>> to decide the upcoming writeback rate for each bcache device. >>>>> >>>>> Also when all bcache devices are idle, limited wrieback rate to a small >>>>> number is wast of thoughput, especially when backing devices are slower >>>>> non-rotation devices (e.g. SATA SSD). This patch sets a max writeback >>>>> rate for each backing device if the whole cache set is idle. A faster >>>>> writeback rate in idle time means new I/Os may have more available space >>>>> for dirty data, and people may observe a better write performance then. >>>>> >>>>> Please note bcache may change its cache mode in run time, and this patch >>>>> still works if the cache mode is switched from writeback mode and there >>>>> is still dirty data on cache. >>>>> >>>>> Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle") >>>>> Cc: stable@vger.kernel.org #4.16+ >>>>> Signed-off-by: Coly Li <colyli@suse.de> >>>>> Tested-by: Kai Krakow <kai@kaishome.de> >>>>> Cc: Michael Lyle <mlyle@lyle.org> >>>>> Cc: Stefan Priebe <s.priebe@profihost.ag> >>>> >>>> Hi Coly, >>>> >>>> i'm still experiencing the same bug as yesterday while rebooting every >>>> two times i get a deadlock in cache_set_free. >>>> >>>> Sadly it's so late ion the shutdown process that netconsole is already >>>> unloaded or at least i get no messages from while it happens. The traces >>>> look the same like yesterday. >>> >>> Hi Stefan, >>> >>> Do you use the v2 patch on latest SLE15 kernel source? >> >> Yes - i use the kernel code from here: >> https://github.com/openSUSE/kernel-source/commits/SLE15 >> >>> Let me try to >>> reproduce it on my machine. I assume when you reboot, the cache is still >>> dirty and not clean up, am I right ? >> >> Yes with around 15GB of dirty data - ceph is running on top of it and >> generated many random writes. >> >>> And which file system do you mount >>> on top of the bcache device ? >> XFS > > Hi Stefan, > > Thanks for the information. I managed to reproduce a similar deadlock on > my small box. Let me see how to fix it and post a new version ASAP :-) Great! > > Coly Li > -- > To unsubscribe from this list: send the line "unsubscribe linux-bcache" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >
> Am 25.07.2018 um 08:16 schrieb Stefan Priebe - Profihost AG <s.priebe@profihost.ag>: > >> Am 24.07.2018 um 18:36 schrieb Coly Li: >>> On 2018/7/24 4:33 PM, Stefan Priebe - Profihost AG wrote: >>>> Am 24.07.2018 um 10:28 schrieb Coly Li: >>>>> On 2018/7/24 3:16 PM, Stefan Priebe - Profihost AG wrote: >>>>>> Am 24.07.2018 um 06:03 schrieb Coly Li: >>>>>> Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle") >>>>>> allows the writeback rate to be faster if there is no I/O request on a >>>>>> bcache device. It works well if there is only one bcache device attached >>>>>> to the cache set. If there are many bcache devices attached to a cache >>>>>> set, it may introduce performance regression because multiple faster >>>>>> writeback threads of the idle bcache devices will compete the btree level >>>>>> locks with the bcache device who have I/O requests coming. >>>>>> >>>>>> This patch fixes the above issue by only permitting fast writebac when >>>>>> all bcache devices attached on the cache set are idle. And if one of the >>>>>> bcache devices has new I/O request coming, minimized all writeback >>>>>> throughput immediately and let PI controller __update_writeback_rate() >>>>>> to decide the upcoming writeback rate for each bcache device. >>>>>> >>>>>> Also when all bcache devices are idle, limited wrieback rate to a small >>>>>> number is wast of thoughput, especially when backing devices are slower >>>>>> non-rotation devices (e.g. SATA SSD). This patch sets a max writeback >>>>>> rate for each backing device if the whole cache set is idle. A faster >>>>>> writeback rate in idle time means new I/Os may have more available space >>>>>> for dirty data, and people may observe a better write performance then. >>>>>> >>>>>> Please note bcache may change its cache mode in run time, and this patch >>>>>> still works if the cache mode is switched from writeback mode and there >>>>>> is still dirty data on cache. >>>>>> >>>>>> Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle") >>>>>> Cc: stable@vger.kernel.org #4.16+ >>>>>> Signed-off-by: Coly Li <colyli@suse.de> >>>>>> Tested-by: Kai Krakow <kai@kaishome.de> >>>>>> Cc: Michael Lyle <mlyle@lyle.org> >>>>>> Cc: Stefan Priebe <s.priebe@profihost.ag> >>>>> >>>>> Hi Coly, >>>>> >>>>> i'm still experiencing the same bug as yesterday while rebooting every >>>>> two times i get a deadlock in cache_set_free. >>>>> >>>>> Sadly it's so late ion the shutdown process that netconsole is already >>>>> unloaded or at least i get no messages from while it happens. The traces >>>>> look the same like yesterday. >>>> >>>> Hi Stefan, >>>> >>>> Do you use the v2 patch on latest SLE15 kernel source? >>> >>> Yes - i use the kernel code from here: >>> https://github.com/openSUSE/kernel-source/commits/SLE15 >>> >>>> Let me try to >>>> reproduce it on my machine. I assume when you reboot, the cache is still >>>> dirty and not clean up, am I right ? >>> >>> Yes with around 15GB of dirty data - ceph is running on top of it and >>> generated many random writes. >>> >>>> And which file system do you mount >>>> on top of the bcache device ? >>> XFS >> >> Hi Stefan, >> >> Thanks for the information. I managed to reproduce a similar deadlock on >> my small box. Let me see how to fix it and post a new version ASAP :-) > > Great! Any news on this already? Greets, Stefan > >> >> Coly Li >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >>
On 2018/7/26 1:44 PM, Stefan Priebe - Profihost AG wrote: >> Am 25.07.2018 um 08:16 schrieb Stefan Priebe - Profihost AG <s.priebe@profihost.ag>: >> >>> Am 24.07.2018 um 18:36 schrieb Coly Li: >>>> On 2018/7/24 4:33 PM, Stefan Priebe - Profihost AG wrote: >>>>> Am 24.07.2018 um 10:28 schrieb Coly Li: >>>>>> On 2018/7/24 3:16 PM, Stefan Priebe - Profihost AG wrote: >>>>>>> Am 24.07.2018 um 06:03 schrieb Coly Li: >>>>>>> Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle") >>>>>>> allows the writeback rate to be faster if there is no I/O request on a >>>>>>> bcache device. It works well if there is only one bcache device attached >>>>>>> to the cache set. If there are many bcache devices attached to a cache >>>>>>> set, it may introduce performance regression because multiple faster >>>>>>> writeback threads of the idle bcache devices will compete the btree level >>>>>>> locks with the bcache device who have I/O requests coming. >>>>>>> >>>>>>> This patch fixes the above issue by only permitting fast writebac when >>>>>>> all bcache devices attached on the cache set are idle. And if one of the >>>>>>> bcache devices has new I/O request coming, minimized all writeback >>>>>>> throughput immediately and let PI controller __update_writeback_rate() >>>>>>> to decide the upcoming writeback rate for each bcache device. >>>>>>> >>>>>>> Also when all bcache devices are idle, limited wrieback rate to a small >>>>>>> number is wast of thoughput, especially when backing devices are slower >>>>>>> non-rotation devices (e.g. SATA SSD). This patch sets a max writeback >>>>>>> rate for each backing device if the whole cache set is idle. A faster >>>>>>> writeback rate in idle time means new I/Os may have more available space >>>>>>> for dirty data, and people may observe a better write performance then. >>>>>>> >>>>>>> Please note bcache may change its cache mode in run time, and this patch >>>>>>> still works if the cache mode is switched from writeback mode and there >>>>>>> is still dirty data on cache. >>>>>>> >>>>>>> Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle") >>>>>>> Cc: stable@vger.kernel.org #4.16+ >>>>>>> Signed-off-by: Coly Li <colyli@suse.de> >>>>>>> Tested-by: Kai Krakow <kai@kaishome.de> >>>>>>> Cc: Michael Lyle <mlyle@lyle.org> >>>>>>> Cc: Stefan Priebe <s.priebe@profihost.ag> >>>>>> >>>>>> Hi Coly, >>>>>> >>>>>> i'm still experiencing the same bug as yesterday while rebooting every >>>>>> two times i get a deadlock in cache_set_free. >>>>>> >>>>>> Sadly it's so late ion the shutdown process that netconsole is already >>>>>> unloaded or at least i get no messages from while it happens. The traces >>>>>> look the same like yesterday. >>>>> >>>>> Hi Stefan, >>>>> >>>>> Do you use the v2 patch on latest SLE15 kernel source? >>>> >>>> Yes - i use the kernel code from here: >>>> https://github.com/openSUSE/kernel-source/commits/SLE15 >>>> >>>>> Let me try to >>>>> reproduce it on my machine. I assume when you reboot, the cache is still >>>>> dirty and not clean up, am I right ? >>>> >>>> Yes with around 15GB of dirty data - ceph is running on top of it and >>>> generated many random writes. >>>> >>>>> And which file system do you mount >>>>> on top of the bcache device ? >>>> XFS >>> >>> Hi Stefan, >>> >>> Thanks for the information. I managed to reproduce a similar deadlock on >>> my small box. Let me see how to fix it and post a new version ASAP :-) >> >> Great! > > Any news on this already? I am testing the new version, and it also requires other patches I posted earlier for 4.19 (which gets rid of bch_register_lock in writeback thread). Hope I can make it today :-) Coly Li
On 12/15/18 2:10 AM, Michael Lyle wrote: > Coly-- > > Apologies for the late reply on this. I just noticed it based on Greg's > comment about stable. > > When I wrote the previous "accelerate writeback" patchset, my first > attempt was very much like this. I believe it was asked (by you?) > whether it would impact the latency of front-end I/O because of deep > backing device queues when a new request comes in. > > Won't this cause lots of requests to be pending to backing, so if there > is intermittent front-end I/O they'll have to wait for the device? > That's why I previous had it set to only complete one writeback at a > time, to bound the impact on latency-- based on that review feedback. Hi Mike, This patch is a much more conservative effort. It sets a high writeback rate only when all attached bcache device are idled for quite many seconds. In this situation, the cache set is really quite and spared. Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle") just looks at single bcache device. If there are I/Os for other bcache device on the cache set, and a single bcache device is idle, a faster writeback rate for this single idle bcache device will happen, I/O to read dirty data on cache for writeback will have negative impact to I/O requests of other bcache devices. Therefore I give up a specific faster writeback, to make sure the latency of front end I/O in general. Thanks. Coly Li > On Mon, Jul 23, 2018 at 9:03 PM Coly Li <colyli@suse.de > <mailto:colyli@suse.de>> wrote: > > Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle") > allows the writeback rate to be faster if there is no I/O request on a > bcache device. It works well if there is only one bcache device attached > to the cache set. If there are many bcache devices attached to a cache > set, it may introduce performance regression because multiple faster > writeback threads of the idle bcache devices will compete the btree > level > locks with the bcache device who have I/O requests coming. > > This patch fixes the above issue by only permitting fast writebac when > all bcache devices attached on the cache set are idle. And if one of the > bcache devices has new I/O request coming, minimized all writeback > throughput immediately and let PI controller __update_writeback_rate() > to decide the upcoming writeback rate for each bcache device. > > Also when all bcache devices are idle, limited wrieback rate to a small > number is wast of thoughput, especially when backing devices are slower > non-rotation devices (e.g. SATA SSD). This patch sets a max writeback > rate for each backing device if the whole cache set is idle. A faster > writeback rate in idle time means new I/Os may have more available space > for dirty data, and people may observe a better write performance then. > > Please note bcache may change its cache mode in run time, and this patch > still works if the cache mode is switched from writeback mode and there > is still dirty data on cache. > > Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when > backing idle") > Cc: stable@vger.kernel.org <mailto:stable@vger.kernel.org> #4.16+ > Signed-off-by: Coly Li <colyli@suse.de <mailto:colyli@suse.de>> > Tested-by: Kai Krakow <kai@kaishome.de <mailto:kai@kaishome.de>> > Cc: Michael Lyle <mlyle@lyle.org <mailto:mlyle@lyle.org>> > Cc: Stefan Priebe <s.priebe@profihost.ag <mailto:s.priebe@profihost.ag>> > --- > Channgelog: > v2, Fix a deadlock reported by Stefan Priebe. > v1, Initial version. > > drivers/md/bcache/bcache.h | 11 ++-- > drivers/md/bcache/request.c | 51 ++++++++++++++- > drivers/md/bcache/super.c | 1 + > drivers/md/bcache/sysfs.c | 14 +++-- > drivers/md/bcache/util.c | 2 +- > drivers/md/bcache/util.h | 2 +- > drivers/md/bcache/writeback.c | 115 ++++++++++++++++++++++++++-------- > 7 files changed, 155 insertions(+), 41 deletions(-) > > diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h > index d6bf294f3907..469ab1a955e0 100644 > --- a/drivers/md/bcache/bcache.h > +++ b/drivers/md/bcache/bcache.h > @@ -328,13 +328,6 @@ struct cached_dev { > */ > atomic_t has_dirty; > > - /* > - * Set to zero by things that touch the backing volume-- except > - * writeback. Incremented by writeback. Used to determine > when to > - * accelerate idle writeback. > - */ > - atomic_t backing_idle; > - > struct bch_ratelimit writeback_rate; > struct delayed_work writeback_rate_update; > > @@ -514,6 +507,8 @@ struct cache_set { > struct cache_accounting accounting; > > unsigned long flags; > + atomic_t idle_counter; > + atomic_t at_max_writeback_rate; > > struct cache_sb sb; > > @@ -523,6 +518,8 @@ struct cache_set { > > struct bcache_device **devices; > unsigned devices_max_used; > + /* See set_at_max_writeback_rate() for how it is used */ > + unsigned previous_dirty_dc_nr; > struct list_head cached_devs; > uint64_t cached_dev_sectors; > struct closure caching; > diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c > index ae67f5fa8047..1af3d96abfa5 100644 > --- a/drivers/md/bcache/request.c > +++ b/drivers/md/bcache/request.c > @@ -1104,6 +1104,43 @@ static void detached_dev_do_request(struct > bcache_device *d, struct bio *bio) > > /* Cached devices - read & write stuff */ > > +static void quit_max_writeback_rate(struct cache_set *c, > + struct cached_dev *this_dc) > +{ > + int i; > + struct bcache_device *d; > + struct cached_dev *dc; > + > + /* > + * If bch_register_lock is acquired by other attach/detach > operations, > + * waiting here will increase I/O request latency for > seconds or more. > + * To avoid such situation, only writeback rate of current > cached device > + * is set to 1, and __update_write_back() will decide > writeback rate > + * of other cached devices (remember c->idle_counter is 0 now). > + */ > + if (mutex_trylock(&bch_register_lock)){ > + for (i = 0; i < c->devices_max_used; i++) { > + if (!c->devices[i]) > + continue; > + > + if (UUID_FLASH_ONLY(&c->uuids[i])) > + continue; > + > + d = c->devices[i]; > + dc = container_of(d, struct cached_dev, disk); > + /* > + * set writeback rate to default minimum value, > + * then let update_writeback_rate() to > decide the > + * upcoming rate. > + */ > + atomic64_set(&dc->writeback_rate.rate, 1); > + } > + > + mutex_unlock(&bch_register_lock); > + } else > + atomic64_set(&this_dc->writeback_rate.rate, 1); > +} > + > static blk_qc_t cached_dev_make_request(struct request_queue *q, > struct bio *bio) > { > @@ -1119,7 +1156,19 @@ static blk_qc_t > cached_dev_make_request(struct request_queue *q, > return BLK_QC_T_NONE; > } > > - atomic_set(&dc->backing_idle, 0); > + if (d->c) { > + atomic_set(&d->c->idle_counter, 0); > + /* > + * If at_max_writeback_rate of cache set is true and > new I/O > + * comes, quit max writeback rate of all cached devices > + * attached to this cache set, and set > at_max_writeback_rate > + * to false. > + */ > + if > (unlikely(atomic_read(&d->c->at_max_writeback_rate) == 1)) { > + atomic_set(&d->c->at_max_writeback_rate, 0); > + quit_max_writeback_rate(d->c, dc); > + } > + } > generic_start_io_acct(q, rw, bio_sectors(bio), &d->disk->part0); > > bio_set_dev(bio, dc->bdev); > diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c > index fa4058e43202..fa532d9f9353 100644 > --- a/drivers/md/bcache/super.c > +++ b/drivers/md/bcache/super.c > @@ -1687,6 +1687,7 @@ struct cache_set *bch_cache_set_alloc(struct > cache_sb *sb) > c->block_bits = ilog2(sb->block_size); > c->nr_uuids = bucket_bytes(c) / sizeof(struct > uuid_entry); > c->devices_max_used = 0; > + c->previous_dirty_dc_nr = 0; > c->btree_pages = bucket_pages(c); > if (c->btree_pages > BTREE_MAX_PAGES) > c->btree_pages = max_t(int, c->btree_pages / 4, > diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c > index 225b15aa0340..d719021bff81 100644 > --- a/drivers/md/bcache/sysfs.c > +++ b/drivers/md/bcache/sysfs.c > @@ -170,7 +170,8 @@ SHOW(__bch_cached_dev) > var_printf(writeback_running, "%i"); > var_print(writeback_delay); > var_print(writeback_percent); > - sysfs_hprint(writeback_rate, dc->writeback_rate.rate << 9); > + sysfs_hprint(writeback_rate, > + atomic64_read(&dc->writeback_rate.rate) << 9); > sysfs_hprint(io_errors, atomic_read(&dc->io_errors)); > sysfs_printf(io_error_limit, "%i", dc->error_limit); > sysfs_printf(io_disable, "%i", dc->io_disable); > @@ -188,7 +189,8 @@ SHOW(__bch_cached_dev) > char change[20]; > s64 next_io; > > - bch_hprint(rate, dc->writeback_rate.rate << 9); > + bch_hprint(rate, > + atomic64_read(&dc->writeback_rate.rate) > << 9); > bch_hprint(dirty, > bcache_dev_sectors_dirty(&dc->disk) << 9); > bch_hprint(target, dc->writeback_rate_target << 9); > > bch_hprint(proportional,dc->writeback_rate_proportional << 9); > @@ -255,8 +257,12 @@ STORE(__cached_dev) > > sysfs_strtoul_clamp(writeback_percent, > dc->writeback_percent, 0, 40); > > - sysfs_strtoul_clamp(writeback_rate, > - dc->writeback_rate.rate, 1, INT_MAX); > + if (attr == &sysfs_writeback_rate) { > + int v; > + > + sysfs_strtoul_clamp(writeback_rate, v, 1, INT_MAX); > + atomic64_set(&dc->writeback_rate.rate, v); > + } > > sysfs_strtoul_clamp(writeback_rate_update_seconds, > dc->writeback_rate_update_seconds, > diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c > index fc479b026d6d..84f90c3d996d 100644 > --- a/drivers/md/bcache/util.c > +++ b/drivers/md/bcache/util.c > @@ -200,7 +200,7 @@ uint64_t bch_next_delay(struct bch_ratelimit *d, > uint64_t done) > { > uint64_t now = local_clock(); > > - d->next += div_u64(done * NSEC_PER_SEC, d->rate); > + d->next += div_u64(done * NSEC_PER_SEC, > atomic64_read(&d->rate)); > > /* Bound the time. Don't let us fall further than 2 seconds > behind > * (this prevents unnecessary backlog that would make it > impossible > diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h > index cced87f8eb27..7e17f32ab563 100644 > --- a/drivers/md/bcache/util.h > +++ b/drivers/md/bcache/util.h > @@ -442,7 +442,7 @@ struct bch_ratelimit { > * Rate at which we want to do work, in units per second > * The units here correspond to the units passed to > bch_next_delay() > */ > - uint32_t rate; > + atomic64_t rate; > }; > > static inline void bch_ratelimit_reset(struct bch_ratelimit *d) > diff --git a/drivers/md/bcache/writeback.c > b/drivers/md/bcache/writeback.c > index ad45ebe1a74b..11ffadc3cf8f 100644 > --- a/drivers/md/bcache/writeback.c > +++ b/drivers/md/bcache/writeback.c > @@ -49,6 +49,80 @@ static uint64_t __calc_target_rate(struct > cached_dev *dc) > return (cache_dirty_target * bdev_share) >> > WRITEBACK_SHARE_SHIFT; > } > > +static bool set_at_max_writeback_rate(struct cache_set *c, > + struct cached_dev *dc) > +{ > + int i, dirty_dc_nr = 0; > + struct bcache_device *d; > + > + /* > + * bch_register_lock is acquired in > cached_dev_detach_finish() before > + * calling cancel_writeback_rate_update_dwork() to stop the > delayed > + * kworker writeback_rate_update (where the context we are > for now). > + * Therefore call mutex_lock() here may introduce deadlock > when shut > + * down the bcache device. > + * c->previous_dirty_dc_nr is used to record previous calculated > + * dirty_dc_nr when mutex_trylock() last time succeeded. Then if > + * mutex_trylock() failed here, use c->previous_dirty_dc_nr > as dirty > + * cached device number. Of cause it might be inaccurate, > but a few more > + * or less loop before setting c->at_max_writeback_rate is > much better > + * then a deadlock here. > + */ > + if (mutex_trylock(&bch_register_lock)) { > + for (i = 0; i < c->devices_max_used; i++) { > + if (!c->devices[i]) > + continue; > + if (UUID_FLASH_ONLY(&c->uuids[i])) > + continue; > + d = c->devices[i]; > + dc = container_of(d, struct cached_dev, disk); > + if (atomic_read(&dc->has_dirty)) > + dirty_dc_nr++; > + } > + c->previous_dirty_dc_nr = dirty_dc_nr; > + > + mutex_unlock(&bch_register_lock); > + } else > + dirty_dc_nr = c->previous_dirty_dc_nr; > + > + /* > + * Idle_counter is increased everytime when > update_writeback_rate() > + * is rescheduled in. If all backing devices attached to the > same > + * cache set has same dc->writeback_rate_update_seconds > value, it > + * is about 10 rounds of update_writeback_rate() is called > on each > + * backing device, then the code will fall through at set 1 to > + * c->at_max_writeback_rate, and a max wrteback rate to each > + * dc->writeback_rate.rate. This is not very accurate but > works well > + * to make sure the whole cache set has no new I/O coming before > + * writeback rate is set to a max number. > + */ > + if (atomic_inc_return(&c->idle_counter) < dirty_dc_nr * 10) > + return false; > + > + if (atomic_read(&c->at_max_writeback_rate) != 1) > + atomic_set(&c->at_max_writeback_rate, 1); > + > + > + atomic64_set(&dc->writeback_rate.rate, INT_MAX); > + > + /* keep writeback_rate_target as existing value */ > + dc->writeback_rate_proportional = 0; > + dc->writeback_rate_integral_scaled = 0; > + dc->writeback_rate_change = 0; > + > + /* > + * Check c->idle_counter and c->at_max_writeback_rate > agagain in case > + * new I/O arrives during before set_at_max_writeback_rate() > returns. > + * Then the writeback rate is set to 1, and its new value > should be > + * decided via __update_writeback_rate(). > + */ > + if (atomic_read(&c->idle_counter) < dirty_dc_nr * 10 || > + !atomic_read(&c->at_max_writeback_rate)) > + return false; > + > + return true; > +} > + > static void __update_writeback_rate(struct cached_dev *dc) > { > /* > @@ -104,8 +178,9 @@ static void __update_writeback_rate(struct > cached_dev *dc) > > dc->writeback_rate_proportional = proportional_scaled; > dc->writeback_rate_integral_scaled = integral_scaled; > - dc->writeback_rate_change = new_rate - dc->writeback_rate.rate; > - dc->writeback_rate.rate = new_rate; > + dc->writeback_rate_change = new_rate - > + atomic64_read(&dc->writeback_rate.rate); > + atomic64_set(&dc->writeback_rate.rate, new_rate); > dc->writeback_rate_target = target; > } > > @@ -138,9 +213,16 @@ static void update_writeback_rate(struct > work_struct *work) > > down_read(&dc->writeback_lock); > > - if (atomic_read(&dc->has_dirty) && > - dc->writeback_percent) > - __update_writeback_rate(dc); > + if (atomic_read(&dc->has_dirty) && dc->writeback_percent) { > + /* > + * If the whole cache set is idle, > set_at_max_writeback_rate() > + * will set writeback rate to a max number. Then it is > + * unncessary to update writeback rate for an idle > cache set > + * in maximum writeback rate number(s). > + */ > + if (!set_at_max_writeback_rate(c, dc)) > + __update_writeback_rate(dc); > + } > > up_read(&dc->writeback_lock); > > @@ -422,27 +504,6 @@ static void read_dirty(struct cached_dev *dc) > > delay = writeback_delay(dc, size); > > - /* If the control system would wait for at least half a > - * second, and there's been no reqs hitting the > backing disk > - * for awhile: use an alternate mode where we have > at most > - * one contiguous set of writebacks in flight at a > time. If > - * someone wants to do IO it will be quick, as it > will only > - * have to contend with one operation in flight, and > we'll > - * be round-tripping data to the backing disk as > quickly as > - * it can accept it. > - */ > - if (delay >= HZ / 2) { > - /* 3 means at least 1.5 seconds, up to 7.5 if we > - * have slowed way down. > - */ > - if (atomic_inc_return(&dc->backing_idle) >= 3) { > - /* Wait for current I/Os to finish */ > - closure_sync(&cl); > - /* And immediately launch a new set. */ > - delay = 0; > - } > - } > - > while (!kthread_should_stop() && > !test_bit(CACHE_SET_IO_DISABLE, > &dc->disk.c->flags) && > delay) { > @@ -715,7 +776,7 @@ void bch_cached_dev_writeback_init(struct > cached_dev *dc) > dc->writeback_running = true; > dc->writeback_percent = 10; > dc->writeback_delay = 30; > - dc->writeback_rate.rate = 1024; > + atomic64_set(&dc->writeback_rate.rate, 1024); > dc->writeback_rate_minimum = 8; > > dc->writeback_rate_update_seconds = > WRITEBACK_RATE_UPDATE_SECS_DEFAULT; > -- > 2.17.1 >
On Fri, Dec 14, 2018 at 8:08 PM Coly Li <colyli@suse.de> wrote: > Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle") > just looks at single bcache device. If there are I/Os for other bcache > device on the cache set, and a single bcache device is idle, a faster > writeback rate for this single idle bcache device will happen, I/O to > read dirty data on cache for writeback will have negative impact to I/O > requests of other bcache devices. Yes, but this will potentially fill the queue on the idle device. If you have an application that is intermittently doing I/O every few seconds, the latency of the individual I/Os will be high. I was told during review that this was unacceptable, and revised my patch. But now we have merged a change that is more like my first patch that was rejected... Mike
On 12/15/18 12:22 PM, Michael Lyle wrote: > On Fri, Dec 14, 2018 at 8:08 PM Coly Li <colyli@suse.de> wrote: >> Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle") >> just looks at single bcache device. If there are I/Os for other bcache >> device on the cache set, and a single bcache device is idle, a faster >> writeback rate for this single idle bcache device will happen, I/O to >> read dirty data on cache for writeback will have negative impact to I/O >> requests of other bcache devices. > > Yes, but this will potentially fill the queue on the idle device. If > you have an application that is intermittently doing I/O every few > seconds, the latency of the individual I/Os will be high. I was told > during review that this was unacceptable, and revised my patch. But > now we have merged a change that is more like my first patch that was > rejected... The original patch only looks at a single bcache, this fix looks at all bcache devices. If only a single bcache attached on cache set, by default it will be idle for 30 seconds before the writeback rate to be set to a high rate. And if more bcache devices attached, longer threshold required. For 4 bcache devices attached the default idle threshold will be about 120 seconds. If I/O comes between more than 30 seconds (even minutes), I will not take it as an online system, most cases the latency should not be sensitive, or bcache is not proper for such workload. Coly Li
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index d6bf294f3907..469ab1a955e0 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -328,13 +328,6 @@ struct cached_dev { */ atomic_t has_dirty; - /* - * Set to zero by things that touch the backing volume-- except - * writeback. Incremented by writeback. Used to determine when to - * accelerate idle writeback. - */ - atomic_t backing_idle; - struct bch_ratelimit writeback_rate; struct delayed_work writeback_rate_update; @@ -514,6 +507,8 @@ struct cache_set { struct cache_accounting accounting; unsigned long flags; + atomic_t idle_counter; + atomic_t at_max_writeback_rate; struct cache_sb sb; @@ -523,6 +518,8 @@ struct cache_set { struct bcache_device **devices; unsigned devices_max_used; + /* See set_at_max_writeback_rate() for how it is used */ + unsigned previous_dirty_dc_nr; struct list_head cached_devs; uint64_t cached_dev_sectors; struct closure caching; diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index ae67f5fa8047..1af3d96abfa5 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -1104,6 +1104,43 @@ static void detached_dev_do_request(struct bcache_device *d, struct bio *bio) /* Cached devices - read & write stuff */ +static void quit_max_writeback_rate(struct cache_set *c, + struct cached_dev *this_dc) +{ + int i; + struct bcache_device *d; + struct cached_dev *dc; + + /* + * If bch_register_lock is acquired by other attach/detach operations, + * waiting here will increase I/O request latency for seconds or more. + * To avoid such situation, only writeback rate of current cached device + * is set to 1, and __update_write_back() will decide writeback rate + * of other cached devices (remember c->idle_counter is 0 now). + */ + if (mutex_trylock(&bch_register_lock)){ + for (i = 0; i < c->devices_max_used; i++) { + if (!c->devices[i]) + continue; + + if (UUID_FLASH_ONLY(&c->uuids[i])) + continue; + + d = c->devices[i]; + dc = container_of(d, struct cached_dev, disk); + /* + * set writeback rate to default minimum value, + * then let update_writeback_rate() to decide the + * upcoming rate. + */ + atomic64_set(&dc->writeback_rate.rate, 1); + } + + mutex_unlock(&bch_register_lock); + } else + atomic64_set(&this_dc->writeback_rate.rate, 1); +} + static blk_qc_t cached_dev_make_request(struct request_queue *q, struct bio *bio) { @@ -1119,7 +1156,19 @@ static blk_qc_t cached_dev_make_request(struct request_queue *q, return BLK_QC_T_NONE; } - atomic_set(&dc->backing_idle, 0); + if (d->c) { + atomic_set(&d->c->idle_counter, 0); + /* + * If at_max_writeback_rate of cache set is true and new I/O + * comes, quit max writeback rate of all cached devices + * attached to this cache set, and set at_max_writeback_rate + * to false. + */ + if (unlikely(atomic_read(&d->c->at_max_writeback_rate) == 1)) { + atomic_set(&d->c->at_max_writeback_rate, 0); + quit_max_writeback_rate(d->c, dc); + } + } generic_start_io_acct(q, rw, bio_sectors(bio), &d->disk->part0); bio_set_dev(bio, dc->bdev); diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index fa4058e43202..fa532d9f9353 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -1687,6 +1687,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb) c->block_bits = ilog2(sb->block_size); c->nr_uuids = bucket_bytes(c) / sizeof(struct uuid_entry); c->devices_max_used = 0; + c->previous_dirty_dc_nr = 0; c->btree_pages = bucket_pages(c); if (c->btree_pages > BTREE_MAX_PAGES) c->btree_pages = max_t(int, c->btree_pages / 4, diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c index 225b15aa0340..d719021bff81 100644 --- a/drivers/md/bcache/sysfs.c +++ b/drivers/md/bcache/sysfs.c @@ -170,7 +170,8 @@ SHOW(__bch_cached_dev) var_printf(writeback_running, "%i"); var_print(writeback_delay); var_print(writeback_percent); - sysfs_hprint(writeback_rate, dc->writeback_rate.rate << 9); + sysfs_hprint(writeback_rate, + atomic64_read(&dc->writeback_rate.rate) << 9); sysfs_hprint(io_errors, atomic_read(&dc->io_errors)); sysfs_printf(io_error_limit, "%i", dc->error_limit); sysfs_printf(io_disable, "%i", dc->io_disable); @@ -188,7 +189,8 @@ SHOW(__bch_cached_dev) char change[20]; s64 next_io; - bch_hprint(rate, dc->writeback_rate.rate << 9); + bch_hprint(rate, + atomic64_read(&dc->writeback_rate.rate) << 9); bch_hprint(dirty, bcache_dev_sectors_dirty(&dc->disk) << 9); bch_hprint(target, dc->writeback_rate_target << 9); bch_hprint(proportional,dc->writeback_rate_proportional << 9); @@ -255,8 +257,12 @@ STORE(__cached_dev) sysfs_strtoul_clamp(writeback_percent, dc->writeback_percent, 0, 40); - sysfs_strtoul_clamp(writeback_rate, - dc->writeback_rate.rate, 1, INT_MAX); + if (attr == &sysfs_writeback_rate) { + int v; + + sysfs_strtoul_clamp(writeback_rate, v, 1, INT_MAX); + atomic64_set(&dc->writeback_rate.rate, v); + } sysfs_strtoul_clamp(writeback_rate_update_seconds, dc->writeback_rate_update_seconds, diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c index fc479b026d6d..84f90c3d996d 100644 --- a/drivers/md/bcache/util.c +++ b/drivers/md/bcache/util.c @@ -200,7 +200,7 @@ uint64_t bch_next_delay(struct bch_ratelimit *d, uint64_t done) { uint64_t now = local_clock(); - d->next += div_u64(done * NSEC_PER_SEC, d->rate); + d->next += div_u64(done * NSEC_PER_SEC, atomic64_read(&d->rate)); /* Bound the time. Don't let us fall further than 2 seconds behind * (this prevents unnecessary backlog that would make it impossible diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h index cced87f8eb27..7e17f32ab563 100644 --- a/drivers/md/bcache/util.h +++ b/drivers/md/bcache/util.h @@ -442,7 +442,7 @@ struct bch_ratelimit { * Rate at which we want to do work, in units per second * The units here correspond to the units passed to bch_next_delay() */ - uint32_t rate; + atomic64_t rate; }; static inline void bch_ratelimit_reset(struct bch_ratelimit *d) diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index ad45ebe1a74b..11ffadc3cf8f 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -49,6 +49,80 @@ static uint64_t __calc_target_rate(struct cached_dev *dc) return (cache_dirty_target * bdev_share) >> WRITEBACK_SHARE_SHIFT; } +static bool set_at_max_writeback_rate(struct cache_set *c, + struct cached_dev *dc) +{ + int i, dirty_dc_nr = 0; + struct bcache_device *d; + + /* + * bch_register_lock is acquired in cached_dev_detach_finish() before + * calling cancel_writeback_rate_update_dwork() to stop the delayed + * kworker writeback_rate_update (where the context we are for now). + * Therefore call mutex_lock() here may introduce deadlock when shut + * down the bcache device. + * c->previous_dirty_dc_nr is used to record previous calculated + * dirty_dc_nr when mutex_trylock() last time succeeded. Then if + * mutex_trylock() failed here, use c->previous_dirty_dc_nr as dirty + * cached device number. Of cause it might be inaccurate, but a few more + * or less loop before setting c->at_max_writeback_rate is much better + * then a deadlock here. + */ + if (mutex_trylock(&bch_register_lock)) { + for (i = 0; i < c->devices_max_used; i++) { + if (!c->devices[i]) + continue; + if (UUID_FLASH_ONLY(&c->uuids[i])) + continue; + d = c->devices[i]; + dc = container_of(d, struct cached_dev, disk); + if (atomic_read(&dc->has_dirty)) + dirty_dc_nr++; + } + c->previous_dirty_dc_nr = dirty_dc_nr; + + mutex_unlock(&bch_register_lock); + } else + dirty_dc_nr = c->previous_dirty_dc_nr; + + /* + * Idle_counter is increased everytime when update_writeback_rate() + * is rescheduled in. If all backing devices attached to the same + * cache set has same dc->writeback_rate_update_seconds value, it + * is about 10 rounds of update_writeback_rate() is called on each + * backing device, then the code will fall through at set 1 to + * c->at_max_writeback_rate, and a max wrteback rate to each + * dc->writeback_rate.rate. This is not very accurate but works well + * to make sure the whole cache set has no new I/O coming before + * writeback rate is set to a max number. + */ + if (atomic_inc_return(&c->idle_counter) < dirty_dc_nr * 10) + return false; + + if (atomic_read(&c->at_max_writeback_rate) != 1) + atomic_set(&c->at_max_writeback_rate, 1); + + + atomic64_set(&dc->writeback_rate.rate, INT_MAX); + + /* keep writeback_rate_target as existing value */ + dc->writeback_rate_proportional = 0; + dc->writeback_rate_integral_scaled = 0; + dc->writeback_rate_change = 0; + + /* + * Check c->idle_counter and c->at_max_writeback_rate agagain in case + * new I/O arrives during before set_at_max_writeback_rate() returns. + * Then the writeback rate is set to 1, and its new value should be + * decided via __update_writeback_rate(). + */ + if (atomic_read(&c->idle_counter) < dirty_dc_nr * 10 || + !atomic_read(&c->at_max_writeback_rate)) + return false; + + return true; +} + static void __update_writeback_rate(struct cached_dev *dc) { /* @@ -104,8 +178,9 @@ static void __update_writeback_rate(struct cached_dev *dc) dc->writeback_rate_proportional = proportional_scaled; dc->writeback_rate_integral_scaled = integral_scaled; - dc->writeback_rate_change = new_rate - dc->writeback_rate.rate; - dc->writeback_rate.rate = new_rate; + dc->writeback_rate_change = new_rate - + atomic64_read(&dc->writeback_rate.rate); + atomic64_set(&dc->writeback_rate.rate, new_rate); dc->writeback_rate_target = target; } @@ -138,9 +213,16 @@ static void update_writeback_rate(struct work_struct *work) down_read(&dc->writeback_lock); - if (atomic_read(&dc->has_dirty) && - dc->writeback_percent) - __update_writeback_rate(dc); + if (atomic_read(&dc->has_dirty) && dc->writeback_percent) { + /* + * If the whole cache set is idle, set_at_max_writeback_rate() + * will set writeback rate to a max number. Then it is + * unncessary to update writeback rate for an idle cache set + * in maximum writeback rate number(s). + */ + if (!set_at_max_writeback_rate(c, dc)) + __update_writeback_rate(dc); + } up_read(&dc->writeback_lock); @@ -422,27 +504,6 @@ static void read_dirty(struct cached_dev *dc) delay = writeback_delay(dc, size); - /* If the control system would wait for at least half a - * second, and there's been no reqs hitting the backing disk - * for awhile: use an alternate mode where we have at most - * one contiguous set of writebacks in flight at a time. If - * someone wants to do IO it will be quick, as it will only - * have to contend with one operation in flight, and we'll - * be round-tripping data to the backing disk as quickly as - * it can accept it. - */ - if (delay >= HZ / 2) { - /* 3 means at least 1.5 seconds, up to 7.5 if we - * have slowed way down. - */ - if (atomic_inc_return(&dc->backing_idle) >= 3) { - /* Wait for current I/Os to finish */ - closure_sync(&cl); - /* And immediately launch a new set. */ - delay = 0; - } - } - while (!kthread_should_stop() && !test_bit(CACHE_SET_IO_DISABLE, &dc->disk.c->flags) && delay) { @@ -715,7 +776,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc) dc->writeback_running = true; dc->writeback_percent = 10; dc->writeback_delay = 30; - dc->writeback_rate.rate = 1024; + atomic64_set(&dc->writeback_rate.rate, 1024); dc->writeback_rate_minimum = 8; dc->writeback_rate_update_seconds = WRITEBACK_RATE_UPDATE_SECS_DEFAULT;