diff mbox

[GIT,PULL,20/20] lightnvm: pblk: sync RB and RL states during GC

Message ID 20180528085841.26684-21-mb@lightnvm.io (mailing list archive)
State New, archived
Headers show

Commit Message

Matias Bjorling May 28, 2018, 8:58 a.m. UTC
From: Igor Konopko <igor.j.konopko@intel.com>

During sequential workloads we can met the case when almost all the
lines are fully written with data. In that case rate limiter will
significantly reduce the max number of requests for user IOs.

Unfortunately in the case when round buffer is flushed to drive and
the entries are not yet removed (which is ok, since there is still
enough free entries in round buffer for user IO) we hang on user
IO due to not enough entries in rate limiter. The reason is that
rate limiter user entries are decreased after freeing the round
buffer entries, which does not happen if there is still plenty of
space in round buffer.

This patch forces freeing the round buffer by calling
pblk_rb_sync_l2p and thus making new free entries in rate limiter,
when there is no enough of them for user IO.

Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
Signed-off-by: Marcin Dziegielewski <marcin.dziegielewski@intel.com>
Reworded description.
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
---
 drivers/lightnvm/pblk-init.c | 2 ++
 drivers/lightnvm/pblk-rb.c   | 7 +++----
 2 files changed, 5 insertions(+), 4 deletions(-)

Comments

Javier Gonzalez May 28, 2018, 10:51 a.m. UTC | #1
Javier

I somehow missed these patches in the mailing list. Sorry for coming
with feedback this late. I'll look at my filters - in any case, would
you mind Cc'ing me in the future?

> On 28 May 2018, at 10.58, Matias Bjørling <mb@lightnvm.io> wrote:
> 
> From: Igor Konopko <igor.j.konopko@intel.com>
> 
> During sequential workloads we can met the case when almost all the
> lines are fully written with data. In that case rate limiter will
> significantly reduce the max number of requests for user IOs.

Do you mean random writes? On fully sequential, a line will either be
fully written, fully invalidated or on its way to be written. When
invalidating the line, then the whole line will be invalidated and GC
will free it without having to move valid data.

> 
> Unfortunately in the case when round buffer is flushed to drive and
> the entries are not yet removed (which is ok, since there is still
> enough free entries in round buffer for user IO) we hang on user
> IO due to not enough entries in rate limiter. The reason is that
> rate limiter user entries are decreased after freeing the round
> buffer entries, which does not happen if there is still plenty of
> space in round buffer.
> 
> This patch forces freeing the round buffer by calling
> pblk_rb_sync_l2p and thus making new free entries in rate limiter,
> when there is no enough of them for user IO.

I can see why you might have problems with very low OP due to the rate
limiter, but unfortunately this is not a good way of solving the
problem. When you do this, you basically make the L2P to point to the
device instead of pointing to the write cache, which in essence bypasses
mw_cuints. As a result, if a read comes in to one of the synced entries,
it will violate the device-host contract and most probably fail (for
sure fail on > SLC).

I think that the right way of solving this problem is separating the
write and GC buffers and then assigning tokens to them. The write thread
will then consume both buffers based on these tokens. In this case, user
I/O will have a buffer for itself, which will be guaranteed to advance
at the rate the rate-limiter is allowing it to. Note that the 2 buffers
can be a single buffer with a new set of pointers so that the lookup can
be done with a single bit.

I have been looking for time to implement this for a while. If you want
to give it a go, we can talk and I can give you some pointers on
potential issues I have thought about.

Javier
Igor Konopko May 29, 2018, 1:07 p.m. UTC | #2
> From: Javier Gonzalez [mailto:javier@cnexlabs.com]

> Do you mean random writes? On fully sequential, a line will either be

> fully written, fully invalidated or on its way to be written. When

> invalidating the line, then the whole line will be invalidated and GC

> will free it without having to move valid data.


I meant sequential writes, since this is the easiest way to reach rl->rb_state = PBLK_RL_LOW. When we updating this values inside __pblk_rl_update_rates(), most of the times rl->rb_user_cnt will became equal to rl->rb_user_max - which means that we cannot handle any more user IOs. In that case pblk_rl_user_may_insert() will return false, no one will call pblk_rb_update_l2p(), rl->rb_user_cnt will not be decreased, so user IOs will stuck for forever.

> I can see why you might have problems with very low OP due to the rate

> limiter, but unfortunately this is not a good way of solving the

> problem. When you do this, you basically make the L2P to point to the

> device instead of pointing to the write cache, which in essence bypasses

> mw_cuints. As a result, if a read comes in to one of the synced entries,

> it will violate the device-host contract and most probably fail (for

> sure fail on > SLC).


What about using on that path some modified version of pblk_rb_sync_l2p() which will synchronize all the RB entries except last mw_cunits number of elements?

Also you wrote about mw_cuints definitely makes sense, but even without my changes I believe that we can lead into such a situation - especially for pblk with small number of LUNs assigned under  write IOs with high sector count. Pblk_rb_update_l2p() does not explicitly takes mw_cunints this value into consideration right now.

> I think that the right way of solving this problem is separating the

> write and GC buffers and then assigning tokens to them. The write thread

> will then consume both buffers based on these tokens. In this case, user

> I/O will have a buffer for itself, which will be guaranteed to advance

> at the rate the rate-limiter is allowing it to. Note that the 2 buffers

> can be a single buffer with a new set of pointers so that the lookup can

> be done with a single bit.

> 

> I have been looking for time to implement this for a while. If you want

> to give it a go, we can talk and I can give you some pointers on

> potential issues I have thought about.


I believe this is interesting option - we can discuss this for one of next releases.
Javier Gonzalez May 29, 2018, 5:58 p.m. UTC | #3
> On 29 May 2018, at 15.07, Konopko, Igor J <igor.j.konopko@intel.com> wrote:
> 
>> From: Javier Gonzalez [mailto:javier@cnexlabs.com]
>> Do you mean random writes? On fully sequential, a line will either be
>> fully written, fully invalidated or on its way to be written. When
>> invalidating the line, then the whole line will be invalidated and GC
>> will free it without having to move valid data.
> 
> I meant sequential writes, since this is the easiest way to reach
> rl->rb_state = PBLK_RL_LOW. When we updating this values inside
> __pblk_rl_update_rates(), most of the times rl->rb_user_cnt will
> became equal to rl->rb_user_max - which means that we cannot handle
> any more user IOs. In that case pblk_rl_user_may_insert() will return
> false, no one will call pblk_rb_update_l2p(), rl->rb_user_cnt will not
> be decreased, so user IOs will stuck for forever.
> 

What OP are you using? Even with a 5%, full lines should start being
freed as they are completely invalid. But fair enough, if you decide to
optimize for a guaranteed sequential workload where the OP is only to
support grown bad blocks, you will run into this issue.

>> I can see why you might have problems with very low OP due to the rate
>> limiter, but unfortunately this is not a good way of solving the
>> problem. When you do this, you basically make the L2P to point to the
>> device instead of pointing to the write cache, which in essence bypasses
>> mw_cuints. As a result, if a read comes in to one of the synced entries,
>> it will violate the device-host contract and most probably fail (for
>> sure fail on > SLC).
> 
> What about using on that path some modified version of
> pblk_rb_sync_l2p() which will synchronize all the RB entries except
> last mw_cunits number of elements?
> 

Typically, the size of the buffer is the closest upper power of two to
nr_luns * mw_cuints. So in essence, we only update the L2P as we wrap
up. You could go down to match nr_luns * mw_cuints exactly, which
depending on the media geometry can gain you some extra user entries.

> Also you wrote about mw_cuints definitely makes sense, but even
> without my changes I believe that we can lead into such a situation -
> especially for pblk with small number of LUNs assigned under write IOs
> with high sector count. Pblk_rb_update_l2p() does not explicitly takes
> mw_cunints this value into consideration right now.
> 

As mentioned, the size of the buffer is always over nr_luns * mw_cuints
and we only update when we wrap up - this is, when the mem pointer
(which writes new data to the write buffer), wraps and starts writing
new data. So this case is guaranteed not to happen. In fact, this
particular case is what inspired the design of the write buffer and the
current 5 pointers to extend the typical head and tail.

>> I think that the right way of solving this problem is separating the
>> write and GC buffers and then assigning tokens to them. The write thread
>> will then consume both buffers based on these tokens. In this case, user
>> I/O will have a buffer for itself, which will be guaranteed to advance
>> at the rate the rate-limiter is allowing it to. Note that the 2 buffers
>> can be a single buffer with a new set of pointers so that the lookup can
>> be done with a single bit.
>> 
>> I have been looking for time to implement this for a while. If you want
>> to give it a go, we can talk and I can give you some pointers on
>> potential issues I have thought about.
> 
> I believe this is interesting option - we can discuss this for one of
> next releases.

Sure. Ping me if you want to discuss in more detail.

Javier
diff mbox

Patch

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 25aa1e73984f..9d7d9e3b8506 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -1159,7 +1159,9 @@  static void pblk_tear_down(struct pblk *pblk, bool graceful)
 		__pblk_pipeline_flush(pblk);
 	__pblk_pipeline_stop(pblk);
 	pblk_writer_stop(pblk);
+	spin_lock(&pblk->rwb.w_lock);
 	pblk_rb_sync_l2p(&pblk->rwb);
+	spin_unlock(&pblk->rwb.w_lock);
 	pblk_rl_free(&pblk->rl);
 
 	pr_debug("pblk: consistent tear down (graceful:%d)\n", graceful);
diff --git a/drivers/lightnvm/pblk-rb.c b/drivers/lightnvm/pblk-rb.c
index 1b74ec51a4ad..91824cd3e8d8 100644
--- a/drivers/lightnvm/pblk-rb.c
+++ b/drivers/lightnvm/pblk-rb.c
@@ -266,21 +266,18 @@  static int pblk_rb_update_l2p(struct pblk_rb *rb, unsigned int nr_entries,
  * Update the l2p entry for all sectors stored on the write buffer. This means
  * that all future lookups to the l2p table will point to a device address, not
  * to the cacheline in the write buffer.
+ * Caller must ensure that rb->w_lock is taken.
  */
 void pblk_rb_sync_l2p(struct pblk_rb *rb)
 {
 	unsigned int sync;
 	unsigned int to_update;
 
-	spin_lock(&rb->w_lock);
-
 	/* Protect from reads and writes */
 	sync = smp_load_acquire(&rb->sync);
 
 	to_update = pblk_rb_ring_count(sync, rb->l2p_update, rb->nr_entries);
 	__pblk_rb_update_l2p(rb, to_update);
-
-	spin_unlock(&rb->w_lock);
 }
 
 /*
@@ -462,6 +459,8 @@  int pblk_rb_may_write_user(struct pblk_rb *rb, struct bio *bio,
 	spin_lock(&rb->w_lock);
 	io_ret = pblk_rl_user_may_insert(&pblk->rl, nr_entries);
 	if (io_ret) {
+		/* Sync RB & L2P in order to update rate limiter values */
+		pblk_rb_sync_l2p(rb);
 		spin_unlock(&rb->w_lock);
 		return io_ret;
 	}