Message ID | 20220622225102.2112026-7-joel@joelfernandes.org (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Implement call_rcu_lazy() and miscellaneous fixes | expand |
On Wed, Jun 22, 2022 at 10:50:59PM +0000, Joel Fernandes (Google) wrote: > We notice that rcu_barrier() can take a really long time. It appears > that this can happen when all CBs are lazy and the timer does not fire > yet. So after flushing, nothing wakes up GP thread. This patch forces > GP thread to wake when bypass flushing happens, this fixes the > rcu_barrier() delays with lazy CBs. I am wondering if there is a bug in non-rcu_barrier() lazy callback processing hiding here as well? Thanx, Paul > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > --- > kernel/rcu/tree_nocb.h | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h > index 2f5da12811a5..b481f1ea57c0 100644 > --- a/kernel/rcu/tree_nocb.h > +++ b/kernel/rcu/tree_nocb.h > @@ -325,6 +325,8 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp, > rcu_segcblist_insert_pend_cbs(&rdp->cblist, &rcl); > WRITE_ONCE(rdp->nocb_bypass_first, j); > rcu_nocb_bypass_unlock(rdp); > + > + wake_nocb_gp(rdp, true); > return true; > } > > -- > 2.37.0.rc0.104.g0611611a94-goog >
On Sat, Jun 25, 2022 at 09:06:22PM -0700, Paul E. McKenney wrote: > On Wed, Jun 22, 2022 at 10:50:59PM +0000, Joel Fernandes (Google) wrote: > > We notice that rcu_barrier() can take a really long time. It appears > > that this can happen when all CBs are lazy and the timer does not fire > > yet. So after flushing, nothing wakes up GP thread. This patch forces > > GP thread to wake when bypass flushing happens, this fixes the > > rcu_barrier() delays with lazy CBs. > > I am wondering if there is a bug in non-rcu_barrier() lazy callback > processing hiding here as well? I don't think so because in both nocb_try_bypass and nocb_gp_wait, we are not going to an indefinite sleep after the flush. However, with rcu_barrier() , there is nothing to keep the RCU GP thread awake. That's my theory at least. In practice, I have not been able to reproduce this issue with non-rcu_barrier(). With rcu_barrier() I happen to hit it thanks to the rcuscale changes I did. That's an interesting story! As I apply call_rcu_lazy() to the file table code, turns out that on boot, the initram unpacking code continously triggers call_rcu_lazy(). This happens apparently in a different thread than the one that rcuscale is running in. In rcuscale, I did rcu_barrier() at init time and this stalled for a long time to my surprise, and this patch fixed it. thanks, - Joel > > Thanx, Paul > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > --- > > kernel/rcu/tree_nocb.h | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h > > index 2f5da12811a5..b481f1ea57c0 100644 > > --- a/kernel/rcu/tree_nocb.h > > +++ b/kernel/rcu/tree_nocb.h > > @@ -325,6 +325,8 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp, > > rcu_segcblist_insert_pend_cbs(&rdp->cblist, &rcl); > > WRITE_ONCE(rdp->nocb_bypass_first, j); > > rcu_nocb_bypass_unlock(rdp); > > + > > + wake_nocb_gp(rdp, true); > > return true; > > } > > > > -- > > 2.37.0.rc0.104.g0611611a94-goog > >
On Sun, Jun 26, 2022 at 01:45:32PM +0000, Joel Fernandes wrote: > On Sat, Jun 25, 2022 at 09:06:22PM -0700, Paul E. McKenney wrote: > > On Wed, Jun 22, 2022 at 10:50:59PM +0000, Joel Fernandes (Google) wrote: > > > We notice that rcu_barrier() can take a really long time. It appears > > > that this can happen when all CBs are lazy and the timer does not fire > > > yet. So after flushing, nothing wakes up GP thread. This patch forces > > > GP thread to wake when bypass flushing happens, this fixes the > > > rcu_barrier() delays with lazy CBs. > > > > I am wondering if there is a bug in non-rcu_barrier() lazy callback > > processing hiding here as well? > > I don't think so because in both nocb_try_bypass and nocb_gp_wait, we are not > going to an indefinite sleep after the flush. However, with rcu_barrier() , > there is nothing to keep the RCU GP thread awake. That's my theory at least. > In practice, I have not been able to reproduce this issue with > non-rcu_barrier(). > > With rcu_barrier() I happen to hit it thanks to the rcuscale changes I did. > That's an interesting story! As I apply call_rcu_lazy() to the file table > code, turns out that on boot, the initram unpacking code continously triggers > call_rcu_lazy(). This happens apparently in a different thread than the one > that rcuscale is running in. In rcuscale, I did rcu_barrier() at init time > and this stalled for a long time to my surprise, and this patch fixed it. Cool! Then should this wake_nocb_gp() instead go into the rcu_barrier() code path? As shown below, wouldn't we be doing some spurious wakeups? Thanx, Paul > thanks, > > - Joel > > > > > > Thanx, Paul > > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > > --- > > > kernel/rcu/tree_nocb.h | 2 ++ > > > 1 file changed, 2 insertions(+) > > > > > > diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h > > > index 2f5da12811a5..b481f1ea57c0 100644 > > > --- a/kernel/rcu/tree_nocb.h > > > +++ b/kernel/rcu/tree_nocb.h > > > @@ -325,6 +325,8 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp, > > > rcu_segcblist_insert_pend_cbs(&rdp->cblist, &rcl); > > > WRITE_ONCE(rdp->nocb_bypass_first, j); > > > rcu_nocb_bypass_unlock(rdp); > > > + > > > + wake_nocb_gp(rdp, true); > > > return true; > > > } > > > > > > -- > > > 2.37.0.rc0.104.g0611611a94-goog > > >
On Sun, Jun 26, 2022 at 06:52:40AM -0700, Paul E. McKenney wrote: > On Sun, Jun 26, 2022 at 01:45:32PM +0000, Joel Fernandes wrote: > > On Sat, Jun 25, 2022 at 09:06:22PM -0700, Paul E. McKenney wrote: > > > On Wed, Jun 22, 2022 at 10:50:59PM +0000, Joel Fernandes (Google) wrote: > > > > We notice that rcu_barrier() can take a really long time. It appears > > > > that this can happen when all CBs are lazy and the timer does not fire > > > > yet. So after flushing, nothing wakes up GP thread. This patch forces > > > > GP thread to wake when bypass flushing happens, this fixes the > > > > rcu_barrier() delays with lazy CBs. > > > > > > I am wondering if there is a bug in non-rcu_barrier() lazy callback > > > processing hiding here as well? > > > > I don't think so because in both nocb_try_bypass and nocb_gp_wait, we are not > > going to an indefinite sleep after the flush. However, with rcu_barrier() , > > there is nothing to keep the RCU GP thread awake. That's my theory at least. > > In practice, I have not been able to reproduce this issue with > > non-rcu_barrier(). > > > > With rcu_barrier() I happen to hit it thanks to the rcuscale changes I did. > > That's an interesting story! As I apply call_rcu_lazy() to the file table > > code, turns out that on boot, the initram unpacking code continously triggers > > call_rcu_lazy(). This happens apparently in a different thread than the one > > that rcuscale is running in. In rcuscale, I did rcu_barrier() at init time > > and this stalled for a long time to my surprise, and this patch fixed it. > > Cool! > > Then should this wake_nocb_gp() instead go into the rcu_barrier() > code path? As shown below, wouldn't we be doing some spurious wakeups? You are right. In my testing, I don't see any issue with the extra wake up which is going to happen anyway and my thought was why not do it so that a future bypass flush from some other path forgets to call wake up. I'll refine it to be rcu-barrier-only then. thanks! - Joel
diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h index 2f5da12811a5..b481f1ea57c0 100644 --- a/kernel/rcu/tree_nocb.h +++ b/kernel/rcu/tree_nocb.h @@ -325,6 +325,8 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp, rcu_segcblist_insert_pend_cbs(&rdp->cblist, &rcl); WRITE_ONCE(rdp->nocb_bypass_first, j); rcu_nocb_bypass_unlock(rdp); + + wake_nocb_gp(rdp, true); return true; }
We notice that rcu_barrier() can take a really long time. It appears that this can happen when all CBs are lazy and the timer does not fire yet. So after flushing, nothing wakes up GP thread. This patch forces GP thread to wake when bypass flushing happens, this fixes the rcu_barrier() delays with lazy CBs. Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> --- kernel/rcu/tree_nocb.h | 2 ++ 1 file changed, 2 insertions(+)