Message ID | 20230904083602.106703-1-tim@klingt.org (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [1/1] rpmsg: virtio_rpmsg_bus - prevent possible race condition | expand |
Hello Tim, On 9/4/23 10:36, Tim Blechmann wrote: > when we cannot get a tx buffer (`get_a_tx_buf`) `rpmsg_upref_sleepers` > enables tx-complete interrupt. > however if the interrupt is executed after `get_a_tx_buf` and before > `rpmsg_upref_sleepers` we may mis the tx-complete interrupt and sleep > for the full 15 seconds. Is there any reason why your co-processor is unable to release the TX RPMSG buffers for 15 seconds? If not, you should first determine the reason why it is stalled. Regards, Arnaud > > in this case, so we re-try once before we really start to sleep > > Signed-off-by: Tim Blechmann <tim@klingt.org> > --- > drivers/rpmsg/virtio_rpmsg_bus.c | 24 +++++++++++++++--------- > 1 file changed, 15 insertions(+), 9 deletions(-) > > diff --git a/drivers/rpmsg/virtio_rpmsg_bus.c b/drivers/rpmsg/virtio_rpmsg_bus.c > index 905ac7910c98..2a9d42225e60 100644 > --- a/drivers/rpmsg/virtio_rpmsg_bus.c > +++ b/drivers/rpmsg/virtio_rpmsg_bus.c > @@ -587,21 +587,27 @@ static int rpmsg_send_offchannel_raw(struct rpmsg_device *rpdev, > > /* no free buffer ? wait for one (but bail after 15 seconds) */ > while (!msg) { > /* enable "tx-complete" interrupts, if not already enabled */ > rpmsg_upref_sleepers(vrp); > > - /* > - * sleep until a free buffer is available or 15 secs elapse. > - * the timeout period is not configurable because there's > - * little point in asking drivers to specify that. > - * if later this happens to be required, it'd be easy to add. > - */ > - err = wait_event_interruptible_timeout(vrp->sendq, > - (msg = get_a_tx_buf(vrp)), > - msecs_to_jiffies(15000)); > + /* make sure to retry to grab tx buffer before we start waiting */ > + msg = get_a_tx_buf(vrp); > + if (msg) { > + err = 0; > + } else { > + /* > + * sleep until a free buffer is available or 15 secs elapse. > + * the timeout period is not configurable because there's > + * little point in asking drivers to specify that. > + * if later this happens to be required, it'd be easy to add. > + */ > + err = wait_event_interruptible_timeout(vrp->sendq, > + (msg = get_a_tx_buf(vrp)), > + msecs_to_jiffies(15000)); > + } > > /* disable "tx-complete" interrupts if we're the last sleeper */ > rpmsg_downref_sleepers(vrp); > > /* timeout ? */ > if (!err) {
On Mon, Sep 04, 2023 at 03:52:56PM +0200, Arnaud POULIQUEN wrote: > Hello Tim, > > On 9/4/23 10:36, Tim Blechmann wrote: > > when we cannot get a tx buffer (`get_a_tx_buf`) `rpmsg_upref_sleepers` > > enables tx-complete interrupt. > > however if the interrupt is executed after `get_a_tx_buf` and before > > `rpmsg_upref_sleepers` we may mis the tx-complete interrupt and sleep > > for the full 15 seconds. > > > Is there any reason why your co-processor is unable to release the TX RPMSG > buffers for 15 seconds? If not, you should first determine the reason why it is > stalled. Arnaud's concern is valid. If the remote processor can't consume a buffer within 15 seconds, something is probably wrong. That said, I believe your assesment of the situation is correct. *If* the TX callback is disabled and there is no buffer available, there is a window of opportunity between calls to get_a_tx_buf() and rpmsg_upref_sleepers() for an interrupt to arrive in function rpmsg_send_offchannel_raw(). From here three things need to happen: 1) You send another version of this patch with a changelong that uses proper english, i.e capital letters when they are needed and no spelling mistake. 2) Arnaud confirms our suspicions. 3) This patch gets applied when rc1 comes out so that it has 6 or 7 weeks to soak. No error are locks are reported due to this patch during that time. > > Regards, > Arnaud > > > > > in this case, so we re-try once before we really start to sleep > > > > Signed-off-by: Tim Blechmann <tim@klingt.org> > > --- > > drivers/rpmsg/virtio_rpmsg_bus.c | 24 +++++++++++++++--------- > > 1 file changed, 15 insertions(+), 9 deletions(-) > > > > diff --git a/drivers/rpmsg/virtio_rpmsg_bus.c b/drivers/rpmsg/virtio_rpmsg_bus.c > > index 905ac7910c98..2a9d42225e60 100644 > > --- a/drivers/rpmsg/virtio_rpmsg_bus.c > > +++ b/drivers/rpmsg/virtio_rpmsg_bus.c > > @@ -587,21 +587,27 @@ static int rpmsg_send_offchannel_raw(struct rpmsg_device *rpdev, > > > > /* no free buffer ? wait for one (but bail after 15 seconds) */ > > while (!msg) { > > /* enable "tx-complete" interrupts, if not already enabled */ > > rpmsg_upref_sleepers(vrp); > > > > - /* > > - * sleep until a free buffer is available or 15 secs elapse. > > - * the timeout period is not configurable because there's > > - * little point in asking drivers to specify that. > > - * if later this happens to be required, it'd be easy to add. > > - */ > > - err = wait_event_interruptible_timeout(vrp->sendq, > > - (msg = get_a_tx_buf(vrp)), > > - msecs_to_jiffies(15000)); > > + /* make sure to retry to grab tx buffer before we start waiting */ > > + msg = get_a_tx_buf(vrp); > > + if (msg) { > > + err = 0; > > + } else { > > + /* > > + * sleep until a free buffer is available or 15 secs elapse. > > + * the timeout period is not configurable because there's > > + * little point in asking drivers to specify that. > > + * if later this happens to be required, it'd be easy to add. > > + */ > > + err = wait_event_interruptible_timeout(vrp->sendq, > > + (msg = get_a_tx_buf(vrp)), > > + msecs_to_jiffies(15000)); > > + } > > > > /* disable "tx-complete" interrupts if we're the last sleeper */ > > rpmsg_downref_sleepers(vrp); > > > > /* timeout ? */ > > if (!err) {
>>> when we cannot get a tx buffer (`get_a_tx_buf`) `rpmsg_upref_sleepers` >>> enables tx-complete interrupt. >>> however if the interrupt is executed after `get_a_tx_buf` and before >>> `rpmsg_upref_sleepers` we may mis the tx-complete interrupt and sleep >>> for the full 15 seconds. >> >> >> Is there any reason why your co-processor is unable to release the TX RPMSG >> buffers for 15 seconds? If not, you should first determine the reason why it is >> stalled. > > Arnaud's concern is valid. If the remote processor can't consume a buffer > within 15 seconds, something is probably wrong. > > That said, I believe your assesment of the situation is correct. *If* the TX > callback is disabled and there is no buffer available, there is a window of > opportunity between calls to get_a_tx_buf() and rpmsg_upref_sleepers() for an > interrupt to arrive in function rpmsg_send_offchannel_raw(). the remote processor certainly releases the tx buffer and according to my tracing the `vring_interrupt` fires immediately before `rpmsg_send` enters the `rpmsg_upref_sleepers`. after applying this patch we haven't been able to reproduce the 15s timeout anymore, whereas before we could easily reproduce it with certain workloads. > 3) This patch gets applied when rc1 comes out so that it has 6 or 7 weeks to > soak. No error are locks are reported due to this patch during that time. mentioning locks: i was a bit uncertain about a good way to implement the retry, since both `rpmsg_upref_sleepers` and `get_a_tx_buf` both acquire the same mutex. i briefly considered to add `get_a_tx_buf` into `rpmsg_upref_sleepers` to avoid locking the same mutex multiple times, though it adds a bit of complexity to the implementation and harms readability a bit. are there any recommendations on this topic or are (likely non-contended) locks not expensive enough to justify the added complexity? thanks, tim > >> >> Regards, >> Arnaud >> >>> >>> in this case, so we re-try once before we really start to sleep >>> >>> Signed-off-by: Tim Blechmann <tim@klingt.org> >>> --- >>> drivers/rpmsg/virtio_rpmsg_bus.c | 24 +++++++++++++++--------- >>> 1 file changed, 15 insertions(+), 9 deletions(-) >>> >>> diff --git a/drivers/rpmsg/virtio_rpmsg_bus.c b/drivers/rpmsg/virtio_rpmsg_bus.c >>> index 905ac7910c98..2a9d42225e60 100644 >>> --- a/drivers/rpmsg/virtio_rpmsg_bus.c >>> +++ b/drivers/rpmsg/virtio_rpmsg_bus.c >>> @@ -587,21 +587,27 @@ static int rpmsg_send_offchannel_raw(struct rpmsg_device *rpdev, >>> >>> /* no free buffer ? wait for one (but bail after 15 seconds) */ >>> while (!msg) { >>> /* enable "tx-complete" interrupts, if not already enabled */ >>> rpmsg_upref_sleepers(vrp); >>> >>> - /* >>> - * sleep until a free buffer is available or 15 secs elapse. >>> - * the timeout period is not configurable because there's >>> - * little point in asking drivers to specify that. >>> - * if later this happens to be required, it'd be easy to add. >>> - */ >>> - err = wait_event_interruptible_timeout(vrp->sendq, >>> - (msg = get_a_tx_buf(vrp)), >>> - msecs_to_jiffies(15000)); >>> + /* make sure to retry to grab tx buffer before we start waiting */ >>> + msg = get_a_tx_buf(vrp); >>> + if (msg) { >>> + err = 0; >>> + } else { >>> + /* >>> + * sleep until a free buffer is available or 15 secs elapse. >>> + * the timeout period is not configurable because there's >>> + * little point in asking drivers to specify that. >>> + * if later this happens to be required, it'd be easy to add. >>> + */ >>> + err = wait_event_interruptible_timeout(vrp->sendq, >>> + (msg = get_a_tx_buf(vrp)), >>> + msecs_to_jiffies(15000)); >>> + } >>> >>> /* disable "tx-complete" interrupts if we're the last sleeper */ >>> rpmsg_downref_sleepers(vrp); >>> >>> /* timeout ? */ >>> if (!err) { >
On Mon, 4 Sept 2023 at 19:33, Tim Blechmann <tim@klingt.org> wrote: > > >>> when we cannot get a tx buffer (`get_a_tx_buf`) `rpmsg_upref_sleepers` > >>> enables tx-complete interrupt. > >>> however if the interrupt is executed after `get_a_tx_buf` and before > >>> `rpmsg_upref_sleepers` we may mis the tx-complete interrupt and sleep > >>> for the full 15 seconds. > >> > >> > >> Is there any reason why your co-processor is unable to release the TX RPMSG > >> buffers for 15 seconds? If not, you should first determine the reason why it is > >> stalled. > > > > Arnaud's concern is valid. If the remote processor can't consume a buffer > > within 15 seconds, something is probably wrong. > > > > That said, I believe your assesment of the situation is correct. *If* the TX > > callback is disabled and there is no buffer available, there is a window of > > opportunity between calls to get_a_tx_buf() and rpmsg_upref_sleepers() for an > > interrupt to arrive in function rpmsg_send_offchannel_raw(). > > the remote processor certainly releases the tx buffer and according to > my tracing the `vring_interrupt` fires immediately before `rpmsg_send` > enters the `rpmsg_upref_sleepers`. > > after applying this patch we haven't been able to reproduce the 15s > timeout anymore, whereas before we could easily reproduce it with > certain workloads. > > > 3) This patch gets applied when rc1 comes out so that it has 6 or 7 weeks to > > soak. No error are locks are reported due to this patch during that time. > > mentioning locks: i was a bit uncertain about a good way to implement > the retry, since both `rpmsg_upref_sleepers` and `get_a_tx_buf` both > acquire the same mutex. i briefly considered to add `get_a_tx_buf` into > `rpmsg_upref_sleepers` to avoid locking the same mutex multiple times, > though it adds a bit of complexity to the implementation and harms > readability a bit. The current implementation looks good to me. > are there any recommendations on this topic or are (likely > non-contended) locks not expensive enough to justify the added complexity? > Since we are changing how buffers are delivered to applications in a highly concurrent environment, it is possible for this patch to introduce unintended side effects, something I want to avoid as much as possible. I am not worried about contentions for the vrp->tx_lock. > thanks, > tim > > > > > >> > >> Regards, > >> Arnaud > >> > >>> > >>> in this case, so we re-try once before we really start to sleep > >>> > >>> Signed-off-by: Tim Blechmann <tim@klingt.org> > >>> --- > >>> drivers/rpmsg/virtio_rpmsg_bus.c | 24 +++++++++++++++--------- > >>> 1 file changed, 15 insertions(+), 9 deletions(-) > >>> > >>> diff --git a/drivers/rpmsg/virtio_rpmsg_bus.c b/drivers/rpmsg/virtio_rpmsg_bus.c > >>> index 905ac7910c98..2a9d42225e60 100644 > >>> --- a/drivers/rpmsg/virtio_rpmsg_bus.c > >>> +++ b/drivers/rpmsg/virtio_rpmsg_bus.c > >>> @@ -587,21 +587,27 @@ static int rpmsg_send_offchannel_raw(struct rpmsg_device *rpdev, > >>> > >>> /* no free buffer ? wait for one (but bail after 15 seconds) */ > >>> while (!msg) { > >>> /* enable "tx-complete" interrupts, if not already enabled */ > >>> rpmsg_upref_sleepers(vrp); > >>> > >>> - /* > >>> - * sleep until a free buffer is available or 15 secs elapse. > >>> - * the timeout period is not configurable because there's > >>> - * little point in asking drivers to specify that. > >>> - * if later this happens to be required, it'd be easy to add. > >>> - */ > >>> - err = wait_event_interruptible_timeout(vrp->sendq, > >>> - (msg = get_a_tx_buf(vrp)), > >>> - msecs_to_jiffies(15000)); > >>> + /* make sure to retry to grab tx buffer before we start waiting */ > >>> + msg = get_a_tx_buf(vrp); > >>> + if (msg) { > >>> + err = 0; > >>> + } else { > >>> + /* > >>> + * sleep until a free buffer is available or 15 secs elapse. > >>> + * the timeout period is not configurable because there's > >>> + * little point in asking drivers to specify that. > >>> + * if later this happens to be required, it'd be easy to add. > >>> + */ > >>> + err = wait_event_interruptible_timeout(vrp->sendq, > >>> + (msg = get_a_tx_buf(vrp)), > >>> + msecs_to_jiffies(15000)); > >>> + } > >>> > >>> /* disable "tx-complete" interrupts if we're the last sleeper */ > >>> rpmsg_downref_sleepers(vrp); > >>> > >>> /* timeout ? */ > >>> if (!err) { > > >
On 9/5/23 03:33, Tim Blechmann wrote: >>>> when we cannot get a tx buffer (`get_a_tx_buf`) `rpmsg_upref_sleepers` >>>> enables tx-complete interrupt. >>>> however if the interrupt is executed after `get_a_tx_buf` and before >>>> `rpmsg_upref_sleepers` we may mis the tx-complete interrupt and sleep >>>> for the full 15 seconds. >>> >>> >>> Is there any reason why your co-processor is unable to release the TX RPMSG >>> buffers for 15 seconds? If not, you should first determine the reason why it is >>> stalled. >> >> Arnaud's concern is valid. If the remote processor can't consume a buffer >> within 15 seconds, something is probably wrong. >> >> That said, I believe your assesment of the situation is correct. *If* the TX >> callback is disabled and there is no buffer available, there is a window of >> opportunity between calls to get_a_tx_buf() and rpmsg_upref_sleepers() for an >> interrupt to arrive in function rpmsg_send_offchannel_raw(). > > the remote processor certainly releases the tx buffer and according to my > tracing the `vring_interrupt` fires immediately before `rpmsg_send` enters the > `rpmsg_upref_sleepers`. If I well understood your point, the issue occur in following race condition - all the Tx buffers are used - in rpmsg_send_offchannel_raw() function, we try to get a buffer using get_a_tx_buf(vrp) that returns NULL - rpmsg_xmit_done is called as a Tx buffer is released by the remote processor and now free - in rpmsg_send_offchannel_raw() rpmsg_upref_sleepers is called At this point you are nothing happen until 15 second because rpmsg_xmit_done is never called again that would wake up the waitqueue to call get_a_tx_buf() I'm right? If yes what is not clear to me is that wait_event_interruptible_timeout() seems to test the condition (so call get_a_tx_buf()) before entering in sleep[1]. A free TX buffer should be found at this step. [1]https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L534 Regards, Arnaud > > after applying this patch we haven't been able to reproduce the 15s timeout > anymore, whereas before we could easily reproduce it with certain workloads. > >> 3) This patch gets applied when rc1 comes out so that it has 6 or 7 weeks to >> soak. No error are locks are reported due to this patch during that time. > > mentioning locks: i was a bit uncertain about a good way to implement the retry, > since both `rpmsg_upref_sleepers` and `get_a_tx_buf` both acquire the same > mutex. i briefly considered to add `get_a_tx_buf` into `rpmsg_upref_sleepers` to > avoid locking the same mutex multiple times, though it adds a bit of complexity > to the implementation and harms readability a bit. > are there any recommendations on this topic or are (likely non-contended) locks > not expensive enough to justify the added complexity? > > thanks, > tim > > >> >>> >>> Regards, >>> Arnaud >>> >>>> >>>> in this case, so we re-try once before we really start to sleep >>>> >>>> Signed-off-by: Tim Blechmann <tim@klingt.org> >>>> --- >>>> drivers/rpmsg/virtio_rpmsg_bus.c | 24 +++++++++++++++--------- >>>> 1 file changed, 15 insertions(+), 9 deletions(-) >>>> >>>> diff --git a/drivers/rpmsg/virtio_rpmsg_bus.c >>>> b/drivers/rpmsg/virtio_rpmsg_bus.c >>>> index 905ac7910c98..2a9d42225e60 100644 >>>> --- a/drivers/rpmsg/virtio_rpmsg_bus.c >>>> +++ b/drivers/rpmsg/virtio_rpmsg_bus.c >>>> @@ -587,21 +587,27 @@ static int rpmsg_send_offchannel_raw(struct >>>> rpmsg_device *rpdev, >>>> /* no free buffer ? wait for one (but bail after 15 seconds) */ >>>> while (!msg) { >>>> /* enable "tx-complete" interrupts, if not already enabled */ >>>> rpmsg_upref_sleepers(vrp); >>>> - /* >>>> - * sleep until a free buffer is available or 15 secs elapse. >>>> - * the timeout period is not configurable because there's >>>> - * little point in asking drivers to specify that. >>>> - * if later this happens to be required, it'd be easy to add. >>>> - */ >>>> - err = wait_event_interruptible_timeout(vrp->sendq, >>>> - (msg = get_a_tx_buf(vrp)), >>>> - msecs_to_jiffies(15000)); >>>> + /* make sure to retry to grab tx buffer before we start waiting */ >>>> + msg = get_a_tx_buf(vrp); >>>> + if (msg) { >>>> + err = 0; >>>> + } else { >>>> + /* >>>> + * sleep until a free buffer is available or 15 secs elapse. >>>> + * the timeout period is not configurable because there's >>>> + * little point in asking drivers to specify that. >>>> + * if later this happens to be required, it'd be easy to add. >>>> + */ >>>> + err = wait_event_interruptible_timeout(vrp->sendq, >>>> + (msg = get_a_tx_buf(vrp)), >>>> + msecs_to_jiffies(15000)); >>>> + } >>>> /* disable "tx-complete" interrupts if we're the last sleeper */ >>>> rpmsg_downref_sleepers(vrp); >>>> /* timeout ? */ >>>> if (!err) { >> >
>>>>> when we cannot get a tx buffer (`get_a_tx_buf`) `rpmsg_upref_sleepers` >>>>> enables tx-complete interrupt. >>>>> however if the interrupt is executed after `get_a_tx_buf` and before >>>>> `rpmsg_upref_sleepers` we may mis the tx-complete interrupt and sleep >>>>> for the full 15 seconds. >>>> >>>> >>>> Is there any reason why your co-processor is unable to release the TX RPMSG >>>> buffers for 15 seconds? If not, you should first determine the reason why it is >>>> stalled. >>> >>> Arnaud's concern is valid. If the remote processor can't consume a buffer >>> within 15 seconds, something is probably wrong. >>> >>> That said, I believe your assesment of the situation is correct. *If* the TX >>> callback is disabled and there is no buffer available, there is a window of >>> opportunity between calls to get_a_tx_buf() and rpmsg_upref_sleepers() for an >>> interrupt to arrive in function rpmsg_send_offchannel_raw(). >> >> the remote processor certainly releases the tx buffer and according to my >> tracing the `vring_interrupt` fires immediately before `rpmsg_send` enters the >> `rpmsg_upref_sleepers`. > > > If I well understood your point, the issue occur in following race condition > > - all the Tx buffers are used > - in rpmsg_send_offchannel_raw() function, we try to get a buffer using > get_a_tx_buf(vrp) that returns NULL > - rpmsg_xmit_done is called as a Tx buffer is released by the remote processor > and now free > - in rpmsg_send_offchannel_raw() rpmsg_upref_sleepers is called > > At this point you are nothing happen until 15 second because rpmsg_xmit_done is > never called again that would wake up the waitqueue to call get_a_tx_buf() yes, this is the behaviour that i've been seeing: virtio emits a few times: [ 404.XXXXXX] No more buffers in queue and then [ 404.588722] remoteproc remoteproc0: vq index 1 is interrupted [ 404.597249] virtqueue interrupt with no work for 6d53f13a > If yes what is not clear to me is that wait_event_interruptible_timeout() seems > to test the condition (so call get_a_tx_buf()) before entering in sleep[1]. A > free TX buffer should be found at this step. > > [1]https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L534 hmm, interesting. but that would imply a timing issue where calling get_a_tx_buf twice somehow makes a difference as opposed to calling it only once. would the "interrupt with no work" point to a different cause for the issue that we are seeing? > > Regards, > Arnaud > >> >> after applying this patch we haven't been able to reproduce the 15s timeout >> anymore, whereas before we could easily reproduce it with certain workloads. >> >>> 3) This patch gets applied when rc1 comes out so that it has 6 or 7 weeks to >>> soak. No error are locks are reported due to this patch during that time. >> >> mentioning locks: i was a bit uncertain about a good way to implement the retry, >> since both `rpmsg_upref_sleepers` and `get_a_tx_buf` both acquire the same >> mutex. i briefly considered to add `get_a_tx_buf` into `rpmsg_upref_sleepers` to >> avoid locking the same mutex multiple times, though it adds a bit of complexity >> to the implementation and harms readability a bit. >> are there any recommendations on this topic or are (likely non-contended) locks >> not expensive enough to justify the added complexity? >> >> thanks, >> tim >> >> >>> >>>> >>>> Regards, >>>> Arnaud >>>> >>>>> >>>>> in this case, so we re-try once before we really start to sleep >>>>> >>>>> Signed-off-by: Tim Blechmann <tim@klingt.org> >>>>> --- >>>>> drivers/rpmsg/virtio_rpmsg_bus.c | 24 +++++++++++++++--------- >>>>> 1 file changed, 15 insertions(+), 9 deletions(-) >>>>> >>>>> diff --git a/drivers/rpmsg/virtio_rpmsg_bus.c >>>>> b/drivers/rpmsg/virtio_rpmsg_bus.c >>>>> index 905ac7910c98..2a9d42225e60 100644 >>>>> --- a/drivers/rpmsg/virtio_rpmsg_bus.c >>>>> +++ b/drivers/rpmsg/virtio_rpmsg_bus.c >>>>> @@ -587,21 +587,27 @@ static int rpmsg_send_offchannel_raw(struct >>>>> rpmsg_device *rpdev, >>>>> /* no free buffer ? wait for one (but bail after 15 seconds) */ >>>>> while (!msg) { >>>>> /* enable "tx-complete" interrupts, if not already enabled */ >>>>> rpmsg_upref_sleepers(vrp); >>>>> - /* >>>>> - * sleep until a free buffer is available or 15 secs elapse. >>>>> - * the timeout period is not configurable because there's >>>>> - * little point in asking drivers to specify that. >>>>> - * if later this happens to be required, it'd be easy to add. >>>>> - */ >>>>> - err = wait_event_interruptible_timeout(vrp->sendq, >>>>> - (msg = get_a_tx_buf(vrp)), >>>>> - msecs_to_jiffies(15000)); >>>>> + /* make sure to retry to grab tx buffer before we start waiting */ >>>>> + msg = get_a_tx_buf(vrp); >>>>> + if (msg) { >>>>> + err = 0; >>>>> + } else { >>>>> + /* >>>>> + * sleep until a free buffer is available or 15 secs elapse. >>>>> + * the timeout period is not configurable because there's >>>>> + * little point in asking drivers to specify that. >>>>> + * if later this happens to be required, it'd be easy to add. >>>>> + */ >>>>> + err = wait_event_interruptible_timeout(vrp->sendq, >>>>> + (msg = get_a_tx_buf(vrp)), >>>>> + msecs_to_jiffies(15000)); >>>>> + } >>>>> /* disable "tx-complete" interrupts if we're the last sleeper */ >>>>> rpmsg_downref_sleepers(vrp); >>>>> /* timeout ? */ >>>>> if (!err) { >>> >> >
On 9/9/23 08:28, Tim Blechmann wrote: >>>>>> when we cannot get a tx buffer (`get_a_tx_buf`) `rpmsg_upref_sleepers` >>>>>> enables tx-complete interrupt. >>>>>> however if the interrupt is executed after `get_a_tx_buf` and before >>>>>> `rpmsg_upref_sleepers` we may mis the tx-complete interrupt and sleep >>>>>> for the full 15 seconds. >>>>> >>>>> >>>>> Is there any reason why your co-processor is unable to release the TX RPMSG >>>>> buffers for 15 seconds? If not, you should first determine the reason why >>>>> it is >>>>> stalled. >>>> >>>> Arnaud's concern is valid. If the remote processor can't consume a buffer >>>> within 15 seconds, something is probably wrong. >>>> >>>> That said, I believe your assesment of the situation is correct. *If* the TX >>>> callback is disabled and there is no buffer available, there is a window of >>>> opportunity between calls to get_a_tx_buf() and rpmsg_upref_sleepers() for an >>>> interrupt to arrive in function rpmsg_send_offchannel_raw(). >>> >>> the remote processor certainly releases the tx buffer and according to my >>> tracing the `vring_interrupt` fires immediately before `rpmsg_send` enters the >>> `rpmsg_upref_sleepers`. >> >> >> If I well understood your point, the issue occur in following race condition >> >> - all the Tx buffers are used >> - in rpmsg_send_offchannel_raw() function, we try to get a buffer using >> get_a_tx_buf(vrp) that returns NULL >> - rpmsg_xmit_done is called as a Tx buffer is released by the remote processor >> and now free >> - in rpmsg_send_offchannel_raw() rpmsg_upref_sleepers is called >> >> At this point you are nothing happen until 15 second because rpmsg_xmit_done is >> never called again that would wake up the waitqueue to call get_a_tx_buf() > > yes, this is the behaviour that i've been seeing: > > virtio emits a few times: > [ 404.XXXXXX] No more buffers in queue > and then > [ 404.588722] remoteproc remoteproc0: vq index 1 is interrupted > [ 404.597249] virtqueue interrupt with no work for 6d53f13a Difficult to understand the scheduling with your trace. Could you enable ftrace to observe it (mailbox interrupts,rpmsg and virtio functions) > >> If yes what is not clear to me is that wait_event_interruptible_timeout() seems >> to test the condition (so call get_a_tx_buf()) before entering in sleep[1]. A >> free TX buffer should be found at this step. >> >> [1]https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L534 > > hmm, interesting. but that would imply a timing issue where calling get_a_tx_buf > twice somehow makes a difference as opposed to calling it only once. > > would the "interrupt with no work" point to a different cause for the issue that > we are seeing? Look like you have not released the buffer on remote side so no buffer in vring used list. Do you use the zero copy (rpmsg_hold_rx_buffer) on remote processor? If yes, please check that you have following fix: https://github.com/OpenAMP/open-amp/commit/2b6a38a9c6890869960ed41d6a1dc9930c079e0d > >> >> Regards, >> Arnaud >> >>> >>> after applying this patch we haven't been able to reproduce the 15s timeout >>> anymore, whereas before we could easily reproduce it with certain workloads. >>> >>>> 3) This patch gets applied when rc1 comes out so that it has 6 or 7 weeks to >>>> soak. No error are locks are reported due to this patch during that time. >>> >>> mentioning locks: i was a bit uncertain about a good way to implement the retry, >>> since both `rpmsg_upref_sleepers` and `get_a_tx_buf` both acquire the same >>> mutex. i briefly considered to add `get_a_tx_buf` into `rpmsg_upref_sleepers` to >>> avoid locking the same mutex multiple times, though it adds a bit of complexity >>> to the implementation and harms readability a bit. >>> are there any recommendations on this topic or are (likely non-contended) locks >>> not expensive enough to justify the added complexity? >>> >>> thanks, >>> tim >>> >>> >>>> >>>>> >>>>> Regards, >>>>> Arnaud >>>>> >>>>>> >>>>>> in this case, so we re-try once before we really start to sleep >>>>>> >>>>>> Signed-off-by: Tim Blechmann <tim@klingt.org> >>>>>> --- >>>>>> drivers/rpmsg/virtio_rpmsg_bus.c | 24 +++++++++++++++--------- >>>>>> 1 file changed, 15 insertions(+), 9 deletions(-) >>>>>> >>>>>> diff --git a/drivers/rpmsg/virtio_rpmsg_bus.c >>>>>> b/drivers/rpmsg/virtio_rpmsg_bus.c >>>>>> index 905ac7910c98..2a9d42225e60 100644 >>>>>> --- a/drivers/rpmsg/virtio_rpmsg_bus.c >>>>>> +++ b/drivers/rpmsg/virtio_rpmsg_bus.c >>>>>> @@ -587,21 +587,27 @@ static int rpmsg_send_offchannel_raw(struct >>>>>> rpmsg_device *rpdev, >>>>>> /* no free buffer ? wait for one (but bail after 15 seconds) */ >>>>>> while (!msg) { >>>>>> /* enable "tx-complete" interrupts, if not already enabled */ >>>>>> rpmsg_upref_sleepers(vrp); >>>>>> - /* >>>>>> - * sleep until a free buffer is available or 15 secs elapse. >>>>>> - * the timeout period is not configurable because there's >>>>>> - * little point in asking drivers to specify that. >>>>>> - * if later this happens to be required, it'd be easy to add. >>>>>> - */ >>>>>> - err = wait_event_interruptible_timeout(vrp->sendq, >>>>>> - (msg = get_a_tx_buf(vrp)), >>>>>> - msecs_to_jiffies(15000)); >>>>>> + /* make sure to retry to grab tx buffer before we start waiting */ >>>>>> + msg = get_a_tx_buf(vrp); >>>>>> + if (msg) { >>>>>> + err = 0; >>>>>> + } else { >>>>>> + /* >>>>>> + * sleep until a free buffer is available or 15 secs elapse. >>>>>> + * the timeout period is not configurable because there's >>>>>> + * little point in asking drivers to specify that. >>>>>> + * if later this happens to be required, it'd be easy to add. >>>>>> + */ >>>>>> + err = wait_event_interruptible_timeout(vrp->sendq, >>>>>> + (msg = get_a_tx_buf(vrp)), >>>>>> + msecs_to_jiffies(15000)); >>>>>> + } >>>>>> /* disable "tx-complete" interrupts if we're the last sleeper */ >>>>>> rpmsg_downref_sleepers(vrp); >>>>>> /* timeout ? */ >>>>>> if (!err) { >>>> >>> >>
>> virtio emits a few times: >> [ 404.XXXXXX] No more buffers in queue >> and then >> [ 404.588722] remoteproc remoteproc0: vq index 1 is interrupted >> [ 404.597249] virtqueue interrupt with no work for 6d53f13a > > Difficult to understand the scheduling with your trace. Could you enable ftrace to > observe it (mailbox interrupts,rpmsg and virtio functions) does this help (it's a trace that a coworker did a few days ago)? ``` root:.../kernel/tracing# cat trace | head -50 # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 0) + 10.250 us | rpmsg_sg_init(); 0) + 42.375 us | } /* rpmsg_send_offchannel_raw */ 0) + 51.125 us | } /* virtio_rpmsg_send */ 0) + 59.625 us | } /* rpmsg_send */ ------------------------------------------ 0) stm32mp-414 => kworker-365 ------------------------------------------ 0) | rpmsg_recv_done() { 0) | rpmsg_recv_single() { 0) 5.542 us | rpmsg_sg_init(); 0) + 36.542 us | } 0) + 48.250 us | } ------------------------------------------ 0) kworker-365 => stm32mp-414 ------------------------------------------ 0) 7.000 us | rpmsg_get_buffer_size(); 0) | rpmsg_send() { 0) | virtio_rpmsg_send() { 0) | rpmsg_send_offchannel_raw() { 0) 5.625 us | rpmsg_sg_init(); 0) + 37.333 us | } 0) + 45.916 us | } 0) + 54.250 us | } ------------------------------------------ 0) stm32mp-414 => kworker-365 ------------------------------------------ 0) | rpmsg_recv_done() { 0) | rpmsg_recv_single() { 0) 6.041 us | rpmsg_sg_init(); 0) + 37.292 us | } 0) + 49.542 us | } ------------------------------------------ 0) kworker-365 => stm32mp-414 ------------------------------------------ 0) 6.500 us | rpmsg_get_buffer_size(); 0) | rpmsg_send() { 0) | virtio_rpmsg_send() { 0) | rpmsg_send_offchannel_raw() { 0) 5.417 us | rpmsg_sg_init(); 0) + 36.583 us | } 0) + 45.167 us | } 0) + 53.583 us | } ``` apparently this is where it gets stuck (not sure if it's enough context, though). >> >>> If yes what is not clear to me is that wait_event_interruptible_timeout() seems >>> to test the condition (so call get_a_tx_buf()) before entering in sleep[1]. A >>> free TX buffer should be found at this step. >>> >>> [1]https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L534 >> >> hmm, interesting. but that would imply a timing issue where calling get_a_tx_buf >> twice somehow makes a difference as opposed to calling it only once. >> >> would the "interrupt with no work" point to a different cause for the issue that >> we are seeing? > > Look like you have not released the buffer on remote side so no buffer in vring > used list. > > Do you use the zero copy (rpmsg_hold_rx_buffer) on remote processor? > If yes, please check that you have following fix: > https://github.com/OpenAMP/open-amp/commit/2b6a38a9c6890869960ed41d6a1dc9930c079e0d yes, we use rpmsg_hold_rx_buffer, however the fix from 2b6a38a9c6890869960ed41d6a1dc9930c079e0d apparently did not change the behaviour. many thanks, tim
On 9/13/23 09:07, Tim Blechmann wrote: >>> virtio emits a few times: >>> [ 404.XXXXXX] No more buffers in queue >>> and then >>> [ 404.588722] remoteproc remoteproc0: vq index 1 is interrupted >>> [ 404.597249] virtqueue interrupt with no work for 6d53f13a >> >> Difficult to understand the scheduling with your trace. Could you >> enable ftrace to >> observe it (mailbox interrupts,rpmsg and virtio functions) full trace, that my worker did (quite large, so i zipped it): https://fileservice.klingt.org/downloads/15526951694567393180-tracefull.zip the interesting part starts at 907.2244250 many thanks, tim
hello Tim On 9/13/23 03:11, Tim Blechmann wrote: > On 9/13/23 09:07, Tim Blechmann wrote: >>>> virtio emits a few times: >>>> [ 404.XXXXXX] No more buffers in queue >>>> and then >>>> [ 404.588722] remoteproc remoteproc0: vq index 1 is interrupted >>>> [ 404.597249] virtqueue interrupt with no work for 6d53f13a >>> >>> Difficult to understand the scheduling with your trace. Could you enable >>> ftrace to >>> observe it (mailbox interrupts,rpmsg and virtio functions) > > full trace, that my worker did (quite large, so i zipped it): > https://fileservice.klingt.org/downloads/15526951694567393180-tracefull.zip please find below an extract of your trace with my analysis: stm32mp1_bulk_p-390 [001] ..... 907.241226: rpmsg_send <-rpmsg_intercore_send_buffer.constprop.0 stm32mp1_bulk_p-390 [001] ..... 907.241228: virtio_rpmsg_send <-rpmsg_send stm32mp1_bulk_p-390 [001] ..... 907.241237: virtqueue_enable_cb <-rpmsg_send_offchannel_raw stm32mp1_bulk_p-390 [001] ..... 907.241239: virtqueue_enable_cb_prepare At this point seems that no more TX-buffer <-rpmsg_recv_single kworker/0:4-67 [000] ..... 907.242533: vring_interrupt <-rproc_vq_interrupt kworker/0:4-67 [000] ..... 907.242536: rpmsg_xmit_done Here you receive an interrupt indicating that TX buffer has been released by the remote. that's the expected behavior. <-rpmsg_send_offchannel_raw stm32mp1_bulk_p-390 [000] ..... 984.054941: rpmsg_send <-rpmsg_intercore_send_buffer.constprop.0 stm32mp1_bulk_p-390 [000] ..... 984.054943: virtio_rpmsg_send <-rpmsg_send stm32mp1_bulk_p-390 [000] ..... 984.054956: virtqueue_enable_cb <-rpmsg_send_offchannel_raw stm32mp1_bulk_p-390 [000] ..... 984.054958: virtqueue_enable_cb_prepare <-virtqueue_enable_cb stm32mp1_bulk_p-390 [000] ..... 999.398667: virtqueue_disable_cb <-rpmsg_send_offchannel_raw stm32mp1_bulk_p-390 [000] ..... 999.414840: rpmsg_send <-rpmsg_intercore_send_buffer.constprop.0 stm32mp1_bulk_p-390 [000] ..... 999.414843: virtio_rpmsg_send <-rpmsg_send stm32mp1_bulk_p-390 [000] ..... 999.414855: virtqueue_enable_cb <-rpmsg_send_offchannel_raw stm32mp1_bulk_p-390 [000] ..... 999.414857: virtqueue_enable_cb_prepare Here you have again no more TX buffer. From this point there is no more activity neither in TX nor in RX until the timeout of 15 seconds. If you have a look to rproc_vq_interrupt the last one occurs at 907.242533 As you have no more virtqueue interrupts call for both directions, the issue is probably either in the Cortex-M firmware, which seems to be stalled, or due to a disable of the IRQs in Linux. <-virtqueue_enable_cb stm32mp1_bulk_p-390 [000] ..... 1014.758678: virtqueue_disable_cb <-rpmsg_send_offchannel_raw stm32mp1_bulk_p-390 [000] ..... 1014.774802: rpmsg_send <-rpmsg_intercore_send_buffer.constprop.0 stm32mp1_bulk_p-390 [000] ..... 1014.774804: virtio_rpmsg_send <-rpmsg_send stm32mp1_bulk_p-390 [000] ..... 1014.774815: virtqueue_enable_cb Regards, Arnaud > > the interesting part starts at 907.2244250 > > many thanks, > tim
many thanks for your analysis, very interesting. > please find below an extract of your trace with my analysis: > > > stm32mp1_bulk_p-390 [001] ..... 907.241226: rpmsg_send > <-rpmsg_intercore_send_buffer.constprop.0 > stm32mp1_bulk_p-390 [001] ..... 907.241228: virtio_rpmsg_send <-rpmsg_send > stm32mp1_bulk_p-390 [001] ..... 907.241237: virtqueue_enable_cb > <-rpmsg_send_offchannel_raw > stm32mp1_bulk_p-390 [001] ..... 907.241239: virtqueue_enable_cb_prepare > > At this point seems that no more TX-buffer > > <-rpmsg_recv_single > kworker/0:4-67 [000] ..... 907.242533: vring_interrupt > <-rproc_vq_interrupt > kworker/0:4-67 [000] ..... 907.242536: rpmsg_xmit_done > > Here you receive an interrupt indicating that TX buffer has been released by > the remote. that's the expected behavior. > > > <-rpmsg_send_offchannel_raw > stm32mp1_bulk_p-390 [000] ..... 984.054941: rpmsg_send > <-rpmsg_intercore_send_buffer.constprop.0 > stm32mp1_bulk_p-390 [000] ..... 984.054943: virtio_rpmsg_send <-rpmsg_send > stm32mp1_bulk_p-390 [000] ..... 984.054956: virtqueue_enable_cb > <-rpmsg_send_offchannel_raw > stm32mp1_bulk_p-390 [000] ..... 984.054958: virtqueue_enable_cb_prepare > <-virtqueue_enable_cb > stm32mp1_bulk_p-390 [000] ..... 999.398667: virtqueue_disable_cb > <-rpmsg_send_offchannel_raw > stm32mp1_bulk_p-390 [000] ..... 999.414840: rpmsg_send > <-rpmsg_intercore_send_buffer.constprop.0 > stm32mp1_bulk_p-390 [000] ..... 999.414843: virtio_rpmsg_send <-rpmsg_send > stm32mp1_bulk_p-390 [000] ..... 999.414855: virtqueue_enable_cb > <-rpmsg_send_offchannel_raw > stm32mp1_bulk_p-390 [000] ..... 999.414857: virtqueue_enable_cb_prepare > > Here you have again no more TX buffer. From this point there is no more activity > neither in TX nor in RX until the timeout of 15 seconds. > If you have a look to rproc_vq_interrupt the last one occurs at 907.242533 > > > As you have no more virtqueue interrupts call for both directions, the issue is > probably either in the Cortex-M firmware, which seems to be stalled, or due to a > disable of the IRQs in Linux. afaict we can rule out a complete stall of the cortex-m firmware: if we change the rpmsg_send to a rpmsg_trysend/msleep loop, the trysend will succeed to get a buffer after a few iterations. > or due to a disable of the IRQs in Linux. do you have some recommendations how we could trace this? many thanks, tim > <-virtqueue_enable_cb > stm32mp1_bulk_p-390 [000] ..... 1014.758678: virtqueue_disable_cb > <-rpmsg_send_offchannel_raw > stm32mp1_bulk_p-390 [000] ..... 1014.774802: rpmsg_send > <-rpmsg_intercore_send_buffer.constprop.0 > stm32mp1_bulk_p-390 [000] ..... 1014.774804: virtio_rpmsg_send <-rpmsg_send > stm32mp1_bulk_p-390 [000] ..... 1014.774815: virtqueue_enable_cb
On 9/13/23 10:47, Tim Blechmann wrote: > many thanks for your analysis, very interesting. > >> please find below an extract of your trace with my analysis: >> >> >> stm32mp1_bulk_p-390 [001] ..... 907.241226: rpmsg_send >> <-rpmsg_intercore_send_buffer.constprop.0 >> stm32mp1_bulk_p-390 [001] ..... 907.241228: virtio_rpmsg_send >> <-rpmsg_send >> stm32mp1_bulk_p-390 [001] ..... 907.241237: virtqueue_enable_cb >> <-rpmsg_send_offchannel_raw >> stm32mp1_bulk_p-390 [001] ..... 907.241239: virtqueue_enable_cb_prepare >> >> At this point seems that no more TX-buffer >> >> <-rpmsg_recv_single >> kworker/0:4-67 [000] ..... 907.242533: vring_interrupt >> <-rproc_vq_interrupt >> kworker/0:4-67 [000] ..... 907.242536: rpmsg_xmit_done >> >> Here you receive an interrupt indicating that TX buffer has been released by >> the remote. that's the expected behavior. >> >> >> <-rpmsg_send_offchannel_raw >> stm32mp1_bulk_p-390 [000] ..... 984.054941: rpmsg_send >> <-rpmsg_intercore_send_buffer.constprop.0 >> stm32mp1_bulk_p-390 [000] ..... 984.054943: virtio_rpmsg_send >> <-rpmsg_send >> stm32mp1_bulk_p-390 [000] ..... 984.054956: virtqueue_enable_cb >> <-rpmsg_send_offchannel_raw >> stm32mp1_bulk_p-390 [000] ..... 984.054958: virtqueue_enable_cb_prepare >> <-virtqueue_enable_cb >> stm32mp1_bulk_p-390 [000] ..... 999.398667: virtqueue_disable_cb >> <-rpmsg_send_offchannel_raw >> stm32mp1_bulk_p-390 [000] ..... 999.414840: rpmsg_send >> <-rpmsg_intercore_send_buffer.constprop.0 >> stm32mp1_bulk_p-390 [000] ..... 999.414843: virtio_rpmsg_send >> <-rpmsg_send >> stm32mp1_bulk_p-390 [000] ..... 999.414855: virtqueue_enable_cb >> <-rpmsg_send_offchannel_raw >> stm32mp1_bulk_p-390 [000] ..... 999.414857: virtqueue_enable_cb_prepare >> >> Here you have again no more TX buffer. From this point there is no more activity >> neither in TX nor in RX until the timeout of 15 seconds. >> If you have a look to rproc_vq_interrupt the last one occurs at 907.242533 >> >> >> As you have no more virtqueue interrupts call for both directions, the issue is >> probably either in the Cortex-M firmware, which seems to be stalled, or due to a >> disable of the IRQs in Linux. > > afaict we can rule out a complete stall of the cortex-m firmware: if we change > the rpmsg_send to a rpmsg_trysend/msleep loop, the trysend will succeed to get a > buffer after a few iterations. > >> or due to a disable of the IRQs in Linux. > > do you have some recommendations how we could trace this? First, check if the Cortex-M sends the IPCC mailbox notification during the 15-second period. Then, you can verify the IPCC registers to check if some flags are pending. On the Linux side, there is probably literature available to explain how to trace it. > > many thanks, > tim > >> <-virtqueue_enable_cb >> stm32mp1_bulk_p-390 [000] ..... 1014.758678: virtqueue_disable_cb >> <-rpmsg_send_offchannel_raw >> stm32mp1_bulk_p-390 [000] ..... 1014.774802: rpmsg_send >> <-rpmsg_intercore_send_buffer.constprop.0 >> stm32mp1_bulk_p-390 [000] ..... 1014.774804: virtio_rpmsg_send >> <-rpmsg_send >> stm32mp1_bulk_p-390 [000] ..... 1014.774815: virtqueue_enable_cb > >
On 9/4/23 22:43, Mathieu Poirier wrote: > On Mon, Sep 04, 2023 at 03:52:56PM +0200, Arnaud POULIQUEN wrote: >> Hello Tim, >> >> On 9/4/23 10:36, Tim Blechmann wrote: >>> when we cannot get a tx buffer (`get_a_tx_buf`) `rpmsg_upref_sleepers` >>> enables tx-complete interrupt. >>> however if the interrupt is executed after `get_a_tx_buf` and before >>> `rpmsg_upref_sleepers` we may mis the tx-complete interrupt and sleep >>> for the full 15 seconds. >> >> >> Is there any reason why your co-processor is unable to release the TX RPMSG >> buffers for 15 seconds? If not, you should first determine the reason why it is >> stalled. > > Arnaud's concern is valid. If the remote processor can't consume a buffer > within 15 seconds, something is probably wrong. > > That said, I believe your assesment of the situation is correct. *If* the TX > callback is disabled and there is no buffer available, there is a window of > opportunity between calls to get_a_tx_buf() and rpmsg_upref_sleepers() for an > interrupt to arrive in function rpmsg_send_offchannel_raw(). > > From here three things need to happen: > > 1) You send another version of this patch with a changelong that uses proper > english, i.e capital letters when they are needed and no spelling mistake. > > 2) Arnaud confirms our suspicions. Seems to me that this patch is useless - wait_event_interruptible_timeout() function already seems to test the condition (so call get_a_tx_buf()) before entering in sleep[1]. - ftraces show that vq interrupt is not called during the 15-second period. So it is a normal behavior that the vrp->sendq is never waked-up. Tim needs to analyze the reason why no mailbox interrupt occurs. [1]https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L534 > > 3) This patch gets applied when rc1 comes out so that it has 6 or 7 weeks to > soak. No error are locks are reported due to this patch during that time. > >> >> Regards, >> Arnaud >> >>> >>> in this case, so we re-try once before we really start to sleep >>> >>> Signed-off-by: Tim Blechmann <tim@klingt.org> >>> --- >>> drivers/rpmsg/virtio_rpmsg_bus.c | 24 +++++++++++++++--------- >>> 1 file changed, 15 insertions(+), 9 deletions(-) >>> >>> diff --git a/drivers/rpmsg/virtio_rpmsg_bus.c b/drivers/rpmsg/virtio_rpmsg_bus.c >>> index 905ac7910c98..2a9d42225e60 100644 >>> --- a/drivers/rpmsg/virtio_rpmsg_bus.c >>> +++ b/drivers/rpmsg/virtio_rpmsg_bus.c >>> @@ -587,21 +587,27 @@ static int rpmsg_send_offchannel_raw(struct rpmsg_device *rpdev, >>> >>> /* no free buffer ? wait for one (but bail after 15 seconds) */ >>> while (!msg) { >>> /* enable "tx-complete" interrupts, if not already enabled */ >>> rpmsg_upref_sleepers(vrp); >>> >>> - /* >>> - * sleep until a free buffer is available or 15 secs elapse. >>> - * the timeout period is not configurable because there's >>> - * little point in asking drivers to specify that. >>> - * if later this happens to be required, it'd be easy to add. >>> - */ >>> - err = wait_event_interruptible_timeout(vrp->sendq, >>> - (msg = get_a_tx_buf(vrp)), >>> - msecs_to_jiffies(15000)); >>> + /* make sure to retry to grab tx buffer before we start waiting */ >>> + msg = get_a_tx_buf(vrp); >>> + if (msg) { >>> + err = 0; >>> + } else { >>> + /* >>> + * sleep until a free buffer is available or 15 secs elapse. >>> + * the timeout period is not configurable because there's >>> + * little point in asking drivers to specify that. >>> + * if later this happens to be required, it'd be easy to add. >>> + */ >>> + err = wait_event_interruptible_timeout(vrp->sendq, >>> + (msg = get_a_tx_buf(vrp)), >>> + msecs_to_jiffies(15000)); >>> + } >>> >>> /* disable "tx-complete" interrupts if we're the last sleeper */ >>> rpmsg_downref_sleepers(vrp); >>> >>> /* timeout ? */ >>> if (!err) {
On Wed, Sep 13, 2023 at 12:10:39PM +0200, Arnaud POULIQUEN wrote: > > > On 9/4/23 22:43, Mathieu Poirier wrote: > > On Mon, Sep 04, 2023 at 03:52:56PM +0200, Arnaud POULIQUEN wrote: > >> Hello Tim, > >> > >> On 9/4/23 10:36, Tim Blechmann wrote: > >>> when we cannot get a tx buffer (`get_a_tx_buf`) `rpmsg_upref_sleepers` > >>> enables tx-complete interrupt. > >>> however if the interrupt is executed after `get_a_tx_buf` and before > >>> `rpmsg_upref_sleepers` we may mis the tx-complete interrupt and sleep > >>> for the full 15 seconds. > >> > >> > >> Is there any reason why your co-processor is unable to release the TX RPMSG > >> buffers for 15 seconds? If not, you should first determine the reason why it is > >> stalled. > > > > Arnaud's concern is valid. If the remote processor can't consume a buffer > > within 15 seconds, something is probably wrong. > > > > That said, I believe your assesment of the situation is correct. *If* the TX > > callback is disabled and there is no buffer available, there is a window of > > opportunity between calls to get_a_tx_buf() and rpmsg_upref_sleepers() for an > > interrupt to arrive in function rpmsg_send_offchannel_raw(). > > > > From here three things need to happen: > > > > 1) You send another version of this patch with a changelong that uses proper > > english, i.e capital letters when they are needed and no spelling mistake. > > > > 2) Arnaud confirms our suspicions. > > Seems to me that this patch is useless > - wait_event_interruptible_timeout() function already seems > to test the condition (so call get_a_tx_buf()) before entering in sleep[1]. > - ftraces show that vq interrupt is not called during the 15-second period. > So it is a normal behavior that the vrp->sendq is never waked-up. > Thanks for looking further into this issue. I will wait for you guys to get to the root of the problems. > Tim needs to analyze the reason why no mailbox interrupt occurs. > > [1]https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L534 > > > > > > 3) This patch gets applied when rc1 comes out so that it has 6 or 7 weeks to > > soak. No error are locks are reported due to this patch during that time. > > > >> > >> Regards, > >> Arnaud > >> > >>> > >>> in this case, so we re-try once before we really start to sleep > >>> > >>> Signed-off-by: Tim Blechmann <tim@klingt.org> > >>> --- > >>> drivers/rpmsg/virtio_rpmsg_bus.c | 24 +++++++++++++++--------- > >>> 1 file changed, 15 insertions(+), 9 deletions(-) > >>> > >>> diff --git a/drivers/rpmsg/virtio_rpmsg_bus.c b/drivers/rpmsg/virtio_rpmsg_bus.c > >>> index 905ac7910c98..2a9d42225e60 100644 > >>> --- a/drivers/rpmsg/virtio_rpmsg_bus.c > >>> +++ b/drivers/rpmsg/virtio_rpmsg_bus.c > >>> @@ -587,21 +587,27 @@ static int rpmsg_send_offchannel_raw(struct rpmsg_device *rpdev, > >>> > >>> /* no free buffer ? wait for one (but bail after 15 seconds) */ > >>> while (!msg) { > >>> /* enable "tx-complete" interrupts, if not already enabled */ > >>> rpmsg_upref_sleepers(vrp); > >>> > >>> - /* > >>> - * sleep until a free buffer is available or 15 secs elapse. > >>> - * the timeout period is not configurable because there's > >>> - * little point in asking drivers to specify that. > >>> - * if later this happens to be required, it'd be easy to add. > >>> - */ > >>> - err = wait_event_interruptible_timeout(vrp->sendq, > >>> - (msg = get_a_tx_buf(vrp)), > >>> - msecs_to_jiffies(15000)); > >>> + /* make sure to retry to grab tx buffer before we start waiting */ > >>> + msg = get_a_tx_buf(vrp); > >>> + if (msg) { > >>> + err = 0; > >>> + } else { > >>> + /* > >>> + * sleep until a free buffer is available or 15 secs elapse. > >>> + * the timeout period is not configurable because there's > >>> + * little point in asking drivers to specify that. > >>> + * if later this happens to be required, it'd be easy to add. > >>> + */ > >>> + err = wait_event_interruptible_timeout(vrp->sendq, > >>> + (msg = get_a_tx_buf(vrp)), > >>> + msecs_to_jiffies(15000)); > >>> + } > >>> > >>> /* disable "tx-complete" interrupts if we're the last sleeper */ > >>> rpmsg_downref_sleepers(vrp); > >>> > >>> /* timeout ? */ > >>> if (!err) {
On 9/13/23 12:57, Agostiño Carballeira wrote: > Hello! > > I am the main developer of the CM4 firmware for this project. > First of all thanks for taking the time to analyse the trace. > Further analysis on my side has shown that the CM4 is not completely stalled in > this situation, but it is stuck on a busywait loop within the MAILBOX_Notify > function, awaiting a window to send a "buffer used" notification to the CA7. So > it seems that the mailbox is locked both ways and neither side is giving way to > unclog the traffic jam. The Cortex-M4 is probably blocked, waiting for Linux mailbox to acknowledge a mailbox notification[1]. Are you 100% sure that you have never exited this loop during the issue? Could you provide the value of the 'id' variable and the call stack? That would mean that the mailbox has not been acknowledged by Linux [2] or [3]. I don't understand how it could be possible... [1] https://github.com/STMicroelectronics/STM32CubeMP1/blob/master/Middlewares/Third_Party/OpenAMP/mw_if/platform_if/mbox_ipcc_template.c#L182 [2] https://elixir.bootlin.com/linux/latest/source/drivers/mailbox/stm32-ipcc.c#L105 [3] https://elixir.bootlin.com/linux/latest/source/drivers/mailbox/stm32-ipcc.c#L134 > Interestingly, when we replace rpmsg_send by rpmsg_trysend + busywait loop, this > mutual stall doesn't happen at all. What do you mean by busywait? Do you add a delay between 2 rpmsg_trysend() calls? If yes, you probably add delay that avoid the issue. That said rpmsg_trysend is recommended for baremetal to avoid to block the system. > Does that give you any clues? > > Thanks > Agos > > On Wed, Sep 13, 2023 at 10:47 AM Tim Blechmann <tim@klingt.org > <mailto:tim@klingt.org>> wrote: > > many thanks for your analysis, very interesting. > > > please find below an extract of your trace with my analysis: > > > > > > stm32mp1_bulk_p-390 [001] ..... 907.241226: rpmsg_send > > <-rpmsg_intercore_send_buffer.constprop.0 > > stm32mp1_bulk_p-390 [001] ..... 907.241228: virtio_rpmsg_send > <-rpmsg_send > > stm32mp1_bulk_p-390 [001] ..... 907.241237: virtqueue_enable_cb > > <-rpmsg_send_offchannel_raw > > stm32mp1_bulk_p-390 [001] ..... 907.241239: > virtqueue_enable_cb_prepare > > > > At this point seems that no more TX-buffer > > > > <-rpmsg_recv_single > > kworker/0:4-67 [000] ..... 907.242533: vring_interrupt > > <-rproc_vq_interrupt > > kworker/0:4-67 [000] ..... 907.242536: rpmsg_xmit_done > > > > Here you receive an interrupt indicating that TX buffer has been released by > > the remote. that's the expected behavior. > > > > > > <-rpmsg_send_offchannel_raw > > stm32mp1_bulk_p-390 [000] ..... 984.054941: rpmsg_send > > <-rpmsg_intercore_send_buffer.constprop.0 > > stm32mp1_bulk_p-390 [000] ..... 984.054943: virtio_rpmsg_send > <-rpmsg_send > > stm32mp1_bulk_p-390 [000] ..... 984.054956: virtqueue_enable_cb > > <-rpmsg_send_offchannel_raw > > stm32mp1_bulk_p-390 [000] ..... 984.054958: > virtqueue_enable_cb_prepare > > <-virtqueue_enable_cb > > stm32mp1_bulk_p-390 [000] ..... 999.398667: virtqueue_disable_cb > > <-rpmsg_send_offchannel_raw > > stm32mp1_bulk_p-390 [000] ..... 999.414840: rpmsg_send > > <-rpmsg_intercore_send_buffer.constprop.0 > > stm32mp1_bulk_p-390 [000] ..... 999.414843: virtio_rpmsg_send > <-rpmsg_send > > stm32mp1_bulk_p-390 [000] ..... 999.414855: virtqueue_enable_cb > > <-rpmsg_send_offchannel_raw > > stm32mp1_bulk_p-390 [000] ..... 999.414857: > virtqueue_enable_cb_prepare > > > > Here you have again no more TX buffer. From this point there is no more > activity > > neither in TX nor in RX until the timeout of 15 seconds. > > If you have a look to rproc_vq_interrupt the last one occurs at 907.242533 > > > > > > As you have no more virtqueue interrupts call for both directions, the > issue is > > probably either in the Cortex-M firmware, which seems to be stalled, or > due to a > > disable of the IRQs in Linux. > > afaict we can rule out a complete stall of the cortex-m firmware: if we > change the rpmsg_send to a rpmsg_trysend/msleep loop, the trysend will > succeed to get a buffer after a few iterations. > > > or due to a disable of the IRQs in Linux. > > do you have some recommendations how we could trace this? > > many thanks, > tim > > > <-virtqueue_enable_cb > > stm32mp1_bulk_p-390 [000] ..... 1014.758678: virtqueue_disable_cb > > <-rpmsg_send_offchannel_raw > > stm32mp1_bulk_p-390 [000] ..... 1014.774802: rpmsg_send > > <-rpmsg_intercore_send_buffer.constprop.0 > > stm32mp1_bulk_p-390 [000] ..... 1014.774804: virtio_rpmsg_send > <-rpmsg_send > > stm32mp1_bulk_p-390 [000] ..... 1014.774815: virtqueue_enable_cb > > > > > -- > > Agostiño Carballeira > > Senior Embedded Software Engineer > > agostino.carballeira@native-instruments.com > <mailto:agostino.carballeira@native-instruments.com> > > > Native Instruments <https://www.native-instruments.com>– now including iZotope, > Plugin Alliance, and Brainworx
>> Interestingly, when we replace rpmsg_send by rpmsg_trysend + busywait loop, this >> mutual stall doesn't happen at all. > > What do you mean by busywait? Do you add a delay between 2 rpmsg_trysend() > calls? If yes, you probably add delay that avoid the issue. > > That said rpmsg_trysend is recommended for baremetal to avoid to block the system. to avoid misunderstanding here: replacing `rpmsg_send` by a `rpmsg_trysend` / `usleep` loop on the *linux* side works around the issue (i.e. in our own small module and in rpmsg_char). thanks, tim
diff --git a/drivers/rpmsg/virtio_rpmsg_bus.c b/drivers/rpmsg/virtio_rpmsg_bus.c index 905ac7910c98..2a9d42225e60 100644 --- a/drivers/rpmsg/virtio_rpmsg_bus.c +++ b/drivers/rpmsg/virtio_rpmsg_bus.c @@ -587,21 +587,27 @@ static int rpmsg_send_offchannel_raw(struct rpmsg_device *rpdev, /* no free buffer ? wait for one (but bail after 15 seconds) */ while (!msg) { /* enable "tx-complete" interrupts, if not already enabled */ rpmsg_upref_sleepers(vrp); - /* - * sleep until a free buffer is available or 15 secs elapse. - * the timeout period is not configurable because there's - * little point in asking drivers to specify that. - * if later this happens to be required, it'd be easy to add. - */ - err = wait_event_interruptible_timeout(vrp->sendq, - (msg = get_a_tx_buf(vrp)), - msecs_to_jiffies(15000)); + /* make sure to retry to grab tx buffer before we start waiting */ + msg = get_a_tx_buf(vrp); + if (msg) { + err = 0; + } else { + /* + * sleep until a free buffer is available or 15 secs elapse. + * the timeout period is not configurable because there's + * little point in asking drivers to specify that. + * if later this happens to be required, it'd be easy to add. + */ + err = wait_event_interruptible_timeout(vrp->sendq, + (msg = get_a_tx_buf(vrp)), + msecs_to_jiffies(15000)); + } /* disable "tx-complete" interrupts if we're the last sleeper */ rpmsg_downref_sleepers(vrp); /* timeout ? */ if (!err) {
when we cannot get a tx buffer (`get_a_tx_buf`) `rpmsg_upref_sleepers` enables tx-complete interrupt. however if the interrupt is executed after `get_a_tx_buf` and before `rpmsg_upref_sleepers` we may mis the tx-complete interrupt and sleep for the full 15 seconds. in this case, so we re-try once before we really start to sleep Signed-off-by: Tim Blechmann <tim@klingt.org> --- drivers/rpmsg/virtio_rpmsg_bus.c | 24 +++++++++++++++--------- 1 file changed, 15 insertions(+), 9 deletions(-)