Message ID | 20250327055314.8679-1-senozhatsky@chromium.org (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [RFC] thunderbolt: do not double dequeue a request | expand |
Hi, On Thu, Mar 27, 2025 at 02:52:54PM +0900, Sergey Senozhatsky wrote: > Some of our devices crash in tb_cfg_request_dequeue(): > > general protection fault, probably for non-canonical address 0xdead000000000122: 0000 [#1] PREEMPT SMP NOPTI > > CPU: 6 PID: 91007 Comm: kworker/6:2 Tainted: G U W 6.6.65-06391-gbdec63d10750 #1 (HASH:cf42 1) > RIP: 0010:tb_cfg_request_dequeue+0x2d/0xa0 > Call Trace: > <TASK> > ? tb_cfg_request_dequeue+0x2d/0xa0 > tb_cfg_request_work+0x33/0x80 > worker_thread+0x386/0x8f0 > kthread+0xed/0x110 > ret_from_fork+0x38/0x50 > ret_from_fork_asm+0x1b/0x30 > > The circumstances are unclear, however, the theory is that > tb_cfg_request_work() can be scheduled twice for a request: > first time via frame.callback from ring_work() and second > time from tb_cfg_request(). Both times kworkers will execute > tb_cfg_request_dequeue(), which results in double list_del() > from the ctl->request_queue (the list poison deference hints > at it: 0xdead000000000122). I remember seeing similar but it was long time ago. > Do not dequeue requests that don't have TB_CFG_REQUEST_ACTIVE > bit set. > > Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> > --- > drivers/thunderbolt/ctl.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/drivers/thunderbolt/ctl.c b/drivers/thunderbolt/ctl.c > index cd15e84c47f4..3ad15febc7df 100644 > --- a/drivers/thunderbolt/ctl.c > +++ b/drivers/thunderbolt/ctl.c > @@ -151,7 +151,12 @@ static void tb_cfg_request_dequeue(struct tb_cfg_request *req) > struct tb_ctl *ctl = req->ctl; > > mutex_lock(&ctl->request_queue_lock); > - list_del(&req->list); > + if (!test_bit(TB_CFG_REQUEST_ACTIVE, &req->flags)) { > + mutex_unlock(&ctl->request_queue_lock); > + return; > + } > + > + list_del_init(&req->list); Why this change? We are not putting the req back to the list anymore. Otherwise this looks good to me. > > clear_bit(TB_CFG_REQUEST_ACTIVE, &req->flags); > if (test_bit(TB_CFG_REQUEST_CANCELED, &req->flags)) > wake_up(&tb_cfg_request_cancel_queue); > -- > 2.49.0.395.g12beb8f557-goog
On (25/03/27 13:17), Mika Westerberg wrote: > On Thu, Mar 27, 2025 at 02:52:54PM +0900, Sergey Senozhatsky wrote: > > Some of our devices crash in tb_cfg_request_dequeue(): > > > > general protection fault, probably for non-canonical address 0xdead000000000122: 0000 [#1] PREEMPT SMP NOPTI > > > > CPU: 6 PID: 91007 Comm: kworker/6:2 Tainted: G U W 6.6.65-06391-gbdec63d10750 #1 (HASH:cf42 1) > > RIP: 0010:tb_cfg_request_dequeue+0x2d/0xa0 > > Call Trace: > > <TASK> > > ? tb_cfg_request_dequeue+0x2d/0xa0 > > tb_cfg_request_work+0x33/0x80 > > worker_thread+0x386/0x8f0 > > kthread+0xed/0x110 > > ret_from_fork+0x38/0x50 > > ret_from_fork_asm+0x1b/0x30 > > > > The circumstances are unclear, however, the theory is that > > tb_cfg_request_work() can be scheduled twice for a request: > > first time via frame.callback from ring_work() and second > > time from tb_cfg_request(). Both times kworkers will execute > > tb_cfg_request_dequeue(), which results in double list_del() > > from the ctl->request_queue (the list poison deference hints > > at it: 0xdead000000000122). > > I remember seeing similar but it was long time ago. Another possibility probably can be tb_cfg_request_sync() tb_cfg_request_sync() tb_cfg_request() schedule_work(&req->work) -> tb_cfg_request_dequeue() tb_cfg_request_cancel() schedule_work(&req->work) -> tb_cfg_request_dequeue() [..] > > mutex_lock(&ctl->request_queue_lock); > > - list_del(&req->list); > > + if (!test_bit(TB_CFG_REQUEST_ACTIVE, &req->flags)) { > > + mutex_unlock(&ctl->request_queue_lock); > > + return; > > + } > > + > > + list_del_init(&req->list); > > Why this change? We are not putting the req back to the list anymore. Purely just to be safe. Do you want me to resend without that line?
diff --git a/drivers/thunderbolt/ctl.c b/drivers/thunderbolt/ctl.c index cd15e84c47f4..3ad15febc7df 100644 --- a/drivers/thunderbolt/ctl.c +++ b/drivers/thunderbolt/ctl.c @@ -151,7 +151,12 @@ static void tb_cfg_request_dequeue(struct tb_cfg_request *req) struct tb_ctl *ctl = req->ctl; mutex_lock(&ctl->request_queue_lock); - list_del(&req->list); + if (!test_bit(TB_CFG_REQUEST_ACTIVE, &req->flags)) { + mutex_unlock(&ctl->request_queue_lock); + return; + } + + list_del_init(&req->list); clear_bit(TB_CFG_REQUEST_ACTIVE, &req->flags); if (test_bit(TB_CFG_REQUEST_CANCELED, &req->flags)) wake_up(&tb_cfg_request_cancel_queue);
Some of our devices crash in tb_cfg_request_dequeue(): general protection fault, probably for non-canonical address 0xdead000000000122: 0000 [#1] PREEMPT SMP NOPTI CPU: 6 PID: 91007 Comm: kworker/6:2 Tainted: G U W 6.6.65-06391-gbdec63d10750 #1 (HASH:cf42 1) RIP: 0010:tb_cfg_request_dequeue+0x2d/0xa0 Call Trace: <TASK> ? tb_cfg_request_dequeue+0x2d/0xa0 tb_cfg_request_work+0x33/0x80 worker_thread+0x386/0x8f0 kthread+0xed/0x110 ret_from_fork+0x38/0x50 ret_from_fork_asm+0x1b/0x30 The circumstances are unclear, however, the theory is that tb_cfg_request_work() can be scheduled twice for a request: first time via frame.callback from ring_work() and second time from tb_cfg_request(). Both times kworkers will execute tb_cfg_request_dequeue(), which results in double list_del() from the ctl->request_queue (the list poison deference hints at it: 0xdead000000000122). Do not dequeue requests that don't have TB_CFG_REQUEST_ACTIVE bit set. Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> --- drivers/thunderbolt/ctl.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)