Message ID | 20220117120102.2395157-1-william.xuanziyang@huawei.com (mailing list archive) |
---|---|
State | Awaiting Upstream |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net] can: isotp: isotp_rcv_cf(): fix so->rx race problem | expand |
Hi, the referenced syzbot issue has already been fixed in upstream here: https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=5f33a09e769a9da0482f20a6770a342842443776 ("can: isotp: convert struct tpcon::{idx,len} to unsigned int") Additionally this fix changes some behaviour that is required by the ISO 16765-2 specification (see below). On 17.01.22 13:01, Ziyang Xuan wrote: > When receive a FF, the current code logic does not consider the real > so->rx.state but set so->rx.state to ISOTP_IDLE directly. That will > make so->rx accessed by multiple receiving processes concurrently. This is intentionally. "multiple receiving processes" are not allowed resp. specified by ISO 15765-2. > The following syz problem is one of the scenarios. so->rx.len is > changed by isotp_rcv_ff() during isotp_rcv_cf(), so->rx.len equals > 0 before alloc_skb() and equals 4096 after alloc_skb(). That will > trigger skb_over_panic() in skb_put(). > > ======================================================= > CPU: 1 PID: 19 Comm: ksoftirqd/1 Not tainted 5.16.0-rc8-syzkaller #0 > RIP: 0010:skb_panic+0x16c/0x16e net/core/skbuff.c:113 > Call Trace: > <TASK> > skb_over_panic net/core/skbuff.c:118 [inline] > skb_put.cold+0x24/0x24 net/core/skbuff.c:1990 > isotp_rcv_cf net/can/isotp.c:570 [inline] > isotp_rcv+0xa38/0x1e30 net/can/isotp.c:668 > deliver net/can/af_can.c:574 [inline] > can_rcv_filter+0x445/0x8d0 net/can/af_can.c:635 > can_receive+0x31d/0x580 net/can/af_can.c:665 > can_rcv+0x120/0x1c0 net/can/af_can.c:696 > __netif_receive_skb_one_core+0x114/0x180 net/core/dev.c:5465 > __netif_receive_skb+0x24/0x1b0 net/core/dev.c:5579 > > Check so->rx.state equals ISOTP_IDLE firstly in isotp_rcv_ff(). > Make sure so->rx idle when receive another new packet. And set > so->rx.state to ISOTP_IDLE after whole packet being received. > > Fixes: e057dd3fc20f ("can: add ISO 15765-2:2016 transport protocol") > Reported-by: syzbot+4c63f36709a642f801c5@syzkaller.appspotmail.com > Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> > --- > net/can/isotp.c | 28 +++++++++++++++++----------- > 1 file changed, 17 insertions(+), 11 deletions(-) > > diff --git a/net/can/isotp.c b/net/can/isotp.c > index df6968b28bf4..a4b174f860f3 100644 > --- a/net/can/isotp.c > +++ b/net/can/isotp.c > @@ -443,8 +443,10 @@ static int isotp_rcv_ff(struct sock *sk, struct canfd_frame *cf, int ae) > int off; > int ff_pci_sz; > > + if (so->rx.state != ISOTP_IDLE) > + return 0; > + > hrtimer_cancel(&so->rxtimer); > - so->rx.state = ISOTP_IDLE; No matter in which receiving state we receive a first frame (FF) we are required to start a fresh reception process and/or terminate the current attempt. Best regards, Oliver > > /* get the used sender LL_DL from the (first) CAN frame data length */ > so->rx.ll_dl = padlen(cf->len); > @@ -518,8 +520,6 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae, > so->lastrxcf_tstamp = skb->tstamp; > } > > - hrtimer_cancel(&so->rxtimer); > - > /* CFs are never longer than the FF */ > if (cf->len > so->rx.ll_dl) > return 1; > @@ -531,15 +531,15 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae, > return 1; > } > > + hrtimer_cancel(&so->rxtimer); > + > if ((cf->data[ae] & 0x0F) != so->rx.sn) { > /* wrong sn detected - report 'illegal byte sequence' */ > sk->sk_err = EILSEQ; > if (!sock_flag(sk, SOCK_DEAD)) > sk_error_report(sk); > > - /* reset rx state */ > - so->rx.state = ISOTP_IDLE; > - return 1; > + goto err_out; > } > so->rx.sn++; > so->rx.sn %= 16; > @@ -551,21 +551,18 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae, > } > > if (so->rx.idx >= so->rx.len) { > - /* we are done */ > - so->rx.state = ISOTP_IDLE; > - > if ((so->opt.flags & ISOTP_CHECK_PADDING) && > check_pad(so, cf, i + 1, so->opt.rxpad_content)) { > /* malformed PDU - report 'not a data message' */ > sk->sk_err = EBADMSG; > if (!sock_flag(sk, SOCK_DEAD)) > sk_error_report(sk); > - return 1; > + goto err_out; > } > > nskb = alloc_skb(so->rx.len, gfp_any()); > if (!nskb) > - return 1; > + goto err_out; > > memcpy(skb_put(nskb, so->rx.len), so->rx.buf, > so->rx.len); > @@ -573,6 +570,10 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae, > nskb->tstamp = skb->tstamp; > nskb->dev = skb->dev; > isotp_rcv_skb(nskb, sk); > + > + /* we are done */ > + so->rx.state = ISOTP_IDLE; > + > return 0; > } > > @@ -591,6 +592,11 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae, > /* we reached the specified blocksize so->rxfc.bs */ > isotp_send_fc(sk, ae, ISOTP_FC_CTS); > return 0; > + > +err_out: > + /* reset rx state */ > + so->rx.state = ISOTP_IDLE; > + return 1; > } > > static void isotp_rcv(struct sk_buff *skb, void *data)
> Hi, > > the referenced syzbot issue has already been fixed in upstream here: > > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=5f33a09e769a9da0482f20a6770a342842443776 > > ("can: isotp: convert struct tpcon::{idx,len} to unsigned int") > > Additionally this fix changes some behaviour that is required by the ISO 16765-2 specification (see below). > > On 17.01.22 13:01, Ziyang Xuan wrote: >> When receive a FF, the current code logic does not consider the real >> so->rx.state but set so->rx.state to ISOTP_IDLE directly. That will >> make so->rx accessed by multiple receiving processes concurrently. > > This is intentionally. "multiple receiving processes" are not allowed resp. specified by ISO 15765-2. Does it can be a network attack? It receives packets from network, but unexpected packets order make server panic. > >> The following syz problem is one of the scenarios. so->rx.len is >> changed by isotp_rcv_ff() during isotp_rcv_cf(), so->rx.len equals >> 0 before alloc_skb() and equals 4096 after alloc_skb(). That will >> trigger skb_over_panic() in skb_put(). >> >> ======================================================= >> CPU: 1 PID: 19 Comm: ksoftirqd/1 Not tainted 5.16.0-rc8-syzkaller #0 >> RIP: 0010:skb_panic+0x16c/0x16e net/core/skbuff.c:113 >> Call Trace: >> <TASK> >> skb_over_panic net/core/skbuff.c:118 [inline] >> skb_put.cold+0x24/0x24 net/core/skbuff.c:1990 >> isotp_rcv_cf net/can/isotp.c:570 [inline] >> isotp_rcv+0xa38/0x1e30 net/can/isotp.c:668 >> deliver net/can/af_can.c:574 [inline] >> can_rcv_filter+0x445/0x8d0 net/can/af_can.c:635 >> can_receive+0x31d/0x580 net/can/af_can.c:665 >> can_rcv+0x120/0x1c0 net/can/af_can.c:696 >> __netif_receive_skb_one_core+0x114/0x180 net/core/dev.c:5465 >> __netif_receive_skb+0x24/0x1b0 net/core/dev.c:5579 >> >> Check so->rx.state equals ISOTP_IDLE firstly in isotp_rcv_ff(). >> Make sure so->rx idle when receive another new packet. And set >> so->rx.state to ISOTP_IDLE after whole packet being received. >> >> Fixes: e057dd3fc20f ("can: add ISO 15765-2:2016 transport protocol") >> Reported-by: syzbot+4c63f36709a642f801c5@syzkaller.appspotmail.com >> Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> >> --- >> net/can/isotp.c | 28 +++++++++++++++++----------- >> 1 file changed, 17 insertions(+), 11 deletions(-) >> >> diff --git a/net/can/isotp.c b/net/can/isotp.c >> index df6968b28bf4..a4b174f860f3 100644 >> --- a/net/can/isotp.c >> +++ b/net/can/isotp.c >> @@ -443,8 +443,10 @@ static int isotp_rcv_ff(struct sock *sk, struct canfd_frame *cf, int ae) >> int off; >> int ff_pci_sz; >> + if (so->rx.state != ISOTP_IDLE) >> + return 0; >> + >> hrtimer_cancel(&so->rxtimer); >> - so->rx.state = ISOTP_IDLE; > > No matter in which receiving state we receive a first frame (FF) we are required to start a fresh reception process and/or terminate the current attempt. > > Best regards, > Oliver > >> /* get the used sender LL_DL from the (first) CAN frame data length */ >> so->rx.ll_dl = padlen(cf->len); >> @@ -518,8 +520,6 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae, >> so->lastrxcf_tstamp = skb->tstamp; >> } >> - hrtimer_cancel(&so->rxtimer); >> - >> /* CFs are never longer than the FF */ >> if (cf->len > so->rx.ll_dl) >> return 1; >> @@ -531,15 +531,15 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae, >> return 1; >> } >> + hrtimer_cancel(&so->rxtimer); >> + >> if ((cf->data[ae] & 0x0F) != so->rx.sn) { >> /* wrong sn detected - report 'illegal byte sequence' */ >> sk->sk_err = EILSEQ; >> if (!sock_flag(sk, SOCK_DEAD)) >> sk_error_report(sk); >> - /* reset rx state */ >> - so->rx.state = ISOTP_IDLE; >> - return 1; >> + goto err_out; >> } >> so->rx.sn++; >> so->rx.sn %= 16; >> @@ -551,21 +551,18 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae, >> } >> if (so->rx.idx >= so->rx.len) { >> - /* we are done */ >> - so->rx.state = ISOTP_IDLE; >> - >> if ((so->opt.flags & ISOTP_CHECK_PADDING) && >> check_pad(so, cf, i + 1, so->opt.rxpad_content)) { >> /* malformed PDU - report 'not a data message' */ >> sk->sk_err = EBADMSG; >> if (!sock_flag(sk, SOCK_DEAD)) >> sk_error_report(sk); >> - return 1; >> + goto err_out; >> } >> nskb = alloc_skb(so->rx.len, gfp_any()); >> if (!nskb) >> - return 1; >> + goto err_out; >> memcpy(skb_put(nskb, so->rx.len), so->rx.buf, >> so->rx.len); >> @@ -573,6 +570,10 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae, >> nskb->tstamp = skb->tstamp; >> nskb->dev = skb->dev; >> isotp_rcv_skb(nskb, sk); >> + >> + /* we are done */ >> + so->rx.state = ISOTP_IDLE; >> + >> return 0; >> } >> @@ -591,6 +592,11 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae, >> /* we reached the specified blocksize so->rxfc.bs */ >> isotp_send_fc(sk, ae, ISOTP_FC_CTS); >> return 0; >> + >> +err_out: >> + /* reset rx state */ >> + so->rx.state = ISOTP_IDLE; >> + return 1; >> } >> static void isotp_rcv(struct sk_buff *skb, void *data) > .
On 18.01.22 13:46, Ziyang Xuan (William) wrote: >> Hi, >> >> the referenced syzbot issue has already been fixed in upstream here: >> >> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=5f33a09e769a9da0482f20a6770a342842443776 >> >> ("can: isotp: convert struct tpcon::{idx,len} to unsigned int") >> >> Additionally this fix changes some behaviour that is required by the ISO 16765-2 specification (see below). >> >> On 17.01.22 13:01, Ziyang Xuan wrote: >>> When receive a FF, the current code logic does not consider the real >>> so->rx.state but set so->rx.state to ISOTP_IDLE directly. That will >>> make so->rx accessed by multiple receiving processes concurrently. >> >> This is intentionally. "multiple receiving processes" are not allowed resp. specified by ISO 15765-2. > > Does it can be a network attack? Yes. You can see it like this. The ISO 15765-2 protocol is an unreliable UDP-like datagram protocol and the session layer takes care about timeouts and packet lost. If you want to disturb that protocol you can also send PDUs with out-of-sync packet counters which will make the receiver drop the communication attempt. This is 'CAN-style' ... as usually the bus is very reliable. Security and reliable communication is done on top of these protocols. > It receives packets from network, but unexpected packets order make server panic. Haha, no :-) Unexpected packets should not make the server panic but only drop the communication process. In the case pointed out by syzbot the unsigned 32 bit length information was stored in a signed 32 bit integer which caused a sanity check to fail. This is now fixed with the patch from Marc. Regards, Oliver
> On 18.01.22 13:46, Ziyang Xuan (William) wrote: >>> Hi, >>> >>> the referenced syzbot issue has already been fixed in upstream here: >>> >>> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=5f33a09e769a9da0482f20a6770a342842443776 >>> >>> ("can: isotp: convert struct tpcon::{idx,len} to unsigned int") >>> >>> Additionally this fix changes some behaviour that is required by the ISO 16765-2 specification (see below). >>> >>> On 17.01.22 13:01, Ziyang Xuan wrote: >>>> When receive a FF, the current code logic does not consider the real >>>> so->rx.state but set so->rx.state to ISOTP_IDLE directly. That will >>>> make so->rx accessed by multiple receiving processes concurrently. >>> >>> This is intentionally. "multiple receiving processes" are not allowed resp. specified by ISO 15765-2. >> >> Does it can be a network attack? > > Yes. You can see it like this. The ISO 15765-2 protocol is an unreliable UDP-like datagram protocol and the session layer takes care about timeouts and packet lost. > > If you want to disturb that protocol you can also send PDUs with out-of-sync packet counters which will make the receiver drop the communication attempt. > > This is 'CAN-style' ... as usually the bus is very reliable. Security and reliable communication is done on top of these protocols. > >> It receives packets from network, but unexpected packets order make server panic. > > Haha, no :-) > > Unexpected packets should not make the server panic but only drop the communication process. I have reproduced the syz problem with Marc's commit, the commit can not fix the panic problem. So I tried to find the root cause for panic and gave my solution. Marc's commit just fix the condition that packet size bigger than INT_MAX which trigger tpcon::{idx,len} integer overflow, but the packet size is 4096 in the syz problem. so->rx.len is 0 after the following logic in isotp_rcv_ff(): /* get the FF_DL */ so->rx.len = (cf->data[ae] & 0x0F) << 8; so->rx.len += cf->data[ae + 1]; so->rx.len is 4096 after the following logic in isotp_rcv_ff(): /* FF_DL = 0 => get real length from next 4 bytes */ so->rx.len = cf->data[ae + 2] << 24; so->rx.len += cf->data[ae + 3] << 16; so->rx.len += cf->data[ae + 4] << 8; so->rx.len += cf->data[ae + 5]; so->rx.len is 0 before alloc_skb() and is 4096 after alloc_skb() in isotp_rcv_cf(). The following skb_put() will trigger panic. The following log is my reproducing log with Marc's commit and my debug modification in isotp_rcv_cf(). [ 150.605776][ C6] isotp_rcv_cf: before alloc_skb so->rc.len: 0, after alloc_skb so->rx.len: 4096 [ 150.611477][ C6] skbuff: skb_over_panic: text:ffffffff881ff7be len:4096 put:4096 head:ffff88807f93a800 data:ffff88807f93a800 tail:0x1000 end:0xc0 dev:<NULL> [ 150.615837][ C6] ------------[ cut here ]------------ [ 150.617238][ C6] kernel BUG at net/core/skbuff.c:113! > In the case pointed out by syzbot the unsigned 32 bit length information was stored in a signed 32 bit integer which caused a sanity check to fail. > > This is now fixed with the patch from Marc. > > Regards, > Oliver > .
On 20.01.22 07:24, Ziyang Xuan (William) wrote: > I have reproduced the syz problem with Marc's commit, the commit can not fix the panic problem. > So I tried to find the root cause for panic and gave my solution. > > Marc's commit just fix the condition that packet size bigger than INT_MAX which trigger > tpcon::{idx,len} integer overflow, but the packet size is 4096 in the syz problem. > > so->rx.len is 0 after the following logic in isotp_rcv_ff(): > > /* get the FF_DL */ > so->rx.len = (cf->data[ae] & 0x0F) << 8; > so->rx.len += cf->data[ae + 1]; > > so->rx.len is 4096 after the following logic in isotp_rcv_ff(): > > /* FF_DL = 0 => get real length from next 4 bytes */ > so->rx.len = cf->data[ae + 2] << 24; > so->rx.len += cf->data[ae + 3] << 16; > so->rx.len += cf->data[ae + 4] << 8; > so->rx.len += cf->data[ae + 5]; > In these cases the values 0 could be the minimum value in so->rx.len - but e.g. the value 0 can not show up in isotp_rcv_cf() as this function requires so->rx.state to be ISOTP_WAIT_DATA. And when so->rx.len is 0 in isotp_rcv_ff() this check if (so->rx.len + ae + off + ff_pci_sz < so->rx.ll_dl) return 1; will return from isotp_rcv_ff() before ISOTP_WAIT_DATA is set at the end. So after that above check we are still in ISOTP_IDLE state. Or did I miss something here? > so->rx.len is 0 before alloc_skb() and is 4096 after alloc_skb() in isotp_rcv_cf(). The following > skb_put() will trigger panic. > > The following log is my reproducing log with Marc's commit and my debug modification in isotp_rcv_cf(). > > [ 150.605776][ C6] isotp_rcv_cf: before alloc_skb so->rc.len: 0, after alloc_skb so->rx.len: 4096 But so->rx_len is not a value that is modified by alloc_skb(): nskb = alloc_skb(so->rx.len, gfp_any()); if (!nskb) return 1; memcpy(skb_put(nskb, so->rx.len), so->rx.buf, so->rx.len); Can you send your debug modification changes please? Best regards, Oliver > [ 150.611477][ C6] skbuff: skb_over_panic: text:ffffffff881ff7be len:4096 put:4096 head:ffff88807f93a800 data:ffff88807f93a800 tail:0x1000 end:0xc0 dev:<NULL> > [ 150.615837][ C6] ------------[ cut here ]------------ > [ 150.617238][ C6] kernel BUG at net/core/skbuff.c:113! >
> > On 20.01.22 07:24, Ziyang Xuan (William) wrote: > >> I have reproduced the syz problem with Marc's commit, the commit can not fix the panic problem. >> So I tried to find the root cause for panic and gave my solution. >> >> Marc's commit just fix the condition that packet size bigger than INT_MAX which trigger >> tpcon::{idx,len} integer overflow, but the packet size is 4096 in the syz problem. >> >> so->rx.len is 0 after the following logic in isotp_rcv_ff(): >> >> /* get the FF_DL */ >> so->rx.len = (cf->data[ae] & 0x0F) << 8; >> so->rx.len += cf->data[ae + 1]; >> >> so->rx.len is 4096 after the following logic in isotp_rcv_ff(): >> >> /* FF_DL = 0 => get real length from next 4 bytes */ >> so->rx.len = cf->data[ae + 2] << 24; >> so->rx.len += cf->data[ae + 3] << 16; >> so->rx.len += cf->data[ae + 4] << 8; >> so->rx.len += cf->data[ae + 5]; >> > > In these cases the values 0 could be the minimum value in so->rx.len - but e.g. the value 0 can not show up in isotp_rcv_cf() as this function requires so->rx.state to be ISOTP_WAIT_DATA. Consider the scenario that isotp_rcv_cf() and isotp_rcv_cf() are concurrent for the same isotp_sock as following sequence: isotp_rcv_cf() if (so->rx.state != ISOTP_WAIT_DATA) [false] isotp_rcv_ff() so->rx.state = ISOTP_IDLE /* get the FF_DL */ [so->rx.len == 0] alloc_skb() [so->rx.len == 0] /* FF_DL = 0 => get real length from next 4 bytes */ [so->rx.len == 4096] skb_put(nskb, so->rx.len) [so->rx.len == 4096] skb_over_panic() > > And when so->rx.len is 0 in isotp_rcv_ff() this check > > if (so->rx.len + ae + off + ff_pci_sz < so->rx.ll_dl) > return 1; > > will return from isotp_rcv_ff() before ISOTP_WAIT_DATA is set at the end. So after that above check we are still in ISOTP_IDLE state. > > Or did I miss something here? > >> so->rx.len is 0 before alloc_skb() and is 4096 after alloc_skb() in isotp_rcv_cf(). The following >> skb_put() will trigger panic. >> >> The following log is my reproducing log with Marc's commit and my debug modification in isotp_rcv_cf(). >> >> [ 150.605776][ C6] isotp_rcv_cf: before alloc_skb so->rc.len: 0, after alloc_skb so->rx.len: 4096 > > > But so->rx_len is not a value that is modified by alloc_skb(): > > nskb = alloc_skb(so->rx.len, gfp_any()); > if (!nskb) > return 1; > > memcpy(skb_put(nskb, so->rx.len), so->rx.buf, > so->rx.len); > > > Can you send your debug modification changes please? My reproducing debug as attachment and following: diff --git a/net/can/isotp.c b/net/can/isotp.c index df6968b28bf4..8b12d63b4d59 100644 --- a/net/can/isotp.c +++ b/net/can/isotp.c @@ -119,8 +119,8 @@ enum { }; struct tpcon { - int idx; - int len; + unsigned int idx; + unsigned int len; u32 state; u8 bs; u8 sn; @@ -505,6 +505,7 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae, struct isotp_sock *so = isotp_sk(sk); struct sk_buff *nskb; int i; + bool unexpection = false; if (so->rx.state != ISOTP_WAIT_DATA) return 0; @@ -562,11 +563,13 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae, sk_error_report(sk); return 1; } - + if (so->rx.len == 0) + unexpection = true; nskb = alloc_skb(so->rx.len, gfp_any()); if (!nskb) return 1; - + if (unexpection) + printk("%s: before alloc_skb so->rc.len: 0, after alloc_skb so->rx.len: %u\n", __func__, so->rx.len); memcpy(skb_put(nskb, so->rx.len), so->rx.buf, so->rx.len); > > Best regards, > Oliver > >> [ 150.611477][ C6] skbuff: skb_over_panic: text:ffffffff881ff7be len:4096 put:4096 head:ffff88807f93a800 data:ffff88807f93a800 tail:0x1000 end:0xc0 dev:<NULL> >> [ 150.615837][ C6] ------------[ cut here ]------------ >> [ 150.617238][ C6] kernel BUG at net/core/skbuff.c:113! >> > > . From 16c94e5aee258b1d856e29493746db9afce8963a Mon Sep 17 00:00:00 2001 From: Ziyang Xuan <william.xuanziyang@huawei.com> Date: Thu, 20 Jan 2022 15:17:54 +0800 Subject: [PATCH] can: isotp: debug for reproducing isotp_rcv panic Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> --- net/can/isotp.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/net/can/isotp.c b/net/can/isotp.c index df6968b28bf4..8b12d63b4d59 100644 --- a/net/can/isotp.c +++ b/net/can/isotp.c @@ -119,8 +119,8 @@ enum { }; struct tpcon { - int idx; - int len; + unsigned int idx; + unsigned int len; u32 state; u8 bs; u8 sn; @@ -505,6 +505,7 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae, struct isotp_sock *so = isotp_sk(sk); struct sk_buff *nskb; int i; + bool unexpection = false; if (so->rx.state != ISOTP_WAIT_DATA) return 0; @@ -562,11 +563,13 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae, sk_error_report(sk); return 1; } - + if (so->rx.len == 0) + unexpection = true; nskb = alloc_skb(so->rx.len, gfp_any()); if (!nskb) return 1; - + if (unexpection) + printk("%s: before alloc_skb so->rc.len: 0, after alloc_skb so->rx.len: %u\n", __func__, so->rx.len); memcpy(skb_put(nskb, so->rx.len), so->rx.buf, so->rx.len);
On 20.01.22 12:28, Ziyang Xuan (William) wrote: >> >> On 20.01.22 07:24, Ziyang Xuan (William) wrote: >> >>> I have reproduced the syz problem with Marc's commit, the commit can not fix the panic problem. >>> So I tried to find the root cause for panic and gave my solution. >>> >>> Marc's commit just fix the condition that packet size bigger than INT_MAX which trigger >>> tpcon::{idx,len} integer overflow, but the packet size is 4096 in the syz problem. >>> >>> so->rx.len is 0 after the following logic in isotp_rcv_ff(): >>> >>> /* get the FF_DL */ >>> so->rx.len = (cf->data[ae] & 0x0F) << 8; >>> so->rx.len += cf->data[ae + 1]; >>> >>> so->rx.len is 4096 after the following logic in isotp_rcv_ff(): >>> >>> /* FF_DL = 0 => get real length from next 4 bytes */ >>> so->rx.len = cf->data[ae + 2] << 24; >>> so->rx.len += cf->data[ae + 3] << 16; >>> so->rx.len += cf->data[ae + 4] << 8; >>> so->rx.len += cf->data[ae + 5]; >>> >> >> In these cases the values 0 could be the minimum value in so->rx.len - but e.g. the value 0 can not show up in isotp_rcv_cf() as this function requires so->rx.state to be ISOTP_WAIT_DATA. > > Consider the scenario that isotp_rcv_cf() and isotp_rcv_cf() are concurrent for the same isotp_sock as following sequence: o_O Sorry but the receive path is not designed to handle concurrent receptions that would run isotp_rcv_cf() and isotp_rcv_ff() simultaneously. > isotp_rcv_cf() > if (so->rx.state != ISOTP_WAIT_DATA) [false] > isotp_rcv_ff() > so->rx.state = ISOTP_IDLE > /* get the FF_DL */ [so->rx.len == 0] > alloc_skb() [so->rx.len == 0] > /* FF_DL = 0 => get real length from next 4 bytes */ [so->rx.len == 4096] > skb_put(nskb, so->rx.len) [so->rx.len == 4096] > skb_over_panic() > Even though this case is not possible with a real CAN bus due to the CAN frame transmission times we could introduce some locking (or dropping of concurrent CAN frames) in isotp_rcv() - but this code runs in net softirq context ... Regards, Oliver >> >> And when so->rx.len is 0 in isotp_rcv_ff() this check >> >> if (so->rx.len + ae + off + ff_pci_sz < so->rx.ll_dl) >> return 1; >> >> will return from isotp_rcv_ff() before ISOTP_WAIT_DATA is set at the end. So after that above check we are still in ISOTP_IDLE state. >> >> Or did I miss something here? >> >>> so->rx.len is 0 before alloc_skb() and is 4096 after alloc_skb() in isotp_rcv_cf(). The following >>> skb_put() will trigger panic. >>> >>> The following log is my reproducing log with Marc's commit and my debug modification in isotp_rcv_cf(). >>> >>> [ 150.605776][ C6] isotp_rcv_cf: before alloc_skb so->rc.len: 0, after alloc_skb so->rx.len: 4096 >> >> >> But so->rx_len is not a value that is modified by alloc_skb(): >> >> nskb = alloc_skb(so->rx.len, gfp_any()); >> if (!nskb) >> return 1; >> >> memcpy(skb_put(nskb, so->rx.len), so->rx.buf, >> so->rx.len); >> >> >> Can you send your debug modification changes please? > > My reproducing debug as attachment and following: > > diff --git a/net/can/isotp.c b/net/can/isotp.c > index df6968b28bf4..8b12d63b4d59 100644 > --- a/net/can/isotp.c > +++ b/net/can/isotp.c > @@ -119,8 +119,8 @@ enum { > }; > > struct tpcon { > - int idx; > - int len; > + unsigned int idx; > + unsigned int len; > u32 state; > u8 bs; > u8 sn; > @@ -505,6 +505,7 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae, > struct isotp_sock *so = isotp_sk(sk); > struct sk_buff *nskb; > int i; > + bool unexpection = false; > > if (so->rx.state != ISOTP_WAIT_DATA) > return 0; > @@ -562,11 +563,13 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae, > sk_error_report(sk); > return 1; > } > - > + if (so->rx.len == 0) > + unexpection = true; > nskb = alloc_skb(so->rx.len, gfp_any()); > if (!nskb) > return 1; > - > + if (unexpection) > + printk("%s: before alloc_skb so->rc.len: 0, after alloc_skb so->rx.len: %u\n", __func__, so->rx.len); > memcpy(skb_put(nskb, so->rx.len), so->rx.buf, > so->rx.len); > > >> >> Best regards, >> Oliver >> >>> [ 150.611477][ C6] skbuff: skb_over_panic: text:ffffffff881ff7be len:4096 put:4096 head:ffff88807f93a800 data:ffff88807f93a800 tail:0x1000 end:0xc0 dev:<NULL> >>> [ 150.615837][ C6] ------------[ cut here ]------------ >>> [ 150.617238][ C6] kernel BUG at net/core/skbuff.c:113! >>> >> >> .
> > On 20.01.22 12:28, Ziyang Xuan (William) wrote: >>> >>> On 20.01.22 07:24, Ziyang Xuan (William) wrote: >>> >>>> I have reproduced the syz problem with Marc's commit, the commit can not fix the panic problem. >>>> So I tried to find the root cause for panic and gave my solution. >>>> >>>> Marc's commit just fix the condition that packet size bigger than INT_MAX which trigger >>>> tpcon::{idx,len} integer overflow, but the packet size is 4096 in the syz problem. >>>> >>>> so->rx.len is 0 after the following logic in isotp_rcv_ff(): >>>> >>>> /* get the FF_DL */ >>>> so->rx.len = (cf->data[ae] & 0x0F) << 8; >>>> so->rx.len += cf->data[ae + 1]; >>>> >>>> so->rx.len is 4096 after the following logic in isotp_rcv_ff(): >>>> >>>> /* FF_DL = 0 => get real length from next 4 bytes */ >>>> so->rx.len = cf->data[ae + 2] << 24; >>>> so->rx.len += cf->data[ae + 3] << 16; >>>> so->rx.len += cf->data[ae + 4] << 8; >>>> so->rx.len += cf->data[ae + 5]; >>>> >>> >>> In these cases the values 0 could be the minimum value in so->rx.len - but e.g. the value 0 can not show up in isotp_rcv_cf() as this function requires so->rx.state to be ISOTP_WAIT_DATA. >> >> Consider the scenario that isotp_rcv_cf() and isotp_rcv_cf() are concurrent for the same isotp_sock as following sequence: > > o_O > > Sorry but the receive path is not designed to handle concurrent receptions that would run isotp_rcv_cf() and isotp_rcv_ff() simultaneously. > >> isotp_rcv_cf() >> if (so->rx.state != ISOTP_WAIT_DATA) [false] >> isotp_rcv_ff() >> so->rx.state = ISOTP_IDLE >> /* get the FF_DL */ [so->rx.len == 0] >> alloc_skb() [so->rx.len == 0] >> /* FF_DL = 0 => get real length from next 4 bytes */ [so->rx.len == 4096] >> skb_put(nskb, so->rx.len) [so->rx.len == 4096] >> skb_over_panic() >> > > Even though this case is not possible with a real CAN bus due to the CAN frame transmission times we could introduce some locking (or dropping of concurrent CAN frames) in isotp_rcv() - but this code runs in net softirq context ... > I thought the kernel code logic should make sure the kernel availability no matter what happens in user space code. And tx path has considered so->tx race condition actually but rx path for so->rx. > Regards, > Oliver > > >>> >>> And when so->rx.len is 0 in isotp_rcv_ff() this check >>> >>> if (so->rx.len + ae + off + ff_pci_sz < so->rx.ll_dl) >>> return 1; >>> >>> will return from isotp_rcv_ff() before ISOTP_WAIT_DATA is set at the end. So after that above check we are still in ISOTP_IDLE state. >>> >>> Or did I miss something here? >>> >>>> so->rx.len is 0 before alloc_skb() and is 4096 after alloc_skb() in isotp_rcv_cf(). The following >>>> skb_put() will trigger panic. >>>> >>>> The following log is my reproducing log with Marc's commit and my debug modification in isotp_rcv_cf(). >>>> >>>> [ 150.605776][ C6] isotp_rcv_cf: before alloc_skb so->rc.len: 0, after alloc_skb so->rx.len: 4096 >>> >>> >>> But so->rx_len is not a value that is modified by alloc_skb(): >>> >>> nskb = alloc_skb(so->rx.len, gfp_any()); >>> if (!nskb) >>> return 1; >>> >>> memcpy(skb_put(nskb, so->rx.len), so->rx.buf, >>> so->rx.len); >>> >>> >>> Can you send your debug modification changes please? >> >> My reproducing debug as attachment and following: >> >> diff --git a/net/can/isotp.c b/net/can/isotp.c >> index df6968b28bf4..8b12d63b4d59 100644 >> --- a/net/can/isotp.c >> +++ b/net/can/isotp.c >> @@ -119,8 +119,8 @@ enum { >> }; >> >> struct tpcon { >> - int idx; >> - int len; >> + unsigned int idx; >> + unsigned int len; >> u32 state; >> u8 bs; >> u8 sn; >> @@ -505,6 +505,7 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae, >> struct isotp_sock *so = isotp_sk(sk); >> struct sk_buff *nskb; >> int i; >> + bool unexpection = false; >> >> if (so->rx.state != ISOTP_WAIT_DATA) >> return 0; >> @@ -562,11 +563,13 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae, >> sk_error_report(sk); >> return 1; >> } >> - >> + if (so->rx.len == 0) >> + unexpection = true; >> nskb = alloc_skb(so->rx.len, gfp_any()); >> if (!nskb) >> return 1; >> - >> + if (unexpection) >> + printk("%s: before alloc_skb so->rc.len: 0, after alloc_skb so->rx.len: %u\n", __func__, so->rx.len); >> memcpy(skb_put(nskb, so->rx.len), so->rx.buf, >> so->rx.len); >> >> >>> >>> Best regards, >>> Oliver >>> >>>> [ 150.611477][ C6] skbuff: skb_over_panic: text:ffffffff881ff7be len:4096 put:4096 head:ffff88807f93a800 data:ffff88807f93a800 tail:0x1000 end:0xc0 dev:<NULL> >>>> [ 150.615837][ C6] ------------[ cut here ]------------ >>>> [ 150.617238][ C6] kernel BUG at net/core/skbuff.c:113! >>>> >>> >>> . > .
Hello Ziyang Xuan, On 21.01.22 02:50, Ziyang Xuan (William) wrote: >> >> On 20.01.22 12:28, Ziyang Xuan (William) wrote: >>>> >>>> On 20.01.22 07:24, Ziyang Xuan (William) wrote: >>>> >>>>> I have reproduced the syz problem with Marc's commit, the commit can not fix the panic problem. >>>>> So I tried to find the root cause for panic and gave my solution. >>>>> >>>>> Marc's commit just fix the condition that packet size bigger than INT_MAX which trigger >>>>> tpcon::{idx,len} integer overflow, but the packet size is 4096 in the syz problem. >>>>> >>>>> so->rx.len is 0 after the following logic in isotp_rcv_ff(): >>>>> >>>>> /* get the FF_DL */ >>>>> so->rx.len = (cf->data[ae] & 0x0F) << 8; >>>>> so->rx.len += cf->data[ae + 1]; >>>>> >>>>> so->rx.len is 4096 after the following logic in isotp_rcv_ff(): >>>>> >>>>> /* FF_DL = 0 => get real length from next 4 bytes */ >>>>> so->rx.len = cf->data[ae + 2] << 24; >>>>> so->rx.len += cf->data[ae + 3] << 16; >>>>> so->rx.len += cf->data[ae + 4] << 8; >>>>> so->rx.len += cf->data[ae + 5]; >>>>> >>>> >>>> In these cases the values 0 could be the minimum value in so->rx.len - but e.g. the value 0 can not show up in isotp_rcv_cf() as this function requires so->rx.state to be ISOTP_WAIT_DATA. >>> >>> Consider the scenario that isotp_rcv_cf() and isotp_rcv_cf() are concurrent for the same isotp_sock as following sequence: >> >> o_O >> >> Sorry but the receive path is not designed to handle concurrent receptions that would run isotp_rcv_cf() and isotp_rcv_ff() simultaneously. >> >>> isotp_rcv_cf() >>> if (so->rx.state != ISOTP_WAIT_DATA) [false] >>> isotp_rcv_ff() >>> so->rx.state = ISOTP_IDLE >>> /* get the FF_DL */ [so->rx.len == 0] >>> alloc_skb() [so->rx.len == 0] >>> /* FF_DL = 0 => get real length from next 4 bytes */ [so->rx.len == 4096] >>> skb_put(nskb, so->rx.len) [so->rx.len == 4096] >>> skb_over_panic() >>> >> >> Even though this case is not possible with a real CAN bus due to the CAN frame transmission times we could introduce some locking (or dropping of concurrent CAN frames) in isotp_rcv() - but this code runs in net softirq context ... >> As discussed off-list I added a spin_lock() in isotp_rcv() as https://www.kernel.org/doc/htmldocs/kernel-locking/lock-softirqs.html suggests. Please give this patch[1] a try in your test setup. Many thanks, Oliver [1]: https://lore.kernel.org/linux-can/20220127192429.336335-1-socketcan@hartkopp.net/T/
Answering myself ... I've seen the frame processing sometimes freezes for one second when stressing the isotp_rcv() from multiple sources. This finally freezes the entire softirq which is either not good and not needed as we only need to fix this race for stress tests - and not for real world usage that does not create this case. Therefore I created a V2 patch which uses the spin_trylock() to simply drop the incomming frame in the race condition. https://lore.kernel.org/linux-can/20220128074327.52229-1-socketcan@hartkopp.net/T/ Please take a look, if it also fixes the issue in your test setup. Thanks & best regards, Oliver On 27.01.22 20:44, Oliver Hartkopp wrote: > Hello Ziyang Xuan, > > On 21.01.22 02:50, Ziyang Xuan (William) wrote: >>> >>> On 20.01.22 12:28, Ziyang Xuan (William) wrote: >>>>> >>>>> On 20.01.22 07:24, Ziyang Xuan (William) wrote: >>>>> >>>>>> I have reproduced the syz problem with Marc's commit, the commit >>>>>> can not fix the panic problem. >>>>>> So I tried to find the root cause for panic and gave my solution. >>>>>> >>>>>> Marc's commit just fix the condition that packet size bigger than >>>>>> INT_MAX which trigger >>>>>> tpcon::{idx,len} integer overflow, but the packet size is 4096 in >>>>>> the syz problem. >>>>>> >>>>>> so->rx.len is 0 after the following logic in isotp_rcv_ff(): >>>>>> >>>>>> /* get the FF_DL */ >>>>>> so->rx.len = (cf->data[ae] & 0x0F) << 8; >>>>>> so->rx.len += cf->data[ae + 1]; >>>>>> >>>>>> so->rx.len is 4096 after the following logic in isotp_rcv_ff(): >>>>>> >>>>>> /* FF_DL = 0 => get real length from next 4 bytes */ >>>>>> so->rx.len = cf->data[ae + 2] << 24; >>>>>> so->rx.len += cf->data[ae + 3] << 16; >>>>>> so->rx.len += cf->data[ae + 4] << 8; >>>>>> so->rx.len += cf->data[ae + 5]; >>>>>> >>>>> >>>>> In these cases the values 0 could be the minimum value in >>>>> so->rx.len - but e.g. the value 0 can not show up in isotp_rcv_cf() >>>>> as this function requires so->rx.state to be ISOTP_WAIT_DATA. >>>> >>>> Consider the scenario that isotp_rcv_cf() and isotp_rcv_cf() are >>>> concurrent for the same isotp_sock as following sequence: >>> >>> o_O >>> >>> Sorry but the receive path is not designed to handle concurrent >>> receptions that would run isotp_rcv_cf() and isotp_rcv_ff() >>> simultaneously. >>> >>>> isotp_rcv_cf() >>>> if (so->rx.state != ISOTP_WAIT_DATA) [false] >>>> isotp_rcv_ff() >>>> so->rx.state = ISOTP_IDLE >>>> /* get the FF_DL */ [so->rx.len == 0] >>>> alloc_skb() [so->rx.len == 0] >>>> /* FF_DL = 0 => get real length from next 4 >>>> bytes */ [so->rx.len == 4096] >>>> skb_put(nskb, so->rx.len) [so->rx.len == 4096] >>>> skb_over_panic() >>>> >>> >>> Even though this case is not possible with a real CAN bus due to the >>> CAN frame transmission times we could introduce some locking (or >>> dropping of concurrent CAN frames) in isotp_rcv() - but this code >>> runs in net softirq context ... >>> > > As discussed off-list I added a spin_lock() in isotp_rcv() as > https://www.kernel.org/doc/htmldocs/kernel-locking/lock-softirqs.html > suggests. > > Please give this patch[1] a try in your test setup. > > Many thanks, > Oliver > > [1]: > https://lore.kernel.org/linux-can/20220127192429.336335-1-socketcan@hartkopp.net/T/ >
On 28.01.2022 08:56:19, Oliver Hartkopp wrote: > I've seen the frame processing sometimes freezes for one second when > stressing the isotp_rcv() from multiple sources. This finally freezes > the entire softirq which is either not good and not needed as we only > need to fix this race for stress tests - and not for real world usage > that does not create this case. Hmmm, this doesn't sound good. Can you test with LOCKDEP enabled? > > Therefore I created a V2 patch which uses the spin_trylock() to simply drop > the incomming frame in the race condition. > > https://lore.kernel.org/linux-can/20220128074327.52229-1-socketcan@hartkopp.net/T/ > > Please take a look, if it also fixes the issue in your test setup. regards, Marc
On 28.01.22 09:07, Marc Kleine-Budde wrote: > On 28.01.2022 08:56:19, Oliver Hartkopp wrote: >> I've seen the frame processing sometimes freezes for one second when >> stressing the isotp_rcv() from multiple sources. This finally freezes >> the entire softirq which is either not good and not needed as we only >> need to fix this race for stress tests - and not for real world usage >> that does not create this case. > > Hmmm, this doesn't sound good. Can you test with LOCKDEP enabled? In kernel config? I enabled almost everything with LOCKing ;-) CONFIG_LOCKDEP_SUPPORT=y CONFIG_HAVE_HARDLOCKUP_DETECTOR_PERF=y CONFIG_ASN1=y CONFIG_UNINLINE_SPIN_UNLOCK=y CONFIG_ARCH_SUPPORTS_ATOMIC_RMW=y CONFIG_MUTEX_SPIN_ON_OWNER=y CONFIG_RWSEM_SPIN_ON_OWNER=y CONFIG_LOCK_SPIN_ON_OWNER=y CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y CONFIG_QUEUED_SPINLOCKS=y CONFIG_ARCH_USE_QUEUED_RWLOCKS=y CONFIG_QUEUED_RWLOCKS=y CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE=y CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE=y CONFIG_ARCH_HAS_SYSCALL_WRAPPER=y CONFIG_FREEZER=y CONFIG_HWSPINLOCK=y CONFIG_I8253_LOCK=y # # Debug Oops, Lockups and Hangs # # CONFIG_PANIC_ON_OOPS is not set CONFIG_PANIC_ON_OOPS_VALUE=0 CONFIG_PANIC_TIMEOUT=0 CONFIG_LOCKUP_DETECTOR=y CONFIG_SOFTLOCKUP_DETECTOR=y # CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=0 CONFIG_HARDLOCKUP_DETECTOR_PERF=y CONFIG_HARDLOCKUP_CHECK_TIMESTAMP=y CONFIG_HARDLOCKUP_DETECTOR=y # CONFIG_BOOTPARAM_HARDLOCKUP_PANIC is not set CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE=0 CONFIG_DETECT_HUNG_TASK=y CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=120 <--- the isotp timeout is 1000 ms what I can observe here with v1 patch # CONFIG_BOOTPARAM_HUNG_TASK_PANIC is not set CONFIG_BOOTPARAM_HUNG_TASK_PANIC_VALUE=0 # CONFIG_WQ_WATCHDOG is not set # CONFIG_TEST_LOCKUP is not set # end of Debug Oops, Lockups and Hangs Only this debugging stuff is not really enabled: # # Lock Debugging (spinlocks, mutexes, etc...) # CONFIG_LOCK_DEBUGGING_SUPPORT=y # CONFIG_PROVE_LOCKING is not set # CONFIG_LOCK_STAT is not set # CONFIG_DEBUG_RT_MUTEXES is not set # CONFIG_DEBUG_SPINLOCK is not set # CONFIG_DEBUG_MUTEXES is not set # CONFIG_DEBUG_WW_MUTEX_SLOWPATH is not set # CONFIG_DEBUG_RWSEMS is not set # CONFIG_DEBUG_LOCK_ALLOC is not set # CONFIG_DEBUG_ATOMIC_SLEEP is not set # CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set # CONFIG_LOCK_TORTURE_TEST is not set # CONFIG_WW_MUTEX_SELFTEST is not set # CONFIG_SCF_TORTURE_TEST is not set # CONFIG_CSD_LOCK_WAIT_DEBUG is not set # end of Lock Debugging (spinlocks, mutexes, etc...) Would this help to be enabled for this test (of the v1 patch? Best regards, Oliver
On 28.01.2022 09:32:40, Oliver Hartkopp wrote: > > > On 28.01.22 09:07, Marc Kleine-Budde wrote: > > On 28.01.2022 08:56:19, Oliver Hartkopp wrote: > > > I've seen the frame processing sometimes freezes for one second when > > > stressing the isotp_rcv() from multiple sources. This finally freezes > > > the entire softirq which is either not good and not needed as we only > > > need to fix this race for stress tests - and not for real world usage > > > that does not create this case. > > > > Hmmm, this doesn't sound good. Can you test with LOCKDEP enabled? > In kernel config? I enabled almost everything with LOCKing ;-) > > > CONFIG_LOCKDEP_SUPPORT=y > > CONFIG_HAVE_HARDLOCKUP_DETECTOR_PERF=y > > CONFIG_ASN1=y > CONFIG_UNINLINE_SPIN_UNLOCK=y > CONFIG_ARCH_SUPPORTS_ATOMIC_RMW=y > CONFIG_MUTEX_SPIN_ON_OWNER=y > CONFIG_RWSEM_SPIN_ON_OWNER=y > CONFIG_LOCK_SPIN_ON_OWNER=y > CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y > CONFIG_QUEUED_SPINLOCKS=y > CONFIG_ARCH_USE_QUEUED_RWLOCKS=y > CONFIG_QUEUED_RWLOCKS=y > CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE=y > CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE=y > CONFIG_ARCH_HAS_SYSCALL_WRAPPER=y > CONFIG_FREEZER=y > > CONFIG_HWSPINLOCK=y > > CONFIG_I8253_LOCK=y > > # > # Debug Oops, Lockups and Hangs > # > # CONFIG_PANIC_ON_OOPS is not set > CONFIG_PANIC_ON_OOPS_VALUE=0 > CONFIG_PANIC_TIMEOUT=0 > CONFIG_LOCKUP_DETECTOR=y > CONFIG_SOFTLOCKUP_DETECTOR=y > # CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set > CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=0 > CONFIG_HARDLOCKUP_DETECTOR_PERF=y > CONFIG_HARDLOCKUP_CHECK_TIMESTAMP=y > CONFIG_HARDLOCKUP_DETECTOR=y > # CONFIG_BOOTPARAM_HARDLOCKUP_PANIC is not set > CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE=0 > CONFIG_DETECT_HUNG_TASK=y > CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=120 <--- the isotp timeout is 1000 ms what > I can observe here with v1 patch > > # CONFIG_BOOTPARAM_HUNG_TASK_PANIC is not set > CONFIG_BOOTPARAM_HUNG_TASK_PANIC_VALUE=0 > # CONFIG_WQ_WATCHDOG is not set > # CONFIG_TEST_LOCKUP is not set > # end of Debug Oops, Lockups and Hangs > > Only this debugging stuff is not really enabled: > > # > # Lock Debugging (spinlocks, mutexes, etc...) > # > CONFIG_LOCK_DEBUGGING_SUPPORT=y > # CONFIG_PROVE_LOCKING is not set CONFIG_PROVE_LOCKING=y Marc > # CONFIG_LOCK_STAT is not set > # CONFIG_DEBUG_RT_MUTEXES is not set > # CONFIG_DEBUG_SPINLOCK is not set > # CONFIG_DEBUG_MUTEXES is not set > # CONFIG_DEBUG_WW_MUTEX_SLOWPATH is not set > # CONFIG_DEBUG_RWSEMS is not set > # CONFIG_DEBUG_LOCK_ALLOC is not set > # CONFIG_DEBUG_ATOMIC_SLEEP is not set > # CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set > # CONFIG_LOCK_TORTURE_TEST is not set > # CONFIG_WW_MUTEX_SELFTEST is not set > # CONFIG_SCF_TORTURE_TEST is not set > # CONFIG_CSD_LOCK_WAIT_DEBUG is not set > # end of Lock Debugging (spinlocks, mutexes, etc...) > > Would this help to be enabled for this test (of the v1 patch? > > Best regards, > Oliver > >
Hello Marc, hello William, On 28.01.22 09:46, Marc Kleine-Budde wrote: > On 28.01.2022 09:32:40, Oliver Hartkopp wrote: >> >> >> On 28.01.22 09:07, Marc Kleine-Budde wrote: >>> On 28.01.2022 08:56:19, Oliver Hartkopp wrote: >>>> I've seen the frame processing sometimes freezes for one second when >>>> stressing the isotp_rcv() from multiple sources. This finally freezes >>>> the entire softirq which is either not good and not needed as we only >>>> need to fix this race for stress tests - and not for real world usage >>>> that does not create this case. >>> >>> Hmmm, this doesn't sound good. Can you test with LOCKDEP enabled? >> # >> # Lock Debugging (spinlocks, mutexes, etc...) >> # >> CONFIG_LOCK_DEBUGGING_SUPPORT=y >> # CONFIG_PROVE_LOCKING is not set > CONFIG_PROVE_LOCKING=y Now enabled even more locking (seen relevant kernel config at the end). It turns out that there is no visible difference when using spin_lock() or spin_trylock(). I only got some of these kernel log entries Jan 28 11:13:14 silver kernel: [ 2396.323211] perf: interrupt took too long (2549 > 2500), lowering kernel.perf_event_max_sample_rate to 78250 Jan 28 11:25:49 silver kernel: [ 3151.172773] perf: interrupt took too long (3188 > 3186), lowering kernel.perf_event_max_sample_rate to 62500 Jan 28 11:45:24 silver kernel: [ 4325.583328] perf: interrupt took too long (4009 > 3985), lowering kernel.perf_event_max_sample_rate to 49750 Jan 28 12:15:46 silver kernel: [ 6148.238246] perf: interrupt took too long (5021 > 5011), lowering kernel.perf_event_max_sample_rate to 39750 Jan 28 13:01:45 silver kernel: [ 8907.303715] perf: interrupt took too long (6285 > 6276), lowering kernel.perf_event_max_sample_rate to 31750 But I get these sporadically anyway. No other LOCKDEP splat. At least the issue reported by William should be fixed now - but I'm still unclear whether spin_lock() or spin_trylock() is the best approach here in the NET_RX softirq?!? Best regards, Oliver $ grep LOCK .config | grep -v BLOCK | grep -v CLOCK CONFIG_LOCKDEP_SUPPORT=y # CONFIG_PM_WAKELOCKS is not set CONFIG_HAVE_HARDLOCKUP_DETECTOR_PERF=y # CONFIG_LOCK_EVENT_COUNTS is not set CONFIG_UNINLINE_SPIN_UNLOCK=y CONFIG_LOCK_SPIN_ON_OWNER=y CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y CONFIG_QUEUED_SPINLOCKS=y CONFIG_ARCH_USE_QUEUED_RWLOCKS=y CONFIG_QUEUED_RWLOCKS=y CONFIG_SPLIT_PTLOCK_CPUS=4 CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK=y CONFIG_PCI_LOCKLESS_CONFIG=y # CONFIG_DRM_DEBUG_MODESET_LOCK is not set CONFIG_HWSPINLOCK=y CONFIG_I8253_LOCK=y CONFIG_FILE_LOCKING=y # CONFIG_SECURITY_LOCKDOWN_LSM is not set # CONFIG_SECURITY_LANDLOCK is not set # CONFIG_CRYPTO_DEV_PADLOCK is not set CONFIG_ARCH_USE_CMPXCHG_LOCKREF=y CONFIG_LOCKUP_DETECTOR=y CONFIG_SOFTLOCKUP_DETECTOR=y # CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=0 CONFIG_HARDLOCKUP_DETECTOR_PERF=y CONFIG_HARDLOCKUP_CHECK_TIMESTAMP=y CONFIG_HARDLOCKUP_DETECTOR=y # CONFIG_BOOTPARAM_HARDLOCKUP_PANIC is not set CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE=0 # CONFIG_TEST_LOCKUP is not set CONFIG_LOCK_DEBUGGING_SUPPORT=y CONFIG_PROVE_LOCKING=y CONFIG_PROVE_RAW_LOCK_NESTING=y # CONFIG_LOCK_STAT is not set CONFIG_DEBUG_SPINLOCK=y CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_LOCKDEP=y CONFIG_LOCKDEP_BITS=15 CONFIG_LOCKDEP_CHAINS_BITS=16 CONFIG_LOCKDEP_STACK_TRACE_BITS=19 CONFIG_LOCKDEP_STACK_TRACE_HASH_BITS=14 CONFIG_LOCKDEP_CIRCULAR_QUEUE_BITS=12 CONFIG_DEBUG_LOCKDEP=y # CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set # CONFIG_LOCK_TORTURE_TEST is not set # CONFIG_CSD_LOCK_WAIT_DEBUG is not set
On 28.01.2022 15:48:05, Oliver Hartkopp wrote: > Hello Marc, hello William, > > On 28.01.22 09:46, Marc Kleine-Budde wrote: > > On 28.01.2022 09:32:40, Oliver Hartkopp wrote: > > > > > > > > > On 28.01.22 09:07, Marc Kleine-Budde wrote: > > > > On 28.01.2022 08:56:19, Oliver Hartkopp wrote: > > > > > I've seen the frame processing sometimes freezes for one second when > > > > > stressing the isotp_rcv() from multiple sources. This finally freezes > > > > > the entire softirq which is either not good and not needed as we only > > > > > need to fix this race for stress tests - and not for real world usage > > > > > that does not create this case. > > > > > > > > Hmmm, this doesn't sound good. Can you test with LOCKDEP enabled? > > > > > # > > > # Lock Debugging (spinlocks, mutexes, etc...) > > > # > > > CONFIG_LOCK_DEBUGGING_SUPPORT=y > > > # CONFIG_PROVE_LOCKING is not set > > CONFIG_PROVE_LOCKING=y > > Now enabled even more locking (seen relevant kernel config at the end). > > It turns out that there is no visible difference when using spin_lock() or > spin_trylock(). > > I only got some of these kernel log entries > > Jan 28 11:13:14 silver kernel: [ 2396.323211] perf: interrupt took too long > (2549 > 2500), lowering kernel.perf_event_max_sample_rate to 78250 > Jan 28 11:25:49 silver kernel: [ 3151.172773] perf: interrupt took too long > (3188 > 3186), lowering kernel.perf_event_max_sample_rate to 62500 > Jan 28 11:45:24 silver kernel: [ 4325.583328] perf: interrupt took too long > (4009 > 3985), lowering kernel.perf_event_max_sample_rate to 49750 > Jan 28 12:15:46 silver kernel: [ 6148.238246] perf: interrupt took too long > (5021 > 5011), lowering kernel.perf_event_max_sample_rate to 39750 > Jan 28 13:01:45 silver kernel: [ 8907.303715] perf: interrupt took too long > (6285 > 6276), lowering kernel.perf_event_max_sample_rate to 31750 > > But I get these sporadically anyway. No other LOCKDEP splat. > > At least the issue reported by William should be fixed now - but I'm still > unclear whether spin_lock() or spin_trylock() is the best approach here in > the NET_RX softirq?!? With the !spin_trylock() -> return you are saying if something concurrent happens, drop it. This doesn't sound correct. Marc
Hi Marc, On 07.02.22 09:11, Marc Kleine-Budde wrote: > On 28.01.2022 15:48:05, Oliver Hartkopp wrote: >> Hello Marc, hello William, >> >> On 28.01.22 09:46, Marc Kleine-Budde wrote: >>> On 28.01.2022 09:32:40, Oliver Hartkopp wrote: >>>> >>>> >>>> On 28.01.22 09:07, Marc Kleine-Budde wrote: >>>>> On 28.01.2022 08:56:19, Oliver Hartkopp wrote: >>>>>> I've seen the frame processing sometimes freezes for one second when >>>>>> stressing the isotp_rcv() from multiple sources. This finally freezes >>>>>> the entire softirq which is either not good and not needed as we only >>>>>> need to fix this race for stress tests - and not for real world usage >>>>>> that does not create this case. >>>>> >>>>> Hmmm, this doesn't sound good. Can you test with LOCKDEP enabled? >> >> >>>> # >>>> # Lock Debugging (spinlocks, mutexes, etc...) >>>> # >>>> CONFIG_LOCK_DEBUGGING_SUPPORT=y >>>> # CONFIG_PROVE_LOCKING is not set >>> CONFIG_PROVE_LOCKING=y >> >> Now enabled even more locking (seen relevant kernel config at the end). >> >> It turns out that there is no visible difference when using spin_lock() or >> spin_trylock(). >> >> I only got some of these kernel log entries >> >> Jan 28 11:13:14 silver kernel: [ 2396.323211] perf: interrupt took too long >> (2549 > 2500), lowering kernel.perf_event_max_sample_rate to 78250 >> Jan 28 11:25:49 silver kernel: [ 3151.172773] perf: interrupt took too long >> (3188 > 3186), lowering kernel.perf_event_max_sample_rate to 62500 >> Jan 28 11:45:24 silver kernel: [ 4325.583328] perf: interrupt took too long >> (4009 > 3985), lowering kernel.perf_event_max_sample_rate to 49750 >> Jan 28 12:15:46 silver kernel: [ 6148.238246] perf: interrupt took too long >> (5021 > 5011), lowering kernel.perf_event_max_sample_rate to 39750 >> Jan 28 13:01:45 silver kernel: [ 8907.303715] perf: interrupt took too long >> (6285 > 6276), lowering kernel.perf_event_max_sample_rate to 31750 >> >> But I get these sporadically anyway. No other LOCKDEP splat. >> >> At least the issue reported by William should be fixed now - but I'm still >> unclear whether spin_lock() or spin_trylock() is the best approach here in >> the NET_RX softirq?!? > > With the !spin_trylock() -> return you are saying if something > concurrent happens, drop it. This doesn't sound correct. Yes, I had the same feeling and did some extensive load tests using both variants. It turned out the standard spin_lock() works excellent to fix the issue. Thanks for taking it for upstream here: https://lore.kernel.org/linux-can/20220209074818.3ylfz4zmuhit7orc@pengutronix.de/T/#t Best regards, Oliver
diff --git a/net/can/isotp.c b/net/can/isotp.c index df6968b28bf4..a4b174f860f3 100644 --- a/net/can/isotp.c +++ b/net/can/isotp.c @@ -443,8 +443,10 @@ static int isotp_rcv_ff(struct sock *sk, struct canfd_frame *cf, int ae) int off; int ff_pci_sz; + if (so->rx.state != ISOTP_IDLE) + return 0; + hrtimer_cancel(&so->rxtimer); - so->rx.state = ISOTP_IDLE; /* get the used sender LL_DL from the (first) CAN frame data length */ so->rx.ll_dl = padlen(cf->len); @@ -518,8 +520,6 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae, so->lastrxcf_tstamp = skb->tstamp; } - hrtimer_cancel(&so->rxtimer); - /* CFs are never longer than the FF */ if (cf->len > so->rx.ll_dl) return 1; @@ -531,15 +531,15 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae, return 1; } + hrtimer_cancel(&so->rxtimer); + if ((cf->data[ae] & 0x0F) != so->rx.sn) { /* wrong sn detected - report 'illegal byte sequence' */ sk->sk_err = EILSEQ; if (!sock_flag(sk, SOCK_DEAD)) sk_error_report(sk); - /* reset rx state */ - so->rx.state = ISOTP_IDLE; - return 1; + goto err_out; } so->rx.sn++; so->rx.sn %= 16; @@ -551,21 +551,18 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae, } if (so->rx.idx >= so->rx.len) { - /* we are done */ - so->rx.state = ISOTP_IDLE; - if ((so->opt.flags & ISOTP_CHECK_PADDING) && check_pad(so, cf, i + 1, so->opt.rxpad_content)) { /* malformed PDU - report 'not a data message' */ sk->sk_err = EBADMSG; if (!sock_flag(sk, SOCK_DEAD)) sk_error_report(sk); - return 1; + goto err_out; } nskb = alloc_skb(so->rx.len, gfp_any()); if (!nskb) - return 1; + goto err_out; memcpy(skb_put(nskb, so->rx.len), so->rx.buf, so->rx.len); @@ -573,6 +570,10 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae, nskb->tstamp = skb->tstamp; nskb->dev = skb->dev; isotp_rcv_skb(nskb, sk); + + /* we are done */ + so->rx.state = ISOTP_IDLE; + return 0; } @@ -591,6 +592,11 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae, /* we reached the specified blocksize so->rxfc.bs */ isotp_send_fc(sk, ae, ISOTP_FC_CTS); return 0; + +err_out: + /* reset rx state */ + so->rx.state = ISOTP_IDLE; + return 1; } static void isotp_rcv(struct sk_buff *skb, void *data)
When receive a FF, the current code logic does not consider the real so->rx.state but set so->rx.state to ISOTP_IDLE directly. That will make so->rx accessed by multiple receiving processes concurrently. The following syz problem is one of the scenarios. so->rx.len is changed by isotp_rcv_ff() during isotp_rcv_cf(), so->rx.len equals 0 before alloc_skb() and equals 4096 after alloc_skb(). That will trigger skb_over_panic() in skb_put(). ======================================================= CPU: 1 PID: 19 Comm: ksoftirqd/1 Not tainted 5.16.0-rc8-syzkaller #0 RIP: 0010:skb_panic+0x16c/0x16e net/core/skbuff.c:113 Call Trace: <TASK> skb_over_panic net/core/skbuff.c:118 [inline] skb_put.cold+0x24/0x24 net/core/skbuff.c:1990 isotp_rcv_cf net/can/isotp.c:570 [inline] isotp_rcv+0xa38/0x1e30 net/can/isotp.c:668 deliver net/can/af_can.c:574 [inline] can_rcv_filter+0x445/0x8d0 net/can/af_can.c:635 can_receive+0x31d/0x580 net/can/af_can.c:665 can_rcv+0x120/0x1c0 net/can/af_can.c:696 __netif_receive_skb_one_core+0x114/0x180 net/core/dev.c:5465 __netif_receive_skb+0x24/0x1b0 net/core/dev.c:5579 Check so->rx.state equals ISOTP_IDLE firstly in isotp_rcv_ff(). Make sure so->rx idle when receive another new packet. And set so->rx.state to ISOTP_IDLE after whole packet being received. Fixes: e057dd3fc20f ("can: add ISO 15765-2:2016 transport protocol") Reported-by: syzbot+4c63f36709a642f801c5@syzkaller.appspotmail.com Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> --- net/can/isotp.c | 28 +++++++++++++++++----------- 1 file changed, 17 insertions(+), 11 deletions(-)