Message ID | 20240823021333.1252272-2-johunt@akamai.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | NULL ptr dereference in tcp_rearm_rto | expand |
Hello Josh, On Fri, Aug 23, 2024 at 11:02 AM Josh Hunt <johunt@akamai.com> wrote: > > There have been multiple occassions where we have crashed in this path > because packets_out suggested there were packets on the write or retransmit > queues, but in fact there weren't leading to a NULL skb being dereferenced. Could you show us the detailed splats and more information about it so that we can know what exactly happened? Thanks, Jason
On 8/22/24 8:27 PM, Jason Xing wrote: > > Hello Josh, > > On Fri, Aug 23, 2024 at 11:02 AM Josh Hunt <johunt@akamai.com> wrote: >> >> There have been multiple occassions where we have crashed in this path >> because packets_out suggested there were packets on the write or retransmit >> queues, but in fact there weren't leading to a NULL skb being dereferenced. > > Could you show us the detailed splats and more information about it so > that we can know what exactly happened? Hey Jason Yeah for some reason my cover letter did not come through which has the oops info that we hit. I'll resend it now. Fingers crossed it goes through this time :) Josh
On 8/22/24 8:33 PM, Josh Hunt wrote: > On 8/22/24 8:27 PM, Jason Xing wrote: >> >> Hello Josh, >> >> On Fri, Aug 23, 2024 at 11:02 AM Josh Hunt <johunt@akamai.com> wrote: >>> >>> There have been multiple occassions where we have crashed in this path >>> because packets_out suggested there were packets on the write or >>> retransmit >>> queues, but in fact there weren't leading to a NULL skb being >>> dereferenced. >> >> Could you show us the detailed splats and more information about it so >> that we can know what exactly happened? > > Hey Jason > > Yeah for some reason my cover letter did not come through which has the > oops info that we hit. I'll resend it now. Fingers crossed it goes > through this time :) > > Josh Seems like our mail server is block the cover letter for some reason right now. I'll have to figure out why tomorrow. I filed a bug with Ubuntu as well as sending this patch upstream b/c the kernel we're running is a stock Ubuntu kernel. The bug report there has most of what I put in the cover letter: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2077657 Josh
On Fri, Aug 23, 2024 at 4:14 AM Josh Hunt <johunt@akamai.com> wrote: > > There have been multiple occassions where we have crashed in this path > because packets_out suggested there were packets on the write or retransmit > queues, but in fact there weren't leading to a NULL skb being dereferenced. > While we should fix that root cause we should also just make sure the skb > is not NULL before dereferencing it. Also add a warn once here to capture > some information if/when the problem case is hit again. > > Signed-off-by: Josh Hunt <johunt@akamai.com> > --- > include/net/tcp.h | 15 ++++++++++++--- > 1 file changed, 12 insertions(+), 3 deletions(-) > > diff --git a/include/net/tcp.h b/include/net/tcp.h > index 2aac11e7e1cc..19ea6ed87880 100644 > --- a/include/net/tcp.h > +++ b/include/net/tcp.h > @@ -2433,10 +2433,19 @@ void tcp_plb_update_state_upon_rto(struct sock *sk, struct tcp_plb_state *plb); > static inline s64 tcp_rto_delta_us(const struct sock *sk) > { > const struct sk_buff *skb = tcp_rtx_queue_head(sk); > - u32 rto = inet_csk(sk)->icsk_rto; > - u64 rto_time_stamp_us = tcp_skb_timestamp_us(skb) + jiffies_to_usecs(rto); > + u32 rto = jiffies_to_usecs(inet_csk(sk)->icsk_rto); > + > + if (likely(skb)) { > + u64 rto_time_stamp_us = tcp_skb_timestamp_us(skb) + rto; > + > + return rto_time_stamp_us - tcp_sk(sk)->tcp_mstamp; > + } else { > + WARN_ONCE(1, > + "rtx queue emtpy: inflight %u tlp_high_seq %u state %u\n", > + tcp_sk(sk)->packets_out, tcp_sk(sk)->tlp_high_seq, sk->sk_state); > + return rto; > + } > > - return rto_time_stamp_us - tcp_sk(sk)->tcp_mstamp; > } > > /* > -- > 2.34.1 > Are you using a recent linux kernel version ? I am asking because sometimes patches are submitted while the authors are using very old kernels, and they do not state this clearly. I have never seen such a state. Please CC Neal on your next submission.
On Fri, Aug 23, 2024 at 2:44 PM Eric Dumazet <edumazet@google.com> wrote: > > On Fri, Aug 23, 2024 at 4:14 AM Josh Hunt <johunt@akamai.com> wrote: > > > > There have been multiple occassions where we have crashed in this path > > because packets_out suggested there were packets on the write or retransmit > > queues, but in fact there weren't leading to a NULL skb being dereferenced. > > While we should fix that root cause we should also just make sure the skb > > is not NULL before dereferencing it. Also add a warn once here to capture > > some information if/when the problem case is hit again. > > > > Signed-off-by: Josh Hunt <johunt@akamai.com> > > --- > > include/net/tcp.h | 15 ++++++++++++--- > > 1 file changed, 12 insertions(+), 3 deletions(-) > > > > diff --git a/include/net/tcp.h b/include/net/tcp.h > > index 2aac11e7e1cc..19ea6ed87880 100644 > > --- a/include/net/tcp.h > > +++ b/include/net/tcp.h > > @@ -2433,10 +2433,19 @@ void tcp_plb_update_state_upon_rto(struct sock *sk, struct tcp_plb_state *plb); > > static inline s64 tcp_rto_delta_us(const struct sock *sk) > > { > > const struct sk_buff *skb = tcp_rtx_queue_head(sk); > > - u32 rto = inet_csk(sk)->icsk_rto; > > - u64 rto_time_stamp_us = tcp_skb_timestamp_us(skb) + jiffies_to_usecs(rto); > > + u32 rto = jiffies_to_usecs(inet_csk(sk)->icsk_rto); > > + > > + if (likely(skb)) { > > + u64 rto_time_stamp_us = tcp_skb_timestamp_us(skb) + rto; > > + > > + return rto_time_stamp_us - tcp_sk(sk)->tcp_mstamp; > > + } else { > > + WARN_ONCE(1, > > + "rtx queue emtpy: inflight %u tlp_high_seq %u state %u\n", > > + tcp_sk(sk)->packets_out, tcp_sk(sk)->tlp_high_seq, sk->sk_state); > > + return rto; > > + } > > > > - return rto_time_stamp_us - tcp_sk(sk)->tcp_mstamp; > > } > > > > /* > > -- > > 2.34.1 > > > > Are you using a recent linux kernel version ? I'm afraid that he doesn't. What that link shows to me is the 5.4.0-174-generic ubuntu kernel.
diff --git a/include/net/tcp.h b/include/net/tcp.h index 2aac11e7e1cc..19ea6ed87880 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -2433,10 +2433,19 @@ void tcp_plb_update_state_upon_rto(struct sock *sk, struct tcp_plb_state *plb); static inline s64 tcp_rto_delta_us(const struct sock *sk) { const struct sk_buff *skb = tcp_rtx_queue_head(sk); - u32 rto = inet_csk(sk)->icsk_rto; - u64 rto_time_stamp_us = tcp_skb_timestamp_us(skb) + jiffies_to_usecs(rto); + u32 rto = jiffies_to_usecs(inet_csk(sk)->icsk_rto); + + if (likely(skb)) { + u64 rto_time_stamp_us = tcp_skb_timestamp_us(skb) + rto; + + return rto_time_stamp_us - tcp_sk(sk)->tcp_mstamp; + } else { + WARN_ONCE(1, + "rtx queue emtpy: inflight %u tlp_high_seq %u state %u\n", + tcp_sk(sk)->packets_out, tcp_sk(sk)->tlp_high_seq, sk->sk_state); + return rto; + } - return rto_time_stamp_us - tcp_sk(sk)->tcp_mstamp; } /*
There have been multiple occassions where we have crashed in this path because packets_out suggested there were packets on the write or retransmit queues, but in fact there weren't leading to a NULL skb being dereferenced. While we should fix that root cause we should also just make sure the skb is not NULL before dereferencing it. Also add a warn once here to capture some information if/when the problem case is hit again. Signed-off-by: Josh Hunt <johunt@akamai.com> --- include/net/tcp.h | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-)