Message ID | 20240328170309.2172584-4-edumazet@google.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | net: rps: misc changes | expand |
On Fri, Mar 29, 2024 at 1:07 AM Eric Dumazet <edumazet@google.com> wrote: > > If the device attached to the packet given to enqueue_to_backlog() > is not running, we drop the packet. > > But we accidentally increase sd->dropped, giving false signals > to admins: sd->dropped should be reserved to cpu backlog pressure, > not to temporary glitches at device dismantles. It seems that drop action happening is intended in this case (see commit e9e4dd3267d0c ("net: do not process device backlog during unregistration")). We can see the strange/unexpected behaviour at least through simply taking a look at /proc/net/softnet_stat file. > > While we are at it, perform the netif_running() test before > we get the rps lock, and use REASON_DEV_READY > drop reason instead of NOT_SPECIFIED. > > Signed-off-by: Eric Dumazet <edumazet@google.com> > --- > net/core/dev.c | 9 +++++---- > 1 file changed, 5 insertions(+), 4 deletions(-) > > diff --git a/net/core/dev.c b/net/core/dev.c > index 5d36a634f468ffdeaca598c3dd033fe06d240bd0..af7a34b0a7d6683c6ffb21dd3388ed678473d95e 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -4791,12 +4791,13 @@ static int enqueue_to_backlog(struct sk_buff *skb, int cpu, > unsigned long flags; > unsigned int qlen; > > - reason = SKB_DROP_REASON_NOT_SPECIFIED; > + reason = SKB_DROP_REASON_DEV_READY; > + if (!netif_running(skb->dev)) > + goto bad_dev; > + > sd = &per_cpu(softnet_data, cpu); > > backlog_lock_irq_save(sd, &flags); > - if (!netif_running(skb->dev)) > - goto drop; > qlen = skb_queue_len(&sd->input_pkt_queue); > if (qlen <= READ_ONCE(net_hotdata.max_backlog) && > !skb_flow_limit(skb, qlen)) { > @@ -4817,10 +4818,10 @@ static int enqueue_to_backlog(struct sk_buff *skb, int cpu, > } > reason = SKB_DROP_REASON_CPU_BACKLOG; > > -drop: > sd->dropped++; > backlog_unlock_irq_restore(sd, &flags); > > +bad_dev: > dev_core_stats_rx_dropped_inc(skb->dev); > kfree_skb_reason(skb, reason); > return NET_RX_DROP; > -- > 2.44.0.478.gd926399ef9-goog > >
On Fri, Mar 29, 2024 at 4:21 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > On Fri, Mar 29, 2024 at 1:07 AM Eric Dumazet <edumazet@google.com> wrote: > > > > If the device attached to the packet given to enqueue_to_backlog() > > is not running, we drop the packet. > > > > But we accidentally increase sd->dropped, giving false signals > > to admins: sd->dropped should be reserved to cpu backlog pressure, > > not to temporary glitches at device dismantles. > > It seems that drop action happening is intended in this case (see > commit e9e4dd3267d0c ("net: do not process device backlog during > unregistration")). We can see the strange/unexpected behaviour at > least through simply taking a look at /proc/net/softnet_stat file. I disagree. We are dismantling a device, temporary drops are expected, and this patch adds a more precise drop_reason. I have seen admins being worried about this counter being not zero on carefully tuned hosts. If you think we have to carry these drops forever in /proc/net/softnet_stat, you will have give a more precise reason than "This was intentionally added in 2015" e9e4dd3267d0c5234c changelog was very long, but said nothing about why sd->dropped _had_ to be updated.
On Fri, Mar 29, 2024 at 2:31 PM Eric Dumazet <edumazet@google.com> wrote: > > On Fri, Mar 29, 2024 at 4:21 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > On Fri, Mar 29, 2024 at 1:07 AM Eric Dumazet <edumazet@google.com> wrote: > > > > > > If the device attached to the packet given to enqueue_to_backlog() > > > is not running, we drop the packet. > > > > > > But we accidentally increase sd->dropped, giving false signals > > > to admins: sd->dropped should be reserved to cpu backlog pressure, > > > not to temporary glitches at device dismantles. > > > > It seems that drop action happening is intended in this case (see > > commit e9e4dd3267d0c ("net: do not process device backlog during > > unregistration")). We can see the strange/unexpected behaviour at > > least through simply taking a look at /proc/net/softnet_stat file. > > I disagree. > > We are dismantling a device, temporary drops are expected, and this > patch adds a more precise drop_reason. > > I have seen admins being worried about this counter being not zero on > carefully tuned hosts. > > If you think we have to carry these drops forever in > /proc/net/softnet_stat, you will have give > a more precise reason than "This was intentionally added in 2015" > > e9e4dd3267d0c5234c changelog was very long, but said nothing > about why sd->dropped _had_ to be updated. Indeed, the author might think the skb is dropped in the RPS process so it's necessary to record that and reflect these actions in this proc file. Don't get me wrong. I'm not against what you did, just proposing my concern about changing the meaning/understanding of 'drop'. Thanks, Jason
diff --git a/net/core/dev.c b/net/core/dev.c index 5d36a634f468ffdeaca598c3dd033fe06d240bd0..af7a34b0a7d6683c6ffb21dd3388ed678473d95e 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4791,12 +4791,13 @@ static int enqueue_to_backlog(struct sk_buff *skb, int cpu, unsigned long flags; unsigned int qlen; - reason = SKB_DROP_REASON_NOT_SPECIFIED; + reason = SKB_DROP_REASON_DEV_READY; + if (!netif_running(skb->dev)) + goto bad_dev; + sd = &per_cpu(softnet_data, cpu); backlog_lock_irq_save(sd, &flags); - if (!netif_running(skb->dev)) - goto drop; qlen = skb_queue_len(&sd->input_pkt_queue); if (qlen <= READ_ONCE(net_hotdata.max_backlog) && !skb_flow_limit(skb, qlen)) { @@ -4817,10 +4818,10 @@ static int enqueue_to_backlog(struct sk_buff *skb, int cpu, } reason = SKB_DROP_REASON_CPU_BACKLOG; -drop: sd->dropped++; backlog_unlock_irq_restore(sd, &flags); +bad_dev: dev_core_stats_rx_dropped_inc(skb->dev); kfree_skb_reason(skb, reason); return NET_RX_DROP;
If the device attached to the packet given to enqueue_to_backlog() is not running, we drop the packet. But we accidentally increase sd->dropped, giving false signals to admins: sd->dropped should be reserved to cpu backlog pressure, not to temporary glitches at device dismantles. While we are at it, perform the netif_running() test before we get the rps lock, and use REASON_DEV_READY drop reason instead of NOT_SPECIFIED. Signed-off-by: Eric Dumazet <edumazet@google.com> --- net/core/dev.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)