diff mbox series

2nd RTM_NEWLINK notification with operstate down is always 1 second delayed

Message ID DS7PR84MB303940368E1CC7CE98A49E96D70F2@DS7PR84MB3039.NAMPRD84.PROD.OUTLOOK.COM (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series 2nd RTM_NEWLINK notification with operstate down is always 1 second delayed | expand

Checks

Context Check Description
netdev/tree_selection success Guessing tree name failed - patch did not apply

Commit Message

Tom, Deepak Abraham April 17, 2024, 5:37 p.m. UTC
Hi!

I have a system configured with 2 physical eth interfaces connected to a switch.
When I reboot the switch, I see that the userspace RTM_NEWLINK notifications for the interfaces are always 1 second apart although both links actually go down almost simultaneously!
The subsequent RTM_NEWLINK notifications when the switch comes back up are however only delayed by a few microseconds between each other, which is as expected.

Turns out this delay is intentionally introudced by the linux kernel networking code in net/core/link_watch.c, last modified 17 years ago in commit 294cc44:
         /*
          * Limit the number of linkwatch events to one
          * per second so that a runaway driver does not
          * cause a storm of messages on the netlink
          * socket.  This limit does not apply to up events
          * while the device qdisc is down.
          */


On modern high performance systems, limiting the number of down events to just one per second have far reaching consequences.
I was wondering if it would be advisable to reduce this delay to something smaller, say 5ms (so 5ms+scheduling delay practically):


I have tested this change in my environment, and it works as expected. I don't see any new issues popping up because of this.

Are there any concerns with making this change today? Hoping to get some feedback.


Thank You,
Deepak Abraham Tom

Comments

Stephen Hemminger April 17, 2024, 10:33 p.m. UTC | #1
On Wed, 17 Apr 2024 17:37:40 +0000
"Tom, Deepak Abraham" <deepak-abraham.tom@hpe.com> wrote:

> Hi!
> 
> I have a system configured with 2 physical eth interfaces connected to a switch.
> When I reboot the switch, I see that the userspace RTM_NEWLINK notifications for the interfaces are always 1 second apart although both links actually go down almost simultaneously!
> The subsequent RTM_NEWLINK notifications when the switch comes back up are however only delayed by a few microseconds between each other, which is as expected.
> 
> Turns out this delay is intentionally introudced by the linux kernel networking code in net/core/link_watch.c, last modified 17 years ago in commit 294cc44:
>          /*
>           * Limit the number of linkwatch events to one
>           * per second so that a runaway driver does not
>           * cause a storm of messages on the netlink
>           * socket.  This limit does not apply to up events
>           * while the device qdisc is down.
>           */
> 
> 
> On modern high performance systems, limiting the number of down events to just one per second have far reaching consequences.
> I was wondering if it would be advisable to reduce this delay to something smaller, say 5ms (so 5ms+scheduling delay practically):

The reason is that for systems that are connected to the Internet with routing daemons
the impact of link state change is huge. A single link transistion may keep FRR (nee Quagga)
busy for a several seconds as it linearly evaluates 3 Million route entries. Maybe more recent
versions of FRR got smarter. This is also to avoid routing daemon propagating lots of changes
a.k.a route flap.
Tom, Deepak Abraham April 18, 2024, 7:26 p.m. UTC | #2
Maybe I'm missing something, but could you please explain how this really helps to not keep FRR busy?
If I understood this right, the link watch code does not ignore events but merely delays them. So any link transition will be propagated whether its scheduled urgently or not urgently.
So FRR will have to still deal with each transition keeping it busy with or without this change, unless FRR dampens flaps on its own?

Also from a design perspective, would it be better if FRR's issues with route flaps be dealt directly in FRR code itself? That way, in use cases where FRR does not come in to play, such a delay is not causing other consequences? Are there more such situations where such a delay is absolutely required?

Thank You,
Deepak Abraham Tom

-----Original Message-----
From: Stephen Hemminger <stephen@networkplumber.org> 
Sent: Wednesday, April 17, 2024 4:34 PM
To: Tom, Deepak Abraham <deepak-abraham.tom@hpe.com>
Cc: netdev@vger.kernel.org
Subject: Re: 2nd RTM_NEWLINK notification with operstate down is always 1 second delayed

On Wed, 17 Apr 2024 17:37:40 +0000
"Tom, Deepak Abraham" <deepak-abraham.tom@hpe.com> wrote:

> Hi!
> 
> I have a system configured with 2 physical eth interfaces connected to a switch.
> When I reboot the switch, I see that the userspace RTM_NEWLINK notifications for the interfaces are always 1 second apart although both links actually go down almost simultaneously!
> The subsequent RTM_NEWLINK notifications when the switch comes back up are however only delayed by a few microseconds between each other, which is as expected.
> 
> Turns out this delay is intentionally introudced by the linux kernel networking code in net/core/link_watch.c, last modified 17 years ago in commit 294cc44:
>          /*
>           * Limit the number of linkwatch events to one
>           * per second so that a runaway driver does not
>           * cause a storm of messages on the netlink
>           * socket.  This limit does not apply to up events
>           * while the device qdisc is down.
>           */
> 
> 
> On modern high performance systems, limiting the number of down events to just one per second have far reaching consequences.
> I was wondering if it would be advisable to reduce this delay to something smaller, say 5ms (so 5ms+scheduling delay practically):

The reason is that for systems that are connected to the Internet with routing daemons the impact of link state change is huge. A single link transistion may keep FRR (nee Quagga) busy for a several seconds as it linearly evaluates 3 Million route entries. Maybe more recent versions of FRR got smarter. This is also to avoid routing daemon propagating lots of changes a.k.a route flap.
Stephen Hemminger April 19, 2024, 4:31 p.m. UTC | #3
On Thu, 18 Apr 2024 19:26:51 +0000
"Tom, Deepak Abraham" <deepak-abraham.tom@hpe.com> wrote:

> Maybe I'm missing something, but could you please explain how this really helps to not keep FRR busy?
> If I understood this right, the link watch code does not ignore events but merely delays them. So any link transition will be propagated whether its scheduled urgently or not urgently.
> So FRR will have to still deal with each transition keeping it busy with or without this change, unless FRR dampens flaps on its own?
>

A poor connection to a switch can cause repeated link down/up. I haven't seen it in person,
but have had to deal with user reports of poor router connections.

> Also from a design perspective, would it be better if FRR's issues with route flaps be dealt directly in FRR code itself? That way, in use cases where FRR does not come in to play, such a delay is not causing other consequences? Are there more such situations where such a delay is absolutely required?

Too late, now. Can't change Linux semantics without breaking many things. And it impacts not just FRR.
diff mbox series

Patch

--- a/net/core/link_watch.c
+++ b/net/core/link_watch.c
@@ -130,8 +130,8 @@  static void linkwatch_schedule_work(int urgent)
                delay = 0;
        }

-       /* If we wrap around we'll delay it by at most HZ. */
-       if (delay > HZ)
+       /* If we wrap around we'll delay it by at most HZ/200. */
+       if (delay > (HZ/200))
                delay = 0;

        /*
@@ -187,15 +187,15 @@  static void __linkwatch_run_queue(int urgent_only)

        /*
         * Limit the number of linkwatch events to one
-        * per second so that a runaway driver does not
+        * per 5 millisecond so that a runaway driver does not
         * cause a storm of messages on the netlink
         * socket.  This limit does not apply to up events
         * while the device qdisc is down.
         */
        if (!urgent_only)
-               linkwatch_nextevent = jiffies + HZ;
+               linkwatch_nextevent = jiffies + (HZ/200);
        /* Limit wrap-around effect on delay. */
-       else if (time_after(linkwatch_nextevent, jiffies + HZ))
+       else if (time_after(linkwatch_nextevent, jiffies + (HZ/200)))
                linkwatch_nextevent = jiffies;

        clear_bit(LW_URGENT, &linkwatch_flags);