diff mbox series

[net] net: fix raising a softirq on the current cpu with rps enabled

Message ID 20230325152417.5403-1-kerneljasonxing@gmail.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net] net: fix raising a softirq on the current cpu with rps enabled | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 26 this patch: 26
netdev/cc_maintainers success CCed 6 of 6 maintainers
netdev/build_clang success Errors and warnings before: 18 this patch: 18
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 26 this patch: 26
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 8 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Jason Xing March 25, 2023, 3:24 p.m. UTC
From: Jason Xing <kernelxing@tencent.com>

Since we decide to put the skb into a backlog queue of another
cpu, we should not raise the softirq for the current cpu. When
to raise a softirq is based on whether we have more data left to
process later. As to the current cpu, there is no indication of
more data enqueued, so we do not need this action. After enqueuing
to another cpu, net_rx_action() function will call ipi and then
another cpu will raise the softirq as expected.

Also, raising more softirqs which set the corresponding bit field
can make the IRQ mechanism think we probably need to start ksoftirqd
on the current cpu. Actually it shouldn't happen.

Fixes: 0a9627f2649a ("rps: Receive Packet Steering")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/core/dev.c | 2 --
 1 file changed, 2 deletions(-)

Comments

Eric Dumazet March 25, 2023, 3:56 p.m. UTC | #1
On Sat, Mar 25, 2023 at 8:26 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> From: Jason Xing <kernelxing@tencent.com>
>
> Since we decide to put the skb into a backlog queue of another
> cpu, we should not raise the softirq for the current cpu. When
> to raise a softirq is based on whether we have more data left to
> process later. As to the current cpu, there is no indication of
> more data enqueued, so we do not need this action. After enqueuing
> to another cpu, net_rx_action() function will call ipi and then
> another cpu will raise the softirq as expected.
>
> Also, raising more softirqs which set the corresponding bit field
> can make the IRQ mechanism think we probably need to start ksoftirqd
> on the current cpu. Actually it shouldn't happen.
>
> Fixes: 0a9627f2649a ("rps: Receive Packet Steering")
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> ---
>  net/core/dev.c | 2 --
>  1 file changed, 2 deletions(-)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 1518a366783b..bfaaa652f50c 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4594,8 +4594,6 @@ static int napi_schedule_rps(struct softnet_data *sd)
>         if (sd != mysd) {
>                 sd->rps_ipi_next = mysd->rps_ipi_list;
>                 mysd->rps_ipi_list = sd;
> -
> -               __raise_softirq_irqoff(NET_RX_SOFTIRQ);
>                 return 1;
>         }
>  #endif /* CONFIG_RPS */
> --
> 2.37.3
>

This is not going to work in some cases. Please take a deeper look.

I have to run, if you (or others) do not find the reason, I will give
more details when I am done traveling.
Jason Xing March 26, 2023, 3:27 a.m. UTC | #2
On Sat, Mar 25, 2023 at 11:57 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Sat, Mar 25, 2023 at 8:26 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > Since we decide to put the skb into a backlog queue of another
> > cpu, we should not raise the softirq for the current cpu. When
> > to raise a softirq is based on whether we have more data left to
> > process later. As to the current cpu, there is no indication of
> > more data enqueued, so we do not need this action. After enqueuing
> > to another cpu, net_rx_action() function will call ipi and then
> > another cpu will raise the softirq as expected.
> >
> > Also, raising more softirqs which set the corresponding bit field
> > can make the IRQ mechanism think we probably need to start ksoftirqd
> > on the current cpu. Actually it shouldn't happen.
> >
> > Fixes: 0a9627f2649a ("rps: Receive Packet Steering")
> > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > ---
> >  net/core/dev.c | 2 --
> >  1 file changed, 2 deletions(-)
> >
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 1518a366783b..bfaaa652f50c 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -4594,8 +4594,6 @@ static int napi_schedule_rps(struct softnet_data *sd)
> >         if (sd != mysd) {
> >                 sd->rps_ipi_next = mysd->rps_ipi_list;
> >                 mysd->rps_ipi_list = sd;
> > -
> > -               __raise_softirq_irqoff(NET_RX_SOFTIRQ);
> >                 return 1;
> >         }
> >  #endif /* CONFIG_RPS */
> > --
> > 2.37.3
> >
>
> This is not going to work in some cases. Please take a deeper look.

I'll do it. I've already been struggling with this for a few days. I
still have no clue.

I found out in some cases that frequently starting ksoftirqd is not
that good because this thread may be blocked when waiting in the
runqueue.
So I tried to avoid raising softirqs to mitigate the issue and then I
noticed this one.

>
> I have to run, if you (or others) do not find the reason, I will give
> more details when I am done traveling.

Happy traveling :)

Thanks,
Jason
Jason Xing March 26, 2023, 3:30 a.m. UTC | #3
On Sun, Mar 26, 2023 at 9:39 AM Hillf Danton <hdanton@sina.com> wrote:
>
> On Sat, Mar 25, 2023 at 8:26 AM Jason Xing <kerneljasonxing@gmail.com>
> >
> > @@ -4594,8 +4594,6 @@ static int napi_schedule_rps(struct softnet_data *sd)
> >         if (sd !=3D mysd) {
> >                 sd->rps_ipi_next =3D mysd->rps_ipi_list;
> >                 mysd->rps_ipi_list =3D sd;
> > -
> > -               __raise_softirq_irqoff(NET_RX_SOFTIRQ);
> >                 return 1;
> >         }
>
> Nope, ipi should be sent. But no ipi can go without irq enabled.
>

Sorry, I didn't get it. IPI is sent in net_rx_action() and apparently
I didn't touch this part in this patch. Here is only about whether we
should raise an IRQ even if the skb will be enqueued into another cpu.

> So feel free to work out what sense made by disabling irq at the call site
> then directly send ipi instead.
Jason Xing March 26, 2023, 4:04 a.m. UTC | #4
On Sat, Mar 25, 2023 at 11:57 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Sat, Mar 25, 2023 at 8:26 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > Since we decide to put the skb into a backlog queue of another
> > cpu, we should not raise the softirq for the current cpu. When
> > to raise a softirq is based on whether we have more data left to
> > process later. As to the current cpu, there is no indication of
> > more data enqueued, so we do not need this action. After enqueuing
> > to another cpu, net_rx_action() function will call ipi and then
> > another cpu will raise the softirq as expected.
> >
> > Also, raising more softirqs which set the corresponding bit field
> > can make the IRQ mechanism think we probably need to start ksoftirqd
> > on the current cpu. Actually it shouldn't happen.
> >
> > Fixes: 0a9627f2649a ("rps: Receive Packet Steering")
> > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > ---
> >  net/core/dev.c | 2 --
> >  1 file changed, 2 deletions(-)
> >
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 1518a366783b..bfaaa652f50c 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -4594,8 +4594,6 @@ static int napi_schedule_rps(struct softnet_data *sd)
> >         if (sd != mysd) {
> >                 sd->rps_ipi_next = mysd->rps_ipi_list;
> >                 mysd->rps_ipi_list = sd;
> > -
> > -               __raise_softirq_irqoff(NET_RX_SOFTIRQ);
> >                 return 1;
> >         }
> >  #endif /* CONFIG_RPS */
> > --
> > 2.37.3
> >
>
> This is not going to work in some cases. Please take a deeper look.
>
> I have to run, if you (or others) do not find the reason, I will give
> more details when I am done traveling.

I'm wondering whether we could use @mysd instead of @sd like this:

if (!__test_and_set_bit(NAPI_STATE_SCHED, &mysd->backlog.state))
    __raise_softirq_irqoff(NET_RX_SOFTIRQ);

I traced back to some historical changes and saw some relations with
this commit ("net: solve a NAPI race"):
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=39e6c8208d7b6fb9d2047850fb3327db567b564b

Thanks,
Jason
Jason Xing March 26, 2023, 10:10 a.m. UTC | #5
On Sun, Mar 26, 2023 at 12:04 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Sat, Mar 25, 2023 at 11:57 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Sat, Mar 25, 2023 at 8:26 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > >
> > > From: Jason Xing <kernelxing@tencent.com>
> > >
> > > Since we decide to put the skb into a backlog queue of another
> > > cpu, we should not raise the softirq for the current cpu. When
> > > to raise a softirq is based on whether we have more data left to
> > > process later. As to the current cpu, there is no indication of
> > > more data enqueued, so we do not need this action. After enqueuing
> > > to another cpu, net_rx_action() function will call ipi and then
> > > another cpu will raise the softirq as expected.
> > >
> > > Also, raising more softirqs which set the corresponding bit field
> > > can make the IRQ mechanism think we probably need to start ksoftirqd
> > > on the current cpu. Actually it shouldn't happen.
> > >
> > > Fixes: 0a9627f2649a ("rps: Receive Packet Steering")
> > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > ---
> > >  net/core/dev.c | 2 --
> > >  1 file changed, 2 deletions(-)
> > >
> > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > index 1518a366783b..bfaaa652f50c 100644
> > > --- a/net/core/dev.c
> > > +++ b/net/core/dev.c
> > > @@ -4594,8 +4594,6 @@ static int napi_schedule_rps(struct softnet_data *sd)
> > >         if (sd != mysd) {
> > >                 sd->rps_ipi_next = mysd->rps_ipi_list;
> > >                 mysd->rps_ipi_list = sd;
> > > -
> > > -               __raise_softirq_irqoff(NET_RX_SOFTIRQ);
> > >                 return 1;
> > >         }
> > >  #endif /* CONFIG_RPS */
> > > --
> > > 2.37.3
> > >
> >
> > This is not going to work in some cases. Please take a deeper look.
> >
> > I have to run, if you (or others) do not find the reason, I will give
> > more details when I am done traveling.
>
> I'm wondering whether we could use @mysd instead of @sd like this:
>
> if (!__test_and_set_bit(NAPI_STATE_SCHED, &mysd->backlog.state))
>     __raise_softirq_irqoff(NET_RX_SOFTIRQ);

Ah, I have to add more precise code because the above codes may mislead people.

diff --git a/net/core/dev.c b/net/core/dev.c
index 1518a366783b..9ac9b32e392f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4594,8 +4594,9 @@ static int napi_schedule_rps(struct softnet_data *sd)
        if (sd != mysd) {
                sd->rps_ipi_next = mysd->rps_ipi_list;
                mysd->rps_ipi_list = sd;
+               if (!__test_and_set_bit(NAPI_STATE_SCHED, &mysd->backlog.state))
+                       __raise_softirq_irqoff(NET_RX_SOFTIRQ);

-               __raise_softirq_irqoff(NET_RX_SOFTIRQ);
                return 1;
        }
 #endif /* CONFIG_RPS */

Eric, I realized that some paths don't call the ipi to notify another
cpu. If someone grabs the NAPI_STATE_SCHED flag, we know that at the
end of net_rx_action() or the beginning of process_backlog(), the
net_rps_action_and_irq_enable() will handle the information delivery.
However, if no one grabs the flag, in some paths we could not have a
chance immediately to tell another cpu to raise the softirq and then
process those pending data. Thus, I have to make sure if someone owns
the napi poll as shown above.

If I get this wrong, please correct me if you're available. Thanks in advance.

>
> I traced back to some historical changes and saw some relations with
> this commit ("net: solve a NAPI race"):
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=39e6c8208d7b6fb9d2047850fb3327db567b564b
>
> Thanks,
> Jason
Jason Xing March 26, 2023, 2:56 p.m. UTC | #6
On Sun, Mar 26, 2023 at 6:10 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Sun, Mar 26, 2023 at 12:04 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > On Sat, Mar 25, 2023 at 11:57 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Sat, Mar 25, 2023 at 8:26 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > >
> > > > From: Jason Xing <kernelxing@tencent.com>
> > > >
> > > > Since we decide to put the skb into a backlog queue of another
> > > > cpu, we should not raise the softirq for the current cpu. When
> > > > to raise a softirq is based on whether we have more data left to
> > > > process later. As to the current cpu, there is no indication of
> > > > more data enqueued, so we do not need this action. After enqueuing
> > > > to another cpu, net_rx_action() function will call ipi and then
> > > > another cpu will raise the softirq as expected.
> > > >
> > > > Also, raising more softirqs which set the corresponding bit field
> > > > can make the IRQ mechanism think we probably need to start ksoftirqd
> > > > on the current cpu. Actually it shouldn't happen.
> > > >
> > > > Fixes: 0a9627f2649a ("rps: Receive Packet Steering")
> > > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > > ---
> > > >  net/core/dev.c | 2 --
> > > >  1 file changed, 2 deletions(-)
> > > >
> > > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > > index 1518a366783b..bfaaa652f50c 100644
> > > > --- a/net/core/dev.c
> > > > +++ b/net/core/dev.c
> > > > @@ -4594,8 +4594,6 @@ static int napi_schedule_rps(struct softnet_data *sd)
> > > >         if (sd != mysd) {
> > > >                 sd->rps_ipi_next = mysd->rps_ipi_list;
> > > >                 mysd->rps_ipi_list = sd;
> > > > -
> > > > -               __raise_softirq_irqoff(NET_RX_SOFTIRQ);
> > > >                 return 1;
> > > >         }
> > > >  #endif /* CONFIG_RPS */
> > > > --
> > > > 2.37.3
> > > >
> > >
> > > This is not going to work in some cases. Please take a deeper look.
> > >
> > > I have to run, if you (or others) do not find the reason, I will give
> > > more details when I am done traveling.
> >
> > I'm wondering whether we could use @mysd instead of @sd like this:
> >
> > if (!__test_and_set_bit(NAPI_STATE_SCHED, &mysd->backlog.state))
> >     __raise_softirq_irqoff(NET_RX_SOFTIRQ);
>
> Ah, I have to add more precise code because the above codes may mislead people.
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 1518a366783b..9ac9b32e392f 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4594,8 +4594,9 @@ static int napi_schedule_rps(struct softnet_data *sd)
>         if (sd != mysd) {
>                 sd->rps_ipi_next = mysd->rps_ipi_list;
>                 mysd->rps_ipi_list = sd;
> +               if (!__test_and_set_bit(NAPI_STATE_SCHED, &mysd->backlog.state))

Forgive me. Really I need some coffee. I made a mistake. This line
above should be:

+               if (!test_bit(NAPI_STATE_SCHED, &mysd->backlog.state))

But the whole thing doesn't feel right. I need a few days to dig into
this part until Eric can help me with more of it.

Thanks,
Jason

> +                       __raise_softirq_irqoff(NET_RX_SOFTIRQ);
>
> -               __raise_softirq_irqoff(NET_RX_SOFTIRQ);
>                 return 1;
>         }
>  #endif /* CONFIG_RPS */
>
> Eric, I realized that some paths don't call the ipi to notify another
> cpu. If someone grabs the NAPI_STATE_SCHED flag, we know that at the
> end of net_rx_action() or the beginning of process_backlog(), the
> net_rps_action_and_irq_enable() will handle the information delivery.
> However, if no one grabs the flag, in some paths we could not have a
> chance immediately to tell another cpu to raise the softirq and then
> process those pending data. Thus, I have to make sure if someone owns
> the napi poll as shown above.
>
> If I get this wrong, please correct me if you're available. Thanks in advance.
>
> >
> > I traced back to some historical changes and saw some relations with
> > this commit ("net: solve a NAPI race"):
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=39e6c8208d7b6fb9d2047850fb3327db567b564b
> >
> > Thanks,
> > Jason
Eric Dumazet March 26, 2023, 5:35 p.m. UTC | #7
>
> Forgive me. Really I need some coffee. I made a mistake. This line
> above should be:
>
> +               if (!test_bit(NAPI_STATE_SCHED, &mysd->backlog.state))
>
> But the whole thing doesn't feel right. I need a few days to dig into
> this part until Eric can help me with more of it.
>

I am still traveling, and this is weekend time :/

It should not be too hard to read net/core/dev.c and remember that not
_all_ drivers (or some core networking functions) use the NAPI model.

eg look at netif_rx() and ask yourself why your patch is buggy.

Just look at callers of enqueue_to_backlog() and ask yourself if all
of them are called from net_rx_action()

[The answer is no, just in case you wonder]

In order to add your optimization, more work is needed, like adding
new parameters so that we do not miss critical
__raise_softirq_irqoff(NET_RX_SOFTIRQ) when _needed_.

We keep going circles around softirq deficiencies, I feel you are
trying to fix a second-order 'issue'.

Real cause is elsewhere, look at recent patches from Jakub.

Thanks.
Jason Xing March 27, 2023, 2:25 a.m. UTC | #8
On Mon, Mar 27, 2023 at 1:35 AM Eric Dumazet <edumazet@google.com> wrote:
>
> >
> > Forgive me. Really I need some coffee. I made a mistake. This line
> > above should be:
> >
> > +               if (!test_bit(NAPI_STATE_SCHED, &mysd->backlog.state))
> >
> > But the whole thing doesn't feel right. I need a few days to dig into
> > this part until Eric can help me with more of it.
> >
>
> I am still traveling, and this is weekend time :/

Thanks for your time, Eric, really appreciate it.

>
> It should not be too hard to read net/core/dev.c and remember that not
> _all_ drivers (or some core networking functions) use the NAPI model.
>
> eg look at netif_rx() and ask yourself why your patch is buggy.

Yes, it is. In my last email I sent yesterday I encountered one issue
which exactly happened when I started hundreds of iperf processes
transmitting data to loopback.
It got stuck :( So I realized it is the non-napi case that triggers
such a problem.

>
> Just look at callers of enqueue_to_backlog() and ask yourself if all
> of them are called from net_rx_action()
>
> [The answer is no, just in case you wonder]
>
> In order to add your optimization, more work is needed, like adding
> new parameters so that we do not miss critical
> __raise_softirq_irqoff(NET_RX_SOFTIRQ) when _needed_.

Thanks, I need to do more work/study on it.

>
> We keep going circles around softirq deficiencies, I feel you are
> trying to fix a second-order 'issue'.

Right, going circles gives me a headache.

>
> Real cause is elsewhere, look at recent patches from Jakub.

After you pointed out, I searched and found there is indeed one patchset in 2022

The tile like this:

[PATCH 0/3] softirq: uncontroversial change

Thanks,
Jason


>
> Thanks.
diff mbox series

Patch

diff --git a/net/core/dev.c b/net/core/dev.c
index 1518a366783b..bfaaa652f50c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4594,8 +4594,6 @@  static int napi_schedule_rps(struct softnet_data *sd)
 	if (sd != mysd) {
 		sd->rps_ipi_next = mysd->rps_ipi_list;
 		mysd->rps_ipi_list = sd;
-
-		__raise_softirq_irqoff(NET_RX_SOFTIRQ);
 		return 1;
 	}
 #endif /* CONFIG_RPS */