mbox series

[RFC,0/7] Add support to process rx packets in thread

Message ID 1595351666-28193-1-git-send-email-pillair@codeaurora.org (mailing list archive)
Headers show
Series Add support to process rx packets in thread | expand

Message

Rakesh Pillai July 21, 2020, 5:14 p.m. UTC
NAPI gets scheduled on the CPU core which got the
interrupt. The linux scheduler cannot move it to a
different core, even if the CPU on which NAPI is running
is heavily loaded. This can lead to degraded wifi
performance when running traffic at peak data rates.

A thread on the other hand can be moved to different
CPU cores, if the one on which its running is heavily
loaded. During high incoming data traffic, this gives
better performance, since the thread can be moved to a
less loaded or sometimes even a more powerful CPU core
to account for the required CPU performance in order
to process the incoming packets.

This patch series adds the support to use a high priority
thread to process the incoming packets, as opposed to
everything being done in NAPI context.

The rx thread can be enabled by using a module parameter
when loading the ath10k_snoc module.

---
This patch series is dependent on the below patch series
https://patchwork.kernel.org/project/ath10k/list/?series=315759

Rakesh Pillai (7):
  mac80211: Add check for napi handle before WARN_ON
  ath10k: Add support to process rx packet in thread
  ath10k: Add module param to enable rx thread
  ath10k: Do not exhaust budget on process tx completion
  ath10k: Handle the rx packet processing in thread
  ath10k: Add deliver to stack from thread context
  ath10k: Handle rx thread suspend and resume

 drivers/net/wireless/ath/ath10k/core.c   |  64 +++++++++++++++++++
 drivers/net/wireless/ath/ath10k/core.h   |  33 ++++++++++
 drivers/net/wireless/ath/ath10k/htt.h    |   2 +
 drivers/net/wireless/ath/ath10k/htt_rx.c |  66 ++++++++++++++-----
 drivers/net/wireless/ath/ath10k/snoc.c   | 105 ++++++++++++++++++++++++++++++-
 net/mac80211/rx.c                        |   2 +-
 6 files changed, 253 insertions(+), 19 deletions(-)

Comments

Andrew Lunn July 21, 2020, 5:25 p.m. UTC | #1
On Tue, Jul 21, 2020 at 10:44:19PM +0530, Rakesh Pillai wrote:
> NAPI gets scheduled on the CPU core which got the
> interrupt. The linux scheduler cannot move it to a
> different core, even if the CPU on which NAPI is running
> is heavily loaded. This can lead to degraded wifi
> performance when running traffic at peak data rates.
> 
> A thread on the other hand can be moved to different
> CPU cores, if the one on which its running is heavily
> loaded. During high incoming data traffic, this gives
> better performance, since the thread can be moved to a
> less loaded or sometimes even a more powerful CPU core
> to account for the required CPU performance in order
> to process the incoming packets.
> 
> This patch series adds the support to use a high priority
> thread to process the incoming packets, as opposed to
> everything being done in NAPI context.

I don't see why this problem is limited to the ath10k driver. I expect
it applies to all drivers using NAPI. So shouldn't you be solving this
in the NAPI core? Allow a driver to request the NAPI core uses a
thread?

	Andrew
Florian Fainelli July 21, 2020, 6:05 p.m. UTC | #2
On 7/21/20 10:25 AM, Andrew Lunn wrote:
> On Tue, Jul 21, 2020 at 10:44:19PM +0530, Rakesh Pillai wrote:
>> NAPI gets scheduled on the CPU core which got the
>> interrupt. The linux scheduler cannot move it to a
>> different core, even if the CPU on which NAPI is running
>> is heavily loaded. This can lead to degraded wifi
>> performance when running traffic at peak data rates.
>>
>> A thread on the other hand can be moved to different
>> CPU cores, if the one on which its running is heavily
>> loaded. During high incoming data traffic, this gives
>> better performance, since the thread can be moved to a
>> less loaded or sometimes even a more powerful CPU core
>> to account for the required CPU performance in order
>> to process the incoming packets.
>>
>> This patch series adds the support to use a high priority
>> thread to process the incoming packets, as opposed to
>> everything being done in NAPI context.
> 
> I don't see why this problem is limited to the ath10k driver. I expect
> it applies to all drivers using NAPI. So shouldn't you be solving this
> in the NAPI core? Allow a driver to request the NAPI core uses a
> thread?

What's more, you should be able to configure interrupt affinity to steer
RX processing onto a desired CPU core, is not that working for you somehow?
David Laight July 22, 2020, 9:12 a.m. UTC | #3
From: Andrew Lunn
> Sent: 21 July 2020 18:25
> 
> On Tue, Jul 21, 2020 at 10:44:19PM +0530, Rakesh Pillai wrote:
> > NAPI gets scheduled on the CPU core which got the
> > interrupt. The linux scheduler cannot move it to a
> > different core, even if the CPU on which NAPI is running
> > is heavily loaded. This can lead to degraded wifi
> > performance when running traffic at peak data rates.
> >
> > A thread on the other hand can be moved to different
> > CPU cores, if the one on which its running is heavily
> > loaded. During high incoming data traffic, this gives
> > better performance, since the thread can be moved to a
> > less loaded or sometimes even a more powerful CPU core
> > to account for the required CPU performance in order
> > to process the incoming packets.
> >
> > This patch series adds the support to use a high priority
> > thread to process the incoming packets, as opposed to
> > everything being done in NAPI context.
> 
> I don't see why this problem is limited to the ath10k driver. I expect
> it applies to all drivers using NAPI. So shouldn't you be solving this
> in the NAPI core? Allow a driver to request the NAPI core uses a
> thread?

It's not just NAPI the problem is with the softint processing.
I suspect a lot of systems would work better if it ran as
a (highish priority) kernel thread.

I've had to remove the main locks from a multi-threaded application
and replace them with atomic counters.
Consider what happens when the threads remove items from a shared
work list.
The code looks like:
	mutex_enter();
	remove_item_from_list();
	mutex_exit().
The mutex is only held for a few instructions, so while you'd expect
the cache line to be 'hot' you wouldn't get real contention.
However the following scenarios happen:
1) An ethernet interrupt happens while the mutex is held.
   This stops the other threads until all the softint processing
   has finished.
2) An ethernet interrupt (and softint) runs on a thread that is
   waiting for the mutex.
   (Or on the cpu that the thread's processor affinity ties it to.)
   In this case the 'fair' (ticket) mutex code won't let any other
   thread acquire the mutex.
   So again everything stops until the softints all complete.

The second one is also a problem when trying to wake up all
the threads (eg after adding a lot of items to the list).
The ticket locks force them to wake in order, but
sometimes the 'thundering herd' would work better.

IIRC this is actually worse for processes running under the RT
scheduler (without CONFIG_PREEMPT) because the they are almost
always scheduled on the same cpu they ran on last.
If it is busy, but cannot be pre-empted, they are not moved
to an idle cpu.
   
To confound things there is a very broken workaround for broken
hardware in the driver for the e1000 interface on (at least)
Ivy Bridge cpu that can cause the driver to spin for a very
long time (IIRC milliseconds) whenever it has to write to a
MAC register (ie on every transmit setup).

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Jakub Kicinski July 22, 2020, 4:20 p.m. UTC | #4
On Tue, 21 Jul 2020 19:25:14 +0200 Andrew Lunn wrote:
> On Tue, Jul 21, 2020 at 10:44:19PM +0530, Rakesh Pillai wrote:
> > NAPI gets scheduled on the CPU core which got the
> > interrupt. The linux scheduler cannot move it to a
> > different core, even if the CPU on which NAPI is running
> > is heavily loaded. This can lead to degraded wifi
> > performance when running traffic at peak data rates.
> > 
> > A thread on the other hand can be moved to different
> > CPU cores, if the one on which its running is heavily
> > loaded. During high incoming data traffic, this gives
> > better performance, since the thread can be moved to a
> > less loaded or sometimes even a more powerful CPU core
> > to account for the required CPU performance in order
> > to process the incoming packets.
> > 
> > This patch series adds the support to use a high priority
> > thread to process the incoming packets, as opposed to
> > everything being done in NAPI context.  
> 
> I don't see why this problem is limited to the ath10k driver. I expect
> it applies to all drivers using NAPI. So shouldn't you be solving this
> in the NAPI core? Allow a driver to request the NAPI core uses a
> thread?

Agreed, this is a problem we have with all drivers today.
We see seriously sub-optimal behavior in data center workloads, 
because kernel overloads the cores doing packet processing.

I think the fix may actually be in the scheduler. AFAIU the scheduler
counts the softIRQ time towards the interrupted process, and on top
of that tries to move processes to the cores handling their IO. In the
end the configuration which works somewhat okay is when each core has
its own IRQ and queues, which is seriously sub-optimal.
Rakesh Pillai July 23, 2020, 6:21 p.m. UTC | #5
> -----Original Message-----
> From: Florian Fainelli <f.fainelli@gmail.com>
> Sent: Tuesday, July 21, 2020 11:35 PM
> To: Andrew Lunn <andrew@lunn.ch>; Rakesh Pillai <pillair@codeaurora.org>
> Cc: ath10k@lists.infradead.org; linux-wireless@vger.kernel.org; linux-
> kernel@vger.kernel.org; kvalo@codeaurora.org; johannes@sipsolutions.net;
> davem@davemloft.net; kuba@kernel.org; netdev@vger.kernel.org;
> dianders@chromium.org; evgreen@chromium.org
> Subject: Re: [RFC 0/7] Add support to process rx packets in thread
> 
> On 7/21/20 10:25 AM, Andrew Lunn wrote:
> > On Tue, Jul 21, 2020 at 10:44:19PM +0530, Rakesh Pillai wrote:
> >> NAPI gets scheduled on the CPU core which got the
> >> interrupt. The linux scheduler cannot move it to a
> >> different core, even if the CPU on which NAPI is running
> >> is heavily loaded. This can lead to degraded wifi
> >> performance when running traffic at peak data rates.
> >>
> >> A thread on the other hand can be moved to different
> >> CPU cores, if the one on which its running is heavily
> >> loaded. During high incoming data traffic, this gives
> >> better performance, since the thread can be moved to a
> >> less loaded or sometimes even a more powerful CPU core
> >> to account for the required CPU performance in order
> >> to process the incoming packets.
> >>
> >> This patch series adds the support to use a high priority
> >> thread to process the incoming packets, as opposed to
> >> everything being done in NAPI context.
> >
> > I don't see why this problem is limited to the ath10k driver. I expect
> > it applies to all drivers using NAPI. So shouldn't you be solving this
> > in the NAPI core? Allow a driver to request the NAPI core uses a
> > thread?
> 
> What's more, you should be able to configure interrupt affinity to steer
> RX processing onto a desired CPU core, is not that working for you
> somehow?

Hi Florian,
Yes, the affinity of IRQ does work for me.
But the affinity of IRQ does not happen runtime based on load.


> --
> Florian
Florian Fainelli July 23, 2020, 7:02 p.m. UTC | #6
On 7/23/20 11:21 AM, Rakesh Pillai wrote:
> 
> 
>> -----Original Message-----
>> From: Florian Fainelli <f.fainelli@gmail.com>
>> Sent: Tuesday, July 21, 2020 11:35 PM
>> To: Andrew Lunn <andrew@lunn.ch>; Rakesh Pillai <pillair@codeaurora.org>
>> Cc: ath10k@lists.infradead.org; linux-wireless@vger.kernel.org; linux-
>> kernel@vger.kernel.org; kvalo@codeaurora.org; johannes@sipsolutions.net;
>> davem@davemloft.net; kuba@kernel.org; netdev@vger.kernel.org;
>> dianders@chromium.org; evgreen@chromium.org
>> Subject: Re: [RFC 0/7] Add support to process rx packets in thread
>>
>> On 7/21/20 10:25 AM, Andrew Lunn wrote:
>>> On Tue, Jul 21, 2020 at 10:44:19PM +0530, Rakesh Pillai wrote:
>>>> NAPI gets scheduled on the CPU core which got the
>>>> interrupt. The linux scheduler cannot move it to a
>>>> different core, even if the CPU on which NAPI is running
>>>> is heavily loaded. This can lead to degraded wifi
>>>> performance when running traffic at peak data rates.
>>>>
>>>> A thread on the other hand can be moved to different
>>>> CPU cores, if the one on which its running is heavily
>>>> loaded. During high incoming data traffic, this gives
>>>> better performance, since the thread can be moved to a
>>>> less loaded or sometimes even a more powerful CPU core
>>>> to account for the required CPU performance in order
>>>> to process the incoming packets.
>>>>
>>>> This patch series adds the support to use a high priority
>>>> thread to process the incoming packets, as opposed to
>>>> everything being done in NAPI context.
>>>
>>> I don't see why this problem is limited to the ath10k driver. I expect
>>> it applies to all drivers using NAPI. So shouldn't you be solving this
>>> in the NAPI core? Allow a driver to request the NAPI core uses a
>>> thread?
>>
>> What's more, you should be able to configure interrupt affinity to steer
>> RX processing onto a desired CPU core, is not that working for you
>> somehow?
> 
> Hi Florian,
> Yes, the affinity of IRQ does work for me.
> But the affinity of IRQ does not happen runtime based on load.

It can if you also run irqbalance.
Rakesh Pillai July 24, 2020, 6:20 a.m. UTC | #7
> -----Original Message-----
> From: Florian Fainelli <f.fainelli@gmail.com>
> Sent: Friday, July 24, 2020 12:33 AM
> To: Rakesh Pillai <pillair@codeaurora.org>; 'Andrew Lunn'
> <andrew@lunn.ch>
> Cc: ath10k@lists.infradead.org; linux-wireless@vger.kernel.org; linux-
> kernel@vger.kernel.org; kvalo@codeaurora.org; johannes@sipsolutions.net;
> davem@davemloft.net; kuba@kernel.org; netdev@vger.kernel.org;
> dianders@chromium.org; evgreen@chromium.org
> Subject: Re: [RFC 0/7] Add support to process rx packets in thread
> 
> On 7/23/20 11:21 AM, Rakesh Pillai wrote:
> >
> >
> >> -----Original Message-----
> >> From: Florian Fainelli <f.fainelli@gmail.com>
> >> Sent: Tuesday, July 21, 2020 11:35 PM
> >> To: Andrew Lunn <andrew@lunn.ch>; Rakesh Pillai
> <pillair@codeaurora.org>
> >> Cc: ath10k@lists.infradead.org; linux-wireless@vger.kernel.org; linux-
> >> kernel@vger.kernel.org; kvalo@codeaurora.org;
> johannes@sipsolutions.net;
> >> davem@davemloft.net; kuba@kernel.org; netdev@vger.kernel.org;
> >> dianders@chromium.org; evgreen@chromium.org
> >> Subject: Re: [RFC 0/7] Add support to process rx packets in thread
> >>
> >> On 7/21/20 10:25 AM, Andrew Lunn wrote:
> >>> On Tue, Jul 21, 2020 at 10:44:19PM +0530, Rakesh Pillai wrote:
> >>>> NAPI gets scheduled on the CPU core which got the
> >>>> interrupt. The linux scheduler cannot move it to a
> >>>> different core, even if the CPU on which NAPI is running
> >>>> is heavily loaded. This can lead to degraded wifi
> >>>> performance when running traffic at peak data rates.
> >>>>
> >>>> A thread on the other hand can be moved to different
> >>>> CPU cores, if the one on which its running is heavily
> >>>> loaded. During high incoming data traffic, this gives
> >>>> better performance, since the thread can be moved to a
> >>>> less loaded or sometimes even a more powerful CPU core
> >>>> to account for the required CPU performance in order
> >>>> to process the incoming packets.
> >>>>
> >>>> This patch series adds the support to use a high priority
> >>>> thread to process the incoming packets, as opposed to
> >>>> everything being done in NAPI context.
> >>>
> >>> I don't see why this problem is limited to the ath10k driver. I expect
> >>> it applies to all drivers using NAPI. So shouldn't you be solving this
> >>> in the NAPI core? Allow a driver to request the NAPI core uses a
> >>> thread?
> >>
> >> What's more, you should be able to configure interrupt affinity to steer
> >> RX processing onto a desired CPU core, is not that working for you
> >> somehow?
> >
> > Hi Florian,
> > Yes, the affinity of IRQ does work for me.
> > But the affinity of IRQ does not happen runtime based on load.
> 
> It can if you also run irqbalance.


Hi Florian,

Is it some kernel feature ?  Sorry I am not aware of this ?
I know it can be done in userspace.

> --
> Florian
Florian Fainelli July 24, 2020, 10:28 p.m. UTC | #8
On 7/23/20 11:20 PM, Rakesh Pillai wrote:
> 
> 
>> -----Original Message-----
>> From: Florian Fainelli <f.fainelli@gmail.com>
>> Sent: Friday, July 24, 2020 12:33 AM
>> To: Rakesh Pillai <pillair@codeaurora.org>; 'Andrew Lunn'
>> <andrew@lunn.ch>
>> Cc: ath10k@lists.infradead.org; linux-wireless@vger.kernel.org; linux-
>> kernel@vger.kernel.org; kvalo@codeaurora.org; johannes@sipsolutions.net;
>> davem@davemloft.net; kuba@kernel.org; netdev@vger.kernel.org;
>> dianders@chromium.org; evgreen@chromium.org
>> Subject: Re: [RFC 0/7] Add support to process rx packets in thread
>>
>> On 7/23/20 11:21 AM, Rakesh Pillai wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Florian Fainelli <f.fainelli@gmail.com>
>>>> Sent: Tuesday, July 21, 2020 11:35 PM
>>>> To: Andrew Lunn <andrew@lunn.ch>; Rakesh Pillai
>> <pillair@codeaurora.org>
>>>> Cc: ath10k@lists.infradead.org; linux-wireless@vger.kernel.org; linux-
>>>> kernel@vger.kernel.org; kvalo@codeaurora.org;
>> johannes@sipsolutions.net;
>>>> davem@davemloft.net; kuba@kernel.org; netdev@vger.kernel.org;
>>>> dianders@chromium.org; evgreen@chromium.org
>>>> Subject: Re: [RFC 0/7] Add support to process rx packets in thread
>>>>
>>>> On 7/21/20 10:25 AM, Andrew Lunn wrote:
>>>>> On Tue, Jul 21, 2020 at 10:44:19PM +0530, Rakesh Pillai wrote:
>>>>>> NAPI gets scheduled on the CPU core which got the
>>>>>> interrupt. The linux scheduler cannot move it to a
>>>>>> different core, even if the CPU on which NAPI is running
>>>>>> is heavily loaded. This can lead to degraded wifi
>>>>>> performance when running traffic at peak data rates.
>>>>>>
>>>>>> A thread on the other hand can be moved to different
>>>>>> CPU cores, if the one on which its running is heavily
>>>>>> loaded. During high incoming data traffic, this gives
>>>>>> better performance, since the thread can be moved to a
>>>>>> less loaded or sometimes even a more powerful CPU core
>>>>>> to account for the required CPU performance in order
>>>>>> to process the incoming packets.
>>>>>>
>>>>>> This patch series adds the support to use a high priority
>>>>>> thread to process the incoming packets, as opposed to
>>>>>> everything being done in NAPI context.
>>>>>
>>>>> I don't see why this problem is limited to the ath10k driver. I expect
>>>>> it applies to all drivers using NAPI. So shouldn't you be solving this
>>>>> in the NAPI core? Allow a driver to request the NAPI core uses a
>>>>> thread?
>>>>
>>>> What's more, you should be able to configure interrupt affinity to steer
>>>> RX processing onto a desired CPU core, is not that working for you
>>>> somehow?
>>>
>>> Hi Florian,
>>> Yes, the affinity of IRQ does work for me.
>>> But the affinity of IRQ does not happen runtime based on load.
>>
>> It can if you also run irqbalance.
> 
> 
> Hi Florian,
> 
> Is it some kernel feature ?  Sorry I am not aware of this ?
> I know it can be done in userspace.

The kernel interface is /proc/<irq>/smp_affinity and the users-space
implementation resides here:

https://github.com/Irqbalance/irqbalance
Hillf Danton July 25, 2020, 8:16 a.m. UTC | #9
On Wed, 22 Jul 2020 09:12:42 +0000 David Laight wrote:
> > On 21 July 2020 18:25 Andrew Lunn wrote:
> > 
> > On Tue, Jul 21, 2020 at 10:44:19PM +0530, Rakesh Pillai wrote:
> > > NAPI gets scheduled on the CPU core which got the
> > > interrupt. The linux scheduler cannot move it to a
> > > different core, even if the CPU on which NAPI is running
> > > is heavily loaded. This can lead to degraded wifi
> > > performance when running traffic at peak data rates.
> > >
> > > A thread on the other hand can be moved to different
> > > CPU cores, if the one on which its running is heavily
> > > loaded. During high incoming data traffic, this gives
> > > better performance, since the thread can be moved to a
> > > less loaded or sometimes even a more powerful CPU core
> > > to account for the required CPU performance in order
> > > to process the incoming packets.
> > >
> > > This patch series adds the support to use a high priority
> > > thread to process the incoming packets, as opposed to
> > > everything being done in NAPI context.
> > 
> > I don't see why this problem is limited to the ath10k driver. I expect
> > it applies to all drivers using NAPI. So shouldn't you be solving this
> > in the NAPI core? Allow a driver to request the NAPI core uses a
> > thread?
> 
> It's not just NAPI the problem is with the softint processing.
> I suspect a lot of systems would work better if it ran as
> a (highish priority) kernel thread.

Hi folks

Below is a minimunm poc implementation I can imagine on top of workqueue
to make napi threaded. Thoughts are appreciated.

> I've had to remove the main locks from a multi-threaded application
> and replace them with atomic counters.
> Consider what happens when the threads remove items from a shared
> work list.
> The code looks like:
> 	mutex_enter();
> 	remove_item_from_list();
> 	mutex_exit().
> The mutex is only held for a few instructions, so while you'd expect
> the cache line to be 'hot' you wouldn't get real contention.
> However the following scenarios happen:
> 1) An ethernet interrupt happens while the mutex is held.
>    This stops the other threads until all the softint processing
>    has finished.
> 2) An ethernet interrupt (and softint) runs on a thread that is
>    waiting for the mutex.
>    (Or on the cpu that the thread's processor affinity ties it to.)
>    In this case the 'fair' (ticket) mutex code won't let any other
>    thread acquire the mutex.
>    So again everything stops until the softints all complete.
> 
> The second one is also a problem when trying to wake up all
> the threads (eg after adding a lot of items to the list).
> The ticket locks force them to wake in order, but
> sometimes the 'thundering herd' would work better.
> 
> IIRC this is actually worse for processes running under the RT
> scheduler (without CONFIG_PREEMPT) because the they are almost
> always scheduled on the same cpu they ran on last.
> If it is busy, but cannot be pre-empted, they are not moved
> to an idle cpu.
>    
> To confound things there is a very broken workaround for broken
> hardware in the driver for the e1000 interface on (at least)
> Ivy Bridge cpu that can cause the driver to spin for a very
> long time (IIRC milliseconds) whenever it has to write to a
> MAC register (ie on every transmit setup).
> 
> 	David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

To make napi threaded, if either irq or softirq thread is entirely ruled
out, add napi::work that will be queued on a highpri workqueue. It is
actually a unbound one to facilitate scheduler to catter napi loads on to
idle CPU cores. What users need to do with the threaded napi
is s/netif_napi_add/netif_threaded_napi_add/ and no more.

--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -338,6 +338,9 @@ struct napi_struct {
 	struct list_head	dev_list;
 	struct hlist_node	napi_hash_node;
 	unsigned int		napi_id;
+#ifdef CONFIG_THREADED_NAPI
+	struct work_struct	work;
+#endif
 };
 
 enum {
@@ -2234,6 +2237,19 @@ static inline void *netdev_priv(const st
 void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
 		    int (*poll)(struct napi_struct *, int), int weight);
 
+#ifdef CONFIG_THREADED_NAPI
+void netif_threaded_napi_add(struct net_device *dev, struct napi_struct *napi,
+		    int (*poll)(struct napi_struct *, int), int weight);
+#else
+static inline void netif_threaded_napi_add(struct net_device *dev,
+					struct napi_struct *napi,
+					int (*poll)(struct napi_struct *, int),
+					int weight)
+{
+	netif_napi_add(dev, napi, poll, weight);
+}
+#endif
+
 /**
  *	netif_tx_napi_add - initialize a NAPI context
  *	@dev:  network device
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6277,6 +6277,61 @@ static int process_backlog(struct napi_s
 	return work;
 }
 
+#ifdef CONFIG_THREADED_NAPI
+/* unbound highpri workqueue for threaded napi */
+static struct workqueue_struct *napi_workq;
+
+static void napi_workfn(struct work_struct *work)
+{
+	struct napi_struct *n = container_of(work, struct napi_struct, work);
+
+	for (;;) {
+		if (!test_bit(NAPI_STATE_SCHED, &n->state))
+			return;
+
+		if (n->poll(n, n->weight) < n->weight)
+			return;
+
+		if (need_resched()) {
+			/*
+			 * have to pay for the latency of task switch even if 
+			 * napi is scheduled
+			 */
+			if (test_bit(NAPI_STATE_SCHED, &n->state))
+				queue_work(napi_workq, work);
+			return;
+		}
+	}
+}
+
+void netif_threaded_napi_add(struct net_device *dev,
+				struct napi_struct *napi,
+				int (*poll)(struct napi_struct *, int),
+				int weight)
+{
+	netif_napi_add(dev, napi, poll, weight);
+	INIT_WORK(&napi->work, napi_workfn);
+}
+
+static inline bool is_threaded_napi(struct napi_struct *n)
+{
+	return n->work.func == napi_workfn;
+}
+
+static inline void threaded_napi_sched(struct napi_struct *n)
+{
+	if (is_threaded_napi(n))
+		queue_work(napi_workq, &n->work);
+	else
+		____napi_schedule(this_cpu_ptr(&softnet_data), n);
+}
+#else
+static inline void threaded_napi_sched(struct napi_struct *n)
+{
+	____napi_schedule(this_cpu_ptr(&softnet_data), n);
+}
+#endif
+
 /**
  * __napi_schedule - schedule for receive
  * @n: entry to schedule
@@ -6289,7 +6344,7 @@ void __napi_schedule(struct napi_struct
 	unsigned long flags;
 
 	local_irq_save(flags);
-	____napi_schedule(this_cpu_ptr(&softnet_data), n);
+	threaded_napi_sched(n);
 	local_irq_restore(flags);
 }
 EXPORT_SYMBOL(__napi_schedule);
@@ -6335,7 +6390,7 @@ EXPORT_SYMBOL(napi_schedule_prep);
  */
 void __napi_schedule_irqoff(struct napi_struct *n)
 {
-	____napi_schedule(this_cpu_ptr(&softnet_data), n);
+	threaded_napi_sched(n);
 }
 EXPORT_SYMBOL(__napi_schedule_irqoff);
 
@@ -10685,6 +10740,10 @@ static int __init net_dev_init(void)
 		sd->backlog.weight = weight_p;
 	}
 
+#ifdef CONFIG_THREADED_NAPI
+	napi_workq = alloc_workqueue("napi_workq", WQ_UNBOUND | WQ_HIGHPRI,
+					    WQ_UNBOUND_MAX_ACTIVE);
+#endif
 	dev_boot_phase = 0;
 
 	/* The loopback device is special if any other network devices
Sebastian Gottschall July 25, 2020, 10:38 a.m. UTC | #10
you may consider this

https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1142611.html 
<https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1142611.html>

years ago someone already wanted to bring this feature upstream, but it 
was denied. i already tested this patch the last 2 days and it worked so 
far (with some little modifications)
so such a solution existed already and may be considered here

Sebastian


someone

Am 25.07.2020 um 10:16 schrieb Hillf Danton:
> On Wed, 22 Jul 2020 09:12:42 +0000 David Laight wrote:
>>> On 21 July 2020 18:25 Andrew Lunn wrote:
>>>
>>> On Tue, Jul 21, 2020 at 10:44:19PM +0530, Rakesh Pillai wrote:
>>>> NAPI gets scheduled on the CPU core which got the
>>>> interrupt. The linux scheduler cannot move it to a
>>>> different core, even if the CPU on which NAPI is running
>>>> is heavily loaded. This can lead to degraded wifi
>>>> performance when running traffic at peak data rates.
>>>>
>>>> A thread on the other hand can be moved to different
>>>> CPU cores, if the one on which its running is heavily
>>>> loaded. During high incoming data traffic, this gives
>>>> better performance, since the thread can be moved to a
>>>> less loaded or sometimes even a more powerful CPU core
>>>> to account for the required CPU performance in order
>>>> to process the incoming packets.
>>>>
>>>> This patch series adds the support to use a high priority
>>>> thread to process the incoming packets, as opposed to
>>>> everything being done in NAPI context.
>>> I don't see why this problem is limited to the ath10k driver. I expect
>>> it applies to all drivers using NAPI. So shouldn't you be solving this
>>> in the NAPI core? Allow a driver to request the NAPI core uses a
>>> thread?
>> It's not just NAPI the problem is with the softint processing.
>> I suspect a lot of systems would work better if it ran as
>> a (highish priority) kernel thread.
> Hi folks
>
> Below is a minimunm poc implementation I can imagine on top of workqueue
> to make napi threaded. Thoughts are appreciated.
>
>> I've had to remove the main locks from a multi-threaded application
>> and replace them with atomic counters.
>> Consider what happens when the threads remove items from a shared
>> work list.
>> The code looks like:
>> 	mutex_enter();
>> 	remove_item_from_list();
>> 	mutex_exit().
>> The mutex is only held for a few instructions, so while you'd expect
>> the cache line to be 'hot' you wouldn't get real contention.
>> However the following scenarios happen:
>> 1) An ethernet interrupt happens while the mutex is held.
>>     This stops the other threads until all the softint processing
>>     has finished.
>> 2) An ethernet interrupt (and softint) runs on a thread that is
>>     waiting for the mutex.
>>     (Or on the cpu that the thread's processor affinity ties it to.)
>>     In this case the 'fair' (ticket) mutex code won't let any other
>>     thread acquire the mutex.
>>     So again everything stops until the softints all complete.
>>
>> The second one is also a problem when trying to wake up all
>> the threads (eg after adding a lot of items to the list).
>> The ticket locks force them to wake in order, but
>> sometimes the 'thundering herd' would work better.
>>
>> IIRC this is actually worse for processes running under the RT
>> scheduler (without CONFIG_PREEMPT) because the they are almost
>> always scheduled on the same cpu they ran on last.
>> If it is busy, but cannot be pre-empted, they are not moved
>> to an idle cpu.
>>     
>> To confound things there is a very broken workaround for broken
>> hardware in the driver for the e1000 interface on (at least)
>> Ivy Bridge cpu that can cause the driver to spin for a very
>> long time (IIRC milliseconds) whenever it has to write to a
>> MAC register (ie on every transmit setup).
>>
>> 	David
>>
>> -
>> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
>> Registration No: 1397386 (Wales)
> To make napi threaded, if either irq or softirq thread is entirely ruled
> out, add napi::work that will be queued on a highpri workqueue. It is
> actually a unbound one to facilitate scheduler to catter napi loads on to
> idle CPU cores. What users need to do with the threaded napi
> is s/netif_napi_add/netif_threaded_napi_add/ and no more.
>
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -338,6 +338,9 @@ struct napi_struct {
>   	struct list_head	dev_list;
>   	struct hlist_node	napi_hash_node;
>   	unsigned int		napi_id;
> +#ifdef CONFIG_THREADED_NAPI
> +	struct work_struct	work;
> +#endif
>   };
>   
>   enum {
> @@ -2234,6 +2237,19 @@ static inline void *netdev_priv(const st
>   void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
>   		    int (*poll)(struct napi_struct *, int), int weight);
>   
> +#ifdef CONFIG_THREADED_NAPI
> +void netif_threaded_napi_add(struct net_device *dev, struct napi_struct *napi,
> +		    int (*poll)(struct napi_struct *, int), int weight);
> +#else
> +static inline void netif_threaded_napi_add(struct net_device *dev,
> +					struct napi_struct *napi,
> +					int (*poll)(struct napi_struct *, int),
> +					int weight)
> +{
> +	netif_napi_add(dev, napi, poll, weight);
> +}
> +#endif
> +
>   /**
>    *	netif_tx_napi_add - initialize a NAPI context
>    *	@dev:  network device
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -6277,6 +6277,61 @@ static int process_backlog(struct napi_s
>   	return work;
>   }
>   
> +#ifdef CONFIG_THREADED_NAPI
> +/* unbound highpri workqueue for threaded napi */
> +static struct workqueue_struct *napi_workq;
> +
> +static void napi_workfn(struct work_struct *work)
> +{
> +	struct napi_struct *n = container_of(work, struct napi_struct, work);
> +
> +	for (;;) {
> +		if (!test_bit(NAPI_STATE_SCHED, &n->state))
> +			return;
> +
> +		if (n->poll(n, n->weight) < n->weight)
> +			return;
> +
> +		if (need_resched()) {
> +			/*
> +			 * have to pay for the latency of task switch even if
> +			 * napi is scheduled
> +			 */
> +			if (test_bit(NAPI_STATE_SCHED, &n->state))
> +				queue_work(napi_workq, work);
> +			return;
> +		}
> +	}
> +}
> +
> +void netif_threaded_napi_add(struct net_device *dev,
> +				struct napi_struct *napi,
> +				int (*poll)(struct napi_struct *, int),
> +				int weight)
> +{
> +	netif_napi_add(dev, napi, poll, weight);
> +	INIT_WORK(&napi->work, napi_workfn);
> +}
> +
> +static inline bool is_threaded_napi(struct napi_struct *n)
> +{
> +	return n->work.func == napi_workfn;
> +}
> +
> +static inline void threaded_napi_sched(struct napi_struct *n)
> +{
> +	if (is_threaded_napi(n))
> +		queue_work(napi_workq, &n->work);
> +	else
> +		____napi_schedule(this_cpu_ptr(&softnet_data), n);
> +}
> +#else
> +static inline void threaded_napi_sched(struct napi_struct *n)
> +{
> +	____napi_schedule(this_cpu_ptr(&softnet_data), n);
> +}
> +#endif
> +
>   /**
>    * __napi_schedule - schedule for receive
>    * @n: entry to schedule
> @@ -6289,7 +6344,7 @@ void __napi_schedule(struct napi_struct
>   	unsigned long flags;
>   
>   	local_irq_save(flags);
> -	____napi_schedule(this_cpu_ptr(&softnet_data), n);
> +	threaded_napi_sched(n);
>   	local_irq_restore(flags);
>   }
>   EXPORT_SYMBOL(__napi_schedule);
> @@ -6335,7 +6390,7 @@ EXPORT_SYMBOL(napi_schedule_prep);
>    */
>   void __napi_schedule_irqoff(struct napi_struct *n)
>   {
> -	____napi_schedule(this_cpu_ptr(&softnet_data), n);
> +	threaded_napi_sched(n);
>   }
>   EXPORT_SYMBOL(__napi_schedule_irqoff);
>   
> @@ -10685,6 +10740,10 @@ static int __init net_dev_init(void)
>   		sd->backlog.weight = weight_p;
>   	}
>   
> +#ifdef CONFIG_THREADED_NAPI
> +	napi_workq = alloc_workqueue("napi_workq", WQ_UNBOUND | WQ_HIGHPRI,
> +					    WQ_UNBOUND_MAX_ACTIVE);
> +#endif
>   	dev_boot_phase = 0;
>   
>   	/* The loopback device is special if any other network devices
>
>
> _______________________________________________
> ath10k mailing list
> ath10k@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/ath10k
>
Hillf Danton July 25, 2020, 12:25 p.m. UTC | #11
On Sat, 25 Jul 2020 12:38:00 +0200 Sebastian Gottschall wrote:
> you may consider this
> 
> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1142611.html 
> <https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1142611.html>

Thanks very much for your link.

> years ago someone already wanted to bring this feature upstream, but it 
> was denied. i already tested this patch the last 2 days and it worked so 
> far (with some little modifications)
> so such a solution existed already and may be considered here

I don't see outstanding difference in principle from Paolo's work in
2016 except for the use of kthread_create() and friends because kworker 
made use of them even before 2016. This is a simpler one as shown by
the diff stat in his cover letter.

Paolo, feel free to correct me if I misread anything.

Finally I don't see the need to add sysfs attr, given CONFIG_THREADED_NAPI
in this work.

BTW let us know if anyone has plans to pick up the 2016 RFC.

Hillf

Paolo Abeni (2):
  net: implement threaded-able napi poll loop support
  net: add sysfs attribute to control napi threaded mode

 include/linux/netdevice.h |   4 ++
 net/core/dev.c            | 113 ++++++++++++++++++++++++++++++++++++++++++++++
 net/core/net-sysfs.c      | 102 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 219 insertions(+)
> 
> Sebastian
> 
> 
> someone
> 
> Am 25.07.2020 um 10:16 schrieb Hillf Danton:
> > On Wed, 22 Jul 2020 09:12:42 +0000 David Laight wrote:
> >>> On 21 July 2020 18:25 Andrew Lunn wrote:
> >>>
> >>> On Tue, Jul 21, 2020 at 10:44:19PM +0530, Rakesh Pillai wrote:
> >>>> NAPI gets scheduled on the CPU core which got the
> >>>> interrupt. The linux scheduler cannot move it to a
> >>>> different core, even if the CPU on which NAPI is running
> >>>> is heavily loaded. This can lead to degraded wifi
> >>>> performance when running traffic at peak data rates.
> >>>>
> >>>> A thread on the other hand can be moved to different
> >>>> CPU cores, if the one on which its running is heavily
> >>>> loaded. During high incoming data traffic, this gives
> >>>> better performance, since the thread can be moved to a
> >>>> less loaded or sometimes even a more powerful CPU core
> >>>> to account for the required CPU performance in order
> >>>> to process the incoming packets.
> >>>>
> >>>> This patch series adds the support to use a high priority
> >>>> thread to process the incoming packets, as opposed to
> >>>> everything being done in NAPI context.
> >>> I don't see why this problem is limited to the ath10k driver. I expect
> >>> it applies to all drivers using NAPI. So shouldn't you be solving this
> >>> in the NAPI core? Allow a driver to request the NAPI core uses a
> >>> thread?
> >> It's not just NAPI the problem is with the softint processing.
> >> I suspect a lot of systems would work better if it ran as
> >> a (highish priority) kernel thread.
> > Hi folks
> >
> > Below is a minimunm poc implementation I can imagine on top of workqueue
> > to make napi threaded. Thoughts are appreciated.
> >
> >> I've had to remove the main locks from a multi-threaded application
> >> and replace them with atomic counters.
> >> Consider what happens when the threads remove items from a shared
> >> work list.
> >> The code looks like:
> >> 	mutex_enter();
> >> 	remove_item_from_list();
> >> 	mutex_exit().
> >> The mutex is only held for a few instructions, so while you'd expect
> >> the cache line to be 'hot' you wouldn't get real contention.
> >> However the following scenarios happen:
> >> 1) An ethernet interrupt happens while the mutex is held.
> >>     This stops the other threads until all the softint processing
> >>     has finished.
> >> 2) An ethernet interrupt (and softint) runs on a thread that is
> >>     waiting for the mutex.
> >>     (Or on the cpu that the thread's processor affinity ties it to.)
> >>     In this case the 'fair' (ticket) mutex code won't let any other
> >>     thread acquire the mutex.
> >>     So again everything stops until the softints all complete.
> >>
> >> The second one is also a problem when trying to wake up all
> >> the threads (eg after adding a lot of items to the list).
> >> The ticket locks force them to wake in order, but
> >> sometimes the 'thundering herd' would work better.
> >>
> >> IIRC this is actually worse for processes running under the RT
> >> scheduler (without CONFIG_PREEMPT) because the they are almost
> >> always scheduled on the same cpu they ran on last.
> >> If it is busy, but cannot be pre-empted, they are not moved
> >> to an idle cpu.
> >>     
> >> To confound things there is a very broken workaround for broken
> >> hardware in the driver for the e1000 interface on (at least)
> >> Ivy Bridge cpu that can cause the driver to spin for a very
> >> long time (IIRC milliseconds) whenever it has to write to a
> >> MAC register (ie on every transmit setup).
> >>
> >> 	David
> >>
> >> -
> >> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> >> Registration No: 1397386 (Wales)
> >>
> >>
> > To make napi threaded, if either irq or softirq thread is entirely ruled
> > out, add napi::work that will be queued on a highpri workqueue. It is
> > actually a unbound one to facilitate scheduler to catter napi loads on to
> > idle CPU cores. What users need to do with the threaded napi
> > is s/netif_napi_add/netif_threaded_napi_add/ and no more.
> >
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -338,6 +338,9 @@ struct napi_struct {
> >   	struct list_head	dev_list;
> >   	struct hlist_node	napi_hash_node;
> >   	unsigned int		napi_id;
> > +#ifdef CONFIG_THREADED_NAPI
> > +	struct work_struct	work;
> > +#endif
> >   };
> >   
> >   enum {
> > @@ -2234,6 +2237,19 @@ static inline void *netdev_priv(const st
> >   void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
> >   		    int (*poll)(struct napi_struct *, int), int weight);
> >   
> > +#ifdef CONFIG_THREADED_NAPI
> > +void netif_threaded_napi_add(struct net_device *dev, struct napi_struct *napi,
> > +		    int (*poll)(struct napi_struct *, int), int weight);
> > +#else
> > +static inline void netif_threaded_napi_add(struct net_device *dev,
> > +					struct napi_struct *napi,
> > +					int (*poll)(struct napi_struct *, int),
> > +					int weight)
> > +{
> > +	netif_napi_add(dev, napi, poll, weight);
> > +}
> > +#endif
> > +
> >   /**
> >    *	netif_tx_napi_add - initialize a NAPI context
> >    *	@dev:  network device
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -6277,6 +6277,61 @@ static int process_backlog(struct napi_s
> >   	return work;
> >   }
> >   
> > +#ifdef CONFIG_THREADED_NAPI
> > +/* unbound highpri workqueue for threaded napi */
> > +static struct workqueue_struct *napi_workq;
> > +
> > +static void napi_workfn(struct work_struct *work)
> > +{
> > +	struct napi_struct *n = container_of(work, struct napi_struct, work);
> > +
> > +	for (;;) {
> > +		if (!test_bit(NAPI_STATE_SCHED, &n->state))
> > +			return;
> > +
> > +		if (n->poll(n, n->weight) < n->weight)
> > +			return;
> > +
> > +		if (need_resched()) {
> > +			/*
> > +			 * have to pay for the latency of task switch even if
> > +			 * napi is scheduled
> > +			 */
> > +			if (test_bit(NAPI_STATE_SCHED, &n->state))
> > +				queue_work(napi_workq, work);
> > +			return;
> > +		}
> > +	}
> > +}
> > +
> > +void netif_threaded_napi_add(struct net_device *dev,
> > +				struct napi_struct *napi,
> > +				int (*poll)(struct napi_struct *, int),
> > +				int weight)
> > +{
> > +	netif_napi_add(dev, napi, poll, weight);
> > +	INIT_WORK(&napi->work, napi_workfn);
> > +}
> > +
> > +static inline bool is_threaded_napi(struct napi_struct *n)
> > +{
> > +	return n->work.func == napi_workfn;
> > +}
> > +
> > +static inline void threaded_napi_sched(struct napi_struct *n)
> > +{
> > +	if (is_threaded_napi(n))
> > +		queue_work(napi_workq, &n->work);
> > +	else
> > +		____napi_schedule(this_cpu_ptr(&softnet_data), n);
> > +}
> > +#else
> > +static inline void threaded_napi_sched(struct napi_struct *n)
> > +{
> > +	____napi_schedule(this_cpu_ptr(&softnet_data), n);
> > +}
> > +#endif
> > +
> >   /**
> >    * __napi_schedule - schedule for receive
> >    * @n: entry to schedule
> > @@ -6289,7 +6344,7 @@ void __napi_schedule(struct napi_struct
> >   	unsigned long flags;
> >   
> >   	local_irq_save(flags);
> > -	____napi_schedule(this_cpu_ptr(&softnet_data), n);
> > +	threaded_napi_sched(n);
> >   	local_irq_restore(flags);
> >   }
> >   EXPORT_SYMBOL(__napi_schedule);
> > @@ -6335,7 +6390,7 @@ EXPORT_SYMBOL(napi_schedule_prep);
> >    */
> >   void __napi_schedule_irqoff(struct napi_struct *n)
> >   {
> > -	____napi_schedule(this_cpu_ptr(&softnet_data), n);
> > +	threaded_napi_sched(n);
> >   }
> >   EXPORT_SYMBOL(__napi_schedule_irqoff);
> >   
> > @@ -10685,6 +10740,10 @@ static int __init net_dev_init(void)
> >   		sd->backlog.weight = weight_p;
> >   	}
> >   
> > +#ifdef CONFIG_THREADED_NAPI
> > +	napi_workq = alloc_workqueue("napi_workq", WQ_UNBOUND | WQ_HIGHPRI,
> > +					    WQ_UNBOUND_MAX_ACTIVE);
> > +#endif
> >   	dev_boot_phase = 0;
> >   
> >   	/* The loopback device is special if any other network devices
> >
> >
> > _______________________________________________
> > ath10k mailing list
> > ath10k@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/ath10k
Sebastian Gottschall July 25, 2020, 2:08 p.m. UTC | #12
Am 25.07.2020 um 14:25 schrieb Hillf Danton:
> On Sat, 25 Jul 2020 12:38:00 +0200 Sebastian Gottschall wrote:
>> you may consider this
>>
>> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1142611.html 
>>
>> <https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1142611.html> 
>>
> Thanks very much for your link.
>
>> years ago someone already wanted to bring this feature upstream, but it
>> was denied. i already tested this patch the last 2 days and it worked so
>> far (with some little modifications)
>> so such a solution existed already and may be considered here
> I don't see outstanding difference in principle from Paolo's work in
> 2016 except for the use of kthread_create() and friends because kworker
> made use of them even before 2016. This is a simpler one as shown by
> the diff stat in his cover letter.
i agree. i just can say that i tested this patch recently due this 
discussion here. and it can be changed by sysfs. but it doesnt work for
wifi drivers which are mainly using dummy netdev devices. for this i 
made a small patch to get them working using napi_set_threaded manually 
hardcoded in the drivers. (see patch bellow)
i also tested various networking drivers. one thing i notice doesnt 
work. some napi code is used for tx polling. so from my experience this 
concept just works good for rx with the most drivers.
so far i tested mt76, ath10k and some soc ethernet chipsets with good 
success. on ath10k i had about 10 - 20% performance gain on multicore 
systems. using standard iperf3 with 4 parallel streams.

-5439,7 +5441,7 @@ int napi_set_threaded(struct napi_struct *n, bool
                 clear_bit(NAPI_STATE_THREADED, &n->state);

         /* if the device is initializing, nothing todo */
-       if (test_bit(__LINK_STATE_START, &n->dev->state))
+       if (test_bit(__LINK_STATE_START, &n->dev->state) && 
n->dev->reg_state != NETREG_DUMMY)
                 return 0;

         napi_thread_stop(n)
;


>
> Paolo, feel free to correct me if I misread anything.
>
> Finally I don't see the need to add sysfs attr, given 
> CONFIG_THREADED_NAPI
> in this work.
>
> BTW let us know if anyone has plans to pick up the 2016 RFC.
>
> Hillf
>
> Paolo Abeni (2):
>    net: implement threaded-able napi poll loop support
>    net: add sysfs attribute to control napi threaded mode
>
>   include/linux/netdevice.h |   4 ++
>   net/core/dev.c            | 113 
> ++++++++++++++++++++++++++++++++++++++++++++++
>   net/core/net-sysfs.c      | 102 
> +++++++++++++++++++++++++++++++++++++++++
>   3 files changed, 219 insertions(+)
>> Sebastian
>>
>>
>> someone
>>
>> Am 25.07.2020 um 10:16 schrieb Hillf Danton:
>>> On Wed, 22 Jul 2020 09:12:42 +0000 David Laight wrote:
>>>>> On 21 July 2020 18:25 Andrew Lunn wrote:
>>>>>
>>>>> On Tue, Jul 21, 2020 at 10:44:19PM +0530, Rakesh Pillai wrote:
>>>>>> NAPI gets scheduled on the CPU core which got the
>>>>>> interrupt. The linux scheduler cannot move it to a
>>>>>> different core, even if the CPU on which NAPI is running
>>>>>> is heavily loaded. This can lead to degraded wifi
>>>>>> performance when running traffic at peak data rates.
>>>>>>
>>>>>> A thread on the other hand can be moved to different
>>>>>> CPU cores, if the one on which its running is heavily
>>>>>> loaded. During high incoming data traffic, this gives
>>>>>> better performance, since the thread can be moved to a
>>>>>> less loaded or sometimes even a more powerful CPU core
>>>>>> to account for the required CPU performance in order
>>>>>> to process the incoming packets.
>>>>>>
>>>>>> This patch series adds the support to use a high priority
>>>>>> thread to process the incoming packets, as opposed to
>>>>>> everything being done in NAPI context.
>>>>> I don't see why this problem is limited to the ath10k driver. I 
>>>>> expect
>>>>> it applies to all drivers using NAPI. So shouldn't you be solving 
>>>>> this
>>>>> in the NAPI core? Allow a driver to request the NAPI core uses a
>>>>> thread?
>>>> It's not just NAPI the problem is with the softint processing.
>>>> I suspect a lot of systems would work better if it ran as
>>>> a (highish priority) kernel thread.
>>> Hi folks
>>>
>>> Below is a minimunm poc implementation I can imagine on top of 
>>> workqueue
>>> to make napi threaded. Thoughts are appreciated.
>>>
>>>> I've had to remove the main locks from a multi-threaded application
>>>> and replace them with atomic counters.
>>>> Consider what happens when the threads remove items from a shared
>>>> work list.
>>>> The code looks like:
>>>>     mutex_enter();
>>>>     remove_item_from_list();
>>>>     mutex_exit().
>>>> The mutex is only held for a few instructions, so while you'd expect
>>>> the cache line to be 'hot' you wouldn't get real contention.
>>>> However the following scenarios happen:
>>>> 1) An ethernet interrupt happens while the mutex is held.
>>>>      This stops the other threads until all the softint processing
>>>>      has finished.
>>>> 2) An ethernet interrupt (and softint) runs on a thread that is
>>>>      waiting for the mutex.
>>>>      (Or on the cpu that the thread's processor affinity ties it to.)
>>>>      In this case the 'fair' (ticket) mutex code won't let any other
>>>>      thread acquire the mutex.
>>>>      So again everything stops until the softints all complete.
>>>>
>>>> The second one is also a problem when trying to wake up all
>>>> the threads (eg after adding a lot of items to the list).
>>>> The ticket locks force them to wake in order, but
>>>> sometimes the 'thundering herd' would work better.
>>>>
>>>> IIRC this is actually worse for processes running under the RT
>>>> scheduler (without CONFIG_PREEMPT) because the they are almost
>>>> always scheduled on the same cpu they ran on last.
>>>> If it is busy, but cannot be pre-empted, they are not moved
>>>> to an idle cpu.
>>>>      To confound things there is a very broken workaround for broken
>>>> hardware in the driver for the e1000 interface on (at least)
>>>> Ivy Bridge cpu that can cause the driver to spin for a very
>>>> long time (IIRC milliseconds) whenever it has to write to a
>>>> MAC register (ie on every transmit setup).
>>>>
>>>>     David
>>>>
>>>> -
>>>> Registered Address Lakeside, Bramley Road, Mount Farm, Milton 
>>>> Keynes, MK1 1PT, UK
>>>> Registration No: 1397386 (Wales)
>>>>
>>>>
>>> To make napi threaded, if either irq or softirq thread is entirely 
>>> ruled
>>> out, add napi::work that will be queued on a highpri workqueue. It is
>>> actually a unbound one to facilitate scheduler to catter napi loads 
>>> on to
>>> idle CPU cores. What users need to do with the threaded napi
>>> is s/netif_napi_add/netif_threaded_napi_add/ and no more.
>>>
>>> --- a/include/linux/netdevice.h
>>> +++ b/include/linux/netdevice.h
>>> @@ -338,6 +338,9 @@ struct napi_struct {
>>>        struct list_head    dev_list;
>>>        struct hlist_node    napi_hash_node;
>>>        unsigned int        napi_id;
>>> +#ifdef CONFIG_THREADED_NAPI
>>> +    struct work_struct    work;
>>> +#endif
>>>    };
>>>       enum {
>>> @@ -2234,6 +2237,19 @@ static inline void *netdev_priv(const st
>>>    void netif_napi_add(struct net_device *dev, struct napi_struct 
>>> *napi,
>>>                int (*poll)(struct napi_struct *, int), int weight);
>>>    +#ifdef CONFIG_THREADED_NAPI
>>> +void netif_threaded_napi_add(struct net_device *dev, struct 
>>> napi_struct *napi,
>>> +            int (*poll)(struct napi_struct *, int), int weight);
>>> +#else
>>> +static inline void netif_threaded_napi_add(struct net_device *dev,
>>> +                    struct napi_struct *napi,
>>> +                    int (*poll)(struct napi_struct *, int),
>>> +                    int weight)
>>> +{
>>> +    netif_napi_add(dev, napi, poll, weight);
>>> +}
>>> +#endif
>>> +
>>>    /**
>>>     *    netif_tx_napi_add - initialize a NAPI context
>>>     *    @dev:  network device
>>> --- a/net/core/dev.c
>>> +++ b/net/core/dev.c
>>> @@ -6277,6 +6277,61 @@ static int process_backlog(struct napi_s
>>>        return work;
>>>    }
>>>    +#ifdef CONFIG_THREADED_NAPI
>>> +/* unbound highpri workqueue for threaded napi */
>>> +static struct workqueue_struct *napi_workq;
>>> +
>>> +static void napi_workfn(struct work_struct *work)
>>> +{
>>> +    struct napi_struct *n = container_of(work, struct napi_struct, 
>>> work);
>>> +
>>> +    for (;;) {
>>> +        if (!test_bit(NAPI_STATE_SCHED, &n->state))
>>> +            return;
>>> +
>>> +        if (n->poll(n, n->weight) < n->weight)
>>> +            return;
>>> +
>>> +        if (need_resched()) {
>>> +            /*
>>> +             * have to pay for the latency of task switch even if
>>> +             * napi is scheduled
>>> +             */
>>> +            if (test_bit(NAPI_STATE_SCHED, &n->state))
>>> +                queue_work(napi_workq, work);
>>> +            return;
>>> +        }
>>> +    }
>>> +}
>>> +
>>> +void netif_threaded_napi_add(struct net_device *dev,
>>> +                struct napi_struct *napi,
>>> +                int (*poll)(struct napi_struct *, int),
>>> +                int weight)
>>> +{
>>> +    netif_napi_add(dev, napi, poll, weight);
>>> +    INIT_WORK(&napi->work, napi_workfn);
>>> +}
>>> +
>>> +static inline bool is_threaded_napi(struct napi_struct *n)
>>> +{
>>> +    return n->work.func == napi_workfn;
>>> +}
>>> +
>>> +static inline void threaded_napi_sched(struct napi_struct *n)
>>> +{
>>> +    if (is_threaded_napi(n))
>>> +        queue_work(napi_workq, &n->work);
>>> +    else
>>> +        ____napi_schedule(this_cpu_ptr(&softnet_data), n);
>>> +}
>>> +#else
>>> +static inline void threaded_napi_sched(struct napi_struct *n)
>>> +{
>>> +    ____napi_schedule(this_cpu_ptr(&softnet_data), n);
>>> +}
>>> +#endif
>>> +
>>>    /**
>>>     * __napi_schedule - schedule for receive
>>>     * @n: entry to schedule
>>> @@ -6289,7 +6344,7 @@ void __napi_schedule(struct napi_struct
>>>        unsigned long flags;
>>>           local_irq_save(flags);
>>> -    ____napi_schedule(this_cpu_ptr(&softnet_data), n);
>>> +    threaded_napi_sched(n);
>>>        local_irq_restore(flags);
>>>    }
>>>    EXPORT_SYMBOL(__napi_schedule);
>>> @@ -6335,7 +6390,7 @@ EXPORT_SYMBOL(napi_schedule_prep);
>>>     */
>>>    void __napi_schedule_irqoff(struct napi_struct *n)
>>>    {
>>> -    ____napi_schedule(this_cpu_ptr(&softnet_data), n);
>>> +    threaded_napi_sched(n);
>>>    }
>>>    EXPORT_SYMBOL(__napi_schedule_irqoff);
>>>    @@ -10685,6 +10740,10 @@ static int __init net_dev_init(void)
>>>            sd->backlog.weight = weight_p;
>>>        }
>>>    +#ifdef CONFIG_THREADED_NAPI
>>> +    napi_workq = alloc_workqueue("napi_workq", WQ_UNBOUND | 
>>> WQ_HIGHPRI,
>>> +                        WQ_UNBOUND_MAX_ACTIVE);
>>> +#endif
>>>        dev_boot_phase = 0;
>>>           /* The loopback device is special if any other network 
>>> devices
>>>
>>>
>>> _______________________________________________
>>> ath10k mailing list
>>> ath10k@lists.infradead.org
>>> http://lists.infradead.org/mailman/listinfo/ath10k
>
Hillf Danton July 25, 2020, 2:57 p.m. UTC | #13
On Sat, 25 Jul 2020 16:08:41 +0200 Sebastian Gottschall wrote:
> Am 25.07.2020 um 14:25 schrieb Hillf Danton:
> > On Sat, 25 Jul 2020 12:38:00 +0200 Sebastian Gottschall wrote:
> >> you may consider this
> >>
> >> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1142611.html 
> >>
> >> <https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1142611.html> 
> >>
> > Thanks very much for your link.
> >
> >> years ago someone already wanted to bring this feature upstream, but it
> >> was denied. i already tested this patch the last 2 days and it worked so
> >> far (with some little modifications)
> >> so such a solution existed already and may be considered here
> >
> > I don't see outstanding difference in principle from Paolo's work in
> > 2016 except for the use of kthread_create() and friends because kworker
> > made use of them even before 2016. This is a simpler one as shown by
> > the diff stat in his cover letter.
> >
> i agree. i just can say that i tested this patch recently due this 
> discussion here. and it can be changed by sysfs. but it doesnt work for
> wifi drivers which are mainly using dummy netdev devices. for this i 
> made a small patch to get them working using napi_set_threaded manually 
> hardcoded in the drivers. (see patch bellow)

By CONFIG_THREADED_NAPI, there is no need to consider what you did here
in the napi core because device drivers know better and are responsible
for it before calling napi_schedule(n).

> i also tested various networking drivers. one thing i notice doesnt 
> work. some napi code is used for tx polling. so from my experience this 
> concept just works good for rx with the most drivers.

Drivers are also taking care of the napi::poll cb before calling
netif_threaded_napi_add(), while the core offers napi threads. But these
are the trivial differences from the 2016 RFC AFAICS.

> so far i tested mt76, ath10k and some soc ethernet chipsets with good 
> success. on ath10k i had about 10 - 20% performance gain on multicore 
> systems. using standard iperf3 with 4 parallel streams.

Thanks for sharing the tests.

>
> -5439,7 +5441,7 @@ int napi_set_threaded(struct napi_struct *n, bool
>                  clear_bit(NAPI_STATE_THREADED, &n->state);
> 
>          /* if the device is initializing, nothing todo */
> -       if (test_bit(__LINK_STATE_START, &n->dev->state))
> +       if (test_bit(__LINK_STATE_START, &n->dev->state) && 
> n->dev->reg_state != NETREG_DUMMY)
>                  return 0;
> 
>          napi_thread_stop(n);
Sebastian Gottschall July 25, 2020, 3:41 p.m. UTC | #14
>> i agree. i just can say that i tested this patch recently due this
>> discussion here. and it can be changed by sysfs. but it doesnt work for
>> wifi drivers which are mainly using dummy netdev devices. for this i
>> made a small patch to get them working using napi_set_threaded manually
>> hardcoded in the drivers. (see patch bellow)
> By CONFIG_THREADED_NAPI, there is no need to consider what you did here
> in the napi core because device drivers know better and are responsible
> for it before calling napi_schedule(n).
yeah. but that approach will not work for some cases. some stupid 
drivers are using locking context in the napi poll function.
in that case the performance will runto shit. i discovered this with the 
mvneta eth driver (marvell) and mt76 tx polling (rx  works)
for mvneta is will cause very high latencies and packet drops. for mt76 
it causes packet stop. doesnt work simply (on all cases no crashes)
so the threading will only work for drivers which are compatible with 
that approach. it cannot be used as drop in replacement from my point of 
view.
its all a question of the driver design
Felix Fietkau July 25, 2020, 5:57 p.m. UTC | #15
On 2020-07-25 10:16, Hillf Danton wrote:
> Hi folks
> 
> Below is a minimunm poc implementation I can imagine on top of workqueue
> to make napi threaded. Thoughts are appreciated.
Hi Hillf,

For some reason I don't see your mails on linux-wireless/netdev.
I've cleaned up your implementation a bit and I ran some tests with mt76
on an mt7621 embedded board. The results look pretty nice, performance
is a lot more consistent in my tests now.

Here are the changes that I've made compared to your version:

- remove the #ifdef, I think it's unnecessary
- add a state bit for threaded NAPI
- make netif_threaded_napi_add inline
- run queue_work outside of local_irq_save/restore (it does that
internally already)

If you don't mind, I'd like to propose this to netdev soon. Can I have
your Signed-off-by for that?

Thanks,

- Felix

---
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -347,6 +347,7 @@ struct napi_struct {
 	struct list_head	dev_list;
 	struct hlist_node	napi_hash_node;
 	unsigned int		napi_id;
+	struct work_struct	work;
 };
 
 enum {
@@ -357,6 +358,7 @@ enum {
 	NAPI_STATE_HASHED,	/* In NAPI hash (busy polling possible) */
 	NAPI_STATE_NO_BUSY_POLL,/* Do not add in napi_hash, no busy polling */
 	NAPI_STATE_IN_BUSY_POLL,/* sk_busy_loop() owns this NAPI */
+	NAPI_STATE_THREADED,	/* Use threaded NAPI */
 };
 
 enum {
@@ -367,6 +369,7 @@ enum {
 	NAPIF_STATE_HASHED	 = BIT(NAPI_STATE_HASHED),
 	NAPIF_STATE_NO_BUSY_POLL = BIT(NAPI_STATE_NO_BUSY_POLL),
 	NAPIF_STATE_IN_BUSY_POLL = BIT(NAPI_STATE_IN_BUSY_POLL),
+	NAPIF_STATE_THREADED	 = BIT(NAPI_STATE_THREADED),
 };
 
 enum gro_result {
@@ -2315,6 +2318,26 @@ static inline void *netdev_priv(const struct net_device *dev)
 void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
 		    int (*poll)(struct napi_struct *, int), int weight);
 
+/**
+ *	netif_threaded_napi_add - initialize a NAPI context
+ *	@dev:  network device
+ *	@napi: NAPI context
+ *	@poll: polling function
+ *	@weight: default weight
+ *
+ * This variant of netif_napi_add() should be used from drivers using NAPI
+ * with CPU intensive poll functions.
+ * This will schedule polling from a high priority workqueue that
+ */
+static inline void netif_threaded_napi_add(struct net_device *dev,
+					   struct napi_struct *napi,
+					   int (*poll)(struct napi_struct *, int),
+					   int weight)
+{
+	set_bit(NAPI_STATE_THREADED, &napi->state);
+	netif_napi_add(dev, napi, poll, weight);
+}
+
 /**
  *	netif_tx_napi_add - initialize a NAPI context
  *	@dev:  network device
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -158,6 +158,7 @@ static DEFINE_SPINLOCK(offload_lock);
 struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly;
 struct list_head ptype_all __read_mostly;	/* Taps */
 static struct list_head offload_base __read_mostly;
+static struct workqueue_struct *napi_workq;
 
 static int netif_rx_internal(struct sk_buff *skb);
 static int call_netdevice_notifiers_info(unsigned long val,
@@ -6286,6 +6287,11 @@ void __napi_schedule(struct napi_struct *n)
 {
 	unsigned long flags;
 
+	if (test_bit(NAPI_STATE_THREADED, &n->state)) {
+		queue_work(napi_workq, &n->work);
+		return;
+	}
+
 	local_irq_save(flags);
 	____napi_schedule(this_cpu_ptr(&softnet_data), n);
 	local_irq_restore(flags);
@@ -6333,6 +6339,11 @@ EXPORT_SYMBOL(napi_schedule_prep);
  */
 void __napi_schedule_irqoff(struct napi_struct *n)
 {
+	if (test_bit(NAPI_STATE_THREADED, &n->state)) {
+		queue_work(napi_workq, &n->work);
+		return;
+	}
+
 	____napi_schedule(this_cpu_ptr(&softnet_data), n);
 }
 EXPORT_SYMBOL(__napi_schedule_irqoff);
@@ -6601,6 +6612,29 @@ static void init_gro_hash(struct napi_struct *napi)
 	napi->gro_bitmask = 0;
 }
 
+static void napi_workfn(struct work_struct *work)
+{
+	struct napi_struct *n = container_of(work, struct napi_struct, work);
+
+	for (;;) {
+		if (!test_bit(NAPI_STATE_SCHED, &n->state))
+			return;
+
+		if (n->poll(n, n->weight) < n->weight)
+			return;
+
+		if (need_resched()) {
+			/*
+			 * have to pay for the latency of task switch even if
+			 * napi is scheduled
+			 */
+			if (test_bit(NAPI_STATE_SCHED, &n->state))
+				queue_work(napi_workq, work);
+			return;
+		}
+	}
+}
+
 void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
 		    int (*poll)(struct napi_struct *, int), int weight)
 {
@@ -6621,6 +6655,7 @@ void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
 #ifdef CONFIG_NETPOLL
 	napi->poll_owner = -1;
 #endif
+	INIT_WORK(&napi->work, napi_workfn);
 	set_bit(NAPI_STATE_SCHED, &napi->state);
 	napi_hash_add(napi);
 }
@@ -10676,6 +10711,10 @@ static int __init net_dev_init(void)
 		sd->backlog.weight = weight_p;
 	}
 
+	napi_workq = alloc_workqueue("napi_workq", WQ_UNBOUND | WQ_HIGHPRI,
+				     WQ_UNBOUND_MAX_ACTIVE);
+	BUG_ON(!napi_workq);
+
 	dev_boot_phase = 0;
 
 	/* The loopback device is special if any other network devices
Hillf Danton July 26, 2020, 1:22 a.m. UTC | #16
Hi Felix,

On Sat, 25 Jul 2020 19:57:23 +0200 Felix Fietkau wrote:
> On 2020-07-25 10:16, Hillf Danton wrote:
> > Hi folks
> > 
> > Below is a minimunm poc implementation I can imagine on top of workqueue
> > to make napi threaded. Thoughts are appreciated.
> > 
> Hi Hillf,
> 
> For some reason I don't see your mails on linux-wireless/netdev.

All is down to my inbox ...

> I've cleaned up your implementation a bit and I ran some tests with mt76
> on an mt7621 embedded board. The results look pretty nice, performance
> is a lot more consistent in my tests now.

Thanks very much for your cleanup and testing.

> Here are the changes that I've made compared to your version:
> 
> - remove the #ifdef, I think it's unnecessary

This makes the code a mile nicer and perhaps I fear pulls some questions
from EricD.

> - add a state bit for threaded NAPI
> - make netif_threaded_napi_add inline
> - run queue_work outside of local_irq_save/restore (it does that
> internally already)
>
> If you don't mind, I'd like to propose this to netdev soon. Can I have
> your Signed-off-by for that?

Feel free to do that. Is it likely for me to select a Cc?

Thanks
Hillf

> Thanks,
> 
> - Felix
> 
> ---
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -347,6 +347,7 @@ struct napi_struct {
>  	struct list_head	dev_list;
>  	struct hlist_node	napi_hash_node;
>  	unsigned int		napi_id;
> +	struct work_struct	work;
>  };
>  
>  enum {
> @@ -357,6 +358,7 @@ enum {
>  	NAPI_STATE_HASHED,	/* In NAPI hash (busy polling possible) */
>  	NAPI_STATE_NO_BUSY_POLL,/* Do not add in napi_hash, no busy polling */
>  	NAPI_STATE_IN_BUSY_POLL,/* sk_busy_loop() owns this NAPI */
> +	NAPI_STATE_THREADED,	/* Use threaded NAPI */
>  };
>  
>  enum {
> @@ -367,6 +369,7 @@ enum {
>  	NAPIF_STATE_HASHED	 = BIT(NAPI_STATE_HASHED),
>  	NAPIF_STATE_NO_BUSY_POLL = BIT(NAPI_STATE_NO_BUSY_POLL),
>  	NAPIF_STATE_IN_BUSY_POLL = BIT(NAPI_STATE_IN_BUSY_POLL),
> +	NAPIF_STATE_THREADED	 = BIT(NAPI_STATE_THREADED),
>  };
>  
>  enum gro_result {
> @@ -2315,6 +2318,26 @@ static inline void *netdev_priv(const struct net_device *dev)
>  void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
>  		    int (*poll)(struct napi_struct *, int), int weight);
>  
> +/**
> + *	netif_threaded_napi_add - initialize a NAPI context
> + *	@dev:  network device
> + *	@napi: NAPI context
> + *	@poll: polling function
> + *	@weight: default weight
> + *
> + * This variant of netif_napi_add() should be used from drivers using NAPI
> + * with CPU intensive poll functions.
> + * This will schedule polling from a high priority workqueue that
> + */
> +static inline void netif_threaded_napi_add(struct net_device *dev,
> +					   struct napi_struct *napi,
> +					   int (*poll)(struct napi_struct *, int),
> +					   int weight)
> +{
> +	set_bit(NAPI_STATE_THREADED, &napi->state);
> +	netif_napi_add(dev, napi, poll, weight);
> +}
> +
>  /**
>   *	netif_tx_napi_add - initialize a NAPI context
>   *	@dev:  network device
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -158,6 +158,7 @@ static DEFINE_SPINLOCK(offload_lock);
>  struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly;
>  struct list_head ptype_all __read_mostly;	/* Taps */
>  static struct list_head offload_base __read_mostly;
> +static struct workqueue_struct *napi_workq;

Is __read_mostly missing?

>  static int netif_rx_internal(struct sk_buff *skb);
>  static int call_netdevice_notifiers_info(unsigned long val,
> @@ -6286,6 +6287,11 @@ void __napi_schedule(struct napi_struct *n)
>  {
>  	unsigned long flags;
>  
> +	if (test_bit(NAPI_STATE_THREADED, &n->state)) {
> +		queue_work(napi_workq, &n->work);
> +		return;
> +	}
> +
>  	local_irq_save(flags);
>  	____napi_schedule(this_cpu_ptr(&softnet_data), n);
>  	local_irq_restore(flags);
> @@ -6333,6 +6339,11 @@ EXPORT_SYMBOL(napi_schedule_prep);
>   */
>  void __napi_schedule_irqoff(struct napi_struct *n)
>  {
> +	if (test_bit(NAPI_STATE_THREADED, &n->state)) {
> +		queue_work(napi_workq, &n->work);
> +		return;
> +	}
> +
>  	____napi_schedule(this_cpu_ptr(&softnet_data), n);
>  }
>  EXPORT_SYMBOL(__napi_schedule_irqoff);
> @@ -6601,6 +6612,29 @@ static void init_gro_hash(struct napi_struct *napi)
>  	napi->gro_bitmask = 0;
>  }
>  
> +static void napi_workfn(struct work_struct *work)
> +{
> +	struct napi_struct *n = container_of(work, struct napi_struct, work);
> +
> +	for (;;) {
> +		if (!test_bit(NAPI_STATE_SCHED, &n->state))
> +			return;
> +
> +		if (n->poll(n, n->weight) < n->weight)
> +			return;
> +
> +		if (need_resched()) {
> +			/*
> +			 * have to pay for the latency of task switch even if
> +			 * napi is scheduled
> +			 */
> +			if (test_bit(NAPI_STATE_SCHED, &n->state))
> +				queue_work(napi_workq, work);
> +			return;
> +		}
> +	}
> +}
> +
>  void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
>  		    int (*poll)(struct napi_struct *, int), int weight)
>  {
> @@ -6621,6 +6655,7 @@ void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
>  #ifdef CONFIG_NETPOLL
>  	napi->poll_owner = -1;
>  #endif
> +	INIT_WORK(&napi->work, napi_workfn);
>  	set_bit(NAPI_STATE_SCHED, &napi->state);
>  	napi_hash_add(napi);
>  }
> @@ -10676,6 +10711,10 @@ static int __init net_dev_init(void)
>  		sd->backlog.weight = weight_p;
>  	}
>  
> +	napi_workq = alloc_workqueue("napi_workq", WQ_UNBOUND | WQ_HIGHPRI,
> +				     WQ_UNBOUND_MAX_ACTIVE);
> +	BUG_ON(!napi_workq);
> +
>  	dev_boot_phase = 0;
>  
>  	/* The loopback device is special if any other network devices
>
Felix Fietkau July 26, 2020, 8:10 a.m. UTC | #17
On 2020-07-26 03:22, Hillf Danton wrote:
>> - add a state bit for threaded NAPI
>> - make netif_threaded_napi_add inline
>> - run queue_work outside of local_irq_save/restore (it does that
>> internally already)
>>
>> If you don't mind, I'd like to propose this to netdev soon. Can I have
>> your Signed-off-by for that?
> 
> Feel free to do that. Is it likely for me to select a Cc?
Shall I use Signed-off-by: Hillf Danton <hdanton@sina.com>?
What Cc do you want me to add?

>> ---
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -347,6 +347,7 @@ struct napi_struct {
>>  	struct list_head	dev_list;
>>  	struct hlist_node	napi_hash_node;
>>  	unsigned int		napi_id;
>> +	struct work_struct	work;
>>  };
>>  
>>  enum {
>> @@ -357,6 +358,7 @@ enum {
>>  	NAPI_STATE_HASHED,	/* In NAPI hash (busy polling possible) */
>>  	NAPI_STATE_NO_BUSY_POLL,/* Do not add in napi_hash, no busy polling */
>>  	NAPI_STATE_IN_BUSY_POLL,/* sk_busy_loop() owns this NAPI */
>> +	NAPI_STATE_THREADED,	/* Use threaded NAPI */
>>  };
>>  
>>  enum {
>> @@ -367,6 +369,7 @@ enum {
>>  	NAPIF_STATE_HASHED	 = BIT(NAPI_STATE_HASHED),
>>  	NAPIF_STATE_NO_BUSY_POLL = BIT(NAPI_STATE_NO_BUSY_POLL),
>>  	NAPIF_STATE_IN_BUSY_POLL = BIT(NAPI_STATE_IN_BUSY_POLL),
>> +	NAPIF_STATE_THREADED	 = BIT(NAPI_STATE_THREADED),
>>  };
>>  
>>  enum gro_result {
>> @@ -2315,6 +2318,26 @@ static inline void *netdev_priv(const struct net_device *dev)
>>  void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
>>  		    int (*poll)(struct napi_struct *, int), int weight);
>>  
>> +/**
>> + *	netif_threaded_napi_add - initialize a NAPI context
>> + *	@dev:  network device
>> + *	@napi: NAPI context
>> + *	@poll: polling function
>> + *	@weight: default weight
>> + *
>> + * This variant of netif_napi_add() should be used from drivers using NAPI
>> + * with CPU intensive poll functions.
>> + * This will schedule polling from a high priority workqueue that
>> + */
>> +static inline void netif_threaded_napi_add(struct net_device *dev,
>> +					   struct napi_struct *napi,
>> +					   int (*poll)(struct napi_struct *, int),
>> +					   int weight)
>> +{
>> +	set_bit(NAPI_STATE_THREADED, &napi->state);
>> +	netif_napi_add(dev, napi, poll, weight);
>> +}
>> +
>>  /**
>>   *	netif_tx_napi_add - initialize a NAPI context
>>   *	@dev:  network device
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -158,6 +158,7 @@ static DEFINE_SPINLOCK(offload_lock);
>>  struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly;
>>  struct list_head ptype_all __read_mostly;	/* Taps */
>>  static struct list_head offload_base __read_mostly;
>> +static struct workqueue_struct *napi_workq;
> 
> Is __read_mostly missing?
Yes, thanks. I will add that before sending the patch.

- Felix
Hillf Danton July 26, 2020, 8:32 a.m. UTC | #18
On Sun, 26 Jul 2020 10:10:15 +0200 Felix Fietkau wrote:
> On 2020-07-26 03:22, Hillf Danton wrote:
> > 
> > Feel free to do that. Is it likely for me to select a Cc?
> > 
> Shall I use Signed-off-by: Hillf Danton <hdanton@sina.com>?

s/Signed-off-by/Cc/

> What Cc do you want me to add?

I prefer Cc over other tags.
Felix Fietkau July 26, 2020, 8:59 a.m. UTC | #19
On 2020-07-26 10:32, Hillf Danton wrote:
> 
> On Sun, 26 Jul 2020 10:10:15 +0200 Felix Fietkau wrote:
>> On 2020-07-26 03:22, Hillf Danton wrote:
>> > 
>> > Feel free to do that. Is it likely for me to select a Cc?
>> > 
>> Shall I use Signed-off-by: Hillf Danton <hdanton@sina.com>?
> 
> s/Signed-off-by/Cc/
> 
>> What Cc do you want me to add?
> 
> I prefer Cc over other tags.
Ah, okay. I was planning on adding you as the author of the patch, since
you did most of the work on it. If you want to be attributed as author
(or in Co-developed-by), I'd need your Signed-off-by.

If you don't want that, I can submit it under my name and leave you in
as Cc only.

- Felix
David Laight July 26, 2020, 11:16 a.m. UTC | #20
From: Sebastian Gottschall <s.gottschall@dd-wrt.com>
> Sent: 25 July 2020 16:42
> >> i agree. i just can say that i tested this patch recently due this
> >> discussion here. and it can be changed by sysfs. but it doesnt work for
> >> wifi drivers which are mainly using dummy netdev devices. for this i
> >> made a small patch to get them working using napi_set_threaded manually
> >> hardcoded in the drivers. (see patch bellow)

> > By CONFIG_THREADED_NAPI, there is no need to consider what you did here
> > in the napi core because device drivers know better and are responsible
> > for it before calling napi_schedule(n).

> yeah. but that approach will not work for some cases. some stupid
> drivers are using locking context in the napi poll function.
> in that case the performance will runto shit. i discovered this with the
> mvneta eth driver (marvell) and mt76 tx polling (rx  works)
> for mvneta is will cause very high latencies and packet drops. for mt76
> it causes packet stop. doesnt work simply (on all cases no crashes)
> so the threading will only work for drivers which are compatible with
> that approach. it cannot be used as drop in replacement from my point of
> view.
> its all a question of the driver design

Why should it make (much) difference whether the napi callbacks (etc)
are done in the context of the interrupted process or that of a
specific kernel thread.
The process flags (or whatever) can even be set so that it appears
to be the expected 'softint' context.

In any case running NAPI from a thread will just show up the next
piece of code that runs for ages in softint context.
I think I've seen the tail end of memory being freed under rcu
finally happening under softint and taking absolutely ages.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Rakesh Pillai July 28, 2020, 4:59 p.m. UTC | #21
> -----Original Message-----
> From: David Laight <David.Laight@ACULAB.COM>
> Sent: Sunday, July 26, 2020 4:46 PM
> To: 'Sebastian Gottschall' <s.gottschall@dd-wrt.com>; Hillf Danton
> <hdanton@sina.com>
> Cc: Andrew Lunn <andrew@lunn.ch>; Rakesh Pillai <pillair@codeaurora.org>;
> netdev@vger.kernel.org; linux-wireless@vger.kernel.org; linux-
> kernel@vger.kernel.org; ath10k@lists.infradead.org;
> dianders@chromium.org; Markus Elfring <Markus.Elfring@web.de>;
> evgreen@chromium.org; kuba@kernel.org; johannes@sipsolutions.net;
> davem@davemloft.net; kvalo@codeaurora.org
> Subject: RE: [RFC 0/7] Add support to process rx packets in thread
> 
> From: Sebastian Gottschall <s.gottschall@dd-wrt.com>
> > Sent: 25 July 2020 16:42
> > >> i agree. i just can say that i tested this patch recently due this
> > >> discussion here. and it can be changed by sysfs. but it doesnt work for
> > >> wifi drivers which are mainly using dummy netdev devices. for this i
> > >> made a small patch to get them working using napi_set_threaded
> manually
> > >> hardcoded in the drivers. (see patch bellow)
> 
> > > By CONFIG_THREADED_NAPI, there is no need to consider what you did
> here
> > > in the napi core because device drivers know better and are responsible
> > > for it before calling napi_schedule(n).
> 
> > yeah. but that approach will not work for some cases. some stupid
> > drivers are using locking context in the napi poll function.
> > in that case the performance will runto shit. i discovered this with the
> > mvneta eth driver (marvell) and mt76 tx polling (rx  works)
> > for mvneta is will cause very high latencies and packet drops. for mt76
> > it causes packet stop. doesnt work simply (on all cases no crashes)
> > so the threading will only work for drivers which are compatible with
> > that approach. it cannot be used as drop in replacement from my point of
> > view.
> > its all a question of the driver design
> 
> Why should it make (much) difference whether the napi callbacks (etc)
> are done in the context of the interrupted process or that of a
> specific kernel thread.
> The process flags (or whatever) can even be set so that it appears
> to be the expected 'softint' context.
> 
> In any case running NAPI from a thread will just show up the next
> piece of code that runs for ages in softint context.
> I think I've seen the tail end of memory being freed under rcu
> finally happening under softint and taking absolutely ages.
> 
> 	David
> 

Hi All,

Is the threaded NAPI change posted to kernel ? 
Is the conclusion of this discussion that " we cannot use threads for processing packets " ??


> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes,
> MK1 1PT, UK
> Registration No: 1397386 (Wales)
Hillf Danton July 29, 2020, 1:34 a.m. UTC | #22
On Tue, 28 Jul 2020 22:29:02 +0530 Rakesh Pillai wrote:
> > -----Original Message-----
> > From: David Laight <David.Laight@ACULAB.COM>
> > Sent: Sunday, July 26, 2020 4:46 PM
> > To: 'Sebastian Gottschall' <s.gottschall@dd-wrt.com>; Hillf Danton
> > <hdanton@sina.com>
> > Cc: Andrew Lunn <andrew@lunn.ch>; Rakesh Pillai =
> <pillair@codeaurora.org>;
> > netdev@vger.kernel.org; linux-wireless@vger.kernel.org; linux-
> > kernel@vger.kernel.org; ath10k@lists.infradead.org;
> > dianders@chromium.org; Markus Elfring <Markus.Elfring@web.de>;
> > evgreen@chromium.org; kuba@kernel.org; johannes@sipsolutions.net;
> > davem@davemloft.net; kvalo@codeaurora.org
> > Subject: RE: [RFC 0/7] Add support to process rx packets in thread
> >=20
> > From: Sebastian Gottschall <s.gottschall@dd-wrt.com>
> > > Sent: 25 July 2020 16:42
> > > >> i agree. i just can say that i tested this patch recently due =
> this
> > > >> discussion here. and it can be changed by sysfs. but it doesnt =
> work for
> > > >> wifi drivers which are mainly using dummy netdev devices. for =
> this i
> > > >> made a small patch to get them working using napi_set_threaded
> > manually
> > > >> hardcoded in the drivers. (see patch bellow)
> >=20
> > > > By CONFIG_THREADED_NAPI, there is no need to consider what you did
> > here
> > > > in the napi core because device drivers know better and are =
> responsible
> > > > for it before calling napi_schedule(n).
> >=20
> > > yeah. but that approach will not work for some cases. some stupid
> > > drivers are using locking context in the napi poll function.
> > > in that case the performance will runto shit. i discovered this with =
> the
> > > mvneta eth driver (marvell) and mt76 tx polling (rx  works)
> > > for mvneta is will cause very high latencies and packet drops. for =
> mt76
> > > it causes packet stop. doesnt work simply (on all cases no crashes)
> > > so the threading will only work for drivers which are compatible =
> with
> > > that approach. it cannot be used as drop in replacement from my =
> point of
> > > view.
> > > its all a question of the driver design
> >=20
> > Why should it make (much) difference whether the napi callbacks (etc)
> > are done in the context of the interrupted process or that of a
> > specific kernel thread.
> > The process flags (or whatever) can even be set so that it appears
> > to be the expected 'softint' context.
> >=20
> > In any case running NAPI from a thread will just show up the next
> > piece of code that runs for ages in softint context.
> > I think I've seen the tail end of memory being freed under rcu
> > finally happening under softint and taking absolutely ages.
> >=20
> > 	David
> >=20
> 
> Hi All,
> 
> Is the threaded NAPI change posted to kernel ?

https://lore.kernel.org/netdev/20200726163119.86162-1-nbd@nbd.name/
https://lore.kernel.org/netdev/20200727123239.4921-1-nbd@nbd.name/

> Is the conclusion of this discussion that " we cannot use threads for
> processing packets " ??

That isn't it if any conclusion reached. Hard to answer your question
TBH, and OTOH I'm wondering in which context device driver developer
prefers to handle tx/rx, IRQ or BH or user context on available idle
CPUs, what is preventing them from doing that? Is it likely making
ant-antenna-size sense to set the napi::weight to 3 and turn to 30
kworkers for processing the ten-minute packet flood hitting the hardware
for instance on a system with 32 CPU cores or more?