diff mbox series

[RFC,1/2] net: Add sysfs files for threaded NAPI.

Message ID 20230524111259.1323415-2-bigeasy@linutronix.de (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series net: Add sysfs files for threaded NAPI. | expand

Checks

Context Check Description
netdev/series_format warning Target tree name not specified in the subject
netdev/tree_selection success Guessed tree name to be net-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 4195 this patch: 4195
netdev/cc_maintainers warning 3 maintainers not CCed: gregkh@linuxfoundation.org wangyufen@huawei.com davem@davemloft.net
netdev/build_clang success Errors and warnings before: 921 this patch: 921
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 4414 this patch: 4414
netdev/checkpatch warning WARNING: line length of 82 exceeds 80 columns WARNING: line length of 94 exceeds 80 columns WARNING: line length of 97 exceeds 80 columns
netdev/kdoc fail Errors and warnings before: 0 this patch: 1
netdev/source_inline success Was 0 now: 0

Commit Message

Sebastian Andrzej Siewior May 24, 2023, 11:12 a.m. UTC
I've been looking into threaded NAPI. One awkward thing to do is
to figure out the thread names, pids in order to adjust the thread
priorities and SMP affinity.
On PREEMPT_RT the NAPI thread is treated (by the user) the same way as
the threaded interrupt which means a dedicate CPU affinity for the
thread and a higher task priority to be favoured over other tasks on the
CPU. Otherwise the NAPI thread can be preempted by other threads leading
to delays in packet delivery.
Having to run ps/ grep is awkward to get the PID right. It is not easy
to match the interrupt since there is no obvious relation between the
IRQ and the NAPI thread.
NAPI threads are enabled often to mitigate the problems caused by a
"pending" ksoftirqd (which has been mitigated recently by doing softiqrs
regardless of ksoftirqd status). There is still the part that the NAPI
thread does not use softnet_data::poll_list.

To make things easier to setup NAPI threads here is a sysfs interfaces.
It provides for each NAPI instance a folder containing the name and PID
of the NAPI thread and an interrupt number of the interrupt scheduling
the NAPI thread. The latter requires support from the driver.
The name of the napi-instance can also be set by driver so it does not
fallback to the NAPI-id.

I've been thinking to wire up task affinity to follow the affinity of
the interrupt thread. While this would require some extra work, it
shouldn't be needed since the PID of the NAPI thread and interrupt
number is exposed so the user may use chrt/ taskset to adjust the
priority accordingly and the interrupt affinity does not change
magically.

Having said all that, there is still no generic solution to the
"overload" problem. Part of the problem is lack of policy since the
offload to ksoftirqd is not welcomed by everyone. Also "better" cards
support filtering by ether type which allows to filter the problematic
part to another NAPI instance avoiding the prbolem.

This is what the structure looks with the igb driver after adding the
name/ irq hints (second patch).

| root@box:/sys/class/net# cd eno0
| root@box:/sys/class/net/eno0# ls -l napi
| total 0

Empty before threaded NAPI is enabled.

| root@box:/sys/class/net/eno0# echo 1 > threaded
| root@box:/sys/class/net/eno0# ls -l napi
| total 0
| drwxr-xr-x 2 root root 0 May 24 09:42 eno0-TxRx-0
| drwxr-xr-x 2 root root 0 May 24 09:42 eno0-TxRx-1
| drwxr-xr-x 2 root root 0 May 24 09:42 eno0-TxRx-2
| drwxr-xr-x 2 root root 0 May 24 09:42 eno0-TxRx-3
| drwxr-xr-x 2 root root 0 May 24 09:42 eno0-TxRx-4
| drwxr-xr-x 2 root root 0 May 24 09:42 eno0-TxRx-5
| drwxr-xr-x 2 root root 0 May 24 09:42 eno0-TxRx-6
| drwxr-xr-x 2 root root 0 May 24 09:42 eno0-TxRx-7

Deployed using names supplied by the driver which map the names which
are used for the IRQ.

| root@box:/sys/class/net/eno0# grep . napi/*/*
| napi/eno0-TxRx-0/interrupt:37
| napi/eno0-TxRx-0/name:napi/eno0-8193
| napi/eno0-TxRx-0/pid:2253
| napi/eno0-TxRx-1/interrupt:38
| napi/eno0-TxRx-1/name:napi/eno0-8194
| napi/eno0-TxRx-1/pid:2252
| napi/eno0-TxRx-2/interrupt:39
| napi/eno0-TxRx-2/name:napi/eno0-8195
| napi/eno0-TxRx-2/pid:2251
| napi/eno0-TxRx-3/interrupt:40
| napi/eno0-TxRx-3/name:napi/eno0-8196
| napi/eno0-TxRx-3/pid:2250
| napi/eno0-TxRx-4/interrupt:41
| napi/eno0-TxRx-4/name:napi/eno0-8197
| napi/eno0-TxRx-4/pid:2249
| napi/eno0-TxRx-5/interrupt:42
| napi/eno0-TxRx-5/name:napi/eno0-8198
| napi/eno0-TxRx-5/pid:2248
| napi/eno0-TxRx-6/interrupt:43
| napi/eno0-TxRx-6/name:napi/eno0-8199
| napi/eno0-TxRx-6/pid:2247
| napi/eno0-TxRx-7/interrupt:44
| napi/eno0-TxRx-7/name:napi/eno0-8200
| napi/eno0-TxRx-7/pid:2246

Thread name, pid and interrupt number as provided by the driver.

| root@box:/sys/class/net/eno0# grep eno0-TxRx-7 /proc/interrupts | sed 's@ \+@ @g'
|  44: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 123 0 0 0 0 0 0 0 0 0 0 0 0 0 IR-PCI-MSIX-0000:07:00.0 8-edge eno0-TxRx-7
| root@box:/sys/class/net/eno0# cat /proc/irq/44/smp_affinity_list
| 0-7,16-23
| root@box:/sys/class/net/eno0# cat /proc/irq/44/effective_affinity_list
| 18

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/netdevice.h |   6 ++
 net/core/dev.c            |  13 ++++
 net/core/net-sysfs.c      | 137 ++++++++++++++++++++++++++++++++++++++
 net/core/net-sysfs.h      |  12 ++++
 4 files changed, 168 insertions(+)

Comments

Simon Horman May 24, 2023, 1:38 p.m. UTC | #1
On Wed, May 24, 2023 at 01:12:58PM +0200, Sebastian Andrzej Siewior wrote:
> I've been looking into threaded NAPI. One awkward thing to do is
> to figure out the thread names, pids in order to adjust the thread
> priorities and SMP affinity.
> On PREEMPT_RT the NAPI thread is treated (by the user) the same way as
> the threaded interrupt which means a dedicate CPU affinity for the
> thread and a higher task priority to be favoured over other tasks on the
> CPU. Otherwise the NAPI thread can be preempted by other threads leading
> to delays in packet delivery.
> Having to run ps/ grep is awkward to get the PID right. It is not easy
> to match the interrupt since there is no obvious relation between the
> IRQ and the NAPI thread.
> NAPI threads are enabled often to mitigate the problems caused by a
> "pending" ksoftirqd (which has been mitigated recently by doing softiqrs
> regardless of ksoftirqd status). There is still the part that the NAPI
> thread does not use softnet_data::poll_list.
> 
> To make things easier to setup NAPI threads here is a sysfs interfaces.
> It provides for each NAPI instance a folder containing the name and PID
> of the NAPI thread and an interrupt number of the interrupt scheduling
> the NAPI thread. The latter requires support from the driver.
> The name of the napi-instance can also be set by driver so it does not
> fallback to the NAPI-id.
> 
> I've been thinking to wire up task affinity to follow the affinity of
> the interrupt thread. While this would require some extra work, it
> shouldn't be needed since the PID of the NAPI thread and interrupt
> number is exposed so the user may use chrt/ taskset to adjust the
> priority accordingly and the interrupt affinity does not change
> magically.
> 
> Having said all that, there is still no generic solution to the
> "overload" problem. Part of the problem is lack of policy since the
> offload to ksoftirqd is not welcomed by everyone. Also "better" cards
> support filtering by ether type which allows to filter the problematic
> part to another NAPI instance avoiding the prbolem.
> 
> This is what the structure looks with the igb driver after adding the
> name/ irq hints (second patch).
> 
> | root@box:/sys/class/net# cd eno0
> | root@box:/sys/class/net/eno0# ls -l napi
> | total 0
> 
> Empty before threaded NAPI is enabled.
> 
> | root@box:/sys/class/net/eno0# echo 1 > threaded
> | root@box:/sys/class/net/eno0# ls -l napi
> | total 0
> | drwxr-xr-x 2 root root 0 May 24 09:42 eno0-TxRx-0
> | drwxr-xr-x 2 root root 0 May 24 09:42 eno0-TxRx-1
> | drwxr-xr-x 2 root root 0 May 24 09:42 eno0-TxRx-2
> | drwxr-xr-x 2 root root 0 May 24 09:42 eno0-TxRx-3
> | drwxr-xr-x 2 root root 0 May 24 09:42 eno0-TxRx-4
> | drwxr-xr-x 2 root root 0 May 24 09:42 eno0-TxRx-5
> | drwxr-xr-x 2 root root 0 May 24 09:42 eno0-TxRx-6
> | drwxr-xr-x 2 root root 0 May 24 09:42 eno0-TxRx-7
> 
> Deployed using names supplied by the driver which map the names which
> are used for the IRQ.
> 
> | root@box:/sys/class/net/eno0# grep . napi/*/*
> | napi/eno0-TxRx-0/interrupt:37
> | napi/eno0-TxRx-0/name:napi/eno0-8193
> | napi/eno0-TxRx-0/pid:2253
> | napi/eno0-TxRx-1/interrupt:38
> | napi/eno0-TxRx-1/name:napi/eno0-8194
> | napi/eno0-TxRx-1/pid:2252
> | napi/eno0-TxRx-2/interrupt:39
> | napi/eno0-TxRx-2/name:napi/eno0-8195
> | napi/eno0-TxRx-2/pid:2251
> | napi/eno0-TxRx-3/interrupt:40
> | napi/eno0-TxRx-3/name:napi/eno0-8196
> | napi/eno0-TxRx-3/pid:2250
> | napi/eno0-TxRx-4/interrupt:41
> | napi/eno0-TxRx-4/name:napi/eno0-8197
> | napi/eno0-TxRx-4/pid:2249
> | napi/eno0-TxRx-5/interrupt:42
> | napi/eno0-TxRx-5/name:napi/eno0-8198
> | napi/eno0-TxRx-5/pid:2248
> | napi/eno0-TxRx-6/interrupt:43
> | napi/eno0-TxRx-6/name:napi/eno0-8199
> | napi/eno0-TxRx-6/pid:2247
> | napi/eno0-TxRx-7/interrupt:44
> | napi/eno0-TxRx-7/name:napi/eno0-8200
> | napi/eno0-TxRx-7/pid:2246
> 
> Thread name, pid and interrupt number as provided by the driver.
> 
> | root@box:/sys/class/net/eno0# grep eno0-TxRx-7 /proc/interrupts | sed 's@ \+@ @g'
> |  44: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 123 0 0 0 0 0 0 0 0 0 0 0 0 0 IR-PCI-MSIX-0000:07:00.0 8-edge eno0-TxRx-7
> | root@box:/sys/class/net/eno0# cat /proc/irq/44/smp_affinity_list
> | 0-7,16-23
> | root@box:/sys/class/net/eno0# cat /proc/irq/44/effective_affinity_list
> | 18
> 
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Hi Sebastian,

thanks for your patch.
A few nits from my side.

...

> @@ -2411,6 +2414,7 @@ struct net_device {
>  	struct rtnl_hw_stats64	*offload_xstats_l3;
>  
>  	struct devlink_port	*devlink_port;
> +	struct kset		*napi_kset;
>  };

nit: Please add napi_kset to the kdoc for struct net_device
     which is immediately above it.

...

> +int napi_thread_add_kobj(struct napi_struct *n)
> +{
> +	struct kobject *kobj;
> +	char napi_name[32];
> +	const char *name_ptr;

nit: Please use reverse xmas tree order - longest line to shortest - for
     networking code.

> +	int ret;
> +
> +	if (n->napi_name) {
> +		name_ptr = n->napi_name;
> +	} else {
> +		name_ptr = napi_name;
> +		scnprintf(napi_name, sizeof(napi_name), "napi_id-%d", n->napi_id);
> +	}
> +	kobj = &n->kobj;
> +	kobj->kset = n->dev->napi_kset;
> +	ret = kobject_init_and_add(kobj, &napi_ktype, NULL,
> +				   name_ptr);
> +	return ret;
> +}

...
Eric Dumazet May 24, 2023, 1:53 p.m. UTC | #2
On Wed, May 24, 2023 at 1:13 PM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> I've been looking into threaded NAPI. One awkward thing to do is
> to figure out the thread names, pids in order to adjust the thread
> priorities and SMP affinity.
> On PREEMPT_RT the NAPI thread is treated (by the user) the same way as
> the threaded interrupt which means a dedicate CPU affinity for the
> thread and a higher task priority to be favoured over other tasks on the
> CPU. Otherwise the NAPI thread can be preempted by other threads leading
> to delays in packet delivery.
> Having to run ps/ grep is awkward to get the PID right. It is not easy
> to match the interrupt since there is no obvious relation between the
> IRQ and the NAPI thread.
> NAPI threads are enabled often to mitigate the problems caused by a
> "pending" ksoftirqd (which has been mitigated recently by doing softiqrs
> regardless of ksoftirqd status). There is still the part that the NAPI
> thread does not use softnet_data::poll_list.
>

How is interface rename handled ?

root@edumazet1:~# ip link show dev dummy0
4: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode
DEFAULT group default qlen 1000
    link/ether f2:38:20:69:b4:ca brd ff:ff:ff:ff:ff:ff
root@edumazet1:~# ip link set dummy0 name new-name
root@edumazet1:~# ip link show dev dummy0
Device "dummy0" does not exist.
root@edumazet1:~# ip link show dev new-name
4: new-name: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode
DEFAULT group default qlen 1000
    link/ether f2:38:20:69:b4:ca brd ff:ff:ff:ff:ff:ff

Thanks.
Jiri Pirko May 24, 2023, 2:41 p.m. UTC | #3
Wed, May 24, 2023 at 03:53:27PM CEST, edumazet@google.com wrote:
>On Wed, May 24, 2023 at 1:13 PM Sebastian Andrzej Siewior
><bigeasy@linutronix.de> wrote:
>>
>> I've been looking into threaded NAPI. One awkward thing to do is
>> to figure out the thread names, pids in order to adjust the thread
>> priorities and SMP affinity.
>> On PREEMPT_RT the NAPI thread is treated (by the user) the same way as
>> the threaded interrupt which means a dedicate CPU affinity for the
>> thread and a higher task priority to be favoured over other tasks on the
>> CPU. Otherwise the NAPI thread can be preempted by other threads leading
>> to delays in packet delivery.
>> Having to run ps/ grep is awkward to get the PID right. It is not easy
>> to match the interrupt since there is no obvious relation between the
>> IRQ and the NAPI thread.
>> NAPI threads are enabled often to mitigate the problems caused by a
>> "pending" ksoftirqd (which has been mitigated recently by doing softiqrs
>> regardless of ksoftirqd status). There is still the part that the NAPI
>> thread does not use softnet_data::poll_list.
>>
>
>How is interface rename handled ?

Well, having name of interface in sysfs dir/file name is silly, even if
it would be handled correctly. Should not be there.


>
>root@edumazet1:~# ip link show dev dummy0
>4: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode
>DEFAULT group default qlen 1000
>    link/ether f2:38:20:69:b4:ca brd ff:ff:ff:ff:ff:ff
>root@edumazet1:~# ip link set dummy0 name new-name
>root@edumazet1:~# ip link show dev dummy0
>Device "dummy0" does not exist.
>root@edumazet1:~# ip link show dev new-name
>4: new-name: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode
>DEFAULT group default qlen 1000
>    link/ether f2:38:20:69:b4:ca brd ff:ff:ff:ff:ff:ff
>
>Thanks.
>
Sebastian Andrzej Siewior May 24, 2023, 2:43 p.m. UTC | #4
On 2023-05-24 15:53:27 [+0200], Eric Dumazet wrote:
> How is interface rename handled ?
> 
> root@edumazet1:~# ip link show dev dummy0
> 4: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode
> DEFAULT group default qlen 1000
>     link/ether f2:38:20:69:b4:ca brd ff:ff:ff:ff:ff:ff
> root@edumazet1:~# ip link set dummy0 name new-name
> root@edumazet1:~# ip link show dev dummy0
> Device "dummy0" does not exist.
> root@edumazet1:~# ip link show dev new-name
> 4: new-name: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode
> DEFAULT group default qlen 1000
>     link/ether f2:38:20:69:b4:ca brd ff:ff:ff:ff:ff:ff

Not sure I understood the question. But after
	ip link set eno0 name newdev

I have
| root@box:/sys/class/net/newdev# ps uax | grep napi
| root        2246  0.0  0.0      0     0 ?        S    12:04   0:00 [napi/eno0-8200]
| root        2247  0.0  0.0      0     0 ?        S    12:04   0:00 [napi/eno0-8199]
| root        2248  0.0  0.0      0     0 ?        S    12:04   0:00 [napi/eno0-8198]
| root        2249  0.0  0.0      0     0 ?        S    12:04   0:00 [napi/eno0-8197]
| root        2250  0.0  0.0      0     0 ?        S    12:04   0:00 [napi/eno0-8196]
| root        2251  0.0  0.0      0     0 ?        S    12:04   0:00 [napi/eno0-8195]
| root        2252  0.0  0.0      0     0 ?        S    12:04   0:00 [napi/eno0-8194]
| root        2253  0.0  0.0      0     0 ?        S    12:04   0:00 [napi/eno0-8193]
|
| root@box:/sys/class/net/newdev# ls -lh napi
| total 0
| drwxr-xr-x 2 root root 0 May 24 12:05 eno0-TxRx-0
| drwxr-xr-x 2 root root 0 May 24 12:05 eno0-TxRx-1
| drwxr-xr-x 2 root root 0 May 24 12:05 eno0-TxRx-2
| drwxr-xr-x 2 root root 0 May 24 12:05 eno0-TxRx-3
| drwxr-xr-x 2 root root 0 May 24 12:05 eno0-TxRx-4
| drwxr-xr-x 2 root root 0 May 24 12:05 eno0-TxRx-5
| drwxr-xr-x 2 root root 0 May 24 12:05 eno0-TxRx-6
| drwxr-xr-x 2 root root 0 May 24 12:05 eno0-TxRx-7

NAPI was not freed/ allocated again. For some reason the interface went
down during the rename. I suspect the fancy network manager because it
does not know the interface. It did noto happen again after a further
rename of the renamed device and it remained up.

A link up request resulted in requesting the interrupts again and so the
names of the IRQ-threads contain now "newdev" in /proc/interrupts and
the relevant interrupt thread. This isn't the case if I rename it again
since the device isn't going down.

If I reconfigure the queues
	ethtool -L newdev combined 6 other 1

which drops all NAPI devices and asks for new then the NAPI threads are
displayed properly (in ps) but the sysfs entries in the napi folder are
napi_id-$id so I probably missed a spot in the igb driver. And reverting
back to the original 8 channels does not change the fact that I still
see napi_id-$id. So details… But the interrupt number and so on match :)

> Thanks.

Sebastian
Sebastian Andrzej Siewior May 24, 2023, 3:52 p.m. UTC | #5
On 2023-05-24 16:41:19 [+0200], Jiri Pirko wrote:
> >How is interface rename handled ?
> 
> Well, having name of interface in sysfs dir/file name is silly, even if
> it would be handled correctly. Should not be there.

The name is there due to the fact that the driver is using it as the
IRQ-name and that is what I recycled here. So it matches what is seen at
the time in proc/interrupts.
Given that it relies on the IRQ-number we may just avoid using the
interrupt name for the folder and simply stick with the napi_id scheme.
The IRQ number should give information for proper mapping.

By providing an explicit name for the napi instance like TxRx-0 (without
the device name) we would have a stable name for a given configuration.
The napi_id while mostly stable is generated on the fly and depends on
device registration order.

Sebastian
Jakub Kicinski May 24, 2023, 3:55 p.m. UTC | #6
On Wed, 24 May 2023 13:12:58 +0200 Sebastian Andrzej Siewior wrote:
> I've been looking into threaded NAPI. One awkward thing to do is
> to figure out the thread names, pids in order to adjust the thread
> priorities and SMP affinity.
> On PREEMPT_RT the NAPI thread is treated (by the user) the same way as
> the threaded interrupt which means a dedicate CPU affinity for the
> thread and a higher task priority to be favoured over other tasks on the
> CPU. Otherwise the NAPI thread can be preempted by other threads leading
> to delays in packet delivery.
> Having to run ps/ grep is awkward to get the PID right. It is not easy
> to match the interrupt since there is no obvious relation between the
> IRQ and the NAPI thread.
> NAPI threads are enabled often to mitigate the problems caused by a
> "pending" ksoftirqd (which has been mitigated recently by doing softiqrs
> regardless of ksoftirqd status). There is still the part that the NAPI
> thread does not use softnet_data::poll_list.

This needs to go to the netdev netlink family, not sysfs.
There's much more information about NAPI to expose and sysfs will
quickly start showing its limitations.
Nambiar, Amritha May 25, 2023, 12:16 a.m. UTC | #7
> -----Original Message-----
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Wednesday, May 24, 2023 9:25 PM
> To: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Cc: netdev@vger.kernel.org; Eric Dumazet <edumazet@google.com>;
> Kanzenbach, Kurt <kurt.kanzenbach@linutronix.de>; Paolo Abeni
> <pabeni@redhat.com>; Thomas Gleixner <tglx@linutronix.de>
> Subject: Re: [RFC PATCH 1/2] net: Add sysfs files for threaded NAPI.
> 
> On Wed, 24 May 2023 13:12:58 +0200 Sebastian Andrzej Siewior wrote:
> > I've been looking into threaded NAPI. One awkward thing to do is
> > to figure out the thread names, pids in order to adjust the thread
> > priorities and SMP affinity.
> > On PREEMPT_RT the NAPI thread is treated (by the user) the same way as
> > the threaded interrupt which means a dedicate CPU affinity for the
> > thread and a higher task priority to be favoured over other tasks on the
> > CPU. Otherwise the NAPI thread can be preempted by other threads
> leading
> > to delays in packet delivery.
> > Having to run ps/ grep is awkward to get the PID right. It is not easy
> > to match the interrupt since there is no obvious relation between the
> > IRQ and the NAPI thread.
> > NAPI threads are enabled often to mitigate the problems caused by a
> > "pending" ksoftirqd (which has been mitigated recently by doing softiqrs
> > regardless of ksoftirqd status). There is still the part that the NAPI
> > thread does not use softnet_data::poll_list.
> 
> This needs to go to the netdev netlink family, not sysfs.
> There's much more information about NAPI to expose and sysfs will
> quickly start showing its limitations.

Based on the discussion in https://lore.kernel.org/netdev/c8476530638a5f4381d64db0e024ed49c2db3b02.camel@gmail.com/T/#m00999652a8b4731fbdb7bf698d2e3666c65a60e7 ,
I am working on a patch series (will post RFCs next week), to extend the
netdev-genl interface to expose napi_ids and associated queues. The patch
series will enable support for retrieving napi information contained in the
netdev struct, such as napi id, queue/queue-set (both RX and TX) associated
with each napi instance etc. I figure this requirement for exposing napi thread
to PID association can be done as a follow-on/extension to my series.
David Laight May 25, 2023, 8:32 a.m. UTC | #8
From: Sebastian Andrzej Siewior
> Sent: 24 May 2023 12:13
> 
> I've been looking into threaded NAPI. One awkward thing to do is
> to figure out the thread names, pids in order to adjust the thread
> priorities and SMP affinity.
> On PREEMPT_RT the NAPI thread is treated (by the user) the same way as
> the threaded interrupt which means a dedicate CPU affinity for the
> thread and a higher task priority to be favoured over other tasks on the
> CPU. Otherwise the NAPI thread can be preempted by other threads leading
> to delays in packet delivery.

Dropped packets....

> Having to run ps/ grep is awkward to get the PID right. It is not easy
> to match the interrupt since there is no obvious relation between the
> IRQ and the NAPI thread.
> NAPI threads are enabled often to mitigate the problems caused by a
> "pending" ksoftirqd (which has been mitigated recently by doing softiqrs
> regardless of ksoftirqd status). There is still the part that the NAPI
> thread does not use softnet_data::poll_list.

I had to enable both threaded NAPI and RFS (splitting IP processing
to multiple threads) in order to avoid dropping ethernet packets.

To make it all work the NAPI threads processing the receive ring
had to be running under the RT scheduler.
None of the other NAPI threads need to be RT - and, indeed, it
is likely to be detrimental to run them under the RT scheduler.
(You pretty much need to limit the total number of RT threads
to the number of cpu cores so that the scheduler effectively
assigns a cpu to each RT thread.)
Trying to find the correct thread pids was definitely non-trivial.
Especially if you are trying to set it all up for an unknown system.
(As soon as you run ps | grep you are open to arbitrary process names.)

One of the problems with using softints for network receive is
that they suddenly drop from 'higher priority than any RT thread'
to 'a normal priority thread that might have its priority reduced'.
The latter is pretty useless for emptying network receive rings.

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
diff mbox series

Patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index a937b9329af52..34b584b4d5d94 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -376,6 +376,9 @@  struct napi_struct {
 	/* control-path-only fields follow */
 	struct list_head	dev_list;
 	struct hlist_node	napi_hash_node;
+	struct kobject		kobj;
+	const char		*napi_name;
+	int			interrupt_num;
 };
 
 enum {
@@ -2411,6 +2414,7 @@  struct net_device {
 	struct rtnl_hw_stats64	*offload_xstats_l3;
 
 	struct devlink_port	*devlink_port;
+	struct kset		*napi_kset;
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
 
@@ -2688,6 +2692,8 @@  static inline void netif_napi_del(struct napi_struct *napi)
 	synchronize_net();
 }
 
+void netif_napi_add_hints(struct napi_struct *napi, const char *name, unsigned int interrupt);
+
 struct packet_type {
 	__be16			type;	/* This is really htons(ether_type). */
 	bool			ignore_outgoing;
diff --git a/net/core/dev.c b/net/core/dev.c
index 318ae441df1f5..79789dbc1c521 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1375,7 +1375,10 @@  static int napi_kthread_create(struct napi_struct *n)
 		pr_err("kthread_run failed with err %d\n", err);
 		n->thread = NULL;
 	}
+	if (err)
+		return err;
 
+	err = napi_thread_add_kobj(n);
 	return err;
 }
 
@@ -6383,6 +6386,7 @@  void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
 	list_add_rcu(&napi->dev_list, &dev->napi_list);
 	napi_hash_add(napi);
 	napi_get_frags_check(napi);
+	napi->interrupt_num = -1;
 	/* Create kthread for this napi if dev->threaded is set.
 	 * Clear dev->threaded if kthread creation failed so that
 	 * threaded mode will not be enabled in napi_enable().
@@ -6392,6 +6396,14 @@  void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
 }
 EXPORT_SYMBOL(netif_napi_add_weight);
 
+void netif_napi_add_hints(struct napi_struct *napi, const char *name,
+			  unsigned int interrupt)
+{
+	napi->napi_name = name;
+	napi->interrupt_num = interrupt;
+}
+EXPORT_SYMBOL_GPL(netif_napi_add_hints);
+
 void napi_disable(struct napi_struct *n)
 {
 	unsigned long val, new;
@@ -6464,6 +6476,7 @@  void __netif_napi_del(struct napi_struct *napi)
 	napi->gro_bitmask = 0;
 
 	if (napi->thread) {
+		napi_thread_remove_kobj(napi);
 		kthread_stop(napi->thread);
 		napi->thread = NULL;
 	}
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 15e3f4606b5f9..a050f8cd4913c 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1820,6 +1820,136 @@  static int register_queue_kobjects(struct net_device *dev)
 	return error;
 }
 
+#ifdef CONFIG_SYSFS
+
+struct napi_thread_attribute {
+	struct attribute attr;
+	ssize_t (*show)(struct napi_struct *n, char *buf);
+	ssize_t (*store)(struct napi_struct *n, const char *buf, size_t len);
+};
+
+static ssize_t napi_thread_interrupt_show(struct napi_struct *n, char *buf)
+{
+	if (n->interrupt_num < 0)
+		return -EINVAL;
+	return sysfs_emit(buf, "%u\n", n->interrupt_num);
+}
+
+static ssize_t napi_thread_name_show(struct napi_struct *n, char *buf)
+{
+	char comm_buf[TASK_COMM_LEN];
+
+	get_task_comm(comm_buf, n->thread);
+	return sysfs_emit(buf, "%s\n", comm_buf);
+}
+
+static ssize_t napi_thread_pid_show(struct napi_struct *n, char *buf)
+{
+	return sysfs_emit(buf, "%d\n", task_pid_nr(n->thread));
+}
+
+#define NAPI_THREAD_ATTR(__name)	\
+	static struct napi_thread_attribute thread_napi_##__name##_attribute __ro_after_init =	\
+	__ATTR(__name, 0444, napi_thread_##__name##_show, NULL)
+
+NAPI_THREAD_ATTR(interrupt);
+NAPI_THREAD_ATTR(name);
+NAPI_THREAD_ATTR(pid);
+
+static struct attribute *napi_thread_default_attrs[] __ro_after_init = {
+	&thread_napi_interrupt_attribute.attr,
+	&thread_napi_name_attribute.attr,
+	&thread_napi_pid_attribute.attr,
+	NULL,
+};
+ATTRIBUTE_GROUPS(napi_thread_default);
+
+#define to_napi_struct_attr(_attr) \
+	container_of(_attr, struct napi_thread_attribute, attr)
+
+#define to_napi_struct(obj) container_of(obj, struct napi_struct, kobj)
+
+static ssize_t napi_thread_attr_show(struct kobject *kobj, struct attribute *attr,
+				     char *buf)
+{
+	const struct napi_thread_attribute *attribute = to_napi_struct_attr(attr);
+	struct napi_struct *n = to_napi_struct(kobj);
+	ssize_t ret = -EINVAL;
+
+	if (!attribute->show)
+		return -EIO;
+
+	if (!rtnl_trylock())
+		return restart_syscall();
+	if (dev_isalive(n->dev))
+		ret = attribute->show(n, buf);
+
+	rtnl_unlock();
+	return ret;
+}
+
+static const struct sysfs_ops napi_thread_sysfs_ops = {
+	.show = napi_thread_attr_show,
+};
+
+static void napi_thread_release(struct kobject *kobj)
+{
+	memset(kobj, 0, sizeof(*kobj));
+}
+
+static const struct kobj_type napi_ktype = {
+	.sysfs_ops = &napi_thread_sysfs_ops,
+	.release = napi_thread_release,
+	.default_groups = napi_thread_default_groups,
+};
+
+int napi_thread_add_kobj(struct napi_struct *n)
+{
+	struct kobject *kobj;
+	char napi_name[32];
+	const char *name_ptr;
+	int ret;
+
+	if (n->napi_name) {
+		name_ptr = n->napi_name;
+	} else {
+		name_ptr = napi_name;
+		scnprintf(napi_name, sizeof(napi_name), "napi_id-%d", n->napi_id);
+	}
+	kobj = &n->kobj;
+	kobj->kset = n->dev->napi_kset;
+	ret = kobject_init_and_add(kobj, &napi_ktype, NULL,
+				   name_ptr);
+	return ret;
+}
+
+void napi_thread_remove_kobj(struct napi_struct *n)
+{
+	kobject_put(&n->kobj);
+}
+
+static int register_napi_kobjects(struct net_device *dev)
+{
+	dev->napi_kset = kset_create_and_add("napi",
+					     NULL, &dev->dev.kobj);
+	if (!dev->napi_kset)
+		return -ENOMEM;
+	return 0;
+}
+
+static int unregister_napi_kobjects(struct net_device *dev)
+{
+	kset_unregister(dev->napi_kset);
+	return 0;
+}
+
+#else /* !CONFIG_SYSFS */
+
+static int register_napi_kobjects(struct net_device *dev) { return 0; }
+static void unregister_napi_kobjects(struct net_device *dev) { }
+
+#endif
+
 static int queue_change_owner(struct net_device *ndev, kuid_t kuid, kgid_t kgid)
 {
 	int error = 0, real_rx = 0, real_tx = 0;
@@ -2009,6 +2139,7 @@  void netdev_unregister_kobject(struct net_device *ndev)
 	kobject_get(&dev->kobj);
 
 	remove_queue_kobjects(ndev);
+	unregister_napi_kobjects(ndev);
 
 	pm_runtime_set_memalloc_noio(dev, false);
 
@@ -2050,6 +2181,12 @@  int netdev_register_kobject(struct net_device *ndev)
 		return error;
 	}
 
+	error = register_napi_kobjects(ndev);
+	if (error) {
+		remove_queue_kobjects(ndev);
+		device_del(dev);
+		return -ENOMEM;
+	}
 	pm_runtime_set_memalloc_noio(dev, true);
 
 	return error;
diff --git a/net/core/net-sysfs.h b/net/core/net-sysfs.h
index 8a5b04c2699aa..6b185a309290d 100644
--- a/net/core/net-sysfs.h
+++ b/net/core/net-sysfs.h
@@ -11,4 +11,16 @@  int netdev_queue_update_kobjects(struct net_device *net,
 int netdev_change_owner(struct net_device *, const struct net *net_old,
 			const struct net *net_new);
 
+#ifdef CONFIG_SYSFS
+
+int napi_thread_add_kobj(struct napi_struct *n);
+void napi_thread_remove_kobj(struct napi_struct *n);
+
+#else
+
+static inline int napi_thread_add_kobj(struct napi_struct *n) { return 0; }
+static inline void napi_thread_remove_kobj(struct napi_struct *n) { }
+
+#endif
+
 #endif