Message ID | 20240812082244.22810-1-e.velu@criteo.com (mailing list archive) |
---|---|
State | Accepted |
Commit | e5efc2311cc437e2b565d164a3de884fa33f13e9 |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | net/mlx5: Use cpumask_local_spread() instead of custom code | expand |
On 12/08/2024 11:22, Erwan Velu wrote: > Commit 2acda57736de ("net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity hints") > removed the usage of cpumask_local_spread(). > > The issue explained in this commit was fixed by > commit 406d394abfcd ("cpumask: improve on cpumask_local_spread() locality"). > > Since this commit, mlx5_cpumask_default_spread() is having the same > behavior as cpumask_local_spread(). > Adding Yuri. One patch led to the other, finally they were all submitted within the same patchset. cpumask_local_spread() indeed improved, and AFAIU is functionally equivalent to existing logic. According to [1] the current code is faster. However, this alone is not a relevant enough argument, as we're talking about slowpath here. Yuri, is that accurate? Is this the only difference? If so, I am fine with this change, preferring simplicity. [1] https://elixir.bootlin.com/linux/v6.11-rc3/source/lib/cpumask.c#L122 > This commit is about : > - removing the specific logic and use cpumask_local_spread() instead > - passing mlx5_core_dev as argument to more flexibility > > mlx5_cpumask_default_spread() is kept as it could be useful for some > future specific quirks. > > Signed-off-by: Erwan Velu <e.velu@criteo.com> > --- > drivers/net/ethernet/mellanox/mlx5/core/eq.c | 27 +++----------------- > 1 file changed, 4 insertions(+), 23 deletions(-) > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c > index cb7e7e4104af..f15ecaef1331 100644 > --- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c > +++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c > @@ -835,28 +835,9 @@ static void comp_irq_release_pci(struct mlx5_core_dev *dev, u16 vecidx) > mlx5_irq_release_vector(irq); > } > > -static int mlx5_cpumask_default_spread(int numa_node, int index) > +static int mlx5_cpumask_default_spread(struct mlx5_core_dev *dev, int index) > { > - const struct cpumask *prev = cpu_none_mask; > - const struct cpumask *mask; > - int found_cpu = 0; > - int i = 0; > - int cpu; > - > - rcu_read_lock(); > - for_each_numa_hop_mask(mask, numa_node) { > - for_each_cpu_andnot(cpu, mask, prev) { > - if (i++ == index) { > - found_cpu = cpu; > - goto spread_done; > - } > - } > - prev = mask; > - } > - > -spread_done: > - rcu_read_unlock(); > - return found_cpu; > + return cpumask_local_spread(index, dev->priv.numa_node); > } > > static struct cpu_rmap *mlx5_eq_table_get_pci_rmap(struct mlx5_core_dev *dev) > @@ -880,7 +861,7 @@ static int comp_irq_request_pci(struct mlx5_core_dev *dev, u16 vecidx) > int cpu; > > rmap = mlx5_eq_table_get_pci_rmap(dev); > - cpu = mlx5_cpumask_default_spread(dev->priv.numa_node, vecidx); > + cpu = mlx5_cpumask_default_spread(dev, vecidx); > irq = mlx5_irq_request_vector(dev, cpu, vecidx, &rmap); > if (IS_ERR(irq)) > return PTR_ERR(irq); > @@ -1145,7 +1126,7 @@ int mlx5_comp_vector_get_cpu(struct mlx5_core_dev *dev, int vector) > if (mask) > cpu = cpumask_first(mask); > else > - cpu = mlx5_cpumask_default_spread(dev->priv.numa_node, vector); > + cpu = mlx5_cpumask_default_spread(dev, vector); > > return cpu; > }
On Wed, Aug 14, 2024 at 10:48:40AM +0300, Tariq Toukan wrote: > > > On 12/08/2024 11:22, Erwan Velu wrote: > > Commit 2acda57736de ("net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity hints") > > removed the usage of cpumask_local_spread(). > > > > The issue explained in this commit was fixed by > > commit 406d394abfcd ("cpumask: improve on cpumask_local_spread() locality"). > > > > Since this commit, mlx5_cpumask_default_spread() is having the same > > behavior as cpumask_local_spread(). > > > > Adding Yuri. > > One patch led to the other, finally they were all submitted within the same > patchset. > > cpumask_local_spread() indeed improved, and AFAIU is functionally equivalent > to existing logic. > According to [1] the current code is faster. > However, this alone is not a relevant enough argument, as we're talking > about slowpath here. > > Yuri, is that accurate? Is this the only difference? > > If so, I am fine with this change, preferring simplicity. > > [1] https://elixir.bootlin.com/linux/v6.11-rc3/source/lib/cpumask.c#L122 If you end up calling mlx5_cpumask_default_spread() for each CPU, it would be O(N^2). If you call cpumask_local_spread() for each CPU, your complexity would be O(N*logN), because under the hood it uses binary search. The comment you've mentioned says that you can traverse your CPUs in O(N) if you can manage to put all the logic inside the for_each_numa_hop_mask() iterator. It doesn't seem to be your case. I agree with you. mlx5_cpumask_default_spread() should be switched to using library code. Acked-by: Yury Norov <yury.norov@gmail.com> You may be interested in siblings-aware CPU distribution I've made for mana ethernet driver in 91bfe210e196. This is also an example where using for_each_numa_hop_mask() over simple cpumask_local_spread() is justified. Thanks, Yury
On 14/08/2024 17:45, Yury Norov wrote: > On Wed, Aug 14, 2024 at 10:48:40AM +0300, Tariq Toukan wrote: >> >> >> On 12/08/2024 11:22, Erwan Velu wrote: >>> Commit 2acda57736de ("net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity hints") >>> removed the usage of cpumask_local_spread(). >>> >>> The issue explained in this commit was fixed by >>> commit 406d394abfcd ("cpumask: improve on cpumask_local_spread() locality"). >>> >>> Since this commit, mlx5_cpumask_default_spread() is having the same >>> behavior as cpumask_local_spread(). >>> >> >> Adding Yuri. >> >> One patch led to the other, finally they were all submitted within the same >> patchset. >> >> cpumask_local_spread() indeed improved, and AFAIU is functionally equivalent >> to existing logic. >> According to [1] the current code is faster. >> However, this alone is not a relevant enough argument, as we're talking >> about slowpath here. >> >> Yuri, is that accurate? Is this the only difference? >> >> If so, I am fine with this change, preferring simplicity. >> >> [1] https://elixir.bootlin.com/linux/v6.11-rc3/source/lib/cpumask.c#L122 > > If you end up calling mlx5_cpumask_default_spread() for each CPU, it > would be O(N^2). If you call cpumask_local_spread() for each CPU, your > complexity would be O(N*logN), because under the hood it uses binary > search. > > The comment you've mentioned says that you can traverse your CPUs in > O(N) if you can manage to put all the logic inside the > for_each_numa_hop_mask() iterator. It doesn't seem to be your case. > > I agree with you. mlx5_cpumask_default_spread() should be switched to > using library code. > > Acked-by: Yury Norov <yury.norov@gmail.com> > > You may be interested in siblings-aware CPU distribution I've made > for mana ethernet driver in 91bfe210e196. This is also an example > where using for_each_numa_hop_mask() over simple cpumask_local_spread() > is justified. > > Thanks, > Yury Thanks Yuri. For the patch: Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Targeting net-next. The patch subject should've t stated this clearly. Jakub, Please note that this patch is also mistakenly marked 'Not Applicable' already... Regards, Tariq
Hello: This patch was applied to netdev/net-next.git (main) by Jakub Kicinski <kuba@kernel.org>: On Mon, 12 Aug 2024 10:22:42 +0200 you wrote: > Commit 2acda57736de ("net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity hints") > removed the usage of cpumask_local_spread(). > > The issue explained in this commit was fixed by > commit 406d394abfcd ("cpumask: improve on cpumask_local_spread() locality"). > > Since this commit, mlx5_cpumask_default_spread() is having the same > behavior as cpumask_local_spread(). > > [...] Here is the summary with links: - net/mlx5: Use cpumask_local_spread() instead of custom code https://git.kernel.org/netdev/net-next/c/e5efc2311cc4 You are awesome, thank you!
[...] > You may be interested in siblings-aware CPU distribution I've made > for mana ethernet driver in 91bfe210e196. This is also an example > where using for_each_numa_hop_mask() over simple cpumask_local_spread() > is justified. That's clearly a topic I'd like to discuss because the allocation strategy may vary depending on the hardware and/or usage. I've been investigating a case where the default mlx5 allocation isn't what I need. 1/ I noticed that using the smp_affinity in an RFS context didn't change the IRQ allocation and I was wondering if that is an expected behavior. This prevents any later tuning that an application could require. It would be super helpful to be able to influence the placement from the host to avoid hardcoded allocators that may not match a particular hardware configuration. 2/ I was also wondering if we shouldn't have a kernel module option to choose the allocation algorithm (I have a POC in that direction). The benefit could be allowing the platform owner to select the allocation algorithm that sys-admin needs. On single-package AMD EPYC servers, the numa topology is pretty handy for mapping the L3 affinity but it doesn't provide any particular hint about the actual "distance" to the network device. You can have up to 12 NUMA nodes on a single package but the actual distance to the nic is almost identical as each core needs to use the IOdie to reach the PCI devices. We can see in the NUMA allocation logic assumptions like "1 NUMA per package" logic that the actual distance between nodes should be considered in the allocation logic. In my case, the NIC is reported to Numa node 6 (of 8) (inherited from the PXM configuration). With the current "proximity" logic all cores are consumed within this numa domain before reaching the next ones and so on. This leads to a very unbalanced configuration where a few numa domains are fully allocated when others are free. When SMT is enabled, consuming all cores from a NUMA domain also means using hyperthreads which could be less optimal than using real cores from adjacent nodes. In a hypervisor-like use case, when multiple containers from various users run on the same system, having RFS enabled helps to have each user have its own toil of generating traffic. In such a configuration, it'd be better to let the allocator consume cores from each numa node of the same package one by one to get a balanced configuration, that would also have the advantage of avoiding consuming hyperthreads until at least 1 IRQ per physical core is reached. That allocation logic could be interesting to be shared between various drivers to allow sys admins to get a balanced IRQ mapping on modern, multi-nodes per socket, architecture. WDYT of having selectable logic and add this type of "package-balanced" allocator? Erwan,
On Mon, 19 Aug 2024 12:15:10 +0200 Erwan Velu wrote: > 2/ I was also wondering if we shouldn't have a kernel module option to > choose the allocation algorithm (I have a POC in that direction). > The benefit could be allowing the platform owner to select the > allocation algorithm that sys-admin needs. > On single-package AMD EPYC servers, the numa topology is pretty handy > for mapping the L3 affinity but it doesn't provide any particular hint > about the actual "distance" to the network device. > You can have up to 12 NUMA nodes on a single package but the actual > distance to the nic is almost identical as each core needs to use the > IOdie to reach the PCI devices. > We can see in the NUMA allocation logic assumptions like "1 NUMA per > package" logic that the actual distance between nodes should be > considered in the allocation logic. I think user space has more information on what the appropriate placement is than the kernel. We can have a reasonable default, and maybe try not to stupidly reset the settings when config changes (I don't think mlx5 does that but other drivers do); but having a way to select algorithm would only work if there was a well understood and finite set of algorithms. IMHO we should try to sell this task to systemd-networkd or some other user space daemon. We now have netlink access to NAPI information, including IRQ<>NAPI<>queue mapping. It's possible to implement a completely driver-agnostic IRQ mapping support from user space (without the need to grep irq names like we used to)
Le lun. 19 août 2024 à 17:34, Jakub Kicinski <kuba@kernel.org> a écrit : > > On Mon, 19 Aug 2024 12:15:10 +0200 Erwan Velu wrote: > > 2/ I was also wondering if we shouldn't have a kernel module option to > > choose the allocation algorithm (I have a POC in that direction). > > The benefit could be allowing the platform owner to select the > > allocation algorithm that sys-admin needs. > > On single-package AMD EPYC servers, the numa topology is pretty handy > > for mapping the L3 affinity but it doesn't provide any particular hint > > about the actual "distance" to the network device. > > You can have up to 12 NUMA nodes on a single package but the actual > > distance to the nic is almost identical as each core needs to use the > > IOdie to reach the PCI devices. > > We can see in the NUMA allocation logic assumptions like "1 NUMA per > > package" logic that the actual distance between nodes should be > > considered in the allocation logic. > > I think user space has more information on what the appropriate > placement is than the kernel. We can have a reasonable default, > and maybe try not to stupidly reset the settings when config > changes (I don't think mlx5 does that but other drivers do); > but having a way to select algorithm would only work if there > was a well understood and finite set of algorithms. I totally agree with this view, I'm wondering if people who used to work on the mlx driver can provide hints about this task. I have no idea if that requires any particular task at the fw level. Is this a complex task to perform? That feature would be super helpful to get precise tuning. > IMHO we should try to sell this task to systemd-networkd or some other > user space daemon. We now have netlink access to NAPI information, > including IRQ<>NAPI<>queue mapping. It's possible to implement a > completely driver-agnostic IRQ mapping support from user space > (without the need to grep irq names like we used to) Clearly that would be a nice path to achieve this feature.
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c index cb7e7e4104af..f15ecaef1331 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c @@ -835,28 +835,9 @@ static void comp_irq_release_pci(struct mlx5_core_dev *dev, u16 vecidx) mlx5_irq_release_vector(irq); } -static int mlx5_cpumask_default_spread(int numa_node, int index) +static int mlx5_cpumask_default_spread(struct mlx5_core_dev *dev, int index) { - const struct cpumask *prev = cpu_none_mask; - const struct cpumask *mask; - int found_cpu = 0; - int i = 0; - int cpu; - - rcu_read_lock(); - for_each_numa_hop_mask(mask, numa_node) { - for_each_cpu_andnot(cpu, mask, prev) { - if (i++ == index) { - found_cpu = cpu; - goto spread_done; - } - } - prev = mask; - } - -spread_done: - rcu_read_unlock(); - return found_cpu; + return cpumask_local_spread(index, dev->priv.numa_node); } static struct cpu_rmap *mlx5_eq_table_get_pci_rmap(struct mlx5_core_dev *dev) @@ -880,7 +861,7 @@ static int comp_irq_request_pci(struct mlx5_core_dev *dev, u16 vecidx) int cpu; rmap = mlx5_eq_table_get_pci_rmap(dev); - cpu = mlx5_cpumask_default_spread(dev->priv.numa_node, vecidx); + cpu = mlx5_cpumask_default_spread(dev, vecidx); irq = mlx5_irq_request_vector(dev, cpu, vecidx, &rmap); if (IS_ERR(irq)) return PTR_ERR(irq); @@ -1145,7 +1126,7 @@ int mlx5_comp_vector_get_cpu(struct mlx5_core_dev *dev, int vector) if (mask) cpu = cpumask_first(mask); else - cpu = mlx5_cpumask_default_spread(dev->priv.numa_node, vector); + cpu = mlx5_cpumask_default_spread(dev, vector); return cpu; }
Commit 2acda57736de ("net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity hints") removed the usage of cpumask_local_spread(). The issue explained in this commit was fixed by commit 406d394abfcd ("cpumask: improve on cpumask_local_spread() locality"). Since this commit, mlx5_cpumask_default_spread() is having the same behavior as cpumask_local_spread(). This commit is about : - removing the specific logic and use cpumask_local_spread() instead - passing mlx5_core_dev as argument to more flexibility mlx5_cpumask_default_spread() is kept as it could be useful for some future specific quirks. Signed-off-by: Erwan Velu <e.velu@criteo.com> --- drivers/net/ethernet/mellanox/mlx5/core/eq.c | 27 +++----------------- 1 file changed, 4 insertions(+), 23 deletions(-)