mbox series

[net-next,0/4] Allow configuration of multipath hash seed

Message ID 20240529111844.13330-1-petrm@nvidia.com (mailing list archive)
Headers show
Series Allow configuration of multipath hash seed | expand

Message

Petr Machata May 29, 2024, 11:18 a.m. UTC
Let me just quote the commit message of patch #2 here to inform the
motivation and some of the implementation:

    When calculating hashes for the purpose of multipath forwarding,
    both IPv4 and IPv6 code currently fall back on
    flow_hash_from_keys(). That uses a randomly-generated seed. That's a
    fine choice by default, but unfortunately some deployments may need
    a tighter control over the seed used.

    In this patchset, make the seed configurable by adding a new sysctl
    key, net.ipv4.fib_multipath_hash_seed to control the seed. This seed
    is used specifically for multipath forwarding and not for the other
    concerns that flow_hash_from_keys() is used for, such as queue
    selection. Expose the knob as sysctl because other such settings,
    such as headers to hash, are also handled that way.

    Despite being placed in the net.ipv4 namespace, the multipath seed
    sysctl is used for both IPv4 and IPv6, similarly to e.g. a number of
    TCP variables. Like those, the multipath hash seed is a per-netns
    variable.

    The new sysctl is added with permissions 0600 so that the hash is
    only readable and writable by root.

    The seed used by flow_hash_from_keys() is a 128-bit quantity.
    However it seems that usually the seed is a much more modest value.
    32 bits seem typical (Cisco, Cumulus), some systems go even lower.
    For that reason, and to decouple the user interface from
    implementation details, go with a 32-bit quantity, which is then
    quadruplicated to form the siphash key.

One example of use of this interface is avoiding hash polarization,
where two ECMP routers, one behind the other, happen to make consistent
hashing decisions, and as a result, part of the ECMP space of the latter
router is never used. Another is a load balancer where several machines
forward traffic to one of a number of leaves, and the forwarding
decisions need to be made consistently. (This is a case of a desired
hash polarization, mentioned e.g. in chapter 6.3 of [0].)

There has already been a proposal to include a hash seed control
interface in the past[1]. This patchset uses broadly the same ideas, but
limits the externally visible seed size to 32 bits.

- Patches #1-#2 contain the substance of the work
- Patch #3 is a mlxsw offload
- Patch #4 is a selftest

[0] https://www.usenix.org/system/files/conference/nsdi18/nsdi18-araujo.pdf
[1] https://lore.kernel.org/netdev/YIlVpYMCn%2F8WfE1P@rnd/

Petr Machata (4):
  net: ipv4,ipv6: Pass multipath hash computation through a helper
  net: ipv4: Add a sysctl to set multipath hash seed
  mlxsw: spectrum_router: Apply user-defined multipath hash seed
  selftests: forwarding: router_mpath_hash: Add a new selftest

 Documentation/networking/ip-sysctl.rst        |  10 +
 .../ethernet/mellanox/mlxsw/spectrum_router.c |  14 +-
 include/net/flow_dissector.h                  |   2 +
 include/net/ip_fib.h                          |  24 ++
 include/net/netns/ipv4.h                      |  10 +
 net/core/flow_dissector.c                     |   7 +
 net/ipv4/route.c                              |  12 +-
 net/ipv4/sysctl_net_ipv4.c                    |  82 +++++
 net/ipv6/route.c                              |  12 +-
 .../testing/selftests/net/forwarding/Makefile |   1 +
 .../net/forwarding/router_mpath_seed.sh       | 322 ++++++++++++++++++
 11 files changed, 482 insertions(+), 14 deletions(-)
 create mode 100755 tools/testing/selftests/net/forwarding/router_mpath_seed.sh

Comments

Nikolay Aleksandrov May 29, 2024, 7:57 p.m. UTC | #1
On 5/29/24 14:18, Petr Machata wrote:
> Let me just quote the commit message of patch #2 here to inform the
> motivation and some of the implementation:
> 
>     When calculating hashes for the purpose of multipath forwarding,
>     both IPv4 and IPv6 code currently fall back on
>     flow_hash_from_keys(). That uses a randomly-generated seed. That's a
>     fine choice by default, but unfortunately some deployments may need
>     a tighter control over the seed used.
> 
>     In this patchset, make the seed configurable by adding a new sysctl
>     key, net.ipv4.fib_multipath_hash_seed to control the seed. This seed
>     is used specifically for multipath forwarding and not for the other
>     concerns that flow_hash_from_keys() is used for, such as queue
>     selection. Expose the knob as sysctl because other such settings,
>     such as headers to hash, are also handled that way.
> 
>     Despite being placed in the net.ipv4 namespace, the multipath seed
>     sysctl is used for both IPv4 and IPv6, similarly to e.g. a number of
>     TCP variables. Like those, the multipath hash seed is a per-netns
>     variable.
> 
>     The new sysctl is added with permissions 0600 so that the hash is
>     only readable and writable by root.
> 
>     The seed used by flow_hash_from_keys() is a 128-bit quantity.
>     However it seems that usually the seed is a much more modest value.
>     32 bits seem typical (Cisco, Cumulus), some systems go even lower.
>     For that reason, and to decouple the user interface from
>     implementation details, go with a 32-bit quantity, which is then
>     quadruplicated to form the siphash key.
> 
> One example of use of this interface is avoiding hash polarization,
> where two ECMP routers, one behind the other, happen to make consistent
> hashing decisions, and as a result, part of the ECMP space of the latter
> router is never used. Another is a load balancer where several machines
> forward traffic to one of a number of leaves, and the forwarding
> decisions need to be made consistently. (This is a case of a desired
> hash polarization, mentioned e.g. in chapter 6.3 of [0].)
> 
> There has already been a proposal to include a hash seed control
> interface in the past[1]. This patchset uses broadly the same ideas, but
> limits the externally visible seed size to 32 bits.
> 
> - Patches #1-#2 contain the substance of the work
> - Patch #3 is a mlxsw offload
> - Patch #4 is a selftest
> 
> [0] https://www.usenix.org/system/files/conference/nsdi18/nsdi18-araujo.pdf
> [1] https://lore.kernel.org/netdev/YIlVpYMCn%2F8WfE1P@rnd/
> 
> Petr Machata (4):
>   net: ipv4,ipv6: Pass multipath hash computation through a helper
>   net: ipv4: Add a sysctl to set multipath hash seed
>   mlxsw: spectrum_router: Apply user-defined multipath hash seed
>   selftests: forwarding: router_mpath_hash: Add a new selftest
> 
>  Documentation/networking/ip-sysctl.rst        |  10 +
>  .../ethernet/mellanox/mlxsw/spectrum_router.c |  14 +-
>  include/net/flow_dissector.h                  |   2 +
>  include/net/ip_fib.h                          |  24 ++
>  include/net/netns/ipv4.h                      |  10 +
>  net/core/flow_dissector.c                     |   7 +
>  net/ipv4/route.c                              |  12 +-
>  net/ipv4/sysctl_net_ipv4.c                    |  82 +++++
>  net/ipv6/route.c                              |  12 +-
>  .../testing/selftests/net/forwarding/Makefile |   1 +
>  .../net/forwarding/router_mpath_seed.sh       | 322 ++++++++++++++++++
>  11 files changed, 482 insertions(+), 14 deletions(-)
>  create mode 100755 tools/testing/selftests/net/forwarding/router_mpath_seed.sh
> 

Hi,
I think that using memory management for such simple task is an
overkill. Would it be simpler to define 2 x 4 byte seed variables
in netns_ipv4 (e.g. user_seed, mp_seed). One is set only by the
user through the sysctl, which would also set mp_seed. Then you
can use mp_seed in the fast-path to construct that siphash key.
If the user_seed is set to 0 then you reset to some static init
hash value that is generated on init_net's creation. The idea
is to avoid leaking that initial seed, to have the same seed
for all netns (known behaviour), be able to recognize when a
seed was set and if the user sets a seed then overwrite it for
that ns, but to be able to reset it as well.
Since 32 bits are enough I don't see why we should be using
the flow hash seed, note that init_net's initialization already
uses get_random_bytes() for hashes. This seems like a simpler
scheme that doesn't require memory management for a 32 bit seed.
Also it has the benefit that it will remove the test when generating
a hash because in the initial/non-user-set case we just have the
initial seed in mp_seed which is used to generate the siphash key,
i.e. we always use that internal seed for the hash, regardless if
it was set by the user or it's the initial seed.

That's just one suggestion, if you decide to use more memory you
can keep the whole key in netns_ipv4 instead, the point is I don't
think we need memory management for this value.

Cheers,
 Nik
Petr Machata May 30, 2024, 3:25 p.m. UTC | #2
Nikolay Aleksandrov <razor@blackwall.org> writes:

> I think that using memory management for such simple task is an
> overkill. Would it be simpler to define 2 x 4 byte seed variables
> in netns_ipv4 (e.g. user_seed, mp_seed). One is set only by the
> user through the sysctl, which would also set mp_seed. Then you
> can use mp_seed in the fast-path to construct that siphash key.
> If the user_seed is set to 0 then you reset to some static init
> hash value that is generated on init_net's creation. The idea

Currently the flow dissector siphash key is initialized lazily so that
the pool of random bytes is full when the initialization is done:

    https://lore.kernel.org/all/20131023180527.GC2403@order.stressinduktion.org
    https://lore.kernel.org/all/20131023111219.GA31531@order.stressinduktion.org

I'm not sure how important that is -- the mailing does not really
discuss much in the way of rationale, and admits it's not critical. But
initializing the seed during net init would undo that. At the same time,
initializing it lazily would be a bit of a mess, as we would have to
possibly retroactively update mp_hash as well, which would be racy vs.
user_hash update unless locked. So dunno.

If you are OK with giving up on the siphash key "quality", I'm fine with
this.

Alternatively I can keep the dispatch in like it currently is. I.e.:

	if (user_seed) {
		sip_hash = construct(user_seed)
		return flow_hash_from_keys_seed(sip_hash)
	} else {
		return flow_hash_from_keys()
	}

I wanted to have the flow dispatcher hash init early as well, as it made
the code branch-free like you note below, but then Ido dug out that
there are $reasons for how it's currently done.

> is to avoid leaking that initial seed, to have the same seed
> for all netns (known behaviour), be able to recognize when a
> seed was set and if the user sets a seed then overwrite it for
> that ns, but to be able to reset it as well.
> Since 32 bits are enough I don't see why we should be using
> the flow hash seed, note that init_net's initialization already

No deep reason in using the dissector hash as far as I'm concerned.
I just didn't want to change things arbitrarily, so kept the current
behavior except where I needed it to change.

> uses get_random_bytes() for hashes. This seems like a simpler
> scheme that doesn't require memory management for a 32 bit seed.
> Also it has the benefit that it will remove the test when generating
> a hash because in the initial/non-user-set case we just have the
> initial seed in mp_seed which is used to generate the siphash key,
> i.e. we always use that internal seed for the hash, regardless if
> it was set by the user or it's the initial seed.
>
> That's just one suggestion, if you decide to use more memory you
> can keep the whole key in netns_ipv4 instead, the point is I don't
> think we need memory management for this value.

I kept the RCU stuff in because it makes it easy to precompute the
siphash key while allowing readers to access it lock-free. I could
inline it and guard with a seqlock instead, but that's a bit messier
code-wise. Or indeed construct in-situ, it's an atomic access plus like
four instructions or something like that.
Nikolay Aleksandrov May 30, 2024, 5:27 p.m. UTC | #3
On 5/30/24 18:25, Petr Machata wrote:
> 
> Nikolay Aleksandrov <razor@blackwall.org> writes:
> 
>> I think that using memory management for such simple task is an
>> overkill. Would it be simpler to define 2 x 4 byte seed variables
>> in netns_ipv4 (e.g. user_seed, mp_seed). One is set only by the
>> user through the sysctl, which would also set mp_seed. Then you
>> can use mp_seed in the fast-path to construct that siphash key.
>> If the user_seed is set to 0 then you reset to some static init
>> hash value that is generated on init_net's creation. The idea
> 
> Currently the flow dissector siphash key is initialized lazily so that
> the pool of random bytes is full when the initialization is done:
> 
>     https://lore.kernel.org/all/20131023180527.GC2403@order.stressinduktion.org
>     https://lore.kernel.org/all/20131023111219.GA31531@order.stressinduktion.org
> 
> I'm not sure how important that is -- the mailing does not really
> discuss much in the way of rationale, and admits it's not critical. But
> initializing the seed during net init would undo that. At the same time,
> initializing it lazily would be a bit of a mess, as we would have to
> possibly retroactively update mp_hash as well, which would be racy vs.
> user_hash update unless locked. So dunno.

If you want to keep the late init, untested:
init_mp_seed_once() -> DO_ONCE(func (net_get_random_once(&init_mp_seed),
memcpy(&init_net.mp_seed, &init_mp_seed))

> 
> If you are OK with giving up on the siphash key "quality", I'm fine with
> this.
> 

IMO that's fine, early init of the seed wouldn't be a problem. net's
hash_mix is already initialized early.

> Alternatively I can keep the dispatch in like it currently is. I.e.:
> 
> 	if (user_seed) {
> 		sip_hash = construct(user_seed)
> 		return flow_hash_from_keys_seed(sip_hash)
> 	} else {
> 		return flow_hash_from_keys()
> 	}
> 
> I wanted to have the flow dispatcher hash init early as well, as it made
> the code branch-free like you note below, but then Ido dug out that

+1

> there are $reasons for how it's currently done.
> >> is to avoid leaking that initial seed, to have the same seed
>> for all netns (known behaviour), be able to recognize when a
>> seed was set and if the user sets a seed then overwrite it for
>> that ns, but to be able to reset it as well.
>> Since 32 bits are enough I don't see why we should be using
>> the flow hash seed, note that init_net's initialization already
> 
> No deep reason in using the dissector hash as far as I'm concerned.
> I just didn't want to change things arbitrarily, so kept the current
> behavior except where I needed it to change.
> 
>> uses get_random_bytes() for hashes. This seems like a simpler
>> scheme that doesn't require memory management for a 32 bit seed.
>> Also it has the benefit that it will remove the test when generating
>> a hash because in the initial/non-user-set case we just have the
>> initial seed in mp_seed which is used to generate the siphash key,
>> i.e. we always use that internal seed for the hash, regardless if
>> it was set by the user or it's the initial seed.
>>
>> That's just one suggestion, if you decide to use more memory you
>> can keep the whole key in netns_ipv4 instead, the point is I don't
>> think we need memory management for this value.
> 
> I kept the RCU stuff in because it makes it easy to precompute the
> siphash key while allowing readers to access it lock-free. I could
> inline it and guard with a seqlock instead, but that's a bit messier
> code-wise. Or indeed construct in-situ, it's an atomic access plus like
> four instructions or something like that.

You can READ/WRITE_ONCE() the full 8 bytes every time so it's lock-free
and consistent view of both values for observers. For fast-path it'll
only be accessing one of the two values, so it's fine either way. You
can use barriers to ensure latest value is seen by interested readers,
but for most eventual consistency would be enough.
Nikolay Aleksandrov May 30, 2024, 6:07 p.m. UTC | #4
On 5/30/24 20:27, Nikolay Aleksandrov wrote:
> On 5/30/24 18:25, Petr Machata wrote:
>>
>> Nikolay Aleksandrov <razor@blackwall.org> writes:
>>
>>> I think that using memory management for such simple task is an
>>> overkill. Would it be simpler to define 2 x 4 byte seed variables
>>> in netns_ipv4 (e.g. user_seed, mp_seed). One is set only by the
>>> user through the sysctl, which would also set mp_seed. Then you
>>> can use mp_seed in the fast-path to construct that siphash key.
>>> If the user_seed is set to 0 then you reset to some static init
>>> hash value that is generated on init_net's creation. The idea
>>
>> Currently the flow dissector siphash key is initialized lazily so that
>> the pool of random bytes is full when the initialization is done:
>>
>>     https://lore.kernel.org/all/20131023180527.GC2403@order.stressinduktion.org
>>     https://lore.kernel.org/all/20131023111219.GA31531@order.stressinduktion.org
>>
>> I'm not sure how important that is -- the mailing does not really
>> discuss much in the way of rationale, and admits it's not critical. But
>> initializing the seed during net init would undo that. At the same time,
>> initializing it lazily would be a bit of a mess, as we would have to
>> possibly retroactively update mp_hash as well, which would be racy vs.
>> user_hash update unless locked. So dunno.
> 
> If you want to keep the late init, untested:
> init_mp_seed_once() -> DO_ONCE(func (net_get_random_once(&init_mp_seed),
> memcpy(&init_net.mp_seed, &init_mp_seed))
> 

Blah, just get_random_bytes() instead of net_get*. :)
It'll be executed once anyway. But again I think we can do without all
of this.

>>
>> If you are OK with giving up on the siphash key "quality", I'm fine with
>> this.
>>
> 
> IMO that's fine, early init of the seed wouldn't be a problem. net's
> hash_mix is already initialized early.
> 
[snip]
>>
>> I kept the RCU stuff in because it makes it easy to precompute the
>> siphash key while allowing readers to access it lock-free. I could
>> inline it and guard with a seqlock instead, but that's a bit messier
>> code-wise. Or indeed construct in-situ, it's an atomic access plus like
>> four instructions or something like that.
> 
> You can READ/WRITE_ONCE() the full 8 bytes every time so it's lock-free
> and consistent view of both values for observers. For fast-path it'll
> only be accessing one of the two values, so it's fine either way. You
> can use barriers to ensure latest value is seen by interested readers,
> but for most eventual consistency would be enough.
> 

Actually aren't we interested only in user_seed in the external reader
case? We don't care what's in mp_seed, so this is much simpler.
Nikolay Aleksandrov May 30, 2024, 9:34 p.m. UTC | #5
On 5/30/24 21:07, Nikolay Aleksandrov wrote:
> On 5/30/24 20:27, Nikolay Aleksandrov wrote:
>> On 5/30/24 18:25, Petr Machata wrote:
>>>
>>> Nikolay Aleksandrov <razor@blackwall.org> writes:
>>>
>>
> [snip]
[snip snip]
>>>
>>> I kept the RCU stuff in because it makes it easy to precompute the
>>> siphash key while allowing readers to access it lock-free. I could
>>> inline it and guard with a seqlock instead, but that's a bit messier
>>> code-wise. Or indeed construct in-situ, it's an atomic access plus like
>>> four instructions or something like that.
>>
>> You can READ/WRITE_ONCE() the full 8 bytes every time so it's lock-free
>> and consistent view of both values for observers. For fast-path it'll
>> only be accessing one of the two values, so it's fine either way. You
>> can use barriers to ensure latest value is seen by interested readers,
>> but for most eventual consistency would be enough.
>>
> 
> Actually aren't we interested only in user_seed in the external reader
> case? We don't care what's in mp_seed, so this is much simpler.
> 

Oh, I misunderstood you, didn't I? :) Were you talking about
constructing the siphash key in the fast-path above? If yes,
then sure it's a few instructions but nothing conditional.
I don't think we need anything atomic in that case.
Petr Machata June 3, 2024, 9:21 a.m. UTC | #6
Nikolay Aleksandrov <razor@blackwall.org> writes:

> On 5/30/24 21:07, Nikolay Aleksandrov wrote:
>> On 5/30/24 20:27, Nikolay Aleksandrov wrote:
>>> On 5/30/24 18:25, Petr Machata wrote:
>>>>
>>>> I kept the RCU stuff in because it makes it easy to precompute the
>>>> siphash key while allowing readers to access it lock-free. I could
>>>> inline it and guard with a seqlock instead, but that's a bit messier
>>>> code-wise. Or indeed construct in-situ, it's an atomic access plus like
>>>> four instructions or something like that.
>>>
>>> You can READ/WRITE_ONCE() the full 8 bytes every time so it's lock-free
>>> and consistent view of both values for observers. For fast-path it'll
>>> only be accessing one of the two values, so it's fine either way. You
>>> can use barriers to ensure latest value is seen by interested readers,
>>> but for most eventual consistency would be enough.
>> 
>> Actually aren't we interested only in user_seed in the external reader
>> case? We don't care what's in mp_seed, so this is much simpler.
>
> Oh, I misunderstood you, didn't I? :) Were you talking about
> constructing the siphash key in the fast-path above? If yes,
> then sure it's a few instructions but nothing conditional.

That's what I meant. I tried to be concise and went overboard.

> I don't think we need anything atomic in that case.

Hmm, right, no competing increments of any sort, so WRITE_ONCE in the
control path and READ_ONCE in fastpath should be enough.

Thanks for the feedback, I'll send v2 this week.