Message ID | 20240529111844.13330-1-petrm@nvidia.com (mailing list archive) |
---|---|
Headers | show |
Series | Allow configuration of multipath hash seed | expand |
On 5/29/24 14:18, Petr Machata wrote: > Let me just quote the commit message of patch #2 here to inform the > motivation and some of the implementation: > > When calculating hashes for the purpose of multipath forwarding, > both IPv4 and IPv6 code currently fall back on > flow_hash_from_keys(). That uses a randomly-generated seed. That's a > fine choice by default, but unfortunately some deployments may need > a tighter control over the seed used. > > In this patchset, make the seed configurable by adding a new sysctl > key, net.ipv4.fib_multipath_hash_seed to control the seed. This seed > is used specifically for multipath forwarding and not for the other > concerns that flow_hash_from_keys() is used for, such as queue > selection. Expose the knob as sysctl because other such settings, > such as headers to hash, are also handled that way. > > Despite being placed in the net.ipv4 namespace, the multipath seed > sysctl is used for both IPv4 and IPv6, similarly to e.g. a number of > TCP variables. Like those, the multipath hash seed is a per-netns > variable. > > The new sysctl is added with permissions 0600 so that the hash is > only readable and writable by root. > > The seed used by flow_hash_from_keys() is a 128-bit quantity. > However it seems that usually the seed is a much more modest value. > 32 bits seem typical (Cisco, Cumulus), some systems go even lower. > For that reason, and to decouple the user interface from > implementation details, go with a 32-bit quantity, which is then > quadruplicated to form the siphash key. > > One example of use of this interface is avoiding hash polarization, > where two ECMP routers, one behind the other, happen to make consistent > hashing decisions, and as a result, part of the ECMP space of the latter > router is never used. Another is a load balancer where several machines > forward traffic to one of a number of leaves, and the forwarding > decisions need to be made consistently. (This is a case of a desired > hash polarization, mentioned e.g. in chapter 6.3 of [0].) > > There has already been a proposal to include a hash seed control > interface in the past[1]. This patchset uses broadly the same ideas, but > limits the externally visible seed size to 32 bits. > > - Patches #1-#2 contain the substance of the work > - Patch #3 is a mlxsw offload > - Patch #4 is a selftest > > [0] https://www.usenix.org/system/files/conference/nsdi18/nsdi18-araujo.pdf > [1] https://lore.kernel.org/netdev/YIlVpYMCn%2F8WfE1P@rnd/ > > Petr Machata (4): > net: ipv4,ipv6: Pass multipath hash computation through a helper > net: ipv4: Add a sysctl to set multipath hash seed > mlxsw: spectrum_router: Apply user-defined multipath hash seed > selftests: forwarding: router_mpath_hash: Add a new selftest > > Documentation/networking/ip-sysctl.rst | 10 + > .../ethernet/mellanox/mlxsw/spectrum_router.c | 14 +- > include/net/flow_dissector.h | 2 + > include/net/ip_fib.h | 24 ++ > include/net/netns/ipv4.h | 10 + > net/core/flow_dissector.c | 7 + > net/ipv4/route.c | 12 +- > net/ipv4/sysctl_net_ipv4.c | 82 +++++ > net/ipv6/route.c | 12 +- > .../testing/selftests/net/forwarding/Makefile | 1 + > .../net/forwarding/router_mpath_seed.sh | 322 ++++++++++++++++++ > 11 files changed, 482 insertions(+), 14 deletions(-) > create mode 100755 tools/testing/selftests/net/forwarding/router_mpath_seed.sh > Hi, I think that using memory management for such simple task is an overkill. Would it be simpler to define 2 x 4 byte seed variables in netns_ipv4 (e.g. user_seed, mp_seed). One is set only by the user through the sysctl, which would also set mp_seed. Then you can use mp_seed in the fast-path to construct that siphash key. If the user_seed is set to 0 then you reset to some static init hash value that is generated on init_net's creation. The idea is to avoid leaking that initial seed, to have the same seed for all netns (known behaviour), be able to recognize when a seed was set and if the user sets a seed then overwrite it for that ns, but to be able to reset it as well. Since 32 bits are enough I don't see why we should be using the flow hash seed, note that init_net's initialization already uses get_random_bytes() for hashes. This seems like a simpler scheme that doesn't require memory management for a 32 bit seed. Also it has the benefit that it will remove the test when generating a hash because in the initial/non-user-set case we just have the initial seed in mp_seed which is used to generate the siphash key, i.e. we always use that internal seed for the hash, regardless if it was set by the user or it's the initial seed. That's just one suggestion, if you decide to use more memory you can keep the whole key in netns_ipv4 instead, the point is I don't think we need memory management for this value. Cheers, Nik
Nikolay Aleksandrov <razor@blackwall.org> writes: > I think that using memory management for such simple task is an > overkill. Would it be simpler to define 2 x 4 byte seed variables > in netns_ipv4 (e.g. user_seed, mp_seed). One is set only by the > user through the sysctl, which would also set mp_seed. Then you > can use mp_seed in the fast-path to construct that siphash key. > If the user_seed is set to 0 then you reset to some static init > hash value that is generated on init_net's creation. The idea Currently the flow dissector siphash key is initialized lazily so that the pool of random bytes is full when the initialization is done: https://lore.kernel.org/all/20131023180527.GC2403@order.stressinduktion.org https://lore.kernel.org/all/20131023111219.GA31531@order.stressinduktion.org I'm not sure how important that is -- the mailing does not really discuss much in the way of rationale, and admits it's not critical. But initializing the seed during net init would undo that. At the same time, initializing it lazily would be a bit of a mess, as we would have to possibly retroactively update mp_hash as well, which would be racy vs. user_hash update unless locked. So dunno. If you are OK with giving up on the siphash key "quality", I'm fine with this. Alternatively I can keep the dispatch in like it currently is. I.e.: if (user_seed) { sip_hash = construct(user_seed) return flow_hash_from_keys_seed(sip_hash) } else { return flow_hash_from_keys() } I wanted to have the flow dispatcher hash init early as well, as it made the code branch-free like you note below, but then Ido dug out that there are $reasons for how it's currently done. > is to avoid leaking that initial seed, to have the same seed > for all netns (known behaviour), be able to recognize when a > seed was set and if the user sets a seed then overwrite it for > that ns, but to be able to reset it as well. > Since 32 bits are enough I don't see why we should be using > the flow hash seed, note that init_net's initialization already No deep reason in using the dissector hash as far as I'm concerned. I just didn't want to change things arbitrarily, so kept the current behavior except where I needed it to change. > uses get_random_bytes() for hashes. This seems like a simpler > scheme that doesn't require memory management for a 32 bit seed. > Also it has the benefit that it will remove the test when generating > a hash because in the initial/non-user-set case we just have the > initial seed in mp_seed which is used to generate the siphash key, > i.e. we always use that internal seed for the hash, regardless if > it was set by the user or it's the initial seed. > > That's just one suggestion, if you decide to use more memory you > can keep the whole key in netns_ipv4 instead, the point is I don't > think we need memory management for this value. I kept the RCU stuff in because it makes it easy to precompute the siphash key while allowing readers to access it lock-free. I could inline it and guard with a seqlock instead, but that's a bit messier code-wise. Or indeed construct in-situ, it's an atomic access plus like four instructions or something like that.
On 5/30/24 18:25, Petr Machata wrote: > > Nikolay Aleksandrov <razor@blackwall.org> writes: > >> I think that using memory management for such simple task is an >> overkill. Would it be simpler to define 2 x 4 byte seed variables >> in netns_ipv4 (e.g. user_seed, mp_seed). One is set only by the >> user through the sysctl, which would also set mp_seed. Then you >> can use mp_seed in the fast-path to construct that siphash key. >> If the user_seed is set to 0 then you reset to some static init >> hash value that is generated on init_net's creation. The idea > > Currently the flow dissector siphash key is initialized lazily so that > the pool of random bytes is full when the initialization is done: > > https://lore.kernel.org/all/20131023180527.GC2403@order.stressinduktion.org > https://lore.kernel.org/all/20131023111219.GA31531@order.stressinduktion.org > > I'm not sure how important that is -- the mailing does not really > discuss much in the way of rationale, and admits it's not critical. But > initializing the seed during net init would undo that. At the same time, > initializing it lazily would be a bit of a mess, as we would have to > possibly retroactively update mp_hash as well, which would be racy vs. > user_hash update unless locked. So dunno. If you want to keep the late init, untested: init_mp_seed_once() -> DO_ONCE(func (net_get_random_once(&init_mp_seed), memcpy(&init_net.mp_seed, &init_mp_seed)) > > If you are OK with giving up on the siphash key "quality", I'm fine with > this. > IMO that's fine, early init of the seed wouldn't be a problem. net's hash_mix is already initialized early. > Alternatively I can keep the dispatch in like it currently is. I.e.: > > if (user_seed) { > sip_hash = construct(user_seed) > return flow_hash_from_keys_seed(sip_hash) > } else { > return flow_hash_from_keys() > } > > I wanted to have the flow dispatcher hash init early as well, as it made > the code branch-free like you note below, but then Ido dug out that +1 > there are $reasons for how it's currently done. > >> is to avoid leaking that initial seed, to have the same seed >> for all netns (known behaviour), be able to recognize when a >> seed was set and if the user sets a seed then overwrite it for >> that ns, but to be able to reset it as well. >> Since 32 bits are enough I don't see why we should be using >> the flow hash seed, note that init_net's initialization already > > No deep reason in using the dissector hash as far as I'm concerned. > I just didn't want to change things arbitrarily, so kept the current > behavior except where I needed it to change. > >> uses get_random_bytes() for hashes. This seems like a simpler >> scheme that doesn't require memory management for a 32 bit seed. >> Also it has the benefit that it will remove the test when generating >> a hash because in the initial/non-user-set case we just have the >> initial seed in mp_seed which is used to generate the siphash key, >> i.e. we always use that internal seed for the hash, regardless if >> it was set by the user or it's the initial seed. >> >> That's just one suggestion, if you decide to use more memory you >> can keep the whole key in netns_ipv4 instead, the point is I don't >> think we need memory management for this value. > > I kept the RCU stuff in because it makes it easy to precompute the > siphash key while allowing readers to access it lock-free. I could > inline it and guard with a seqlock instead, but that's a bit messier > code-wise. Or indeed construct in-situ, it's an atomic access plus like > four instructions or something like that. You can READ/WRITE_ONCE() the full 8 bytes every time so it's lock-free and consistent view of both values for observers. For fast-path it'll only be accessing one of the two values, so it's fine either way. You can use barriers to ensure latest value is seen by interested readers, but for most eventual consistency would be enough.
On 5/30/24 20:27, Nikolay Aleksandrov wrote: > On 5/30/24 18:25, Petr Machata wrote: >> >> Nikolay Aleksandrov <razor@blackwall.org> writes: >> >>> I think that using memory management for such simple task is an >>> overkill. Would it be simpler to define 2 x 4 byte seed variables >>> in netns_ipv4 (e.g. user_seed, mp_seed). One is set only by the >>> user through the sysctl, which would also set mp_seed. Then you >>> can use mp_seed in the fast-path to construct that siphash key. >>> If the user_seed is set to 0 then you reset to some static init >>> hash value that is generated on init_net's creation. The idea >> >> Currently the flow dissector siphash key is initialized lazily so that >> the pool of random bytes is full when the initialization is done: >> >> https://lore.kernel.org/all/20131023180527.GC2403@order.stressinduktion.org >> https://lore.kernel.org/all/20131023111219.GA31531@order.stressinduktion.org >> >> I'm not sure how important that is -- the mailing does not really >> discuss much in the way of rationale, and admits it's not critical. But >> initializing the seed during net init would undo that. At the same time, >> initializing it lazily would be a bit of a mess, as we would have to >> possibly retroactively update mp_hash as well, which would be racy vs. >> user_hash update unless locked. So dunno. > > If you want to keep the late init, untested: > init_mp_seed_once() -> DO_ONCE(func (net_get_random_once(&init_mp_seed), > memcpy(&init_net.mp_seed, &init_mp_seed)) > Blah, just get_random_bytes() instead of net_get*. :) It'll be executed once anyway. But again I think we can do without all of this. >> >> If you are OK with giving up on the siphash key "quality", I'm fine with >> this. >> > > IMO that's fine, early init of the seed wouldn't be a problem. net's > hash_mix is already initialized early. > [snip] >> >> I kept the RCU stuff in because it makes it easy to precompute the >> siphash key while allowing readers to access it lock-free. I could >> inline it and guard with a seqlock instead, but that's a bit messier >> code-wise. Or indeed construct in-situ, it's an atomic access plus like >> four instructions or something like that. > > You can READ/WRITE_ONCE() the full 8 bytes every time so it's lock-free > and consistent view of both values for observers. For fast-path it'll > only be accessing one of the two values, so it's fine either way. You > can use barriers to ensure latest value is seen by interested readers, > but for most eventual consistency would be enough. > Actually aren't we interested only in user_seed in the external reader case? We don't care what's in mp_seed, so this is much simpler.
On 5/30/24 21:07, Nikolay Aleksandrov wrote: > On 5/30/24 20:27, Nikolay Aleksandrov wrote: >> On 5/30/24 18:25, Petr Machata wrote: >>> >>> Nikolay Aleksandrov <razor@blackwall.org> writes: >>> >> > [snip] [snip snip] >>> >>> I kept the RCU stuff in because it makes it easy to precompute the >>> siphash key while allowing readers to access it lock-free. I could >>> inline it and guard with a seqlock instead, but that's a bit messier >>> code-wise. Or indeed construct in-situ, it's an atomic access plus like >>> four instructions or something like that. >> >> You can READ/WRITE_ONCE() the full 8 bytes every time so it's lock-free >> and consistent view of both values for observers. For fast-path it'll >> only be accessing one of the two values, so it's fine either way. You >> can use barriers to ensure latest value is seen by interested readers, >> but for most eventual consistency would be enough. >> > > Actually aren't we interested only in user_seed in the external reader > case? We don't care what's in mp_seed, so this is much simpler. > Oh, I misunderstood you, didn't I? :) Were you talking about constructing the siphash key in the fast-path above? If yes, then sure it's a few instructions but nothing conditional. I don't think we need anything atomic in that case.
Nikolay Aleksandrov <razor@blackwall.org> writes: > On 5/30/24 21:07, Nikolay Aleksandrov wrote: >> On 5/30/24 20:27, Nikolay Aleksandrov wrote: >>> On 5/30/24 18:25, Petr Machata wrote: >>>> >>>> I kept the RCU stuff in because it makes it easy to precompute the >>>> siphash key while allowing readers to access it lock-free. I could >>>> inline it and guard with a seqlock instead, but that's a bit messier >>>> code-wise. Or indeed construct in-situ, it's an atomic access plus like >>>> four instructions or something like that. >>> >>> You can READ/WRITE_ONCE() the full 8 bytes every time so it's lock-free >>> and consistent view of both values for observers. For fast-path it'll >>> only be accessing one of the two values, so it's fine either way. You >>> can use barriers to ensure latest value is seen by interested readers, >>> but for most eventual consistency would be enough. >> >> Actually aren't we interested only in user_seed in the external reader >> case? We don't care what's in mp_seed, so this is much simpler. > > Oh, I misunderstood you, didn't I? :) Were you talking about > constructing the siphash key in the fast-path above? If yes, > then sure it's a few instructions but nothing conditional. That's what I meant. I tried to be concise and went overboard. > I don't think we need anything atomic in that case. Hmm, right, no competing increments of any sort, so WRITE_ONCE in the control path and READ_ONCE in fastpath should be enough. Thanks for the feedback, I'll send v2 this week.